[jira] [Created] (SPARK-28332) SQLMetric wrong initValue
Song Jun created SPARK-28332: Summary: SQLMetric wrong initValue Key: SPARK-28332 URL: https://issues.apache.org/jira/browse/SPARK-28332 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Song Jun Currently SQLMetrics.createSizeMetric create a SQLMetric with initValue set to -1. If there is a ShuffleMapStage with lots of Tasks which read 0 bytes data, these tasks will send the metric(the metric value still be the initValue with -1) to Driver, then Driver do metric merge for this Stage in DAGScheduler.updateAccumulators, this will cause the merged metric value of this Stage set to be a negative value. This is incorrect, we should set the initValue to 0 . Another same case in SQLMetrics.createTimingMetric. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27227) Spark Runtime Filter
[ https://issues.apache.org/jira/browse/SPARK-27227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16833558#comment-16833558 ] Song Jun edited comment on SPARK-27227 at 5/6/19 7:32 AM: -- [~cloud_fan] [~smilegator] could you please help to review this SPIP? thanks very much! was (Author: windpiger): [~cloud_fan] [~LI,Xiao] could you please help to review this SPIP? thanks very much! > Spark Runtime Filter > > > Key: SPARK-27227 > URL: https://issues.apache.org/jira/browse/SPARK-27227 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Song Jun >Priority: Major > > When we equi-join one big table with a smaller table, we can collect some > statistics from the smaller table side, and use it to the scan of big table > to do partition prune or data filter before execute the join. > This can significantly improve SQL perfermance. > For a simple example: > select * from A, B where A.a = B.b > A is big table ,B is small table. > There are two scenarios: > 1. A.a is a partition column of table A >we can collect all the values of B.b, and send it to table A to do >partition prune on A.a. > 2. A.a is not a partition column of table A > we can collect real-time some statistics(such as min/max/bloomfilter) of > B.b by execute extra sql(select max(b),min(b),bbf(b) from B), and send it to > table A to do filter on A.a. > Addititionaly, if a more complex query select * from A join (select * from > B where B.c = 1) X on A.a = B.b, then we collect real-time statistics(such as > min/max/bloomfilter) of X by execute extra sql(select max(b),min(b),bbf(b) > from X) > Above two scenarios, we can filter out lots of data by partition prune or > data filter, thus we can imporve perfermance. > 10TB TPC-DS gain about 35% improvement in our test. > I will submit a SPIP later. > SPIP: > https://docs.google.com/document/d/1hTXxsG_qLu5W_VrVvPx2gumXEFrnVPhQSRMjZ6WkfiY/edit#heading=h.7vhjx9226jbt -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27227) Spark Runtime Filter
[ https://issues.apache.org/jira/browse/SPARK-27227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16833558#comment-16833558 ] Song Jun commented on SPARK-27227: -- [~cloud_fan] [~LI,Xiao] could you please help to review this SPIP? thanks very much! > Spark Runtime Filter > > > Key: SPARK-27227 > URL: https://issues.apache.org/jira/browse/SPARK-27227 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Song Jun >Priority: Major > > When we equi-join one big table with a smaller table, we can collect some > statistics from the smaller table side, and use it to the scan of big table > to do partition prune or data filter before execute the join. > This can significantly improve SQL perfermance. > For a simple example: > select * from A, B where A.a = B.b > A is big table ,B is small table. > There are two scenarios: > 1. A.a is a partition column of table A >we can collect all the values of B.b, and send it to table A to do >partition prune on A.a. > 2. A.a is not a partition column of table A > we can collect real-time some statistics(such as min/max/bloomfilter) of > B.b by execute extra sql(select max(b),min(b),bbf(b) from B), and send it to > table A to do filter on A.a. > Addititionaly, if a more complex query select * from A join (select * from > B where B.c = 1) X on A.a = B.b, then we collect real-time statistics(such as > min/max/bloomfilter) of X by execute extra sql(select max(b),min(b),bbf(b) > from X) > Above two scenarios, we can filter out lots of data by partition prune or > data filter, thus we can imporve perfermance. > 10TB TPC-DS gain about 35% improvement in our test. > I will submit a SPIP later. > SPIP: > https://docs.google.com/document/d/1hTXxsG_qLu5W_VrVvPx2gumXEFrnVPhQSRMjZ6WkfiY/edit#heading=h.7vhjx9226jbt -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27227) Spark Runtime Filter
[ https://issues.apache.org/jira/browse/SPARK-27227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-27227: - Description: When we equi-join one big table with a smaller table, we can collect some statistics from the smaller table side, and use it to the scan of big table to do partition prune or data filter before execute the join. This can significantly improve SQL perfermance. For a simple example: select * from A, B where A.a = B.b A is big table ,B is small table. There are two scenarios: 1. A.a is a partition column of table A we can collect all the values of B.b, and send it to table A to do partition prune on A.a. 2. A.a is not a partition column of table A we can collect real-time some statistics(such as min/max/bloomfilter) of B.b by execute extra sql(select max(b),min(b),bbf(b) from B), and send it to table A to do filter on A.a. Addititionaly, if a more complex query select * from A join (select * from B where B.c = 1) X on A.a = B.b, then we collect real-time statistics(such as min/max/bloomfilter) of X by execute extra sql(select max(b),min(b),bbf(b) from X) Above two scenarios, we can filter out lots of data by partition prune or data filter, thus we can imporve perfermance. 10TB TPC-DS gain about 35% improvement in our test. I will submit a SPIP later. SPIP: https://docs.google.com/document/d/1hTXxsG_qLu5W_VrVvPx2gumXEFrnVPhQSRMjZ6WkfiY/edit#heading=h.7vhjx9226jbt was: When we equi-join one big table with a smaller table, we can collect some statistics from the smaller table side, and use it to the scan of big table to do partition prune or data filter before execute the join. This can significantly improve SQL perfermance. For a simple example: select * from A, B where A.a = B.b A is big table ,B is small table. There are two scenarios: 1. A.a is a partition column of table A we can collect all the values of B.b, and send it to table A to do partition prune on A.a. 2. A.a is not a partition column of table A we can collect real-time some statistics(such as min/max/bloomfilter) of B.b by execute extra sql(select max(b),min(b),bbf(b) from B), and send it to table A to do filter on A.a. Addititionaly, if a more complex query select * from A join (select * from B where B.c = 1) X on A.a = B.b, then we collect real-time statistics(such as min/max/bloomfilter) of X by execute extra sql(select max(b),min(b),bbf(b) from X) Above two scenarios, we can filter out lots of data by partition prune or data filter, thus we can imporve perfermance. 10TB TPC-DS gain about 35% improvement in our test. I will submit a SPIP later. > Spark Runtime Filter > > > Key: SPARK-27227 > URL: https://issues.apache.org/jira/browse/SPARK-27227 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Song Jun >Priority: Major > > When we equi-join one big table with a smaller table, we can collect some > statistics from the smaller table side, and use it to the scan of big table > to do partition prune or data filter before execute the join. > This can significantly improve SQL perfermance. > For a simple example: > select * from A, B where A.a = B.b > A is big table ,B is small table. > There are two scenarios: > 1. A.a is a partition column of table A >we can collect all the values of B.b, and send it to table A to do >partition prune on A.a. > 2. A.a is not a partition column of table A > we can collect real-time some statistics(such as min/max/bloomfilter) of > B.b by execute extra sql(select max(b),min(b),bbf(b) from B), and send it to > table A to do filter on A.a. > Addititionaly, if a more complex query select * from A join (select * from > B where B.c = 1) X on A.a = B.b, then we collect real-time statistics(such as > min/max/bloomfilter) of X by execute extra sql(select max(b),min(b),bbf(b) > from X) > Above two scenarios, we can filter out lots of data by partition prune or > data filter, thus we can imporve perfermance. > 10TB TPC-DS gain about 35% improvement in our test. > I will submit a SPIP later. > SPIP: > https://docs.google.com/document/d/1hTXxsG_qLu5W_VrVvPx2gumXEFrnVPhQSRMjZ6WkfiY/edit#heading=h.7vhjx9226jbt -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27227) Spark Runtime Filter
[ https://issues.apache.org/jira/browse/SPARK-27227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-27227: - Summary: Spark Runtime Filter (was: Dynamic Partition Prune in Spark) > Spark Runtime Filter > > > Key: SPARK-27227 > URL: https://issues.apache.org/jira/browse/SPARK-27227 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Song Jun >Priority: Major > > When we equi-join one big table with a smaller table, we can collect some > statistics from the smaller table side, and use it to the scan of big table > to do partition prune or data filter before execute the join. > This can significantly improve SQL perfermance. > For a simple example: > select * from A, B where A.a = B.b > A is big table ,B is small table. > There are two scenarios: > 1. A.a is a partition column of table A >we can collect all the values of B.b, and send it to table A to do >partition prune on A.a. > 2. A.a is not a partition column of table A > we can collect real-time some statistics(such as min/max/bloomfilter) of > B.b by execute extra sql(select max(b),min(b),bbf(b) from B), and send it to > table A to do filter on A.a. > Addititionaly, if a more complex query select * from A join (select * from > B where B.c = 1) X on A.a = B.b, then we collect real-time statistics(such as > min/max/bloomfilter) of X by execute extra sql(select max(b),min(b),bbf(b) > from X) > Above two scenarios, we can filter out lots of data by partition prune or > data filter, thus we can imporve perfermance. > 10TB TPC-DS gain about 35% improvement in our test. > I will submit a SPIP later. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19842) Informational Referential Integrity Constraints Support in Spark
[ https://issues.apache.org/jira/browse/SPARK-19842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16819844#comment-16819844 ] Song Jun commented on SPARK-19842: -- I think Constraint should be designed with DataSource v2 and can do more than this jira. Constraint can be used to: 1. data integrity(not include in this jira) 2. optimizer can use it to rewrite query to gain perfermance(not just PK/FK, unique/not null is also useful) For data integrity, we have two scenarios: 1.1 DataSource native support data integrity, such as mysql/oracle and so on Spark should only call read/write API of this DataSource, and do nothing about data integrity. 1.2 DataSource do not support data integrity, such as csv/json/parquet and so on Spark can provide data integrity for this DataSource like Hive does(maybe a switch can be used to turn it off), and we can discuss to support which kind of Constraint. For example, Hive support PK/FK/UNIQUE(DISABLE RELY)/NOT NUL/DEFAULT, NOT NULL ENFORCE check is implement by add an extra UDF GenericUDFEnforceNotNullConstraint to the Plan(https://issues.apache.org/jira/browse/HIVE-16605). For Optimizer rewrite query: 2.1 We can add Constraint Information into CatalogTable which is returned by catalog.getTable API. Then Optimizer can use it to do query rewrite. 2.2 if we can not get Constraint information, we can use hint to the SQL Above all, we can bring Constraint feature to DataSource v2 design: a) to support 2.1 feature, we can add constraint information to createTable/alterTable/getTable API in this SPIP(https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#) b) to support data integrity, we can add ConstaintSupport mix-in for DataSource v2: if one DataSource support Constraint, then Spark do nothing when insert data; if one DataSource do not support Constraint but still want to do constraint check, then Spark should do the constraint check like Hive(such as not null in Hive add a extra udf GenericUDFEnforceNotNullConstraint to the Plan). if one DataSource do not support Constraint and do not want to do constraint check, then Spark do nothing. Hive catalog support constraint, we can implement this logic in createTable/alterTable API . Then we can use SparkSQL DDL to create Table with constraint which stored to HiveMetaStore by Hive catalog API. for example:CREATE TABLE t(a STRING, b STRING NOT NULL DISABLE, CONSTRAINT pk1 PRIMARY KEY (a) DISABLE) USING parquet; As for how to store constraint, because Hive 2.1 has provide constraint API in Hive.java, we can call it directly in createTable/alterTable API of Hive catalog. There is no need to use table properties to store these constraint information by Spark. There are some concern for using Hive 2.1 catalog API directly in the docs(https://docs.google.com/document/d/17r-cOqbKF7Px0xb9L7krKg2-RQB_gD2pxOmklm-ehsw/edit#heading=h.lnxbz9), such as Spark built-in Hive is 1.2.1, but upgrade Hive to 2.3.4 is inprogress(https://issues.apache.org/jira/browse/SPARK-23710). [~cloud_fan] [~ioana-delaney] If this proposal is reasonable, please give me some feedback. Thanks! > Informational Referential Integrity Constraints Support in Spark > > > Key: SPARK-19842 > URL: https://issues.apache.org/jira/browse/SPARK-19842 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Ioana Delaney >Priority: Major > Attachments: InformationalRIConstraints.doc > > > *Informational Referential Integrity Constraints Support in Spark* > This work proposes support for _informational primary key_ and _foreign key > (referential integrity) constraints_ in Spark. The main purpose is to open up > an area of query optimization techniques that rely on referential integrity > constraints semantics. > An _informational_ or _statistical constraint_ is a constraint such as a > _unique_, _primary key_, _foreign key_, or _check constraint_, that can be > used by Spark to improve query performance. Informational constraints are not > enforced by the Spark SQL engine; rather, they are used by Catalyst to > optimize the query processing. They provide semantics information that allows > Catalyst to rewrite queries to eliminate joins, push down aggregates, remove > unnecessary Distinct operations, and perform a number of other optimizations. > Informational constraints are primarily targeted to applications that load > and analyze data that originated from a data warehouse. For such > applications, the conditions for a given constraint are known to be true, so > the constraint does not need to be enforced during data load operations. > The attached document covers constraint definition, metastore storage, >
[jira] [Created] (SPARK-27280) infer filters from Join's OR condition
Song Jun created SPARK-27280: Summary: infer filters from Join's OR condition Key: SPARK-27280 URL: https://issues.apache.org/jira/browse/SPARK-27280 Project: Spark Issue Type: Improvement Components: Optimizer, SQL Affects Versions: 3.0.0 Reporter: Song Jun In some case, We can infer filters from Join condition with OR expressions. for example, tpc-ds query 48: {code:java} select sum (ss_quantity) from store_sales, store, customer_demographics, customer_address, date_dim where s_store_sk = ss_store_sk and ss_sold_date_sk = d_date_sk and d_year = 2000 and ( ( cd_demo_sk = ss_cdemo_sk and cd_marital_status = 'S' and cd_education_status = 'Secondary' and ss_sales_price between 100.00 and 150.00 ) or ( cd_demo_sk = ss_cdemo_sk and cd_marital_status = 'M' and cd_education_status = 'College' and ss_sales_price between 50.00 and 100.00 ) or ( cd_demo_sk = ss_cdemo_sk and cd_marital_status = 'U' and cd_education_status = '2 yr Degree' and ss_sales_price between 150.00 and 200.00 ) ) and ( ( ss_addr_sk = ca_address_sk and ca_country = 'United States' and ca_state in ('AL', 'OH', 'MD') and ss_net_profit between 0 and 2000 ) or (ss_addr_sk = ca_address_sk and ca_country = 'United States' and ca_state in ('VA', 'TX', 'IA') and ss_net_profit between 150 and 3000 ) or (ss_addr_sk = ca_address_sk and ca_country = 'United States' and ca_state in ('RI', 'WI', 'KY') and ss_net_profit between 50 and 25000 ) ) ; {code} we can infer two filters from the join or condidtion: {code:java} for customer_demographics: cd_marital_status in(‘D',‘U',‘M') and cd_education_status in('4 yr Degree’,’Secondary’,’Primary') for store_sales: (ss_sales_price between 100.00 and 150.00 or ss_sales_price between 50.00 and 100.00 or ss_sales_price between 150.00 and 200.00) {code} then then we can push down the above two filters to filter customer_demographics/store_sales. A pr will be submit soon. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27229) GroupBy Placement in Intersect Distinct
[ https://issues.apache.org/jira/browse/SPARK-27229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-27229: - Priority: Major (was: Minor) > GroupBy Placement in Intersect Distinct > --- > > Key: SPARK-27229 > URL: https://issues.apache.org/jira/browse/SPARK-27229 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Song Jun >Priority: Major > > Intersect operator will be replace by Left Semi Join in Optimizer. > for example: > SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2 > ==> SELECT DISTINCT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND > a2<=>b2 > if Tabe1 and Tab2 are too large, the join will be very slow, we can reduce > the table data before > Join by place groupby operator under join, that is > ==> > SELECT a1, a2 FROM >(SELECT a1,a2 FROM Tab1 GROUP BY a1,a2) X >LEFT SEMI JOIN >(SELECT b1,b2 FROM Tab2 GROUP BY b1,b2) Y > ON X.a1<=>Y.b1 AND X.a2<=>Y.b2 > then we can have smaller table data when execute join, because group by has > cut lots of > data. > > A pr will be submit soon -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27229) GroupBy Placement in Intersect Distinct
[ https://issues.apache.org/jira/browse/SPARK-27229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-27229: - Priority: Minor (was: Major) > GroupBy Placement in Intersect Distinct > --- > > Key: SPARK-27229 > URL: https://issues.apache.org/jira/browse/SPARK-27229 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Song Jun >Priority: Minor > > Intersect operator will be replace by Left Semi Join in Optimizer. > for example: > SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2 > ==> SELECT DISTINCT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND > a2<=>b2 > if Tabe1 and Tab2 are too large, the join will be very slow, we can reduce > the table data before > Join by place groupby operator under join, that is > ==> > SELECT a1, a2 FROM >(SELECT a1,a2 FROM Tab1 GROUP BY a1,a2) X >LEFT SEMI JOIN >(SELECT b1,b2 FROM Tab2 GROUP BY b1,b2) Y > ON X.a1<=>Y.b1 AND X.a2<=>Y.b2 > then we can have smaller table data when execute join, because group by has > cut lots of > data. > > A pr will be submit soon -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27229) GroupBy Placement in Intersect Distinct
[ https://issues.apache.org/jira/browse/SPARK-27229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16798748#comment-16798748 ] Song Jun commented on SPARK-27229: -- Thanks > GroupBy Placement in Intersect Distinct > --- > > Key: SPARK-27229 > URL: https://issues.apache.org/jira/browse/SPARK-27229 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Song Jun >Priority: Major > > Intersect operator will be replace by Left Semi Join in Optimizer. > for example: > SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2 > ==> SELECT DISTINCT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND > a2<=>b2 > if Tabe1 and Tab2 are too large, the join will be very slow, we can reduce > the table data before > Join by place groupby operator under join, that is > ==> > SELECT a1, a2 FROM >(SELECT a1,a2 FROM Tab1 GROUP BY a1,a2) X >LEFT SEMI JOIN >(SELECT b1,b2 FROM Tab2 GROUP BY b1,b2) Y > ON X.a1<=>Y.b1 AND X.a2<=>Y.b2 > then we can have smaller table data when execute join, because group by has > cut lots of > data. > > A pr will be submit soon -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27229) GroupBy Placement in Intersect Distinct
[ https://issues.apache.org/jira/browse/SPARK-27229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-27229: - Description: Intersect operator will be replace by Left Semi Join in Optimizer. for example: SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2 ==> SELECT DISTINCT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND a2<=>b2 if Tabe1 and Tab2 are too large, the join will be very slow, we can reduce the table data before Join by place groupby operator under join, that is ==> SELECT a1, a2 FROM (SELECT a1,a2 FROM Tab1 GROUP BY a1,a2) X LEFT SEMI JOIN (SELECT b1,b2 FROM Tab2 GROUP BY b1,b2) Y ON X.a1<=>Y.b1 AND X.a2<=>Y.b2 then we can have smaller table data when execute join, because group by has cut lots of data. A pr will be submit soon was: Intersect operator will be replace by Left Semi Join in Optimizer. for example: SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2 ==> SELECT DISTINCT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND a2<=>b2 if Tabe1 and Tab2 are too large, the join will be very slow, we can reduce the table data before Join by place groupby operator under join, that is ==> SELECT a1, a2 FROM (SELECT a1,a2 FROM Tab1 GROUP BY a1,a2) X LEFT SEMI JOIN (SELECT b1,b2 FROM Tab2 GROUP BY b1,b2) Y ON X.a1<=>Y.b1 AND X.a2<=>Y.b2 then we can have smaller table data when execute join, because group by has cut lots of data > GroupBy Placement in Intersect Distinct > --- > > Key: SPARK-27229 > URL: https://issues.apache.org/jira/browse/SPARK-27229 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Song Jun >Priority: Critical > > Intersect operator will be replace by Left Semi Join in Optimizer. > for example: > SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2 > ==> SELECT DISTINCT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND > a2<=>b2 > if Tabe1 and Tab2 are too large, the join will be very slow, we can reduce > the table data before > Join by place groupby operator under join, that is > ==> > SELECT a1, a2 FROM >(SELECT a1,a2 FROM Tab1 GROUP BY a1,a2) X >LEFT SEMI JOIN >(SELECT b1,b2 FROM Tab2 GROUP BY b1,b2) Y > ON X.a1<=>Y.b1 AND X.a2<=>Y.b2 > then we can have smaller table data when execute join, because group by has > cut lots of > data. > > A pr will be submit soon -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27229) GroupBy Placement in Intersect Distinct
Song Jun created SPARK-27229: Summary: GroupBy Placement in Intersect Distinct Key: SPARK-27229 URL: https://issues.apache.org/jira/browse/SPARK-27229 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Song Jun Intersect operator will be replace by Left Semi Join in Optimizer. for example: SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2 ==> SELECT DISTINCT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND a2<=>b2 if Tabe1 and Tab2 are too large, the join will be very slow, we can reduce the table data before Join by place groupby operator under join, that is ==> SELECT a1, a2 FROM (SELECT a1,a2 FROM Tab1 GROUP BY a1,a2) X LEFT SEMI JOIN (SELECT b1,b2 FROM Tab2 GROUP BY b1,b2) Y ON X.a1<=>Y.b1 AND X.a2<=>Y.b2 then we can have smaller table data when execute join, because group by has cut lots of data -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27227) Dynamic Partition Prune in Spark
Song Jun created SPARK-27227: Summary: Dynamic Partition Prune in Spark Key: SPARK-27227 URL: https://issues.apache.org/jira/browse/SPARK-27227 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Song Jun When we equi-join one big table with a smaller table, we can collect some statistics from the smaller table side, and use it to the scan of big table to do partition prune or data filter before execute the join. This can significantly improve SQL perfermance. For a simple example: select * from A, B where A.a = B.b A is big table ,B is small table. There are two scenarios: 1. A.a is a partition column of table A we can collect all the values of B.b, and send it to table A to do partition prune on A.a. 2. A.a is not a partition column of table A we can collect real-time some statistics(such as min/max/bloomfilter) of B.b by execute extra sql(select max(b),min(b),bbf(b) from B), and send it to table A to do filter on A.a. Addititionaly, if a more complex query select * from A join (select * from B where B.c = 1) X on A.a = B.b, then we collect real-time statistics(such as min/max/bloomfilter) of X by execute extra sql(select max(b),min(b),bbf(b) from X) Above two scenarios, we can filter out lots of data by partition prune or data filter, thus we can imporve perfermance. 10TB TPC-DS gain about 35% improvement in our test. I will submit a SPIP later. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20960) make ColumnVector public
[ https://issues.apache.org/jira/browse/SPARK-20960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-20960: - Description: ColumnVector is an internal interface in Spark SQL, which is only used for vectorized parquet reader to represent the in-memory columnar format. In Spark 2.3 we want to make ColumnVector public, so that we can provide a more efficient way for data exchanges between Spark and external systems. For example, we can use ColumnVector to build the columnar read API in data source framework, we can use ColumnVector to build a more efficient UDF API, etc. We also want to introduce a new ColumnVector implementation based on Apache Arrow(basically just a wrapper over Arrow), so that external systems(like Python Pandas DataFrame) can build ColumnVector very easily. was: _emphasized text_ColumnVector is an internal interface in Spark SQL, which is only used for vectorized parquet reader to represent the in-memory columnar format. In Spark 2.3 we want to make ColumnVector public, so that we can provide a more efficient way for data exchanges between Spark and external systems. For example, we can use ColumnVector to build the columnar read API in data source framework, we can use ColumnVector to build a more efficient UDF API, etc. We also want to introduce a new ColumnVector implementation based on Apache Arrow(basically just a wrapper over Arrow), so that external systems(like Python Pandas DataFrame) can build ColumnVector very easily. > make ColumnVector public > > > Key: SPARK-20960 > URL: https://issues.apache.org/jira/browse/SPARK-20960 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan > > ColumnVector is an internal interface in Spark SQL, which is only used for > vectorized parquet reader to represent the in-memory columnar format. > In Spark 2.3 we want to make ColumnVector public, so that we can provide a > more efficient way for data exchanges between Spark and external systems. For > example, we can use ColumnVector to build the columnar read API in data > source framework, we can use ColumnVector to build a more efficient UDF API, > etc. > We also want to introduce a new ColumnVector implementation based on Apache > Arrow(basically just a wrapper over Arrow), so that external systems(like > Python Pandas DataFrame) can build ColumnVector very easily. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20960) make ColumnVector public
[ https://issues.apache.org/jira/browse/SPARK-20960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-20960: - Description: _emphasized text_ColumnVector is an internal interface in Spark SQL, which is only used for vectorized parquet reader to represent the in-memory columnar format. In Spark 2.3 we want to make ColumnVector public, so that we can provide a more efficient way for data exchanges between Spark and external systems. For example, we can use ColumnVector to build the columnar read API in data source framework, we can use ColumnVector to build a more efficient UDF API, etc. We also want to introduce a new ColumnVector implementation based on Apache Arrow(basically just a wrapper over Arrow), so that external systems(like Python Pandas DataFrame) can build ColumnVector very easily. was: ColumnVector is an internal interface in Spark SQL, which is only used for vectorized parquet reader to represent the in-memory columnar format. In Spark 2.3 we want to make ColumnVector public, so that we can provide a more efficient way for data exchanges between Spark and external systems. For example, we can use ColumnVector to build the columnar read API in data source framework, we can use ColumnVector to build a more efficient UDF API, etc. We also want to introduce a new ColumnVector implementation based on Apache Arrow(basically just a wrapper over Arrow), so that external systems(like Python Pandas DataFrame) can build ColumnVector very easily. > make ColumnVector public > > > Key: SPARK-20960 > URL: https://issues.apache.org/jira/browse/SPARK-20960 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan > > _emphasized text_ColumnVector is an internal interface in Spark SQL, which is > only used for vectorized parquet reader to represent the in-memory columnar > format. > In Spark 2.3 we want to make ColumnVector public, so that we can provide a > more efficient way for data exchanges between Spark and external systems. For > example, we can use ColumnVector to build the columnar read API in data > source framework, we can use ColumnVector to build a more efficient UDF API, > etc. > We also want to introduce a new ColumnVector implementation based on Apache > Arrow(basically just a wrapper over Arrow), so that external systems(like > Python Pandas DataFrame) can build ColumnVector very easily. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20013) merge renameTable to alterTable in ExternalCatalog
[ https://issues.apache.org/jira/browse/SPARK-20013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-20013: - Description: Currently when we create / rename a managed table, we should get the defaultTablePath for them in ExternalCatalog, then we have two defaultTablePath logic in its two subclass HiveExternalCatalog and InMemoryCatalog, additionally there is also a defaultTablePath in SessionCatalog, so till now we have three defaultTablePath in three classes. we'd better to unify them up to SessionCatalog To unify them, we should move some logic from ExternalCatalog to SessionCatalog, renameTable is one of this. while limit to the simple parameters in renameTable {code} def renameTable(db: String, oldName: String, newName: String): Unit {code} even if we move the defaultTablePath logic to SessionCatalog, we can not pass it to renameTable. So we can add a newTablePath parameter for renameTable in ExternalCatalog was: merge renameTable to alterTable in ExternalCatalog has some reasons: 1. In Hive, we rename a Table by alterTable 2. Currently when we create / rename a managed table, we should get the defaultTablePath for them in ExternalCatalog, then we have two defaultTablePath logic in its two subclass HiveExternalCatalog and InMemoryCatalog, additionally there is also a defaultTablePath in SessionCatalog, so till now we have three defaultTablePath in three classes. we'd better to unify them up to SessionCatalog To unify them, we should move some logic from ExternalCatalog to SessionCatalog, renameTable is one of this. while limit to the simple parameters in renameTable {code} def renameTable(db: String, oldName: String, newName: String): Unit {code} even if we move the defaultTablePath logic to SessionCatalog, we can not pass it to renameTable. So we can merge the renameTable to alterTable, and rename it in alterTable. > merge renameTable to alterTable in ExternalCatalog > -- > > Key: SPARK-20013 > URL: https://issues.apache.org/jira/browse/SPARK-20013 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun > > Currently when we create / rename a managed table, we should get the > defaultTablePath for them in ExternalCatalog, then we have two > defaultTablePath logic in its two subclass HiveExternalCatalog and > InMemoryCatalog, additionally there is also a defaultTablePath in > SessionCatalog, so till now we have three defaultTablePath in three classes. > we'd better to unify them up to SessionCatalog > To unify them, we should move some logic from ExternalCatalog to > SessionCatalog, renameTable is one of this. > while limit to the simple parameters in renameTable > {code} > def renameTable(db: String, oldName: String, newName: String): Unit > {code} > even if we move the defaultTablePath logic to SessionCatalog, we can not pass > it to renameTable. > So we can add a newTablePath parameter for renameTable in ExternalCatalog -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20013) merge renameTable to alterTable in ExternalCatalog
Song Jun created SPARK-20013: Summary: merge renameTable to alterTable in ExternalCatalog Key: SPARK-20013 URL: https://issues.apache.org/jira/browse/SPARK-20013 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun merge renameTable to alterTable in ExternalCatalog has some reasons: 1. In Hive, we rename a Table by alterTable 2. Currently when we create / rename a managed table, we should get the defaultTablePath for them in ExternalCatalog, then we have two defaultTablePath logic in its two subclass HiveExternalCatalog and InMemoryCatalog, additionally there is also a defaultTablePath in SessionCatalog, so till now we have three defaultTablePath in three classes. we'd better to unify them up to SessionCatalog To unify them, we should move some logic from ExternalCatalog to SessionCatalog, renameTable is one of this. while limit to the simple parameters in renameTable {code} def renameTable(db: String, oldName: String, newName: String): Unit {code} even if we move the defaultTablePath logic to SessionCatalog, we can not pass it to renameTable. So we can merge the renameTable to alterTable, and rename it in alterTable. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19990) Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create temporary view using
[ https://issues.apache.org/jira/browse/SPARK-19990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15929409#comment-15929409 ] Song Jun edited comment on SPARK-19990 at 3/17/17 4:36 AM: --- the root cause is [the csvfile path in this test case|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala#L703] is "jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv", which will failed when new Path() [new Path in datasource.scala |https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L344] and the cars.csv are stored in module core's resources. after we merge the HiveDDLSuit and DDLSuit in SPARK-19235 https://github.com/apache/spark/commit/09829be621f0f9bb5076abb3d832925624699fa9,if we test module hive, we will run the DDLSuit in the core module, and this will cause that we get the illegal path like 'jar:file:/xxx' above. it is not related with SPARK-19763 I will fix this by providing a new test dir which contain the test files in sql/ , and the test case use this file path. thanks~ was (Author: windpiger): the root cause is [the csvfile path in this test case|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala#L703] is "jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv", which will failed when new Path() [new Path in datasource.scala |https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L344] and the cars.csv are stored in module core's resources. after we merge the HiveDDLSuit and DDLSuit https://github.com/apache/spark/commit/09829be621f0f9bb5076abb3d832925624699fa9,if we test module hive, we will run the DDLSuit in the core module, and this will cause that we get the illegal path like 'jar:file:/xxx' above. it is not related with SPARK-19763 I will fix this by providing a new test dir which contain the test files in sql/ , and the test case use this file path. thanks~ > Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create > temporary view using > -- > > Key: SPARK-19990 > URL: https://issues.apache.org/jira/browse/SPARK-19990 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.2.0 >Reporter: Kay Ousterhout > > This test seems to be failing consistently on all of the maven builds: > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite&test_name=create+temporary+view+using > and is possibly caused by SPARK-19763. > Here's a stack trace for the failure: > java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative > path in absolute URI: > jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv > at org.apache.hadoop.fs.Path.initialize(Path.java:206) > at org.apache.hadoop.fs.Path.(Path.java:172) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:344) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:343) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:344) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:343) > at > org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:91) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:617) > at > org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.
[jira] [Comment Edited] (SPARK-19990) Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create temporary view using
[ https://issues.apache.org/jira/browse/SPARK-19990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15929409#comment-15929409 ] Song Jun edited comment on SPARK-19990 at 3/17/17 4:35 AM: --- the root cause is [the csvfile path in this test case|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala#L703] is "jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv", which will failed when new Path() [new Path in datasource.scala |https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L344] and the cars.csv are stored in module core's resources. after we merge the HiveDDLSuit and DDLSuit https://github.com/apache/spark/commit/09829be621f0f9bb5076abb3d832925624699fa9,if we test module hive, we will run the DDLSuit in the core module, and this will cause that we get the illegal path like 'jar:file:/xxx' above. it is not related with SPARK-19763 I will fix this by providing a new test dir which contain the test files in sql/ , and the test case use this file path. thanks~ was (Author: windpiger): the root cause is [the csvfile path in this test case|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala#L703] is "jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv", which will failed when new Path() [new Path in datasource.scala |https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L344] and the cars.csv are stored in module core's resources. after we merge the HiveDDLSuit and DDLSuit https://github.com/apache/spark/commit/09829be621f0f9bb5076abb3d832925624699fa9,if we test module hive, we will run the DDLSuit in the core module, and this will cause that we get the illegal path like 'jar:file:/xxx' above. > Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create > temporary view using > -- > > Key: SPARK-19990 > URL: https://issues.apache.org/jira/browse/SPARK-19990 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.2.0 >Reporter: Kay Ousterhout > > This test seems to be failing consistently on all of the maven builds: > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite&test_name=create+temporary+view+using > and is possibly caused by SPARK-19763. > Here's a stack trace for the failure: > java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative > path in absolute URI: > jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv > at org.apache.hadoop.fs.Path.initialize(Path.java:206) > at org.apache.hadoop.fs.Path.(Path.java:172) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:344) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:343) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:344) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:343) > at > org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:91) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:617) > at > org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:62) > at > org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:62) > at > org.apache.spark.sql.execution.command.DDLSuit
[jira] [Commented] (SPARK-19990) Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create temporary view using
[ https://issues.apache.org/jira/browse/SPARK-19990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15929409#comment-15929409 ] Song Jun commented on SPARK-19990: -- the root cause is [the csvfile path in this test case|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala#L703] is "jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv", which will failed when new Path() [new Path in datasource.scala |https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L344] and the cars.csv are stored in module core's resources. after we merge the HiveDDLSuit and DDLSuit https://github.com/apache/spark/commit/09829be621f0f9bb5076abb3d832925624699fa9,if we test module hive, we will run the DDLSuit in the core module, and this will cause that we get the illegal path like 'jar:file:/xxx' above. > Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create > temporary view using > -- > > Key: SPARK-19990 > URL: https://issues.apache.org/jira/browse/SPARK-19990 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.2.0 >Reporter: Kay Ousterhout > > This test seems to be failing consistently on all of the maven builds: > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite&test_name=create+temporary+view+using > and is possibly caused by SPARK-19763. > Here's a stack trace for the failure: > java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative > path in absolute URI: > jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv > at org.apache.hadoop.fs.Path.initialize(Path.java:206) > at org.apache.hadoop.fs.Path.(Path.java:172) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:344) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:343) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:344) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:343) > at > org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:91) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:617) > at > org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:62) > at > org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:62) > at > org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38$$anonfun$apply$mcV$sp$8.apply$mcV$sp(DDLSuite.scala:705) > at > org.apache.spark.sql.test.SQLTestUtils$class.withView(SQLTestUtils.scala:186) > at > org.apache.spark.sql.execution.command.DDLSuite.withView(DDLSuite.scala:171) > at > org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38.apply$mcV$sp(DDLSuite.scala:704) > at > org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38.apply(DDLSuite.scala:701) > at > org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38.apply(DDLSuite.scala:701) > at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68) > at > org.scalatest.FunSuit
[jira] [Created] (SPARK-19961) unify a exception erro msg for dropdatabase
Song Jun created SPARK-19961: Summary: unify a exception erro msg for dropdatabase Key: SPARK-19961 URL: https://issues.apache.org/jira/browse/SPARK-19961 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun Priority: Minor unify a exception erro msg for dropdatabase when the database still have some tables. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19945) Add test case for SessionCatalog with HiveExternalCatalog
Song Jun created SPARK-19945: Summary: Add test case for SessionCatalog with HiveExternalCatalog Key: SPARK-19945 URL: https://issues.apache.org/jira/browse/SPARK-19945 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun Currently SessionCatalogSuite is only for InMemoryCatalog, there is no suite for HiveExternalCatalog. And there are some ddl function is not proper to test in ExternalCatalogSuite, because some logic are not full implement in ExternalCatalog, these ddl functions are full implement in SessionCatalog, it is better to test it in SessionCatalogSuite So we should add a test suite for SessionCatalog with HiveExternalCatalog -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19945) Add test suite for SessionCatalog with HiveExternalCatalog
[ https://issues.apache.org/jira/browse/SPARK-19945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-19945: - Summary: Add test suite for SessionCatalog with HiveExternalCatalog (was: Add test case for SessionCatalog with HiveExternalCatalog) > Add test suite for SessionCatalog with HiveExternalCatalog > -- > > Key: SPARK-19945 > URL: https://issues.apache.org/jira/browse/SPARK-19945 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun > > Currently SessionCatalogSuite is only for InMemoryCatalog, there is no suite > for HiveExternalCatalog. > And there are some ddl function is not proper to test in > ExternalCatalogSuite, because some logic are not full implement in > ExternalCatalog, these ddl functions are full implement in SessionCatalog, it > is better to test it in SessionCatalogSuite > So we should add a test suite for SessionCatalog with HiveExternalCatalog -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19917) qualified partition location stored in catalog
Song Jun created SPARK-19917: Summary: qualified partition location stored in catalog Key: SPARK-19917 URL: https://issues.apache.org/jira/browse/SPARK-19917 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun partition path should be qualified to store in catalog. There are some scenes: 1. ALTER TABLE t PARTITION(b=1) SET LOCATION '/path/x' qualified: file:/path/x 2. ALTER TABLE t PARTITION(b=1) SET LOCATION 'x' qualified: file:/tablelocation/x 3. ALTER TABLE t ADD PARTITION(b=1) LOCATION '/path/x' qualified: file:/path/x 4. ALTER TABLE t ADD PARTITION(b=1) LOCATION 'x' qualified: file:/tablelocation/x Currently only ALTER TABLE t ADD PARTITION(b=1) LOCATION for hive serde table has the expected qualified path. we should make other scenes to be consist with it. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19869) move table related ddl from ddl.scala to tables.scala
[ https://issues.apache.org/jira/browse/SPARK-19869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-19869: - Issue Type: Improvement (was: Bug) > move table related ddl from ddl.scala to tables.scala > - > > Key: SPARK-19869 > URL: https://issues.apache.org/jira/browse/SPARK-19869 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun >Priority: Minor > > move table related ddl from ddl.scala to tables.scala -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19869) move table related ddl from ddl.scala to tables.scala
Song Jun created SPARK-19869: Summary: move table related ddl from ddl.scala to tables.scala Key: SPARK-19869 URL: https://issues.apache.org/jira/browse/SPARK-19869 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun Priority: Minor move table related ddl from ddl.scala to tables.scala -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19864) add makeQualifiedPath in SQLTestUtils to optimize some code
[ https://issues.apache.org/jira/browse/SPARK-19864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-19864: - Summary: add makeQualifiedPath in SQLTestUtils to optimize some code (was: add makeQualifiedPath in CatalogUtils to optimize some code) > add makeQualifiedPath in SQLTestUtils to optimize some code > --- > > Key: SPARK-19864 > URL: https://issues.apache.org/jira/browse/SPARK-19864 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun >Priority: Minor > > Currently there are lots of places to make the path qualified, it is better > to provide a function to do this, then the code will be more simple. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19867) merge defaultTablePath logic when create table for InMemroyCatalog and HiveExternalCatalog
Song Jun created SPARK-19867: Summary: merge defaultTablePath logic when create table for InMemroyCatalog and HiveExternalCatalog Key: SPARK-19867 URL: https://issues.apache.org/jira/browse/SPARK-19867 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun if we create a managed table, we will set a defaultTablePath for this table, Currently the defaultTablePath exists in both InMemoryCatalog and HiveExternalCatalog, it is better to merge them up to SessionCatalog -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19864) add makeQualifiedPath in CatalogUtils to optimize some code
Song Jun created SPARK-19864: Summary: add makeQualifiedPath in CatalogUtils to optimize some code Key: SPARK-19864 URL: https://issues.apache.org/jira/browse/SPARK-19864 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun Priority: Minor Currently there are lots of places to make the path qualified, it is better to provide a function to do this, then the code will be more simple. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19836) Customizable remote repository url for hive versions unit test
[ https://issues.apache.org/jira/browse/SPARK-19836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899292#comment-15899292 ] Song Jun commented on SPARK-19836: -- I have do this similar https://github.com/apache/spark/pull/16803 > Customizable remote repository url for hive versions unit test > -- > > Key: SPARK-19836 > URL: https://issues.apache.org/jira/browse/SPARK-19836 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.1.0 >Reporter: Elek, Marton > Labels: ivy, unittest > > When the VersionSuite test runs from sql/hive it downloads different versions > from hive. > Unfortunately the IsolatedClientClassloader (which is used by the > VersionSuite) uses hardcoded fix repositories: > {code} > val classpath = quietly { > SparkSubmitUtils.resolveMavenCoordinates( > hiveArtifacts.mkString(","), > SparkSubmitUtils.buildIvySettings( > Some("http://www.datanucleus.org/downloads/maven2";), > ivyPath), > exclusions = version.exclusions) > } > {code} > The problem is with the hard-coded repositories: > 1. it's hard to run unit tests in an environment where only one internal > maven repository is available (and central/datanucleus is not) > 2. it's impossible to run unit tests against custom built hive/hadoop > artifacts (which are not available from the central repository) > VersionSuite has already a specific SPARK_VERSIONS_SUITE_IVY_PATH environment > variable to define a custom local repository as ivy cache. > I suggest to add an additional environment variable > (SPARK_VERSIONS_SUITE_IVY_REPOSITORIES to the HiveClientBuilder.scala), to > make it possible adding new remote repositories for testing the different > hive versions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19845) failed to uncache datasource table after the table location altered
[ https://issues.apache.org/jira/browse/SPARK-19845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15898626#comment-15898626 ] Song Jun edited comment on SPARK-19845 at 3/7/17 2:33 AM: -- yes, this jira is doing this https://issues.apache.org/jira/browse/SPARK-19784 the location changed, it make it more complex to uncache the table and recache other tables reference this. I will dig it more was (Author: windpiger): yes, this jira is doing this https://issues.apache.org/jira/browse/SPARK-19784 the location changed, it make it more complex to uncache the table and recache other tables reference this. > failed to uncache datasource table after the table location altered > --- > > Key: SPARK-19845 > URL: https://issues.apache.org/jira/browse/SPARK-19845 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun > > Currently if we first cache a datasource table, then we alter the table > location, > then we drop the table, uncache table will failed in the DropTableCommand, > because the location has changed and sameResult for two InMemoryFileIndex > with different location return false, so we can't find the table key in the > cache. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19845) failed to uncache datasource table after the table location altered
[ https://issues.apache.org/jira/browse/SPARK-19845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15898626#comment-15898626 ] Song Jun commented on SPARK-19845: -- yes, this jira is doing this https://issues.apache.org/jira/browse/SPARK-19784 the location changed, it make it more complex to uncache the table and recache other tables reference this. > failed to uncache datasource table after the table location altered > --- > > Key: SPARK-19845 > URL: https://issues.apache.org/jira/browse/SPARK-19845 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun > > Currently if we first cache a datasource table, then we alter the table > location, > then we drop the table, uncache table will failed in the DropTableCommand, > because the location has changed and sameResult for two InMemoryFileIndex > with different location return false, so we can't find the table key in the > cache. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19845) failed to uncache datasource table after the table location altered
Song Jun created SPARK-19845: Summary: failed to uncache datasource table after the table location altered Key: SPARK-19845 URL: https://issues.apache.org/jira/browse/SPARK-19845 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun Currently if we first cache a datasource table, then we alter the table location, then we drop the table, uncache table will failed in the DropTableCommand, because the location has changed and sameResult for two InMemoryFileIndex with different location return false, so we can't find the table key in the cache. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19833) remove SQLConf.HIVE_VERIFY_PARTITION_PATH, we always return empty when the path does not exists
Song Jun created SPARK-19833: Summary: remove SQLConf.HIVE_VERIFY_PARTITION_PATH, we always return empty when the path does not exists Key: SPARK-19833 URL: https://issues.apache.org/jira/browse/SPARK-19833 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun In SPARK-5068, we introduce a SQLConf spark.sql.hive.verifyPartitionPath, if it is set to true, it will avoid the task failed when the patition location does not exists in the filesystem. this situation should always return emtpy and don't lead to the task failed, here we remove this conf. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19832) DynamicPartitionWriteTask should escape the partition name
Song Jun created SPARK-19832: Summary: DynamicPartitionWriteTask should escape the partition name Key: SPARK-19832 URL: https://issues.apache.org/jira/browse/SPARK-19832 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun Currently in DynamicPartitionWriteTask, when we get the paritionPath of a parition, we just escape the partition value, not escape the partition name. this will cause some problems for some special partition name situation, for example : 1) if the partition name contains '%' etc, there will be two partition path created in the filesytem, one is for escaped path like '/path/a%25b=1', another is for unescaped path like '/path/a%b=1'. and the data inserted stored in unescaped path, while the show partitions table will return 'a%25b=1' which the partition name is escaped. So here it is not consist. And I think the data should be stored in the escaped path in filesystem, which Hive2.0.0 also have the same action. 2) if the partition name contains ':', there will throw exception that new Path("/path","a:b"), this is illegal which has a colon in the relative path. {code} java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: a:b at org.apache.hadoop.fs.Path.initialize(Path.java:205) at org.apache.hadoop.fs.Path.(Path.java:171) at org.apache.hadoop.fs.Path.(Path.java:88) ... 48 elided Caused by: java.net.URISyntaxException: Relative path in absolute URI: a:b at java.net.URI.checkPath(URI.java:1823) at java.net.URI.(URI.java:745) at org.apache.hadoop.fs.Path.initialize(Path.java:202) ... 50 more {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19784) refresh datasource table after alter the location
Song Jun created SPARK-19784: Summary: refresh datasource table after alter the location Key: SPARK-19784 URL: https://issues.apache.org/jira/browse/SPARK-19784 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun currently if we alter the location of a datasource table, then we select from it, it still return the data of the old location. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19763) qualified external datasource table location stored in catalog
Song Jun created SPARK-19763: Summary: qualified external datasource table location stored in catalog Key: SPARK-19763 URL: https://issues.apache.org/jira/browse/SPARK-19763 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun If we create a external datasource table with a non-qualified location , we should qualified it to store in catalog. {code} CREATE TABLE t(a string) USING parquet LOCATION '/path/xx' CREATE TABLE t1(a string, b string) USING parquet PARTITIONED BY(b) LOCATION '/path/xx' {code} when we get the table from catalog, the location should be qualified, e.g.'file:/path/xxx' -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19761) create InMemoryFileIndex with empty rootPaths when set PARALLEL_PARTITION_DISCOVERY_THRESHOLD to zero
Song Jun created SPARK-19761: Summary: create InMemoryFileIndex with empty rootPaths when set PARALLEL_PARTITION_DISCOVERY_THRESHOLD to zero Key: SPARK-19761 URL: https://issues.apache.org/jira/browse/SPARK-19761 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun if we create a InMemoryFileIndex with an empty rootPaths when set PARALLEL_PARTITION_DISCOVERY_THRESHOLD to zero, it will throw an exception: {code} Positive number of slices required java.lang.IllegalArgumentException: Positive number of slices required at org.apache.spark.rdd.ParallelCollectionRDD$.slice(ParallelCollectionRDD.scala:119) at org.apache.spark.rdd.ParallelCollectionRDD.getPartitions(ParallelCollectionRDD.scala:97) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2084) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.RDD.collect(RDD.scala:935) at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$.org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$bulkListLeafFiles(PartitioningAwareFileIndex.scala:357) at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.listLeafFiles(PartitioningAwareFileIndex.scala:256) at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.refresh0(InMemoryFileIndex.scala:74) at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.(InMemoryFileIndex.scala:50) at org.apache.spark.sql.execution.datasources.FileIndexSuite$$anonfun$9$$anonfun$apply$mcV$sp$2.apply$mcV$sp(FileIndexSuite.scala:186) at org.apache.spark.sql.test.SQLTestUtils$class.withSQLConf(SQLTestUtils.scala:105) at org.apache.spark.sql.execution.datasources.FileIndexSuite.withSQLConf(FileIndexSuite.scala:33) at org.apache.spark.sql.execution.datasources.FileIndexSuite$$anonfun$9.apply$mcV$sp(FileIndexSuite.scala:185) at org.apache.spark.sql.execution.datasources.FileIndexSuite$$anonfun$9.apply(FileIndexSuite.scala:185) at org.apache.spark.sql.execution.datasources.FileIndexSuite$$anonfun$9.apply(FileIndexSuite.scala:185) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19742) When using SparkSession to write a dataset to Hive the schema is ignored
[ https://issues.apache.org/jira/browse/SPARK-19742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15885399#comment-15885399 ] Song Jun commented on SPARK-19742: -- this is expected, see the comment. {code} /** * Inserts the content of the `DataFrame` to the specified table. It requires that * the schema of the `DataFrame` is the same as the schema of the table. * * @note Unlike `saveAsTable`, `insertInto` ignores the column names and just uses position-based * resolution. For example: * * {{{ *scala> Seq((1, 2)).toDF("i", "j").write.mode("overwrite").saveAsTable("t1") *scala> Seq((3, 4)).toDF("j", "i").write.insertInto("t1") *scala> Seq((5, 6)).toDF("a", "b").write.insertInto("t1") *scala> sql("select * from t1").show *+---+---+ *| i| j| *+---+---+ *| 5| 6| *| 3| 4| *| 1| 2| *+---+---+ * }}} * * Because it inserts data to an existing table, format or options will be ignored. * * @since 1.4.0 */ def insertInto(tableName: String): Unit = { insertInto(df.sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)) } {code} > When using SparkSession to write a dataset to Hive the schema is ignored > > > Key: SPARK-19742 > URL: https://issues.apache.org/jira/browse/SPARK-19742 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.0.1 > Environment: Running on Ubuntu with HDP 2.4. >Reporter: Navin Goel > > I am saving a Dataset that is created form reading a json and some selects > and filters into a hive table. The dataset.write().insertInto function does > not look at schema when writing to the table but instead writes in order to > the hive table. > The schemas for both the tables are same. > schema printed from spark of the dataset being written: > StructType(StructField(countrycode,StringType,true), > StructField(systemflag,StringType,true), > StructField(classcode,StringType,true), > StructField(classname,StringType,true), > StructField(rangestart,StringType,true), > StructField(rangeend,StringType,true), > StructField(tablename,StringType,true), > StructField(last_updated_date,TimestampType,true)) > Schema of the dataset after loading the same table from Hive: > StructType(StructField(systemflag,StringType,true), > StructField(RangeEnd,StringType,true), > StructField(classcode,StringType,true), > StructField(classname,StringType,true), > StructField(last_updated_date,TimestampType,true), > StructField(countrycode,StringType,true), > StructField(rangestart,StringType,true), > StructField(tablename,StringType,true)) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19748) refresh for InMemoryFileIndex with FileStatusCache does not work correctly
Song Jun created SPARK-19748: Summary: refresh for InMemoryFileIndex with FileStatusCache does not work correctly Key: SPARK-19748 URL: https://issues.apache.org/jira/browse/SPARK-19748 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun If we refresh a InMemoryFileIndex with a FileStatusCache, it will first use the FileStatusCache to generate the cachedLeafFiles etc, then call FileStatusCache.invalidateAll. the order to do these two actions is wrong, this lead to the refresh action does not take effect. {code} override def refresh(): Unit = { refresh0() fileStatusCache.invalidateAll() } private def refresh0(): Unit = { val files = listLeafFiles(rootPaths) cachedLeafFiles = new mutable.LinkedHashMap[Path, FileStatus]() ++= files.map(f => f.getPath -> f) cachedLeafDirToChildrenFiles = files.toArray.groupBy(_.getPath.getParent) cachedPartitionSpec = null } {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19724) create a managed table with an existed default location should throw an exception
[ https://issues.apache.org/jira/browse/SPARK-19724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-19724: - Summary: create a managed table with an existed default location should throw an exception (was: create managed table for hive tables with an existed default location should throw an exception) > create a managed table with an existed default location should throw an > exception > - > > Key: SPARK-19724 > URL: https://issues.apache.org/jira/browse/SPARK-19724 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun > > This JIRA is a follow up work after > [SPARK-19583](https://issues.apache.org/jira/browse/SPARK-19583) > As we discussed in that [PR](https://github.com/apache/spark/pull/16938) > The following DDL for a managed table with an existed default location should > throw an exception: > {code} > CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ... > CREATE TABLE ... (PARTITIONED BY ...) > {code} > Currently there are some situations which are not consist with above logic: > 1. CREATE TABLE ... (PARTITIONED BY ...) succeed with an existed default > location > situation: for both hive/datasource(with HiveExternalCatalog/InMemoryCatalog) > 2. CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ... > situation: hive table succeed with an existed default location -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19724) create managed table for hive tables with an existed default location should throw an exception
[ https://issues.apache.org/jira/browse/SPARK-19724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-19724: - Description: This JIRA is a follow up work after [SPARK-19583](https://issues.apache.org/jira/browse/SPARK-19583) As we discussed in that [PR](https://github.com/apache/spark/pull/16938) The following DDL for a managed table with an existed default location should throw an exception: {code} CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ... CREATE TABLE ... (PARTITIONED BY ...) {code} Currently there are some situations which are not consist with above logic: 1. CREATE TABLE ... (PARTITIONED BY ...) succeed with an existed default location situation: for both hive/datasource(with HiveExternalCatalog/InMemoryCatalog) 2. CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ... situation: hive table succeed with an existed default location was: This JIRA is a follow up work after SPARK-19583 As we discussed in that [PR|https://github.com/apache/spark/pull/16938] The following DDL for hive table with an existed default location should throw an exception: {code} CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ... {code} Currently it will success for this situation > create managed table for hive tables with an existed default location should > throw an exception > --- > > Key: SPARK-19724 > URL: https://issues.apache.org/jira/browse/SPARK-19724 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun > > This JIRA is a follow up work after > [SPARK-19583](https://issues.apache.org/jira/browse/SPARK-19583) > As we discussed in that [PR](https://github.com/apache/spark/pull/16938) > The following DDL for a managed table with an existed default location should > throw an exception: > {code} > CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ... > CREATE TABLE ... (PARTITIONED BY ...) > {code} > Currently there are some situations which are not consist with above logic: > 1. CREATE TABLE ... (PARTITIONED BY ...) succeed with an existed default > location > situation: for both hive/datasource(with HiveExternalCatalog/InMemoryCatalog) > 2. CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ... > situation: hive table succeed with an existed default location -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19724) create managed table for hive tables with an existed default location should throw an exception
[ https://issues.apache.org/jira/browse/SPARK-19724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-19724: - Summary: create managed table for hive tables with an existed default location should throw an exception (was: create table for hive tables with an existed default location should throw an exception) > create managed table for hive tables with an existed default location should > throw an exception > --- > > Key: SPARK-19724 > URL: https://issues.apache.org/jira/browse/SPARK-19724 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun > > This JIRA is a follow up work after SPARK-19583 > As we discussed in that [PR|https://github.com/apache/spark/pull/16938] > The following DDL for hive table with an existed default location should > throw an exception: > {code} > CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ... > {code} > Currently it will success for this situation -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19724) create table for hive tables with an existed default location should throw an exception
Song Jun created SPARK-19724: Summary: create table for hive tables with an existed default location should throw an exception Key: SPARK-19724 URL: https://issues.apache.org/jira/browse/SPARK-19724 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun This JIRA is a follow up work after SPARK-19583 As we discussed in that [PR|https://github.com/apache/spark/pull/16938] The following DDL for hive table with an existed default location should throw an exception: {code} CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ... {code} Currently it will success for this situation -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19723) create table for data source tables should work with an non-existent location
Song Jun created SPARK-19723: Summary: create table for data source tables should work with an non-existent location Key: SPARK-19723 URL: https://issues.apache.org/jira/browse/SPARK-19723 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun This JIRA is a follow up work after SPARK-19583 As we discussed in that [PR|https://github.com/apache/spark/pull/16938] The following DDL for datasource table with an non-existent location should work: `` CREATE TABLE ... (PARTITIONED BY ...) LOCATION path ``` Currently it will throw exception that path not exists -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19667) Create table with HiveEnabled in default database use warehouse path instead of the location of default database
Song Jun created SPARK-19667: Summary: Create table with HiveEnabled in default database use warehouse path instead of the location of default database Key: SPARK-19667 URL: https://issues.apache.org/jira/browse/SPARK-19667 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.1.0 Reporter: Song Jun Currently, when we create a managed table with HiveEnabled in default database, Spark will use the location of default database as the table's location, this is ok in non-shared metastore. While if we use a shared metastore between different clusters, for example, 1) there is a hive metastore in Cluster-A, and the metastore use a remote mysql as its db, and create a default database in metastore, then the location of the default database is the path in Cluster-A 2) then we create another Cluster-B, and Cluster-B also use the same remote mysql as its metastore's db, so the default database conf in Cluster-B download from mysql, which location is the path of Cluster-A 3) then we create a table in Cluster-B in default database, it will throw an exception, that UnknowHost Cluster-A In Hive2.0.0, it is allowed to create a table in default database which shared between clusters , and this action is not allowed in other database, just for default. As a spark User, we will want to have the same action as Hive, thus we can create table in default databse -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19664) put 'hive.metastore.warehouse.dir' in hadoopConf place
[ https://issues.apache.org/jira/browse/SPARK-19664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-19664: - Description: In SPARK-15959, we bring back the 'hive.metastore.warehouse.dir' , while in the logic, when use the value of 'spark.sql.warehouse.dir' to overwrite 'spark.sql.warehouse.dir' , it set it to 'sparkContext.conf' , I think it should put in 'sparkContext.hadoopConfiguration' and overwrite the original value of hadoopConf https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala#L64 was: In SPARK-15959, we bring back the 'hive.metastore.warehouse.dir' , while in the logic, when use the value of 'spark.sql.warehouse.dir' to overwrite 'spark.sql.warehouse.dir' , it set it to 'sparkContext.conf' , I think it should put in 'sparkContext.hadoopConfiguration' https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala#L64 > put 'hive.metastore.warehouse.dir' in hadoopConf place > -- > > Key: SPARK-19664 > URL: https://issues.apache.org/jira/browse/SPARK-19664 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Song Jun >Priority: Minor > > In SPARK-15959, we bring back the 'hive.metastore.warehouse.dir' , while in > the logic, when use the value of 'spark.sql.warehouse.dir' to overwrite > 'spark.sql.warehouse.dir' , it set it to 'sparkContext.conf' , I think it > should put in 'sparkContext.hadoopConfiguration' and overwrite the original > value of hadoopConf > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala#L64 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19664) put 'hive.metastore.warehouse.dir' in hadoopConf place
Song Jun created SPARK-19664: Summary: put 'hive.metastore.warehouse.dir' in hadoopConf place Key: SPARK-19664 URL: https://issues.apache.org/jira/browse/SPARK-19664 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0 Reporter: Song Jun Priority: Minor In SPARK-15959, we bring back the 'hive.metastore.warehouse.dir' , while in the logic, when use the value of 'spark.sql.warehouse.dir' to overwrite 'spark.sql.warehouse.dir' , it set it to 'sparkContext.conf' , I think it should put in 'sparkContext.hadoopConfiguration' https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala#L64 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19598) Remove the alias parameter in UnresolvedRelation
[ https://issues.apache.org/jira/browse/SPARK-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15869623#comment-15869623 ] Song Jun commented on SPARK-19598: -- Thanks~ let me investigate more~ > Remove the alias parameter in UnresolvedRelation > > > Key: SPARK-19598 > URL: https://issues.apache.org/jira/browse/SPARK-19598 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin > > UnresolvedRelation has a second argument named "alias", for assigning the > relation an alias. I think we can actually remove it and replace its use with > a SubqueryAlias. > This would actually simplify some analyzer code to only match on > SubqueryAlias. For example, the broadcast hint pull request can have one > fewer case https://github.com/apache/spark/pull/16925/files. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19598) Remove the alias parameter in UnresolvedRelation
[ https://issues.apache.org/jira/browse/SPARK-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15867548#comment-15867548 ] Song Jun edited comment on SPARK-19598 at 2/15/17 9:46 AM: --- [~rxin] When I do this jira, I found it that it is not proper to remove UnresolvedRelation's alias argument. If we remove it and use SubqueryAlias, that is: {quote}SubqueryAlias(alias,UnresolvedRelation(tableIdentifier),None){quote} to replace {quote}UnresolvedRelation(tableIdentifier, alias){quote} While there are lots of *match case* codes for *UnresolvedRelation*, and in matched logic it will use the alias parameter of *UnresolvedRelation*, currently table with/without alias can processed in one *match case UnresolvedRelation* logic, after this change, we should process table with alias and without alias seperately in two *match case*: {quote} case u:UnresolvedRelation => func(u,None) case s@SubqueryAlias(alias,u:UnresolvedRelation,_) => func(u, alias) {quote} such as: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L589-L591 https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L626 Is this right? Or am I missing something? was (Author: windpiger): [~rxin] When I do this jira, I found it that it is not proper to remove UnresolvedRelation's alias argument. If we remove it and use SubqueryAlias, that is: {quote}SubqueryAlias(alias,UnresolvedRelation(tableIdentifier),None){quote} to replace {quote}UnresolvedRelation(tableIdentifier, aliase){quote} While there are lots of *match case* codes for *UnresolvedRelation*, and in matched logic it will use the alias parameter of *UnresolvedRelation*, currently table with/without alias can processed in one *match case UnresolvedRelation* logic, after this change, we should process table with alias and without alias seperately in two *match case*: {quote} case u:UnresolvedRelation => func(u,None) case s@SubqueryAlias(alias,u:UnresolvedRelation,_) => func(u, alias) {quote} such as: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L589-L591 https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L626 Is this right? Or am I missing something? > Remove the alias parameter in UnresolvedRelation > > > Key: SPARK-19598 > URL: https://issues.apache.org/jira/browse/SPARK-19598 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin > > UnresolvedRelation has a second argument named "alias", for assigning the > relation an alias. I think we can actually remove it and replace its use with > a SubqueryAlias. > This would actually simplify some analyzer code to only match on > SubqueryAlias. For example, the broadcast hint pull request can have one > fewer case https://github.com/apache/spark/pull/16925/files. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19598) Remove the alias parameter in UnresolvedRelation
[ https://issues.apache.org/jira/browse/SPARK-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15867548#comment-15867548 ] Song Jun commented on SPARK-19598: -- [~rxin] When I do this jira, I found it that it is not proper to remove UnresolvedRelation's alias argument. If we remove it and use SubqueryAlias, that is: {quote}SubqueryAlias(alias,UnresolvedRelation(tableIdentifier),None){quote} to replace {quote}UnresolvedRelation(tableIdentifier, aliase){quote} While there are lots of *match case* codes for *UnresolvedRelation*, and in matched logic it will use the alias parameter of *UnresolvedRelation*, currently table with/without alias can processed in one *match case UnresolvedRelation* logic, after this change, we should process table with alias and without alias seperately in two *match case*: {quote} case u:UnresolvedRelation => func(u,None) case s@SubqueryAlias(alias,u:UnresolvedRelation,_) => func(u, alias) {quote} such as: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L589-L591 https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L626 Is this right? Or am I missing something? > Remove the alias parameter in UnresolvedRelation > > > Key: SPARK-19598 > URL: https://issues.apache.org/jira/browse/SPARK-19598 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin > > UnresolvedRelation has a second argument named "alias", for assigning the > relation an alias. I think we can actually remove it and replace its use with > a SubqueryAlias. > This would actually simplify some analyzer code to only match on > SubqueryAlias. For example, the broadcast hint pull request can have one > fewer case https://github.com/apache/spark/pull/16925/files. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19598) Remove the alias parameter in UnresolvedRelation
[ https://issues.apache.org/jira/browse/SPARK-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15867398#comment-15867398 ] Song Jun commented on SPARK-19598: -- OK~ I'd like to do this. Thank you very much! > Remove the alias parameter in UnresolvedRelation > > > Key: SPARK-19598 > URL: https://issues.apache.org/jira/browse/SPARK-19598 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin > > UnresolvedRelation has a second argument named "alias", for assigning the > relation an alias. I think we can actually remove it and replace its use with > a SubqueryAlias. > This would actually simplify some analyzer code to only match on > SubqueryAlias. For example, the broadcast hint pull request can have one > fewer case https://github.com/apache/spark/pull/16925/files. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-19166) change method name from InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions to InsertIntoHadoopFsRelationCommand.deleteMatchingPrefix
[ https://issues.apache.org/jira/browse/SPARK-19166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun closed SPARK-19166. Resolution: Not A Bug minor issue > change method name from > InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions to > InsertIntoHadoopFsRelationCommand.deleteMatchingPrefix > > > Key: SPARK-19166 > URL: https://issues.apache.org/jira/browse/SPARK-19166 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Song Jun >Priority: Minor > > InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions delete all files > that match a static prefix, such as a partition file path(/table/foo=1), or a > no partition file path(/xxx/a.json). > while the method name deleteMatchingPartitions indicates that only the > partition file will be deleted. This name make a confused. > It is better to rename the method name. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-19491) add a config for tableRelation cache size in SessionCatalog
[ https://issues.apache.org/jira/browse/SPARK-19491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun closed SPARK-19491. Resolution: Duplicate duplicate with https://github.com/apache/spark/pull/16736 > add a config for tableRelation cache size in SessionCatalog > --- > > Key: SPARK-19491 > URL: https://issues.apache.org/jira/browse/SPARK-19491 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun >Priority: Minor > > currently the table relation cache size is hardcode to 1000, it is better to > add a config to set its size. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-19484) continue work to create a table with an empty schema
[ https://issues.apache.org/jira/browse/SPARK-19484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun closed SPARK-19484. Resolution: Won't Fix this has been contained in https://github.com/apache/spark/pull/16787 > continue work to create a table with an empty schema > > > Key: SPARK-19484 > URL: https://issues.apache.org/jira/browse/SPARK-19484 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun >Priority: Minor > > after SPARK-19279, we could not create a Hive table with an empty schema, > we should tighten up the condition when create a hive table in > https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L835 > That is if a CatalogTable t has an empty schema, and (there is no > `spark.sql.schema.numParts` or its value is 0), we should not add a default > `col` schema, if we did, a table with an empty schema will be created, that > is not we expected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19583) CTAS for data source tables with an created location does not work
[ https://issues.apache.org/jira/browse/SPARK-19583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15865111#comment-15865111 ] Song Jun commented on SPARK-19583: -- ok, I'd like to take this one, thanks a lot! > CTAS for data source tables with an created location does not work > -- > > Key: SPARK-19583 > URL: https://issues.apache.org/jira/browse/SPARK-19583 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li > > {noformat} > spark.sql( > s""" > |CREATE TABLE t > |USING parquet > |PARTITIONED BY(a, b) > |LOCATION '$dir' > |AS SELECT 3 as a, 4 as b, 1 as c, 2 as d >""".stripMargin) > {noformat} > Failed with the error message: > {noformat} > path > file:/private/var/folders/6r/15tqm8hn3ldb3rmbfqm1gf4cgn/T/spark-195cd513-428a-4df9-b196-87db0c73e772 > already exists.; > org.apache.spark.sql.AnalysisException: path > file:/private/var/folders/6r/15tqm8hn3ldb3rmbfqm1gf4cgn/T/spark-195cd513-428a-4df9-b196-87db0c73e772 > already exists.; > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:102) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19577) insert into a partition datasource table with InMemoryCatalog after the partition location alter by alter command failed
[ https://issues.apache.org/jira/browse/SPARK-19577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15863511#comment-15863511 ] Song Jun commented on SPARK-19577: -- I am working on this~ > insert into a partition datasource table with InMemoryCatalog after the > partition location alter by alter command failed > > > Key: SPARK-19577 > URL: https://issues.apache.org/jira/browse/SPARK-19577 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun > > If we use InMemoryCatalog, then we insert into a partition datasource table, > which partition location has changed by `alter table t partition(a="xx") set > location $newpath`, the insert operation is ok, and the data can be insert > into $newpath, while if we then select partition from the table, it will not > return the value we inserted. > The reason is that the InMemoryFileIndex to inferPartition by the table's > rootPath, it does not track the user specific $newPath which provided by > alter command. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19575) Reading from or writing to a hive serde table with a non pre-existing location should succeed
Song Jun created SPARK-19575: Summary: Reading from or writing to a hive serde table with a non pre-existing location should succeed Key: SPARK-19575 URL: https://issues.apache.org/jira/browse/SPARK-19575 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun currently when we select from a hive serde table which has a non pre-existing location will throw an exception: ``` Input path does not exist: file:/tmp/spark-37caa4e6-5a6a-4361-a905-06cc56afb274 org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/tmp/spark-37caa4e6-5a6a-4361-a905-06cc56afb274 at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:194) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2080) at org.apache.spark.rdd.RDD.count(RDD.scala:1157) at org.apache.spark.sql.QueryTest$.checkAnswer(QueryTest.scala:258) ``` this is a folllowup work from SPARK-19329 which has unify the action when we reading from or writing to a datasource table with a non pre-existing locaiton, so here we should also unify the hive serde tables -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19577) insert into a partition datasource table with InMemoryCatalog after the partition location alter by alter command failed
Song Jun created SPARK-19577: Summary: insert into a partition datasource table with InMemoryCatalog after the partition location alter by alter command failed Key: SPARK-19577 URL: https://issues.apache.org/jira/browse/SPARK-19577 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun If we use InMemoryCatalog, then we insert into a partition datasource table, which partition location has changed by `alter table t partition(a="xx") set location $newpath`, the insert operation is ok, and the data can be insert into $newpath, while if we then select partition from the table, it will not return the value we inserted. The reason is that the InMemoryFileIndex to inferPartition by the table's rootPath, it does not track the user specific $newPath which provided by alter command. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19558) Provide a config option to attach QueryExecutionListener to SparkSession
[ https://issues.apache.org/jira/browse/SPARK-19558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15862768#comment-15862768 ] Song Jun commented on SPARK-19558: -- sparkSession.listenerManager.register is not enough? > Provide a config option to attach QueryExecutionListener to SparkSession > > > Key: SPARK-19558 > URL: https://issues.apache.org/jira/browse/SPARK-19558 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Salil Surendran > > Provide a configuration property(just like spark.extraListeners) to attach a > QueryExecutionListener to a SparkSession -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19496) to_date with format has weird behavior
[ https://issues.apache.org/jira/browse/SPARK-19496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15857570#comment-15857570 ] Song Jun commented on SPARK-19496: -- [~hyukjin.kwon] 😁 > to_date with format has weird behavior > -- > > Key: SPARK-19496 > URL: https://issues.apache.org/jira/browse/SPARK-19496 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Wenchen Fan > > Today, if we run > {code} > SELECT to_date('2015-07-22', '-dd-MM') > {code} > will result to `2016-10-07`, while running > {code} > SELECT to_date('2014-31-12') # default format > {code} > will return null. > this behavior is weird and we should check other systems like hive to see if > this is expected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19496) to_date with format has weird behavior
[ https://issues.apache.org/jira/browse/SPARK-19496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15857541#comment-15857541 ] Song Jun edited comment on SPARK-19496 at 2/8/17 7:18 AM: -- mysql: select str_to_date('2014-12-31','%Y-%d-%m') also return null that is mysql both return null when the date is invalidate or the formate is invalidate. and hive will transform the invalidate date to valid, e.g 2014-31-12 -> 31/12 = 2 -> 2014+2=2016 , 31 - 12*2=7 -> 2016-07-12 currently spark can handle wrong format / wrong date when to_date has the format parameter (like hive's transform), what about we also make to_date without format parameter follow its action, that is replace null with a transformed date to return was (Author: windpiger): mysql: select str_to_date('2014-12-31','%Y-%d-%m') also return null that is mysql both return null when the date is invalidate or the formate is invalidate. and hive will transform the invalidate date to valid, e.g 2014-31-12 -> 31/12 = 2 -> 2014+2=2016 , 31 - 12*2=7 -> 2016-07-12 > to_date with format has weird behavior > -- > > Key: SPARK-19496 > URL: https://issues.apache.org/jira/browse/SPARK-19496 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Wenchen Fan > > Today, if we run > {code} > SELECT to_date('2015-07-22', '-dd-MM') > {code} > will result to `2016-10-07`, while running > {code} > SELECT to_date('2014-31-12') # default format > {code} > will return null. > this behavior is weird and we should check other systems like hive to see if > this is expected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19496) to_date with format has weird behavior
[ https://issues.apache.org/jira/browse/SPARK-19496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15857541#comment-15857541 ] Song Jun edited comment on SPARK-19496 at 2/8/17 7:11 AM: -- mysql: select str_to_date('2014-12-31','%Y-%d-%m') also return null that is mysql both return null when the date is invalidate or the formate is invalidate. and hive will transform the invalidate date to valid, e.g 2014-31-12 -> 31/12 = 2 -> 2014+2=2016 , 31 - 12*2=7 -> 2016-07-12 was (Author: windpiger): mysql: select str_to_date('2014-12-31','%Y-%d-%m') also return null mysql both return null when the date is invalidate or the formate is invalidate > to_date with format has weird behavior > -- > > Key: SPARK-19496 > URL: https://issues.apache.org/jira/browse/SPARK-19496 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Wenchen Fan > > Today, if we run > {code} > SELECT to_date('2015-07-22', '-dd-MM') > {code} > will result to `2016-10-07`, while running > {code} > SELECT to_date('2014-31-12') # default format > {code} > will return null. > this behavior is weird and we should check other systems like hive to see if > this is expected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19496) to_date with format has weird behavior
[ https://issues.apache.org/jira/browse/SPARK-19496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15857541#comment-15857541 ] Song Jun edited comment on SPARK-19496 at 2/8/17 7:09 AM: -- mysql: select str_to_date('2014-12-31','%Y-%d-%m') also return null mysql both return null when the date is invalidate or the formate is invalidate was (Author: windpiger): mysql: select str_to_date('2014-12-31','%Y-%d-%m') also return null > to_date with format has weird behavior > -- > > Key: SPARK-19496 > URL: https://issues.apache.org/jira/browse/SPARK-19496 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Wenchen Fan > > Today, if we run > {code} > SELECT to_date('2015-07-22', '-dd-MM') > {code} > will result to `2016-10-07`, while running > {code} > SELECT to_date('2014-31-12') # default format > {code} > will return null. > this behavior is weird and we should check other systems like hive to see if > this is expected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19496) to_date with format has weird behavior
[ https://issues.apache.org/jira/browse/SPARK-19496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15857541#comment-15857541 ] Song Jun commented on SPARK-19496: -- mysql: select str_to_date('2014-12-31','%Y-%d-%m') also return null > to_date with format has weird behavior > -- > > Key: SPARK-19496 > URL: https://issues.apache.org/jira/browse/SPARK-19496 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Wenchen Fan > > Today, if we run > {code} > SELECT to_date('2015-07-22', '-dd-MM') > {code} > will result to `2016-10-07`, while running > {code} > SELECT to_date('2014-31-12') # default format > {code} > will return null. > this behavior is weird and we should check other systems like hive to see if > this is expected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19496) to_date with format has weird behavior
[ https://issues.apache.org/jira/browse/SPARK-19496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15856205#comment-15856205 ] Song Jun commented on SPARK-19496: -- I am working on this~ > to_date with format has weird behavior > -- > > Key: SPARK-19496 > URL: https://issues.apache.org/jira/browse/SPARK-19496 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Wenchen Fan > > Today, if we run > {code} > SELECT to_date('2015-07-22', '-dd-MM') > {code} > will result to `2016-10-07`, while running > {code} > SELECT to_date('2014-31-12') # default format > {code} > will return null. > this behavior is weird and we should check other systems like hive to see if > this is expected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19491) add a config for tableRelation cache size in SessionCatalog
Song Jun created SPARK-19491: Summary: add a config for tableRelation cache size in SessionCatalog Key: SPARK-19491 URL: https://issues.apache.org/jira/browse/SPARK-19491 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun Priority: Minor currently the table relation cache size is hardcode to 1000, it is better to add a config to set its size. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19477) [SQL] Datasets created from a Dataframe with extra columns retain the extra columns
[ https://issues.apache.org/jira/browse/SPARK-19477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15855677#comment-15855677 ] Song Jun commented on SPARK-19477: -- thanks, I got it~ > [SQL] Datasets created from a Dataframe with extra columns retain the extra > columns > --- > > Key: SPARK-19477 > URL: https://issues.apache.org/jira/browse/SPARK-19477 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Don Drake > > In 1.6, when you created a Dataset from a Dataframe that had extra columns, > the columns not in the case class were dropped from the Dataset. > For example in 1.6, the column c4 is gone: > {code} > scala> case class F(f1: String, f2: String, f3:String) > defined class F > scala> import sqlContext.implicits._ > import sqlContext.implicits._ > scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", > "j","z")).toDF("f1", "f2", "f3", "c4") > df: org.apache.spark.sql.DataFrame = [f1: string, f2: string, f3: string, c4: > string] > scala> val ds = df.as[F] > ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string, f3: string] > scala> ds.show > +---+---+---+ > | f1| f2| f3| > +---+---+---+ > | a| b| c| > | d| e| f| > | h| i| j| > {code} > This seems to have changed in Spark 2.0 and also 2.1: > Spark 2.1.0: > {code} > scala> case class F(f1: String, f2: String, f3:String) > defined class F > scala> import spark.implicits._ > import spark.implicits._ > scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", > "j","z")).toDF("f1", "f2", "f3", "c4") > df: org.apache.spark.sql.DataFrame = [f1: string, f2: string ... 2 more > fields] > scala> val ds = df.as[F] > ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more > fields] > scala> ds.show > +---+---+---+---+ > | f1| f2| f3| c4| > +---+---+---+---+ > | a| b| c| x| > | d| e| f| y| > | h| i| j| z| > +---+---+---+---+ > scala> import org.apache.spark.sql.Encoders > import org.apache.spark.sql.Encoders > scala> val fEncoder = Encoders.product[F] > fEncoder: org.apache.spark.sql.Encoder[F] = class[f1[0]: string, f2[0]: > string, f3[0]: string] > scala> fEncoder.schema == ds.schema > res2: Boolean = false > scala> ds.schema > res3: org.apache.spark.sql.types.StructType = > StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), > StructField(f3,StringType,true), StructField(c4,StringType,true)) > scala> fEncoder.schema > res4: org.apache.spark.sql.types.StructType = > StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), > StructField(f3,StringType,true)) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19484) continue work to create a table with an empty schema
Song Jun created SPARK-19484: Summary: continue work to create a table with an empty schema Key: SPARK-19484 URL: https://issues.apache.org/jira/browse/SPARK-19484 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun Priority: Minor after SPARK-19279, we could not create a Hive table with an empty schema, we should tighten up the condition when create a hive table in https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L835 That is if a CatalogTable t has an empty schema, and (there is no `spark.sql.schema.numParts` or its value is 0), we should not add a default `col` schema, if we did, a table with an empty schema will be created, that is not we expected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19430) Cannot read external tables with VARCHAR columns if they're backed by ORC files written by Hive 1.2.1
[ https://issues.apache.org/jira/browse/SPARK-19430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15853466#comment-15853466 ] Song Jun commented on SPARK-19430: -- I think this is not a bug. If you want to access the hive table ,you can directly use ` spark.table("orc_varchar_test").show ` > Cannot read external tables with VARCHAR columns if they're backed by ORC > files written by Hive 1.2.1 > - > > Key: SPARK-19430 > URL: https://issues.apache.org/jira/browse/SPARK-19430 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.0 >Reporter: Sameer Agarwal > > Spark throws an exception when trying to read external tables with VARCHAR > columns if they're backed by ORC files that were written by Hive 1.2.1 (and > possibly other versions of hive). > Steps to reproduce (credits to [~lian cheng]): > # Write an ORC table using Hive 1.2.1 with >{noformat} > CREATE TABLE orc_varchar_test STORED AS ORC > AS SELECT CASTE('a' AS VARCHAR(10)) AS c0{noformat} > # Get the raw path of the written ORC file > # Create an external table pointing to this file and read the table using > Spark > {noformat} > val path = "/tmp/orc_varchar_test" > sql(s"create external table if not exists test (c0 varchar(10)) stored as orc > location '$path'") > spark.table("test").show(){noformat} > The problem here is that the metadata in the ORC file written by Hive is > different from those written by Spark. We can inspect the ORC file written > above: > {noformat} > $ hive --orcfiledump > file:///Users/lian/local/var/lib/hive/warehouse_1.2.1/orc_varchar_test/00_0 > Structure for > file:///Users/lian/local/var/lib/hive/warehouse_1.2.1/orc_varchar_test/00_0 > File Version: 0.12 with HIVE_8732 > Rows: 1 > Compression: ZLIB > Compression size: 262144 > Type: struct<_col0:varchar(10)> < > ... > {noformat} > On the other hand, if you create an ORC table using the same DDL and inspect > the written ORC file, you'll see: > {noformat} > ... > Type: struct > ... > {noformat} > Note that all tests are done with {{spark.sql.hive.convertMetastoreOrc}} set > to {{false}}, which is the default case. > I've verified that Spark 1.6.x, 2.0.x and 2.1.x all fail with instances of > the following error: > {code} > java.lang.ClassCastException: > org.apache.hadoop.hive.serde2.io.HiveVarcharWritable cannot be cast to > org.apache.hadoop.io.Text > at > org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41) > at > org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$23.apply(HiveInspectors.scala:529) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:435) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19447) Fix input metrics for range operator
[ https://issues.apache.org/jira/browse/SPARK-19447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15853462#comment-15853462 ] Song Jun commented on SPARK-19447: -- spark.range(1,100).show there is some information in the SQL UI like: ` Range number of output rows: 99 ` I didn't see some information like `0 rows` maybe I didn't get the right place. could you help to describe it more clearly? thanks! > Fix input metrics for range operator > > > Key: SPARK-19447 > URL: https://issues.apache.org/jira/browse/SPARK-19447 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Reynold Xin > > Range operator currently does not output any input metrics, and as a result > in the SQL UI the number of rows shown is always 0. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19463) refresh the table cache after InsertIntoHadoopFsRelation
Song Jun created SPARK-19463: Summary: refresh the table cache after InsertIntoHadoopFsRelation Key: SPARK-19463 URL: https://issues.apache.org/jira/browse/SPARK-19463 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun If we first cache a DataSource table, then we insert some data into the table, we should refresh the data in the cache after the insert command. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19458) loading hive jars from the local repo which has already downloaded
Song Jun created SPARK-19458: Summary: loading hive jars from the local repo which has already downloaded Key: SPARK-19458 URL: https://issues.apache.org/jira/browse/SPARK-19458 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun Priority: Minor Currently when we new a HiveClient for a specific metastore version and `spark.sql.hive.metastore.jars` is setted to `maven`, Spark will download the hive jars from remote repo(http://www.datanucleus.org/downloads/maven2). we should allow the user to load hive jars from the local repo which has already downloaded. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19448) unify some duplication function in MetaStoreRelation
Song Jun created SPARK-19448: Summary: unify some duplication function in MetaStoreRelation Key: SPARK-19448 URL: https://issues.apache.org/jira/browse/SPARK-19448 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun Priority: Minor 1. MetaStoreRelation' hiveQlTable can be replaced by calling HiveClientImpl's toHiveTable 2. MetaStoreRelation's toHiveColumn can be replaced by calling HiveClientImpl's toHiveColumn 3. process another TODO https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala#L234 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19340) Opening a file in CSV format will result in an exception if the filename contains special characters
[ https://issues.apache.org/jira/browse/SPARK-19340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15837407#comment-15837407 ] Song Jun commented on SPARK-19340: -- the reason is that spark sql treat the test{00-1}.txt as a globpath. we can not put a file name like text{00-1}.txt to hdfs, it will throw an exception. I think this is not a bug > Opening a file in CSV format will result in an exception if the filename > contains special characters > > > Key: SPARK-19340 > URL: https://issues.apache.org/jira/browse/SPARK-19340 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.1.0, 2.2.0 >Reporter: Reza Safi >Priority: Minor > > If you want to open a file that its name is like {noformat} "*{*}*.*" > {noformat} or {noformat} "*[*]*.*" {noformat} using CSV format, you will get > the "org.apache.spark.sql.AnalysisException: Path does not exist" whether the > file is a local file or on hdfs. > This bug can be reproduced on master and all other Spark 2 branches. > To reproduce: > # Create a file like "test{00-1}.txt" on a local directory (like in > /Users/reza/test/test{00-1}.txt) > # Run spark-shell > # Execute this command: > {noformat} > val df=spark.read.option("header","false").csv("/Users/reza/test/*.txt") > {noformat} > You will see the following stack trace: > {noformat} > org.apache.spark.sql.AnalysisException: Path does not exist: > file:/Users/reza/test/test\{00-01\}.txt; > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:367) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:360) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:344) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:360) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.readText(CSVFileFormat.scala:208) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:63) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:174) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:174) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:173) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:377) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:158) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:423) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:360) > ... 48 elided > {noformat} > If you put the file on hadoop (like on /user/root) when you try to run the > following: > {noformat} > val df=spark.read.option("header", false).csv("/user/root/*.txt") > {noformat} > > You will get the following exception: > {noformat} > org.apache.hadoop.mapred.InvalidInputException: Input Pattern > hdfs://hosturl/user/root/test\{00-01\}.txt matches 0 files > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.pa
[jira] [Updated] (SPARK-19359) partition path created by Hive should be deleted after rename a partition with upper-case
[ https://issues.apache.org/jira/browse/SPARK-19359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-19359: - Issue Type: Improvement (was: Bug) > partition path created by Hive should be deleted after rename a partition > with upper-case > - > > Key: SPARK-19359 > URL: https://issues.apache.org/jira/browse/SPARK-19359 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Song Jun >Priority: Minor > > Hive metastore is not case preserving and keep partition columns with lower > case names. > If SparkSQL create a table with upper-case partion name use > HiveExternalCatalog, when we rename partition, it first call the HiveClient > to renamePartition, which will create a new lower case partition path, then > SparkSql rename the lower case path to the upper-case. > while if the renamed partition contains more than one depth partition ,e.g. > A=1/B=2, hive renamePartition change to a=1/b=2, then SparkSql rename it to > A=1/B=2, but the a=1 still exists in the filesystem, we should also delete it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19359) partition path created by Hive should be deleted after rename a partition with upper-case
Song Jun created SPARK-19359: Summary: partition path created by Hive should be deleted after rename a partition with upper-case Key: SPARK-19359 URL: https://issues.apache.org/jira/browse/SPARK-19359 Project: Spark Issue Type: Bug Components: SQL Reporter: Song Jun Priority: Minor Hive metastore is not case preserving and keep partition columns with lower case names. If SparkSQL create a table with upper-case partion name use HiveExternalCatalog, when we rename partition, it first call the HiveClient to renamePartition, which will create a new lower case partition path, then SparkSql rename the lower case path to the upper-case. while if the renamed partition contains more than one depth partition ,e.g. A=1/B=2, hive renamePartition change to a=1/b=2, then SparkSql rename it to A=1/B=2, but the a=1 still exists in the filesystem, we should also delete it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19332) table's location should check if a URI is legal
Song Jun created SPARK-19332: Summary: table's location should check if a URI is legal Key: SPARK-19332 URL: https://issues.apache.org/jira/browse/SPARK-19332 Project: Spark Issue Type: Bug Components: SQL Reporter: Song Jun ~SPARK-19257 ‘s work is to change the type of `CatalogStorageFormat` 's locationUri to `URI`, while it has some problem: 1.`CatalogTable` and `CatalogTablePartition` use the same class `CatalogStorageFormat` 2. the type URI is ok for `CatalogTable`, but it is not proper for `CatalogTablePartition` 3. the location of a table partition can contains a not encode whitespace, so if a partition location contains this not encode whitespace, and it will throw an exception for URI. for example `/path/2014-01-01 00%3A00%3A00` is a partition location which has whitespace so if we change the type to URI, it is bad for `CatalogTablePartition` and I found Hive has the same issue ~HIVE-6185 before hive 0.13 the location is URI, while after above PR, it change it to Path, and do some check when DDL. https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553 https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732 so I think ,we can do the URI check for the table's location , and it is not proper to change the type to URI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19332) table's location should check if a URI is legal
[ https://issues.apache.org/jira/browse/SPARK-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-19332: - Description: SPARK-19257 ‘s work is to change the type of `CatalogStorageFormat` 's locationUri to `URI`, while it has some problem: 1.`CatalogTable` and `CatalogTablePartition` use the same class `CatalogStorageFormat` 2. the type URI is ok for `CatalogTable`, but it is not proper for `CatalogTablePartition` 3. the location of a table partition can contains a not encode whitespace, so if a partition location contains this not encode whitespace, and it will throw an exception for URI. for example `/path/2014-01-01 00%3A00%3A00` is a partition location which has whitespace so if we change the type to URI, it is bad for `CatalogTablePartition` and I found Hive has the same issue HIVE-6185 before hive 0.13 the location is URI, while after above PR, it change it to Path, and do some check when DDL. https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553 https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732 so I think ,we can do the URI check for the table's location , and it is not proper to change the type to URI. was: ~SPARK-19257 ‘s work is to change the type of `CatalogStorageFormat` 's locationUri to `URI`, while it has some problem: 1.`CatalogTable` and `CatalogTablePartition` use the same class `CatalogStorageFormat` 2. the type URI is ok for `CatalogTable`, but it is not proper for `CatalogTablePartition` 3. the location of a table partition can contains a not encode whitespace, so if a partition location contains this not encode whitespace, and it will throw an exception for URI. for example `/path/2014-01-01 00%3A00%3A00` is a partition location which has whitespace so if we change the type to URI, it is bad for `CatalogTablePartition` and I found Hive has the same issue ~HIVE-6185 before hive 0.13 the location is URI, while after above PR, it change it to Path, and do some check when DDL. https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553 https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732 so I think ,we can do the URI check for the table's location , and it is not proper to change the type to URI. > table's location should check if a URI is legal > --- > > Key: SPARK-19332 > URL: https://issues.apache.org/jira/browse/SPARK-19332 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Song Jun > > SPARK-19257 ‘s work is to change the type of `CatalogStorageFormat` 's > locationUri to `URI`, while it has some problem: > 1.`CatalogTable` and `CatalogTablePartition` use the same class > `CatalogStorageFormat` > 2. the type URI is ok for `CatalogTable`, but it is not proper for > `CatalogTablePartition` > 3. the location of a table partition can contains a not encode whitespace, so > if a partition location contains this not encode whitespace, and it will > throw an exception for URI. for example `/path/2014-01-01 00%3A00%3A00` is a > partition location which has whitespace > so if we change the type to URI, it is bad for `CatalogTablePartition` > and I found Hive has the same issue HIVE-6185 > before hive 0.13 the location is URI, while after above PR, it change it to > Path, and do some check when DDL. > https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553 > https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732 > so I think ,we can do the URI check for the table's location , and it is not > proper to change the type to URI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19257) The type of CatalogStorageFormat.locationUri should be java.net.URI instead of String
[ https://issues.apache.org/jira/browse/SPARK-19257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15832935#comment-15832935 ] Song Jun edited comment on SPARK-19257 at 1/22/17 11:22 AM: I found it that `CatalogTablePartition` and `CatalogTable` use the same class `CatalogTableStorageFormat`. if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, this will failed in `CatalogTablePartition` because table partition location can have whitespace, such as `2014-01-01 00%3A00%3A00`, while this is not a legal URI (no encoded whitespace) a)should we seperate the CatalogTableStorageFormat from CatalogTablePartition and CatalogTable? b)or we just check if the ` locationUri: String` is illegal when make a DDL command, such as ` alter table set location xx` or `create table t options (path xx) ` Hive's implement is like b) : https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553 https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732 HIVE-0.12 the table's location method's paramenter Type is URI, after that version change to Path https://github.com/apache/hive/blob/branch-0.12/ql/src/java/org/apache/hadoop/hive/ql/metadata/Table.java#L501 this is the same issue in hive: https://issues.apache.org/jira/browse/HIVE-6185 [~cloud_fan] [~smilegator] was (Author: windpiger): I found it that `CatalogTablePartition` and `CatalogTable` use the same class `CatalogTableStorageFormat`. if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, this will failed in `CatalogTablePartition` because table partition location can have whitespace, such as `2014-01-01 00%3A00%3A00`, while this is not a legal URI (no encoded whitespace) a)should we seperate the CatalogTableStorageFormat from CatalogTablePartition and CatalogTable? b)or we just check if the ` locationUri: String` is illegal when make a DDL command, such as ` alter table set location xx` or `create table t options (path xx) ` Hive's implement is like b) : https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553 https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732 [~cloud_fan] [~smilegator] > The type of CatalogStorageFormat.locationUri should be java.net.URI instead > of String > - > > Key: SPARK-19257 > URL: https://issues.apache.org/jira/browse/SPARK-19257 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > > Currently we treat `CatalogStorageFormat.locationUri` as URI string and > always convert it to path by `new Path(new URI(locationUri))` > It will be safer if we can make the type of > `CatalogStorageFormat.locationUri` java.net.URI. We should finish the TODO in > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L50-L52 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19329) after alter a datasource table's location to a not exist location and then insert data throw Exception
Song Jun created SPARK-19329: Summary: after alter a datasource table's location to a not exist location and then insert data throw Exception Key: SPARK-19329 URL: https://issues.apache.org/jira/browse/SPARK-19329 Project: Spark Issue Type: Bug Components: SQL Reporter: Song Jun spark.sql("create table t(a string, b int) using parquet") spark.sql(s"alter table t set location '$notexistedlocation'") spark.sql("insert into table t select 'c', 1") this will throw an exception: com.google.common.util.concurrent.UncheckedExecutionException: org.apache.spark.sql.AnalysisException: Path does not exist: $notexistedlocation; at com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4814) at com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4830) at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:122) at org.apache.spark.sql.hive.HiveSessionCatalog.lookupRelation(HiveSessionCatalog.scala:69) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:456) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:465) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:463) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:463) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:453) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124) at scala.collection.immutable.List.foldLeft(List.scala:84) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) at scala.collection.immutable.List.foreach(List.scala:381) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19257) The type of CatalogStorageFormat.locationUri should be java.net.URI instead of String
[ https://issues.apache.org/jira/browse/SPARK-19257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15832935#comment-15832935 ] Song Jun edited comment on SPARK-19257 at 1/22/17 6:41 AM: --- I found it that `CatalogPartition` and `CatalogTable` use the same class `CatalogTableStorageFormat`. if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, this will failed in `CatalogTablePartition` because table partition location can have whitespace, such as `2014-01-01 00%3A00%3A00`, while this is not a legal URI (no encoded whitespace) a)should we seperate the CatalogTableStorageFormat from CatalogPartition and CatalogTable? b)or we just check if the ` locationUri: String` is illegal when make a DDL command, such as ` alter table set location xx` or `create table t options (path xx) ` Hive's implement is like b) : https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553 https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732 [~cloud_fan] [~smilegator] was (Author: windpiger): I found it that `CatalogPartition` and `CatalogTable` use the same class `CatalogTableStorageFormat`. if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, this will failed in `CatalogTablePartition` because table partition location can have whitespace, such as `2014-01-01 00%3A00%3A00`, while this is not a legal URI (no encoded whitespace) should we seperate the CatalogTableStorageFormat from CatalogPartition and CatalogTable? [~cloud_fan] [~smilegator] > The type of CatalogStorageFormat.locationUri should be java.net.URI instead > of String > - > > Key: SPARK-19257 > URL: https://issues.apache.org/jira/browse/SPARK-19257 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > > Currently we treat `CatalogStorageFormat.locationUri` as URI string and > always convert it to path by `new Path(new URI(locationUri))` > It will be safer if we can make the type of > `CatalogStorageFormat.locationUri` java.net.URI. We should finish the TODO in > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L50-L52 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19257) The type of CatalogStorageFormat.locationUri should be java.net.URI instead of String
[ https://issues.apache.org/jira/browse/SPARK-19257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15832935#comment-15832935 ] Song Jun edited comment on SPARK-19257 at 1/22/17 6:42 AM: --- I found it that `CatalogTablePartition` and `CatalogTable` use the same class `CatalogTableStorageFormat`. if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, this will failed in `CatalogTablePartition` because table partition location can have whitespace, such as `2014-01-01 00%3A00%3A00`, while this is not a legal URI (no encoded whitespace) a)should we seperate the CatalogTableStorageFormat from CatalogTablePartition and CatalogTable? b)or we just check if the ` locationUri: String` is illegal when make a DDL command, such as ` alter table set location xx` or `create table t options (path xx) ` Hive's implement is like b) : https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553 https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732 [~cloud_fan] [~smilegator] was (Author: windpiger): I found it that `CatalogPartition` and `CatalogTable` use the same class `CatalogTableStorageFormat`. if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, this will failed in `CatalogTablePartition` because table partition location can have whitespace, such as `2014-01-01 00%3A00%3A00`, while this is not a legal URI (no encoded whitespace) a)should we seperate the CatalogTableStorageFormat from CatalogPartition and CatalogTable? b)or we just check if the ` locationUri: String` is illegal when make a DDL command, such as ` alter table set location xx` or `create table t options (path xx) ` Hive's implement is like b) : https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553 https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732 [~cloud_fan] [~smilegator] > The type of CatalogStorageFormat.locationUri should be java.net.URI instead > of String > - > > Key: SPARK-19257 > URL: https://issues.apache.org/jira/browse/SPARK-19257 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > > Currently we treat `CatalogStorageFormat.locationUri` as URI string and > always convert it to path by `new Path(new URI(locationUri))` > It will be safer if we can make the type of > `CatalogStorageFormat.locationUri` java.net.URI. We should finish the TODO in > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L50-L52 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19257) The type of CatalogStorageFormat.locationUri should be java.net.URI instead of String
[ https://issues.apache.org/jira/browse/SPARK-19257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15832935#comment-15832935 ] Song Jun edited comment on SPARK-19257 at 1/22/17 6:00 AM: --- I found it that `CatalogPartition` and `CatalogTable` use the same class `CatalogTableStorageFormat`. if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, this will failed in `CatalogTablePartition` because table partition location can have whitespace, such as `2014-01-01 00%3A00%3A00`, while this is not a legal URI (no encoded whitespace) should we seperate the CatalogTableStorageFormat from CatalogPartition and CatalogTable? [~cloud_fan] [~smilegator] was (Author: windpiger): I found it that `CatalogTablePartition` and `CatalogTable` use the same class `CatalogTableStorageFormat`. if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, this will failed in `CatalogTablePartition` because table partition location can have whitespace, such as `2014-01-01 00%3A00%3A00`, while this is not a legal URI (no encoded whitespace) > The type of CatalogStorageFormat.locationUri should be java.net.URI instead > of String > - > > Key: SPARK-19257 > URL: https://issues.apache.org/jira/browse/SPARK-19257 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > > Currently we treat `CatalogStorageFormat.locationUri` as URI string and > always convert it to path by `new Path(new URI(locationUri))` > It will be safer if we can make the type of > `CatalogStorageFormat.locationUri` java.net.URI. We should finish the TODO in > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L50-L52 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19257) The type of CatalogStorageFormat.locationUri should be java.net.URI instead of String
[ https://issues.apache.org/jira/browse/SPARK-19257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15832935#comment-15832935 ] Song Jun edited comment on SPARK-19257 at 1/21/17 11:09 AM: I found it that `CatalogTablePartition` and `CatalogTable` use the same class `CatalogTableStorageFormat`. if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, this will failed in `CatalogTablePartition` because table partition location can have whitespace, such as `2014-01-01 00%3A00%3A00`, while this is not a legal URI (no encoded whitespace) was (Author: windpiger): I found it than `CatalogTablePartition` and `CatalogTable` use the same class `CatalogTableStorageFormat`. if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, this will failed in `CatalogTablePartition` because table partition location can have whitespace, such as `2014-01-01 00%3A00%3A00`, while this is not a legal URI (no encoded whitespace) > The type of CatalogStorageFormat.locationUri should be java.net.URI instead > of String > - > > Key: SPARK-19257 > URL: https://issues.apache.org/jira/browse/SPARK-19257 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > > Currently we treat `CatalogStorageFormat.locationUri` as URI string and > always convert it to path by `new Path(new URI(locationUri))` > It will be safer if we can make the type of > `CatalogStorageFormat.locationUri` java.net.URI. We should finish the TODO in > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L50-L52 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19257) The type of CatalogStorageFormat.locationUri should be java.net.URI instead of String
[ https://issues.apache.org/jira/browse/SPARK-19257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15832935#comment-15832935 ] Song Jun commented on SPARK-19257: -- I found it than `CatalogTablePartition` and `CatalogTable` use the same class `CatalogTableStorageFormat`. if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, this will failed in `CatalogTablePartition` because table partition location can have whitespace, such as `2014-01-01 00%3A00%3A00`, while this is not a legal URI (no encoded whitespace) > The type of CatalogStorageFormat.locationUri should be java.net.URI instead > of String > - > > Key: SPARK-19257 > URL: https://issues.apache.org/jira/browse/SPARK-19257 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > > Currently we treat `CatalogStorageFormat.locationUri` as URI string and > always convert it to path by `new Path(new URI(locationUri))` > It will be safer if we can make the type of > `CatalogStorageFormat.locationUri` java.net.URI. We should finish the TODO in > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L50-L52 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19284) append to a existed partitioned datasource table should have no CustomPartitionLocations
Song Jun created SPARK-19284: Summary: append to a existed partitioned datasource table should have no CustomPartitionLocations Key: SPARK-19284 URL: https://issues.apache.org/jira/browse/SPARK-19284 Project: Spark Issue Type: Bug Components: SQL Reporter: Song Jun Priority: Minor when we append data to a existed partitioned datasource table, the InsertIntoHadoopFsRelationCommand.getCustomPartitionLocations currently return the same location with Hive default, it should return None. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19257) The type of CatalogStorageFormat.locationUri should be java.net.URI instead of String
[ https://issues.apache.org/jira/browse/SPARK-19257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15825645#comment-15825645 ] Song Jun commented on SPARK-19257: -- I am working on this~ thanks > The type of CatalogStorageFormat.locationUri should be java.net.URI instead > of String > - > > Key: SPARK-19257 > URL: https://issues.apache.org/jira/browse/SPARK-19257 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > > Currently we treat `CatalogStorageFormat.locationUri` as URI string and > always convert it to path by `new Path(new URI(locationUri))` > It will be safer if we can make the type of > `CatalogStorageFormat.locationUri` java.net.URI. We should finish the TODO in > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L50-L52 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
issues@spark.apache.org
[ https://issues.apache.org/jira/browse/SPARK-19246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-19246: - Summary: CataLogTable's partitionSchema should check order&exist (was: CataLogTable's partitionSchema should check if each column name in partitionColumnNames must match one and only one field in schema, and keep order with partitionColumnNames ) > CataLogTable's partitionSchema should check order&exist > --- > > Key: SPARK-19246 > URL: https://issues.apache.org/jira/browse/SPARK-19246 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Song Jun > > get CataLogTable's partitionSchema should check if each column name in > partitionColumnNames must match one and only one field in schema, if not we > should throw an exception > and CataLogTable's partitionSchema should keep order with > partitionColumnNames -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19246) CataLogTable's partitionSchema should check if each column name in partitionColumnNames must match one and only one field in schema, and keep order with partitionColumnN
[ https://issues.apache.org/jira/browse/SPARK-19246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-19246: - Summary: CataLogTable's partitionSchema should check if each column name in partitionColumnNames must match one and only one field in schema, and keep order with partitionColumnNames (was: CataLogTable's partitionSchema should check if each column name in partitionColumnNames must match one and only one field in schema) > CataLogTable's partitionSchema should check if each column name in > partitionColumnNames must match one and only one field in schema, and keep > order with partitionColumnNames > -- > > Key: SPARK-19246 > URL: https://issues.apache.org/jira/browse/SPARK-19246 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Song Jun > > get CataLogTable's partitionSchema should check if each column name in > partitionColumnNames must match one and only one field in schema, if not we > should throw an exception -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19246) CataLogTable's partitionSchema should check if each column name in partitionColumnNames must match one and only one field in schema, and keep order with partitionColumnN
[ https://issues.apache.org/jira/browse/SPARK-19246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-19246: - Description: get CataLogTable's partitionSchema should check if each column name in partitionColumnNames must match one and only one field in schema, if not we should throw an exception and CataLogTable's partitionSchema should keep order with partitionColumnNames was:get CataLogTable's partitionSchema should check if each column name in partitionColumnNames must match one and only one field in schema, if not we should throw an exception > CataLogTable's partitionSchema should check if each column name in > partitionColumnNames must match one and only one field in schema, and keep > order with partitionColumnNames > -- > > Key: SPARK-19246 > URL: https://issues.apache.org/jira/browse/SPARK-19246 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Song Jun > > get CataLogTable's partitionSchema should check if each column name in > partitionColumnNames must match one and only one field in schema, if not we > should throw an exception > and CataLogTable's partitionSchema should keep order with > partitionColumnNames -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19246) CataLogTable's partitionSchema should check if each column name in partitionColumnNames must match one and only one field in schema
Song Jun created SPARK-19246: Summary: CataLogTable's partitionSchema should check if each column name in partitionColumnNames must match one and only one field in schema Key: SPARK-19246 URL: https://issues.apache.org/jira/browse/SPARK-19246 Project: Spark Issue Type: Bug Components: SQL Reporter: Song Jun get CataLogTable's partitionSchema should check if each column name in partitionColumnNames must match one and only one field in schema, if not we should throw an exception -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19153) DataFrameWriter.saveAsTable should work with hive format to create partitioned table
[ https://issues.apache.org/jira/browse/SPARK-19153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15823657#comment-15823657 ] Song Jun commented on SPARK-19153: -- I am sorry, It is my fault,I forget to comment > DataFrameWriter.saveAsTable should work with hive format to create > partitioned table > > > Key: SPARK-19153 > URL: https://issues.apache.org/jira/browse/SPARK-19153 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19241) remove hive generated table properties if they are not useful in Spark
[ https://issues.apache.org/jira/browse/SPARK-19241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15823654#comment-15823654 ] Song Jun commented on SPARK-19241: -- I am working on this~ > remove hive generated table properties if they are not useful in Spark > -- > > Key: SPARK-19241 > URL: https://issues.apache.org/jira/browse/SPARK-19241 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > > When we save a table into hive metastore, hive will generate some table > properties automatically. e.g. transient_lastDdlTime, last_modified_by, > rawDataSize, etc. Some of them are useless in Spark SQL, we should remove > them. > It will be good if we can get the list of Hive-generated table properties via > Hive API, so that we don't need to hardcode them. > We can take a look at Hive code to see how it excludes these auto-generated > table properties when describe table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19166) change method name from InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions to InsertIntoHadoopFsRelationCommand.deleteMatchingPrefix
Song Jun created SPARK-19166: Summary: change method name from InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions to InsertIntoHadoopFsRelationCommand.deleteMatchingPrefix Key: SPARK-19166 URL: https://issues.apache.org/jira/browse/SPARK-19166 Project: Spark Issue Type: Improvement Components: SQL Reporter: Song Jun Priority: Minor InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions delete all files that match a static prefix, such as a partition file path(/table/foo=1), or a no partition file path(/xxx/a.json). while the method name deleteMatchingPartitions indicates that only the partition file will be deleted. This name make a confused. It is better to rename the method name. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19154) support read and overwrite a same table
[ https://issues.apache.org/jira/browse/SPARK-19154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15816942#comment-15816942 ] Song Jun commented on SPARK-19154: -- I am working on this~ > support read and overwrite a same table > --- > > Key: SPARK-19154 > URL: https://issues.apache.org/jira/browse/SPARK-19154 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > > In SPARK-5746 , we forbid users to read and overwrite a same table. It seems > like we don't need this limitation now, we can remove the check and add > regression tests. We may need to take care of partitioned table though. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18930) Inserting in partitioned table - partitioned field should be last in select statement.
[ https://issues.apache.org/jira/browse/SPARK-18930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15784678#comment-15784678 ] Song Jun edited comment on SPARK-18930 at 12/29/16 6:44 AM: from hive document, https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-Dynamic-PartitionInsert Note that the dynamic partition values are selected by ordering, not name, and taken as the last columns from the select clause. and test it on hive also have the same logic as your description . I think we can close this jira? [~srowen] was (Author: windpiger): from hive document, https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-Dynamic-PartitionInsert Note that the dynamic partition values are selected by ordering, not name, and taken as the last columns from the select clause. and test it on hive also have the same logic as your description . I think we can close this jira? > Inserting in partitioned table - partitioned field should be last in select > statement. > --- > > Key: SPARK-18930 > URL: https://issues.apache.org/jira/browse/SPARK-18930 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Egor Pahomov > > CREATE TABLE temp.test_partitioning_4 ( > num string > ) > PARTITIONED BY ( > day string) > stored as parquet > INSERT INTO TABLE temp.test_partitioning_4 PARTITION (day) > select day, count(*) as num from > hss.session where year=2016 and month=4 > group by day > Resulted schema on HDFS: /temp.db/test_partitioning_3/day=62456298, > emp.db/test_partitioning_3/day=69094345 > As you can imagine these numbers are num of records. But! When I do select * > from temp.test_partitioning_4 data is correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18930) Inserting in partitioned table - partitioned field should be last in select statement.
[ https://issues.apache.org/jira/browse/SPARK-18930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15784678#comment-15784678 ] Song Jun commented on SPARK-18930: -- from hive document, https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-Dynamic-PartitionInsert Note that the dynamic partition values are selected by ordering, not name, and taken as the last columns from the select clause. and test it on hive also have the same logic as your description . I think we can close this jira? > Inserting in partitioned table - partitioned field should be last in select > statement. > --- > > Key: SPARK-18930 > URL: https://issues.apache.org/jira/browse/SPARK-18930 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Egor Pahomov > > CREATE TABLE temp.test_partitioning_4 ( > num string > ) > PARTITIONED BY ( > day string) > stored as parquet > INSERT INTO TABLE temp.test_partitioning_4 PARTITION (day) > select day, count(*) as num from > hss.session where year=2016 and month=4 > group by day > Resulted schema on HDFS: /temp.db/test_partitioning_3/day=62456298, > emp.db/test_partitioning_3/day=69094345 > As you can imagine these numbers are num of records. But! When I do select * > from temp.test_partitioning_4 data is correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18742) Clarify that user-defined BroadcastFactory is not supported
[ https://issues.apache.org/jira/browse/SPARK-18742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-18742: - Description: After SPARK-12588 Remove HTTPBroadcast [1], the one and only implementation of BroadcastFactory is TorrentBroadcastFactory and the spark.broadcast.factory conf has removed. however the scaladoc says [2]: /** * An interface for all the broadcast implementations in Spark (to allow * multiple broadcast implementations). SparkContext uses a user-specified * BroadcastFactory implementation to instantiate a particular broadcast for the * entire Spark job. */ so we should modify the comment that SparkContext will not use a user-specified BroadcastFactory implementation [1] https://issues.apache.org/jira/browse/SPARK-12588 [2] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/BroadcastFactory.scala#L25-L30 was: After SPARK-12588 Remove HTTPBroadcast [1], the one and only implementation of BroadcastFactory is TorrentBroadcastFactory. however the scaladoc says [2]: /** * An interface for all the broadcast implementations in Spark (to allow * multiple broadcast implementations). SparkContext uses a user-specified * BroadcastFactory implementation to instantiate a particular broadcast for the * entire Spark job. */ so we should modify the comment that SparkContext will not use a user-specified BroadcastFactory implementation [1] https://issues.apache.org/jira/browse/SPARK-12588 [2] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/BroadcastFactory.scala#L25-L30 > Clarify that user-defined BroadcastFactory is not supported > --- > > Key: SPARK-18742 > URL: https://issues.apache.org/jira/browse/SPARK-18742 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Song Jun >Priority: Trivial > > After SPARK-12588 Remove HTTPBroadcast [1], the one and only implementation > of BroadcastFactory is TorrentBroadcastFactory and the > spark.broadcast.factory conf has removed. > however the scaladoc says [2]: > /** > * An interface for all the broadcast implementations in Spark (to allow > * multiple broadcast implementations). SparkContext uses a user-specified > * BroadcastFactory implementation to instantiate a particular broadcast for > the > * entire Spark job. > */ > so we should modify the comment that SparkContext will not use a > user-specified BroadcastFactory implementation > [1] https://issues.apache.org/jira/browse/SPARK-12588 > [2] > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/BroadcastFactory.scala#L25-L30 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18742) Clarify that user-defined BroadcastFactory is not supported
[ https://issues.apache.org/jira/browse/SPARK-18742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-18742: - Description: After SPARK-12588 Remove HTTPBroadcast [1], the one and only implementation of BroadcastFactory is TorrentBroadcastFactory. however the scaladoc says [2]: /** * An interface for all the broadcast implementations in Spark (to allow * multiple broadcast implementations). SparkContext uses a user-specified * BroadcastFactory implementation to instantiate a particular broadcast for the * entire Spark job. */ so we should modify the comment that SparkContext will not use a user-specified BroadcastFactory implementation [1] https://issues.apache.org/jira/browse/SPARK-12588 [2] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/BroadcastFactory.scala#L25-L30 was: After SPARK-12588 Remove HTTPBroadcast [1], the one and only implementation of BroadcastFactory is TorrentBroadcastFactory. No code in Spark 2 uses BroadcastFactory (but TorrentBroadcastFactory) however the scaladoc says [2]: /** * An interface for all the broadcast implementations in Spark (to allow * multiple broadcast implementations). SparkContext uses a user-specified * BroadcastFactory implementation to instantiate a particular broadcast for the * entire Spark job. */ which is not correct since there is no way to plug in a custom user-specified BroadcastFactory. It is better to readd spark.broadcast.factory for user-defined BroadcastFactory [1] https://issues.apache.org/jira/browse/SPARK-12588 [2] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/BroadcastFactory.scala#L25-L30 > Clarify that user-defined BroadcastFactory is not supported > --- > > Key: SPARK-18742 > URL: https://issues.apache.org/jira/browse/SPARK-18742 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Song Jun >Priority: Trivial > > After SPARK-12588 Remove HTTPBroadcast [1], the one and only > implementation of BroadcastFactory is TorrentBroadcastFactory. however > the scaladoc says [2]: > /** > * An interface for all the broadcast implementations in Spark (to allow > * multiple broadcast implementations). SparkContext uses a user-specified > * BroadcastFactory implementation to instantiate a particular broadcast for > the > * entire Spark job. > */ > so we should modify the comment that SparkContext will not use a > user-specified BroadcastFactory implementation > [1] https://issues.apache.org/jira/browse/SPARK-12588 > [2] > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/BroadcastFactory.scala#L25-L30 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org