[jira] [Commented] (SPARK-18390) Optimized plan tried to use Cartesian join when it is not enabled
[ https://issues.apache.org/jira/browse/SPARK-18390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15651980#comment-15651980 ] Srinath commented on SPARK-18390: - FYI, these are in branch 2.1 {noformat} commit e6132a6cf10df8b12af8dd8d1a2c563792b5cc5a Author: Srinath Shankar Date: Sat Sep 3 00:20:43 2016 +0200 [SPARK-17298][SQL] Require explicit CROSS join for cartesian products {noformat} and {noformat} commit 2d96d35dc0fed6df249606d9ce9272c0f0109fa2 Author: Srinath Shankar Date: Fri Oct 14 18:24:47 2016 -0700 [SPARK-17946][PYSPARK] Python crossJoin API similar to Scala {noformat} With the above 2 changes, if a user requests a cross join (with the crossJoin API), the join will always be performed regardless of the physical plan chosen > Optimized plan tried to use Cartesian join when it is not enabled > - > > Key: SPARK-18390 > URL: https://issues.apache.org/jira/browse/SPARK-18390 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.1 >Reporter: Xiangrui Meng >Assignee: Srinath > > {code} > val df2 = spark.range(1e9.toInt).withColumn("one", lit(1)) > val df3 = spark.range(1e9.toInt) > df3.join(df2, df3("id") === df2("one")).count() > {code} > throws > bq. org.apache.spark.sql.AnalysisException: Cartesian joins could be > prohibitively expensive and are disabled by default. To explicitly enable > them, please set spark.sql.crossJoin.enabled = true; > This is probably not the right behavior because it was not the user who > suggested using cartesian product. SQL picked it while knowing it is not > enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18209) More robust view canonicalization without full SQL expansion
[ https://issues.apache.org/jira/browse/SPARK-18209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15626971#comment-15626971 ] Srinath commented on SPARK-18209: - Given that the hive metastore lets you store column names and types with the view, we could probably just forgo the nesting and store just a database hint {code} CREATE VIEW my_view AS SELECT * FROM my_table WHERE id > 10; {code} would store the following definition {code} SELECT * FROM my_table WHERE id > 10 /* current database */ {code} For now in fact the "current database" will always be the database in the view. We should also store or document the effect of other settings such as case-\[in\]sensitive resolution. > More robust view canonicalization without full SQL expansion > > > Key: SPARK-18209 > URL: https://issues.apache.org/jira/browse/SPARK-18209 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Priority: Critical > > Spark SQL currently stores views by analyzing the provided SQL and then > generating fully expanded SQL out of the analyzed logical plan. This is > actually a very error prone way of doing it, because: > 1. It is non-trivial to guarantee that the generated SQL is correct without > being extremely verbose, given the current set of operators. > 2. We need extensive testing for all combination of operators. > 3. Whenever we introduce a new logical plan operator, we need to be super > careful because it might break SQL generation. This is the main reason > broadcast join hint has taken forever to be merged because it is very > difficult to guarantee correctness. > Given the two primary reasons to do view canonicalization is to provide the > context for the database as well as star expansion, I think we can this > through a simpler approach, by taking the user given SQL, analyze it, and > just wrap the original SQL with a SELECT clause at the outer and store the > database as a hint. > For example, given the following view creation SQL: > {code} > USE DATABASE my_db; > CREATE TABLE my_table (id int, name string); > CREATE VIEW my_view AS SELECT * FROM my_table WHERE id > 10; > {code} > We store the following SQL instead: > {code} > SELECT /*+ current_db: `my_db` */ id, name FROM (SELECT * FROM my_table WHERE > id > 10); > {code} > During parsing time, we expand the view along using the provided database > context. > (We don't need to follow exactly the same hint, as I'm merely illustrating > the high level approach here.) > Note that there is a chance that the underlying base table(s)' schema change > and the stored schema of the view might differ from the actual SQL schema. In > that case, I think we should throw an exception at runtime to warn users. > This exception can be controlled by a flag. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18209) More robust view canonicalization without full SQL expansion
[ https://issues.apache.org/jira/browse/SPARK-18209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15626838#comment-15626838 ] Srinath commented on SPARK-18209: - A practical (positive) consequence of this is query expansion when we have nested views: {nocode} create table T(a int) create view A as select * from T view B = select * from A {nocode} As it stands, the definition of B is frozen at view creation time, so {nocode} drop view A create view A as select * from T2 select * from B {nocode} would return data from T even though the definition of A has changed. If we only expand view definition at query time, then the above would return data from T2 > More robust view canonicalization without full SQL expansion > > > Key: SPARK-18209 > URL: https://issues.apache.org/jira/browse/SPARK-18209 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Priority: Critical > > Spark SQL currently stores views by analyzing the provided SQL and then > generating fully expanded SQL out of the analyzed logical plan. This is > actually a very error prone way of doing it, because: > 1. It is non-trivial to guarantee that the generated SQL is correct without > being extremely verbose, given the current set of operators. > 2. We need extensive testing for all combination of operators. > 3. Whenever we introduce a new logical plan operator, we need to be super > careful because it might break SQL generation. This is the main reason > broadcast join hint has taken forever to be merged because it is very > difficult to guarantee correctness. > Given the two primary reasons to do view canonicalization is to provide the > context for the database as well as star expansion, I think we can this > through a simpler approach, by taking the user given SQL, analyze it, and > just wrap the original SQL with a SELECT clause at the outer and store the > database as a hint. > For example, given the following view creation SQL: > {code} > USE DATABASE my_db; > CREATE TABLE my_table (id int, name string); > CREATE VIEW my_view AS SELECT * FROM my_table WHERE id > 10; > {code} > We store the following SQL instead: > {code} > SELECT /*+ current_db: `my_db` */ id, name FROM (SELECT * FROM my_table WHERE > id > 10); > {code} > During parsing time, we expand the view along using the provided database > context. > (We don't need to follow exactly the same hint, as I'm merely illustrating > the high level approach here.) > Note that there is a chance that the underlying base table(s)' schema change > and the stored schema of the view might differ from the actual SQL schema. In > that case, I think we should throw an exception at runtime to warn users. > This exception can be controlled by a flag. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18127) Add hooks and extension points to Spark
Srinath created SPARK-18127: --- Summary: Add hooks and extension points to Spark Key: SPARK-18127 URL: https://issues.apache.org/jira/browse/SPARK-18127 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Srinath We need hooks in Spark for: 1. Custom Parsers 2. Additional custom analyzer and optimizer rules 3. Extend Catalog operations -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18106) Analyze Table accepts a garbage identifier at the end
[ https://issues.apache.org/jira/browse/SPARK-18106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Srinath updated SPARK-18106: Description: {noformat} scala> sql("create table test(a int)") res2: org.apache.spark.sql.DataFrame = [] scala> sql("analyze table test compute statistics blah") res3: org.apache.spark.sql.DataFrame = [] {noformat} An identifier that is not "noscan" produces an AnalyzeTableCommand with noscan=false was: {noformat} scala> sql("create table test(a int)") res2: org.apache.spark.sql.DataFrame = [] scala> sql("analyze table test compute statistics blah") res3: org.apache.spark.sql.DataFrame = [] {noformat} An identifier that is not noscan produces an AnalyzeTableCommand with noscan=false > Analyze Table accepts a garbage identifier at the end > - > > Key: SPARK-18106 > URL: https://issues.apache.org/jira/browse/SPARK-18106 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Srinath >Priority: Minor > > {noformat} > scala> sql("create table test(a int)") > res2: org.apache.spark.sql.DataFrame = [] > scala> sql("analyze table test compute statistics blah") > res3: org.apache.spark.sql.DataFrame = [] > {noformat} > An identifier that is not "noscan" produces an AnalyzeTableCommand with > noscan=false -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18106) Analyze Table accepts a garbage identifier at the end
[ https://issues.apache.org/jira/browse/SPARK-18106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Srinath updated SPARK-18106: Description: {noformat} scala> sql("create table test(a int)") res2: org.apache.spark.sql.DataFrame = [] scala> sql("analyze table test compute statistics blah") res3: org.apache.spark.sql.DataFrame = [] {noformat} An identifier that is not noscan produces an AnalyzeTableCommand with noscan=false was: {noformat} scala> sql("create table test(a int)") res2: org.apache.spark.sql.DataFrame = [] scala> sql("analyze table test compute statistics blah") res3: org.apache.spark.sql.DataFrame = [] {noformat} An identifier that is not {noformat}noscan{noformat} produces an AnalyzeTableCommand with {code}noscan=false{code} > Analyze Table accepts a garbage identifier at the end > - > > Key: SPARK-18106 > URL: https://issues.apache.org/jira/browse/SPARK-18106 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Srinath >Priority: Minor > > {noformat} > scala> sql("create table test(a int)") > res2: org.apache.spark.sql.DataFrame = [] > scala> sql("analyze table test compute statistics blah") > res3: org.apache.spark.sql.DataFrame = [] > {noformat} > An identifier that is not noscan produces an AnalyzeTableCommand with > noscan=false -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18106) Analyze Table accepts a garbage identifier at the end
Srinath created SPARK-18106: --- Summary: Analyze Table accepts a garbage identifier at the end Key: SPARK-18106 URL: https://issues.apache.org/jira/browse/SPARK-18106 Project: Spark Issue Type: Bug Components: SQL Reporter: Srinath Priority: Minor {noformat} scala> sql("create table test(a int)") res2: org.apache.spark.sql.DataFrame = [] scala> sql("analyze table test compute statistics blah") res3: org.apache.spark.sql.DataFrame = [] {noformat} An identifier that is not {noformat}noscan{noformat} produces an AnalyzeTableCommand with {code}noscan=false{code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18013) R cross join API similar to python and Scala
Srinath created SPARK-18013: --- Summary: R cross join API similar to python and Scala Key: SPARK-18013 URL: https://issues.apache.org/jira/browse/SPARK-18013 Project: Spark Issue Type: Bug Reporter: Srinath https://github.com/apache/spark/pull/14866 and https://github.com/apache/spark/pull/15493 added an explicit cross join to the dataset api in scala and python, requiring crossJoin to be used when there is no join condition. (JIRA: https://issues.apache.org/jira/browse/SPARK-17298) Add an explicit crossJoin to R as well so the API behavior is similar to Scala and python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17946) Python crossJoin API similar to Scala
Srinath created SPARK-17946: --- Summary: Python crossJoin API similar to Scala Key: SPARK-17946 URL: https://issues.apache.org/jira/browse/SPARK-17946 Project: Spark Issue Type: Bug Reporter: Srinath https://github.com/apache/spark/pull/14866 added an explicit cross join to the dataset api in scala, requiring crossJoin to be used when there is no join condition. (JIRA: https://issues.apache.org/jira/browse/SPARK-17298) The "join" API in python was implemented using cross join in that patch. Add an explicit crossJoin to python as well so the API behavior is similar to Scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17074) generate histogram information for column
[ https://issues.apache.org/jira/browse/SPARK-17074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15542830#comment-15542830 ] Srinath commented on SPARK-17074: - IMO if you can get reasonable error bounds (as Tim points out) the method with lower overhead is preferable. In general you can't rely on exact statistics during optimization anyway since new data may have arrived since the last stats collection > generate histogram information for column > - > > Key: SPARK-17074 > URL: https://issues.apache.org/jira/browse/SPARK-17074 > Project: Spark > Issue Type: Sub-task > Components: Optimizer >Affects Versions: 2.0.0 >Reporter: Ron Hu > > We support two kinds of histograms: > - Equi-width histogram: We have a fixed width for each column interval in > the histogram. The height of a histogram represents the frequency for those > column values in a specific interval. For this kind of histogram, its height > varies for different column intervals. We use the equi-width histogram when > the number of distinct values is less than 254. > - Equi-height histogram: For this histogram, the width of column interval > varies. The heights of all column intervals are the same. The equi-height > histogram is effective in handling skewed data distribution. We use the equi- > height histogram when the number of distinct values is equal to or greater > than 254. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16026) Cost-based Optimizer framework
[ https://issues.apache.org/jira/browse/SPARK-16026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15478341#comment-15478341 ] Srinath commented on SPARK-16026: - Thanks for the response: 1. You’re correct that the search space will increase compared to the attached proposal in the pdf. But typically the time spent in optimization (specifically search space exploration) is negligible compared to the time spent executing the query itself. So it’s usually beneficial to err on the side of exploring more plans, especially to explore plans that specifically reduce exchange times. Something to think about. 2. Thanks for the explanation. Yes, I believe that some common sense measures like updating table stats when column stats are updated will help. 3. You’re absolutely correct that correlated statistics would help with those kinds of predicates. But I was pointing out the more basic problem of estimating selectivity in the absence of statistics. In the absence of statistics we want to ensure that if we’re just “guessing” selectivities for filters F1, F2, (say 0.15 each), then we don’t assign unduly low selectivities (0.15 * 0.15) to F1 && F2. More generally, we have to ensure that the rules for cardinality estimation for various operators are implemented in a way that accounts for such guesses — for instance that we don’t compound guesses. Again, something to think about. > Cost-based Optimizer framework > -- > > Key: SPARK-16026 > URL: https://issues.apache.org/jira/browse/SPARK-16026 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > Attachments: Spark_CBO_Design_Spec.pdf > > > This is an umbrella ticket to implement a cost-based optimizer framework > beyond broadcast join selection. This framework can be used to implement some > useful optimizations such as join reordering. > The design should discuss how to break the work down into multiple, smaller > logical units. For example, changes to statistics class, system catalog, cost > estimation/propagation in expressions, cost estimation/propagation in > operators can be done in decoupled pull requests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16026) Cost-based Optimizer framework
[ https://issues.apache.org/jira/browse/SPARK-16026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15468891#comment-15468891 ] Srinath commented on SPARK-16026: - I have a couple of comments/questions on the proposal. Regarding the join reordering algorithm: One of the big wins we should be able to get is avoiding shuffles/broadcasts. If the costing and dynamic programming algo doesn't take into account change costs and output partitioning we may produce some bad plans. Here's an example: Suppose we start with completely unpartitioned tables A(a), B(b1, b2), C(c) and D(d), in increasing order of size and let's assume none of them are small enough to broadcast. Suppose we want to optimize the following join (A join B on A.a = B.b1) join C on (B.b2 = C.c) join D on (B.b1 = D.d). Since A, B C and D are in increasing order of size and we try to minimize intermediate result size, we end up with the following “cheapest” plan (join order A-B-C-D): {noformat} Plan I Join(B.b1 = D.d) |-Exchange(b1) | Join(B.b2 = c) | |-Exchange(b2) | | Join(A.a = B.b1) | | |-Exchange(a) | | | A | | | Exchange(b1) | | B | | Exchange(c) | C |-Exchange(d) D {noformat} Ignoring leaf node sizes, the cost according to the proposed model, i.e. the intermediate data size is Size(A join B) + size(ABC). This is also the size of intermediate data exchanged. But a better plan may be to join to D before C (i.e. join order A-B-D-C) because that would avoid a re-shuffle {noformat} Plan II Join(B.b2 = C.c) |-Exchange(B.b2) | Join (B.b1 = d) | |-Join(A.a = B.b1) | | |-Exchange(a) | | | A | | | Exchange(b1) | | B | |-Exchange(d) | D |-Exchange(c) C {noformat} The cost of this plan, i.e. the intermediate data size, is size(AB) + size(ABD), which is higher than Plan I. But the size of intermediate data exchanged is size(ABD) which may be lower than size(AB) + size(ABC) of Plan I. This plan could be significantly faster as a result. It should be relatively painless to incorporate partition-awareness into the dynamic programming proposal for cost-based join ordering — with a couple of tweaks i) Take into account intermediate data exchanged, not just total intermediate data. For example, a good and simple start would be to use (exchanged-data, total-data) as the cost function, with a preference for the former (i.e. prefer lower exchanged data, and lower total-data if the exchanged data is the same). You could certainly have a more complex model, though. ii) Preserve (i.e. don't prune) partial plans based on output partitioning. e.g. consider a partial plan involving A, B and C. A join B join C may have a different output partitioning than A join C join B. If ACB is more expensive but has an output partitioning scheme that is useful for further joins, its worth preserving. Another question I have is regarding statistics: With separate analyze column/analyze table statements it's possible for your statistics to have two different views of data, leading to weird results and inconsistent cardinality estimates. For filter factor, what are the default selectivities assumed ? We may also want to cap the minimum selectivity, so that C1 && C2 && C3 etc. doesn’t lead to ridiculously low cardinality estimates. > Cost-based Optimizer framework > -- > > Key: SPARK-16026 > URL: https://issues.apache.org/jira/browse/SPARK-16026 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > Attachments: Spark_CBO_Design_Spec.pdf > > > This is an umbrella ticket to implement a cost-based optimizer framework > beyond broadcast join selection. This framework can be used to implement some > useful optimizations such as join reordering. > The design should discuss how to break the work down into multiple, smaller > logical units. For example, changes to statistics class, system catalog, cost > estimation/propagation in expressions, cost estimation/propagation in > operators can be done in decoupled pull requests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17298) Require explicit CROSS join for cartesian products by default
[ https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15447642#comment-15447642 ] Srinath commented on SPARK-17298: - I've updated the description. Hopefully it is clearer. Note that before this change, even with spark.sql.crossJoin.enabled = false, case 1.a may sometimes NOT throw an error (i.e. execute successfully) depending on the physical plan chosen. With the proposed change, it would always throw an error > Require explicit CROSS join for cartesian products by default > - > > Key: SPARK-17298 > URL: https://issues.apache.org/jira/browse/SPARK-17298 > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Srinath >Priority: Minor > > Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame > API) to specify explicit cartesian products between relations under the > default configuration (spark.sql.crossJoin.enabled = false). > By cartesian product we mean a join between relations R and S where there is > no join condition involving columns from both R and S. > If a cartesian product is detected in the absence of an explicit CROSS join, > an error must be thrown. > Turning on the spark.sql.crossJoin.enabled configuration flag will disable > this check and allow cartesian products without an explicit cross join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17298) Require explicit CROSS join for cartesian products by default
[ https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Srinath updated SPARK-17298: Description: Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame API) to specify explicit cartesian products between relations under the default configuration (spark.sql.crossJoin.enabled = false). By cartesian product we mean a join between relations R and S where there is no join condition involving columns from both R and S. If a cartesian product is detected in the absence of an explicit CROSS join, an error must be thrown. Turning on the spark.sql.crossJoin.enabled configuration flag will disable this check and allow cartesian products without an explicit cross join. was: Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame API) to specify explicit cartesian products between relations under the default configuration with spark.sql.crossJoin.enabled = false. By cartesian product we mean a join between relations R and S where there is no join condition involving columns from both R and S. If a cartesian product is detected in the absence of an explicit CROSS join, an error must be thrown. Turning on the spark.sql.crossJoin.enabled configuration flag will disable this check and allow cartesian products without an explicit cross join. > Require explicit CROSS join for cartesian products by default > - > > Key: SPARK-17298 > URL: https://issues.apache.org/jira/browse/SPARK-17298 > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Srinath >Priority: Minor > > Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame > API) to specify explicit cartesian products between relations under the > default configuration (spark.sql.crossJoin.enabled = false). > By cartesian product we mean a join between relations R and S where there is > no join condition involving columns from both R and S. > If a cartesian product is detected in the absence of an explicit CROSS join, > an error must be thrown. > Turning on the spark.sql.crossJoin.enabled configuration flag will disable > this check and allow cartesian products without an explicit cross join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17298) Require explicit CROSS join for cartesian products by default
[ https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Srinath updated SPARK-17298: Description: Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame API) to specify explicit cartesian products between relations under the default cross_join_. By cartesian product we mean a join between relations R and S where there is no join condition involving columns from both R and S. If a cartesian product is detected in the absence of an explicit CROSS join, an error must be thrown. Turning on the spark.sql.crossJoin.enabled configuration flag will disable this check and allow cartesian products without an explicit cross join. was: Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame API) to specify explicit cartesian products between relations. By cartesian product we mean a join between relations R and S where there is no join condition involving columns from both R and S. If a cartesian product is detected in the absence of an explicit CROSS join, an error must be thrown. Turning on the spark.sql.crossJoin.enabled configuration flag will disable this check and allow cartesian products without an explicit cross join. > Require explicit CROSS join for cartesian products by default > - > > Key: SPARK-17298 > URL: https://issues.apache.org/jira/browse/SPARK-17298 > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Srinath >Priority: Minor > > Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame > API) to specify explicit cartesian products between relations under the > default cross_join_. > By cartesian product we mean a join between relations R and S where there is > no join condition involving columns from both R and S. > If a cartesian product is detected in the absence of an explicit CROSS join, > an error must be thrown. > Turning on the spark.sql.crossJoin.enabled configuration flag will disable > this check and allow cartesian products without an explicit cross join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17298) Require explicit CROSS join for cartesian products by default
[ https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Srinath updated SPARK-17298: Description: Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame API) to specify explicit cartesian products between relations under the default configuration with spark.sql.crossJoin.enabled = false. By cartesian product we mean a join between relations R and S where there is no join condition involving columns from both R and S. If a cartesian product is detected in the absence of an explicit CROSS join, an error must be thrown. Turning on the spark.sql.crossJoin.enabled configuration flag will disable this check and allow cartesian products without an explicit cross join. was: Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame API) to specify explicit cartesian products between relations under the default cross_join_. By cartesian product we mean a join between relations R and S where there is no join condition involving columns from both R and S. If a cartesian product is detected in the absence of an explicit CROSS join, an error must be thrown. Turning on the spark.sql.crossJoin.enabled configuration flag will disable this check and allow cartesian products without an explicit cross join. > Require explicit CROSS join for cartesian products by default > - > > Key: SPARK-17298 > URL: https://issues.apache.org/jira/browse/SPARK-17298 > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Srinath >Priority: Minor > > Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame > API) to specify explicit cartesian products between relations under the > default configuration with spark.sql.crossJoin.enabled = false. > By cartesian product we mean a join between relations R and S where there is > no join condition involving columns from both R and S. > If a cartesian product is detected in the absence of an explicit CROSS join, > an error must be thrown. > Turning on the spark.sql.crossJoin.enabled configuration flag will disable > this check and allow cartesian products without an explicit cross join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17298) Require explicit CROSS join for cartesian products by default
[ https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Srinath updated SPARK-17298: Summary: Require explicit CROSS join for cartesian products by default (was: Require explicit CROSS join for cartesian products) > Require explicit CROSS join for cartesian products by default > - > > Key: SPARK-17298 > URL: https://issues.apache.org/jira/browse/SPARK-17298 > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Srinath >Priority: Minor > > Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame > API) to specify explicit cartesian products between relations. > By cartesian product we mean a join between relations R and S where there is > no join condition involving columns from both R and S. > If a cartesian product is detected in the absence of an explicit CROSS join, > an error must be thrown. Turning on the spark.sql.crossJoin.enabled > configuration flag will disable this check and allow cartesian products > without an explicit cross join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17298) Require explicit CROSS join for cartesian products
[ https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15446920#comment-15446920 ] Srinath commented on SPARK-17298: - So if I do the following: create temporary view nt1 as select * from values ("one", 1), ("two", 2), ("three", 3) as nt1(k, v1); create temporary view nt2 as select * from values ("one", 1), ("two", 22), ("one", 5) as nt2(k, v2); SELECT * FROM nt1, nt2; -- or select * FROM nt1 inner join nt2; The SELECT queries do not in fact result in an error. The proposed change would have them return an error > Require explicit CROSS join for cartesian products > -- > > Key: SPARK-17298 > URL: https://issues.apache.org/jira/browse/SPARK-17298 > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Srinath >Priority: Minor > > Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame > API) to specify explicit cartesian products between relations. > By cartesian product we mean a join between relations R and S where there is > no join condition involving columns from both R and S. > If a cartesian product is detected in the absence of an explicit CROSS join, > an error must be thrown. Turning on the spark.sql.crossJoin.enabled > configuration flag will disable this check and allow cartesian products > without an explicit cross join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17298) Require explicit CROSS join for cartesian products
[ https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15446682#comment-15446682 ] Srinath commented on SPARK-17298: - You are correct that with this change, queries of the form {noformat} select * from A inner join B {noformat} will now throw an error where previously they would not. The reason for this suggestion is that users may often forget to specify join conditions altogether, leading to incorrect, long-running queries. Requiring explicit cross joins helps clarify intent. Turning on the spark.sql.crossJoin.enabled flag will revert to previous behavior. > Require explicit CROSS join for cartesian products > -- > > Key: SPARK-17298 > URL: https://issues.apache.org/jira/browse/SPARK-17298 > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Srinath >Priority: Minor > > Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame > API) to specify explicit cartesian products between relations. > By cartesian product we mean a join between relations R and S where there is > no join condition involving columns from both R and S. > If a cartesian product is detected in the absence of an explicit CROSS join, > an error must be thrown. Turning on the spark.sql.crossJoin.enabled > configuration flag will disable this check and allow cartesian products > without an explicit cross join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17298) Require explicit CROSS join for cartesian products
Srinath created SPARK-17298: --- Summary: Require explicit CROSS join for cartesian products Key: SPARK-17298 URL: https://issues.apache.org/jira/browse/SPARK-17298 Project: Spark Issue Type: Story Components: SQL Reporter: Srinath Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame API) to specify explicit cartesian products between relations. By cartesian product we mean a join between relations R and S where there is no join condition involving columns from both R and S. If a cartesian product is detected in the absence of an explicit CROSS join, an error must be thrown. Turning on the spark.sql.crossJoin.enabled configuration flag will disable this check and allow cartesian products without an explicit cross join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17158) Improve error message for numeric literal parsing
Srinath created SPARK-17158: --- Summary: Improve error message for numeric literal parsing Key: SPARK-17158 URL: https://issues.apache.org/jira/browse/SPARK-17158 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Srinath Priority: Minor Spark currently gives confusing and inconsistent error messages for numeric literals. For example: scala> sql("select 123456Y") org.apache.spark.sql.catalyst.parser.ParseException: Value out of range. Value:"123456" Radix:10(line 1, pos 7) == SQL == select 123456Y ---^^^ scala> sql("select 123456S") org.apache.spark.sql.catalyst.parser.ParseException: Value out of range. Value:"123456" Radix:10(line 1, pos 7) == SQL == select 123456S ---^^^ scala> sql("select 12345623434523434564565L") org.apache.spark.sql.catalyst.parser.ParseException: For input string: "12345623434523434564565"(line 1, pos 7) == SQL == select 12345623434523434564565L ---^^^ The problem is that we are relying on JDK's implementations for parsing, and those functions throw different error messages. This code can be found in AstBuilder.numericLiteral function. The proposal is that instead of using `_.toByte` to turn a string into a byte, we always turn the numeric literal string into a BigDecimal, and then we validate the range before turning it into a numeric value. This way, we have more control over the data. If BigDecimal fails to parse the number, we should throw a better exception than "For input string ...". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org