[jira] [Commented] (SPARK-18390) Optimized plan tried to use Cartesian join when it is not enabled

2016-11-09 Thread Srinath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15651980#comment-15651980
 ] 

Srinath commented on SPARK-18390:
-

FYI, these are in branch 2.1
{noformat}
commit e6132a6cf10df8b12af8dd8d1a2c563792b5cc5a
Author: Srinath Shankar 
Date:   Sat Sep 3 00:20:43 2016 +0200

[SPARK-17298][SQL] Require explicit CROSS join for cartesian products

{noformat}
and
{noformat}
commit 2d96d35dc0fed6df249606d9ce9272c0f0109fa2
Author: Srinath Shankar 
Date:   Fri Oct 14 18:24:47 2016 -0700

[SPARK-17946][PYSPARK] Python crossJoin API similar to Scala
{noformat}

With the above 2 changes, if a user requests a cross join (with the crossJoin 
API), the join will always be performed regardless of the physical plan chosen

> Optimized plan tried to use Cartesian join when it is not enabled
> -
>
> Key: SPARK-18390
> URL: https://issues.apache.org/jira/browse/SPARK-18390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Xiangrui Meng
>Assignee: Srinath
>
> {code}
> val df2 = spark.range(1e9.toInt).withColumn("one", lit(1))
> val df3 = spark.range(1e9.toInt)
> df3.join(df2, df3("id") === df2("one")).count()
> {code}
> throws
> bq. org.apache.spark.sql.AnalysisException: Cartesian joins could be 
> prohibitively expensive and are disabled by default. To explicitly enable 
> them, please set spark.sql.crossJoin.enabled = true;
> This is probably not the right behavior because it was not the user who 
> suggested using cartesian product. SQL picked it while knowing it is not 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18209) More robust view canonicalization without full SQL expansion

2016-11-01 Thread Srinath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15626971#comment-15626971
 ] 

Srinath commented on SPARK-18209:
-

Given that the hive metastore lets you store column names and types with the 
view, we could probably just forgo the nesting and store just a database hint
{code}
CREATE VIEW my_view AS SELECT * FROM my_table WHERE id > 10;
{code}
would store the following definition
{code}
SELECT * FROM my_table WHERE id > 10 /* current database */
{code}
For now in fact the "current database" will always be the database in the view.
We should also store or document the effect of other settings such as 
case-\[in\]sensitive resolution.

> More robust view canonicalization without full SQL expansion
> 
>
> Key: SPARK-18209
> URL: https://issues.apache.org/jira/browse/SPARK-18209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> Spark SQL currently stores views by analyzing the provided SQL and then 
> generating fully expanded SQL out of the analyzed logical plan. This is 
> actually a very error prone way of doing it, because:
> 1. It is non-trivial to guarantee that the generated SQL is correct without 
> being extremely verbose, given the current set of operators.
> 2. We need extensive testing for all combination of operators.
> 3. Whenever we introduce a new logical plan operator, we need to be super 
> careful because it might break SQL generation. This is the main reason 
> broadcast join hint has taken forever to be merged because it is very 
> difficult to guarantee correctness.
> Given the two primary reasons to do view canonicalization is to provide the 
> context for the database as well as star expansion, I think we can this 
> through a simpler approach, by taking the user given SQL, analyze it, and 
> just wrap the original SQL with a SELECT clause at the outer and store the 
> database as a hint.
> For example, given the following view creation SQL:
> {code}
> USE DATABASE my_db;
> CREATE TABLE my_table (id int, name string);
> CREATE VIEW my_view AS SELECT * FROM my_table WHERE id > 10;
> {code}
> We store the following SQL instead:
> {code}
> SELECT /*+ current_db: `my_db` */ id, name FROM (SELECT * FROM my_table WHERE 
> id > 10);
> {code}
> During parsing time, we expand the view along using the provided database 
> context.
> (We don't need to follow exactly the same hint, as I'm merely illustrating 
> the high level approach here.)
> Note that there is a chance that the underlying base table(s)' schema change 
> and the stored schema of the view might differ from the actual SQL schema. In 
> that case, I think we should throw an exception at runtime to warn users. 
> This exception can be controlled by a flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18209) More robust view canonicalization without full SQL expansion

2016-11-01 Thread Srinath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15626838#comment-15626838
 ] 

Srinath commented on SPARK-18209:
-

A practical (positive) consequence of this is query expansion when we have 
nested views:
{nocode}
create table T(a int)
create view A as select * from T
view B = select * from A
{nocode}
As it stands, the definition of B is frozen at view creation time, so
{nocode}
drop view A
create view A as select * from T2
select * from B
{nocode}
would return data from T even though the definition of A has changed.
If we only expand view definition at query time, then the above would return 
data from T2

> More robust view canonicalization without full SQL expansion
> 
>
> Key: SPARK-18209
> URL: https://issues.apache.org/jira/browse/SPARK-18209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> Spark SQL currently stores views by analyzing the provided SQL and then 
> generating fully expanded SQL out of the analyzed logical plan. This is 
> actually a very error prone way of doing it, because:
> 1. It is non-trivial to guarantee that the generated SQL is correct without 
> being extremely verbose, given the current set of operators.
> 2. We need extensive testing for all combination of operators.
> 3. Whenever we introduce a new logical plan operator, we need to be super 
> careful because it might break SQL generation. This is the main reason 
> broadcast join hint has taken forever to be merged because it is very 
> difficult to guarantee correctness.
> Given the two primary reasons to do view canonicalization is to provide the 
> context for the database as well as star expansion, I think we can this 
> through a simpler approach, by taking the user given SQL, analyze it, and 
> just wrap the original SQL with a SELECT clause at the outer and store the 
> database as a hint.
> For example, given the following view creation SQL:
> {code}
> USE DATABASE my_db;
> CREATE TABLE my_table (id int, name string);
> CREATE VIEW my_view AS SELECT * FROM my_table WHERE id > 10;
> {code}
> We store the following SQL instead:
> {code}
> SELECT /*+ current_db: `my_db` */ id, name FROM (SELECT * FROM my_table WHERE 
> id > 10);
> {code}
> During parsing time, we expand the view along using the provided database 
> context.
> (We don't need to follow exactly the same hint, as I'm merely illustrating 
> the high level approach here.)
> Note that there is a chance that the underlying base table(s)' schema change 
> and the stored schema of the view might differ from the actual SQL schema. In 
> that case, I think we should throw an exception at runtime to warn users. 
> This exception can be controlled by a flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18127) Add hooks and extension points to Spark

2016-10-26 Thread Srinath (JIRA)
Srinath created SPARK-18127:
---

 Summary: Add hooks and extension points to Spark
 Key: SPARK-18127
 URL: https://issues.apache.org/jira/browse/SPARK-18127
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Srinath


We need hooks in Spark for:
1. Custom Parsers 
2. Additional custom analyzer and optimizer rules
3. Extend Catalog operations



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18106) Analyze Table accepts a garbage identifier at the end

2016-10-25 Thread Srinath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srinath updated SPARK-18106:

Description: 
{noformat}
scala> sql("create table test(a int)")
res2: org.apache.spark.sql.DataFrame = []

scala> sql("analyze table test compute statistics blah")
res3: org.apache.spark.sql.DataFrame = []
{noformat}

An identifier that is not "noscan" produces an AnalyzeTableCommand with 
noscan=false

  was:
{noformat}
scala> sql("create table test(a int)")
res2: org.apache.spark.sql.DataFrame = []

scala> sql("analyze table test compute statistics blah")
res3: org.apache.spark.sql.DataFrame = []
{noformat}

An identifier that is not noscan produces an AnalyzeTableCommand with 
noscan=false


> Analyze Table accepts a garbage identifier at the end
> -
>
> Key: SPARK-18106
> URL: https://issues.apache.org/jira/browse/SPARK-18106
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Srinath
>Priority: Minor
>
> {noformat}
> scala> sql("create table test(a int)")
> res2: org.apache.spark.sql.DataFrame = []
> scala> sql("analyze table test compute statistics blah")
> res3: org.apache.spark.sql.DataFrame = []
> {noformat}
> An identifier that is not "noscan" produces an AnalyzeTableCommand with 
> noscan=false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18106) Analyze Table accepts a garbage identifier at the end

2016-10-25 Thread Srinath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srinath updated SPARK-18106:

Description: 
{noformat}
scala> sql("create table test(a int)")
res2: org.apache.spark.sql.DataFrame = []

scala> sql("analyze table test compute statistics blah")
res3: org.apache.spark.sql.DataFrame = []
{noformat}

An identifier that is not noscan produces an AnalyzeTableCommand with 
noscan=false

  was:
{noformat}
scala> sql("create table test(a int)")
res2: org.apache.spark.sql.DataFrame = []

scala> sql("analyze table test compute statistics blah")
res3: org.apache.spark.sql.DataFrame = []
{noformat}

An identifier that is not {noformat}noscan{noformat} produces an 
AnalyzeTableCommand with {code}noscan=false{code}


> Analyze Table accepts a garbage identifier at the end
> -
>
> Key: SPARK-18106
> URL: https://issues.apache.org/jira/browse/SPARK-18106
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Srinath
>Priority: Minor
>
> {noformat}
> scala> sql("create table test(a int)")
> res2: org.apache.spark.sql.DataFrame = []
> scala> sql("analyze table test compute statistics blah")
> res3: org.apache.spark.sql.DataFrame = []
> {noformat}
> An identifier that is not noscan produces an AnalyzeTableCommand with 
> noscan=false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18106) Analyze Table accepts a garbage identifier at the end

2016-10-25 Thread Srinath (JIRA)
Srinath created SPARK-18106:
---

 Summary: Analyze Table accepts a garbage identifier at the end
 Key: SPARK-18106
 URL: https://issues.apache.org/jira/browse/SPARK-18106
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Srinath
Priority: Minor


{noformat}
scala> sql("create table test(a int)")
res2: org.apache.spark.sql.DataFrame = []

scala> sql("analyze table test compute statistics blah")
res3: org.apache.spark.sql.DataFrame = []
{noformat}

An identifier that is not {noformat}noscan{noformat} produces an 
AnalyzeTableCommand with {code}noscan=false{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18013) R cross join API similar to python and Scala

2016-10-19 Thread Srinath (JIRA)
Srinath created SPARK-18013:
---

 Summary: R cross join API similar to python and Scala
 Key: SPARK-18013
 URL: https://issues.apache.org/jira/browse/SPARK-18013
 Project: Spark
  Issue Type: Bug
Reporter: Srinath


https://github.com/apache/spark/pull/14866
and
https://github.com/apache/spark/pull/15493
added an explicit cross join to the dataset api in scala and python, requiring 
crossJoin to be used when there is no join condition. 
(JIRA: https://issues.apache.org/jira/browse/SPARK-17298)
Add an explicit crossJoin to R as well so the API behavior is similar to Scala 
and python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17946) Python crossJoin API similar to Scala

2016-10-14 Thread Srinath (JIRA)
Srinath created SPARK-17946:
---

 Summary: Python crossJoin API similar to Scala
 Key: SPARK-17946
 URL: https://issues.apache.org/jira/browse/SPARK-17946
 Project: Spark
  Issue Type: Bug
Reporter: Srinath


https://github.com/apache/spark/pull/14866
added an explicit cross join to the dataset api in scala, requiring crossJoin 
to be used when there is no join condition. 
(JIRA: https://issues.apache.org/jira/browse/SPARK-17298)
The "join" API in python was implemented using cross join in that patch.
Add an explicit crossJoin to python as well so the API behavior is similar to 
Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17074) generate histogram information for column

2016-10-03 Thread Srinath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15542830#comment-15542830
 ] 

Srinath commented on SPARK-17074:
-

IMO if you can get reasonable error bounds (as Tim points out) the method with 
lower overhead is preferable. In general you can't rely on exact statistics 
during optimization anyway since new data may have arrived since the last stats 
collection

> generate histogram information for column
> -
>
> Key: SPARK-17074
> URL: https://issues.apache.org/jira/browse/SPARK-17074
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.0.0
>Reporter: Ron Hu
>
> We support two kinds of histograms: 
> - Equi-width histogram: We have a fixed width for each column interval in 
> the histogram.  The height of a histogram represents the frequency for those 
> column values in a specific interval.  For this kind of histogram, its height 
> varies for different column intervals. We use the equi-width histogram when 
> the number of distinct values is less than 254.
> - Equi-height histogram: For this histogram, the width of column interval 
> varies.  The heights of all column intervals are the same.  The equi-height 
> histogram is effective in handling skewed data distribution. We use the equi- 
> height histogram when the number of distinct values is equal to or greater 
> than 254.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16026) Cost-based Optimizer framework

2016-09-09 Thread Srinath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15478341#comment-15478341
 ] 

Srinath commented on SPARK-16026:
-

Thanks for the response:
1. You’re correct that the search space will increase compared to the attached 
proposal in the pdf. But typically the time spent in optimization (specifically 
search space exploration) is negligible compared to the time spent executing 
the query itself. So it’s usually beneficial to err on the side of exploring 
more plans, especially to explore plans that specifically reduce exchange 
times. Something to think about.
2. Thanks for the explanation. Yes, I believe that some common sense measures 
like updating table stats when column stats are updated will help.
3. You’re absolutely correct that correlated statistics would help with those 
kinds of predicates. But I was pointing out the more basic problem of 
estimating selectivity in the absence of statistics. In the absence of 
statistics we want to ensure that if we’re just “guessing” selectivities for 
filters F1, F2, (say 0.15 each), then we don’t assign unduly low selectivities 
(0.15 * 0.15) to F1 && F2. More generally, we have to ensure that the rules for 
cardinality estimation for various operators are implemented in a way that 
accounts for such guesses — for instance that we don’t compound guesses. Again, 
something to think about.

> Cost-based Optimizer framework
> --
>
> Key: SPARK-16026
> URL: https://issues.apache.org/jira/browse/SPARK-16026
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: Spark_CBO_Design_Spec.pdf
>
>
> This is an umbrella ticket to implement a cost-based optimizer framework 
> beyond broadcast join selection. This framework can be used to implement some 
> useful optimizations such as join reordering.
> The design should discuss how to break the work down into multiple, smaller 
> logical units. For example, changes to statistics class, system catalog, cost 
> estimation/propagation in expressions, cost estimation/propagation in 
> operators can be done in decoupled pull requests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16026) Cost-based Optimizer framework

2016-09-06 Thread Srinath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15468891#comment-15468891
 ] 

Srinath commented on SPARK-16026:
-

I have a couple of comments/questions on the proposal.
Regarding the join reordering algorithm: 
One of the big wins we should be able to get is avoiding shuffles/broadcasts. 
If the costing and dynamic programming algo doesn't take into account change 
costs and output partitioning we may produce some bad plans.
Here's an example: Suppose we start with completely unpartitioned tables A(a), 
B(b1, b2), C(c) and D(d), in increasing order of size and let's assume none of 
them are small enough to broadcast. Suppose we want to optimize the following 
join 
(A join B on A.a = B.b1) join C on (B.b2 = C.c) join D on (B.b1 = D.d).
Since A, B C and D are in increasing order of size and we try to minimize 
intermediate result size, we end up with the following “cheapest” plan (join 
order A-B-C-D):
{noformat}
Plan I
Join(B.b1 = D.d)
|-Exchange(b1)
|   Join(B.b2 = c)
|   |-Exchange(b2)
|   |   Join(A.a = B.b1)
|   |   |-Exchange(a)
|   |   |   A
|   |   | Exchange(b1)
|   |       B
|   | Exchange(c)
|       C
|-Exchange(d)
    D
{noformat}
Ignoring leaf node sizes, the cost according to the proposed model, i.e. the 
intermediate data size is Size(A join B) + size(ABC). This is also the size of 
intermediate data exchanged.
But a better plan may be to join to D before C (i.e. join order A-B-D-C) 
because that would avoid a re-shuffle 
{noformat}
Plan II
Join(B.b2 = C.c)
|-Exchange(B.b2)
|   Join (B.b1 = d)
|   |-Join(A.a = B.b1)
|   | |-Exchange(a)
|   | |   A
|   | | Exchange(b1)
|   |     B  
|   |-Exchange(d)
|       D
|-Exchange(c)
    C
{noformat}
The cost of this plan, i.e. the intermediate data size, is size(AB) + 
size(ABD), which is higher than Plan I. But the size of intermediate data 
exchanged is  size(ABD) which may be lower than size(AB) + size(ABC) of Plan I. 
This plan could be significantly faster as a result.

It should be relatively painless to incorporate partition-awareness into the 
dynamic programming proposal for cost-based join ordering — with a couple of 
tweaks
i) Take into account intermediate data exchanged, not just total intermediate 
data. For example, a good and simple start would be to use (exchanged-data, 
total-data) as the cost function, with a preference for the former (i.e. prefer 
lower exchanged data, and lower total-data if the exchanged data is the same). 
You could certainly have a more complex model, though. 
ii) Preserve (i.e. don't prune) partial plans based on output partitioning. 
e.g. consider a partial plan involving A, B and C. A join B join C may have a 
different output partitioning than A join C join B. If ACB is more expensive 
but has an output partitioning scheme that is useful for further joins, its 
worth preserving.

Another question I have is regarding statistics: With separate analyze 
column/analyze table statements it's possible for your statistics to have two 
different views of data, leading to weird results and inconsistent cardinality 
estimates.

For filter factor, what are the default selectivities assumed ? We may also 
want to cap the minimum selectivity, so that C1 && C2 && C3 etc. doesn’t lead 
to ridiculously low cardinality estimates.

> Cost-based Optimizer framework
> --
>
> Key: SPARK-16026
> URL: https://issues.apache.org/jira/browse/SPARK-16026
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: Spark_CBO_Design_Spec.pdf
>
>
> This is an umbrella ticket to implement a cost-based optimizer framework 
> beyond broadcast join selection. This framework can be used to implement some 
> useful optimizations such as join reordering.
> The design should discuss how to break the work down into multiple, smaller 
> logical units. For example, changes to statistics class, system catalog, cost 
> estimation/propagation in expressions, cost estimation/propagation in 
> operators can be done in decoupled pull requests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17298) Require explicit CROSS join for cartesian products by default

2016-08-29 Thread Srinath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15447642#comment-15447642
 ] 

Srinath commented on SPARK-17298:
-

I've updated the description. Hopefully it is clearer.
Note that before this change, even with spark.sql.crossJoin.enabled = false,
case 1.a may sometimes NOT throw an error (i.e. execute successfully) depending 
on the physical plan chosen.
With the proposed  change, it would always throw an error

> Require explicit CROSS join for cartesian products by default
> -
>
> Key: SPARK-17298
> URL: https://issues.apache.org/jira/browse/SPARK-17298
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Srinath
>Priority: Minor
>
> Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame 
> API) to specify explicit cartesian products between relations under the 
> default configuration (spark.sql.crossJoin.enabled = false).
> By cartesian product we mean a join between relations R and S where there is 
> no join condition involving columns from both R and S.
> If a cartesian product is detected in the absence of an explicit CROSS join, 
> an error must be thrown. 
> Turning on the spark.sql.crossJoin.enabled configuration flag will disable 
> this check and allow cartesian products without an explicit cross join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17298) Require explicit CROSS join for cartesian products by default

2016-08-29 Thread Srinath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srinath updated SPARK-17298:

Description: 
Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame API) 
to specify explicit cartesian products between relations under the default 
configuration (spark.sql.crossJoin.enabled = false).
By cartesian product we mean a join between relations R and S where there is no 
join condition involving columns from both R and S.

If a cartesian product is detected in the absence of an explicit CROSS join, an 
error must be thrown. 
Turning on the spark.sql.crossJoin.enabled configuration flag will disable this 
check and allow cartesian products without an explicit cross join.

  was:
Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame API) 
to specify explicit cartesian products between relations under the default 
configuration with spark.sql.crossJoin.enabled = false.
By cartesian product we mean a join between relations R and S where there is no 
join condition involving columns from both R and S.

If a cartesian product is detected in the absence of an explicit CROSS join, an 
error must be thrown. 
Turning on the spark.sql.crossJoin.enabled configuration flag will disable this 
check and allow cartesian products without an explicit cross join.


> Require explicit CROSS join for cartesian products by default
> -
>
> Key: SPARK-17298
> URL: https://issues.apache.org/jira/browse/SPARK-17298
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Srinath
>Priority: Minor
>
> Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame 
> API) to specify explicit cartesian products between relations under the 
> default configuration (spark.sql.crossJoin.enabled = false).
> By cartesian product we mean a join between relations R and S where there is 
> no join condition involving columns from both R and S.
> If a cartesian product is detected in the absence of an explicit CROSS join, 
> an error must be thrown. 
> Turning on the spark.sql.crossJoin.enabled configuration flag will disable 
> this check and allow cartesian products without an explicit cross join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17298) Require explicit CROSS join for cartesian products by default

2016-08-29 Thread Srinath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srinath updated SPARK-17298:

Description: 
Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame API) 
to specify explicit cartesian products between relations under the default 
cross_join_.
By cartesian product we mean a join between relations R and S where there is no 
join condition involving columns from both R and S.

If a cartesian product is detected in the absence of an explicit CROSS join, an 
error must be thrown. 

Turning on the spark.sql.crossJoin.enabled configuration flag will disable this 
check and allow cartesian products without an explicit cross join.

  was:
Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame API) 
to specify explicit cartesian products between relations.
By cartesian product we mean a join between relations R and S where there is no 
join condition involving columns from both R and S.

If a cartesian product is detected in the absence of an explicit CROSS join, an 
error must be thrown. Turning on the spark.sql.crossJoin.enabled configuration 
flag will disable this check and allow cartesian products without an explicit 
cross join.


> Require explicit CROSS join for cartesian products by default
> -
>
> Key: SPARK-17298
> URL: https://issues.apache.org/jira/browse/SPARK-17298
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Srinath
>Priority: Minor
>
> Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame 
> API) to specify explicit cartesian products between relations under the 
> default cross_join_.
> By cartesian product we mean a join between relations R and S where there is 
> no join condition involving columns from both R and S.
> If a cartesian product is detected in the absence of an explicit CROSS join, 
> an error must be thrown. 
> Turning on the spark.sql.crossJoin.enabled configuration flag will disable 
> this check and allow cartesian products without an explicit cross join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17298) Require explicit CROSS join for cartesian products by default

2016-08-29 Thread Srinath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srinath updated SPARK-17298:

Description: 
Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame API) 
to specify explicit cartesian products between relations under the default 
configuration with spark.sql.crossJoin.enabled = false.
By cartesian product we mean a join between relations R and S where there is no 
join condition involving columns from both R and S.

If a cartesian product is detected in the absence of an explicit CROSS join, an 
error must be thrown. 
Turning on the spark.sql.crossJoin.enabled configuration flag will disable this 
check and allow cartesian products without an explicit cross join.

  was:
Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame API) 
to specify explicit cartesian products between relations under the default 
cross_join_.
By cartesian product we mean a join between relations R and S where there is no 
join condition involving columns from both R and S.

If a cartesian product is detected in the absence of an explicit CROSS join, an 
error must be thrown. 

Turning on the spark.sql.crossJoin.enabled configuration flag will disable this 
check and allow cartesian products without an explicit cross join.


> Require explicit CROSS join for cartesian products by default
> -
>
> Key: SPARK-17298
> URL: https://issues.apache.org/jira/browse/SPARK-17298
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Srinath
>Priority: Minor
>
> Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame 
> API) to specify explicit cartesian products between relations under the 
> default configuration with spark.sql.crossJoin.enabled = false.
> By cartesian product we mean a join between relations R and S where there is 
> no join condition involving columns from both R and S.
> If a cartesian product is detected in the absence of an explicit CROSS join, 
> an error must be thrown. 
> Turning on the spark.sql.crossJoin.enabled configuration flag will disable 
> this check and allow cartesian products without an explicit cross join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17298) Require explicit CROSS join for cartesian products by default

2016-08-29 Thread Srinath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srinath updated SPARK-17298:

Summary: Require explicit CROSS join for cartesian products by default  
(was: Require explicit CROSS join for cartesian products)

> Require explicit CROSS join for cartesian products by default
> -
>
> Key: SPARK-17298
> URL: https://issues.apache.org/jira/browse/SPARK-17298
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Srinath
>Priority: Minor
>
> Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame 
> API) to specify explicit cartesian products between relations.
> By cartesian product we mean a join between relations R and S where there is 
> no join condition involving columns from both R and S.
> If a cartesian product is detected in the absence of an explicit CROSS join, 
> an error must be thrown. Turning on the spark.sql.crossJoin.enabled 
> configuration flag will disable this check and allow cartesian products 
> without an explicit cross join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17298) Require explicit CROSS join for cartesian products

2016-08-29 Thread Srinath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15446920#comment-15446920
 ] 

Srinath commented on SPARK-17298:
-

So if I do the following:

create temporary view nt1 as select * from values   

  ("one", 1),   

  ("two", 2),   

  ("three", 3)  

  as nt1(k, v1);



create temporary view nt2 as select * from values   

  ("one", 1),   

  ("two", 22),  

  ("one", 5)

  as nt2(k, v2);

SELECT * FROM nt1, nt2; -- or
select * FROM nt1 inner join nt2;

The SELECT queries do not in fact result in an error. The proposed change would 
have them return an error

> Require explicit CROSS join for cartesian products
> --
>
> Key: SPARK-17298
> URL: https://issues.apache.org/jira/browse/SPARK-17298
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Srinath
>Priority: Minor
>
> Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame 
> API) to specify explicit cartesian products between relations.
> By cartesian product we mean a join between relations R and S where there is 
> no join condition involving columns from both R and S.
> If a cartesian product is detected in the absence of an explicit CROSS join, 
> an error must be thrown. Turning on the spark.sql.crossJoin.enabled 
> configuration flag will disable this check and allow cartesian products 
> without an explicit cross join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17298) Require explicit CROSS join for cartesian products

2016-08-29 Thread Srinath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15446682#comment-15446682
 ] 

Srinath commented on SPARK-17298:
-

You are correct that with this change, queries of the form
{noformat}
select * from A inner join B
{noformat}
will now throw an error where previously they would not. 
The reason for this suggestion is that users may often forget to specify join 
conditions altogether, leading to incorrect, long-running queries. Requiring 
explicit cross joins helps clarify intent.

Turning on the spark.sql.crossJoin.enabled flag will revert to previous 
behavior.

> Require explicit CROSS join for cartesian products
> --
>
> Key: SPARK-17298
> URL: https://issues.apache.org/jira/browse/SPARK-17298
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Srinath
>Priority: Minor
>
> Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame 
> API) to specify explicit cartesian products between relations.
> By cartesian product we mean a join between relations R and S where there is 
> no join condition involving columns from both R and S.
> If a cartesian product is detected in the absence of an explicit CROSS join, 
> an error must be thrown. Turning on the spark.sql.crossJoin.enabled 
> configuration flag will disable this check and allow cartesian products 
> without an explicit cross join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17298) Require explicit CROSS join for cartesian products

2016-08-29 Thread Srinath (JIRA)
Srinath created SPARK-17298:
---

 Summary: Require explicit CROSS join for cartesian products
 Key: SPARK-17298
 URL: https://issues.apache.org/jira/browse/SPARK-17298
 Project: Spark
  Issue Type: Story
  Components: SQL
Reporter: Srinath


Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame API) 
to specify explicit cartesian products between relations.
By cartesian product we mean a join between relations R and S where there is no 
join condition involving columns from both R and S.

If a cartesian product is detected in the absence of an explicit CROSS join, an 
error must be thrown. Turning on the spark.sql.crossJoin.enabled configuration 
flag will disable this check and allow cartesian products without an explicit 
cross join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17158) Improve error message for numeric literal parsing

2016-08-19 Thread Srinath (JIRA)
Srinath created SPARK-17158:
---

 Summary: Improve error message for numeric literal parsing
 Key: SPARK-17158
 URL: https://issues.apache.org/jira/browse/SPARK-17158
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Srinath
Priority: Minor


Spark currently gives confusing and inconsistent error messages for numeric 
literals. For example:
scala> sql("select 123456Y")
org.apache.spark.sql.catalyst.parser.ParseException:
Value out of range. Value:"123456" Radix:10(line 1, pos 7)

== SQL ==
select 123456Y
---^^^
scala> sql("select 123456S")
org.apache.spark.sql.catalyst.parser.ParseException:
Value out of range. Value:"123456" Radix:10(line 1, pos 7)

== SQL ==
select 123456S
---^^^
scala> sql("select 12345623434523434564565L")
org.apache.spark.sql.catalyst.parser.ParseException:
For input string: "12345623434523434564565"(line 1, pos 7)

== SQL ==
select 12345623434523434564565L
---^^^
The problem is that we are relying on JDK's implementations for parsing, and 
those functions throw different error messages. This code can be found in 
AstBuilder.numericLiteral function.
The proposal is that instead of using `_.toByte` to turn a string into a byte, 
we always turn the numeric literal string into a BigDecimal, and then we 
validate the range before turning it into a numeric value. This way, we have 
more control over the data.
If BigDecimal fails to parse the number, we should throw a better exception 
than "For input string ...".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org