from:"Cheng Lian \(Jira\)"

[jira] [Deleted] (SPARK-13553) Migrate basic inspection operations

2016-03-11 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian deleted SPARK-13553:
---


> Migrate basic inspection operations
> ---
>
> Key: SPARK-13553
> URL: https://issues.apache.org/jira/browse/SPARK-13553
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> Should migrate the following methods and corresponding tests to Dataset:
> {noformat}
> - Basic inspection operations
>   - dtypes
>   - columns
>   - printSchema
>   - explain
>   - Column accessors
> - col
> - apply
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Deleted] (SPARK-13555) Migrate untyped relational operations

2016-03-11 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian deleted SPARK-13555:
---


> Migrate untyped relational operations
> -
>
> Key: SPARK-13555
> URL: https://issues.apache.org/jira/browse/SPARK-13555
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> Should migrate the following methods and corresponding tests to Dataset:
> {noformat}
> - Relational operations
>   - Untyped relational operations
> - select(Column*): Dataset[Row]
> - select(String, String*): Dataset[Row]
> - selectExpr(String*): Dataset[Row]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Deleted] (SPARK-13556) Migrate untyped joins

2016-03-11 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian deleted SPARK-13556:
---


> Migrate untyped joins
> -
>
> Key: SPARK-13556
> URL: https://issues.apache.org/jira/browse/SPARK-13556
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> Should migrate the following methods and corresponding tests to Dataset:
> {noformat}
> - Joins
>   - Untyped joins
> - join[U: Encoder](Dataset[U]): Dataset[Row]
> - join[U: Encoder](Dataset[U], String): Dataset[Row]
> - join[U: Encoder](Dataset[U], Seq[String]): Dataset[Row]
> - join[U: Encoder](Dataset[U], Seq[String], String): Dataset[Row]
> - join[U: Encoder](Dataset[U], Column): Dataset[Row]
> - join[U: Encoder](Dataset[U], Column, String): Dataset[Row]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Deleted] (SPARK-13557) Migrate gather-to-driver actions

2016-03-11 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian deleted SPARK-13557:
---


> Migrate gather-to-driver actions
> 
>
> Key: SPARK-13557
> URL: https://issues.apache.org/jira/browse/SPARK-13557
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Cheng Lian
>
> Should migrate the following methods and corresponding tests to Dataset:
> {noformat}
> - Gater-to-driver actions
>   - head(Int): Array[T]
>   - head(): T
>   - first(): T
>   - collect(): Array[T]
>   - collectAsList(): java.util.List[T]
>   - take(Int): Array[T]
>   - takeAsList(Int): java.util.List[T]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Deleted] (SPARK-13558) Migrate basic GroupedDataset methods

2016-03-11 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian deleted SPARK-13558:
---


> Migrate basic GroupedDataset methods
> 
>
> Key: SPARK-13558
> URL: https://issues.apache.org/jira/browse/SPARK-13558
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> Should migrate the following methods and corresponding tests to Dataset:
> {noformat}
> - Aggregations
>   - GroupedDataset
> - Support GroupType (GroupBy/GroupingSet/Rollup/Cube)
> - Untyped aggregations
>   - agg((String, String), (String, String)*): Dataset[Row]
>   - agg(Map[String, String]): Dataset[Row]
>   - agg(java.util.Map[String, String]): Dataset[Row]
>   - agg(Column, Column*): Dataset[Row]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Deleted] (SPARK-13559) Migrate common GroupedDataset aggregations

2016-03-11 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian deleted SPARK-13559:
---


> Migrate common GroupedDataset aggregations
> --
>
> Key: SPARK-13559
> URL: https://issues.apache.org/jira/browse/SPARK-13559
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Cheng Lian
>
> Should migrate the following methods and corresponding tests to Dataset:
> {noformat}
> - Aggregations
>   - GroupedDataset
> - Common untyped aggregations
>   - mean(String*): Dataset[Row]
>   - max(String*): Dataset[Row]
>   - avg(String*): Dataset[Row]
>   - min(String*): Dataset[Row]
>   - sum(String*): Dataset[Row]
> - Common typed aggregations
>   - count(): Dataset[(K, Long)]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Deleted] (SPARK-13560) Migrate GroupedDataset pivoting methods

2016-03-11 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian deleted SPARK-13560:
---


> Migrate GroupedDataset pivoting methods
> ---
>
> Key: SPARK-13560
> URL: https://issues.apache.org/jira/browse/SPARK-13560
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Cheng Lian
>
> Should migrate the following methods and corresponding tests to Dataset:
> {noformat}
> - Aggregations
>   - GroupedDataset
> - Pivoting
>   - pivot(String): GroupedDataset[Row, V]
>   - pivot(String, Seq[Any]): GroupedDataset[Row, V]
>   - pivot(String, java.util.List[Any]): GroupedDataset[Row, V]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13817) Re-enable MiMA check after unifying DataFrame and Dataset API

2016-03-11 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-13817:
--

 Summary: Re-enable MiMA check after unifying DataFrame and Dataset 
API
 Key: SPARK-13817
 URL: https://issues.apache.org/jira/browse/SPARK-13817
 Project: Spark
  Issue Type: Test
  Components: Build
Affects Versions: 2.0.0
Reporter: Cheng Lian
Assignee: Cheng Lian


In [PR #11443|https://github.com/apache/spark/pull/11443], we unified DataFrame 
and Dataset API. Since this PR did tons of API changes, we disabled MiMA check 
temporarily for convenience. Now it is merged, we should re-enable MiMA check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Deleted] (SPARK-13564) Migrate DataFrameStatFunctions to Dataset

2016-03-11 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian deleted SPARK-13564:
---


> Migrate DataFrameStatFunctions to Dataset
> -
>
> Key: SPARK-13564
> URL: https://issues.apache.org/jira/browse/SPARK-13564
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Cheng Lian
>
> After the migration, we should have a separate namespace {{Dataset.stat}} for 
> statistics methods, just like {{DataFrame.stat}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Deleted] (SPARK-13563) Migrate DataFrameNaFunctions to Dataset

2016-03-11 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian deleted SPARK-13563:
---


> Migrate DataFrameNaFunctions to Dataset
> ---
>
> Key: SPARK-13563
> URL: https://issues.apache.org/jira/browse/SPARK-13563
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Cheng Lian
>
> After the migration, we should have a separate namespace {{Dataset.na}}, just 
> like {{DataFrame.na}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Deleted] (SPARK-13562) Migrate Dataset typed aggregations

2016-03-11 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian deleted SPARK-13562:
---


> Migrate Dataset typed aggregations
> --
>
> Key: SPARK-13562
> URL: https://issues.apache.org/jira/browse/SPARK-13562
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Cheng Lian
>
> Should migrate the following methods and corresponding tests to Dataset:
> {noformat}
> - Aggregations
>   - Untyped aggregations (depends on GroupedDataset)
> - groupBy(Column*): GroupedDataset[Row, T]
> - groupBy(String, String*): GroupedDataset[Row, T]
> - rollup(Column*): GroupedDataset[Row, T]
> - rollup(String, String*): GroupedDataset[Row, T]
> - cube(Column*): GroupedDataset[Row, T]
> - cube(String, String*): GroupedDataset[Row, T]
> - agg((String, String), (String, String)*): Dataset[Row]
> - agg(Map[String, String]): Dataset[Row]
> - agg(java.util.Map[String, String]): Dataset[Row]
> - agg(Column, Column*): Dataset[Row]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Deleted] (SPARK-13561) Migrate Dataset untyped aggregations

2016-03-11 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian deleted SPARK-13561:
---


> Migrate Dataset untyped aggregations
> 
>
> Key: SPARK-13561
> URL: https://issues.apache.org/jira/browse/SPARK-13561
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Cheng Lian
>
> Should migrate the following methods and corresponding tests to Dataset:
> {noformat}
> - Aggregations
>   - Typed aggregations (depends on GroupedDataset)
> - groupBy[K: Encoder](T => K): GroupedDataset[K, T] // rename to 
> groupByKey
> - groupBy[K](MapFunction[T, K], Encoder[K]): GroupedDataset[K, T] // 
> Rename to groupByKey
> - count
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Deleted] (SPARK-13565) Migrate DataFrameReader/DataFrameWriter to Dataset API

2016-03-11 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian deleted SPARK-13565:
---


> Migrate DataFrameReader/DataFrameWriter to Dataset API
> --
>
> Key: SPARK-13565
> URL: https://issues.apache.org/jira/browse/SPARK-13565
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Cheng Lian
>
> We'd like to be able to read/write a Dataset from/to specific data sources.
> After the migration, we should have {{Dataset.read}}/{{Dataset.write}}, just 
> like {{DataFrame.read}}/{{DataFrame.write}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13822) Follow-ups of DataFrame/Dataset API unification

2016-03-11 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-13822:
--

 Summary: Follow-ups of DataFrame/Dataset API unification
 Key: SPARK-13822
 URL: https://issues.apache.org/jira/browse/SPARK-13822
 Project: Spark
  Issue Type: Improvement
  Components: Build, SQL
Affects Versions: 2.0.0
Reporter: Cheng Lian
Assignee: Cheng Lian


This is an umbrella ticket for all follow-up work of DataFrame/Dataset API 
unification (SPARK-13244).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13817) Re-enable MiMA check after unifying DataFrame and Dataset API

2016-03-11 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-13817:
---
Issue Type: Sub-task  (was: Test)
Parent: SPARK-13822

> Re-enable MiMA check after unifying DataFrame and Dataset API
> -
>
> Key: SPARK-13817
> URL: https://issues.apache.org/jira/browse/SPARK-13817
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> In [PR #11443|https://github.com/apache/spark/pull/11443], we unified 
> DataFrame and Dataset API. Since this PR did tons of API changes, we disabled 
> MiMA check temporarily for convenience. Now it is merged, we should re-enable 
> MiMA check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13826) Revise ScalaDoc of the new Dataset API

2016-03-11 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-13826:
--

 Summary: Revise ScalaDoc of the new Dataset API
 Key: SPARK-13826
 URL: https://issues.apache.org/jira/browse/SPARK-13826
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, SQL
Affects Versions: 2.0.0
Reporter: Cheng Lian
Assignee: Cheng Lian


Tons of DataFrame operations were migrated to Dataset in SPARK-13244. We should 
revise ScalaDoc of these APIs. The following thing should be updated:

- {{@since}} tag
- {{@group}} tag
- Example code



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13817) Re-enable MiMA check after unifying DataFrame and Dataset API

2016-03-11 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-13817.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11656
[https://github.com/apache/spark/pull/11656]

> Re-enable MiMA check after unifying DataFrame and Dataset API
> -
>
> Key: SPARK-13817
> URL: https://issues.apache.org/jira/browse/SPARK-13817
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 2.0.0
>
>
> In [PR #11443|https://github.com/apache/spark/pull/11443], we unified 
> DataFrame and Dataset API. Since this PR did tons of API changes, we disabled 
> MiMA check temporarily for convenience. Now it is merged, we should re-enable 
> MiMA check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13828) QueryExecution's assertAnalyzed needs to preserve the stacktrace

2016-03-11 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-13828:
--

 Summary: QueryExecution's assertAnalyzed needs to preserve the 
stacktrace
 Key: SPARK-13828
 URL: https://issues.apache.org/jira/browse/SPARK-13828
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.0.0
Reporter: Cheng Lian
Assignee: Cheng Lian


SPARK-13244 made Dataset always eager analyzed, and added an extra {{plan}} 
argument to {{AnalysisException}} to facilitate logical plan analysis debugging 
using {{QueryExecution.assertAnalyzed}}. (Previously we used to temporarily 
disable DataFrame eager analysis to report the partially analyzed plan tree.) 
However, the exception stack trace wasn't properly preserved. It should be 
added back.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13841) Remove Dataset.collectRows() and Dataset.takeRows()

2016-03-12 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-13841:
--

 Summary: Remove Dataset.collectRows() and Dataset.takeRows()
 Key: SPARK-13841
 URL: https://issues.apache.org/jira/browse/SPARK-13841
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.0.0
Reporter: Cheng Lian
Assignee: Cheng Lian


These two methods were added because the original {{DataFrame.collect()}} and 
{{DataFrame.take()}} methods became methods of {{Dataset\[T\]}} after merging 
DataFrame and Dataset API. However, Java doesn't allow returning generic array, 
and thus erased their return type to {{Object}}. This breaks compilation of 
Java code. After discussion, we decided to simply resort to existing 
{{collectAsList()}} and {{takeAsList()}} methods and remove these two extra 
specialized ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13841) Remove Dataset.collectRows() and Dataset.takeRows()

2016-03-12 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-13841.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11678
[https://github.com/apache/spark/pull/11678]

> Remove Dataset.collectRows() and Dataset.takeRows()
> ---
>
> Key: SPARK-13841
> URL: https://issues.apache.org/jira/browse/SPARK-13841
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 2.0.0
>
>
> These two methods were added because the original {{DataFrame.collect()}} and 
> {{DataFrame.take()}} methods became methods of {{Dataset\[T\]}} after merging 
> DataFrame and Dataset API. However, Java doesn't allow returning generic 
> array, and thus erased their return type to {{Object}}. This breaks 
> compilation of Java code. After discussion, we decided to simply resort to 
> existing {{collectAsList()}} and {{takeAsList()}} methods and remove these 
> two extra specialized ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12718) SQL generation support for window functions

2016-03-14 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12718:
---
Assignee: Wenchen Fan  (was: Xiao Li)

> SQL generation support for window functions
> ---
>
> Key: SPARK-12718
> URL: https://issues.apache.org/jira/browse/SPARK-12718
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Wenchen Fan
>
> {{HiveWindowFunctionQuerySuite}} and {{HiveWindowFunctionQueryFileSuite}} can 
> be useful for bootstrapping test coverage. Please refer to SPARK-11012 for 
> more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13910) Should provide a factory method for constructing DataFrames using unresolved logical plan

2016-03-15 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-13910:
--

 Summary: Should provide a factory method for constructing 
DataFrames using unresolved logical plan
 Key: SPARK-13910
 URL: https://issues.apache.org/jira/browse/SPARK-13910
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.0.0
Reporter: Cheng Lian
Assignee: Cheng Lian


Before merging DataFrame and Dataset, there is a public DataFrame constructor 
that accepts an unresolved logical plan. Now this constructor is gone and 
replaced by {{Dataset.newDataFrame}}, but {{object Dataset}} is marked as 
{{private\[sql\]}}. Should make this method public.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13911) Having condition and order by cannot both have aggregate functions

2016-03-15 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-13911:
--

 Summary: Having condition and order by cannot both have aggregate 
functions
 Key: SPARK-13911
 URL: https://issues.apache.org/jira/browse/SPARK-13911
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.1, 1.5.2, 1.4.1, 1.3.1, 2.0.0
Reporter: Cheng Lian


Given the following temporary table:

{code}
sqlContext range 10 select ('id as 'a, 'id as 'b) registerTempTable "t"
{code}

The following SQL statement can't pass analysis:

{noformat}
scala> sqlContext sql "SELECT * FROM t GROUP BY a HAVING COUNT(b) > 0 ORDER BY 
COUNT(b)" show ()
org.apache.spark.sql.AnalysisException: expression '`t`.`b`' is neither present 
in the group by, nor is it an aggregate function. Add to group by or wrap in 
first() (or first_value) if you don't care which value you get.;
  at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:36)
  at org.apache.spark.sql.Dataset$.newDataFrame(Dataset.scala:58)
  at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:784)
  ... 49 elided
{noformat}

The reason is that analysis rule {{ResolveAggregateFunctions}} only handles the 
first {{Filter}} _or_ {{Sort}} directly above an {{Aggregate}}.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13911) Having condition and order by cannot both have aggregate functions

2016-03-15 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195821#comment-15195821
 ] 

Cheng Lian commented on SPARK-13911:


Another problem related to the {{ResolveAggregateFunctions}} rule is that it 
invokes the analyzer recursively, which is pretty tricky to understand and 
maintain.  Here is a possible fix that both fixes the above issue and removes 
the recursive invocation.

Considering having condition and order by over aggregations are actually 
tightly coupled with aggregation during resolution, we probably shouldn't view 
them as separate constructs while resolving them.  One possible fix of this 
issue is to introduce a new unresolved logical plan node 
{{UnresolvedAggregate}}:

{code}
case class UnresolvedAggregate(
  child: LogicalPlan,
  groupingExpressions: Seq[Expression],
  aggregateExpressions: Seq[NamedExpression],
  havingCondition: Option[Expression] = None,
  order: Seq[SortOrder] = Nil
) extends UnaryLogicalPlan
{code}

The major difference between {{UnresolvedAggregate}} and {{Aggregate}} is that 
it also contains an optional having condition and a list of sort orders.  In 
other words, it's a filtered, ordered aggregate operator.  Then, we can have 
two simple rules to merge all adjacent {{Sort}} and {{Filter}} operators 
directly above an {{UnresolvedAggregate}}:

{code}
  object MergeHavingConditions extends Rule[LogicalPlan] {
override def apply(tree: LogicalPlan): LogicalPlan = tree transformDown {
  case Filter(condition, (agg: UnresolvedAggregate)) =>
// Combines all having conditions
val combinedCondition = (agg.havingCondition.toSeq :+ 
condition).reduce(And)
agg.copy(havingCondition = Some(combinedCondition))
}
  }

  object MergeSortsOverAggregates extends Rule[LogicalPlan] {
override def apply(tree: LogicalPlan): LogicalPlan = tree transformDown {
  case Sort(order, _, (agg: UnresolvedAggregate)) =>
// Only preserves the last sort order
agg.copy(order = order)
}
  }
{code}

(Of course, we also need to make the Dataset API and the {{GlobalAggregates}} 
produce {{UnresolvedAggregate}} instead of {{Aggregate}}.)

At last, we only need to resolve {{UnresolvedAggregate}} into {{Aggregate}} 
with optional {{Filter}} and {{Sort}}, which is relatively straightforward.  
Now, we no long need to invoke the analyzer recursively in 
{{ResolveAggregateFunctions}} to resolve aggregate functions appearing in 
having and order by clauses, since they are already merged into 
{{UnresolvedAggregate}} and can be resolved all together with grouping 
expressions and aggregate expressions.

cc yhuai

> Having condition and order by cannot both have aggregate functions
> --
>
> Key: SPARK-13911
> URL: https://issues.apache.org/jira/browse/SPARK-13911
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.1, 2.0.0
>Reporter: Cheng Lian
>
> Given the following temporary table:
> {code}
> sqlContext range 10 select ('id as 'a, 'id as 'b) registerTempTable "t"
> {code}
> The following SQL statement can't pass analysis:
> {noformat}
> scala> sqlContext sql "SELECT * FROM t GROUP BY a HAVING COUNT(b) > 0 ORDER 
> BY COUNT(b)" show ()
> org.apache.spark.sql.AnalysisException: expression '`t`.`b`' is neither 
> present in the group by, nor is it an aggregate function. Add to group by or 
> wrap in first() (or first_value) if you don't care which value you get.;
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:36)
>   at org.apache.spark.sql.Dataset$.newDataFrame(Dataset.scala:58)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:784)
>   ... 49 elided
> {noformat}
> The reason is that analysis rule {{ResolveAggregateFunctions}} only handles 
> the first {{Filter}} _or_ {{Sort}} directly above an {{Aggregate}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13911) Having condition and order by cannot both have aggregate functions

2016-03-15 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195821#comment-15195821
 ] 

Cheng Lian edited comment on SPARK-13911 at 3/15/16 6:15 PM:
-

Another problem related to the {{ResolveAggregateFunctions}} rule is that it 
invokes the analyzer recursively, which is pretty tricky to understand and 
maintain.  Here is a possible fix that both fixes the above issue and removes 
the recursive invocation.

Considering having condition and order by over aggregations are actually 
tightly coupled with aggregation during resolution, we probably shouldn't view 
them as separate constructs while resolving them.  One possible fix of this 
issue is to introduce a new unresolved logical plan node 
{{UnresolvedAggregate}}:

{code}
case class UnresolvedAggregate(
  child: LogicalPlan,
  groupingExpressions: Seq[Expression],
  aggregateExpressions: Seq[NamedExpression],
  havingCondition: Option[Expression] = None,
  order: Seq[SortOrder] = Nil
) extends UnaryLogicalPlan
{code}

The major difference between {{UnresolvedAggregate}} and {{Aggregate}} is that 
it also contains an optional having condition and a list of sort orders.  In 
other words, it's a filtered, ordered aggregate operator.  Then, we can have 
two simple rules to merge all adjacent {{Sort}} and {{Filter}} operators 
directly above an {{UnresolvedAggregate}}:

{code}
  object MergeHavingConditions extends Rule[LogicalPlan] {
override def apply(tree: LogicalPlan): LogicalPlan = tree transformDown {
  case Filter(condition, (agg: UnresolvedAggregate)) =>
// Combines all having conditions
val combinedCondition = (agg.havingCondition.toSeq :+ 
condition).reduce(And)
agg.copy(havingCondition = Some(combinedCondition))
}
  }

  object MergeSortsOverAggregates extends Rule[LogicalPlan] {
override def apply(tree: LogicalPlan): LogicalPlan = tree transformDown {
  case Sort(order, _, (agg: UnresolvedAggregate)) =>
// Only preserves the last sort order
agg.copy(order = order)
}
  }
{code}

(Of course, we also need to make the Dataset API and the {{GlobalAggregates}} 
produce {{UnresolvedAggregate}} instead of {{Aggregate}}.)

At last, we only need to resolve {{UnresolvedAggregate}} into {{Aggregate}} 
with optional {{Filter}} and {{Sort}}, which is relatively straightforward.  
Now, we no long need to invoke the analyzer recursively in 
{{ResolveAggregateFunctions}} to resolve aggregate functions appearing in 
having and order by clauses, since they are already merged into 
{{UnresolvedAggregate}} and can be resolved all together with grouping 
expressions and aggregate expressions.

cc [~yhuai]


was (Author: lian cheng):
Another problem related to the {{ResolveAggregateFunctions}} rule is that it 
invokes the analyzer recursively, which is pretty tricky to understand and 
maintain.  Here is a possible fix that both fixes the above issue and removes 
the recursive invocation.

Considering having condition and order by over aggregations are actually 
tightly coupled with aggregation during resolution, we probably shouldn't view 
them as separate constructs while resolving them.  One possible fix of this 
issue is to introduce a new unresolved logical plan node 
{{UnresolvedAggregate}}:

{code}
case class UnresolvedAggregate(
  child: LogicalPlan,
  groupingExpressions: Seq[Expression],
  aggregateExpressions: Seq[NamedExpression],
  havingCondition: Option[Expression] = None,
  order: Seq[SortOrder] = Nil
) extends UnaryLogicalPlan
{code}

The major difference between {{UnresolvedAggregate}} and {{Aggregate}} is that 
it also contains an optional having condition and a list of sort orders.  In 
other words, it's a filtered, ordered aggregate operator.  Then, we can have 
two simple rules to merge all adjacent {{Sort}} and {{Filter}} operators 
directly above an {{UnresolvedAggregate}}:

{code}
  object MergeHavingConditions extends Rule[LogicalPlan] {
override def apply(tree: LogicalPlan): LogicalPlan = tree transformDown {
  case Filter(condition, (agg: UnresolvedAggregate)) =>
// Combines all having conditions
val combinedCondition = (agg.havingCondition.toSeq :+ 
condition).reduce(And)
agg.copy(havingCondition = Some(combinedCondition))
}
  }

  object MergeSortsOverAggregates extends Rule[LogicalPlan] {
override def apply(tree: LogicalPlan): LogicalPlan = tree transformDown {
  case Sort(order, _, (agg: UnresolvedAggregate)) =>
// Only preserves the last sort order
agg.copy(order = order)
}
  }
{code}

(Of course, we also need to make the Dataset API and the {{GlobalAggregates}} 
produce {{UnresolvedAggregate}} instead of {{Aggregate}}.)

At last, we only need to resolve {{UnresolvedAggregate}} into {{Aggregate}} 
with optional {{Filter}} and {{Sort}}, which is relatively straightforw

[jira] [Resolved] (SPARK-13972) hive tests should fail if SQL generation failed

2016-03-19 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-13972.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11782
[https://github.com/apache/spark/pull/11782]

> hive tests should fail if SQL generation failed
> ---
>
> Key: SPARK-13972
> URL: https://issues.apache.org/jira/browse/SPARK-13972
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14001) support multi-children Union in SQLBuilder

2016-03-19 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-14001.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11818
[https://github.com/apache/spark/pull/11818]

> support multi-children Union in SQLBuilder
> --
>
> Key: SPARK-14001
> URL: https://issues.apache.org/jira/browse/SPARK-14001
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14002) SQLBuilder should add subquery to Aggregate child when necessary

2016-03-19 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-14002:
--

 Summary: SQLBuilder should add subquery to Aggregate child when 
necessary
 Key: SPARK-14002
 URL: https://issues.apache.org/jira/browse/SPARK-14002
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Cheng Lian
Assignee: Cheng Lian


Adding the following test case to {{LogicalPlanToSQLSuite}} to reproduce this 
issue:
{code}
  test("bug") {
checkHiveQl(
  """SELECT COUNT(id)
|FROM
|(
|  SELECT id FROM t0
|) subq
  """.stripMargin
)
  }
{code}
Generated wrong SQL is:
{code:sql}
SELECT `gen_attr_46` AS `count(id)`
FROM
(
SELECT count(`gen_attr_45`) AS `gen_attr_46`
FROM
SELECT `gen_attr_45`-- 
FROM--
(   -- A subquery
SELECT `id` AS `gen_attr_45`-- is missing
FROM `default`.`t0` --
) AS gen_subquery_0 --
) AS gen_subquery_1
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14004) AttributeReference and Alias should only use their first qualifier to build SQL representations

2016-03-19 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-14004:
--

 Summary: AttributeReference and Alias should only use their first 
qualifier to build SQL representations
 Key: SPARK-14004
 URL: https://issues.apache.org/jira/browse/SPARK-14004
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Cheng Lian
Assignee: Cheng Lian


Current implementation joins all qualifiers, which is wrong.

However, this doesn't cause any real SQL generation bugs as there is always at 
most one qualifier for any given {{AttributeReference}} or {{Alias}}.

We can probably use {{Option\[String\]}} instead of {{Seq\[String\]}} to 
represent qualifiers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14004) AttributeReference and Alias should only use their first qualifier to build SQL representations

2016-03-19 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-14004:
---
Priority: Minor  (was: Major)

> AttributeReference and Alias should only use their first qualifier to build 
> SQL representations
> ---
>
> Key: SPARK-14004
> URL: https://issues.apache.org/jira/browse/SPARK-14004
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
>
> Current implementation joins all qualifiers, which is wrong.
> However, this doesn't cause any real SQL generation bugs as there is always 
> at most one qualifier for any given {{AttributeReference}} or {{Alias}}.
> We can probably use {{Option\[String\]}} instead of {{Seq\[String\]}} to 
> represent qualifiers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13974) sub-query names do not need to be globally unique while generate SQL

2016-03-19 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-13974:
---
Assignee: Wenchen Fan

> sub-query names do not need to be globally unique while generate SQL
> 
>
> Key: SPARK-13974
> URL: https://issues.apache.org/jira/browse/SPARK-13974
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14002) SQLBuilder should add subquery to Aggregate child when necessary

2016-03-19 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-14002.

   Resolution: Duplicate
Fix Version/s: 2.0.0

This issue is actually covered by SPARK-13976.

> SQLBuilder should add subquery to Aggregate child when necessary
> 
>
> Key: SPARK-14002
> URL: https://issues.apache.org/jira/browse/SPARK-14002
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 2.0.0
>
>
> Adding the following test case to {{LogicalPlanToSQLSuite}} to reproduce this 
> issue:
> {code}
>   test("bug") {
> checkHiveQl(
>   """SELECT COUNT(id)
> |FROM
> |(
> |  SELECT id FROM t0
> |) subq
>   """.stripMargin
> )
>   }
> {code}
> Generated wrong SQL is:
> {code:sql}
> SELECT `gen_attr_46` AS `count(id)`
> FROM
> (
> SELECT count(`gen_attr_45`) AS `gen_attr_46`
> FROM
> SELECT `gen_attr_45`-- 
> FROM--
> (   -- A subquery
> SELECT `id` AS `gen_attr_45`-- is missing
> FROM `default`.`t0` --
> ) AS gen_subquery_0 --
> ) AS gen_subquery_1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12719) SQL generation support for generators (including UDTF)

2016-03-19 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12719:
---
Assignee: Wenchen Fan

> SQL generation support for generators (including UDTF)
> --
>
> Key: SPARK-12719
> URL: https://issues.apache.org/jira/browse/SPARK-12719
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Wenchen Fan
>
> {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. 
> Please refer to SPARK-11012 for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14001) support multi-children Union in SQLBuilder

2016-03-19 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-14001:
---
Assignee: Wenchen Fan

> support multi-children Union in SQLBuilder
> --
>
> Key: SPARK-14001
> URL: https://issues.apache.org/jira/browse/SPARK-14001
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13972) hive tests should fail if SQL generation failed

2016-03-19 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-13972:
---
Assignee: Wenchen Fan

> hive tests should fail if SQL generation failed
> ---
>
> Key: SPARK-13972
> URL: https://issues.apache.org/jira/browse/SPARK-13972
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14004) AttributeReference and Alias should only use their first qualifier to build SQL representations

2016-03-20 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-14004.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11820
[https://github.com/apache/spark/pull/11820]

> AttributeReference and Alias should only use their first qualifier to build 
> SQL representations
> ---
>
> Key: SPARK-14004
> URL: https://issues.apache.org/jira/browse/SPARK-14004
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
> Fix For: 2.0.0
>
>
> Current implementation joins all qualifiers, which is wrong.
> However, this doesn't cause any real SQL generation bugs as there is always 
> at most one qualifier for any given {{AttributeReference}} or {{Alias}}.
> We can probably use {{Option\[String\]}} instead of {{Seq\[String\]}} to 
> represent qualifiers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14000) case class with a tuple field can't work in Dataset

2016-03-21 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-14000.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11816
[https://github.com/apache/spark/pull/11816]

> case class with a tuple field can't work in Dataset
> ---
>
> Key: SPARK-14000
> URL: https://issues.apache.org/jira/browse/SPARK-14000
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Wenchen Fan
> Fix For: 2.0.0
>
>
> for example, `case class TupleClass(data: (Int, String))`, we can create 
> encoder for it, but when we create Dataset with it, we will fail while 
> validating the encoder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14038) Enable native view by default

2016-03-21 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-14038:
---
Assignee: Wenchen Fan

> Enable native view by default
> -
>
> Key: SPARK-14038
> URL: https://issues.apache.org/jira/browse/SPARK-14038
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14038) Enable native view by default

2016-03-21 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-14038:
---
Affects Version/s: 2.0.0
 Target Version/s: 2.0.0

> Enable native view by default
> -
>
> Key: SPARK-14038
> URL: https://issues.apache.org/jira/browse/SPARK-14038
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14038) Enable native view by default

2016-03-21 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-14038:
---
Labels: releasenotes  (was: )

> Enable native view by default
> -
>
> Key: SPARK-14038
> URL: https://issues.apache.org/jira/browse/SPARK-14038
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>  Labels: releasenotes
>
> Release note update:
> {quote}
> Starting from 2.0.0, Spark SQL handles views natively by default. When 
> defining a view, now Spark SQL canonicalizes view definition by generating a 
> canonical SQL statement from the parsed logical query plan, and then stores 
> it into the catalog. If you hit any problems, you may try to turn off native 
> view by setting {{spark.sql.nativeView}} to false.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14038) Enable native view by default

2016-03-21 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-14038:
---
Description: 
Release note update:
{quote}
Starting from 2.0.0, Spark SQL handles views natively by default. When defining 
a view, now Spark SQL canonicalizes view definition by generating a canonical 
SQL statement from the parsed logical query plan, and then stores it into the 
catalog. If you hit any problems, you may try to turn off native view by 
setting {{spark.sql.nativeView}} to false.
{quote}

> Enable native view by default
> -
>
> Key: SPARK-14038
> URL: https://issues.apache.org/jira/browse/SPARK-14038
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>  Labels: releasenotes
>
> Release note update:
> {quote}
> Starting from 2.0.0, Spark SQL handles views natively by default. When 
> defining a view, now Spark SQL canonicalizes view definition by generating a 
> canonical SQL statement from the parsed logical query plan, and then stores 
> it into the catalog. If you hit any problems, you may try to turn off native 
> view by setting {{spark.sql.nativeView}} to false.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13774) IllegalArgumentException: Can not create a Path from an empty string for incorrect file path

2016-03-22 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-13774:
---
Assignee: Sunitha Kambhampati

> IllegalArgumentException: Can not create a Path from an empty string for 
> incorrect file path
> 
>
> Key: SPARK-13774
> URL: https://issues.apache.org/jira/browse/SPARK-13774
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Assignee: Sunitha Kambhampati
>Priority: Minor
>
> Think the error message should be improved for files that could not be found. 
> The {{Path}} seems given.
> {code}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.0.0-SNAPSHOT
>   /_/
> Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_74)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> sqlContext.read.format("csv").load("file-path-is-incorrect.csv")
> java.lang.IllegalArgumentException: Can not create a Path from an empty string
>   at org.apache.hadoop.fs.Path.checkPathArg(Path.java:126)
>   at org.apache.hadoop.fs.Path.(Path.java:134)
>   at org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:245)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:411)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:976)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:976)
>   at 
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:177)
>   at 
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:177)
>   at scala.Option.map(Option.scala:146)
>   at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:177)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:196)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1251)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:352)
>   at org.apache.spark.rdd.RDD.take(RDD.scala:1246)
>   at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1286)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:352)
>   at org.apache.spark.rdd.RDD.first(RDD.scala:1285)
>   at 
> org.apache.spark.sql.execution.datasources.csv.DefaultSource.findFirstLine(DefaultSource.scala:156)
>   at 
> org.apache.spark.sql.execution.datasources.csv.DefaultSource.inferSchema(DefaultSource.scala:58)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$13.apply(DataSource.scala:213)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$13.apply(DataSource.scala:213)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:212)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:131)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:141)
>   ... 49 elided
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13774) IllegalArgumentException: Can not create a Path from an empty string for incorrect file path

2016-03-22 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-13774.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11775
[https://github.com/apache/spark/pull/11775]

> IllegalArgumentException: Can not create a Path from an empty string for 
> incorrect file path
> 
>
> Key: SPARK-13774
> URL: https://issues.apache.org/jira/browse/SPARK-13774
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Assignee: Sunitha Kambhampati
>Priority: Minor
> Fix For: 2.0.0
>
>
> Think the error message should be improved for files that could not be found. 
> The {{Path}} seems given.
> {code}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.0.0-SNAPSHOT
>   /_/
> Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_74)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> sqlContext.read.format("csv").load("file-path-is-incorrect.csv")
> java.lang.IllegalArgumentException: Can not create a Path from an empty string
>   at org.apache.hadoop.fs.Path.checkPathArg(Path.java:126)
>   at org.apache.hadoop.fs.Path.(Path.java:134)
>   at org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:245)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:411)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:976)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:976)
>   at 
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:177)
>   at 
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:177)
>   at scala.Option.map(Option.scala:146)
>   at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:177)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:196)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1251)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:352)
>   at org.apache.spark.rdd.RDD.take(RDD.scala:1246)
>   at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1286)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:352)
>   at org.apache.spark.rdd.RDD.first(RDD.scala:1285)
>   at 
> org.apache.spark.sql.execution.datasources.csv.DefaultSource.findFirstLine(DefaultSource.scala:156)
>   at 
> org.apache.spark.sql.execution.datasources.csv.DefaultSource.inferSchema(DefaultSource.scala:58)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$13.apply(DataSource.scala:213)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$13.apply(DataSource.scala:213)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:212)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:131)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:141)
>   ... 49 elided
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spar

[jira] [Updated] (SPARK-13772) DataType mismatch about decimal

2016-03-22 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-13772:
---
Description: 
Code snippet to reproduce this issue using 1.6.0:
{code}
select if(1=1, cast(1 as double), cast(1.1 as decimal) as a from test
{code}
It will throw exceptions like this：
{noformat}
Error in query: cannot resolve 'if ((1 = 1)) cast(1 as double) else cast(1.1 as 
decimal(10,0))' due to data type mismatch: differing types in 'if ((1 = 1)) 
cast(1 as double) else cast(1.1 as decimal(10,0))' (double and decimal(10,0)).; 
line 1 pos 37
{noformat}
I also tested:
{code}
select if(1=1,cast(1 as decimal),cast(1 as decimal(19,6))) from test;
{code}
{noformat}
Error in query: cannot resolve 'if ((1 = 1)) cast(1 as decimal(10,0)) else 
cast(1 as decimal(19,6))' due to data type mismatch: differing types in 'if ((1 
= 1)) cast(1 as decimal(10,0)) else cast(1 as decimal(19,6))' (decimal(10,0) 
and decimal(19,6)).; line 1 pos 38
{noformat}


  was:
I found a bug：
select if(1=1, cast(1 as double), cast(1.1 as decimal) as a from test
It will throw exceptions like this：

Error in query: cannot resolve 'if ((1 = 1)) cast(1 as double) else cast(1.1 as 
decimal(10,0))' due to data type mismatch: differing types in 'if ((1 = 1)) 
cast(1 as double) else cast(1.1 as decimal(10,0))' (double and decimal(10,0)).; 
line 1 pos 37

I also test：
select if(1=1,cast(1 as decimal),cast(1 as decimal(19,6))) from test;

Error in query: cannot resolve 'if ((1 = 1)) cast(1 as decimal(10,0)) else 
cast(1 as decimal(19,6))' due to data type mismatch: differing types in 'if ((1 
= 1)) cast(1 as decimal(10,0)) else cast(1 as decimal(19,6))' (decimal(10,0) 
and decimal(19,6)).; line 1 pos 38


> DataType mismatch about decimal
> ---
>
> Key: SPARK-13772
> URL: https://issues.apache.org/jira/browse/SPARK-13772
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: spark1.6.0 hadoop2.2.0 jdk1.7.0_79
>Reporter: cen yuhai
>
> Code snippet to reproduce this issue using 1.6.0:
> {code}
> select if(1=1, cast(1 as double), cast(1.1 as decimal) as a from test
> {code}
> It will throw exceptions like this：
> {noformat}
> Error in query: cannot resolve 'if ((1 = 1)) cast(1 as double) else cast(1.1 
> as decimal(10,0))' due to data type mismatch: differing types in 'if ((1 = 
> 1)) cast(1 as double) else cast(1.1 as decimal(10,0))' (double and 
> decimal(10,0)).; line 1 pos 37
> {noformat}
> I also tested:
> {code}
> select if(1=1,cast(1 as decimal),cast(1 as decimal(19,6))) from test;
> {code}
> {noformat}
> Error in query: cannot resolve 'if ((1 = 1)) cast(1 as decimal(10,0)) else 
> cast(1 as decimal(19,6))' due to data type mismatch: differing types in 'if 
> ((1 = 1)) cast(1 as decimal(10,0)) else cast(1 as decimal(19,6))' 
> (decimal(10,0) and decimal(19,6)).; line 1 pos 38
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13772) DataType mismatch about decimal

2016-03-22 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-13772:
---
Target Version/s: 1.6.2

> DataType mismatch about decimal
> ---
>
> Key: SPARK-13772
> URL: https://issues.apache.org/jira/browse/SPARK-13772
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: spark1.6.0 hadoop2.2.0 jdk1.7.0_79
>Reporter: cen yuhai
>
> Code snippet to reproduce this issue using 1.6.0:
> {code}
> select if(1=1, cast(1 as double), cast(1.1 as decimal) as a from test
> {code}
> It will throw exceptions like this：
> {noformat}
> Error in query: cannot resolve 'if ((1 = 1)) cast(1 as double) else cast(1.1 
> as decimal(10,0))' due to data type mismatch: differing types in 'if ((1 = 
> 1)) cast(1 as double) else cast(1.1 as decimal(10,0))' (double and 
> decimal(10,0)).; line 1 pos 37
> {noformat}
> I also tested:
> {code}
> select if(1=1,cast(1 as decimal),cast(1 as decimal(19,6))) from test;
> {code}
> {noformat}
> Error in query: cannot resolve 'if ((1 = 1)) cast(1 as decimal(10,0)) else 
> cast(1 as decimal(19,6))' due to data type mismatch: differing types in 'if 
> ((1 = 1)) cast(1 as decimal(10,0)) else cast(1 as decimal(19,6))' 
> (decimal(10,0) and decimal(19,6)).; line 1 pos 38
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13772) DataType mismatch about decimal

2016-03-22 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-13772:
---
Assignee: cen yuhai

> DataType mismatch about decimal
> ---
>
> Key: SPARK-13772
> URL: https://issues.apache.org/jira/browse/SPARK-13772
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: spark1.6.0 hadoop2.2.0 jdk1.7.0_79
>Reporter: cen yuhai
>Assignee: cen yuhai
>
> Code snippet to reproduce this issue using 1.6.0:
> {code}
> select if(1=1, cast(1 as double), cast(1.1 as decimal) as a from test
> {code}
> It will throw exceptions like this：
> {noformat}
> Error in query: cannot resolve 'if ((1 = 1)) cast(1 as double) else cast(1.1 
> as decimal(10,0))' due to data type mismatch: differing types in 'if ((1 = 
> 1)) cast(1 as double) else cast(1.1 as decimal(10,0))' (double and 
> decimal(10,0)).; line 1 pos 37
> {noformat}
> I also tested:
> {code}
> select if(1=1,cast(1 as decimal),cast(1 as decimal(19,6))) from test;
> {code}
> {noformat}
> Error in query: cannot resolve 'if ((1 = 1)) cast(1 as decimal(10,0)) else 
> cast(1 as decimal(19,6))' due to data type mismatch: differing types in 'if 
> ((1 = 1)) cast(1 as decimal(10,0)) else cast(1 as decimal(19,6))' 
> (decimal(10,0) and decimal(19,6)).; line 1 pos 38
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13772) DataType mismatch about decimal

2016-03-22 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-13772.

   Resolution: Fixed
Fix Version/s: 1.6.2

Issue resolved by pull request 11605
[https://github.com/apache/spark/pull/11605]

> DataType mismatch about decimal
> ---
>
> Key: SPARK-13772
> URL: https://issues.apache.org/jira/browse/SPARK-13772
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: spark1.6.0 hadoop2.2.0 jdk1.7.0_79
>Reporter: cen yuhai
>Assignee: cen yuhai
> Fix For: 1.6.2
>
>
> Code snippet to reproduce this issue using 1.6.0:
> {code}
> select if(1=1, cast(1 as double), cast(1.1 as decimal) as a from test
> {code}
> It will throw exceptions like this：
> {noformat}
> Error in query: cannot resolve 'if ((1 = 1)) cast(1 as double) else cast(1.1 
> as decimal(10,0))' due to data type mismatch: differing types in 'if ((1 = 
> 1)) cast(1 as double) else cast(1.1 as decimal(10,0))' (double and 
> decimal(10,0)).; line 1 pos 37
> {noformat}
> I also tested:
> {code}
> select if(1=1,cast(1 as decimal),cast(1 as decimal(19,6))) from test;
> {code}
> {noformat}
> Error in query: cannot resolve 'if ((1 = 1)) cast(1 as decimal(10,0)) else 
> cast(1 as decimal(19,6))' due to data type mismatch: differing types in 'if 
> ((1 = 1)) cast(1 as decimal(10,0)) else cast(1 as decimal(19,6))' 
> (decimal(10,0) and decimal(19,6)).; line 1 pos 38
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3414) Case insensitivity breaks when unresolved relation contains attributes with upper case letter in their names

2014-09-05 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-3414:
-

 Summary: Case insensitivity breaks when unresolved relation 
contains attributes with upper case letter in their names
 Key: SPARK-3414
 URL: https://issues.apache.org/jira/browse/SPARK-3414
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2
Reporter: Cheng Lian
Priority: Critical


Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce 
this issue:
{code}
import org.apache.spark.sql.hive.HiveContext

val hiveContext = new HiveContext(sc)
import hiveContext._

case class LogEntry(filename: String, message: String)
case class LogFile(name: String)

sc.makeRDD(Seq.empty[LogEntry]).registerTempTable("rawLogs")
sc.makeRDD(Seq.empty[LogFile]).registerTempTable("logFiles")

val srdd = sql(
  """
SELECT name, message
FROM rawLogs
JOIN (
  SELECT name
  FROM logFiles
) files
ON rawLogs.filename = files.name
  """)

srdd.registerTempTable("boom")
sql("select * from boom")
{code}
Exception thrown:
{code}
SchemaRDD[7] at RDD at SchemaRDD.scala:103
== Query Plan ==
== Physical Plan ==
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
attributes: *, tree:
Project [*]
 LowerCaseSchema
  Subquery boom
   Project ['name,'message]
Join Inner, Some(('rawLogs.filename = name#2))
 LowerCaseSchema
  Subquery rawlogs
   SparkLogicalPlan (ExistingRdd [filename#0,message#1], 
MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208)
 Subquery files
  Project [name#2]
   LowerCaseSchema
Subquery logfiles
 SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at 
mapPartitions at basicOperators.scala:208)
{code}
Notice that {{rawLogs}} in the join operator is now lowercased.

The reason is that, during analysis phase, the 
{{CaseInsensitiveAttributeReferences}} is only executed once.

When {{srdd}} is registered as temporary table {{boom}}, its original 
(unanalyzed) logical plan is stored into the catalog:
{code}
Join Inner, Some(('rawLogs.filename = 'files.name))
 UnresolvedRelation None, rawLogs, None
 Subquery files
  Project ['name]
   UnresolvedRelation None, logFiles, None
{code}
attributes referenced in the join operator is now lowercased yet.

And then, when {{select * from boom}} is been analyzed, the input logical plan 
is:
{code}
Project [*]
 UnresolvedRelation None, boom, None
{code}
here the unresolved relation points to the unanalyzed logical plan of {{srdd}}, 
which is later discovered by rule {{ResolveRelations}}:
{code}
=== Applying Rule 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations ===
 Project [*]Project [*]
! UnresolvedRelation None, boom, NoneLowerCaseSchema
! Subquery boom
!  Project ['name,'message]
!   Join Inner, Some(('rawLogs.filename 
= 'files.name))
!LowerCaseSchema
! Subquery rawlogs
!  SparkLogicalPlan (ExistingRdd 
[filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at 
basicOperators.scala:208)
!Subquery files
! Project ['name]
!  LowerCaseSchema
!   Subquery logfiles
!SparkLogicalPlan (ExistingRdd 
[name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208)
{code}
Because the {{CaseInsensitiveAttributeReferences}} batch happens before the 
{{Resolution}} batch, attribute referenced in the join operator ({{rawLogs}}) 
is not lowercased, and thus causes the resolution failure.

A reasonable fix for this could be always register analyzed logical plan to the 
catalog when registering temporary tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3414) Case insensitivity breaks when unresolved relation contains attributes with upper case letter in their names

2014-09-05 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-3414:
--
Description: 
Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce 
this issue:
{code}
import org.apache.spark.sql.hive.HiveContext

val hiveContext = new HiveContext(sc)
import hiveContext._

case class LogEntry(filename: String, message: String)
case class LogFile(name: String)

sc.makeRDD(Seq.empty[LogEntry]).registerTempTable("rawLogs")
sc.makeRDD(Seq.empty[LogFile]).registerTempTable("logFiles")

val srdd = sql(
  """
SELECT name, message
FROM rawLogs
JOIN (
  SELECT name
  FROM logFiles
) files
ON rawLogs.filename = files.name
  """)

srdd.registerTempTable("boom")
sql("select * from boom")
{code}
Exception thrown:
{code}
SchemaRDD[7] at RDD at SchemaRDD.scala:103
== Query Plan ==
== Physical Plan ==
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
attributes: *, tree:
Project [*]
 LowerCaseSchema
  Subquery boom
   Project ['name,'message]
Join Inner, Some(('rawLogs.filename = name#2))
 LowerCaseSchema
  Subquery rawlogs
   SparkLogicalPlan (ExistingRdd [filename#0,message#1], 
MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208)
 Subquery files
  Project [name#2]
   LowerCaseSchema
Subquery logfiles
 SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at 
mapPartitions at basicOperators.scala:208)
{code}
Notice that {{rawLogs}} in the join operator is not lowercased.

The reason is that, during analysis phase, the 
{{CaseInsensitiveAttributeReferences}} batch is only executed before the 
{{Resolution}} batch.

When {{srdd}} is registered as temporary table {{boom}}, its original 
(unanalyzed) logical plan is stored into the catalog:
{code}
Join Inner, Some(('rawLogs.filename = 'files.name))
 UnresolvedRelation None, rawLogs, None
 Subquery files
  Project ['name]
   UnresolvedRelation None, logFiles, None
{code}
attributes referenced in the join operator is now lowercased yet.

And then, when {{select * from boom}} is been analyzed, the input logical plan 
is:
{code}
Project [*]
 UnresolvedRelation None, boom, None
{code}
here the unresolved relation points to the unanalyzed logical plan of {{srdd}}, 
which is later discovered by rule {{ResolveRelations}}:
{code}
=== Applying Rule 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations ===
 Project [*]Project [*]
! UnresolvedRelation None, boom, NoneLowerCaseSchema
! Subquery boom
!  Project ['name,'message]
!   Join Inner, Some(('rawLogs.filename 
= 'files.name))
!LowerCaseSchema
! Subquery rawlogs
!  SparkLogicalPlan (ExistingRdd 
[filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at 
basicOperators.scala:208)
!Subquery files
! Project ['name]
!  LowerCaseSchema
!   Subquery logfiles
!SparkLogicalPlan (ExistingRdd 
[name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208)
{code}
Because the {{CaseInsensitiveAttributeReferences}} batch happens before the 
{{Resolution}} batch, attribute referenced in the join operator ({{rawLogs}}) 
is not lowercased, and thus causes the resolution failure.

A reasonable fix for this could be always register analyzed logical plan to the 
catalog when registering temporary tables.

  was:
Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce 
this issue:
{code}
import org.apache.spark.sql.hive.HiveContext

val hiveContext = new HiveContext(sc)
import hiveContext._

case class LogEntry(filename: String, message: String)
case class LogFile(name: String)

sc.makeRDD(Seq.empty[LogEntry]).registerTempTable("rawLogs")
sc.makeRDD(Seq.empty[LogFile]).registerTempTable("logFiles")

val srdd = sql(
  """
SELECT name, message
FROM rawLogs
JOIN (
  SELECT name
  FROM logFiles
) files
ON rawLogs.filename = files.name
  """)

srdd.registerTempTable("boom")
sql("select * from boom")
{code}
Exception thrown:
{code}
SchemaRDD[7] at RDD at SchemaRDD.scala:103
== Query Plan ==
== Physical Plan ==
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
attributes: *, tree:
Project [*]
 LowerCaseSchema
  Subquery boom
   Project ['name,'message]
Join Inner, Some(('rawLogs.filename = name#2))
 LowerCaseSchema
  Subquery rawlogs
   SparkLogicalPlan (ExistingRdd [filename#0,messag

[jira] [Updated] (SPARK-3414) Case insensitivity breaks when unresolved relation contains attributes with upper case letter in their names

2014-09-05 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-3414:
--
Description: 
Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce 
this issue:
{code}
import org.apache.spark.sql.hive.HiveContext

val hiveContext = new HiveContext(sc)
import hiveContext._

case class LogEntry(filename: String, message: String)
case class LogFile(name: String)

sc.makeRDD(Seq.empty[LogEntry]).registerTempTable("rawLogs")
sc.makeRDD(Seq.empty[LogFile]).registerTempTable("logFiles")

val srdd = sql(
  """
SELECT name, message
FROM rawLogs
JOIN (
  SELECT name
  FROM logFiles
) files
ON rawLogs.filename = files.name
  """)

srdd.registerTempTable("boom")
sql("select * from boom")
{code}
Exception thrown:
{code}
SchemaRDD[7] at RDD at SchemaRDD.scala:103
== Query Plan ==
== Physical Plan ==
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
attributes: *, tree:
Project [*]
 LowerCaseSchema
  Subquery boom
   Project ['name,'message]
Join Inner, Some(('rawLogs.filename = name#2))
 LowerCaseSchema
  Subquery rawlogs
   SparkLogicalPlan (ExistingRdd [filename#0,message#1], 
MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208)
 Subquery files
  Project [name#2]
   LowerCaseSchema
Subquery logfiles
 SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at 
mapPartitions at basicOperators.scala:208)
{code}
Notice that {{rawLogs}} in the join operator is not lowercased.

The reason is that, during analysis phase, the 
{{CaseInsensitiveAttributeReferences}} is only executed once.

When {{srdd}} is registered as temporary table {{boom}}, its original 
(unanalyzed) logical plan is stored into the catalog:
{code}
Join Inner, Some(('rawLogs.filename = 'files.name))
 UnresolvedRelation None, rawLogs, None
 Subquery files
  Project ['name]
   UnresolvedRelation None, logFiles, None
{code}
attributes referenced in the join operator is now lowercased yet.

And then, when {{select * from boom}} is been analyzed, the input logical plan 
is:
{code}
Project [*]
 UnresolvedRelation None, boom, None
{code}
here the unresolved relation points to the unanalyzed logical plan of {{srdd}}, 
which is later discovered by rule {{ResolveRelations}}:
{code}
=== Applying Rule 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations ===
 Project [*]Project [*]
! UnresolvedRelation None, boom, NoneLowerCaseSchema
! Subquery boom
!  Project ['name,'message]
!   Join Inner, Some(('rawLogs.filename 
= 'files.name))
!LowerCaseSchema
! Subquery rawlogs
!  SparkLogicalPlan (ExistingRdd 
[filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at 
basicOperators.scala:208)
!Subquery files
! Project ['name]
!  LowerCaseSchema
!   Subquery logfiles
!SparkLogicalPlan (ExistingRdd 
[name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208)
{code}
Because the {{CaseInsensitiveAttributeReferences}} batch happens before the 
{{Resolution}} batch, attribute referenced in the join operator ({{rawLogs}}) 
is not lowercased, and thus causes the resolution failure.

A reasonable fix for this could be always register analyzed logical plan to the 
catalog when registering temporary tables.

  was:
Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce 
this issue:
{code}
import org.apache.spark.sql.hive.HiveContext

val hiveContext = new HiveContext(sc)
import hiveContext._

case class LogEntry(filename: String, message: String)
case class LogFile(name: String)

sc.makeRDD(Seq.empty[LogEntry]).registerTempTable("rawLogs")
sc.makeRDD(Seq.empty[LogFile]).registerTempTable("logFiles")

val srdd = sql(
  """
SELECT name, message
FROM rawLogs
JOIN (
  SELECT name
  FROM logFiles
) files
ON rawLogs.filename = files.name
  """)

srdd.registerTempTable("boom")
sql("select * from boom")
{code}
Exception thrown:
{code}
SchemaRDD[7] at RDD at SchemaRDD.scala:103
== Query Plan ==
== Physical Plan ==
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
attributes: *, tree:
Project [*]
 LowerCaseSchema
  Subquery boom
   Project ['name,'message]
Join Inner, Some(('rawLogs.filename = name#2))
 LowerCaseSchema
  Subquery rawlogs
   SparkLogicalPlan (ExistingRdd [filename#0,message#1], 
MapPartitionsRDD[1] at mapP

[jira] [Updated] (SPARK-3414) Case insensitivity breaks when unresolved relation contains attributes with uppercase letters in their names

2014-09-05 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-3414:
--
Summary: Case insensitivity breaks when unresolved relation contains 
attributes with uppercase letters in their names  (was: Case insensitivity 
breaks when unresolved relation contains attributes with upper case letter in 
their names)

> Case insensitivity breaks when unresolved relation contains attributes with 
> uppercase letters in their names
> 
>
> Key: SPARK-3414
> URL: https://issues.apache.org/jira/browse/SPARK-3414
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.2
>Reporter: Cheng Lian
>Priority: Critical
>
> Paste the following snippet to {{spark-shell}} (need Hive support) to 
> reproduce this issue:
> {code}
> import org.apache.spark.sql.hive.HiveContext
> val hiveContext = new HiveContext(sc)
> import hiveContext._
> case class LogEntry(filename: String, message: String)
> case class LogFile(name: String)
> sc.makeRDD(Seq.empty[LogEntry]).registerTempTable("rawLogs")
> sc.makeRDD(Seq.empty[LogFile]).registerTempTable("logFiles")
> val srdd = sql(
>   """
> SELECT name, message
> FROM rawLogs
> JOIN (
>   SELECT name
>   FROM logFiles
> ) files
> ON rawLogs.filename = files.name
>   """)
> srdd.registerTempTable("boom")
> sql("select * from boom")
> {code}
> Exception thrown:
> {code}
> SchemaRDD[7] at RDD at SchemaRDD.scala:103
> == Query Plan ==
> == Physical Plan ==
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
> attributes: *, tree:
> Project [*]
>  LowerCaseSchema
>   Subquery boom
>Project ['name,'message]
> Join Inner, Some(('rawLogs.filename = name#2))
>  LowerCaseSchema
>   Subquery rawlogs
>SparkLogicalPlan (ExistingRdd [filename#0,message#1], 
> MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208)
>  Subquery files
>   Project [name#2]
>LowerCaseSchema
> Subquery logfiles
>  SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at 
> mapPartitions at basicOperators.scala:208)
> {code}
> Notice that {{rawLogs}} in the join operator is not lowercased.
> The reason is that, during analysis phase, the 
> {{CaseInsensitiveAttributeReferences}} batch is only executed before the 
> {{Resolution}} batch.
> When {{srdd}} is registered as temporary table {{boom}}, its original 
> (unanalyzed) logical plan is stored into the catalog:
> {code}
> Join Inner, Some(('rawLogs.filename = 'files.name))
>  UnresolvedRelation None, rawLogs, None
>  Subquery files
>   Project ['name]
>UnresolvedRelation None, logFiles, None
> {code}
> notice that attributes referenced in the join operator (esp. {{rawLogs}}) is 
> not lowercased yet.
> And then, when {{select * from boom}} is been analyzed, its input logical 
> plan is:
> {code}
> Project [*]
>  UnresolvedRelation None, boom, None
> {code}
> here the unresolved relation points to the unanalyzed logical plan of 
> {{srdd}}, which is later discovered by rule {{ResolveRelations}}:
> {code}
> === Applying Rule 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations ===
>  Project [*]Project [*]
> ! UnresolvedRelation None, boom, NoneLowerCaseSchema
> ! Subquery boom
> !  Project ['name,'message]
> !   Join Inner, 
> Some(('rawLogs.filename = 'files.name))
> !LowerCaseSchema
> ! Subquery rawlogs
> !  SparkLogicalPlan (ExistingRdd 
> [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at 
> basicOperators.scala:208)
> !Subquery files
> ! Project ['name]
> !  LowerCaseSchema
> !   Subquery logfiles
> !SparkLogicalPlan 
> (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at 
> basicOperators.scala:208)
> {code}
> Because the {{CaseInsensitiveAttributeReferences}} batch happens before the 
> {{Resolution}} batch, attribute referenced in the join operator ({{rawLogs}}) 
> is not lowercased, and thus causes the resolution failure.
> A reasonable fix for this could be always register analyzed logical plan to 
> the catalog when registering temporary tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-

[jira] [Updated] (SPARK-3414) Case insensitivity breaks when unresolved relation contains attributes with upper case letter in their names

2014-09-05 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-3414:
--
Description: 
Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce 
this issue:
{code}
import org.apache.spark.sql.hive.HiveContext

val hiveContext = new HiveContext(sc)
import hiveContext._

case class LogEntry(filename: String, message: String)
case class LogFile(name: String)

sc.makeRDD(Seq.empty[LogEntry]).registerTempTable("rawLogs")
sc.makeRDD(Seq.empty[LogFile]).registerTempTable("logFiles")

val srdd = sql(
  """
SELECT name, message
FROM rawLogs
JOIN (
  SELECT name
  FROM logFiles
) files
ON rawLogs.filename = files.name
  """)

srdd.registerTempTable("boom")
sql("select * from boom")
{code}
Exception thrown:
{code}
SchemaRDD[7] at RDD at SchemaRDD.scala:103
== Query Plan ==
== Physical Plan ==
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
attributes: *, tree:
Project [*]
 LowerCaseSchema
  Subquery boom
   Project ['name,'message]
Join Inner, Some(('rawLogs.filename = name#2))
 LowerCaseSchema
  Subquery rawlogs
   SparkLogicalPlan (ExistingRdd [filename#0,message#1], 
MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208)
 Subquery files
  Project [name#2]
   LowerCaseSchema
Subquery logfiles
 SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at 
mapPartitions at basicOperators.scala:208)
{code}
Notice that {{rawLogs}} in the join operator is not lowercased.

The reason is that, during analysis phase, the 
{{CaseInsensitiveAttributeReferences}} batch is only executed before the 
{{Resolution}} batch.

When {{srdd}} is registered as temporary table {{boom}}, its original 
(unanalyzed) logical plan is stored into the catalog:
{code}
Join Inner, Some(('rawLogs.filename = 'files.name))
 UnresolvedRelation None, rawLogs, None
 Subquery files
  Project ['name]
   UnresolvedRelation None, logFiles, None
{code}
notice that attributes referenced in the join operator (esp. {{rawLogs}}) is 
not lowercased yet.

And then, when {{select * from boom}} is been analyzed, its input logical plan 
is:
{code}
Project [*]
 UnresolvedRelation None, boom, None
{code}
here the unresolved relation points to the unanalyzed logical plan of {{srdd}}, 
which is later discovered by rule {{ResolveRelations}}:
{code}
=== Applying Rule 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations ===
 Project [*]Project [*]
! UnresolvedRelation None, boom, NoneLowerCaseSchema
! Subquery boom
!  Project ['name,'message]
!   Join Inner, Some(('rawLogs.filename 
= 'files.name))
!LowerCaseSchema
! Subquery rawlogs
!  SparkLogicalPlan (ExistingRdd 
[filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at 
basicOperators.scala:208)
!Subquery files
! Project ['name]
!  LowerCaseSchema
!   Subquery logfiles
!SparkLogicalPlan (ExistingRdd 
[name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208)
{code}
Because the {{CaseInsensitiveAttributeReferences}} batch happens before the 
{{Resolution}} batch, attribute referenced in the join operator ({{rawLogs}}) 
is not lowercased, and thus causes the resolution failure.

A reasonable fix for this could be always register analyzed logical plan to the 
catalog when registering temporary tables.

  was:
Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce 
this issue:
{code}
import org.apache.spark.sql.hive.HiveContext

val hiveContext = new HiveContext(sc)
import hiveContext._

case class LogEntry(filename: String, message: String)
case class LogFile(name: String)

sc.makeRDD(Seq.empty[LogEntry]).registerTempTable("rawLogs")
sc.makeRDD(Seq.empty[LogFile]).registerTempTable("logFiles")

val srdd = sql(
  """
SELECT name, message
FROM rawLogs
JOIN (
  SELECT name
  FROM logFiles
) files
ON rawLogs.filename = files.name
  """)

srdd.registerTempTable("boom")
sql("select * from boom")
{code}
Exception thrown:
{code}
SchemaRDD[7] at RDD at SchemaRDD.scala:103
== Query Plan ==
== Physical Plan ==
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
attributes: *, tree:
Project [*]
 LowerCaseSchema
  Subquery boom
   Project ['name,'message]
Join Inner, Some(('rawLogs.filename = name#2))
 LowerCaseSchema
  Subquery rawlogs
   SparkLogicalPlan

[jira] [Updated] (SPARK-3414) Case insensitivity breaks when unresolved relation contains attributes with uppercase letters in their names

2014-09-05 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-3414:
--
Description: 
Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce 
this issue:
{code}
import org.apache.spark.sql.hive.HiveContext

val hiveContext = new HiveContext(sc)
import hiveContext._

case class LogEntry(filename: String, message: String)
case class LogFile(name: String)

sc.makeRDD(Seq.empty[LogEntry]).registerTempTable("rawLogs")
sc.makeRDD(Seq.empty[LogFile]).registerTempTable("logFiles")

val srdd = sql(
  """
SELECT name, message
FROM rawLogs
JOIN (
  SELECT name
  FROM logFiles
) files
ON rawLogs.filename = files.name
  """)

srdd.registerTempTable("boom")
sql("select * from boom")
{code}
Exception thrown:
{code}
SchemaRDD[7] at RDD at SchemaRDD.scala:103
== Query Plan ==
== Physical Plan ==
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
attributes: *, tree:
Project [*]
 LowerCaseSchema
  Subquery boom
   Project ['name,'message]
Join Inner, Some(('rawLogs.filename = name#2))
 LowerCaseSchema
  Subquery rawlogs
   SparkLogicalPlan (ExistingRdd [filename#0,message#1], 
MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208)
 Subquery files
  Project [name#2]
   LowerCaseSchema
Subquery logfiles
 SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at 
mapPartitions at basicOperators.scala:208)
{code}
Notice that {{rawLogs}} in the join operator is not lowercased.

The reason is that, during analysis phase, the 
{{CaseInsensitiveAttributeReferences}} batch is only executed before the 
{{Resolution}} batch. And when {{srdd}} is registered as temporary table 
{{boom}}, its original (unanalyzed) logical plan is stored into the catalog:
{code}
Join Inner, Some(('rawLogs.filename = 'files.name))
 UnresolvedRelation None, rawLogs, None
 Subquery files
  Project ['name]
   UnresolvedRelation None, logFiles, None
{code}
notice that attributes referenced in the join operator (esp. {{rawLogs}}) is 
not lowercased yet.

And then, when {{select * from boom}} is been analyzed, its input logical plan 
is:
{code}
Project [*]
 UnresolvedRelation None, boom, None
{code}
here the unresolved relation points to the unanalyzed logical plan of {{srdd}} 
above, which is later discovered by rule {{ResolveRelations}}, thus not touched 
by {{CaseInsensitiveAttributeReferences}} at all:
{code}
=== Applying Rule 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations ===
 Project [*]Project [*]
! UnresolvedRelation None, boom, NoneLowerCaseSchema
! Subquery boom
!  Project ['name,'message]
!   Join Inner, Some(('rawLogs.filename 
= 'files.name))
!LowerCaseSchema
! Subquery rawlogs
!  SparkLogicalPlan (ExistingRdd 
[filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at 
basicOperators.scala:208)
!Subquery files
! Project ['name]
!  LowerCaseSchema
!   Subquery logfiles
!SparkLogicalPlan (ExistingRdd 
[name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208)
{code}
Because the {{CaseInsensitiveAttributeReferences}} batch happens before the 
{{Resolution}} batch, attribute referenced in the join operator ({{rawLogs}}) 
is not lowercased, and thus causes the resolution failure.

A reasonable fix for this could be always register analyzed logical plan to the 
catalog when registering temporary tables.

  was:
Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce 
this issue:
{code}
import org.apache.spark.sql.hive.HiveContext

val hiveContext = new HiveContext(sc)
import hiveContext._

case class LogEntry(filename: String, message: String)
case class LogFile(name: String)

sc.makeRDD(Seq.empty[LogEntry]).registerTempTable("rawLogs")
sc.makeRDD(Seq.empty[LogFile]).registerTempTable("logFiles")

val srdd = sql(
  """
SELECT name, message
FROM rawLogs
JOIN (
  SELECT name
  FROM logFiles
) files
ON rawLogs.filename = files.name
  """)

srdd.registerTempTable("boom")
sql("select * from boom")
{code}
Exception thrown:
{code}
SchemaRDD[7] at RDD at SchemaRDD.scala:103
== Query Plan ==
== Physical Plan ==
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
attributes: *, tree:
Project [*]
 LowerCaseSchema
  Subquery boom
   Project ['name,'message]
Join Inner, Some(('rawLogs.filename =

[jira] [Updated] (SPARK-3414) Case insensitivity breaks when unresolved relation contains attributes with uppercase letters in their names

2014-09-05 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-3414:
--
Description: 
Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce 
this issue:
{code}
import org.apache.spark.sql.hive.HiveContext

val hiveContext = new HiveContext(sc)
import hiveContext._

case class LogEntry(filename: String, message: String)
case class LogFile(name: String)

sc.makeRDD(Seq.empty[LogEntry]).registerTempTable("rawLogs")
sc.makeRDD(Seq.empty[LogFile]).registerTempTable("logFiles")

val srdd = sql(
  """
SELECT name, message
FROM rawLogs
JOIN (
  SELECT name
  FROM logFiles
) files
ON rawLogs.filename = files.name
  """)

srdd.registerTempTable("boom")
sql("select * from boom")
{code}
Exception thrown:
{code}
SchemaRDD[7] at RDD at SchemaRDD.scala:103
== Query Plan ==
== Physical Plan ==
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
attributes: *, tree:
Project [*]
 LowerCaseSchema
  Subquery boom
   Project ['name,'message]
Join Inner, Some(('rawLogs.filename = name#2))
 LowerCaseSchema
  Subquery rawlogs
   SparkLogicalPlan (ExistingRdd [filename#0,message#1], 
MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208)
 Subquery files
  Project [name#2]
   LowerCaseSchema
Subquery logfiles
 SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at 
mapPartitions at basicOperators.scala:208)
{code}
Notice that {{rawLogs}} in the join operator is not lowercased.

The reason is that, during analysis phase, the 
{{CaseInsensitiveAttributeReferences}} batch is only executed before the 
{{Resolution}} batch. And when {{srdd}} is registered as temporary table 
{{boom}}, its original (unanalyzed) logical plan is stored into the catalog:
{code}
Join Inner, Some(('rawLogs.filename = 'files.name))
 UnresolvedRelation None, rawLogs, None
 Subquery files
  Project ['name]
   UnresolvedRelation None, logFiles, None
{code}
notice that attributes referenced in the join operator (esp. {{rawLogs}}) is 
not lowercased yet.

And then, when {{select * from boom}} is been analyzed, its input logical plan 
is:
{code}
Project [*]
 UnresolvedRelation None, boom, None
{code}
here the unresolved relation points to the unanalyzed logical plan of {{srdd}} 
above, which is later discovered by rule {{ResolveRelations}}, thus not touched 
by {{CaseInsensitiveAttributeReferences}} at all, and {{rawLogs.filename}} is 
thus not lowercased:
{code}
=== Applying Rule 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations ===
 Project [*]Project [*]
! UnresolvedRelation None, boom, NoneLowerCaseSchema
! Subquery boom
!  Project ['name,'message]
!   Join Inner, Some(('rawLogs.filename 
= 'files.name))
!LowerCaseSchema
! Subquery rawlogs
!  SparkLogicalPlan (ExistingRdd 
[filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at 
basicOperators.scala:208)
!Subquery files
! Project ['name]
!  LowerCaseSchema
!   Subquery logfiles
!SparkLogicalPlan (ExistingRdd 
[name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208)
{code}

A reasonable fix for this could be always register analyzed logical plan to the 
catalog when registering temporary tables.

  was:
Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce 
this issue:
{code}
import org.apache.spark.sql.hive.HiveContext

val hiveContext = new HiveContext(sc)
import hiveContext._

case class LogEntry(filename: String, message: String)
case class LogFile(name: String)

sc.makeRDD(Seq.empty[LogEntry]).registerTempTable("rawLogs")
sc.makeRDD(Seq.empty[LogFile]).registerTempTable("logFiles")

val srdd = sql(
  """
SELECT name, message
FROM rawLogs
JOIN (
  SELECT name
  FROM logFiles
) files
ON rawLogs.filename = files.name
  """)

srdd.registerTempTable("boom")
sql("select * from boom")
{code}
Exception thrown:
{code}
SchemaRDD[7] at RDD at SchemaRDD.scala:103
== Query Plan ==
== Physical Plan ==
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
attributes: *, tree:
Project [*]
 LowerCaseSchema
  Subquery boom
   Project ['name,'message]
Join Inner, Some(('rawLogs.filename = name#2))
 LowerCaseSchema
  Subquery rawlogs
   SparkLogicalPlan (ExistingRdd [filename#0,message#1], 
MapPartitionsRDD[1] at mapPartitions at basicOper

[jira] [Created] (SPARK-3421) StructField.toString should quote the name field to allow arbitrary character as struct field name

2014-09-05 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-3421:
-

 Summary: StructField.toString should quote the name field to allow 
arbitrary character as struct field name
 Key: SPARK-3421
 URL: https://issues.apache.org/jira/browse/SPARK-3421
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2
Reporter: Cheng Lian


The original use case is something like this:
{code}
// JSON snippet with "illegal" characters in field names
val json =
  """{ "a(b)": { "c(d)": "hello" } }""" ::
  """{ "a(b)": { "c(d)": "world" } }""" ::
  Nil

val jsonSchemaRdd = sqlContext.jsonRDD(sparkContext.makeRDD(json))
jsonSchemaRdd.saveAsParquetFile("/tmp/file.parquet")

java.lang.Exception: java.lang.RuntimeException: Unsupported dataType: 
StructType(ArrayBuffer(StructField(a(b),StructType(ArrayBuffer(StructField(c(d),StringType,true))),true))),
 [1.37] failure: `,' expected but `(' found
{code}
The reason is that, the {{DataType}} parser only allows {{\[a-zA-Z0-9_\]*}} as 
struct field name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2537) Workaround Timezone specific Hive tests

2014-09-05 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14123556#comment-14123556
 ] 

Cheng Lian commented on SPARK-2537:
---

PR [#1440|https://github.com/apache/spark/pull/1440] fixes this issue.

> Workaround Timezone specific Hive tests
> ---
>
> Key: SPARK-2537
> URL: https://issues.apache.org/jira/browse/SPARK-2537
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.1, 1.1.0
>Reporter: Cheng Lian
>Priority: Minor
>
> Several Hive tests in {{HiveCompatibilitySuite}} are timezone sensitive:
> - {{timestamp_1}}
> - {{timestamp_2}}
> - {{timestamp_3}}
> - {{timestamp_udf}}
> Their answers differ between different timezones. Caching golden answers 
> naively cause build failures in other timezones. Currently these tests are 
> blacklisted. A not so clever solution is to cache golden answers of all 
> timezones for these tests, then select the right version for the current 
> build according to system timezone.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2537) Workaround Timezone specific Hive tests

2014-09-05 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-2537.
---
  Resolution: Fixed
   Fix Version/s: 1.1.0
Target Version/s: 1.1.0

> Workaround Timezone specific Hive tests
> ---
>
> Key: SPARK-2537
> URL: https://issues.apache.org/jira/browse/SPARK-2537
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.1, 1.1.0
>Reporter: Cheng Lian
>Priority: Minor
> Fix For: 1.1.0
>
>
> Several Hive tests in {{HiveCompatibilitySuite}} are timezone sensitive:
> - {{timestamp_1}}
> - {{timestamp_2}}
> - {{timestamp_3}}
> - {{timestamp_udf}}
> Their answers differ between different timezones. Caching golden answers 
> naively cause build failures in other timezones. Currently these tests are 
> blacklisted. A not so clever solution is to cache golden answers of all 
> timezones for these tests, then select the right version for the current 
> build according to system timezone.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3440) HiveServer2 and CLI should retrieve Hive result set schema

2014-09-08 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-3440:
-

 Summary: HiveServer2 and CLI should retrieve Hive result set schema
 Key: SPARK-3440
 URL: https://issues.apache.org/jira/browse/SPARK-3440
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2, 1.1.0
Reporter: Cheng Lian


When executing Hive native queries/commands with {{HiveContext.runHive}}, Spark 
SQL only calls {{Driver.getResults}} and returns a {{Seq\[String\]}}. The 
schema of the result set is not retrieved, and thus not possible to split the 
row string into proper columns and assign column names to them. For example, 
currently all {{NativeCommand}} only returns a single column named {{result}}.

For existing Hive applications that rely on result set schemas, this breaks 
compatibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3448) SpecificMutableRow.update doesn't check for null

2014-09-08 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-3448:
-

 Summary: SpecificMutableRow.update doesn't check for null
 Key: SPARK-3448
 URL: https://issues.apache.org/jira/browse/SPARK-3448
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Cheng Lian
Priority: Minor
 Fix For: 1.1.1


{code}
  test("SpecificMutableRow.update with null") {
val row = new SpecificMutableRow(Seq(IntegerType))
row(0) = null
assert(row.isNullAt(0))
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3515) ParquetMetastoreSuite fails when executed together with other suites under Maven

2014-09-12 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-3515:
-

 Summary: ParquetMetastoreSuite fails when executed together with 
other suites under Maven
 Key: SPARK-3515
 URL: https://issues.apache.org/jira/browse/SPARK-3515
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.1
Reporter: Cheng Lian


Reproduction step:
{code}
mvn -Phive,hadoop-2.4 
-DwildcardSuites=org.apache.spark.sql.parquet.ParquetMetastoreSuite,org.apache.spark.sql.hive.StatisticsSuite
 -pl core,sql/catalyst,sql/core,sql/hive test
{code}
Maven instantiates all discovered test suite object at first, and then starts 
executing all test cases. {{ParquetMetastoreSuite}} sets up several temporary 
tables in constructor, but these tables are deleted immediately since 
{{StatisticsSuite}}'s constructor calls {{TestHiveContext.reset()}}.

To fix this issue, we shouldn't put this kind of side effect in constructor, 
but in {{beforeAll}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3515) ParquetMetastoreSuite fails when executed together with other suites under Maven

2014-09-12 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14132406#comment-14132406
 ] 

Cheng Lian commented on SPARK-3515:
---

The bug SPARK-3481 fixed actually covered up the bug mentioned in this ticket.

> ParquetMetastoreSuite fails when executed together with other suites under 
> Maven
> 
>
> Key: SPARK-3515
> URL: https://issues.apache.org/jira/browse/SPARK-3515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.1
>Reporter: Cheng Lian
>
> Reproduction step:
> {code}
> mvn -Phive,hadoop-2.4 
> -DwildcardSuites=org.apache.spark.sql.parquet.ParquetMetastoreSuite,org.apache.spark.sql.hive.StatisticsSuite
>  -pl core,sql/catalyst,sql/core,sql/hive test
> {code}
> Maven instantiates all discovered test suite object at first, and then starts 
> executing all test cases. {{ParquetMetastoreSuite}} sets up several temporary 
> tables in constructor, but these tables are deleted immediately since 
> {{StatisticsSuite}}'s constructor calls {{TestHiveContext.reset()}}.
> To fix this issue, we shouldn't put this kind of side effect in constructor, 
> but in {{beforeAll}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3552) Thrift server doesn't reset current database for each connection

2014-09-16 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-3552:
-

 Summary: Thrift server doesn't reset current database for each 
connection
 Key: SPARK-3552
 URL: https://issues.apache.org/jira/browse/SPARK-3552
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Cheng Lian


Reproduction steps:

- Start Thrift server
- Connect with beeline
  {code}
./bin/beeline -u jdbc:hive2://localhost:1/default -n lian
{code}
- Create an empty database and switch to it
  {code}
0: jdbc:hive2://localhost:1/default> create database test;
0: jdbc:hive2://localhost:1/default> use test;
{code}
- Exit beeline and reconnect, specify current database to "default"
  {code}
./bin/beeline -u jdbc:hive2://localhost:1/default -n lian
{code}
- Now {{SHOW TABLES}} returns nothing indicating that the current database is 
still {{test}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3609) Add sizeInBytes statistics to Limit operator

2014-09-19 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-3609:
-

 Summary: Add sizeInBytes statistics to Limit operator
 Key: SPARK-3609
 URL: https://issues.apache.org/jira/browse/SPARK-3609
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Cheng Lian


The {{sizeInBytes}} statistics of a {{LIMIT}} operator can be estimated fairly 
precisely when all output attributes are of native data types, all native data 
types except {{StringType}} have fixed size. For {{StringType}}, we can use a 
relatively large (say 4K) default size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2271) Use Hive's high performance Decimal128 to replace BigDecimal

2014-09-19 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141510#comment-14141510
 ] 

Cheng Lian commented on SPARK-2271:
---

[~pwendell] I can't find a Maven artifact for this. From the Hive JIRA Reynold 
pointed out, the {{Decimal128}} comes from Microsoft PolyBase, which I think is 
not open source.

> Use Hive's high performance Decimal128 to replace BigDecimal
> 
>
> Key: SPARK-2271
> URL: https://issues.apache.org/jira/browse/SPARK-2271
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Cheng Lian
>
> Hive JIRA: https://issues.apache.org/jira/browse/HIVE-6017



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3654) Implement all extended HiveQL statements/commands with a separate parser combinator

2014-09-22 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-3654:
-

 Summary: Implement all extended HiveQL statements/commands with a 
separate parser combinator
 Key: SPARK-3654
 URL: https://issues.apache.org/jira/browse/SPARK-3654
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Cheng Lian


Statements and commands like {{SET}}, {{CACHE TABLE}} and {{ADD JAR}} etc. are 
currently parsed in a quite hacky way, like this:
{code}
if (sql.trim.toLowerCase.startsWith("cache table")) {
  sql.trim.toLowerCase.startsWith("cache table") match {
...
  }
}
{code}
It would be much better to add an extra parser combinator that parses these 
syntax extensions first, and then fallback to the normal Hive parser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3713) Use JSON to serialize DataType

2014-09-28 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-3713:
-

 Summary: Use JSON to serialize DataType
 Key: SPARK-3713
 URL: https://issues.apache.org/jira/browse/SPARK-3713
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Cheng Lian


Currently we are using compiler generated {{toString}} method for case classes 
to serialize {{DataType}} objects, which is dangerous and already introduced 
some bugs (e.g. SPARK-3421). Moreover, we also serialize schema in this format 
and write into generated Parquet metadata. Using JSON can fix all these known 
and potential issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3713) Use JSON to serialize DataType

2014-09-28 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-3713:
--
Issue Type: Improvement  (was: Bug)

> Use JSON to serialize DataType
> --
>
> Key: SPARK-3713
> URL: https://issues.apache.org/jira/browse/SPARK-3713
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Cheng Lian
>
> Currently we are using compiler generated {{toString}} method for case 
> classes to serialize {{DataType}} objects, which is dangerous and already 
> introduced some bugs (e.g. SPARK-3421). Moreover, we also serialize schema in 
> this format and write into generated Parquet metadata. Using JSON can fix all 
> these known and potential issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3738) InsertIntoHiveTable can't handle strings with "\n"

2014-09-29 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-3738:
-

 Summary: InsertIntoHiveTable can't handle strings with "\n"
 Key: SPARK-3738
 URL: https://issues.apache.org/jira/browse/SPARK-3738
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Cheng Lian
Priority: Blocker


Try the following snippet in {{sbt/sbt hive/console}} to reproduce:
{code}
sql("drop table if exists z")
case class Str(s: String)
sparkContext.parallelize(Str("a\nb") :: Nil, 1).saveAsTable("z")
table("z").count()
{code}
Expected result should be 1, but 2 is returned instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3738) InsertIntoHiveTable can't handle strings with "\n"

2014-09-29 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152765#comment-14152765
 ] 

Cheng Lian commented on SPARK-3738:
---

False alarm... it's because of Hive's default SerDe uses '\n' as record 
delimiter.

> InsertIntoHiveTable can't handle strings with "\n"
> --
>
> Key: SPARK-3738
> URL: https://issues.apache.org/jira/browse/SPARK-3738
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Cheng Lian
>Priority: Blocker
>
> Try the following snippet in {{sbt/sbt hive/console}} to reproduce:
> {code}
> sql("drop table if exists z")
> case class Str(s: String)
> sparkContext.parallelize(Str("a\nb") :: Nil, 1).saveAsTable("z")
> table("z").count()
> {code}
> Expected result should be 1, but 2 is returned instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-3738) InsertIntoHiveTable can't handle strings with "\n"

2014-09-29 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian closed SPARK-3738.
-
Resolution: Invalid

False alarm, it's because of Hive's default SerDe, which uses '\n' as record 
delimiter.

> InsertIntoHiveTable can't handle strings with "\n"
> --
>
> Key: SPARK-3738
> URL: https://issues.apache.org/jira/browse/SPARK-3738
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Cheng Lian
>Priority: Blocker
>
> Try the following snippet in {{sbt/sbt hive/console}} to reproduce:
> {code}
> sql("drop table if exists z")
> case class Str(s: String)
> sparkContext.parallelize(Str("a\nb") :: Nil, 1).saveAsTable("z")
> table("z").count()
> {code}
> Expected result should be 1, but 2 is returned instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3791) HiveThriftServer2 returns 0.12.0 to ODBC SQLGetInfo call

2014-10-04 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-3791:
-

 Summary: HiveThriftServer2 returns 0.12.0 to ODBC SQLGetInfo call
 Key: SPARK-3791
 URL: https://issues.apache.org/jira/browse/SPARK-3791
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Cheng Lian


The "DBMS Server version" should be Spark version rather than Hive version:
{code}
...
{"ts":"2014-10-03T07:01:21.679","pid":23188,"tid":"2034","sev":"info","req":"-","sess":"-","site":"-","user":"-","k":"msg","v":"GenericODBCProtocol:
 DBMS Server version: 0.12.0"}
...
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3791) HiveThriftServer2 returns 0.12.0 to ODBC SQLGetInfo call

2014-10-06 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-3791:
--
Target Version/s: 1.1.1  (was: 1.2.0)

> HiveThriftServer2 returns 0.12.0 to ODBC SQLGetInfo call
> 
>
> Key: SPARK-3791
> URL: https://issues.apache.org/jira/browse/SPARK-3791
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Cheng Lian
>
> The "DBMS Server version" should be Spark version rather than Hive version:
> {code}
> ...
> {"ts":"2014-10-03T07:01:21.679","pid":23188,"tid":"2034","sev":"info","req":"-","sess":"-","site":"-","user":"-","k":"msg","v":"GenericODBCProtocol:
>  DBMS Server version: 0.12.0"}
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3810) Rule PreInsertionCasts doesn't handle partitioned table properly

2014-10-06 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-3810:
-

 Summary: Rule PreInsertionCasts doesn't handle partitioned table 
properly
 Key: SPARK-3810
 URL: https://issues.apache.org/jira/browse/SPARK-3810
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Cheng Lian
Priority: Minor


This issue can be reproduced by the following {{sbt/sbt hive/console}} session:
{code}
scala> loadTestTable("src")
...
scala> loadTestTable("srcpart")
...
scala> sql("INSERT INTO TABLE srcpart PARTITION (ds='1', hr='2') SELECT key, 
value FROM src").queryExecution
...
== Parsed Logical Plan ==
InsertIntoTable (UnresolvedRelation None, srcpart, None), Map(ds -> 
Some(hello), hr -> Some(world)), false
 Project ['key,'value]
  UnresolvedRelation None, src, None

== Analyzed Logical Plan ==
InsertIntoTable (MetastoreRelation default, srcpart, None), Map(ds -> 
Some(hello), hr -> Some(world)), false
 Project [key#50,value#51]
  Project [key#50,value#51]
   Project [key#50,value#51]
Project [key#50,value#51]
 Project [key#50,value#51]
  Project [key#50,value#51]
   Project [key#50,value#51]
Project [key#50,value#51]
 Project [key#50,value#51]
  Project [key#50,value#51]
   Project [key#50,value#51]
Project [key...
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3810) Rule PreInsertionCasts doesn't handle partitioned table properly

2014-10-06 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14160151#comment-14160151
 ] 

Cheng Lian commented on SPARK-3810:
---

This issue is marked as MINOR because it doesn't affect correctness. All the 
redundant {{Project}}s can be removed by the subsequent optimization phase.

> Rule PreInsertionCasts doesn't handle partitioned table properly
> 
>
> Key: SPARK-3810
> URL: https://issues.apache.org/jira/browse/SPARK-3810
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Cheng Lian
>Priority: Minor
>
> This issue can be reproduced by the following {{sbt/sbt hive/console}} 
> session:
> {code}
> scala> loadTestTable("src")
> ...
> scala> loadTestTable("srcpart")
> ...
> scala> sql("INSERT INTO TABLE srcpart PARTITION (ds='1', hr='2') SELECT key, 
> value FROM src").queryExecution
> ...
> == Parsed Logical Plan ==
> InsertIntoTable (UnresolvedRelation None, srcpart, None), Map(ds -> 
> Some(hello), hr -> Some(world)), false
>  Project ['key,'value]
>   UnresolvedRelation None, src, None
> == Analyzed Logical Plan ==
> InsertIntoTable (MetastoreRelation default, srcpart, None), Map(ds -> 
> Some(hello), hr -> Some(world)), false
>  Project [key#50,value#51]
>   Project [key#50,value#51]
>Project [key#50,value#51]
> Project [key#50,value#51]
>  Project [key#50,value#51]
>   Project [key#50,value#51]
>Project [key#50,value#51]
> Project [key#50,value#51]
>  Project [key#50,value#51]
>   Project [key#50,value#51]
>Project [key#50,value#51]
> Project [key...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3810) Rule PreInsertionCasts doesn't handle partitioned table properly

2014-10-06 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14160151#comment-14160151
 ] 

Cheng Lian edited comment on SPARK-3810 at 10/6/14 10:26 AM:
-

This issue is marked as MINOR because it doesn't affect correctness. All the 
redundant Projects can be removed by the subsequent optimization phase.


was (Author: lian cheng):
This issue is marked as MINOR because it doesn't affect correctness. All the 
redundant {{Project}}s can be removed by the subsequent optimization phase.

> Rule PreInsertionCasts doesn't handle partitioned table properly
> 
>
> Key: SPARK-3810
> URL: https://issues.apache.org/jira/browse/SPARK-3810
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Cheng Lian
>Priority: Minor
>
> This issue can be reproduced by the following {{sbt/sbt hive/console}} 
> session:
> {code}
> scala> loadTestTable("src")
> ...
> scala> loadTestTable("srcpart")
> ...
> scala> sql("INSERT INTO TABLE srcpart PARTITION (ds='1', hr='2') SELECT key, 
> value FROM src").queryExecution
> ...
> == Parsed Logical Plan ==
> InsertIntoTable (UnresolvedRelation None, srcpart, None), Map(ds -> 
> Some(hello), hr -> Some(world)), false
>  Project ['key,'value]
>   UnresolvedRelation None, src, None
> == Analyzed Logical Plan ==
> InsertIntoTable (MetastoreRelation default, srcpart, None), Map(ds -> 
> Some(hello), hr -> Some(world)), false
>  Project [key#50,value#51]
>   Project [key#50,value#51]
>Project [key#50,value#51]
> Project [key#50,value#51]
>  Project [key#50,value#51]
>   Project [key#50,value#51]
>Project [key#50,value#51]
> Project [key#50,value#51]
>  Project [key#50,value#51]
>   Project [key#50,value#51]
>Project [key#50,value#51]
> Project [key...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3421) StructField.toString should quote the name field to allow arbitrary character as struct field name

2014-10-10 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-3421.
---
Resolution: Fixed

> StructField.toString should quote the name field to allow arbitrary character 
> as struct field name
> --
>
> Key: SPARK-3421
> URL: https://issues.apache.org/jira/browse/SPARK-3421
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.2
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> The original use case is something like this:
> {code}
> // JSON snippet with "illegal" characters in field names
> val json =
>   """{ "a(b)": { "c(d)": "hello" } }""" ::
>   """{ "a(b)": { "c(d)": "world" } }""" ::
>   Nil
> val jsonSchemaRdd = sqlContext.jsonRDD(sparkContext.makeRDD(json))
> jsonSchemaRdd.saveAsParquetFile("/tmp/file.parquet")
> java.lang.Exception: java.lang.RuntimeException: Unsupported dataType: 
> StructType(ArrayBuffer(StructField(a(b),StructType(ArrayBuffer(StructField(c(d),StringType,true))),true))),
>  [1.37] failure: `,' expected but `(' found
> {code}
> The reason is that, the {{DataType}} parser only allows {{\[a-zA-Z0-9_\]*}} 
> as struct field name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3892) Map type should have typeName

2014-10-10 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14166812#comment-14166812
 ] 

Cheng Lian commented on SPARK-3892:
---

Actually {{MapType.simpleName}} can be simply removed, it's not used anywhere, 
I forgot to remove it while refactoring. {{DataType.typeName}} is defined as:
{code}
  def typeName: String = 
this.getClass.getSimpleName.stripSuffix("$").dropRight(4).toLowerCase
{code}
So concrete {{DataType}} classes don't need to override {{typeName}} as long as 
their name ends with {{Type}}. 

> Map type should have typeName
> -
>
> Key: SPARK-3892
> URL: https://issues.apache.org/jira/browse/SPARK-3892
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Adrian Wang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3892) Map type should have typeName

2014-10-10 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14166822#comment-14166822
 ] 

Cheng Lian commented on SPARK-3892:
---

[~adrian-wang] You're right, it's a typo. So would you mind to change the 
priority of this ticket to Minor?

> Map type should have typeName
> -
>
> Key: SPARK-3892
> URL: https://issues.apache.org/jira/browse/SPARK-3892
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Adrian Wang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3892) Map type do not need simpleName

2014-10-10 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14166848#comment-14166848
 ] 

Cheng Lian commented on SPARK-3892:
---

Ah, while working on the {{DataType}} JSON ser/de PR 
([#2563|https://github.com/apache/spark/pull/2563]), I had once refactored 
{{simpleString}} to {{simpleName}}, and at last got the current version and 
removed all overrides from sub-classes. {{MapType.simpleName}} was not removed 
partly because its a member of {{object MapType}}, which is not a subclass of 
{{DataType}}. Sorry for the trouble and confusion.

> Map type do not need simpleName
> ---
>
> Key: SPARK-3892
> URL: https://issues.apache.org/jira/browse/SPARK-3892
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Adrian Wang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3892) Map type do not need simpleName

2014-10-10 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14166886#comment-14166886
 ] 

Cheng Lian commented on SPARK-3892:
---

Please see my comments in the PR :)

> Map type do not need simpleName
> ---
>
> Key: SPARK-3892
> URL: https://issues.apache.org/jira/browse/SPARK-3892
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Adrian Wang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3914) InMemoryRelation should inherit statistics of its child to enable broadcast join

2014-10-11 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-3914:
-

 Summary: InMemoryRelation should inherit statistics of its child 
to enable broadcast join
 Key: SPARK-3914
 URL: https://issues.apache.org/jira/browse/SPARK-3914
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Cheng Lian


When a table/query is cached, {{InMemoryRelation}} stores the physical plan 
rather than the logical plan of the original table/query, thus loses the 
statistics information and disables broadcast join optimization.

Sample {{spark-shell}} session to reproduce this issue:
{code}
val sparkContext = sc

import org.apache.spark.sql._
import sparkContext._

val sqlContext = new SQLContext(sparkContext)

import sqlContext._

case class Sale(year: Int)
makeRDD((1 to 100).map(Sale(_))).registerTempTable("sales")
sql("select distinct year from sales limit 10").registerTempTable("tinyTable")
cacheTable("tinyTable")
sql("select * from sales join tinyTable on sales.year = 
tinyTable.year").queryExecution.executedPlan

...

res3: org.apache.spark.sql.execution.SparkPlan =
Project [year#4,year#5]
 ShuffledHashJoin [year#4], [year#5], BuildRight
  Exchange (HashPartitioning [year#4], 200)
   PhysicalRDD [year#4], MapPartitionsRDD[1] at mapPartitions at 
ExistingRDD.scala:37
  Exchange (HashPartitioning [year#5], 200)
   InMemoryColumnarTableScan [year#5], [], (InMemoryRelation [year#5], false, 
1000, StorageLevel(true, true, false, true, 1), (Limit 10))
{code}
A workaround for this is to add a {{LIMIT}} operator above the 
{{InMemoryColumnarTableScan}} operator:
{code}
sql("select * from sales join (select * from tinyTable limit 10) tiny on 
sales.year = tiny.year").queryExecution.executedPlan

...

res8: org.apache.spark.sql.execution.SparkPlan =
Project [year#12,year#13]
 BroadcastHashJoin [year#12], [year#13], BuildRight
  PhysicalRDD [year#12], MapPartitionsRDD[1] at mapPartitions at 
ExistingRDD.scala:37
  Limit 10
   InMemoryColumnarTableScan [year#13], [], (InMemoryRelation [year#13], false, 
1000, StorageLevel(true, true, false, true, 1), (Limit 10))
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3919) HiveThriftServer2 fails to start because of Hive 0.12 metastore schema verification failure

2014-10-12 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-3919:
-

 Summary: HiveThriftServer2 fails to start because of Hive 0.12 
metastore schema verification failure
 Key: SPARK-3919
 URL: https://issues.apache.org/jira/browse/SPARK-3919
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Cheng Lian


When using MySQL backed Metastore with {{hive.metastore.schema.verification}} 
set to {{true}}, HiveThriftServer2 fails to start:
{code}
14/10/12 17:05:01 ERROR HiveThriftServer2: Error starting HiveThriftServer2
org.apache.hive.service.ServiceException: Failed to Start HiveServer2
at 
org.apache.hive.service.CompositeService.start(CompositeService.java:80)
at org.apache.hive.service.server.HiveServer2.start(HiveServer2.java:73)
at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:84)
at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:335)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.hive.service.ServiceException: Unable to connect to 
MetaStore!
at org.apache.hive.service.cli.CLIService.start(CLIService.java:85)
at 
org.apache.hive.service.CompositeService.start(CompositeService.java:70)
... 10 more
Caused by: MetaException(message:Hive Schema version 0.12.0-protobuf-2.5 does 
not match metastore's schema version 0.12.0 Metastore is not upgraded or 
corrupt)
at 
org.apache.hadoop.hive.metastore.ObjectStore.checkSchema(ObjectStore.java:5651)
at 
org.apache.hadoop.hive.metastore.ObjectStore.verifySchema(ObjectStore.java:5622)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RetryingRawStore.invoke(RetryingRawStore.java:124)
at com.sun.proxy.$Proxy11.verifySchema(Unknown Source)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:403)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:441)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:326)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.(HiveMetaStore.java:286)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:54)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:59)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore.newHMSHandler(HiveMetaStore.java:4060)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:121)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:104)
at org.apache.hive.service.cli.CLIService.start(CLIService.java:82)
... 11 more
{code}
Seems that recent Akka/Protobuf dependency changes are related to this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3919) HiveThriftServer2 fails to start because of Hive 0.12 metastore schema verification failure

2014-10-12 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-3919:
--
Description: 
When using MySQL backed Metastore with {{hive.metastore.schema.verification}} 
set to {{true}}, HiveThriftServer2 fails to start:
{code}
14/10/12 17:05:01 ERROR HiveThriftServer2: Error starting HiveThriftServer2
org.apache.hive.service.ServiceException: Failed to Start HiveServer2
at 
org.apache.hive.service.CompositeService.start(CompositeService.java:80)
at org.apache.hive.service.server.HiveServer2.start(HiveServer2.java:73)
at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:84)
at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:335)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.hive.service.ServiceException: Unable to connect to 
MetaStore!
at org.apache.hive.service.cli.CLIService.start(CLIService.java:85)
at 
org.apache.hive.service.CompositeService.start(CompositeService.java:70)
... 10 more
Caused by: MetaException(message:Hive Schema version 0.12.0-protobuf-2.5 does 
not match metastore's schema version 0.12.0 Metastore is not upgraded or 
corrupt)
at 
org.apache.hadoop.hive.metastore.ObjectStore.checkSchema(ObjectStore.java:5651)
at 
org.apache.hadoop.hive.metastore.ObjectStore.verifySchema(ObjectStore.java:5622)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RetryingRawStore.invoke(RetryingRawStore.java:124)
at com.sun.proxy.$Proxy11.verifySchema(Unknown Source)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:403)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:441)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:326)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.(HiveMetaStore.java:286)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:54)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:59)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore.newHMSHandler(HiveMetaStore.java:4060)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:121)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:104)
at org.apache.hive.service.cli.CLIService.start(CLIService.java:82)
... 11 more
{code}
Seems that recent Akka/Protobuf dependency changes are related to this.

A valid workaround is to set {{hive.metastore.schema.verification}} to 
{{false}}.

  was:
When using MySQL backed Metastore with {{hive.metastore.schema.verification}} 
set to {{true}}, HiveThriftServer2 fails to start:
{code}
14/10/12 17:05:01 ERROR HiveThriftServer2: Error starting HiveThriftServer2
org.apache.hive.service.ServiceException: Failed to Start HiveServer2
at 
org.apache.hive.service.CompositeService.start(CompositeService.java:80)
at org.apache.hive.service.server.HiveServer2.start(HiveServer2.java:73)
at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:84)
at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:335)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.hive.service.ServiceException: Unable to connect to 
MetaStore!
at org.apache.hive.service.cli.CLIService.start(CLIService.jav

[jira] [Commented] (SPARK-3919) HiveThriftServer2 fails to start because of Hive 0.12 metastore schema verification failure

2014-10-12 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168664#comment-14168664
 ] 

Cheng Lian commented on SPARK-3919:
---

[~pwendell] Hive Metastore schema verification requires the version string to 
be exactly the same (except the {{-SNAPSHOT}} suffix).

> HiveThriftServer2 fails to start because of Hive 0.12 metastore schema 
> verification failure
> ---
>
> Key: SPARK-3919
> URL: https://issues.apache.org/jira/browse/SPARK-3919
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Cheng Lian
>
> When using MySQL backed Metastore with {{hive.metastore.schema.verification}} 
> set to {{true}}, HiveThriftServer2 fails to start:
> {code}
> 14/10/12 17:05:01 ERROR HiveThriftServer2: Error starting HiveThriftServer2
> org.apache.hive.service.ServiceException: Failed to Start HiveServer2
>   at 
> org.apache.hive.service.CompositeService.start(CompositeService.java:80)
>   at org.apache.hive.service.server.HiveServer2.start(HiveServer2.java:73)
>   at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:84)
>   at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:335)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: org.apache.hive.service.ServiceException: Unable to connect to 
> MetaStore!
>   at org.apache.hive.service.cli.CLIService.start(CLIService.java:85)
>   at 
> org.apache.hive.service.CompositeService.start(CompositeService.java:70)
>   ... 10 more
> Caused by: MetaException(message:Hive Schema version 0.12.0-protobuf-2.5 does 
> not match metastore's schema version 0.12.0 Metastore is not upgraded or 
> corrupt)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.checkSchema(ObjectStore.java:5651)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.verifySchema(ObjectStore.java:5622)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hadoop.hive.metastore.RetryingRawStore.invoke(RetryingRawStore.java:124)
>   at com.sun.proxy.$Proxy11.verifySchema(Unknown Source)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:403)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:441)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:326)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.(HiveMetaStore.java:286)
>   at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:54)
>   at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:59)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore.newHMSHandler(HiveMetaStore.java:4060)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:121)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:104)
>   at org.apache.hive.service.cli.CLIService.start(CLIService.java:82)
>   ... 11 more
> {code}
> Seems that recent Akka/Protobuf dependency changes are related to this.
> A valid workaround is to set {{hive.metastore.schema.verification}} to 
> {{false}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3938) Set RDD name to table name during cache operations

2014-10-13 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14170454#comment-14170454
 ] 

Cheng Lian commented on SPARK-3938:
---

A problem here is that after PR 
[#2501|https://github.com/apache/spark/pull/2501], cached tables may share 
in-memory columnar RDDs, for example:
{code}
sql("CREATE TABLE src(key INT, value STRING)")
val aRDD = sql("CACHE TABLE a AS SELECT * FROM src")
val bRDD = sql("CACHE TABLE b AS SELECT key, value FROM src")
{code}
The two tables {{a}} and {{b}} share the same underlying in-memory columnar RDD 
because their queries have the same results. This can be easily verified from 
the Web UI. Furthermore, setting names to the resulted {{SchemaRDD}} ({{aRDD}} 
and {{bRDD}}) is useless, since the RDD that is actually cached is the 
underlying in-memory columnar RDD compiled from the logical plan.

I think the best we can do is to try to list all in-memory table names in the 
in-memory columnar RDD name. On the other hand, current in-memory columnar RDD 
name is the string representation of the physical plan, which can be more 
useful for debugging.

> Set RDD name to table name during cache operations
> --
>
> Key: SPARK-3938
> URL: https://issues.apache.org/jira/browse/SPARK-3938
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Patrick Wendell
>Assignee: Cheng Lian
>
> When we create a table via "CACHE TABLE tbl" or "CACHE TABLE tbl AS SELECT", 
> we should name the created RDD with the table name. This will allow it to 
> render nicely in the storage tab, which is necessary when people look at the 
> storage tab to understand the caching behavior of Spark (e.g. percentage in 
> cache, etc).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4000) Gathers unit tests logs to Jenkins master at the end of a Jenkins build

2014-10-19 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-4000:
-

 Summary: Gathers unit tests logs to Jenkins master at the end of a 
Jenkins build
 Key: SPARK-4000
 URL: https://issues.apache.org/jira/browse/SPARK-4000
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.1.0
Reporter: Cheng Lian






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4000) Gathers unit tests logs to Jenkins master at the end of a Jenkins build

2014-10-19 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-4000:
--
Description: Unit tests logs can be useful for debugging Jenkins failures. 
Currently these logs are deleted together with the build directory. We can scp 
the archived logs to the build history directory in Jenkins master, and then 
serve them via HTTP.

> Gathers unit tests logs to Jenkins master at the end of a Jenkins build
> ---
>
> Key: SPARK-4000
> URL: https://issues.apache.org/jira/browse/SPARK-4000
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.1.0
>Reporter: Cheng Lian
>
> Unit tests logs can be useful for debugging Jenkins failures. Currently these 
> logs are deleted together with the build directory. We can scp the archived 
> logs to the build history directory in Jenkins master, and then serve them 
> via HTTP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4037) NPE in JDBC server when calling SET

2014-10-21 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179305#comment-14179305
 ] 

Cheng Lian commented on SPARK-4037:
---

This is a regression of SPARK-2814, added in SPARK-3729.

One solution is let HiveContext always reuse SessionState started within the 
same thread, if there's none, create a new one. In this way, we don't need to 
override the SessionState field in HiveThriftServer2, thus eliminate this issue.

> NPE in JDBC server when calling SET
> ---
>
> Key: SPARK-4037
> URL: https://issues.apache.org/jira/browse/SPARK-4037
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Cheng Lian
>Priority: Blocker
>
> {code}
> SET spark.sql.shuffle.partitions=10;
> {code}
> {code}
> 14/10/21 18:00:47 ERROR server.SparkSQLOperationManager: Error executing 
> query:
> java.lang.NullPointerException
>   at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:309)
>   at 
> org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:272)
>   at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:244)
>   at 
> org.apache.spark.sql.execution.SetCommand.sideEffectResult$lzycompute(commands.scala:64)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4037) NPE in JDBC server when calling SET

2014-10-21 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179598#comment-14179598
 ] 

Cheng Lian commented on SPARK-4037:
---

I think we can safely remove the global singleton SessionState created in 
HiveThriftServer2, and replace it with the SessionState field instance in 
HiveContext.

The current global singleton SessionState design dates back to Shark 
(SharkContext.sessionState). This basically breaks session isolation in Shark, 
and later HiveThriftServer2 because all connections/sessions share a single 
SessionState instance. For example, switching current database in one 
connection also affects other concurrent connections (SPARK-3552). Fixing this 
properly requires major refactoring of HiveContext. Considering 1.2.0 release 
is approaching, and Hive support will probably be rewritten against the newly 
introduced external data source API, I'd like to fix this specific SessionState 
initialization issue for Spark 1.2.0 release, and leave multi-user support to a 
later release.

> NPE in JDBC server when calling SET
> ---
>
> Key: SPARK-4037
> URL: https://issues.apache.org/jira/browse/SPARK-4037
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Cheng Lian
>Priority: Blocker
>
> {code}
> SET spark.sql.shuffle.partitions=10;
> {code}
> {code}
> 14/10/21 18:00:47 ERROR server.SparkSQLOperationManager: Error executing 
> query:
> java.lang.NullPointerException
>   at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:309)
>   at 
> org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:272)
>   at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:244)
>   at 
> org.apache.spark.sql.execution.SetCommand.sideEffectResult$lzycompute(commands.scala:64)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-3939) NPE caused by SessionState.out not set in thriftserver2

2014-10-22 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian closed SPARK-3939.
-
Resolution: Duplicate

> NPE caused by SessionState.out not set in thriftserver2
> ---
>
> Key: SPARK-3939
> URL: https://issues.apache.org/jira/browse/SPARK-3939
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Adrian Wang
>Assignee: Adrian Wang
>
> a simple 'set' query can reproduce this in thriftserver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3939) NPE caused by SessionState.out not set in thriftserver2

2014-10-22 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179749#comment-14179749
 ] 

Cheng Lian commented on SPARK-3939:
---

Ah, actually it's SPARK-4037 who duplicates this ticket since this one came 
earlier, just realized it after closing this ticket, sorry for the confusion...

> NPE caused by SessionState.out not set in thriftserver2
> ---
>
> Key: SPARK-3939
> URL: https://issues.apache.org/jira/browse/SPARK-3939
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Adrian Wang
>Assignee: Adrian Wang
>
> a simple 'set' query can reproduce this in thriftserver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4021) Issues observed after upgrading Jenkins to JDK7u71

2014-10-23 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181199#comment-14181199
 ] 

Cheng Lian commented on SPARK-4021:
---

Hi [~shaneknapp], I think this shows some clue, notice the GCJ 1.5.0.0 
{{javac}}:
{code}
[lian@amp-jenkins-slave-05 ~]$ locate javac
/etc/alternatives/javac
/usr/bin/javac
/usr/java/jdk1.7.0_71/bin/javac
/usr/java/jdk1.7.0_71/man/ja_JP.UTF-8/man1/javac.1
/usr/java/jdk1.7.0_71/man/man1/javac.1
/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/bin/javac
/usr/lib64/R/etc/javaconf
/usr/share/vim/vim72/compiler/javac.vim
/usr/share/vim/vim72/syntax/javacc.vim
/var/lib/alternatives/javac
[lian@amp-jenkins-slave-05 ~]$ ls -alh /etc/alternatives/javac
lrwxrwxrwx. 1 root root 37 Sep 29 17:17 /etc/alternatives/javac -> 
/usr/lib/jvm/java-1.5.0-gcj/bin/javac
[lian@amp-jenkins-slave-05 ~]$
{code}

> Issues observed after upgrading Jenkins to JDK7u71
> --
>
> Key: SPARK-4021
> URL: https://issues.apache.org/jira/browse/SPARK-4021
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
> Environment: JDK 7u71
>Reporter: Patrick Wendell
>Assignee: shane knapp
>
> The following compile failure was observed after adding JDK7u71 to Jenkins. 
> However, this is likely a misconfiguration from Jenkins rather than an issue 
> with Spark (these errors are specific to JDK5, in fact).
> {code}
> [error] --
> [error] 1. WARNING in 
> /home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop1.0/label/centos/extras/kinesis-asl/src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java
>  (at line 83)
> [error]   private static final Logger logger = 
> Logger.getLogger(JavaKinesisWordCountASL.class);
> [error]   ^^
> [error] The field JavaKinesisWordCountASL.logger is never read locally
> [error] --
> [error] 2. WARNING in 
> /home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop1.0/label/centos/extras/kinesis-asl/src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java
>  (at line 151)
> [error]   JavaDStream words = unionStreams.flatMap(new 
> FlatMapFunction() {
> [error]
> ^
> [error] The serializable class  does not declare a static final 
> serialVersionUID field of type long
> [error] --
> [error] 3. ERROR in 
> /home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop1.0/label/centos/extras/kinesis-asl/src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java
>  (at line 153)
> [error]   public Iterable call(byte[] line) {
> [error]   ^
> [error] The method call(byte[]) of type new 
> FlatMapFunction(){} must override a superclass method
> [error] --
> [error] 4. WARNING in 
> /home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop1.0/label/centos/extras/kinesis-asl/src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java
>  (at line 160)
> [error]   new PairFunction() {
> [error]   ^^^
> [error] The serializable class  does not declare a static final 
> serialVersionUID field of type long
> [error] --
> [error] 5. ERROR in 
> /home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop1.0/label/centos/extras/kinesis-asl/src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java
>  (at line 162)
> [error]   public Tuple2 call(String s) {
> [error]  ^^
> [error] The method call(String) of type new 
> PairFunction(){} must override a superclass method
> [error] --
> [error] 6. WARNING in 
> /home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop1.0/label/centos/extras/kinesis-asl/src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java
>  (at line 165)
> [error]   }).reduceByKey(new Function2() {
> [error]  ^^
> [error] The serializable class  does not declare a static final 
> serialVersionUID field of type long
> [error] --
> [error] 7. ERROR in 
> /home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop1.0/label/centos/extras/kinesis-asl/src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java
>  (at line 167)
> [error]   public Integer call(Integer i1, Integer i2) {
> [error]  
> [error] The method call(Integer, Integer) of type new 
> Function2(){} must override a superclass method
> [error] --
> [error] 7 prob

[jira] [Created] (SPARK-4091) Occasionally spark.local.dir can be deleted twice and causes test failure

2014-10-26 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-4091:
-

 Summary: Occasionally spark.local.dir can be deleted twice and 
causes test failure
 Key: SPARK-4091
 URL: https://issues.apache.org/jira/browse/SPARK-4091
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Cheng Lian


By persisting an arbitrary RDD with storage level {{MEMORY_AND_DISK}}, Spark 
may occasionally throw the following exception when shutting down:
{code}
java.io.IOException: Failed to list files for dir: 
/var/folders/kh/r9ylmzln40n9nrlchnsry2qwgn/T/spark-local-20141027005012-5bcd/0b
at org.apache.spark.util.Utils$.listFilesSafely(Utils.scala:664)
at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:678)
at 
org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:680)
at 
org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:678)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:678)
at 
org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:177)
at 
org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:175)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
at 
org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply$mcV$sp(Utils.scala:175)
at 
org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173)
at 
org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1323)
at org.apache.spark.util.Utils$$anon$4.run(Utils.scala:173)
{code}
By adding log output to {{Utils.deleteRecursively}}, setting breakpoints at 
{{File.delete}} in IntelliJ, and asking IntelliJ to evaluate and log 
{{Thread.currentThread().getStackTrace()}} when the breakpoint is hit rather 
than suspend execution, we can get the following result, which shows 
{{spark.local.dir}} is deleted twice from both {{DiskBlockManager.stop}} and 
the shutdown hook installed in {{Utils}}:
{code}
+++ Deleting file: 
/var/folders/kh/r9ylmzln40n9nrlchnsry2qwgn/T/spark-local-20141027003412-7fae/1d
Breakpoint reached at java.io.File.delete(File.java:1028)
[java.lang.Thread.getStackTrace(Thread.java:1589)
java.io.File.delete(File.java:1028)
org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:695)

org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:177)

org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:175)
scala.collection.mutable.HashSet.foreach(HashSet.scala:79)

org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply$mcV$sp(Utils.scala:175)

org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173)

org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173)
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1323)
org.apache.spark.util.Utils$$anon$4.run(Utils.scala:173)]
+++ Deleting file: 
/var/folders/kh/r9ylmzln40n9nrlchnsry2qwgn/T/spark-local-20141027003412-7fae/1d
Breakpoint reached at java.io.File.delete(File.java:1028)
[java.lang.Thread.getStackTrace(Thread.java:1589)
java.io.File.delete(File.java:1028)
org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:695)

org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:680)

org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:678)

scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:678)

org.apache.spark.storage.DiskBlockManager$$anonfun$stop$1.apply(DiskBlockManager.scala:157)

org.apache.spark.storage.DiskBlockManager$$anonfun$stop$1.apply(DiskBlockManager.scala:154)

scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)

org.apache.spark.storage.DiskBlockManager.stop(DiskBlockManager.scala:154)

org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$1.apply$mcV$sp(DiskBlockManager.scala:147)

org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$1.apply(DiskBlockManager.scala:145)

org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$1.apply(DiskBlockManager.scala:145)
org.apache.spark.util.Utils$.logUncaughtEx

[jira] [Updated] (SPARK-4091) Occasionally spark.local.dir can be deleted twice and causes test failure

2014-10-26 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-4091:
--
Description: 
By persisting an arbitrary RDD with storage level {{MEMORY_AND_DISK}}, Spark 
may occasionally throw the following exception when shutting down:
{code}
java.io.IOException: Failed to list files for dir: 
/var/folders/kh/r9ylmzln40n9nrlchnsry2qwgn/T/spark-local-20141027005012-5bcd/0b
at org.apache.spark.util.Utils$.listFilesSafely(Utils.scala:664)
at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:678)
at 
org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:680)
at 
org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:678)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:678)
at 
org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:177)
at 
org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:175)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
at 
org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply$mcV$sp(Utils.scala:175)
at 
org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173)
at 
org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1323)
at org.apache.spark.util.Utils$$anon$4.run(Utils.scala:173)
{code}
By adding log output to {{Utils.deleteRecursively}}, setting breakpoints at 
{{File.delete}} in IntelliJ, and asking IntelliJ to evaluate and log 
{{Thread.currentThread().getStackTrace()}} when the breakpoint is hit rather 
than suspend execution, we can get the following result, which shows 
{{spark.local.dir}} is deleted twice from both {{DiskBlockManager.stop}} and 
the shutdown hook installed in {{Utils}}:
{code}
+++ Deleting file: 
/var/folders/kh/r9ylmzln40n9nrlchnsry2qwgn/T/spark-local-20141027003412-7fae/1d
Breakpoint reached at java.io.File.delete(File.java:1028)
[java.lang.Thread.getStackTrace(Thread.java:1589)
java.io.File.delete(File.java:1028)
org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:695)

org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:177)

org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:175)
scala.collection.mutable.HashSet.foreach(HashSet.scala:79)

org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply$mcV$sp(Utils.scala:175)

org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173)

org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173)
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1323)
org.apache.spark.util.Utils$$anon$4.run(Utils.scala:173)]
+++ Deleting file: 
/var/folders/kh/r9ylmzln40n9nrlchnsry2qwgn/T/spark-local-20141027003412-7fae/1d
Breakpoint reached at java.io.File.delete(File.java:1028)
[java.lang.Thread.getStackTrace(Thread.java:1589)
java.io.File.delete(File.java:1028)
org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:695)

org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:680)

org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:678)

scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:678)

org.apache.spark.storage.DiskBlockManager$$anonfun$stop$1.apply(DiskBlockManager.scala:157)

org.apache.spark.storage.DiskBlockManager$$anonfun$stop$1.apply(DiskBlockManager.scala:154)

scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)

org.apache.spark.storage.DiskBlockManager.stop(DiskBlockManager.scala:154)

org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$1.apply$mcV$sp(DiskBlockManager.scala:147)

org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$1.apply(DiskBlockManager.scala:145)

org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$1.apply(DiskBlockManager.scala:145)
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1323)

org.apache.spark.storage.DiskBlockManager$$anon$1.run(DiskBlockManager.scala:145)]
{code}
When this bug happens during Jenkins build, it fails {{CliSuite}}.

  was:
By p

[jira] [Commented] (SPARK-4091) Occasionally spark.local.dir can be deleted twice and causes test failure

2014-10-26 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184741#comment-14184741
 ] 

Cheng Lian commented on SPARK-4091:
---

Yes, thanks [~joshrosen], closing this.

> Occasionally spark.local.dir can be deleted twice and causes test failure
> -
>
> Key: SPARK-4091
> URL: https://issues.apache.org/jira/browse/SPARK-4091
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Cheng Lian
>
> By persisting an arbitrary RDD with storage level {{MEMORY_AND_DISK}}, Spark 
> may occasionally throw the following exception when shutting down:
> {code}
> java.io.IOException: Failed to list files for dir: 
> /var/folders/kh/r9ylmzln40n9nrlchnsry2qwgn/T/spark-local-20141027005012-5bcd/0b
>   at org.apache.spark.util.Utils$.listFilesSafely(Utils.scala:664)
>   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:678)
> at 
> org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:680)
>   at 
> org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:678)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
>   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:678)
>   at 
> org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:177)
>   at 
> org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:175)
>   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
>   at 
> org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply$mcV$sp(Utils.scala:175)
>   at 
> org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173)
>   at 
> org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1323)
>   at org.apache.spark.util.Utils$$anon$4.run(Utils.scala:173)
> {code}
> By adding log output to {{Utils.deleteRecursively}}, setting breakpoints at 
> {{File.delete}} in IntelliJ, and asking IntelliJ to evaluate and log 
> {{Thread.currentThread().getStackTrace()}} when the breakpoint is hit rather 
> than suspend execution, we can get the following result, which shows 
> {{spark.local.dir}} is deleted twice from both {{DiskBlockManager.stop}} and 
> the shutdown hook installed in {{Utils}}:
> {code}
> +++ Deleting file: 
> /var/folders/kh/r9ylmzln40n9nrlchnsry2qwgn/T/spark-local-20141027003412-7fae/1d
> Breakpoint reached at java.io.File.delete(File.java:1028)
> [java.lang.Thread.getStackTrace(Thread.java:1589)
>   java.io.File.delete(File.java:1028)
>   org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:695)
>   
> org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:177)
>   
> org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:175)
>   scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
>   
> org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply$mcV$sp(Utils.scala:175)
>   
> org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173)
>   
> org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173)
>   org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1323)
>   org.apache.spark.util.Utils$$anon$4.run(Utils.scala:173)]
> +++ Deleting file: 
> /var/folders/kh/r9ylmzln40n9nrlchnsry2qwgn/T/spark-local-20141027003412-7fae/1d
> Breakpoint reached at java.io.File.delete(File.java:1028)
> [java.lang.Thread.getStackTrace(Thread.java:1589)
>   java.io.File.delete(File.java:1028)
>   org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:695)
>   
> org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:680)
>   
> org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:678)
>   
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
>   org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:678)
>   
> org.apache.spark.storage.DiskBlockManager$$anonfun$stop$1.apply(DiskBlockManager.scala:157)
>   
> org.apache.spark.storage.DiskBlockManager$$anonfun$stop$1.apply(DiskBlockManager.scala:154)
>   
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   
> org.apache.spark.storage.DiskBlockManager.stop(DiskBlockManager.scala:154)
>

[jira] [Closed] (SPARK-4091) Occasionally spark.local.dir can be deleted twice and causes test failure

2014-10-26 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian closed SPARK-4091.
-
Resolution: Duplicate

> Occasionally spark.local.dir can be deleted twice and causes test failure
> -
>
> Key: SPARK-4091
> URL: https://issues.apache.org/jira/browse/SPARK-4091
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Cheng Lian
>
> By persisting an arbitrary RDD with storage level {{MEMORY_AND_DISK}}, Spark 
> may occasionally throw the following exception when shutting down:
> {code}
> java.io.IOException: Failed to list files for dir: 
> /var/folders/kh/r9ylmzln40n9nrlchnsry2qwgn/T/spark-local-20141027005012-5bcd/0b
>   at org.apache.spark.util.Utils$.listFilesSafely(Utils.scala:664)
>   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:678)
> at 
> org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:680)
>   at 
> org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:678)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
>   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:678)
>   at 
> org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:177)
>   at 
> org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:175)
>   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
>   at 
> org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply$mcV$sp(Utils.scala:175)
>   at 
> org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173)
>   at 
> org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1323)
>   at org.apache.spark.util.Utils$$anon$4.run(Utils.scala:173)
> {code}
> By adding log output to {{Utils.deleteRecursively}}, setting breakpoints at 
> {{File.delete}} in IntelliJ, and asking IntelliJ to evaluate and log 
> {{Thread.currentThread().getStackTrace()}} when the breakpoint is hit rather 
> than suspend execution, we can get the following result, which shows 
> {{spark.local.dir}} is deleted twice from both {{DiskBlockManager.stop}} and 
> the shutdown hook installed in {{Utils}}:
> {code}
> +++ Deleting file: 
> /var/folders/kh/r9ylmzln40n9nrlchnsry2qwgn/T/spark-local-20141027003412-7fae/1d
> Breakpoint reached at java.io.File.delete(File.java:1028)
> [java.lang.Thread.getStackTrace(Thread.java:1589)
>   java.io.File.delete(File.java:1028)
>   org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:695)
>   
> org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:177)
>   
> org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:175)
>   scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
>   
> org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply$mcV$sp(Utils.scala:175)
>   
> org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173)
>   
> org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173)
>   org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1323)
>   org.apache.spark.util.Utils$$anon$4.run(Utils.scala:173)]
> +++ Deleting file: 
> /var/folders/kh/r9ylmzln40n9nrlchnsry2qwgn/T/spark-local-20141027003412-7fae/1d
> Breakpoint reached at java.io.File.delete(File.java:1028)
> [java.lang.Thread.getStackTrace(Thread.java:1589)
>   java.io.File.delete(File.java:1028)
>   org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:695)
>   
> org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:680)
>   
> org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:678)
>   
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
>   org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:678)
>   
> org.apache.spark.storage.DiskBlockManager$$anonfun$stop$1.apply(DiskBlockManager.scala:157)
>   
> org.apache.spark.storage.DiskBlockManager$$anonfun$stop$1.apply(DiskBlockManager.scala:154)
>   
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   
> org.apache.spark.storage.DiskBlockManager.stop(DiskBlockManager.scala:154)
>   
> org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$1.appl

[jira] [Created] (SPARK-4119) Don't rely on HIVE_DEV_HOME to find .q files

2014-10-28 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-4119:
-

 Summary: Don't rely on HIVE_DEV_HOME to find .q files
 Key: SPARK-4119
 URL: https://issues.apache.org/jira/browse/SPARK-4119
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 1.1.1
Reporter: Cheng Lian
Priority: Minor


After merging in Hive 0.13.1 support, a bunch of .q files and golden answer 
files got updated. Unfortunately, some .q were updated in Hive. For example, an 
ORDER BY clause was added to groupby1_limit.q for bug fix.

With HIVE_DEV_HOME set, developers working on Hive 0.12.0 may end up with false 
test failures. Because .q files are looked up from HIVE_DEV_HOME and outdated 
.q files are used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3683) PySpark Hive query generates "NULL" instead of None

2014-10-28 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187757#comment-14187757
 ] 

Cheng Lian commented on SPARK-3683:
---

[~davies] I removed this special case for "NULL" because in this way we have no 
way to represent the literal string {{"NULL"}}. Maybe I was wrong here, 
validating Hive behavior in this case.

> PySpark Hive query generates "NULL" instead of None
> ---
>
> Key: SPARK-3683
> URL: https://issues.apache.org/jira/browse/SPARK-3683
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.1.0
>Reporter: Tamas Jambor
>Assignee: Davies Liu
>
> When I run a Hive query in Spark SQL, I get the new Row object, where it does 
> not convert Hive NULL into Python None instead it keeps it string 'NULL'. 
> It's only an issue with String type, works with other types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3683) PySpark Hive query generates "NULL" instead of None

2014-10-28 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187768#comment-14187768
 ] 

Cheng Lian commented on SPARK-3683:
---

Actually a Hive session was illustrated in SPARK-1959, and seems that Hive 
interprets {{"NULL"}} as a literal string whose contents is "NULL" rather than 
a null value.

> PySpark Hive query generates "NULL" instead of None
> ---
>
> Key: SPARK-3683
> URL: https://issues.apache.org/jira/browse/SPARK-3683
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.1.0
>Reporter: Tamas Jambor
>Assignee: Davies Liu
>
> When I run a Hive query in Spark SQL, I get the new Row object, where it does 
> not convert Hive NULL into Python None instead it keeps it string 'NULL'. 
> It's only an issue with String type, works with other types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3683) PySpark Hive query generates "NULL" instead of None

2014-10-29 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188194#comment-14188194
 ] 

Cheng Lian commented on SPARK-3683:
---

[~jamborta] Your concern is legitimate. However, unfortunately we have to take 
Hive compatibility into consideration  in this case, otherwise people who run 
legacy Hive scripts with Spark SQL may get wrong query result.

> PySpark Hive query generates "NULL" instead of None
> ---
>
> Key: SPARK-3683
> URL: https://issues.apache.org/jira/browse/SPARK-3683
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.1.0
>Reporter: Tamas Jambor
>Assignee: Davies Liu
>
> When I run a Hive query in Spark SQL, I get the new Row object, where it does 
> not convert Hive NULL into Python None instead it keeps it string 'NULL'. 
> It's only an issue with String type, works with other types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

< 1 2 3 4 5 6 7 8 9 10 >

301 - 400 of 2003 matches

Mail list logo