from:"HyukjinKwon"

[GitHub] spark pull request: [SPARK-9935][SQL] EqualNotNull not processed i...

2015-08-13 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/8163

[SPARK-9935][SQL] EqualNotNull not processed in ORC

https://issues.apache.org/jira/browse/SPARK-9935

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/8163.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #8163


commit 755affe90e76535e5968017c0a2186d6b7e886a8
Author: hyukjinkwon gurwls...@gmail.com
Date:   2015-08-13T10:40:24Z

[SPARK-9935][SQL] EqualNotNull not processed in ORC




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10035][SQL] Parquet filters does not pr...

2015-08-18 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/8275

[SPARK-10035][SQL] Parquet filters does not process EqualNullSafe filter.

As I talked with Lian, 

1. I added EquelNullSafe to ParquetFilters
 - It uses the same equality comparison filter with EqualTo since the 
Parquet filter performs actually null-safe equality comparison.

2. Updated the test code (ParquetFilterSuite) 
 - Convert catalyst.Expression to sources.Filter
 - Removed Cast since only Literal is picked up as a proper Filter in 
DataSourceStrategy
 - Added EquelNullSafe comparison

3. Removed deprecated createFilter for catalyst.Expression


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/8275.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #8275


commit 98ccf859fbfd7cce89bf56d625b0bd214638621a
Author: hyukjinkwon gurwls...@gmail.com
Date:   2015-08-18T12:41:00Z

[SPARK-10035][SQL] Parquet filters does not process EqualNullSafe filter.

commit cf24b243eeb03eebb38ff14eeb9c61b9ead3178f
Author: hyukjinkwon gurwls...@gmail.com
Date:   2015-08-18T12:41:57Z

[SPARK-10035][SQL] Update ParquetFilterSuite for sources.Filter test.

commit f35030f74f4bc0cf89fe815758f443a1912d1a95
Author: hyukjinkwon gurwls...@gmail.com
Date:   2015-08-18T12:46:42Z

[SPARK-10035][SQL] Remove deprecated createFilter() in ParquetFilters (for 
catalyst.Expression).




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10035][SQL] Parquet filters does not pr...

2015-08-19 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/8275#issuecomment-132819486
  
Could you merge this :) ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10180] [SQL] JDBCRDD does not process E...

2015-08-24 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/8391

[SPARK-10180] [SQL] JDBCRDD does not process EqualNullSafe filter.

https://issues.apache.org/jira/browse/SPARK-10180

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/8391.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #8391


commit d167acbe5a6007f2a41c27f552795976512c847e
Author: hyukjinkwon gurwls...@gmail.com
Date:   2015-08-24T07:09:18Z

[SPARK-10180] [SQL] JDBCRDD does not process EqualNullSafe filter.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9814][SQL] EqualNotNull not passing to ...

2015-08-11 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/8096

[SPARK-9814][SQL] EqualNotNull not passing to data sources



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/8096.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #8096


commit fed1c4e30d37b8a0e9e2f3c87c15bc6cfe0c4534
Author: hyukjinkwon gurwls...@gmail.com
Date:   2015-08-11T05:22:06Z

[SPARK-9814][SQL] EqualNotNull not passing to data sources




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9814][SQL] EqualNotNull not passing to ...

2015-08-11 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/8096#discussion_r36715585
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/sources/filters.scala ---
@@ -38,6 +38,14 @@ case class EqualTo(attribute: String, value: Any) 
extends Filter
 
 /**
  * A filter that evaluates to `true` iff the attribute evaluates to a value
+ * null safe and equal to `value`.
--- End diff --

Should I write like this?

A filter that evaluates to `true` if the attribute evaluates to a value 
equal to `value` and if both the attribute and the value are null.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9814][SQL] EqualNotNull not passing to ...

2015-08-11 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/8096#issuecomment-129835273
  
Reset this please :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10180] [SQL] JDBCRDD does not process E...

2015-08-25 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/8391#issuecomment-134615781
  
Hm.. one thing I want to say is, it looks like there is no test code for 
JdbcRelation. So I tested this with seperate copied functions. It looks just 
about text parsing (SQL query) though.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11103][SQL] Filter applied on Merged Pa...

2015-10-28 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/9327

[SPARK-11103][SQL] Filter applied on Merged Parquet shema with new column 
fail

When enabling mergedSchema and predicate filter, this fails since Parquet 
filters are pushed down regardless of each schema of the splits (or rather 
files).

This is related with Parquet issue 
(https://issues.apache.org/jira/browse/PARQUET-389).

For now, it just simply disable predicate push down when using merged 
schema in this PR.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-11103

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9327.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9327


commit 85dadbcb458eec1067b6bae838ca930b8d958f4a
Author: hyukjinkwon <gurwls...@gmail.com>
Date:   2015-10-28T08:30:15Z

[SPARK-11103][SQL] Disable predicate push down when using merged schema for 
Parquet.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11103][SQL] Filter applied on Merged Pa...

2015-10-28 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9327#issuecomment-152054929
  
/cc @liancheng 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...

2015-11-11 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9060#issuecomment-155720490
  
I will try to find and test them first tommorow before adding a commit!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-09 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9517#issuecomment-155280757
  
I used `sortBy` instead of `sortWith`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-08 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9517#issuecomment-154890873
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9557][SQL] Refactor ParquetFilterSuite ...

2015-11-08 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/9554

[SPARK-9557][SQL] Refactor ParquetFilterSuite and remove old ParquetFilters 
code

Actually this was resolved by https://github.com/apache/spark/pull/8275.

But I found the JIRA issue for this is not marked as resolved since the PR 
above resolved another issue but the PR above resolved both.

I commented that this is resolved by the PR above; however, I opened this 
PR as I would like to just add 
a little bit of corrections. 

In the previous PR, I refactored the test by not reducing just collecting 
filters; however, this actually creates filters after reducing in the 
DataSourceStrategy, which the original test codes (before being refactored) 
followed this.

In this PR, I just followed the original way to collect filters by reducing.

I would like to close this if this PR is inappropriate and somebody would 
like this deal with it in the separate PR related with this.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-9557

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9554.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9554


commit b2b3631d2e49b232954c64fc196a49b8736fa87b
Author: hyukjinkwon <gurwls...@gmail.com>
Date:   2015-11-09T02:15:21Z

[SPARK-9557][SQL] Refactor ParquetFilterSuite and remove old ParquetFilters 
code




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-08 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/9517#discussion_r44247117
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala
 ---
@@ -461,13 +461,29 @@ private[sql] class ParquetRelation(
   // You should enable this configuration only if you are very 
sure that for the parquet
   // part-files to read there are corresponding summary files 
containing correct schema.
 
+  // As filed in SPARK-11500, the order of files to touch is a 
matter, which might affect
+  // the ordering of the output columns. There are several things 
to mention here.
+  //
+  //  1. If mergeRespectSummaries config is false, then it merges 
schemas by reducing from
+  // the first part-file so that the columns of the first file 
show first.
+  //
+  //  2. If mergeRespectSummaries config is true, then there 
should be, at least,
+  // "_metadata"s for all given files. So, we can ensure the 
columns of the first file
+  // show first.
+  //
+  //  3. If shouldMergeSchemas is false, but when multiple files 
are given, there is
+  // no guarantee of the output order, since there might not 
be a summary file for the
+  // first file, which ends up putting ahead the columns of 
the other files. However,
+  // this should be okay since not enabling shouldMergeSchemas 
means (assumes) all the
+  // files have the same schemas.
+
   val needMerged: Seq[FileStatus] =
 if (mergeRespectSummaries) {
   Seq()
 } else {
   dataStatuses
 }
-  (metadataStatuses ++ commonMetadataStatuses ++ needMerged).toSeq
+  needMerged ++ metadataStatuses ++ commonMetadataStatuses
--- End diff --

Yes, I think I should sort them.
It looks it is not really recommended just to use it as it is, although 
they looks sorted, assuming from [this 
link](http://lucene.472066.n3.nabble.com/FileSystem-contract-of-listStatus-td3475540.html).
 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-08 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/9517#discussion_r44247087
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/sources/ParquetHadoopFsRelationSuite.scala
 ---
@@ -155,4 +155,22 @@ class ParquetHadoopFsRelationSuite extends 
HadoopFsRelationTest {
   assert(physicalPlan.collect { case p: execution.Filter => p }.length 
=== 1)
 }
   }
+
+  test("SPARK-11500: Not deterministic order of columns when using merging 
schemas.") {
+import testImplicits._
+withSQLConf(SQLConf.PARQUET_SCHEMA_MERGING_ENABLED.key -> "true") {
+  withTempPath { dir =>
+val pathOne = s"${dir.getCanonicalPath}/table1"
+Seq(1, 1).zipWithIndex.toDF("a", "b").write.parquet(pathOne)
+val pathTwo = s"${dir.getCanonicalPath}/table2"
+Seq(1, 1).zipWithIndex.toDF("c", "b").write.parquet(pathTwo)
+val pathThree = s"${dir.getCanonicalPath}/table3"
+Seq(1, 1).zipWithIndex.toDF("d", "b").write.parquet(pathThree)
--- End diff --

Thanks for commands!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10113][SQL] Support for unsigned Parque...

2015-11-11 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/9646

[SPARK-10113][SQL] Support for unsigned Parquet logical types

Parquet supports some unsigned datatypes. However, Since Spark does not 
support unsigned datatypes, it needs to emit an exception with a clear message 
rather then with the one saying illegal datatype. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-10113

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9646.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9646


commit 49e162480e33a0379e1b04979666533e66e33f6e
Author: hyukjinkwon <gurwls...@gmail.com>
Date:   2015-11-12T01:00:56Z

[SPARK-10113][SQL] Support for unsigned Parquet logical types

commit 2752c14331539bda7ccd7e1bc5a2887da2a2b4cd
Author: hyukjinkwon <gurwls...@gmail.com>
Date:   2015-11-12T01:01:26Z

[SPARK-10113][SQL] Add a variable for readability




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11661] [SQL] Still pushdown filters ret...

2015-11-11 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/9634#discussion_r44612005
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -336,4 +336,29 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
   }
 }
   }
+
+  test("actual filter push down") {
--- End diff --

It is just a nit (maybe only for me) though.. Should this be such as 
"SPARK-11661 Still pushdown filters returned by unhandledFilters"?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...

2015-11-11 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9060#issuecomment-155973698
  
@liancheng I give some tries to figure out the version but.. as you said, 
it is pretty tricky to check the writer version as it only changes the version 
of data page which we could know only within the internal of Parquet.

Would this be too inappropriate if we write Parquet files with both 
version1 and version2 and then, check if the sizes of both are equal?

Since encoding types are different, the size should be also different.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11661] [SQL] Still pushdown filters ret...

2015-11-11 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/9634#discussion_r44607471
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -336,4 +336,29 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
   }
 }
   }
+
+  test("actual filter push down") {
+import testImplicits._
+withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") {
+  withTempPath { dir =>
+val path = s"${dir.getCanonicalPath}/part=1"
+(1 to 3).map(i => (i, i.toString)).toDF("a", 
"b").write.parquet(path)
+val df = sqlContext.read.parquet(path).filter("a = 2")
+
+// This is the source RDD without Spark-side filtering.
+val childRDD =
+  df
+.queryExecution
+
.executedPlan.asInstanceOf[org.apache.spark.sql.execution.Filter]
+.child.
+execute()
+
+// The result should be single row.
+// When a filter is pushed to Parquet, Parquet can apply it to 
eveyr row.
+// So, we can check the number of rows returned from the Parquet
+// to make sure our filter pushdown work.
+assert(childRDD.count == 1)
--- End diff --

I wanted to make a single issue for this but just made separate since for 
ORC separate test might be needed since there is no such 
`OrcFilterPredicateSuite` (I think) due to the difficulty to check objects that 
are pushed down.

Filed here for Parquet and ORC.

https://issues.apache.org/jira/browse/SPARK-11676
https://issues.apache.org/jira/browse/SPARK-11677


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10113][SQL] Support for unsigned Parque...

2015-11-11 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9646#issuecomment-155986820
  
cc @liancheng 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11692][SQL] Support for Parquet logical...

2015-11-12 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9658#issuecomment-156046292
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11687] Mixed usage of fold and foldLeft...

2015-11-11 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/9655

[SPARK-11687] Mixed usage of fold and foldLeft, reduce and reduceLeft and 
reduceOption and reduceLeftOption

https://issues.apache.org/jira/browse/SPARK-11687

As can be seen here 
https://github.com/scala/scala/blob/2.12.x/src/library/scala/collection/TraversableOnce.scala,
 `fold`, `reduce` and `reduceOption` came out from scala 2.9.x. In addition, 
all the implementations of `fold`, `reduce` and `reduceOption` call actually 
`foldLeft`, `reduceLeft` and `reduceLeftOption` directly without any additional 
codes.

For developers who are not used to scala, the mixed usages of them might 
leave a question, for example, if it works as reduceRight or reduceLeft (as all 
know, the direction can change the results). 

Even some usages can be found where they work identically the same things. 

This is just a little nit but I think it would be better to use only 
`xxxLeft` for readability.

Fortunately, there are not so many usages of `xxx` so in this PR, I changed 
them to `xxxLeft`

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-11687

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9655.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9655


commit 794919ffdba242595bf3ad14dd9c53f68b82
Author: hyukjinkwon <gurwls...@gmail.com>
Date:   2015-11-12T07:19:35Z

[SPARK-11687][SQL] Mixed usage of fold and foldLeft, reduce and reduceLeft 
and reduceOption and reduceLeftOption




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11692][SQL] Support for Parquet logical...

2015-11-12 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/9658

[SPARK-11692][SQL] Support for Parquet logical types, JSON and  BSON 
(embedded types)

Parquet supports some JSON and BSON datatypes. They are represented as 
binary for BSON and string (UTF-8) for JSON internally.

I searched a bit and found Apache drill also supports both in this way, 
[link](https://drill.apache.org/docs/parquet-format/).

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-11692

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9658.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9658


commit 0c9b77a54f47dfd42ccd2454d63f272c5c7b4464
Author: hyukjinkwon <gurwls...@gmail.com>
Date:   2015-11-12T09:03:37Z

[SPARK-11692][SQL] Support for Parquet logical types, JSON and BISON 
(embedded types)




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11676][SQL] Parquet filter tests all pa...

2015-11-12 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/9659

[SPARK-11676][SQL] Parquet filter tests all pass if filters are not really 
pushed down

Currently Parquet predicate tests all pass even if filters are not pushed 
down or this is disabled.

In this PR, For checking evaluating filters, Simply it makes the expression 
from `expression.Filter` and then try to create filters just like Spark does.

For checking the results, this manually accesses to the child rdd (of 
`expression.Filter`) and produces the results which should be filtered 
properly, and then compares it to expected values.

Now, if filters are not pushed down or this is disabled, this throws 
exceptions.

cc @liancheng 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-11676

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9659.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9659


commit 1c6a5480df53d16fe38d14146deedaf035a6024c
Author: hyukjinkwon <gurwls...@gmail.com>
Date:   2015-11-12T09:31:41Z

[SPARK-11676][SQL] Parquet filter tests all pass if filters are not really 
pushed down

commit 6b722ef4ad3a29a1bd0cde0961f6bc12638e2f31
Author: hyukjinkwon <gurwls...@gmail.com>
Date:   2015-11-12T09:37:27Z

[SPARK-11676][SQL] Update indentation and comments

commit d007c3f996f1fd4e093a590741b92a4bc4072d41
Author: hyukjinkwon <gurwls...@gmail.com>
Date:   2015-11-12T09:39:05Z

[SPARK-11676][SQL] Remove unused imports

commit 8cc842fa3f97c39e61fa072a916f7056a85be1b7
Author: hyukjinkwon <gurwls...@gmail.com>
Date:   2015-11-12T09:49:11Z

[SPARK-111676][SQL] Set true to pushing down filters




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11692][SQL] Support for Parquet logical...

2015-11-12 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9658#issuecomment-156052665
  
cc @liancheng


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11694][SQL] Parquet logical types are n...

2015-11-12 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/9660#discussion_r44737977
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala
 ---
@@ -91,6 +91,33 @@ class ParquetIOSuite extends QueryTest with ParquetTest 
with SharedSQLContext {
 }
   }
 
+  test("SPARK-11694 Parquet logical types are not being tested properly") {
+val parquetSchema = MessageTypeParser.parseMessageType(
+  """message root {
+|  required int32 a(INT_8);
+|  required int32 b(INT_16);
+|  required int32 c(DATE);
+|  required int32 d(DECIMAL(1,0));
+|  required int64 e(DECIMAL(10,0));
+|}
+  """.stripMargin)
+
+withTempPath { location =>
+  val extraMetadata = Map.empty[String, String].asJava
+  val fileMetadata = new FileMetaData(parquetSchema, extraMetadata, 
"Spark")
+  val path = new Path(location.getCanonicalPath)
+  val footer = List(
+new Footer(path, new ParquetMetadata(fileMetadata, 
Collections.emptyList()))
+  ).asJava
+
+  
ParquetFileWriter.writeMetadataFile(sparkContext.hadoopConfiguration, path, 
footer)
--- End diff --

Done!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11687] Mixed usage of fold and foldLeft...

2015-11-12 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9655#issuecomment-156262080
  
I think I should have opened a thread in the mailing list. Sorry, closing 
this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11687] Mixed usage of fold and foldLeft...

2015-11-12 Thread HyukjinKwon

Github user HyukjinKwon closed the pull request at:

https://github.com/apache/spark/pull/9655


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11694][SQL] Parquet logical types are n...

2015-11-12 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/9660

[SPARK-11694][SQL] Parquet logical types are not being tested properly

All the physical types are properly tested at `ParquetIOSuite` but logical 
type mapping is not being tested.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-11694

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9660.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9660


commit 870a37a0ed2ce3af0a1507c529620a160e854ece
Author: hyukjinkwon <gurwls...@gmail.com>
Date:   2015-11-12T10:17:53Z

[SPARK-11694][SQL] Parquet logical types are not being tested properly




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11694][SQL] Parquet logical types are n...

2015-11-12 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9660#issuecomment-156059610
  
cc @liancheng 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11694][SQL] Parquet logical types are n...

2015-11-12 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/9660#discussion_r44650410
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala
 ---
@@ -91,6 +91,32 @@ class ParquetIOSuite extends QueryTest with ParquetTest 
with SharedSQLContext {
 }
   }
 
+  test("SPARK-11694 Parquet logical types are not being tested properly") {
+val parquetSchema = MessageTypeParser.parseMessageType(
+  """message root {
+|  required int32 a(INT_8);
+|  required int32 b(INT_16);
+|  required int32 c(DATE);
+|  required int32 d(DECIMAL(1,0));
+|  required int64 e(DECIMAL(10,0));
+|} +  """.stripMargin)
--- End diff --

Hm. I should get rid of this. I ran this test several times and added a PR. 
This is weird. Sorry.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...

2015-11-12 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9060#issuecomment-156076494
  
Thank toy very much. I will try in that way.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11687] Mixed usage of fold and foldLeft...

2015-11-12 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9655#issuecomment-156087424
  
I agree with you. However, if you see the codes, they are used in a mixed 
way, even in the same class.

Mostly the usages are `_ and _` or ` + ` but `xxx` and `xxxLeft` are used 
in a mixed way even in the same class. Namely it looks this does not infer that 
the order is a matter.

If the order is apparently not a matter, then I think we better set all 
`xxx` instead of `xxxLeft`.

Anyhow, I would like to close if guys think this is inappropriate.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...

2015-11-12 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9060#issuecomment-156306727
  
Fortunately, I worked around parquet tools once and looked through Parquet 
codes several times :).

Thank you very much for your help. This could be dome much more easily than 
I though because of your help.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11694][SQL] Parquet logical types are n...

2015-11-12 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9660#issuecomment-156321461
  
Also added an import `Collections` at `ParquetIOSuite`. 

Does this also compile the old version of all commits in this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11692][SQL] Support for Parquet logical...

2015-11-12 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9658#issuecomment-156355260
  
All the builds pass all the tests at `ParquetIOSuite` and I do not think it 
affects other modules such as ML.
I will retest this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11692][SQL] Support for Parquet logical...

2015-11-12 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9658#issuecomment-156355277
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11692][SQL] Support for Parquet logical...

2015-11-12 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9658#issuecomment-156320809
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11677][SQL ]ORC filter tests all pass i...

2015-11-12 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/9687

[SPARK-11677][SQL ]ORC filter tests all pass if filters are actually not 
pushed down.

Currently ORC filters are not tested properly. All the tests pass even if 
the filters are not pushed down or disabled. In this PR, I add some logics for 
this.

Several things to mention.
Firstly, since ORC does not filter record by record fully, I checked the 
count and if it contains the expected values.

Secondly, I wonder if it is okay to put `extractSourceRDDToDataFrame` at 
`QueryTest`. I did not put but I think the `extractSourceRDDToDataFrame` can be 
shared with `ParquetFilterSuite`. 

Lastly, I originally wanted to add `OrcFilterSuite` separately in order to 
test actual filter evaluation; however, I decided not to do it here (I will do 
in a separate issue or followup PR) and just let the original test way work 
properly first.

cc @liancheng  

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-11677

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9687.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9687


commit 7a13c8e6c73e3824b8188b865fcaeda4c7e04117
Author: hyukjinkwon <gurwls...@gmail.com>
Date:   2015-11-13T05:18:10Z

[SPARK-11677][SQL] ORC filter tests all pass if filters are actually not 
pushed down.

commit 82d0aa773d58115b0a2b3d5fd782d473e26c2671
Author: hyukjinkwon <gurwls...@gmail.com>
Date:   2015-11-13T07:43:00Z

[SPARK-11677][SQL] Add tests for is-not-null operator and in-operator




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...

2015-11-12 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9060#issuecomment-156099372
  
Thanks! I will follow the way. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11694][SQL] Parquet logical types are n...

2015-11-13 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9660#issuecomment-156416602
  
Finally I got all-pass! Pleace review the codes!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11694][SQL] Parquet logical types are n...

2015-11-13 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9660#issuecomment-156371302
  
restart this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11694][SQL] Parquet logical types are n...

2015-11-12 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9660#issuecomment-156296565
  
hm. I'm not using `FileMetaData` at line 234 in `ParquetIOSuite`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11694][SQL] Parquet logical types are n...

2015-11-12 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9660#issuecomment-156309901
  
I added an import `FileMetaData` at ParquetIOSuite.
This is weird, I am suing scala 2.10 and it compiles okay in my local 
computer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11694][SQL] Parquet logical types are n...

2015-11-12 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9660#issuecomment-156296578
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11694][SQL] Parquet logical types are n...

2015-11-12 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9660#issuecomment-156313536
  
Also added an import `ParquetMetadata` at `ParquetIOSuite`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11694][FOLLOW-UP] Clean up imports, use...

2015-11-16 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/9754

[SPARK-11694][FOLLOW-UP] Clean up imports, use a common function for 
metadata and add a test for FIXED_LEN_BYTE_ARRAY

As discussed https://github.com/apache/spark/pull/9660 
https://github.com/apache/spark/pull/9060, I cleaned up unused imports, added a 
test for fixed-length byte array and used a common function for writing 
metadata for Parquet.

For the test for fixed-length byte array, I have tested and checked the 
encoding types with 
[parquet-tools](https://github.com/Parquet/parquet-mr/tree/master/parquet-tools).


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-11694-followup

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9754.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9754


commit dc5ef4f73f76e68a60590d7203f82b3ebecc6fb0
Author: hyukjinkwon <gurwls...@gmail.com>
Date:   2015-11-17T02:03:32Z

[SPARK-11694][FOLLOW-UP] Clean up imports, use a common function for 
metadata and add a test for FIXED_LEN_BYTE_ARRAY




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11694][FOLLOW-UP] Clean up imports, use...

2015-11-16 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9754#issuecomment-157276165
  
I will cc you @liancheng just so that you can easily find :).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...

2015-11-15 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9060#issuecomment-156879272
  
I saw accidently `TODO Adds test case for reading dictionary encoded 
decimals written as 'FIXED_LEN_BYTE_ARRAY'`.

I will also add this test in the following PR for using the overloaded 
`writeMetaFile`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11692][SQL] Support for Parquet logical...

2015-11-15 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9658#issuecomment-156878712
  
Thanks! I changed this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11661] [SQL] Still pushdown filters ret...

2015-11-11 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/9634#discussion_r44598625
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -336,4 +336,29 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
   }
 }
   }
+
+  test("actual filter push down") {
+import testImplicits._
+withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") {
+  withTempPath { dir =>
+val path = s"${dir.getCanonicalPath}/part=1"
+(1 to 3).map(i => (i, i.toString)).toDF("a", 
"b").write.parquet(path)
+val df = sqlContext.read.parquet(path).filter("a = 2")
+
+// This is the source RDD without Spark-side filtering.
+val childRDD =
+  df
+.queryExecution
+
.executedPlan.asInstanceOf[org.apache.spark.sql.execution.Filter]
+.child.
+execute()
+
+// The result should be single row.
+// When a filter is pushed to Parquet, Parquet can apply it to 
eveyr row.
+// So, we can check the number of rows returned from the Parquet
+// to make sure our filter pushdown work.
+assert(childRDD.count == 1)
--- End diff --

Can I open an separate issue for this and try to work on this if it is 
right place (after resolving this issue)?

I would like to add some codes for ORC and Parquet I wrote already.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11661] [SQL] Still pushdown filters ret...

2015-11-11 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/9634#discussion_r44600842
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -336,4 +336,29 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
   }
 }
   }
+
+  test("actual filter push down") {
+import testImplicits._
+withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") {
+  withTempPath { dir =>
+val path = s"${dir.getCanonicalPath}/part=1"
+(1 to 3).map(i => (i, i.toString)).toDF("a", 
"b").write.parquet(path)
+val df = sqlContext.read.parquet(path).filter("a = 2")
+
+// This is the source RDD without Spark-side filtering.
+val childRDD =
+  df
+.queryExecution
+
.executedPlan.asInstanceOf[org.apache.spark.sql.execution.Filter]
+.child.
+execute()
+
+// The result should be single row.
+// When a filter is pushed to Parquet, Parquet can apply it to 
eveyr row.
+// So, we can check the number of rows returned from the Parquet
+// to make sure our filter pushdown work.
+assert(childRDD.count == 1)
--- End diff --

Oh, sorry lack of language skills..

I mean the test code checks if filters are really pushed down or not, which 
I thought is enough for this issue.

So, after this issue is resolved, I would like to change 
`ParqueyFilterSuite` and add some tests for ORC, maybe at a separste issue that 
is for properly testing all.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11692][SQL] Support for Parquet logical...

2015-11-17 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9658#issuecomment-157373552
  
Oh. I just got confused for a bit. Sorry.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11692][SQL] Support for Parquet logical...

2015-11-17 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9658#issuecomment-157341844
  
Please note that https://github.com/apache/spark/pull/9754 updated 
unintentionally this to clean up at mater branch however, that is supposed to 
be merged with branch 1.6 and for this version 1.7.
I will give a alert to you or make a PR to backport this as soon as branch 
1.7 is available. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11692] [SPARK-11694] [SQL] Backports #9...

2015-11-17 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9763#issuecomment-157335573
  
I mistakenly added SPARK-11692 here as I though this is supposed to go to 
branch 1.6.0 but this is classified for version 1.7.

I will take this out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-09 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9517#issuecomment-155020044
  
In this commit, I added partitioned tables for the test and sorted the 
`FileStatus`es.

There are several things to mention here.

Firstly, now we do not need to change `Set` to `LinkedHashSet` and `Map` to 
`LinkedHashMap` for this issue since it manually sorts the `FileStatus`es. 
However, I left them as I though anyway the order of files better be in the 
order as they are retrieved. If that looks weird, I would like to get it back.

Secondly, in any cases, the columns of the lexicographically first file 
shows first, which might be a matter for files starting/containing with numeric 
values. However, I left this as I though anyway it is deterministic.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11103][SQL] Filter applied on Merged Pa...

2015-10-29 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9327#issuecomment-152424928
  
@liancheng oh, right. I just added at `ParquetFilterSuite`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11103][SQL] Filter applied on Merged Pa...

2015-11-03 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/9327#discussion_r43724135
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -314,4 +314,24 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
   }
 }
   }
+
+  test("SPARK-11103: Filter applied on merged Parquet schema with new 
column fails") {
+import testImplicits._
+
+withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true",
+  SQLConf.PARQUET_SCHEMA_MERGING_ENABLED.key -> "true") {
+  withTempPath { dir =>
+var pathOne = s"${dir.getCanonicalPath}/table1"
+(1 to 3).map(i => (i, i.toString)).toDF("a", 
"b").write.parquet(pathOne)
+var pathTwo = s"${dir.getCanonicalPath}/table2"
+(1 to 3).map(i => (i, i.toString)).toDF("c", 
"b").write.parquet(pathTwo)
+
+// If the "c = 1" filter gets pushed down, this query will throw 
an exception which
+// Parquet emits. This is a Parquet issue (PARQUET-389).
+checkAnswer(
+  sqlContext.read.parquet(pathOne, pathTwo).filter("c = 1"),
--- End diff --

I will try yo check this. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11103][SQL] Filter applied on Merged Pa...

2015-10-30 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/9327#discussion_r43488432
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -314,4 +314,24 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
   }
 }
   }
+
+  test("SPARK-11103: Filter applied on merged Parquet schema with new 
column fails") {
+import testImplicits._
+
+withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true",
+  SQLConf.PARQUET_SCHEMA_MERGING_ENABLED.key -> "true") {
+  withTempPath { dir =>
+var pathOne = s"${dir.getCanonicalPath}/table1"
+(1 to 3).map(i => (i, i.toString)).toDF("a", 
"b").write.parquet(pathOne)
+var pathTwo = s"${dir.getCanonicalPath}/table2"
+(1 to 3).map(i => (i, i.toString)).toDF("c", 
"b").write.parquet(pathTwo)
--- End diff --

 `val` was used by mistake... Thanks for the comments!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11103][SQL] Filter applied on Merged Pa...

2015-11-01 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/9327#discussion_r43600244
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -314,4 +314,24 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
   }
 }
   }
+
+  test("SPARK-11103: Filter applied on merged Parquet schema with new 
column fails") {
+import testImplicits._
+
+withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true",
+  SQLConf.PARQUET_SCHEMA_MERGING_ENABLED.key -> "true") {
+  withTempPath { dir =>
+var pathOne = s"${dir.getCanonicalPath}/table1"
+(1 to 3).map(i => (i, i.toString)).toDF("a", 
"b").write.parquet(pathOne)
+var pathTwo = s"${dir.getCanonicalPath}/table2"
+(1 to 3).map(i => (i, i.toString)).toDF("c", 
"b").write.parquet(pathTwo)
+
+// If the "c = 1" filter gets pushed down, this query will throw 
an exception which
+// Parquet emits. This is a Parquet issue (PARQUET-389).
+checkAnswer(
+  sqlContext.read.parquet(pathOne, pathTwo).filter("c = 1"),
--- End diff --

I see. I just wonder if the inconsistent order is another issue. I think 
users might think it is weird if they run the same script with `SELECT *` but 
the column order of the results are different every time they run.

Could I open an issue for this if you think it is an separate issue?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...

2015-11-06 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9060#issuecomment-154597634
  
@liancheng I assume you missed this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11103][SQL] Filter applied on Merged Pa...

2015-11-03 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/9327#discussion_r43846677
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -314,4 +314,24 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
   }
 }
   }
+
+  test("SPARK-11103: Filter applied on merged Parquet schema with new 
column fails") {
+import testImplicits._
+
+withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true",
+  SQLConf.PARQUET_SCHEMA_MERGING_ENABLED.key -> "true") {
+  withTempPath { dir =>
+var pathOne = s"${dir.getCanonicalPath}/table1"
+(1 to 3).map(i => (i, i.toString)).toDF("a", 
"b").write.parquet(pathOne)
+var pathTwo = s"${dir.getCanonicalPath}/table2"
+(1 to 3).map(i => (i, i.toString)).toDF("c", 
"b").write.parquet(pathTwo)
+
+// If the "c = 1" filter gets pushed down, this query will throw 
an exception which
+// Parquet emits. This is a Parquet issue (PARQUET-389).
+checkAnswer(
+  sqlContext.read.parquet(pathOne, pathTwo).filter("c = 1"),
--- End diff --

I investigated that. It does not guarantee the order.

This is because of `FileStatusCache` in `HadoopFsRelation` (which 
`ParquetRelation` extends as you know). When `FileStatusCache.listLeafFiles()` 
is called, this returns `Set[FileStatus]` which messes up the order of 
`Array[FileStatus]`.

So, after retrieving the list of leaf files including `_metadata` and 
`_common_metadata`,  this starts to merge (separately if necessary) the `Set`s 
of `_metadata`, `_common_metadata` and part-files in 
`ParquetRelation.mergeSchemasInParallel()`.

I think this can be resolved by using `LinkedHashSet`.

I will open an issue for this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-05 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/9517

[SPARK-11500][SQL] Not deterministic order of columns when using merging 
schemas.

https://issues.apache.org/jira/browse/SPARK-11500

As filed in SPARK-11500, if merging schemas is enabled, the order of files 
to touch is a matter which might affect the ordering of the output columns. 

This was mostly because of the use of `Set` and `Map` so I replaced them to 
`LinkedHashSet` and `LinkedHashMap` to keep the insertion order.

Also, reducing order is set left, and replaced the order of `filesToTouch` 
from `metadataStatuses ++ commonMetadataStatuses ++ needMerged` to  `needMerged 
++ metadataStatuses ++ commonMetadataStatuses` in order to touch the part-files 
first which always have the schema in footers whereas the others might not 
exist.

One nit is, If merging schemas is enabled, but when multiple files are 
given, there is no guarantee of the output order, since there might not be a 
summary file for the first file, which ends up putting ahead the columns of the 
other files. 

However, I thought this should be okay since disabling merging schemas 
means (assumes) all the files have the same schemas.

In addition, in the test code for this, I only checked the names of fields.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-11500

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9517.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9517


commit b0e6ce2729f584a9f95996707f60eb650c2a58b9
Author: hyukjinkwon <gurwls...@gmail.com>
Date:   2015-11-06T07:38:26Z

[SPARK-11500][SQL] Not deterministic order of columns when using merging 
schemas.

commit 08fc91ca8d21902677e78f0adb3b36769f2cba51
Author: hyukjinkwon <gurwls...@gmail.com>
Date:   2015-11-06T07:38:55Z

[SPARK-11500][SQL] Add a test to check the deterministic order.

commit bcf72d3ca308f9a69993803d9c8939696c915b07
Author: hyukjinkwon <gurwls...@gmail.com>
Date:   2015-11-06T07:40:17Z

[SPARK-11500][SQL] Remove trailing newline.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-05 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9517#issuecomment-154338571
  
cc @liancheng 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9182][SQL] Cast filters are not passed ...

2015-10-07 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/8718#issuecomment-146380223
  
@liancheng Would this casting check be unsafe?
I came across Parquet downcasting check with the actual value 
```java
  public static int checkedCast(long value) {
int valueI = (int) value;
if (valueI != value) {
  throw new IllegalArgumentException(String.format("Overflow casting %d 
to an int", value));
}
return valueI;
  }
```

and I see the use of `null` from `eval()` in other codes in Spark.

I would like to close this PR, If this approach is inappropriate for other 
reasons.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...

2015-10-10 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/9060#discussion_r41705069
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystWriteSupport.scala
 ---
@@ -431,6 +431,7 @@ private[parquet] object CatalystWriteSupport {
 configuration.set(SPARK_ROW_SCHEMA, schema.json)
 configuration.set(
   ParquetOutputFormat.WRITER_VERSION,
-  ParquetProperties.WriterVersion.PARQUET_1_0.toString)
+  configuration.get(ParquetOutputFormat.WRITER_VERSION,
--- End diff --

Yeap I just updated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...

2015-10-10 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/9060

[SPARK-11044][SQL] Parquet writer version fixed as version1

https://issues.apache.org/jira/browse/SPARK-11044

Spark only writes the parquet file with writer version1 ignoring the given 
writer version by user.

So, in this PR, it keeps the writer version if given and sets version1 as 
default.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-11044

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9060.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9060


commit 5e72fbc93ec0783d5a440f8f70c7653f8fc39d9a
Author: HyukjinKwon <gurwls...@gmail.com>
Date:   2015-10-10T06:59:52Z

[SPARK-11044][SQL] Apply the writer version if given.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9182][SQL] Cast filters are not passed ...

2015-10-13 Thread HyukjinKwon

Github user HyukjinKwon closed the pull request at:

https://github.com/apache/spark/pull/8718


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11677][SQL] ORC filter tests all pass i...

2015-11-17 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9687#issuecomment-157561940
  
Several things to mention.

Firstly, I wonder if it is okay to put `extractSourceRDDToDataFrame` at 
`QueryTest`. I did not put but I think the `extractSourceRDDToDataFrame` can be 
shared with `ParquetFilterSuite`. 

Secondly, I originally wanted to add `OrcFilterSuite` separately in order 
to test actual filter evaluation (using the enums in `ExpressionTree.Operator` 
and `PredicateLeaf.Operator` at hive); however, I decided not to do it here (I 
will do in a separate issue or followup PR) and just let the original test way 
work properly first.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11677][SQL] ORC filter tests all pass i...

2015-11-17 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9687#issuecomment-157564511
  
cc @liancheng 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11694][FOLLOW-UP] Clean up imports, use...

2015-11-16 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9754#issuecomment-157300557
  
yes I will.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11692] [SPARK-11694] [SQL] Backports #9...

2015-11-17 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/9763

[SPARK-11692] [SPARK-11694] [SQL] Backports #9658 and #9754

The main purpose of this PR is to backport 
https://github.com/apache/spark/pull/9754 and 
https://github.com/apache/spark/pull/9658.

I added several commits but they are identical with both PRs.

I will cc @liancheng just to find this easily.  

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark 
SPARK-11694-followup-backporting

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9763.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9763


commit c91d279aaf3895acb63871da1db0b810522c4e24
Author: hyukjinkwon <gurwls...@gmail.com>
Date:   2015-11-17T08:52:45Z

[SPARK-11692] [SPARK-11694] [SQL] Backports #9658 and #9754

commit e14b40ecc85394bbd0fcc024673093b23036af49
Author: hyukjinkwon <gurwls...@gmail.com>
Date:   2015-11-17T08:59:40Z

[SPARK-11692] [SPARK-11694] [SQL] Add a binary file for test.

commit 1f252ecf3527b8566c8f5c7b87a1a2af9aa33d06
Author: hyukjinkwon <gurwls...@gmail.com>
Date:   2015-11-17T09:03:50Z

[SPARK-11692] [SPARK-11694] [SQL] Add a support for bson and json




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10180] [SQL] JDBCRDD does not process E...

2015-08-24 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/8391#discussion_r37833992
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala
 ---
@@ -275,6 +275,7 @@ private[sql] class JDBCRDD(
*/
   private def compileFilter(f: Filter): String = f match {
 case EqualTo(attr, value) = s$attr = ${compileValue(value)}
+case EqualNullSafe(attr, value) = s$attr = ${compileValue(value)}
--- End diff --

Yes I will go for the second one with some comments. Thanks :)  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10180] [SQL] JDBCRDD does not process E...

2015-08-24 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/8391#discussion_r37830657
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala
 ---
@@ -275,6 +275,7 @@ private[sql] class JDBCRDD(
*/
   private def compileFilter(f: Filter): String = f match {
 case EqualTo(attr, value) = s$attr = ${compileValue(value)}
+case EqualNullSafe(attr, value) = s$attr = ${compileValue(value)}
--- End diff --

I thought the comparison operator is official and a standard one (because 
both I am used to MySQL and thought Spark and Hive uses Standard SQL Syntax). 
However, it looks like it is a MySQL dialect. 

In details, 
Firstly, I looked through several documents.
I assume there are several standard documentations as mentioned in the 
`Where can I get a copy of the SQL standards?` 
https://wiki.postgresql.org/wiki/Developer_FAQ#Where_can_I_get_a_copy_of_the_SQL_standards.3F).

(SQL-92 http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt
SQL:1999 
http://web.cs.ualberta.ca/~yuan/courses/db_readings/ansi-iso-9075-2-1999.pdf
SQL:2003 http://www.wiscorp.com/sql_2003_standard.zip
SQL:201x (preliminary) http://www.wiscorp.com/sql20nn.zip)

Though I can guarantee, It looks null-safe equality comparison is not the 
standard one. It seems there are no mentions about this.

Secondly, I got a list of the top 10 databases here 
(http://www.improgrammer.net/top-10-databases-should-learn-2015/)

and reviewed if there is a such operation or not.

1. Oracle - not support (http://docs.oracle.com/html/A95915_01/sqopr.htm)

2. MySQL - support 
(https://dev.mysql.com/doc/refman/5.0/en/comparison-operators.html)

3. Microsoft SQL Server - not support 
(https://msdn.microsoft.com/en-us/library/ms188074.aspx)

4. PostgreSQL - not support 
(http://www.postgresql.org/docs/9.2/static/functions-comparison.html)

5. MongoDB  - N/A

6. DB 2 - not support 
(http://www-01.ibm.com/support/knowledgecenter/ssw_ibm_i_72/sqlp/rbafycompop.htm)

7. Microsoft Access - not support 
(https://support.office.com/en-za/article/Table-of-operators-e1bc04d5-8b76-429f-a252-e9223117d6bd)

8. SQLite - not support 
(http://www.tutorialspoint.com/sqlite/sqlite_comparison_operators.htm)

9. Cassandra  - N/A

10. Redis - N/A



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10180] [SQL] JDBCRDD does not process E...

2015-08-25 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/8391#issuecomment-134564538
  
Hm.. one thing I want to say is, it looks like there is no test code for 
JdbcRelation. So I tested this with seperate copied functions. It looks just 
about text parsing (SQL query) though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10180] [SQL] JDBCRDD does not process E...

2015-08-24 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/8391#discussion_r37831947
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala
 ---
@@ -275,6 +275,7 @@ private[sql] class JDBCRDD(
*/
   private def compileFilter(f: Filter): String = f match {
 case EqualTo(attr, value) = s$attr = ${compileValue(value)}
+case EqualNullSafe(attr, value) = s$attr = ${compileValue(value)}
--- End diff --

Do you think this should be treated differently for each different dialect 
( maybe only for MySQL ) or should this be closed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10180] [SQL] JDBCRDD does not process E...

2015-09-13 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/8391#discussion_r39345066
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala
 ---
@@ -275,6 +275,10 @@ private[sql] class JDBCRDD(
*/
   private def compileFilter(f: Filter): String = f match {
 case EqualTo(attr, value) => s"$attr = ${compileValue(value)}"
+// Since the null-safe equality operator is not a standard SQL 
operator,
+// This was written as using is-null and normal equality.
+case EqualNullSafe(attr, value) =>
+  s"($attr = ${compileValue(value)} OR ($attr IS NULL AND 
${compileValue(value)} IS NULL))"
--- End diff --


Yes, it looks so..
If SparkSQL creates the SQL (for datasources) like the below, it can be a 
problem.. 
```
SELECT 1 <=> 1, NULL <=> NULL, 1 <=> NULL;
-> 1, 1, 0
SELECT 1 = 1, NULL = NULL, 1 = NULL;
-> 1, NULL, NULL
```

However, as I see the codes, I think the `compileFilter()` is only used to 
construct WHERE clause.
I tested the expression at MySQL and they look ok.


In details,
I tested four cases at MySQL.

1. attr is a column, value is null.

2. attr is a column, value is not null.

3. attr is null, value is null.

4. attr is null, value is not null.

- `CREATE`
```SQL
CREATE TABLE TestTable
(`id` int, `test` varchar(7))
;
```

- `INSERT`
```SQL
INSERT INTO TestTable
(`id`, `test`)

VALUES
(1, null),
(2, 'OpenAM'),
(3, 'OpenDJ')
;
```

- `SELECT`
```SQL
/*1. attr is a column, value is null. */
SELECT
  *
FROM
  TestTable
WHERE
  (test = NULL OR (test IS NULL AND NULL IS NULL));
  
SELECT
  *
FROM
  TestTable
WHERE
  test <=> NULL;
```
```SQL
/*2. attr is a column, value is not null.*/
SELECT
  *
FROM
  TestTable
WHERE
  (test = 'OpenAM' OR (test IS NULL AND 'OpenAM' IS NULL));

SELECT
  *
FROM
  TestTable
WHERE
  test <=> 'OpenAM';
```
```SQL
/*3. attr is null, value is null.*/
SELECT
  *
FROM
  TestTable
WHERE
  (NULL = NULL OR (NULL IS NULL AND NULL IS NULL));

SELECT
  *
FROM
  TestTable
WHERE
  NULL <=> NULL;
```
```SQL
/*4. attr is null, value is not null.*/
SELECT
  *
FROM
  TestTable
WHERE
  (NULL = 'OpenAM' OR (NULL IS NULL AND 'OpenAM' IS NULL));

SELECT
  *
FROM
  TestTable
WHERE
  NULL <=> 'OpenAM';
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10180] [SQL] JDBCRDD does not process E...

2015-09-13 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/8391#discussion_r39345352
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala
 ---
@@ -275,6 +275,10 @@ private[sql] class JDBCRDD(
*/
   private def compileFilter(f: Filter): String = f match {
 case EqualTo(attr, value) => s"$attr = ${compileValue(value)}"
+// Since the null-safe equality operator is not a standard SQL 
operator,
+// This was written as using is-null and normal equality.
+case EqualNullSafe(attr, value) =>
+  s"($attr = ${compileValue(value)} OR ($attr IS NULL AND 
${compileValue(value)} IS NULL))"
--- End diff --

Yes.. it looks so. There can be a problem if the query uses the result of  
comparison `<=>`.

Let me correct this as `(NOT ($attr <> ${compileValue(value)} OR $attr IS 
NULL OR ${compileValue(value)} IS NULL) OR ($attr IS NULL AND 
${compileValue(value)} IS NULL))`.

(BTW, I found `<=>` actually a standard but as `IS NOT DISTINCT FROM` from 
SQL1999)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10180] [SQL] JDBC datasource are not pr...

2015-09-14 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/8743

[SPARK-10180] [SQL] JDBC datasource are not processing EqualNullSafe filter

https://github.com/apache/spark/pull/8391
@rxin I apologize that I removed the forked repo by mistake with not closed 
PR.
So I opened this again.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-10180

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/8743.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #8743


commit cb1dd0af31fc6835333cea0fdc63cb700d2336fd
Author: hyukjinkwon <gurwls...@gmail.com>
Date:   2015-09-14T10:23:26Z

[SPARK-10180] [SQL] Update query to prevent to return null.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9182][SQL] Cast filters are not passed ...

2015-09-16 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/8718#issuecomment-140693582
  
It is OK for JDBC but for Parquet and ORC, it looks the conversion from 
`StringType` to `NumericType` are not safe.

When the field type is `StringType`, then the given numeric value is 
converted to `String` and it compares both as string which ends up in wrong 
comparison. 

Also, it looks sometimes `Cast()` fails when trying to cast unrealistic 
values. So, it returns false when it fails just in case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9182][SQL] Cast filters are not passed ...

2015-09-15 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/8718#issuecomment-140569493
  
It look ok to JDBC datasource tough. I am wondering if conversion from 
`NumericType` to `StringType` should be prevented or treated for each 
differently as it looks unsafe for Parquet and ORC. For example, when column 
`A` is string, `A >100' will push down a filter with string value performing 
string comperison.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9182][SQL] Cast filters are not passed ...

2015-09-11 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/8718

[SPARK-9182][SQL] Cast filters are not passed through to datasources

As mentioned in https://issues.apache.org/jira/browse/SPARK-9182,

Some casts filters are not passing to datasources  (The original PR was 
reverted because of some downcasting issues).

So, for this PR, rather than checking the range or constant-folding, I just 
made the actual cast to check if down-casting actually happens.

Test codes and some codes are basically similar with the previous ones 
(@yjshen).

 The codes are a bit messy and might need to be cleaned up.

@liancheng Could you please review this PR?



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/8718.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #8718


commit 2acbeb36c3d1a9d96277102788ef4fae95aa799d
Author: hyukjinkwon <gurwls...@gmail.com>
Date:   2015-09-11T10:50:10Z

[SPARK-9182][SQL] Cast filters are not passed through to datasources

commit b3f02b29b02ae255f2cee71912e12e4f34d3935a
Author: hyukjinkwon <gurwls...@gmail.com>
Date:   2015-09-11T10:52:49Z

[SPARK-9182][SQL] Remove a newline

commit 53bcddb4fb204e5a5b67596a8e5b55d030220fd4
Author: hyukjinkwon <gurwls...@gmail.com>
Date:   2015-09-11T10:55:36Z

[SPARK-9182][SQL] Update indentation




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9182][SQL] Cast filters are not passed ...

2015-09-15 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/8718#issuecomment-140368436
  
I think in that cast it does not create any condition as `double -> int -> 
double` means, the field type is int and the given value type is double. Losing 
precision during this convert means the given value cannot be converted to the 
field type. (I might misunderstood your comment). I just tested and added some 
more test codes. Could you give me the case you think?




I tested and added some more test codes for other type conversions. Could 
you give me a specific case? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10180] [SQL] JDBC datasource are not pr...

2015-09-16 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/8743#issuecomment-140919876
  
Should I better write a test code for this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10180] [SQL] JDBCRDD does not process E...

2015-09-14 Thread HyukjinKwon

Github user HyukjinKwon closed the pull request at:

https://github.com/apache/spark/pull/8391


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9182][SQL] Cast filters are not passed ...

2015-09-19 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/8718#issuecomment-141679720
  
Yes I will do so.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10180] [SQL] JDBC datasource are not pr...

2015-09-19 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/8743#issuecomment-141679756
  
Should I better write some test codes for this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9182][SQL] Cast filters are not passed ...

2015-09-21 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/8718#issuecomment-141960719
  
It looks the original case became a downcast `Decimal(10, 0)` to 
`Decimal(7, 2)` which seems when scale and precision of the latter are less 
than the former, rather than int to decimal. I think there have been an update 
at Optimizer.

It looks roughly in most cases, the conversions need to be downcasted. 
Here is a list i tested.

1. Where `A` is `IntegerType`
`WHERE A <= 2500` becomes int to int
`WHERE A <= 2500.0` becomes decimal(11, 1) to decimal(10, 0)

2. Where `A` is `Decimal(7,2)`
`WHERE A <= 2500` becomes decimal(10, 0) to decimal(7, 2)
`WHERE A <= 2500.0` becomes decimal(7, 2) to decimal(7, 2)
`WHERE A <= 2500.00` becomes decimal(12, 2) to decimal(7, 2)

3. Where `A` is `Double`
`WHERE A <= 2500` becomes double to double
`WHERE A <= 2500.0` becomes double to double

So I think even if I check the widening cast dealing with types and both 
precision and scale in `Decimal`, neither this issue can be resolved nor deals 
with a lot of use-cases.

I'm not too sure if 
1. I should just go for this
2. handles only numbers including downcasts with `Cast(...).eval()`, or 
3. just close this. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9182][SQL] Cast filters are not passed ...

2015-09-25 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/8718#issuecomment-143179571
  
In this commit, it only deals with numbers. I removed the roundtrip in cast 
and It only supports comparisons among other numeric types except `Decimal` and 
between `Decimal` and `Decimal`  

Otherwise it considers incompatible. e.g. (between `Decimal` and other 
types except `Decimal`).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11676][SQL] Parquet filter tests all pa...

2015-12-08 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/9659#discussion_r47059656
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -50,27 +50,33 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
 val output = predicate.collect { case a: Attribute => a }.distinct
 
 withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") {
-  val query = df
-.select(output.map(e => Column(e)): _*)
-.where(Column(predicate))
-
-  val maybeAnalyzedPredicate = 
query.queryExecution.optimizedPlan.collect {
-case PhysicalOperation(_, filters, LogicalRelation(_: 
ParquetRelation, _)) => filters
-  }.flatten.reduceLeftOption(_ && _)
-  assert(maybeAnalyzedPredicate.isDefined)
-
-  val selectedFilters = 
maybeAnalyzedPredicate.flatMap(DataSourceStrategy.translateFilter)
-  assert(selectedFilters.nonEmpty)
-
-  selectedFilters.foreach { pred =>
-val maybeFilter = ParquetFilters.createFilter(df.schema, pred)
-assert(maybeFilter.isDefined, s"Couldn't generate filter predicate 
for $pred")
-maybeFilter.foreach { f =>
-  // Doesn't bother checking type parameters here (e.g. 
`Eq[Integer]`)
-  assert(f.getClass === filterClass)
+  withSQLConf(SQLConf.PARQUET_UNSAFE_ROW_RECORD_READER_ENABLED.key -> 
"false") {
--- End diff --

I see. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12236][SQL] JDBC filter tests all pass ...

2015-12-08 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/10221

[SPARK-12236][SQL] JDBC filter tests all pass if filters are not really 
pushed down

https://issues.apache.org/jira/browse/SPARK-12236
Currently JDBC filters are not tested properly. All the tests pass even if 
the filters are not pushed down due to Spark-side filtering.

In this PR, 
Firstly, I corrected the tests to properly check the pushed down filters by 
removing Spark-side filtering.
Also, `!=` was being tested which is actually not pushed down. So I removed 
them.
Lastly, I moved the `stripSparkFilter()` function to `SQLTestUtils` as this 
functions would be shared for all tests for pushed down filters. This function 
would be also shared with ORC datasource as the filters for that are also not 
being tested properly.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-12236

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10221.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10221


commit 3294b31303d4f67d6a7e7b9125c3f38df0f01fda
Author: hyukjinkwon <gurwls...@gmail.com>
Date:   2015-12-09T07:31:36Z

Correct the test for checking pushed down filters




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10180] [SQL] JDBC datasource are not pr...

2015-12-08 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/8743#issuecomment-163142135
  
Just a question. Now it looks the PR for the end-to-end docker tests is 
merged. Do you think it needs all the tests for all the databases (namely  
`MySQLIntegrationSuite` and `PostgresIntegrationSuite` for now)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11676][SQL] Parquet filter tests all pa...

2015-12-08 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9659#issuecomment-163088089
  
In this commit, I resolved conflicts, renamed the function 
extractSourceRDDToDataFrame to stripSparkFilter and set false to 
`spark.sql.parquet.enableUnsafeRowRecordReader`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11677][SQL] ORC filter tests all pass i...

2015-12-08 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9687#issuecomment-163081502
  
I will add that function to `SharedSQLContext` if it is okay.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12227][SQL] Support drop multiple colum...

2015-12-08 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/10218

[SPARK-12227][SQL] Support drop multiple columns specified by Column class 
in DataFrame API

https://issues.apache.org/jira/browse/SPARK-12227

In this PR, I added the support to drop multiple columns by reference. This 
was only supported with the string names.

As `varargs` does not recognise the type for multuple variables, I changed 
`drop(colNames: String*)` to `drop(colName: String, colNames: String*)` 
correspondingly to `orderBy(sortCol: String, sortCols: String*)` and 
`sort(sortCol: String, sortCols: String*)`.

Accordingly, the call in `drop(colName: String)` is changed.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-12227

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10218.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10218


commit cae52921a1491f2dee9d130aaeadc21b7e9cb168
Author: hyukjinkwon <gurwls...@gmail.com>
Date:   2015-12-09T06:25:20Z

Support to drop multiple columns by reference

commit 22235269995f5c525da0d9aef948803883ca91ae
Author: hyukjinkwon <gurwls...@gmail.com>
Date:   2015-12-09T06:36:58Z

Change comments




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11676][SQL] Parquet filter tests all pa...

2015-12-08 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9659#issuecomment-163102930
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11677][SQL] ORC filter tests all pass i...

2015-12-08 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9687#issuecomment-163089723
  
In this commit, I renamed the function extractSourceRDDToDataFrame to 
stripSparkFilter.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11677][SQL] ORC filter tests all pass i...

2015-12-08 Thread HyukjinKwon

Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9687#issuecomment-163104687
  
Can I add the function `stripSparkFilter` to `SQLTestUtils` for 
`ParquetFilterSuite` and `OrcQuerySuite` (and possibly `OrcFilterSuite` I will 
make)? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12249][SQL] JDBC not-equality compariso...

2015-12-09 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/10233

[SPARK-12249][SQL] JDBC not-equality comparison operator not pushed down.

https://issues.apache.org/jira/browse/SPARK-12249
Currently `!=` operator is not pushed down correctly.
I simply added a case to this.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-12249

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10233.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10233


commit 6cdd5f31f01af4c91f43c6a3fec12d3416dbdaff
Author: hyukjinkwon <gurwls...@gmail.com>
Date:   2015-12-10T00:23:26Z

Add not-equality comparison operator.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 12622 matches

Mail list logo