[GitHub] spark pull request: [SPARK-9935][SQL] EqualNotNull not processed i...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/8163 [SPARK-9935][SQL] EqualNotNull not processed in ORC https://issues.apache.org/jira/browse/SPARK-9935 You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/8163.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #8163 commit 755affe90e76535e5968017c0a2186d6b7e886a8 Author: hyukjinkwon gurwls...@gmail.com Date: 2015-08-13T10:40:24Z [SPARK-9935][SQL] EqualNotNull not processed in ORC --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10035][SQL] Parquet filters does not pr...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/8275 [SPARK-10035][SQL] Parquet filters does not process EqualNullSafe filter. As I talked with Lian, 1. I added EquelNullSafe to ParquetFilters - It uses the same equality comparison filter with EqualTo since the Parquet filter performs actually null-safe equality comparison. 2. Updated the test code (ParquetFilterSuite) - Convert catalyst.Expression to sources.Filter - Removed Cast since only Literal is picked up as a proper Filter in DataSourceStrategy - Added EquelNullSafe comparison 3. Removed deprecated createFilter for catalyst.Expression You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/8275.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #8275 commit 98ccf859fbfd7cce89bf56d625b0bd214638621a Author: hyukjinkwon gurwls...@gmail.com Date: 2015-08-18T12:41:00Z [SPARK-10035][SQL] Parquet filters does not process EqualNullSafe filter. commit cf24b243eeb03eebb38ff14eeb9c61b9ead3178f Author: hyukjinkwon gurwls...@gmail.com Date: 2015-08-18T12:41:57Z [SPARK-10035][SQL] Update ParquetFilterSuite for sources.Filter test. commit f35030f74f4bc0cf89fe815758f443a1912d1a95 Author: hyukjinkwon gurwls...@gmail.com Date: 2015-08-18T12:46:42Z [SPARK-10035][SQL] Remove deprecated createFilter() in ParquetFilters (for catalyst.Expression). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10035][SQL] Parquet filters does not pr...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/8275#issuecomment-132819486 Could you merge this :) ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10180] [SQL] JDBCRDD does not process E...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/8391 [SPARK-10180] [SQL] JDBCRDD does not process EqualNullSafe filter. https://issues.apache.org/jira/browse/SPARK-10180 You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/8391.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #8391 commit d167acbe5a6007f2a41c27f552795976512c847e Author: hyukjinkwon gurwls...@gmail.com Date: 2015-08-24T07:09:18Z [SPARK-10180] [SQL] JDBCRDD does not process EqualNullSafe filter. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9814][SQL] EqualNotNull not passing to ...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/8096 [SPARK-9814][SQL] EqualNotNull not passing to data sources You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/8096.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #8096 commit fed1c4e30d37b8a0e9e2f3c87c15bc6cfe0c4534 Author: hyukjinkwon gurwls...@gmail.com Date: 2015-08-11T05:22:06Z [SPARK-9814][SQL] EqualNotNull not passing to data sources --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9814][SQL] EqualNotNull not passing to ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/8096#discussion_r36715585 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/filters.scala --- @@ -38,6 +38,14 @@ case class EqualTo(attribute: String, value: Any) extends Filter /** * A filter that evaluates to `true` iff the attribute evaluates to a value + * null safe and equal to `value`. --- End diff -- Should I write like this? A filter that evaluates to `true` if the attribute evaluates to a value equal to `value` and if both the attribute and the value are null. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9814][SQL] EqualNotNull not passing to ...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/8096#issuecomment-129835273 Reset this please :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10180] [SQL] JDBCRDD does not process E...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/8391#issuecomment-134615781 Hm.. one thing I want to say is, it looks like there is no test code for JdbcRelation. So I tested this with seperate copied functions. It looks just about text parsing (SQL query) though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11103][SQL] Filter applied on Merged Pa...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/9327 [SPARK-11103][SQL] Filter applied on Merged Parquet shema with new column fail When enabling mergedSchema and predicate filter, this fails since Parquet filters are pushed down regardless of each schema of the splits (or rather files). This is related with Parquet issue (https://issues.apache.org/jira/browse/PARQUET-389). For now, it just simply disable predicate push down when using merged schema in this PR. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-11103 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9327.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9327 commit 85dadbcb458eec1067b6bae838ca930b8d958f4a Author: hyukjinkwon <gurwls...@gmail.com> Date: 2015-10-28T08:30:15Z [SPARK-11103][SQL] Disable predicate push down when using merged schema for Parquet. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11103][SQL] Filter applied on Merged Pa...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9327#issuecomment-152054929 /cc @liancheng --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-155720490 I will try to find and test them first tommorow before adding a commit! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9517#issuecomment-155280757 I used `sortBy` instead of `sortWith` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9517#issuecomment-154890873 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9557][SQL] Refactor ParquetFilterSuite ...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/9554 [SPARK-9557][SQL] Refactor ParquetFilterSuite and remove old ParquetFilters code Actually this was resolved by https://github.com/apache/spark/pull/8275. But I found the JIRA issue for this is not marked as resolved since the PR above resolved another issue but the PR above resolved both. I commented that this is resolved by the PR above; however, I opened this PR as I would like to just add a little bit of corrections. In the previous PR, I refactored the test by not reducing just collecting filters; however, this actually creates filters after reducing in the DataSourceStrategy, which the original test codes (before being refactored) followed this. In this PR, I just followed the original way to collect filters by reducing. I would like to close this if this PR is inappropriate and somebody would like this deal with it in the separate PR related with this. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-9557 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9554.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9554 commit b2b3631d2e49b232954c64fc196a49b8736fa87b Author: hyukjinkwon <gurwls...@gmail.com> Date: 2015-11-09T02:15:21Z [SPARK-9557][SQL] Refactor ParquetFilterSuite and remove old ParquetFilters code --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/9517#discussion_r44247117 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala --- @@ -461,13 +461,29 @@ private[sql] class ParquetRelation( // You should enable this configuration only if you are very sure that for the parquet // part-files to read there are corresponding summary files containing correct schema. + // As filed in SPARK-11500, the order of files to touch is a matter, which might affect + // the ordering of the output columns. There are several things to mention here. + // + // 1. If mergeRespectSummaries config is false, then it merges schemas by reducing from + // the first part-file so that the columns of the first file show first. + // + // 2. If mergeRespectSummaries config is true, then there should be, at least, + // "_metadata"s for all given files. So, we can ensure the columns of the first file + // show first. + // + // 3. If shouldMergeSchemas is false, but when multiple files are given, there is + // no guarantee of the output order, since there might not be a summary file for the + // first file, which ends up putting ahead the columns of the other files. However, + // this should be okay since not enabling shouldMergeSchemas means (assumes) all the + // files have the same schemas. + val needMerged: Seq[FileStatus] = if (mergeRespectSummaries) { Seq() } else { dataStatuses } - (metadataStatuses ++ commonMetadataStatuses ++ needMerged).toSeq + needMerged ++ metadataStatuses ++ commonMetadataStatuses --- End diff -- Yes, I think I should sort them. It looks it is not really recommended just to use it as it is, although they looks sorted, assuming from [this link](http://lucene.472066.n3.nabble.com/FileSystem-contract-of-listStatus-td3475540.html). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/9517#discussion_r44247087 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/ParquetHadoopFsRelationSuite.scala --- @@ -155,4 +155,22 @@ class ParquetHadoopFsRelationSuite extends HadoopFsRelationTest { assert(physicalPlan.collect { case p: execution.Filter => p }.length === 1) } } + + test("SPARK-11500: Not deterministic order of columns when using merging schemas.") { +import testImplicits._ +withSQLConf(SQLConf.PARQUET_SCHEMA_MERGING_ENABLED.key -> "true") { + withTempPath { dir => +val pathOne = s"${dir.getCanonicalPath}/table1" +Seq(1, 1).zipWithIndex.toDF("a", "b").write.parquet(pathOne) +val pathTwo = s"${dir.getCanonicalPath}/table2" +Seq(1, 1).zipWithIndex.toDF("c", "b").write.parquet(pathTwo) +val pathThree = s"${dir.getCanonicalPath}/table3" +Seq(1, 1).zipWithIndex.toDF("d", "b").write.parquet(pathThree) --- End diff -- Thanks for commands! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10113][SQL] Support for unsigned Parque...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/9646 [SPARK-10113][SQL] Support for unsigned Parquet logical types Parquet supports some unsigned datatypes. However, Since Spark does not support unsigned datatypes, it needs to emit an exception with a clear message rather then with the one saying illegal datatype. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-10113 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9646.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9646 commit 49e162480e33a0379e1b04979666533e66e33f6e Author: hyukjinkwon <gurwls...@gmail.com> Date: 2015-11-12T01:00:56Z [SPARK-10113][SQL] Support for unsigned Parquet logical types commit 2752c14331539bda7ccd7e1bc5a2887da2a2b4cd Author: hyukjinkwon <gurwls...@gmail.com> Date: 2015-11-12T01:01:26Z [SPARK-10113][SQL] Add a variable for readability --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11661] [SQL] Still pushdown filters ret...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/9634#discussion_r44612005 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -336,4 +336,29 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex } } } + + test("actual filter push down") { --- End diff -- It is just a nit (maybe only for me) though.. Should this be such as "SPARK-11661 Still pushdown filters returned by unhandledFilters"? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-155973698 @liancheng I give some tries to figure out the version but.. as you said, it is pretty tricky to check the writer version as it only changes the version of data page which we could know only within the internal of Parquet. Would this be too inappropriate if we write Parquet files with both version1 and version2 and then, check if the sizes of both are equal? Since encoding types are different, the size should be also different. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11661] [SQL] Still pushdown filters ret...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/9634#discussion_r44607471 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -336,4 +336,29 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex } } } + + test("actual filter push down") { +import testImplicits._ +withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") { + withTempPath { dir => +val path = s"${dir.getCanonicalPath}/part=1" +(1 to 3).map(i => (i, i.toString)).toDF("a", "b").write.parquet(path) +val df = sqlContext.read.parquet(path).filter("a = 2") + +// This is the source RDD without Spark-side filtering. +val childRDD = + df +.queryExecution + .executedPlan.asInstanceOf[org.apache.spark.sql.execution.Filter] +.child. +execute() + +// The result should be single row. +// When a filter is pushed to Parquet, Parquet can apply it to eveyr row. +// So, we can check the number of rows returned from the Parquet +// to make sure our filter pushdown work. +assert(childRDD.count == 1) --- End diff -- I wanted to make a single issue for this but just made separate since for ORC separate test might be needed since there is no such `OrcFilterPredicateSuite` (I think) due to the difficulty to check objects that are pushed down. Filed here for Parquet and ORC. https://issues.apache.org/jira/browse/SPARK-11676 https://issues.apache.org/jira/browse/SPARK-11677 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10113][SQL] Support for unsigned Parque...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9646#issuecomment-155986820 cc @liancheng --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11692][SQL] Support for Parquet logical...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9658#issuecomment-156046292 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11687] Mixed usage of fold and foldLeft...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/9655 [SPARK-11687] Mixed usage of fold and foldLeft, reduce and reduceLeft and reduceOption and reduceLeftOption https://issues.apache.org/jira/browse/SPARK-11687 As can be seen here https://github.com/scala/scala/blob/2.12.x/src/library/scala/collection/TraversableOnce.scala, `fold`, `reduce` and `reduceOption` came out from scala 2.9.x. In addition, all the implementations of `fold`, `reduce` and `reduceOption` call actually `foldLeft`, `reduceLeft` and `reduceLeftOption` directly without any additional codes. For developers who are not used to scala, the mixed usages of them might leave a question, for example, if it works as reduceRight or reduceLeft (as all know, the direction can change the results). Even some usages can be found where they work identically the same things. This is just a little nit but I think it would be better to use only `xxxLeft` for readability. Fortunately, there are not so many usages of `xxx` so in this PR, I changed them to `xxxLeft` You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-11687 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9655.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9655 commit 794919ffdba242595bf3ad14dd9c53f68b82 Author: hyukjinkwon <gurwls...@gmail.com> Date: 2015-11-12T07:19:35Z [SPARK-11687][SQL] Mixed usage of fold and foldLeft, reduce and reduceLeft and reduceOption and reduceLeftOption --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11692][SQL] Support for Parquet logical...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/9658 [SPARK-11692][SQL] Support for Parquet logical types, JSON and BSON (embedded types) Parquet supports some JSON and BSON datatypes. They are represented as binary for BSON and string (UTF-8) for JSON internally. I searched a bit and found Apache drill also supports both in this way, [link](https://drill.apache.org/docs/parquet-format/). You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-11692 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9658.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9658 commit 0c9b77a54f47dfd42ccd2454d63f272c5c7b4464 Author: hyukjinkwon <gurwls...@gmail.com> Date: 2015-11-12T09:03:37Z [SPARK-11692][SQL] Support for Parquet logical types, JSON and BISON (embedded types) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11676][SQL] Parquet filter tests all pa...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/9659 [SPARK-11676][SQL] Parquet filter tests all pass if filters are not really pushed down Currently Parquet predicate tests all pass even if filters are not pushed down or this is disabled. In this PR, For checking evaluating filters, Simply it makes the expression from `expression.Filter` and then try to create filters just like Spark does. For checking the results, this manually accesses to the child rdd (of `expression.Filter`) and produces the results which should be filtered properly, and then compares it to expected values. Now, if filters are not pushed down or this is disabled, this throws exceptions. cc @liancheng You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-11676 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9659.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9659 commit 1c6a5480df53d16fe38d14146deedaf035a6024c Author: hyukjinkwon <gurwls...@gmail.com> Date: 2015-11-12T09:31:41Z [SPARK-11676][SQL] Parquet filter tests all pass if filters are not really pushed down commit 6b722ef4ad3a29a1bd0cde0961f6bc12638e2f31 Author: hyukjinkwon <gurwls...@gmail.com> Date: 2015-11-12T09:37:27Z [SPARK-11676][SQL] Update indentation and comments commit d007c3f996f1fd4e093a590741b92a4bc4072d41 Author: hyukjinkwon <gurwls...@gmail.com> Date: 2015-11-12T09:39:05Z [SPARK-11676][SQL] Remove unused imports commit 8cc842fa3f97c39e61fa072a916f7056a85be1b7 Author: hyukjinkwon <gurwls...@gmail.com> Date: 2015-11-12T09:49:11Z [SPARK-111676][SQL] Set true to pushing down filters --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11692][SQL] Support for Parquet logical...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9658#issuecomment-156052665 cc @liancheng --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11694][SQL] Parquet logical types are n...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/9660#discussion_r44737977 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala --- @@ -91,6 +91,33 @@ class ParquetIOSuite extends QueryTest with ParquetTest with SharedSQLContext { } } + test("SPARK-11694 Parquet logical types are not being tested properly") { +val parquetSchema = MessageTypeParser.parseMessageType( + """message root { +| required int32 a(INT_8); +| required int32 b(INT_16); +| required int32 c(DATE); +| required int32 d(DECIMAL(1,0)); +| required int64 e(DECIMAL(10,0)); +|} + """.stripMargin) + +withTempPath { location => + val extraMetadata = Map.empty[String, String].asJava + val fileMetadata = new FileMetaData(parquetSchema, extraMetadata, "Spark") + val path = new Path(location.getCanonicalPath) + val footer = List( +new Footer(path, new ParquetMetadata(fileMetadata, Collections.emptyList())) + ).asJava + + ParquetFileWriter.writeMetadataFile(sparkContext.hadoopConfiguration, path, footer) --- End diff -- Done! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11687] Mixed usage of fold and foldLeft...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9655#issuecomment-156262080 I think I should have opened a thread in the mailing list. Sorry, closing this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11687] Mixed usage of fold and foldLeft...
Github user HyukjinKwon closed the pull request at: https://github.com/apache/spark/pull/9655 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11694][SQL] Parquet logical types are n...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/9660 [SPARK-11694][SQL] Parquet logical types are not being tested properly All the physical types are properly tested at `ParquetIOSuite` but logical type mapping is not being tested. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-11694 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9660.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9660 commit 870a37a0ed2ce3af0a1507c529620a160e854ece Author: hyukjinkwon <gurwls...@gmail.com> Date: 2015-11-12T10:17:53Z [SPARK-11694][SQL] Parquet logical types are not being tested properly --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11694][SQL] Parquet logical types are n...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9660#issuecomment-156059610 cc @liancheng --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11694][SQL] Parquet logical types are n...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/9660#discussion_r44650410 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala --- @@ -91,6 +91,32 @@ class ParquetIOSuite extends QueryTest with ParquetTest with SharedSQLContext { } } + test("SPARK-11694 Parquet logical types are not being tested properly") { +val parquetSchema = MessageTypeParser.parseMessageType( + """message root { +| required int32 a(INT_8); +| required int32 b(INT_16); +| required int32 c(DATE); +| required int32 d(DECIMAL(1,0)); +| required int64 e(DECIMAL(10,0)); +|} + """.stripMargin) --- End diff -- Hm. I should get rid of this. I ran this test several times and added a PR. This is weird. Sorry. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-156076494 Thank toy very much. I will try in that way. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11687] Mixed usage of fold and foldLeft...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9655#issuecomment-156087424 I agree with you. However, if you see the codes, they are used in a mixed way, even in the same class. Mostly the usages are `_ and _` or ` + ` but `xxx` and `xxxLeft` are used in a mixed way even in the same class. Namely it looks this does not infer that the order is a matter. If the order is apparently not a matter, then I think we better set all `xxx` instead of `xxxLeft`. Anyhow, I would like to close if guys think this is inappropriate. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-156306727 Fortunately, I worked around parquet tools once and looked through Parquet codes several times :). Thank you very much for your help. This could be dome much more easily than I though because of your help. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11694][SQL] Parquet logical types are n...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9660#issuecomment-156321461 Also added an import `Collections` at `ParquetIOSuite`. Does this also compile the old version of all commits in this PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11692][SQL] Support for Parquet logical...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9658#issuecomment-156355260 All the builds pass all the tests at `ParquetIOSuite` and I do not think it affects other modules such as ML. I will retest this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11692][SQL] Support for Parquet logical...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9658#issuecomment-156355277 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11692][SQL] Support for Parquet logical...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9658#issuecomment-156320809 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11677][SQL ]ORC filter tests all pass i...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/9687 [SPARK-11677][SQL ]ORC filter tests all pass if filters are actually not pushed down. Currently ORC filters are not tested properly. All the tests pass even if the filters are not pushed down or disabled. In this PR, I add some logics for this. Several things to mention. Firstly, since ORC does not filter record by record fully, I checked the count and if it contains the expected values. Secondly, I wonder if it is okay to put `extractSourceRDDToDataFrame` at `QueryTest`. I did not put but I think the `extractSourceRDDToDataFrame` can be shared with `ParquetFilterSuite`. Lastly, I originally wanted to add `OrcFilterSuite` separately in order to test actual filter evaluation; however, I decided not to do it here (I will do in a separate issue or followup PR) and just let the original test way work properly first. cc @liancheng You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-11677 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9687.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9687 commit 7a13c8e6c73e3824b8188b865fcaeda4c7e04117 Author: hyukjinkwon <gurwls...@gmail.com> Date: 2015-11-13T05:18:10Z [SPARK-11677][SQL] ORC filter tests all pass if filters are actually not pushed down. commit 82d0aa773d58115b0a2b3d5fd782d473e26c2671 Author: hyukjinkwon <gurwls...@gmail.com> Date: 2015-11-13T07:43:00Z [SPARK-11677][SQL] Add tests for is-not-null operator and in-operator --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-156099372 Thanks! I will follow the way. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11694][SQL] Parquet logical types are n...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9660#issuecomment-156416602 Finally I got all-pass! Pleace review the codes! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11694][SQL] Parquet logical types are n...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9660#issuecomment-156371302 restart this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11694][SQL] Parquet logical types are n...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9660#issuecomment-156296565 hm. I'm not using `FileMetaData` at line 234 in `ParquetIOSuite`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11694][SQL] Parquet logical types are n...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9660#issuecomment-156309901 I added an import `FileMetaData` at ParquetIOSuite. This is weird, I am suing scala 2.10 and it compiles okay in my local computer. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11694][SQL] Parquet logical types are n...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9660#issuecomment-156296578 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11694][SQL] Parquet logical types are n...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9660#issuecomment-156313536 Also added an import `ParquetMetadata` at `ParquetIOSuite` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11694][FOLLOW-UP] Clean up imports, use...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/9754 [SPARK-11694][FOLLOW-UP] Clean up imports, use a common function for metadata and add a test for FIXED_LEN_BYTE_ARRAY As discussed https://github.com/apache/spark/pull/9660 https://github.com/apache/spark/pull/9060, I cleaned up unused imports, added a test for fixed-length byte array and used a common function for writing metadata for Parquet. For the test for fixed-length byte array, I have tested and checked the encoding types with [parquet-tools](https://github.com/Parquet/parquet-mr/tree/master/parquet-tools). You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-11694-followup Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9754.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9754 commit dc5ef4f73f76e68a60590d7203f82b3ebecc6fb0 Author: hyukjinkwon <gurwls...@gmail.com> Date: 2015-11-17T02:03:32Z [SPARK-11694][FOLLOW-UP] Clean up imports, use a common function for metadata and add a test for FIXED_LEN_BYTE_ARRAY --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11694][FOLLOW-UP] Clean up imports, use...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9754#issuecomment-157276165 I will cc you @liancheng just so that you can easily find :). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-156879272 I saw accidently `TODO Adds test case for reading dictionary encoded decimals written as 'FIXED_LEN_BYTE_ARRAY'`. I will also add this test in the following PR for using the overloaded `writeMetaFile`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11692][SQL] Support for Parquet logical...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9658#issuecomment-156878712 Thanks! I changed this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11661] [SQL] Still pushdown filters ret...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/9634#discussion_r44598625 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -336,4 +336,29 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex } } } + + test("actual filter push down") { +import testImplicits._ +withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") { + withTempPath { dir => +val path = s"${dir.getCanonicalPath}/part=1" +(1 to 3).map(i => (i, i.toString)).toDF("a", "b").write.parquet(path) +val df = sqlContext.read.parquet(path).filter("a = 2") + +// This is the source RDD without Spark-side filtering. +val childRDD = + df +.queryExecution + .executedPlan.asInstanceOf[org.apache.spark.sql.execution.Filter] +.child. +execute() + +// The result should be single row. +// When a filter is pushed to Parquet, Parquet can apply it to eveyr row. +// So, we can check the number of rows returned from the Parquet +// to make sure our filter pushdown work. +assert(childRDD.count == 1) --- End diff -- Can I open an separate issue for this and try to work on this if it is right place (after resolving this issue)? I would like to add some codes for ORC and Parquet I wrote already. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11661] [SQL] Still pushdown filters ret...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/9634#discussion_r44600842 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -336,4 +336,29 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex } } } + + test("actual filter push down") { +import testImplicits._ +withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") { + withTempPath { dir => +val path = s"${dir.getCanonicalPath}/part=1" +(1 to 3).map(i => (i, i.toString)).toDF("a", "b").write.parquet(path) +val df = sqlContext.read.parquet(path).filter("a = 2") + +// This is the source RDD without Spark-side filtering. +val childRDD = + df +.queryExecution + .executedPlan.asInstanceOf[org.apache.spark.sql.execution.Filter] +.child. +execute() + +// The result should be single row. +// When a filter is pushed to Parquet, Parquet can apply it to eveyr row. +// So, we can check the number of rows returned from the Parquet +// to make sure our filter pushdown work. +assert(childRDD.count == 1) --- End diff -- Oh, sorry lack of language skills.. I mean the test code checks if filters are really pushed down or not, which I thought is enough for this issue. So, after this issue is resolved, I would like to change `ParqueyFilterSuite` and add some tests for ORC, maybe at a separste issue that is for properly testing all. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11692][SQL] Support for Parquet logical...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9658#issuecomment-157373552 Oh. I just got confused for a bit. Sorry. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11692][SQL] Support for Parquet logical...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9658#issuecomment-157341844 Please note that https://github.com/apache/spark/pull/9754 updated unintentionally this to clean up at mater branch however, that is supposed to be merged with branch 1.6 and for this version 1.7. I will give a alert to you or make a PR to backport this as soon as branch 1.7 is available. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11692] [SPARK-11694] [SQL] Backports #9...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9763#issuecomment-157335573 I mistakenly added SPARK-11692 here as I though this is supposed to go to branch 1.6.0 but this is classified for version 1.7. I will take this out. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9517#issuecomment-155020044 In this commit, I added partitioned tables for the test and sorted the `FileStatus`es. There are several things to mention here. Firstly, now we do not need to change `Set` to `LinkedHashSet` and `Map` to `LinkedHashMap` for this issue since it manually sorts the `FileStatus`es. However, I left them as I though anyway the order of files better be in the order as they are retrieved. If that looks weird, I would like to get it back. Secondly, in any cases, the columns of the lexicographically first file shows first, which might be a matter for files starting/containing with numeric values. However, I left this as I though anyway it is deterministic. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11103][SQL] Filter applied on Merged Pa...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9327#issuecomment-152424928 @liancheng oh, right. I just added at `ParquetFilterSuite` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11103][SQL] Filter applied on Merged Pa...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/9327#discussion_r43724135 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -314,4 +314,24 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex } } } + + test("SPARK-11103: Filter applied on merged Parquet schema with new column fails") { +import testImplicits._ + +withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true", + SQLConf.PARQUET_SCHEMA_MERGING_ENABLED.key -> "true") { + withTempPath { dir => +var pathOne = s"${dir.getCanonicalPath}/table1" +(1 to 3).map(i => (i, i.toString)).toDF("a", "b").write.parquet(pathOne) +var pathTwo = s"${dir.getCanonicalPath}/table2" +(1 to 3).map(i => (i, i.toString)).toDF("c", "b").write.parquet(pathTwo) + +// If the "c = 1" filter gets pushed down, this query will throw an exception which +// Parquet emits. This is a Parquet issue (PARQUET-389). +checkAnswer( + sqlContext.read.parquet(pathOne, pathTwo).filter("c = 1"), --- End diff -- I will try yo check this. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11103][SQL] Filter applied on Merged Pa...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/9327#discussion_r43488432 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -314,4 +314,24 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex } } } + + test("SPARK-11103: Filter applied on merged Parquet schema with new column fails") { +import testImplicits._ + +withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true", + SQLConf.PARQUET_SCHEMA_MERGING_ENABLED.key -> "true") { + withTempPath { dir => +var pathOne = s"${dir.getCanonicalPath}/table1" +(1 to 3).map(i => (i, i.toString)).toDF("a", "b").write.parquet(pathOne) +var pathTwo = s"${dir.getCanonicalPath}/table2" +(1 to 3).map(i => (i, i.toString)).toDF("c", "b").write.parquet(pathTwo) --- End diff -- `val` was used by mistake... Thanks for the comments! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11103][SQL] Filter applied on Merged Pa...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/9327#discussion_r43600244 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -314,4 +314,24 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex } } } + + test("SPARK-11103: Filter applied on merged Parquet schema with new column fails") { +import testImplicits._ + +withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true", + SQLConf.PARQUET_SCHEMA_MERGING_ENABLED.key -> "true") { + withTempPath { dir => +var pathOne = s"${dir.getCanonicalPath}/table1" +(1 to 3).map(i => (i, i.toString)).toDF("a", "b").write.parquet(pathOne) +var pathTwo = s"${dir.getCanonicalPath}/table2" +(1 to 3).map(i => (i, i.toString)).toDF("c", "b").write.parquet(pathTwo) + +// If the "c = 1" filter gets pushed down, this query will throw an exception which +// Parquet emits. This is a Parquet issue (PARQUET-389). +checkAnswer( + sqlContext.read.parquet(pathOne, pathTwo).filter("c = 1"), --- End diff -- I see. I just wonder if the inconsistent order is another issue. I think users might think it is weird if they run the same script with `SELECT *` but the column order of the results are different every time they run. Could I open an issue for this if you think it is an separate issue? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9060#issuecomment-154597634 @liancheng I assume you missed this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11103][SQL] Filter applied on Merged Pa...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/9327#discussion_r43846677 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -314,4 +314,24 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex } } } + + test("SPARK-11103: Filter applied on merged Parquet schema with new column fails") { +import testImplicits._ + +withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true", + SQLConf.PARQUET_SCHEMA_MERGING_ENABLED.key -> "true") { + withTempPath { dir => +var pathOne = s"${dir.getCanonicalPath}/table1" +(1 to 3).map(i => (i, i.toString)).toDF("a", "b").write.parquet(pathOne) +var pathTwo = s"${dir.getCanonicalPath}/table2" +(1 to 3).map(i => (i, i.toString)).toDF("c", "b").write.parquet(pathTwo) + +// If the "c = 1" filter gets pushed down, this query will throw an exception which +// Parquet emits. This is a Parquet issue (PARQUET-389). +checkAnswer( + sqlContext.read.parquet(pathOne, pathTwo).filter("c = 1"), --- End diff -- I investigated that. It does not guarantee the order. This is because of `FileStatusCache` in `HadoopFsRelation` (which `ParquetRelation` extends as you know). When `FileStatusCache.listLeafFiles()` is called, this returns `Set[FileStatus]` which messes up the order of `Array[FileStatus]`. So, after retrieving the list of leaf files including `_metadata` and `_common_metadata`, this starts to merge (separately if necessary) the `Set`s of `_metadata`, `_common_metadata` and part-files in `ParquetRelation.mergeSchemasInParallel()`. I think this can be resolved by using `LinkedHashSet`. I will open an issue for this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/9517 [SPARK-11500][SQL] Not deterministic order of columns when using merging schemas. https://issues.apache.org/jira/browse/SPARK-11500 As filed in SPARK-11500, if merging schemas is enabled, the order of files to touch is a matter which might affect the ordering of the output columns. This was mostly because of the use of `Set` and `Map` so I replaced them to `LinkedHashSet` and `LinkedHashMap` to keep the insertion order. Also, reducing order is set left, and replaced the order of `filesToTouch` from `metadataStatuses ++ commonMetadataStatuses ++ needMerged` to `needMerged ++ metadataStatuses ++ commonMetadataStatuses` in order to touch the part-files first which always have the schema in footers whereas the others might not exist. One nit is, If merging schemas is enabled, but when multiple files are given, there is no guarantee of the output order, since there might not be a summary file for the first file, which ends up putting ahead the columns of the other files. However, I thought this should be okay since disabling merging schemas means (assumes) all the files have the same schemas. In addition, in the test code for this, I only checked the names of fields. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-11500 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9517.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9517 commit b0e6ce2729f584a9f95996707f60eb650c2a58b9 Author: hyukjinkwon <gurwls...@gmail.com> Date: 2015-11-06T07:38:26Z [SPARK-11500][SQL] Not deterministic order of columns when using merging schemas. commit 08fc91ca8d21902677e78f0adb3b36769f2cba51 Author: hyukjinkwon <gurwls...@gmail.com> Date: 2015-11-06T07:38:55Z [SPARK-11500][SQL] Add a test to check the deterministic order. commit bcf72d3ca308f9a69993803d9c8939696c915b07 Author: hyukjinkwon <gurwls...@gmail.com> Date: 2015-11-06T07:40:17Z [SPARK-11500][SQL] Remove trailing newline. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9517#issuecomment-154338571 cc @liancheng --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9182][SQL] Cast filters are not passed ...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/8718#issuecomment-146380223 @liancheng Would this casting check be unsafe? I came across Parquet downcasting check with the actual value ```java public static int checkedCast(long value) { int valueI = (int) value; if (valueI != value) { throw new IllegalArgumentException(String.format("Overflow casting %d to an int", value)); } return valueI; } ``` and I see the use of `null` from `eval()` in other codes in Spark. I would like to close this PR, If this approach is inappropriate for other reasons. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/9060#discussion_r41705069 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystWriteSupport.scala --- @@ -431,6 +431,7 @@ private[parquet] object CatalystWriteSupport { configuration.set(SPARK_ROW_SCHEMA, schema.json) configuration.set( ParquetOutputFormat.WRITER_VERSION, - ParquetProperties.WriterVersion.PARQUET_1_0.toString) + configuration.get(ParquetOutputFormat.WRITER_VERSION, --- End diff -- Yeap I just updated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11044][SQL] Parquet writer version fixe...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/9060 [SPARK-11044][SQL] Parquet writer version fixed as version1 https://issues.apache.org/jira/browse/SPARK-11044 Spark only writes the parquet file with writer version1 ignoring the given writer version by user. So, in this PR, it keeps the writer version if given and sets version1 as default. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-11044 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9060.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9060 commit 5e72fbc93ec0783d5a440f8f70c7653f8fc39d9a Author: HyukjinKwon <gurwls...@gmail.com> Date: 2015-10-10T06:59:52Z [SPARK-11044][SQL] Apply the writer version if given. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9182][SQL] Cast filters are not passed ...
Github user HyukjinKwon closed the pull request at: https://github.com/apache/spark/pull/8718 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11677][SQL] ORC filter tests all pass i...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9687#issuecomment-157561940 Several things to mention. Firstly, I wonder if it is okay to put `extractSourceRDDToDataFrame` at `QueryTest`. I did not put but I think the `extractSourceRDDToDataFrame` can be shared with `ParquetFilterSuite`. Secondly, I originally wanted to add `OrcFilterSuite` separately in order to test actual filter evaluation (using the enums in `ExpressionTree.Operator` and `PredicateLeaf.Operator` at hive); however, I decided not to do it here (I will do in a separate issue or followup PR) and just let the original test way work properly first. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11677][SQL] ORC filter tests all pass i...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9687#issuecomment-157564511 cc @liancheng --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11694][FOLLOW-UP] Clean up imports, use...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9754#issuecomment-157300557 yes I will. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11692] [SPARK-11694] [SQL] Backports #9...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/9763 [SPARK-11692] [SPARK-11694] [SQL] Backports #9658 and #9754 The main purpose of this PR is to backport https://github.com/apache/spark/pull/9754 and https://github.com/apache/spark/pull/9658. I added several commits but they are identical with both PRs. I will cc @liancheng just to find this easily. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-11694-followup-backporting Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9763.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9763 commit c91d279aaf3895acb63871da1db0b810522c4e24 Author: hyukjinkwon <gurwls...@gmail.com> Date: 2015-11-17T08:52:45Z [SPARK-11692] [SPARK-11694] [SQL] Backports #9658 and #9754 commit e14b40ecc85394bbd0fcc024673093b23036af49 Author: hyukjinkwon <gurwls...@gmail.com> Date: 2015-11-17T08:59:40Z [SPARK-11692] [SPARK-11694] [SQL] Add a binary file for test. commit 1f252ecf3527b8566c8f5c7b87a1a2af9aa33d06 Author: hyukjinkwon <gurwls...@gmail.com> Date: 2015-11-17T09:03:50Z [SPARK-11692] [SPARK-11694] [SQL] Add a support for bson and json --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10180] [SQL] JDBCRDD does not process E...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/8391#discussion_r37833992 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala --- @@ -275,6 +275,7 @@ private[sql] class JDBCRDD( */ private def compileFilter(f: Filter): String = f match { case EqualTo(attr, value) = s$attr = ${compileValue(value)} +case EqualNullSafe(attr, value) = s$attr = ${compileValue(value)} --- End diff -- Yes I will go for the second one with some comments. Thanks :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10180] [SQL] JDBCRDD does not process E...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/8391#discussion_r37830657 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala --- @@ -275,6 +275,7 @@ private[sql] class JDBCRDD( */ private def compileFilter(f: Filter): String = f match { case EqualTo(attr, value) = s$attr = ${compileValue(value)} +case EqualNullSafe(attr, value) = s$attr = ${compileValue(value)} --- End diff -- I thought the comparison operator is official and a standard one (because both I am used to MySQL and thought Spark and Hive uses Standard SQL Syntax). However, it looks like it is a MySQL dialect. In details, Firstly, I looked through several documents. I assume there are several standard documentations as mentioned in the `Where can I get a copy of the SQL standards?` https://wiki.postgresql.org/wiki/Developer_FAQ#Where_can_I_get_a_copy_of_the_SQL_standards.3F). (SQL-92 http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt SQL:1999 http://web.cs.ualberta.ca/~yuan/courses/db_readings/ansi-iso-9075-2-1999.pdf SQL:2003 http://www.wiscorp.com/sql_2003_standard.zip SQL:201x (preliminary) http://www.wiscorp.com/sql20nn.zip) Though I can guarantee, It looks null-safe equality comparison is not the standard one. It seems there are no mentions about this. Secondly, I got a list of the top 10 databases here (http://www.improgrammer.net/top-10-databases-should-learn-2015/) and reviewed if there is a such operation or not. 1. Oracle - not support (http://docs.oracle.com/html/A95915_01/sqopr.htm) 2. MySQL - support (https://dev.mysql.com/doc/refman/5.0/en/comparison-operators.html) 3. Microsoft SQL Server - not support (https://msdn.microsoft.com/en-us/library/ms188074.aspx) 4. PostgreSQL - not support (http://www.postgresql.org/docs/9.2/static/functions-comparison.html) 5. MongoDB - N/A 6. DB 2 - not support (http://www-01.ibm.com/support/knowledgecenter/ssw_ibm_i_72/sqlp/rbafycompop.htm) 7. Microsoft Access - not support (https://support.office.com/en-za/article/Table-of-operators-e1bc04d5-8b76-429f-a252-e9223117d6bd) 8. SQLite - not support (http://www.tutorialspoint.com/sqlite/sqlite_comparison_operators.htm) 9. Cassandra - N/A 10. Redis - N/A --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10180] [SQL] JDBCRDD does not process E...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/8391#issuecomment-134564538 Hm.. one thing I want to say is, it looks like there is no test code for JdbcRelation. So I tested this with seperate copied functions. It looks just about text parsing (SQL query) though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10180] [SQL] JDBCRDD does not process E...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/8391#discussion_r37831947 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala --- @@ -275,6 +275,7 @@ private[sql] class JDBCRDD( */ private def compileFilter(f: Filter): String = f match { case EqualTo(attr, value) = s$attr = ${compileValue(value)} +case EqualNullSafe(attr, value) = s$attr = ${compileValue(value)} --- End diff -- Do you think this should be treated differently for each different dialect ( maybe only for MySQL ) or should this be closed? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10180] [SQL] JDBCRDD does not process E...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/8391#discussion_r39345066 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala --- @@ -275,6 +275,10 @@ private[sql] class JDBCRDD( */ private def compileFilter(f: Filter): String = f match { case EqualTo(attr, value) => s"$attr = ${compileValue(value)}" +// Since the null-safe equality operator is not a standard SQL operator, +// This was written as using is-null and normal equality. +case EqualNullSafe(attr, value) => + s"($attr = ${compileValue(value)} OR ($attr IS NULL AND ${compileValue(value)} IS NULL))" --- End diff -- Yes, it looks so.. If SparkSQL creates the SQL (for datasources) like the below, it can be a problem.. ``` SELECT 1 <=> 1, NULL <=> NULL, 1 <=> NULL; -> 1, 1, 0 SELECT 1 = 1, NULL = NULL, 1 = NULL; -> 1, NULL, NULL ``` However, as I see the codes, I think the `compileFilter()` is only used to construct WHERE clause. I tested the expression at MySQL and they look ok. In details, I tested four cases at MySQL. 1. attr is a column, value is null. 2. attr is a column, value is not null. 3. attr is null, value is null. 4. attr is null, value is not null. - `CREATE` ```SQL CREATE TABLE TestTable (`id` int, `test` varchar(7)) ; ``` - `INSERT` ```SQL INSERT INTO TestTable (`id`, `test`) VALUES (1, null), (2, 'OpenAM'), (3, 'OpenDJ') ; ``` - `SELECT` ```SQL /*1. attr is a column, value is null. */ SELECT * FROM TestTable WHERE (test = NULL OR (test IS NULL AND NULL IS NULL)); SELECT * FROM TestTable WHERE test <=> NULL; ``` ```SQL /*2. attr is a column, value is not null.*/ SELECT * FROM TestTable WHERE (test = 'OpenAM' OR (test IS NULL AND 'OpenAM' IS NULL)); SELECT * FROM TestTable WHERE test <=> 'OpenAM'; ``` ```SQL /*3. attr is null, value is null.*/ SELECT * FROM TestTable WHERE (NULL = NULL OR (NULL IS NULL AND NULL IS NULL)); SELECT * FROM TestTable WHERE NULL <=> NULL; ``` ```SQL /*4. attr is null, value is not null.*/ SELECT * FROM TestTable WHERE (NULL = 'OpenAM' OR (NULL IS NULL AND 'OpenAM' IS NULL)); SELECT * FROM TestTable WHERE NULL <=> 'OpenAM'; ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10180] [SQL] JDBCRDD does not process E...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/8391#discussion_r39345352 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala --- @@ -275,6 +275,10 @@ private[sql] class JDBCRDD( */ private def compileFilter(f: Filter): String = f match { case EqualTo(attr, value) => s"$attr = ${compileValue(value)}" +// Since the null-safe equality operator is not a standard SQL operator, +// This was written as using is-null and normal equality. +case EqualNullSafe(attr, value) => + s"($attr = ${compileValue(value)} OR ($attr IS NULL AND ${compileValue(value)} IS NULL))" --- End diff -- Yes.. it looks so. There can be a problem if the query uses the result of comparison `<=>`. Let me correct this as `(NOT ($attr <> ${compileValue(value)} OR $attr IS NULL OR ${compileValue(value)} IS NULL) OR ($attr IS NULL AND ${compileValue(value)} IS NULL))`. (BTW, I found `<=>` actually a standard but as `IS NOT DISTINCT FROM` from SQL1999) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10180] [SQL] JDBC datasource are not pr...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/8743 [SPARK-10180] [SQL] JDBC datasource are not processing EqualNullSafe filter https://github.com/apache/spark/pull/8391 @rxin I apologize that I removed the forked repo by mistake with not closed PR. So I opened this again. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-10180 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/8743.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #8743 commit cb1dd0af31fc6835333cea0fdc63cb700d2336fd Author: hyukjinkwon <gurwls...@gmail.com> Date: 2015-09-14T10:23:26Z [SPARK-10180] [SQL] Update query to prevent to return null. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9182][SQL] Cast filters are not passed ...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/8718#issuecomment-140693582 It is OK for JDBC but for Parquet and ORC, it looks the conversion from `StringType` to `NumericType` are not safe. When the field type is `StringType`, then the given numeric value is converted to `String` and it compares both as string which ends up in wrong comparison. Also, it looks sometimes `Cast()` fails when trying to cast unrealistic values. So, it returns false when it fails just in case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9182][SQL] Cast filters are not passed ...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/8718#issuecomment-140569493 It look ok to JDBC datasource tough. I am wondering if conversion from `NumericType` to `StringType` should be prevented or treated for each differently as it looks unsafe for Parquet and ORC. For example, when column `A` is string, `A >100' will push down a filter with string value performing string comperison. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9182][SQL] Cast filters are not passed ...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/8718 [SPARK-9182][SQL] Cast filters are not passed through to datasources As mentioned in https://issues.apache.org/jira/browse/SPARK-9182, Some casts filters are not passing to datasources (The original PR was reverted because of some downcasting issues). So, for this PR, rather than checking the range or constant-folding, I just made the actual cast to check if down-casting actually happens. Test codes and some codes are basically similar with the previous ones (@yjshen). The codes are a bit messy and might need to be cleaned up. @liancheng Could you please review this PR? You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/8718.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #8718 commit 2acbeb36c3d1a9d96277102788ef4fae95aa799d Author: hyukjinkwon <gurwls...@gmail.com> Date: 2015-09-11T10:50:10Z [SPARK-9182][SQL] Cast filters are not passed through to datasources commit b3f02b29b02ae255f2cee71912e12e4f34d3935a Author: hyukjinkwon <gurwls...@gmail.com> Date: 2015-09-11T10:52:49Z [SPARK-9182][SQL] Remove a newline commit 53bcddb4fb204e5a5b67596a8e5b55d030220fd4 Author: hyukjinkwon <gurwls...@gmail.com> Date: 2015-09-11T10:55:36Z [SPARK-9182][SQL] Update indentation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9182][SQL] Cast filters are not passed ...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/8718#issuecomment-140368436 I think in that cast it does not create any condition as `double -> int -> double` means, the field type is int and the given value type is double. Losing precision during this convert means the given value cannot be converted to the field type. (I might misunderstood your comment). I just tested and added some more test codes. Could you give me the case you think? I tested and added some more test codes for other type conversions. Could you give me a specific case? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10180] [SQL] JDBC datasource are not pr...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/8743#issuecomment-140919876 Should I better write a test code for this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10180] [SQL] JDBCRDD does not process E...
Github user HyukjinKwon closed the pull request at: https://github.com/apache/spark/pull/8391 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9182][SQL] Cast filters are not passed ...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/8718#issuecomment-141679720 Yes I will do so. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10180] [SQL] JDBC datasource are not pr...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/8743#issuecomment-141679756 Should I better write some test codes for this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9182][SQL] Cast filters are not passed ...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/8718#issuecomment-141960719 It looks the original case became a downcast `Decimal(10, 0)` to `Decimal(7, 2)` which seems when scale and precision of the latter are less than the former, rather than int to decimal. I think there have been an update at Optimizer. It looks roughly in most cases, the conversions need to be downcasted. Here is a list i tested. 1. Where `A` is `IntegerType` `WHERE A <= 2500` becomes int to int `WHERE A <= 2500.0` becomes decimal(11, 1) to decimal(10, 0) 2. Where `A` is `Decimal(7,2)` `WHERE A <= 2500` becomes decimal(10, 0) to decimal(7, 2) `WHERE A <= 2500.0` becomes decimal(7, 2) to decimal(7, 2) `WHERE A <= 2500.00` becomes decimal(12, 2) to decimal(7, 2) 3. Where `A` is `Double` `WHERE A <= 2500` becomes double to double `WHERE A <= 2500.0` becomes double to double So I think even if I check the widening cast dealing with types and both precision and scale in `Decimal`, neither this issue can be resolved nor deals with a lot of use-cases. I'm not too sure if 1. I should just go for this 2. handles only numbers including downcasts with `Cast(...).eval()`, or 3. just close this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9182][SQL] Cast filters are not passed ...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/8718#issuecomment-143179571 In this commit, it only deals with numbers. I removed the roundtrip in cast and It only supports comparisons among other numeric types except `Decimal` and between `Decimal` and `Decimal` Otherwise it considers incompatible. e.g. (between `Decimal` and other types except `Decimal`). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11676][SQL] Parquet filter tests all pa...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/9659#discussion_r47059656 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -50,27 +50,33 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex val output = predicate.collect { case a: Attribute => a }.distinct withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") { - val query = df -.select(output.map(e => Column(e)): _*) -.where(Column(predicate)) - - val maybeAnalyzedPredicate = query.queryExecution.optimizedPlan.collect { -case PhysicalOperation(_, filters, LogicalRelation(_: ParquetRelation, _)) => filters - }.flatten.reduceLeftOption(_ && _) - assert(maybeAnalyzedPredicate.isDefined) - - val selectedFilters = maybeAnalyzedPredicate.flatMap(DataSourceStrategy.translateFilter) - assert(selectedFilters.nonEmpty) - - selectedFilters.foreach { pred => -val maybeFilter = ParquetFilters.createFilter(df.schema, pred) -assert(maybeFilter.isDefined, s"Couldn't generate filter predicate for $pred") -maybeFilter.foreach { f => - // Doesn't bother checking type parameters here (e.g. `Eq[Integer]`) - assert(f.getClass === filterClass) + withSQLConf(SQLConf.PARQUET_UNSAFE_ROW_RECORD_READER_ENABLED.key -> "false") { --- End diff -- I see. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12236][SQL] JDBC filter tests all pass ...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/10221 [SPARK-12236][SQL] JDBC filter tests all pass if filters are not really pushed down https://issues.apache.org/jira/browse/SPARK-12236 Currently JDBC filters are not tested properly. All the tests pass even if the filters are not pushed down due to Spark-side filtering. In this PR, Firstly, I corrected the tests to properly check the pushed down filters by removing Spark-side filtering. Also, `!=` was being tested which is actually not pushed down. So I removed them. Lastly, I moved the `stripSparkFilter()` function to `SQLTestUtils` as this functions would be shared for all tests for pushed down filters. This function would be also shared with ORC datasource as the filters for that are also not being tested properly. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-12236 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10221.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10221 commit 3294b31303d4f67d6a7e7b9125c3f38df0f01fda Author: hyukjinkwon <gurwls...@gmail.com> Date: 2015-12-09T07:31:36Z Correct the test for checking pushed down filters --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10180] [SQL] JDBC datasource are not pr...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/8743#issuecomment-163142135 Just a question. Now it looks the PR for the end-to-end docker tests is merged. Do you think it needs all the tests for all the databases (namely `MySQLIntegrationSuite` and `PostgresIntegrationSuite` for now)? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11676][SQL] Parquet filter tests all pa...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9659#issuecomment-163088089 In this commit, I resolved conflicts, renamed the function extractSourceRDDToDataFrame to stripSparkFilter and set false to `spark.sql.parquet.enableUnsafeRowRecordReader`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11677][SQL] ORC filter tests all pass i...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9687#issuecomment-163081502 I will add that function to `SharedSQLContext` if it is okay. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12227][SQL] Support drop multiple colum...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/10218 [SPARK-12227][SQL] Support drop multiple columns specified by Column class in DataFrame API https://issues.apache.org/jira/browse/SPARK-12227 In this PR, I added the support to drop multiple columns by reference. This was only supported with the string names. As `varargs` does not recognise the type for multuple variables, I changed `drop(colNames: String*)` to `drop(colName: String, colNames: String*)` correspondingly to `orderBy(sortCol: String, sortCols: String*)` and `sort(sortCol: String, sortCols: String*)`. Accordingly, the call in `drop(colName: String)` is changed. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-12227 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10218.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10218 commit cae52921a1491f2dee9d130aaeadc21b7e9cb168 Author: hyukjinkwon <gurwls...@gmail.com> Date: 2015-12-09T06:25:20Z Support to drop multiple columns by reference commit 22235269995f5c525da0d9aef948803883ca91ae Author: hyukjinkwon <gurwls...@gmail.com> Date: 2015-12-09T06:36:58Z Change comments --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11676][SQL] Parquet filter tests all pa...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9659#issuecomment-163102930 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11677][SQL] ORC filter tests all pass i...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9687#issuecomment-163089723 In this commit, I renamed the function extractSourceRDDToDataFrame to stripSparkFilter. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11677][SQL] ORC filter tests all pass i...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9687#issuecomment-163104687 Can I add the function `stripSparkFilter` to `SQLTestUtils` for `ParquetFilterSuite` and `OrcQuerySuite` (and possibly `OrcFilterSuite` I will make)? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12249][SQL] JDBC not-equality compariso...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/10233 [SPARK-12249][SQL] JDBC not-equality comparison operator not pushed down. https://issues.apache.org/jira/browse/SPARK-12249 Currently `!=` operator is not pushed down correctly. I simply added a case to this. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-12249 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10233.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10233 commit 6cdd5f31f01af4c91f43c6a3fec12d3416dbdaff Author: hyukjinkwon <gurwls...@gmail.com> Date: 2015-12-10T00:23:26Z Add not-equality comparison operator. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org