[GitHub] spark pull request: [SPARK-5775] [SQL] BugFix: GenericRow cannot b...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/4792 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5775] [SQL] BugFix: GenericRow cannot b...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4792#issuecomment-76516145 [Test build #28117 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28117/consoleFull) for PR 4792 at commit [`538f506`](https://github.com/apache/spark/commit/538f506851d7e2eba6a20d0ad4a5909486bf8516). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5775] [SQL] BugFix: GenericRow cannot b...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4792#issuecomment-76516146 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28117/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5775] [SQL] BugFix: GenericRow cannot b...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/4792#issuecomment-76514046 @yhuai Thanks for the review! I've addressed the comments. Will merge this to master and branch-1.3 after Jenkins approves. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5775] [SQL] BugFix: GenericRow cannot b...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4792#issuecomment-76513950 [Test build #28117 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28117/consoleFull) for PR 4792 at commit [`538f506`](https://github.com/apache/spark/commit/538f506851d7e2eba6a20d0ad4a5909486bf8516). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5775] [SQL] BugFix: GenericRow cannot b...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/4792#discussion_r25554933 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala --- @@ -483,16 +487,42 @@ private[sql] case class ParquetRelation2( val requiredPartOrdinal = partitionKeyLocations.keys.toSeq -iterator.map { pair => - val row = pair._2.asInstanceOf[SpecificMutableRow] - var i = 0 - while (i < requiredPartOrdinal.size) { -// TODO Avoids boxing cost here! -val partOrdinal = requiredPartOrdinal(i) -row.update(partitionKeyLocations(partOrdinal), partValues(partOrdinal)) -i += 1 +if (primitiveRow) { + iterator.map { pair => +// We are using CatalystPrimitiveRowConverter and it returns a SpecificMutableRow. +val row = pair._2.asInstanceOf[SpecificMutableRow] +var i = 0 +while (i < requiredPartOrdinal.size) { + // TODO Avoids boxing cost here! + val partOrdinal = requiredPartOrdinal(i) + row.update(partitionKeyLocations(partOrdinal), partValues(partOrdinal)) + i += 1 +} +row + } +} else { + // Create a mutable row since we need to fill in values from partition columns. + val mutableRow = new GenericMutableRow(requestedSchema.size) --- End diff -- This is not necessary. Unlike `output.size` in `ParquetTableScan`, `requestedSchema` here is already a local variable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5775] [SQL] BugFix: GenericRow cannot b...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/4792#discussion_r25533656 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala --- @@ -143,19 +146,47 @@ private[sql] case class ParquetTableScan( relation.partitioningAttributes .map(a => Cast(Literal(partValues(a.name)), a.dataType).eval(EmptyRow)) -new Iterator[Row] { - def hasNext = iter.hasNext - def next() = { -val row = iter.next()._2.asInstanceOf[SpecificMutableRow] - -// Parquet will leave partitioning columns empty, so we fill them in here. -var i = 0 -while (i < requestedPartitionOrdinals.size) { - row(requestedPartitionOrdinals(i)._2) = -partitionRowValues(requestedPartitionOrdinals(i)._1) - i += 1 +if (primitiveRow) { + new Iterator[Row] { +def hasNext = iter.hasNext +def next() = { + // We are using CatalystPrimitiveRowConverter and it returns a SpecificMutableRow. + val row = iter.next()._2.asInstanceOf[SpecificMutableRow] + + // Parquet will leave partitioning columns empty, so we fill them in here. + var i = 0 + while (i < requestedPartitionOrdinals.size) { +row(requestedPartitionOrdinals(i)._2) = + partitionRowValues(requestedPartitionOrdinals(i)._1) +i += 1 + } + row +} + } +} else { + // Create a mutable row since we need to fill in values from partition columns. + val mutableRow = new GenericMutableRow(output.size) --- End diff -- Seems it will be safer if we get `output.size` out side of `mapPartitionsWithInputSplit`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5775] [SQL] BugFix: GenericRow cannot b...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/4792#discussion_r25533640 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala --- @@ -483,16 +487,42 @@ private[sql] case class ParquetRelation2( val requiredPartOrdinal = partitionKeyLocations.keys.toSeq -iterator.map { pair => - val row = pair._2.asInstanceOf[SpecificMutableRow] - var i = 0 - while (i < requiredPartOrdinal.size) { -// TODO Avoids boxing cost here! -val partOrdinal = requiredPartOrdinal(i) -row.update(partitionKeyLocations(partOrdinal), partValues(partOrdinal)) -i += 1 +if (primitiveRow) { + iterator.map { pair => +// We are using CatalystPrimitiveRowConverter and it returns a SpecificMutableRow. +val row = pair._2.asInstanceOf[SpecificMutableRow] +var i = 0 +while (i < requiredPartOrdinal.size) { + // TODO Avoids boxing cost here! + val partOrdinal = requiredPartOrdinal(i) + row.update(partitionKeyLocations(partOrdinal), partValues(partOrdinal)) + i += 1 +} +row + } +} else { + // Create a mutable row since we need to fill in values from partition columns. + val mutableRow = new GenericMutableRow(requestedSchema.size) --- End diff -- Seems it will be safer if we get `requestedSchema.size` out side of `mapPartitionsWithInputSplit`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5775] [SQL] BugFix: GenericRow cannot b...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/4792#discussion_r25533533 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala --- @@ -476,6 +476,10 @@ private[sql] case class ParquetRelation2( // When the data does not include the key and the key is requested then we must fill it in // based on information from the input split. if (!partitionKeysIncludedInDataSchema && partitionKeyLocations.nonEmpty) { + // This check if based on CatalystConverter.createRootConverter. --- End diff -- same here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5775] [SQL] BugFix: GenericRow cannot b...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/4792#discussion_r25533517 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala --- @@ -126,6 +126,9 @@ private[sql] case class ParquetTableScan( conf) if (requestedPartitionOrdinals.nonEmpty) { + // This check if based on CatalystConverter.createRootConverter. --- End diff -- I made a typo at here... if => is. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5775] [SQL] BugFix: GenericRow cannot b...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4792#issuecomment-76328427 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28039/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5775] [SQL] BugFix: GenericRow cannot b...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4792#issuecomment-76328421 [Test build #28039 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28039/consoleFull) for PR 4792 at commit [`cee55cf`](https://github.com/apache/spark/commit/cee55cfb849355e21363e6a4fdc155d27db88401). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5775] [SQL] BugFix: GenericRow cannot b...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4792#issuecomment-76323022 [Test build #28039 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28039/consoleFull) for PR 4792 at commit [`cee55cf`](https://github.com/apache/spark/commit/cee55cfb849355e21363e6a4fdc155d27db88401). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5775] [SQL] BugFix: GenericRow cannot b...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/4792#issuecomment-76322637 @marmbrus @yhuai Thanks for the suggestion and PR, much better! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5775] [SQL] BugFix: GenericRow cannot b...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/4792#issuecomment-76307166 https://github.com/liancheng/spark/pull/4 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5775] [SQL] BugFix: GenericRow cannot b...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4792#issuecomment-76270205 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28012/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5775] [SQL] BugFix: GenericRow cannot b...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4792#issuecomment-76270178 [Test build #28012 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28012/consoleFull) for PR 4792 at commit [`ca6e038`](https://github.com/apache/spark/commit/ca6e0384f30b8b6d07840a2c555cfc6069126679). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5775] [SQL] BugFix: GenericRow cannot b...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/4792#issuecomment-76264043 Can't we do the same check that we do inside of parquet to see if the row is all primitives once and switch on that, instead of doing per tuple extraction and run time type checking? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5775] [SQL] BugFix: GenericRow cannot b...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/4792#issuecomment-76251578 @marmbrus As discussed offline, I tried to use `SpecificMutableRow` throughout the whole Parquet support implementation, but it touches to many places, which I think is too risky as 1.3 release is so close. So falled back to current implementation. I can do the refactoring later on master (mainly focus on the `CatalystConverter` class hierarchy). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5775] [SQL] BugFix: GenericRow cannot b...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4792#issuecomment-76250658 [Test build #28012 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28012/consoleFull) for PR 4792 at commit [`ca6e038`](https://github.com/apache/spark/commit/ca6e0384f30b8b6d07840a2c555cfc6069126679). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5775] [SQL] BugFix: GenericRow cannot b...
GitHub user liancheng opened a pull request: https://github.com/apache/spark/pull/4792 [SPARK-5775] [SQL] BugFix: GenericRow cannot be cast to SpecificMutableRow when nested data and partitioned table This PR adapts @anselmevignon's #4697 to master and branch-1.3. Please refer to PR description of #4697 for details. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liancheng/spark spark-5775 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4792.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4792 commit ca6e0384f30b8b6d07840a2c555cfc6069126679 Author: Cheng Lian Date: 2015-02-26T19:24:37Z Fixes SPARK-5775 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org