[GitHub] spark pull request: [SPARK-14557][SQL] Reading textfile (created t...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/12356#issuecomment-209762085 I think we can eliminate applyFilterIfNeeded method as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11878][SQL]: Eliminate distribute by in...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/9858#issuecomment-169523800 Thanks for the comments and the merge :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11878][SQL]: Eliminate distribute by in...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/9858#issuecomment-168934153 Fixed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11878][SQL]: Eliminate distribute by in...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/9858#issuecomment-168956141 Seems some other issue , tests in pyspark mllib failing ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11878][SQL]: Eliminate distribute by in...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/9858#issuecomment-169240395 Hey @marmbrus have reverted to 46e7419 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11878][SQL]: Eliminate distribute by in...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/9858#issuecomment-167422237 Hey @marmbrus sorry for the delay in this update, I have added the same thing to the planner. Also rebased to latest master. How does it look now ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11878][SQL]: Eliminate distribute by in...
Github user saucam commented on a diff in the pull request: https://github.com/apache/spark/pull/9858#discussion_r46020227 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala --- @@ -488,6 +488,12 @@ private[sql] case class EnsureRequirements(sqlContext: SQLContext) extends Rule[ } def apply(plan: SparkPlan): SparkPlan = plan.transformUp { +case operator @ Exchange(partitioning, child, _) => + child.children match { +case Exchange(childPartitioning, baseChild, _)::Nil => --- End diff -- Yes, I thought the same, but then it will again be not as generic as this, since SparkStrategies are applied first and till that time we don;t have the exchanges added. So it will be similar to my previous change done in optimizer in that it will check that the child plan is an aggregate or not instead of testing for an Exchange. Will that be acceptable ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11878][SQL][WIP]: Eliminate distribute ...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/9858#issuecomment-159798845 @marmbrus Added the same thing to exchange planning. How does it look now ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11878][SQL][WIP]: Eliminate distribute ...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/9858#issuecomment-159507098 Thanks for the feedback! Let me take a look at the Exchange code --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-11878: Eliminate distribute by in case g...
GitHub user saucam opened a pull request: https://github.com/apache/spark/pull/9858 SPARK-11878: Eliminate distribute by in case group by is present with exactly the same grouping expressions For queries like : select <> from table group by a distribute by a we can eliminate distribute by ; since group by will anyways do a hash partitioning Also applicable when user uses Dataframe API but the number of partitions in RepartitionByExpression is not specified (None) You can merge this pull request into a Git repository by running: $ git pull https://github.com/saucam/spark eliminatedistribute Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9858.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9858 commit a86feca6e2b9aaba9babed8854a39c97b59f34cd Author: Yash Datta <yash.da...@guavus.com> Date: 2015-11-20T07:43:47Z SPARK-11878: Eliminate distribute by in case group by is present with exactly the same grouping expressions --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL][SPARK-10451]: Prevent unnecessary serial...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/8604#issuecomment-141396161 @yhuai thanks for the help! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL][SPARK-10451]: Prevent unnecessary serial...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/8604#issuecomment-141483053 thanks for the merge :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL][SPARK-10451]: Prevent unnecessary serial...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/8604#issuecomment-139615421 This time it fails jsonHadoopFSRelationSuite ! https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42341/testReport/junit/org.apache.spark.sql.sources/JsonHadoopFsRelationSuite/test_all_data_types/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL][SPARK-10451]: Prevent unnecessary serial...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/8604#issuecomment-139589217 phew! ohk :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL][SPARK-10451]: Prevent unnecessary serial...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/8604#issuecomment-139500144 added comments --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7142][SQL]: Minor enhancement to Boolea...
Github user saucam commented on a diff in the pull request: https://github.com/apache/spark/pull/5700#discussion_r39256778 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -413,6 +418,10 @@ object BooleanSimplification extends Rule[LogicalPlan] with PredicateHelper { case LessThan(l, r) => GreaterThanOrEqual(l, r) // not(l <= r) => l > r case LessThanOrEqual(l, r) => GreaterThan(l, r) +// not(l || r) => not(l) && not(r) +case Or(l, r) => And(Not(l), Not(r)) +// not(l && r) => not(l) or not(r) +case And(l, r) => Or(Not(l), Not(r)) --- End diff -- @cloud-fan could you please explain a bit more when and how converting to "And" may not be an optimization ? I was wondering would it actually result in any kind of performance hit ? Also could you tell how #8200 is more reasonable ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-7142: Incorporate review comments
GitHub user saucam opened a pull request: https://github.com/apache/spark/pull/8716 SPARK-7142: Incorporate review comments Adding changes suggested by @cloud-fan in #5700 cc @marmbrus You can merge this pull request into a Git repository by running: $ git pull https://github.com/saucam/spark bool_simp Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/8716.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #8716 commit 2861453c92021ac0108267b67169ac9d2cd37192 Author: Yash Datta <yash.da...@guavus.com> Date: 2015-09-11T10:29:34Z SPARK-7142: Incorporate review comments --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL][SPARK-10451]: Prevent unnecessary serial...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/8604#issuecomment-139519988 Some OrcHadoopFSRelationSuite test is failing. Can you help with this one @liancheng ? I dont understand, i just added a comment ! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7142][SQL]: Minor enhancement to Boolea...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/5700#issuecomment-139138411 added test cases --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [CORE][SPARK-10527]: Minor enhancement to eval...
Github user saucam commented on a diff in the pull request: https://github.com/apache/spark/pull/8678#discussion_r39153871 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -901,7 +901,7 @@ class DAGScheduler( // the stage as completed here in case there are no tasks to run markStageAsFinished(stage, None) - val debugString = stage match { + def debugString: String = stage match { --- End diff -- Even if its not heavy , we can easily make it lazy. By passing it directly , it will still always evaluate the expression first. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [CORE][SPARK-10527]: Minor enhancement to eval...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/8678#issuecomment-139283491 @srowen thanks for the detailed explanation, passed the value directly --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [CORE][SPARK-10527]: Minor enhancement to eval...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/8678#issuecomment-139331711 @andrewor14 for the "harder to read part" initially i had changed the val to a def ; the rest was the same , I am sure it was not affecting the readability if thats your concern here, but @srowen suggested to pass the value directly (don;t know why) I can revert it to my original change if you prefer that way. "The reduce in delay is extremely negligible" agreed! , but still it is there ! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [CORE][SPARK-10527]: Minor enhancement to eval...
Github user saucam closed the pull request at: https://github.com/apache/spark/pull/8678 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [CORE][SPARK-10527]: Minor enhancement to eval...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/8678#issuecomment-139459104 @andrewor14 , @srowen thanks for the feedback. I was actually working on a very low latency spark job (300 - 500 ms) , and thought it better to improve obvious things. It might not be applicable for wider audience, as suggested, closing this PR --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7142][SQL]: Minor enhancement to Boolea...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/5700#issuecomment-139327218 thanks @marmbrus :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-10527: Minor enhancement to evalutate de...
GitHub user saucam opened a pull request: https://github.com/apache/spark/pull/8678 SPARK-10527: Minor enhancement to evalutate debugstring only when log level is debug in DAGScheduler You can merge this pull request into a Git repository by running: $ git pull https://github.com/saucam/spark slog Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/8678.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #8678 commit e4c4c10db44cbec79e190265fc0351731ff664ec Author: Yash Datta <yash.da...@guavus.com> Date: 2015-09-10T02:14:25Z SPARK-10527: Minor enhancement to evalutate debugstring only when log level is debug in DAGScheduler --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SQL][SPARK-10451]: Prevent unnecessary s...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/8604#issuecomment-137932280 I get this failure : [error] /home/jenkins/workspace/SparkPullRequestBuilder/sql/core/src/test/scala/org/apache/spark/sql/SQLConfSuite.scala:83: not found: value ctx [error] assert(ctx.conf.numShufflePartitions === 10) [error] ^ [error] one error found [error] (sql/test:compile) Compilation failed [error] Total time: 108 s, completed Sep 5, 2015 1:47:01 AM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL][SPARK-10451]: Prevent unnecessary serial...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/8604#issuecomment-137972091 thnx @rxin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL][SPARK-10451]: Prevent unnecessary serial...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/8604#issuecomment-137830950 cc @liancheng thoughts ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL][SPARK-10451]: Prevent unnecessary serial...
GitHub user saucam opened a pull request: https://github.com/apache/spark/pull/8604 [SQL][SPARK-10451]: Prevent unnecessary serializations in InMemoryColumnarTableScan Many of the fields in InMemoryColumnar scan and InMemoryRelation can be made transient. This reduces my 1000ms job to abt 700 ms . The task size reduces from 2.8 mb to ~1300kb You can merge this pull request into a Git repository by running: $ git pull https://github.com/saucam/spark serde Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/8604.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #8604 commit 5afb9ebdf3ff2ae3321b89dd80f0207fe1e330a6 Author: Yash Datta <yash.da...@guavus.com> Date: 2015-09-04T18:55:19Z SPARK-10451: Prevent unnecessary serializations in InMemoryColumnarTableScan --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6566][SQL]: Related changes for newer p...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/5889#issuecomment-111014258 @liancheng looks ok to you now ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6566][SQL]: Related changes for newer p...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/5889#issuecomment-109370628 incorporated review comments retest please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6566][SQL]: Change parquet version to l...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/5889#issuecomment-109170412 cc @liancheng I have rebased. can we retest this ? How to determine what is failing ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7743] [SQL] Parquet 1.7
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/6597#issuecomment-108265494 hey @liancheng , sounds ok to me. We can rebase once these changes are merged. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7340][SQL]: Change parquet version to l...
GitHub user saucam opened a pull request: https://github.com/apache/spark/pull/5888 [SPARK-7340][SQL]: Change parquet version to latest release This brings in major improvement in that footers are not read on the driver. This also cleans up the code in parquetTableOperations, where we had to override getSplits to eliminate multiple listStatus calls. You can merge this pull request into a Git repository by running: $ git pull https://github.com/saucam/spark parquet_1.6 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5888.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5888 commit 3e3cbf978f0980669cb5d7492dec38a0061c2974 Author: Yash Datta yash.da...@guavus.com Date: 2015-05-04T12:14:14Z SPARK-7340: Change parquet version to latest release --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7340][SQL]: Change parquet version to l...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/5888#issuecomment-98741146 is this some problem with jenkins ? [info] Updating {file:/home/jenkins/workspace/SparkPullRequestBuilder/}core... [error] oro#oro;2.0.8!oro.jar origin location must be absolute: file:/home/jenkins/.m2/repository/oro/oro/2.0.8/oro-2.0.8.jar java.lang.IllegalArgumentException: oro#oro;2.0.8!oro.jar origin location must be absolute: file:/home/jenkins/.m2/repository/oro/oro/2.0.8/oro-2.0.8.jar at org.apache.ivy.util.Checks.checkAbsolute(Checks.java:57) at org.apache.ivy.core.cache.DefaultRepositoryCacheManager.getArchiveFileInCache(DefaultRepositoryCacheManager.java:385) at org.apache.ivy.core.cache.DefaultRepositoryCacheManager.download(DefaultRepositoryCacheManager.java:849) at org.apache.ivy.plugins.resolver.BasicResolver.download(BasicResolver.java:835) at org.apache.ivy.plugins.resolver.RepositoryResolver.download(RepositoryResolver.java:282) at org.apache.ivy.plugins.resolver.ChainResolver.download(ChainResolver.java:219) at org.apache.ivy.plugins.resolver.ChainResolver.download(ChainResolver.java:219) at org.apache.ivy.core.resolve.ResolveEngine.downloadArtifacts(ResolveEngine.java:388) at org.apache.ivy.core.resolve.ResolveEngine.resolve(ResolveEngine.java:331) at org.apache.ivy.Ivy.resolve(Ivy.java:517) at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:266) at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:175) at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:157) at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:151) at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:151) at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:128) at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:56) at sbt.IvySbt$$anon$4.call(Ivy.scala:64) at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93) at xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:78) at xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:97) at xsbt.boot.Using$.withResource(Using.scala:10) at xsbt.boot.Using$.apply(Using.scala:9) at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:58) at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:48) at xsbt.boot.Locks$.apply0(Locks.scala:31) at xsbt.boot.Locks$.apply(Locks.scala:28) at sbt.IvySbt.withDefaultLogger(Ivy.scala:64) at sbt.IvySbt.withIvy(Ivy.scala:123) at sbt.IvySbt.withIvy(Ivy.scala:120) at sbt.IvySbt$Module.withModule(Ivy.scala:151) at sbt.IvyActions$.updateEither(IvyActions.scala:157) at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1318) at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1315) at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$85.apply(Defaults.scala:1345) at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$85.apply(Defaults.scala:1343) at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:35) at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1348) at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1342) at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:45) at sbt.Classpaths$.cachedUpdate(Defaults.scala:1360) at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1300) at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1275) at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47) at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40) at sbt.std.Transform$$anon$4.work(System.scala:63) at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:226) at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:226) at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17) at sbt.Execute.work(Execute.scala:235) at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:226) at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:226) at sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159) at sbt.CompletionService$$anon$2.call(CompletionService.scala:28) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) [error] (core/*:update
[GitHub] spark pull request: [SPARK-7340][SQL]: Change parquet version to l...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/5888#issuecomment-98743527 opening PR against the original ticket --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7340][SQL]: Change parquet version to l...
Github user saucam closed the pull request at: https://github.com/apache/spark/pull/5888 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6566][SQL]: Change parquet version to l...
GitHub user saucam opened a pull request: https://github.com/apache/spark/pull/5889 [SPARK-6566][SQL]: Change parquet version to latest release This brings in major improvement in that footers are not read on the driver. This also cleans up the code in parquetTableOperations, where we had to override getSplits to eliminate multiple listStatus calls. cc @liancheng are there any other changes we need for this ? You can merge this pull request into a Git repository by running: $ git pull https://github.com/saucam/spark parquet_1.6 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5889.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5889 commit 3e3cbf978f0980669cb5d7492dec38a0061c2974 Author: Yash Datta yash.da...@guavus.com Date: 2015-05-04T12:14:14Z SPARK-7340: Change parquet version to latest release --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7142][SQL]: Minor enhancement to Boolea...
Github user saucam commented on a diff in the pull request: https://github.com/apache/spark/pull/5700#discussion_r29110517 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -413,6 +418,10 @@ object BooleanSimplification extends Rule[LogicalPlan] with PredicateHelper { case LessThan(l, r) = GreaterThanOrEqual(l, r) // not(l = r) = l r case LessThanOrEqual(l, r) = GreaterThan(l, r) +// not(l || r) = not(l) not(r) +case Or(l, r) = And(Not(l), Not(r)) +// not(l r) = not(l) or not(r) +case And(l, r) = Or(Not(l), Not(r)) --- End diff -- So for example the filter is not(Or(left, r)) , where r might be some filter on a partitioned column like part=12 , in the present case this filter cannot be pushed down, since while evaluating we will encounter reference of partitioned column, whereas if this rule is applied we get And(not(l), part12) and then not(l) might be pushed down since now splitting into conjunctive predicates is possible. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7142][SQL]: Minor enhancement to Boolea...
Github user saucam commented on a diff in the pull request: https://github.com/apache/spark/pull/5700#discussion_r29121549 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -413,6 +418,10 @@ object BooleanSimplification extends Rule[LogicalPlan] with PredicateHelper { case LessThan(l, r) = GreaterThanOrEqual(l, r) // not(l = r) = l r case LessThanOrEqual(l, r) = GreaterThan(l, r) +// not(l || r) = not(l) not(r) +case Or(l, r) = And(Not(l), Not(r)) +// not(l r) = not(l) or not(r) +case And(l, r) = Or(Not(l), Not(r)) --- End diff -- This is inside a case match : ``` case not @ Not(exp) = exp match { case Or(l, r) = And(Not(l), Not(r)) } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7142][SQL]: Minor enhancement to Boolea...
GitHub user saucam opened a pull request: https://github.com/apache/spark/pull/5700 [SPARK-7142][SQL]: Minor enhancement to BooleanSimplification Optimizer rule Use these in the optimizer as well: A and (not(A) or B) = A and B not(A and B) = not(A) or not(B) not(A or B) = not(A) and not(B) You can merge this pull request into a Git repository by running: $ git pull https://github.com/saucam/spark bool_simp Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5700.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5700 commit 3eb813e66281b53ca029faa7928cc6c50a69b509 Author: Yash Datta yash.da...@guavus.com Date: 2015-04-25T12:45:48Z SPARK-7142: Minor enhancement to BooleanSimplification Optimizer rule, using these rules: A and (not(A) or B) = A and B not(A and B) = not(A) or not(B) not(A or B) = not(A) and not(B) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7097][SQL]: Partitioned tables should o...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/5668#issuecomment-95678594 retest please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7097][SQL]: Partitioned tables should o...
GitHub user saucam opened a pull request: https://github.com/apache/spark/pull/5668 [SPARK-7097][SQL]: Partitioned tables should only consider referred partitions in query during size estimation for checking against autoBroadcastJoinThreshold This PR attempts to add support for better size estimation in case of partitioned tables so that only the referred partition's size are taken into consideration when testing against autoBroadCastJoinThreshold and deciding whether to create a broadcast join or shuffle hash join. You can merge this pull request into a Git repository by running: $ git pull https://github.com/saucam/spark part_size Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5668.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5668 commit b0beb34d6a77c738660cb161306c947411d70ab5 Author: Yash Datta yash.da...@guavus.com Date: 2015-04-23T17:58:17Z SPARK-7097: Partitioned tables should only consider referred partitions in query during size estimation for checking against autoBroadcastJoinThreshold --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/4764#issuecomment-92803029 thanks @marmbrus . Let me refactor this then and open another PR later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...
Github user saucam closed the pull request at: https://github.com/apache/spark/pull/4764 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/4764#issuecomment-92803267 thanks @marmbrus . Let me refactor this then and open another PR later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SQL][SPARK-6632]: Read schema from each ...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/5298#issuecomment-92450385 ok @liancheng Thanks for the comments. In the meantime let me try to address your suggestions. Can we keep this open in WIP state for now ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SQL][SPARK-6632]: Read schema from each ...
Github user saucam commented on a diff in the pull request: https://github.com/apache/spark/pull/5298#discussion_r28214791 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala --- @@ -98,12 +98,32 @@ private[parquet] class RowReadSupport extends ReadSupport[Row] with Logging { val metadata = new JHashMap[String, String]() val requestedAttributes = RowReadSupport.getRequestedSchema(configuration) +// convert fileSchema to attributes +val fileAttributes = ParquetTypesConverter.convertToAttributes(fileSchema, true, true) --- End diff -- these booleans are for finding the datatype of the attribute, whereas here we are just interested in finding out the names of the columns, to reconcile with metastore schema. Hence it is safe to always send these parameters as true, since we do not have SQL context here from which to derive these. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SQL][SPARK-6632]: Read schema from each ...
Github user saucam commented on a diff in the pull request: https://github.com/apache/spark/pull/5298#discussion_r28214717 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala --- @@ -98,12 +98,32 @@ private[parquet] class RowReadSupport extends ReadSupport[Row] with Logging { val metadata = new JHashMap[String, String]() val requestedAttributes = RowReadSupport.getRequestedSchema(configuration) +// convert fileSchema to attributes +val fileAttributes = ParquetTypesConverter.convertToAttributes(fileSchema, true, true) +val fileAttMap = fileAttributes.map(f = f.name.toLowerCase - f.name).toMap + if (requestedAttributes != null) { + // reconcile names of requested Attributes + val modRequestedAttributes = requestedAttributes.map(attr = { + val lName = attr.name.toLowerCase + if (fileAttMap.contains(lName)) { +attr.withName(fileAttMap(lName)) + } else { +if (attr.nullable) { + attr +} else { + // field is not nullable but not present in the parquet file schema!! + // this is just a safety check since in hive all columns are nullable + // throw exception here + throw new RuntimeException(sField ${attr.name} is non-nullable, +but not found in parquet file schema: ${fileSchema}.stripMargin) +}}}) + --- End diff -- yes, the difference being that this happens within each task, whereas ParquetRelation2.mergeMetastoreParquetSchema happens on the driver. This eliminates the need of mergeMetastoreParquetSchema method --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL][SPARK-5453] Use property 'mapreduce.inpu...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/4246#issuecomment-91996558 hey @marmbrus , this property is needed because : 1. The 'hijacked' parquet read path would not use mapreduce property while reading schema/footers, see refresh method in ParquetRelation2.scala 2. HiveTableScan would need the pathfilter while creating RDD see makeRDDForPartitionedTable in TableReader.scala These are not using the mapreduce.input.pathFilter (even if its is set by the user) property and hence the extra code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/4764#issuecomment-91997683 hi @marmbrus , can you share other plans of modifying aggregates that you mentioned earlier? Can I help with that ? Otherwise i'll modify this one for now as you have suggested. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SQL][SPARK-6632]: Read schema from each ...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/5298#issuecomment-92069665 hey @liancheng , this change now reconciles schema within the tasks. do suggest. After that I will remove the merge schema functions that are no longer needed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL][SPARK-6742]: Don't push down predicates ...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/5390#issuecomment-91925973 Added test case. please test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL][SPARK-5453] Use property 'mapreduce.inpu...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/4246#issuecomment-91868767 please retest --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL][SPARK-6742]: Don't push down predicates ...
GitHub user saucam opened a pull request: https://github.com/apache/spark/pull/5390 [SQL][SPARK-6742]: Don't push down predicates which reference partition column(s) cc @liancheng You can merge this pull request into a Git repository by running: $ git pull https://github.com/saucam/spark fpush Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5390.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5390 commit 8592acc665241e2304c77427df35221fa7bfc020 Author: Yash Datta yash.da...@guavus.com Date: 2015-04-07T12:09:20Z SPARK-6742: Don't push down predicates which reference partition column(s) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/4764#issuecomment-89756295 fixed test failures because of class cast exceptions. Please retest. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL][SPARK-5453] Use property 'mapreduce.inpu...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/4246#issuecomment-89779951 Thanks for the suggestions @marmbrus , I have refactored PathFilter creation in SQLContext. Covered more instances of listStatus. please review. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/4764#issuecomment-89604768 fixed the test case of zero count when there is no data. rebased with latest master. please retest --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SQL][SPARK-6632]: Read schema from each ...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/5298#issuecomment-89632303 hmm i see. Would definitely go through these PRs. Anyways fixed the whitespace problem here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5684][SQL]: Pass in partition name alon...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/4469#issuecomment-89613886 Hi @marmbrus , this is a pretty common scenario in production, where the data is generated in some directory and then later partitions are added to tables using alter table tablename add partition (col=value) location directory where data is generated (where path does not contain partition key=value) In the old parquet path in v1.2.1, this is not possible. This is doable in the new parquet path in spark 1.3 though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL][SPARK-4226]: Add support for subqueries ...
Github user saucam closed the pull request at: https://github.com/apache/spark/pull/3888 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL][SPARK-4226]: Add support for subqueries ...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/3888#issuecomment-89118699 hey @marmbrus , thanks for the feedback, i'll close this one and work on another PR , incorporating the changes you have suggested. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SQL][SPARK-6632]: Read schema from each ...
GitHub user saucam opened a pull request: https://github.com/apache/spark/pull/5298 [WIP][SQL][SPARK-6632]: Read schema from each input split in the ReadSupport hook, reconciling with the metastore schema at that time Hey @liancheng, How about this approach for schema reconciliation, where we use the metastore schema, and reconcile within the ReadSupport init function. This way, we handle each input file in the map task, and no need to read schema from all part files and merging before initiating the tasks. I have not removed the merge code for now. Let me know your thoughts on this one. You can merge this pull request into a Git repository by running: $ git pull https://github.com/saucam/spark SPARK-6632 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5298.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5298 commit 304daccb4e6b947eb10a8feb893ca5b47c42e16e Author: Yash Datta yash.da...@guavus.com Date: 2015-03-31T13:11:40Z SPARK-6632: Read schema from each input split in the ReadSupport hook, reconciling with the metastore schema at that time --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL][SPARK-6471]: Metastore schema should onl...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/5141#issuecomment-87182630 Thanks a lot @liancheng ! :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL][SPARK-6471]: Metastore schema should onl...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/5141#issuecomment-86900047 Sorry for so many queries . How about if I simply ignore reading schema from parquet part files, relying only on metastore schema (I will pass it from hivestrategy to ParquetRelation). Do you think it would have issues ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL][SPARK-6471]: Metastore schema should onl...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/5141#issuecomment-86880657 Hi @liancheng , thanks for the references, I have already gone through these , but I was talking about ParquetRelation (old parquet path, the default one in spark 1.2) and not ParquetRelation2 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL][SPARK-6471]: Metastore schema should onl...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/5141#issuecomment-86971965 Thanks for confirming this, I hope there is no other reason for reconciling schema ? (In our use cases we can safely make sure our schema is lowercase and all are nullable columns, so should be easier for me to use metastore schema itself in the ParquetRelation) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL][SPARK-6471]: Metastore schema should onl...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/5141#issuecomment-86443584 hi @liancheng , thanks for reviewing. One small query on a separate note, currently in the implementation of mergeMetastoreParquetSchema, I see that for finding out the merged parquetSchema, part files from all the partitions are being used. Does this scale ? What happens if we have millions of partitions, doesn't this slow down every read query even if only small number of partitions are being referred ? Was wondering if we can change this to get a unified schema just from the referred partitions ? (Though in that case I think we will need to have a summary file containing all the columns in the base path of the table) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL][SPARK-6471]: Metastore schema should onl...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/5141#issuecomment-86832254 Hi @liancheng , We do have use cases where 100K partitions will be registered in tables, (partitioned on timestamps, data is added in form of partitions for every 5min interval) , but it could be more in other cases. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL][SPARK-6471]: Metastore schema should onl...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/5141#issuecomment-85857964 Fixed the test case. Added a new test case as well. Please retest --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6471][SQL]: Metastore schema should onl...
GitHub user saucam opened a pull request: https://github.com/apache/spark/pull/5141 [SPARK-6471][SQL]: Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns Currently in the parquet relation 2 implementation, error is thrown in case merged schema is not exactly the same as metastore schema. But to support cases like deletion of column using replace column command, we can relax the restriction so that even if metastore schema is a subset of merged parquet schema, the query will work. You can merge this pull request into a Git repository by running: $ git pull https://github.com/saucam/spark replace_col Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5141.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5141 commit 5f2f4674084b4f6202c0eb884b798f0980659b4b Author: Yash Datta yash.da...@guavus.com Date: 2015-03-23T17:35:45Z SPARK-6471: Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5684][SQL]: Pass in partition name alon...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/4469#issuecomment-77828501 hi @liancheng , any update on this one ? i think it will be useful for people using spark 1.2.1 since old parquet path might suit their needs better in that version --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/4764#issuecomment-77510275 please restest --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/4764#issuecomment-76347215 please retest --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/4764#issuecomment-76184234 Fixed the null count test failure. Optimization works only in case of single count distinct in select clause --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-6006: Optimize count distinct for high c...
GitHub user saucam opened a pull request: https://github.com/apache/spark/pull/4764 SPARK-6006: Optimize count distinct for high cardinality columns Currently the plan for count distinct looks like this : Aggregate false, [snAppProtocol#448], [CombineAndCount(partialSets#513) AS _c0#437L] Exchange SinglePartition Aggregate true, [snAppProtocol#448], [snAppProtocol#448,AddToHashSet(snAppProtocol#448) AS partialSets#513] !OutputFaker [snAppProtocol#448] ParquetTableScan [snAppProtocol#587], (ParquetRelation hdfs://192.168.160.57:9000/data/collector/13/11/14, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml), org.apache.spark.sql.hive.HiveContext@6b1ed434, [ptime#443], ptime=2014-11-13 00%3A55%3A00), [] This can be slow if there are too many distinct values in a column. This PR changes the above plan to : Aggregate false, [], [SUM(_c0#437L) AS totalCount#514L] Exchange SinglePartition Aggregate false, [snAppProtocol#448], [CombineAndCount(partialSets#513) AS _c0#437L] Exchange (HashPartitioning [snAppProtocol#448], 200) Aggregate true, [snAppProtocol#448], [snAppProtocol#448,AddToHashSet(snAppProtocol#448) AS partialSets#513] !OutputFaker [snAppProtocol#448] ParquetTableScan [snAppProtocol#587], (ParquetRelation hdfs://192.168.160.57:9000/data/collector/13/11/14, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml), org.apache.spark.sql.hive.HiveContext@6b1ed434, [ptime#443], ptime=2014-11-13 00%3A55%3A00), [] This way even if there are too many distinct values; we insert them into partial maps and computation remains distributed and thus faster. You can merge this pull request into a Git repository by running: $ git pull https://github.com/saucam/spark optcountdis Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4764.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4764 commit 3e6d227184451026dbfda9866ae1e114bde002b1 Author: Yash Datta yash.da...@guavus.com Date: 2015-02-25T12:09:01Z SPARK-6006: Optimize count distinct for high cardinality columns --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/4764#issuecomment-75952342 @marmbrus can you please guide how to rewrite this in a better way ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/4764#issuecomment-76135270 can we test this again please ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5684][SQL]: Pass in partition name alon...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/4469#issuecomment-73839872 Hi @liancheng , thanks for the comments. We are using spark-1.2.1 and the old parquet support is being used. Can this be merged so that we have proper partitioning with different locations as well. I tried partitioning on 2 columns and it worked fine (Also applied this patch for specifying a different location) On a different note, When I create a parquet table with smallint type in spark, the schema being used in parquet shows 'int32 type', is that by design in spark or its a parquet limitation ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5684: Pass in partition name along with ...
Github user saucam commented on a diff in the pull request: https://github.com/apache/spark/pull/4469#discussion_r24315891 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/dataTypes.scala --- @@ -362,7 +362,7 @@ case object BooleanType extends NativeType with PrimitiveType { * @group dataType */ @DeveloperApi -case object TimestampType extends NativeType { +case object TimestampType extends NativeType with PrimitiveType { --- End diff -- this is done, in case table is partitioned on a timestamp type column, parquet iterator returns a GenericRow due to this in ParquetTypes.scala : def isPrimitiveType(ctype: DataType): Boolean = classOf[PrimitiveType] isAssignableFrom ctype.getClass and in ParquetConverter.scala we have : protected[parquet] def createRootConverter( parquetSchema: MessageType, attributes: Seq[Attribute]): CatalystConverter = { // For non-nested types we use the optimized Row converter if (attributes.forall(a = ParquetTypesConverter.isPrimitiveType(a.dataType))) { new CatalystPrimitiveRowConverter(attributes.toArray) } else { new CatalystGroupConverter(attributes.toArray) } } which fails here later : new Iterator[Row] { def hasNext = iter.hasNext def next() = { val row = iter.next()._2.asInstanceOf[SpecificMutableRow] throwing a class cast exception that GenericRow cannot be cast to SpecificMutableRow Am I missing something here ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5684: Pass in partition name along with ...
GitHub user saucam opened a pull request: https://github.com/apache/spark/pull/4469 SPARK-5684: Pass in partition name along with location information, as the location can be different (that is may not contain the partition keys) While parsing the partition keys from the locations, in parquetRelations, it is assumed that location path string will always contain the partition keys, which is not true. Different location can be specified while adding partitions to the table, which results in key not found exception while reading from such partitions: Create a partitioned parquet table : create table test_table (dummy string) partitioned by (timestamp bigint) stored as parquet; Add a partition to the table and specify a different location: alter table test_table add partition (timestamp=9) location '/data/pth/different' Run a simple select * query we get an exception : 15/02/09 08:27:25 ERROR thriftserver.SparkSQLDriver: Failed in [select * from db4_mi2mi_binsrc1_default limit 5] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 21.0 failed 1 times, most recent failure: Lost task 0.0 in stage 21.0 (TID 21, localhost): java .util.NoSuchElementException: key not found: timestamp at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.MapLike$class.apply(MapLike.scala:141) at scala.collection.AbstractMap.apply(Map.scala:58) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4$$anonfun$6.apply(ParquetTableOperations.scala:141) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4$$anonfun$6.apply(ParquetTableOperations.scala:141) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) You can merge this pull request into a Git repository by running: $ git pull https://github.com/saucam/spark partition_bug Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4469.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4469 commit 5aeeb6db8a3651b7b13d641ec0ed0dea21025438 Author: Yash Datta yash.da...@guavus.com Date: 2015-02-09T08:53:40Z SPARK-5684: Pass in partition name along with location information, as the location can be different (that is may not contain the partition keys) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5684: Pass in partition name along with ...
Github user saucam commented on a diff in the pull request: https://github.com/apache/spark/pull/4469#discussion_r24316073 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala --- @@ -310,7 +310,10 @@ class SQLContext(@transient val sparkContext: SparkContext) @scala.annotation.varargs def parquetFile(path: String, paths: String*): DataFrame = if (conf.parquetUseDataSourceApi) { - baseRelationToDataFrame(parquet.ParquetRelation2(path +: paths, Map.empty)(this)) + // not fixed for ParquetRelation2 ! + val sPaths = path +: paths + baseRelationToDataFrame(parquet.ParquetRelation2(sPaths.map(p = +p.split(-).head), Map.empty)(this)) --- End diff -- Please suggest how to proceed in case of ParquetRelation2 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5684: Pass in partition name along with ...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/4469#issuecomment-73478985 @liancheng please suggest ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL][SPARK-5453] Use property 'mapreduce.inpu...
GitHub user saucam opened a pull request: https://github.com/apache/spark/pull/4246 [SQL][SPARK-5453] Use property 'mapreduce.input.pathFilter.class' to set a custom filter class for input files This PR adds support for using a custom filter class for input files for queries. We can re-use the existing property in hive-site.xml for this. You can merge this pull request into a Git repository by running: $ git pull https://github.com/saucam/spark hive_site Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4246.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4246 commit 53e86c88890932f40502ab1c81647e321ba8 Author: Yash Datta yash.da...@guavus.com Date: 2015-01-28T10:43:21Z SPARK-5453: Use property 'mapreduce.input.pathFilter.class' to set a custom filter class for input files --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4786][SQL]: Parquet filter pushdown for...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/4156#issuecomment-71578360 fixed the styling issues. @liancheng thanks for the feedback! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4786][SQL]: Parquet filter pushdown for...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/4156#issuecomment-71428920 Added test case --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4786: Parquet filter pushdown for castab...
GitHub user saucam opened a pull request: https://github.com/apache/spark/pull/4156 SPARK-4786: Parquet filter pushdown for castable types Enable parquet filter pushdown of castable types like short, byte that can be cast to integer You can merge this pull request into a Git repository by running: $ git pull https://github.com/saucam/spark filter_short Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4156.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4156 commit cb2e0d94102bedb961f403bdc2420fabc021fe1a Author: Yash Datta yash.da...@guavus.com Date: 2015-01-22T06:00:18Z SPARK-4786: Parquet filter pushdown for castable types --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4786][SQL]: Parquet filter pushdown for...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/4156#issuecomment-70976498 done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4226: Add support for subqueries in wher...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/3888#issuecomment-69716627 Can we have some kind of hint mechanism in the query itself , if the user knows the subquery is small ? Then perhaps we can change the plan accordingly ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4226: Add support for subqueries in wher...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/3888#issuecomment-68853317 Hi Michael, Thanks for the feedback. 1. Yes it does not handle correlated queries. It definitely makes more sense to convert correlated queries to joins, but for uncorrelated queries, i think its too slow if table size is large and user is querying on smaller data: eg: 2 tables with ~ 45 million rows each ; subquery returns only 90 rows: select * from Y1 where Y1.id in (select Y2.id from Y2 where Y2.id 90); takes about 12 seconds to run by this approach on a single machine (--executor-memory 16G --driver-memory 8G) by following the join approach, query is changed to : select * from Y1 left semi join (select Y2.id as sqc0 from Y2 where id 90) subquery on Y1.id = subquery.sqc0; which takes 660 seconds to run on the same machine 2. This approach can handle arbitrary nesting of subqueries : select * from Y1 where Y1.id in (select Y2.id where Y2.timestamp in (select Y3.timestamp limit 20)) Can we take some hybrid approach from the two ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4226: Add support for subqueries in wher...
GitHub user saucam opened a pull request: https://github.com/apache/spark/pull/3888 SPARK-4226: Add support for subqueries in where in clause this PR adds support for subquery in where in clause by adding a dynamic filter class that will compute the values list from the subquery first and then create a hash-set , using it as input to inset class. You can merge this pull request into a Git repository by running: $ git pull https://github.com/saucam/spark subquery_where_clause Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3888.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3888 commit 4019e0d6e0bf31a123f2817eb964562891211635 Author: Yash Datta yash.da...@guavus.com Date: 2015-01-04T09:06:55Z SPARK-4226: Add support for subqueries in where in clause --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4226: Add support for subqueries in wher...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/3888#issuecomment-68626500 Hi, @marmbrus can you please take a look and suggest changes ; Have tested for a few queries and this approach looks simpler than an already existing PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4968: takeOrdered to skip reduce step in...
GitHub user saucam opened a pull request: https://github.com/apache/spark/pull/3830 SPARK-4968: takeOrdered to skip reduce step in case mappers return no partitions takeOrdered should skip reduce step in case mapped RDDs have no partitions. This prevents the mentioned exception : 4. run query SELECT * FROM testTable WHERE market = 'market2' ORDER BY End_Time DESC LIMIT 100; Error trace java.lang.UnsupportedOperationException: empty collection at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:863) at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:863) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.reduce(RDD.scala:863) at org.apache.spark.rdd.RDD.takeOrdered(RDD.scala:1136) You can merge this pull request into a Git repository by running: $ git pull https://github.com/saucam/spark fix_takeorder Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3830.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3830 commit 5974d10c619dac2ca2433d331e43ed48e6822f90 Author: Yash Datta yash.da...@guavus.com Date: 2014-12-29T19:06:32Z SPARK-4968: takeOrdered to skip reduce step in case mappers return no partitions --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4762: Add support for tuples in 'where i...
GitHub user saucam opened a pull request: https://github.com/apache/spark/pull/3618 SPARK-4762: Add support for tuples in 'where in' clause query Currently, in the where in clause the filter is applied only on a single column. We can enhance it to accept filter on multiple columns. So current support is for queries like : Select * from table where c1 in (value1,value2,...value n); This added support for queries like : Select * from table where (c1,c2,... cn) in ((value1,value2...value n), (value1' , value2' ... ,value n') ) Also, added optimized version of where in clause of tuples , where we create a hashset of the filter tuples for matching rows. This also requires a change in the hive parser since currently there is no support for multiple columns in IN clause. You can merge this pull request into a Git repository by running: $ git pull https://github.com/saucam/spark tuple_where_clause Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3618.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3618 commit c877926c64c7c6f2048d31759f35446c9cec1cdc Author: Yash Datta yash.da...@guavus.com Date: 2014-12-05T08:55:29Z SPARK-4762: 1. Add support for tuples in 'where in' clause query 2. Also adds optimized version of the same, which uses hashset to filter rows --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4762: Add support for tuples in 'where i...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/3618#issuecomment-65764552 @pwendell this PR requires a change in the hive parser for which i created a PR against hive trunk here : https://github.com/apache/hive/pull/25 can you please suggest if I need to open this request against some other branch which is used for spark build ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4365: Remove unnecessary filter call on ...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/3229#issuecomment-63162128 Thanks everyone! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4365: Remove unnecessary filter call on ...
GitHub user saucam opened a pull request: https://github.com/apache/spark/pull/3229 SPARK-4365: Remove unnecessary filter call on records returned from parquet library Since parquet library has been updated , we no longer need to filter the records returned from parquet library for null records , as now the library skips those : from parquet-hadoop/src/main/java/parquet/hadoop/InternalParquetRecordReader.java public boolean nextKeyValue() throws IOException, InterruptedException { boolean recordFound = false; while (!recordFound) { // no more records left if (current = total) { return false; } try { checkRead(); currentValue = recordReader.read(); current ++; if (recordReader.shouldSkipCurrentRecord()) { // this record is being filtered via the filter2 package if (DEBUG) LOG.debug(skipping record); continue; } if (currentValue == null) { // only happens with FilteredRecordReader at end of block current = totalCountLoadedSoFar; if (DEBUG) LOG.debug(filtered record reader reached end of block); continue; } recordFound = true; if (DEBUG) LOG.debug(read value: + currentValue); } catch (RuntimeException e) { throw new ParquetDecodingException(format(Can not read value at %d in block %d in file %s, current, currentBlock, file), e); } } return true; } You can merge this pull request into a Git repository by running: $ git pull https://github.com/saucam/spark remove_filter Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3229.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3229 commit 8909ae921db25971259d3c4463af7af8db4a4152 Author: Yash Datta yash.da...@guavus.com Date: 2014-11-12T14:12:12Z SPARK-4365: Remove unnecessary filter call on records returned from parquet library --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/2841#issuecomment-61052070 yes. In task side metadata strategy, the tasks are spawned first, and each task will then read the metadata and drop the row groups. So if I am using yarn, and data is huge (metadata is large) , the memory will be consumed on the yarn side , but in case of client side metadata strategy, whole of the metadata will be read before the tasks are spawned, on a single node. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/2841#issuecomment-61216332 @marmbrus , @mateiz thanks for all the help ! @marmbrus you may want to close this ticket as well : https://issues.apache.org/jira/browse/SPARK-1847 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org