from:"saucam"

[GitHub] spark pull request: [SPARK-14557][SQL] Reading textfile (created t...

2016-04-13 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/12356#issuecomment-209762085
  
I think we can eliminate applyFilterIfNeeded method as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11878][SQL]: Eliminate distribute by in...

2016-01-06 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/9858#issuecomment-169523800
  
Thanks for the comments and the merge :)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11878][SQL]: Eliminate distribute by in...

2016-01-05 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/9858#issuecomment-168934153
  
Fixed



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11878][SQL]: Eliminate distribute by in...

2016-01-05 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/9858#issuecomment-168956141
  
Seems some other issue , tests in pyspark mllib failing ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11878][SQL]: Eliminate distribute by in...

2016-01-05 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/9858#issuecomment-169240395
  
Hey @marmbrus have reverted to 46e7419


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11878][SQL]: Eliminate distribute by in...

2015-12-27 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/9858#issuecomment-167422237
  
Hey @marmbrus sorry for the delay in this update, I have added the same 
thing to the planner. Also rebased to latest master. How does it look now ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11878][SQL]: Eliminate distribute by in...

2015-11-26 Thread saucam

Github user saucam commented on a diff in the pull request:

https://github.com/apache/spark/pull/9858#discussion_r46020227
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala ---
@@ -488,6 +488,12 @@ private[sql] case class EnsureRequirements(sqlContext: 
SQLContext) extends Rule[
   }
 
   def apply(plan: SparkPlan): SparkPlan = plan.transformUp {
+case operator @ Exchange(partitioning, child, _) =>
+  child.children match {
+case Exchange(childPartitioning, baseChild, _)::Nil =>
--- End diff --

Yes, I thought the same, but then it will again be not as generic as this, 
since SparkStrategies are applied first and till that time we don;t have the 
exchanges added. So it will be similar to my previous change done in optimizer 
in that it will check that the child plan is an aggregate or not instead of 
testing for an Exchange. Will that be acceptable ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11878][SQL][WIP]: Eliminate distribute ...

2015-11-25 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/9858#issuecomment-159798845
  
@marmbrus 
Added the same thing to exchange planning. How does it look now ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11878][SQL][WIP]: Eliminate distribute ...

2015-11-24 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/9858#issuecomment-159507098
  
Thanks for the feedback! Let me take a look at the Exchange code


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-11878: Eliminate distribute by in case g...

2015-11-19 Thread saucam

GitHub user saucam opened a pull request:

https://github.com/apache/spark/pull/9858

SPARK-11878: Eliminate distribute by in case group by is present with 
exactly the same grouping expressions

For queries like :
select <> from table group by a distribute by a
we can eliminate distribute by ; since group by will anyways do a hash 
partitioning
Also applicable when user uses Dataframe API but the number of partitions 
in RepartitionByExpression is not specified (None)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/saucam/spark eliminatedistribute

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9858.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9858


commit a86feca6e2b9aaba9babed8854a39c97b59f34cd
Author: Yash Datta <yash.da...@guavus.com>
Date:   2015-11-20T07:43:47Z

SPARK-11878: Eliminate distribute by in case group by is present with 
exactly the same grouping expressions




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL][SPARK-10451]: Prevent unnecessary serial...

2015-09-18 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/8604#issuecomment-141396161
  
@yhuai thanks for the help!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL][SPARK-10451]: Prevent unnecessary serial...

2015-09-18 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/8604#issuecomment-141483053
  
thanks for the merge :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL][SPARK-10451]: Prevent unnecessary serial...

2015-09-11 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/8604#issuecomment-139615421
  
This time it fails jsonHadoopFSRelationSuite ! 


https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42341/testReport/junit/org.apache.spark.sql.sources/JsonHadoopFsRelationSuite/test_all_data_types/



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL][SPARK-10451]: Prevent unnecessary serial...

2015-09-11 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/8604#issuecomment-139589217
  
phew! ohk :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL][SPARK-10451]: Prevent unnecessary serial...

2015-09-11 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/8604#issuecomment-139500144
  
added comments


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7142][SQL]: Minor enhancement to Boolea...

2015-09-11 Thread saucam

Github user saucam commented on a diff in the pull request:

https://github.com/apache/spark/pull/5700#discussion_r39256778
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -413,6 +418,10 @@ object BooleanSimplification extends Rule[LogicalPlan] 
with PredicateHelper {
 case LessThan(l, r) => GreaterThanOrEqual(l, r)
 // not(l <= r)  =>  l > r
 case LessThanOrEqual(l, r) => GreaterThan(l, r)
+// not(l || r) => not(l) && not(r)
+case Or(l, r) => And(Not(l), Not(r))
+// not(l && r) => not(l) or not(r)
+case And(l, r) => Or(Not(l), Not(r))
--- End diff --

@cloud-fan could you please explain a bit more when and how converting to 
"And" may not be an optimization ? I was wondering would it actually result in 
any kind of performance hit ? Also could you tell how #8200 is more reasonable ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-7142: Incorporate review comments

2015-09-11 Thread saucam

GitHub user saucam opened a pull request:

https://github.com/apache/spark/pull/8716

SPARK-7142: Incorporate review comments

Adding changes suggested by @cloud-fan  in #5700 

cc @marmbrus 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/saucam/spark bool_simp

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/8716.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #8716


commit 2861453c92021ac0108267b67169ac9d2cd37192
Author: Yash Datta <yash.da...@guavus.com>
Date:   2015-09-11T10:29:34Z

SPARK-7142: Incorporate review comments




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL][SPARK-10451]: Prevent unnecessary serial...

2015-09-11 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/8604#issuecomment-139519988
  
Some OrcHadoopFSRelationSuite test is failing. Can you help with this one 
@liancheng ?

I dont understand, i just added a comment !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7142][SQL]: Minor enhancement to Boolea...

2015-09-10 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/5700#issuecomment-139138411
  
added test cases


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [CORE][SPARK-10527]: Minor enhancement to eval...

2015-09-10 Thread saucam

Github user saucam commented on a diff in the pull request:

https://github.com/apache/spark/pull/8678#discussion_r39153871
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -901,7 +901,7 @@ class DAGScheduler(
   // the stage as completed here in case there are no tasks to run
   markStageAsFinished(stage, None)
 
-  val debugString = stage match {
+  def debugString: String = stage match {
--- End diff --

Even if its not heavy , we can easily make it lazy. 
By passing it directly , it will still always evaluate the expression first.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [CORE][SPARK-10527]: Minor enhancement to eval...

2015-09-10 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/8678#issuecomment-139283491
  
@srowen thanks for the detailed explanation, passed the value directly 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [CORE][SPARK-10527]: Minor enhancement to eval...

2015-09-10 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/8678#issuecomment-139331711
  
@andrewor14  for the "harder to read part" initially i had changed the val 
to a def ; the rest was the same , I am sure it was not affecting the 
readability if thats your concern here, but @srowen  suggested to pass the 
value directly (don;t know why)
I can revert it to my original change if you prefer that way. 

"The reduce in delay is extremely negligible"

agreed! ,  but still it is there !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [CORE][SPARK-10527]: Minor enhancement to eval...

2015-09-10 Thread saucam

Github user saucam closed the pull request at:

https://github.com/apache/spark/pull/8678


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [CORE][SPARK-10527]: Minor enhancement to eval...

2015-09-10 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/8678#issuecomment-139459104
  
@andrewor14 , @srowen thanks for the feedback. I was actually working on a 
very low latency spark job (300 - 500 ms) , and thought it better to improve 
obvious things. It might not be applicable for wider audience, as suggested,  
closing this PR


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7142][SQL]: Minor enhancement to Boolea...

2015-09-10 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/5700#issuecomment-139327218
  
thanks @marmbrus :)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-10527: Minor enhancement to evalutate de...

2015-09-09 Thread saucam

GitHub user saucam opened a pull request:

https://github.com/apache/spark/pull/8678

SPARK-10527: Minor enhancement to evalutate debugstring only when log level 
is debug in DAGScheduler



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/saucam/spark slog

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/8678.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #8678


commit e4c4c10db44cbec79e190265fc0351731ff664ec
Author: Yash Datta <yash.da...@guavus.com>
Date:   2015-09-10T02:14:25Z

SPARK-10527: Minor enhancement to evalutate debugstring only when log level 
is debug in DAGScheduler




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SQL][SPARK-10451]: Prevent unnecessary s...

2015-09-05 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/8604#issuecomment-137932280
  
I get this failure : 

[error] 
/home/jenkins/workspace/SparkPullRequestBuilder/sql/core/src/test/scala/org/apache/spark/sql/SQLConfSuite.scala:83:
 not found: value ctx
[error]   assert(ctx.conf.numShufflePartitions === 10)
[error]  ^
[error] one error found
[error] (sql/test:compile) Compilation failed
[error] Total time: 108 s, completed Sep 5, 2015 1:47:01 AM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL][SPARK-10451]: Prevent unnecessary serial...

2015-09-05 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/8604#issuecomment-137972091
  
thnx @rxin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL][SPARK-10451]: Prevent unnecessary serial...

2015-09-04 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/8604#issuecomment-137830950
  
cc @liancheng 

thoughts ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL][SPARK-10451]: Prevent unnecessary serial...

2015-09-04 Thread saucam

GitHub user saucam opened a pull request:

https://github.com/apache/spark/pull/8604

[SQL][SPARK-10451]: Prevent unnecessary serializations in 
InMemoryColumnarTableScan


Many of the fields in InMemoryColumnar scan and InMemoryRelation can be 
made transient.

This  reduces my 1000ms job to abt 700 ms . The task size reduces from 2.8 
mb to ~1300kb

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/saucam/spark serde

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/8604.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #8604


commit 5afb9ebdf3ff2ae3321b89dd80f0207fe1e330a6
Author: Yash Datta <yash.da...@guavus.com>
Date:   2015-09-04T18:55:19Z

SPARK-10451: Prevent unnecessary serializations in InMemoryColumnarTableScan




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6566][SQL]: Related changes for newer p...

2015-06-11 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/5889#issuecomment-111014258
  
@liancheng  looks ok to you now ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6566][SQL]: Related changes for newer p...

2015-06-05 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/5889#issuecomment-109370628
  
incorporated review comments

retest please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6566][SQL]: Change parquet version to l...

2015-06-04 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/5889#issuecomment-109170412
  
cc @liancheng 

I have rebased.
can we retest this ? How to determine what is failing ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7743] [SQL] Parquet 1.7

2015-06-03 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/6597#issuecomment-108265494
  
hey @liancheng , sounds ok to me. We can rebase once these changes are 
merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7340][SQL]: Change parquet version to l...

2015-05-04 Thread saucam

GitHub user saucam opened a pull request:

https://github.com/apache/spark/pull/5888

[SPARK-7340][SQL]: Change parquet version to latest release

This brings in major improvement in that footers are not read on the 
driver. This also cleans up the code in parquetTableOperations, where we had to 
override getSplits to eliminate multiple listStatus calls.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/saucam/spark parquet_1.6

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/5888.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5888


commit 3e3cbf978f0980669cb5d7492dec38a0061c2974
Author: Yash Datta yash.da...@guavus.com
Date:   2015-05-04T12:14:14Z

SPARK-7340: Change parquet version to latest release




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7340][SQL]: Change parquet version to l...

2015-05-04 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/5888#issuecomment-98741146
  
is this some problem with jenkins ? 

[info] Updating 
{file:/home/jenkins/workspace/SparkPullRequestBuilder/}core...
[error] oro#oro;2.0.8!oro.jar origin location must be absolute: 
file:/home/jenkins/.m2/repository/oro/oro/2.0.8/oro-2.0.8.jar
java.lang.IllegalArgumentException: oro#oro;2.0.8!oro.jar origin location 
must be absolute: file:/home/jenkins/.m2/repository/oro/oro/2.0.8/oro-2.0.8.jar
at org.apache.ivy.util.Checks.checkAbsolute(Checks.java:57)
at 
org.apache.ivy.core.cache.DefaultRepositoryCacheManager.getArchiveFileInCache(DefaultRepositoryCacheManager.java:385)
at 
org.apache.ivy.core.cache.DefaultRepositoryCacheManager.download(DefaultRepositoryCacheManager.java:849)
at 
org.apache.ivy.plugins.resolver.BasicResolver.download(BasicResolver.java:835)
at 
org.apache.ivy.plugins.resolver.RepositoryResolver.download(RepositoryResolver.java:282)
at 
org.apache.ivy.plugins.resolver.ChainResolver.download(ChainResolver.java:219)
at 
org.apache.ivy.plugins.resolver.ChainResolver.download(ChainResolver.java:219)
at 
org.apache.ivy.core.resolve.ResolveEngine.downloadArtifacts(ResolveEngine.java:388)
at 
org.apache.ivy.core.resolve.ResolveEngine.resolve(ResolveEngine.java:331)
at org.apache.ivy.Ivy.resolve(Ivy.java:517)
at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:266)
at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:175)
at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:157)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:151)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:151)
at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:128)
at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:56)
at sbt.IvySbt$$anon$4.call(Ivy.scala:64)
at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93)
at 
xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:78)
at 
xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:97)
at xsbt.boot.Using$.withResource(Using.scala:10)
at xsbt.boot.Using$.apply(Using.scala:9)
at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:58)
at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:48)
at xsbt.boot.Locks$.apply0(Locks.scala:31)
at xsbt.boot.Locks$.apply(Locks.scala:28)
at sbt.IvySbt.withDefaultLogger(Ivy.scala:64)
at sbt.IvySbt.withIvy(Ivy.scala:123)
at sbt.IvySbt.withIvy(Ivy.scala:120)
at sbt.IvySbt$Module.withModule(Ivy.scala:151)
at sbt.IvyActions$.updateEither(IvyActions.scala:157)
at 
sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1318)
at 
sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1315)
at 
sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$85.apply(Defaults.scala:1345)
at 
sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$85.apply(Defaults.scala:1343)
at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:35)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1348)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1342)
at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:45)
at sbt.Classpaths$.cachedUpdate(Defaults.scala:1360)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1300)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1275)
at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40)
at sbt.std.Transform$$anon$4.work(System.scala:63)
at 
sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:226)
at 
sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:226)
at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
at sbt.Execute.work(Execute.scala:235)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:226)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:226)
at 
sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159)
at sbt.CompletionService$$anon$2.call(CompletionService.scala:28)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[error] (core/*:update

[GitHub] spark pull request: [SPARK-7340][SQL]: Change parquet version to l...

2015-05-04 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/5888#issuecomment-98743527
  
opening PR against the original ticket


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7340][SQL]: Change parquet version to l...

2015-05-04 Thread saucam

Github user saucam closed the pull request at:

https://github.com/apache/spark/pull/5888


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6566][SQL]: Change parquet version to l...

2015-05-04 Thread saucam

GitHub user saucam opened a pull request:

https://github.com/apache/spark/pull/5889

[SPARK-6566][SQL]: Change parquet version to latest release

This brings in major improvement in that footers are not read on the 
driver. This also cleans up the code in parquetTableOperations, where we had to 
override getSplits to eliminate multiple listStatus calls.

cc @liancheng

are there any other changes we need for this ?

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/saucam/spark parquet_1.6

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/5889.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5889


commit 3e3cbf978f0980669cb5d7492dec38a0061c2974
Author: Yash Datta yash.da...@guavus.com
Date:   2015-05-04T12:14:14Z

SPARK-7340: Change parquet version to latest release




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7142][SQL]: Minor enhancement to Boolea...

2015-04-26 Thread saucam

Github user saucam commented on a diff in the pull request:

https://github.com/apache/spark/pull/5700#discussion_r29110517
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -413,6 +418,10 @@ object BooleanSimplification extends Rule[LogicalPlan] 
with PredicateHelper {
 case LessThan(l, r) = GreaterThanOrEqual(l, r)
 // not(l = r)  =  l  r
 case LessThanOrEqual(l, r) = GreaterThan(l, r)
+// not(l || r) = not(l)  not(r)
+case Or(l, r) = And(Not(l), Not(r))
+// not(l  r) = not(l) or not(r)
+case And(l, r) = Or(Not(l), Not(r))
--- End diff --

So for example the filter is not(Or(left, r)) , where r might be some 
filter on a partitioned column like part=12 , in the present case this filter 
cannot be pushed down, since while evaluating we will encounter reference of 
partitioned column, whereas if this rule is applied we get And(not(l), part12) 
and then not(l) might be pushed down since now splitting into conjunctive 
predicates is possible.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7142][SQL]: Minor enhancement to Boolea...

2015-04-26 Thread saucam

Github user saucam commented on a diff in the pull request:

https://github.com/apache/spark/pull/5700#discussion_r29121549
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -413,6 +418,10 @@ object BooleanSimplification extends Rule[LogicalPlan] 
with PredicateHelper {
 case LessThan(l, r) = GreaterThanOrEqual(l, r)
 // not(l = r)  =  l  r
 case LessThanOrEqual(l, r) = GreaterThan(l, r)
+// not(l || r) = not(l)  not(r)
+case Or(l, r) = And(Not(l), Not(r))
+// not(l  r) = not(l) or not(r)
+case And(l, r) = Or(Not(l), Not(r))
--- End diff --

This is inside a case match : 

```
 case not @ Not(exp) = exp match {
  
  
  case Or(l, r) = And(Not(l), Not(r))

}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7142][SQL]: Minor enhancement to Boolea...

2015-04-25 Thread saucam

GitHub user saucam opened a pull request:

https://github.com/apache/spark/pull/5700

[SPARK-7142][SQL]: Minor enhancement to BooleanSimplification Optimizer rule

Use these in the optimizer as well:

A and (not(A) or B) = A and B
not(A and B) = not(A) or not(B)
not(A or B) = not(A) and not(B)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/saucam/spark bool_simp

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/5700.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5700


commit 3eb813e66281b53ca029faa7928cc6c50a69b509
Author: Yash Datta yash.da...@guavus.com
Date:   2015-04-25T12:45:48Z

SPARK-7142: Minor enhancement to BooleanSimplification Optimizer rule, 
using these rules:
A and (not(A) or B) = A and B
not(A and B) = not(A) or not(B)
not(A or B) = not(A) and not(B)




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7097][SQL]: Partitioned tables should o...

2015-04-23 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/5668#issuecomment-95678594
  
retest please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7097][SQL]: Partitioned tables should o...

2015-04-23 Thread saucam

GitHub user saucam opened a pull request:

https://github.com/apache/spark/pull/5668

[SPARK-7097][SQL]: Partitioned tables should only consider referred 
partitions in query during size estimation for checking against 
autoBroadcastJoinThreshold

This PR attempts to add support for better size estimation in case of 
partitioned tables so that only the referred partition's size are taken into 
consideration when testing against autoBroadCastJoinThreshold and  deciding 
whether to create a broadcast join or shuffle hash join.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/saucam/spark part_size

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/5668.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5668


commit b0beb34d6a77c738660cb161306c947411d70ab5
Author: Yash Datta yash.da...@guavus.com
Date:   2015-04-23T17:58:17Z

SPARK-7097: Partitioned tables should only consider referred partitions in 
query during size estimation for checking against autoBroadcastJoinThreshold




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...

2015-04-14 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/4764#issuecomment-92803029
  
thanks @marmbrus . Let me refactor this then and open another PR later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...

2015-04-14 Thread saucam

Github user saucam closed the pull request at:

https://github.com/apache/spark/pull/4764


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...

2015-04-14 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/4764#issuecomment-92803267
  
thanks @marmbrus . Let me refactor this then and open another PR later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SQL][SPARK-6632]: Read schema from each ...

2015-04-13 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/5298#issuecomment-92450385
  
ok @liancheng 
Thanks for the comments. In the meantime let me try to address your 
suggestions. Can we keep this open in WIP state for now ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SQL][SPARK-6632]: Read schema from each ...

2015-04-12 Thread saucam

Github user saucam commented on a diff in the pull request:

https://github.com/apache/spark/pull/5298#discussion_r28214791
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala 
---
@@ -98,12 +98,32 @@ private[parquet] class RowReadSupport extends 
ReadSupport[Row] with Logging {
 val metadata = new JHashMap[String, String]()
 val requestedAttributes = 
RowReadSupport.getRequestedSchema(configuration)
 
+// convert fileSchema to attributes
+val fileAttributes = 
ParquetTypesConverter.convertToAttributes(fileSchema, true, true)
--- End diff --

these booleans are for finding the datatype of the attribute, whereas here 
we are just interested in finding out the names of the columns, to reconcile 
with metastore schema. Hence it is safe to always send these parameters as 
true, since we do not have SQL context here from which to derive these.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SQL][SPARK-6632]: Read schema from each ...

2015-04-12 Thread saucam

Github user saucam commented on a diff in the pull request:

https://github.com/apache/spark/pull/5298#discussion_r28214717
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala 
---
@@ -98,12 +98,32 @@ private[parquet] class RowReadSupport extends 
ReadSupport[Row] with Logging {
 val metadata = new JHashMap[String, String]()
 val requestedAttributes = 
RowReadSupport.getRequestedSchema(configuration)
 
+// convert fileSchema to attributes
+val fileAttributes = 
ParquetTypesConverter.convertToAttributes(fileSchema, true, true)
+val fileAttMap = fileAttributes.map(f = f.name.toLowerCase - 
f.name).toMap
+
 if (requestedAttributes != null) {
+  // reconcile names of requested Attributes
+  val modRequestedAttributes = requestedAttributes.map(attr = {
+  val lName = attr.name.toLowerCase
+  if (fileAttMap.contains(lName)) {
+attr.withName(fileAttMap(lName))
+  } else {
+if (attr.nullable) {
+  attr
+} else {
+  // field is not nullable but not present in the parquet file 
schema!!
+  // this is just a safety check since in hive all columns are 
nullable
+  // throw exception here
+  throw new RuntimeException(sField ${attr.name} is 
non-nullable, 
+but not found in parquet file schema: 
${fileSchema}.stripMargin)
+}}})
+
--- End diff --

yes, the difference being that this happens within each task, whereas 
ParquetRelation2.mergeMetastoreParquetSchema happens on the driver. This 
eliminates the need of mergeMetastoreParquetSchema method


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL][SPARK-5453] Use property 'mapreduce.inpu...

2015-04-12 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/4246#issuecomment-91996558
  
hey @marmbrus , this property is needed because : 
1. The 'hijacked' parquet read path would not use mapreduce property while 
reading schema/footers, 

see  refresh method in ParquetRelation2.scala

2. HiveTableScan would need the pathfilter while creating RDD 

see makeRDDForPartitionedTable in TableReader.scala

These are not using the mapreduce.input.pathFilter (even if its is set by 
the user) property and hence the extra code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...

2015-04-12 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/4764#issuecomment-91997683
  
hi @marmbrus , can you share other plans of modifying aggregates that you 
mentioned earlier? Can I help with that ? Otherwise i'll modify this one for 
now as you have suggested.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SQL][SPARK-6632]: Read schema from each ...

2015-04-12 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/5298#issuecomment-92069665
  
hey @liancheng , this change now reconciles schema within the tasks. do 
suggest. After that I will remove the merge schema functions that are no longer 
needed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL][SPARK-6742]: Don't push down predicates ...

2015-04-11 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/5390#issuecomment-91925973
  
Added test case. please test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL][SPARK-5453] Use property 'mapreduce.inpu...

2015-04-11 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/4246#issuecomment-91868767
  
please retest


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL][SPARK-6742]: Don't push down predicates ...

2015-04-07 Thread saucam

GitHub user saucam opened a pull request:

https://github.com/apache/spark/pull/5390

[SQL][SPARK-6742]: Don't push down predicates which reference partition 
column(s)

cc @liancheng  

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/saucam/spark fpush

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/5390.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5390


commit 8592acc665241e2304c77427df35221fa7bfc020
Author: Yash Datta yash.da...@guavus.com
Date:   2015-04-07T12:09:20Z

SPARK-6742: Don't push down predicates which reference partition column(s)




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...

2015-04-05 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/4764#issuecomment-89756295
  
fixed test failures because of class cast exceptions. Please retest.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL][SPARK-5453] Use property 'mapreduce.inpu...

2015-04-05 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/4246#issuecomment-89779951
  
Thanks for the suggestions @marmbrus , I have refactored PathFilter 
creation in SQLContext. Covered more instances of listStatus. please review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...

2015-04-04 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/4764#issuecomment-89604768
  
fixed the test case of zero count when there is no data. rebased with 
latest master. please retest


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SQL][SPARK-6632]: Read schema from each ...

2015-04-04 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/5298#issuecomment-89632303
  
hmm i see. Would definitely go through these PRs. Anyways fixed the 
whitespace problem here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5684][SQL]: Pass in partition name alon...

2015-04-04 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/4469#issuecomment-89613886
  
Hi @marmbrus , this is a pretty common scenario in production, where the 
data is generated in some directory and then later partitions are added to 
tables using alter table tablename add partition (col=value) location 
directory where data is generated (where path does not contain partition 
key=value)
In the old parquet path in v1.2.1, this is not possible.
This is doable in the new parquet path in spark 1.3 though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL][SPARK-4226]: Add support for subqueries ...

2015-04-02 Thread saucam

Github user saucam closed the pull request at:

https://github.com/apache/spark/pull/3888


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL][SPARK-4226]: Add support for subqueries ...

2015-04-02 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/3888#issuecomment-89118699
  
hey @marmbrus , thanks for the feedback, i'll close this one and work on 
another PR , incorporating the changes you have suggested. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SQL][SPARK-6632]: Read schema from each ...

2015-03-31 Thread saucam

GitHub user saucam opened a pull request:

https://github.com/apache/spark/pull/5298

[WIP][SQL][SPARK-6632]: Read schema from each input split in the 
ReadSupport hook, reconciling with the metastore schema at that time

Hey @liancheng,

How about this approach for schema reconciliation, where we use the 
metastore schema, and reconcile within the ReadSupport init function. This way, 
we handle each input file in the map task, and no need to read schema from all 
part files and merging before initiating the tasks.
I have not removed the merge code for now. Let me know your thoughts on 
this one.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/saucam/spark SPARK-6632

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/5298.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5298


commit 304daccb4e6b947eb10a8feb893ca5b47c42e16e
Author: Yash Datta yash.da...@guavus.com
Date:   2015-03-31T13:11:40Z

SPARK-6632: Read schema from each input split in the ReadSupport hook, 
reconciling with the metastore schema at that time




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL][SPARK-6471]: Metastore schema should onl...

2015-03-28 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/5141#issuecomment-87182630
  
Thanks a lot @liancheng ! :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL][SPARK-6471]: Metastore schema should onl...

2015-03-27 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/5141#issuecomment-86900047
  
Sorry for so many queries .

How about if I simply ignore reading schema from parquet part files, 
relying only on metastore schema (I will pass it from hivestrategy to 
ParquetRelation). Do you think it would have issues ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL][SPARK-6471]: Metastore schema should onl...

2015-03-27 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/5141#issuecomment-86880657
  
Hi @liancheng , thanks for the references, I have already gone through 
these , but I was talking about ParquetRelation (old parquet path, the default 
one in spark 1.2) and not ParquetRelation2


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL][SPARK-6471]: Metastore schema should onl...

2015-03-27 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/5141#issuecomment-86971965
  
Thanks for confirming this, I hope there is no other reason for reconciling 
schema ? 
(In our use cases we can safely make sure our schema is lowercase and all 
are nullable columns, so should be easier for me to use metastore schema itself 
in the ParquetRelation)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL][SPARK-6471]: Metastore schema should onl...

2015-03-26 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/5141#issuecomment-86443584
  
hi @liancheng , thanks for reviewing. 

One small query on a separate note, 
currently in the implementation of mergeMetastoreParquetSchema, I see that 
for finding out the merged parquetSchema, part files from all the partitions 
are being used. Does this scale ? What happens if we have millions of 
partitions, doesn't this slow down every read query even if only small number 
of partitions are being referred ? 
Was wondering if we can change this to get a unified schema just from the 
referred partitions ? (Though in that case I think we will need to have a 
summary file containing all the columns in the base path of the table)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL][SPARK-6471]: Metastore schema should onl...

2015-03-26 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/5141#issuecomment-86832254
  
Hi @liancheng ,

We do have use cases where 100K partitions will be registered in tables, 
(partitioned on timestamps, data is added in form of partitions for every 5min 
interval) , but it could be more in other cases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL][SPARK-6471]: Metastore schema should onl...

2015-03-24 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/5141#issuecomment-85857964
  
Fixed the test case. Added a new test case as well. Please retest


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6471][SQL]: Metastore schema should onl...

2015-03-23 Thread saucam

GitHub user saucam opened a pull request:

https://github.com/apache/spark/pull/5141

[SPARK-6471][SQL]: Metastore schema should only be a subset of parquet 
schema to support dropping of columns using replace columns

Currently in the parquet relation 2 implementation, error is thrown in case 
merged schema is not exactly the same as metastore schema. 
But to support cases like deletion of column using replace column command, 
we can relax the restriction so that even if metastore schema is a subset of 
merged parquet schema, the query will work.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/saucam/spark replace_col

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/5141.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5141


commit 5f2f4674084b4f6202c0eb884b798f0980659b4b
Author: Yash Datta yash.da...@guavus.com
Date:   2015-03-23T17:35:45Z

SPARK-6471: Metastore schema should only be a subset of parquet schema to 
support dropping of columns using replace columns




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5684][SQL]: Pass in partition name alon...

2015-03-09 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/4469#issuecomment-77828501
  
hi @liancheng , any update on this one ? i think it will be useful for 
people using spark 1.2.1 since old parquet path might suit their needs better 
in that version



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...

2015-03-05 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/4764#issuecomment-77510275
  
please restest



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...

2015-02-26 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/4764#issuecomment-76347215
  
please retest


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...

2015-02-26 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/4764#issuecomment-76184234
  
Fixed the null count test failure. Optimization works only in case of 
single count distinct in select clause


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-6006: Optimize count distinct for high c...

2015-02-25 Thread saucam

GitHub user saucam opened a pull request:

https://github.com/apache/spark/pull/4764

SPARK-6006: Optimize count distinct for high cardinality columns

Currently the plan for count distinct looks like this : 

Aggregate false, [snAppProtocol#448], [CombineAndCount(partialSets#513) AS 
_c0#437L]
   Exchange SinglePartition
Aggregate true, [snAppProtocol#448], 
[snAppProtocol#448,AddToHashSet(snAppProtocol#448) AS partialSets#513]
 !OutputFaker [snAppProtocol#448]
  ParquetTableScan [snAppProtocol#587], (ParquetRelation 
hdfs://192.168.160.57:9000/data/collector/13/11/14, Some(Configuration: 
core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, 
yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml), 
org.apache.spark.sql.hive.HiveContext@6b1ed434, [ptime#443], ptime=2014-11-13 
00%3A55%3A00), []


This can be slow if there are too many distinct values in a column. This PR 
changes the above plan to : 


Aggregate false, [], [SUM(_c0#437L) AS totalCount#514L]
 Exchange SinglePartition
  Aggregate false, [snAppProtocol#448], [CombineAndCount(partialSets#513) 
AS _c0#437L]
   Exchange (HashPartitioning [snAppProtocol#448], 200)
Aggregate true, [snAppProtocol#448], 
[snAppProtocol#448,AddToHashSet(snAppProtocol#448) AS partialSets#513]
 !OutputFaker [snAppProtocol#448]
  ParquetTableScan [snAppProtocol#587], (ParquetRelation 
hdfs://192.168.160.57:9000/data/collector/13/11/14, Some(Configuration: 
core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, 
yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml), 
org.apache.spark.sql.hive.HiveContext@6b1ed434, [ptime#443], ptime=2014-11-13 
00%3A55%3A00), []

This way even if there are too many distinct values; we insert them into 
partial maps and computation remains distributed and thus faster.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/saucam/spark optcountdis

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4764.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4764


commit 3e6d227184451026dbfda9866ae1e114bde002b1
Author: Yash Datta yash.da...@guavus.com
Date:   2015-02-25T12:09:01Z

SPARK-6006: Optimize count distinct for high cardinality columns




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...

2015-02-25 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/4764#issuecomment-75952342
  
@marmbrus can you please guide how to rewrite this in a better way ?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...

2015-02-25 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/4764#issuecomment-76135270
  
can we test this again please ?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5684][SQL]: Pass in partition name alon...

2015-02-10 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/4469#issuecomment-73839872
  
Hi @liancheng , thanks for the comments. We are using spark-1.2.1 and the 
old parquet support is being used. Can this be merged so that we have proper 
partitioning with different locations as well. I tried partitioning on 2 
columns and it worked fine (Also applied this patch for specifying a different 
location) 

On a different note, When I create a parquet table with smallint type in 
spark, the schema being used in parquet shows 'int32 type', is that by design 
in spark or its a parquet limitation ?  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-5684: Pass in partition name along with ...

2015-02-09 Thread saucam

Github user saucam commented on a diff in the pull request:

https://github.com/apache/spark/pull/4469#discussion_r24315891
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/dataTypes.scala ---
@@ -362,7 +362,7 @@ case object BooleanType extends NativeType with 
PrimitiveType {
  * @group dataType
  */
 @DeveloperApi
-case object TimestampType extends NativeType {
+case object TimestampType extends NativeType with PrimitiveType {
--- End diff --

this is done, in case table is partitioned on a timestamp type column, 
parquet iterator returns a GenericRow due to this in ParquetTypes.scala : 

def isPrimitiveType(ctype: DataType): Boolean =
classOf[PrimitiveType] isAssignableFrom ctype.getClass

and in ParquetConverter.scala we have  : 

 protected[parquet] def createRootConverter(
  parquetSchema: MessageType,
  attributes: Seq[Attribute]): CatalystConverter = {
// For non-nested types we use the optimized Row converter
if (attributes.forall(a = 
ParquetTypesConverter.isPrimitiveType(a.dataType))) {
  new CatalystPrimitiveRowConverter(attributes.toArray)
} else {
  new CatalystGroupConverter(attributes.toArray)
}
  }

which fails here later : 

   new Iterator[Row] {
  def hasNext = iter.hasNext
  def next() = {
val row = iter.next()._2.asInstanceOf[SpecificMutableRow]

throwing a class cast exception that GenericRow cannot be cast to 
SpecificMutableRow

Am I missing something here ? 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-5684: Pass in partition name along with ...

2015-02-09 Thread saucam

GitHub user saucam opened a pull request:

https://github.com/apache/spark/pull/4469

SPARK-5684: Pass in partition name along with location information, as the 
location can be different (that is may not contain the partition keys)

While parsing the partition keys from the locations, in parquetRelations, 
it is assumed that location path string will always contain the partition keys, 
which is not true. Different location can be specified while adding partitions 
to the table, which results in key not found exception while reading from such 
partitions:

Create a partitioned parquet table :
create table test_table (dummy string) partitioned by (timestamp bigint) 
stored as parquet;
Add a partition to the table and specify a different location:
alter table test_table add partition (timestamp=9) location 
'/data/pth/different'
Run a simple select * query
we get an exception :
15/02/09 08:27:25 ERROR thriftserver.SparkSQLDriver: Failed in [select * 
from db4_mi2mi_binsrc1_default limit 5]
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
in stage 21.0 failed 1 times, most recent failure: Lost task 0.0 in stage 21.0 
(TID 21, localhost): java
.util.NoSuchElementException: key not found: timestamp
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:58)
at 
org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4$$anonfun$6.apply(ParquetTableOperations.scala:141)
at 
org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4$$anonfun$6.apply(ParquetTableOperations.scala:141)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/saucam/spark partition_bug

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4469.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4469


commit 5aeeb6db8a3651b7b13d641ec0ed0dea21025438
Author: Yash Datta yash.da...@guavus.com
Date:   2015-02-09T08:53:40Z

SPARK-5684: Pass in partition name along with location information, as the 
location can be different (that is may not contain the partition keys)




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-5684: Pass in partition name along with ...

2015-02-09 Thread saucam

Github user saucam commented on a diff in the pull request:

https://github.com/apache/spark/pull/4469#discussion_r24316073
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -310,7 +310,10 @@ class SQLContext(@transient val sparkContext: 
SparkContext)
   @scala.annotation.varargs
   def parquetFile(path: String, paths: String*): DataFrame =
 if (conf.parquetUseDataSourceApi) {
-  baseRelationToDataFrame(parquet.ParquetRelation2(path +: paths, 
Map.empty)(this))
+  // not fixed for ParquetRelation2 !
+  val sPaths = path +: paths
+  baseRelationToDataFrame(parquet.ParquetRelation2(sPaths.map(p = 
+p.split(-).head), Map.empty)(this))
--- End diff --

Please suggest how to proceed in case of ParquetRelation2 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-5684: Pass in partition name along with ...

2015-02-09 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/4469#issuecomment-73478985
  
@liancheng please suggest ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL][SPARK-5453] Use property 'mapreduce.inpu...

2015-01-28 Thread saucam

GitHub user saucam opened a pull request:

https://github.com/apache/spark/pull/4246

[SQL][SPARK-5453] Use property 'mapreduce.input.pathFilter.class' to set a 
custom filter class for input files

This PR adds support for using a custom filter class for input files for 
queries. We can re-use the existing property in hive-site.xml for this.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/saucam/spark hive_site

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4246.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4246


commit 53e86c88890932f40502ab1c81647e321ba8
Author: Yash Datta yash.da...@guavus.com
Date:   2015-01-28T10:43:21Z

SPARK-5453: Use property 'mapreduce.input.pathFilter.class' to set a custom 
filter class for input files




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4786][SQL]: Parquet filter pushdown for...

2015-01-26 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/4156#issuecomment-71578360
  
fixed the styling issues. @liancheng thanks for the feedback!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4786][SQL]: Parquet filter pushdown for...

2015-01-26 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/4156#issuecomment-71428920
  
Added test case


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-4786: Parquet filter pushdown for castab...

2015-01-21 Thread saucam

GitHub user saucam opened a pull request:

https://github.com/apache/spark/pull/4156

SPARK-4786: Parquet filter pushdown for castable types

Enable parquet filter pushdown of castable types like short, byte that can 
be cast to integer

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/saucam/spark filter_short

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4156.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4156


commit cb2e0d94102bedb961f403bdc2420fabc021fe1a
Author: Yash Datta yash.da...@guavus.com
Date:   2015-01-22T06:00:18Z

SPARK-4786: Parquet filter pushdown for castable types




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4786][SQL]: Parquet filter pushdown for...

2015-01-21 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/4156#issuecomment-70976498
  
done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-4226: Add support for subqueries in wher...

2015-01-13 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/3888#issuecomment-69716627
  
Can we have some kind of hint mechanism in the query itself , if the user 
knows the subquery is small ? Then perhaps we can change the plan accordingly ?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-4226: Add support for subqueries in wher...

2015-01-06 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/3888#issuecomment-68853317
  
Hi Michael, 

Thanks for the feedback.

1. Yes it does not handle correlated queries. It definitely makes more 
sense to convert correlated queries to joins, 
but for uncorrelated queries, i think its too slow if table size is large 
and user is querying on smaller data:

eg: 2 tables with ~ 45 million rows each ; subquery returns only 90 rows:
  
select * from Y1 where Y1.id in (select Y2.id from Y2 where Y2.id  90);

takes about 12 seconds to run by this approach on a single machine 
(--executor-memory 16G --driver-memory 8G)

by following the join approach, query is changed to : 

select * from Y1 left semi join (select Y2.id as sqc0 from Y2 where id  
90) subquery on Y1.id = subquery.sqc0;

which takes 660 seconds to run on the same machine

2. This approach can handle arbitrary nesting of subqueries : 

select * from Y1 where Y1.id in (select Y2.id where Y2.timestamp in (select 
Y3.timestamp limit 20))

Can we take some hybrid approach from the two ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-4226: Add support for subqueries in wher...

2015-01-04 Thread saucam

GitHub user saucam opened a pull request:

https://github.com/apache/spark/pull/3888

SPARK-4226: Add support for subqueries in where in clause

this PR adds support for subquery in where in clause by adding a dynamic 
filter class that will compute the values list from the subquery first and then 
create a hash-set , using it as input to inset class.
  

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/saucam/spark subquery_where_clause

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3888.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3888


commit 4019e0d6e0bf31a123f2817eb964562891211635
Author: Yash Datta yash.da...@guavus.com
Date:   2015-01-04T09:06:55Z

SPARK-4226: Add support for subqueries in where in clause




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-4226: Add support for subqueries in wher...

2015-01-04 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/3888#issuecomment-68626500
  

Hi, @marmbrus can you please take a look and suggest changes ; Have tested 
for a few queries and this approach looks simpler than an already existing PR. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-4968: takeOrdered to skip reduce step in...

2014-12-29 Thread saucam

GitHub user saucam opened a pull request:

https://github.com/apache/spark/pull/3830

SPARK-4968: takeOrdered to skip reduce step in case mappers return no 
partitions

takeOrdered should skip reduce step in case mapped RDDs have no partitions. 
This prevents the mentioned exception : 

4. run query
SELECT * FROM testTable WHERE market = 'market2' ORDER BY End_Time DESC 
LIMIT 100;
Error trace
java.lang.UnsupportedOperationException: empty collection
at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:863)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:863)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.reduce(RDD.scala:863)
at org.apache.spark.rdd.RDD.takeOrdered(RDD.scala:1136)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/saucam/spark fix_takeorder

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3830.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3830


commit 5974d10c619dac2ca2433d331e43ed48e6822f90
Author: Yash Datta yash.da...@guavus.com
Date:   2014-12-29T19:06:32Z

SPARK-4968: takeOrdered to skip reduce step in case mappers return no 
partitions




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-4762: Add support for tuples in 'where i...

2014-12-05 Thread saucam

GitHub user saucam opened a pull request:

https://github.com/apache/spark/pull/3618

SPARK-4762: Add support for tuples in 'where in' clause query

 Currently, in the where in clause the filter is applied only on a single 
column. We can enhance it to accept filter on multiple columns.
So current support is for queries like :
Select * from table where c1 in (value1,value2,...value n);

This added  support for queries like :
Select * from table where (c1,c2,... cn) in ((value1,value2...value n), 
(value1' , value2' ... ,value n') )
Also, added optimized version of where in clause of tuples , where we 
create a hashset of the filter tuples for matching rows.

This also requires a change in the hive parser since currently there is no 
support for multiple columns in IN clause.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/saucam/spark tuple_where_clause

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3618.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3618


commit c877926c64c7c6f2048d31759f35446c9cec1cdc
Author: Yash Datta yash.da...@guavus.com
Date:   2014-12-05T08:55:29Z

SPARK-4762: 1. Add support for tuples in 'where in' clause query
2. Also adds optimized version of the same, which uses hashset 
to filter rows




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-4762: Add support for tuples in 'where i...

2014-12-05 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/3618#issuecomment-65764552
  
@pwendell  this PR requires a change in the hive parser for which i created 
a PR against hive trunk here : https://github.com/apache/hive/pull/25

can you please suggest if I need to open this request against some other 
branch which is used for spark build ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-4365: Remove unnecessary filter call on ...

2014-11-14 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/3229#issuecomment-63162128
  
Thanks everyone!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-4365: Remove unnecessary filter call on ...

2014-11-12 Thread saucam

GitHub user saucam opened a pull request:

https://github.com/apache/spark/pull/3229

SPARK-4365: Remove unnecessary filter call on records returned from parquet 
library

Since parquet library has been updated , we no longer need to filter the 
records returned from parquet library for null records , as now the library 
skips those :

from 
parquet-hadoop/src/main/java/parquet/hadoop/InternalParquetRecordReader.java

public boolean nextKeyValue() throws IOException, InterruptedException {
boolean recordFound = false;
while (!recordFound) {
// no more records left
if (current = total)
{ return false; }
try {
checkRead();
currentValue = recordReader.read();
current ++; 
if (recordReader.shouldSkipCurrentRecord())
{
 // this record is being filtered via the filter2 package 
if (DEBUG) LOG.debug(skipping record);
 continue;
 }
if (currentValue == null)
{ 
// only happens with FilteredRecordReader at end of block current = 
totalCountLoadedSoFar;
 if (DEBUG) LOG.debug(filtered record reader reached end of block);
 continue; 
}

recordFound = true;
if (DEBUG) LOG.debug(read value:  + currentValue);
} catch (RuntimeException e)
{ throw new ParquetDecodingException(format(Can not read value at %d in 
block %d in file %s, current, currentBlock, file), e); }

} 
return true;
}

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/saucam/spark remove_filter

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3229.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3229


commit 8909ae921db25971259d3c4463af7af8db4a4152
Author: Yash Datta yash.da...@guavus.com
Date:   2014-11-12T14:12:12Z

SPARK-4365: Remove unnecessary filter call on records returned from parquet 
library




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

2014-10-30 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/2841#issuecomment-61052070
  
yes. In task side metadata strategy, the tasks are spawned first, and each 
task will then read the metadata and drop the row groups. So if I am using 
yarn,  and data is huge (metadata is large) , the memory will be consumed on 
the yarn side , but in case of client side metadata strategy, whole of the 
metadata will be read before the tasks are spawned, on a single node.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

2014-10-30 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/2841#issuecomment-61216332
  
@marmbrus , @mateiz  thanks for all the help ! 
@marmbrus  you may want to close this ticket as well : 

https://issues.apache.org/jira/browse/SPARK-1847


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 >

1 - 100 of 114 matches

Mail list logo