Re: Review Request 33251: HIVE-10302 Cache small tables in memory [Spark Branch]

2015-04-17 Thread Jimmy Xiang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33251/
---

(Updated April 17, 2015, 2:54 p.m.)


Review request for hive, Chao Sun, Szehon Ho, and Xuefu Zhang.


Bugs: HIVE-10302
https://issues.apache.org/jira/browse/HIVE-10302


Repository: hive-git


Description (updated)
---

Cached the small table containter so that mapjoin tasks can use it if the task 
is executed on the same Spark executor.
The cache is released right before the next job after the mapjoin job is done.


Diffs
-

  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HashTableLoader.java fe108c4 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/LocalHiveSparkClient.java 
7e33a3f 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/RemoteHiveSparkClient.java 
059016d 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkRecordHandler.java 
97b3471 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java 72ab913 

Diff: https://reviews.apache.org/r/33251/diff/


Testing
---

Ran several queries in live cluster. ptest pending.


Thanks,

Jimmy Xiang



Re: Review Request 33251: HIVE-10302 Cache small tables in memory [Spark Branch]

2015-04-20 Thread Jimmy Xiang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33251/
---

(Updated April 21, 2015, 1:37 a.m.)


Review request for hive, Chao Sun, Szehon Ho, and Xuefu Zhang.


Changes
---

Changed the assumption. The small tables are cache only for the same work.


Bugs: HIVE-10302
https://issues.apache.org/jira/browse/HIVE-10302


Repository: hive-git


Description
---

Cached the small table containter so that mapjoin tasks can use it if the task 
is executed on the same Spark executor.
The cache is released right before the next job after the mapjoin job is done.


Diffs (updated)
-

  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HashTableLoader.java fe108c4 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 
3f240f5 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkRecordHandler.java 
97b3471 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java 72ab913 

Diff: https://reviews.apache.org/r/33251/diff/


Testing
---

Ran several queries in live cluster. ptest pending.


Thanks,

Jimmy Xiang



Re: Review Request 33251: HIVE-10302 Cache small tables in memory [Spark Branch]

2015-04-21 Thread Xuefu Zhang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33251/#review80969
---



ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HashTableLoader.java


1. For clarity, it might be good to put this in a separate private method.
2. Does it work if we just synchronize on mapJoinTables[pos]?



ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkRecordHandler.java


Method naming, see below.



ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java


Using thread-local makes me a little nervous, but let's discuss about this 
offline.



ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java


The method name suggests no indication of a side effect of setting thread 
local value. We'd better put this outside of this method.

In addition, the method name seems also a little confusing in that it 
suggests cleanup is for sure but in fact it's conditional.


- Xuefu Zhang


On April 21, 2015, 1:37 a.m., Jimmy Xiang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33251/
> ---
> 
> (Updated April 21, 2015, 1:37 a.m.)
> 
> 
> Review request for hive, Chao Sun, Szehon Ho, and Xuefu Zhang.
> 
> 
> Bugs: HIVE-10302
> https://issues.apache.org/jira/browse/HIVE-10302
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> Cached the small table containter so that mapjoin tasks can use it if the 
> task is executed on the same Spark executor.
> The cache is released right before the next job after the mapjoin job is done.
> 
> 
> Diffs
> -
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HashTableLoader.java 
> fe108c4 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 
> 3f240f5 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkRecordHandler.java 
> 97b3471 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java 
> 72ab913 
> 
> Diff: https://reviews.apache.org/r/33251/diff/
> 
> 
> Testing
> ---
> 
> Ran several queries in live cluster. ptest pending.
> 
> 
> Thanks,
> 
> Jimmy Xiang
> 
>



Re: Review Request 33251: HIVE-10302 Cache small tables in memory [Spark Branch]

2015-04-22 Thread Jimmy Xiang


> On April 21, 2015, 2:17 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HashTableLoader.java, line 
> > 120
> > 
> >
> > 1. For clarity, it might be good to put this in a separate private 
> > method.
> > 2. Does it work if we just synchronize on mapJoinTables[pos]?

1. Have put it in a separate method. 2. Probabl it does work to synchronize on 
mapJoinTables[pos] since it is not the same object in different tasks.


> On April 21, 2015, 2:17 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkRecordHandler.java, 
> > line 52
> > 
> >
> > Method naming, see below.

Fixed.  Changed the method a little so that it doesn't do more than clean up 
small table caches.


> On April 21, 2015, 2:17 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java, line 
> > 51
> > 
> >
> > Using thread-local makes me a little nervous, but let's discuss about 
> > this offline.

Per our discussion, we don't use thread-local any more.


> On April 21, 2015, 2:17 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java, line 
> > 60
> > 
> >
> > The method name suggests no indication of a side effect of setting 
> > thread local value. We'd better put this outside of this method.
> > 
> > In addition, the method name seems also a little confusing in that it 
> > suggests cleanup is for sure but in fact it's conditional.

Removed the thread local variable usage.  As to the method name, it means we 
clean up the cache a little, it doesn't mean we remove all the cahced contents. 
I added some javadoc to make it a little clear.


- Jimmy


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33251/#review80969
---


On April 21, 2015, 1:37 a.m., Jimmy Xiang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33251/
> ---
> 
> (Updated April 21, 2015, 1:37 a.m.)
> 
> 
> Review request for hive, Chao Sun, Szehon Ho, and Xuefu Zhang.
> 
> 
> Bugs: HIVE-10302
> https://issues.apache.org/jira/browse/HIVE-10302
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> Cached the small table containter so that mapjoin tasks can use it if the 
> task is executed on the same Spark executor.
> The cache is released right before the next job after the mapjoin job is done.
> 
> 
> Diffs
> -
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HashTableLoader.java 
> fe108c4 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 
> 3f240f5 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkRecordHandler.java 
> 97b3471 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java 
> 72ab913 
> 
> Diff: https://reviews.apache.org/r/33251/diff/
> 
> 
> Testing
> ---
> 
> Ran several queries in live cluster. ptest pending.
> 
> 
> Thanks,
> 
> Jimmy Xiang
> 
>



Re: Review Request 33251: HIVE-10302 Cache small tables in memory [Spark Branch]

2015-04-22 Thread Jimmy Xiang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33251/
---

(Updated April 22, 2015, 4:36 p.m.)


Review request for hive, Chao Sun, Szehon Ho, and Xuefu Zhang.


Changes
---

Addressed Xuefu's review comments: removed threadlocal variable, added some 
javadoc, fixed some code clarification issue.
In this patch, we still clean up cache based on work id so that we can avoid 
extra memory usage for other works in the same job. Unfortunately, this means, 
if there are other works running in parallel with the mapjoin work, the cache 
may be released when it can still be kept for a while.


Bugs: HIVE-10302
https://issues.apache.org/jira/browse/HIVE-10302


Repository: hive-git


Description
---

Cached the small table containter so that mapjoin tasks can use it if the task 
is executed on the same Spark executor.
The cache is released right before the next job after the mapjoin job is done.


Diffs (updated)
-

  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HashTableLoader.java fe108c4 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HivePairFlatMapFunction.java 
2f137f9 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 
3f240f5 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java 72ab913 

Diff: https://reviews.apache.org/r/33251/diff/


Testing
---

Ran several queries in live cluster. ptest pending.


Thanks,

Jimmy Xiang



Re: Review Request 33251: HIVE-10302 Cache small tables in memory [Spark Branch]

2015-04-23 Thread Jimmy Xiang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33251/
---

(Updated April 23, 2015, 5:48 p.m.)


Review request for hive, Chao Sun, Szehon Ho, and Xuefu Zhang.


Bugs: HIVE-10302
https://issues.apache.org/jira/browse/HIVE-10302


Repository: hive-git


Description
---

Cached the small table containter so that mapjoin tasks can use it if the task 
is executed on the same Spark executor.
The cache is released right before the next job after the mapjoin job is done.


Diffs (updated)
-

  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HashTableLoader.java fe108c4 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HivePairFlatMapFunction.java 
2f137f9 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 
3f240f5 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java 72ab913 

Diff: https://reviews.apache.org/r/33251/diff/


Testing
---

Ran several queries in live cluster. ptest pending.


Thanks,

Jimmy Xiang



Re: Review Request 33251: HIVE-10302 Cache small tables in memory [Spark Branch]

2015-04-23 Thread Jimmy Xiang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33251/
---

(Updated April 23, 2015, 11:22 p.m.)


Review request for hive, Chao Sun, Szehon Ho, and Xuefu Zhang.


Bugs: HIVE-10302
https://issues.apache.org/jira/browse/HIVE-10302


Repository: hive-git


Description
---

Cached the small table containter so that mapjoin tasks can use it if the task 
is executed on the same Spark executor.
The cache is released right before the next job after the mapjoin job is done.


Diffs (updated)
-

  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HashTableLoader.java fe108c4 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HivePairFlatMapFunction.java 
2f137f9 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SmallTableCache.java 
PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java 72ab913 

Diff: https://reviews.apache.org/r/33251/diff/


Testing
---

Ran several queries in live cluster. ptest pending.


Thanks,

Jimmy Xiang



Re: Review Request 33251: HIVE-10302 Cache small tables in memory [Spark Branch]

2015-04-23 Thread Xuefu Zhang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33251/#review81444
---

Ship it!


Ship It!

- Xuefu Zhang


On April 23, 2015, 11:22 p.m., Jimmy Xiang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33251/
> ---
> 
> (Updated April 23, 2015, 11:22 p.m.)
> 
> 
> Review request for hive, Chao Sun, Szehon Ho, and Xuefu Zhang.
> 
> 
> Bugs: HIVE-10302
> https://issues.apache.org/jira/browse/HIVE-10302
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> Cached the small table containter so that mapjoin tasks can use it if the 
> task is executed on the same Spark executor.
> The cache is released right before the next job after the mapjoin job is done.
> 
> 
> Diffs
> -
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HashTableLoader.java 
> fe108c4 
>   
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HivePairFlatMapFunction.java 
> 2f137f9 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SmallTableCache.java 
> PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java 
> 72ab913 
> 
> Diff: https://reviews.apache.org/r/33251/diff/
> 
> 
> Testing
> ---
> 
> Ran several queries in live cluster. ptest pending.
> 
> 
> Thanks,
> 
> Jimmy Xiang
> 
>