-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27117/#review58183
-----------------------------------------------------------



ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkMapRecordHandler.java
<https://reviews.apache.org/r/27117/#comment99118>

    We don't need this, as this class is only used for Spark.



ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkMapRecordHandler.java
<https://reviews.apache.org/r/27117/#comment99120>

    Let's give a less conflicting name, such as SPARK_MAP_IO_CONTEXT. Same 
below. Better define a constant in SparkUtils.



ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkMapRecordHandler.java
<https://reviews.apache.org/r/27117/#comment99122>

    We may need to copy other fields in IOContext besides input path.



ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkMapRecordHandler.java
<https://reviews.apache.org/r/27117/#comment99124>

    Same as above



ql/src/java/org/apache/hadoop/hive/ql/io/HiveContextAwareRecordReader.java
<https://reviews.apache.org/r/27117/#comment99126>

    We need to copy every field.


- Xuefu Zhang


On Oct. 23, 2014, 11:56 p.m., Chao Sun wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27117/
> -----------------------------------------------------------
> 
> (Updated Oct. 23, 2014, 11:56 p.m.)
> 
> 
> Review request for hive and Xuefu Zhang.
> 
> 
> Bugs: HIVE-8457
>     https://issues.apache.org/jira/browse/HIVE-8457
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> Currently, on the Spark branch, each thread it is bound with a thread-local 
> IOContext, which gets initialized when we generates an input HadoopRDD, and 
> later used in MapOperator, FilterOperator, etc.
> And, given the introduction of HIVE-8118, we may have multiple downstream 
> RDDs that share the same input HadoopRDD, and we would like to have the 
> HadoopRDD to be cached, to avoid scanning the same table multiple times. A 
> typical case would be like the following:
>      inputRDD     inputRDD
>         |            |
>        MT_11        MT_12
>         |            |
>        RT_1         RT_2
> Here, MT_11 and MT_12 are MapTran from a splitted MapWork,
> and RT_1 and RT_2 are two ReduceTran. Note that, this example is simplified, 
> as we may also have ShuffleTran between MapTran and ReduceTran.
> When multiple Spark threads are running, MT_11 may be executed first, and it 
> will ask for an iterator from the HadoopRDD will trigger the creation of the 
> iterator, which in turn triggers the initialization of the IOContext 
> associated with that particular thread.
> Now, the problem is: before MT_12 starts executing, it will also ask for an 
> iterator from the
> HadoopRDD, and since the RDD is already cached, instead of creating a new 
> iterator, it will just fetch it from the cached result. However, this will 
> skip the initialization of the IOContext associated with this particular 
> thread. And, when MT_12 starts executing, it will try to initialize the 
> MapOperator, but since the IOContext is not initialized, this will fail 
> miserably.
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkMapRecordHandler.java 
> 20ea977 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 
> 00a6f3d 
>   ql/src/java/org/apache/hadoop/hive/ql/io/HiveContextAwareRecordReader.java 
> 58e1ceb 
> 
> Diff: https://reviews.apache.org/r/27117/diff/
> 
> 
> Testing
> -------
> 
> All multi-insertion related tests are passing on my local machine.
> 
> 
> Thanks,
> 
> Chao Sun
> 
>

Reply via email to