Re: Review Request 60432: HIVE-16845: INSERT OVERWRITE a table with dynamic partitions on S3 fails with NPE

Marta Kuczora Wed, 12 Jul 2017 07:06:02 -0700


> On July 11, 2017, 11:14 p.m., Sahil Takiar wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverMergeFiles.java
> > Lines 305-308 (patched)
> > <https://reviews.apache.org/r/60432/diff/2/?file=1763797#file1763797line305>
> >
> >     Is the if statement necessary? Is there a problem with always doing 
> > this?
> >     
> >     I think it would make more sense to always do this. Right now the code 
> > assumes `mvTask.getWork() == mrAnvMvTask.getChildTasks().get(0).getWork()` 
> > which doesn't seem like a smart idea. If `mvAndMvTask` is the tasks that is 
> > going to be added to `resTask` then it would make more sense to just use 
> > the `MoveWork` from `mvAndMvTask.getChildTasks.get(0)`


Thanks a lot for the review Sahil!
I absolutely agree with you that it would be better to get the move task from 
mergeAndMoveMoveTask always. The reason why I added the if is that I wasn't 
100% sure if there cannot be any use-case when this approach is not working. So 
I wanted to be cautious and only change the behavior for the blobstore 
optimization use-case and leave it as it was for the other use-cases. But if 
you think that this caution is not necessary, I an happy to remove the if 
condition. 
Do you think it would make sense to upload a patch where I always use the child 
task and see if all tests are successful in the pre-commit run?


- Marta


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60432/#review180254
-----------------------------------------------------------


On June 27, 2017, 1:02 p.m., Marta Kuczora wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/60432/
> -----------------------------------------------------------
> 
> (Updated June 27, 2017, 1:02 p.m.)
> 
> 
> Review request for hive and Sergio Pena.
> 
> 
> Bugs: HIVE-16845
>     https://issues.apache.org/jira/browse/HIVE-16845
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> The following steps lead to the NPE in the 
> ConditionalResolverMergeFiles.generateActualTasks method:
> 
> In the GenMapRedUtils.createCondTask method, the tasks for the merge, move 
> and "merge and move" use cases are created and set as task list to the 
> ConditionalWork. Originally the moveOnlyMoveTask and the mergeAndMoveMoveTask 
> was created from the same moveWork, which was the dummyWork created like this 
> in the createMRWorkForMergingFiles method:
> 
>     MoveWork dummyMv = new MoveWork(null, null, null,
>          new LoadFileDesc(fsInputDesc.getFinalDirName(), finalName, true, 
> null, null), false);
> 
> 
> Then in the ConditionalResolverMergeFiles.generateActualTasks method we get 
> these tasks and use them to create result "resTsks" list.
> 
> For the "merge and move" use case, the code looks like this:
> 
>     if (toMove.size() > 0) {
>         resTsks.add(mrAndMvTask);
> 
>         MoveWork mvWork = (MoveWork) mvTask.getWork();
>         LoadFileDesc lfd = mvWork.getLoadFileWork();
>         
>         ...
>         
>         LoadMultiFilesDesc lmfd = new LoadMultiFilesDesc(toMove,
>             targetDirs, lfd.getIsDfsDir(), lfd.getColumns(), 
> lfd.getColumnTypes());
>         mvWork.setLoadFileWork(null);
>         mvWork.setLoadTableWork(null);
>         mvWork.setMultiFilesDesc(lmfd);
>       }
> 
> It adds the mrAndMvTask task to the resTsks list and modifies the move work 
> to move all necessary files in one-step. The mrAndMvTask contains a move task 
> as child task, which is the same as the mvWork work. 
> 
> With the blobstore optimization on, the moveOnlyMoveTask task is created from 
> a different move work, not from the dummyMoveWork as before:
> 
>     MoveWork workForMoveOnlyTask;
>     if (shouldMergeMovePaths) {
>       workForMoveOnlyTask = mergeMovePaths(condInputPath, 
> moveTaskToLink.getWork());
>     } else {
>       workForMoveOnlyTask = dummyMoveWork;
>     }
> 
>     ...
> 
>     Task<? extends Serializable> mergeOnlyMergeTask = 
> TaskFactory.get(mergeWork, conf);
>     Task<? extends Serializable> moveOnlyMoveTask = 
> TaskFactory.get(workForMoveOnlyTask, conf);
>     Task<? extends Serializable> mergeAndMoveMergeTask = 
> TaskFactory.get(mergeWork, conf);
>     Task<? extends Serializable> mergeAndMoveMoveTask = 
> TaskFactory.get(dummyMoveWork, conf);
> 
> Because of this the mvWork in the 
> ConditionalResolverMergeFiles.generateActualTasks method will also be 
> different. It has the LoadTableDesc variable set and not the LoadFileDesc, 
> that causes the NPE.
> 
> When the blobstore optimization is on and the move work is changed, we should 
> use the child move task of the mrAndMvTask in the generateActualTasks method, 
> instead of the mvTask. Not just to avoid the NPE, but because this is the 
> correct move task for the "merge and move" use case.
> 
> 
> Diffs
> -----
> 
>   
> itests/hive-blobstore/src/test/queries/clientpositive/insert_overwrite_dynamic_partitions_merge_move.q
>  PRE-CREATION 
>   
> itests/hive-blobstore/src/test/queries/clientpositive/insert_overwrite_dynamic_partitions_merge_only.q
>  PRE-CREATION 
>   
> itests/hive-blobstore/src/test/queries/clientpositive/insert_overwrite_dynamic_partitions_move_only.q
>  PRE-CREATION 
>   
> itests/hive-blobstore/src/test/results/clientpositive/insert_overwrite_dynamic_partitions_merge_move.q.out
>  PRE-CREATION 
>   
> itests/hive-blobstore/src/test/results/clientpositive/insert_overwrite_dynamic_partitions_merge_only.q.out
>  PRE-CREATION 
>   
> itests/hive-blobstore/src/test/results/clientpositive/insert_overwrite_dynamic_partitions_move_only.q.out
>  PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 88bf829 
>   
> ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverMergeFiles.java 
> 4266569 
> 
> 
> Diff: https://reviews.apache.org/r/60432/diff/2/
> 
> 
> Testing
> -------
> 
> Created q tests for the move-only, merge-only and move-and-merge use cases. 
> The tests do an "insert overwiew" with blobstore optimizations on and off.
> 
> 
> Thanks,
> 
> Marta Kuczora
> 
>

Re: Review Request 60432: HIVE-16845: INSERT OVERWRITE a table with dynamic partitions on S3 fails with NPE

Reply via email to