Re: Review Request 60432: HIVE-16845: INSERT OVERWRITE a table with dynamic partitions on S3 fails with NPE

2017-07-14 Thread Sahil Takiar


> On July 11, 2017, 11:14 p.m., Sahil Takiar wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverMergeFiles.java
> > Lines 305-308 (patched)
> > 
> >
> > Is the if statement necessary? Is there a problem with always doing 
> > this?
> > 
> > I think it would make more sense to always do this. Right now the code 
> > assumes `mvTask.getWork() == mrAnvMvTask.getChildTasks().get(0).getWork()` 
> > which doesn't seem like a smart idea. If `mvAndMvTask` is the tasks that is 
> > going to be added to `resTask` then it would make more sense to just use 
> > the `MoveWork` from `mvAndMvTask.getChildTasks.get(0)`
> 
> Marta Kuczora wrote:
> Thanks a lot for the review Sahil!
> I absolutely agree with you that it would be better to get the move task 
> from mergeAndMoveMoveTask always. The reason why I added the if is that I 
> wasn't 100% sure if there cannot be any use-case when this approach is not 
> working. So I wanted to be cautious and only change the behavior for the 
> blobstore optimization use-case and leave it as it was for the other 
> use-cases. But if you think that this caution is not necessary, I an happy to 
> remove the if condition. 
> Do you think it would make sense to upload a patch where I always use the 
> child task and see if all tests are successful in the pre-commit run?
> 
> Marta Kuczora wrote:
> Updated the patch accordingly and the pre-commit tests were successful 
> except for 5 tests which are failing for a long time.

Yeah, I think it worth creating a patch where we always use the child task and 
see if all tests are successful in pre-commit.


- Sahil


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60432/#review180254
---


On July 13, 2017, 3:05 p.m., Marta Kuczora wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/60432/
> ---
> 
> (Updated July 13, 2017, 3:05 p.m.)
> 
> 
> Review request for hive and Sergio Pena.
> 
> 
> Bugs: HIVE-16845
> https://issues.apache.org/jira/browse/HIVE-16845
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> The following steps lead to the NPE in the 
> ConditionalResolverMergeFiles.generateActualTasks method:
> 
> In the GenMapRedUtils.createCondTask method, the tasks for the merge, move 
> and "merge and move" use cases are created and set as task list to the 
> ConditionalWork. Originally the moveOnlyMoveTask and the mergeAndMoveMoveTask 
> was created from the same moveWork, which was the dummyWork created like this 
> in the createMRWorkForMergingFiles method:
> 
> MoveWork dummyMv = new MoveWork(null, null, null,
>  new LoadFileDesc(fsInputDesc.getFinalDirName(), finalName, true, 
> null, null), false);
> 
> 
> Then in the ConditionalResolverMergeFiles.generateActualTasks method we get 
> these tasks and use them to create result "resTsks" list.
> 
> For the "merge and move" use case, the code looks like this:
> 
> if (toMove.size() > 0) {
> resTsks.add(mrAndMvTask);
> 
> MoveWork mvWork = (MoveWork) mvTask.getWork();
> LoadFileDesc lfd = mvWork.getLoadFileWork();
> 
> ...
> 
> LoadMultiFilesDesc lmfd = new LoadMultiFilesDesc(toMove,
> targetDirs, lfd.getIsDfsDir(), lfd.getColumns(), 
> lfd.getColumnTypes());
> mvWork.setLoadFileWork(null);
> mvWork.setLoadTableWork(null);
> mvWork.setMultiFilesDesc(lmfd);
>   }
> 
> It adds the mrAndMvTask task to the resTsks list and modifies the move work 
> to move all necessary files in one-step. The mrAndMvTask contains a move task 
> as child task, which is the same as the mvWork work. 
> 
> With the blobstore optimization on, the moveOnlyMoveTask task is created from 
> a different move work, not from the dummyMoveWork as before:
> 
> MoveWork workForMoveOnlyTask;
> if (shouldMergeMovePaths) {
>   workForMoveOnlyTask = mergeMovePaths(condInputPath, 
> moveTaskToLink.getWork());
> } else {
>   workForMoveOnlyTask = dummyMoveWork;
> }
> 
> ...
> 
> Task mergeOnlyMergeTask = 
> TaskFactory.get(mergeWork, conf);
> Task moveOnlyMoveTask = 
> TaskFactory.get(workForMoveOnlyTask, conf);
> Task mergeAndMoveMergeTask = 
> TaskFactory.get(mergeWork, conf);
> Task mergeAndMoveMoveTask = 
> TaskFactory.get(dummyMoveWork, conf);
> 
> Because of this the mvWork in the 
> ConditionalResolverMergeFiles.generateActualTasks method will also be 
> different. It has the LoadTableDesc variable set and not the LoadFileDesc, 
> that causes the NPE.
> 
> When the blobstore optimization is on and the move w

Re: Review Request 60432: HIVE-16845: INSERT OVERWRITE a table with dynamic partitions on S3 fails with NPE

2017-07-13 Thread Marta Kuczora

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60432/
---

(Updated July 13, 2017, 3:05 p.m.)


Review request for hive and Sergio Pena.


Bugs: HIVE-16845
https://issues.apache.org/jira/browse/HIVE-16845


Repository: hive-git


Description
---

The following steps lead to the NPE in the 
ConditionalResolverMergeFiles.generateActualTasks method:

In the GenMapRedUtils.createCondTask method, the tasks for the merge, move and 
"merge and move" use cases are created and set as task list to the 
ConditionalWork. Originally the moveOnlyMoveTask and the mergeAndMoveMoveTask 
was created from the same moveWork, which was the dummyWork created like this 
in the createMRWorkForMergingFiles method:

MoveWork dummyMv = new MoveWork(null, null, null,
 new LoadFileDesc(fsInputDesc.getFinalDirName(), finalName, true, null, 
null), false);


Then in the ConditionalResolverMergeFiles.generateActualTasks method we get 
these tasks and use them to create result "resTsks" list.

For the "merge and move" use case, the code looks like this:

if (toMove.size() > 0) {
resTsks.add(mrAndMvTask);

MoveWork mvWork = (MoveWork) mvTask.getWork();
LoadFileDesc lfd = mvWork.getLoadFileWork();

...

LoadMultiFilesDesc lmfd = new LoadMultiFilesDesc(toMove,
targetDirs, lfd.getIsDfsDir(), lfd.getColumns(), 
lfd.getColumnTypes());
mvWork.setLoadFileWork(null);
mvWork.setLoadTableWork(null);
mvWork.setMultiFilesDesc(lmfd);
  }

It adds the mrAndMvTask task to the resTsks list and modifies the move work to 
move all necessary files in one-step. The mrAndMvTask contains a move task as 
child task, which is the same as the mvWork work. 

With the blobstore optimization on, the moveOnlyMoveTask task is created from a 
different move work, not from the dummyMoveWork as before:

MoveWork workForMoveOnlyTask;
if (shouldMergeMovePaths) {
  workForMoveOnlyTask = mergeMovePaths(condInputPath, 
moveTaskToLink.getWork());
} else {
  workForMoveOnlyTask = dummyMoveWork;
}

...

Task mergeOnlyMergeTask = 
TaskFactory.get(mergeWork, conf);
Task moveOnlyMoveTask = 
TaskFactory.get(workForMoveOnlyTask, conf);
Task mergeAndMoveMergeTask = 
TaskFactory.get(mergeWork, conf);
Task mergeAndMoveMoveTask = 
TaskFactory.get(dummyMoveWork, conf);

Because of this the mvWork in the 
ConditionalResolverMergeFiles.generateActualTasks method will also be 
different. It has the LoadTableDesc variable set and not the LoadFileDesc, that 
causes the NPE.

When the blobstore optimization is on and the move work is changed, we should 
use the child move task of the mrAndMvTask in the generateActualTasks method, 
instead of the mvTask. Not just to avoid the NPE, but because this is the 
correct move task for the "merge and move" use case.


Diffs (updated)
-

  
itests/hive-blobstore/src/test/queries/clientpositive/insert_overwrite_dynamic_partitions_merge_move.q
 PRE-CREATION 
  
itests/hive-blobstore/src/test/queries/clientpositive/insert_overwrite_dynamic_partitions_merge_only.q
 PRE-CREATION 
  
itests/hive-blobstore/src/test/queries/clientpositive/insert_overwrite_dynamic_partitions_move_only.q
 PRE-CREATION 
  
itests/hive-blobstore/src/test/results/clientpositive/insert_overwrite_dynamic_partitions_merge_move.q.out
 PRE-CREATION 
  
itests/hive-blobstore/src/test/results/clientpositive/insert_overwrite_dynamic_partitions_merge_only.q.out
 PRE-CREATION 
  
itests/hive-blobstore/src/test/results/clientpositive/insert_overwrite_dynamic_partitions_move_only.q.out
 PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverMergeFiles.java 
4266569 


Diff: https://reviews.apache.org/r/60432/diff/4/

Changes: https://reviews.apache.org/r/60432/diff/3-4/


Testing
---

Created q tests for the move-only, merge-only and move-and-merge use cases. The 
tests do an "insert overwiew" with blobstore optimizations on and off.


Thanks,

Marta Kuczora



Re: Review Request 60432: HIVE-16845: INSERT OVERWRITE a table with dynamic partitions on S3 fails with NPE

2017-07-13 Thread Marta Kuczora


> On July 11, 2017, 11:14 p.m., Sahil Takiar wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverMergeFiles.java
> > Lines 305-308 (patched)
> > 
> >
> > Is the if statement necessary? Is there a problem with always doing 
> > this?
> > 
> > I think it would make more sense to always do this. Right now the code 
> > assumes `mvTask.getWork() == mrAnvMvTask.getChildTasks().get(0).getWork()` 
> > which doesn't seem like a smart idea. If `mvAndMvTask` is the tasks that is 
> > going to be added to `resTask` then it would make more sense to just use 
> > the `MoveWork` from `mvAndMvTask.getChildTasks.get(0)`
> 
> Marta Kuczora wrote:
> Thanks a lot for the review Sahil!
> I absolutely agree with you that it would be better to get the move task 
> from mergeAndMoveMoveTask always. The reason why I added the if is that I 
> wasn't 100% sure if there cannot be any use-case when this approach is not 
> working. So I wanted to be cautious and only change the behavior for the 
> blobstore optimization use-case and leave it as it was for the other 
> use-cases. But if you think that this caution is not necessary, I an happy to 
> remove the if condition. 
> Do you think it would make sense to upload a patch where I always use the 
> child task and see if all tests are successful in the pre-commit run?

Updated the patch accordingly and the pre-commit tests were successful except 
for 5 tests which are failing for a long time.


- Marta


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60432/#review180254
---


On July 13, 2017, 2:58 p.m., Marta Kuczora wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/60432/
> ---
> 
> (Updated July 13, 2017, 2:58 p.m.)
> 
> 
> Review request for hive and Sergio Pena.
> 
> 
> Bugs: HIVE-16845
> https://issues.apache.org/jira/browse/HIVE-16845
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> The following steps lead to the NPE in the 
> ConditionalResolverMergeFiles.generateActualTasks method:
> 
> In the GenMapRedUtils.createCondTask method, the tasks for the merge, move 
> and "merge and move" use cases are created and set as task list to the 
> ConditionalWork. Originally the moveOnlyMoveTask and the mergeAndMoveMoveTask 
> was created from the same moveWork, which was the dummyWork created like this 
> in the createMRWorkForMergingFiles method:
> 
> MoveWork dummyMv = new MoveWork(null, null, null,
>  new LoadFileDesc(fsInputDesc.getFinalDirName(), finalName, true, 
> null, null), false);
> 
> 
> Then in the ConditionalResolverMergeFiles.generateActualTasks method we get 
> these tasks and use them to create result "resTsks" list.
> 
> For the "merge and move" use case, the code looks like this:
> 
> if (toMove.size() > 0) {
> resTsks.add(mrAndMvTask);
> 
> MoveWork mvWork = (MoveWork) mvTask.getWork();
> LoadFileDesc lfd = mvWork.getLoadFileWork();
> 
> ...
> 
> LoadMultiFilesDesc lmfd = new LoadMultiFilesDesc(toMove,
> targetDirs, lfd.getIsDfsDir(), lfd.getColumns(), 
> lfd.getColumnTypes());
> mvWork.setLoadFileWork(null);
> mvWork.setLoadTableWork(null);
> mvWork.setMultiFilesDesc(lmfd);
>   }
> 
> It adds the mrAndMvTask task to the resTsks list and modifies the move work 
> to move all necessary files in one-step. The mrAndMvTask contains a move task 
> as child task, which is the same as the mvWork work. 
> 
> With the blobstore optimization on, the moveOnlyMoveTask task is created from 
> a different move work, not from the dummyMoveWork as before:
> 
> MoveWork workForMoveOnlyTask;
> if (shouldMergeMovePaths) {
>   workForMoveOnlyTask = mergeMovePaths(condInputPath, 
> moveTaskToLink.getWork());
> } else {
>   workForMoveOnlyTask = dummyMoveWork;
> }
> 
> ...
> 
> Task mergeOnlyMergeTask = 
> TaskFactory.get(mergeWork, conf);
> Task moveOnlyMoveTask = 
> TaskFactory.get(workForMoveOnlyTask, conf);
> Task mergeAndMoveMergeTask = 
> TaskFactory.get(mergeWork, conf);
> Task mergeAndMoveMoveTask = 
> TaskFactory.get(dummyMoveWork, conf);
> 
> Because of this the mvWork in the 
> ConditionalResolverMergeFiles.generateActualTasks method will also be 
> different. It has the LoadTableDesc variable set and not the LoadFileDesc, 
> that causes the NPE.
> 
> When the blobstore optimization is on and the move work is changed, we should 
> use the child move task of the mrAndMvTask in the generateActualTasks method, 
> instead of the mvTask. Not just to avoid the NPE, b

Re: Review Request 60432: HIVE-16845: INSERT OVERWRITE a table with dynamic partitions on S3 fails with NPE

2017-07-13 Thread Marta Kuczora

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60432/
---

(Updated July 13, 2017, 2:58 p.m.)


Review request for hive and Sergio Pena.


Bugs: HIVE-16845
https://issues.apache.org/jira/browse/HIVE-16845


Repository: hive-git


Description
---

The following steps lead to the NPE in the 
ConditionalResolverMergeFiles.generateActualTasks method:

In the GenMapRedUtils.createCondTask method, the tasks for the merge, move and 
"merge and move" use cases are created and set as task list to the 
ConditionalWork. Originally the moveOnlyMoveTask and the mergeAndMoveMoveTask 
was created from the same moveWork, which was the dummyWork created like this 
in the createMRWorkForMergingFiles method:

MoveWork dummyMv = new MoveWork(null, null, null,
 new LoadFileDesc(fsInputDesc.getFinalDirName(), finalName, true, null, 
null), false);


Then in the ConditionalResolverMergeFiles.generateActualTasks method we get 
these tasks and use them to create result "resTsks" list.

For the "merge and move" use case, the code looks like this:

if (toMove.size() > 0) {
resTsks.add(mrAndMvTask);

MoveWork mvWork = (MoveWork) mvTask.getWork();
LoadFileDesc lfd = mvWork.getLoadFileWork();

...

LoadMultiFilesDesc lmfd = new LoadMultiFilesDesc(toMove,
targetDirs, lfd.getIsDfsDir(), lfd.getColumns(), 
lfd.getColumnTypes());
mvWork.setLoadFileWork(null);
mvWork.setLoadTableWork(null);
mvWork.setMultiFilesDesc(lmfd);
  }

It adds the mrAndMvTask task to the resTsks list and modifies the move work to 
move all necessary files in one-step. The mrAndMvTask contains a move task as 
child task, which is the same as the mvWork work. 

With the blobstore optimization on, the moveOnlyMoveTask task is created from a 
different move work, not from the dummyMoveWork as before:

MoveWork workForMoveOnlyTask;
if (shouldMergeMovePaths) {
  workForMoveOnlyTask = mergeMovePaths(condInputPath, 
moveTaskToLink.getWork());
} else {
  workForMoveOnlyTask = dummyMoveWork;
}

...

Task mergeOnlyMergeTask = 
TaskFactory.get(mergeWork, conf);
Task moveOnlyMoveTask = 
TaskFactory.get(workForMoveOnlyTask, conf);
Task mergeAndMoveMergeTask = 
TaskFactory.get(mergeWork, conf);
Task mergeAndMoveMoveTask = 
TaskFactory.get(dummyMoveWork, conf);

Because of this the mvWork in the 
ConditionalResolverMergeFiles.generateActualTasks method will also be 
different. It has the LoadTableDesc variable set and not the LoadFileDesc, that 
causes the NPE.

When the blobstore optimization is on and the move work is changed, we should 
use the child move task of the mrAndMvTask in the generateActualTasks method, 
instead of the mvTask. Not just to avoid the NPE, but because this is the 
correct move task for the "merge and move" use case.


Diffs (updated)
-

  HIVE-16845.3.patch PRE-CREATION 
  
itests/hive-blobstore/src/test/queries/clientpositive/insert_overwrite_dynamic_partitions_merge_move.q
 PRE-CREATION 
  
itests/hive-blobstore/src/test/queries/clientpositive/insert_overwrite_dynamic_partitions_merge_only.q
 PRE-CREATION 
  
itests/hive-blobstore/src/test/queries/clientpositive/insert_overwrite_dynamic_partitions_move_only.q
 PRE-CREATION 
  
itests/hive-blobstore/src/test/results/clientpositive/insert_overwrite_dynamic_partitions_merge_move.q.out
 PRE-CREATION 
  
itests/hive-blobstore/src/test/results/clientpositive/insert_overwrite_dynamic_partitions_merge_only.q.out
 PRE-CREATION 
  
itests/hive-blobstore/src/test/results/clientpositive/insert_overwrite_dynamic_partitions_move_only.q.out
 PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverMergeFiles.java 
4266569 


Diff: https://reviews.apache.org/r/60432/diff/3/

Changes: https://reviews.apache.org/r/60432/diff/2-3/


Testing
---

Created q tests for the move-only, merge-only and move-and-merge use cases. The 
tests do an "insert overwiew" with blobstore optimizations on and off.


Thanks,

Marta Kuczora



Re: Review Request 60432: HIVE-16845: INSERT OVERWRITE a table with dynamic partitions on S3 fails with NPE

2017-07-12 Thread Marta Kuczora


> On July 11, 2017, 11:14 p.m., Sahil Takiar wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverMergeFiles.java
> > Lines 305-308 (patched)
> > 
> >
> > Is the if statement necessary? Is there a problem with always doing 
> > this?
> > 
> > I think it would make more sense to always do this. Right now the code 
> > assumes `mvTask.getWork() == mrAnvMvTask.getChildTasks().get(0).getWork()` 
> > which doesn't seem like a smart idea. If `mvAndMvTask` is the tasks that is 
> > going to be added to `resTask` then it would make more sense to just use 
> > the `MoveWork` from `mvAndMvTask.getChildTasks.get(0)`

Thanks a lot for the review Sahil!
I absolutely agree with you that it would be better to get the move task from 
mergeAndMoveMoveTask always. The reason why I added the if is that I wasn't 
100% sure if there cannot be any use-case when this approach is not working. So 
I wanted to be cautious and only change the behavior for the blobstore 
optimization use-case and leave it as it was for the other use-cases. But if 
you think that this caution is not necessary, I an happy to remove the if 
condition. 
Do you think it would make sense to upload a patch where I always use the child 
task and see if all tests are successful in the pre-commit run?


- Marta


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60432/#review180254
---


On June 27, 2017, 1:02 p.m., Marta Kuczora wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/60432/
> ---
> 
> (Updated June 27, 2017, 1:02 p.m.)
> 
> 
> Review request for hive and Sergio Pena.
> 
> 
> Bugs: HIVE-16845
> https://issues.apache.org/jira/browse/HIVE-16845
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> The following steps lead to the NPE in the 
> ConditionalResolverMergeFiles.generateActualTasks method:
> 
> In the GenMapRedUtils.createCondTask method, the tasks for the merge, move 
> and "merge and move" use cases are created and set as task list to the 
> ConditionalWork. Originally the moveOnlyMoveTask and the mergeAndMoveMoveTask 
> was created from the same moveWork, which was the dummyWork created like this 
> in the createMRWorkForMergingFiles method:
> 
> MoveWork dummyMv = new MoveWork(null, null, null,
>  new LoadFileDesc(fsInputDesc.getFinalDirName(), finalName, true, 
> null, null), false);
> 
> 
> Then in the ConditionalResolverMergeFiles.generateActualTasks method we get 
> these tasks and use them to create result "resTsks" list.
> 
> For the "merge and move" use case, the code looks like this:
> 
> if (toMove.size() > 0) {
> resTsks.add(mrAndMvTask);
> 
> MoveWork mvWork = (MoveWork) mvTask.getWork();
> LoadFileDesc lfd = mvWork.getLoadFileWork();
> 
> ...
> 
> LoadMultiFilesDesc lmfd = new LoadMultiFilesDesc(toMove,
> targetDirs, lfd.getIsDfsDir(), lfd.getColumns(), 
> lfd.getColumnTypes());
> mvWork.setLoadFileWork(null);
> mvWork.setLoadTableWork(null);
> mvWork.setMultiFilesDesc(lmfd);
>   }
> 
> It adds the mrAndMvTask task to the resTsks list and modifies the move work 
> to move all necessary files in one-step. The mrAndMvTask contains a move task 
> as child task, which is the same as the mvWork work. 
> 
> With the blobstore optimization on, the moveOnlyMoveTask task is created from 
> a different move work, not from the dummyMoveWork as before:
> 
> MoveWork workForMoveOnlyTask;
> if (shouldMergeMovePaths) {
>   workForMoveOnlyTask = mergeMovePaths(condInputPath, 
> moveTaskToLink.getWork());
> } else {
>   workForMoveOnlyTask = dummyMoveWork;
> }
> 
> ...
> 
> Task mergeOnlyMergeTask = 
> TaskFactory.get(mergeWork, conf);
> Task moveOnlyMoveTask = 
> TaskFactory.get(workForMoveOnlyTask, conf);
> Task mergeAndMoveMergeTask = 
> TaskFactory.get(mergeWork, conf);
> Task mergeAndMoveMoveTask = 
> TaskFactory.get(dummyMoveWork, conf);
> 
> Because of this the mvWork in the 
> ConditionalResolverMergeFiles.generateActualTasks method will also be 
> different. It has the LoadTableDesc variable set and not the LoadFileDesc, 
> that causes the NPE.
> 
> When the blobstore optimization is on and the move work is changed, we should 
> use the child move task of the mrAndMvTask in the generateActualTasks method, 
> instead of the mvTask. Not just to avoid the NPE, but because this is the 
> correct move task for the "merge and move" use case.
> 
> 
> Diffs
> -
> 
>   
> itests/hive-blobstore/src/test/queries/clientpositive/insert_overwrite_dy

Re: Review Request 60432: HIVE-16845: INSERT OVERWRITE a table with dynamic partitions on S3 fails with NPE

2017-07-11 Thread Sahil Takiar

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60432/#review180254
---




ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverMergeFiles.java
Lines 305-308 (patched)


Is the if statement necessary? Is there a problem with always doing this?

I think it would make more sense to always do this. Right now the code 
assumes `mvTask.getWork() == mrAnvMvTask.getChildTasks().get(0).getWork()` 
which doesn't seem like a smart idea. If `mvAndMvTask` is the tasks that is 
going to be added to `resTask` then it would make more sense to just use the 
`MoveWork` from `mvAndMvTask.getChildTasks.get(0)`


- Sahil Takiar


On June 27, 2017, 1:02 p.m., Marta Kuczora wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/60432/
> ---
> 
> (Updated June 27, 2017, 1:02 p.m.)
> 
> 
> Review request for hive and Sergio Pena.
> 
> 
> Bugs: HIVE-16845
> https://issues.apache.org/jira/browse/HIVE-16845
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> The following steps lead to the NPE in the 
> ConditionalResolverMergeFiles.generateActualTasks method:
> 
> In the GenMapRedUtils.createCondTask method, the tasks for the merge, move 
> and "merge and move" use cases are created and set as task list to the 
> ConditionalWork. Originally the moveOnlyMoveTask and the mergeAndMoveMoveTask 
> was created from the same moveWork, which was the dummyWork created like this 
> in the createMRWorkForMergingFiles method:
> 
> MoveWork dummyMv = new MoveWork(null, null, null,
>  new LoadFileDesc(fsInputDesc.getFinalDirName(), finalName, true, 
> null, null), false);
> 
> 
> Then in the ConditionalResolverMergeFiles.generateActualTasks method we get 
> these tasks and use them to create result "resTsks" list.
> 
> For the "merge and move" use case, the code looks like this:
> 
> if (toMove.size() > 0) {
> resTsks.add(mrAndMvTask);
> 
> MoveWork mvWork = (MoveWork) mvTask.getWork();
> LoadFileDesc lfd = mvWork.getLoadFileWork();
> 
> ...
> 
> LoadMultiFilesDesc lmfd = new LoadMultiFilesDesc(toMove,
> targetDirs, lfd.getIsDfsDir(), lfd.getColumns(), 
> lfd.getColumnTypes());
> mvWork.setLoadFileWork(null);
> mvWork.setLoadTableWork(null);
> mvWork.setMultiFilesDesc(lmfd);
>   }
> 
> It adds the mrAndMvTask task to the resTsks list and modifies the move work 
> to move all necessary files in one-step. The mrAndMvTask contains a move task 
> as child task, which is the same as the mvWork work. 
> 
> With the blobstore optimization on, the moveOnlyMoveTask task is created from 
> a different move work, not from the dummyMoveWork as before:
> 
> MoveWork workForMoveOnlyTask;
> if (shouldMergeMovePaths) {
>   workForMoveOnlyTask = mergeMovePaths(condInputPath, 
> moveTaskToLink.getWork());
> } else {
>   workForMoveOnlyTask = dummyMoveWork;
> }
> 
> ...
> 
> Task mergeOnlyMergeTask = 
> TaskFactory.get(mergeWork, conf);
> Task moveOnlyMoveTask = 
> TaskFactory.get(workForMoveOnlyTask, conf);
> Task mergeAndMoveMergeTask = 
> TaskFactory.get(mergeWork, conf);
> Task mergeAndMoveMoveTask = 
> TaskFactory.get(dummyMoveWork, conf);
> 
> Because of this the mvWork in the 
> ConditionalResolverMergeFiles.generateActualTasks method will also be 
> different. It has the LoadTableDesc variable set and not the LoadFileDesc, 
> that causes the NPE.
> 
> When the blobstore optimization is on and the move work is changed, we should 
> use the child move task of the mrAndMvTask in the generateActualTasks method, 
> instead of the mvTask. Not just to avoid the NPE, but because this is the 
> correct move task for the "merge and move" use case.
> 
> 
> Diffs
> -
> 
>   
> itests/hive-blobstore/src/test/queries/clientpositive/insert_overwrite_dynamic_partitions_merge_move.q
>  PRE-CREATION 
>   
> itests/hive-blobstore/src/test/queries/clientpositive/insert_overwrite_dynamic_partitions_merge_only.q
>  PRE-CREATION 
>   
> itests/hive-blobstore/src/test/queries/clientpositive/insert_overwrite_dynamic_partitions_move_only.q
>  PRE-CREATION 
>   
> itests/hive-blobstore/src/test/results/clientpositive/insert_overwrite_dynamic_partitions_merge_move.q.out
>  PRE-CREATION 
>   
> itests/hive-blobstore/src/test/results/clientpositive/insert_overwrite_dynamic_partitions_merge_only.q.out
>  PRE-CREATION 
>   
> itests/hive-blobstore/src/test/results/clientpositive/insert_overwrite_dynamic_partitions_move_only.q.out
>  PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 88bf829 

Re: Review Request 60432: HIVE-16845: INSERT OVERWRITE a table with dynamic partitions on S3 fails with NPE

2017-06-27 Thread Marta Kuczora

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60432/
---

(Updated June 27, 2017, 1:02 p.m.)


Review request for hive and Sergio Pena.


Changes
---

Fixed the q tests and the comment.


Bugs: HIVE-16845
https://issues.apache.org/jira/browse/HIVE-16845


Repository: hive-git


Description
---

The following steps lead to the NPE in the 
ConditionalResolverMergeFiles.generateActualTasks method:

In the GenMapRedUtils.createCondTask method, the tasks for the merge, move and 
"merge and move" use cases are created and set as task list to the 
ConditionalWork. Originally the moveOnlyMoveTask and the mergeAndMoveMoveTask 
was created from the same moveWork, which was the dummyWork created like this 
in the createMRWorkForMergingFiles method:

MoveWork dummyMv = new MoveWork(null, null, null,
 new LoadFileDesc(fsInputDesc.getFinalDirName(), finalName, true, null, 
null), false);


Then in the ConditionalResolverMergeFiles.generateActualTasks method we get 
these tasks and use them to create result "resTsks" list.

For the "merge and move" use case, the code looks like this:

if (toMove.size() > 0) {
resTsks.add(mrAndMvTask);

MoveWork mvWork = (MoveWork) mvTask.getWork();
LoadFileDesc lfd = mvWork.getLoadFileWork();

...

LoadMultiFilesDesc lmfd = new LoadMultiFilesDesc(toMove,
targetDirs, lfd.getIsDfsDir(), lfd.getColumns(), 
lfd.getColumnTypes());
mvWork.setLoadFileWork(null);
mvWork.setLoadTableWork(null);
mvWork.setMultiFilesDesc(lmfd);
  }

It adds the mrAndMvTask task to the resTsks list and modifies the move work to 
move all necessary files in one-step. The mrAndMvTask contains a move task as 
child task, which is the same as the mvWork work. 

With the blobstore optimization on, the moveOnlyMoveTask task is created from a 
different move work, not from the dummyMoveWork as before:

MoveWork workForMoveOnlyTask;
if (shouldMergeMovePaths) {
  workForMoveOnlyTask = mergeMovePaths(condInputPath, 
moveTaskToLink.getWork());
} else {
  workForMoveOnlyTask = dummyMoveWork;
}

...

Task mergeOnlyMergeTask = 
TaskFactory.get(mergeWork, conf);
Task moveOnlyMoveTask = 
TaskFactory.get(workForMoveOnlyTask, conf);
Task mergeAndMoveMergeTask = 
TaskFactory.get(mergeWork, conf);
Task mergeAndMoveMoveTask = 
TaskFactory.get(dummyMoveWork, conf);

Because of this the mvWork in the 
ConditionalResolverMergeFiles.generateActualTasks method will also be 
different. It has the LoadTableDesc variable set and not the LoadFileDesc, that 
causes the NPE.

When the blobstore optimization is on and the move work is changed, we should 
use the child move task of the mrAndMvTask in the generateActualTasks method, 
instead of the mvTask. Not just to avoid the NPE, but because this is the 
correct move task for the "merge and move" use case.


Diffs (updated)
-

  
itests/hive-blobstore/src/test/queries/clientpositive/insert_overwrite_dynamic_partitions_merge_move.q
 PRE-CREATION 
  
itests/hive-blobstore/src/test/queries/clientpositive/insert_overwrite_dynamic_partitions_merge_only.q
 PRE-CREATION 
  
itests/hive-blobstore/src/test/queries/clientpositive/insert_overwrite_dynamic_partitions_move_only.q
 PRE-CREATION 
  
itests/hive-blobstore/src/test/results/clientpositive/insert_overwrite_dynamic_partitions_merge_move.q.out
 PRE-CREATION 
  
itests/hive-blobstore/src/test/results/clientpositive/insert_overwrite_dynamic_partitions_merge_only.q.out
 PRE-CREATION 
  
itests/hive-blobstore/src/test/results/clientpositive/insert_overwrite_dynamic_partitions_move_only.q.out
 PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 88bf829 
  ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverMergeFiles.java 
4266569 


Diff: https://reviews.apache.org/r/60432/diff/2/

Changes: https://reviews.apache.org/r/60432/diff/1-2/


Testing
---

Created q tests for the move-only, merge-only and move-and-merge use cases. The 
tests do an "insert overwiew" with blobstore optimizations on and off.


Thanks,

Marta Kuczora



Re: Review Request 60432: HIVE-16845: INSERT OVERWRITE a table with dynamic partitions on S3 fails with NPE

2017-06-26 Thread Marta Kuczora


> On June 26, 2017, 2:53 p.m., Barna Zsombor Klara wrote:
> > Thank you for the review Marta, I just had one comment.
> > Also do you think it makes sense to add tests for HoS as well or is this 
> > problem only MR related?

Thanks a lot for the review.
That's a good question. I don't think this issue is MR related, but at the 
moment we don't have blobstore q tests with Spark. We can create a new driver 
and tests, but it is a bigger work, so I think it would make sense to do it in 
the scope of an other Jira.


> On June 26, 2017, 2:53 p.m., Barna Zsombor Klara wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverMergeFiles.java
> > Lines 304 (patched)
> > 
> >
> > If this all depends on the *hive.blobstore.optimizations.enabled* could 
> > we use the HiveConf object and check that value instead of introducing a 
> > boolean instance variable?

It doesn't just depend on the hive.blobstore.optimizations.enabled parameter. 
There are other checks in the GenMapRedUtils.shouldMergeMovePaths method to 
decide whether or not the move work should be changed. Maybe I should rephrase 
the comment on my change.


- Marta


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60432/#review178891
---


On June 26, 2017, 1:56 p.m., Marta Kuczora wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/60432/
> ---
> 
> (Updated June 26, 2017, 1:56 p.m.)
> 
> 
> Review request for hive and Sergio Pena.
> 
> 
> Bugs: HIVE-16845
> https://issues.apache.org/jira/browse/HIVE-16845
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> The following steps lead to the NPE in the 
> ConditionalResolverMergeFiles.generateActualTasks method:
> 
> In the GenMapRedUtils.createCondTask method, the tasks for the merge, move 
> and "merge and move" use cases are created and set as task list to the 
> ConditionalWork. Originally the moveOnlyMoveTask and the mergeAndMoveMoveTask 
> was created from the same moveWork, which was the dummyWork created like this 
> in the createMRWorkForMergingFiles method:
> 
> MoveWork dummyMv = new MoveWork(null, null, null,
>  new LoadFileDesc(fsInputDesc.getFinalDirName(), finalName, true, 
> null, null), false);
> 
> 
> Then in the ConditionalResolverMergeFiles.generateActualTasks method we get 
> these tasks and use them to create result "resTsks" list.
> 
> For the "merge and move" use case, the code looks like this:
> 
> if (toMove.size() > 0) {
> resTsks.add(mrAndMvTask);
> 
> MoveWork mvWork = (MoveWork) mvTask.getWork();
> LoadFileDesc lfd = mvWork.getLoadFileWork();
> 
> ...
> 
> LoadMultiFilesDesc lmfd = new LoadMultiFilesDesc(toMove,
> targetDirs, lfd.getIsDfsDir(), lfd.getColumns(), 
> lfd.getColumnTypes());
> mvWork.setLoadFileWork(null);
> mvWork.setLoadTableWork(null);
> mvWork.setMultiFilesDesc(lmfd);
>   }
> 
> It adds the mrAndMvTask task to the resTsks list and modifies the move work 
> to move all necessary files in one-step. The mrAndMvTask contains a move task 
> as child task, which is the same as the mvWork work. 
> 
> With the blobstore optimization on, the moveOnlyMoveTask task is created from 
> a different move work, not from the dummyMoveWork as before:
> 
> MoveWork workForMoveOnlyTask;
> if (shouldMergeMovePaths) {
>   workForMoveOnlyTask = mergeMovePaths(condInputPath, 
> moveTaskToLink.getWork());
> } else {
>   workForMoveOnlyTask = dummyMoveWork;
> }
> 
> ...
> 
> Task mergeOnlyMergeTask = 
> TaskFactory.get(mergeWork, conf);
> Task moveOnlyMoveTask = 
> TaskFactory.get(workForMoveOnlyTask, conf);
> Task mergeAndMoveMergeTask = 
> TaskFactory.get(mergeWork, conf);
> Task mergeAndMoveMoveTask = 
> TaskFactory.get(dummyMoveWork, conf);
> 
> Because of this the mvWork in the 
> ConditionalResolverMergeFiles.generateActualTasks method will also be 
> different. It has the LoadTableDesc variable set and not the LoadFileDesc, 
> that causes the NPE.
> 
> When the blobstore optimization is on and the move work is changed, we should 
> use the child move task of the mrAndMvTask in the generateActualTasks method, 
> instead of the mvTask. Not just to avoid the NPE, but because this is the 
> correct move task for the "merge and move" use case.
> 
> 
> Diffs
> -
> 
>   
> itests/hive-blobstore/src/test/queries/clientpositive/insert_overwrite_dynamic_partitions_merge_move.q
>  PRE-CREATION 
>   
> itests/hive-blobstore/src/test/queries/clientpositive/insert_overwrite_dynamic_partitions_merge_only

Re: Review Request 60432: HIVE-16845: INSERT OVERWRITE a table with dynamic partitions on S3 fails with NPE

2017-06-26 Thread Barna Zsombor Klara

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60432/#review178891
---



Thank you for the review Marta, I just had one comment.
Also do you think it makes sense to add tests for HoS as well or is this 
problem only MR related?


ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverMergeFiles.java
Lines 304 (patched)


If this all depends on the *hive.blobstore.optimizations.enabled* could we 
use the HiveConf object and check that value instead of introducing a boolean 
instance variable?


- Barna Zsombor Klara


On June 26, 2017, 1:56 p.m., Marta Kuczora wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/60432/
> ---
> 
> (Updated June 26, 2017, 1:56 p.m.)
> 
> 
> Review request for hive and Sergio Pena.
> 
> 
> Bugs: HIVE-16845
> https://issues.apache.org/jira/browse/HIVE-16845
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> The following steps lead to the NPE in the 
> ConditionalResolverMergeFiles.generateActualTasks method:
> 
> In the GenMapRedUtils.createCondTask method, the tasks for the merge, move 
> and "merge and move" use cases are created and set as task list to the 
> ConditionalWork. Originally the moveOnlyMoveTask and the mergeAndMoveMoveTask 
> was created from the same moveWork, which was the dummyWork created like this 
> in the createMRWorkForMergingFiles method:
> 
> MoveWork dummyMv = new MoveWork(null, null, null,
>  new LoadFileDesc(fsInputDesc.getFinalDirName(), finalName, true, 
> null, null), false);
> 
> 
> Then in the ConditionalResolverMergeFiles.generateActualTasks method we get 
> these tasks and use them to create result "resTsks" list.
> 
> For the "merge and move" use case, the code looks like this:
> 
> if (toMove.size() > 0) {
> resTsks.add(mrAndMvTask);
> 
> MoveWork mvWork = (MoveWork) mvTask.getWork();
> LoadFileDesc lfd = mvWork.getLoadFileWork();
> 
> ...
> 
> LoadMultiFilesDesc lmfd = new LoadMultiFilesDesc(toMove,
> targetDirs, lfd.getIsDfsDir(), lfd.getColumns(), 
> lfd.getColumnTypes());
> mvWork.setLoadFileWork(null);
> mvWork.setLoadTableWork(null);
> mvWork.setMultiFilesDesc(lmfd);
>   }
> 
> It adds the mrAndMvTask task to the resTsks list and modifies the move work 
> to move all necessary files in one-step. The mrAndMvTask contains a move task 
> as child task, which is the same as the mvWork work. 
> 
> With the blobstore optimization on, the moveOnlyMoveTask task is created from 
> a different move work, not from the dummyMoveWork as before:
> 
> MoveWork workForMoveOnlyTask;
> if (shouldMergeMovePaths) {
>   workForMoveOnlyTask = mergeMovePaths(condInputPath, 
> moveTaskToLink.getWork());
> } else {
>   workForMoveOnlyTask = dummyMoveWork;
> }
> 
> ...
> 
> Task mergeOnlyMergeTask = 
> TaskFactory.get(mergeWork, conf);
> Task moveOnlyMoveTask = 
> TaskFactory.get(workForMoveOnlyTask, conf);
> Task mergeAndMoveMergeTask = 
> TaskFactory.get(mergeWork, conf);
> Task mergeAndMoveMoveTask = 
> TaskFactory.get(dummyMoveWork, conf);
> 
> Because of this the mvWork in the 
> ConditionalResolverMergeFiles.generateActualTasks method will also be 
> different. It has the LoadTableDesc variable set and not the LoadFileDesc, 
> that causes the NPE.
> 
> When the blobstore optimization is on and the move work is changed, we should 
> use the child move task of the mrAndMvTask in the generateActualTasks method, 
> instead of the mvTask. Not just to avoid the NPE, but because this is the 
> correct move task for the "merge and move" use case.
> 
> 
> Diffs
> -
> 
>   
> itests/hive-blobstore/src/test/queries/clientpositive/insert_overwrite_dynamic_partitions_merge_move.q
>  PRE-CREATION 
>   
> itests/hive-blobstore/src/test/queries/clientpositive/insert_overwrite_dynamic_partitions_merge_only.q
>  PRE-CREATION 
>   
> itests/hive-blobstore/src/test/queries/clientpositive/insert_overwrite_dynamic_partitions_move_only.q
>  PRE-CREATION 
>   
> itests/hive-blobstore/src/test/results/clientpositive/insert_overwrite_dynamic_partitions_merge_move.q.out
>  PRE-CREATION 
>   
> itests/hive-blobstore/src/test/results/clientpositive/insert_overwrite_dynamic_partitions_merge_only.q.out
>  PRE-CREATION 
>   
> itests/hive-blobstore/src/test/results/clientpositive/insert_overwrite_dynamic_partitions_move_only.q.out
>  PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 88bf829 
>   
> ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverMergeFiles.java 
> 4266569 
> 
>