Re: Review Request 61087: HIVE-16965 SMB join may produce incorrect results

2017-07-27 Thread Deepak Jaiswal

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61087/
---

(Updated July 27, 2017, 9:25 p.m.)


Review request for hive, Gopal V, Jason Dere, and Sergey Shelukhin.


Changes
---

Fixed the assert introduced in last rev. to compare the path values instead of 
comparing the path objects.


Bugs: HIVE-16965
https://issues.apache.org/jira/browse/HIVE-16965


Repository: hive-git


Description
---

Usually, in a JOIN with multiple inputs (partitions), the inputs are read 
sequentially, however, incase of SMB join, the inputs are read based on key 
ordering. This invalidates the current IOContext assumption that the input path 
once set wont change unless the input changes.
This was resulting in incorrect partition information in results as it is 
derived from the input path in IOContext.
The new logic changes the input path as and when input changes.


Diffs (updated)
-

  ql/src/java/org/apache/hadoop/hive/ql/exec/tez/MapRecordSource.java 
add7d08c40 
  ql/src/java/org/apache/hadoop/hive/ql/exec/tez/tools/KeyValueInputMerger.java 
698fa7f69e 
  ql/src/test/results/clientpositive/llap/llap_smb.q.out 87b33db805 


Diff: https://reviews.apache.org/r/61087/diff/5/

Changes: https://reviews.apache.org/r/61087/diff/4-5/


Testing
---

Added a new test.


Thanks,

Deepak Jaiswal



Re: Review Request 61087: HIVE-16965 SMB join may produce incorrect results

2017-07-27 Thread Deepak Jaiswal

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61087/
---

(Updated July 27, 2017, 8:37 p.m.)


Review request for hive, Gopal V, Jason Dere, and Sergey Shelukhin.


Changes
---

Added a better assert to verify uniqueness of the paths for a given input.


Bugs: HIVE-16965
https://issues.apache.org/jira/browse/HIVE-16965


Repository: hive-git


Description
---

Usually, in a JOIN with multiple inputs (partitions), the inputs are read 
sequentially, however, incase of SMB join, the inputs are read based on key 
ordering. This invalidates the current IOContext assumption that the input path 
once set wont change unless the input changes.
This was resulting in incorrect partition information in results as it is 
derived from the input path in IOContext.
The new logic changes the input path as and when input changes.


Diffs (updated)
-

  ql/src/java/org/apache/hadoop/hive/ql/exec/tez/MapRecordSource.java 
add7d08c40 
  ql/src/java/org/apache/hadoop/hive/ql/exec/tez/tools/KeyValueInputMerger.java 
698fa7f69e 
  ql/src/test/results/clientpositive/llap/llap_smb.q.out 87b33db805 


Diff: https://reviews.apache.org/r/61087/diff/4/

Changes: https://reviews.apache.org/r/61087/diff/3-4/


Testing
---

Added a new test.


Thanks,

Deepak Jaiswal



Re: Review Request 61087: HIVE-16965 SMB join may produce incorrect results

2017-07-25 Thread Gopal V

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61087/#review181369
---




ql/src/java/org/apache/hadoop/hive/ql/exec/tez/tools/KeyValueInputMerger.java
Lines 86 (patched)


That is not always true.

splits.get(1) could have a different path.

You might want to add a loop + assert there.


- Gopal V


On July 25, 2017, 8:01 p.m., Deepak Jaiswal wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61087/
> ---
> 
> (Updated July 25, 2017, 8:01 p.m.)
> 
> 
> Review request for hive, Gopal V, Jason Dere, and Sergey Shelukhin.
> 
> 
> Bugs: HIVE-16965
> https://issues.apache.org/jira/browse/HIVE-16965
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> Usually, in a JOIN with multiple inputs (partitions), the inputs are read 
> sequentially, however, incase of SMB join, the inputs are read based on key 
> ordering. This invalidates the current IOContext assumption that the input 
> path once set wont change unless the input changes.
> This was resulting in incorrect partition information in results as it is 
> derived from the input path in IOContext.
> The new logic changes the input path as and when input changes.
> 
> 
> Diffs
> -
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/tez/MapRecordSource.java 
> add7d08c40 
>   
> ql/src/java/org/apache/hadoop/hive/ql/exec/tez/tools/KeyValueInputMerger.java 
> 698fa7f69e 
>   ql/src/test/results/clientpositive/llap/llap_smb.q.out 87b33db805 
> 
> 
> Diff: https://reviews.apache.org/r/61087/diff/3/
> 
> 
> Testing
> ---
> 
> Added a new test.
> 
> 
> Thanks,
> 
> Deepak Jaiswal
> 
>



Re: Review Request 61087: HIVE-16965 SMB join may produce incorrect results

2017-07-25 Thread Deepak Jaiswal

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61087/
---

(Updated July 25, 2017, 8:01 p.m.)


Review request for hive, Gopal V, Jason Dere, and Sergey Shelukhin.


Changes
---

Use llap_smb.q as main test instead of smb_join1.q
Remove assert based on failing tests. Irrespective of number of splits the path 
is same.


Bugs: HIVE-16965
https://issues.apache.org/jira/browse/HIVE-16965


Repository: hive-git


Description
---

Usually, in a JOIN with multiple inputs (partitions), the inputs are read 
sequentially, however, incase of SMB join, the inputs are read based on key 
ordering. This invalidates the current IOContext assumption that the input path 
once set wont change unless the input changes.
This was resulting in incorrect partition information in results as it is 
derived from the input path in IOContext.
The new logic changes the input path as and when input changes.


Diffs (updated)
-

  ql/src/java/org/apache/hadoop/hive/ql/exec/tez/MapRecordSource.java 
add7d08c40 
  ql/src/java/org/apache/hadoop/hive/ql/exec/tez/tools/KeyValueInputMerger.java 
698fa7f69e 
  ql/src/test/results/clientpositive/llap/llap_smb.q.out 87b33db805 


Diff: https://reviews.apache.org/r/61087/diff/3/

Changes: https://reviews.apache.org/r/61087/diff/2-3/


Testing
---

Added a new test.


Thanks,

Deepak Jaiswal



Re: Review Request 61087: HIVE-16965 SMB join may produce incorrect results

2017-07-24 Thread Deepak Jaiswal

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61087/
---

(Updated July 24, 2017, 9:35 p.m.)


Review request for hive, Gopal V, Jason Dere, and Sergey Shelukhin.


Changes
---

Implemented comments from Sergey and Gopal.


Bugs: HIVE-16965
https://issues.apache.org/jira/browse/HIVE-16965


Repository: hive-git


Description
---

Usually, in a JOIN with multiple inputs (partitions), the inputs are read 
sequentially, however, incase of SMB join, the inputs are read based on key 
ordering. This invalidates the current IOContext assumption that the input path 
once set wont change unless the input changes.
This was resulting in incorrect partition information in results as it is 
derived from the input path in IOContext.
The new logic changes the input path as and when input changes.


Diffs (updated)
-

  itests/src/test/resources/testconfiguration.properties f66e19be3e 
  ql/src/java/org/apache/hadoop/hive/ql/exec/tez/MapRecordSource.java 
add7d08c40 
  ql/src/java/org/apache/hadoop/hive/ql/exec/tez/tools/KeyValueInputMerger.java 
698fa7f69e 
  ql/src/test/queries/clientpositive/smb_join1.q PRE-CREATION 
  ql/src/test/results/clientpositive/llap/smb_join1.q.out PRE-CREATION 


Diff: https://reviews.apache.org/r/61087/diff/2/

Changes: https://reviews.apache.org/r/61087/diff/1-2/


Testing
---

Added a new test.


Thanks,

Deepak Jaiswal



Re: Review Request 61087: HIVE-16965 SMB join may produce incorrect results

2017-07-24 Thread Gopal V

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61087/#review181246
---




ql/src/java/org/apache/hadoop/hive/ql/exec/tez/tools/KeyValueInputMerger.java
Lines 66 (patched)


IdentityHashMap - don't trust the hashCode() for KeyValueReader to be safe.



ql/src/java/org/apache/hadoop/hive/ql/exec/tez/tools/KeyValueInputMerger.java
Line 109 (original), 130 (patched)


Clear the prev and IOContext refs - interrupts do leave leaky state behind 
sometimes


- Gopal V


On July 24, 2017, 6:47 p.m., Deepak Jaiswal wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61087/
> ---
> 
> (Updated July 24, 2017, 6:47 p.m.)
> 
> 
> Review request for hive, Gopal V, Jason Dere, and Sergey Shelukhin.
> 
> 
> Bugs: HIVE-16965
> https://issues.apache.org/jira/browse/HIVE-16965
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> Usually, in a JOIN with multiple inputs (partitions), the inputs are read 
> sequentially, however, incase of SMB join, the inputs are read based on key 
> ordering. This invalidates the current IOContext assumption that the input 
> path once set wont change unless the input changes.
> This was resulting in incorrect partition information in results as it is 
> derived from the input path in IOContext.
> The new logic changes the input path as and when input changes.
> 
> 
> Diffs
> -
> 
>   itests/src/test/resources/testconfiguration.properties f66e19be3e 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/tez/MapRecordSource.java 
> add7d08c40 
>   
> ql/src/java/org/apache/hadoop/hive/ql/exec/tez/tools/KeyValueInputMerger.java 
> 698fa7f69e 
>   ql/src/test/queries/clientpositive/smb_join1.q PRE-CREATION 
>   ql/src/test/results/clientpositive/llap/smb_join1.q.out PRE-CREATION 
> 
> 
> Diff: https://reviews.apache.org/r/61087/diff/1/
> 
> 
> Testing
> ---
> 
> Added a new test.
> 
> 
> Thanks,
> 
> Deepak Jaiswal
> 
>