Re: Review Request 35950: HIVE-11131: Get row information on DataWritableWriter once for better writing performance

2015-07-07 Thread cheng xu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35950/#review90838
---

Ship it!


Ship It!

- cheng xu


On July 8, 2015, 12:25 a.m., Sergio Pena wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/35950/
> ---
> 
> (Updated July 8, 2015, 12:25 a.m.)
> 
> 
> Review request for hive, Ryan Blue, cheng xu, and Dong Chen.
> 
> 
> Bugs: HIVE-11131
> https://issues.apache.org/jira/browse/HIVE-11131
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> Implemented data type writers that will be created before the first Hive row 
> is written to Parquet. These writers contain information about object 
> inspectors and schema of a specific data type, and calls the specific 
> add() method used by Parquet for each data type.
> 
> 
> Diffs
> -
> 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java
>  c195c3ec3ddae19bf255fc2c9633f8bf4390f428 
> 
> Diff: https://reviews.apache.org/r/35950/diff/
> 
> 
> Testing
> ---
> 
> Tests from TestDataWritableWriter run OK.
> 
> I run other tests with micro-becnhmarks, and I got some better results from 
> this new implemntation:
> 
> Using repeated rows across the file, this is the throughput increase using 1 
> million records:
> 
> bigintboolean double  float   int string
> 7.598 7.491   7.488   7.588   7.530.270 (before)
> 10.13711.511  10.155  10.297  10.242  0.286 (after)
> 
> Using random rows across the file, the is the throughput increase using 1 
> million records:
> 
> bigintboolean double  float   int string
> 5.268 7.723   4.107   4.173   4.729   0.20   (before)
> 6.236 10.466  5.944   4.749   5.234   0.22   (after)
> 
> 
> Thanks,
> 
> Sergio Pena
> 
>



Re: Review Request 35950: HIVE-11131: Get row information on DataWritableWriter once for better writing performance

2015-07-07 Thread Sergio Pena

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35950/
---

(Updated July 7, 2015, 4:25 p.m.)


Review request for hive, Ryan Blue, cheng xu, and Dong Chen.


Changes
---

Address feedback changes.


Bugs: HIVE-11131
https://issues.apache.org/jira/browse/HIVE-11131


Repository: hive-git


Description
---

Implemented data type writers that will be created before the first Hive row is 
written to Parquet. These writers contain information about object inspectors 
and schema of a specific data type, and calls the specific add() method 
used by Parquet for each data type.


Diffs (updated)
-

  
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java 
c195c3ec3ddae19bf255fc2c9633f8bf4390f428 

Diff: https://reviews.apache.org/r/35950/diff/


Testing
---

Tests from TestDataWritableWriter run OK.

I run other tests with micro-becnhmarks, and I got some better results from 
this new implemntation:

Using repeated rows across the file, this is the throughput increase using 1 
million records:

bigint  boolean double  float   int string
7.598   7.491   7.488   7.588   7.530.270 (before)
10.137  11.511  10.155  10.297  10.242  0.286 (after)

Using random rows across the file, the is the throughput increase using 1 
million records:

bigint  boolean double  float   int string
5.268   7.723   4.107   4.173   4.729   0.20   (before)
6.236   10.466  5.944   4.749   5.234   0.22   (after)


Thanks,

Sergio Pena



Re: Review Request 35950: HIVE-11131: Get row information on DataWritableWriter once for better writing performance

2015-06-29 Thread Dong Chen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35950/#review89860
---



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java 
(line 63)


shall we keep this as 'final'?


Nice refactor. The change looks good. Thanks

- Dong Chen


On June 28, 2015, 12:29 a.m., Sergio Pena wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/35950/
> ---
> 
> (Updated June 28, 2015, 12:29 a.m.)
> 
> 
> Review request for hive, Ryan Blue, cheng xu, and Dong Chen.
> 
> 
> Bugs: HIVE-11131
> https://issues.apache.org/jira/browse/HIVE-11131
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> Implemented data type writers that will be created before the first Hive row 
> is written to Parquet. These writers contain information about object 
> inspectors and schema of a specific data type, and calls the specific 
> add() method used by Parquet for each data type.
> 
> 
> Diffs
> -
> 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java
>  c195c3ec3ddae19bf255fc2c9633f8bf4390f428 
> 
> Diff: https://reviews.apache.org/r/35950/diff/
> 
> 
> Testing
> ---
> 
> Tests from TestDataWritableWriter run OK.
> 
> I run other tests with micro-becnhmarks, and I got some better results from 
> this new implemntation:
> 
> Using repeated rows across the file, this is the throughput increase using 1 
> million records:
> 
> bigintboolean double  float   int string
> 7.598 7.491   7.488   7.588   7.530.270 (before)
> 10.13711.511  10.155  10.297  10.242  0.286 (after)
> 
> Using random rows across the file, the is the throughput increase using 1 
> million records:
> 
> bigintboolean double  float   int string
> 5.268 7.723   4.107   4.173   4.729   0.20   (before)
> 6.236 10.466  5.944   4.749   5.234   0.22   (after)
> 
> 
> Thanks,
> 
> Sergio Pena
> 
>



Re: Review Request 35950: HIVE-11131: Get row information on DataWritableWriter once for better writing performance

2015-06-29 Thread Ryan Blue

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35950/#review89812
---

Ship it!


Ship It!

- Ryan Blue


On June 27, 2015, 5:29 p.m., Sergio Pena wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/35950/
> ---
> 
> (Updated June 27, 2015, 5:29 p.m.)
> 
> 
> Review request for hive, Ryan Blue, cheng xu, and Dong Chen.
> 
> 
> Bugs: HIVE-11131
> https://issues.apache.org/jira/browse/HIVE-11131
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> Implemented data type writers that will be created before the first Hive row 
> is written to Parquet. These writers contain information about object 
> inspectors and schema of a specific data type, and calls the specific 
> add() method used by Parquet for each data type.
> 
> 
> Diffs
> -
> 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java
>  c195c3ec3ddae19bf255fc2c9633f8bf4390f428 
> 
> Diff: https://reviews.apache.org/r/35950/diff/
> 
> 
> Testing
> ---
> 
> Tests from TestDataWritableWriter run OK.
> 
> I run other tests with micro-becnhmarks, and I got some better results from 
> this new implemntation:
> 
> Using repeated rows across the file, this is the throughput increase using 1 
> million records:
> 
> bigintboolean double  float   int string
> 7.598 7.491   7.488   7.588   7.530.270 (before)
> 10.13711.511  10.155  10.297  10.242  0.286 (after)
> 
> Using random rows across the file, the is the throughput increase using 1 
> million records:
> 
> bigintboolean double  float   int string
> 5.268 7.723   4.107   4.173   4.729   0.20   (before)
> 6.236 10.466  5.944   4.749   5.234   0.22   (after)
> 
> 
> Thanks,
> 
> Sergio Pena
> 
>



Re: Review Request 35950: HIVE-11131: Get row information on DataWritableWriter once for better writing performance

2015-06-29 Thread cheng xu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35950/#review89740
---


Thanks Sergio for this patch. Will this have negative impacts with the initial 
part? Thank you.


ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java 
(line 416)


Is that possible to use generic type to avoid creating DataWriter for each 
type since they are quite similar?


- cheng xu


On June 28, 2015, 8:29 a.m., Sergio Pena wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/35950/
> ---
> 
> (Updated June 28, 2015, 8:29 a.m.)
> 
> 
> Review request for hive, Ryan Blue, cheng xu, and Dong Chen.
> 
> 
> Bugs: HIVE-11131
> https://issues.apache.org/jira/browse/HIVE-11131
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> Implemented data type writers that will be created before the first Hive row 
> is written to Parquet. These writers contain information about object 
> inspectors and schema of a specific data type, and calls the specific 
> add() method used by Parquet for each data type.
> 
> 
> Diffs
> -
> 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java
>  c195c3ec3ddae19bf255fc2c9633f8bf4390f428 
> 
> Diff: https://reviews.apache.org/r/35950/diff/
> 
> 
> Testing
> ---
> 
> Tests from TestDataWritableWriter run OK.
> 
> I run other tests with micro-becnhmarks, and I got some better results from 
> this new implemntation:
> 
> Using repeated rows across the file, this is the throughput increase using 1 
> million records:
> 
> bigintboolean double  float   int string
> 7.598 7.491   7.488   7.588   7.530.270 (before)
> 10.13711.511  10.155  10.297  10.242  0.286 (after)
> 
> Using random rows across the file, the is the throughput increase using 1 
> million records:
> 
> bigintboolean double  float   int string
> 5.268 7.723   4.107   4.173   4.729   0.20   (before)
> 6.236 10.466  5.944   4.749   5.234   0.22   (after)
> 
> 
> Thanks,
> 
> Sergio Pena
> 
>



Re: Review Request 35950: HIVE-11131: Get row information on DataWritableWriter once for better writing performance

2015-06-29 Thread cheng xu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35950/#review89721
---



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java 
(lines 71 - 72)


Add the comments before the declarations of messageWriter.



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java 
(line 73)


No need to initialized by null val.



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java 
(line 107)


I don't follow Why rename to schema here.



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java 
(line 182)


groupSchema -> groupType



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java 
(line 352)


ByteDataWrter?



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java 
(line 538)


DateDataWriter


- cheng xu


On June 28, 2015, 8:29 a.m., Sergio Pena wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/35950/
> ---
> 
> (Updated June 28, 2015, 8:29 a.m.)
> 
> 
> Review request for hive, Ryan Blue, cheng xu, and Dong Chen.
> 
> 
> Bugs: HIVE-11131
> https://issues.apache.org/jira/browse/HIVE-11131
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> Implemented data type writers that will be created before the first Hive row 
> is written to Parquet. These writers contain information about object 
> inspectors and schema of a specific data type, and calls the specific 
> add() method used by Parquet for each data type.
> 
> 
> Diffs
> -
> 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java
>  c195c3ec3ddae19bf255fc2c9633f8bf4390f428 
> 
> Diff: https://reviews.apache.org/r/35950/diff/
> 
> 
> Testing
> ---
> 
> Tests from TestDataWritableWriter run OK.
> 
> I run other tests with micro-becnhmarks, and I got some better results from 
> this new implemntation:
> 
> Using repeated rows across the file, this is the throughput increase using 1 
> million records:
> 
> bigintboolean double  float   int string
> 7.598 7.491   7.488   7.588   7.530.270 (before)
> 10.13711.511  10.155  10.297  10.242  0.286 (after)
> 
> Using random rows across the file, the is the throughput increase using 1 
> million records:
> 
> bigintboolean double  float   int string
> 5.268 7.723   4.107   4.173   4.729   0.20   (before)
> 6.236 10.466  5.944   4.749   5.234   0.22   (after)
> 
> 
> Thanks,
> 
> Sergio Pena
> 
>



Re: Review Request 35950: HIVE-11131: Get row information on DataWritableWriter once for better writing performance

2015-06-27 Thread Sergio Pena

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35950/
---

(Updated June 28, 2015, 12:29 a.m.)


Review request for hive, Ryan Blue, cheng xu, and Dong Chen.


Bugs: HIVE-11131
https://issues.apache.org/jira/browse/HIVE-11131


Repository: hive-git


Description
---

Implemented data type writers that will be created before the first Hive row is 
written to Parquet. These writers contain information about object inspectors 
and schema of a specific data type, and calls the specific add() method 
used by Parquet for each data type.


Diffs
-

  
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java 
c195c3ec3ddae19bf255fc2c9633f8bf4390f428 

Diff: https://reviews.apache.org/r/35950/diff/


Testing (updated)
---

Tests from TestDataWritableWriter run OK.

I run other tests with micro-becnhmarks, and I got some better results from 
this new implemntation:

Using repeated rows across the file, this is the throughput increase using 1 
million records:

bigint  boolean double  float   int string
7.598   7.491   7.488   7.588   7.530.270 (before)
10.137  11.511  10.155  10.297  10.242  0.286 (after)

Using random rows across the file, the is the throughput increase using 1 
million records:

bigint  boolean double  float   int string
5.268   7.723   4.107   4.173   4.729   0.20   (before)
6.236   10.466  5.944   4.749   5.234   0.22   (after)


Thanks,

Sergio Pena



Re: Review Request 35950: HIVE-11131: Get row information on DataWritableWriter once for better writing performance

2015-06-27 Thread Sergio Pena

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35950/
---

(Updated June 28, 2015, 12:24 a.m.)


Review request for hive, Ryan Blue, cheng xu, and Dong Chen.


Bugs: HIVE-11131
https://issues.apache.org/jira/browse/HIVE-11131


Repository: hive-git


Description
---

Implemented data type writers that will be created before the first Hive row is 
written to Parquet. These writers contain information about object inspectors 
and schema of a specific data type, and calls the specific add() method 
used by Parquet for each data type.


Diffs (updated)
-

  
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java 
c195c3ec3ddae19bf255fc2c9633f8bf4390f428 

Diff: https://reviews.apache.org/r/35950/diff/


Testing
---

Tests from TestDataWritableWriter run OK.

I run other tests with micro-becnhmarks, and I got some better results from 
this new implemntation:

Using repeated rows across the file, the speed increased in:

bigint  boolean double  float   int string
33.42%  53.66%  35.62%  35.70%  36.02%  5.93%

Using random rows across the file, the speed increased in:

bigint  boolean double  float   int string
18.38%  35.52%  44.73%  13.80%  10.68%  10.00%


Thanks,

Sergio Pena



Re: Review Request 35950: HIVE-11131: Get row information on DataWritableWriter once for better writing performance

2015-06-26 Thread Sergio Pena

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35950/
---

(Updated June 27, 2015, 2:51 a.m.)


Review request for hive, Ryan Blue, cheng xu, and Dong Chen.


Changes
---

Added change on DataListWriter to loop into the list of values using a for() 
instead of for each. The ListObjectInspector.getList() is more expensive than 
getListLength() and getListElement()


Bugs: HIVE-11131
https://issues.apache.org/jira/browse/HIVE-11131


Repository: hive-git


Description
---

Implemented data type writers that will be created before the first Hive row is 
written to Parquet. These writers contain information about object inspectors 
and schema of a specific data type, and calls the specific add() method 
used by Parquet for each data type.


Diffs (updated)
-

  
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java 
c195c3ec3ddae19bf255fc2c9633f8bf4390f428 

Diff: https://reviews.apache.org/r/35950/diff/


Testing
---

Tests from TestDataWritableWriter run OK.

I run other tests with micro-becnhmarks, and I got some better results from 
this new implemntation:

Using repeated rows across the file, the speed increased in:

bigint  boolean double  float   int string
33.42%  53.66%  35.62%  35.70%  36.02%  5.93%

Using random rows across the file, the speed increased in:

bigint  boolean double  float   int string
18.38%  35.52%  44.73%  13.80%  10.68%  10.00%


Thanks,

Sergio Pena



Review Request 35950: HIVE-11131: Get row information on DataWritableWriter once for better writing performance

2015-06-26 Thread Sergio Pena

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35950/
---

Review request for hive, Ryan Blue, cheng xu, and Dong Chen.


Bugs: HIVE-11131
https://issues.apache.org/jira/browse/HIVE-11131


Repository: hive-git


Description
---

Implemented data type writers that will be created before the first Hive row is 
written to Parquet. These writers contain information about object inspectors 
and schema of a specific data type, and calls the specific add() method 
used by Parquet for each data type.


Diffs
-

  
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java 
c195c3ec3ddae19bf255fc2c9633f8bf4390f428 

Diff: https://reviews.apache.org/r/35950/diff/


Testing
---

Tests from TestDataWritableWriter run OK.

I run other tests with micro-becnhmarks, and I got some better results from 
this new implemntation:

Using repeated rows across the file, the speed increased in:

bigint  boolean double  float   int string
33.42%  53.66%  35.62%  35.70%  36.02%  5.93%

Using random rows across the file, the speed increased in:

bigint  boolean double  float   int string
18.38%  35.52%  44.73%  13.80%  10.68%  10.00%


Thanks,

Sergio Pena