Hi,

 

As you know, a lot of work this year went into performance optimization
of Pig. One of the main sources of performance problems is high memory
usage. In an effort to address this problem we propose switching
internal implementation of strings from Java Strings to Hadoop Text
because text has lower memory overhead. Examples (assumes ASCII data;
sizes are in bytes):

 

Real String        Java String        Hadoop Text

5                      46                     37

10                     56                     42

20                     76                     52

40                     116                   72

80                     196                   112

 

As the size of the strings grows so does the gap between the two
implementations.

 

Making this change would have no impact on pig users; however, it will
have impact on existing UDFs that work with Strings. Our question is
whether UDF writers/owners are comfortable with the proposed transition
and will update their UDFs.

 

Please, let us know by the end of next week if you strongly object to
this proposal. Otherwise, we will go forward with this plan.

 

Thanks,

 

Olga 

 

 

Reply via email to