[ 
https://issues.apache.org/jira/browse/PIG-2835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy reassigned PIG-2835:
--------------------------------------

    Assignee: Jie Li
    
> Optimizing the convertion from bytes to Integer/Long
> ----------------------------------------------------
>
>                 Key: PIG-2835
>                 URL: https://issues.apache.org/jira/browse/PIG-2835
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Jie Li
>            Assignee: Jie Li
>             Fix For: 0.11
>
>         Attachments: PIG-2835.1.patch
>
>
> Currently Pig doesn't support lazy see/de, so as one of the best practices, 
> we recommend users not to declare types in the schema so that Pig will guess 
> the right types and cast them lazily. However, if Pig guesses a wrong type, 
> especially mistakes a double field as an integer field, the overhead of 
> casting is tremendous due to the exception handling.
> See Utf8StorageConverter#bytesToIntege. It first casts bytes to Integer by 
> Integer.parseInt(), and if exception occurs, it tries to cast it to Double by 
> Double.parseDouble() and convert it back to Integer. The problem is that the 
> exception handling can be 10x slower than the actual casting. bytesToLong has 
> the same problem. Below is a mini-benchmark:
> {code}        
>         int i;
>         Exception ex = null;
>         long start = System.nanoTime();
>         for (i = 0; i < 100000000; i++) {
>             try {
>                 // Double.parseDouble(i+ ".0");
>                 // Integer.parseInt(i + ".0");
>                 Integer.parseInt(i + "");
>                 // Double.parseDouble(i + "");
>             } catch (NumberFormatException e) {
>                 ex = e;
>             }
>         }
>         System.out.println("time: " + (System.nanoTime() - start)
>                 / 1000000000.0);
>         if (ex != null) {
>             ex.printStackTrace();
>         }
> {code}
> And the results:
> ||casting||running time(sec)||
> |Double.parseDouble(i+ ".0");| 17 |
> |Integer.parseInt(i + ".0");| *118* |
> |Integer.parseInt(i + "");| 13 |
> |Double.parseDouble(i + "");| 16 |
> We can see Integer.parseInt(i + ".0") is 10x slower than the other due to the 
> exception handling.
> This issue was found when I benchmark TPC-H Query 1, for which Pig was 
> terribly slower than Hive:
> {code}
> LineItems = LOAD '$input/lineitem' USING PigStorage('|') AS (orderkey, 
> partkey, suppkey, linenumber, quantity, extendedprice, discount, tax, 
> returnflag, linestatus, shipdate, commitdate, receiptdate, shipinstruct, 
> shipmode, comment);
> SubLineItems = FILTER LineItems BY shipdate <= '1998-09-02';
> SubLine = FOREACH SubLineItems GENERATE returnflag, linestatus, quantity, 
> extendedprice, extendedprice*(1-discount) AS disc_price, 
> extendedprice*(1-discount)*(1+tax) AS charge, discount;
> StatusGroup = GROUP SubLine BY (returnflag, linestatus);
> PriceSummary = FOREACH StatusGroup GENERATE group.returnflag AS returnflag, 
> group.linestatus AS linestatus, SUM(SubLine.quantity) AS sum_qty, 
> SUM(SubLine.extendedprice) AS sum_base_price, SUM(SubLine.disc_price) as 
> sum_disc_price, SUM(SubLine.charge) as sum_charge, AVG(SubLine.quantity) as 
> avg_qty, AVG(SubLine.extendedprice) as avg_price, AVG(SubLine.discount) as 
> avg_disc, COUNT(SubLine) as count_order;
> SortedSummary = ORDER PriceSummary BY returnflag, linestatus;
> STORE SortedSummary INTO '$output/Q1out';
> {code}
> After declaring three double fields as double, the performance was boosted. 
> || pig without types || pig with three doubles || hive ||
> | 76 min | 34 min | 16 min |
> Besides recommending users to declare actual double fields as double, we can 
> also improve the casting to avoid this happening. Maybe the easiest way is to 
> remove the Integer.parseInt and only use the Double.parseDouble and convert 
> back to Integer. The mini benchmark above shows Double.parseDouble + range 
> checking + Integer.valueOf(Double.intValue()) takes about 17 seconds. I think 
> the small percent of extra overhead (3 seconds compared to 
> Integer.parseInt()) is acceptable as it won't be the dominant bottleneck?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to