[
https://issues.apache.org/jira/browse/HIVE-7661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093270#comment-14093270
]
Vishal Kamath commented on HIVE-7661:
-------------------------------------
I think this might help.
InputSampler.SplitSampler<K,V> would sample the first n records from s splits.
getSample() would take the first numSamples/numSplits records
Would it be possible for you to explain your approach with an example ? Let us
that you have 1,000,000 rows in your file and say 10 splits.
> Observed performance issues while sorting using Hive's Parallel Order by
> clause while retaining pre-existing sort order.
> ------------------------------------------------------------------------------------------------------------------------
>
> Key: HIVE-7661
> URL: https://issues.apache.org/jira/browse/HIVE-7661
> Project: Hive
> Issue Type: Bug
> Components: Logical Optimizer
> Affects Versions: 0.12.0
> Environment: Cloudera 5.0
> hive-0.12.0-cdh5.0.0
> Red Hat Linux
> Reporter: Vishal Kamath
> Labels: performance
> Fix For: 0.12.1
>
>
> Improve Hive's sampling logic to accommodate use cases that require to retain
> the pre existing sort in the underlying source table.
> In order to support Parallel order by clause, Hive Samples the source table
> based on values provided to hive.optimize.sampling.orderby.number and
> hive.optimize.sampling.orderby.percent.
> This does work with reasonable performance when sorting is performed on a
> columns having random distribution of data but has severe performance issues
> when retaining the sort order.
> Let us try to understand this with an example.
> insert overwrite table lineitem_temp_report
> select
> l_orderkey, l_partkey, l_suppkey, l_linenumber, l_quantity,
> l_extendedprice, l_discount, l_tax, l_returnflag, l_linestatus, l_shipdate,
> l_commitdate, l_receiptdate, l_shipinstruct, l_shipmode, l_comment
> from
> lineitem
> order by l_orderkey, l_partkey, l_suppkey;
> Sample data set for lineitem table. The first column represents the
> l_orderKey and is sorted.
>
> l_orderkey|l_partkey|l_suppkey|l_linenumber|l_quantity|l_extendedprice|l_discount|l_tax|l_returnflag|l_linestatus|l_shipdate|l_commitdate|l_receiptdate|l_shipinstruct|l_shipmode|l_comment
> 197|1771022|96040|2|8|8743.52|0.09|0.02|A|F|1995-04-17|1995-07-01|1995-0
> 197|1771022|96040|2|8|4-27|DELIVER IN PERSON|SHIP|y blithely even
> 197|1771022|96040|2|8|deposits. blithely fina|
> 197|1558290|83306|3|17|22919.74|0.06|0.02|N|O|1995-08-02|1995-06-23|1995
> 197|1558290|83306|3|17|-08-03|COLLECT COD|REG AIR|ts. careful|
> 197|179355|29358|4|25|35858.75|0.04|0.01|N|F|1995-06-13|1995-05-23|1995-
> 197|179355|29358|4|25|06-24|TAKE BACK RETURN|FOB|s-- quickly final
> 197|179355|29358|4|25|accounts|
> 197|414653|39658|5|14|21946.82|0.09|0.01|R|F|1995-05-08|1995-05-24|1995-
> 197|414653|39658|5|14|05-12|TAKE BACK RETURN|RAIL|use slyly slyly silent
> 197|414653|39658|5|14|depo|
> 197|1058800|8821|6|1|1758.75|0.07|0.05|N|O|1995-07-15|1995-06-21|1995-08
> 197|1058800|8821|6|1|-11|COLLECT COD|RAIL| even, thin dependencies sno|
> 198|560609|60610|1|33|55096.14|0.07|0.02|N|O|1998-01-05|1998-03-20|1998-
> 198|560609|60610|1|33|01-10|TAKE BACK RETURN|TRUCK|carefully caref|
> 198|152287|77289|2|20|26785.60|0.03|0.00|N|O|1998-01-15|1998-03-31|1998-
> 198|152287|77289|2|20|01-25|DELIVER IN PERSON|FOB|carefully final
> 198|152287|77289|2|20|escapades a|
> 224|1899665|74720|3|41|68247.37|0.07|0.04|A|F|1994-09-01|1994-09-15|1994
> 224|1899665|74720|3|41|-09-02|TAKE BACK RETURN|SHIP|after the furiou|
> When we try to either sort on a presorted column or do a multi-column sort
> while trying to retain the sort order on the source table,
> Source table "lineitem" has 600 million rows.
> We don't see equal distribution of data to the reducers. Out of 100 reducers,
> 99 complete in less than 40 seconds. The last reducer is doing the bulk of
> the work processing nearly 570 million rows.
> So, let us understand what is going wrong here ..
> on a table having 600 million records with orderkey column sorted, i created
> temp table with 10% sampling.
> insert overwrite table sampTempTbl (select * from lineitem tablesample (10
> percent) t);
> select min(l_orderkey), max(l_orderkey) from sampTempTbl ;
> 12306309, 142321700
> where as on the source table, the orderkey range (select min(l_orderkey),
> max(l_orderkey) from lineitem) is 1 and 600000000
> So naturally bulk of the records will be directed towards single reducer.
> One way to work around this problem is to increase the
> hive.optimize.sampling.orderby.number to a larger value (as close as the #
> rows in the input source table). But then we will have to provide higher heap
> (hive-env.sh) for hive, otherwise it will fail while creating the Sampling
> Data. With larger data volume, it is not practical to sample the entire data
> set.
--
This message was sent by Atlassian JIRA
(v6.2#6252)