[ 
https://issues.apache.org/jira/browse/PIG-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Klish updated PIG-2293:
-----------------------------

    Release Note: 
Overview
--------

 * This patch adds a new join type, 'merge-sparse' to PIG.
 * This patch adds a new 'IndexedStorage' UDF to piggybank. 

'merge-sparse' join operator
-----------------------------

The new functionality is similar to the existing 'merge' join in the following 
ways:
 * It expects the input to be sorted in both the left and right tables.
 * The join can be accomplished as a map-only merge of the left and right 
tables.

The new functionality is different from the existing 'merge' join in the 
following ways:
 * The performance characteristics of the join operations are different.
 * For every key in the left table, the join operator will seek to the 
corresponding key in the right table. 
 * Unlike a 'merge' join, PIG will not create an index of the right table if it 
is not already indexed.  The right table loader >must< implement 
IndexableLoadFunc.  The feature is disabled for 'merge-sparse' joins because 
the implementation of the DefaultIndexableLoader was not designed for multiple 
seeks.
  
The existing 'merge' join implementation only seeks to the first key from each 
input split of the left table.
It performs similar to a table scan with the optimization to skip records that 
belong to other input splits.
This implementation performs best when the join is dense.  In this case, 
scanning block by block will be 
efficient because of the likelihood of finding matches in many of the scanned 
blocks.

The new 'merge-sparse' join implementation performs best when the join is 
sparse and the right table
implements an efficient seek operation (typically, the right table should be 
indexed by the join keys).
In this case, the additional overhead of per record seeks is less than the IO & 
processing 
overhead of a partial table scan.

The performance tradeoff between the two methods depends on a number of factors:
 * The density of matching join keys in the right table.
 * The overhead/efficiency of the seek operator implementation of the right 
loader.  
 * The size and number of keys in the index.

Finding the exact performance inflection point will require trial and error for 
each data set and implementation of the right loader.

See the performance data below for guidance on when to use this operator.

IndexedStorage UDF
------------------

IndexedStorage is a form of PigStorage that supports a per record seek.  
IndexedStorage creates a separate (hidden) index file for
every data file that is written.  The format of the index file is:
| Header     |
| Index Body |
| Footer     |
The Header contains the list of record indices (field numbers) that represent 
index keys.
The Index Body contains a PIG Tuple for each record in the data.  
The fields of the Tuple are:
 * The index key(s) as a PIG Tuple 
 * The number of records that share this index key. 
 * Offset into the data file to read the first matching record. 
The Footer contains sequentially:
 * The smallest key(s) Tuple in the index. 
 * The largest key(s) Tuple in the index. 
 * The offset in bytes to the start of the footer. 

IndexedStorage implements IndexableLoadFunc and
can be used as the 'right table' in a PIG 'merge' or 'merge-sparse' join.

IndexedStorage does not require the data to be partitioned by index key(s).  
However, each partition (separate index) must be >locally< sorted by the index 
key(s) 
(partitions can contain overlapping ranges of keys).

Performance
------------------
Here is some performance data:

merge-sparse (Using IndexedStorage)
------------
SLOTS_MILLIS_MAPS 1,711,722
HDFS_BYTES_READ 13,305,558,069
HDFS_BYTES_WRITTEN 485,765,545
Execution Time: (2mins, 46sec)

merge (Using standard PigStorage)

SLOTS_MILLIS_MAPS 3,633,465
HDFS_BYTES_READ 91,896,826,321
HDFS_BYTES_WRITTEN 485,765,545
Execution Time: (6mins, 53sec) This does not include the index phase.

Left Table
131,540 Records

Right Table
28,476,640 Records
90,549,549,781 bytes

Join
96,174 Records

The join key was 37 bytes.

Sparseness by records: 0.34%
Sparseness by bytes: 0.0039%

Number of Mappers: 20

Other runs have similar data.

  was:
Overview
--------

 * This patch adds a new join type, 'merge-sparse' to PIG.
 * This patch adds a new 'IndexedStorage' UDF to piggybank. 

'merge-sparse' join operator
-----------------------------

The new functionality is similar to the existing 'merge' join in the following 
ways:
 * It expects the input to be sorted in both the left and right tables.
 * The join can be accomplished as a map-only merge of the left and right 
tables.

The new functionality is different from the existing 'merge' join in the 
following ways:
 * The performance characteristics of the join operations are different.
 * For every key in the left table, the join operator will seek to the 
corresponding key in the right table. 
 * The right table loader >must< implement IndexableLoadFunc.
  
The existing 'merge' join implementation only seeks to the first key from each 
input split of the left table.
It performs similar to a table scan with the optimization to skip records that 
belong to other input splits.
This implementation performs best when the join is dense.  In this case, 
scanning block by block will be 
efficient because of the likelihood of finding matches in many of the scanned 
blocks.

The new 'merge-sparse' join implementation performs best when the join is 
sparse and the right table
implements an efficient seek operation (typically, the right table should be 
indexed by the join keys).
In this case, the additional overhead of per record seeks is less than the IO & 
processing 
overhead of a partial table scan.

The performance tradeoff between the two methods depends on a number of factors:
 * The density of matching join keys in the right table.
 * The overhead/efficiency of the seek operator implementation of the right 
loader.  
 * The size and number of keys in the index.

Finding the exact performance inflection point will require trial and error for 
each data set and implementation of the right loader.

See the performance data below for guidance on when to use this operator.

IndexedStorage UDF
------------------

IndexedStorage is a form of PigStorage that supports a per record seek.  
IndexedStorage creates a separate (hidden) index file for
every data file that is written.  The format of the index file is:
| Header     |
| Index Body |
| Footer     |
The Header contains the list of record indices (field numbers) that represent 
index keys.
The Index Body contains a PIG Tuple for each record in the data.  
The fields of the Tuple are:
 * The index key(s) as a PIG Tuple 
 * The number of records that share this index key. 
 * Offset into the data file to read the first matching record. 
The Footer contains sequentially:
 * The smallest key(s) Tuple in the index. 
 * The largest key(s) Tuple in the index. 
 * The offset in bytes to the start of the footer. 

IndexedStorage implements IndexableLoadFunc and
can be used as the 'right table' in a PIG 'merge' or 'merge-sparse' join.

IndexedStorage does not require the data to be partitioned by index key(s).  
However, each partition (separate index) must be >locally< sorted by the index 
key(s) 
(partitions can contain overlapping ranges of keys).

Performance
------------------
Here is some performance data:

merge-sparse (Using IndexedStorage)
------------
SLOTS_MILLIS_MAPS 1,711,722
HDFS_BYTES_READ 13,305,558,069
HDFS_BYTES_WRITTEN 485,765,545
Execution Time: (2mins, 46sec)

merge (Using standard PigStorage)

SLOTS_MILLIS_MAPS 3,633,465
HDFS_BYTES_READ 91,896,826,321
HDFS_BYTES_WRITTEN 485,765,545
Execution Time: (6mins, 53sec) This does not include the index phase.

Left Table
131,540 Records

Right Table
28,476,640 Records
90,549,549,781 bytes

Join
96,174 Records

The join key was 37 bytes.

Sparseness by records: 0.34%
Sparseness by bytes: 0.0039%

Number of Mappers: 20

Other runs have similar data.

          Status: Patch Available  (was: Reopened)
    
> Pig should support a more efficient merge join against data sources that 
> natively support point lookups or where the join is against large, sparse 
> tables.
> ----------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-2293
>                 URL: https://issues.apache.org/jira/browse/PIG-2293
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>    Affects Versions: 0.9.0
>            Reporter: Aaron Klish
>            Assignee: Aaron Klish
>             Fix For: 0.10
>
>         Attachments: PIG-2293-1.patch, PIG-2293-2.patch, PIG-2293-3.patch, 
> PIG-2293-4.patch, PIG-2293-5.patch, PIG-2293-6.patch.txt, PIG-2293-7.patch, 
> e2e_test.txt, patch.txt, patch.txt
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> The existing PIG merge join has the following limitations:
>    1. It assumes the right side of the table must be accessed sequentially - 
> record by record.
>    2. It does not perform well against large, sparse tables.
> The current implementation of the merge join introduced the interface 
> IndexableLoadFunc.  This 'LoadFunc'
> supports the ability to 'seekNear' a given key (before reading the next 
> record).  
> The merge join physical operator only calls 'seekNear' for the first key in 
> each split (effectively eliminating splits
> where the first and subsequent keys will not be found).  Subsequent joins are 
> found by reading sequentially through
> the records on the right table looking for matches from the left table.
> While this method works well for dense join tables - it performs poorly 
> against large sparse tables or data sources that support 
> point lookups natively (HBase for example).
> The proposed enhancement is to add a new join type - 'merge-sparse' to PIG 
> latin.  When specified in the PIG script, this join type
> will cause the merge join operator to call seekNear on each and every key 
> (rather than just the first in each split).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to