[ https://issues.apache.org/jira/browse/PIG-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780735#action_12780735 ]
Yan Zhou commented on PIG-1077: ------------------------------- This pacth is also targeted for the 0.6 release so it needs to be on the 0.6 branch too. > [Zebra] to support record(row)-based file split in Zebra's TableInputFormat > --------------------------------------------------------------------------- > > Key: PIG-1077 > URL: https://issues.apache.org/jira/browse/PIG-1077 > Project: Pig > Issue Type: New Feature > Affects Versions: 0.4.0 > Reporter: Chao Wang > Assignee: Chao Wang > Fix For: 0.7.0 > > Attachments: patch_Pig1077 > > > TFile currently supports split by record sequence number (see Jira > HADOOP-6218). We want to utilize this to provide record(row)-based input > split support in Zebra. > One prominent benefit is that: in cases where we have very large data files, > we can create much more fine-grained input splits than before where we can > only create one big split for one big file. > In more detail, the new row-based getSplits() works by default (user does not > specify no. of splits to be generated) as follows: > 1) Select the biggest column group in terms of data size, split all of its > TFiles according to hdfs block size (64 MB or 128 MB) and get a list of > physical byte offsets as the output per TFile. For example, let us assume for > the 1st TFile we get offset1, offset2, ..., offset10; > 2) Invoke TFile.getRecordNumNear(long offset) to get the RecordNum of a > key-value pair near a byte offset. For the example above, say we get > recordNum1, recordNum2, ..., recordNum10; > 3) Stitch [0, recordNum1], [recordNum1+1, recordNum2], ..., [recordNum9+1, > recordNum10], [recordNum10+1, lastRecordNum] splits of all column groups, > respectively to form 11 record-based input splits for the 1st TFile. > 4) For each input split, we need to create a TFile scanner through: > TFile.createScannerByRecordNum(long beginRecNum, long endRecNum). > Note: conversion from byte offset to record number will be done by each > mapper, rather than being done at the job initialization phase. This is due > to performance concern since the conversion incurs some TFile reading > overhead. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.