[ 
https://issues.apache.org/jira/browse/PIG-1828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988159#action_12988159
 ] 

Ashutosh Chauhan commented on PIG-1828:
---------------------------------------

Thanks Lukas for checking. This indicates that TableSplits are rather not 
combinable. Thinking more about it, I think basic Pig's assumption that splits 
can be combined in general and only for special cases we won't combine (which 
Pig checks itself) is not correct. Question of combination should really be 
asked from Loader and not assumed. Also, this OLF thing is too complicated.  
Condition imposed by OLF is one possibility, but I assume there exists other 
scenarios where loader is not OLF but is still not combinable. I would propose 
to add a new method in LoadFunc and ask directly from loader and drop all the 
logic of determining whether splits are combinable or not.
{java}
// By default, splits generated by a loader is considered combinable to 
preserve current behavior
public boolean isCombinable() {
return true;
}
{java}

Good thing is LoadFunc is abstract class, so this won't break backward 
compatibility.

@Dmitiry,
As I pointed above adding OLF to HBaseStorage will not help. Though it won't 
hurt either. A quick fix for HBaseStorage loader for now is to set the key to 
false, somewhere early. I think setLocation() or setSchema() is one of the 
first methods called on LoadFunc and since checks for determining combination 
happen much later,  loader setting that key to false will be seen and 
combination won't happen. That will avoid the need of telling the users of 
HbaseStorage to set the key themselves. 


> HBaseStorage has problems with processing multiregion tables
> ------------------------------------------------------------
>
>                 Key: PIG-1828
>                 URL: https://issues.apache.org/jira/browse/PIG-1828
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.0
>         Environment: Hadoop 0.20.2, Hbase 0.20.6, Distributed mode
>            Reporter: Lukas
>
> As brought up in the pig user mailing list 
> (http://www.mail-archive.com/user%40pig.apache.org/msg00606.html) Pig does 
> sometime not scan the full HBase table.
> It seems that HBaseStorage has problems scanning large tables. It issues just 
> one mapper job instead of one mapper job per table region.
> Ian Stevens, who brought this issue up in the mailing list, attached a script 
> to reproduce the problem (https://gist.github.com/766929).
> However, in my case, the problem only occurred, after the table was split 
> into more than one regions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to