[ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772623#action_12772623 ]
Thejas M Nair commented on PIG-1062: ------------------------------------ Even after the interface changes, pig can compute the file size by adding up size of each split (from InputSplit.getLenght()) . The documentation of the function in the interface does not make it clear if this is size on disk , compressed/uncompressed etc. Assuming it is size on disk (uncompressed), estimating the total memory it will require is a challenge, one has to make assumption about the compression ratio and the serialization method. Using Tuple.getMemorySize() while sampling will give more accurate numbers for reducer memory that it will consume. > load-store-redesign branch: change SampleLoader and subclasses to work with > new LoadFunc interface > --------------------------------------------------------------------------------------------------- > > Key: PIG-1062 > URL: https://issues.apache.org/jira/browse/PIG-1062 > Project: Pig > Issue Type: Sub-task > Reporter: Thejas M Nair > Assignee: Thejas M Nair > > This is part of the effort to implement new load store interfaces as laid out > in http://wiki.apache.org/pig/LoadStoreRedesignProposal . > PigStorage and BinStorage are now working. > SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to > be changed to work with new LoadFunc interface. > Fixing SampleLoader and RandomSampleLoader will get order-by queries working. > PoissonSampleLoader is used by skew join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.