[ https://issues.apache.org/jira/browse/PIG-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12983758#action_12983758 ]
Vivek Padmanabhan commented on PIG-1561: ---------------------------------------- In the current XML loader, the behavior is that, the XMLLoaderBufferedPositionedInputStream reads the entire XML file without considering the split start and end locations. Hence if there is an XML > block size, the MR will execute multiple mappers but in all the mappers the loaders will load the entire XML file. ie If i have an XML of size 256mb and the block size is 128mb there will be two mappers , but because of the loader, both the mappers will read the entire file regardless of the split boundaries . This is functionally wrong. This is the reason why I marked it as unsplitable. > XMLLoader in Piggybank does not support bz2 or gzip compressed XML files > ------------------------------------------------------------------------ > > Key: PIG-1561 > URL: https://issues.apache.org/jira/browse/PIG-1561 > Project: Pig > Issue Type: Bug > Components: impl > Affects Versions: 0.7.0, 0.8.0 > Reporter: Viraj Bhat > Assignee: Vivek Padmanabhan > Attachments: PIG-1561-1.patch > > > I have a simple Pig script which uses the XMLLoader after the Piggybank is > built. > {code} > register piggybank.jar; > A = load '/user/viraj/capacity-scheduler.xml.gz' using > org.apache.pig.piggybank.storage.XMLLoader('property') as (docs:chararray); > B = limit A 1; > dump B; > --store B into '/user/viraj/handlegz' using PigStorage(); > {code} > returns empty tuple > {code} > () > {code} > If you supply the uncompressed XML file, you get > {code} > (<property> > <name>mapred.capacity-scheduler.queue.my.capacity</name> > <value>10</value> > <description>Percentage of the number of slots in the cluster that are > guaranteed to be available for jobs in this queue. > </description> > </property>) > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.