[
https://issues.apache.org/jira/browse/PIG-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12983758#action_12983758
]
Vivek Padmanabhan commented on PIG-1561:
----------------------------------------
In the current XML loader, the behavior is that, the
XMLLoaderBufferedPositionedInputStream reads the entire XML file without
considering the split start and end locations.
Hence if there is an XML > block size, the MR will execute multiple mappers
but in all the mappers the loaders will load the entire XML file.
ie If i have an XML of size 256mb and the block size is 128mb there will be
two mappers , but because of the loader, both the mappers will read the entire
file regardless of the split boundaries . This is functionally wrong. This is
the reason why I marked it as unsplitable.
> XMLLoader in Piggybank does not support bz2 or gzip compressed XML files
> ------------------------------------------------------------------------
>
> Key: PIG-1561
> URL: https://issues.apache.org/jira/browse/PIG-1561
> Project: Pig
> Issue Type: Bug
> Components: impl
> Affects Versions: 0.7.0, 0.8.0
> Reporter: Viraj Bhat
> Assignee: Vivek Padmanabhan
> Attachments: PIG-1561-1.patch
>
>
> I have a simple Pig script which uses the XMLLoader after the Piggybank is
> built.
> {code}
> register piggybank.jar;
> A = load '/user/viraj/capacity-scheduler.xml.gz' using
> org.apache.pig.piggybank.storage.XMLLoader('property') as (docs:chararray);
> B = limit A 1;
> dump B;
> --store B into '/user/viraj/handlegz' using PigStorage();
> {code}
> returns empty tuple
> {code}
> ()
> {code}
> If you supply the uncompressed XML file, you get
> {code}
> (<property>
> <name>mapred.capacity-scheduler.queue.my.capacity</name>
> <value>10</value>
> <description>Percentage of the number of slots in the cluster that are
> guaranteed to be available for jobs in this queue.
> </description>
> </property>)
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.