[jira] Commented: (PIG-1842) Improve Scalability of the XMLLoader for large datasets such as wikipedia

Vivek Padmanabhan (JIRA) Fri, 25 Feb 2011 03:51:07 -0800

    [ 
https://issues.apache.org/jira/browse/PIG-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12999315#comment-12999315
 ]


Vivek Padmanabhan commented on PIG-1842:
----------------------------------------

Hi Alan ,
 The below is how I have handled these cases :

Note :-
The XMLLoader will consider one record from begining tag to end tag just like a 
line record reader searching for new line char .
Split start and end locations are provided by the default FileInputFormat.




Describing the entire steps in a simple way ;

*The loader will collect the start and end tags and create a record out of it. 
(XMLLoaderBufferedPositionedInputStream.collectTag)
        *For begin tag 
                *Read till the tag is found in this block 
                        *If tag not found and split end has reached then no rec 
found in this split (return empty array)
                        *If partial tag is found in the current split then even 
though split end has reached 
                         continue reading rest of the file , beyond the split 
end location (handled by cond in while loop)
        *For end tag
                *Read till the end tag is found even if the split end location 
is reached.      
        
                
>>How far will split 1 read? It seems like it has to read to "</a>" or else the 
>>map processing split one will not be able to process this as a coherent 
>>document. 
>>Yet from the setting of maxBytesReadable on line 132 it looks to me like it 
>>won't read past the end point.

The other condition will keep the reading going on. (matchBuf.size() > 0 )

Here in this case lets say my tag identifier is <a> .  Then the loader will 
read till the split end to search for begining tag. 
Now for the end tag, it reads the rest of file starting from the last read 
position.Lets say split end has reached in between,
it will check whether it has found a match/or partial match. If not proceed 
with the reading till it finds a end tag.

> Improve Scalability of the XMLLoader for large datasets such as wikipedia
> -------------------------------------------------------------------------
>
>                 Key: PIG-1842
>                 URL: https://issues.apache.org/jira/browse/PIG-1842
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.7.0, 0.8.0, 0.9.0
>            Reporter: Viraj Bhat
>            Assignee: Vivek Padmanabhan
>             Fix For: 0.7.0, 0.8.0, 0.9.0
>
>         Attachments: PIG-1842_1.patch, PIG-1842_2.patch
>
>
> The current XMLLoader for Pig, does not work well for large datasets such as 
> the wikipedia dataset. Each mapper reads in the entire XML file resulting in 
> extermely slow run times.
> Viraj

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (PIG-1842) Improve Scalability of the XMLLoader for large datasets such as wikipedia

Reply via email to