This is a bad way to do it, but I wonder what would happen in that case.
Assuming we have following file

<element>
               ...
</element>
<element>
               ...
</element>

and we use TextInputFormat to get single lines as records and then check if we 
find a  <element>  then buffer all the lines until we find the </element> tag. 
Then use a XML parser to parse the buffered string in order to get our XML 
object.

>From my understanding now, as Pig/Hadoop will only make sure that we read 
>entire records, it could happen that our file get sliced like

---slice1---
<element>
               ...
---slice2---
</element>
<element>
               ...
</element>

then as we will read entire records this will be all fine, but we lose the 
first element, right?

It's kind of a record granularity mismatch.

Would be great if someone could confirm that.

Thanks,
Will
From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com]
Sent: Dienstag, 1. März 2011 22:05
To: user@pig.apache.org
Cc: Lai Will
Subject: Re: Custom Slicer

Slicers are deprecated -- Pig now uses Hadoop InputFormats directly; you can 
read up what those entail in Hadoop documentation and books.

As far as dealing with partial records at the beginning and end of the slice, 
the normal pattern is to always read a full record even if it takes you past 
the configured range, and to ignore any partial records in the beginning of a 
slice (because the previous slice will pick them up as part of its read). So if 
I was to represent records as letters, and slice boundaries as dots, something 
like this:

aaabbb.bbccccdd.ddeee.eeee

Would be read in as follows:

Slice 1: aaabbbbb
Slice 2: (skips bb) ccccdddd
Slice 3: (skips dd) eeeeeee
Slice 4: (skips eeee) -- nothing --

-D

On Tue, Mar 1, 2011 at 12:45 PM, Lai Will 
<l...@student.ethz.ch<mailto:l...@student.ethz.ch>> wrote:
Hello,

The data I want to process is XML. It boils down to

<element>
               ...
</element>
<element>
               ...
</element>

According to what I read in the documentation. When loading the file using the 
default Slicer, I end up in block sized chunks, that will very likely contain 
partial <element>s at the beginning and at the end. I don't want to ignore 
those.
I want to have slice at the element boundaries, and have reasonably sized 
chunks (e.g. the largest chunk that is smaller than block size and that 
contains only whole <element>s.

Unfortunately the user documentation is not very helpful to me, so can anyone 
help me on that?

I found a XMLLoader in the Piggybank but that does not solve my issue with 
slicing.

Best,
Will

Reply via email to