This is a bad way to do it, but I wonder what would happen in that case. Assuming we have following file
<element> ... </element> <element> ... </element> and we use TextInputFormat to get single lines as records and then check if we find a <element> then buffer all the lines until we find the </element> tag. Then use a XML parser to parse the buffered string in order to get our XML object. >From my understanding now, as Pig/Hadoop will only make sure that we read >entire records, it could happen that our file get sliced like ---slice1--- <element> ... ---slice2--- </element> <element> ... </element> then as we will read entire records this will be all fine, but we lose the first element, right? It's kind of a record granularity mismatch. Would be great if someone could confirm that. Thanks, Will From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com] Sent: Dienstag, 1. März 2011 22:05 To: user@pig.apache.org Cc: Lai Will Subject: Re: Custom Slicer Slicers are deprecated -- Pig now uses Hadoop InputFormats directly; you can read up what those entail in Hadoop documentation and books. As far as dealing with partial records at the beginning and end of the slice, the normal pattern is to always read a full record even if it takes you past the configured range, and to ignore any partial records in the beginning of a slice (because the previous slice will pick them up as part of its read). So if I was to represent records as letters, and slice boundaries as dots, something like this: aaabbb.bbccccdd.ddeee.eeee Would be read in as follows: Slice 1: aaabbbbb Slice 2: (skips bb) ccccdddd Slice 3: (skips dd) eeeeeee Slice 4: (skips eeee) -- nothing -- -D On Tue, Mar 1, 2011 at 12:45 PM, Lai Will <l...@student.ethz.ch<mailto:l...@student.ethz.ch>> wrote: Hello, The data I want to process is XML. It boils down to <element> ... </element> <element> ... </element> According to what I read in the documentation. When loading the file using the default Slicer, I end up in block sized chunks, that will very likely contain partial <element>s at the beginning and at the end. I don't want to ignore those. I want to have slice at the element boundaries, and have reasonably sized chunks (e.g. the largest chunk that is smaller than block size and that contains only whole <element>s. Unfortunately the user documentation is not very helpful to me, so can anyone help me on that? I found a XMLLoader in the Piggybank but that does not solve my issue with slicing. Best, Will