Hello,

I'm currently encountering following problem.

I have a xml file that gets loaded using a custom LoadFunc.

Boiled down my xml file could look like:
<files>
<file>
<id>
                1
                </id>
                <text>
                                This is a sample text that contains newlines,
which should be preserved when parsing.
                </text>
</file>
<file> ... </file>
<file> ... </file>
...
</files>

So the text does contain a newline (\r\n or \n does not matter).
When parsing the xml I parse the contents of <text/> into a string and add it 
to the list that should be returned by the LoadFunc.

The problem now is that whenever I dump, store or use the intermediate result 
in another UDF e.g. with

raw = LOAD 'data/files.xml' using org.my.MyCustomXMLLoader() AS ( id:int , 
text: chararray);
dump raw;

or

raw = LOAD 'data/files.xml' using org.my.MyCustomXMLLoader() AS ( id:int , 
text: chararray);
clean = FOREACH raw GENERATE id, org.my.MyCleaner(text) as clean_text;

The newlines as completely stripped away:

1              This is a sample text that contains newlines,which should be 
preserved when parsing.

Or in the latter example leading MyCleaner() to fail..

How can I preserve the newline in Pig?

Best,
Will



Reply via email to