Greetings, I have an interesting problem I'm trying to solve. I currently store a bunch of webpages in a large XML file in Hadoop. I'm trying to parse information out of these webpages using a complex C# program that I have running on Mono (I'm in a Linux environment). Therefore, I'm using Hadoop Streaming and the StreamXMLRecordReader in order to get the information to my C# parser. The problem is that even wrapped in XML, the Hadoop Streaming ends the records at newlines! This makes the map input data pretty useless. Does anyone have any hints on how to get around this?
Here's the XML structure I'm trying to use: <ContentRecord><RecordURL>http://www.blah</RecordURL><PageContent><![CDATA[page text would be here including newlines ]]></PageContent></ContentRecord> Any ideas? Cheers, Bradford