I can't think of any advantage to storing XML inside Accumulo. I am
interested to learn some details about your view. Storing the
extracted information and the location of the HDFS file that sourced
the information does make sense to me. In fact, it might be useful to
store file positions in Accumulo so it's easy to get back to specific
spots in the XML file. For example, if you had an XML file with many
records in it and there was no reason to immediately decompose each
record.

On Wed, Jun 6, 2012 at 9:57 PM, William Slacum <[email protected]> wrote:
> There are advantages to using Accumulo to store the contents of your
> XML documents, depending on their structure and what you want to end
> up taking out of them. Are you trying to emulate the document store
> pattern that the Wikipedia example uses?
>
> On Wed, Jun 6, 2012 at 4:20 PM, Perko, Ralph J <[email protected]> wrote:
>> Hi,  I am working with large chunks of XML, anywhere from 1 – 50 GB each.  I 
>> am running several different MapReduce jobs on the XML to pull out various 
>> pieces of data, do analytics, etc.  I am using an XML input type based on 
>> the WikipediaInputFormat from the examples.  What I have been doing is 1) 
>> loading the entire XML into HDFS as a single document 2) parsing the XML on 
>> some tag <foo> and storing each one of these instances as the content of a 
>> new row in Accumulo, using the name of the instance as the row id.  I then 
>> run other MR jobs that scan this table, pull out and parse the XML and do 
>> whatever I need to do with the data.
>>
>> My question is, is there any advantage to storing the XML in Accumulo versus 
>> just leaving it in HDFS and parsing it from there?  Either as a large block 
>> of XML or as individual chunks, perhaps  using Hadoop Archive to handle the 
>> small-file problem?  The actual XML will not be queried in and of itself but 
>> is part other analysis processes.
>>
>> Thanks,
>> Ralph
>>
>>
>> __________________________________________________
>> Ralph Perko
>> Pacific Northwest National Laboratory
>>
>>

Reply via email to