If your XML documents are really just lists of elements/objects, and what you want to run your analytics on are subsets of those elements (even across XML documents), then it makes sense to take a document store approach similar to what the Wikipedia example has done. This allows you to index specific portions of elements, create graphs and apply visibility labels to specific attributes in a given object tree.
On Wed, Jun 6, 2012 at 10:06 PM, David Medinets <[email protected]> wrote: > I can't think of any advantage to storing XML inside Accumulo. I am > interested to learn some details about your view. Storing the > extracted information and the location of the HDFS file that sourced > the information does make sense to me. In fact, it might be useful to > store file positions in Accumulo so it's easy to get back to specific > spots in the XML file. For example, if you had an XML file with many > records in it and there was no reason to immediately decompose each > record. > > On Wed, Jun 6, 2012 at 9:57 PM, William Slacum <[email protected]> wrote: >> There are advantages to using Accumulo to store the contents of your >> XML documents, depending on their structure and what you want to end >> up taking out of them. Are you trying to emulate the document store >> pattern that the Wikipedia example uses? >> >> On Wed, Jun 6, 2012 at 4:20 PM, Perko, Ralph J <[email protected]> wrote: >>> Hi, I am working with large chunks of XML, anywhere from 1 – 50 GB each. >>> I am running several different MapReduce jobs on the XML to pull out >>> various pieces of data, do analytics, etc. I am using an XML input type >>> based on the WikipediaInputFormat from the examples. What I have been >>> doing is 1) loading the entire XML into HDFS as a single document 2) >>> parsing the XML on some tag <foo> and storing each one of these instances >>> as the content of a new row in Accumulo, using the name of the instance as >>> the row id. I then run other MR jobs that scan this table, pull out and >>> parse the XML and do whatever I need to do with the data. >>> >>> My question is, is there any advantage to storing the XML in Accumulo >>> versus just leaving it in HDFS and parsing it from there? Either as a >>> large block of XML or as individual chunks, perhaps using Hadoop Archive >>> to handle the small-file problem? The actual XML will not be queried in >>> and of itself but is part other analysis processes. >>> >>> Thanks, >>> Ralph >>> >>> >>> __________________________________________________ >>> Ralph Perko >>> Pacific Northwest National Laboratory >>> >>>
