Using XML-Data

David Rahman Tue, 08 Nov 2011 01:50:34 -0800

Hi,

my data is available in XML, it's looking something like that:


<data>
  <doc>
    <title> ... </title>
    <abstract> ... </abstract>
    <keyword> ... </keyword>
    ...
    <keyword> ... </keyword>
    <keyword> ... </keyword>
  </doc>
  <doc>
    ...
  </doc>
</data>

I looked into the wikipedia-example and I have a few questions:

1. The first step of the example is to chunk the data in pieces. Is this
necessary, because I have the data in pieces. In each xml-file are ~1000
documents and I want to use ~250 xml-files in a first test. Could I just
put the existing xml-files into a HDFS-folder in Hadoop?

2. Second step is using the wikipediaDataSetCreator on the chunk-files
(chunk-****.xml). I found the WikipediaDataSetCreatorDriver, -Mapper and
-Reducer. Can someone explain how they work, for example I don't
understand, how the label (the category "country") is defined. In my case
there would also be more than one label on one document.

3. And in the third step the classifier is trained, here I would use the
ComplementaryBayes. As result, when I test the classifier I would also need
all possible candidates (not only the top one). How can I list all possible
candidates with their weights? I just found the possibility to list the top
candidates, did I miss something?

But overall it should be be as the same as the wikipedia-example, only with
more labels (xml+text+possible categories).

Thanks and regards,
David

Using XML-Data

Reply via email to