Hi, my data is available in XML, it's looking something like that:
<data> <doc> <title> ... </title> <abstract> ... </abstract> <keyword> ... </keyword> ... <keyword> ... </keyword> <keyword> ... </keyword> </doc> <doc> ... </doc> </data> I looked into the wikipedia-example and I have a few questions: 1. The first step of the example is to chunk the data in pieces. Is this necessary, because I have the data in pieces. In each xml-file are ~1000 documents and I want to use ~250 xml-files in a first test. Could I just put the existing xml-files into a HDFS-folder in Hadoop? 2. Second step is using the wikipediaDataSetCreator on the chunk-files (chunk-****.xml). I found the WikipediaDataSetCreatorDriver, -Mapper and -Reducer. Can someone explain how they work, for example I don't understand, how the label (the category "country") is defined. In my case there would also be more than one label on one document. 3. And in the third step the classifier is trained, here I would use the ComplementaryBayes. As result, when I test the classifier I would also need all possible candidates (not only the top one). How can I list all possible candidates with their weights? I just found the possibility to list the top candidates, did I miss something? But overall it should be be as the same as the wikipedia-example, only with more labels (xml+text+possible categories). Thanks and regards, David