[Nutch-dev] [jira] Commented: (NUTCH-185) XMLParser is configurable xml parser plugin.

Armel Nene (JIRA) Sat, 25 Nov 2006 05:51:30 -0800

    [ 
http://issues.apache.org/jira/browse/NUTCH-185?page=comments#action_12452579 ] 
            
Armel Nene commented on NUTCH-185:
----------------------------------


Hi, Iam run the parser and it works fine. Now i want the parser instead of 
setting defaults values as fields (i.e. <xmlcontent>), i want  it to create 
index fields based on the field in the xml document. The reason is because i 
will be parsing large xml documents that do not have xpath. Also the xml are 
generated from database table and therefore do not have an xpath to validate 
against. Is this possible to implement from the parse-xml.

In the populateField() method of the XMLParser class, the field are checked 
against the one in the properties map. To work around the issue, I tried to 
generate another XMLIndexerProperties object and use it set field to add list 
of field that i want. The logic was, 

1. If the properties map contains "default", create a new XMLIndexerProperties 
object
2. Create a collection to hold the new xmlfields (i.e. Collection xmlFields = 
new ArrayList())
3. Loop over the elements in the XML document and 
4. Create a new XMLField object with each element (i.e. 
XMLField.setName(Element.getName())) 
5. Add the XMLField object to the new collection
6. Set the field of the new XMLIndexerProperties object with the newly created 
collection of fields using the XMLIndexerProperties setXMLFields(Collection 
fields) method
7. Then pass the variable to the extractDataFromFields method
8. And finally, return the populated field collection.

When I run the code, the method parses XML files with a valid xpath, but when 
parsing XML document with no xpath, the program throws a class cast exception: 
java.lang.string

I then modified the code again to make sure the xmlfield object are actually 
created, this time around when I wun the application; the document is parse 
with no errors but the default field <xmlcontent> is the one being stored in 
the index and the not the element from the xml document. The reason why i 
decided to create a XMLField object before storing the object in a collection 
is because the extractDataFromElement method looks for an object of that type 
when iterating over the elements. Anyway below is the logic i implmented which 
doesn't differ that much from the inital implementation:



if (xmlIndexersProperties.containsKey("default")) {
                            
                            Collection fields = new ArrayList();
                            List docx = ((org.jdom.Document)xml).getContent();
                            Iterator children = docx.listIterator();
                            while(children.hasNext()){
                                Object o = children.next();
                                if(o instanceof Element){
                                    XMLField xmlfield = new XMLField();
                                    Element el = (Element)o;
                                    xmlfield.setFieldName(el.getName());
                                    //xmlfield.setFieldType(el.)
                                }

I hope someone can help on this and let me know how to go about implementing in 
a better way.

Armel

> XMLParser is configurable xml parser plugin.
> --------------------------------------------
>
>                 Key: NUTCH-185
>                 URL: http://issues.apache.org/jira/browse/NUTCH-185
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher, indexer
>    Affects Versions: 0.7.2, 0.8.1, 0.8
>         Environment: OS Independent
>            Reporter: Rida Benjelloun
>         Assigned To: Chris A. Mattmann
>         Attachments: parse-xml.patch, parse-xml.zip, parse-xml.zip
>
>
> Xml parser  is configurable plugin. It use XPath and namespaces to do the 
> mapping between the XML elements and Lucene fields. 
> Informations :
> 1- Copy "xmlparser-conf.xml" to the nutch/conf dir
> 2- To index your custom XML file, you have to modify the 
> "xmlparser-conf.xml". 
> This parser uses namespaces and XPATH to parse XML content
> The config file do the mapping between the XML noeds (using XPATH) and lucene 
> field. 
> Example : <field name="dctitle" xpath="//dc:title" type="Text" boost="1.4" /> 
> 3- The xmlIndexerProperties encapsulate a set of fields associated to a 
> namespace. 
> If the namespace is found in the xml document, the fields represented by the 
> namespace will be indexed.
> Example : 
> <xmlIndexerProperties type="filePerDocument" namespace=" 
> http://purl.org/dc/elements/1.1/";>
>   <field name="dctitle" xpath="//dc:title" type="Text" boost=" 1.4" /> 
>   <field name="dccreator" xpath="//dc:creator" type="keyword" boost=" 1.0" /> 
> </xmlIndexerProperties>
> 4- It is possible to define a default namespace that will be applied when the 
> parser 
> didn't find any namespace in the document or when the namespace found in the 
> xml document doesn't match with the namespace defined in the 
> xmlIndexerProperties. 
> Example :
> <xmlIndexerProperties type="filePerDocument" namespace="default">
>   <field name="xmlcontent" xpath="//*" type="Unstored" boost="1.0" /> 
> </xmlIndexerProperties>

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] [jira] Commented: (NUTCH-185) XMLParser is configurable xml parser plugin.

Reply via email to