You can write a simple parse filter plugin. With the NodeWalker you can walk 
all nodes of the DOM and get the alt attribute for img tags.

        NodeWalker walker = new NodeWalker(doc);
        Node currentNode = walker.nextNode();
        if (currentNode.getNodeType() == Node.ELEMENT_NODE) {
          if ("img".equalsIgnoreCase(currentNode.getNodeName())) {
            HashMap<String,String> atts = getAttributes(currentNode);

          }
        }
      }
 
   protected HashMap<String,String> getAttributes(Node node) {
    HashMap<String,String> attribMap = new HashMap<String,String>();

    NamedNodeMap attributes = node.getAttributes();

    for(int i = 0 ; i < attributes.getLength(); i++) {
      Attr attribute = (Attr)attributes.item(i);
      attribMap.put(attribute.getName().toLowerCase(), attribute.getValue());
    }

    return attribMap;
  }

-----Original message-----
> From:Alexandre <alex.hura...@gmail.com>
> Sent: Mon 01-Oct-2012 15:05
> To: user@nutch.apache.org
> Subject: Re: Parsing/Indexing alt tag
> 
> Hi Patrick,
> 
> I have the same Problem.
> Did you find a way to parse the alt attributes without rewrite a complet
> parse plugin?
> 
> Alex.
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Parsing-Indexing-alt-tag-tp3999540p4011181.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 

Reply via email to