[ 
https://issues.apache.org/jira/browse/SOLR-7027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated SOLR-7027:
--------------------------------
    Affects Version/s: 5.0

> ExtractingRequestHandler indiscriminantly dumps all source HTML attributes 
> into the catch-all field when captureAttr=false, but it should be more 
> selective, something like only href, title, alt, etc. attributes
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-7027
>                 URL: https://issues.apache.org/jira/browse/SOLR-7027
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - Solr Cell (Tika extraction)
>    Affects Versions: 5.0
>            Reporter: Steve Rowe
>            Priority: Minor
>             Fix For: 5.1
>
>
> On line 283 in {{SolrContentHandler}}, the catch-all field gets *all* source 
> HTML attribute values dumped into it:
> {code:java}
> 270:  @Override
> 271:  public void startElement(String uri, String localName, String qName, 
> Attributes attributes) throws SAXException {
> 272:    StringBuilder theBldr = fieldBuilders.get(localName);
> 273:    if (theBldr != null) {
> 274:      //we need to switch the currentBuilder
> 275:      bldrStack.add(theBldr);
> 276:    }
> 277:    if (captureAttribs == true) {
> 278:      for (int i = 0; i < attributes.getLength(); i++) {
> 279:        addField(localName, attributes.getValue(i), null);
> 280:      }
> 281:    } else {
> 282:      for (int i = 0; i < attributes.getLength(); i++) {
> 283:        bldrStack.getLast().append(' ').append(attributes.getValue(i));
> 284:      }
> 285:    }
> 286:    bldrStack.getLast().append(' ');
> 287:  }
> {code}
> But this will contains lots of unwanted cruft: {{class}} and {{style}} tags, 
> etc.
> It would be much better if only attribute values containing addresses or 
> tooltip text, etc. were dumped into the catch-all field.  Here are a couple 
> of places where this kind of attribute are described:
> http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/TextExtractor.html#includeAttribute(net.htmlparser.jericho.StartTag,%20net.htmlparser.jericho.Attribute)
> From Tika's {{HtmlHandler}} class:
> {code:java}
>     // List of attributes that need to be resolved.
>     private static final Set<String> URI_ATTRIBUTES =
>         new HashSet<String>(Arrays.asList("src", "href", "longdesc", "cite"));
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to