[ https://issues.apache.org/jira/browse/SOLR-7027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Uwe Schindler updated SOLR-7027: -------------------------------- Affects Version/s: 5.0 > ExtractingRequestHandler indiscriminantly dumps all source HTML attributes > into the catch-all field when captureAttr=false, but it should be more > selective, something like only href, title, alt, etc. attributes > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > > Key: SOLR-7027 > URL: https://issues.apache.org/jira/browse/SOLR-7027 > Project: Solr > Issue Type: Improvement > Components: contrib - Solr Cell (Tika extraction) > Affects Versions: 5.0 > Reporter: Steve Rowe > Priority: Minor > Fix For: 5.1 > > > On line 283 in {{SolrContentHandler}}, the catch-all field gets *all* source > HTML attribute values dumped into it: > {code:java} > 270: @Override > 271: public void startElement(String uri, String localName, String qName, > Attributes attributes) throws SAXException { > 272: StringBuilder theBldr = fieldBuilders.get(localName); > 273: if (theBldr != null) { > 274: //we need to switch the currentBuilder > 275: bldrStack.add(theBldr); > 276: } > 277: if (captureAttribs == true) { > 278: for (int i = 0; i < attributes.getLength(); i++) { > 279: addField(localName, attributes.getValue(i), null); > 280: } > 281: } else { > 282: for (int i = 0; i < attributes.getLength(); i++) { > 283: bldrStack.getLast().append(' ').append(attributes.getValue(i)); > 284: } > 285: } > 286: bldrStack.getLast().append(' '); > 287: } > {code} > But this will contains lots of unwanted cruft: {{class}} and {{style}} tags, > etc. > It would be much better if only attribute values containing addresses or > tooltip text, etc. were dumped into the catch-all field. Here are a couple > of places where this kind of attribute are described: > http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/TextExtractor.html#includeAttribute(net.htmlparser.jericho.StartTag,%20net.htmlparser.jericho.Attribute) > From Tika's {{HtmlHandler}} class: > {code:java} > // List of attributes that need to be resolved. > private static final Set<String> URI_ATTRIBUTES = > new HashSet<String>(Arrays.asList("src", "href", "longdesc", "cite")); > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org