[ https://issues.apache.org/jira/browse/TIKA-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris A. Mattmann updated TIKA-1102: ------------------------------------ Component/s: parser > Can we add <div> to the list of heuristics for bad html fragments? > ------------------------------------------------------------------ > > Key: TIKA-1102 > URL: https://issues.apache.org/jira/browse/TIKA-1102 > Project: Tika > Issue Type: Improvement > Components: parser > Affects Versions: 1.2, 1.3 > Environment: I'm using Solr 4.0 final with tika v1.2 and ManifoldCF > v1.2dev all on tomcat 7.0.37 > Reporter: David Morana > > Good morning, > Crawling legacy sites with poorly written html fragments causes severe Solr > Xml parse errors and in turn causes ManifoldCF to abort. > Can we add <div> to the list of heuristics so the html parser is used instead > of the xml parser? > see this ticket for further information: > [TIKA-1101|https://issues.apache.org/jira/browse/TIKA-1101] > Thank you, -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira