[ https://issues.apache.org/jira/browse/LUCENE-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Keith Sprochi updated LUCENE-1704: ---------------------------------- Description: Parsing HTML documents using the org.apache.lucene.ant.HtmlDocument.Document method resulted in many error messages such as this: line 152 column 725 - Error: <as-html> is not recognized! This document has errors that must be fixed before using HTML Tidy to generate a tidied up version. The solution is to configure Tidy to accept these abnormal tags by adding the tag name to the "new-inline-tags" option in the Tidy config file (or the command line which does not make sense in this context), like so: new-inline-tags: as-html Tidy needs to know where the configuration file is, so a new constructor and Document method can be added. Here is the code: {code} /** * Constructs an <code>HtmlDocument</code> from a {...@link * java.io.File}. * *...@param file the <code>File</code> containing the * HTML to parse *...@param tidyConfigFile the <code>String</code> containing * the full path to the Tidy config file *...@exception IOException if an I/O exception occurs */ public HtmlDocument(File file, String tidyConfigFile) throws IOException { Tidy tidy = new Tidy(); tidy.setConfigurationFromFile(tidyConfigFile); tidy.setQuiet(true); tidy.setShowWarnings(false); org.w3c.dom.Document root = tidy.parseDOM(new FileInputStream(file), null); rawDoc = root.getDocumentElement(); } /** * Creates a Lucene <code>Document</code> from a {...@link * java.io.File}. * *...@param file *...@param tidyConfigFile the full path to the Tidy config file *...@exception IOException */ public static org.apache.lucene.document.Document Document(File file, String tidyConfigFile) throws IOException { HtmlDocument htmlDoc = new HtmlDocument(file, tidyConfigFile); org.apache.lucene.document.Document luceneDoc = new org.apache.lucene.document.Document(); luceneDoc.add(new Field("title", htmlDoc.getTitle(), Field.Store.YES, Field.Index.ANALYZED)); luceneDoc.add(new Field("contents", htmlDoc.getBody(), Field.Store.YES, Field.Index.ANALYZED)); String contents = null; BufferedReader br = new BufferedReader(new FileReader(file)); StringWriter sw = new StringWriter(); String line = br.readLine(); while (line != null) { sw.write(line); line = br.readLine(); } br.close(); contents = sw.toString(); sw.close(); luceneDoc.add(new Field("rawcontents", contents, Field.Store.YES, Field.Index.NO)); return luceneDoc; } {code} I am using this now and it is working fine. The configuration file is being passed to Tidy and now I am able to index thousands of HTML pages with no more Tidy tag errors. was: Parsing HTML documents using the org.apache.lucene.ant.HtmlDocument.Document method resulted in many error messages such as this: line 152 column 725 - Error: <as-html> is not recognized! This document has errors that must be fixed before using HTML Tidy to generate a tidied up version. The solution is to configure Tidy to accept these abnormal tags by adding the tag name to the "new-inline-tags" option in the Tidy config file (or the command line which does not make sense in this context), like so: new-inline-tags: as-html Tidy needs to know where the configuration file is, so a new constructor and Document method can be added. Here is the code: /** * Constructs an <code>HtmlDocument</code> from a {...@link * java.io.File}. * *...@param file the <code>File</code> containing the * HTML to parse *...@param tidyConfigFile the <code>String</code> containing * the full path to the Tidy config file *...@exception IOException if an I/O exception occurs */ public HtmlDocument(File file, String tidyConfigFile) throws IOException { Tidy tidy = new Tidy(); tidy.setConfigurationFromFile(tidyConfigFile); tidy.setQuiet(true); tidy.setShowWarnings(false); org.w3c.dom.Document root = tidy.parseDOM(new FileInputStream(file), null); rawDoc = root.getDocumentElement(); } /** * Creates a Lucene <code>Document</code> from a {...@link * java.io.File}. * *...@param file *...@param tidyConfigFile the full path to the Tidy config file *...@exception IOException */ public static org.apache.lucene.document.Document Document(File file, String tidyConfigFile) throws IOException { HtmlDocument htmlDoc = new HtmlDocument(file, tidyConfigFile); org.apache.lucene.document.Document luceneDoc = new org.apache.lucene.document.Document(); luceneDoc.add(new Field("title", htmlDoc.getTitle(), Field.Store.YES, Field.Index.ANALYZED)); luceneDoc.add(new Field("contents", htmlDoc.getBody(), Field.Store.YES, Field.Index.ANALYZED)); String contents = null; BufferedReader br = new BufferedReader(new FileReader(file)); StringWriter sw = new StringWriter(); String line = br.readLine(); while (line != null) { sw.write(line); line = br.readLine(); } br.close(); contents = sw.toString(); sw.close(); luceneDoc.add(new Field("rawcontents", contents, Field.Store.YES, Field.Index.NO)); return luceneDoc; } I am using this now and it is working fine. The configuration file is being passed to Tidy and now I am able to index thousands of HTML pages with no more Tidy tag errors. Much better, thanks. I guess I should have RTFM. > org.apache.lucene.ant.HtmlDocument added Tidy config file passthrough > availability > ---------------------------------------------------------------------------------- > > Key: LUCENE-1704 > URL: https://issues.apache.org/jira/browse/LUCENE-1704 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* > Affects Versions: 2.4.1 > Reporter: Keith Sprochi > Priority: Trivial > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > Parsing HTML documents using the org.apache.lucene.ant.HtmlDocument.Document > method resulted in many error messages such as this: > line 152 column 725 - Error: <as-html> is not recognized! > This document has errors that must be fixed before > using HTML Tidy to generate a tidied up version. > The solution is to configure Tidy to accept these abnormal tags by adding the > tag name to the "new-inline-tags" option in the Tidy config file (or the > command line which does not make sense in this context), like so: > new-inline-tags: as-html > Tidy needs to know where the configuration file is, so a new constructor and > Document method can be added. Here is the code: > {code} > /** > > > * Constructs an <code>HtmlDocument</code> from a {...@link > > > * java.io.File}. > > > * > > > *...@param file the <code>File</code> containing the > > > * HTML to parse > > > *...@param tidyConfigFile the <code>String</code> containing > > > * the full path to the Tidy config file > > > *...@exception IOException if an I/O exception occurs > > > */ > public HtmlDocument(File file, String tidyConfigFile) throws IOException { > Tidy tidy = new Tidy(); > tidy.setConfigurationFromFile(tidyConfigFile); > tidy.setQuiet(true); > tidy.setShowWarnings(false); > org.w3c.dom.Document root = > tidy.parseDOM(new FileInputStream(file), null); > rawDoc = root.getDocumentElement(); > } > /** > > > * Creates a Lucene <code>Document</code> from a {...@link > > > * java.io.File}. > > > * > > > *...@param file > > > *...@param tidyConfigFile the full path to the Tidy config file > > > *...@exception IOException > > > */ > public static org.apache.lucene.document.Document > Document(File file, String tidyConfigFile) throws IOException { > HtmlDocument htmlDoc = new HtmlDocument(file, tidyConfigFile); > org.apache.lucene.document.Document luceneDoc = new > org.apache.lucene.document.Document(); > luceneDoc.add(new Field("title", htmlDoc.getTitle(), Field.Store.YES, > Field.Index.ANALYZED)); > luceneDoc.add(new Field("contents", htmlDoc.getBody(), > Field.Store.YES, Field.Index.ANALYZED)); > String contents = null; > BufferedReader br = > new BufferedReader(new FileReader(file)); > StringWriter sw = new StringWriter(); > String line = br.readLine(); > while (line != null) { > sw.write(line); > line = br.readLine(); > } > br.close(); > contents = sw.toString(); > sw.close(); > luceneDoc.add(new Field("rawcontents", contents, Field.Store.YES, > Field.Index.NO)); > return luceneDoc; > } > {code} > I am using this now and it is working fine. The configuration file is being > passed to Tidy and now I am able to index thousands of HTML pages with no > more Tidy tag errors. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org