On 2/8/07, Peter W. <[EMAIL PROTECTED]> wrote:
Using a parser to get text out of HTML, XML (including RSS, ATOM) is
only
easy if you control the source documents.
HTML pages in the wild are much different, generating exceptions you
must
catch and deal with.
Yes, that's why the Solr version isn
http://issues.apache.org/jira/browse/SOLR-42
: Date: Wed, 7 Feb 2007 17:04:54 -0800 (PST)
: From: Joe Tang <[EMAIL PROTECTED]>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: How to not tokenize HTML tag from input string
:
:
: My work is to index ke
ply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: How to not tokenize HTML tag from input string
:
:
: My work is to index keywords with a document. In my case, the document is
: made up with HTML tags which i don't want to index them.
:
: For example:
: Inp
ex them.
For example:
Input Document:
You are welcome
Testing text
Expected Keywords:
keywords:You
keywords:are
keywords:welcome
keywords:Testing
keywords:text
Is there anyway I can make them not to be one of the keywords?
--
View this message in context:
http://www.nabble.com/How-to-not-tokeniz
there anyway I can make them not to be one of the keywords?
--
View this message in context:
http://www.nabble.com/How-to-not-tokenize-HTML-tag-from-input-string-tf3190778.html#a8857789
Sent from the Lucene - Java Users mailing list archive at Nabbl