RE: HTMLParser.getReader returning null

2004-11-12 Thread Vanlerberghe, Luc
If you use the Field.Text(String name, Reader value) version of the
Field.Text constructor, the field is tokenized and indexed but *not*
stored.  This means you will be able to search and find that document,
but to know the original contents you will have to store a copy of it
elsewhere.

The Field.Text(String name, String value) version does store the
document String itself, so that's probably the origin of the confusion.

 -Original Message-
 From: Luke Shannon [mailto:[EMAIL PROTECTED] 
 Sent: donderdag 11 november 2004 20:17
 To: Lucene Users List
 Subject: HTMLParser.getReader returning null
 
 Hello;
 
 Things were working fine. I have been re-organizing my code 
 to drop into QA when I noticed I was no longer getting search 
 results for my HTML files.
 When I checked things out I confirmed I was still creating 
 the Documents but realized no content was being indexed.
 
  HTMLParser parser = new HTMLParser(f);
 
 // Add the tag-stripped contents as a Reader-valued Text 
 field so it will
 // get tokenized and indexed.
 doc.add(Field.Text(contents, parser.getReader()));
 System.out.println(The content is  + doc.get(contents));
 
 The SOP line above outputs a null where the contents used to 
 be. Any seen this before?
 
 Thanks,
 
 Luke
 
 - Original Message -
 From: Will Allen [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Thursday, November 11, 2004 1:59 PM
 Subject: RE: Bug in the BooleanQuery optimizer? ..TooManyClauses
 
 
 Any wildcard search will automatically expand your query to 
 the number of
 terms it find in the index that suit the wildcard.
 
 For example:
 
 wild*, would become wild OR wilderness OR wildman etc for 
 each of the terms
 that exist in your index.
 
 It is because of this, that you quickly reach the 1024 limit 
 of clauses.  I
 automatically set it to max int with the following line:
 
 BooleanQuery.setMaxClauseCount( Integer.MAX_VALUE );
 
 
 -Original Message-
 From: Sanyi [mailto:[EMAIL PROTECTED]
 Sent: Thursday, November 11, 2004 6:46 AM
 To: [EMAIL PROTECTED]
 Subject: Bug in the BooleanQuery optimizer? ..TooManyClauses
 
 
 Hi!
 
 First of all, I've read about BooleanQuery$TooManyClauses, so 
 I know that it
 has a 1024 Clauses
 limit by default which is good enough for me, but I still 
 think it works
 strange.
 
 Example:
 I have an index with about 20Million documents.
 Let's say that there is about 3000 variants in the entire 
 document set of
 this word mask: cab*
 Let's say that about 500 documents are containing the word: spectrum
 Now, when I search for cab* AND spectrum, I don't expect it 
 to throw an
 exception.
 It should first restrict the search for the 500 documents 
 containing the
 word spectrum, then it
 should collect the variants of cab* withing these 
 documents, which turns
 out in two or three
 variants of cab* (cable, cables, maybe some more) and the 
 search should
 return let's say 10
 documents.
 
 Similar example: When I search for cab* AND nonexistingword it still
 throws a TooManyClauses
 exception instead of saying No results, since there is no
 nonexistingword in my document set,
 so it doesn't even have to start collecting the variations of cab*.
 
 Is there any path for this issue?
 Thank you for your time!
 
 Sanyi
 (I'm using: lucene 1.4.2)
 
 p.s.: Sorry for re-sending this message, I was first sending it as an
 accidental reply to a wrong thread..
 
 
 
 __
 Do you Yahoo!?
 Check out the new Yahoo! Front Page.
 www.yahoo.com
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: HTMLParser.getReader returning null

2004-11-12 Thread Luke Shannon
Hi;

I am using the HTMLParser that comes with the latest version of Lucene (in
the demo).

Here is the import line:

import org.apache.lucene.demo.html.HTMLParser;

If you have lucene-demos-1.4-final.jar in your class path the system will
find the Parser Class.

I am happy with the results.

Let me know if you need anything else.

L


- Original Message - 
From: sergiu gordea [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Friday, November 12, 2004 3:39 AM
Subject: Re: HTMLParser.getReader returning null


 Luke Shannon wrote:

  Hi,

  May I ask you which library you are using for parsing html pages?
   I need to index html pages and I want to use a good parser to
 eliminate the html tags.
   Can you recomend me a simple parser that has a demo?

  Thanks,

   Sergiu

 Hello;
 
 Things were working fine. I have been re-organizing my code to drop into
QA
 when I noticed I was no longer getting search results for my HTML files.
 When I checked things out I confirmed I was still creating the Documents
but
 realized no content was being indexed.
 
  HTMLParser parser = new HTMLParser(f);
 
 // Add the tag-stripped contents as a Reader-valued Text field so it
 will
 // get tokenized and indexed.
 doc.add(Field.Text(contents, parser.getReader()));
 System.out.println(The content is  + doc.get(contents));
 
 The SOP line above outputs a null where the contents used to be. Any seen
 this before?
 
 Thanks,
 
 Luke
 
 - Original Message - 
 From: Will Allen [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Thursday, November 11, 2004 1:59 PM
 Subject: RE: Bug in the BooleanQuery optimizer? ..TooManyClauses
 
 
 Any wildcard search will automatically expand your query to the number of
 terms it find in the index that suit the wildcard.
 
 For example:
 
 wild*, would become wild OR wilderness OR wildman etc for each of the
terms
 that exist in your index.
 
 It is because of this, that you quickly reach the 1024 limit of clauses.
I
 automatically set it to max int with the following line:
 
 BooleanQuery.setMaxClauseCount( Integer.MAX_VALUE );
 
 
 -Original Message-
 From: Sanyi [mailto:[EMAIL PROTECTED]
 Sent: Thursday, November 11, 2004 6:46 AM
 To: [EMAIL PROTECTED]
 Subject: Bug in the BooleanQuery optimizer? ..TooManyClauses
 
 
 Hi!
 
 First of all, I've read about BooleanQuery$TooManyClauses, so I know that
it
 has a 1024 Clauses
 limit by default which is good enough for me, but I still think it works
 strange.
 
 Example:
 I have an index with about 20Million documents.
 Let's say that there is about 3000 variants in the entire document set of
 this word mask: cab*
 Let's say that about 500 documents are containing the word: spectrum
 Now, when I search for cab* AND spectrum, I don't expect it to throw an
 exception.
 It should first restrict the search for the 500 documents containing the
 word spectrum, then it
 should collect the variants of cab* withing these documents, which
turns
 out in two or three
 variants of cab* (cable, cables, maybe some more) and the search should
 return let's say 10
 documents.
 
 Similar example: When I search for cab* AND nonexistingword it still
 throws a TooManyClauses
 exception instead of saying No results, since there is no
 nonexistingword in my document set,
 so it doesn't even have to start collecting the variations of cab*.
 
 Is there any path for this issue?
 Thank you for your time!
 
 Sanyi
 (I'm using: lucene 1.4.2)
 
 p.s.: Sorry for re-sending this message, I was first sending it as an
 accidental reply to a wrong thread..
 
 
 
 __
 Do you Yahoo!?
 Check out the new Yahoo! Front Page.
 www.yahoo.com
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: HTMLParser.getReader returning null

2004-11-12 Thread Luke Shannon
Ah. That would explain it. Thank you Luc.

- Original Message - 
From: Vanlerberghe, Luc [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Friday, November 12, 2004 5:41 AM
Subject: RE: HTMLParser.getReader returning null


If you use the Field.Text(String name, Reader value) version of the
Field.Text constructor, the field is tokenized and indexed but *not*
stored.  This means you will be able to search and find that document,
but to know the original contents you will have to store a copy of it
elsewhere.

The Field.Text(String name, String value) version does store the
document String itself, so that's probably the origin of the confusion.

 -Original Message-
 From: Luke Shannon [mailto:[EMAIL PROTECTED] 
 Sent: donderdag 11 november 2004 20:17
 To: Lucene Users List
 Subject: HTMLParser.getReader returning null
 
 Hello;
 
 Things were working fine. I have been re-organizing my code 
 to drop into QA when I noticed I was no longer getting search 
 results for my HTML files.
 When I checked things out I confirmed I was still creating 
 the Documents but realized no content was being indexed.
 
  HTMLParser parser = new HTMLParser(f);
 
 // Add the tag-stripped contents as a Reader-valued Text 
 field so it will
 // get tokenized and indexed.
 doc.add(Field.Text(contents, parser.getReader()));
 System.out.println(The content is  + doc.get(contents));
 
 The SOP line above outputs a null where the contents used to 
 be. Any seen this before?
 
 Thanks,
 
 Luke
 
 - Original Message -
 From: Will Allen [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Thursday, November 11, 2004 1:59 PM
 Subject: RE: Bug in the BooleanQuery optimizer? ..TooManyClauses
 
 
 Any wildcard search will automatically expand your query to 
 the number of
 terms it find in the index that suit the wildcard.
 
 For example:
 
 wild*, would become wild OR wilderness OR wildman etc for 
 each of the terms
 that exist in your index.
 
 It is because of this, that you quickly reach the 1024 limit 
 of clauses.  I
 automatically set it to max int with the following line:
 
 BooleanQuery.setMaxClauseCount( Integer.MAX_VALUE );
 
 
 -Original Message-
 From: Sanyi [mailto:[EMAIL PROTECTED]
 Sent: Thursday, November 11, 2004 6:46 AM
 To: [EMAIL PROTECTED]
 Subject: Bug in the BooleanQuery optimizer? ..TooManyClauses
 
 
 Hi!
 
 First of all, I've read about BooleanQuery$TooManyClauses, so 
 I know that it
 has a 1024 Clauses
 limit by default which is good enough for me, but I still 
 think it works
 strange.
 
 Example:
 I have an index with about 20Million documents.
 Let's say that there is about 3000 variants in the entire 
 document set of
 this word mask: cab*
 Let's say that about 500 documents are containing the word: spectrum
 Now, when I search for cab* AND spectrum, I don't expect it 
 to throw an
 exception.
 It should first restrict the search for the 500 documents 
 containing the
 word spectrum, then it
 should collect the variants of cab* withing these 
 documents, which turns
 out in two or three
 variants of cab* (cable, cables, maybe some more) and the 
 search should
 return let's say 10
 documents.
 
 Similar example: When I search for cab* AND nonexistingword it still
 throws a TooManyClauses
 exception instead of saying No results, since there is no
 nonexistingword in my document set,
 so it doesn't even have to start collecting the variations of cab*.
 
 Is there any path for this issue?
 Thank you for your time!
 
 Sanyi
 (I'm using: lucene 1.4.2)
 
 p.s.: Sorry for re-sending this message, I was first sending it as an
 accidental reply to a wrong thread..
 
 
 
 __
 Do you Yahoo!?
 Check out the new Yahoo! Front Page.
 www.yahoo.com
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]