Re: HTMLParser.getReader returning null

2004-11-12 Thread Luke Shannon
Ah. That would explain it. Thank you Luc.

- Original Message - 
From: "Vanlerberghe, Luc" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Friday, November 12, 2004 5:41 AM
Subject: RE: HTMLParser.getReader returning null


If you use the Field.Text(String name, Reader value) version of the
Field.Text constructor, the field is tokenized and indexed but *not*
stored.  This means you will be able to search and find that document,
but to know the original contents you will have to store a copy of it
elsewhere.

The Field.Text(String name, String value) version does store the
document String itself, so that's probably the origin of the confusion.

> -Original Message-
> From: Luke Shannon [mailto:[EMAIL PROTECTED] 
> Sent: donderdag 11 november 2004 20:17
> To: Lucene Users List
> Subject: HTMLParser.getReader returning null
> 
> Hello;
> 
> Things were working fine. I have been re-organizing my code 
> to drop into QA when I noticed I was no longer getting search 
> results for my HTML files.
> When I checked things out I confirmed I was still creating 
> the Documents but realized no content was being indexed.
> 
>  HTMLParser parser = new HTMLParser(f);
> 
> // Add the tag-stripped contents as a Reader-valued Text 
> field so it will
> // get tokenized and indexed.
> doc.add(Field.Text("contents", parser.getReader()));
> System.out.println("The content is " + doc.get("contents"));
> 
> The SOP line above outputs a null where the contents used to 
> be. Any seen this before?
> 
> Thanks,
> 
> Luke
> 
> - Original Message -
> From: "Will Allen" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Thursday, November 11, 2004 1:59 PM
> Subject: RE: Bug in the BooleanQuery optimizer? ..TooManyClauses
> 
> 
> Any wildcard search will automatically expand your query to 
> the number of
> terms it find in the index that suit the wildcard.
> 
> For example:
> 
> wild*, would become wild OR wilderness OR wildman etc for 
> each of the terms
> that exist in your index.
> 
> It is because of this, that you quickly reach the 1024 limit 
> of clauses.  I
> automatically set it to max int with the following line:
> 
> BooleanQuery.setMaxClauseCount( Integer.MAX_VALUE );
> 
> 
> -Original Message-
> From: Sanyi [mailto:[EMAIL PROTECTED]
> Sent: Thursday, November 11, 2004 6:46 AM
> To: [EMAIL PROTECTED]
> Subject: Bug in the BooleanQuery optimizer? ..TooManyClauses
> 
> 
> Hi!
> 
> First of all, I've read about BooleanQuery$TooManyClauses, so 
> I know that it
> has a 1024 Clauses
> limit by default which is good enough for me, but I still 
> think it works
> strange.
> 
> Example:
> I have an index with about 20Million documents.
> Let's say that there is about 3000 variants in the entire 
> document set of
> this word mask: cab*
> Let's say that about 500 documents are containing the word: spectrum
> Now, when I search for "cab* AND spectrum", I don't expect it 
> to throw an
> exception.
> It should first restrict the search for the 500 documents 
> containing the
> word "spectrum", then it
> should collect the variants of "cab*" withing these 
> documents, which turns
> out in two or three
> variants of "cab*" (cable, cables, maybe some more) and the 
> search should
> return let's say 10
> documents.
> 
> Similar example: When I search for "cab* AND nonexistingword" it still
> throws a TooManyClauses
> exception instead of saying "No results", since there is no
> "nonexistingword" in my document set,
> so it doesn't even have to start collecting the variations of "cab*".
> 
> Is there any path for this issue?
> Thank you for your time!
> 
> Sanyi
> (I'm using: lucene 1.4.2)
> 
> p.s.: Sorry for re-sending this message, I was first sending it as an
> accidental reply to a wrong thread..
> 
> 
> 
> __
> Do you Yahoo!?
> Check out the new Yahoo! Front Page.
> www.yahoo.com
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: HTMLParser.getReader returning null

2004-11-12 Thread Luke Shannon
Hi;

I am using the HTMLParser that comes with the latest version of Lucene (in
the demo).

Here is the import line:

import org.apache.lucene.demo.html.HTMLParser;

If you have lucene-demos-1.4-final.jar in your class path the system will
find the Parser Class.

I am happy with the results.

Let me know if you need anything else.

L


- Original Message - 
From: "sergiu gordea" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Friday, November 12, 2004 3:39 AM
Subject: Re: HTMLParser.getReader returning null


> Luke Shannon wrote:
>
>  Hi,
>
>  May I ask you which library you are using for parsing html pages?
>   I need to index html pages and I want to use a good parser to
> eliminate the html tags.
>   Can you recomend me a simple parser that has a demo?
>
>  Thanks,
>
>   Sergiu
>
> >Hello;
> >
> >Things were working fine. I have been re-organizing my code to drop into
QA
> >when I noticed I was no longer getting search results for my HTML files.
> >When I checked things out I confirmed I was still creating the Documents
but
> >realized no content was being indexed.
> >
> > HTMLParser parser = new HTMLParser(f);
> >
> >// Add the tag-stripped contents as a Reader-valued Text field so it
> >will
> >// get tokenized and indexed.
> >doc.add(Field.Text("contents", parser.getReader()));
> >System.out.println("The content is " + doc.get("contents"));
> >
> >The SOP line above outputs a null where the contents used to be. Any seen
> >this before?
> >
> >Thanks,
> >
> >Luke
> >
> >- Original Message - 
> >From: "Will Allen" <[EMAIL PROTECTED]>
> >To: "Lucene Users List" <[EMAIL PROTECTED]>
> >Sent: Thursday, November 11, 2004 1:59 PM
> >Subject: RE: Bug in the BooleanQuery optimizer? ..TooManyClauses
> >
> >
> >Any wildcard search will automatically expand your query to the number of
> >terms it find in the index that suit the wildcard.
> >
> >For example:
> >
> >wild*, would become wild OR wilderness OR wildman etc for each of the
terms
> >that exist in your index.
> >
> >It is because of this, that you quickly reach the 1024 limit of clauses.
I
> >automatically set it to max int with the following line:
> >
> >BooleanQuery.setMaxClauseCount( Integer.MAX_VALUE );
> >
> >
> >-Original Message-
> >From: Sanyi [mailto:[EMAIL PROTECTED]
> >Sent: Thursday, November 11, 2004 6:46 AM
> >To: [EMAIL PROTECTED]
> >Subject: Bug in the BooleanQuery optimizer? ..TooManyClauses
> >
> >
> >Hi!
> >
> >First of all, I've read about BooleanQuery$TooManyClauses, so I know that
it
> >has a 1024 Clauses
> >limit by default which is good enough for me, but I still think it works
> >strange.
> >
> >Example:
> >I have an index with about 20Million documents.
> >Let's say that there is about 3000 variants in the entire document set of
> >this word mask: cab*
> >Let's say that about 500 documents are containing the word: spectrum
> >Now, when I search for "cab* AND spectrum", I don't expect it to throw an
> >exception.
> >It should first restrict the search for the 500 documents containing the
> >word "spectrum", then it
> >should collect the variants of "cab*" withing these documents, which
turns
> >out in two or three
> >variants of "cab*" (cable, cables, maybe some more) and the search should
> >return let's say 10
> >documents.
> >
> >Similar example: When I search for "cab* AND nonexistingword" it still
> >throws a TooManyClauses
> >exception instead of saying "No results", since there is no
> >"nonexistingword" in my document set,
> >so it doesn't even have to start collecting the variations of "cab*".
> >
> >Is there any path for this issue?
> >Thank you for your time!
> >
> >Sanyi
> >(I'm using: lucene 1.4.2)
> >
> >p.s.: Sorry for re-sending this message, I was first sending it as an
> >accidental reply to a wrong thread..
> >
> >
> >
> >__
> >Do you Yahoo!?
> >Check out the new Yahoo! Front Page.
> >www.yahoo.com
> >
> >
> >
> >-
> >To unsubscribe, e-mail: [EMAIL PROTECTED]
> >For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >-
> >To unsubscribe, e-mail: [EMAIL PROTECTED]
> >For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
> >
> >-
> >To unsubscribe, e-mail: [EMAIL PROTECTED]
> >For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: HTMLParser.getReader returning null

2004-11-12 Thread Vanlerberghe, Luc
If you use the Field.Text(String name, Reader value) version of the
Field.Text constructor, the field is tokenized and indexed but *not*
stored.  This means you will be able to search and find that document,
but to know the original contents you will have to store a copy of it
elsewhere.

The Field.Text(String name, String value) version does store the
document String itself, so that's probably the origin of the confusion.

> -Original Message-
> From: Luke Shannon [mailto:[EMAIL PROTECTED] 
> Sent: donderdag 11 november 2004 20:17
> To: Lucene Users List
> Subject: HTMLParser.getReader returning null
> 
> Hello;
> 
> Things were working fine. I have been re-organizing my code 
> to drop into QA when I noticed I was no longer getting search 
> results for my HTML files.
> When I checked things out I confirmed I was still creating 
> the Documents but realized no content was being indexed.
> 
>  HTMLParser parser = new HTMLParser(f);
> 
> // Add the tag-stripped contents as a Reader-valued Text 
> field so it will
> // get tokenized and indexed.
> doc.add(Field.Text("contents", parser.getReader()));
> System.out.println("The content is " + doc.get("contents"));
> 
> The SOP line above outputs a null where the contents used to 
> be. Any seen this before?
> 
> Thanks,
> 
> Luke
> 
> - Original Message -
> From: "Will Allen" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Thursday, November 11, 2004 1:59 PM
> Subject: RE: Bug in the BooleanQuery optimizer? ..TooManyClauses
> 
> 
> Any wildcard search will automatically expand your query to 
> the number of
> terms it find in the index that suit the wildcard.
> 
> For example:
> 
> wild*, would become wild OR wilderness OR wildman etc for 
> each of the terms
> that exist in your index.
> 
> It is because of this, that you quickly reach the 1024 limit 
> of clauses.  I
> automatically set it to max int with the following line:
> 
> BooleanQuery.setMaxClauseCount( Integer.MAX_VALUE );
> 
> 
> -Original Message-
> From: Sanyi [mailto:[EMAIL PROTECTED]
> Sent: Thursday, November 11, 2004 6:46 AM
> To: [EMAIL PROTECTED]
> Subject: Bug in the BooleanQuery optimizer? ..TooManyClauses
> 
> 
> Hi!
> 
> First of all, I've read about BooleanQuery$TooManyClauses, so 
> I know that it
> has a 1024 Clauses
> limit by default which is good enough for me, but I still 
> think it works
> strange.
> 
> Example:
> I have an index with about 20Million documents.
> Let's say that there is about 3000 variants in the entire 
> document set of
> this word mask: cab*
> Let's say that about 500 documents are containing the word: spectrum
> Now, when I search for "cab* AND spectrum", I don't expect it 
> to throw an
> exception.
> It should first restrict the search for the 500 documents 
> containing the
> word "spectrum", then it
> should collect the variants of "cab*" withing these 
> documents, which turns
> out in two or three
> variants of "cab*" (cable, cables, maybe some more) and the 
> search should
> return let's say 10
> documents.
> 
> Similar example: When I search for "cab* AND nonexistingword" it still
> throws a TooManyClauses
> exception instead of saying "No results", since there is no
> "nonexistingword" in my document set,
> so it doesn't even have to start collecting the variations of "cab*".
> 
> Is there any path for this issue?
> Thank you for your time!
> 
> Sanyi
> (I'm using: lucene 1.4.2)
> 
> p.s.: Sorry for re-sending this message, I was first sending it as an
> accidental reply to a wrong thread..
> 
> 
> 
> __
> Do you Yahoo!?
> Check out the new Yahoo! Front Page.
> www.yahoo.com
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]