Re: HTMLParser.getReader returning null
Ah. That would explain it. Thank you Luc. - Original Message - From: "Vanlerberghe, Luc" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Friday, November 12, 2004 5:41 AM Subject: RE: HTMLParser.getReader returning null If you use the Field.Text(String name, Reader value) version of the Field.Text constructor, the field is tokenized and indexed but *not* stored. This means you will be able to search and find that document, but to know the original contents you will have to store a copy of it elsewhere. The Field.Text(String name, String value) version does store the document String itself, so that's probably the origin of the confusion. > -Original Message- > From: Luke Shannon [mailto:[EMAIL PROTECTED] > Sent: donderdag 11 november 2004 20:17 > To: Lucene Users List > Subject: HTMLParser.getReader returning null > > Hello; > > Things were working fine. I have been re-organizing my code > to drop into QA when I noticed I was no longer getting search > results for my HTML files. > When I checked things out I confirmed I was still creating > the Documents but realized no content was being indexed. > > HTMLParser parser = new HTMLParser(f); > > // Add the tag-stripped contents as a Reader-valued Text > field so it will > // get tokenized and indexed. > doc.add(Field.Text("contents", parser.getReader())); > System.out.println("The content is " + doc.get("contents")); > > The SOP line above outputs a null where the contents used to > be. Any seen this before? > > Thanks, > > Luke > > - Original Message - > From: "Will Allen" <[EMAIL PROTECTED]> > To: "Lucene Users List" <[EMAIL PROTECTED]> > Sent: Thursday, November 11, 2004 1:59 PM > Subject: RE: Bug in the BooleanQuery optimizer? ..TooManyClauses > > > Any wildcard search will automatically expand your query to > the number of > terms it find in the index that suit the wildcard. > > For example: > > wild*, would become wild OR wilderness OR wildman etc for > each of the terms > that exist in your index. > > It is because of this, that you quickly reach the 1024 limit > of clauses. I > automatically set it to max int with the following line: > > BooleanQuery.setMaxClauseCount( Integer.MAX_VALUE ); > > > -Original Message- > From: Sanyi [mailto:[EMAIL PROTECTED] > Sent: Thursday, November 11, 2004 6:46 AM > To: [EMAIL PROTECTED] > Subject: Bug in the BooleanQuery optimizer? ..TooManyClauses > > > Hi! > > First of all, I've read about BooleanQuery$TooManyClauses, so > I know that it > has a 1024 Clauses > limit by default which is good enough for me, but I still > think it works > strange. > > Example: > I have an index with about 20Million documents. > Let's say that there is about 3000 variants in the entire > document set of > this word mask: cab* > Let's say that about 500 documents are containing the word: spectrum > Now, when I search for "cab* AND spectrum", I don't expect it > to throw an > exception. > It should first restrict the search for the 500 documents > containing the > word "spectrum", then it > should collect the variants of "cab*" withing these > documents, which turns > out in two or three > variants of "cab*" (cable, cables, maybe some more) and the > search should > return let's say 10 > documents. > > Similar example: When I search for "cab* AND nonexistingword" it still > throws a TooManyClauses > exception instead of saying "No results", since there is no > "nonexistingword" in my document set, > so it doesn't even have to start collecting the variations of "cab*". > > Is there any path for this issue? > Thank you for your time! > > Sanyi > (I'm using: lucene 1.4.2) > > p.s.: Sorry for re-sending this message, I was first sending it as an > accidental reply to a wrong thread.. > > > > __ > Do you Yahoo!? > Check out the new Yahoo! Front Page. > www.yahoo.com > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: HTMLParser.getReader returning null
Hi; I am using the HTMLParser that comes with the latest version of Lucene (in the demo). Here is the import line: import org.apache.lucene.demo.html.HTMLParser; If you have lucene-demos-1.4-final.jar in your class path the system will find the Parser Class. I am happy with the results. Let me know if you need anything else. L - Original Message - From: "sergiu gordea" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Friday, November 12, 2004 3:39 AM Subject: Re: HTMLParser.getReader returning null > Luke Shannon wrote: > > Hi, > > May I ask you which library you are using for parsing html pages? > I need to index html pages and I want to use a good parser to > eliminate the html tags. > Can you recomend me a simple parser that has a demo? > > Thanks, > > Sergiu > > >Hello; > > > >Things were working fine. I have been re-organizing my code to drop into QA > >when I noticed I was no longer getting search results for my HTML files. > >When I checked things out I confirmed I was still creating the Documents but > >realized no content was being indexed. > > > > HTMLParser parser = new HTMLParser(f); > > > >// Add the tag-stripped contents as a Reader-valued Text field so it > >will > >// get tokenized and indexed. > >doc.add(Field.Text("contents", parser.getReader())); > >System.out.println("The content is " + doc.get("contents")); > > > >The SOP line above outputs a null where the contents used to be. Any seen > >this before? > > > >Thanks, > > > >Luke > > > >- Original Message - > >From: "Will Allen" <[EMAIL PROTECTED]> > >To: "Lucene Users List" <[EMAIL PROTECTED]> > >Sent: Thursday, November 11, 2004 1:59 PM > >Subject: RE: Bug in the BooleanQuery optimizer? ..TooManyClauses > > > > > >Any wildcard search will automatically expand your query to the number of > >terms it find in the index that suit the wildcard. > > > >For example: > > > >wild*, would become wild OR wilderness OR wildman etc for each of the terms > >that exist in your index. > > > >It is because of this, that you quickly reach the 1024 limit of clauses. I > >automatically set it to max int with the following line: > > > >BooleanQuery.setMaxClauseCount( Integer.MAX_VALUE ); > > > > > >-Original Message- > >From: Sanyi [mailto:[EMAIL PROTECTED] > >Sent: Thursday, November 11, 2004 6:46 AM > >To: [EMAIL PROTECTED] > >Subject: Bug in the BooleanQuery optimizer? ..TooManyClauses > > > > > >Hi! > > > >First of all, I've read about BooleanQuery$TooManyClauses, so I know that it > >has a 1024 Clauses > >limit by default which is good enough for me, but I still think it works > >strange. > > > >Example: > >I have an index with about 20Million documents. > >Let's say that there is about 3000 variants in the entire document set of > >this word mask: cab* > >Let's say that about 500 documents are containing the word: spectrum > >Now, when I search for "cab* AND spectrum", I don't expect it to throw an > >exception. > >It should first restrict the search for the 500 documents containing the > >word "spectrum", then it > >should collect the variants of "cab*" withing these documents, which turns > >out in two or three > >variants of "cab*" (cable, cables, maybe some more) and the search should > >return let's say 10 > >documents. > > > >Similar example: When I search for "cab* AND nonexistingword" it still > >throws a TooManyClauses > >exception instead of saying "No results", since there is no > >"nonexistingword" in my document set, > >so it doesn't even have to start collecting the variations of "cab*". > > > >Is there any path for this issue? > >Thank you for your time! > > > >Sanyi > >(I'm using: lucene 1.4.2) > > > >p.s.: Sorry for re-sending this message, I was first sending it as an > >accidental reply to a wrong thread.. > > > > > > > >__ > >Do you Yahoo!? > >Check out the new Yahoo! Front Page. > >www.yahoo.com > > > > > > > >- > >To unsubscribe, e-mail: [EMAIL PROTECTED] > >For additional commands, e-mail: [EMAIL PROTECTED] > > > > > >- > >To unsubscribe, e-mail: [EMAIL PROTECTED] > >For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > >- > >To unsubscribe, e-mail: [EMAIL PROTECTED] > >For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: HTMLParser.getReader returning null
If you use the Field.Text(String name, Reader value) version of the Field.Text constructor, the field is tokenized and indexed but *not* stored. This means you will be able to search and find that document, but to know the original contents you will have to store a copy of it elsewhere. The Field.Text(String name, String value) version does store the document String itself, so that's probably the origin of the confusion. > -Original Message- > From: Luke Shannon [mailto:[EMAIL PROTECTED] > Sent: donderdag 11 november 2004 20:17 > To: Lucene Users List > Subject: HTMLParser.getReader returning null > > Hello; > > Things were working fine. I have been re-organizing my code > to drop into QA when I noticed I was no longer getting search > results for my HTML files. > When I checked things out I confirmed I was still creating > the Documents but realized no content was being indexed. > > HTMLParser parser = new HTMLParser(f); > > // Add the tag-stripped contents as a Reader-valued Text > field so it will > // get tokenized and indexed. > doc.add(Field.Text("contents", parser.getReader())); > System.out.println("The content is " + doc.get("contents")); > > The SOP line above outputs a null where the contents used to > be. Any seen this before? > > Thanks, > > Luke > > - Original Message - > From: "Will Allen" <[EMAIL PROTECTED]> > To: "Lucene Users List" <[EMAIL PROTECTED]> > Sent: Thursday, November 11, 2004 1:59 PM > Subject: RE: Bug in the BooleanQuery optimizer? ..TooManyClauses > > > Any wildcard search will automatically expand your query to > the number of > terms it find in the index that suit the wildcard. > > For example: > > wild*, would become wild OR wilderness OR wildman etc for > each of the terms > that exist in your index. > > It is because of this, that you quickly reach the 1024 limit > of clauses. I > automatically set it to max int with the following line: > > BooleanQuery.setMaxClauseCount( Integer.MAX_VALUE ); > > > -Original Message- > From: Sanyi [mailto:[EMAIL PROTECTED] > Sent: Thursday, November 11, 2004 6:46 AM > To: [EMAIL PROTECTED] > Subject: Bug in the BooleanQuery optimizer? ..TooManyClauses > > > Hi! > > First of all, I've read about BooleanQuery$TooManyClauses, so > I know that it > has a 1024 Clauses > limit by default which is good enough for me, but I still > think it works > strange. > > Example: > I have an index with about 20Million documents. > Let's say that there is about 3000 variants in the entire > document set of > this word mask: cab* > Let's say that about 500 documents are containing the word: spectrum > Now, when I search for "cab* AND spectrum", I don't expect it > to throw an > exception. > It should first restrict the search for the 500 documents > containing the > word "spectrum", then it > should collect the variants of "cab*" withing these > documents, which turns > out in two or three > variants of "cab*" (cable, cables, maybe some more) and the > search should > return let's say 10 > documents. > > Similar example: When I search for "cab* AND nonexistingword" it still > throws a TooManyClauses > exception instead of saying "No results", since there is no > "nonexistingword" in my document set, > so it doesn't even have to start collecting the variations of "cab*". > > Is there any path for this issue? > Thank you for your time! > > Sanyi > (I'm using: lucene 1.4.2) > > p.s.: Sorry for re-sending this message, I was first sending it as an > accidental reply to a wrong thread.. > > > > __ > Do you Yahoo!? > Check out the new Yahoo! Front Page. > www.yahoo.com > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]