QueryFilter and Memory

2006-07-13 Thread Chun Wei Ho
Hi, I've been trying to adjust the weightings for my searches (thanks Chris for his replies on that thread), and have been using ConstantScoreQuery to even out scores from portions in my query that I want to match but not to contribute to the ranking of that result. I convert a

Re: modify existing non-indexed field

2006-07-13 Thread dan2000
can't access the file: Forbidden Remote Host: [62.172.205.164] You do not have permission to access http://cdoronc.20m.com/tmp/indexingThreads.zip Data files must be stored on the same site they are linked from. Thank you for using 20m.com -- View this message in context:

RE: Searching for a phrase which spans on 2 pages

2006-07-13 Thread Ramesh Salla
Yes, this can be easily done using TokenStream class and hence getting the the BestTokens. But ofcourse you have to have this content in the index. DONE Ramesh Reddy On Wed, 2006-07-12 at 12:43 +0100, Mike Streeton wrote: The simplest solution is always the best - when storing the page,

Re: question regarding Field.Index.UN_TOKENZED

2006-07-13 Thread Ramesh Salla
Are you using the StandardAnalyzer at the time of Indexing? which one do u use at the time of Querying? Ramesh Reddy On Mon, 2006-07-10 at 18:37 -0700, Chris Hostetter wrote: : I'm storing a field in an index with that option : (Field.Index.UN_TOKENZIED). the key to understanding your

accented characters, wildcards and other problems

2006-07-13 Thread Tomi NA
I've done a bit of testing with accented characters (Croatian, to be specific) and can't really explain what I see when I explore the index with luke. I've used accented characters in directory names, file names and file contents. Now, in the list of terms (in Top ranking terms, Overview tab) I

Re: Can I do Google Suggest Like Search?

2006-07-13 Thread karl wettin
On Wed, 2006-05-24 at 13:11 +0530, Vikas Khengare wrote: So when I type “L” it will give me search options names which will start from “L”. Then when I will type “Lu” then it should give me options for names which are starting from “Lu”. so on …… Vikas, the Jira now contains code that does

Re: Can I do Google Suggest Like Search?

2006-07-13 Thread Mark Miller
Another option is to use Sun's free and soon to be open source Java Studio Creator2. It's a great way to do JSF and provides an AJAX google suggest type component. You can hook this component up to a lucene search and *BOOM*...google suggest. Here is a link to a did you mean tutorial as well (it

RE: Out of memory error

2006-07-13 Thread Rob Staveley (Tom)
If you are using http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#getText(o rg.pdfbox.pdmodel.PDDocument), you are going to get a large String and may need a 1G heap. If, however, you are using http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#writeText

Re: Out of memory error

2006-07-13 Thread Suba Suresh
Thanks. I am using the getText(PDDocument) method of the PDFTextStripper. I will try the other suggestion. suba suresh. Rob Staveley (Tom) wrote: If you are using http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#getText(o rg.pdfbox.pdmodel.PDDocument), you are going to get

Re: Out of memory error

2006-07-13 Thread Ben Litchfield
By 300MG I assume you mean 300MB. You can also try extracting the text outside of lucene by using a PDFBox command line app. java org.pdfbox.ExtractText pdffile you may need to increase the JRE memory like this java -Xmx512m .pdfbox.ExtractText pdffile OR java -Xmx1024m

RE: Out of memory error

2006-07-13 Thread Rob Staveley (Tom)
Let us know how you get on. There are a lot of people fighting very similar battles on this list. -Original Message- From: Suba Suresh [mailto:[EMAIL PROTECTED] Sent: 13 July 2006 15:30 To: java-user@lucene.apache.org Subject: Re: Out of memory error Thanks. I am using the

Re: Out of memory error

2006-07-13 Thread Suba Suresh
Definitely. Thanks for both the suggestions. Yes it is 300MB.(typo) suba suresh. Rob Staveley (Tom) wrote: Let us know how you get on. There are a lot of people fighting very similar battles on this list. -Original Message- From: Suba Suresh [mailto:[EMAIL PROTECTED] Sent: 13 July

Re: accented characters, wildcards and other problems

2006-07-13 Thread Otis Gospodnetic
Bok Tomi, What do you mean by terms are misrepresented? What should they be, and what are you seeing? What I'm not clear on is how can I see the problematic *terms* in the list of terms, but not the documents they're stored in? Are you saying that the content got indexed, but the file

Re: modify existing non-indexed field

2006-07-13 Thread Doron Cohen
can't access the file: http://cdoronc.20m.com/tmp/indexingThreads.zip Yes, this Web host sometimes behaves strange when clicking a link from a mail program. Please try to copy cdoronc.20m.com/tmp to the Web Browser (e.g. Firefox), click Enter. This should show the content of that tmp folder,

lengthnorm again

2006-07-13 Thread Zhao, Xin
Hi, I am sure this is a question been asked before. :-) I have done some research too, but still don't quite understand. I indexed 20 terms under field name mesh, and set the boost accordingly from 20 to 1.(just some arbitrary numbers) But when I checked the index from Luke, the boosts all

Re: lengthnorm again

2006-07-13 Thread Yonik Seeley
On 7/13/06, Zhao, Xin [EMAIL PROTECTED] wrote: Hi, I am sure this is a question been asked before. :-) I have done some research too, but still don't quite understand. I indexed 20 terms under field name mesh, and set the boost accordingly from 20 to 1.(just some arbitrary numbers) But when I

HTMLParser

2006-07-13 Thread Ross Rankin
Since I cannot seem to access the HTMLParser mailing list and I saw the library recommended here, I thought someone here that has used it successfully can help me out. I have HTML text stored in a database field which I want to add to a Lucene document, but I want to remove the HTML tags, so

Are Search Joins Possible between two Physically separate Indexes?

2006-07-13 Thread Dejan Nenov
Here is a use case I am trying to address. I have two separate indexes, which contain sets of the same document pool/corpus. The two indexes have a different set of indexed fields. One of the indexed fields is an external DocumentID. I would like to perform searches, like a relational join,

Re: Are Search Joins Possible between two Physically separate Indexes?

2006-07-13 Thread Paul Borgermans
Though I'm a newbie (which means I may be completely wrong), I don't think this is possible out of the box. The quickest would be to write a filter which looks up document id's in the first index and applies this to the second index to get the disired subset to search over. I may need this too,

Re: HTMLParser

2006-07-13 Thread Yonik Seeley
I've never used HTMLParser, but if you have malformed., incomplete, or optional HTML that would otherwise choke an HTML parser, you could use Solr's HTMLStripping: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-031d5d370010955fdcc529d208395cd556f4a73e It's pretty stand-alone,

file format of index

2006-07-13 Thread Beady Geraghty
As I understand from earlier answers to my question that one can create an index on machine A, and use it (search and merge with other indices) on Machine B. I was reading the file format today. http://lucene.apache.org/java/docs/fileformats.html The index has Byte UInt32 UInt64 in most

Re: file format of index

2006-07-13 Thread Beady Geraghty
I think that I may be misreading the documentation. I didn't see the description of the Long and Int type under the Primitive Types section, while reading about the description of Byte, UInt32, Uint64, VInt. So, for some reason I thought that Long and Int are byte order sensitive. Upon