Re: indexing and searching real numbers

2007-04-05 Thread Otis Gospodnetic
Hmmm... I never use range queries, but that "" part looks suspicious. Sorry, can't help more now, maybe somebody else will have the answer for you. Sit tight. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share

Re: short documents = help me tweak Similarity??

2007-04-05 Thread Otis Gospodnetic
John, Look at coord Similarity method. That may help you solve the e.g., "Nissan Altima Sports Package" will be the #1 hit even though there was an exact document matching every term. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag

Re: multi user multi index?

2007-04-05 Thread Otis Gospodnetic
Simpy.com has a similar setup. You have to be careful about open files, making sure you don't run out of open file descriptors. You'll also want to minimize IndexReader/Searcher/Writer open/close as much as you can. The good side of this setup is that searches go against small indices, you do

Re: indexing and searching real numbers

2007-04-05 Thread Leon
Thanks for your reply Otis, wquery.toString() returns westbc:[* TO ] query.toString() returns westbc:[* TO ] If I compare these two strings for equality like wquery.toString().equals(query.toString()) I get true. I also got bytes of those strings and compared them bytewise - they are

Re: short documents = help me tweak Similarity??

2007-04-05 Thread Andrew Hudson
> Also, i don't understand why the encode/decode functions have a range of 7x10^9 to 2x10^-9, when it seems to me the most common values are (boosts set to 1.0) something between 1.0 and 0. When would somebody have a monster huge value like 7x10^9? Even with a huge index time boost of 20.0 or s

Re: short documents = help me tweak Similarity??

2007-04-05 Thread John Kleven
Thank you kindly for the responses. This was the solution that I dreamed up initially as well (overriding lengthNorm) and making the returned values for small numTerms values (e.g. 3 and 4) more discrete. So I did that in multiple ways, and I ran into a different problem. If lengthNorm returns

Re: How many Searches is a Searcher Worth?

2007-04-05 Thread Craig W Conway
Wow. Thanks Erick! So I guess the issue isn't with the test code... I wonder what kind of environmental problem I could have? I am also running on XP with JDK 1.5, Lucene 2.1, default memory and gc... The queries I am running are a bit more complex, and return 0-10,000 hits. When I close and re-

Re: How many Searches is a Searcher Worth?

2007-04-05 Thread Erick Erickson
Having to put a counter in and close/open your searcher should not be necessary. I'm afraid I'm not going to be very helpful, because I took your test case and made some very minor modifications to make it run in an environment I happen to have lying around (mostly, just instantiated the Quer

Re: How many Searches is a Searcher Worth?

2007-04-05 Thread Andy Goodell
My approach to dealing with these kinds of issues (which has worked well for me thus far) is: - Run java with -XX:+HeapDumpOnOutOfMemoryError command-line option - use jhat to inspect the heap dump, like so: $ /usr/java/jdk1.6/bin/jhat ./java_pid1347.hprof jhat will take a while to parse the hea

Re: How many Searches is a Searcher Worth?

2007-04-05 Thread Craig W Conway
So, forgetting the RMI stuff, I put together a test client very similar to the one in the book "Lucene in Action" page 182. The client: 1. instantiates a IndexSearcher 2. loops through queries, searches, prints hit count, saves nothing I am only able to run through about 40 searches before I

Re: multi user multi index?

2007-04-05 Thread Erick Erickson
What's the aggregate size of all your user indexes? And how many servers could you potentially spread the load across? What kind of queries do you allow? wildcards? simple term? Arbitrary Boolean expressions? What kind of throughput are you expecting? Opening and closing a reader for each search

multi user multi index?

2007-04-05 Thread nesrka sri
Hi, Iam currently working on indexing the documents present in a web based document management system. The system currently has around 200,000 users and each user has approximately 10 to 100 documents.We currently have around 50 GB of data. The system should allow the users only to search a

Re: Lucene for name matching

2007-04-05 Thread Nilesh Bansal
On 4/5/07, moraleslos <[EMAIL PROTECTED]> wrote: something specific as this, or are there better algorithms and/or software out there that does name matching. Thanks in advance! Approximate string matching is an active research field. There are many systems that implement different algorithms t

Re: Not able to search on UN_TOKENIZED fields

2007-04-05 Thread Erick Erickson
See below On 4/5/07, Ryan O'Hara <[EMAIL PROTECTED]> wrote: Hey Erick, Thanks for the quick response. I need a truly exact match. What I ended up doing was using a TOKENIZED field, but altering the StandardAnalyzer's stop word list to include only the word/letter 'a'. Below is my searching

Re: Explanation from FunctionQuery

2007-04-05 Thread Chris Hostetter
1) which version of FunctionQuery are you using (from the solr repository or from a Jira issue attachment?) 2) what is hte full stacktrace? (ie: which function/line is throwing the Exception) FunctionQuery supports explain just fine, not sure why you'd have problems, oh wait ... i see exactly wha

Re: short documents = help me tweak Similarity??

2007-04-05 Thread Chris Hostetter
: The problem comes when your float value is encoded into that 8 bit : field norm, the 3 length and 4 length both become the same 8 bit : value. Call Similarity.encodeNorm on the values you calculate for the : different numbers of terms and make sure they return different byte : values. bingo.

Re: Can Query.toString() output be parsed to the same query?

2007-04-05 Thread Chris Hostetter
: I am new to Lucene. I find that the output : of the Query.toString() method cannot be parsed : back to the same query. Is it true? If it is : true, I am wondering why not make the output of : Query.toString() parsable to the same query again? some of hte more simplified query classes generate a

Re: Not able to search on UN_TOKENIZED fields

2007-04-05 Thread Ryan O'Hara
Hey Erick, Thanks for the quick response. I need a truly exact match. What I ended up doing was using a TOKENIZED field, but altering the StandardAnalyzer's stop word list to include only the word/letter 'a'. Below is my searching code: String[] stopWords = {"a"}; StandardA

Re: Lucene for name matching

2007-04-05 Thread moraleslos
Hi Grant! Thanks for the reply. I'll look into the links you suggested. Just curious though, what did you do to implement this--if you can spill some of the beans ;-) You think what you did was better than the FuzzyQuery approach? Was it a custom algorithm or did you utilize some framework f

Re: Lucene for name matching

2007-04-05 Thread Grant Ingersoll
It's like deja vu all over again. I literally just finished up a similar task (about 2 hours ago). I didn't use Lucene for it, although I suppose I could have. Lucene does have the FuzzyQuery (http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/ javadoc/org/apache/lucene/search/

Re: short documents = help me tweak Similarity??

2007-04-05 Thread Andrew Hudson
The problem comes when your float value is encoded into that 8 bit field norm, the 3 length and 4 length both become the same 8 bit value. Call Similarity.encodeNorm on the values you calculate for the different numbers of terms and make sure they return different byte values. Andrew On 4/5/07,

Lucene for name matching

2007-04-05 Thread moraleslos
I was wondering if anyone has done people name matching using Lucene. For example, I have a name coming from some external source that I would like to match with the one I have in my DB. Lets say my DB contains the name "John Smith". If the external source has something like "Smith John", "Smit

Re: Not able to search on UN_TOKENIZED fields

2007-04-05 Thread Erick Erickson
Yes, you can search on UN_TOKENIZED fields, but they're exact, really, really exact . I'd recommend that you get a copy of Luke (google lucene luke) and examine your index to see what you actually have in your index. Also, you haven't provided us a clue what the actual query is. I'd use Query.to

Not able to search on UN_TOKENIZED fields

2007-04-05 Thread Ryan O'Hara
Hey, I was just wondering if you are supposed to be able to search on UN_TOKENIZED fields? It seems like you can from the docs, but I have been unsuccessful. I want to do exact string matching on a certain field without analyzer interference. Thanks, Ryan --

Re: short documents = help me tweak Similarity??

2007-04-05 Thread Otis Gospodnetic
As far as I know, this is the case where you want your custom Similarity that knows how to deal with a small number of terms. public float lengthNorm(String fieldName, int numTerms) { if (numTerms < N) // return something smart return (float)(1.0 / Math.sqrt(numTerms)); } I thi

Re: short documents = help me tweak Similarity??

2007-04-05 Thread Grant Ingersoll
It is the right forum, silence just means either no one knows the answer or no one who knows the answer has read it... Such is the nature of the community. Have you looked at overriding similarity with your own implementation? Have you done explain() calls on the docs to see where the s

Re: indexing and searching real numbers

2007-04-05 Thread Otis Gospodnetic
You can't really rely on Query.toString() to produce a valid query identical to the query in that Query instance. Are you sure both produce the same query string? You didn't include that. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag

Re: short documents = help me tweak Similarity??

2007-04-05 Thread John Kleven
Sorry to re-post -- is this the correct forum for questions like this? I think that writing a new encode/decode operation should help alleviate my problem, but thought that this must be fairly widespread issue for people using lucene for "non-web-page" searches (i.e., shorter documents) Thanks a

indexing and searching real numbers

2007-04-05 Thread Leon
Hello everybody, I need to index and search real numbers in Lucene. I found NumberUtils class in Solr project which permits one to encode doubles into string so that alpha numeric ordering would correctly correspond to the ordering on numbers. When I use ConstantScoreRangeQuery programmatically e

Re: How does lucene handle content-type

2007-04-05 Thread Erick Erickson
Lucene has no built-in recognition of anything. You have to parse the header and index the relevant bits as you need to. There are projects *based* upon lucene that do web crawls that you might want to look into, Nutch comes to mind. Erick On 4/5/07, Developer Developer <[EMAIL PROTECTED]> wrot

How does lucene handle content-type

2007-04-05 Thread Developer Developer
I am using WGET to download content from the www with ---save-header option. The save-header option saves the hppt header to the downloaded files. Does Lucene make use of content type while indexing or I have to parse the header , determine the content-type and determine the right set of action

luke v0.7 and SnowBallAnalyzer

2007-04-05 Thread Paul Hermans
I'm running lukeall-0.7.jar. In the Search Tab, when I try to use the SnowBallAnalyzer with name "German" for a Query, I do receive the message "java.lang.ClassNotFound: net.sf.snowball.ext.GermansStemmer". In the PlugIns tab, when I'm using the SnowballAnalyzer, I do get "Couldn't instantiate

Re: I need the internal lucene's document id from Hits

2007-04-05 Thread Mohammad Norouzi
ŮŚWell Philipp and Ronnie Thank you very much indeed -- Regards, Mohammad

Re: I need the internal lucene's document id from Hits

2007-04-05 Thread Philipp Nanz
As long as there are no deletions, the ids will remain unchanged and it is safe to use them outside. But in a case where you delete some document, the resulting gap in the document list will be filled during the next optimize (triggered manually) or merge operation (may be triggered automatically

Re: I need the internal lucene's document id from Hits

2007-04-05 Thread Mohammad Norouzi
Thanks Philipp 2007/4/5, Philipp Nanz <[EMAIL PROTECTED]>: > That *is* the actual id in the index. There is no other. > You should be careful using it outside of Lucene though, because > Lucene may rearrange the document ids during optimization for example. > > If you need an application id, ad

Re: I need the internal lucene's document id from Hits

2007-04-05 Thread Philipp Nanz
Ahh, now i know what you mean... Forget the above :-) Use result.id( i ) 2007/4/5, Philipp Nanz <[EMAIL PROTECTED]>: That *is* the actual id in the index. There is no other. You should be careful using it outside of Lucene though, because Lucene may rearrange the document ids during optimizati

Re: I need the internal lucene's document id from Hits

2007-04-05 Thread Ronnie Kolehmainen
It's in the FAQ: http://wiki.apache.org/lucene-java/LuceneFAQ#head-e1de2630fe33fb6eb6733747a5bf870f600e1b4c Mohammad Norouzi wrote: but the question is, if I add, say, a document to my index, is lucene going to re arrange the internal IDs? can't I trust them? Would you tell me in exactly which

I need the internal lucene's document id from Hits

2007-04-05 Thread Mohammad Norouzi
Hi I need the id of the document that returned by Hits as a result of a query. Hits result = searchable.find(myQuery); now I need something like result.getId() is there any way to get it? Thanks so much -- Regards, Mohammad Norouzi

Re: I need the internal lucene's document id from Hits

2007-04-05 Thread Philipp Nanz
That *is* the actual id in the index. There is no other. You should be careful using it outside of Lucene though, because Lucene may rearrange the document ids during optimization for example. If you need an application id, add it as an additional stored field to each document and retrieve that.

Re: I need the internal lucene's document id from Hits

2007-04-05 Thread Mohammad Norouzi
sorry to correct my answer: I need something like this result.doc( i ).getId(); this id from the result (the i ) is starting from 1 but I need the actual id in the index. On 4/5/07, Mohammad Norouzi <[EMAIL PROTECTED]> wrote: Hi I need the id of the document that returned by Hits as a result

Re: Better parsing of Queries

2007-04-05 Thread Chris Hostetter
deja vu ... didn't someone else just asking about "tolerant" query parsing, and then followup that they have found this suggestion from past me... http://www.nabble.com/Error-tolerant-query-parsing-tf108987.html ...inspecting the ParseException should allow you to do all sorts of iterative "fixi