distinct results

2007-04-10 Thread Melanie Langlois
Hi, I'm indexing documents, and some of them are provided in several languages. Thanks to this mailing list participants, I know that I have two choices to index these multiple instances of documents. Either, I create languages specific field, either I index the translations in different doc

Search text file and return position

2007-04-10 Thread Marius Cirsta
Hello Please excuse my newbiness but I need Lucene to do a simple taks and I haven't been able to find out how. I just need to search some text files for a given string, say "brown fox" and get the filename , which I found out how but I also need the position in that file ( so I can replace that t

Re: distinct results

2007-04-10 Thread Erick Erickson
You might get some good pointers by searching the mail archive for "faceted search", or perhaps just "faceted". I vaguely remember that the whole notion of sub-dividing result sets into bags of documents was discussed under that heading, quite an extensive discussion as I remember, and certainly n

Re: IndexReader.deleteDocuement(); How to use it with our code??

2007-04-10 Thread Donna L Gresh
>Hi, >I din't get the exact message from this sentence what exactly you want to >say?? >Can you please brief it with some more sentences??? I believe he meant that in one place (with no error) you have "E:/eclipse/310307/objtest/crawl-result/indexes/part-0"; but in the other you have indexDir

Get the total term frequency vector of a specific field from the hit results

2007-04-10 Thread Sengly Heng
Hello all, I would like to extract the term freq vector from the hit results as a total vector not by document. I have searched the mailing and I found many have talked about this issue but I still could not find the right solution to this matter. Everyone just suggested to look at getTermFreqVe

Re: Get the total term frequency vector of a specific field from the hit results

2007-04-10 Thread thomas arni
Hello Sengly First of all you have to make sure, that you create new Fields, which you add to a Document, with the appropriate constructor. You have to specify the usage of term vectors (Field.TermVector.YES): new Field("text", "your text...", Field.Store.YES, Field.Index.TOKENIZED,Field.Ter

Re: Get the total term frequency vector of a specific field from the hit results

2007-04-10 Thread karl wettin
10 apr 2007 kl. 16.58 skrev Sengly Heng: I wanted to do this way as well but I am a bit worrying about computational time as I have many documents and each document is a bit large. I am looking for more solutions. We don't really know what your problem is. Explaining that rathern than

Re: Get the total term frequency vector of a specific field from the hit results

2007-04-10 Thread Sengly Heng
Thanks so much Thomas for your prompt reply. First of all you have to make sure, that you create new Fields, which you add to a Document, with the appropriate constructor. You have to specify the usage of term vectors (Field.TermVector.YES): new Field("text", "your text...", Field.Store.YES,

Re: Get the total term frequency vector of a specific field from the hit results

2007-04-10 Thread Sengly Heng
Dear Karl, Thank you for taking your time in my problem. We don't really know what your problem is. Explaining that rathern than the solution you have thought of might render a couple of alternate solutions. Perhaps something could be precalculated and stored in the documents. Perhaps feature

index the whole plain text file's content

2007-04-10 Thread Chen Li
Hello, I used demo code(IndexFiles.java) from lucene to index around 100 text files. doc.add(new Field("contents", new FileReader(f))); Which is interesting that, for some larger files (around 500kb), only the query term on the top of the file is searchable, once the term is at the end or a

Copy index while updating the index

2007-04-10 Thread Rajendranath, Divya
Hello, I have a scenario, where we need to set up our application, that uses Lucene (and has on-demand indexing of documents) in Disaster-recovery site. The simple files/attachments used by our application can be simply copied to the DR site just by syncing (manual copying). Yes, we can also cop

Re: Get the total term frequency vector of a specific field from the hit results

2007-04-10 Thread karl wettin
10 apr 2007 kl. 17.48 skrev Sengly Heng: We don't really know what your problem is. Explaining that rathern than the solution you have thought of might render a couple of alternate solutions. Perhaps something could be precalculated and stored in the documents. Perhaps feature selection (reduct

Re: index the whole plain text file's content

2007-04-10 Thread karl wettin
10 apr 2007 kl. 17.58 skrev Chen Li: Which is interesting that, for some larger files (around 500kb), only the query term on the top of the file is searchable, once the term is at the end or after an unknown point of the file, I couldn't use SearchFiles.java, which also came with demo code

Re: Copy index while updating the index

2007-04-10 Thread Otis Gospodnetic
Here is one way to do it: You can read/open an index at any point, even when it's being modified. You can then open a new FSDirectory pointing to a new directory and add your original FSDirectory to that new FSDirectory. That will copy the index. Of course, any new documents you add to the or

Re: distinct results

2007-04-10 Thread Doron Cohen
> > I'm indexing documents, and some of them are provided in several > > languages. ... Either, I create > > languages specific field, either I index the translations in different > > documents, adding the language field. > > > > I choose the second solution, because first, the translated docum

Re: Copy index while updating the index

2007-04-10 Thread Michael McCandless
You do need to be careful with this because if a writer commits while you are copying you can easily get a copy that's unusable (is missing files). When you instantiate an IndexReader, it actually holds open most files that it uses which protects them from being deleted. So in theory if you coul

StopAnalyzer- Stop List Words

2007-04-10 Thread sai hariharan
Hi, Where can i find the list of words that is used for removal of common English words by StopAnlayzer ? Can i add additional words to the stop list ? Regards, -- சாய் Hari

Re: StopAnalyzer- Stop List Words

2007-04-10 Thread Ryan O'Hara
You can find the list in StopAnalyzer.java: public static final String[] ENGLISH_STOP_WORDS = { "a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these",

Re: Standard Parser Behavior

2007-04-10 Thread Walt Stoneburner
Steven Parkes points out: Lucene doesn't use a pure Boolean algebra, so things don't always do what one might expect and things like De Morgan's law don't hold. You're exactly on to what I was pondering about. With boolean logic, I understand the operators inside and out, so something like De

How to sort on a tokenised field?

2007-04-10 Thread Joe Tang
My task is to index lots of documents with different fields. Some of the fields are tokenized and are going to be sorted later on when a list of result set is need to particular field. Unfortunately, Lucene complains about sort on a tokenized field. So is there any way to get around of it? Thank

Re: How to sort on a tokenised field?

2007-04-10 Thread Erick Erickson
Lucene sorting is intended to sort documents relative to each other. So it makes no sense to allow sorts on tokenized fields in the Lucene context. Imagine the separate tokens in a field for doc1 of a, c and e, and for doc2 b, d and f. Where should doc1 go in relation to doc2 when sorting on that

Re: How to sort on a tokenised field?

2007-04-10 Thread Joe Tang
I understand what you are trying to say about the problem of sorting a tokenized field. The reason why i try to sort a tokenized field is that I need to have a field to be both sortable and searchable in different time. Searchable field requires tokenized field while sortable field requires un-

Re: How to sort on a tokenised field?

2007-04-10 Thread Chris Hostetter
: The worse solution is to have another duplicated field which is un-tokenized : but it is not scalable when we have lots of fields need to be searchable. That is really the only solution that exists in in Lucene at the moment. Typically the number of fields people want to sort on isn't that big

Re: Standard Parser Behavior

2007-04-10 Thread Chris Hostetter
: The problem is the grouping operator ( ) and how it works with distributed : operators, I don't quite get what the specific transformation rules are. you shouldnt' think if parens as a groiuping operator, you should think of it as a way to force the explicit creation of a BooleanQuery object.

Re: Standard Parser Behavior

2007-04-10 Thread Mike Klaas
On 4/10/07, Walt Stoneburner <[EMAIL PROTECTED]> wrote: Furthermore syntax like +(-A +B) and -(-A +B) appear to be legal to Luke, though I have no clue what this even means in simple English. Let me try: +(-A +B) -> must match (-A +B) -> must contain B and must not contain A -(-A +B) -> must

Ideas for a relevance score that could be considered stable across multiple searches with the same query structure?

2007-04-10 Thread Daniel Einspanjer
I asked this question on the Solr user list because that is the current lucene server implementation I'm using, but I didn't get any feedback there and the problem isn't really Solr specific so I thought I'd cross post here just in case any non-Solr users might have some ideas. Thank you very muc

Re: How to update index dynamically

2007-04-10 Thread Daniel Noll
Otis Gospodnetic wrote: Anson, That's not your real code, is it? Those $ characters in it look incorrect. Are you sure? $ is legal at the front of a variable in Java. :-) Daniel -- Daniel Noll Nuix Pty Ltd Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699 Web: http

Re: Get the total term frequency vector of a specific field from the hit results

2007-04-10 Thread Sengly Heng
Once again, thank you for your help. >> We don't really know what your problem is. Explaining that rathern >> than the solution you have thought of might render a couple of >> alternate solutions. Perhaps something could be precalculated and >> stored in the documents. Perhaps feature selection

Re: How to update index dynamically

2007-04-10 Thread Otis Gospodnetic
Wow, you are right. I never realized that! - Original Message From: Daniel Noll <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Tuesday, April 10, 2007 8:39:28 PM Subject: Re: How to update index dynamically Otis Gospodnetic wrote: > Anson, > > That's not your real code, is

Re: Ideas for a relevance score that could be considered stable across multiple searches with the same query structure?

2007-04-10 Thread Grant Ingersoll
On Apr 10, 2007, at 8:03 PM, Daniel Einspanjer wrote: The people reviewing this matching process need some way of determining why a particular match was made other than the overall score. Was it because the title was a perfect match or was it because the title wasn't that close, but the direct

Re: Get the total term frequency vector of a specific field from the hit results

2007-04-10 Thread Grant Ingersoll
Would some sort of caching strategy work? How big is your overall collection? Also, lately there have been a few threads on TV (term vector) performance. I don't recall anyone having actively profiled or examined it for improvements, so perhaps that would be helpful. Another thought: co

Re: Issue with search() Help Appreciated.

2007-04-10 Thread Lokeya
I solved the issue by using: 1.Same Analyser. 2.Making indexing by tokenizing terms. Now issue with the following code is, I am facing issues which I have pasted after the code, I searched the forum but couldn't find a relevant post : QueryParser parser = new QueryParser("Title", analyzer); Que

Issue with : Searcher.search() returning Hits of same length for different searches

2007-04-10 Thread Lokeya
I am following all the points which are mentioned in the following link: http://wiki.apache.org/lucene-java/LuceneFAQ#head-3558e5121806fb4fce80fc022d889484a9248b71 I am having the following issues: 1. For different Queries I give I get a Hits object where there are always 21 documents, but gett

How to access Levenstein distance number?

2007-04-10 Thread Michael Barbarelli
Hello. I am using Lucene to submit fuzzy queries against an index. I have noticed that relevant matches are often retreived, but the scoring is not at all what I expected. For example, if my query is "rightches~", a reference to a text file with the single word "righteous" is returned with a sco