drawback addindexes method

2007-05-03 Thread Chandan Tamrakar
I found that IndexWriter.addIndexes(Directory[]) always calls optimize method twice I am indexing a documents in batches , i.e I call this method when X no. of documents are buffered in RAM Using RAMDirectory . So as the index size grows , optimize method will only increase by indexing time C

Doubt in FuzzyQuery

2007-05-03 Thread sccarrera
Hi! I have a problem in dealing whith a fuzzy query in Lucene 2.1.0. In order to explain my problem, I illustrate it by a simple example: I would like to recover files including the set of strings "société américaine" and "sociétés américaines" from a fuzzy query relating the string "société

MergeFactor advice wanted

2007-05-03 Thread Aleksander M. Stensby
Hello everyone! I'm wondering if any of you have any helpful advice to what MergeFactor i should use... The indexing process is handling a large amount of documents and i would like to index as fast as possible. Initial thought was to increase the mergeFactor to make the indexer work more in

Re: MergeFactor advice wanted

2007-05-03 Thread Mark Miller
I think it is worth your time to do some benchmarking. I think mergeFactor is not very helpful in the end...if you set it high, you'll index faster but then your searches will be slower prompting you to optimize...after which you'll find that you paid all your gains back. Test things out for yo

Re: MergeFactor advice wanted

2007-05-03 Thread Aleksander M. Stensby
Ok. but then you would not optimize at all? Not even in the end of the indexing run? On Thu, 03 May 2007 12:17:40 +0200, Mark Miller <[EMAIL PROTECTED]> wrote: I think it is worth your time to do some benchmarking. I think mergeFactor is not very helpful in the end...if you set it high, y

RE: MergeFactor advice wanted

2007-05-03 Thread Chandan Tamrakar
What if we are using addindexes(Ram Directory) method ? it calls optimize function inside the function itself ? Any solution to this ? -Original Message- From: Mark Miller [mailto:[EMAIL PROTECTED] Sent: Thursday, May 03, 2007 4:03 PM To: java-user@lucene.apache.org Subject: Re: MergeFac

RE: drawback addindexes method

2007-05-03 Thread Steven Parkes
See IndexWriter#addIndexesNoOptimize, released with 2.1. Note that it doesn't optimize before or after, so if you want an optimize at the end, you need to ask for it manually. -Original Message- From: Chandan Tamrakar [mailto:[EMAIL PROTECTED] Sent: Thursday, May 03, 2007 12:46 AM To: jav

Re: Email Definition in StandardTokenizer.jj

2007-05-03 Thread Erick Erickson
I can just see Hatcher's reply now.. Would you be willing to submit the correct code ? Erick On 5/2/07, Winton Davies <[EMAIL PROTECTED]> wrote: Hey guys, Does someone who makes commits want to fix the EMAIL definition in StandardTokenizer.jj Its a not very well known exception to the n

Re: Doubt in FuzzyQuery

2007-05-03 Thread Erick Erickson
It would help a lot if you can either post a snippet of code showing how you construct the fuzzy query or create a small, self-contained program illustrating the problem. With the latter approach, I've often found that in the middle of creating the program, what I'm doing wrong surfaces ... Best

Re: MergeFactor advice wanted

2007-05-03 Thread Erick Erickson
I don't think (but don't know for sure) whether optimizing before the end of the run buys you anything. And you're right, it takes a while. I've assumed that it was best done at the end of the entire run, but that's only an assumption. Search the archives for the thread titled MergeFactor and Ma

Re: MergeFactor advice wanted

2007-05-03 Thread Erick Erickson
I don't think you're doing yourself any good by explicitly using a RAMdirectory in the first place. If you use a simple FSDirectory, a number of documents are added in RAM before being flushed to the FS. Why do you add this complexity to your code with no proof that it does you any good? Or do yo

Implementing lagre secure Lucene search system questions.

2007-05-03 Thread jim shirreffs
Hi, I'm a relative Lucene newbe and would appreciate some expert advice. I would like to make fulltest searchable, files distributed on various local hosts in the intranet. My startup plan is to index these files locally and then merge all the little indexes into a master indexes on a search

Re: Doubt in FuzzyQuery

2007-05-03 Thread Stefan Will
It seems to me like a french stemmer is what you need instead of a fuzzy query. What analyzer are you using for your documents and queries ? -- Stefan [EMAIL PROTECTED] wrote: Hi! I have a problem in dealing whith a fuzzy query in Lucene 2.1.0. In order to explain my problem, I illustrate it

For indexing: how to estimate needed memory?

2007-05-03 Thread david m
Our application includes an indexing server that writes to multiple indexes in parallel (each thread writes to a single index). In order to avoid an OutOfMemoryError, each request to index a document is checked to see if the JVM has enough memory available to index the document. I know that Index

Language detection library

2007-05-03 Thread Mordo, Aviran (EXP N-NANNATEK)
Anyone knows of a good language detection library that can detect what language a document (text) is ? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Language detection library

2007-05-03 Thread Otis Gospodnetic
LingPipe - commercial unless your data/product/service is free. Nutch language id plugin. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: "Mordo, Aviran (EXP N-NANNATEK)" <[EMAIL PROTEC

Re: Language detection library

2007-05-03 Thread Jason Pump
http://software.wise-guys.nl/libtextcat/ Otis Gospodnetic wrote: LingPipe - commercial unless your data/product/service is free. Nutch language id plugin. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Origin

Re: For indexing: how to estimate needed memory?

2007-05-03 Thread Erick Erickson
Coincidentally, I'm hacking at this very problem First, are you sure you're free memory calculation is OK? Why not just use freeMemory? Perhaps also calling the gc if the avail isn't enough. Although I confess I don't know the innards of the interplay of getting the various memory amounts

Re: Language detection library

2007-05-03 Thread Andrzej Bialecki
Jason Pump wrote: http://software.wise-guys.nl/libtextcat/ ... which is what Nutch implements in its language-identifier plugin. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___

Re: Language detection library

2007-05-03 Thread karl wettin
3 maj 2007 kl. 22.06 skrev Mordo, Aviran (EXP N-NANNATEK): Anyone knows of a good language detection library that can detect what language a document (text) is ? I posted this some time back: https://issues.apache.org/jira/browse/LUCENE-826 A bit of proof-of-concept:ish, but it does the job

customizing index file name

2007-05-03 Thread Shaw, James
Does anyone know how to fix the .cfs file name in an index directory? The deletable and segments file names are always the same, but we have observed that the .cfs file name changes each time you index a content directory with some changes to the directory (some deleted files, added files, etc). H

Re: Implementing lagre secure Lucene search system questions.

2007-05-03 Thread Daniel Noll
jim shirreffs wrote: Hi, I'm a relative Lucene newbe and would appreciate some expert advice. Sounds like you might want to start a new thread, otherwise people who know the answer to your problem might not see your post. Daniel -- Daniel Noll Nuix Pty Ltd Suite 79, 89 Jones St, Ultimo N

Re: customizing index file name

2007-05-03 Thread Erick Erickson
Uh, what do you mean "fix"? You shouldn't have to do anything with it at all. What behavior are you observing that you want to change and why? Erick On 5/3/07, Shaw, James <[EMAIL PROTECTED]> wrote: Does anyone know how to fix the .cfs file name in an index directory? The deletable and segment

Re: Language detection library

2007-05-03 Thread Chris Lu
I suppose if a document is indexed as English or French, when users searching the document, we need to parse the query as English or French also? -- Chris Lu - Instant Scalable Full-Text Search On Any Database/Application site: http://www.dbsight.net demo: http://search.db

RE: customizing index file name

2007-05-03 Thread Shaw, James
I mean specifying the name of the .csf file, rather than letting Lucene come up with a name by itself. I'm actually using Lucene.Net, and we pre-index during our build and want to include the index in the installer, but the installer can only reference named files, and it wouldn't work if the .csf

Re: customizing index file name

2007-05-03 Thread Erick Erickson
Oh, fix as in make constant, not fix as in broken ... No, I don't know of any way to do this. Can your installer just pack up everything in a directory? Erick On 5/3/07, Shaw, James <[EMAIL PROTECTED]> wrote: I mean specifying the name of the .csf file, rather than letting Lucene come up with

Re: Language detection library

2007-05-03 Thread karl wettin
4 maj 2007 kl. 02.20 skrev Chris Lu: I suppose if a document is indexed as English or French, when users searching the document, we need to parse the query as English or French also? If you do some language specific token analysis such as stemming, yes. Detecting the language on such small t