RE: Disk full while optimizing an index

2009-11-30 Thread Uwe Schindler
Hi siraj, There is now way to find out the free space on a partition using Java 5 (Lucene 3.0) or Java 1.4 (Lucene 2.9) without any native JNI calls. So Lucene cannot calculate it before optimizing. With Java 6 it would be possible, but Lucene 3.0 is only allowed to use Java 5: File#getUsableSpac

RE: Disk full while optimizing an index

2009-11-30 Thread Uwe Schindler
And if you have open IndexReaders/Searchers at the same time use 3.5 as factor (because some files were already deleted from directory, but still occupy space - *nix delete on last close) :-) - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de >

Re: Disk full while optimizing an index

2009-11-30 Thread Siraj Haider
Jason, Thank you for your suggestion. That is what I am planning to do, but I overheard or read somewhere that the new lucene version can take care of that internally, so I was just trying to see if somebody know something about it. regards -siraj Jason Rutherglen wrote: Siraj, You could

Re: Disk full while optimizing an index

2009-11-30 Thread Jason Rutherglen
Siraj, You could estimate the maximum size used during optimization at 2.5 (a sort of rough maximum) times your current index size, and not optimize if your index (at 2.5 times) would exceed your allowable disk space. Jason On Mon, Nov 30, 2009 at 2:50 PM, Siraj Haider wrote: > Index optimizati

Disk full while optimizing an index

2009-11-30 Thread Siraj Haider
Index optimization fails if we don't have enough space on the drive and leaves the hard drive almost full. Is there a way not to even start optimization if we don't have enough space on drive? regards -siraj - To unsubscribe,

Re: LowerCaseFilter fails one letter (I) of Turkish alphabet

2009-11-30 Thread Robert Muir
On Mon, Nov 30, 2009 at 4:07 PM, Shai Erera wrote: > Thanks again, I'll use this table as well. you should only use it if you are normalizing to NFKC or NFKD afterwards... > What I do is read those tables > and store in a char[], for fast lookups of folding chars. I noticed your > comments in

Re: Deciding on strategy for storing indexed fields

2009-11-30 Thread Paul Taylor
Uwe Schindler wrote: There are two answers: Its often a good idea, if you mostly need the full representation in one call. E.g. we have the complete XML representation in a stored field and use it for display with XSLT and so on. Other fields are for indexing only and do not get stored. I alw

Re: LowerCaseFilter fails one letter (I) of Turkish alphabet

2009-11-30 Thread Shai Erera
Thanks again, I'll use this table as well. What I do is read those tables and store in a char[], for fast lookups of folding chars. I noticed your comments in the code about not doing so because then the tables would need to be updated once in a while, and I agree. But ICU's lack of char[] API drov

Re: LowerCaseFilter fails one letter (I) of Turkish alphabet

2009-11-30 Thread Robert Muir
Shai, no, behind the scenes I am using just that table, via ICU library. The only reason the CaseFoldingFilter in my patch is more complex, is because I also apply FC_NFKC_Closure mappings. You can apply these tables in your impl too if you are also using normalization, they are here: http://unico

Re: LowerCaseFilter fails one letter (I) of Turkish alphabet

2009-11-30 Thread Shai Erera
Thanks Robert. In my Analyzer I do case folding according to Unicode tables. So ß is converted to "SS". I do the same for diacritic removal and Hiragana/Katakan folding. I then apply a LowerCaseFilter, which gets the "SS" to "ss". I checked the filter's output on "AĞACIN" and it's "AGACIN". If I t

RE: Moving to Lucene 3.0

2009-11-30 Thread Uwe Schindler
> I'm trying to fix my code to remove everything that is deprecated in order > to move to Lucene 3.0. I fixed many many items but I can't find the answer > to some answers. See items in red below: > > *#1. Opening an index* > *idx = FSDirectory.getDirectory(new File(INDEX)); > reader = IndexRead

Moving to Lucene 3.0

2009-11-30 Thread Michel Nadeau
Hi ! I'm trying to fix my code to remove everything that is deprecated in order to move to Lucene 3.0. I fixed many many items but I can't find the answer to some answers. See items in red below: *#1. Opening an index* *idx = FSDirectory.getDirectory(new File(INDEX)); reader = IndexReader.open(

Re: LowerCaseFilter fails one letter (I) of Turkish alphabet

2009-11-30 Thread Robert Muir
On Mon, Nov 30, 2009 at 2:53 PM, Shai Erera wrote: > Robert, what if I need to do additional filtering after CollationKeyFilter, > like stopwords removal, abbreviations handling, stemming etc? Will that be > possible if I use CollationKeyFilter? > > Shai, great point. This won't work with Collati

Re: LowerCaseFilter fails one letter (I) of Turkish alphabet

2009-11-30 Thread Robert Muir
Shai, again the problem is not really performance (I am ignoring that for now), but the fact that lowercasing and case folding are different. An easy example, the lowercase of ß is ß itself, it is already lowercase. it will not match with 'SS' if you use lowercase filter. if you use case folding,

Re: LowerCaseFilter fails one letter (I) of Turkish alphabet

2009-11-30 Thread Shai Erera
Robert, what if I need to do additional filtering after CollationKeyFilter, like stopwords removal, abbreviations handling, stemming etc? Will that be possible if I use CollationKeyFilter? I also noticed CKF creates a String out of the char[]. If the code already does that, why not use String.toLo

RE: LowerCaseFilter fails one letter (I) of Turkish alphabet

2009-11-30 Thread Uwe Schindler
Hi Simon, > > and RussianLowerCaseFilter is deprecated now, it does the exact same > thing > > as LowerCaseFilter. > btw. we should fix supplementary chars in there too even if it is > deprecated. Deprecated classes should never change and for sure not add Version ctors! If somebody wants to use

Re: LowerCaseFilter fails one letter (I) of Turkish alphabet

2009-11-30 Thread Simon Willnauer
On Mon, Nov 30, 2009 at 8:08 PM, Robert Muir wrote: >> I am not sure if it is worth to add a new TokenFilter for Turkish language. >> I see there exist GreekLowerCaseFilter and RussianLowerCaseFilter. It would >> be nice to see TurkishLowerCaseFilter in Lucene. >> >> >> > just to clarify, GreekLow

Re: LowerCaseFilter fails one letter (I) of Turkish alphabet

2009-11-30 Thread Robert Muir
yes, this is what I would do! The downside to using collation in your filter chain right now, is that then your terms in the index will not be human-readable. The upside is they will both sort and search the way your users expect for a huge list of languages. On Mon, Nov 30, 2009 at 2:22 PM, AHMET

Re: LowerCaseFilter fails one letter (I) of Turkish alphabet

2009-11-30 Thread AHMET ARSLAN
> just to clarify, GreekLowerCaseFilter really shouldn't > exist either. The > final sigma problem it has (where there are two lowercase > forms depending > upon position in word), this is also solved with unicode > case folding or > collation. This is a perfect example of how lowercase is > the wr

Re: LowerCaseFilter fails one letter (I) of Turkish alphabet

2009-11-30 Thread Robert Muir
> I am not sure if it is worth to add a new TokenFilter for Turkish language. > I see there exist GreekLowerCaseFilter and RussianLowerCaseFilter. It would > be nice to see TurkishLowerCaseFilter in Lucene. > > > just to clarify, GreekLowerCaseFilter really shouldn't exist either. The final sigma p

Re: LowerCaseFilter fails one letter (I) of Turkish alphabet

2009-11-30 Thread Robert Muir
Hello, there is already an issue of this. The basics are that lowercase with locale is still not even right. because, its intended for presentation (display), not for case folding. the problem is case folding is not exposed in the JDK, and you have to use the alternate "turkish/azeri" mappings an

LowerCaseFilter fails one letter (I) of Turkish alphabet

2009-11-30 Thread AHMET ARSLAN
In Turkish alphabet lowercase of I is not i. It is LATIN SMALL LETTER DOTLESS I. LowerCaseFilter which uses Character.toLowerCase() makes mistake just for that character. http://java.sun.com/javase/6/docs/api/java/lang/String.html#toLowerCase() I am not sure if it is worth to add a new TokenFi

RE: Deciding on strategy for storing indexed fields

2009-11-30 Thread Uwe Schindler
There are two answers: Its often a good idea, if you mostly need the full representation in one call. E.g. we have the complete XML representation in a stored field and use it for display with XSLT and so on. Other fields are for indexing only and do not get stored. BUT: If you only need parts o

Deciding on strategy for storing indexed fields

2009-11-30 Thread Paul Taylor
Currently in our Lucene Search we have a number of distinct fields that are indexed and stored, so that the fields can be searched and we can then construct an xml representation of the match (http://wiki.musicbrainz.org/Next_Generation_Schema/SearchServerXML) but on further reading it appears

Re: What does "out of order" mean?

2009-11-30 Thread Nick Burch
On Mon, Nov 30, 2009 at 12:22 PM, Stefan Trcek wrote: I'd do, but was not successful to get the svn repo some months ago. I have to claim the sys admin for any svn repo to open a door through the firewall. Gave up due to $ nmap -p3690 svn.apache.org     PORT     STATE    SERVICE     3690/tcp fi

Re: What does "out of order" mean?

2009-11-30 Thread Michael McCandless
I was able to apply that git patch just fine -- so I think it'll work? Thanks! Mike On Mon, Nov 30, 2009 at 12:22 PM, Stefan Trcek wrote: > On Monday 30 November 2009 14:24:20 Michael McCandless wrote: >> I agree, it's silly we label things like TopDocs/TopFieldDocs as >> expert -- they are no

RE: Performance problems with Lucene 2.9

2009-11-30 Thread Uwe Schindler
The total number is also returned in Top(Field)Docs, there is a getter method. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Michel Nadeau [mailto:aka...@gmail.com] > Sent: Monday, November 30, 2009 6:

Re: Performance problems with Lucene 2.9

2009-11-30 Thread Michel Nadeau
The problem with this method is that I won't be able to know how many total results / pages a search have? For example if I do a search X that returns 1,000,000 records, so 5,000 pages of 200 items, I will only know if I have more when I'll hit "next page" - I won't be able to display "1,000,000 r

RE: Performance problems with Lucene 2.9

2009-11-30 Thread Uwe Schindler
> you think that something like this - > TopFieldDocs tfd = searcher.search(new ConstantScoreQuery(cluCF), null, > 200, > cluSort); This is little bit faster as it does not need to intersect the all queries with the filtered ones. > Would be more performant than using MatchAllDocsQuery with Filte

RE: Performance problems with Lucene 2.9

2009-11-30 Thread Uwe Schindler
> Now I have another question... is there a way to specify a "start from" so > I > could get page 2, 3, 4, etc.. ? Search the mailing list, this was explained quite often (by others and me). The trick is: If you have 200 results per page, with n = 200 you get the top ranking results for the first

Re: What does "out of order" mean?

2009-11-30 Thread Stefan Trcek
On Monday 30 November 2009 14:24:20 Michael McCandless wrote: > I agree, it's silly we label things like TopDocs/TopFieldDocs as > expert -- they are no longer for "low level" APIs (or, perhaps since > we've removed the "high level" API (= Hits), what remains should no > longer be considered low le

Re: Performance problems with Lucene 2.9

2009-11-30 Thread Michel Nadeau
Uwe, you think that something like this - TopFieldDocs tfd = searcher.search(new ConstantScoreQuery(cluCF), null, 200, cluSort); Would be more performant than using MatchAllDocsQuery with Filters like this - TopFieldDocs tfd = searcher.search(new MatchAllDocsQuery(), cluCF, 200, cluSort); Thanks

Re: Performance problems with Lucene 2.9

2009-11-30 Thread Michel Nadeau
I'm currently trying something like this - TopFieldDocs tfd = searcher.search(new MatchAllDocsQuery(), cluCF, 200, cluSort); cluCF = filters cluSort = sorts Now I have another question... is there a way to specify a "start from" so I could get page 2, 3, 4, etc.. ? - Mike aka...@gmail.com On

RE: Performance problems with Lucene 2.9

2009-11-30 Thread Uwe Schindler
> And sorting is done by the > collector, Lucene has no idea how to sort. Sorting is done by the internal collector behind the Top(Field)Docs-returning method (your own collectors would have to do it themselves). If you call search(Query, n,... Sort), internally an collector is created that does t

RE: Performance problems with Lucene 2.9

2009-11-30 Thread Uwe Schindler
You should use ConstantScoreQuery(filter) as query if you want to filter all docs and need no scoring! This disables scoring automatically. It is the same (but more performant) like combining MatchAllDocs with a Filter. If you only need the top 200 results, use TopDocs search(Query, int) and set t

Re: Performance problems with Lucene 2.9

2009-11-30 Thread Ian Lea
Since you are just interested in retrieving the top n hits, sounds to me that TopDocs is the way to go. It's not a drop-in replacement for Hits but the transition is pretty straightforward. Then if MatchAllDocsQuery + filters gives you good enough performance you could stop, if it doesn't look at

Re: Performance problems with Lucene 2.9

2009-11-30 Thread Michel Nadeau
I'll definitely switch to a Collector. It's just not clear for me if I should use BooleanQueries or MatchAllDocuments+Filters ? And should I write my own collector or the TopDocs one is perfect for me ? - Mike aka...@gmail.com On Mon, Nov 30, 2009 at 11:30 AM, Erick Erickson wrote: > The prob

Re: Performance problems with Lucene 2.9

2009-11-30 Thread Erick Erickson
The problem with hits is that a it re-executes the query every N documents where N is 100 (?). So, a loop like for (int idx : hits.length) { do something } Assuming my memory is right and it's every 100, your query will re-execute (length/100) times. Which is unfortunate. The very quick t

Re: Performance problems with Lucene 2.9

2009-11-30 Thread Michel Nadeau
Great, thanks! So what do you guys think would be the best road for my application? I NEVER want to retrieve -all- documents, only like maximum 200. I always need to apply some filters and some sorts. From what I understand, in all cases I should switch from Hits to a Collector for performance rea

RE: Performance problems with Lucene 2.9

2009-11-30 Thread Uwe Schindler
Hits is deprecated and should no longer be used. The replacements are TopDocs *or* Collectors. If you add a number of max-scoring results you want to have (e.g. to display the first 10 results of a google-like query on a web page), use TopDocs. The method for that is Searcher.serach(Query q, int n

Re: Performance problems with Lucene 2.9

2009-11-30 Thread Michel Nadeau
What is the main difference between Hits and Collectors? - Mike aka...@gmail.com On Mon, Nov 30, 2009 at 11:03 AM, Uwe Schindler wrote: > And if you only have a filter and apply it to all documents, make a > ConstantScoreQuery on top of the filter: > > Query q=new ConstantScoreQuery(cluCF); >

Re: Performance problems with Lucene 2.9

2009-11-30 Thread Michel Nadeau
Hi ! Thanks so much !! * I'll check the documentation for MatchAllDocsQuery. * I'm already changing my code to create BooleanQueries instead of filters - is that better than MatchAllDocsQuery or it's the same? * Is using MatchAllDocsQuery the only way to disable scoring? * Would you have any good

RE: Performance problems with Lucene 2.9

2009-11-30 Thread Uwe Schindler
And if you only have a filter and apply it to all documents, make a ConstantScoreQuery on top of the filter: Query q=new ConstantScoreQuery(cluCF); Then remove the filter from your search method call and only execute this query. And if you iterate over all results never-ever use Hits! (its alre

Re: Performance problems with Lucene 2.9

2009-11-30 Thread Shai Erera
Hi First you can use MatchAllDocsQuery, which matches all documents. It will save a HUGE posting list (TAG:TAG), and performs much faster. For example TAG:TAG computes a score for each doc, even though you don't need it. MatchAllDocsQuery doesn't. Second, move away from Hits ! :) Use Collectors i

Performance problems with Lucene 2.9

2009-11-30 Thread Michel Nadeau
Hi, we use Lucene to store around 300 millions of records. We use the index both for conventional searching, but also for all the system's data - we replaced MySQL with Lucene because it was simply not working at all with MySQL due to the amount or records. Our problem is that we have HUGE perform

Re: MergePolicy$MergeException CorruptIndexException in lucene2.4.1

2009-11-30 Thread jm
On Mon, Nov 30, 2009 at 2:34 PM, Michael McCandless wrote: > On Mon, Nov 30, 2009 at 7:22 AM, jm wrote: >> No other exceptions I could spot. > > OK > >> OS: win2003 32bits, with NTFS. This is a vm running on vmware fusion on a >> mac. > > That should be fine... > >> jvm: I made sure, java versio

Re: SpanQuery for Terms at same position

2009-11-30 Thread Christopher Tignor
It would take a bit of work work / learning (haven't used a RAMDirectory yet) to make them into test cases usable by others and am deep into this project and under the gun right now. But if some time surfaces I will for sure... thanks - C>T> On Wed, Nov 25, 2009 at 7:49 PM, Erick Erickson wrote

Re: splitting words

2009-11-30 Thread Erick Erickson
H, didn't we discuss this already? What about that discussion needs further clarification? The answer is probably a variant of SynonymAnalyzer. If you have a list of known words you could do the synonyms at index time, which is preferable. Best Erick On Mon, Nov 30, 2009 at 7:22 AM, m.harig

Re: MergePolicy$MergeException CorruptIndexException in lucene2.4.1

2009-11-30 Thread Michael McCandless
On Mon, Nov 30, 2009 at 7:22 AM, jm wrote: > No other exceptions I could spot. OK > OS: win2003 32bits, with NTFS. This is a vm running on vmware fusion on a mac. That should be fine... > jvm: I made sure, java version "1.6.0_14" Good. > IndexWriter settings: >        writer.setMaxFieldLengt

Re: What does "out of order" mean?

2009-11-30 Thread Michael McCandless
I agree, it's silly we label things like TopDocs/TopFieldDocs as expert -- they are no longer for "low level" APIs (or, perhaps since we've removed the "high level" API (= Hits), what remains should no longer be considered low level). Do you wanna cough up a patch to correct these? Mike On Mon,

Re: MergePolicy$MergeException CorruptIndexException in lucene2.4.1

2009-11-30 Thread jm
No other exceptions I could spot. OS: win2003 32bits, with NTFS. This is a vm running on vmware fusion on a mac. jvm: I made sure, java version "1.6.0_14" IndexWriter settings: writer.setMaxFieldLength(maxFieldLength); writer.setMergeFactor(10); writer.setRAMBufferSizeMB(

splitting words

2009-11-30 Thread m.harig
hello all i've doubt in lucene split words search , for example if i search for dualcore it should return dual core , how do i split this word ? is there any analyzer in lucene to do it? please any one help me. -- View this message in context: http://old.nabble.com/splitting-words-tp265

Re: What does "out of order" mean?

2009-11-30 Thread Stefan Trcek
On Friday 27 November 2009 14:49:07 Michael McCandless wrote: > So the "don't care" equivalent here is to use IndexSearcher's normal > search APIs (ie, we don't use Version to switch this on or off). Hmm - Searcher/IndexSearchers search methods are "Low level", "Expert", "Expert + low level" or r