Hi siraj,
There is now way to find out the free space on a partition using Java 5
(Lucene 3.0) or Java 1.4 (Lucene 2.9) without any native JNI calls. So
Lucene cannot calculate it before optimizing.
With Java 6 it would be possible, but Lucene 3.0 is only allowed to use Java
5: File#getUsableSpac
And if you have open IndexReaders/Searchers at the same time use 3.5 as
factor (because some files were already deleted from directory, but still
occupy space - *nix delete on last close) :-)
-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de
>
Jason,
Thank you for your suggestion. That is what I am planning to do, but I
overheard or read somewhere that the new lucene version can take care of
that internally, so I was just trying to see if somebody know something
about it.
regards
-siraj
Jason Rutherglen wrote:
Siraj,
You could
Siraj,
You could estimate the maximum size used during optimization at 2.5 (a
sort of rough maximum) times your current index size, and not optimize
if your index (at 2.5 times) would exceed your allowable disk space.
Jason
On Mon, Nov 30, 2009 at 2:50 PM, Siraj Haider wrote:
> Index optimizati
Index optimization fails if we don't have enough space on the drive and
leaves the hard drive almost full. Is there a way not to even start
optimization if we don't have enough space on drive?
regards
-siraj
-
To unsubscribe,
On Mon, Nov 30, 2009 at 4:07 PM, Shai Erera wrote:
> Thanks again, I'll use this table as well.
you should only use it if you are normalizing to NFKC or NFKD afterwards...
> What I do is read those tables
> and store in a char[], for fast lookups of folding chars. I noticed your
> comments in
Uwe Schindler wrote:
There are two answers:
Its often a good idea, if you mostly need the full representation in one
call. E.g. we have the complete XML representation in a stored field and use
it for display with XSLT and so on. Other fields are for indexing only and
do not get stored.
I alw
Thanks again, I'll use this table as well. What I do is read those tables
and store in a char[], for fast lookups of folding chars. I noticed your
comments in the code about not doing so because then the tables would need
to be updated once in a while, and I agree. But ICU's lack of char[] API
drov
Shai, no, behind the scenes I am using just that table, via ICU library.
The only reason the CaseFoldingFilter in my patch is more complex, is
because I also apply FC_NFKC_Closure mappings.
You can apply these tables in your impl too if you are also using
normalization, they are here:
http://unico
Thanks Robert. In my Analyzer I do case folding according to Unicode tables.
So ß is converted to "SS". I do the same for diacritic removal and
Hiragana/Katakan folding. I then apply a LowerCaseFilter, which gets the
"SS" to "ss".
I checked the filter's output on "AĞACIN" and it's "AGACIN". If I
t
> I'm trying to fix my code to remove everything that is deprecated in order
> to move to Lucene 3.0. I fixed many many items but I can't find the answer
> to some answers. See items in red below:
>
> *#1. Opening an index*
> *idx = FSDirectory.getDirectory(new File(INDEX));
> reader = IndexRead
Hi !
I'm trying to fix my code to remove everything that is deprecated in order
to move to Lucene 3.0. I fixed many many items but I can't find the answer
to some answers. See items in red below:
*#1. Opening an index*
*idx = FSDirectory.getDirectory(new File(INDEX));
reader = IndexReader.open(
On Mon, Nov 30, 2009 at 2:53 PM, Shai Erera wrote:
> Robert, what if I need to do additional filtering after CollationKeyFilter,
> like stopwords removal, abbreviations handling, stemming etc? Will that be
> possible if I use CollationKeyFilter?
>
>
Shai, great point. This won't work with Collati
Shai, again the problem is not really performance (I am ignoring that for
now), but the fact that lowercasing and case folding are different.
An easy example, the lowercase of ß is ß itself, it is already lowercase.
it will not match with 'SS' if you use lowercase filter.
if you use case folding,
Robert, what if I need to do additional filtering after CollationKeyFilter,
like stopwords removal, abbreviations handling, stemming etc? Will that be
possible if I use CollationKeyFilter?
I also noticed CKF creates a String out of the char[]. If the code already
does that, why not use String.toLo
Hi Simon,
> > and RussianLowerCaseFilter is deprecated now, it does the exact same
> thing
> > as LowerCaseFilter.
> btw. we should fix supplementary chars in there too even if it is
> deprecated.
Deprecated classes should never change and for sure not add Version ctors!
If somebody wants to use
On Mon, Nov 30, 2009 at 8:08 PM, Robert Muir wrote:
>> I am not sure if it is worth to add a new TokenFilter for Turkish language.
>> I see there exist GreekLowerCaseFilter and RussianLowerCaseFilter. It would
>> be nice to see TurkishLowerCaseFilter in Lucene.
>>
>>
>>
> just to clarify, GreekLow
yes, this is what I would do! The downside to using collation in your filter
chain right now, is that then your terms in the index will not be
human-readable. The upside is they will both sort and search the way your
users expect for a huge list of languages.
On Mon, Nov 30, 2009 at 2:22 PM, AHMET
> just to clarify, GreekLowerCaseFilter really shouldn't
> exist either. The
> final sigma problem it has (where there are two lowercase
> forms depending
> upon position in word), this is also solved with unicode
> case folding or
> collation. This is a perfect example of how lowercase is
> the wr
> I am not sure if it is worth to add a new TokenFilter for Turkish language.
> I see there exist GreekLowerCaseFilter and RussianLowerCaseFilter. It would
> be nice to see TurkishLowerCaseFilter in Lucene.
>
>
>
just to clarify, GreekLowerCaseFilter really shouldn't exist either. The
final sigma p
Hello, there is already an issue of this.
The basics are that lowercase with locale is still not even right. because,
its intended for presentation (display), not for case folding.
the problem is case folding is not exposed in the JDK, and you have to use
the alternate "turkish/azeri" mappings an
In Turkish alphabet lowercase of I is not i. It is LATIN SMALL LETTER DOTLESS
I. LowerCaseFilter which uses Character.toLowerCase() makes mistake just for
that character.
http://java.sun.com/javase/6/docs/api/java/lang/String.html#toLowerCase()
I am not sure if it is worth to add a new TokenFi
There are two answers:
Its often a good idea, if you mostly need the full representation in one
call. E.g. we have the complete XML representation in a stored field and use
it for display with XSLT and so on. Other fields are for indexing only and
do not get stored.
BUT:
If you only need parts o
Currently in our Lucene Search we have a number of distinct fields that
are indexed and stored, so that the fields can be searched and we can
then construct an xml representation of the match
(http://wiki.musicbrainz.org/Next_Generation_Schema/SearchServerXML) but
on further reading it appears
On Mon, Nov 30, 2009 at 12:22 PM, Stefan Trcek wrote:
I'd do, but was not successful to get the svn repo some months ago. I
have to claim the sys admin for any svn repo to open a door through the
firewall. Gave up due to
$ nmap -p3690 svn.apache.org
PORT STATE SERVICE
3690/tcp fi
I was able to apply that git patch just fine -- so I think it'll work?
Thanks!
Mike
On Mon, Nov 30, 2009 at 12:22 PM, Stefan Trcek wrote:
> On Monday 30 November 2009 14:24:20 Michael McCandless wrote:
>> I agree, it's silly we label things like TopDocs/TopFieldDocs as
>> expert -- they are no
The total number is also returned in Top(Field)Docs, there is a getter
method.
-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de
> -Original Message-
> From: Michel Nadeau [mailto:aka...@gmail.com]
> Sent: Monday, November 30, 2009 6:
The problem with this method is that I won't be able to know how many total
results / pages a search have?
For example if I do a search X that returns 1,000,000 records, so 5,000
pages of 200 items, I will only know if I have more when I'll hit "next
page" - I won't be able to display "1,000,000 r
> you think that something like this -
> TopFieldDocs tfd = searcher.search(new ConstantScoreQuery(cluCF), null,
> 200,
> cluSort);
This is little bit faster as it does not need to intersect the all queries
with the filtered ones.
> Would be more performant than using MatchAllDocsQuery with Filte
> Now I have another question... is there a way to specify a "start from" so
> I
> could get page 2, 3, 4, etc.. ?
Search the mailing list, this was explained quite often (by others and me).
The trick is:
If you have 200 results per page, with n = 200 you get the top ranking
results for the first
On Monday 30 November 2009 14:24:20 Michael McCandless wrote:
> I agree, it's silly we label things like TopDocs/TopFieldDocs as
> expert -- they are no longer for "low level" APIs (or, perhaps since
> we've removed the "high level" API (= Hits), what remains should no
> longer be considered low le
Uwe,
you think that something like this -
TopFieldDocs tfd = searcher.search(new ConstantScoreQuery(cluCF), null, 200,
cluSort);
Would be more performant than using MatchAllDocsQuery with Filters like this
-
TopFieldDocs tfd = searcher.search(new MatchAllDocsQuery(), cluCF, 200,
cluSort);
Thanks
I'm currently trying something like this -
TopFieldDocs tfd = searcher.search(new MatchAllDocsQuery(), cluCF, 200,
cluSort);
cluCF = filters
cluSort = sorts
Now I have another question... is there a way to specify a "start from" so I
could get page 2, 3, 4, etc.. ?
- Mike
aka...@gmail.com
On
> And sorting is done by the
> collector, Lucene has no idea how to sort.
Sorting is done by the internal collector behind the
Top(Field)Docs-returning method (your own collectors would have to do it
themselves). If you call search(Query, n,... Sort), internally an collector
is created that does t
You should use ConstantScoreQuery(filter) as query if you want to filter all
docs and need no scoring! This disables scoring automatically. It is the
same (but more performant) like combining MatchAllDocs with a Filter.
If you only need the top 200 results, use TopDocs search(Query, int) and set
t
Since you are just interested in retrieving the top n hits, sounds to
me that TopDocs is the way to go. It's not a drop-in replacement for
Hits but the transition is pretty straightforward.
Then if MatchAllDocsQuery + filters gives you good enough performance
you could stop, if it doesn't look at
I'll definitely switch to a Collector.
It's just not clear for me if I should use BooleanQueries or
MatchAllDocuments+Filters ?
And should I write my own collector or the TopDocs one is perfect for me ?
- Mike
aka...@gmail.com
On Mon, Nov 30, 2009 at 11:30 AM, Erick Erickson wrote:
> The prob
The problem with hits is that a it re-executes the query
every N documents where N is 100 (?).
So, a loop like
for (int idx : hits.length) {
do something
}
Assuming my memory is right and it's every 100, your query will
re-execute (length/100) times. Which is unfortunate.
The very quick t
Great, thanks!
So what do you guys think would be the best road for my application? I NEVER
want to retrieve -all- documents, only like maximum 200. I always need to
apply some filters and some sorts. From what I understand, in all cases I
should switch from Hits to a Collector for performance rea
Hits is deprecated and should no longer be used. The replacements are
TopDocs *or* Collectors.
If you add a number of max-scoring results you want to have (e.g. to display
the first 10 results of a google-like query on a web page), use TopDocs. The
method for that is Searcher.serach(Query q, int n
What is the main difference between Hits and Collectors?
- Mike
aka...@gmail.com
On Mon, Nov 30, 2009 at 11:03 AM, Uwe Schindler wrote:
> And if you only have a filter and apply it to all documents, make a
> ConstantScoreQuery on top of the filter:
>
> Query q=new ConstantScoreQuery(cluCF);
>
Hi !
Thanks so much !!
* I'll check the documentation for MatchAllDocsQuery.
* I'm already changing my code to create BooleanQueries instead of filters -
is that better than MatchAllDocsQuery or it's the same?
* Is using MatchAllDocsQuery the only way to disable scoring?
* Would you have any good
And if you only have a filter and apply it to all documents, make a
ConstantScoreQuery on top of the filter:
Query q=new ConstantScoreQuery(cluCF);
Then remove the filter from your search method call and only execute this
query.
And if you iterate over all results never-ever use Hits! (its alre
Hi
First you can use MatchAllDocsQuery, which matches all documents. It will
save a HUGE posting list (TAG:TAG), and performs much faster. For example
TAG:TAG computes a score for each doc, even though you don't need it.
MatchAllDocsQuery doesn't.
Second, move away from Hits ! :) Use Collectors i
Hi,
we use Lucene to store around 300 millions of records. We use the index both
for conventional searching, but also for all the system's data - we replaced
MySQL with Lucene because it was simply not working at all with MySQL due to
the amount or records. Our problem is that we have HUGE perform
On Mon, Nov 30, 2009 at 2:34 PM, Michael McCandless
wrote:
> On Mon, Nov 30, 2009 at 7:22 AM, jm wrote:
>> No other exceptions I could spot.
>
> OK
>
>> OS: win2003 32bits, with NTFS. This is a vm running on vmware fusion on a
>> mac.
>
> That should be fine...
>
>> jvm: I made sure, java versio
It would take a bit of work work / learning (haven't used a RAMDirectory
yet) to make them into test cases usable by others and am deep into this
project and under the gun right now. But if some time surfaces I will for
sure...
thanks -
C>T>
On Wed, Nov 25, 2009 at 7:49 PM, Erick Erickson wrote
H, didn't we discuss this already? What about that
discussion needs further clarification?
The answer is probably a variant of SynonymAnalyzer. If
you have a list of known words you could do the synonyms
at index time, which is preferable.
Best
Erick
On Mon, Nov 30, 2009 at 7:22 AM, m.harig
On Mon, Nov 30, 2009 at 7:22 AM, jm wrote:
> No other exceptions I could spot.
OK
> OS: win2003 32bits, with NTFS. This is a vm running on vmware fusion on a mac.
That should be fine...
> jvm: I made sure, java version "1.6.0_14"
Good.
> IndexWriter settings:
> writer.setMaxFieldLengt
I agree, it's silly we label things like TopDocs/TopFieldDocs as
expert -- they are no longer for "low level" APIs (or, perhaps since
we've removed the "high level" API (= Hits), what remains should no
longer be considered low level).
Do you wanna cough up a patch to correct these?
Mike
On Mon,
No other exceptions I could spot.
OS: win2003 32bits, with NTFS. This is a vm running on vmware fusion on a mac.
jvm: I made sure, java version "1.6.0_14"
IndexWriter settings:
writer.setMaxFieldLength(maxFieldLength);
writer.setMergeFactor(10);
writer.setRAMBufferSizeMB(
hello all
i've doubt in lucene split words search , for example if i search
for dualcore it should return dual core , how do i split this word ? is
there any analyzer in lucene to do it? please any one help me.
--
View this message in context:
http://old.nabble.com/splitting-words-tp265
On Friday 27 November 2009 14:49:07 Michael McCandless wrote:
> So the "don't care" equivalent here is to use IndexSearcher's normal
> search APIs (ie, we don't use Version to switch this on or off).
Hmm - Searcher/IndexSearchers search methods are "Low
level", "Expert", "Expert + low level" or r
53 matches
Mail list logo