Use of In-like query and performance implications

2005-03-02 Thread Paul Smith
. my question is, is there any performance concerns here if ("...In(g,h,i,j,) ") starts getting longer and longer? Can Lucene handle this in an optimal manner, without a serious scalability issue ? (memory/cpu/io etc). Or would it be better that a different design is used gor th

Re: Strategies for updating indexes.

2005-04-05 Thread Paul Smith
your application too, which is very useful for a single instance, and can be easily broken out to be used in a clustered environment. cheers, Paul Smith - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [

Hungarian notation analyzer and phrase queries

2005-04-12 Thread Paul Smith
I am writing a document management system for my company, and many of our feature names are in Hungarian notation (PowerQuery, TransactionManager, etc.). This can make it hard to find some things with a default analyzer. I'd like to be able to index text like "Use PowerQuery for advanced searches"

Re: Hungarian notation analyzer and phrase queries

2005-04-13 Thread Paul Smith
Thanks for your help guys! If you put the term query at position 2 then you need slop to find "Use PowerQuery for advanced searches", which is the exact text in the document. I think I'd rather have that phrase query work without any slop, and require some slop for "use power query for advanced

Re: Re[2]: multi word synonym (was Hungarian notation analyzer and phrase queries)

2005-04-29 Thread Paul Smith
Indexing every multi-word synonym as a single token would introduce spaces into the tokens. In that case searching for (java) would not match "i love jsp and tomcat". I think that searching for (java*) would match. Rewriting the query is also problematic. If you search for (java server), you don't

Re: Index Replication / Clustering

2005-06-26 Thread Paul Smith
hout the main application knowing anything about it. Paul Smith On 26/06/2005, at 2:35 AM, Stephane Bailliez wrote: I have been browsing the archives concerning this particular topic. I'm in the same boat and the customer has clustering requirements. To give some background: I ha

Re: Index Replication / Clustering

2005-06-27 Thread Paul Smith
If you use ActiveMQ for JMS, you can take advantage of it's Composite Destination feature and have a virtual Queue/Topic that is actually several Queues/Topics. This is what we use to keep a mirror index server completely in sync. The application sends an update message to a queue

Re: Index Replication / Clustering

2005-06-27 Thread Paul Smith
On 27/06/2005, at 7:14 PM, Nader Henein wrote: I implemented a JMS based solution about a year ago because I thought it would solve my atomicity problem and give me a centralized way of indexing, you'll have to use the pluggable persistence (if you use ActiveMQ) to be able to recover from

Index Partitioning ( was Re: Search deadlocking under load)

2005-07-08 Thread Paul Smith
omatically closed? Appreciate any thoughts on this. I'd rather know now while I have the opportunity to change the design than later when in production.. :) cheers, Paul Smith On 09/07/2005, at 5:39 AM, Otis Gospodnetic wrote: Nathan, 3) is the recommended usage. Your index is on an

Re: Index Partitioning ( was Re: Search deadlocking under load)

2005-07-10 Thread Paul Smith
On 11/07/2005, at 9:15 AM, Chris Hostetter wrote: : Nathan's point about pooling Searchers is something that we also : addressed by a LRU cache mechanism. In testing we also found that Generally speaking, you only ever need one active Searcher, which all of your threads should be able to u

Re: Index Partitioning ( was Re: Search deadlocking under load)

2005-07-10 Thread Paul Smith
On 11/07/2005, at 10:43 AM, Chris Hostetter wrote: : > Generally speaking, you only ever need one active Searcher, which : > all of : > your threads should be able to use. (Of course, Nathan says that : > in his : > code base, doing this causes his JVM to freeze up, but I've never seen : >

Re: Re[2]: Index Partitioning ( was Re: Search deadlocking under load)

2005-07-11 Thread Paul Smith
Many thanks for confirming the principles should work fine. It is a load off my mind! :) On index update, a small Event is triggered into a Buffer, that is periodically (every 30 seconds) processed to coalesce them, then ensure that any open IndexSearcher in the cache is closed. On 12/07

Re: Index Partitioning ( was Re: Search deadlocking under load)

2005-07-12 Thread Paul Smith
On 13/07/2005, at 1:34 AM, Chris Hostetter wrote: : Since this isn't in production yet, I'd rather be proven wrong now : rather than later! :) it sounds like what you're doing makes a lot of sense given your situation, and the nature of your data. the one thing you might not have concidered

Re: Best Practices for Distributing Lucene Indexing and Searching

2005-07-14 Thread Paul Smith
ghput problem on that too). Would love to see something like this work really well, and perhaps generalize it a bit more. I do like the simplicity of the SEDA principles. cheers, Paul Smith On 14/07/2005, at 11:50 PM, Peter Gelderbloem wrote: I am currently looking into building a

Re: Best Practices for Distributing Lucene Indexing and Searching

2005-07-14 Thread Paul Smith
m in the same index? Maybe you need those individual, smaller indices to be separate How do you deal with the possibility of the same Document being present in multiple indices? Otis --- Paul Smith <[EMAIL PROTECTED]> wrote: I had a crack at whipping up something along this li

Re: Best Practices for Distributing Lucene Indexing and Searching

2005-07-14 Thread Paul Smith
, Paul Smith wrote: My punt was that having workers create sub-indexs (creating the documents and making a partial index) and shipping the partial index back to the queen to merge may be more efficient. It's probably not, I was just using the day as a chance to see if it looked prom

Re: Best Practices for Distributing Lucene Indexing and Searching

2005-07-14 Thread Paul Smith
answering my own question: nutch.org -> lucene.apache.org/nutch/ Excellent! Paul On 15/07/2005, at 11:45 AM, Paul Smith wrote: Cl, I should go have a look at that.. That begs another question though, where does Nutch stand in terms of the ASF? Did I read (or dream) that Nutch may

Re: Best Practices for Distributing Lucene Indexing and Searching

2005-07-14 Thread Paul Smith
On 15/07/2005, at 3:57 PM, Otis Gospodnetic wrote: The problem that I saw (from your email only) with the "ship the full little index to the Queen" approach is that, from what I understand, you eventually do addIndexes(Directory[]) in there, and as this optimizes things in the end, this means y

RAID Stripe sizes - suggestions?

2009-02-21 Thread Paul Smith
I'm just wondering if anyone can share with us their learnings on optimizing their storage configurations for relatively large indexes (millions of documents, 10+Gb in size). Is there a 'suggested best' Stripe size for RAID-10 configurations? I did some Googling, and surprised I couldn't f

Re: Memory Leak?

2009-03-24 Thread Paul Smith
No, I don't hit OOME if I comment out the call to getHTMLTitle. The heap behaves perfectly. I completely agree with you, the thread count goes haywire the moment I call the HTMLParser.getTitle(). I have seen a thread count of like 600 before my I hit OOME (with the getTitle() call on) and

Re: Huge number of Term objects in memory gives OutOfMemory error

2008-03-17 Thread Paul Smith
duce the Terms in memory, but I have not seen how to set this value for Lucene. Any help would be greatly appreciated. Rich Paul Smith Core Engineering Manager Aconex The easy way to save time and money on your project 696 Bourke Street, Melbourne, VIC 3000, Australia Tel: +61 3 9240 0200

Re: Memory Usage

2008-07-03 Thread Paul Smith
en once you get past the synchronization bottleneck in the CollationKey stuff). cheers, Paul Smith - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Searching Log Files

2008-10-14 Thread Paul Smith
On 15/10/2008, at 7:37 AM, Chris Gilliam wrote: Hello Everyone, New to Lucene.. We currently roughly 100Gig of log files. We are needing to build a search application that can return rows of data from the files and combine the results? Does Lucene index the content in the files? Will i

Re: Multisearcher will maintain index order sorting?

2008-10-22 Thread Paul Smith
On 23/10/2008, at 4:20 PM, Ganesh wrote: My Index DB is having 10 million records and it will grow to 30 million. Currently I am using millisecond timestamp and the RAM cosumption is more. I will change the resolution to minute. I am using 2 searcher objects refreshing each other every min

Re: Performance of never optimizing

2008-11-05 Thread Paul Smith
I don't believe our large users to have enough memory for Lucene indexes to fit in RAM. (Especially given we use quite a bit of RAM for other stuff.) I think we also close readers pretty frequently (whenever any user updates a JIRA issue, which I am assuming happening nearly constantly

Sorting, RuleBasedCollater, and synchronization bottleneck

2007-02-14 Thread Paul Smith
is synchronized. I wonder if a ThreadLocal based collator would be better here... ? There doesn't appear to be a reason for other threads searching the same index to wait on this sort. Be just as easy to use their own. (Is RuleBasedCollator a "heavy" object memory wise? Wou

Re: IndexSearcher on multi-core CPU machine

2007-02-18 Thread Paul Smith
are you using Locale-sensitive sorting at all? https://issues.apache.org/jira/browse/LUCENE-806 Just wondering if you're seeing the same problem we are having. cheers, Paul Smith On 19/02/2007, at 8:52 AM, dmitri wrote: We have search (no update) web app on 2 dual core CPU machin

Re: Field Boosting

2005-11-17 Thread Paul Smith
This would be a good candidate for an IllegalStateException to be thrown if the user calls this method when it's not valid. Save the user some hassles? (one can JavaDoc to one is blue in the face, but throwing a good RuntimeException with a message trains the users much quicker... :) ) P

Re: References to deleted file handles in long-running server application

2005-11-18 Thread Paul Smith
he user has probably navigated away and given up on the long running search anyway). Paul Smith On 18/11/2005, at 6:57 PM, Matt Magoffin wrote: I'm updating nearly continuously (probably average about every 10 seconds). I don't explicitly close the IndexSearcher objects I create, as I sha

"Starts with" query?

2006-01-05 Thread Paul Smith
How do I do that with Lucene? I'm sure this a is a dumb question, and I know that Lucene's searching is way more useful than that, but you know these pesky compatibility requirements.It's screwing with my unit tests

Re: "Starts with" query?

2006-01-05 Thread Paul Smith
On 06/01/2006, at 9:33 AM, Chris Hostetter wrote: : Think SQL of " where title like 'The quick%' ". I solved this problem by having a variation of my field that was not tokenized, and did PrefixQueries on that field (so in your case, leave your title field alone for generic matches, and

Re: "Starts with" query?

2006-01-05 Thread Paul Smith
1) also index the field untokenized and use a straight prefix query See my reply to Chris, not sure I can afford the index size increment. 2) index a magic token at the start of the title and include that in a phrase query: "_START_ the quick" h, that's clever. 3) use a SpanFirst quer

Re: "Starts with" query?

2006-01-05 Thread Paul Smith
2) index a magic token at the start of the title and include that in a phrase query: "_START_ the quick" Ok, I've gone and chose "0start0" as my start token, because our analyzer is stripping _. Now, second dumb question of the day, give the search for starts with "The qui*", that has t

Re: "Starts with" query?

2006-01-05 Thread Paul Smith
query trick works if it searches on title:"0start0 auto*" but does not find any matches for title:"0start0 aut*" I'm a bit stuck. Paul On 06/01/2006, at 10:43 AM, Paul Smith wrote: 2) index a magic token at the start of the title and include tha

Re: "Starts with" query?

2006-01-05 Thread Paul Smith
one thing you may not have thought about yet that may affect your decision: sorting in lucene requires the field be indexed but untokenized. so if you want to support sortting on the conceptual "title", you'll still need a version of your title field that's untokenized, which can then be u

Re: Memory

2006-01-16 Thread Paul Smith
his all the time in a Tomcat app server box, where each Http Connector is a thread, and appears as it's own process. cheers, Paul Smith On 17/01/2006, at 7:11 AM, Aigner, Thomas wrote: Hi all, Is anyone experiencing possible memory problems on LINUX with Lucene search? Here is o

CompoundFileReader question/'leaking' file descriptors ?

2006-02-12 Thread Paul Smith
assumably, close the file. The guard here is that the finalizer method in FSInputStream does call close() so that would well explain the releasing of file handles at garbage collection intervals. Why would CompoundFileReader not need to call .close()? Am I going mad here and just seeing ghosts? Comments appreciated. Paul Smith

Re: CompoundFileReader question/'leaking' file descriptors ?

2006-02-13 Thread Paul Smith
On 14/02/2006, at 7:44 AM, Doug Cutting wrote: Paul Smith wrote: We're using Lucene 1.4.3, and after hunting around in the source code just to see what I might be missing, I came across this, and I'd just like some comments. Please try using a 1.9 build to see if this is

Re: CompoundFileReader question/'leaking' file descriptors ?

2006-02-13 Thread Paul Smith
or deletion. Waiting the amount of time for the IndexSearcher to close sees the file descriptor released. Sorry for the intrusion. cheers, Paul Smith - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: javadoc lookup

2006-03-01 Thread Paul Smith
That is neat... nice work. On 02/03/2006, at 10:23 AM, Larry Ogrodnek wrote: Hey, I put together a little ajax / lucene javadoc lookup site that I just wanted to share I've found it pretty useful to be able to just type a few letters instead of navigating through the standard javadoc fr

Re: Poor performance "race condition" in FieldSortedHitQueue

2006-08-08 Thread Paul Smith
y 3.75 US cents). Paul Smith smime.p7s Description: S/MIME cryptographic signature