Re: Concurrent Indexing and Searching

2009-09-25 Thread Jake Mannix
Hi Klaus, If you've really still got 500MB of changes to your index since the last time you commit()'ed, then the call to commit() will be costly and take a while to complete. If in another thread, you reopen() an IndexReader pointing to that index, it will only see changes since the most recen

Re: How to setup a scalable deployment?

2009-10-06 Thread Jake Mannix
of queries a day in real time (meaning milliseconds, even under fairly high indexing load) for the past year. -jake mannix

Re: Best strategy for reindexing large amount of data

2009-10-07 Thread Jake Mannix
I think a Hadoop cluster is maybe a bit overkill for this kind of thing - it's pretty common to have to do "grandfathering" of an index when you have new features, and just doing it in place with IndexWriter.update() can work just fine as long as you are not very frequently reopening your index. T

Re: Index splitter

2009-10-07 Thread Jake Mannix
As long as you don't have to split up a fully optimized index, or one with the wrong number of segments for the division you want to do, that would be useful. Of course, sometimes you need to split up the big segments into smaller ones too, but the only way I've done that in the past is basically:

Re: Best strategy for reindexing large amount of data

2009-10-07 Thread Jake Mannix
stantly updating the index with new info, we're also reopening it very > frequently to make the new info appear in query results. Would that > disqualify the update method? And what do you mean by "not very > frequently". > Is every 5 min too much? > > Thanks agai

Re: Help needed figuring out reason for maxClauseCount is set to 1024 error

2009-10-07 Thread Jake Mannix
On Wed, Oct 7, 2009 at 4:42 PM, mitu2009 wrote: > > Hi, > > I've two sets of search indexes. TestIndex (used in our test environment) > and ProdIndex(used in PRODUCTION environment). Lucene search query: > +date:[20090410184806 TO 20091007184806] works fine for test index but > gives > this error

Re: 2.9: TopScoreDocCollector

2009-10-07 Thread Jake Mannix
Hi Eric, Different Query classes have different options on whether they can score docs out of order, or if they always proceed in order, so the way to make sure you're choosing the right value, if you don't know which you need, is to ask your Query (or more appropriately, it's Weight): Query

Re: Help needed figuring out reason for maxClauseCount is set to 1024 error

2009-10-07 Thread Jake Mannix
taphi.de > > > > -Original Message- > > From: Jake Mannix [mailto:jake.man...@gmail.com] > > Sent: Thursday, October 08, 2009 2:35 AM > > To: java-user@lucene.apache.org > > Subject: Re: Help needed figuring out reason for maxClauseCount is set

Re: Realtime & distributed

2009-10-08 Thread Jake Mannix
Jason, On Thu, Oct 8, 2009 at 7:56 PM, Jason Rutherglen wrote: > Today near realtime search (with or without SSDs) comes at a > price, that is reduced indexing speed due to continued in RAM > merging. People typically hack something together where indexes > are held in a RAMDir until being flush

Re: Realtime & distributed

2009-10-08 Thread Jake Mannix
On Thu, Oct 8, 2009 at 7:00 PM, Angel, Eric wrote: > > Does anyone have any recommendations? I've looked at Katta, but it doesn't > seem to support realtime searching. It also uses hdfs, which I've heard can > be slow. I'm looking to serve 40gb of indexes and support about 1 million > updates

Re: Realtime & distributed

2009-10-08 Thread Jake Mannix
On Thu, Oct 8, 2009 at 7:56 PM, Jason Rutherglen wrote: > There is the Zoie system which uses the RAMDir > solution, > Also, to clarify: zoie does not index into a RAMDir and then periodically merge that down to disk, as for one thing, this has a bad failure mode when the system crashes, as you

Re: How to setup a scalable deployment?

2009-10-08 Thread Jake Mannix
On Thu, Oct 8, 2009 at 9:32 PM, Chris Were wrote: > Zoie looks very close to what I'm after, however my whole app is written in > Python and uses PyLucene, so there is a non-trivial amount of work to make > things work with Zoie. > I've never used PyLucene before, but since it's a wrapper, plugg

Re: Question about how to speed up custom scoring

2009-10-09 Thread Jake Mannix
Scott, To reiterate what Erick and Andrzej's said: calling IndexReader.document(docId) in your inner scoring loop is the source of your performance problem - iterating over all these stored fields is what is killing you. To do this a better way, can you try to explain exactly what this Scorer

Re: Realtime & distributed

2009-10-09 Thread Jake Mannix
est way to go about > this is to post benchmarks that others may run in their > environment which can then be tweaked for their unique edge > cases. I wish I had more time to work on it. > > -J > > On Thu, Oct 8, 2009 at 8:18 PM, Jake Mannix wrote: > > Jason, > > &g

Re: Using Numeric Field

2009-10-09 Thread Jake Mannix
If you are really using all of that precision (down to the second) the short answer is YES. If you can remove much of that precision (only keep down to the day, for example), then you may be able to get perfectly good performance with strings alone when the range is only over a small set of terms,

Re: Question about how to speed up custom scoring

2009-10-09 Thread Jake Mannix
ou > have a default set of weights and you want to adjust them on the fly > although our use case is a little different. > > thanks, > Scott > > On Fri, Oct 9, 2009 at 10:40 AM, Jake Mannix > wrote: > > > Scott, > > > > To reiterate what Erick and Andrzej

Re: Question about how to speed up custom scoring

2009-10-09 Thread Jake Mannix
On Fri, Oct 9, 2009 at 3:07 PM, scott w wrote: > Example Document: > model_1_score = 0.9 > model_2_score = 0.3 > model_3_score = 0.7 > > I want to be able to pass in the following map at query time: > {model_1_score=0.4, model_2_score=0.7} and have that map get used as input > to a custom score f

Re: Question about how to speed up custom scoring

2009-10-09 Thread Jake Mannix
ccidentally hang onto references to those IndexReaders past when needed. -jake On Fri, Oct 9, 2009 at 3:52 PM, scott w wrote: > Thanks Jake! I will test this out and report back soon in case it's helpful > to others. Definitely appreciate the help. > > Scott > > On Fri

Re: Realtime & distributed

2009-10-09 Thread Jake Mannix
ote: > Hi Jake, > > Zoie looks like a a really cool project. I'd like to learn more about > the distributed part of the setup. Any way you could describe that > here or on the wiki? > > -Mike > > On Thu, Oct 8, 2009 at 9:24 PM, Jake Mannix wrote: > > On

Re: Question about new TopScoreDocCollector class in Lucene 2.9

2009-10-10 Thread Jake Mannix
Hi Michael, If you just want the top "n" hits (the way you used to use the Hits class), just call TopDocs topDocs = Searcher.search(query, n); Don't worry about the Collector interface unless you actually need it. -jake On Sat, Oct 10, 2009 at 1:12 PM, M R wrote: > Hi > > This is the

Re: Question about how to speed up custom scoring

2009-10-11 Thread Jake Mannix
d it yet but looking at it closer it looks like it's not > something I can plug in on top of my original query. I am definitely happy > using an approximation for the sake of performance but I do need to be able > to have the original results stay the same. > > On Fri, Oct 9, 2

Re: Realtime & distributed

2009-10-11 Thread Jake Mannix
Hey Eric, One clarification before letting the rest of this discussion sneak over to the zoie list: On Sun, Oct 11, 2009 at 1:51 PM, Angel, Eric wrote: * Am I wrong to assume that the RAMDir holds the entire index - just as the > FSDir? Or does RAMDir only hold a portion of the index that ha

Re: Realtime & distributed

2009-10-11 Thread Jake Mannix
-jake On Sun, Oct 11, 2009 at 3:36 PM, Jake Mannix wrote: > Hey Eric, > > One clarification before letting the rest of this discussion sneak over > to the zoie list: > > On Sun, Oct 11, 2009 at 1:51 PM, Angel, Eric wrote: > > * Am I wrong to assume that the RAMDir hol

Re: Realtime search best practices

2009-10-12 Thread Jake Mannix
Hi Cedric, I don't know of anyone with a substantial throughput production system who is doing realtime search with the 2.9 improvements yet (and in fact, no serious performance analysis has been done on these even "in the lab" so to speak: follow https://issues.apache.org/jira/browse/LUCENE-157

Re: faceted search performance

2009-10-12 Thread Jake Mannix
Hey Chris, On Mon, Oct 12, 2009 at 10:30 AM, Christoph Boosz < christoph.bo...@googlemail.com> wrote: > Thanks for your reply. > Yes, it's likely that many terms occur in few documents. > > If I understand you right, I should do the following: > -Write a HitCollector that simply increments a coun

Re: Realtime search best practices

2009-10-12 Thread Jake Mannix
Wait, so according to the javadocs, the IndexReader which you got from the IndexWriter forwards calls to reopen() back to IndexWriter.getReader(), which means that if the user has a NRT reader, and the user keeps calling reopen() on it, they're getting uncommitted changes as well, while if they cal

Re: Realtime search best practices

2009-10-12 Thread Jake Mannix
On Mon, Oct 12, 2009 at 12:26 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > On Mon, Oct 12, 2009 at 3:17 PM, Jake Mannix > wrote: > > > Wait, so according to the javadocs, the IndexReader which you got from > > the IndexWriter forward

Re: querying multi-value fields

2009-10-12 Thread Jake Mannix
Or else just make sure that you use PhraseQuery to hit this field when you want "value1 aaa". If you don't tokenize these pairs, then you will have to do prefix/wildcard matching to hit just "value1" by itself (if this is allowed by your business logic). -jake On Mon, Oct 12, 2009 at 1:21 PM,

Re: Realtime search best practices

2009-10-12 Thread Jake Mannix
Thanks Yonik, It may be surprising, but in fact I have read that javadoc. It talks about not needing to close the writer, but doesn't specifically talk about the what the relationship between commit() calls and getReader() calls is. I suppose I should have interpreted: "@returns a new reader

Re: Realtime search best practices

2009-10-12 Thread Jake Mannix
t; * costly {...@link #commit}. > > Mike > > On Mon, Oct 12, 2009 at 4:35 PM, Jake Mannix > wrote: > > Thanks Yonik, > > > > It may be surprising, but in fact I have read that > > javadoc. It talks about not needing to close the > > writer, but doesn&#

Re: Realtime search best practices

2009-10-12 Thread Jake Mannix
On Mon, Oct 12, 2009 at 1:57 PM, Yonik Seeley wrote: > On Mon, Oct 12, 2009 at 4:35 PM, Jake Mannix > wrote: > > It may be surprising, but in fact I have read that > > javadoc. > > It was not your email I responded to. > Sorry, my bad then - you said "guys"

Re: Realtime search best practices

2009-10-12 Thread Jake Mannix
> * open a new reader. But the turarnound time of this > * method should be faster since it avoids the potentially > * costly {...@link #commit}. > > Mike > > On Mon, Oct 12, 2009 at 4:35 PM, Jake Mannix > wrote: > > Thanks Yonik, > > > > It may be sur

Indexing Speed: 2.3 vs 2.2 (real world numbers)

2008-02-03 Thread Jake Mannix
Hello all, I know you lucene devs did a lot of work on indexing performance in 2.3, and I just tested it out last thursday, so I thought I'd let you know how it fared: On a 2.17 million document index, a recent test gave indexing time to be: * lucene 2.2: 4.83 hours * lucene 2.3: 26 m

Re: Indexing Speed: 2.3 vs 2.2 (real world numbers)

2008-02-03 Thread Jake Mannix
ood. :) -jake On Feb 3, 2008 2:11 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: > > Awesome! We are glad to hear that :) > > You might be able to make it even faster with the steps here: > > http://wiki.apache.org/lucene-java/ImproveIndexingSpeed > > Mi

Re: Indexing Speed: 2.3 vs 2.2 (real world numbers)

2008-02-03 Thread Jake Mannix
ge of multiple threads / cores? If so, I could rerun it again multithreaded and see if that's even better... -jake On Feb 3, 2008 9:02 PM, ajay_garg <[EMAIL PROTECTED]> wrote: > > Hi Jake. > > Was the test conducted with a single indexing thread, or multiple on

Re: Indexing Speed: 2.3 vs 2.2 (real world numbers)

2008-02-03 Thread Jake Mannix
te: > Damn, really? I haven't had the opportunity to test this yet. Has > anyone else seen this kind of improvement? > > > > On Feb 3, 2008 2:57 PM, Jake Mannix <[EMAIL PROTECTED]> wrote: > > Hello all, > > I know you lucene devs did a lot of work on ind

Re: How to promote an unstemmed match over a stemmed match in an index that's stemmed...

2008-02-11 Thread Jake Mannix
The way I've always done this was to index two fields: say, "contents" and "contents_unstemmed", (using a PerFieldAnalyzer) and then query on both of them. This has the double effect of a) boosting unstemmed hits, because every unstemmed match is also a stemmed one, so the BooleanQuery combining

Re: Searching for multiple criteria (accross 2 tables)

2008-02-15 Thread Jake Mannix
What the other posters are referring to is that you will have to probably write some java code to do lucene indexing: you can get access to your model objects (with all their dependent data) in java. - since you are using hibernate, this shouild be easy- then create lucene documents from your mode

Re: Security filtering from external DB

2008-03-03 Thread Jake Mannix
Gabriel, You can make this search much more efficient as follows: say that you have a method public BooleanQuery createQuery(Collection allowedUUIDs); that works as you describe. Then you can easily create a useful reusable filter as follows: Filter filter = new CachingWrapperFilter(new Q

LUCENE-933 / SOLR-261

2008-03-18 Thread Jake Mannix
Hey folks, I was wondering what the status of LUCENE-933 (stop words can cause the queryparser to end up with no results, due to an e.g. +(the) clause in the resultant BooleanQuery). According to the tracking bug, it's resolved, and there's a patch, but where has that patch been applied? I trie

Re: LUCENE-933 / SOLR-261

2008-03-18 Thread Jake Mannix
u can also > see the actual diffs that took place. > > Best, > Doron > > On Tue, Mar 18, 2008 at 7:14 PM, Jake Mannix <[EMAIL PROTECTED]> > wrote: > > > Hey folks, > > I was wondering what the status of LUCENE-933 (stop words can cause the > > q

Re: factor in stopwords when searching

2008-03-21 Thread Jake Mannix
I think the way I've seen it done most often is to either index some bi-grams which contain stop words (so "the database" and "search the" are in the index as individual tokens), or else to index that piece of content twice - once with stop words removed (and stemming, if you use it), and then agai

Re: feedback: Indexing speed improvement lucene 2.2->2.3.1

2008-03-25 Thread Jake Mannix
org > Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1 > > Hi Uwe, > > Could you tell what Analyzer do you use when you marked so big indexing > speedup? > If you use StandardAnalyzer (that uses StandardTokenizer) may be the > reason is in it. You can see the pr

Re: Pooled searcher (was: Solid State Drives vs. RAMDirectory)

2008-04-16 Thread Jake Mannix
We started doing the same thing (pooling 1 searcher per core) at my work when profiling showed a lot of time hitting synchonized blocks deep inside the SegmentTermReader (? Might be messing the class up) under high load, due to file read()'s using instance variables for seeking. I could dig up the

Re: Maximum index file size

2009-10-22 Thread Jake Mannix
On Thu, Oct 22, 2009 at 10:29 PM, Hrishikesh Agashe < hrishikesh_aga...@persistent.co.in> wrote: > Can I create an index file with very large size, like 1 TB or so? Is there > any limit on how large index file one can create? Also, will I be able to > search on this 1 TB index file at all? > Leav

Re: Maximum index file size

2009-10-22 Thread Jake Mannix
er there is any limit as such. And obviously whether > such a huge index files can be searched at all. > > From your response it appears that 1 TB of 1 index file is too much. Is > there any guideline to what kind of hardware will be required to handle > (10GB, 50GB, 100GB, 500GB et

Re: Split single string into several fields?

2009-10-27 Thread Jake Mannix
On Tue, Oct 27, 2009 at 6:12 PM, Erick Erickson wrote: > Could you go into your use case a bit more? Because I'm confused. > Why don't you want your text tokenized? You say you want to search it, > which means you have to analyze it. I think Will is suggesting that he doesn't want to have to ana

Re: Lucene 2.9.0 / BooleanQuery problem

2009-10-28 Thread Jake Mannix
Hi Michel, I don't have time to look in too much detail right now, but I'll bet ya $5 it's because your query is for "sector:IT" - 'IT' lowercases to 'it' which is in the default stopword list, and if you're not careful about how you query with this, you'll end up with TermQuery instances which

Re: Facets

2009-11-03 Thread Jake Mannix
If you need faceting on top of Lucene and you're not using Solr, Bobo-browse ( http://bobo-browse.googlecode.com ) is a high-performance open source faceting library which may suit your needs. You're asking for "all facet values", which in bobo isn't terribly hard to get: because of the way bobo k

Re: Creating tag clouds with lucene

2009-11-05 Thread Jake Mannix
Well you can do it as a facet search, but in addition to doing multi-valued faceting, you can also normalize the counts by dividing by the docFreq of the term, which instead of getting you the most popular tags which overlap your query, you get the tags which are more popular for documents matching

Re: Creating tag clouds with lucene

2009-11-06 Thread Jake Mannix
On Fri, Nov 6, 2009 at 12:25 AM, Mathias Bank wrote: > Well, it could be a facet search, if there would be tags available but > if you just wanna have a "tag cloud" generated by full-text, I don't > see how a facet search could help to generate this cloud. > Unfortunatelly, I don't have tags in my

Re: OutOfMemoryError when using Sort

2009-11-12 Thread Jake Mannix
Sorting utilizes a FieldCache: the forward lookup - the value a document has for a particular field (as opposed to the usual "inverted" way of looking at all documents which contains a given term), which lives in memory, and takes up as much space as one 4-bytes * numDocs. If you've indexed the en

Re: OutOfMemoryError when using Sort

2009-11-12 Thread Jake Mannix
|." > > I understood that only the hits (50 in this) for the current search would > be sorted... > I'll just do the ordering afterwards. Thank you for clarifying this issue. > > > -- > Nuno Seco > > > > Jake Mannix wrote: > >> Sorting

Re: Term Boost Threshold

2009-11-13 Thread Jake Mannix
Hi Max, You want a query like ("San Francisco" OR "California") AND ("John Smith" OR "John Smith Manufacturing") essentially? You can give Lucene exactly this query and it will require that either "John Smith" or "John Smith Manufacturing" be present, but will score results which have these

Re: Term Boost Threshold

2009-11-13 Thread Jake Mannix
Did I do that wrong? I always mess up the AND/OR human-readable form of this - it's clearer when you use +/- unary operators instead: query: "San Francisco" "California" +("John Smith" "John Smith Manufacturing") Here the San Fran and CA clauses are optional, and the ("John Smith" OR "John Smith

Re: Term Boost Threshold

2009-11-13 Thread Jake Mannix
On Fri, Nov 13, 2009 at 3:35 PM, Max Lynch wrote: > > query: "San Francisco" "California" +("John Smith" "John Smith > > Manufacturing") > > > > Here the San Fran and CA clauses are optional, and the ("John Smith" OR > > "John Smith Manufacturing") is required. > > > > Thanks Jake, that works nic

Re: Term Boost Threshold

2009-11-13 Thread Jake Mannix
On Fri, Nov 13, 2009 at 4:02 PM, Max Lynch wrote: > > > Now, I would like to know exactly what term was found. For example, if > a > > > result comes back from the query above, how do I know whether John > Smith > > > was > > > found, or both John Smith and his company, or just John Smith > > Ma

Re: Term Boost Threshold

2009-11-13 Thread Jake Mannix
On Fri, Nov 13, 2009 at 4:21 PM, Max Lynch wrote: > Well already, without doing any boosting, documents matching more of the > > terms > > in your query will score higher. If you really want to make this effect > > more > > pronounced, yes, you can boost the more important query terms higher. >

Re: share some numbers for range queries

2009-11-15 Thread Jake Mannix
On Sun, Nov 15, 2009 at 11:02 PM, Uwe Schindler wrote: > the second approach is slower, when deleted docs > are involved and 0 is inside the range (need to consult TermDocs). > This is a good point (and should be mentioned in your blog, John) - for while custom FieldCache-like implementations (

Re: What is the best way to handle the primary key case during lucene indexing

2009-11-16 Thread Jake Mannix
The usual way to do this is to use: IndexWriter.updateDocument(Term, Document) This method deletes all documents with the given Term in it (this would be your primary key), and then adds the Document you want to add. This is the traditional way to do updates, and it is fast. -jake On Mo

Re: What is the best way to handle the primary key case during lucene indexing

2009-11-16 Thread Jake Mannix
You will want to have one Lucene field which contains this composite key - they could be the un-tokenized concatenation of all of the subkeys, for example, and then one Term would have the full composite key, and the updateDocument technique would work fine. -jake On Mon, Nov 16, 2009 at 11:09

Re: Top field count scoring across documents

2009-11-22 Thread Jake Mannix
Peter, You want to do a facet query. This kind of functionality is not in Lucene-core (sadly), but both Solr (the fully featured search application built on Lucene) and bobo-browse (just a library, like Lucene itself) are open-source and work with Lucene to provide faceting capabilities for yo

Re: Caching analyzed query

2009-12-02 Thread Jake Mannix
What kind of queries are these? I.e. How much work goes into step 4? Is this a fairly standard combination of Boolean/Phrase/other stock Lucene queries built up out of tokenizing the text? If so, it's going to be nowhere near the bottleneck in your runtime (we're talking often way less than a mi

Re: How to get a apache public license

2009-12-23 Thread Jake Mannix
Merry Christmas to you, Weiwei. If you want to release your software under *exactly* the Apache License (version 2.0 is the most current form of it), you may do so very easily - just read the appendix at the end of this page: http://www.apache.org/licenses/LICENSE-2.0 In particular, note that

Re: file open handles?

2010-01-26 Thread Jake Mannix
Hi Jamie, How fast are you indexing (number of documents per second)? We also ran into this when trying to perf test heavy query throughput while doing rapid indexing under exactly these conditions: call getReader() every time a search is executed (so that it's "really real time"). The answe

Re: file open handles?

2010-01-26 Thread Jake Mannix
On Tue, Jan 26, 2010 at 11:13 PM, Jamie wrote: > > Hi Jake > > Thanks for the info. Are you specifically referring to > http://issues.apache.org/jira/browse/LUCENE-2120? > Yep, that's the issue I'm referring to. > Our app indexes about 170 50k documents per second in heavy load. In any > case,

Re: file open handles?

2010-01-27 Thread Jake Mannix
On Wed, Jan 27, 2010 at 12:17 AM, Jamie wrote: > Hi Jake > > > You were indexing but not searching? So you are never calling getReader() >> in the first place? >> >> > Of course, the call exists, its just that during testing we did not execute > any searches at all. Oh! Re-reading your initi

Re: "one of the terms"

2010-01-29 Thread Jake Mannix
coord won't help him, I don't think. Doesn't he just want a DisjunctionMaxQuery instead of BooleanQuery? -jake On Fri, Jan 29, 2010 at 9:28 PM, Otis Gospodnetic < otis_gospodne...@yahoo.com> wrote: > Paul, > > Custom Similarity perhaps, oui. Not 100% sure, maybe have this always > return 1.0

Re: Sort memory usage

2010-02-03 Thread Jake Mannix
The FieldCache loads per segment, and the NRT reader is reloading only new segments from disk, so yes, it's "smarter" about this caching in this case. -jake On Wed, Feb 3, 2010 at 1:07 PM, tsuraan wrote: > Is the cache used by sorting on strings separated by reader, or is it > a global thing?

Re: Sort memory usage

2010-02-03 Thread Jake Mannix
On Wed, Feb 3, 2010 at 1:33 PM, tsuraan wrote: > > The FieldCache loads per segment, and the NRT reader is reloading only > > new segments from disk, so yes, it's "smarter" about this caching in this > > case. > > Ok, so the cache is tied to the index, and not to any particular > reader. The act

Re: Scale Out

2010-02-08 Thread Jake Mannix
On Mon, Feb 8, 2010 at 9:33 AM, Chris Lu wrote: > Since you already have RMI interface, maybe you can parallel search on > several nodes, collect the data, pick top ones, and send back results via > RMI. > One thing to be careful about this, which you might already be aware of: Query (and subcla

Re: Query about Query.ToString()

2010-02-17 Thread Jake Mannix
On Wed, Feb 17, 2010 at 10:55 AM, Erick Erickson wrote: > Well, Query *does* implement the Serializable interface, so that > might work. WARNING: I haven't personally used the Serializable > interface on Query, so I have no real clue whether it's applicable! > Query is serializable (lots of peopl