Re: Scalability of Lucene indexes
Hi Bryan, How big is your index? Also what is the advantage of binding a user to a server? Thanks. Andy --- Bryan McCormick <[EMAIL PROTECTED]> wrote: > Hi chris, > > I'm responsible for the webshots.com search index > and we've had very > good results with lucene. It currently indexes over > 100 Million > documents and performs 4 Million searches / day. > > We initially tested running multiple small copies > and using a > MultiSearcher and then merging results as compared > to running a very > large single index. We actually found that the > single large instance > performed better. To improve load handling we > clustered multiple > identical copies together, then session bind a user > to particular server > and cache the results, but each server is running a > single index. > > Bryan McCormick > > > On Fri, 2005-02-18 at 08:01, Chris D wrote: > > Hi all, > > > > I have a question about scaling lucene across a > cluster, and good ways > > of breaking up the work. > > > > We have a very large index and searches sometimes > take more time than > > they're allowed. What we have been doing is during > indexing we index > > into 256 seperate indexes (depending on the > md5sum) then distribute > > the indexes to the search machines. So if a > machine has 128 indexes it > > would have to do 128 searches. I gave > parallelMultiSearcher a try and > > it was significantly slower than simply iterating > through the indexes > > one at a time. > > > > Our new plan is to somehow have only one index per > search machine and > > a larger main index stored on the master. > > > > What I'm interested to know is whether having one > extremely large > > index for the master then splitting the index into > several smaller > > indexes (if this is possible) would be better than > having several > > smaller indexes and merging them on the search > machines into one > > index. > > > > I would also be interested to know how others have > divided up search > > work across a cluster. > > > > Thanks, > > Chris > > > > > - > > To unsubscribe, e-mail: > [EMAIL PROTECTED] > > For additional commands, e-mail: > [EMAIL PROTECTED] > > > > > - > To unsubscribe, e-mail: > [EMAIL PROTECTED] > For additional commands, e-mail: > [EMAIL PROTECTED] > > __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Access Lucene from PHP or Perl
Greetings. Can anyone point me to a how-to tutorial on how to access Lucene from a web page generated by PHP pr Perl? I've been looking but couldn't find anything. Thanks a lot. And __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Penalty for storing unrelated field?
You should be fine. On Fri, 28 Jan 2005 15:21:50 -0600, Bill Tschumy <[EMAIL PROTECTED]> wrote: > I just want to make sure > that adding the unrelated field to a single doc won't cause all the > other documents to increase their storage space. > -- I have lots of fields that only occur in one document, but it doesn't phase lucene. Actually when choosing an indexing solution, we chose lucene mostly because of its ability to index and store unlimited kinds of metadata. - andy g - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Filtering w/ Multiple Terms
Maybe you should try making a BooleanQuery out of the TermQuerys and then passing that to QueryFilter. I've never tried it, but it should work, right? - andy g On Thu, 20 Jan 2005 16:02:26 -0600, Jerry Jalenak <[EMAIL PROTECTED]> wrote: > In looking at the examples for filtering of hits, it looks like I can only > specify a single term; i.e. > > Filter f = new QueryFilter(new TermQuery(new Term("acct", > "acct1"))); > > I need to specify more than one term in my filter. Short of using something > like ChainFilter, how are others handling this? > > Thanks! > > Jerry Jalenak > Senior Programmer / Analyst, Web Publishing > LabOne, Inc. > 10101 Renner Blvd. > Lenexa, KS 66219 > (913) 577-1496 > > [EMAIL PROTECTED] > > This transmission (and any information attached to it) may be confidential and > is intended solely for the use of the individual or entity to which it is > addressed. If you are not the intended recipient or the person responsible for > delivering the transmission to the intended recipient, be advised that you > have received this transmission in error and that any use, dissemination, > forwarding, printing, or copying of this information is strictly prohibited. > If you have received this transmission in error, please immediately notify > LabOne at the following email address: [EMAIL PROTECTED] > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene integration with relational database
I do these kinds of queries all the time. I found that the fastest performance for my collections (millions of documents) came from subclassing Filter using the set of primary keys from the database to make the Filter, and then doing the query with the Searcher.search(query,filter) interface. I was previously using the in memory merge, but the memory requirements were crashing the JVM when we had a lot of simultaneous users. - andy g On Sat, 15 Jan 2005 23:03:00 +0530, sunil goyal <[EMAIL PROTECTED]> wrote: > Hi all, > > Thanks for the answers. I was looking for a best practice guide to do > the same. If anyone already had had some practical experience with > such kind of queries, it will be great to know his thoughts. > > Thanks > > Regards > Sunil > > > On Sat, 15 Jan 2005 09:00:35 -0800, jian chen <[EMAIL PROTECTED]> wrote: > > Hi, > > > > Still minor additions to the steps: > > > > 1) do lucene query and get the hits (keyed by the database primary > > key, for example, employee id) > > > > 2) do database query and get the primary keys (i.e., employee id) for > > the result rows, ordered by primary key > > > > 3) for each lucene query result, look into db query result and see if > > the primary key is there (since db query result is sorted already by > > primary key, so, a binary search could be applied) > > > > if the primary key is there, store this result, else, discard it > > > > 4) when top k results are obtained, send back to the user. > > > > How does this sound? > > > > Cheers, > > > > Jian > > > > On Sat, 15 Jan 2005 08:36:16 -0800, jian chen <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > > > To further the discussion. Would the following detailed steps work: > > > > > > 1) do lucene query and get the hits (keyed by the database primary > > > key, for example, employee id) > > > > > > 2) do database query and get the primary keys (i.e., employee id) for > > > the result rows, ordered by primary key > > > > > > 3) merge the two sets of primary keys (for example, in memory two-way > > > merge) and take the top k records > > > > > > 4) display the top k result rows > > > > > > Cheers, > > > > > > Jian > > > > > > On Sat, 15 Jan 2005 12:40:04 +, Peter Pimley <[EMAIL PROTECTED]> > > > wrote: > > > > sunil goyal wrote: > > > > > > > > >But can i do for instance a unified query where i want to take certain > > > > >parameters (non-textual e.g. age < 30 ) from relational databases and > > > > >keywords from the lucene index ? > > > > > > > > > > > > > > > > > > > When I have had to do this, I've done the lucene search first, and then > > > > manually filtered out the hits that fail on other criteria. > > > > > > > > I'd suggest doing that first (as it's easiest) and then seeing whether > > > > the performance is acceptable. > > > > > > > > - > > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Corrupted indexes
Recently, I've been getting a lot of corrupted lucene indexes. They appear to return search results normally, but there is really no good way to test whether information is missing. The main problem is that when i try to optimize, i get the following Exception: java.io.IOException: read past EOF at org.apache.lucene.index.CompoundFileReader$CSInputStream.readInternal(CompoundFileReader.java:218) at org.apache.lucene.store.InputStream.readBytes(InputStream.java:61) at org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:356) at org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:323) at org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:422) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:94) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:487) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366) this is preventing me from optimizing the indexes, and also scares me that information might be missing. Does anybody know what's going on here, and what might be wrong? Thanks for your time, - andy g - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: index files version and lucene 1.4
I had this problem when i initially upgraded to 1.4, but tomcat was still searching with the old 1.3 jar. Make sure you have fully updated its path variables, include directories, etc. - andy g On Fri, 22 Oct 2004 16:00:42 +0200, gaudinat <[EMAIL PROTECTED]> wrote: > Thanks, > > Finally my problem seems to come from TOMCAT (5.0) and lucene 1.4 > installation. > > To summerize: > > Throught TOMCAT with the same application (lucene 1.4) and index 1.4 I > have no Hits while I have Hits with index 1.3. > Without TOMCAT with the same application (lucene 1.4) I have Hits for > both version of index files 1.3 and 1.4. > > Is someone have an idea, please? > > Arno. > > > > Aviran wrote: > > >Lucene 1.4 changed the file format for indexes. You can access a old index > >using lucene 1.4 but you can't access index which was created using lucene > >1.4 with older versions. > >I suggest you rebuild your index using lucene 1.4 > > > >Aviran > >http://aviran.mordos.com > > > >-Original Message- > >From: arnaud gaudinat [mailto:[EMAIL PROTECTED] > >Sent: Thursday, October 21, 2004 12:10 PM > >To: Lucene Users List > >Subject: index files version and lucene 1.4 > > > > > >Hi, > >Certainly a stupid question! > >I have just upgraded to 1.4, I have succeeded to access my 1.3 index files > >but not my new 1.4 index files. In fact I have no error, but no hits for 1.4 > >index files. More, I don't know if it's normal but now I have just 3 files > >for my index (.cfs, deletable and segments). However if I use Luke with the > >1.4 index files, It works perfectly. > > > >An idea? > > > >Regards, > > > >Arno. > > > > > >- > >To unsubscribe, e-mail: [EMAIL PROTECTED] > >For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > >- > >To unsubscribe, e-mail: [EMAIL PROTECTED] > >For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
removing duplicate Documents from Hits
Hello, I've searched on previous posts on this topic but couldn't find an answer. I want to query my index (which are a number of 'flattened' Oracle tables) for some criteria, then return Hits such that there are no Documents that duplicate a particular field. In the case where table A has a one-to-many relationship to table B, I get one Document for each (A1-B1, A1-B2, A1-B3...). My index needs to have each of these records as 'B' is a searchable field in the index. However, after the query is executed, I want my resulting Hits on be unique on 'A'. I'm only returning the Oracle object ID, so once I've seen it once I don't need it again. It looks like some sort of custom Filter is in order. My fix at the moment is to run the query, then store unique id's in a Map to build another query that will return singletons on field 'A'. I could skip this step if there was a way to remove documents from Hits (I didn't see a way). Has anyone written a filter that does this? Are there others using Lucene to mimic a relational DB? I've got a complex SQL search that joins (most outer) 40 some tables. Query performance is important, and the tables are relatively static. I find the ID's of the objects that match the users' criteria, then go to the DB to instantiate them. Any comments are appreciated. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Problems with Lucene + BDB (Berkeley DB) integration
I used BDB + lucene successfully using the lucene 1.3 distribution, but it broke in my application with the 1.4 distribution. The 1.4 dist uses a different file system by default, the "cluster file system", so maybe that is the source of the issues. good luck, andy g On Mon, 20 Sep 2004 19:36:51 -0300, Christian Rodriguez <[EMAIL PROTECTED]> wrote: > Hi everyone, > > I am trying to use the Lucene + BDB integration from the sandbox > (http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/db/). > I installed C Berkeley DB 4.2.52 and I have the Lucene jar file. > > I have an example program that indexes 4 small text files in a > directory (its very similar to the IndexFiles.java in the Lucene demo, > except that it uses BDB + Lucene). The problem I have is that > executing the indexing program generates different results each time I > run it. For example: If I start with an empty index, run the indexing > program and then query the index I get the correct results; then I > delete the index to start from scratch again, and perform the same > sequence and I get no results. (?) > > What puzzles me is the non-deterministic results... the same execution > sequence generates two different results. I then wrote a program to > dump the index and I found out that the list of files that end up in > the index is different every time I index those 4 files. > > For example: > 1st run: contents of directory: _4.f2, _4.f3, _4.cfs, _4.fdx, _4.fnm, > _4.frq, _4.prx, _4.tii, segments, deletable. (9 files) > 2nd run: contents of directory: 0:_4.f1, _4.cfs, _4.fdt, _4.fdx, > _4.fnm, _4.frq, _4.prx, _4.tii, _4.tis, segments, deletable. (11 > files) > > Does anyone have any idea why this is happening? > Has anyone been able to use the BDB + Lucene integration with no problems? > > Id appreciate any help or pointers. > Thanks! > Xtian > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Can I prevent Sort fields from influencing score?
I build the query myself, its really easy, I just use the normal query parser with IndexReader.getFieldNames(true) and loop through all of them to search everything at once. You can either make a really big BooleanQuery or make a bunch of small queries and merge the results, depending on what kind of results you are looking for. It's probably not as fast as the one big data field method, but speed is not an issue yet for anything i've done, whereas code maintenance is a pain, witness my question that started this thread. - andy g On Wed, 2 Jun 2004 13:43:41 -0700 , Gus Kormeier <[EMAIL PROTECTED]> wrote: > > Just curious, > Are you building your query or using a particular Query Parser? > which one? > > Are you using MultiFieldQueryParser? I had problems with MFQP before and > was looking for other solutions besides dumping fields into a massive > "content" field. > > TIA, > -Gus > > > > -Original Message- > From: Andy Goodell [mailto:[EMAIL PROTECTED] > Sent: Wednesday, June 02, 2004 1:30 PM > To: Lucene Users List > Subject: Re: Can I prevent Sort fields from influencing score? > > thanks that was my problem, i had code extending the search out to all > the fields, now it only extends the search out to the fields i'm > interested in. > > - andy g > > On Wed, 2 Jun 2004 14:21:24 -0500 , Tim Jones <[EMAIL PROTECTED]> wrote: > > > > This seems like it would be determined by how you generate your query - if > > your query doesn't search in the sorted fields, they shouldn't affect the > > scoring of your documents ... > > > > > > > > > -Original Message- > > > From: Andy Goodell [mailto:[EMAIL PROTECTED] > > > Sent: Wednesday, June 02, 2004 12:22 PM > > > To: [EMAIL PROTECTED] > > > Subject: Can I prevent Sort fields from influencing score? > > > > > > > > > I have been using the new lucene 1.4 SortField implementation wih some > > > custom fields added to old indexes so that the results can be sorted > > > by them. My problem here is that some of the String fields that I add > > > to the index come up in the search terms, so my results in sort by > > > score order are different. Here's an example: > > > > > > I added the field AUTHOR_SORTABLE to most of the documents in the > > > index. But if one of the AUTHOR_SORTABLE field in a document is set > > > to "andy", and i search for "andy", this document gets a very > > > different score than it used to. > > > > > > Since my added fields aren't set in stone, I'm interested in a general > > > solution, where all fields containing the text "SORTABLE" in the name > > > aren't considered for matches, only for sorting. Could I do this by > > > overriding Similarity? I tried doing this to set the lengthNorm() for > > > each of my sortable fields to 0, but it hasnt worked yet. Is there a > > > different way to store the sortable fields that will prevent this? > > > > > > Any help would be greatly appreciated. > > > > > > - andy g > > > > > > - > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Can I prevent Sort fields from influencing score?
thanks that was my problem, i had code extending the search out to all the fields, now it only extends the search out to the fields i'm interested in. - andy g On Wed, 2 Jun 2004 14:21:24 -0500 , Tim Jones <[EMAIL PROTECTED]> wrote: > > This seems like it would be determined by how you generate your query - if > your query doesn't search in the sorted fields, they shouldn't affect the > scoring of your documents ... > > > > > -Original Message- > > From: Andy Goodell [mailto:[EMAIL PROTECTED] > > Sent: Wednesday, June 02, 2004 12:22 PM > > To: [EMAIL PROTECTED] > > Subject: Can I prevent Sort fields from influencing score? > > > > > > I have been using the new lucene 1.4 SortField implementation wih some > > custom fields added to old indexes so that the results can be sorted > > by them. My problem here is that some of the String fields that I add > > to the index come up in the search terms, so my results in sort by > > score order are different. Here's an example: > > > > I added the field AUTHOR_SORTABLE to most of the documents in the > > index. But if one of the AUTHOR_SORTABLE field in a document is set > > to "andy", and i search for "andy", this document gets a very > > different score than it used to. > > > > Since my added fields aren't set in stone, I'm interested in a general > > solution, where all fields containing the text "SORTABLE" in the name > > aren't considered for matches, only for sorting. Could I do this by > > overriding Similarity? I tried doing this to set the lengthNorm() for > > each of my sortable fields to 0, but it hasnt worked yet. Is there a > > different way to store the sortable fields that will prevent this? > > > > Any help would be greatly appreciated. > > > > - andy g > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Can I prevent Sort fields from influencing score?
I have been using the new lucene 1.4 SortField implementation wih some custom fields added to old indexes so that the results can be sorted by them. My problem here is that some of the String fields that I add to the index come up in the search terms, so my results in sort by score order are different. Here's an example: I added the field AUTHOR_SORTABLE to most of the documents in the index. But if one of the AUTHOR_SORTABLE field in a document is set to "andy", and i search for "andy", this document gets a very different score than it used to. Since my added fields aren't set in stone, I'm interested in a general solution, where all fields containing the text "SORTABLE" in the name aren't considered for matches, only for sorting. Could I do this by overriding Similarity? I tried doing this to set the lengthNorm() for each of my sortable fields to 0, but it hasnt worked yet. Is there a different way to store the sortable fields that will prevent this? Any help would be greatly appreciated. - andy g - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: 1.4 Sort API compatible with 1.3 index?
In my experience, the only barrier to using Sort with a 1.3 index is that the Sort interface requires sortable fields to be indexed in a certain way (not analyzed, indexed, and it doesnt matter if stored), from the javadoc: document.add (new Field ("byNumber", Integer.toString(x), false, true, false)); so unless you have fields of this prototype already in your index, you may need to do some degree of re-indexing. if you do already have the fields indexed in this fashion, then you should be in good shape, although i have only done cursory testing of this setup, since i have migrated my setup entirely to 1.4. - andy g On Tue, 1 Jun 2004 07:56:27 -0700 (PDT), Greg Gershman <[EMAIL PROTECTED]> wrote: > > I looked around a bit, but couldn't find an answer to > this question. There doesn't seem to be any reason > why it wouldn't, from what I can see, but I just want > to make sure I don't have to rebuild my index to use > the Sort functionality provided in 1.4 with an index > build with 1.3. > > Thanks! > > Greg Gershman > > __ > Do you Yahoo!? > Friends. Fun. Try the all-new Yahoo! Messenger. > http://messenger.yahoo.com/ > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses
In our application we had a similar problem with non-date ranges until we realized that it wasnt so much that we were searching for the values in the range as restricting the search to that range, and then we used an extension to the org.apache.lucene.search.Filter class, and our implementation got much simpler and faster. - andy g On Tue, 18 May 2004 10:38:01 -0700, Claude Devarenne <[EMAIL PROTECTED]> wrote: > > Hi, > > I have over 60,000 documents in my index which is slightly over a 1 GB > in size. The documents range from the late seventies up to now. I > have indexed dates as a keyword field using a string because the dates > are in MMDD format. When I do range queries things are OK as long > as I don't exceed the built-in number of boolean clauses, so that's a > range of 3 years, e.g. 1979 to 1981. The users are not only doing > complex queries but also want to query over long ranges, e.g. [19790101 > TO 19991231]. > > Given these requirements, I am thinking of doing a query without the > date range, bring the unique ids back from the hits and then do a date > query in the SQL database I have that contains the same data. Another > alternative is to do the query without the date range in Lucene and > then sort the results within the range. I still have to learn how to > use the new sorting code and confessed I did not have time to look at > it yet. > > Is there a simpler, easier way to do this? > > Claude > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: clean up html before indexing or add tags to ignore list
If you are running linux, i recommend before indexing with lucene, you use the program lynx with the option -dump which dumps the formatted text without the tags, and runs really really fast in most cases. - andy g On Thu, 13 May 2004 03:46:37 -0700 (PDT), Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > > Clean up seems cleaner. Just extract the textual information from HTML > using NekoHTML or JTidy or HTMLParser (.sf.net) or some such. > > You can also get fancy and preserve the 'structural' information (e.g. > H1 text is more important that H2, which is more important than BODY, > which is more important that DIV, etc.) and combine it with field > boosting at index time. > > Otis > > > > --- Sebastian Ho <[EMAIL PROTECTED]> wrote: > > Hi > > > > This is a typical web crawler, indexing and search application > > development. I have wrote my crawler and planning to add lucene in > > next. > > One questions pop to my mind, in terms of performance, do i clean up > > the > > html removing all tags before indexing, or i add all tags into the > > ignore list during indexing/search stage. > > > > Which is better? > > > > Thanks > > > > Sebastian Ho > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query performance on a 315 Million document index (1TB)
Although I've never indexed anything quite that large, i've had good experiences with splitting the index out over a cluster. (for example, a set that would be about 4 seconds per complicated query on one of our machines becomes around a second when spread out over 6) I think the reason why this helps is because of the disk I/O speed bounding of performance that the others have mentioned, and how adding another disk array adds to the effective disk bandwidth. good luck - andy g On Fri, 07 May 2004 04:47:55 +0500, Will Allen <[EMAIL PROTECTED]> wrote: > > Hi, > I am considering a project that would index 315+ million documents. I am > comfortable that the indexing will work well in creating an index ~800GB in size, > but am concerned about the query performance. (Is this a = bad > assumption?) > > What are the bottlenecks of performance as an index scales? Memory? = Cost is not > a concern, so what would be the shortcomings of a theoretical = machine with 16GB of > ram, 4-16 cpus and 1-2 terabytes of space? Would it be = better to cluster machines > to break apart the query? > > Thank you for your serious responses, > Will Allen > -- > ___ > Sign-up for Ads Free at Mail.com > http://promo.mail.com/adsfreejump.htm > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Bug in Sandbox - Berkeley DB
IndexReader.delete(int docid) doesn't work with the Berkeley DB implementation of org.apache.lucene.store.Directory This error message appears when closing an IndexReader which has a deletion: PANIC: Invalid argument I get this stack trace: java.io.IOException: DB_RUNRECOVERY: Fatal error, run database recovery at org.apache.lucene.store.db.Block.put(Block.java:128) at org.apache.lucene.store.db.DbOutputStream.close(DbOutputStream.java:111) at org.apache.lucene.util.BitVector.write(BitVector.java:155) at org.apache.lucene.index.SegmentReader$1.doBody(SegmentReader.java:162) at org.apache.lucene.store.Lock$With.run(Lock.java:148) at org.apache.lucene.index.SegmentReader.doClose(SegmentReader.java:157) at org.apache.lucene.index.IndexReader.close(IndexReader.java:422) Help! - andy g code that triggers this: // dbdir is a working DbDirectory, docid is a search result IndexReader read = IndexReader.open(dbdir); read.delete(docid); read.close(); - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene MBean service for JBoss
Thanks Otis... With any luck my current employer will also chip in a few bucks to help maintain the project (I'm working on it)... cheers -andy Otis Gospodnetic wrote: Thanks, I'm finally including this on the Contributions page. Otis --- Andy Scholz <[EMAIL PROTECTED]> wrote: Hi All, For those that may be interested, I have written a full text indexing service for the JBoss application server that uses Lucene as its engine. It allows lucene to be used as a service rather than a standalone app with thread pooling, access synchronization, management etc. Index and search interfaces are accessable via JNDI and remotely via session EJB's. Additionally I have provided content filters for common formats like HTML, MSWord, MSExcel, xml etc (with some help from other projects). A simple interface also allows you to write your own filters for different formats. It is available under an LGPL license and source code, binaries and info are avaialble here: http://ejindex.sourceforge.net I'd love to get some feedback, so if your iterested, please let me know your comments or suggestions;) regards, Andy Scholz - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? The New Yahoo! Shopping - with improved product search http://shopping.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene MBean service for JBoss
Hi Dan, I'm not sure whats going on there, I've checked the moveNext & it seems OK. It seems that somehow the page got indexed twice, or two hits are being returned for some reason. I cant replicate get this tho so I'd aprreciate any more info you might be able to give me - e.g. if you comment out the setDocumentURL (so that only metadata is indexed) and change the query to to something like "title:jboss", does it sill return two hits. Also there are some unit tests in the ejindexXX_tests.zip file you might want to run - this has a bunch of tests that test the service both locally and remotely - these should(!) fail if there is a problem and hopefully give more indication as to what the problem might be. Thanks for your feedback! Regards, Andy Scholz Hi Andy This looks like a very useful MBean (quite a bit more developed than the one I was working on). One quick query on the quickstart example though, when I run it I get the output twice: - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Spanish analyzer and Indexing StarOffice docs
We tried the udk approach late last year but it was an awfully clumsy soultion - requiring you actually run an OO app instance as a 'server'. It kind of worked but the show-stopper was that OO is so tied into the UI, whenever an error occured (i.e. file not found etc), a dialog box pops up and nothing else happens till it gets acknowledged. In fact, AFIK you have to have X windows (or some other gui) for it to even run at all (the giveaway being the splash screen that shows when you start OO via the sdk interface). Apperently a new command line option was being added to suppress UI (but then how do get error messages?), but it was only in the cvs head (as of late last year) but we gave up on it becasue it was just too messey a solution for our server-side needs. If you want to use it as an app on a workstation though, it might work fine for you. There also was(is) a project underway to provide a filter interface (called x-filter I think) that provides a set of import/export filters for OO that would be ideal to use for text indexing purposes, but I think it will be quite some time before that becomes available. I havent looked at it in a while but - I'd stick to Peter's first suggestion - unzip it & read the XML. cheers -andy - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene MBean service for JBoss
Hi All, For those that may be interested, I have written a full text indexing service for the JBoss application server that uses Lucene as its engine. It allows lucene to be used as a service rather than a standalone app with thread pooling, access synchronization, management etc. Index and search interfaces are accessable via JNDI and remotely via session EJB's. Additionally I have provided content filters for common formats like HTML, MSWord, MSExcel, xml etc (with some help from other projects). A simple interface also allows you to write your own filters for different formats. It is available under an LGPL license and source code, binaries and info are avaialble here: http://ejindex.sourceforge.net I'd love to get some feedback, so if your iterested, please let me know your comments or suggestions;) regards, Andy Scholz - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]