RE: Does lucene support distributed indexing?
Solr does not do distributed indexing, but index replication. All copies are identical. Lucene has some build in support for distributed search, please take a look at RemoteSearchable. For indexing, you can add a front load balancer in a naïve way. Regards, -Original Message- From: Samuel Guo [mailto:[EMAIL PROTECTED] Sent: Sunday, April 27, 2008 4:22 PM To: java-user@lucene.apache.org Subject: Re: Does lucene support distributed indexing? Thanks a lot :) 2008/4/26 Grant Ingersoll <[EMAIL PROTECTED]>: > > On Apr 26, 2008, at 2:33 AM, Samuel Guo wrote: > > Hi all, > > > > I am a lucene newbie:) > > > > It seems that lucene doesn't support distributed indexing:( > > As some IR research papers mentioned, when the documents collection > > become > > large, the index will be large also. When one single machine can't hold > > all > > the index, some strategies are used to solve it. such as that we can > > part > > the whole collection into several small sub-collections. According to > > different partitions, we can got different strategies : > > document-partittion > > and term-partition. but I don't know why not lucene support these ways:( > > can't anyone explain it ? > > > > Because no one has donated the code to do it. You can do distributed > indexing via Nutch and some (albeit non fault tolerant) distributed Search > in Lucene. Solr also now has distributed search. > > -Grant > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
Re: TrecDocMaker
Yeah, these classes are a bit weird in that they are configured via properties, and not setters. They really are designed to run inside the benchmaker and not much attention was paid to using them elsewhere. However, one can co-opt them for the purposes you are doing: Something like: TrecDocMaker docMaker = new TrecDocMaker(); Properties properties = new Properties(); properties.setProperty("doc.maker.forever", "false"); ... docMaker.setConfig(new Config(properties)); (not I was using the EnWikiDocMaker in the above example, but it should work for Trec, too. I often also do something like: while ((doc = docMaker.makeDocument()) != null && i < numDocs) { ... where numDocs is the max. docs I want. HTH, Grant On Apr 27, 2008, at 2:31 PM, DanaWhite wrote: Greetings, I am trying to use TrecDocMaker so I can successfully index and evaluate lucene on a TReC collection. It seems like I would just repeatedly call makeDocument() until all the Documents have been created, but makeDocument appears to just read forever. In general TrecDocMaker seems like an odd class and I just cant figure out how to use it right. I have been changing the class so it works with an uncompressed collection and trying to modify it so makeDocument doesnt endlessly read, but no matter what I have done it is just causing a different error. Clearly I am trying too hard. In short what I want know is how am I supposed to use TrecDocMaker to parse my collection...cause the current Lucene implementation doesnt seem to work right, or I am using it wrong. Thanks Dana -- View this message in context: http://www.nabble.com/TrecDocMaker-tp16926877p16926877.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
search performance & caching
I'm using lucene 2.2.0 & have two questions: 1) Should search times be linear wrt number of queries hitting a single searcher? I've run multiple search threads against a single searcher, and the search times are very linear - 10x slower for 10 threads vs 1 thread, etc. I'm using a paralle multi-searcher with a custom hit collector. 2) I'm performing some field caching during search warmup. For an index of 3.4 million doc's and 7GB, it's taking up to 30 minutes to execute the code snippet below. Most of this time is involved with the multireader.document call (where it says "THIS TAKES THE MOST TIME"). I want to know if anyone has any ideas for speeding this up. There are multiple documents containing the same recordId. I want to figure out which two documents with the same recordId also have a documentName of CORE or WL. Then for each document in the index I store three pieces of information: - it's associated recordId - the CORE doc number for this recordId. - the WL doc number for this recordId Ideally, since the multiReader.document call is taking the most time, I'd like to not have to perform this. Although I can't figure out how to get around needing to read in the recordId. What I really need is something like a two dimensional termEnum I could iterate over - for the recordId and documentName fields. Any ideas are appreciated. // Now loop through all documents in the indexes and set the cache values. TermDocs termDocs = multiReader.termDocs(); TermEnum termEnum = multiReader.terms (new Term ("RECORD_ID", "")); try { FieldSelector fieldSelector = getFieldSelector(); List docList = new ArrayList(); int regularCoreDocId = -1; int wlCoreDocId = -1; int docId = -1; Document document = null; String documentName = null; // Loop through each RECORD_ID with termEnums do { docList.clear(); regularCoreDocId = -1; wlCoreDocId = -1; Term term = termEnum.term(); if (term == null || term.field() != field) { break; } String recordId = term.text(); // Now loop through all documents with the same recordId // using the termDocs. termDocs.seek(termEnum); while (termDocs.next()) { docId = termDocs.doc(); docList.add(Integer.valueOf(docId)); // THIS TAKES THE MOST TIME document = multiReader.document(docId, fieldSelector); documentName = document.get("DOCUMENT_NAME"); if ("CORE".equals(documentName)) { regularCoreDocId = docId; } else if ("WL".equals(documentName)) { wlCoreDocId = docId; } } // Map all docId's associated with this recordId for (Integer i : docList) { doc2RecordId [i] = recordId; } // Map from the docId to the coreData docId for // regular core and wl core documents. for (Integer i : docList) { doc2RegularCoreDoc[i] = regularCoreDocId; wlCoreDocId [i] = wlCoreDocId; } } while (termEnum.next()); } finally { termDocs.close(); termEnum.close(); } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Does lucene support distributed indexing?
Solr does not do distributed indexing, but the development version _does_ do distributed search, in addition to replication. Currently, you can manually shard up your data to a set of Solr instances, and then query them by adding a 'shard=localhost:8080/solr_1,localhost:8080/solr_2' parameter. See https://issues.apache.org/jira/browse/SOLR-303 Thanks, Stu -Original Message- From: [EMAIL PROTECTED] Sent: Monday, April 28, 2008 5:04am To: java-user@lucene.apache.org Subject: RE: Does lucene support distributed indexing? Solr does not do distributed indexing, but index replication. All copies are identical. Lucene has some build in support for distributed search, please take a look at RemoteSearchable. For indexing, you can add a front load balancer in a naïve way. Regards, -Original Message- From: Samuel Guo [mailto:[EMAIL PROTECTED] Sent: Sunday, April 27, 2008 4:22 PM To: java-user@lucene.apache.org Subject: Re: Does lucene support distributed indexing? Thanks a lot :) 2008/4/26 Grant Ingersoll <[EMAIL PROTECTED]>: > > On Apr 26, 2008, at 2:33 AM, Samuel Guo wrote: > > Hi all, > > > > I am a lucene newbie:) > > > > It seems that lucene doesn't support distributed indexing:( > > As some IR research papers mentioned, when the documents collection > > become > > large, the index will be large also. When one single machine can't hold > > all > > the index, some strategies are used to solve it. such as that we can > > part > > the whole collection into several small sub-collections. According to > > different partitions, we can got different strategies : > > document-partittion > > and term-partition. but I don't know why not lucene support these ways:( > > can't anyone explain it ? > > > > Because no one has donated the code to do it. You can do distributed > indexing via Nutch and some (albeit non fault tolerant) distributed Search > in Lucene. Solr also now has distributed search. > > -Grant > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Does lucene support distributed indexing?
: There are actually several distributed indexing or searching projects in : Lucene (the top-level ASF Lucene project, not Lucene Java), and it's : time to start thinking about the possibility of bringing them together, : finding commonalities, etc. I would actually argue that almost all of the examples you listed describe "distributed searching" to query multiple shards. As far as i know, none of them address the "distributed indexing" aspect: throw some raw data at the system and trust that it it will be indexed by one (or more) shard(s) in a way that "evenly" distributes the indexing "load" -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Does lucene support distributed indexing?
That's right - most of them are about distributed searching (hence my notes about sharding being up to the app). Hadoop's contrib/index is about dist indexing: "This contrib package provides a utility to build or update an index using Map/Reduce. A distributed "index" is partitioned into "shards". Each shard corresponds to a Lucene instance." Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Chris Hostetter <[EMAIL PROTECTED]> > To: java-user@lucene.apache.org > Sent: Monday, April 28, 2008 7:53:43 PM > Subject: Re: Does lucene support distributed indexing? > > > : There are actually several distributed indexing or searching projects in > : Lucene (the top-level ASF Lucene project, not Lucene Java), and it's > : time to start thinking about the possibility of bringing them together, > : finding commonalities, etc. > > I would actually argue that almost all of the examples you listed describe > "distributed searching" to query multiple shards. > > As far as i know, none of them address the "distributed indexing" aspect: > throw some raw data at the system and trust that it it will be indexed by > one (or more) shard(s) in a way that "evenly" distributes the indexing > "load" > > > > -Hoss > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Does lucene support distributed indexing?
Hi all, How about adding hadoop support for distributed indexing. If required I can start working on this. If Hadoop is the fesiable option. Also what other technique one can think for doing distributed Indexing. Currently I am planning on extending the SolrJ to keep a map of where the document has gone and trying to get a distibuted Indexing. --Thanks and Regards Vaijanath Otis Gospodnetic wrote: That's right - most of them are about distributed searching (hence my notes about sharding being up to the app). Hadoop's contrib/index is about dist indexing: "This contrib package provides a utility to build or update an index using Map/Reduce. A distributed "index" is partitioned into "shards". Each shard corresponds to a Lucene instance." Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Chris Hostetter <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Monday, April 28, 2008 7:53:43 PM Subject: Re: Does lucene support distributed indexing? : There are actually several distributed indexing or searching projects in : Lucene (the top-level ASF Lucene project, not Lucene Java), and it's : time to start thinking about the possibility of bringing them together, : finding commonalities, etc. I would actually argue that almost all of the examples you listed describe "distributed searching" to query multiple shards. As far as i know, none of them address the "distributed indexing" aspect: throw some raw data at the system and trust that it it will be indexed by one (or more) shard(s) in a way that "evenly" distributes the indexing "load" -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]