Re: Date faceting - howto improve performance
You mean doc A and doc B will become one doc after adding index 2 to index 1? I don't think this is currently supported either at Lucene level or at Solr level. If index 1 has m docs and index 2 has n docs, index 1 will have m+n docs after adding index 2 to index 1. Documents themselves are not modified by index merge. Cheers, Ning On Sat, Apr 25, 2009 at 4:03 PM, Marcus Herou marcus.he...@tailsweep.com wrote: Hmm looking in the code for the IndexMerger in Solr (org.apache.solr.update.DirectUpdateHandler(2) See that the IndexWriter.addIndexesNoOptimize(dirs) is used (union of indexes) ? And the test class org.apache.solr.client.solrj.MergeIndexesExampleTestBase suggests: add doc A to index1 with id=AAA,name=core1 add doc B to index2 with id=BBB,name=core2 merge the two indexes into one index which then contains both docs. The resulting index will have 2 docs. Great but in my case I think it should work more like this. add doc A to index1 with id=X,title=blog entry title,description=blog entry description add doc B to index2 with id=X,score=1.2 somehow add index2 to index1 so id=XX has score=1.2 when searching in index1 The resulting index should have 1 doc. So this is not really what I want right ? Sorry for being a smart-ass... Kindly //Marcus On Sat, Apr 25, 2009 at 5:10 PM, Marcus Herou marcus.he...@tailsweep.comwrote: Guys! Thanks for these insights, I think we will head for Lucene level merging strategy (two or more indexes). When merging I guess the second index need to have the same doc ids somehow. This is an internal id in Lucene, not that easy to get hold of right ? So you are saying the the solr: ExternalFileField + FunctionQuery stuff would not work very well performance wise or what do you mean ? I sure like bleeding edge :) Cheers dudes //Marcus On Sat, Apr 25, 2009 at 3:46 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: I should emphasize that the PR trick I mentioned is something you'd do at the Lucene level, outside Solr, and then you'd just slip the modified index back into Solr. Of, if you like the bleeding edge, perhaps you can make use of Ning Li's Solr index merging functionality (patch in JIRA). Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Otis Gospodnetic otis_gospodne...@yahoo.com To: solr-user@lucene.apache.org Sent: Saturday, April 25, 2009 9:41:45 AM Subject: Re: Date faceting - howto improve performance Yes, you could simply round the date, no need for a non-date type field. Yes, you can add a field after the fact by making use of ParallelReader and merging (I don't recall the details, search the ML for ParallelReader and Andrzej), I remember he once provided the working recipe. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Marcus Herou To: solr-user@lucene.apache.org Sent: Saturday, April 25, 2009 6:54:02 AM Subject: Date faceting - howto improve performance Hi. One of our faceting use-cases: We are creating trend graphs of how many blog posts that contains a certain term and groups it by day/week/year etc. with the nice DateMathParser functions. The performance degrades really fast and consumes a lot of memory which forces OOM from time to time We think it is due the fact that the cardinality of the field publishedDate in our index is huge, almost equal to the nr of documents in the index. We need to address that... Some questions: 1. Can a datefield have other date-formats than the default of -MM-dd HH:mm:ssZ ? 2. We are thinking of adding a field to the index which have the format -MM-dd to reduce the cardinality, if that field can't be a date, it could perhaps be a string, but the question then is if faceting can be used ? 3. Since we now already have such a huge index, is there a way to add a field afterwards and apply it to all documents without actually reindexing the whole shebang ? 4. If the field cannot be a string can we just leave out the hour/minute/second information and to reduce the cardinality and improve performance ? Example: 2009-01-01 00:00:00Z 5. I am afraid that we need to reindex everything to get this to work (negates Q3). We have 8 shards as of current, what would the most efficient way be to reindexing the whole shebang ? Dump the entire database to disk (sigh), create many xml file splits and use curl in a random/hash(numServers) manner on them ? Kindly //Marcus -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ http://blogg.tailsweep.com/ -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/
Re: solr index size
Slightly different index sizes (even optimized) are normal - a same document may get different internal docids in different runs. I don't know why the number of terms are slight different. On Fri, Apr 3, 2009 at 7:21 PM, Jun Rao jun...@almaden.ibm.com wrote: Hi, We built a Solr index on a set of documents a few times. Each time, we did an optimize to reduce the index to a single segment. The index sizes are slightly different across different runs. Even though the documents are not inserted in the same order across runs, it seems to me that the final optimized index should be identical. Running CheckIndex showed that the number of docs and fields are the same, but the number of terms are slightly different. Does anyone know how to explain this? Thanks, Jun IBM Almaden Research Center K55/B1, 650 Harry Road, San Jose, CA 95120-6099 jun...@almaden.ibm.com
Re: Merging Solr Indexes
There is a jira issue on supporting index merge: https://issues.apache.org/jira/browse/SOLR-1051. But I agree with Otis that you should go with a single index first. Cheers, Ning On Wed, Apr 1, 2009 at 12:06 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi, Yes, you can write to the same index from multiple threads. You still need to keep track of the index size manually, whether you create 1 or N indices/cores. I'd go with a single index first. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: vivek sar vivex...@gmail.com To: solr-user@lucene.apache.org Sent: Wednesday, April 1, 2009 4:26:04 AM Subject: Re: Merging Solr Indexes Thanks Otis. Could you write to same core (same index) from multiple threads at the same time? I thought each writer would lock the index so other can not write at the same time. I'll try it though. Another reason of putting indexes in separate core was to limit the index size. Our index can grow up to 50G a day, so I was hoping writing to smaller indexes would be faster in separate cores and if needed I can merge them at later point (like end of day). I want to keep daily cores. Isn't this a good idea? How else can I limit the index size (beside multiple instances or separate boxes). Thanks, -vivek On Tue, Mar 31, 2009 at 8:28 PM, Otis Gospodnetic wrote: Let me start with 4) Have you tried simply using multiple threads to send your docs to a single Solr instance/core? You should get about the same performance as what you are trying with your approach below, but without the headache of managing multiple cores and index merging (not yet possible to do programatically). Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: vivek sar To: solr-user@lucene.apache.org Sent: Tuesday, March 31, 2009 1:59:01 PM Subject: Merging Solr Indexes Hi, As part of speeding up the index process I'm thinking of spawning multiple threads which will write to different temporary SolrCores. Once the index process is done I want to merge all the indexes in temporary cores to a master core. For ex., if I want one SolrCore per day then every index cycle I'll spawn 4 threads which will index into some temporary index and once they are done I want to merge all these into the day core. My questions, 1) I want to use the same schema and solrconfig.xml for all cores without duplicating them - how do I do that? 2) How do I merge the temporary Solr cores into one master core programmatically? I've read the wiki on MergingSolrIndexes, but I want to do it programmatically (like in Lucene - writer.addIndexes(..)) once the temporary indices are done. 3) Can I remove the temporary indices once the merge process is done? 4) Is this the right strategy to speed up indexing? Thanks, -vivek
Lucene-based Distributed Index Leveraging Hadoop
a consistent view of the shards in the index. The results of a search query include either all or none of a recent update to the index. The details of the algorithm to accomplish this are omitted here, but the basic flow is pretty simple. After the Map/Reduce job to update the shards completes, the master will tell each shard server to prepare the new version of the index. After all the shard servers have responded affirmatively to the prepare message, the new index is ready to be queried. An index client will then lazily learn about the new index when it makes its next getShardLocations() call to the master. In essence, a lazy two-phase commit protocol is used, with prepare and commit messages piggybacked on heartbeats. After a shard has switched to the new index, the Lucene files in the old index that are no longer needed can safely be deleted. ACHIEVING FAULT-TOLERANCE We rely on the fault-tolerance of Map/Reduce to guarantee that an index update will eventually succeed. All shards are stored in HDFS and can be read by any shard server in a cluster. For a given shard, if one of its shard servers dies, new search requests are handled by its surviving shard servers. To ensure that there is always enough coverage for a shard, the master will instruct other shard servers to take over the shards of a dead shard server. PERFORMANCE ISSUES Currently, each shard server reads a shard directly from HDFS. Experiments have shown that this approach does not perform very well, with HDFS causing Lucene to slow down fairly dramatically (by well over 5x when data blocks are accessed over the network). Consequently, we are exploring different ways to leverage the fault tolerance of HDFS and, at the same time, work around its performance problems. One simple alternative is to add a local file system cache on each shard server. Another alternative is to modify HDFS so that an application has more control over where to store the primary and replicas of an HDFS block. This feature may be useful for other HDFS applications (e.g., HBase). We would like to collaborate with other people who are interested in adding this feature to HDFS. Regards, Ning Li
Re: Lucene-based Distributed Index Leveraging Hadoop
I work for IBM Research. I read the Rackspace article. Rackspace's Mailtrust has a similar design. Happy to see an existing application on such a system. Do they plan to open-source it? Is the AOL project an open source project? On Feb 6, 2008 11:33 AM, Clay Webster [EMAIL PROTECTED] wrote: There seem to be a few other players in this space too. Are you from Rackspace? (http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop- query-terabytes-data) AOL also has a Hadoop/Solr project going on. CNET does not have much brewing there. Although Yonik and I had talked about it a bunch -- but that was long ago. --cw Clay Webster tel:1.908.541.3724 Associate VP, Platform Infrastructure http://www.cnet.com CNET, Inc. (Nasdaq:CNET) mailto:[EMAIL PROTECTED]
Re: Lucene-based Distributed Index Leveraging Hadoop
No. I'm curious too. :) On Feb 6, 2008 11:44 AM, J. Delgado [EMAIL PROTECTED] wrote: I assume that Google also has distributed index over their GFS/MapReduce implementation. Any idea how they achieve this? J.D.
Re: Lucene-based Distributed Index Leveraging Hadoop
One main focus is to provide fault-tolerance in this distributed index system. Correct me if I'm wrong, I think SOLR-303 is focusing on merging results from multiple shards right now. We'd like to start an open source project for a fault-tolerant distributed index system (or join if one already exists) if there is enough interest. Making Solr work on top of such a system could be an important goal and SOLR-303 is a big part of it in that case. I should have made it clear that disjoint data sets are not a requirement of the system. On Feb 6, 2008 12:57 PM, Ian Holsman [EMAIL PROTECTED] wrote: Hi. AOL has a couple of projects going on in the lucene/hadoop/solr space, and we will be pushing more stuff out as we can. We don't have anything going with solr over hadoop at the moment. I'm not sure if this would be better than what SOLR-303 does, but you should have a look at the work being done there. One of the things you mentioned is that the data sets are disjoint. SOLR-303 doesn't require this, and allows us to have a document stored in multiple shards (with different caching/update characteristics).