Re: Authenticated Indexing Not working
HI Otis, I am using HTTPClient for authentication. When I use the server with Authentication for searching it works fine. But when I use it for indexing it throws error. Regards, Allahbaksh On 4/25/09, Otis Gospodnetic wrote: > > My guess is you could provide the credentials to the underlying HttpClient > (used by SolrJ), and let it do the authentication. I don't have the API > handy, sorry. > But this may slow things down and I have to wonder if you really really need > authentication there or whether using HTTP authentication is the best way to > do it. > > > Otis -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > - Original Message >> From: Allahbaksh Asadullah >> To: solr-user@lucene.apache.org >> Sent: Saturday, April 25, 2009 9:31:35 AM >> Subject: Authenticated Indexing Not working >> >> Hi, I have configured basic authentication in Solr using web.xml. It is >> working fine when I search using SolrJ. But when I try to index with >> Authentication enabled using SolrJ it is throwing exception. >> >> Is secured indexing is not enabled? How I am suppose to use secured >> indexing. >> >> Regards, >> Allahbaksh > > -- Allahbaksh Mohammedali Asadullah, Software Engineering & Technology Labs, Infosys Technolgies Limited, Electronic City, Hosur Road, Bangalore 560 100, India. (Board: 91-80-28520261 | Extn: 73927 | Direct: 41173927. Fax: 91-80-28520362 | Mobile: 91-9845505322.
Re: Solr-1.4 indexing slower ?
Strange, now I am reindexing a lot of items and have 1000 docs/sec again... This is really, really nice, sorry for bothering... /M On Sat, Apr 25, 2009 at 3:39 PM, Otis Gospodnetic < otis_gospodne...@yahoo.com> wrote: > > Yes, versions of Lucene have changed, but should only be faster. A simple > way to see what's happening is to get the thread dump (e.g. through Solr > admin pages) to see what the JVM is doing when things slow down. Do a few > dumps and see. Perhaps the avg indexing rate is slower due to larger index > segments? > > > Otis -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > - Original Message > > From: Marcus Herou > > To: solr-user@lucene.apache.org > > Sent: Saturday, April 25, 2009 7:00:57 AM > > Subject: Solr-1.4 indexing slower ? > > > > Hi. > > We upgraded to solr-trunk (1.4-dev) for a few weeks ago and I've notices > > that the performance really went down. Not sure if I can blame solr 100% > > though so take these comments for what they might be (bullshit) > > > > However: > > For a few months ago I know we had indexing speed of about 100 docs/sec > per > > shard = 800-1000 docs/sec on all shards but now we are lucky if we get > over > > 10 docs/sec per shard... > > > > This is merely an observation with very little scientific research behind > to > > support it since I did not profile the app before launching 1.4 to see > how > > good 1.3 behaved at that exact time... I launched 1.4 due to the fact > that > > the rumours said that date faceting was faster in solr-1.4 which I > believe > > it is. That's why I missed to profile indexing speed. > > > > Did not Lucene as well change version between the two ? > > > > Wondering if anyone else experience the same issues. > > > > //Marcus > > > > -- > > Marcus Herou CTO and co-founder Tailsweep AB > > +46702561312 > > marcus.he...@tailsweep.com > > http://www.tailsweep.com/ > > http://blogg.tailsweep.com/ > > -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ http://blogg.tailsweep.com/
Re: Date faceting - howto improve performance
Hmm looking in the code for the IndexMerger in Solr (org.apache.solr.update.DirectUpdateHandler(2) See that the IndexWriter.addIndexesNoOptimize(dirs) is used (union of indexes) ? And the test class org.apache.solr.client.solrj.MergeIndexesExampleTestBase suggests: add doc A to index1 with id=AAA,name=core1 add doc B to index2 with id=BBB,name=core2 merge the two indexes into one index which then contains both docs. The resulting index will have 2 docs. Great but in my case I think it should work more like this. add doc A to index1 with id=X,title=blog entry title,description=blog entry description add doc B to index2 with id=X,score=1.2 somehow add index2 to index1 so id=XX has score=1.2 when searching in index1 The resulting index should have 1 doc. So this is not really what I want right ? Sorry for being a smart-ass... Kindly //Marcus On Sat, Apr 25, 2009 at 5:10 PM, Marcus Herou wrote: > Guys! > > Thanks for these insights, I think we will head for Lucene level merging > strategy (two or more indexes). > When merging I guess the second index need to have the same doc ids > somehow. This is an internal id in Lucene, not that easy to get hold of > right ? > > So you are saying the the solr: ExternalFileField + FunctionQuery stuff > would not work very well performance wise or what do you mean ? > > I sure like bleeding edge :) > > Cheers dudes > > //Marcus > > > > > > On Sat, Apr 25, 2009 at 3:46 PM, Otis Gospodnetic < > otis_gospodne...@yahoo.com> wrote: > >> >> I should emphasize that the PR trick I mentioned is something you'd do at >> the Lucene level, outside Solr, and then you'd just slip the modified index >> back into Solr. >> Of, if you like the bleeding edge, perhaps you can make use of Ning Li's >> Solr index merging functionality (patch in JIRA). >> >> >> Otis -- >> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch >> >> >> >> - Original Message >> > From: Otis Gospodnetic >> > To: solr-user@lucene.apache.org >> > Sent: Saturday, April 25, 2009 9:41:45 AM >> > Subject: Re: Date faceting - howto improve performance >> > >> > >> > Yes, you could simply round the date, no need for a non-date type field. >> > Yes, you can add a field after the fact by making use of ParallelReader >> and >> > merging (I don't recall the details, search the ML for ParallelReader >> and >> > Andrzej), I remember he once provided the working recipe. >> > >> > >> > Otis -- >> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch >> > >> > >> > >> > - Original Message >> > > From: Marcus Herou >> > > To: solr-user@lucene.apache.org >> > > Sent: Saturday, April 25, 2009 6:54:02 AM >> > > Subject: Date faceting - howto improve performance >> > > >> > > Hi. >> > > >> > > One of our faceting use-cases: >> > > We are creating trend graphs of how many blog posts that contains a >> certain >> > > term and groups it by day/week/year etc. with the nice DateMathParser >> > > functions. >> > > >> > > The performance degrades really fast and consumes a lot of memory >> which >> > > forces OOM from time to time >> > > We think it is due the fact that the cardinality of the field >> publishedDate >> > > in our index is huge, almost equal to the nr of documents in the >> index. >> > > >> > > We need to address that... >> > > >> > > Some questions: >> > > >> > > 1. Can a datefield have other date-formats than the default of >> -MM-dd >> > > HH:mm:ssZ ? >> > > >> > > 2. We are thinking of adding a field to the index which have the >> format >> > > -MM-dd to reduce the cardinality, if that field can't be a date, >> it >> > > could perhaps be a string, but the question then is if faceting can be >> used >> > > ? >> > > >> > > 3. Since we now already have such a huge index, is there a way to add >> a >> > > field afterwards and apply it to all documents without actually >> reindexing >> > > the whole shebang ? >> > > >> > > 4. If the field cannot be a string can we just leave out the >> > > hour/minute/second information and to reduce the cardinality and >> improve >> > > performance ? Example: 2009-01-01 00:00:00Z >> > > >> > > 5. I am afraid that we need to reindex everything to get this to work >> > > (negates Q3). We have 8 shards as of current, what would the most >> efficient >> > > way be to reindexing the whole shebang ? Dump the entire database to >> disk >> > > (sigh), create many xml file splits and use curl in a >> > > random/hash(numServers) manner on them ? >> > > >> > > >> > > Kindly >> > > >> > > //Marcus >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > -- >> > > Marcus Herou CTO and co-founder Tailsweep AB >> > > +46702561312 >> > > marcus.he...@tailsweep.com >> > > http://www.tailsweep.com/ >> > > http://blogg.tailsweep.com/ >> >> > > > -- > Marcus Herou CTO and co-founder Tailsweep AB > +46702561312 > marcus.he...@tailsweep.com > http://www.tailsweep.com/ > http://blogg.tailsweep.com/ > -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.
Re: Date faceting - howto improve performance
Oh and the indexing strategy is just a stupid random across the shards. What I asked about was a "Best Practice" of achieving most MB/sec indexing. I feel that the java-api should be less efficient than something more raw like curl or so but that is just my hunch. /M On Sat, Apr 25, 2009 at 5:10 PM, Marcus Herou wrote: > Guys! > > Thanks for these insights, I think we will head for Lucene level merging > strategy (two or more indexes). > When merging I guess the second index need to have the same doc ids > somehow. This is an internal id in Lucene, not that easy to get hold of > right ? > > So you are saying the the solr: ExternalFileField + FunctionQuery stuff > would not work very well performance wise or what do you mean ? > > I sure like bleeding edge :) > > Cheers dudes > > //Marcus > > > > > > On Sat, Apr 25, 2009 at 3:46 PM, Otis Gospodnetic < > otis_gospodne...@yahoo.com> wrote: > >> >> I should emphasize that the PR trick I mentioned is something you'd do at >> the Lucene level, outside Solr, and then you'd just slip the modified index >> back into Solr. >> Of, if you like the bleeding edge, perhaps you can make use of Ning Li's >> Solr index merging functionality (patch in JIRA). >> >> >> Otis -- >> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch >> >> >> >> - Original Message >> > From: Otis Gospodnetic >> > To: solr-user@lucene.apache.org >> > Sent: Saturday, April 25, 2009 9:41:45 AM >> > Subject: Re: Date faceting - howto improve performance >> > >> > >> > Yes, you could simply round the date, no need for a non-date type field. >> > Yes, you can add a field after the fact by making use of ParallelReader >> and >> > merging (I don't recall the details, search the ML for ParallelReader >> and >> > Andrzej), I remember he once provided the working recipe. >> > >> > >> > Otis -- >> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch >> > >> > >> > >> > - Original Message >> > > From: Marcus Herou >> > > To: solr-user@lucene.apache.org >> > > Sent: Saturday, April 25, 2009 6:54:02 AM >> > > Subject: Date faceting - howto improve performance >> > > >> > > Hi. >> > > >> > > One of our faceting use-cases: >> > > We are creating trend graphs of how many blog posts that contains a >> certain >> > > term and groups it by day/week/year etc. with the nice DateMathParser >> > > functions. >> > > >> > > The performance degrades really fast and consumes a lot of memory >> which >> > > forces OOM from time to time >> > > We think it is due the fact that the cardinality of the field >> publishedDate >> > > in our index is huge, almost equal to the nr of documents in the >> index. >> > > >> > > We need to address that... >> > > >> > > Some questions: >> > > >> > > 1. Can a datefield have other date-formats than the default of >> -MM-dd >> > > HH:mm:ssZ ? >> > > >> > > 2. We are thinking of adding a field to the index which have the >> format >> > > -MM-dd to reduce the cardinality, if that field can't be a date, >> it >> > > could perhaps be a string, but the question then is if faceting can be >> used >> > > ? >> > > >> > > 3. Since we now already have such a huge index, is there a way to add >> a >> > > field afterwards and apply it to all documents without actually >> reindexing >> > > the whole shebang ? >> > > >> > > 4. If the field cannot be a string can we just leave out the >> > > hour/minute/second information and to reduce the cardinality and >> improve >> > > performance ? Example: 2009-01-01 00:00:00Z >> > > >> > > 5. I am afraid that we need to reindex everything to get this to work >> > > (negates Q3). We have 8 shards as of current, what would the most >> efficient >> > > way be to reindexing the whole shebang ? Dump the entire database to >> disk >> > > (sigh), create many xml file splits and use curl in a >> > > random/hash(numServers) manner on them ? >> > > >> > > >> > > Kindly >> > > >> > > //Marcus >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > -- >> > > Marcus Herou CTO and co-founder Tailsweep AB >> > > +46702561312 >> > > marcus.he...@tailsweep.com >> > > http://www.tailsweep.com/ >> > > http://blogg.tailsweep.com/ >> >> > > > -- > Marcus Herou CTO and co-founder Tailsweep AB > +46702561312 > marcus.he...@tailsweep.com > http://www.tailsweep.com/ > http://blogg.tailsweep.com/ > -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ http://blogg.tailsweep.com/
Re: Date faceting - howto improve performance
Guys! Thanks for these insights, I think we will head for Lucene level merging strategy (two or more indexes). When merging I guess the second index need to have the same doc ids somehow. This is an internal id in Lucene, not that easy to get hold of right ? So you are saying the the solr: ExternalFileField + FunctionQuery stuff would not work very well performance wise or what do you mean ? I sure like bleeding edge :) Cheers dudes //Marcus On Sat, Apr 25, 2009 at 3:46 PM, Otis Gospodnetic < otis_gospodne...@yahoo.com> wrote: > > I should emphasize that the PR trick I mentioned is something you'd do at > the Lucene level, outside Solr, and then you'd just slip the modified index > back into Solr. > Of, if you like the bleeding edge, perhaps you can make use of Ning Li's > Solr index merging functionality (patch in JIRA). > > > Otis -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > - Original Message > > From: Otis Gospodnetic > > To: solr-user@lucene.apache.org > > Sent: Saturday, April 25, 2009 9:41:45 AM > > Subject: Re: Date faceting - howto improve performance > > > > > > Yes, you could simply round the date, no need for a non-date type field. > > Yes, you can add a field after the fact by making use of ParallelReader > and > > merging (I don't recall the details, search the ML for ParallelReader and > > Andrzej), I remember he once provided the working recipe. > > > > > > Otis -- > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > > > > > - Original Message > > > From: Marcus Herou > > > To: solr-user@lucene.apache.org > > > Sent: Saturday, April 25, 2009 6:54:02 AM > > > Subject: Date faceting - howto improve performance > > > > > > Hi. > > > > > > One of our faceting use-cases: > > > We are creating trend graphs of how many blog posts that contains a > certain > > > term and groups it by day/week/year etc. with the nice DateMathParser > > > functions. > > > > > > The performance degrades really fast and consumes a lot of memory which > > > forces OOM from time to time > > > We think it is due the fact that the cardinality of the field > publishedDate > > > in our index is huge, almost equal to the nr of documents in the index. > > > > > > We need to address that... > > > > > > Some questions: > > > > > > 1. Can a datefield have other date-formats than the default of > -MM-dd > > > HH:mm:ssZ ? > > > > > > 2. We are thinking of adding a field to the index which have the format > > > -MM-dd to reduce the cardinality, if that field can't be a date, it > > > could perhaps be a string, but the question then is if faceting can be > used > > > ? > > > > > > 3. Since we now already have such a huge index, is there a way to add a > > > field afterwards and apply it to all documents without actually > reindexing > > > the whole shebang ? > > > > > > 4. If the field cannot be a string can we just leave out the > > > hour/minute/second information and to reduce the cardinality and > improve > > > performance ? Example: 2009-01-01 00:00:00Z > > > > > > 5. I am afraid that we need to reindex everything to get this to work > > > (negates Q3). We have 8 shards as of current, what would the most > efficient > > > way be to reindexing the whole shebang ? Dump the entire database to > disk > > > (sigh), create many xml file splits and use curl in a > > > random/hash(numServers) manner on them ? > > > > > > > > > Kindly > > > > > > //Marcus > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Marcus Herou CTO and co-founder Tailsweep AB > > > +46702561312 > > > marcus.he...@tailsweep.com > > > http://www.tailsweep.com/ > > > http://blogg.tailsweep.com/ > > -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ http://blogg.tailsweep.com/
Get the field value that caused the result
Hi, I've been following this mailinglist for some time now, but don't think this question was here recently. If it was, sorry for repeating it :-) I'm looking into a way to determine the value of a field that caused the result to be returned. For example, we're searching through blog posts by individual "tags". This is a multiValue field. A blog post about "Solr" can have tags like "search", "web search" and "search engine". When I run the search, I want to know what tag caused the result to score big. If I searched for "search", this would be the exact matching tag "search", over the looser tag "search engine" or "web search". There may also be other criteria, causing the post to score. Can this tag value be retrieved? Or do I have to use the debug mode, and calculate the score? Isn't debug mode slow? Speed is obviously crucial. Thanks in advance Wouter Samaey
Re: Date faceting - howto improve performance
I should emphasize that the PR trick I mentioned is something you'd do at the Lucene level, outside Solr, and then you'd just slip the modified index back into Solr. Of, if you like the bleeding edge, perhaps you can make use of Ning Li's Solr index merging functionality (patch in JIRA). Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Otis Gospodnetic > To: solr-user@lucene.apache.org > Sent: Saturday, April 25, 2009 9:41:45 AM > Subject: Re: Date faceting - howto improve performance > > > Yes, you could simply round the date, no need for a non-date type field. > Yes, you can add a field after the fact by making use of ParallelReader and > merging (I don't recall the details, search the ML for ParallelReader and > Andrzej), I remember he once provided the working recipe. > > > Otis -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > - Original Message > > From: Marcus Herou > > To: solr-user@lucene.apache.org > > Sent: Saturday, April 25, 2009 6:54:02 AM > > Subject: Date faceting - howto improve performance > > > > Hi. > > > > One of our faceting use-cases: > > We are creating trend graphs of how many blog posts that contains a certain > > term and groups it by day/week/year etc. with the nice DateMathParser > > functions. > > > > The performance degrades really fast and consumes a lot of memory which > > forces OOM from time to time > > We think it is due the fact that the cardinality of the field publishedDate > > in our index is huge, almost equal to the nr of documents in the index. > > > > We need to address that... > > > > Some questions: > > > > 1. Can a datefield have other date-formats than the default of -MM-dd > > HH:mm:ssZ ? > > > > 2. We are thinking of adding a field to the index which have the format > > -MM-dd to reduce the cardinality, if that field can't be a date, it > > could perhaps be a string, but the question then is if faceting can be used > > ? > > > > 3. Since we now already have such a huge index, is there a way to add a > > field afterwards and apply it to all documents without actually reindexing > > the whole shebang ? > > > > 4. If the field cannot be a string can we just leave out the > > hour/minute/second information and to reduce the cardinality and improve > > performance ? Example: 2009-01-01 00:00:00Z > > > > 5. I am afraid that we need to reindex everything to get this to work > > (negates Q3). We have 8 shards as of current, what would the most efficient > > way be to reindexing the whole shebang ? Dump the entire database to disk > > (sigh), create many xml file splits and use curl in a > > random/hash(numServers) manner on them ? > > > > > > Kindly > > > > //Marcus > > > > > > > > > > > > > > > > -- > > Marcus Herou CTO and co-founder Tailsweep AB > > +46702561312 > > marcus.he...@tailsweep.com > > http://www.tailsweep.com/ > > http://blogg.tailsweep.com/
Re: Authenticated Indexing Not working
My guess is you could provide the credentials to the underlying HttpClient (used by SolrJ), and let it do the authentication. I don't have the API handy, sorry. But this may slow things down and I have to wonder if you really really need authentication there or whether using HTTP authentication is the best way to do it. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Allahbaksh Asadullah > To: solr-user@lucene.apache.org > Sent: Saturday, April 25, 2009 9:31:35 AM > Subject: Authenticated Indexing Not working > > Hi, I have configured basic authentication in Solr using web.xml. It is > working fine when I search using SolrJ. But when I try to index with > Authentication enabled using SolrJ it is throwing exception. > > Is secured indexing is not enabled? How I am suppose to use secured > indexing. > > Regards, > Allahbaksh
Re: Date faceting - howto improve performance
Yes, you could simply round the date, no need for a non-date type field. Yes, you can add a field after the fact by making use of ParallelReader and merging (I don't recall the details, search the ML for ParallelReader and Andrzej), I remember he once provided the working recipe. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Marcus Herou > To: solr-user@lucene.apache.org > Sent: Saturday, April 25, 2009 6:54:02 AM > Subject: Date faceting - howto improve performance > > Hi. > > One of our faceting use-cases: > We are creating trend graphs of how many blog posts that contains a certain > term and groups it by day/week/year etc. with the nice DateMathParser > functions. > > The performance degrades really fast and consumes a lot of memory which > forces OOM from time to time > We think it is due the fact that the cardinality of the field publishedDate > in our index is huge, almost equal to the nr of documents in the index. > > We need to address that... > > Some questions: > > 1. Can a datefield have other date-formats than the default of -MM-dd > HH:mm:ssZ ? > > 2. We are thinking of adding a field to the index which have the format > -MM-dd to reduce the cardinality, if that field can't be a date, it > could perhaps be a string, but the question then is if faceting can be used > ? > > 3. Since we now already have such a huge index, is there a way to add a > field afterwards and apply it to all documents without actually reindexing > the whole shebang ? > > 4. If the field cannot be a string can we just leave out the > hour/minute/second information and to reduce the cardinality and improve > performance ? Example: 2009-01-01 00:00:00Z > > 5. I am afraid that we need to reindex everything to get this to work > (negates Q3). We have 8 shards as of current, what would the most efficient > way be to reindexing the whole shebang ? Dump the entire database to disk > (sigh), create many xml file splits and use curl in a > random/hash(numServers) manner on them ? > > > Kindly > > //Marcus > > > > > > > > -- > Marcus Herou CTO and co-founder Tailsweep AB > +46702561312 > marcus.he...@tailsweep.com > http://www.tailsweep.com/ > http://blogg.tailsweep.com/
Re: Solr-1.4 indexing slower ?
Yes, versions of Lucene have changed, but should only be faster. A simple way to see what's happening is to get the thread dump (e.g. through Solr admin pages) to see what the JVM is doing when things slow down. Do a few dumps and see. Perhaps the avg indexing rate is slower due to larger index segments? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Marcus Herou > To: solr-user@lucene.apache.org > Sent: Saturday, April 25, 2009 7:00:57 AM > Subject: Solr-1.4 indexing slower ? > > Hi. > We upgraded to solr-trunk (1.4-dev) for a few weeks ago and I've notices > that the performance really went down. Not sure if I can blame solr 100% > though so take these comments for what they might be (bullshit) > > However: > For a few months ago I know we had indexing speed of about 100 docs/sec per > shard = 800-1000 docs/sec on all shards but now we are lucky if we get over > 10 docs/sec per shard... > > This is merely an observation with very little scientific research behind to > support it since I did not profile the app before launching 1.4 to see how > good 1.3 behaved at that exact time... I launched 1.4 due to the fact that > the rumours said that date faceting was faster in solr-1.4 which I believe > it is. That's why I missed to profile indexing speed. > > Did not Lucene as well change version between the two ? > > Wondering if anyone else experience the same issues. > > //Marcus > > -- > Marcus Herou CTO and co-founder Tailsweep AB > +46702561312 > marcus.he...@tailsweep.com > http://www.tailsweep.com/ > http://blogg.tailsweep.com/
Authenticated Indexing Not working
Hi, I have configured basic authentication in Solr using web.xml. It is working fine when I search using SolrJ. But when I try to index with Authentication enabled using SolrJ it is throwing exception. Is secured indexing is not enabled? How I am suppose to use secured indexing. Regards, Allahbaksh
RE: Date faceting - howto improve performance
Hi Marcus. You must supply dates in the format that you are doing now -- ISO-8601 with the Z to indicate there is no time-zone offset occurring. To reduce cardinality to the day level instead of to the second that you are currently performing, the date you supply can include DateMathParser operations. So if you supply: 2009-04-01 20:15:01Z/DAY then this will do what you think it does. Of course you then loose the ability to search based on a granularity finer than a day. And the date you get back (i.e. the stored value) is the rounded date; not the date prior to rounding. Yes you will certainly need to re-index. Since you have architected your indexing strategy, only you know how to go about doing that. By now I'm sure you are aware that you cannot update individual fields. By the way, if your current strategy involves periodic updates then you could take the strategy of simply waiting until all your data eventually gets re-indexed. There's no harm in some of the dates being rounded and some not -- it's just that until most of them are rounded, you have your current problem of sporadic OOM. ~ David From: Marcus Herou [marcus.he...@tailsweep.com] Sent: Saturday, April 25, 2009 6:54 AM To: solr-user@lucene.apache.org Subject: Date faceting - howto improve performance Hi. One of our faceting use-cases: We are creating trend graphs of how many blog posts that contains a certain term and groups it by day/week/year etc. with the nice DateMathParser functions. The performance degrades really fast and consumes a lot of memory which forces OOM from time to time We think it is due the fact that the cardinality of the field publishedDate in our index is huge, almost equal to the nr of documents in the index. We need to address that... Some questions: 1. Can a datefield have other date-formats than the default of -MM-dd HH:mm:ssZ ? 2. We are thinking of adding a field to the index which have the format -MM-dd to reduce the cardinality, if that field can't be a date, it could perhaps be a string, but the question then is if faceting can be used ? 3. Since we now already have such a huge index, is there a way to add a field afterwards and apply it to all documents without actually reindexing the whole shebang ? 4. If the field cannot be a string can we just leave out the hour/minute/second information and to reduce the cardinality and improve performance ? Example: 2009-01-01 00:00:00Z 5. I am afraid that we need to reindex everything to get this to work (negates Q3). We have 8 shards as of current, what would the most efficient way be to reindexing the whole shebang ? Dump the entire database to disk (sigh), create many xml file splits and use curl in a random/hash(numServers) manner on them ? Kindly //Marcus -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ http://blogg.tailsweep.com/
Solr-1.4 indexing slower ?
Hi. We upgraded to solr-trunk (1.4-dev) for a few weeks ago and I've notices that the performance really went down. Not sure if I can blame solr 100% though so take these comments for what they might be (bullshit) However: For a few months ago I know we had indexing speed of about 100 docs/sec per shard = 800-1000 docs/sec on all shards but now we are lucky if we get over 10 docs/sec per shard... This is merely an observation with very little scientific research behind to support it since I did not profile the app before launching 1.4 to see how good 1.3 behaved at that exact time... I launched 1.4 due to the fact that the rumours said that date faceting was faster in solr-1.4 which I believe it is. That's why I missed to profile indexing speed. Did not Lucene as well change version between the two ? Wondering if anyone else experience the same issues. //Marcus -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ http://blogg.tailsweep.com/
Date faceting - howto improve performance
Hi. One of our faceting use-cases: We are creating trend graphs of how many blog posts that contains a certain term and groups it by day/week/year etc. with the nice DateMathParser functions. The performance degrades really fast and consumes a lot of memory which forces OOM from time to time We think it is due the fact that the cardinality of the field publishedDate in our index is huge, almost equal to the nr of documents in the index. We need to address that... Some questions: 1. Can a datefield have other date-formats than the default of -MM-dd HH:mm:ssZ ? 2. We are thinking of adding a field to the index which have the format -MM-dd to reduce the cardinality, if that field can't be a date, it could perhaps be a string, but the question then is if faceting can be used ? 3. Since we now already have such a huge index, is there a way to add a field afterwards and apply it to all documents without actually reindexing the whole shebang ? 4. If the field cannot be a string can we just leave out the hour/minute/second information and to reduce the cardinality and improve performance ? Example: 2009-01-01 00:00:00Z 5. I am afraid that we need to reindex everything to get this to work (negates Q3). We have 8 shards as of current, what would the most efficient way be to reindexing the whole shebang ? Dump the entire database to disk (sigh), create many xml file splits and use curl in a random/hash(numServers) manner on them ? Kindly //Marcus -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ http://blogg.tailsweep.com/