Re: Authenticated Indexing Not working

2009-04-25 Thread Allahbaksh Asadullah
HI Otis,
I am using HTTPClient for authentication. When I use the server with
Authentication for searching it works fine. But when I use it for
indexing it throws error.
Regards,
Allahbaksh

On 4/25/09, Otis Gospodnetic  wrote:
>
> My guess is you could provide the credentials to the underlying HttpClient
> (used by SolrJ), and let it do the authentication.  I don't have the API
> handy, sorry.
> But this may slow things down and I have to wonder if you really really need
> authentication there or whether using HTTP authentication is the best way to
> do it.
>
>
> Otis --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> - Original Message 
>> From: Allahbaksh Asadullah 
>> To: solr-user@lucene.apache.org
>> Sent: Saturday, April 25, 2009 9:31:35 AM
>> Subject: Authenticated Indexing Not working
>>
>> Hi, I have configured basic authentication in Solr using web.xml. It is
>> working fine when I search using SolrJ. But when I try to index with
>> Authentication enabled using SolrJ it is throwing exception.
>>
>> Is secured indexing is not enabled? How I am suppose to use secured
>> indexing.
>>
>> Regards,
>> Allahbaksh
>
>


-- 
Allahbaksh Mohammedali Asadullah,
Software Engineering & Technology Labs,
Infosys Technolgies Limited, Electronic City,
Hosur Road, Bangalore 560 100, India.
(Board: 91-80-28520261 | Extn: 73927 | Direct: 41173927.
Fax: 91-80-28520362 | Mobile: 91-9845505322.


Re: Solr-1.4 indexing slower ?

2009-04-25 Thread Marcus Herou
Strange, now I am reindexing a lot of items and have 1000 docs/sec again...

This is really, really nice, sorry for bothering...

/M

On Sat, Apr 25, 2009 at 3:39 PM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

>
> Yes, versions of Lucene have changed, but should only be faster.  A simple
> way to see what's happening is to get the thread dump (e.g. through Solr
> admin pages) to see what the JVM is doing when things slow down.  Do a few
> dumps and see.  Perhaps the avg indexing rate is slower due to larger index
> segments?
>
>
> Otis --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> - Original Message 
> > From: Marcus Herou 
> > To: solr-user@lucene.apache.org
> > Sent: Saturday, April 25, 2009 7:00:57 AM
> > Subject: Solr-1.4 indexing slower ?
> >
> > Hi.
> > We upgraded to solr-trunk (1.4-dev) for a few weeks ago and I've notices
> > that the performance really went down. Not sure if I can blame solr 100%
> > though so take these comments for what they might be (bullshit)
> >
> > However:
> > For a few months ago I know we had indexing speed of about 100 docs/sec
> per
> > shard = 800-1000 docs/sec on all shards but now we are lucky if we get
> over
> > 10 docs/sec per shard...
> >
> > This is merely an observation with very little scientific research behind
> to
> > support it since I did not profile the app before launching 1.4 to see
> how
> > good 1.3 behaved at that exact time... I launched 1.4 due to the fact
> that
> > the rumours said that date faceting was faster in solr-1.4 which I
> believe
> > it is. That's why I missed to profile indexing speed.
> >
> > Did not Lucene as well change version between the two ?
> >
> > Wondering if anyone else experience the same issues.
> >
> > //Marcus
> >
> > --
> > Marcus Herou CTO and co-founder Tailsweep AB
> > +46702561312
> > marcus.he...@tailsweep.com
> > http://www.tailsweep.com/
> > http://blogg.tailsweep.com/
>
>


-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/


Re: Date faceting - howto improve performance

2009-04-25 Thread Marcus Herou
Hmm looking in the code for the IndexMerger in Solr
(org.apache.solr.update.DirectUpdateHandler(2)

See that the IndexWriter.addIndexesNoOptimize(dirs) is used (union of
indexes) ?

And the test class org.apache.solr.client.solrj.MergeIndexesExampleTestBase
suggests:
add doc A to index1 with id=AAA,name=core1
add doc B to index2 with id=BBB,name=core2
merge the two indexes into one index which then contains both docs.
The resulting index will have 2 docs.

Great but in my case I think it should work more like this.

add doc A to index1 with id=X,title=blog entry title,description=blog entry
description
add doc B to index2 with id=X,score=1.2
somehow add index2 to index1 so id=XX has score=1.2 when searching in index1
The resulting index should have 1 doc.

So this is not really what I want right ?

Sorry for being a smart-ass...

Kindly

//Marcus





On Sat, Apr 25, 2009 at 5:10 PM, Marcus Herou wrote:

> Guys!
>
> Thanks for these insights, I think we will head for Lucene level merging
> strategy (two or more indexes).
> When merging I guess the second index need to have the same doc ids
> somehow. This is an internal id in Lucene, not that easy to get hold of
> right ?
>
> So you are saying the the solr: ExternalFileField + FunctionQuery stuff
> would not work very well performance wise or what do you mean ?
>
> I sure like bleeding edge :)
>
> Cheers dudes
>
> //Marcus
>
>
>
>
>
> On Sat, Apr 25, 2009 at 3:46 PM, Otis Gospodnetic <
> otis_gospodne...@yahoo.com> wrote:
>
>>
>> I should emphasize that the PR trick I mentioned is something you'd do at
>> the Lucene level, outside Solr, and then you'd just slip the modified index
>> back into Solr.
>> Of, if you like the bleeding edge, perhaps you can make use of Ning Li's
>> Solr index merging functionality (patch in JIRA).
>>
>>
>> Otis --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>
>>
>>
>> - Original Message 
>> > From: Otis Gospodnetic 
>> > To: solr-user@lucene.apache.org
>> > Sent: Saturday, April 25, 2009 9:41:45 AM
>> > Subject: Re: Date faceting - howto improve performance
>> >
>> >
>> > Yes, you could simply round the date, no need for a non-date type field.
>> > Yes, you can add a field after the fact by making use of ParallelReader
>> and
>> > merging (I don't recall the details, search the ML for ParallelReader
>> and
>> > Andrzej), I remember he once provided the working recipe.
>> >
>> >
>> > Otis --
>> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>> >
>> >
>> >
>> > - Original Message 
>> > > From: Marcus Herou
>> > > To: solr-user@lucene.apache.org
>> > > Sent: Saturday, April 25, 2009 6:54:02 AM
>> > > Subject: Date faceting - howto improve performance
>> > >
>> > > Hi.
>> > >
>> > > One of our faceting use-cases:
>> > > We are creating trend graphs of how many blog posts that contains a
>> certain
>> > > term and groups it by day/week/year etc. with the nice DateMathParser
>> > > functions.
>> > >
>> > > The performance degrades really fast and consumes a lot of memory
>> which
>> > > forces OOM from time to time
>> > > We think it is due the fact that the cardinality of the field
>> publishedDate
>> > > in our index is huge, almost equal to the nr of documents in the
>> index.
>> > >
>> > > We need to address that...
>> > >
>> > > Some questions:
>> > >
>> > > 1. Can a datefield have other date-formats than the default of
>> -MM-dd
>> > > HH:mm:ssZ ?
>> > >
>> > > 2. We are thinking of adding a field to the index which have the
>> format
>> > > -MM-dd to reduce the cardinality, if that field can't be a date,
>> it
>> > > could perhaps be a string, but the question then is if faceting can be
>> used
>> > > ?
>> > >
>> > > 3. Since we now already have such a huge index, is there a way to add
>> a
>> > > field afterwards and apply it to all documents without actually
>> reindexing
>> > > the whole shebang ?
>> > >
>> > > 4. If the field cannot be a string can we just leave out the
>> > > hour/minute/second information and to reduce the cardinality and
>> improve
>> > > performance ? Example: 2009-01-01 00:00:00Z
>> > >
>> > > 5. I am afraid that we need to reindex everything to get this to work
>> > > (negates Q3). We have 8 shards as of current, what would the most
>> efficient
>> > > way be to reindexing the whole shebang ? Dump the entire database to
>> disk
>> > > (sigh), create many xml file splits and use curl in a
>> > > random/hash(numServers) manner on them ?
>> > >
>> > >
>> > > Kindly
>> > >
>> > > //Marcus
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > --
>> > > Marcus Herou CTO and co-founder Tailsweep AB
>> > > +46702561312
>> > > marcus.he...@tailsweep.com
>> > > http://www.tailsweep.com/
>> > > http://blogg.tailsweep.com/
>>
>>
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.he...@tailsweep.com
> http://www.tailsweep.com/
> http://blogg.tailsweep.com/
>



-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.

Re: Date faceting - howto improve performance

2009-04-25 Thread Marcus Herou
Oh and the indexing strategy is just a stupid random across the shards.

What I asked about was a "Best Practice" of achieving most MB/sec indexing.
I feel that the java-api should be less efficient than something more raw
like curl or so but that is just my hunch.

/M


On Sat, Apr 25, 2009 at 5:10 PM, Marcus Herou wrote:

> Guys!
>
> Thanks for these insights, I think we will head for Lucene level merging
> strategy (two or more indexes).
> When merging I guess the second index need to have the same doc ids
> somehow. This is an internal id in Lucene, not that easy to get hold of
> right ?
>
> So you are saying the the solr: ExternalFileField + FunctionQuery stuff
> would not work very well performance wise or what do you mean ?
>
> I sure like bleeding edge :)
>
> Cheers dudes
>
> //Marcus
>
>
>
>
>
> On Sat, Apr 25, 2009 at 3:46 PM, Otis Gospodnetic <
> otis_gospodne...@yahoo.com> wrote:
>
>>
>> I should emphasize that the PR trick I mentioned is something you'd do at
>> the Lucene level, outside Solr, and then you'd just slip the modified index
>> back into Solr.
>> Of, if you like the bleeding edge, perhaps you can make use of Ning Li's
>> Solr index merging functionality (patch in JIRA).
>>
>>
>> Otis --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>
>>
>>
>> - Original Message 
>> > From: Otis Gospodnetic 
>> > To: solr-user@lucene.apache.org
>> > Sent: Saturday, April 25, 2009 9:41:45 AM
>> > Subject: Re: Date faceting - howto improve performance
>> >
>> >
>> > Yes, you could simply round the date, no need for a non-date type field.
>> > Yes, you can add a field after the fact by making use of ParallelReader
>> and
>> > merging (I don't recall the details, search the ML for ParallelReader
>> and
>> > Andrzej), I remember he once provided the working recipe.
>> >
>> >
>> > Otis --
>> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>> >
>> >
>> >
>> > - Original Message 
>> > > From: Marcus Herou
>> > > To: solr-user@lucene.apache.org
>> > > Sent: Saturday, April 25, 2009 6:54:02 AM
>> > > Subject: Date faceting - howto improve performance
>> > >
>> > > Hi.
>> > >
>> > > One of our faceting use-cases:
>> > > We are creating trend graphs of how many blog posts that contains a
>> certain
>> > > term and groups it by day/week/year etc. with the nice DateMathParser
>> > > functions.
>> > >
>> > > The performance degrades really fast and consumes a lot of memory
>> which
>> > > forces OOM from time to time
>> > > We think it is due the fact that the cardinality of the field
>> publishedDate
>> > > in our index is huge, almost equal to the nr of documents in the
>> index.
>> > >
>> > > We need to address that...
>> > >
>> > > Some questions:
>> > >
>> > > 1. Can a datefield have other date-formats than the default of
>> -MM-dd
>> > > HH:mm:ssZ ?
>> > >
>> > > 2. We are thinking of adding a field to the index which have the
>> format
>> > > -MM-dd to reduce the cardinality, if that field can't be a date,
>> it
>> > > could perhaps be a string, but the question then is if faceting can be
>> used
>> > > ?
>> > >
>> > > 3. Since we now already have such a huge index, is there a way to add
>> a
>> > > field afterwards and apply it to all documents without actually
>> reindexing
>> > > the whole shebang ?
>> > >
>> > > 4. If the field cannot be a string can we just leave out the
>> > > hour/minute/second information and to reduce the cardinality and
>> improve
>> > > performance ? Example: 2009-01-01 00:00:00Z
>> > >
>> > > 5. I am afraid that we need to reindex everything to get this to work
>> > > (negates Q3). We have 8 shards as of current, what would the most
>> efficient
>> > > way be to reindexing the whole shebang ? Dump the entire database to
>> disk
>> > > (sigh), create many xml file splits and use curl in a
>> > > random/hash(numServers) manner on them ?
>> > >
>> > >
>> > > Kindly
>> > >
>> > > //Marcus
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > --
>> > > Marcus Herou CTO and co-founder Tailsweep AB
>> > > +46702561312
>> > > marcus.he...@tailsweep.com
>> > > http://www.tailsweep.com/
>> > > http://blogg.tailsweep.com/
>>
>>
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.he...@tailsweep.com
> http://www.tailsweep.com/
> http://blogg.tailsweep.com/
>



-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/


Re: Date faceting - howto improve performance

2009-04-25 Thread Marcus Herou
Guys!

Thanks for these insights, I think we will head for Lucene level merging
strategy (two or more indexes).
When merging I guess the second index need to have the same doc ids somehow.
This is an internal id in Lucene, not that easy to get hold of right ?

So you are saying the the solr: ExternalFileField + FunctionQuery stuff
would not work very well performance wise or what do you mean ?

I sure like bleeding edge :)

Cheers dudes

//Marcus




On Sat, Apr 25, 2009 at 3:46 PM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

>
> I should emphasize that the PR trick I mentioned is something you'd do at
> the Lucene level, outside Solr, and then you'd just slip the modified index
> back into Solr.
> Of, if you like the bleeding edge, perhaps you can make use of Ning Li's
> Solr index merging functionality (patch in JIRA).
>
>
> Otis --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> - Original Message 
> > From: Otis Gospodnetic 
> > To: solr-user@lucene.apache.org
> > Sent: Saturday, April 25, 2009 9:41:45 AM
> > Subject: Re: Date faceting - howto improve performance
> >
> >
> > Yes, you could simply round the date, no need for a non-date type field.
> > Yes, you can add a field after the fact by making use of ParallelReader
> and
> > merging (I don't recall the details, search the ML for ParallelReader and
> > Andrzej), I remember he once provided the working recipe.
> >
> >
> > Otis --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
> >
> > - Original Message 
> > > From: Marcus Herou
> > > To: solr-user@lucene.apache.org
> > > Sent: Saturday, April 25, 2009 6:54:02 AM
> > > Subject: Date faceting - howto improve performance
> > >
> > > Hi.
> > >
> > > One of our faceting use-cases:
> > > We are creating trend graphs of how many blog posts that contains a
> certain
> > > term and groups it by day/week/year etc. with the nice DateMathParser
> > > functions.
> > >
> > > The performance degrades really fast and consumes a lot of memory which
> > > forces OOM from time to time
> > > We think it is due the fact that the cardinality of the field
> publishedDate
> > > in our index is huge, almost equal to the nr of documents in the index.
> > >
> > > We need to address that...
> > >
> > > Some questions:
> > >
> > > 1. Can a datefield have other date-formats than the default of
> -MM-dd
> > > HH:mm:ssZ ?
> > >
> > > 2. We are thinking of adding a field to the index which have the format
> > > -MM-dd to reduce the cardinality, if that field can't be a date, it
> > > could perhaps be a string, but the question then is if faceting can be
> used
> > > ?
> > >
> > > 3. Since we now already have such a huge index, is there a way to add a
> > > field afterwards and apply it to all documents without actually
> reindexing
> > > the whole shebang ?
> > >
> > > 4. If the field cannot be a string can we just leave out the
> > > hour/minute/second information and to reduce the cardinality and
> improve
> > > performance ? Example: 2009-01-01 00:00:00Z
> > >
> > > 5. I am afraid that we need to reindex everything to get this to work
> > > (negates Q3). We have 8 shards as of current, what would the most
> efficient
> > > way be to reindexing the whole shebang ? Dump the entire database to
> disk
> > > (sigh), create many xml file splits and use curl in a
> > > random/hash(numServers) manner on them ?
> > >
> > >
> > > Kindly
> > >
> > > //Marcus
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > --
> > > Marcus Herou CTO and co-founder Tailsweep AB
> > > +46702561312
> > > marcus.he...@tailsweep.com
> > > http://www.tailsweep.com/
> > > http://blogg.tailsweep.com/
>
>


-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/


Get the field value that caused the result

2009-04-25 Thread Wouter Samaey
Hi,

I've been following this mailinglist for some time now, but don't
think this question was here recently. If it was, sorry for repeating
it :-)

I'm looking into a way to determine the value of a field that caused
the result to be returned.
For example, we're searching through blog posts by individual "tags".
This is a multiValue field.

A blog post about "Solr" can have tags like "search", "web search" and
"search engine". When I run the search, I want to know what tag caused
the result to score big. If I searched for "search", this would be the
exact matching tag "search", over the looser tag "search engine" or
"web search". There may also be other criteria, causing the post to
score.

Can this tag value be retrieved?
Or do I have to use the debug mode, and calculate the score? Isn't
debug mode slow?

Speed is obviously crucial.

Thanks in advance

Wouter Samaey


Re: Date faceting - howto improve performance

2009-04-25 Thread Otis Gospodnetic

I should emphasize that the PR trick I mentioned is something you'd do at the 
Lucene level, outside Solr, and then you'd just slip the modified index back 
into Solr.
Of, if you like the bleeding edge, perhaps you can make use of Ning Li's Solr 
index merging functionality (patch in JIRA).


Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: Otis Gospodnetic 
> To: solr-user@lucene.apache.org
> Sent: Saturday, April 25, 2009 9:41:45 AM
> Subject: Re: Date faceting - howto improve performance
> 
> 
> Yes, you could simply round the date, no need for a non-date type field.
> Yes, you can add a field after the fact by making use of ParallelReader and 
> merging (I don't recall the details, search the ML for ParallelReader and 
> Andrzej), I remember he once provided the working recipe.
> 
> 
> Otis --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> 
> - Original Message 
> > From: Marcus Herou 
> > To: solr-user@lucene.apache.org
> > Sent: Saturday, April 25, 2009 6:54:02 AM
> > Subject: Date faceting - howto improve performance
> > 
> > Hi.
> > 
> > One of our faceting use-cases:
> > We are creating trend graphs of how many blog posts that contains a certain
> > term and groups it by day/week/year etc. with the nice DateMathParser
> > functions.
> > 
> > The performance degrades really fast and consumes a lot of memory which
> > forces OOM from time to time
> > We think it is due the fact that the cardinality of the field publishedDate
> > in our index is huge, almost equal to the nr of documents in the index.
> > 
> > We need to address that...
> > 
> > Some questions:
> > 
> > 1. Can a datefield have other date-formats than the default of -MM-dd
> > HH:mm:ssZ ?
> > 
> > 2. We are thinking of adding a field to the index which have the format
> > -MM-dd to reduce the cardinality, if that field can't be a date, it
> > could perhaps be a string, but the question then is if faceting can be used
> > ?
> > 
> > 3. Since we now already have such a huge index, is there a way to add a
> > field afterwards and apply it to all documents without actually reindexing
> > the whole shebang ?
> > 
> > 4. If the field cannot be a string can we just leave out the
> > hour/minute/second information and to reduce the cardinality and improve
> > performance ? Example: 2009-01-01 00:00:00Z
> > 
> > 5. I am afraid that we need to reindex everything to get this to work
> > (negates Q3). We have 8 shards as of current, what would the most efficient
> > way be to reindexing the whole shebang ? Dump the entire database to disk
> > (sigh), create many xml file splits and use curl in a
> > random/hash(numServers) manner on them ?
> > 
> > 
> > Kindly
> > 
> > //Marcus
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > -- 
> > Marcus Herou CTO and co-founder Tailsweep AB
> > +46702561312
> > marcus.he...@tailsweep.com
> > http://www.tailsweep.com/
> > http://blogg.tailsweep.com/



Re: Authenticated Indexing Not working

2009-04-25 Thread Otis Gospodnetic

My guess is you could provide the credentials to the underlying HttpClient 
(used by SolrJ), and let it do the authentication.  I don't have the API handy, 
sorry.
But this may slow things down and I have to wonder if you really really need 
authentication there or whether using HTTP authentication is the best way to do 
it.


Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: Allahbaksh Asadullah 
> To: solr-user@lucene.apache.org
> Sent: Saturday, April 25, 2009 9:31:35 AM
> Subject: Authenticated Indexing Not working
> 
> Hi, I have configured basic authentication in Solr using web.xml. It is
> working fine when I search using SolrJ. But when I try to index with
> Authentication enabled using SolrJ it is throwing exception.
> 
> Is secured indexing is not enabled? How I am suppose to use secured
> indexing.
> 
> Regards,
> Allahbaksh



Re: Date faceting - howto improve performance

2009-04-25 Thread Otis Gospodnetic

Yes, you could simply round the date, no need for a non-date type field.
Yes, you can add a field after the fact by making use of ParallelReader and 
merging (I don't recall the details, search the ML for ParallelReader and 
Andrzej), I remember he once provided the working recipe.


Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: Marcus Herou 
> To: solr-user@lucene.apache.org
> Sent: Saturday, April 25, 2009 6:54:02 AM
> Subject: Date faceting - howto improve performance
> 
> Hi.
> 
> One of our faceting use-cases:
> We are creating trend graphs of how many blog posts that contains a certain
> term and groups it by day/week/year etc. with the nice DateMathParser
> functions.
> 
> The performance degrades really fast and consumes a lot of memory which
> forces OOM from time to time
> We think it is due the fact that the cardinality of the field publishedDate
> in our index is huge, almost equal to the nr of documents in the index.
> 
> We need to address that...
> 
> Some questions:
> 
> 1. Can a datefield have other date-formats than the default of -MM-dd
> HH:mm:ssZ ?
> 
> 2. We are thinking of adding a field to the index which have the format
> -MM-dd to reduce the cardinality, if that field can't be a date, it
> could perhaps be a string, but the question then is if faceting can be used
> ?
> 
> 3. Since we now already have such a huge index, is there a way to add a
> field afterwards and apply it to all documents without actually reindexing
> the whole shebang ?
> 
> 4. If the field cannot be a string can we just leave out the
> hour/minute/second information and to reduce the cardinality and improve
> performance ? Example: 2009-01-01 00:00:00Z
> 
> 5. I am afraid that we need to reindex everything to get this to work
> (negates Q3). We have 8 shards as of current, what would the most efficient
> way be to reindexing the whole shebang ? Dump the entire database to disk
> (sigh), create many xml file splits and use curl in a
> random/hash(numServers) manner on them ?
> 
> 
> Kindly
> 
> //Marcus
> 
> 
> 
> 
> 
> 
> 
> -- 
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.he...@tailsweep.com
> http://www.tailsweep.com/
> http://blogg.tailsweep.com/



Re: Solr-1.4 indexing slower ?

2009-04-25 Thread Otis Gospodnetic

Yes, versions of Lucene have changed, but should only be faster.  A simple way 
to see what's happening is to get the thread dump (e.g. through Solr admin 
pages) to see what the JVM is doing when things slow down.  Do a few dumps and 
see.  Perhaps the avg indexing rate is slower due to larger index segments?


Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: Marcus Herou 
> To: solr-user@lucene.apache.org
> Sent: Saturday, April 25, 2009 7:00:57 AM
> Subject: Solr-1.4 indexing slower ?
> 
> Hi.
> We upgraded to solr-trunk (1.4-dev) for a few weeks ago and I've notices
> that the performance really went down. Not sure if I can blame solr 100%
> though so take these comments for what they might be (bullshit)
> 
> However:
> For a few months ago I know we had indexing speed of about 100 docs/sec per
> shard = 800-1000 docs/sec on all shards but now we are lucky if we get over
> 10 docs/sec per shard...
> 
> This is merely an observation with very little scientific research behind to
> support it since I did not profile the app before launching 1.4 to see how
> good 1.3 behaved at that exact time... I launched 1.4 due to the fact that
> the rumours said that date faceting was faster in solr-1.4 which I believe
> it is. That's why I missed to profile indexing speed.
> 
> Did not Lucene as well change version between the two ?
> 
> Wondering if anyone else experience the same issues.
> 
> //Marcus
> 
> -- 
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.he...@tailsweep.com
> http://www.tailsweep.com/
> http://blogg.tailsweep.com/



Authenticated Indexing Not working

2009-04-25 Thread Allahbaksh Asadullah
Hi, I have configured basic authentication in Solr using web.xml. It is
working fine when I search using SolrJ. But when I try to index with
Authentication enabled using SolrJ it is throwing exception.

Is secured indexing is not enabled? How I am suppose to use secured
indexing.

Regards,
Allahbaksh


RE: Date faceting - howto improve performance

2009-04-25 Thread Smiley, David W.
Hi Marcus.

You must supply dates in the format that you are doing now -- ISO-8601 with the 
Z to indicate there is no time-zone offset occurring.  To reduce cardinality to 
the day level instead of to the second that you are currently performing, the 
date you supply can include DateMathParser operations.  So if you supply:  
2009-04-01 20:15:01Z/DAY then this will do what you think it does.  Of course 
you then loose the ability to search based on a granularity finer than a day.  
And the date you get back (i.e. the stored value) is the rounded date; not the 
date prior to rounding.

Yes you will certainly need to re-index.  Since you have architected your 
indexing strategy, only you know how to go about doing that.  By now I'm sure 
you are aware that you cannot update individual fields.  By the way, if your 
current strategy involves periodic updates then you could take the strategy of 
simply waiting until all your data eventually gets re-indexed.  There's no harm 
in some of the dates being rounded and some not -- it's just that until most of 
them are rounded, you have your current problem of sporadic OOM.

~ David

From: Marcus Herou [marcus.he...@tailsweep.com]
Sent: Saturday, April 25, 2009 6:54 AM
To: solr-user@lucene.apache.org
Subject: Date faceting - howto improve performance

Hi.

One of our faceting use-cases:
We are creating trend graphs of how many blog posts that contains a certain
term and groups it by day/week/year etc. with the nice DateMathParser
functions.

The performance degrades really fast and consumes a lot of memory which
forces OOM from time to time
We think it is due the fact that the cardinality of the field publishedDate
in our index is huge, almost equal to the nr of documents in the index.

We need to address that...

Some questions:

1. Can a datefield have other date-formats than the default of -MM-dd
HH:mm:ssZ ?

2. We are thinking of adding a field to the index which have the format
-MM-dd to reduce the cardinality, if that field can't be a date, it
could perhaps be a string, but the question then is if faceting can be used
?

3. Since we now already have such a huge index, is there a way to add a
field afterwards and apply it to all documents without actually reindexing
the whole shebang ?

4. If the field cannot be a string can we just leave out the
hour/minute/second information and to reduce the cardinality and improve
performance ? Example: 2009-01-01 00:00:00Z

5. I am afraid that we need to reindex everything to get this to work
(negates Q3). We have 8 shards as of current, what would the most efficient
way be to reindexing the whole shebang ? Dump the entire database to disk
(sigh), create many xml file splits and use curl in a
random/hash(numServers) manner on them ?


Kindly

//Marcus







--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/


Solr-1.4 indexing slower ?

2009-04-25 Thread Marcus Herou
Hi.
We upgraded to solr-trunk (1.4-dev) for a few weeks ago and I've notices
that the performance really went down. Not sure if I can blame solr 100%
though so take these comments for what they might be (bullshit)

However:
For a few months ago I know we had indexing speed of about 100 docs/sec per
shard = 800-1000 docs/sec on all shards but now we are lucky if we get over
10 docs/sec per shard...

This is merely an observation with very little scientific research behind to
support it since I did not profile the app before launching 1.4 to see how
good 1.3 behaved at that exact time... I launched 1.4 due to the fact that
the rumours said that date faceting was faster in solr-1.4 which I believe
it is. That's why I missed to profile indexing speed.

Did not Lucene as well change version between the two ?

Wondering if anyone else experience the same issues.

//Marcus

-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/


Date faceting - howto improve performance

2009-04-25 Thread Marcus Herou
Hi.

One of our faceting use-cases:
We are creating trend graphs of how many blog posts that contains a certain
term and groups it by day/week/year etc. with the nice DateMathParser
functions.

The performance degrades really fast and consumes a lot of memory which
forces OOM from time to time
We think it is due the fact that the cardinality of the field publishedDate
in our index is huge, almost equal to the nr of documents in the index.

We need to address that...

Some questions:

1. Can a datefield have other date-formats than the default of -MM-dd
HH:mm:ssZ ?

2. We are thinking of adding a field to the index which have the format
-MM-dd to reduce the cardinality, if that field can't be a date, it
could perhaps be a string, but the question then is if faceting can be used
?

3. Since we now already have such a huge index, is there a way to add a
field afterwards and apply it to all documents without actually reindexing
the whole shebang ?

4. If the field cannot be a string can we just leave out the
hour/minute/second information and to reduce the cardinality and improve
performance ? Example: 2009-01-01 00:00:00Z

5. I am afraid that we need to reindex everything to get this to work
(negates Q3). We have 8 shards as of current, what would the most efficient
way be to reindexing the whole shebang ? Dump the entire database to disk
(sigh), create many xml file splits and use curl in a
random/hash(numServers) manner on them ?


Kindly

//Marcus







-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/