Re: Solr performance issue

2011-03-23 Thread Doğacan Güney
Hello,

The problem turned out to be some sort of sharding/searching weirdness. We
modified some code in sharding but I don't think it is related. In any case,
we just added a new server that just shards (but doesn't do any searching /
doesn't contain any index) and performance is very very good.

Thanks for all the help.

On Tue, Mar 22, 2011 at 14:30, Alexey Serba  wrote:

> > Btw, I am monitoring output via jconsole with 8gb of ram and it still
> goes
> > to 8gb every 20 seconds or so,
> > gc runs, falls down to 1gb.
>
> Hmm, jvm is eating 8Gb for 20 seconds - sounds a lot.
>
> Do you return all results (ids) for your queries? Any tricky
> faceting/sorting/function queries?
>



-- 
Doğacan Güney


Re: Solr performance issue

2011-03-14 Thread Doğacan Güney
2011/3/14 Markus Jelsma 

> Mmm. SearchHander.handleRequestBody takes care of sharding. Could your
> system
> suffer from
> http://wiki.apache.org/solr/DistributedSearch#Distributed_Deadlock
> ?
>
>
We increased thread limit (which was 1 before) but it did not help.

Anyway, we will try to disable sharding tomorrow. Maybe this can give us a
better picture.

Thanks for the help, everyone.


> I'm not sure, i haven't seen a similar issue in a sharded environment,
> probably because it was a controlled environment.
>
>
> > Hello,
> >
> > 2011/3/14 Markus Jelsma 
> >
> > > That depends on your GC settings and generation sizes. And, instead of
> > > UseParallelGC you'd better use UseParNewGC in combination with CMS.
> >
> > JConsole now shows a different profile output but load is still high and
> > performance is still bad.
> >
> > Btw, here is the thread profile from newrelic:
> >
> > https://skitch.com/meralan/rwscm/thread-profiler-solr-new-relic-rpm
> >
> > Note that we do use a form of sharding so I maybe all the time spent
> > waiting for handleRequestBody
> > is results from sharding?
> >
> > > See 22: http://java.sun.com/docs/hotspot/gc1.4.2/faq.html
> > >
> > > > It's actually, as I understand it, expected JVM behavior to see the
> > > > heap rise to close to it's limit before it gets GC'd, that's how Java
> > > > GC works.  Whether that should happen every 20 seconds or what, I
> > > > don't
> > >
> > > nkow.
> > >
> > > > Another option is setting better JVM garbage collection arguments, so
> > > > GC doesn't "stop the world" so often. I have had good luck with my
> > > > Solr using this:  -XX:+UseParallelGC
> > > >
> > > > On 3/14/2011 4:15 PM, Doğacan Güney wrote:
> > > > > Hello again,
> > > > >
> > > > > 2011/3/14 Markus Jelsma
> > > > >
> > > > >>> Hello,
> > > > >>>
> > > > >>> 2011/3/14 Markus Jelsma
> > > > >>>
> > > > >>>> Hi Doğacan,
> > > > >>>>
> > > > >>>> Are you, at some point, running out of heap space? In my
> > > > >>>> experience, that's the common cause of increased load and
> > > > >>>> excessivly high
> > >
> > > response
> > >
> > > > >>>> times (or time
> > > > >>>> outs).
> > > > >>>
> > > > >>> How much of a heap size would be enough? Our index size is
> growing
> > > > >>> slowly but we did not have this problem
> > > > >>> a couple weeks ago where index size was maybe 100mb smaller.
> > > > >>
> > > > >> Telling how much heap space is needed isn't easy to say. It
> usually
> > > > >> needs to
> > > > >> be increased when you run out of memory and get those nasty OOM
> > >
> > > errors,
> > >
> > > > >> are you getting them?
> > > > >> Replication eventes will increase heap usage due to cache warming
> > > > >> queries and
> > > > >> autowarming.
> > > > >
> > > > > Nope, no OOM errors.
> > > > >
> > > > >>> We left most of the caches in solrconfig as default and only
> > >
> > > increased
> > >
> > > > >>> filterCache to 1024. We only ask for "id"s (which
> > > > >>> are unique) and no other fields during queries (though we do
> > >
> > > faceting).
> > >
> > > > >>> Btw, 1.6gb of our index is stored fields (we store
> > > > >>> everything for now, even though we do not get them during
> queries),
> > >
> > > and
> > >
> > > > >>> about 1gb of index.
> > > > >>
> > > > >> Hmm, it seems 4000 would be enough indeed. What about the
> > > > >> fieldCache, are there
> > > > >> a lot of entries? Is there an insanity count? Do you use boost
> > > > >> functions?
> > > > >
> > > > > Insanity count is 0 and fieldCAche has 12 entries. We do use some
> > > > > boosting functions.
> > > > >
> > >

Re: Solr performance issue

2011-03-14 Thread Doğacan Güney
Hello,

2011/3/14 Markus Jelsma 

> That depends on your GC settings and generation sizes. And, instead of
> UseParallelGC you'd better use UseParNewGC in combination with CMS.
>
>
JConsole now shows a different profile output but load is still high and
performance is still bad.

Btw, here is the thread profile from newrelic:

https://skitch.com/meralan/rwscm/thread-profiler-solr-new-relic-rpm

Note that we do use a form of sharding so I maybe all the time spent waiting
for handleRequestBody
is results from sharding?


> See 22: http://java.sun.com/docs/hotspot/gc1.4.2/faq.html
>
> > It's actually, as I understand it, expected JVM behavior to see the heap
> > rise to close to it's limit before it gets GC'd, that's how Java GC
> > works.  Whether that should happen every 20 seconds or what, I don't
> nkow.
> >
> > Another option is setting better JVM garbage collection arguments, so GC
> > doesn't "stop the world" so often. I have had good luck with my Solr
> > using this:  -XX:+UseParallelGC
> >
> > On 3/14/2011 4:15 PM, Doğacan Güney wrote:
> > > Hello again,
> > >
> > > 2011/3/14 Markus Jelsma
> > >
> > >>> Hello,
> > >>>
> > >>> 2011/3/14 Markus Jelsma
> > >>>
> > >>>> Hi Doğacan,
> > >>>>
> > >>>> Are you, at some point, running out of heap space? In my experience,
> > >>>> that's the common cause of increased load and excessivly high
> response
> > >>>> times (or time
> > >>>> outs).
> > >>>
> > >>> How much of a heap size would be enough? Our index size is growing
> > >>> slowly but we did not have this problem
> > >>> a couple weeks ago where index size was maybe 100mb smaller.
> > >>
> > >> Telling how much heap space is needed isn't easy to say. It usually
> > >> needs to
> > >> be increased when you run out of memory and get those nasty OOM
> errors,
> > >> are you getting them?
> > >> Replication eventes will increase heap usage due to cache warming
> > >> queries and
> > >> autowarming.
> > >
> > > Nope, no OOM errors.
> > >
> > >>> We left most of the caches in solrconfig as default and only
> increased
> > >>> filterCache to 1024. We only ask for "id"s (which
> > >>> are unique) and no other fields during queries (though we do
> faceting).
> > >>> Btw, 1.6gb of our index is stored fields (we store
> > >>> everything for now, even though we do not get them during queries),
> and
> > >>> about 1gb of index.
> > >>
> > >> Hmm, it seems 4000 would be enough indeed. What about the fieldCache,
> > >> are there
> > >> a lot of entries? Is there an insanity count? Do you use boost
> > >> functions?
> > >
> > > Insanity count is 0 and fieldCAche has 12 entries. We do use some
> > > boosting functions.
> > >
> > > Btw, I am monitoring output via jconsole with 8gb of ram and it still
> > > goes to 8gb every 20 seconds or so,
> > > gc runs, falls down to 1gb.
> > >
> > > Btw, our current revision was just a random choice but up until two
> weeks
> > > ago it has been rock-solid so we have been
> > > reluctant to update to another version. Would you recommend upgrading
> to
> > > latest trunk?
> > >
> > >> It might not have anything to do with memory at all but i'm just
> asking.
> > >> There
> > >> may be a bug in your revision causing this.
> > >>
> > >>> Anyway, Xmx was 4000m, we tried increasing it to 8000m but did not
> get
> > >>
> > >> any
> > >>
> > >>> improvement in load. I can try monitoring with Jconsole
> > >>> with 8gigs of heap to see if it helps.
> > >>>
> > >>>> Cheers,
> > >>>>
> > >>>>> Hello everyone,
> > >>>>>
> > >>>>> First of all here is our Solr setup:
> > >>>>>
> > >>>>> - Solr nightly build 986158
> > >>>>> - Running solr inside the default jetty comes with solr build
> > >>>>> - 1 write only Master , 4 read only Slaves (quad core 5640 with
> 24gb
> > >>
> > >> of
> > >>
> > >>>>> RAM) -

Re: Solr performance issue

2011-03-14 Thread Doğacan Güney
Hello again,

2011/3/14 Markus Jelsma 

> > Hello,
> >
> > 2011/3/14 Markus Jelsma 
> >
> > > Hi Doğacan,
> > >
> > > Are you, at some point, running out of heap space? In my experience,
> > > that's the common cause of increased load and excessivly high response
> > > times (or time
> > > outs).
> >
> > How much of a heap size would be enough? Our index size is growing slowly
> > but we did not have this problem
> > a couple weeks ago where index size was maybe 100mb smaller.
>
> Telling how much heap space is needed isn't easy to say. It usually needs
> to
> be increased when you run out of memory and get those nasty OOM errors, are
> you getting them?
> Replication eventes will increase heap usage due to cache warming queries
> and
> autowarming.
>
>
Nope, no OOM errors.


> >
> > We left most of the caches in solrconfig as default and only increased
> > filterCache to 1024. We only ask for "id"s (which
> > are unique) and no other fields during queries (though we do faceting).
> > Btw, 1.6gb of our index is stored fields (we store
> > everything for now, even though we do not get them during queries), and
> > about 1gb of index.
>
> Hmm, it seems 4000 would be enough indeed. What about the fieldCache, are
> there
> a lot of entries? Is there an insanity count? Do you use boost functions?
>
>
Insanity count is 0 and fieldCAche has 12 entries. We do use some boosting
functions.

Btw, I am monitoring output via jconsole with 8gb of ram and it still goes
to 8gb every 20 seconds or so,
gc runs, falls down to 1gb.

Btw, our current revision was just a random choice but up until two weeks
ago it has been rock-solid so we have been
reluctant to update to another version. Would you recommend upgrading to
latest trunk?


> It might not have anything to do with memory at all but i'm just asking.
> There
> may be a bug in your revision causing this.
>
> >
> > Anyway, Xmx was 4000m, we tried increasing it to 8000m but did not get
> any
> > improvement in load. I can try monitoring with Jconsole
> > with 8gigs of heap to see if it helps.
> >
> > > Cheers,
> > >
> > > > Hello everyone,
> > > >
> > > > First of all here is our Solr setup:
> > > >
> > > > - Solr nightly build 986158
> > > > - Running solr inside the default jetty comes with solr build
> > > > - 1 write only Master , 4 read only Slaves (quad core 5640 with 24gb
> of
> > > > RAM) - Index replicated (on optimize) to slaves via Solr Replication
> > > > - Size of index is around 2.5gb
> > > > - No incremental writes, index is created from scratch(delete old
> > >
> > > documents
> > >
> > > > -> commit new documents -> optimize)  every 6 hours
> > > > - Avg # of request per second is around 60 (for a single slave)
> > > > - Avg time per request is around 25ms (before having problems)
> > > > - Load on each is slave is around 2
> > > >
> > > > We are using this set-up for months without any problem. However last
> > >
> > > week
> > >
> > > > we started to experience very weird performance problems like :
> > > >
> > > > - Avg time per request increased from 25ms to 200-300ms (even higher
> if
> > >
> > > we
> > >
> > > > don't restart the slaves)
> > > > - Load on each slave increased from 2 to 15-20 (solr uses %400-%600
> > > > cpu)
> > > >
> > > > When we profile solr we see two very strange things :
> > > >
> > > > 1 - This is the jconsole output:
> > > >
> > > > https://skitch.com/meralan/rwwcf/mail-886x691
> > > >
> > > > As you see gc runs for every 10-15 seconds and collects more than 1
> gb
> > > > of memory. (Actually if you wait more than 10 minutes you see spikes
> > > > up to
> > >
> > > 4gb
> > >
> > > > consistently)
> > > >
> > > > 2 - This is the newrelic output :
> > > >
> > > > https://skitch.com/meralan/rwwci/solr-requests-solr-new-relic-rpm
> > > >
> > > > As you see solr spent ridiculously long time in
> > > > SolrDispatchFilter.doFilter() method.
> > > >
> > > >
> > > > Apart form these, when we clean the index directory, re-replicate and
> > > > restart  each slave one by one we see a relief in the system but
> after
> > >
> > > some
> > >
> > > > time servers start to melt down again. Although deleting index and
> > > > replicating doesn't solve the problem, we think that these problems
> are
> > > > somehow related to replication. Because symptoms started after
> > >
> > > replication
> > >
> > > > and once it heals itself after replication. I also see
> > > > lucene-write.lock files in slaves (we don't have write.lock files in
> > > > the master) which I think we shouldn't see.
> > > >
> > > >
> > > > If anyone can give any sort of ideas, we will appreciate it.
> > > >
> > > > Regards,
> > > > Dogacan Guney
>



-- 
Doğacan Güney


Re: Solr performance issue

2011-03-14 Thread Doğacan Güney
Hello,

2011/3/14 Markus Jelsma 

> Hi Doğacan,
>
> Are you, at some point, running out of heap space? In my experience, that's
> the common cause of increased load and excessivly high response times (or
> time
> outs).
>
>
How much of a heap size would be enough? Our index size is growing slowly
but we did not have this problem
a couple weeks ago where index size was maybe 100mb smaller.

We left most of the caches in solrconfig as default and only increased
filterCache to 1024. We only ask for "id"s (which
are unique) and no other fields during queries (though we do faceting). Btw,
1.6gb of our index is stored fields (we store
everything for now, even though we do not get them during queries), and
about 1gb of index.

Anyway, Xmx was 4000m, we tried increasing it to 8000m but did not get any
improvement in load. I can try monitoring with Jconsole
with 8gigs of heap to see if it helps.


> Cheers,
>
> > Hello everyone,
> >
> > First of all here is our Solr setup:
> >
> > - Solr nightly build 986158
> > - Running solr inside the default jetty comes with solr build
> > - 1 write only Master , 4 read only Slaves (quad core 5640 with 24gb of
> > RAM) - Index replicated (on optimize) to slaves via Solr Replication
> > - Size of index is around 2.5gb
> > - No incremental writes, index is created from scratch(delete old
> documents
> > -> commit new documents -> optimize)  every 6 hours
> > - Avg # of request per second is around 60 (for a single slave)
> > - Avg time per request is around 25ms (before having problems)
> > - Load on each is slave is around 2
> >
> > We are using this set-up for months without any problem. However last
> week
> > we started to experience very weird performance problems like :
> >
> > - Avg time per request increased from 25ms to 200-300ms (even higher if
> we
> > don't restart the slaves)
> > - Load on each slave increased from 2 to 15-20 (solr uses %400-%600 cpu)
> >
> > When we profile solr we see two very strange things :
> >
> > 1 - This is the jconsole output:
> >
> > https://skitch.com/meralan/rwwcf/mail-886x691
> >
> > As you see gc runs for every 10-15 seconds and collects more than 1 gb of
> > memory. (Actually if you wait more than 10 minutes you see spikes up to
> 4gb
> > consistently)
> >
> > 2 - This is the newrelic output :
> >
> > https://skitch.com/meralan/rwwci/solr-requests-solr-new-relic-rpm
> >
> > As you see solr spent ridiculously long time in
> > SolrDispatchFilter.doFilter() method.
> >
> >
> > Apart form these, when we clean the index directory, re-replicate and
> > restart  each slave one by one we see a relief in the system but after
> some
> > time servers start to melt down again. Although deleting index and
> > replicating doesn't solve the problem, we think that these problems are
> > somehow related to replication. Because symptoms started after
> replication
> > and once it heals itself after replication. I also see lucene-write.lock
> > files in slaves (we don't have write.lock files in the master) which I
> > think we shouldn't see.
> >
> >
> > If anyone can give any sort of ideas, we will appreciate it.
> >
> > Regards,
> > Dogacan Guney
>



-- 
Doğacan Güney


Solr performance issue

2011-03-14 Thread Doğacan Güney
Hello everyone,

First of all here is our Solr setup:

- Solr nightly build 986158
- Running solr inside the default jetty comes with solr build
- 1 write only Master , 4 read only Slaves (quad core 5640 with 24gb of RAM)
- Index replicated (on optimize) to slaves via Solr Replication
- Size of index is around 2.5gb
- No incremental writes, index is created from scratch(delete old documents
-> commit new documents -> optimize)  every 6 hours
- Avg # of request per second is around 60 (for a single slave)
- Avg time per request is around 25ms (before having problems)
- Load on each is slave is around 2

We are using this set-up for months without any problem. However last week
we started to experience very weird performance problems like :

- Avg time per request increased from 25ms to 200-300ms (even higher if we
don't restart the slaves)
- Load on each slave increased from 2 to 15-20 (solr uses %400-%600 cpu)

When we profile solr we see two very strange things :

1 - This is the jconsole output:

https://skitch.com/meralan/rwwcf/mail-886x691

As you see gc runs for every 10-15 seconds and collects more than 1 gb of
memory. (Actually if you wait more than 10 minutes you see spikes up to 4gb
consistently)

2 - This is the newrelic output :

https://skitch.com/meralan/rwwci/solr-requests-solr-new-relic-rpm

As you see solr spent ridiculously long time in
SolrDispatchFilter.doFilter() method.


Apart form these, when we clean the index directory, re-replicate and
restart  each slave one by one we see a relief in the system but after some
time servers start to melt down again. Although deleting index and
replicating doesn't solve the problem, we think that these problems are
somehow related to replication. Because symptoms started after replication
and once it heals itself after replication. I also see lucene-write.lock
files in slaves (we don't have write.lock files in the master) which I think
we shouldn't see.


If anyone can give any sort of ideas, we will appreciate it.

Regards,
Dogacan Guney


Re: Nutch with SOLR

2007-09-26 Thread Doğacan Güney
On 9/26/07, Brian Whitman <[EMAIL PROTECTED]> wrote:
>
> > Sami has a patch in there which used a older version of the solr
> > client. with the current solr client in the SVN tree, his patch
> > becomes much easier.
> > your job would be to upgrade the patch and mail it back to him so
> > he can update his blog, or post it as a patch for inclusion in
> > nutch/contrib (if sami is ok with that). If you have issues with
> > how to use the solr client api, solr-user is here to help.
> >
>
> I've done this. Apparently someone else has taken on the solr-nutch
> job and made it a bit more complicated (which is good for the long
> term) than sami's original patch -- https://issues.apache.org/jira/
> browse/NUTCH-442

That someone else is me :)

NUTCH-442 is one of the issues that I want to really see resolved.
Unfortunately, I haven't received many (as in, none) comments, so I
haven't made further progress on it.

Patch at NUTCH-442 tries to integrate SOLR in a way that it is a
"first-class" citizen (so to speak), so that you can index to solr or
to lucene within the same Indexer job (or both), retrieve search
results from a solr server or from nutch's home-grown index servers in
nutch's web UI (or a combination of both). And I think patch lays the
ground work for generating summaries from solr.

>
> But we still use a version of Sami's patch that works on both trunk
> nutch and trunk solr (solrj.) I sent my changes to sami when we did
> it, if you need it let me know...
>
>
> -b
>
>
>


-- 
Doğacan Güney


Re: Passing arguments to analyzers

2007-07-23 Thread Doğacan Güney

On 7/17/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:

On 7/17/07, Doğacan Güney <[EMAIL PROTECTED]> wrote:
> Hi,
>
> On 7/17/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> > On 7/17/07, Doğacan Güney <[EMAIL PROTECTED]> wrote:
> > > Hi all,
> > >
> > > Is there a way to pass arguments to analyzers per document? Let's say
> > > that I have a field "foo" which is tokenized by WhitespaceTokenizer
> > > and then filtered by MyCustomStemmingFilter. MyCustomStemmingFilter
> > > can stem more than one language but (obviously) it needs to know the
> > > language of the document it is working on. So what I need is to
> > > specify the language per document (actually per field).
> > >
> > > Here is an example:
> > > 
> > >My spam egg bars baz.
> > > 
> > >
> > > Is something like this possible with Solr?
> >
> > You can pass extra args to a factory in the field-type definition, but
> > that means you would need a separate field-type per language.
>
> Thanks for the answer.
>
> Your suggestion would work for this particular use case, but IMHO
> there are other use cases out there that can benefit (for example, one
> may process the whole document and add parameters for each field based
> on document-level analysis) from this.
>
> Would this be useful feature for Solr? I would actually like to work
> on it if others consider this as a useful add-on. It seems simple to
> accomplish and it would probably be a good introduction to Solr
> internals.

wrt passing more info to the analyzer at runtime to alter its
behavior: analyzers are singletons per field-type, and
Analyzer.tokenStream(String fieldName, Reader reader) is called to
analyze a particular value.  There isn't really a good place to pass
in extra info.

During XML parsing, we *could* build up a Map of the parameters we
don't know about, but then the question is what to do with them.  One
hackish solution would be to store them in a thread-local where your
analyzer could check it.  Perhaps a custom request processor could do
that task.

It seems there does need to be some kind of framework more aligned
with parsing documents (word docs, pdf, etc), for adding metadata to
fields at runtime (how does UIMA or Tika fit into this?), and for
mapping the fields+metadata to Solr/Lucene document fields.


I opened SORL-313 for this.



-Yonik




--
Doğacan Güney


Re: Passing arguments to analyzers

2007-07-17 Thread Doğacan Güney

Hi,

On 7/17/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:

On 7/17/07, Doğacan Güney <[EMAIL PROTECTED]> wrote:
> Hi all,
>
> Is there a way to pass arguments to analyzers per document? Let's say
> that I have a field "foo" which is tokenized by WhitespaceTokenizer
> and then filtered by MyCustomStemmingFilter. MyCustomStemmingFilter
> can stem more than one language but (obviously) it needs to know the
> language of the document it is working on. So what I need is to
> specify the language per document (actually per field).
>
> Here is an example:
> 
>My spam egg bars baz.
> 
>
> Is something like this possible with Solr?

You can pass extra args to a factory in the field-type definition, but
that means you would need a separate field-type per language.


Thanks for the answer.

Your suggestion would work for this particular use case, but IMHO
there are other use cases out there that can benefit (for example, one
may process the whole document and add parameters for each field based
on document-level analysis) from this. Also, again IMHO, per-field
parameters are more flexible.

Would this be useful feature for Solr? I would actually like to work
on it if others consider this as a useful add-on. It seems simple to
accomplish and it would probably be a good introduction to Solr
internals.



-Yonik




--
Doğacan Güney


Passing arguments to analyzers

2007-07-17 Thread Doğacan Güney

Hi all,

Is there a way to pass arguments to analyzers per document? Let's say
that I have a field "foo" which is tokenized by WhitespaceTokenizer
and then filtered by MyCustomStemmingFilter. MyCustomStemmingFilter
can stem more than one language but (obviously) it needs to know the
language of the document it is working on. So what I need is to
specify the language per document (actually per field).

Here is an example:

  My spam egg bars baz.


Is something like this possible with Solr?

--
Doğacan Güney