subject:"Solr performance"

Re: solr performance with >1 NUMAs

2020-10-22 Thread matthew sporleder

Great updates.  Thanks for keeping us all in the loop!

On Thu, Oct 22, 2020 at 7:43 PM Wei  wrote:
>
> Hi Shawn,
>
> I.m circling back with some new findings with our 2 NUMA issue.  After a
> few iterations, we do see improvement with the useNUMA flag and other JVM
> setting changes. Here are the current settings, with Java 11:
>
> -XX:+UseNUMA
>
> -XX:+UseG1GC
>
> -XX:+AlwaysPreTouch
>
> -XX:+UseTLAB
>
> -XX:G1MaxNewSizePercent=20
>
> -XX:MaxGCPauseMillis=150
>
> -XX:+DisableExplicitGC
>
> -XX:+DoEscapeAnalysis
>
> -XX:+ParallelRefProcEnabled
>
> -XX:+UnlockDiagnosticVMOptions
>
> -XX:+UnlockExperimentalVMOptions
>
>
> Compared to previous Java 8 + CMS on 2 NUMA servers,  P99 latency has
> improved over 20%.
>
>
> Thanks,
>
> Wei
>
>
>
>
> On Mon, Sep 28, 2020 at 4:02 PM Shawn Heisey  wrote:
>
> > On 9/28/2020 12:17 PM, Wei wrote:
> > > Thanks Shawn. Looks like Java 11 is the way to go with -XX:+UseNUMA. Do
> > you
> > > see any backward compatibility issue for Solr 8 with Java 11? Can we run
> > > Solr 8 built with JDK 8 in Java 11 JRE, or need to rebuild solr with Java
> > > 11 JDK?
> >
> > I do not know of any problems running the binary release of Solr 8
> > (which is most likely built with the Java 8 JDK) with a newer release
> > like Java 11 or higher.
> >
> > I think Sun was really burned by such problems cropping up in the days
> > of Java 5 and 6, and their developers have worked really hard to make
> > sure that never happens again.
> >
> > If you're running Java 11, you will need to pick a different garbage
> > collector if you expect the NUMA flag to function.  The most recent
> > releases of Solr are defaulting to G1GC, which as previously mentioned,
> > did not gain NUMA optimizations until Java 14.
> >
> > It is not clear to me whether the NUMA optimizations will work with any
> > collector other than Parallel until Java 14.  You would need to check
> > Java documentation carefully or ask someone involved with development of
> > Java.
> >
> > If you do see an improvement using the NUMA flag with Java 11, please
> > let us know exactly what options Solr was started with.
> >
> > Thanks,
> > Shawn
> >

Re: solr performance with >1 NUMAs

2020-10-22 Thread Wei

Hi Shawn,

I.m circling back with some new findings with our 2 NUMA issue.  After a
few iterations, we do see improvement with the useNUMA flag and other JVM
setting changes. Here are the current settings, with Java 11:

-XX:+UseNUMA

-XX:+UseG1GC

-XX:+AlwaysPreTouch

-XX:+UseTLAB

-XX:G1MaxNewSizePercent=20

-XX:MaxGCPauseMillis=150

-XX:+DisableExplicitGC

-XX:+DoEscapeAnalysis

-XX:+ParallelRefProcEnabled

-XX:+UnlockDiagnosticVMOptions

-XX:+UnlockExperimentalVMOptions


Compared to previous Java 8 + CMS on 2 NUMA servers,  P99 latency has
improved over 20%.


Thanks,

Wei




On Mon, Sep 28, 2020 at 4:02 PM Shawn Heisey  wrote:

> On 9/28/2020 12:17 PM, Wei wrote:
> > Thanks Shawn. Looks like Java 11 is the way to go with -XX:+UseNUMA. Do
> you
> > see any backward compatibility issue for Solr 8 with Java 11? Can we run
> > Solr 8 built with JDK 8 in Java 11 JRE, or need to rebuild solr with Java
> > 11 JDK?
>
> I do not know of any problems running the binary release of Solr 8
> (which is most likely built with the Java 8 JDK) with a newer release
> like Java 11 or higher.
>
> I think Sun was really burned by such problems cropping up in the days
> of Java 5 and 6, and their developers have worked really hard to make
> sure that never happens again.
>
> If you're running Java 11, you will need to pick a different garbage
> collector if you expect the NUMA flag to function.  The most recent
> releases of Solr are defaulting to G1GC, which as previously mentioned,
> did not gain NUMA optimizations until Java 14.
>
> It is not clear to me whether the NUMA optimizations will work with any
> collector other than Parallel until Java 14.  You would need to check
> Java documentation carefully or ask someone involved with development of
> Java.
>
> If you do see an improvement using the NUMA flag with Java 11, please
> let us know exactly what options Solr was started with.
>
> Thanks,
> Shawn
>

Re: solr performance with >1 NUMAs

2020-09-28 Thread Shawn Heisey


On 9/28/2020 12:17 PM, Wei wrote:

Thanks Shawn. Looks like Java 11 is the way to go with -XX:+UseNUMA. Do you
see any backward compatibility issue for Solr 8 with Java 11? Can we run
Solr 8 built with JDK 8 in Java 11 JRE, or need to rebuild solr with Java
11 JDK?


I do not know of any problems running the binary release of Solr 8 
(which is most likely built with the Java 8 JDK) with a newer release 
like Java 11 or higher.


I think Sun was really burned by such problems cropping up in the days 
of Java 5 and 6, and their developers have worked really hard to make 
sure that never happens again.


If you're running Java 11, you will need to pick a different garbage 
collector if you expect the NUMA flag to function.  The most recent 
releases of Solr are defaulting to G1GC, which as previously mentioned, 
did not gain NUMA optimizations until Java 14.


It is not clear to me whether the NUMA optimizations will work with any 
collector other than Parallel until Java 14.  You would need to check 
Java documentation carefully or ask someone involved with development of 
Java.


If you do see an improvement using the NUMA flag with Java 11, please 
let us know exactly what options Solr was started with.


Thanks,
Shawn

Re: solr performance with >1 NUMAs

2020-09-28 Thread Wei

Thanks Shawn. Looks like Java 11 is the way to go with -XX:+UseNUMA. Do you
see any backward compatibility issue for Solr 8 with Java 11? Can we run
Solr 8 built with JDK 8 in Java 11 JRE, or need to rebuild solr with Java
11 JDK?

Best,
Wei

On Sat, Sep 26, 2020 at 6:44 PM Shawn Heisey  wrote:

> On 9/26/2020 1:39 PM, Wei wrote:
> > Thanks Shawn! Currently we are still using the CMS collector for solr
> with
> > Java 8. When last evaluated with Solr 7, CMS performs better than G1 for
> > our case. When using G1, is it better to upgrade from Java 8 to Java 11?
> >  From
> https://lucene.apache.org/solr/guide/8_4/solr-system-requirements.html,
> > seems Java 14 is not officially supported for Solr 8.
>
> It has been a while since I was working with Solr every day, and when I
> was, Java 11 did not yet exist.  I have no idea whether Java 11 improves
> things beyond Java 8.  That said ... all software evolves and usually
> improves as time goes by.  It is likely that the newer version has SOME
> benefit.
>
> Regarding whether or not Java 14 is supported:  There are automated
> tests where all the important code branches are run with all major
> versions of Java, including pre-release versions, and those tests do
> include various garbage collectors.  Somebody notices when a combination
> doesn't work, and big problems with newer Java versions are something
> that gets discussed on our mailing lists.
>
> Java 14 has been out for a while, with no big problems being discussed
> so far.  So it is likely that it works with Solr.  Can I say for sure?
> No.  I haven't tried it myself.
>
> I don't have any hardware available where there is more than one NUMA,
> or I would look deeper into this myself.  It would be interesting to
> find out whether the -XX:+UseNUMA option makes a big difference in
> performance.
>
> Thanks,
> Shawn
>

Re: solr performance with >1 NUMAs

2020-09-26 Thread Shawn Heisey


On 9/26/2020 1:39 PM, Wei wrote:

Thanks Shawn! Currently we are still using the CMS collector for solr with
Java 8. When last evaluated with Solr 7, CMS performs better than G1 for
our case. When using G1, is it better to upgrade from Java 8 to Java 11?
 From https://lucene.apache.org/solr/guide/8_4/solr-system-requirements.html,
seems Java 14 is not officially supported for Solr 8.


It has been a while since I was working with Solr every day, and when I 
was, Java 11 did not yet exist.  I have no idea whether Java 11 improves 
things beyond Java 8.  That said ... all software evolves and usually 
improves as time goes by.  It is likely that the newer version has SOME 
benefit.


Regarding whether or not Java 14 is supported:  There are automated 
tests where all the important code branches are run with all major 
versions of Java, including pre-release versions, and those tests do 
include various garbage collectors.  Somebody notices when a combination 
doesn't work, and big problems with newer Java versions are something 
that gets discussed on our mailing lists.


Java 14 has been out for a while, with no big problems being discussed 
so far.  So it is likely that it works with Solr.  Can I say for sure? 
No.  I haven't tried it myself.


I don't have any hardware available where there is more than one NUMA, 
or I would look deeper into this myself.  It would be interesting to 
find out whether the -XX:+UseNUMA option makes a big difference in 
performance.


Thanks,
Shawn

Re: solr performance with >1 NUMAs

2020-09-26 Thread Wei

Thanks Shawn! Currently we are still using the CMS collector for solr with
Java 8. When last evaluated with Solr 7, CMS performs better than G1 for
our case. When using G1, is it better to upgrade from Java 8 to Java 11?
>From https://lucene.apache.org/solr/guide/8_4/solr-system-requirements.html,
seems Java 14 is not officially supported for Solr 8.

Best,
Wei


On Fri, Sep 25, 2020 at 5:50 PM Shawn Heisey  wrote:

> On 9/23/2020 7:42 PM, Wei wrote:
> > Recently we deployed solr 8.4.1 on a batch of new servers with 2 NUMAs. I
> > noticed that query latency almost doubled compared to deployment on
> single
> > NUMA machines. Not sure what's causing the huge difference. Is there any
> > tuning to boost the performance on multiple NUMA machines? Any pointer is
> > appreciated.
>
> If you're running with standard options, Solr 8.4.1 will start using the
> G1 garbage collector.
>
> As of Java 14, G1 has gained the ability to use the -XX:+UseNUMA option,
> which makes better decisions about memory allocations and multiple
> NUMAs.  If you're running a new enough Java, it would probably be
> beneficial to add this to the garbage collector options.  Solr itself is
> unaware of things like NUMA -- Java must handle that.
>
> https://openjdk.java.net/jeps/345
>
> Thanks,
> Shawn
>

Re: solr performance with >1 NUMAs

2020-09-25 Thread Shawn Heisey


On 9/23/2020 7:42 PM, Wei wrote:

Recently we deployed solr 8.4.1 on a batch of new servers with 2 NUMAs. I
noticed that query latency almost doubled compared to deployment on single
NUMA machines. Not sure what's causing the huge difference. Is there any
tuning to boost the performance on multiple NUMA machines? Any pointer is
appreciated.


If you're running with standard options, Solr 8.4.1 will start using the 
G1 garbage collector.


As of Java 14, G1 has gained the ability to use the -XX:+UseNUMA option, 
which makes better decisions about memory allocations and multiple 
NUMAs.  If you're running a new enough Java, it would probably be 
beneficial to add this to the garbage collector options.  Solr itself is 
unaware of things like NUMA -- Java must handle that.


https://openjdk.java.net/jeps/345

Thanks,
Shawn

Re: solr performance with >1 NUMAs

2020-09-25 Thread Wei

Thanks Dominique. I'll start with the -XX:+UseNUMA option.

Best,
Wei

On Fri, Sep 25, 2020 at 7:04 AM Dominique Bejean 
wrote:

> Hi,
>
> This would be a Java VM option, not something Solr itself can know about.
> Take a look at this article in comments. May be it will help.
>
> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html?showComment=1347033706559#c229885263664926125
>
> Regards
>
> Dominique
>
>
>
> Le jeu. 24 sept. 2020 à 03:42, Wei  a écrit :
>
> > Hi,
> >
> > Recently we deployed solr 8.4.1 on a batch of new servers with 2 NUMAs. I
> > noticed that query latency almost doubled compared to deployment on
> single
> > NUMA machines. Not sure what's causing the huge difference. Is there any
> > tuning to boost the performance on multiple NUMA machines? Any pointer is
> > appreciated.
> >
> > Best,
> > Wei
> >
>

Re: solr performance with >1 NUMAs

2020-09-25 Thread Dominique Bejean

Hi,

This would be a Java VM option, not something Solr itself can know about.
Take a look at this article in comments. May be it will help.
https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html?showComment=1347033706559#c229885263664926125

Regards

Dominique



Le jeu. 24 sept. 2020 à 03:42, Wei  a écrit :

> Hi,
>
> Recently we deployed solr 8.4.1 on a batch of new servers with 2 NUMAs. I
> noticed that query latency almost doubled compared to deployment on single
> NUMA machines. Not sure what's causing the huge difference. Is there any
> tuning to boost the performance on multiple NUMA machines? Any pointer is
> appreciated.
>
> Best,
> Wei
>

solr performance with >1 NUMAs

2020-09-23 Thread Wei

Hi,

Recently we deployed solr 8.4.1 on a batch of new servers with 2 NUMAs. I
noticed that query latency almost doubled compared to deployment on single
NUMA machines. Not sure what's causing the huge difference. Is there any
tuning to boost the performance on multiple NUMA machines? Any pointer is
appreciated.

Best,
Wei

Re: question about setup for maximizing solr performance

2020-06-01 Thread Shawn Heisey


On 6/1/2020 9:29 AM, Odysci wrote:

Hi,
I'm looking for some advice on improving performance of our solr setup.




Does anyone have any insights on what would be better for maximizing
throughput on multiple searches being done at the same time?
thanks!


In almost all cases, adding memory will provide the best performance 
boost.  This is because memory is faster than disks, even SSD.  I have 
put relevant information on a wiki page so that it is easy for people to 
find and digest:


https://cwiki.apache.org/confluence/display/solr/SolrPerformanceProblems

Thanks,
Shawn

question about setup for maximizing solr performance

2020-06-01 Thread Odysci

Hi,
I'm looking for some advice on improving performance of our solr setup. In
particular, about the trade-offs between applying larger machines, vs more
smaller machines. Our full index has just over 100 million docs, and we do
almost all searches using fq's (with q=*:*) and facets. We are using solr
8.3.

Currently, I have a solrcloud setup with 2 physical machines (let's call
them A and B), and my index is divided into 2 shards, and 2 replicas, such
that each machine has a full copy of the index.
The nodes and replicas are as follows:
Machine A:
  core_node3 / shard1_replica_n1
  core_node7 / shard2_replica_n4
Machine B:
  core_node5 / shard1_replica_n2
  core_node8 / shard2_replica_n6

My Zookeeper setup uses 3 instances. It's also the case that most of the
searches we do, we have results returning from both shards (from the same
search).

My experiments indicate that our setup is cpu-bound.
Due to cost constraints, I could, either, double the cpu in each of the 2
machines, or make it a 4-machine setup (using current size machines) and 2
shards and 4 replicas (or 4 shards w/ 4 replicas). I assume that keeping
the full index on all machines will allow all searches to be evenly
distributed.

Does anyone have any insights on what would be better for maximizing
throughput on multiple searches being done at the same time?
thanks!

Reinaldo

Re: Solr performance using fq with multiple values

2020-04-18 Thread Shawn Heisey


On 4/18/2020 12:20 PM, Odysci wrote:

We don't used this field for general queries (q:*), only for fq and
faceting.
Do you think making it indexed="true" would make a difference in fq
performance?


fq means "filter query".  It's still a query.  So yes, the field should 
be indexed.  The query you're doing only works because docValues is true 
... but queries using docValues have terrible performance.


Thanks,
Shawn

Re: Solr performance using fq with multiple values

2020-04-18 Thread Odysci

We don't used this field for general queries (q:*), only for fq and
faceting.
Do you think making it indexed="true" would make a difference in fq
performance?
Thanks

Reinaldo

On Sat, Apr 18, 2020 at 3:06 PM Sylvain James 
wrote:

> Hi Reinaldo,
>
> Involved fields should be indexed for better performance ?
>
>  stored="false" required="false" multiValued="false"
> docValues="true" />
>
> Sylvain
>
> Le sam. 18 avr. 2020 à 18:46, Odysci  a écrit :
>
> > Hi,
> >
> > We are seeing significant performance degradation on single queries that
> > use fq with multiple values as in:
> >
> > fq=field1_name:(V1 V2 V3 ...)
> >
> > If we use only one value in the fq (say only V1) we get Qtime = T ms
> > As we increase the number of values, say to 5 values, Qtime more than
> > triples, even if the number of results is small. In my tests I made sure
> > cache was not an issue and nothing else was using the cpu.
> >
> > We commonly need to use fq with multiple values (on the same field name,
> > which is normally a long).
> > Is this performance hit to be expected?
> > Is there a better way to do this?
> >
> > We use Solr Cloud 8.3, and the field that we use fq on is defined as:
> >
> >  > stored="false" required="false" multiValued="false"
> > docValues="true" />
> >
> > Thanks
> >
> > Reinaldo
> >
>

Re: Solr performance using fq with multiple values

2020-04-18 Thread Sylvain James

Hi Reinaldo,

Involved fields should be indexed for better performance ?



Sylvain

Le sam. 18 avr. 2020 à 18:46, Odysci  a écrit :

> Hi,
>
> We are seeing significant performance degradation on single queries that
> use fq with multiple values as in:
>
> fq=field1_name:(V1 V2 V3 ...)
>
> If we use only one value in the fq (say only V1) we get Qtime = T ms
> As we increase the number of values, say to 5 values, Qtime more than
> triples, even if the number of results is small. In my tests I made sure
> cache was not an issue and nothing else was using the cpu.
>
> We commonly need to use fq with multiple values (on the same field name,
> which is normally a long).
> Is this performance hit to be expected?
> Is there a better way to do this?
>
> We use Solr Cloud 8.3, and the field that we use fq on is defined as:
>
>  stored="false" required="false" multiValued="false"
> docValues="true" />
>
> Thanks
>
> Reinaldo
>

Solr performance using fq with multiple values

2020-04-18 Thread Odysci

Hi,

We are seeing significant performance degradation on single queries that
use fq with multiple values as in:

fq=field1_name:(V1 V2 V3 ...)

If we use only one value in the fq (say only V1) we get Qtime = T ms
As we increase the number of values, say to 5 values, Qtime more than
triples, even if the number of results is small. In my tests I made sure
cache was not an issue and nothing else was using the cpu.

We commonly need to use fq with multiple values (on the same field name,
which is normally a long).
Is this performance hit to be expected?
Is there a better way to do this?

We use Solr Cloud 8.3, and the field that we use fq on is defined as:



Thanks

Reinaldo

Re: SOLR PERFORMANCE Warning

2020-02-20 Thread Emir Arnautović

Hi,
It means that you are either committing too frequently or your warming up takes 
too long. If you are committing on every bulk, stop doing that and use 
autocommit.

Regards,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 21 Feb 2020, at 06:54, Akreeti Agarwal  wrote:
> 
> Hi All,
> 
> 
> 
> I am using SOLR 7.5 version with master slave architecture.
> 
> I am getting :
> 
> 
> 
> "PERFORMANCE WARNING: Overlapping onDeckSearchers=2"
> 
> 
> 
> continuously on my master logs for all cores. Please help me to resolve this.
> 
> 
> 
> 
> 
> Thanks & Regards,
> 
> Akreeti Agarwal
> 
> ::DISCLAIMER::
> 
> The contents of this e-mail and any attachment(s) are confidential and 
> intended for the named recipient(s) only. E-mail transmission is not 
> guaranteed to be secure or error-free as information could be intercepted, 
> corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses 
> in transmission. The e mail and its contents (with or without referred 
> errors) shall therefore not attach any liability on the originator or HCL or 
> its affiliates. Views or opinions, if any, presented in this email are solely 
> those of the author and may not necessarily reflect the views or opinions of 
> HCL or its affiliates. Any form of reproduction, dissemination, copying, 
> disclosure, modification, distribution and / or publication of this message 
> without the prior written consent of authorized representative of HCL is 
> strictly prohibited. If you have received this email in error please delete 
> it and notify the sender immediately. Before opening any email and/or 
> attachments, please check them for viruses and other defects.
>

SOLR PERFORMANCE Warning

2020-02-20 Thread Akreeti Agarwal

Hi All,



I am using SOLR 7.5 version with master slave architecture.

I am getting :



"PERFORMANCE WARNING: Overlapping onDeckSearchers=2"



continuously on my master logs for all cores. Please help me to resolve this.





Thanks & Regards,

Akreeti Agarwal

::DISCLAIMER::

The contents of this e-mail and any attachment(s) are confidential and intended 
for the named recipient(s) only. E-mail transmission is not guaranteed to be 
secure or error-free as information could be intercepted, corrupted, lost, 
destroyed, arrive late or incomplete, or may contain viruses in transmission. 
The e mail and its contents (with or without referred errors) shall therefore 
not attach any liability on the originator or HCL or its affiliates. Views or 
opinions, if any, presented in this email are solely those of the author and 
may not necessarily reflect the views or opinions of HCL or its affiliates. Any 
form of reproduction, dissemination, copying, disclosure, modification, 
distribution and / or publication of this message without the prior written 
consent of authorized representative of HCL is strictly prohibited. If you have 
received this email in error please delete it and notify the sender 
immediately. Before opening any email and/or attachments, please check them for 
viruses and other defects.

Re: solr SSL encryption degardes solr performance

2019-02-06 Thread Zheng Lin Edwin Yeo

Hi,

Which Solr version are you using?

Also, how many collections do you have, and how many records have you
indexed in those collections?

Regards,
Edwin

On Mon, 4 Feb 2019 at 23:33, Anchal Sharma2  wrote:

>
>
> Hi All,
>
> We had recently enabled SSL on solr. But afterwards ,our application
> performance  has degraded significantly i.e the time for the source
> application  to fetch  a record from solr has increased from approx 4 ms to
> 200 ms(this is for a single record) .This amounts to a lot of time ,when
> multiple calls are made to solr.
>
> Has any one experienced this ,and please share if some one has any
> suggestion .
>
> Thanks & Regards,
> -
> Anchal Sharma
>

solr SSL encryption degardes solr performance

2019-02-04 Thread Anchal Sharma2



Hi All,

We had recently enabled SSL on solr. But afterwards ,our application
performance  has degraded significantly i.e the time for the source
application  to fetch  a record from solr has increased from approx 4 ms to
200 ms(this is for a single record) .This amounts to a lot of time ,when
multiple calls are made to solr.

Has any one experienced this ,and please share if some one has any
suggestion .

Thanks & Regards,
-
Anchal Sharma

Re: SOLR Performance Statistics

2018-11-21 Thread Shawn Heisey


On 11/21/2018 8:59 AM, Marc Schöchlin wrote:

Is it possible to modify the log4j appender to also log other query attributes 
like response/request size in bytes and number of resulted documents?


Changing the log4j config might not do anything useful at all.  In order 
for such a change to be useful, the application must have code that logs 
the information you're after.


If you change the default logging level in your log4j config to DEBUG 
instead of INFO, you'll get a LOT more information in your logs.  The 
information you're after *MIGHT* be logged, but it might not -- I really 
have no idea without checking the source code about precisely what 
information Solr is logging.


The number of documents that match each query *IS* mentioned in 
solr.log, and you won't even have to change anything to get it.  You'll 
see "hits=nn" when a query is logged.



I think about snooping on the ethernet interface of a client or on the server to gather 
libpcap data. Is there a chance to analyze captured data i format is i.e 
"wt=javabin=2"


The javabin format is binary.  You would need something that understands 
it.  If you added solr jars to a custom program, you could probably feed 
the data to it and make sense of it.  That would require some research 
into how Solr works at a low level, to learn how to take information 
gathered from a packet capture and decode it into an actual response.


Thanks,
Shawn

SOLR Performance Statistics

2018-11-21 Thread Marc Schöchlin

Hello list,

i am using the pretty old solr 4.7 *sigh* release and i am currently in 
investigation of performance problems.
The solr instance runs currently very expensive queries with huge results and i 
want to find the most promising queries for optimization.

I am currently using the solr logfiles and a simple tool (enhanced by me) to 
analyze the queries: https://github.com/scoopex/solr-loganalyzer

Is it possible to modify the log4j appender to also log other query attributes 
like response/request size in bytes and number of resulted documents?

#- File to log to and log format
log4j.appender.file.File=${solr.log}/solr.log
log4j.appender.file.layout=org.apache.log4j.PatternLayout
log4j.appender.file.layout.ConversionPattern=%-5p - %d{-MM-dd 
HH:mm:ss.SSS}; %C; %m\n
log4j.appender.file.bufferedIO=true

Is there a better way to create detailed query stats and to replay queries on a 
test system?

I think about snooping on the ethernet interface of a client or on the server 
to gather libpcap data. Is there a chance to analyze captured data i format is 
i.e "wt=javabin=2"
I do similar things for mysql to make non intrusive performance analytics using 
pt-query-digest (Percona Toolkit).

This works like that on mysql:

1.) Capture data
 # Capture all data on port 3306
 tcpdump -s 65535 -x -nn -q - -i any port 3306 > mysql.tcp.txt
 # capure only 1/7 of the connection using a modulus of 7 on the source 
port if you have a very busy network connection
 tcpdump -i eth0 -s 65535 -x -n -q - 'port 3306 and tcp[1] & 7 == 2 and 
tcp[3] & 7 == 2' > mysql.tcp.txt

2.) Create statistics on a other system using the tcpdump file
 pt-query-digest  --watch-server '127.0.0.1:3307' --limit 110 --type 
tcpdump mysql.tcp.txt

If i can extract the streams of the connections - do you have a idea how to 
parse the binary data?
(Can i use parts of the solr client?)

Is there comparable tool out there?

Regards
Marc

Re: Live publishing and solr performance optimization

2018-11-20 Thread Zheng Lin Edwin Yeo

Sharding can be one of the option.

But what is the size of your documents? And which Solr version are you
using?

Regards,
Edwin

On Tue, 20 Nov 2018 at 01:40, Balanathagiri Ayyasamypalanivel <
bala.cit...@gmail.com> wrote:

> Hi,
> We are in the process for live Publishing document in solr and the same
> time we have to maintain the search performance.
>
> Total existing docs : 120 million
> Expected data for live publishing : 1 million
>
> For every 1 hour, we will get 1m docs to publish in live to the hot solr
> collection, can you please provide your suggestions on how effectively we
> can do this.
>
> Regards,
> Bala.
>

Live publishing and solr performance optimization

2018-11-19 Thread Balanathagiri Ayyasamypalanivel

Hi,
We are in the process for live Publishing document in solr and the same
time we have to maintain the search performance.

Total existing docs : 120 million
Expected data for live publishing : 1 million

For every 1 hour, we will get 1m docs to publish in live to the hot solr
collection, can you please provide your suggestions on how effectively we
can do this.

Regards,
Bala.

Re: Solr performance issue

2018-02-15 Thread Shawn Heisey

On 2/15/2018 2:00 AM, Srinivas Kashyap wrote:
> I have implemented 'SortedMapBackedCache' in my SqlEntityProcessor for the 
> child entities in data-config.xml. And i'm using the same for full-import 
> only. And in the beginning of my implementation, i had written delta-import 
> query to index the modified changes. But my requirement grew and i have 17 
> child entities for a single parent entity now. When doing delta-import for 
> huge data, the number of requests being made to datasource(database)  became 
> more and CPU utilization was 100% when concurrent users started modifying the 
> data. For this instead of calling delta-import which imports based on last 
> index time, I did full-import('SortedMapBackedCache' ) based on last index 
> time.
>
> Though the parent entity query would return only records that are modified, 
> the child entity queries pull all the data from the database and the indexing 
> happens 'in-memory' which is causing the JVM memory go out of memory.

Can you provide your DIH config file (with passwords redacted) and the
precise URL you are using to initiate dataimport?  Also, I would like to
know what field you have defined as your uniqueKey.  I may have more
questions about the data in your system, depending on what I see.

That cache implementation should only cache entries from the database
that are actually requested.  If your query is correctly defined, it
should not pull all records from the DB table.

> Is there a way to specify in the child query entity to pull the record 
> related to parent entity in the full-import mode.

If I am understanding your question correctly, this is one of the fairly
basic things that DIH does.  Look at this config example in the
reference guide:

https://lucene.apache.org/solr/guide/6_6/uploading-structured-data-store-data-with-the-data-import-handler.html#configuring-the-dih-configuration-file

In the entity named feature in that example config, the query string
uses ${item.ID} to reference the ID column from the parent entity, which
is item.

I should warn you that a cached entity does not always improve
performance.  This is particularly true if the lookup into the cache is
the information that goes to your uniqueKey field.  When the lookup is
by uniqueKey, every single row requested from the database will be used
exactly once, so there's not really any point to caching it.

Thanks,
Shawn

Re: Solr performance issue

2018-02-15 Thread Erick Erickson

Srinivas:

Not an answer to your question, but when DIH starts getting this
complicated, I start to seriously think about SolrJ, see:
https://lucidworks.com/2012/02/14/indexing-with-solrj/

IN particular, it moves the heavy lifting of acquiring the data from a
Solr node (which I'm assuming also has to index docs) to "some
client". It also let's you play some tricks with the code to make
things faster.

Best,
Erick

On Thu, Feb 15, 2018 at 1:00 AM, Srinivas Kashyap
 wrote:
> Hi,
>
> I have implemented 'SortedMapBackedCache' in my SqlEntityProcessor for the 
> child entities in data-config.xml. And i'm using the same for full-import 
> only. And in the beginning of my implementation, i had written delta-import 
> query to index the modified changes. But my requirement grew and i have 17 
> child entities for a single parent entity now. When doing delta-import for 
> huge data, the number of requests being made to datasource(database)  became 
> more and CPU utilization was 100% when concurrent users started modifying the 
> data. For this instead of calling delta-import which imports based on last 
> index time, I did full-import('SortedMapBackedCache' ) based on last index 
> time.
>
> Though the parent entity query would return only records that are modified, 
> the child entity queries pull all the data from the database and the indexing 
> happens 'in-memory' which is causing the JVM memory go out of memory.
>
> Is there a way to specify in the child query entity to pull the record 
> related to parent entity in the full-import mode.
>
> Thanks and Regards,
> Srinivas Kashyap
>
> DISCLAIMER:
> E-mails and attachments from TradeStone Software, Inc. are confidential.
> If you are not the intended recipient, please notify the sender immediately by
> replying to the e-mail, and then delete it without making copies or using it
> in any way. No representation is made that this email or any attachments are
> free of viruses. Virus scanning is recommended and is the responsibility of
> the recipient.

Solr performance issue

2018-02-15 Thread Srinivas Kashyap

Hi,

I have implemented 'SortedMapBackedCache' in my SqlEntityProcessor for the 
child entities in data-config.xml. And i'm using the same for full-import only. 
And in the beginning of my implementation, i had written delta-import query to 
index the modified changes. But my requirement grew and i have 17 child 
entities for a single parent entity now. When doing delta-import for huge data, 
the number of requests being made to datasource(database)  became more and CPU 
utilization was 100% when concurrent users started modifying the data. For this 
instead of calling delta-import which imports based on last index time, I did 
full-import('SortedMapBackedCache' ) based on last index time.

Though the parent entity query would return only records that are modified, the 
child entity queries pull all the data from the database and the indexing 
happens 'in-memory' which is causing the JVM memory go out of memory.

Is there a way to specify in the child query entity to pull the record related 
to parent entity in the full-import mode.

Thanks and Regards,
Srinivas Kashyap

DISCLAIMER: 
E-mails and attachments from TradeStone Software, Inc. are confidential.
If you are not the intended recipient, please notify the sender immediately by
replying to the e-mail, and then delete it without making copies or using it
in any way. No representation is made that this email or any attachments are
free of viruses. Virus scanning is recommended and is the responsibility of
the recipient.

Re: Solr performance issue on querying --> Solr 6.5.1

2017-09-30 Thread sasarun

Hi Erick, 

As suggested, I did try nonHDFS solr cloud instance and it response looks to
be really better. From the configuration side to, I am mostly using default
configurations and with block.cache.direct.memory.allocation as false.  On
analysis of hdfs cache, evictions seems to be on higher side. 

Thanks, 
Arun



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr performance issue on querying --> Solr 6.5.1

2017-09-27 Thread Emir Arnautović

Hi Arun,
It is hard to measure something without affecting it, but we could use debug 
results and combine with QTime without debug: If we ignore merging results, it 
seems that majority of time is spent for retrieving docs (~500ms). You should 
consider reducing number of rows if you want better response time (you can ask 
for rows=0 to see max possible time). Also, as Erick suggested, reducing number 
of shards (1 if not plan much more doc) will trim some overhead of merging 
results.

Thanks,
Emir

I noticed that you removed bq - is time with bq acceptable as well?
> On 27 Sep 2017, at 12:34, sasarun  wrote:
> 
> Hi Emir, 
> 
> Please find the response without bq parameter and debugQuery set to true. 
> Also it was noted that Qtime comes down drastically without the debug
> parameter to about 700-800. 
> 
> 
> true
> 0
> 3446
> 
> 
> ("hybrid electric powerplant" "hybrid electric powerplants" "Electric"
> "Electrical" "Electricity" "Engine" "fuel economy" "fuel efficiency" "Hybrid
> Electric Propulsion" "Power Systems" "Powerplant" "Propulsion" "hybrid"
> "hybrid electric" "electric powerplant")
> 
> edismax
> on
> 
> host
> title
> url
> customContent
> contentSpecificSearch
> 
> 
> id
> contentOntologyTagsCount
> 
> 0
> OR
> 3985d7e2-3e54-48d8-8336-229e85f5d9de
> 600
> true
> 
> 
>  maxScore="56.74194">...
> 
> 
> 
> solr-prd-cluster-m-GooglePatent_shard4_replica2-1506504238282-20
> 
> 
> 
> 35
> 159
> GET_TOP_IDS
> 41294
> ...
> 
> 
> 29
> 165
> GET_TOP_IDS
> 40980
> ...
> 
> 
> 31
> 200
> GET_TOP_IDS
> 41006
> ...
> 
> 
> 43
> 208
> GET_TOP_IDS
> 41040
> ...
> 
> 
> 181
> 466
> GET_TOP_IDS
> 41138
> ...
> 
> 
> 
> 
> 1518
> 1523
> GET_FIELDS,GET_DEBUG
> 110
> ...
> 
> 
> 1562
> 1573
> GET_FIELDS,GET_DEBUG
> 115
> ...
> 
> 
> 1793
> 1800
> GET_FIELDS,GET_DEBUG
> 120
> ...
> 
> 
> 2153
> 2161
> GET_FIELDS,GET_DEBUG
> 125
> ...
> 
> 
> 2957
> 2970
> GET_FIELDS,GET_DEBUG
> 130
> ...
> 
> 
> 
> 
> 10302.0
> 
> 2.0
> 
> 2.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 
> 10288.0
> 
> 661.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 9627.0
> 
> 
> 
> 
> ("hybrid electric powerplant" "hybrid electric powerplants" "Electric"
> "Electrical" "Electricity" "Engine" "fuel economy" "fuel efficiency" "Hybrid
> Electric Propulsion" "Power Systems" "Powerplant" "Propulsion" "hybrid"
> "hybrid electric" "electric powerplant")
> 
> 
> ("hybrid electric powerplant" "hybrid electric powerplants" "Electric"
> "Electrical" "Electricity" "Engine" "fuel economy" "fuel efficiency" "Hybrid
> Electric Propulsion" "Power Systems" "Powerplant" "Propulsion" "hybrid"
> "hybrid electric" "electric powerplant")
> 
> 
> (+(DisjunctionMaxQuery((host:hybrid electric powerplant |
> contentSpecificSearch:"hybrid electric powerplant" | customContent:"hybrid
> electric powerplant" | title:hybrid electric powerplant | url:hybrid
> electric powerplant)) DisjunctionMaxQuery((host:hybrid electric powerplants
> | contentSpecificSearch:"hybrid electric powerplants" |
> customContent:"hybrid electric powerplants" | title:hybrid electric
> powerplants | url:hybrid electric powerplants))
> DisjunctionMaxQuery((host:Electric | contentSpecificSearch:electric |
> customContent:electric | title:Electric | url:Electric))
> DisjunctionMaxQuery((host:Electrical | contentSpecificSearch:electrical |
> customContent:electrical | title:Electrical | url:Electrical))
> DisjunctionMaxQuery((host:Electricity | contentSpecificSearch:electricity |
> customContent:electricity | title:Electricity | url:Electricity))
> DisjunctionMaxQuery((host:Engine | contentSpecificSearch:engine |
> customContent:engine | title:Engine | url:Engine))
> DisjunctionMaxQuery((host:fuel economy | contentSpecificSearch:"fuel
> economy" | customContent:"fuel economy" | title:fuel economy | url:fuel
> economy)) DisjunctionMaxQuery((host:fuel efficiency |
> contentSpecificSearch:"fuel efficiency" | customContent:"fuel efficiency" |
> title:fuel efficiency | url:fuel efficiency))
> DisjunctionMaxQuery((host:Hybrid Electric Propulsion |
> contentSpecificSearch:"hybrid electric propulsion" | customContent:"hybrid
> electric propulsion" | title:Hybrid Electric Propulsion | url:Hybrid
> Electric Propulsion)) DisjunctionMaxQuery((host:Power Systems |
> contentSpecificSearch:"power systems" | customContent:"power systems" |
> title:Power Systems | url:Power Systems))
> DisjunctionMaxQuery((host:Powerplant | contentSpecificSearch:powerplant |
> customContent:powerplant | title:Powerplant | url:Powerplant))
> DisjunctionMaxQuery((host:Propulsion | contentSpecificSearch:propulsion |
> customContent:propulsion | title:Propulsion | url:Propulsion))
> DisjunctionMaxQuery((host:hybrid | contentSpecificSearch:hybrid |
> customContent:hybrid | title:hybrid | url:hybrid))
> DisjunctionMaxQuery((host:hybrid electric | contentSpecificSearch:"hybrid
> electric" | customContent:"hybrid

Re: Solr performance issue on querying --> Solr 6.5.1

2017-09-27 Thread sasarun

Hi Emir, 

Please find the response without bq parameter and debugQuery set to true. 
Also it was noted that Qtime comes down drastically without the debug
parameter to about 700-800. 


true
0
3446


("hybrid electric powerplant" "hybrid electric powerplants" "Electric"
"Electrical" "Electricity" "Engine" "fuel economy" "fuel efficiency" "Hybrid
Electric Propulsion" "Power Systems" "Powerplant" "Propulsion" "hybrid"
"hybrid electric" "electric powerplant")

edismax
on

host
title
url
customContent
contentSpecificSearch


id
contentOntologyTagsCount

0
OR
3985d7e2-3e54-48d8-8336-229e85f5d9de
600
true


...



solr-prd-cluster-m-GooglePatent_shard4_replica2-1506504238282-20



35
159
GET_TOP_IDS
41294
...


29
165
GET_TOP_IDS
40980
...


31
200
GET_TOP_IDS
41006
...


43
208
GET_TOP_IDS
41040
...


181
466
GET_TOP_IDS
41138
...




1518
1523
GET_FIELDS,GET_DEBUG
110
...


1562
1573
GET_FIELDS,GET_DEBUG
115
...


1793
1800
GET_FIELDS,GET_DEBUG
120
...


2153
2161
GET_FIELDS,GET_DEBUG
125
...


2957
2970
GET_FIELDS,GET_DEBUG
130
...




10302.0

2.0

2.0


0.0


0.0


0.0


0.0


0.0


0.0


0.0


0.0



10288.0

661.0


0.0


0.0


0.0


0.0


0.0


0.0


0.0


9627.0




("hybrid electric powerplant" "hybrid electric powerplants" "Electric"
"Electrical" "Electricity" "Engine" "fuel economy" "fuel efficiency" "Hybrid
Electric Propulsion" "Power Systems" "Powerplant" "Propulsion" "hybrid"
"hybrid electric" "electric powerplant")


("hybrid electric powerplant" "hybrid electric powerplants" "Electric"
"Electrical" "Electricity" "Engine" "fuel economy" "fuel efficiency" "Hybrid
Electric Propulsion" "Power Systems" "Powerplant" "Propulsion" "hybrid"
"hybrid electric" "electric powerplant")


(+(DisjunctionMaxQuery((host:hybrid electric powerplant |
contentSpecificSearch:"hybrid electric powerplant" | customContent:"hybrid
electric powerplant" | title:hybrid electric powerplant | url:hybrid
electric powerplant)) DisjunctionMaxQuery((host:hybrid electric powerplants
| contentSpecificSearch:"hybrid electric powerplants" |
customContent:"hybrid electric powerplants" | title:hybrid electric
powerplants | url:hybrid electric powerplants))
DisjunctionMaxQuery((host:Electric | contentSpecificSearch:electric |
customContent:electric | title:Electric | url:Electric))
DisjunctionMaxQuery((host:Electrical | contentSpecificSearch:electrical |
customContent:electrical | title:Electrical | url:Electrical))
DisjunctionMaxQuery((host:Electricity | contentSpecificSearch:electricity |
customContent:electricity | title:Electricity | url:Electricity))
DisjunctionMaxQuery((host:Engine | contentSpecificSearch:engine |
customContent:engine | title:Engine | url:Engine))
DisjunctionMaxQuery((host:fuel economy | contentSpecificSearch:"fuel
economy" | customContent:"fuel economy" | title:fuel economy | url:fuel
economy)) DisjunctionMaxQuery((host:fuel efficiency |
contentSpecificSearch:"fuel efficiency" | customContent:"fuel efficiency" |
title:fuel efficiency | url:fuel efficiency))
DisjunctionMaxQuery((host:Hybrid Electric Propulsion |
contentSpecificSearch:"hybrid electric propulsion" | customContent:"hybrid
electric propulsion" | title:Hybrid Electric Propulsion | url:Hybrid
Electric Propulsion)) DisjunctionMaxQuery((host:Power Systems |
contentSpecificSearch:"power systems" | customContent:"power systems" |
title:Power Systems | url:Power Systems))
DisjunctionMaxQuery((host:Powerplant | contentSpecificSearch:powerplant |
customContent:powerplant | title:Powerplant | url:Powerplant))
DisjunctionMaxQuery((host:Propulsion | contentSpecificSearch:propulsion |
customContent:propulsion | title:Propulsion | url:Propulsion))
DisjunctionMaxQuery((host:hybrid | contentSpecificSearch:hybrid |
customContent:hybrid | title:hybrid | url:hybrid))
DisjunctionMaxQuery((host:hybrid electric | contentSpecificSearch:"hybrid
electric" | customContent:"hybrid electric" | title:hybrid electric |
url:hybrid electric)) DisjunctionMaxQuery((host:electric powerplant |
contentSpecificSearch:"electric powerplant" | customContent:"electric
powerplant" | title:electric powerplant | url:electric
powerplant/no_coord


+((host:hybrid electric powerplant | contentSpecificSearch:"hybrid electric
powerplant" | customContent:"hybrid electric powerplant" | title:hybrid
electric powerplant | url:hybrid electric powerplant) (host:hybrid electric
powerplants | contentSpecificSearch:"hybrid electric powerplants" |
customContent:"hybrid electric powerplants" | title:hybrid electric
powerplants | url:hybrid electric powerplants) (host:Electric |
contentSpecificSearch:electric | customContent:electric | title:Electric |
url:Electric) (host:Electrical | contentSpecificSearch:electrical |
customContent:electrical | title:Electrical | url:Electrical)
(host:Electricity | contentSpecificSearch:electricity |
customContent:electricity | title:Electricity | url:Electricity)
(host:Engine | contentSpecificSearch:engine | customContent:engine |
title:Engine | url:Engine) (host:fuel

Re: Solr performance issue on querying --> Solr 6.5.1

2017-09-27 Thread sasarun

Hi Erick, 

Qtime comes down with rows set as 1. Also it was noted that qtime comes down
when debug parameter is not added with the query. It comes to about 900.

Thanks, 
Arun 



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr performance issue on querying --> Solr 6.5.1

2017-09-27 Thread Toke Eskildsen

On Tue, 2017-09-26 at 07:43 -0700, sasarun wrote:
> Allocated heap size for young generation is about 8 gb and old 
> generation is about 24 gb. And gc analysis showed peak
> size utlisation is really low compared to these values.

That does not come as a surprise. Your collections would normally be
considered small, if not tiny, looking only at their size measured in
bytes. Again, if you expect them to grow significantly (more than 10x),
your allocation might make sense. If you do not expect such a growth in
the near future, you will be better off with a much smaller heap: The
peak heap utilization that you have logged (or twice that to err on the
cautious side) seems a good starting point.

And whatever you do, don't set Xmx to 32GB. Use <31GB or significantly
more than 32GB:
https://blog.codecentric.de/en/2014/02/35gb-heap-less-32gb-java-jvm-mem
ory-oddities/

Are you indexing while you search? If so, you need to set auto-warm or
state a few explicit warmup-queries. If not, your measuring will not be
representative as it will be on first-searches, which are always slower
than warmed-searches.

- Toke Eskildsen, Royal Danish Library

Re: Solr performance issue on querying --> Solr 6.5.1

2017-09-27 Thread Emir Arnautović

Hi Arun,
This is not the most simple query either - a dozen of phrase queries on several 
fields + the same query as bq. Can you provide debugQuery info.
I did not look much into debug times and what includes what, but one thing that 
is strange to me is that QTime is 4s while query in debug is 1.3s. Can you try 
running without bq? Can you include boost factors in the main query?

Thanks,
Emir

> On 26 Sep 2017, at 16:43, sasarun  wrote:
> 
> Hi All, 
> I have been using Solr for some time now but mostly in standalone mode. Now
> my current project is using Solr 6.5.1 hosted on hadoop. My solrconfig.xml
> has the following configuration. In the prod environment the performance on
> querying seems to really slow. Can anyone help me with few pointers on
> howimprove on the same. 
> 
> 
>${solr.hdfs.home:}
> name="solr.hdfs.blockcache.enabled">${solr.hdfs.blockcache.enabled:true}
> name="solr.hdfs.blockcache.slab.count">${solr.hdfs.blockcache.slab.count:1}
> name="solr.hdfs.blockcache.direct.memory.allocation">${solr.hdfs.blockcache.direct.memory.allocation:false}
> name="solr.hdfs.blockcache.blocksperbank">${solr.hdfs.blockcache.blocksperbank:16384}
> name="solr.hdfs.blockcache.read.enabled">${solr.hdfs.blockcache.read.enabled:true}
> name="solr.hdfs.blockcache.write.enabled">${solr.hdfs.blockcache.write.enabled:false}
> name="solr.hdfs.nrtcachingdirectory.enable">${solr.hdfs.nrtcachingdirectory.enable:true}
> name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">${solr.hdfs.nrtcachingdirectory.maxmergesizemb:16}
> name="solr.hdfs.nrtcachingdirectory.maxcachedmb">${solr.hdfs.nrtcachingdirectory.maxcachedmb:192}
> 
>hdfs
> It has 6 collections of following size 
> Collection 1 -->6.41 MB
> Collection 2 -->634.51 KB 
> Collection 3 -->4.59 MB 
> Collection 4 -->1,020.56 MB 
> Collection 5 --> 607.26 MB
> Collection 6 -->102.4 kb
> Each Collection has 5 shards each. Allocated heap size for young generation
> is about 8 gb and old generation is about 24 gb. And gc analysis showed peak
> size 
> utlisation is really low compared to these values. 
> But querying to Collection 4 and collection 5 is giving really slow response
> even thoughwe are not using any complex queries.Output of debug quries run
> with debug=timing
> are given below for reference. Can anyone help suggest a way improve the
> performance.
> 
> Response to query
> 
> 
> true
> 0
> 3962
> 
> 
> ("hybrid electric powerplant" "hybrid electric powerplants" "Electric"
> "Electrical" "Electricity" "Engine" "fuel economy" "fuel efficiency" "Hybrid
> Electric Propulsion" "Power Systems" "Powerplant" "Propulsion" "hybrid"
> "hybrid electric" "electric powerplant")
> 
> edismax
> true
> on
> 
> host
> title
> url
> customContent
> contentSpecificSearch
> 
> 
> id
> contentTagsCount
> 
> 0
> OR
> OR
> 3985d7e2-3e54-48d8-8336-229e85f5d9de
> 600
> 
> ("hybrid electric powerplant"^100.0 "hybrid electric powerplants"^100.0
> "Electric"^50.0 "Electrical"^50.0 "Electricity"^50.0 "Engine"^50.0 "fuel
> economy"^50.0 "fuel efficiency"^50.0 "Hybrid Electric Propulsion"^50.0
> "Power Systems"^50.0 "Powerplant"^50.0 "Propulsion"^50.0 "hybrid"^15.0
> "hybrid electric"^15.0 "electric powerplant"^15.0)
> 
> 
> 
> 
> 
> 15374.0
> 
> 2.0
> 
> 2.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 
> 15363.0
> 
> 1313.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 14048.0
> 
> 
> 
> 
> 
> Thanks,
> Arun
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr performance issue on querying --> Solr 6.5.1

2017-09-26 Thread Erick Erickson

Well, 15 second responses are not what I'd expect either. But two
things (just looked again)

1> note that the time to assemble the debug information is a large
majority of your total time (14 of 15.3 seconds).
2> you're specifying 600 rows which is quite a lot as each one
requires that a 16K block of data be read from disk and decompressed
to assemble the "fl" list.

so one quick test would be to set rows=1 or something. All that said,
the QTime value returned does _not_ include <1> or <2> above and even
4 seconds seems excessive.

Best,
Erick

On Tue, Sep 26, 2017 at 10:54 AM, sasarun  wrote:
> Hi Erick,
>
> Thank you for the quick response. Query time was relatively faster once it
> is read from memory. But personally I always felt response time could be far
> better. As suggested, We will try and set up in a non HDFS environment and
> update on the results.
>
> Thanks,
> Arun
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr performance issue on querying --> Solr 6.5.1

2017-09-26 Thread sasarun

Hi Erick, 

Thank you for the quick response. Query time was relatively faster once it
is read from memory. But personally I always felt response time could be far
better. As suggested, We will try and set up in a non HDFS environment and
update on the results. 

Thanks, 
Arun



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr performance issue on querying --> Solr 6.5.1

2017-09-26 Thread Erick Erickson

Does the query time _stay_ low? Once the data is read from HDFS it
should pretty much stay in memory. So my question is whether, once
Solr warms up you see this kind of query response time.

Have you tried this on a non HDFS system? That would be useful to help
figure out where to look.

And given the sizes of your collections, unless you expect them to get
much larger, there's no reason to shard any of them. Sharding should
only really be used when the collections are too big for a single
shard as distributed searches inevitably have increased overhead. I
expect _at least_ 20M documents/shard, and have seen 200M docs/shard.
YMMV of course.

Best,
Erick

On Tue, Sep 26, 2017 at 7:43 AM, sasarun  wrote:
> Hi All,
> I have been using Solr for some time now but mostly in standalone mode. Now
> my current project is using Solr 6.5.1 hosted on hadoop. My solrconfig.xml
> has the following configuration. In the prod environment the performance on
> querying seems to really slow. Can anyone help me with few pointers on
> howimprove on the same.
>
> 
> ${solr.hdfs.home:}
>  name="solr.hdfs.blockcache.enabled">${solr.hdfs.blockcache.enabled:true}
>  name="solr.hdfs.blockcache.slab.count">${solr.hdfs.blockcache.slab.count:1}
>  name="solr.hdfs.blockcache.direct.memory.allocation">${solr.hdfs.blockcache.direct.memory.allocation:false}
>  name="solr.hdfs.blockcache.blocksperbank">${solr.hdfs.blockcache.blocksperbank:16384}
>  name="solr.hdfs.blockcache.read.enabled">${solr.hdfs.blockcache.read.enabled:true}
>  name="solr.hdfs.blockcache.write.enabled">${solr.hdfs.blockcache.write.enabled:false}
>  name="solr.hdfs.nrtcachingdirectory.enable">${solr.hdfs.nrtcachingdirectory.enable:true}
>  name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">${solr.hdfs.nrtcachingdirectory.maxmergesizemb:16}
>  name="solr.hdfs.nrtcachingdirectory.maxcachedmb">${solr.hdfs.nrtcachingdirectory.maxcachedmb:192}
> 
> hdfs
> It has 6 collections of following size
> Collection 1 -->6.41 MB
> Collection 2 -->634.51 KB
> Collection 3 -->4.59 MB
> Collection 4 -->1,020.56 MB
> Collection 5 --> 607.26 MB
> Collection 6 -->102.4 kb
> Each Collection has 5 shards each. Allocated heap size for young generation
> is about 8 gb and old generation is about 24 gb. And gc analysis showed peak
> size
> utlisation is really low compared to these values.
> But querying to Collection 4 and collection 5 is giving really slow response
> even thoughwe are not using any complex queries.Output of debug quries run
> with debug=timing
> are given below for reference. Can anyone help suggest a way improve the
> performance.
>
> Response to query
> 
> 
> true
> 0
> 3962
> 
> 
> ("hybrid electric powerplant" "hybrid electric powerplants" "Electric"
> "Electrical" "Electricity" "Engine" "fuel economy" "fuel efficiency" "Hybrid
> Electric Propulsion" "Power Systems" "Powerplant" "Propulsion" "hybrid"
> "hybrid electric" "electric powerplant")
> 
> edismax
> true
> on
> 
> host
> title
> url
> customContent
> contentSpecificSearch
> 
> 
> id
> contentTagsCount
> 
> 0
> OR
> OR
> 3985d7e2-3e54-48d8-8336-229e85f5d9de
> 600
> 
> ("hybrid electric powerplant"^100.0 "hybrid electric powerplants"^100.0
> "Electric"^50.0 "Electrical"^50.0 "Electricity"^50.0 "Engine"^50.0 "fuel
> economy"^50.0 "fuel efficiency"^50.0 "Hybrid Electric Propulsion"^50.0
> "Power Systems"^50.0 "Powerplant"^50.0 "Propulsion"^50.0 "hybrid"^15.0
> "hybrid electric"^15.0 "electric powerplant"^15.0)
> 
> 
> 
> 
> 
> 15374.0
> 
> 2.0
> 
> 2.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 
> 15363.0
> 
> 1313.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 14048.0
> 
> 
> 
>
>
> Thanks,
> Arun
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Solr performance issue on querying --> Solr 6.5.1

2017-09-26 Thread sasarun

Hi All, 
I have been using Solr for some time now but mostly in standalone mode. Now
my current project is using Solr 6.5.1 hosted on hadoop. My solrconfig.xml
has the following configuration. In the prod environment the performance on
querying seems to really slow. Can anyone help me with few pointers on
howimprove on the same. 


${solr.hdfs.home:}
${solr.hdfs.blockcache.enabled:true}
${solr.hdfs.blockcache.slab.count:1}
${solr.hdfs.blockcache.direct.memory.allocation:false}
${solr.hdfs.blockcache.blocksperbank:16384}
${solr.hdfs.blockcache.read.enabled:true}
${solr.hdfs.blockcache.write.enabled:false}
${solr.hdfs.nrtcachingdirectory.enable:true}
${solr.hdfs.nrtcachingdirectory.maxmergesizemb:16}
${solr.hdfs.nrtcachingdirectory.maxcachedmb:192}

hdfs
It has 6 collections of following size 
Collection 1 -->6.41 MB
Collection 2 -->634.51 KB 
Collection 3 -->4.59 MB 
Collection 4 -->1,020.56 MB 
Collection 5 --> 607.26 MB
Collection 6 -->102.4 kb
Each Collection has 5 shards each. Allocated heap size for young generation
is about 8 gb and old generation is about 24 gb. And gc analysis showed peak
size 
utlisation is really low compared to these values. 
But querying to Collection 4 and collection 5 is giving really slow response
even thoughwe are not using any complex queries.Output of debug quries run
with debug=timing
are given below for reference. Can anyone help suggest a way improve the
performance.

Response to query


true
0
3962


("hybrid electric powerplant" "hybrid electric powerplants" "Electric"
"Electrical" "Electricity" "Engine" "fuel economy" "fuel efficiency" "Hybrid
Electric Propulsion" "Power Systems" "Powerplant" "Propulsion" "hybrid"
"hybrid electric" "electric powerplant")

edismax
true
on

host
title
url
customContent
contentSpecificSearch


id
contentTagsCount

0
OR
OR
3985d7e2-3e54-48d8-8336-229e85f5d9de
600

("hybrid electric powerplant"^100.0 "hybrid electric powerplants"^100.0
"Electric"^50.0 "Electrical"^50.0 "Electricity"^50.0 "Engine"^50.0 "fuel
economy"^50.0 "fuel efficiency"^50.0 "Hybrid Electric Propulsion"^50.0
"Power Systems"^50.0 "Powerplant"^50.0 "Propulsion"^50.0 "hybrid"^15.0
"hybrid electric"^15.0 "electric powerplant"^15.0)





15374.0

2.0

2.0


0.0


0.0


0.0


0.0


0.0


0.0


0.0


0.0



15363.0

1313.0


0.0


0.0


0.0


0.0


0.0


0.0


0.0


14048.0





Thanks,
Arun



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Suggestions on scaling up Solr performance.

2017-05-11 Thread Erick Erickson

Impossible to answer as Shawn says. Or even recommend. For instance,
you say "but once we launch our application all across the world it
may give performance issues."

You haven't defined at all what changes when you "launch our
application all across the world". Increasing your query traffic 10
fold? Trying to index 100 times the number of docs you have now?
10,000 times the number of docs you have now?

Best,
Erick

On Thu, May 11, 2017 at 8:11 AM, Venkateswarlu Bommineni
 wrote:
> Thanks, Shawn.
>
> As of now, we don't have any performance issues, We are just working for
> the future purpose.
>
> So I was looking for any general architecture which is agreed by many of
> Solr experts.
>
> Thanks,
> Venkat.
>
> On Thu, May 11, 2017 at 8:19 PM, Shawn Heisey  wrote:
>
>> On 5/11/2017 7:39 AM, Venkateswarlu Bommineni wrote:
>> > In current design we have below configuration: *One collection with
>> > one shard with 4 replication factor with 4 nodes.* as of now, it is
>> > working fine.but once we launch our application all across the world
>> > it may give performance issues. To improve the performance below is
>> > our thought: one of the design we found is: *Adding a new node and
>> > adding a new replication to existing solrcloud.* Please suggest any
>> > other approaches which give better performance.
>>
>> Knowing the number of nodes, shards, and replicas is not enough
>> information to even make guesses.
>>
>> https://lucidworks.com/sizing-hardware-in-the-abstract-why-
>> we-dont-have-a-definitive-answer/
>>
>> Even with a LOT more information, any recommendations we made would be
>> just that -- guesses.  Those guesses might be completely wrong, or
>> represent a lot more expense than you really need.
>>
>> The exact kind of setup you need is affected by a great many things.
>> Here's a few of them: request rate, complexity of queries, contents of the
>> index, size of the index, Solr cache settings, schema settings, number of
>> documents, number of shards, amount of memory in the server, amount of
>> memory in the java heap.
>>
>> Even the phrase "improve our performance" is vague.  What kind of
>> performance issue are you having?
>>
>> Thanks,
>> Shawn
>>
>>

Re: Suggestions on scaling up Solr performance.

2017-05-11 Thread Venkateswarlu Bommineni

Thanks, Shawn.

As of now, we don't have any performance issues, We are just working for
the future purpose.

So I was looking for any general architecture which is agreed by many of
Solr experts.

Thanks,
Venkat.

On Thu, May 11, 2017 at 8:19 PM, Shawn Heisey  wrote:

> On 5/11/2017 7:39 AM, Venkateswarlu Bommineni wrote:
> > In current design we have below configuration: *One collection with
> > one shard with 4 replication factor with 4 nodes.* as of now, it is
> > working fine.but once we launch our application all across the world
> > it may give performance issues. To improve the performance below is
> > our thought: one of the design we found is: *Adding a new node and
> > adding a new replication to existing solrcloud.* Please suggest any
> > other approaches which give better performance.
>
> Knowing the number of nodes, shards, and replicas is not enough
> information to even make guesses.
>
> https://lucidworks.com/sizing-hardware-in-the-abstract-why-
> we-dont-have-a-definitive-answer/
>
> Even with a LOT more information, any recommendations we made would be
> just that -- guesses.  Those guesses might be completely wrong, or
> represent a lot more expense than you really need.
>
> The exact kind of setup you need is affected by a great many things.
> Here's a few of them: request rate, complexity of queries, contents of the
> index, size of the index, Solr cache settings, schema settings, number of
> documents, number of shards, amount of memory in the server, amount of
> memory in the java heap.
>
> Even the phrase "improve our performance" is vague.  What kind of
> performance issue are you having?
>
> Thanks,
> Shawn
>
>

Re: Suggestions on scaling up Solr performance.

2017-05-11 Thread Shawn Heisey

On 5/11/2017 7:39 AM, Venkateswarlu Bommineni wrote:
> In current design we have below configuration: *One collection with
> one shard with 4 replication factor with 4 nodes.* as of now, it is
> working fine.but once we launch our application all across the world
> it may give performance issues. To improve the performance below is
> our thought: one of the design we found is: *Adding a new node and
> adding a new replication to existing solrcloud.* Please suggest any
> other approaches which give better performance.

Knowing the number of nodes, shards, and replicas is not enough
information to even make guesses.

https://lucidworks.com/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Even with a LOT more information, any recommendations we made would be just 
that -- guesses.  Those guesses might be completely wrong, or represent a lot 
more expense than you really need.

The exact kind of setup you need is affected by a great many things.  Here's a 
few of them: request rate, complexity of queries, contents of the index, size 
of the index, Solr cache settings, schema settings, number of documents, number 
of shards, amount of memory in the server, amount of memory in the java heap.

Even the phrase "improve our performance" is vague.  What kind of performance 
issue are you having?

Thanks,
Shawn

Suggestions on scaling up Solr performance.

2017-05-11 Thread Venkateswarlu Bommineni

Hello Guys,


In current design we have below configuration:

*One collection with one shard with 4 replication factor with 4 nodes.*

as of now, it is working fine.but once we launch our application all across
the world it may give performance issues.

To improve the performance below is our thought:

one of the design we found is:

*Adding a new node and adding a new replication to existing solrcloud.*


Please suggest any other approaches which give better performance.


Thanks,
Venkat.

Re: Solr performance on EC2 linux

2017-05-03 Thread Walter Underwood

Already have a Jira issue for next week. I have a script to run prod logs 
against a cluster. I’ll be testing a four shard by two replica cluster with 17 
million docs and very long queries. We are working on getting the 95th 
percentile under one second, so we should exercise the timeAllowed feature.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On May 3, 2017, at 3:53 PM, Rick Leir  wrote:
> 
> +Walter test it
> 
> Jeff,
> How much CPU does the EC2 hypervisor use? I have heard 5% but that is for a 
> normal workload, and is mostly consumed during system calls or context 
> changes. So it is quite understandable that frequent time calls would take a 
> bigger bite in the AWS cloud compared to bare metal. Sorry, my words are 
> mostly conjecture so please ignore. Cheers -- Rick
> 
> On May 3, 2017 2:35:33 PM EDT, Jeff Wartes  wrote:
>> 
>> It’s presumably not a small degradation - this guy very recently
>> suggested it’s 77% slower:
>> https://blog.packagecloud.io/eng/2017/03/08/system-calls-are-much-slower-on-ec2/
>> 
>> The other reason that blog post is interesting to me is that his
>> benchmark utility showed the work of entering the kernel as high system
>> time, which is also what I was seeing.
>> 
>> I really want to go back and try some more tests, including (now)
>> disabling the timeAllowed param in my query corpus. 
>> I think I’m still a few weeks of higher priority issues away from that
>> though.
>> 
>> 
>> On 5/2/17, 1:45 PM, "Tomás Fernández Löbbe" 
>> wrote:
>> 
>> I remember seeing some performance impact (even when not using it) and
>> it
>> was attributed to the calls to System.nanoTime. See SOLR-7875 and
>> SOLR-7876
>> (fixed for 5.3 and 5.4). Those two Jiras fix the impact when
>> timeAllowed is
>>  not used, but I don't know if there were more changes to improve the
>> performance of the feature itself. The problem was that System.nanoTime
>> may
>> be called too many times on indices with many different terms. If this
>> is
>> the problem Jeff is seeing, a small degradation of System.nanoTime
>> could
>>   have a big impact.
>> 
>>   Tomás
>> 
>> On Tue, May 2, 2017 at 10:23 AM, Walter Underwood
>> 
>>   wrote:
>> 
>>> Hmm, has anyone measured the overhead of timeAllowed? We use it all
>> the
>>> time.
>>> 
>>> If nobody has, I’ll run a benchmark with and without it.
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> 
>> https://linkprotect.cudasvc.com/url?a=http://observer.wunderwood.org/=E,1,7uGY1VtJPqam-MhMKpspcb31C9NQ_Jh4nI0gzkQP4gVJkhcC5l031vMIHH0j38EdMESOM5Chjav3lUu1rpTdohTNTPdchTkl4TGNEHWJpJFJ-MR6RrjnTQ,,=0
>> (my blog)
>>> 
>>> 
 On May 2, 2017, at 9:52 AM, Chris Hostetter
>> 
>>> wrote:
 
 
 : I specify a timeout on all queries, 
 
 Ah -- ok, yeah -- you mean using "timeAllowed" correct?
 
 If the root issue you were seeing is in fact clocksource related,
 then using timeAllowed would probably be a significant compounding
 factor there since it would involve a lot of time checks in a
>> single
 request (even w/o any debugging enabled)
 
 (did your coworker's experiements with ES use any sort of
>> equivilent
 timeout feature?)
 
 
 
 
 
 -Hoss
 
>> https://linkprotect.cudasvc.com/url?a=http://www.lucidworks.com/=E,1,DwDibSb7PG6wpqsnn-u9uKdCuujyokjeyc6ero6bEdoUjs4Hn_X1jj_z6QAEDmorDqAP_TcaEJX8k5HYYJI7bJ7jQxTDpKUX9MvWAaP6ICoyVmpmQ8X7=0
>>> 
>>> 
>> 
> 
> -- 
> Sorry for being brief. Alternate email is rickleir at yahoo dot com

Re: Solr performance on EC2 linux

2017-05-03 Thread Rick Leir

+Walter test it

Jeff,
How much CPU does the EC2 hypervisor use? I have heard 5% but that is for a 
normal workload, and is mostly consumed during system calls or context changes. 
So it is quite understandable that frequent time calls would take a bigger bite 
in the AWS cloud compared to bare metal. Sorry, my words are mostly conjecture 
so please ignore. Cheers -- Rick

On May 3, 2017 2:35:33 PM EDT, Jeff Wartes  wrote:
>
>It’s presumably not a small degradation - this guy very recently
>suggested it’s 77% slower:
>https://blog.packagecloud.io/eng/2017/03/08/system-calls-are-much-slower-on-ec2/
>
>The other reason that blog post is interesting to me is that his
>benchmark utility showed the work of entering the kernel as high system
>time, which is also what I was seeing.
>
>I really want to go back and try some more tests, including (now)
>disabling the timeAllowed param in my query corpus. 
>I think I’m still a few weeks of higher priority issues away from that
>though.
>
>
>On 5/2/17, 1:45 PM, "Tomás Fernández Löbbe" 
>wrote:
>
>I remember seeing some performance impact (even when not using it) and
>it
>was attributed to the calls to System.nanoTime. See SOLR-7875 and
>SOLR-7876
>(fixed for 5.3 and 5.4). Those two Jiras fix the impact when
>timeAllowed is
>   not used, but I don't know if there were more changes to improve the
>performance of the feature itself. The problem was that System.nanoTime
>may
>be called too many times on indices with many different terms. If this
>is
>the problem Jeff is seeing, a small degradation of System.nanoTime
>could
>have a big impact.
>
>Tomás
>
>On Tue, May 2, 2017 at 10:23 AM, Walter Underwood
>
>wrote:
>
>> Hmm, has anyone measured the overhead of timeAllowed? We use it all
>the
>> time.
>>
>> If nobody has, I’ll run a benchmark with and without it.
>>
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>>
>https://linkprotect.cudasvc.com/url?a=http://observer.wunderwood.org/=E,1,7uGY1VtJPqam-MhMKpspcb31C9NQ_Jh4nI0gzkQP4gVJkhcC5l031vMIHH0j38EdMESOM5Chjav3lUu1rpTdohTNTPdchTkl4TGNEHWJpJFJ-MR6RrjnTQ,,=0
> (my blog)
>>
>>
>> > On May 2, 2017, at 9:52 AM, Chris Hostetter
>
>> wrote:
>> >
>> >
>> > : I specify a timeout on all queries, 
>> >
>> > Ah -- ok, yeah -- you mean using "timeAllowed" correct?
>> >
>  > > If the root issue you were seeing is in fact clocksource related,
> > > then using timeAllowed would probably be a significant compounding
>> > factor there since it would involve a lot of time checks in a
>single
>> > request (even w/o any debugging enabled)
>> >
>> > (did your coworker's experiements with ES use any sort of
>equivilent
>> > timeout feature?)
>> >
>> >
>> >
>> >
>> >
>> > -Hoss
>> >
>https://linkprotect.cudasvc.com/url?a=http://www.lucidworks.com/=E,1,DwDibSb7PG6wpqsnn-u9uKdCuujyokjeyc6ero6bEdoUjs4Hn_X1jj_z6QAEDmorDqAP_TcaEJX8k5HYYJI7bJ7jQxTDpKUX9MvWAaP6ICoyVmpmQ8X7=0
>>
>>
>

-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com

Re: Solr performance on EC2 linux

2017-05-03 Thread Jeff Wartes

It’s presumably not a small degradation - this guy very recently suggested it’s 
77% slower:
https://blog.packagecloud.io/eng/2017/03/08/system-calls-are-much-slower-on-ec2/

The other reason that blog post is interesting to me is that his benchmark 
utility showed the work of entering the kernel as high system time, which is 
also what I was seeing.

I really want to go back and try some more tests, including (now) disabling the 
timeAllowed param in my query corpus. 
I think I’m still a few weeks of higher priority issues away from that though.

On 5/2/17, 1:45 PM, "Tomás Fernández Löbbe"  wrote:

I remember seeing some performance impact (even when not using it) and it
was attributed to the calls to System.nanoTime. See SOLR-7875 and SOLR-7876
(fixed for 5.3 and 5.4). Those two Jiras fix the impact when timeAllowed is
not used, but I don't know if there were more changes to improve the
performance of the feature itself. The problem was that System.nanoTime may
be called too many times on indices with many different terms. If this is
the problem Jeff is seeing, a small degradation of System.nanoTime could
have a big impact.

Tomás

On Tue, May 2, 2017 at 10:23 AM, Walter Underwood 
wrote:

> Hmm, has anyone measured the overhead of timeAllowed? We use it all the
> time.
>
> If nobody has, I’ll run a benchmark with and without it.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> 
https://linkprotect.cudasvc.com/url?a=http://observer.wunderwood.org/=E,1,7uGY1VtJPqam-MhMKpspcb31C9NQ_Jh4nI0gzkQP4gVJkhcC5l031vMIHH0j38EdMESOM5Chjav3lUu1rpTdohTNTPdchTkl4TGNEHWJpJFJ-MR6RrjnTQ,,=0
  (my blog)
>
>
> > On May 2, 2017, at 9:52 AM, Chris Hostetter 
> wrote:
> >
> >
> > : I specify a timeout on all queries, 
> >
> > Ah -- ok, yeah -- you mean using "timeAllowed" correct?
> >
> > If the root issue you were seeing is in fact clocksource related,
> > then using timeAllowed would probably be a significant compounding
> > factor there since it would involve a lot of time checks in a single
> > request (even w/o any debugging enabled)
> >
> > (did your coworker's experiements with ES use any sort of equivilent
> > timeout feature?)
> >
> >
> >
> >
> >
> > -Hoss
> > 
https://linkprotect.cudasvc.com/url?a=http://www.lucidworks.com/=E,1,DwDibSb7PG6wpqsnn-u9uKdCuujyokjeyc6ero6bEdoUjs4Hn_X1jj_z6QAEDmorDqAP_TcaEJX8k5HYYJI7bJ7jQxTDpKUX9MvWAaP6ICoyVmpmQ8X7=0
>
>

Re: Solr performance on EC2 linux

2017-05-02 Thread Tomás Fernández Löbbe

I remember seeing some performance impact (even when not using it) and it
was attributed to the calls to System.nanoTime. See SOLR-7875 and SOLR-7876
(fixed for 5.3 and 5.4). Those two Jiras fix the impact when timeAllowed is
not used, but I don't know if there were more changes to improve the
performance of the feature itself. The problem was that System.nanoTime may
be called too many times on indices with many different terms. If this is
the problem Jeff is seeing, a small degradation of System.nanoTime could
have a big impact.

Tomás

On Tue, May 2, 2017 at 10:23 AM, Walter Underwood 
wrote:

> Hmm, has anyone measured the overhead of timeAllowed? We use it all the
> time.
>
> If nobody has, I’ll run a benchmark with and without it.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On May 2, 2017, at 9:52 AM, Chris Hostetter 
> wrote:
> >
> >
> > : I specify a timeout on all queries, 
> >
> > Ah -- ok, yeah -- you mean using "timeAllowed" correct?
> >
> > If the root issue you were seeing is in fact clocksource related,
> > then using timeAllowed would probably be a significant compounding
> > factor there since it would involve a lot of time checks in a single
> > request (even w/o any debugging enabled)
> >
> > (did your coworker's experiements with ES use any sort of equivilent
> > timeout feature?)
> >
> >
> >
> >
> >
> > -Hoss
> > http://www.lucidworks.com/
>
>

Re: Solr performance on EC2 linux

2017-05-02 Thread Walter Underwood

Hmm, has anyone measured the overhead of timeAllowed? We use it all the time.

If nobody has, I’ll run a benchmark with and without it.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On May 2, 2017, at 9:52 AM, Chris Hostetter  wrote:
> 
> 
> : I specify a timeout on all queries, 
> 
> Ah -- ok, yeah -- you mean using "timeAllowed" correct?
> 
> If the root issue you were seeing is in fact clocksource related,
> then using timeAllowed would probably be a significant compounding 
> factor there since it would involve a lot of time checks in a single 
> request (even w/o any debugging enabled)
> 
> (did your coworker's experiements with ES use any sort of equivilent 
> timeout feature?)
> 
> 
> 
> 
> 
> -Hoss
> http://www.lucidworks.com/

Re: Solr performance on EC2 linux

2017-05-02 Thread Chris Hostetter


: I specify a timeout on all queries, 

Ah -- ok, yeah -- you mean using "timeAllowed" correct?

If the root issue you were seeing is in fact clocksource related,
then using timeAllowed would probably be a significant compounding 
factor there since it would involve a lot of time checks in a single 
request (even w/o any debugging enabled)

(did your coworker's experiements with ES use any sort of equivilent 
timeout feature?)





-Hoss
http://www.lucidworks.com/

Re: Solr performance on EC2 linux

2017-05-01 Thread Jeff Wartes

Yes, that’s the Xenial I tried. Ubuntu 16.04.2 LTS.

On 5/1/17, 7:22 PM, "Will Martin"  wrote:

Ubuntu 16.04 LTS - Xenial (HVM)

Is this your Xenial version?




On 5/1/2017 6:37 PM, Jeff Wartes wrote:
> I tried a few variations of various things before we found and tried that 
linux/EC2 tuning page, including:
>- EC2 instance type: r4, c4, and i3
>- Ubuntu version: Xenial and Trusty
>- EBS vs local storage
>- Stock openjdk vs Zulu openjdk (Recent java8 in both cases - I’m 
aware of the issues with early java8 versions and I’m not using G1)
>
> Most of those attempts were to help reduce differences between the data 
center and the EC2 cluster. In all cases I re-indexed from scratch. I got the 
same very high system-time symptom in all cases. With the linux changes in 
place, we settled on r4/Xenial/EBS/Stock.
>
> Again, this was a slightly modified Solr 5.4, (I added backup requests, 
and two memory allocation rate tweaks that have long since been merged into 
mainline - released in 6.2 I think. I can dig up the jira numbers if anyone’s 
interested) I’ve never used Solr 6.x in production though.
> The only reason I mentioned 6.x at all is because I’m aware that ES 5.x 
is based on Lucene 6.2. I don’t believe my coworker spent any time on tuning 
his ES setup, although I think he did try G1.
>
> I definitely do want to binary-search those settings until I understand 
better what exactly did the trick.
> It’s a long cycle time per test is the problem, but hopefully in the next 
couple of weeks.
>
>
>
> On 5/1/17, 7:26 AM, "John Bickerstaff"  wrote:
>
>  It's also very important to consider the type of EC2 instance you are
>  using...
>  
>  We settled on the R4.2XL...  The R series is labeled "High-Memory"
>  
>  Which instance type did you end up using?
>  
>  On Mon, May 1, 2017 at 8:22 AM, Shawn Heisey  
wrote:
>  
>  > On 4/28/2017 10:09 AM, Jeff Wartes wrote:
>  > > tldr: Recently, I tried moving an existing solrcloud 
configuration from
>  > a local datacenter to EC2. Performance was roughly 1/10th what I’d
>  > expected, until I applied a bunch of linux tweaks.
>  >
>  > How very strange.  I knew virtualization would have overheard, 
possibly
>  > even measurable overhead, but that's insane.  Running on bare 
metal is
>  > always better if you can do it.  I would be curious what would 
happen on
>  > your original install if you applied similar tuning to that.  
Would you
>  > see a speedup there?
>  >
>  > > Interestingly, a coworker playing with a ElasticSearch (ES 5.x, 
so a
>  > much more recent release) alternate implementation of the same 
index was
>  > not seeing this high-system-time behavior on EC2, and was getting
>  > throughput consistent with our general expectations.
>  >
>  > That's even weirder.  ES 5.x will likely be using Points field 
types for
>  > numeric fields, and although those are faster than what Solr 
currently
>  > uses, I doubt it could explain that difference.  The implication 
here is
>  > that the ES systems are running with stock EC2 settings, not the 
tuned
>  > settings ... but I'd like you to confirm that.  Same Java version 
as
>  > with Solr?  IMHO, Java itself is more likely to cause issues like 
you
>  > saw than Solr.
>  >
>  > > I’m writing this for a few reasons:
>  > >
>  > > 1.   The performance difference was so crazy I really feel 
like this
>  > should really be broader knowledge.
>  >
>  > Definitely agree!  I would be very interested in learning which of 
the
>  > tunables you changed were major contributors to the improvement.  
If it
>  > turns out that Solr's code is sub-optimal in some way, maybe we 
can fix it.
>  >
>  > > 2.   If anyone is aware of anything that changed in Lucene 
between
>  > 5.4 and 6.x that could explain why Elasticsearch wasn’t suffering 
from
>  > this? If it’s the clocksource that’s the issue, there’s an 
implication that
>  > Solr was using tons more system calls like gettimeofday that the 
EC2 (xen)
>  > hypervisor doesn’t allow in userspace.
>  >
>  > I had not considered the performance regression in 6.4.0 and 6.4.1 
that
>  > Erick mentioned.  Were you still running Solr 5.4, or was it a 6.x 
version?
>  >
>  > =
>  >
>  > Specific thoughts on the tuning:
>  >
>  > The noatime option is very good to use.  I also use nodiratime on 
my
>  > systems.  Turning these

Re: Solr performance on EC2 linux

2017-05-01 Thread Jeff Wartes

I started with the same three-node 15-shard configuration I’d been used to, in 
an RF1 cluster. (the index is almost 700G so this takes three r4.8xlarge’s if I 
want to be entirely memory-resident) I eventually dropped down to a 1/3rd size 
index on a single node (so 5 shards, 100M docs each) so I could test 
configurations more quickly. The system time usage was present on all solr 
nodes regardless. I adjusted for a difference in the CPU count on the EC2 nodes 
when I picked my load testing rates. 

Zookeeper is a separate cluster on separate nodes. It is NOT collocated with 
Solr, although it’s dedicated exclusively to Solr’s use.

I specify a timeout on all queries, and as mentioned, use SOLR-4449. So there’s 
possibly an argument I’m doing a lot more timing related calls than most. 
There’s nothing particularly exotic there though, just another Executor 
Service, and you’ll never get a backup request on an RF1 cluster because 
there’s no alternate to try. 

On 5/1/17, 6:28 PM, "Walter Underwood"  wrote:

Might want to measure the single CPU performance of your EC2 instance. The 
last time I checked, my MacBook was twice as fast as the EC2 instance I was 
using.

wunder
Walter Underwood
wun...@wunderwood.org

https://linkprotect.cudasvc.com/url?a=http://observer.wunderwood.org/=E,1,L0yDngRyy1MwN7dh5tRFW86sVcn6tcLZH4c03j0EdQSsGBMn0SLDqeB_sHQjB4DdbRMOLka5MnyeXnKS_CEUEv4qIgU5wuyhZBMHciVoH6e8uo7KGr09mXTtDw,,=0
  (my blog)

> On May 1, 2017, at 6:24 PM, Chris Hostetter  
wrote:
> 
> 
> : tldr: Recently, I tried moving an existing solrcloud configuration from 
> : a local datacenter to EC2. Performance was roughly 1/10th what I’d 
> : expected, until I applied a bunch of linux tweaks.
> 
> How many total nodes in your cluster?  How many of them running ZooKeeper?
> 
> Did you observe the heavy increase in system time CPU usage on all nodes, 
> or just the ones running zookeeper?
> 
> I ask because if your speculation is correct and it is an issue of 
> clocksource, then perhaps ZK is where the majority of those system calls 
> are happening, and perhaps that's why you didn't see any similar heavy 
> system CPU load in ES?  
> 
> (Then again: at the lowest levels "lucene" really shouldn't care about 
> anything clock related at all Any "time" realted code would live in the 
> Solr level ... hmmm.)
> 
> 
> -Hoss
> 
https://linkprotect.cudasvc.com/url?a=http://www.lucidworks.com/=E,1,ooHM-f4KYxxASNvbLSSYXKwDzWVBK-9orXh84oAZsxzfcPKZ8AF2m_U8K7wc8D5EUvaoHJCrb3O6BPCQIJucBxQaqJMOakPTxCnMW1BDHsyBf13HxMyCeEM_=0

Re: Solr performance on EC2 linux

2017-05-01 Thread Will Martin

Ubuntu 16.04 LTS - Xenial (HVM)

Is this your Xenial version?




On 5/1/2017 6:37 PM, Jeff Wartes wrote:
> I tried a few variations of various things before we found and tried that 
> linux/EC2 tuning page, including:
>- EC2 instance type: r4, c4, and i3
>- Ubuntu version: Xenial and Trusty
>- EBS vs local storage
>- Stock openjdk vs Zulu openjdk (Recent java8 in both cases - I’m aware of 
> the issues with early java8 versions and I’m not using G1)
>
> Most of those attempts were to help reduce differences between the data 
> center and the EC2 cluster. In all cases I re-indexed from scratch. I got the 
> same very high system-time symptom in all cases. With the linux changes in 
> place, we settled on r4/Xenial/EBS/Stock.
>
> Again, this was a slightly modified Solr 5.4, (I added backup requests, and 
> two memory allocation rate tweaks that have long since been merged into 
> mainline - released in 6.2 I think. I can dig up the jira numbers if anyone’s 
> interested) I’ve never used Solr 6.x in production though.
> The only reason I mentioned 6.x at all is because I’m aware that ES 5.x is 
> based on Lucene 6.2. I don’t believe my coworker spent any time on tuning his 
> ES setup, although I think he did try G1.
>
> I definitely do want to binary-search those settings until I understand 
> better what exactly did the trick.
> It’s a long cycle time per test is the problem, but hopefully in the next 
> couple of weeks.
>
>
>
> On 5/1/17, 7:26 AM, "John Bickerstaff"  wrote:
>
>  It's also very important to consider the type of EC2 instance you are
>  using...
>  
>  We settled on the R4.2XL...  The R series is labeled "High-Memory"
>  
>  Which instance type did you end up using?
>  
>  On Mon, May 1, 2017 at 8:22 AM, Shawn Heisey  wrote:
>  
>  > On 4/28/2017 10:09 AM, Jeff Wartes wrote:
>  > > tldr: Recently, I tried moving an existing solrcloud configuration 
> from
>  > a local datacenter to EC2. Performance was roughly 1/10th what I’d
>  > expected, until I applied a bunch of linux tweaks.
>  >
>  > How very strange.  I knew virtualization would have overheard, possibly
>  > even measurable overhead, but that's insane.  Running on bare metal is
>  > always better if you can do it.  I would be curious what would happen 
> on
>  > your original install if you applied similar tuning to that.  Would you
>  > see a speedup there?
>  >
>  > > Interestingly, a coworker playing with a ElasticSearch (ES 5.x, so a
>  > much more recent release) alternate implementation of the same index 
> was
>  > not seeing this high-system-time behavior on EC2, and was getting
>  > throughput consistent with our general expectations.
>  >
>  > That's even weirder.  ES 5.x will likely be using Points field types 
> for
>  > numeric fields, and although those are faster than what Solr currently
>  > uses, I doubt it could explain that difference.  The implication here 
> is
>  > that the ES systems are running with stock EC2 settings, not the tuned
>  > settings ... but I'd like you to confirm that.  Same Java version as
>  > with Solr?  IMHO, Java itself is more likely to cause issues like you
>  > saw than Solr.
>  >
>  > > I’m writing this for a few reasons:
>  > >
>  > > 1.   The performance difference was so crazy I really feel like 
> this
>  > should really be broader knowledge.
>  >
>  > Definitely agree!  I would be very interested in learning which of the
>  > tunables you changed were major contributors to the improvement.  If it
>  > turns out that Solr's code is sub-optimal in some way, maybe we can 
> fix it.
>  >
>  > > 2.   If anyone is aware of anything that changed in Lucene 
> between
>  > 5.4 and 6.x that could explain why Elasticsearch wasn’t suffering from
>  > this? If it’s the clocksource that’s the issue, there’s an implication 
> that
>  > Solr was using tons more system calls like gettimeofday that the EC2 
> (xen)
>  > hypervisor doesn’t allow in userspace.
>  >
>  > I had not considered the performance regression in 6.4.0 and 6.4.1 that
>  > Erick mentioned.  Were you still running Solr 5.4, or was it a 6.x 
> version?
>  >
>  > =
>  >
>  > Specific thoughts on the tuning:
>  >
>  > The noatime option is very good to use.  I also use nodiratime on my
>  > systems.  Turning these off can have *massive* impacts on disk
>  > performance.  If these are the source of the speedup, then the machine
>  > doesn't have enough spare memory.
>  >
>  > I'd be wary of the "nobarrier" mount option.  If the underlying storage
>  > has battery-backed write caches, or is SSD without write caching, it
>  > wouldn't be a problem.  Here's info about the "discard" mount option, I
>  >

Re: Solr performance on EC2 linux

2017-05-01 Thread Walter Underwood

Might want to measure the single CPU performance of your EC2 instance. The last 
time I checked, my MacBook was twice as fast as the EC2 instance I was using.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On May 1, 2017, at 6:24 PM, Chris Hostetter  wrote:
> 
> 
> : tldr: Recently, I tried moving an existing solrcloud configuration from 
> : a local datacenter to EC2. Performance was roughly 1/10th what I’d 
> : expected, until I applied a bunch of linux tweaks.
> 
> How many total nodes in your cluster?  How many of them running ZooKeeper?
> 
> Did you observe the heavy increase in system time CPU usage on all nodes, 
> or just the ones running zookeeper?
> 
> I ask because if your speculation is correct and it is an issue of 
> clocksource, then perhaps ZK is where the majority of those system calls 
> are happening, and perhaps that's why you didn't see any similar heavy 
> system CPU load in ES?  
> 
> (Then again: at the lowest levels "lucene" really shouldn't care about 
> anything clock related at all Any "time" realted code would live in the 
> Solr level ... hmmm.)
> 
> 
> -Hoss
> http://www.lucidworks.com/

Re: Solr performance on EC2 linux

2017-05-01 Thread Chris Hostetter


: tldr: Recently, I tried moving an existing solrcloud configuration from 
: a local datacenter to EC2. Performance was roughly 1/10th what I’d 
: expected, until I applied a bunch of linux tweaks.

How many total nodes in your cluster?  How many of them running ZooKeeper?

Did you observe the heavy increase in system time CPU usage on all nodes, 
or just the ones running zookeeper?

I ask because if your speculation is correct and it is an issue of 
clocksource, then perhaps ZK is where the majority of those system calls 
are happening, and perhaps that's why you didn't see any similar heavy 
system CPU load in ES?  

(Then again: at the lowest levels "lucene" really shouldn't care about 
anything clock related at all Any "time" realted code would live in the 
Solr level ... hmmm.)


-Hoss
http://www.lucidworks.com/

Re: Solr performance on EC2 linux

2017-05-01 Thread Jeff Wartes

I tried a few variations of various things before we found and tried that 
linux/EC2 tuning page, including:
  - EC2 instance type: r4, c4, and i3
  - Ubuntu version: Xenial and Trusty
  - EBS vs local storage
  - Stock openjdk vs Zulu openjdk (Recent java8 in both cases - I’m aware of 
the issues with early java8 versions and I’m not using G1)

Most of those attempts were to help reduce differences between the data center 
and the EC2 cluster. In all cases I re-indexed from scratch. I got the same 
very high system-time symptom in all cases. With the linux changes in place, we 
settled on r4/Xenial/EBS/Stock.

Again, this was a slightly modified Solr 5.4, (I added backup requests, and two 
memory allocation rate tweaks that have long since been merged into mainline - 
released in 6.2 I think. I can dig up the jira numbers if anyone’s interested) 
I’ve never used Solr 6.x in production though. 
The only reason I mentioned 6.x at all is because I’m aware that ES 5.x is 
based on Lucene 6.2. I don’t believe my coworker spent any time on tuning his 
ES setup, although I think he did try G1.

I definitely do want to binary-search those settings until I understand better 
what exactly did the trick. 
It’s a long cycle time per test is the problem, but hopefully in the next 
couple of weeks.



On 5/1/17, 7:26 AM, "John Bickerstaff"  wrote:

It's also very important to consider the type of EC2 instance you are
using...

We settled on the R4.2XL...  The R series is labeled "High-Memory"

Which instance type did you end up using?

On Mon, May 1, 2017 at 8:22 AM, Shawn Heisey  wrote:

> On 4/28/2017 10:09 AM, Jeff Wartes wrote:
> > tldr: Recently, I tried moving an existing solrcloud configuration from
> a local datacenter to EC2. Performance was roughly 1/10th what I’d
> expected, until I applied a bunch of linux tweaks.
>
> How very strange.  I knew virtualization would have overheard, possibly
> even measurable overhead, but that's insane.  Running on bare metal is
> always better if you can do it.  I would be curious what would happen on
> your original install if you applied similar tuning to that.  Would you
> see a speedup there?
>
> > Interestingly, a coworker playing with a ElasticSearch (ES 5.x, so a
> much more recent release) alternate implementation of the same index was
> not seeing this high-system-time behavior on EC2, and was getting
> throughput consistent with our general expectations.
>
> That's even weirder.  ES 5.x will likely be using Points field types for
> numeric fields, and although those are faster than what Solr currently
> uses, I doubt it could explain that difference.  The implication here is
> that the ES systems are running with stock EC2 settings, not the tuned
> settings ... but I'd like you to confirm that.  Same Java version as
> with Solr?  IMHO, Java itself is more likely to cause issues like you
> saw than Solr.
>
> > I’m writing this for a few reasons:
> >
> > 1.   The performance difference was so crazy I really feel like this
> should really be broader knowledge.
>
> Definitely agree!  I would be very interested in learning which of the
> tunables you changed were major contributors to the improvement.  If it
> turns out that Solr's code is sub-optimal in some way, maybe we can fix 
it.
>
> > 2.   If anyone is aware of anything that changed in Lucene between
> 5.4 and 6.x that could explain why Elasticsearch wasn’t suffering from
> this? If it’s the clocksource that’s the issue, there’s an implication 
that
> Solr was using tons more system calls like gettimeofday that the EC2 (xen)
> hypervisor doesn’t allow in userspace.
>
> I had not considered the performance regression in 6.4.0 and 6.4.1 that
> Erick mentioned.  Were you still running Solr 5.4, or was it a 6.x 
version?
>
> =
>
> Specific thoughts on the tuning:
>
> The noatime option is very good to use.  I also use nodiratime on my
> systems.  Turning these off can have *massive* impacts on disk
> performance.  If these are the source of the speedup, then the machine
> doesn't have enough spare memory.
>
> I'd be wary of the "nobarrier" mount option.  If the underlying storage
> has battery-backed write caches, or is SSD without write caching, it
> wouldn't be a problem.  Here's info about the "discard" mount option, I
> don't know whether it applies to your amazon storage:
>
>discard/nodiscard
>   Controls  whether ext4 should issue discard/TRIM commands
> to the
>   underlying block device when blocks are freed.  This  is
> useful
>   for  SSD  devices  and sparse/thinly-provisioned LUNs, but
> it is
>

Re: Solr performance on EC2 linux

2017-05-01 Thread John Bickerstaff

It's also very important to consider the type of EC2 instance you are
using...

We settled on the R4.2XL...  The R series is labeled "High-Memory"

Which instance type did you end up using?

On Mon, May 1, 2017 at 8:22 AM, Shawn Heisey  wrote:

> On 4/28/2017 10:09 AM, Jeff Wartes wrote:
> > tldr: Recently, I tried moving an existing solrcloud configuration from
> a local datacenter to EC2. Performance was roughly 1/10th what I’d
> expected, until I applied a bunch of linux tweaks.
>
> How very strange.  I knew virtualization would have overheard, possibly
> even measurable overhead, but that's insane.  Running on bare metal is
> always better if you can do it.  I would be curious what would happen on
> your original install if you applied similar tuning to that.  Would you
> see a speedup there?
>
> > Interestingly, a coworker playing with a ElasticSearch (ES 5.x, so a
> much more recent release) alternate implementation of the same index was
> not seeing this high-system-time behavior on EC2, and was getting
> throughput consistent with our general expectations.
>
> That's even weirder.  ES 5.x will likely be using Points field types for
> numeric fields, and although those are faster than what Solr currently
> uses, I doubt it could explain that difference.  The implication here is
> that the ES systems are running with stock EC2 settings, not the tuned
> settings ... but I'd like you to confirm that.  Same Java version as
> with Solr?  IMHO, Java itself is more likely to cause issues like you
> saw than Solr.
>
> > I’m writing this for a few reasons:
> >
> > 1.   The performance difference was so crazy I really feel like this
> should really be broader knowledge.
>
> Definitely agree!  I would be very interested in learning which of the
> tunables you changed were major contributors to the improvement.  If it
> turns out that Solr's code is sub-optimal in some way, maybe we can fix it.
>
> > 2.   If anyone is aware of anything that changed in Lucene between
> 5.4 and 6.x that could explain why Elasticsearch wasn’t suffering from
> this? If it’s the clocksource that’s the issue, there’s an implication that
> Solr was using tons more system calls like gettimeofday that the EC2 (xen)
> hypervisor doesn’t allow in userspace.
>
> I had not considered the performance regression in 6.4.0 and 6.4.1 that
> Erick mentioned.  Were you still running Solr 5.4, or was it a 6.x version?
>
> =
>
> Specific thoughts on the tuning:
>
> The noatime option is very good to use.  I also use nodiratime on my
> systems.  Turning these off can have *massive* impacts on disk
> performance.  If these are the source of the speedup, then the machine
> doesn't have enough spare memory.
>
> I'd be wary of the "nobarrier" mount option.  If the underlying storage
> has battery-backed write caches, or is SSD without write caching, it
> wouldn't be a problem.  Here's info about the "discard" mount option, I
> don't know whether it applies to your amazon storage:
>
>discard/nodiscard
>   Controls  whether ext4 should issue discard/TRIM commands
> to the
>   underlying block device when blocks are freed.  This  is
> useful
>   for  SSD  devices  and sparse/thinly-provisioned LUNs, but
> it is
>   off by default until sufficient testing has been done.
>
> The network tunables would have more of an effect in a distributed
> environment like EC2 than they would on a LAN.
>
> Thanks,
> Shawn
>
>

Re: Solr performance on EC2 linux

2017-05-01 Thread Shawn Heisey

On 4/28/2017 10:09 AM, Jeff Wartes wrote:
> tldr: Recently, I tried moving an existing solrcloud configuration from a 
> local datacenter to EC2. Performance was roughly 1/10th what I’d expected, 
> until I applied a bunch of linux tweaks.

How very strange.  I knew virtualization would have overheard, possibly
even measurable overhead, but that's insane.  Running on bare metal is
always better if you can do it.  I would be curious what would happen on
your original install if you applied similar tuning to that.  Would you
see a speedup there?

> Interestingly, a coworker playing with a ElasticSearch (ES 5.x, so a much 
> more recent release) alternate implementation of the same index was not 
> seeing this high-system-time behavior on EC2, and was getting throughput 
> consistent with our general expectations.

That's even weirder.  ES 5.x will likely be using Points field types for
numeric fields, and although those are faster than what Solr currently
uses, I doubt it could explain that difference.  The implication here is
that the ES systems are running with stock EC2 settings, not the tuned
settings ... but I'd like you to confirm that.  Same Java version as
with Solr?  IMHO, Java itself is more likely to cause issues like you
saw than Solr.

> I’m writing this for a few reasons:
>
> 1.   The performance difference was so crazy I really feel like this 
> should really be broader knowledge.

Definitely agree!  I would be very interested in learning which of the
tunables you changed were major contributors to the improvement.  If it
turns out that Solr's code is sub-optimal in some way, maybe we can fix it.

> 2.   If anyone is aware of anything that changed in Lucene between 5.4 
> and 6.x that could explain why Elasticsearch wasn’t suffering from this? If 
> it’s the clocksource that’s the issue, there’s an implication that Solr was 
> using tons more system calls like gettimeofday that the EC2 (xen) hypervisor 
> doesn’t allow in userspace.

I had not considered the performance regression in 6.4.0 and 6.4.1 that
Erick mentioned.  Were you still running Solr 5.4, or was it a 6.x version?

=

Specific thoughts on the tuning:

The noatime option is very good to use.  I also use nodiratime on my
systems.  Turning these off can have *massive* impacts on disk
performance.  If these are the source of the speedup, then the machine
doesn't have enough spare memory.

I'd be wary of the "nobarrier" mount option.  If the underlying storage
has battery-backed write caches, or is SSD without write caching, it
wouldn't be a problem.  Here's info about the "discard" mount option, I
don't know whether it applies to your amazon storage:

   discard/nodiscard
  Controls  whether ext4 should issue discard/TRIM commands
to the
  underlying block device when blocks are freed.  This  is 
useful
  for  SSD  devices  and sparse/thinly-provisioned LUNs, but
it is
  off by default until sufficient testing has been done.

The network tunables would have more of an effect in a distributed
environment like EC2 than they would on a LAN.

Thanks,
Shawn

Re: Solr performance on EC2 linux

2017-04-30 Thread Jeff Wartes

I’d like to think I helped a little with the metrics upgrade that got released 
in 6.4, so I was already watching that and I’m aware of the resulting 
performance issue.
This was 5.4 though, patched with https://github.com/whitepages/SOLR-4449 - an 
index we’ve been running for some time now.

Mganeshs’s comment that he doesn’t see a difference on EC2 with Solr 6.2 lends 
some additional strength to the thought that something changed between Lucene 
5.4 and 6.2 (which is used in ES 5), but of course it’s all still pretty 
anecdotal.


On 4/28/17, 11:44 AM, "Erick Erickson"  wrote:

Well, 6.4.0 had a pretty severe performance issue so if you were using
that release you might see this, 6.4.2 is the most recent 6.4 release.
But I have no clue how changing linux settings would alter that and I
sure can't square that issue with you having such different
performance between local and EC2

But thanks for telling us about this! It's totally baffling

Erick

On Fri, Apr 28, 2017 at 9:09 AM, Jeff Wartes  wrote:
>
> tldr: Recently, I tried moving an existing solrcloud configuration from a 
local datacenter to EC2. Performance was roughly 1/10th what I’d expected, 
until I applied a bunch of linux tweaks.
>
> This should’ve been a straight port: one datacenter server -> one EC2 
node. Solr 5.4, Solrcloud, Ubuntu xenial. Nodes were sized in both cases such 
that the entire index could be cached in memory, and the JVM settings were 
identical in both places. I applied what should’ve been a comfortable load to 
the EC2 cluster, and everything exploded. I had to back the rate down to 
something close to 10% of what I had been getting in the datacenter before 
latency improved.
> Looking around, I was interested to note that under load, user-time CPU 
usage was being shadowed by an almost equal amount of system CPU time. This was 
not IOWait, but system time. Strace showed a bunch of time being spent in futex 
and restart_syscall, but I couldn’t see where to go from there.
>
> Interestingly, a coworker playing with a ElasticSearch (ES 5.x, so a much 
more recent release) alternate implementation of the same index was not seeing 
this high-system-time behavior on EC2, and was getting throughput consistent 
with our general expectations.
>
> Eventually, we came across this: 
https://linkprotect.cudasvc.com/url?a=http://www.brendangregg.com/blog/2015-03-03/performance-tuning-linux-instances-on-ec2.html=E,1,wrdb94Vzm3Hu0-Edzz8gwrCGG9MiHbLKDKltAaM0g2kqyw35-xRDD2azZNIQqp8aoVnP654tzZ3WyRGAhneL4AvPRfV4G6s4VoEeZtSzXgRIBXS62M4Zq4Q,=0
> In direct opposition to the author’s intent, (something about taking 
expired medication) we applied these settings blindly to see what happened. The 
difference was breathtaking. The system time usage disappeared, and I could 
apply load at and even a little above my expected rates, well within my latency 
goals.
>
> There are a number of settings involved, and we haven’t isolated for sure 
which ones made the biggest difference, but my guess at the moment is that it’s 
the change of clocksource. I think this would be consistent with the observed 
system time. Note however that using the “tsc” clocksource on EC2 is generally 
discouraged, because it’s possible to get backwards clock drift.
>
> I’m writing this for a few reasons:
>
> 1.   The performance difference was so crazy I really feel like this 
should really be broader knowledge.
>
> 2.   If anyone is aware of anything that changed in Lucene between 
5.4 and 6.x that could explain why Elasticsearch wasn’t suffering from this? If 
it’s the clocksource that’s the issue, there’s an implication that Solr was 
using tons more system calls like gettimeofday that the EC2 (xen) hypervisor 
doesn’t allow in userspace.
>
> 3.   Has anyone run Solr with the “tsc” clocksource, and is aware of 
any concrete issues?
>
>

Re: Solr performance on EC2 linux

2017-04-29 Thread mganeshs

We use Solr 6.2 in EC2 instance with Cent OS 6.2 and we don't see any
difference in performance between EC2 and in local environment. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-performance-on-EC2-linux-tp4332467p4332553.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr performance on EC2 linux

2017-04-28 Thread Erick Erickson

Well, 6.4.0 had a pretty severe performance issue so if you were using
that release you might see this, 6.4.2 is the most recent 6.4 release.
But I have no clue how changing linux settings would alter that and I
sure can't square that issue with you having such different
performance between local and EC2

But thanks for telling us about this! It's totally baffling

Erick

On Fri, Apr 28, 2017 at 9:09 AM, Jeff Wartes  wrote:
>
> tldr: Recently, I tried moving an existing solrcloud configuration from a 
> local datacenter to EC2. Performance was roughly 1/10th what I’d expected, 
> until I applied a bunch of linux tweaks.
>
> This should’ve been a straight port: one datacenter server -> one EC2 node. 
> Solr 5.4, Solrcloud, Ubuntu xenial. Nodes were sized in both cases such that 
> the entire index could be cached in memory, and the JVM settings were 
> identical in both places. I applied what should’ve been a comfortable load to 
> the EC2 cluster, and everything exploded. I had to back the rate down to 
> something close to 10% of what I had been getting in the datacenter before 
> latency improved.
> Looking around, I was interested to note that under load, user-time CPU usage 
> was being shadowed by an almost equal amount of system CPU time. This was not 
> IOWait, but system time. Strace showed a bunch of time being spent in futex 
> and restart_syscall, but I couldn’t see where to go from there.
>
> Interestingly, a coworker playing with a ElasticSearch (ES 5.x, so a much 
> more recent release) alternate implementation of the same index was not 
> seeing this high-system-time behavior on EC2, and was getting throughput 
> consistent with our general expectations.
>
> Eventually, we came across this: 
> http://www.brendangregg.com/blog/2015-03-03/performance-tuning-linux-instances-on-ec2.html
> In direct opposition to the author’s intent, (something about taking expired 
> medication) we applied these settings blindly to see what happened. The 
> difference was breathtaking. The system time usage disappeared, and I could 
> apply load at and even a little above my expected rates, well within my 
> latency goals.
>
> There are a number of settings involved, and we haven’t isolated for sure 
> which ones made the biggest difference, but my guess at the moment is that 
> it’s the change of clocksource. I think this would be consistent with the 
> observed system time. Note however that using the “tsc” clocksource on EC2 is 
> generally discouraged, because it’s possible to get backwards clock drift.
>
> I’m writing this for a few reasons:
>
> 1.   The performance difference was so crazy I really feel like this 
> should really be broader knowledge.
>
> 2.   If anyone is aware of anything that changed in Lucene between 5.4 
> and 6.x that could explain why Elasticsearch wasn’t suffering from this? If 
> it’s the clocksource that’s the issue, there’s an implication that Solr was 
> using tons more system calls like gettimeofday that the EC2 (xen) hypervisor 
> doesn’t allow in userspace.
>
> 3.   Has anyone run Solr with the “tsc” clocksource, and is aware of any 
> concrete issues?
>
>

Solr performance on EC2 linux

2017-04-28 Thread Jeff Wartes

tldr: Recently, I tried moving an existing solrcloud configuration from a local
datacenter to EC2. Performance was roughly 1/10th what I’d expected, until I
applied a bunch of linux tweaks.

This should’ve been a straight port: one datacenter server -> one EC2 node.
Solr 5.4, Solrcloud, Ubuntu xenial. Nodes were sized in both cases such that
the entire index could be cached in memory, and the JVM settings were identical
in both places. I applied what should’ve been a comfortable load to the EC2
cluster, and everything exploded. I had to back the rate down to something
close to 10% of what I had been getting in the datacenter before latency
improved.
Looking around, I was interested to note that under load, user-time CPU usage
was being shadowed by an almost equal amount of system CPU time. This was not
IOWait, but system time. Strace showed a bunch of time being spent in futex and
restart_syscall, but I couldn’t see where to go from there.

Interestingly, a coworker playing with a ElasticSearch (ES 5.x, so a much more
recent release) alternate implementation of the same index was not seeing this
high-system-time behavior on EC2, and was getting throughput consistent with
our general expectations.

Eventually, we came across this:
http://www.brendangregg.com/blog/2015-03-03/performance-tuning-linux-instances-on-ec2.html
In direct opposition to the author’s intent, (something about taking expired
medication) we applied these settings blindly to see what happened. The
difference was breathtaking. The system time usage disappeared, and I could
apply load at and even a little above my expected rates, well within my latency
goals.

There are a number of settings involved, and we haven’t isolated for sure which
ones made the biggest difference, but my guess at the moment is that it’s the
change of clocksource. I think this would be consistent with the observed
system time. Note however that using the “tsc” clocksource on EC2 is generally
discouraged, because it’s possible to get backwards clock drift.

I’m writing this for a few reasons:

1. The performance difference was so crazy I really feel like this should
really be broader knowledge.

2. If anyone is aware of anything that changed in Lucene between 5.4 and
6.x that could explain why Elasticsearch wasn’t suffering from this? If it’s
the clocksource that’s the issue, there’s an implication that Solr was using
tons more system calls like gettimeofday that the EC2 (xen) hypervisor doesn’t
allow in userspace.

3. Has anyone run Solr with the “tsc” clocksource, and is aware of any
concrete issues?

RE: Solr performance issue on indexing

2017-04-04 Thread Allison, Timothy B.

>  Also we will try to decouple tika to solr.
+1

-Original Message-
From: tstusr [mailto:ulfrhe...@gmail.com] 
Sent: Friday, March 31, 2017 4:31 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr performance issue on indexing

Hi, thanks for the feedback.

Yes, it is about OOM, indeed even solr instance makes unavailable. As I was 
saying I can't find more relevant information on logs.

We're are able to increment JVM amout, so, the first thing we'll do will be 
that.

As far as I know, all documents are bounded to that amount (14K), just the 
processing could change. We are making some tests on indexing and it seems it 
works without concurrent threads. Also we will try to decouple tika to solr.

By the way, make it available with solr cloud will improve performance? Or 
there will be no perceptible improvement?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-performance-issue-on-indexing-tp4327886p4327914.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr performance issue on indexing

2017-03-31 Thread Erick Erickson

If, by chance, the docs you're sending get routed to different Solr
nodes then all the processing is in parallel. I don't know if there's
a good way to insure that the docs get sent to different replicas on
different Solr instances. You could try addressing specific Solr
replicas, something like "blah
blah/solr/collection1_shard1_replica1/export" but I'm not totally sure
that'll do what you want either.

 But that still doesn't decouple Tika from the Solr instances running
those replicas. So if Tika has a problem it has the potential to bring
the Solr node down.

Best,
Erick

On Fri, Mar 31, 2017 at 1:31 PM, tstusr <ulfrhe...@gmail.com> wrote:
> Hi, thanks for the feedback.
>
> Yes, it is about OOM, indeed even solr instance makes unavailable. As I was
> saying I can't find more relevant information on logs.
>
> We're are able to increment JVM amout, so, the first thing we'll do will be
> that.
>
> As far as I know, all documents are bounded to that amount (14K), just the
> processing could change. We are making some tests on indexing and it seems
> it works without concurrent threads. Also we will try to decouple tika to
> solr.
>
> By the way, make it available with solr cloud will improve performance? Or
> there will be no perceptible improvement?
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-performance-issue-on-indexing-tp4327886p4327914.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr performance issue on indexing

2017-03-31 Thread tstusr

Hi, thanks for the feedback.

Yes, it is about OOM, indeed even solr instance makes unavailable. As I was
saying I can't find more relevant information on logs.

We're are able to increment JVM amout, so, the first thing we'll do will be
that.

As far as I know, all documents are bounded to that amount (14K), just the
processing could change. We are making some tests on indexing and it seems
it works without concurrent threads. Also we will try to decouple tika to
solr.

By the way, make it available with solr cloud will improve performance? Or
there will be no perceptible improvement?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-performance-issue-on-indexing-tp4327886p4327914.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr performance issue on indexing

2017-03-31 Thread Erick Erickson

First, running multiple threads with PDF files to a Solr running 4G of
JVM is...ambitious. You say it crashes; how? OOMs?

Second while the extracting request handler is a fine way to get up
and running, any problems with Tika will affect Solr. Tika does a
great job of extraction, but there are so many variants of so many
file formats that this scenario isn' recommended for production.
Consider extracting the PDF on a client and sending the docs to Solr.
Tika can run as a server also so you aren't coupling Solr and Tika.

For a sample SolrJ program, see:
https://lucidworks.com/2012/02/14/indexing-with-solrj/

Best,
Erick

On Fri, Mar 31, 2017 at 10:44 AM, tstusr <ulfrhe...@gmail.com> wrote:
> Hi there.
>
> We are currently indexing some PDF files, the main handler to index is
> /extract where we perform simple processing (extract relevant fields and
> store on some fields).
>
> The PDF files are about 10M~100M size and we have to have available the text
> extracted. So, everything works correct on test stages, but when we try to
> index all the 14K files (around 120Gb) on a client application that only
> sends http curls through 3-4 concurrent threads to /extract handler it
> crashes. I can't find some relevant information about on solr logs (We
> checked in server/logs & in core_dir/tlog).
>
> My question is about performance. I think it is a small amount of info we
> are processing, the deploy scenario is in a docker container with 4gb of JVM
> Memory and ~50gb of physical memory (reported through dashboard) we are
> using a single instance.
>
> I don't think is a normal behaviour that handler crashes. So, what are some
> general tips about improving performance for this scenario?
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-performance-issue-on-indexing-tp4327886.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Solr performance issue on indexing

2017-03-31 Thread tstusr

Hi there.

We are currently indexing some PDF files, the main handler to index is
/extract where we perform simple processing (extract relevant fields and
store on some fields). 

The PDF files are about 10M~100M size and we have to have available the text
extracted. So, everything works correct on test stages, but when we try to
index all the 14K files (around 120Gb) on a client application that only
sends http curls through 3-4 concurrent threads to /extract handler it
crashes. I can't find some relevant information about on solr logs (We
checked in server/logs & in core_dir/tlog).

My question is about performance. I think it is a small amount of info we
are processing, the deploy scenario is in a docker container with 4gb of JVM
Memory and ~50gb of physical memory (reported through dashboard) we are
using a single instance. 

I don't think is a normal behaviour that handler crashes. So, what are some
general tips about improving performance for this scenario?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-performance-issue-on-indexing-tp4327886.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: solr | performance warning

2016-11-21 Thread Prateek Jain J

Thanks EricK

Regards,
Prateek Jain

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: 21 November 2016 04:32 PM
To: solr-user <solr-user@lucene.apache.org>
Subject: Re: solr | performance warning

_when_ are you seeing this? I see this on startup upon occasion, and I _think_ 
there's a JIRA about startup opening more than one searcher on startup.
If it _is_ on startup, you can simply ignore it.

If it's after the system is up and running, then you're probably committing too 
frequently. "Too frequently" means your autowarm interval is longer than your 
commit interval. It's usually best to just let autocommit handle this BTW.

This is totally on a per-core basis. You won't get this warning if you commit 
to coreA and coreB simultaneously, only if you commit to an individual core too 
frequently.

Best,
Erick

On Mon, Nov 21, 2016 at 7:47 AM, Prateek Jain J <prateek.j.j...@ericsson.com> 
wrote:
>
> Hi All,
>
> I am observing following error in logs, any clues about this:
>
> 2016-11-06T23:15:53.066069+00:00@solr@@ 
> org.apache.solr.core.SolrCore:1650 - [my_custom_core] PERFORMANCE 
> WARNING: Overlapping onDeckSearchers=2
>
> Slight web search suggests that it could be a case of too-frequent commits. I 
> have multiple cores running in solr, so one or the other would be commiting 
> at some time. Any clues/pointers/suggestions?
>
> I am using solr 4.8.1.
>
> Regards,
> Prateek Jain
>

Re: solr | performance warning

2016-11-21 Thread Erick Erickson

_when_ are you seeing this? I see this on startup upon occasion, and I _think_
there's a JIRA about startup opening more than one searcher on startup.
If it _is_ on startup, you can simply ignore it.

If it's after the system is up and running, then you're probably committing too
frequently. "Too frequently" means your autowarm interval is longer than
your commit interval. It's usually best to just let autocommit handle this BTW.

This is totally on a per-core basis. You won't get this warning if you commit
to coreA and coreB simultaneously, only if you commit to an individual core
too frequently.

Best,
Erick

On Mon, Nov 21, 2016 at 7:47 AM, Prateek Jain J
 wrote:
>
> Hi All,
>
> I am observing following error in logs, any clues about this:
>
> 2016-11-06T23:15:53.066069+00:00@solr@@ org.apache.solr.core.SolrCore:1650 - 
> [my_custom_core] PERFORMANCE WARNING: Overlapping onDeckSearchers=2
>
> Slight web search suggests that it could be a case of too-frequent commits. I 
> have multiple cores running in solr, so one or the other would be commiting 
> at some time. Any clues/pointers/suggestions?
>
> I am using solr 4.8.1.
>
> Regards,
> Prateek Jain
>

solr | performance warning

2016-11-21 Thread Prateek Jain J


Hi All,

I am observing following error in logs, any clues about this:

2016-11-06T23:15:53.066069+00:00@solr@@ org.apache.solr.core.SolrCore:1650 - 
[my_custom_core] PERFORMANCE WARNING: Overlapping onDeckSearchers=2

Slight web search suggests that it could be a case of too-frequent commits. I 
have multiple cores running in solr, so one or the other would be commiting at 
some time. Any clues/pointers/suggestions?

I am using solr 4.8.1.

Regards,
Prateek Jain

Re: Disable hyper-threading for better Solr performance?

2016-03-09 Thread Ilan Schwarts

Solrcloud.. Faster discs.. Multiple cores on different physical discs would
help
On Mar 9, 2016 2:22 PM, "Vincenzo D'Amore" <v.dam...@gmail.com> wrote:

> Upgrading to Solr 5 you should improve your indexing performance.
>
>
> http://lucidworks.com/blog/2015/06/10/indexing-performance-solr-5-2-now-twice-fast/
>
> On Wed, Mar 9, 2016 at 1:13 PM, Avner Levy <av...@checkpoint.com> wrote:
>
> > Currently I'm using Solr 4.8.1 but I can move to another version if it
> > performs significantly faster.
> > My target is to reach the max indexing throughput possible on the
> machine.
> > Since it seems the indexing process is CPU bound I was wondering whether
> > 32 logical cores with twice indexing threads will perform better.
> > Thanks,
> >  Avner
> >
> > -Original Message-
> > From: Ilan Schwarts [mailto:ila...@gmail.com]
> > Sent: Wednesday, March 09, 2016 9:09 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Disable hyper-threading for better Solr performance?
> >
> > What is the solr version and shard config? Standalone? Multiple cores?
> > Spread over RAID ?
> > On Mar 9, 2016 9:00 AM, "Avner Levy" <av...@checkpoint.com> wrote:
> >
> > > I have a machine with 16 real cores (32 with HT enabled).
> > > I'm running on it a Solr server and trying to reach maximum
> > > performance for indexing and queries (indexing 20k documents/sec by a
> > > number of threads).
> > > I've read on multiple places that in some scenarios / products
> > > disabling the hyper-threading may result in better performance results.
> > > I'm looking for inputs / insights about HT on Solr setups.
> > > Thanks in advance,
> > >   Avner
> > >
> >
> >
> > Email secured by Check Point
> >
>
>
>
> --
> Vincenzo D'Amore
> email: v.dam...@gmail.com
> skype: free.dev
> mobile: +39 349 8513251
>

Re: Disable hyper-threading for better Solr performance?

2016-03-09 Thread Vincenzo D'Amore

Upgrading to Solr 5 you should improve your indexing performance.

http://lucidworks.com/blog/2015/06/10/indexing-performance-solr-5-2-now-twice-fast/

On Wed, Mar 9, 2016 at 1:13 PM, Avner Levy <av...@checkpoint.com> wrote:

> Currently I'm using Solr 4.8.1 but I can move to another version if it
> performs significantly faster.
> My target is to reach the max indexing throughput possible on the machine.
> Since it seems the indexing process is CPU bound I was wondering whether
> 32 logical cores with twice indexing threads will perform better.
> Thanks,
>  Avner
>
> -Original Message-
> From: Ilan Schwarts [mailto:ila...@gmail.com]
> Sent: Wednesday, March 09, 2016 9:09 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Disable hyper-threading for better Solr performance?
>
> What is the solr version and shard config? Standalone? Multiple cores?
> Spread over RAID ?
> On Mar 9, 2016 9:00 AM, "Avner Levy" <av...@checkpoint.com> wrote:
>
> > I have a machine with 16 real cores (32 with HT enabled).
> > I'm running on it a Solr server and trying to reach maximum
> > performance for indexing and queries (indexing 20k documents/sec by a
> > number of threads).
> > I've read on multiple places that in some scenarios / products
> > disabling the hyper-threading may result in better performance results.
> > I'm looking for inputs / insights about HT on Solr setups.
> > Thanks in advance,
> >   Avner
> >
>
>
> Email secured by Check Point
>



-- 
Vincenzo D'Amore
email: v.dam...@gmail.com
skype: free.dev
mobile: +39 349 8513251

RE: Disable hyper-threading for better Solr performance?

2016-03-09 Thread Avner Levy

Currently I'm using Solr 4.8.1 but I can move to another version if it performs 
significantly faster.
My target is to reach the max indexing throughput possible on the machine.
Since it seems the indexing process is CPU bound I was wondering whether 32 
logical cores with twice indexing threads will perform better.
Thanks,
 Avner

-Original Message-
From: Ilan Schwarts [mailto:ila...@gmail.com] 
Sent: Wednesday, March 09, 2016 9:09 AM
To: solr-user@lucene.apache.org
Subject: Re: Disable hyper-threading for better Solr performance?

What is the solr version and shard config? Standalone? Multiple cores?
Spread over RAID ?
On Mar 9, 2016 9:00 AM, "Avner Levy" <av...@checkpoint.com> wrote:

> I have a machine with 16 real cores (32 with HT enabled).
> I'm running on it a Solr server and trying to reach maximum 
> performance for indexing and queries (indexing 20k documents/sec by a 
> number of threads).
> I've read on multiple places that in some scenarios / products 
> disabling the hyper-threading may result in better performance results.
> I'm looking for inputs / insights about HT on Solr setups.
> Thanks in advance,
>   Avner
>

Email secured by Check Point

RE: Disable hyper-threading for better Solr performance?

2016-03-09 Thread Markus Jelsma

Hi - i can't remember having seen any threads on this topic for the past seven 
years. Can you perform a controlled test with a lot of concurrent users. I 
would suspect nowadays HT would boost highly concurrent environments such a 
search engines.

Markus

 
 
-Original message-
> From:Avner Levy <av...@checkpoint.com>
> Sent: Wednesday 9th March 2016 8:00
> To: solr-user@lucene.apache.org
> Subject: Disable hyper-threading for better Solr performance?
> 
> I have a machine with 16 real cores (32 with HT enabled).
> I'm running on it a Solr server and trying to reach maximum performance for 
> indexing and queries (indexing 20k documents/sec by a number of threads).
> I've read on multiple places that in some scenarios / products disabling the 
> hyper-threading may result in better performance results.
> I'm looking for inputs / insights about HT on Solr setups.
> Thanks in advance,
>   Avner
>

Re: Disable hyper-threading for better Solr performance?

2016-03-08 Thread Ilan Schwarts

What is the solr version and shard config? Standalone? Multiple cores?
Spread over RAID ?
On Mar 9, 2016 9:00 AM, "Avner Levy"  wrote:

> I have a machine with 16 real cores (32 with HT enabled).
> I'm running on it a Solr server and trying to reach maximum performance
> for indexing and queries (indexing 20k documents/sec by a number of
> threads).
> I've read on multiple places that in some scenarios / products disabling
> the hyper-threading may result in better performance results.
> I'm looking for inputs / insights about HT on Solr setups.
> Thanks in advance,
>   Avner
>

Disable hyper-threading for better Solr performance?

2016-03-08 Thread Avner Levy

I have a machine with 16 real cores (32 with HT enabled).
I'm running on it a Solr server and trying to reach maximum performance for 
indexing and queries (indexing 20k documents/sec by a number of threads).
I've read on multiple places that in some scenarios / products disabling the 
hyper-threading may result in better performance results.
I'm looking for inputs / insights about HT on Solr setups.
Thanks in advance,
  Avner

Re: solr performance issue

2016-02-09 Thread Zheng Lin Edwin Yeo

1 million document isn't considered big for Solr. How much RAM does your
machine have?

Regards,
Edwin

On 8 February 2016 at 23:45, Susheel Kumar <susheel2...@gmail.com> wrote:

> 1 million document shouldn't have any issues at all.  Something else is
> wrong with your hw/system configuration.
>
> Thanks,
> Susheel
>
> On Mon, Feb 8, 2016 at 6:45 AM, sara hajili <hajili.s...@gmail.com> wrote:
>
> > On Mon, Feb 8, 2016 at 3:04 AM, sara hajili <hajili.s...@gmail.com>
> wrote:
> >
> > > sorry i made a mistake i have a bout 1000 K doc.
> > > i mean about 100 doc.
> > >
> > > On Mon, Feb 8, 2016 at 1:35 AM, Emir Arnautovic <
> > > emir.arnauto...@sematext.com> wrote:
> > >
> > >> Hi Sara,
> > >> Not sure if I am reading this right, but I read it as you have 1000
> doc
> > >> index and issues? Can you tell us bit more about your setup: number of
> > >> servers, hw, index size, number of shards, queries that you run, do
> you
> > >> index at the same time...
> > >>
> > >> It seems to me that you are running Solr on server with limited RAM
> and
> > >> probably small heap. Swapping for sure will slow things down and GC is
> > most
> > >> likely reason for high CPU.
> > >>
> > >> You can use http://sematext.com/spm to collect Solr and host metrics
> > and
> > >> see where the issue is.
> > >>
> > >> Thanks,
> > >> Emir
> > >>
> > >> --
> > >> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> > >> Solr & Elasticsearch Support * http://sematext.com/
> > >>
> > >>
> > >>
> > >> On 08.02.2016 10:27, sara hajili wrote:
> > >>
> > >>> hi all.
> > >>> i have a problem with my solr performance and usage hardware like a
> > >>> ram,cup...
> > >>> i have a lot of document and so indexed file about 1000 doc in solr
> > that
> > >>> every doc has about 8 field in average.
> > >>> and each field has about 60 char.
> > >>> i set my field as a storedfield = "false" except of  1 field. // i
> read
> > >>> that this help performance.
> > >>> i used copy field and dynamic field if it was necessary . // i read
> > that
> > >>> this help performance.
> > >>> and now my question is that when i run a lot of query on solr i faced
> > >>> with
> > >>> a problem solr use more cpu and ram and after that filled ,it use a
> lot
> > >>>   swapped storage and then use hard,but doesn't create a system file!
> > >>> solr
> > >>> fill hard until i forced to restart server to release hard disk.
> > >>> and now my question is why solr treat in this way? and how i can
> avoid
> > >>> solr
> > >>> to use huge cpu space?
> > >>> any config need?!
> > >>>
> > >>>
> > >>
> > >
> >
>

Re: solr performance issue

2016-02-08 Thread Susheel Kumar

1 million document shouldn't have any issues at all.  Something else is
wrong with your hw/system configuration.

Thanks,
Susheel

On Mon, Feb 8, 2016 at 6:45 AM, sara hajili <hajili.s...@gmail.com> wrote:

> On Mon, Feb 8, 2016 at 3:04 AM, sara hajili <hajili.s...@gmail.com> wrote:
>
> > sorry i made a mistake i have a bout 1000 K doc.
> > i mean about 100 doc.
> >
> > On Mon, Feb 8, 2016 at 1:35 AM, Emir Arnautovic <
> > emir.arnauto...@sematext.com> wrote:
> >
> >> Hi Sara,
> >> Not sure if I am reading this right, but I read it as you have 1000 doc
> >> index and issues? Can you tell us bit more about your setup: number of
> >> servers, hw, index size, number of shards, queries that you run, do you
> >> index at the same time...
> >>
> >> It seems to me that you are running Solr on server with limited RAM and
> >> probably small heap. Swapping for sure will slow things down and GC is
> most
> >> likely reason for high CPU.
> >>
> >> You can use http://sematext.com/spm to collect Solr and host metrics
> and
> >> see where the issue is.
> >>
> >> Thanks,
> >> Emir
> >>
> >> --
> >> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> >> Solr & Elasticsearch Support * http://sematext.com/
> >>
> >>
> >>
> >> On 08.02.2016 10:27, sara hajili wrote:
> >>
> >>> hi all.
> >>> i have a problem with my solr performance and usage hardware like a
> >>> ram,cup...
> >>> i have a lot of document and so indexed file about 1000 doc in solr
> that
> >>> every doc has about 8 field in average.
> >>> and each field has about 60 char.
> >>> i set my field as a storedfield = "false" except of  1 field. // i read
> >>> that this help performance.
> >>> i used copy field and dynamic field if it was necessary . // i read
> that
> >>> this help performance.
> >>> and now my question is that when i run a lot of query on solr i faced
> >>> with
> >>> a problem solr use more cpu and ram and after that filled ,it use a lot
> >>>   swapped storage and then use hard,but doesn't create a system file!
> >>> solr
> >>> fill hard until i forced to restart server to release hard disk.
> >>> and now my question is why solr treat in this way? and how i can avoid
> >>> solr
> >>> to use huge cpu space?
> >>> any config need?!
> >>>
> >>>
> >>
> >
>

solr performance issue

2016-02-08 Thread sara hajili

hi all.
i have a problem with my solr performance and usage hardware like a
ram,cup...
i have a lot of document and so indexed file about 1000 doc in solr that
every doc has about 8 field in average.
and each field has about 60 char.
i set my field as a storedfield = "false" except of  1 field. // i read
that this help performance.
i used copy field and dynamic field if it was necessary . // i read that
this help performance.
and now my question is that when i run a lot of query on solr i faced with
a problem solr use more cpu and ram and after that filled ,it use a lot
 swapped storage and then use hard,but doesn't create a system file! solr
fill hard until i forced to restart server to release hard disk.
and now my question is why solr treat in this way? and how i can avoid solr
to use huge cpu space?
any config need?!

Re: solr performance issue

2016-02-08 Thread Emir Arnautovic


Hi Sara,
It is still considered to be small index. Can you give us bit details 
about your setup?


Thanks,
Emir

On 08.02.2016 12:04, sara hajili wrote:

sorry i made a mistake i have a bout 1000 K doc.
i mean about 100 doc.

On Mon, Feb 8, 2016 at 1:35 AM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:


Hi Sara,
Not sure if I am reading this right, but I read it as you have 1000 doc
index and issues? Can you tell us bit more about your setup: number of
servers, hw, index size, number of shards, queries that you run, do you
index at the same time...

It seems to me that you are running Solr on server with limited RAM and
probably small heap. Swapping for sure will slow things down and GC is most
likely reason for high CPU.

You can use http://sematext.com/spm to collect Solr and host metrics and
see where the issue is.

Thanks,
Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



On 08.02.2016 10:27, sara hajili wrote:


hi all.
i have a problem with my solr performance and usage hardware like a
ram,cup...
i have a lot of document and so indexed file about 1000 doc in solr that
every doc has about 8 field in average.
and each field has about 60 char.
i set my field as a storedfield = "false" except of  1 field. // i read
that this help performance.
i used copy field and dynamic field if it was necessary . // i read that
this help performance.
and now my question is that when i run a lot of query on solr i faced with
a problem solr use more cpu and ram and after that filled ,it use a lot
   swapped storage and then use hard,but doesn't create a system file! solr
fill hard until i forced to restart server to release hard disk.
and now my question is why solr treat in this way? and how i can avoid
solr
to use huge cpu space?
any config need?!




--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/

Re: solr performance issue

2016-02-08 Thread Emir Arnautovic


Hi Sara,
Not sure if I am reading this right, but I read it as you have 1000 doc 
index and issues? Can you tell us bit more about your setup: number of 
servers, hw, index size, number of shards, queries that you run, do you 
index at the same time...


It seems to me that you are running Solr on server with limited RAM and 
probably small heap. Swapping for sure will slow things down and GC is 
most likely reason for high CPU.


You can use http://sematext.com/spm to collect Solr and host metrics and 
see where the issue is.


Thanks,
Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


On 08.02.2016 10:27, sara hajili wrote:

hi all.
i have a problem with my solr performance and usage hardware like a
ram,cup...
i have a lot of document and so indexed file about 1000 doc in solr that
every doc has about 8 field in average.
and each field has about 60 char.
i set my field as a storedfield = "false" except of  1 field. // i read
that this help performance.
i used copy field and dynamic field if it was necessary . // i read that
this help performance.
and now my question is that when i run a lot of query on solr i faced with
a problem solr use more cpu and ram and after that filled ,it use a lot
  swapped storage and then use hard,but doesn't create a system file! solr
fill hard until i forced to restart server to release hard disk.
and now my question is why solr treat in this way? and how i can avoid solr
to use huge cpu space?
any config need?!

Re: solr performance issue

2016-02-08 Thread sara hajili

sorry i made a mistake i have a bout 1000 K doc.
i mean about 100 doc.

On Mon, Feb 8, 2016 at 1:35 AM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:

> Hi Sara,
> Not sure if I am reading this right, but I read it as you have 1000 doc
> index and issues? Can you tell us bit more about your setup: number of
> servers, hw, index size, number of shards, queries that you run, do you
> index at the same time...
>
> It seems to me that you are running Solr on server with limited RAM and
> probably small heap. Swapping for sure will slow things down and GC is most
> likely reason for high CPU.
>
> You can use http://sematext.com/spm to collect Solr and host metrics and
> see where the issue is.
>
> Thanks,
> Emir
>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>
>
> On 08.02.2016 10:27, sara hajili wrote:
>
>> hi all.
>> i have a problem with my solr performance and usage hardware like a
>> ram,cup...
>> i have a lot of document and so indexed file about 1000 doc in solr that
>> every doc has about 8 field in average.
>> and each field has about 60 char.
>> i set my field as a storedfield = "false" except of  1 field. // i read
>> that this help performance.
>> i used copy field and dynamic field if it was necessary . // i read that
>> this help performance.
>> and now my question is that when i run a lot of query on solr i faced with
>> a problem solr use more cpu and ram and after that filled ,it use a lot
>>   swapped storage and then use hard,but doesn't create a system file! solr
>> fill hard until i forced to restart server to release hard disk.
>> and now my question is why solr treat in this way? and how i can avoid
>> solr
>> to use huge cpu space?
>> any config need?!
>>
>>
>

Re: solr performance issue

2016-02-08 Thread sara hajili

On Mon, Feb 8, 2016 at 3:04 AM, sara hajili <hajili.s...@gmail.com> wrote:

> sorry i made a mistake i have a bout 1000 K doc.
> i mean about 100 doc.
>
> On Mon, Feb 8, 2016 at 1:35 AM, Emir Arnautovic <
> emir.arnauto...@sematext.com> wrote:
>
>> Hi Sara,
>> Not sure if I am reading this right, but I read it as you have 1000 doc
>> index and issues? Can you tell us bit more about your setup: number of
>> servers, hw, index size, number of shards, queries that you run, do you
>> index at the same time...
>>
>> It seems to me that you are running Solr on server with limited RAM and
>> probably small heap. Swapping for sure will slow things down and GC is most
>> likely reason for high CPU.
>>
>> You can use http://sematext.com/spm to collect Solr and host metrics and
>> see where the issue is.
>>
>> Thanks,
>> Emir
>>
>> --
>> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
>> Solr & Elasticsearch Support * http://sematext.com/
>>
>>
>>
>> On 08.02.2016 10:27, sara hajili wrote:
>>
>>> hi all.
>>> i have a problem with my solr performance and usage hardware like a
>>> ram,cup...
>>> i have a lot of document and so indexed file about 1000 doc in solr that
>>> every doc has about 8 field in average.
>>> and each field has about 60 char.
>>> i set my field as a storedfield = "false" except of  1 field. // i read
>>> that this help performance.
>>> i used copy field and dynamic field if it was necessary . // i read that
>>> this help performance.
>>> and now my question is that when i run a lot of query on solr i faced
>>> with
>>> a problem solr use more cpu and ram and after that filled ,it use a lot
>>>   swapped storage and then use hard,but doesn't create a system file!
>>> solr
>>> fill hard until i forced to restart server to release hard disk.
>>> and now my question is why solr treat in this way? and how i can avoid
>>> solr
>>> to use huge cpu space?
>>> any config need?!
>>>
>>>
>>
>

Re: Solr performance is slow with just 1GB of data indexed

2015-08-26 Thread Toke Eskildsen

On Wed, 2015-08-26 at 15:47 +0800, Zheng Lin Edwin Yeo wrote:

 Now I've tried to increase the carrot.fragSize to 75 and
 carrot.summarySnippets to 2, and set the carrot.produceSummary to
 true. With this setting, I'm mostly able to get the cluster results
 back within 2 to 3 seconds when I set rows=200. I'm still trying out
 to see if the cluster labels are ok, but in theory do you think this
 is a suitable setting to attempt to improve the clustering results and
 at the same time improve the performance?

I don't know - the quality/performance point as well as which knobs to
tweak is extremely dependent on your corpus and your hardware. A person
with better understanding of carrot might be able to do better sanity
checking, but I am not at all at that level.

Related, it seems to me that the question of how to tweak the clustering
has little to do with Solr and a lot to do with carrot (assuming here
that carrot is the bottleneck). You might have more success asking in a
carrot forum?


- Toke Eskildsen, State and University Library, Denmark

Re: Solr performance is slow with just 1GB of data indexed

2015-08-26 Thread Zheng Lin Edwin Yeo

Hi Toke,

Thank you for the link.

I'm using Solr 5.2.1 but I think the carrot2 bundled will be slightly older
version, as I'm using the latest carrot2-workbench-3.10.3, which is only
released recently. I've changed all the settings like fragSize and
desiredCluserCountBase to be the same on both sides, and I'm now able to
get very similar cluster results.

Now I've tried to increase the carrot.fragSize to 75 and
carrot.summarySnippets to 2, and set the carrot.produceSummary to true.
With this setting, I'm mostly able to get the cluster results back within 2
to 3 seconds when I set rows=200. I'm still trying out to see if the
cluster labels are ok, but in theory do you think this is a suitable
setting to attempt to improve the clustering results and at the same time
improve the performance?

Regards,
Edwin



On 26 August 2015 at 13:58, Toke Eskildsen t...@statsbiblioteket.dk wrote:

 On Wed, 2015-08-26 at 10:10 +0800, Zheng Lin Edwin Yeo wrote:
  I'm currently trying out on the Carrot2 Workbench and get it to call Solr
  to see how they did the clustering. Although it still takes some time to
 do
  the clustering, but the results of the cluster is much better than mine.
 I
  think its probably due to the different settings like the fragSize and
  desiredCluserCountBase?

 Either that or the carrot bundled with Solr is an older version.

  By the way, the link on the clustering example
  https://cwiki.apache.org/confluence/display/solr/Result is not working
 as
  it says 'Page Not Found'.

 That is because it is too long for a single line. Try copy-pasting it:

 https://cwiki.apache.org/confluence/display/solr/Result
 +Clustering#ResultClustering-Configuration

 - Toke Eskildsen, State and University Library, Denmark

Re: Solr performance is slow with just 1GB of data indexed

2015-08-26 Thread Zheng Lin Edwin Yeo

Thanks for your recommendation Toke.

Will try to ask in the carrot forum.

Regards,
Edwin

On 26 August 2015 at 18:45, Toke Eskildsen t...@statsbiblioteket.dk wrote:

 On Wed, 2015-08-26 at 15:47 +0800, Zheng Lin Edwin Yeo wrote:

  Now I've tried to increase the carrot.fragSize to 75 and
  carrot.summarySnippets to 2, and set the carrot.produceSummary to
  true. With this setting, I'm mostly able to get the cluster results
  back within 2 to 3 seconds when I set rows=200. I'm still trying out
  to see if the cluster labels are ok, but in theory do you think this
  is a suitable setting to attempt to improve the clustering results and
  at the same time improve the performance?

 I don't know - the quality/performance point as well as which knobs to
 tweak is extremely dependent on your corpus and your hardware. A person
 with better understanding of carrot might be able to do better sanity
 checking, but I am not at all at that level.

 Related, it seems to me that the question of how to tweak the clustering
 has little to do with Solr and a lot to do with carrot (assuming here
 that carrot is the bottleneck). You might have more success asking in a
 carrot forum?


 - Toke Eskildsen, State and University Library, Denmark

Re: Solr performance is slow with just 1GB of data indexed

2015-08-25 Thread Toke Eskildsen

On Wed, 2015-08-26 at 10:10 +0800, Zheng Lin Edwin Yeo wrote:
 I'm currently trying out on the Carrot2 Workbench and get it to call Solr
 to see how they did the clustering. Although it still takes some time to do
 the clustering, but the results of the cluster is much better than mine. I
 think its probably due to the different settings like the fragSize and
 desiredCluserCountBase?

Either that or the carrot bundled with Solr is an older version.

 By the way, the link on the clustering example
 https://cwiki.apache.org/confluence/display/solr/Result is not working as
 it says 'Page Not Found'.

That is because it is too long for a single line. Try copy-pasting it:

https://cwiki.apache.org/confluence/display/solr/Result
+Clustering#ResultClustering-Configuration

- Toke Eskildsen, State and University Library, Denmark

Re: Solr performance is slow with just 1GB of data indexed

2015-08-25 Thread Toke Eskildsen

On Tue, 2015-08-25 at 10:40 +0800, Zheng Lin Edwin Yeo wrote:
 Would like to confirm, when I set rows=100, does it mean that it only build
 the cluster based on the first 100 records that are returned by the search,
 and if I have 1000 records that matches the search, all the remaining 900
 records will not be considered for clustering?

That is correct. It is not stated very clearly, but it follows from
trading the comments in the third example at
https://cwiki.apache.org/confluence/display/solr/Result
+Clustering#ResultClustering-Configuration

 As if that is the case, the result of the cluster may not be so accurate as
 there is a possibility that the first 100 records might have a large amount
 of similarities in the records, while the subsequent 900 records have
 differences that could have impact on the cluster result.

Such is the nature of on-the-fly clustering. The clustering aims to be
as representative of your search result as possible. Assigning more
weight to the higher scoring documents (in this case: All the weight, as
those beyond the top-100 are not even considered) does this.

If that does not fit your expectations, maybe you need something else?
Plain faceting perhaps? Or maybe enrichment of the documents with some
sort of entity extraction?

- Toke Eskildsen, State and University Library, Denmark

Re: Solr performance is slow with just 1GB of data indexed

2015-08-25 Thread Zheng Lin Edwin Yeo

Hi Toke,

Thank you for your reply.

I'm currently trying out on the Carrot2 Workbench and get it to call Solr
to see how they did the clustering. Although it still takes some time to do
the clustering, but the results of the cluster is much better than mine. I
think its probably due to the different settings like the fragSize and
desiredCluserCountBase?

By the way, the link on the clustering example
https://cwiki.apache.org/confluence/display/solr/Result is not working as
it says 'Page Not Found'.

Regards,
Edwin


On 25 August 2015 at 15:29, Toke Eskildsen t...@statsbiblioteket.dk wrote:

 On Tue, 2015-08-25 at 10:40 +0800, Zheng Lin Edwin Yeo wrote:
  Would like to confirm, when I set rows=100, does it mean that it only
 build
  the cluster based on the first 100 records that are returned by the
 search,
  and if I have 1000 records that matches the search, all the remaining 900
  records will not be considered for clustering?

 That is correct. It is not stated very clearly, but it follows from
 trading the comments in the third example at
 https://cwiki.apache.org/confluence/display/solr/Result
 +Clustering#ResultClustering-Configuration

  As if that is the case, the result of the cluster may not be so accurate
 as
  there is a possibility that the first 100 records might have a large
 amount
  of similarities in the records, while the subsequent 900 records have
  differences that could have impact on the cluster result.

 Such is the nature of on-the-fly clustering. The clustering aims to be
 as representative of your search result as possible. Assigning more
 weight to the higher scoring documents (in this case: All the weight, as
 those beyond the top-100 are not even considered) does this.

 If that does not fit your expectations, maybe you need something else?
 Plain faceting perhaps? Or maybe enrichment of the documents with some
 sort of entity extraction?

 - Toke Eskildsen, State and University Library, Denmark

Re: Solr performance is slow with just 1GB of data indexed

2015-08-24 Thread Upayavira

I honestly suspect your performance issue is down to the number of terms
you are passing into the clustering algorithm, not to memory usage as
such. If you have *huge* documents and cluster across them, performance
will be slower, by definition.

Clustering is usually done offline, for example on a large dataset
taking a few hours or even days. Carrot2 manages to reduce this time to
a reasonable online task by only clustering a few search results. If
you increase the number of documents (from say 100 to 1000) and increase
the number of terms in each document, you are inherently making the
clustering algorithm have to work harder, and therefore it *IS* going to
take longer. Either use less documents, or only use the first 1000 terms
when clustering, or do your clustering offline and include the results
of the clustering into your index.

Upayavira

On Mon, Aug 24, 2015, at 04:59 AM, Zheng Lin Edwin Yeo wrote:
 Hi Alexandre,
 
 I've tried to use just index=true, and the speed is still the same and
 not
 any faster. If I set to store=false, there's no results that came back
 with
 the clustering. Is this due to the index are not stored, and the
 clustering
 requires indexed that are stored?
 
 I've also increase my heap size to 16GB as I'm using a machine with 32GB
 RAM, but there is no significant improvement with the performance too.
 
 Regards,
 Edwin
 
 
 
 On 24 August 2015 at 10:16, Zheng Lin Edwin Yeo edwinye...@gmail.com
 wrote:
 
  Yes, I'm using store=true.
  field name=content type=text_general indexed=true stored=true
  omitNorms=true termVectors=true/
 
  However, this field needs to be stored as my program requires this field
  to be returned during normal searching. I tried the lazyLoading=true, but
  it's not working.
 
  Will you do a copy field for the content, and not to set stored=true for
  that field. So that field will just be referenced to for the clustering,
  and the normal search will reference to the original content field?
 
  Regards,
  Edwin
 
 
 
 
  On 23 August 2015 at 23:51, Alexandre Rafalovitch arafa...@gmail.com
  wrote:
 
  Are you by any chance doing store=true on the fields you want to search?
 
  If so, you may want to switch to just index=true. Of course, they will
  then not come back in the results, but do you really want to sling
  huge content fields around.
 
  The other option is to do lazyLoading=true and not request that field.
  This, as a test, you could actually do without needing to reindex
  Solr, just with restart. This could give you a way to test whether the
  field stored size is the issue.
 
  Regards,
 Alex.
  
  Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
  http://www.solr-start.com/
 
 
  On 23 August 2015 at 11:13, Zheng Lin Edwin Yeo edwinye...@gmail.com
  wrote:
   Hi Shawn and Toke,
  
   I only have 520 docs in my data, but each of the documents is quite big
  in
   size, In the Solr, it is using 221MB. So when i set to read from the top
   1000 rows, it should just be reading all the 520 docs that are indexed?
  
   Regards,
   Edwin
  
  
   On 23 August 2015 at 22:52, Shawn Heisey apa...@elyograg.org wrote:
  
   On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote:
Hi Shawn,
   
Yes, I've increased the heap size to 4GB already, and I'm using a
  machine
with 32GB RAM.
   
Is it recommended to further increase the heap size to like 8GB or
  16GB?
  
   Probably not, but I know nothing about your data.  How many Solr docs
   were created by indexing 1GB of data?  How much disk space is used by
   your Solr index(es)?
  
   I know very little about clustering, but it looks like you've gotten a
   reply from Toke, who knows a lot more about that part of the code than
  I
   do.
  
   Thanks,
   Shawn

Re: Solr performance is slow with just 1GB of data indexed

2015-08-24 Thread Zheng Lin Edwin Yeo

Thank you Upayavira for your reply.

Would like to confirm, when I set rows=100, does it mean that it only build
the cluster based on the first 100 records that are returned by the search,
and if I have 1000 records that matches the search, all the remaining 900
records will not be considered for clustering?
As if that is the case, the result of the cluster may not be so accurate as
there is a possibility that the first 100 records might have a large amount
of similarities in the records, while the subsequent 900 records have
differences that could have impact on the cluster result.

Regards,
Edwin


On 24 August 2015 at 17:50, Upayavira u...@odoko.co.uk wrote:

 I honestly suspect your performance issue is down to the number of terms
 you are passing into the clustering algorithm, not to memory usage as
 such. If you have *huge* documents and cluster across them, performance
 will be slower, by definition.

 Clustering is usually done offline, for example on a large dataset
 taking a few hours or even days. Carrot2 manages to reduce this time to
 a reasonable online task by only clustering a few search results. If
 you increase the number of documents (from say 100 to 1000) and increase
 the number of terms in each document, you are inherently making the
 clustering algorithm have to work harder, and therefore it *IS* going to
 take longer. Either use less documents, or only use the first 1000 terms
 when clustering, or do your clustering offline and include the results
 of the clustering into your index.

 Upayavira

 On Mon, Aug 24, 2015, at 04:59 AM, Zheng Lin Edwin Yeo wrote:
  Hi Alexandre,
 
  I've tried to use just index=true, and the speed is still the same and
  not
  any faster. If I set to store=false, there's no results that came back
  with
  the clustering. Is this due to the index are not stored, and the
  clustering
  requires indexed that are stored?
 
  I've also increase my heap size to 16GB as I'm using a machine with 32GB
  RAM, but there is no significant improvement with the performance too.
 
  Regards,
  Edwin
 
 
 
  On 24 August 2015 at 10:16, Zheng Lin Edwin Yeo edwinye...@gmail.com
  wrote:
 
   Yes, I'm using store=true.
   field name=content type=text_general indexed=true stored=true
   omitNorms=true termVectors=true/
  
   However, this field needs to be stored as my program requires this
 field
   to be returned during normal searching. I tried the lazyLoading=true,
 but
   it's not working.
  
   Will you do a copy field for the content, and not to set stored=true
 for
   that field. So that field will just be referenced to for the
 clustering,
   and the normal search will reference to the original content field?
  
   Regards,
   Edwin
  
  
  
  
   On 23 August 2015 at 23:51, Alexandre Rafalovitch arafa...@gmail.com
   wrote:
  
   Are you by any chance doing store=true on the fields you want to
 search?
  
   If so, you may want to switch to just index=true. Of course, they will
   then not come back in the results, but do you really want to sling
   huge content fields around.
  
   The other option is to do lazyLoading=true and not request that field.
   This, as a test, you could actually do without needing to reindex
   Solr, just with restart. This could give you a way to test whether the
   field stored size is the issue.
  
   Regards,
  Alex.
   
   Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
   http://www.solr-start.com/
  
  
   On 23 August 2015 at 11:13, Zheng Lin Edwin Yeo edwinye...@gmail.com
 
   wrote:
Hi Shawn and Toke,
   
I only have 520 docs in my data, but each of the documents is quite
 big
   in
size, In the Solr, it is using 221MB. So when i set to read from
 the top
1000 rows, it should just be reading all the 520 docs that are
 indexed?
   
Regards,
Edwin
   
   
On 23 August 2015 at 22:52, Shawn Heisey apa...@elyograg.org
 wrote:
   
On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote:
 Hi Shawn,

 Yes, I've increased the heap size to 4GB already, and I'm using a
   machine
 with 32GB RAM.

 Is it recommended to further increase the heap size to like 8GB
 or
   16GB?
   
Probably not, but I know nothing about your data.  How many Solr
 docs
were created by indexing 1GB of data?  How much disk space is used
 by
your Solr index(es)?
   
I know very little about clustering, but it looks like you've
 gotten a
reply from Toke, who knows a lot more about that part of the code
 than
   I
do.
   
Thanks,
Shawn

Re: Solr performance is slow with just 1GB of data indexed

2015-08-23 Thread Upayavira

And be aware that I'm sure the more terms in your documents, the slower
clustering will be. So it isn't just the number of docs, the size of
them counts in this instance.

A simple test would be to build an index with just the first 1000 terms
of your clustering fields, and see if that makes a difference to
performance.

Upayavira

On Sun, Aug 23, 2015, at 05:32 PM, Erick Erickson wrote:
 You're confusing clustering with searching. Sure, Solr can index
 and lots of data, but clustering is essentially finding ad-hoc
 similarities between arbitrary documents. It must take each of
 the documents in the result size you specify from your result
 set and try to find commonalities.
 
 For perf issues in terms of clustering, you'd be better off
 talking to the folks at the carrot project.
 
 Best,
 Erick
 
 On Sun, Aug 23, 2015 at 8:51 AM, Alexandre Rafalovitch
 arafa...@gmail.com wrote:
  Are you by any chance doing store=true on the fields you want to search?
 
  If so, you may want to switch to just index=true. Of course, they will
  then not come back in the results, but do you really want to sling
  huge content fields around.
 
  The other option is to do lazyLoading=true and not request that field.
  This, as a test, you could actually do without needing to reindex
  Solr, just with restart. This could give you a way to test whether the
  field stored size is the issue.
 
  Regards,
 Alex.
  
  Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
  http://www.solr-start.com/
 
 
  On 23 August 2015 at 11:13, Zheng Lin Edwin Yeo edwinye...@gmail.com 
  wrote:
  Hi Shawn and Toke,
 
  I only have 520 docs in my data, but each of the documents is quite big in
  size, In the Solr, it is using 221MB. So when i set to read from the top
  1000 rows, it should just be reading all the 520 docs that are indexed?
 
  Regards,
  Edwin
 
 
  On 23 August 2015 at 22:52, Shawn Heisey apa...@elyograg.org wrote:
 
  On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote:
   Hi Shawn,
  
   Yes, I've increased the heap size to 4GB already, and I'm using a 
   machine
   with 32GB RAM.
  
   Is it recommended to further increase the heap size to like 8GB or 16GB?
 
  Probably not, but I know nothing about your data.  How many Solr docs
  were created by indexing 1GB of data?  How much disk space is used by
  your Solr index(es)?
 
  I know very little about clustering, but it looks like you've gotten a
  reply from Toke, who knows a lot more about that part of the code than I
  do.
 
  Thanks,
  Shawn

Re: Solr performance is slow with just 1GB of data indexed

2015-08-23 Thread Toke Eskildsen

Zheng Lin Edwin Yeo edwinye...@gmail.com wrote:
 However, I find that clustering is exceeding slow after I index this 1GB of
 data. It took almost 30 seconds to return the cluster results when I set it
 to cluster the top 1000 records, and still take more than 3 seconds when I
 set it to cluster the top 100 records.

Your clustering uses Carrot2, which fetches the top documents and performs 
real-time clustering on them - that process is (nearly) independent of index 
size. The relevant numbers here are top 1000 and top 100, not 1GB. The unknown 
part is whether it is the fetching of top 1000 (the Solr part) or the 
clustering itself (the Carrot part) that is the bottleneck.

- Toke Eskildsen

Re: Solr performance is slow with just 1GB of data indexed

2015-08-23 Thread Alexandre Rafalovitch

Are you by any chance doing store=true on the fields you want to search?

If so, you may want to switch to just index=true. Of course, they will
then not come back in the results, but do you really want to sling
huge content fields around.

The other option is to do lazyLoading=true and not request that field.
This, as a test, you could actually do without needing to reindex
Solr, just with restart. This could give you a way to test whether the
field stored size is the issue.

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 23 August 2015 at 11:13, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote:
 Hi Shawn and Toke,

 I only have 520 docs in my data, but each of the documents is quite big in
 size, In the Solr, it is using 221MB. So when i set to read from the top
 1000 rows, it should just be reading all the 520 docs that are indexed?

 Regards,
 Edwin


 On 23 August 2015 at 22:52, Shawn Heisey apa...@elyograg.org wrote:

 On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote:
  Hi Shawn,
 
  Yes, I've increased the heap size to 4GB already, and I'm using a machine
  with 32GB RAM.
 
  Is it recommended to further increase the heap size to like 8GB or 16GB?

 Probably not, but I know nothing about your data.  How many Solr docs
 were created by indexing 1GB of data?  How much disk space is used by
 your Solr index(es)?

 I know very little about clustering, but it looks like you've gotten a
 reply from Toke, who knows a lot more about that part of the code than I
 do.

 Thanks,
 Shawn

Re: Solr performance is slow with just 1GB of data indexed

2015-08-23 Thread Erick Erickson

You're confusing clustering with searching. Sure, Solr can index
and lots of data, but clustering is essentially finding ad-hoc
similarities between arbitrary documents. It must take each of
the documents in the result size you specify from your result
set and try to find commonalities.

For perf issues in terms of clustering, you'd be better off
talking to the folks at the carrot project.

Best,
Erick

On Sun, Aug 23, 2015 at 8:51 AM, Alexandre Rafalovitch
arafa...@gmail.com wrote:
 Are you by any chance doing store=true on the fields you want to search?

 If so, you may want to switch to just index=true. Of course, they will
 then not come back in the results, but do you really want to sling
 huge content fields around.

 The other option is to do lazyLoading=true and not request that field.
 This, as a test, you could actually do without needing to reindex
 Solr, just with restart. This could give you a way to test whether the
 field stored size is the issue.

 Regards,
Alex.
 
 Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
 http://www.solr-start.com/


 On 23 August 2015 at 11:13, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote:
 Hi Shawn and Toke,

 I only have 520 docs in my data, but each of the documents is quite big in
 size, In the Solr, it is using 221MB. So when i set to read from the top
 1000 rows, it should just be reading all the 520 docs that are indexed?

 Regards,
 Edwin


 On 23 August 2015 at 22:52, Shawn Heisey apa...@elyograg.org wrote:

 On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote:
  Hi Shawn,
 
  Yes, I've increased the heap size to 4GB already, and I'm using a machine
  with 32GB RAM.
 
  Is it recommended to further increase the heap size to like 8GB or 16GB?

 Probably not, but I know nothing about your data.  How many Solr docs
 were created by indexing 1GB of data?  How much disk space is used by
 your Solr index(es)?

 I know very little about clustering, but it looks like you've gotten a
 reply from Toke, who knows a lot more about that part of the code than I
 do.

 Thanks,
 Shawn

Re: Solr performance is slow with just 1GB of data indexed

2015-08-23 Thread Jimmy Lin

unsubscribe

On Sat, Aug 22, 2015 at 9:31 PM, Zheng Lin Edwin Yeo edwinye...@gmail.com
wrote:

 Hi,

 I'm using Solr 5.2.1, and I've indexed about 1GB of data into Solr.

 However, I find that clustering is exceeding slow after I index this 1GB of
 data. It took almost 30 seconds to return the cluster results when I set it
 to cluster the top 1000 records, and still take more than 3 seconds when I
 set it to cluster the top 100 records.

 Is this speed normal? Cos i understand Solr can index terabytes of data
 without having the performance impacted so much, but now the collection is
 slowing down even with just 1GB of data.

 Below is my clustering configurations in solrconfig.xml.

  requestHandler name=/clustering
   startup=lazy
   enable=${solr.clustering.enabled:true}
   class=solr.SearchHandler
 lst name=defaults
str name=echoParamsexplicit/str
   int name=rows1000/int
str name=wtjson/str
str name=indenttrue/str
   str name=dftext/str
   str name=flnull/str

   bool name=clusteringtrue/bool
   bool name=clustering.resultstrue/bool
   str name=carrot.titlesubject content tag/str
   bool name=carrot.produceSummarytrue/bool

  int name=carrot.fragSize20/int
   !-- the maximum number of labels per cluster --
   int name=carrot.numDescriptions20/int
   !-- produce sub clusters --
   bool name=carrot.outputSubClustersfalse/bool
  str name=LingoClusteringAlgorithm.desiredClusterCountBase7/str

   !-- Configure the remaining request handler parameters. --
   str name=defTypeedismax/str
 /lst
 arr name=last-components
   strclustering/str
 /arr
   /requestHandler


 Regards,
 Edwin

Re: Solr performance is slow with just 1GB of data indexed

2015-08-23 Thread Zheng Lin Edwin Yeo

Hi Shawn and Toke,

I only have 520 docs in my data, but each of the documents is quite big in
size, In the Solr, it is using 221MB. So when i set to read from the top
1000 rows, it should just be reading all the 520 docs that are indexed?

Regards,
Edwin


On 23 August 2015 at 22:52, Shawn Heisey apa...@elyograg.org wrote:

 On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote:
  Hi Shawn,
 
  Yes, I've increased the heap size to 4GB already, and I'm using a machine
  with 32GB RAM.
 
  Is it recommended to further increase the heap size to like 8GB or 16GB?

 Probably not, but I know nothing about your data.  How many Solr docs
 were created by indexing 1GB of data?  How much disk space is used by
 your Solr index(es)?

 I know very little about clustering, but it looks like you've gotten a
 reply from Toke, who knows a lot more about that part of the code than I
 do.

 Thanks,
 Shawn

Re: Solr performance is slow with just 1GB of data indexed

2015-08-23 Thread Shawn Heisey

On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote:
 Hi Shawn,
 
 Yes, I've increased the heap size to 4GB already, and I'm using a machine
 with 32GB RAM.
 
 Is it recommended to further increase the heap size to like 8GB or 16GB?

Probably not, but I know nothing about your data.  How many Solr docs
were created by indexing 1GB of data?  How much disk space is used by
your Solr index(es)?

I know very little about clustering, but it looks like you've gotten a
reply from Toke, who knows a lot more about that part of the code than I do.

Thanks,
Shawn

Re: Solr performance is slow with just 1GB of data indexed

2015-08-23 Thread Bill Bell

We use 8gb to 10gb for those size indexes all the time.


Bill Bell
Sent from mobile


 On Aug 23, 2015, at 8:52 AM, Shawn Heisey apa...@elyograg.org wrote:
 
 On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote:
 Hi Shawn,
 
 Yes, I've increased the heap size to 4GB already, and I'm using a machine
 with 32GB RAM.
 
 Is it recommended to further increase the heap size to like 8GB or 16GB?
 
 Probably not, but I know nothing about your data.  How many Solr docs
 were created by indexing 1GB of data?  How much disk space is used by
 your Solr index(es)?
 
 I know very little about clustering, but it looks like you've gotten a
 reply from Toke, who knows a lot more about that part of the code than I do.
 
 Thanks,
 Shawn

Re: Solr performance is slow with just 1GB of data indexed

2015-08-23 Thread Zheng Lin Edwin Yeo

Hi Alexandre,

I've tried to use just index=true, and the speed is still the same and not
any faster. If I set to store=false, there's no results that came back with
the clustering. Is this due to the index are not stored, and the clustering
requires indexed that are stored?

I've also increase my heap size to 16GB as I'm using a machine with 32GB
RAM, but there is no significant improvement with the performance too.

Regards,
Edwin



On 24 August 2015 at 10:16, Zheng Lin Edwin Yeo edwinye...@gmail.com
wrote:

 Yes, I'm using store=true.
 field name=content type=text_general indexed=true stored=true
 omitNorms=true termVectors=true/

 However, this field needs to be stored as my program requires this field
 to be returned during normal searching. I tried the lazyLoading=true, but
 it's not working.

 Will you do a copy field for the content, and not to set stored=true for
 that field. So that field will just be referenced to for the clustering,
 and the normal search will reference to the original content field?

 Regards,
 Edwin




 On 23 August 2015 at 23:51, Alexandre Rafalovitch arafa...@gmail.com
 wrote:

 Are you by any chance doing store=true on the fields you want to search?

 If so, you may want to switch to just index=true. Of course, they will
 then not come back in the results, but do you really want to sling
 huge content fields around.

 The other option is to do lazyLoading=true and not request that field.
 This, as a test, you could actually do without needing to reindex
 Solr, just with restart. This could give you a way to test whether the
 field stored size is the issue.

 Regards,
Alex.
 
 Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
 http://www.solr-start.com/


 On 23 August 2015 at 11:13, Zheng Lin Edwin Yeo edwinye...@gmail.com
 wrote:
  Hi Shawn and Toke,
 
  I only have 520 docs in my data, but each of the documents is quite big
 in
  size, In the Solr, it is using 221MB. So when i set to read from the top
  1000 rows, it should just be reading all the 520 docs that are indexed?
 
  Regards,
  Edwin
 
 
  On 23 August 2015 at 22:52, Shawn Heisey apa...@elyograg.org wrote:
 
  On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote:
   Hi Shawn,
  
   Yes, I've increased the heap size to 4GB already, and I'm using a
 machine
   with 32GB RAM.
  
   Is it recommended to further increase the heap size to like 8GB or
 16GB?
 
  Probably not, but I know nothing about your data.  How many Solr docs
  were created by indexing 1GB of data?  How much disk space is used by
  your Solr index(es)?
 
  I know very little about clustering, but it looks like you've gotten a
  reply from Toke, who knows a lot more about that part of the code than
 I
  do.
 
  Thanks,
  Shawn

Re: Solr performance is slow with just 1GB of data indexed

2015-08-23 Thread Zheng Lin Edwin Yeo

Yes, I'm using store=true.
field name=content type=text_general indexed=true stored=true
omitNorms=true termVectors=true/

However, this field needs to be stored as my program requires this field to
be returned during normal searching. I tried the lazyLoading=true, but it's
not working.

Will you do a copy field for the content, and not to set stored=true for
that field. So that field will just be referenced to for the clustering,
and the normal search will reference to the original content field?

Regards,
Edwin




On 23 August 2015 at 23:51, Alexandre Rafalovitch arafa...@gmail.com
wrote:

 Are you by any chance doing store=true on the fields you want to search?

 If so, you may want to switch to just index=true. Of course, they will
 then not come back in the results, but do you really want to sling
 huge content fields around.

 The other option is to do lazyLoading=true and not request that field.
 This, as a test, you could actually do without needing to reindex
 Solr, just with restart. This could give you a way to test whether the
 field stored size is the issue.

 Regards,
Alex.
 
 Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
 http://www.solr-start.com/


 On 23 August 2015 at 11:13, Zheng Lin Edwin Yeo edwinye...@gmail.com
 wrote:
  Hi Shawn and Toke,
 
  I only have 520 docs in my data, but each of the documents is quite big
 in
  size, In the Solr, it is using 221MB. So when i set to read from the top
  1000 rows, it should just be reading all the 520 docs that are indexed?
 
  Regards,
  Edwin
 
 
  On 23 August 2015 at 22:52, Shawn Heisey apa...@elyograg.org wrote:
 
  On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote:
   Hi Shawn,
  
   Yes, I've increased the heap size to 4GB already, and I'm using a
 machine
   with 32GB RAM.
  
   Is it recommended to further increase the heap size to like 8GB or
 16GB?
 
  Probably not, but I know nothing about your data.  How many Solr docs
  were created by indexing 1GB of data?  How much disk space is used by
  your Solr index(es)?
 
  I know very little about clustering, but it looks like you've gotten a
  reply from Toke, who knows a lot more about that part of the code than I
  do.
 
  Thanks,
  Shawn

Re: Solr performance is slow with just 1GB of data indexed

2015-08-22 Thread Zheng Lin Edwin Yeo

Hi Shawn,

Yes, I've increased the heap size to 4GB already, and I'm using a machine
with 32GB RAM.

Is it recommended to further increase the heap size to like 8GB or 16GB?

Regards,
Edwin
On 23 Aug 2015 10:23, Shawn Heisey apa...@elyograg.org wrote:

 On 8/22/2015 7:31 PM, Zheng Lin Edwin Yeo wrote:
  I'm using Solr 5.2.1, and I've indexed about 1GB of data into Solr.
 
  However, I find that clustering is exceeding slow after I index this 1GB
 of
  data. It took almost 30 seconds to return the cluster results when I set
 it
  to cluster the top 1000 records, and still take more than 3 seconds when
 I
  set it to cluster the top 100 records.
 
  Is this speed normal? Cos i understand Solr can index terabytes of data
  without having the performance impacted so much, but now the collection
 is
  slowing down even with just 1GB of data.

 Have you increased the heap size?  If you simply start Solr 5.x with the
 included script and don't use any commandline options, Solr will only
 have a 512MB heap.  This is *extremely* small.  A significant chunk of
 that 512MB heap will be required just to start Jetty and Solr, so
 there's not much memory left for manipulating the index data and serving
 queries.  Assuming you have at least 4GB of RAM, try adding -m 2g to
 the start commandline.

 Thanks,
 Shawn

Solr performance is slow with just 1GB of data indexed

2015-08-22 Thread Zheng Lin Edwin Yeo

Hi,

I'm using Solr 5.2.1, and I've indexed about 1GB of data into Solr.

However, I find that clustering is exceeding slow after I index this 1GB of
data. It took almost 30 seconds to return the cluster results when I set it
to cluster the top 1000 records, and still take more than 3 seconds when I
set it to cluster the top 100 records.

Is this speed normal? Cos i understand Solr can index terabytes of data
without having the performance impacted so much, but now the collection is
slowing down even with just 1GB of data.

Below is my clustering configurations in solrconfig.xml.

 requestHandler name=/clustering
  startup=lazy
  enable=${solr.clustering.enabled:true}
  class=solr.SearchHandler
lst name=defaults
   str name=echoParamsexplicit/str
  int name=rows1000/int
   str name=wtjson/str
   str name=indenttrue/str
  str name=dftext/str
  str name=flnull/str

  bool name=clusteringtrue/bool
  bool name=clustering.resultstrue/bool
  str name=carrot.titlesubject content tag/str
  bool name=carrot.produceSummarytrue/bool

 int name=carrot.fragSize20/int
  !-- the maximum number of labels per cluster --
  int name=carrot.numDescriptions20/int
  !-- produce sub clusters --
  bool name=carrot.outputSubClustersfalse/bool
 str name=LingoClusteringAlgorithm.desiredClusterCountBase7/str

  !-- Configure the remaining request handler parameters. --
  str name=defTypeedismax/str
/lst
arr name=last-components
  strclustering/str
/arr
  /requestHandler


Regards,
Edwin

1 2 3 4 5 >

1 - 100 of 464 matches

Mail list logo