Re: solr performance with >1 NUMAs
Great updates. Thanks for keeping us all in the loop! On Thu, Oct 22, 2020 at 7:43 PM Wei wrote: > > Hi Shawn, > > I.m circling back with some new findings with our 2 NUMA issue. After a > few iterations, we do see improvement with the useNUMA flag and other JVM > setting changes. Here are the current settings, with Java 11: > > -XX:+UseNUMA > > -XX:+UseG1GC > > -XX:+AlwaysPreTouch > > -XX:+UseTLAB > > -XX:G1MaxNewSizePercent=20 > > -XX:MaxGCPauseMillis=150 > > -XX:+DisableExplicitGC > > -XX:+DoEscapeAnalysis > > -XX:+ParallelRefProcEnabled > > -XX:+UnlockDiagnosticVMOptions > > -XX:+UnlockExperimentalVMOptions > > > Compared to previous Java 8 + CMS on 2 NUMA servers, P99 latency has > improved over 20%. > > > Thanks, > > Wei > > > > > On Mon, Sep 28, 2020 at 4:02 PM Shawn Heisey wrote: > > > On 9/28/2020 12:17 PM, Wei wrote: > > > Thanks Shawn. Looks like Java 11 is the way to go with -XX:+UseNUMA. Do > > you > > > see any backward compatibility issue for Solr 8 with Java 11? Can we run > > > Solr 8 built with JDK 8 in Java 11 JRE, or need to rebuild solr with Java > > > 11 JDK? > > > > I do not know of any problems running the binary release of Solr 8 > > (which is most likely built with the Java 8 JDK) with a newer release > > like Java 11 or higher. > > > > I think Sun was really burned by such problems cropping up in the days > > of Java 5 and 6, and their developers have worked really hard to make > > sure that never happens again. > > > > If you're running Java 11, you will need to pick a different garbage > > collector if you expect the NUMA flag to function. The most recent > > releases of Solr are defaulting to G1GC, which as previously mentioned, > > did not gain NUMA optimizations until Java 14. > > > > It is not clear to me whether the NUMA optimizations will work with any > > collector other than Parallel until Java 14. You would need to check > > Java documentation carefully or ask someone involved with development of > > Java. > > > > If you do see an improvement using the NUMA flag with Java 11, please > > let us know exactly what options Solr was started with. > > > > Thanks, > > Shawn > >
Re: solr performance with >1 NUMAs
Hi Shawn, I.m circling back with some new findings with our 2 NUMA issue. After a few iterations, we do see improvement with the useNUMA flag and other JVM setting changes. Here are the current settings, with Java 11: -XX:+UseNUMA -XX:+UseG1GC -XX:+AlwaysPreTouch -XX:+UseTLAB -XX:G1MaxNewSizePercent=20 -XX:MaxGCPauseMillis=150 -XX:+DisableExplicitGC -XX:+DoEscapeAnalysis -XX:+ParallelRefProcEnabled -XX:+UnlockDiagnosticVMOptions -XX:+UnlockExperimentalVMOptions Compared to previous Java 8 + CMS on 2 NUMA servers, P99 latency has improved over 20%. Thanks, Wei On Mon, Sep 28, 2020 at 4:02 PM Shawn Heisey wrote: > On 9/28/2020 12:17 PM, Wei wrote: > > Thanks Shawn. Looks like Java 11 is the way to go with -XX:+UseNUMA. Do > you > > see any backward compatibility issue for Solr 8 with Java 11? Can we run > > Solr 8 built with JDK 8 in Java 11 JRE, or need to rebuild solr with Java > > 11 JDK? > > I do not know of any problems running the binary release of Solr 8 > (which is most likely built with the Java 8 JDK) with a newer release > like Java 11 or higher. > > I think Sun was really burned by such problems cropping up in the days > of Java 5 and 6, and their developers have worked really hard to make > sure that never happens again. > > If you're running Java 11, you will need to pick a different garbage > collector if you expect the NUMA flag to function. The most recent > releases of Solr are defaulting to G1GC, which as previously mentioned, > did not gain NUMA optimizations until Java 14. > > It is not clear to me whether the NUMA optimizations will work with any > collector other than Parallel until Java 14. You would need to check > Java documentation carefully or ask someone involved with development of > Java. > > If you do see an improvement using the NUMA flag with Java 11, please > let us know exactly what options Solr was started with. > > Thanks, > Shawn >
Re: solr performance with >1 NUMAs
On 9/28/2020 12:17 PM, Wei wrote: Thanks Shawn. Looks like Java 11 is the way to go with -XX:+UseNUMA. Do you see any backward compatibility issue for Solr 8 with Java 11? Can we run Solr 8 built with JDK 8 in Java 11 JRE, or need to rebuild solr with Java 11 JDK? I do not know of any problems running the binary release of Solr 8 (which is most likely built with the Java 8 JDK) with a newer release like Java 11 or higher. I think Sun was really burned by such problems cropping up in the days of Java 5 and 6, and their developers have worked really hard to make sure that never happens again. If you're running Java 11, you will need to pick a different garbage collector if you expect the NUMA flag to function. The most recent releases of Solr are defaulting to G1GC, which as previously mentioned, did not gain NUMA optimizations until Java 14. It is not clear to me whether the NUMA optimizations will work with any collector other than Parallel until Java 14. You would need to check Java documentation carefully or ask someone involved with development of Java. If you do see an improvement using the NUMA flag with Java 11, please let us know exactly what options Solr was started with. Thanks, Shawn
Re: solr performance with >1 NUMAs
Thanks Shawn. Looks like Java 11 is the way to go with -XX:+UseNUMA. Do you see any backward compatibility issue for Solr 8 with Java 11? Can we run Solr 8 built with JDK 8 in Java 11 JRE, or need to rebuild solr with Java 11 JDK? Best, Wei On Sat, Sep 26, 2020 at 6:44 PM Shawn Heisey wrote: > On 9/26/2020 1:39 PM, Wei wrote: > > Thanks Shawn! Currently we are still using the CMS collector for solr > with > > Java 8. When last evaluated with Solr 7, CMS performs better than G1 for > > our case. When using G1, is it better to upgrade from Java 8 to Java 11? > > From > https://lucene.apache.org/solr/guide/8_4/solr-system-requirements.html, > > seems Java 14 is not officially supported for Solr 8. > > It has been a while since I was working with Solr every day, and when I > was, Java 11 did not yet exist. I have no idea whether Java 11 improves > things beyond Java 8. That said ... all software evolves and usually > improves as time goes by. It is likely that the newer version has SOME > benefit. > > Regarding whether or not Java 14 is supported: There are automated > tests where all the important code branches are run with all major > versions of Java, including pre-release versions, and those tests do > include various garbage collectors. Somebody notices when a combination > doesn't work, and big problems with newer Java versions are something > that gets discussed on our mailing lists. > > Java 14 has been out for a while, with no big problems being discussed > so far. So it is likely that it works with Solr. Can I say for sure? > No. I haven't tried it myself. > > I don't have any hardware available where there is more than one NUMA, > or I would look deeper into this myself. It would be interesting to > find out whether the -XX:+UseNUMA option makes a big difference in > performance. > > Thanks, > Shawn >
Re: solr performance with >1 NUMAs
On 9/26/2020 1:39 PM, Wei wrote: Thanks Shawn! Currently we are still using the CMS collector for solr with Java 8. When last evaluated with Solr 7, CMS performs better than G1 for our case. When using G1, is it better to upgrade from Java 8 to Java 11? From https://lucene.apache.org/solr/guide/8_4/solr-system-requirements.html, seems Java 14 is not officially supported for Solr 8. It has been a while since I was working with Solr every day, and when I was, Java 11 did not yet exist. I have no idea whether Java 11 improves things beyond Java 8. That said ... all software evolves and usually improves as time goes by. It is likely that the newer version has SOME benefit. Regarding whether or not Java 14 is supported: There are automated tests where all the important code branches are run with all major versions of Java, including pre-release versions, and those tests do include various garbage collectors. Somebody notices when a combination doesn't work, and big problems with newer Java versions are something that gets discussed on our mailing lists. Java 14 has been out for a while, with no big problems being discussed so far. So it is likely that it works with Solr. Can I say for sure? No. I haven't tried it myself. I don't have any hardware available where there is more than one NUMA, or I would look deeper into this myself. It would be interesting to find out whether the -XX:+UseNUMA option makes a big difference in performance. Thanks, Shawn
Re: solr performance with >1 NUMAs
Thanks Shawn! Currently we are still using the CMS collector for solr with Java 8. When last evaluated with Solr 7, CMS performs better than G1 for our case. When using G1, is it better to upgrade from Java 8 to Java 11? >From https://lucene.apache.org/solr/guide/8_4/solr-system-requirements.html, seems Java 14 is not officially supported for Solr 8. Best, Wei On Fri, Sep 25, 2020 at 5:50 PM Shawn Heisey wrote: > On 9/23/2020 7:42 PM, Wei wrote: > > Recently we deployed solr 8.4.1 on a batch of new servers with 2 NUMAs. I > > noticed that query latency almost doubled compared to deployment on > single > > NUMA machines. Not sure what's causing the huge difference. Is there any > > tuning to boost the performance on multiple NUMA machines? Any pointer is > > appreciated. > > If you're running with standard options, Solr 8.4.1 will start using the > G1 garbage collector. > > As of Java 14, G1 has gained the ability to use the -XX:+UseNUMA option, > which makes better decisions about memory allocations and multiple > NUMAs. If you're running a new enough Java, it would probably be > beneficial to add this to the garbage collector options. Solr itself is > unaware of things like NUMA -- Java must handle that. > > https://openjdk.java.net/jeps/345 > > Thanks, > Shawn >
Re: solr performance with >1 NUMAs
On 9/23/2020 7:42 PM, Wei wrote: Recently we deployed solr 8.4.1 on a batch of new servers with 2 NUMAs. I noticed that query latency almost doubled compared to deployment on single NUMA machines. Not sure what's causing the huge difference. Is there any tuning to boost the performance on multiple NUMA machines? Any pointer is appreciated. If you're running with standard options, Solr 8.4.1 will start using the G1 garbage collector. As of Java 14, G1 has gained the ability to use the -XX:+UseNUMA option, which makes better decisions about memory allocations and multiple NUMAs. If you're running a new enough Java, it would probably be beneficial to add this to the garbage collector options. Solr itself is unaware of things like NUMA -- Java must handle that. https://openjdk.java.net/jeps/345 Thanks, Shawn
Re: solr performance with >1 NUMAs
Thanks Dominique. I'll start with the -XX:+UseNUMA option. Best, Wei On Fri, Sep 25, 2020 at 7:04 AM Dominique Bejean wrote: > Hi, > > This would be a Java VM option, not something Solr itself can know about. > Take a look at this article in comments. May be it will help. > > https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html?showComment=1347033706559#c229885263664926125 > > Regards > > Dominique > > > > Le jeu. 24 sept. 2020 à 03:42, Wei a écrit : > > > Hi, > > > > Recently we deployed solr 8.4.1 on a batch of new servers with 2 NUMAs. I > > noticed that query latency almost doubled compared to deployment on > single > > NUMA machines. Not sure what's causing the huge difference. Is there any > > tuning to boost the performance on multiple NUMA machines? Any pointer is > > appreciated. > > > > Best, > > Wei > > >
Re: solr performance with >1 NUMAs
Hi, This would be a Java VM option, not something Solr itself can know about. Take a look at this article in comments. May be it will help. https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html?showComment=1347033706559#c229885263664926125 Regards Dominique Le jeu. 24 sept. 2020 à 03:42, Wei a écrit : > Hi, > > Recently we deployed solr 8.4.1 on a batch of new servers with 2 NUMAs. I > noticed that query latency almost doubled compared to deployment on single > NUMA machines. Not sure what's causing the huge difference. Is there any > tuning to boost the performance on multiple NUMA machines? Any pointer is > appreciated. > > Best, > Wei >
solr performance with >1 NUMAs
Hi, Recently we deployed solr 8.4.1 on a batch of new servers with 2 NUMAs. I noticed that query latency almost doubled compared to deployment on single NUMA machines. Not sure what's causing the huge difference. Is there any tuning to boost the performance on multiple NUMA machines? Any pointer is appreciated. Best, Wei
Re: question about setup for maximizing solr performance
On 6/1/2020 9:29 AM, Odysci wrote: Hi, I'm looking for some advice on improving performance of our solr setup. Does anyone have any insights on what would be better for maximizing throughput on multiple searches being done at the same time? thanks! In almost all cases, adding memory will provide the best performance boost. This is because memory is faster than disks, even SSD. I have put relevant information on a wiki page so that it is easy for people to find and digest: https://cwiki.apache.org/confluence/display/solr/SolrPerformanceProblems Thanks, Shawn
question about setup for maximizing solr performance
Hi, I'm looking for some advice on improving performance of our solr setup. In particular, about the trade-offs between applying larger machines, vs more smaller machines. Our full index has just over 100 million docs, and we do almost all searches using fq's (with q=*:*) and facets. We are using solr 8.3. Currently, I have a solrcloud setup with 2 physical machines (let's call them A and B), and my index is divided into 2 shards, and 2 replicas, such that each machine has a full copy of the index. The nodes and replicas are as follows: Machine A: core_node3 / shard1_replica_n1 core_node7 / shard2_replica_n4 Machine B: core_node5 / shard1_replica_n2 core_node8 / shard2_replica_n6 My Zookeeper setup uses 3 instances. It's also the case that most of the searches we do, we have results returning from both shards (from the same search). My experiments indicate that our setup is cpu-bound. Due to cost constraints, I could, either, double the cpu in each of the 2 machines, or make it a 4-machine setup (using current size machines) and 2 shards and 4 replicas (or 4 shards w/ 4 replicas). I assume that keeping the full index on all machines will allow all searches to be evenly distributed. Does anyone have any insights on what would be better for maximizing throughput on multiple searches being done at the same time? thanks! Reinaldo
Re: Solr performance using fq with multiple values
On 4/18/2020 12:20 PM, Odysci wrote: We don't used this field for general queries (q:*), only for fq and faceting. Do you think making it indexed="true" would make a difference in fq performance? fq means "filter query". It's still a query. So yes, the field should be indexed. The query you're doing only works because docValues is true ... but queries using docValues have terrible performance. Thanks, Shawn
Re: Solr performance using fq with multiple values
We don't used this field for general queries (q:*), only for fq and faceting. Do you think making it indexed="true" would make a difference in fq performance? Thanks Reinaldo On Sat, Apr 18, 2020 at 3:06 PM Sylvain James wrote: > Hi Reinaldo, > > Involved fields should be indexed for better performance ? > > stored="false" required="false" multiValued="false" > docValues="true" /> > > Sylvain > > Le sam. 18 avr. 2020 à 18:46, Odysci a écrit : > > > Hi, > > > > We are seeing significant performance degradation on single queries that > > use fq with multiple values as in: > > > > fq=field1_name:(V1 V2 V3 ...) > > > > If we use only one value in the fq (say only V1) we get Qtime = T ms > > As we increase the number of values, say to 5 values, Qtime more than > > triples, even if the number of results is small. In my tests I made sure > > cache was not an issue and nothing else was using the cpu. > > > > We commonly need to use fq with multiple values (on the same field name, > > which is normally a long). > > Is this performance hit to be expected? > > Is there a better way to do this? > > > > We use Solr Cloud 8.3, and the field that we use fq on is defined as: > > > > > stored="false" required="false" multiValued="false" > > docValues="true" /> > > > > Thanks > > > > Reinaldo > > >
Re: Solr performance using fq with multiple values
Hi Reinaldo, Involved fields should be indexed for better performance ? Sylvain Le sam. 18 avr. 2020 à 18:46, Odysci a écrit : > Hi, > > We are seeing significant performance degradation on single queries that > use fq with multiple values as in: > > fq=field1_name:(V1 V2 V3 ...) > > If we use only one value in the fq (say only V1) we get Qtime = T ms > As we increase the number of values, say to 5 values, Qtime more than > triples, even if the number of results is small. In my tests I made sure > cache was not an issue and nothing else was using the cpu. > > We commonly need to use fq with multiple values (on the same field name, > which is normally a long). > Is this performance hit to be expected? > Is there a better way to do this? > > We use Solr Cloud 8.3, and the field that we use fq on is defined as: > > stored="false" required="false" multiValued="false" > docValues="true" /> > > Thanks > > Reinaldo >
Solr performance using fq with multiple values
Hi, We are seeing significant performance degradation on single queries that use fq with multiple values as in: fq=field1_name:(V1 V2 V3 ...) If we use only one value in the fq (say only V1) we get Qtime = T ms As we increase the number of values, say to 5 values, Qtime more than triples, even if the number of results is small. In my tests I made sure cache was not an issue and nothing else was using the cpu. We commonly need to use fq with multiple values (on the same field name, which is normally a long). Is this performance hit to be expected? Is there a better way to do this? We use Solr Cloud 8.3, and the field that we use fq on is defined as: Thanks Reinaldo
Re: SOLR PERFORMANCE Warning
Hi, It means that you are either committing too frequently or your warming up takes too long. If you are committing on every bulk, stop doing that and use autocommit. Regards, Emir -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > On 21 Feb 2020, at 06:54, Akreeti Agarwal wrote: > > Hi All, > > > > I am using SOLR 7.5 version with master slave architecture. > > I am getting : > > > > "PERFORMANCE WARNING: Overlapping onDeckSearchers=2" > > > > continuously on my master logs for all cores. Please help me to resolve this. > > > > > > Thanks & Regards, > > Akreeti Agarwal > > ::DISCLAIMER:: > > The contents of this e-mail and any attachment(s) are confidential and > intended for the named recipient(s) only. E-mail transmission is not > guaranteed to be secure or error-free as information could be intercepted, > corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses > in transmission. The e mail and its contents (with or without referred > errors) shall therefore not attach any liability on the originator or HCL or > its affiliates. Views or opinions, if any, presented in this email are solely > those of the author and may not necessarily reflect the views or opinions of > HCL or its affiliates. Any form of reproduction, dissemination, copying, > disclosure, modification, distribution and / or publication of this message > without the prior written consent of authorized representative of HCL is > strictly prohibited. If you have received this email in error please delete > it and notify the sender immediately. Before opening any email and/or > attachments, please check them for viruses and other defects. >
SOLR PERFORMANCE Warning
Hi All, I am using SOLR 7.5 version with master slave architecture. I am getting : "PERFORMANCE WARNING: Overlapping onDeckSearchers=2" continuously on my master logs for all cores. Please help me to resolve this. Thanks & Regards, Akreeti Agarwal ::DISCLAIMER:: The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects.
Re: solr SSL encryption degardes solr performance
Hi, Which Solr version are you using? Also, how many collections do you have, and how many records have you indexed in those collections? Regards, Edwin On Mon, 4 Feb 2019 at 23:33, Anchal Sharma2 wrote: > > > Hi All, > > We had recently enabled SSL on solr. But afterwards ,our application > performance has degraded significantly i.e the time for the source > application to fetch a record from solr has increased from approx 4 ms to > 200 ms(this is for a single record) .This amounts to a lot of time ,when > multiple calls are made to solr. > > Has any one experienced this ,and please share if some one has any > suggestion . > > Thanks & Regards, > - > Anchal Sharma >
solr SSL encryption degardes solr performance
Hi All, We had recently enabled SSL on solr. But afterwards ,our application performance has degraded significantly i.e the time for the source application to fetch a record from solr has increased from approx 4 ms to 200 ms(this is for a single record) .This amounts to a lot of time ,when multiple calls are made to solr. Has any one experienced this ,and please share if some one has any suggestion . Thanks & Regards, - Anchal Sharma
Re: SOLR Performance Statistics
On 11/21/2018 8:59 AM, Marc Schöchlin wrote: Is it possible to modify the log4j appender to also log other query attributes like response/request size in bytes and number of resulted documents? Changing the log4j config might not do anything useful at all. In order for such a change to be useful, the application must have code that logs the information you're after. If you change the default logging level in your log4j config to DEBUG instead of INFO, you'll get a LOT more information in your logs. The information you're after *MIGHT* be logged, but it might not -- I really have no idea without checking the source code about precisely what information Solr is logging. The number of documents that match each query *IS* mentioned in solr.log, and you won't even have to change anything to get it. You'll see "hits=nn" when a query is logged. I think about snooping on the ethernet interface of a client or on the server to gather libpcap data. Is there a chance to analyze captured data i format is i.e "wt=javabin=2" The javabin format is binary. You would need something that understands it. If you added solr jars to a custom program, you could probably feed the data to it and make sense of it. That would require some research into how Solr works at a low level, to learn how to take information gathered from a packet capture and decode it into an actual response. Thanks, Shawn
SOLR Performance Statistics
Hello list, i am using the pretty old solr 4.7 *sigh* release and i am currently in investigation of performance problems. The solr instance runs currently very expensive queries with huge results and i want to find the most promising queries for optimization. I am currently using the solr logfiles and a simple tool (enhanced by me) to analyze the queries: https://github.com/scoopex/solr-loganalyzer Is it possible to modify the log4j appender to also log other query attributes like response/request size in bytes and number of resulted documents? #- File to log to and log format log4j.appender.file.File=${solr.log}/solr.log log4j.appender.file.layout=org.apache.log4j.PatternLayout log4j.appender.file.layout.ConversionPattern=%-5p - %d{-MM-dd HH:mm:ss.SSS}; %C; %m\n log4j.appender.file.bufferedIO=true Is there a better way to create detailed query stats and to replay queries on a test system? I think about snooping on the ethernet interface of a client or on the server to gather libpcap data. Is there a chance to analyze captured data i format is i.e "wt=javabin=2" I do similar things for mysql to make non intrusive performance analytics using pt-query-digest (Percona Toolkit). This works like that on mysql: 1.) Capture data # Capture all data on port 3306 tcpdump -s 65535 -x -nn -q - -i any port 3306 > mysql.tcp.txt # capure only 1/7 of the connection using a modulus of 7 on the source port if you have a very busy network connection tcpdump -i eth0 -s 65535 -x -n -q - 'port 3306 and tcp[1] & 7 == 2 and tcp[3] & 7 == 2' > mysql.tcp.txt 2.) Create statistics on a other system using the tcpdump file pt-query-digest --watch-server '127.0.0.1:3307' --limit 110 --type tcpdump mysql.tcp.txt If i can extract the streams of the connections - do you have a idea how to parse the binary data? (Can i use parts of the solr client?) Is there comparable tool out there? Regards Marc
Re: Live publishing and solr performance optimization
Sharding can be one of the option. But what is the size of your documents? And which Solr version are you using? Regards, Edwin On Tue, 20 Nov 2018 at 01:40, Balanathagiri Ayyasamypalanivel < bala.cit...@gmail.com> wrote: > Hi, > We are in the process for live Publishing document in solr and the same > time we have to maintain the search performance. > > Total existing docs : 120 million > Expected data for live publishing : 1 million > > For every 1 hour, we will get 1m docs to publish in live to the hot solr > collection, can you please provide your suggestions on how effectively we > can do this. > > Regards, > Bala. >
Live publishing and solr performance optimization
Hi, We are in the process for live Publishing document in solr and the same time we have to maintain the search performance. Total existing docs : 120 million Expected data for live publishing : 1 million For every 1 hour, we will get 1m docs to publish in live to the hot solr collection, can you please provide your suggestions on how effectively we can do this. Regards, Bala.
Re: Solr performance issue
On 2/15/2018 2:00 AM, Srinivas Kashyap wrote: > I have implemented 'SortedMapBackedCache' in my SqlEntityProcessor for the > child entities in data-config.xml. And i'm using the same for full-import > only. And in the beginning of my implementation, i had written delta-import > query to index the modified changes. But my requirement grew and i have 17 > child entities for a single parent entity now. When doing delta-import for > huge data, the number of requests being made to datasource(database) became > more and CPU utilization was 100% when concurrent users started modifying the > data. For this instead of calling delta-import which imports based on last > index time, I did full-import('SortedMapBackedCache' ) based on last index > time. > > Though the parent entity query would return only records that are modified, > the child entity queries pull all the data from the database and the indexing > happens 'in-memory' which is causing the JVM memory go out of memory. Can you provide your DIH config file (with passwords redacted) and the precise URL you are using to initiate dataimport? Also, I would like to know what field you have defined as your uniqueKey. I may have more questions about the data in your system, depending on what I see. That cache implementation should only cache entries from the database that are actually requested. If your query is correctly defined, it should not pull all records from the DB table. > Is there a way to specify in the child query entity to pull the record > related to parent entity in the full-import mode. If I am understanding your question correctly, this is one of the fairly basic things that DIH does. Look at this config example in the reference guide: https://lucene.apache.org/solr/guide/6_6/uploading-structured-data-store-data-with-the-data-import-handler.html#configuring-the-dih-configuration-file In the entity named feature in that example config, the query string uses ${item.ID} to reference the ID column from the parent entity, which is item. I should warn you that a cached entity does not always improve performance. This is particularly true if the lookup into the cache is the information that goes to your uniqueKey field. When the lookup is by uniqueKey, every single row requested from the database will be used exactly once, so there's not really any point to caching it. Thanks, Shawn
Re: Solr performance issue
Srinivas: Not an answer to your question, but when DIH starts getting this complicated, I start to seriously think about SolrJ, see: https://lucidworks.com/2012/02/14/indexing-with-solrj/ IN particular, it moves the heavy lifting of acquiring the data from a Solr node (which I'm assuming also has to index docs) to "some client". It also let's you play some tricks with the code to make things faster. Best, Erick On Thu, Feb 15, 2018 at 1:00 AM, Srinivas Kashyapwrote: > Hi, > > I have implemented 'SortedMapBackedCache' in my SqlEntityProcessor for the > child entities in data-config.xml. And i'm using the same for full-import > only. And in the beginning of my implementation, i had written delta-import > query to index the modified changes. But my requirement grew and i have 17 > child entities for a single parent entity now. When doing delta-import for > huge data, the number of requests being made to datasource(database) became > more and CPU utilization was 100% when concurrent users started modifying the > data. For this instead of calling delta-import which imports based on last > index time, I did full-import('SortedMapBackedCache' ) based on last index > time. > > Though the parent entity query would return only records that are modified, > the child entity queries pull all the data from the database and the indexing > happens 'in-memory' which is causing the JVM memory go out of memory. > > Is there a way to specify in the child query entity to pull the record > related to parent entity in the full-import mode. > > Thanks and Regards, > Srinivas Kashyap > > DISCLAIMER: > E-mails and attachments from TradeStone Software, Inc. are confidential. > If you are not the intended recipient, please notify the sender immediately by > replying to the e-mail, and then delete it without making copies or using it > in any way. No representation is made that this email or any attachments are > free of viruses. Virus scanning is recommended and is the responsibility of > the recipient.
Solr performance issue
Hi, I have implemented 'SortedMapBackedCache' in my SqlEntityProcessor for the child entities in data-config.xml. And i'm using the same for full-import only. And in the beginning of my implementation, i had written delta-import query to index the modified changes. But my requirement grew and i have 17 child entities for a single parent entity now. When doing delta-import for huge data, the number of requests being made to datasource(database) became more and CPU utilization was 100% when concurrent users started modifying the data. For this instead of calling delta-import which imports based on last index time, I did full-import('SortedMapBackedCache' ) based on last index time. Though the parent entity query would return only records that are modified, the child entity queries pull all the data from the database and the indexing happens 'in-memory' which is causing the JVM memory go out of memory. Is there a way to specify in the child query entity to pull the record related to parent entity in the full-import mode. Thanks and Regards, Srinivas Kashyap DISCLAIMER: E-mails and attachments from TradeStone Software, Inc. are confidential. If you are not the intended recipient, please notify the sender immediately by replying to the e-mail, and then delete it without making copies or using it in any way. No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient.
Re: Solr performance issue on querying --> Solr 6.5.1
Hi Erick, As suggested, I did try nonHDFS solr cloud instance and it response looks to be really better. From the configuration side to, I am mostly using default configurations and with block.cache.direct.memory.allocation as false. On analysis of hdfs cache, evictions seems to be on higher side. Thanks, Arun -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr performance issue on querying --> Solr 6.5.1
Hi Arun, It is hard to measure something without affecting it, but we could use debug results and combine with QTime without debug: If we ignore merging results, it seems that majority of time is spent for retrieving docs (~500ms). You should consider reducing number of rows if you want better response time (you can ask for rows=0 to see max possible time). Also, as Erick suggested, reducing number of shards (1 if not plan much more doc) will trim some overhead of merging results. Thanks, Emir I noticed that you removed bq - is time with bq acceptable as well? > On 27 Sep 2017, at 12:34, sasarunwrote: > > Hi Emir, > > Please find the response without bq parameter and debugQuery set to true. > Also it was noted that Qtime comes down drastically without the debug > parameter to about 700-800. > > > true > 0 > 3446 > > > ("hybrid electric powerplant" "hybrid electric powerplants" "Electric" > "Electrical" "Electricity" "Engine" "fuel economy" "fuel efficiency" "Hybrid > Electric Propulsion" "Power Systems" "Powerplant" "Propulsion" "hybrid" > "hybrid electric" "electric powerplant") > > edismax > on > > host > title > url > customContent > contentSpecificSearch > > > id > contentOntologyTagsCount > > 0 > OR > 3985d7e2-3e54-48d8-8336-229e85f5d9de > 600 > true > > > maxScore="56.74194">... > > > > solr-prd-cluster-m-GooglePatent_shard4_replica2-1506504238282-20 > > > > 35 > 159 > GET_TOP_IDS > 41294 > ... > > > 29 > 165 > GET_TOP_IDS > 40980 > ... > > > 31 > 200 > GET_TOP_IDS > 41006 > ... > > > 43 > 208 > GET_TOP_IDS > 41040 > ... > > > 181 > 466 > GET_TOP_IDS > 41138 > ... > > > > > 1518 > 1523 > GET_FIELDS,GET_DEBUG > 110 > ... > > > 1562 > 1573 > GET_FIELDS,GET_DEBUG > 115 > ... > > > 1793 > 1800 > GET_FIELDS,GET_DEBUG > 120 > ... > > > 2153 > 2161 > GET_FIELDS,GET_DEBUG > 125 > ... > > > 2957 > 2970 > GET_FIELDS,GET_DEBUG > 130 > ... > > > > > 10302.0 > > 2.0 > > 2.0 > > > 0.0 > > > 0.0 > > > 0.0 > > > 0.0 > > > 0.0 > > > 0.0 > > > 0.0 > > > 0.0 > > > > 10288.0 > > 661.0 > > > 0.0 > > > 0.0 > > > 0.0 > > > 0.0 > > > 0.0 > > > 0.0 > > > 0.0 > > > 9627.0 > > > > > ("hybrid electric powerplant" "hybrid electric powerplants" "Electric" > "Electrical" "Electricity" "Engine" "fuel economy" "fuel efficiency" "Hybrid > Electric Propulsion" "Power Systems" "Powerplant" "Propulsion" "hybrid" > "hybrid electric" "electric powerplant") > > > ("hybrid electric powerplant" "hybrid electric powerplants" "Electric" > "Electrical" "Electricity" "Engine" "fuel economy" "fuel efficiency" "Hybrid > Electric Propulsion" "Power Systems" "Powerplant" "Propulsion" "hybrid" > "hybrid electric" "electric powerplant") > > > (+(DisjunctionMaxQuery((host:hybrid electric powerplant | > contentSpecificSearch:"hybrid electric powerplant" | customContent:"hybrid > electric powerplant" | title:hybrid electric powerplant | url:hybrid > electric powerplant)) DisjunctionMaxQuery((host:hybrid electric powerplants > | contentSpecificSearch:"hybrid electric powerplants" | > customContent:"hybrid electric powerplants" | title:hybrid electric > powerplants | url:hybrid electric powerplants)) > DisjunctionMaxQuery((host:Electric | contentSpecificSearch:electric | > customContent:electric | title:Electric | url:Electric)) > DisjunctionMaxQuery((host:Electrical | contentSpecificSearch:electrical | > customContent:electrical | title:Electrical | url:Electrical)) > DisjunctionMaxQuery((host:Electricity | contentSpecificSearch:electricity | > customContent:electricity | title:Electricity | url:Electricity)) > DisjunctionMaxQuery((host:Engine | contentSpecificSearch:engine | > customContent:engine | title:Engine | url:Engine)) > DisjunctionMaxQuery((host:fuel economy | contentSpecificSearch:"fuel > economy" | customContent:"fuel economy" | title:fuel economy | url:fuel > economy)) DisjunctionMaxQuery((host:fuel efficiency | > contentSpecificSearch:"fuel efficiency" | customContent:"fuel efficiency" | > title:fuel efficiency | url:fuel efficiency)) > DisjunctionMaxQuery((host:Hybrid Electric Propulsion | > contentSpecificSearch:"hybrid electric propulsion" | customContent:"hybrid > electric propulsion" | title:Hybrid Electric Propulsion | url:Hybrid > Electric Propulsion)) DisjunctionMaxQuery((host:Power Systems | > contentSpecificSearch:"power systems" | customContent:"power systems" | > title:Power Systems | url:Power Systems)) > DisjunctionMaxQuery((host:Powerplant | contentSpecificSearch:powerplant | > customContent:powerplant | title:Powerplant | url:Powerplant)) > DisjunctionMaxQuery((host:Propulsion | contentSpecificSearch:propulsion | > customContent:propulsion | title:Propulsion | url:Propulsion)) > DisjunctionMaxQuery((host:hybrid | contentSpecificSearch:hybrid | > customContent:hybrid | title:hybrid | url:hybrid)) > DisjunctionMaxQuery((host:hybrid electric | contentSpecificSearch:"hybrid > electric" | customContent:"hybrid
Re: Solr performance issue on querying --> Solr 6.5.1
Hi Emir, Please find the response without bq parameter and debugQuery set to true. Also it was noted that Qtime comes down drastically without the debug parameter to about 700-800. true 0 3446 ("hybrid electric powerplant" "hybrid electric powerplants" "Electric" "Electrical" "Electricity" "Engine" "fuel economy" "fuel efficiency" "Hybrid Electric Propulsion" "Power Systems" "Powerplant" "Propulsion" "hybrid" "hybrid electric" "electric powerplant") edismax on host title url customContent contentSpecificSearch id contentOntologyTagsCount 0 OR 3985d7e2-3e54-48d8-8336-229e85f5d9de 600 true ... solr-prd-cluster-m-GooglePatent_shard4_replica2-1506504238282-20 35 159 GET_TOP_IDS 41294 ... 29 165 GET_TOP_IDS 40980 ... 31 200 GET_TOP_IDS 41006 ... 43 208 GET_TOP_IDS 41040 ... 181 466 GET_TOP_IDS 41138 ... 1518 1523 GET_FIELDS,GET_DEBUG 110 ... 1562 1573 GET_FIELDS,GET_DEBUG 115 ... 1793 1800 GET_FIELDS,GET_DEBUG 120 ... 2153 2161 GET_FIELDS,GET_DEBUG 125 ... 2957 2970 GET_FIELDS,GET_DEBUG 130 ... 10302.0 2.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 10288.0 661.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 9627.0 ("hybrid electric powerplant" "hybrid electric powerplants" "Electric" "Electrical" "Electricity" "Engine" "fuel economy" "fuel efficiency" "Hybrid Electric Propulsion" "Power Systems" "Powerplant" "Propulsion" "hybrid" "hybrid electric" "electric powerplant") ("hybrid electric powerplant" "hybrid electric powerplants" "Electric" "Electrical" "Electricity" "Engine" "fuel economy" "fuel efficiency" "Hybrid Electric Propulsion" "Power Systems" "Powerplant" "Propulsion" "hybrid" "hybrid electric" "electric powerplant") (+(DisjunctionMaxQuery((host:hybrid electric powerplant | contentSpecificSearch:"hybrid electric powerplant" | customContent:"hybrid electric powerplant" | title:hybrid electric powerplant | url:hybrid electric powerplant)) DisjunctionMaxQuery((host:hybrid electric powerplants | contentSpecificSearch:"hybrid electric powerplants" | customContent:"hybrid electric powerplants" | title:hybrid electric powerplants | url:hybrid electric powerplants)) DisjunctionMaxQuery((host:Electric | contentSpecificSearch:electric | customContent:electric | title:Electric | url:Electric)) DisjunctionMaxQuery((host:Electrical | contentSpecificSearch:electrical | customContent:electrical | title:Electrical | url:Electrical)) DisjunctionMaxQuery((host:Electricity | contentSpecificSearch:electricity | customContent:electricity | title:Electricity | url:Electricity)) DisjunctionMaxQuery((host:Engine | contentSpecificSearch:engine | customContent:engine | title:Engine | url:Engine)) DisjunctionMaxQuery((host:fuel economy | contentSpecificSearch:"fuel economy" | customContent:"fuel economy" | title:fuel economy | url:fuel economy)) DisjunctionMaxQuery((host:fuel efficiency | contentSpecificSearch:"fuel efficiency" | customContent:"fuel efficiency" | title:fuel efficiency | url:fuel efficiency)) DisjunctionMaxQuery((host:Hybrid Electric Propulsion | contentSpecificSearch:"hybrid electric propulsion" | customContent:"hybrid electric propulsion" | title:Hybrid Electric Propulsion | url:Hybrid Electric Propulsion)) DisjunctionMaxQuery((host:Power Systems | contentSpecificSearch:"power systems" | customContent:"power systems" | title:Power Systems | url:Power Systems)) DisjunctionMaxQuery((host:Powerplant | contentSpecificSearch:powerplant | customContent:powerplant | title:Powerplant | url:Powerplant)) DisjunctionMaxQuery((host:Propulsion | contentSpecificSearch:propulsion | customContent:propulsion | title:Propulsion | url:Propulsion)) DisjunctionMaxQuery((host:hybrid | contentSpecificSearch:hybrid | customContent:hybrid | title:hybrid | url:hybrid)) DisjunctionMaxQuery((host:hybrid electric | contentSpecificSearch:"hybrid electric" | customContent:"hybrid electric" | title:hybrid electric | url:hybrid electric)) DisjunctionMaxQuery((host:electric powerplant | contentSpecificSearch:"electric powerplant" | customContent:"electric powerplant" | title:electric powerplant | url:electric powerplant/no_coord +((host:hybrid electric powerplant | contentSpecificSearch:"hybrid electric powerplant" | customContent:"hybrid electric powerplant" | title:hybrid electric powerplant | url:hybrid electric powerplant) (host:hybrid electric powerplants | contentSpecificSearch:"hybrid electric powerplants" | customContent:"hybrid electric powerplants" | title:hybrid electric powerplants | url:hybrid electric powerplants) (host:Electric | contentSpecificSearch:electric | customContent:electric | title:Electric | url:Electric) (host:Electrical | contentSpecificSearch:electrical | customContent:electrical | title:Electrical | url:Electrical) (host:Electricity | contentSpecificSearch:electricity | customContent:electricity | title:Electricity | url:Electricity) (host:Engine | contentSpecificSearch:engine | customContent:engine | title:Engine | url:Engine) (host:fuel
Re: Solr performance issue on querying --> Solr 6.5.1
Hi Erick, Qtime comes down with rows set as 1. Also it was noted that qtime comes down when debug parameter is not added with the query. It comes to about 900. Thanks, Arun -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr performance issue on querying --> Solr 6.5.1
On Tue, 2017-09-26 at 07:43 -0700, sasarun wrote: > Allocated heap size for young generation is about 8 gb and old > generation is about 24 gb. And gc analysis showed peak > size utlisation is really low compared to these values. That does not come as a surprise. Your collections would normally be considered small, if not tiny, looking only at their size measured in bytes. Again, if you expect them to grow significantly (more than 10x), your allocation might make sense. If you do not expect such a growth in the near future, you will be better off with a much smaller heap: The peak heap utilization that you have logged (or twice that to err on the cautious side) seems a good starting point. And whatever you do, don't set Xmx to 32GB. Use <31GB or significantly more than 32GB: https://blog.codecentric.de/en/2014/02/35gb-heap-less-32gb-java-jvm-mem ory-oddities/ Are you indexing while you search? If so, you need to set auto-warm or state a few explicit warmup-queries. If not, your measuring will not be representative as it will be on first-searches, which are always slower than warmed-searches. - Toke Eskildsen, Royal Danish Library
Re: Solr performance issue on querying --> Solr 6.5.1
Hi Arun, This is not the most simple query either - a dozen of phrase queries on several fields + the same query as bq. Can you provide debugQuery info. I did not look much into debug times and what includes what, but one thing that is strange to me is that QTime is 4s while query in debug is 1.3s. Can you try running without bq? Can you include boost factors in the main query? Thanks, Emir > On 26 Sep 2017, at 16:43, sasarunwrote: > > Hi All, > I have been using Solr for some time now but mostly in standalone mode. Now > my current project is using Solr 6.5.1 hosted on hadoop. My solrconfig.xml > has the following configuration. In the prod environment the performance on > querying seems to really slow. Can anyone help me with few pointers on > howimprove on the same. > > >${solr.hdfs.home:} > name="solr.hdfs.blockcache.enabled">${solr.hdfs.blockcache.enabled:true} > name="solr.hdfs.blockcache.slab.count">${solr.hdfs.blockcache.slab.count:1} > name="solr.hdfs.blockcache.direct.memory.allocation">${solr.hdfs.blockcache.direct.memory.allocation:false} > name="solr.hdfs.blockcache.blocksperbank">${solr.hdfs.blockcache.blocksperbank:16384} > name="solr.hdfs.blockcache.read.enabled">${solr.hdfs.blockcache.read.enabled:true} > name="solr.hdfs.blockcache.write.enabled">${solr.hdfs.blockcache.write.enabled:false} > name="solr.hdfs.nrtcachingdirectory.enable">${solr.hdfs.nrtcachingdirectory.enable:true} > name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">${solr.hdfs.nrtcachingdirectory.maxmergesizemb:16} > name="solr.hdfs.nrtcachingdirectory.maxcachedmb">${solr.hdfs.nrtcachingdirectory.maxcachedmb:192} > >hdfs > It has 6 collections of following size > Collection 1 -->6.41 MB > Collection 2 -->634.51 KB > Collection 3 -->4.59 MB > Collection 4 -->1,020.56 MB > Collection 5 --> 607.26 MB > Collection 6 -->102.4 kb > Each Collection has 5 shards each. Allocated heap size for young generation > is about 8 gb and old generation is about 24 gb. And gc analysis showed peak > size > utlisation is really low compared to these values. > But querying to Collection 4 and collection 5 is giving really slow response > even thoughwe are not using any complex queries.Output of debug quries run > with debug=timing > are given below for reference. Can anyone help suggest a way improve the > performance. > > Response to query > > > true > 0 > 3962 > > > ("hybrid electric powerplant" "hybrid electric powerplants" "Electric" > "Electrical" "Electricity" "Engine" "fuel economy" "fuel efficiency" "Hybrid > Electric Propulsion" "Power Systems" "Powerplant" "Propulsion" "hybrid" > "hybrid electric" "electric powerplant") > > edismax > true > on > > host > title > url > customContent > contentSpecificSearch > > > id > contentTagsCount > > 0 > OR > OR > 3985d7e2-3e54-48d8-8336-229e85f5d9de > 600 > > ("hybrid electric powerplant"^100.0 "hybrid electric powerplants"^100.0 > "Electric"^50.0 "Electrical"^50.0 "Electricity"^50.0 "Engine"^50.0 "fuel > economy"^50.0 "fuel efficiency"^50.0 "Hybrid Electric Propulsion"^50.0 > "Power Systems"^50.0 "Powerplant"^50.0 "Propulsion"^50.0 "hybrid"^15.0 > "hybrid electric"^15.0 "electric powerplant"^15.0) > > > > > > 15374.0 > > 2.0 > > 2.0 > > > 0.0 > > > 0.0 > > > 0.0 > > > 0.0 > > > 0.0 > > > 0.0 > > > 0.0 > > > 0.0 > > > > 15363.0 > > 1313.0 > > > 0.0 > > > 0.0 > > > 0.0 > > > 0.0 > > > 0.0 > > > 0.0 > > > 0.0 > > > 14048.0 > > > > > > Thanks, > Arun > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr performance issue on querying --> Solr 6.5.1
Well, 15 second responses are not what I'd expect either. But two things (just looked again) 1> note that the time to assemble the debug information is a large majority of your total time (14 of 15.3 seconds). 2> you're specifying 600 rows which is quite a lot as each one requires that a 16K block of data be read from disk and decompressed to assemble the "fl" list. so one quick test would be to set rows=1 or something. All that said, the QTime value returned does _not_ include <1> or <2> above and even 4 seconds seems excessive. Best, Erick On Tue, Sep 26, 2017 at 10:54 AM, sasarunwrote: > Hi Erick, > > Thank you for the quick response. Query time was relatively faster once it > is read from memory. But personally I always felt response time could be far > better. As suggested, We will try and set up in a non HDFS environment and > update on the results. > > Thanks, > Arun > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr performance issue on querying --> Solr 6.5.1
Hi Erick, Thank you for the quick response. Query time was relatively faster once it is read from memory. But personally I always felt response time could be far better. As suggested, We will try and set up in a non HDFS environment and update on the results. Thanks, Arun -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr performance issue on querying --> Solr 6.5.1
Does the query time _stay_ low? Once the data is read from HDFS it should pretty much stay in memory. So my question is whether, once Solr warms up you see this kind of query response time. Have you tried this on a non HDFS system? That would be useful to help figure out where to look. And given the sizes of your collections, unless you expect them to get much larger, there's no reason to shard any of them. Sharding should only really be used when the collections are too big for a single shard as distributed searches inevitably have increased overhead. I expect _at least_ 20M documents/shard, and have seen 200M docs/shard. YMMV of course. Best, Erick On Tue, Sep 26, 2017 at 7:43 AM, sasarunwrote: > Hi All, > I have been using Solr for some time now but mostly in standalone mode. Now > my current project is using Solr 6.5.1 hosted on hadoop. My solrconfig.xml > has the following configuration. In the prod environment the performance on > querying seems to really slow. Can anyone help me with few pointers on > howimprove on the same. > > > ${solr.hdfs.home:} > name="solr.hdfs.blockcache.enabled">${solr.hdfs.blockcache.enabled:true} > name="solr.hdfs.blockcache.slab.count">${solr.hdfs.blockcache.slab.count:1} > name="solr.hdfs.blockcache.direct.memory.allocation">${solr.hdfs.blockcache.direct.memory.allocation:false} > name="solr.hdfs.blockcache.blocksperbank">${solr.hdfs.blockcache.blocksperbank:16384} > name="solr.hdfs.blockcache.read.enabled">${solr.hdfs.blockcache.read.enabled:true} > name="solr.hdfs.blockcache.write.enabled">${solr.hdfs.blockcache.write.enabled:false} > name="solr.hdfs.nrtcachingdirectory.enable">${solr.hdfs.nrtcachingdirectory.enable:true} > name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">${solr.hdfs.nrtcachingdirectory.maxmergesizemb:16} > name="solr.hdfs.nrtcachingdirectory.maxcachedmb">${solr.hdfs.nrtcachingdirectory.maxcachedmb:192} > > hdfs > It has 6 collections of following size > Collection 1 -->6.41 MB > Collection 2 -->634.51 KB > Collection 3 -->4.59 MB > Collection 4 -->1,020.56 MB > Collection 5 --> 607.26 MB > Collection 6 -->102.4 kb > Each Collection has 5 shards each. Allocated heap size for young generation > is about 8 gb and old generation is about 24 gb. And gc analysis showed peak > size > utlisation is really low compared to these values. > But querying to Collection 4 and collection 5 is giving really slow response > even thoughwe are not using any complex queries.Output of debug quries run > with debug=timing > are given below for reference. Can anyone help suggest a way improve the > performance. > > Response to query > > > true > 0 > 3962 > > > ("hybrid electric powerplant" "hybrid electric powerplants" "Electric" > "Electrical" "Electricity" "Engine" "fuel economy" "fuel efficiency" "Hybrid > Electric Propulsion" "Power Systems" "Powerplant" "Propulsion" "hybrid" > "hybrid electric" "electric powerplant") > > edismax > true > on > > host > title > url > customContent > contentSpecificSearch > > > id > contentTagsCount > > 0 > OR > OR > 3985d7e2-3e54-48d8-8336-229e85f5d9de > 600 > > ("hybrid electric powerplant"^100.0 "hybrid electric powerplants"^100.0 > "Electric"^50.0 "Electrical"^50.0 "Electricity"^50.0 "Engine"^50.0 "fuel > economy"^50.0 "fuel efficiency"^50.0 "Hybrid Electric Propulsion"^50.0 > "Power Systems"^50.0 "Powerplant"^50.0 "Propulsion"^50.0 "hybrid"^15.0 > "hybrid electric"^15.0 "electric powerplant"^15.0) > > > > > > 15374.0 > > 2.0 > > 2.0 > > > 0.0 > > > 0.0 > > > 0.0 > > > 0.0 > > > 0.0 > > > 0.0 > > > 0.0 > > > 0.0 > > > > 15363.0 > > 1313.0 > > > 0.0 > > > 0.0 > > > 0.0 > > > 0.0 > > > 0.0 > > > 0.0 > > > 0.0 > > > 14048.0 > > > > > > Thanks, > Arun > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Solr performance issue on querying --> Solr 6.5.1
Hi All, I have been using Solr for some time now but mostly in standalone mode. Now my current project is using Solr 6.5.1 hosted on hadoop. My solrconfig.xml has the following configuration. In the prod environment the performance on querying seems to really slow. Can anyone help me with few pointers on howimprove on the same. ${solr.hdfs.home:} ${solr.hdfs.blockcache.enabled:true} ${solr.hdfs.blockcache.slab.count:1} ${solr.hdfs.blockcache.direct.memory.allocation:false} ${solr.hdfs.blockcache.blocksperbank:16384} ${solr.hdfs.blockcache.read.enabled:true} ${solr.hdfs.blockcache.write.enabled:false} ${solr.hdfs.nrtcachingdirectory.enable:true} ${solr.hdfs.nrtcachingdirectory.maxmergesizemb:16} ${solr.hdfs.nrtcachingdirectory.maxcachedmb:192} hdfs It has 6 collections of following size Collection 1 -->6.41 MB Collection 2 -->634.51 KB Collection 3 -->4.59 MB Collection 4 -->1,020.56 MB Collection 5 --> 607.26 MB Collection 6 -->102.4 kb Each Collection has 5 shards each. Allocated heap size for young generation is about 8 gb and old generation is about 24 gb. And gc analysis showed peak size utlisation is really low compared to these values. But querying to Collection 4 and collection 5 is giving really slow response even thoughwe are not using any complex queries.Output of debug quries run with debug=timing are given below for reference. Can anyone help suggest a way improve the performance. Response to query true 0 3962 ("hybrid electric powerplant" "hybrid electric powerplants" "Electric" "Electrical" "Electricity" "Engine" "fuel economy" "fuel efficiency" "Hybrid Electric Propulsion" "Power Systems" "Powerplant" "Propulsion" "hybrid" "hybrid electric" "electric powerplant") edismax true on host title url customContent contentSpecificSearch id contentTagsCount 0 OR OR 3985d7e2-3e54-48d8-8336-229e85f5d9de 600 ("hybrid electric powerplant"^100.0 "hybrid electric powerplants"^100.0 "Electric"^50.0 "Electrical"^50.0 "Electricity"^50.0 "Engine"^50.0 "fuel economy"^50.0 "fuel efficiency"^50.0 "Hybrid Electric Propulsion"^50.0 "Power Systems"^50.0 "Powerplant"^50.0 "Propulsion"^50.0 "hybrid"^15.0 "hybrid electric"^15.0 "electric powerplant"^15.0) 15374.0 2.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 15363.0 1313.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 14048.0 Thanks, Arun -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Suggestions on scaling up Solr performance.
Impossible to answer as Shawn says. Or even recommend. For instance, you say "but once we launch our application all across the world it may give performance issues." You haven't defined at all what changes when you "launch our application all across the world". Increasing your query traffic 10 fold? Trying to index 100 times the number of docs you have now? 10,000 times the number of docs you have now? Best, Erick On Thu, May 11, 2017 at 8:11 AM, Venkateswarlu Bommineniwrote: > Thanks, Shawn. > > As of now, we don't have any performance issues, We are just working for > the future purpose. > > So I was looking for any general architecture which is agreed by many of > Solr experts. > > Thanks, > Venkat. > > On Thu, May 11, 2017 at 8:19 PM, Shawn Heisey wrote: > >> On 5/11/2017 7:39 AM, Venkateswarlu Bommineni wrote: >> > In current design we have below configuration: *One collection with >> > one shard with 4 replication factor with 4 nodes.* as of now, it is >> > working fine.but once we launch our application all across the world >> > it may give performance issues. To improve the performance below is >> > our thought: one of the design we found is: *Adding a new node and >> > adding a new replication to existing solrcloud.* Please suggest any >> > other approaches which give better performance. >> >> Knowing the number of nodes, shards, and replicas is not enough >> information to even make guesses. >> >> https://lucidworks.com/sizing-hardware-in-the-abstract-why- >> we-dont-have-a-definitive-answer/ >> >> Even with a LOT more information, any recommendations we made would be >> just that -- guesses. Those guesses might be completely wrong, or >> represent a lot more expense than you really need. >> >> The exact kind of setup you need is affected by a great many things. >> Here's a few of them: request rate, complexity of queries, contents of the >> index, size of the index, Solr cache settings, schema settings, number of >> documents, number of shards, amount of memory in the server, amount of >> memory in the java heap. >> >> Even the phrase "improve our performance" is vague. What kind of >> performance issue are you having? >> >> Thanks, >> Shawn >> >>
Re: Suggestions on scaling up Solr performance.
Thanks, Shawn. As of now, we don't have any performance issues, We are just working for the future purpose. So I was looking for any general architecture which is agreed by many of Solr experts. Thanks, Venkat. On Thu, May 11, 2017 at 8:19 PM, Shawn Heiseywrote: > On 5/11/2017 7:39 AM, Venkateswarlu Bommineni wrote: > > In current design we have below configuration: *One collection with > > one shard with 4 replication factor with 4 nodes.* as of now, it is > > working fine.but once we launch our application all across the world > > it may give performance issues. To improve the performance below is > > our thought: one of the design we found is: *Adding a new node and > > adding a new replication to existing solrcloud.* Please suggest any > > other approaches which give better performance. > > Knowing the number of nodes, shards, and replicas is not enough > information to even make guesses. > > https://lucidworks.com/sizing-hardware-in-the-abstract-why- > we-dont-have-a-definitive-answer/ > > Even with a LOT more information, any recommendations we made would be > just that -- guesses. Those guesses might be completely wrong, or > represent a lot more expense than you really need. > > The exact kind of setup you need is affected by a great many things. > Here's a few of them: request rate, complexity of queries, contents of the > index, size of the index, Solr cache settings, schema settings, number of > documents, number of shards, amount of memory in the server, amount of > memory in the java heap. > > Even the phrase "improve our performance" is vague. What kind of > performance issue are you having? > > Thanks, > Shawn > >
Re: Suggestions on scaling up Solr performance.
On 5/11/2017 7:39 AM, Venkateswarlu Bommineni wrote: > In current design we have below configuration: *One collection with > one shard with 4 replication factor with 4 nodes.* as of now, it is > working fine.but once we launch our application all across the world > it may give performance issues. To improve the performance below is > our thought: one of the design we found is: *Adding a new node and > adding a new replication to existing solrcloud.* Please suggest any > other approaches which give better performance. Knowing the number of nodes, shards, and replicas is not enough information to even make guesses. https://lucidworks.com/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ Even with a LOT more information, any recommendations we made would be just that -- guesses. Those guesses might be completely wrong, or represent a lot more expense than you really need. The exact kind of setup you need is affected by a great many things. Here's a few of them: request rate, complexity of queries, contents of the index, size of the index, Solr cache settings, schema settings, number of documents, number of shards, amount of memory in the server, amount of memory in the java heap. Even the phrase "improve our performance" is vague. What kind of performance issue are you having? Thanks, Shawn
Suggestions on scaling up Solr performance.
Hello Guys, In current design we have below configuration: *One collection with one shard with 4 replication factor with 4 nodes.* as of now, it is working fine.but once we launch our application all across the world it may give performance issues. To improve the performance below is our thought: one of the design we found is: *Adding a new node and adding a new replication to existing solrcloud.* Please suggest any other approaches which give better performance. Thanks, Venkat.
Re: Solr performance on EC2 linux
Already have a Jira issue for next week. I have a script to run prod logs against a cluster. I’ll be testing a four shard by two replica cluster with 17 million docs and very long queries. We are working on getting the 95th percentile under one second, so we should exercise the timeAllowed feature. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On May 3, 2017, at 3:53 PM, Rick Leirwrote: > > +Walter test it > > Jeff, > How much CPU does the EC2 hypervisor use? I have heard 5% but that is for a > normal workload, and is mostly consumed during system calls or context > changes. So it is quite understandable that frequent time calls would take a > bigger bite in the AWS cloud compared to bare metal. Sorry, my words are > mostly conjecture so please ignore. Cheers -- Rick > > On May 3, 2017 2:35:33 PM EDT, Jeff Wartes wrote: >> >> It’s presumably not a small degradation - this guy very recently >> suggested it’s 77% slower: >> https://blog.packagecloud.io/eng/2017/03/08/system-calls-are-much-slower-on-ec2/ >> >> The other reason that blog post is interesting to me is that his >> benchmark utility showed the work of entering the kernel as high system >> time, which is also what I was seeing. >> >> I really want to go back and try some more tests, including (now) >> disabling the timeAllowed param in my query corpus. >> I think I’m still a few weeks of higher priority issues away from that >> though. >> >> >> On 5/2/17, 1:45 PM, "Tomás Fernández Löbbe" >> wrote: >> >> I remember seeing some performance impact (even when not using it) and >> it >> was attributed to the calls to System.nanoTime. See SOLR-7875 and >> SOLR-7876 >> (fixed for 5.3 and 5.4). Those two Jiras fix the impact when >> timeAllowed is >> not used, but I don't know if there were more changes to improve the >> performance of the feature itself. The problem was that System.nanoTime >> may >> be called too many times on indices with many different terms. If this >> is >> the problem Jeff is seeing, a small degradation of System.nanoTime >> could >> have a big impact. >> >> Tomás >> >> On Tue, May 2, 2017 at 10:23 AM, Walter Underwood >> >> wrote: >> >>> Hmm, has anyone measured the overhead of timeAllowed? We use it all >> the >>> time. >>> >>> If nobody has, I’ll run a benchmark with and without it. >>> >>> wunder >>> Walter Underwood >>> wun...@wunderwood.org >>> >> https://linkprotect.cudasvc.com/url?a=http://observer.wunderwood.org/=E,1,7uGY1VtJPqam-MhMKpspcb31C9NQ_Jh4nI0gzkQP4gVJkhcC5l031vMIHH0j38EdMESOM5Chjav3lUu1rpTdohTNTPdchTkl4TGNEHWJpJFJ-MR6RrjnTQ,,=0 >> (my blog) >>> >>> On May 2, 2017, at 9:52 AM, Chris Hostetter >> >>> wrote: : I specify a timeout on all queries, Ah -- ok, yeah -- you mean using "timeAllowed" correct? If the root issue you were seeing is in fact clocksource related, then using timeAllowed would probably be a significant compounding factor there since it would involve a lot of time checks in a >> single request (even w/o any debugging enabled) (did your coworker's experiements with ES use any sort of >> equivilent timeout feature?) -Hoss >> https://linkprotect.cudasvc.com/url?a=http://www.lucidworks.com/=E,1,DwDibSb7PG6wpqsnn-u9uKdCuujyokjeyc6ero6bEdoUjs4Hn_X1jj_z6QAEDmorDqAP_TcaEJX8k5HYYJI7bJ7jQxTDpKUX9MvWAaP6ICoyVmpmQ8X7=0 >>> >>> >> > > -- > Sorry for being brief. Alternate email is rickleir at yahoo dot com
Re: Solr performance on EC2 linux
+Walter test it Jeff, How much CPU does the EC2 hypervisor use? I have heard 5% but that is for a normal workload, and is mostly consumed during system calls or context changes. So it is quite understandable that frequent time calls would take a bigger bite in the AWS cloud compared to bare metal. Sorry, my words are mostly conjecture so please ignore. Cheers -- Rick On May 3, 2017 2:35:33 PM EDT, Jeff Warteswrote: > >It’s presumably not a small degradation - this guy very recently >suggested it’s 77% slower: >https://blog.packagecloud.io/eng/2017/03/08/system-calls-are-much-slower-on-ec2/ > >The other reason that blog post is interesting to me is that his >benchmark utility showed the work of entering the kernel as high system >time, which is also what I was seeing. > >I really want to go back and try some more tests, including (now) >disabling the timeAllowed param in my query corpus. >I think I’m still a few weeks of higher priority issues away from that >though. > > >On 5/2/17, 1:45 PM, "Tomás Fernández Löbbe" >wrote: > >I remember seeing some performance impact (even when not using it) and >it >was attributed to the calls to System.nanoTime. See SOLR-7875 and >SOLR-7876 >(fixed for 5.3 and 5.4). Those two Jiras fix the impact when >timeAllowed is > not used, but I don't know if there were more changes to improve the >performance of the feature itself. The problem was that System.nanoTime >may >be called too many times on indices with many different terms. If this >is >the problem Jeff is seeing, a small degradation of System.nanoTime >could >have a big impact. > >Tomás > >On Tue, May 2, 2017 at 10:23 AM, Walter Underwood > >wrote: > >> Hmm, has anyone measured the overhead of timeAllowed? We use it all >the >> time. >> >> If nobody has, I’ll run a benchmark with and without it. >> >> wunder >> Walter Underwood >> wun...@wunderwood.org >> >https://linkprotect.cudasvc.com/url?a=http://observer.wunderwood.org/=E,1,7uGY1VtJPqam-MhMKpspcb31C9NQ_Jh4nI0gzkQP4gVJkhcC5l031vMIHH0j38EdMESOM5Chjav3lUu1rpTdohTNTPdchTkl4TGNEHWJpJFJ-MR6RrjnTQ,,=0 > (my blog) >> >> >> > On May 2, 2017, at 9:52 AM, Chris Hostetter > >> wrote: >> > >> > >> > : I specify a timeout on all queries, >> > >> > Ah -- ok, yeah -- you mean using "timeAllowed" correct? >> > > > > If the root issue you were seeing is in fact clocksource related, > > > then using timeAllowed would probably be a significant compounding >> > factor there since it would involve a lot of time checks in a >single >> > request (even w/o any debugging enabled) >> > >> > (did your coworker's experiements with ES use any sort of >equivilent >> > timeout feature?) >> > >> > >> > >> > >> > >> > -Hoss >> > >https://linkprotect.cudasvc.com/url?a=http://www.lucidworks.com/=E,1,DwDibSb7PG6wpqsnn-u9uKdCuujyokjeyc6ero6bEdoUjs4Hn_X1jj_z6QAEDmorDqAP_TcaEJX8k5HYYJI7bJ7jQxTDpKUX9MvWAaP6ICoyVmpmQ8X7=0 >> >> > -- Sorry for being brief. Alternate email is rickleir at yahoo dot com
Re: Solr performance on EC2 linux
It’s presumably not a small degradation - this guy very recently suggested it’s 77% slower: https://blog.packagecloud.io/eng/2017/03/08/system-calls-are-much-slower-on-ec2/ The other reason that blog post is interesting to me is that his benchmark utility showed the work of entering the kernel as high system time, which is also what I was seeing. I really want to go back and try some more tests, including (now) disabling the timeAllowed param in my query corpus. I think I’m still a few weeks of higher priority issues away from that though. On 5/2/17, 1:45 PM, "Tomás Fernández Löbbe"wrote: I remember seeing some performance impact (even when not using it) and it was attributed to the calls to System.nanoTime. See SOLR-7875 and SOLR-7876 (fixed for 5.3 and 5.4). Those two Jiras fix the impact when timeAllowed is not used, but I don't know if there were more changes to improve the performance of the feature itself. The problem was that System.nanoTime may be called too many times on indices with many different terms. If this is the problem Jeff is seeing, a small degradation of System.nanoTime could have a big impact. Tomás On Tue, May 2, 2017 at 10:23 AM, Walter Underwood wrote: > Hmm, has anyone measured the overhead of timeAllowed? We use it all the > time. > > If nobody has, I’ll run a benchmark with and without it. > > wunder > Walter Underwood > wun...@wunderwood.org > https://linkprotect.cudasvc.com/url?a=http://observer.wunderwood.org/=E,1,7uGY1VtJPqam-MhMKpspcb31C9NQ_Jh4nI0gzkQP4gVJkhcC5l031vMIHH0j38EdMESOM5Chjav3lUu1rpTdohTNTPdchTkl4TGNEHWJpJFJ-MR6RrjnTQ,,=0 (my blog) > > > > On May 2, 2017, at 9:52 AM, Chris Hostetter > wrote: > > > > > > : I specify a timeout on all queries, > > > > Ah -- ok, yeah -- you mean using "timeAllowed" correct? > > > > If the root issue you were seeing is in fact clocksource related, > > then using timeAllowed would probably be a significant compounding > > factor there since it would involve a lot of time checks in a single > > request (even w/o any debugging enabled) > > > > (did your coworker's experiements with ES use any sort of equivilent > > timeout feature?) > > > > > > > > > > > > -Hoss > > https://linkprotect.cudasvc.com/url?a=http://www.lucidworks.com/=E,1,DwDibSb7PG6wpqsnn-u9uKdCuujyokjeyc6ero6bEdoUjs4Hn_X1jj_z6QAEDmorDqAP_TcaEJX8k5HYYJI7bJ7jQxTDpKUX9MvWAaP6ICoyVmpmQ8X7=0 > >
Re: Solr performance on EC2 linux
I remember seeing some performance impact (even when not using it) and it was attributed to the calls to System.nanoTime. See SOLR-7875 and SOLR-7876 (fixed for 5.3 and 5.4). Those two Jiras fix the impact when timeAllowed is not used, but I don't know if there were more changes to improve the performance of the feature itself. The problem was that System.nanoTime may be called too many times on indices with many different terms. If this is the problem Jeff is seeing, a small degradation of System.nanoTime could have a big impact. Tomás On Tue, May 2, 2017 at 10:23 AM, Walter Underwoodwrote: > Hmm, has anyone measured the overhead of timeAllowed? We use it all the > time. > > If nobody has, I’ll run a benchmark with and without it. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > > > On May 2, 2017, at 9:52 AM, Chris Hostetter > wrote: > > > > > > : I specify a timeout on all queries, > > > > Ah -- ok, yeah -- you mean using "timeAllowed" correct? > > > > If the root issue you were seeing is in fact clocksource related, > > then using timeAllowed would probably be a significant compounding > > factor there since it would involve a lot of time checks in a single > > request (even w/o any debugging enabled) > > > > (did your coworker's experiements with ES use any sort of equivilent > > timeout feature?) > > > > > > > > > > > > -Hoss > > http://www.lucidworks.com/ > >
Re: Solr performance on EC2 linux
Hmm, has anyone measured the overhead of timeAllowed? We use it all the time. If nobody has, I’ll run a benchmark with and without it. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On May 2, 2017, at 9:52 AM, Chris Hostetterwrote: > > > : I specify a timeout on all queries, > > Ah -- ok, yeah -- you mean using "timeAllowed" correct? > > If the root issue you were seeing is in fact clocksource related, > then using timeAllowed would probably be a significant compounding > factor there since it would involve a lot of time checks in a single > request (even w/o any debugging enabled) > > (did your coworker's experiements with ES use any sort of equivilent > timeout feature?) > > > > > > -Hoss > http://www.lucidworks.com/
Re: Solr performance on EC2 linux
: I specify a timeout on all queries, Ah -- ok, yeah -- you mean using "timeAllowed" correct? If the root issue you were seeing is in fact clocksource related, then using timeAllowed would probably be a significant compounding factor there since it would involve a lot of time checks in a single request (even w/o any debugging enabled) (did your coworker's experiements with ES use any sort of equivilent timeout feature?) -Hoss http://www.lucidworks.com/
Re: Solr performance on EC2 linux
Yes, that’s the Xenial I tried. Ubuntu 16.04.2 LTS. On 5/1/17, 7:22 PM, "Will Martin"wrote: Ubuntu 16.04 LTS - Xenial (HVM) Is this your Xenial version? On 5/1/2017 6:37 PM, Jeff Wartes wrote: > I tried a few variations of various things before we found and tried that linux/EC2 tuning page, including: >- EC2 instance type: r4, c4, and i3 >- Ubuntu version: Xenial and Trusty >- EBS vs local storage >- Stock openjdk vs Zulu openjdk (Recent java8 in both cases - I’m aware of the issues with early java8 versions and I’m not using G1) > > Most of those attempts were to help reduce differences between the data center and the EC2 cluster. In all cases I re-indexed from scratch. I got the same very high system-time symptom in all cases. With the linux changes in place, we settled on r4/Xenial/EBS/Stock. > > Again, this was a slightly modified Solr 5.4, (I added backup requests, and two memory allocation rate tweaks that have long since been merged into mainline - released in 6.2 I think. I can dig up the jira numbers if anyone’s interested) I’ve never used Solr 6.x in production though. > The only reason I mentioned 6.x at all is because I’m aware that ES 5.x is based on Lucene 6.2. I don’t believe my coworker spent any time on tuning his ES setup, although I think he did try G1. > > I definitely do want to binary-search those settings until I understand better what exactly did the trick. > It’s a long cycle time per test is the problem, but hopefully in the next couple of weeks. > > > > On 5/1/17, 7:26 AM, "John Bickerstaff" wrote: > > It's also very important to consider the type of EC2 instance you are > using... > > We settled on the R4.2XL... The R series is labeled "High-Memory" > > Which instance type did you end up using? > > On Mon, May 1, 2017 at 8:22 AM, Shawn Heisey wrote: > > > On 4/28/2017 10:09 AM, Jeff Wartes wrote: > > > tldr: Recently, I tried moving an existing solrcloud configuration from > > a local datacenter to EC2. Performance was roughly 1/10th what I’d > > expected, until I applied a bunch of linux tweaks. > > > > How very strange. I knew virtualization would have overheard, possibly > > even measurable overhead, but that's insane. Running on bare metal is > > always better if you can do it. I would be curious what would happen on > > your original install if you applied similar tuning to that. Would you > > see a speedup there? > > > > > Interestingly, a coworker playing with a ElasticSearch (ES 5.x, so a > > much more recent release) alternate implementation of the same index was > > not seeing this high-system-time behavior on EC2, and was getting > > throughput consistent with our general expectations. > > > > That's even weirder. ES 5.x will likely be using Points field types for > > numeric fields, and although those are faster than what Solr currently > > uses, I doubt it could explain that difference. The implication here is > > that the ES systems are running with stock EC2 settings, not the tuned > > settings ... but I'd like you to confirm that. Same Java version as > > with Solr? IMHO, Java itself is more likely to cause issues like you > > saw than Solr. > > > > > I’m writing this for a few reasons: > > > > > > 1. The performance difference was so crazy I really feel like this > > should really be broader knowledge. > > > > Definitely agree! I would be very interested in learning which of the > > tunables you changed were major contributors to the improvement. If it > > turns out that Solr's code is sub-optimal in some way, maybe we can fix it. > > > > > 2. If anyone is aware of anything that changed in Lucene between > > 5.4 and 6.x that could explain why Elasticsearch wasn’t suffering from > > this? If it’s the clocksource that’s the issue, there’s an implication that > > Solr was using tons more system calls like gettimeofday that the EC2 (xen) > > hypervisor doesn’t allow in userspace. > > > > I had not considered the performance regression in 6.4.0 and 6.4.1 that > > Erick mentioned. Were you still running Solr 5.4, or was it a 6.x version? > > > > = > > > > Specific thoughts on the tuning: > > > > The noatime option is very good to use. I also use nodiratime on my > > systems. Turning these
Re: Solr performance on EC2 linux
I started with the same three-node 15-shard configuration I’d been used to, in an RF1 cluster. (the index is almost 700G so this takes three r4.8xlarge’s if I want to be entirely memory-resident) I eventually dropped down to a 1/3rd size index on a single node (so 5 shards, 100M docs each) so I could test configurations more quickly. The system time usage was present on all solr nodes regardless. I adjusted for a difference in the CPU count on the EC2 nodes when I picked my load testing rates. Zookeeper is a separate cluster on separate nodes. It is NOT collocated with Solr, although it’s dedicated exclusively to Solr’s use. I specify a timeout on all queries, and as mentioned, use SOLR-4449. So there’s possibly an argument I’m doing a lot more timing related calls than most. There’s nothing particularly exotic there though, just another Executor Service, and you’ll never get a backup request on an RF1 cluster because there’s no alternate to try. On 5/1/17, 6:28 PM, "Walter Underwood"wrote: Might want to measure the single CPU performance of your EC2 instance. The last time I checked, my MacBook was twice as fast as the EC2 instance I was using. wunder Walter Underwood wun...@wunderwood.org https://linkprotect.cudasvc.com/url?a=http://observer.wunderwood.org/=E,1,L0yDngRyy1MwN7dh5tRFW86sVcn6tcLZH4c03j0EdQSsGBMn0SLDqeB_sHQjB4DdbRMOLka5MnyeXnKS_CEUEv4qIgU5wuyhZBMHciVoH6e8uo7KGr09mXTtDw,,=0 (my blog) > On May 1, 2017, at 6:24 PM, Chris Hostetter wrote: > > > : tldr: Recently, I tried moving an existing solrcloud configuration from > : a local datacenter to EC2. Performance was roughly 1/10th what I’d > : expected, until I applied a bunch of linux tweaks. > > How many total nodes in your cluster? How many of them running ZooKeeper? > > Did you observe the heavy increase in system time CPU usage on all nodes, > or just the ones running zookeeper? > > I ask because if your speculation is correct and it is an issue of > clocksource, then perhaps ZK is where the majority of those system calls > are happening, and perhaps that's why you didn't see any similar heavy > system CPU load in ES? > > (Then again: at the lowest levels "lucene" really shouldn't care about > anything clock related at all Any "time" realted code would live in the > Solr level ... hmmm.) > > > -Hoss > https://linkprotect.cudasvc.com/url?a=http://www.lucidworks.com/=E,1,ooHM-f4KYxxASNvbLSSYXKwDzWVBK-9orXh84oAZsxzfcPKZ8AF2m_U8K7wc8D5EUvaoHJCrb3O6BPCQIJucBxQaqJMOakPTxCnMW1BDHsyBf13HxMyCeEM_=0
Re: Solr performance on EC2 linux
Ubuntu 16.04 LTS - Xenial (HVM) Is this your Xenial version? On 5/1/2017 6:37 PM, Jeff Wartes wrote: > I tried a few variations of various things before we found and tried that > linux/EC2 tuning page, including: >- EC2 instance type: r4, c4, and i3 >- Ubuntu version: Xenial and Trusty >- EBS vs local storage >- Stock openjdk vs Zulu openjdk (Recent java8 in both cases - I’m aware of > the issues with early java8 versions and I’m not using G1) > > Most of those attempts were to help reduce differences between the data > center and the EC2 cluster. In all cases I re-indexed from scratch. I got the > same very high system-time symptom in all cases. With the linux changes in > place, we settled on r4/Xenial/EBS/Stock. > > Again, this was a slightly modified Solr 5.4, (I added backup requests, and > two memory allocation rate tweaks that have long since been merged into > mainline - released in 6.2 I think. I can dig up the jira numbers if anyone’s > interested) I’ve never used Solr 6.x in production though. > The only reason I mentioned 6.x at all is because I’m aware that ES 5.x is > based on Lucene 6.2. I don’t believe my coworker spent any time on tuning his > ES setup, although I think he did try G1. > > I definitely do want to binary-search those settings until I understand > better what exactly did the trick. > It’s a long cycle time per test is the problem, but hopefully in the next > couple of weeks. > > > > On 5/1/17, 7:26 AM, "John Bickerstaff"wrote: > > It's also very important to consider the type of EC2 instance you are > using... > > We settled on the R4.2XL... The R series is labeled "High-Memory" > > Which instance type did you end up using? > > On Mon, May 1, 2017 at 8:22 AM, Shawn Heisey wrote: > > > On 4/28/2017 10:09 AM, Jeff Wartes wrote: > > > tldr: Recently, I tried moving an existing solrcloud configuration > from > > a local datacenter to EC2. Performance was roughly 1/10th what I’d > > expected, until I applied a bunch of linux tweaks. > > > > How very strange. I knew virtualization would have overheard, possibly > > even measurable overhead, but that's insane. Running on bare metal is > > always better if you can do it. I would be curious what would happen > on > > your original install if you applied similar tuning to that. Would you > > see a speedup there? > > > > > Interestingly, a coworker playing with a ElasticSearch (ES 5.x, so a > > much more recent release) alternate implementation of the same index > was > > not seeing this high-system-time behavior on EC2, and was getting > > throughput consistent with our general expectations. > > > > That's even weirder. ES 5.x will likely be using Points field types > for > > numeric fields, and although those are faster than what Solr currently > > uses, I doubt it could explain that difference. The implication here > is > > that the ES systems are running with stock EC2 settings, not the tuned > > settings ... but I'd like you to confirm that. Same Java version as > > with Solr? IMHO, Java itself is more likely to cause issues like you > > saw than Solr. > > > > > I’m writing this for a few reasons: > > > > > > 1. The performance difference was so crazy I really feel like > this > > should really be broader knowledge. > > > > Definitely agree! I would be very interested in learning which of the > > tunables you changed were major contributors to the improvement. If it > > turns out that Solr's code is sub-optimal in some way, maybe we can > fix it. > > > > > 2. If anyone is aware of anything that changed in Lucene > between > > 5.4 and 6.x that could explain why Elasticsearch wasn’t suffering from > > this? If it’s the clocksource that’s the issue, there’s an implication > that > > Solr was using tons more system calls like gettimeofday that the EC2 > (xen) > > hypervisor doesn’t allow in userspace. > > > > I had not considered the performance regression in 6.4.0 and 6.4.1 that > > Erick mentioned. Were you still running Solr 5.4, or was it a 6.x > version? > > > > = > > > > Specific thoughts on the tuning: > > > > The noatime option is very good to use. I also use nodiratime on my > > systems. Turning these off can have *massive* impacts on disk > > performance. If these are the source of the speedup, then the machine > > doesn't have enough spare memory. > > > > I'd be wary of the "nobarrier" mount option. If the underlying storage > > has battery-backed write caches, or is SSD without write caching, it > > wouldn't be a problem. Here's info about the "discard" mount option, I > >
Re: Solr performance on EC2 linux
Might want to measure the single CPU performance of your EC2 instance. The last time I checked, my MacBook was twice as fast as the EC2 instance I was using. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On May 1, 2017, at 6:24 PM, Chris Hostetterwrote: > > > : tldr: Recently, I tried moving an existing solrcloud configuration from > : a local datacenter to EC2. Performance was roughly 1/10th what I’d > : expected, until I applied a bunch of linux tweaks. > > How many total nodes in your cluster? How many of them running ZooKeeper? > > Did you observe the heavy increase in system time CPU usage on all nodes, > or just the ones running zookeeper? > > I ask because if your speculation is correct and it is an issue of > clocksource, then perhaps ZK is where the majority of those system calls > are happening, and perhaps that's why you didn't see any similar heavy > system CPU load in ES? > > (Then again: at the lowest levels "lucene" really shouldn't care about > anything clock related at all Any "time" realted code would live in the > Solr level ... hmmm.) > > > -Hoss > http://www.lucidworks.com/
Re: Solr performance on EC2 linux
: tldr: Recently, I tried moving an existing solrcloud configuration from : a local datacenter to EC2. Performance was roughly 1/10th what I’d : expected, until I applied a bunch of linux tweaks. How many total nodes in your cluster? How many of them running ZooKeeper? Did you observe the heavy increase in system time CPU usage on all nodes, or just the ones running zookeeper? I ask because if your speculation is correct and it is an issue of clocksource, then perhaps ZK is where the majority of those system calls are happening, and perhaps that's why you didn't see any similar heavy system CPU load in ES? (Then again: at the lowest levels "lucene" really shouldn't care about anything clock related at all Any "time" realted code would live in the Solr level ... hmmm.) -Hoss http://www.lucidworks.com/
Re: Solr performance on EC2 linux
I tried a few variations of various things before we found and tried that linux/EC2 tuning page, including: - EC2 instance type: r4, c4, and i3 - Ubuntu version: Xenial and Trusty - EBS vs local storage - Stock openjdk vs Zulu openjdk (Recent java8 in both cases - I’m aware of the issues with early java8 versions and I’m not using G1) Most of those attempts were to help reduce differences between the data center and the EC2 cluster. In all cases I re-indexed from scratch. I got the same very high system-time symptom in all cases. With the linux changes in place, we settled on r4/Xenial/EBS/Stock. Again, this was a slightly modified Solr 5.4, (I added backup requests, and two memory allocation rate tweaks that have long since been merged into mainline - released in 6.2 I think. I can dig up the jira numbers if anyone’s interested) I’ve never used Solr 6.x in production though. The only reason I mentioned 6.x at all is because I’m aware that ES 5.x is based on Lucene 6.2. I don’t believe my coworker spent any time on tuning his ES setup, although I think he did try G1. I definitely do want to binary-search those settings until I understand better what exactly did the trick. It’s a long cycle time per test is the problem, but hopefully in the next couple of weeks. On 5/1/17, 7:26 AM, "John Bickerstaff"wrote: It's also very important to consider the type of EC2 instance you are using... We settled on the R4.2XL... The R series is labeled "High-Memory" Which instance type did you end up using? On Mon, May 1, 2017 at 8:22 AM, Shawn Heisey wrote: > On 4/28/2017 10:09 AM, Jeff Wartes wrote: > > tldr: Recently, I tried moving an existing solrcloud configuration from > a local datacenter to EC2. Performance was roughly 1/10th what I’d > expected, until I applied a bunch of linux tweaks. > > How very strange. I knew virtualization would have overheard, possibly > even measurable overhead, but that's insane. Running on bare metal is > always better if you can do it. I would be curious what would happen on > your original install if you applied similar tuning to that. Would you > see a speedup there? > > > Interestingly, a coworker playing with a ElasticSearch (ES 5.x, so a > much more recent release) alternate implementation of the same index was > not seeing this high-system-time behavior on EC2, and was getting > throughput consistent with our general expectations. > > That's even weirder. ES 5.x will likely be using Points field types for > numeric fields, and although those are faster than what Solr currently > uses, I doubt it could explain that difference. The implication here is > that the ES systems are running with stock EC2 settings, not the tuned > settings ... but I'd like you to confirm that. Same Java version as > with Solr? IMHO, Java itself is more likely to cause issues like you > saw than Solr. > > > I’m writing this for a few reasons: > > > > 1. The performance difference was so crazy I really feel like this > should really be broader knowledge. > > Definitely agree! I would be very interested in learning which of the > tunables you changed were major contributors to the improvement. If it > turns out that Solr's code is sub-optimal in some way, maybe we can fix it. > > > 2. If anyone is aware of anything that changed in Lucene between > 5.4 and 6.x that could explain why Elasticsearch wasn’t suffering from > this? If it’s the clocksource that’s the issue, there’s an implication that > Solr was using tons more system calls like gettimeofday that the EC2 (xen) > hypervisor doesn’t allow in userspace. > > I had not considered the performance regression in 6.4.0 and 6.4.1 that > Erick mentioned. Were you still running Solr 5.4, or was it a 6.x version? > > = > > Specific thoughts on the tuning: > > The noatime option is very good to use. I also use nodiratime on my > systems. Turning these off can have *massive* impacts on disk > performance. If these are the source of the speedup, then the machine > doesn't have enough spare memory. > > I'd be wary of the "nobarrier" mount option. If the underlying storage > has battery-backed write caches, or is SSD without write caching, it > wouldn't be a problem. Here's info about the "discard" mount option, I > don't know whether it applies to your amazon storage: > >discard/nodiscard > Controls whether ext4 should issue discard/TRIM commands > to the > underlying block device when blocks are freed. This is > useful > for SSD devices and sparse/thinly-provisioned LUNs, but > it is >
Re: Solr performance on EC2 linux
It's also very important to consider the type of EC2 instance you are using... We settled on the R4.2XL... The R series is labeled "High-Memory" Which instance type did you end up using? On Mon, May 1, 2017 at 8:22 AM, Shawn Heiseywrote: > On 4/28/2017 10:09 AM, Jeff Wartes wrote: > > tldr: Recently, I tried moving an existing solrcloud configuration from > a local datacenter to EC2. Performance was roughly 1/10th what I’d > expected, until I applied a bunch of linux tweaks. > > How very strange. I knew virtualization would have overheard, possibly > even measurable overhead, but that's insane. Running on bare metal is > always better if you can do it. I would be curious what would happen on > your original install if you applied similar tuning to that. Would you > see a speedup there? > > > Interestingly, a coworker playing with a ElasticSearch (ES 5.x, so a > much more recent release) alternate implementation of the same index was > not seeing this high-system-time behavior on EC2, and was getting > throughput consistent with our general expectations. > > That's even weirder. ES 5.x will likely be using Points field types for > numeric fields, and although those are faster than what Solr currently > uses, I doubt it could explain that difference. The implication here is > that the ES systems are running with stock EC2 settings, not the tuned > settings ... but I'd like you to confirm that. Same Java version as > with Solr? IMHO, Java itself is more likely to cause issues like you > saw than Solr. > > > I’m writing this for a few reasons: > > > > 1. The performance difference was so crazy I really feel like this > should really be broader knowledge. > > Definitely agree! I would be very interested in learning which of the > tunables you changed were major contributors to the improvement. If it > turns out that Solr's code is sub-optimal in some way, maybe we can fix it. > > > 2. If anyone is aware of anything that changed in Lucene between > 5.4 and 6.x that could explain why Elasticsearch wasn’t suffering from > this? If it’s the clocksource that’s the issue, there’s an implication that > Solr was using tons more system calls like gettimeofday that the EC2 (xen) > hypervisor doesn’t allow in userspace. > > I had not considered the performance regression in 6.4.0 and 6.4.1 that > Erick mentioned. Were you still running Solr 5.4, or was it a 6.x version? > > = > > Specific thoughts on the tuning: > > The noatime option is very good to use. I also use nodiratime on my > systems. Turning these off can have *massive* impacts on disk > performance. If these are the source of the speedup, then the machine > doesn't have enough spare memory. > > I'd be wary of the "nobarrier" mount option. If the underlying storage > has battery-backed write caches, or is SSD without write caching, it > wouldn't be a problem. Here's info about the "discard" mount option, I > don't know whether it applies to your amazon storage: > >discard/nodiscard > Controls whether ext4 should issue discard/TRIM commands > to the > underlying block device when blocks are freed. This is > useful > for SSD devices and sparse/thinly-provisioned LUNs, but > it is > off by default until sufficient testing has been done. > > The network tunables would have more of an effect in a distributed > environment like EC2 than they would on a LAN. > > Thanks, > Shawn > >
Re: Solr performance on EC2 linux
On 4/28/2017 10:09 AM, Jeff Wartes wrote: > tldr: Recently, I tried moving an existing solrcloud configuration from a > local datacenter to EC2. Performance was roughly 1/10th what I’d expected, > until I applied a bunch of linux tweaks. How very strange. I knew virtualization would have overheard, possibly even measurable overhead, but that's insane. Running on bare metal is always better if you can do it. I would be curious what would happen on your original install if you applied similar tuning to that. Would you see a speedup there? > Interestingly, a coworker playing with a ElasticSearch (ES 5.x, so a much > more recent release) alternate implementation of the same index was not > seeing this high-system-time behavior on EC2, and was getting throughput > consistent with our general expectations. That's even weirder. ES 5.x will likely be using Points field types for numeric fields, and although those are faster than what Solr currently uses, I doubt it could explain that difference. The implication here is that the ES systems are running with stock EC2 settings, not the tuned settings ... but I'd like you to confirm that. Same Java version as with Solr? IMHO, Java itself is more likely to cause issues like you saw than Solr. > I’m writing this for a few reasons: > > 1. The performance difference was so crazy I really feel like this > should really be broader knowledge. Definitely agree! I would be very interested in learning which of the tunables you changed were major contributors to the improvement. If it turns out that Solr's code is sub-optimal in some way, maybe we can fix it. > 2. If anyone is aware of anything that changed in Lucene between 5.4 > and 6.x that could explain why Elasticsearch wasn’t suffering from this? If > it’s the clocksource that’s the issue, there’s an implication that Solr was > using tons more system calls like gettimeofday that the EC2 (xen) hypervisor > doesn’t allow in userspace. I had not considered the performance regression in 6.4.0 and 6.4.1 that Erick mentioned. Were you still running Solr 5.4, or was it a 6.x version? = Specific thoughts on the tuning: The noatime option is very good to use. I also use nodiratime on my systems. Turning these off can have *massive* impacts on disk performance. If these are the source of the speedup, then the machine doesn't have enough spare memory. I'd be wary of the "nobarrier" mount option. If the underlying storage has battery-backed write caches, or is SSD without write caching, it wouldn't be a problem. Here's info about the "discard" mount option, I don't know whether it applies to your amazon storage: discard/nodiscard Controls whether ext4 should issue discard/TRIM commands to the underlying block device when blocks are freed. This is useful for SSD devices and sparse/thinly-provisioned LUNs, but it is off by default until sufficient testing has been done. The network tunables would have more of an effect in a distributed environment like EC2 than they would on a LAN. Thanks, Shawn
Re: Solr performance on EC2 linux
I’d like to think I helped a little with the metrics upgrade that got released in 6.4, so I was already watching that and I’m aware of the resulting performance issue. This was 5.4 though, patched with https://github.com/whitepages/SOLR-4449 - an index we’ve been running for some time now. Mganeshs’s comment that he doesn’t see a difference on EC2 with Solr 6.2 lends some additional strength to the thought that something changed between Lucene 5.4 and 6.2 (which is used in ES 5), but of course it’s all still pretty anecdotal. On 4/28/17, 11:44 AM, "Erick Erickson"wrote: Well, 6.4.0 had a pretty severe performance issue so if you were using that release you might see this, 6.4.2 is the most recent 6.4 release. But I have no clue how changing linux settings would alter that and I sure can't square that issue with you having such different performance between local and EC2 But thanks for telling us about this! It's totally baffling Erick On Fri, Apr 28, 2017 at 9:09 AM, Jeff Wartes wrote: > > tldr: Recently, I tried moving an existing solrcloud configuration from a local datacenter to EC2. Performance was roughly 1/10th what I’d expected, until I applied a bunch of linux tweaks. > > This should’ve been a straight port: one datacenter server -> one EC2 node. Solr 5.4, Solrcloud, Ubuntu xenial. Nodes were sized in both cases such that the entire index could be cached in memory, and the JVM settings were identical in both places. I applied what should’ve been a comfortable load to the EC2 cluster, and everything exploded. I had to back the rate down to something close to 10% of what I had been getting in the datacenter before latency improved. > Looking around, I was interested to note that under load, user-time CPU usage was being shadowed by an almost equal amount of system CPU time. This was not IOWait, but system time. Strace showed a bunch of time being spent in futex and restart_syscall, but I couldn’t see where to go from there. > > Interestingly, a coworker playing with a ElasticSearch (ES 5.x, so a much more recent release) alternate implementation of the same index was not seeing this high-system-time behavior on EC2, and was getting throughput consistent with our general expectations. > > Eventually, we came across this: https://linkprotect.cudasvc.com/url?a=http://www.brendangregg.com/blog/2015-03-03/performance-tuning-linux-instances-on-ec2.html=E,1,wrdb94Vzm3Hu0-Edzz8gwrCGG9MiHbLKDKltAaM0g2kqyw35-xRDD2azZNIQqp8aoVnP654tzZ3WyRGAhneL4AvPRfV4G6s4VoEeZtSzXgRIBXS62M4Zq4Q,=0 > In direct opposition to the author’s intent, (something about taking expired medication) we applied these settings blindly to see what happened. The difference was breathtaking. The system time usage disappeared, and I could apply load at and even a little above my expected rates, well within my latency goals. > > There are a number of settings involved, and we haven’t isolated for sure which ones made the biggest difference, but my guess at the moment is that it’s the change of clocksource. I think this would be consistent with the observed system time. Note however that using the “tsc” clocksource on EC2 is generally discouraged, because it’s possible to get backwards clock drift. > > I’m writing this for a few reasons: > > 1. The performance difference was so crazy I really feel like this should really be broader knowledge. > > 2. If anyone is aware of anything that changed in Lucene between 5.4 and 6.x that could explain why Elasticsearch wasn’t suffering from this? If it’s the clocksource that’s the issue, there’s an implication that Solr was using tons more system calls like gettimeofday that the EC2 (xen) hypervisor doesn’t allow in userspace. > > 3. Has anyone run Solr with the “tsc” clocksource, and is aware of any concrete issues? > >
Re: Solr performance on EC2 linux
We use Solr 6.2 in EC2 instance with Cent OS 6.2 and we don't see any difference in performance between EC2 and in local environment. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-performance-on-EC2-linux-tp4332467p4332553.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr performance on EC2 linux
Well, 6.4.0 had a pretty severe performance issue so if you were using that release you might see this, 6.4.2 is the most recent 6.4 release. But I have no clue how changing linux settings would alter that and I sure can't square that issue with you having such different performance between local and EC2 But thanks for telling us about this! It's totally baffling Erick On Fri, Apr 28, 2017 at 9:09 AM, Jeff Warteswrote: > > tldr: Recently, I tried moving an existing solrcloud configuration from a > local datacenter to EC2. Performance was roughly 1/10th what I’d expected, > until I applied a bunch of linux tweaks. > > This should’ve been a straight port: one datacenter server -> one EC2 node. > Solr 5.4, Solrcloud, Ubuntu xenial. Nodes were sized in both cases such that > the entire index could be cached in memory, and the JVM settings were > identical in both places. I applied what should’ve been a comfortable load to > the EC2 cluster, and everything exploded. I had to back the rate down to > something close to 10% of what I had been getting in the datacenter before > latency improved. > Looking around, I was interested to note that under load, user-time CPU usage > was being shadowed by an almost equal amount of system CPU time. This was not > IOWait, but system time. Strace showed a bunch of time being spent in futex > and restart_syscall, but I couldn’t see where to go from there. > > Interestingly, a coworker playing with a ElasticSearch (ES 5.x, so a much > more recent release) alternate implementation of the same index was not > seeing this high-system-time behavior on EC2, and was getting throughput > consistent with our general expectations. > > Eventually, we came across this: > http://www.brendangregg.com/blog/2015-03-03/performance-tuning-linux-instances-on-ec2.html > In direct opposition to the author’s intent, (something about taking expired > medication) we applied these settings blindly to see what happened. The > difference was breathtaking. The system time usage disappeared, and I could > apply load at and even a little above my expected rates, well within my > latency goals. > > There are a number of settings involved, and we haven’t isolated for sure > which ones made the biggest difference, but my guess at the moment is that > it’s the change of clocksource. I think this would be consistent with the > observed system time. Note however that using the “tsc” clocksource on EC2 is > generally discouraged, because it’s possible to get backwards clock drift. > > I’m writing this for a few reasons: > > 1. The performance difference was so crazy I really feel like this > should really be broader knowledge. > > 2. If anyone is aware of anything that changed in Lucene between 5.4 > and 6.x that could explain why Elasticsearch wasn’t suffering from this? If > it’s the clocksource that’s the issue, there’s an implication that Solr was > using tons more system calls like gettimeofday that the EC2 (xen) hypervisor > doesn’t allow in userspace. > > 3. Has anyone run Solr with the “tsc” clocksource, and is aware of any > concrete issues? > >
Solr performance on EC2 linux
tldr: Recently, I tried moving an existing solrcloud configuration from a local datacenter to EC2. Performance was roughly 1/10th what I’d expected, until I applied a bunch of linux tweaks. This should’ve been a straight port: one datacenter server -> one EC2 node. Solr 5.4, Solrcloud, Ubuntu xenial. Nodes were sized in both cases such that the entire index could be cached in memory, and the JVM settings were identical in both places. I applied what should’ve been a comfortable load to the EC2 cluster, and everything exploded. I had to back the rate down to something close to 10% of what I had been getting in the datacenter before latency improved. Looking around, I was interested to note that under load, user-time CPU usage was being shadowed by an almost equal amount of system CPU time. This was not IOWait, but system time. Strace showed a bunch of time being spent in futex and restart_syscall, but I couldn’t see where to go from there. Interestingly, a coworker playing with a ElasticSearch (ES 5.x, so a much more recent release) alternate implementation of the same index was not seeing this high-system-time behavior on EC2, and was getting throughput consistent with our general expectations. Eventually, we came across this: http://www.brendangregg.com/blog/2015-03-03/performance-tuning-linux-instances-on-ec2.html In direct opposition to the author’s intent, (something about taking expired medication) we applied these settings blindly to see what happened. The difference was breathtaking. The system time usage disappeared, and I could apply load at and even a little above my expected rates, well within my latency goals. There are a number of settings involved, and we haven’t isolated for sure which ones made the biggest difference, but my guess at the moment is that it’s the change of clocksource. I think this would be consistent with the observed system time. Note however that using the “tsc” clocksource on EC2 is generally discouraged, because it’s possible to get backwards clock drift. I’m writing this for a few reasons: 1. The performance difference was so crazy I really feel like this should really be broader knowledge. 2. If anyone is aware of anything that changed in Lucene between 5.4 and 6.x that could explain why Elasticsearch wasn’t suffering from this? If it’s the clocksource that’s the issue, there’s an implication that Solr was using tons more system calls like gettimeofday that the EC2 (xen) hypervisor doesn’t allow in userspace. 3. Has anyone run Solr with the “tsc” clocksource, and is aware of any concrete issues?
RE: Solr performance issue on indexing
> Also we will try to decouple tika to solr. +1 -Original Message- From: tstusr [mailto:ulfrhe...@gmail.com] Sent: Friday, March 31, 2017 4:31 PM To: solr-user@lucene.apache.org Subject: Re: Solr performance issue on indexing Hi, thanks for the feedback. Yes, it is about OOM, indeed even solr instance makes unavailable. As I was saying I can't find more relevant information on logs. We're are able to increment JVM amout, so, the first thing we'll do will be that. As far as I know, all documents are bounded to that amount (14K), just the processing could change. We are making some tests on indexing and it seems it works without concurrent threads. Also we will try to decouple tika to solr. By the way, make it available with solr cloud will improve performance? Or there will be no perceptible improvement? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-performance-issue-on-indexing-tp4327886p4327914.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr performance issue on indexing
If, by chance, the docs you're sending get routed to different Solr nodes then all the processing is in parallel. I don't know if there's a good way to insure that the docs get sent to different replicas on different Solr instances. You could try addressing specific Solr replicas, something like "blah blah/solr/collection1_shard1_replica1/export" but I'm not totally sure that'll do what you want either. But that still doesn't decouple Tika from the Solr instances running those replicas. So if Tika has a problem it has the potential to bring the Solr node down. Best, Erick On Fri, Mar 31, 2017 at 1:31 PM, tstusr <ulfrhe...@gmail.com> wrote: > Hi, thanks for the feedback. > > Yes, it is about OOM, indeed even solr instance makes unavailable. As I was > saying I can't find more relevant information on logs. > > We're are able to increment JVM amout, so, the first thing we'll do will be > that. > > As far as I know, all documents are bounded to that amount (14K), just the > processing could change. We are making some tests on indexing and it seems > it works without concurrent threads. Also we will try to decouple tika to > solr. > > By the way, make it available with solr cloud will improve performance? Or > there will be no perceptible improvement? > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Solr-performance-issue-on-indexing-tp4327886p4327914.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr performance issue on indexing
Hi, thanks for the feedback. Yes, it is about OOM, indeed even solr instance makes unavailable. As I was saying I can't find more relevant information on logs. We're are able to increment JVM amout, so, the first thing we'll do will be that. As far as I know, all documents are bounded to that amount (14K), just the processing could change. We are making some tests on indexing and it seems it works without concurrent threads. Also we will try to decouple tika to solr. By the way, make it available with solr cloud will improve performance? Or there will be no perceptible improvement? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-performance-issue-on-indexing-tp4327886p4327914.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr performance issue on indexing
First, running multiple threads with PDF files to a Solr running 4G of JVM is...ambitious. You say it crashes; how? OOMs? Second while the extracting request handler is a fine way to get up and running, any problems with Tika will affect Solr. Tika does a great job of extraction, but there are so many variants of so many file formats that this scenario isn' recommended for production. Consider extracting the PDF on a client and sending the docs to Solr. Tika can run as a server also so you aren't coupling Solr and Tika. For a sample SolrJ program, see: https://lucidworks.com/2012/02/14/indexing-with-solrj/ Best, Erick On Fri, Mar 31, 2017 at 10:44 AM, tstusr <ulfrhe...@gmail.com> wrote: > Hi there. > > We are currently indexing some PDF files, the main handler to index is > /extract where we perform simple processing (extract relevant fields and > store on some fields). > > The PDF files are about 10M~100M size and we have to have available the text > extracted. So, everything works correct on test stages, but when we try to > index all the 14K files (around 120Gb) on a client application that only > sends http curls through 3-4 concurrent threads to /extract handler it > crashes. I can't find some relevant information about on solr logs (We > checked in server/logs & in core_dir/tlog). > > My question is about performance. I think it is a small amount of info we > are processing, the deploy scenario is in a docker container with 4gb of JVM > Memory and ~50gb of physical memory (reported through dashboard) we are > using a single instance. > > I don't think is a normal behaviour that handler crashes. So, what are some > general tips about improving performance for this scenario? > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Solr-performance-issue-on-indexing-tp4327886.html > Sent from the Solr - User mailing list archive at Nabble.com.
Solr performance issue on indexing
Hi there. We are currently indexing some PDF files, the main handler to index is /extract where we perform simple processing (extract relevant fields and store on some fields). The PDF files are about 10M~100M size and we have to have available the text extracted. So, everything works correct on test stages, but when we try to index all the 14K files (around 120Gb) on a client application that only sends http curls through 3-4 concurrent threads to /extract handler it crashes. I can't find some relevant information about on solr logs (We checked in server/logs & in core_dir/tlog). My question is about performance. I think it is a small amount of info we are processing, the deploy scenario is in a docker container with 4gb of JVM Memory and ~50gb of physical memory (reported through dashboard) we are using a single instance. I don't think is a normal behaviour that handler crashes. So, what are some general tips about improving performance for this scenario? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-performance-issue-on-indexing-tp4327886.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: solr | performance warning
Thanks EricK Regards, Prateek Jain -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: 21 November 2016 04:32 PM To: solr-user <solr-user@lucene.apache.org> Subject: Re: solr | performance warning _when_ are you seeing this? I see this on startup upon occasion, and I _think_ there's a JIRA about startup opening more than one searcher on startup. If it _is_ on startup, you can simply ignore it. If it's after the system is up and running, then you're probably committing too frequently. "Too frequently" means your autowarm interval is longer than your commit interval. It's usually best to just let autocommit handle this BTW. This is totally on a per-core basis. You won't get this warning if you commit to coreA and coreB simultaneously, only if you commit to an individual core too frequently. Best, Erick On Mon, Nov 21, 2016 at 7:47 AM, Prateek Jain J <prateek.j.j...@ericsson.com> wrote: > > Hi All, > > I am observing following error in logs, any clues about this: > > 2016-11-06T23:15:53.066069+00:00@solr@@ > org.apache.solr.core.SolrCore:1650 - [my_custom_core] PERFORMANCE > WARNING: Overlapping onDeckSearchers=2 > > Slight web search suggests that it could be a case of too-frequent commits. I > have multiple cores running in solr, so one or the other would be commiting > at some time. Any clues/pointers/suggestions? > > I am using solr 4.8.1. > > Regards, > Prateek Jain >
Re: solr | performance warning
_when_ are you seeing this? I see this on startup upon occasion, and I _think_ there's a JIRA about startup opening more than one searcher on startup. If it _is_ on startup, you can simply ignore it. If it's after the system is up and running, then you're probably committing too frequently. "Too frequently" means your autowarm interval is longer than your commit interval. It's usually best to just let autocommit handle this BTW. This is totally on a per-core basis. You won't get this warning if you commit to coreA and coreB simultaneously, only if you commit to an individual core too frequently. Best, Erick On Mon, Nov 21, 2016 at 7:47 AM, Prateek Jain Jwrote: > > Hi All, > > I am observing following error in logs, any clues about this: > > 2016-11-06T23:15:53.066069+00:00@solr@@ org.apache.solr.core.SolrCore:1650 - > [my_custom_core] PERFORMANCE WARNING: Overlapping onDeckSearchers=2 > > Slight web search suggests that it could be a case of too-frequent commits. I > have multiple cores running in solr, so one or the other would be commiting > at some time. Any clues/pointers/suggestions? > > I am using solr 4.8.1. > > Regards, > Prateek Jain >
solr | performance warning
Hi All, I am observing following error in logs, any clues about this: 2016-11-06T23:15:53.066069+00:00@solr@@ org.apache.solr.core.SolrCore:1650 - [my_custom_core] PERFORMANCE WARNING: Overlapping onDeckSearchers=2 Slight web search suggests that it could be a case of too-frequent commits. I have multiple cores running in solr, so one or the other would be commiting at some time. Any clues/pointers/suggestions? I am using solr 4.8.1. Regards, Prateek Jain
Re: Disable hyper-threading for better Solr performance?
Solrcloud.. Faster discs.. Multiple cores on different physical discs would help On Mar 9, 2016 2:22 PM, "Vincenzo D'Amore" <v.dam...@gmail.com> wrote: > Upgrading to Solr 5 you should improve your indexing performance. > > > http://lucidworks.com/blog/2015/06/10/indexing-performance-solr-5-2-now-twice-fast/ > > On Wed, Mar 9, 2016 at 1:13 PM, Avner Levy <av...@checkpoint.com> wrote: > > > Currently I'm using Solr 4.8.1 but I can move to another version if it > > performs significantly faster. > > My target is to reach the max indexing throughput possible on the > machine. > > Since it seems the indexing process is CPU bound I was wondering whether > > 32 logical cores with twice indexing threads will perform better. > > Thanks, > > Avner > > > > -Original Message- > > From: Ilan Schwarts [mailto:ila...@gmail.com] > > Sent: Wednesday, March 09, 2016 9:09 AM > > To: solr-user@lucene.apache.org > > Subject: Re: Disable hyper-threading for better Solr performance? > > > > What is the solr version and shard config? Standalone? Multiple cores? > > Spread over RAID ? > > On Mar 9, 2016 9:00 AM, "Avner Levy" <av...@checkpoint.com> wrote: > > > > > I have a machine with 16 real cores (32 with HT enabled). > > > I'm running on it a Solr server and trying to reach maximum > > > performance for indexing and queries (indexing 20k documents/sec by a > > > number of threads). > > > I've read on multiple places that in some scenarios / products > > > disabling the hyper-threading may result in better performance results. > > > I'm looking for inputs / insights about HT on Solr setups. > > > Thanks in advance, > > > Avner > > > > > > > > > Email secured by Check Point > > > > > > -- > Vincenzo D'Amore > email: v.dam...@gmail.com > skype: free.dev > mobile: +39 349 8513251 >
Re: Disable hyper-threading for better Solr performance?
Upgrading to Solr 5 you should improve your indexing performance. http://lucidworks.com/blog/2015/06/10/indexing-performance-solr-5-2-now-twice-fast/ On Wed, Mar 9, 2016 at 1:13 PM, Avner Levy <av...@checkpoint.com> wrote: > Currently I'm using Solr 4.8.1 but I can move to another version if it > performs significantly faster. > My target is to reach the max indexing throughput possible on the machine. > Since it seems the indexing process is CPU bound I was wondering whether > 32 logical cores with twice indexing threads will perform better. > Thanks, > Avner > > -Original Message- > From: Ilan Schwarts [mailto:ila...@gmail.com] > Sent: Wednesday, March 09, 2016 9:09 AM > To: solr-user@lucene.apache.org > Subject: Re: Disable hyper-threading for better Solr performance? > > What is the solr version and shard config? Standalone? Multiple cores? > Spread over RAID ? > On Mar 9, 2016 9:00 AM, "Avner Levy" <av...@checkpoint.com> wrote: > > > I have a machine with 16 real cores (32 with HT enabled). > > I'm running on it a Solr server and trying to reach maximum > > performance for indexing and queries (indexing 20k documents/sec by a > > number of threads). > > I've read on multiple places that in some scenarios / products > > disabling the hyper-threading may result in better performance results. > > I'm looking for inputs / insights about HT on Solr setups. > > Thanks in advance, > > Avner > > > > > Email secured by Check Point > -- Vincenzo D'Amore email: v.dam...@gmail.com skype: free.dev mobile: +39 349 8513251
RE: Disable hyper-threading for better Solr performance?
Currently I'm using Solr 4.8.1 but I can move to another version if it performs significantly faster. My target is to reach the max indexing throughput possible on the machine. Since it seems the indexing process is CPU bound I was wondering whether 32 logical cores with twice indexing threads will perform better. Thanks, Avner -Original Message- From: Ilan Schwarts [mailto:ila...@gmail.com] Sent: Wednesday, March 09, 2016 9:09 AM To: solr-user@lucene.apache.org Subject: Re: Disable hyper-threading for better Solr performance? What is the solr version and shard config? Standalone? Multiple cores? Spread over RAID ? On Mar 9, 2016 9:00 AM, "Avner Levy" <av...@checkpoint.com> wrote: > I have a machine with 16 real cores (32 with HT enabled). > I'm running on it a Solr server and trying to reach maximum > performance for indexing and queries (indexing 20k documents/sec by a > number of threads). > I've read on multiple places that in some scenarios / products > disabling the hyper-threading may result in better performance results. > I'm looking for inputs / insights about HT on Solr setups. > Thanks in advance, > Avner > Email secured by Check Point
RE: Disable hyper-threading for better Solr performance?
Hi - i can't remember having seen any threads on this topic for the past seven years. Can you perform a controlled test with a lot of concurrent users. I would suspect nowadays HT would boost highly concurrent environments such a search engines. Markus -Original message- > From:Avner Levy <av...@checkpoint.com> > Sent: Wednesday 9th March 2016 8:00 > To: solr-user@lucene.apache.org > Subject: Disable hyper-threading for better Solr performance? > > I have a machine with 16 real cores (32 with HT enabled). > I'm running on it a Solr server and trying to reach maximum performance for > indexing and queries (indexing 20k documents/sec by a number of threads). > I've read on multiple places that in some scenarios / products disabling the > hyper-threading may result in better performance results. > I'm looking for inputs / insights about HT on Solr setups. > Thanks in advance, > Avner >
Re: Disable hyper-threading for better Solr performance?
What is the solr version and shard config? Standalone? Multiple cores? Spread over RAID ? On Mar 9, 2016 9:00 AM, "Avner Levy"wrote: > I have a machine with 16 real cores (32 with HT enabled). > I'm running on it a Solr server and trying to reach maximum performance > for indexing and queries (indexing 20k documents/sec by a number of > threads). > I've read on multiple places that in some scenarios / products disabling > the hyper-threading may result in better performance results. > I'm looking for inputs / insights about HT on Solr setups. > Thanks in advance, > Avner >
Disable hyper-threading for better Solr performance?
I have a machine with 16 real cores (32 with HT enabled). I'm running on it a Solr server and trying to reach maximum performance for indexing and queries (indexing 20k documents/sec by a number of threads). I've read on multiple places that in some scenarios / products disabling the hyper-threading may result in better performance results. I'm looking for inputs / insights about HT on Solr setups. Thanks in advance, Avner
Re: solr performance issue
1 million document isn't considered big for Solr. How much RAM does your machine have? Regards, Edwin On 8 February 2016 at 23:45, Susheel Kumar <susheel2...@gmail.com> wrote: > 1 million document shouldn't have any issues at all. Something else is > wrong with your hw/system configuration. > > Thanks, > Susheel > > On Mon, Feb 8, 2016 at 6:45 AM, sara hajili <hajili.s...@gmail.com> wrote: > > > On Mon, Feb 8, 2016 at 3:04 AM, sara hajili <hajili.s...@gmail.com> > wrote: > > > > > sorry i made a mistake i have a bout 1000 K doc. > > > i mean about 100 doc. > > > > > > On Mon, Feb 8, 2016 at 1:35 AM, Emir Arnautovic < > > > emir.arnauto...@sematext.com> wrote: > > > > > >> Hi Sara, > > >> Not sure if I am reading this right, but I read it as you have 1000 > doc > > >> index and issues? Can you tell us bit more about your setup: number of > > >> servers, hw, index size, number of shards, queries that you run, do > you > > >> index at the same time... > > >> > > >> It seems to me that you are running Solr on server with limited RAM > and > > >> probably small heap. Swapping for sure will slow things down and GC is > > most > > >> likely reason for high CPU. > > >> > > >> You can use http://sematext.com/spm to collect Solr and host metrics > > and > > >> see where the issue is. > > >> > > >> Thanks, > > >> Emir > > >> > > >> -- > > >> Monitoring * Alerting * Anomaly Detection * Centralized Log Management > > >> Solr & Elasticsearch Support * http://sematext.com/ > > >> > > >> > > >> > > >> On 08.02.2016 10:27, sara hajili wrote: > > >> > > >>> hi all. > > >>> i have a problem with my solr performance and usage hardware like a > > >>> ram,cup... > > >>> i have a lot of document and so indexed file about 1000 doc in solr > > that > > >>> every doc has about 8 field in average. > > >>> and each field has about 60 char. > > >>> i set my field as a storedfield = "false" except of 1 field. // i > read > > >>> that this help performance. > > >>> i used copy field and dynamic field if it was necessary . // i read > > that > > >>> this help performance. > > >>> and now my question is that when i run a lot of query on solr i faced > > >>> with > > >>> a problem solr use more cpu and ram and after that filled ,it use a > lot > > >>> swapped storage and then use hard,but doesn't create a system file! > > >>> solr > > >>> fill hard until i forced to restart server to release hard disk. > > >>> and now my question is why solr treat in this way? and how i can > avoid > > >>> solr > > >>> to use huge cpu space? > > >>> any config need?! > > >>> > > >>> > > >> > > > > > >
Re: solr performance issue
1 million document shouldn't have any issues at all. Something else is wrong with your hw/system configuration. Thanks, Susheel On Mon, Feb 8, 2016 at 6:45 AM, sara hajili <hajili.s...@gmail.com> wrote: > On Mon, Feb 8, 2016 at 3:04 AM, sara hajili <hajili.s...@gmail.com> wrote: > > > sorry i made a mistake i have a bout 1000 K doc. > > i mean about 100 doc. > > > > On Mon, Feb 8, 2016 at 1:35 AM, Emir Arnautovic < > > emir.arnauto...@sematext.com> wrote: > > > >> Hi Sara, > >> Not sure if I am reading this right, but I read it as you have 1000 doc > >> index and issues? Can you tell us bit more about your setup: number of > >> servers, hw, index size, number of shards, queries that you run, do you > >> index at the same time... > >> > >> It seems to me that you are running Solr on server with limited RAM and > >> probably small heap. Swapping for sure will slow things down and GC is > most > >> likely reason for high CPU. > >> > >> You can use http://sematext.com/spm to collect Solr and host metrics > and > >> see where the issue is. > >> > >> Thanks, > >> Emir > >> > >> -- > >> Monitoring * Alerting * Anomaly Detection * Centralized Log Management > >> Solr & Elasticsearch Support * http://sematext.com/ > >> > >> > >> > >> On 08.02.2016 10:27, sara hajili wrote: > >> > >>> hi all. > >>> i have a problem with my solr performance and usage hardware like a > >>> ram,cup... > >>> i have a lot of document and so indexed file about 1000 doc in solr > that > >>> every doc has about 8 field in average. > >>> and each field has about 60 char. > >>> i set my field as a storedfield = "false" except of 1 field. // i read > >>> that this help performance. > >>> i used copy field and dynamic field if it was necessary . // i read > that > >>> this help performance. > >>> and now my question is that when i run a lot of query on solr i faced > >>> with > >>> a problem solr use more cpu and ram and after that filled ,it use a lot > >>> swapped storage and then use hard,but doesn't create a system file! > >>> solr > >>> fill hard until i forced to restart server to release hard disk. > >>> and now my question is why solr treat in this way? and how i can avoid > >>> solr > >>> to use huge cpu space? > >>> any config need?! > >>> > >>> > >> > > >
solr performance issue
hi all. i have a problem with my solr performance and usage hardware like a ram,cup... i have a lot of document and so indexed file about 1000 doc in solr that every doc has about 8 field in average. and each field has about 60 char. i set my field as a storedfield = "false" except of 1 field. // i read that this help performance. i used copy field and dynamic field if it was necessary . // i read that this help performance. and now my question is that when i run a lot of query on solr i faced with a problem solr use more cpu and ram and after that filled ,it use a lot swapped storage and then use hard,but doesn't create a system file! solr fill hard until i forced to restart server to release hard disk. and now my question is why solr treat in this way? and how i can avoid solr to use huge cpu space? any config need?!
Re: solr performance issue
Hi Sara, It is still considered to be small index. Can you give us bit details about your setup? Thanks, Emir On 08.02.2016 12:04, sara hajili wrote: sorry i made a mistake i have a bout 1000 K doc. i mean about 100 doc. On Mon, Feb 8, 2016 at 1:35 AM, Emir Arnautovic < emir.arnauto...@sematext.com> wrote: Hi Sara, Not sure if I am reading this right, but I read it as you have 1000 doc index and issues? Can you tell us bit more about your setup: number of servers, hw, index size, number of shards, queries that you run, do you index at the same time... It seems to me that you are running Solr on server with limited RAM and probably small heap. Swapping for sure will slow things down and GC is most likely reason for high CPU. You can use http://sematext.com/spm to collect Solr and host metrics and see where the issue is. Thanks, Emir -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & Elasticsearch Support * http://sematext.com/ On 08.02.2016 10:27, sara hajili wrote: hi all. i have a problem with my solr performance and usage hardware like a ram,cup... i have a lot of document and so indexed file about 1000 doc in solr that every doc has about 8 field in average. and each field has about 60 char. i set my field as a storedfield = "false" except of 1 field. // i read that this help performance. i used copy field and dynamic field if it was necessary . // i read that this help performance. and now my question is that when i run a lot of query on solr i faced with a problem solr use more cpu and ram and after that filled ,it use a lot swapped storage and then use hard,but doesn't create a system file! solr fill hard until i forced to restart server to release hard disk. and now my question is why solr treat in this way? and how i can avoid solr to use huge cpu space? any config need?! -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & Elasticsearch Support * http://sematext.com/
Re: solr performance issue
Hi Sara, Not sure if I am reading this right, but I read it as you have 1000 doc index and issues? Can you tell us bit more about your setup: number of servers, hw, index size, number of shards, queries that you run, do you index at the same time... It seems to me that you are running Solr on server with limited RAM and probably small heap. Swapping for sure will slow things down and GC is most likely reason for high CPU. You can use http://sematext.com/spm to collect Solr and host metrics and see where the issue is. Thanks, Emir -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & Elasticsearch Support * http://sematext.com/ On 08.02.2016 10:27, sara hajili wrote: hi all. i have a problem with my solr performance and usage hardware like a ram,cup... i have a lot of document and so indexed file about 1000 doc in solr that every doc has about 8 field in average. and each field has about 60 char. i set my field as a storedfield = "false" except of 1 field. // i read that this help performance. i used copy field and dynamic field if it was necessary . // i read that this help performance. and now my question is that when i run a lot of query on solr i faced with a problem solr use more cpu and ram and after that filled ,it use a lot swapped storage and then use hard,but doesn't create a system file! solr fill hard until i forced to restart server to release hard disk. and now my question is why solr treat in this way? and how i can avoid solr to use huge cpu space? any config need?!
Re: solr performance issue
sorry i made a mistake i have a bout 1000 K doc. i mean about 100 doc. On Mon, Feb 8, 2016 at 1:35 AM, Emir Arnautovic < emir.arnauto...@sematext.com> wrote: > Hi Sara, > Not sure if I am reading this right, but I read it as you have 1000 doc > index and issues? Can you tell us bit more about your setup: number of > servers, hw, index size, number of shards, queries that you run, do you > index at the same time... > > It seems to me that you are running Solr on server with limited RAM and > probably small heap. Swapping for sure will slow things down and GC is most > likely reason for high CPU. > > You can use http://sematext.com/spm to collect Solr and host metrics and > see where the issue is. > > Thanks, > Emir > > -- > Monitoring * Alerting * Anomaly Detection * Centralized Log Management > Solr & Elasticsearch Support * http://sematext.com/ > > > > On 08.02.2016 10:27, sara hajili wrote: > >> hi all. >> i have a problem with my solr performance and usage hardware like a >> ram,cup... >> i have a lot of document and so indexed file about 1000 doc in solr that >> every doc has about 8 field in average. >> and each field has about 60 char. >> i set my field as a storedfield = "false" except of 1 field. // i read >> that this help performance. >> i used copy field and dynamic field if it was necessary . // i read that >> this help performance. >> and now my question is that when i run a lot of query on solr i faced with >> a problem solr use more cpu and ram and after that filled ,it use a lot >> swapped storage and then use hard,but doesn't create a system file! solr >> fill hard until i forced to restart server to release hard disk. >> and now my question is why solr treat in this way? and how i can avoid >> solr >> to use huge cpu space? >> any config need?! >> >> >
Re: solr performance issue
On Mon, Feb 8, 2016 at 3:04 AM, sara hajili <hajili.s...@gmail.com> wrote: > sorry i made a mistake i have a bout 1000 K doc. > i mean about 100 doc. > > On Mon, Feb 8, 2016 at 1:35 AM, Emir Arnautovic < > emir.arnauto...@sematext.com> wrote: > >> Hi Sara, >> Not sure if I am reading this right, but I read it as you have 1000 doc >> index and issues? Can you tell us bit more about your setup: number of >> servers, hw, index size, number of shards, queries that you run, do you >> index at the same time... >> >> It seems to me that you are running Solr on server with limited RAM and >> probably small heap. Swapping for sure will slow things down and GC is most >> likely reason for high CPU. >> >> You can use http://sematext.com/spm to collect Solr and host metrics and >> see where the issue is. >> >> Thanks, >> Emir >> >> -- >> Monitoring * Alerting * Anomaly Detection * Centralized Log Management >> Solr & Elasticsearch Support * http://sematext.com/ >> >> >> >> On 08.02.2016 10:27, sara hajili wrote: >> >>> hi all. >>> i have a problem with my solr performance and usage hardware like a >>> ram,cup... >>> i have a lot of document and so indexed file about 1000 doc in solr that >>> every doc has about 8 field in average. >>> and each field has about 60 char. >>> i set my field as a storedfield = "false" except of 1 field. // i read >>> that this help performance. >>> i used copy field and dynamic field if it was necessary . // i read that >>> this help performance. >>> and now my question is that when i run a lot of query on solr i faced >>> with >>> a problem solr use more cpu and ram and after that filled ,it use a lot >>> swapped storage and then use hard,but doesn't create a system file! >>> solr >>> fill hard until i forced to restart server to release hard disk. >>> and now my question is why solr treat in this way? and how i can avoid >>> solr >>> to use huge cpu space? >>> any config need?! >>> >>> >> >
Re: Solr performance is slow with just 1GB of data indexed
On Wed, 2015-08-26 at 15:47 +0800, Zheng Lin Edwin Yeo wrote: Now I've tried to increase the carrot.fragSize to 75 and carrot.summarySnippets to 2, and set the carrot.produceSummary to true. With this setting, I'm mostly able to get the cluster results back within 2 to 3 seconds when I set rows=200. I'm still trying out to see if the cluster labels are ok, but in theory do you think this is a suitable setting to attempt to improve the clustering results and at the same time improve the performance? I don't know - the quality/performance point as well as which knobs to tweak is extremely dependent on your corpus and your hardware. A person with better understanding of carrot might be able to do better sanity checking, but I am not at all at that level. Related, it seems to me that the question of how to tweak the clustering has little to do with Solr and a lot to do with carrot (assuming here that carrot is the bottleneck). You might have more success asking in a carrot forum? - Toke Eskildsen, State and University Library, Denmark
Re: Solr performance is slow with just 1GB of data indexed
Hi Toke, Thank you for the link. I'm using Solr 5.2.1 but I think the carrot2 bundled will be slightly older version, as I'm using the latest carrot2-workbench-3.10.3, which is only released recently. I've changed all the settings like fragSize and desiredCluserCountBase to be the same on both sides, and I'm now able to get very similar cluster results. Now I've tried to increase the carrot.fragSize to 75 and carrot.summarySnippets to 2, and set the carrot.produceSummary to true. With this setting, I'm mostly able to get the cluster results back within 2 to 3 seconds when I set rows=200. I'm still trying out to see if the cluster labels are ok, but in theory do you think this is a suitable setting to attempt to improve the clustering results and at the same time improve the performance? Regards, Edwin On 26 August 2015 at 13:58, Toke Eskildsen t...@statsbiblioteket.dk wrote: On Wed, 2015-08-26 at 10:10 +0800, Zheng Lin Edwin Yeo wrote: I'm currently trying out on the Carrot2 Workbench and get it to call Solr to see how they did the clustering. Although it still takes some time to do the clustering, but the results of the cluster is much better than mine. I think its probably due to the different settings like the fragSize and desiredCluserCountBase? Either that or the carrot bundled with Solr is an older version. By the way, the link on the clustering example https://cwiki.apache.org/confluence/display/solr/Result is not working as it says 'Page Not Found'. That is because it is too long for a single line. Try copy-pasting it: https://cwiki.apache.org/confluence/display/solr/Result +Clustering#ResultClustering-Configuration - Toke Eskildsen, State and University Library, Denmark
Re: Solr performance is slow with just 1GB of data indexed
Thanks for your recommendation Toke. Will try to ask in the carrot forum. Regards, Edwin On 26 August 2015 at 18:45, Toke Eskildsen t...@statsbiblioteket.dk wrote: On Wed, 2015-08-26 at 15:47 +0800, Zheng Lin Edwin Yeo wrote: Now I've tried to increase the carrot.fragSize to 75 and carrot.summarySnippets to 2, and set the carrot.produceSummary to true. With this setting, I'm mostly able to get the cluster results back within 2 to 3 seconds when I set rows=200. I'm still trying out to see if the cluster labels are ok, but in theory do you think this is a suitable setting to attempt to improve the clustering results and at the same time improve the performance? I don't know - the quality/performance point as well as which knobs to tweak is extremely dependent on your corpus and your hardware. A person with better understanding of carrot might be able to do better sanity checking, but I am not at all at that level. Related, it seems to me that the question of how to tweak the clustering has little to do with Solr and a lot to do with carrot (assuming here that carrot is the bottleneck). You might have more success asking in a carrot forum? - Toke Eskildsen, State and University Library, Denmark
Re: Solr performance is slow with just 1GB of data indexed
On Wed, 2015-08-26 at 10:10 +0800, Zheng Lin Edwin Yeo wrote: I'm currently trying out on the Carrot2 Workbench and get it to call Solr to see how they did the clustering. Although it still takes some time to do the clustering, but the results of the cluster is much better than mine. I think its probably due to the different settings like the fragSize and desiredCluserCountBase? Either that or the carrot bundled with Solr is an older version. By the way, the link on the clustering example https://cwiki.apache.org/confluence/display/solr/Result is not working as it says 'Page Not Found'. That is because it is too long for a single line. Try copy-pasting it: https://cwiki.apache.org/confluence/display/solr/Result +Clustering#ResultClustering-Configuration - Toke Eskildsen, State and University Library, Denmark
Re: Solr performance is slow with just 1GB of data indexed
On Tue, 2015-08-25 at 10:40 +0800, Zheng Lin Edwin Yeo wrote: Would like to confirm, when I set rows=100, does it mean that it only build the cluster based on the first 100 records that are returned by the search, and if I have 1000 records that matches the search, all the remaining 900 records will not be considered for clustering? That is correct. It is not stated very clearly, but it follows from trading the comments in the third example at https://cwiki.apache.org/confluence/display/solr/Result +Clustering#ResultClustering-Configuration As if that is the case, the result of the cluster may not be so accurate as there is a possibility that the first 100 records might have a large amount of similarities in the records, while the subsequent 900 records have differences that could have impact on the cluster result. Such is the nature of on-the-fly clustering. The clustering aims to be as representative of your search result as possible. Assigning more weight to the higher scoring documents (in this case: All the weight, as those beyond the top-100 are not even considered) does this. If that does not fit your expectations, maybe you need something else? Plain faceting perhaps? Or maybe enrichment of the documents with some sort of entity extraction? - Toke Eskildsen, State and University Library, Denmark
Re: Solr performance is slow with just 1GB of data indexed
Hi Toke, Thank you for your reply. I'm currently trying out on the Carrot2 Workbench and get it to call Solr to see how they did the clustering. Although it still takes some time to do the clustering, but the results of the cluster is much better than mine. I think its probably due to the different settings like the fragSize and desiredCluserCountBase? By the way, the link on the clustering example https://cwiki.apache.org/confluence/display/solr/Result is not working as it says 'Page Not Found'. Regards, Edwin On 25 August 2015 at 15:29, Toke Eskildsen t...@statsbiblioteket.dk wrote: On Tue, 2015-08-25 at 10:40 +0800, Zheng Lin Edwin Yeo wrote: Would like to confirm, when I set rows=100, does it mean that it only build the cluster based on the first 100 records that are returned by the search, and if I have 1000 records that matches the search, all the remaining 900 records will not be considered for clustering? That is correct. It is not stated very clearly, but it follows from trading the comments in the third example at https://cwiki.apache.org/confluence/display/solr/Result +Clustering#ResultClustering-Configuration As if that is the case, the result of the cluster may not be so accurate as there is a possibility that the first 100 records might have a large amount of similarities in the records, while the subsequent 900 records have differences that could have impact on the cluster result. Such is the nature of on-the-fly clustering. The clustering aims to be as representative of your search result as possible. Assigning more weight to the higher scoring documents (in this case: All the weight, as those beyond the top-100 are not even considered) does this. If that does not fit your expectations, maybe you need something else? Plain faceting perhaps? Or maybe enrichment of the documents with some sort of entity extraction? - Toke Eskildsen, State and University Library, Denmark
Re: Solr performance is slow with just 1GB of data indexed
I honestly suspect your performance issue is down to the number of terms you are passing into the clustering algorithm, not to memory usage as such. If you have *huge* documents and cluster across them, performance will be slower, by definition. Clustering is usually done offline, for example on a large dataset taking a few hours or even days. Carrot2 manages to reduce this time to a reasonable online task by only clustering a few search results. If you increase the number of documents (from say 100 to 1000) and increase the number of terms in each document, you are inherently making the clustering algorithm have to work harder, and therefore it *IS* going to take longer. Either use less documents, or only use the first 1000 terms when clustering, or do your clustering offline and include the results of the clustering into your index. Upayavira On Mon, Aug 24, 2015, at 04:59 AM, Zheng Lin Edwin Yeo wrote: Hi Alexandre, I've tried to use just index=true, and the speed is still the same and not any faster. If I set to store=false, there's no results that came back with the clustering. Is this due to the index are not stored, and the clustering requires indexed that are stored? I've also increase my heap size to 16GB as I'm using a machine with 32GB RAM, but there is no significant improvement with the performance too. Regards, Edwin On 24 August 2015 at 10:16, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Yes, I'm using store=true. field name=content type=text_general indexed=true stored=true omitNorms=true termVectors=true/ However, this field needs to be stored as my program requires this field to be returned during normal searching. I tried the lazyLoading=true, but it's not working. Will you do a copy field for the content, and not to set stored=true for that field. So that field will just be referenced to for the clustering, and the normal search will reference to the original content field? Regards, Edwin On 23 August 2015 at 23:51, Alexandre Rafalovitch arafa...@gmail.com wrote: Are you by any chance doing store=true on the fields you want to search? If so, you may want to switch to just index=true. Of course, they will then not come back in the results, but do you really want to sling huge content fields around. The other option is to do lazyLoading=true and not request that field. This, as a test, you could actually do without needing to reindex Solr, just with restart. This could give you a way to test whether the field stored size is the issue. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 23 August 2015 at 11:13, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Hi Shawn and Toke, I only have 520 docs in my data, but each of the documents is quite big in size, In the Solr, it is using 221MB. So when i set to read from the top 1000 rows, it should just be reading all the 520 docs that are indexed? Regards, Edwin On 23 August 2015 at 22:52, Shawn Heisey apa...@elyograg.org wrote: On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote: Hi Shawn, Yes, I've increased the heap size to 4GB already, and I'm using a machine with 32GB RAM. Is it recommended to further increase the heap size to like 8GB or 16GB? Probably not, but I know nothing about your data. How many Solr docs were created by indexing 1GB of data? How much disk space is used by your Solr index(es)? I know very little about clustering, but it looks like you've gotten a reply from Toke, who knows a lot more about that part of the code than I do. Thanks, Shawn
Re: Solr performance is slow with just 1GB of data indexed
Thank you Upayavira for your reply. Would like to confirm, when I set rows=100, does it mean that it only build the cluster based on the first 100 records that are returned by the search, and if I have 1000 records that matches the search, all the remaining 900 records will not be considered for clustering? As if that is the case, the result of the cluster may not be so accurate as there is a possibility that the first 100 records might have a large amount of similarities in the records, while the subsequent 900 records have differences that could have impact on the cluster result. Regards, Edwin On 24 August 2015 at 17:50, Upayavira u...@odoko.co.uk wrote: I honestly suspect your performance issue is down to the number of terms you are passing into the clustering algorithm, not to memory usage as such. If you have *huge* documents and cluster across them, performance will be slower, by definition. Clustering is usually done offline, for example on a large dataset taking a few hours or even days. Carrot2 manages to reduce this time to a reasonable online task by only clustering a few search results. If you increase the number of documents (from say 100 to 1000) and increase the number of terms in each document, you are inherently making the clustering algorithm have to work harder, and therefore it *IS* going to take longer. Either use less documents, or only use the first 1000 terms when clustering, or do your clustering offline and include the results of the clustering into your index. Upayavira On Mon, Aug 24, 2015, at 04:59 AM, Zheng Lin Edwin Yeo wrote: Hi Alexandre, I've tried to use just index=true, and the speed is still the same and not any faster. If I set to store=false, there's no results that came back with the clustering. Is this due to the index are not stored, and the clustering requires indexed that are stored? I've also increase my heap size to 16GB as I'm using a machine with 32GB RAM, but there is no significant improvement with the performance too. Regards, Edwin On 24 August 2015 at 10:16, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Yes, I'm using store=true. field name=content type=text_general indexed=true stored=true omitNorms=true termVectors=true/ However, this field needs to be stored as my program requires this field to be returned during normal searching. I tried the lazyLoading=true, but it's not working. Will you do a copy field for the content, and not to set stored=true for that field. So that field will just be referenced to for the clustering, and the normal search will reference to the original content field? Regards, Edwin On 23 August 2015 at 23:51, Alexandre Rafalovitch arafa...@gmail.com wrote: Are you by any chance doing store=true on the fields you want to search? If so, you may want to switch to just index=true. Of course, they will then not come back in the results, but do you really want to sling huge content fields around. The other option is to do lazyLoading=true and not request that field. This, as a test, you could actually do without needing to reindex Solr, just with restart. This could give you a way to test whether the field stored size is the issue. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 23 August 2015 at 11:13, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Hi Shawn and Toke, I only have 520 docs in my data, but each of the documents is quite big in size, In the Solr, it is using 221MB. So when i set to read from the top 1000 rows, it should just be reading all the 520 docs that are indexed? Regards, Edwin On 23 August 2015 at 22:52, Shawn Heisey apa...@elyograg.org wrote: On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote: Hi Shawn, Yes, I've increased the heap size to 4GB already, and I'm using a machine with 32GB RAM. Is it recommended to further increase the heap size to like 8GB or 16GB? Probably not, but I know nothing about your data. How many Solr docs were created by indexing 1GB of data? How much disk space is used by your Solr index(es)? I know very little about clustering, but it looks like you've gotten a reply from Toke, who knows a lot more about that part of the code than I do. Thanks, Shawn
Re: Solr performance is slow with just 1GB of data indexed
And be aware that I'm sure the more terms in your documents, the slower clustering will be. So it isn't just the number of docs, the size of them counts in this instance. A simple test would be to build an index with just the first 1000 terms of your clustering fields, and see if that makes a difference to performance. Upayavira On Sun, Aug 23, 2015, at 05:32 PM, Erick Erickson wrote: You're confusing clustering with searching. Sure, Solr can index and lots of data, but clustering is essentially finding ad-hoc similarities between arbitrary documents. It must take each of the documents in the result size you specify from your result set and try to find commonalities. For perf issues in terms of clustering, you'd be better off talking to the folks at the carrot project. Best, Erick On Sun, Aug 23, 2015 at 8:51 AM, Alexandre Rafalovitch arafa...@gmail.com wrote: Are you by any chance doing store=true on the fields you want to search? If so, you may want to switch to just index=true. Of course, they will then not come back in the results, but do you really want to sling huge content fields around. The other option is to do lazyLoading=true and not request that field. This, as a test, you could actually do without needing to reindex Solr, just with restart. This could give you a way to test whether the field stored size is the issue. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 23 August 2015 at 11:13, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Hi Shawn and Toke, I only have 520 docs in my data, but each of the documents is quite big in size, In the Solr, it is using 221MB. So when i set to read from the top 1000 rows, it should just be reading all the 520 docs that are indexed? Regards, Edwin On 23 August 2015 at 22:52, Shawn Heisey apa...@elyograg.org wrote: On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote: Hi Shawn, Yes, I've increased the heap size to 4GB already, and I'm using a machine with 32GB RAM. Is it recommended to further increase the heap size to like 8GB or 16GB? Probably not, but I know nothing about your data. How many Solr docs were created by indexing 1GB of data? How much disk space is used by your Solr index(es)? I know very little about clustering, but it looks like you've gotten a reply from Toke, who knows a lot more about that part of the code than I do. Thanks, Shawn
Re: Solr performance is slow with just 1GB of data indexed
Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: However, I find that clustering is exceeding slow after I index this 1GB of data. It took almost 30 seconds to return the cluster results when I set it to cluster the top 1000 records, and still take more than 3 seconds when I set it to cluster the top 100 records. Your clustering uses Carrot2, which fetches the top documents and performs real-time clustering on them - that process is (nearly) independent of index size. The relevant numbers here are top 1000 and top 100, not 1GB. The unknown part is whether it is the fetching of top 1000 (the Solr part) or the clustering itself (the Carrot part) that is the bottleneck. - Toke Eskildsen
Re: Solr performance is slow with just 1GB of data indexed
Are you by any chance doing store=true on the fields you want to search? If so, you may want to switch to just index=true. Of course, they will then not come back in the results, but do you really want to sling huge content fields around. The other option is to do lazyLoading=true and not request that field. This, as a test, you could actually do without needing to reindex Solr, just with restart. This could give you a way to test whether the field stored size is the issue. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 23 August 2015 at 11:13, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Hi Shawn and Toke, I only have 520 docs in my data, but each of the documents is quite big in size, In the Solr, it is using 221MB. So when i set to read from the top 1000 rows, it should just be reading all the 520 docs that are indexed? Regards, Edwin On 23 August 2015 at 22:52, Shawn Heisey apa...@elyograg.org wrote: On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote: Hi Shawn, Yes, I've increased the heap size to 4GB already, and I'm using a machine with 32GB RAM. Is it recommended to further increase the heap size to like 8GB or 16GB? Probably not, but I know nothing about your data. How many Solr docs were created by indexing 1GB of data? How much disk space is used by your Solr index(es)? I know very little about clustering, but it looks like you've gotten a reply from Toke, who knows a lot more about that part of the code than I do. Thanks, Shawn
Re: Solr performance is slow with just 1GB of data indexed
You're confusing clustering with searching. Sure, Solr can index and lots of data, but clustering is essentially finding ad-hoc similarities between arbitrary documents. It must take each of the documents in the result size you specify from your result set and try to find commonalities. For perf issues in terms of clustering, you'd be better off talking to the folks at the carrot project. Best, Erick On Sun, Aug 23, 2015 at 8:51 AM, Alexandre Rafalovitch arafa...@gmail.com wrote: Are you by any chance doing store=true on the fields you want to search? If so, you may want to switch to just index=true. Of course, they will then not come back in the results, but do you really want to sling huge content fields around. The other option is to do lazyLoading=true and not request that field. This, as a test, you could actually do without needing to reindex Solr, just with restart. This could give you a way to test whether the field stored size is the issue. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 23 August 2015 at 11:13, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Hi Shawn and Toke, I only have 520 docs in my data, but each of the documents is quite big in size, In the Solr, it is using 221MB. So when i set to read from the top 1000 rows, it should just be reading all the 520 docs that are indexed? Regards, Edwin On 23 August 2015 at 22:52, Shawn Heisey apa...@elyograg.org wrote: On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote: Hi Shawn, Yes, I've increased the heap size to 4GB already, and I'm using a machine with 32GB RAM. Is it recommended to further increase the heap size to like 8GB or 16GB? Probably not, but I know nothing about your data. How many Solr docs were created by indexing 1GB of data? How much disk space is used by your Solr index(es)? I know very little about clustering, but it looks like you've gotten a reply from Toke, who knows a lot more about that part of the code than I do. Thanks, Shawn
Re: Solr performance is slow with just 1GB of data indexed
unsubscribe On Sat, Aug 22, 2015 at 9:31 PM, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Hi, I'm using Solr 5.2.1, and I've indexed about 1GB of data into Solr. However, I find that clustering is exceeding slow after I index this 1GB of data. It took almost 30 seconds to return the cluster results when I set it to cluster the top 1000 records, and still take more than 3 seconds when I set it to cluster the top 100 records. Is this speed normal? Cos i understand Solr can index terabytes of data without having the performance impacted so much, but now the collection is slowing down even with just 1GB of data. Below is my clustering configurations in solrconfig.xml. requestHandler name=/clustering startup=lazy enable=${solr.clustering.enabled:true} class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows1000/int str name=wtjson/str str name=indenttrue/str str name=dftext/str str name=flnull/str bool name=clusteringtrue/bool bool name=clustering.resultstrue/bool str name=carrot.titlesubject content tag/str bool name=carrot.produceSummarytrue/bool int name=carrot.fragSize20/int !-- the maximum number of labels per cluster -- int name=carrot.numDescriptions20/int !-- produce sub clusters -- bool name=carrot.outputSubClustersfalse/bool str name=LingoClusteringAlgorithm.desiredClusterCountBase7/str !-- Configure the remaining request handler parameters. -- str name=defTypeedismax/str /lst arr name=last-components strclustering/str /arr /requestHandler Regards, Edwin
Re: Solr performance is slow with just 1GB of data indexed
Hi Shawn and Toke, I only have 520 docs in my data, but each of the documents is quite big in size, In the Solr, it is using 221MB. So when i set to read from the top 1000 rows, it should just be reading all the 520 docs that are indexed? Regards, Edwin On 23 August 2015 at 22:52, Shawn Heisey apa...@elyograg.org wrote: On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote: Hi Shawn, Yes, I've increased the heap size to 4GB already, and I'm using a machine with 32GB RAM. Is it recommended to further increase the heap size to like 8GB or 16GB? Probably not, but I know nothing about your data. How many Solr docs were created by indexing 1GB of data? How much disk space is used by your Solr index(es)? I know very little about clustering, but it looks like you've gotten a reply from Toke, who knows a lot more about that part of the code than I do. Thanks, Shawn
Re: Solr performance is slow with just 1GB of data indexed
On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote: Hi Shawn, Yes, I've increased the heap size to 4GB already, and I'm using a machine with 32GB RAM. Is it recommended to further increase the heap size to like 8GB or 16GB? Probably not, but I know nothing about your data. How many Solr docs were created by indexing 1GB of data? How much disk space is used by your Solr index(es)? I know very little about clustering, but it looks like you've gotten a reply from Toke, who knows a lot more about that part of the code than I do. Thanks, Shawn
Re: Solr performance is slow with just 1GB of data indexed
We use 8gb to 10gb for those size indexes all the time. Bill Bell Sent from mobile On Aug 23, 2015, at 8:52 AM, Shawn Heisey apa...@elyograg.org wrote: On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote: Hi Shawn, Yes, I've increased the heap size to 4GB already, and I'm using a machine with 32GB RAM. Is it recommended to further increase the heap size to like 8GB or 16GB? Probably not, but I know nothing about your data. How many Solr docs were created by indexing 1GB of data? How much disk space is used by your Solr index(es)? I know very little about clustering, but it looks like you've gotten a reply from Toke, who knows a lot more about that part of the code than I do. Thanks, Shawn
Re: Solr performance is slow with just 1GB of data indexed
Hi Alexandre, I've tried to use just index=true, and the speed is still the same and not any faster. If I set to store=false, there's no results that came back with the clustering. Is this due to the index are not stored, and the clustering requires indexed that are stored? I've also increase my heap size to 16GB as I'm using a machine with 32GB RAM, but there is no significant improvement with the performance too. Regards, Edwin On 24 August 2015 at 10:16, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Yes, I'm using store=true. field name=content type=text_general indexed=true stored=true omitNorms=true termVectors=true/ However, this field needs to be stored as my program requires this field to be returned during normal searching. I tried the lazyLoading=true, but it's not working. Will you do a copy field for the content, and not to set stored=true for that field. So that field will just be referenced to for the clustering, and the normal search will reference to the original content field? Regards, Edwin On 23 August 2015 at 23:51, Alexandre Rafalovitch arafa...@gmail.com wrote: Are you by any chance doing store=true on the fields you want to search? If so, you may want to switch to just index=true. Of course, they will then not come back in the results, but do you really want to sling huge content fields around. The other option is to do lazyLoading=true and not request that field. This, as a test, you could actually do without needing to reindex Solr, just with restart. This could give you a way to test whether the field stored size is the issue. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 23 August 2015 at 11:13, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Hi Shawn and Toke, I only have 520 docs in my data, but each of the documents is quite big in size, In the Solr, it is using 221MB. So when i set to read from the top 1000 rows, it should just be reading all the 520 docs that are indexed? Regards, Edwin On 23 August 2015 at 22:52, Shawn Heisey apa...@elyograg.org wrote: On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote: Hi Shawn, Yes, I've increased the heap size to 4GB already, and I'm using a machine with 32GB RAM. Is it recommended to further increase the heap size to like 8GB or 16GB? Probably not, but I know nothing about your data. How many Solr docs were created by indexing 1GB of data? How much disk space is used by your Solr index(es)? I know very little about clustering, but it looks like you've gotten a reply from Toke, who knows a lot more about that part of the code than I do. Thanks, Shawn
Re: Solr performance is slow with just 1GB of data indexed
Yes, I'm using store=true. field name=content type=text_general indexed=true stored=true omitNorms=true termVectors=true/ However, this field needs to be stored as my program requires this field to be returned during normal searching. I tried the lazyLoading=true, but it's not working. Will you do a copy field for the content, and not to set stored=true for that field. So that field will just be referenced to for the clustering, and the normal search will reference to the original content field? Regards, Edwin On 23 August 2015 at 23:51, Alexandre Rafalovitch arafa...@gmail.com wrote: Are you by any chance doing store=true on the fields you want to search? If so, you may want to switch to just index=true. Of course, they will then not come back in the results, but do you really want to sling huge content fields around. The other option is to do lazyLoading=true and not request that field. This, as a test, you could actually do without needing to reindex Solr, just with restart. This could give you a way to test whether the field stored size is the issue. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 23 August 2015 at 11:13, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Hi Shawn and Toke, I only have 520 docs in my data, but each of the documents is quite big in size, In the Solr, it is using 221MB. So when i set to read from the top 1000 rows, it should just be reading all the 520 docs that are indexed? Regards, Edwin On 23 August 2015 at 22:52, Shawn Heisey apa...@elyograg.org wrote: On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote: Hi Shawn, Yes, I've increased the heap size to 4GB already, and I'm using a machine with 32GB RAM. Is it recommended to further increase the heap size to like 8GB or 16GB? Probably not, but I know nothing about your data. How many Solr docs were created by indexing 1GB of data? How much disk space is used by your Solr index(es)? I know very little about clustering, but it looks like you've gotten a reply from Toke, who knows a lot more about that part of the code than I do. Thanks, Shawn
Re: Solr performance is slow with just 1GB of data indexed
Hi Shawn, Yes, I've increased the heap size to 4GB already, and I'm using a machine with 32GB RAM. Is it recommended to further increase the heap size to like 8GB or 16GB? Regards, Edwin On 23 Aug 2015 10:23, Shawn Heisey apa...@elyograg.org wrote: On 8/22/2015 7:31 PM, Zheng Lin Edwin Yeo wrote: I'm using Solr 5.2.1, and I've indexed about 1GB of data into Solr. However, I find that clustering is exceeding slow after I index this 1GB of data. It took almost 30 seconds to return the cluster results when I set it to cluster the top 1000 records, and still take more than 3 seconds when I set it to cluster the top 100 records. Is this speed normal? Cos i understand Solr can index terabytes of data without having the performance impacted so much, but now the collection is slowing down even with just 1GB of data. Have you increased the heap size? If you simply start Solr 5.x with the included script and don't use any commandline options, Solr will only have a 512MB heap. This is *extremely* small. A significant chunk of that 512MB heap will be required just to start Jetty and Solr, so there's not much memory left for manipulating the index data and serving queries. Assuming you have at least 4GB of RAM, try adding -m 2g to the start commandline. Thanks, Shawn
Solr performance is slow with just 1GB of data indexed
Hi, I'm using Solr 5.2.1, and I've indexed about 1GB of data into Solr. However, I find that clustering is exceeding slow after I index this 1GB of data. It took almost 30 seconds to return the cluster results when I set it to cluster the top 1000 records, and still take more than 3 seconds when I set it to cluster the top 100 records. Is this speed normal? Cos i understand Solr can index terabytes of data without having the performance impacted so much, but now the collection is slowing down even with just 1GB of data. Below is my clustering configurations in solrconfig.xml. requestHandler name=/clustering startup=lazy enable=${solr.clustering.enabled:true} class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows1000/int str name=wtjson/str str name=indenttrue/str str name=dftext/str str name=flnull/str bool name=clusteringtrue/bool bool name=clustering.resultstrue/bool str name=carrot.titlesubject content tag/str bool name=carrot.produceSummarytrue/bool int name=carrot.fragSize20/int !-- the maximum number of labels per cluster -- int name=carrot.numDescriptions20/int !-- produce sub clusters -- bool name=carrot.outputSubClustersfalse/bool str name=LingoClusteringAlgorithm.desiredClusterCountBase7/str !-- Configure the remaining request handler parameters. -- str name=defTypeedismax/str /lst arr name=last-components strclustering/str /arr /requestHandler Regards, Edwin