Re: Solr performance is slow with just 1GB of data indexed
Thanks for your recommendation Toke. Will try to ask in the carrot forum. Regards, Edwin On 26 August 2015 at 18:45, Toke Eskildsen wrote: > On Wed, 2015-08-26 at 15:47 +0800, Zheng Lin Edwin Yeo wrote: > > > Now I've tried to increase the carrot.fragSize to 75 and > > carrot.summarySnippets to 2, and set the carrot.produceSummary to > > true. With this setting, I'm mostly able to get the cluster results > > back within 2 to 3 seconds when I set rows=200. I'm still trying out > > to see if the cluster labels are ok, but in theory do you think this > > is a suitable setting to attempt to improve the clustering results and > > at the same time improve the performance? > > I don't know - the quality/performance point as well as which knobs to > tweak is extremely dependent on your corpus and your hardware. A person > with better understanding of carrot might be able to do better sanity > checking, but I am not at all at that level. > > Related, it seems to me that the question of how to tweak the clustering > has little to do with Solr and a lot to do with carrot (assuming here > that carrot is the bottleneck). You might have more success asking in a > carrot forum? > > > - Toke Eskildsen, State and University Library, Denmark > > > >
Re: Solr performance is slow with just 1GB of data indexed
On Wed, 2015-08-26 at 15:47 +0800, Zheng Lin Edwin Yeo wrote: > Now I've tried to increase the carrot.fragSize to 75 and > carrot.summarySnippets to 2, and set the carrot.produceSummary to > true. With this setting, I'm mostly able to get the cluster results > back within 2 to 3 seconds when I set rows=200. I'm still trying out > to see if the cluster labels are ok, but in theory do you think this > is a suitable setting to attempt to improve the clustering results and > at the same time improve the performance? I don't know - the quality/performance point as well as which knobs to tweak is extremely dependent on your corpus and your hardware. A person with better understanding of carrot might be able to do better sanity checking, but I am not at all at that level. Related, it seems to me that the question of how to tweak the clustering has little to do with Solr and a lot to do with carrot (assuming here that carrot is the bottleneck). You might have more success asking in a carrot forum? - Toke Eskildsen, State and University Library, Denmark
Re: Solr performance is slow with just 1GB of data indexed
Hi Toke, Thank you for the link. I'm using Solr 5.2.1 but I think the carrot2 bundled will be slightly older version, as I'm using the latest carrot2-workbench-3.10.3, which is only released recently. I've changed all the settings like fragSize and desiredCluserCountBase to be the same on both sides, and I'm now able to get very similar cluster results. Now I've tried to increase the carrot.fragSize to 75 and carrot.summarySnippets to 2, and set the carrot.produceSummary to true. With this setting, I'm mostly able to get the cluster results back within 2 to 3 seconds when I set rows=200. I'm still trying out to see if the cluster labels are ok, but in theory do you think this is a suitable setting to attempt to improve the clustering results and at the same time improve the performance? Regards, Edwin On 26 August 2015 at 13:58, Toke Eskildsen wrote: > On Wed, 2015-08-26 at 10:10 +0800, Zheng Lin Edwin Yeo wrote: > > I'm currently trying out on the Carrot2 Workbench and get it to call Solr > > to see how they did the clustering. Although it still takes some time to > do > > the clustering, but the results of the cluster is much better than mine. > I > > think its probably due to the different settings like the fragSize and > > desiredCluserCountBase? > > Either that or the carrot bundled with Solr is an older version. > > > By the way, the link on the clustering example > > https://cwiki.apache.org/confluence/display/solr/Result is not working > as > > it says 'Page Not Found'. > > That is because it is too long for a single line. Try copy-pasting it: > > https://cwiki.apache.org/confluence/display/solr/Result > +Clustering#ResultClustering-Configuration > > - Toke Eskildsen, State and University Library, Denmark > > >
Re: Solr performance is slow with just 1GB of data indexed
On Wed, 2015-08-26 at 10:10 +0800, Zheng Lin Edwin Yeo wrote: > I'm currently trying out on the Carrot2 Workbench and get it to call Solr > to see how they did the clustering. Although it still takes some time to do > the clustering, but the results of the cluster is much better than mine. I > think its probably due to the different settings like the fragSize and > desiredCluserCountBase? Either that or the carrot bundled with Solr is an older version. > By the way, the link on the clustering example > https://cwiki.apache.org/confluence/display/solr/Result is not working as > it says 'Page Not Found'. That is because it is too long for a single line. Try copy-pasting it: https://cwiki.apache.org/confluence/display/solr/Result +Clustering#ResultClustering-Configuration - Toke Eskildsen, State and University Library, Denmark
Re: Solr performance is slow with just 1GB of data indexed
Hi Toke, Thank you for your reply. I'm currently trying out on the Carrot2 Workbench and get it to call Solr to see how they did the clustering. Although it still takes some time to do the clustering, but the results of the cluster is much better than mine. I think its probably due to the different settings like the fragSize and desiredCluserCountBase? By the way, the link on the clustering example https://cwiki.apache.org/confluence/display/solr/Result is not working as it says 'Page Not Found'. Regards, Edwin On 25 August 2015 at 15:29, Toke Eskildsen wrote: > On Tue, 2015-08-25 at 10:40 +0800, Zheng Lin Edwin Yeo wrote: > > Would like to confirm, when I set rows=100, does it mean that it only > build > > the cluster based on the first 100 records that are returned by the > search, > > and if I have 1000 records that matches the search, all the remaining 900 > > records will not be considered for clustering? > > That is correct. It is not stated very clearly, but it follows from > trading the comments in the third example at > https://cwiki.apache.org/confluence/display/solr/Result > +Clustering#ResultClustering-Configuration > > > As if that is the case, the result of the cluster may not be so accurate > as > > there is a possibility that the first 100 records might have a large > amount > > of similarities in the records, while the subsequent 900 records have > > differences that could have impact on the cluster result. > > Such is the nature of on-the-fly clustering. The clustering aims to be > as representative of your search result as possible. Assigning more > weight to the higher scoring documents (in this case: All the weight, as > those beyond the top-100 are not even considered) does this. > > If that does not fit your expectations, maybe you need something else? > Plain faceting perhaps? Or maybe enrichment of the documents with some > sort of entity extraction? > > - Toke Eskildsen, State and University Library, Denmark > > >
Re: Solr performance is slow with just 1GB of data indexed
On Tue, 2015-08-25 at 10:40 +0800, Zheng Lin Edwin Yeo wrote: > Would like to confirm, when I set rows=100, does it mean that it only build > the cluster based on the first 100 records that are returned by the search, > and if I have 1000 records that matches the search, all the remaining 900 > records will not be considered for clustering? That is correct. It is not stated very clearly, but it follows from trading the comments in the third example at https://cwiki.apache.org/confluence/display/solr/Result +Clustering#ResultClustering-Configuration > As if that is the case, the result of the cluster may not be so accurate as > there is a possibility that the first 100 records might have a large amount > of similarities in the records, while the subsequent 900 records have > differences that could have impact on the cluster result. Such is the nature of on-the-fly clustering. The clustering aims to be as representative of your search result as possible. Assigning more weight to the higher scoring documents (in this case: All the weight, as those beyond the top-100 are not even considered) does this. If that does not fit your expectations, maybe you need something else? Plain faceting perhaps? Or maybe enrichment of the documents with some sort of entity extraction? - Toke Eskildsen, State and University Library, Denmark
Re: Solr performance is slow with just 1GB of data indexed
Thank you Upayavira for your reply. Would like to confirm, when I set rows=100, does it mean that it only build the cluster based on the first 100 records that are returned by the search, and if I have 1000 records that matches the search, all the remaining 900 records will not be considered for clustering? As if that is the case, the result of the cluster may not be so accurate as there is a possibility that the first 100 records might have a large amount of similarities in the records, while the subsequent 900 records have differences that could have impact on the cluster result. Regards, Edwin On 24 August 2015 at 17:50, Upayavira wrote: > I honestly suspect your performance issue is down to the number of terms > you are passing into the clustering algorithm, not to memory usage as > such. If you have *huge* documents and cluster across them, performance > will be slower, by definition. > > Clustering is usually done offline, for example on a large dataset > taking a few hours or even days. Carrot2 manages to reduce this time to > a reasonable "online" task by only clustering a few search results. If > you increase the number of documents (from say 100 to 1000) and increase > the number of terms in each document, you are inherently making the > clustering algorithm have to work harder, and therefore it *IS* going to > take longer. Either use less documents, or only use the first 1000 terms > when clustering, or do your clustering offline and include the results > of the clustering into your index. > > Upayavira > > On Mon, Aug 24, 2015, at 04:59 AM, Zheng Lin Edwin Yeo wrote: > > Hi Alexandre, > > > > I've tried to use just index=true, and the speed is still the same and > > not > > any faster. If I set to store=false, there's no results that came back > > with > > the clustering. Is this due to the index are not stored, and the > > clustering > > requires indexed that are stored? > > > > I've also increase my heap size to 16GB as I'm using a machine with 32GB > > RAM, but there is no significant improvement with the performance too. > > > > Regards, > > Edwin > > > > > > > > On 24 August 2015 at 10:16, Zheng Lin Edwin Yeo > > wrote: > > > > > Yes, I'm using store=true. > > > > > omitNorms="true" termVectors="true"/> > > > > > > However, this field needs to be stored as my program requires this > field > > > to be returned during normal searching. I tried the lazyLoading=true, > but > > > it's not working. > > > > > > Will you do a copy field for the content, and not to set stored="true" > for > > > that field. So that field will just be referenced to for the > clustering, > > > and the normal search will reference to the original content field? > > > > > > Regards, > > > Edwin > > > > > > > > > > > > > > > On 23 August 2015 at 23:51, Alexandre Rafalovitch > > > wrote: > > > > > >> Are you by any chance doing store=true on the fields you want to > search? > > >> > > >> If so, you may want to switch to just index=true. Of course, they will > > >> then not come back in the results, but do you really want to sling > > >> huge content fields around. > > >> > > >> The other option is to do lazyLoading=true and not request that field. > > >> This, as a test, you could actually do without needing to reindex > > >> Solr, just with restart. This could give you a way to test whether the > > >> field stored size is the issue. > > >> > > >> Regards, > > >>Alex. > > >> > > >> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: > > >> http://www.solr-start.com/ > > >> > > >> > > >> On 23 August 2015 at 11:13, Zheng Lin Edwin Yeo > > > >> wrote: > > >> > Hi Shawn and Toke, > > >> > > > >> > I only have 520 docs in my data, but each of the documents is quite > big > > >> in > > >> > size, In the Solr, it is using 221MB. So when i set to read from > the top > > >> > 1000 rows, it should just be reading all the 520 docs that are > indexed? > > >> > > > >> > Regards, > > >> > Edwin > > >> > > > >> > > > >> > On 23 August 2015 at 22:52, Shawn Heisey > wrote: > > >> > > > >> >> On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote: > > >> >> > Hi Shawn, > > >> >> > > > >> >> > Yes, I've increased the heap size to 4GB already, and I'm using a > > >> machine > > >> >> > with 32GB RAM. > > >> >> > > > >> >> > Is it recommended to further increase the heap size to like 8GB > or > > >> 16GB? > > >> >> > > >> >> Probably not, but I know nothing about your data. How many Solr > docs > > >> >> were created by indexing 1GB of data? How much disk space is used > by > > >> >> your Solr index(es)? > > >> >> > > >> >> I know very little about clustering, but it looks like you've > gotten a > > >> >> reply from Toke, who knows a lot more about that part of the code > than > > >> I > > >> >> do. > > >> >> > > >> >> Thanks, > > >> >> Shawn > > >> >> > > >> >> > > >> > > > > > > >
Re: Solr performance is slow with just 1GB of data indexed
I honestly suspect your performance issue is down to the number of terms you are passing into the clustering algorithm, not to memory usage as such. If you have *huge* documents and cluster across them, performance will be slower, by definition. Clustering is usually done offline, for example on a large dataset taking a few hours or even days. Carrot2 manages to reduce this time to a reasonable "online" task by only clustering a few search results. If you increase the number of documents (from say 100 to 1000) and increase the number of terms in each document, you are inherently making the clustering algorithm have to work harder, and therefore it *IS* going to take longer. Either use less documents, or only use the first 1000 terms when clustering, or do your clustering offline and include the results of the clustering into your index. Upayavira On Mon, Aug 24, 2015, at 04:59 AM, Zheng Lin Edwin Yeo wrote: > Hi Alexandre, > > I've tried to use just index=true, and the speed is still the same and > not > any faster. If I set to store=false, there's no results that came back > with > the clustering. Is this due to the index are not stored, and the > clustering > requires indexed that are stored? > > I've also increase my heap size to 16GB as I'm using a machine with 32GB > RAM, but there is no significant improvement with the performance too. > > Regards, > Edwin > > > > On 24 August 2015 at 10:16, Zheng Lin Edwin Yeo > wrote: > > > Yes, I'm using store=true. > > > omitNorms="true" termVectors="true"/> > > > > However, this field needs to be stored as my program requires this field > > to be returned during normal searching. I tried the lazyLoading=true, but > > it's not working. > > > > Will you do a copy field for the content, and not to set stored="true" for > > that field. So that field will just be referenced to for the clustering, > > and the normal search will reference to the original content field? > > > > Regards, > > Edwin > > > > > > > > > > On 23 August 2015 at 23:51, Alexandre Rafalovitch > > wrote: > > > >> Are you by any chance doing store=true on the fields you want to search? > >> > >> If so, you may want to switch to just index=true. Of course, they will > >> then not come back in the results, but do you really want to sling > >> huge content fields around. > >> > >> The other option is to do lazyLoading=true and not request that field. > >> This, as a test, you could actually do without needing to reindex > >> Solr, just with restart. This could give you a way to test whether the > >> field stored size is the issue. > >> > >> Regards, > >>Alex. > >> > >> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: > >> http://www.solr-start.com/ > >> > >> > >> On 23 August 2015 at 11:13, Zheng Lin Edwin Yeo > >> wrote: > >> > Hi Shawn and Toke, > >> > > >> > I only have 520 docs in my data, but each of the documents is quite big > >> in > >> > size, In the Solr, it is using 221MB. So when i set to read from the top > >> > 1000 rows, it should just be reading all the 520 docs that are indexed? > >> > > >> > Regards, > >> > Edwin > >> > > >> > > >> > On 23 August 2015 at 22:52, Shawn Heisey wrote: > >> > > >> >> On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote: > >> >> > Hi Shawn, > >> >> > > >> >> > Yes, I've increased the heap size to 4GB already, and I'm using a > >> machine > >> >> > with 32GB RAM. > >> >> > > >> >> > Is it recommended to further increase the heap size to like 8GB or > >> 16GB? > >> >> > >> >> Probably not, but I know nothing about your data. How many Solr docs > >> >> were created by indexing 1GB of data? How much disk space is used by > >> >> your Solr index(es)? > >> >> > >> >> I know very little about clustering, but it looks like you've gotten a > >> >> reply from Toke, who knows a lot more about that part of the code than > >> I > >> >> do. > >> >> > >> >> Thanks, > >> >> Shawn > >> >> > >> >> > >> > > > >
Re: Solr performance is slow with just 1GB of data indexed
Hi Alexandre, I've tried to use just index=true, and the speed is still the same and not any faster. If I set to store=false, there's no results that came back with the clustering. Is this due to the index are not stored, and the clustering requires indexed that are stored? I've also increase my heap size to 16GB as I'm using a machine with 32GB RAM, but there is no significant improvement with the performance too. Regards, Edwin On 24 August 2015 at 10:16, Zheng Lin Edwin Yeo wrote: > Yes, I'm using store=true. > omitNorms="true" termVectors="true"/> > > However, this field needs to be stored as my program requires this field > to be returned during normal searching. I tried the lazyLoading=true, but > it's not working. > > Will you do a copy field for the content, and not to set stored="true" for > that field. So that field will just be referenced to for the clustering, > and the normal search will reference to the original content field? > > Regards, > Edwin > > > > > On 23 August 2015 at 23:51, Alexandre Rafalovitch > wrote: > >> Are you by any chance doing store=true on the fields you want to search? >> >> If so, you may want to switch to just index=true. Of course, they will >> then not come back in the results, but do you really want to sling >> huge content fields around. >> >> The other option is to do lazyLoading=true and not request that field. >> This, as a test, you could actually do without needing to reindex >> Solr, just with restart. This could give you a way to test whether the >> field stored size is the issue. >> >> Regards, >>Alex. >> >> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: >> http://www.solr-start.com/ >> >> >> On 23 August 2015 at 11:13, Zheng Lin Edwin Yeo >> wrote: >> > Hi Shawn and Toke, >> > >> > I only have 520 docs in my data, but each of the documents is quite big >> in >> > size, In the Solr, it is using 221MB. So when i set to read from the top >> > 1000 rows, it should just be reading all the 520 docs that are indexed? >> > >> > Regards, >> > Edwin >> > >> > >> > On 23 August 2015 at 22:52, Shawn Heisey wrote: >> > >> >> On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote: >> >> > Hi Shawn, >> >> > >> >> > Yes, I've increased the heap size to 4GB already, and I'm using a >> machine >> >> > with 32GB RAM. >> >> > >> >> > Is it recommended to further increase the heap size to like 8GB or >> 16GB? >> >> >> >> Probably not, but I know nothing about your data. How many Solr docs >> >> were created by indexing 1GB of data? How much disk space is used by >> >> your Solr index(es)? >> >> >> >> I know very little about clustering, but it looks like you've gotten a >> >> reply from Toke, who knows a lot more about that part of the code than >> I >> >> do. >> >> >> >> Thanks, >> >> Shawn >> >> >> >> >> > >
Re: Solr performance is slow with just 1GB of data indexed
Yes, I'm using store=true. However, this field needs to be stored as my program requires this field to be returned during normal searching. I tried the lazyLoading=true, but it's not working. Will you do a copy field for the content, and not to set stored="true" for that field. So that field will just be referenced to for the clustering, and the normal search will reference to the original content field? Regards, Edwin On 23 August 2015 at 23:51, Alexandre Rafalovitch wrote: > Are you by any chance doing store=true on the fields you want to search? > > If so, you may want to switch to just index=true. Of course, they will > then not come back in the results, but do you really want to sling > huge content fields around. > > The other option is to do lazyLoading=true and not request that field. > This, as a test, you could actually do without needing to reindex > Solr, just with restart. This could give you a way to test whether the > field stored size is the issue. > > Regards, >Alex. > > Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: > http://www.solr-start.com/ > > > On 23 August 2015 at 11:13, Zheng Lin Edwin Yeo > wrote: > > Hi Shawn and Toke, > > > > I only have 520 docs in my data, but each of the documents is quite big > in > > size, In the Solr, it is using 221MB. So when i set to read from the top > > 1000 rows, it should just be reading all the 520 docs that are indexed? > > > > Regards, > > Edwin > > > > > > On 23 August 2015 at 22:52, Shawn Heisey wrote: > > > >> On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote: > >> > Hi Shawn, > >> > > >> > Yes, I've increased the heap size to 4GB already, and I'm using a > machine > >> > with 32GB RAM. > >> > > >> > Is it recommended to further increase the heap size to like 8GB or > 16GB? > >> > >> Probably not, but I know nothing about your data. How many Solr docs > >> were created by indexing 1GB of data? How much disk space is used by > >> your Solr index(es)? > >> > >> I know very little about clustering, but it looks like you've gotten a > >> reply from Toke, who knows a lot more about that part of the code than I > >> do. > >> > >> Thanks, > >> Shawn > >> > >> >
Re: Solr performance is slow with just 1GB of data indexed
We use 8gb to 10gb for those size indexes all the time. Bill Bell Sent from mobile > On Aug 23, 2015, at 8:52 AM, Shawn Heisey wrote: > >> On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote: >> Hi Shawn, >> >> Yes, I've increased the heap size to 4GB already, and I'm using a machine >> with 32GB RAM. >> >> Is it recommended to further increase the heap size to like 8GB or 16GB? > > Probably not, but I know nothing about your data. How many Solr docs > were created by indexing 1GB of data? How much disk space is used by > your Solr index(es)? > > I know very little about clustering, but it looks like you've gotten a > reply from Toke, who knows a lot more about that part of the code than I do. > > Thanks, > Shawn >
Re: Solr performance is slow with just 1GB of data indexed
unsubscribe On Sat, Aug 22, 2015 at 9:31 PM, Zheng Lin Edwin Yeo wrote: > Hi, > > I'm using Solr 5.2.1, and I've indexed about 1GB of data into Solr. > > However, I find that clustering is exceeding slow after I index this 1GB of > data. It took almost 30 seconds to return the cluster results when I set it > to cluster the top 1000 records, and still take more than 3 seconds when I > set it to cluster the top 100 records. > > Is this speed normal? Cos i understand Solr can index terabytes of data > without having the performance impacted so much, but now the collection is > slowing down even with just 1GB of data. > > Below is my clustering configurations in solrconfig.xml. > > startup="lazy" > enable="${solr.clustering.enabled:true}" > class="solr.SearchHandler"> > >explicit > 1000 >json >true > text > null > > true > true > subject content tag > true > > 20 > > 20 > > false > 7 > > > edismax > > > clustering > > > > > Regards, > Edwin >
Re: Solr performance is slow with just 1GB of data indexed
And be aware that I'm sure the more terms in your documents, the slower clustering will be. So it isn't just the number of docs, the size of them counts in this instance. A simple test would be to build an index with just the first 1000 terms of your clustering fields, and see if that makes a difference to performance. Upayavira On Sun, Aug 23, 2015, at 05:32 PM, Erick Erickson wrote: > You're confusing clustering with searching. Sure, Solr can index > and lots of data, but clustering is essentially finding ad-hoc > similarities between arbitrary documents. It must take each of > the documents in the result size you specify from your result > set and try to find commonalities. > > For perf issues in terms of clustering, you'd be better off > talking to the folks at the carrot project. > > Best, > Erick > > On Sun, Aug 23, 2015 at 8:51 AM, Alexandre Rafalovitch > wrote: > > Are you by any chance doing store=true on the fields you want to search? > > > > If so, you may want to switch to just index=true. Of course, they will > > then not come back in the results, but do you really want to sling > > huge content fields around. > > > > The other option is to do lazyLoading=true and not request that field. > > This, as a test, you could actually do without needing to reindex > > Solr, just with restart. This could give you a way to test whether the > > field stored size is the issue. > > > > Regards, > >Alex. > > > > Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: > > http://www.solr-start.com/ > > > > > > On 23 August 2015 at 11:13, Zheng Lin Edwin Yeo > > wrote: > >> Hi Shawn and Toke, > >> > >> I only have 520 docs in my data, but each of the documents is quite big in > >> size, In the Solr, it is using 221MB. So when i set to read from the top > >> 1000 rows, it should just be reading all the 520 docs that are indexed? > >> > >> Regards, > >> Edwin > >> > >> > >> On 23 August 2015 at 22:52, Shawn Heisey wrote: > >> > >>> On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote: > >>> > Hi Shawn, > >>> > > >>> > Yes, I've increased the heap size to 4GB already, and I'm using a > >>> > machine > >>> > with 32GB RAM. > >>> > > >>> > Is it recommended to further increase the heap size to like 8GB or 16GB? > >>> > >>> Probably not, but I know nothing about your data. How many Solr docs > >>> were created by indexing 1GB of data? How much disk space is used by > >>> your Solr index(es)? > >>> > >>> I know very little about clustering, but it looks like you've gotten a > >>> reply from Toke, who knows a lot more about that part of the code than I > >>> do. > >>> > >>> Thanks, > >>> Shawn > >>> > >>>
Re: Solr performance is slow with just 1GB of data indexed
You're confusing clustering with searching. Sure, Solr can index and lots of data, but clustering is essentially finding ad-hoc similarities between arbitrary documents. It must take each of the documents in the result size you specify from your result set and try to find commonalities. For perf issues in terms of clustering, you'd be better off talking to the folks at the carrot project. Best, Erick On Sun, Aug 23, 2015 at 8:51 AM, Alexandre Rafalovitch wrote: > Are you by any chance doing store=true on the fields you want to search? > > If so, you may want to switch to just index=true. Of course, they will > then not come back in the results, but do you really want to sling > huge content fields around. > > The other option is to do lazyLoading=true and not request that field. > This, as a test, you could actually do without needing to reindex > Solr, just with restart. This could give you a way to test whether the > field stored size is the issue. > > Regards, >Alex. > > Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: > http://www.solr-start.com/ > > > On 23 August 2015 at 11:13, Zheng Lin Edwin Yeo wrote: >> Hi Shawn and Toke, >> >> I only have 520 docs in my data, but each of the documents is quite big in >> size, In the Solr, it is using 221MB. So when i set to read from the top >> 1000 rows, it should just be reading all the 520 docs that are indexed? >> >> Regards, >> Edwin >> >> >> On 23 August 2015 at 22:52, Shawn Heisey wrote: >> >>> On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote: >>> > Hi Shawn, >>> > >>> > Yes, I've increased the heap size to 4GB already, and I'm using a machine >>> > with 32GB RAM. >>> > >>> > Is it recommended to further increase the heap size to like 8GB or 16GB? >>> >>> Probably not, but I know nothing about your data. How many Solr docs >>> were created by indexing 1GB of data? How much disk space is used by >>> your Solr index(es)? >>> >>> I know very little about clustering, but it looks like you've gotten a >>> reply from Toke, who knows a lot more about that part of the code than I >>> do. >>> >>> Thanks, >>> Shawn >>> >>>
Re: Solr performance is slow with just 1GB of data indexed
Are you by any chance doing store=true on the fields you want to search? If so, you may want to switch to just index=true. Of course, they will then not come back in the results, but do you really want to sling huge content fields around. The other option is to do lazyLoading=true and not request that field. This, as a test, you could actually do without needing to reindex Solr, just with restart. This could give you a way to test whether the field stored size is the issue. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 23 August 2015 at 11:13, Zheng Lin Edwin Yeo wrote: > Hi Shawn and Toke, > > I only have 520 docs in my data, but each of the documents is quite big in > size, In the Solr, it is using 221MB. So when i set to read from the top > 1000 rows, it should just be reading all the 520 docs that are indexed? > > Regards, > Edwin > > > On 23 August 2015 at 22:52, Shawn Heisey wrote: > >> On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote: >> > Hi Shawn, >> > >> > Yes, I've increased the heap size to 4GB already, and I'm using a machine >> > with 32GB RAM. >> > >> > Is it recommended to further increase the heap size to like 8GB or 16GB? >> >> Probably not, but I know nothing about your data. How many Solr docs >> were created by indexing 1GB of data? How much disk space is used by >> your Solr index(es)? >> >> I know very little about clustering, but it looks like you've gotten a >> reply from Toke, who knows a lot more about that part of the code than I >> do. >> >> Thanks, >> Shawn >> >>
Re: Solr performance is slow with just 1GB of data indexed
Hi Shawn and Toke, I only have 520 docs in my data, but each of the documents is quite big in size, In the Solr, it is using 221MB. So when i set to read from the top 1000 rows, it should just be reading all the 520 docs that are indexed? Regards, Edwin On 23 August 2015 at 22:52, Shawn Heisey wrote: > On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote: > > Hi Shawn, > > > > Yes, I've increased the heap size to 4GB already, and I'm using a machine > > with 32GB RAM. > > > > Is it recommended to further increase the heap size to like 8GB or 16GB? > > Probably not, but I know nothing about your data. How many Solr docs > were created by indexing 1GB of data? How much disk space is used by > your Solr index(es)? > > I know very little about clustering, but it looks like you've gotten a > reply from Toke, who knows a lot more about that part of the code than I > do. > > Thanks, > Shawn > >
Re: Solr performance is slow with just 1GB of data indexed
On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote: > Hi Shawn, > > Yes, I've increased the heap size to 4GB already, and I'm using a machine > with 32GB RAM. > > Is it recommended to further increase the heap size to like 8GB or 16GB? Probably not, but I know nothing about your data. How many Solr docs were created by indexing 1GB of data? How much disk space is used by your Solr index(es)? I know very little about clustering, but it looks like you've gotten a reply from Toke, who knows a lot more about that part of the code than I do. Thanks, Shawn
Re: Solr performance is slow with just 1GB of data indexed
Zheng Lin Edwin Yeo wrote: > However, I find that clustering is exceeding slow after I index this 1GB of > data. It took almost 30 seconds to return the cluster results when I set it > to cluster the top 1000 records, and still take more than 3 seconds when I > set it to cluster the top 100 records. Your clustering uses Carrot2, which fetches the top documents and performs real-time clustering on them - that process is (nearly) independent of index size. The relevant numbers here are top 1000 and top 100, not 1GB. The unknown part is whether it is the fetching of top 1000 (the Solr part) or the clustering itself (the Carrot part) that is the bottleneck. - Toke Eskildsen
Re: Solr performance is slow with just 1GB of data indexed
Hi Shawn, Yes, I've increased the heap size to 4GB already, and I'm using a machine with 32GB RAM. Is it recommended to further increase the heap size to like 8GB or 16GB? Regards, Edwin On 23 Aug 2015 10:23, "Shawn Heisey" wrote: > On 8/22/2015 7:31 PM, Zheng Lin Edwin Yeo wrote: > > I'm using Solr 5.2.1, and I've indexed about 1GB of data into Solr. > > > > However, I find that clustering is exceeding slow after I index this 1GB > of > > data. It took almost 30 seconds to return the cluster results when I set > it > > to cluster the top 1000 records, and still take more than 3 seconds when > I > > set it to cluster the top 100 records. > > > > Is this speed normal? Cos i understand Solr can index terabytes of data > > without having the performance impacted so much, but now the collection > is > > slowing down even with just 1GB of data. > > Have you increased the heap size? If you simply start Solr 5.x with the > included script and don't use any commandline options, Solr will only > have a 512MB heap. This is *extremely* small. A significant chunk of > that 512MB heap will be required just to start Jetty and Solr, so > there's not much memory left for manipulating the index data and serving > queries. Assuming you have at least 4GB of RAM, try adding "-m 2g" to > the start commandline. > > Thanks, > Shawn > >
Re: Solr performance is slow with just 1GB of data indexed
On 8/22/2015 7:31 PM, Zheng Lin Edwin Yeo wrote: > I'm using Solr 5.2.1, and I've indexed about 1GB of data into Solr. > > However, I find that clustering is exceeding slow after I index this 1GB of > data. It took almost 30 seconds to return the cluster results when I set it > to cluster the top 1000 records, and still take more than 3 seconds when I > set it to cluster the top 100 records. > > Is this speed normal? Cos i understand Solr can index terabytes of data > without having the performance impacted so much, but now the collection is > slowing down even with just 1GB of data. Have you increased the heap size? If you simply start Solr 5.x with the included script and don't use any commandline options, Solr will only have a 512MB heap. This is *extremely* small. A significant chunk of that 512MB heap will be required just to start Jetty and Solr, so there's not much memory left for manipulating the index data and serving queries. Assuming you have at least 4GB of RAM, try adding "-m 2g" to the start commandline. Thanks, Shawn