Re: Solr performance is slow with just 1GB of data indexed

2015-08-26 Thread Zheng Lin Edwin Yeo
Thanks for your recommendation Toke.

Will try to ask in the carrot forum.

Regards,
Edwin

On 26 August 2015 at 18:45, Toke Eskildsen  wrote:

> On Wed, 2015-08-26 at 15:47 +0800, Zheng Lin Edwin Yeo wrote:
>
> > Now I've tried to increase the carrot.fragSize to 75 and
> > carrot.summarySnippets to 2, and set the carrot.produceSummary to
> > true. With this setting, I'm mostly able to get the cluster results
> > back within 2 to 3 seconds when I set rows=200. I'm still trying out
> > to see if the cluster labels are ok, but in theory do you think this
> > is a suitable setting to attempt to improve the clustering results and
> > at the same time improve the performance?
>
> I don't know - the quality/performance point as well as which knobs to
> tweak is extremely dependent on your corpus and your hardware. A person
> with better understanding of carrot might be able to do better sanity
> checking, but I am not at all at that level.
>
> Related, it seems to me that the question of how to tweak the clustering
> has little to do with Solr and a lot to do with carrot (assuming here
> that carrot is the bottleneck). You might have more success asking in a
> carrot forum?
>
>
> - Toke Eskildsen, State and University Library, Denmark
>
>
>
>


Re: Solr performance is slow with just 1GB of data indexed

2015-08-26 Thread Toke Eskildsen
On Wed, 2015-08-26 at 15:47 +0800, Zheng Lin Edwin Yeo wrote:

> Now I've tried to increase the carrot.fragSize to 75 and
> carrot.summarySnippets to 2, and set the carrot.produceSummary to
> true. With this setting, I'm mostly able to get the cluster results
> back within 2 to 3 seconds when I set rows=200. I'm still trying out
> to see if the cluster labels are ok, but in theory do you think this
> is a suitable setting to attempt to improve the clustering results and
> at the same time improve the performance?

I don't know - the quality/performance point as well as which knobs to
tweak is extremely dependent on your corpus and your hardware. A person
with better understanding of carrot might be able to do better sanity
checking, but I am not at all at that level.

Related, it seems to me that the question of how to tweak the clustering
has little to do with Solr and a lot to do with carrot (assuming here
that carrot is the bottleneck). You might have more success asking in a
carrot forum?


- Toke Eskildsen, State and University Library, Denmark





Re: Solr performance is slow with just 1GB of data indexed

2015-08-26 Thread Zheng Lin Edwin Yeo
Hi Toke,

Thank you for the link.

I'm using Solr 5.2.1 but I think the carrot2 bundled will be slightly older
version, as I'm using the latest carrot2-workbench-3.10.3, which is only
released recently. I've changed all the settings like fragSize and
desiredCluserCountBase to be the same on both sides, and I'm now able to
get very similar cluster results.

Now I've tried to increase the carrot.fragSize to 75 and
carrot.summarySnippets to 2, and set the carrot.produceSummary to true.
With this setting, I'm mostly able to get the cluster results back within 2
to 3 seconds when I set rows=200. I'm still trying out to see if the
cluster labels are ok, but in theory do you think this is a suitable
setting to attempt to improve the clustering results and at the same time
improve the performance?

Regards,
Edwin



On 26 August 2015 at 13:58, Toke Eskildsen  wrote:

> On Wed, 2015-08-26 at 10:10 +0800, Zheng Lin Edwin Yeo wrote:
> > I'm currently trying out on the Carrot2 Workbench and get it to call Solr
> > to see how they did the clustering. Although it still takes some time to
> do
> > the clustering, but the results of the cluster is much better than mine.
> I
> > think its probably due to the different settings like the fragSize and
> > desiredCluserCountBase?
>
> Either that or the carrot bundled with Solr is an older version.
>
> > By the way, the link on the clustering example
> > https://cwiki.apache.org/confluence/display/solr/Result is not working
> as
> > it says 'Page Not Found'.
>
> That is because it is too long for a single line. Try copy-pasting it:
>
> https://cwiki.apache.org/confluence/display/solr/Result
> +Clustering#ResultClustering-Configuration
>
> - Toke Eskildsen, State and University Library, Denmark
>
>
>


Re: Solr performance is slow with just 1GB of data indexed

2015-08-25 Thread Toke Eskildsen
On Wed, 2015-08-26 at 10:10 +0800, Zheng Lin Edwin Yeo wrote:
> I'm currently trying out on the Carrot2 Workbench and get it to call Solr
> to see how they did the clustering. Although it still takes some time to do
> the clustering, but the results of the cluster is much better than mine. I
> think its probably due to the different settings like the fragSize and
> desiredCluserCountBase?

Either that or the carrot bundled with Solr is an older version.

> By the way, the link on the clustering example
> https://cwiki.apache.org/confluence/display/solr/Result is not working as
> it says 'Page Not Found'.

That is because it is too long for a single line. Try copy-pasting it:

https://cwiki.apache.org/confluence/display/solr/Result
+Clustering#ResultClustering-Configuration

- Toke Eskildsen, State and University Library, Denmark




Re: Solr performance is slow with just 1GB of data indexed

2015-08-25 Thread Zheng Lin Edwin Yeo
Hi Toke,

Thank you for your reply.

I'm currently trying out on the Carrot2 Workbench and get it to call Solr
to see how they did the clustering. Although it still takes some time to do
the clustering, but the results of the cluster is much better than mine. I
think its probably due to the different settings like the fragSize and
desiredCluserCountBase?

By the way, the link on the clustering example
https://cwiki.apache.org/confluence/display/solr/Result is not working as
it says 'Page Not Found'.

Regards,
Edwin


On 25 August 2015 at 15:29, Toke Eskildsen  wrote:

> On Tue, 2015-08-25 at 10:40 +0800, Zheng Lin Edwin Yeo wrote:
> > Would like to confirm, when I set rows=100, does it mean that it only
> build
> > the cluster based on the first 100 records that are returned by the
> search,
> > and if I have 1000 records that matches the search, all the remaining 900
> > records will not be considered for clustering?
>
> That is correct. It is not stated very clearly, but it follows from
> trading the comments in the third example at
> https://cwiki.apache.org/confluence/display/solr/Result
> +Clustering#ResultClustering-Configuration
>
> > As if that is the case, the result of the cluster may not be so accurate
> as
> > there is a possibility that the first 100 records might have a large
> amount
> > of similarities in the records, while the subsequent 900 records have
> > differences that could have impact on the cluster result.
>
> Such is the nature of on-the-fly clustering. The clustering aims to be
> as representative of your search result as possible. Assigning more
> weight to the higher scoring documents (in this case: All the weight, as
> those beyond the top-100 are not even considered) does this.
>
> If that does not fit your expectations, maybe you need something else?
> Plain faceting perhaps? Or maybe enrichment of the documents with some
> sort of entity extraction?
>
> - Toke Eskildsen, State and University Library, Denmark
>
>
>


Re: Solr performance is slow with just 1GB of data indexed

2015-08-25 Thread Toke Eskildsen
On Tue, 2015-08-25 at 10:40 +0800, Zheng Lin Edwin Yeo wrote:
> Would like to confirm, when I set rows=100, does it mean that it only build
> the cluster based on the first 100 records that are returned by the search,
> and if I have 1000 records that matches the search, all the remaining 900
> records will not be considered for clustering?

That is correct. It is not stated very clearly, but it follows from
trading the comments in the third example at
https://cwiki.apache.org/confluence/display/solr/Result
+Clustering#ResultClustering-Configuration

> As if that is the case, the result of the cluster may not be so accurate as
> there is a possibility that the first 100 records might have a large amount
> of similarities in the records, while the subsequent 900 records have
> differences that could have impact on the cluster result.

Such is the nature of on-the-fly clustering. The clustering aims to be
as representative of your search result as possible. Assigning more
weight to the higher scoring documents (in this case: All the weight, as
those beyond the top-100 are not even considered) does this.

If that does not fit your expectations, maybe you need something else?
Plain faceting perhaps? Or maybe enrichment of the documents with some
sort of entity extraction?

- Toke Eskildsen, State and University Library, Denmark




Re: Solr performance is slow with just 1GB of data indexed

2015-08-24 Thread Zheng Lin Edwin Yeo
Thank you Upayavira for your reply.

Would like to confirm, when I set rows=100, does it mean that it only build
the cluster based on the first 100 records that are returned by the search,
and if I have 1000 records that matches the search, all the remaining 900
records will not be considered for clustering?
As if that is the case, the result of the cluster may not be so accurate as
there is a possibility that the first 100 records might have a large amount
of similarities in the records, while the subsequent 900 records have
differences that could have impact on the cluster result.

Regards,
Edwin


On 24 August 2015 at 17:50, Upayavira  wrote:

> I honestly suspect your performance issue is down to the number of terms
> you are passing into the clustering algorithm, not to memory usage as
> such. If you have *huge* documents and cluster across them, performance
> will be slower, by definition.
>
> Clustering is usually done offline, for example on a large dataset
> taking a few hours or even days. Carrot2 manages to reduce this time to
> a reasonable "online" task by only clustering a few search results. If
> you increase the number of documents (from say 100 to 1000) and increase
> the number of terms in each document, you are inherently making the
> clustering algorithm have to work harder, and therefore it *IS* going to
> take longer. Either use less documents, or only use the first 1000 terms
> when clustering, or do your clustering offline and include the results
> of the clustering into your index.
>
> Upayavira
>
> On Mon, Aug 24, 2015, at 04:59 AM, Zheng Lin Edwin Yeo wrote:
> > Hi Alexandre,
> >
> > I've tried to use just index=true, and the speed is still the same and
> > not
> > any faster. If I set to store=false, there's no results that came back
> > with
> > the clustering. Is this due to the index are not stored, and the
> > clustering
> > requires indexed that are stored?
> >
> > I've also increase my heap size to 16GB as I'm using a machine with 32GB
> > RAM, but there is no significant improvement with the performance too.
> >
> > Regards,
> > Edwin
> >
> >
> >
> > On 24 August 2015 at 10:16, Zheng Lin Edwin Yeo 
> > wrote:
> >
> > > Yes, I'm using store=true.
> > >  > > omitNorms="true" termVectors="true"/>
> > >
> > > However, this field needs to be stored as my program requires this
> field
> > > to be returned during normal searching. I tried the lazyLoading=true,
> but
> > > it's not working.
> > >
> > > Will you do a copy field for the content, and not to set stored="true"
> for
> > > that field. So that field will just be referenced to for the
> clustering,
> > > and the normal search will reference to the original content field?
> > >
> > > Regards,
> > > Edwin
> > >
> > >
> > >
> > >
> > > On 23 August 2015 at 23:51, Alexandre Rafalovitch 
> > > wrote:
> > >
> > >> Are you by any chance doing store=true on the fields you want to
> search?
> > >>
> > >> If so, you may want to switch to just index=true. Of course, they will
> > >> then not come back in the results, but do you really want to sling
> > >> huge content fields around.
> > >>
> > >> The other option is to do lazyLoading=true and not request that field.
> > >> This, as a test, you could actually do without needing to reindex
> > >> Solr, just with restart. This could give you a way to test whether the
> > >> field stored size is the issue.
> > >>
> > >> Regards,
> > >>Alex.
> > >> 
> > >> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> > >> http://www.solr-start.com/
> > >>
> > >>
> > >> On 23 August 2015 at 11:13, Zheng Lin Edwin Yeo  >
> > >> wrote:
> > >> > Hi Shawn and Toke,
> > >> >
> > >> > I only have 520 docs in my data, but each of the documents is quite
> big
> > >> in
> > >> > size, In the Solr, it is using 221MB. So when i set to read from
> the top
> > >> > 1000 rows, it should just be reading all the 520 docs that are
> indexed?
> > >> >
> > >> > Regards,
> > >> > Edwin
> > >> >
> > >> >
> > >> > On 23 August 2015 at 22:52, Shawn Heisey 
> wrote:
> > >> >
> > >> >> On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote:
> > >> >> > Hi Shawn,
> > >> >> >
> > >> >> > Yes, I've increased the heap size to 4GB already, and I'm using a
> > >> machine
> > >> >> > with 32GB RAM.
> > >> >> >
> > >> >> > Is it recommended to further increase the heap size to like 8GB
> or
> > >> 16GB?
> > >> >>
> > >> >> Probably not, but I know nothing about your data.  How many Solr
> docs
> > >> >> were created by indexing 1GB of data?  How much disk space is used
> by
> > >> >> your Solr index(es)?
> > >> >>
> > >> >> I know very little about clustering, but it looks like you've
> gotten a
> > >> >> reply from Toke, who knows a lot more about that part of the code
> than
> > >> I
> > >> >> do.
> > >> >>
> > >> >> Thanks,
> > >> >> Shawn
> > >> >>
> > >> >>
> > >>
> > >
> > >
>


Re: Solr performance is slow with just 1GB of data indexed

2015-08-24 Thread Upayavira
I honestly suspect your performance issue is down to the number of terms
you are passing into the clustering algorithm, not to memory usage as
such. If you have *huge* documents and cluster across them, performance
will be slower, by definition.

Clustering is usually done offline, for example on a large dataset
taking a few hours or even days. Carrot2 manages to reduce this time to
a reasonable "online" task by only clustering a few search results. If
you increase the number of documents (from say 100 to 1000) and increase
the number of terms in each document, you are inherently making the
clustering algorithm have to work harder, and therefore it *IS* going to
take longer. Either use less documents, or only use the first 1000 terms
when clustering, or do your clustering offline and include the results
of the clustering into your index.

Upayavira

On Mon, Aug 24, 2015, at 04:59 AM, Zheng Lin Edwin Yeo wrote:
> Hi Alexandre,
> 
> I've tried to use just index=true, and the speed is still the same and
> not
> any faster. If I set to store=false, there's no results that came back
> with
> the clustering. Is this due to the index are not stored, and the
> clustering
> requires indexed that are stored?
> 
> I've also increase my heap size to 16GB as I'm using a machine with 32GB
> RAM, but there is no significant improvement with the performance too.
> 
> Regards,
> Edwin
> 
> 
> 
> On 24 August 2015 at 10:16, Zheng Lin Edwin Yeo 
> wrote:
> 
> > Yes, I'm using store=true.
> >  > omitNorms="true" termVectors="true"/>
> >
> > However, this field needs to be stored as my program requires this field
> > to be returned during normal searching. I tried the lazyLoading=true, but
> > it's not working.
> >
> > Will you do a copy field for the content, and not to set stored="true" for
> > that field. So that field will just be referenced to for the clustering,
> > and the normal search will reference to the original content field?
> >
> > Regards,
> > Edwin
> >
> >
> >
> >
> > On 23 August 2015 at 23:51, Alexandre Rafalovitch 
> > wrote:
> >
> >> Are you by any chance doing store=true on the fields you want to search?
> >>
> >> If so, you may want to switch to just index=true. Of course, they will
> >> then not come back in the results, but do you really want to sling
> >> huge content fields around.
> >>
> >> The other option is to do lazyLoading=true and not request that field.
> >> This, as a test, you could actually do without needing to reindex
> >> Solr, just with restart. This could give you a way to test whether the
> >> field stored size is the issue.
> >>
> >> Regards,
> >>Alex.
> >> 
> >> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> >> http://www.solr-start.com/
> >>
> >>
> >> On 23 August 2015 at 11:13, Zheng Lin Edwin Yeo 
> >> wrote:
> >> > Hi Shawn and Toke,
> >> >
> >> > I only have 520 docs in my data, but each of the documents is quite big
> >> in
> >> > size, In the Solr, it is using 221MB. So when i set to read from the top
> >> > 1000 rows, it should just be reading all the 520 docs that are indexed?
> >> >
> >> > Regards,
> >> > Edwin
> >> >
> >> >
> >> > On 23 August 2015 at 22:52, Shawn Heisey  wrote:
> >> >
> >> >> On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote:
> >> >> > Hi Shawn,
> >> >> >
> >> >> > Yes, I've increased the heap size to 4GB already, and I'm using a
> >> machine
> >> >> > with 32GB RAM.
> >> >> >
> >> >> > Is it recommended to further increase the heap size to like 8GB or
> >> 16GB?
> >> >>
> >> >> Probably not, but I know nothing about your data.  How many Solr docs
> >> >> were created by indexing 1GB of data?  How much disk space is used by
> >> >> your Solr index(es)?
> >> >>
> >> >> I know very little about clustering, but it looks like you've gotten a
> >> >> reply from Toke, who knows a lot more about that part of the code than
> >> I
> >> >> do.
> >> >>
> >> >> Thanks,
> >> >> Shawn
> >> >>
> >> >>
> >>
> >
> >


Re: Solr performance is slow with just 1GB of data indexed

2015-08-23 Thread Zheng Lin Edwin Yeo
Hi Alexandre,

I've tried to use just index=true, and the speed is still the same and not
any faster. If I set to store=false, there's no results that came back with
the clustering. Is this due to the index are not stored, and the clustering
requires indexed that are stored?

I've also increase my heap size to 16GB as I'm using a machine with 32GB
RAM, but there is no significant improvement with the performance too.

Regards,
Edwin



On 24 August 2015 at 10:16, Zheng Lin Edwin Yeo 
wrote:

> Yes, I'm using store=true.
>  omitNorms="true" termVectors="true"/>
>
> However, this field needs to be stored as my program requires this field
> to be returned during normal searching. I tried the lazyLoading=true, but
> it's not working.
>
> Will you do a copy field for the content, and not to set stored="true" for
> that field. So that field will just be referenced to for the clustering,
> and the normal search will reference to the original content field?
>
> Regards,
> Edwin
>
>
>
>
> On 23 August 2015 at 23:51, Alexandre Rafalovitch 
> wrote:
>
>> Are you by any chance doing store=true on the fields you want to search?
>>
>> If so, you may want to switch to just index=true. Of course, they will
>> then not come back in the results, but do you really want to sling
>> huge content fields around.
>>
>> The other option is to do lazyLoading=true and not request that field.
>> This, as a test, you could actually do without needing to reindex
>> Solr, just with restart. This could give you a way to test whether the
>> field stored size is the issue.
>>
>> Regards,
>>Alex.
>> 
>> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
>> http://www.solr-start.com/
>>
>>
>> On 23 August 2015 at 11:13, Zheng Lin Edwin Yeo 
>> wrote:
>> > Hi Shawn and Toke,
>> >
>> > I only have 520 docs in my data, but each of the documents is quite big
>> in
>> > size, In the Solr, it is using 221MB. So when i set to read from the top
>> > 1000 rows, it should just be reading all the 520 docs that are indexed?
>> >
>> > Regards,
>> > Edwin
>> >
>> >
>> > On 23 August 2015 at 22:52, Shawn Heisey  wrote:
>> >
>> >> On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote:
>> >> > Hi Shawn,
>> >> >
>> >> > Yes, I've increased the heap size to 4GB already, and I'm using a
>> machine
>> >> > with 32GB RAM.
>> >> >
>> >> > Is it recommended to further increase the heap size to like 8GB or
>> 16GB?
>> >>
>> >> Probably not, but I know nothing about your data.  How many Solr docs
>> >> were created by indexing 1GB of data?  How much disk space is used by
>> >> your Solr index(es)?
>> >>
>> >> I know very little about clustering, but it looks like you've gotten a
>> >> reply from Toke, who knows a lot more about that part of the code than
>> I
>> >> do.
>> >>
>> >> Thanks,
>> >> Shawn
>> >>
>> >>
>>
>
>


Re: Solr performance is slow with just 1GB of data indexed

2015-08-23 Thread Zheng Lin Edwin Yeo
Yes, I'm using store=true.


However, this field needs to be stored as my program requires this field to
be returned during normal searching. I tried the lazyLoading=true, but it's
not working.

Will you do a copy field for the content, and not to set stored="true" for
that field. So that field will just be referenced to for the clustering,
and the normal search will reference to the original content field?

Regards,
Edwin




On 23 August 2015 at 23:51, Alexandre Rafalovitch 
wrote:

> Are you by any chance doing store=true on the fields you want to search?
>
> If so, you may want to switch to just index=true. Of course, they will
> then not come back in the results, but do you really want to sling
> huge content fields around.
>
> The other option is to do lazyLoading=true and not request that field.
> This, as a test, you could actually do without needing to reindex
> Solr, just with restart. This could give you a way to test whether the
> field stored size is the issue.
>
> Regards,
>Alex.
> 
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 23 August 2015 at 11:13, Zheng Lin Edwin Yeo 
> wrote:
> > Hi Shawn and Toke,
> >
> > I only have 520 docs in my data, but each of the documents is quite big
> in
> > size, In the Solr, it is using 221MB. So when i set to read from the top
> > 1000 rows, it should just be reading all the 520 docs that are indexed?
> >
> > Regards,
> > Edwin
> >
> >
> > On 23 August 2015 at 22:52, Shawn Heisey  wrote:
> >
> >> On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote:
> >> > Hi Shawn,
> >> >
> >> > Yes, I've increased the heap size to 4GB already, and I'm using a
> machine
> >> > with 32GB RAM.
> >> >
> >> > Is it recommended to further increase the heap size to like 8GB or
> 16GB?
> >>
> >> Probably not, but I know nothing about your data.  How many Solr docs
> >> were created by indexing 1GB of data?  How much disk space is used by
> >> your Solr index(es)?
> >>
> >> I know very little about clustering, but it looks like you've gotten a
> >> reply from Toke, who knows a lot more about that part of the code than I
> >> do.
> >>
> >> Thanks,
> >> Shawn
> >>
> >>
>


Re: Solr performance is slow with just 1GB of data indexed

2015-08-23 Thread Bill Bell
We use 8gb to 10gb for those size indexes all the time.


Bill Bell
Sent from mobile


> On Aug 23, 2015, at 8:52 AM, Shawn Heisey  wrote:
> 
>> On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote:
>> Hi Shawn,
>> 
>> Yes, I've increased the heap size to 4GB already, and I'm using a machine
>> with 32GB RAM.
>> 
>> Is it recommended to further increase the heap size to like 8GB or 16GB?
> 
> Probably not, but I know nothing about your data.  How many Solr docs
> were created by indexing 1GB of data?  How much disk space is used by
> your Solr index(es)?
> 
> I know very little about clustering, but it looks like you've gotten a
> reply from Toke, who knows a lot more about that part of the code than I do.
> 
> Thanks,
> Shawn
> 


Re: Solr performance is slow with just 1GB of data indexed

2015-08-23 Thread Jimmy Lin
unsubscribe

On Sat, Aug 22, 2015 at 9:31 PM, Zheng Lin Edwin Yeo 
wrote:

> Hi,
>
> I'm using Solr 5.2.1, and I've indexed about 1GB of data into Solr.
>
> However, I find that clustering is exceeding slow after I index this 1GB of
> data. It took almost 30 seconds to return the cluster results when I set it
> to cluster the top 1000 records, and still take more than 3 seconds when I
> set it to cluster the top 100 records.
>
> Is this speed normal? Cos i understand Solr can index terabytes of data
> without having the performance impacted so much, but now the collection is
> slowing down even with just 1GB of data.
>
> Below is my clustering configurations in solrconfig.xml.
>
> startup="lazy"
>   enable="${solr.clustering.enabled:true}"
>   class="solr.SearchHandler">
> 
>explicit
>   1000
>json
>true
>   text
>   null
>
>   true
>   true
>   subject content tag
>   true
>
>  20
>   
>   20
>   
>   false
>  7
>
>   
>   edismax
> 
> 
>   clustering
> 
>   
>
>
> Regards,
> Edwin
>


Re: Solr performance is slow with just 1GB of data indexed

2015-08-23 Thread Upayavira
And be aware that I'm sure the more terms in your documents, the slower
clustering will be. So it isn't just the number of docs, the size of
them counts in this instance.

A simple test would be to build an index with just the first 1000 terms
of your clustering fields, and see if that makes a difference to
performance.

Upayavira

On Sun, Aug 23, 2015, at 05:32 PM, Erick Erickson wrote:
> You're confusing clustering with searching. Sure, Solr can index
> and lots of data, but clustering is essentially finding ad-hoc
> similarities between arbitrary documents. It must take each of
> the documents in the result size you specify from your result
> set and try to find commonalities.
> 
> For perf issues in terms of clustering, you'd be better off
> talking to the folks at the carrot project.
> 
> Best,
> Erick
> 
> On Sun, Aug 23, 2015 at 8:51 AM, Alexandre Rafalovitch
>  wrote:
> > Are you by any chance doing store=true on the fields you want to search?
> >
> > If so, you may want to switch to just index=true. Of course, they will
> > then not come back in the results, but do you really want to sling
> > huge content fields around.
> >
> > The other option is to do lazyLoading=true and not request that field.
> > This, as a test, you could actually do without needing to reindex
> > Solr, just with restart. This could give you a way to test whether the
> > field stored size is the issue.
> >
> > Regards,
> >Alex.
> > 
> > Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> > http://www.solr-start.com/
> >
> >
> > On 23 August 2015 at 11:13, Zheng Lin Edwin Yeo  
> > wrote:
> >> Hi Shawn and Toke,
> >>
> >> I only have 520 docs in my data, but each of the documents is quite big in
> >> size, In the Solr, it is using 221MB. So when i set to read from the top
> >> 1000 rows, it should just be reading all the 520 docs that are indexed?
> >>
> >> Regards,
> >> Edwin
> >>
> >>
> >> On 23 August 2015 at 22:52, Shawn Heisey  wrote:
> >>
> >>> On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote:
> >>> > Hi Shawn,
> >>> >
> >>> > Yes, I've increased the heap size to 4GB already, and I'm using a 
> >>> > machine
> >>> > with 32GB RAM.
> >>> >
> >>> > Is it recommended to further increase the heap size to like 8GB or 16GB?
> >>>
> >>> Probably not, but I know nothing about your data.  How many Solr docs
> >>> were created by indexing 1GB of data?  How much disk space is used by
> >>> your Solr index(es)?
> >>>
> >>> I know very little about clustering, but it looks like you've gotten a
> >>> reply from Toke, who knows a lot more about that part of the code than I
> >>> do.
> >>>
> >>> Thanks,
> >>> Shawn
> >>>
> >>>


Re: Solr performance is slow with just 1GB of data indexed

2015-08-23 Thread Erick Erickson
You're confusing clustering with searching. Sure, Solr can index
and lots of data, but clustering is essentially finding ad-hoc
similarities between arbitrary documents. It must take each of
the documents in the result size you specify from your result
set and try to find commonalities.

For perf issues in terms of clustering, you'd be better off
talking to the folks at the carrot project.

Best,
Erick

On Sun, Aug 23, 2015 at 8:51 AM, Alexandre Rafalovitch
 wrote:
> Are you by any chance doing store=true on the fields you want to search?
>
> If so, you may want to switch to just index=true. Of course, they will
> then not come back in the results, but do you really want to sling
> huge content fields around.
>
> The other option is to do lazyLoading=true and not request that field.
> This, as a test, you could actually do without needing to reindex
> Solr, just with restart. This could give you a way to test whether the
> field stored size is the issue.
>
> Regards,
>Alex.
> 
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 23 August 2015 at 11:13, Zheng Lin Edwin Yeo  wrote:
>> Hi Shawn and Toke,
>>
>> I only have 520 docs in my data, but each of the documents is quite big in
>> size, In the Solr, it is using 221MB. So when i set to read from the top
>> 1000 rows, it should just be reading all the 520 docs that are indexed?
>>
>> Regards,
>> Edwin
>>
>>
>> On 23 August 2015 at 22:52, Shawn Heisey  wrote:
>>
>>> On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote:
>>> > Hi Shawn,
>>> >
>>> > Yes, I've increased the heap size to 4GB already, and I'm using a machine
>>> > with 32GB RAM.
>>> >
>>> > Is it recommended to further increase the heap size to like 8GB or 16GB?
>>>
>>> Probably not, but I know nothing about your data.  How many Solr docs
>>> were created by indexing 1GB of data?  How much disk space is used by
>>> your Solr index(es)?
>>>
>>> I know very little about clustering, but it looks like you've gotten a
>>> reply from Toke, who knows a lot more about that part of the code than I
>>> do.
>>>
>>> Thanks,
>>> Shawn
>>>
>>>


Re: Solr performance is slow with just 1GB of data indexed

2015-08-23 Thread Alexandre Rafalovitch
Are you by any chance doing store=true on the fields you want to search?

If so, you may want to switch to just index=true. Of course, they will
then not come back in the results, but do you really want to sling
huge content fields around.

The other option is to do lazyLoading=true and not request that field.
This, as a test, you could actually do without needing to reindex
Solr, just with restart. This could give you a way to test whether the
field stored size is the issue.

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 23 August 2015 at 11:13, Zheng Lin Edwin Yeo  wrote:
> Hi Shawn and Toke,
>
> I only have 520 docs in my data, but each of the documents is quite big in
> size, In the Solr, it is using 221MB. So when i set to read from the top
> 1000 rows, it should just be reading all the 520 docs that are indexed?
>
> Regards,
> Edwin
>
>
> On 23 August 2015 at 22:52, Shawn Heisey  wrote:
>
>> On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote:
>> > Hi Shawn,
>> >
>> > Yes, I've increased the heap size to 4GB already, and I'm using a machine
>> > with 32GB RAM.
>> >
>> > Is it recommended to further increase the heap size to like 8GB or 16GB?
>>
>> Probably not, but I know nothing about your data.  How many Solr docs
>> were created by indexing 1GB of data?  How much disk space is used by
>> your Solr index(es)?
>>
>> I know very little about clustering, but it looks like you've gotten a
>> reply from Toke, who knows a lot more about that part of the code than I
>> do.
>>
>> Thanks,
>> Shawn
>>
>>


Re: Solr performance is slow with just 1GB of data indexed

2015-08-23 Thread Zheng Lin Edwin Yeo
Hi Shawn and Toke,

I only have 520 docs in my data, but each of the documents is quite big in
size, In the Solr, it is using 221MB. So when i set to read from the top
1000 rows, it should just be reading all the 520 docs that are indexed?

Regards,
Edwin


On 23 August 2015 at 22:52, Shawn Heisey  wrote:

> On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote:
> > Hi Shawn,
> >
> > Yes, I've increased the heap size to 4GB already, and I'm using a machine
> > with 32GB RAM.
> >
> > Is it recommended to further increase the heap size to like 8GB or 16GB?
>
> Probably not, but I know nothing about your data.  How many Solr docs
> were created by indexing 1GB of data?  How much disk space is used by
> your Solr index(es)?
>
> I know very little about clustering, but it looks like you've gotten a
> reply from Toke, who knows a lot more about that part of the code than I
> do.
>
> Thanks,
> Shawn
>
>


Re: Solr performance is slow with just 1GB of data indexed

2015-08-23 Thread Shawn Heisey
On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote:
> Hi Shawn,
> 
> Yes, I've increased the heap size to 4GB already, and I'm using a machine
> with 32GB RAM.
> 
> Is it recommended to further increase the heap size to like 8GB or 16GB?

Probably not, but I know nothing about your data.  How many Solr docs
were created by indexing 1GB of data?  How much disk space is used by
your Solr index(es)?

I know very little about clustering, but it looks like you've gotten a
reply from Toke, who knows a lot more about that part of the code than I do.

Thanks,
Shawn



Re: Solr performance is slow with just 1GB of data indexed

2015-08-23 Thread Toke Eskildsen
Zheng Lin Edwin Yeo  wrote:
> However, I find that clustering is exceeding slow after I index this 1GB of
> data. It took almost 30 seconds to return the cluster results when I set it
> to cluster the top 1000 records, and still take more than 3 seconds when I
> set it to cluster the top 100 records.

Your clustering uses Carrot2, which fetches the top documents and performs 
real-time clustering on them - that process is (nearly) independent of index 
size. The relevant numbers here are top 1000 and top 100, not 1GB. The unknown 
part is whether it is the fetching of top 1000 (the Solr part) or the 
clustering itself (the Carrot part) that is the bottleneck.

- Toke Eskildsen


Re: Solr performance is slow with just 1GB of data indexed

2015-08-22 Thread Zheng Lin Edwin Yeo
Hi Shawn,

Yes, I've increased the heap size to 4GB already, and I'm using a machine
with 32GB RAM.

Is it recommended to further increase the heap size to like 8GB or 16GB?

Regards,
Edwin
On 23 Aug 2015 10:23, "Shawn Heisey"  wrote:

> On 8/22/2015 7:31 PM, Zheng Lin Edwin Yeo wrote:
> > I'm using Solr 5.2.1, and I've indexed about 1GB of data into Solr.
> >
> > However, I find that clustering is exceeding slow after I index this 1GB
> of
> > data. It took almost 30 seconds to return the cluster results when I set
> it
> > to cluster the top 1000 records, and still take more than 3 seconds when
> I
> > set it to cluster the top 100 records.
> >
> > Is this speed normal? Cos i understand Solr can index terabytes of data
> > without having the performance impacted so much, but now the collection
> is
> > slowing down even with just 1GB of data.
>
> Have you increased the heap size?  If you simply start Solr 5.x with the
> included script and don't use any commandline options, Solr will only
> have a 512MB heap.  This is *extremely* small.  A significant chunk of
> that 512MB heap will be required just to start Jetty and Solr, so
> there's not much memory left for manipulating the index data and serving
> queries.  Assuming you have at least 4GB of RAM, try adding "-m 2g" to
> the start commandline.
>
> Thanks,
> Shawn
>
>


Re: Solr performance is slow with just 1GB of data indexed

2015-08-22 Thread Shawn Heisey
On 8/22/2015 7:31 PM, Zheng Lin Edwin Yeo wrote:
> I'm using Solr 5.2.1, and I've indexed about 1GB of data into Solr.
> 
> However, I find that clustering is exceeding slow after I index this 1GB of
> data. It took almost 30 seconds to return the cluster results when I set it
> to cluster the top 1000 records, and still take more than 3 seconds when I
> set it to cluster the top 100 records.
> 
> Is this speed normal? Cos i understand Solr can index terabytes of data
> without having the performance impacted so much, but now the collection is
> slowing down even with just 1GB of data.

Have you increased the heap size?  If you simply start Solr 5.x with the
included script and don't use any commandline options, Solr will only
have a 512MB heap.  This is *extremely* small.  A significant chunk of
that 512MB heap will be required just to start Jetty and Solr, so
there's not much memory left for manipulating the index data and serving
queries.  Assuming you have at least 4GB of RAM, try adding "-m 2g" to
the start commandline.

Thanks,
Shawn