I honestly suspect your performance issue is down to the number of terms
you are passing into the clustering algorithm, not to memory usage as
such. If you have *huge* documents and cluster across them, performance
will be slower, by definition.

Clustering is usually done offline, for example on a large dataset
taking a few hours or even days. Carrot2 manages to reduce this time to
a reasonable "online" task by only clustering a few search results. If
you increase the number of documents (from say 100 to 1000) and increase
the number of terms in each document, you are inherently making the
clustering algorithm have to work harder, and therefore it *IS* going to
take longer. Either use less documents, or only use the first 1000 terms
when clustering, or do your clustering offline and include the results
of the clustering into your index.

Upayavira

On Mon, Aug 24, 2015, at 04:59 AM, Zheng Lin Edwin Yeo wrote:
> Hi Alexandre,
> 
> I've tried to use just index=true, and the speed is still the same and
> not
> any faster. If I set to store=false, there's no results that came back
> with
> the clustering. Is this due to the index are not stored, and the
> clustering
> requires indexed that are stored?
> 
> I've also increase my heap size to 16GB as I'm using a machine with 32GB
> RAM, but there is no significant improvement with the performance too.
> 
> Regards,
> Edwin
> 
> 
> 
> On 24 August 2015 at 10:16, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
> wrote:
> 
> > Yes, I'm using store=true.
> > <field name="content" type="text_general" indexed="true" stored="true"
> > omitNorms="true" termVectors="true"/>
> >
> > However, this field needs to be stored as my program requires this field
> > to be returned during normal searching. I tried the lazyLoading=true, but
> > it's not working.
> >
> > Will you do a copy field for the content, and not to set stored="true" for
> > that field. So that field will just be referenced to for the clustering,
> > and the normal search will reference to the original content field?
> >
> > Regards,
> > Edwin
> >
> >
> >
> >
> > On 23 August 2015 at 23:51, Alexandre Rafalovitch <arafa...@gmail.com>
> > wrote:
> >
> >> Are you by any chance doing store=true on the fields you want to search?
> >>
> >> If so, you may want to switch to just index=true. Of course, they will
> >> then not come back in the results, but do you really want to sling
> >> huge content fields around.
> >>
> >> The other option is to do lazyLoading=true and not request that field.
> >> This, as a test, you could actually do without needing to reindex
> >> Solr, just with restart. This could give you a way to test whether the
> >> field stored size is the issue.
> >>
> >> Regards,
> >>    Alex.
> >> ----
> >> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> >> http://www.solr-start.com/
> >>
> >>
> >> On 23 August 2015 at 11:13, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
> >> wrote:
> >> > Hi Shawn and Toke,
> >> >
> >> > I only have 520 docs in my data, but each of the documents is quite big
> >> in
> >> > size, In the Solr, it is using 221MB. So when i set to read from the top
> >> > 1000 rows, it should just be reading all the 520 docs that are indexed?
> >> >
> >> > Regards,
> >> > Edwin
> >> >
> >> >
> >> > On 23 August 2015 at 22:52, Shawn Heisey <apa...@elyograg.org> wrote:
> >> >
> >> >> On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote:
> >> >> > Hi Shawn,
> >> >> >
> >> >> > Yes, I've increased the heap size to 4GB already, and I'm using a
> >> machine
> >> >> > with 32GB RAM.
> >> >> >
> >> >> > Is it recommended to further increase the heap size to like 8GB or
> >> 16GB?
> >> >>
> >> >> Probably not, but I know nothing about your data.  How many Solr docs
> >> >> were created by indexing 1GB of data?  How much disk space is used by
> >> >> your Solr index(es)?
> >> >>
> >> >> I know very little about clustering, but it looks like you've gotten a
> >> >> reply from Toke, who knows a lot more about that part of the code than
> >> I
> >> >> do.
> >> >>
> >> >> Thanks,
> >> >> Shawn
> >> >>
> >> >>
> >>
> >
> >

Reply via email to