Thank you Upayavira for your reply.

Would like to confirm, when I set rows=100, does it mean that it only build
the cluster based on the first 100 records that are returned by the search,
and if I have 1000 records that matches the search, all the remaining 900
records will not be considered for clustering?
As if that is the case, the result of the cluster may not be so accurate as
there is a possibility that the first 100 records might have a large amount
of similarities in the records, while the subsequent 900 records have
differences that could have impact on the cluster result.

Regards,
Edwin


On 24 August 2015 at 17:50, Upayavira <u...@odoko.co.uk> wrote:

> I honestly suspect your performance issue is down to the number of terms
> you are passing into the clustering algorithm, not to memory usage as
> such. If you have *huge* documents and cluster across them, performance
> will be slower, by definition.
>
> Clustering is usually done offline, for example on a large dataset
> taking a few hours or even days. Carrot2 manages to reduce this time to
> a reasonable "online" task by only clustering a few search results. If
> you increase the number of documents (from say 100 to 1000) and increase
> the number of terms in each document, you are inherently making the
> clustering algorithm have to work harder, and therefore it *IS* going to
> take longer. Either use less documents, or only use the first 1000 terms
> when clustering, or do your clustering offline and include the results
> of the clustering into your index.
>
> Upayavira
>
> On Mon, Aug 24, 2015, at 04:59 AM, Zheng Lin Edwin Yeo wrote:
> > Hi Alexandre,
> >
> > I've tried to use just index=true, and the speed is still the same and
> > not
> > any faster. If I set to store=false, there's no results that came back
> > with
> > the clustering. Is this due to the index are not stored, and the
> > clustering
> > requires indexed that are stored?
> >
> > I've also increase my heap size to 16GB as I'm using a machine with 32GB
> > RAM, but there is no significant improvement with the performance too.
> >
> > Regards,
> > Edwin
> >
> >
> >
> > On 24 August 2015 at 10:16, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
> > wrote:
> >
> > > Yes, I'm using store=true.
> > > <field name="content" type="text_general" indexed="true" stored="true"
> > > omitNorms="true" termVectors="true"/>
> > >
> > > However, this field needs to be stored as my program requires this
> field
> > > to be returned during normal searching. I tried the lazyLoading=true,
> but
> > > it's not working.
> > >
> > > Will you do a copy field for the content, and not to set stored="true"
> for
> > > that field. So that field will just be referenced to for the
> clustering,
> > > and the normal search will reference to the original content field?
> > >
> > > Regards,
> > > Edwin
> > >
> > >
> > >
> > >
> > > On 23 August 2015 at 23:51, Alexandre Rafalovitch <arafa...@gmail.com>
> > > wrote:
> > >
> > >> Are you by any chance doing store=true on the fields you want to
> search?
> > >>
> > >> If so, you may want to switch to just index=true. Of course, they will
> > >> then not come back in the results, but do you really want to sling
> > >> huge content fields around.
> > >>
> > >> The other option is to do lazyLoading=true and not request that field.
> > >> This, as a test, you could actually do without needing to reindex
> > >> Solr, just with restart. This could give you a way to test whether the
> > >> field stored size is the issue.
> > >>
> > >> Regards,
> > >>    Alex.
> > >> ----
> > >> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> > >> http://www.solr-start.com/
> > >>
> > >>
> > >> On 23 August 2015 at 11:13, Zheng Lin Edwin Yeo <edwinye...@gmail.com
> >
> > >> wrote:
> > >> > Hi Shawn and Toke,
> > >> >
> > >> > I only have 520 docs in my data, but each of the documents is quite
> big
> > >> in
> > >> > size, In the Solr, it is using 221MB. So when i set to read from
> the top
> > >> > 1000 rows, it should just be reading all the 520 docs that are
> indexed?
> > >> >
> > >> > Regards,
> > >> > Edwin
> > >> >
> > >> >
> > >> > On 23 August 2015 at 22:52, Shawn Heisey <apa...@elyograg.org>
> wrote:
> > >> >
> > >> >> On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote:
> > >> >> > Hi Shawn,
> > >> >> >
> > >> >> > Yes, I've increased the heap size to 4GB already, and I'm using a
> > >> machine
> > >> >> > with 32GB RAM.
> > >> >> >
> > >> >> > Is it recommended to further increase the heap size to like 8GB
> or
> > >> 16GB?
> > >> >>
> > >> >> Probably not, but I know nothing about your data.  How many Solr
> docs
> > >> >> were created by indexing 1GB of data?  How much disk space is used
> by
> > >> >> your Solr index(es)?
> > >> >>
> > >> >> I know very little about clustering, but it looks like you've
> gotten a
> > >> >> reply from Toke, who knows a lot more about that part of the code
> than
> > >> I
> > >> >> do.
> > >> >>
> > >> >> Thanks,
> > >> >> Shawn
> > >> >>
> > >> >>
> > >>
> > >
> > >
>

Reply via email to