Re: Merging documents from a distributed search

2015-09-08 Thread tedsolr
Joel,

It needs to perform. Typically users will have 1 - 5 million rows in a
query, returning 10 - 15 fields. Grouping reduces the return by 50% or more
normally. Responses tend be less than a half second.

It sounds like the manipulation of docs at the collector level has been left
to the single solr node implementations, and that your streaming API is the
way forward for cloud implementations. Even if it does have some performance
drawbacks. I can bear slower searches as long as they are not seconds
slower.

I could implement some business strategy that forks searching to either the
AnalyticsQuery or the streaming API based on the shard count in the
collection. Most of my customers will have single shard collections. A goal
of mine is to keep each collection whole as long as possible. If one gets
too big for the pond I'll move it to a bigger pond, until some heap limit is
reached when it will have to be split. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Merging-documents-from-a-distributed-search-tp4226802p4227595.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Merging documents from a distributed search

2015-09-04 Thread Joel Bernstein
It's possible that the ReducerStream's buffer can grow too large if
document groups are very large. But the ReducerStream only needs to hold
one group at a time in memory. The RollupStream, in trunk, has a grouping
implementation that doesn't hang on to all the Tuples from a group. You
could also implement a custom stream that does exactly what you need.

The AnalyicsQuery is much more efficient because the data is left in place.
The Streaming API has streaming overhead which is considerable. But it's
the Stream "shuffling" that gives you the power to do things like fully
distributed grouping.

How many records are processed in a typical query and what type of response
time do you need?

Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Sep 3, 2015 at 3:25 PM, tedsolr  wrote:

> Thanks Joel, that link looks promising. The CloudSolrStream bypasses my
> issue
> of multiple shards. Perhaps the ReducerStream would provide what I need. At
> first glance I worry that the the buffer would grow too large - if its
> really holding the values for all the fields in each document
> (Tuple.getMaps()). I use a Map in my DelegatingCollector to store the
> "unique" docs, but I only keep the docId, my stats, and the ordinals for
> each field. Would you expect the new streams API to perform as well as my
> implementation of an AnalyticsQuery and a DelegatingCollector?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Merging-documents-from-a-distributed-search-tp4226802p4227034.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Merging documents from a distributed search

2015-09-04 Thread tedsolr
Upayavira ,

The docs are all unique. In my example the two docs are considered to be
dupes because the requested fields all have the same values.
fields   AB   C   D E
Doc 1: apple, 10, 15, bye, yellow
Doc 2: apple, 12, 15, by, green

The two docs are certainly unique. Say they are on different shards in the
same collection. If the search request has fl:A,C then the two are dupes and
the user wants to see them collapsed. If the search request has fl:A,B,C
then the two are unique from the user's perspective and display separately.

Each doc typically has a couple hundred fields. When viewed through the lens
of just 3 or 4 fields, lots of docs, sometimes 1000s will be rolled up and
I'll compute some stats on that group. Bringing all those docs back to the
calling app for processing is too slow. The AnalyticsQuery does a great job
of filtering out the dupes, but it looks like I need another solution for
multi shard collections.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Merging-documents-from-a-distributed-search-tp4226802p4227261.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Merging documents from a distributed search

2015-09-03 Thread Markus Jelsma
Hello - another current topic is also covering this issue, you may want to 
check it out:
http://lucene.472066.n3.nabble.com/Merging-documents-from-a-distributed-search-td4226802.html

 
 
-Original message-
> From:Markus Jelsma <markus.jel...@openindex.io>
> Sent: Thursday 3rd September 2015 10:27
> To: solr-user@lucene.apache.org
> Subject: RE: Merging documents from a distributed search
> 
> Hello - We're doing something similar ended up overriding QueryComponent 
> (https://issues.apache.org/jira/browse/SOLR-7968) which needs protected 
> members instead of private members first. We could do a RankQuery and use its 
> cool MergeStrategy, but we would also ened RankQuery to provide an entry 
> point for QueryComponent.createMainQuery(). That would be ideal because we 
> can then use the Collector there for local deduplication, and a combination 
> of createMainQuery and mergeIds to do the distributed deduplication.
> 
> Markus
>  
> -Original message-
> > From:Joel Bernstein <joels...@gmail.com>
> > Sent: Wednesday 2nd September 2015 23:46
> > To: solr-user@lucene.apache.org
> > Subject: Re: Merging documents from a distributed search
> > 
> > The merge strategy probably won't work for the type of distributed collapse
> > you're describing.
> > 
> > You may want to begin exploring the Streaming API which supports real-time
> > map/reduce operations,
> > 
> > http://joelsolr.blogspot.com/2015/03/parallel-computing-with-solrcloud.html
> > 
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> > 
> > On Wed, Sep 2, 2015 at 5:12 PM, tedsolr <tsm...@sciquest.com> wrote:
> > 
> > > I've read from  http://heliosearch.org/solrs-mergestrategy/
> > > <http://heliosearch.org/solrs-mergestrategy/>   that the AnalyticsQuery
> > > component only works for a single instance of Solr. I'm planning to
> > > "migrate" to the SolrCloud soon and I have a custom AnalyticsQuery module
> > > that collapses what I consider to be duplicate documents, keeping stats
> > > like
> > > a "count" of the dupes. For my purposes "dupes" are determined at run time
> > > and vary by the search request. Once a collection has multiple shards I
> > > will
> > > not be able to prevent "dupes" from appearing across those shards. A 
> > > custom
> > > merge strategy should allow me to merge my stats, but I don't see how I 
> > > can
> > > drop duplicate docs at that point.
> > >
> > > If shard1 returns docs A & B and shard2 returns docs B & C (letters
> > > denoting
> > > what I consider to be unique docs), can my implementation of a merge
> > > strategy return only docs A, B, & C, rather than A, B, B, & C?
> > >
> > > thanks!
> > > solr 5.2.1
> > >
> > >
> > >
> > > --
> > > View this message in context:
> > > http://lucene.472066.n3.nabble.com/Merging-documents-from-a-distributed-search-tp4226802.html
> > > Sent from the Solr - User mailing list archive at Nabble.com.
> > >
> > 
> 


RE: Merging documents from a distributed search

2015-09-03 Thread tedsolr
Markus, did you mistakingly post a link to this same thread?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Merging-documents-from-a-distributed-search-tp4226802p4227035.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Merging documents from a distributed search

2015-09-03 Thread tedsolr
Thanks Joel, that link looks promising. The CloudSolrStream bypasses my issue
of multiple shards. Perhaps the ReducerStream would provide what I need. At
first glance I worry that the the buffer would grow too large - if its
really holding the values for all the fields in each document
(Tuple.getMaps()). I use a Map in my DelegatingCollector to store the
"unique" docs, but I only keep the docId, my stats, and the ordinals for
each field. Would you expect the new streams API to perform as well as my
implementation of an AnalyticsQuery and a DelegatingCollector?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Merging-documents-from-a-distributed-search-tp4226802p4227034.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Merging documents from a distributed search

2015-09-03 Thread Upayavira


On Wed, Sep 2, 2015, at 10:12 PM, tedsolr wrote:
> I've read from  http://heliosearch.org/solrs-mergestrategy/
>    that the AnalyticsQuery
> component only works for a single instance of Solr. I'm planning to
> "migrate" to the SolrCloud soon and I have a custom AnalyticsQuery module
> that collapses what I consider to be duplicate documents, keeping stats
> like
> a "count" of the dupes. For my purposes "dupes" are determined at run
> time
> and vary by the search request. Once a collection has multiple shards I
> will
> not be able to prevent "dupes" from appearing across those shards. A
> custom
> merge strategy should allow me to merge my stats, but I don't see how I
> can
> drop duplicate docs at that point.
> 
> If shard1 returns docs A & B and shard2 returns docs B & C (letters
> denoting
> what I consider to be unique docs), can my implementation of a merge
> strategy return only docs A, B, & C, rather than A, B, B, & C?

How did you end up with document B in both shard1 and shard2? Can't you
prevent that from happening, and thus not have this issue?

Upayavira


RE: Merging documents from a distributed search

2015-09-03 Thread Markus Jelsma
Hello - We're doing something similar ended up overriding QueryComponent 
(https://issues.apache.org/jira/browse/SOLR-7968) which needs protected members 
instead of private members first. We could do a RankQuery and use its cool 
MergeStrategy, but we would also ened RankQuery to provide an entry point for 
QueryComponent.createMainQuery(). That would be ideal because we can then use 
the Collector there for local deduplication, and a combination of 
createMainQuery and mergeIds to do the distributed deduplication.

Markus
 
-Original message-
> From:Joel Bernstein <joels...@gmail.com>
> Sent: Wednesday 2nd September 2015 23:46
> To: solr-user@lucene.apache.org
> Subject: Re: Merging documents from a distributed search
> 
> The merge strategy probably won't work for the type of distributed collapse
> you're describing.
> 
> You may want to begin exploring the Streaming API which supports real-time
> map/reduce operations,
> 
> http://joelsolr.blogspot.com/2015/03/parallel-computing-with-solrcloud.html
> 
> Joel Bernstein
> http://joelsolr.blogspot.com/
> 
> On Wed, Sep 2, 2015 at 5:12 PM, tedsolr <tsm...@sciquest.com> wrote:
> 
> > I've read from  http://heliosearch.org/solrs-mergestrategy/
> > <http://heliosearch.org/solrs-mergestrategy/>   that the AnalyticsQuery
> > component only works for a single instance of Solr. I'm planning to
> > "migrate" to the SolrCloud soon and I have a custom AnalyticsQuery module
> > that collapses what I consider to be duplicate documents, keeping stats
> > like
> > a "count" of the dupes. For my purposes "dupes" are determined at run time
> > and vary by the search request. Once a collection has multiple shards I
> > will
> > not be able to prevent "dupes" from appearing across those shards. A custom
> > merge strategy should allow me to merge my stats, but I don't see how I can
> > drop duplicate docs at that point.
> >
> > If shard1 returns docs A & B and shard2 returns docs B & C (letters
> > denoting
> > what I consider to be unique docs), can my implementation of a merge
> > strategy return only docs A, B, & C, rather than A, B, B, & C?
> >
> > thanks!
> > solr 5.2.1
> >
> >
> >
> > --
> > View this message in context:
> > http://lucene.472066.n3.nabble.com/Merging-documents-from-a-distributed-search-tp4226802.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
> 


RE: Merging documents from a distributed search

2015-09-03 Thread Markus Jelsma
It seems so indeed. Please look up the thread titled "Custom merge logic in 
SolrCloud."   

 
 
-Original message-
> From:tedsolr <tsm...@sciquest.com>
> Sent: Thursday 3rd September 2015 21:28
> To: solr-user@lucene.apache.org
> Subject: RE: Merging documents from a distributed search
> 
> Markus, did you mistakingly post a link to this same thread?
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Merging-documents-from-a-distributed-search-tp4226802p4227035.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 


Re: Merging documents from a distributed search

2015-09-02 Thread Joel Bernstein
The merge strategy probably won't work for the type of distributed collapse
you're describing.

You may want to begin exploring the Streaming API which supports real-time
map/reduce operations,

http://joelsolr.blogspot.com/2015/03/parallel-computing-with-solrcloud.html

Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Sep 2, 2015 at 5:12 PM, tedsolr  wrote:

> I've read from  http://heliosearch.org/solrs-mergestrategy/
>    that the AnalyticsQuery
> component only works for a single instance of Solr. I'm planning to
> "migrate" to the SolrCloud soon and I have a custom AnalyticsQuery module
> that collapses what I consider to be duplicate documents, keeping stats
> like
> a "count" of the dupes. For my purposes "dupes" are determined at run time
> and vary by the search request. Once a collection has multiple shards I
> will
> not be able to prevent "dupes" from appearing across those shards. A custom
> merge strategy should allow me to merge my stats, but I don't see how I can
> drop duplicate docs at that point.
>
> If shard1 returns docs A & B and shard2 returns docs B & C (letters
> denoting
> what I consider to be unique docs), can my implementation of a merge
> strategy return only docs A, B, & C, rather than A, B, B, & C?
>
> thanks!
> solr 5.2.1
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Merging-documents-from-a-distributed-search-tp4226802.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>