On Thu, Mar 28, 2013 at 12:27 PM, Mike Haas <mikehaas...@gmail.com> wrote:

> Thanks for your reply, Roman. Unfortunately, the business has been running
> this way forever so I don't think it would be feasible to switch to a whole
>

sure, no arguing against that :)


> document store versus segments store. Even then, if I understand you
> correctly it would not work for our needs. I'm thinking because we don't
> care about any other parts of the document, just the segment. If a similar
> segment is in an entirely different document, we want that segment.
>

the algo should work for this case - the beauty of the local winnowing is
that it is *local*, ie it tends to select the same segments from the text
(ie. you process two documents, written by two different people - but if
they cited the same thing, and it is longer than 'm' tokens, you will have
at least one identical fingerprints from both documents - which means:
match!) then of course, you can store the position offset of the original
words of the fingerprint and retrieve the original, compute ratio of
overlap etc... but a database seems to be better suited for these kind of
jobs...

let us know what you adopt!

ps: MoreLikeThis selects 'significant' tokens from the document you
selected and then constructs a new boolean query searching for those.
http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/

>
> I'll keep taking any and all feedback however so that I can develop an idea
> and present it to my manager.
>
>
> On Thu, Mar 28, 2013 at 11:16 AM, Roman Chyla <roman.ch...@gmail.com>
> wrote:
>
> > Apologies if you already do something similar, but perhaps of general
> > interest...
> >
> > One (different approach) to your problem is to implement a local
> > fingerprint - if you want to find documents with overlapping segments,
> this
> > algorithm will dramatically reduce the number of segments you
> create/search
> > for every document
> >
> > http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf
> >
> > Then you simply end up indexing each document, and upon submission:
> > computing fingerprints and querying for them. I don't know (ie. remember)
> > exact numbers, but my feeling is that you end up storing ~13% of document
> > text (besides, it is a one token fingerprint, therefore quite fast to
> > search for - you could even try one huge boolean query with 1024 clauses,
> > ouch... :))
> >
> > roman
> >
> > On Thu, Mar 28, 2013 at 11:43 AM, Mike Haas <mikehaas...@gmail.com>
> wrote:
> >
> > > Hello. My company is currently thinking of switching over to Solr 4.2,
> > > coming off of SQL Server. However, what we need to do is a bit weird.
> > >
> > > Right now, we have ~12 million segments and growing. Usually these are
> > > sentences but can be other things. These segments are what will be
> stored
> > > in Solr. I’ve already done that.
> > >
> > > Now, what happens is a user will upload say a word document to us. We
> > then
> > > parse it and process it into segments. It very well could be 5000
> > segments
> > > or even more in that word document. Each one of those ~5000 segments
> > needs
> > > to be searched for similar segments in solr. I’m not quite sure how I
> > will
> > > do the query (whether proximate or something else). The point though,
> is
> > to
> > > get back similar results for each segment.
> > >
> > > However, I think I’m seeing a bigger problem first. I have to search
> > > against ~5000 segments. That would be 5000 http requests. That’s a lot!
> > I’m
> > > pretty sure that would take a LOT of hardware. Keep in mind this could
> be
> > > happening with maybe 4 different users at once right now (and of course
> > > more in the future). Is there a good way to send a batch query over one
> > (or
> > > at least a lot fewer) http requests?
> > >
> > > If not, what kinds of things could I do to implement such a feature (if
> > > feasible, of course)?
> > >
> > >
> > > Thanks,
> > >
> > > Mike
> > >
> >
>

Reply via email to