I will definitely let you all know what we end up doing. I realized I
forgot to mention something that might make what we do more clear.

Right now we use sql server full text to get back fairly similar matches
for each segment. We do this with some funky sql stuff which I didn't write
and haven't even looked at. It gives us back 100 results. They are not
really all that good of matches though, it just gives us something to work
with. So although some results are good, some are horrible. Then, to truly
make sure we have a good match we take each one of those ~100 results and
run it through a levenshtein algorithm implemented in c# code. Levenshtein
gives back a % match. We then use the highest match so long as it is above
85%

Hope this makes it a little more clear what we are doing.


On Thu, Mar 28, 2013 at 11:39 AM, Roman Chyla <roman.ch...@gmail.com> wrote:

> On Thu, Mar 28, 2013 at 12:27 PM, Mike Haas <mikehaas...@gmail.com> wrote:
>
> > Thanks for your reply, Roman. Unfortunately, the business has been
> running
> > this way forever so I don't think it would be feasible to switch to a
> whole
> >
>
> sure, no arguing against that :)
>
>
> > document store versus segments store. Even then, if I understand you
> > correctly it would not work for our needs. I'm thinking because we don't
> > care about any other parts of the document, just the segment. If a
> similar
> > segment is in an entirely different document, we want that segment.
> >
>
> the algo should work for this case - the beauty of the local winnowing is
> that it is *local*, ie it tends to select the same segments from the text
> (ie. you process two documents, written by two different people - but if
> they cited the same thing, and it is longer than 'm' tokens, you will have
> at least one identical fingerprints from both documents - which means:
> match!) then of course, you can store the position offset of the original
> words of the fingerprint and retrieve the original, compute ratio of
> overlap etc... but a database seems to be better suited for these kind of
> jobs...
>
> let us know what you adopt!
>
> ps: MoreLikeThis selects 'significant' tokens from the document you
> selected and then constructs a new boolean query searching for those.
> http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/
>
> >
> > I'll keep taking any and all feedback however so that I can develop an
> idea
> > and present it to my manager.
> >
> >
> > On Thu, Mar 28, 2013 at 11:16 AM, Roman Chyla <roman.ch...@gmail.com>
> > wrote:
> >
> > > Apologies if you already do something similar, but perhaps of general
> > > interest...
> > >
> > > One (different approach) to your problem is to implement a local
> > > fingerprint - if you want to find documents with overlapping segments,
> > this
> > > algorithm will dramatically reduce the number of segments you
> > create/search
> > > for every document
> > >
> > > http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf
> > >
> > > Then you simply end up indexing each document, and upon submission:
> > > computing fingerprints and querying for them. I don't know (ie.
> remember)
> > > exact numbers, but my feeling is that you end up storing ~13% of
> document
> > > text (besides, it is a one token fingerprint, therefore quite fast to
> > > search for - you could even try one huge boolean query with 1024
> clauses,
> > > ouch... :))
> > >
> > > roman
> > >
> > > On Thu, Mar 28, 2013 at 11:43 AM, Mike Haas <mikehaas...@gmail.com>
> > wrote:
> > >
> > > > Hello. My company is currently thinking of switching over to Solr
> 4.2,
> > > > coming off of SQL Server. However, what we need to do is a bit weird.
> > > >
> > > > Right now, we have ~12 million segments and growing. Usually these
> are
> > > > sentences but can be other things. These segments are what will be
> > stored
> > > > in Solr. I’ve already done that.
> > > >
> > > > Now, what happens is a user will upload say a word document to us. We
> > > then
> > > > parse it and process it into segments. It very well could be 5000
> > > segments
> > > > or even more in that word document. Each one of those ~5000 segments
> > > needs
> > > > to be searched for similar segments in solr. I’m not quite sure how I
> > > will
> > > > do the query (whether proximate or something else). The point though,
> > is
> > > to
> > > > get back similar results for each segment.
> > > >
> > > > However, I think I’m seeing a bigger problem first. I have to search
> > > > against ~5000 segments. That would be 5000 http requests. That’s a
> lot!
> > > I’m
> > > > pretty sure that would take a LOT of hardware. Keep in mind this
> could
> > be
> > > > happening with maybe 4 different users at once right now (and of
> course
> > > > more in the future). Is there a good way to send a batch query over
> one
> > > (or
> > > > at least a lot fewer) http requests?
> > > >
> > > > If not, what kinds of things could I do to implement such a feature
> (if
> > > > feasible, of course)?
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > Mike
> > > >
> > >
> >
>

Reply via email to