This might not be a good match for Solr, or for many other systems. It does 
seem like a natural fit for MarkLogic. That natively searches and selects over 
XML documents.

Disclaimer: I worked at MarkLogic for a couple of years.

wunder

On Mar 28, 2013, at 9:27 AM, Mike Haas wrote:

> Thanks for your reply, Roman. Unfortunately, the business has been running
> this way forever so I don't think it would be feasible to switch to a whole
> document store versus segments store. Even then, if I understand you
> correctly it would not work for our needs. I'm thinking because we don't
> care about any other parts of the document, just the segment. If a similar
> segment is in an entirely different document, we want that segment.
> 
> I'll keep taking any and all feedback however so that I can develop an idea
> and present it to my manager.
> 
> 
> On Thu, Mar 28, 2013 at 11:16 AM, Roman Chyla <roman.ch...@gmail.com> wrote:
> 
>> Apologies if you already do something similar, but perhaps of general
>> interest...
>> 
>> One (different approach) to your problem is to implement a local
>> fingerprint - if you want to find documents with overlapping segments, this
>> algorithm will dramatically reduce the number of segments you create/search
>> for every document
>> 
>> http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf
>> 
>> Then you simply end up indexing each document, and upon submission:
>> computing fingerprints and querying for them. I don't know (ie. remember)
>> exact numbers, but my feeling is that you end up storing ~13% of document
>> text (besides, it is a one token fingerprint, therefore quite fast to
>> search for - you could even try one huge boolean query with 1024 clauses,
>> ouch... :))
>> 
>> roman
>> 
>> On Thu, Mar 28, 2013 at 11:43 AM, Mike Haas <mikehaas...@gmail.com> wrote:
>> 
>>> Hello. My company is currently thinking of switching over to Solr 4.2,
>>> coming off of SQL Server. However, what we need to do is a bit weird.
>>> 
>>> Right now, we have ~12 million segments and growing. Usually these are
>>> sentences but can be other things. These segments are what will be stored
>>> in Solr. I’ve already done that.
>>> 
>>> Now, what happens is a user will upload say a word document to us. We
>> then
>>> parse it and process it into segments. It very well could be 5000
>> segments
>>> or even more in that word document. Each one of those ~5000 segments
>> needs
>>> to be searched for similar segments in solr. I’m not quite sure how I
>> will
>>> do the query (whether proximate or something else). The point though, is
>> to
>>> get back similar results for each segment.
>>> 
>>> However, I think I’m seeing a bigger problem first. I have to search
>>> against ~5000 segments. That would be 5000 http requests. That’s a lot!
>> I’m
>>> pretty sure that would take a LOT of hardware. Keep in mind this could be
>>> happening with maybe 4 different users at once right now (and of course
>>> more in the future). Is there a good way to send a batch query over one
>> (or
>>> at least a lot fewer) http requests?
>>> 
>>> If not, what kinds of things could I do to implement such a feature (if
>>> feasible, of course)?
>>> 
>>> 
>>> Thanks,
>>> 
>>> Mike
>>> 
>> 

--
Walter Underwood
wun...@wunderwood.org



Reply via email to