This might not be a good match for Solr, or for many other systems. It does seem like a natural fit for MarkLogic. That natively searches and selects over XML documents.
Disclaimer: I worked at MarkLogic for a couple of years. wunder On Mar 28, 2013, at 9:27 AM, Mike Haas wrote: > Thanks for your reply, Roman. Unfortunately, the business has been running > this way forever so I don't think it would be feasible to switch to a whole > document store versus segments store. Even then, if I understand you > correctly it would not work for our needs. I'm thinking because we don't > care about any other parts of the document, just the segment. If a similar > segment is in an entirely different document, we want that segment. > > I'll keep taking any and all feedback however so that I can develop an idea > and present it to my manager. > > > On Thu, Mar 28, 2013 at 11:16 AM, Roman Chyla <roman.ch...@gmail.com> wrote: > >> Apologies if you already do something similar, but perhaps of general >> interest... >> >> One (different approach) to your problem is to implement a local >> fingerprint - if you want to find documents with overlapping segments, this >> algorithm will dramatically reduce the number of segments you create/search >> for every document >> >> http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf >> >> Then you simply end up indexing each document, and upon submission: >> computing fingerprints and querying for them. I don't know (ie. remember) >> exact numbers, but my feeling is that you end up storing ~13% of document >> text (besides, it is a one token fingerprint, therefore quite fast to >> search for - you could even try one huge boolean query with 1024 clauses, >> ouch... :)) >> >> roman >> >> On Thu, Mar 28, 2013 at 11:43 AM, Mike Haas <mikehaas...@gmail.com> wrote: >> >>> Hello. My company is currently thinking of switching over to Solr 4.2, >>> coming off of SQL Server. However, what we need to do is a bit weird. >>> >>> Right now, we have ~12 million segments and growing. Usually these are >>> sentences but can be other things. These segments are what will be stored >>> in Solr. I’ve already done that. >>> >>> Now, what happens is a user will upload say a word document to us. We >> then >>> parse it and process it into segments. It very well could be 5000 >> segments >>> or even more in that word document. Each one of those ~5000 segments >> needs >>> to be searched for similar segments in solr. I’m not quite sure how I >> will >>> do the query (whether proximate or something else). The point though, is >> to >>> get back similar results for each segment. >>> >>> However, I think I’m seeing a bigger problem first. I have to search >>> against ~5000 segments. That would be 5000 http requests. That’s a lot! >> I’m >>> pretty sure that would take a LOT of hardware. Keep in mind this could be >>> happening with maybe 4 different users at once right now (and of course >>> more in the future). Is there a good way to send a batch query over one >> (or >>> at least a lot fewer) http requests? >>> >>> If not, what kinds of things could I do to implement such a feature (if >>> feasible, of course)? >>> >>> >>> Thanks, >>> >>> Mike >>> >> -- Walter Underwood wun...@wunderwood.org