> PS I am relicensing skwish under Apache 2.0 and am in the process of > re-releasing it from the project website under this new license. (I'm > not that efficient with such chores, so bear with me :)
DONE On Sat, Dec 27, 2008 at 1:15 AM, Babak Farhang <farh...@gmail.com> wrote: >> - there will be a trade-off where reading the info from a 2nd system would >> be slower than just a single call which has all the results. Especially if >> you have to fetch a couple of these things. > > I agree. So, for example, when an app is displaying the search > results on a web page, it is certainly more efficient if the stored > fields used in the presentation layer are stored directly in Lucene. > But if the search results also contain links for fetching the original > documents that were indexed by Lucene, and those documents are also > managed by the app, then reading from the 2nd system will be > advantageous. I suggest a rule-of-thumb may be that when it's the > end-user that initiates the fetch of a stored field then the trade-off > is worthwhile. > >> - how is this different than BDB, and a UUID. couldn't you just store it >> using that? > > You could, and there are certainly a lot of advantages to using BDB. > I think skwish might enjoy an advantage over BDB in the efficient I/O > interface it exposes. For example, using skwish you can retrieve a > stored value as a java FileChannel which you can then write more > efficiently to a socket. (This idea is also fleshed out in the > prototype non-blocking HTTP server included in the last release.) > >> - how are you going to deal with situations where the commit fails in >> lucene. does the client have to recognize this and rollback skwish? > > Here's an example work flow from an experimental lucene / skwish mash > up I'm trying: > > 1. New documents come in to be stored and indexed. > 2. I store the contents of each document in a SegmentStore (skwish > term. for a blob store) called *cache* (I guess after google cache?) > 3. I store meta information about each document in another > SegmentStore called *meta*. This meta information includes the skwish > id of the document contents in the *cache* store from step 2. > 4. I run a Lucene IndexWriter over the new documents. String fields > are read from the *meta* store; stream fields are read from the > *cache* store. Most of the fields in the *meta* store are duplicated > (stored) and/or indexed by Lucene. Nothing from the *cache* store is > stored directly by Lucene. > > At this point, the documents are stored and indexed in the mash up. > If any of those steps were to fail, the step would have to be retried. > Note there is enough information in the 2 skwish stores cache and > meta, to recreate the entire Lucene index. > > 5. The mash up returns search results. Each search result item > contains a link to the original document stored in the cache store > (indirected through the cache id stored in a Lucene Field). > > The implementation aside, I don't imagine this work flow is too > atypical. You typically have to store the documents *somewhere*, and > from the application's perspective, it makes sense to leave enough > [meta] information around in order to be able to recreate a Lucene > index from scratch. > >> on a positive note, it would shrink the index size and allow more records to >> fit in memory. > > Especially important if a stored field (e.g. the contents of the > original document) does not fit in memory. > > Regards > -Babak > > PS I am relicensing skwish under Apache 2.0 and am in the process of > re-releasing it from the project website under this new license. (I'm > not that efficient with such chores, so bear with me :) > > > > On Fri, Dec 26, 2008 at 3:40 AM, Ian Holsman <li...@holsman.net> wrote: >> Babak Farhang wrote: >>> >>> Most of all, I'm trying to communicate an *idea* which itself cannot >>> be encumbered by any license, anyway. But if you want to incorporate >>> some of this code into an asf project, I'd be happy to also release it >>> under the apache license. Hope the license I chose for my project >>> doesn't get in the way of this conversation.. >>> >> >> as an idea, let me offer some thoughts. >> - there will be a trade-off where reading the info from a 2nd system would >> be slower than just a single call which has all the results. Especially if >> you have to fetch a couple of these things. >> >> - how is this different than BDB, and a UUID. couldn't you just store it >> using that? >> >> - how are you going to deal with situations where the commit fails in >> lucene. does the client have to recognize this and rollback skwish? >> >> - there will need to be some kind of reconciliation process that will need >> to deal with inconsistencies where someone forgets to delete the skiwsh >> object when they have deleted the lucene record. >> >> on a positive note, it would shrink the index size and allow more records to >> fit in memory. >> >> Regards >> Ian >>> >>> On Fri, Dec 26, 2008 at 12:46 AM, Noble Paul നോബിള് नोब्ळ् >>> <noble.p...@gmail.com> wrote: >>> >>>> >>>> The license is GPL . It cannont be used directly in any apache projects >>>> >>>> On Fri, Dec 26, 2008 at 12:47 PM, Babak Farhang <farh...@gmail.com> >>>> wrote: >>>> >>>>>> >>>>>> I assume one could use Skwish instead of Lucene's normal stored fields >>>>>> to >>>>>> store & retrieve document data? >>>>>> >>>>> >>>>> Exactly: instead of storing the field's value directly in Lucene, you >>>>> could store it in skwish and then store its skwish id in the Lucene >>>>> field instead. This works well for serving large streams (e.g. >>>>> original document contents). >>>>> >>>>> >>>>>> >>>>>> Have you run any threaded performance tests comparing the two? >>>>>> >>>>> >>>>> No direct comps, yet. >>>>> >>>>> -b >>>>> >>>>> >>>>> On Thu, Dec 25, 2008 at 5:22 AM, Michael McCandless >>>>> <luc...@mikemccandless.com> wrote: >>>>> >>>>>> >>>>>> This looks interesting! >>>>>> I assume one could use Skwish instead of Lucene's normal stored fields >>>>>> to >>>>>> store & retrieve document data? >>>>>> Have you run any threaded performance tests comparing the two? >>>>>> Mike >>>>>> >>>>>> Babak Farhang <farh...@gmail.com> wrote: >>>>>> >>>>>>> >>>>>>> Hi everyone, >>>>>>> >>>>>>> I've been working on a library called Skwish to complement indexes >>>>>>> like Lucene, for blob storage and retrieval. This is nothing more >>>>>>> than a structured implementation of storing all the files in one file >>>>>>> and managing their offsets in another. The idea is to provide a fast, >>>>>>> concurrent, lock-free way to serve lots of files to lots of users. >>>>>>> >>>>>>> Hope you find it useful or interesting. >>>>>>> >>>>>>> -Babak >>>>>>> http://skwish.sourceforge.net/ >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >>>>>>> For additional commands, e-mail: java-dev-h...@lucene.apache.org >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >>>>> For additional commands, e-mail: java-dev-h...@lucene.apache.org >>>>> >>>>> >>>>> >>>> >>>> -- >>>> --Noble Paul >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: java-dev-h...@lucene.apache.org >>>> >>>> >>>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-dev-h...@lucene.apache.org >> >> >