Re: Using lucene as a database... good idea or bad idea?

ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S) Tue, 29 Jul 2008 07:29:10 -0700

"Don't connect "database" (i.e. SQL, transactions, etc.) and Lucene.
 Connect data storage with simple, fast lookup and Lucene."
Thanks, Grant for the clarification. I see now.


Nagesh

On Tue, Jul 29, 2008 at 7:55 PM, Grant Ingersoll <[EMAIL PROTECTED]>wrote:

> Don't connect "database" (i.e. SQL, transactions, etc.) and Lucene.
>  Connect data storage with simple, fast lookup and Lucene.
>
> One field is the key (i.e. the filename) the other field is a binary,
> stored Field containing the contents of the file.  Of course, there are
> other ways of slicing and dicing, such that one can search (in the fuzzy
> sense) the content and the key by adding tokenization, etc.  This is the
> more traditional model for Lucene
>
> Also, have a look at Apache Jackrabbit.  It is a content repository that is
> implemented with Lucene.
>
> -Grant
>
>
> On Jul 29, 2008, at 10:02 AM, ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S) wrote:
>
>  Hi Ian,
>> Yes, I see that we are discussing an "option" here.
>>
>> But, as I said before (the three parts to search-based solution), I do not
>> know (but, would like to know) how Lucene (java only - not Nutch, Solr,
>> etc.) can be used as a datastore.
>>
>> Basically, I am not able to connect "database" and Lucene java. :)
>>
>> Nagesh
>>
>>
>> On Tue, Jul 29, 2008 at 6:51 PM, Ian Lea <[EMAIL PROTECTED]> wrote:
>>
>>  I don't think that anyone in this thread has said "should", just
>>> "could" - it is a valid option (IMHO).  Personally, I use it as a
>>> store for lucene related data because I know and like and trust it, it
>>> is already there for this project so no need to introduce another
>>> software dependency, and because it is blindingly fast.
>>>
>>>
>>> --
>>> Ian.
>>>
>>>
>>> On Tue, Jul 29, 2008 at 1:43 PM, ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
>>> <[EMAIL PROTECTED]> wrote:
>>>
>>>> The way I see it, search solutions (on whatever scale) have three
>>>>
>>> components
>>>
>>>> - data aggregation, indexing/searching and presentation of results. I
>>>> thought, Lucene did the second part only.
>>>>
>>>> So, I do not quite follow, why should Lucene be used for datastore ?
>>>>
>>>> Nagesh
>>>>
>>>> On Tue, Jul 29, 2008 at 6:01 PM, Grant Ingersoll <[EMAIL PROTECTED]
>>>> wrote:
>>>>
>>>>  I think the answer is it can be done and probably quite well.  I also
>>>>>
>>>> think
>>>
>>>> it's informative that Nutch does not use Lucene for this function, as I
>>>>> understand it, but that shouldn't stop you either.  You might also have
>>>>>
>>>> a
>>>
>>>> look at Apache Jackrabbit, which uses Lucene underneath as a content
>>>>> repository.
>>>>>
>>>>> -Grant
>>>>>
>>>>>
>>>>> On Jul 29, 2008, at 5:34 AM, Ganesh - yahoo wrote:
>>>>>
>>>>> Hello all,
>>>>>
>>>>>>
>>>>>> I am also interested in this. I want to archive the content of the
>>>>>> document using Lucene.
>>>>>>
>>>>>> Is it a good idea to use Lucene as storage engine?
>>>>>>
>>>>>> Regards
>>>>>> Ganesh
>>>>>>
>>>>>> ----- Original Message ----- From: "Ian Lea" <[EMAIL PROTECTED]>
>>>>>> To: <[email protected]>
>>>>>> Sent: Tuesday, July 29, 2008 2:18 PM
>>>>>> Subject: Re: Using lucene as a database... good idea or bad idea?
>>>>>>
>>>>>>
>>>>>> John
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I think it's a great idea, and do exactly this to store 5 million+
>>>>>>> documents with info that it takes way too long to get out of our
>>>>>>> Oracle database (think days).  Not as many docs as you are talking
>>>>>>> about, and less data for each doc, but I wouldn't have any concerns
>>>>>>> about scaling.  There are certainly lucene indexes out there bigger
>>>>>>> than what you propose.  You can compress the stored data to save some
>>>>>>> space.  Run times for optimization might get interesting but see
>>>>>>> recent threads for suggestions on that.  And since you are not too
>>>>>>> concerned about performance you may not need to optimize much, or
>>>>>>> even
>>>>>>> at all.
>>>>>>>
>>>>>>> Of course you need to remember that this is not a DBMS solution in
>>>>>>> the
>>>>>>> sense of transactions, recovery, etc. but I'm sure you are already
>>>>>>> aware of that.
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ian.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jul 29, 2008 at 2:53 AM, John Evans <[EMAIL PROTECTED]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>  Hi All,
>>>>>>>>
>>>>>>>> I have successfully used Lucene in the "tradtiional" way to provide
>>>>>>>> full-text search for various websites.  Now I am tasked with
>>>>>>>>
>>>>>>> developing
>>>
>>>> a
>>>>>>>> data-store to back a web crawler.  The crawler can be configured to
>>>>>>>> retrieve
>>>>>>>> arbitrary fields from arbitrary pages, so the result is that each
>>>>>>>> document
>>>>>>>> may have a random assortment of fields.  It seems like Lucene may be
>>>>>>>>
>>>>>>> a
>>>
>>>> natural fit for this scenario since you can obviously add arbitrary
>>>>>>>> fields
>>>>>>>> to each document and you can store the actually data in the
>>>>>>>> database.
>>>>>>>> I've
>>>>>>>> done some research to make sure that it would meet all of our
>>>>>>>>
>>>>>>> individual
>>>
>>>> requirements (that we can iterate over documents, update
>>>>>>>> (delete/replace)
>>>>>>>> documents, etc.) and everything looks good.  I've also seen a couple
>>>>>>>>
>>>>>>> of
>>>
>>>> references around the net to other people trying similar things...
>>>>>>>> however,
>>>>>>>> I know it's not meant to be used this way, so I thought I would post
>>>>>>>> here
>>>>>>>> and ask for guidance?  Has anyone done something similar?  Is there
>>>>>>>>
>>>>>>> any
>>>
>>>> specific reason to think this is a bad idea?
>>>>>>>>
>>>>>>>> The one thing that I am least certain about his how well it will
>>>>>>>>
>>>>>>> scale.
>>>
>>>> We
>>>>>>>> may reach the point where we have tens of millions of documents and
>>>>>>>> a
>>>>>>>> high
>>>>>>>> percentage of those documents may be relatively large (10k-50k
>>>>>>>> each).
>>>>>>>> We
>>>>>>>> actually would NOT be expecting/needing Lucene's normal extreme fast
>>>>>>>> text
>>>>>>>> search times for this, but we would need reasonable times for adding
>>>>>>>>
>>>>>>> new
>>>
>>>> documents to the index, retrieving documents by ID (for iterating
>>>>>>>>
>>>>>>> over
>>>
>>>> all
>>>>>>>> documents), optimizing the index after a series of changes, etc.
>>>>>>>>
>>>>>>>> Any advice/input/theories anyone can contribute would be greatly
>>>>>>>> appreciated.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> -
>>>>>>>> John
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>>>>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>>>>>
>>>>>>>
>>>>>> Send instant messages to your online friends
>>>>>> http://in.messenger.yahoo.com
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>>>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>>>>
>>>>>>
>>>>>>  --------------------------
>>>>> Grant Ingersoll
>>>>> http://www.lucidimagination.com
>>>>>
>>>>> Lucene Helpful Hints:
>>>>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>>>>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>>>
>>>>>
>>>>>
>>>>
>>>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: Using lucene as a database... good idea or bad idea?

Reply via email to