Re: Modelling Relational Lucene Index

Erick Erickson Wed, 27 Dec 2006 09:50:07 -0800

One other note. If you do NOT store the article text, you can still search
it but your index size for storing the text data will be MUCH smaller. This
requires that you have access to the actual text somewhere in order to be
able to return it to the user, but it's a possibility. The scenario runs
something like this....


Index the text of the article WITHOUT storing it for each company. Have the
actual text out on disk someplace so you can fetch it to the user for
display.

Another alternative is to store the text in the index once (without
indexing) and index it (but not store it) for each company. Then you could
fetch the text out of the index when you needed but not pay the penalty for
storing the text for each company and wouldn't have to try coordinating the
disk storage, which can be a pain.

But I wouldn't try any of that until I generated some numbers. Take some
representative articles and index them, say, 10 different times, stored and
unstored and see what the index size is. Then do it again storing them, say,
100 times and look at the difference. You simply can't make architectural
decisions like this without some data to back it up. What is the eventual
number of articles you intend to index? How many times to you expect each
article to be indexed (that is, how many different companies do you expect
to associate with each company)? Why do you fear doing two searches? Do you
have any evidence at all that this will be unacceptably slow? The reason
that I keep asking these questions whenever someone starts talking about
efficiency is that I've spent far too much time making things complicated
for the sake of efficiency that, in the end, was wasted effort, meant that
the program was delivered later than it should have been and had more bugs
in it than it needed to.

Best
Erick

On 12/27/06, Harini Raghavan <[EMAIL PROTECTED]> wrote:


Hi Erick,

Thank you for the detailed response.

First I would like to mention that my application has an index with
company id & name indexed for article for the following reasons:
1. A search interface where we search across articles and companies.
2. Paging - I need to page the results after loading the hits due to
which I don't want to separate out the text search and article-company
matching logic. I want to load the articles using one single Lucene query.

I am using MySQL database to store the relations. But since I need to
search across companies & keywords in article, I am also storing the
company name and id in the index. The option 3 looks good to me. But I
am concerned about degrading the performance of the existing system if I
make the search into a 2 step process.

However I will try to evaluate your suggestions in detail.

Thank you again,
Harini

Erick Erickson wrote:

> First, it probably would have been a good thing to start a new thread on
> this topic, since it's only vaguely related to disk space <G>...
>
> That said, sure. Note that there's no requirement in lucene that all
> documents in an index have the same fields. Also, there's no reason you
> can't use two separate indexes. Finally, you have to think about how
many
> times you are going to add update a given article when choosing your
> approach. Here are several possibilities.
>
> 1> Add a field (tokenized) to each article in your index that contains
> IDs
> of the companies you want to associate with that article. The downside
> here
> is that you need to delete and re-add the document every time you want
to
> add a company to that article.
>
> 2> Create a separate index that contains that relationship.
>
> 3> have two kinds of documents in your index, one that indexes
> articles and
> one that relates those to companies. Something like this:
>
> Articles are indexed with "text" and "artid" fields. (NOTE: artid is
> NOT the
> Lucene document ID, those change)
> Relations are indexed with "id" and "company id" fields.
>
> id and artid are your relationship. You *don't* want to name the field
> the
> same for both kinds of documents since they would be indexed together.
>
> Now, given a search over some text, you get back a bunch of article
> IDs. You
> then search on the id field of the relations documents to extract
> company id
> fields.
>
> You may be able to do some interesting things with termdocs/termenums to
> make this efficient, but don't go there unless you need to.
>
> At this point, though, I've got to ask if you have access to a
> database in
> your application. If you do, why not store the relations there? Lucene
> is a
> text-search engine, not a relational database. This kind of relation
> may be
> perfectly valid to implement in Lucene, but you want to be careful if
you
> find yourself trying to do any more RDBMS-like things.
>
> Best
> Erick
>
> On 12/26/06, Harini Raghavan <[EMAIL PROTECTED]> wrote:
>
>>
>> Hi,
>>
>> I have another related problem. I am adding news articles for a company
>> to the lucene index. As of now if the articles are mapped to more than
>> one company, they are added so many times in the index. As the no. of
>> companies mapped to each article increases, this will not be a scalable
>> implementation as documents will be duplicated in the index. Is there a
>> way to model the lucene index in a relational way such that the
articles
>> can be stored in an index and article-company mapping can be modelled
>> separately?
>>
>> Thanks,
>> Harini
>>

Harini Raghavan
Software Engineer
Office : +91-40-23556255
[EMAIL PROTECTED]
we think, you sell
www.InsideView.com
InsideView


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Modelling Relational Lucene Index

Reply via email to