One other note. If you do NOT store the article text, you can still search it but your index size for storing the text data will be MUCH smaller. This requires that you have access to the actual text somewhere in order to be able to return it to the user, but it's a possibility. The scenario runs something like this....
Index the text of the article WITHOUT storing it for each company. Have the actual text out on disk someplace so you can fetch it to the user for display. Another alternative is to store the text in the index once (without indexing) and index it (but not store it) for each company. Then you could fetch the text out of the index when you needed but not pay the penalty for storing the text for each company and wouldn't have to try coordinating the disk storage, which can be a pain. But I wouldn't try any of that until I generated some numbers. Take some representative articles and index them, say, 10 different times, stored and unstored and see what the index size is. Then do it again storing them, say, 100 times and look at the difference. You simply can't make architectural decisions like this without some data to back it up. What is the eventual number of articles you intend to index? How many times to you expect each article to be indexed (that is, how many different companies do you expect to associate with each company)? Why do you fear doing two searches? Do you have any evidence at all that this will be unacceptably slow? The reason that I keep asking these questions whenever someone starts talking about efficiency is that I've spent far too much time making things complicated for the sake of efficiency that, in the end, was wasted effort, meant that the program was delivered later than it should have been and had more bugs in it than it needed to. Best Erick On 12/27/06, Harini Raghavan <[EMAIL PROTECTED]> wrote:
Hi Erick, Thank you for the detailed response. First I would like to mention that my application has an index with company id & name indexed for article for the following reasons: 1. A search interface where we search across articles and companies. 2. Paging - I need to page the results after loading the hits due to which I don't want to separate out the text search and article-company matching logic. I want to load the articles using one single Lucene query. I am using MySQL database to store the relations. But since I need to search across companies & keywords in article, I am also storing the company name and id in the index. The option 3 looks good to me. But I am concerned about degrading the performance of the existing system if I make the search into a 2 step process. However I will try to evaluate your suggestions in detail. Thank you again, Harini Erick Erickson wrote: > First, it probably would have been a good thing to start a new thread on > this topic, since it's only vaguely related to disk space <G>... > > That said, sure. Note that there's no requirement in lucene that all > documents in an index have the same fields. Also, there's no reason you > can't use two separate indexes. Finally, you have to think about how many > times you are going to add update a given article when choosing your > approach. Here are several possibilities. > > 1> Add a field (tokenized) to each article in your index that contains > IDs > of the companies you want to associate with that article. The downside > here > is that you need to delete and re-add the document every time you want to > add a company to that article. > > 2> Create a separate index that contains that relationship. > > 3> have two kinds of documents in your index, one that indexes > articles and > one that relates those to companies. Something like this: > > Articles are indexed with "text" and "artid" fields. (NOTE: artid is > NOT the > Lucene document ID, those change) > Relations are indexed with "id" and "company id" fields. > > id and artid are your relationship. You *don't* want to name the field > the > same for both kinds of documents since they would be indexed together. > > Now, given a search over some text, you get back a bunch of article > IDs. You > then search on the id field of the relations documents to extract > company id > fields. > > You may be able to do some interesting things with termdocs/termenums to > make this efficient, but don't go there unless you need to. > > At this point, though, I've got to ask if you have access to a > database in > your application. If you do, why not store the relations there? Lucene > is a > text-search engine, not a relational database. This kind of relation > may be > perfectly valid to implement in Lucene, but you want to be careful if you > find yourself trying to do any more RDBMS-like things. > > Best > Erick > > On 12/26/06, Harini Raghavan <[EMAIL PROTECTED]> wrote: > >> >> Hi, >> >> I have another related problem. I am adding news articles for a company >> to the lucene index. As of now if the articles are mapped to more than >> one company, they are added so many times in the index. As the no. of >> companies mapped to each article increases, this will not be a scalable >> implementation as documents will be duplicated in the index. Is there a >> way to model the lucene index in a relational way such that the articles >> can be stored in an index and article-company mapping can be modelled >> separately? >> >> Thanks, >> Harini >> Harini Raghavan Software Engineer Office : +91-40-23556255 [EMAIL PROTECTED] we think, you sell www.InsideView.com InsideView --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]