Re: Indexing and Searching linked files

Erick Erickson Tue, 19 Jan 2010 07:07:32 -0800

What's a reasonable upper limit on the number of files? Because I think it
would be simpler, at least to start, to allow your field to be larger (say,
1B
tokens, 1,000 files of 1M tokens each), but restrict the input of each file
to 1M tokens per file. The most elegant way would probably be to
subclass the TokenStream of your choice and cause it to return the
end-of-input
(e.g. next() returns null) after every 1M tokens, regardless of whether
there were more. You'd have to add some methods to allow you to signal
starting a new file, but that shouldn't be difficult....


The tricky part here would be if you had to answer the question "which
file associated with this card matched for this search?". It's do-able,
I'd think about storing the begin/end offsets of each file in a
meta-data field for the card (probably stored but not indexed)......

A cruder approach would be to read the input files yourself and approximate
1M tokens and feed *that* to the analyzer, but that's perilously close to
re-writing a tokenizer.....

HTH
Erick



On Tue, Jan 19, 2010 at 9:35 AM, Anna Hunecke <annahune...@yahoo.de> wrote:

> The field size is restricted to 1 million tokens, because of the very
> reasons you mentioned.
> So, even if I have one separate field for the content of a file, I might
> reach the limit if the file is really big. But I can't help that. What I
> want to avoid is that the whole content of some files can not be found
> because I used one field for the content of all files and they just could
> not be appended anymore.
>
> --- Erick Erickson <erickerick...@gmail.com> schrieb am Di, 19.1.2010:
>
> > Von: Erick Erickson <erickerick...@gmail.com>
> > Betreff: Re: Indexing and Searching linked files
> > An: java-user@lucene.apache.org
> > Datum: Dienstag, 19. Januar 2010, 14:43
> > What field size limit are you talking
> > about here? Because 10,000
> > tokens is the default, but you can increase it to
> > Integer.MAX_VALUE.
> >
> > So are you really talking billions of tokens here? Your
> > index
> > quickly becomes unmanageable if you're allowing it to grow
> > by such increments.
> >
> > One can argue, IMO, that the first N (10M, say) tokens/file
> > is
> > "enough" and there's not much real value in the rest, but
> > that
> > can be a weak argument depending on the problem space....
> >
> > But if you're really committed to indexing an unbounded
> > number
> > of arbitrarily large files...you'll fail. Sometime,
> > somewhere, somebody
> > will want to index enough to violate whatever limits you
> > have (disk,
> > memory, time, whatever). So I think you'd be farther ahead
> > to ask your
> > product manager what limits are reasonable and go from
> > there...
> >
> > HTH
> > Erick
> >
> > On Tue, Jan 19, 2010 at 7:57 AM, Anna Hunecke <annahune...@yahoo.de>
> > wrote:
> >
> > > Hi!
> > > I have been working with Lucene for a while now. So
> > far, I found helpful
> > > tips on this list, so I hope somebody can help me with
> > my problem:
> > >
> > > In our app information is grouped in so-called cards.
> > Now, it should be
> > > made possible to also search on files linked to the
> > cards. You can link
> > > arbitrarily many files to a card and the size of the
> > files is also not
> > > restricted.
> > > So, as far as I can see, there are two ways to do
> > this:
> > >
> > > 1. Add the content of the files to the search index of
> > the card. First, I
> > > thought that I could just have an additional field in
> > the index which
> > > contains the content of all the files. But then, if
> > the files are very big,
> > > I could hit the field size limit, and would possibly
> > not get the content of
> > > all files indexed. So, I would need one field per
> > file. The problem I have
> > > then is that I don't know how many files I have and
> > how large the index
> > > would get. This is risky, because some customers have
> > a lot of data.
> > >
> > > 2. Create a separate index for files. The documents in
> > this index would
> > > contain one file each, so I would not have the problem
> > that I don't know how
> > > many fields I have. But then, the searching is a
> > problem:
> > > I would need to search on both the card and the
> > document index, and somehow
> > > merge the results together. I sort by score always,
> > but, as I understand it,
> > > the scores of the results of two different indexes are
> > not comparable.
> > >
> > > So, which way do you think is better?
> > >
> > > Best,
> > > Anna
> > >
> > > __________________________________________________
> > > Do You Yahoo!?
> > > Sie sind Spam leid? Yahoo! Mail verfügt über einen
> > herausragenden Schutz
> > > gegen Massenmails.
> > > http://mail.yahoo.com
> > >
> > >
> > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> > >
> >
>
> __________________________________________________
> Do You Yahoo!?
> Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden Schutz
> gegen Massenmails.
> http://mail.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Indexing and Searching linked files

Reply via email to