-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 another aspect is, if you index such large documents, you also recieve these documents inside your search results, which is then again a bit ambigous for a user (if there is one in the use case). The search problem is only partially solved in this case. Maybe it would be better to index single chapters or something, to make it usefull for the consumer in this case.
Another aspect is, that such huge documents tend to have everything (i.e. every term) inside, which results into bad statistics (there are maybe no characteristic terms left). In the worst case, the document becomes part of every search result, but with low scores in any case. I would say, for 'normal', human-readable documents, the extracted texts are so small in memory footprint, that there is no problem at all - to avoid a OOM for rare cases that are maybe invocation bugs, you can set a simple threshold, cutting the document, print a warning, etc. Of course, everything depends on the use case ;) On 02.07.2014 17:45, Sergey Beryozkin wrote: > Hi Tim > > Thanks for sharing your thoughts. I find them very helpful, > > On 02/07/14 14:32, Allison, Timothy B. wrote: >> Hi Sergey, >> >> I'd take a look at what the DataImportHandler in Solr does. If you want to >> store the field, >> you need to create the field with a String (as opposed to a Reader); which >> means you have to >> have the whole thing in memory. Also, if you're proposing adding a field >> entry in a >> multivalued field for a given SAX event, I don't think that will help, >> because you still have >> to hold the entire document in memory before calling addDocument() if you >> are storing the >> field. If you aren't storing the field, then you could try a Reader. >> >> Some thoughts: >> >> At the least, you could create a separate Lucene document for each container >> document and >> each of its embedded documents. >> >> You could also break large documents into logical sections and index those >> as separate >> documents; but that gets very use-case dependent. > > Right. I think this is something we might investigate further. The goal is to > generalize some > Tika Parser to Lucene code sequences, and perhaps we can offer some > boilerplate ContentHandler > as we don't know of the concrete/final requirements of the would be API > consumers. > > What is your opinion of having a Tika Parser ContentHandler that would try to > do it in a > minimal kind of way, store character sequences as unique individual Lucene > fields. Suppose we > have a single PDF file, and we have a content handler reporting every line in > such a file. So > instead of storing all the PDF content in a single "content" field we'd have > "content1":"line1", "content2":"line2", etc and then offer a support for > searching across all > of these contentN fields ? > > I guess it would be somewhat similar to your idea of having a separate Lucene > Document per > every logical chunk, except that in this case we'd have a single Document > with many fields > covering a single PDF/etc > > Does it make any sense at all from the performance point of view or may be > not worth it ? > >> >> In practice, for many, many use cases I've come across, you can index quite >> large documents >> with no problems, e.g. "Moby Dick" or "Dream of the Red Chamber." There may >> be a hit at >> highlighting time for large docs depending on which highlighter you use. In >> the old days, >> there used to be a 10k default limit on the number of tokens, but that is >> now long gone. >> > Sounds reasonable >> For truly large docs (probably machine generated), yes, you could run into >> problems if you >> need to hold the whole thing in memory. > > Sure, if we get the users reporting OOM or similar related issues against our > API then it would > be a good start :-) > > Thanks, Sergey > >> >> Cheers, >> >> Tim -----Original Message----- From: Sergey Beryozkin >> [mailto:sberyoz...@gmail.com] Sent: >> Wednesday, July 02, 2014 8:27 AM To: user@tika.apache.org Subject: How to >> index the parsed >> content effectively >> >> Hi All, >> >> We've been experimenting with indexing the parsed content in Lucene and our >> initial attempt >> was to index the output from ToTextContentHandler.toString() as a Lucene >> Text field. >> >> This is unlikely to be effective for large files. So I wonder what >> strategies exist for a >> more effective indexing/tokenization of the possibly large content. >> >> Perhaps a custom ContentHandler can index content fragments in a unique >> Lucene field every >> time its characters(...) method is called, something I've been planning to >> experiment with. >> >> The feedback will be appreciated Cheers, Sergey >> - -- ______________________________________________________________________________ Christian Reuschling, Dipl.-Ing.(BA) Software Engineer Knowledge Management Department German Research Center for Artificial Intelligence DFKI GmbH Trippstadter Straße 122, D-67663 Kaiserslautern, Germany Phone: +49.631.20575-1250 mailto:reuschl...@dfki.de http://www.dfki.uni-kl.de/~reuschling/ - ------------Legal Company Information Required by German Law------------------ Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender) Dr. Walter Olthoff Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes Amtsgericht Kaiserslautern, HRB 2313= ______________________________________________________________________________ -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlO0NDIACgkQ6EqMXq+WZg9EZwCfVo2ao4nvrKE9WdgP4a31pcqW o48AnR/9pZ+wehU9U7KKVsaZ9QkKJkAF =6v8O -----END PGP SIGNATURE-----