Re: How to index the parsed content effectively

Sergey Beryozkin Wed, 02 Jul 2014 13:42:29 -0700

Hi
On 02/07/14 17:32, Christian Reuschling wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


another aspect is, if you index such large documents, you also recieve these 
documents inside your
search results, which is then again a bit ambigous for a user (if there is one 
in the use case).
The search problem is only partially solved in this case. Maybe it would be 
better to index single
chapters or something, to make it usefull for the consumer in this case.

This is another nice idea. We'll expect the users to customize theprocess of indexing the Tika-produced content if they won't be satisfiedwith the default approach of storing the content in a single field.But as we move along and start getting more experience/feedback we maybe able to find the way to generalize some of the ideas that yourselfand Tim talked about. Example, we may ship a boilerplate ContentHandlerthat may be able to react to new chapter or new document indicators, etc

Another aspect is, that such huge documents tend to have everything (i.e. every 
term) inside,
which results into bad statistics (there are maybe no characteristic terms 
left). In the worst
case, the document becomes part of every search result, but with low scores in 
any case.

I would say, for 'normal', human-readable documents, the extracted texts are so 
small in memory
footprint, that there is no problem at all - to avoid a OOM for rare cases that 
are maybe
invocation bugs, you can set a simple threshold, cutting the document, print a 
warning, etc.

Sure

Of course, everything depends on the use case ;)

I agree,
Many thanks for the feedback,
Definitely has been useful for me and hopefully for some other users :-)
Cheers, Sergey


On 02.07.2014 17:45, Sergey Beryozkin wrote:

Hi Tim

Thanks for sharing your thoughts. I find them very helpful,

On 02/07/14 14:32, Allison, Timothy B. wrote:

Hi Sergey,

I'd take a look at what the DataImportHandler in Solr does.  If you want to 
store the field,
you need to create the field with a String (as opposed to a Reader); which 
means you have to
have the whole thing in memory.  Also, if you're proposing adding a field entry 
in a
multivalued field for a given SAX event, I don't think that will help, because 
you still have
to hold the entire document in memory before calling addDocument() if you are 
storing the
field.  If you aren't storing the field, then you could try a Reader.

Some thoughts:

At the least, you could create a separate Lucene document for each container 
document and
each of its embedded documents.

You could also break large documents into logical sections and index those as 
separate
documents; but that gets very use-case dependent.


Right. I think this is something we might investigate further. The goal is to 
generalize some
Tika Parser to Lucene code sequences, and perhaps we can offer some boilerplate 
ContentHandler
as we don't know of the concrete/final requirements of the would be API 
consumers.

What is your opinion of having a Tika Parser ContentHandler that would try to 
do it in a
minimal kind of way, store character sequences as unique individual Lucene 
fields. Suppose we
have a single PDF file, and we have a content handler reporting every line in 
such a file. So
instead of storing all the PDF content in a single "content" field we'd have
"content1":"line1", "content2":"line2", etc and then offer a support for 
searching across all
of these contentN fields ?

I guess it would be somewhat similar to your idea of having a separate Lucene 
Document per
every logical chunk, except that in this case we'd have a single Document with 
many fields
covering a single PDF/etc

Does it make any sense at all from the performance point of view or may be not 
worth it ?


In practice, for many, many use cases I've come across, you can index quite 
large documents
with no problems, e.g. "Moby Dick" or "Dream of the Red Chamber."  There may be 
a hit at
highlighting time for large docs depending on which highlighter you use.  In 
the old days,
there used to be a 10k default limit on the number of tokens, but that is now 
long gone.

Sounds reasonable

For truly large docs (probably machine generated), yes, you could run into 
problems if you
need to hold the whole thing in memory.


Sure, if we get the users reporting OOM or similar related issues against our 
API then it would
be a good start :-)

Thanks, Sergey


Cheers,

Tim -----Original Message----- From: Sergey Beryozkin 
[mailto:sberyoz...@gmail.com] Sent:
Wednesday, July 02, 2014 8:27 AM To: user@tika.apache.org Subject: How to index 
the parsed
content effectively

Hi All,

We've been experimenting with indexing the parsed content in Lucene and our 
initial attempt
was to index the output from ToTextContentHandler.toString() as a Lucene Text 
field.

This is unlikely to be effective for large files. So I wonder what strategies 
exist for a
more effective indexing/tokenization of the possibly large content.

Perhaps a custom ContentHandler can index content fragments in a unique Lucene 
field every
time its characters(...) method is called, something I've been planning to 
experiment with.

The feedback will be appreciated Cheers, Sergey


- --
______________________________________________________________________________
Christian Reuschling, Dipl.-Ing.(BA)
Software Engineer

Knowledge Management Department
German Research Center for Artificial Intelligence DFKI GmbH
Trippstadter Straße 122, D-67663 Kaiserslautern, Germany

Phone: +49.631.20575-1250
mailto:reuschl...@dfki.de  http://www.dfki.uni-kl.de/~reuschling/

- ------------Legal Company Information Required by German Law------------------
Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
                   Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313=
______________________________________________________________________________
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlO0NDIACgkQ6EqMXq+WZg9EZwCfVo2ao4nvrKE9WdgP4a31pcqW
o48AnR/9pZ+wehU9U7KKVsaZ9QkKJkAF
=6v8O
-----END PGP SIGNATURE-----

Re: How to index the parsed content effectively

Reply via email to