RE: How to index the parsed content effectively

2014-07-14 Thread Allison, Timothy B.
Hi Sergey,

 Now, we already have the original PDF occupying some space, so 
duplicating it (its content) with a Document with Store.YES fields may 
not be the best idea in some cases.

In some cases, agreed, but in general, this is probably a good default idea.  
As you point out, you aren't quite duplicating the document -- one copy contain 
the original bytes, and the other contains the text (and metadata?) that was 
extracted from the document.  One reason to store the content in the field is 
for easy highlighting.  You could configure the highlighter to pull the text 
content of the document from a db or other source, but that adds complexity and 
perhaps lookup time.  What you really would not want to do from a time 
perspective is ask Tika to parse the raw bytes to pull the content for 
highlighting at search time.  In general, Lucene's storage of the content is 
very reasonable; on one big batch of text files I have, the Lucene index with 
stored fields is the same size as the uncompressed text files.

So I wonder, is it possible somehow for a given Tika Parser, lets say a 
PDF parser, report, via the Metadata, the start and end indexes of the 
content ? So the consumer will create say InputStreamReader for a 
content region and will use Store.NO and this Reader ?

I don't think I quite understand what you're proposing.  The start and end 
indexes of the extracted content?  Wouldn't that just be 0 and the length of 
the string in most cases (beyond-bmp issues aside)?  Or, are you suggesting 
that there may be start and end indexes for content within the actual raw bytes 
of the PDF?  If the latter, for PDFs at least that would effectively require a 
full reparse ... if it were possible, and it probably wouldn't save much in 
time.  For other formats, where that might work, it would create far more 
complexity than value...IMHO.

In general, I'd say store the field.  Perhaps let the user choose to not store 
the field. 

Always interested to hear input from others.

Best,

  Tim


-Original Message-
From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] 
Sent: Friday, July 11, 2014 1:38 PM
To: user@tika.apache.org
Subject: Re: How to index the parsed content effectively

Hi Tim, All.
On 02/07/14 14:32, Allison, Timothy B. wrote:
 Hi Sergey,

I'd take a look at what the DataImportHandler in Solr does.  If you want 
 to store the field, you need to create the field with a String (as opposed to 
 a Reader); which means you have to have the whole thing in memory.  Also, if 
 you're proposing adding a field entry in a multivalued field for a given SAX 
 event, I don't think that will help, because you still have to hold the 
 entire document in memory before calling addDocument() if you are storing the 
 field.  If you aren't storing the field, then you could try a Reader.

I'd like to ask something about using Tika parser and a Reader (and 
Lucene Store.NO)

Consider a case where we have a service which accepts a very large PDF 
file. This file will be stored on the disk or may be in some DB. And 
this service will also use Tika to extract content and populate a Lucene 
Document.
Now, we already have the original PDF occupying some space, so 
duplicating it (its content) with a Document with Store.YES fields may 
not be the best idea in some cases.

So I wonder, is it possible somehow for a given Tika Parser, lets say a 
PDF parser, report, via the Metadata, the start and end indexes of the 
content ? So the consumer will create say InputStreamReader for a 
content region and will use Store.NO and this Reader ?

Does it really make sense at all ? I can create a minor enhancement 
request for parsers getting the access to a low level info like the 
start/stop delimiters of the content to report it ?

Cheers, Sergey





Some thoughts:

At the least, you could create a separate Lucene document for each 
 container document and each of its embedded documents.

You could also break large documents into logical sections and index those 
 as separate documents; but that gets very use-case dependent.

  In practice, for many, many use cases I've come across, you can index 
 quite large documents with no problems, e.g. Moby Dick or Dream of the Red 
 Chamber.  There may be a hit at highlighting time for large docs depending 
 on which highlighter you use.  In the old days, there used to be a 10k 
 default limit on the number of tokens, but that is now long gone.

For truly large docs (probably machine generated), yes, you could run into 
 problems if you need to hold the whole thing in memory.

   Cheers,

Tim
 -Original Message-
 From: Sergey Beryozkin [mailto:sberyoz...@gmail.com]
 Sent: Wednesday, July 02, 2014 8:27 AM
 To: user@tika.apache.org
 Subject: How to index the parsed content effectively

 Hi All,

 We've been experimenting with indexing the parsed content in Lucene and
 our initial attempt was to index the output from

Re: How to index the parsed content effectively

2014-07-11 Thread Sergey Beryozkin

Hi Tim, All.
On 02/07/14 14:32, Allison, Timothy B. wrote:

Hi Sergey,

   I'd take a look at what the DataImportHandler in Solr does.  If you want to 
store the field, you need to create the field with a String (as opposed to a 
Reader); which means you have to have the whole thing in memory.  Also, if 
you're proposing adding a field entry in a multivalued field for a given SAX 
event, I don't think that will help, because you still have to hold the entire 
document in memory before calling addDocument() if you are storing the field.  
If you aren't storing the field, then you could try a Reader.


I'd like to ask something about using Tika parser and a Reader (and 
Lucene Store.NO)


Consider a case where we have a service which accepts a very large PDF 
file. This file will be stored on the disk or may be in some DB. And 
this service will also use Tika to extract content and populate a Lucene 
Document.
Now, we already have the original PDF occupying some space, so 
duplicating it (its content) with a Document with Store.YES fields may 
not be the best idea in some cases.


So I wonder, is it possible somehow for a given Tika Parser, lets say a 
PDF parser, report, via the Metadata, the start and end indexes of the 
content ? So the consumer will create say InputStreamReader for a 
content region and will use Store.NO and this Reader ?


Does it really make sense at all ? I can create a minor enhancement 
request for parsers getting the access to a low level info like the 
start/stop delimiters of the content to report it ?


Cheers, Sergey






   Some thoughts:

   At the least, you could create a separate Lucene document for each container 
document and each of its embedded documents.

   You could also break large documents into logical sections and index those 
as separate documents; but that gets very use-case dependent.

 In practice, for many, many use cases I've come across, you can index quite large documents 
with no problems, e.g. Moby Dick or Dream of the Red Chamber.  There may be 
a hit at highlighting time for large docs depending on which highlighter you use.  In the old days, 
there used to be a 10k default limit on the number of tokens, but that is now long gone.

   For truly large docs (probably machine generated), yes, you could run into 
problems if you need to hold the whole thing in memory.

  Cheers,

   Tim
-Original Message-
From: Sergey Beryozkin [mailto:sberyoz...@gmail.com]
Sent: Wednesday, July 02, 2014 8:27 AM
To: user@tika.apache.org
Subject: How to index the parsed content effectively

Hi All,

We've been experimenting with indexing the parsed content in Lucene and
our initial attempt was to index the output from
ToTextContentHandler.toString() as a Lucene Text field.

This is unlikely to be effective for large files. So I wonder what
strategies exist for a more effective indexing/tokenization of the
possibly large content.

Perhaps a custom ContentHandler can index content fragments in a unique
Lucene field every time its characters(...) method is called, something
I've been planning to experiment with.

The feedback will be appreciated
Cheers, Sergey





Re: How to index the parsed content effectively

2014-07-02 Thread Ken Krugler

On Jul 2, 2014, at 5:27am, Sergey Beryozkin sberyoz...@gmail.com wrote:

 Hi All,
 
 We've been experimenting with indexing the parsed content in Lucene and
 our initial attempt was to index the output from
 ToTextContentHandler.toString() as a Lucene Text field.
 
 This is unlikely to be effective for large files.

What are your concerns here?

And what's the max amount of text in one file you think you'll need to index?

-- Ken

 So I wonder what
 strategies exist for a more effective indexing/tokenization of the the
 possibly large content.
 
 Perhaps a custom ContentHandler can index content fragments in a unique
 Lucene field every time its characters(...) method is called, something
 I've been planning to experiment with.
 
 The feedback will be appreciated
 Cheers, Sergey

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr







Re: How to index the parsed content effectively

2014-07-02 Thread Christian Reuschling
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

If you want to have a try, we created a crawling Tika parser, which gives 
recursive, incremental
crawing capabilities to Tika. There we also implemented a handler as a 
decorator that writes into
a Lucene index.

Checkout 'Create a Lucene index' here:

https://github.com/leechcrawler/leech/blob/master/codeSnippets.md

Maybe also as a starting point by looking into the code

best

Chris

On 02.07.2014 14:27, Sergey Beryozkin wrote:
 Hi All,
 
 We've been experimenting with indexing the parsed content in Lucene and our 
 initial attempt was
 to index the output from ToTextContentHandler.toString() as a Lucene Text 
 field.
 
 This is unlikely to be effective for large files. So I wonder what strategies 
 exist for a more
 effective indexing/tokenization of the the possibly large content.
 
 Perhaps a custom ContentHandler can index content fragments in a unique 
 Lucene field every time
 its characters(...) method is called, something I've been planning to 
 experiment with.
 
 The feedback will be appreciated Cheers, Sergey

- -- 
__
Christian Reuschling, Dipl.-Ing.(BA)
Software Engineer

Knowledge Management Department
German Research Center for Artificial Intelligence DFKI GmbH
Trippstadter Straße 122, D-67663 Kaiserslautern, Germany

Phone: +49.631.20575-1250
mailto:reuschl...@dfki.de  http://www.dfki.uni-kl.de/~reuschling/

- Legal Company Information Required by German Law--
Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
  Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313=
__
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.19 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlO0A5UACgkQ6EqMXq+WZg/oLgCgkdpH5uRoYncVhLadg7qxjXKD
PZQAn1jxxRejVGchXXoYA08BIA3ldOKH
=ulNT
-END PGP SIGNATURE-


Re: How to index the parsed content effectively

2014-07-02 Thread Sergey Beryozkin

Hi,
On 02/07/14 13:54, Ken Krugler wrote:


On Jul 2, 2014, at 5:27am, Sergey Beryozkin sberyoz...@gmail.com
mailto:sberyoz...@gmail.com wrote:


Hi All,

We've been experimenting with indexing the parsed content in Lucene and
our initial attempt was to index the output from
ToTextContentHandler.toString() as a Lucene Text field.

This is unlikely to be effective for large files.


What are your concerns here?

We write a utility for (CXF JAX-RS) users to start experimenting with 
searching with the help of Tika and Lucene. As such my concerns are 
rather vague for now. I suspect that parsing a large file into a 
possibly very large/massive String and indexing it in a single Lucene 
Text field won't be memory and/or performance optimal.



And what's the max amount of text in one file you think you'll need to
index?
This is something I've no idea about. I'd like to make sure our utility 
can help other users to effectively index Tika output into Lucene if 
they will ever need it


Thanks, Sergey



-- Ken


So I wonder what
strategies exist for a more effective indexing/tokenization of the the
possibly large content.

Perhaps a custom ContentHandler can index content fragments in a unique
Lucene field every time its characters(...) method is called, something
I've been planning to experiment with.

The feedback will be appreciated
Cheers, Sergey


--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr








--
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/

Blog: http://sberyozkin.blogspot.com


RE: How to index the parsed content effectively

2014-07-02 Thread Allison, Timothy B.
Hi Sergey,

  I'd take a look at what the DataImportHandler in Solr does.  If you want to 
store the field, you need to create the field with a String (as opposed to a 
Reader); which means you have to have the whole thing in memory.  Also, if 
you're proposing adding a field entry in a multivalued field for a given SAX 
event, I don't think that will help, because you still have to hold the entire 
document in memory before calling addDocument() if you are storing the field.  
If you aren't storing the field, then you could try a Reader.
 
  Some thoughts:

  At the least, you could create a separate Lucene document for each container 
document and each of its embedded documents.
  
  You could also break large documents into logical sections and index those as 
separate documents; but that gets very use-case dependent.

In practice, for many, many use cases I've come across, you can index quite 
large documents with no problems, e.g. Moby Dick or Dream of the Red 
Chamber.  There may be a hit at highlighting time for large docs depending on 
which highlighter you use.  In the old days, there used to be a 10k default 
limit on the number of tokens, but that is now long gone.
  
  For truly large docs (probably machine generated), yes, you could run into 
problems if you need to hold the whole thing in memory.  
  
 Cheers,

  Tim
-Original Message-
From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] 
Sent: Wednesday, July 02, 2014 8:27 AM
To: user@tika.apache.org
Subject: How to index the parsed content effectively

Hi All,

We've been experimenting with indexing the parsed content in Lucene and
our initial attempt was to index the output from
ToTextContentHandler.toString() as a Lucene Text field.

This is unlikely to be effective for large files. So I wonder what
strategies exist for a more effective indexing/tokenization of the
possibly large content.

Perhaps a custom ContentHandler can index content fragments in a unique
Lucene field every time its characters(...) method is called, something
I've been planning to experiment with.

The feedback will be appreciated
Cheers, Sergey


Re: How to index the parsed content effectively

2014-07-02 Thread Sergey Beryozkin

Hi Tim

Thanks for sharing your thoughts. I find them very helpful,

On 02/07/14 14:32, Allison, Timothy B. wrote:

Hi Sergey,

   I'd take a look at what the DataImportHandler in Solr does.  If you want to 
store the field, you need to create the field with a String (as opposed to a 
Reader); which means you have to have the whole thing in memory.  Also, if 
you're proposing adding a field entry in a multivalued field for a given SAX 
event, I don't think that will help, because you still have to hold the entire 
document in memory before calling addDocument() if you are storing the field.  
If you aren't storing the field, then you could try a Reader.

   Some thoughts:

   At the least, you could create a separate Lucene document for each container 
document and each of its embedded documents.

   You could also break large documents into logical sections and index those 
as separate documents; but that gets very use-case dependent.


Right. I think this is something we might investigate further. The goal 
is to generalize some Tika Parser to Lucene code sequences, and perhaps 
we can offer some boilerplate ContentHandler as we don't know of the 
concrete/final requirements of the would be API consumers.


What is your opinion of having a Tika Parser ContentHandler that would 
try to do it in a minimal kind of way, store character sequences as 
unique individual Lucene fields. Suppose we have a single PDF file, and 
we have a content handler reporting every line in such a file. So 
instead of storing all the PDF content in a single content field we'd 
have content1:line1, content2:line2, etc and then offer a 
support for searching across all of these contentN fields ?


I guess it would be somewhat similar to your idea of having a separate 
Lucene Document per every logical chunk, except that in this case we'd 
have a single Document with many fields covering a single PDF/etc


Does it make any sense at all from the performance point of view or may 
be not worth it ?




 In practice, for many, many use cases I've come across, you can index quite large documents 
with no problems, e.g. Moby Dick or Dream of the Red Chamber.  There may be 
a hit at highlighting time for large docs depending on which highlighter you use.  In the old days, 
there used to be a 10k default limit on the number of tokens, but that is now long gone.


Sounds reasonable

   For truly large docs (probably machine generated), yes, you could run into 
problems if you need to hold the whole thing in memory.


Sure, if we get the users reporting OOM or similar related issues 
against our API then it would be a good start :-)


Thanks, Sergey



  Cheers,

   Tim
-Original Message-
From: Sergey Beryozkin [mailto:sberyoz...@gmail.com]
Sent: Wednesday, July 02, 2014 8:27 AM
To: user@tika.apache.org
Subject: How to index the parsed content effectively

Hi All,

We've been experimenting with indexing the parsed content in Lucene and
our initial attempt was to index the output from
ToTextContentHandler.toString() as a Lucene Text field.

This is unlikely to be effective for large files. So I wonder what
strategies exist for a more effective indexing/tokenization of the
possibly large content.

Perhaps a custom ContentHandler can index content fragments in a unique
Lucene field every time its characters(...) method is called, something
I've been planning to experiment with.

The feedback will be appreciated
Cheers, Sergey



Re: How to index the parsed content effectively

2014-07-02 Thread Christian Reuschling
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

another aspect is, if you index such large documents, you also recieve these 
documents inside your
search results, which is then again a bit ambigous for a user (if there is one 
in the use case).
The search problem is only partially solved in this case. Maybe it would be 
better to index single
chapters or something, to make it usefull for the consumer in this case.

Another aspect is, that such huge documents tend to have everything (i.e. every 
term) inside,
which results into bad statistics (there are maybe no characteristic terms 
left). In the worst
case, the document becomes part of every search result, but with low scores in 
any case.

I would say, for 'normal', human-readable documents, the extracted texts are so 
small in memory
footprint, that there is no problem at all - to avoid a OOM for rare cases that 
are maybe
invocation bugs, you can set a simple threshold, cutting the document, print a 
warning, etc.

Of course, everything depends on the use case ;)


On 02.07.2014 17:45, Sergey Beryozkin wrote:
 Hi Tim
 
 Thanks for sharing your thoughts. I find them very helpful,
 
 On 02/07/14 14:32, Allison, Timothy B. wrote:
 Hi Sergey,
 
 I'd take a look at what the DataImportHandler in Solr does.  If you want to 
 store the field, 
 you need to create the field with a String (as opposed to a Reader); which 
 means you have to
 have the whole thing in memory.  Also, if you're proposing adding a field 
 entry in a
 multivalued field for a given SAX event, I don't think that will help, 
 because you still have
 to hold the entire document in memory before calling addDocument() if you 
 are storing the
 field.  If you aren't storing the field, then you could try a Reader.
 
 Some thoughts:
 
 At the least, you could create a separate Lucene document for each container 
 document and
 each of its embedded documents.
 
 You could also break large documents into logical sections and index those 
 as separate 
 documents; but that gets very use-case dependent.
 
 Right. I think this is something we might investigate further. The goal is to 
 generalize some
 Tika Parser to Lucene code sequences, and perhaps we can offer some 
 boilerplate ContentHandler
 as we don't know of the concrete/final requirements of the would be API 
 consumers.
 
 What is your opinion of having a Tika Parser ContentHandler that would try to 
 do it in a
 minimal kind of way, store character sequences as unique individual Lucene 
 fields. Suppose we
 have a single PDF file, and we have a content handler reporting every line in 
 such a file. So
 instead of storing all the PDF content in a single content field we'd have
 content1:line1, content2:line2, etc and then offer a support for 
 searching across all
 of these contentN fields ?
 
 I guess it would be somewhat similar to your idea of having a separate Lucene 
 Document per
 every logical chunk, except that in this case we'd have a single Document 
 with many fields
 covering a single PDF/etc
 
 Does it make any sense at all from the performance point of view or may be 
 not worth it ?
 
 
 In practice, for many, many use cases I've come across, you can index quite 
 large documents 
 with no problems, e.g. Moby Dick or Dream of the Red Chamber.  There may 
 be a hit at 
 highlighting time for large docs depending on which highlighter you use.  In 
 the old days,
 there used to be a 10k default limit on the number of tokens, but that is 
 now long gone.
 
 Sounds reasonable
 For truly large docs (probably machine generated), yes, you could run into 
 problems if you
 need to hold the whole thing in memory.
 
 Sure, if we get the users reporting OOM or similar related issues against our 
 API then it would
 be a good start :-)
 
 Thanks, Sergey
 
 
 Cheers,
 
 Tim -Original Message- From: Sergey Beryozkin 
 [mailto:sberyoz...@gmail.com] Sent:
 Wednesday, July 02, 2014 8:27 AM To: user@tika.apache.org Subject: How to 
 index the parsed
 content effectively
 
 Hi All,
 
 We've been experimenting with indexing the parsed content in Lucene and our 
 initial attempt
 was to index the output from ToTextContentHandler.toString() as a Lucene 
 Text field.
 
 This is unlikely to be effective for large files. So I wonder what 
 strategies exist for a
 more effective indexing/tokenization of the possibly large content.
 
 Perhaps a custom ContentHandler can index content fragments in a unique 
 Lucene field every
 time its characters(...) method is called, something I've been planning to 
 experiment with.
 
 The feedback will be appreciated Cheers, Sergey
 

- -- 
__
Christian Reuschling, Dipl.-Ing.(BA)
Software Engineer

Knowledge Management Department
German Research Center for Artificial Intelligence DFKI GmbH
Trippstadter Straße 122, D-67663 Kaiserslautern, Germany

Phone: +49.631.20575-1250
mailto:reuschl...@dfki.de  

Re: How to index the parsed content effectively

2014-07-02 Thread Sergey Beryozkin

Hi
On 02/07/14 17:32, Christian Reuschling wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

another aspect is, if you index such large documents, you also recieve these 
documents inside your
search results, which is then again a bit ambigous for a user (if there is one 
in the use case).
The search problem is only partially solved in this case. Maybe it would be 
better to index single
chapters or something, to make it usefull for the consumer in this case.

This is another nice idea. We'll expect the users to customize the 
process of indexing the Tika-produced content if they won't be satisfied 
with the default approach of storing the content in a single field.
But as we move along and start getting more experience/feedback we may 
be able to find the way to generalize some of the ideas that yourself 
and Tim talked about. Example, we may ship a boilerplate ContentHandler 
that may be able to react to new chapter or new document indicators, etc



Another aspect is, that such huge documents tend to have everything (i.e. every 
term) inside,
which results into bad statistics (there are maybe no characteristic terms 
left). In the worst
case, the document becomes part of every search result, but with low scores in 
any case.

I would say, for 'normal', human-readable documents, the extracted texts are so 
small in memory
footprint, that there is no problem at all - to avoid a OOM for rare cases that 
are maybe
invocation bugs, you can set a simple threshold, cutting the document, print a 
warning, etc.


Sure

Of course, everything depends on the use case ;)


I agree,
Many thanks for the feedback,
Definitely has been useful for me and hopefully for some other users :-)
Cheers, Sergey


On 02.07.2014 17:45, Sergey Beryozkin wrote:

Hi Tim

Thanks for sharing your thoughts. I find them very helpful,

On 02/07/14 14:32, Allison, Timothy B. wrote:

Hi Sergey,

I'd take a look at what the DataImportHandler in Solr does.  If you want to 
store the field,
you need to create the field with a String (as opposed to a Reader); which 
means you have to
have the whole thing in memory.  Also, if you're proposing adding a field entry 
in a
multivalued field for a given SAX event, I don't think that will help, because 
you still have
to hold the entire document in memory before calling addDocument() if you are 
storing the
field.  If you aren't storing the field, then you could try a Reader.

Some thoughts:

At the least, you could create a separate Lucene document for each container 
document and
each of its embedded documents.

You could also break large documents into logical sections and index those as 
separate
documents; but that gets very use-case dependent.


Right. I think this is something we might investigate further. The goal is to 
generalize some
Tika Parser to Lucene code sequences, and perhaps we can offer some boilerplate 
ContentHandler
as we don't know of the concrete/final requirements of the would be API 
consumers.

What is your opinion of having a Tika Parser ContentHandler that would try to 
do it in a
minimal kind of way, store character sequences as unique individual Lucene 
fields. Suppose we
have a single PDF file, and we have a content handler reporting every line in 
such a file. So
instead of storing all the PDF content in a single content field we'd have
content1:line1, content2:line2, etc and then offer a support for 
searching across all
of these contentN fields ?

I guess it would be somewhat similar to your idea of having a separate Lucene 
Document per
every logical chunk, except that in this case we'd have a single Document with 
many fields
covering a single PDF/etc

Does it make any sense at all from the performance point of view or may be not 
worth it ?



In practice, for many, many use cases I've come across, you can index quite 
large documents
with no problems, e.g. Moby Dick or Dream of the Red Chamber.  There may be 
a hit at
highlighting time for large docs depending on which highlighter you use.  In 
the old days,
there used to be a 10k default limit on the number of tokens, but that is now 
long gone.


Sounds reasonable

For truly large docs (probably machine generated), yes, you could run into 
problems if you
need to hold the whole thing in memory.


Sure, if we get the users reporting OOM or similar related issues against our 
API then it would
be a good start :-)

Thanks, Sergey



Cheers,

Tim -Original Message- From: Sergey Beryozkin 
[mailto:sberyoz...@gmail.com] Sent:
Wednesday, July 02, 2014 8:27 AM To: user@tika.apache.org Subject: How to index 
the parsed
content effectively

Hi All,

We've been experimenting with indexing the parsed content in Lucene and our 
initial attempt
was to index the output from ToTextContentHandler.toString() as a Lucene Text 
field.

This is unlikely to be effective for large files. So I wonder what strategies 
exist for a
more effective indexing/tokenization of the possibly large content.

Perhaps a custom