Re: How to index the parsed content effectively
Hi Tim On 14/07/14 14:53, Allison, Timothy B. wrote: Hi Sergey, Now, we already have the original PDF occupying some space, so duplicating it (its content) with a Document with Store.YES fields may not be the best idea in some cases. In some cases, agreed, but in general, this is probably a good default idea. As you point out, you aren't quite duplicating the document -- one copy contain the original bytes, and the other contains the text (and metadata?) that was extracted from the document. One reason to store the content in the field is for easy highlighting. You could configure the highlighter to pull the text content of the document from a db or other source, but that adds complexity and perhaps lookup time. What you really would not want to do from a time perspective is ask Tika to parse the raw bytes to pull the content for highlighting at search time. In general, Lucene's storage of the content is very reasonable; on one big batch of text files I have, the Lucene index with stored fields is the same size as the uncompressed text files. OK. I'm sure Lucene is very good in what it does. I'm just trying to figure out what the limits may be. By the way I apologize if it is off topic. For now it seems though to me that Tika and Lucene can make a perfect combination, something SOLR and other implementations build upon AFAIK... So I wonder, is it possible somehow for a given Tika Parser, lets say a PDF parser, report, via the Metadata, the start and end indexes of the content ? So the consumer will create say InputStreamReader for a content region and will use Store.NO and this Reader ? I don't think I quite understand what you're proposing. The start and end indexes of the extracted content? Wouldn't that just be 0 and the length of the string in most cases (beyond-bmp issues aside)? Or, are you suggesting that there may be start and end indexes for content within the actual raw bytes of the PDF? If the latter, for PDFs at least that would effectively require a full reparse ... if it were possible, and it probably wouldn't save much in time. For other formats, where that might work, it would create far more complexity than value...IMHO. Start and end indexes for content within the actual raw bytes... It's theoretical for me at this point of time. I was thinking of this case: We have a PDF stored on the disk, Tika parsing the content against a NOP content handler, and providing these indexes, and we have a Document in the memory only, using Reader to populate the content field. When we restart we pay the penalty (the only penalty) of having Lucene repopulating a Document from Reader, on the plus side the content is only ever stored on the disk or in DB as part of the original PDF image. We'd only have to persist these indexes to avoid having Tika reparse again. I think I won't worry about getting it all over-optimized at this stage, from what I understand working with Lucene means no major storage limits exists :-) In general, I'd say store the field. Perhaps let the user choose to not store the field. I guess in context of working with Tika I'd go for storing the fields for now... Thanks, Sergey Always interested to hear input from others. Best, Tim -Original Message- From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] Sent: Friday, July 11, 2014 1:38 PM To: user@tika.apache.org Subject: Re: How to index the parsed content effectively Hi Tim, All. On 02/07/14 14:32, Allison, Timothy B. wrote: Hi Sergey, I'd take a look at what the DataImportHandler in Solr does. If you want to store the field, you need to create the field with a String (as opposed to a Reader); which means you have to have the whole thing in memory. Also, if you're proposing adding a field entry in a multivalued field for a given SAX event, I don't think that will help, because you still have to hold the entire document in memory before calling addDocument() if you are storing the field. If you aren't storing the field, then you could try a Reader. I'd like to ask something about using Tika parser and a Reader (and Lucene Store.NO) Consider a case where we have a service which accepts a very large PDF file. This file will be stored on the disk or may be in some DB. And this service will also use Tika to extract content and populate a Lucene Document. Now, we already have the original PDF occupying some space, so duplicating it (its content) with a Document with Store.YES fields may not be the best idea in some cases. So I wonder, is it possible somehow for a given Tika Parser, lets say a PDF parser, report, via the Metadata, the start and end indexes of the content ? So the consumer will create say InputStreamReader for a content region and will use Store.NO and this Reader ? Does it really make sense at all ? I can create a minor
RE: How to index the parsed content effectively
Hi Sergey, >> Now, we already have the original PDF occupying some space, so >>duplicating it (its content) with a Document with Store.YES fields may >>not be the best idea in some cases. In some cases, agreed, but in general, this is probably a good default idea. As you point out, you aren't quite duplicating the document -- one copy contain the original bytes, and the other contains the text (and metadata?) that was extracted from the document. One reason to store the content in the field is for easy highlighting. You could configure the highlighter to pull the text content of the document from a db or other source, but that adds complexity and perhaps lookup time. What you really would not want to do from a time perspective is ask Tika to parse the raw bytes to pull the content for highlighting at search time. In general, Lucene's storage of the content is very reasonable; on one big batch of text files I have, the Lucene index with stored fields is the same size as the uncompressed text files. >>So I wonder, is it possible somehow for a given Tika Parser, lets say a >>PDF parser, report, via the Metadata, the start and end indexes of the >>content ? So the consumer will create say InputStreamReader for a >>content region and will use Store.NO and this Reader ? I don't think I quite understand what you're proposing. The start and end indexes of the extracted content? Wouldn't that just be 0 and the length of the string in most cases (beyond-bmp issues aside)? Or, are you suggesting that there may be start and end indexes for content within the actual raw bytes of the PDF? If the latter, for PDFs at least that would effectively require a full reparse ... if it were possible, and it probably wouldn't save much in time. For other formats, where that might work, it would create far more complexity than value...IMHO. In general, I'd say store the field. Perhaps let the user choose to not store the field. Always interested to hear input from others. Best, Tim -Original Message- From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] Sent: Friday, July 11, 2014 1:38 PM To: user@tika.apache.org Subject: Re: How to index the parsed content effectively Hi Tim, All. On 02/07/14 14:32, Allison, Timothy B. wrote: > Hi Sergey, > >I'd take a look at what the DataImportHandler in Solr does. If you want > to store the field, you need to create the field with a String (as opposed to > a Reader); which means you have to have the whole thing in memory. Also, if > you're proposing adding a field entry in a multivalued field for a given SAX > event, I don't think that will help, because you still have to hold the > entire document in memory before calling addDocument() if you are storing the > field. If you aren't storing the field, then you could try a Reader. I'd like to ask something about using Tika parser and a Reader (and Lucene Store.NO) Consider a case where we have a service which accepts a very large PDF file. This file will be stored on the disk or may be in some DB. And this service will also use Tika to extract content and populate a Lucene Document. Now, we already have the original PDF occupying some space, so duplicating it (its content) with a Document with Store.YES fields may not be the best idea in some cases. So I wonder, is it possible somehow for a given Tika Parser, lets say a PDF parser, report, via the Metadata, the start and end indexes of the content ? So the consumer will create say InputStreamReader for a content region and will use Store.NO and this Reader ? Does it really make sense at all ? I can create a minor enhancement request for parsers getting the access to a low level info like the start/stop delimiters of the content to report it ? Cheers, Sergey > >Some thoughts: > >At the least, you could create a separate Lucene document for each > container document and each of its embedded documents. > >You could also break large documents into logical sections and index those > as separate documents; but that gets very use-case dependent. > > In practice, for many, many use cases I've come across, you can index > quite large documents with no problems, e.g. "Moby Dick" or "Dream of the Red > Chamber." There may be a hit at highlighting time for large docs depending > on which highlighter you use. In the old days, there used to be a 10k > default limit on the number of tokens, but that is now long gone. > >For truly large docs (probably machine generated), yes, you could run into > problems if you need to hold the whole thing in memory. > > Cheers, > > Tim > -Original Message- > From: Sergey Beryozkin [mailto:sberyoz...@gmai
Re: How to index the parsed content effectively
Hi Tim, All. On 02/07/14 14:32, Allison, Timothy B. wrote: Hi Sergey, I'd take a look at what the DataImportHandler in Solr does. If you want to store the field, you need to create the field with a String (as opposed to a Reader); which means you have to have the whole thing in memory. Also, if you're proposing adding a field entry in a multivalued field for a given SAX event, I don't think that will help, because you still have to hold the entire document in memory before calling addDocument() if you are storing the field. If you aren't storing the field, then you could try a Reader. I'd like to ask something about using Tika parser and a Reader (and Lucene Store.NO) Consider a case where we have a service which accepts a very large PDF file. This file will be stored on the disk or may be in some DB. And this service will also use Tika to extract content and populate a Lucene Document. Now, we already have the original PDF occupying some space, so duplicating it (its content) with a Document with Store.YES fields may not be the best idea in some cases. So I wonder, is it possible somehow for a given Tika Parser, lets say a PDF parser, report, via the Metadata, the start and end indexes of the content ? So the consumer will create say InputStreamReader for a content region and will use Store.NO and this Reader ? Does it really make sense at all ? I can create a minor enhancement request for parsers getting the access to a low level info like the start/stop delimiters of the content to report it ? Cheers, Sergey Some thoughts: At the least, you could create a separate Lucene document for each container document and each of its embedded documents. You could also break large documents into logical sections and index those as separate documents; but that gets very use-case dependent. In practice, for many, many use cases I've come across, you can index quite large documents with no problems, e.g. "Moby Dick" or "Dream of the Red Chamber." There may be a hit at highlighting time for large docs depending on which highlighter you use. In the old days, there used to be a 10k default limit on the number of tokens, but that is now long gone. For truly large docs (probably machine generated), yes, you could run into problems if you need to hold the whole thing in memory. Cheers, Tim -Original Message- From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] Sent: Wednesday, July 02, 2014 8:27 AM To: user@tika.apache.org Subject: How to index the parsed content effectively Hi All, We've been experimenting with indexing the parsed content in Lucene and our initial attempt was to index the output from ToTextContentHandler.toString() as a Lucene Text field. This is unlikely to be effective for large files. So I wonder what strategies exist for a more effective indexing/tokenization of the possibly large content. Perhaps a custom ContentHandler can index content fragments in a unique Lucene field every time its characters(...) method is called, something I've been planning to experiment with. The feedback will be appreciated Cheers, Sergey
Re: How to index the parsed content effectively
Hi On 02/07/14 17:32, Christian Reuschling wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 another aspect is, if you index such large documents, you also recieve these documents inside your search results, which is then again a bit ambigous for a user (if there is one in the use case). The search problem is only partially solved in this case. Maybe it would be better to index single chapters or something, to make it usefull for the consumer in this case. This is another nice idea. We'll expect the users to customize the process of indexing the Tika-produced content if they won't be satisfied with the default approach of storing the content in a single field. But as we move along and start getting more experience/feedback we may be able to find the way to generalize some of the ideas that yourself and Tim talked about. Example, we may ship a boilerplate ContentHandler that may be able to react to new chapter or new document indicators, etc Another aspect is, that such huge documents tend to have everything (i.e. every term) inside, which results into bad statistics (there are maybe no characteristic terms left). In the worst case, the document becomes part of every search result, but with low scores in any case. I would say, for 'normal', human-readable documents, the extracted texts are so small in memory footprint, that there is no problem at all - to avoid a OOM for rare cases that are maybe invocation bugs, you can set a simple threshold, cutting the document, print a warning, etc. Sure Of course, everything depends on the use case ;) I agree, Many thanks for the feedback, Definitely has been useful for me and hopefully for some other users :-) Cheers, Sergey On 02.07.2014 17:45, Sergey Beryozkin wrote: Hi Tim Thanks for sharing your thoughts. I find them very helpful, On 02/07/14 14:32, Allison, Timothy B. wrote: Hi Sergey, I'd take a look at what the DataImportHandler in Solr does. If you want to store the field, you need to create the field with a String (as opposed to a Reader); which means you have to have the whole thing in memory. Also, if you're proposing adding a field entry in a multivalued field for a given SAX event, I don't think that will help, because you still have to hold the entire document in memory before calling addDocument() if you are storing the field. If you aren't storing the field, then you could try a Reader. Some thoughts: At the least, you could create a separate Lucene document for each container document and each of its embedded documents. You could also break large documents into logical sections and index those as separate documents; but that gets very use-case dependent. Right. I think this is something we might investigate further. The goal is to generalize some Tika Parser to Lucene code sequences, and perhaps we can offer some boilerplate ContentHandler as we don't know of the concrete/final requirements of the would be API consumers. What is your opinion of having a Tika Parser ContentHandler that would try to do it in a minimal kind of way, store character sequences as unique individual Lucene fields. Suppose we have a single PDF file, and we have a content handler reporting every line in such a file. So instead of storing all the PDF content in a single "content" field we'd have "content1":"line1", "content2":"line2", etc and then offer a support for searching across all of these contentN fields ? I guess it would be somewhat similar to your idea of having a separate Lucene Document per every logical chunk, except that in this case we'd have a single Document with many fields covering a single PDF/etc Does it make any sense at all from the performance point of view or may be not worth it ? In practice, for many, many use cases I've come across, you can index quite large documents with no problems, e.g. "Moby Dick" or "Dream of the Red Chamber." There may be a hit at highlighting time for large docs depending on which highlighter you use. In the old days, there used to be a 10k default limit on the number of tokens, but that is now long gone. Sounds reasonable For truly large docs (probably machine generated), yes, you could run into problems if you need to hold the whole thing in memory. Sure, if we get the users reporting OOM or similar related issues against our API then it would be a good start :-) Thanks, Sergey Cheers, Tim -Original Message----- From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] Sent: Wednesday, July 02, 2014 8:27 AM To: user@tika.apache.org Subject: How to index the parsed content effectively Hi All, We've been experimenting with indexing the parsed content in Lucene and our initial attempt was to index the output from ToTextContentHandler.toString() as a Lucene Text field. This is unlikely to be effective f
Re: How to index the parsed content effectively
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 another aspect is, if you index such large documents, you also recieve these documents inside your search results, which is then again a bit ambigous for a user (if there is one in the use case). The search problem is only partially solved in this case. Maybe it would be better to index single chapters or something, to make it usefull for the consumer in this case. Another aspect is, that such huge documents tend to have everything (i.e. every term) inside, which results into bad statistics (there are maybe no characteristic terms left). In the worst case, the document becomes part of every search result, but with low scores in any case. I would say, for 'normal', human-readable documents, the extracted texts are so small in memory footprint, that there is no problem at all - to avoid a OOM for rare cases that are maybe invocation bugs, you can set a simple threshold, cutting the document, print a warning, etc. Of course, everything depends on the use case ;) On 02.07.2014 17:45, Sergey Beryozkin wrote: > Hi Tim > > Thanks for sharing your thoughts. I find them very helpful, > > On 02/07/14 14:32, Allison, Timothy B. wrote: >> Hi Sergey, >> >> I'd take a look at what the DataImportHandler in Solr does. If you want to >> store the field, >> you need to create the field with a String (as opposed to a Reader); which >> means you have to >> have the whole thing in memory. Also, if you're proposing adding a field >> entry in a >> multivalued field for a given SAX event, I don't think that will help, >> because you still have >> to hold the entire document in memory before calling addDocument() if you >> are storing the >> field. If you aren't storing the field, then you could try a Reader. >> >> Some thoughts: >> >> At the least, you could create a separate Lucene document for each container >> document and >> each of its embedded documents. >> >> You could also break large documents into logical sections and index those >> as separate >> documents; but that gets very use-case dependent. > > Right. I think this is something we might investigate further. The goal is to > generalize some > Tika Parser to Lucene code sequences, and perhaps we can offer some > boilerplate ContentHandler > as we don't know of the concrete/final requirements of the would be API > consumers. > > What is your opinion of having a Tika Parser ContentHandler that would try to > do it in a > minimal kind of way, store character sequences as unique individual Lucene > fields. Suppose we > have a single PDF file, and we have a content handler reporting every line in > such a file. So > instead of storing all the PDF content in a single "content" field we'd have > "content1":"line1", "content2":"line2", etc and then offer a support for > searching across all > of these contentN fields ? > > I guess it would be somewhat similar to your idea of having a separate Lucene > Document per > every logical chunk, except that in this case we'd have a single Document > with many fields > covering a single PDF/etc > > Does it make any sense at all from the performance point of view or may be > not worth it ? > >> >> In practice, for many, many use cases I've come across, you can index quite >> large documents >> with no problems, e.g. "Moby Dick" or "Dream of the Red Chamber." There may >> be a hit at >> highlighting time for large docs depending on which highlighter you use. In >> the old days, >> there used to be a 10k default limit on the number of tokens, but that is >> now long gone. >> > Sounds reasonable >> For truly large docs (probably machine generated), yes, you could run into >> problems if you >> need to hold the whole thing in memory. > > Sure, if we get the users reporting OOM or similar related issues against our > API then it would > be a good start :-) > > Thanks, Sergey > >> >> Cheers, >> >> Tim -Original Message- From: Sergey Beryozkin >> [mailto:sberyoz...@gmail.com] Sent: >> Wednesday, July 02, 2014 8:27 AM To: user@tika.apache.org Subject: How to >> index the parsed >> content effectively >> >> Hi All, >> >> We've been experimenting with indexing the parsed content in Lucene and our >> initial attempt >> was to index the output from ToTextContentHandler.toString() as a Lucene >> Text field. >> >> This is unlikely to be effective for large files. So I
Re: How to index the parsed content effectively
Hi Tim Thanks for sharing your thoughts. I find them very helpful, On 02/07/14 14:32, Allison, Timothy B. wrote: Hi Sergey, I'd take a look at what the DataImportHandler in Solr does. If you want to store the field, you need to create the field with a String (as opposed to a Reader); which means you have to have the whole thing in memory. Also, if you're proposing adding a field entry in a multivalued field for a given SAX event, I don't think that will help, because you still have to hold the entire document in memory before calling addDocument() if you are storing the field. If you aren't storing the field, then you could try a Reader. Some thoughts: At the least, you could create a separate Lucene document for each container document and each of its embedded documents. You could also break large documents into logical sections and index those as separate documents; but that gets very use-case dependent. Right. I think this is something we might investigate further. The goal is to generalize some Tika Parser to Lucene code sequences, and perhaps we can offer some boilerplate ContentHandler as we don't know of the concrete/final requirements of the would be API consumers. What is your opinion of having a Tika Parser ContentHandler that would try to do it in a minimal kind of way, store character sequences as unique individual Lucene fields. Suppose we have a single PDF file, and we have a content handler reporting every line in such a file. So instead of storing all the PDF content in a single "content" field we'd have "content1":"line1", "content2":"line2", etc and then offer a support for searching across all of these contentN fields ? I guess it would be somewhat similar to your idea of having a separate Lucene Document per every logical chunk, except that in this case we'd have a single Document with many fields covering a single PDF/etc Does it make any sense at all from the performance point of view or may be not worth it ? In practice, for many, many use cases I've come across, you can index quite large documents with no problems, e.g. "Moby Dick" or "Dream of the Red Chamber." There may be a hit at highlighting time for large docs depending on which highlighter you use. In the old days, there used to be a 10k default limit on the number of tokens, but that is now long gone. Sounds reasonable For truly large docs (probably machine generated), yes, you could run into problems if you need to hold the whole thing in memory. Sure, if we get the users reporting OOM or similar related issues against our API then it would be a good start :-) Thanks, Sergey Cheers, Tim -Original Message- From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] Sent: Wednesday, July 02, 2014 8:27 AM To: user@tika.apache.org Subject: How to index the parsed content effectively Hi All, We've been experimenting with indexing the parsed content in Lucene and our initial attempt was to index the output from ToTextContentHandler.toString() as a Lucene Text field. This is unlikely to be effective for large files. So I wonder what strategies exist for a more effective indexing/tokenization of the possibly large content. Perhaps a custom ContentHandler can index content fragments in a unique Lucene field every time its characters(...) method is called, something I've been planning to experiment with. The feedback will be appreciated Cheers, Sergey
Re: How to index the parsed content effectively
Hi, On 02/07/14 14:05, Christian Reuschling wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 If you want to have a try, we created a crawling Tika parser, which gives recursive, incremental crawing capabilities to Tika. There we also implemented a handler as a decorator that writes into a Lucene index. Checkout 'Create a Lucene index' here: https://github.com/leechcrawler/leech/blob/master/codeSnippets.md Maybe also as a starting point by looking into the code Thanks for a link. Our requirements are fairly simple, we want to provide a utility code for our users to do an effective enough indexing of the content passing via ContentHandler. We will check the code and see if something similar cam be applied to our case, will get back with the confirmation if yes... Thanks, Sergey best Chris On 02.07.2014 14:27, Sergey Beryozkin wrote: Hi All, We've been experimenting with indexing the parsed content in Lucene and our initial attempt was to index the output from ToTextContentHandler.toString() as a Lucene Text field. This is unlikely to be effective for large files. So I wonder what strategies exist for a more effective indexing/tokenization of the the possibly large content. Perhaps a custom ContentHandler can index content fragments in a unique Lucene field every time its characters(...) method is called, something I've been planning to experiment with. The feedback will be appreciated Cheers, Sergey - -- __ Christian Reuschling, Dipl.-Ing.(BA) Software Engineer Knowledge Management Department German Research Center for Artificial Intelligence DFKI GmbH Trippstadter Straße 122, D-67663 Kaiserslautern, Germany Phone: +49.631.20575-1250 mailto:reuschl...@dfki.de http://www.dfki.uni-kl.de/~reuschling/ - Legal Company Information Required by German Law-- Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender) Dr. Walter Olthoff Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes Amtsgericht Kaiserslautern, HRB 2313= __ -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.19 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlO0A5UACgkQ6EqMXq+WZg/oLgCgkdpH5uRoYncVhLadg7qxjXKD PZQAn1jxxRejVGchXXoYA08BIA3ldOKH =ulNT -END PGP SIGNATURE-
RE: How to index the parsed content effectively
Hi Sergey, I'd take a look at what the DataImportHandler in Solr does. If you want to store the field, you need to create the field with a String (as opposed to a Reader); which means you have to have the whole thing in memory. Also, if you're proposing adding a field entry in a multivalued field for a given SAX event, I don't think that will help, because you still have to hold the entire document in memory before calling addDocument() if you are storing the field. If you aren't storing the field, then you could try a Reader. Some thoughts: At the least, you could create a separate Lucene document for each container document and each of its embedded documents. You could also break large documents into logical sections and index those as separate documents; but that gets very use-case dependent. In practice, for many, many use cases I've come across, you can index quite large documents with no problems, e.g. "Moby Dick" or "Dream of the Red Chamber." There may be a hit at highlighting time for large docs depending on which highlighter you use. In the old days, there used to be a 10k default limit on the number of tokens, but that is now long gone. For truly large docs (probably machine generated), yes, you could run into problems if you need to hold the whole thing in memory. Cheers, Tim -Original Message- From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] Sent: Wednesday, July 02, 2014 8:27 AM To: user@tika.apache.org Subject: How to index the parsed content effectively Hi All, We've been experimenting with indexing the parsed content in Lucene and our initial attempt was to index the output from ToTextContentHandler.toString() as a Lucene Text field. This is unlikely to be effective for large files. So I wonder what strategies exist for a more effective indexing/tokenization of the possibly large content. Perhaps a custom ContentHandler can index content fragments in a unique Lucene field every time its characters(...) method is called, something I've been planning to experiment with. The feedback will be appreciated Cheers, Sergey
Re: How to index the parsed content effectively
Hi, On 02/07/14 13:54, Ken Krugler wrote: On Jul 2, 2014, at 5:27am, Sergey Beryozkin mailto:sberyoz...@gmail.com>> wrote: Hi All, We've been experimenting with indexing the parsed content in Lucene and our initial attempt was to index the output from ToTextContentHandler.toString() as a Lucene Text field. This is unlikely to be effective for large files. What are your concerns here? We write a utility for (CXF JAX-RS) users to start experimenting with searching with the help of Tika and Lucene. As such my concerns are rather vague for now. I suspect that parsing a large file into a possibly very large/massive String and indexing it in a single Lucene Text field won't be memory and/or performance optimal. And what's the max amount of text in one file you think you'll need to index? This is something I've no idea about. I'd like to make sure our utility can help other users to effectively index Tika output into Lucene if they will ever need it Thanks, Sergey -- Ken So I wonder what strategies exist for a more effective indexing/tokenization of the the possibly large content. Perhaps a custom ContentHandler can index content fragments in a unique Lucene field every time its characters(...) method is called, something I've been planning to experiment with. The feedback will be appreciated Cheers, Sergey -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr -- Sergey Beryozkin Talend Community Coders http://coders.talend.com/ Blog: http://sberyozkin.blogspot.com
Re: How to index the parsed content effectively
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 If you want to have a try, we created a crawling Tika parser, which gives recursive, incremental crawing capabilities to Tika. There we also implemented a handler as a decorator that writes into a Lucene index. Checkout 'Create a Lucene index' here: https://github.com/leechcrawler/leech/blob/master/codeSnippets.md Maybe also as a starting point by looking into the code best Chris On 02.07.2014 14:27, Sergey Beryozkin wrote: > Hi All, > > We've been experimenting with indexing the parsed content in Lucene and our > initial attempt was > to index the output from ToTextContentHandler.toString() as a Lucene Text > field. > > This is unlikely to be effective for large files. So I wonder what strategies > exist for a more > effective indexing/tokenization of the the possibly large content. > > Perhaps a custom ContentHandler can index content fragments in a unique > Lucene field every time > its characters(...) method is called, something I've been planning to > experiment with. > > The feedback will be appreciated Cheers, Sergey - -- __ Christian Reuschling, Dipl.-Ing.(BA) Software Engineer Knowledge Management Department German Research Center for Artificial Intelligence DFKI GmbH Trippstadter Straße 122, D-67663 Kaiserslautern, Germany Phone: +49.631.20575-1250 mailto:reuschl...@dfki.de http://www.dfki.uni-kl.de/~reuschling/ - Legal Company Information Required by German Law-- Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender) Dr. Walter Olthoff Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes Amtsgericht Kaiserslautern, HRB 2313= __ -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.19 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlO0A5UACgkQ6EqMXq+WZg/oLgCgkdpH5uRoYncVhLadg7qxjXKD PZQAn1jxxRejVGchXXoYA08BIA3ldOKH =ulNT -END PGP SIGNATURE-
Re: How to index the parsed content effectively
On Jul 2, 2014, at 5:27am, Sergey Beryozkin wrote: > Hi All, > > We've been experimenting with indexing the parsed content in Lucene and > our initial attempt was to index the output from > ToTextContentHandler.toString() as a Lucene Text field. > > This is unlikely to be effective for large files. What are your concerns here? And what's the max amount of text in one file you think you'll need to index? -- Ken > So I wonder what > strategies exist for a more effective indexing/tokenization of the the > possibly large content. > > Perhaps a custom ContentHandler can index content fragments in a unique > Lucene field every time its characters(...) method is called, something > I've been planning to experiment with. > > The feedback will be appreciated > Cheers, Sergey -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
How to index the parsed content effectively
Hi All, We've been experimenting with indexing the parsed content in Lucene and our initial attempt was to index the output from ToTextContentHandler.toString() as a Lucene Text field. This is unlikely to be effective for large files. So I wonder what strategies exist for a more effective indexing/tokenization of the the possibly large content. Perhaps a custom ContentHandler can index content fragments in a unique Lucene field every time its characters(...) method is called, something I've been planning to experiment with. The feedback will be appreciated Cheers, Sergey