html of the crawled pages.

2011-07-10 Thread Cam Bazz
Hello All, Is there a way to save the plain htmls from the crawl? Or is this already stored in segments dir? Best Regards, -C.B.

Re: html of the crawled pages.

2011-07-10 Thread lewis john mcgibbney
Hi C.B., Can you please expand on this description? On Sun, Jul 10, 2011 at 11:52 AM, Cam Bazz camb...@gmail.com wrote: Hello All, Is there a way to save the plain htmls from the crawl? Or is this already stored in segments dir? Best Regards, -C.B. -- *Lewis*

Re: html of the crawled pages.

2011-07-10 Thread Cam Bazz
I would like to access it and run my own / parser / analyzer if necessary. can I read this segment data? Best On Sun, Jul 10, 2011 at 9:08 PM, Markus Jelsma markus.jel...@openindex.io wrote: Well, the raw data is stored inside the segment. Without it there would be nothing to parse. What do

Re: html of the crawled pages.

2011-07-10 Thread Markus Jelsma
Yes. You can build a plugin that implements a parser. Check the wiki [1] to get started. If you intend to write a parser for an exotic mime-type consider contributing to Apache Tika. What exactly are you trying to accomplish? There may be an easier method. [1]: