Thank you very much!
On Thu, Feb 11, 2010 at 7:46 AM, Claudio Martella <claudio.marte...@tis.bz.it> wrote: > it's already in. > > here's a snippet from my nutch-site.xml: > > <property> > <name>plugin.includes</name> > <value>protocol-http|parse-(text|html|pdf|mspowerpoint|msword|msexcel|oo)|language-identifier|urlfilter-regex|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> > <description>Regular expression naming plugin directory names to > include. Any plugin not matching this expression is excluded. By > default Nutch includes crawling just HTML and plain text via HTTP, > and basic indexing and search plugins. > </description> > </property> > > > remember that nutch needs you to add manually a couple of jars to handle > pdfs correctly. check out the README.txt. > > > > Kelly Vista wrote: >> Thanks. I know this seems like it is any day, but does anyone have >> more details on exactly when this will happen? I saw some traffic on >> the nutch-dev list about 0.6 of Tika possibly facilitating this, but I >> don't know whether I should wait for things to just be there by >> default or whether I should find a way to do it myself. I am sorry if >> I come across as "complaining" about free software. :-) I'm not, I do >> appreciate it. >> >> On Wed, Feb 10, 2010 at 7:30 PM, Ken Krugler >> <kkrugler_li...@transpac.com> wrote: >> >>> On Feb 10, 2010, at 4:25pm, Kelly Vista wrote: >>> >>> >>>> It seems like using Tika as a plug-in to Nutch for processing various >>>> non HTML formats is somewhat bleeding-edge. Can someone point me (or >>>> tell me) how I can simply use Tika in Nutch to crawl and index MS >>>> Office or PDF docs? Or is it now in there by default? >>>> >>> Should be there by default, once the Tika plug-in gets rolled in. >>> >>> -- Ken >>> >>> -------------------------------------------- >>> Ken Krugler >>> +1 530-210-6378 >>> http://bixolabs.com >>> e l a s t i c w e b m i n i n g >>> >>> >>> >>> >>> >>> >> >> > > > -- > Claudio Martella > Digital Technologies > Unit Research & Development - Analyst > > TIS innovation park > Via Siemens 19 | Siemensstr. 19 > 39100 Bolzano | 39100 Bozen > Tel. +39 0471 068 123 > Fax +39 0471 068 129 > claudio.marte...@tis.bz.it http://www.tis.bz.it > > Short information regarding use of personal data. According to Section 13 of > Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we > process your personal data in order to fulfil contractual and fiscal > obligations and also to send you information regarding our services and > events. Your personal data are processed with and without electronic means > and by respecting data subjects' rights, fundamental freedoms and dignity, > particularly with regard to confidentiality, personal identity and the right > to personal data protection. At any time and without formalities you can > write an e-mail to priv...@tis.bz.it in order to object the processing of > your personal data for the purpose of sending advertising materials and also > to exercise the right to access personal data and other rights referred to in > Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation > Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete > information on the web site www.tis.bz.it. > > >