Thank you very much!

On Thu, Feb 11, 2010 at 7:46 AM, Claudio Martella
<claudio.marte...@tis.bz.it> wrote:
> it's already in.
>
> here's a snippet from my nutch-site.xml:
>
> <property>
>  <name>plugin.includes</name>
> <value>protocol-http|parse-(text|html|pdf|mspowerpoint|msword|msexcel|oo)|language-identifier|urlfilter-regex|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>  <description>Regular expression naming plugin directory names to
>  include.  Any plugin not matching this expression is excluded.  By
>  default Nutch includes crawling just HTML and plain text via HTTP,
>  and basic indexing and search plugins.
>  </description>
> </property>
>
>
> remember that nutch needs you to add manually a couple of jars to handle
> pdfs correctly. check out the README.txt.
>
>
>
> Kelly Vista wrote:
>> Thanks.  I know this seems like it is any day, but does anyone have
>> more details on exactly when this will happen?  I saw some traffic on
>> the nutch-dev list about 0.6 of Tika possibly facilitating this, but I
>> don't know whether I should wait for things to just be there by
>> default or whether I should find a way to do it myself.  I am sorry if
>> I come across as "complaining" about free software. :-)  I'm not, I do
>> appreciate it.
>>
>> On Wed, Feb 10, 2010 at 7:30 PM, Ken Krugler
>> <kkrugler_li...@transpac.com> wrote:
>>
>>> On Feb 10, 2010, at 4:25pm, Kelly Vista wrote:
>>>
>>>
>>>> It seems like using Tika as a plug-in to Nutch for processing various
>>>> non HTML formats is somewhat bleeding-edge.  Can someone point me (or
>>>> tell me) how I can simply use Tika in Nutch to crawl and index MS
>>>> Office or PDF docs?  Or is it now in there by default?
>>>>
>>> Should be there by default, once the Tika plug-in gets rolled in.
>>>
>>> -- Ken
>>>
>>> --------------------------------------------
>>> Ken Krugler
>>> +1 530-210-6378
>>> http://bixolabs.com
>>> e l a s t i c   w e b   m i n i n g
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>
>
> --
> Claudio Martella
> Digital Technologies
> Unit Research & Development - Analyst
>
> TIS innovation park
> Via Siemens 19 | Siemensstr. 19
> 39100 Bolzano | 39100 Bozen
> Tel. +39 0471 068 123
> Fax  +39 0471 068 129
> claudio.marte...@tis.bz.it http://www.tis.bz.it
>
> Short information regarding use of personal data. According to Section 13 of 
> Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we 
> process your personal data in order to fulfil contractual and fiscal 
> obligations and also to send you information regarding our services and 
> events. Your personal data are processed with and without electronic means 
> and by respecting data subjects' rights, fundamental freedoms and dignity, 
> particularly with regard to confidentiality, personal identity and the right 
> to personal data protection. At any time and without formalities you can 
> write an e-mail to priv...@tis.bz.it in order to object the processing of 
> your personal data for the purpose of sending advertising materials and also 
> to exercise the right to access personal data and other rights referred to in 
> Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation 
> Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete 
> information on the web site www.tis.bz.it.
>
>
>

Reply via email to