I would just use Nutch and specify the -solr param on the command line. That 
will add the extracted content your instance of solr.

Adam

Sent from my iPhone

On Jan 25, 2011, at 5:29 AM, pankaj bhatt <panbh...@gmail.com> wrote:

> Hi All,
>         I need to index the documents presents in my file system at various
> locations (e.g. C:\docs , d:\docs ).
>    Is there any way through which i can specify this in my DIH
> Configuration.
>    Here is my configuration:-
> 
> <document>
>      <entity name="sd"
>        processor="FileListEntityProcessor"
>        fileName="docx$|doc$|pdf$|xls$|xlsx|html$|rtf$|txt$|zip$"
> *baseDir="G:\\Desktop\\"*
>        recursive="false"
>        rootEntity="true"
>        transformer="DateFormatTransformer"
> onerror="continue">
>        <entity name="tikatest"
> processor="org.apache.solr.handler.dataimport.TikaEntityProcessor"
> url="${sd.fileAbsolutePath}" format="text" dataSource="bin">
>          <field column="Author" name="author" meta="true"/>
>          <field column="Content-Type" name="title" meta="true"/>
>          <!-- field column="title" name="title" meta="true"/ -->
>          <field column="text" name="all_text"/>
>        </entity>
> 
>        <!-- field column="fileLastModified" name="date"
> dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss" / -->
>        <field column="fileSize" name="size"/>
>        <field column="file" name="filename"/>
>    </entity>
> <!--baseDir="../site"-->
>  </document>
> 
> / Pankaj Bhatt.

Reply via email to