Thanks Adam, It seems like Nutch use to solve most of my concerns.
i would be great if you can have share resources for Nutch with us.

/ Pankaj Bhatt.

On Tue, Jan 25, 2011 at 7:21 PM, Estrada Groups <
estrada.adam.gro...@gmail.com> wrote:

> I would just use Nutch and specify the -solr param on the command line.
> That will add the extracted content your instance of solr.
>
> Adam
>
> Sent from my iPhone
>
> On Jan 25, 2011, at 5:29 AM, pankaj bhatt <panbh...@gmail.com> wrote:
>
> > Hi All,
> >         I need to index the documents presents in my file system at
> various
> > locations (e.g. C:\docs , d:\docs ).
> >    Is there any way through which i can specify this in my DIH
> > Configuration.
> >    Here is my configuration:-
> >
> > <document>
> >      <entity name="sd"
> >        processor="FileListEntityProcessor"
> >        fileName="docx$|doc$|pdf$|xls$|xlsx|html$|rtf$|txt$|zip$"
> > *baseDir="G:\\Desktop\\"*
> >        recursive="false"
> >        rootEntity="true"
> >        transformer="DateFormatTransformer"
> > onerror="continue">
> >        <entity name="tikatest"
> > processor="org.apache.solr.handler.dataimport.TikaEntityProcessor"
> > url="${sd.fileAbsolutePath}" format="text" dataSource="bin">
> >          <field column="Author" name="author" meta="true"/>
> >          <field column="Content-Type" name="title" meta="true"/>
> >          <!-- field column="title" name="title" meta="true"/ -->
> >          <field column="text" name="all_text"/>
> >        </entity>
> >
> >        <!-- field column="fileLastModified" name="date"
> > dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss" / -->
> >        <field column="fileSize" name="size"/>
> >        <field column="file" name="filename"/>
> >    </entity>
> > <!--baseDir="../site"-->
> >  </document>
> >
> > / Pankaj Bhatt.
>

Reply via email to