I would just use Nutch and specify the -solr param on the command line. That will add the extracted content your instance of solr.
Adam Sent from my iPhone On Jan 25, 2011, at 5:29 AM, pankaj bhatt <panbh...@gmail.com> wrote: > Hi All, > I need to index the documents presents in my file system at various > locations (e.g. C:\docs , d:\docs ). > Is there any way through which i can specify this in my DIH > Configuration. > Here is my configuration:- > > <document> > <entity name="sd" > processor="FileListEntityProcessor" > fileName="docx$|doc$|pdf$|xls$|xlsx|html$|rtf$|txt$|zip$" > *baseDir="G:\\Desktop\\"* > recursive="false" > rootEntity="true" > transformer="DateFormatTransformer" > onerror="continue"> > <entity name="tikatest" > processor="org.apache.solr.handler.dataimport.TikaEntityProcessor" > url="${sd.fileAbsolutePath}" format="text" dataSource="bin"> > <field column="Author" name="author" meta="true"/> > <field column="Content-Type" name="title" meta="true"/> > <!-- field column="title" name="title" meta="true"/ --> > <field column="text" name="all_text"/> > </entity> > > <!-- field column="fileLastModified" name="date" > dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss" / --> > <field column="fileSize" name="size"/> > <field column="file" name="filename"/> > </entity> > <!--baseDir="../site"--> > </document> > > / Pankaj Bhatt.