Yes, I know the reasons why put this work on a client rather than use Solr directly and it should be maybe the next my task. But I need to finish first my task - index a pdf files stored in SqlBase database. The pdf files are pretty simple, sometimes only dozens text lines.
Regards, Aruna On Wed, Apr 3, 2019 at 5:03 PM Erick Erickson <erickerick...@gmail.com> wrote: > For a lot of reasons, I greatly prefer to put this work on a client rather > than use Solr directly. Here’s a place to get started, it connects to a DB > and also scans local file directory for docs to push through (local) Tika > and index. So you should be able to modify it relatively easily to get the > data from SqlBase, read the associated PDF, combine the two and send to > Solr. > > https://lucidworks.com/2012/02/14/indexing-with-solrj/ > > The code itself is a bit old, but illustrates the process. > > Best, > Erick > > > On Apr 2, 2019, at 11:46 PM, Arunas Spurga <arunas2...@gmail.com> wrote: > > > > Hello, > > > > I got a task to index in Solr 7.71 a PDF files which are stored in > SqlBase > > database. I did half the job - I can to index all table fields, I can do > a > > search in these fields except field in which is stored a pdf file > content. > > As I am ttotally new in Solr, spent unsuccessfully a lot a time trying to > > understand how to force to extract and index field with pdf content. I > need > > a help. > > > > Regards, > > > > Aruna > > > > in solrconfig.xml i have > > > > > > * <lib > dir="${solr.install.dir:../../../..}/contrib/dataimporthandler/lib" > > regex=".*\.jar" /> <lib dir="${solr.install.dir:../../../..}/dist/" > > regex="solr-dataimporthandler-.*\.jar" /> * > > * <lib dir="${solr.install.dir:../../../..}/contrib/extraction/lib" > > regex=".*\.jar" />* > > * <lib dir="${solr.install.dir:../../../..}/dist/" > > regex="solr-cell-\d.*\.jar" />* > > > > > > > > > > > > > > > > > > > > *<requestHandler name="/update/extract" > > startup="lazy" > > class="solr.extraction.ExtractingRequestHandler" > <lst > > name="defaults"> <str name="lowernames">true</str> <str > > name="fmap.meta">ignored_</str> <str > > name="fmap.content">_text_</str> </lst> </requestHandler>* > > > > > > > > > > > > *<requestHandler name="/dataimport" > > class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst > > name="defaults"> <str name="config">db-data-config.xml</str> </lst> > > </requestHandler>* > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *---------------------------------------------------------------------------------------------------------------------------------------------db-data-config.xml<dataConfig><dataSource > > type="JdbcDataSource" > > driver="jdbc.unify.sqlbase.SqlbaseDriver" > > url="jdbc:sqlbase://localhost:2155/PDFDOCS" > > user="sysadm" password="sysadm" /> <document> <entity > > name="PDFDOCUMENTS" query="select ID, PDOCUMENT, UNIT from SYSADM.DOCS"> > > <field column="ID" name="idx" /> <field column="PDOCUMENT" > > name="PDF" /> <field column="UNIT" name="division" /> </entity> > > </document></dataConfig>* > >