I am using Solr 3.3 and I am trying to extract and index meta data from PDF files. I am using the DataImportHandler with the TikaEntityProcessor to add the documents. Here is are the fields as defined in my schema.xml file:
<field name="title" type="text" indexed="true" stored="true" multiValued="false"/> <field name="description" type="text" indexed="true" stored="true" multiValued="false"/> <field name="date_published" type="string" indexed="false" stored="true" multiValued="false"/> <field name="link" type="string" indexed="true" stored="true" multiValued="false" required="false"/> <field name="imgName" type="string" indexed="false" stored="true" multiValued="false" required="false"/> <dynamicField name="attr_*" type="textgen" indexed="true" stored="true" multiValued="false"/> So I suppose the meta data information should be indexed and stored in fields prefixed as "attr_". Here is how my data config file looks. It takes a source directory path from a database, passes it to a FileListEntityProcessor which will pass each of the pdf files found in the directory to the TikaEntityProcessor to extract and index the content. <entity onError="skip" name="fileSourcePaths" rootEntity="false" dataSource="dbSource" fileName=".*pdf" query="select path from file_sources"> <entity name="fileSource" processor="FileListEntityProcessor" transformer="ThumbnailTransformer" baseDir="${fileSourcePaths.path}" recursive="true" rootEntity="false"> <field name="link" column="fileAbsolutePath" thumbnail="true"/> <field name="imgName" column="imgName"/> <entity rootEntity="true" onError="abort" name="file" processor="TikaEntityProcessor" url="${fileSource.fileAbsolutePath}" dataSource="fileSource" format="text"> <field column="resourceName" name="title" meta="true"/> <field column="Creation-Date" name="date_published" meta="true"/> <field column="text" name="description"/> </entity> </entity> It extracts the description and Creation-date just fine but it doesn't seem like it is extracting resourceName and so there is no title field for the documents when I query the index . This is weird because both Creation-date and resourceName are meta data. Also, none of the other possible meta data was being stored under the attr_ fields. I came across some threads which said there are know problems with using Tika 0.8 so I downloaded Tika 0.9 and replaced it over 0.8. I also downloaded and replaced pdfbox, jempbox and fontbox from 1.3 to 1.4. I tested one of the pdf's separately with just Tika to see what meta data is stored with the file. This is what I found: Content-Length: 546459 Content-Type: application/pdf Creation-Date: 2010-06-09T12:11:12Z Last-Modified: 2010-06-09T14:53:38Z created: Wed Jun 09 08:11:12 EDT 2010 creator: XSL Formatter V4.3 MR9a (4,3,2009,1022) for Windows producer: Antenna House PDF Output Library 2.6.0 (Windows) resourceName: Argentina.pdf trapped: False xmpTPg:NPages: 2 As you can see, it does have a resourceName meta data. I tried indexing again but I got the same result. Creation-date extracts and indexes just fine but not resourceName. Also the rest of the attributes are not being indexed under the attr_ fields. Whats going wrong? -- View this message in context: http://lucene.472066.n3.nabble.com/Error-with-Extracting-PDF-metadata-tp3210813p3210813.html Sent from the Solr - User mailing list archive at Nabble.com.