Hi Folks,
I'm new to jackrabbit and I'm trying out full-text search using jackrabbit
2.6.0. (with tika 1.3) . I have a custom node type that allows me to store
some custom properties and multiple html files (stored as binary) . I have
the following configurations:
*workspace.xml:*
<?xml version="1.0" encoding="UTF-8"?>
<Workspace name="default">
<!--
virtual file system of the workspace:
class: FQN of class implementing the FileSystem interface
-->
<FileSystem
class="org.apache.jackrabbit.core.fs.db.OracleFileSystem">
<param name="dataSourceName" value="ds1"/>
<param name="schemaObjectPrefix" value="fs_${wsp.name}_"/>
</FileSystem>
<!--
persistence manager of the workspace:
class: FQN of class implementing the PersistenceManager
interface
-->
<PersistenceManager
class="org.apache.jackrabbit.core.persistence.pool.OraclePersistenceManager">
<param name="dataSourceName" value="ds1"/>
<param name="schemaObjectPrefix" value="pm_${wsp.name}_"/>
</PersistenceManager>
<!--
Search index and the file system it uses.
class: FQN of class implementing the QueryHandler interface
-->
<SearchIndex
class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
<param name="path" value="${wsp.home}/index"/>
<param name="analyzer"
value="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
<param name="queryClass"
value="org.apache.jackrabbit.core.query.QueryImpl"/>
<param name="excerptProviderClass"
value="org.apache.jackrabbit.core.query.lucene.DefaultHTMLExcerpt"/>
<param name="supportHighlighting" value="true"/>
<param name="tikaConfigPath"
value="${wsp.home}/tika-config.xml"/>
</SearchIndex>
</Workspace>
*tika-config.xml:*
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<mimeTypeRepository resource="/org/apache/tika/mime/tika-mimetypes.xml"
magic="false"/>
<parsers>
<parser name="parse-html"
class="org.apache.tika.parser.html.HtmlParser">
<mime>text/html</mime>
<mime>application/xhtml+xml</mime>
<mime>application/x-asp</mime>
</parser>
</parsers>
</properties>
*JCR-SQL2 queries tested:*
1) SELECT * FROM [nt:file] as file WHERE CONTAINS(file.*, 'This')
2) SELECT * FROM [nt:file] as file WHERE CONTAINS(file.*, 'This*')
3)
SELECT file.*, resource.* FROM [nt:file] AS file
INNER JOIN [nt:resource] AS resource ON ISCHILDNODE(resource, file)
WHERE resource.[jcr:mimeType] = 'text/html'
AND CONTAINS(file.*, 'This')
4)
SELECT file.*, resource.* FROM [nt:file] AS file
INNER JOIN [nt:resource] AS resource ON ISCHILDNODE(resource, file)
WHERE resource.[jcr:mimeType] = 'text/html'
AND CONTAINS(file.*, 'This*')
*Result:*
Nothing seems to work. If I remove the CONTAINS() clause from the queries,
I am able to get rows from all the queries above and for query #3 & #4 I
can see that the field resource.[jcr:data] has the text ("This") I am
searching for when I dump the result to the log file. I've also tried
deleting the index folder so that the repository will be re-indexed but I
am still not able to do full-text search successfully.
What am I missing? In addition, is there any documentation on how to
configure tika (tika-config.xml)?
Thanks and Regards,
Orlando