Hello Drillers,

I have been working on a lucene format plugin. In its current state, the
below sample query successfully searches a lucene index and returns the
results.

select path from dfs_test.`/search-index` where contents='maxItemsPerBlock'
and contents = 'BlockTreeTermsIndex'



*High Level Overview of Current Implementation:*

*Parallelization:* A lucene segment is the lowest level of parrallelization.
*Filter Pushdown:* Currently the format plugin is designed to push the
complete filter into the scan.
*Filter Evaluation:* Each condition in the filter is treated as a lucene
TermQuery
<http://lucene.apache.org/core/5_2_0/core/org/apache/lucene/search/TermQuery.html>
and multiple conditions are joined using a BooleanQuery
<http://lucene.apache.org/core/5_2_0/core/org/apache/lucene/search/BooleanQuery.html>.
If we *do not* use a TermQuery, then we have to know the exact type of
Analyzer
<https://lucene.apache.org/core/5_2_1/core/org/apache/lucene/analysis/Analyzer.html>
to use with each field in the query.
    Ex: 'contents' field might have been analyzed using a StandardAnalyzer
<https://lucene.apache.org/core/5_2_1/analyzers-common/org/apache/lucene/analysis/standard/StandardAnalyzer.html>
and the 'path' field might not have been analyzed at all.
If desired, support for raw lucene queries with a reserved word should be
easy to add.
    Ex: select * from dfs.`search-index` where searchQuery =
"+contents:maxItemsPerBlock
+path:/home/file.txt";
*Converting SqlFilter to Lucene Query:* Currently only "=" and "!="
operators are handled while converting a sql filter into a lucene query.
For indexed fields this might be sufficient to handle a good number of
cases. For non-indexed fields operators like ">,<, like etc" need to be
handled.
*FileSystems:* Currently the format plugin only works on a local filesystem.


Though far from complete, I want to work with the community to get some
feedback and avoid any chance of duplication of work. Kindly let me know
your thoughts

- Rahul

Reply via email to