Re: Lucene Format Plugin

rahul challapalli Sat, 29 Aug 2015 03:07:58 -0700

Stefan,

I rebased my branch on top of latest master. Let me know if you hit any
issues.


- Rahul

On Wed, Aug 26, 2015 at 11:46 AM, rahul challapalli <
[email protected]> wrote:

> Stefan,
>
> I have some changes to push. I will push them and also rebase the branch
> on top of latest mater. I will do it sometime tomorrow.
>
> - Rahul
>
> On Tue, Aug 25, 2015 at 11:49 PM, Stefán Baxter <[email protected]
> > wrote:
>
>> Hi Rahul,
>>
>> I will start working on this later this week and over the weekend. I'm
>> not sure how long it will take me to become productive but hopefully I will
>> be able to share something soon.
>>
>> I will fork your repo on github. Can you please make sure it's up to date
>> with master?
>> I'm assuming that it runs in current state so I can get straight to work
>> :).
>>
>> Best regards,
>>  -Stefan
>>
>> On Sun, Aug 23, 2015 at 1:28 AM, rahul challapalli <
>> [email protected]> wrote:
>>
>>> Hi Stefan,
>>>
>>> I was not able to make any further progress on this. Below are a list of
>>> things to-do from a high level
>>>
>>> 1. Cleanup LuceneScanSpec : The current implementation serializes a lot
>>> of low level state information to serialize/de-serialize lucene's
>>> SegmentReader. This has to be changed otherwise the plugin is tightly
>>> coupled to Lucene's implementation details
>>> 2. Serialization of Lucene Query object
>>> 3. Convert Sql filter into Lucene Query object : I just started it and
>>> made it work in the simplest case. You can take a look at it here.
>>>
>>> https://github.com/rchallapalli/drill/blob/lucene/contrib/format-lucene/src/main/java/org/apache/drill/exec/planner/logical/SqlFilterToLuceneQuery.java
>>>     As part of the ElasticSearch storage plugin, Andrew has converted
>>> the sql filter to Elastic Search Query. It looks like he handled many
>>> cases. We can leverage
>>>     this for the Lucene format plugin. Below is his code
>>>
>>> https://github.com/aleph-zero/drill/blob/elastic/contrib/storage-elasticsearch/src/main/java/org/apache/drill/exec/store/elasticsearch/rules/PredicateAnalyzer.java
>>> 4. Currently the lucene format plugin does not work on HDFS/MaprFs. This
>>> should be handled
>>> 5. Pushing Agg functions and Limits into the scan. (This will be an
>>> improvement)
>>> 5. Testing
>>>
>>> I want to work on (1) sometime next week.
>>>
>>> - Rahul
>>>
>>>
>>> On Sat, Aug 22, 2015 at 12:00 AM, Stefán Baxter <
>>> [email protected]> wrote:
>>>
>>>> Hi Rahul,
>>>>
>>>> Can you elaborate a bit on the status of the Lucene plugin and what
>>>> needs to be done before using it?
>>>>
>>>> Also let me know if there are specific things that need improving. We
>>>> want to try to using it in our project and perhaps we can contribute
>>>> something meaningful.
>>>>
>>>> Regards,
>>>>  -Stefan
>>>>
>>>>
>>>>
>>>> On Mon, Aug 10, 2015 at 5:01 AM, Sudip Mukherjee <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Rahul,
>>>>>
>>>>> Thanks for sharing your code. I was trying to get plugin for solr
>>>>> engine. But I thought of using solr's rest api to do the queries ,get
>>>>> schema metadata info etc.
>>>>> The goal for me is to expose a solr engine to tools like Tableau or
>>>>> MS Excel and user can do stuff there.
>>>>>
>>>>> I am still very new to this and there is a learning curve. It would be
>>>>> great if you can comment/review whatever I've done so far.
>>>>>
>>>>>
>>>>> https://github.com/sudipmukherjee/drill/tree/master/contrib/storage-solr
>>>>>
>>>>> Thanks,
>>>>> Sudip
>>>>>
>>>>> -----Original Message-----
>>>>> From: rahul challapalli [mailto:[email protected]]
>>>>> Sent: 10 August 2015 AM 05:21
>>>>> To: [email protected]
>>>>> Subject: Re: Lucene Format Plugin
>>>>>
>>>>> Below is the link to my branch which contains the changes related to
>>>>> the format plugin.
>>>>>
>>>>> https://github.com/rchallapalli/drill/tree/lucene/contrib/format-lucene
>>>>>
>>>>> Any thoughts on how to handle contributions like this which still have
>>>>> some work to be done?
>>>>>
>>>>> - Rahul
>>>>>
>>>>>
>>>>> On Mon, Aug 3, 2015 at 12:21 PM, rahul challapalli <
>>>>> [email protected]> wrote:
>>>>>
>>>>> > Thanks Jason.
>>>>> >
>>>>> > I want to look at the solr plugin and see where we can collaborate or
>>>>> > if we already duplicated part of the effort.
>>>>> >
>>>>> > I still need to push a few commits. I will share the code once I get
>>>>> > these changes pushed.
>>>>> >
>>>>> > - Rahul
>>>>> >
>>>>> >
>>>>> >
>>>>> > On Mon, Aug 3, 2015 at 11:31 AM, Jason Altekruse
>>>>> > <[email protected]
>>>>> > > wrote:
>>>>> >
>>>>> >> Hey Rahul,
>>>>> >>
>>>>> >> This is really cool! Thanks for all of the time you put into writing
>>>>> >> this, I think we have a lot of available opportunities to reach new
>>>>> >> communities with efforts like this.
>>>>> >>
>>>>> >> I noticed last week another contributor opened a JIRA for a solr
>>>>> >> plugin, there might be a good opportunity for the two of you to join
>>>>> >> efforts, as I believe he likely stated working on a lucene reader as
>>>>> >> part of his solr work.
>>>>> >>
>>>>> >> Would you like to post a link to your work on Github or another
>>>>> >> public host of your code?
>>>>> >>
>>>>> >> https://issues.apache.org/jira/browse/DRILL-3585
>>>>> >>
>>>>> >> On Mon, Aug 3, 2015 at 2:29 AM, Stefán Baxter
>>>>> >> <[email protected]>
>>>>> >> wrote:
>>>>> >>
>>>>> >> > Hi,
>>>>> >> >
>>>>> >> > I'm pretty new around here but I just wanted to tell you how much
>>>>> >> > your
>>>>> >> work
>>>>> >> > can benefit us. This is great!.
>>>>> >> >
>>>>> >> > Look forward to trying it out.
>>>>> >> >
>>>>> >> > Regards,
>>>>> >> >  -Stefán
>>>>> >> >
>>>>> >> > On Mon, Aug 3, 2015 at 8:38 AM, rahul challapalli <
>>>>> >> > [email protected]> wrote:
>>>>> >> >
>>>>> >> > > Hello Drillers,
>>>>> >> > >
>>>>> >> > > I have been working on a lucene format plugin. In its current
>>>>> >> > > state,
>>>>> >> the
>>>>> >> > > below sample query successfully searches a lucene index and
>>>>> >> > > returns
>>>>> >> the
>>>>> >> > > results.
>>>>> >> > >
>>>>> >> > > select path from dfs_test.`/search-index` where
>>>>> >> > contents='maxItemsPerBlock'
>>>>> >> > > and contents = 'BlockTreeTermsIndex'
>>>>> >> > >
>>>>> >> > >
>>>>> >> > >
>>>>> >> > > *High Level Overview of Current Implementation:*
>>>>> >> > >
>>>>> >> > > *Parallelization:* A lucene segment is the lowest level of
>>>>> >> > > parrallelization.
>>>>> >> > > *Filter Pushdown:* Currently the format plugin is designed to
>>>>> >> > > push the complete filter into the scan.
>>>>> >> > > *Filter Evaluation:* Each condition in the filter is treated as
>>>>> a
>>>>> >> lucene
>>>>> >> > > TermQuery
>>>>> >> > > <
>>>>> >> > >
>>>>> >> >
>>>>> >>
>>>>> http://lucene.apache.org/core/5_2_0/core/org/apache/lucene/search/Ter
>>>>> >> mQuery.html
>>>>> >> > > >
>>>>> >> > > and multiple conditions are joined using a BooleanQuery <
>>>>> >> > >
>>>>> >> >
>>>>> >>
>>>>> http://lucene.apache.org/core/5_2_0/core/org/apache/lucene/search/Boo
>>>>> >> leanQuery.html
>>>>> >> > > >.
>>>>> >> > > If we *do not* use a TermQuery, then we have to know the exact
>>>>> >> > > type of Analyzer <
>>>>> >> > >
>>>>> >> >
>>>>> >>
>>>>> https://lucene.apache.org/core/5_2_1/core/org/apache/lucene/analysis/
>>>>> >> Analyzer.html
>>>>> >> > > >
>>>>> >> > > to use with each field in the query.
>>>>> >> > >     Ex: 'contents' field might have been analyzed using a
>>>>> >> > StandardAnalyzer
>>>>> >> > > <
>>>>> >> > >
>>>>> >> >
>>>>> >>
>>>>> https://lucene.apache.org/core/5_2_1/analyzers-common/org/apache/luce
>>>>> >> ne/analysis/standard/StandardAnalyzer.html
>>>>> >> > > >
>>>>> >> > > and the 'path' field might not have been analyzed at all.
>>>>> >> > > If desired, support for raw lucene queries with a reserved word
>>>>> >> should be
>>>>> >> > > easy to add.
>>>>> >> > >     Ex: select * from dfs.`search-index` where searchQuery =
>>>>> >> > > "+contents:maxItemsPerBlock
>>>>> >> > > +path:/home/file.txt";
>>>>> >> > > *Converting SqlFilter to Lucene Query:* Currently only "=" and
>>>>> "!="
>>>>> >> > > operators are handled while converting a sql filter into a
>>>>> lucene
>>>>> >> query.
>>>>> >> > > For indexed fields this might be sufficient to handle a good
>>>>> >> > > number of cases. For non-indexed fields operators like ">,<,
>>>>> like
>>>>> >> > > etc" need to
>>>>> >> be
>>>>> >> > > handled.
>>>>> >> > > *FileSystems:* Currently the format plugin only works on a local
>>>>> >> > > filesystem.
>>>>> >> > >
>>>>> >> > >
>>>>> >> > > Though far from complete, I want to work with the community to
>>>>> >> > > get
>>>>> >> some
>>>>> >> > > feedback and avoid any chance of duplication of work. Kindly let
>>>>> >> > > me
>>>>> >> know
>>>>> >> > > your thoughts
>>>>> >> > >
>>>>> >> > > - Rahul
>>>>> >> > >
>>>>> >> >
>>>>> >>
>>>>> >
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>> ***************************Legal Disclaimer***************************
>>>>> "This communication may contain confidential and privileged material
>>>>> for the
>>>>> sole use of the intended recipient. Any unauthorized review, use or
>>>>> distribution
>>>>> by others is strictly prohibited. If you have received the message by
>>>>> mistake,
>>>>> please advise the sender by reply email and delete the message. Thank
>>>>> you."
>>>>> **********************************************************************
>>>>
>>>>
>>>>
>>>
>>
>

Re: Lucene Format Plugin

Reply via email to