Hi,

Sure. Thanks a lot for the helpful pointers. I will take a look at the 
classes and create a plugin. If there are any gotcha's or certain ways of 
doing things in this plugin, please tell me so that I can take note.

It seems that the plugin would be small with just the Parser/Builder and a 
simple plugin that uses the IndexQueryParserModule to add a process for a 
XContentFilterParser.

I have not checked the Cache classes, and will take a look there as well.

I do have a followup question:
If I have to be aware of the shards in my filter implementation, is that 
possible? I mean, if during indexing, there is a routing policy (not sure 
if there is something like that) that directs documents to a particular 
shard based on certain range or function or hash (please let me know if 
there is something like that, as I need to check that implementation as 
well), then during query time, I would want the filter to only be created 
for the DocIds that correspond to the shard where the query will execute. 
Seems like a problem that is not unusual. 

Please tell me if this is possible. Thanks again,

Thanks,
Sandeep


On Tuesday, 8 July 2014 02:01:35 UTC+5:30, Jörg Prante wrote:
>
> In Elasticsearch, you can extend the existing queries and filters, by a 
> plugin, with the help of addQuery/addFilter at IndexQueryParserModule
>
> Each query or filter comes in a pair of classes, a builder and a parser.
>
> A filter builder manages the syntax, the content serialization with the 
> help of XContent classes for inner/outer representation of filter 
> specification.
>
> A filter parser parses such a structure and turns it into a Lucene Filter 
> for internal processing.
>
> So one approach would be to look at your bit set implementation how this 
> can be turned into a Lucene Filter. An instructive example where to start 
> from is 
> in org.elasticsearch.index.query.TermsFilterParser/TermsFilterBuilder
>
> An example where terms from fielddata cache are read and turned into a 
> filter is org.elasticsearch.index.search.FielddataTermsFilter
>
> A key line is the method 
>
> public DocIdSet getDocIdSet(AtomicReaderContext context, Bits acceptDocs) 
> throws IOException
>
> An example for caching filters 
> is org.elasticsearch.indices.cache.filter.terms.IndicesTermsFilterCache
> (the caching of filters in ES is done with Guava's cache classes)
>
> Also, it could be helpful to study helper classes in this context like in 
> package org.elasticsearch.common.lucene.docset
>
> I am not aware of a filter plugin yet but it is possible that I could 
> sketch a demo filter plugin source code on github.
>
> Jörg
>
>
>
>
> On Mon, Jul 7, 2014 at 3:49 PM, Sandeep Ramesh Khanzode <
> k.san...@gmail.com <javascript:>> wrote:
>
>> Hi,
>>
>> A little clarification:
>>
>> Assume sample data set of 50M documents. The documents need to be 
>> filtered by a field, Field1. However, at indexing time, this field is NOT 
>> written to the document in Lucene through ES. Field1 is a frequently 
>> changing field and hence, we will like to maintain it outside.
>>
>> (This following paragraph can be skipped.)
>> Now assume that there are a few such fields, Field1, ..., FieldN. For 
>> every document in the corpus, the value for Field1 may be from a pool of 
>> 100-odd values. Thus, for example, at max, FIeld1 can hold 1M documents 
>> that correspond to one of the 100-dd values, and at the fag-end, can 
>> probably correspond to 10 documents as well.  
>>
>>
>> (Continue reading) :-)
>> I would, at system startup time, make sure that I have loaded all 
>> relevant BitSets that I plan to use for any Filters in memory, so that my 
>> cache framework is warm and I can lookup the relevant filter values for a 
>> particular query from this cache at query run time. The mechanisms for this 
>> loading are still unknown, but please assume that this BitSet will be 
>> available readily during query time. 
>>
>> This BitSet will correspond to the DocIDs in Lucene for a particular 
>> value of Field1 that I want to filter. I plan to create a Filter class 
>> overridden in Lucene that will accept this DocIdSet.
>>
>> What I am unable to understand is how I can achieve this in ES? Now, I 
>> have been exploring the different mail threads on this forum, and it seems 
>> that certain plugins can achieve this. Please see the list below that I 
>> could find on this forum.
>>
>> Can you please tell me how an IndexQueryParserModule will serve my use 
>> case? If you can provide some pointers on writing a plugin that can 
>> leverage a CustomFilter, that will be immensely helpful. Thanks, 
>>
>> 1. 
>> https://groups.google.com/forum/?fromgroups=#!searchin/elasticsearch/IndexQueryParserModule$20Plugin/elasticsearch/5Gqxx3UvN2s/FL4Lb2RxQt0J
>> 2. https://groups.google.com/forum/#!topic/elasticsearch/1jiHl4kngJo
>> 3. https://github.com/elasticsearch/elasticsearch/issues/208
>> 4. 
>> http://elasticsearch-users.115913.n3.nabble.com/custom-filter-handler-plugin-td4051973.html
>>
>> Thanks,
>> Sandeep
>>
>> On Mon, Jul 7, 2014 at 2:17 AM, joerg...@gmail.com <javascript:> <
>> joerg...@gmail.com <javascript:>> wrote:
>>
>>> Thanks for being so patient with me :)
>>>
>>> I understand now the following: there are 50m of documents in an 
>>> external DB, from which up to 1m is to be exported in form of document 
>>> identifiers to work as a filter in ES. The idea is to use internal 
>>> mechanisms like bit sets. There is no API for manipulating filters in ES on 
>>> that level, ES receives the terms and passes them into Lucene TermFilter 
>>> class according to the type of the filter.
>>>
>>> What is a bit unclear to me: how is the filter set constructed? I assume 
>>> it should be a select statement on the database?
>>>
>>> Next, if you have this large set of document identifiers selected, I do 
>>> not understand what is the base query you want to apply the filter on? Is 
>>> there a user given query for ES? How does such query looks like? Is it 
>>> assumed there are other documents in ES that are related somehow to the 50m 
>>> documents? An illustrative example of the steps in the scenario would 
>>> really help to understand the data model.
>>>
>>> Just some food for thought: it is close to impossible to filter in ES on 
>>> 1m unique terms with a single step - the default setting of maximum clauses 
>>> in a Lucene Query is for good reason limited to 1024 terms. A workaround 
>>> would be iterating over 1m terms and execute 1000 filter queries and add up 
>>> the results. This takes a long time and may not be the desired solution. 
>>>
>>> Fortunately, in most situations, it is possible to find more concise 
>>> grouping to reduce the 1m document identifiers into fewer ones for more 
>>> efficient filtering.
>>>
>>> Jörg
>>>
>>>
>>>
>>> On Sun, Jul 6, 2014 at 9:39 PM, 'Sandeep Ramesh Khanzode' via 
>>> elasticsearch <elasti...@googlegroups.com <javascript:>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Appreciate your continued assistance. :) Thanks,
>>>>
>>>> Disclaimer: I am yet to sufficiently understand ES sources so as to 
>>>> depict my scenario completely. Some info' below may be conjecture.
>>>>
>>>> I would have a corpus of 50M docs (actually lot more, but for testing 
>>>> now) out of which I would have say, upto, 1M DocIds to be used as a 
>>>> filter. 
>>>> This set of 1M docs can be different for different use cases, the point 
>>>> being, upto 1M docIds can form one logical set of documents for filtering 
>>>> results. If I use a simple IdsFilter from ES Java API, I would have to 
>>>> keep 
>>>> adding these 1M docs to the List implementation internally, and I have a 
>>>> feeling it may not scale very well as they may change per use case and per 
>>>> some combinations internal to a single use case also.
>>>>
>>>> As I debug the code, the IdsFilter will be converted to a Lucene 
>>>> filter. Lucene filters, on the other hand, operate on a docId bitset type. 
>>>> That gels very well with my requirement, since I can scale with BitSets (I 
>>>> assume).
>>>>
>>>> If I can find a way to directly plug this BitSet as a Lucene Filter to 
>>>> the Lucene search() call bypassing the ES filters using, I dont know, may 
>>>> some sort of a plugin, I believe that may support my cause. I assume I may 
>>>> not get to use the Filter cache from ES but probably I can cache these 
>>>> BitSets for subsequent use. 
>>>>
>>>> Please let me know. And thanks!
>>>>
>>>> Thanks,
>>>> Sandeep
>>>>
>>>>
>>>> On Saturday, 5 July 2014 01:40:55 UTC+5:30, Jörg Prante wrote:
>>>>
>>>>> What I understand is a TermsFilter is required
>>>>>
>>>>> http://www.elasticsearch.org/guide/en/elasticsearch/
>>>>> reference/current/query-dsl-terms-filter.html
>>>>>
>>>>> and the source of the terms is a DB. That is no problem. The plan is: 
>>>>> fetch the terms from the DB, build the query (either Java API or JSON) 
>>>>> and 
>>>>> execute it.
>>>>>
>>>>> What I don't understand is the part with the "quick mapping", Lucene, 
>>>>> and the doc ids. Lucene doc IDs are not reliable and are not exposed by 
>>>>> Elasticsearch, Elasticsearch uses it's own document identifiers which are 
>>>>> stable and augmented with info about the index type they belong to, in 
>>>>> order to make them unique. But I do not understand why this is important 
>>>>> in 
>>>>> this context.
>>>>>
>>>>> Elasticsearch API uses query builders and filter builders to build 
>>>>> search requests . A "quick mapping" is just fetching the terms from the 
>>>>> DB 
>>>>> as a string array before this API is called.
>>>>>
>>>>> I also do not understand the role of the number "1M", is this the 
>>>>> number of fields, or the number of terms? Is it a total number or a 
>>>>> number 
>>>>> per query?
>>>>>
>>>>> Did I misunderstand anything more? I am not really sure what is the 
>>>>> challenge...
>>>>>
>>>>> Jörg
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Jul 4, 2014 at 8:55 PM, 'Sandeep Ramesh Khanzode' via 
>>>>> elasticsearch <elasti...@googlegroups.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Just to give some background. I will have a large-ish corpus of more 
>>>>>> than 100M documents indexed. The filters that I want to apply will be on 
>>>>>> a 
>>>>>> field that is not indexed. I mean, I prefer to not have them indexed in 
>>>>>> ES/Lucene since they will be frequently changing. So, for that, I will 
>>>>>> be 
>>>>>> maintaining them elsewhere, like a DB etc.
>>>>>>
>>>>>> Everytime I have a query, I would want to filter the results by those 
>>>>>> fields that are not indexed in Lucene. And I am guessing that number may 
>>>>>> well be more than 1M. In that case, I think, since we will maintain some 
>>>>>> sort of TermsFilter, it may not scale linearly. What I would want to do, 
>>>>>> preferably, is to have a hook inside the ES query, so that I can, at 
>>>>>> query 
>>>>>> time, inject the required filter values. Since the filter values have to 
>>>>>> be 
>>>>>> recognized by Lucene, and I will not be indexing them, I will need to do 
>>>>>> some quick mapping to get those fields and map them quickly to some 
>>>>>> field 
>>>>>> in Lucene that I can save in the filter. I am not sure whether we can 
>>>>>> access and set Lucene DocIDs in the filter or whether they are even 
>>>>>> exposed 
>>>>>> in ES.
>>>>>>
>>>>>> Please assist with this query. Thanks,
>>>>>>
>>>>>> Thanks,
>>>>>> Sandeep
>>>>>>
>>>>>>
>>>>>> On Thursday, 3 July 2014 21:33:45 UTC+5:30, Jörg Prante wrote:
>>>>>>
>>>>>>> Maybe I do not fully understand, but in a client, you can fetch the 
>>>>>>> required filter terms from any external source before a JSON query is 
>>>>>>> constructed?
>>>>>>>
>>>>>>> Can you give an example what you want to achieve?
>>>>>>>
>>>>>>> Jörg
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jul 3, 2014 at 3:34 PM, 'Sandeep Ramesh Khanzode' via 
>>>>>>> elasticsearch <elasti...@googlegroups.com> wrote:
>>>>>>>
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> I am new to ES and I have the following requirement:
>>>>>>>> I need to specify a list of strings as a filter that applies to a 
>>>>>>>> specific field in the document. Like what a filter does, but instead 
>>>>>>>> of 
>>>>>>>> sending them on the query, I would like them to be populated from an 
>>>>>>>> external sources, like a DB or something. Can you please guide me to 
>>>>>>>> the 
>>>>>>>> relevant examples or references to achieve this on v1.1.2? 
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Sandeep
>>>>>>>>
>>>>>>>> -- 
>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>> Groups "elasticsearch" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>> send an email to elasticsearc...@googlegroups.com.
>>>>>>>>
>>>>>>>> To view this discussion on the web visit 
>>>>>>>> https://groups.google.com/d/msgid/elasticsearch/0093d97d-0f4
>>>>>>>> 7-48e9-ba19-85b0850eda89%40googlegroups.com 
>>>>>>>> <https://groups.google.com/d/msgid/elasticsearch/0093d97d-0f47-48e9-ba19-85b0850eda89%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>
>>>>>>>
>>>>>>>  -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "elasticsearch" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to elasticsearc...@googlegroups.com.
>>>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>>>> msgid/elasticsearch/513172cd-9507-4e96-b456-498c98c3b8c9%
>>>>>> 40googlegroups.com 
>>>>>> <https://groups.google.com/d/msgid/elasticsearch/513172cd-9507-4e96-b456-498c98c3b8c9%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>>  -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "elasticsearch" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to elasticsearc...@googlegroups.com <javascript:>.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/elasticsearch/f2ec45c7-8980-4005-9e1b-fc9a6aa422e0%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/elasticsearch/f2ec45c7-8980-4005-9e1b-fc9a6aa422e0%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>  -- 
>>> You received this message because you are subscribed to a topic in the 
>>> Google Groups "elasticsearch" group.
>>> To unsubscribe from this topic, visit 
>>> https://groups.google.com/d/topic/elasticsearch/MB0ThaJRmKE/unsubscribe.
>>> To unsubscribe from this group and all its topics, send an email to 
>>> elasticsearc...@googlegroups.com <javascript:>.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/elasticsearch/CAKdsXoExZ6eMfi%2BfDx9_fRUnmtDEs64p5yX%2BE5Mk_GgR2bR58A%40mail.gmail.com
>>>  
>>> <https://groups.google.com/d/msgid/elasticsearch/CAKdsXoExZ6eMfi%2BfDx9_fRUnmtDEs64p5yX%2BE5Mk_GgR2bR58A%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>  
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearc...@googlegroups.com <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/CAKnM90Zbh-WDaDZg-GTVHk3B0yv5XXBW49nbrt6WcHzA%2BTUk_A%40mail.gmail.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/CAKnM90Zbh-WDaDZg-GTVHk3B0yv5XXBW49nbrt6WcHzA%2BTUk_A%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/30f3d287-752d-4b2e-8a9d-4ba216a514d0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to