Hi, Sure. Thanks a lot for the helpful pointers. I will take a look at the classes and create a plugin. If there are any gotcha's or certain ways of doing things in this plugin, please tell me so that I can take note.
It seems that the plugin would be small with just the Parser/Builder and a simple plugin that uses the IndexQueryParserModule to add a process for a XContentFilterParser. I have not checked the Cache classes, and will take a look there as well. I do have a followup question: If I have to be aware of the shards in my filter implementation, is that possible? I mean, if during indexing, there is a routing policy (not sure if there is something like that) that directs documents to a particular shard based on certain range or function or hash (please let me know if there is something like that, as I need to check that implementation as well), then during query time, I would want the filter to only be created for the DocIds that correspond to the shard where the query will execute. Seems like a problem that is not unusual. Please tell me if this is possible. Thanks again, Thanks, Sandeep On Tuesday, 8 July 2014 02:01:35 UTC+5:30, Jörg Prante wrote: > > In Elasticsearch, you can extend the existing queries and filters, by a > plugin, with the help of addQuery/addFilter at IndexQueryParserModule > > Each query or filter comes in a pair of classes, a builder and a parser. > > A filter builder manages the syntax, the content serialization with the > help of XContent classes for inner/outer representation of filter > specification. > > A filter parser parses such a structure and turns it into a Lucene Filter > for internal processing. > > So one approach would be to look at your bit set implementation how this > can be turned into a Lucene Filter. An instructive example where to start > from is > in org.elasticsearch.index.query.TermsFilterParser/TermsFilterBuilder > > An example where terms from fielddata cache are read and turned into a > filter is org.elasticsearch.index.search.FielddataTermsFilter > > A key line is the method > > public DocIdSet getDocIdSet(AtomicReaderContext context, Bits acceptDocs) > throws IOException > > An example for caching filters > is org.elasticsearch.indices.cache.filter.terms.IndicesTermsFilterCache > (the caching of filters in ES is done with Guava's cache classes) > > Also, it could be helpful to study helper classes in this context like in > package org.elasticsearch.common.lucene.docset > > I am not aware of a filter plugin yet but it is possible that I could > sketch a demo filter plugin source code on github. > > Jörg > > > > > On Mon, Jul 7, 2014 at 3:49 PM, Sandeep Ramesh Khanzode < > k.san...@gmail.com <javascript:>> wrote: > >> Hi, >> >> A little clarification: >> >> Assume sample data set of 50M documents. The documents need to be >> filtered by a field, Field1. However, at indexing time, this field is NOT >> written to the document in Lucene through ES. Field1 is a frequently >> changing field and hence, we will like to maintain it outside. >> >> (This following paragraph can be skipped.) >> Now assume that there are a few such fields, Field1, ..., FieldN. For >> every document in the corpus, the value for Field1 may be from a pool of >> 100-odd values. Thus, for example, at max, FIeld1 can hold 1M documents >> that correspond to one of the 100-dd values, and at the fag-end, can >> probably correspond to 10 documents as well. >> >> >> (Continue reading) :-) >> I would, at system startup time, make sure that I have loaded all >> relevant BitSets that I plan to use for any Filters in memory, so that my >> cache framework is warm and I can lookup the relevant filter values for a >> particular query from this cache at query run time. The mechanisms for this >> loading are still unknown, but please assume that this BitSet will be >> available readily during query time. >> >> This BitSet will correspond to the DocIDs in Lucene for a particular >> value of Field1 that I want to filter. I plan to create a Filter class >> overridden in Lucene that will accept this DocIdSet. >> >> What I am unable to understand is how I can achieve this in ES? Now, I >> have been exploring the different mail threads on this forum, and it seems >> that certain plugins can achieve this. Please see the list below that I >> could find on this forum. >> >> Can you please tell me how an IndexQueryParserModule will serve my use >> case? If you can provide some pointers on writing a plugin that can >> leverage a CustomFilter, that will be immensely helpful. Thanks, >> >> 1. >> https://groups.google.com/forum/?fromgroups=#!searchin/elasticsearch/IndexQueryParserModule$20Plugin/elasticsearch/5Gqxx3UvN2s/FL4Lb2RxQt0J >> 2. https://groups.google.com/forum/#!topic/elasticsearch/1jiHl4kngJo >> 3. https://github.com/elasticsearch/elasticsearch/issues/208 >> 4. >> http://elasticsearch-users.115913.n3.nabble.com/custom-filter-handler-plugin-td4051973.html >> >> Thanks, >> Sandeep >> >> On Mon, Jul 7, 2014 at 2:17 AM, joerg...@gmail.com <javascript:> < >> joerg...@gmail.com <javascript:>> wrote: >> >>> Thanks for being so patient with me :) >>> >>> I understand now the following: there are 50m of documents in an >>> external DB, from which up to 1m is to be exported in form of document >>> identifiers to work as a filter in ES. The idea is to use internal >>> mechanisms like bit sets. There is no API for manipulating filters in ES on >>> that level, ES receives the terms and passes them into Lucene TermFilter >>> class according to the type of the filter. >>> >>> What is a bit unclear to me: how is the filter set constructed? I assume >>> it should be a select statement on the database? >>> >>> Next, if you have this large set of document identifiers selected, I do >>> not understand what is the base query you want to apply the filter on? Is >>> there a user given query for ES? How does such query looks like? Is it >>> assumed there are other documents in ES that are related somehow to the 50m >>> documents? An illustrative example of the steps in the scenario would >>> really help to understand the data model. >>> >>> Just some food for thought: it is close to impossible to filter in ES on >>> 1m unique terms with a single step - the default setting of maximum clauses >>> in a Lucene Query is for good reason limited to 1024 terms. A workaround >>> would be iterating over 1m terms and execute 1000 filter queries and add up >>> the results. This takes a long time and may not be the desired solution. >>> >>> Fortunately, in most situations, it is possible to find more concise >>> grouping to reduce the 1m document identifiers into fewer ones for more >>> efficient filtering. >>> >>> Jörg >>> >>> >>> >>> On Sun, Jul 6, 2014 at 9:39 PM, 'Sandeep Ramesh Khanzode' via >>> elasticsearch <elasti...@googlegroups.com <javascript:>> wrote: >>> >>>> Hi, >>>> >>>> Appreciate your continued assistance. :) Thanks, >>>> >>>> Disclaimer: I am yet to sufficiently understand ES sources so as to >>>> depict my scenario completely. Some info' below may be conjecture. >>>> >>>> I would have a corpus of 50M docs (actually lot more, but for testing >>>> now) out of which I would have say, upto, 1M DocIds to be used as a >>>> filter. >>>> This set of 1M docs can be different for different use cases, the point >>>> being, upto 1M docIds can form one logical set of documents for filtering >>>> results. If I use a simple IdsFilter from ES Java API, I would have to >>>> keep >>>> adding these 1M docs to the List implementation internally, and I have a >>>> feeling it may not scale very well as they may change per use case and per >>>> some combinations internal to a single use case also. >>>> >>>> As I debug the code, the IdsFilter will be converted to a Lucene >>>> filter. Lucene filters, on the other hand, operate on a docId bitset type. >>>> That gels very well with my requirement, since I can scale with BitSets (I >>>> assume). >>>> >>>> If I can find a way to directly plug this BitSet as a Lucene Filter to >>>> the Lucene search() call bypassing the ES filters using, I dont know, may >>>> some sort of a plugin, I believe that may support my cause. I assume I may >>>> not get to use the Filter cache from ES but probably I can cache these >>>> BitSets for subsequent use. >>>> >>>> Please let me know. And thanks! >>>> >>>> Thanks, >>>> Sandeep >>>> >>>> >>>> On Saturday, 5 July 2014 01:40:55 UTC+5:30, Jörg Prante wrote: >>>> >>>>> What I understand is a TermsFilter is required >>>>> >>>>> http://www.elasticsearch.org/guide/en/elasticsearch/ >>>>> reference/current/query-dsl-terms-filter.html >>>>> >>>>> and the source of the terms is a DB. That is no problem. The plan is: >>>>> fetch the terms from the DB, build the query (either Java API or JSON) >>>>> and >>>>> execute it. >>>>> >>>>> What I don't understand is the part with the "quick mapping", Lucene, >>>>> and the doc ids. Lucene doc IDs are not reliable and are not exposed by >>>>> Elasticsearch, Elasticsearch uses it's own document identifiers which are >>>>> stable and augmented with info about the index type they belong to, in >>>>> order to make them unique. But I do not understand why this is important >>>>> in >>>>> this context. >>>>> >>>>> Elasticsearch API uses query builders and filter builders to build >>>>> search requests . A "quick mapping" is just fetching the terms from the >>>>> DB >>>>> as a string array before this API is called. >>>>> >>>>> I also do not understand the role of the number "1M", is this the >>>>> number of fields, or the number of terms? Is it a total number or a >>>>> number >>>>> per query? >>>>> >>>>> Did I misunderstand anything more? I am not really sure what is the >>>>> challenge... >>>>> >>>>> Jörg >>>>> >>>>> >>>>> >>>>> On Fri, Jul 4, 2014 at 8:55 PM, 'Sandeep Ramesh Khanzode' via >>>>> elasticsearch <elasti...@googlegroups.com> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> Just to give some background. I will have a large-ish corpus of more >>>>>> than 100M documents indexed. The filters that I want to apply will be on >>>>>> a >>>>>> field that is not indexed. I mean, I prefer to not have them indexed in >>>>>> ES/Lucene since they will be frequently changing. So, for that, I will >>>>>> be >>>>>> maintaining them elsewhere, like a DB etc. >>>>>> >>>>>> Everytime I have a query, I would want to filter the results by those >>>>>> fields that are not indexed in Lucene. And I am guessing that number may >>>>>> well be more than 1M. In that case, I think, since we will maintain some >>>>>> sort of TermsFilter, it may not scale linearly. What I would want to do, >>>>>> preferably, is to have a hook inside the ES query, so that I can, at >>>>>> query >>>>>> time, inject the required filter values. Since the filter values have to >>>>>> be >>>>>> recognized by Lucene, and I will not be indexing them, I will need to do >>>>>> some quick mapping to get those fields and map them quickly to some >>>>>> field >>>>>> in Lucene that I can save in the filter. I am not sure whether we can >>>>>> access and set Lucene DocIDs in the filter or whether they are even >>>>>> exposed >>>>>> in ES. >>>>>> >>>>>> Please assist with this query. Thanks, >>>>>> >>>>>> Thanks, >>>>>> Sandeep >>>>>> >>>>>> >>>>>> On Thursday, 3 July 2014 21:33:45 UTC+5:30, Jörg Prante wrote: >>>>>> >>>>>>> Maybe I do not fully understand, but in a client, you can fetch the >>>>>>> required filter terms from any external source before a JSON query is >>>>>>> constructed? >>>>>>> >>>>>>> Can you give an example what you want to achieve? >>>>>>> >>>>>>> Jörg >>>>>>> >>>>>>> >>>>>>> On Thu, Jul 3, 2014 at 3:34 PM, 'Sandeep Ramesh Khanzode' via >>>>>>> elasticsearch <elasti...@googlegroups.com> wrote: >>>>>>> >>>>>>>> Hi All, >>>>>>>> >>>>>>>> I am new to ES and I have the following requirement: >>>>>>>> I need to specify a list of strings as a filter that applies to a >>>>>>>> specific field in the document. Like what a filter does, but instead >>>>>>>> of >>>>>>>> sending them on the query, I would like them to be populated from an >>>>>>>> external sources, like a DB or something. Can you please guide me to >>>>>>>> the >>>>>>>> relevant examples or references to achieve this on v1.1.2? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Sandeep >>>>>>>> >>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "elasticsearch" group. >>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>> send an email to elasticsearc...@googlegroups.com. >>>>>>>> >>>>>>>> To view this discussion on the web visit >>>>>>>> https://groups.google.com/d/msgid/elasticsearch/0093d97d-0f4 >>>>>>>> 7-48e9-ba19-85b0850eda89%40googlegroups.com >>>>>>>> <https://groups.google.com/d/msgid/elasticsearch/0093d97d-0f47-48e9-ba19-85b0850eda89%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>> . >>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>> >>>>>>> >>>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "elasticsearch" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to elasticsearc...@googlegroups.com. >>>>>> To view this discussion on the web visit https://groups.google.com/d/ >>>>>> msgid/elasticsearch/513172cd-9507-4e96-b456-498c98c3b8c9% >>>>>> 40googlegroups.com >>>>>> <https://groups.google.com/d/msgid/elasticsearch/513172cd-9507-4e96-b456-498c98c3b8c9%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "elasticsearch" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to elasticsearc...@googlegroups.com <javascript:>. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/elasticsearch/f2ec45c7-8980-4005-9e1b-fc9a6aa422e0%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/elasticsearch/f2ec45c7-8980-4005-9e1b-fc9a6aa422e0%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >>> You received this message because you are subscribed to a topic in the >>> Google Groups "elasticsearch" group. >>> To unsubscribe from this topic, visit >>> https://groups.google.com/d/topic/elasticsearch/MB0ThaJRmKE/unsubscribe. >>> To unsubscribe from this group and all its topics, send an email to >>> elasticsearc...@googlegroups.com <javascript:>. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/elasticsearch/CAKdsXoExZ6eMfi%2BfDx9_fRUnmtDEs64p5yX%2BE5Mk_GgR2bR58A%40mail.gmail.com >>> >>> <https://groups.google.com/d/msgid/elasticsearch/CAKdsXoExZ6eMfi%2BfDx9_fRUnmtDEs64p5yX%2BE5Mk_GgR2bR58A%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to elasticsearc...@googlegroups.com <javascript:>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/CAKnM90Zbh-WDaDZg-GTVHk3B0yv5XXBW49nbrt6WcHzA%2BTUk_A%40mail.gmail.com >> >> <https://groups.google.com/d/msgid/elasticsearch/CAKnM90Zbh-WDaDZg-GTVHk3B0yv5XXBW49nbrt6WcHzA%2BTUk_A%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/30f3d287-752d-4b2e-8a9d-4ba216a514d0%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.