Simone, Your findings are great. You captured the key problems with EFF in SolrCloud, let me summarize them: 1. file should be replicated somehow; 2. file need to be sliced on shards, to avoid wasteful full read; 3. delta EFF updates - when you change popularity for few docs, you have to fully reload the file; 4. frequent NRT (soft)commits leads to full refresh of top-level float array, as well as EFF is held by top level searcher.
Despite all these points are highly coupled altogether let me group them on: - 1&2 are SolrCloud caused architecture puzzles, which I want to discuss - 3&4 are coding issues, which from my POV can be solved today or tomorrow - just raise and attach, you know. Developers, I'm kindly asking about you plan regarding EFF's future in SolrCloud: how its' data will spread? * Simone proposed quite smart mechanism of distributing EFF - rely on the current documents distribution, but EFF data is ripped off the SolrInputDoc before indexing and persisted somehow at the every node. I'm just concerned that not all failures can be recovered by this way. Also I see a gap when some secondary key is used in EFF e.g. if popularity is keyed by brand name not by productID, EFF update doc can not be routed to the right shard, also keep custom sharding in mind. * my vision is slightly different: I propose to cut current EFF onto two pieces: FunctionalQuery/ValueSource and dataloading mechanism (current impl is reading floats from property files). Former one should call the later one via some efficient interface. We can hope that further refactoring of the first part gives us per-segment ValueSources, that will be great for NRT, I suppose. Then we can provide more superb dataloaders like jdbc-, or memcached-backed. in this case failover should supported by that external data loading machinery. Also if dataloading interface will be efficient enough it can also cover sharding the external data - it will pull only those values which is required at the particular shard/reader. What is the strategic EFF direction in the Cloud: "internal" or "external"? Behind the horizon I see updatable DocValues and modifiable fields. Does their soon coming blocks EFF improvement? May EFF be retired soon? or the right way to do that is implement some Codec? Excuse me for being so verbose. Thanks for your feedback! On Fri, Nov 23, 2012 at 11:39 PM, Simone Gianni <simo...@apache.org> wrote: > Hi all, > this is my first posting on this list so bear with me if this is not the > right place/the right topic etc.. > > I'm currently migrating a Solr 3.x system to SolrCloud. It uses > ExternalFileField for the common "popularity" ranking. > > I've tried to get ExternalFileField to work in SolrCloud, but it is quite > a problem now that the data dir is not directly accessible (it's "in the > cloud"). Moreover, external files served well for the purpose, but with > millions of KV pairs that need to be generated for each update are not > "cloud scale". > > After asking on the users mailing list for advice, I started coding and > came up with a prototype of a possible replacement, that I'd like to > contribute. It's currently "working", but far from having being heavily > tested under the various scenarios of a ZooKeeper based system. It is > currently implemented as a plugin. > > Basically it's an UpdateProcessor, to be placed in the update chain after > the DistributedUpdateProcessor and before the RunUpdateProcessor, as usual. > > This processor looks at the add/update request, search for a specific > field (the "popularity" field for example), and delegates its persistence > to a different system than the lucene index. My current stupid > implementation simply caches it in a concurrent sorted map which is > "dumped" to a file upon commit. It's possible however to plug > implementations for any embedded simple KV store (like JDBM for example) to > provide much faster commit time and avoid loading everything in ram at > startup. > > This makes it possible to send to SolrCloud (or Solr without the cloud) > updates about the "popularity" field as normal document updates. Soft > commit to have NRT and rollback are already supported. > > The fields persisted in the alternative system are removed from the > document, so they should never reach Lucene at all. If an update consists > only of these specific fields, then it is not propagated to the Lucene > index, so that (AFAIU) no reindexing or index growth has to take place (not > even a Lucene commit, if the commit involves only updates containing only > these fields). > > Then a specific field type (code mostly copy-pasted from ExternalFileField > :) ) is able to use the same instance to get an array of float values for a > specific Reader and use it in a ValueSource for all the common uses. > > Since the processor is placed after the DistributedUpdateProcessor, it is > (AFAIU) sharded and replicated (I know it's sharded for sure, not yet fully > tested replication) the same way the Lucene indexes are sharded, so each > core has its index and its "external files". > > > Now my questions are : > 1) do you thing this is interesting, right, wrong, dangerous etc.. ? > 2) do you see any error in my reasoning about update handlers? > 3) how does creation/synchronization of a new replica happens? (so where > should I have to plug to replicate also the "external files") > 4) I can to contribute this code, if you think it's worth, but it needs to > be worked a bit before being ready to be submitted as a "ready to apply > patch", could we use an Apache Lab or a sandbox space to cooperate on this > if someone is willing to help? > > > Let me know, > Simone > > -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics <http://www.griddynamics.com> <mkhlud...@griddynamics.com>