Mikhail I haven't experimented further yet. I think that the previous experiment of issuing a commit to a specific core proved that all cores get the commit, so I don't think that this approach will work.
Thanks, /Martin On Tue, Nov 27, 2012 at 6:24 PM, Mikhail Khludnev < mkhlud...@griddynamics.com> wrote: > Martin, > > It's still not clear to me whether you solve the problem completely or > partially: > Does reducing number of cores free some resources for searching during > commit? > Does the commiting one-by-one core prevents the "freeze"? > > Thanks > > > On Thu, Nov 22, 2012 at 4:31 PM, Martin Koch <m...@issuu.com> wrote: > >> Mikhail >> >> To avoid freezes we deployed the patches that are now on the 4.1 trunk >> (bug >> 3985). But this wasn't good enough, because SOLR would still take very >> long >> to restart when that was necessary. >> >> I don't see how we could throw more hardware at the problem without making >> it worse, really - the only solution here would be *fewer* shards, not >> >> more. >> >> IMO it would be ideal if the lucene/solr community could come up with a >> good way of updating fields in a document without reindexing. This could >> be >> by linking to some external data store, or in the lucene/solr internals. >> If >> it would make things easier, a good first step would be to have >> dynamically >> updateable numerical fields only. >> >> /Martin >> >> On Wed, Nov 21, 2012 at 8:51 PM, Mikhail Khludnev < >> mkhlud...@griddynamics.com> wrote: >> >> > Martin, >> > >> > I don't think solrconfig.xml shed any light on. I've just found what I >> > didn't get in your setup - the way of how to explicitly assigning core >> to >> > collection. Now, I realized most of details after all! >> > Ball is on your side, let us know whether you have managed your cores to >> > commit one by one to avoid freeze, or could you eliminate pauses by >> > allocating more hardware? >> > Thanks in advance! >> > >> > >> > On Wed, Nov 21, 2012 at 3:56 PM, Martin Koch <m...@issuu.com> wrote: >> > >> > > Mikhail, >> > > >> > > PSB >> > > >> > > On Wed, Nov 21, 2012 at 10:08 AM, Mikhail Khludnev < >> > > mkhlud...@griddynamics.com> wrote: >> > > >> > > > On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch <m...@issuu.com> >> wrote: >> > > > >> > > > > >> > > > > I wasn't aware until now that it is possible to send a commit to >> one >> > > core >> > > > > only. What we observed was the effect of curl >> > > > > localhost:8080/solr/update?commit=true but perhaps we should >> > experiment >> > > > > with solr/coreN/update?commit=true. A quick trial run seems to >> > indicate >> > > > > that a commit to a single core causes commits on all cores. >> > > > > >> > > > You should see something like this in the log: >> > > > ... SolrCmdDistributor .... Distrib commit to: ... >> > > > >> > > > Yup, a commit towards a single core results in a commit on all >> cores. >> > > >> > > >> > > > > >> > > > > >> > > > > Perhaps I should clarify that we are using SOLR as a black box; >> we do >> > > not >> > > > > touch the code at all - we only install the distribution WAR file >> and >> > > > > proceed from there. >> > > > > >> > > > I still don't understand how you deploy/launch Solr. How many jettys >> > you >> > > > start whether you have -DzkRun -DzkHost -DnumShards=2 or you >> specifies >> > > > shards= param for every request and distributes updates yourself? >> What >> > > > collections do you create and with which settings? >> > > > >> > > > We let SOLR do the sharding using one collection with 16 SOLR cores >> > > holding one shard each. We launch only one instance of jetty with the >> > > folllowing arguments: >> > > >> > > -DnumShards=16 >> > > -DzkHost=<zookeeperhost:port> >> > > -Xmx10G >> > > -Xms10G >> > > -Xmn2G >> > > -server >> > > >> > > Would you like to see the solrconfig.xml? >> > > >> > > /Martin >> > > >> > > >> > > > > >> > > > > >> > > > > > Also from my POV such deployments should start at least from >> *16* >> > > 4-way >> > > > > > vboxes, it's more expensive, but should be much better available >> > > during >> > > > > > cpu-consuming operations. >> > > > > > >> > > > > >> > > > > Do you mean that you recommend 16 hosts with 4 cores each? Or 4 >> hosts >> > > > with >> > > > > 16 cores? Or am I misunderstanding something :) ? >> > > > > >> > > > I prefer to start from 16 hosts with 4 cores each. >> > > > >> > > > >> > > > > >> > > > > >> > > > > > Other details, if you use single jetty for all of them, are you >> > sure >> > > > that >> > > > > > jetty's threadpool doesn't limit requests? is it large enough? >> > > > > > You have 60G and set -Xmx=10G. are you sure that total size of >> > cores >> > > > > index >> > > > > > directories is less than 45G? >> > > > > > >> > > > > > The total index size is 230 GB, so it won't fit in ram, but >> we're >> > > using >> > > > > an >> > > > > SSD disk to minimize disk access time. We have tried putting the >> EFF >> > > > onto a >> > > > > ram disk, but this didn't have a measurable effect. >> > > > > >> > > > > Thanks, >> > > > > /Martin >> > > > > >> > > > > >> > > > > > Thanks >> > > > > > >> > > > > > >> > > > > > On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch <m...@issuu.com> >> > wrote: >> > > > > > >> > > > > > > Mikhail >> > > > > > > >> > > > > > > PSB >> > > > > > > >> > > > > > > On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev < >> > > > > > > mkhlud...@griddynamics.com> wrote: >> > > > > > > >> > > > > > > > Martin, >> > > > > > > > >> > > > > > > > Please find additional question from me below. >> > > > > > > > >> > > > > > > > Simone, >> > > > > > > > >> > > > > > > > I'm sorry for hijacking your thread. The only what I've >> heard >> > > about >> > > > > it >> > > > > > at >> > > > > > > > recent ApacheCon sessions is that Zookeeper is supposed to >> > > > replicate >> > > > > > > those >> > > > > > > > files as configs under solr home. And I'm really looking >> > forward >> > > to >> > > > > > know >> > > > > > > > how it works with huge files in production. >> > > > > > > > >> > > > > > > > Thank You, Guys! >> > > > > > > > >> > > > > > > > 20.11.2012 18:06 пользователь "Martin Koch" <m...@issuu.com> >> > > > написал: >> > > > > > > > > >> > > > > > > > > Hi Mikhail >> > > > > > > > > >> > > > > > > > > Please see answers below. >> > > > > > > > > >> > > > > > > > > On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev < >> > > > > > > > > mkhlud...@griddynamics.com> wrote: >> > > > > > > > > >> > > > > > > > > > Martin, >> > > > > > > > > > >> > > > > > > > > > Thank you for telling your own "war-story". It's really >> > > useful >> > > > > for >> > > > > > > > > > community. >> > > > > > > > > > The first question might seems not really conscious, but >> > > would >> > > > > you >> > > > > > > tell >> > > > > > > > me >> > > > > > > > > > what blocks searching during EFF reload, when it's >> > triggered >> > > by >> > > > > > > handler >> > > > > > > > or >> > > > > > > > > > by listener? >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > We continuously index new documents using CommitWithin to >> get >> > > > > regular >> > > > > > > > > commits. However, we observed that the EFFs were not >> re-read, >> > > so >> > > > we >> > > > > > had >> > > > > > > > to >> > > > > > > > > do external commits (curl '.../solr/update?commit=true') >> to >> > > force >> > > > > > > reload. >> > > > > > > > > When this is done, solr blocks. I can't tell you exactly >> why >> > > it's >> > > > > > doing >> > > > > > > > > that (it was related to SOLR-3985). >> > > > > > > > >> > > > > > > > Is there a chance to get a thread dump when they are >> blocked? >> > > > > > > > >> > > > > > > > >> > > > > > > Well I could try to recreate the situation. But the setup is >> > fairly >> > > > > > simple: >> > > > > > > Create a large EFF in a largeish index with many shards. >> Issue a >> > > > > commit, >> > > > > > > and then try to do a search. Solr will not respond to the >> search >> > > > before >> > > > > > the >> > > > > > > commit has completed, and this will take a long time. >> > > > > > > >> > > > > > > >> > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > > I don't really get the sentence about sequential commits >> > and >> > > > > number >> > > > > > > of >> > > > > > > > > > cores. Do I get right that file is replicated via >> > Zookeeper? >> > > > > > Doesn't >> > > > > > > it >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > Again, this is observed behavior. When we issue a commit >> on a >> > > > > system >> > > > > > > with >> > > > > > > > a >> > > > > > > > > system with many solr cores using EFFs, the system blocks >> > for a >> > > > > long >> > > > > > > time >> > > > > > > > > (15 minutes). We do NOT use zookeeper for anything. The >> EFF >> > > is a >> > > > > > > symlink >> > > > > > > > > from each cores index dir to the actual file, which is >> > updated >> > > by >> > > > > an >> > > > > > > > > external process. >> > > > > > > > >> > > > > > > > Hold on, I asked about Zookeeper because the subj mentions >> > > > SolrCloud. >> > > > > > > > >> > > > > > > > Do you use SolrCloud, SolrShards, or these cores are just >> > > replicas >> > > > of >> > > > > > the >> > > > > > > > same index? >> > > > > > > > >> > > > > > > >> > > > > > > Ah - we use solr 4 out of the box, so I guess this is >> SolrCloud. >> > > I'm >> > > > a >> > > > > > bit >> > > > > > > unsure about the terminology here, but we've got a single >> index >> > > > divided >> > > > > > > into 16 shard. Each shard is hosted in a solr core. >> > > > > > > >> > > > > > > >> > > > > > > > Also, about simlink - Don't you share that file via some >> NFS? >> > > > > > > > >> > > > > > > > No, we generate the EFF on the local solr host (there is >> only >> > one >> > > > > > > physical >> > > > > > > host that holds all shards), so there is no need for NFS or >> > copying >> > > > > files >> > > > > > > around. No need for Zookeeper either. >> > > > > > > >> > > > > > > >> > > > > > > > how many cores you run per box? >> > > > > > > > >> > > > > > > This box is a 16-virtual core (8 hyperthreaded cores) with >> 60GB >> > of >> > > > > RAM. >> > > > > > We >> > > > > > > run 16 solr cores on this box in Jetty. >> > > > > > > >> > > > > > > >> > > > > > > > Do boxes has plenty of ram to cache filesystem beside of jvm >> > > heaps? >> > > > > > > > >> > > > > > > > Yes. We've allocated 10GB for jetty, and left the rest for >> the >> > > OS. >> > > > > > > >> > > > > > > >> > > > > > > > I assume you use 64 bit linux and mmap directory. Please >> > confirm >> > > > > that. >> > > > > > > > >> > > > > > > > >> > > > > > > We use 64-bit linux. I'm not sure about the mmap directory or >> > where >> > > > > that >> > > > > > > would be configured in solr - can you explain that? >> > > > > > > >> > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > > causes scalability problem or long time to reload? Will >> it >> > > help >> > > > > if >> > > > > > > > we'll >> > > > > > > > > > have, let's say ExternalDatabaseField which will pull >> > values >> > > > from >> > > > > > > jdbc. >> > > > > > > > ie. >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > I think the possibility of having some fields being >> retrieved >> > > > from >> > > > > an >> > > > > > > > > external, dynamically updatable store would be really >> > > > interesting. >> > > > > > This >> > > > > > > > > could be JDBC, something in-memory like redis, or a NoSql >> > > product >> > > > > > (e.g. >> > > > > > > > > Cassandra). >> > > > > > > > >> > > > > > > > Ok. Let's have it in mind as a possible direction. >> > > > > > > > >> > > > > > > >> > > > > > > Alternatively, an API that would allow updating a single field >> > for >> > > a >> > > > > > > document might be an option. >> > > > > > > >> > > > > > > >> > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > > why all cores can't read these values simultaneously? >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > Again, this is a solr implementation detail that I can't >> > answer >> > > > :) >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > > Can you confirm that IDs in the file is ordered by the >> > index >> > > > term >> > > > > > > > order? >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > Yes, we sorted the files (standard UNIX sort). >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > > AFAIK it can impact load time. >> > > > > > > > > > >> > > > > > > > > Yes, it does >> > > > > > > > >> > > > > > > > Ok, I've got that you aware of it, and your IDs are just >> > strings, >> > > > not >> > > > > > > > integers. >> > > > > > > > >> > > > > > > > >> > > > > > > Yes, ids are strings. >> > > > > > > >> > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > > Regarding your post-query solution can you tell me if >> query >> > > > found >> > > > > > > 10000 >> > > > > > > > > > docs, but I need to display only first page with 100 >> rows, >> > > > > whether >> > > > > > I >> > > > > > > > need >> > > > > > > > > > to pull all 10K results to frontend to order them by the >> > > rank? >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > In our architecture, the clients query an API that >> generates >> > > the >> > > > > SOLR >> > > > > > > > > query, retrieves the relevant additional fields that we >> > needs, >> > > > and >> > > > > > > > returns >> > > > > > > > > the relevant JSON to the front-end. >> > > > > > > > > >> > > > > > > > > In our use case, results are returned from SOLR by the >> 10's, >> > > not >> > > > by >> > > > > > the >> > > > > > > > > 1000's, so it is a manageable job. Even so, if solr >> returned >> > > > > > thousands >> > > > > > > of >> > > > > > > > > results, it would be up to the implementation of the api >> to >> > > > augment >> > > > > > > only >> > > > > > > > > the results that needed to be returned to the front-end. >> > > > > > > > > >> > > > > > > > > Even so, patching up a JSON structure with 10000 results >> > should >> > > > be >> > > > > > > > > possible. >> > > > > > > > >> > > > > > > > You are right. I'm concerned anyway because retrieving whole >> > > result >> > > > > is >> > > > > > > > expensive, and not always possible. >> > > > > > > > >> > > > > > > > >> > > > > > > In our case, getting the whole result is almost impossible, >> > because >> > > > > that >> > > > > > > would be millions of documents, and returning the Nth result >> > seems >> > > to >> > > > > be >> > > > > > a >> > > > > > > quadratic (or worse) operation in SOLR. >> > > > > > > >> > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > > I'm really appreciate if you comment on the questions >> > above. >> > > > > > > > > > PS: It's time to pitch, how much >> > > > > > > > > > https://issues.apache.org/jira/browse/SOLR-4085 >> "Commit-free >> > > > > > > > > > ExternalFileField" can help you? >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > It looks very interesting :) Does it make it possible to >> > > avoid >> > > > > > > > re-reading >> > > > > > > > > the EFF on every commit, and only re-read the values that >> > have >> > > > > > actually >> > > > > > > > > changed? >> > > > > > > > >> > > > > > > > >> > > > > > > > You don't need commit (in SOLR-4085) to reload file content, >> > but >> > > > > after >> > > > > > > > commit you need to read whole file and scan all key terms >> and >> > > > > postings. >> > > > > > > > That's because EFF sits on top of top level searcher. it's a >> > > > > Solr-like >> > > > > > > way. >> > > > > > > > In some future we might have per-segment EFF, in this case >> > > adding a >> > > > > > > segment >> > > > > > > > will trigger full file scan, but in the index only that new >> > > segment >> > > > > > will >> > > > > > > be >> > > > > > > > scanned. It should be faster. You know, straightforward >> sharing >> > > > > > internal >> > > > > > > > data structures between different index views/generations is >> > not >> > > > > > > possible. >> > > > > > > > If you are asking about applying delta changes on external >> file >> > > > > that's >> > > > > > > > something what we did ourselves http://goo.gl/P8GFq . This >> > > feature >> > > > > is >> > > > > > > much >> > > > > > > > more doubtful and vague, although it might be the next >> > > contribution >> > > > > > after >> > > > > > > > SOLR-4085. >> > > > > > > > >> > > > > > > > > >> > > > > > > > > /Martin >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch < >> > m...@issuu.com> >> > > > > > wrote: >> > > > > > > > > > >> > > > > > > > > > > Solr 4.0 does support using EFFs, but it might not >> give >> > you >> > > > > what >> > > > > > > > you're >> > > > > > > > > > > hoping fore. >> > > > > > > > > > > >> > > > > > > > > > > We tried using Solr Cloud, and have given up again. >> > > > > > > > > > > >> > > > > > > > > > > The EFF is placed in the parent of the index >> directory in >> > > > each >> > > > > > > core; >> > > > > > > > each >> > > > > > > > > > > core reads the entire EFF and picks out the IDs that >> it >> > is >> > > > > > > > responsible >> > > > > > > > > > for. >> > > > > > > > > > > >> > > > > > > > > > > In the current 4.0.0 release of solr, solr blocks >> > (doesn't >> > > > > answer >> > > > > > > > > > queries) >> > > > > > > > > > > while re-reading the EFF. Even worse, it seems that >> the >> > > time >> > > > to >> > > > > > > > re-read >> > > > > > > > > > the >> > > > > > > > > > > EFF is multiplied by the number of cores in use (i.e. >> the >> > > EFF >> > > > > is >> > > > > > > > re-read >> > > > > > > > > > by >> > > > > > > > > > > each core sequentially). The contents of the EFF >> become >> > > > active >> > > > > > > after >> > > > > > > > the >> > > > > > > > > > > first EXTERNAL commit (commitWithin does NOT work >> here) >> > > after >> > > > > the >> > > > > > > > file >> > > > > > > > > > has >> > > > > > > > > > > been updated. >> > > > > > > > > > > >> > > > > > > > > > > In our case, the EFF was quite large - around 450MB - >> and >> > > we >> > > > > use >> > > > > > 16 >> > > > > > > > > > shards, >> > > > > > > > > > > so when we triggered an external commit to force >> > > re-reading, >> > > > > the >> > > > > > > > whole >> > > > > > > > > > > system would block for several (10-15) minutes. This >> > won't >> > > > work >> > > > > > in >> > > > > > > a >> > > > > > > > > > > production environment. The reason for the size of the >> > EFF >> > > is >> > > > > > that >> > > > > > > we >> > > > > > > > > > have >> > > > > > > > > > > around 7M documents in the index; each document has a >> 45 >> > > > > > character >> > > > > > > > ID. >> > > > > > > > > > > >> > > > > > > > > > > We got some help to try to fix the problem so that the >> > > > re-read >> > > > > of >> > > > > > > the >> > > > > > > > EFF >> > > > > > > > > > > proceeds in the background (see >> > > > > > > > > > > here<https://issues.apache.org/jira/browse/SOLR-3985> >> > for >> > > > > > > > > > > a fix on the 4.1 branch). However, even though the >> > re-read >> > > > > > proceeds >> > > > > > > > in >> > > > > > > > > > the >> > > > > > > > > > > background, the time required to launch solr now >> takes at >> > > > least >> > > > > > as >> > > > > > > > long >> > > > > > > > > > as >> > > > > > > > > > > re-reading the EFFs. Again, this is not good enough >> for >> > our >> > > > > > needs. >> > > > > > > > > > > >> > > > > > > > > > > The next issue is that you cannot sort on EFF fields >> > > (though >> > > > > you >> > > > > > > can >> > > > > > > > > > return >> > > > > > > > > > > them as values using &fl=field(my_eff_field). This is >> > also >> > > > > fixed >> > > > > > in >> > > > > > > > the >> > > > > > > > > > 4.1 >> > > > > > > > > > > branch here < >> > > https://issues.apache.org/jira/browse/SOLR-4022 >> > > > >. >> > > > > > > > > > > >> > > > > > > > > > > So: Even after these fixes, EFF performance is not >> that >> > > > great. >> > > > > > Our >> > > > > > > > > > solution >> > > > > > > > > > > is as follows: The actual value of the popularity >> measure >> > > > (say, >> > > > > > > > reads) >> > > > > > > > > > that >> > > > > > > > > > > we want to report to the user is inserted into the >> search >> > > > > > response >> > > > > > > > > > > post-query by our query front-end. This value will >> then >> > be >> > > > the >> > > > > > > > > > > authoritative value at the time of the query. The >> value >> > of >> > > > the >> > > > > > > > popularity >> > > > > > > > > > > measure that we use for boosting in the ranking of the >> > > search >> > > > > > > results >> > > > > > > > is >> > > > > > > > > > > only updated when the value has changed enough so that >> > the >> > > > > impact >> > > > > > > on >> > > > > > > > the >> > > > > > > > > > > boost will be significant (say, more than 2%). This >> does >> > > > > require >> > > > > > > > frequent >> > > > > > > > > > > re-indexing of the documents that have significant >> > changes >> > > in >> > > > > the >> > > > > > > > number >> > > > > > > > > > of >> > > > > > > > > > > reads, but at least we won't have to update a >> document if >> > > it >> > > > > > moves >> > > > > > > > from, >> > > > > > > > > > > say, 1000000 to 1000001 reads. >> > > > > > > > > > > >> > > > > > > > > > > /Martin Koch - ISSUU - senior systems architect. >> > > > > > > > > > > >> > > > > > > > > > > On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni < >> > > > > > simo...@apache.org >> > > > > > > > >> > > > > > > > > > wrote: >> > > > > > > > > > > >> > > > > > > > > > > > Hi all, >> > > > > > > > > > > > I'm planning to move a quite big Solr index to >> > SolrCloud. >> > > > > > > However, >> > > > > > > > in >> > > > > > > > > > > this >> > > > > > > > > > > > index, an external file field is used for popularity >> > > > ranking. >> > > > > > > > > > > > >> > > > > > > > > > > > Does SolrCloud supports external file fields? How >> does >> > it >> > > > > cope >> > > > > > > with >> > > > > > > > > > > > sharding and replication? Where should the external >> > file >> > > be >> > > > > > > placed >> > > > > > > > now >> > > > > > > > > > > that >> > > > > > > > > > > > the index folder is not local but in the cloud? >> > > > > > > > > > > > >> > > > > > > > > > > > Are there otherwise other best practices to deal >> with >> > the >> > > > use >> > > > > > > cases >> > > > > > > > > > > > external file fields were used for, like >> > > > popularity/ranking, >> > > > > in >> > > > > > > > > > > SolrCloud? >> > > > > > > > > > > > Custom ValueSources going to something external? >> > > > > > > > > > > > >> > > > > > > > > > > > Thanks in advance, >> > > > > > > > > > > > Simone >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > -- >> > > > > > > > > > Sincerely yours >> > > > > > > > > > Mikhail Khludnev >> > > > > > > > > > Principal Engineer, >> > > > > > > > > > Grid Dynamics >> > > > > > > > > > >> > > > > > > > > > <http://www.griddynamics.com> >> > > > > > > > > > <mkhlud...@griddynamics.com> >> > > > > > > > > > >> > > > > > > > 20.11.2012 18:06 пользователь "Martin Koch" <m...@issuu.com >> > >> > > > > написал: >> > > > > > > > >> > > > > > > > > Hi Mikhail >> > > > > > > > > >> > > > > > > > > Please see answers below. >> > > > > > > > > >> > > > > > > > > On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev < >> > > > > > > > > mkhlud...@griddynamics.com> wrote: >> > > > > > > > > >> > > > > > > > > > Martin, >> > > > > > > > > > >> > > > > > > > > > Thank you for telling your own "war-story". It's really >> > > useful >> > > > > for >> > > > > > > > > > community. >> > > > > > > > > > The first question might seems not really conscious, but >> > > would >> > > > > you >> > > > > > > tell >> > > > > > > > > me >> > > > > > > > > > what blocks searching during EFF reload, when it's >> > triggered >> > > by >> > > > > > > handler >> > > > > > > > > or >> > > > > > > > > > by listener? >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > We continuously index new documents using CommitWithin to >> get >> > > > > regular >> > > > > > > > > commits. However, we observed that the EFFs were not >> re-read, >> > > so >> > > > we >> > > > > > had >> > > > > > > > to >> > > > > > > > > do external commits (curl '.../solr/update?commit=true') >> to >> > > force >> > > > > > > reload. >> > > > > > > > > When this is done, solr blocks. I can't tell you exactly >> why >> > > it's >> > > > > > doing >> > > > > > > > > that (it was related to SOLR-3985). >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > > I don't really get the sentence about sequential commits >> > and >> > > > > number >> > > > > > > of >> > > > > > > > > > cores. Do I get right that file is replicated via >> > Zookeeper? >> > > > > > Doesn't >> > > > > > > it >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > Again, this is observed behavior. When we issue a commit >> on a >> > > > > system >> > > > > > > > with a >> > > > > > > > > system with many solr cores using EFFs, the system blocks >> > for a >> > > > > long >> > > > > > > time >> > > > > > > > > (15 minutes). We do NOT use zookeeper for anything. The >> EFF >> > > is a >> > > > > > > symlink >> > > > > > > > > from each cores index dir to the actual file, which is >> > updated >> > > by >> > > > > an >> > > > > > > > > external process. >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > > causes scalability problem or long time to reload? Will >> it >> > > help >> > > > > if >> > > > > > > > we'll >> > > > > > > > > > have, let's say ExternalDatabaseField which will pull >> > values >> > > > from >> > > > > > > jdbc. >> > > > > > > > > ie. >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > I think the possibility of having some fields being >> retrieved >> > > > from >> > > > > an >> > > > > > > > > external, dynamically updatable store would be really >> > > > interesting. >> > > > > > This >> > > > > > > > > could be JDBC, something in-memory like redis, or a NoSql >> > > product >> > > > > > (e.g. >> > > > > > > > > Cassandra). >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > > why all cores can't read these values simultaneously? >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > Again, this is a solr implementation detail that I can't >> > answer >> > > > :) >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > > Can you confirm that IDs in the file is ordered by the >> > index >> > > > term >> > > > > > > > order? >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > Yes, we sorted the files (standard UNIX sort). >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > > AFAIK it can impact load time. >> > > > > > > > > > >> > > > > > > > > Yes, it does. >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > > Regarding your post-query solution can you tell me if >> query >> > > > found >> > > > > > > 10000 >> > > > > > > > > > docs, but I need to display only first page with 100 >> rows, >> > > > > whether >> > > > > > I >> > > > > > > > need >> > > > > > > > > > to pull all 10K results to frontend to order them by the >> > > rank? >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > In our architecture, the clients query an API that >> generates >> > > the >> > > > > SOLR >> > > > > > > > > query, retrieves the relevant additional fields that we >> > needs, >> > > > and >> > > > > > > > returns >> > > > > > > > > the relevant JSON to the front-end. >> > > > > > > > > >> > > > > > > > > In our use case, results are returned from SOLR by the >> 10's, >> > > not >> > > > by >> > > > > > the >> > > > > > > > > 1000's, so it is a manageable job. Even so, if solr >> returned >> > > > > > thousands >> > > > > > > of >> > > > > > > > > results, it would be up to the implementation of the api >> to >> > > > augment >> > > > > > > only >> > > > > > > > > the results that needed to be returned to the front-end. >> > > > > > > > > >> > > > > > > > > Even so, patching up a JSON structure with 10000 results >> > should >> > > > be >> > > > > > > > > possible. >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > > I'm really appreciate if you comment on the questions >> > above. >> > > > > > > > > > PS: It's time to pitch, how much >> > > > > > > > > > https://issues.apache.org/jira/browse/SOLR-4085 >> "Commit-free >> > > > > > > > > > ExternalFileField" can help you? >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > It looks very interesting :) Does it make it possible to >> > > avoid >> > > > > > > > re-reading >> > > > > > > > > the EFF on every commit, and only re-read the values that >> > have >> > > > > > actually >> > > > > > > > > changed? >> > > > > > > > > >> > > > > > > > > /Martin >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch < >> > m...@issuu.com> >> > > > > > wrote: >> > > > > > > > > > >> > > > > > > > > > > Solr 4.0 does support using EFFs, but it might not >> give >> > you >> > > > > what >> > > > > > > > you're >> > > > > > > > > > > hoping fore. >> > > > > > > > > > > >> > > > > > > > > > > We tried using Solr Cloud, and have given up again. >> > > > > > > > > > > >> > > > > > > > > > > The EFF is placed in the parent of the index >> directory in >> > > > each >> > > > > > > core; >> > > > > > > > > each >> > > > > > > > > > > core reads the entire EFF and picks out the IDs that >> it >> > is >> > > > > > > > responsible >> > > > > > > > > > for. >> > > > > > > > > > > >> > > > > > > > > > > In the current 4.0.0 release of solr, solr blocks >> > (doesn't >> > > > > answer >> > > > > > > > > > queries) >> > > > > > > > > > > while re-reading the EFF. Even worse, it seems that >> the >> > > time >> > > > to >> > > > > > > > re-read >> > > > > > > > > > the >> > > > > > > > > > > EFF is multiplied by the number of cores in use (i.e. >> the >> > > EFF >> > > > > is >> > > > > > > > > re-read >> > > > > > > > > > by >> > > > > > > > > > > each core sequentially). The contents of the EFF >> become >> > > > active >> > > > > > > after >> > > > > > > > > the >> > > > > > > > > > > first EXTERNAL commit (commitWithin does NOT work >> here) >> > > after >> > > > > the >> > > > > > > > file >> > > > > > > > > > has >> > > > > > > > > > > been updated. >> > > > > > > > > > > >> > > > > > > > > > > In our case, the EFF was quite large - around 450MB - >> and >> > > we >> > > > > use >> > > > > > 16 >> > > > > > > > > > shards, >> > > > > > > > > > > so when we triggered an external commit to force >> > > re-reading, >> > > > > the >> > > > > > > > whole >> > > > > > > > > > > system would block for several (10-15) minutes. This >> > won't >> > > > work >> > > > > > in >> > > > > > > a >> > > > > > > > > > > production environment. The reason for the size of the >> > EFF >> > > is >> > > > > > that >> > > > > > > we >> > > > > > > > > > have >> > > > > > > > > > > around 7M documents in the index; each document has a >> 45 >> > > > > > character >> > > > > > > > ID. >> > > > > > > > > > > >> > > > > > > > > > > We got some help to try to fix the problem so that the >> > > > re-read >> > > > > of >> > > > > > > the >> > > > > > > > > EFF >> > > > > > > > > > > proceeds in the background (see >> > > > > > > > > > > here<https://issues.apache.org/jira/browse/SOLR-3985> >> > for >> > > > > > > > > > > a fix on the 4.1 branch). However, even though the >> > re-read >> > > > > > proceeds >> > > > > > > > in >> > > > > > > > > > the >> > > > > > > > > > > background, the time required to launch solr now >> takes at >> > > > least >> > > > > > as >> > > > > > > > long >> > > > > > > > > > as >> > > > > > > > > > > re-reading the EFFs. Again, this is not good enough >> for >> > our >> > > > > > needs. >> > > > > > > > > > > >> > > > > > > > > > > The next issue is that you cannot sort on EFF fields >> > > (though >> > > > > you >> > > > > > > can >> > > > > > > > > > return >> > > > > > > > > > > them as values using &fl=field(my_eff_field). This is >> > also >> > > > > fixed >> > > > > > in >> > > > > > > > the >> > > > > > > > > > 4.1 >> > > > > > > > > > > branch here < >> > > https://issues.apache.org/jira/browse/SOLR-4022 >> > > > >. >> > > > > > > > > > > >> > > > > > > > > > > So: Even after these fixes, EFF performance is not >> that >> > > > great. >> > > > > > Our >> > > > > > > > > > solution >> > > > > > > > > > > is as follows: The actual value of the popularity >> measure >> > > > (say, >> > > > > > > > reads) >> > > > > > > > > > that >> > > > > > > > > > > we want to report to the user is inserted into the >> search >> > > > > > response >> > > > > > > > > > > post-query by our query front-end. This value will >> then >> > be >> > > > the >> > > > > > > > > > > authoritative value at the time of the query. The >> value >> > of >> > > > the >> > > > > > > > > popularity >> > > > > > > > > > > measure that we use for boosting in the ranking of the >> > > search >> > > > > > > results >> > > > > > > > > is >> > > > > > > > > > > only updated when the value has changed enough so that >> > the >> > > > > impact >> > > > > > > on >> > > > > > > > > the >> > > > > > > > > > > boost will be significant (say, more than 2%). This >> does >> > > > > require >> > > > > > > > > frequent >> > > > > > > > > > > re-indexing of the documents that have significant >> > changes >> > > in >> > > > > the >> > > > > > > > > number >> > > > > > > > > > of >> > > > > > > > > > > reads, but at least we won't have to update a >> document if >> > > it >> > > > > > moves >> > > > > > > > > from, >> > > > > > > > > > > say, 1000000 to 1000001 reads. >> > > > > > > > > > > >> > > > > > > > > > > /Martin Koch - ISSUU - senior systems architect. >> > > > > > > > > > > >> > > > > > > > > > > On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni < >> > > > > > simo...@apache.org >> > > > > > > > >> > > > > > > > > > wrote: >> > > > > > > > > > > >> > > > > > > > > > > > Hi all, >> > > > > > > > > > > > I'm planning to move a quite big Solr index to >> > SolrCloud. >> > > > > > > However, >> > > > > > > > in >> > > > > > > > > > > this >> > > > > > > > > > > > index, an external file field is used for popularity >> > > > ranking. >> > > > > > > > > > > > >> > > > > > > > > > > > Does SolrCloud supports external file fields? How >> does >> > it >> > > > > cope >> > > > > > > with >> > > > > > > > > > > > sharding and replication? Where should the external >> > file >> > > be >> > > > > > > placed >> > > > > > > > > now >> > > > > > > > > > > that >> > > > > > > > > > > > the index folder is not local but in the cloud? >> > > > > > > > > > > > >> > > > > > > > > > > > Are there otherwise other best practices to deal >> with >> > the >> > > > use >> > > > > > > cases >> > > > > > > > > > > > external file fields were used for, like >> > > > popularity/ranking, >> > > > > in >> > > > > > > > > > > SolrCloud? >> > > > > > > > > > > > Custom ValueSources going to something external? >> > > > > > > > > > > > >> > > > > > > > > > > > Thanks in advance, >> > > > > > > > > > > > Simone >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > -- >> > > > > > > > > > Sincerely yours >> > > > > > > > > > Mikhail Khludnev >> > > > > > > > > > Principal Engineer, >> > > > > > > > > > Grid Dynamics >> > > > > > > > > > >> > > > > > > > > > <http://www.griddynamics.com> >> > > > > > > > > > <mkhlud...@griddynamics.com> >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > -- >> > > > > > Sincerely yours >> > > > > > Mikhail Khludnev >> > > > > > Principal Engineer, >> > > > > > Grid Dynamics >> > > > > > >> > > > > > <http://www.griddynamics.com> >> > > > > > <mkhlud...@griddynamics.com> >> > > > > > >> > > > > >> > > > >> > > > >> > > > >> > > > -- >> > > > Sincerely yours >> > > > Mikhail Khludnev >> > > > Principal Engineer, >> > > > Grid Dynamics >> > > > >> > > > <http://www.griddynamics.com> >> > > > <mkhlud...@griddynamics.com> >> > > > >> > > >> > >> > >> > >> > -- >> > Sincerely yours >> > Mikhail Khludnev >> > Principal Engineer, >> > Grid Dynamics >> > >> > <http://www.griddynamics.com> >> > <mkhlud...@griddynamics.com> >> > >> > > > > -- > Sincerely yours > Mikhail Khludnev > Principal Engineer, > Grid Dynamics > > <http://www.griddynamics.com> > <mkhlud...@griddynamics.com> > >