On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch <m...@issuu.com> wrote:
> > I wasn't aware until now that it is possible to send a commit to one core > only. What we observed was the effect of curl > localhost:8080/solr/update?commit=true but perhaps we should experiment > with solr/coreN/update?commit=true. A quick trial run seems to indicate > that a commit to a single core causes commits on all cores. > You should see something like this in the log: ... SolrCmdDistributor .... Distrib commit to: ... > > > Perhaps I should clarify that we are using SOLR as a black box; we do not > touch the code at all - we only install the distribution WAR file and > proceed from there. > I still don't understand how you deploy/launch Solr. How many jettys you start whether you have -DzkRun -DzkHost -DnumShards=2 or you specifies shards= param for every request and distributes updates yourself? What collections do you create and with which settings? > > > > Also from my POV such deployments should start at least from *16* 4-way > > vboxes, it's more expensive, but should be much better available during > > cpu-consuming operations. > > > > Do you mean that you recommend 16 hosts with 4 cores each? Or 4 hosts with > 16 cores? Or am I misunderstanding something :) ? > I prefer to start from 16 hosts with 4 cores each. > > > > Other details, if you use single jetty for all of them, are you sure that > > jetty's threadpool doesn't limit requests? is it large enough? > > You have 60G and set -Xmx=10G. are you sure that total size of cores > index > > directories is less than 45G? > > > > The total index size is 230 GB, so it won't fit in ram, but we're using > an > SSD disk to minimize disk access time. We have tried putting the EFF onto a > ram disk, but this didn't have a measurable effect. > > Thanks, > /Martin > > > > Thanks > > > > > > On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch <m...@issuu.com> wrote: > > > > > Mikhail > > > > > > PSB > > > > > > On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev < > > > mkhlud...@griddynamics.com> wrote: > > > > > > > Martin, > > > > > > > > Please find additional question from me below. > > > > > > > > Simone, > > > > > > > > I'm sorry for hijacking your thread. The only what I've heard about > it > > at > > > > recent ApacheCon sessions is that Zookeeper is supposed to replicate > > > those > > > > files as configs under solr home. And I'm really looking forward to > > know > > > > how it works with huge files in production. > > > > > > > > Thank You, Guys! > > > > > > > > 20.11.2012 18:06 пользователь "Martin Koch" <m...@issuu.com> написал: > > > > > > > > > > Hi Mikhail > > > > > > > > > > Please see answers below. > > > > > > > > > > On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev < > > > > > mkhlud...@griddynamics.com> wrote: > > > > > > > > > > > Martin, > > > > > > > > > > > > Thank you for telling your own "war-story". It's really useful > for > > > > > > community. > > > > > > The first question might seems not really conscious, but would > you > > > tell > > > > me > > > > > > what blocks searching during EFF reload, when it's triggered by > > > handler > > > > or > > > > > > by listener? > > > > > > > > > > > > > > > > We continuously index new documents using CommitWithin to get > regular > > > > > commits. However, we observed that the EFFs were not re-read, so we > > had > > > > to > > > > > do external commits (curl '.../solr/update?commit=true') to force > > > reload. > > > > > When this is done, solr blocks. I can't tell you exactly why it's > > doing > > > > > that (it was related to SOLR-3985). > > > > > > > > Is there a chance to get a thread dump when they are blocked? > > > > > > > > > > > Well I could try to recreate the situation. But the setup is fairly > > simple: > > > Create a large EFF in a largeish index with many shards. Issue a > commit, > > > and then try to do a search. Solr will not respond to the search before > > the > > > commit has completed, and this will take a long time. > > > > > > > > > > > > > > > > > > > > > > > > > > I don't really get the sentence about sequential commits and > number > > > of > > > > > > cores. Do I get right that file is replicated via Zookeeper? > > Doesn't > > > it > > > > > > > > > > > > > > > > Again, this is observed behavior. When we issue a commit on a > system > > > with > > > > a > > > > > system with many solr cores using EFFs, the system blocks for a > long > > > time > > > > > (15 minutes). We do NOT use zookeeper for anything. The EFF is a > > > symlink > > > > > from each cores index dir to the actual file, which is updated by > an > > > > > external process. > > > > > > > > Hold on, I asked about Zookeeper because the subj mentions SolrCloud. > > > > > > > > Do you use SolrCloud, SolrShards, or these cores are just replicas of > > the > > > > same index? > > > > > > > > > > Ah - we use solr 4 out of the box, so I guess this is SolrCloud. I'm a > > bit > > > unsure about the terminology here, but we've got a single index divided > > > into 16 shard. Each shard is hosted in a solr core. > > > > > > > > > > Also, about simlink - Don't you share that file via some NFS? > > > > > > > > No, we generate the EFF on the local solr host (there is only one > > > physical > > > host that holds all shards), so there is no need for NFS or copying > files > > > around. No need for Zookeeper either. > > > > > > > > > > how many cores you run per box? > > > > > > > This box is a 16-virtual core (8 hyperthreaded cores) with 60GB of > RAM. > > We > > > run 16 solr cores on this box in Jetty. > > > > > > > > > > Do boxes has plenty of ram to cache filesystem beside of jvm heaps? > > > > > > > > Yes. We've allocated 10GB for jetty, and left the rest for the OS. > > > > > > > > > > I assume you use 64 bit linux and mmap directory. Please confirm > that. > > > > > > > > > > > We use 64-bit linux. I'm not sure about the mmap directory or where > that > > > would be configured in solr - can you explain that? > > > > > > > > > > > > > > > > > > > > > > > causes scalability problem or long time to reload? Will it help > if > > > > we'll > > > > > > have, let's say ExternalDatabaseField which will pull values from > > > jdbc. > > > > ie. > > > > > > > > > > > > > > > > I think the possibility of having some fields being retrieved from > an > > > > > external, dynamically updatable store would be really interesting. > > This > > > > > could be JDBC, something in-memory like redis, or a NoSql product > > (e.g. > > > > > Cassandra). > > > > > > > > Ok. Let's have it in mind as a possible direction. > > > > > > > > > > Alternatively, an API that would allow updating a single field for a > > > document might be an option. > > > > > > > > > > > > > > > > > > > > > > > > > > why all cores can't read these values simultaneously? > > > > > > > > > > > > > > > > Again, this is a solr implementation detail that I can't answer :) > > > > > > > > > > > > > > > > Can you confirm that IDs in the file is ordered by the index term > > > > order? > > > > > > > > > > > > > > > > Yes, we sorted the files (standard UNIX sort). > > > > > > > > > > > > > > > > AFAIK it can impact load time. > > > > > > > > > > > Yes, it does > > > > > > > > Ok, I've got that you aware of it, and your IDs are just strings, not > > > > integers. > > > > > > > > > > > Yes, ids are strings. > > > > > > > > > > > > > > > > > > > > > > > Regarding your post-query solution can you tell me if query found > > > 10000 > > > > > > docs, but I need to display only first page with 100 rows, > whether > > I > > > > need > > > > > > to pull all 10K results to frontend to order them by the rank? > > > > > > > > > > > > > > > > > In our architecture, the clients query an API that generates the > SOLR > > > > > query, retrieves the relevant additional fields that we needs, and > > > > returns > > > > > the relevant JSON to the front-end. > > > > > > > > > > In our use case, results are returned from SOLR by the 10's, not by > > the > > > > > 1000's, so it is a manageable job. Even so, if solr returned > > thousands > > > of > > > > > results, it would be up to the implementation of the api to augment > > > only > > > > > the results that needed to be returned to the front-end. > > > > > > > > > > Even so, patching up a JSON structure with 10000 results should be > > > > > possible. > > > > > > > > You are right. I'm concerned anyway because retrieving whole result > is > > > > expensive, and not always possible. > > > > > > > > > > > In our case, getting the whole result is almost impossible, because > that > > > would be millions of documents, and returning the Nth result seems to > be > > a > > > quadratic (or worse) operation in SOLR. > > > > > > > > > > > > > > > > > > > > > > > I'm really appreciate if you comment on the questions above. > > > > > > PS: It's time to pitch, how much > > > > > > https://issues.apache.org/jira/browse/SOLR-4085 "Commit-free > > > > > > ExternalFileField" can help you? > > > > > > > > > > > > > > > > > > It looks very interesting :) Does it make it possible to avoid > > > > re-reading > > > > > the EFF on every commit, and only re-read the values that have > > actually > > > > > changed? > > > > > > > > > > > > You don't need commit (in SOLR-4085) to reload file content, but > after > > > > commit you need to read whole file and scan all key terms and > postings. > > > > That's because EFF sits on top of top level searcher. it's a > Solr-like > > > way. > > > > In some future we might have per-segment EFF, in this case adding a > > > segment > > > > will trigger full file scan, but in the index only that new segment > > will > > > be > > > > scanned. It should be faster. You know, straightforward sharing > > internal > > > > data structures between different index views/generations is not > > > possible. > > > > If you are asking about applying delta changes on external file > that's > > > > something what we did ourselves http://goo.gl/P8GFq . This feature > is > > > much > > > > more doubtful and vague, although it might be the next contribution > > after > > > > SOLR-4085. > > > > > > > > > > > > > > /Martin > > > > > > > > > > > > > > > > > > > > > > On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <m...@issuu.com> > > wrote: > > > > > > > > > > > > > Solr 4.0 does support using EFFs, but it might not give you > what > > > > you're > > > > > > > hoping fore. > > > > > > > > > > > > > > We tried using Solr Cloud, and have given up again. > > > > > > > > > > > > > > The EFF is placed in the parent of the index directory in each > > > core; > > > > each > > > > > > > core reads the entire EFF and picks out the IDs that it is > > > > responsible > > > > > > for. > > > > > > > > > > > > > > In the current 4.0.0 release of solr, solr blocks (doesn't > answer > > > > > > queries) > > > > > > > while re-reading the EFF. Even worse, it seems that the time to > > > > re-read > > > > > > the > > > > > > > EFF is multiplied by the number of cores in use (i.e. the EFF > is > > > > re-read > > > > > > by > > > > > > > each core sequentially). The contents of the EFF become active > > > after > > > > the > > > > > > > first EXTERNAL commit (commitWithin does NOT work here) after > the > > > > file > > > > > > has > > > > > > > been updated. > > > > > > > > > > > > > > In our case, the EFF was quite large - around 450MB - and we > use > > 16 > > > > > > shards, > > > > > > > so when we triggered an external commit to force re-reading, > the > > > > whole > > > > > > > system would block for several (10-15) minutes. This won't work > > in > > > a > > > > > > > production environment. The reason for the size of the EFF is > > that > > > we > > > > > > have > > > > > > > around 7M documents in the index; each document has a 45 > > character > > > > ID. > > > > > > > > > > > > > > We got some help to try to fix the problem so that the re-read > of > > > the > > > > EFF > > > > > > > proceeds in the background (see > > > > > > > here<https://issues.apache.org/jira/browse/SOLR-3985> for > > > > > > > a fix on the 4.1 branch). However, even though the re-read > > proceeds > > > > in > > > > > > the > > > > > > > background, the time required to launch solr now takes at least > > as > > > > long > > > > > > as > > > > > > > re-reading the EFFs. Again, this is not good enough for our > > needs. > > > > > > > > > > > > > > The next issue is that you cannot sort on EFF fields (though > you > > > can > > > > > > return > > > > > > > them as values using &fl=field(my_eff_field). This is also > fixed > > in > > > > the > > > > > > 4.1 > > > > > > > branch here <https://issues.apache.org/jira/browse/SOLR-4022>. > > > > > > > > > > > > > > So: Even after these fixes, EFF performance is not that great. > > Our > > > > > > solution > > > > > > > is as follows: The actual value of the popularity measure (say, > > > > reads) > > > > > > that > > > > > > > we want to report to the user is inserted into the search > > response > > > > > > > post-query by our query front-end. This value will then be the > > > > > > > authoritative value at the time of the query. The value of the > > > > popularity > > > > > > > measure that we use for boosting in the ranking of the search > > > results > > > > is > > > > > > > only updated when the value has changed enough so that the > impact > > > on > > > > the > > > > > > > boost will be significant (say, more than 2%). This does > require > > > > frequent > > > > > > > re-indexing of the documents that have significant changes in > the > > > > number > > > > > > of > > > > > > > reads, but at least we won't have to update a document if it > > moves > > > > from, > > > > > > > say, 1000000 to 1000001 reads. > > > > > > > > > > > > > > /Martin Koch - ISSUU - senior systems architect. > > > > > > > > > > > > > > On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni < > > simo...@apache.org > > > > > > > > > > wrote: > > > > > > > > > > > > > > > Hi all, > > > > > > > > I'm planning to move a quite big Solr index to SolrCloud. > > > However, > > > > in > > > > > > > this > > > > > > > > index, an external file field is used for popularity ranking. > > > > > > > > > > > > > > > > Does SolrCloud supports external file fields? How does it > cope > > > with > > > > > > > > sharding and replication? Where should the external file be > > > placed > > > > now > > > > > > > that > > > > > > > > the index folder is not local but in the cloud? > > > > > > > > > > > > > > > > Are there otherwise other best practices to deal with the use > > > cases > > > > > > > > external file fields were used for, like popularity/ranking, > in > > > > > > > SolrCloud? > > > > > > > > Custom ValueSources going to something external? > > > > > > > > > > > > > > > > Thanks in advance, > > > > > > > > Simone > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Sincerely yours > > > > > > Mikhail Khludnev > > > > > > Principal Engineer, > > > > > > Grid Dynamics > > > > > > > > > > > > <http://www.griddynamics.com> > > > > > > <mkhlud...@griddynamics.com> > > > > > > > > > > 20.11.2012 18:06 пользователь "Martin Koch" <m...@issuu.com> > написал: > > > > > > > > > Hi Mikhail > > > > > > > > > > Please see answers below. > > > > > > > > > > On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev < > > > > > mkhlud...@griddynamics.com> wrote: > > > > > > > > > > > Martin, > > > > > > > > > > > > Thank you for telling your own "war-story". It's really useful > for > > > > > > community. > > > > > > The first question might seems not really conscious, but would > you > > > tell > > > > > me > > > > > > what blocks searching during EFF reload, when it's triggered by > > > handler > > > > > or > > > > > > by listener? > > > > > > > > > > > > > > > > We continuously index new documents using CommitWithin to get > regular > > > > > commits. However, we observed that the EFFs were not re-read, so we > > had > > > > to > > > > > do external commits (curl '.../solr/update?commit=true') to force > > > reload. > > > > > When this is done, solr blocks. I can't tell you exactly why it's > > doing > > > > > that (it was related to SOLR-3985). > > > > > > > > > > > > > > > > I don't really get the sentence about sequential commits and > number > > > of > > > > > > cores. Do I get right that file is replicated via Zookeeper? > > Doesn't > > > it > > > > > > > > > > > > > > > > Again, this is observed behavior. When we issue a commit on a > system > > > > with a > > > > > system with many solr cores using EFFs, the system blocks for a > long > > > time > > > > > (15 minutes). We do NOT use zookeeper for anything. The EFF is a > > > symlink > > > > > from each cores index dir to the actual file, which is updated by > an > > > > > external process. > > > > > > > > > > > > > > > > causes scalability problem or long time to reload? Will it help > if > > > > we'll > > > > > > have, let's say ExternalDatabaseField which will pull values from > > > jdbc. > > > > > ie. > > > > > > > > > > > > > > > > I think the possibility of having some fields being retrieved from > an > > > > > external, dynamically updatable store would be really interesting. > > This > > > > > could be JDBC, something in-memory like redis, or a NoSql product > > (e.g. > > > > > Cassandra). > > > > > > > > > > > > > > > > why all cores can't read these values simultaneously? > > > > > > > > > > > > > > > > Again, this is a solr implementation detail that I can't answer :) > > > > > > > > > > > > > > > > Can you confirm that IDs in the file is ordered by the index term > > > > order? > > > > > > > > > > > > > > > > Yes, we sorted the files (standard UNIX sort). > > > > > > > > > > > > > > > > AFAIK it can impact load time. > > > > > > > > > > > Yes, it does. > > > > > > > > > > > > > > > > Regarding your post-query solution can you tell me if query found > > > 10000 > > > > > > docs, but I need to display only first page with 100 rows, > whether > > I > > > > need > > > > > > to pull all 10K results to frontend to order them by the rank? > > > > > > > > > > > > > > > > > In our architecture, the clients query an API that generates the > SOLR > > > > > query, retrieves the relevant additional fields that we needs, and > > > > returns > > > > > the relevant JSON to the front-end. > > > > > > > > > > In our use case, results are returned from SOLR by the 10's, not by > > the > > > > > 1000's, so it is a manageable job. Even so, if solr returned > > thousands > > > of > > > > > results, it would be up to the implementation of the api to augment > > > only > > > > > the results that needed to be returned to the front-end. > > > > > > > > > > Even so, patching up a JSON structure with 10000 results should be > > > > > possible. > > > > > > > > > > > > > > > > I'm really appreciate if you comment on the questions above. > > > > > > PS: It's time to pitch, how much > > > > > > https://issues.apache.org/jira/browse/SOLR-4085 "Commit-free > > > > > > ExternalFileField" can help you? > > > > > > > > > > > > > > > > > > It looks very interesting :) Does it make it possible to avoid > > > > re-reading > > > > > the EFF on every commit, and only re-read the values that have > > actually > > > > > changed? > > > > > > > > > > /Martin > > > > > > > > > > > > > > > > > > > > > > On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <m...@issuu.com> > > wrote: > > > > > > > > > > > > > Solr 4.0 does support using EFFs, but it might not give you > what > > > > you're > > > > > > > hoping fore. > > > > > > > > > > > > > > We tried using Solr Cloud, and have given up again. > > > > > > > > > > > > > > The EFF is placed in the parent of the index directory in each > > > core; > > > > > each > > > > > > > core reads the entire EFF and picks out the IDs that it is > > > > responsible > > > > > > for. > > > > > > > > > > > > > > In the current 4.0.0 release of solr, solr blocks (doesn't > answer > > > > > > queries) > > > > > > > while re-reading the EFF. Even worse, it seems that the time to > > > > re-read > > > > > > the > > > > > > > EFF is multiplied by the number of cores in use (i.e. the EFF > is > > > > > re-read > > > > > > by > > > > > > > each core sequentially). The contents of the EFF become active > > > after > > > > > the > > > > > > > first EXTERNAL commit (commitWithin does NOT work here) after > the > > > > file > > > > > > has > > > > > > > been updated. > > > > > > > > > > > > > > In our case, the EFF was quite large - around 450MB - and we > use > > 16 > > > > > > shards, > > > > > > > so when we triggered an external commit to force re-reading, > the > > > > whole > > > > > > > system would block for several (10-15) minutes. This won't work > > in > > > a > > > > > > > production environment. The reason for the size of the EFF is > > that > > > we > > > > > > have > > > > > > > around 7M documents in the index; each document has a 45 > > character > > > > ID. > > > > > > > > > > > > > > We got some help to try to fix the problem so that the re-read > of > > > the > > > > > EFF > > > > > > > proceeds in the background (see > > > > > > > here<https://issues.apache.org/jira/browse/SOLR-3985> for > > > > > > > a fix on the 4.1 branch). However, even though the re-read > > proceeds > > > > in > > > > > > the > > > > > > > background, the time required to launch solr now takes at least > > as > > > > long > > > > > > as > > > > > > > re-reading the EFFs. Again, this is not good enough for our > > needs. > > > > > > > > > > > > > > The next issue is that you cannot sort on EFF fields (though > you > > > can > > > > > > return > > > > > > > them as values using &fl=field(my_eff_field). This is also > fixed > > in > > > > the > > > > > > 4.1 > > > > > > > branch here <https://issues.apache.org/jira/browse/SOLR-4022>. > > > > > > > > > > > > > > So: Even after these fixes, EFF performance is not that great. > > Our > > > > > > solution > > > > > > > is as follows: The actual value of the popularity measure (say, > > > > reads) > > > > > > that > > > > > > > we want to report to the user is inserted into the search > > response > > > > > > > post-query by our query front-end. This value will then be the > > > > > > > authoritative value at the time of the query. The value of the > > > > > popularity > > > > > > > measure that we use for boosting in the ranking of the search > > > results > > > > > is > > > > > > > only updated when the value has changed enough so that the > impact > > > on > > > > > the > > > > > > > boost will be significant (say, more than 2%). This does > require > > > > > frequent > > > > > > > re-indexing of the documents that have significant changes in > the > > > > > number > > > > > > of > > > > > > > reads, but at least we won't have to update a document if it > > moves > > > > > from, > > > > > > > say, 1000000 to 1000001 reads. > > > > > > > > > > > > > > /Martin Koch - ISSUU - senior systems architect. > > > > > > > > > > > > > > On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni < > > simo...@apache.org > > > > > > > > > > wrote: > > > > > > > > > > > > > > > Hi all, > > > > > > > > I'm planning to move a quite big Solr index to SolrCloud. > > > However, > > > > in > > > > > > > this > > > > > > > > index, an external file field is used for popularity ranking. > > > > > > > > > > > > > > > > Does SolrCloud supports external file fields? How does it > cope > > > with > > > > > > > > sharding and replication? Where should the external file be > > > placed > > > > > now > > > > > > > that > > > > > > > > the index folder is not local but in the cloud? > > > > > > > > > > > > > > > > Are there otherwise other best practices to deal with the use > > > cases > > > > > > > > external file fields were used for, like popularity/ranking, > in > > > > > > > SolrCloud? > > > > > > > > Custom ValueSources going to something external? > > > > > > > > > > > > > > > > Thanks in advance, > > > > > > > > Simone > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Sincerely yours > > > > > > Mikhail Khludnev > > > > > > Principal Engineer, > > > > > > Grid Dynamics > > > > > > > > > > > > <http://www.griddynamics.com> > > > > > > <mkhlud...@griddynamics.com> > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > Sincerely yours > > Mikhail Khludnev > > Principal Engineer, > > Grid Dynamics > > > > <http://www.griddynamics.com> > > <mkhlud...@griddynamics.com> > > > -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics <http://www.griddynamics.com> <mkhlud...@griddynamics.com>