Re: SolrCloud and exernal file fields

Martin Koch Wed, 28 Nov 2012 01:16:55 -0800

Mikhail

I haven't experimented further yet. I think that the previous experiment of
issuing a commit to a specific core proved that all cores get the commit,
so I don't think that this approach will work.


Thanks,
/Martin


On Tue, Nov 27, 2012 at 6:24 PM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

> Martin,
>
> It's still not clear to me whether you solve the problem completely or
> partially:
> Does reducing number of cores free some resources for searching during
> commit?
> Does the commiting one-by-one core prevents the "freeze"?
>
> Thanks
>
>
> On Thu, Nov 22, 2012 at 4:31 PM, Martin Koch <m...@issuu.com> wrote:
>
>> Mikhail
>>
>> To avoid freezes we deployed the patches that are now on the 4.1 trunk
>> (bug
>> 3985). But this wasn't good enough, because SOLR would still take very
>> long
>> to restart when that was necessary.
>>
>> I don't see how we could throw more hardware at the problem without making
>> it worse, really - the only solution here would be *fewer* shards, not
>>
>> more.
>>
>> IMO it would be ideal if the lucene/solr community could come up with a
>> good way of updating fields in a document without reindexing. This could
>> be
>> by linking to some external data store, or in the lucene/solr internals.
>> If
>> it would make things easier, a good first step would be to have
>> dynamically
>> updateable numerical fields only.
>>
>> /Martin
>>
>> On Wed, Nov 21, 2012 at 8:51 PM, Mikhail Khludnev <
>> mkhlud...@griddynamics.com> wrote:
>>
>> > Martin,
>> >
>> > I don't think solrconfig.xml shed any light on. I've just found what I
>> > didn't get in your setup - the way of how to explicitly assigning core
>> to
>> > collection. Now, I realized most of details after all!
>> > Ball is on your side, let us know whether you have managed your cores to
>> > commit one by one to avoid freeze, or could you eliminate pauses by
>> > allocating more hardware?
>> > Thanks in advance!
>> >
>> >
>> > On Wed, Nov 21, 2012 at 3:56 PM, Martin Koch <m...@issuu.com> wrote:
>> >
>> > > Mikhail,
>> > >
>> > > PSB
>> > >
>> > > On Wed, Nov 21, 2012 at 10:08 AM, Mikhail Khludnev <
>> > > mkhlud...@griddynamics.com> wrote:
>> > >
>> > > > On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch <m...@issuu.com>
>> wrote:
>> > > >
>> > > > >
>> > > > > I wasn't aware until now that it is possible to send a commit to
>> one
>> > > core
>> > > > > only. What we observed was the effect of curl
>> > > > > localhost:8080/solr/update?commit=true but perhaps we should
>> > experiment
>> > > > > with solr/coreN/update?commit=true. A quick trial run seems to
>> > indicate
>> > > > > that a commit to a single core causes commits on all cores.
>> > > > >
>> > > > You should see something like this in the log:
>> > > > ... SolrCmdDistributor .... Distrib commit to: ...
>> > > >
>> > > > Yup, a commit towards a single core results in a commit on all
>> cores.
>> > >
>> > >
>> > > > >
>> > > > >
>> > > > > Perhaps I should clarify that we are using SOLR as a black box;
>> we do
>> > > not
>> > > > > touch the code at all - we only install the distribution WAR file
>> and
>> > > > > proceed from there.
>> > > > >
>> > > > I still don't understand how you deploy/launch Solr. How many jettys
>> > you
>> > > > start whether you have -DzkRun -DzkHost -DnumShards=2  or you
>> specifies
>> > > > shards= param for every request and distributes updates yourself?
>> What
>> > > > collections do you create and with which settings?
>> > > >
>> > > > We let SOLR do the sharding using one collection with 16 SOLR cores
>> > > holding one shard each. We launch only one instance of jetty with the
>> > > folllowing arguments:
>> > >
>> > > -DnumShards=16
>> > > -DzkHost=<zookeeperhost:port>
>> > > -Xmx10G
>> > > -Xms10G
>> > > -Xmn2G
>> > > -server
>> > >
>> > > Would you like to see the solrconfig.xml?
>> > >
>> > > /Martin
>> > >
>> > >
>> > > > >
>> > > > >
>> > > > > > Also from my POV such deployments should start at least from
>> *16*
>> > > 4-way
>> > > > > > vboxes, it's more expensive, but should be much better available
>> > > during
>> > > > > > cpu-consuming operations.
>> > > > > >
>> > > > >
>> > > > > Do you mean that you recommend 16 hosts with 4 cores each? Or 4
>> hosts
>> > > > with
>> > > > > 16 cores? Or am I misunderstanding something :) ?
>> > > > >
>> > > > I prefer to start from 16 hosts with 4 cores each.
>> > > >
>> > > >
>> > > > >
>> > > > >
>> > > > > > Other details, if you use single jetty for all of them, are you
>> > sure
>> > > > that
>> > > > > > jetty's threadpool doesn't limit requests? is it large enough?
>> > > > > > You have 60G and set -Xmx=10G. are you sure that total size of
>> > cores
>> > > > > index
>> > > > > > directories is less than 45G?
>> > > > > >
>> > > > > > The total index size is 230 GB, so it won't fit in ram, but
>> we're
>> > > using
>> > > > > an
>> > > > > SSD disk to minimize disk access time. We have tried putting the
>> EFF
>> > > > onto a
>> > > > > ram disk, but this didn't have a measurable effect.
>> > > > >
>> > > > > Thanks,
>> > > > > /Martin
>> > > > >
>> > > > >
>> > > > > > Thanks
>> > > > > >
>> > > > > >
>> > > > > > On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch <m...@issuu.com>
>> > wrote:
>> > > > > >
>> > > > > > > Mikhail
>> > > > > > >
>> > > > > > > PSB
>> > > > > > >
>> > > > > > > On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev <
>> > > > > > > mkhlud...@griddynamics.com> wrote:
>> > > > > > >
>> > > > > > > > Martin,
>> > > > > > > >
>> > > > > > > > Please find additional question from me below.
>> > > > > > > >
>> > > > > > > > Simone,
>> > > > > > > >
>> > > > > > > > I'm sorry for hijacking your thread. The only what I've
>> heard
>> > > about
>> > > > > it
>> > > > > > at
>> > > > > > > > recent ApacheCon sessions is that Zookeeper is supposed to
>> > > > replicate
>> > > > > > > those
>> > > > > > > > files as configs under solr home. And I'm really looking
>> > forward
>> > > to
>> > > > > > know
>> > > > > > > > how it works with huge files in production.
>> > > > > > > >
>> > > > > > > > Thank You, Guys!
>> > > > > > > >
>> > > > > > > > 20.11.2012 18:06 пользователь "Martin Koch" <m...@issuu.com>
>> > > > написал:
>> > > > > > > > >
>> > > > > > > > > Hi Mikhail
>> > > > > > > > >
>> > > > > > > > > Please see answers below.
>> > > > > > > > >
>> > > > > > > > > On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev <
>> > > > > > > > > mkhlud...@griddynamics.com> wrote:
>> > > > > > > > >
>> > > > > > > > > > Martin,
>> > > > > > > > > >
>> > > > > > > > > > Thank you for telling your own "war-story". It's really
>> > > useful
>> > > > > for
>> > > > > > > > > > community.
>> > > > > > > > > > The first question might seems not really conscious, but
>> > > would
>> > > > > you
>> > > > > > > tell
>> > > > > > > > me
>> > > > > > > > > > what blocks searching during EFF reload, when it's
>> > triggered
>> > > by
>> > > > > > > handler
>> > > > > > > > or
>> > > > > > > > > > by listener?
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > We continuously index new documents using CommitWithin to
>> get
>> > > > > regular
>> > > > > > > > > commits. However, we observed that the EFFs were not
>> re-read,
>> > > so
>> > > > we
>> > > > > > had
>> > > > > > > > to
>> > > > > > > > > do external commits (curl '.../solr/update?commit=true')
>> to
>> > > force
>> > > > > > > reload.
>> > > > > > > > > When this is done, solr blocks. I can't tell you exactly
>> why
>> > > it's
>> > > > > > doing
>> > > > > > > > > that (it was related to SOLR-3985).
>> > > > > > > >
>> > > > > > > > Is there a chance to get a thread dump when they are
>> blocked?
>> > > > > > > >
>> > > > > > > >
>> > > > > > > Well I could try to recreate the situation. But the setup is
>> > fairly
>> > > > > > simple:
>> > > > > > > Create a large EFF in a largeish index with many shards.
>> Issue a
>> > > > > commit,
>> > > > > > > and then try to do a search. Solr will not respond to the
>> search
>> > > > before
>> > > > > > the
>> > > > > > > commit has completed, and this will take a long time.
>> > > > > > >
>> > > > > > >
>> > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > > I don't really get the sentence about sequential commits
>> > and
>> > > > > number
>> > > > > > > of
>> > > > > > > > > > cores. Do I get right that file is replicated via
>> > Zookeeper?
>> > > > > > Doesn't
>> > > > > > > it
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > Again, this is observed behavior. When we issue a commit
>> on a
>> > > > > system
>> > > > > > > with
>> > > > > > > > a
>> > > > > > > > > system with many solr cores using EFFs, the system blocks
>> > for a
>> > > > > long
>> > > > > > > time
>> > > > > > > > > (15 minutes).  We do NOT use zookeeper for anything. The
>> EFF
>> > > is a
>> > > > > > > symlink
>> > > > > > > > > from each cores index dir to the actual file, which is
>> > updated
>> > > by
>> > > > > an
>> > > > > > > > > external process.
>> > > > > > > >
>> > > > > > > > Hold on, I asked about Zookeeper because the subj mentions
>> > > > SolrCloud.
>> > > > > > > >
>> > > > > > > > Do you use SolrCloud, SolrShards, or these cores are just
>> > > replicas
>> > > > of
>> > > > > > the
>> > > > > > > > same index?
>> > > > > > > >
>> > > > > > >
>> > > > > > > Ah - we use solr 4 out of the box, so I guess this is
>> SolrCloud.
>> > > I'm
>> > > > a
>> > > > > > bit
>> > > > > > > unsure about the terminology here, but we've got a single
>> index
>> > > > divided
>> > > > > > > into 16 shard. Each shard is hosted in a solr core.
>> > > > > > >
>> > > > > > >
>> > > > > > > > Also, about simlink - Don't you share that file via some
>> NFS?
>> > > > > > > >
>> > > > > > > > No, we generate the EFF on the local solr host (there is
>> only
>> > one
>> > > > > > > physical
>> > > > > > > host that holds all shards), so there is no need for NFS or
>> > copying
>> > > > > files
>> > > > > > > around. No need for Zookeeper either.
>> > > > > > >
>> > > > > > >
>> > > > > > > > how many cores you run per box?
>> > > > > > > >
>> > > > > > > This box is a 16-virtual core (8 hyperthreaded cores)  with
>> 60GB
>> > of
>> > > > > RAM.
>> > > > > > We
>> > > > > > > run 16 solr cores on this box in Jetty.
>> > > > > > >
>> > > > > > >
>> > > > > > > > Do boxes has plenty of ram to cache filesystem beside of jvm
>> > > heaps?
>> > > > > > > >
>> > > > > > > > Yes. We've allocated 10GB for jetty, and left the rest for
>> the
>> > > OS.
>> > > > > > >
>> > > > > > >
>> > > > > > > > I assume you use 64 bit linux and mmap directory. Please
>> > confirm
>> > > > > that.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > We use 64-bit linux. I'm not sure about the mmap directory or
>> > where
>> > > > > that
>> > > > > > > would be configured in solr - can you explain that?
>> > > > > > >
>> > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > > causes scalability problem or long time to reload? Will
>> it
>> > > help
>> > > > > if
>> > > > > > > > we'll
>> > > > > > > > > > have, let's say ExternalDatabaseField which will pull
>> > values
>> > > > from
>> > > > > > > jdbc.
>> > > > > > > > ie.
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > I think the possibility of having some fields being
>> retrieved
>> > > > from
>> > > > > an
>> > > > > > > > > external, dynamically updatable store would be really
>> > > > interesting.
>> > > > > > This
>> > > > > > > > > could be JDBC, something in-memory like redis, or a NoSql
>> > > product
>> > > > > > (e.g.
>> > > > > > > > > Cassandra).
>> > > > > > > >
>> > > > > > > > Ok. Let's have it in mind as a possible direction.
>> > > > > > > >
>> > > > > > >
>> > > > > > > Alternatively, an API that would allow updating a single field
>> > for
>> > > a
>> > > > > > > document might be an option.
>> > > > > > >
>> > > > > > >
>> > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > > why all cores can't read these values simultaneously?
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > Again, this is a solr implementation detail that I can't
>> > answer
>> > > > :)
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > > Can you confirm that IDs in the file is ordered by the
>> > index
>> > > > term
>> > > > > > > > order?
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > Yes, we sorted the files (standard UNIX sort).
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > > AFAIK it can impact load time.
>> > > > > > > > > >
>> > > > > > > > > Yes, it does
>> > > > > > > >
>> > > > > > > > Ok, I've got that you aware of it, and your IDs are just
>> > strings,
>> > > > not
>> > > > > > > > integers.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > Yes, ids are strings.
>> > > > > > >
>> > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > > Regarding your post-query solution can you tell me if
>> query
>> > > > found
>> > > > > > > 10000
>> > > > > > > > > > docs, but I need to display only first page with 100
>> rows,
>> > > > > whether
>> > > > > > I
>> > > > > > > > need
>> > > > > > > > > > to pull all 10K results to frontend to order them by the
>> > > rank?
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > In our architecture, the clients query an API that
>> generates
>> > > the
>> > > > > SOLR
>> > > > > > > > > query, retrieves the relevant additional fields that we
>> > needs,
>> > > > and
>> > > > > > > > returns
>> > > > > > > > > the relevant JSON to the front-end.
>> > > > > > > > >
>> > > > > > > > > In our use case, results are returned from SOLR by the
>> 10's,
>> > > not
>> > > > by
>> > > > > > the
>> > > > > > > > > 1000's, so it is a manageable job. Even so, if solr
>> returned
>> > > > > > thousands
>> > > > > > > of
>> > > > > > > > > results, it would be up to the implementation of the api
>> to
>> > > > augment
>> > > > > > > only
>> > > > > > > > > the results that needed to be returned to the front-end.
>> > > > > > > > >
>> > > > > > > > > Even so, patching up a JSON structure with 10000 results
>> > should
>> > > > be
>> > > > > > > > > possible.
>> > > > > > > >
>> > > > > > > > You are right. I'm concerned anyway because retrieving whole
>> > > result
>> > > > > is
>> > > > > > > > expensive, and not always possible.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > In our case, getting the whole result is almost impossible,
>> > because
>> > > > > that
>> > > > > > > would be millions of documents, and returning the Nth result
>> > seems
>> > > to
>> > > > > be
>> > > > > > a
>> > > > > > > quadratic (or worse) operation in SOLR.
>> > > > > > >
>> > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > > I'm really appreciate if you comment on the questions
>> > above.
>> > > > > > > > > > PS: It's time to pitch, how much
>> > > > > > > > > > https://issues.apache.org/jira/browse/SOLR-4085
>> "Commit-free
>> > > > > > > > > > ExternalFileField" can help you?
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > It looks very interesting :) Does it make it possible to
>> > > avoid
>> > > > > > > > re-reading
>> > > > > > > > > the EFF on every commit, and only re-read the values that
>> > have
>> > > > > > actually
>> > > > > > > > > changed?
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > You don't need commit (in SOLR-4085) to reload file content,
>> > but
>> > > > > after
>> > > > > > > > commit you need to read whole file and scan all key terms
>> and
>> > > > > postings.
>> > > > > > > > That's because EFF sits on top of top level searcher. it's a
>> > > > > Solr-like
>> > > > > > > way.
>> > > > > > > > In some future we might have per-segment EFF, in this case
>> > > adding a
>> > > > > > > segment
>> > > > > > > > will trigger full file scan, but in the index only that new
>> > > segment
>> > > > > > will
>> > > > > > > be
>> > > > > > > > scanned. It should be faster. You know, straightforward
>> sharing
>> > > > > > internal
>> > > > > > > > data structures between different index views/generations is
>> > not
>> > > > > > > possible.
>> > > > > > > > If you are asking about applying delta changes on external
>> file
>> > > > > that's
>> > > > > > > > something what we did ourselves http://goo.gl/P8GFq . This
>> > > feature
>> > > > > is
>> > > > > > > much
>> > > > > > > > more doubtful and vague, although it might be the next
>> > > contribution
>> > > > > > after
>> > > > > > > > SOLR-4085.
>> > > > > > > >
>> > > > > > > > >
>> > > > > > > > > /Martin
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <
>> > m...@issuu.com>
>> > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > > Solr 4.0 does support using EFFs, but it might not
>> give
>> > you
>> > > > > what
>> > > > > > > > you're
>> > > > > > > > > > > hoping fore.
>> > > > > > > > > > >
>> > > > > > > > > > > We tried using Solr Cloud, and have given up again.
>> > > > > > > > > > >
>> > > > > > > > > > > The EFF is placed in the parent of the index
>> directory in
>> > > > each
>> > > > > > > core;
>> > > > > > > > each
>> > > > > > > > > > > core reads the entire EFF and picks out the IDs that
>> it
>> > is
>> > > > > > > > responsible
>> > > > > > > > > > for.
>> > > > > > > > > > >
>> > > > > > > > > > > In the current 4.0.0 release of solr, solr blocks
>> > (doesn't
>> > > > > answer
>> > > > > > > > > > queries)
>> > > > > > > > > > > while re-reading the EFF. Even worse, it seems that
>> the
>> > > time
>> > > > to
>> > > > > > > > re-read
>> > > > > > > > > > the
>> > > > > > > > > > > EFF is multiplied by the number of cores in use (i.e.
>> the
>> > > EFF
>> > > > > is
>> > > > > > > > re-read
>> > > > > > > > > > by
>> > > > > > > > > > > each core sequentially). The contents of the EFF
>> become
>> > > > active
>> > > > > > > after
>> > > > > > > > the
>> > > > > > > > > > > first EXTERNAL commit (commitWithin does NOT work
>> here)
>> > > after
>> > > > > the
>> > > > > > > > file
>> > > > > > > > > > has
>> > > > > > > > > > > been updated.
>> > > > > > > > > > >
>> > > > > > > > > > > In our case, the EFF was quite large - around 450MB -
>> and
>> > > we
>> > > > > use
>> > > > > > 16
>> > > > > > > > > > shards,
>> > > > > > > > > > > so when we triggered an external commit to force
>> > > re-reading,
>> > > > > the
>> > > > > > > > whole
>> > > > > > > > > > > system would block for several (10-15) minutes. This
>> > won't
>> > > > work
>> > > > > > in
>> > > > > > > a
>> > > > > > > > > > > production environment. The reason for the size of the
>> > EFF
>> > > is
>> > > > > > that
>> > > > > > > we
>> > > > > > > > > > have
>> > > > > > > > > > > around 7M documents in the index; each document has a
>> 45
>> > > > > > character
>> > > > > > > > ID.
>> > > > > > > > > > >
>> > > > > > > > > > > We got some help to try to fix the problem so that the
>> > > > re-read
>> > > > > of
>> > > > > > > the
>> > > > > > > > EFF
>> > > > > > > > > > > proceeds in the background (see
>> > > > > > > > > > > here<https://issues.apache.org/jira/browse/SOLR-3985>
>> > for
>> > > > > > > > > > > a fix on the 4.1 branch). However, even though the
>> > re-read
>> > > > > > proceeds
>> > > > > > > > in
>> > > > > > > > > > the
>> > > > > > > > > > > background, the time required to launch solr now
>> takes at
>> > > > least
>> > > > > > as
>> > > > > > > > long
>> > > > > > > > > > as
>> > > > > > > > > > > re-reading the EFFs. Again, this is not good enough
>> for
>> > our
>> > > > > > needs.
>> > > > > > > > > > >
>> > > > > > > > > > > The next issue is that you cannot sort on EFF fields
>> > > (though
>> > > > > you
>> > > > > > > can
>> > > > > > > > > > return
>> > > > > > > > > > > them as values using &fl=field(my_eff_field). This is
>> > also
>> > > > > fixed
>> > > > > > in
>> > > > > > > > the
>> > > > > > > > > > 4.1
>> > > > > > > > > > > branch here <
>> > > https://issues.apache.org/jira/browse/SOLR-4022
>> > > > >.
>> > > > > > > > > > >
>> > > > > > > > > > > So: Even after these fixes, EFF performance is not
>> that
>> > > > great.
>> > > > > > Our
>> > > > > > > > > > solution
>> > > > > > > > > > > is as follows: The actual value of the popularity
>> measure
>> > > > (say,
>> > > > > > > > reads)
>> > > > > > > > > > that
>> > > > > > > > > > > we want to report to the user is inserted into the
>> search
>> > > > > > response
>> > > > > > > > > > > post-query by our query front-end. This value will
>> then
>> > be
>> > > > the
>> > > > > > > > > > > authoritative value at the time of the query. The
>> value
>> > of
>> > > > the
>> > > > > > > > popularity
>> > > > > > > > > > > measure that we use for boosting in the ranking of the
>> > > search
>> > > > > > > results
>> > > > > > > > is
>> > > > > > > > > > > only updated when the value has changed enough so that
>> > the
>> > > > > impact
>> > > > > > > on
>> > > > > > > > the
>> > > > > > > > > > > boost will be significant (say, more than 2%). This
>> does
>> > > > > require
>> > > > > > > > frequent
>> > > > > > > > > > > re-indexing of the documents that have significant
>> > changes
>> > > in
>> > > > > the
>> > > > > > > > number
>> > > > > > > > > > of
>> > > > > > > > > > > reads, but at least we won't have to update a
>> document if
>> > > it
>> > > > > > moves
>> > > > > > > > from,
>> > > > > > > > > > > say, 1000000 to 1000001 reads.
>> > > > > > > > > > >
>> > > > > > > > > > > /Martin Koch - ISSUU - senior systems architect.
>> > > > > > > > > > >
>> > > > > > > > > > > On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <
>> > > > > > simo...@apache.org
>> > > > > > > >
>> > > > > > > > > > wrote:
>> > > > > > > > > > >
>> > > > > > > > > > > > Hi all,
>> > > > > > > > > > > > I'm planning to move a quite big Solr index to
>> > SolrCloud.
>> > > > > > > However,
>> > > > > > > > in
>> > > > > > > > > > > this
>> > > > > > > > > > > > index, an external file field is used for popularity
>> > > > ranking.
>> > > > > > > > > > > >
>> > > > > > > > > > > > Does SolrCloud supports external file fields? How
>> does
>> > it
>> > > > > cope
>> > > > > > > with
>> > > > > > > > > > > > sharding and replication? Where should the external
>> > file
>> > > be
>> > > > > > > placed
>> > > > > > > > now
>> > > > > > > > > > > that
>> > > > > > > > > > > > the index folder is not local but in the cloud?
>> > > > > > > > > > > >
>> > > > > > > > > > > > Are there otherwise other best practices to deal
>> with
>> > the
>> > > > use
>> > > > > > > cases
>> > > > > > > > > > > > external file fields were used for, like
>> > > > popularity/ranking,
>> > > > > in
>> > > > > > > > > > > SolrCloud?
>> > > > > > > > > > > > Custom ValueSources going to something external?
>> > > > > > > > > > > >
>> > > > > > > > > > > > Thanks in advance,
>> > > > > > > > > > > > Simone
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > --
>> > > > > > > > > > Sincerely yours
>> > > > > > > > > > Mikhail Khludnev
>> > > > > > > > > > Principal Engineer,
>> > > > > > > > > > Grid Dynamics
>> > > > > > > > > >
>> > > > > > > > > > <http://www.griddynamics.com>
>> > > > > > > > > >  <mkhlud...@griddynamics.com>
>> > > > > > > > > >
>> > > > > > > >  20.11.2012 18:06 пользователь "Martin Koch" <m...@issuu.com
>> >
>> > > > > написал:
>> > > > > > > >
>> > > > > > > > > Hi Mikhail
>> > > > > > > > >
>> > > > > > > > > Please see answers below.
>> > > > > > > > >
>> > > > > > > > > On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev <
>> > > > > > > > > mkhlud...@griddynamics.com> wrote:
>> > > > > > > > >
>> > > > > > > > > > Martin,
>> > > > > > > > > >
>> > > > > > > > > > Thank you for telling your own "war-story". It's really
>> > > useful
>> > > > > for
>> > > > > > > > > > community.
>> > > > > > > > > > The first question might seems not really conscious, but
>> > > would
>> > > > > you
>> > > > > > > tell
>> > > > > > > > > me
>> > > > > > > > > > what blocks searching during EFF reload, when it's
>> > triggered
>> > > by
>> > > > > > > handler
>> > > > > > > > > or
>> > > > > > > > > > by listener?
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > We continuously index new documents using CommitWithin to
>> get
>> > > > > regular
>> > > > > > > > > commits. However, we observed that the EFFs were not
>> re-read,
>> > > so
>> > > > we
>> > > > > > had
>> > > > > > > > to
>> > > > > > > > > do external commits (curl '.../solr/update?commit=true')
>> to
>> > > force
>> > > > > > > reload.
>> > > > > > > > > When this is done, solr blocks. I can't tell you exactly
>> why
>> > > it's
>> > > > > > doing
>> > > > > > > > > that (it was related to SOLR-3985).
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > > I don't really get the sentence about sequential commits
>> > and
>> > > > > number
>> > > > > > > of
>> > > > > > > > > > cores. Do I get right that file is replicated via
>> > Zookeeper?
>> > > > > > Doesn't
>> > > > > > > it
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > Again, this is observed behavior. When we issue a commit
>> on a
>> > > > > system
>> > > > > > > > with a
>> > > > > > > > > system with many solr cores using EFFs, the system blocks
>> > for a
>> > > > > long
>> > > > > > > time
>> > > > > > > > > (15 minutes).  We do NOT use zookeeper for anything. The
>> EFF
>> > > is a
>> > > > > > > symlink
>> > > > > > > > > from each cores index dir to the actual file, which is
>> > updated
>> > > by
>> > > > > an
>> > > > > > > > > external process.
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > > causes scalability problem or long time to reload? Will
>> it
>> > > help
>> > > > > if
>> > > > > > > > we'll
>> > > > > > > > > > have, let's say ExternalDatabaseField which will pull
>> > values
>> > > > from
>> > > > > > > jdbc.
>> > > > > > > > > ie.
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > I think the possibility of having some fields being
>> retrieved
>> > > > from
>> > > > > an
>> > > > > > > > > external, dynamically updatable store would be really
>> > > > interesting.
>> > > > > > This
>> > > > > > > > > could be JDBC, something in-memory like redis, or a NoSql
>> > > product
>> > > > > > (e.g.
>> > > > > > > > > Cassandra).
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > > why all cores can't read these values simultaneously?
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > Again, this is a solr implementation detail that I can't
>> > answer
>> > > > :)
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > > Can you confirm that IDs in the file is ordered by the
>> > index
>> > > > term
>> > > > > > > > order?
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > Yes, we sorted the files (standard UNIX sort).
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > > AFAIK it can impact load time.
>> > > > > > > > > >
>> > > > > > > > > Yes, it does.
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > > Regarding your post-query solution can you tell me if
>> query
>> > > > found
>> > > > > > > 10000
>> > > > > > > > > > docs, but I need to display only first page with 100
>> rows,
>> > > > > whether
>> > > > > > I
>> > > > > > > > need
>> > > > > > > > > > to pull all 10K results to frontend to order them by the
>> > > rank?
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > In our architecture, the clients query an API that
>> generates
>> > > the
>> > > > > SOLR
>> > > > > > > > > query, retrieves the relevant additional fields that we
>> > needs,
>> > > > and
>> > > > > > > > returns
>> > > > > > > > > the relevant JSON to the front-end.
>> > > > > > > > >
>> > > > > > > > > In our use case, results are returned from SOLR by the
>> 10's,
>> > > not
>> > > > by
>> > > > > > the
>> > > > > > > > > 1000's, so it is a manageable job. Even so, if solr
>> returned
>> > > > > > thousands
>> > > > > > > of
>> > > > > > > > > results, it would be up to the implementation of the api
>> to
>> > > > augment
>> > > > > > > only
>> > > > > > > > > the results that needed to be returned to the front-end.
>> > > > > > > > >
>> > > > > > > > > Even so, patching up a JSON structure with 10000 results
>> > should
>> > > > be
>> > > > > > > > > possible.
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > > I'm really appreciate if you comment on the questions
>> > above.
>> > > > > > > > > > PS: It's time to pitch, how much
>> > > > > > > > > > https://issues.apache.org/jira/browse/SOLR-4085
>> "Commit-free
>> > > > > > > > > > ExternalFileField" can help you?
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > It looks very interesting :) Does it make it possible to
>> > > avoid
>> > > > > > > > re-reading
>> > > > > > > > > the EFF on every commit, and only re-read the values that
>> > have
>> > > > > > actually
>> > > > > > > > > changed?
>> > > > > > > > >
>> > > > > > > > > /Martin
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <
>> > m...@issuu.com>
>> > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > > Solr 4.0 does support using EFFs, but it might not
>> give
>> > you
>> > > > > what
>> > > > > > > > you're
>> > > > > > > > > > > hoping fore.
>> > > > > > > > > > >
>> > > > > > > > > > > We tried using Solr Cloud, and have given up again.
>> > > > > > > > > > >
>> > > > > > > > > > > The EFF is placed in the parent of the index
>> directory in
>> > > > each
>> > > > > > > core;
>> > > > > > > > > each
>> > > > > > > > > > > core reads the entire EFF and picks out the IDs that
>> it
>> > is
>> > > > > > > > responsible
>> > > > > > > > > > for.
>> > > > > > > > > > >
>> > > > > > > > > > > In the current 4.0.0 release of solr, solr blocks
>> > (doesn't
>> > > > > answer
>> > > > > > > > > > queries)
>> > > > > > > > > > > while re-reading the EFF. Even worse, it seems that
>> the
>> > > time
>> > > > to
>> > > > > > > > re-read
>> > > > > > > > > > the
>> > > > > > > > > > > EFF is multiplied by the number of cores in use (i.e.
>> the
>> > > EFF
>> > > > > is
>> > > > > > > > > re-read
>> > > > > > > > > > by
>> > > > > > > > > > > each core sequentially). The contents of the EFF
>> become
>> > > > active
>> > > > > > > after
>> > > > > > > > > the
>> > > > > > > > > > > first EXTERNAL commit (commitWithin does NOT work
>> here)
>> > > after
>> > > > > the
>> > > > > > > > file
>> > > > > > > > > > has
>> > > > > > > > > > > been updated.
>> > > > > > > > > > >
>> > > > > > > > > > > In our case, the EFF was quite large - around 450MB -
>> and
>> > > we
>> > > > > use
>> > > > > > 16
>> > > > > > > > > > shards,
>> > > > > > > > > > > so when we triggered an external commit to force
>> > > re-reading,
>> > > > > the
>> > > > > > > > whole
>> > > > > > > > > > > system would block for several (10-15) minutes. This
>> > won't
>> > > > work
>> > > > > > in
>> > > > > > > a
>> > > > > > > > > > > production environment. The reason for the size of the
>> > EFF
>> > > is
>> > > > > > that
>> > > > > > > we
>> > > > > > > > > > have
>> > > > > > > > > > > around 7M documents in the index; each document has a
>> 45
>> > > > > > character
>> > > > > > > > ID.
>> > > > > > > > > > >
>> > > > > > > > > > > We got some help to try to fix the problem so that the
>> > > > re-read
>> > > > > of
>> > > > > > > the
>> > > > > > > > > EFF
>> > > > > > > > > > > proceeds in the background (see
>> > > > > > > > > > > here<https://issues.apache.org/jira/browse/SOLR-3985>
>> > for
>> > > > > > > > > > > a fix on the 4.1 branch). However, even though the
>> > re-read
>> > > > > > proceeds
>> > > > > > > > in
>> > > > > > > > > > the
>> > > > > > > > > > > background, the time required to launch solr now
>> takes at
>> > > > least
>> > > > > > as
>> > > > > > > > long
>> > > > > > > > > > as
>> > > > > > > > > > > re-reading the EFFs. Again, this is not good enough
>> for
>> > our
>> > > > > > needs.
>> > > > > > > > > > >
>> > > > > > > > > > > The next issue is that you cannot sort on EFF fields
>> > > (though
>> > > > > you
>> > > > > > > can
>> > > > > > > > > > return
>> > > > > > > > > > > them as values using &fl=field(my_eff_field). This is
>> > also
>> > > > > fixed
>> > > > > > in
>> > > > > > > > the
>> > > > > > > > > > 4.1
>> > > > > > > > > > > branch here <
>> > > https://issues.apache.org/jira/browse/SOLR-4022
>> > > > >.
>> > > > > > > > > > >
>> > > > > > > > > > > So: Even after these fixes, EFF performance is not
>> that
>> > > > great.
>> > > > > > Our
>> > > > > > > > > > solution
>> > > > > > > > > > > is as follows: The actual value of the popularity
>> measure
>> > > > (say,
>> > > > > > > > reads)
>> > > > > > > > > > that
>> > > > > > > > > > > we want to report to the user is inserted into the
>> search
>> > > > > > response
>> > > > > > > > > > > post-query by our query front-end. This value will
>> then
>> > be
>> > > > the
>> > > > > > > > > > > authoritative value at the time of the query. The
>> value
>> > of
>> > > > the
>> > > > > > > > > popularity
>> > > > > > > > > > > measure that we use for boosting in the ranking of the
>> > > search
>> > > > > > > results
>> > > > > > > > > is
>> > > > > > > > > > > only updated when the value has changed enough so that
>> > the
>> > > > > impact
>> > > > > > > on
>> > > > > > > > > the
>> > > > > > > > > > > boost will be significant (say, more than 2%). This
>> does
>> > > > > require
>> > > > > > > > > frequent
>> > > > > > > > > > > re-indexing of the documents that have significant
>> > changes
>> > > in
>> > > > > the
>> > > > > > > > > number
>> > > > > > > > > > of
>> > > > > > > > > > > reads, but at least we won't have to update a
>> document if
>> > > it
>> > > > > > moves
>> > > > > > > > > from,
>> > > > > > > > > > > say, 1000000 to 1000001 reads.
>> > > > > > > > > > >
>> > > > > > > > > > > /Martin Koch - ISSUU - senior systems architect.
>> > > > > > > > > > >
>> > > > > > > > > > > On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <
>> > > > > > simo...@apache.org
>> > > > > > > >
>> > > > > > > > > > wrote:
>> > > > > > > > > > >
>> > > > > > > > > > > > Hi all,
>> > > > > > > > > > > > I'm planning to move a quite big Solr index to
>> > SolrCloud.
>> > > > > > > However,
>> > > > > > > > in
>> > > > > > > > > > > this
>> > > > > > > > > > > > index, an external file field is used for popularity
>> > > > ranking.
>> > > > > > > > > > > >
>> > > > > > > > > > > > Does SolrCloud supports external file fields? How
>> does
>> > it
>> > > > > cope
>> > > > > > > with
>> > > > > > > > > > > > sharding and replication? Where should the external
>> > file
>> > > be
>> > > > > > > placed
>> > > > > > > > > now
>> > > > > > > > > > > that
>> > > > > > > > > > > > the index folder is not local but in the cloud?
>> > > > > > > > > > > >
>> > > > > > > > > > > > Are there otherwise other best practices to deal
>> with
>> > the
>> > > > use
>> > > > > > > cases
>> > > > > > > > > > > > external file fields were used for, like
>> > > > popularity/ranking,
>> > > > > in
>> > > > > > > > > > > SolrCloud?
>> > > > > > > > > > > > Custom ValueSources going to something external?
>> > > > > > > > > > > >
>> > > > > > > > > > > > Thanks in advance,
>> > > > > > > > > > > > Simone
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > --
>> > > > > > > > > > Sincerely yours
>> > > > > > > > > > Mikhail Khludnev
>> > > > > > > > > > Principal Engineer,
>> > > > > > > > > > Grid Dynamics
>> > > > > > > > > >
>> > > > > > > > > > <http://www.griddynamics.com>
>> > > > > > > > > >  <mkhlud...@griddynamics.com>
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > --
>> > > > > > Sincerely yours
>> > > > > > Mikhail Khludnev
>> > > > > > Principal Engineer,
>> > > > > > Grid Dynamics
>> > > > > >
>> > > > > > <http://www.griddynamics.com>
>> > > > > >  <mkhlud...@griddynamics.com>
>> > > > > >
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Sincerely yours
>> > > > Mikhail Khludnev
>> > > > Principal Engineer,
>> > > > Grid Dynamics
>> > > >
>> > > > <http://www.griddynamics.com>
>> > > >  <mkhlud...@griddynamics.com>
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Sincerely yours
>> > Mikhail Khludnev
>> > Principal Engineer,
>> > Grid Dynamics
>> >
>> > <http://www.griddynamics.com>
>> >  <mkhlud...@griddynamics.com>
>> >
>>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
>  <mkhlud...@griddynamics.com>
>
>

Re: SolrCloud and exernal file fields

Reply via email to