[ https://issues.apache.org/jira/browse/SOLR-6888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14260257#comment-14260257 ]
Erick Erickson commented on SOLR-6888: -------------------------------------- Right, thanks! > Decompressing documents on first-pass distributed queries to get docId is > inefficient, use indexed values instead? > ------------------------------------------------------------------------------------------------------------------ > > Key: SOLR-6888 > URL: https://issues.apache.org/jira/browse/SOLR-6888 > Project: Solr > Issue Type: Improvement > Affects Versions: 5.0, Trunk > Reporter: Erick Erickson > Assignee: Erick Erickson > Attachments: SOLR-6888-hacktiming.patch > > > Assigning this to myself to just not lose track of it, but I won't be working > on this in the near term; anyone feeling ambitious should feel free to grab > it. > Note, docId used here is whatever is defined for <uniqueKey>... > Since Solr 4.1, the compression/decompression process is based on 16K blocks > and is automatic, and not configurable. So, to get a single stored value one > must decompress an entire 16K block. At least. > For SolrCloud (and distributed processing in general), we make two trips, one > to get the doc id and score (or other sort criteria) and one to return the > actual data. > The first pass here requires that we return the top N docIDs and sort > criteria, which means that each and every sub-request has to unpack at least > one 16K block (and sometimes more) to get just the doc ID. So if we have 20 > shards and only want 20 rows, 95% of the decompression cycles will be wasted. > Not to mention all the disk reads. > It seems like we should be able to do better than that. Can we argue that doc > ids are 'special' and should be cached somehow? Let's discuss what this would > look like. I can think of a couple of approaches: > 1> Since doc IDs are "special", can we say that for this purpose returning > the indexed version is OK? We'd need to return the actual stored value when > the full doc was requested, but for the sub-request only what about returning > the indexed value instead of the stored one? On the surface I don't see a > problem here, but what do I know? Storing these as DocValues seems useful in > this case. > 1a> A variant is treating numeric docIds specially since the indexed value > and the stored value should be identical. And DocValues here would be useful > it seems. But this seems an unnecessary specialization if <1> is implemented > well. > 2> We could cache individual doc IDs, although I'm not sure what use that > really is. Would maintaining the cache overwhelm the savings of not > decompressing? I really don't like this idea, but am throwing it out there. > Doing this from stored data up front would essentially mean decompressing > every doc so that seems untenable to try up-front. > 3> We could maintain an array[maxDoc] that held document IDs, perhaps lazily > initializing it. I'm not particularly a fan of this either, doesn't seem like > a Good Thing. I can see lazy loading being almost, but not quite totally, > useless, i.e. a hit ratio near 0, especially since it'd be thrown out on > every openSearcher. > Really, the only one of these that seems viable is <1>/<1a>. The others would > all involve decompressing the docs anyway to get the ID, and I suspect that > caching would be of very limited usefulness. I guess <1>'s viability hinges > on whether, for internal use, the indexed form of DocId is interchangeable > with the stored value. > Or are there other ways to approach this? Or isn't it something to really > worry about? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org