[ https://issues.apache.org/jira/browse/SOLR-13512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16854890#comment-16854890 ]
Andrzej Bialecki commented on SOLR-13512: ------------------------------------------ This patch contains an {{IndexSizeEstimator}} tool (which is both a command-line utility and a Solr component). It provides the functionality described above. The patch contains also extensions to {{/admin/<collection>/segments}} and {{/admin/collections?action=COLSTATUS}} to efficiently report this data from live {{IndexReader}}-s of a collection. Here's an example output of COLSTATUS that contains just the overview section (the collection contains a partial Wikipedia dump): {code} curl 'http://localhost:8983/solr/admin/collections?action=COLSTATUS&rawSizeInfo=true' { "responseHeader": { "status": 0, "QTime": 49406 }, "gettingstarted": { ... "shards": { "shard1": { ... "leader": { ... "segInfos": { ... "rawSize": { "fieldsBySize": { "761 MB": "revision.text", "88.7 MB": "revision.text_str", "29.4 MB": "revision", "26.4 MB": "revision.sha1", "24.8 MB": "revision.comment", "18.9 MB": "revision.comment_str", "13.5 MB": "title", "12.5 MB": "revision.contributor", "11.9 MB": "revision.sha1_str", "9.2 MB": "revision.timestamp", "8.8 MB": "revision.contributor.id", "7.3 MB": "revision.format", "7.1 MB": "id", "6.8 MB": "revision.parentid", "6.3 MB": "revision.contributor.username", "6.1 MB": "revision.model", "4.6 MB": "title_str", "4.3 MB": "revision.format_str", "3.8 MB": "revision.contributor.username_str", "3.1 MB": "_version_", "2.8 MB": "revision.model_str", "2.7 MB": "revision.contributor_str", ... } } } } }, "shard2": { ... "leader": { ... "segInfos": { ... "rawSize": { "fieldsBySize": { "769.4 MB": "revision.text", "89.2 MB": "revision.text_str", "31.2 MB": "revision", "28 MB": "revision.sha1", "26.4 MB": "revision.comment", "20.7 MB": "revision.comment_str", "14.2 MB": "title", "13.3 MB": "revision.contributor", "12.6 MB": "revision.sha1_str", "9.8 MB": "revision.timestamp", "9.4 MB": "revision.contributor.id", "7.7 MB": "revision.format", "7.6 MB": "id", "6.9 MB": "revision.parentid", "6.7 MB": "revision.contributor.username", "6.5 MB": "revision.model", "4.7 MB": "title_str", "4.5 MB": "revision.format_str", "3.9 MB": "revision.contributor.username_str", "3.3 MB": "_version_", "2.9 MB": "revision.contributor_str", ... } } } } } } } } {code} I attached outputs from the command that provide a summary breakdown per type of data, and a really detailed breakdown including per-field and per-type statistical summary. > Raw index data analysis tool > ---------------------------- > > Key: SOLR-13512 > URL: https://issues.apache.org/jira/browse/SOLR-13512 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Andrzej Bialecki > Assignee: Andrzej Bialecki > Priority: Major > Attachments: SOLR-13512.patch > > > A common question from Solr users is how to determine how a given schema > field and all its related index data contributes to the total index size. > It's possible to estimate this information by doing a single full pass > through all index data, aggregating estimated sizes of terms, postings, doc > values and stored fields. The totals represent of course the worst case > scenario when there's no index compression at all, but still they should be > useful for answering the questions above. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org