[ 
https://issues.apache.org/jira/browse/SOLR-13512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16854890#comment-16854890
 ] 

Andrzej Bialecki  commented on SOLR-13512:
------------------------------------------

This patch contains an {{IndexSizeEstimator}} tool (which is both a 
command-line utility and a Solr component). It provides the functionality 
described above. The patch contains also extensions to 
{{/admin/<collection>/segments}} and {{/admin/collections?action=COLSTATUS}} to 
efficiently report this data from live {{IndexReader}}-s of a collection.

Here's an example output of COLSTATUS that contains just the overview section 
(the collection contains a partial Wikipedia dump):

{code}
curl 
'http://localhost:8983/solr/admin/collections?action=COLSTATUS&rawSizeInfo=true'

{
    "responseHeader": {
        "status": 0,
        "QTime": 49406
    },
    "gettingstarted": {
...
        "shards": {
            "shard1": {
...
                "leader": {
...
                    "segInfos": {
...
                        "rawSize": {
                            "fieldsBySize": {
                                "761 MB": "revision.text",
                                "88.7 MB": "revision.text_str",
                                "29.4 MB": "revision",
                                "26.4 MB": "revision.sha1",
                                "24.8 MB": "revision.comment",
                                "18.9 MB": "revision.comment_str",
                                "13.5 MB": "title",
                                "12.5 MB": "revision.contributor",
                                "11.9 MB": "revision.sha1_str",
                                "9.2 MB": "revision.timestamp",
                                "8.8 MB": "revision.contributor.id",
                                "7.3 MB": "revision.format",
                                "7.1 MB": "id",
                                "6.8 MB": "revision.parentid",
                                "6.3 MB": "revision.contributor.username",
                                "6.1 MB": "revision.model",
                                "4.6 MB": "title_str",
                                "4.3 MB": "revision.format_str",
                                "3.8 MB": "revision.contributor.username_str",
                                "3.1 MB": "_version_",
                                "2.8 MB": "revision.model_str",
                                "2.7 MB": "revision.contributor_str",
...
                            }
                        }
                    }
                }
            },
            "shard2": {
...
                "leader": {
...
                    "segInfos": {
...
                        "rawSize": {
                            "fieldsBySize": {
                                "769.4 MB": "revision.text",
                                "89.2 MB": "revision.text_str",
                                "31.2 MB": "revision",
                                "28 MB": "revision.sha1",
                                "26.4 MB": "revision.comment",
                                "20.7 MB": "revision.comment_str",
                                "14.2 MB": "title",
                                "13.3 MB": "revision.contributor",
                                "12.6 MB": "revision.sha1_str",
                                "9.8 MB": "revision.timestamp",
                                "9.4 MB": "revision.contributor.id",
                                "7.7 MB": "revision.format",
                                "7.6 MB": "id",
                                "6.9 MB": "revision.parentid",
                                "6.7 MB": "revision.contributor.username",
                                "6.5 MB": "revision.model",
                                "4.7 MB": "title_str",
                                "4.5 MB": "revision.format_str",
                                "3.9 MB": "revision.contributor.username_str",
                                "3.3 MB": "_version_",
                                "2.9 MB": "revision.contributor_str",
...
                            }
                        }
                    }
                }
            }
        }
    }
}
{code}

I attached outputs from the command that provide a summary breakdown per type 
of data, and a really detailed breakdown including per-field and per-type 
statistical summary.

> Raw index data analysis tool
> ----------------------------
>
>                 Key: SOLR-13512
>                 URL: https://issues.apache.org/jira/browse/SOLR-13512
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>            Priority: Major
>         Attachments: SOLR-13512.patch
>
>
> A common question from Solr users is how to determine how a given schema 
> field and all its related index data contributes to the total index size.
> It's possible to estimate this information by doing a single full pass 
> through all index data, aggregating estimated sizes of terms, postings, doc 
> values and stored fields. The totals represent of course the worst case 
> scenario when there's no index compression at all, but still they should be 
> useful for answering the questions above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to