[ https://issues.apache.org/jira/browse/OAK-7193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Davide Giannella updated OAK-7193: ---------------------------------- Fix Version/s: (was: 1.14.0) > DataStore: API to retrieve statistic (file headers, size estimation) > -------------------------------------------------------------------- > > Key: OAK-7193 > URL: https://issues.apache.org/jira/browse/OAK-7193 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: blob > Reporter: Thomas Mueller > Priority: Major > Fix For: 1.16.0 > > > Extension of OAK-6254: in addition to retrieving the size, it would be good > to retrieve the estimated number and total size per file type. A simple (and > in my view sufficient) solution is to use the first few bytes ("magic > numbers", 2 bytes should be enough) to get the file type. That would allow to > estimate, for example, the number of, and total size, of PDF files, JPEG, > Lucene index and so on. A histogram would be nice as well, but I think is not > needed. > To speed up calculation, the blob ID could be extended with the first 2 bytes > of the file content, that is: <hash>#<length>@<magic> where magic is the > first two bytes, in hex. That would allow to quickly get the data from the > blob ids (no need to actually read content). > Sampling should be enough. The longer it takes, the more accurate the data. > We could store the data while doing datastore GC, in which case the returned > data would be somewhat stale; that's OK. -- This message was sent by Atlassian JIRA (v7.6.3#76005)