[ 
https://issues.apache.org/jira/browse/OAK-6254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-6254:
--------------------------------
    Description: 
The estimated size of the datastore (on disk) is needed to:
* monitor growth over time, or growth of certain operations
* monitor if garbage collection is effective
* avoid out of disk space
* estimate backup size
* statistical purposes (for example, if there are many repositories, to group 
them by size)

Datastore size: we could use the following heuristic: We could read the file 
sizes in ./datastore/00/00 (if it exists) and multiply by 65536; or 
./datastore/00 and multiply by 256. That would give a rough estimation (within 
about 20% for repositories with datastore size > 50 GB).

I think this is mainly important for the FileDataStore. The S3 datastore, if 
there is a simple and fast S3 API to read the size, then that would be good as 
well, but if there is none, then returning "unknown" is fine for me.

As for the API, I would use something like this: {{long 
getEstimatedStorageSize(int accuracyLevel)}} with accuracyLevel 1 for 
inaccurate (fastest), 2 more accurate (slower),..., 9 precise (possibly very 
slow). Similar to 
[java.util.zip.Deflater.setLevel|https://docs.oracle.com/javase/7/docs/api/java/util/zip/Deflater.html#setLevel(int)].
 I would expect it takes up to 1 second for accuracyLevel 0, up to 5 seconds 
for accuracyLevel 1, and possibly hours for level 9. With level 1, I would read 
files in 00/00, with level 2 - 8 I would read files in 00, and with level 9 I 
would read all the files. For level 1, I wouldn't stop; for level 2, if it 
takes more than 5 seconds, I would stop and return the current best estimate.

  was:
The estimated size of the datastore (on disk) is needed to:
* monitor growth over time, or growth of certain operations
* monitor if garbage collection is effective
* avoid out of disk space
* estimate backup size
* statistical purposes (for example, if there are many repositories, to group 
them by size)

Datastore size: we could use the following heuristic: We could read the file 
sizes in ./datastore/00/00 (if it exists) and multiply by 65536; or 
./datastore/00 and multiply by 256. That would give a rough estimation (within 
about 20% for repositories with datastore size > 50 GB).

I think this is mainly important for the FileDataStore. The S3 datastore, if 
there is a simple and fast S3 API to read the size, then that would be good as 
well, but if there is none, then returning "unknown" is fine for me.

As for the API, I would use something like this: {{long 
getEstimatedStorageSize(int accuracyLevel)}} with accuracyLevel 1 for 
inaccurate (fastest), 2 more accurate (slower),..., 9 precise (possibly very 
slow). Similar to 
[java.util.zip.Deflater.setLevel|https://docs.oracle.com/javase/7/docs/api/java/util/zip/Deflater.html#setLevel(int)].
 I would expect it takes up to 1 second for accuracyLevel 0, up to 5 seconds 
for accuracyLevel 1, and possibly hours for level 9. With level 1, I would read 
files in 00/00, with level 2 - 8 I would read files in 00, and with level 9 I 
would read all the files.


> DataStore: API to retrieve approximate storage size
> ---------------------------------------------------
>
>                 Key: OAK-6254
>                 URL: https://issues.apache.org/jira/browse/OAK-6254
>             Project: Jackrabbit Oak
>          Issue Type: Bug
>          Components: blob
>            Reporter: Thomas Mueller
>
> The estimated size of the datastore (on disk) is needed to:
> * monitor growth over time, or growth of certain operations
> * monitor if garbage collection is effective
> * avoid out of disk space
> * estimate backup size
> * statistical purposes (for example, if there are many repositories, to group 
> them by size)
> Datastore size: we could use the following heuristic: We could read the file 
> sizes in ./datastore/00/00 (if it exists) and multiply by 65536; or 
> ./datastore/00 and multiply by 256. That would give a rough estimation 
> (within about 20% for repositories with datastore size > 50 GB).
> I think this is mainly important for the FileDataStore. The S3 datastore, if 
> there is a simple and fast S3 API to read the size, then that would be good 
> as well, but if there is none, then returning "unknown" is fine for me.
> As for the API, I would use something like this: {{long 
> getEstimatedStorageSize(int accuracyLevel)}} with accuracyLevel 1 for 
> inaccurate (fastest), 2 more accurate (slower),..., 9 precise (possibly very 
> slow). Similar to 
> [java.util.zip.Deflater.setLevel|https://docs.oracle.com/javase/7/docs/api/java/util/zip/Deflater.html#setLevel(int)].
>  I would expect it takes up to 1 second for accuracyLevel 0, up to 5 seconds 
> for accuracyLevel 1, and possibly hours for level 9. With level 1, I would 
> read files in 00/00, with level 2 - 8 I would read files in 00, and with 
> level 9 I would read all the files. For level 1, I wouldn't stop; for level 
> 2, if it takes more than 5 seconds, I would stop and return the current best 
> estimate.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to