[ 
https://issues.apache.org/jira/browse/JENA-804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Seaborne updated JENA-804:
-------------------------------
    Description: 
We have a product based on Jena TDB where we insert quads to Jena TDB along 
with the deletion of quads.  We understand the performance over space 
architectural decision to not clean up deleted nodeids from the indexes. But 
the usage of disk space appears that Jena TDB is not reusing allocated space 
which had been allocated by Jena previously.  Based on this comment there 
appears to be something that is not correct on file space utilization, 
http://mail-archives.apache.org/mod_mbox/jena-users/201310.mbox/%3cce7d7929.2a707%[email protected]%3E:
 "The indexes won't shrink - TDB never gives disk space back to the OS -  but 
disk space is reused when reallocated within the same JVM.".

In this scenario on the same JVM with NO server stops or starts, we add 27765 
graphs to IndexTdb and immediately remove them,  repeating this process several 
times. 
{noformat}
                   MB   Bytes           Diff (Bytes)
Start             193   203239424               
                                
Reindex 5               249     262066176               58826752
Reindex 6               249     262086656               20480
Reindex 10      298     312500224               50413568
Reindex 11      298     312520704               20480
Reindex 12      298     312541184               20480
Reindex 13      298     312586240               45056
Reindex 14      306     320995328               8409088
Reindex 15      330     346181632               25186304
Reindex 16      330     346198538               16906
Reindex 17      346     362999808               16801270
Reindex 18      346     363020288               20480
Reindex 19      346     363040768               20480
Reindex 20      346     363061248               20480
Reindex 21      346     363081728               20480
Reindex 22      354     371490816               8409088
Reindex 23      378     396677120               25186304
                                
End     193     203239424               
{noformat}

The system starts with 193MB of data allocated by indexTdb.  A reindex consists 
of a remove followed by an add of these graphs. As you can see from the data 
there is a dramatic increase in the size of indexTdb on the disk after 
repeadedly removing and adding graphs.  After Reindex 23, there is 378 MB of 
disk space used.  If Jena TDB reused allocated space there would be no need to 
allocate more space other than what is used by deleted node ids (unless nodeid 
storage is eating all of this space?).  Jena does not appear to be reusing the 
allocated disk space.  At the very end of this scenario, we exported the nquads 
and reloaded them to show the original disk space was 193MB back to where it 
started. 

We believe Jena TDB is not reusing the space allocated by the TDB file system 
within the same JVM.

  was:
We have a product based on Jena TDB where we insert quads to Jena TDB along 
with the deletion of quads.  We understand the performance over space 
architectural decision to not clean up deleted nodeids from the indexes. But 
the usage of disk space appears that Jena TDB is not reusing allocated space 
which had been allocated by Jena previously.  Based on this comment there 
appears to be something that is not correct on file space utilization, 
http://mail-archives.apache.org/mod_mbox/jena-users/201310.mbox/%3cce7d7929.2a707%[email protected]%3E:
 "The indexes won't shrink - TDB never gives disk space back to the OS -  but 
disk space is reused when reallocated within the same JVM.".

In this scenario on the same JVM with NO server stops or starts, we add 27765 
graphs to IndexTdb and immediately remove them,  repeating this process several 
times. 
                   MB   Bytes           Diff (Bytes)
Start             193   203239424               
                                
Reindex 5               249     262066176               58826752
Reindex 6               249     262086656               20480
Reindex 10      298     312500224               50413568
Reindex 11      298     312520704               20480
Reindex 12      298     312541184               20480
Reindex 13      298     312586240               45056
Reindex 14      306     320995328               8409088
Reindex 15      330     346181632               25186304
Reindex 16      330     346198538               16906
Reindex 17      346     362999808               16801270
Reindex 18      346     363020288               20480
Reindex 19      346     363040768               20480
Reindex 20      346     363061248               20480
Reindex 21      346     363081728               20480
Reindex 22      354     371490816               8409088
Reindex 23      378     396677120               25186304
                                
End     193     203239424               

The system starts with 193MB of data allocated by indexTdb.  A reindex consists 
of a remove followed by an add of these graphs. As you can see from the data 
there is a dramatic increase in the size of indexTdb on the disk after 
repeadedly removing and adding graphs.  After Reindex 23, there is 378 MB of 
disk space used.  If Jena TDB reused allocated space there would be no need to 
allocate more space other than what is used by deleted node ids (unless nodeid 
storage is eating all of this space?).  Jena does not appear to be reusing the 
allocated disk space.  At the very end of this scenario, we exported the nquads 
and reloaded them to show the original disk space was 193MB back to where it 
started. 

We believe Jena TDB is not reusing the space allocated by the TDB file system 
within the same JVM.


> Jena is not reusing already allocated space on the file system which results 
> in large amounts of disk space reserved by Jena files
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: JENA-804
>                 URL: https://issues.apache.org/jira/browse/JENA-804
>             Project: Apache Jena
>          Issue Type: Bug
>          Components: Jena
>    Affects Versions: TDB 1.0.2
>         Environment: Windows 7, IBM JRE 1.7, Tomcat 7.0.54
>            Reporter: Keith Wells
>
> We have a product based on Jena TDB where we insert quads to Jena TDB along 
> with the deletion of quads.  We understand the performance over space 
> architectural decision to not clean up deleted nodeids from the indexes. But 
> the usage of disk space appears that Jena TDB is not reusing allocated space 
> which had been allocated by Jena previously.  Based on this comment there 
> appears to be something that is not correct on file space utilization, 
> http://mail-archives.apache.org/mod_mbox/jena-users/201310.mbox/%3cce7d7929.2a707%[email protected]%3E:
>  "The indexes won't shrink - TDB never gives disk space back to the OS -  but 
> disk space is reused when reallocated within the same JVM.".
> In this scenario on the same JVM with NO server stops or starts, we add 27765 
> graphs to IndexTdb and immediately remove them,  repeating this process 
> several times. 
> {noformat}
>                  MB   Bytes           Diff (Bytes)
> Start           193   203239424               
>                               
> Reindex 5             249     262066176               58826752
> Reindex 6             249     262086656               20480
> Reindex 10    298     312500224               50413568
> Reindex 11    298     312520704               20480
> Reindex 12    298     312541184               20480
> Reindex 13    298     312586240               45056
> Reindex 14    306     320995328               8409088
> Reindex 15    330     346181632               25186304
> Reindex 16    330     346198538               16906
> Reindex 17    346     362999808               16801270
> Reindex 18    346     363020288               20480
> Reindex 19    346     363040768               20480
> Reindex 20    346     363061248               20480
> Reindex 21    346     363081728               20480
> Reindex 22    354     371490816               8409088
> Reindex 23    378     396677120               25186304
>                               
> End   193     203239424               
> {noformat}
> The system starts with 193MB of data allocated by indexTdb.  A reindex 
> consists of a remove followed by an add of these graphs. As you can see from 
> the data there is a dramatic increase in the size of indexTdb on the disk 
> after repeadedly removing and adding graphs.  After Reindex 23, there is 378 
> MB of disk space used.  If Jena TDB reused allocated space there would be no 
> need to allocate more space other than what is used by deleted node ids 
> (unless nodeid storage is eating all of this space?).  Jena does not appear 
> to be reusing the allocated disk space.  At the very end of this scenario, we 
> exported the nquads and reloaded them to show the original disk space was 
> 193MB back to where it started. 
> We believe Jena TDB is not reusing the space allocated by the TDB file system 
> within the same JVM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to