[jira] [Commented] (OAK-3085) Add timestamp property to journal entries
[ https://issues.apache.org/jira/browse/OAK-3085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622013#comment-14622013 ] Julian Reschke commented on OAK-3085: - Adding a new column at this point will create a backwards compatibility problem with existing DBs. Do we really need it? Add timestamp property to journal entries - Key: OAK-3085 URL: https://issues.apache.org/jira/browse/OAK-3085 Project: Jackrabbit Oak Issue Type: Improvement Components: core, mongomk Affects Versions: 1.2.2, 1.3.2 Reporter: Stefan Egli Fix For: 1.2.3, 1.3.3 Attachments: OAK-3085.patch, OAK-3085.v2.patch OAK-3001 is about improving the JournalGarbageCollector by querying on a separated-out timestamp property (rather than the id that encapsulated the timestamp). In order to remove OAK-3001 as a blocker ticket from the 1.2.3 release, this ticket is about adding a timestamp property to the journal entry but not making use of it yet. Later on, when OAK-3001 is tackled, this timestamp property already exists and migration is not an issue anymore (as 1.2.3 introduces the journal entry first time) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-3085) Add timestamp property to journal entries
[ https://issues.apache.org/jira/browse/OAK-3085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622076#comment-14622076 ] Stefan Egli commented on OAK-3085: -- [~reschke], is the {{_modified}} a unix timestamp too (as is the newly suggested {{_ts}}) ? If it is, what would RDB do if the {{JournalEntry.asUpdateOp}} would try to explicitly set {{_modified}} (whereas I assume RDBDocumentStore does this somehow too)? Add timestamp property to journal entries - Key: OAK-3085 URL: https://issues.apache.org/jira/browse/OAK-3085 Project: Jackrabbit Oak Issue Type: Improvement Components: core, mongomk Affects Versions: 1.2.2, 1.3.2 Reporter: Stefan Egli Fix For: 1.2.3, 1.3.3 Attachments: OAK-3085.patch, OAK-3085.v2.patch OAK-3001 is about improving the JournalGarbageCollector by querying on a separated-out timestamp property (rather than the id that encapsulated the timestamp). In order to remove OAK-3001 as a blocker ticket from the 1.2.3 release, this ticket is about adding a timestamp property to the journal entry but not making use of it yet. Later on, when OAK-3001 is tackled, this timestamp property already exists and migration is not an issue anymore (as 1.2.3 introduces the journal entry first time) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-3085) Add timestamp property to journal entries
[ https://issues.apache.org/jira/browse/OAK-3085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622086#comment-14622086 ] Julian Reschke commented on OAK-3085: - It's a unix timestamp with 5s resolution, maintained by the DocumentStore. Add timestamp property to journal entries - Key: OAK-3085 URL: https://issues.apache.org/jira/browse/OAK-3085 Project: Jackrabbit Oak Issue Type: Improvement Components: core, mongomk Affects Versions: 1.2.2, 1.3.2 Reporter: Stefan Egli Fix For: 1.2.3, 1.3.3 Attachments: OAK-3085.patch, OAK-3085.v2.patch OAK-3001 is about improving the JournalGarbageCollector by querying on a separated-out timestamp property (rather than the id that encapsulated the timestamp). In order to remove OAK-3001 as a blocker ticket from the 1.2.3 release, this ticket is about adding a timestamp property to the journal entry but not making use of it yet. Later on, when OAK-3001 is tackled, this timestamp property already exists and migration is not an issue anymore (as 1.2.3 introduces the journal entry first time) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-3090) Caching BlobStore implementation
[ https://issues.apache.org/jira/browse/OAK-3090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622215#comment-14622215 ] Chetan Mehrotra commented on OAK-3090: -- [~tmueller] Would it be possible to add support [RemovalListener|http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/cache/RemovalListener.html] to {{CacheLIRS}}. Then we can use that to implement above caching Caching BlobStore implementation - Key: OAK-3090 URL: https://issues.apache.org/jira/browse/OAK-3090 Project: Jackrabbit Oak Issue Type: New Feature Components: blob Reporter: Chetan Mehrotra Fix For: 1.3.4 Storing binaries in Mongo puts lots of pressure on the MongoDB for reads. To reduce the read load it would be useful to have a filesystem based cache of frequently used binaries. This would be similar to CachingFDS (OAK-3005) but would be implemented on top of BlobStore API. Requirements * Specify the max binary size which can be cached on file system * Limit the size of all binary content present in the cache -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-3090) Caching BlobStore implementation
[ https://issues.apache.org/jira/browse/OAK-3090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622214#comment-14622214 ] Chetan Mehrotra commented on OAK-3090: -- Given we already use Guava in Oak it might be better to just make use of them (or LIRSCache if it supports removal listener) and have a simple CachingDataStore impl. Have a look at {{DataStoreBlobStore#getInputStream}} where we have some caching done for small binaries on heap. Extrapolating that design in following way would allow us to implement a simple FS based caching layer # Have a new cache where the cached value is File (or some instance which keeps a reference to File) # Provide support for Weight via a simple [Weigher|http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/cache/Weigher.html] which is based on File size # Register a [RemovalListener|http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/cache/RemovalListener.html] which removes the file from file system upon eviction # Provide a [CacheLoader|http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/cache/CacheLoader.html] which spools the remote binary to local filesystem This should be a small logic and would provide you all benefits of Guava cache including cache stats. And this would transparently work for any DataStore. May be we implement it at {{BlobStore}} level itself and then it would be useful for other BlobStore also. Doing it at BlobStore level would require some support from {{BlobStore}} to determine the blob length from blobId itself. [1] https://code.google.com/p/guava-libraries/wiki/CachesExplained Caching BlobStore implementation - Key: OAK-3090 URL: https://issues.apache.org/jira/browse/OAK-3090 Project: Jackrabbit Oak Issue Type: New Feature Components: blob Reporter: Chetan Mehrotra Fix For: 1.3.4 Storing binaries in Mongo puts lots of pressure on the MongoDB for reads. To reduce the read load it would be useful to have a filesystem based cache of frequently used binaries. This would be similar to CachingFDS (OAK-3005) but would be implemented on top of BlobStore API. Requirements * Specify the max binary size which can be cached on file system * Limit the size of all binary content present in the cache -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-3085) Add timestamp property to journal entries
[ https://issues.apache.org/jira/browse/OAK-3085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622095#comment-14622095 ] Chetan Mehrotra commented on OAK-3085: -- bq. Adding a new column at this point will create a backwards compatibility problem with existing DBs. Do we really need it? [~julian.resc...@gmx.de] This column is to be added for the table created for JournalEntry which is a new table being introduced. So would it still pose backwards compatibility problem. Also it seems that RDBDocumentStore is using same schema for all collections [1] which looks incorrect. [1] http://markmail.org/thread/xratik7dsrw3o7og Add timestamp property to journal entries - Key: OAK-3085 URL: https://issues.apache.org/jira/browse/OAK-3085 Project: Jackrabbit Oak Issue Type: Improvement Components: core, mongomk Affects Versions: 1.2.2, 1.3.2 Reporter: Stefan Egli Fix For: 1.2.3, 1.3.3 Attachments: OAK-3085.patch, OAK-3085.v2.patch OAK-3001 is about improving the JournalGarbageCollector by querying on a separated-out timestamp property (rather than the id that encapsulated the timestamp). In order to remove OAK-3001 as a blocker ticket from the 1.2.3 release, this ticket is about adding a timestamp property to the journal entry but not making use of it yet. Later on, when OAK-3001 is tackled, this timestamp property already exists and migration is not an issue anymore (as 1.2.3 introduces the journal entry first time) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-3085) Add timestamp property to journal entries
[ https://issues.apache.org/jira/browse/OAK-3085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622094#comment-14622094 ] Stefan Egli commented on OAK-3085: -- Hm, but in the mongo case I don't see this being created for the journal case.. Add timestamp property to journal entries - Key: OAK-3085 URL: https://issues.apache.org/jira/browse/OAK-3085 Project: Jackrabbit Oak Issue Type: Improvement Components: core, mongomk Affects Versions: 1.2.2, 1.3.2 Reporter: Stefan Egli Fix For: 1.2.3, 1.3.3 Attachments: OAK-3085.patch, OAK-3085.v2.patch OAK-3001 is about improving the JournalGarbageCollector by querying on a separated-out timestamp property (rather than the id that encapsulated the timestamp). In order to remove OAK-3001 as a blocker ticket from the 1.2.3 release, this ticket is about adding a timestamp property to the journal entry but not making use of it yet. Later on, when OAK-3001 is tackled, this timestamp property already exists and migration is not an issue anymore (as 1.2.3 introduces the journal entry first time) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-2892) Speed up lucene indexing post migration by pre extracting the text content from binaries
[ https://issues.apache.org/jira/browse/OAK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622175#comment-14622175 ] Chetan Mehrotra commented on OAK-2892: -- Done initial implementation in http://svn.apache.org/r1690247 [~tmueller] Can you review the commit to see if the comments you made are addressed. if anything to be changed there then let me know Speed up lucene indexing post migration by pre extracting the text content from binaries Key: OAK-2892 URL: https://issues.apache.org/jira/browse/OAK-2892 Project: Jackrabbit Oak Issue Type: New Feature Components: lucene, run Reporter: Chetan Mehrotra Assignee: Chetan Mehrotra Labels: performance Fix For: 1.3.3, 1.0.18 While migrating large repositories say having 3 M docs (250k PDF) Lucene indexing takes long time to complete (at time 4 days!). Currently the text extraction logic is coupled with Lucene indexing and hence is performed in a single threaded mode which slows down the indexing process. Further if the reindexing has to be triggered it has to be done all over again. To speed up the Lucene indexing we can decouple the text extraction from actual indexing. It is partly based on discussion on OAK-2787 # Introduce a new ExtractedTextProvider which can provide extracted text for a given Blob instance # In oak-run introduce a new indexer mode - This would take a path in repository and would then traverse the repository and look for existing binaries and extract text from that So before or after migration is done one can run this oak-run tool to create this store which has the text already extracted. Then post startup we need to wire up the ExtractedTextProvider instance (which is backed by the BlobStore populated before) and indexing logic can just get content from that. This would avoid performing expensive text extraction in the indexing thread. See discussion thread http://markmail.org/thread/ndlfpkwfgpey6o66 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-2953) Implement text extractor as part of oak-run
[ https://issues.apache.org/jira/browse/OAK-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622172#comment-14622172 ] Chetan Mehrotra commented on OAK-2953: -- Applied the patch in http://svn.apache.org/r1690249 Implement text extractor as part of oak-run --- Key: OAK-2953 URL: https://issues.apache.org/jira/browse/OAK-2953 Project: Jackrabbit Oak Issue Type: Sub-task Components: run Reporter: Chetan Mehrotra Assignee: Chetan Mehrotra Fix For: 1.3.3 Attachments: OAK-2953.patch Implement a crawler and indexer which can find out all binary content in repository under certain path and extracts text from them and store them somewhere -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OAK-3005) OSGI wrapper service for Jackrabbit CachingFDS
[ https://issues.apache.org/jira/browse/OAK-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chetan Mehrotra updated OAK-3005: - Labels: candidate_oak_1_0 candidate_oak_1_2 features performance (was: features performance) OSGI wrapper service for Jackrabbit CachingFDS -- Key: OAK-3005 URL: https://issues.apache.org/jira/browse/OAK-3005 Project: Jackrabbit Oak Issue Type: Improvement Components: blob Affects Versions: 1.0.15 Reporter: Shashank Gupta Assignee: Shashank Gupta Labels: candidate_oak_1_0, candidate_oak_1_2, features, performance Fix For: 1.3.1 Attachments: OAK-2729.patch, org.apache.jackrabbit.oak.plugins.blob.datastore.CachingFDS.sample.config OSGI service wrapper for JCR-3869 which provides CachingDataStore capabilities for SAN NAS storage -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (OAK-3090) Caching BlobStore implementation
Chetan Mehrotra created OAK-3090: Summary: Caching BlobStore implementation Key: OAK-3090 URL: https://issues.apache.org/jira/browse/OAK-3090 Project: Jackrabbit Oak Issue Type: New Feature Components: blob Reporter: Chetan Mehrotra Fix For: 1.3.4 Storing binaries in Mongo puts lots of pressure on the MongoDB for reads. To reduce the read load it would be useful to have a filesystem based cache of frequently used binaries. This would be similar to CachingFDS (OAK-3005) but would be implemented on top of BlobStore API. Requirements * Specify the max binary size which can be cached on file system * Limit the size of all binary content present in the cache -- This message was sent by Atlassian JIRA (v6.3.4#6332)