[jira] [Created] (OAK-3091) Remove duplicate logback-classic dependency entry from oak-lucene pom
Chetan Mehrotra created OAK-3091: Summary: Remove duplicate logback-classic dependency entry from oak-lucene pom Key: OAK-3091 URL: https://issues.apache.org/jira/browse/OAK-3091 Project: Jackrabbit Oak Issue Type: Bug Components: lucene Reporter: Chetan Mehrotra Assignee: Chetan Mehrotra Priority: Minor Fix For: 1.2.3, 1.3.3, 1.0.18 Following warning is seen when building oak-lucene component {noformat} [WARNING] [WARNING] Some problems were encountered while building the effective model for org.apache.jackrabbit:oak-lucene:bundle:1.4-SNAPSHOT [WARNING] 'dependencies.dependency.(groupId:artifactId:type:classifier)' must be unique: ch.qos.logback:logback-classic:jar -> duplicate declaration of version (?) @ org.apache.jackrabbit:oak-lucene:[unknown-version], /path/to/jackrabbit-oak/oak-lucene/pom.xml, line 279, column 17 [WARNING] [WARNING] It is highly recommended to fix these problems because they threaten the stability of your build. [WARNING] [WARNING] For this reason, future Maven versions might no longer support building such malformed projects. [WARNING] {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-3090) Caching BlobStore implementation
[ https://issues.apache.org/jira/browse/OAK-3090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14622215#comment-14622215 ] Chetan Mehrotra commented on OAK-3090: -- [~tmueller] Would it be possible to add support [RemovalListener|http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/cache/RemovalListener.html] to {{CacheLIRS}}. Then we can use that to implement above caching > Caching BlobStore implementation > - > > Key: OAK-3090 > URL: https://issues.apache.org/jira/browse/OAK-3090 > Project: Jackrabbit Oak > Issue Type: New Feature > Components: blob >Reporter: Chetan Mehrotra > Fix For: 1.3.4 > > > Storing binaries in Mongo puts lots of pressure on the MongoDB for reads. To > reduce the read load it would be useful to have a filesystem based cache of > frequently used binaries. > This would be similar to CachingFDS (OAK-3005) but would be implemented on > top of BlobStore API. > Requirements > * Specify the max binary size which can be cached on file system > * Limit the size of all binary content present in the cache -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-3090) Caching BlobStore implementation
[ https://issues.apache.org/jira/browse/OAK-3090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14622214#comment-14622214 ] Chetan Mehrotra commented on OAK-3090: -- Given we already use Guava in Oak it might be better to just make use of them (or LIRSCache if it supports removal listener) and have a simple CachingDataStore impl. Have a look at {{DataStoreBlobStore#getInputStream}} where we have some caching done for small binaries on heap. Extrapolating that design in following way would allow us to implement a simple FS based caching layer # Have a new cache where the cached value is File (or some instance which keeps a reference to File) # Provide support for Weight via a simple [Weigher|http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/cache/Weigher.html] which is based on File size # Register a [RemovalListener|http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/cache/RemovalListener.html] which removes the file from file system upon eviction # Provide a [CacheLoader|http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/cache/CacheLoader.html] which spools the remote binary to local filesystem This should be a small logic and would provide you all benefits of Guava cache including cache stats. And this would transparently work for any DataStore. May be we implement it at {{BlobStore}} level itself and then it would be useful for other BlobStore also. Doing it at BlobStore level would require some support from {{BlobStore}} to determine the blob length from blobId itself. [1] https://code.google.com/p/guava-libraries/wiki/CachesExplained > Caching BlobStore implementation > - > > Key: OAK-3090 > URL: https://issues.apache.org/jira/browse/OAK-3090 > Project: Jackrabbit Oak > Issue Type: New Feature > Components: blob >Reporter: Chetan Mehrotra > Fix For: 1.3.4 > > > Storing binaries in Mongo puts lots of pressure on the MongoDB for reads. To > reduce the read load it would be useful to have a filesystem based cache of > frequently used binaries. > This would be similar to CachingFDS (OAK-3005) but would be implemented on > top of BlobStore API. > Requirements > * Specify the max binary size which can be cached on file system > * Limit the size of all binary content present in the cache -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (OAK-3090) Caching BlobStore implementation
Chetan Mehrotra created OAK-3090: Summary: Caching BlobStore implementation Key: OAK-3090 URL: https://issues.apache.org/jira/browse/OAK-3090 Project: Jackrabbit Oak Issue Type: New Feature Components: blob Reporter: Chetan Mehrotra Fix For: 1.3.4 Storing binaries in Mongo puts lots of pressure on the MongoDB for reads. To reduce the read load it would be useful to have a filesystem based cache of frequently used binaries. This would be similar to CachingFDS (OAK-3005) but would be implemented on top of BlobStore API. Requirements * Specify the max binary size which can be cached on file system * Limit the size of all binary content present in the cache -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OAK-3005) OSGI wrapper service for Jackrabbit CachingFDS
[ https://issues.apache.org/jira/browse/OAK-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chetan Mehrotra updated OAK-3005: - Labels: candidate_oak_1_0 candidate_oak_1_2 features performance (was: features performance) > OSGI wrapper service for Jackrabbit CachingFDS > -- > > Key: OAK-3005 > URL: https://issues.apache.org/jira/browse/OAK-3005 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: blob >Affects Versions: 1.0.15 >Reporter: Shashank Gupta >Assignee: Shashank Gupta > Labels: candidate_oak_1_0, candidate_oak_1_2, features, > performance > Fix For: 1.3.1 > > Attachments: OAK-2729.patch, > org.apache.jackrabbit.oak.plugins.blob.datastore.CachingFDS.sample.config > > > OSGI service wrapper for JCR-3869 which provides CachingDataStore > capabilities for SAN & NAS storage -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-2892) Speed up lucene indexing post migration by pre extracting the text content from binaries
[ https://issues.apache.org/jira/browse/OAK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14622175#comment-14622175 ] Chetan Mehrotra commented on OAK-2892: -- Done initial implementation in http://svn.apache.org/r1690247 [~tmueller] Can you review the commit to see if the comments you made are addressed. if anything to be changed there then let me know > Speed up lucene indexing post migration by pre extracting the text content > from binaries > > > Key: OAK-2892 > URL: https://issues.apache.org/jira/browse/OAK-2892 > Project: Jackrabbit Oak > Issue Type: New Feature > Components: lucene, run >Reporter: Chetan Mehrotra >Assignee: Chetan Mehrotra > Labels: performance > Fix For: 1.3.3, 1.0.18 > > > While migrating large repositories say having 3 M docs (250k PDF) Lucene > indexing takes long time to complete (at time 4 days!). Currently the text > extraction logic is coupled with Lucene indexing and hence is performed in a > single threaded mode which slows down the indexing process. Further if the > reindexing has to be triggered it has to be done all over again. > To speed up the Lucene indexing we can decouple the text extraction > from actual indexing. It is partly based on discussion on OAK-2787 > # Introduce a new ExtractedTextProvider which can provide extracted text for > a given Blob instance > # In oak-run introduce a new indexer mode - This would take a path in > repository and would then traverse the repository and look for existing > binaries and extract text from that > So before or after migration is done one can run this oak-run tool to create > this store which has the text already extracted. Then post startup we need to > wire up the ExtractedTextProvider instance (which is backed by the BlobStore > populated before) and indexing logic can just get content from that. This > would avoid performing expensive text extraction in the indexing thread. > See discussion thread http://markmail.org/thread/ndlfpkwfgpey6o66 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-2953) Implement text extractor as part of oak-run
[ https://issues.apache.org/jira/browse/OAK-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14622172#comment-14622172 ] Chetan Mehrotra commented on OAK-2953: -- Applied the patch in http://svn.apache.org/r1690249 > Implement text extractor as part of oak-run > --- > > Key: OAK-2953 > URL: https://issues.apache.org/jira/browse/OAK-2953 > Project: Jackrabbit Oak > Issue Type: Sub-task > Components: run >Reporter: Chetan Mehrotra >Assignee: Chetan Mehrotra > Fix For: 1.3.3 > > Attachments: OAK-2953.patch > > > Implement a crawler and indexer which can find out all binary content in > repository under certain path and extracts text from them and store them > somewhere -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-3085) Add timestamp property to journal entries
[ https://issues.apache.org/jira/browse/OAK-3085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14622095#comment-14622095 ] Chetan Mehrotra commented on OAK-3085: -- bq. Adding a new column at this point will create a backwards compatibility problem with existing DBs. Do we really need it? [~julian.resc...@gmx.de] This column is to be added for the table created for JournalEntry which is a new table being introduced. So would it still pose backwards compatibility problem. Also it seems that RDBDocumentStore is using same schema for all collections [1] which looks incorrect. [1] http://markmail.org/thread/xratik7dsrw3o7og > Add timestamp property to journal entries > - > > Key: OAK-3085 > URL: https://issues.apache.org/jira/browse/OAK-3085 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: core, mongomk >Affects Versions: 1.2.2, 1.3.2 >Reporter: Stefan Egli > Fix For: 1.2.3, 1.3.3 > > Attachments: OAK-3085.patch, OAK-3085.v2.patch > > > OAK-3001 is about improving the JournalGarbageCollector by querying on a > separated-out timestamp property (rather than the id that encapsulated the > timestamp). > In order to remove OAK-3001 as a blocker ticket from the 1.2.3 release, this > ticket is about adding a timestamp property to the journal entry but not > making use of it yet. Later on, when OAK-3001 is tackled, this timestamp > property already exists and migration is not an issue anymore (as 1.2.3 > introduces the journal entry first time) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-3085) Add timestamp property to journal entries
[ https://issues.apache.org/jira/browse/OAK-3085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14622094#comment-14622094 ] Stefan Egli commented on OAK-3085: -- Hm, but in the mongo case I don't see this being created for the journal case.. > Add timestamp property to journal entries > - > > Key: OAK-3085 > URL: https://issues.apache.org/jira/browse/OAK-3085 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: core, mongomk >Affects Versions: 1.2.2, 1.3.2 >Reporter: Stefan Egli > Fix For: 1.2.3, 1.3.3 > > Attachments: OAK-3085.patch, OAK-3085.v2.patch > > > OAK-3001 is about improving the JournalGarbageCollector by querying on a > separated-out timestamp property (rather than the id that encapsulated the > timestamp). > In order to remove OAK-3001 as a blocker ticket from the 1.2.3 release, this > ticket is about adding a timestamp property to the journal entry but not > making use of it yet. Later on, when OAK-3001 is tackled, this timestamp > property already exists and migration is not an issue anymore (as 1.2.3 > introduces the journal entry first time) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-3085) Add timestamp property to journal entries
[ https://issues.apache.org/jira/browse/OAK-3085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14622086#comment-14622086 ] Julian Reschke commented on OAK-3085: - It's a unix timestamp with 5s resolution, maintained by the DocumentStore. > Add timestamp property to journal entries > - > > Key: OAK-3085 > URL: https://issues.apache.org/jira/browse/OAK-3085 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: core, mongomk >Affects Versions: 1.2.2, 1.3.2 >Reporter: Stefan Egli > Fix For: 1.2.3, 1.3.3 > > Attachments: OAK-3085.patch, OAK-3085.v2.patch > > > OAK-3001 is about improving the JournalGarbageCollector by querying on a > separated-out timestamp property (rather than the id that encapsulated the > timestamp). > In order to remove OAK-3001 as a blocker ticket from the 1.2.3 release, this > ticket is about adding a timestamp property to the journal entry but not > making use of it yet. Later on, when OAK-3001 is tackled, this timestamp > property already exists and migration is not an issue anymore (as 1.2.3 > introduces the journal entry first time) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-3085) Add timestamp property to journal entries
[ https://issues.apache.org/jira/browse/OAK-3085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14622076#comment-14622076 ] Stefan Egli commented on OAK-3085: -- [~reschke], is the {{_modified}} a unix timestamp too (as is the newly suggested {{_ts}}) ? If it is, what would RDB do if the {{JournalEntry.asUpdateOp}} would try to explicitly set {{_modified}} (whereas I assume RDBDocumentStore does this somehow too)? > Add timestamp property to journal entries > - > > Key: OAK-3085 > URL: https://issues.apache.org/jira/browse/OAK-3085 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: core, mongomk >Affects Versions: 1.2.2, 1.3.2 >Reporter: Stefan Egli > Fix For: 1.2.3, 1.3.3 > > Attachments: OAK-3085.patch, OAK-3085.v2.patch > > > OAK-3001 is about improving the JournalGarbageCollector by querying on a > separated-out timestamp property (rather than the id that encapsulated the > timestamp). > In order to remove OAK-3001 as a blocker ticket from the 1.2.3 release, this > ticket is about adding a timestamp property to the journal entry but not > making use of it yet. Later on, when OAK-3001 is tackled, this timestamp > property already exists and migration is not an issue anymore (as 1.2.3 > introduces the journal entry first time) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-3085) Add timestamp property to journal entries
[ https://issues.apache.org/jira/browse/OAK-3085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14622013#comment-14622013 ] Julian Reschke commented on OAK-3085: - Adding a new column at this point will create a backwards compatibility problem with existing DBs. Do we really need it? > Add timestamp property to journal entries > - > > Key: OAK-3085 > URL: https://issues.apache.org/jira/browse/OAK-3085 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: core, mongomk >Affects Versions: 1.2.2, 1.3.2 >Reporter: Stefan Egli > Fix For: 1.2.3, 1.3.3 > > Attachments: OAK-3085.patch, OAK-3085.v2.patch > > > OAK-3001 is about improving the JournalGarbageCollector by querying on a > separated-out timestamp property (rather than the id that encapsulated the > timestamp). > In order to remove OAK-3001 as a blocker ticket from the 1.2.3 release, this > ticket is about adding a timestamp property to the journal entry but not > making use of it yet. Later on, when OAK-3001 is tackled, this timestamp > property already exists and migration is not an issue anymore (as 1.2.3 > introduces the journal entry first time) -- This message was sent by Atlassian JIRA (v6.3.4#6332)