[jira] [Created] (OAK-3091) Remove duplicate logback-classic dependency entry from oak-lucene pom

2015-07-10 Thread Chetan Mehrotra (JIRA)
Chetan Mehrotra created OAK-3091:


 Summary: Remove duplicate logback-classic dependency entry from 
oak-lucene pom
 Key: OAK-3091
 URL: https://issues.apache.org/jira/browse/OAK-3091
 Project: Jackrabbit Oak
  Issue Type: Bug
  Components: lucene
Reporter: Chetan Mehrotra
Assignee: Chetan Mehrotra
Priority: Minor
 Fix For: 1.2.3, 1.3.3, 1.0.18


Following warning is seen when building oak-lucene component

{noformat}
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.jackrabbit:oak-lucene:bundle:1.4-SNAPSHOT
[WARNING] 'dependencies.dependency.(groupId:artifactId:type:classifier)' must 
be unique: ch.qos.logback:logback-classic:jar -> duplicate declaration of 
version (?) @ org.apache.jackrabbit:oak-lucene:[unknown-version], 
/path/to/jackrabbit-oak/oak-lucene/pom.xml, line 279, column 17 


[WARNING] 
[WARNING] It is highly recommended to fix these problems because they threaten 
the stability of your build.
[WARNING] 
[WARNING] For this reason, future Maven versions might no longer support 
building such malformed projects.
[WARNING] 

{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-3090) Caching BlobStore implementation

2015-07-10 Thread Chetan Mehrotra (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-3090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14622215#comment-14622215
 ] 

Chetan Mehrotra commented on OAK-3090:
--

[~tmueller] Would it be possible to add support 
[RemovalListener|http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/cache/RemovalListener.html]
 to {{CacheLIRS}}. Then we can use that to implement above caching

> Caching BlobStore implementation 
> -
>
> Key: OAK-3090
> URL: https://issues.apache.org/jira/browse/OAK-3090
> Project: Jackrabbit Oak
>  Issue Type: New Feature
>  Components: blob
>Reporter: Chetan Mehrotra
> Fix For: 1.3.4
>
>
> Storing binaries in Mongo puts lots of pressure on the MongoDB for reads. To 
> reduce the read load it would be useful to have a filesystem based cache of 
> frequently used binaries. 
> This would be similar to CachingFDS (OAK-3005) but would be implemented on 
> top of BlobStore API. 
> Requirements
> * Specify the max binary size which can be cached on file system
> * Limit the size of all binary content present in the cache



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-3090) Caching BlobStore implementation

2015-07-10 Thread Chetan Mehrotra (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-3090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14622214#comment-14622214
 ] 

Chetan Mehrotra commented on OAK-3090:
--

Given we already use Guava in Oak it might be better to just make use of them 
(or LIRSCache if it supports removal listener) and have a simple 
CachingDataStore impl. Have a look at {{DataStoreBlobStore#getInputStream}} 
where we have some caching done for small binaries on heap. 

Extrapolating that design in following way would allow us to implement a simple 
FS based caching layer
# Have a new cache where the cached value is File (or some instance which keeps 
a reference to File)
# Provide support for Weight via a simple 
[Weigher|http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/cache/Weigher.html]
 which is based on File size
# Register a 
[RemovalListener|http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/cache/RemovalListener.html]
 which removes the file from file system upon eviction
# Provide a 
[CacheLoader|http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/cache/CacheLoader.html]
 which spools the remote binary to local filesystem

This should be a small logic and would provide you all benefits of Guava cache 
including cache stats. And this would transparently work for any DataStore. 

May be we implement it at {{BlobStore}} level itself and then it would be 
useful for other BlobStore also. Doing it at BlobStore level would require some 
support from {{BlobStore}} to determine the blob length from blobId itself. 

[1] https://code.google.com/p/guava-libraries/wiki/CachesExplained

> Caching BlobStore implementation 
> -
>
> Key: OAK-3090
> URL: https://issues.apache.org/jira/browse/OAK-3090
> Project: Jackrabbit Oak
>  Issue Type: New Feature
>  Components: blob
>Reporter: Chetan Mehrotra
> Fix For: 1.3.4
>
>
> Storing binaries in Mongo puts lots of pressure on the MongoDB for reads. To 
> reduce the read load it would be useful to have a filesystem based cache of 
> frequently used binaries. 
> This would be similar to CachingFDS (OAK-3005) but would be implemented on 
> top of BlobStore API. 
> Requirements
> * Specify the max binary size which can be cached on file system
> * Limit the size of all binary content present in the cache



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (OAK-3090) Caching BlobStore implementation

2015-07-10 Thread Chetan Mehrotra (JIRA)
Chetan Mehrotra created OAK-3090:


 Summary: Caching BlobStore implementation 
 Key: OAK-3090
 URL: https://issues.apache.org/jira/browse/OAK-3090
 Project: Jackrabbit Oak
  Issue Type: New Feature
  Components: blob
Reporter: Chetan Mehrotra
 Fix For: 1.3.4


Storing binaries in Mongo puts lots of pressure on the MongoDB for reads. To 
reduce the read load it would be useful to have a filesystem based cache of 
frequently used binaries. 

This would be similar to CachingFDS (OAK-3005) but would be implemented on top 
of BlobStore API. 

Requirements
* Specify the max binary size which can be cached on file system
* Limit the size of all binary content present in the cache




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (OAK-3005) OSGI wrapper service for Jackrabbit CachingFDS

2015-07-10 Thread Chetan Mehrotra (JIRA)

 [ 
https://issues.apache.org/jira/browse/OAK-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chetan Mehrotra updated OAK-3005:
-
Labels: candidate_oak_1_0 candidate_oak_1_2 features performance  (was: 
features performance)

> OSGI wrapper service for Jackrabbit CachingFDS
> --
>
> Key: OAK-3005
> URL: https://issues.apache.org/jira/browse/OAK-3005
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: blob
>Affects Versions: 1.0.15
>Reporter: Shashank Gupta
>Assignee: Shashank Gupta
>  Labels: candidate_oak_1_0, candidate_oak_1_2, features, 
> performance
> Fix For: 1.3.1
>
> Attachments: OAK-2729.patch, 
> org.apache.jackrabbit.oak.plugins.blob.datastore.CachingFDS.sample.config
>
>
> OSGI service wrapper for JCR-3869 which provides CachingDataStore 
> capabilities for SAN & NAS storage



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-2892) Speed up lucene indexing post migration by pre extracting the text content from binaries

2015-07-10 Thread Chetan Mehrotra (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14622175#comment-14622175
 ] 

Chetan Mehrotra commented on OAK-2892:
--

Done initial implementation in http://svn.apache.org/r1690247

[~tmueller] Can you review the commit to see if the comments you made are 
addressed. if anything to be changed there then let me know

> Speed up lucene indexing post migration by pre extracting the text content 
> from binaries
> 
>
> Key: OAK-2892
> URL: https://issues.apache.org/jira/browse/OAK-2892
> Project: Jackrabbit Oak
>  Issue Type: New Feature
>  Components: lucene, run
>Reporter: Chetan Mehrotra
>Assignee: Chetan Mehrotra
>  Labels: performance
> Fix For: 1.3.3, 1.0.18
>
>
> While migrating large repositories say having 3 M docs (250k PDF) Lucene 
> indexing takes long time to complete (at time 4 days!). Currently the text 
> extraction logic is coupled with Lucene indexing and hence is performed in a 
> single threaded mode which slows down the indexing process. Further if the 
> reindexing has to be triggered it has to be done all over again.
> To speed up the Lucene indexing we can decouple the text extraction
> from actual indexing. It is partly based on discussion on OAK-2787
> # Introduce a new ExtractedTextProvider which can provide extracted text for 
> a given Blob instance
> # In oak-run introduce a new indexer mode - This would take a path in 
> repository and would then traverse the repository and look for existing 
> binaries and extract text from that
> So before or after migration is done one can run this oak-run tool to create 
> this store which has the text already extracted. Then post startup we need to 
> wire up the ExtractedTextProvider instance (which is backed by the BlobStore 
> populated before) and indexing logic can just get content from that. This 
> would avoid performing expensive text extraction in the indexing thread.
> See discussion thread http://markmail.org/thread/ndlfpkwfgpey6o66



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-2953) Implement text extractor as part of oak-run

2015-07-10 Thread Chetan Mehrotra (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14622172#comment-14622172
 ] 

Chetan Mehrotra commented on OAK-2953:
--

Applied the patch in http://svn.apache.org/r1690249

> Implement text extractor as part of oak-run
> ---
>
> Key: OAK-2953
> URL: https://issues.apache.org/jira/browse/OAK-2953
> Project: Jackrabbit Oak
>  Issue Type: Sub-task
>  Components: run
>Reporter: Chetan Mehrotra
>Assignee: Chetan Mehrotra
> Fix For: 1.3.3
>
> Attachments: OAK-2953.patch
>
>
> Implement a crawler and indexer which can find out all binary content in 
> repository under certain path and extracts text  from them and store them 
> somewhere



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-3085) Add timestamp property to journal entries

2015-07-10 Thread Chetan Mehrotra (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-3085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14622095#comment-14622095
 ] 

Chetan Mehrotra commented on OAK-3085:
--

bq. Adding a new column at this point will create a backwards compatibility 
problem with existing DBs. Do we really need it?

[~julian.resc...@gmx.de] This column is to be added for the table created for 
JournalEntry which is a new table being introduced. So would it still pose 
backwards compatibility problem. Also it seems that RDBDocumentStore is using 
same schema for all collections [1] which looks incorrect. 

[1] http://markmail.org/thread/xratik7dsrw3o7og

> Add timestamp property to journal entries
> -
>
> Key: OAK-3085
> URL: https://issues.apache.org/jira/browse/OAK-3085
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: core, mongomk
>Affects Versions: 1.2.2, 1.3.2
>Reporter: Stefan Egli
> Fix For: 1.2.3, 1.3.3
>
> Attachments: OAK-3085.patch, OAK-3085.v2.patch
>
>
> OAK-3001 is about improving the JournalGarbageCollector by querying on a 
> separated-out timestamp property (rather than the id that encapsulated the 
> timestamp).
> In order to remove OAK-3001 as a blocker ticket from the 1.2.3 release, this 
> ticket is about adding a timestamp property to the journal entry but not 
> making use of it yet. Later on, when OAK-3001 is tackled, this timestamp 
> property already exists and migration is not an issue anymore (as 1.2.3 
> introduces the journal entry first time)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-3085) Add timestamp property to journal entries

2015-07-10 Thread Stefan Egli (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-3085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14622094#comment-14622094
 ] 

Stefan Egli commented on OAK-3085:
--

Hm, but in the mongo case I don't see this being created for the journal case..

> Add timestamp property to journal entries
> -
>
> Key: OAK-3085
> URL: https://issues.apache.org/jira/browse/OAK-3085
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: core, mongomk
>Affects Versions: 1.2.2, 1.3.2
>Reporter: Stefan Egli
> Fix For: 1.2.3, 1.3.3
>
> Attachments: OAK-3085.patch, OAK-3085.v2.patch
>
>
> OAK-3001 is about improving the JournalGarbageCollector by querying on a 
> separated-out timestamp property (rather than the id that encapsulated the 
> timestamp).
> In order to remove OAK-3001 as a blocker ticket from the 1.2.3 release, this 
> ticket is about adding a timestamp property to the journal entry but not 
> making use of it yet. Later on, when OAK-3001 is tackled, this timestamp 
> property already exists and migration is not an issue anymore (as 1.2.3 
> introduces the journal entry first time)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-3085) Add timestamp property to journal entries

2015-07-10 Thread Julian Reschke (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-3085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14622086#comment-14622086
 ] 

Julian Reschke commented on OAK-3085:
-

It's a unix timestamp with 5s resolution, maintained by the DocumentStore.

> Add timestamp property to journal entries
> -
>
> Key: OAK-3085
> URL: https://issues.apache.org/jira/browse/OAK-3085
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: core, mongomk
>Affects Versions: 1.2.2, 1.3.2
>Reporter: Stefan Egli
> Fix For: 1.2.3, 1.3.3
>
> Attachments: OAK-3085.patch, OAK-3085.v2.patch
>
>
> OAK-3001 is about improving the JournalGarbageCollector by querying on a 
> separated-out timestamp property (rather than the id that encapsulated the 
> timestamp).
> In order to remove OAK-3001 as a blocker ticket from the 1.2.3 release, this 
> ticket is about adding a timestamp property to the journal entry but not 
> making use of it yet. Later on, when OAK-3001 is tackled, this timestamp 
> property already exists and migration is not an issue anymore (as 1.2.3 
> introduces the journal entry first time)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-3085) Add timestamp property to journal entries

2015-07-10 Thread Stefan Egli (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-3085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14622076#comment-14622076
 ] 

Stefan Egli commented on OAK-3085:
--

[~reschke], is the {{_modified}} a unix timestamp too (as is the newly 
suggested {{_ts}}) ? If it is, what would RDB do if the 
{{JournalEntry.asUpdateOp}} would try to explicitly set {{_modified}} (whereas 
I assume RDBDocumentStore does this somehow too)?

> Add timestamp property to journal entries
> -
>
> Key: OAK-3085
> URL: https://issues.apache.org/jira/browse/OAK-3085
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: core, mongomk
>Affects Versions: 1.2.2, 1.3.2
>Reporter: Stefan Egli
> Fix For: 1.2.3, 1.3.3
>
> Attachments: OAK-3085.patch, OAK-3085.v2.patch
>
>
> OAK-3001 is about improving the JournalGarbageCollector by querying on a 
> separated-out timestamp property (rather than the id that encapsulated the 
> timestamp).
> In order to remove OAK-3001 as a blocker ticket from the 1.2.3 release, this 
> ticket is about adding a timestamp property to the journal entry but not 
> making use of it yet. Later on, when OAK-3001 is tackled, this timestamp 
> property already exists and migration is not an issue anymore (as 1.2.3 
> introduces the journal entry first time)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-3085) Add timestamp property to journal entries

2015-07-10 Thread Julian Reschke (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-3085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14622013#comment-14622013
 ] 

Julian Reschke commented on OAK-3085:
-

Adding a new column at this point will create a backwards compatibility problem 
with existing DBs. Do we really need it?

> Add timestamp property to journal entries
> -
>
> Key: OAK-3085
> URL: https://issues.apache.org/jira/browse/OAK-3085
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: core, mongomk
>Affects Versions: 1.2.2, 1.3.2
>Reporter: Stefan Egli
> Fix For: 1.2.3, 1.3.3
>
> Attachments: OAK-3085.patch, OAK-3085.v2.patch
>
>
> OAK-3001 is about improving the JournalGarbageCollector by querying on a 
> separated-out timestamp property (rather than the id that encapsulated the 
> timestamp).
> In order to remove OAK-3001 as a blocker ticket from the 1.2.3 release, this 
> ticket is about adding a timestamp property to the journal entry but not 
> making use of it yet. Later on, when OAK-3001 is tackled, this timestamp 
> property already exists and migration is not an issue anymore (as 1.2.3 
> introduces the journal entry first time)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)