[jira] [Commented] (OAK-3085) Add timestamp property to journal entries

2015-07-10 Thread Julian Reschke (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-3085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622013#comment-14622013
 ] 

Julian Reschke commented on OAK-3085:
-

Adding a new column at this point will create a backwards compatibility problem 
with existing DBs. Do we really need it?

 Add timestamp property to journal entries
 -

 Key: OAK-3085
 URL: https://issues.apache.org/jira/browse/OAK-3085
 Project: Jackrabbit Oak
  Issue Type: Improvement
  Components: core, mongomk
Affects Versions: 1.2.2, 1.3.2
Reporter: Stefan Egli
 Fix For: 1.2.3, 1.3.3

 Attachments: OAK-3085.patch, OAK-3085.v2.patch


 OAK-3001 is about improving the JournalGarbageCollector by querying on a 
 separated-out timestamp property (rather than the id that encapsulated the 
 timestamp).
 In order to remove OAK-3001 as a blocker ticket from the 1.2.3 release, this 
 ticket is about adding a timestamp property to the journal entry but not 
 making use of it yet. Later on, when OAK-3001 is tackled, this timestamp 
 property already exists and migration is not an issue anymore (as 1.2.3 
 introduces the journal entry first time)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-3085) Add timestamp property to journal entries

2015-07-10 Thread Stefan Egli (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-3085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622076#comment-14622076
 ] 

Stefan Egli commented on OAK-3085:
--

[~reschke], is the {{_modified}} a unix timestamp too (as is the newly 
suggested {{_ts}}) ? If it is, what would RDB do if the 
{{JournalEntry.asUpdateOp}} would try to explicitly set {{_modified}} (whereas 
I assume RDBDocumentStore does this somehow too)?

 Add timestamp property to journal entries
 -

 Key: OAK-3085
 URL: https://issues.apache.org/jira/browse/OAK-3085
 Project: Jackrabbit Oak
  Issue Type: Improvement
  Components: core, mongomk
Affects Versions: 1.2.2, 1.3.2
Reporter: Stefan Egli
 Fix For: 1.2.3, 1.3.3

 Attachments: OAK-3085.patch, OAK-3085.v2.patch


 OAK-3001 is about improving the JournalGarbageCollector by querying on a 
 separated-out timestamp property (rather than the id that encapsulated the 
 timestamp).
 In order to remove OAK-3001 as a blocker ticket from the 1.2.3 release, this 
 ticket is about adding a timestamp property to the journal entry but not 
 making use of it yet. Later on, when OAK-3001 is tackled, this timestamp 
 property already exists and migration is not an issue anymore (as 1.2.3 
 introduces the journal entry first time)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-3085) Add timestamp property to journal entries

2015-07-10 Thread Julian Reschke (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-3085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622086#comment-14622086
 ] 

Julian Reschke commented on OAK-3085:
-

It's a unix timestamp with 5s resolution, maintained by the DocumentStore.

 Add timestamp property to journal entries
 -

 Key: OAK-3085
 URL: https://issues.apache.org/jira/browse/OAK-3085
 Project: Jackrabbit Oak
  Issue Type: Improvement
  Components: core, mongomk
Affects Versions: 1.2.2, 1.3.2
Reporter: Stefan Egli
 Fix For: 1.2.3, 1.3.3

 Attachments: OAK-3085.patch, OAK-3085.v2.patch


 OAK-3001 is about improving the JournalGarbageCollector by querying on a 
 separated-out timestamp property (rather than the id that encapsulated the 
 timestamp).
 In order to remove OAK-3001 as a blocker ticket from the 1.2.3 release, this 
 ticket is about adding a timestamp property to the journal entry but not 
 making use of it yet. Later on, when OAK-3001 is tackled, this timestamp 
 property already exists and migration is not an issue anymore (as 1.2.3 
 introduces the journal entry first time)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-3090) Caching BlobStore implementation

2015-07-10 Thread Chetan Mehrotra (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-3090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622215#comment-14622215
 ] 

Chetan Mehrotra commented on OAK-3090:
--

[~tmueller] Would it be possible to add support 
[RemovalListener|http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/cache/RemovalListener.html]
 to {{CacheLIRS}}. Then we can use that to implement above caching

 Caching BlobStore implementation 
 -

 Key: OAK-3090
 URL: https://issues.apache.org/jira/browse/OAK-3090
 Project: Jackrabbit Oak
  Issue Type: New Feature
  Components: blob
Reporter: Chetan Mehrotra
 Fix For: 1.3.4


 Storing binaries in Mongo puts lots of pressure on the MongoDB for reads. To 
 reduce the read load it would be useful to have a filesystem based cache of 
 frequently used binaries. 
 This would be similar to CachingFDS (OAK-3005) but would be implemented on 
 top of BlobStore API. 
 Requirements
 * Specify the max binary size which can be cached on file system
 * Limit the size of all binary content present in the cache



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-3090) Caching BlobStore implementation

2015-07-10 Thread Chetan Mehrotra (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-3090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622214#comment-14622214
 ] 

Chetan Mehrotra commented on OAK-3090:
--

Given we already use Guava in Oak it might be better to just make use of them 
(or LIRSCache if it supports removal listener) and have a simple 
CachingDataStore impl. Have a look at {{DataStoreBlobStore#getInputStream}} 
where we have some caching done for small binaries on heap. 

Extrapolating that design in following way would allow us to implement a simple 
FS based caching layer
# Have a new cache where the cached value is File (or some instance which keeps 
a reference to File)
# Provide support for Weight via a simple 
[Weigher|http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/cache/Weigher.html]
 which is based on File size
# Register a 
[RemovalListener|http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/cache/RemovalListener.html]
 which removes the file from file system upon eviction
# Provide a 
[CacheLoader|http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/cache/CacheLoader.html]
 which spools the remote binary to local filesystem

This should be a small logic and would provide you all benefits of Guava cache 
including cache stats. And this would transparently work for any DataStore. 

May be we implement it at {{BlobStore}} level itself and then it would be 
useful for other BlobStore also. Doing it at BlobStore level would require some 
support from {{BlobStore}} to determine the blob length from blobId itself. 

[1] https://code.google.com/p/guava-libraries/wiki/CachesExplained

 Caching BlobStore implementation 
 -

 Key: OAK-3090
 URL: https://issues.apache.org/jira/browse/OAK-3090
 Project: Jackrabbit Oak
  Issue Type: New Feature
  Components: blob
Reporter: Chetan Mehrotra
 Fix For: 1.3.4


 Storing binaries in Mongo puts lots of pressure on the MongoDB for reads. To 
 reduce the read load it would be useful to have a filesystem based cache of 
 frequently used binaries. 
 This would be similar to CachingFDS (OAK-3005) but would be implemented on 
 top of BlobStore API. 
 Requirements
 * Specify the max binary size which can be cached on file system
 * Limit the size of all binary content present in the cache



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-3085) Add timestamp property to journal entries

2015-07-10 Thread Chetan Mehrotra (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-3085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622095#comment-14622095
 ] 

Chetan Mehrotra commented on OAK-3085:
--

bq. Adding a new column at this point will create a backwards compatibility 
problem with existing DBs. Do we really need it?

[~julian.resc...@gmx.de] This column is to be added for the table created for 
JournalEntry which is a new table being introduced. So would it still pose 
backwards compatibility problem. Also it seems that RDBDocumentStore is using 
same schema for all collections [1] which looks incorrect. 

[1] http://markmail.org/thread/xratik7dsrw3o7og

 Add timestamp property to journal entries
 -

 Key: OAK-3085
 URL: https://issues.apache.org/jira/browse/OAK-3085
 Project: Jackrabbit Oak
  Issue Type: Improvement
  Components: core, mongomk
Affects Versions: 1.2.2, 1.3.2
Reporter: Stefan Egli
 Fix For: 1.2.3, 1.3.3

 Attachments: OAK-3085.patch, OAK-3085.v2.patch


 OAK-3001 is about improving the JournalGarbageCollector by querying on a 
 separated-out timestamp property (rather than the id that encapsulated the 
 timestamp).
 In order to remove OAK-3001 as a blocker ticket from the 1.2.3 release, this 
 ticket is about adding a timestamp property to the journal entry but not 
 making use of it yet. Later on, when OAK-3001 is tackled, this timestamp 
 property already exists and migration is not an issue anymore (as 1.2.3 
 introduces the journal entry first time)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-3085) Add timestamp property to journal entries

2015-07-10 Thread Stefan Egli (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-3085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622094#comment-14622094
 ] 

Stefan Egli commented on OAK-3085:
--

Hm, but in the mongo case I don't see this being created for the journal case..

 Add timestamp property to journal entries
 -

 Key: OAK-3085
 URL: https://issues.apache.org/jira/browse/OAK-3085
 Project: Jackrabbit Oak
  Issue Type: Improvement
  Components: core, mongomk
Affects Versions: 1.2.2, 1.3.2
Reporter: Stefan Egli
 Fix For: 1.2.3, 1.3.3

 Attachments: OAK-3085.patch, OAK-3085.v2.patch


 OAK-3001 is about improving the JournalGarbageCollector by querying on a 
 separated-out timestamp property (rather than the id that encapsulated the 
 timestamp).
 In order to remove OAK-3001 as a blocker ticket from the 1.2.3 release, this 
 ticket is about adding a timestamp property to the journal entry but not 
 making use of it yet. Later on, when OAK-3001 is tackled, this timestamp 
 property already exists and migration is not an issue anymore (as 1.2.3 
 introduces the journal entry first time)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-2892) Speed up lucene indexing post migration by pre extracting the text content from binaries

2015-07-10 Thread Chetan Mehrotra (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622175#comment-14622175
 ] 

Chetan Mehrotra commented on OAK-2892:
--

Done initial implementation in http://svn.apache.org/r1690247

[~tmueller] Can you review the commit to see if the comments you made are 
addressed. if anything to be changed there then let me know

 Speed up lucene indexing post migration by pre extracting the text content 
 from binaries
 

 Key: OAK-2892
 URL: https://issues.apache.org/jira/browse/OAK-2892
 Project: Jackrabbit Oak
  Issue Type: New Feature
  Components: lucene, run
Reporter: Chetan Mehrotra
Assignee: Chetan Mehrotra
  Labels: performance
 Fix For: 1.3.3, 1.0.18


 While migrating large repositories say having 3 M docs (250k PDF) Lucene 
 indexing takes long time to complete (at time 4 days!). Currently the text 
 extraction logic is coupled with Lucene indexing and hence is performed in a 
 single threaded mode which slows down the indexing process. Further if the 
 reindexing has to be triggered it has to be done all over again.
 To speed up the Lucene indexing we can decouple the text extraction
 from actual indexing. It is partly based on discussion on OAK-2787
 # Introduce a new ExtractedTextProvider which can provide extracted text for 
 a given Blob instance
 # In oak-run introduce a new indexer mode - This would take a path in 
 repository and would then traverse the repository and look for existing 
 binaries and extract text from that
 So before or after migration is done one can run this oak-run tool to create 
 this store which has the text already extracted. Then post startup we need to 
 wire up the ExtractedTextProvider instance (which is backed by the BlobStore 
 populated before) and indexing logic can just get content from that. This 
 would avoid performing expensive text extraction in the indexing thread.
 See discussion thread http://markmail.org/thread/ndlfpkwfgpey6o66



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-2953) Implement text extractor as part of oak-run

2015-07-10 Thread Chetan Mehrotra (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622172#comment-14622172
 ] 

Chetan Mehrotra commented on OAK-2953:
--

Applied the patch in http://svn.apache.org/r1690249

 Implement text extractor as part of oak-run
 ---

 Key: OAK-2953
 URL: https://issues.apache.org/jira/browse/OAK-2953
 Project: Jackrabbit Oak
  Issue Type: Sub-task
  Components: run
Reporter: Chetan Mehrotra
Assignee: Chetan Mehrotra
 Fix For: 1.3.3

 Attachments: OAK-2953.patch


 Implement a crawler and indexer which can find out all binary content in 
 repository under certain path and extracts text  from them and store them 
 somewhere



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (OAK-3005) OSGI wrapper service for Jackrabbit CachingFDS

2015-07-10 Thread Chetan Mehrotra (JIRA)

 [ 
https://issues.apache.org/jira/browse/OAK-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chetan Mehrotra updated OAK-3005:
-
Labels: candidate_oak_1_0 candidate_oak_1_2 features performance  (was: 
features performance)

 OSGI wrapper service for Jackrabbit CachingFDS
 --

 Key: OAK-3005
 URL: https://issues.apache.org/jira/browse/OAK-3005
 Project: Jackrabbit Oak
  Issue Type: Improvement
  Components: blob
Affects Versions: 1.0.15
Reporter: Shashank Gupta
Assignee: Shashank Gupta
  Labels: candidate_oak_1_0, candidate_oak_1_2, features, 
 performance
 Fix For: 1.3.1

 Attachments: OAK-2729.patch, 
 org.apache.jackrabbit.oak.plugins.blob.datastore.CachingFDS.sample.config


 OSGI service wrapper for JCR-3869 which provides CachingDataStore 
 capabilities for SAN  NAS storage



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (OAK-3090) Caching BlobStore implementation

2015-07-10 Thread Chetan Mehrotra (JIRA)
Chetan Mehrotra created OAK-3090:


 Summary: Caching BlobStore implementation 
 Key: OAK-3090
 URL: https://issues.apache.org/jira/browse/OAK-3090
 Project: Jackrabbit Oak
  Issue Type: New Feature
  Components: blob
Reporter: Chetan Mehrotra
 Fix For: 1.3.4


Storing binaries in Mongo puts lots of pressure on the MongoDB for reads. To 
reduce the read load it would be useful to have a filesystem based cache of 
frequently used binaries. 

This would be similar to CachingFDS (OAK-3005) but would be implemented on top 
of BlobStore API. 

Requirements
* Specify the max binary size which can be cached on file system
* Limit the size of all binary content present in the cache




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)