Re: Active deletion of 'deleted' Lucene index files from DataStore without relying on full scale Blob GC

2015-03-10 Thread Thomas Mueller
Hi, I think removing binaries directly without going though the GC logic is dangerous, because we can't be sure if there are other references. There is one exception, it is if each file is guaranteed to be unique. For that, we could for example append a unique UUID to each file. The Lucene file sy

Re: Parallelize text extraction from binary fields

2015-03-10 Thread Ian Boston
Hi, On 10 March 2015 at 09:52, Chetan Mehrotra wrote: > > Is Oak already single instance when it comes to the identification and > storage of binaries ? > > Yes. Oak uses content addressable storage for binaries > > > Are the existing TextExtractors also single instance ? > > No. If same binary

Re: Active deletion of 'deleted' Lucene index files from DataStore without relying on full scale Blob GC

2015-03-10 Thread Chetan Mehrotra
Thats one approach we can think about. Thinking further with Lucene design of immutable files things become simpler (ignoring the reindex case). In normal usage Lucene never reuses the file name and never modifies any existing file. So we would not have to worry about reading older revisions. We on

Re: Active deletion of 'deleted' Lucene index files from DataStore without relying on full scale Blob GC

2015-03-10 Thread Michael Marth
Could the Lucene indexer explicitly track these files (e.g. as a property in the index definition)? And also take care of removing them? (the latter part is assuming that the same index file is not identical across various definitions) > On 10 Mar 2015, at 12:18, Chetan Mehrotra wrote: > > On

Re: working lucene fulltext index

2015-03-10 Thread Torgeir Veimo
Thank you! This example helped me iron out the errors in my index configuration! It would be good to have a bit more example code online for these things. On 6 March 2015 at 04:16, Chetan Mehrotra wrote: > Hi Torgeir, > > Sorry for the delay here as got stuck with other issues. I tried your > ap

Re: Active deletion of 'deleted' Lucene index files from DataStore without relying on full scale Blob GC

2015-03-10 Thread Chetan Mehrotra
On Tue, Mar 10, 2015 at 4:12 PM, Michael Dürig wrote: > The problem is that you don't even have a list of all previous revisions of > the root node state. Revisions are created on the fly and kept as needed. hmm yup. Then we would need to think of some other approach to know all the blobId referr

Re: Active deletion of 'deleted' Lucene index files from DataStore without relying on full scale Blob GC

2015-03-10 Thread Michael Dürig
On 10.3.15 11:32 , Chetan Mehrotra wrote: On Tue, Mar 10, 2015 at 3:33 PM, Michael Dürig wrote: SegmentMK doesn't even have the concept of a previous revision of a NodeState. Yes that is to be thought about. I want to read all previous revision for path /oak:index/lucene/:data. For segment

[RESULT][VOTE] Release Apache Jackrabbit Oak 1.0.12

2015-03-10 Thread Marcel Reutegger
Hi, the vote passes as follows: +1 Michael Dürig +1 Amit Jain +1 Alex Parvulescu +1 Davide Giannella +1 Julian Reschke +1 Thomas Mueller I'll push the release out. Thomas, your vote was a bit unclear. Your first statement was a +1 vote. Later you voiced concerns and suggested to not release the

Re: Active deletion of 'deleted' Lucene index files from DataStore without relying on full scale Blob GC

2015-03-10 Thread Chetan Mehrotra
On Tue, Mar 10, 2015 at 3:33 PM, Michael Dürig wrote: > SegmentMK doesn't even have the concept of a previous revision of a > NodeState. Yes that is to be thought about. I want to read all previous revision for path /oak:index/lucene/:data. For segment I believe I would need to start at root refe

Re: Active deletion of 'deleted' Lucene index files from DataStore without relying on full scale Blob GC

2015-03-10 Thread Michael Dürig
On 10.3.15 10:49 , Chetan Mehrotra wrote: For Segment I am not sure how to easily read previous revisions of given NodeState SegmentMK doesn't even have the concept of a previous revision of a NodeState. Michael

Re: Parallelize text extraction from binary fields

2015-03-10 Thread Chetan Mehrotra
> Is Oak already single instance when it comes to the identification and > storage of binaries ? Yes. Oak uses content addressable storage for binaries > Are the existing TextExtractors also single instance ? No. If same binary is referred at multiple places then text extraction would be perfor

Re: Active deletion of 'deleted' Lucene index files from DataStore without relying on full scale Blob GC

2015-03-10 Thread Chetan Mehrotra
On Tue, Mar 10, 2015 at 1:50 PM, Michael Marth wrote: > But I wonder: how do you envision that this new index cleanup would locate > indexes in the content-addressed DS Thats bit tricky. Have rough idea here on how to approach but would require more thinking here. The approach I am thinking of i

Re: Active deletion of 'deleted' Lucene index files from DataStore without relying on full scale Blob GC

2015-03-10 Thread Michael Marth
Hi Chetan, I like the idea. But I wonder: how do you envision that this new index cleanup would locate indexes in the content-addressed DS? Michael > On 10 Mar 2015, at 07:46, Chetan Mehrotra wrote: > > Hi Team, > > With storing of Lucene index files within DataStore our usage pattern > of D

Re: Parallelize text extraction from binary fields

2015-03-10 Thread Ian Boston
Hi, Is Oak already single instance when it comes to the identification and storage of binaries ? Are the existing TextExtractors also single instance ? By Single instance I mean, 1 copy of the binary and its token stream in the repository regardless of how many times its referenced. Best Regards I

Parallelize text extraction from binary fields

2015-03-10 Thread Chetan Mehrotra
LuceneIndexEditor currently extract the binary contents via Tika in same thread which is used for processing the commit. Such an approach does not make good use of multi processor system specifically when index is being built up as part of migration process. Looking at JR2 I see LazyTextExtractor