[jira] [Commented] (OAK-2892) Speed up lucene indexing post migration by pre extracting the text content from binaries

2015-07-15 Thread Chetan Mehrotra (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627613#comment-14627613
 ] 

Chetan Mehrotra commented on OAK-2892:
--

Documentation updated in 
http://jackrabbit.apache.org/oak/docs/query/lucene.html#text-extraction

Feature merged to 1.0 and 1.2 branches

 Speed up lucene indexing post migration by pre extracting the text content 
 from binaries
 

 Key: OAK-2892
 URL: https://issues.apache.org/jira/browse/OAK-2892
 Project: Jackrabbit Oak
  Issue Type: New Feature
  Components: lucene, run
Reporter: Chetan Mehrotra
Assignee: Chetan Mehrotra
  Labels: performance
 Fix For: 1.2.3, 1.3.3, 1.0.18


 While migrating large repositories say having 3 M docs (250k PDF) Lucene 
 indexing takes long time to complete (at time 4 days!). Currently the text 
 extraction logic is coupled with Lucene indexing and hence is performed in a 
 single threaded mode which slows down the indexing process. Further if the 
 reindexing has to be triggered it has to be done all over again.
 To speed up the Lucene indexing we can decouple the text extraction
 from actual indexing. It is partly based on discussion on OAK-2787
 # Introduce a new ExtractedTextProvider which can provide extracted text for 
 a given Blob instance
 # In oak-run introduce a new indexer mode - This would take a path in 
 repository and would then traverse the repository and look for existing 
 binaries and extract text from that
 So before or after migration is done one can run this oak-run tool to create 
 this store which has the text already extracted. Then post startup we need to 
 wire up the ExtractedTextProvider instance (which is backed by the BlobStore 
 populated before) and indexing logic can just get content from that. This 
 would avoid performing expensive text extraction in the indexing thread.
 See discussion thread http://markmail.org/thread/ndlfpkwfgpey6o66



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-2892) Speed up lucene indexing post migration by pre extracting the text content from binaries

2015-07-14 Thread Julian Reschke (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626341#comment-14626341
 ] 

Julian Reschke commented on OAK-2892:
-

Getting test failure, probably related to Windows:

copyOnWriteAndLocks(org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexEditorTest)
  Time elapsed: 0.11 sec   ERROR!
org.apache.jackrabbit.oak.api.CommitFailedException: OakLucene0003: Failed to 
index the node /test
at 
org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexEditor.addOrUpdate(LuceneIndexEditor.java:306)
at 
org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexEditor.leave(LuceneIndexEditor.java:198)
at 
org.apache.jackrabbit.oak.spi.commit.CompositeEditor.leave(CompositeEditor.java:74)
at 
org.apache.jackrabbit.oak.spi.commit.VisibleEditor.leave(VisibleEditor.java:63)
at 
org.apache.jackrabbit.oak.spi.commit.EditorDiff.childNodeAdded(EditorDiff.java:130)
at 
org.apache.jackrabbit.oak.plugins.memory.ModifiedNodeState.compareAgainstBaseState(ModifiedNodeState.java:396)
at 
org.apache.jackrabbit.oak.spi.commit.EditorDiff.process(EditorDiff.java:52)
at 
org.apache.jackrabbit.oak.spi.commit.EditorHook.processCommit(EditorHook.java:54)
at 
org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexEditorTest.copyOnWriteAndLocks(LuceneIndexEditorTest.java:376)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30)
at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:46)
at org.junit.rules.RunRules.evaluate(RunRules.java:18)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:252)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:141)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:112)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189)
at 
org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165)
at 
org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85)
at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:115)
at 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:75)
Caused by: java.io.IOException: Cannot overwrite: 
C:\tmp\junit308522119715585104\2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae\1\_1.fdt
at 
org.apache.lucene.store.FSDirectory.ensureCanWrite(FSDirectory.java:293)
at 
org.apache.lucene.store.FSDirectory.createOutput(FSDirectory.java:282)
at 
org.apache.jackrabbit.oak.plugins.index.lucene.IndexCopier$CopyOnWriteDirectory$COWLocalFileReference.createOutput(IndexCopier.java:848)
at 
org.apache.jackrabbit.oak.plugins.index.lucene.IndexCopier$CopyOnWriteDirectory.createOutput(IndexCopier.java:618)
at 
org.apache.lucene.store.TrackingDirectoryWrapper.createOutput(TrackingDirectoryWrapper.java:44)
at 

[jira] [Commented] (OAK-2892) Speed up lucene indexing post migration by pre extracting the text content from binaries

2015-07-10 Thread Chetan Mehrotra (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622175#comment-14622175
 ] 

Chetan Mehrotra commented on OAK-2892:
--

Done initial implementation in http://svn.apache.org/r1690247

[~tmueller] Can you review the commit to see if the comments you made are 
addressed. if anything to be changed there then let me know

 Speed up lucene indexing post migration by pre extracting the text content 
 from binaries
 

 Key: OAK-2892
 URL: https://issues.apache.org/jira/browse/OAK-2892
 Project: Jackrabbit Oak
  Issue Type: New Feature
  Components: lucene, run
Reporter: Chetan Mehrotra
Assignee: Chetan Mehrotra
  Labels: performance
 Fix For: 1.3.3, 1.0.18


 While migrating large repositories say having 3 M docs (250k PDF) Lucene 
 indexing takes long time to complete (at time 4 days!). Currently the text 
 extraction logic is coupled with Lucene indexing and hence is performed in a 
 single threaded mode which slows down the indexing process. Further if the 
 reindexing has to be triggered it has to be done all over again.
 To speed up the Lucene indexing we can decouple the text extraction
 from actual indexing. It is partly based on discussion on OAK-2787
 # Introduce a new ExtractedTextProvider which can provide extracted text for 
 a given Blob instance
 # In oak-run introduce a new indexer mode - This would take a path in 
 repository and would then traverse the repository and look for existing 
 binaries and extract text from that
 So before or after migration is done one can run this oak-run tool to create 
 this store which has the text already extracted. Then post startup we need to 
 wire up the ExtractedTextProvider instance (which is backed by the BlobStore 
 populated before) and indexing logic can just get content from that. This 
 would avoid performing expensive text extraction in the indexing thread.
 See discussion thread http://markmail.org/thread/ndlfpkwfgpey6o66



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-2892) Speed up lucene indexing post migration by pre extracting the text content from binaries

2015-06-16 Thread Chetan Mehrotra (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587746#comment-14587746
 ] 

Chetan Mehrotra commented on OAK-2892:
--

[~tmueller]

bq.  There should be a way to distinguish between not extracted and 
extraction failed and extraction returned no data

On write side this is already being done. Not sure if we want to expose this on 
the reading side. As per current impl if there is a failure in parsing then 
some token text is returned {{TextExtractionError}}.

bq. I would probably store the data in a async index. This is a new type of 
index, similar to the counter index.

For now the focus is on migration one off usecase. For incremental indexing we 
would go with async index as discussed sometime back offline. 

 Speed up lucene indexing post migration by pre extracting the text content 
 from binaries
 

 Key: OAK-2892
 URL: https://issues.apache.org/jira/browse/OAK-2892
 Project: Jackrabbit Oak
  Issue Type: New Feature
  Components: lucene, run
Reporter: Chetan Mehrotra
Assignee: Chetan Mehrotra
  Labels: performance
 Fix For: 1.3.1, 1.0.16


 While migrating large repositories say having 3 M docs (250k PDF) Lucene 
 indexing takes long time to complete (at time 4 days!). Currently the text 
 extraction logic is coupled with Lucene indexing and hence is performed in a 
 single threaded mode which slows down the indexing process. Further if the 
 reindexing has to be triggered it has to be done all over again.
 To speed up the Lucene indexing we can decouple the text extraction
 from actual indexing. It is partly based on discussion on OAK-2787
 # Introduce a new ExtractedTextProvider which can provide extracted text for 
 a given Blob instance
 # In oak-run introduce a new indexer mode - This would take a path in 
 repository and would then traverse the repository and look for existing 
 binaries and extract text from that
 So before or after migration is done one can run this oak-run tool to create 
 this store which has the text already extracted. Then post startup we need to 
 wire up the ExtractedTextProvider instance (which is backed by the BlobStore 
 populated before) and indexing logic can just get content from that. This 
 would avoid performing expensive text extraction in the indexing thread.
 See discussion thread http://markmail.org/thread/ndlfpkwfgpey6o66



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-2892) Speed up lucene indexing post migration by pre extracting the text content from binaries

2015-06-16 Thread Thomas Mueller (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587738#comment-14587738
 ] 

Thomas Mueller commented on OAK-2892:
-

It might make sense to store / retrieve some more info. There should be a way 
to distinguish between not extracted and extraction failed and extraction 
returned no data. Maybe we should also store the content type (mimetype) if 
available (this sometimes doesn't match the file name suffix), the version and 
type of the text extraction tool used. The information that extraction failed 
(and the reason for it). This is quite a rich set of information, so we should 
probably store the data in some other way, for example as nodes.

I would probably store the data in a async index. This is a new type of index, 
similar to the counter index.



 Speed up lucene indexing post migration by pre extracting the text content 
 from binaries
 

 Key: OAK-2892
 URL: https://issues.apache.org/jira/browse/OAK-2892
 Project: Jackrabbit Oak
  Issue Type: New Feature
  Components: lucene, run
Reporter: Chetan Mehrotra
Assignee: Chetan Mehrotra
  Labels: performance
 Fix For: 1.3.1, 1.0.16


 While migrating large repositories say having 3 M docs (250k PDF) Lucene 
 indexing takes long time to complete (at time 4 days!). Currently the text 
 extraction logic is coupled with Lucene indexing and hence is performed in a 
 single threaded mode which slows down the indexing process. Further if the 
 reindexing has to be triggered it has to be done all over again.
 To speed up the Lucene indexing we can decouple the text extraction
 from actual indexing. It is partly based on discussion on OAK-2787
 # Introduce a new ExtractedTextProvider which can provide extracted text for 
 a given Blob instance
 # In oak-run introduce a new indexer mode - This would take a path in 
 repository and would then traverse the repository and look for existing 
 binaries and extract text from that
 So before or after migration is done one can run this oak-run tool to create 
 this store which has the text already extracted. Then post startup we need to 
 wire up the ExtractedTextProvider instance (which is backed by the BlobStore 
 populated before) and indexing logic can just get content from that. This 
 would avoid performing expensive text extraction in the indexing thread.
 See discussion thread http://markmail.org/thread/ndlfpkwfgpey6o66



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-2892) Speed up lucene indexing post migration by pre extracting the text content from binaries

2015-06-15 Thread Chetan Mehrotra (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585526#comment-14585526
 ] 

Chetan Mehrotra commented on OAK-2892:
--

There are 3 parts in this work

# Scan binaries and pre extract text - This part if dealt with in OAK-2953 sub 
task
# Persisting extracted text - There can be multiple way to store the extracted 
text. For migration case we can go for a simple solution where the extracted 
text is saved as file on the FS similar to FileDataStore. 
# Read from pre extracted text
{code:java}
package org.apache.jackrabbit.oak.plugins.index.fulltext;
public interface PreExtractedTextProvider {

/**
 * Get pre extracted text for given blob
 *
 * @param propertyPath path of the binary property
 * @param blob binary property value
 *
 * @return pre extracted text or null if no pre extracted
 * text found
 */
@CheckForNull
String getText(String propertyPath, Blob blob);
}
{code}

Of this #1 can be made part of oak-run and it would rely on #2. While #3 would 
be used by oak-lucene and oak-solr and would also rely on #2. So the proposal 
is 

# Add interface in #3 to {{org.apache.jackrabbit.oak.plugins.index.fulltext}}
# Provide an implementation for above (and also a writer to be used by oak-run) 
under {{org.apache.jackrabbit.oak.plugins.blob.datastore}}. The implementation 
can be enabled via OSGi config
# Have LuceneIndexEditor and SolrEditor use the interface in #1 to check if 
text is already extracted

[~tmueller] [~alex.parvulescu] [~teofili] Thoughts for above

 Speed up lucene indexing post migration by pre extracting the text content 
 from binaries
 

 Key: OAK-2892
 URL: https://issues.apache.org/jira/browse/OAK-2892
 Project: Jackrabbit Oak
  Issue Type: New Feature
  Components: lucene, run
Reporter: Chetan Mehrotra
Assignee: Chetan Mehrotra
  Labels: performance
 Fix For: 1.3.1, 1.0.16


 While migrating large repositories say having 3 M docs (250k PDF) Lucene 
 indexing takes long time to complete (at time 4 days!). Currently the text 
 extraction logic is coupled with Lucene indexing and hence is performed in a 
 single threaded mode which slows down the indexing process. Further if the 
 reindexing has to be triggered it has to be done all over again.
 To speed up the Lucene indexing we can decouple the text extraction
 from actual indexing. It is partly based on discussion on OAK-2787
 # Introduce a new ExtractedTextProvider which can provide extracted text for 
 a given Blob instance
 # In oak-run introduce a new indexer mode - This would take a path in 
 repository and would then traverse the repository and look for existing 
 binaries and extract text from that
 So before or after migration is done one can run this oak-run tool to create 
 this store which has the text already extracted. Then post startup we need to 
 wire up the ExtractedTextProvider instance (which is backed by the BlobStore 
 populated before) and indexing logic can just get content from that. This 
 would avoid performing expensive text extraction in the indexing thread.
 See discussion thread http://markmail.org/thread/ndlfpkwfgpey6o66



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-2892) Speed up lucene indexing post migration by pre extracting the text content from binaries

2015-06-08 Thread Davide Giannella (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576836#comment-14576836
 ] 

Davide Giannella commented on OAK-2892:
---

Move to 1.3.1

 Speed up lucene indexing post migration by pre extracting the text content 
 from binaries
 

 Key: OAK-2892
 URL: https://issues.apache.org/jira/browse/OAK-2892
 Project: Jackrabbit Oak
  Issue Type: New Feature
  Components: lucene, run
Reporter: Chetan Mehrotra
Assignee: Chetan Mehrotra
  Labels: performance
 Fix For: 1.3.1, 1.0.15


 While migrating large repositories say having 3 M docs (250k PDF) Lucene 
 indexing takes long time to complete (at time 4 days!). Currently the text 
 extraction logic is coupled with Lucene indexing and hence is performed in a 
 single threaded mode which slows down the indexing process. Further if the 
 reindexing has to be triggered it has to be done all over again.
 To speed up the Lucene indexing we can decouple the text extraction
 from actual indexing. It is partly based on discussion on OAK-2787
 # Introduce a new ExtractedTextProvider which can provide extracted text for 
 a given Blob instance
 # In oak-run introduce a new indexer mode - This would take a path in 
 repository and would then traverse the repository and look for existing 
 binaries and extract text from that
 So before or after migration is done one can run this oak-run tool to create 
 this store which has the text already extracted. Then post startup we need to 
 wire up the ExtractedTextProvider instance (which is backed by the BlobStore 
 populated before) and indexing logic can just get content from that. This 
 would avoid performing expensive text extraction in the indexing thread.
 See discussion thread http://markmail.org/thread/ndlfpkwfgpey6o66



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)