[jira] [Commented] (OAK-2892) Speed up lucene indexing post migration by pre extracting the text content from binaries
[ https://issues.apache.org/jira/browse/OAK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627613#comment-14627613 ] Chetan Mehrotra commented on OAK-2892: -- Documentation updated in http://jackrabbit.apache.org/oak/docs/query/lucene.html#text-extraction Feature merged to 1.0 and 1.2 branches Speed up lucene indexing post migration by pre extracting the text content from binaries Key: OAK-2892 URL: https://issues.apache.org/jira/browse/OAK-2892 Project: Jackrabbit Oak Issue Type: New Feature Components: lucene, run Reporter: Chetan Mehrotra Assignee: Chetan Mehrotra Labels: performance Fix For: 1.2.3, 1.3.3, 1.0.18 While migrating large repositories say having 3 M docs (250k PDF) Lucene indexing takes long time to complete (at time 4 days!). Currently the text extraction logic is coupled with Lucene indexing and hence is performed in a single threaded mode which slows down the indexing process. Further if the reindexing has to be triggered it has to be done all over again. To speed up the Lucene indexing we can decouple the text extraction from actual indexing. It is partly based on discussion on OAK-2787 # Introduce a new ExtractedTextProvider which can provide extracted text for a given Blob instance # In oak-run introduce a new indexer mode - This would take a path in repository and would then traverse the repository and look for existing binaries and extract text from that So before or after migration is done one can run this oak-run tool to create this store which has the text already extracted. Then post startup we need to wire up the ExtractedTextProvider instance (which is backed by the BlobStore populated before) and indexing logic can just get content from that. This would avoid performing expensive text extraction in the indexing thread. See discussion thread http://markmail.org/thread/ndlfpkwfgpey6o66 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-2892) Speed up lucene indexing post migration by pre extracting the text content from binaries
[ https://issues.apache.org/jira/browse/OAK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626341#comment-14626341 ] Julian Reschke commented on OAK-2892: - Getting test failure, probably related to Windows: copyOnWriteAndLocks(org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexEditorTest) Time elapsed: 0.11 sec ERROR! org.apache.jackrabbit.oak.api.CommitFailedException: OakLucene0003: Failed to index the node /test at org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexEditor.addOrUpdate(LuceneIndexEditor.java:306) at org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexEditor.leave(LuceneIndexEditor.java:198) at org.apache.jackrabbit.oak.spi.commit.CompositeEditor.leave(CompositeEditor.java:74) at org.apache.jackrabbit.oak.spi.commit.VisibleEditor.leave(VisibleEditor.java:63) at org.apache.jackrabbit.oak.spi.commit.EditorDiff.childNodeAdded(EditorDiff.java:130) at org.apache.jackrabbit.oak.plugins.memory.ModifiedNodeState.compareAgainstBaseState(ModifiedNodeState.java:396) at org.apache.jackrabbit.oak.spi.commit.EditorDiff.process(EditorDiff.java:52) at org.apache.jackrabbit.oak.spi.commit.EditorHook.processCommit(EditorHook.java:54) at org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexEditorTest.copyOnWriteAndLocks(LuceneIndexEditorTest.java:376) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30) at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:46) at org.junit.rules.RunRules.evaluate(RunRules.java:18) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222) at org.junit.runners.ParentRunner.run(ParentRunner.java:300) at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:252) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:141) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:112) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189) at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165) at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85) at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:115) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:75) Caused by: java.io.IOException: Cannot overwrite: C:\tmp\junit308522119715585104\2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae\1\_1.fdt at org.apache.lucene.store.FSDirectory.ensureCanWrite(FSDirectory.java:293) at org.apache.lucene.store.FSDirectory.createOutput(FSDirectory.java:282) at org.apache.jackrabbit.oak.plugins.index.lucene.IndexCopier$CopyOnWriteDirectory$COWLocalFileReference.createOutput(IndexCopier.java:848) at org.apache.jackrabbit.oak.plugins.index.lucene.IndexCopier$CopyOnWriteDirectory.createOutput(IndexCopier.java:618) at org.apache.lucene.store.TrackingDirectoryWrapper.createOutput(TrackingDirectoryWrapper.java:44) at
[jira] [Commented] (OAK-2892) Speed up lucene indexing post migration by pre extracting the text content from binaries
[ https://issues.apache.org/jira/browse/OAK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622175#comment-14622175 ] Chetan Mehrotra commented on OAK-2892: -- Done initial implementation in http://svn.apache.org/r1690247 [~tmueller] Can you review the commit to see if the comments you made are addressed. if anything to be changed there then let me know Speed up lucene indexing post migration by pre extracting the text content from binaries Key: OAK-2892 URL: https://issues.apache.org/jira/browse/OAK-2892 Project: Jackrabbit Oak Issue Type: New Feature Components: lucene, run Reporter: Chetan Mehrotra Assignee: Chetan Mehrotra Labels: performance Fix For: 1.3.3, 1.0.18 While migrating large repositories say having 3 M docs (250k PDF) Lucene indexing takes long time to complete (at time 4 days!). Currently the text extraction logic is coupled with Lucene indexing and hence is performed in a single threaded mode which slows down the indexing process. Further if the reindexing has to be triggered it has to be done all over again. To speed up the Lucene indexing we can decouple the text extraction from actual indexing. It is partly based on discussion on OAK-2787 # Introduce a new ExtractedTextProvider which can provide extracted text for a given Blob instance # In oak-run introduce a new indexer mode - This would take a path in repository and would then traverse the repository and look for existing binaries and extract text from that So before or after migration is done one can run this oak-run tool to create this store which has the text already extracted. Then post startup we need to wire up the ExtractedTextProvider instance (which is backed by the BlobStore populated before) and indexing logic can just get content from that. This would avoid performing expensive text extraction in the indexing thread. See discussion thread http://markmail.org/thread/ndlfpkwfgpey6o66 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-2892) Speed up lucene indexing post migration by pre extracting the text content from binaries
[ https://issues.apache.org/jira/browse/OAK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587746#comment-14587746 ] Chetan Mehrotra commented on OAK-2892: -- [~tmueller] bq. There should be a way to distinguish between not extracted and extraction failed and extraction returned no data On write side this is already being done. Not sure if we want to expose this on the reading side. As per current impl if there is a failure in parsing then some token text is returned {{TextExtractionError}}. bq. I would probably store the data in a async index. This is a new type of index, similar to the counter index. For now the focus is on migration one off usecase. For incremental indexing we would go with async index as discussed sometime back offline. Speed up lucene indexing post migration by pre extracting the text content from binaries Key: OAK-2892 URL: https://issues.apache.org/jira/browse/OAK-2892 Project: Jackrabbit Oak Issue Type: New Feature Components: lucene, run Reporter: Chetan Mehrotra Assignee: Chetan Mehrotra Labels: performance Fix For: 1.3.1, 1.0.16 While migrating large repositories say having 3 M docs (250k PDF) Lucene indexing takes long time to complete (at time 4 days!). Currently the text extraction logic is coupled with Lucene indexing and hence is performed in a single threaded mode which slows down the indexing process. Further if the reindexing has to be triggered it has to be done all over again. To speed up the Lucene indexing we can decouple the text extraction from actual indexing. It is partly based on discussion on OAK-2787 # Introduce a new ExtractedTextProvider which can provide extracted text for a given Blob instance # In oak-run introduce a new indexer mode - This would take a path in repository and would then traverse the repository and look for existing binaries and extract text from that So before or after migration is done one can run this oak-run tool to create this store which has the text already extracted. Then post startup we need to wire up the ExtractedTextProvider instance (which is backed by the BlobStore populated before) and indexing logic can just get content from that. This would avoid performing expensive text extraction in the indexing thread. See discussion thread http://markmail.org/thread/ndlfpkwfgpey6o66 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-2892) Speed up lucene indexing post migration by pre extracting the text content from binaries
[ https://issues.apache.org/jira/browse/OAK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587738#comment-14587738 ] Thomas Mueller commented on OAK-2892: - It might make sense to store / retrieve some more info. There should be a way to distinguish between not extracted and extraction failed and extraction returned no data. Maybe we should also store the content type (mimetype) if available (this sometimes doesn't match the file name suffix), the version and type of the text extraction tool used. The information that extraction failed (and the reason for it). This is quite a rich set of information, so we should probably store the data in some other way, for example as nodes. I would probably store the data in a async index. This is a new type of index, similar to the counter index. Speed up lucene indexing post migration by pre extracting the text content from binaries Key: OAK-2892 URL: https://issues.apache.org/jira/browse/OAK-2892 Project: Jackrabbit Oak Issue Type: New Feature Components: lucene, run Reporter: Chetan Mehrotra Assignee: Chetan Mehrotra Labels: performance Fix For: 1.3.1, 1.0.16 While migrating large repositories say having 3 M docs (250k PDF) Lucene indexing takes long time to complete (at time 4 days!). Currently the text extraction logic is coupled with Lucene indexing and hence is performed in a single threaded mode which slows down the indexing process. Further if the reindexing has to be triggered it has to be done all over again. To speed up the Lucene indexing we can decouple the text extraction from actual indexing. It is partly based on discussion on OAK-2787 # Introduce a new ExtractedTextProvider which can provide extracted text for a given Blob instance # In oak-run introduce a new indexer mode - This would take a path in repository and would then traverse the repository and look for existing binaries and extract text from that So before or after migration is done one can run this oak-run tool to create this store which has the text already extracted. Then post startup we need to wire up the ExtractedTextProvider instance (which is backed by the BlobStore populated before) and indexing logic can just get content from that. This would avoid performing expensive text extraction in the indexing thread. See discussion thread http://markmail.org/thread/ndlfpkwfgpey6o66 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-2892) Speed up lucene indexing post migration by pre extracting the text content from binaries
[ https://issues.apache.org/jira/browse/OAK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585526#comment-14585526 ] Chetan Mehrotra commented on OAK-2892: -- There are 3 parts in this work # Scan binaries and pre extract text - This part if dealt with in OAK-2953 sub task # Persisting extracted text - There can be multiple way to store the extracted text. For migration case we can go for a simple solution where the extracted text is saved as file on the FS similar to FileDataStore. # Read from pre extracted text {code:java} package org.apache.jackrabbit.oak.plugins.index.fulltext; public interface PreExtractedTextProvider { /** * Get pre extracted text for given blob * * @param propertyPath path of the binary property * @param blob binary property value * * @return pre extracted text or null if no pre extracted * text found */ @CheckForNull String getText(String propertyPath, Blob blob); } {code} Of this #1 can be made part of oak-run and it would rely on #2. While #3 would be used by oak-lucene and oak-solr and would also rely on #2. So the proposal is # Add interface in #3 to {{org.apache.jackrabbit.oak.plugins.index.fulltext}} # Provide an implementation for above (and also a writer to be used by oak-run) under {{org.apache.jackrabbit.oak.plugins.blob.datastore}}. The implementation can be enabled via OSGi config # Have LuceneIndexEditor and SolrEditor use the interface in #1 to check if text is already extracted [~tmueller] [~alex.parvulescu] [~teofili] Thoughts for above Speed up lucene indexing post migration by pre extracting the text content from binaries Key: OAK-2892 URL: https://issues.apache.org/jira/browse/OAK-2892 Project: Jackrabbit Oak Issue Type: New Feature Components: lucene, run Reporter: Chetan Mehrotra Assignee: Chetan Mehrotra Labels: performance Fix For: 1.3.1, 1.0.16 While migrating large repositories say having 3 M docs (250k PDF) Lucene indexing takes long time to complete (at time 4 days!). Currently the text extraction logic is coupled with Lucene indexing and hence is performed in a single threaded mode which slows down the indexing process. Further if the reindexing has to be triggered it has to be done all over again. To speed up the Lucene indexing we can decouple the text extraction from actual indexing. It is partly based on discussion on OAK-2787 # Introduce a new ExtractedTextProvider which can provide extracted text for a given Blob instance # In oak-run introduce a new indexer mode - This would take a path in repository and would then traverse the repository and look for existing binaries and extract text from that So before or after migration is done one can run this oak-run tool to create this store which has the text already extracted. Then post startup we need to wire up the ExtractedTextProvider instance (which is backed by the BlobStore populated before) and indexing logic can just get content from that. This would avoid performing expensive text extraction in the indexing thread. See discussion thread http://markmail.org/thread/ndlfpkwfgpey6o66 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-2892) Speed up lucene indexing post migration by pre extracting the text content from binaries
[ https://issues.apache.org/jira/browse/OAK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576836#comment-14576836 ] Davide Giannella commented on OAK-2892: --- Move to 1.3.1 Speed up lucene indexing post migration by pre extracting the text content from binaries Key: OAK-2892 URL: https://issues.apache.org/jira/browse/OAK-2892 Project: Jackrabbit Oak Issue Type: New Feature Components: lucene, run Reporter: Chetan Mehrotra Assignee: Chetan Mehrotra Labels: performance Fix For: 1.3.1, 1.0.15 While migrating large repositories say having 3 M docs (250k PDF) Lucene indexing takes long time to complete (at time 4 days!). Currently the text extraction logic is coupled with Lucene indexing and hence is performed in a single threaded mode which slows down the indexing process. Further if the reindexing has to be triggered it has to be done all over again. To speed up the Lucene indexing we can decouple the text extraction from actual indexing. It is partly based on discussion on OAK-2787 # Introduce a new ExtractedTextProvider which can provide extracted text for a given Blob instance # In oak-run introduce a new indexer mode - This would take a path in repository and would then traverse the repository and look for existing binaries and extract text from that So before or after migration is done one can run this oak-run tool to create this store which has the text already extracted. Then post startup we need to wire up the ExtractedTextProvider instance (which is backed by the BlobStore populated before) and indexing logic can just get content from that. This would avoid performing expensive text extraction in the indexing thread. See discussion thread http://markmail.org/thread/ndlfpkwfgpey6o66 -- This message was sent by Atlassian JIRA (v6.3.4#6332)