[jira] [Commented] (OAK-4585) Text extraction: runtime status monitoring

2016-08-10 Thread Thomas Mueller (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15414854#comment-15414854
 ] 

Thomas Mueller commented on OAK-4585:
-

http://svn.apache.org/r1755681 (1.4 branch)

> Text extraction: runtime status monitoring
> --
>
> Key: OAK-4585
> URL: https://issues.apache.org/jira/browse/OAK-4585
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
> Fix For: 1.5.8, 1.4.7
>
>
> Text extraction is sometimes slow, and, in case of a bug in the text 
> extraction library, can even get stuck in an endless loop.
> Right now, it is not easy to understand what is going on, even when looking 
> at full thread dumps. (Debug) log information about the current state of text 
> extraction would be nice as well.
> I suggest we add debug level logging for the current extracted binary 
> (content identity). For larger binaries, we can also temporarily set the 
> thread name (append "Extracting "). That way, it is 
> relatively easy to see if text extraction is stuck simply looking at full 
> thread dumps, without having to change the log level and then reindex.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-4585) Text extraction: runtime status monitoring

2016-08-04 Thread Thomas Mueller (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15407381#comment-15407381
 ] 

Thomas Mueller commented on OAK-4585:
-

I will backport this to 1.4 as well.

> Text extraction: runtime status monitoring
> --
>
> Key: OAK-4585
> URL: https://issues.apache.org/jira/browse/OAK-4585
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
> Fix For: 1.4.6, 1.5.7
>
>
> Text extraction is sometimes slow, and, in case of a bug in the text 
> extraction library, can even get stuck in an endless loop.
> Right now, it is not easy to understand what is going on, even when looking 
> at full thread dumps. (Debug) log information about the current state of text 
> extraction would be nice as well.
> I suggest we add debug level logging for the current extracted binary 
> (content identity). For larger binaries, we can also temporarily set the 
> thread name (append "Extracting "). That way, it is 
> relatively easy to see if text extraction is stuck simply looking at full 
> thread dumps, without having to change the log level and then reindex.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-4585) Text extraction: runtime status monitoring

2016-08-04 Thread Thomas Mueller (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15407380#comment-15407380
 ] 

Thomas Mueller commented on OAK-4585:
-

http://svn.apache.org/r1755145 (trunk)

Thanks Chetan. 

* Path: I now mainly log the path, as I see it's better than the content 
identity for the SegmentStore case. I still log the content identity for the 
FileDataStore case, so it should be easy to get hold of the binary (and even if 
the repository is not available).
* Trace level: I still use debug level, not sure why trace level would be 
better? I think people mainly use debug level when trying to find problems, and 
this doesn't seem to increase the log file dramatically.
* Log time taken, source size, extracted text size: done
* By the way there was a bug in the first patch, Long.parseLong instead of 
Long.getLong.


> Text extraction: runtime status monitoring
> --
>
> Key: OAK-4585
> URL: https://issues.apache.org/jira/browse/OAK-4585
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
> Fix For: 1.4.6, 1.5.7
>
>
> Text extraction is sometimes slow, and, in case of a bug in the text 
> extraction library, can even get stuck in an endless loop.
> Right now, it is not easy to understand what is going on, even when looking 
> at full thread dumps. (Debug) log information about the current state of text 
> extraction would be nice as well.
> I suggest we add debug level logging for the current extracted binary 
> (content identity). For larger binaries, we can also temporarily set the 
> thread name (append "Extracting "). That way, it is 
> relatively easy to see if text extraction is stuck simply looking at full 
> thread dumps, without having to change the log level and then reindex.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-4585) Text extraction: runtime status monitoring

2016-07-21 Thread Chetan Mehrotra (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15387714#comment-15387714
 ] 

Chetan Mehrotra commented on OAK-4585:
--

Few minor points
* Instead of content identity we can use path
* Might be better to log this at trace and also log after finish time taken, 
source size, extracted text size

> Text extraction: runtime status monitoring
> --
>
> Key: OAK-4585
> URL: https://issues.apache.org/jira/browse/OAK-4585
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>
> Text extraction is sometimes slow, and, in case of a bug in the text 
> extraction library, can even get stuck in an endless loop.
> Right now, it is not easy to understand what is going on, even when looking 
> at full thread dumps. (Debug) log information about the current state of text 
> extraction would be nice as well.
> I suggest we add debug level logging for the current extracted binary 
> (content identity). For larger binaries, we can also temporarily set the 
> thread name (append "Extracting "). That way, it is 
> relatively easy to see if text extraction is stuck simply looking at full 
> thread dumps, without having to change the log level and then reindex.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-4585) Text extraction: runtime status monitoring

2016-07-21 Thread Thomas Mueller (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15387684#comment-15387684
 ] 

Thomas Mueller commented on OAK-4585:
-

[~chetanm] could you review the patch please?

Please note Thread.setName is a native method, so it's a little bit slow, 
that's why I only did that for binaries larger than 16 KB. Microbenchmark to 
verify Thread.setName has measurable performance effect if used for very small 
binaries, but not if only used for binaries larger than 16 KB:

{noformat}
   public static void main(String... args) throws Exception {
test();
test();
test();
}

static void test() {
int sum = 0;
for (int size = 16; size < 32 * 1024; size *= 2) {
System.out.print(size + " bytes: with Thread.setName ");
sum += test(size, size);
System.out.print("; without: ");
sum += test(size, size * 10);
System.out.println();
}
System.out.println("dummy " + sum);
}

static int test(int size, int min) {
Random r = new Random(1);
byte[] data = new byte[size];
r.nextBytes(data);
long time = System.currentTimeMillis();
int sum = 0;
for (int i = 0; i < (1024 * 1024 * 1024) / size; i++) {
Thread t = Thread.currentThread();
String oldName = null;
if (data.length >= min) {
oldName = t.getName();
t.setName(oldName + ": Extrating " + i + ": " + data.length);
}
for (int j = 0; j < data.length; j++) {
sum += data[j];
}
if (oldName != null) {
t.setName(oldName);
}
}
System.out.print((System.currentTimeMillis() - time) + " ms");
return sum;
}

16 bytes: with Thread.setName 5102 ms; without: 1135 ms
32 bytes: with Thread.setName 2790 ms; without: 953 ms
64 bytes: with Thread.setName 1787 ms; without: 857 ms
128 bytes: with Thread.setName 1295 ms; without: 806 ms
256 bytes: with Thread.setName 1041 ms; without: 788 ms
512 bytes: with Thread.setName 893 ms; without: 770 ms
1024 bytes: with Thread.setName 844 ms; without: 781 ms
2048 bytes: with Thread.setName 807 ms; without: 780 ms
4096 bytes: with Thread.setName 784 ms; without: 772 ms
8192 bytes: with Thread.setName 782 ms; without: 767 ms
16384 bytes: with Thread.setName 766 ms; without: 774 ms
dummy -588513280
{noformat}


> Text extraction: runtime status monitoring
> --
>
> Key: OAK-4585
> URL: https://issues.apache.org/jira/browse/OAK-4585
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>
> Text extraction is sometimes slow, and, in case of a bug in the text 
> extraction library, can even get stuck in an endless loop.
> Right now, it is not easy to understand what is going on, even when looking 
> at full thread dumps. (Debug) log information about the current state of text 
> extraction would be nice as well.
> I suggest we add debug level logging for the current extracted binary 
> (content identity). For larger binaries, we can also temporarily set the 
> thread name (append "Extracting "). That way, it is 
> relatively easy to see if text extraction is stuck simply looking at full 
> thread dumps, without having to change the log level and then reindex.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-4585) Text extraction: runtime status monitoring

2016-07-21 Thread Thomas Mueller (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15387306#comment-15387306
 ] 

Thomas Mueller commented on OAK-4585:
-

Proposed patch (not yet tested):

{noformat}
--- 
src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/LuceneIndexEditor.java
 (revision 1751620)
+++ 
src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/LuceneIndexEditor.java
 (working copy)
@@ -85,6 +85,9 @@
 
 private static final Logger log =
 LoggerFactory.getLogger(LuceneIndexEditor.class);
+
+private static final long SMALL_BINARY = 
Long.parseLong("oak.lucene.smallBinary", 16 * 1024);
+
 static final String TEXT_EXTRACTION_ERROR = "TextExtractionError";
 
 private final LuceneIndexEditorContext context;
@@ -948,6 +951,14 @@
 WriteOutContentHandler handler = new 
WriteOutContentHandler(context.getDefinition().getMaxExtractLength());
 long start = System.currentTimeMillis();
 long bytesRead = 0;
+long length = v.length();
+log.debug("Extracting {}, {} bytes", v.getContentIdentity(), length);
+String oldThreadName = null;
+if (length > SMALL_BINARY) {
+Thread t = Thread.currentThread();
+oldThreadName = t.getName();
+t.setName(oldThreadName + ": Extracting " + v.getContentIdentity() 
+ ", " + length + " bytes");
+}
 try {
 CountingInputStream stream = new CountingInputStream(new 
LazyInputStream(new BlobByteSource(v)));
 try {
@@ -974,6 +985,10 @@
 context.getExtractedTextCache().put(v, ExtractedText.ERROR);
 return TEXT_EXTRACTION_ERROR;
 }
+} finally {
+if (oldThreadName != null) {
+Thread.currentThread().setName(oldThreadName);
+}
 }
 String result = handler.toString();
 if (bytesRead > 0) {
{noformat}

> Text extraction: runtime status monitoring
> --
>
> Key: OAK-4585
> URL: https://issues.apache.org/jira/browse/OAK-4585
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>
> Text extraction is sometimes slow, and, in case of a bug in the text 
> extraction library, can even get stuck in an endless loop.
> Right now, it is not easy to understand what is going on, even when looking 
> at full thread dumps. (Debug) log information about the current state of text 
> extraction would be nice as well.
> I suggest we add debug level logging for the current extracted binary 
> (content identity). For larger binaries, we can also temporarily set the 
> thread name (append "Extracting "). That way, it is 
> relatively easy to see if text extraction is stuck simply looking at full 
> thread dumps, without having to change the log level and then reindex.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)