[jira] Commented: (JCR-415) Enhance indexing of binary content
[ http://issues.apache.org/jira/browse/JCR-415?page=comments#action_12459384 ] Marcel Reutegger commented on JCR-415: -- I would like to get this change into the next major release (1.3) and propose the following changes: - Create a new module jackrabbit-text-extractors which will initially contain the jackrabbit-extractor patch provided by Jukka - Migrate the jackrabbit-text-filters into the new extractors module - Add jackrabbit-text-filters as dependency to jackrabbit-core - Remove the jackrabbit-text-filters module and do not create releases anymore for this module. Jackrabbit would still support existing releases of jackrabbit-text-filters but the interface TextFilter will be deprecated (see Jukkas' patch) and developers are encouraged to use the new TextExtractor interface. Does this make sense? Enhance indexing of binary content -- Key: JCR-415 URL: http://issues.apache.org/jira/browse/JCR-415 Project: Jackrabbit Issue Type: Improvement Components: indexing Affects Versions: 1.0, 1.0.1, 0.9 Reporter: Marcel Reutegger Priority: Minor Attachments: jackrabbit-extractor-r420472.patch, jackrabbit-query-r420472.patch, jackrabbit-query-r421461.patch, org.apache.jackrabbit.core.query-extractor.jpg, org.apache.jackrabbit.core.query.lucene-extractor.jpg, org.apache.jackrabbit.extractor.jpg Indexing of binary content should be enhanced in order to allow either configuration what fields are indexed or provide better support for custom NodeIndexer implementations. The current design has a couple of flaws that should be addressed at the same time: - Reader instances are requested from the text filters even though the reader might never be used - only jcr:data properties of nt:resource nodes are fulltext indexed - It is up to the text filter implementation to decide the lucene field name for the text representation, responsibility should be moved to the NodeIndexer. A text filter should only provide a Reader instance. With those changes a custom NodeIndexer can then decide if a binary property has one or more representations in the index. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (JCR-415) Enhance indexing of binary content
[ http://issues.apache.org/jira/browse/JCR-415?page=comments#action_12420256 ] Jukka Zitting commented on JCR-415: --- Marcel: NodeIndexer.addBinaryValue() is protected to allow subclasses to override it but it uses the private method getValue(). Thus getValue() should be protected final in order to be usable for a subclass. OK. Extracting text should be deferred to the time when the lucene Document acutally requests character from Reader that is assigned to a Field. See http://issues.apache.org/jira/browse/JCR-264. I think it would make more design sense to try to postpone the creation of the Document instances instead of delaying text extraction. But I'm not too familiar with the details, so I'm OK with adding lazy reading to the mix. In any case I think it's best to layer the lazy reading on top of the TextExtractor interface instead of below it. A utility class like the following could achieve this as long as the given InputStream remains valid until the document has been read. class TextExtractorReader extends Reader { private final TextExtractor extractor; private final InputStream stream; private final String type; private final String encoding; private Reader reader; public TextExtractorReader( TextExtractor extractor, InputStream stream, String type, String encoding) { this.extractor = extractor; this.stream = stream; this.type = type; this.encoding = encoding; this.reader = null; } public int read(char[] buffer, int offset, int length) throws IOException { if (reader == null) { reader = extractor.extractText(stream, type, encoding); } return reader.read(buffer, offset, length); } public void close() throws IOException { if (reader != null) { reader.close(); } else { stream.close(); } } } I can update the query patch accordingly. Enhance indexing of binary content -- Key: JCR-415 URL: http://issues.apache.org/jira/browse/JCR-415 Project: Jackrabbit Type: Improvement Components: indexing Versions: 1.0, 1.0.1, 0.9 Reporter: Marcel Reutegger Priority: Minor Fix For: 1.1 Attachments: jackrabbit-extractor-r420472.patch, jackrabbit-query-r420472.patch, org.apache.jackrabbit.core.query-extractor.jpg, org.apache.jackrabbit.core.query.lucene-extractor.jpg, org.apache.jackrabbit.extractor.jpg Indexing of binary content should be enhanced in order to allow either configuration what fields are indexed or provide better support for custom NodeIndexer implementations. The current design has a couple of flaws that should be addressed at the same time: - Reader instances are requested from the text filters even though the reader might never be used - only jcr:data properties of nt:resource nodes are fulltext indexed - It is up to the text filter implementation to decide the lucene field name for the text representation, responsibility should be moved to the NodeIndexer. A text filter should only provide a Reader instance. With those changes a custom NodeIndexer can then decide if a binary property has one or more representations in the index. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (JCR-415) Enhance indexing of binary content
[ http://issues.apache.org/jira/browse/JCR-415?page=comments#action_12420293 ] Marcel Reutegger commented on JCR-415: -- Jukka wrote: I think it would make more design sense to try to postpone the creation of the Document instances instead of delaying text extraction. But I'm not too familiar with the details, so I'm OK with adding lazy reading to the mix. In any case I think it's best to layer the lazy reading on top of the TextExtractor interface instead of below it. A utility class like the following could achieve this as long as the given InputStream remains valid until the document has been read. Yes, you are right. I thought I could get away with the dirty solution ;) While going through your patch I was actually also thinking about a design that should create the document only when it is really added to the index. For now we can maybe use the TextExtractorReader you proposed and then in a next step change the design to create the Document in a later stage of the indexing process. Enhance indexing of binary content -- Key: JCR-415 URL: http://issues.apache.org/jira/browse/JCR-415 Project: Jackrabbit Type: Improvement Components: indexing Versions: 1.0, 1.0.1, 0.9 Reporter: Marcel Reutegger Priority: Minor Fix For: 1.1 Attachments: jackrabbit-extractor-r420472.patch, jackrabbit-query-r420472.patch, org.apache.jackrabbit.core.query-extractor.jpg, org.apache.jackrabbit.core.query.lucene-extractor.jpg, org.apache.jackrabbit.extractor.jpg Indexing of binary content should be enhanced in order to allow either configuration what fields are indexed or provide better support for custom NodeIndexer implementations. The current design has a couple of flaws that should be addressed at the same time: - Reader instances are requested from the text filters even though the reader might never be used - only jcr:data properties of nt:resource nodes are fulltext indexed - It is up to the text filter implementation to decide the lucene field name for the text representation, responsibility should be moved to the NodeIndexer. A text filter should only provide a Reader instance. With those changes a custom NodeIndexer can then decide if a binary property has one or more representations in the index. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (JCR-415) Enhance indexing of binary content
[ http://issues.apache.org/jira/browse/JCR-415?page=comments#action_12418774 ] Jukka Zitting commented on JCR-415: --- See a related email thread at http://article.gmane.org/gmane.comp.apache.jackrabbit.devel/7609 Enhance indexing of binary content -- Key: JCR-415 URL: http://issues.apache.org/jira/browse/JCR-415 Project: Jackrabbit Type: Improvement Components: indexing Versions: 1.0, 1.0.1, 0.9 Reporter: Marcel Reutegger Priority: Minor Fix For: 1.1 Indexing of binary content should be enhanced in order to allow either configuration what fields are indexed or provide better support for custom NodeIndexer implementations. The current design has a couple of flaws that should be addressed at the same time: - Reader instances are requested from the text filters even though the reader might never be used - only jcr:data properties of nt:resource nodes are fulltext indexed - It is up to the text filter implementation to decide the lucene field name for the text representation, responsibility should be moved to the NodeIndexer. A text filter should only provide a Reader instance. With those changes a custom NodeIndexer can then decide if a binary property has one or more representations in the index. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira