[jira] Commented: (JCR-415) Enhance indexing of binary content

2006-12-18 Thread Marcel Reutegger (JIRA)
[ 
http://issues.apache.org/jira/browse/JCR-415?page=comments#action_12459384 ] 

Marcel Reutegger commented on JCR-415:
--

I would like to get this change into the next major release (1.3) and propose 
the following changes:

- Create a new module jackrabbit-text-extractors which will initially contain 
the jackrabbit-extractor patch provided by Jukka
- Migrate the jackrabbit-text-filters into the new extractors module
- Add jackrabbit-text-filters as dependency to jackrabbit-core
- Remove the jackrabbit-text-filters module and do not create releases anymore 
for this module. Jackrabbit would still support existing releases of 
jackrabbit-text-filters but the interface TextFilter will be deprecated (see 
Jukkas' patch) and developers are encouraged to use the new TextExtractor 
interface.

Does this make sense?

 Enhance indexing of binary content
 --

 Key: JCR-415
 URL: http://issues.apache.org/jira/browse/JCR-415
 Project: Jackrabbit
  Issue Type: Improvement
  Components: indexing
Affects Versions: 1.0, 1.0.1, 0.9
Reporter: Marcel Reutegger
Priority: Minor
 Attachments: jackrabbit-extractor-r420472.patch, 
 jackrabbit-query-r420472.patch, jackrabbit-query-r421461.patch, 
 org.apache.jackrabbit.core.query-extractor.jpg, 
 org.apache.jackrabbit.core.query.lucene-extractor.jpg, 
 org.apache.jackrabbit.extractor.jpg


 Indexing of binary content should be enhanced in order to allow either 
 configuration what fields are indexed or provide better support for custom 
 NodeIndexer implementations.
 The current design has a couple of flaws that should be addressed at the same 
 time:
 - Reader instances are requested from the text filters even though the reader 
 might never be used
 - only jcr:data properties of nt:resource nodes are fulltext indexed
 - It is up to the text filter implementation to decide the lucene field name 
 for the text representation, responsibility should be moved to the 
 NodeIndexer. A text filter should only provide a Reader instance.
 With those changes a custom NodeIndexer can then decide if a binary property 
 has one or more representations in the index.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (JCR-415) Enhance indexing of binary content

2006-07-11 Thread Jukka Zitting (JIRA)
[ 
http://issues.apache.org/jira/browse/JCR-415?page=comments#action_12420256 ] 

Jukka Zitting commented on JCR-415:
---

Marcel:
 NodeIndexer.addBinaryValue() is protected to allow subclasses to override it 
 but it uses the private
 method getValue(). Thus getValue() should be protected final in order to be 
 usable for a subclass. 

OK.

 Extracting text should be deferred to the time when the lucene Document 
 acutally requests character
 from Reader that is assigned to a Field. See 
 http://issues.apache.org/jira/browse/JCR-264.

I think it would make more design sense to try to postpone the creation of the 
Document instances instead of delaying text extraction. But I'm not too 
familiar with the details, so I'm OK with adding lazy reading to the mix. In 
any case I think it's best to layer the lazy reading on top of the 
TextExtractor interface instead of below it. A utility class like the following 
could achieve this as long as the given InputStream remains valid until the 
document has been read.

class TextExtractorReader extends Reader {

private final TextExtractor extractor;
private final InputStream stream;
private final String type;
private final String encoding;

private Reader reader;

public TextExtractorReader(
TextExtractor extractor, InputStream stream,
String type, String encoding) {
this.extractor = extractor;
this.stream = stream;
this.type = type;
this.encoding = encoding;
this.reader = null;
}

public int read(char[] buffer, int offset, int length) throws 
IOException {
if (reader == null) {
reader = extractor.extractText(stream, type, encoding);
}
return reader.read(buffer, offset, length);
}

public void close() throws IOException {
if (reader != null) {
reader.close();
} else {
stream.close();
}
}

}

I can update the query patch accordingly.


 Enhance indexing of binary content
 --

  Key: JCR-415
  URL: http://issues.apache.org/jira/browse/JCR-415
  Project: Jackrabbit
 Type: Improvement

   Components: indexing
 Versions: 1.0, 1.0.1, 0.9
 Reporter: Marcel Reutegger
 Priority: Minor
  Fix For: 1.1
  Attachments: jackrabbit-extractor-r420472.patch, 
 jackrabbit-query-r420472.patch, 
 org.apache.jackrabbit.core.query-extractor.jpg, 
 org.apache.jackrabbit.core.query.lucene-extractor.jpg, 
 org.apache.jackrabbit.extractor.jpg

 Indexing of binary content should be enhanced in order to allow either 
 configuration what fields are indexed or provide better support for custom 
 NodeIndexer implementations.
 The current design has a couple of flaws that should be addressed at the same 
 time:
 - Reader instances are requested from the text filters even though the reader 
 might never be used
 - only jcr:data properties of nt:resource nodes are fulltext indexed
 - It is up to the text filter implementation to decide the lucene field name 
 for the text representation, responsibility should be moved to the 
 NodeIndexer. A text filter should only provide a Reader instance.
 With those changes a custom NodeIndexer can then decide if a binary property 
 has one or more representations in the index.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (JCR-415) Enhance indexing of binary content

2006-07-11 Thread Marcel Reutegger (JIRA)
[ 
http://issues.apache.org/jira/browse/JCR-415?page=comments#action_12420293 ] 

Marcel Reutegger commented on JCR-415:
--

Jukka wrote:
 I think it would make more design sense to try to postpone the creation of 
 the Document instances
 instead of delaying text extraction. But I'm not too familiar with the 
 details, so I'm OK with adding lazy
 reading to the mix. In any case I think it's best to layer the lazy reading 
 on top of the TextExtractor interface
 instead of below it. A utility class like the following could achieve this as 
 long as the given InputStream
 remains valid until the document has been read.

Yes, you are right. I thought I could get away with the dirty solution ;)
While going through your patch I was actually also thinking about a design that 
should create the document
only when it is really added to the index.
For now we can maybe use the TextExtractorReader you proposed and then in a 
next step change the design
to create the Document in a later stage of the indexing process.

 Enhance indexing of binary content
 --

  Key: JCR-415
  URL: http://issues.apache.org/jira/browse/JCR-415
  Project: Jackrabbit
 Type: Improvement

   Components: indexing
 Versions: 1.0, 1.0.1, 0.9
 Reporter: Marcel Reutegger
 Priority: Minor
  Fix For: 1.1
  Attachments: jackrabbit-extractor-r420472.patch, 
 jackrabbit-query-r420472.patch, 
 org.apache.jackrabbit.core.query-extractor.jpg, 
 org.apache.jackrabbit.core.query.lucene-extractor.jpg, 
 org.apache.jackrabbit.extractor.jpg

 Indexing of binary content should be enhanced in order to allow either 
 configuration what fields are indexed or provide better support for custom 
 NodeIndexer implementations.
 The current design has a couple of flaws that should be addressed at the same 
 time:
 - Reader instances are requested from the text filters even though the reader 
 might never be used
 - only jcr:data properties of nt:resource nodes are fulltext indexed
 - It is up to the text filter implementation to decide the lucene field name 
 for the text representation, responsibility should be moved to the 
 NodeIndexer. A text filter should only provide a Reader instance.
 With those changes a custom NodeIndexer can then decide if a binary property 
 has one or more representations in the index.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (JCR-415) Enhance indexing of binary content

2006-07-01 Thread Jukka Zitting (JIRA)
[ 
http://issues.apache.org/jira/browse/JCR-415?page=comments#action_12418774 ] 

Jukka Zitting commented on JCR-415:
---

See a related email thread at 
http://article.gmane.org/gmane.comp.apache.jackrabbit.devel/7609

 Enhance indexing of binary content
 --

  Key: JCR-415
  URL: http://issues.apache.org/jira/browse/JCR-415
  Project: Jackrabbit
 Type: Improvement

   Components: indexing
 Versions: 1.0, 1.0.1, 0.9
 Reporter: Marcel Reutegger
 Priority: Minor
  Fix For: 1.1


 Indexing of binary content should be enhanced in order to allow either 
 configuration what fields are indexed or provide better support for custom 
 NodeIndexer implementations.
 The current design has a couple of flaws that should be addressed at the same 
 time:
 - Reader instances are requested from the text filters even though the reader 
 might never be used
 - only jcr:data properties of nt:resource nodes are fulltext indexed
 - It is up to the text filter implementation to decide the lucene field name 
 for the text representation, responsibility should be moved to the 
 NodeIndexer. A text filter should only provide a Reader instance.
 With those changes a custom NodeIndexer can then decide if a binary property 
 has one or more representations in the index.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira