[GitHub] jena pull request: Jena-text multilingual alternative implementati...

osma Wed, 13 May 2015 12:14:14 -0700

Github user osma commented on the pull request:

    https://github.com/apache/jena/pull/64#issuecomment-101780700
  
    This looks very promising. As I've said before, I think a single index is 
preferable to multiple indexes. It should require less book-keeping overall.
    
    I have to say I don't like the current division of labor between 
TextIndexLucene and TextIndexLuceneMultilingual, with the latter having an 
isMultilingual() method that returns true, and all the actual multilingual 
support in TextIndexLucene.
    
    I think that ideally, TextIndexLucene should not be in any way aware of the 
multilingual aspects and should merely provide hooks or extension/override 
points that can then be used by TextIndexLuceneMultilingual to provide 
multilingual functionality.
    
    As an example, instead of this (in TextIndexLucene around line 230):
    
    ```java
                Analyzer analyzer = null;
                if (isMultilingual())
                    analyzer = 
LuceneUtil.getLocalizedAnalyzer(entity.getLanguage());
    
                if (analyzer != null)
                    indexWriter.addDocument(doc, analyzer) ;
                else //use the default one
                    indexWriter.addDocument(doc) ;
    ```
    
    you could have something like this:
    
    ```java
    /* at the end of TextIndexLucene.addEntity() */
          addDocument(entity);
      }
    
      /* extension point; can be overridden if necessary */
      protected void addDocument(Entity entity) {
           Document doc = doc(entity);
           indexWriter.addDocument(doc);
      }
    
    /* in TextIndexLuceneMultilingual */
    
      @Override
      protected void addDocument(Entity entity) {
           Document doc = doc(entity);
           Analyzer analyzer = 
LuceneUtil.getLocalizedAnalyzer(entity.getLanguage());
           indexWriter.addDocument(doc, analyzer);
      }
    ```
    
    I understand that compromises sometimes have to be made, but putting all 
the multilingual support code in the TextIndexLucene class is, I think, pushing 
it too far. It also makes it hard to extend TextIndexLucene for other purposes.
    
    Another thing: I don't quite understand the need to handle 3 letter 
language tags. AFAIU the relevant standards (RDF 1.0, XML 1.0, RDF 1.1) all 
specify that language tags should be formatted according to BCP 47 / RFC 4646 
which, I think, states that when a 2 letter ISO 639-1 language tag exists, it 
should be used instead of the equivalent 3-letter ISO 639-2 or ISO 639-3 codes. 
So you should never see a literal such as "book"@eng in well-formed RDF data.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request: Jena-text multilingual alternative implementati...

Reply via email to