[GitHub] RestfulBlue commented on issue #6189: Lucene indexing for free form text

GitBox Wed, 22 Aug 2018 00:09:01 -0700

RestfulBlue commented on issue #6189: Lucene indexing for free form text
URL: 
https://github.com/apache/incubator-druid/issues/6189#issuecomment-414933972
 
 
   Hi , multivalue dimensions will work only in some generic simple case, for 
example where logs have simple form with space separated words. But even with 
this form of data, it need external preprocessing, which will be grow with 
time. For example by first it just split by space, when we realize we also want 
to split by all special characters, when we realize what we also want to search 
by part of word, so we k skip n gramm, etc. With what external preprocessing 
will slowly move to things, what lucene doing. Also with what we cant simply 
get source text, for like select * from table limit 100, because data in 
multivalue column splitted and optimized for search. So this requiere 
denormalization of data and cost additional space.
   
   Simple lucene indexing looks like this :
   
   ```java
      Analyzer analyzer = new StandardAnalyzer();
   
       // Store the index in memory:
       Directory directory = new RAMDirectory();
       // To store an index on disk, use this instead:
       //Directory directory = FSDirectory.open("/tmp/testindex");
       IndexWriterConfig config = new IndexWriterConfig(analyzer);
       IndexWriter iwriter = new IndexWriter(directory, config);
       Document doc = new Document();
       String text = "This is the text to be indexed.";
       doc.add(new Field("fieldname", text, TextField.TYPE_STORED));
       iwriter.addDocument(doc);
       iwriter.close();
       
       // Now search the index:
       DirectoryReader ireader = DirectoryReader.open(directory);
       IndexSearcher isearcher = new IndexSearcher(ireader);
       // Parse a simple query that searches for "text":
       QueryParser parser = new QueryParser("fieldname", analyzer);
       Query query = parser.parse("text");
       ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
       assertEquals(1, hits.length);
       // Iterate through the results:
       for (int i = 0; i < hits.length; i++) {
         Document hitDoc = isearcher.doc(hits[i].doc);
         assertEquals("This is the text to be indexed.", 
hitDoc.get("fieldname"));
       }
       ireader.close();
       directory.close();
   ```
   
   i think adding it as new column will be great. The main reason is what 
lucene is more heavy than simple token indexing. Mixing disabled indexing, 
tokening and lucene in one table can greatly reduce total amount of required 
disk space compare to full lucene indexing


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org

[GitHub] RestfulBlue commented on issue #6189: Lucene indexing for free form text

Reply via email to