RestfulBlue commented on issue #6189: Lucene indexing for free form text URL: https://github.com/apache/incubator-druid/issues/6189#issuecomment-414933972 Hi , multivalue dimensions will work only in some generic simple case, for example where logs have simple form with space separated words. But even with this form of data, it need external preprocessing, which will be grow with time. For example by first it just split by space, when we realize we also want to split by all special characters, when we realize what we also want to search by part of word, so we k skip n gramm, etc. With what external preprocessing will slowly move to things, what lucene doing. Also with what we cant simply get source text, for like select * from table limit 100, because data in multivalue column splitted and optimized for search. So this requiere denormalization of data and cost additional space. Simple lucene indexing looks like this : ```java Analyzer analyzer = new StandardAnalyzer(); // Store the index in memory: Directory directory = new RAMDirectory(); // To store an index on disk, use this instead: //Directory directory = FSDirectory.open("/tmp/testindex"); IndexWriterConfig config = new IndexWriterConfig(analyzer); IndexWriter iwriter = new IndexWriter(directory, config); Document doc = new Document(); String text = "This is the text to be indexed."; doc.add(new Field("fieldname", text, TextField.TYPE_STORED)); iwriter.addDocument(doc); iwriter.close(); // Now search the index: DirectoryReader ireader = DirectoryReader.open(directory); IndexSearcher isearcher = new IndexSearcher(ireader); // Parse a simple query that searches for "text": QueryParser parser = new QueryParser("fieldname", analyzer); Query query = parser.parse("text"); ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs; assertEquals(1, hits.length); // Iterate through the results: for (int i = 0; i < hits.length; i++) { Document hitDoc = isearcher.doc(hits[i].doc); assertEquals("This is the text to be indexed.", hitDoc.get("fieldname")); } ireader.close(); directory.close(); ``` i think adding it as new column will be great. The main reason is what lucene is more heavy than simple token indexing. Mixing disabled indexing, tokening and lucene in one table can greatly reduce total amount of required disk space compare to full lucene indexing
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org