[ https://issues.apache.org/jira/browse/KYLIN-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiaoxiang Yu closed KYLIN-4810. ------------------------------- Released at kylin 3.1.2 > TrieDictionary is not correctly build > ------------------------------------- > > Key: KYLIN-4810 > URL: https://issues.apache.org/jira/browse/KYLIN-4810 > Project: Kylin > Issue Type: Bug > Components: Job Engine > Affects Versions: v2.3.2 > Reporter: ShengJun Zheng > Assignee: ShengJun Zheng > Priority: Critical > Labels: Dictionary > Fix For: v3.1.2 > > > Hi, recently, I've met a problem in our product environment: Segments failed > to merge because TrieDictionaryForest was disordered > {code:java} > java.lang.IllegalStateException: Invalid input data. Unordered data cannot be > split into multi trees > at > org.apache.kylin.dict.TrieDictionaryForestBuilder.addValue(TrieDictionaryForestBuilder.java:92) > at > org.apache.kylin.dict.TrieDictionaryForestBuilder.addValue(TrieDictionaryForestBuilder.java:78) > at > org.apache.kylin.dict.DictionaryGenerator$StringTrieDictForestBuilder.addValue(DictionaryGenerator.java:214) > at > org.apache.kylin.dict.DictionaryGenerator.buildDictionary(DictionaryGenerator.java:81) > at > org.apache.kylin.dict.DictionaryGenerator.buildDictionary(DictionaryGenerator.java:65) > at > org.apache.kylin.dict.DictionaryGenerator.mergeDictionaries(DictionaryGenerator.java:106) > {code} > After some analysis, we found out when there is large values in a > dict-encoded column, iterating over a single TrieDictionaryTree will get > unordered data. > > Digging into the source code, the root cause is as described: > # Kylin will split a TrieTree Node into two parts when a single nodes's > value length is more than 255 bytes > # Then, these tow parts of value will be added to build the TrieTree. In > fact the splitted two parts should not be used as new values to add to the > TrieTree. > # Step 2 will cause the TrieDictionaryTree build more leave nodes,and the > extra leaf nodes will be 'end-value' of dictionary tree; > # It has no impact to the correctness of the dict tree itself, except for > adding some additional nodes . > # But If you spit a UTF-8 word, you will get unordered data when iterating > over the tree ( Something todo with Java UTF-8 String Serialize/Deserialize > implementations. Please Refer to JDK sun.nio.cs.UTF_8.class) > How to re-produce ? Run test code : > {code:java} > TrieDictionaryForestBuilder builder = new TrieDictionaryForestBuilder(new > StringBytesConverter()); > String longUrl = > "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx你好~~~"; > builder.addValue(longUrl); > TrieDictionaryForest<String> dict = builder.build(); > TrieDictionaryForestBuilder mergeBuild = new TrieDictionaryForestBuilder(new > StringBytesConverter()); > for (int i = dict.getMinId(); i <= dict.getMaxId(); i++) { > String str = dict.getValueFromId(i); > System.out.println("add value into merge tree"); > mergeBuild.addValue(str); > } > The log output of this test code is: > add value into merge tree > add value into merge tree > 16:59:36 [main] INFO > org.apache.kylin.dict.TrieDictionaryForestBuilder.addValue(TrieDictionaryForestBuilder.java:127) > values not in ascending order, previous > 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\xEF\xBF\xBD', > current > 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\xE4\xBD\xA0\xE5\xA5\xBD~~~' > {code} > We can see from the test code's output: > # We only add 1 value but the tire dictionary tree turn out to have 2 end > vlaues > # Iterating over the TrieDictionary Tree got unordered data > We address this problem by > # classify values which is a whole column value, which is splitted value, > # not mark splitted value as end-value of a TrieTree Node. > I wonder if there is something wrong, thanks. -- This message was sent by Atlassian Jira (v8.3.4#803005)