ShengJun Zheng created KYLIN-4810:
-------------------------------------
Summary: TrieDictionary is not correctly build
Key: KYLIN-4810
URL: https://issues.apache.org/jira/browse/KYLIN-4810
Project: Kylin
Issue Type: Bug
Components: Job Engine
Affects Versions: v2.3.2
Reporter: ShengJun Zheng
Fix For: Future
Hi, recently, I've met a problem in our product environment: Segments failed to
merge because TrieDictionaryForest was disordered
{code:java}
java.lang.IllegalStateException: Invalid input data. Unordered data cannot be
split into multi trees
at
org.apache.kylin.dict.TrieDictionaryForestBuilder.addValue(TrieDictionaryForestBuilder.java:92)
at
org.apache.kylin.dict.TrieDictionaryForestBuilder.addValue(TrieDictionaryForestBuilder.java:78)
at
org.apache.kylin.dict.DictionaryGenerator$StringTrieDictForestBuilder.addValue(DictionaryGenerator.java:214)
at
org.apache.kylin.dict.DictionaryGenerator.buildDictionary(DictionaryGenerator.java:81)
at
org.apache.kylin.dict.DictionaryGenerator.buildDictionary(DictionaryGenerator.java:65)
at
org.apache.kylin.dict.DictionaryGenerator.mergeDictionaries(DictionaryGenerator.java:106)
{code}
After some analysis, we found out when there is large values in a dict-encoded
column, iterating over a single TrieDictionaryTree will get unordered data.
Digging into the source code, the root cause is as described: # Kylin will
split a TrieTree Node into two parts when a single nodes's value length is more
than 255 bytes
# Then, these tow parts of value will be added to build the TrieTree. In fact
the splitted two parts should not be used as new values to add to the TrieTree.
# Step 2 will cause the TrieDictionaryTree build more leave nodes,and the
extra leaf nodes will be 'end-value' of dictionary tree;
# It has no impact to the correctness of the dict tree itself, except for
adding some additional nodes .
# But If you spit a UTF-8 word, you will get unordered data when iterating
over the tree ( Something todo with Java UTF-8 String Serialize/Deserialize
implementations. Please Refer to JDK UTF-8 Implementation)
How to re-produce ? Run test code :
{code:java}
TrieDictionaryForestBuilder builder = new TrieDictionaryForestBuilder(new
StringBytesConverter());
String longUrl =
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx你好~~~";
builder.addValue(longUrl);
TrieDictionaryForest<String> dict = builder.build();
TrieDictionaryForestBuilder mergeBuild = new TrieDictionaryForestBuilder(new
StringBytesConverter());
for (int i = dict.getMinId(); i <= dict.getMaxId(); i++) {
String str = dict.getValueFromId(i);
System.out.println("add value into merge tree");
mergeBuild.addValue(str);
}
The log output of this test code is:
add value into merge tree
add value into merge tree
16:59:36 [main] INFO
org.apache.kylin.dict.TrieDictionaryForestBuilder.addValue(TrieDictionaryForestBuilder.java:127)
values not in ascending order, previous
'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\xEF\xBF\xBD',
current
'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\xE4\xBD\xA0\xE5\xA5\xBD~~~'
{code}
We can see from the test code's output:
# We only add 1 value but the tire dictionary tree turn out to have 2 end
vlaues
# Iterating over the TrieDictionary Tree got unordered data
We address this problem by
# classify values which is a whole column value, which is splitted value,
# not mark splitted value as end-value of a TrieTree Node.
I wonder if there is something wrong, thanks.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)