zhao jintao created KYLIN-3845:
----------------------------------

             Summary: Kylin build error If the Kafka data source lacks selected 
dimensions or metrics in the kylin stream build.
                 Key: KYLIN-3845
                 URL: https://issues.apache.org/jira/browse/KYLIN-3845
             Project: Kylin
          Issue Type: Bug
          Components: Job Engine
    Affects Versions: v2.5.2
         Environment: Fusion Insight
            Reporter: zhao jintao
             Fix For: Future


Hi dear team:
I'm developing OLAP Platform based on Kylin2.5.2. During my work, I build a 
streaming cube from Kafka source using kafka demo.
In my streaming project, I set country、currency as dimensions and userId as 
metrics. But the cube build failed in 3rd step("Extract Fact Table Distinct 
Columns"). The exception is java.lang.ArrayIndexOutOfBoundsException.
This is logs:
2019-03-02 14:21:01,492 INFO [main] org.apache.kylin.engine.mr.KylinReducer: Do 
cleanup, available memory: 1334m
2019-03-02 14:21:01,492 INFO [main] org.apache.kylin.engine.mr.KylinReducer: 
Total rows: 127
2019-03-02 14:21:01,492 INFO [main] org.apache.hadoop.mapred.MapTask: Finished 
spill 0
2019-03-02 14:21:01,492 INFO [main] org.apache.hadoop.mapred.YarnChild: 
Exception running child: java.lang.ArrayIndexOutOfBoundsException:2
2019-03-02 14:21:01,492 INFO [main] org.apache.kylin.engine.mr.KylinReducer: Do 
cleanup, available memory: 1334m
 at 
org.apache.kylin.engine.mr.steps.FactDistinctColumnsMapper.doMap(FactDistinctColumnsMapper.java:177)
 at org.apache.kylin.engine.mr.KylinMapper.map(KylinMapper.java:77)
 at org.apache.hadoop.mapreduce.Mapper.run(MapperTask.java:146)
 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:793)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
 at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:187)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1781)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java;180)
 
Then I find that in Kafka datasource, some streaming data lack the userId 
column. Most of the streaming data(contry, currency,userId) is 
("China","CNY","843c4d");but a small amount of data lack userId, some data is 
("China","CNY"). so when run the 3rd step("Extract Fact Table Distinct 
Columns"),MR engine will throw exception if the streaming data lack userId.

The I check the source of Kylin, FactDistinctColumnsMapper.java:

public void doMap(KEYIN key, Object record, Context context) throws 
IOException, InterruptedException {
 Collection<String[]> rowCollection = 
flatTableInputFormat.parseMapperInput(record);

for (String[] row : rowCollection) {
 context.getCounter(RawDataCounter.BYTES).increment(countSizeInBytes(row));
 for (int i = 0; i < allCols.size(); i++) {
 String fieldValue = row[columnIndex[i]];
 if (fieldValue == null)
 continue;

final DataType type = allCols.get(i).getType();
 ...

I find that columnIndex[i] is equal with the size of row if the streaming data 
lack one column. So the row[columnIndex[i]] will throw the 
ArrayIndexOutOfBoundsException. So I change this code, check the columnIndex[i] 
and the size of row. If columnIndex[i] is equal with or larger than the size of 
row, I set fieldValue empty value. And After I change my code, the 3rd 
step("Extract Fact Table Distinct Columns") will run success.

Those are what I found, which will cause problem for developers.
How do you think?

Best regard
jintao



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to