The "shard-by" column will be used to distribute the cube data to different shards (each shard is a HBase region). Usually the "shard-by" column need be a high cardinality column, using which can ensure the shards are in similar size, like user_id, order_id etc. Usually partition column's cardinality is not enough high, so not suggest to use it for this purpose.
The default streaming parser in Kylin accepts JSON format. If you have another format, need implement that by extending the StreamingParser class. 2017-10-10 16:30 GMT+08:00 崔苗 <[email protected]>: > Thanks for your your suggestion,finally we changed the timestamp into date > format by sql and it worked.<br/>some other questions:<br/>1、what's the > meaning of 'shard by' column,is it proper to set partition column as the > 'shard by' column?<br/>2、Is there limitations on data format in kafka when > building streaming cubes?we succeed to build streaming a cube on the sample > data supplied by kylin,but failed on our own data,it's avro format,not json > format. > 在 2017-10-10 14:47:30,ShaoFeng Shi <[email protected]> 写道: > >Hi Miao, > > > >It doesn't understand your time format. You need to use the standard Date > >format in Hive. Or you can implement your own logic, with the interface " > >IPartitionConditionBuilder" > > > >2017-10-10 11:33 GMT+08:00 崔苗 <[email protected]>: > > > >> well,the timestamp column was bigint such as 1507547479434 in hive > table, > >> when I define the endtime to build the cube ,I found the timestamp > >> 1507547479434 was converted to '20171009' and the log show that kylin > >> loaded data from hive with condition "WHERE (USER_REG.REG_TIME < > >> 20171009)",so the Intermediate Flat Hive Table was null. I want to know > >> could kylin derive other time values like “year_start”, “day_start” from > >> the bigint timestamp in hive as it does in kafka table? or we must > change > >> the bigint timestamp into data format such as "2017-10-09" in hive? > >> At 2017-10-09 22:04:56, ShaoFeng Shi <[email protected]> wrote: > >> >Hi Miao, > >> > > >> >What's the error as you said: "kylin failed to load data from hive > >> tables"? > >> > > >> >In my opinion, it is not recommended to use timestamp as the partition > >> >column, since it is too fine granularity. Usually, the cube is > partitioned > >> >by day/week/month; in some cases, it is by the hour; In streaming > case, it > >> >might partition by the minute; But no case by timestamp. I put some > >> >comments about this in this document: > >> >https://kylin.apache.org/docs21/tutorial/cube_streaming.html > >> > > >> >2017-10-09 14:27 GMT+08:00 崔苗 <[email protected]>: > >> > > >> >> Hi, > >> >> we want to use tables in kafka as fact tables and tables in MySql as > >> >> lookup tables,so we put all the tables into hive and want to join > them > >> as > >> >> cubes. > >> >> > >> >> the time column in fact table was timestamp, so does kylin2.1 support > >> >> timestamp for cube partition? > >> >> I find this :https://issues.apache.org/jira/browse/KYLIN-633 , > >> >> > >> >> it seems kylin already supprt Timestamp for cube partition,but when > we > >> >> define timestamp as partition , kylin failed to load data from hive > >> tables. > >> >> > >> >> > >> >> thanks in advanced for your reply. > >> >> > >> >> > >> >> > >> >> > >> >> > >> > > >> > > >> >-- > >> >Best regards, > >> > > >> >Shaofeng Shi 史少锋 > >> > >> > >> > > > > > >-- > >Best regards, > > > >Shaofeng Shi 史少锋 > > > -- Best regards, Shaofeng Shi 史少锋
