Re: Introduce mdrill project(opensource,maybe help full for apache drill`s develope)

Jacques Nadeau Thu, 22 Aug 2013 18:37:39 -0700

Interesting.  Is there any other english documentation about it's purpose
and architecture?



On Mon, Aug 12, 2013 at 2:25 AM, 子落 <[email protected]> wrote:

> it`s address is https://github.com/alibaba/mdrill ,i think some of the
> information or desion maybe help full for apache drill dev.
>
>
>
> Which is like apache drill or google power drill, it is base on
> hadoop,lucene,solr,jstorm
>
>
>
> Now in my project , has 10 tables, 47760506482 rows ,80~400columns. (run on
> 10 mathines, permachine ram:48GB,12*2TB disk)
>
>
>
> Some of the search example.,like bellows:
>
>
>
> select count(*) from r_rpt_cps_luna_item where thedate >='20130416' and
> thedate <'20130811' limit 0,100
>
>   _____
>
> totalRecords:1
>
>
> count(*)
>
>
> 11108914892
>
> times taken 4.031 seconds
>
>
>
>
>
> select sum(landing_uv) from r_rpt_cps_luna_item where thedate >='20130416'
> and  thedate <'20130811' limit 0,100
>
>   _____
>
> totalRecords:1
>
>
> sum(landing_uv)
>
>
> 2.07678497E8
>
> times taken 56.081 seconds
>
>
>
> select dist(user_id) from r_rpt_cps_luna_item where thedate >='20130416'
> and
> thedate <'20130811' limit 0,100
>
>   _____
>
> totalRecords:1
>
>
> dist(user_id)
>
>
> 1483008.0
>
> times taken 246.147 seconds
>
>
>
> select thedate,count(*) as cnt from r_rpt_cps_luna_item where thedate
> >='20130416' and  thedate <'20130811' group by thedate order by cnt desc
> limit 0,3
>
>   _____
>
> totalRecords:118
>
>
> thedate
>
> cnt
>
>
> 20130803
>
> 158301304
>
>
> 20130802
>
> 157748487
>
>
> 20130725
>
> 157047045
>
> times taken 34.727 seconds
>
>
>
> select thedate,user_id,count(*) as cnt from r_rpt_cps_luna_item where
> thedate >='20130416' and  thedate <'20130811' group by thedate,user_id
> order
> by cnt desc limit 0,3
>
>   _____
>
> totalRecords:10010
>
>
> thedate
>
> user_id
>
> cnt
>
>
> 20130725
>
> 725677994
>
> 194397
>
>
> 20130725
>
> 101450072
>
> 192650
>
>
> 20130701
>
> 101450072
>
> 189107
>
> times taken 149.316 seconds
>
>
>
> select thedate,category_level1,count(*) as cnt from r_rpt_cps_luna_item
> where thedate >='20130416' and  thedate <'20130811' group by
> thedate,category_level1 order by cnt desc limit 0,3
>
>   _____
>
> totalRecords:10010
>
>
> thedate
>
> category_level1
>
> cnt
>
>
> 20130803
>
> 16
>
> 26487658
>
>
> 20130802
>
> 16
>
> 26306163
>
>
> 20130725
>
> 16
>
> 26128576
>
> times taken 94.989 seconds
>
>
>
> select thedate,category_level1,category_level2,count(*) as cnt from
> r_rpt_cps_luna_item where thedate >='20130416' and  thedate <'20130811'
> group by thedate,category_level1,category_level2 order by cnt desc limit
> 0,3
>
>   _____
>
> totalRecords:10010
>
>
> thedate
>
> category_level1
>
> category_level2
>
> cnt
>
>
> 20130725
>
> 16
>
> 50010850
>
> 7315606
>
>
> 20130803
>
> 16
>
> 50010850
>
> 7006255
>
>
> 20130802
>
> 16
>
> 50010850
>
> 6936059
>
> times taken 288.885 seconds
>
>
>
>
>
> chinese introduce
> 1：mdrill旨在帮助用户在几秒到几十秒的时间内，分析百亿级别的任意维度组合的数
> 据。
> 2：mdrill是一个分布式的在线分析查询系统，基于hadoop,lucene,solr,jstorm等开源
> 系统作为实现，基于SQL的查询语法。 mdrill是一个能够对大量数据进行分布式处理的
> 软件框架。mdrill是快速的高性能的，他的底层因使用了索引、列式存储、以及内存
> cache等技 术，使得数据扫描的速度大为增加。mdrill是分布式的，它以并行的方式工
> 作，通过并行处理加快处理速度。
> 3：基于mdrill应用的adhoc项目，使用了10台机器,存储了400亿的数据
>   ==>每次扫描30亿的行数，响应时间在20秒~120秒左右(取决不同的查询条件与扫描的
> 列数)。
>   ==>对100亿数据进行count(*),耗时为2秒，单列sum耗时在25秒,按照日期分组求
> count和sum耗时47秒，按照用户id分组并且按照成交笔数排序去TopN 耗时 243秒。
>
>

Re: Introduce mdrill project(opensource,maybe help full for apache drill`s develope)

Reply via email to