Re: Exact distinct count support

2016-01-28 Thread Abhilash L L
+user ml

Regards,
Abhilash

On Thu, Jan 28, 2016 at 11:32 AM, Abhilash L L 
wrote:

> Hello,
>
>Is there a way to ask Kylin to get exact distinct count ?  From what we
> understand, we can choose between hllc(10) to hllc(16)
>
>I understand that for every cuboid, you will need to go through the
> whole data set again, but with the new cubing algo (2.x branch) should be
> simpler to add ?
>
>If currently not present are there any plans to introduce this ?
>
> Regards,
> Abhilash
>


Re: Exact distinct count support

2016-01-28 Thread ShaoFeng Shi
is this matched your case? https://issues.apache.org/jira/browse/KYLIN-1186

2016-01-28 17:42 GMT+08:00 Abhilash L L :

> +user ml
>
> Regards,
> Abhilash
>
> On Thu, Jan 28, 2016 at 11:32 AM, Abhilash L L 
> wrote:
>
> > Hello,
> >
> >Is there a way to ask Kylin to get exact distinct count ?  From what
> we
> > understand, we can choose between hllc(10) to hllc(16)
> >
> >I understand that for every cuboid, you will need to go through the
> > whole data set again, but with the new cubing algo (2.x branch) should be
> > simpler to add ?
> >
> >If currently not present are there any plans to introduce this ?
> >
> > Regards,
> > Abhilash
> >
>



-- 
Best regards,

Shaofeng Shi


[jira] [Created] (KYLIN-1377) TopN measure should support more expressions

2016-01-28 Thread Shaofeng SHI (JIRA)
Shaofeng SHI created KYLIN-1377:
---

 Summary: TopN measure should support more expressions
 Key: KYLIN-1377
 URL: https://issues.apache.org/jira/browse/KYLIN-1377
 Project: Kylin
  Issue Type: New Feature
Reporter: Shaofeng SHI
 Fix For: v2.1


TopN should support not only SUM, but also MAX, MIN as the expression.

A possible case is, find out the sellers which sold the top expensive items:

select seller_id, max(price) from sals_records where region = 'US' and year = 
'2015' order by max(price) desc limit 100;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Exact distinct count support

2016-01-28 Thread Abhilash L L
Thanks ShaoFeng Shi,

We might need for other data types as well

date & string

 (eg, distinct count of dates of certain activity)

So in the rest call instead of hllc return type it should be bitmap for
int,tinyint etc ?

And we still send it as hllc for other data types ?


Also in one of the comments, it said we cast long to int..  wont we be
losing data due to truncation ?


Regards,
Abhilash

On Thu, Jan 28, 2016 at 3:43 PM, ShaoFeng Shi 
wrote:

> is this matched your case?
> https://issues.apache.org/jira/browse/KYLIN-1186
>
> 2016-01-28 17:42 GMT+08:00 Abhilash L L :
>
> > +user ml
> >
> > Regards,
> > Abhilash
> >
> > On Thu, Jan 28, 2016 at 11:32 AM, Abhilash L L 
> > wrote:
> >
> > > Hello,
> > >
> > >Is there a way to ask Kylin to get exact distinct count ?  From what
> > we
> > > understand, we can choose between hllc(10) to hllc(16)
> > >
> > >I understand that for every cuboid, you will need to go through the
> > > whole data set again, but with the new cubing algo (2.x branch) should
> be
> > > simpler to add ?
> > >
> > >If currently not present are there any plans to introduce this ?
> > >
> > > Regards,
> > > Abhilash
> > >
> >
>
>
>
> --
> Best regards,
>
> Shaofeng Shi
>


Hot swapping cube post build

2016-01-28 Thread Abhilash L L
We have a use case where we want to rebuild the cube with an updated data
set without downtime on requests

Lets say we have cube C1.
We get some new data and we rebuild the cube.
Lets call this C2 with the new data. (Assume no change to cube structure)

When C2 is building, we want C1 to be still serving requests.
Once C2 is done building, we hot swap C1 with C2

This way there is no downtime on the requests (even if it is, its very less)

Another problem is, since it has to use the same hive table names and
schema as C1, we can recreate the tables (external) pointing to the data
for C2

We cannot use  the incremental cube data addition since as of now its hard
to figure out of the change set.

What is the best way to achieve this ?

Assumption:
Since we cannot have two cubes with same name under same project, we need
two different cubes.


Regards,
Abhilash


Re: [jira] [Created] (KYLIN-1377) TopN measure should support more expressions

2016-01-28 Thread hongbin ma
also need to support order by desc as well as asc

On Thu, Jan 28, 2016 at 6:30 PM, Shaofeng SHI (JIRA) 
wrote:

> Shaofeng SHI created KYLIN-1377:
> ---
>
>  Summary: TopN measure should support more expressions
>  Key: KYLIN-1377
>  URL: https://issues.apache.org/jira/browse/KYLIN-1377
>  Project: Kylin
>   Issue Type: New Feature
> Reporter: Shaofeng SHI
>  Fix For: v2.1
>
>
> TopN should support not only SUM, but also MAX, MIN as the expression.
>
> A possible case is, find out the sellers which sold the top expensive
> items:
>
> select seller_id, max(price) from sals_records where region = 'US' and
> year = '2015' order by max(price) desc limit 100;
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>



-- 
Regards,

*Bin Mahone | 马洪宾*
Apache Kylin: http://kylin.io
Github: https://github.com/binmahone


Re: Exact distinct count support

2016-01-28 Thread hongbin ma
KYLIN-1186  is not a
mature feature yet and it only supports integer
we don't yet have plans to support any other forms of precise distinct
count, as it is too expensive to pre-calculate

On Thu, Jan 28, 2016 at 6:56 PM, Abhilash L L  wrote:

> Thanks ShaoFeng Shi,
>
> We might need for other data types as well
>
> date & string
>
>  (eg, distinct count of dates of certain activity)
>
> So in the rest call instead of hllc return type it should be bitmap for
> int,tinyint etc ?
>
> And we still send it as hllc for other data types ?
>
>
> Also in one of the comments, it said we cast long to int..  wont we be
> losing data due to truncation ?
>
>
> Regards,
> Abhilash
>
> On Thu, Jan 28, 2016 at 3:43 PM, ShaoFeng Shi 
> wrote:
>
> > is this matched your case?
> > https://issues.apache.org/jira/browse/KYLIN-1186
> >
> > 2016-01-28 17:42 GMT+08:00 Abhilash L L :
> >
> > > +user ml
> > >
> > > Regards,
> > > Abhilash
> > >
> > > On Thu, Jan 28, 2016 at 11:32 AM, Abhilash L L 
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > >Is there a way to ask Kylin to get exact distinct count ?  From
> what
> > > we
> > > > understand, we can choose between hllc(10) to hllc(16)
> > > >
> > > >I understand that for every cuboid, you will need to go through
> the
> > > > whole data set again, but with the new cubing algo (2.x branch)
> should
> > be
> > > > simpler to add ?
> > > >
> > > >If currently not present are there any plans to introduce this ?
> > > >
> > > > Regards,
> > > > Abhilash
> > > >
> > >
> >
> >
> >
> > --
> > Best regards,
> >
> > Shaofeng Shi
> >
>



-- 
Regards,

*Bin Mahone | 马洪宾*
Apache Kylin: http://kylin.io
Github: https://github.com/binmahone


Re: Hot swapping cube post build

2016-01-28 Thread hongbin ma
have you ever checked out the "refresh" function for cubes?

On Thu, Jan 28, 2016 at 7:07 PM, Abhilash L L  wrote:

> We have a use case where we want to rebuild the cube with an updated data
> set without downtime on requests
>
> Lets say we have cube C1.
> We get some new data and we rebuild the cube.
> Lets call this C2 with the new data. (Assume no change to cube structure)
>
> When C2 is building, we want C1 to be still serving requests.
> Once C2 is done building, we hot swap C1 with C2
>
> This way there is no downtime on the requests (even if it is, its very
> less)
>
> Another problem is, since it has to use the same hive table names and
> schema as C1, we can recreate the tables (external) pointing to the data
> for C2
>
> We cannot use  the incremental cube data addition since as of now its hard
> to figure out of the change set.
>
> What is the best way to achieve this ?
>
> Assumption:
> Since we cannot have two cubes with same name under same project, we need
> two different cubes.
>
>
> Regards,
> Abhilash
>



-- 
Regards,

*Bin Mahone | 马洪宾*
Apache Kylin: http://kylin.io
Github: https://github.com/binmahone


Re: Exact distinct count support

2016-01-28 Thread ShaoFeng Shi
what's the cardinality of the dimension that you want to count distinct
values? Integer's range is enough for most cases, if your case is under
this scope, you can try the bitmap with integer; but you need map the value
to an unique id and use that within the bitmap. For example, if you want to
count distinct users, use the numeric user_id, instead of email address; To
support other data types, as Hongbin mentioned, the storage cost is very
high, we don't have that plan.





2016-01-28 20:54 GMT+08:00 hongbin ma :

> KYLIN-1186  is not a
> mature feature yet and it only supports integer
> we don't yet have plans to support any other forms of precise distinct
> count, as it is too expensive to pre-calculate
>
> On Thu, Jan 28, 2016 at 6:56 PM, Abhilash L L 
> wrote:
>
> > Thanks ShaoFeng Shi,
> >
> > We might need for other data types as well
> >
> > date & string
> >
> >  (eg, distinct count of dates of certain activity)
> >
> > So in the rest call instead of hllc return type it should be bitmap for
> > int,tinyint etc ?
> >
> > And we still send it as hllc for other data types ?
> >
> >
> > Also in one of the comments, it said we cast long to int..  wont we be
> > losing data due to truncation ?
> >
> >
> > Regards,
> > Abhilash
> >
> > On Thu, Jan 28, 2016 at 3:43 PM, ShaoFeng Shi 
> > wrote:
> >
> > > is this matched your case?
> > > https://issues.apache.org/jira/browse/KYLIN-1186
> > >
> > > 2016-01-28 17:42 GMT+08:00 Abhilash L L :
> > >
> > > > +user ml
> > > >
> > > > Regards,
> > > > Abhilash
> > > >
> > > > On Thu, Jan 28, 2016 at 11:32 AM, Abhilash L L <
> abhil...@infoworks.io>
> > > > wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > >Is there a way to ask Kylin to get exact distinct count ?  From
> > what
> > > > we
> > > > > understand, we can choose between hllc(10) to hllc(16)
> > > > >
> > > > >I understand that for every cuboid, you will need to go through
> > the
> > > > > whole data set again, but with the new cubing algo (2.x branch)
> > should
> > > be
> > > > > simpler to add ?
> > > > >
> > > > >If currently not present are there any plans to introduce this ?
> > > > >
> > > > > Regards,
> > > > > Abhilash
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > >
> > > Shaofeng Shi
> > >
> >
>
>
>
> --
> Regards,
>
> *Bin Mahone | 马洪宾*
> Apache Kylin: http://kylin.io
> Github: https://github.com/binmahone
>



-- 
Best regards,

Shaofeng Shi


Re: Exact distinct count support

2016-01-28 Thread ShaoFeng Shi
I removed the code for long type in BitmapCounter as the casting will get
things wrong (but the target is to provide accurate value); @Yerui, for you
awareness; once we find the solution for long, then add it back.

2016-01-28 22:13 GMT+08:00 ShaoFeng Shi :

> what's the cardinality of the dimension that you want to count distinct
> values? Integer's range is enough for most cases, if your case is under
> this scope, you can try the bitmap with integer; but you need map the value
> to an unique id and use that within the bitmap. For example, if you want to
> count distinct users, use the numeric user_id, instead of email address; To
> support other data types, as Hongbin mentioned, the storage cost is very
> high, we don't have that plan.
>
>
>
>
>
> 2016-01-28 20:54 GMT+08:00 hongbin ma :
>
>> KYLIN-1186  is not a
>> mature feature yet and it only supports integer
>> we don't yet have plans to support any other forms of precise distinct
>> count, as it is too expensive to pre-calculate
>>
>> On Thu, Jan 28, 2016 at 6:56 PM, Abhilash L L 
>> wrote:
>>
>> > Thanks ShaoFeng Shi,
>> >
>> > We might need for other data types as well
>> >
>> > date & string
>> >
>> >  (eg, distinct count of dates of certain activity)
>> >
>> > So in the rest call instead of hllc return type it should be bitmap for
>> > int,tinyint etc ?
>> >
>> > And we still send it as hllc for other data types ?
>> >
>> >
>> > Also in one of the comments, it said we cast long to int..  wont we be
>> > losing data due to truncation ?
>> >
>> >
>> > Regards,
>> > Abhilash
>> >
>> > On Thu, Jan 28, 2016 at 3:43 PM, ShaoFeng Shi 
>> > wrote:
>> >
>> > > is this matched your case?
>> > > https://issues.apache.org/jira/browse/KYLIN-1186
>> > >
>> > > 2016-01-28 17:42 GMT+08:00 Abhilash L L :
>> > >
>> > > > +user ml
>> > > >
>> > > > Regards,
>> > > > Abhilash
>> > > >
>> > > > On Thu, Jan 28, 2016 at 11:32 AM, Abhilash L L <
>> abhil...@infoworks.io>
>> > > > wrote:
>> > > >
>> > > > > Hello,
>> > > > >
>> > > > >Is there a way to ask Kylin to get exact distinct count ?  From
>> > what
>> > > > we
>> > > > > understand, we can choose between hllc(10) to hllc(16)
>> > > > >
>> > > > >I understand that for every cuboid, you will need to go through
>> > the
>> > > > > whole data set again, but with the new cubing algo (2.x branch)
>> > should
>> > > be
>> > > > > simpler to add ?
>> > > > >
>> > > > >If currently not present are there any plans to introduce this
>> ?
>> > > > >
>> > > > > Regards,
>> > > > > Abhilash
>> > > > >
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Best regards,
>> > >
>> > > Shaofeng Shi
>> > >
>> >
>>
>>
>>
>> --
>> Regards,
>>
>> *Bin Mahone | 马洪宾*
>> Apache Kylin: http://kylin.io
>> Github: https://github.com/binmahone
>>
>
>
>
> --
> Best regards,
>
> Shaofeng Shi
>
>


-- 
Best regards,

Shaofeng Shi


Re: kylin concurrency test

2016-01-28 Thread zhong zhang
Hi Hongbin, Luke, Feng and everyone,

Thanks so so much for taking time to see my thread and
your kind help. Luke, thanks for introduce Feng to me.

Feng, the concurrency test is vital for our application case. We definitely
will
use benchmark datasets to test it later. Currently, we'd like to have a
broad
understanding the capacity of Kylin. Can you help me answer the following
questions for the throughput graph?

(1) Parallel Thread #, 30 for high level aggregation query and 30 for
detail
level query. How many high level aggregation queries are used? Does 30
parallel threads means 30 queries are triggered at the same time?

(2) Does raw records mean the total records for all the queries?

(3) For HBase scan, does the return less than scan mean there is
no hit in the cube?

(4) Latency, the min, max, median are the statistical results for
all the test queries? what's 90% Line?

(5) The throughput is 72.5/sec for high level. So each query takes
about 13.8ms. This kind of contradicts to the min latency 67ms.
Please correct me.

Thanks once again for your help. Have a wonderful day!

Best regards,
Zhong

On Wed, Jan 27, 2016 at 10:34 PM, yu feng  wrote:

> Yes, It is QPS, this result comes from page 36 in Apache
> Kylin-Hadoop上的大规模联机分析平台
> <
> http://events.linuxfoundation.org/sites/events/files/slides/Apache%20Kylin%202014%20Dec.pdf
> >,
> we do the same test in one and two kylin query node query result and
> get similar result , so we use that picture for convenience, bottleneck of
> kylin query throughput rely on hbase scan performance, which will related
> to regionserver number and machine configuration, network etc.
>
>
> 2016-01-28 10:08 GMT+08:00 Luke Han :
>
> > It's QPS, please contact Yu Feng (kylin committer) from NetEase for more
> > detail.
> >
> > Thanks.
> > Luke
> >
> >
> > Best Regards!
> > -
> >
> > Luke Han
> >
> > On Thu, Jan 28, 2016 at 9:43 AM, hongbin ma 
> wrote:
> >
> > > i think by default it is QPS (queries per second)
> > >
> > > On Thu, Jan 28, 2016 at 7:34 AM, zhong zhang 
> wrote:
> > >
> > > > Hi All,
> > > >
> > > > There is an article  > > >posted
> > > > by @Hu Wei at Neteast which introduces the concurrency test results.
> In
> > > the
> > > > article, there is a throughput result graph. Please see the attached.
> > > > Based on my understanding, the x-axis is the number of Kylin server.
> > > > What's the y-axis? Is it the requests at the same time?
> > > >
> > > > Best regards,
> > > > Zhong
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > >
> > > *Bin Mahone | 马洪宾*
> > > Apache Kylin: http://kylin.io
> > > Github: https://github.com/binmahone
> > >
> >
>


StringIndexOutOfBoundsException: String index out of range: -1

2016-01-28 Thread hd2
Hi Kylin-devs,

we are currently trying to build a cube / refresh a table but are
unable to do so. Kylin produces the following error:

[2016-01-28 
16:14:41,472][INFO][org.apache.kylin.job.common.HadoopShellExecutable.doWork(HadoopShellExecutable.java:58)]
- -table KYLIN_DK.DIM_DTM -output
/tmp/kylin/cardinality/KYLIN_DK.DIM_DTM
Starting: Kylin Hive Column Cardinality Update Job
table=KYLIN_DK.DIM_DTM output=/tmp/kylin/cardinality/KYLIN_DK.DIM_DTM
The hadoop cardinality value is not valid
usage: HiveColumnCardinalityUpdateJob
 -output Output path
 -tableThe hive table name
[2016-01-28 
16:14:41,502][ERROR][org.apache.kylin.job.common.HadoopShellExecutable.doWork(HadoopShellExecutable.java:64)]
- error execute
HadoopShellExecutable{id=a22a895d-d10f-4ab8-9bb0-defe1fdf1756-01,
name=null, state=RUNNING}
java.lang.StringIndexOutOfBoundsException: String index out of range: -1
at java.lang.String.substring(String.java:1911)
at 
org.apache.kylin.job.hadoop.cardinality.HiveColumnCardinalityUpdateJob.updateKylinTableExd(HiveColumnCardinalityUpdateJob.java:113)
at 
org.apache.kylin.job.hadoop.cardinality.HiveColumnCardinalityUpdateJob.run(HiveColumnCardinalityUpdateJob.java:80)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at 
org.apache.kylin.job.common.HadoopShellExecutable.doWork(HadoopShellExecutable.java:62)
at 
org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:107)
at 
org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:51)
at 
org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:107)
at 
org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:130)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2016-01-28 
16:14:41,504][DEBUG][org.apache.kylin.common.persistence.ResourceStore.putResource(ResourceStore.java:200)]
- Saving resource
/execute_output/a22a895d-d10f-4ab8-9bb0-defe1fdf1756-01 (Store
kylin_metadata@hbase)
[2016-01-28 
16:14:41,510][DEBUG][org.apache.kylin.common.persistence.ResourceStore.putResource(ResourceStore.java:200)]
- Saving resource
/execute_output/a22a895d-d10f-4ab8-9bb0-defe1fdf1756-01 (Store
kylin_metadata@hbase)
[2016-01-28 
16:14:41,513][INFO][org.apache.kylin.job.manager.ExecutableManager.updateJobOutput(ExecutableManager.java:241)]
- job id:a22a895d-d10f-4ab8-9bb0-defe1fdf1756-01 from RUNNING to ERROR

Fun fact: the cube building process does not catch this error and
states SUCCESS in this stage.

Any hints on what is going on or/and how to fix this issue?

Thanks
-Seb


Re: Exact distinct count support

2016-01-28 Thread Yerui Sun
Thanks shaofeng and hongbin for your explaining.

Abhilash, I’m the reporter and contributor of KYLIN-1186, and here’s some 
thinking about the designing:

We indeed just support Int type for now, and cast Long to Int may cause 
precision losing (shaofeng removed the casting and I agreed that), the reason 
mainly is Int has been enough for most cases. 

I thought about to support all types, including String or Date, and the 
conclusion is that’s difficult. One solution is store all the values, that’s 
appear too costly, and another solution is finding the *precisely* projecting 
from string to int, for example dict ( not hash, because the projecting maybe 
conflicting). 
However, the dict generating is still difficult, especially when the 
cardinality is very high. I think KYLIN-1122 facing the same problem, so let’s 
see what’s the solution in KYLIN-1122, maybe we could borrow something.

The reason of casting Long to Int is that bitmap based on RoaringBitmap, which 
maintained by lemire(lem...@gmail.com), just supporting Integer. Expanding it 
to Long is kind of complicated, so I skipped that for now.

Overall, this feature just fitted the common user case, and has absolutely room 
for improvement. Please let me know if you have any idea, and any comment is 
welcome.
 
 

> 在 2016年1月28日,22:33,ShaoFeng Shi  写道:
> 
> I removed the code for long type in BitmapCounter as the casting will get
> things wrong (but the target is to provide accurate value); @Yerui, for you
> awareness; once we find the solution for long, then add it back.
> 
> 2016-01-28 22:13 GMT+08:00 ShaoFeng Shi :
> 
>> what's the cardinality of the dimension that you want to count distinct
>> values? Integer's range is enough for most cases, if your case is under
>> this scope, you can try the bitmap with integer; but you need map the value
>> to an unique id and use that within the bitmap. For example, if you want to
>> count distinct users, use the numeric user_id, instead of email address; To
>> support other data types, as Hongbin mentioned, the storage cost is very
>> high, we don't have that plan.
>> 
>> 
>> 
>> 
>> 
>> 2016-01-28 20:54 GMT+08:00 hongbin ma :
>> 
>>> KYLIN-1186  is not a
>>> mature feature yet and it only supports integer
>>> we don't yet have plans to support any other forms of precise distinct
>>> count, as it is too expensive to pre-calculate
>>> 
>>> On Thu, Jan 28, 2016 at 6:56 PM, Abhilash L L 
>>> wrote:
>>> 
 Thanks ShaoFeng Shi,
 
 We might need for other data types as well
 
 date & string
 
 (eg, distinct count of dates of certain activity)
 
 So in the rest call instead of hllc return type it should be bitmap for
 int,tinyint etc ?
 
 And we still send it as hllc for other data types ?
 
 
 Also in one of the comments, it said we cast long to int..  wont we be
 losing data due to truncation ?
 
 
 Regards,
 Abhilash
 
 On Thu, Jan 28, 2016 at 3:43 PM, ShaoFeng Shi 
 wrote:
 
> is this matched your case?
> https://issues.apache.org/jira/browse/KYLIN-1186
> 
> 2016-01-28 17:42 GMT+08:00 Abhilash L L :
> 
>> +user ml
>> 
>> Regards,
>> Abhilash
>> 
>> On Thu, Jan 28, 2016 at 11:32 AM, Abhilash L L <
>>> abhil...@infoworks.io>
>> wrote:
>> 
>>> Hello,
>>> 
>>>   Is there a way to ask Kylin to get exact distinct count ?  From
 what
>> we
>>> understand, we can choose between hllc(10) to hllc(16)
>>> 
>>>   I understand that for every cuboid, you will need to go through
 the
>>> whole data set again, but with the new cubing algo (2.x branch)
 should
> be
>>> simpler to add ?
>>> 
>>>   If currently not present are there any plans to introduce this
>>> ?
>>> 
>>> Regards,
>>> Abhilash
>>> 
>> 
> 
> 
> 
> --
> Best regards,
> 
> Shaofeng Shi
> 
 
>>> 
>>> 
>>> 
>>> --
>>> Regards,
>>> 
>>> *Bin Mahone | 马洪宾*
>>> Apache Kylin: http://kylin.io
>>> Github: https://github.com/binmahone
>>> 
>> 
>> 
>> 
>> --
>> Best regards,
>> 
>> Shaofeng Shi
>> 
>> 
> 
> 
> -- 
> Best regards,
> 
> Shaofeng Shi



Re: [KYLIN-1186] Build failed in Jenkins: 2.x-staging-full-stagedcubing #124

2016-01-28 Thread Yerui Sun
Hi, lemire,
I have a question about RoaringBitmap, is there a theoretically upper size 
length of RoaringBitmap given the max cardinality?



For hongbin,
I got your point, hongbin.

If we must have a upper length of every measure, the precision seem a good 
idea. 
But I still not sure is there a theoretically upper length of RoaringBitmap 
given the max cardinality? If yes, that’s be fine, otherwise we must find 
another way.

> 在 2016年1月27日,12:41,Ma, Hongbin  写道:
> 
> I think an unbounded measure will be problematic from the maintenance 
> perspective as well as optimization perspective.
> The situation is even worse in 2.0 above versions in which we introduced 
> GTTable concept, where each column in the GTTable is treated like a fixed 
> length cell.
> This means that for each dimension and measure, we’ll always need to allocate 
> DataTypeSerializer.maxLength() memory space.
> Unbounded measure will also increase the size of cube’s rows. As we all know 
> Hbase may have difficulty in deal with every large rows.
> Have you ever checked what’s the largest row size of such cubes in your case? 
> 
> So it would be risky to release such a immature feature to our customers, I 
> suggest to hide the function from Web UI (or even IT) before we come up a 
> mature solution for this.
> Is it possible to refer to hll implementation to set a “max” for the measure? 
> For example we may have “bitmap(100)”, “bitmap(1000)” ?
> 
> I’m also forwarding this to the dev mail list for public discussion.
> 
> Regards,
> 
> Bin Mahone | 马洪宾
> Apache Kylin: http://kylin.io 
> Github: https://github.com/binmahone  
> 
> From: Yerui Sun mailto:sunye...@gmail.com>>
> Date: Wednesday, January 27, 2016 at 12:25 PM
> To: ShaoFeng Shi mailto:shaofeng...@apache.org>>
> Cc: "Ma, Hongbin" mailto:ho...@ebay.com>>
> Subject: Re: Build failed in Jenkins: 2.x-staging-full-stagedcubing #124
> 
> Hi, shaofeng
> 
> Sorry for my slow response, I missed this mail due to the wrong mail filter.
> 
> Bitmap is difficult to determine the max length, because that it’s size 
> totally depends on the data. I discussed this with liyang before, and his 
> suggestion is just set a big number and let we see. Unfortunately, it seems 
> not a properly number.
> 
> And I also has one question, why 1.x-staging is good but 2.x-staging failed, 
> what’s different between them?
> 
>> 在 2016年1月25日,14:14,ShaoFeng Shi > > 写道:
>> 
>> Hi Yerui,
>> 
>> The Jenkins failed to build the test cube 
>> "test_kylin_cube_without_slr_left_join_desc", I checked the log, the mapper 
>> throws OOM exception when requesting memory; it should be caused by the 
>> bitmap measure, which estimated 8Mb as the max length, which might be too 
>> high.
>> 
>> Could you please check and enhance the estimation? One idea is to add a 
>> precision for the bitmap measure, just like the HLL or TopN; then use the 
>> selected precision to do some estimation. Please let us know your thought. 
>> Thanks!
>> 
>> Here is the error trace in mapper; Besides, we suggest you do a fully 
>> regression test before pushing the code to git. Just let me know if you want 
>> to know the detail steps of a full CI.
>> 
>> 2016-01-24 22:11:40,202 WARN [main] org.apache.hadoop.mapred.YarnChild: 
>> Exception running child : java.io.IOException: Failed to build cube in 
>> mapper 0
>>  at 
>> org.apache.kylin.engine.mr.steps.InMemCuboidMapper.cleanup(InMemCuboidMapper.java:124)
>>  at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:148)
>>  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:784)
>>  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
>>  at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
>>  at java.security.AccessController.doPrivileged(Native Method)
>>  at javax.security.auth.Subject.doAs(Subject.java:415)
>>  at 
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>>  at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
>> Caused by: java.util.concurrent.ExecutionException: 
>> java.lang.OutOfMemoryError: Java heap space
>>  at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>>  at java.util.concurrent.FutureTask.get(FutureTask.java:188)
>>  at 
>> org.apache.kylin.engine.mr.steps.InMemCuboidMapper.cleanup(InMemCuboidMapper.java:122)
>>  ... 8 more
>> Caused by: java.lang.OutOfMemoryError: Java heap space
>>  at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57)
>>  at java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
>>  at 
>> org.apache.kylin.cube.inmemcubing.ConcurrentDiskStore$Reader$2.(ConcurrentDiskStore.java:219)
>>  at 
>> org.apache.kylin.cube.inmemcubing.ConcurrentDiskStore$Reader.iterator(ConcurrentDiskStore.java:216)
>>  at 
>> org.apache.kylin.cube.inmemcubing.DoggedCubeBuilder$MergeSlot.fetchNext(DoggedCubeBuilder.java:403)
>>  at 
>> or

[KYLIN-1186] Build failed in Jenkins: 2.x-staging-full-stagedcubing #124

2016-01-28 Thread Li Yang
2.x failed because it started to do the right thing -- allocate enough mem
for a row -- while 1.x just hard coded 1mb for the row buffer.

I dont think the max length is too much an issue previously cause the goal
was to get the feature in. Now it is time to fix details like this. Adding
precision is a good idea.


Re: Exact distinct count support

2016-01-28 Thread Li Yang
If you can figure out a good mapping between date/string to int/long, then
the bitmap is a good solution. E.g. date maps to integer very well.

Expect community will have more contributions in this area.


On Friday, January 29, 2016, Yerui Sun  wrote:

> Thanks shaofeng and hongbin for your explaining.
>
> Abhilash, I’m the reporter and contributor of KYLIN-1186, and here’s some
> thinking about the designing:
>
> We indeed just support Int type for now, and cast Long to Int may cause
> precision losing (shaofeng removed the casting and I agreed that), the
> reason mainly is Int has been enough for most cases.
>
> I thought about to support all types, including String or Date, and the
> conclusion is that’s difficult. One solution is store all the values,
> that’s appear too costly, and another solution is finding the *precisely*
> projecting from string to int, for example dict ( not hash, because the
> projecting maybe conflicting).
> However, the dict generating is still difficult, especially when the
> cardinality is very high. I think KYLIN-1122 facing the same problem, so
> let’s see what’s the solution in KYLIN-1122, maybe we could borrow
> something.
>
> The reason of casting Long to Int is that bitmap based on RoaringBitmap,
> which maintained by lemire(lem...@gmail.com ), just
> supporting Integer. Expanding it to Long is kind of complicated, so I
> skipped that for now.
>
> Overall, this feature just fitted the common user case, and has absolutely
> room for improvement. Please let me know if you have any idea, and any
> comment is welcome.
>
>
>
> > 在 2016年1月28日,22:33,ShaoFeng Shi >
> 写道:
> >
> > I removed the code for long type in BitmapCounter as the casting will get
> > things wrong (but the target is to provide accurate value); @Yerui, for
> you
> > awareness; once we find the solution for long, then add it back.
> >
> > 2016-01-28 22:13 GMT+08:00 ShaoFeng Shi  >:
> >
> >> what's the cardinality of the dimension that you want to count distinct
> >> values? Integer's range is enough for most cases, if your case is under
> >> this scope, you can try the bitmap with integer; but you need map the
> value
> >> to an unique id and use that within the bitmap. For example, if you
> want to
> >> count distinct users, use the numeric user_id, instead of email
> address; To
> >> support other data types, as Hongbin mentioned, the storage cost is very
> >> high, we don't have that plan.
> >>
> >>
> >>
> >>
> >>
> >> 2016-01-28 20:54 GMT+08:00 hongbin ma  >:
> >>
> >>> KYLIN-1186  is not a
> >>> mature feature yet and it only supports integer
> >>> we don't yet have plans to support any other forms of precise distinct
> >>> count, as it is too expensive to pre-calculate
> >>>
> >>> On Thu, Jan 28, 2016 at 6:56 PM, Abhilash L L  >
> >>> wrote:
> >>>
>  Thanks ShaoFeng Shi,
> 
>  We might need for other data types as well
> 
>  date & string
> 
>  (eg, distinct count of dates of certain activity)
> 
>  So in the rest call instead of hllc return type it should be bitmap
> for
>  int,tinyint etc ?
> 
>  And we still send it as hllc for other data types ?
> 
> 
>  Also in one of the comments, it said we cast long to int..  wont we be
>  losing data due to truncation ?
> 
> 
>  Regards,
>  Abhilash
> 
>  On Thu, Jan 28, 2016 at 3:43 PM, ShaoFeng Shi  >
>  wrote:
> 
> > is this matched your case?
> > https://issues.apache.org/jira/browse/KYLIN-1186
> >
> > 2016-01-28 17:42 GMT+08:00 Abhilash L L  >:
> >
> >> +user ml
> >>
> >> Regards,
> >> Abhilash
> >>
> >> On Thu, Jan 28, 2016 at 11:32 AM, Abhilash L L <
> >>> abhil...@infoworks.io >
> >> wrote:
> >>
> >>> Hello,
> >>>
> >>>   Is there a way to ask Kylin to get exact distinct count ?  From
>  what
> >> we
> >>> understand, we can choose between hllc(10) to hllc(16)
> >>>
> >>>   I understand that for every cuboid, you will need to go through
>  the
> >>> whole data set again, but with the new cubing algo (2.x branch)
>  should
> > be
> >>> simpler to add ?
> >>>
> >>>   If currently not present are there any plans to introduce this
> >>> ?
> >>>
> >>> Regards,
> >>> Abhilash
> >>>
> >>
> >
> >
> >
> > --
> > Best regards,
> >
> > Shaofeng Shi
> >
> 
> >>>
> >>>
> >>>
> >>> --
> >>> Regards,
> >>>
> >>> *Bin Mahone | 马洪宾*
> >>> Apache Kylin: http://kylin.io
> >>> Github: https://github.com/binmahone
> >>>
> >>
> >>
> >>
> >> --
> >> Best regards,
> >>
> >> Shaofeng Shi
> >>
> >>
> >
> >
> > --
> > Best regards,
> >
> > Shaofeng Shi
>
>


Re: Exact distinct count support

2016-01-28 Thread Sarnath
Just thinking out loud: For distinct count for complex data types, bloom
filter can be considered after hashing them to some hash-code. Bloom is a
probabilistic data structure that can handle the set-presence enquiries
faster but with a tentative answer.
Alternatively a secondary index for the column (or distinct values of that
column) through Solr/ElasticSearch may also work.
On Jan 29, 2016 2:41 AM, "Li Yang"  wrote:

> If you can figure out a good mapping between date/string to int/long, then
> the bitmap is a good solution. E.g. date maps to integer very well.
>
> Expect community will have more contributions in this area.
>
>
> On Friday, January 29, 2016, Yerui Sun  wrote:
>
> > Thanks shaofeng and hongbin for your explaining.
> >
> > Abhilash, I’m the reporter and contributor of KYLIN-1186, and here’s some
> > thinking about the designing:
> >
> > We indeed just support Int type for now, and cast Long to Int may cause
> > precision losing (shaofeng removed the casting and I agreed that), the
> > reason mainly is Int has been enough for most cases.
> >
> > I thought about to support all types, including String or Date, and the
> > conclusion is that’s difficult. One solution is store all the values,
> > that’s appear too costly, and another solution is finding the *precisely*
> > projecting from string to int, for example dict ( not hash, because the
> > projecting maybe conflicting).
> > However, the dict generating is still difficult, especially when the
> > cardinality is very high. I think KYLIN-1122 facing the same problem, so
> > let’s see what’s the solution in KYLIN-1122, maybe we could borrow
> > something.
> >
> > The reason of casting Long to Int is that bitmap based on RoaringBitmap,
> > which maintained by lemire(lem...@gmail.com ), just
> > supporting Integer. Expanding it to Long is kind of complicated, so I
> > skipped that for now.
> >
> > Overall, this feature just fitted the common user case, and has
> absolutely
> > room for improvement. Please let me know if you have any idea, and any
> > comment is welcome.
> >
> >
> >
> > > 在 2016年1月28日,22:33,ShaoFeng Shi  >
> > 写道:
> > >
> > > I removed the code for long type in BitmapCounter as the casting will
> get
> > > things wrong (but the target is to provide accurate value); @Yerui, for
> > you
> > > awareness; once we find the solution for long, then add it back.
> > >
> > > 2016-01-28 22:13 GMT+08:00 ShaoFeng Shi  > >:
> > >
> > >> what's the cardinality of the dimension that you want to count
> distinct
> > >> values? Integer's range is enough for most cases, if your case is
> under
> > >> this scope, you can try the bitmap with integer; but you need map the
> > value
> > >> to an unique id and use that within the bitmap. For example, if you
> > want to
> > >> count distinct users, use the numeric user_id, instead of email
> > address; To
> > >> support other data types, as Hongbin mentioned, the storage cost is
> very
> > >> high, we don't have that plan.
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> 2016-01-28 20:54 GMT+08:00 hongbin ma  > >:
> > >>
> > >>> KYLIN-1186  is
> not a
> > >>> mature feature yet and it only supports integer
> > >>> we don't yet have plans to support any other forms of precise
> distinct
> > >>> count, as it is too expensive to pre-calculate
> > >>>
> > >>> On Thu, Jan 28, 2016 at 6:56 PM, Abhilash L L  > >
> > >>> wrote:
> > >>>
> >  Thanks ShaoFeng Shi,
> > 
> >  We might need for other data types as well
> > 
> >  date & string
> > 
> >  (eg, distinct count of dates of certain activity)
> > 
> >  So in the rest call instead of hllc return type it should be bitmap
> > for
> >  int,tinyint etc ?
> > 
> >  And we still send it as hllc for other data types ?
> > 
> > 
> >  Also in one of the comments, it said we cast long to int..  wont we
> be
> >  losing data due to truncation ?
> > 
> > 
> >  Regards,
> >  Abhilash
> > 
> >  On Thu, Jan 28, 2016 at 3:43 PM, ShaoFeng Shi <
> shaofeng...@apache.org
> > >
> >  wrote:
> > 
> > > is this matched your case?
> > > https://issues.apache.org/jira/browse/KYLIN-1186
> > >
> > > 2016-01-28 17:42 GMT+08:00 Abhilash L L  > >:
> > >
> > >> +user ml
> > >>
> > >> Regards,
> > >> Abhilash
> > >>
> > >> On Thu, Jan 28, 2016 at 11:32 AM, Abhilash L L <
> > >>> abhil...@infoworks.io >
> > >> wrote:
> > >>
> > >>> Hello,
> > >>>
> > >>>   Is there a way to ask Kylin to get exact distinct count ?  From
> >  what
> > >> we
> > >>> understand, we can choose between hllc(10) to hllc(16)
> > >>>
> > >>>   I understand that for every cuboid, you will need to go through
> >  the
> > >>> whole data set again, but with the new cubing algo (2.x branch)
> >  should
> > > be
> > >>> simpler to add ?
> > >>>
> > >>>   If currently not p

Re: kylin concurrency test

2016-01-28 Thread hongbin ma
hi,

the stats was only for reference, it was gathered from an early kylin
version.
also notice the stats may vary based on you case, so I think hands-on
exercise is necessary if you want to do a POC

kylin does not like detailed level with a lot of result records, because it
is heavy to transfer  in json result format, and it does not make much
sense for analysts.

On Thu, Jan 28, 2016 at 11:50 PM, zhong zhang  wrote:

> Hi Hongbin, Luke, Feng and everyone,
>
> Thanks so so much for taking time to see my thread and
> your kind help. Luke, thanks for introduce Feng to me.
>
> Feng, the concurrency test is vital for our application case. We definitely
> will
> use benchmark datasets to test it later. Currently, we'd like to have a
> broad
> understanding the capacity of Kylin. Can you help me answer the following
> questions for the throughput graph?
>
> (1) Parallel Thread #, 30 for high level aggregation query and 30 for
> detail
> level query. How many high level aggregation queries are used? Does 30
> parallel threads means 30 queries are triggered at the same time?
>
> (2) Does raw records mean the total records for all the queries?
>
> (3) For HBase scan, does the return less than scan mean there is
> no hit in the cube?
>
> (4) Latency, the min, max, median are the statistical results for
> all the test queries? what's 90% Line?
>
> (5) The throughput is 72.5/sec for high level. So each query takes
> about 13.8ms. This kind of contradicts to the min latency 67ms.
> Please correct me.
>
> Thanks once again for your help. Have a wonderful day!
>
> Best regards,
> Zhong
>
> On Wed, Jan 27, 2016 at 10:34 PM, yu feng  wrote:
>
> > Yes, It is QPS, this result comes from page 36 in Apache
> > Kylin-Hadoop上的大规模联机分析平台
> > <
> >
> http://events.linuxfoundation.org/sites/events/files/slides/Apache%20Kylin%202014%20Dec.pdf
> > >,
> > we do the same test in one and two kylin query node query result and
> > get similar result , so we use that picture for convenience, bottleneck
> of
> > kylin query throughput rely on hbase scan performance, which will related
> > to regionserver number and machine configuration, network etc.
> >
> >
> > 2016-01-28 10:08 GMT+08:00 Luke Han :
> >
> > > It's QPS, please contact Yu Feng (kylin committer) from NetEase for
> more
> > > detail.
> > >
> > > Thanks.
> > > Luke
> > >
> > >
> > > Best Regards!
> > > -
> > >
> > > Luke Han
> > >
> > > On Thu, Jan 28, 2016 at 9:43 AM, hongbin ma 
> > wrote:
> > >
> > > > i think by default it is QPS (queries per second)
> > > >
> > > > On Thu, Jan 28, 2016 at 7:34 AM, zhong zhang 
> > wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > There is an article <
> http://www.bitstech.net/2016/01/04/kylin-olap/
> > > > >posted
> > > > > by @Hu Wei at Neteast which introduces the concurrency test
> results.
> > In
> > > > the
> > > > > article, there is a throughput result graph. Please see the
> attached.
> > > > > Based on my understanding, the x-axis is the number of Kylin
> server.
> > > > > What's the y-axis? Is it the requests at the same time?
> > > > >
> > > > > Best regards,
> > > > > Zhong
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > >
> > > > *Bin Mahone | 马洪宾*
> > > > Apache Kylin: http://kylin.io
> > > > Github: https://github.com/binmahone
> > > >
> > >
> >
>



-- 
Regards,

*Bin Mahone | 马洪宾*
Apache Kylin: http://kylin.io
Github: https://github.com/binmahone


Re: StringIndexOutOfBoundsException: String index out of range: -1

2016-01-28 Thread hongbin ma
HiveColumnCardinalityUpdateJob
​ desc in source code:

/**
 * This job will update save the cardinality result into Kylin table
metadata store.
 * @author shaoshi
 */
 ​


​it does not belong to a cubing job, it's a separate task to help modeling.
​can you checkout the output in /tmp/kylin/cardinality/KYLIN_DK.DIM_DTM, it
seems the content format is not as expected:
https://github.com/apache/kylin/blob/kylin-1.2/job/src/main/java/org/apache/kylin/job/hadoop/cardinality/HiveColumnCardinalityUpdateJob.java#L113



-- 
Regards,

*Bin Mahone | 马洪宾*
Apache Kylin: http://kylin.io
Github: https://github.com/binmahone


Kylin.sandbox=true in config

2016-01-28 Thread greg gu
Hi,
 
I have 4 nodes hadoop cluster, I tried to set  Kylin.sandbox=false in 
kylin.properties, then I started kylin.  but the kylin web site stopped working 
after the change.
 
I use ssh tunneling to visit web site. the url I used is 
http://localhost:7070/kylin
 
before the changes, the above web site works, it shows kylin UI. after I 
changed sandbox to false, the browser says "the web page cannot be found"
 
could you let me know if I need to change Kylin.sandbox to false ?
 
thanks,
Greg
 
  
 
  

Re: Kylin.sandbox=true in config

2016-01-28 Thread Jian Zhong
what's the error in $KYLIN_HOME/tomcat/logs/kylin.log when you start server

On Fri, Jan 29, 2016 at 10:20 AM, greg gu  wrote:

> Hi,
>
> I have 4 nodes hadoop cluster, I tried to set  Kylin.sandbox=false in
> kylin.properties, then I started kylin.  but the kylin web site stopped
> working after the change.
>
> I use ssh tunneling to visit web site. the url I used is
> http://localhost:7070/kylin
>
> before the changes, the above web site works, it shows kylin UI. after I
> changed sandbox to false, the browser says "the web page cannot be found"
>
> could you let me know if I need to change Kylin.sandbox to false ?
>
> thanks,
> Greg
>
>
>
>


[jira] [Created] (KYLIN-1378) Add UI for TopN measure

2016-01-28 Thread Shaofeng SHI (JIRA)
Shaofeng SHI created KYLIN-1378:
---

 Summary: Add UI for TopN measure
 Key: KYLIN-1378
 URL: https://issues.apache.org/jira/browse/KYLIN-1378
 Project: Kylin
  Issue Type: Sub-task
  Components: Web 
Reporter: Shaofeng SHI
Assignee: Zhong,Jason


Need the user interface for user to define the TopN. User need to selecting: 1) 
the literal column; 2) the metrics column; 3) the expression (default SUM); 4) 
the soring order (default Desc).

A sample is:

{
"name" : "TOP_SELLER",
"function" : {
  "expression" : "TOP_N",
  "parameter" : {
"name": "counter",
"type" : "column",
"value" : "PRICE",
"next_parameter" : {
  "name": "literal",
  "type" : "column",
  "value" : "SELLER_ID",
  "next_parameter" : {
"name": "expression",
"type" : "",
"value" : "SUM",
"next_parameter" : {
  "name": "order",
  "type" : "",
  "value" : "DESC"
}
}
  },
  "returntype" : "topn(100)"
},




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Kylin.sandbox=true in config

2016-01-28 Thread ShaoFeng Shi
set Kylin.sandbox=false will start to use LDAP for user authentication, it
has no relationship with the cluster. Please change it back to true to
using the testing profile.

2016-01-29 10:52 GMT+08:00 Jian Zhong :

> what's the error in $KYLIN_HOME/tomcat/logs/kylin.log when you start server
>
> On Fri, Jan 29, 2016 at 10:20 AM, greg gu  wrote:
>
> > Hi,
> >
> > I have 4 nodes hadoop cluster, I tried to set  Kylin.sandbox=false in
> > kylin.properties, then I started kylin.  but the kylin web site stopped
> > working after the change.
> >
> > I use ssh tunneling to visit web site. the url I used is
> > http://localhost:7070/kylin
> >
> > before the changes, the above web site works, it shows kylin UI. after I
> > changed sandbox to false, the browser says "the web page cannot be found"
> >
> > could you let me know if I need to change Kylin.sandbox to false ?
> >
> > thanks,
> > Greg
> >
> >
> >
> >
>



-- 
Best regards,

Shaofeng Shi


Re: Kylin.sandbox=true in config

2016-01-28 Thread greg gu
thanks for your help,

another configure is
deploy.env=Dev|QA|prod

what's the differences of the three values?

Greg

Sent from my iPhone

> On Jan 28, 2016, at 7:06 PM, ShaoFeng Shi  wrote:
> 
> set Kylin.sandbox=false will start to use LDAP for user authentication, it
> has no relationship with the cluster. Please change it back to true to
> using the testing profile.
> 
> 2016-01-29 10:52 GMT+08:00 Jian Zhong :
> 
>> what's the error in $KYLIN_HOME/tomcat/logs/kylin.log when you start server
>> 
>>> On Fri, Jan 29, 2016 at 10:20 AM, greg gu  wrote:
>>> 
>>> Hi,
>>> 
>>> I have 4 nodes hadoop cluster, I tried to set  Kylin.sandbox=false in
>>> kylin.properties, then I started kylin.  but the kylin web site stopped
>>> working after the change.
>>> 
>>> I use ssh tunneling to visit web site. the url I used is
>>> http://localhost:7070/kylin
>>> 
>>> before the changes, the above web site works, it shows kylin UI. after I
>>> changed sandbox to false, the browser says "the web page cannot be found"
>>> 
>>> could you let me know if I need to change Kylin.sandbox to false ?
>>> 
>>> thanks,
>>> Greg
> 
> 
> 
> -- 
> Best regards,
> 
> Shaofeng Shi


[jira] [Created] (KYLIN-1379) More stable precise count distinct implements after KYLIN-1186

2016-01-28 Thread Yerui Sun (JIRA)
Yerui Sun created KYLIN-1379:


 Summary: More stable precise count distinct implements after 
KYLIN-1186
 Key: KYLIN-1379
 URL: https://issues.apache.org/jira/browse/KYLIN-1379
 Project: Kylin
  Issue Type: Improvement
  Components: Job Engine
Affects Versions: v2.1, v1.3
Reporter: Yerui Sun
Assignee: Yerui Sun


After KYLIN-1186, we've gained the ability to count distinct int type columns.
However, the implements of KYLIN-1186 is not stable, especially in 2.x-staging 
branch.
The reason is that the measure's maxlength is used to allocate memory in 2.x 
version, and the BitmapMeasure is hardcoded to 8MB in KYLIN-1186, causing OOM 
when cube building.
To resolve this problem, we have introduce precision on the bitmap measure, 
such as bitmap(100), bitmap(1), bitmap(100), meaning the measure could 
accept 100/1/1M cardinality at most. This solution should be fine, 
considering the reality, if the count value over 100, the hyperloglog 
measure which produce approx. result should be acceptable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [KYLIN-1186] Build failed in Jenkins: 2.x-staging-full-stagedcubing #124

2016-01-28 Thread Yerui Sun
I opened a new jira https://issues.apache.org/jira/browse/KYLIN-1379 to track 
this, and I will work on this in Feb.

> 在 2016年1月29日,03:33,Li Yang  写道:
> 
> 2.x failed because it started to do the right thing -- allocate enough mem
> for a row -- while 1.x just hard coded 1mb for the row buffer.
> 
> I dont think the max length is too much an issue previously cause the goal
> was to get the feature in. Now it is time to fix details like this. Adding
> precision is a good idea.



Re: [KYLIN-1186] Build failed in Jenkins: 2.x-staging-full-stagedcubing #124

2016-01-28 Thread hongbin ma
​cool, thanks for the continuous contribution on this! ​

On Fri, Jan 29, 2016 at 2:15 PM, Yerui Sun  wrote:

> I opened a new jira https://issues.apache.org/jira/browse/KYLIN-1379 to
> track this, and I will work on this in Feb.
>
> > 在 2016年1月29日,03:33,Li Yang  写道:
> >
> > 2.x failed because it started to do the right thing -- allocate enough
> mem
> > for a row -- while 1.x just hard coded 1mb for the row buffer.
> >
> > I dont think the max length is too much an issue previously cause the
> goal
> > was to get the feature in. Now it is time to fix details like this.
> Adding
> > precision is a good idea.
>
>


-- 
Regards,

*Bin Mahone | 马洪宾*
Apache Kylin: http://kylin.io
Github: https://github.com/binmahone


Re: Exact distinct count support

2016-01-28 Thread Yerui Sun
Hi, Sarnath,

What we want is a *precisely* distinct count algorithm, as you said, bloom is a 
*probabilistic* data structure, we can’t the precisely result via that.

Secondary index, or inverted index, is a big topic. As I know, we haven’t 
decided how to leverage inverted index in Kylin, maybe we could discuss this in 
another thread.

> 在 2016年1月29日,08:16,Sarnath  写道:
> 
> Just thinking out loud: For distinct count for complex data types, bloom
> filter can be considered after hashing them to some hash-code. Bloom is a
> probabilistic data structure that can handle the set-presence enquiries
> faster but with a tentative answer.
> Alternatively a secondary index for the column (or distinct values of that
> column) through Solr/ElasticSearch may also work.
> On Jan 29, 2016 2:41 AM, "Li Yang"  wrote:
> 
>> If you can figure out a good mapping between date/string to int/long, then
>> the bitmap is a good solution. E.g. date maps to integer very well.
>> 
>> Expect community will have more contributions in this area.
>> 
>> 
>> On Friday, January 29, 2016, Yerui Sun  wrote:
>> 
>>> Thanks shaofeng and hongbin for your explaining.
>>> 
>>> Abhilash, I’m the reporter and contributor of KYLIN-1186, and here’s some
>>> thinking about the designing:
>>> 
>>> We indeed just support Int type for now, and cast Long to Int may cause
>>> precision losing (shaofeng removed the casting and I agreed that), the
>>> reason mainly is Int has been enough for most cases.
>>> 
>>> I thought about to support all types, including String or Date, and the
>>> conclusion is that’s difficult. One solution is store all the values,
>>> that’s appear too costly, and another solution is finding the *precisely*
>>> projecting from string to int, for example dict ( not hash, because the
>>> projecting maybe conflicting).
>>> However, the dict generating is still difficult, especially when the
>>> cardinality is very high. I think KYLIN-1122 facing the same problem, so
>>> let’s see what’s the solution in KYLIN-1122, maybe we could borrow
>>> something.
>>> 
>>> The reason of casting Long to Int is that bitmap based on RoaringBitmap,
>>> which maintained by lemire(lem...@gmail.com ), just
>>> supporting Integer. Expanding it to Long is kind of complicated, so I
>>> skipped that for now.
>>> 
>>> Overall, this feature just fitted the common user case, and has
>> absolutely
>>> room for improvement. Please let me know if you have any idea, and any
>>> comment is welcome.
>>> 
>>> 
>>> 
 在 2016年1月28日,22:33,ShaoFeng Shi > >
>>> 写道:
 
 I removed the code for long type in BitmapCounter as the casting will
>> get
 things wrong (but the target is to provide accurate value); @Yerui, for
>>> you
 awareness; once we find the solution for long, then add it back.
 
 2016-01-28 22:13 GMT+08:00 ShaoFeng Shi >> >:
 
> what's the cardinality of the dimension that you want to count
>> distinct
> values? Integer's range is enough for most cases, if your case is
>> under
> this scope, you can try the bitmap with integer; but you need map the
>>> value
> to an unique id and use that within the bitmap. For example, if you
>>> want to
> count distinct users, use the numeric user_id, instead of email
>>> address; To
> support other data types, as Hongbin mentioned, the storage cost is
>> very
> high, we don't have that plan.
> 
> 
> 
> 
> 
> 2016-01-28 20:54 GMT+08:00 hongbin ma >> >:
> 
>> KYLIN-1186  is
>> not a
>> mature feature yet and it only supports integer
>> we don't yet have plans to support any other forms of precise
>> distinct
>> count, as it is too expensive to pre-calculate
>> 
>> On Thu, Jan 28, 2016 at 6:56 PM, Abhilash L L >> >
>> wrote:
>> 
>>> Thanks ShaoFeng Shi,
>>> 
>>> We might need for other data types as well
>>> 
>>> date & string
>>> 
>>> (eg, distinct count of dates of certain activity)
>>> 
>>> So in the rest call instead of hllc return type it should be bitmap
>>> for
>>> int,tinyint etc ?
>>> 
>>> And we still send it as hllc for other data types ?
>>> 
>>> 
>>> Also in one of the comments, it said we cast long to int..  wont we
>> be
>>> losing data due to truncation ?
>>> 
>>> 
>>> Regards,
>>> Abhilash
>>> 
>>> On Thu, Jan 28, 2016 at 3:43 PM, ShaoFeng Shi <
>> shaofeng...@apache.org
>>> >
>>> wrote:
>>> 
 is this matched your case?
 https://issues.apache.org/jira/browse/KYLIN-1186
 
 2016-01-28 17:42 GMT+08:00 Abhilash L L >> >:
 
> +user ml
> 
> Regards,
> Abhilash
> 
> On Thu, Jan 28, 2016 at 11:32 AM, Abhilash L L <
>> abhil...@infoworks.io >
> wrote:
> 
>> Hello,
>> 
>>  Is there a way to ask Kylin to get exact distinct count ?  F

Re: StringIndexOutOfBoundsException: String index out of range: -1

2016-01-28 Thread hd2
Hi,

the output file is actually empty (that's probably the cause for "out
of range -1" -> length (0)-1 = -1). There is no output logging which
could be used to investigate why the file is actually empty. Any hints
on how we can debug why it is empty?


2016-01-29 2:52 GMT+01:00 hongbin ma :
> HiveColumnCardinalityUpdateJob
> desc in source code:
>
> /**
>  * This job will update save the cardinality result into Kylin table
> metadata store.
>  * @author shaoshi
>  */
>
>
>
> it does not belong to a cubing job, it's a separate task to help modeling.
> can you checkout the output in /tmp/kylin/cardinality/KYLIN_DK.DIM_DTM, it
> seems the content format is not as expected:
> https://github.com/apache/kylin/blob/kylin-1.2/job/src/main/java/org/apache/kylin/job/hadoop/cardinality/HiveColumnCardinalityUpdateJob.java#L113
>
>
>
> --
> Regards,
>
> *Bin Mahone | 马洪宾*
> Apache Kylin: http://kylin.io
> Github: https://github.com/binmahone