Hi Yong Zhang,

Thank you for analyzing carbondata.
Yes, lazy decoding is only possible if the dictionaries are global.
At the time of loading the data it generates global dictionary values.
There are 2 ways to generate global dictionary values.
1. Launch a job to read all input data and find the distinct values from
each columns and assign the dictionary values to it. Then starts the actual
loading job, it just encodes the data with already generated dictionary
values and write down in carbondata format.
2. Launch Dictionary Server/client to generate global dictionary during the
load job. It consults dictionary server to get the global dictionary for
the fields.

Yes, compare to local dictionary it is little more expensive but with this
approach we can have better compression and better performance through lazy
decoding.



Regards,
Ravindra.

On 9 March 2017 at 00:01, Yong Zhang <java8...@hotmail.com> wrote:

> Hi,
>
>
> I watched one session of "Apache Carbondata" in Spark Submit 2017. The
> video is here: https://www.youtube.com/watch?v=lhsAg2H_GXc.
>
> [https://i.ytimg.com/vi/lhsAg2H_GXc/maxresdefault.jpg]<
> https://www.youtube.com/watch?v=lhsAg2H_GXc>
>
> Apache Carbondata: An Indexed Columnar File Format for Interactive Query
> by Jacky Li/Jihong Ma<https://www.youtube.com/watch?v=lhsAg2H_GXc>
> www.youtube.com
> Realtime analytics over large datasets has become an increasing
> wide-spread demand, over the past several years, Hadoop ecosystem has been
> continuously evolv...
>
>
>
>
> Starting from 23:10, the speaker talks about lazy decoding optimization,
> and the example given in the speech is following:
>
> "select  c3, sum(c2) from t1 group by c3", and talked about that c3 can be
> aggregated directly by the encoding value (Maybe integer, if let's say a
> String type c3 is encoded as int). I assume this in fact is done even
> within Spark executor engine, as the Speaker described.
>
>
> But I really not sure that I understand this is possible, especially in
> the Spark. If Carbondata is the storage format for a framework on one box,
> I can image that and understand this value it brings. But for a distribute
> executing engine, like Spark, the data will come from multi hosts. Spark
> has to deserialize the data for grouping/aggregating (C3 in this case).
> Let's say that even Spark dedicates this to underline storage engine
> somehow, how Carbondata will make sure that all the value will be encoded
> in the same globally? Won't it just encode consistently per file? Globally
> is just too expensive. But without it, I don't know how this lazy decoding
> can work.
>
>
> I am just start researching this project, so maybe there are something
> underline I don't understand.
>
>
> Thanks
>
>
> Yong
>



-- 
Thanks & Regards,
Ravi

Reply via email to