Hi,
I watched one session of "Apache Carbondata" in Spark Submit 2017. The video is here: https://www.youtube.com/watch?v=lhsAg2H_GXc. [https://i.ytimg.com/vi/lhsAg2H_GXc/maxresdefault.jpg]<https://www.youtube.com/watch?v=lhsAg2H_GXc> Apache Carbondata: An Indexed Columnar File Format for Interactive Query by Jacky Li/Jihong Ma<https://www.youtube.com/watch?v=lhsAg2H_GXc> www.youtube.com Realtime analytics over large datasets has become an increasing wide-spread demand, over the past several years, Hadoop ecosystem has been continuously evolv... Starting from 23:10, the speaker talks about lazy decoding optimization, and the example given in the speech is following: "select c3, sum(c2) from t1 group by c3", and talked about that c3 can be aggregated directly by the encoding value (Maybe integer, if let's say a String type c3 is encoded as int). I assume this in fact is done even within Spark executor engine, as the Speaker described. But I really not sure that I understand this is possible, especially in the Spark. If Carbondata is the storage format for a framework on one box, I can image that and understand this value it brings. But for a distribute executing engine, like Spark, the data will come from multi hosts. Spark has to deserialize the data for grouping/aggregating (C3 in this case). Let's say that even Spark dedicates this to underline storage engine somehow, how Carbondata will make sure that all the value will be encoded in the same globally? Won't it just encode consistently per file? Globally is just too expensive. But without it, I don't know how this lazy decoding can work. I am just start researching this project, so maybe there are something underline I don't understand. Thanks Yong