Re: [DISCUSSION] Distributed Index Cache Server

litao Tue, 12 Feb 2019 00:40:20 -0800

some discussions：
1.      The first time it has to load all the datamaps based on the list of
segments provided by the main server. Index pruning will happen from the
cached datamaps.
        -> Does the main server means the carbon driver?
        Kunal : Yes main server means carbon driver.. I will update the same
in the design.
2.      The pruned splits can either be written to a file if the number of 
splits
are too many or can be sent back to the carbon driver directly.
        -> Is carbon driver has two ways to get index. One is directly
return, another is by load file from hdfs that the path is return to carbon
driver?
        Kunal : Correct, if the splits count is huge like 1 million then
serialization cost would be too huge. Therefore it is better to write to a
file so that the carbon driver can read the files and generate splits
directly.
3.      If the number of splits increases the threshold then, the index driver
will create a multi block split on the directory in which the files are
written and serialize them to the carbon driver.
       What is the multi block split mean?
Kunal : Multi-Block split means that it contains more than 1 split. This is
a way to tell the carbon driver that the splits are written to a file with
the location present in Multi-block split object and the driver can directly
read from there.
4.      The server should be called when the size of the index files of the 
table
is more than 1GB. If the table size is less than the main driver can prune
        -> How can carbon driver decide to use index server or carbon driver
cache?
        Kunal : The carbon driver will read the table status file and
generate the LoadMetaDataDetails, Index size stored in the load details can
be used to calculate the total index size for the table. If more than the
configured value then the index server can be called otherwise the carbon
driver can prune the splits(how current pruning is done).
5.      The index files to be divided between the executors should be based on
size and not count.
        -> Does consider whether the index of the same table is distributed
on the same excutor?
        Kunal : Yes, it will consider because the index driver will have a
mapping for the same. The idea here is that the size of datamaps that are
handled by the executors have to be same.
6.      Dynamic allocation for index server should be false
        -> What does this mean?
        Kunal : Dynamic executor allocation, this property is exposed by
spark where the executor processes are killed once their jobs are complete
if set to true. This is not good for the server as then it will have to
reload the datamaps.
7.      Fallback
        -> If index server recover, carbon driver need to reuse the index
server cache instead of self lru cache.
        Kunal : Correct, will update the same in the design






--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [DISCUSSION] Distributed Index Cache Server

Reply via email to