some discussions: 1. The first time it has to load all the datamaps based on the list of segments provided by the main server. Index pruning will happen from the cached datamaps. -> Does the main server means the carbon driver? Kunal : Yes main server means carbon driver.. I will update the same in the design. 2. The pruned splits can either be written to a file if the number of splits are too many or can be sent back to the carbon driver directly. -> Is carbon driver has two ways to get index. One is directly return, another is by load file from hdfs that the path is return to carbon driver? Kunal : Correct, if the splits count is huge like 1 million then serialization cost would be too huge. Therefore it is better to write to a file so that the carbon driver can read the files and generate splits directly. 3. If the number of splits increases the threshold then, the index driver will create a multi block split on the directory in which the files are written and serialize them to the carbon driver. What is the multi block split mean? Kunal : Multi-Block split means that it contains more than 1 split. This is a way to tell the carbon driver that the splits are written to a file with the location present in Multi-block split object and the driver can directly read from there. 4. The server should be called when the size of the index files of the table is more than 1GB. If the table size is less than the main driver can prune -> How can carbon driver decide to use index server or carbon driver cache? Kunal : The carbon driver will read the table status file and generate the LoadMetaDataDetails, Index size stored in the load details can be used to calculate the total index size for the table. If more than the configured value then the index server can be called otherwise the carbon driver can prune the splits(how current pruning is done). 5. The index files to be divided between the executors should be based on size and not count. -> Does consider whether the index of the same table is distributed on the same excutor? Kunal : Yes, it will consider because the index driver will have a mapping for the same. The idea here is that the size of datamaps that are handled by the executors have to be same. 6. Dynamic allocation for index server should be false -> What does this mean? Kunal : Dynamic executor allocation, this property is exposed by spark where the executor processes are killed once their jobs are complete if set to true. This is not good for the server as then it will have to reload the datamaps. 7. Fallback -> If index server recover, carbon driver need to reuse the index server cache instead of self lru cache. Kunal : Correct, will update the same in the design
-- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/