Hi David,
a) Compressing table status is good. But need to check the decompression
overhead and how much overall benefit we can get.
b) I suggest we can keep multiple 10MB files (or configurable), then read
it distributed way.
c) Once read all the table status files better to cache them at driver
Hi David,
After discussing with you its little bit clear, let me just summarize in
some lines
*Goals*
1. reduce the size of status file (which reduces overall size wit some MBs)
2. make table status file less prone to failures, and fast reading during
read
*For the above goals with your solution
Hi Akash
2. new tablestsatus, only store the lastest status file name, not all
status files.
status file will store all segment metadata (just like old tablestatus)
3. if we have delta file, no need to read status file for each query. only
reading delta file is enough if status file not chang
Hi david,
Thanks for starting this discussion, i have some questions and inputs
1. in solution 1, it just plane compression, where we will get the benefit
of size,
but still we will face, reliability issues in case of concurrency. So can be
-1.
2. solution 2
writing, and reading to separate fil
add solution 4 to separate the status file by segment status
*solution 4:* Based on solution 2, support status.inprogress
1) new tablestatus file format
{
"statusFileName":"status-uuid1",
"inProgressStatusFileName": "status-uuid2.inprogess",
"updateStatusFileName":"updatest
solution2, +1
-
My English name is Sunday
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
[Background]
Now the size of one segment metadata entry is about 200 bytes in the
tablestatus file. if the table has 1 million segments and the mean size of
segments is 1GB(means the table size is 1PB), the size of the tablestatus
file will reach 200MB.
Any reading/writing operation on this table