Re: [Discussion] Improve the reading/writing performance on the big tablestatus file

2020-09-03 Thread akashrn5
Hi david, 

Thanks for starting this discussion, i have some questions and inputs

1. in solution 1, it just plane compression, where we will get the benefit
of size,
but still we will face, reliability issues in case of concurrency. So can be
-1.

2. solution 2
writing, and reading to separate files is pretty good idea, in order to
avoid many issues 
which i mentioned in point 1.
You mentioned a new format, what my understanding is, you will have a new
file which contains list of all
table status like "statusFileName":"status-uuid1","status-uuid2",.. and you
store "status-uuid1" files in metadata. 
Am i right?

If I am, then your plan is to read this new format and then go to actual
files right?
When do you merge all these files, and what is the threshold for these
files, i mean to say on what basis you decide
you should create new status file?

3. Solution 3:
writing a delta file what is the obvious benefit we gonna get?
whenever i query, we need to read all the status and decide the valid
segments right?

I dont think we get any benefit here, correct me if my understanding is
wrong.


4. This is better idea to keep in progress in other file, with this we can
avoid some unnecessary validations
in many operations. But this we need to decide with which solution we need
to combine,
may be once i get my doubts cleared, i can suggest some.

*
Suggestion/Idea:* Now we have table status file with so many details, but in
all the cases we do not read or required all
details, can we have some abstraction layer, or status on top of the actual
status with some above optimizations, 
so that we will read less/only required data especially during query?

Regards,
Akash



--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Discussion] Improve the reading/writing performance on the big tablestatus file

2020-09-03 Thread David CaiQiang
Hi Akash

2.  new tablestsatus, only store the lastest status file name, not all
status files.
   status file will store all segment metadata (just like old tablestatus)

3. if we have delta file, no need to read status file for each query. only
reading delta file is enough if status file not changed.




-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


[Discussion] Segment management enhance

2020-09-03 Thread David CaiQiang
[Background]
1. In some scenes, two loading/compaction jobs maybe write data to the same
segment, it will result in some data confusion and impact some features
which will not work fine again.  
2. Loading/compaction/update/delete operations need  to clean stale data
before execution. Cleaning stale data is a high-risk operation, if it has
some exception, it will delete valid data. If the system doesn't clean stale
data,   in some scenes, it will be added into a new merged index file and
can be queried.
3. Loading/compaction takes a long time and lock will keep a long time also
in some scenes. 

[Motivation & Goal]
We should avoid data confusion and the risk of clean stale data. Maybe we
can use UUID as a segment id to avoid these troubles. Even if we can do
loading/compaction without the segment/compaction lock.

[Modification]
1. segment id 
  Using UUID as segment id instead of the unique numeric value.

2. segment layout
 a) move segment data folder into the table folder
 b) move carbonindexmerge file into Metadata/segments folder, 

 tableFolder
UUID1
 |_xxx.carbondata
 |_xxx.carobnindex
UUID2
Metadata
 |_segemnts
|_UUID1_timestamp1.segment (segment index summary)
|_UUID1_timestamp1.carbonindexmerge (segment index detail)
 |_schema
 |_tablestatus
LockFiles

  partitionTableFolder
partkey=value1
 |_xxx.carbondata
 |_xxx.carobnindex
partkey=value2
Metadata
 |_segemnts
|_UUID1_timestamp1.segment (segment index summary)
|_partkey=value1
  |_UUID1_timestamp1.carbonindexmerge (segment index detail)
|_partkey=value2
 |_schema
 |_tablestatus
LockFiles

3. segment management
Extracting segment interface, it can support open/close, read/write, and
segment level index pruning API.
The segment should support multiple data source types: file format(carbon,
parquet, orc...), HBase...

4. clean stale data
it will become an optional operation.



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Discussion] Improve the reading/writing performance on the big tablestatus file

2020-09-03 Thread akashrn5
Hi David,

After discussing with you its little bit clear, let me just summarize in
some lines

*Goals*
1. reduce the size of status file (which reduces overall size wit some MBs)
2. make table status file less prone to failures, and fast reading during
read

*For the above goals with your solutions*

1. use the compressor, compress the table status file, so that during read
inmemory read happens and
it will faster
2. to make less prone to failure, *+1 for solution3* , which can combined
with little bit of solution2 (for new format of table status and trace
folder structure ) and solution3 of delta file, to make the read and write
separate so that the read will be faster and it will help to avoid failures
in case of reliability.

Suggestion: One more point is to maintain the cache of details after forst
read, instead of reading every time, only once the status-uuid is updated we
can read again, till then we can read from cache, this will help in faster
read and help in our query.

I suggest you to create a *jira and prepare a design document*, there we can
cover many impact areas and *avoid fixing small bugs after implementation.*



--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Discussion] Update feature enhancement

2020-09-03 Thread akashrn5
Hi david, 

Please check below points

One advantage what we get here is , when we insert as new segment, it will
take the new insert flow without converter step and that will be faster.

But here are some points.

1. when you write for new segments for each update, the horizontal
compaction in case of update does not make sense, as it wont happen with
this idea. With this solution, horizontal compaction makes sense only in
delete case.
2. you said we avoid reloading the indexes, but we will avoid reloading the
indexes of complete segment(original segment on which update has happened),
but we still need to reload the index of newly added segment which has
updated right.
3. when you keep on adding multiple segments, we will have more number of
segments and if we does not do compaction, that's one problem and the entry
and size of metadata(table status) increases so much which is another
problem.

So how are you going to handle these cases?
correct me if i'm wrong in my understanding.

Regards,
Akash



--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Discussion] Update feature enhancement

2020-09-03 Thread David CaiQiang
Hi Akash,

1. the update operation still has "deletdelta" files, it keeps the same with
previous. horizontal compaction is still needed.

2. loading one carbonindexmerge file will fast, and not impact the query
performance. (customer has faced this issue)

3. for insert/loading, it can trigger compaction to avoid small segments.



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Discussion] Segment management enhance

2020-09-03 Thread Kunal Kapoor
Hi David,
I don't think changing the segment ID to UUID is a good idea, it will cause
usability issues.

1. Seeing a UUID named directory in the table structure would be weird, and
not informative.
2. The show segments command would also have the same problem.

Thanks
Kunal Kapoor

On Fri, Sep 4, 2020 at 8:38 AM David CaiQiang  wrote:

> [Background]
> 1. In some scenes, two loading/compaction jobs maybe write data to the same
> segment, it will result in some data confusion and impact some features
> which will not work fine again.
> 2. Loading/compaction/update/delete operations need  to clean stale data
> before execution. Cleaning stale data is a high-risk operation, if it has
> some exception, it will delete valid data. If the system doesn't clean
> stale
> data,   in some scenes, it will be added into a new merged index file and
> can be queried.
> 3. Loading/compaction takes a long time and lock will keep a long time also
> in some scenes.
>
> [Motivation & Goal]
> We should avoid data confusion and the risk of clean stale data. Maybe we
> can use UUID as a segment id to avoid these troubles. Even if we can do
> loading/compaction without the segment/compaction lock.
>
> [Modification]
> 1. segment id
>   Using UUID as segment id instead of the unique numeric value.
>
> 2. segment layout
>  a) move segment data folder into the table folder
>  b) move carbonindexmerge file into Metadata/segments folder,
>
>  tableFolder
> UUID1
>  |_xxx.carbondata
>  |_xxx.carobnindex
> UUID2
> Metadata
>  |_segemnts
> |_UUID1_timestamp1.segment (segment index summary)
> |_UUID1_timestamp1.carbonindexmerge (segment index detail)
>  |_schema
>  |_tablestatus
> LockFiles
>
>   partitionTableFolder
> partkey=value1
>  |_xxx.carbondata
>  |_xxx.carobnindex
> partkey=value2
> Metadata
>  |_segemnts
> |_UUID1_timestamp1.segment (segment index summary)
> |_partkey=value1
>   |_UUID1_timestamp1.carbonindexmerge (segment index detail)
> |_partkey=value2
>  |_schema
>  |_tablestatus
> LockFiles
>
> 3. segment management
> Extracting segment interface, it can support open/close, read/write, and
> segment level index pruning API.
> The segment should support multiple data source types: file format(carbon,
> parquet, orc...), HBase...
>
> 4. clean stale data
> it will become an optional operation.
>
>
>
> -
> Best Regards
> David Cai
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>