Re: [Design Discussion] Transaction manager, time travel and segment interface refactoring

2021-05-02 Thread Venkata Gollamudi
+1

Valuable feature for data consistency across tables, mvs, indexes.

On Wed, Apr 28, 2021 at 7:02 PM Ravindra Pesala 
wrote:

> +1
>
> Much needed feature and interface refactoring. Thanks for working on it.
>
> Regards,
> Ravindra.
>
> On Thu, 22 Apr 2021 at 2:36 PM, Ajantha Bhat 
> wrote:
>
> > Hi All,
> > In this thread, I am continuing the below discussion along with the
> > Transaction Manager and Time Travel feature design.
> >
> >
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Taking-the-inputs-for-Segment-Interface-Refactoring-td101950.html
> >
> > The goal of this requirement is as follows.
> >
> >1.
> >
> >Implement a “Transaction Manager” with optimistic concurrency to
> provide
> >within a table transaction/versioning. (interfaces should also be
> >flexible enough to support across table transactions)
> >2.
> >
> >Support time travel in carbonData.
> >3.
> >
> >Decouple and clean up segment interfaces. (which should also help in
> >supporting segment concepts to other open formats under carbonData
> > metadata
> >service)
> >
> >
> > The design document is attached in JIRA.
> > JIRA link: https://issues.apache.org/jira/browse/CARBONDATA-4171
> > GoogleDrive link:
> >
> >
> https://docs.google.com/document/d/1FsVsXjj5QCuFDrzrayN4Qo0LqWc0Kcijc_jL7pCzfXo/edit?usp=sharing
> >
> > Please have a look. suggestions are welcome.
> > I have mentioned some TODO in the document, I will be updating it in the
> V2
> > version soon.
> > Implementation will be done by adding subtasks under the same JIRA.
> >
> > Thanks,
> > Ajantha
> >
> --
> Thanks & Regards,
> Ravi
>


Re: [VOTE] Apache CarbonData 2.1.1(RC1) release

2021-03-18 Thread Venkata Gollamudi
-1 as pending major fixes.

Regards,
Ramana

On Thu, 18 Mar, 2021, 20:37 Kunal Kapoor,  wrote:

> -1, lets wait for the pending jira's to get resolved
>
> On Thu, 18 Mar 2021, 7:30 pm Kumar Vishal, 
> wrote:
>
> > -1
> > -Regards
> > Kumar Vishal
> >
> > On Thu, 18 Mar 2021 at 6:35 PM, David CaiQiang 
> > wrote:
> >
> > > -1, please fix the pending defect and merge the completed PR at first.
> > >
> > >
> > >
> > > -
> > > Best Regards
> > > David Cai
> > > --
> > > Sent from:
> > >
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> > >
> >
>


Re: [Discussion] Segment management enhance

2020-09-13 Thread Venkata Gollamudi
Hi David,

In the current design of data load operation, isolation, consistency is
achieved with following steps
1. DataLoad operation needs to check, if it can concurrently execute along
with any other operation like Update/delete( most. operations can be
allowed, even multiple parallel data loads).
2. Once operation is allowed, then lock is acquired at tablestatus file to
create a segment which it continues to load.
3. During load operation, time stamp(long) is used as transaction id for
the complete operation. This transaction id uniquely identifies the
operation.
4. When data load is complete the operation is committed to table status by
taking lock, read & update and release lock.
5. When data load fails then the operation is not committed and data can be
cleaned up during failure flow or in case of abrupt process failures, same
is cleaned up later.
6. Temporary data with the transaction id is never read or should be
discovered by any reader(Ex: never should be discovered using list files
etc and used) . All valid data should be read using only committed
transaction references.

This method of isolation and committing operation is applicable for all the
operations like dataload, insert, update or delete.
Segment ID does not have much significance in the above flow, though
sequence number is used currently, just for convenience.
Segment file locks to ensure the atomic commit is must and cannot be
avoided, even if we support complete optimistic concurrency control.

So segment ID replacing with UUID will not solve the concurrency or data
correctness or cleanup or stale data reading issues and also cannot replace
locking. There might be some other problem which you need to deep dive.
There might be code which is not following the above actions mentioned by
me (like discovering files using file listing and without filtering
required transaction id), which might be causing issues you have mentioned.

Regards,
Ramana

On Sat, Sep 5, 2020 at 8:31 AM Ajantha Bhat  wrote:

> Hi David,
>
> a) Recently we tested huge concurrent load and compactions but never faced
> two loads using same segment id issue (because of table status lock in
> recordNewLoadMetadata), so I am not sure whether we really need to update
> to UUID.
>
> b) And about other segment interfaces, we have to refactor it. It is long
> pending. Refactor such that we can support TIME TRAVEL. I have to analyze
> more on this. If somebody has already done some analysis can use thread to
> refactor the segment interface discussion.
>
> Thanks,
> Ajantha
>
> On Fri, Sep 4, 2020 at 1:11 PM Kunal Kapoor 
> wrote:
>
> > Hi David,
> > Then better we keep a mapping for the segment UUID to virtual segment
> > number in the table status file as well,
> > Any API through which the user can get the segment details should return
> > the virtual segment id instead of the UUID.
> >
> > On Fri, Sep 4, 2020 at 12:59 PM David CaiQiang 
> > wrote:
> >
> > > Hi Kunal,
> > >
> > >1. The user uses SQL API or other interfaces. This UUID is a
> > transaction
> > > id, and we already stored the timestamp and other informations in the
> > > segment metadata.
> > >This transaction id can be used in the loading/compaction/update
> > > operation. We can append this id into the log if needed.
> > >Git commit id also uses UUID, so we can consider to use it. What
> > > information do you want to get from the folder name?
> > >
> > >2. It is easy to fix the show segment command's issue. Maybe we can
> > sort
> > > segment by timestamp and UUID to generate the index id.  The user can
> > > continue to use it in other commands.
> > >
> > >
> > >
> > > -
> > > Best Regards
> > > David Cai
> > > --
> > > Sent from:
> > >
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> > >
> >
>


Re: [Discussion] Update feature enhancement

2020-09-13 Thread Venkata Gollamudi
Hi David,
+1

Initially when segments concept is started, it is viewed as a folder which
is incrementally added with time, so that data retention use-cases like
"delete segments before a given date" were thought of. In that case if
updated records are written into new segment, then old records will become
new records and retention model will not work on that data. So update
records were written to the same segment folder.

But later as the partition concept was introduced, that will be a clean
method to implement retention or even using a delete by time column is a
better method.
So inserting new records into the new segment makes sense.

Only disadvantage can be later supporting one column data update/replace
feature which Likun was mentioning previously.

So to generalize, update feature can support inserting the updated records
to new segment. The logic to reload indexes when segments are updated can
still be there, however when there is no insert of data to old segments,
reload of indexes needs to be avoided.

Increasing the number of segments need not be a reason for this to go
ahead, as the problem of increasing segments anyway is a problem and needs
to be solved using compaction either horizontal or vertical. Also
optimization of segment file storage either filebased or DB based(embedded
or external) for too big deployments needs to be solved independently.

Regards,
Ramana

On Sat, Sep 5, 2020 at 7:58 AM Ajantha Bhat  wrote:

> Hi David. Thanks for proposing this.
>
> *+1 from my side.*
>
> I have seen users with 200K segments table stored in cloud.
> It will be really slow to reload all the segments where update happened for
> indexes like SI, min-max, MV.
>
> So, it is good to write as a new segment
> and just load new segment indexes. (try to reuse this flow
> UpdateTableModel.loadAsNewSegment
> = true)
>
> and user can compact the segments to avoid many new segments created by
> update.
> and we can also move the compacted segments to table status history I guess
> to avoid more entries in table status.
>
> Thanks,
> Ajantha
>
>
>
> On Fri, Sep 4, 2020 at 1:48 PM David CaiQiang 
> wrote:
>
> > Hi Akash,
> >
> > 3. Update operation contain a insert operation.  Update operation
> will
> > do the same thing how the insert operation process this issue.
> >
> >
> >
> > -
> > Best Regards
> > David Cai
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
>


Re: [Discussion]Query Regarding Task launch mechanism for data load operations

2020-08-17 Thread Venkata Gollamudi
Hi Varun,

Yes, previously most cases were tuned for LOCAL_SORT, where merging will
automatically happen.  But certainly data loading flow can be improved to
do it based on data size, rather than a fixed configuration.
However old behaviour might also be required, if the user has to control
the maximum number of partitions in case data size is too big.  This
configuration has started as data loading cores are not transparent to
spark, mainly in case of LOCAL_SORT.

Same thing is applicable for insert into scenario also, as you said
coalescing will reduce the load performance.

Regards,
Ramana

On Fri, Aug 14, 2020 at 3:25 PM David CaiQiang  wrote:

> This mechanism will work fine for LOCAL_SORT loading of big data and the
> small cluster with big executor.
>
> If it doesn't match these conditions, better consider a new solution to
> adapter the generic scenario.
>
> I suggest re-factoring NO_SORT, maybe we can check and improve the
> global_sort solution.
>
> The solution should support both NO_SORT and GLOBAL_SORT, and automatically
> determines the number of partitions to avoid small file issue.
>
>
>
>
> -
> Best Regards
> David Cai
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: Propose feature change in CarbonData 2.0

2019-12-03 Thread Venkata Gollamudi
1. Global dictionary: Value might be visible on small clusters, but Local
dictionary is definitely a scalable solution. Also depends on hdfs features
like append to previous dictionary file and no method to remove state data.
Also we don't suggest Global dictionary for high cardinality dimensions,
for low cardinality dims cost of duplicating dictionary across files is not
high. So I think better we can deprecate this considering value vs
complexity and use cases it solves. vote: +1
2. Bucketing: I think is important feature, which will considerably improve
join performance. So I feel it should not be removed. vote: -1
3, 4 vote: +1
5. Inverted index: Current inverted index might not be efficient on space
and we don't have method to detect when inverted index needs to be built
and when it is not required. This area has to be further explored for
optimising various looks ups and refactored. Like druid has an inverted
index. vote: -1
5. Old pre aggregate and time series data map implementation, vote: +1
6. Lucene datamap. This is required to be improved, than deprecating it.
vote: -1
7. Stored by : vote +1

Refactoring:
1. good to do, but need to consider effort vote: 0
2. Column order need not be according to schema order, as columns and their
order can logically change from file. vote: -1
3, 4 are required  vote:+1

Regards,
Ramana

On Tue, Dec 3, 2019 at 9:53 PM 恩爸 <441586...@qq.com> wrote:

> Hi:
> Thank you for proposing. My votes are below:
>
>
>  1,3,4,5.1,5.2,7: +1
>  2:
>   0
>  6:
>   -1,but should be optimzied.
>
>
> And there are some internal refactory we can do:
>  1. Unify dimension and measure +1.
>
>  2. Keep the column order the same as schema order 0.
>
>  3. Spark integration refactory based on Spark extension
> interface +1
>
>  4. Store optimization PR2729 +1
>
> In my opinion, we also can do some refactor: 1.
> there are many places using string[] to store data in the process of
> loading data, it can replace with InternalRow object to save memory;
>  2. remove 'streaming' property and eliminate the difference between
> streaming and batch table, users can insert data into a table by batch way
> and streaming way.
>
>
>
>
>
>
> --Original--
> From:"ravipesala [via Apache CarbonData Dev Mailing List archive]"<
> ml+s1130556n87707...@n5.nabble.com;
> Date:Tue, Dec 3, 2019 06:07 PM
> To:"恩爸"<441586...@qq.com;
>
> Subject:Re: Propose feature change in CarbonData 2.0
>
>
>
> Hi,
>
> Thank you for proposing. Please check my comments below.
>
> 1.Global dictionary: It was one of the prime features when it was
> initially
> released to apache. Even though spark has introduced tungsten still it has
> its benefits like compression, filtering and aggregation queries.
> But after
> the introduction of a local dictionary, it got solved partially like
> compression and filtering (cannot get the same performance as a global
> dictionary). But only the major drawback here is the data load
> performance.
> In some cases like MOLAP cube (build once) it is still might be useful.
> Vote: 0
>
> 2. Bucket: It is a very useful feature if we use it. if we are planning to
> remove better find the alternative to this feature first. Since these
> features are available in spark+parquet it would be helpful for users who
> want to migrate to carbon. As I know this feature was never productized
> and
> it is still in experimental. So if we are planning to keep it better make
> it
> productize. Vote : -1
>
> 3. Carbon custom partition: Vote : +1
>
> 4. Batch Sort : Vote : +1
>
> 5. Page level inverse index : It makes the store size bigger to store
> these
> indexes. It is really helpful in case of multiple in filters but it is got
> overshadowed by its IO and CPU performance due to its size. Vote : +1
>
> 5. old preaggregate and time series datamap implementation : Vote :
> +1
> (remove pre-aggregate)
>
> 6. Lucene DataMap: It is a helpful feature but I guess it had performance
> issues due to bad integration. It would be better if we can fix these
> issues
> instead of removing it. Moreover, it is a separate module so there would
> not
> be any code maintenance problem. Vote : -1
>
> 7. STORED BY : Vote : +1
>
> refractory points:
> 1  2 : I think at this point of time it would be a massive refractory
> but
> very less outcome. So better don't do it. Vote : -1
>
> 3 4 : Vote : +1
>
>
>
> Regards,
> Ravindra.
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>
>
>
> If you reply to this email, your message will be
> added to the discussion below:
>
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Propose-feature-change-in-CarbonData-2-0-tp87540p87707.html
>
> To start a new topic under Apache CarbonData Dev
> Mailing List archive, email ml+s1130556n1...@n5.nabble.com
> To unsubscribe from Apache CarbonData Dev Mailing List
> 

Re: [DISCUSSION] Implement file-level Min/Max index for streaming segment

2018-08-29 Thread Venkata Gollamudi
+1

Regards,
Venkata Ramana

On Mon, Aug 27, 2018 at 2:10 PM manish gupta 
wrote:

> +1
>
> Regards
> Manish Gupta
>
> On Mon, 27 Aug 2018 at 9:47 AM, Kumar Vishal 
> wrote:
>
> > +1
> > Regards
> > Kumar Vishal
> >
> > On Mon, 27 Aug 2018 at 07:15, xm_zzc <441586...@qq.com> wrote:
> >
> > > +1.
> > >
> > >
> > >
> > > --
> > > Sent from:
> > >
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> > >
> >
>


Re: [VOTE] Apache CarbonData 1.4.1(RC2) release

2018-08-14 Thread Venkata Gollamudi
+1

Regards,
Venkata Ramana Gollamudi

On Tue, Aug 14, 2018, 11:05 Kunal Kapoor  wrote:

> +1
>
> Regards
> Kunal Kapoor
>
> On Fri, Aug 10, 2018, 8:14 AM Ravindra Pesala 
> wrote:
>
> > Hi
> >
> >
> > I submit the Apache CarbonData 1.4.1 (RC2) for your vote.
> >
> >
> > 1.Release Notes:
> >
> >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220=12343148
> >
> > Some key features and improvements in this release:
> >
> >1. Supported Local dictionary to improve IO and query performance.
> >2. Improved and stabilized Bloom filter datamap.
> >3. Supported left outer join MV datamap(Alpha feature)
> >4. Supported driver min max caching for specified columns and
> >segregate block and blocklet cache.
> >5. Support Flat folder structure in carbon to maintain the same folder
> >structure as Hive.
> >6. Supported S3 read and write on carbondata files
> >7. Support projection push down for struct data type.
> >8. Improved complex datatypes compression and performance through
> >adaptive encoding.
> >9. Many Bug fixes and stabilized carbondata.
> >
> >
> >  2. The tag to be voted upon : apache-carbondata-1.4.1.rc2(commit:
> > a17db2439aa51f6db7da293215f9732ffb200bd9)
> >
> >
> >
> https://github.com/apache/carbondata/releases/tag/apache-carbondata-1.4.1-rc2
> >
> >
> > 3. The artifacts to be voted on are located here:
> >
> > https://dist.apache.org/repos/dist/dev/carbondata/1.4.1-rc2/
> >
> >
> > 4. A staged Maven repository is available for review at:
> >
> >
> https://repository.apache.org/content/repositories/orgapachecarbondata-1032
> >
> >
> > 5. Release artifacts are signed with the following key:
> >
> > *https://people.apache.org/keys/committer/ravipesala.asc
> > <
> >
> https://link.getmailspring.com/link/1524823736.local-38e60b2f-d8f4-v1.2.1-7e744...@getmailspring.com/9?redirect=https%3A%2F%2Fpeople.apache.org%2Fkeys%2Fcommitter%2Fravipesala.asc=ZGV2QGNhcmJvbmRhdGEuYXBhY2hlLm9yZw%3D%3D
> > >*
> >
> >
> > Please vote on releasing this package as Apache CarbonData 1.4.1,  The
> vote
> >
> > will be open for the next 72 hours and passes if a majority of
> >
> > at least three +1 PMC votes are cast.
> >
> >
> > [ ] +1 Release this package as Apache CarbonData 1.4.1
> >
> > [ ] 0 I don't feel strongly about it, but I'm okay with the release
> >
> > [ ] -1 Do not release this package because...
> >
> >
> > Regards,
> > Ravindra.
> >
>


Re: Change the 'comment' content for column when execute command 'desc formatted table_name'

2018-04-25 Thread Venkata Gollamudi
I agree with Liang, Better we align with create table terminology and
properties. Details of properties, user can easily get from Create table
DDL documentation.

Regards,
Ramana

On Thu, Apr 26, 2018 at 8:17 AM, Liang Chen  wrote:

> Hi
>
> Attaching my proposed "desc_table_info":
> desc_table_info.txt
>  n5.nabble.com/file/t1/desc_table_info.txt>
>
> Regards
> Liang
>
>
>
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.
> n5.nabble.com/
>


Re: [VOTE] Apache CarbonData 1.2.0(RC3) release

2017-09-25 Thread Venkata Gollamudi
+1

Regards,
Venkata Ramana G

On Mon, Sep 25, 2017 at 7:32 AM, David CaiQiang 
wrote:

>
>  +1 Release this package as Apache CarbonData 1.2.0
>
> 1. Release
>   There are important new features and the integration of new platform
>
> 2. The tag
>  " mvn clean -DskipTests -Pspark-2.1 -Pbuild-with-format package" passed
>  "mvn clean -DskipTests -Pspark-2.1 -Pbuild-with-format install" passed
>
> 3. The artifacts
>   both md5sum and sha512 are correct
>
>
>
> -
> Best Regards
> David Cai
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.
> n5.nabble.com/
>


Re: [VOTE] Apache CarbonData 1.1.1(RC1) release

2017-07-09 Thread Venkata Gollamudi
+1

Regards,
Venkata Ramana G

On Mon, Jul 10, 2017 at 7:25 AM, Erlu Chen  wrote:

> +1
>
> Regards.
> Chenerlu.
>
>
>
> --
> View this message in context: http://apache-carbondata-dev-
> mailing-list-archive.1130556.n5.nabble.com/VOTE-Apache-
> CarbonData-1-1-1-RC1-release-tp17531p17715.html
> Sent from the Apache CarbonData Dev Mailing List archive mailing list
> archive at Nabble.com.
>


Re: [Discussion] CarbonOutputFormat Implementation

2017-07-04 Thread Venkata Gollamudi
+1
OutputFormat should be based on single pass and with similar job
configurations as CarbonInputFormat.
Please output initial design and code skeleton, for review before
proceeding for implementation.

On Tue, Jul 4, 2017 at 4:30 PM, Kumar Vishal 
wrote:

> +1
> It's a long pending task.
> -Regards
> Kumar Vishal
>
> Sent from my iPhone
>
> > On 04-Jul-2017, at 16:26, Erlu Chen  wrote:
> >
> > Thanks very much.
> >
> > After you have raised a PR, we can start review.
> >
> >
> > Regards.
> > Chenerlu.
> >
> >
> >
> > --
> > View this message in context: http://apache-carbondata-dev-
> mailing-list-archive.1130556.n5.nabble.com/Discussion-CarbonOutputFormat-
> Implementation-tp17113p17239.html
> > Sent from the Apache CarbonData Dev Mailing List archive mailing list
> archive at Nabble.com.
>


Re: [DISCUSSION] Propose to move notification of "jira Created" to issues@mailing list from dev

2017-07-04 Thread Venkata Gollamudi
+1
It is better to be moved

Regards,
Venkata Ramana G

On Tue, Jul 4, 2017 at 4:40 PM, Kumar Vishal 
wrote:

> +1
> Better to move to issue mailing list
>
> Regards
> Kumar Vishal
>
> Sent from my iPhone
>
> > On 03-Jul-2017, at 15:02, Ravindra Pesala  wrote:
> >
> > +1
> > Yes, we should move to issues mailing list.
> >
> > Regards,
> > Ravindra.
> >
> >> On 30 June 2017 at 07:35, Erlu Chen  wrote:
> >>
> >> Agreed, we can separate discussion and created JIRA.
> >>
> >> It will be better for develops to filter some unnecessary message and
> focus
> >> on discussion.
> >>
> >> Regards.
> >> Chenerlu.
> >>
> >>
> >>
> >> --
> >> View this message in context: http://apache-carbondata-dev-
> >> mailing-list-archive.1130556.n5.nabble.com/DISCUSSION-
> >> Propose-to-move-notification-of-jira-Created-to-issues-
> >> mailing-list-from-dev-tp16835p16842.html
> >> Sent from the Apache CarbonData Dev Mailing List archive mailing list
> >> archive at Nabble.com.
> >>
> >
> >
> >
> > --
> > Thanks & Regards,
> > Ravi
>


Re: [Discussion] Add HEADER option to load data sql

2017-07-04 Thread Venkata Gollamudi
I agree that user need not provide columns names if no header present in
file and columns order is same as schema order.

instead of option header=true, will not cover all the cases of header
present, not present, override header etc. I have added added intermediate
approach covering all the cases and also taking care of current default
values and backward compatibility.

csv file without header
1. FILEHEADER="col1,col2,col3",  default: IGNORE_FIRST_LINE="FALSE"
use given header
2. FILEHEADER="" default: IGNORE_FIRST_LINE="FALSE"
use schema order

csv file with header
1. Nonedefault:
IGNORE_FIRST_LINE="FALSE"
 expects CSV first line as header.
2. FILEHEADER="col1,col2,col3",  IGNORE_FIRST_LINE="TRUE"
uses explicitly given header, ignoring header from file.
3. FILEHEADER="",
 IGNORE_FIRST_LINE="TRUE"
uses schema order, ignoring header from file.

Regards,
Ramana

On Tue, Jul 4, 2017 at 6:51 AM, wangbin  wrote:

> I propose the loading the CSV files by explicitly give a table schema,while
> using a option to ignore csv header if has.
>
>
>
> --
> View this message in context: http://apache-carbondata-dev-
> mailing-list-archive.1130556.n5.nabble.com/Discussion-Add-
> HEADER-option-to-load-data-sql-tp17080p17179.html
> Sent from the Apache CarbonData Dev Mailing List archive mailing list
> archive at Nabble.com.
>