Re: [DISCUSSION] Support write Flink streaming data to Carbon

2019-11-05 Thread Raghunandan S
+1

On Thu, 31 Oct, 2019, 9:13 AM Jacky Li,  wrote:

> +1 for these feature, in my opinion, flink-carbon is a good fit for near
> realtiem analytics
>
> One doubt is that in your design, the Collect Segment command and
> Compaction command are two separate commands, right?
>
> Collect Segment command modify the metadata files (tablestatus file and
> segment file), while Compaction command merges small data files and build
> indexes.
>
> Is my understanding right?
>
> Regards,
> Jacky
>
> On 2019/10/29 06:59:51, "爱在西元前" <371684...@qq.com> wrote:
> > The write process is:
> >
> > Write flink streaming data to local file system of flink task node use
> flink StreamingFileSink and carbon SDK;
> >
> > Copy local carbon data file to carbon data store system, such as HDFS,
> S3;
> >
> > Generate and write segment file to ${tablePath}/load_details;
> >
> > Run "alter table ${tableName} collect segments" command on server, to
> compact segment files in ${tablePath}/load_details, and then move the
> compacted segment file to ${tablePath}/Metadata/Segments/,update table
> status file finally.
> >
> > Have raised a jira https://issues.apache.org/jira/browse/CARBONDATA-3557
> >
> > Welcome you opinion and suggestions.
>


Re: [VOTE] Apache CarbonData 1.6.1(RC1) release

2019-10-24 Thread Raghunandan S
Hi all


PMC vote has passed for Apache Carbondata 1.6.1 release, the result

as below:


+1(binding): 5(Kumar Vishal, Ravindra, Liang Chen)


+1(non-binding) : 2


Thanks all for your vote.


Regards

On Mon, Oct 14, 2019 at 4:56 PM Liang Chen  wrote:

> +1
>
> Please update the release notes accordingly.
>
> Regards
> Liang
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


[ANNOUNCE] Apache CarbonData 1.6.1 release

2019-10-24 Thread Raghunandan S
Hi All,

Apache CarbonData community is pleased to announce the release of the
Version 1.6.1 in The Apache Software Foundation (ASF).

CarbonData is a high-performance data solution that supports various data
analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter
lookup on detail record, streaming analytics, and so on. CarbonData has
been deployed in many enterprise production environments, in one of the
largest scenarios, it supports queries on a single table with 3PB data
(more than 5 trillion records) with response time less than 3 seconds!

We encourage you to use the release
https://dist.apache.org/repos/dist/release/carbondata/1.6.1/, and feedback
through the CarbonData user mailing lists !

This release note provides information on the new features, improvements,
and bug fixes of this release.
What’s New in CarbonData Version 1.6.1?

CarbonData 1.6.1 intention was to move closer to unified analytics and
improve the stability. In this version of CarbonData, around 40 JIRA
tickets related to improvements, and bugs have been resolved. Following are
the summary.


Index Server performance improvements for Full Scan and TPCH Queries
Carbon currently prunes and caches all block/blocklet datamap index
information into the driver. If the cache size becomes huge(70-80% of the
driver memory) then there can be excessive GC in the driver which can slow
down the queries and the driver may even go OutOfMemory. Moving out the
indexes to separate JDBCServer reduced the overhead on the primary
JDBCServer, but introduced delay in fetching the bulk pruning blocks list
from the Index server. This is improved in this release and performance is
same as running without Index Server.

Behaviour Change

None


Please find the detailed JIRA list:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220=12345993


Sub-task

   - [CARBONDATA-3454
   ] - Optimize the
   performance of select coun(*) for index server
   - [CARBONDATA-3462
   ] - Add usage and
   deployment document for index server

Bug

   - [CARBONDATA-3452
   ] - select query
   failure when substring on dictionary column with join
   - [CARBONDATA-3474
   ] - Fix validate
   mvQuery having filter expression and correct error message
   - [CARBONDATA-3476
   ] - Read time and
   scan time stats shown wrong in executor log for filter query
   - [CARBONDATA-3477
   ] - Throw out
   exception when use sql: 'update table select\n...'
   - [CARBONDATA-3478
   ] - Fix
   ArrayIndexOutOfBoundsException issue on compaction after alter rename
   operation
   - [CARBONDATA-3480
   ] - Remove
   Modified MDT and make relation refresh only when schema file is modified.
   - [CARBONDATA-3481
   ] - Multi-thread
   pruning fails when datamaps count is just near numOfThreadsForPruning
   - [CARBONDATA-3482
   ] - Null pointer
   exception when concurrent select queries are executed from different
   beeline terminals.
   - [CARBONDATA-3483
   ] - Can not run
   horizontal compaction when execute update sql
   - [CARBONDATA-3485
   ] - data loading
   is failed from S3 to hdfs table having ~2K carbonfiles
   - [CARBONDATA-3486
   ] -
   Serialization/ deserialization issue with Datatype
   - [CARBONDATA-3487
   ] - wrong Input
   metrics (size/record) displayed in spark UI during insert into
   - [CARBONDATA-3490
   ] - Concurrent
   data load failure with carbondata FileNotFound exception
   - [CARBONDATA-3493
   ] - Carbon query
   fails when enable.query.statistics is true in specific scenario.
   - [CARBONDATA-3494
   ] - Nullpointer
   exception in case of drop table
   - [CARBONDATA-3495
   ] - Insert into
   Complex data type of Binary fails with Carbon & SparkFileFormat
   - [CARBONDATA-3499
   ] - Fix insert
   failure with customFileProvider
   - [CARBONDATA-3502
   ] - Select query
   fails with UDF having Match expression inside IN expression
   - 

[VOTE] Apache CarbonData 1.6.1(RC1) release

2019-10-07 Thread Raghunandan S
Hi


I submit the Apache CarbonData 1.6.1 (RC1) for your vote.


1.Release Notes:

https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220=12345993


Some key features and improvements in this release:


   1. Supported adding segment to CarbonData table




[Behaviour Changes]

   1. None


 2. The tag to be voted upon : apache-carbondata-1.6.1-rc1 (commit:

cabde6252d4a527fbfeb7f17627c6dce3e357f84)

https://github.com/apache/carbondata/releases/tag/apache-CarbonData-1.6.1-rc1



3. The artifacts to be voted on are located here:

https://dist.apache.org/repos/dist/dev/carbondata/1.6.1-rc1/



4. A staged Maven repository is available for review at:

https://repository.apache.org/content/repositories/orgapachecarbondata-1057/



5. Release artifacts are signed with the following key:


*https://people.apache.org/keys/committer/raghunandan.asc*



Please vote on releasing this package as Apache CarbonData 1.6.1,  The vote


will be open for the next 72 hours and passes if a majority of


at least three +1 PMC votes are cast.



[ ] +1 Release this package as Apache CarbonData 1.6.1


[ ] 0 I don't feel strongly about it, but I'm okay with the release


[ ] -1 Do not release this package because...



Regards,

Raghunandan.


[ANNOUNCE] Apache CarbonData 1.6.0 release

2019-08-29 Thread Raghunandan S
Hi All,

Apache CarbonData community is pleased to announce the release of the
Version 1.6.0 in The Apache Software Foundation (ASF).

CarbonData is a high-performance data solution that supports various data
analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter
lookup on detail record, streaming analytics, and so on. CarbonData has
been deployed in many enterprise production environments, in one of the
largest scenarios, it supports queries on a single table with 3PB data
(more than 5 trillion records) with response time less than 3 seconds!

We encourage you to use the release
https://dist.apache.org/repos/dist/release/carbondata/1.6.0/, and feedback
through the CarbonData user mailing lists !

This release note provides information on the new features, improvements,
and bug fixes of this release.
What’s New in CarbonData Version 1.6.0?

CarbonData 1.6.0 intention was to move closer to unified analytics. We have
added index server to distribute the index cache. We have also supported
incremental loading on MV datamaps to improve the loading time of datamap.
we are now supporting reading cabondata tables from Hive and also supported
Arrow format form SDK.

In this version of CarbonData, around 75 JIRA tickets related to new
features, improvements, and bugs have been resolved. Following are the
summary.
Index Server to distribute the index cache and parallelise the index pruning


Carbon currently prunes and caches all block/blocklet datamap index
information into the driver. If the cache size becomes huge(70-80% of the
driver memory) then there can be excessive GC in the driver which can slow
down the queries and the driver may even go OutOfMemory. If multiple JDBC
drivers want to read from same tables then every JDBC server needs to
maintain their own copy of the cache. To solve these problems we have
introduced distributed Index Cache Server. It is separate scalable server
stores only index information and all the drivers can connect and prune the
data using cached index information.
Incremental data loading on MV datamaps

Currently, MV datamaps can only be loaded with full load for any new data
load on the parent table. Now we have supported incremental loading on MV
datamaps so for any new load on parent table triggers the load on MV
datamap only for incrementally added data.
Supported Arrow format from Carbon SDK

SDK reader now supports reading carbondata files and filling it to apache
arrow vectors. This helps to avoid unnecessary intermediate serialisations
when accessing from other execution engines or languages.
Supported read from Hive

CarbonData files can be read from the Hive. This helps users to easily
migrate to CarbonData format on existing Hive deployments using other
formats.
Behaviour Change

None


Please find the detailed JIRA list:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220=12344965


Sub-task

   - [CARBONDATA-3306
   ] - Implement a
   DistributableIndexPruneRDD and IndexPruneFileFormat
   - [CARBONDATA-3337
   ] - Implement a
   Hadoop RPC framwork for communication
   - [CARBONDATA-3338
   ] - Incremental
   dat load support to datamap on single table
   - [CARBONDATA-3349
   ] - add is_sorted
   and sort_columns information into show segments
   - [CARBONDATA-3350
   ] - enhance
   custom compaction to support resort single segment
   - [CARBONDATA-3357
   ] - Support
   TableProperties from single parent table and restrict
   alter/delete/partition on mv
   - [CARBONDATA-3378
   ] - Display
   original query in Indexserver Job
   - [CARBONDATA-3381
   ] - Large
   response size Exception is thrown from index server.
   - [CARBONDATA-3387
   ] - Support
   Partition with MV datamap & Show DataMap Status
   - [CARBONDATA-3392
   ] - Make use of
   LRU mandatory when using IndexServer
   - [CARBONDATA-3398
   ] - Implement
   Show Cache for IndexServer and MV
   - [CARBONDATA-3399
   ] - Implement
   Executor ID based task distribution for Index Server
   - [CARBONDATA-3402
   ] - Block complex
   data types and validate dmproperties in mv
   - [CARBONDATA-3408
   ] - CarbonSession
   partition support binary data type
   - [CARBONDATA-3409
   

Re: [VOTE] Apache CarbonData 1.6.0(RC3) release

2019-08-21 Thread Raghunandan S
Hi all


PMC vote has passed for Apache Carbondata 1.6.0 release, the result

as below:


+1(binding): 5(Jacky, Kumar Vishal, Ravindra, David CaiQiang, Liang Chen)


+1(non-binding) : 2


Thanks all for your vote.


Regards

On Mon, Aug 19, 2019 at 9:52 AM Liang Chen  wrote:

> +1 from my side
>
> regards
> Liang
>
> 在 2019年8月13日星期二,Raghunandan S  写道:
>
> > Hi
> >
> >
> > I submit the Apache CarbonData 1.6.0 (RC3) for your vote.
> >
> >
> > 1.Release Notes:
> >
> > https://issues.apache.org/jira/secure/ReleaseNote.jspa?
> > projectId=12320220=12344965
> >
> >
> > Some key features and improvements in this release:
> >
> >
> >1. Supported Index Server to distribute the index cache and
> parallelise
> > the
> > index pruning.
> >
> >2. Supported incremental data loading on MV datamaps and stabilised
> MV.
> >
> >3. Supported Arrow format from Carbon SDK.
> >
> >4. Supported read from Hive.
> >
> >
> >
> >
> > [Behaviour Changes]
> >
> >1. None
> >
> >
> >  2. The tag to be voted upon : apache-CarbonData-1.6.0-rc3 (commit:
> >
> > 4729b4ccee18ada1898e27f130253ad06497f1fb)
> >
> > https://github.com/apache/carbondata/releases/tag/
> > apache-CarbonData-1.6.0-rc3
> > /
> >
> >
> >
> > 3. The artifacts to be voted on are located here:
> >
> > https://dist.apache.org/repos/dist/dev/carbondata/1.6.0-rc3/
> >
> >
> >
> > 4. A staged Maven repository is available for review at:
> >
> > https://repository.apache.org/content/repositories/
> > orgapachecarbondata-1055/
> >
> >
> >
> > 5. Release artifacts are signed with the following key:
> >
> >
> > *https://people.apache.org/keys/committer/raghunandan.asc*
> >
> >
> >
> > Please vote on releasing this package as Apache CarbonData 1.6.0,  The
> vote
> >
> >
> > will be open for the next 72 hours and passes if a majority of
> >
> >
> > at least three +1 PMC votes are cast.
> >
> >
> >
> > [ ] +1 Release this package as Apache CarbonData 1.6.0
> >
> >
> > [ ] 0 I don't feel strongly about it, but I'm okay with the release
> >
> >
> > [ ] -1 Do not release this package because...
> >
> >
> >
> > Regards,
> >
> > Raghunandan.
> >
>


[VOTE] Apache CarbonData 1.6.0(RC3) release

2019-08-13 Thread Raghunandan S
Hi


I submit the Apache CarbonData 1.6.0 (RC3) for your vote.


1.Release Notes:

https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220=12344965


Some key features and improvements in this release:


   1. Supported Index Server to distribute the index cache and parallelise the
index pruning.

   2. Supported incremental data loading on MV datamaps and stabilised MV.

   3. Supported Arrow format from Carbon SDK.

   4. Supported read from Hive.




[Behaviour Changes]

   1. None


 2. The tag to be voted upon : apache-CarbonData-1.6.0-rc3 (commit:

4729b4ccee18ada1898e27f130253ad06497f1fb)

https://github.com/apache/carbondata/releases/tag/apache-CarbonData-1.6.0-rc3
/



3. The artifacts to be voted on are located here:

https://dist.apache.org/repos/dist/dev/carbondata/1.6.0-rc3/



4. A staged Maven repository is available for review at:

https://repository.apache.org/content/repositories/orgapachecarbondata-1055/



5. Release artifacts are signed with the following key:


*https://people.apache.org/keys/committer/raghunandan.asc*



Please vote on releasing this package as Apache CarbonData 1.6.0,  The vote


will be open for the next 72 hours and passes if a majority of


at least three +1 PMC votes are cast.



[ ] +1 Release this package as Apache CarbonData 1.6.0


[ ] 0 I don't feel strongly about it, but I'm okay with the release


[ ] -1 Do not release this package because...



Regards,

Raghunandan.


[VOTE] Apache CarbonData 1.6.0(RC2) release

2019-08-03 Thread Raghunandan S
Hi


I submit the Apache CarbonData 1.6.0 (RC2) for your vote.


1.Release Notes:

https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220=12344965


Some key features and improvements in this release:


   1. Supported Index Server to distribute the index cache and parallelise the
index pruning.

   2. Supported incremental data loading on MV datamaps and stabilised MV.

   3. Supported Arrow format from Carbon SDK.

   4. Supported read from Hive.




[Behaviour Changes]

   1. None


 2. The tag to be voted upon : apache-carbondata-1.6.0-rc2 (commit:

9ca7891d16313be66d8271c855f7c6f4c54c2e1b)

https://github.com/apache/carbondata/releases/tag/apache-CarbonData-1.6.0-rc2
/



3. The artifacts to be voted on are located here:

https://dist.apache.org/repos/dist/dev/carbondata/1.6.0-rc2/



4. A staged Maven repository is available for review at:

https://repository.apache.org/content/repositories/orgapachecarbondata-1054/



5. Release artifacts are signed with the following key:


*https://people.apache.org/keys/committer/raghunandan.asc*



Please vote on releasing this package as Apache CarbonData 1.6.0,  The vote


will be open for the next 72 hours and passes if a majority of


at least three +1 PMC votes are cast.



[ ] +1 Release this package as Apache CarbonData 1.6.0


[ ] 0 I don't feel strongly about it, but I'm okay with the release


[ ] -1 Do not release this package because...



Regards,

Raghunandan.


[ANNOUNCE] Apache CarbonData 1.5.3 release

2019-04-09 Thread Raghunandan S
Hi All,

Apache CarbonData community is pleased to announce the release of the
Version 1.5.3 in The Apache Software Foundation (ASF).

CarbonData is a high-performance data solution that supports various data
analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter
lookup on detail record, streaming analytics, and so on. CarbonData has
been deployed in many enterprise production environments, in one of the
largest scenario, it supports queries on a single table with 3PB data (more
than 5 trillion records) with response time less than 3 seconds!

We encourage you to use the release
https://dist.apache.org/repos/dist/release/carbondata/1.5.3/, and feedback
through the CarbonData user mailing lists !

This release note provides information on the new features, improvements,
and bug fixes of this release.
What’s New in CarbonData Version 1.5.3?

CarbonData 1.5.3 intention was to move closer to unified analytics. We are
allowing DDL to operate on LRU cache for the user to handle LRU cache as
per his requirement. We have also upgraded the integration support for
Presto latest version. More importantly, we have further improved the
CarbonData performance.

In this version of CarbonData, more than 20 JIRA tickets related to new
features, improvements, and bugs have been resolved. Following are the
summary.
CarbonData CoreDDL Support on CarbonData LRU Cache

Before, though the user could set the cache size, the functionality was
limited as the user did not have a clear picture of how much cache should
be set for his/her requirement.

>From this version, we support DDL on CarbonData LRU Cache which allows you
to do the following operations:

   - Show the current cache used per table.
   - Showing current cache used for a specific table.
   - Clearing cache for a specific table.

Supports SDK Read from Different Schema

This version allows the user to read two or more CarbonData files in a
location with different schema.
Performance ImprovementsImproved Single/Concurrent Query Performance

When the number of segments are more, query performance reduces due to
higher memory footprint, multi-thread pruning, retrieval from unsafe
Datamap, and so on.

In this version we have improved the  query performance by following
modifications:

   - Reduced memory footprints during the query.
   - Added multi-thread pruning in case of nonfilter query.
   - Updated driver cache unsafe storage format for faster retrieval of
   data.

Improved Count(*) Query Performance

Before for count(*), the prune used to be the same as a select * query
which is very time-consuming due to different processes involved.

In this version, we have optimized the count(*) query performance by
reading blocklet row count directly from DataMapRow. This reduces query
time and improves the query performance.
Other ImprovementsPresto Version Upgrade

Now CarbonData integrates with the Presto version 0.217.
Behavior Change

None


Please find the detailed JIRA list:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220=12344322
Bug

   - [CARBONDATA-3202
   ] - updated
   schema is not updated in session catalog after add, drop or rename column.
   - [CARBONDATA-3223
   ] - Datasize and
   Indexsize showing 0B for 1.1 store when show segments is done
   - [CARBONDATA-3284
   ] - Workaround
   for Create-PreAgg Datamap Fail
   - [CARBONDATA-3287
   ] - Remove the
   validation of same chema data files in location for external table and file
   format
   - [CARBONDATA-3298
   ] - Logs are
   getting printed when clean files is executed for old stores
   - [CARBONDATA-3301
   ] - Array
   column is giving null data in case of spark carbon file format
   - [CARBONDATA-3313
   ] - count(*) is
   not invalidating the invalid segments cache
   - [CARBONDATA-3314
   ] - Index Cache
   Size printed in SHOW METACACHE ON TABLE DDL is not accurate
   - [CARBONDATA-3315
   ] - Range Filter
   query with two between clauses with an OR gives wrong results
   - [CARBONDATA-3320
   ] - number of
   partitions are always zero in describe formatted for hive native partition
   - [CARBONDATA-3322
   ] - After
   renaming table, "SHOW METACACHE ON TABLE" still works for old table
   - [CARBONDATA-3323
   ] - Output is
   null when cache is empty
   - [CARBONDATA-3328
   

Re: [VOTE] Apache CarbonData 1.5.3(RC1) release

2019-04-09 Thread Raghunandan S
Hi all


PMC vote has passed for Apache Carbondata 1.5.3 release, the result

as below:


+1(binding): 5(Jacky, Kumar Vishal, Ravindra, David CaiQiang, Liang Chen)


+1(non-binding) : 1


Thanks all for your vote.


Regards

On Mon, Apr 8, 2019 at 6:31 PM Ravindra Pesala 
wrote:

> +1
>
> Regards,
> Ravindra.
>
> On Wed, 3 Apr 2019 at 1:23 PM, Raghunandan S <
> carbondatacontributi...@gmail.com> wrote:
>
> > Hi
> >
> >
> > I submit the Apache CarbonData 1.5.3 (RC1) for your vote.
> >
> >
> > 1.Release Notes:
> >
> >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220=12344322
> >
> >
> > Some key features and improvements in this release:
> >
> >
> >1. Supported DDL to operate on CarbonData LRU Cache
> >
> >2. Improved single, concurrent query performance.
> >
> >3. Count(*) query performance enhanced by optimising datamaps pruning
> >
> >4. Supported adding new columns through SDK
> >
> >5. Presto version upgraded to 0.217
> >
> >
> >
> >
> > [Behavior Changes]
> >
> >1. None
> >
> >
> >  2. The tag to be voted upon : apache-carbondata-1.5.3-rc1 (commit:
> >
> > 7f271d0aba272f9fbe9642a4900cd4da61eb43bb)
> >
> >
> >
> https://github.com/apache/carbondata/releases/tag/apache-carbondata-1.5.3-rc1
> >
> >
> >
> > 3. The artifacts to be voted on are located here:
> >
> > https://dist.apache.org/repos/dist/dev/carbondata/1.5.3-rc1/
> >
> >
> >
> > 4. A staged Maven repository is available for review at:
> >
> >
> >
> https://repository.apache.org/content/repositories/orgapachecarbondata-1039/
> >
> >
> >
> > 5. Release artifacts are signed with the following key:
> >
> >
> > *https://people.apache.org/keys/committer/raghunandan.asc*
> >
> >
> >
> > Please vote on releasing this package as Apache CarbonData 1.5.3,  The
> vote
> >
> >
> > will be open for the next 72 hours and passes if a majority of
> >
> >
> > at least three +1 PMC votes are cast.
> >
> >
> >
> > [ ] +1 Release this package as Apache CarbonData 1.5.3
> >
> >
> > [ ] 0 I don't feel strongly about it, but I'm okay with the release
> >
> >
> > [ ] -1 Do not release this package because...
> >
> >
> >
> > Regards,
> >
> > Raghunandan.
> >
> --
> Thanks & Regards,
> Ravi
>


[VOTE] Apache CarbonData 1.5.3(RC1) release

2019-04-03 Thread Raghunandan S
Hi


I submit the Apache CarbonData 1.5.3 (RC1) for your vote.


1.Release Notes:

https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220=12344322


Some key features and improvements in this release:


   1. Supported DDL to operate on CarbonData LRU Cache

   2. Improved single, concurrent query performance.

   3. Count(*) query performance enhanced by optimising datamaps pruning

   4. Supported adding new columns through SDK

   5. Presto version upgraded to 0.217




[Behavior Changes]

   1. None


 2. The tag to be voted upon : apache-carbondata-1.5.3-rc1 (commit:

7f271d0aba272f9fbe9642a4900cd4da61eb43bb)

https://github.com/apache/carbondata/releases/tag/apache-carbondata-1.5.3-rc1



3. The artifacts to be voted on are located here:

https://dist.apache.org/repos/dist/dev/carbondata/1.5.3-rc1/



4. A staged Maven repository is available for review at:

https://repository.apache.org/content/repositories/orgapachecarbondata-1039/



5. Release artifacts are signed with the following key:


*https://people.apache.org/keys/committer/raghunandan.asc*



Please vote on releasing this package as Apache CarbonData 1.5.3,  The vote


will be open for the next 72 hours and passes if a majority of


at least three +1 PMC votes are cast.



[ ] +1 Release this package as Apache CarbonData 1.5.3


[ ] 0 I don't feel strongly about it, but I'm okay with the release


[ ] -1 Do not release this package because...



Regards,

Raghunandan.


[ANNOUNCE] Apache CarbonData 1.5.2 release

2019-02-04 Thread Raghunandan S
Hi All,


Apache CarbonData community is pleased to announce the release of the
Version 1.5.2 in The Apache Software Foundation (ASF).

CarbonData is a high-performance data solution that supports various data
analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter
lookup on detail record, streaming analytics, and so on. CarbonData has
been deployed in many enterprise production environments, in one of the
largest scenario it supports queries on single table with 3PB data (more
than 5 trillion records) with response time less than 3 seconds!

We encourage you to use the release
https://dist.apache.org/repos/dist/release/carbondata/1.5.2/, and feedback
through the CarbonData user mailing lists !

This release note provides information on the new features, improvements,
and bug fixes of this release.
What’s New in CarbonData Version 1.5.2?

CarbonData 1.5.2 intention was to move more closer to unified analytics. We
want to enable CarbonData files to be read from more engines/libraries to
support various use cases. In this regard we have enhanced and stabilized
Presto features and the following features and improvements.

In this version of CarbonData, more than 68 JIRA tickets related to new
features, improvements, and bugs has been resolved. Following are the
summary.
CarbonData CoreSupport Compaction for No-sort Load Segments

During Data loading, if sort scope is set as No-sort, the data loading
performance would increase significantly as the data won't get sorted and
is written as it is received. But this no-sort loading would cause the
query performance to degrade as indexes are not built on these segments.
Compacting these no-sort loaded segments would convert these segments into
sorted segments and thereby improve the query performance as indexes get
generated. The ideal scenario to use this feature is when high speed data
loading is more important than a high query performance till the time the
compaction is not done.
Support Rename of Column Names

Column names can be renamed to reflect the business scenario or
conventions.
Support GZIP Compressor for CarbonData Files

GZIP compression is supported to compress each page of CarbonData file.
GZIP offers better compression ratio there by reducing the store size. On
the average GZIP compression reduces store size by 20-30% as compared to
Snappy compression. GZIP compression is supported to compress sort temp
files written during data loading. GZIP also has support from hardware.
Hence data loading performance would increase on those machines where GZIP
is supported natively from hardware.
Performance ImprovementsSupport Range Partitioned Sort during data load

Global Sort supported during Data loads ensures the data is entirely sorted
and hence group all the same data to a particular node/machine.This helps
to optimise the Spark scan performance and also increases the concurrency.
The drawback of Global Sort is that is very slow as the data has to be
globally sorted(Heavy shuffle). Local sort on the other hand partitions the
data to multiple nodes/machines and ensure the data local to that
node/machine is sorted. This improves the data loading performance, but
query performance degrades a bit as more Spark tasks will have to be
launched to scan the data. Range sort on the other hand, splits the data
based on the value range and loads using local sort. This give a balanced
performance for both load and query.
Other ImprovementsPresto Enhancements

CarbonData implemented features to better integrate with Presto. Now Presto
can recognise CarbonData as a native format. Many bugs were fixed to
enhance the stability.
Support Map Data Type through DDL

1.5.0 version supported adding Map data type through CarbonData SDK. This
version supports adding Map data type through DDL.
Behaviour Change

   1. If user doesn’t specify sort columns during table creation, default
   sort scope is set to no-sort during data loading
   2. Default Complex values delimiter value is changed from '*$*','*:*' to
   '*\001*' , '*\002*' respectively
   3. Inverted Index generation is disabled by default

New Configuration Parameters
Configuration NameDefault Value Range
*carbon.table.load.sort.scope  *LOCAL_SORTLOCAL_SORT,
NO_SORT, GLOBAL_SORT, BATCH_SORT
*carbon.range.column.scale.factor** 3  1-300  *


Please find the detailed JIRA list:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220=12344321
Sub-task

   - [CARBONDATA-2755
   ] - Compaction of
   Complex DataType (STRUCT AND ARRAY)
   - [CARBONDATA-2838
   ] - Add SDV test
   cases for Local Dictionary Support
   - [CARBONDATA-3017
   ] - Create DDL
   Support for Map Type
   - [CARBONDATA-3073
   

[VOTE] Apache CarbonData 1.5.2(RC2) release

2019-01-30 Thread Raghunandan S
Hi


I submit the Apache CarbonData 1.5.2 (RC2) for your vote.


1.Release Notes:

https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220=12344321


Some key features and improvements in this release:


   1. Presto Enhancements like supporting Hive metastore and stabilising
existing Presto features

   2. Supported Range sort for faster data loading and improved point query
performance.

   3. Supported Compaction for no-sort loaded segments

   4. Supported rename of column names

   5. Supported GZIP compressor for CarbonData files.

   6. Supported map data type from DDL.


[Behavior Changes]

   1. If user doesn’t specify sort columns during table creation, default
sort scope is set to no-sort during data loading


 2. The tag to be voted upon : apache-carbondata-1.5.2-rc2 (commit:

9e0ff5e4c06fecd2dc9253d6e02093f123f2e71b)

https://github.com/apache/carbondata/releases/tag/apache-carbondata-1.5.2-rc

2



3. The artifacts to be voted on are located here:

https://dist.apache.org/repos/dist/dev/carbondata/1.5.2-rc2/



4. A staged Maven repository is available for review at:

https://repository.apache.org/content/repositories/orgapachecarbondata-1038/



5. Release artifacts are signed with the following key:


*https://people.apache.org/keys/committer/raghunandan.asc*



Please vote on releasing this package as Apache CarbonData 1.5.2,  The vote


will be open for the next 72 hours and passes if a majority of


at least three +1 PMC votes are cast.



[ ] +1 Release this package as Apache CarbonData 1.5.2


[ ] 0 I don't feel strongly about it, but I'm okay with the release


[ ] -1 Do not release this package because...



Regards,

Raghunandan.


[VOTE] Apache CarbonData 1.5.2(RC1) release

2019-01-21 Thread Raghunandan S
Hi


I submit the Apache CarbonData 1.5.2 (RC1) for your vote.


1.Release Notes:

https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220=12344321


Some key features and improvements in this release:


   1. Presto Enhancements like supporting Hive metastore and stabilising
existing Presto features

   2. Supported Range sort for faster data loading and improved point query
performance.

   3. Supported Compaction for no-sort loaded segments

   4. Supported rename of column names

   5. Supported GZIP compressor for CarbonData files.

   6. Supported map data type from DDL.


[Behavior Changes]

   1. If user doesn’t specify sort columns during table creation, default
sort scope is set to no-sort during data loading


 2. The tag to be voted upon : apache-carbondata-1.5.2-rc1 (commit:

a8235fa3dd2d73497b6a9b7c57fd78fe589cd0cf)

https://github.com/apache/carbondata/releases/tag/apache-carbondata-1.5.2-rc1



3. The artifacts to be voted on are located here:

https://dist.apache.org/repos/dist/dev/carbondata/1.5.2-rc1/



4. A staged Maven repository is available for review at:

https://repository.apache.org/content/repositories/orgapachecarbondata-1037/



5. Release artifacts are signed with the following key:


*https://people.apache.org/keys/committer/raghunandan.asc*



Please vote on releasing this package as Apache CarbonData 1.5.2,  The vote


will be open for the next 72 hours and passes if a majority of


at least three +1 PMC votes are cast.



[ ] +1 Release this package as Apache CarbonData 1.5.2


[ ] 0 I don't feel strongly about it, but I'm okay with the release


[ ] -1 Do not release this package because...



Regards,

Raghunandan.


Re: [DISCUSS] Move to gitbox as per ASF infra team mail

2019-01-04 Thread Raghunandan S
+1

On Sat, 5 Jan 2019, 7:38 am Liang Chen,  wrote:

> Hi
>
> +1 from my side.
>
> Regards
> Liang
>
>
> Liang Chen wrote
> > Hi all,
> >
> > Background :
> >
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/NOTICE-Mandatory-migration-of-git-repositories-to-gitbox-apache-org-td72614.html
> >
> > Apache Hadoop git repository is in git-wip-us server and it will be
> > decommissioned, ASF infra is proposing to move to gitbox. This discussion
> > is
> > for getting consensus, please discuss and vote.
> >
> > Regards
> > Liang
> >
> >
> >
> >
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [ANNOUNCE] Chuanyin Xu as new PMC for Apache CarbonData

2019-01-01 Thread Raghunandan S
Congrats

On Wed, 2 Jan 2019, 5:49 am Liang Chen,  wrote:

> Hi
>
> We are pleased to announce that Chuanyin Xu as new PMC for Apache
> CarbonData
> .
>
> Congrats to Chuanyin Xu!
>
> Apache CarbonData PMC
>


Re: [discussion]should we improving algorithmic performance about LRU

2018-12-20 Thread Raghunandan S
Hi,
Im unable to open the link. Could do pls share the article summary

Regards

On Fri, 21 Dec 2018, 7:51 am litao,  wrote:

> HiIn carbondata LRU are used for cache,under CarbonLRUCache the
> algorithmic
> is LRU-1.can we improve the algorithmic like LRU-2,Algorithm comparison
> under https://blog.csdn.net/elricboa/article/details/78847305If we improve
> the LRU Algorithm,the performance of index hit will be improved.alse it
> helps to Improving query efficiency.Another question,if we can share the
> cahe to all of the drivers,it is no need to build multiple for each driver.
> it is another question we should discuss
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [ANNOUNCE] Bo Xu as new Apache CarbonData committer

2018-12-07 Thread Raghunandan S
Congrats xubo. Welcome on board

On Sat, 8 Dec 2018, 8:37 am Liang Chen,  wrote:

> Hi all
>
> We are pleased to announce that the PMC has invited Bo Xu as new
> Apache CarbonData
> committer, and the invite has been accepted!
>
> Congrats to Bo Xu and welcome aboard.
>
> Regards
> Apache CarbonData PMC
>


Re: [DISCUSSION] Support DataLoad using Json for CarbonSession

2018-12-06 Thread Raghunandan S
What is the usecase to support this?

-1

On Fri, 7 Dec 2018, 12:08 pm Indhumathi,  wrote:

> Hi xuchuanyin, thanks for your reply.
>
> The syntax for DataLoad using Json is to support the Load DDL with .json
> files.
> Example:
> LOAD DATA INPATH 'data.json' into table 'tablename';
>
> As per your suggestion, if we read the input files(.json) using spark
> dataframe, then we cannot handle bad records.
> I tried loading a json file with has a bad record in one column using
> dataframe and that dataframe returned null values for all the columns.
> So, carbon does not know which column actually contains a bad record while
> loading. Hence, this case cannot be handled through data frame.
>
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: SDK support LOCAL_DICTIONARY_INCLUDE and LOCAL_DICTIONARY_EXCLUDE

2018-12-06 Thread Raghunandan S
@kumar vishal what is the fallback performance if more number of columns
need to fallback. Would it not increase the overhead of generating
temporary dictionary and discarding it?

On Fri, 7 Dec 2018, 12:56 pm ravipesala,  wrote:

>
> I agree with @kumarvishal , better not add more options as it confuses the
> user. We better fallback automatically depends on the size of the
> dictionary.
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [VOTE] Apache CarbonData 1.5.1(RC2) release

2018-12-03 Thread Raghunandan S
+1

On Mon, 3 Dec 2018, 3:03 pm xm_zzc, <441586...@qq.com> wrote:

> +1
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [DISCUSSION] Refactory on spark related modules

2018-12-03 Thread Raghunandan S
I feel not required to do now. Let's wait for breaking changes in spark
before we do the refacforing

On Tue, 4 Dec 2018, 8:23 am xubo245, <601450...@qq.com> wrote:

> Who will Refactory on spark related modules?
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [DISCUSSION] CarbonData to support spark 2.3 version in next carbon version

2018-12-03 Thread Raghunandan S
+1

On Tue, 4 Dec 2018, 8:26 am xubo245, <601450...@qq.com> wrote:

> hi, all
> Spark has released spark-2.4 more than one month. CarbonData should
> start to support  spark-2.4.
> I want to develop this, and raised a jira for
> it:https://issues.apache.org/jira/browse/CARBONDATA-3144
> is it ok?
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [DISCUSSION] CarbonData to support spark 2.4 version in next carbon version

2018-12-03 Thread Raghunandan S
Currently none. No major compatibility breaking features added in 2.4. Need
to do a small defect analysis for any changes in the classes used by
carbondata

Regards
Raghunandan

On Tue, 4 Dec 2018, 8:30 am xubo245, <601450...@qq.com> wrote:

> Are there any limit for supporting Spark-2.4?
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Streaming Analytics Meetup [24th November 2018 @ Radisson BLU, Bangalore]

2018-11-23 Thread Raghunandan S
Hi All,



We are glad to host “*Streaming Analytics*” meet-up on 24th November 2018 @
Radisson BLU, Bangalore.



Kindly find the list of topics presented and venue details:







Venue Map: https://goo.gl/maps/SrpwyaYFWEE2


Regards
Raghunandan


Re: [Proposal] Thoughts on general guidelines to follow in ApacheCarbonData community

2018-11-18 Thread Raghunandan S
Dear xuchuanyin,
For 2,4,6--  there are many ways the meeting can be held. Using zoom.us is
one mechanism. It is a community collaborative development and hence
collective responsibility of all committers. Version manager can remind for
timely handling of pr

7 can't be combined with 1,2. 1,2 is activity before starting coding. 7 is
test report after coding




On Mon, 19 Nov 2018, 9:10 am xuchuanyin,  wrote:

> Hi ravin, Very nice to see this proposal in community!
> The guidelines are better if they are easy to be performed. Even though I
> care more about the code quality, I do also care about the convenience for
> developers to contribute.
>
> After I go through the points, I think
> 1,3,5,8,9,10 : +1
> 2,4,6: How will the meeting be held? What if some committers are not
> convenient to attend online? As for enough ‘+1’, how can this be ensured to
> not take too much time?
> 7: This can be meld into 1,2
>
>
>
>


Re: Proposal to integrate QATCodec into Carbondata

2018-11-01 Thread Raghunandan S
+1
This would further enhance the performance of queries where io is the
bottleneck

Regards
Raghu

On Fri, 12 Oct 2018, 12:18 pm Xu, Cheng A,  wrote:

> Thanks Chuanyin. This PR looks cool. Allowing customized codec is a good
> option. Comparing with the existing built-in Snappy codec in CarbonData, I
> think QATCodec with better performance and better compression ratio is also
> a good candidate for built-in support. Any thoughts?
>
> Thanks
> Ferdinand Xu
>
> -Original Message-
> From: xuchuanyin [mailto:xuchuan...@hust.edu.cn]
> Sent: Friday, October 12, 2018 11:07 AM
> To: dev@carbondata.apache.org
> Subject: Re: Proposal to integrate QATCodec into Carbondata
>
> emm, if it only needs to extend another compressor for software
> implementation, I think it will be quite easy to integrate.
>
> Actually a PR has already been raised weeks ago to support customize
> compressor in carbondata, you can refer to this link:
> https://github.com/apache/carbondata/pull/2715. You can refer to the
> `CustomizeCompressor` in `TestLoadDataWithCompression.scala` for more
> information.
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [ANNOUNCE] Raghunandan as new committer of Apache CarbonData

2018-09-26 Thread Raghunandan S
Thank you all.

On Thu, 27 Sep 2018, 7:46 am Lu Cao,  wrote:

> Congrats Raghu!
>
> Regards,
> Lionel
>
> On Wed, Sep 26, 2018 at 10:57 PM Jean-Baptiste Onofré 
> wrote:
>
> > Welcome aboard !
> >
> > Congrats
> >
> > Regards
> > JB
> >
> > Le 26 sept. 2018 à 10:45, à 10:45, xm_zzc <441586...@qq.com> a écrit:
> > >Congratulations Raghunandan, welcome aboard !!
> > >
> > >
> > >
> > >--
> > >Sent from:
> > >
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
>


Re: [Issue] Load auto compaction failed

2018-09-26 Thread Raghunandan S
Dear Aaron,
The memory requirements for local dictionary are high when compared to
global dictionary.what are the offheap configuration values? I guess low
values might be the reason

Regards
Raghu

On Thu, 27 Sep 2018, 4:58 am aaron, <949835...@qq.com> wrote:

> Hi community,
>
> Based on 1.5.0 - the load with local dictionary and local sort, the load
> failed when data count arrive 0.5 billion, but I've already load 50billion
> before with global dictionary and sort. Do you have any ideas?
>
>
> 18/09/26 08:39:45 AUDIT CarbonTableCompactor:
> [ec2-dca-aa-p-sdn-16.appannie.org][hadoop][Thread-1]Compaction request
> completed for table default.store
> 18/09/26 08:46:39 WARN TaskSetManager: Lost task 1.0 in stage 216.0 (TID
> 1513, 10.2.3.249, executor 2):
> org.apache.spark.util.TaskCompletionListenerException:
> org.apache.carbondata.core.datastore.exception.CarbonDataWriterException
>
> Previous exception in task:
> org.apache.carbondata.core.datastore.exception.CarbonDataWriterException
>
>
> org.apache.carbondata.processing.store.CarbonFactDataHandlerColumnar.processWriteTaskSubmitList(CarbonFactDataHandlerColumnar.java:353)
>
>
> org.apache.carbondata.processing.store.CarbonFactDataHandlerColumnar.closeHandler(CarbonFactDataHandlerColumnar.java:377)
>
>
> org.apache.carbondata.processing.merger.RowResultMergerProcessor.execute(RowResultMergerProcessor.java:177)
>
>
> org.apache.carbondata.spark.rdd.CarbonMergerRDD$$anon$1.(CarbonMergerRDD.scala:224)
>
>
> org.apache.carbondata.spark.rdd.CarbonMergerRDD.internalCompute(CarbonMergerRDD.scala:87)
>
> org.apache.carbondata.spark.rdd.CarbonRDD.compute(CarbonRDD.scala:78)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> org.apache.spark.scheduler.Task.run(Task.scala:109)
>
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> java.lang.Thread.run(Thread.java:748)
> at
> org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:139)
> at
>
> org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:117)
> at org.apache.spark.scheduler.Task.run(Task.scala:119)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
>
> 18/09/26 08:53:31 WARN TaskSetManager: Lost task 1.1 in stage 216.0 (TID
> 1515, 10.2.3.11, executor 1):
> org.apache.spark.util.TaskCompletionListenerException:
> org.apache.carbondata.core.datastore.exception.CarbonDataWriterException
>
> Previous exception in task:
> org.apache.carbondata.core.datastore.exception.CarbonDataWriterException
>
>
> org.apache.carbondata.processing.store.CarbonFactDataHandlerColumnar.processWriteTaskSubmitList(CarbonFactDataHandlerColumnar.java:353)
>
>
> org.apache.carbondata.processing.store.CarbonFactDataHandlerColumnar.closeHandler(CarbonFactDataHandlerColumnar.java:377)
>
>
> org.apache.carbondata.processing.merger.RowResultMergerProcessor.execute(RowResultMergerProcessor.java:177)
>
>
> org.apache.carbondata.spark.rdd.CarbonMergerRDD$$anon$1.(CarbonMergerRDD.scala:224)
>
>
> org.apache.carbondata.spark.rdd.CarbonMergerRDD.internalCompute(CarbonMergerRDD.scala:87)
>
> org.apache.carbondata.spark.rdd.CarbonRDD.compute(CarbonRDD.scala:78)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> org.apache.spark.scheduler.Task.run(Task.scala:109)
>
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> java.lang.Thread.run(Thread.java:748)
> at
> org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:139)
> at
>
> org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:117)
> at org.apache.spark.scheduler.Task.run(Task.scala:119)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
>
> 18/09/26 09:00:22 WARN TaskSetManager: Lost task 1.2 in 

[DISCUSSION] Updates to CarbonData documentation and structure

2018-09-04 Thread Raghunandan S
Dear All,

 I wanted to propose some updates and changes to our current
documentation,Please let me know your inputs and comments.


1.Split Our carbondata command into DDL and DML

2.Add Presto and Hive integration along with Spark into quick start

3.Add a master reference manual which lists all the commands supported in
carbondata.This manual shall have links to DDL and DML supported

4.Add a introduction to carbondata covering architecture,design and
features supported

5.Merge FAQ and troubleshooting documents into single document

6.Add a separate md file to explain user how to navigate across our
documentation

7.Add the TOC (Table of Contents) to all the md files which has multiple
sections

8.Add list of supported properties at the beginning of each DDL or DML so
that user knows all the properties that are supported

9.Rewrite the configuration properties description to explain the property
in bit more detail and also highlight when to use the command and any
caveats

10.ReOrder our configuration properties table to group features wise

11.Update our webpage(carbondata.apache.org) to have a better navigation
for documentation section

12.Add use cases about carbondata usage and performance tuning tips


Regards

Raghu


Re: [DISCUSSION] Implement file-level Min/Max index for streaming segment

2018-08-26 Thread Raghunandan S
+1. this would improve the query performance on streaming tables

On Sun, Aug 26, 2018 at 4:12 PM Liang Chen  wrote:

> Hi
>
> +1 for this proposal.
>
> Regards
> Liang
>
>
> David CaiQiang wrote
> > Hi All,
> > Currently, the filter queries on the streaming table always scan all
> > streaming files, even though there are no data in streaming files that
> > meet
> > the filter conditions.
> > So I try to support file-level min/max index on streaming segment. It
> > helps to reduce the task number and improve the performance of filter
> scan
> > in some cases.
> > Please check the document in JIRA:
> > https://issues.apache.org/jira/browse/CARBONDATA-2853
> > https://issues.apache.org/jira/browse/CARBONDATA-2853;
> > Any question, suggestion?
> >
> >
> >
> > -
> > Best Regards
> > David Cai
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: Change the 'comment' content for column when execute command 'desc formatted table_name'

2018-08-21 Thread Raghunandan S
Hi,
In opinion it is not required to show the original select sql. Also is
there a way to get it? I don't think it can be got.


Regards
Raghu

On Tue, 21 Aug 2018, 8:02 pm Liang Chen,  wrote:

> Hi
>
> 1. Agree with likun's comments(4 points) :
>
> 2. About 'select sql' for CTAS , you can leave it. we can consider it
> later.
>
> Regards
> Liang
>
> Jacky Li wrote
> > Hi ZZC,
> >
> > I have checked the doc in CARBONDATA-2595. I have following comments:
> > 1. In the Table Basic Information section, it is better to print the
> Table
> > Path instead of "CARBON Store Path”
> > 2. For the Table Data Size  and Index Size, can you format the output in
> > GB, MB, KB, etc
> > 3. For the Last Update Time, can you format the output in UTC time like
> > -MM-DD hh:mm:ss
> > 4. In table property, I think maybe some properties are missing, like
> > block size, blocklet size, long string
> >
> > For implementation, I suggest to write the main logic of collecting these
> > information in java so that it is easier to write tools for it. One tool
> > can be this SQL command and another tool I can think of is an standalone
> > java executable that  can print these information on the screen by
> reading
> > the given table path. (We can put this standalone tool in SDK module)
> >
> > Regards,
> > Jacky
> >
> >
> >> 在 2018年8月20日,上午11:20,xm_zzc <
>
> > 441586683@
>
> >> 写道:
> >>
> >> Hi dev:
> >>  Now I am working on this, the new format is shown in attachment, please
> >> give me some feedback.
> >>  There is one question: if user uses CTAS to create table, do we need to
> >> show the 'select sql' in the result of 'desc formatted table'? If yes,
> >> how
> >> to get 'select sql'? now I just can get a non-formatted sql from
> >> 'CarbonSparkSqlParser.scala' (Jacky mentioned), for example:
> >>
> >> *CREATE TABLE IF NOT EXISTS test_table
> >> STORED BY 'carbondata'
> >> TBLPROPERTIES(
> >> 'streaming'='false', 'sort_columns'='id,city',
> >> 'dictionary_include'='name')
> >> AS SELECT * from source_test ;*
> >>
> >> The non-formatted sql I get is :
> >> *SELECT*fromsource_test*
> >>
> >> desc_formatted.txt
> >> 
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t133/desc_formatted.txt;
>
> >> desc_formatted_external.txt
> >> 
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t133/desc_formatted_external.txt;
>
> >>
> >>
> >>
> >>
> >>
> >>
> >> --
> >> Sent from:
> >>
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >>
>
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: Operation not allowed: STORED BY (from Spark Dataframe save)

2018-08-14 Thread Raghunandan S
Hi yannav,
Can you send the df api and the code you have used.
Did you refer the example in TestLoadDataFrame.scala?

Are you trying from spark session or carbonsession?

Regards
Raghu

On Tue, 14 Aug 2018, 8:44 pm yannv,  wrote:

> Hello,
>
> I am trying to create a carbon data table from a Spark Data Frame, however
> I
> am getting an error with the (automatic create table statement)
>
> I run this code on spark-shell (passing the carbon data assembly jar file
> for 1.4.0 as well as master branch), on Azure HDInsight cluster with spark
> 2.2.1.
>
> Code used :
>
>
> org.apache.spark.sql.catalyst.parser.ParseException:
> Operation not allowed: STORED BY(line 5, pos 1)
>
> == SQL ==
>
>  CREATE TABLE IF NOT EXISTS default.carbon_df_table_test1
>  (c2 STRING, number INT)
>  PARTITIONED BY (c1 string)
>  STORED BY 'carbondata'
> -^^^
>
>   TBLPROPERTIES ('STREAMING' = 'false')
>
>
>
>   at
>
> org.apache.spark.sql.catalyst.parser.ParserUtils$.operationNotAllowed(ParserUtils.scala:39)
>   at
>
> org.apache.spark.sql.execution.SparkSqlAstBuilder$$anonfun$visitCreateFileFormat$1.apply(SparkSqlParser.scala:1194)
>   at
>
> org.apache.spark.sql.execution.SparkSqlAstBuilder$$anonfun$visitCreateFileFormat$1.apply(SparkSqlParser.scala:1186)
>   at
>
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:99)
>   at
>
> org.apache.spark.sql.execution.SparkSqlAstBuilder.visitCreateFileFormat(SparkSqlParser.scala:1185)
>   at
>
> org.apache.spark.sql.execution.SparkSqlAstBuilder$$anonfun$visitCreateHiveTable$1$$anonfun$31.apply(SparkSqlParser.scala:1090)
>   at
>
> org.apache.spark.sql.execution.SparkSqlAstBuilder$$anonfun$visitCreateHiveTable$1$$anonfun$31.apply(SparkSqlParser.scala:1090)
>   at scala.Option.map(Option.scala:146)
>   at
>
> org.apache.spark.sql.execution.SparkSqlAstBuilder$$anonfun$visitCreateHiveTable$1.apply(SparkSqlParser.scala:1090)
>
>
>
> I tried various constructors for the carbon object without success.
>
> Note : I can create a Carbondata table and insert data from CSV file
> successfully (but I need to write carbon data from SparkDF), but it looks
> like when the save method is executed, it tries to create the (new) table
> and I get this error on "Stored by"...
>
>
>
> Regards,
> Yann
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: carbondata insert job has only one task

2018-06-30 Thread Raghunandan S
Hi chenxingyu,
How many executors you are having?
Can you check how many select tasks are fired to query from parquet?
Also got can check number of tasks being created is you do CTAS to hive
table

Regards
Raghu

On Tue, 19 Jun 2018, 5:45 pm 陈星宇,  wrote:

> hi ,
>
>
> i wrote data into carbondata table from parquet table by spark_sql 'insert
> into carbondata_table select * from parquet_table', the task number is
> always only one.
> it caused the insert job was very slow .
> i tried increase spark.default.parallelism = 1000, but only increase query
> task.
> the parquet files are more than 500.
> how can i get better performance when insert into carbondata table.
>
>
> THANKS
> ChenXingYu


Re: Support updating/deleting data for stream table

2018-06-04 Thread Raghunandan S
Hi,
Those are 2 steps in the same solution. Not different solutions. We can
create jira considering all and implement only the part. The parent jira
would get closed when all the child jira are implemented

Regards
Raghu

On Sun, 3 Jun 2018, 1:07 pm Liang Chen,  wrote:

> Hi
>
> +1 for first considering solution1
>
> Regards
> Liang
>
> xm_zzc wrote
> > Hi  Raghu:
> >   Yep, you are right, so I said solution 1 is not very precise when there
> > are still some data you want to update/delete being stored in stream
> > segments, solution 2 can handle this scenario you mentioned.
> >   But, in my opinion, the scenario of deleting historical data is more
> > common than the one of updating data, the data size of stream table will
> > grow day by day, user generally want to delete specific data to make data
> > size not too large, for example, if user want to keep data for one year,
> > he
> > need to delete one year ago of data everyday. On the other hand, solution
> > 2
> > is more complicated than solution 1, we need to consider the implement of
> > solution 2 in depth.
> >   Based on the above reasons, Liang Chen, Jacky, David and I prefered to
> > implement Solution 1 first. Is it ok for you?
> >
> >   Is there any other suggestion?
> >
> >
> >
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: Support updating/deleting data for stream table

2018-05-30 Thread Raghunandan S
Hi,
But this leads to inconsistent data returned to the user. For example user
wanted to replace all 2g with 4g,but in stream segments it will still be 2g
and would be returned in user query. I think we need to handle this
scenario also

Regards
Raghu

On Wed, 30 May 2018, 9:22 pm Liang Chen,  wrote:

> Hi
>
> Thank you started this discussion thread.
> Agree with solution1, use the easy way to delete data for stream table.
>
> Regards
> Liang
>
> xm_zzc wrote
> > Hi dev:
> >   Sometimes we need to delete some historical data from stream table to
> > make
> > the table size not too large, but currently the stream table can't
> support
> > updating/deleting data, so we need to stop the app and use 'alter table
> > COMPACT 'close_streaming' command to close stream table, and then delete
> > data.
> >   According to discussion with Jacky and David offline, there are two
> > solutions to resolve this without stopping app:
> >
> >   1. set all
> *
> > non-stream
> *
> >  segments to 'carbon.input.segments.tablename'
> > property to delete data except stream segment, this's easy to implement,
> > but not very precise when there are data stored in stream segments.
> >   2. support deleting data for stream segment too, this's more
> > complicated, but precise.
> >
> >   I think we can implement with solution 1 first, and then consider the
> > implementation of solution 2 in depth.
> >
> >   Welcome to feedback, thanks.
> >
> >
> >
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: Grammar about supporting string longer than 32000 characters

2018-05-03 Thread Raghunandan S
+1 for solution 2
On Wed, 2 May 2018 at 9:09 PM, ravipesala  wrote:

> Hi,
>
> I agree with option 2 but not new datatype use varchar(size).
> There are more optimizations we can do with varchar(size) datatype like
> 1. if the size is smaller (less than 8 bytes)  then we can write in fixed
> length encoder instead of  LV encode it can save a lot of space and memory.
> 2. If the size is less than 32000 then use current our string datatype.
> 3. If size is more than 32000 then encode using int as a length in LV
> format.
>
> In spark dataframe support we can by default use string as datatype.
>
> Even if we take option 1 also carbon should internally has new datatype
> otherwise code will not be good as you need to check this property many
> places so ideally new datatype can lead to a new set of implementations and
> easier to code and maintain.
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: How to reduce driver memory usage of carbon index

2018-04-16 Thread Raghunandan S
Hi Yaojinguo,
  The issue is we currently load all the index info into driver memory
which causes a large memory footprint irrespective of query type(filter or
full scan).
 This can be avoided by loading only required segment's index information
for filter queries.
  We could achieve it by creating a datamap containing segment level
min/max information.Instead of loading all the datamaps till blocklet
level, we can load only the segment level min/max at startup and load the
next level datamaps based on the query.
This approach combined with LRU should be able to limit the memory
consumption at driver side.

The datamap containing segment level min/max needs to be implemented and is
not currently supported in carbondata.

Regards
Raghu

On Wed, Apr 11, 2018 at 1:25 PM, yaojinguo  wrote:

> Hi community ,
>   I am using CarbonData1.3 + Spark2.1, I find a potential bottleneck when
> using Carbondata. As
> I know, CarbonData loads all of the carbonindex files and turn these files
> to DataMap or SegmentIndex (for early version)which contains startkey
> ,endkey,min/max value of each column. If I have one table with 200 columns
> which contains 1000 segments, each segment has 2000 carbondata files,
> assume
> each column occupies just 10 bytes, then you need at least 20GB memory to
> store min/max values only. Any suggestion to resolve this problem?
>
>
>
>
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.
> n5.nabble.com/
>


Re: The size of the tablestatus file is getting larger, does it impact the performance of reading this file?

2018-03-14 Thread Raghunandan S
Dear Jacky,
It was purposefully done like that.the table status need to give the
history of the transactions that happened on the system.This is like an
audit point.

Dear xm_zzc
what is your use case?

In any case we cannot permanently remove the entries from our system.based
on use case we can consider to move it to a separate file.we can also check
what the size would be and optimising reading it from multiple places.

Regards
Raghu
On Wed, 14 Mar 2018 at 12:18 PM, Jacky Li  wrote:

> Hi,
>
> Yes, I think you are right. Currently CLEAN FILES command only delete the
> segment data folder, but not deleting metadata entries in table_status
> file, I think this is the problem.
> Please feel free to open a JIRA ticket and improve it. Thanks.
>
> Regards,
> Jacky
>
> > 在 2018年3月14日,上午10:28,xm_zzc <441586...@qq.com> 写道:
> >
> > Hi dev:
> >  The size of the tablestatus file is getting larger, does it impact the
> > performance of reading this file, for example 1 million segment info in
> this
> > file? There are many places will scan this file.
> >  Why not delete the invisible segment info to reduce the size of
> > tablestatus file? will they be used later?
> >
> >
> > --
> > Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>
>
>
>


Re: [VOTE] Apache CarbonData 1.3.1(RC1) release

2018-03-04 Thread Raghunandan S
+1
On Sun, 4 Mar 2018 at 10:33 PM, Jacky Li  wrote:

> +1
>
> Regards,
> Jacky
>
>
> > 在 2018年3月5日,上午12:55,Ravindra Pesala  写道:
> >
> > Hi
> >
> > I submit the Apache CarbonData 1.3.1 (RC1) for your vote.
> >
> > 1.Release Notes:
> > *
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220=12342754
> > <
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220=12342754
> >*
> >
> >Some key improvement in this patch release:
> >
> >   1. Restructured the carbon partitions to use standard hive folder
> >   structure.
> >   2. Supported global sort on the partitioned table.
> >   3. Many Bug fixes and stabilized 1.3.0 release.
> >
> >
> > 2. The tag to be voted upon : apache-carbondata-1.3.1-rc1(commit:
> > 744032d3cc39ff009b4a24b2b43f6d9457f439f4)
> > *
> https://github.com/apache/carbondata/releases/tag/apache-carbondata-1.3.1-rc1
> > <
> https://github.com/apache/carbondata/releases/tag/apache-carbondata-1.3.1-rc1
> >*
> >
> > 3.The artifacts to be voted on are located here:
> > *https://dist.apache.org/repos/dist/dev/carbondata/1.3.1-rc1/
> > *
> >
> > 4. A staged Maven repository is available for review at:
> > *
> https://repository.apache.org/content/repositories/orgapachecarbondata-1026
> > <
> https://repository.apache.org/content/repositories/orgapachecarbondata-1026
> >*
> >
> > 5. Release artifacts are signed with the following key:
> > *https://people.apache.org/keys/committer/ravipesala.asc
> > *
> >
> > Please vote on releasing this package as Apache CarbonData 1.3.1,  The
> vote
> > will be open for the next 72 hours and passes if a majority of
> > at least three +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache CarbonData 1.3.1
> > [ ] 0 I don't feel strongly about it, but I'm okay with the release
> > [ ] -1 Do not release this package because...
> >
> > Regards,
> > Ravindra.
>
>
>
>


Re: Problem with with writing the loadStartTime in "dd-MM-yyyy HH:mm:ss:SSS" format

2017-12-15 Thread Raghunandan S
+1
On Fri, 15 Dec 2017 at 5:34 PM, Mohammad Shahid Khan <
mohdshahidkhan1...@gmail.com> wrote:

> Hi All,
> Please find the details.
> Please response as earliest as possible.
> *Problem*:
>
> if we move the table to environment having different timezone or we change
> the system current timezone
> then after IUD operation some of the blocks are not treated as valid
> blocks.
>
> [{"timestamp":"15-12-2017
>
> 16:50:31:703","loadStatus":"Success","loadName":"0","partitionCount":"0","isDeleted":"FALSE","dataSize":"912","indexSize":"700","updateDeltaEndTimestamp":"","updateDeltaStartTimestamp":"","updateStatusFileName":"",*"loadStartTime":"15-12-2017
> 16:50:27:493"*,"visibility":"true","fileFormat":"COLUMNAR_V3"}]
>
> part-0-0_batchno0-0-*1513336827493*.carbondata
>
> If timezone is different than at the time load was done, the value
> calculated from *loadStartTime *15-12-2017 16:50:27:493 will not match to
> the time stamp extracted from block file name.
>
> *Solution*:
>
> We should stop writing the loadStartTime and timestamp in "dd-MM-
> HH:mm:ss:SSS" format.
> *We should write the long value of the timestamp.*
> *like below:*
> [{"timestamp":"*1513336827593*
>
> ","loadStatus":"Success","loadName":"0","partitionCount":"0","isDeleted":"FALSE","dataSize":"912","indexSize":"700","updateDeltaEndTimestamp":"","updateDeltaStartTimestamp":"","updateStatusFileName":"",
> *"loadStartTime":"**1513336827493**"*,
> "visibility":"true","fileFormat":"COLUMNAR_V3"}]
>
> Backward compatibility:
> If string to Long parse fail, we can fall back for date parsing.
>
> private long convertTimeStampToLong(String factTimeStamp) {
> try {
>   // for new loads the factTimeStamp will be long string
>   // but for the old store the it will be in form of date string
>   *return Long.parseLong(factTimeStamp);*
> } catch (NumberFormatException nf) {
>   SimpleDateFormat parser = new
> SimpleDateFormat(CarbonCommonConstants.CARBON_TIMESTAMP_MILLIS);
>   Date dateToStr = null;
>   try {
> dateToStr = parser.parse(factTimeStamp);
> return dateToStr.getTime();
>   } catch (ParseException e) {
> LOGGER
> .error("Cannot convert" + factTimeStamp + " to Time/Long type
> value" + e.getMessage());
> parser = new
> SimpleDateFormat(CarbonCommonConstants.CARBON_TIMESTAMP);
> try {
>   // if the load is in progress, factTimeStamp will be null, so use
> current time
>   if (null == factTimeStamp) {
> return System.currentTimeMillis();
>   }
>   dateToStr = parser.parse(factTimeStamp);
>   return dateToStr.getTime();
> } catch (ParseException e1) {
>   LOGGER.error(
>   "Cannot convert" + factTimeStamp + " to Time/Long type value"
> + e1.getMessage());
>   return 0;
> }
>   }
> }
>   }
>
>
> Regards,
> Mohammad Shahid Khan
>


Re: Version upgrade for Presto Integration to 0.186

2017-11-02 Thread Raghunandan S
Any backward incompatibilities introduced?
+1 for the upgrade
On Thu, 2 Nov 2017 at 12:18 PM, Bhavya Aggarwal  wrote:

> Hi All,
>
> Presto 0.186 version has as lot of improvements that will increase the
> performance and improve the reliability. Some of the major issues and
> improvements are listed below.
>
>
>- Fix excessive GC overhead caused by map to map cast.
>- Fix issue that may cause queries containing expensive functions, such
>as regular expressions, to continue using CPU resources even after they
> are
>killed.
>- Fix performance issue caused by redundant casts
>- Fix leak in running query counter for failed queries. The counter
>would increment but never decrement for queries that failed before
> starting.
>- Reduce memory usage when building data of VARCHAR or VARBINARY types.
>- Estimate memory usage for GROUP BY more precisely to avoid out of
>memory errors.
>- Add Spill to Disk 
>for joins.
>
> Currently the Presto version that we are using in Carbondata is 0.166 , I
> would like to suggest to upgrade it to 0.186. Please let me know what the
> group thinks about it.
>
>
> Regards
>
> Bhavya
>


Re: [PROPOSAL] Tag Pull Request with feature tag

2017-10-29 Thread Raghunandan S
+1
On Sat, 28 Oct 2017 at 8:05 PM, Liang Chen  wrote:

> +1, agree with this proposal.
>
> Regards
> Liang
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [Discussion] Carbon Store abstraction

2017-10-20 Thread Raghunandan S
I think we need to integrate with presto hive and then refactor.this gives
clear idea on what we want to achieve.each processing engine is different
in its own way and integrating first would give us a clear idea on what’s
required in CarbonData
On Fri, 20 Oct 2017 at 1:01 PM, Liang Chen  wrote:

> Hi
>
> Thank you started this discussion. agree,  for exposing the clear interface
> to users, there are some optimization works.
>
> Can you list the more detail about your proposal? for example: what class
> you propose to move to carbon store, what api you propose to create and
> expose to users.
> I suggest we can discuss and confirm your proposal  in dev first, then
> start
> to create sub task in Jira.
>
> Regards
> Liang
>
>
> Jacky Li wrote
> > Hi community,
> >
> > I am proposing to create a carbondata-store module to abstract the carbon
> > store concept. The reason is:
> >
> > 1. Initially, carbon is designed as a file format, as it evolves to
> > provide more features, it implemented more and more functionalities in
> the
> > spark integration module. However, as community is trying to integrate
> > more and more compute framework with carbon, these functionalities is
> > duplicated across integration layer. Idealy, these functionality can be
> > unified and provided in one place.
> >
> > 2. The current interface of carbondata exposed to user is through SQL,
> but
> > the developer interface for developers who want to do compute engine
> > integration is not very clear.
> >
> > 3. There are many SQL command that carbon supported, but they are
> > implemented through spark RDD only. It is not sharable across compute
> > framework.
> >
> > Due to these reasons, for the long term future of carbondata, I think it
> > is better to abstract the interface for compute engine integration within
> > a new module called carbondata-store. It can wrap all store level
> > functionalities that above file format in an independent module of
> compute
> > engine, so that every integration module can depends on it and duplicate
> > code is removed.
> >
> > This is a continuous effort for long term, I will break this work into
> > subtask and start it by creating JIRA issue, if you agree.
> >
> > Regards,
> > Jacky Li
>
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [ANNOUNCE] Lu Cao as new Apache CarbonData committer

2017-09-13 Thread Raghunandan S
Congrats lu cao.
On Wed, 13 Sep 2017 at 7:18 PM, Liang Chen  wrote:

> Hi all
>
> We are pleased to announce that the PMC has invited Lu Cao as new
> Apache
> CarbonData committer, and the invite has been accepted !
>
>Congrats to Lu Cao and welcome aboard.
>
> Regards
> The Apache CarbonData PMC
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [DISCUSSION] Unify the sort column and sort scope in create table command

2017-09-02 Thread Raghunandan S
Sort scope is different. Should not be made during create alone. Supporting
during create is only like default value. Actual decision should be made
during load. It depends on the system load and the balance required between
load & query performance
On Thu, 31 Aug 2017 at 4:25 PM, xuchuanyin  wrote:

> The two options both prefer to make all the sortscope in all segments
> (loads) same.
> Since carbondata supports different sortscope in different segment (load),
> I think there should be a third option.
>
> Option 3: The sortscope in load data command is in higher priority than
> that specified in create table command, which means the sortscope in create
> table command is a default value and will only be used if user doesn't
> specified it when loading data.
>
> Options 3 will leave the user to make a balance between loading and
> querying performance. Users can use global sort as default scope and turn
> to local sort when encountering large amount data during peak periods.
> I am not sure whether this will be a complicated or advanced usage?
>
> Besides, update is performed as a select followed by a load. So, what sort
> scope will this load use?
>
>
>
>
>
>
> On 08/31/2017 17:45, Erlu Chen wrote:
> 1 Requirement
> Currently, Users can specify sort column in table properties when create
> table. And when load data, users can also specify sort scope in load
> options.
> In order to improve the ease of use for users, it will be better to specify
> the sort related parameters all in create table command.
> Once sort scope is specified in create table command, it will be used in
> load data even users have specified in load options.
> 2 Detailed design
> 2.1 Task-01
> Requirement: Create table can support specify sort scope
> Implement: Take use of table properties (Map), will specify
> sort scope in table properties by key/value pair, then existing interface
> will be called to write this key/value pair into metastore.
> Will support Global Sort,Local Sort and No Sort,it can be specified in sql
> command:
> CREATE TABLE tableWithGlobalSort (
> shortField SHORT,
> intField INT,
> bigintField LONG,
> doubleField DOUBLE,
> stringField STRING,
> timestampField TIMESTAMP,
> decimalField DECIMAL(18,2),
> dateField DATE,
> charField CHAR(5)
> )
> STORED BY 'carbondata'
> TBLPROPERTIES('SORT_COLUMNS'='stringField', 'SORT_SCOPE'='GLOBAL_SORT')
> Tips:If the sort scope is global Sort, users should specify
> GLOBAL_SORT_PARTITIONS. If users do not specify it, it will use the number
> of map task. GLOBAL_SORT_PARTITIONS should be Integer type, the range is
> [1,Integer.MaxValue],it is only used when the sort scope is global sort.
> Global Sort Use orderby operator in spark, data is ordered in segment
> level.
> Local Sort Node ordered, carbondata file is ordered if it is written
> by one
> task.
> No Sort No sort
> Tips:key and value is case-insensitive.
> 2.2 Task-02
> Requirement:
> Load data in will support local sort, no sort, global sort
> Ignore the sort scope specified in load data and use the parameter which
> specified in create table.
> Currently, user can specify the sort scope and global sort partitions in
> load options, After modification, it will ignore the sort scope which
> specified in load options and will get sort scope from table properties.
> Current logic: sort scope is from load options
> Number Prerequisite Sort scope
> 1 isSortTable is true && Sort Scope is Global Sort Global
> Sort(first check)
> 2 isSortTable is false No Sort
> 3 isSortTable is true Local Sort
> Tips: isSortTable is true means this table contains sort column or it
> contains dimensions (except complex type), like string type.
> For example:
> Create table xxx1 (col1 string col2 int) stored by ‘carbondata’ — sort
> table
> Create table xx1 (col1 int, col2 int) stored by ‘carbondata’ — not sort
> table
> Create table xx (col1 int, col2 string) stored by ‘carbondata’
> tblproperties
> (‘sort_column’=’col1’) –- sort table
> New logic:sort scope is from create table
> Number Prerequisite Code branch
> 1 isSortTable = true && Sort Scope is Global Sort Global
> Sort(first check)
> 2 isSortTable= false || Sort Scope is No Sort No Sort
> 3 isSortTable is true && Sort Scope is Local Sort Local Sort
> 4 isSortTable is true,without specify Sort Scope Local Sort, (Keep
> current
> logic)
> 3 Acceptance standard
> Number Acceptance standard
> 1 Use can specify sort scope(global, local, no sort) when create carbon
> table in sql type
> 2 Load data will ignore the sort scope specified in load options and
> will
> use the parameter which specify in create table command. If user still
> specify the sort scope in load options, will give warning and inform user
> that he will use the sort scope which specified in create table.
>
> Here is my JIRA: 

Re: [DISCUSSION] Propose to move notification of "jira Created" to issues@mailing list from dev

2017-07-04 Thread Raghunandan S
+1

> On 05-Jul-2017, at 9:07 AM, Jacky Li  wrote:
> 
> +1
> 
>> 在 2017年7月4日,下午9:27,Bhavya Aggarwal  写道:
>> 
>> +1
>> Agreed these should be two seperate mailing lists.
>> 
>> Thanks and Regards
>> Bhavya
>> 
>> On Tue, Jul 4, 2017 at 5:20 PM, Venkata Gollamudi 
>> wrote:
>> 
>>> +1
>>> It is better to be moved
>>> 
>>> Regards,
>>> Venkata Ramana G
>>> 
>>> On Tue, Jul 4, 2017 at 4:40 PM, Kumar Vishal 
>>> wrote:
>>> 
 +1
 Better to move to issue mailing list
 
 Regards
 Kumar Vishal
 
 Sent from my iPhone
 
> On 03-Jul-2017, at 15:02, Ravindra Pesala 
>>> wrote:
> 
> +1
> Yes, we should move to issues mailing list.
> 
> Regards,
> Ravindra.
> 
>> On 30 June 2017 at 07:35, Erlu Chen  wrote:
>> 
>> Agreed, we can separate discussion and created JIRA.
>> 
>> It will be better for develops to filter some unnecessary message and
 focus
>> on discussion.
>> 
>> Regards.
>> Chenerlu.
>> 
>> 
>> 
>> --
>> View this message in context: http://apache-carbondata-dev-
>> mailing-list-archive.1130556.n5.nabble.com/DISCUSSION-
>> Propose-to-move-notification-of-jira-Created-to-issues-
>> mailing-list-from-dev-tp16835p16842.html
>> Sent from the Apache CarbonData Dev Mailing List archive mailing list
>> archive at Nabble.com.
>> 
> 
> 
> 
> --
> Thanks & Regards,
> Ravi
 
>>> 
> 
> 
>