Re: [VOTE] Apache CarbonData 2.3.0(RC2) release

2022-01-20 Thread Ravindra Pesala
+1

Regards,
Ravindra.

On Wed, 19 Jan 2022 at 23:57, Kunal Kapoor  wrote:

> Hi All,
> I submit the Apache CarbonData 2.3.0(RC2) for your vote.
>
>
> *1.Release Notes:*
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12349262=Html=12320220=Create_token=A5KQ-2QAV-T4JA-FDED_e7564140ee4c259084ecff7746af846d0c968ea9_lin
>
> *Some key features and improvements in this release:*
>
>- Support spatial index creation using data frame
>- Upgrade prestosql to 333 version
>- Support Carbondata Streamer tool to fetch data incrementally and merge
>- Support DPP for carbon filters
>- Alter support for complex types
>
>  *2. The tag to be voted upon* : apache-carbondata-2.3.0-rc2
> 
>
> Commit: 6db604a6389673194b30e3c45e7252af6400d54b
> <
> https://github.com/apache/carbondata/commit/6db604a6389673194b30e3c45e7252af6400d54b
> >
>
> *3. The artifacts to be voted on are located here:*
> https://dist.apache.org/repos/dist/dev/carbondata/2.3.0-rc2/
>
> *4. A staged Maven repository is available for review at:*
> https://repository.apache.org/content/repositories/orgapachecarbondata-1074
>
> *5. Release artifacts are signed with the following key:*
> https://people.apache.org/keys/committer/kunalkapoor.asc
>
>
> Please vote on releasing this package as Apache CarbonData 2.3.0,  The
> vote will
> be open for the next 72 hours and passes if a majority of at least three +1
> PMC votes are cast.
>
> [ ] +1 Release this package as Apache CarbonData 2.3.0
> [ ] 0 I don't feel strongly about it, but I'm okay with the release
> [ ] -1 Do not release this package because...
>
>
> Regards,
> Kunal Kapoor
>


-- 
Thanks & Regards,
Ravi


Re: [DISCUSSION]Carbondata Streamer tool and Schema change capture in CDC merge

2021-09-01 Thread Ravindra Pesala
+1

I want to understand few clarifications regarding the design.
1. Generally CDC includes IUD operations, so how are you planning to handle
them? Are you planning to merge command? If yes how frequent you want to
merge it?
2. How you can make sure the Kafka exactly once semantics( how can you
ensure data is written once with out duplication)


Regards,
Ravindra.

On Wed, 1 Sep 2021 at 1:48 AM, Akash Nilugal  wrote:

> Hi Community,
>
> OLTP systems like Mysql are used heavily for storing transactional data in
> real-time and the same data is later used for doing fraud detection and
> taking various data-driven business decisions. Since OLTP systems are not
> suited for analytical queries due to their row-based storage, there is a
> need to store this primary data into big data storage in a way that data on
> DFS is an exact replica of the data present in Mysql. Traditional ways for
> capturing data from primary databases, like Apache Sqoop, use pull-based
> CDC approaches which put additional load on the primary databases. Hence
> log-based CDC solutions became increasingly popular. However, there are 2
> aspects to this problem. We should be able to incrementally capture the
> data changes from primary databases and should be able to incrementally
> ingest the same in the data lake so that the overall latency decreases. The
> former is taken care of using log-based CDC systems like Maxwell and
> Debezium. Here we are proposing a solution for the second aspect using
> Apache Carbondata.
>
> Carbondata streamer tool enables users to incrementally ingest data from
> various sources, like Kafka and DFS into their data lakes. The tool comes
> with out-of-the-box support for almost all types of schema evolution use
> cases. Currently, this tool can be launched as a spark application either
> in continuous mode or a one-time job.
>
> Further details are present in the design document. Please review the
> design and help to improve it. I'm attaching the link to the google doc,
> you can directly comment on that. Any suggestions and improvements are most
> welcome.
>
>
> https://docs.google.com/document/d/1x66X5LU5silp4wLzjxx2Hxmt78gFRLF_8IocapoXxJk/edit?usp=sharing
>
> Thanks
>
> Regards,
> Akash R Nilugal
>
-- 
Thanks & Regards,
Ravi


Re: [VOTE] Apache CarbonData 2.2.0(RC2) release

2021-08-04 Thread Ravindra Pesala
+1

Regards,
Ravindra

On Wed, 4 Aug 2021 at 7:23 PM, Liang Chen  wrote:

> +1
>
> Regards
> Liang
>
> Ajantha Bhat  于2021年8月3日周二 下午1:07写道:
>
> > +1
> >
> > Regards,
> > Ajantha
> >
> > On Mon, Aug 2, 2021 at 9:03 PM Venkata Gollamudi 
> > wrote:
> >
> > > +1
> > >
> > > Regards,
> > > Venkata Ramana
> > >
> > > On Mon, 2 Aug, 2021, 20:18 Kunal Kapoor, 
> > wrote:
> > >
> > > > +1
> > > >
> > > > Regards
> > > > Kunal Kapoor
> > > >
> > > > On Mon, 2 Aug 2021, 4:53 pm Kumar Vishal,  >
> > > > wrote:
> > > >
> > > > > +1
> > > > > Regards
> > > > > Kumar Vishal
> > > > >
> > > > > On Mon, 2 Aug 2021 at 2:28 PM, Indhumathi M <
> indhumathi...@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > +1
> > > > > >
> > > > > > Regards,
> > > > > > Indhumathi M
> > > > > >
> > > > > > On Mon, Aug 2, 2021 at 12:33 PM Akash Nilugal <
> > > akashnilu...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi All,
> > > > > > >
> > > > > > > I submit the Apache CarbonData 2.2.0(RC2) for your vote.
> > > > > > >
> > > > > > >
> > > > > > > *1.Release Notes:*
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12347869=Html=12320220=Create_token=A5KQ-2QAV-T4JA-FDED_d44fca7058ab2c2a2a4a24e02264cc701f7d10b8_lin
> > > > > > >
> > > > > > >
> > > > > > > *Some key features and improvements in this release:*
> > > > > > >- Integrate with Apache Spark-3.1
> > > > > > >- Leverage Secondary Index till segment level with SI as
> > datamap
> > > > and
> > > > > > SI
> > > > > > > with plan rewrite
> > > > > > >- Make Secondary Index as a coarse grain datamap and use
> > > secondary
> > > > > > > indexes for Presto queries
> > > > > > >- Support rename SI table
> > > > > > >- Support describe column
> > > > > > >- Local sort Partition Load and Compaction improvement
> > > > > > >- GeoSpatial Query Enhancements
> > > > > > >- Improve the table status and segment file writing
> > > > > > >- Improve the carbon CDC performance and introduce APIs to
> > > UPSERT,
> > > > > > > DELETE, UPDATE and DELETE
> > > > > > >- Improvements clean file and rename performance
> > > > > > >
> > > > > > > *2. The tag to be voted upon:* apache-carbondata-2.2.0-rc2
> > > > > > >
> > > >
> https://github.com/apache/carbondata/tree/apache-carbondata-2.2.0-rc2
> > > > > > >
> > > > > > > Commit: c3a908b51b2f590eb76eb4f4d875cd568dbece40
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/carbondata/commit/c3a908b51b2f590eb76eb4f4d875cd568dbece40
> > > > > > >
> > > > > > >
> > > > > > > *3. The artifacts to be voted on are located here:*
> > > > > > > https://dist.apache.org/repos/dist/dev/carbondata/2.2.0-rc2
> > > > > > >
> > > > > > > *4. A staged Maven repository is available for review at:*
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://repository.apache.org/content/repositories/orgapachecarbondata-1071/
> > > > > > >
> > > > > > >
> > > > > > > Please vote on releasing this package as Apache CarbonData
> 2.2.0,
> > > > The
> > > > > > vote
> > > > > > > will be open for the next 72 hours and passes if a majority of
> at
> > > > least
> > > > > > > three +1
> > > > > > > PMC votes are cast.
> > > > > > >
> > > > > > > [ ] +1 Release this package as Apache CarbonData 2.2.0
> > > > > > >
> > > > > > > [ ] 0 I don't feel strongly about it, but I'm okay with the
> > release
> > > > > > >
> > > > > > > [ ] -1 Do not release this package because...
> > > > > > >
> > > > > > >
> > > > > > > Regards,
> > > > > > > Akash R Nilugal
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
-- 
Thanks & Regards,
Ravi


Re: [VOTE] Apache CarbonData 2.2.0(RC1) release

2021-07-08 Thread Ravindra Pesala
-1

I suggest PR 4148 to be merged before release.

Regards,
Ravindra.

On Thu, 8 Jul 2021 at 5:04 PM, Jacky Li  wrote:

> -1,
>
> I suggest following PR to be merged before release
> #4148
> #4157
> #4158
> #4162
>
> Regards,
> Jacky Li
>
>
> > 2021年7月6日 下午3:14,Akash Nilugal  写道:
> >
> > Hi All,
> >
> > I submit the *Apache CarbonData 2.2.0(RC1) *for your vote.
> >
> >
> >
> > *1. Release Notes:*
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12347869=Html=12320220=Create_token=A5KQ-2QAV-T4JA-FDED_386c7cf69a9d53cc8715137e7dba91958dabef9b_lin
> >
> > *Some key features and improvements in this release:*
> >
> >   - Integrate Carbondata with spark-3.1
> >   - Leverage Secondary Index till segment level with SI as datamap and SI
> > with plan rewrite
> >   - Make Secondary Index as a coarse grain datamap and use secondary
> > indexes for Presto queries
> >   - Support rename SI table
> >   - Local sort Partition Load and Compaction improvement
> >   - GeoSpatial Query Enhancements
> >   - Improve the table status and segment file writing
> >
> > *2. The tag to be voted upon*: apache-carbondata-2.2.0-rc1
> > 
> >
> > *Commit: *d4e5d2337164b34fa19a42a40c03da26ff65ab9e
> > <
> >
> https://github.com/apache/carbondata/commit/d4e5d2337164b34fa19a42a40c03da26ff65ab9e
> >>
> >
> >
> > *3. The artifacts to be voted on are located here:*
> > https://dist.apache.org/repos/dist/dev/carbondata/2.2.0-rc1/
> >
> > *4. A staged Maven repository is available for review at:*
> >
> https://repository.apache.org/content/repositories/orgapachecarbondata-1070/
> >
> >
> > Please vote on releasing this package as Apache CarbonData 2.2.0,  The
> > vote will
> > be open for the next 72 hours and passes if a majority of at least three
> +1
> > PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache CarbonData 2.2.0
> >
> > [ ] 0 I don't feel strongly about it, but I'm okay with the release
> >
> > [ ] -1 Do not release this package because...
> >
> >
> > Regards,
> > Akash R Nilugal
> >
>
> --
Thanks & Regards,
Ravi


Re: [Design Discussion] Transaction manager, time travel and segment interface refactoring

2021-04-28 Thread Ravindra Pesala
+1

Much needed feature and interface refactoring. Thanks for working on it.

Regards,
Ravindra.

On Thu, 22 Apr 2021 at 2:36 PM, Ajantha Bhat  wrote:

> Hi All,
> In this thread, I am continuing the below discussion along with the
> Transaction Manager and Time Travel feature design.
>
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Taking-the-inputs-for-Segment-Interface-Refactoring-td101950.html
>
> The goal of this requirement is as follows.
>
>1.
>
>Implement a “Transaction Manager” with optimistic concurrency to provide
>within a table transaction/versioning. (interfaces should also be
>flexible enough to support across table transactions)
>2.
>
>Support time travel in carbonData.
>3.
>
>Decouple and clean up segment interfaces. (which should also help in
>supporting segment concepts to other open formats under carbonData
> metadata
>service)
>
>
> The design document is attached in JIRA.
> JIRA link: https://issues.apache.org/jira/browse/CARBONDATA-4171
> GoogleDrive link:
>
> https://docs.google.com/document/d/1FsVsXjj5QCuFDrzrayN4Qo0LqWc0Kcijc_jL7pCzfXo/edit?usp=sharing
>
> Please have a look. suggestions are welcome.
> I have mentioned some TODO in the document, I will be updating it in the V2
> version soon.
> Implementation will be done by adding subtasks under the same JIRA.
>
> Thanks,
> Ajantha
>
-- 
Thanks & Regards,
Ravi


Re: [VOTE] Apache CarbonData 2.1.1(RC2) release

2021-03-29 Thread Ravindra Pesala
+1

Regards,
Ravindra.

On Fri, 26 Mar 2021 at 11:02 PM, Indhumathi  wrote:

> +1
>
> Regards
> Indhumathi M
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>
-- 
Thanks & Regards,
Ravi


Re: Improve carbondata CDC performance

2021-03-11 Thread Ravindra Pesala
+1
Instead of doing the cartesian join, we can broadcast the sorted min/max
with file paths and do the binary search inside the map function.

Thank you

On Wed, 24 Feb 2021 at 13:02, akashrn5  wrote:

> Hi Venu,
>
> Thanks for your review.
>
> I have replied the same in the document.
> you are right
>
> 1. its taken care to group by extended blocklets on split path and get the
> min-max on block level
> 2. we need to do group by on the file path to avoid the duplicates from
> dataframe output. I have updated the same in the doc please have a look.
>
> Thanks,
> Akash R
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


-- 
Thanks & Regards,
Ravi


Re: [DISCUSSION]Improve Simple updates and delete performance in carbondata

2020-12-10 Thread Ravindra Pesala
+1
I am looking forward to this feature as most of the update/delete
operations are simple and it can simplify and improve the performance as
well.
Thank you.

On Thu, 19 Nov 2020 at 19:41, Akash Nilugal  wrote:

> Hi Community,
>
> Carbondata supports update and delete using spark. So basically update is
> delete + Insert, and delete is just delete
> But we use spark APIs or actions on collections that use spark jobs to do
> them, like map, partition etc
> So Spark adds overhead of task serialization cost, total job execution in
> remote nodes, shuffle etc
> So even just for simple updates, Carbon takes a lot of time, and the same
> for delete as well due to these overheads.
>
> Carbondata 2.1.0 supports update and delete for SDK. This is implemented at
> the carbon file format level
>
> so we can reuse the same for simple updates and deletes and avoid spark
> completely and can perform simple update
>
> and delete on transactional tables using simple java code. This helps to
> avoid all the overhead of spark and make
>
> updates and deletes faster.
>
> I have added an initial V1 design document, please check and give
> comments/inputs/suggestions.
>
>
> https://docs.google.com/document/d/1-M6xPKZG8l6yAu0c9qo3jdUKhpXHWgUR-h8HeUUmk8M/edit?usp=sharing
>
> Thanks,
>
> Regards,
> Akash R Nilugal
>


-- 
Thanks & Regards,
Ravi


Re: [VOTE] Apache CarbonData 2.1.0(RC2) release

2020-11-04 Thread Ravindra Pesala
+1

On Wed, 4 Nov 2020 at 12:17 PM, Kunal Kapoor 
wrote:

> Hi All,
>
> I submit the Apache CarbonData 2.1.0(RC2) for your vote.
>
>
> *1.Release Notes:*
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12347868=Html=12320220=Create_token=A5KQ-2QAV-T4JA-FDED_1b35cd2d01110783b464000df174503b2dc55b0c_lin
>
> *Some key features and improvements in this release:*
>
>- Support Float and Decimal in the Merge Flow
>- Implement delete and update feature in carbondata SDK.
>- Support array with SI
>- Support IndexServer with Presto Engine
>- Insert from stage command support partition table.
>- Implementing a new Reindex command to repair the missing SI Segments
>- Support Change Column Comment
>- Presto complex type read support
>-  SI global sort support
>
>  *2. The tag to be voted upon* : apache-carbondata-2.1.0-rc2
> 
>
> Commit: 52b5a2a08b00a4c7cdf34801201f4fa5393b3700
> <
> https://github.com/apache/carbondata/commit/52b5a2a08b00a4c7cdf34801201f4fa5393b3700
> >
>
> *3. The artifacts to be voted on are located here:*
> https://dist.apache.org/repos/dist/dev/carbondata/2.1.0-rc2/
>
> *4. A staged Maven repository is available for review at:*
> https://repository.apache.org/content/repositories/orgapachecarbondata-1065
>
> *5. Release artifacts are signed with the following key:*
> https://people.apache.org/keys/committer/kunalkapoor.asc
>
>
> Please vote on releasing this package as Apache CarbonData 2.1.0,
> The vote will be open for the next 72 hours and passes if a majority of at
> least three +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache CarbonData 2.1.0
>
> [ ] 0 I don't feel strongly about it, but I'm okay with the release
>
> [ ] -1 Do not release this package because...
>
>
> Regards,
> Kunal Kapoor
>
-- 
Thanks & Regards,
Ravi


Re: [ANN] Indhumathi as new Apache CarbonData committer

2020-10-07 Thread Ravindra Pesala
Congrats Indumathi !


On Wed, 7 Oct 2020 at 10:29 AM, manish gupta 
wrote:

> Congratulations Indumathi
>
> Regards
> Manish Gupta
>
> On Wed, 7 Oct 2020 at 10:23 AM, brijoobopanna 
> wrote:
>
> > Congrats Indhumathi, best of luck for your new role in the community
> >
> >
> >
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
>
-- 
Thanks & Regards,
Ravi


Re: Clean files enhancement

2020-09-24 Thread Ravindra Pesala
Hi Vikram,

+1

It is good to remove the automatic cleanup.
But I am still worried about the clean file command executed by user as
well.  We need to enhance the clean file command to introduce dry run to
print what segments it is going to be deleted and what is left. If user ok
with dry run result then he can go for actual run.

Regards,
Ravindra.

On Mon, 21 Sep 2020 at 1:27 PM, Vikram Ahuja 
wrote:

> Hi Ravi and David,
>
>
>
> 1. All the automatic clean data in the case of load/insert/compact/delete
>
> will be removed, so cleaning will only happen when the clean files command
>
> is called.
>
>
>
> 2. We will only add the data to trash when we try to clean data which is in
>
> IN PROGRESS state. In case of COmpacted/Marked For Delete it will not be
>
> moved to the trash, it will be directly deleted. The user will only be able
>
> to recover the In Progress segments if the user wants. @Ravi -> Is this
>
> okay for trash usage? Only using it for in progress segments.
>
>
>
> 3. No trash management will be implemented, the data will ONLY BE REMOVED
>
> from the trash folder immediately when the clean files command is called.
>
> There will be no time to live, the data can be kept in the trash folder
>
> untill the user triggers clean files command.
>
>
>
> Let me know if you have any questions.
>
>
>
> Vikram Ahuja
>
>
>
> On Fri, Sep 18, 2020 at 1:43 PM David CaiQiang 
> wrote:
>
>
>
> > agree with Ravindra,
>
> >
>
> > 1. stop all automatic clean data in load/insert/compact/update/delete...
>
> >
>
> > 2. when clean files command clean in-progress or uncertain data, we can
>
> > move
>
> > them to data trash.
>
> > it can prevent delete useful data by mistake, we already find this
>
> > issue
>
> > in some scenes.
>
> > other cases(for example clean mark_for_delete/compacted segment)
> should
>
> > not use the data trash folder, clean data directly.
>
> >
>
> > 3. no need data trash management, suggest keeping it simple.
>
> > The clean file command should support empty trash immediately, it
> will
>
> > be enough.
>
> >
>
> >
>
> >
>
> > -
>
> > Best Regards
>
> > David Cai
>
> > --
>
> > Sent from:
>
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>
> >
>
>

-- 
Thanks & Regards,
Ravi


Re: Clean files enhancement

2020-09-17 Thread Ravindra Pesala
-1

I don’t see any reason why we should use trash. How does it change the
behaviour.
1. Are you still going with automatic clean up?
If yes then you are adding extra time to move the data to trash(for S3 file
system).
2. Even if you move the data and keep the time to live as 3 days in trash,
what if user realised that data is not right or lost after that time period.

Regards,
Ravindra


On Thu, 17 Sep 2020 at 3:12 PM, Vikram Ahuja 
wrote:

> Hi all,
>
> after all the suggestions the trash folder mechanism in carbondata will be
>
> implemented in 2 phases
>
> Phase1 :
>
> 1. Create a generic trash folder at table level. Trash folders will be
>
> hidden/invisible(like .trash or .recyclebin). The trash folder will be
>
> stored in the table dir.
>
> 2. If we delete any file/folder from a table it will be moved to the trash
>
> folder of that corresponding table (The call for adding to trash will be
>
> added in FileFactory delete api's)
>
> 3. A trash manager will be created, which will keep track of all the files
>
> that have been deleted and moved to the trash and will also maintain the
>
> time when it is deleted. All the trashmanager's api will be called from the
>
> FileFactory class
>
> 4. On clean files command, the trash folders will be cleared if the expiry
>
> time has been met. Each file moved to the trash will have some expiration
>
> time associated with it
>
>
>
> Phase 2: For phase 2 more enhancements are planned, and will be implemented
>
> after the phase 1 is completed. The plan for phase 2 development and
>
> changes shall be posted in this mail thread itself.
>
>
>
>
>
> Thanks
>
> Vikram Ahuja
>
>
>
>
>
> On Wed, Sep 16, 2020 at 8:43 AM PickUpOldDriver 
>
> wrote:
>
>
>
> > Hi Vikram,
>
> >
>
> > I agree to build a trash folder, +1.
>
> >
>
> > Currently, the data loading/compaction/update/merge flow has automatic
>
> > cleaning files actions, but they are written separately.  Most of them
> are
>
> > aimed at deleting the stale segments(MARKED_FOR_DELETE/COMPACTED). And
> they
>
> > rely on the precise of the table status. If you could build a general
> clean
>
> > file function, it can be applied to substitute the current automatic
>
> > deletion for stale folders.
>
> >
>
> > Besides, having a trash folder handle by Carbondata will be good, we can
>
> > find the deleted segments by this API.
>
> >
>
> > And I think we should also consider the status of INSERT_IN_PROGERSS &
>
> > INSERT_OVERWRITE _IN_PROGRESS
>
> >
>
> >
>
> >
>
> >
>
> > --
>
> > Sent from:
>
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>
> >
>
>

-- 
Thanks & Regards,
Ravi


Re: Clean files enhancement

2020-09-15 Thread Ravindra Pesala
+1 with Vishal proposal.
It is not safe to clean the automatically with out ensuring the data
integrity. Let’s enhance the clean command to do sanity check before
removing it. It should be the administrative work to delete the data, not
the framework automatic feature. User can call when he needs to delete the
data.

Regards,
Ravindra.

On Tue, 15 Sep 2020 at 10:50 PM, akashrn5  wrote:

> Hi David,
>
>
>
> 1. we cannot remove the code of clean up from all commands, because in case
>
> of any failures if we do not clean the stale files, there can be issues of
>
> wrong data or extra data.
>
>
>
> What i think is, we are calling the APIs which does may be say X amount of
>
> work, but we may just need some Y amount of clean up to be done (X >Y ). So
>
> may be what we can do is refactor in a proper way, just to delete or clean
>
> only the required files or folders specific to that command and not call
> the
>
> general or common clean up APIs which creates problem for us.
>
>
>
> 2. Yes, i agree that no need to clean up in progress in commads.
>
>
>
> Regards,
>
> AKash R Nilugal
>
>
>
>
>
>
>
> --
>
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>
> --
Thanks & Regards,
Ravi


Re: [DISCUSSION] Parallel compaction and update

2020-09-14 Thread Ravindra Pesala
Hi Nihal,

I appreciate the design but I don’t want to implement features with out
proper segment interfacing in place. With out segment refactoring if you
try to implement this type of features will make the code more dirty.

Once we bring the proper segment interfacing  and transaction management in
place we can make parallel executions simpler  and less error prone.

Regards,
Ravindra.

On Mon, 14 Sep 2020 at 10:31 PM, Nihal  wrote:

> Dear community,
>
>
>
> This mail is regarding the parallel compaction and update.
>
> Current behavior: Currently we are not supporting concurrent compaction and
>
> update because It may cause data inconsistency or incorrect result.
>
> We take the compaction and update lock before any of these operations.
>
> Because of this behavior if one is executing then others have to wait and
>
> sometimes this waiting time is very long.
>
>
>
> To come out with this problem we are planning to support parallel
> compaction
>
> and update. And here I have proposed one of the solutions to implement this
>
> feature.
>
> Paraller_Compaction_And_Update.pdf
>
> <
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t443/Paraller_Compaction_And_Update.pdf>
>
>
> Please go through this solution document and provide your input if this
>
> approach is ok or any drawback is there.
>
>
>
>
>
> Thanks & Regards
>
> Nihal kumar ojha
>
>
>
>
>
>
>
> --
>
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>
> --
Thanks & Regards,
Ravi


Re: [Discussion] Update feature enhancement

2020-09-14 Thread Ravindra Pesala
+1
Already partition loading uses the new segment to write the update delta
data.

It is better to make consistent across all. Creating new segment simplifies
the design.



On Mon, 14 Sep 2020 at 1:48 AM, Venkata Gollamudi 
wrote:

> Hi David,
>
> +1
>
>
>
> Initially when segments concept is started, it is viewed as a folder which
>
> is incrementally added with time, so that data retention use-cases like
>
> "delete segments before a given date" were thought of. In that case if
>
> updated records are written into new segment, then old records will become
>
> new records and retention model will not work on that data. So update
>
> records were written to the same segment folder.
>
>
>
> But later as the partition concept was introduced, that will be a clean
>
> method to implement retention or even using a delete by time column is a
>
> better method.
>
> So inserting new records into the new segment makes sense.
>
>
>
> Only disadvantage can be later supporting one column data update/replace
>
> feature which Likun was mentioning previously.
>
>
>
> So to generalize, update feature can support inserting the updated records
>
> to new segment. The logic to reload indexes when segments are updated can
>
> still be there, however when there is no insert of data to old segments,
>
> reload of indexes needs to be avoided.
>
>
>
> Increasing the number of segments need not be a reason for this to go
>
> ahead, as the problem of increasing segments anyway is a problem and needs
>
> to be solved using compaction either horizontal or vertical. Also
>
> optimization of segment file storage either filebased or DB based(embedded
>
> or external) for too big deployments needs to be solved independently.
>
>
>
> Regards,
>
> Ramana
>
>
>
> On Sat, Sep 5, 2020 at 7:58 AM Ajantha Bhat  wrote:
>
>
>
> > Hi David. Thanks for proposing this.
>
> >
>
> > *+1 from my side.*
>
> >
>
> > I have seen users with 200K segments table stored in cloud.
>
> > It will be really slow to reload all the segments where update happened
> for
>
> > indexes like SI, min-max, MV.
>
> >
>
> > So, it is good to write as a new segment
>
> > and just load new segment indexes. (try to reuse this flow
>
> > UpdateTableModel.loadAsNewSegment
>
> > = true)
>
> >
>
> > and user can compact the segments to avoid many new segments created by
>
> > update.
>
> > and we can also move the compacted segments to table status history I
> guess
>
> > to avoid more entries in table status.
>
> >
>
> > Thanks,
>
> > Ajantha
>
> >
>
> >
>
> >
>
> > On Fri, Sep 4, 2020 at 1:48 PM David CaiQiang 
>
> > wrote:
>
> >
>
> > > Hi Akash,
>
> > >
>
> > > 3. Update operation contain a insert operation.  Update operation
>
> > will
>
> > > do the same thing how the insert operation process this issue.
>
> > >
>
> > >
>
> > >
>
> > > -
>
> > > Best Regards
>
> > > David Cai
>
> > > --
>
> > > Sent from:
>
> > >
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>
> > >
>
> >
>
> --
Thanks & Regards,
Ravi


Re: [Discussion] SI support Complex Array Type

2020-08-02 Thread Ravindra Pesala
Hi All,

+1 for solution 2. But don't store rowid as it makes the storage very big
and it gives a very slow performance. Let's go with the current model of SI
which stores till blocklet level. Don't make things complicated by storing
rowid.
Solution 1 makes the scan slower as it needs to construct the complex row
for every row. So it is better to flatten out to get the better scan
performance and storage optimization.

Consider the following way.
*Array: *Flatten out each row and store in multiple rows and store till
blocklet id.
*Struct:* It is up to the user on which element exactly he wants to index.
For example *emp:struct*, in this user can
create separate SI on individual columns like *emp.name *
or *emp.address*.
*Map:* Here also we can flatten out the data like Array. But the user
should choose whether he wants the SI on Map key or value. If he wants both
then he can create separate SI.

Regards,
Ravindra.

On Thu, 30 Jul 2020 at 17:35, Ajantha Bhat  wrote:

> Hi David & Indhumathi,
> Storing Array of String as just String column in SI by flattening [with row
> level position reference] can result in slow performance in case of
> * Multiple array_contains() or multiple array[0] = 'x'
> * The join solution mentioned can result in multiple scan (once for every
> complex filter condition) which can slow down the SI performance.
> * Row level SI can slow down SI performance when the filter results huge
> value.
> * To support multiple SI on a single table, complex SI will become row
> level position reference and primitive will become blocklet level position
> reference. Need extra logic /time for join.
> * Solution 2 cannot support struct column SI in the future. So, it cannot
> be a generic solution.
>
> Considering the above points, *solution2 is a very good solution if only
> one filter exist* for complex column. *But not a good solution for all the
> scenarios.*
>
> *So, I have to go with solution1 or need to wait for other people opinions
> or new solutions.*
>
> Thanks,
> Ajantha
>
> On Thu, Jul 30, 2020 at 1:19 PM David CaiQiang 
> wrote:
>
> > +1 for solution2
> >
> > Can we support more than one array_contains by using SI join (like SI on
> > primitive data type)?
> >
> >
> >
> > -
> > Best Regards
> > David Cai
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
>


-- 
Thanks & Regards,
Ravi


Re: [Disscuss] The precise of timestamp is limited to millisecond in carbondata, which is incompatiable with DB

2020-07-15 Thread Ravindra Pesala
Hi,

I think it is bigger than just changing to DateTimeFormatter. As of now,
carbon uses only 64 bit to store timestamp so it can accommodate till
milliseconds. In order to support till nanoseconds, we need to use 96
bits.  If you check spark-parquet it uses 96 bits to store timestamp. It
would be good if we also go in that direction to support till nanosecond
level. Thank you.

Regards,
Ravindra.

On Wed, 15 Jul 2020 at 14:50, Zhangshunyu  wrote:

> +1
>
>
>
> -
> My English name is Sunday
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


-- 
Thanks & Regards,
Ravi


Re: [Discussion]Do we still need to support carbon.merge.index.in.segment property ?

2020-07-09 Thread Ravindra Pesala
Hi,

+1
I agree with Vishal, let's deprecate the configuration and keep it as
internal.

Regards,
Ravindra.

On Fri, 10 Jul 2020 at 01:54, Ajantha Bhat  wrote:

> Hi,
> I didn't reply to deprecation. *+1 for deprecating it*.
>
> *And +1 for issue fix also.*
> Issue fix, I didn't mean when *carbon.merge.index.in
> .segment = false.*
>
> but when when *carbon.merge.index.in
> .segment = true and merge index creation
> failed for some reason.*
> code needs to take care of
> a. Moving index files from temp folder to final folder in case of partition
> table load.
> b. Not failing the current partition load.  (same as normal load behavior)
> I think these two are not handled after partition optimization, you can
> check and handle it.
>
>
> Thanks,
> Ajantha
>
> On Thu, 9 Jul, 2020, 9:29 pm Akash r,  wrote:
>
> > Hi,
> >
> > +1, we can deprecate it and as Vishal suggested we can keep as internal
> > property for developer purpose.
> >
> > Regards,
> > Akash R Nilugal
> >
> > On Thu, Jul 9, 2020, 2:46 PM VenuReddy 
> wrote:
> >
> > > Dear Community.!
> > >
> > > Have recently encountered a problem of Segment directory and segment
> file
> > > in
> > > metadata directory are not created for partitioned table when
> > > 'carbon.merge.index.in.segment' property is set to 'false'. And actual
> > > index
> > > files which were present in respective partition's '.tmp' directory are
> > > also
> > > deleted without moving them out to respective partition directory where
> > its
> > > '.carbondata' file exist. Thus queries throw exception while reading
> > index
> > > files. Please refer jira issue -
> > > https://issues.apache.org/jira/browse/CARBONDATA-3834
> > > 
> > >
> > > To address this issue, we have 2 options to go with -
> > > 1. Either fix it to work for 'carbon.merge.index.in.segment' set to
> > false
> > > case. There is an open PR
> > https://github.com/apache/carbondata/pull/3776
> > >    for it.
> > >
> > > 2. Or Deprecate the 'carbon.merge.index.in.segment' property itself.
> As
> > > the
> > > query performance is better when merge index files are in use when
> > compared
> > > to normal index files, and default behavior is to generate merge index
> > > files, probably it is not necessary to support
> > > 'carbon.merge.index.in.segment' anymore.
> > >
> > > What do you think about it ? Please let me know your opinion.
> > >
> > > Thanks,
> > > Venu
> > >
> > >
> > >
> > > --
> > > Sent from:
> > >
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> > >
> >
> >
>


-- 
Thanks & Regards,
Ravi


Re: [VOTE] Apache CarbonData 2.0.0(RC3) release

2020-05-17 Thread Ravindra Pesala
+1

Regards,
Ravindra.

On Sun, 17 May 2020 at 9:24 PM, Ajantha Bhat  wrote:

> +1
>
> Regards,
> Ajantha
>
>
>
> On Sun, 17 May, 2020, 6:41 pm Jacky Li,  wrote:
>
> > +1
> >
> > Regards,
> > Jacky
> >
> >
> > > 2020年5月17日 下午4:50,Kunal Kapoor  写道:
> > >
> > > Hi All,
> > >
> > > I submit the Apache CarbonData 2.0.0(RC3) for your vote.
> > >
> > >
> > > *1.Release Notes:*
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12346046=Html=12320220
> > >
> > > *Some key features and improvements in this release:*
> > >
> > >   - Adapt to SparkSessionExtensions
> > >   - Support integration with spark 2.4.5
> > >   - Support heterogeneous format segments in carbondata
> > >   - Support write Flink streaming data to Carbon
> > >   - Insert from stage command support partition table.
> > >   - Support secondary index on carbon table
> > >   - Support query of stage files
> > >   - Support TimeBased Cache expiration using ExpiringMap
> > >   - Improve insert into performance and decrease memory foot print
> > >   - Support PyTorch and TensorFlow
> > >
> > > *2. The tag to be voted upon* : apache-carbondata-2.0.0-rc3
> > >  >
> > >
> > > Commit: 29d78b78095ad02afde750d89a0e44f153bcc0f3
> > > <
> >
> https://github.com/apache/carbondata/commit/29d78b78095ad02afde750d89a0e44f153bcc0f3
> > >
> > >
> > > *3. The artifacts to be voted on are located here:*
> > > https://dist.apache.org/repos/dist/dev/carbondata/2.0.0-rc3/
> > >
> > > *4. A staged Maven repository is available for review at:*
> > >
> >
> https://repository.apache.org/content/repositories/orgapachecarbondata-1062/
> > >
> > > *5. Release artifacts are signed with the following key:*
> > > https://people.apache.org/keys/committer/kunalkapoor.asc
> > >
> > >
> > > Please vote on releasing this package as Apache CarbonData 2.0.0,
> > > The vote will be open for the next 72 hours and passes if a majority of
> > at
> > > least three +1 PMC votes are cast.
> > >
> > > [ ] +1 Release this package as Apache CarbonData 2.0.0
> > >
> > > [ ] 0 I don't feel strongly about it, but I'm okay with the release
> > >
> > > [ ] -1 Do not release this package because...
> > >
> > >
> > > Regards,
> > > Kunal Kapoor
> >
> >
>
-- 
Thanks & Regards,
Ravi


Re: [Dissussion] Support FLOAT datatype in the CDC Flow

2020-05-11 Thread Ravindra Pesala
Hi,

CDC can support all primitive data types. If it is failing in particular
scenario please raise a jira with a proper test case to reproduce the
problem. Thank you

Regards,
Ravindra.

On Mon, 11 May 2020 at 11:35 AM, haomarch  wrote:

> We don't support FLOAT datatype in the CDC Flow. This is a big issue.
>
> <
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t431/348669693.jpg>
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>
-- 
Thanks & Regards,
Ravi


Re: Disable Adaptive encoding for Double and Float by default

2020-03-25 Thread Ravindra Pesala
Hi Anantha,

I think it is better to fix the problem instead of disabling the things. It
is already observed that store size increases proportionally. If my data
has more columns then it will be exponential.  Store size directly impacts
the query performance in object store world. It is better to find a way to
fix it rather than removing things.

Regards,
Ravindra.

On Wed, 25 Mar 2020 at 5:04 PM, Ajantha Bhat  wrote:

> Hi Ravi, please find the performance readings below.
>
> On TPCH 10GB data, carbon to carbon insert in on HDFS standalone cluster:
>
>
> *By disabling adaptive encoding for float and double.*
> insert is *more than 10% faster* [before 139 seconds, after this it is
> 114 seconds] and
> *saves 25% memory in TLAB*store size *has increased by 10% *[before 2.3
> GB, after this it is 2.55 GB]
>
> Also we have below check. If data is more than 5 decimal precision. we
> don't apply adaptive encoding for double/float.
> So, I am not sure how much it is useful for real-world double precision
> data.
>
> [image: Screenshot from 2020-03-25 14-27-07.png]
>
>
> *Bottleneck is finding that decimal points from every float and double
> value [*PrimitivePageStatsCollector.getDecimalCount(double)*] *
> *where we convert to string and use substring().*
>
> so I want to disable adaptive encoding for double and float by default.
>
> Thanks,
> Ajantha
>
> On Wed, Mar 25, 2020 at 11:37 AM Ravindra Pesala 
> wrote:
>
>> Hi ,
>>
>> It increases the store size.  Can you give me performance figures with and
>> without these changes.  And also provide how much store size impact if we
>> disable it.
>>
>>
>> Regards,
>> Ravindra.
>>
>> On Wed, 25 Mar 2020 at 1:51 PM, Ajantha Bhat 
>> wrote:
>>
>> > Hi all,
>> >
>> > I have done insert into flow profiling using JMC with the latest code
>> > [with new optimized insert flow]
>> >
>> > It seems for *2.5GB* carbon to carbon insert, double and float stats
>> > collector has used *68.36 GB* [*25%* of TLAB (Thread local allocation
>> > buffer)]
>> >
>> > [image: Screenshot from 2020-03-25 11-18-04.png]
>> > *The problem is for every value of double and float in every row, we
>> call *
>> > *PrimitivePageStatsCollector.getDecimalCount()**Which makes new objects
>> > every time.*
>> >
>> > So, I want to disable Adaptive encoding for float and double by default.
>> > *I will make this configurable.*
>
>
>> > If some user has a well-sorted double or float column and wants to apply
>> > adaptive encoding on that, they can enable it to reduce store size.
>> >
>> > Thanks,
>> > Ajantha
>> >
>> --
>> Thanks & Regards,
>> Ravi
>>
> --
Thanks & Regards,
Ravi


Re: Disable Adaptive encoding for Double and Float by default

2020-03-25 Thread Ravindra Pesala
Hi ,

It increases the store size.  Can you give me performance figures with and
without these changes.  And also provide how much store size impact if we
disable it.


Regards,
Ravindra.

On Wed, 25 Mar 2020 at 1:51 PM, Ajantha Bhat  wrote:

> Hi all,
>
> I have done insert into flow profiling using JMC with the latest code
> [with new optimized insert flow]
>
> It seems for *2.5GB* carbon to carbon insert, double and float stats
> collector has used *68.36 GB* [*25%* of TLAB (Thread local allocation
> buffer)]
>
> [image: Screenshot from 2020-03-25 11-18-04.png]
> *The problem is for every value of double and float in every row, we call *
> *PrimitivePageStatsCollector.getDecimalCount()**Which makes new objects
> every time.*
>
> So, I want to disable Adaptive encoding for float and double by default.
> *I will make this configurable.*
> If some user has a well-sorted double or float column and wants to apply
> adaptive encoding on that, they can enable it to reduce store size.
>
> Thanks,
> Ajantha
>
-- 
Thanks & Regards,
Ravi


Re: What is the transaction ability of CarbonData? Does it support the transaction like this.

2020-02-17 Thread Ravindra Pesala
Hi ,

 Yes, you are right. Carbon supports the way you expected.  It can either
give the data before overwrite or after overwrite in another session when
you run query concurrently. It never gives `FileNotFoundException`.

Regards,
Ravindra.

On Mon, 17 Feb 2020, 20:14 李书明,  wrote:

> Hi community!
>
> I found CarbonData provides some transaction capabilities, However I have
> some questions about the detail transaction ability which I cannot find in
> the document.
>
> There is a  problem like this:
>   SessionA: insert overwrite tableA partition(dt=A) select * from x;
>   SessionB: select * from tableA
>
> For a common `Spark`(parquet) table, this situation may cause
> `FileNotFoundException` because `overwrite` op drops partitoin `dt=A` when
> SessionB is scanning this file.
>
> However I tested `insert overwrite` of CarbonData table, it still dropped
> the tableA partition(dt=A).
>
> I don’t know whether `CarbonData` source can solve this problem like this,
> the expected result of SessionB could be the data before `overwrite` or the
> complete data after `overwrite` instead of `FileNotFoundException`?
>
>
>
>
>
>
>


Re: Discussion: change default compressor to ZSTD

2020-02-07 Thread Ravindra Pesala
Hi Jacky,

As per the original PR
https://github.com/apache/carbondata/pull/2628 , query performance got
decreased by 20% ~ 50% compared to snappy.  So I am concerned about the
performance. Please better have a proper tpch performance report on the
regular cluster like we do for every version and decide based on that.

Regards,
Ravindra.

On Fri, 7 Feb 2020 at 10:40 AM, Jacky Li  wrote:

> Hi Ajantha,
>
>
> Yes, decoder will use the compressorName stored in ChunkCompressionMeta
> from the file header,
> but I think it is better to put it in the name so that user can know the
> compressor in the shell without reading it by launching engine.
>
>
> In spark, for parquet/orc the file name written
> is:part-00115-e2758995-4b10-4bd2-bf15-b4c176e587fe-c000.snappy.orc
>
>
> In PR3606, I will handle the compatibility.
>
>
> Regards,
> Jacky
>
>
> --原始邮件--
> 发件人:"Ajantha Bhat" 发送时间:2020年2月6日(星期四) 晚上11:51
> 收件人:"dev"
> 主题:Re: Discussion: change default compressor to ZSTD
>
>
>
> Hi,
>
> 33% is huge a reduction in store size. If there is negligible difference in
> load and query time, we should definitely go for it.
>
> And does user really need to know about what compression is used ? change
> in file name may be need to handle compatibility.
> Already thrift *FileHeader, ChunkCompressionMeta* is storing the compressor
> name. query time decoding can be based on this.
>
> Thanks,
> Ajantha
>
>
> On Thu, Feb 6, 2020 at 4:27 PM Jacky Li 
>  Hi,
> 
> 
>  I compared snappy and zstd compressor using TPCH for carbondata.
> 
> 
>  For TPCH lineitem table:
>  carbon-zstdcarbon-snappy
>  loading (s)5351
>  size795MB1.2GB
> 
>  TPCH-query:
>  Q14.2898.29
>  Q212.60912.986
>  Q314.90214.458
>  Q46.2765.954
>  Q523.14721.946
>  Q61.120.945
>  Q723.01728.007
>  Q814.55415.077
>  Q928.47227.473
>  Q1024.06724.682
>  Q113.3213.79
>  Q125.3115.185
>  Q1314.0811.84
>  Q142.2622.087
>  Q155.4964.772
>  Q1629.91929.833
>  Q177.0187.057
>  Q1817.36717.795
>  Q192.9312.865
>  Q2011.34710.937
>  Q2126.41628.414
>  Q225.9236.311
>  sum283.844290.704
> 
> 
>  As you can see, after using zstd, table size is 33% reduced comparing
> to
>  snappy. And the data loading and query time difference is negligible.
> So I
>  suggest to change the default compressor in carbondata from snappy to
> zstd.
> 
> 
>  To change the default compressor, we need to:
>  1. append the compressor name in the carbondata file name. So that
> from
>  the file name user can know what compressor is used.
>  For example, file name will be changed from
>  nbsp;part-0-0_batchno0-0-0-1580982686749.carbondata
> 
> tonbsp;nbsp;part-0-0_batchno0-0-0-1580982686749.snappy.carbondata
> 
> ornbsp;nbsp;part-0-0_batchno0-0-0-1580982686749.zstd.carbondata
> 
> 
>  2. Change the compressor constant in CarbonCommonConstaint.java file
> to
>  use zstd as default compressor
> 
> 
>  What do you think?
> 
> 
>  Regards,
>  Jacky

-- 
Thanks & Regards,
Ravi


Re: [Discussion] Support Secondary Index on Carbon Table

2020-02-05 Thread Ravindra Pesala
+1

Regards,
Ravindra.

On Wed, 5 Feb 2020 at 8:03 PM, Indhumathi M  wrote:

> Hi Community,
>
> Currently we have datamaps like,* default datamaps* which are block and
> blocklet and *coarse grained datamaps* like bloom, and *fine grained
> datamaps* like lucene
> which helps in better pruning during query. What if we introduce another
> kind of datamap which can hold blockletId as index? Initial level, we call
> it as index which
> will work as a child table to the main table like we have MV in our current
> code.
>
> Yes, lets introduce the secondary index to carbon table which will be the
> child table to main table and it can be created on column like we create
> lucene datamap,
> where we give index columns to create index. In a similar way, we create
> secondary index on column, so indexes on these column will be blocklet IDs
> which will
> help in better pruning and faster query when we have a filter query on the
> index column.
>
> Currenlty we will take it as index table and then later part we will make
> it inline to datamap interface.
>
> So design document is attached in JIRA, please give your suggestion/inputs.
>
> JIRA Link: CARBONDATA-3680
> 
>
> Thanks & Regards,
> Indhumathi M
>
-- 
Thanks & Regards,
Ravi


Re: Optimize and refactor insert into command

2020-01-02 Thread Ravindra Pesala
Hi,

+1
It’s a long pending work. Most welcome.

Regards,
Ravindra.

On Fri, 20 Dec 2019 at 7:55 AM, Ajantha Bhat  wrote:

> Currently carbondata "insert into" uses the CarbonLoadDataCommand itself.
> Load process has steps like parsing and converter step with bad record
> support.
> Insert into doesn't require these steps as data is already validated and
> converted from source table or dataframe.
>
> Some identified changes are below.
>
> 1. Need to refactor and separate load and insert at driver side to skip
> converter step and unify flow for No sort and global sort insert.
> 2. Need to avoid reorder of each row. By changing select dataframe's
> projection order itself during the insert into.
> 3. For carbon to carbon insert, need to provide the ReadSupport and use
> RecordReader (vector reader currently doesn't support ReadSupport) to
> handle null values, time stamp cutoff (direct dictionary) from scanRDD
> result.
> 4. Need to handle insert into partition/non-partition table in local sort,
> global sort, no sort, range columns, compaction flow.
>
> The final goal is to improve insert performance by keeping only required
> logic and also decrease the memory footprint.
>
> If you have any other suggestions or optimizations related to this let me
> know.
>
> Thanks,
> Ajantha
>
-- 
Thanks & Regards,
Ravi


Re: [VOTE] Apache CarbonData 1.6.1(RC1) release

2019-10-14 Thread Ravindra Pesala
+1


On Sat, 12 Oct 2019 at 9:18 AM, 恩爸 <441586...@qq.com> wrote:

> +1.
> But it better remove CARBONDATA-3540 and CARBONDATA-3544 from release
> notes, these two improvements are not included in 1.6.1.
>
>
>
>
> --Original--
> From:"kunalkapoor [via Apache CarbonData Dev Mailing List archive]"<
> ml+s1130556n85191...@n5.nabble.com;
> Date:Fri, Oct 11, 2019 09:52 PM
> To:"恩爸"<441586...@qq.com;
>
> Subject:Re: [VOTE] Apache CarbonData 1.6.1(RC1) release
>
>
>
> +1
>
> On Mon, Oct 7, 2019, 8:42 PM Raghunandan S <
> [hidden email] wrote:
>
>  Hi
> 
> 
>  I submit the Apache CarbonData 1.6.1 (RC1) for your vote.
> 
> 
>  1.Release Notes:
> 
> 
> 
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220version=12345993
> 
> ;
>
> 
>Some key features and improvements in this release:
> 
> 
>   1. Supported adding segment to CarbonData table
> 
> 
> 
> 
>  [Behaviour Changes]
> 
>   1. None
> 
> 
>  2. The tag to be voted upon : apache-carbondata-1.6.1-rc1
> (commit:
> 
>  cabde6252d4a527fbfeb7f17627c6dce3e357f84)
> 
> 
> 
> https://github.com/apache/carbondata/releases/tag/apache-CarbonData-1.6.1-rc1
> 
> ;
>
> 
> 
>  3. The artifacts to be voted on are located here:
> 
>  https://dist.apache.org/repos/dist/dev/carbondata/1.6.1-rc1/
>  ;
> 
> 
>  4. A staged Maven repository is available for review at:
> 
> 
> 
> https://repository.apache.org/content/repositories/orgapachecarbondata-1057/
> 
> ;
>
> 
> 
>  5. Release artifacts are signed with the following key:
> 
> 
>  *https://people.apache.org/keys/committer/raghunandan.asc*
> 
> 
> 
>  Please vote on releasing this package as Apache CarbonData 1.6.1,
> The vote
> 
> 
>  will be open for the next 72 hours and passes if a majority of
> 
> 
>  at least three +1 PMC votes are cast.
> 
> 
> 
>  [ ] +1 Release this package as Apache CarbonData 1.6.1
> 
> 
>  [ ] 0 I don't feel strongly about it, but I'm okay with the release
> 
> 
>  [ ] -1 Do not release this package because...
> 
> 
> 
>  Regards,
> 
>  Raghunandan.
> 
>
>
>
>
> If you reply to this email, your message will be
> added to the discussion below:
>
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/VOTE-Apache-CarbonData-1-6-1-RC1-release-tp85083p85191.html
>
> To start a new topic under Apache CarbonData Dev
> Mailing List archive, email ml+s1130556n1...@n5.nabble.com
> To unsubscribe from Apache CarbonData Dev Mailing List
> archive, click here.
> NAML

-- 
Thanks & Regards,
Ravi


Re: [DISCUSSION] Support Time Series for MV datamap and autodatamap loading of timeseries datamaps

2019-10-07 Thread Ravindra Pesala
Hi Akash,

1. It is better to make it simple and let user provide the udf he wants in the 
query. So no need to rewrite the query and no need provide extra granularity 
property.

3. I got your point why you want to use accumulator to get min/max. But why I 
am worried is it should not add complexity to generate min/max as we already 
has this information available. I don’t think we should be so bothered about 
reading min/max on data loading phase as it is already heavy duty job and 
adding few more mills does not do any harm. But as you mentioned it is easier 
to do so we can go a head your way.


Regards,
Ravindra.

> On 7 Oct 2019, at 5:38 PM, Akash Nilugal  wrote:
> 
> Hi Ravi,
> 
> 1. i) During create datamap, in ctas query, user does not mention udf, so if 
> granularity is present in DM properties, then internally we rewrite the ctas 
> query with udf and then load the data to datamap according to current design.
>   ii) but if we say user to give ctas query with udf only, then internally no 
> need to rewite the query, we can just load data to it and avoid giving the 
> granularity in DMproperties.
>   Currently im planning to do first one. Please give your input on this.
> 
> 2. Ok, we will not use the RP management in DMProperties, we will use as 
> separate command and do proper decoupling.
> 
> 3. I think you are referring to the cache pre-priming in index server. 
> Problem with this is that, we wil be not sure whether the cache loaded for 
> the segment or not, because as per pre-priming design, if loading to cache 
> fails after data load to main table, we ignore it as query takes care of it. 
> So we cannot completely rely on this feature for min max.
> So for accumulator, im not calculating again, i just take the minmax before 
> writing index file in dataload and use that in driver to prepare the dataload 
> ranges for datamaps.
> 
> The reason to keep the segment min max in the table status of datamap is 
> that, it will be helful in RP scenarios, second is we will not be missing any 
> data from loading to datamap from main table[if 1st time data came from 1 to 
> 4:15 , then next we get data 5:10 to 6, then there might be chance that we 
> can miss 15minutes of data from 4 to 4:15]. It will be helpful in querying 
> also. So that we can avoid the problem i mentioned above with datamaps loaded 
> in cache.
> 
> 4. I agree, your point is valid one. I will do more abalysis on this based on 
> the user use cases and then we can decide finally. That would be better.
> 
> Please give your inputs/suggestions on the above points.
> 
> regards,
> Akash R Nilugal
> 
> On 2019/10/07 03:03:35, Ravindra Pesala  wrote: 
>> HI Akash,
>> 
>> 1. I feel user providing granularity is redundant, he can just provide 
>> respective udf in select query should be enough.
>> 
>> 2. I think it is better to add the RP management now itself, otherwise if 
>> you start adding to DM properties as temporary then it will never be moved. 
>> Better put little more effort to decouple it from datamaps.
>> 
>> 3. I feel accumulator is a added cost, we already have feature in 
>> development to load datamap immediately after load happens, why not use 
>> that? If the datamap is already in memory why we need min/max at segment 
>> level?
>> 
>> 4. I feel there must be some reason why other timeseries db does not support 
>> union of data.  Consider a scenario that we have data from 1pm to 4.30 pm , 
>> it means 4 to 5pm data is still loading.  when user asks the data at hour 
>> level I feel it is safe to give data for 1,2,3 hours data, because providing 
>> 4pm is actually not a complete data. So atleast user comes to know that 4 pm 
>> data is not available and starts querying the low level data if he needs it.
>> I think better get some real uses how user wants this time series data.
>> 
>> Regards,
>> Ravindra.
>> 
>>> On 4 Oct 2019, at 9:39 PM, Akash Nilugal  wrote:
>>> 
>>> Hi Ravi,
>>> 
>>> 1. I forgot to mention the CTAS query in the create datamap statement, i 
>>> have updated the document, during create datamap user can give granularity, 
>>> during query just the UDF. That should be fine right.
>>> 2. I think may be we can mention the RP policy in DM properties also, and 
>>> then may be we provide add RP, drop RP, alter RP for existing and older 
>>> datamaps. RP will be taken as a separate subtask and will be handled in 
>>> later part. That should be fine i tink.
>>> 3. Actually consider a scenario when datamap is already created, then load 
>>> happened to main table, then i use accumulator to get all the min max

Re: [DISCUSSION] Support Time Series for MV datamap and autodatamap loading of timeseries datamaps

2019-10-06 Thread Ravindra Pesala
HI Akash,

1. I feel user providing granularity is redundant, he can just provide 
respective udf in select query should be enough.

2. I think it is better to add the RP management now itself, otherwise if you 
start adding to DM properties as temporary then it will never be moved. Better 
put little more effort to decouple it from datamaps.

3. I feel accumulator is a added cost, we already have feature in development 
to load datamap immediately after load happens, why not use that? If the 
datamap is already in memory why we need min/max at segment level?

4. I feel there must be some reason why other timeseries db does not support 
union of data.  Consider a scenario that we have data from 1pm to 4.30 pm , it 
means 4 to 5pm data is still loading.  when user asks the data at hour level I 
feel it is safe to give data for 1,2,3 hours data, because providing 4pm is 
actually not a complete data. So atleast user comes to know that 4 pm data is 
not available and starts querying the low level data if he needs it.
I think better get some real uses how user wants this time series data.

Regards,
Ravindra.

> On 4 Oct 2019, at 9:39 PM, Akash Nilugal  wrote:
> 
> Hi Ravi,
> 
> 1. I forgot to mention the CTAS query in the create datamap statement, i have 
> updated the document, during create datamap user can give granularity, during 
> query just the UDF. That should be fine right.
> 2. I think may be we can mention the RP policy in DM properties also, and 
> then may be we provide add RP, drop RP, alter RP for existing and older 
> datamaps. RP will be taken as a separate subtask and will be handled in later 
> part. That should be fine i tink.
> 3. Actually consider a scenario when datamap is already created, then load 
> happened to main table, then i use accumulator to get all the min max to 
> driver, so that i can avoid reading index file in driver in order to load to 
> datamap. 
> other scenario is when main table already has segments and then 
> datamap is created, the we will read index files from each segments to decide 
> the min max of timestamp column.
> 4. We are not storing min max in main table  table status. We are storing in 
> datamap table's table status file, so that it will be used to prepare the 
> plan during the query phase.
> 
> 5. Other timeseries db supports only getting the data present in hour or day 
> .. aggregated data. Since we cannot miss the data, plan is to get the data 
> like higher to lower. May be it does not make much difference when its from 
> minute to second, but it makes difference from year to month , so that we 
> cannot avoid aggregations from main table.
> 
> 
> Regards,
> Akash R Nilugal
> 
> On 2019/10/04 11:35:46, Ravindra Pesala  wrote: 
>> Hi Akash,
>> 
>> I have following suggestions.
>> 
>> 1. I think it is redundant to use granularity inside create datamap, user 
>> can use the respective granularity UDF in his query like time(1h) or 
>> time(1d) etc.
>> 
>> 2. Better create separate RP commands and let user add the RP on the datamap 
>> or even on the main table also. It would be more manageable if you 
>> independent feature for RP instead of including in datamap.
>> 
>> 3. I am not getting why exactly we need accumulator instead of using index 
>> min/max? Can you explain with some scenario 
>> 
>> 4. Why to store min/max at segment level? We can get from datamap also right?
>> 
>> 4.  Union with high granularity tables to low granularity tables are really 
>> needed? Any other time series DB is doing it? Or any known use case we have?
>> 
>> Regards,
>> Ravindra.
>> 
>>> On 1 Oct 2019, at 5:49 PM, Akash Nilugal  wrote:
>>> 
>>> Hi Babu,
>>> 
>>> Thanks for the inputs. Please find the comments 
>>> 1. I will change from Union to UnionAll
>>> 2. For auto datamap loading, once the data is loaded to lower level 
>>> granularity datamap, then we load the higher level datamap from the lower 
>>> level datamap. But as per your point, i think you are telling to load from 
>>> main table itself.
>>> 3. similar to 2nd point, whether to need configuration or not we can decide 
>>> i think.
>>> 4. a. I think the max of the datamap is required to decide the range for 
>>> the load, because in case of failure case, we may need.
>>> b. This point will be taken care.
>>> 5. Yes, dataload is sync based on current design, as it is non lazy, it 
>>> will happen with main table load only.
>>> 6. Yes, this will be handled.
>>> 7. Already added a task in jira.
>>> On 2019/10/01 08:50:05, babu lal jangir  wrote: 
>>>> Hi Akash, Thanks fo

Re: [DISCUSSION] Support Time Series for MV datamap and autodatamap loading of timeseries datamaps

2019-10-04 Thread Ravindra Pesala
Hi Akash,

I have following suggestions.

1. I think it is redundant to use granularity inside create datamap, user can 
use the respective granularity UDF in his query like time(1h) or time(1d) etc.

2. Better create separate RP commands and let user add the RP on the datamap or 
even on the main table also. It would be more manageable if you independent 
feature for RP instead of including in datamap.

3. I am not getting why exactly we need accumulator instead of using index 
min/max? Can you explain with some scenario 

4. Why to store min/max at segment level? We can get from datamap also right?

4.  Union with high granularity tables to low granularity tables are really 
needed? Any other time series DB is doing it? Or any known use case we have?

Regards,
Ravindra.

> On 1 Oct 2019, at 5:49 PM, Akash Nilugal  wrote:
> 
> Hi Babu,
> 
> Thanks for the inputs. Please find the comments 
> 1. I will change from Union to UnionAll
> 2. For auto datamap loading, once the data is loaded to lower level 
> granularity datamap, then we load the higher level datamap from the lower 
> level datamap. But as per your point, i think you are telling to load from 
> main table itself.
> 3. similar to 2nd point, whether to need configuration or not we can decide i 
> think.
> 4. a. I think the max of the datamap is required to decide the range for the 
> load, because in case of failure case, we may need.
> b. This point will be taken care.
> 5. Yes, dataload is sync based on current design, as it is non lazy, it will 
> happen with main table load only.
> 6. Yes, this will be handled.
> 7. Already added a task in jira.
> On 2019/10/01 08:50:05, babu lal jangir  wrote: 
>> Hi Akash, Thanks for Time Series DataMap proposal.
>> Please check below Points.
>> 
>> 1. During Query Planing Change Union to Union All , Otherwise will loose
>> row if same value appears.
>> 2. Whether system start load for next granularity level table as soon it
>> matches the data condition or next granularity level table has to wait till
>> current  granularity level table is finished ? please handle if possible.
>> 3. Add Configuration to load multiple Ranges at a time(across granularity
>> tables).
>> 4. Please check if Current data loading min ,max is enough to find current
>> load . No need to refer the DataMap's min,max because data loading Range
>> prepration can go wrong if loading happens from multiple driver . i think
>> below rules are enough for loading.
>>4.a. Create MV should should sync data.   On any failure Rebuild should
>> sync again till than MV will be disabled.
>>4.b.  Each load has independent Ranges and should load only those
>> ranges. Any failure MV may go in disable state(only if intermediate ranges
>> load is failed ,last loads failure will NOT make MV disable).
>> 5. We can make Data loading sync because anyway queries can be served from
>> fact table if any segments is in-progress in  Datamap.
>> 6. In Data loading Pipleline ,failures in intermediate time series datamap,
>> still we can continue loading next level data. (ignore if already handled).
>>   For Example.
>>DataMaps:- Hour,Day,Month Level
>>Load Data(10 day):- 2018-01-01 01:00:00 to 2018-01-10 01:00:00
>>  Failure in hour level during below range
>>2018-01-06 01:00:00 to 2018-01-06 01:00:00
>> This point of time Hour level has 5 day data.so start loading on day
>> level .
>> 7. Add SubTask to support loading of in-between missing time.(Incremental
>> but old records if timeseries device stopped working for some time).
>> 
>> On Tue, Oct 1, 2019 at 10:41 AM Akash Nilugal 
>> wrote:
>> 
>>> Hi vishal,
>>> 
>>> In the design document, in the impacted analysis section, there is a topic
>>> compatibility/legacy stores, so basically For old tables when the datamap
>>> is created, we load all the timeseries datamaps with different granularity.
>>> I think this should do fine, please let me know for further
>>> suggestions/comments.
>>> 
>>> Regards,
>>> Akash R Nilugal
>>> 
>>> On 2019/09/30 17:09:44, Kumar Vishal  wrote:
 Hi Akash,
 
 In this desing document you haven't mentioned how to handle data loading
 for timeseries datamap for older segments[Existing table].
 If the customer's main table data is also stored based on time[increasing
 time] in different segments,he can use this feature as well.
 
 We can discuss and finalize the solution.
 
 -Regards
 Kumar Vishal
 
 On Mon, Sep 30, 2019 at 2:42 PM Akash Nilugal 
 wrote:
 
> Hi Ajantha,
> 
> Thanks for the queries and suggestions
> 
> 1. Yes, this is a good suggestion, i ll include this change. Both date
>>> and
> timestamp columns are supported, will be updated in document.
> 2. yes, you are right.
> 3. you are right, if the day level is not available, then we will try
>>> to
> get the whole day data from hour level, if not availaible, as
>>> explained in
> design document, 

Re: [ANNOUNCE] Ajantha as new Apache CarbonData committer

2019-10-03 Thread Ravindra Pesala
Congrats Ajantha and welcome.

Regards,
Ravindra.

> On 3 Oct 2019, at 8:00 PM, Liang Chen  wrote:
> 
> Hi
> 
> 
> We are pleased to announce that the PMC has invited Ajantha as new Apache 
> CarbonData committer and the invite has been accepted!
> 
> Congrats to Ajantha and welcome aboard.
> 
> Regards
> 
> Apache CarbonData PMC



[DISCUSSION] Support heterogeneous format segments in carbondata

2019-09-10 Thread Ravindra Pesala
Hi All,

 This discussion is regarding support of other formats in carbon. Already
existing customers use other formats like parquet, orc etc., but if they
want to migrate to carbon there is no proper solution at hand. So this
feature allows all the old data to add as a segment to carbondata .  And
during query, it reads old data in its respective format and all new
segments will be read in carbon.

I have created the design document and attached to the jira. Please review
it.
https://issues.apache.org/jira/browse/CARBONDATA-3516


-- 
Thanks & Regards,
Ravindra


Re: [DISCUSSION] implement MERGE INTO statement

2019-08-31 Thread Ravindra Pesala
Hi David,

+1

It is better to follow the hive syntax rather than having our own. Please
check it
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Merge

And also it is better to have design document explaining the changes to be
done on current IUD.

Regards,
Ravindra.

On Sat, 31 Aug 2019 at 11:44 AM, David Cai  wrote:

> hi all,
> CarbonData has supported the insert/update/delete operations.
> Now we can start to discuss the MERGE INTO statement.
> It should combine insert, update and delete operations into a single
> statement, and it will be executed atomically.
>
> SQL maybe like :
>  MERGE INTO target_table
>  USING source_table
>  ON merge_condition
>  WHEN MATCHED [ AND condition] THEN
>  UPDATE | DELETE ...
>  WHEN NOT MATCHED [ AND condition]  THEN
>  INSERT ...
>
> Any question and suggestion are welcome.
>
> Regards
> David QiangCai
>
-- 
Thanks & Regards,
Ravi


Re: Adapt to SparkSessionExtensions

2019-08-31 Thread Ravindra Pesala
Hi,

I think it is better to work on the master branch instead of 2.0 branch. It
will avoid the rebase cost and unnecessary confusion. it is better to go
with proper version quality.

Regards,
Ravindra.

On Mon, 26 Aug 2019 at 8:13 PM, Jacky Li  wrote:

> I have created branch-2.0, let's work on this feature in this branch.
>
> Regards,
> Jacky
>
> On 2019/08/22 04:58:53, Ajith shetty  wrote:
> > Hi Community
> >
> > From https://issues.apache.org/jira/browse/SPARK-18127 Spark provides
> SparkSessionExtensions in order to extended capabilities of spark. Carbon
> can use this in order to avoid the tight coupling due to CarbonSession in
> spark environment.
> >
> https://spark.apache.org/docs/2.4.3/api/java/org/apache/spark/sql/SparkSessionExtensions.html
> >
> > Main Scope:
> > 1. Compatible with Spark 2.3.2+
> > 2. Make Carbon Parser Pluggable
> > a. Move to antlr4 based parser
> > 3. Make Analyzer Rules Pluggable
> > 4. Make Optimizer Rules Pluggable
> > 5. Make Planning Strategies Pluggable
> >
> > We can have Sub jiras in order to cover all the scenarios due to this.
> Please input your thoughts.
> >
> > Regards
> >
>
-- 
Thanks & Regards,
Ravi


Time travel/versioning on carbondata.

2019-08-23 Thread Ravindra Pesala
Hi All,

CarbonData allows to store the data incrementally and do the Update/Delete
operations on the stored data. But the user always can access the latest
state of data at that point of time.

In the current system, it is not possible to access the old version of
data. And it is not possible to rollback to the old version in case of some
issues in current version data.

This proposal adds the automatic versioning of data that we store and we
can access any historical version of that data.


The design is attached on the jira
https://issues.apache.org/jira/browse/CARBONDATA-3500 Please check it.

-- 
Thanks & Regards,
Ravindra.


Re: [DISCUSSION] Cache Pre Priming

2019-08-23 Thread Ravindra Pesala
Hi Akash,

+1 for Vishal suggestion.Better focus on load data cache sync.

Regards,
Ravindra.

On Fri, 23 Aug 2019 at 16:35, Akash Nilugal  wrote:

> Hi vishal,
>
> Your point is correct, we can focus on just loading to cache after data
> load is finished (Async Operation).
> for DDL support, count(*) can be used for all the required tables to load
> into cache.
>
> Regards,
> Akash
>
> On 2019/08/22 14:44:34, Kumar Vishal  wrote:
> > Hi Akash,
> >
> > I think better to PrePrime only after each load(Async).
> >
> > As mentioned in design doc, when index servers is started, if the table
> or
> > db is configured, until and unless all the configured things are loaded
> > into cache, Index server won't be available for query. So query cannot
> get
> > benefit of pre-prime untill all the metadata is loaded to cache.
> > In order to avoid this user can run count(*) after startup to pre-prime
> > only required tables. Any extra ddl is not required as count(*) can be
> used
> > as a DDL to load the cache.
> >
> > -Regards
> > Kumar Vishal
> >
> > On Wed, Aug 21, 2019 at 6:57 PM Akash Nilugal 
> > wrote:
> >
> > > Hi chetan,
> > >
> > > As mentioned in design , loading to cache will be an asyc operation,
> and
> > > we will load only the corresponding segment to cache, so there wont be
> any
> > > hit.
> > > Logs will be added
> > >
> > > On 2019/08/21 13:18:05, chetan bhat  wrote:
> > > > Hi Akash,
> > > >
> > > > 1. Will the performance of end to end dataload operation be impacted
> if
> > > the segment datamap is loaded to cache once the load is finished.
> > > > 2. Will there be a notification in logs stating that the loading of
> > > datamap cache is completed.
> > > >
> > > > Regards
> > > >
> > > > On 2019/08/15 12:03:09, Akash Nilugal 
> wrote:
> > > > > Hi Community,
> > > > >
> > > > > Currently, we have an index server which basically helps in
> distributed
> > > > > caching of the datamaps in a separate spark application.
> > > > >
> > > > > The caching of the datamaps in index server will start once the
> query
> > > is
> > > > > fired on the table for the first time, all the datamaps will be
> loaded
> > > > >
> > > > > if the count(*) is fired and only required will be loaded for any
> > > filter
> > > > > query.
> > > > >
> > > > >
> > > > > Here the problem or the bottleneck is, until and unless the query
> is
> > > fired
> > > > > on table, the caching won’t be done for the table datamaps.
> > > > >
> > > > > So consider a scenario where we are just loading the data to table
> for
> > > > > whole day and then next day we query,
> > > > >
> > > > > so all the segments will start loading into cache. So first time
> the
> > > query
> > > > > will be slow.
> > > > >
> > > > >
> > > > > What if we load the datamaps into cache or preprime the cache
> without
> > > > > waititng for any query on the table?
> > > > >
> > > > > Yes, what if we load the cache after every load is done, what if we
> > > load
> > > > > the cache for all the segments at once,
> > > > >
> > > > > so that first time query need not do all this job, which makes it
> > > faster.
> > > > >
> > > > >
> > > > > Here i have attached the design document for the pre-priming of
> cache
> > > into
> > > > > index server. Please have a look at it
> > > > >
> > > > > and any suggestions or inputs on this are most welcomed.
> > > > >
> > > > >
> > > > >
> > >
> https://drive.google.com/file/d/1YUpDUv7ZPUyZQQYwQYcQK2t2aBQH18PB/view?usp=sharing
> > > > >
> > > > >
> > > > >
> > > > > Regards,
> > > > >
> > > > > Akash R Nilugal
> > > > >
> > > >
> > >
> >
>


-- 
Thanks & Regards,
Ravi


Re: [VOTE] Apache CarbonData 1.6.0(RC3) release

2019-08-14 Thread Ravindra Pesala
+1

Regards,
Ravindra.

On Tue, 13 Aug 2019 at 17:12, Raghunandan S <
carbondatacontributi...@gmail.com> wrote:

> Hi
>
>
> I submit the Apache CarbonData 1.6.0 (RC3) for your vote.
>
>
> 1.Release Notes:
>
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220=12344965
>
>
> Some key features and improvements in this release:
>
>
>1. Supported Index Server to distribute the index cache and parallelise
> the
> index pruning.
>
>2. Supported incremental data loading on MV datamaps and stabilised MV.
>
>3. Supported Arrow format from Carbon SDK.
>
>4. Supported read from Hive.
>
>
>
>
> [Behaviour Changes]
>
>1. None
>
>
>  2. The tag to be voted upon : apache-CarbonData-1.6.0-rc3 (commit:
>
> 4729b4ccee18ada1898e27f130253ad06497f1fb)
>
>
> https://github.com/apache/carbondata/releases/tag/apache-CarbonData-1.6.0-rc3
> /
>
>
>
> 3. The artifacts to be voted on are located here:
>
> https://dist.apache.org/repos/dist/dev/carbondata/1.6.0-rc3/
>
>
>
> 4. A staged Maven repository is available for review at:
>
>
> https://repository.apache.org/content/repositories/orgapachecarbondata-1055/
>
>
>
> 5. Release artifacts are signed with the following key:
>
>
> *https://people.apache.org/keys/committer/raghunandan.asc*
>
>
>
> Please vote on releasing this package as Apache CarbonData 1.6.0,  The vote
>
>
> will be open for the next 72 hours and passes if a majority of
>
>
> at least three +1 PMC votes are cast.
>
>
>
> [ ] +1 Release this package as Apache CarbonData 1.6.0
>
>
> [ ] 0 I don't feel strongly about it, but I'm okay with the release
>
>
> [ ] -1 Do not release this package because...
>
>
>
> Regards,
>
> Raghunandan.
>


-- 
Thanks & Regards,
Ravi


Re: Apache CarbonData 2 RoadMap Feedback

2019-07-18 Thread Ravindra Pesala
Hi,

Yes, Flink and CarbonData integration will definitely attract more users.
We welcome any contributions in that direction.

Regards,
Ravindra.

On Thu, 18 Jul 2019 at 07:55, 蒋晓峰  wrote:

> Hi Community,
>
>
>
>
>I have already read CarbonData 2 roadmap.I consider that integration
> with Flink of CarbonData 2 features should take more effort to focus on its
> implementation.As we all know,the 1.9 version of Flink will be released at
> the end of this month,which is merged with Blink of Alibaba.Building
> real-time data warehouses through the CarbonData integration of Flink will
> attract many engineers to use CarbonData to add more real-time artificial
> intelligence platform possibilities.It's just my option,and I have great
> interest in build integration with Flink together with you.
>
>
>
>
>
>
>
>
>
>
> Thanks,
>
>
>
>
> Nicholas



-- 
Thanks & Regards,
Ravi


Re: [Discussion] Roadmap for Apache CarbonData 2

2019-07-18 Thread Ravindra Pesala
Hi Kevin,

Yes, we can improve it. The implementation is closely related to supporting
pre-aggregate datamaps on the streaming table which we have already
implemented some time ago. And same will be reimplemented for MV datamap
soon as well.
The implementation allows using of pre-aggregate datamap for non-streaming
segments and main table for streaming segments. We update the query plan to
do union on both the tables and query only the streaming segments for main
table.
So even in our case also we can use the same way, we can do the union query
of MV table and main table(only non loaded datamap segments) and execute
the query.  We can definitely consider after we support streaming table for
MV datamap.

Regards,
Ravindra.

On Wed, 17 Jul 2019 at 07:55, kevinjmh  wrote:

> currently, datamap in carbon applys to all segments.
> The roadmap refers to commands like add/drop segment, and also maybe
> something
> about incremental loading for MV. For these scenes, it is better to make
> datamap can be use on segment level instead of disable the datamap when any
> datamap data is not ready for any segment. Also this can make datamap
> fail-safe and enhance carbon's stablility.
> Maybe we can consider about this also.
>
>
>
>
> -
> Regards
> Manhua
>
>
>
> ---Original---
> From: "Ravindra Pesala"
> Date: Tue, Jul 16, 2019 22:31 PM
> To: "dev";
> Subject: [Discussion] Roadmap for Apache CarbonData 2
>
>
> Hi Community,
>
> Three years have passed since the launching of the Apache CarbonData
> project, CarbonData has become a popular data management solution for
> various scenarios. As new workload like AI and new runtime environment like
> the cloud is emerging quickly, I think we are reaching a point that needs
> to discuss the future of CarbonData.
>
> To bring CarbonData to a new level to satisfy those new requirements, Jacky
> and I drafted a roadmap for CarbonData 2 in the cwiki website.
> - English Version:
>
> https://cwiki.apache.org/confluence/display/CARBONDATA/Apache+CarbonData+2+Roadmap+Proposal
> - Chinese Version:
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120737492
>
> Please feel free to discuss the roadmap in this thread, and we welcome
> every feedback to make CarbonData better.
>
> Thanks and Regards,
> Ravindra.



-- 
Thanks & Regards,
Ravi


[Discussion] Roadmap for Apache CarbonData 2

2019-07-16 Thread Ravindra Pesala
Hi Community,

Three years have passed since the launching of the Apache CarbonData
project, CarbonData has become a popular data management solution for
various scenarios. As new workload like AI and new runtime environment like
the cloud is emerging quickly, I think we are reaching a point that needs
to discuss the future of CarbonData.

To bring CarbonData to a new level to satisfy those new requirements, Jacky
and I drafted a roadmap for CarbonData 2 in the cwiki website.
- English Version:
https://cwiki.apache.org/confluence/display/CARBONDATA/Apache+CarbonData+2+Roadmap+Proposal
- Chinese Version:
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120737492

Please feel free to discuss the roadmap in this thread, and we welcome
every feedback to make CarbonData better.

Thanks and Regards,
Ravindra.


[VOTE] Apache CarbonData 1.6.0(RC1) release

2019-07-15 Thread Ravindra Pesala
Hi

I submit the Apache CarbonData 1.6.0 (RC1) for your vote.

1.Release Notes:

https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220=12344965

Some key features and improvements in this release:
   1. Supported Index Server to distribute the index cache and parallelize
the index pruning.
   2. Supported incremental data loading on MV datamaps and stabilized MV.
   3. Supported Arrow format from Carbon SDK.
   4. Supported read from Hive.
[Behavior Changes]
   None

2. The tag to be voted upon : apache-carbondata-1.6.0-rc1 (commit:
9938633c6a80407876c7e7fa0ffd455164edff4b)
https://github.com/apache/carbondata/releases/tag/apache-carbondata-1.6.0-rc1

3. The artifacts to be voted on are located here:
https://dist.apache.org/repos/dist/dev/carbondata/1.6.0-rc1/

4. A staged Maven repository is available for review at:
https://repository.apache.org/content/repositories/orgapachecarbondata-1053

5. Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/ravipesala.asc

Please vote on releasing this package as Apache CarbonData 1.6.0,  The
vote will
be open for the next 72 hours and passes if a majority of at least three +1
PMC votes are cast.
[ ] +1 Release this package as Apache CarbonData 1.6.0
[ ] 0 I don't feel strongly about it, but I'm okay with the release
[ ] -1 Do not release this package because...

Regards,
Ravindra.


Re: [VOTE] Apache CarbonData 1.5.4(RC1) release

2019-05-29 Thread Ravindra Pesala
Hi all


PMC vote has passed for Apache Carbondata 1.5.4 release, the result as
below:
+1(binding): 4(Jacky, Kumar Vishal, Ravindra, Liang Chen)
+1(non-binding) : 3

Thanks all for your vote.

Regards,
Ravindra.

On Wed, 29 May 2019 at 15:28, Liang Chen  wrote:

> +1
>
> Regards
> Liang
>
> manishgupta88 wrote
> > +1
> >
> > Regards
> > Manish Gupta
> >
> > On Mon, May 27, 2019 at 11:34 AM kanaka 
>
> > kanakakumaravvaru@
>
> >  wrote:
> >
> >> +1
> >>
> >>
> >>
> >> --
> >> Sent from:
> >>
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >>
>
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


-- 
Thanks & Regards,
Ravi


[VOTE] Apache CarbonData 1.5.4(RC1) release

2019-05-17 Thread Ravindra Pesala
Hi

I submit the Apache CarbonData 1.5.4 (RC1) for your vote.

1.Release Notes:

https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220=12345388

Some key features and improvements in this release:

   1. Supported alter SORT_COLUMNS property on the table to allow changing
sort columns.
   2. Supported configurable page size to control the memory in case of
storing complex types.
   3. Supported Binary Data Type.
   4. Supported compaction on range sorted segments.
[Behavior Changes]
   None

2. The tag to be voted upon : apache-carbondata-1.5.4-rc1 (commit:
1f2e184b81bef4e861b4dd32be94dc50bada6b68)
https://github.com/apache/carbondata/releases/tag/apache-carbondata-1.5.4-rc1

3. The artifacts to be voted on are located here:
https://dist.apache.org/repos/dist/dev/carbondata/1.5.4-rc1/

4. A staged Maven repository is available for review at:
https://repository.apache.org/content/repositories/orgapachecarbondata-1050

5. Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/ravipesala.asc

Please vote on releasing this package as Apache CarbonData 1.5.4,  The
vote will
be open for the next 72 hours and passes if a majority of at least three +1
PMC votes are cast.
[ ] +1 Release this package as Apache CarbonData 1.5.4
[ ] 0 I don't feel strongly about it, but I'm okay with the release
[ ] -1 Do not release this package because...

Regards,
Ravindra.


Re: [VOTE] Apache CarbonData 1.5.3(RC1) release

2019-04-08 Thread Ravindra Pesala
+1

Regards,
Ravindra.

On Wed, 3 Apr 2019 at 1:23 PM, Raghunandan S <
carbondatacontributi...@gmail.com> wrote:

> Hi
>
>
> I submit the Apache CarbonData 1.5.3 (RC1) for your vote.
>
>
> 1.Release Notes:
>
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220=12344322
>
>
> Some key features and improvements in this release:
>
>
>1. Supported DDL to operate on CarbonData LRU Cache
>
>2. Improved single, concurrent query performance.
>
>3. Count(*) query performance enhanced by optimising datamaps pruning
>
>4. Supported adding new columns through SDK
>
>5. Presto version upgraded to 0.217
>
>
>
>
> [Behavior Changes]
>
>1. None
>
>
>  2. The tag to be voted upon : apache-carbondata-1.5.3-rc1 (commit:
>
> 7f271d0aba272f9fbe9642a4900cd4da61eb43bb)
>
>
> https://github.com/apache/carbondata/releases/tag/apache-carbondata-1.5.3-rc1
>
>
>
> 3. The artifacts to be voted on are located here:
>
> https://dist.apache.org/repos/dist/dev/carbondata/1.5.3-rc1/
>
>
>
> 4. A staged Maven repository is available for review at:
>
>
> https://repository.apache.org/content/repositories/orgapachecarbondata-1039/
>
>
>
> 5. Release artifacts are signed with the following key:
>
>
> *https://people.apache.org/keys/committer/raghunandan.asc*
>
>
>
> Please vote on releasing this package as Apache CarbonData 1.5.3,  The vote
>
>
> will be open for the next 72 hours and passes if a majority of
>
>
> at least three +1 PMC votes are cast.
>
>
>
> [ ] +1 Release this package as Apache CarbonData 1.5.3
>
>
> [ ] 0 I don't feel strongly about it, but I'm okay with the release
>
>
> [ ] -1 Do not release this package because...
>
>
>
> Regards,
>
> Raghunandan.
>
-- 
Thanks & Regards,
Ravi


Re: [VOTE] Apache CarbonData 1.5.2(RC2) release

2019-02-01 Thread Ravindra Pesala
+1

Regards,
Ravindra.

On Wed, 30 Jan 2019 at 10:54 PM, Raghunandan S <
carbondatacontributi...@gmail.com> wrote:

> Hi
>
>
> I submit the Apache CarbonData 1.5.2 (RC2) for your vote.
>
>
> 1.Release Notes:
>
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220=12344321
>
>
> Some key features and improvements in this release:
>
>
>1. Presto Enhancements like supporting Hive metastore and stabilising
> existing Presto features
>
>2. Supported Range sort for faster data loading and improved point query
> performance.
>
>3. Supported Compaction for no-sort loaded segments
>
>4. Supported rename of column names
>
>5. Supported GZIP compressor for CarbonData files.
>
>6. Supported map data type from DDL.
>
>
> [Behavior Changes]
>
>1. If user doesn’t specify sort columns during table creation, default
> sort scope is set to no-sort during data loading
>
>
>  2. The tag to be voted upon : apache-carbondata-1.5.2-rc2 (commit:
>
> 9e0ff5e4c06fecd2dc9253d6e02093f123f2e71b)
>
>
> https://github.com/apache/carbondata/releases/tag/apache-carbondata-1.5.2-rc
> <
> https://github.com/apache/carbondata/releases/tag/apache-carbondata-1.5.2-rc2
> >
> 2
>
>
>
> 3. The artifacts to be voted on are located here:
>
> https://dist.apache.org/repos/dist/dev/carbondata/1.5.2-rc2/
>
>
>
> 4. A staged Maven repository is available for review at:
>
>
> https://repository.apache.org/content/repositories/orgapachecarbondata-1038/
>
>
>
> 5. Release artifacts are signed with the following key:
>
>
> *https://people.apache.org/keys/committer/raghunandan.asc*
>
>
>
> Please vote on releasing this package as Apache CarbonData 1.5.2,  The vote
>
>
> will be open for the next 72 hours and passes if a majority of
>
>
> at least three +1 PMC votes are cast.
>
>
>
> [ ] +1 Release this package as Apache CarbonData 1.5.2
>
>
> [ ] 0 I don't feel strongly about it, but I'm okay with the release
>
>
> [ ] -1 Do not release this package because...
>
>
>
> Regards,
>
> Raghunandan.
>
-- 
Thanks & Regards,
Ravi


Re: [DISCUSS] Move to gitbox as per ASF infra team mail

2019-01-05 Thread Ravindra Pesala
+1

Regards,
Ravindra.

On Sun, 6 Jan 2019 at 01:45, Kumar Vishal  wrote:

> +1
> Regards
> Kumar Vishal
>
> On Sat, 5 Jan 2019 at 10:23, xuchuanyin  wrote:
>
> > +1
> >
> > seems the committers only need to change the url for asf repo, that's OK.
> >
> > On 5/1/2019 10:08, Liang Chen wrote:
> > > Hi all,
> > >
> > > Background :
> > >
> >
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/NOTICE-Mandatory-migration-of-git-repositories-to-gitbox-apache-org-td72614.html
> > >
> > > Apache Hadoop git repository is in git-wip-us server and it will be
> > > decommissioned, ASF infra is proposing to move to gitbox. This
> > discussion is
> > > for getting consensus, please discuss and vote.
> > >
> > > Regards
> > > Liang
> > >
> > >
> > >
> > >
> > > --
> > > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
>


-- 
Thanks & Regards,
Ravi


Re: [ANNOUNCE] Chuanyin Xu as new PMC for Apache CarbonData

2019-01-01 Thread Ravindra Pesala
Congrats.

Regards,
Ravindra.

On Wed, 2 Jan 2019 at 05:49, Liang Chen  wrote:

> Hi
>
> We are pleased to announce that Chuanyin Xu as new PMC for Apache
> CarbonData.
>
> Congrats to Chuanyin Xu!
>
> Apache CarbonData PMC
>


-- 
Thanks & Regards,
Ravi


[ANNOUNCE] Apache CarbonData 1.5.1 release

2018-12-04 Thread Ravindra Pesala
Hi,

Apache CarbonData community is pleased to announce the release of the
Version 1.5.1 in The Apache Software Foundation (ASF).

CarbonData is a high-performance data solution that supports various data
analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter
lookup on detail record, streaming analytics, and so on. CarbonData has
been deployed in many enterprise production environments, in one of the
largest scenario it supports queries on single table with 3PB data (more
than 5 trillion records) with response time less than 3 seconds!

We encourage you to use the release
https://dist.apache.org/repos/dist/release/carbondata/1.5.1/, and feedback
through the CarbonData user mailing lists !

This release note provides information on the new features, improvements,
and bug fixes of this release.
What’s New in CarbonData Version 1.5.1?

CarbonData 1.5.1 intention was to move more closer to unified analytics. We
want to enable CarbonData files to be read from more engines/libraries to
support various use cases. In this regard we have added support to write
CarbonData files from c++ libraries.

CarbonData added multiple optimizations to improve query and compaction
performance.

In this version of CarbonData, more than 78 JIRA tickets related to new
features, improvements, and bugs have been resolved. Following are the
summary.
CarbonData CoreSupport Custom Column Compressor

Carbondata supports customized column compressor so that user can add their
own implementation of compressor. To customize compressor, user can
directly use its full class name while creating table or setting it to
carbon property.
Performance ImprovementsOptimized Carbondata Scan Performance

Carbondata scan performance is improved by avoiding multiple data copies in
case of vector flow. This is achieved through short-circuit the read and
vector filling, it means fill the data directly to vector after reading the
data from file with out any intermediate copies.

Now row level filter processing is handled in execution engine, only
blocklet and page pruning is handled in CarbonData for vector flow. This is
controlled by property  *carbon.push.rowfilters.for.vector *and default it
is false.
Optimized Compaction Performance

Compaction performance is optimized through pre-fetching the data while
reading carbon files.
Improved Blocklet DataMap Pruning in Driver

Blocklet DataMap pruning is improved using multi-thread processing in
driver.
CarbonData SDKSDK Supports C++ Interfaces for Writing CarbonData files

To enable integration with non java based execution engines, CarbonData
supports C++ JNI wrapper to write the CarbonData files. It can be
integrated with any execution engine and write data to CarbonData files
without the dependency on Spark or Hadoop.
Multi-Thread Read API in SDK

To improve the read performance when using SDK, CarbonData supports
multi-thread read APIs. This enables the applications to read data from
multiple CarbonData files in parallel. It significantly improves the SDK
read performance.
Other Improvements

   - Added more CLI enhancements by adding more options.
   - Supported fallback mechanism, when offheap memory is not enough then
   switch to on heap instead of failing the job
   - Supported a separate audit log.
   - Support read batch row in CSDK to improve performance.

Behavior Change

   - Enable Local dictionary by default.
   - Make inverted index false by default.
   - Sort temp files during data loading are now compressed by default with
   Snappy compression to improve IO.

New Configuration Parameters
Configuration name
Default Value
Range
*carbon.push.rowfilters.for.vector* false

NA
*carbon.max.driver.threads.for.block.pruning* 4 1-4


Please find the detailed JIRA list:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220=12344320
Sub-task

   - [CARBONDATA-2930
   ] - Support
   customize column compressor
   - [CARBONDATA-2981
   ] - Support read
   primitive data type in CSDK
   - [CARBONDATA-2997
   ] - Support read
   schema from index file and data file in CSDK
   - [CARBONDATA-3000
   ] - Provide C++
   interface for writing carbon data
   - [CARBONDATA-3003
   ] - Suppor read
   batch row in CSDK
   - [CARBONDATA-3004
   ] - Fix bug in
   writing dataframe to carbon table while the field order is different
   - [CARBONDATA-3038
   ] - Add
   annotation for carbon properties and mark whether is dynamic configuration
   - [CARBONDATA-3044
   ] - Handle
   exception in CSDK
   - [CARBONDATA-3056
   

Re: [VOTE] Apache CarbonData 1.5.1(RC2) release

2018-12-04 Thread Ravindra Pesala
Hi all

PMC vote has passed for Apache Carbondata 1.5.1 release, the result
as below:

+1(binding): 4(Liang Chen, Jacky, Kumar Vishal, Ravindra)

+1(non-binding) : 5

Thanks all for your vote.


Regards,
Ravindra

On Tue, 4 Dec 2018 at 16:57, Bhavya Aggarwal  wrote:

> +1
>
> Regards
> Bhavya
>
> On Tue, Dec 4, 2018 at 10:24 AM Liang Chen 
> wrote:
>
> > Hi
> >
> > +1
> >
> > Regards
> > Liang
> >
> >
> >
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
>
>
> --
> *Bhavya Aggarwal*
> CTO & Partner
> Knoldus Inc. 
> +91-9910483067
> Canada - USA - India - Singapore
>  
>  
>


-- 
Thanks & Regards,
Ravi


[VOTE] Apache CarbonData 1.5.1(RC2) release

2018-11-30 Thread Ravindra Pesala
Hi

I submit the Apache CarbonData 1.5.1 (RC2) for your vote.

1.Release Notes:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220=12344320


Some key features and improvements in this release:

   1. Optimized scan performance by avoiding multiple data copies and
   avoided double filtering for spark fileformat and presto by lazy loading
   and page pruning.
   2. Supported customize column compressor
   3. Supported concurrent reading through SDK reader to improve read
   performance.
   4. Supported fallback mechanism when offheap memory is not enough then
   switch to onheap
   5. Supported C++ interface for writing carbon data in CSDK
   6. Supported VectorizedReader for SDK Reader to improve read performance.
   7. Improved Blocklet DataMap pruning in driver using multi-threading.
   8. Make inverted index false by default
   9. Enable Local dictionary by default
   10. Support prefetch for compaction to improve compaction performance.
   11. Many Bug fixes and stabilized Carbondata.


 2. The tag to be voted upon : apache-carbondata-1.5.1-rc2 (commit:
1d1eb7bd625f1af1745c555274dd69298a79ab65)
https://github.com/apache/carbondata/releases/tag/apache-carbondata-1.5.1-rc2


3. The artifacts to be voted on are located here:
https://dist.apache.org/repos/dist/dev/carbondata/1.5.1-rc2/


4. A staged Maven repository is available for review at:
https://repository.apache.org/content/repositories/orgapachecarbondata-1036/


5. Release artifacts are signed with the following key:

*https://people.apache.org/keys/committer/ravipesala.asc
*


Please vote on releasing this package as Apache CarbonData 1.5.1,  The vote

will be open for the next 72 hours and passes if a majority of

at least three +1 PMC votes are cast.


[ ] +1 Release this package as Apache CarbonData 1.5.1

[ ] 0 I don't feel strongly about it, but I'm okay with the release

[ ] -1 Do not release this package because...


Regards,
Ravindra.


[Discussion] Bloom memory and pruning optimisation using hierarchical pruning.

2018-11-28 Thread Ravindra Pesala
Hi,

Problem:
 Current bloom filter is calculated at the
blocklet level and if the cardinality of a
column is more and number of blocklets loaded are more then the bloom
size will become bigger. In few of our use
cases, it is grown till 60 GB also, and it might
increase when data grows or add more bloom datamap columns.
 It is not practical to keep this large amounts of bloom data in
driver memory.  So currently we have option to  launch a distributed
job to prune the data using bloom datamap, but it takes more time as
it needs to load bloom data to all executor memories and then prune
it. And also there is no guarantee that subsequent queries will reuse
the same loaded bloom data from executor as spark scheduler does not
guarantee it.

Solution: Create hierarchal bloom index and pruning.

   - We can create a bloom in a
   hierarchal way, it means maintain bloom at the task (carbonindex)
   and at blocklet level.
   - While loading
   the data we can create bloom also at task level along with blocklet level.
   Bloom at task level is very small compare to bloom at a
   blocklet level so we can load it to the driver memory.
   - Maintain a footer in current blocklet level bloom file to get the
   blocklet blooms offset information for a block.
   This information will be used in executor to get the blocklet
blooms for the corresponding block during the
   query. This footer information also we load to driver memory.
   - During pruning,
   first level pruning happens at task level using task level bloom
and get the pruned blocks. Launch the main job along with respective
block bloom information which is already available in the
   footer of the file.
   - In AbstractQueryExecutor first read the blooms for respective
blocks using footer information sent from the
   driver and prune the blocklets.

 In this way, we maintain only less information in memory and also
avoid launching multiple jobs
for pruning. And also reads only necessary blocklet booms during pruning
instead of reading all.

This is a draft discussion proposal and we need to change the pruning flow
design for datamap to implement it. But I feel this types of coarse-grained
datamaps we should avoid launching job for pruning.  I can output the
design document after the initial discussion.

-- 
Thanks & Regards,
Ravindra.


Re: [proposal] Parallelize block pruning of default datamap in driver for filter query processing.

2018-11-22 Thread Ravindra Pesala
+1, It will be helpful for pruning millions of data files in less time.
Please try to generalize for all datamaps.

Thanks & Regards
Ravindra

On Fri, 23 Nov 2018 at 10:24, Ajantha Bhat  wrote:

> @xuchuanyin
> Yes, I will be handling this for all types of datamap pruning in the same
> flow when I am done with default datamap's implementation and testing.
>
> Thanks,
> Ajantha
>
>
>
> On Fri, Nov 23, 2018 at 6:36 AM xuchuanyin  wrote:
>
> > 'Parallelize pruning' is in my plan long time ago, nice to see your
> > proposal
> > here.
> >
> > While implementing this, I'd like you to make it common, that is to say
> not
> > only default datamap but also other index datamaps can also use
> parallelize
> > pruning.
> >
> >
> >
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
>


-- 
Thanks & Regards,
Ravi


[VOTE] Apache CarbonData 1.5.1(RC1) release

2018-11-21 Thread Ravindra Pesala
Hi

I submit the Apache CarbonData 1.5.1 (RC1) for your vote.

1.Release Notes:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220=12344320

Some key features and improvements in this release:

   1. Optimized scan performance by avoiding multiple data copies and
   avoided double filtering for spark fileformat and presto by lazy loading
   and page pruning.
   2. Added more CLI enhancements by adding more options.
   3. Supported fallback mechanism when offheap memory is not enough then
   switch to onheap
   4. Enable Local dictionary by default
   5. Supported C++ interface for writing carbon data in CSDK
   6. Supported concurrent reading through SDK reader to improve read
   performance.
   7. Supported VectorizedReader for SDK Reader to improve read performance.
   8. Supported customize column compressor
   9. Make inverted index false by default
   10. Support prefetch for compaction to improve compaction performance.
   11. Supported a separate audit log.
   12. Many Bug fixes and stabilized Carbondata.


 2. The tag to be voted upon : apache-carbondata-1.5.1-rc1 (commit:
696e5fe8cc1ac7374c980f5d0ff7d379364f9acf)
https://github.com/apache/carbondata/releases/tag/apache-carbondata-1.5.1-rc1


3. The artifacts to be voted on are located here:
https://dist.apache.org/repos/dist/dev/carbondata/1.5.1-rc1/


4. A staged Maven repository is available for review at:
https://repository.apache.org/content/repositories/orgapachecarbondata-1035/


5. Release artifacts are signed with the following key:

*https://people.apache.org/keys/committer/ravipesala.asc
*


Please vote on releasing this package as Apache CarbonData 1.5.1,  The vote

will be open for the next 72 hours and passes if a majority of

at least three +1 PMC votes are cast.


[ ] +1 Release this package as Apache CarbonData 1.5.1

[ ] 0 I don't feel strongly about it, but I'm okay with the release

[ ] -1 Do not release this package because...


Regards,
Ravindra.


[Proposal] Thoughts on general guidelines to follow in Apache CarbonData community

2018-11-16 Thread Ravindra Pesala
Hi

Please find our thoughts on the guidelines we can follow in the community
to ensure the quality of Carbondata and make the community more
collaborative. Let us discuss here.


1.Let us discuss all the features in the community before starting
design. Let us attach the design document in the JIRA for future easy
reference

2.Let us have design review meetings for all new features and wait till
at least 3 committers give +1 and approve the design.

3.Let us do the impact analysis on base flows and share our analysis
along with the design review.

4.Let us wait for at least 2 committers to review our code and put
LGTM.

5.Let us discuss and assign release manager role to one of the
committers/PMCs for every version and empower the release manager to
actively track the PRs required for the release scope and also assign the
review owners for them so that PRs are merged timely.

6.Let us have weekly meetings for all important features and let the
feature developers/owners share the progress and other
developments.

7.Let us attach functional, compatibility and performance comparison
[TPCH] reports both in JIRA and PR for all feature requirements

8.Let us develop features or optimizations not aligned with the release
scope in a separate branch.

9.Let us add the mailing list link to the Jira to ensure easy tracking.
Suggest checking all the open Jira before proposing new features or
enhancements.

10. Let us add the JIRA link in the mailing list to ensure easy tracking.

--

Thanks & Regards,
Ravindra


Re: Throw 'NoSuchElementException: None.get' error when use CarbonSession to read parquet.

2018-11-16 Thread Ravindra Pesala
Hi,

I will check and fix it.

Regards,
Ravindra

On Fri, 16 Nov 2018 at 09:24, xm_zzc <441586...@qq.com> wrote:

> Hi:
>   Please help. I used CarbonSession to read parquet and it throws
> 'NoSuchElementException: None.get' error, reading carbondata files is ok.
>   *Env*: local mode, Spark 2.3 + CarbonData(master branch)
>   *Code*:
> import org.apache.spark.sql.CarbonSession._
> val spark = SparkSession
>   .builder()
>   .master("local[1]")
>   .appName("Carbon1_5")
>   .config("spark.sql.warehouse.dir", warehouse)
>   .config("spark.default.parallelism", 4)
>   .config("spark.sql.shuffle.partitions", 4)
>   .getOrCreateCarbonSession(storeLocation, Constants.METASTORE_DB)
> spark.conf.set("spark.sql.parquet.binaryAsString", true)
> val parquets = spark.read.parquet("/data1/parquets/")
> println(parquets.count())
>
>   *Error*:
> Exception in thread "main" java.util.ServiceConfigurationError:
> org.apache.spark.sql.sources.DataSourceRegister: Provider
> org.apache.spark.sql.carbondata.execution.datasources.SparkCarbonFileFormat
> could not be instantiated
> at java.util.ServiceLoader.fail(ServiceLoader.java:232)
> at java.util.ServiceLoader.access$100(ServiceLoader.java:185)
> at
> java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384)
> at
> java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
> at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
> at
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
> at
> scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at
>
> scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:258)
> at
> scala.collection.TraversableLike$class.filter(TraversableLike.scala:270)
> at
> scala.collection.AbstractTraversable.filter(Traversable.scala:104)
> at
>
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:618)
> at
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190)
> at
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:622)
> at
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:606)
> at
> cn.xm.zzc.carbonmaster.Carbon1_5$.testReadSpeed(Carbon1_5.scala:434)
> at cn.xm.zzc.carbonmaster.Carbon1_5$.main(Carbon1_5.scala:105)
> at cn.xm.zzc.carbonmaster.Carbon1_5.main(Carbon1_5.scala)
> Caused by: java.util.NoSuchElementException: None.get
> at scala.None$.get(Option.scala:347)
> at scala.None$.get(Option.scala:345)
> at
>
> org.apache.spark.sql.carbondata.execution.datasources.SparkCarbonFileFormat.(SparkCarbonFileFormat.scala:120)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> at
>
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at
>
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at java.lang.Class.newInstance(Class.java:442)
> at
> java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380)
> ... 17 more
>
>   Thanks.
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


-- 
Thanks & Regards,
Ravi


[ANNOUNCE] Apache CarbonData 1.5.0 release

2018-10-16 Thread Ravindra Pesala
Hi,

Apache CarbonData community is pleased to announce the release of the
Version 1.5.0 in The Apache Software Foundation (ASF).

CarbonData is a high-performance data solution that supports various data
analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter
lookups on detail record, streaming analytics, and so on. CarbonData has
been deployed in many enterprise production environments, in one of the
largest scenarios, it supports queries on a single table with 3PB data
(more than 5 trillion records) with response time less than 3 seconds!

We encourage you to use the release
https://dist.apache.org/repos/dist/release/carbondata/1.5.0/, and feedback
through the CarbonData user mailing lists !

This release note provides information on the new features, improvements,
and bug fixes of this release.
What’s New in CarbonData Version 1.5.0?

CarbonData 1.5.0 intention was to move closer to unified analytics. We want
to enable CarbonData files to be read from more engines/libraries to
support various use cases. In this regard, we have added support to read
CarbonData files from c++ libraries. Additionally, CarbonData files can be
read using Java SDK, Spark FileFormat interface, Spark, Presto.

CarbonData added multiple optimisations to reduce the store size so that
query can take advantage of lesser IO. Several enhancements have been made
to Streaming support from CarbonData.

In this version of CarbonData, more than 150 JIRA tickets related to new
features, improvements, and bugs have been resolved. Following are the
summary.
Ecosystem IntegrationSupport Spark 2.3.2 ecosystem integration

Now CarbonData supports Spark 2.3.2

Spark 2.3.2 has many performance improvements in addition to critical bug
fixes. Spark 2.3.2 has many improvements related to Streaming and
unification of interfaces. In 1.5.0 version, CarbonData integrated with
Spark so that future versions of CarbonData can add enhancements based on
Spark's new and improved capabilities.
Support Hadoop 3.1.1 ecosystem integration

Now CarbonData supports Hadoop 3.1.1 which is the latest and stable hadoop
version and supports many new features.(EC, federation cluster etc.)
LightWeight Integration with Spark

CarbonData now supports the Spark FileFormat Data Source APIs so that
CarbonData can be integrated to Spark as an external file source. This
integration helps to query CarbonData tables from SparkSession, it also
helps applications which needs standard compliance's with respect to
interfaces.

Spark data source APIs support file format level operations such as read
and write. CarbonData’s enhanced features namely IUD, Alter, Compaction,
Segment Management, Streaming will not be available to use when CarbonData
is integrated as a Spark’s data source through the data source API.
CarbonData CoreAdaptive Encoding for Numeric Columns

CarbonData now supports adaptive encoding for numeric columns. Adaptive
encoding helps to store each data of a column as a delta of Min/Max value
of that column, there by reducing the effective bits required to store the
value. This results in smaller store size there by increasing the query
performance due to lesser IO. Adaptive encoding for dictionary columns is
already supported from version 1.1.0, now supports for all numeric columns.

Performance improvement measurement is not complete in 1.5.0. The results
will be published along with 1.5.1 release.
Configurable Column Size for Generating Min/Max

CarbonData generates Min/Max index for all columns and uses it for
effective pruning of data while querying. Generating Min/Max for columns
having longer width(like address column) will lead to increased storage
size, increased memory footprint there by reducing the query performance.
Moreover filters are not applied on such columns and hence there is no
necessity of generating the indexes; or the filters on such columns are
very minimal and would be wise to have lower query performance in such
scenarios, rather than affecting the over all performance for other filter
scenarios due to increased index size. CarbonData now supports configuring
the limit of the column width(in terms of characters) beyond which the
Min/Max generation would be skipped.

By Default the Min/Max is generated for all string columns. Users who are
aware of they data schema and know the columns which have more number of
characters and on which filters will not be applied upon, can configure the
exclude such columns; or the maximum length of characters upto which the
Min/Max can be generated can be specified so that CarbonData would skip
Min/Max index generation  if the column character length crosses this
configured threshold. By default string columns with more than 200 bytes
are skipped from Min/Max index generation. In Java each character occupies
2 characters.Hence column length greater than 100 characters are skipped
from Min/Max index generation.
Support for Map Complex Data Type

CarbonData has integrated map complex data type 

Re: [VOTE] Apache CarbonData 1.5.0(RC2) release

2018-10-15 Thread Ravindra Pesala
Hi all

PMC vote has passed for Apache Carbondata 1.5.0 release, the result
as below:

+1(binding): 4(Liang Chen, JB, Kumar Vishal, Ravindra)

+1(non-binding) : 3

Thanks all for your vote.


Regards,

Ravindra

On Fri, 12 Oct 2018 at 00:20, Kumar Vishal 
wrote:

> +1
> Regards
> Kumar Vishal
>
> On Thu, 11 Oct 2018 at 12:21, xm_zzc <441586...@qq.com> wrote:
>
> > +1
> >
> > Regards
> >
> >
> >
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
>


-- 
Thanks & Regards,
Ravi


Re: [ISSUE] carbondata1.5.0 and spark 2.3.2 query plan issue

2018-10-02 Thread Ravindra Pesala
Hi Aaron,

CarbonData profiler is not tested feature added in old version, So it might
have broken or not adding correct information during explain command.  We
will try to correct it in the next version,  meanwhile can you please
check and make sure that the data you are getting from query is right.

Regards,
Ravindra.

On Mon, 1 Oct 2018 at 21:23, aaron <949835...@qq.com> wrote:

> I think the query plan info is not right,
>
> 1. Total blocklet from carbondata cli is 233 + 86 = 319
> 2. But query plan tell me that I have 560 blocklet
>
> I hope below info could help you to locate issue.
>
>
> ***
> I use carbondata cli print the blocklet summary like below:
>
> java -cp "/home/hadoop/carbontool/*:/opt/spark/jars/*"
> org.apache.carbondata.tool.CarbonCli -cmd summary -a -p
> hdfs://
> ec2-dca-aa-p-sdn-16.appannie.org:9000/usr/carbon/data/default/storev3/
>
> ## Summary
> total: 80 blocks, 9 shards, 233 blocklets, 62,698 pages, 2,006,205,228
> rows,
> 12.40GB
> avg: 158.72MB/block, 54.50MB/blocklet, 25,077,565 rows/block, 8,610,322
> rows/blocklet
>
> java -cp "/home/hadoop/carbontool/*:/opt/spark/jars/*"
> org.apache.carbondata.tool.CarbonCli -cmd summary -a -p
> hdfs://
> ec2-dca-aa-p-sdn-16.appannie.org:9000/usr/carbon/data/default/usage_basickpi/
>
> ## Summary
> total: 30 blocks, 14 shards, 86 blocklets, 3,498 pages, 111,719,467 rows,
> 4.24GB
> avg: 144.57MB/block, 50.43MB/blocklet, 3,723,982 rows/block, 1,299,063
> rows/blocklet
>
>
>
> 
> But at the same time, I run a sql, carbon told me below info:
>
> |== CarbonData Profiler ==
> Table Scan on storev3
>  - total: 194 blocks, 560 blocklets
>  - filter: (granularity <> null and date <> null) and date >=
> 14726880 between date <= 14752800) and true) and
> granularity
> = monthly) and country_code in
> (LiteralExpression(US);LiteralExpression(CN);LiteralExpression(JP);)) and
> device_code in (LiteralExpression(ios-phone);)) and product_id <> null) and
> country_code <> null) and device_code <> null)
>  - pruned by Main DataMap
> - skipped: 192 blocks, 537 blocklets
>
>
>
> 
> The select sql like is
>
> SELECT f.country_code, f.date, f.product_id, f.category_id, f.arpu FROM (
> SELECT a.country_code, a.date, a.product_id, a.category_id,
> a.revenue/a.average_active_users as arpu
> FROM(
> SELECT r.device_code, r.category_id, r.country_code, r.date,
> r.product_id, r.revenue, u.average_active_users
> FROM
> (
> SELECT b.device_code, b.country_code, b.product_id,  b.date,
> b.category_id, sum(b.revenue) as revenue
> FROM (
> SELECT v.device_code, v.country_code, v.product_id,
> v.revenue, v.date, p.category_id FROM
> (
> SELECT device_code, country_code, product_id,
> est_revenue as revenue, timeseries(date, 'month') as date
> FROM storev3
> WHERE market_code='apple-store' AND date BETWEEN
> '2016-09-01' AND '2016-10-01' and device_code in ('ios-phone') and
> country_code in ('US', 'CN', 'JP')
> ) as v
> JOIN(
> SELECT DISTINCT product_id, category_id
> FROM storev3
> WHERE market_code='apple-store' AND date BETWEEN
> '2016-09-01' AND '2016-10-01' and device_code in ('ios-phone') and
> category_id in (10, 11, 100021) and country_code in ('US', 'CN',
> 'JP')
> ) as p
> ON p.product_id = v.product_id
> ) as b
> GROUP BY b.device_code, b.country_code, b.product_id, b.date,
> b.category_id
> ) AS r
> JOIN
> (
> SELECT country_code, date, product_id, (CASE WHEN
> est_average_active_users is not NULL THEN est_average_active_users ELSE 0
> END) as average_active_users, device_code
> FROM usage_basickpi
> WHERE date BETWEEN '2016-09-01' AND '2016-10-01'and granularity
> ='monthly' and country_code in ('US', 'CN', 'JP') AND device_code in
> ('ios-phone')
> ) AS u
> ON r.country_code=u.country_code AND r.date=u.date AND
> r.product_id=u.product_id AND r.device_code=u.device_code
> ) AS a
> )AS f
> ORDER BY f.arpu DESC
> LIMIT 10
>
> Thanks
> Aaron
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


-- 
Thanks & Regards,
Ravi


Re: [ANNOUNCE] Raghunandan as new committer of Apache CarbonData

2018-09-26 Thread Ravindra Pesala
Congrats Raghu

On Wed, 26 Sep 2018 at 12:53, sujith chacko 
wrote:

> Congratulations Raghu
>
> On Wed, 26 Sep 2018 at 12:44 PM, Rahul Kumar 
> wrote:
>
> > congrats Raghunandan !!
> >
> >
> > Rahul Kumar
> > *Sr. Software Consultant*
> > *Knoldus Inc.*
> > m: 9555480074
> > w: www.knoldus.com  e: rahul.ku...@knoldus.in
> > 
> >
> >
> > On Wed, Sep 26, 2018 at 12:41 PM Kumar Vishal  >
> > wrote:
> >
> > > Congratulations Raghunandan.
> > >
> > > -Regards
> > > Kumar Vishal
> > >
> > > On Wed, Sep 26, 2018 at 12:36 PM Liang Chen 
> > > wrote:
> > >
> > > > Hi all
> > > >
> > > > We are pleased to announce that the PMC has invited Raghunandan as
> new
> > > > committer of Apache CarbonData, and the invite has been accepted!
> > > >
> > > > Congrats to Raghunandan and welcome aboard.
> > > >
> > > > Regards
> > > > Apache CarbonData PMC
> > > >
> > >
> >
>


-- 
Thanks & Regards,
Ravi


[VOTE] Apache CarbonData 1.5.0(RC1) release

2018-09-25 Thread Ravindra Pesala
Hi

I submit the Apache CarbonData 1.5.0 (RC1) for your vote.

1.Release Notes:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220=12341006

Some key features and improvements in this release:

   1. Supported carbon as Spark datasource using Spark's
   Fileformat interface.
   2. Improved compression and performance of non dictionary columns by
   applying an adaptive encoding to them,
   3. Supported MAP datatype in carbon.
   4. Supported ZSTD compression for carbondata files.
   5. Supported C++ interfaces to read carbon through SDK API.
   6. Supported CLI tool for data summary and debug purpose.
   7. Supported BYTE and FLOAT datatypes in carbon.
   8. Limited min/max for large text by introducing a configurable limit to
   avoid file size bloat up.
   9. Introduced multithread write API in SDK to speed up loading and query
   performance.
   10. Supported min/max stats for stream row format to improve query
   performance.
   11. Many Bug fixes and stabilized carbondata.


 2. The tag to be voted upon : apache-carbondata-1.5.0.rc1(commit:
2157741f1d8cf3f0418ab37e1755a8d4167141a5)
https://github.com/apache/carbondata/releases/tag/apache-carbondata-1.5.0-rc1


3. The artifacts to be voted on are located here:

https://dist.apache.org/repos/dist/dev/carbondata/1.5.0-rc1/


4. A staged Maven repository is available for review at:

https://repository.apache.org/content/repositories/orgapachecarbondata-1033


5. Release artifacts are signed with the following key:

*https://people.apache.org/keys/committer/ravipesala.asc
*


Please vote on releasing this package as Apache CarbonData 1.5.0,  The vote

will be open for the next 72 hours and passes if a majority of

at least three +1 PMC votes are cast.


[ ] +1 Release this package as Apache CarbonData 1.5.0

[ ] 0 I don't feel strongly about it, but I'm okay with the release

[ ] -1 Do not release this package because...


Regards,
Ravindra.


Re: Change the 'comment' content for column when execute command 'desc formatted table_name'

2018-08-21 Thread Ravindra Pesala
Yes, I agree with Liang. We no need to consider showing sql in describe
table in case of CTAS.

Regards
Ravindra

On Tue, 21 Aug 2018 at 20:47, Raghunandan S <
carbondatacontributi...@gmail.com> wrote:

> Hi,
> In opinion it is not required to show the original select sql. Also is
> there a way to get it? I don't think it can be got.
>
>
> Regards
> Raghu
>
> On Tue, 21 Aug 2018, 8:02 pm Liang Chen,  wrote:
>
> > Hi
> >
> > 1. Agree with likun's comments(4 points) :
> >
> > 2. About 'select sql' for CTAS , you can leave it. we can consider it
> > later.
> >
> > Regards
> > Liang
> >
> > Jacky Li wrote
> > > Hi ZZC,
> > >
> > > I have checked the doc in CARBONDATA-2595. I have following comments:
> > > 1. In the Table Basic Information section, it is better to print the
> > Table
> > > Path instead of "CARBON Store Path”
> > > 2. For the Table Data Size  and Index Size, can you format the output
> in
> > > GB, MB, KB, etc
> > > 3. For the Last Update Time, can you format the output in UTC time like
> > > -MM-DD hh:mm:ss
> > > 4. In table property, I think maybe some properties are missing, like
> > > block size, blocklet size, long string
> > >
> > > For implementation, I suggest to write the main logic of collecting
> these
> > > information in java so that it is easier to write tools for it. One
> tool
> > > can be this SQL command and another tool I can think of is an
> standalone
> > > java executable that  can print these information on the screen by
> > reading
> > > the given table path. (We can put this standalone tool in SDK module)
> > >
> > > Regards,
> > > Jacky
> > >
> > >
> > >> 在 2018年8月20日,上午11:20,xm_zzc <
> >
> > > 441586683@
> >
> > >> 写道:
> > >>
> > >> Hi dev:
> > >>  Now I am working on this, the new format is shown in attachment,
> please
> > >> give me some feedback.
> > >>  There is one question: if user uses CTAS to create table, do we need
> to
> > >> show the 'select sql' in the result of 'desc formatted table'? If yes,
> > >> how
> > >> to get 'select sql'? now I just can get a non-formatted sql from
> > >> 'CarbonSparkSqlParser.scala' (Jacky mentioned), for example:
> > >>
> > >> *CREATE TABLE IF NOT EXISTS test_table
> > >> STORED BY 'carbondata'
> > >> TBLPROPERTIES(
> > >> 'streaming'='false', 'sort_columns'='id,city',
> > >> 'dictionary_include'='name')
> > >> AS SELECT * from source_test ;*
> > >>
> > >> The non-formatted sql I get is :
> > >> *SELECT*fromsource_test*
> > >>
> > >> desc_formatted.txt
> > >> 
> >
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t133/desc_formatted.txt
> ;
> >
> > >> desc_formatted_external.txt
> > >> 
> >
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t133/desc_formatted_external.txt
> ;
> >
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> --
> > >> Sent from:
> > >>
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> > >>
> >
> >
> >
> >
> >
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
>


-- 
Thanks & Regards,
Ravi


[DISCUSSION] Support Standard Spark's FileFormat interface in Carbondata

2018-08-21 Thread Ravindra Pesala
Hi,

Current Carbondata has deep integration with Spark to provide optimizations
in performance and also supports features like compaction, IUD, data maps
and metadata management etc. This type of integration forces user to use
CarbonSession instance to use carbon even for read and write operations.

So I am proposing standard spark's FileFormat implementation in carbon for
simple integration with Spark. Please check the jira for the design
document.
https://issues.apache.org/jira/browse/CARBONDATA-2872

-- 
Thanks & Regards,
Ravindra


[ANNOUNCE] Apache CarbonData 1.4.1 release

2018-08-15 Thread Ravindra Pesala
Hi,


Apache CarbonData community is pleased to announce the release of the
Version 1.4.1 in The Apache Software Foundation (ASF).

CarbonData is a high-performance data solution that supports various data
analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter
lookups on detail record, streaming analytics, etc. CarbonData has been
deployed in many enterprise production environments, in one of the largest
scenarios, it supports queries on a single table with 3PB data (more than 5
trillion records)  with response time less than 3 seconds!

We encourage you to use the release
https://dist.apache.org/repos/dist/release/carbondata/1.4.1/, and feedback
through the CarbonData user mailing lists !

This release note provides information on the new features, improvements,
and bug fixes of this release.
What’s New in Version 1.4.1?

In this version of CarbonData, more than 230 JIRA tickets for new feature,
improvement and bugs has been resolved. Following are the summary.
Carbon CoreSupport Cloud Storage (S3)

This can be used to store or retrieve data on Amazon cloud, Huawei
Cloud(OBS) or on any other object stores conforming to S3 API. Storing data
in cloud is advantageous as there are no restrictions on the size of data
and the data can be accessed from anywhere at any time. Carbondata can
support any Object Storage that conforms to Amazon S3 API.  For more
detail, please refer to S3 Guide
.
Support Flat Folder

This feature allows all carbondata and index files to keep directly under
table-path. This is useful for interoperability between the execution
engines and plugin with other execution engines like Hive or Presto.
Support 32K Characters (Alpha Feature)

In common scenarios, the length of the string is less than 32000. In some
cases, if the length of the string is more than 32000 characters,
CarbonData introduces a table property called LONG_STRING_COLUMNS to handle
this scenario. For these columns, CarbonData internally stores the length
of content using Integer.
Local Dictionary

Helps in getting more compression. Filter queries and full scan queries
will be faster as filter will be done on encoded data. Reducing the store
size and memory footprint as only unique values will be stored as part of
local dictionary and corresponding data will be stored as encoded
data. Getting higher IO throughput.
Merge Index

CarbonData supports merging of all the index files inside a segment to a
single CarbonData index merge file. This enhances the first query
performance.
Shows History Segments

CarbonData introduces a 'SHOW HISTORY SEGMENTS' to show all segment
information including visible and invisible segments.
Custom Compaction

Custom compaction is a new compaction type in addition to MAJOR and MINOR
compaction. In custom compaction, you can directly specify the segment ids
to be merged.
Enhancement for Detail Record AnalysisSupports Bloom Filter DataMap

CarbonData introduces BloomFilter as an index datamap to enhance the
performance of querying with precise value. It is well suitable for queries
that do precise match on high cardinality columns(such as Name/ID). In
concurrent filter query scenario (on high cardinality column), we observe
3~5 times improvement in concurrent queries per second comparing to last
version. For more detail, please refer to BloomFilter DataMap Guide

.
Improved Complex Datatypes

Improved complex datatypes compression and performance through adaptive
encoding.


Please find the detailed JIRA list:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220=12343148

New Feature

   - [CARBONDATA-2504
   ] - Support
   StreamSQL for streaming job
   - [CARBONDATA-2638
   ] - Implement
   driver min max caching for specified columns and segregate block and
   blocklet cache

Improvement

   - [CARBONDATA-2202
   ] - Introduce
   local dictionary encoding for dimensions
   - [CARBONDATA-2309
   ] - Add strategy
   to generate bigger carbondata files in case of small amount of data
   - [CARBONDATA-2355
   ] - Support run
   SQL on carbon files directly, which is generated by SDk
   - [CARBONDATA-2389
   ] - Search mode
   support lucene datamap
   - [CARBONDATA-2420
   ] - Support
   string longer than 32000 characters
   - [CARBONDATA-2428
   ] - Support Flat
   folder structure in carbon.
   - [CARBONDATA-2482
   

[VOTE] Apache CarbonData 1.4.1(RC2) release

2018-08-09 Thread Ravindra Pesala
Hi


I submit the Apache CarbonData 1.4.1 (RC2) for your vote.


1.Release Notes:

https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220=12343148

Some key features and improvements in this release:

   1. Supported Local dictionary to improve IO and query performance.
   2. Improved and stabilized Bloom filter datamap.
   3. Supported left outer join MV datamap(Alpha feature)
   4. Supported driver min max caching for specified columns and
   segregate block and blocklet cache.
   5. Support Flat folder structure in carbon to maintain the same folder
   structure as Hive.
   6. Supported S3 read and write on carbondata files
   7. Support projection push down for struct data type.
   8. Improved complex datatypes compression and performance through
   adaptive encoding.
   9. Many Bug fixes and stabilized carbondata.


 2. The tag to be voted upon : apache-carbondata-1.4.1.rc2(commit:
a17db2439aa51f6db7da293215f9732ffb200bd9)

https://github.com/apache/carbondata/releases/tag/apache-carbondata-1.4.1-rc2


3. The artifacts to be voted on are located here:

https://dist.apache.org/repos/dist/dev/carbondata/1.4.1-rc2/


4. A staged Maven repository is available for review at:

https://repository.apache.org/content/repositories/orgapachecarbondata-1032


5. Release artifacts are signed with the following key:

*https://people.apache.org/keys/committer/ravipesala.asc
*


Please vote on releasing this package as Apache CarbonData 1.4.1,  The vote

will be open for the next 72 hours and passes if a majority of

at least three +1 PMC votes are cast.


[ ] +1 Release this package as Apache CarbonData 1.4.1

[ ] 0 I don't feel strongly about it, but I'm okay with the release

[ ] -1 Do not release this package because...


Regards,
Ravindra.


[Discussion] Refactor Segment Management Interface.

2018-08-03 Thread Ravindra Pesala
Hi,

*Carbon uses tablestatus file to record segment status and details of each
segment during each load. This tablestatus enables carbon to support
concurrent loads and reads without data inconsistency or corruption.So it
is a very important feature of carbondata and we should have clean
interfaces to maintain it. Current tablestatus updation is shattered to
multiple places and there is no clean interface, so I am proposing to
refactor the current SegmentStatusManager interface and bringing all
tablestatus operations to a single interface.  This new interface allows
adding table status to any other storage like DB. This is needed for S3
type object stores as these are eventually consistent. *

Please check the attached design in the jira
https://issues.apache.org/jira/browse/CARBONDATA-2827

Please share your ideas on it.

-- 
Thanks & Regards,
Ravi


Re: [Discussion] Blocklet DataMap caching in driver

2018-06-21 Thread Ravindra Pesala
Hi Manish,
Thanks for proposing the solutions of driver memory problem.

+1 for solution 1 but it may not be the complete solution. We should also
have solution 2  to solve driver memory issue completely. I think in a very
near feature we should have solution 2 as well.

I have few doubts and suggestions related to solution 1.
1. what if the query comes on noncached columns, will it start read from
disk in driver side for minmax ?
2. Are we planning to cache blocklet level information or block level
information in driver side for cached columns?
3. What is the impact if we automatically chose cached columns from the
user query instead of letting the user configure them?

Regards,
Ravindra.

On Thu, 21 Jun 2018 at 14:54, manish gupta 
wrote:

> Hi Dev,
>
> The current implementation of Blocklet dataMap caching in driver is that it
> caches the min and max values of all the columns in schema by default.
>
> The problem with this implementation is that as the number of loads
> increases the memory required to hold min and max values also increases
> considerably. We know that in most of the scenarios there is a single
> driver and memory configured for driver is less as compared to executor.
> With continuos increase in memory requirement driver can even go out of
> memory which makes the situation further worse.
>
> *Proposed Solution to solve the above problem:*
>
> Carbondata uses min and max values for blocklet level pruning. It might not
> be necessary that user has filter on all the columns specified in the
> schema instead it could be only few columns that has filter applied on them
> in the query.
>
> 1. We provide user an option to cache the min and max values of only the
> required columns. Caching only the required columns can optimize the
> blocklet dataMap memory usage as well as solve the driver memory problem to
> a greater extent.
>
> 2. Using an external storage/DB to cache min and max values. We can also
> implement a solution to create a table in the external DB and store min and
> max values for all the columns in that table. This will not use any driver
> memory and hence the driver memory usage will be optimized further as
> compared to solution 1.
>
> *Solution 1* will not have any performance impact as the user will cache
> the required filter columns and it will not have any external dependency
> for query execution.
> *Solution 2* will degrade the query performance as it will involve querying
> for min and max values from external DB required for Blocklet pruning.
>
> *So from my point of view we should go with solution 1 and in near future
> propose a design for solution 2. User can have an option to select between
> the 2 options*. Kindly share your suggestions.
>
> Regards
> Manish Gupta
>


-- 
Thanks & Regards,
Ravi


Re: Complex DataType Enhancements

2018-06-14 Thread Ravindra Pesala
Hi Sounak,
Are you planning to do predicate pushdown or projection push down for
struct type?
I guess adaptive encoding is only possible for integral datatypes like
long, int, short not for all datatypes. So please be list down what type of
encoding you are planning on complex types.

Regards,
Ravindra.

On Wed, 13 Jun 2018 at 20:07, sounak  wrote:

> Hi Dev,
>
> We have identified the scope of phase1 activities for complex type
> enhancements.
>
> Below are the phase 1 enhancement activities.
>
>- Predicate push down for struct data type.
>- Provide adaptive encoding and decoding for all data type.
>- Support JSON data loading directly into Carbon table.
>
>
> Please find the detail design document attached in the JIRA
> [CARBONDATA-2605
> ]
> https://issues.apache.org/jira/browse/CARBONDATA-2605
>
> Thanks,
> Sounak
>
>
>
>
>
> On Mon, Jun 4, 2018 at 8:10 AM sounak  wrote:
>
> > Hi Dev,
> >
> > Complex types (also referred to as nested types) let you represent
> > multiple data values within a single row/column position.
> > CarbonData already has the support of Complex Types but it lacks major
> > enhancements which are present in other primitive Datatypes. As complex
> > type usages are increasing, we are planning to enhance the coverage of
> > Complex Types and apply some major optimization. I am listing down few of
> > the optimization which we have thought off.
> >
> > Request to the community to go through the listing and please give your
> > valuable suggestions.
> >
> > 1. Adaptive Encoding for Complex Type Page: Currently Complex Types
> > page doesn't have any encoding present, which leads to higher IO compared
> > to other DataTypes. Complex Page should be at par with other datatypes
> > encoding mechanism.
> >
> > 2. Optimize Array Type Reading: Optimizing Complex Type Array reading so
> > that it can be read faster. One of the ways is to reduce the Read IO for
> > Arrays after applying encoding mechanism like Adaptive or RLE on the
> Array
> > data type.
> >
> > 3.   Filter and Projection Push Down for Complex Datatypes: As of now in
> > case of Complex DataTypes filters and projections are handled in the
> upper
> > spark layer. In case they are pushed down Carbon will get better
> > performance as less IO will incur as all rows need not be send back to
> > spark for processing.
> >
> > 4. Support Multilevel Nesting in Complex Datatypes: Only 2 Level of
> > nesting is supported for Complex Datatype through Load and Insert into.
> > Make this to n-level support.
> >
> > 5. Update and Delete support for complex Datatype: Currently, only
> > primitive datatypes work for Update and Delete in CarbonData. Support
> > Complex DataType too for the DML operation.
> >
> > 6. Alter Table Support for Complex DataType : Alter table doesn't support
> > addition or deletion of complex columns as of now. This support needs to
> be
> > extended.
> >
> > 7. Map Datatype Support: Only Struct and Array datatypes are part of
> > Complex Datatype as of now. Map Datatype should be extended as part of
> > Complex.
> >
> > 8. Compaction support for Complex Datatype: Compaction works for the
> > primitive datatype, but should be extended for complex too.
> >
> >
> > Good to have features
> > --
> > 9. Geospatial Support through Complex Datatype: Geospatial datatypes like
> > ST_GEOMETRY and XMLs  object representation through complex datatypes.
> >
> > 10. Complex Datatype Transformation: Once complex datatype can transform
> > into different complex datatype. For e.g. User Inserted Data with
> ComplexA
> > datatype but want to transform the data and retrieve the data like
> ComplexB
> > datatype.
> >
> > 11. Virtual Tables for Complex Datatypes: Currently complex columns
> reside
> > in one column, but through virtual tables, the complex columns an be
> > denormalized and placed into a separate table called a virtual table for
> > faster processing and joins and applying to sort columns.
> >
> > 12. Including Complex Datatype to Sort Columns.
> >
> > Please let me know your suggestion on these enhancements.
> >
> > Thanks a lot
> >
> > --
> > Thanks
> > Sounak
> >
>
>
> --
> Thanks
> Sounak
>


-- 
Thanks & Regards,
Ravi


Re: [Discussion] Carbon Local Dictionary Support

2018-06-04 Thread Ravindra Pesala
Hi Vishal,

+1

Thank you for starting a discussion on it. It will be a very helpful
feature to improve query performance and reduces the memory footprint.
Please add the design document for the same.

Regards,
Ravindra.

On 5 June 2018 at 09:22, xuchuanyin  wrote:

> Hi, Kumar:
>   Local dictionary will be nice feature and other formats like parquet all
> support this.
>
>   My concern is that: How will you implement this feature?
>
>   1. What's the scope of the `local`? Page level (for all containing rows),
> Blocklet level (for all containing pages), Block level(for all containing
> blocklets)?
>
>   2. Where will you store the local dictionary?
>
>   3. How do you decide to enable the local dictionary for a column?
>
>   4. Have you considered to fall back to plain encoding if the local
> dictionary encoding consumes more space?
>
>   5. Will you still work on V3 format or start a new V4 (or v3.1) version?
>
>   Anyway, I'm concerning about the data loading performance. Please pay
> attention to it while you are implementing this feature.
>
>
>
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.
> n5.nabble.com/
>



-- 
Thanks & Regards,
Ravi


[VOTE] Apache CarbonData 1.4.0(RC2) release

2018-05-22 Thread Ravindra Pesala
Hi


I submit the Apache CarbonData 1.4.0 (RC2) for your vote.


1.Release Notes:

*https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220=1234100
*
*5
*


Some key features and improvements in this release:

   1. Provided Carbon SDK to write and read CarbonData files through Java
   API.
   2. Supported external table with location.
   3. Streaming with Pre-Aggregate table support
   4. Partition with Pre-Aggregate Support
   5. Data Load improvements
   6. Lucene index support for faster text search (Alpha feature)
   7. Supported S3 read on carbondata files
   8. Search Mode to improve concurrent queries (Alpha feature)
   9. Bloom Filter index for faster blocklet pruning (Alpha feature)
   10. Supported 'using carbondata' syntax to standardize carbondata with
   spark datasource
   11. Many Bug fixes.


 2. The tag to be voted upon : apache-carbondata-1.4.0-rc2(commit:
911a2a0150ec6ca9e029634c5747b6a0d1b4c08f)


*https://github.com/apache/carbondata/releases/tag/apache-carbondata-1.4.0-rc2*



3. The artifacts to be voted on are located here:

*https://dist.apache.org/repos/dist/dev/carbondata/1.4.0-rc2
**
/
*


4. A staged Maven repository is available for review at:

*https://repository.apache.org/content/repositories/orgapachecarbondata-1030/
*


5. Release artifacts are signed with the following key:

*https://people.apache.org/keys/committer/ravipesala.asc
*


Please vote on releasing this package as Apache CarbonData 1.4.0,  The vote

will be open for the next 72 hours and passes if a majority of

at least three +1 PMC votes are cast.


[ ] +1 Release this package as Apache CarbonData 1.4.0

[ ] 0 I don't feel strongly about it, but I'm okay with the release

[ ] -1 Do not release this package because...


Regards,
Ravindra.


Re: Change the 'comment' content for column when execute command 'desc formatted table_name'

2018-04-26 Thread Ravindra Pesala

I agree with Liang's suggestion to align the information with table schema. And 
I have one suggestion related to NO_INVERTED_INDEX , instead of mentioning no 
inverted index columns better mention which are inverted index columns. It is 
very hard user to understand which are inverted index columns and it is useful 
if we change our default behaviour of selecting index columns if we provide 
this information to table describe command.
Regards,
Ravindra.

Sent from Mailspring 
(https://link.getmailspring.com/link/1524723947.local-c273b69b-15c2-v1.2.1-7e744...@getmailspring.com/0?redirect=https%3A%2F%2Fgetmailspring.com%2F=ZGV2QGNhcmJvbmRhdGEuYXBhY2hlLm9yZw%3D%3D),
 the best free email app for work
On Apr 26 2018, at 10:34 am, manishgupta88  wrote:
>
> I agree with Liang. We can modify the complete describe formatted command
> display and show the detailed information as suggested by Liang.
> Liang we can make a small change in your suggestion. As we are displaying
> the information to the user we should not include Underscore(_) in the
> property names and in place of DICTIONARY_INCLUDE and DICTIONARY_EXCLUDE we
> can just say Dictionary columns and No Dictionary Columns.
>
> ## Detailed Table Properties Information
> |Sort Columns |name,id
> |No Inverted Index |id
> |Dictionary Columns |name
> |Table BlockSize |1024 MB
> |Sort Scope |LOCAL_SORT
> |Streaming |false
>
> Regards
> Manish Gupta
>
>
>
> --
> Sent from: 
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>



Re: [VOTE] Apache CarbonData 1.3.1(RC1) release

2018-03-09 Thread Ravindra Pesala
Hi all

PMC vote has passed for Apache Carbondata 1.3.1 release, the result as
below:

+1(binding): 5(Liang Chen, Jacky, David, JB, Ravindra)

+1(non-binding) :  1

Thanks all for your vote.


Regards,

Ravindra


On 5 March 2018 at 15:10, David CaiQiang  wrote:

> +1
>
>
>
>
> -
> Best Regards
> David Cai
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.
> n5.nabble.com/
>



-- 
Thanks & Regards,
Ravi


[VOTE] Apache CarbonData 1.3.1(RC1) release

2018-03-04 Thread Ravindra Pesala
Hi

I submit the Apache CarbonData 1.3.1 (RC1) for your vote.

1.Release Notes:
*https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220=12342754
*

Some key improvement in this patch release:

   1. Restructured the carbon partitions to use standard hive folder
   structure.
   2. Supported global sort on the partitioned table.
   3. Many Bug fixes and stabilized 1.3.0 release.


 2. The tag to be voted upon : apache-carbondata-1.3.1-rc1(commit:
744032d3cc39ff009b4a24b2b43f6d9457f439f4)
*https://github.com/apache/carbondata/releases/tag/apache-carbondata-1.3.1-rc1
*

3.The artifacts to be voted on are located here:
*https://dist.apache.org/repos/dist/dev/carbondata/1.3.1-rc1/
*

4. A staged Maven repository is available for review at:
*https://repository.apache.org/content/repositories/orgapachecarbondata-1026
*

5. Release artifacts are signed with the following key:
*https://people.apache.org/keys/committer/ravipesala.asc
*

Please vote on releasing this package as Apache CarbonData 1.3.1,  The vote
will be open for the next 72 hours and passes if a majority of
at least three +1 PMC votes are cast.

[ ] +1 Release this package as Apache CarbonData 1.3.1
[ ] 0 I don't feel strongly about it, but I'm okay with the release
[ ] -1 Do not release this package because...

Regards,
Ravindra.


Re: About bucket feature in carbon

2018-02-09 Thread Ravindra Pesala
Yes Jacky, we will do refactor and use the partition flow.

On 9 February 2018 at 13:44, Jacky Li <13561...@qq.com> wrote:

> Hi Ravindra,
>
> You mean we can do one round of refactory for bucketed table feature in
> CarbonData 1.4.
> I am fine with it.
>
> Regards,
> Jacky
>
>
> > 在 2018年2月9日,下午3:49,Ravindra Pesala <ravi.pes...@gmail.com> 写道:
> >
> > Hi Likun,
> >
> > I feel it is better to change the implementation to use sparks bucketing
> > generation just like how standard hive partitions generates. It will be
> > easy to change it after implementing of partition feature. And it is a
> > useful feature for joining big tables and hash based buckets and
> clustered
> > by enables the queries faster.  So it is better to change the
> > implementation instead of removing it.
> >
> > Regards,
> > Ravindra.
> >
> > On 9 February 2018 at 13:14, Jacky Li <jacky.li...@qq.com> wrote:
> >
> >> Hi,
> >>
> >> One year ago, CarbonData 1.0.0 has introduced bucket table feature, it
> was
> >> expected to improve join performance by avoiding shuffling if both
> tables
> >> are bucketed on same column with same number of buckets.
> >>
> >> However, after this feature was introduced, personally speaking it was
> not
> >> widely used in the community and it creates maintenance overhead for the
> >> developers in the community (for very new Pull Request, all bucket
> related
> >> testcase need to be fixed)
> >>
> >> And now carbon has integrated with spark standard partition, developer
> can
> >> add bucket support using spark bucketed table feature in future if it
> >> requires.
> >>
> >> So, I propose to remove bucket feature after CarbonData 1.3.0 version.
> >> What do you think?
> >>
> >> Regards,
> >> Jacky
> >>
> >>
> >
> >
> > --
> > Thanks & Regards,
> > Ravi
>
>
>
>


-- 
Thanks & Regards,
Ravi


Re: About bucket feature in carbon

2018-02-08 Thread Ravindra Pesala
Hi Likun,

I feel it is better to change the implementation to use sparks bucketing
generation just like how standard hive partitions generates. It will be
easy to change it after implementing of partition feature. And it is a
useful feature for joining big tables and hash based buckets and clustered
by enables the queries faster.  So it is better to change the
implementation instead of removing it.

Regards,
Ravindra.

On 9 February 2018 at 13:14, Jacky Li  wrote:

> Hi,
>
> One year ago, CarbonData 1.0.0 has introduced bucket table feature, it was
> expected to improve join performance by avoiding shuffling if both tables
> are bucketed on same column with same number of buckets.
>
> However, after this feature was introduced, personally speaking it was not
> widely used in the community and it creates maintenance overhead for the
> developers in the community (for very new Pull Request, all bucket related
> testcase need to be fixed)
>
> And now carbon has integrated with spark standard partition, developer can
> add bucket support using spark bucketed table feature in future if it
> requires.
>
> So, I propose to remove bucket feature after CarbonData 1.3.0 version.
> What do you think?
>
> Regards,
> Jacky
>
>


-- 
Thanks & Regards,
Ravi


Re: [VOTE] Apache CarbonData 1.3.0(RC2) release

2018-02-06 Thread Ravindra Pesala
Hi all

PMC vote has passed for Apache Carbondata 1.3.0 release, the result as
below:

+1(binding): 5(Liang Chen, Jacky, Kumar Vishal, JB, Ravindra)

+1(non-binding) :  3

Thanks all for your vote.


Regards,

Ravindra


On 5 February 2018 at 14:05, xm_zzc <441586...@qq.com> wrote:

> +1
>
> The issue I fixed in pr 1928(https://github.com/
> apache/carbondata/pull/1928)
> is *not* a block issue for releasing CarbonData 1.3.0, we can merge this pr
> into 1.3.1 which will be released soon after 1.3.0.
>
>
>
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.
> n5.nabble.com/
>



-- 
Thanks & Regards,
Ravi


[VOTE] Apache CarbonData 1.3.0(RC2) release

2018-02-03 Thread Ravindra Pesala
Hi

I submit the Apache CarbonData 1.3.0 (RC2) for your vote.

1.Release Notes:
*https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220=12341004
*

Some key improvement in this patch release:

   1. Supported Streaming in CarbonData
   2. Supported Spark 2.2 version in Carbon.
   3. Added Pre-aggregation support to carbon.
   4. Supported Time Series (Alpha feature)
   5. Supported standard hive type of partitioning in carbon.
   6. Added CTAS support in Carbon
   7. Support Boolean DataType


 2. The tag to be voted upon : apache-carbondata-1.3.0-rc2(commit:
c055c8f33123bfb6e1103456bea23a0ff8c944ca)
*https://github.com/apache/carbondata/releases/tag/apache-carbondata-1.3.0-rc
2*

3.The artifacts to be voted on are located here:
*https://dist.apache.org/repos/dist/dev/carbondata/1.3.0-rc2/
*

4. A staged Maven repository is available for review at:
*https://repository.apache.org/content/repositories/orgapachecarbondata-1025/
*

5. Release artifacts are signed with the following key:
*https://people.apache.org/keys/committer/ravipesala.asc
*

Please vote on releasing this package as Apache CarbonData 1.3.0,  The vote
will be open for the next 72 hours and passes if a majority of
at least three +1 PMC votes are cast.

[ ] +1 Release this package as Apache CarbonData 1.3.0
[ ] 0 I don't feel strongly about it, but I'm okay with the release
[ ] -1 Do not release this package because...

Regards,
Ravindra.


[VOTE] Apache CarbonData 1.3.0(RC1) release

2018-01-09 Thread Ravindra Pesala
Hi

I submit the Apache CarbonData 1.3.0 (RC1) for your vote.

1.Release Notes:
*https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220=12341004
*

Some key improvement in this patch release:

   1. Supported Streaming in CarbonData
   2. Supported Spark 2.2 version in Carbon.
   3. Added Pre-aggregation support to carbon.
   4. Supported standard hive type of partitioning in carbon.
   5. Added CTAS support in Carbon


 2. The tag to be voted upon : apache-carbondata-1.3.0-rc1(commit:
04b376b76c3e842e7f5cb64375005a93097baedb)
*https://github.com/apache/carbondata/releases/tag/apache-carbondata-1.3.0-rc
1*

3.The artifacts to be voted on are located here:
*https://dist.apache.org/repos/dist/dev/carbondata/1.3.0-rc1/
*

4. A staged Maven repository is available for review at:
*https://repository.apache.org/content/repositories/orgapachecarbondata-1024
*

5. Release artifacts are signed with the following key:
*https://people.apache.org/keys/committer/ravipesala.asc
*

Please vote on releasing this package as Apache CarbonData 1.3.0,  The vote
will be open for the next 72 hours and passes if a majority of
at least three +1 PMC votes are cast.

[ ] +1 Release this package as Apache CarbonData 1.3.0
[ ] 0 I don't feel strongly about it, but I'm okay with the release
[ ] -1 Do not release this package because...

Regards,
Ravindra.


Re: [ANNOUNCE] Kunal Kapoor as new Apache CarbonData committer

2018-01-08 Thread Ravindra Pesala
Congrats Kunal

Regards,
Ravindra

On 8 January 2018 at 20:29, xm_zzc <441586...@qq.com> wrote:

> Congratulations Kunal  !!
>
>
>
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.
> n5.nabble.com/
>



-- 
Thanks & Regards,
Ravi


Initiating Apache CarbonData-1.3.0 Release

2017-12-23 Thread Ravindra Pesala
Hi All,

We are initiating CarbonData 1.3.0 release so no new features are allowed
to commit on master branch till the release is done. We will stabilize the
code and only defect fixes are allowed to commit.  Please let us know if
any urgent features need to be merged into 1.3.0 version so that we will
plan accordingly.

Major Features did in CarbonData 1.3.0 Release
1. Supported Streaming in CarbonData
2. Supported Spark 2.2 version in Carbon.
3. Added Pre-aggregation support to carbon.
4. Supported standard hive type of partitioning in carbon.
5. Added CTAS support in Carbon

-- 
Thanks & Regards,
Ravindra.


Re: [Discussion] Support Spark/Hive based partition in carbon

2017-12-09 Thread Ravindra Pesala
Hi Yuhai Cen,

Yes you are right, we should support standard folder structure like hive to
generalize the fileformat but we have a lot of other features which are
built upon this folder structure. So removing of this will have a lot of
impact on those features. Right now we are implementing
CarbonTableOutputFormat which manages table segments while loading and
writes data in the current carbon folder structure. And one more
outputformat called CarbonOutputFormat and CarbonInputFormat which just
writes and reads the data to file which is totally managed by spark/hive,
so these interfaces will be the generalized fileformat interfaces to
integrate with systems like hive/presto.

Regards,
Ravindra.

On 9 December 2017 at 11:20, 岑玉海 <cenyuha...@163.com> wrote:

> I still insist that if we want to make carbon a general fileformt on
> hadoop ecosystem, we should support standard hive/spark folder structure.
>
>
> we can use the folder structure like this:
> TABLE_PATH
>
> Customer=US
>
> |--Segement_0
>
>  |---0-12212.carbonindex
>
>  |---PART-00-12212.carbondata
>
>  |---0-34343.carbonindex
>
>  |---PART-00-34343.carbondata
>
> or
> TABLE_PATH
>
> Customer=US
>
>   |--Part0
>
>|--Fact
>
> |--Segement_0
>
>  |---0-12212.carbonindex
>
>  |---PART-00-12212.carbondata
>
>  |---0-34343.carbonindex
>
>  |---PART-00-34343.carbondata
>
>
>
>
>
>
>
>
>
> I know there will be some impact on compaction and segment management.
>
> @Jacky @Ravindra @chenliang @David CaiQiang can you estimate the impact?
>
>
>
> Best regards!
> Yuhai Cen
>
>
> 在2017年12月5日 15:29,Ravindra Pesala<ravi.pes...@gmail.com> 写道:
> Hi Jacky,
>
> Here we have the main problem with the underlying segment based design of
> carbon. For every increment load carbon creates a segment and manages the
> segments through the tablestatus file. The changes will be very big and
> impact is more if we try to change this design. And also we will have a
> problem with backward compatibility when the folder structure changes in
> new loads.
>
> Regards,
> Ravindra.
>
> On 5 December 2017 at 10:12, 岑玉海 <cenyuha...@163.com> wrote:
>
> > Hi,  Ravindra:
> >I read your design documents, why not use the standard hive/spark
> > folder structure, is there any problem if use the hive/spark folder
> > structure?
> >
> >
> >
> >
> >
> >
> >
> >
> > Best regards!
> > Yuhai Cen
> >
> >
> > 在2017年12月4日 14:09,Ravindra Pesala<ravi.pes...@gmail.com> 写道:
> > Hi,
> >
> >
> > Please find the design document for standard partition support in carbon.
> > https://docs.google.com/document/d/1NJo_Qq4eovl7YRuT9O7yWTL0P378HnC8WT
> > 0-6pkQ7GQ/edit?usp=sharing
> >
> >
> >
> >
> >
> >
> >
> > Regards,
> > Ravindra.
> >
> >
> > On 27 November 2017 at 17:36, cenyuhai11 <cenyuha...@163.com> wrote:
> > The datasource api still have a problem that it do not support hybird
> > fileformat table.
> > Detail description about hybird fileformat table is in this issue:
> > https://issues.apache.org/jira/browse/CARBONDATA-1377.
> >
> > All partitions' fileformat of datasource table must be the same.
> > So we can't change fileformat to carbodata by command "alter table
> > table_xxx
> > set fileformat carbondata;"
> >
> > So I think implement TableReader is the right way.
> >
> >
> >
> >
> >
> >
> >
> > --
> > Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.
> > n5.nabble.com/
> >
> >
> >
> >
> >
> >
> > --
> >
> > Thanks & Regards,
> > Ravi
> >
>
>
>
> --
> Thanks & Regards,
> Ravi
>



-- 
Thanks & Regards,
Ravi


Re: 回复: [DISCUSSION] Refactory on spark related modules

2017-12-05 Thread Ravindra Pesala
Hi Jacky,

I don't think it's a good idea to create new modules for spark2.1 and
spark2.2 versions.We should not create a module for every spark minor
version. Earlier we had a modules spark and spark2 because of major version
change and a lot of interfaces are changed along with it. If it is a major
version we can create module but not for every minor version. And regarding
the IDE issue, it is just a developer understanding of how to switch the
versions so we can have FAQ for that and also we can check the solution for
IDE problem.

And about merging spark2 and spark-common module, There is no harm in
keeping all the RDD in common package as if you need to support any major
spark versions like spark3.0 in future may require separating the modules
again.

Regards,
Ravindra.

On 6 December 2017 at 11:00, wyphao.2007  wrote:

> +1
>
>
>
>
>
>
> 在2017年12月06 11时44分, "岑玉海"写道:
>
> +1
>
>
>
>
>
>
> Best regards!
> Yuhai Cen
>
>
> 在2017年12月6日 11:43,David CaiQiang 写道:
> +1
>
>
>
> -
> Best Regards
> David Cai
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.
> n5.nabble.com/
>



-- 
Thanks & Regards,
Ravi


Re: [Discussion] Support Spark/Hive based partition in carbon

2017-12-04 Thread Ravindra Pesala
Hi Jacky,

Here we have the main problem with the underlying segment based design of
carbon. For every increment load carbon creates a segment and manages the
segments through the tablestatus file. The changes will be very big and
impact is more if we try to change this design. And also we will have a
problem with backward compatibility when the folder structure changes in
new loads.

Regards,
Ravindra.

On 5 December 2017 at 10:12, 岑玉海 <cenyuha...@163.com> wrote:

> Hi,  Ravindra:
>I read your design documents, why not use the standard hive/spark
> folder structure, is there any problem if use the hive/spark folder
> structure?
>
>
>
>
>
>
>
>
> Best regards!
> Yuhai Cen
>
>
> 在2017年12月4日 14:09,Ravindra Pesala<ravi.pes...@gmail.com> 写道:
> Hi,
>
>
> Please find the design document for standard partition support in carbon.
> https://docs.google.com/document/d/1NJo_Qq4eovl7YRuT9O7yWTL0P378HnC8WT
> 0-6pkQ7GQ/edit?usp=sharing
>
>
>
>
>
>
>
> Regards,
> Ravindra.
>
>
> On 27 November 2017 at 17:36, cenyuhai11 <cenyuha...@163.com> wrote:
> The datasource api still have a problem that it do not support hybird
> fileformat table.
> Detail description about hybird fileformat table is in this issue:
> https://issues.apache.org/jira/browse/CARBONDATA-1377.
>
> All partitions' fileformat of datasource table must be the same.
> So we can't change fileformat to carbodata by command "alter table
> table_xxx
> set fileformat carbondata;"
>
> So I think implement TableReader is the right way.
>
>
>
>
>
>
>
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.
> n5.nabble.com/
>
>
>
>
>
>
> --
>
> Thanks & Regards,
> Ravi
>



-- 
Thanks & Regards,
Ravi


Re: [Discussion] Support Spark/Hive based partition in carbon

2017-12-03 Thread Ravindra Pesala
Hi,

Please find the design document for standard partition support in carbon.
https://docs.google.com/document/d/1NJo_Qq4eovl7YRuT9O7yWTL0P378HnC8WT0-6pkQ7GQ/edit?usp=sharing



Regards,
Ravindra.

On 27 November 2017 at 17:36, cenyuhai11  wrote:

> The datasource api still have a problem that it do not support hybird
> fileformat table.
> Detail description about hybird fileformat table is in this issue:
> https://issues.apache.org/jira/browse/CARBONDATA-1377.
>
> All partitions' fileformat of datasource table must be the same.
> So we can't change fileformat to carbodata by command "alter table
> table_xxx
> set fileformat carbondata;"
>
> So I think implement TableReader is the right way.
>
>
>
>
>
>
>
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.
> n5.nabble.com/
>



-- 
Thanks & Regards,
Ravi


Standard Partitioning Support in CarbonData.docx
Description: MS-Word 2007 document


[Discussion] Support Spark/Hive based partition in carbon

2017-11-21 Thread Ravindra Pesala
Partition features of Spark:

1. Creating table with partition
CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name1 col_type1 [COMMENT col_comment1], ...)]
USING datasource
[OPTIONS (key1=val1, key2=val2, ...)]
[PARTITIONED BY (col_name1, col_name2, ...)]
[TBLPROPERTIES (key1=val1, key2=val2, ...)]
[AS select_statement]

2. Load data
  Static Partition

LOAD DATA LOCAL INPATH '${env:HOME}/staticinput.txt'
  INTO TABLE partitioned_user
  PARTITION (country = 'US', state = 'CA')

INSERT OVERWRITE TABLE partitioned_user
  PARTITION (country = 'US', state = 'AL')
  SELECT * FROM another_user au
  WHERE au.country = 'US' AND au.state = 'AL';

   Dynamic Partition

LOAD DATA LOCAL INPATH '${env:HOME}/staticinput.txt'
  INTO TABLE partitioned_user
  PARTITION (country, state)

INSERT OVERWRITE TABLE partitioned_user
  PARTITION (country, state)
  SELECT * FROM another_user;

 3. Drop, show partitions
  SHOW PARTITIONS [db_name.]table_name
  ALTER TABLE table_name DROP [IF EXISTS] (PARTITION part_spec, ...)

 4. Updating the partitions
  ALTER TABLE table_name PARTITION part_spec RENAME TO PARTITION part_spec



Currently, carbon supports the partitions which is custom implemented by
carbon. So if community users want to use the features which are available
in spark and hive in carbondata then there is a compatibility problem
arrives. And also carbondata does not have built in dynamic partition.
To use the partition feature of spark we should comply with the interfaces
available in spark while loading and reading the data.

Approach 1 :
Comply with pure spark datasource API and implement standard interfaces for
reading and writing of data at a file level.Just like how parquet and ORC
got implemented in spark carbondata also can be implemented in the same way.
To support it we need to implement a FileFormat interface for reading and
writing the data at filelevel, not table level. For reading, we should
implement CarbonFileInputFormat(Read data at the file level) and implement
CarbonOutputFormat(Writes data per partition.)
Pros :
1.It is the clean interface to use on spark, all features of spark can be
worked without any impact.
2.Upgrading from new versions of spark is straightforward and simple.
Cons:
All Carbondata features such as IUD, compaction, Alter table and data
management like show segments, delete segments etc cannot work.

Approach 2:
Improve and expand the current in-house partition features which already
exist in carbondata. Add all the missing features like dynamic partition
and comply the syntax of loading data to partitions.
Pros :
All current features of carbondata works without much impact.
Cons:
Current partition implementation does not comply to spark partition so need
to spend a lot of effort to implement it.

Approach 3:
It is the hybrid approach of 1st approach. Basically, write the data using
FileFormat and CarbonOutputFormat interfaces. So all the partition
information would be added to hive automatically since we are creating the
datasource table. We make sure that the current folder structure does not
change while writing the data.Here we maintain the mapping file inside
segment folder for mapping between the partition and carbonindex file.  And
while reading we first get the partition information from the hive and do
the pruning and based on the pruned partitions read the partition mapping
file to get the carbonindex for querying.
Here we will not support the current carbondata partition feature but we
support spark partition features.
Pros:
1. Support the standard interface for loading data. So features like
partition and bucketing automatically supported.
2. All standard SQL syntax works fine with this approach.
3. All current features of carbon also work fine.
Cons:
1.Existing partition feature cannot work.
2.Minor impact on features like compaction, IUD, clean files because of
maintaining the partition mapping file.

-- 
Thanks & Regards,
Ravindra


Re: [Discussion] Support pre-aggregate table to improve OLAP performance

2017-11-06 Thread Ravindra Pesala
Hi Bill,

Please find my comments.

1. We are not supporting join queries in this design so it will be always
one parent table for an aggregate table. We may consider the join queries
for creating aggregation queries in future.

2. Aggregation column name will be created internally and it would be line
agg_parentcolumnname.

3. Yes if we create aggtable on dictionary column of parent table then it
uses same parent dictionary. Aggregation table does not generate any
dictionary files.

4. time-series.eventtime is the time column of the main table, there should
be at least one timestamp column on the main table to create
timeseries tables. In design, the granularity is replaced with hierarchy it
means the user can give the time hierarchy like a minute, hour, day so
three aggregation tables of a minute , hour and day aggregation tables will
be created automatically and loaded the data to them for every load.

5. In new design v1.1 it is now changed please check the same.

6. As I mentioned above in new V1.1 design it got changed to hierarchy so
user can define his own time hierarchy.

7. Ok, we will discuss and check whether we can expose this  SORT_COLUMNS
configuration on aggregation table. Even if we don't support now we can
expose in future.

8. Yes, merge index s applicable for aggregation table as well.

Regards,
Ravindra.

On 3 November 2017 at 09:05, bill.zhou  wrote:

> hi  Jacky & Ravindra, I have little more query about this design, thank you
> very much can clarify my query.
>
>
> 1. if we support create aggreagation tables from two or more tabels join,
> how to set the aggretate.parent?, whether can be like
> 'aggretate.parent'='fact1,dim1,dim1'
> 2. what's the agg table colum name ? for following create command it will
> be
> as: user_id,name,c2, price ?
> CREATE TABLE agg_sales
> STORED BY 'carbondata'
> TBLPROPERTIES ('aggregate.parent'='sales')
> AS SELECT user_id,user_name as name, sum(quantity) as c2, avg(price) FROM
> sales GROUP BY user_id.
> 3. if we create the dictioanry column in agg table, whether the dictionary
> file will use the same one main table?
>
> 4. for rollup table main table creation: what's the mean for
> timeseries.eventtime, granualarity? what's column can belong to this?
> 5. for rollup table main table creation: what's the mean for
> ‘timeseries.aggtype’ =’quantity:sum, max', it means the column quantity
> only
> support sum, max ?
>
> 6. In both the above cases carbon generates the 4 pre-aggregation tables
> automatically for
> year, month, day and hour. (their table name will be prefixed with
> agg_sales). -- in about cause only see the column hour, how to generate the
> year, month and day ?
>
> 7.In internal implementation, carbon will create these table with
> SORT_COLUMNS=’group by
> column defined above’, so that filter group by query on main table will be
> faster because it
> can leverage the index in pre-aggregate tables. -- I suggstion user can
> control the sort columns order
> 8. whether support merge index to agg table ? -- it is usefull.
>
>
> Jacky Li wrote
> > Hi community,
> >
> > In traditional data warehouse, pre-aggregate table or cube is a common
> > technology to improve OLAP query performance. To take carbondata support
> > for OLAP to next level, I’d like to propose pre-aggregate table support
> in
> > carbondata.
> >
> > Please refer to CARBONDATA-1516
> > https://issues.apache.org/jira/browse/CARBONDATA-1516; and the
> > design document attached in the JIRA ticket
> > (https://issues.apache.org/jira/browse/CARBONDATA-1516
> > https://issues.apache.org/jira/browse/CARBONDATA-1516;)
> >
> > This design is still in initial phase, proposed usage and SQL syntax are
> > subject to change. Please provide your comment to improve this feature.
> > Any suggestion on the design from community is welcomed.
> >
> > Regards,
> > Jacky Li
>
>
>
>
>
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.
> n5.nabble.com/
>



-- 
Thanks & Regards,
Ravi


Re: [DISCUSSION] Optimize the default value for some parameters

2017-10-25 Thread Ravindra Pesala
Hi Liang,

Now the TABLE_BLOCKSIZE is only limited to the size of carbondata file. It
is not considered for allocating tasks. So it does not matter the size
of TABLE_BLOCKSIZE.
But yes we can consider it as 512M.

We can also change the default of blocklet
(carbon.blockletgroup.size.in.mb) size to 128MB. Currently, it is only
64MB. Since the number of tasks allocation is derived from blocklet it is
better to increase the blocklet size. And also we should add a table level
property for blocklet size to configure while creating a table.

Regards,
Ravindra.

On 11 October 2017 at 13:36, Liang Chen  wrote:

> Hi All
>
> As you know, some default value of parameters need to adjust for most of
> cases, this discussion is for collecting which parameters' default value
> need to be optimized:
>
> 1. TABLE_BLOCKSIZE:
> current default is 1G, propose to adjust to 512M
>
> 2.
> Please append at here if you propose to adjust which parameters' default
> value .
>
> Regards
> Liang
>
>
>
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.
> n5.nabble.com/
>



-- 
Thanks & Regards,
Ravi


Re: [DISCUSSION] Optimize the default value for some parameters

2017-10-25 Thread Ravindra Pesala
Hi,

Yes, it is a good suggestion we can plan to set the number of loading cores
dynamically as per the available executor cores. Can you please raise
jira for it.

Regards,
Ravindra

On 25 October 2017 at 12:08, xm_zzc <441586...@qq.com> wrote:

> Hi:
>   If we are using carbondata + spark to load data, we can set
> carbon.number.of.cores.while.loading to the  number of executor cores.
>
>   When set the number of executor cores to 6, it shows that there are at
> least 6 cores per node for loading data, so we can set
> carbon.number.of.cores.while.loading to 6 automatically.
>
>
>
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.
> n5.nabble.com/
>



-- 
Thanks & Regards,
Ravi


[Discussion] Merging carbonindex files for each segments and across segments

2017-10-20 Thread Ravindra Pesala
Hi,

Problem :
 The first-time query of carbon becomes very slow. It is because of reading
many small carbonindex files and cache to the driver at the first time.
 Many carbonindex files are created in two cases
 Case 1: Loading data in large cluster
   For example, if the cluster size is 100 nodes then for each load 100
index files are created per segment. So after 100 loads, the number of
carbonindex files becomes 1.
 Case 2: Frequent loads
   For example, if the load happens for every 5 minutes in 4 node cluster,
it will be more than 1 index files after 10 days even in 4 node cluster.

It will be slower to read all the files from the driver since a lot of
namenode calls and IO operations.

Solution :
Merge the carbonindex files in two levels.so that we can reduce the IO
calls to namenode and improves the read performance.

Level 1: Merge within a segment.
Merge the carbonindex files to single file immediately after load completes
within the segment. It would be named as a .carbonindexmerge file. It is
actually not a true data merging but a simple file merge. So that the
current structure of carbonindex files does not change. While reading we
just read one file instead of many carbonindex files within the segment.

Level 2: Merge across segments.
Merge the already merged carbonindex files of each segment would be merged
after a configurable number of segments reached. These files are placed
under the metadata folder of the table.And the information of these merged
carbonindex files will be updated in the table status file. While reading
the carbonindex files first we check the tablestatus for the availability
of the merged file and read using the information available in it.
For example, the configurable number to merge index files across segments
are 100 then for every 100 segments one new merged index file will be
created under metadata folder and the tablestatus of these 100 segments are
updated with the information of this file.
This file is not updatable and it would be removed only if all the segments
of this merged index file is removed. This file also a simple file merge
not an actual data merge. By default this is disabled and the user can
enable it from the carbon properties.

And also there is an issue in driver cache for old segments.It would be not
necessary to cache the old segments if the queries are not interested in
them.I will start another discussion for this cache issue.

-- 
Thanks & Regards
Ravindra


Re: [Discussion] Carbon Store abstraction

2017-10-20 Thread Ravindra Pesala
Hi Jacky,

Thank you for steering this activity. Yes, there is a need to refactor code
to get the store management out of spark integration module. It becomes
difficult to add another integration module if there is no clear API for
store management.
Please find my comments.
1. Is it really necessary to extract three modules, I think we can create
carbon-store-management module and keep the table, segment, compaction and
data management into it.
2. And also we better name current carbon-core module to carbon-scan or
carbon-io module since we are extracting all store management out of it.
3. Even table status creation and updating also should be belonged to
segment management.
4. I think data loading map function is carbonoutputformat and it should
belong to carbon-processing and carbon-hadoop modules.

I think it is better to have interface document for the public APIs we are
going to expose so that it would be easy to check presto and hive
integration needs are going to be satisfied or not.
Since it is a big work we better split the jiras in such a way that it is
independent of each other and we can do it across versions also. And also
multiple people can do this parallel.

Regards,
Ravindra.

On 20 October 2017 at 14:42, Jacky Li  wrote:

> The markup format in earlier mail is incorrect. Please refer to this one.
>
> carbondata-store is responsible to provide following interface:
> 1. Table management:
> - Initialize and persist table metadata when integration module create
> table. Currently, the metadata includes `TableInfo`. Table path should be
> specified by integration module
> - Delete metadata and data in table path when integration module drop
> table
> - Retrieve `TableInfo` from table path
> - Check whether table exists
> - Alter metadata in `TableInfo`
> 2. Segment management. (Segment is operated in transactional way)
> - Open new segment when integration module load new data
> - Commit segment when data operation is done successfully
> - Close segment when data operation failed
> - Delete segment when integration module drop segment
> - Retrieve segment information by giving segmentId
> 3. Compaction management
> - Compaction policy for deciding whether compaction should be carried
> out
> 4. Data operation (carbondata-store provides map functions in map-reduce
> manner)
> - Data loading map function
> - Delete segment map function
> - other operation that involves map side operation. (basically, it is
> the `internalCompute` function in all RDD in current spark integration
> module)
>
>
> > 在 2017年10月20日,下午4:56,Raghunandan S 
> 写道:
> >
> > I think we need to integrate with presto hive and then refactor.this
> gives
> > clear idea on what we want to achieve.each processing engine is different
> > in its own way and integrating first would give us a clear idea on what’s
> > required in CarbonData
> > On Fri, 20 Oct 2017 at 1:01 PM, Liang Chen 
> wrote:
> >
> >> Hi
> >>
> >> Thank you started this discussion. agree,  for exposing the clear
> interface
> >> to users, there are some optimization works.
> >>
> >> Can you list the more detail about your proposal? for example: what
> class
> >> you propose to move to carbon store, what api you propose to create and
> >> expose to users.
> >> I suggest we can discuss and confirm your proposal  in dev first, then
> >> start
> >> to create sub task in Jira.
> >>
> >> Regards
> >> Liang
> >>
> >>
> >> Jacky Li wrote
> >>> Hi community,
> >>>
> >>> I am proposing to create a carbondata-store module to abstract the
> carbon
> >>> store concept. The reason is:
> >>>
> >>> 1. Initially, carbon is designed as a file format, as it evolves to
> >>> provide more features, it implemented more and more functionalities in
> >> the
> >>> spark integration module. However, as community is trying to integrate
> >>> more and more compute framework with carbon, these functionalities is
> >>> duplicated across integration layer. Idealy, these functionality can be
> >>> unified and provided in one place.
> >>>
> >>> 2. The current interface of carbondata exposed to user is through SQL,
> >> but
> >>> the developer interface for developers who want to do compute engine
> >>> integration is not very clear.
> >>>
> >>> 3. There are many SQL command that carbon supported, but they are
> >>> implemented through spark RDD only. It is not sharable across compute
> >>> framework.
> >>>
> >>> Due to these reasons, for the long term future of carbondata, I think
> it
> >>> is better to abstract the interface for compute engine integration
> within
> >>> a new module called carbondata-store. It can wrap all store level
> >>> functionalities that above file format in an independent module of
> >> compute
> >>> engine, so that every integration module can depends on it and
> duplicate
> >>> code is removed.
> >>>
> >>> This is a continuous effort 

Re: [Discussion] Support pre-aggregate table to improve OLAP performance

2017-10-17 Thread Ravindra Pesala
Hi Bhavya,

For pre-aggregate table load, we will not delete old data and calculate
aggregation every time. We load aggregation tables also incrementally along
with the main table. For suppose if we create an aggregation table on the
main table then aggregation table is calculated and loaded with the
existing data of the main table. For subsequent loads on the main load,
aggregation table also calculated incrementally only for new data and
loaded as a new segment.

Regards,
Ravindra.

On 17 October 2017 at 13:34, Bhavya Aggarwal  wrote:

> Hi Dev,
>
> For the Pre Aggregate tables how will we handle subsequent loads, will we
> be running the query on the whole table and calculating the aggregations
> again and then deleting the existing segment and creating the new segments
> for whole data. With the above approach as the data increases in the main
> table the Loading time will also be increasing substantially. Other way is
> to intelligently determine the new values by querying the latest segment
> and using them in collaboration with the existing pre-aggregated tables.
> Please share your thoughts about it in this discussion.
>
> Regards
> Bhavya
>
> On Mon, Oct 16, 2017 at 4:53 PM, Liang Chen 
> wrote:
>
> > +1 , i agree with Jacky points.
> > As we know, carbondata already be able to get very good performance for
> > filter query scenarios through MDK index.  supports pre-aggregate in
> 1.3.0
> > would improve aggregated query scenarios.   so users can use one
> carbondata
> > to support all query cases(both filter and agg).
> >
> > To Lu cao, you mentioned this solution to build cube schema, it is too
> > complex and there are many limitations, for example: the CUBE data can't
> > support query detail data etc.
> >
> > Regards
> > Liang
> >
> >
> > Jacky Li wrote
> > > Hi Lu Cao,
> > >
> > > In my previous experience on “cube” engine, no matter it is ROLAP or
> > > MOLAP, it is something above SQL layer, because it not only need user
> to
> > > establish cube schema by transform metadata from datawarehouse star
> > schema
> > > but also the engine defines its own query language like MDX, and many
> > > times these languages are not standardized so that different vendor
> need
> > > to provide different BI tools or adaptors for it.
> > > So, although some vendor provides easy-to-use cube management tool, but
> > it
> > > at least has two problems: vendor locking and the rigid of the cube
> mode
> > > once it defines. I think these problems are similar as in other vendor
> > > specific solution.
> > >
> > > Currently one of the strength that carbon store provides is that it
> > > complies to standard SQL support by integrating with SparkSQL, Hive,
> etc.
> > > The intention of providing pre-aggregate table support is, it can
> enable
> > > carbon improve OLAP query performance but still stick with standard SQL
> > > support, it means all users still can use the same BI/JDBC
> > > application/tool which can connect to SparkSQL, Hive, etc.
> > >
> > > If carbon should support “cube”, not only need to defines its
> > > configuration which may be very complex and non-standard, but also will
> > > force user to use vendor specific tools for management and
> visualization.
> > > So, I think before going to this complexity, it is better to provide
> > > pre-agg table as the first step.
> > >
> > > Although we do not want the full complexity of “cube” on arbitrary data
> > > schema, but one special case is for timeseries data. Because time
> > > dimension hierarchy (year/month/day/hour/minute/second) is naturally
> > > understandable and it is consistent in all scenarios, so we can provide
> > > native support for pre-aggregate table on time dimension. Actually it
> is
> > a
> > > cube on time and we can do automatic rollup for all levels in time.
> > >
> > > Finally, please note that, by using CTAS syntax, we are not restricting
> > > carbon to support pre-aggreagate table only, but also arbitrary
> > > materialized view, if we want in the future.
> > >
> > > Hope this make things more clear.
> > >
> > > Regards,
> > > Jacky
> > >
> > >
> > >
> > >  like mandarin provides, Actually, as you can see in the document, I am
> > > avoiding to call this “cube”.
> > >
> > >
> > >> 在 2017年10月15日,下午9:18,Lu Cao 
> >
> > > whucaolu@
> >
> > >  写道:
> > >>
> > >> Hi Jacky,
> > >> If user want to create a cube on main table, does he/she have to
> create
> > >> multiple pre-aggregate tables? It will be a heavy workload to write so
> > >> many
> > >> CTAS commands. If user only need create a few pre-agg tables, current
> > >> carbon already can support this requirement, user can create table
> first
> > >> and then use insert into select statement. The only different is user
> > >> need
> > >> to query the pre-agg table instead of main table.
> > >>
> > >> So maybe we can enable user to create a cube model( in schema or
> > >> metafile?)
> > >> which contains multiple 

Re: DataMap Interface requires `IndexColumns` as Input

2017-10-09 Thread Ravindra Pesala
Hi,

Indexed columns on which datamap is created is present in DataMapFactory.
You can check getMeta method.  By using the filter expression tree during
pruning we can get the filter columns and prune the related datamap.
Please don't refer the PR 1399 yet as it is still incomplete and many
things will change in it.
We are again updating the DataMap interfaces to support FG for storing &
retrieving rowid tuples from datamap.Soon will add proper example for the
same.

Regards,
Ravindra.

On 9 October 2017 at 23:22, Dong Xie  wrote:

> Hi,
>
> Datamap API currently miss an important input parameter `IndexColumns`. It
> is common that we only want to implement one type of DataMap but can apply
> to different data and different column set. In PR 1399, there are no
> specified index columns. I think it would be nice to include that in the
> API.
>
> Thanks,
> Dong




-- 
Thanks & Regards,
Ravi


Re: [DISCUSSION] support user specified segment reading for query

2017-10-05 Thread Ravindra Pesala
Hi,

Instead of using SET command to use for segments why don't you use QUERY
HINT . Using query hint we can mention the segments inside the query itself
as a hint.

For example  SELECT /*+SEGMENTS(1,3,5) */ from t1.

By using the above custom hint we can query from selected segments only,
This concept is supported in Spark also and this concept will be helpful in
our any future optimizations

Regards,
Ravindra.

On 5 October 2017 at 12:22, Rahul Kumar  wrote:

> @Jacky please find the reply of your doubts as follow :
>
>
> *1.If user uses following command in two different beeline session,
> will there be problem due to multithreading?   SET
> carbon.input.segments.default.*
>
> *carbontable=1,3,5;   select * from carbontable;   SET
> carbon.input.segments.default.**carbontable=*;*
>
> *Ans: *In case of multithreading ,yes there will be problem.
>
>   So threadSet() can be use to set the same property in multithread
> mode.
> *  Folowing syntax can be used to set segment ids for multithread mode*
> :
>Syntax : CarbonSession.threadSet(“carbon.input.segments.<
> databese_name>.”,” of segment ids>”)
>e.g =>*future{*
>
> * CarbonSession.threadSet(“**carbon.input.segments.
> default.carbontable”,”1,3,5”)*
>
> * sparkSession.sql(“select * from carbontable”).show*
>
> * CarbonSession.threadSet(“carbon.input.segments.
> default.carbontable”,”*”)*
>
> * }*
>
> *Above will override the property at thread level. So property will be set
> for each thread .*
>
>
> *2.   The RESET command is not clear, why this is needed? It seems SET
> carbon.input.segments.default.**carbontable=* is enough, right? and what
> parameter it has?*
>
> *Ans:* RESET command doesn't take any parameter. RESET is already
> implemented behavior which resets all the properties to their default
> value.So simillarly RESET query will set the above property also to its
> default value.
>
>   Thanks and Regards
>
> *   Rahul Kumar *
>
>
>
> On Wed, Oct 4, 2017 at 7:21 PM, Jacky Li  wrote:
>
> > I have 2 doubts:
> > 1. If user uses following command in two different beeline session, will
> > there be problem due to multithreading?
> > SET carbon.input.segments.default.carbontable=1,3,5;
> > select * from carbontable;
> > SET carbon.input.segments.default.carbontable=*;
> >
> >
> > 2. The RESET command is not clear, why this is needed? It seems SET
> > carbon.input.segments.default.carbontable=* is enough, right? and what
> > parameter it has?
> >
> > Regards,
> > Jacky
> >
> > > 在 2017年10月4日,上午12:42,Rahul Kumar  写道:
> > >
> > > 
> >
> >
>



-- 
Thanks & Regards,
Ravi


[ANNOUNCE] Apache CarbonData 1.2.0 release

2017-09-29 Thread Ravindra Pesala
Hi All,

The Apache CarbonData PMC team is happy to announce the release of Apache
CarbonData version 1.2.0

1.Release Notes:
*https://cwiki.apache.org/confluence/display/CARBONDATA/Apache+CarbonData+1.2.0+Release
*

  Some key improvement in this patch release:

   1. Sort columns feature:  It enables users to define only required
   columns (which are used in query filters) can be sorted while loading the
   data. It improves the loading speed., Note: currently support all data type
   excepting decimal, float, double.
   2. Support 4 type of sort scope: Local sort, Batch sort, Global sort, No
   sort while creating the table
   3. Support partition
   4. Optimize data update and delete for Spark 2.1
   5. Further, improve performance by optimizing measure filter feature
   6. DataMap framework to add custom indexes
   7. Ecosystem feature1: support Presto integration
   8. Ecosystem feature2: support Hive integration


You can follow this document to use these artifacts:
*https://github.com/apache/carbondata/blob/master/docs/quick-start-guide.md
*

You can find the latest CarbonData document and learn more at:
*http://carbondata.apache.org/ *


Thanks
The Apache CarbonData team


[VOTE] Apache CarbonData 1.2.0(RC3) release

2017-09-22 Thread Ravindra Pesala
Hi

I submit the Apache CarbonData 1.2.0 (RC3) to your vote.

1.Release Notes:
*https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220=12340260
*

Some key improvement in this patch release:

   1. Sort columns feature:  It enables users to define only required
   columns (which are used in query filters) can be sorted while loading the
   data. It improves the loading speed., Note: currently support all data type
   excepting decimal, float, double.
   2. Support 4 type of sort scope: Local sort, Batch sort, Global sort, No
   sort while creating the table
   3. Support partition
   4. Optimize data update and delete for Spark 2.1
   5. Further, improve performance by optimizing measure filter feature
   6. DataMap framework to add custom indexes
   7. Ecosystem feature1: support Presto integration
   8. Ecosystem feature2: support Hive integration


 2. The tag to be voted upon : apache-carbondata-1.2.0-rc3(commit:
09e07296a8e2a94ce429f6af333a9b15abb785de)
*https://github.com/apache/carbondata/releases/tag/apache-carbondata-1.2.0-rc
3*

3.The artifacts to be voted on are located here:
*https://dist.apache.org/repos/dist/dev/carbondata/1.2.0-rc3/
*

4. A staged Maven repository is available for review at:
*https://repository.apache.org/content/repositories/orgapachecarbondata-1023/
*

5. Release artifacts are signed with the following key:
*https://people.apache.org/keys/committer/ravipesala.asc
*

Please vote on releasing this package as Apache CarbonData 1.2.0,  The vote
will be open for the next 72 hours and passes if a majority of
at least three +1 PMC votes are cast.

[ ] +1 Release this package as Apache CarbonData 1.2.0
[ ] 0 I don't feel strongly about it, but I'm okay with the release
[ ] -1 Do not release this package because...

Regards,
Ravindra.


Re: [DISCUSSION] Update the function of show segments

2017-09-21 Thread Ravindra Pesala
Hi,

I agree with Jacky and David.
But it is suggested to keep current 'show segments' command without any
change and provide only brief information about segments.
Add new extended command like `extended show segments` to provide more
information which is required for power user.

Regards, only
Ravindra.

On 21 September 2017 at 09:03, David CaiQiang  wrote:

> I agree with Jacky.
>
> I think enhanced segment metadata will help us to understand the table.
>
> I suggest the following properties for segment metadata:
> 1. total data file size
> 2. total index file size
> 3. data file count
> 4. index file count
> 5. last modified time (last update time)
>
> Through these information,  we can answer the following questions.
> 1. Is there small file issue? Whether table require compaction or not,
> which
> type should be used?
> 2. Whether index files is too many or not?  we will can estimate the total
> size of index in memory whether it is big or small for driver memory
> configuration.
> 3. Whether some segment has too many files?  Maybe it is useful to locate
> some performance issue.
>
>
>
> -
> Best Regards
> David Cai
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.
> n5.nabble.com/
>



-- 
Thanks & Regards,
Ravi


Re: Fw:carbonthriftserver can not be load many times

2017-09-13 Thread Ravindra Pesala
Hi,

I have a confusion here.

1 and 2 steps are done through one beeline session and  3,4 and 5 are done
from another beeline session?

And also can you try it on the current master branch if the same issue
exists?


Regards,
Ravindra.

On 13 September 2017 at 15:14, dylan  wrote:

>
> hello ravipesala:
> thanks for your reply,
> i am use carbondata version is 1.1.0 and spark version is 1.6.0.
> and I reproduce in accordance with the official quick-start-guide case
> again,
> 1.Creating a Table
> cc.sql("create table IF NOT EXISTS  carbondb.test_table(id string,name
> String,city String,age int) stored by 'carbondata' ")
>
> 2.load data into table
>   cc.sql("load data inpath 'hdfs://nameservice1/user/zz/sample.csv' into
> table carbondb.test_table")
>
> 3.start carbonthriftserver
>/home/zz/spark-1.6.0-bin-hadoop2.6/bin/spark-submit  --master local[*]
> --driver-java-options="-Dcarbon.properties.filepath=/
> home/zz/spark-1.6.0-bin-hadoop2.6/conf/carbon.properties"
> --executor-memory 4G  --driver-memory 2g  --conf
> spark.serializer=org.apache.spark.serializer.KryoSerializer   --conf
> "spark.sql.shuffle.partitions=3" --conf spark.speculation=true   --class
> org.apache.carbondata.spark.thriftserver.CarbonThriftServer
> /home/zz/spark-1.6.0-bin-hadoop2.6/carbonlib/carbondata_2.10-1.1.0-shade-
> hadoop2.2.0.jar
> hdfs://nameservice1/user/zz/rp_carbon_store
>
>4.Connecting to CarbonData Thrift Server Using Beeline.
>
>  
>
>5.drop table
>cc.sql("drop table carbondb.test_table")
>
>6.recreate table and load data
> cc.sql("create table IF NOT EXISTS  carbondb.test_table(id string,name
> String,city String,age int) stored by 'carbondata' ")
> cc.sql("load data inpath 'hdfs://nameservice1/user/zz/sample.csv' into
> table carbondb.test_table")
>
>7.select data use beeline
> 
>Like the above error, the cache is not updated
>
>and last i want to ask a question,
>if i not do step 5 and I executed the reloading data directly,query data
> is ok,
> but the data is added, not covered, is the design is like this, or a
> bug?
>
>
> Trouble to help me see thank you!
>
>
>
>
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.
> n5.nabble.com/
>



-- 
Thanks & Regards,
Ravi


Re: [ANNOUNCE] Lu Cao as new Apache CarbonData committer

2017-09-13 Thread Ravindra Pesala
Congratulations Lu Cao. Welcome

Regards,
Ravindra
On Wed, 13 Sep 2017 at 7:44 PM, Bhavya Aggarwal  wrote:

> Congrats Lu Cao ..
>
> Thanks and regards
> Bhavya
>
> On Wed, Sep 13, 2017 at 7:30 PM, Raghunandan S <
> carbondatacontributi...@gmail.com> wrote:
>
> > Congrats lu cao.
> > On Wed, 13 Sep 2017 at 7:18 PM, Liang Chen 
> > wrote:
> >
> > > Hi all
> > >
> > > We are pleased to announce that the PMC has invited Lu Cao as new
> > > Apache
> > > CarbonData committer, and the invite has been accepted !
> > >
> > >Congrats to Lu Cao and welcome aboard.
> > >
> > > Regards
> > > The Apache CarbonData PMC
> > >
> > >
> > >
> > > --
> > > Sent from:
> > >
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> > >
> >
>


Re: Fw:carbonthriftserver can not be load many times

2017-09-12 Thread Ravindra Pesala
Hi,

It is not the behavior of carbondata, it must be a bug. Usually, when you
update then the cache refreshes for next query.
Please provide following information.
1. Carbondata and Spark version you are using.
2. Testcase to reproduce this issue.

Regards,
Ravindra.

On 12 September 2017 at 14:18, dylan  wrote:

>
>
>
>
>
>
>  Forwarding messages 
> From: "dylan" 
> Date: 2017-09-12 16:25:56
> To: user 
> Subject: carbonthriftserver can not be load many times
> hello :
>  when i use carbondata,i use step by
> 1.create table and load data
> 2.use carbonthriftserver,select * from table limit 1(it's ok)
> 3.update the table
> 4.use carbonthriftserver,select * from table limit 1(it's bad)
> ,the  error is :
>
>i kown carbonthrifserver use btree cache the carbonindex,
>and when i update the table the index is change,and
> carbonthriftserver didn't know the has changed,
>So every time i have to restart the carbonthriftserver, do not know
> if you run into this problem?
>Is this a design flaw, or is there a better advice to help me solve
> this problem, thanks!
>
>
>
>
>
>
>



-- 
Thanks & Regards,
Ravi


Re: Add an option such as 'carbon.update.storage.level' to configurate the storage level when updating data with 'carbon.update.persist.enable'='true'

2017-09-07 Thread Ravindra Pesala
Hi,

I don't see any problem in adding the options although MEMORY_AND_DISK is a
preferable option. You can keep it as a developer options only and no need
to expose to the user.

Regards,
Ravindra.

On 5 September 2017 at 11:15, xm_zzc <441586...@qq.com> wrote:

> I have searched and there was no other similar problem except this.
>
>
>
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.
> n5.nabble.com/
>



-- 
Thanks & Regards,
Ravi


[VOTE] Apache CarbonData 1.2.0(RC1) release

2017-09-01 Thread Ravindra Pesala
Hi

I submit the Apache CarbonData 1.2.0 (RC1) to your vote.

1.Release Notes:
*https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220=12340260
*

Some key improvement in this patch release:

   1. Sort columns feature : Support users define any columns to become
   MDK(Multi-Dimension Key index), note: currently support all data type
   excepting decimal,float,double.
   2. Support 4 type of sort scope: Local sort,Batch sort, Global sort, No
   sort
   3. Support partition
   4. Optimizate data update and delete for Spark 2.1
   5. Further improve performance by optimizating measure filter feature
   6. DataMap framework to add custom indexes
   7. Ecosystem feature1 : support Presto integration
   8. Ecosystem feature2 : support Hive integration


 2. The tag to be voted upon :
apache-carbondata-1.2.0-rc1(commit: 52984c8707908004a0951761bf58bd6e14581194)
*https://github.com/apache/carbondata/releases/tag/apache-carbondata-1.2.0-rc1
*

3.The artifacts to be voted on are located here:
*https://dist.apache.org/repos/dist/dev/carbondata/1.2.0-rc1/
*

4. A staged Maven repository is available for review at:
https://repository.apache.org/content/repositories/orgapachecarbondata-1021

5. Release artifacts are signed with the following key:
*https://people.apache.org/keys/committer/ravipesala.asc
*

Please vote on releasing this package as Apache CarbonData 1.2.0,  The vote
will be open for the next 72 hours and passes if a majority of
at least three +1 PMC votes are cast.

[ ] +1 Release this package as Apache CarbonData 1.1.1
[ ] 0 I don't feel strongly about it, but I'm okay with the release
[ ] -1 Do not release this package because...
-- 
Thanks & Regards,
Ravindra


  1   2   >