Re: Regarding Carbondata Benchmarking & Feature presentation

2020-09-17 Thread Ajantha Bhat
Hi Vimal,

*We have archived the latest presentation in wiki now.*
https://cwiki.apache.org/confluence/display/CARBONDATA/Carbondata+2.0+Release+Meetup
Please check and let us know if any questions.

And regarding the performance report. The latest report takes some more
time. You can find old reports archived in the wiki.
Once the latest report is ready, we will share with you.
The summary of the performance test is,
TPCDS with basic table (no sort, no secondary index, no materialized view)
queries are in par or better with other formats.
With global sort, SI, MV, queries will perform much better.

Thanks,
Ajantha

On Thu, Sep 17, 2020 at 11:57 AM Ajantha Bhat  wrote:

> Hi, Thanks for planning to propose carbon.
>
> Please join our slack to directly discuss with members also.
>
> https://join.slack
> .com/t/carbondataworkspace/shared_invite/zt-g8sv1g92-pr3GTvjrW5H9DVvNl6H2dg
>
> we will get back to you on the presentations and benchmarks.
>
> Thanks,
> Ajantha
>
> On Thu, Sep 17, 2020 at 11:42 AM Vimal Das Kammath <
> vimaldas.kamm...@gmail.com> wrote:
>
>> Hi Carbondata Team,
>>
>> I am working on proposing Carbondata to the Data Analytics team in Uber.
>> It
>> will be great if any of you can share the latest benchmarking and
>> feature/design presentation.
>>
>> Regards,
>> Vimal
>>
>


Re: [Discussion]Query Regarding Task launch mechanism for data load operations

2020-09-17 Thread VenuReddy
Hi Vishal,

Thank you for the response. 
Configuring `load_min_size_inmb` has helped to control the number of tasks
to launch in case of load from csv and could eventually reduce the
carnondata files along with.


But in case of insert into table select from flow `loadDataFrame()`, problem
didn't get resolved as we have completely different task launching
approach(not same as in `loadDataFile()`. Do you have suggestions about any
paramter to fine tune in insert flow ?

1. Any way to launch more than 1 task per node ?
 
2. Any way to contrl the number of output carbondata files for target table,
when there are too many small sized carbondata files to read/select from src
table ?  Otherwise it generates the output files equal to input files.
-> I tried carbon property,
`carbon.task.distribution`=`merge_small_files`. Could reduce the number of
files generated for target table. Scanrdd with
CARBON_TASK_DISTRIBUTION_MERGE_FILES used similar mechanism as global
partition load(considered filesMaxPartitionBytes, filesOpenCostInBytes and
defaultParallelism for split size). 
But, this property is not dynamically configured. Probably for some
reason ? Confused if it is a good option to use that property in this
scenario.

Any suggestions would be very helpful.

regards,
Venu



--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: Clean files enhancement

2020-09-17 Thread Ravindra Pesala
-1

I don’t see any reason why we should use trash. How does it change the
behaviour.
1. Are you still going with automatic clean up?
If yes then you are adding extra time to move the data to trash(for S3 file
system).
2. Even if you move the data and keep the time to live as 3 days in trash,
what if user realised that data is not right or lost after that time period.

Regards,
Ravindra


On Thu, 17 Sep 2020 at 3:12 PM, Vikram Ahuja 
wrote:

> Hi all,
>
> after all the suggestions the trash folder mechanism in carbondata will be
>
> implemented in 2 phases
>
> Phase1 :
>
> 1. Create a generic trash folder at table level. Trash folders will be
>
> hidden/invisible(like .trash or .recyclebin). The trash folder will be
>
> stored in the table dir.
>
> 2. If we delete any file/folder from a table it will be moved to the trash
>
> folder of that corresponding table (The call for adding to trash will be
>
> added in FileFactory delete api's)
>
> 3. A trash manager will be created, which will keep track of all the files
>
> that have been deleted and moved to the trash and will also maintain the
>
> time when it is deleted. All the trashmanager's api will be called from the
>
> FileFactory class
>
> 4. On clean files command, the trash folders will be cleared if the expiry
>
> time has been met. Each file moved to the trash will have some expiration
>
> time associated with it
>
>
>
> Phase 2: For phase 2 more enhancements are planned, and will be implemented
>
> after the phase 1 is completed. The plan for phase 2 development and
>
> changes shall be posted in this mail thread itself.
>
>
>
>
>
> Thanks
>
> Vikram Ahuja
>
>
>
>
>
> On Wed, Sep 16, 2020 at 8:43 AM PickUpOldDriver 
>
> wrote:
>
>
>
> > Hi Vikram,
>
> >
>
> > I agree to build a trash folder, +1.
>
> >
>
> > Currently, the data loading/compaction/update/merge flow has automatic
>
> > cleaning files actions, but they are written separately.  Most of them
> are
>
> > aimed at deleting the stale segments(MARKED_FOR_DELETE/COMPACTED). And
> they
>
> > rely on the precise of the table status. If you could build a general
> clean
>
> > file function, it can be applied to substitute the current automatic
>
> > deletion for stale folders.
>
> >
>
> > Besides, having a trash folder handle by Carbondata will be good, we can
>
> > find the deleted segments by this API.
>
> >
>
> > And I think we should also consider the status of INSERT_IN_PROGERSS &
>
> > INSERT_OVERWRITE _IN_PROGRESS
>
> >
>
> >
>
> >
>
> >
>
> > --
>
> > Sent from:
>
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>
> >
>
>

-- 
Thanks & Regards,
Ravi


Re: Clean files enhancement

2020-09-17 Thread Vikram Ahuja
Hi all,
after all the suggestions the trash folder mechanism in carbondata will be
implemented in 2 phases
Phase1 :
1. Create a generic trash folder at table level. Trash folders will be
hidden/invisible(like .trash or .recyclebin). The trash folder will be
stored in the table dir.
2. If we delete any file/folder from a table it will be moved to the trash
folder of that corresponding table (The call for adding to trash will be
added in FileFactory delete api's)
3. A trash manager will be created, which will keep track of all the files
that have been deleted and moved to the trash and will also maintain the
time when it is deleted. All the trashmanager's api will be called from the
FileFactory class
4. On clean files command, the trash folders will be cleared if the expiry
time has been met. Each file moved to the trash will have some expiration
time associated with it

Phase 2: For phase 2 more enhancements are planned, and will be implemented
after the phase 1 is completed. The plan for phase 2 development and
changes shall be posted in this mail thread itself.


Thanks
Vikram Ahuja


On Wed, Sep 16, 2020 at 8:43 AM PickUpOldDriver 
wrote:

> Hi Vikram,
>
> I agree to build a trash folder, +1.
>
> Currently, the data loading/compaction/update/merge flow has automatic
> cleaning files actions, but they are written separately.  Most of them are
> aimed at deleting the stale segments(MARKED_FOR_DELETE/COMPACTED). And they
> rely on the precise of the table status. If you could build a general clean
> file function, it can be applied to substitute the current automatic
> deletion for stale folders.
>
> Besides, having a trash folder handle by Carbondata will be good, we can
> find the deleted segments by this API.
>
> And I think we should also consider the status of INSERT_IN_PROGERSS &
> INSERT_OVERWRITE _IN_PROGRESS
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>