Re: [DISCUSSION] Distributed Index Cache Server

2019-02-19 Thread Kunal Kapoor
Hi xuchuanyin,
I have uploaded the version2 of the design document with the desired
changes.
Please review and let me know if anything is missing or needs to be changed.

Thanks
Kunal Kapoor

On Mon, Feb 18, 2019 at 12:15 PM Kunal Kapoor 
wrote:

> Hi xuchuanyin,
> I will expose an interface and the put the same in the design document
> soon.
>
> Thanks for the feedback
> Kunal Kapoor
>
>
> On Wed, Feb 13, 2019 at 8:04 PM ChuanYin Xu 
> wrote:
>
>> Hi kunal, I think we can go further for 2.3 & 4.
>>
>> For 4, I think all functions of IndexServer should be in an individual
>> module. We can think of the IndexServer as an enhancement component for
>> Carbondata. And inside that module we handle the actual pruning logic. On
>> the other side, if we do not have this component, there will be no pruning
>> at all.
>>
>> As a consequence, for 2.3, I think the IndexServer should provide
>> interfaces that will provide pruning services. For example it accepts
>> expressions and returns pruning result.
>>
>> I think only in this way can the IndexServer be more extensible to meet
>> higher requirements.
>>
>>


[Discussion] How to pass some options into Insert Into command

2019-02-19 Thread David CaiQiang
Hi all,
For data loading, we can pass some options into load data command by
using options clause, but insert into command can't.

How to pass some options into Insert Into command?  some options as
following.
1. implement options clause for insert into command
2. use hint
3. set key=value
4. other methods to implement the same result, for example,  use
"clustered by random(2) " to implement "GLOBAL_SORT_PARTITIONS"="2" 
5. ??

   any suggestion? 



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [DISCUSSION] Support OBSFS

2019-02-19 Thread sujith chacko
Hi Manish,

 May i know which version of hadoop will provide OBSFS support. Is this
file system already supported in recent hadoop releases?

Thanks
Sujith.

On Tue, 19 Feb 2019 at 4:18 PM, manish nalla 
wrote:

> Hi all,
>
> OBS is an Object-based Storage Service developed and maintained by
> HuaweiCloud. It provides large storage capacity and is capable of storing
> any type of file. OBS supports both S3-client and OBS-client to connect to
> the OBS server.
>
> Currently in CarbonData we are supporting HDFS and S3 as FileSystems and I
> am proposing to support OBSFS as another FileSystem because of a few
> drawbacks of S3FileSystem.
>
> CarbonData needs OBSFS instead of S3 because of two main reasons:
> 1. Append: While doing append first we have to read the whole object and
> then append to the object which is quite slow.
> 2. Atomic Rename: No atomic rename is there in S3 as also mentioned in Jira
> [CARBONDATA-2670].
>
> So both these issues can be fixed if we use OBSFileSystem.
>
> Any suggestions from the community will be greatly appreciated. I would be
> uploading the design doc shortly.
>
> Thanks and regards
> Manish Nalla
> EI BigData Kernel,
> Huawei Technologies India Pvt. Ltd
>


Re: [DISCUSSION] Support OBSFS

2019-02-19 Thread Indhumathi
Hi Manish Nalla,

Thanks for proposing this feature. Please clarify me the below points.

1. What will be the grammar to store carbondata files to OBSFS?
2. As S3 does not support Concurrent Data Manipulation operations and 
file leasing mechanisms, will this be the same behavior for OBSFS?
3. What are the Authentication properties that has to be configured to 
store carbondata files on to OBSFS location?

Hope all these points will be covered in design document.

Regards,
Indhumathi M




--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Discussion] DDLs to operate on CarbonLRUCache

2019-02-19 Thread akashrn5
Hi naman, 

Thanks for proposing the feature. Looks really helpful from user and
developer perspective.

Basically needed the design document, so that all the doubts would be
cleared.

1. Basicaly how you are going to handle the sync issues like, multiple
queries with drop and show cache. are you going to introduce any locking
mechanism?

2. if user clears cache during query? how it is noing to behave? is it
allowed or concurrent operaion is blocked?

3. How it os gonna work with Distributed index server, and types of it like
embedded, presto and other and local server? basically what is the impact
with that?

4. You said you will launch a job to get the size from all the blocks
present. Currently we create the block or blocklet datamap, and calcute each
datamap size and then we add to cache based on the lru cach esize
configured. So wanted to know how you will be calculating the size in your
case

Regards,
Akash



--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


[DISCUSSION] Support OBSFS

2019-02-19 Thread manish nalla
Hi all,

OBS is an Object-based Storage Service developed and maintained by
HuaweiCloud. It provides large storage capacity and is capable of storing
any type of file. OBS supports both S3-client and OBS-client to connect to
the OBS server.

Currently in CarbonData we are supporting HDFS and S3 as FileSystems and I
am proposing to support OBSFS as another FileSystem because of a few
drawbacks of S3FileSystem.

CarbonData needs OBSFS instead of S3 because of two main reasons:
1. Append: While doing append first we have to read the whole object and
then append to the object which is quite slow.
2. Atomic Rename: No atomic rename is there in S3 as also mentioned in Jira
[CARBONDATA-2670].

So both these issues can be fixed if we use OBSFileSystem.

Any suggestions from the community will be greatly appreciated. I would be
uploading the design doc shortly.

Thanks and regards
Manish Nalla
EI BigData Kernel,
Huawei Technologies India Pvt. Ltd


Re: [Discussion] DDLs to operate on CarbonLRUCache

2019-02-19 Thread sujith chacko
Hi Naman,

 Thanks for proposing this feature, seems to be pretty interesting feature,
few points i want to bring up here

1) I think we shall require a detailed design for this feature where all
the DDL's you are going to expose shall be clearly mentioned  as frequent
updation of DDL's are not recommended in future.
Better you can also cover the scenarios which can impact your DDL
operation like  cross session operation of DDL's

eg: one user is trying to clear the cache/table and another user will
execute show cache command. basically you can also mention how you will
handle all the synchronization scenarios.

2) Already Spark has exposed DDL's for clearing the caches as below, please
refer the same and try to get more insights about this DDL.Better to follow
a standard syntax.
"CLEAR CACHE"

"UNCACHE TABLE (IF EXISTS)? tableIdentifier"



3) How you will deal with drop table case, i think you shall clear the
respective cache also. mention these scenarios clearly in your design
document.

4)  0  for point 5,  as i think  you need to explain more on your design
document about the scenarios and the need of this feature, this ddl can
bring up more complexities to the system
eg: By the time system calculate the table size  a new segment can get
added or an existing segment can get modified. so basically again you need
to go for a lock so that these kind
 of synchronization issues can be tackle in better manner.

Overall i think the approach shall be well documented before you can start
with implementation. Please let me know for any clarifications or
suggestions regarding above points.

Regards,
Sujith


On Mon, Feb 18, 2019 at 3:35 PM Naman Rastogi 
wrote:

> Hi all,
>
> Currently carbon supports caching mechanism for Blocks/Blocklets. Even
> though it allows end user to set the Cache size, it is still very
> limited in functionality, and user arbitrarily chooses the carbon
> property *carbon.max.driver.lru.cache.size* where before launching the
> carbon session, he/she has no idea of how much cache should be set for
> his/her requirement.
>
> For this problem, I propose the following imporvements in carbon caching
> mechanism.
>
> 1. Support DDL for showing current cache used per table.
> 2. Support DDL for showing current cache used for a particular table.
> For these two points, QiangCai has already has a PR:
> https://github.com/apache/carbondata/pull/3078
>
> 3. Support DDL for clearing all the entries cache.
> This will look like:
> CLEAN CACHE
>
> 4. Support DDL for clearing cache for a particular table.
> This will clear all the entries in the cache which belong to a
> particular table. This will look like
> CLEAN CACHE FOR TABLE tablename
>
> 5. Support DDL to estimate required cache for a particular table.
> As explained above, the user does not know beforehand how much cache
> will be required for his/her current work. So this DDL will let the
> user estimate how much cache will be required for a particular
> table. For this we will launch a job and estimate the memory
> required for all the blocks, and sum it up.
>
> 6. Dynamic "max cache size" configration
> Suppose now the user knows required cache size he needs, but the
> current system requires the user to set the
> *carbon.max.driver.lru.cache.size* and restart the JDBC server for
> it to take effect. For this I am suggesting to make the carbon
> property *carbon.max.driver.lru.cache.size* dynamically configurable
> which allows the user to change the max LRU cache size on the fly.
>
> Any suggestion from the community is greatly appreciated.
>
> Thanks
>
> Regards
>
> Naman Rastogi
> Technical Lead - BigData Kernel
> Huawei Technologies India Pvt. Ltd.
>


Re: [Discussion] DDLs to operate on CarbonLRUCache

2019-02-19 Thread David CaiQiang
+1 for 5,6, after point 5 estimated the cache size,  the point 6 can modify
the configuration dynamically.

+1 for 3,4: maybe need to add a lock to sync the concurrent operations. If
it wants to release cache, it will not need to restart the driver

Maybe we also need to check how to use these operations in "[DISCUSSION]
Distributed Index Cache Server".



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


RE: Re:[DISCUSSION] Support Incremental load in datamap and other MV datamap enhancement

2019-02-19 Thread akashrn5
hi ravindra,

Got your point. As i had replied to xuchuyain. We can take these index
datamap enhancement separately.

Thank you



--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


RE: Re:[DISCUSSION] Support Incremental load in datamap and other MV datamap enhancement

2019-02-19 Thread xuchuanyin
+1 for ravin's advice.

We only support lazy/incremental load/rebuild for olap datamap (MV/preagg),
not for index datamap currently.



--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


RE: Re:[DISCUSSION] Support Incremental load in datamap and other MV datamap enhancement

2019-02-19 Thread ravipesala
Hi Akash,

There is a  difference between index datamap (like bloom) and olap datamaps
(like MV). Index datamaps used only for pruning the data while olap datamaps
will be used as pre-computed data which can be fetched directly as per
query.

In OLAP datamap case lazy build or deferred build makes sense as data needs
to be always synchronized with master data otherwise we will get stale data.
So any difference in synchronization will make the datamap disable. But on
the other hand Index datamap  is used only for faster pruning so
synchronization with master data is not mandatory unless we have a mechanism
to prune synchronized data using index datamap and non-synchronized data
using default datamap. This is the same @xuchuanyin mentioned. 

I feel this design is about OLAP datamap incremental loading so better not
do any changes in the behaviour of index datamaps. We can consider the
improvements of Index datamap in future but it should not be part of it.
Please update the design if mentioned anything related to Index datamap.

Regards,
Ravindra.



--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/