[GitHub] carbondata pull request #2576: [CARBONDATA-2795] Add documentation for S3
Github user asfgit closed the pull request at: https://github.com/apache/carbondata/pull/2576 ---
[GitHub] carbondata pull request #2576: [CARBONDATA-2795] Add documentation for S3
Github user kunal642 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2576#discussion_r207250192 --- Diff: docs/s3-guide.md --- @@ -0,0 +1,64 @@ + + +#S3 Guide (Alpha Feature 1.4.1) +S3 is an Object Storage API on cloud, it is recommended for storing large data files. You can use +this feature if you want to store data on Amazon cloud or Huawei cloud(OBS). +Since the data is stored on to cloud there are no restrictions on the size of data and the data can be accessed from anywhere at any time. +Carbondata can support any Object Storage that conforms to Amazon S3 API. --- End diff -- merged ---
[GitHub] carbondata pull request #2576: [CARBONDATA-2795] Add documentation for S3
Github user kunal642 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2576#discussion_r207250096 --- Diff: docs/configuration-parameters.md --- @@ -106,7 +106,10 @@ This section provides the details of all the configurations required for CarbonD |-|--|-| | carbon.sort.file.write.buffer.size | 16384 | File write buffer size used during sorting. Minimum allowed buffer size is 10240 byte and Maximum allowed buffer size is 10485760 byte. | | carbon.lock.type | LOCALLOCK | This configuration specifies the type of lock to be acquired during concurrent operations on table. There are following types of lock implementation: - LOCALLOCK: Lock is created on local file system as file. This lock is useful when only one spark driver (thrift server) runs on a machine and no other CarbonData spark application is launched concurrently. - HDFSLOCK: Lock is created on HDFS file system as file. This lock is useful when multiple CarbonData spark applications are launched and no ZooKeeper is running on cluster and HDFS supports file based locking. | -| carbon.lock.path | TABLEPATH | This configuration specifies the path where lock files have to be created. Recommended to configure zookeeper lock type or configure HDFS lock path(to this property) in case of S3 file system as locking is not feasible on S3. +| carbon.lock.path | TABLEPATH | This configuration specifies the path where lock files have to +be created. Recommended to configure HDFS lock path(to this property) in case of S3 file system +as locking is not feasible on S3. +**Note:** If this property is not set to HDFS location for S3 store, then there is a possibility of data corruption. --- End diff -- done ---
[GitHub] carbondata pull request #2576: [CARBONDATA-2795] Add documentation for S3
Github user kunal642 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2576#discussion_r207250066 --- Diff: docs/configuration-parameters.md --- @@ -106,7 +106,10 @@ This section provides the details of all the configurations required for CarbonD |-|--|-| | carbon.sort.file.write.buffer.size | 16384 | File write buffer size used during sorting. Minimum allowed buffer size is 10240 byte and Maximum allowed buffer size is 10485760 byte. | | carbon.lock.type | LOCALLOCK | This configuration specifies the type of lock to be acquired during concurrent operations on table. There are following types of lock implementation: - LOCALLOCK: Lock is created on local file system as file. This lock is useful when only one spark driver (thrift server) runs on a machine and no other CarbonData spark application is launched concurrently. - HDFSLOCK: Lock is created on HDFS file system as file. This lock is useful when multiple CarbonData spark applications are launched and no ZooKeeper is running on cluster and HDFS supports file based locking. | -| carbon.lock.path | TABLEPATH | This configuration specifies the path where lock files have to be created. Recommended to configure zookeeper lock type or configure HDFS lock path(to this property) in case of S3 file system as locking is not feasible on S3. +| carbon.lock.path | TABLEPATH | This configuration specifies the path where lock files have to --- End diff -- added description ---
[GitHub] carbondata pull request #2576: [CARBONDATA-2795] Add documentation for S3
Github user kunal642 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2576#discussion_r207249973 --- Diff: docs/data-management-on-carbondata.md --- @@ -730,6 +736,8 @@ Users can specify which columns to include and exclude for local dictionary gene * If the IGNORE option is used, then bad records are neither loaded nor written to the separate CSV file. * In loaded data, if all records are bad records, the BAD_RECORDS_ACTION is invalid and the load operation fails. * The maximum number of characters per column is 32000. If there are more than 32000 characters in a column, data loading will fail. + * Since Bad Records Path can be specified in both create, load and carbon properties. --- End diff -- done ---
[GitHub] carbondata pull request #2576: [CARBONDATA-2795] Add documentation for S3
Github user kunal642 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2576#discussion_r207249849 --- Diff: docs/datamap/preaggregate-datamap-guide.md --- @@ -7,6 +24,7 @@ * [Querying Data](#querying-data) * [Compaction](#compacting-pre-aggregate-tables) * [Data Management](#data-management-with-pre-aggregate-tables) +* [Limitations](#Limitations) --- End diff -- removed ---
[GitHub] carbondata pull request #2576: [CARBONDATA-2795] Add documentation for S3
Github user kunal642 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2576#discussion_r207249941 --- Diff: docs/s3-guide.md --- @@ -0,0 +1,63 @@ + + +#S3 Guide (Alpha Feature 1.4.1) +Amazon S3 is a cloud storage service that is recommended for storing large data files. You can +use this feature if you want to store data on amazon cloud. Since the data is stored on to cloud +storage there are no restrictions on the size of data and the data can be accessed from anywhere at any time. +Carbon can support any Object store that conforms to Amazon S3 API. + +#Writing to Object Store +To store carbondata files on to Object Store location, you need to set `carbon +.storelocation` property to Object Store path in CarbonProperties file. For example, carbon +.storelocation=s3a://mybucket/carbonstore. By setting this property, all the tables will be created on the specified Object Store path. + +If your existing store is HDFS, and you want to store specific tables on S3 location, then `location` parameter has to be set during create +table. +For example: + +``` +CREATE TABLE IF NOT EXISTS db1.table1(col1 string, col2 int) STORED AS carbondata LOCATION 's3a://mybucket/carbonstore' +``` + +For more details on create table, Refer [data-management-on-carbondata](https://github.com/apache/carbondata/blob/master/docs/data-management-on-carbondata.md#create-table) + +#Authentication +You need to set authentication properties to store the carbondata files on to S3 location. For +more details on authentication properties, refer +[hadoop authentication document](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Authentication_properties) + +Another way of setting the authentication parameters is as follows: + +``` + SparkSession + .builder() + .master(masterURL) + .appName("S3Example") + .config("spark.driver.host", "localhost") + .config("spark.hadoop.fs.s3a.access.key", "") + .config("spark.hadoop.fs.s3a.secret.key", "") + .config("spark.hadoop.fs.s3a.endpoint", "1.1.1.1") + .getOrCreateCarbonSession() +``` + +#Recommendations +1. Object stores like S3 does not support file leasing mechanism(supported by HDFS) that is +required to take locks which ensure consistency between concurrent operations therefore, it is +recommended to set the configurable lock path property([carbon.lock.path](https://github.com/apache/carbondata/blob/master/docs/configuration-parameters.md#miscellaneous-configuration)) + to a HDFS directory. +2. As Object stores are eventual consistent meaning that any put request can take some time to reflect when trying to list objects from that bucket therefore concurrent queries are not supported. --- End diff -- done ---
[GitHub] carbondata pull request #2576: [CARBONDATA-2795] Add documentation for S3
Github user kunal642 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2576#discussion_r207249910 --- Diff: docs/s3-guide.md --- @@ -0,0 +1,63 @@ + + +#S3 Guide (Alpha Feature 1.4.1) +Amazon S3 is a cloud storage service that is recommended for storing large data files. You can --- End diff -- done ---
[GitHub] carbondata pull request #2576: [CARBONDATA-2795] Add documentation for S3
Github user kunal642 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2576#discussion_r207249485 --- Diff: docs/configuration-parameters.md --- @@ -106,7 +106,12 @@ This section provides the details of all the configurations required for CarbonD |-|--|-| | carbon.sort.file.write.buffer.size | 16384 | File write buffer size used during sorting. Minimum allowed buffer size is 10240 byte and Maximum allowed buffer size is 10485760 byte. | | carbon.lock.type | LOCALLOCK | This configuration specifies the type of lock to be acquired during concurrent operations on table. There are following types of lock implementation: - LOCALLOCK: Lock is created on local file system as file. This lock is useful when only one spark driver (thrift server) runs on a machine and no other CarbonData spark application is launched concurrently. - HDFSLOCK: Lock is created on HDFS file system as file. This lock is useful when multiple CarbonData spark applications are launched and no ZooKeeper is running on cluster and HDFS supports file based locking. | -| carbon.lock.path | TABLEPATH | This configuration specifies the path where lock files have to be created. Recommended to configure zookeeper lock type or configure HDFS lock path(to this property) in case of S3 file system as locking is not feasible on S3. +| carbon.lock.path | TABLEPATH | Locks on the files are used to prevent concurrent operation from modifying the same files. This +configuration specifies the path where lock files have to be created. Recommended to configure +HDFS lock path(to this property) in case of S3 file system as locking is not feasible on S3. +**Note:** If this property is not set to HDFS location for S3 store, then there is a possibility +of data corruption because multiple data manipulation calls might try to update the status file +and as lock is not acquired before updation data might get overwritten. --- End diff -- added ---
[GitHub] carbondata pull request #2576: [CARBONDATA-2795] Add documentation for S3
Github user sgururajshetty commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2576#discussion_r207223686 --- Diff: docs/configuration-parameters.md --- @@ -106,7 +106,12 @@ This section provides the details of all the configurations required for CarbonD |-|--|-| | carbon.sort.file.write.buffer.size | 16384 | File write buffer size used during sorting. Minimum allowed buffer size is 10240 byte and Maximum allowed buffer size is 10485760 byte. | | carbon.lock.type | LOCALLOCK | This configuration specifies the type of lock to be acquired during concurrent operations on table. There are following types of lock implementation: - LOCALLOCK: Lock is created on local file system as file. This lock is useful when only one spark driver (thrift server) runs on a machine and no other CarbonData spark application is launched concurrently. - HDFSLOCK: Lock is created on HDFS file system as file. This lock is useful when multiple CarbonData spark applications are launched and no ZooKeeper is running on cluster and HDFS supports file based locking. | -| carbon.lock.path | TABLEPATH | This configuration specifies the path where lock files have to be created. Recommended to configure zookeeper lock type or configure HDFS lock path(to this property) in case of S3 file system as locking is not feasible on S3. +| carbon.lock.path | TABLEPATH | Locks on the files are used to prevent concurrent operation from modifying the same files. This +configuration specifies the path where lock files have to be created. Recommended to configure +HDFS lock path(to this property) in case of S3 file system as locking is not feasible on S3. +**Note:** If this property is not set to HDFS location for S3 store, then there is a possibility +of data corruption because multiple data manipulation calls might try to update the status file +and as lock is not acquired before updation data might get overwritten. --- End diff -- since it is table, end the line with a pipline | ---
[GitHub] carbondata pull request #2576: [CARBONDATA-2795] Add documentation for S3
Github user sraghunandan commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2576#discussion_r207072493 --- Diff: docs/s3-guide.md --- @@ -0,0 +1,64 @@ + + +#S3 Guide (Alpha Feature 1.4.1) +S3 is an Object Storage API on cloud, it is recommended for storing large data files. You can use +this feature if you want to store data on Amazon cloud or Huawei cloud(OBS). +Since the data is stored on to cloud there are no restrictions on the size of data and the data can be accessed from anywhere at any time. +Carbondata can support any Object Storage that conforms to Amazon S3 API. --- End diff -- This sentence can be merged with the above sentence "You can use this feature if you want to store data " ---
[GitHub] carbondata pull request #2576: [CARBONDATA-2795] Add documentation for S3
Github user sraghunandan commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2576#discussion_r207071826 --- Diff: docs/configuration-parameters.md --- @@ -106,7 +106,10 @@ This section provides the details of all the configurations required for CarbonD |-|--|-| | carbon.sort.file.write.buffer.size | 16384 | File write buffer size used during sorting. Minimum allowed buffer size is 10240 byte and Maximum allowed buffer size is 10485760 byte. | | carbon.lock.type | LOCALLOCK | This configuration specifies the type of lock to be acquired during concurrent operations on table. There are following types of lock implementation: - LOCALLOCK: Lock is created on local file system as file. This lock is useful when only one spark driver (thrift server) runs on a machine and no other CarbonData spark application is launched concurrently. - HDFSLOCK: Lock is created on HDFS file system as file. This lock is useful when multiple CarbonData spark applications are launched and no ZooKeeper is running on cluster and HDFS supports file based locking. | -| carbon.lock.path | TABLEPATH | This configuration specifies the path where lock files have to be created. Recommended to configure zookeeper lock type or configure HDFS lock path(to this property) in case of S3 file system as locking is not feasible on S3. +| carbon.lock.path | TABLEPATH | This configuration specifies the path where lock files have to --- End diff -- add a brief description as to why locks are used in carbondata.what is TABLEPATH ? ---
[GitHub] carbondata pull request #2576: [CARBONDATA-2795] Add documentation for S3
Github user sraghunandan commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2576#discussion_r207073807 --- Diff: docs/s3-guide.md --- @@ -0,0 +1,64 @@ + + +#S3 Guide (Alpha Feature 1.4.1) +S3 is an Object Storage API on cloud, it is recommended for storing large data files. You can use +this feature if you want to store data on Amazon cloud or Huawei cloud(OBS). +Since the data is stored on to cloud there are no restrictions on the size of data and the data can be accessed from anywhere at any time. +Carbondata can support any Object Storage that conforms to Amazon S3 API. + +#Writing to Object Storage +To store carbondata files on to Object Store location, you need to set `carbon +.storelocation` property to Object Store path in CarbonProperties file. For example, carbon +.storelocation=s3a://mybucket/carbonstore. By setting this property, all the tables will be created on the specified Object Store path. + +If your existing store is HDFS, and you want to store specific tables on S3 location, then `location` parameter has to be set during create --- End diff -- If you don't wish to change the existing store location and would wish to store only specific tables onto S3,it can be done by setting the 'location' option parameter in the create table ddl command ---
[GitHub] carbondata pull request #2576: [CARBONDATA-2795] Add documentation for S3
Github user sraghunandan commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2576#discussion_r207072154 --- Diff: docs/data-management-on-carbondata.md --- @@ -730,6 +736,8 @@ Users can specify which columns to include and exclude for local dictionary gene * If the IGNORE option is used, then bad records are neither loaded nor written to the separate CSV file. * In loaded data, if all records are bad records, the BAD_RECORDS_ACTION is invalid and the load operation fails. * The maximum number of characters per column is 32000. If there are more than 32000 characters in a column, data loading will fail. + * Since Bad Records Path can be specified in both create, load and carbon properties. --- End diff -- entire sentence to be reformed. not a grammatically correct statement ---
[GitHub] carbondata pull request #2576: [CARBONDATA-2795] Add documentation for S3
Github user sraghunandan commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2576#discussion_r207071907 --- Diff: docs/configuration-parameters.md --- @@ -106,7 +106,10 @@ This section provides the details of all the configurations required for CarbonD |-|--|-| | carbon.sort.file.write.buffer.size | 16384 | File write buffer size used during sorting. Minimum allowed buffer size is 10240 byte and Maximum allowed buffer size is 10485760 byte. | | carbon.lock.type | LOCALLOCK | This configuration specifies the type of lock to be acquired during concurrent operations on table. There are following types of lock implementation: - LOCALLOCK: Lock is created on local file system as file. This lock is useful when only one spark driver (thrift server) runs on a machine and no other CarbonData spark application is launched concurrently. - HDFSLOCK: Lock is created on HDFS file system as file. This lock is useful when multiple CarbonData spark applications are launched and no ZooKeeper is running on cluster and HDFS supports file based locking. | -| carbon.lock.path | TABLEPATH | This configuration specifies the path where lock files have to be created. Recommended to configure zookeeper lock type or configure HDFS lock path(to this property) in case of S3 file system as locking is not feasible on S3. +| carbon.lock.path | TABLEPATH | This configuration specifies the path where lock files have to +be created. Recommended to configure HDFS lock path(to this property) in case of S3 file system +as locking is not feasible on S3. +**Note:** If this property is not set to HDFS location for S3 store, then there is a possibility of data corruption. --- End diff -- can add a brief sentence as to why corruption might happen ---
[GitHub] carbondata pull request #2576: [CARBONDATA-2795] Add documentation for S3
Github user sraghunandan commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2576#discussion_r207074600 --- Diff: docs/s3-guide.md --- @@ -0,0 +1,64 @@ + + +#S3 Guide (Alpha Feature 1.4.1) +S3 is an Object Storage API on cloud, it is recommended for storing large data files. You can use +this feature if you want to store data on Amazon cloud or Huawei cloud(OBS). +Since the data is stored on to cloud there are no restrictions on the size of data and the data can be accessed from anywhere at any time. +Carbondata can support any Object Storage that conforms to Amazon S3 API. + +#Writing to Object Storage +To store carbondata files on to Object Store location, you need to set `carbon +.storelocation` property to Object Store path in CarbonProperties file. For example, carbon +.storelocation=s3a://mybucket/carbonstore. By setting this property, all the tables will be created on the specified Object Store path. + +If your existing store is HDFS, and you want to store specific tables on S3 location, then `location` parameter has to be set during create +table. +For example: + +``` +CREATE TABLE IF NOT EXISTS db1.table1(col1 string, col2 int) STORED AS carbondata LOCATION 's3a://mybucket/carbonstore' +``` + +For more details on create table, Refer [data-management-on-carbondata](https://github.com/apache/carbondata/blob/master/docs/data-management-on-carbondata.md#create-table) + +#Authentication +You need to set authentication properties to store the carbondata files on to S3 location. For +more details on authentication properties, refer +[hadoop authentication document](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Authentication_properties) + +Another way of setting the authentication parameters is as follows: + +``` + SparkSession + .builder() + .master(masterURL) + .appName("S3Example") + .config("spark.driver.host", "localhost") + .config("spark.hadoop.fs.s3a.access.key", "") + .config("spark.hadoop.fs.s3a.secret.key", "") + .config("spark.hadoop.fs.s3a.endpoint", "1.1.1.1") + .getOrCreateCarbonSession() +``` + +#Recommendations +1. Object Storage like S3 does not support file leasing mechanism(supported by HDFS) that is +required to take locks which ensure consistency between concurrent operations therefore, it is +recommended to set the configurable lock path property([carbon.lock.path](https://github.com/apache/carbondata/blob/master/docs/configuration-parameters.md#miscellaneous-configuration)) + to a HDFS directory. +2. As Object Storage are eventual consistent meaning that any put request can take some time to --- End diff -- Concurrent data manipulation operations are not supported. object stores follow eventual consistency semantics,ie.,any put request might take some time to reflect when trying to list.This behaviour causes not to ensure the data read is always consistent or latest. ---
[GitHub] carbondata pull request #2576: [CARBONDATA-2795] Add documentation for S3
Github user KanakaKumar commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2576#discussion_r206515601 --- Diff: docs/data-management-on-carbondata.md --- @@ -730,6 +736,8 @@ Users can specify which columns to include and exclude for local dictionary gene * If the IGNORE option is used, then bad records are neither loaded nor written to the separate CSV file. * In loaded data, if all records are bad records, the BAD_RECORDS_ACTION is invalid and the load operation fails. * The maximum number of characters per column is 32000. If there are more than 32000 characters in a column, data loading will fail. + * Since Bad Records Path can be specified in both create, load and carbon properties. --- End diff -- "both" does not suite in this statement. Please rewrite. ---
[GitHub] carbondata pull request #2576: [CARBONDATA-2795] Add documentation for S3
Github user chenliang613 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2576#discussion_r206481821 --- Diff: docs/s3-guide.md --- @@ -0,0 +1,63 @@ + + +#S3 Guide (Alpha Feature 1.4.1) +Amazon S3 is a cloud storage service that is recommended for storing large data files. You can +use this feature if you want to store data on amazon cloud. Since the data is stored on to cloud +storage there are no restrictions on the size of data and the data can be accessed from anywhere at any time. +Carbon can support any Object store that conforms to Amazon S3 API. + +#Writing to Object Store +To store carbondata files on to Object Store location, you need to set `carbon +.storelocation` property to Object Store path in CarbonProperties file. For example, carbon +.storelocation=s3a://mybucket/carbonstore. By setting this property, all the tables will be created on the specified Object Store path. + +If your existing store is HDFS, and you want to store specific tables on S3 location, then `location` parameter has to be set during create +table. +For example: + +``` +CREATE TABLE IF NOT EXISTS db1.table1(col1 string, col2 int) STORED AS carbondata LOCATION 's3a://mybucket/carbonstore' +``` + +For more details on create table, Refer [data-management-on-carbondata](https://github.com/apache/carbondata/blob/master/docs/data-management-on-carbondata.md#create-table) + +#Authentication +You need to set authentication properties to store the carbondata files on to S3 location. For +more details on authentication properties, refer +[hadoop authentication document](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Authentication_properties) + +Another way of setting the authentication parameters is as follows: + +``` + SparkSession + .builder() + .master(masterURL) + .appName("S3Example") + .config("spark.driver.host", "localhost") + .config("spark.hadoop.fs.s3a.access.key", "") + .config("spark.hadoop.fs.s3a.secret.key", "") + .config("spark.hadoop.fs.s3a.endpoint", "1.1.1.1") + .getOrCreateCarbonSession() +``` + +#Recommendations +1. Object stores like S3 does not support file leasing mechanism(supported by HDFS) that is +required to take locks which ensure consistency between concurrent operations therefore, it is +recommended to set the configurable lock path property([carbon.lock.path](https://github.com/apache/carbondata/blob/master/docs/configuration-parameters.md#miscellaneous-configuration)) + to a HDFS directory. +2. As Object stores are eventual consistent meaning that any put request can take some time to reflect when trying to list objects from that bucket therefore concurrent queries are not supported. --- End diff -- Changes to : Object Storage ---
[GitHub] carbondata pull request #2576: [CARBONDATA-2795] Add documentation for S3
Github user chenliang613 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2576#discussion_r206481369 --- Diff: docs/s3-guide.md --- @@ -0,0 +1,63 @@ + + +#S3 Guide (Alpha Feature 1.4.1) +Amazon S3 is a cloud storage service that is recommended for storing large data files. You can --- End diff -- Suggest changing to : S3 is an object storage API on cloud,it is recommended for storing large data files. You can use this feature if you want to store data on amazon cloud or huawei cloud(obs). Since the data is stored on cloud storage there are no restrictions on the size of data and the data can be accessed from anywhere at any time. Carbondata can support any Object storage that conforms to Amazon S3 API. ---
[GitHub] carbondata pull request #2576: [CARBONDATA-2795] Add documentation for S3
Github user chenliang613 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2576#discussion_r206480055 --- Diff: docs/datamap/preaggregate-datamap-guide.md --- @@ -7,6 +24,7 @@ * [Querying Data](#querying-data) * [Compaction](#compacting-pre-aggregate-tables) * [Data Management](#data-management-with-pre-aggregate-tables) +* [Limitations](#Limitations) --- End diff -- Why need to add this item ---