[drill-site] branch master updated: Document new Parquet format config opts.

dzamo Thu, 02 Dec 2021 03:59:04 -0800

This is an automated email from the ASF dual-hosted git repository.

dzamo pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/drill-site.git



The following commit(s) were added to refs/heads/master by this push:
     new 26b4425  Document new Parquet format config opts.
26b4425 is described below

commit 26b4425165abd7cccf912cd2336d0c1e5a0aca61
Author: James Turton <[email protected]>
AuthorDate: Thu Dec 2 13:58:13 2021 +0200

    Document new Parquet format config opts.
---
 .../040-parquet-format.md                          | 107 ++++++++++++---------
 .../drill-metastore/040-mongo-metastore.md         |   2 +
 2 files changed, 63 insertions(+), 46 deletions(-)

diff --git a/_docs/en/data-sources-and-file-formats/040-parquet-format.md 
b/_docs/en/data-sources-and-file-formats/040-parquet-format.md
index b297545..df99473 100644
--- a/_docs/en/data-sources-and-file-formats/040-parquet-format.md
+++ b/_docs/en/data-sources-and-file-formats/040-parquet-format.md
@@ -7,7 +7,7 @@ parent: "Data Sources and File Formats"
 
 * Self-describing
 * Columnar format
-* Language-independent 
+* Language-independent
 
 Self-describing data embeds the schema or structure with the data itself. 
Hadoop use cases drive the growth of self-describing data formats, such as 
Parquet and JSON, and of NoSQL databases, such as HBase. These formats and 
databases are well suited for the agile and iterative development cycle of 
Hadoop applications and BI/analytics. Optimized for working with large files, 
Parquet arranges data in columns, putting related values in close proximity to 
each other to optimize query perform [...]
 
@@ -18,74 +18,87 @@ Apache Drill includes the following support for Parquet:
 * Generating Parquet files that have evolving or changing schemas and querying 
the data on the fly
 * Handling Parquet data types
 
+## Configuration Options
+
+| Option                         | Description                                 
                         |
+| ------------------------------ | 
-------------------------------------------------------------------- |
+| enableStringsSignedMinMax      | See config opt 
store.parquet.reader.strings_signed_min_max           |
+| blockSize                      | See config opt store.parquet.block-size     
                         |
+| pageSize                       | See config opt store.parquet.page-size      
                         |
+| useSingleFsBlock               | See config opt 
store.parquet.writer.use_single_fs_block              |
+| writerCompressionType          | See config opt store.parquet.compression    
                         |
+| writerLogicalTypeForDecimals   | See config opt 
store.parquet.writer.logical_type_for_decimals        |
+| writerUsePrimitivesForDecimals | See config opt 
store.parquet.writer.use_primitive_types_for_decimals |
+| writerFormatVersion            | See config opt 
store.parquet.writer.format_version                   |
+
 ## Reading Parquet Files
 When a read of Parquet data occurs, Drill loads only the necessary columns of 
data, which reduces I/O. Reading only a small piece of the Parquet data from a 
data file or table, Drill can examine and analyze all values for a column 
across multiple files. You can create a Drill table from one format and store 
the data in another format, including Parquet.
 
 ## Writing Parquet Files
 CREATE TABLE AS (CTAS) can use any data source provided by the storage plugin. 
To write Parquet data using the CTAS command, set the `session store.format` 
option as shown in [Configuring the Parquet Storage 
Format]({{site.baseurl}}/docs/parquet-format/#configuring-the-parquet-storage-format).
 Alternatively, configure the storage plugin to point to the directory 
containing the Parquet files.
 
-Although the data resides in a single table, Parquet output generally consists 
of multiple files that resemble MapReduce output having numbered file names,  
such as 0_0_0.parquet in a directory.  
+Although the data resides in a single table, Parquet output generally consists 
of multiple files that resemble MapReduce output having numbered file names,  
such as 0_0_0.parquet in a directory.
 
 ### Date Value Auto-Correction
-As of Drill 1.10, Drill writes standard Parquet date values. Drill also has an 
automatic correction feature that automatically detects and corrects corrupted 
date values that Drill wrote into Parquet files prior to Drill 1.10. 
+As of Drill 1.10, Drill writes standard Parquet date values. Drill also has an 
automatic correction feature that automatically detects and corrects corrupted 
date values that Drill wrote into Parquet files prior to Drill 1.10.
 
-By default, the automatic correction feature is turned on and works for dates 
up to 5,000 years into the future. In the unlikely event that Drill needs to 
write dates thousands of years into the future, turn the auto-correction 
feature off.  
+By default, the automatic correction feature is turned on and works for dates 
up to 5,000 years into the future. In the unlikely event that Drill needs to 
write dates thousands of years into the future, turn the auto-correction 
feature off.
 
-To disable the auto-correction feature, navigate to the storage plugin 
configuration and change the `autoCorrectCorruptDates` option in the Parquet 
configuration to “false”, as shown in the example below:  
+To disable the auto-correction feature, navigate to the storage plugin 
configuration and change the `autoCorrectCorruptDates` option in the Parquet 
configuration to “false”, as shown in the example below:
 
        "formats": {
            "parquet": {
              "type": "parquet",
              "autoCorrectCorruptDates": false
-           }  
+           }
 
-Alternatively, you can set the option to false when you issue a query, as 
shown in the following example:  
+Alternatively, you can set the option to false when you issue a query, as 
shown in the following example:
 
-       SELECT l_shipdate, l_commitdate FROM 
table(dfs.`/drill/testdata/parquet_date/dates_nodrillversion/drillgen2_lineitem`
 
-       (type => 'parquet', autoCorrectCorruptDates => false)) LIMIT 1; 
+       SELECT l_shipdate, l_commitdate FROM 
table(dfs.`/drill/testdata/parquet_date/dates_nodrillversion/drillgen2_lineitem`
+       (type => 'parquet', autoCorrectCorruptDates => false)) LIMIT 1;
 
 ### Configuring the Parquet Storage Format
-To read or write Parquet data, you need to include the Parquet format in the 
storage plugin format definitions. The `dfs` plugin definition includes the 
Parquet format. 
+To read or write Parquet data, you need to include the Parquet format in the 
storage plugin format definitions. The `dfs` plugin definition includes the 
Parquet format.
 
 Use the `store.format` option to set the CTAS output format of a Parquet row 
group at the session or system level.
 
-Use the ALTER command to set the `store.format` option.  
+Use the ALTER command to set the `store.format` option.
+
+``ALTER SYSTEM|SESSION SET `store.format` = 'parquet';``
 
-``ALTER SYSTEM|SESSION SET `store.format` = 'parquet';``  
-        
 ### Configuring the Size of Parquet Files
-Configuring the size of Parquet files by setting the 
`store.parquet.block-size` can improve write performance. The block size is the 
size of MFS, HDFS, or the file system. 
+Configuring the size of Parquet files by setting the 
`store.parquet.block-size` can improve write performance. The block size is the 
size of MFS, HDFS, or the file system.
 
 The larger the block size, the more memory Drill needs for buffering data. 
Parquet files that contain a single block maximize the amount of data Drill 
stores contiguously on disk. Given a single row group per file, Drill stores 
the entire Parquet file onto the block, avoiding network I/O.
 
-To maximize performance, set the target size of a Parquet row group to the 
number of bytes less than or equal to the block size of MFS, HDFS, or the file 
system using the `store.parquet.block-size` option, as shown:  
+To maximize performance, set the target size of a Parquet row group to the 
number of bytes less than or equal to the block size of MFS, HDFS, or the file 
system using the `store.parquet.block-size` option, as shown:
 
-``ALTER SYSTEM|SESSION SET `store.parquet.block-size` = 536870912;``  
+``ALTER SYSTEM|SESSION SET `store.parquet.block-size` = 536870912;``
 
-The default block size is 536870912 bytes.  
+The default block size is 536870912 bytes.
 
-### Configuring the HDFS Block Size for Parquet Files  
-Drill 1.11 introduces the `store.parquet.writer.use_single_fs_block` option, 
which enables Drill to write a Parquet file as a single file system block 
without changing the default file system block size. Query performance improves 
when Drill reads Parquet files as a single block on the file system. When the 
`store.parquet.writer.use_single_fs_block` option is enabled, the 
`store.parquet.block-size` setting determines the block size of the Parquet 
files created. The default setting for th [...]
+### Configuring the HDFS Block Size for Parquet Files
+Drill 1.11 introduces the `store.parquet.writer.use_single_fs_block` option, 
which enables Drill to write a Parquet file as a single file system block 
without changing the default file system block size. Query performance improves 
when Drill reads Parquet files as a single block on the file system. When the 
`store.parquet.writer.use_single_fs_block` option is enabled, the 
`store.parquet.block-size` setting determines the block size of the Parquet 
files created. The default setting for th [...]
 
-    ALTER SYSTEM|SESSION SET store.parquet.writer.use_single_fs_block = 
'true|false';  
+    ALTER SYSTEM|SESSION SET store.parquet.writer.use_single_fs_block = 
'true|false';
 
 ### Type Mapping
-The high correlation between Parquet and SQL data types makes reading Parquet 
files effortless in Drill. Writing to Parquet files takes more work than 
reading. Because SQL does not support all Parquet data types, to prevent Drill 
from inferring a type other than one you want, use the [cast function]({{ 
site.baseurl }}/docs/data-type-conversion/#cast) Drill offers more liberal 
casting capabilities than SQL for Parquet conversions if the Parquet data is of 
a logical type. 
+The high correlation between Parquet and SQL data types makes reading Parquet 
files effortless in Drill. Writing to Parquet files takes more work than 
reading. Because SQL does not support all Parquet data types, to prevent Drill 
from inferring a type other than one you want, use the [cast function]({{ 
site.baseurl }}/docs/data-type-conversion/#cast) Drill offers more liberal 
casting capabilities than SQL for Parquet conversions if the Parquet data is of 
a logical type.
 
 The following general process converts a file from JSON to Parquet:
 
 * Create or use an existing storage plugin that specifies the storage location 
of the Parquet file, mutability of the data, and supported file formats.
-* Take a look at the JSON data. 
+* Take a look at the JSON data.
 * Create a table that selects the JSON file.
 * In the CTAS command, cast JSON string data to corresponding [SQL types]({{ 
site.baseurl }}/docs/json-data-model/#data-type-mapping).
 
 ### Example: Read JSON, Write Parquet
-This example demonstrates a storage plugin definition, a sample row of data 
from a JSON file, and a Drill query that writes the JSON input to Parquet 
output. 
+This example demonstrates a storage plugin definition, a sample row of data 
from a JSON file, and a Drill query that writes the JSON input to Parquet 
output.
 
-### Storage Plugin Definition
+#### Storage Plugin Definition
 You can use the default dfs storage plugin installed with Drill for reading 
and writing Parquet files. The storage plugin needs to configure the writable 
option of the workspace to true, so Drill can write the Parquet output. The dfs 
storage plugin defines the tmp writable workspace, which you can use in the 
CTAS command to create a Parquet table.
 
-### Sample Row of JSON Data
+#### Sample Row of JSON Data
 A JSON file called sample.json contains data consisting of strings, typical of 
JSON data. The following example shows one row of the JSON file:
 
         
{"trans_id":0,"date":"2013-07-26","time":"04:56:59","amount":80.5,"user_info":
@@ -98,19 +111,21 @@ A JSON file called sample.json contains data consisting of 
strings, typical of J
                "purch_flag":"false"
               }
         }
-              
 
-### CTAS Query      
+
+#### CTAS Query
 The following example shows a CTAS query that creates a table from JSON data 
shown in the last example. The command casts the date, time, and amount strings 
to SQL types DATE, TIME, and DOUBLE. String-to-VARCHAR casting of the other 
strings occurs automatically.
 
-    CREATE TABLE dfs.tmp.sampleparquet AS 
-    (SELECT trans_id, 
-    cast(`date` AS date) transdate, 
-    cast(`time` AS time) transtime, 
-    cast(amount AS double) amountm,
-    user_info, marketing_info, trans_info 
-    FROM dfs.`/Users/drilluser/sample.json`);
-        
+```sql
+CREATE TABLE dfs.tmp.sampleparquet AS
+(SELECT trans_id,
+cast(`date` AS date) transdate,
+cast(`time` AS time) transtime,
+cast(amount AS double) amountm,
+user_info, marketing_info, trans_info
+FROM dfs.`/Users/drilluser/sample.json`);
+```
+
 The CTAS query does not specify a file name extension for the output. Drill 
creates a parquet file by default, as indicated by the file name in the output:
 
     |------------|---------------------------|
@@ -122,7 +137,7 @@ The CTAS query does not specify a file name extension for 
the output. Drill crea
 
 You can query the Parquet file to verify that Drill now interprets the 
converted string as a date.
 
-    SELECT extract(year from transdate) AS `Year`, t.user_info.cust_id AS 
Customer 
+    SELECT extract(year from transdate) AS `Year`, t.user_info.cust_id AS 
Customer
     FROM dfs.tmp.`sampleparquet` t;
 
     |------------|------------|
@@ -153,17 +168,17 @@ The first table in this section maps SQL data types to 
Parquet data types, limit
 
 \* Drill 1.10 and later can implicitly interpret the Parquet INT96 type as 
TIMESTAMP (with standard 8 byte/millisecond precision) when the 
`store.parquet.reader.int96_as_timestamp` option is enabled. In earlier 
versions of Drill (1.2 through 1.9) or when the 
`store.parquet.reader.int96_as_timestamp` option is disabled, you must use the 
CONVERT_FROM function for Drill to correctly interpret INT96 values as 
TIMESTAMP values.
 
-## About INT96 Support  
-As of Drill 1.10, Drill can implicitly interpret the INT96 timestamp data type 
in Parquet files when the `store.parquet.reader.int96_as_timestamp` option is 
enabled. For earlier versions of Drill,  or when the 
`store.parquet.reader.int96_as_timestamp` option is disabled, you must use the 
CONVERT_FROM function,   
+## About INT96 Support
+As of Drill 1.10, Drill can implicitly interpret the INT96 timestamp data type 
in Parquet files when the `store.parquet.reader.int96_as_timestamp` option is 
enabled. For earlier versions of Drill,  or when the 
`store.parquet.reader.int96_as_timestamp` option is disabled, you must use the 
CONVERT_FROM function,
 
-The `store.parquet.reader.int96_as_timestamp` option is disabled by default. 
Use the [ALTER SYSTEM|SESSION SET]({{site.baseurl}}/docs/alter-system/) command 
to enable the option. Unnecessarily enabling this option can cause queries to 
fail because the CONVERT_FROM(col, 'TIMESTAMP_IMPALA') function does not work 
when `store.parquet.reader.int96_as_timestamp` is enabled.  
+The `store.parquet.reader.int96_as_timestamp` option is disabled by default. 
Use the [ALTER SYSTEM|SESSION SET]({{site.baseurl}}/docs/alter-system/) command 
to enable the option. Unnecessarily enabling this option can cause queries to 
fail because the CONVERT_FROM(col, 'TIMESTAMP_IMPALA') function does not work 
when `store.parquet.reader.int96_as_timestamp` is enabled.
 
 ### Using CONVERT_FROM to Interpret INT96
-In earlier versions of Drill (1.2 through 1.9), you must use the CONVERT_FROM 
function for Drill to interpret the Parquet INT96 type. For example, to decode 
a timestamp from Hive or Impala, which is of type INT96, use the CONVERT_FROM 
function and the 
[TIMESTAMP_IMPALA]({{site.baseurl}}/docs/supported-data-types/#data-types-for-convert_to-and-convert_from-functions)
 type argument:  
+In earlier versions of Drill (1.2 through 1.9), you must use the CONVERT_FROM 
function for Drill to interpret the Parquet INT96 type. For example, to decode 
a timestamp from Hive or Impala, which is of type INT96, use the CONVERT_FROM 
function and the 
[TIMESTAMP_IMPALA]({{site.baseurl}}/docs/supported-data-types/#data-types-for-convert_to-and-convert_from-functions)
 type argument:
 
-``SELECT CONVERT_FROM(timestamp_field, 'TIMESTAMP_IMPALA') as timestamp_field 
FROM `dfs.file_with_timestamp.parquet`;``  
+``SELECT CONVERT_FROM(timestamp_field, 'TIMESTAMP_IMPALA') as timestamp_field 
FROM `dfs.file_with_timestamp.parquet`;``
 
-Because INT96 is supported for reads only, you cannot use the TIMESTAMP_IMPALA 
as a data type argument with CONVERT_TO. You can convert a SQL TIMESTAMP to 
VARBINARY using the CAST function, but the resultant VARBINARY is not the same 
as INT96. 
+Because INT96 is supported for reads only, you cannot use the TIMESTAMP_IMPALA 
as a data type argument with CONVERT_TO. You can convert a SQL TIMESTAMP to 
VARBINARY using the CAST function, but the resultant VARBINARY is not the same 
as INT96.
 
 For example, create a Drill table after reading INT96 and converting some data 
to a timestamp.
 
@@ -173,12 +188,12 @@ For example, create a Drill table after reading INT96 and 
converting some data t
 t1.created_ts is an INT96 (or Hive/Impala timestamp) , t2.created_ts is a SQL 
timestamp. These types are not comparable. You cannot use a condition like 
t1.created_ts = t2.created_ts.
 
 ### Configuring the Timezone
-By default, INT96 timestamp values represent the local date and time, which is 
similar to Hive. To get INT96 timestamp values in UTC, configure Drill for [UTC 
time]({{site.baseurl}}/docs/data-type-conversion/#time-zone-limitation).  
+By default, INT96 timestamp values represent the local date and time, which is 
similar to Hive. To get INT96 timestamp values in UTC, configure Drill for [UTC 
time]({{site.baseurl}}/docs/data-type-conversion/#time-zone-limitation).
 
 
 ### SQL Types to Parquet Logical Types
 Parquet also supports logical types, fully described on the [Apache Parquet 
site](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md). 
Embedded types, JSON and BSON, annotate a binary primitive type representing a 
JSON or BSON document. The logical types and their mapping to SQL types are:
- 
+
 | SQL Type   | Drill Description                                               
               | Parquet Logical Type | Parquet Description                     
                                                                                
                   |
 
|------------|--------------------------------------------------------------------------------|----------------------|--------------------------------------------------------------------------------------------------------------------------------------------|
 | DATE       | Years months and days in the form in the form YYYY-MM-DD      
                 | DATE                 | Date, not including time of day. Uses 
the int32 annotation. Stores the number of days from the Unix epoch, 1 January 
1970.                 |
@@ -203,6 +218,6 @@ Parquet supports the following data description languages:
 
 * Apache Avro
 * Apache Thrift
-* Google Protocol Buffers 
+* Google Protocol Buffers
 
-Implement custom storage plugins to create Parquet readers/writers for formats 
such as Thrift. 
+Implement custom storage plugins to create Parquet readers/writers for formats 
such as Thrift.
diff --git a/_docs/en/performance-tuning/drill-metastore/040-mongo-metastore.md 
b/_docs/en/performance-tuning/drill-metastore/040-mongo-metastore.md
index bc21925..17fa78b 100644
--- a/_docs/en/performance-tuning/drill-metastore/040-mongo-metastore.md
+++ b/_docs/en/performance-tuning/drill-metastore/040-mongo-metastore.md
@@ -4,6 +4,8 @@ slug: "Mongo Metastore"
 parent: "Drill Metastore"
 ---
 
+**Introduced in release:** 1.20.
+
 The Mongo Metastore implementation allows you store Drill Metastore metadata 
in a configured
  MongoDB.

[drill-site] branch master updated: Document new Parquet format config opts.

Reply via email to