http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/drill-docs/sql-ref/nested/001-flatten.md ---------------------------------------------------------------------- diff --git a/_docs/drill-docs/sql-ref/nested/001-flatten.md b/_docs/drill-docs/sql-ref/nested/001-flatten.md deleted file mode 100644 index 124db91..0000000 --- a/_docs/drill-docs/sql-ref/nested/001-flatten.md +++ /dev/null @@ -1,89 +0,0 @@ ---- -title: "FLATTEN Function" -parent: "Nested Data Functions" ---- -The FLATTEN function is useful for flexible exploration of repeated data. -FLATTEN separates the elements in a repeated field into individual records. To -maintain the association between each flattened value and the other fields in -the record, all of the other columns are copied into each new record. A very -simple example would turn this data (one record): - - { - "x" : 5, - "y" : "a string", - "z" : [ 1,2,3] - } - -into three distinct records: - - select flatten(z) from table; - | x | y | z | - +-------------+----------------+-----------+ - | 5 | "a string" | 1 | - | 5 | "a string" | 2 | - | 5 | "a string" | 3 | - -The function takes a single argument, which must be an array (the `z` column -in this example). - - - -For a more interesting example, consider the JSON data in the publicly -available [Yelp](https://www.yelp.com/dataset_challenge/dataset) data set. The -first query below returns three columns from the -`yelp_academic_dataset_business.json` file: `name`, `hours`, and `categories`. -The query is restricted to distinct rows where the name is `z``pizza`. The -query returns only one row that meets those criteria; however, note that this -row contains an array of four categories: - - 0: jdbc:drill:zk=local> select distinct name, hours, categories - from dfs.yelp.`yelp_academic_dataset_business.json` - where name ='zpizza'; - +------------+------------+------------+ - | name | hours | categories | - +------------+------------+------------+ - | zpizza | {"Tuesday":{"close":"22:00","open":"10:00"},"Friday":{"close":"23:00","open":"10:00"},"Monday":{"close":"22:00","open":"10:00"},"Wednesday":{"close":"22:00","open":"10:00"},"Thursday":{"close":"22:00","open":"10:00"},"Sunday":{"close":"22:00","open":"10:00"},"Saturday":{"close":"23:00","open":"10:00"}} | ["Gluten-Free","Pizza","Vegan","Restaurants"] | - -The FLATTEN function can operate on this single row and return multiple rows, -one for each category: - - 0: jdbc:drill:zk=local> select distinct name, flatten(categories) as categories - from dfs.yelp.`yelp_academic_dataset_business.json` - where name ='zpizza' order by 2; - +------------+-------------+ - | name | categories | - +------------+-------------+ - | zpizza | Gluten-Free | - | zpizza | Pizza | - | zpizza | Restaurants | - | zpizza | Vegan | - +------------+-------------+ - 4 rows selected (2.797 seconds) - -Having used the FLATTEN function to break down arrays into distinct rows, you -can run queries that do deeper analysis on the flattened result set. For -example, you can use FLATTEN in a subquery, then apply WHERE clause -constraints or aggregate functions to the results in the outer query. - -The following query uses the same data file as the previous query to flatten -the categories array, then run a COUNT function on the flattened result: - - select celltbl.catl, count(celltbl.catl) catcount - from (select flatten(categories) catl - from dfs.yelp.`yelp_academic_dataset_business.json`) celltbl - group by celltbl.catl - order by count(celltbl.catl) desc limit 5; - - +---------------+------------+ - | catl | catcount | - +---------------+------------+ - | Restaurants | 14303 | - | Shopping | 6428 | - | Food | 5209 | - | Beauty & Spas | 3421 | - | Nightlife | 2870 | - +---------------|------------+ - -A common use case for FLATTEN is its use in conjunction with the -[KVGEN](/confluence/display/DRILL/KVGEN+Function) function. -
http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/drill-docs/sql-ref/nested/002-kvgen.md ---------------------------------------------------------------------- diff --git a/_docs/drill-docs/sql-ref/nested/002-kvgen.md b/_docs/drill-docs/sql-ref/nested/002-kvgen.md deleted file mode 100644 index a27a781..0000000 --- a/_docs/drill-docs/sql-ref/nested/002-kvgen.md +++ /dev/null @@ -1,150 +0,0 @@ ---- -title: "KVGEN Function" -parent: "Nested Data Functions" ---- -KVGEN stands for _key-value generation_. This function is useful when complex -data files contain arbitrary maps that consist of relatively "unknown" column -names. Instead of having to specify columns in the map to access the data, you -can use KVGEN to return a list of the keys that exist in the map. KVGEN turns -a map with a wide set of columns into an array of key-value pairs. - -In turn, you can write analytic queries that return a subset of the generated -keys or constrain the keys in some way. For example, you can use the -[FLATTEN](/confluence/display/DRILL/FLATTEN+Function) function to break the -array down into multiple distinct rows and further query those rows. - - - -For example, assume that a JSON file contains this data: - - {"a": "valA", "b": "valB"} - {"c": "valC", "d": "valD"} - - -KVGEN would operate on this data to generate: - - [{"key": "a", "value": "valA"}, {"key": "b", "value": "valB"}] - [{"key": "c", "value": "valC"}, {"key": "d", "value": "valD"}] - -Applying the [FLATTEN](/confluence/display/DRILL/FLATTEN+Function) function to -this data would return: - - {"key": "a", "value": "valA"} - {"key": "b", "value": "valB"} - {"key": "c", "value": "valC"} - {"key": "d", "value": "valD"} - -Assume that a JSON file called `kvgendata.json` includes multiple records that -look like this one: - - { - "rownum": 1, - "bigintegercol": { - "int_1": 1, - "int_2": 2, - "int_3": 3 - }, - "varcharcol": { - "varchar_1": "abc", - "varchar_2": "def", - "varchar_3": "xyz" - }, - "boolcol": { - "boolean_1": true, - "boolean_2": false, - "boolean_3": true - }, - "float8col": { - "f8_1": 1.1, - "f8_2": 2.2 - }, - "complex": [ - { - "col1": 3 - }, - { - "col2": 2, - "col3": 1 - }, - { - "col1": 7 - } - ] - } - - { - "rownum": 3, - "bigintegercol": { - "int_1": 1, - "int_3": 3 - }, - "varcharcol": { - "varchar_1": "abcde", - "varchar_2": null, - "varchar_3": "xyz", - "varchar_4": "xyz2" - }, - "boolcol": { - "boolean_1": true, - "boolean_2": false - }, - "float8col": { - "f8_1": 1.1, - "f8_3": 6.6 - }, - "complex": [ - { - "col1": 2, - "col3": 1 - } - ] - } - ... - - -A SELECT * query against this specific record returns the following row: - - 0: jdbc:drill:zk=local> select * from dfs.yelp.`kvgendata.json` where rownum=1; - - +------------+---------------+------------+------------+------------+------------+ - | rownum | bigintegercol | varcharcol | boolcol | float8col | complex | - +------------+---------------+------------+------------+------------+------------+ - | 1 | {"int_1":1,"int_2":2,"int_3":3} | {"varchar_1":"abc","varchar_2":"def","varchar_3":"xyz"} | {"boolean_1":true,"boolean_2":false,"boolean_3":true} | {"f8_1":1.1,"f8_2":2.2} | [{"col1":3},{"col2":2,"col3":1},{"col1":7}] | - +------------+---------------+------------+------------+------------+------------+ - 1 row selected (0.122 seconds) - -You can use the KVGEN function to turn the maps in this data into key-value -pairs. For example: - - 0: jdbc:drill:zk=local> select kvgen(varcharcol) from dfs.yelp.`kvgendata.json`; - +------------+ - | EXPR$0 | - +------------+ - | [{"key":"varchar_1","value":"abc"},{"key":"varchar_2","value":"def"},{"key":"varchar_3","value":"xyz"}] | - | [{"key":"varchar_1","value":"abcd"}] | - | [{"key":"varchar_1","value":"abcde"},{"key":"varchar_3","value":"xyz"},{"key":"varchar_4","value":"xyz2"}] | - | [{"key":"varchar_1","value":"abc"},{"key":"varchar_2","value":"def"}] | - +------------+ - 4 rows selected (0.091 seconds) - -Now you can apply the FLATTEN function to break out the key-value pairs into -distinct rows: - - 0: jdbc:drill:zk=local> select flatten(kvgen(varcharcol)) from dfs.yelp.`kvgendata.json`; - +------------+ - | EXPR$0 | - +------------+ - | {"key":"varchar_1","value":"abc"} | - | {"key":"varchar_2","value":"def"} | - | {"key":"varchar_3","value":"xyz"} | - | {"key":"varchar_1","value":"abcd"} | - | {"key":"varchar_1","value":"abcde"} | - | {"key":"varchar_3","value":"xyz"} | - | {"key":"varchar_4","value":"xyz2"} | - | {"key":"varchar_1","value":"abc"} | - | {"key":"varchar_2","value":"def"} | - +------------+ - 9 rows selected (0.151 seconds) - -See the description of [FLATTEN](/confluence/display/DRILL/FLATTEN+Function) -for an example of a query against the flattened data. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/drill-docs/sql-ref/nested/003-repeated-cnt.md ---------------------------------------------------------------------- diff --git a/_docs/drill-docs/sql-ref/nested/003-repeated-cnt.md b/_docs/drill-docs/sql-ref/nested/003-repeated-cnt.md deleted file mode 100644 index a66075c..0000000 --- a/_docs/drill-docs/sql-ref/nested/003-repeated-cnt.md +++ /dev/null @@ -1,34 +0,0 @@ ---- -title: "REPEATED_COUNT Function" -parent: "Nested Data Functions" ---- -This function counts the values in an array. The following example returns the -counts for the `categories` array in the `yelp_academic_dataset_business.json` -file. The counts are restricted to rows that contain the string `pizza`. - - SELECT name, REPEATED_COUNT(categories) - FROM dfs.yelp.`yelp_academic_dataset_business.json` - WHERE name LIKE '%pizza%'; - - +---------------+------------+ - | name | EXPR$1 | - +---------------+------------+ - | Villapizza | 2 | - | zpizza | 4 | - | zpizza | 4 | - | Luckys pizza | 2 | - | Zpizza | 2 | - | S2pizzabar | 4 | - | Dominos pizza | 5 | - +---------------+------------+ - - 7 rows selected (2.03 seconds) - -The function requires a single argument, which must be an array. Note that -this function is not a standard SQL aggregate function and does not require -the count to be grouped by other columns in the select list (such as `name` in -this example). - -For another example of this function, see the following lesson in the Apache -Drill Tutorial for Hadoop: [Lesson 3: Run Queries on Complex Data Types](/conf -luence/display/DRILL/Lesson+3%3A+Run+Queries+on+Complex+Data+Types). \ No newline at end of file http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/drill-docs/tutorial/001-install-sandbox.md ---------------------------------------------------------------------- diff --git a/_docs/drill-docs/tutorial/001-install-sandbox.md b/_docs/drill-docs/tutorial/001-install-sandbox.md deleted file mode 100644 index e63ddd4..0000000 --- a/_docs/drill-docs/tutorial/001-install-sandbox.md +++ /dev/null @@ -1,56 +0,0 @@ ---- -title: "Installing the Apache Drill Sandbox" -parent: "Apache Drill Tutorial" ---- -This tutorial uses the MapR Sandbox, which is a Hadoop environment pre-configured with Apache Drill. - -To complete the tutorial on the MapR Sandbox with Apache Drill, work through -the following pages in order: - - * [Installing the Apache Drill Sandbox](/confluence/display/DRILL/Installing+the+Apache+Drill+Sandbox) - * [Getting to Know the Drill Setup](/confluence/display/DRILL/Getting+to+Know+the+Drill+Setup) - * [Lesson 1: Learn About the Data Set](/confluence/display/DRILL/Lesson+1%3A+Learn+About+the+Data+Set) - * [Lesson 2: Run Queries with ANSI SQL](/confluence/display/DRILL/Lesson+2%3A+Run+Queries+with+ANSI+SQL) - * [Lesson 3: Run Queries on Complex Data Types](/confluence/display/DRILL/Lesson+3%3A+Run+Queries+on+Complex+Data+Types) - * [Summary](/confluence/display/DRILL/Summary) - -# About Apache Drill - -Drill is an Apache open-source SQL query engine for Big Data exploration. -Drill is designed from the ground up to support high-performance analysis on -the semi-structured and rapidly evolving data coming from modern Big Data -applications, while still providing the familiarity and ecosystem of ANSI SQL, -the industry-standard query language. Drill provides plug-and-play integration -with existing Apache Hive and Apache HBase deployments.Apache Drill 0.5 offers -the following key features: - - * Low-latency SQL queries - - * Dynamic queries on self-describing data in files (such as JSON, Parquet, text) and MapR-DB/HBase tables, without requiring metadata definitions in the Hive metastore. - - * ANSI SQL - - * Nested data support - - * Integration with Apache Hive (queries on Hive tables and views, support for all Hive file formats and Hive UDFs) - - * BI/SQL tool integration using standard JDBC/ODBC drivers - -# MapR Sandbox with Apache Drill - -MapR includes Apache Drill as part of the Hadoop distribution. The MapR -Sandbox with Apache Drill is a fully functional single-node cluster that can -be used to get an overview on Apache Drill in a Hadoop environment. Business -and technical analysts, product managers, and developers can use the sandbox -environment to get a feel for the power and capabilities of Apache Drill by -performing various types of queries. Once you get a flavor for the technology, -refer to the [Apache Drill web site](http://incubator.apache.org/drill/) and -[Apache Drill documentation -](https://cwiki.apache.org/confluence/display/DRILL/Apache+Drill+Wiki)for more -details. - -Note that Hadoop is not a prerequisite for Drill and users can start ramping -up with Drill by running SQL queries directly on the local file system. Refer -to [Apache Drill in 10 minutes](https://cwiki.apache.org/confluence/display/DR -ILL/Apache+Drill+in+10+Minutes) for an introduction to using Drill in local -(embedded) mode. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/drill-docs/tutorial/002-get2kno-sb.md ---------------------------------------------------------------------- diff --git a/_docs/drill-docs/tutorial/002-get2kno-sb.md b/_docs/drill-docs/tutorial/002-get2kno-sb.md deleted file mode 100644 index e7b24a8..0000000 --- a/_docs/drill-docs/tutorial/002-get2kno-sb.md +++ /dev/null @@ -1,235 +0,0 @@ ---- -title: "Getting to Know the Drill Sandbox" -parent: "Apache Drill Tutorial" ---- -This section describes the configuration of the Apache Drill system that you -have installed and introduces the overall use case for the tutorial. - -# Storage Plugins Overview - -The Hadoop cluster within the sandbox is set up with MapR-FS, MapR-DB, and -Hive, which all serve as data sources for Drill in this tutorial. Before you -can run queries against these data sources, Drill requires each one to be -configured as a storage plugin. A storage plugin defines the abstraction on -the data sources for Drill to talk to and provides interfaces to read/write -and get metadata from the data source. Each storage plugin also exposes -optimization rules for Drill to leverage for efficient query execution. - -Take a look at the pre-configured storage plugins by opening the Drill Web UI. - -Feel free to skip this section and jump directly to the queries: [Lesson 1: -Learn About the Data -Set](/confluence/display/DRILL/Lesson+1%3A+Learn+About+the+Data+Set) - - * Launch a web browser and go to: `http://<IP address of the sandbox>:8047` - * Go to the Storage tab - * Open the configured storage plugins one at a time by clicking Update - * You will see the following plugins configured. - -## maprdb - -A storage plugin configuration for MapR-DB in the sandbox. Drill uses a single -storage plugin for connecting to HBase as well as MapR-DB, which is an -enterprise grade in-Hadoop NoSQL database. See the [Apache Drill -Wiki](https://cwiki.apache.org/confluence/display/DRILL/Registering+HBase) for -information on how to configure Drill to query HBase. - - { - "type" : "hbase", - "enabled" : true, - "config" : { - "hbase.table.namespace.mappings" : "*:/tables" - } - } - -## dfs - -This is a storage plugin configuration for the MapR file system (MapR-FS) in -the sandbox. The connection attribute indicates the type of distributed file -system: in this case, MapR-FS. Drill can work with any distributed system, -including HDFS, S3, and so on. - -The configuration also includes a set of workspaces; each one represents a -location in MapR-FS: - - * root: access to the root file system location - * clicks: access to nested JSON log data - * logs: access to flat (non-nested) JSON log data in the logs directory and its subdirectories - * views: a workspace for creating views - -A workspace in Drill is a location where users can easily access a specific -set of data and collaborate with each other by sharing artifacts. Users can -create as many workspaces as they need within Drill. - -Each workspace can also be configured as âwritableâ or not, which indicates -whether users can write data to this location and defines the storage format -in which the data will be written (parquet, csv, json). These attributes -become relevant when you explore Drill SQL commands, especially CREATE TABLE -AS (CTAS) and CREATE VIEW. - -Drill can query files and directories directly and can detect the file formats -based on the file extension or the first few bits of data within the file. -However, additional information around formats is required for Drill, such as -delimiters for text files, which are specified in the âformatsâ section below. - - { - "type": "file", - "enabled": true, - "connection": "maprfs:///", - "workspaces": { - "root": { - "location": "/mapr/demo.mapr.com/data", - "writable": false, - "storageformat": null - }, - "clicks": { - "location": "/mapr/demo.mapr.com/data/nested", - "writable": true, - "storageformat": "parquet" - }, - "logs": { - "location": "/mapr/demo.mapr.com/data/flat", - "writable": true, - "storageformat": "parquet" - }, - "views": { - "location": "/mapr/demo.mapr.com/data/views", - "writable": true, - "storageformat": "parquet" - }, - "formats": { - "psv": { - "type": "text", - "extensions": [ - "tbl" - ], - "delimiter": "|" - }, - "csv": { - "type": "text", - "extensions": [ - "csv" - ], - "delimiter": "," - }, - "tsv": { - "type": "text", - "extensions": [ - "tsv" - ], - "delimiter": "\t" - }, - "parquet": { - "type": "parquet" - }, - "json": { - "type": "json" - } - }} - -## hive - -A storage plugin configuration for a Hive data warehouse within the sandbox. -Drill connects to the Hive metastore by using the configured metastore thrift -URI. Metadata for Hive tables is automatically available for users to query. - - { - "type": "hive", - "enabled": true, - "configProps": { - "hive.metastore.uris": "thrift://localhost:9083", - "hive.metastore.sasl.enabled": "false" - } - } - -# Client Application Interfaces - -Drill also provides additional application interfaces for the client tools to -connect and access from Drill. The interfaces include the following. - -### ODBC/JDBC drivers - -Drill provides ODBC/JDBC drivers to connect from BI tools such as Tableau, -MicroStrategy, SQUirrel, and Jaspersoft; refer to [Using ODBC to Access Apache -Drill from BI Tools](http://doc.mapr.com/display/MapR/Using+ODBC+to+Access+Apa -che+Drill+from+BI+Tools) and [Using JDBC to Access Apache Drill](http://doc.ma -pr.com/display/MapR/Using+JDBC+to+Access+Apache+Drill+from+SQuirreL) to learn -more. - -### SQLLine - -SQLLine is a JDBC application that comes packaged with Drill. In order to -start working with it, you can use the command line on the demo cluster to log -in as root, then enter `sqlline`. Use `mapr` as the login password. For -example: - - $ ssh root@localhost -p 2222 - Password: - Last login: Mon Sep 15 13:46:08 2014 from 10.250.0.28 - Welcome to your Mapr Demo virtual machine. - [root@maprdemo ~]# sqlline - sqlline version 1.1.6 - 0: jdbc:drill:> - -### Drill Web UI - -The Drill Web UI is a simple user interface for configuring and manage Apache -Drill. This UI can be launched from any of the nodes in the Drill cluster. The -configuration for Drill includes setting up storage plugins that represent the -data sources on which Drill performs queries. The sandbox comes with storage -plugins configured for the Hive, HBase, MapR file system, and local file -system. - -Users and developers can get the necessary information for tuning and -performing diagnostics on queries, such as the list of queries executed in a -session and detailed query plan profiles for each. - -Detailed configuration and management of Drill is out of scope for this -tutorial. - -The Web interface for Apache Drill also provides a query UI where users can -submit queries to Drill and observe results. Here is a screen shot of the Web -UI for Apache Drill: - -![](../../img/DrillWebUI.png) - -### REST API - -Drill provides a simple REST API for the users to query data as well as manage -the system. The Web UI leverages the REST API to talk to Drill. - -This tutorial introduces sample queries that you can run by using SQLLine. -Note that you can run the queries just as easily by launching the Drill Web -UI. No additional installation or configuration is required. - -# Use Case Overview - -As you run through the queries in this tutorial, put yourself in the shoes of -an analyst with basic SQL skills. Let us imagine that the analyst works for an -emerging online retail business that accepts purchases from its customers -through both an established web-based interface and a new mobile application. - -The analyst is data-driven and operates mostly on the business side with -little or no interaction with the IT department. Recently the central IT team -has implemented a Hadoop-based infrastructure to reduce the cost of the legacy -database system, and most of the DWH/ETL workload is now handled by -Hadoop/Hive. The master customer profile information and product catalog are -managed in MapR-DB, which is a NoSQL database. The IT team has also started -acquiring clickstream data that comes from web and mobile applications. This -data is stored in Hadoop as JSON files. - -The analyst has a number of data sources that he could explore, but exploring -them in isolation is not the way to go. There are some potentially very -interesting analytical connections between these data sources. For example, it -would be good to be able to analyze customer records in the clickstream data -and tie them to the master customer data in MapR DB. - -The analyst decides to explore various data sources and he chooses to do that -by using Apache Drill. Think about the flexibility and analytic capability of -Apache Drill as you work through the tutorial. - -# What's Next - -Start running queries by going to [Lesson 1: Learn About the Data -Set](/confluence/display/DRILL/Lesson+1%3A+Learn+About+the+Data+Set). - http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/drill-docs/tutorial/003-lesson1.md ---------------------------------------------------------------------- diff --git a/_docs/drill-docs/tutorial/003-lesson1.md b/_docs/drill-docs/tutorial/003-lesson1.md deleted file mode 100644 index 8f3465f..0000000 --- a/_docs/drill-docs/tutorial/003-lesson1.md +++ /dev/null @@ -1,423 +0,0 @@ ---- -title: "Lession 1: Learn about the Data Set" -parent: "Apache Drill Tutorial" ---- -## Goal - -This lesson is simply about discovering what data is available, in what -format, using simple SQL SELECT statements. Drill is capable of analyzing data -without prior knowledge or definition of its schema. This means that you can -start querying data immediately (and even as it changes), regardless of its -format. - -The data set for the tutorial consists of: - - * Transactional data: stored as a Hive table - - * Product catalog and master customer data: stored as MapR-DB tables - - * Clickstream and logs data: stored in the MapR file system as JSON files - -## Queries in This Lesson - -This lesson consists of select * queries on each data source. - -## Before You Begin - -### Start sqlline - -If sqlline is not already started, use a Terminal or Command window to log -into the demo VM as root, then enter `sqlline`: - - $ ssh root@10.250.0.6 - Password: - Last login: Mon Sep 15 13:46:08 2014 from 10.250.0.28 - Welcome to your Mapr Demo virtual machine. - [root@maprdemo ~]# sqlline - sqlline version 1.1.6 - 0: jdbc:drill:> - -You can run queries from this prompt to complete the tutorial. To exit from -`sqlline`, type: - - 0: jdbc:drill:> !quit - -Note that though this tutorial demonstrates the queries using SQLLine, you can -also execute queries using the Drill Web UI. - -### List the available workspaces and databases: - - 0: jdbc:drill:> show databases; - +-------------+ - | SCHEMA_NAME | - +-------------+ - | hive.default | - | dfs.default | - | dfs.logs | - | dfs.root | - | dfs.views | - | dfs.clicks | - | dfs.data | - | dfs.tmp | - | sys | - | maprdb | - | cp.default | - | INFORMATION_SCHEMA | - +-------------+ - 12 rows selected - -Note that this command exposes all the metadata available from the storage -plugins configured with Drill as a set of schemas. This includes the Hive and -MapR-DB databases as well as the workspaces configured in the file system. As -you run queries in the tutorial, you will switch among these schemas by -submitting the USE command. This behavior resembles the ability to use -different database schemas (namespaces) in a relational database system. - -## Query Hive Tables - -The orders table is a six-column Hive table defined in the Hive metastore. -This is a Hive external table pointing to the data stored in flat files on the -MapR file system. The orders table contains 122,000 rows. - -### Set the schema to hive: - - 0: jdbc:drill:> use hive; - +------------+------------+ - | ok | summary | - +------------+------------+ - | true | Default schema changed to 'hive' | - +------------+------------+ - -You will run the USE command throughout this tutorial. The USE command sets -the schema for the current session. - -### Describe the table: - -You can use the DESCRIBE command to show the columns and data types for a Hive -table: - - 0: jdbc:drill:> describe orders; - +-------------+------------+-------------+ - | COLUMN_NAME | DATA_TYPE | IS_NULLABLE | - +-------------+------------+-------------+ - | order_id | BIGINT | YES | - | month | VARCHAR | YES | - | cust_id | BIGINT | YES | - | state | VARCHAR | YES | - | prod_id | BIGINT | YES | - | order_total | INTEGER | YES | - +-------------+------------+-------------+ - -The DESCRIBE command returns complete schema information for Hive tables based -on the metadata available in the Hive metastore. - -### Select 5 rows from the orders table: - - 0: jdbc:drill:> select * from orders limit 5; - +------------+------------+------------+------------+------------+-------------+ - | order_id | month | cust_id | state | prod_id | order_total | - +------------+------------+------------+------------+------------+-------------+ - | 67212 | June | 10001 | ca | 909 | 13 | - | 70302 | June | 10004 | ga | 420 | 11 | - | 69090 | June | 10011 | fl | 44 | 76 | - | 68834 | June | 10012 | ar | 0 | 81 | - | 71220 | June | 10018 | az | 411 | 24 | - +------------+------------+------------+------------+------------+-------------+ - -Because orders is a Hive table, you can query the data in the same way that -you would query the columns in a relational database table. Note the use of -the standard LIMIT clause, which limits the result set to the specified number -of rows. You can use LIMIT with or without an ORDER BY clause. - -Drill provides seamless integration with Hive by allowing queries on Hive -tables defined in the metastore with no extra configuration. Note that Hive is -not a prerequisite for Drill, but simply serves as a storage plugin or data -source for Drill. Drill also lets users query all Hive file formats (including -custom serdes). Additionally, any UDFs defined in Hive can be leveraged as -part of Drill queries. - -Because Drill has its own low-latency SQL query execution engine, you can -query Hive tables with high performance and support for interactive and ad-hoc -data exploration. - -## Query MapR-DB and HBase Tables - -The customers and products tables are MapR-DB tables. MapR-DB is an enterprise -in-Hadoop NoSQL database. It exposes the HBase API to support application -development. Every MapR-DB table has a row_key, in addition to one or more -column families. Each column family contains one or more specific columns. The -row_key value is a primary key that uniquely identifies each row. - -Drill allows direct queries on MapR-DB and HBase tables. Unlike other SQL on -Hadoop options, Drill requires no overlay schema definitions in Hive to work -with this data. Think about a MapR-DB or HBase table with thousands of -columns, such as a time-series database, and the pain of having to manage -duplicate schemas for it in Hive! - -### Products Table - -The products table has two column families. - -Column Family|Columns - ----|--- - -details - -| - -name - -category - -pricing - -| - -price - -The products table contains 965 rows. - -### Customers Table - -The Customers table has three column families. - -Column Family|Columns --------------|------- - address | state - loyalty | agg_rev - | membership - personal | age - | gender - -The customers table contains 993 rows. - -### Set the workspace to maprdb: - - 0: jdbc:drill:> use maprdb; - +------------+------------+ - | ok | summary | - +------------+------------+ - | true | Default schema changed to 'maprdb' | - +------------+------------+ - -### Describe the tables: - - 0: jdbc:drill:> describe customers; - +-------------+------------+-------------+ - | COLUMN_NAME | DATA_TYPE | IS_NULLABLE | - +-------------+------------+-------------+ - | row_key | ANY | NO | - | address | (VARCHAR(1), ANY) MAP | NO | - | loyalty | (VARCHAR(1), ANY) MAP | NO | - | personal | (VARCHAR(1), ANY) MAP | NO | - +-------------+------------+-------------+ - - 0: jdbc:drill:> describe products; - +-------------+------------+-------------+ - | COLUMN_NAME | DATA_TYPE | IS_NULLABLE | - +-------------+------------+-------------+ - | row_key | ANY | NO | - | details | (VARCHAR(1), ANY) MAP | NO | - | pricing | (VARCHAR(1), ANY) MAP | NO | - +-------------+------------+-------------+ - -Unlike the Hive example, the DESCRIBE command does not return the full schema -up to the column level. Wide-column NoSQL databases such as MapR-DB and HBase -can be schema-less by design; every row has its own set of column name-value -pairs in a given column family, and the column value can be of any data type, -as determined by the application inserting the data. - -A âMAPâ complex type in Drill represents this variable column name-value -structure, and âANYâ represents the fact that the column value can be of any -data type. Observe the row_key, which is also simply bytes and has the type -ANY. - -### Select 5 rows from the products table: - - 0: jdbc:drill:> select * from products limit 5; - +------------+------------+------------+ - | row_key | details | pricing | - +------------+------------+------------+ - | [B@a1a3e25 | {"category":"bGFwdG9w","name":"IlNvbnkgbm90ZWJvb2si"} | {"price":"OTU5"} | - | [B@103a43af | {"category":"RW52ZWxvcGVz","name":"IzEwLTQgMS84IHggOSAxLzIgUHJlbWl1bSBEaWFnb25hbCBTZWFtIEVudmVsb3Blcw=="} | {"price":"MT | - | [B@61319e7b | {"category":"U3RvcmFnZSAmIE9yZ2FuaXphdGlvbg==","name":"MjQgQ2FwYWNpdHkgTWF4aSBEYXRhIEJpbmRlciBSYWNrc1BlYXJs"} | {"price" | - | [B@9bcf17 | {"category":"TGFiZWxz","name":"QXZlcnkgNDk4"} | {"price":"Mw=="} | - | [B@7538ef50 | {"category":"TGFiZWxz","name":"QXZlcnkgNDk="} | {"price":"Mw=="} | - -Given that Drill requires no up front schema definitions indicating data -types, the query returns the raw byte arrays for column values, just as they -are stored in MapR-DB (or HBase). Observe that the column families (details -and pricing) have the map data type and appear as JSON strings. - -In Lesson 2, you will use CAST functions to return typed data for each column. - -### Select 5 rows from the customers table: - - - +0: jdbc:drill:> select * from customers limit 5; - +------------+------------+------------+------------+ - | row_key | address | loyalty | personal | - +------------+------------+------------+------------+ - | [B@284bae62 | {"state":"Imt5Ig=="} | {"agg_rev":"IjEwMDEtMzAwMCI=","membership":"ImJhc2ljIg=="} | {"age":"IjI2LTM1Ig==","gender":"Ik1B | - | [B@7ffa4523 | {"state":"ImNhIg=="} | {"agg_rev":"IjAtMTAwIg==","membership":"ImdvbGQi"} | {"age":"IjI2LTM1Ig==","gender":"IkZFTUFMRSI= | - | [B@7d13e79 | {"state":"Im9rIg=="} | {"agg_rev":"IjUwMS0xMDAwIg==","membership":"InNpbHZlciI="} | {"age":"IjI2LTM1Ig==","gender":"IkZFT | - | [B@3a5c7df1 | {"state":"ImtzIg=="} | {"agg_rev":"IjMwMDEtMTAwMDAwIg==","membership":"ImdvbGQi"} | {"age":"IjUxLTEwMCI=","gender":"IkZF | - | [B@e507726 | {"state":"Im5qIg=="} | {"agg_rev":"IjAtMTAwIg==","membership":"ImJhc2ljIg=="} | {"age":"IjIxLTI1Ig==","gender":"Ik1BTEUi" | - +------------+------------+------------+------------+ - -Again the table returns byte data that needs to be cast to readable data -types. - -## Query the File System - -Along with querying a data source with full schemas (such as Hive) and partial -schemas (such as MapR-DB and HBase), Drill offers the unique capability to -perform SQL queries directly on file system. The file system could be a local -file system, or a distributed file system such as MapR-FS, HDFS, or S3. - -In the context of Drill, a file or a directory is considered as synonymous to -a relational database âtable.â Therefore, you can perform SQL operations -directly on files and directories without the need for up-front schema -definitions or schema management for any model changes. The schema is -discovered on the fly based on the query. Drill supports queries on a variety -of file formats including text, CSV, Parquet, and JSON in the 0.5 release. - -In this example, the clickstream data coming from the mobile/web applications -is in JSON format. The JSON files have the following structure: - - {"trans_id":31920,"date":"2014-04-26","time":"12:17:12","user_info":{"cust_id":22526,"device":"IOS5","state":"il"},"trans_info":{"prod_id":[174,2],"purch_flag":"false"}} - {"trans_id":31026,"date":"2014-04-20","time":"13:50:29","user_info":{"cust_id":16368,"device":"AOS4.2","state":"nc"},"trans_info":{"prod_id":[],"purch_flag":"false"}} - {"trans_id":33848,"date":"2014-04-10","time":"04:44:42","user_info":{"cust_id":21449,"device":"IOS6","state":"oh"},"trans_info":{"prod_id":[582],"purch_flag":"false"}} - - -The clicks.json and clicks.campaign.json files contain metadata as part of the -data itself (referred to as âself-describingâ data). Also note that the data -elements are complex, or nested. The initial queries below do not show how to -unpack the nested data, but they show that easy access to the data requires no -setup beyond the definition of a workspace. - -### Query nested clickstream data - -#### Set the workspace to dfs.clicks: - - 0: jdbc:drill:> use dfs.clicks; - +------------+------------+ - | ok | summary | - +------------+------------+ - | true | Default schema changed to 'dfs.clicks' | - +------------+------------+ - -In this case, setting the workspace is a mechanism for making queries easier -to write. When you specify a file system workspace, you can shorten references -to files in the FROM clause of your queries. Instead of having to provide the -complete path to a file, you can provide the path relative to a directory -location specified in the workspace. For example: - - "location": "/mapr/demo.mapr.com/data/nested" - -Any file or directory that you want to query in this path can be referenced -relative to this path. The clicks directory referred to in the following query -is directly below the nested directory. - -#### Select 2 rows from the clicks.json file: - - 0: jdbc:drill:> select * from `clicks/clicks.json` limit 2; - +------------+------------+------------+------------+------------+ - | trans_id | date | time | user_info | trans_info | - +------------+------------+------------+------------+------------+ - | 31920 | 2014-04-26 | 12:17:12 | {"cust_id":22526,"device":"IOS5","state":"il"} | {"prod_id":[174,2],"purch_flag":"false"} | - | 31026 | 2014-04-20 | 13:50:29 | {"cust_id":16368,"device":"AOS4.2","state":"nc"} | {"prod_id":[],"purch_flag":"false"} | - +------------+------------+------------+------------+------------+ - 2 rows selected - -Note that the FROM clause reference points to a specific file. Drill expands -the traditional concept of a âtable referenceâ in a standard SQL FROM clause -to refer to a file in a local or distributed file system. - -The only special requirement is the use of back ticks to enclose the file -path. This is necessary whenever the file path contains Drill reserved words -or characters. - -#### Select 2 rows from the campaign.json file: - - 0: jdbc:drill:> select * from `clicks/clicks.campaign.json` limit 2; - +------------+------------+------------+------------+------------+------------+ - | trans_id | date | time | user_info | ad_info | trans_info | - +------------+------------+------------+------------+------------+------------+ - | 35232 | 2014-05-10 | 00:13:03 | {"cust_id":18520,"device":"AOS4.3","state":"tx"} | {"camp_id":"null"} | {"prod_id":[7,7],"purch_flag":"true"} | - | 31995 | 2014-05-22 | 16:06:38 | {"cust_id":17182,"device":"IOS6","state":"fl"} | {"camp_id":"null"} | {"prod_id":[],"purch_flag":"false"} | - +------------+------------+------------+------------+------------+------------+ - 2 rows selected - -Notice that with a select * query, any complex data types such as maps and -arrays return as JSON strings. You will see how to unpack this data using -various SQL functions and operators in the next lesson. - -## Query Logs Data - -Unlike the previous example where we performed queries against clicks data in -one file, logs data is stored as partitioned directories on the file system. -The logs directory has three subdirectories: - - * 2012 - - * 2013 - - * 2014 - -Each of these year directories fans out to a set of numbered month -directories, and each month directory contains a JSON file with log records -for that month. The total number of records in all log files is 48000. - -The files in the logs directory and its subdirectories are JSON files. There -are many of these files, but you can use Drill to query them all as a single -data source, or to query a subset of the files. - -#### Set the workspace to dfs.logs: - - 0: jdbc:drill:> use dfs.logs; - +------------+------------+ - | ok | summary | - +------------+------------+ - | true | Default schema changed to 'dfs.logs' | - +------------+------------+ - -#### Select 2 rows from the logs directory: - - 0: jdbc:drill:> select * from logs limit 2; - +------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+----------+ - | dir0 | dir1 | trans_id | date | time | cust_id | device | state | camp_id | keywords | prod_id | purch_fl | - +------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+----------+ - | 2014 | 8 | 24181 | 08/02/2014 | 09:23:52 | 0 | IOS5 | il | 2 | wait | 128 | false | - | 2014 | 8 | 24195 | 08/02/2014 | 07:58:19 | 243 | IOS5 | mo | 6 | hmm | 107 | false | - +------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+----------+ - -Note that this is flat JSON data. The dfs.clicks workspace location property -points to a directory that contains the logs directory, making the FROM clause -reference for this query very simple. You do not have to refer to the complete -directory path on the file system. - -The column names dir0 and dir1 are special Drill variables that identify -subdirectories below the logs directory. In Lesson 3, you will do more complex -queries that leverage these dynamic variables. - -#### Find the total number of rows in the logs directory (all files): - - 0: jdbc:drill:> select count(*) from logs; - +------------+ - | EXPR$0 | - +------------+ - | 48000 | - +------------+ - -This query traverses all of the files in the logs directory and its -subdirectories to return the total number of rows in those files. - -# What's Next - -Go to [Lesson 2: Run Queries with ANSI -SQL](/confluence/display/DRILL/Lesson+2%3A+Run+Queries+with+ANSI+SQL). - - - http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/drill-docs/tutorial/004-lesson2.md ---------------------------------------------------------------------- diff --git a/_docs/drill-docs/tutorial/004-lesson2.md b/_docs/drill-docs/tutorial/004-lesson2.md deleted file mode 100644 index d9c68d5..0000000 --- a/_docs/drill-docs/tutorial/004-lesson2.md +++ /dev/null @@ -1,392 +0,0 @@ ---- -title: "Lession 2: Run Queries with ANSI SQL" -parent: "Apache Drill Tutorial" ---- -## Goal - -This lesson shows how to do some standard SQL analysis in Apache Drill: for -example, summarizing data by using simple aggregate functions and connecting -data sources by using joins. Note that Apache Drill provides ANSI SQL support, -not a âSQL-likeâ interface. - -## Queries in This Lesson - -Now that you know what the data sources look like in their raw form, using -select * queries, try running some simple but more useful queries on each data -source. These queries demonstrate how Drill supports ANSI SQL constructs and -also how you can combine data from different data sources in a single SELECT -statement. - - * Show an aggregate query on a single file or table. Use GROUP BY, WHERE, HAVING, and ORDER BY clauses. - - * Perform joins between Hive, MapR-DB, and file system data sources. - - * Use table and column aliases. - - * Create a Drill view. - -## Aggregation - - -### Set the schema to hive: - - 0: jdbc:drill:> use hive; - +------------+------------+ - | ok | summary | - +------------+------------+ - | true | Default schema changed to 'hive' | - +------------+------------+ - 1 row selected - -### Return sales totals by month: - - 0: jdbc:drill:> select `month`, sum(order_total) - from orders group by `month` order by 2 desc; - +------------+------------+ - | month | EXPR$1 | - +------------+------------+ - | June | 950481 | - | May | 947796 | - | March | 836809 | - | April | 807291 | - | July | 757395 | - | October | 676236 | - | August | 572269 | - | February | 532901 | - | September | 373100 | - | January | 346536 | - +------------+------------+ - -Drill supports SQL aggregate functions such as SUM, MAX, AVG, and MIN. -Standard SQL clauses work in the same way in Drill queries as in relational -database queries. - -Note that back ticks are required for the âmonthâ column only because âmonthâ -is a reserved word in SQL. - -### Return the top 20 sales totals by month and state: - - 0: jdbc:drill:> select `month`, state, sum(order_total) as sales from orders group by `month`, state - order by 3 desc limit 20; - +------------+------------+------------+ - | month | state | sales | - +------------+------------+------------+ - | May | ca | 119586 | - | June | ca | 116322 | - | April | ca | 101363 | - | March | ca | 99540 | - | July | ca | 90285 | - | October | ca | 80090 | - | June | tx | 78363 | - | May | tx | 77247 | - | March | tx | 73815 | - | August | ca | 71255 | - | April | tx | 68385 | - | July | tx | 63858 | - | February | ca | 63527 | - | June | fl | 62199 | - | June | ny | 62052 | - | May | fl | 61651 | - | May | ny | 59369 | - | October | tx | 55076 | - | March | fl | 54867 | - | March | ny | 52101 | - +------------+------------+------------+ - 20 rows selected - -Note the alias for the result of the SUM function. Drill supports column -aliases and table aliases. - -## HAVING Clause - -This query uses the HAVING clause to constrain an aggregate result. - -### Set the workspace to dfs.clicks - - 0: jdbc:drill:> use dfs.clicks; - +------------+------------+ - | ok | summary | - +------------+------------+ - | true | Default schema changed to 'dfs.clicks' | - +------------+------------+ - 1 row selected - -### Return total number of clicks for devices that indicate high click-throughs: - - 0: jdbc:drill:> select t.user_info.device, count(*) from `clicks/clicks.json` t - group by t.user_info.device - having count(*) > 1000; - +------------+------------+ - | EXPR$0 | EXPR$1 | - +------------+------------+ - | IOS5 | 11814 | - | AOS4.2 | 5986 | - | IOS6 | 4464 | - | IOS7 | 3135 | - | AOS4.4 | 1562 | - | AOS4.3 | 3039 | - +------------+------------+ - -The aggregate is a count of the records for each different mobile device in -the clickstream data. Only the activity for the devices that registered more -than 1000 transactions qualify for the result set. - -## UNION Operator - -Use the same workspace as before (dfs.clicks). - -### Combine clicks activity from before and after the marketing campaign - - 0: jdbc:drill:> select t.trans_id transaction, t.user_info.cust_id customer from `clicks/clicks.campaign.json` t - union all - select u.trans_id, u.user_info.cust_id from `clicks/clicks.json` u limit 5; - +-------------+------------+ - | transaction | customer | - +-------------+------------+ - | 35232 | 18520 | - | 31995 | 17182 | - | 35760 | 18228 | - | 37090 | 17015 | - | 37838 | 18737 | - +-------------+------------+ - -This UNION ALL query returns rows that exist in two files (and includes any -duplicate rows from those files): `clicks.campaign.json` and `clicks.json`. - -## Subqueries - -### Set the workspace to hive: - - 0: jdbc:drill:> use hive; - +------------+------------+ - | ok | summary | - +------------+------------+ - | true | Default schema changed to 'hive' | - +------------+------------+ - -### Compare order totals across states: - - 0: jdbc:drill:> select o1.cust_id, sum(o1.order_total) as ny_sales, - (select sum(o2.order_total) from hive.orders o2 - where o1.cust_id=o2.cust_id and state='ca') as ca_sales - from hive.orders o1 where o1.state='ny' group by o1.cust_id - order by cust_id limit 20; - +------------+------------+------------+ - | cust_id | ny_sales | ca_sales | - +------------+------------+------------+ - | 1001 | 72 | 47 | - | 1002 | 108 | 198 | - | 1003 | 83 | null | - | 1004 | 86 | 210 | - | 1005 | 168 | 153 | - | 1006 | 29 | 326 | - | 1008 | 105 | 168 | - | 1009 | 443 | 127 | - | 1010 | 75 | 18 | - | 1012 | 110 | null | - | 1013 | 19 | null | - | 1014 | 106 | 162 | - | 1015 | 220 | 153 | - | 1016 | 85 | 159 | - | 1017 | 82 | 56 | - | 1019 | 37 | 196 | - | 1020 | 193 | 165 | - | 1022 | 124 | null | - | 1023 | 166 | 149 | - | 1024 | 233 | null | - +------------+------------+------------+ - -This example demonstrates Drill support for correlated subqueries. This query -uses a subquery in the select list and correlates the result of the subquery -with the outer query, using the cust_id column reference. The subquery returns -the sum of order totals for California, and the outer query returns the -equivalent sum, for the same cust_id, for New York. - -The result set is sorted by the cust_id and presents the sales totals side by -side for easy comparison. Null values indicate customer IDs that did not -register any sales in that state. - -## CAST Function - -### Use the maprdb workspace: - - 0: jdbc:drill:> use maprdb; - +------------+------------+ - | ok | summary | - +------------+------------+ - | true | Default schema changed to 'maprdb' | - +------------+------------+ - 1 row selected - -### Return customer data with appropriate data types - - 0: jdbc:drill:> select cast(row_key as int) as cust_id, cast(t.personal.name as varchar(20)) as name, - cast(t.personal.gender as varchar(10)) as gender, cast(t.personal.age as varchar(10)) as age, - cast(t.address.state as varchar(4)) as state, cast(t.loyalty.agg_rev as dec(7,2)) as agg_rev, - cast(t.loyalty.membership as varchar(20)) as membership - from customers t limit 5; - +------------+------------+------------+------------+------------ +------------+------------+ - | cust_id | name | gender | age | state | agg_rev | membership | - +------------+------------+------------+------------+------------+------------+------------+ - | 10001 | "Corrine Mecham" | "FEMALE" | "15-20" | "va" | 197.00 | "silver" | - | 10005 | "Brittany Park" | "MALE" | "26-35" | "in" | 230.00 | "silver" | - | 10006 | "Rose Lokey" | "MALE" | "26-35" | "ca" | 250.00 | "silver" | - | 10007 | "James Fowler" | "FEMALE" | "51-100" | "me" | 263.00 | "silver" | - | 10010 | "Guillermo Koehler" | "OTHER" | "51-100" | "mn" | 202.00 | "silver" | - +------------+------------+------------+------------+------------+------------+------------+ - 5 rows selected - -Note the following features of this query: - - * The CAST function is required for every column in the table. This function returns the MapR-DB/HBase binary data as readable integers and strings. Alternatively, you can use CONVERT_TO/CONVERT_FROM functions to decode the columns. CONVERT_TO and CONVERT_FROM are more efficient than CAST in most cases. - * The row_key column functions as the primary key of the table (a customer ID in this case). - * The table alias t is required; otherwise the column family names would be parsed as table names and the query would return an error. - -### Remove the quotes from the strings: - -You can use the regexp_replace function to remove the quotes around the -strings in the query results. For example, to return a state name va instead -of âvaâ: - - 0: jdbc:drill:> select cast(row_key as int), regexp_replace(cast(t.address.state as varchar(10)),'"','') - from customers t limit 1; - +------------+------------+ - | EXPR$0 | EXPR$1 | - +------------+------------+ - | 10001 | va | - +------------+------------+ - 1 row selected - -## CREATE VIEW Command - - 0: jdbc:drill:> use dfs.views; - +------------+------------+ - | ok | summary | - +------------+------------+ - | true | Default schema changed to 'dfs.views' | - +------------+------------+ - -### Use a mutable workspace: - -A mutable (or writable) workspace is a workspace that is enabled for âwriteâ -operations. This attribute is part of the storage plugin configuration. You -can create Drill views and tables in mutable workspaces. - -### Create a view on a MapR-DB table - - 0: jdbc:drill:> create or replace view custview as select cast(row_key as int) as cust_id, - cast(t.personal.name as varchar(20)) as name, - cast(t.personal.gender as varchar(10)) as gender, - cast(t.personal.age as varchar(10)) as age, - cast(t.address.state as varchar(4)) as state, - cast(t.loyalty.agg_rev as dec(7,2)) as agg_rev, - cast(t.loyalty.membership as varchar(20)) as membership - from maprdb.customers t; - +------------+------------+ - | ok | summary | - +------------+------------+ - | true | View 'custview' replaced successfully in 'dfs.views' schema | - +------------+------------+ - 1 row selected - -Drill provides CREATE OR REPLACE VIEW syntax similar to relational databases -to create views. Use the OR REPLACE option to make it easier to update the -view later without having to remove it first. Note that the FROM clause in -this example must refer to maprdb.customers. The MapR-DB tables are not -directly visible to the dfs.views workspace. - -Unlike a traditional database where views typically are DBA/developer-driven -operations, file system-based views in Drill are very lightweight. A view is -simply a special file with a specific extension (.drill). You can store views -even in your local file system or point to a specific workspace. You can -specify any query against any Drill data source in the body of the CREATE VIEW -statement. - -Drill provides a decentralized metadata model. Drill is able to query metadata -defined in data sources such as Hive, HBase, and the file system. Drill also -supports the creation of metadata in the file system. - -### Query data from the view: - - 0: jdbc:drill:> select * from custview limit 1; - +------------+------------+------------+------------+------------+------------+------------+ - | cust_id | name | gender | age | state | agg_rev | membership | - +------------+------------+------------+------------+------------+------------+------------+ - | 10001 | "Corrine Mecham" | "FEMALE" | "15-20" | "va" | 197.00 | "silver" | - +------------+------------+------------+------------+------------+------------+------------+ - -Once the users get an idea on what data is available by exploring it directly -from file system , views can be used as a way to take the data in downstream -tools like Tableau, Microstrategy etc for downstream analysis and -visualization. For these tools, a view appears simply as a âtableâ with -selectable âcolumnsâ in it. - -## Query Across Data Sources - -Continue using dfs.views for this query. - -### Join the customers view and the orders table: - - 0: jdbc:drill:> select membership, sum(order_total) as sales from hive.orders, custview - where orders.cust_id=custview.cust_id - group by membership order by 2; - +------------+------------+ - | membership | sales | - +------------+------------+ - | "basic" | 380665 | - | "silver" | 708438 | - | "gold" | 2787682 | - +------------+------------+ - 3 rows selected - -In this query, we are reading data from a MapR-DB table (represented by -custview) and combining it with the order information in Hive. When doing -cross data source queries such as this, you need to use fully qualified -table/view names. For example, the orders table is prefixed by âhive,â which -is the storage plugin name registered with Drill. We are not using any prefix -for âcustviewâ because we explicitly switched the dfs.views workspace where -custview is stored. - -Note: If the results of any of your queries appear to be truncated because the -rows are wide, set the maximum width of the display to 10000: - -Do not use a semicolon for this SET command. - -### Join the customers, orders, and clickstream data: - - 0: jdbc:drill:> select custview.membership, sum(orders.order_total) as sales from hive.orders, custview, - dfs.`/mapr/demo.mapr.com/data/nested/clicks/clicks.json` c - where orders.cust_id=custview.cust_id and orders.cust_id=c.user_info.cust_id - group by custview.membership order by 2; - +------------+------------+ - | membership | sales | - +------------+------------+ - | "basic" | 372866 | - | "silver" | 728424 | - | "gold" | 7050198 | - +------------+------------+ - 3 rows selected - -This three-way join selects from three different data sources in one query: - - * hive.orders table - * custview (a view of the HBase customers table) - * clicks.json file - -The join column for both sets of join conditions is the cust_id column. The -views workspace is used for this query so that custview can be accessed. The -hive.orders table is also visible to the query. - -However, note that the JSON file is not directly visible from the views -workspace, so the query specifies the full path to the file: - - dfs.`/mapr/demo.mapr.com/data/nested/clicks/clicks.json` - - -# What's Next - -Go to [Lesson 3: Run Queries on Complex Data Types](/confluence/display/DRILL/ -Lesson+3%3A+Run+Queries+on+Complex+Data+Types). - - - http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/drill-docs/tutorial/005-lesson3.md ---------------------------------------------------------------------- diff --git a/_docs/drill-docs/tutorial/005-lesson3.md b/_docs/drill-docs/tutorial/005-lesson3.md deleted file mode 100644 index d9b362a..0000000 --- a/_docs/drill-docs/tutorial/005-lesson3.md +++ /dev/null @@ -1,379 +0,0 @@ ---- -title: "Lession 3: Run Queries on Complex Data Types" -parent: "Apache Drill Tutorial" ---- -## Goal - -This lesson focuses on queries that exercise functions and operators on self- -describing data and complex data types. Drill offers intuitive SQL extensions -to work with such data and offers high query performance with an architecture -built from the ground up for complex data. - -## Queries in This Lesson - -Now that you have run ANSI SQL queries against different tables and files with -relational data, you can try some examples including complex types. - - * Access directories and subdirectories of files in a single SELECT statement. - * Demonstrate simple ways to access complex data in JSON files. - * Demonstrate the repeated_count function to aggregate values in an array. - -## Query Partitioned Directories - -You can use special variables in Drill to refer to subdirectories in your -workspace path: - - * dir0 - * dir1 - * ⦠- -Note that these variables are dynamically determined based on the partitioning -of the file system. No up-front definitions are required on what partitions -exist. Here is a visual example of how this works: - -![example_query.png](../../img/example_query.png) - -### Set workspace to dfs.logs: - - 0: jdbc:drill:> use dfs.logs; - +------------+------------+ - | ok | summary | - +------------+------------+ - | true | Default schema changed to 'dfs.logs' | - +------------+------------+ - -### Query logs data for a specific year: - - 0: jdbc:drill:> select * from logs where dir0='2013' limit 10; - +------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+-----------+------------+ - | dir0 | dir1 | trans_id | date | time | cust_id | device | state | camp_id | keywords | prod_id | purch_flag | - +------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+-----------+------------+ - | 2013 | 11 | 12119 | 11/09/2013 | 02:24:51 | 262 | IOS5 | ny | 0 | chamber | 198 | false | - | 2013 | 11 | 12120 | 11/19/2013 | 09:37:43 | 0 | AOS4.4 | il | 2 | outside | 511 | false | - | 2013 | 11 | 12134 | 11/10/2013 | 23:42:47 | 60343 | IOS5 | ma | 4 | and | 421 | false | - | 2013 | 11 | 12135 | 11/16/2013 | 01:42:13 | 46762 | AOS4.3 | ca | 4 | here's | 349 | false | - | 2013 | 11 | 12165 | 11/26/2013 | 21:58:09 | 41987 | AOS4.2 | mn | 4 | he | 271 | false | - | 2013 | 11 | 12168 | 11/09/2013 | 23:41:48 | 8600 | IOS5 | in | 6 | i | 459 | false | - | 2013 | 11 | 12196 | 11/20/2013 | 02:23:06 | 15603 | IOS5 | tn | 1 | like | 324 | false | - | 2013 | 11 | 12203 | 11/25/2013 | 23:50:29 | 221 | IOS6 | tx | 10 | if | 323 | false | - | 2013 | 11 | 12206 | 11/09/2013 | 23:53:01 | 2488 | AOS4.2 | tx | 14 | unlike | 296 | false | - | 2013 | 11 | 12217 | 11/06/2013 | 23:51:56 | 0 | AOS4.2 | tx | 9 | can't | 54 | false | - +------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+ - - -This query constrains files inside the subdirectory named 2013. The variable -dir0 refers to the first level down from logs, dir1 to the next level, and so -on. So this query returned 10 of the rows for February 2013. - -### Further constrain the results using multiple predicates in the query: - -This query returns a list of customer IDs for people who made a purchase via -an IOS5 device in August 2013. - - 0: jdbc:drill:> select dir0 as yr, dir1 as mth, cust_id from logs - where dir0='2013' and dir1='8' and device='IOS5' and purch_flag='true' - order by `date`; - +------------+------------+------------+ - | yr | mth | cust_id | - +------------+------------+------------+ - | 2013 | 8 | 4 | - | 2013 | 8 | 521 | - | 2013 | 8 | 1 | - | 2013 | 8 | 2 | - | 2013 | 8 | 4 | - | 2013 | 8 | 549 | - | 2013 | 8 | 72827 | - | 2013 | 8 | 38127 | - ... - -### Return monthly counts per customer for a given year: - - 0: jdbc:drill:> select cust_id, dir1 month_no, count(*) month_count from logs - where dir0=2014 group by cust_id, dir1 order by cust_id, month_no limit 10; - +------------+------------+-------------+ - | cust_id | month_no | month_count | - +------------+------------+-------------+ - | 0 | 1 | 143 | - | 0 | 2 | 118 | - | 0 | 3 | 117 | - | 0 | 4 | 115 | - | 0 | 5 | 137 | - | 0 | 6 | 117 | - | 0 | 7 | 142 | - | 0 | 8 | 19 | - | 1 | 1 | 66 | - | 1 | 2 | 59 | - +------------+------------+-------------+ - 10 rows selected - -This query groups the aggregate function by customer ID and month for one -year: 2014. - -## Query Complex Data - -Drill provides some specialized operators and functions that you can use to -analyze nested data natively without transformation. If you are familiar with -JavaScript notation, you will already know how some of these extensions work. - -### Set the workspace to dfs.clicks: - - 0: jdbc:drill:> use dfs.clicks; - +------------+------------+ - | ok | summary | - +------------+------------+ - | true | Default schema changed to 'dfs.clicks' | - +------------+------------+ - -### Explore clickstream data: - -Note that the user_info and trans_info columns contain nested data: arrays and -arrays within arrays. The following queries show how to access this complex -data. - - 0: jdbc:drill:> select * from `clicks/clicks.json` limit 5; - +------------+------------+------------+------------+------------+ - | trans_id | date | time | user_info | trans_info | - +------------+------------+------------+------------+------------+ - | 31920 | 2014-04-26 | 12:17:12 | {"cust_id":22526,"device":"IOS5","state":"il"} | {"prod_id":[174,2],"purch_flag":"false"} | - | 31026 | 2014-04-20 | 13:50:29 | {"cust_id":16368,"device":"AOS4.2","state":"nc"} | {"prod_id":[],"purch_flag":"false"} | - | 33848 | 2014-04-10 | 04:44:42 | {"cust_id":21449,"device":"IOS6","state":"oh"} | {"prod_id":[582],"purch_flag":"false"} | - | 32383 | 2014-04-18 | 06:27:47 | {"cust_id":20323,"device":"IOS5","state":"oh"} | {"prod_id":[710,47],"purch_flag":"false"} | - | 32359 | 2014-04-19 | 23:13:25 | {"cust_id":15360,"device":"IOS5","state":"ca"} | {"prod_id": [0,8,170,173,1,124,46,764,30,711,0,3,25],"purch_flag":"true"} | - +------------+------------+------------+------------+------------+ - - -### Unpack the user_info column: - - 0: jdbc:drill:> select t.user_info.cust_id as custid, t.user_info.device as device, - t.user_info.state as state - from `clicks/clicks.json` t limit 5; - +------------+------------+------------+ - | custid | device | state | - +------------+------------+------------+ - | 22526 | IOS5 | il | - | 16368 | AOS4.2 | nc | - | 21449 | IOS6 | oh | - | 20323 | IOS5 | oh | - | 15360 | IOS5 | ca | - +------------+------------+------------+ - -This query uses a simple table.column.column notation to extract nested column -data. For example: - - t.user_info.cust_id - -where `t` is the table alias provided in the query, `user_info` is a top-level -column name, and `cust_id` is a nested column name. - -The table alias is required; otherwise column names such as `user_info` are -parsed as table names by the SQL parser. - -### Unpack the trans_info column: - - 0: jdbc:drill:> select t.trans_info.prod_id as prodid, t.trans_info.purch_flag as - purchased - from `clicks/clicks.json` t limit 5; - +------------+------------+ - | prodid | purchased | - +------------+------------+ - | [174,2] | false | - | [] | false | - | [582] | false | - | [710,47] | false | - | [0,8,170,173,1,124,46,764,30,711,0,3,25] | true | - +------------+------------+ - 5 rows selected - -Note that this result reveals that the prod_id column contains an array of IDs -(one or more product ID values per row, separated by commas). The next step -shows how you to access this kind of data. - -## Query Arrays - -Now use the [ n ] notation, where n is the position of the value in an array, -starting from position 0 (not 1) for the first value. You can use this -notation to write interesting queries against nested array data. - -For example: - - trans_info.prod_id[0] - -refers to the first value in the nested prod_id column and - - trans_info.prod_id[20] - -refers to the 21st value, assuming one exists. - -### Find the first product that is searched for in each transaction: - - 0: jdbc:drill:> select t.trans_id, t.trans_info.prod_id[0] from `clicks/clicks.json` t limit 5; - +------------+------------+ - | trans_id | EXPR$1 | - +------------+------------+ - | 31920 | 174 | - | 31026 | null | - | 33848 | 582 | - | 32383 | 710 | - | 32359 | 0 | - +------------+------------+ - 5 rows selected - -### For which transactions did customers search on at least 21 products? - - 0: jdbc:drill:> select t.trans_id, t.trans_info.prod_id[20] - from `clicks/clicks.json` t - where t.trans_info.prod_id[20] is not null - order by trans_id limit 5; - +------------+------------+ - | trans_id | EXPR$1 | - +------------+------------+ - | 10328 | 0 | - | 10380 | 23 | - | 10701 | 1 | - | 11100 | 0 | - | 11219 | 46 | - +------------+------------+ - 5 rows selected - -This query returns transaction IDs and product IDs for records that contain a -non-null product ID at the 21st position in the array. - -### Return clicks for a specific product range: - - 0: jdbc:drill:> select * from (select t.trans_id, t.trans_info.prod_id[0] as prodid, - t.trans_info.purch_flag as purchased - from `clicks/clicks.json` t) sq - where sq.prodid between 700 and 750 and sq.purchased='true' - order by sq.prodid; - +------------+------------+------------+ - | trans_id | prodid | purchased | - +------------+------------+------------+ - | 21886 | 704 | true | - | 20674 | 708 | true | - | 22158 | 709 | true | - | 34089 | 714 | true | - | 22545 | 714 | true | - | 37500 | 717 | true | - | 36595 | 718 | true | - ... - -This query assumes that there is some meaning to the array (that it is an -ordered list of products purchased rather than a random list). - -## Perform Operations on Arrays - -### Rank successful click conversions and count product searches for each session: - - 0: jdbc:drill:> select t.trans_id, t.`date` as session_date, t.user_info.cust_id as - cust_id, t.user_info.device as device, repeated_count(t.trans_info.prod_id) as - prod_count, t.trans_info.purch_flag as purch_flag - from `clicks/clicks.json` t - where t.trans_info.purch_flag = 'true' order by prod_count desc; - +------------+--------------+------------+------------+------------+------------+ - | trans_id | session_date | cust_id | device | prod_count | purch_flag | - +------------+--------------+------------+------------+------------+------------+ - | 37426 | 2014-04-06 | 18709 | IOS5 | 34 | true | - | 31589 | 2014-04-16 | 18576 | IOS6 | 31 | true | - | 11600 | 2014-04-07 | 4260 | AOS4.2 | 28 | true | - | 35074 | 2014-04-03 | 16697 | AOS4.3 | 27 | true | - | 17192 | 2014-04-22 | 2501 | AOS4.2 | 26 | true | - ... - -This query uses a Drill SQL extension, the repeated_count function, to get an -aggregated count of the array values. The query returns the number of products -searched for each session that converted into a purchase and ranks the counts -in descending order. Only clicks that have resulted in a purchase are counted. - -## Store a Result Set in a Table for Reuse and Analysis - -Finally, run another correlated subquery that returns a fairly large result -set. To facilitate additional analysis on this result set, you can easily and -quickly create a Drill table from the results of the query. - -### Continue to use the dfs.clicks workspace - - 0: jdbc:drill:> use dfs.clicks; - +------------+------------+ - | ok | summary | - +------------+------------+ - | true | Default schema changed to 'dfs.clicks' | - +------------+------------+ - -### Return product searches for high-value customers: - - 0: jdbc:drill:> select o.cust_id, o.order_total, t.trans_info.prod_id[0] as prod_id - from hive.orders as o, `clicks/clicks.json` t - where o.cust_id=t.user_info.cust_id - and o.order_total > (select avg(inord.order_total) - from hive.orders inord where inord.state = o.state); - +------------+-------------+------------+ - | cust_id | order_total | prod_id | - +------------+-------------+------------+ - ... - | 9650 | 69 | 16 | - | 9650 | 69 | 560 | - | 9650 | 69 | 959 | - | 9654 | 76 | 768 | - | 9656 | 76 | 32 | - | 9656 | 76 | 16 | - ... - +------------+-------------+------------+ - 106,281 rows selected - -This query returns a list of products that are being searched for by customers -who have made transactions that are above the average in their states. - -### Materialize the result of the previous query: - - 0: jdbc:drill:> create table product_search as select o.cust_id, o.order_total, t.trans_info.prod_id[0] as prod_id - from hive.orders as o, `clicks/clicks.json` t - where o.cust_id=t.user_info.cust_id and o.order_total > (select avg(inord.order_total) - from hive.orders inord where inord.state = o.state); - +------------+---------------------------+ - | Fragment | Number of records written | - +------------+---------------------------+ - | 0_0 | 106281 | - +------------+---------------------------+ - 1 row selected - -This example uses a CTAS statement to create a table based on a correlated -subquery that you ran previously. This table contains all of the rows that the -query returns (106,281) and stores them in the format specified by the storage -plugin (Parquet format in this example). You can create tables that store data -in csv, parquet, and json formats. - -### Query the new table to verify the row count: - -This example simply checks that the CTAS statement worked by verifying the -number of rows in the table. - - 0: jdbc:drill:> select count(*) from product_search; - +------------+ - | EXPR$0 | - +------------+ - | 106281 | - +------------+ - 1 row selected - -### Find the storage file for the table: - - [root@maprdemo product_search]# cd /mapr/demo.mapr.com/data/nested/product_search - [root@maprdemo product_search]# ls -la - total 451 - drwxr-xr-x. 2 mapr mapr 1 Sep 15 13:41 . - drwxr-xr-x. 4 root root 2 Sep 15 13:41 .. - -rwxr-xr-x. 1 mapr mapr 460715 Sep 15 13:41 0_0_0.parquet - -Note that the table is stored in a file called `0_0_0.parquet`. This file is -stored in the location defined by the dfs.clicks workspace: - - "location": "http://demo.mapr.com/data/nested)" - -with a subdirectory that has the same name as the table you created. - -## What's Next - -Complete the tutorial with the [Summary](/confluence/display/DRILL/Summary). - - - http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/drill-docs/tutorial/006-summary.md ---------------------------------------------------------------------- diff --git a/_docs/drill-docs/tutorial/006-summary.md b/_docs/drill-docs/tutorial/006-summary.md deleted file mode 100644 index f210766..0000000 --- a/_docs/drill-docs/tutorial/006-summary.md +++ /dev/null @@ -1,14 +0,0 @@ ---- -title: "Summary" -parent: "Apache Drill Tutorial" ---- -This tutorial introduced Apache Drill and its ability to run ANSI SQL queries -against various data sources, including Hive tables, MapR-DB/HBase tables, and -file system directories. The tutorial also showed how to work with and -manipulate complex and multi-structured data commonly found in Hadoop/NoSQL -systems. - -Now that you are familiar with different ways to access the sample data with -Drill, you can try writing your own queries against your own data sources. -Refer to the [Apache Drill documentation](https://cwiki.apache.org/confluence/ -display/DRILL/Apache+Drill+Wiki) for more information. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/drill-docs/tutorial/install-sandbox/001-install-mapr-vm.md ---------------------------------------------------------------------- diff --git a/_docs/drill-docs/tutorial/install-sandbox/001-install-mapr-vm.md b/_docs/drill-docs/tutorial/install-sandbox/001-install-mapr-vm.md deleted file mode 100644 index f3d8953..0000000 --- a/_docs/drill-docs/tutorial/install-sandbox/001-install-mapr-vm.md +++ /dev/null @@ -1,55 +0,0 @@ ---- -title: "Installing the MapR Sandbox with Apache Drill on VMware Player/VMware Fusion" -parent: "Installing the Apache Drill Sandbox" ---- -Complete the following steps to install the MapR Sandbox with Apache Drill on -VMware Player or VMware Fusion: - - 1. Download the MapR Sandbox with Drill file to a directory on your machine: -<https://www.mapr.com/products/mapr-sandbox-hadoop/download-sandbox-drill> - - 2. Open the virtual machine player, and select the **Open a Virtual Machine **option. - - Tip for VMware Fusion - -If you are running VMware Fusion, select** Import**. - -![](../../../img/vmWelcome.png) - - 3. Navigate to the directory where you downloaded the MapR Sandbox with Apache Drill file, and select `MapR-Sandbox-For-Apache-Drill-4.0.1_VM.ova`. - -![](../../../img/vmShare.png) - -The Import Virtual Machine dialog appears. - - 4. Click **Import**. The virtual machine player imports the sandbox. - -![](../../../img/vmLibrary.png) - - 5. Select `MapR-Sandbox-For-Apache-Drill-4.0.1_VM`, and click **Play virtual machine**. It takes a few minutes for the MapR services to start. -After the MapR services start and installation completes, the following screen -appears: - -![](../../../img/loginSandbox.png) - -Note the URL provided in the screen, which corresponds to the Web UI in Apache -Drill. - - 6. Verify that a DNS entry was created on the host machine for the virtual machine. If not, create the entry. - - * For Linux and Mac, create the entry in `/etc/hosts`. - - * For WIndows, create the entry in the `%WINDIR%\system32\drivers\etc\hosts` file. -Example: `127.0.1.1 <vm_hostname>` - - 7. You can navigate to the URL provided to experience Drill Web UI or you can login to the sandbox through the command line. - - a. To navigate to the MapR Sandbox with Apache Drill, enter the provided URL in your browser's address bar. - - b. To login to the virtual machine and access the command line, press Alt+F2 on Windows or Option+F5 on Mac. When prompted, enter `mapr` as the login and password. - -# What's Next - -After downloading and installing the sandbox, continue with the tutorial by -[Getting to Know the Drill -Setup](/confluence/display/DRILL/Getting+to+Know+the+Drill+Setup). \ No newline at end of file http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/drill-docs/tutorial/install-sandbox/002-install-mapr-vb.md ---------------------------------------------------------------------- diff --git a/_docs/drill-docs/tutorial/install-sandbox/002-install-mapr-vb.md b/_docs/drill-docs/tutorial/install-sandbox/002-install-mapr-vb.md deleted file mode 100644 index 9ff26d5..0000000 --- a/_docs/drill-docs/tutorial/install-sandbox/002-install-mapr-vb.md +++ /dev/null @@ -1,72 +0,0 @@ ---- -title: "Installing the MapR Sandbox with Apache Drill on VirtualBox" -parent: "Installing the Apache Drill Sandbox" ---- -The MapR Sandbox for Apache Drill on VirtualBox comes with NAT port forwarding -enabled, which allows you to access the sandbox using localhost as hostname. - -Complete the following steps to install the MapR Sandbox with Apache Drill on -VirtualBox: - - 1. Download the MapR Sandbox with Apache Drill file to a directory on your machine: -<https://www.mapr.com/products/mapr-sandbox-hadoop/download-sandbox-drill> - - 2. Open the virtual machine player. - - 3. Select **File > Import Appliance**. The Import Virtual Appliance dialog appears. - - ![](../../../img/vbImport.png) - - 4. Navigate to the directory where you downloaded the MapR Sandbox with Apache Drill and click **Next**. The Appliance Settings window appears. - - ![](../../../img/vbapplSettings.png) - - 5. Select the check box at the bottom of the screen: **Reinitialize the MAC address of all network cards**, then click **Import**. The Import Appliance imports the sandbox. - - 6. When the import completes, select **File > Preferences**. The VirtualBox - Settings dialog appears. - - ![](../../../img/vbNetwork.png) - - 7. Select **Network**. - - The correct setting depends on your network connectivity when you run the -Sandbox. In general, if you are going to use a wired Ethernet connection, -select **NAT Networks **and **vboxnet0**. If you are going to use a wireless -network, select **Host-only Networks** and the **VirtualBox Host-Only Ethernet -Adapter**. If no adapters appear, click the green** +** button to add the -VirtualBox adapter. - - ![](../../../img/vbMaprSetting.png) - - 8. Click **OK **to continue. - - 9. Click ![](https://lh5.googleusercontent.com/6TjVEW28MJhPud2Nc2ButYB_GDqKTnadaluSulg0Zb259MgN1IRCgIlo-kMAEJ7lGWHf2aqc-nIjUsUFlaXP-LceAIKE5owNqXUWxXS0WXcBLWzUqg5X1VIXXswajb6oWA). The MapR-Sandbox-For-Apache-Drill-0.6.0-r2-4.0.1 - Settings dialog appears. - - ![](../../../img/vbGenSettings.png) - - 10. Click **OK** to continue. - - 11. Click **Start**. It takes a few minutes for the MapR services to start. After the MapR services start and installation completes, the following screen appears: - - ![](../../../img/vbloginSandbox.png) - - 12. The client must be able to resolve the actual hostname of the Drill node(s) with the IP(s). Verify that a DNS entry was created on the client machine for the Drill node(s). -If a DNS entry does not exist, create the entry for the Drill node(s). - - * For Windows, create the entry in the %WINDIR%\system32\drivers\etc\hosts file. - - * For Linux and Mac, create the entry in /etc/hosts. -<drill-machine-IP> <drill-machine-hostname> -Example: `127.0.1.1 maprdemo` - - 13. You can navigate to the URL provided or to [localhost:8047](http://localhost:8047) to experience the Drill Web UI, or you can log into the sandbox through the command line. - - a. To navigate to the MapR Sandbox with Apache Drill, enter the provided URL in your browser's address bar. - - b. To log into the virtual machine and access the command line, enter Alt+F2 on Windows or Option+F5 on Mac. When prompted, enter `mapr` as the login and password. - -# What's Next - -After downloading and installing the sandbox, continue with the tutorial by -[Getting to Know the Drill -Setup](/confluence/display/DRILL/Getting+to+Know+the+Drill+Setup). \ No newline at end of file http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/img/58.png ---------------------------------------------------------------------- diff --git a/_docs/img/58.png b/_docs/img/58.png new file mode 100644 index 0000000..b957927 Binary files /dev/null and b/_docs/img/58.png differ http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/img/BI_to_Drill_2.png ---------------------------------------------------------------------- diff --git a/_docs/img/BI_to_Drill_2.png b/_docs/img/BI_to_Drill_2.png new file mode 100644 index 0000000..a7f32cd Binary files /dev/null and b/_docs/img/BI_to_Drill_2.png differ http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/img/HbaseViewCreation0.png ---------------------------------------------------------------------- diff --git a/_docs/img/HbaseViewCreation0.png b/_docs/img/HbaseViewCreation0.png new file mode 100644 index 0000000..0ae4465 Binary files /dev/null and b/_docs/img/HbaseViewCreation0.png differ http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/img/HbaseViewDSN.png ---------------------------------------------------------------------- diff --git a/_docs/img/HbaseViewDSN.png b/_docs/img/HbaseViewDSN.png new file mode 100644 index 0000000..988e48b Binary files /dev/null and b/_docs/img/HbaseViewDSN.png differ http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/img/Hbase_Browse.png ---------------------------------------------------------------------- diff --git a/_docs/img/Hbase_Browse.png b/_docs/img/Hbase_Browse.png new file mode 100644 index 0000000..729e0f8 Binary files /dev/null and b/_docs/img/Hbase_Browse.png differ http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/img/Hive_DSN.png ---------------------------------------------------------------------- diff --git a/_docs/img/Hive_DSN.png b/_docs/img/Hive_DSN.png new file mode 100644 index 0000000..be49d00 Binary files /dev/null and b/_docs/img/Hive_DSN.png differ http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/img/ODBC_CustomSQL.png ---------------------------------------------------------------------- diff --git a/_docs/img/ODBC_CustomSQL.png b/_docs/img/ODBC_CustomSQL.png new file mode 100644 index 0000000..d2c7fb2 Binary files /dev/null and b/_docs/img/ODBC_CustomSQL.png differ http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/img/ODBC_HbasePreview2.png ---------------------------------------------------------------------- diff --git a/_docs/img/ODBC_HbasePreview2.png b/_docs/img/ODBC_HbasePreview2.png new file mode 100644 index 0000000..948f268 Binary files /dev/null and b/_docs/img/ODBC_HbasePreview2.png differ http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/img/ODBC_HbaseView.png ---------------------------------------------------------------------- diff --git a/_docs/img/ODBC_HbaseView.png b/_docs/img/ODBC_HbaseView.png new file mode 100644 index 0000000..be3bf4f Binary files /dev/null and b/_docs/img/ODBC_HbaseView.png differ http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/img/ODBC_HiveConnection.png ---------------------------------------------------------------------- diff --git a/_docs/img/ODBC_HiveConnection.png b/_docs/img/ODBC_HiveConnection.png new file mode 100644 index 0000000..a86d960 Binary files /dev/null and b/_docs/img/ODBC_HiveConnection.png differ http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/img/ODBC_to_Drillbit.png ---------------------------------------------------------------------- diff --git a/_docs/img/ODBC_to_Drillbit.png b/_docs/img/ODBC_to_Drillbit.png new file mode 100644 index 0000000..7197d09 Binary files /dev/null and b/_docs/img/ODBC_to_Drillbit.png differ http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/img/ODBC_to_Quorum.png ---------------------------------------------------------------------- diff --git a/_docs/img/ODBC_to_Quorum.png b/_docs/img/ODBC_to_Quorum.png new file mode 100644 index 0000000..bd77a28 Binary files /dev/null and b/_docs/img/ODBC_to_Quorum.png differ http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/img/Parquet_DSN.png ---------------------------------------------------------------------- diff --git a/_docs/img/Parquet_DSN.png b/_docs/img/Parquet_DSN.png new file mode 100644 index 0000000..a76eb4e Binary files /dev/null and b/_docs/img/Parquet_DSN.png differ http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/img/Parquet_Preview.png ---------------------------------------------------------------------- diff --git a/_docs/img/Parquet_Preview.png b/_docs/img/Parquet_Preview.png new file mode 100644 index 0000000..121dff5 Binary files /dev/null and b/_docs/img/Parquet_Preview.png differ http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/img/RegionParquet_table.png ---------------------------------------------------------------------- diff --git a/_docs/img/RegionParquet_table.png b/_docs/img/RegionParquet_table.png new file mode 100644 index 0000000..db914bb Binary files /dev/null and b/_docs/img/RegionParquet_table.png differ http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/img/SelectHbaseView.png ---------------------------------------------------------------------- diff --git a/_docs/img/SelectHbaseView.png b/_docs/img/SelectHbaseView.png new file mode 100644 index 0000000..a37b30e Binary files /dev/null and b/_docs/img/SelectHbaseView.png differ http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/img/Untitled.png ---------------------------------------------------------------------- diff --git a/_docs/img/Untitled.png b/_docs/img/Untitled.png new file mode 100644 index 0000000..7fea1e8 Binary files /dev/null and b/_docs/img/Untitled.png differ http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/img/VoterContributions_hbaseview.png ---------------------------------------------------------------------- diff --git a/_docs/img/VoterContributions_hbaseview.png b/_docs/img/VoterContributions_hbaseview.png new file mode 100644 index 0000000..2c37df9 Binary files /dev/null and b/_docs/img/VoterContributions_hbaseview.png differ http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/img/ngram_plugin.png ---------------------------------------------------------------------- diff --git a/_docs/img/ngram_plugin.png b/_docs/img/ngram_plugin.png new file mode 100644 index 0000000..c47148c Binary files /dev/null and b/_docs/img/ngram_plugin.png differ