Github user sraghunandan commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2568#discussion_r207436115 --- Diff: integration/presto/presto-integration-technical-note.md --- @@ -0,0 +1,253 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to you under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +# Presto Integration Technical Note +Presto Integration with Carbon data include the below steps: + +* Setting up Presto Cluster + +* Setting up cluster to use carbondata as a catalog along with other catalogs provided by presto. + +In this technical note we will first learn about the above two points and after that we will see how we can do performance tuning with Presto. + +## **Let us begin with the first step of Presto Cluster Setup:** + + +* ### Installing Presto + + 1. Download the 0.187 version of Presto using: + `wget https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.187/presto-server-0.187.tar.gz` + + 2. Extract Presto tar file: `tar zxvf presto-server-0.187.tar.gz`. + + 3. Download the Presto CLI for the coordinator and name it presto. + + ``` + wget https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.187/presto-cli-0.187-executable.jar + + mv presto-cli-0.187-executable.jar presto + + chmod +x presto + ``` + +### Create Configuration Files + + 1. Create `etc` folder in presto-server-0.187 directory. + 2. Create `config.properties`, `jvm.config`, `log.properties`, and `node.properties` files. + 3. Install uuid to generate a node.id. + + ``` + sudo apt-get install uuid + + uuid + ``` + + +##### Contents of your node.properties file + + ``` + node.environment=production + node.id=<generated uuid> + node.data-dir=/home/ubuntu/data + ``` + +##### Contents of your jvm.config file + + ``` + -server + -Xmx16G + -XX:+UseG1GC + -XX:G1HeapRegionSize=32M + -XX:+UseGCOverheadLimit + -XX:+ExplicitGCInvokesConcurrent + -XX:+HeapDumpOnOutOfMemoryError + -XX:OnOutOfMemoryError=kill -9 %p + ``` + +##### Contents of your log.properties file + ``` + com.facebook.presto=INFO + ``` + + The default minimum level is `INFO`. There are four levels: `DEBUG`, `INFO`, `WARN` and `ERROR`. + +### Coordinator Configurations + +##### Contents of your config.properties + ``` + coordinator=true + node-scheduler.include-coordinator=false + http-server.http.port=8086 + query.max-memory=50GB + query.max-memory-per-node=2GB + discovery-server.enabled=true + discovery.uri=<coordinator_ip>:8086 + ``` +The options `node-scheduler.include-coordinator=false` and `coordinator=true` indicate that the node is the coordinator and tells the coordinator not to do any of the computation work itself and to use the workers. + +**Note**: We recommend setting `query.max-memory-per-node` to half of the JVM config max memory, though if your workload is highly concurrent, you may want to use a lower value for `query.max-memory-per-node`. + +Also relation between below two configuration-properties should be like: +If, `query.max-memory-per-node=30GB` +Then, `query.max-memory=<30GB * number of nodes>`. + +### Worker Configurations + +##### Contents of your config.properties + + ``` + coordinator=false + http-server.http.port=8086 + query.max-memory=50GB + query.max-memory-per-node=2GB + discovery.uri=<coordinator_ip>:8086 + ``` + +**Note**: `jvm.config` and `node.properties` files are same for all the nodes (worker + coordinator). All the nodes should have different `node.id`.(generated by uuid command). + +### **With this we are ready with the Presto Cluster setup but to integrate with carbon data further steps are required which are as follows:** + +### Catalog Configurations + +1. Create a folder named `catalog` in etc directory of presto on all the nodes of the cluster including the coordinator. + +##### Configuring Carbondata in Presto +1. Create a file named `carbondata.properties` in the `catalog` folder and set the required properties on all the nodes. + +### Add Plugins + +1. Create a directory named `carbondata` in plugin directory of presto. +2. Copy `carbondata` jars to `plugin/carbondata` directory on all nodes. + +### Start Presto Server on all nodes + +``` +./presto-server-0.187/bin/launcher start +``` +To run it as a background process. + +``` +./presto-server-0.187/bin/launcher run +``` +To run it in foreground. + +### Start Presto CLI +``` +./presto +``` +To connect to carbondata catalog use the following command: + +``` +./presto --server <coordinator_ip>:8086 --catalog carbondata --schema <schema_name> +``` +Execute the following command to ensure the workers are connected. + +``` +select * from system.runtime.nodes; +``` +Now you can use the Presto CLI on the coordinator to query data sources in the catalog using the Presto workers. + +**Note :** Create Tables and data loads should be done before executing queries as we can not create carbon table from this interface. + +## **Presto Performance Tuning** + +**Performance Optimizations according to data types and schema:** + +- When the data could be stored in Int as well as String. Example: keys for a table then using Int gives a better performance + +- Use Double instead of Decimal if required precision is low. + +- Columns having low-cardinality should be created as dictionary columns. This will improve query performance to a great extent. + +**Performance Optimization by changing Queries:** + +- Thereâs a probability where GROUP BY becomes a little bit faster, by carefully ordering a list of fields within GROUP BY in an order of high cardinality. + +- Aggregating a series of LIKE clauses in one single regexp_like clause. + +**For example :** + +``` +SELECT + Â ... +FROM + Â access +WHERE + Â method LIKE '%GET%' OR + Â method LIKE '%POST%' OR + Â method LIKE '%PUT%' OR + Â method LIKE '%DELETE%' + ``` + + can be optimized by replacing the 4 LIKE clauses with a single regexp_like clause: + + ``` + SELECT + Â ... +FROM + Â access +WHERE + Â regexp_like(method, 'GET|POST|PUT|DELETE') + ``` + +- Specifying large tables first in join clause + + +The default join algorithm of Presto is broadcast join, which partitions the left-hand side table of a join and sends (broadcasts) a copy of the entire right-hand side table to all of the worker nodes that have the partitions. This works when your right table is small enough to fit within one node (usually less than 2GB). If you observe âExceeded max memory xxGBâ error, this usually means the right-hand side table is too large. Presto does not perform automatic join-reordering, so please make sure your large table preceeds small tables in any join clause. + +**Note :** If you still see the memory issue, try distributed hash join. This algorithm partitions both the left and right tables using the hash values of the join keys. So the distributed join would work even if the right-hand side table is large, but the performance can be slower because it increases the number of network data transfers. To turn on the distributed join, embed the following session property as an SQL comment: + +``` +PrestoCli> -- set session distributed_join = 'true' +SELECT ... FROM large_table l, small_table s WHERE l.id = s.id +``` + +**Performance optimizations by using certain Configuration properties:** --- End diff -- put lower values as default if needs to be specified. better not to give values as it will be directly copy pasted
---