[GitHub] carbondata pull request #2568: [Presto-integration-Technical-note] created d...

sraghunandan Thu, 02 Aug 2018 21:11:40 -0700

Github user sraghunandan commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2568#discussion_r207436115

--- Diff: integration/presto/presto-integration-technical-note.md ---
@@ -0,0 +1,253 @@
+
+
+# Presto Integration Technical Note
+Presto Integration with Carbon data include the below steps:
+
+* Setting up Presto Cluster
+
+* Setting up cluster to use carbondata as a catalog along with other
catalogs provided by presto.
+
+In this technical note we will first learn about the above two points and
after that we will see how we can do performance tuning with Presto.
+
+## **Let us begin with the first step of Presto Cluster Setup:**
+
+
+* ### Installing Presto
+
+ 1. Download the 0.187 version of Presto using:
+ `wget
https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.187/presto-server-0.187.tar.gz`
+
+ 2. Extract Presto tar file: `tar zxvf presto-server-0.187.tar.gz`.
+
+ 3. Download the Presto CLI for the coordinator and name it presto.
+
+ ```
+ wget
https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.187/presto-cli-0.187-executable.jar
+
+ mv presto-cli-0.187-executable.jar presto
+
+ chmod +x presto
+ ```
+
+### Create Configuration Files
+
+ 1. Create `etc` folder in presto-server-0.187 directory.
+ 2. Create `config.properties`, `jvm.config`, `log.properties`, and
`node.properties` files.
+ 3. Install uuid to generate a node.id.
+
+ ```
+ sudo apt-get install uuid
+
+ uuid
+ ```
+
+
+##### Contents of your node.properties file
+
+ ```
+ node.environment=production
+ node.id=<generated uuid>
+ node.data-dir=/home/ubuntu/data
+ ```
+
+##### Contents of your jvm.config file
+
+ ```
+ -server
+ -Xmx16G
+ -XX:+UseG1GC
+ -XX:G1HeapRegionSize=32M
+ -XX:+UseGCOverheadLimit
+ -XX:+ExplicitGCInvokesConcurrent
+ -XX:+HeapDumpOnOutOfMemoryError
+ -XX:OnOutOfMemoryError=kill -9 %p
+ ```
+
+##### Contents of your log.properties file
+ ```
+ com.facebook.presto=INFO
+ ```
+
+ The default minimum level is `INFO`. There are four levels: `DEBUG`,
`INFO`, `WARN` and `ERROR`.
+
+### Coordinator Configurations
+
+##### Contents of your config.properties
+ ```
+ coordinator=true
+ node-scheduler.include-coordinator=false
+ http-server.http.port=8086
+ query.max-memory=50GB
+ query.max-memory-per-node=2GB
+ discovery-server.enabled=true
+ discovery.uri=<coordinator_ip>:8086
+ ```
+The options `node-scheduler.include-coordinator=false` and
`coordinator=true` indicate that the node is the coordinator and tells the
coordinator not to do any of the computation work itself and to use the workers.
+
+**Note**: We recommend setting `query.max-memory-per-node` to half of the
JVM config max memory, though if your workload is highly concurrent, you may
want to use a lower value for `query.max-memory-per-node`.
+
+Also relation between below two configuration-properties should be like:
+If, `query.max-memory-per-node=30GB`
+Then, `query.max-memory=<30GB * number of nodes>`.
+
+### Worker Configurations
+
+##### Contents of your config.properties
+
+ ```
+ coordinator=false
+ http-server.http.port=8086
+ query.max-memory=50GB
+ query.max-memory-per-node=2GB
+ discovery.uri=<coordinator_ip>:8086
+ ```
+
+**Note**: `jvm.config` and `node.properties` files are same for all the
nodes (worker + coordinator). All the nodes should have different
`node.id`.(generated by uuid command).
+
+### **With this we are ready with the Presto Cluster setup but to
integrate with carbon data further steps are required which are as follows:**
+
+### Catalog Configurations
+
+1. Create a folder named `catalog` in etc directory of presto on all the
nodes of the cluster including the coordinator.
+
+##### Configuring Carbondata in Presto
+1. Create a file named `carbondata.properties` in the `catalog` folder and
set the required properties on all the nodes.
+
+### Add Plugins
+
+1. Create a directory named `carbondata` in plugin directory of presto.
+2. Copy `carbondata` jars to `plugin/carbondata` directory on all nodes.
+
+### Start Presto Server on all nodes
+
+```
+./presto-server-0.187/bin/launcher start
+```
+To run it as a background process.
+
+```
+./presto-server-0.187/bin/launcher run
+```
+To run it in foreground.
+
+### Start Presto CLI
+```
+./presto
+```
+To connect to carbondata catalog use the following command:
+
+```
+./presto --server <coordinator_ip>:8086 --catalog carbondata --schema
<schema_name>
+```
+Execute the following command to ensure the workers are connected.
+
+```
+select * from system.runtime.nodes;
+```
+Now you can use the Presto CLI on the coordinator to query data sources in
the catalog using the Presto workers.
+
+**Note :** Create Tables and data loads should be done before executing
queries as we can not create carbon table from this interface.
+
+## **Presto Performance Tuning**
+
+**Performance Optimizations according to data types and schema:**
+
+- When the data could be stored in Int as well as String. Example: keys
for a table then using Int gives a better performance
+
+- Use Double instead of Decimal if required precision is low.
+
+- Columns having low-cardinality should be created as dictionary columns.
This will improve query performance to a great extent.
+
+**Performance Optimization by changing Queries:**
+
+- Thereâs a probability where GROUP BY becomes a little bit faster, by
carefully ordering a list of fields within GROUP BY in an order of high
cardinality.
+
+- Aggregating a series of LIKE clauses in one single regexp_like clause.
+
+**For example :**
+
+```
+SELECT
+ Â ...
+FROM
+ Â access
+WHERE
+ Â method LIKE '%GET%' OR
+ Â method LIKE '%POST%' OR
+ Â method LIKE '%PUT%' OR
+ Â method LIKE '%DELETE%'
+ ```
+
+ can be optimized by replacing the 4 LIKE clauses with a single
regexp_like clause:
+
+ ```
+ SELECT
+ Â ...
+FROM
+ Â access
+WHERE
+ Â regexp_like(method, 'GET|POST|PUT|DELETE')
+ ```
+
+- Specifying large tables first in join clause
+
+
+The default join algorithm of Presto is broadcast join, which partitions
the left-hand side table of a join and sends (broadcasts) a copy of the entire
right-hand side table to all of the worker nodes that have the partitions. This
works when your right table is small enough to fit within one node (usually
less than 2GB). If you observe âExceeded max memory xxGBâ error, this
usually means the right-hand side table is too large. Presto does not perform
automatic join-reordering, so please make sure your large table preceeds small
tables in any join clause.
+
+**Note :** If you still see the memory issue, try distributed hash join.
This algorithm partitions both the left and right tables using the hash values
of the join keys. So the distributed join would work even if the right-hand
side table is large, but the performance can be slower because it increases the
number of network data transfers. To turn on the distributed join, embed the
following session property as an SQL comment:
+
+```
+PrestoCli> -- set session distributed_join = 'true'
+SELECT ... FROM large_table l, small_table s WHERE l.id = s.id
+```
+
+**Performance optimizations by using certain Configuration properties:**
--- End diff --

put lower values as default if needs to be specified. better not to give
values as it will be directly copy pasted

---

[GitHub] carbondata pull request #2568: [Presto-integration-Technical-note] created d...

Reply via email to