jihoonson commented on a change in pull request #7117: Improve doc for auto
compaction
URL: https://github.com/apache/incubator-druid/pull/7117#discussion_r261449210
##########
File path: docs/content/operations/segment-optimization.md
##########
@@ -32,15 +32,57 @@ In Druid, it's important to optimize the segment size
because
which hold the input segments of the query. Each node has a processing
threads pool and use one thread per segment to
process it. If the segment size is too large, data might not be well
distributed over the
whole cluster, thereby decreasing the degree of parallelism. If the segment
size is too small,
- each processing thread processes too small data. This might reduce the
processing speed of other queries as well as
- the input query itself because the processing threads are shared for
executing all queries.
+ each processing thread might process too small data. This can reduce the
overall processing speed because
+ parallel processing involves some overhead like thread scheduling.
It would be best if you can optimize the segment size at ingestion time, but
sometimes it's not easy
-especially for the streaming ingestion because the amount of data ingested
might vary over time. In this case,
-you can roughly set the segment size at ingestion time and optimize it later.
You have two options:
+especially when it comes to stream ingestion because the amount of data
ingested might vary over time. In this case,
+you can create segments with a sub-optimzed size first and optimize them later.
+
+There might be several ways to check if the compaction is necessary. One way
+is using the [System Schema](../querying/sql.html#system-schema). The
+system schema provides several tables about the current system status
including the `segments` table.
+By running the below query, you can get the average number of rows and average
size for published segments.
+
+```sql
+SELECT
+ "start",
+ "end",
+ version,
+ COUNT(*) AS num_segments,
+ AVG("num_rows") AS avg_num_rows,
+ SUM("num_rows") AS total_num_rows,
+ AVG("size") AS avg_size,
+ SUM("size") AS total_size
+FROM
+ sys.segments A
+WHERE
+ datasource = 'your_dataSource' AND
+ is_published = 1
+GROUP BY 1, 2, 3
+ORDER BY 1, 2, 3 DESC;
+```
+
+Please note that the query result might include overshadowed segments.
+In this case, you may want to see only rows of the max version per interval
(pair of `start` and `end`).
+
+The recomended number of rows per segment and segment size are 5 million rows
and 300 ~ 700MB, respectively.
Review comment:
Sounds good. Moved this part and emphasized the importance of # of rows.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]