jihoonson commented on a change in pull request #7117: Improve doc for auto 
compaction
URL: https://github.com/apache/incubator-druid/pull/7117#discussion_r261449210
 
 

 ##########
 File path: docs/content/operations/segment-optimization.md
 ##########
 @@ -32,15 +32,57 @@ In Druid, it's important to optimize the segment size 
because
   which hold the input segments of the query. Each node has a processing 
threads pool and use one thread per segment to
   process it. If the segment size is too large, data might not be well 
distributed over the
   whole cluster, thereby decreasing the degree of parallelism. If the segment 
size is too small,
-  each processing thread processes too small data. This might reduce the 
processing speed of other queries as well as
-  the input query itself because the processing threads are shared for 
executing all queries.
+  each processing thread might process too small data. This can reduce the 
overall processing speed because
+  parallel processing involves some overhead like thread scheduling.
 
 It would be best if you can optimize the segment size at ingestion time, but 
sometimes it's not easy
-especially for the streaming ingestion because the amount of data ingested 
might vary over time. In this case,
-you can roughly set the segment size at ingestion time and optimize it later. 
You have two options:
+especially when it comes to stream ingestion because the amount of data 
ingested might vary over time. In this case,
+you can create segments with a sub-optimzed size first and optimize them later.
+
+There might be several ways to check if the compaction is necessary. One way
+is using the [System Schema](../querying/sql.html#system-schema). The
+system schema provides several tables about the current system status 
including the `segments` table.
+By running the below query, you can get the average number of rows and average 
size for published segments.
+
+```sql
+SELECT
+  "start",
+  "end",
+  version,
+  COUNT(*) AS num_segments,
+  AVG("num_rows") AS avg_num_rows,
+  SUM("num_rows") AS total_num_rows,
+  AVG("size") AS avg_size,
+  SUM("size") AS total_size
+FROM
+  sys.segments A
+WHERE
+  datasource = 'your_dataSource' AND
+  is_published = 1
+GROUP BY 1, 2, 3
+ORDER BY 1, 2, 3 DESC;
+```
+
+Please note that the query result might include overshadowed segments.
+In this case, you may want to see only rows of the max version per interval 
(pair of `start` and `end`).
+
+The recomended number of rows per segment and segment size are 5 million rows 
and 300 ~ 700MB, respectively.
 
 Review comment:
   Sounds good. Moved this part and emphasized the importance of # of rows.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to