[ https://issues.apache.org/jira/browse/JAMES-3430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220451#comment-17220451 ]
René Cordier commented on JAMES-3430: ------------------------------------- https://github.com/linagora/james-project/pull/3930 is restructuring cassandra mailbox message properties storage and introducing messagev3 > Reduce Cassandra overall load by restructuring messageV2 table > -------------------------------------------------------------- > > Key: JAMES-3430 > URL: https://issues.apache.org/jira/browse/JAMES-3430 > Project: James Server > Issue Type: Improvement > Components: cassandra, mailbox > Affects Versions: master, 3.6.0 > Reporter: Benoit Tellier > Priority: Major > Fix For: master > > Attachments: Capture d’écran de 2020-10-20 12-11-19.png > > > *Analysis* > Today messagev2 metadata table is the largest table of our schema (11GB on 6 > million messages instalation). > Reads to it are abnormally long, and this is the query we spend the more time > executing while doing IMAP: > (see attached screenshot) > That is by far our most expensive query (per row), and that while LWT do not > even come into play! > Looking at table stats: > {code:java} > Number of partitions (estimate): 6100310 > SSTable Compression Ratio: 0.43905338269472705 > Space used (total): 2037635717 > {code} > Doing a nodetool status, we realize that total occupied space in Cassandra is > of `4.97 GiB`, so messagesv2 table occuppy 40% of total storage space which > is a lot compared to other messages metadata tables (~180MB so 10x less). > Doing a tablehistograms analysis: > {code:java} > Percentile SSTables Write Latency Read Latency Partition Size > Cell Count > (micros) (micros) (bytes) > > 50% 1.00 0.00 315.85 770 > 12 > 75% 1.00 0.00 545.79 770 > 12 > 95% 1.00 0.00 1131.75 1109 > 17 > 98% 1.00 0.00 1629.72 1331 > 17 > 99% 1.00 0.00 1955.67 1597 > 20 > Min 0.00 0.00 88.15 30 > 0 > Max 1.00 0.00 3379.39 1131752 > 770 > {code} > We realize that cell count is high, and that the byte count is high (other > message metadata table are between 125 -> 250 bytes so 6 times less). > Knowing our data model, each message have a set of properties, each composed > of a namespace, a name and a value. These are stored as a UDT list, thus is > really space inefficient. Compress ratio (see above) do not compensate for > this. > These properties are well defined, set by the StoreMailboxManager and only > include: > - mediaType > - subType > - ContentId > - ContentDescription > - ContentTrasferEncoding > - ContentLocation > - ContentDispositionType > - ContentDispositionParameters > - ContentTypeParameters > - ContentMD5 > - ContentLanguage > - Charset > - MimepartDelimites > *Why?* > Here is the conclusion: > - Slow reads on messagev2 table have a large impact on IMAP performance (and > on Cassandra performances) > - This slowness is due to the corresponding space used on disk (more data = > slower reads) > - This extra space is due to an inefficient storage format of the property > fields > - By restructuring the way we store these properties we can reclaim disk > space and thus query read speed > *How?* > - Remove unused properties: see > https://github.com/linagora/james-project/pull/3925 (charset and mime part > delimiters are unused) > - Restructure CassandraMessageDAO (and underlying table) to store known > properties in a column instead of a UDT list. Unknown properties shall be > rejected. > This avoids the use of a collection on a critical table thus should > significantly fasten related operations. > A data migration (messagev3) will be needed. > *Definition of done* > - Come up with space occupied per message update within tests for the old > and the new > - IMAP performance tests -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org For additional commands, e-mail: server-dev-h...@james.apache.org