Re: Implement ThreadIdGuessingAlgorithm for the distributed module

[email protected] Wed, 21 Jul 2021 03:03:41 -0700

Hello quan,

Nice proposal! I think the Cassandra data model you propose is evolved
enough so that we start implementing it.


Some comments inlined...

On 21/07/2021 15:55, Quan tran hong wrote:
> Hi Benoit,
> I did have another try on this. Please have a look.
> CREATE threadtable
>
> CREATE TABLE ThreadTable (messageId timeuuid, threadId timeuuid, username
> text, mimeMessageId text, baseSubject text, PRIMARY KEY((username,
> mimemessageid), messageid));
>
> => Partition key: (username, mimemessageid), clustering key: messageid.
>
> [...]
>
> SELECT basesubject, threadId FROM threadtable WHERE username = 'quan'  AND
> mimeMessageId IN ('MimeMessageID2', 'MimeMessageID3');
This looks way better to me.

IN usage is PRIMARY KEY is discouraged as it leads to coordination
across partitions.

Read more for instance in
https://stackoverflow.com/questions/55604857/cassandra-query-performance-using-in-clause-for-one-portion-of-the-composite-pa

Either we should move "mimeMessageId" to the clustering key (but all the
messages of a user, including their subject would end up in a single
partition, which could be quite large... for instance 1 million messages
x size of the mime message ids and subject could be too much, as
partitions are recommended to be 100MBs at most).

Or we can just do several selects, one for each mimeMessageIds.I
personaly likely prefer this option.
> [...]
>
> CREATE threadtable_lookup for deletion purpose
>
> Supposed when we delete a message, we would need to delete that message’s
> thread-related data in the threadtable also. I guess we just need messageId
> to delete that message.
>
> We would need to have another table similar to the threadtable but looked
> up by messageId to get the needed params for deletion query.
>
> CREATE TABLE ThreadTable_lookup (messageId timeuuid, username text,
> mimeMessageIds set<text>, PRIMARY KEY(messageid));
The set should be frozen. We will never add or remove data in it, so we
do not need a CRDT (Commutative Replicated Data Type).

Forbidding adding - removing individual elements avoids lots of issues,
and lead to a more compact storage.

> To ease testing, I create a table with messageId’s type is int instead.
>
> We will insert the data as same as the original table.
>
>     insert into ThreadTable_lookup (messageId, username, mimeMessageIds)
> values (1, 'quan', {'MimeMessageID1', 'MimeMessageID2', 'MimeMessageID3'});
>
> [...]
>
>
> SELECT * FROM threadtable_lookup;
>
>
> Now we will do a query selection by messageId to get the needed username,
> mimeMessageIds for original threadtable deletion.
>
> Supposed we want to delete message 4.
>
> SELECT username, mimemessageids FROM threadtable_lookup WHERE messageid = 4;
>
>
> Then we will use these results to do a deletion query on threadtable.
>
Remember to do the delete of ThreadTable before threadtable_lookup,
otherwise if you delete the pointer before the data you might end up in
a case where the actual data is never deleted.

The algorithm that you propose looks good.

When/How do we call it?


Cheers,

Benoit
> The data of messageId 4 deleted.
>
>
>
> Best regards,
>
> Quan
>
>
> Vào Th 3, 20 thg 7, 2021 vào lúc 18:39 [email protected] <
> [email protected]> đã viết:
>
>> Hello Quan,
>>
>> On 20/07/2021 17:24, Quan tran hong wrote:
>>> [...]
>>>
>>> SELECT threadId FROM threadtable WHERE username = 'quan' AND baseSubject
>> =
>>> 'baseSubject1' AND mimeMessageId IN ('MimeMessageID2', 'MimeMessageID3')
>>> LIMIT 1 ALLOW FILTERING;
>> ALLOW FILTERING should not be used as it will result in a full scan and
>> is thus a performance disaster.
>>
>> If you need it, this means you do not have the right table structure and
>> likely should rework the CREATE TABLE statement.
>>> => This new message should have this threadId.
>>> New unrelated message
>>>
>>> Assume that we do a query for a new unrelated message.
>>>
>>> SELECT threadId FROM threadtable WHERE username = 'quan' AND baseSubject
>> =
>>> 'unrelatedBaseSubject' AND mimeMessageId IN ('MimeMessageID2',
>>> 'MimeMessageID3') LIMIT 1 ALLOW FILTERING;
>>>
>>> => This new message should have a new threadId.
>>> Insert new message data
>>>
>>> After having a threadId, we need to insert new message data into the
>> thread
>>> table.
>>>
>>> insert into ThreadTable (messageId, threadId, username, mimeMessageId,
>>> baseSubject) values (now(), 02294fe1-e941-11eb-a8ee-77de5498f1fa, 'quan',
>>> 'MimeMessageID2', 'baseSubject1');
>>>
>>> insert into ThreadTable (messageId, threadId, username, mimeMessageId,
>>> baseSubject) values (now(), 02294fe1-e941-11eb-a8ee-77de5498f1fa, 'quan',
>>> 'MimeMessageID3', 'baseSubject1');
>>> Conclusion
>>>
>>> I think this data model complies with the needed request for the guessing
>>> algorithm problem, but it looks like still maybe there is room for
>>> improvement.
>> What Cassandra request do we use to delete the data in there?
>>
>>>
>>> Best Regards,
>>>
>>> Quan
>>>
>>>
>>>
>>>
>>>
>>> Vào Th 2, 19 thg 7, 2021 vào lúc 18:23 [email protected] <
>>> [email protected]> đã viết:
>>>
>>>> Hello Quan,
>>>>
>>>> On 19/07/2021 17:59, Quan tran hong wrote:
>>>>> Hi,
>>>>> I am starting to implement ThreadIdGuessingAlgorithm for the
>> distributed
>>>>> module. Because this is a breaking change and I am new to Cassandra
>> also,
>>>>> therefore I want to have some discussion with you about how to do this.
>>>> As long as we introduce a new table there is no reason that it creates
>>>> breaking change, but getting the format right will ease our life down
>>>> the line.
>>>>> For the ones who did not catch up with this work, please have a look at
>>>>> JMAP Threads specs [1] and my work related to this [2].
>>>>>
>>>>> So my ideas on how to do this:
>>>>> - Add a needed inputs Cassandra Table for guessing threadId algorithm.
>>>>> Maybe a table likes:
>>>>>  CREATE TABLE ThreadRelatedTable (
>>>>> threadId       timeuuid,
>>>>> messageId      timeuuid,
>>>>> mimeMessageIds     SET<text>,
>>>>> subject       text,
>>>>> PRIMARY KEY (mimeMessageIds, subject)
>>>>> );
>>>>> - Whenever we guess threadId for a new message, we access this table
>> and
>>>> do
>>>>> the matching query to get related threadId(if there is) or decide new
>>>>> message should have a new threadId.
>>>>> - Whenever we save a new message, we save the thread-related data to
>> this
>>>>> table.
>>>>>
>>>>> This is my first come-up idea. Please express your thoughts about this.
>>>> Collections are an advanced data modeling tool, that should be used with
>>>> caution. I am not sure using it in a PRIMARY KEY is a good idea. I am
>>>> not sure that does what you want (the full primary key should be
>>>> specified to know which node hold the data.
>>>>
>>>> Also, once you found the message related to a thread you want to
>>>> validate that the subject matches. This can be done on application side
>>>> (James), and avoids complicated data model.
>>>>
>>>> I encourage you to validate your data model using a Cassandra in docker
>>>> and executing CQL commands locally with CQLSH tool to simulate the
>>>> queries you whish to do, and learn about your data model before even
>>>> starting to implement it. IMO sharing CQL commands for creating the
>>>> table, inserting data in it, and retrieving data from it would be a
>>>> great follow up to this email.
>>>>
>>>> How would you populate the data of that table?
>>>>
>>>> Best regards,
>>>>
>>>> Benoit
>>>>> Best regards,
>>>>>
>>>>> Quan
>>>>>
>>>>> [1] https://jmap.io/spec-mail.html#threads
>>>>> [2] https://issues.apache.org/jira/browse/JAMES-3516
>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>>
>>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Implement ThreadIdGuessingAlgorithm for the distributed module

Reply via email to