Check out the text indexing feature of the new SASI feature in Cassandra
3.4. You could write a custom tokenizer to extract entities and then be
able to query for documents that contain those entities.

That said, using a SHA digest key for the primary key has merit for direct
access to the document given the document text.

-- Jack Krupansky

On Mon, Apr 11, 2016 at 7:12 PM, James Carman <ja...@carmanconsulting.com>
wrote:

> S3 maybe?
>
> On Mon, Apr 11, 2016 at 7:05 PM Robert Wille <rwi...@fold3.com> wrote:
>
>> I do realize its kind of a weird use case, but it is legitimate. I have a
>> collection of documents that I need to index, and I want to perform entity
>> extraction on them and give the extracted entities special treatment in my
>> full-text index. Because entity extraction costs money, and each document
>> will end up being indexed multiple times, I want to cache them in
>> Cassandra. The document text is the obvious key to retrieve entities from
>> the cache. If I use the document ID, then I have to track timestamps. I
>> know that sounds like a simple workaround, but I’m presenting a
>> much-simplified view of my actual data model.
>>
>> The reason for needing the text in the table, and not just a digest, is
>> that sometimes entity extraction has to be deferred due to license
>> limitations. In those cases, the entity extraction occurs on a background
>> process, and the entities will be included in the index the next time the
>> document is indexed.
>>
>> I will use a digest as the key. I suspected that would be the answer, but
>> its good to get confirmation.
>>
>> Robert
>>
>> On Apr 11, 2016, at 4:36 PM, Jan Kesten <j.kes...@enercast.de> wrote:
>>
>> > Hi Robert,
>> >
>> > why do you need the actual text as a key? I sounds a bit unatural at
>> least for me. Keep in mind that you cannot do "like" queries on keys in
>> cassandra. For performance and keeping things more readable I would prefer
>> hashing your text and use the hash as key.
>> >
>> > You should also take into account to store the keys (hashes) in a
>> seperate table per day / hour or something like that, so you can quickly
>> get all keys for a time range. A query without the partition key may be
>> very slow.
>> >
>> > Jan
>> >
>> > Am 11.04.2016 um 23:43 schrieb Robert Wille:
>> >> I have a need to be able to use the text of a document as the primary
>> key in a table. These texts are usually less than 1K, but can sometimes be
>> 10’s of K’s in size. Would it be better to use a digest of the text as the
>> key? I have a background process that will occasionally need to do a full
>> table scan and retrieve all of the texts, so using the digest doesn’t
>> eliminate the need to store the text. Anyway, is it better to keep primary
>> keys small, or is C* okay with large primary keys?
>> >>
>> >> Robert
>> >>
>> >
>>
>>

Reply via email to