Check out the text indexing feature of the new SASI feature in Cassandra 3.4. You could write a custom tokenizer to extract entities and then be able to query for documents that contain those entities.
That said, using a SHA digest key for the primary key has merit for direct access to the document given the document text. -- Jack Krupansky On Mon, Apr 11, 2016 at 7:12 PM, James Carman <ja...@carmanconsulting.com> wrote: > S3 maybe? > > On Mon, Apr 11, 2016 at 7:05 PM Robert Wille <rwi...@fold3.com> wrote: > >> I do realize its kind of a weird use case, but it is legitimate. I have a >> collection of documents that I need to index, and I want to perform entity >> extraction on them and give the extracted entities special treatment in my >> full-text index. Because entity extraction costs money, and each document >> will end up being indexed multiple times, I want to cache them in >> Cassandra. The document text is the obvious key to retrieve entities from >> the cache. If I use the document ID, then I have to track timestamps. I >> know that sounds like a simple workaround, but I’m presenting a >> much-simplified view of my actual data model. >> >> The reason for needing the text in the table, and not just a digest, is >> that sometimes entity extraction has to be deferred due to license >> limitations. In those cases, the entity extraction occurs on a background >> process, and the entities will be included in the index the next time the >> document is indexed. >> >> I will use a digest as the key. I suspected that would be the answer, but >> its good to get confirmation. >> >> Robert >> >> On Apr 11, 2016, at 4:36 PM, Jan Kesten <j.kes...@enercast.de> wrote: >> >> > Hi Robert, >> > >> > why do you need the actual text as a key? I sounds a bit unatural at >> least for me. Keep in mind that you cannot do "like" queries on keys in >> cassandra. For performance and keeping things more readable I would prefer >> hashing your text and use the hash as key. >> > >> > You should also take into account to store the keys (hashes) in a >> seperate table per day / hour or something like that, so you can quickly >> get all keys for a time range. A query without the partition key may be >> very slow. >> > >> > Jan >> > >> > Am 11.04.2016 um 23:43 schrieb Robert Wille: >> >> I have a need to be able to use the text of a document as the primary >> key in a table. These texts are usually less than 1K, but can sometimes be >> 10’s of K’s in size. Would it be better to use a digest of the text as the >> key? I have a background process that will occasionally need to do a full >> table scan and retrieve all of the texts, so using the digest doesn’t >> eliminate the need to store the text. Anyway, is it better to keep primary >> keys small, or is C* okay with large primary keys? >> >> >> >> Robert >> >> >> > >> >>