Then just use a soundex function on the first word in the text... that will shrink it sufficiently and give nice buckets in near sequential order (http://en.wikipedia.org/wiki/Soundex)
On 13 October 2011 21:21, Matthias Pfau <[email protected]> wrote: > Hi Stephen, > we are hashing the first 8 byte (8 US-ASCII characters) of text that has > been written by humans. Wouldn't it be easy for the attacker to do a > dictionary attack on this text, especially if he knows the language of the > text? > > Kind regards > Matthias > > On 10/13/2011 08:20 PM, Stephen Connolly wrote: >> >> in theory, however they have less than 32 bits of entropy from which >> they can do that, leaving them with at least 32 more bits of >> combinations to try... that's 2 billion or so... must be a big dictionary >> >> - Stephen >> >> --- >> Sent from my Android phone, so random spelling mistakes, random nonsense >> words and other nonsense are a direct result of using swype to type on >> the screen >> >> On 13 Oct 2011 17:57, "Matthias Pfau" <[email protected] <mailto:[email protected]>> >> wrote: >> >> Hi Stephen, >> this sounds very reasonable. But wouldn't this enable an attacker to >> execute dictionary attacks in order to "decrypt" the first 8 bytes >> of the plain text? >> >> Kind regards >> Matthias >> >> On 10/13/2011 05:03 PM, Stephen Connolly wrote: >> >> It wouldn't be unencrypted... which is the point >> >> you use a one way linear hash function to take the first, say 8 >> bytes, >> of unencrypted data and turn it into 4 bytes of a sort prefix. >> >> You've used lost half the data in the process, so effectively >> each bit >> is an OR of two bits and you can only infer from 0 values... so >> data >> is still encrypted, but you have an approximate sorting. >> >> For example, if your data is US-ASCII text with no numbers, you >> could >> use Soundex to get the pre-key, so that worst case you have a >> bucket >> of values in the range. >> >> Using this technique, a random get will have to get the values >> at the >> desired prefix +/- a small amount rather than the whole row... >> on the >> client side you can then decrypt the data and sort that small >> bucket >> to get the correct index position. >> >> You could do a 1 byte prefix, but that only gives you at best 256 >> buckets and assumes that the first 2 bytes are uniformly >> distributed... you've said your data is not uniformly >> distributed, so >> a linear hash function sounds like your best bet. >> >> your hash function should have the property that hash(A)>= >> hash(B) if >> and only if A>= B >> >> On 13 October 2011 08:47, Matthias Pfau<[email protected] >> <mailto:[email protected]>> wrote: >> >> Hi Stephen, >> this is a great idea but unfortunately doesn't work for us >> either as we can >> not store the data in an unencrypted form. >> >> Kind regards >> Matthias >> >> On 10/12/2011 07:42 PM, Stephen Connolly wrote: >> >> >> could you prefix the data with 3-4 bytes of a linear >> hash of the >> unencypted data? it wouldn't be a perfect sort, but >> you'd have less of a >> range to query to get the sorted values? >> >> - Stephen >> >> --- >> Sent from my Android phone, so random spelling mistakes, >> random nonsense >> words and other nonsense are a direct result of using >> swype to type on >> the screen >> >> On 12 Oct 2011 17:57, "Matthias Pfau"<[email protected] >> <mailto:[email protected]><mailto:pfau@__l3s.de >> <mailto:[email protected]>>> >> wrote: >> >> Unfortunately, that is not an option as we have to >> store the data in >> an compressed and encrypted and therefore binary and >> non-sortable form. >> >> On 10/12/2011 06:39 PM, David McNelis wrote: >> >> Is it an option to not convert the data to >> binary prior to >> inserting >> into Cassandra? Also, how large are the strings >> you're sorting? >> If its >> viable to not convert to binary before writing >> to Cassandra, and >> you use >> one of the string based column ordering >> techniques (utf8, ascii, >> for >> example), then the data would be sorted without >> you needing to >> specifically worry about that. Of course, if >> the strings are >> lengthy >> you could run into additional issues. >> >> On Wed, Oct 12, 2011 at 11:34 AM, Matthias >> Pfau<[email protected] <mailto:[email protected]> >> <mailto:[email protected] <mailto:[email protected]>> >> <mailto:[email protected] >> <mailto:[email protected]><mailto:[email protected] >> <mailto:[email protected]>>>> wrote: >> >> Hi there, >> we are currently building a prototype based >> on cassandra and >> came >> into problems on implementing sorted lists >> containing >> millions of items. >> >> The special thing about the items of our >> lists is, that >> cassandra is >> not able to sort them as the data is stored >> in a binary >> format which >> is not sortable. However, we are able to >> sort the data >> before the >> plain data gets encoded (our application is >> responsible for >> the order). >> >> First Approach: Storing Lists in ColumnFamilies >> *** >> We first tried to map the list to a single >> row of a >> ColumnFamily in >> a way that the index of the list is mapped >> to the column >> names and >> the items of the list to the column values. >> The column names >> are >> increasing numbers which define the sort order. >> This has the major drawback that big parts >> of the list have >> to be >> rewritten on inserts (because the column >> names are numbered >> by their >> index), which are quite common. >> >> >> Second Approach: Storing the whole List as >> Binary Data: >> *** >> We tried to store the compressed list in a >> single column. >> However, >> this is only feasible for smaller lists. Our >> lists are far >> to big >> leading to multi megabyte reads and writes. >> As we need to >> read and >> update the lists quite often, this would put >> our Cassandra >> cluster >> under a lot of pressure. >> >> Ideal Solution: Native support for storing >> lists >> *** >> We would be very happy with a way to store a >> list of sorted >> values >> without making improper use of column names >> for the list >> index. This >> implies that we would need a possibility to >> insert values at >> defined >> positions. We know that this could lead to >> problems with >> concurrent >> inserts in a distributed environment, but >> this is handled by >> our >> application logic. >> >> >> What are your ideas on that? >> >> Thanks >> Matthias >> >> >> >> >> -- >> *David McNelis* >> Lead Software Engineer >> Agentis Energy >> www.agentisenergy.com >> >> <http://www.agentisenergy.com><http://__www.agentisenergy.com >> <http://www.agentisenergy.com>> >> <http://www.agentisenergy.com> >> c: 219.384.5143 >> <tel:219.384.5143><tel:219.384.5143 <tel:219.384.5143>> >> >> /A Smart Grid technology company focused on >> helping consumers of >> energy >> control an often under-managed resource./ >> >> >> >> >> >> > >
