Re: Storing pre-sorted data

Stephen Connolly Thu, 13 Oct 2011 13:43:48 -0700

Then just use a soundex function on the first word in the text... that
will shrink it sufficiently and give nice buckets in near sequential
order (http://en.wikipedia.org/wiki/Soundex)


On 13 October 2011 21:21, Matthias Pfau <p...@l3s.de> wrote:
> Hi Stephen,
> we are hashing the first 8 byte (8 US-ASCII characters) of text that has
> been written by humans. Wouldn't it be easy for the attacker to do a
> dictionary attack on this text, especially if he knows the language of the
> text?
>
> Kind regards
> Matthias
>
> On 10/13/2011 08:20 PM, Stephen Connolly wrote:
>>
>> in theory, however they have less than 32 bits of entropy from which
>> they can do that, leaving them with at least 32 more bits of
>> combinations to try... that's 2 billion or so... must be a big dictionary
>>
>> - Stephen
>>
>> ---
>> Sent from my Android phone, so random spelling mistakes, random nonsense
>> words and other nonsense are a direct result of using swype to type on
>> the screen
>>
>> On 13 Oct 2011 17:57, "Matthias Pfau" <p...@l3s.de <mailto:p...@l3s.de>>
>> wrote:
>>
>>    Hi Stephen,
>>    this sounds very reasonable. But wouldn't this enable an attacker to
>>    execute dictionary attacks in order to "decrypt" the first 8 bytes
>>    of the plain text?
>>
>>    Kind regards
>>    Matthias
>>
>>    On 10/13/2011 05:03 PM, Stephen Connolly wrote:
>>
>>        It wouldn't be unencrypted... which is the point
>>
>>        you use a one way linear hash function to take the first, say 8
>>        bytes,
>>        of unencrypted data and turn it into 4 bytes of a sort prefix.
>>
>>        You've used lost half the data in the process, so effectively
>>        each bit
>>        is an OR of two bits and you can only infer from 0 values... so
>> data
>>        is still encrypted, but you have an approximate sorting.
>>
>>        For example, if your data is US-ASCII text with no numbers, you
>>        could
>>        use Soundex to get the pre-key, so that worst case you have a
>> bucket
>>        of values in the range.
>>
>>        Using this technique, a random get will have to get the values
>>        at the
>>        desired prefix +/- a small amount rather than the whole row...
>>        on the
>>        client side you can then decrypt the data and sort that small
>> bucket
>>        to get the correct index position.
>>
>>        You could do a 1 byte prefix, but that only gives you at best 256
>>        buckets and assumes that the first 2 bytes are uniformly
>>        distributed... you've said your data is not uniformly
>>        distributed, so
>>        a linear hash function sounds like your best bet.
>>
>>        your hash function should have the property that hash(A)>=
>>        hash(B) if
>>        and only if A>= B
>>
>>        On 13 October 2011 08:47, Matthias Pfau<p...@l3s.de
>>        <mailto:p...@l3s.de>>  wrote:
>>
>>            Hi Stephen,
>>            this is a great idea but unfortunately doesn't work for us
>>            either as we can
>>            not store the data in an unencrypted form.
>>
>>            Kind regards
>>            Matthias
>>
>>            On 10/12/2011 07:42 PM, Stephen Connolly wrote:
>>
>>
>>                could you prefix the data with 3-4 bytes of a linear
>>                hash of the
>>                unencypted data? it wouldn't be a perfect sort, but
>>                you'd have less of a
>>                range to query to get the sorted values?
>>
>>                - Stephen
>>
>>                ---
>>                Sent from my Android phone, so random spelling mistakes,
>>                random nonsense
>>                words and other nonsense are a direct result of using
>>                swype to type on
>>                the screen
>>
>>                On 12 Oct 2011 17:57, "Matthias Pfau"<p...@l3s.de
>>                <mailto:p...@l3s.de><mailto:pfau@__l3s.de
>>                <mailto:p...@l3s.de>>>
>>                wrote:
>>
>>                    Unfortunately, that is not an option as we have to
>>                store the data in
>>                    an compressed and encrypted and therefore binary and
>>                non-sortable form.
>>
>>                    On 10/12/2011 06:39 PM, David McNelis wrote:
>>
>>                        Is it an option to not convert the data to
>>                binary prior to
>>                inserting
>>                        into Cassandra?  Also, how large are the strings
>>                you're sorting?
>>                          If its
>>                        viable to not convert to binary before writing
>>                to Cassandra, and
>>                        you use
>>                        one of the string based column ordering
>>                techniques (utf8, ascii,
>>                for
>>                        example), then the data would be sorted without
>>                you  needing to
>>                        specifically worry about that.  Of course, if
>>                the strings are
>>                        lengthy
>>                        you could run into  additional issues.
>>
>>                        On Wed, Oct 12, 2011 at 11:34 AM, Matthias
>>                Pfau<p...@l3s.de <mailto:p...@l3s.de>
>>                <mailto:p...@l3s.de <mailto:p...@l3s.de>>
>>                <mailto:p...@l3s.de
>>                <mailto:p...@l3s.de><mailto:pfa...@l3s.de
>>                <mailto:p...@l3s.de>>>>  wrote:
>>
>>                            Hi there,
>>                            we are currently building a prototype based
>>                on cassandra and
>>                        came
>>                            into problems on implementing sorted lists
>>                containing
>>                        millions of items.
>>
>>                            The special thing about the items of our
>>                lists is, that
>>                        cassandra is
>>                            not able to sort them as the data is stored
>>                in a binary
>>                        format which
>>                            is not sortable. However, we are able to
>>                sort the data
>>                        before the
>>                            plain data gets encoded (our application is
>>                responsible for
>>                        the order).
>>
>>                            First Approach: Storing Lists in ColumnFamilies
>>                            ***
>>                            We first tried to map the list to a single
>>                row of a
>>                        ColumnFamily in
>>                            a way that the index of the list is mapped
>>                to the column
>>                        names and
>>                            the items of the list to the column values.
>>                The column names
>>                are
>>                            increasing numbers which define the sort order.
>>                            This has the major drawback that big parts
>>                of the list have
>>                        to be
>>                            rewritten on inserts (because the column
>>                names are numbered
>>                        by their
>>                            index), which are quite common.
>>
>>
>>                            Second Approach: Storing the whole List as
>>                Binary Data:
>>                            ***
>>                            We tried to store the compressed list in a
>>                single column.
>>                        However,
>>                            this is only feasible for smaller lists. Our
>>                lists are far
>>                        to big
>>                            leading to multi megabyte reads and writes.
>>                As we need to
>>                        read and
>>                            update the lists quite often, this would put
>>                our Cassandra
>>                        cluster
>>                            under a lot of pressure.
>>
>>                            Ideal Solution: Native support for storing
>> lists
>>                            ***
>>                            We would be very happy with a way to store a
>>                list of sorted
>>                        values
>>                            without making improper use of column names
>>                for the list
>>                        index. This
>>                            implies that we would need a possibility to
>>                insert values at
>>                        defined
>>                            positions. We know that this could lead to
>>                problems with
>>                        concurrent
>>                            inserts in a distributed environment, but
>>                this is handled by
>>                our
>>                            application logic.
>>
>>
>>                            What are your ideas on that?
>>
>>                            Thanks
>>                            Matthias
>>
>>
>>
>>
>>                        --
>>                        *David McNelis*
>>                        Lead Software Engineer
>>                        Agentis Energy
>>                www.agentisenergy.com
>>
>>  <http://www.agentisenergy.com><http://__www.agentisenergy.com
>>                <http://www.agentisenergy.com>>
>>                <http://www.agentisenergy.com>
>>                        c: 219.384.5143
>>                <tel:219.384.5143><tel:219.384.5143 <tel:219.384.5143>>
>>
>>                        /A Smart Grid technology company focused on
>>                helping consumers of
>>                        energy
>>                        control an often under-managed resource./
>>
>>
>>
>>
>>
>>
>
>

Re: Storing pre-sorted data

Reply via email to