Re: Inconsistent result in super range slice query (reversed order)

Tyler Hobbs Thu, 17 Feb 2011 23:00:12 -0800

I'm unable to reproduce this in pycassa starting with a clean database.  Are
you doing anything else to these rows besides inserting them?


Here's the complete script I'm using below.  Could you confirm that this
causes problems for you?

- Tyler

=========

import sys
import pycassa

pool = pycassa.ConnectionPool('Keyspace1')
cf = pycassa.ColumnFamily(pool, 'Super1')

KEY = 'key'

columns = [
    "20031210020333/190209-20031210-4476807-s/"  , #0
    "20031210020333/190209-20031210-4476807-s/0" , #1
    "20031210021940/190209-20031210-4476883-s/"  , #2
    "20031210021940/190209-20031210-4476883-s/0" , #3
    "20031210022059/190209-20031210-4476885-s/"  , #4
    "20031210022059/190209-20031210-4476885-s/0" , #5
    # <--Problem_around_here.
    "20031210022154/190209-20031210-4476888-s/"  , #6
    "20031210022154/190209-20031210-4476888-s/0"   #7
]

for supercolumn in columns:
    cf.insert(KEY, {supercolumn: {'subcol': 'subval', 'subcol2': 'subval'}})

def get_cols(start_date, end_date, reversed):
    for key, cols in cf.get_range(start = KEY,
                                  finish = KEY,
                                  column_reversed=reversed,
                                  column_count=10000,
                                  column_start=start_date,
                                  column_finish=end_date):
        for supercol, subcols in cols.iteritems():
            print "col='%s' \tlen = %d" % (supercol, len(subcols))

start = 0
for end in [0,3,5,7]:
    print "\nstart %d, end %d + 'z'" % (start, end)
    get_cols(columns[start], columns[end] + 'z', False)

end = 0
for start in [0, 3, 5, 7]:
    print "\nstart %d + 'z', end %d (reversed)" % (start, end)
    get_cols(columns[end], columns[start] + 'z', False)


On Thu, Feb 17, 2011 at 11:09 PM, Shotaro Kamio <[email protected]> wrote:

> Hi Aaron,
>
> Range slice means get_range_slices() in thrift api,
> createSuperSliceQuery in hector, get_range() in pycassa. The example
> code in pycassa is attached below.
>
> The problem is a little bit complicated to explain. I'll try to
> describe in examples.
> Here are 8 super column names which exist in the specific key. The
> list is forward order.
>
> #0: "20031210020333/190209-20031210-4476807-s/"
> #1: "20031210020333/190209-20031210-4476807-s/0"
> #2: "20031210021940/190209-20031210-4476883-s/"
> #3: "20031210021940/190209-20031210-4476883-s/0"
> #4: "20031210022059/190209-20031210-4476885-s/"
> #5: "20031210022059/190209-20031210-4476885-s/0"  <-- Problem around here.
> #6: "20031210022154/190209-20031210-4476888-s/"
> #7: "20031210022154/190209-20031210-4476888-s/0"
>
> There is no problem if I use the super column names exist on the key.
>
> * Range from #0 to #3 in forward order -> OK
> * Range from #0 to #5 in forward order -> OK
> * Range from #0 to #7 in forward order -> OK
>
> * Range from #7 to #0 in reverse order -> OK
> * Range from #5 to #0 in reverse order -> OK
> * Range from #3 to #0 in reverse order -> OK
>
>
> Because I want to scan orders in a certain range, however, I use
> column names which added character "z" (higher than anything in
> order_id). Those column names are listed below as #1z, #3z, #5z and
> #7z. Note that these super column names don't really exist on the key.
> (#4+ is a column name to locate between #4 and #5)
>
> #0 : "20031210020333/190209-20031210-4476807-s/"
> #1 : "20031210020333/190209-20031210-4476807-s/0"
> #1z: "20031210020333/190209-20031210-4476807-s/z" (don't exist)
> #2 : "20031210021940/190209-20031210-4476883-s/"
> #3 : "20031210021940/190209-20031210-4476883-s/0"
> #3z: "20031210021940/190209-20031210-4476883-s/z" (don't exist)
> #4 : "20031210022059/190209-20031210-4476885-s/"
> #4+: "20031210022059/190209-20031210-4476885-s/+" (don't exist)
> #5 : "20031210022059/190209-20031210-4476885-s/0"  <-- Problem around here.
> #5z: "20031210022059/190209-20031210-4476885-s/z" (don't exist)
> #6 : "20031210022154/190209-20031210-4476888-s/"
> #7 : "20031210022154/190209-20031210-4476888-s/0"
> #7z: "20031210022154/190209-20031210-4476888-s/z" (don't exist)
>
> Then, try to range slice them.
>
> * Range from #0 to #3z in forward order -> OK
> * Range from #0 to #4+ in forward order -> OK
> * Range from #0 to #5z in forward order -> OK
> * Range from #0 to #7z in forward order -> OK
>
> * Range from #7z to #0 in reverse order -> OK
> * Range from #5z to #0 in reverse order -> FAIL (no result)
> * Range from #4+ to #0 in reverse order -> OK
> * Range from #3z to #0 in reverse order -> OK
>
> The problem happens in this case. No error or warning is shown in cassandra
> log.
>
> Also, I tried dumping data into json via sstable2json and restored it
> with json2sstable. But the same problem occurs.
>
>
> The code I used for the test is something like this.
> ----------------------
> client = pycassa.connect(KEYSPACE, [ CASSANDRA_HOST ])
> cf = pycassa.ColumnFamily(client, COLUMN_FAMILY)
>
> columns = [
> "20031210020333/190209-20031210-4476807-s/"  , #0
> "20031210020333/190209-20031210-4476807-s/0" , #1
> "20031210021940/190209-20031210-4476883-s/"  , #2
> "20031210021940/190209-20031210-4476883-s/0" , #3
> "20031210022059/190209-20031210-4476885-s/"  , #4
> "20031210022059/190209-20031210-4476885-s/0" , #5
> # <--Problem_around_here.
> "20031210022154/190209-20031210-4476888-s/"  , #6
> "20031210022154/190209-20031210-4476888-s/0"   #7
> ]
>
> reversed = False
> if len(sys.argv) > 1:
>    # use reversed order if "-r" option is given. "-f" or others for
> forward order, no option will list all column names.
>    reversed = (sys.argv[1] == '-r')
>
>    start_date = columns[0]
>    end_date  = columns[7] + "z" # add "z" to make problem.
>
>    if reversed:
>        temp = start_date
>        start_date = end_date
>        end_date   = temp
>        pass
> else:
>    start_date = end_date = ''
>    pass
>
> print "start_date =", start_date, "end_date =", end_date, "reversed =
> ", reversed
>
> for it in cf.get_range(start = A_KEY, finish = A_KEY,
> column_reversed=reversed, column_count=10000, column_start=start_date,
> column_finish=end_date):
>
>    for d in it[1].iteritems():
>        print "col='%s', len = %d" % (d[0], len(d[0]))
>        pass
>    pass
>
> -------------------------
>
>
> Regards,
> Shotaro
>
>
>
>
> On Fri, Feb 18, 2011 at 5:19 AM, Aaron Morton <[email protected]>
> wrote:
> > First some terminology, when you say range slice do you mean getting
> multiple rows? Or do you mean get_slice where you return multiple super
> columns from one row?
> >
> > Your examples looks like you want to get multiple super columns from one
> row. In which case the choice of partitioner is not important. The
> comparator and sub comparator as specified in the CF definition control the
> ordering of colums. If possible i would suggest using the random
> partitioner.
> >
> > Could you provide examples of how you are doing the queries using pycassa
> we may be able to help.
> >
> > My initial guess is that the ranges you specify for the query are not
> correct when using ASCII ordering for column names, e,g,
> >
> > 20031210 < 20031210022059/190209-20031210-4476885-s/z is true
> >
> > 20031210022059/190209-20031210-4476885-s/z < 20031210 is not true
> >
> > Trying appending the highest value ASCII character to the end of 20031210
> >
> > Cheers
> > Aaron
> >
> > On 18/02/2011, at 4:35 AM, Shotaro Kamio <[email protected]> wrote:
> >
> >> Hi,
> >>
> >> We are in trouble with a strange behavior in cassandra 0.7.2 (also
> >> happened in 0.7.0). Could someone help us?
> >>
> >> The problem happens on a column family of super column type named
> "Order".
> >> Data structure is something like:
> >>  Order[ a_key ][ date + "/" + order_id + "/" (+ suffix) ][attribute] =
> value
> >>
> >> For example,
> >> Order[ "100" ][ "20031210022059/190209-20031210-4476885-s/" ]
> >> is a super column.
> >> Because we want to scan them in the latest-first order, range slice
> >> query with reversed order is used. (Partitioner is
> >> ByteOrderedPartitioner).
> >>
> >> In some supercolumns in my cassandra instance, reversed query returns
> >> no result while it should have results.
> >> For instance,
> >>
> >> * Range slice in normal (lexical)-order ( Order[ "100" ] [ from
> >> "20031210" to "20031210022059/190209-20031210-4476885-s/z" ] ) will
> >> return results correctly.
> >>
> >> col='20031210014347/190209-20031210-4476668-s/'
> >> col='20031210014347/190209-20031210-4476668-s/0'
> >> col='20031210022059/190209-20031210-4476885-s/'
> >> col='20031210022059/190209-20031210-4476885-s/0'
> >>
> >> * Range slice in reversed (latest-first)-order ( Order[ "100" ] [ from
> >> "20031210022059/190209-20031210-4476885-s/z" to  "20031210" ] ) will
> >> return NO result!
> >>
> >> Note that the super column name
> >> "20031210022059/190209-20031210-4476885-s/z" doesn't exist. The query
> >> should work. And, it succeeds in other super columns.
> >>
> >> * Range slice in reversed (latest-first)-order starting from existing
> >> column name ( Order[ "100" ] [ from
> >> "20031210022059/190209-20031210-4476885-s/0" to "20031210" ] ) will
> >> return results which should return.
> >>
> >> Both pycassa and hector show the same behavior on the same column
> >> name. I guess that cassandra has some logical error.
> >>
> >>
> >> I'll appreciate any help.
> >>
> >>
> >> Best reagards,
> >> Shotaro
> >
>
>
>
> --
> Shotaro Kamio
>



-- 
Tyler Hobbs
Software Engineer, DataStax <http://datastax.com/>
Maintainer of the pycassa <http://github.com/pycassa/pycassa> Cassandra
Python client library

Re: Inconsistent result in super range slice query (reversed order)

Reply via email to