I'm unable to reproduce this in pycassa starting with a clean database. Are
you doing anything else to these rows besides inserting them?
Here's the complete script I'm using below. Could you confirm that this
causes problems for you?
- Tyler
=========
import sys
import pycassa
pool = pycassa.ConnectionPool('Keyspace1')
cf = pycassa.ColumnFamily(pool, 'Super1')
KEY = 'key'
columns = [
"20031210020333/190209-20031210-4476807-s/" , #0
"20031210020333/190209-20031210-4476807-s/0" , #1
"20031210021940/190209-20031210-4476883-s/" , #2
"20031210021940/190209-20031210-4476883-s/0" , #3
"20031210022059/190209-20031210-4476885-s/" , #4
"20031210022059/190209-20031210-4476885-s/0" , #5
# <--Problem_around_here.
"20031210022154/190209-20031210-4476888-s/" , #6
"20031210022154/190209-20031210-4476888-s/0" #7
]
for supercolumn in columns:
cf.insert(KEY, {supercolumn: {'subcol': 'subval', 'subcol2': 'subval'}})
def get_cols(start_date, end_date, reversed):
for key, cols in cf.get_range(start = KEY,
finish = KEY,
column_reversed=reversed,
column_count=10000,
column_start=start_date,
column_finish=end_date):
for supercol, subcols in cols.iteritems():
print "col='%s' \tlen = %d" % (supercol, len(subcols))
start = 0
for end in [0,3,5,7]:
print "\nstart %d, end %d + 'z'" % (start, end)
get_cols(columns[start], columns[end] + 'z', False)
end = 0
for start in [0, 3, 5, 7]:
print "\nstart %d + 'z', end %d (reversed)" % (start, end)
get_cols(columns[end], columns[start] + 'z', False)
On Thu, Feb 17, 2011 at 11:09 PM, Shotaro Kamio <[email protected]> wrote:
> Hi Aaron,
>
> Range slice means get_range_slices() in thrift api,
> createSuperSliceQuery in hector, get_range() in pycassa. The example
> code in pycassa is attached below.
>
> The problem is a little bit complicated to explain. I'll try to
> describe in examples.
> Here are 8 super column names which exist in the specific key. The
> list is forward order.
>
> #0: "20031210020333/190209-20031210-4476807-s/"
> #1: "20031210020333/190209-20031210-4476807-s/0"
> #2: "20031210021940/190209-20031210-4476883-s/"
> #3: "20031210021940/190209-20031210-4476883-s/0"
> #4: "20031210022059/190209-20031210-4476885-s/"
> #5: "20031210022059/190209-20031210-4476885-s/0" <-- Problem around here.
> #6: "20031210022154/190209-20031210-4476888-s/"
> #7: "20031210022154/190209-20031210-4476888-s/0"
>
> There is no problem if I use the super column names exist on the key.
>
> * Range from #0 to #3 in forward order -> OK
> * Range from #0 to #5 in forward order -> OK
> * Range from #0 to #7 in forward order -> OK
>
> * Range from #7 to #0 in reverse order -> OK
> * Range from #5 to #0 in reverse order -> OK
> * Range from #3 to #0 in reverse order -> OK
>
>
> Because I want to scan orders in a certain range, however, I use
> column names which added character "z" (higher than anything in
> order_id). Those column names are listed below as #1z, #3z, #5z and
> #7z. Note that these super column names don't really exist on the key.
> (#4+ is a column name to locate between #4 and #5)
>
> #0 : "20031210020333/190209-20031210-4476807-s/"
> #1 : "20031210020333/190209-20031210-4476807-s/0"
> #1z: "20031210020333/190209-20031210-4476807-s/z" (don't exist)
> #2 : "20031210021940/190209-20031210-4476883-s/"
> #3 : "20031210021940/190209-20031210-4476883-s/0"
> #3z: "20031210021940/190209-20031210-4476883-s/z" (don't exist)
> #4 : "20031210022059/190209-20031210-4476885-s/"
> #4+: "20031210022059/190209-20031210-4476885-s/+" (don't exist)
> #5 : "20031210022059/190209-20031210-4476885-s/0" <-- Problem around here.
> #5z: "20031210022059/190209-20031210-4476885-s/z" (don't exist)
> #6 : "20031210022154/190209-20031210-4476888-s/"
> #7 : "20031210022154/190209-20031210-4476888-s/0"
> #7z: "20031210022154/190209-20031210-4476888-s/z" (don't exist)
>
> Then, try to range slice them.
>
> * Range from #0 to #3z in forward order -> OK
> * Range from #0 to #4+ in forward order -> OK
> * Range from #0 to #5z in forward order -> OK
> * Range from #0 to #7z in forward order -> OK
>
> * Range from #7z to #0 in reverse order -> OK
> * Range from #5z to #0 in reverse order -> FAIL (no result)
> * Range from #4+ to #0 in reverse order -> OK
> * Range from #3z to #0 in reverse order -> OK
>
> The problem happens in this case. No error or warning is shown in cassandra
> log.
>
> Also, I tried dumping data into json via sstable2json and restored it
> with json2sstable. But the same problem occurs.
>
>
> The code I used for the test is something like this.
> ----------------------
> client = pycassa.connect(KEYSPACE, [ CASSANDRA_HOST ])
> cf = pycassa.ColumnFamily(client, COLUMN_FAMILY)
>
> columns = [
> "20031210020333/190209-20031210-4476807-s/" , #0
> "20031210020333/190209-20031210-4476807-s/0" , #1
> "20031210021940/190209-20031210-4476883-s/" , #2
> "20031210021940/190209-20031210-4476883-s/0" , #3
> "20031210022059/190209-20031210-4476885-s/" , #4
> "20031210022059/190209-20031210-4476885-s/0" , #5
> # <--Problem_around_here.
> "20031210022154/190209-20031210-4476888-s/" , #6
> "20031210022154/190209-20031210-4476888-s/0" #7
> ]
>
> reversed = False
> if len(sys.argv) > 1:
> # use reversed order if "-r" option is given. "-f" or others for
> forward order, no option will list all column names.
> reversed = (sys.argv[1] == '-r')
>
> start_date = columns[0]
> end_date = columns[7] + "z" # add "z" to make problem.
>
> if reversed:
> temp = start_date
> start_date = end_date
> end_date = temp
> pass
> else:
> start_date = end_date = ''
> pass
>
> print "start_date =", start_date, "end_date =", end_date, "reversed =
> ", reversed
>
> for it in cf.get_range(start = A_KEY, finish = A_KEY,
> column_reversed=reversed, column_count=10000, column_start=start_date,
> column_finish=end_date):
>
> for d in it[1].iteritems():
> print "col='%s', len = %d" % (d[0], len(d[0]))
> pass
> pass
>
> -------------------------
>
>
> Regards,
> Shotaro
>
>
>
>
> On Fri, Feb 18, 2011 at 5:19 AM, Aaron Morton <[email protected]>
> wrote:
> > First some terminology, when you say range slice do you mean getting
> multiple rows? Or do you mean get_slice where you return multiple super
> columns from one row?
> >
> > Your examples looks like you want to get multiple super columns from one
> row. In which case the choice of partitioner is not important. The
> comparator and sub comparator as specified in the CF definition control the
> ordering of colums. If possible i would suggest using the random
> partitioner.
> >
> > Could you provide examples of how you are doing the queries using pycassa
> we may be able to help.
> >
> > My initial guess is that the ranges you specify for the query are not
> correct when using ASCII ordering for column names, e,g,
> >
> > 20031210 < 20031210022059/190209-20031210-4476885-s/z is true
> >
> > 20031210022059/190209-20031210-4476885-s/z < 20031210 is not true
> >
> > Trying appending the highest value ASCII character to the end of 20031210
> >
> > Cheers
> > Aaron
> >
> > On 18/02/2011, at 4:35 AM, Shotaro Kamio <[email protected]> wrote:
> >
> >> Hi,
> >>
> >> We are in trouble with a strange behavior in cassandra 0.7.2 (also
> >> happened in 0.7.0). Could someone help us?
> >>
> >> The problem happens on a column family of super column type named
> "Order".
> >> Data structure is something like:
> >> Order[ a_key ][ date + "/" + order_id + "/" (+ suffix) ][attribute] =
> value
> >>
> >> For example,
> >> Order[ "100" ][ "20031210022059/190209-20031210-4476885-s/" ]
> >> is a super column.
> >> Because we want to scan them in the latest-first order, range slice
> >> query with reversed order is used. (Partitioner is
> >> ByteOrderedPartitioner).
> >>
> >> In some supercolumns in my cassandra instance, reversed query returns
> >> no result while it should have results.
> >> For instance,
> >>
> >> * Range slice in normal (lexical)-order ( Order[ "100" ] [ from
> >> "20031210" to "20031210022059/190209-20031210-4476885-s/z" ] ) will
> >> return results correctly.
> >>
> >> col='20031210014347/190209-20031210-4476668-s/'
> >> col='20031210014347/190209-20031210-4476668-s/0'
> >> col='20031210022059/190209-20031210-4476885-s/'
> >> col='20031210022059/190209-20031210-4476885-s/0'
> >>
> >> * Range slice in reversed (latest-first)-order ( Order[ "100" ] [ from
> >> "20031210022059/190209-20031210-4476885-s/z" to "20031210" ] ) will
> >> return NO result!
> >>
> >> Note that the super column name
> >> "20031210022059/190209-20031210-4476885-s/z" doesn't exist. The query
> >> should work. And, it succeeds in other super columns.
> >>
> >> * Range slice in reversed (latest-first)-order starting from existing
> >> column name ( Order[ "100" ] [ from
> >> "20031210022059/190209-20031210-4476885-s/0" to "20031210" ] ) will
> >> return results which should return.
> >>
> >> Both pycassa and hector show the same behavior on the same column
> >> name. I guess that cassandra has some logical error.
> >>
> >>
> >> I'll appreciate any help.
> >>
> >>
> >> Best reagards,
> >> Shotaro
> >
>
>
>
> --
> Shotaro Kamio
>
--
Tyler Hobbs
Software Engineer, DataStax <http://datastax.com/>
Maintainer of the pycassa <http://github.com/pycassa/pycassa> Cassandra
Python client library