Re: Reviewing . . . RackAwareStrategy.java . . . ( rev 954657 )

2010-06-15 Thread Masood Mortazavi
On Tue, Jun 15, 2010 at 10:55 AM, Jonathan Ellis jbel...@gmail.com wrote:

 On Mon, Jun 14, 2010 at 11:03 PM, Masood Mortazavi
 masoodmortaz...@gmail.com wrote:
  The comment on the top of RackAwareStrategy says:

 You are correct.  RAS sort of works under other conditions but it is
 primarily intended for 2 DCs and RF=3.  I will update the comment in
 question.



An orthogonal but related problem is the following . . .

Currently, each replica placement strategy involves its own configuration
extensions, along with a great deal of repeated and intertwined code among
the strategies. (For example, all strategies currently need to iterate
through nodes. This is a common funcationality.)

The current approach not only affects construction of replica placement
strategies but also complicates their semantics.

It may be possible to refactor the code as follows:

(1) Each node has a set of properties assigned to it through the
configuration (right now, in the trunk, those properties are the rack and
DC position of a node but it should be possible to add any number of other
properties, and they should really all be in the same configuration file,
not separated as they are, today, in two or more separate files).

(2) Once these physical properties are assigned/defined for each node, a
pluggablity architecture would allow whoever extends the node properties, to
plug-in a node Examiner as a complement to any additional properties.

(3) In the iteration that's common to all replica placement search logic,
the Examiner will either pass or fail an (iterated) node as a replica
place for a given primary based on the properties of that node.

Although such refactoring is not entirely trivial, it will lead to less
repetition across strategies, better factoring of concerns and more
reliable code, I believe.

It will also make maintenance and extension of strategies much easier . . .




  There are other issues to think about. For example, for quorum write
  (consistency.quorum) to work faster, shouldn't the first replicas be as
  close as possible (i.e. on the same rack)?  The whole point of choosing
 this
  level of consistency is to improve performance. Right?

 No, the point is to improve reliability (there are a number of failure
 scenarios that will result in losing an entire rack at once).



Yes, I understand that.

What I was trying to say is that, if we agree to the above, we should select
the other-DC and other-Rack replica after we have selected all near
replicas.

(I imagine that, during actual replication, the replica placement list is
iterated sequentially and taht the first replica will have to be the nearest
and then the farther and farther replicas are chosen and put on the list.)

Thanks,
- m.




 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of Riptano, the source for professional Cassandra support
 http://riptano.com



Re: Secondary indexing and 0.6/0.7 integration with Datanucleus

2010-06-15 Thread Jonathan Ellis
What issue were you trying to link? :)

On Tue, Jun 15, 2010 at 6:56 PM, Todd Nine t...@spidertracks.co.nz wrote:
 Hi all,
  I'm implementing a Datanucleus plugin for Cassandra.  I'm finished
 with the basic functionality, and everything seems to work pretty well.
 Now my issue is performing secondary indexing on fields within my data.
 I have outlined some of the issues I'm facing in this post.

 http://www.datanucleus.org/servlet/forum/viewthread_thread,6087_lastpage,yes#32610

 Essentially, for each operand the user specifies, I will need to make a
 trip to Cassandra, load the key columns, then perform an intersection
 with the result from my previous read.  Eventually at the end of all the
 intersections, I will have a list of keys I will then load.  This
 obviously requires several trips to Cassandra, where from my
 understanding of secondary indexing, I would only need to make one trip
 for multiple operands over a column family.    I've read over this
 issue.

 http://issues.apache.org/jira/browse/CASSANDRA-32610

 And it seems to solve a lot of my woes.  Is it possible/recommended to
 patch the current code base of 0.6.2 to perform this functionality?

 Thanks,
 Todd





-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Re: Secondary indexing and 0.6/0.7 integration with Datanucleus

2010-06-15 Thread Jonathan Ellis
No chance that 749 can be backported to 0.6, sorry.

On Tue, Jun 15, 2010 at 10:35 PM, Todd Nine t...@spidertracks.co.nz wrote:

  Lets try that again.

 This is the intended issue.

 https://issues.apache.org/jira/browse/CASSANDRA-749

 thanks,
 Todd



   On Tue, 2010-06-15 at 20:02 -0500, Jonathan Ellis wrote:

 What issue were you trying to link? :)

 On Tue, Jun 15, 2010 at 6:56 PM, Todd Nine t...@spidertracks.co.nz wrote:
  Hi all,
   I'm implementing a Datanucleus plugin for Cassandra.  I'm finished
  with the basic functionality, and everything seems to work pretty well.
  Now my issue is performing secondary indexing on fields within my data.
  I have outlined some of the issues I'm facing in this post.
 
  http://www.datanucleus.org/servlet/forum/viewthread_thread,6087_lastpage,yes#32610
 
  Essentially, for each operand the user specifies, I will need to make a
  trip to Cassandra, load the key columns, then perform an intersection
  with the result from my previous read.  Eventually at the end of all the
  intersections, I will have a list of keys I will then load.  This
  obviously requires several trips to Cassandra, where from my
  understanding of secondary indexing, I would only need to make one trip
  for multiple operands over a column family.I've read over this
  issue.
 
  http://issues.apache.org/jira/browse/CASSANDRA-32610
 
  And it seems to solve a lot of my woes.  Is it possible/recommended to
  patch the current code base of 0.6.2 to perform this functionality?
 
  Thanks,
  Todd
 
 







-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Re: Secondary indexing and 0.6/0.7 integration with Datanucleus

2010-06-15 Thread Todd Nine
No problem,
  I didn't want to implement my own solution if an existing one could
easily be applied.  Since I'll be creating CF that represent secondary
indexes, I'll need to perform range scans over the keys of those
secondary index CFs.  The column names within the CF's are the row keys
of the primary table.  Is there a way I can get the intersection of all
of the column names from multiple ranges scans over different column
families in one result set?  Otherwise I'll need to make multiple trips
and create the intersection myself in my plugin.  Here is an example of
what I'm trying to do.

CF: Person

key1: {
   firstName: John
   lastName: Smith
   email: smi...@foo.com
}

key2: {
  firstName: Jane
  lastName: Smith
  email: smi...@foo.com
}

key3: {
  firstName: Jane
  lastName: Doe
  email: smi...@foo.com
}


My secondary index tables would be the following

CF: Person_LastName

Smith:{
  key1: 0x00
  key2: 0x00
}

Doe: {
  key3:0x00
}

CF: Person_Email
  smi...@foo.com:{
key1:0x00
key2:0x00 
key3:0x00
}

If my input is something similar to lastName == 'Smith'  email ==
smi...@foo.com, I would return all columns from key Smith in CF
Person_LastName, and all columns from key smi...@foo.com in CF
Person_Email.  The intersection of the two sets is key1, and key2, and
have cassandra only return those rows.

Thanks,
Todd





On Tue, 2010-06-15 at 23:38 -0500, Jonathan Ellis wrote:

 No chance that 749 can be backported to 0.6, sorry.
 
 On Tue, Jun 15, 2010 at 10:35 PM, Todd Nine t...@spidertracks.co.nz wrote:
 
   Lets try that again.
 
  This is the intended issue.
 
  https://issues.apache.org/jira/browse/CASSANDRA-749
 
  thanks,
  Todd
 
 
 
On Tue, 2010-06-15 at 20:02 -0500, Jonathan Ellis wrote:
 
  What issue were you trying to link? :)
 
  On Tue, Jun 15, 2010 at 6:56 PM, Todd Nine t...@spidertracks.co.nz wrote:
   Hi all,
I'm implementing a Datanucleus plugin for Cassandra.  I'm finished
   with the basic functionality, and everything seems to work pretty well.
   Now my issue is performing secondary indexing on fields within my data.
   I have outlined some of the issues I'm facing in this post.
  
   http://www.datanucleus.org/servlet/forum/viewthread_thread,6087_lastpage,yes#32610
  
   Essentially, for each operand the user specifies, I will need to make a
   trip to Cassandra, load the key columns, then perform an intersection
   with the result from my previous read.  Eventually at the end of all the
   intersections, I will have a list of keys I will then load.  This
   obviously requires several trips to Cassandra, where from my
   understanding of secondary indexing, I would only need to make one trip
   for multiple operands over a column family.I've read over this
   issue.
  
   http://issues.apache.org/jira/browse/CASSANDRA-32610
  
   And it seems to solve a lot of my woes.  Is it possible/recommended to
   patch the current code base of 0.6.2 to perform this functionality?
  
   Thanks,
   Todd