[ https://issues.apache.org/jira/browse/CASSANDRA-2388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13057002#comment-13057002 ]
Mck SembWever edited comment on CASSANDRA-2388 at 6/29/11 7:48 AM: ------------------------------------------------------------------- - This does happen already (i've seen it while testing initial patches that were no good). Problem is that the TT is blacklisted, reducing hadoop's throughput for all jobs running. I bet too that a fallback to a replica is faster than a fallback to another TT. - There is no guarantee that any given TT will have its split accessible via a local c* node - this is only a preference in CFRR. A failed task may just as likely go to a random c* node. At least now we can actually properly limit to the one DC and sort by proximity. - One thing we're not doing here is applying this same DC limit and sort by proximity in the case when there isn't a localhost preference. See CFRR.initialize(..) It would make sense to rewrite CFRR.getLocations(..) to {noformat} private Iterator<String> getLocations(final Configuration conf) throws IOException { return new SplitEndpointIterator(conf); }{noformat} and then to move the finding-a-preference-to-localhost code into SplitEndpointIterator... - A bug i can see in the patch that did get accepted already is in CassandraServer.java:763 when endpointValid is false and restrictToSameDC is true we end up restricting to a random DC. I can fix this so restrictToSameDC is disabled in such situations. This actually invalidates the previous point: we can't restrict to DC anymore and we can only sortByProximity to a random node... I think this supports Jonathan's point that it's overall a poor approach. I'm more and more in preference of my original approach using just client.getDatacenter(..) and not worrying about proximity within the datacenter. - Another bug is that, contray to my patch, the code committed bq. committed with a change to use the dynamic snitch id the passed endpoint is valid. can call {{DynamicEndpointSnitch.sortByProximity(..)}} with an address that is not localhost and this breaks the assertion in the method. was (Author: michaelsembwever): This does happen already (i've seen it while testing initial patches that were no good). Problem is that the TT is blacklisted, reducing hadoop's throughput for all jobs running. I bet too that a fallback to a replica is faster than a fallback to another TT. On a side note, there is no guarantee that any given TT will have its split accessible via a local c* node - this is only a preference in CFRR. A failed task may just as likely go to a random c* node. At least now we can actually properly limit to the one DC and sort by proximity. One thing we're not doing here is applying this same DC limit and sort by proximity in the case when there isn't a localhost preference. See CFRR.initialize(..) It would make sense to rewrite CFRR.getLocations(..) to {noformat} private Iterator<String> getLocations(final Configuration conf) throws IOException { return new SplitEndpointIterator(conf); }{noformat} and then to move the finding-a-preference-to-localhost code into SplitEndpointIterator... A bug i can see in the patch that did get accepted already is in CassandraServer.java:763 when endpointValid is false and restrictToSameDC is true we end up restricting to a random DC. I can fix this so restrictToSameDC is disabled in such situations. > ColumnFamilyRecordReader fails for a given split because a host is down, even > if records could reasonably be read from other replica. > ------------------------------------------------------------------------------------------------------------------------------------- > > Key: CASSANDRA-2388 > URL: https://issues.apache.org/jira/browse/CASSANDRA-2388 > Project: Cassandra > Issue Type: Bug > Components: Hadoop > Affects Versions: 0.7.6, 0.8.0 > Reporter: Eldon Stegall > Assignee: Jeremy Hanna > Labels: hadoop, inputformat > Fix For: 0.7.7, 0.8.2 > > Attachments: 0002_On_TException_try_next_split.patch, > CASSANDRA-2388-addition1.patch, CASSANDRA-2388.patch, CASSANDRA-2388.patch, > CASSANDRA-2388.patch > > > ColumnFamilyRecordReader only tries the first location for a given split. We > should try multiple locations for a given split. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira