[jira] [Updated] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes

JIRA Tue, 29 Oct 2013 12:56:02 -0700

     [ 
https://issues.apache.org/jira/browse/CASSANDRA-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Piotr Kołaczkowski updated CASSANDRA-6268:
------------------------------------------

    Description: 
Some customers are complaining about huge number of splits in Hadoop caused by 
VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are 
generated from the results of describe_ring, which returns a huge number of 
ranges anyways, and doesn't take into account that there will be huge number of 
consecutive ranges residing on the nodes we'd like the M/R job to be run.

The proposed fix:
1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - 
defaults to all Hadoop DCs)
2. merges consecutive ranges before generating Hadoop splits, so we don't have 
artificial range splitting caused by vnodes in the other DCs

For non-DSE users this feature is turned off by default and doesn't change the 
behaviour.

  was:
Some customers are complaining about huge number of splits in Hadoop caused by 
VNodes. Disabling vnodes only in Hadoop DC does not fix it, because splits are 
generated from the results of describe_ring, which returns a huge number of 
ranges. 

The proposed fix:
- allows for specifying the DCs the Hadoop job should be run
- merges the consecutive ranges before generating Hadoop splits, so we don't 
have artificial range splitting caused by vnodes in the other DCs


> Poor performance of Hadoop if any DC is using VNodes
> ----------------------------------------------------
>
>                 Key: CASSANDRA-6268
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6268
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>            Reporter: Piotr Kołaczkowski
>            Assignee: Piotr Kołaczkowski
>         Attachments: 
> 0001-DSP-2572-Adds-ability-to-set-target-DCs-where-a-Hado.patch
>
>
> Some customers are complaining about huge number of splits in Hadoop caused 
> by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are 
> generated from the results of describe_ring, which returns a huge number of 
> ranges anyways, and doesn't take into account that there will be huge number 
> of consecutive ranges residing on the nodes we'd like the M/R job to be run.
> The proposed fix:
> 1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - 
> defaults to all Hadoop DCs)
> 2. merges consecutive ranges before generating Hadoop splits, so we don't 
> have artificial range splitting caused by vnodes in the other DCs
> For non-DSE users this feature is turned off by default and doesn't change 
> the behaviour.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (CASSANDRA-6268) Poor performance of Hadoop if any DC is using VNodes

Reply via email to