[ https://issues.apache.org/jira/browse/CASSANDRA-10835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Artem Aliev updated CASSANDRA-10835: ------------------------------------ Attachment: cassandra-3.0.1-10835-2.txt cassandra-2.2-10835-2.txt Joshua idea implementation, MB-based param was added > CqlInputFormat creates too small splits for map Hadoop tasks > ------------------------------------------------------------- > > Key: CASSANDRA-10835 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10835 > Project: Cassandra > Issue Type: Bug > Reporter: Artem Aliev > Attachments: cassandra-2.2-10835-2.txt, cassandra-3.0.1-10835-2.txt, > cassandra-3.0.1-10835.txt > > > CqlInputFormat use number of rows in C* version < 2.2 to define split size > The default split size was 64K rows. > {code} > private static final int DEFAULT_SPLIT_SIZE = 64 * 1024; > {code} > The doc: > {code} > * You can also configure the number of rows per InputSplit with > * ConfigHelper.setInputSplitSize. The default split size is 64k rows. > {code} > New split algorithm assumes that SPLIT size is in bytes, so it creates really > small map hadoop tasks by default (or with old configs). > There two way to fix it: > 1. Update the doc and increase default value to something like 16MB > 2. Make the C* to be compatible with older version. > I like the second options, as it will not surprise people who upgrade from > old versions. I do not expect a lot of new user that will use Hadoop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)