[ https://issues.apache.org/jira/browse/CASSANDRA-9074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Philip Thompson updated CASSANDRA-9074: --------------------------------------- Assignee: Alex Liu > Hadoop Cassandra CqlInputFormat pagination - not reading all input rows > ----------------------------------------------------------------------- > > Key: CASSANDRA-9074 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9074 > Project: Cassandra > Issue Type: Bug > Components: Hadoop > Environment: Cassandra 2.0.11, Hadoop 1.0.4, Datastax java > cassandra-driver-core 2.1.4 > Reporter: fuggy_yama > Assignee: Alex Liu > Priority: Minor > Fix For: 2.0.14 > > > I have a 7-node Cassandra (v2.0.11) cluster and a table with 10k rows. I run > a hadoop job (datanodes reside on cassandra nodes of course) that reads data > from that table and I see that only 7k rows is read to map phase. > I checked CqlInputFormat source code and noticed that a CQL query is build to > select node-local date and also LIMIT clause is added (1k default). So that > 7k read rows can be explained: > 7 nodes * 1k limit = 7k rows read total > The limit can be changed using CqlConfigHelper: > CqlConfigHelper.setInputCQLPageRowSize(job.getConfiguration(), "1000"); > Please help me with questions below: > Is this a desired behavior? > Why CqlInputFormat does not page through the rest of rows? > Is it a bug or should I just increase the InputCQLPageRowSize value? > What if I want to read all data in table and do not know the row count? > What if the amount of rows I need to read per cassandra node is very large - > in other words how to avoid OOM when setting InputCQLPageRowSize very large > to handle all data? -- This message was sent by Atlassian JIRA (v6.3.4#6332)