[jira] [Created] (CASSANDRA-4258) Are we sorting the bloom filters in memory to increase the probability of getting proper result instead of just avoiding the false positive?

2012-05-18 Thread Samarth Gahire (JIRA)
Samarth Gahire created CASSANDRA-4258:
-

 Summary: Are we sorting the bloom filters in memory to increase 
the probability of getting proper result instead of just avoiding the false 
positive?
 Key: CASSANDRA-4258
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4258
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Affects Versions: 1.1.1
Reporter: Samarth Gahire
Assignee: Jonathan Ellis
Priority: Minor
 Fix For: 1.1.1


I was just wondering if there is any logic for which bloom filter should be 
checked first to increase the probability of getting the result and not just 
minimizing the probability of false positive.

( *Note:* I have checked into the code and I am not talking about *Getting 
BloomFilter with the lowest practical false positive probability* OR *Getting 
smallest BloomFilter that can provide the given false positive probability rate 
for the given number of elements.* )

*Consider following Scenario:*

1) In our Cassandra Cluster we are inserting 130 millions of rows on daily 
basis for single column family and practically we cant keep this data compacted 
always.(As the loading time is much and compaction may take too much time that 
could affect the schedule for loading of data for next day )
2) We are inserting same rowkeys(values of all the 130 millions rows are same) 
everyday with different supercolumn.
{code}
For date 20120101 we have

super_CF= {row_1:{_super_column_20120101:{ col1 : val1, col2 : val2 }}
   row_2:{_super_column_20120101:{ col1 : val3, col2 : val4 }}
   row_3:{_super_column_20120101:{ col1 : val5, col2 : val6 }}
} 
and For date 20120102 it will be like

super_CF= {row_1:{_super_column_20120102:{ col1 : val7, col2 : val8 }}
   row_2:{_super_column_20120102:{ col1 : val9, col2 : val10 }}
   row_3:{_super_column_20120102:{ col1 : val11, col2 : val12 }}
} 

Note that set of rowkeys is same for all the days only supercolumn changes
{code}
3) So if we do not compact the data say for 30 days, each row key is present in 
30 different sstables.
4) So in worst case, even with 0 probability of false positive, there could be 
30 unnecessary disk accesses.
5) Because of this scenario we are experiencing extremely degraded read 
performance. 

*Proposed solution:*
1) We can have some sorting of bloom-filters based on logic like the bloom 
filter of the sstable which resulted into successfully serving the read request 
will have higher priority over other bloom filters.
I mean we will go for the bloom filter of the sstable which is most recently 
accessed and which successfully returned the requested columns.(MRU approach, 
As the probability of getting result from MRU sstable is greater).This way we 
can reduce the disk access.

2) The point is we should have some sort of logic for sorting of bloom filters 
to boost the read performance in case where sstables are not yet compacted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-4258) Are we sorting the bloom filters in memory to increase the probability of getting proper result instead of just avoiding the false positive?

2012-05-18 Thread Samarth Gahire (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13278965#comment-13278965
 ] 

Samarth Gahire commented on CASSANDRA-4258:
---

Please try to understand the scenario I am talking about. Even if it is not a 
false positive in this scenario there will be a disk access, a row key will be 
there in sstable but not the super column we are looking for. So this will be 
unnecessary disk access.
e.g. one row having content in all the sstables . might be in case of 
super-column OR simple columns
In such cases deciding on correct bloom filter will certainly save some disk 
access and improve on read performance.

 Are we sorting the bloom filters in memory to increase the probability of 
 getting proper result instead of just avoiding the false positive?
 

 Key: CASSANDRA-4258
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4258
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Affects Versions: 1.1.1
Reporter: Samarth Gahire
Priority: Minor
  Labels: bloom-filter, read
 Fix For: 1.1.1

   Original Estimate: 336h
  Remaining Estimate: 336h

 I was just wondering if there is any logic for which bloom filter should be 
 checked first to increase the probability of getting the result and not just 
 minimizing the probability of false positive.
 ( *Note:* I have checked into the code and I am not talking about *Getting 
 BloomFilter with the lowest practical false positive probability* OR 
 *Getting smallest BloomFilter that can provide the given false positive 
 probability rate for the given number of elements.* )
 *Consider following Scenario:*
 1) In our Cassandra Cluster we are inserting 130 millions of rows on daily 
 basis for single column family and practically we cant keep this data 
 compacted always.(As the loading time is much and compaction may take too 
 much time that could affect the schedule for loading of data for next day )
 2) We are inserting same rowkeys(values of all the 130 millions rows are 
 same) everyday with different supercolumn.
 {code}
 For date 20120101 we have
 super_CF= {row_1:{_super_column_20120101:{ col1 : val1, col2 : val2 }}
row_2:{_super_column_20120101:{ col1 : val3, col2 : val4 }}
row_3:{_super_column_20120101:{ col1 : val5, col2 : val6 }}
 } 
 and For date 20120102 it will be like
 super_CF= {row_1:{_super_column_20120102:{ col1 : val7, col2 : val8 }}
row_2:{_super_column_20120102:{ col1 : val9, col2 : val10 }}
row_3:{_super_column_20120102:{ col1 : val11, col2 : val12 }}
 } 
 Note that set of rowkeys is same for all the days only supercolumn changes
 {code}
 3) So if we do not compact the data say for 30 days, each row key is present 
 in 30 different sstables.
 4) So in worst case, even with 0 probability of false positive, there could 
 be 30 unnecessary disk accesses.
 5) Because of this scenario we are experiencing extremely degraded read 
 performance. 
 *Proposed solution:*
 1) We can have some sorting of bloom-filters based on logic like the bloom 
 filter of the sstable which resulted into successfully serving the read 
 request will have higher priority over other bloom filters.
 I mean we will go for the bloom filter of the sstable which is most recently 
 accessed and which successfully returned the requested columns.(MRU approach, 
 As the probability of getting result from MRU sstable is greater).This way we 
 can reduce the disk access.
 2) The point is we should have some sort of logic for sorting of bloom 
 filters to boost the read performance in case where sstables are not yet 
 compacted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-4196) While loading data using BulkOutPutFormat gettting an exception java.lang.ClassCastException: org.apache.cassandra.utils.Murmur3BloomFilter cannot be cast to org.a

2012-05-09 Thread Samarth Gahire (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13271539#comment-13271539
 ] 

Samarth Gahire commented on CASSANDRA-4196:
---

I Agree with Jonathan.Today After reinstalling Cassandra properly  bulk-loading 
is working fine.We might had mixed the versions while upgrading.

 While loading data using BulkOutPutFormat gettting an exception 
 java.lang.ClassCastException: org.apache.cassandra.utils.Murmur3BloomFilter 
 cannot be cast to org.apache.cassandra.utils.Murmur2BloomFilter
 -

 Key: CASSANDRA-4196
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4196
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop, Tools
Affects Versions: 1.2
Reporter: Samarth Gahire
Assignee: Dave Brosius
Priority: Minor
  Labels: bulkloader, cassandra, hadoop, hash
 Attachments: 4196_create_correct_bf_type.diff

   Original Estimate: 48h
  Remaining Estimate: 48h

 We are using cassandra-1.1 rc1 for production setup and getting following 
 error while bulkloading data using BulkOutPutFormat.
 {code}
 WARN 09:04:52,384 Failed closing 
 IndexWriter(/cassandra/production/Data_daily/production-Data_daily-tmp-hc-2692)
 java.lang.ClassCastException: org.apache.cassandra.utils.Murmur3BloomFilter 
 cannot be cast to org.apache.cassandra.utils.Murmur2BloomFilter
 at 
 org.apache.cassandra.utils.FilterFactory.serialize(FilterFactory.java:50)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter$IndexWriter.close(SSTableWriter.java:410)
 at 
 org.apache.cassandra.io.util.FileUtils.closeQuietly(FileUtils.java:94)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.abort(SSTableWriter.java:255)
 at 
 org.apache.cassandra.streaming.IncomingStreamReader.streamIn(IncomingStreamReader.java:154)
 at 
 org.apache.cassandra.streaming.IncomingStreamReader.read(IncomingStreamReader.java:92)
 at 
 org.apache.cassandra.net.IncomingTcpConnection.stream(IncomingTcpConnection.java:178)
 at 
 org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:74)
  WARN 09:04:52,393 Failed closing 
 IndexWriter(/cassandra/production/Data_daily/production-Data_daily-tmp-hc-2693)
 java.lang.ClassCastException: org.apache.cassandra.utils.Murmur3BloomFilter 
 cannot be cast to org.apache.cassandra.utils.Murmur2BloomFilter
 at 
 org.apache.cassandra.utils.FilterFactory.serialize(FilterFactory.java:50)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter$IndexWriter.close(SSTableWriter.java:410)
 at 
 org.apache.cassandra.io.util.FileUtils.closeQuietly(FileUtils.java:94)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.abort(SSTableWriter.java:255)
 at 
 org.apache.cassandra.streaming.IncomingStreamReader.streamIn(IncomingStreamReader.java:154)
 at 
 org.apache.cassandra.streaming.IncomingStreamReader.read(IncomingStreamReader.java:92)
 at 
 org.apache.cassandra.net.IncomingTcpConnection.stream(IncomingTcpConnection.java:178)
 at 
 org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:74)
  WARN 09:04:52,544 Failed closing 
 IndexWriter(/cassandra/production/Data_daily/production-Data_daily-tmp-hc-2698)
 java.lang.ClassCastException: org.apache.cassandra.utils.Murmur3BloomFilter 
 cannot be cast to org.apache.cassandra.utils.Murmur2BloomFilter
 at 
 org.apache.cassandra.utils.FilterFactory.serialize(FilterFactory.java:50)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter$IndexWriter.close(SSTableWriter.java:410)
 at 
 org.apache.cassandra.io.util.FileUtils.closeQuietly(FileUtils.java:94)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.abort(SSTableWriter.java:255)
 at 
 org.apache.cassandra.streaming.IncomingStreamReader.streamIn(IncomingStreamReader.java:154)
 at 
 org.apache.cassandra.streaming.IncomingStreamReader.read(IncomingStreamReader.java:92)
 at 
 org.apache.cassandra.net.IncomingTcpConnection.stream(IncomingTcpConnection.java:178)
 at 
 org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:74)
 ERROR 09:04:52,544 Exception in thread Thread[Thread-39,5,main]
 [3:02:34 PM] Mariusz Dymarek: java.lang.IndexOutOfBoundsException
 at java.nio.Buffer.checkIndex(Buffer.java:520)
 at java.nio.HeapByteBuffer.getShort(HeapByteBuffer.java:289)
 at org.apache.cassandra.db.CounterColumn.create(CounterColumn.java:79)
 at 
 org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:102)

[jira] [Commented] (CASSANDRA-4196) While loading data using BulkOutPutFormat gettting an exception java.lang.ClassCastException: org.apache.cassandra.utils.Murmur3BloomFilter cannot be cast to org.a

2012-05-06 Thread Samarth Gahire (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13269381#comment-13269381
 ] 

Samarth Gahire commented on CASSANDRA-4196:
---

So is this issue fixed for cassandra-1.1 rc1 ? Do I need to apply a patch to 
resolve this? or will it be fixed only for cassandra-0.2?

 While loading data using BulkOutPutFormat gettting an exception 
 java.lang.ClassCastException: org.apache.cassandra.utils.Murmur3BloomFilter 
 cannot be cast to org.apache.cassandra.utils.Murmur2BloomFilter
 -

 Key: CASSANDRA-4196
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4196
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop, Tools
Affects Versions: 1.2
Reporter: Samarth Gahire
Assignee: Dave Brosius
Priority: Minor
  Labels: bulkloader, cassandra, hadoop, hash
 Fix For: 1.2

 Attachments: 4196_create_correct_bf_type.diff

   Original Estimate: 48h
  Remaining Estimate: 48h

 We are using cassandra-1.1 rc1 for production setup and getting following 
 error while bulkloading data using BulkOutPutFormat.
 {code}
 WARN 09:04:52,384 Failed closing 
 IndexWriter(/cassandra/production/Data_daily/production-Data_daily-tmp-hc-2692)
 java.lang.ClassCastException: org.apache.cassandra.utils.Murmur3BloomFilter 
 cannot be cast to org.apache.cassandra.utils.Murmur2BloomFilter
 at 
 org.apache.cassandra.utils.FilterFactory.serialize(FilterFactory.java:50)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter$IndexWriter.close(SSTableWriter.java:410)
 at 
 org.apache.cassandra.io.util.FileUtils.closeQuietly(FileUtils.java:94)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.abort(SSTableWriter.java:255)
 at 
 org.apache.cassandra.streaming.IncomingStreamReader.streamIn(IncomingStreamReader.java:154)
 at 
 org.apache.cassandra.streaming.IncomingStreamReader.read(IncomingStreamReader.java:92)
 at 
 org.apache.cassandra.net.IncomingTcpConnection.stream(IncomingTcpConnection.java:178)
 at 
 org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:74)
  WARN 09:04:52,393 Failed closing 
 IndexWriter(/cassandra/production/Data_daily/production-Data_daily-tmp-hc-2693)
 java.lang.ClassCastException: org.apache.cassandra.utils.Murmur3BloomFilter 
 cannot be cast to org.apache.cassandra.utils.Murmur2BloomFilter
 at 
 org.apache.cassandra.utils.FilterFactory.serialize(FilterFactory.java:50)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter$IndexWriter.close(SSTableWriter.java:410)
 at 
 org.apache.cassandra.io.util.FileUtils.closeQuietly(FileUtils.java:94)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.abort(SSTableWriter.java:255)
 at 
 org.apache.cassandra.streaming.IncomingStreamReader.streamIn(IncomingStreamReader.java:154)
 at 
 org.apache.cassandra.streaming.IncomingStreamReader.read(IncomingStreamReader.java:92)
 at 
 org.apache.cassandra.net.IncomingTcpConnection.stream(IncomingTcpConnection.java:178)
 at 
 org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:74)
  WARN 09:04:52,544 Failed closing 
 IndexWriter(/cassandra/production/Data_daily/production-Data_daily-tmp-hc-2698)
 java.lang.ClassCastException: org.apache.cassandra.utils.Murmur3BloomFilter 
 cannot be cast to org.apache.cassandra.utils.Murmur2BloomFilter
 at 
 org.apache.cassandra.utils.FilterFactory.serialize(FilterFactory.java:50)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter$IndexWriter.close(SSTableWriter.java:410)
 at 
 org.apache.cassandra.io.util.FileUtils.closeQuietly(FileUtils.java:94)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.abort(SSTableWriter.java:255)
 at 
 org.apache.cassandra.streaming.IncomingStreamReader.streamIn(IncomingStreamReader.java:154)
 at 
 org.apache.cassandra.streaming.IncomingStreamReader.read(IncomingStreamReader.java:92)
 at 
 org.apache.cassandra.net.IncomingTcpConnection.stream(IncomingTcpConnection.java:178)
 at 
 org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:74)
 ERROR 09:04:52,544 Exception in thread Thread[Thread-39,5,main]
 [3:02:34 PM] Mariusz Dymarek: java.lang.IndexOutOfBoundsException
 at java.nio.Buffer.checkIndex(Buffer.java:520)
 at java.nio.HeapByteBuffer.getShort(HeapByteBuffer.java:289)
 at org.apache.cassandra.db.CounterColumn.create(CounterColumn.java:79)
 at 
 

[jira] [Created] (CASSANDRA-4196) While loading data using BulkOutPutFormat gettting an exception java.lang.ClassCastException: org.apache.cassandra.utils.Murmur3BloomFilter cannot be cast to org.apa

2012-04-30 Thread Samarth Gahire (JIRA)
Samarth Gahire created CASSANDRA-4196:
-

 Summary: While loading data using BulkOutPutFormat gettting an 
exception java.lang.ClassCastException: 
org.apache.cassandra.utils.Murmur3BloomFilter cannot be cast to 
org.apache.cassandra.utils.Murmur2BloomFilter
 Key: CASSANDRA-4196
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4196
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop, Tools
Affects Versions: 1.1.0
Reporter: Samarth Gahire
Assignee: Vijay
Priority: Minor
 Fix For: 1.1.1


We are using cassandra-1.1 rc1 for production setup and getting following error 
while bulkloading data using BulkOutPutFormat.

{code}
WARN 09:04:52,384 Failed closing 
IndexWriter(/cassandra/production/Data_daily/production-Data_daily-tmp-hc-2692)
java.lang.ClassCastException: org.apache.cassandra.utils.Murmur3BloomFilter 
cannot be cast to org.apache.cassandra.utils.Murmur2BloomFilter
at 
org.apache.cassandra.utils.FilterFactory.serialize(FilterFactory.java:50)
at 
org.apache.cassandra.io.sstable.SSTableWriter$IndexWriter.close(SSTableWriter.java:410)
at 
org.apache.cassandra.io.util.FileUtils.closeQuietly(FileUtils.java:94)
at 
org.apache.cassandra.io.sstable.SSTableWriter.abort(SSTableWriter.java:255)
at 
org.apache.cassandra.streaming.IncomingStreamReader.streamIn(IncomingStreamReader.java:154)
at 
org.apache.cassandra.streaming.IncomingStreamReader.read(IncomingStreamReader.java:92)
at 
org.apache.cassandra.net.IncomingTcpConnection.stream(IncomingTcpConnection.java:178)
at 
org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:74)
 WARN 09:04:52,393 Failed closing 
IndexWriter(/cassandra/production/Data_daily/production-Data_daily-tmp-hc-2693)
java.lang.ClassCastException: org.apache.cassandra.utils.Murmur3BloomFilter 
cannot be cast to org.apache.cassandra.utils.Murmur2BloomFilter
at 
org.apache.cassandra.utils.FilterFactory.serialize(FilterFactory.java:50)
at 
org.apache.cassandra.io.sstable.SSTableWriter$IndexWriter.close(SSTableWriter.java:410)
at 
org.apache.cassandra.io.util.FileUtils.closeQuietly(FileUtils.java:94)
at 
org.apache.cassandra.io.sstable.SSTableWriter.abort(SSTableWriter.java:255)
at 
org.apache.cassandra.streaming.IncomingStreamReader.streamIn(IncomingStreamReader.java:154)
at 
org.apache.cassandra.streaming.IncomingStreamReader.read(IncomingStreamReader.java:92)
at 
org.apache.cassandra.net.IncomingTcpConnection.stream(IncomingTcpConnection.java:178)
at 
org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:74)
 WARN 09:04:52,544 Failed closing 
IndexWriter(/cassandra/production/Data_daily/production-Data_daily-tmp-hc-2698)
java.lang.ClassCastException: org.apache.cassandra.utils.Murmur3BloomFilter 
cannot be cast to org.apache.cassandra.utils.Murmur2BloomFilter
at 
org.apache.cassandra.utils.FilterFactory.serialize(FilterFactory.java:50)
at 
org.apache.cassandra.io.sstable.SSTableWriter$IndexWriter.close(SSTableWriter.java:410)
at 
org.apache.cassandra.io.util.FileUtils.closeQuietly(FileUtils.java:94)
at 
org.apache.cassandra.io.sstable.SSTableWriter.abort(SSTableWriter.java:255)
at 
org.apache.cassandra.streaming.IncomingStreamReader.streamIn(IncomingStreamReader.java:154)
at 
org.apache.cassandra.streaming.IncomingStreamReader.read(IncomingStreamReader.java:92)
at 
org.apache.cassandra.net.IncomingTcpConnection.stream(IncomingTcpConnection.java:178)
at 
org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:74)
ERROR 09:04:52,544 Exception in thread Thread[Thread-39,5,main]
[3:02:34 PM] Mariusz Dymarek: java.lang.IndexOutOfBoundsException
at java.nio.Buffer.checkIndex(Buffer.java:520)
at java.nio.HeapByteBuffer.getShort(HeapByteBuffer.java:289)
at org.apache.cassandra.db.CounterColumn.create(CounterColumn.java:79)
at 
org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:102)
at 
org.apache.cassandra.io.util.ColumnIterator.deserializeNext(ColumnSortedMap.java:251)
at 
org.apache.cassandra.io.util.ColumnIterator.next(ColumnSortedMap.java:271)
at 
org.apache.cassandra.io.util.ColumnIterator.next(ColumnSortedMap.java:228)
at edu.stanford.ppl.concurrent.SnapTreeMap.init(SnapTreeMap.java:453)
at 
org.apache.cassandra.db.AtomicSortedColumns$Holder.init(AtomicSortedColumns.java:301)
at 
org.apache.cassandra.db.AtomicSortedColumns.init(AtomicSortedColumns.java:77)
at 
org.apache.cassandra.db.AtomicSortedColumns.init(AtomicSortedColumns.java:48)
at 
org.apache.cassandra.db.AtomicSortedColumns$1.fromSorted(AtomicSortedColumns.java:61)
at