[jira] [Commented] (CASSANDRA-2392) Saving IndexSummaries to disk

2012-04-18 Thread Jonathan Ellis (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13256804#comment-13256804
 ] 

Jonathan Ellis commented on CASSANDRA-2392:
---

For the record, I'm still fine with saying "loading caches will slow down 
startup, deal with it," but I think we have a good plan of attack on 3762 now 
and it may be simpler to just do that first, before rebasing this.  Which is 
also fine.

> Saving IndexSummaries to disk
> -
>
> Key: CASSANDRA-2392
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2392
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Chris Goffinet
>Assignee: Vijay
>Priority: Minor
> Fix For: 1.2
>
> Attachments: 0001-CASSANDRA-2392-v6.patch, 
> 0001-re-factor-first-and-last.patch, 0001-save-summaries-to-disk-v4.patch, 
> 0001-save-summaries-to-disk.patch, 0002-save-summaries-to-disk-v2.patch, 
> 0002-save-summaries-to-disk-v3.patch, 0002-save-summaries-to-disk.patch, 
> CASSANDRA-2392-v5.patch
>
>
> For nodes with millions of keys, doing rolling restarts that take over 10 
> minutes per node can be painful if you have 100 node cluster. All of our time 
> is spent on doing index summary computations on startup. It would be great if 
> we could save those to disk as well. Our indexes are quite large.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-2392) Saving IndexSummaries to disk

2012-01-24 Thread Pavel Yaskevich (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13192673#comment-13192673
 ] 

Pavel Yaskevich commented on CASSANDRA-2392:


Great, now when we have this one ready, we really need to finish up with 
CASSANDRA-3762 to find out if such design makes sense or should we go with 
another strategy.

> Saving IndexSummaries to disk
> -
>
> Key: CASSANDRA-2392
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2392
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Chris Goffinet
>Assignee: Vijay
>Priority: Minor
> Fix For: 1.2
>
> Attachments: 0001-CASSANDRA-2392-v6.patch, 
> 0001-re-factor-first-and-last.patch, 0001-save-summaries-to-disk-v4.patch, 
> 0001-save-summaries-to-disk.patch, 0002-save-summaries-to-disk-v2.patch, 
> 0002-save-summaries-to-disk-v3.patch, 0002-save-summaries-to-disk.patch, 
> CASSANDRA-2392-v5.patch
>
>
> For nodes with millions of keys, doing rolling restarts that take over 10 
> minutes per node can be painful if you have 100 node cluster. All of our time 
> is spent on doing index summary computations on startup. It would be great if 
> we could save those to disk as well. Our indexes are quite large.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-2392) Saving IndexSummaries to disk

2012-01-22 Thread Vijay (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190817#comment-13190817
 ] 

Vijay commented on CASSANDRA-2392:
--

>>> Will you be able to deal with CASSANDRA-3762 in time for 1.1 release?
Yeah i have a proto type working just have to do a quick benchmark to see if it 
will make sense :)

> Saving IndexSummaries to disk
> -
>
> Key: CASSANDRA-2392
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2392
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Chris Goffinet
>Assignee: Vijay
>Priority: Minor
> Fix For: 1.2
>
> Attachments: 0001-re-factor-first-and-last.patch, 
> 0001-save-summaries-to-disk-v4.patch, 0001-save-summaries-to-disk.patch, 
> 0002-save-summaries-to-disk-v2.patch, 0002-save-summaries-to-disk-v3.patch, 
> 0002-save-summaries-to-disk.patch
>
>
> For nodes with millions of keys, doing rolling restarts that take over 10 
> minutes per node can be painful if you have 100 node cluster. All of our time 
> is spent on doing index summary computations on startup. It would be great if 
> we could save those to disk as well. Our indexes are quite large.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-2392) Saving IndexSummaries to disk

2012-01-22 Thread Pavel Yaskevich (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190814#comment-13190814
 ] 

Pavel Yaskevich commented on CASSANDRA-2392:


Thanks for the patch, going take a look soon! Will you be able to deal with 
CASSANDRA-3762 in time for 1.1 release? That way we will be able to move it and 
this one back to the 1.1.

> Saving IndexSummaries to disk
> -
>
> Key: CASSANDRA-2392
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2392
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Chris Goffinet
>Assignee: Vijay
>Priority: Minor
> Fix For: 1.2
>
> Attachments: 0001-re-factor-first-and-last.patch, 
> 0001-save-summaries-to-disk-v4.patch, 0001-save-summaries-to-disk.patch, 
> 0002-save-summaries-to-disk-v2.patch, 0002-save-summaries-to-disk-v3.patch, 
> 0002-save-summaries-to-disk.patch
>
>
> For nodes with millions of keys, doing rolling restarts that take over 10 
> minutes per node can be painful if you have 100 node cluster. All of our time 
> is spent on doing index summary computations on startup. It would be great if 
> we could save those to disk as well. Our indexes are quite large.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-2392) Saving IndexSummaries to disk

2012-01-22 Thread Pavel Yaskevich (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190792#comment-13190792
 ] 

Pavel Yaskevich commented on CASSANDRA-2392:


{quote}
Well thats for the following...
if (!recreatebloom && !cacheLoading && loadSummaries)
return;

We still need to complete the indexes and builders even though you dont go 
through all the code in try.
{quote}

Then I can suggest you to skip the primary_index loop phrase instead of return, 
because using finally that way is not a good practice nor style.

bq. Not sure if it can have different semantics, but i will remove the refactor 
not a big deal.

Please do.

bq. I dont think in either case we leak descriptors it is in the finally block 
already and all we are planning to save is a variable assignment and i will do 
that not a big deal.

I don't follow here, what I mean is to change {load, save}Summaries method's 
try blocks to look like code posted below, because in your version you close 
stream only if IOException is thrown:

{noformat}
+try
+{
+{iStream = new DataInputStream(new 
FileInputStream(inMemoryDataFile));
+reader.indexSummary = IndexSummary.serializer.deserialize(iStream, 
reader.descriptor);
+ibuilder.deserializeBounds(iStream);
+dbuilder.deserializeBounds(iStream);
+}
+catch (IOException e)
+{
+// corrupted hence delete it and let it load it now.
+if (inMemoryDataFile.exists())
+inMemoryDataFile.delete();
+return false;
+}
+finally
+{
+FileUtils.closeQuietly(iStream);
+}
{noformat}

and

{noformat}
+try
+{
+oStream = new DataOutputStream(new FileOutputStream(summaryFile));
+IndexSummary.serializer.serialize(reader.indexSummary, oStream);
+ibuilder.serializeBounds(oStream);
+dbuilder.serializeBounds(oStream);
+}
+catch (IOException e)
+{
+// corrupted hence delete it and let it load it now.
+if (summaryFile.exists())
+summaryFile.delete();
+}
+finally
+{
+FileUtils.closeQuietly(oStream);
+}
{noformat}

To make sure that we close stream every time and not only when IOException is 
thrown,file descriptor can be closed even after original file was deleted so 
everything is save even if IOException is thrown and file is deleted before 
finally block is called.

> Saving IndexSummaries to disk
> -
>
> Key: CASSANDRA-2392
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2392
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Chris Goffinet
>Assignee: Vijay
>Priority: Minor
> Fix For: 1.1
>
> Attachments: 0001-re-factor-first-and-last.patch, 
> 0001-save-summaries-to-disk.patch, 0002-save-summaries-to-disk-v2.patch, 
> 0002-save-summaries-to-disk-v3.patch, 0002-save-summaries-to-disk.patch
>
>
> For nodes with millions of keys, doing rolling restarts that take over 10 
> minutes per node can be painful if you have 100 node cluster. All of our time 
> is spent on doing index summary computations on startup. It would be great if 
> we could save those to disk as well. Our indexes are quite large.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-2392) Saving IndexSummaries to disk

2012-01-22 Thread Vijay (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190784#comment-13190784
 ] 

Vijay commented on CASSANDRA-2392:
--

>>> indexSummary.complete() can be moved out from the try block because it 
>>> doesn't throw IOException
Well thats for the following...
if (!recreatebloom && !cacheLoading && loadSummaries)
return;

We still need to complete the indexes and builders even though you dont go 
through all the code in try.

>>> is not a guaranteed thing which means that IndexSummary.last has a 
>>> different semantics
Not sure if it can have different semantics, but i will remove the refactor not 
a big deal.

>>> Summaries methods are leaking file descriptors
I dont think in either case we leak descriptors it is in the finally block 
already and all we are planning to save is a variable assignment and i will do 
that not a big deal.


> Saving IndexSummaries to disk
> -
>
> Key: CASSANDRA-2392
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2392
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Chris Goffinet
>Assignee: Vijay
>Priority: Minor
> Fix For: 1.1
>
> Attachments: 0001-re-factor-first-and-last.patch, 
> 0001-save-summaries-to-disk.patch, 0002-save-summaries-to-disk-v2.patch, 
> 0002-save-summaries-to-disk-v3.patch, 0002-save-summaries-to-disk.patch
>
>
> For nodes with millions of keys, doing rolling restarts that take over 10 
> minutes per node can be painful if you have 100 node cluster. All of our time 
> is spent on doing index summary computations on startup. It would be great if 
> we could save those to disk as well. Our indexes are quite large.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-2392) Saving IndexSummaries to disk

2012-01-22 Thread Pavel Yaskevich (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190776#comment-13190776
 ] 

Pavel Yaskevich commented on CASSANDRA-2392:


Also changes shown below are odd because the same behavior you get by keeping 
code the same - it will just throw an exception somewhere in try block, run 
code in finally block and never get to the indexSummary.complete() and 
{i,d}builder.complete(String) methods which are no-op in that case. Btw, 
indexSummary.complete() can be moved out from the try block because it doesn't 
throw IOException and no-op if code above it does but that is not a big deal 
anyway.


{noformat}
+catch (IOException ex)
+{
+exception = true;
+throw ex;
 }
 finally
 {
+// close the file first.
 FileUtils.closeQuietly(input);
+if (!exception)
+{
+// finalize the load.
+indexSummary.complete();
+// finalize the state of the reader
+ifile = 
ibuilder.complete(descriptor.filenameFor(Component.PRIMARY_INDEX));
+dfile = 
dbuilder.complete(descriptor.filenameFor(Component.DATA));
+}
 }
{noformat}

> Saving IndexSummaries to disk
> -
>
> Key: CASSANDRA-2392
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2392
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Chris Goffinet
>Assignee: Vijay
>Priority: Minor
> Fix For: 1.1
>
> Attachments: 0001-re-factor-first-and-last.patch, 
> 0001-save-summaries-to-disk.patch, 0002-save-summaries-to-disk-v2.patch, 
> 0002-save-summaries-to-disk-v3.patch, 0002-save-summaries-to-disk.patch
>
>
> For nodes with millions of keys, doing rolling restarts that take over 10 
> minutes per node can be painful if you have 100 node cluster. All of our time 
> is spent on doing index summary computations on startup. It would be great if 
> we could save those to disk as well. Our indexes are quite large.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-2392) Saving IndexSummaries to disk

2012-01-22 Thread Pavel Yaskevich (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190751#comment-13190751
 ] 

Pavel Yaskevich commented on CASSANDRA-2392:


bq. But the main idea is to reduce the code and the checks which we have to do 
just to populate the first and last variable. IMO it is better served in Index 
Summary which already has the needed checks. by using maybeAddEntry() and 
marking other private everywhere we dont need extra checks else where to 
populate the fields... first and last in a index is also a summary :)

Correct me if I'm wrong but as I see in SSTableReader.load(...) that condition 
"SSTable.last == IndexSummary.last" is not a guaranteed thing which means that 
IndexSummary.last has a different semantics from SSTable.last. According to 
checks - I don't see many of those and IndexSummary in it's current state does 
not have anything to do with SSTable's last/first variables so I don't really 
understand what checks are you talking about? If you really want to be pedantic 
about the domain of first/last - I agree that they could belong to the summary 
of the SSTable but certainly not to the "index" one :)

bq. Because we read from the disk to populate the Index Summary? If yes i can 
make sure that both the patches go into the same release.

Because we would end-up reading more data (e.g. some of the keys and all index 
and data positions would be read twice) from different files - primary_index 
and summary. 

> Saving IndexSummaries to disk
> -
>
> Key: CASSANDRA-2392
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2392
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Chris Goffinet
>Assignee: Vijay
>Priority: Minor
> Fix For: 1.1
>
> Attachments: 0001-re-factor-first-and-last.patch, 
> 0001-save-summaries-to-disk.patch, 0002-save-summaries-to-disk-v2.patch, 
> 0002-save-summaries-to-disk-v3.patch, 0002-save-summaries-to-disk.patch
>
>
> For nodes with millions of keys, doing rolling restarts that take over 10 
> minutes per node can be painful if you have 100 node cluster. All of our time 
> is spent on doing index summary computations on startup. It would be great if 
> we could save those to disk as well. Our indexes are quite large.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-2392) Saving IndexSummaries to disk

2012-01-22 Thread Vijay (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190743#comment-13190743
 ] 

Vijay commented on CASSANDRA-2392:
--

>>> I don't think that "0001-re-factor-first-and-last" is a good idea because 
>>> by moving first/last variables to IndexSummary
But the main idea is to reduce the code and the checks which we have to do just 
to populate the first and last variable. IMO it is better served in Index 
Summary which already has the needed checks. by using maybeAddEntry() and 
marking other private everywhere we dont need extra checks else where to 
populate the fields... first and last in a index is also a summary :)

>>> one release that could make start-up times even longer than right now
Because we read from the disk to populate the Index Summary? If yes i can make 
sure that both the patches go into the same release.

> Saving IndexSummaries to disk
> -
>
> Key: CASSANDRA-2392
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2392
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Chris Goffinet
>Assignee: Vijay
>Priority: Minor
> Fix For: 1.1
>
> Attachments: 0001-re-factor-first-and-last.patch, 
> 0001-save-summaries-to-disk.patch, 0002-save-summaries-to-disk-v2.patch, 
> 0002-save-summaries-to-disk-v3.patch, 0002-save-summaries-to-disk.patch
>
>
> For nodes with millions of keys, doing rolling restarts that take over 10 
> minutes per node can be painful if you have 100 node cluster. All of our time 
> is spent on doing index summary computations on startup. It would be great if 
> we could save those to disk as well. Our indexes are quite large.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-2392) Saving IndexSummaries to disk

2012-01-22 Thread Pavel Yaskevich (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190740#comment-13190740
 ] 

Pavel Yaskevich commented on CASSANDRA-2392:


here is the last things with v3

 - {load, save}Summaries methods are leaking file descriptors because {o, 
i}Stream is closed only when method handles IOException. 

 Nit: 

{code}
+FileInputStream input = new FileInputStream(inMemoryDataFile);
+iStream = new DataInputStream(input);
{code}
and
{code}
+FileOutputStream input = new FileOutputStream(summaryFile);
+oStream = new DataOutputStream(input);
{code}

can be changed to 
{noformat}
{i,o}Stream = new Data{Input, Output}Stream(new File{Input, 
Output}Stream(summaryFile); 
{noformat}
because input var is not really needed.

I don't think that "0001-re-factor-first-and-last" is a good idea because by 
moving first/last variables to IndexSummary you change their semantics and they 
are no longer indicate the first and last key that SSTable keeps but rather 
first/last key covered by IndexSummary of the individual SSTable, so I think we 
really should just keep those variables in the old place.

Also I'm concerned that CASSANDRA-3762 is marked for 1.2 and this one for 1.1 
because if we don't get them in one release that could make start-up times even 
longer than right now, which breaks the point of current task, because there is 
big chance that key cache would be enabled on the big ColumnFamilies.

> Saving IndexSummaries to disk
> -
>
> Key: CASSANDRA-2392
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2392
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Chris Goffinet
>Assignee: Vijay
>Priority: Minor
> Fix For: 1.1
>
> Attachments: 0001-re-factor-first-and-last.patch, 
> 0001-save-summaries-to-disk.patch, 0002-save-summaries-to-disk-v2.patch, 
> 0002-save-summaries-to-disk-v3.patch, 0002-save-summaries-to-disk.patch
>
>
> For nodes with millions of keys, doing rolling restarts that take over 10 
> minutes per node can be painful if you have 100 node cluster. All of our time 
> is spent on doing index summary computations on startup. It would be great if 
> we could save those to disk as well. Our indexes are quite large.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-2392) Saving IndexSummaries to disk

2012-01-20 Thread Jonathan Ellis (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13189948#comment-13189948
 ] 

Jonathan Ellis commented on CASSANDRA-2392:
---

I think it's fine to acknowledge that key cache load will negate the advantages 
of enabling saved indexsummaries for this ticket, and open another one to 
improve the key cache design.

> Saving IndexSummaries to disk
> -
>
> Key: CASSANDRA-2392
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2392
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Chris Goffinet
>Assignee: Vijay
>Priority: Minor
> Fix For: 1.1
>
> Attachments: 0001-re-factor-first-and-last.patch, 
> 0001-save-summaries-to-disk.patch, 0002-save-summaries-to-disk-v2.patch, 
> 0002-save-summaries-to-disk.patch
>
>
> For nodes with millions of keys, doing rolling restarts that take over 10 
> minutes per node can be painful if you have 100 node cluster. All of our time 
> is spent on doing index summary computations on startup. It would be great if 
> we could save those to disk as well. Our indexes are quite large.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-2392) Saving IndexSummaries to disk

2012-01-20 Thread Pavel Yaskevich (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13189802#comment-13189802
 ] 

Pavel Yaskevich commented on CASSANDRA-2392:


bq. Renamed and done recommended changes. Exempt we have all the in-memory 
data-structures in one file instead of multiple files. They are handled 
differently and will be kind of throw away data so we can regenerate it.

I kind of liked it more when component was Summary because InMemoryData doesn't 
really tell what is inside. Please rename SegmentedFile serialize/deserialize 
to something like serializeBounds/deserializeBounds.

bq. I do see Keycache working in my tests... 

Sorry I wasn't clear when I was saying that. It seems like that summary 
save/load is pointless in it's current form because even if we have loaded 
summary from disk we would anyway have to loop through *whole* PRIMARY_INDEX if 
pre-cache (which is always enabled by default) or re-create-BloomFilter was 
enabled, which is practically means that we spend the same time on I/O there as 
ibuilder.deserialize and dbuilder.deserialize together. We would need to change 
the logic in SSTableReader.load(boolean, Set) the way it doesn't 
have such I/O overhead because this will make it even slower comparing to the 
time it takes now.


> Saving IndexSummaries to disk
> -
>
> Key: CASSANDRA-2392
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2392
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Chris Goffinet
>Assignee: Vijay
>Priority: Minor
> Fix For: 1.1
>
> Attachments: 0001-re-factor-first-and-last.patch, 
> 0001-save-summaries-to-disk.patch, 0002-save-summaries-to-disk-v2.patch, 
> 0002-save-summaries-to-disk.patch
>
>
> For nodes with millions of keys, doing rolling restarts that take over 10 
> minutes per node can be painful if you have 100 node cluster. All of our time 
> is spent on doing index summary computations on startup. It would be great if 
> we could save those to disk as well. Our indexes are quite large.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-2392) Saving IndexSummaries to disk

2012-01-19 Thread Pavel Yaskevich (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13189299#comment-13189299
 ] 

Pavel Yaskevich commented on CASSANDRA-2392:


bq. I am not sure how saving dataPosition will help as we only have summaries 
between 128Keys or more and how will we mark a boundary with it? For example 
each row is 100MB big.

Oh yes, you are right, we really need all boundary information from segmented 
files, my bad.


> Saving IndexSummaries to disk
> -
>
> Key: CASSANDRA-2392
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2392
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Chris Goffinet
>Assignee: Vijay
>Priority: Minor
> Fix For: 1.1
>
> Attachments: 0001-re-factor-first-and-last.patch, 
> 0001-save-summaries-to-disk.patch, 0002-save-summaries-to-disk.patch
>
>
> For nodes with millions of keys, doing rolling restarts that take over 10 
> minutes per node can be painful if you have 100 node cluster. All of our time 
> is spent on doing index summary computations on startup. It would be great if 
> we could save those to disk as well. Our indexes are quite large.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-2392) Saving IndexSummaries to disk

2012-01-19 Thread Vijay (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13189198#comment-13189198
 ] 

Vijay commented on CASSANDRA-2392:
--

Hi Pavel

>> To avoid any seeks in the PRIMARY_INDEX file upon IndexSummary.deserialize I 
>> suggest to save key (only BB part) as well as index position on 
>> IndexSummary.serialize.
Will do, The initial idea was to save some disk space as they keys in some 
cases can be really long :) and with the index seeks was not that bad in my 
initial tests but i will save it in v2.

>> I would also suggest to save dataPosition from the primary index into 
>> summaries file to avoid adding serialization to SegmentedFile because 
>> SegmentedFile serialize(...)/deserialize(...) are not really a 
>> serialize/deserialize 
I am not sure how saving dataPosition will help as we only have summaries 
between 128Keys or more and how will we mark a boundary with it? For example 
each row is 100MB big.

>> can you please explain this chunk of code to me? 
The idea is to save the summary when SSTable creation/load completes (as there 
isnt any temporary state for them and they fit in memory). If corrupted or 
deleted or not there we will just recalculate them instead of depending on them.

> Saving IndexSummaries to disk
> -
>
> Key: CASSANDRA-2392
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2392
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Chris Goffinet
>Assignee: Vijay
>Priority: Minor
> Fix For: 1.1
>
> Attachments: 0001-re-factor-first-and-last.patch, 
> 0001-save-summaries-to-disk.patch, 0002-save-summaries-to-disk.patch
>
>
> For nodes with millions of keys, doing rolling restarts that take over 10 
> minutes per node can be painful if you have 100 node cluster. All of our time 
> is spent on doing index summary computations on startup. It would be great if 
> we could save those to disk as well. Our indexes are quite large.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-2392) Saving IndexSummaries to disk

2012-01-19 Thread Pavel Yaskevich (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13189046#comment-13189046
 ] 

Pavel Yaskevich commented on CASSANDRA-2392:


Thanks for the patch! Here is my review:

- Index summaries load in SSTableReader.load(boolean, Set) breaks 
key cache pre-load.

- IndexSummary deserialize(...) method should be made static and return 
IndexSummary object. This will also allow to drop IndexSummary argument from 
SSTableReader.loadSummaries(...).

- To avoid any seeks in the PRIMARY_INDEX file upon IndexSummary.deserialize I 
suggest to save key (only BB part) as well as index position on 
IndexSummary.serialize.

- I would also suggest to save dataPosition from the primary index into 
summaries file to avoid adding serialization to SegmentedFile because 
SegmentedFile serialize(...)/deserialize(...) are not really a 
serialize/deserialize - they just save/read boundaries. This way you would be 
able to do deserialization and boundary load at the save time without 
saving/reading additional information to/from the disk because only ibuilder 
needs indexPosition and dbuilder - dataPosition.

- loadSummaries should be renamed to something more appropriate because that 
method does not only load index summaries it also loads index and data 
builders, per se it does not really load them but rather just deserializes 
boundaries into an existing object with is not a good practice.

- can you please explain this chunk of code to me?
{code}
+// don't rename summaries as it is not created yet and created 
while it is loaded.
+for (Component component : Sets.difference(components, 
Sets.newHashSet(Component.DATA, Component.SUMMARIES)))
  FBUtilities.renameWithConfirm(tmpdesc.filenameFor(component), 
newdesc.filenameFor(component));
{code}



> Saving IndexSummaries to disk
> -
>
> Key: CASSANDRA-2392
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2392
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Chris Goffinet
>Assignee: Vijay
>Priority: Minor
> Fix For: 1.1
>
> Attachments: 0001-re-factor-first-and-last.patch, 
> 0001-save-summaries-to-disk.patch, 0002-save-summaries-to-disk.patch
>
>
> For nodes with millions of keys, doing rolling restarts that take over 10 
> minutes per node can be painful if you have 100 node cluster. All of our time 
> is spent on doing index summary computations on startup. It would be great if 
> we could save those to disk as well. Our indexes are quite large.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-2392) Saving IndexSummaries to disk

2011-12-22 Thread Vijay (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174956#comment-13174956
 ] 

Vijay commented on CASSANDRA-2392:
--

Update: current plan for this ticket is to implement something like 
CASSANDRA-3623 for mmap'ed files and remove addPotentialBoundary() code.

> Saving IndexSummaries to disk
> -
>
> Key: CASSANDRA-2392
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2392
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Chris Goffinet
>Assignee: Vijay
>Priority: Minor
> Fix For: 1.1
>
>
> For nodes with millions of keys, doing rolling restarts that take over 10 
> minutes per node can be painful if you have 100 node cluster. All of our time 
> is spent on doing index summary computations on startup. It would be great if 
> we could save those to disk as well. Our indexes are quite large.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-2392) Saving IndexSummaries to disk

2011-12-19 Thread Jonathan Ellis (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13172509#comment-13172509
 ] 

Jonathan Ellis commented on CASSANDRA-2392:
---

To answer the question: yes, let's ignore caches here.  Would like to do this 
for 1.1 as well.

> Saving IndexSummaries to disk
> -
>
> Key: CASSANDRA-2392
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2392
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Chris Goffinet
>Assignee: Vijay
>Priority: Minor
> Fix For: 1.1
>
>
> For nodes with millions of keys, doing rolling restarts that take over 10 
> minutes per node can be painful if you have 100 node cluster. All of our time 
> is spent on doing index summary computations on startup. It would be great if 
> we could save those to disk as well. Our indexes are quite large.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-2392) Saving IndexSummaries to disk

2011-11-08 Thread Pavel Yaskevich (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13146527#comment-13146527
 ] 

Pavel Yaskevich commented on CASSANDRA-2392:


Looking forward to see your patch.

> Saving IndexSummaries to disk
> -
>
> Key: CASSANDRA-2392
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2392
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Chris Goffinet
>Assignee: Vijay
>Priority: Minor
> Fix For: 1.1
>
>
> For nodes with millions of keys, doing rolling restarts that take over 10 
> minutes per node can be painful if you have 100 node cluster. All of our time 
> is spent on doing index summary computations on startup. It would be great if 
> we could save those to disk as well. Our indexes are quite large.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-2392) Saving IndexSummaries to disk

2011-11-08 Thread Vijay (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13146511#comment-13146511
 ] 

Vijay commented on CASSANDRA-2392:
--

Cool then we can do a small refractor of addPotentialBoundary() used by 
MmappedSegmentedFile from the indexFile and ignore keycache in this patch 
(which will be taken care @ CASSANDRA-3143)?

> Saving IndexSummaries to disk
> -
>
> Key: CASSANDRA-2392
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2392
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Chris Goffinet
>Assignee: Vijay
>Priority: Minor
> Fix For: 1.1
>
>
> For nodes with millions of keys, doing rolling restarts that take over 10 
> minutes per node can be painful if you have 100 node cluster. All of our time 
> is spent on doing index summary computations on startup. It would be great if 
> we could save those to disk as well. Our indexes are quite large.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-2392) Saving IndexSummaries to disk

2011-11-08 Thread Pavel Yaskevich (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13146430#comment-13146430
 ] 

Pavel Yaskevich commented on CASSANDRA-2392:


We already hold a primary index file which has information about index/data 
segment boundaries, saving boundary information twice would be redundant. Also 
we don't want to save key cache because it could be irrelevant by the time node 
starts and second because we are planing to add global key/row caches in 1.1. 

> Saving IndexSummaries to disk
> -
>
> Key: CASSANDRA-2392
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2392
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Chris Goffinet
>Assignee: Vijay
>Priority: Minor
> Fix For: 1.1
>
>
> For nodes with millions of keys, doing rolling restarts that take over 10 
> minutes per node can be painful if you have 100 node cluster. All of our time 
> is spent on doing index summary computations on startup. It would be great if 
> we could save those to disk as well. Our indexes are quite large.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-2392) Saving IndexSummaries to disk

2011-11-08 Thread Vijay (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13146393#comment-13146393
 ] 

Vijay commented on CASSANDRA-2392:
--

Sorry wasn't clear earlier, 
in SSTR.load(recreatebloom, keysToLoadInCache) we do check to 
addPotentialBoundary() for mmap and compressionbuilder... 
We can save all of them (Should we make it configurable?) and for Keycache we 
can do a testAndLoad... 

Basically testAndLoad 
Option 1) will check the bloom filter and will check the file for the keys and 
if there is then it will add to cache (If cache is saved).
Option 2) Or we can save the descriptor with the keycache file and testAndLoad 
can verify if the file exists (which will be cheaper during startup).


> Saving IndexSummaries to disk
> -
>
> Key: CASSANDRA-2392
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2392
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Chris Goffinet
>Assignee: Vijay
>Priority: Minor
> Fix For: 1.1
>
>
> For nodes with millions of keys, doing rolling restarts that take over 10 
> minutes per node can be painful if you have 100 node cluster. All of our time 
> is spent on doing index summary computations on startup. It would be great if 
> we could save those to disk as well. Our indexes are quite large.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-2392) Saving IndexSummaries to disk

2011-11-08 Thread Jonathan Ellis (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13146308#comment-13146308
 ] 

Jonathan Ellis commented on CASSANDRA-2392:
---

What do you mean by saving ibuilder + dbuilder?  Serialize them somehow?

> Saving IndexSummaries to disk
> -
>
> Key: CASSANDRA-2392
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2392
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Chris Goffinet
>Assignee: Vijay
>Priority: Minor
> Fix For: 1.1
>
>
> For nodes with millions of keys, doing rolling restarts that take over 10 
> minutes per node can be painful if you have 100 node cluster. All of our time 
> is spent on doing index summary computations on startup. It would be great if 
> we could save those to disk as well. Our indexes are quite large.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-2392) Saving IndexSummaries to disk

2011-11-07 Thread Vijay (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13146094#comment-13146094
 ] 

Vijay commented on CASSANDRA-2392:
--

We might also want to save: ibuilder and dbuilder + keycache with descriptor.

> Saving IndexSummaries to disk
> -
>
> Key: CASSANDRA-2392
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2392
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Chris Goffinet
>Assignee: Vijay
>Priority: Minor
> Fix For: 1.1
>
>
> For nodes with millions of keys, doing rolling restarts that take over 10 
> minutes per node can be painful if you have 100 node cluster. All of our time 
> is spent on doing index summary computations on startup. It would be great if 
> we could save those to disk as well. Our indexes are quite large.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira