[jira] [Commented] (ACCUMULO-1124) optimize index size in RFile
[ https://issues.apache.org/jira/browse/ACCUMULO-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15297642#comment-15297642 ] Josh Elser commented on ACCUMULO-1124: -- bq. I experimented with shortening keys in the index and that gave some nice improvements, but not as much as I expected. I realized that even with those changes, bad keys were still being placed in the index. I added code to keep statistics on key sizes and used those statistics to try to select keys that were <=AVG(keySize). I also excluded keys that were too big (greater than 3 std dev from the mean). I had the thought "how would we determine when index size is efficient" in the future (both evaluating the success of this change as well as identifying perf issues in the future). Did you give any thought about how we could expose this information more easily? Maybe we include some extra information in the file entry in metadata so that the master/monitor could easily aggregate/report on file statistics? Not suggesting it needs to happen now, but wondering your thoughts (since I assume you were doing all this investigation by hand). > optimize index size in RFile > > > Key: ACCUMULO-1124 > URL: https://issues.apache.org/jira/browse/ACCUMULO-1124 > Project: Accumulo > Issue Type: Improvement >Reporter: Eric Newton >Assignee: Keith Turner > Fix For: 1.8.0 > > Time Spent: 1h > Remaining Estimate: 0h > > I noticed HBASE-7845 and it seems like something we could do in RFile, too. > Instead of putting the whole key in the index, you put in enough of the key > to get the reader to the beginning of the block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Accumulo-1.8-Integration-Tests - Build # 968 - Unstable! -- 1.6
Accumulo-1.8-Integration-Tests - Build # 968 - Unstable: Check console output at https://secure.penguinsinabox.com/jenkins/job/Accumulo-1.8-Integration-Tests/968/ to view the results.
Accumulo-Pull-Requests - Build # 288 - Aborted
The Apache Jenkins build system has built Accumulo-Pull-Requests (build #288) Status: Aborted Check console output at https://builds.apache.org/job/Accumulo-Pull-Requests/288/ to view the results.
[jira] [Created] (ACCUMULO-4314) Use statistics to choose better keys for RFile index
Keith Turner created ACCUMULO-4314: -- Summary: Use statistics to choose better keys for RFile index Key: ACCUMULO-4314 URL: https://issues.apache.org/jira/browse/ACCUMULO-4314 Project: Accumulo Issue Type: Improvement Reporter: Keith Turner Assignee: Keith Turner Fix For: 1.6.6, 1.7.2 The commit for ACCUMULO-1124 makes two changes : * Generates shorter keys that may not exist in data to place in RFile index * Use statistics to make better choices about what keys to place in index. These changes look for keys that are average or below and excludes large keys (keys that are > 3 std dev). The change to generate shorter keys can not be made in 1.7.X and 1.6.X because it would generate RFiles that may not work properly with older 1.6 and 1.7 versions. However the changes to use statistics to pick better keys could be made in 1.6 and 1.7. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ACCUMULO-1124) optimize index size in RFile
[ https://issues.apache.org/jira/browse/ACCUMULO-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15297313#comment-15297313 ] Keith Turner commented on ACCUMULO-1124: I experimented with shortening keys in the index and that gave some nice improvements, but not as much as I expected. I realized that even with those changes, bad keys were still being placed in the index. I added code to keep statistics on key sizes and used those statistics to try to select keys that were <=AVG(keySize). I also excluded keys that were too big (greater than 3 std dev from the mean). With the key shortening and statistics changes I was able to reduce the index size for the file in my previous comment to that below. {noformat} RFile Version: 8 Locality group : Num blocks : 21,758 Index level 1 : 3,048 bytes 1 blocks Index level 0 : 1,873,885 bytes 8 blocks First key : um:d:385:%03;%01;10.30.170.244>>o>/2954%af; data:current [] 4611686019157309597 false Last key : um:d:395:%03;%01;%ff; com.facebook>.www>s>/dialog/feed?app_id=90376669494... TRUNCATED data:current [] -6917529026891043602 false Num entries: 24,299,468 Column families: [data] Meta block : BCFile.index Raw size : 4 bytes Compressed size : 12 bytes Compression type : gz Meta block : RFile.index Raw size : 3,163 bytes Compressed size : 1,515 bytes Compression type : gz {noformat} At first I thought I could make these changes in 1.6 and 1.7. However while working on this I realized the key shortening change is breaking change, in that older RFile code would not be able to handle keys in the index that do not exist in the data. The changes to uses statistics to choose better keys could be made in 1.6 and 1.7. > optimize index size in RFile > > > Key: ACCUMULO-1124 > URL: https://issues.apache.org/jira/browse/ACCUMULO-1124 > Project: Accumulo > Issue Type: Improvement >Reporter: Eric Newton >Assignee: Keith Turner > Fix For: 1.8.0 > > Time Spent: 10m > Remaining Estimate: 0h > > I noticed HBASE-7845 and it seems like something we could do in RFile, too. > Instead of putting the whole key in the index, you put in enough of the key > to get the reader to the beginning of the block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Accumulo-1.7 - Build # 233 - Aborted
The Apache Jenkins build system has built Accumulo-1.7 (build #233) Status: Aborted Check console output at https://builds.apache.org/job/Accumulo-1.7/233/ to view the results.
[jira] [Created] (ACCUMULO-4313) Improve Accumulo website
Mike Walch created ACCUMULO-4313: Summary: Improve Accumulo website Key: ACCUMULO-4313 URL: https://issues.apache.org/jira/browse/ACCUMULO-4313 Project: Accumulo Issue Type: Improvement Reporter: Mike Walch Assignee: Mike Walch Priority: Minor Some issues: * Page width is not restricted. * Accumulo logo is not used in navbar. * Nav bar links need to be organized better * Home page to very verbose and could be simplified * Footer has too much wording/legalese. * ASF links need to exists on website but could put in their own section These issues are all aesthetic/subjective so feel free to comment or disagree. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Accumulo-Master - Build # 1866 - Aborted
The Apache Jenkins build system has built Accumulo-Master (build #1866) Status: Aborted Check console output at https://builds.apache.org/job/Accumulo-Master/1866/ to view the results.
Accumulo-1.6 - Build # 984 - Fixed
The Apache Jenkins build system has built Accumulo-1.6 (build #984) Status: Fixed Check console output at https://builds.apache.org/job/Accumulo-1.6/984/ to view the results.
Accumulo-1.8 - Build # 13 - Aborted
The Apache Jenkins build system has built Accumulo-1.8 (build #13) Status: Aborted Check console output at https://builds.apache.org/job/Accumulo-1.8/13/ to view the results.
[jira] [Commented] (ACCUMULO-4164) Avoid copy of RFile Index blocks when in cache
[ https://issues.apache.org/jira/browse/ACCUMULO-4164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15296890#comment-15296890 ] Adam Fuchs commented on ACCUMULO-4164: -- I would love to see the perf test results for this change. Can you post them, [~kturner]? > Avoid copy of RFile Index blocks when in cache > -- > > Key: ACCUMULO-4164 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4164 > Project: Accumulo > Issue Type: Improvement >Affects Versions: 1.6.5, 1.7.1 >Reporter: Keith Turner >Assignee: Keith Turner > Fix For: 1.6.6, 1.7.2, 1.8.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > I have been doing performance experiments with RFile. During the course of > these experiments I noticed that RFile is not as fast at it should be in the > case where index blocks are in cache and the RFile is not already open. The > reason is that the RFile code copies and deserializes the index data even > though its already in memory. > I made the following change to RFile in a branch. > * Avoid copy of index data when its in cache > * Deserialize offsets lazily (instead of upfront) during binary search > * Stopped calling lots of synchronized methods during deserialization of > index info. The existing code use ByteArrayInputStream which results in lots > of fine grained synchronization. Switching to an inputstream that offers the > same functionality w/o sync showed a measurable performance difference. > These changes lead to performance in the following two situations : > * When an RFiles data is in cache, but its not open on the tserver. > * For RFiles with multilevel indexes with index data in cache. Currently > an open RFile only keeps the root node in memory. Lower level index nodes > are always read from the cache or DFS. The changes I made would always > avoid the copy and deserialization of lower level index nodes when in cache. > I have seen significant performance improvements testing with the two cases > above. My test are currently based on a new API I am creating for RFile, so > I can not easily share them until I get that pushed. > For the case where a tserver has all files frequently in use already open and > those files have a single level index, these changes should not make a > significant performance difference. > These change should result in less memory use for opening the same rfile > multiple times for different scans (when data is in cache). In this case all > of the RFiles would share the same byte array holding the serialized index > data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ACCUMULO-3470) Upgrade to Commons VFS 2.1
[ https://issues.apache.org/jira/browse/ACCUMULO-3470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15296813#comment-15296813 ] Dave Marion commented on ACCUMULO-3470: --- I removed ReadOnlyHdfsFileProviderTest in 1.7 and beyond. I think my work is done here. > Upgrade to Commons VFS 2.1 > -- > > Key: ACCUMULO-3470 > URL: https://issues.apache.org/jira/browse/ACCUMULO-3470 > Project: Accumulo > Issue Type: Task >Reporter: Dave Marion >Assignee: Dave Marion > Fix For: 1.6.6, 1.7.2, 1.8.0, 2.0.0 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > Commons VFS 2.1 is nearing release. When released we need to remove the VFS > related classes in the start module, update the imports, and update the > version in the pom. Will set fixVersions when VFS is released. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ACCUMULO-3470) Upgrade to Commons VFS 2.1
[ https://issues.apache.org/jira/browse/ACCUMULO-3470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15296791#comment-15296791 ] Dave Marion commented on ACCUMULO-3470: --- Ok, I reverted the commit for updating Commons VFS from 2.0 to 2.1 in the Accumulo 1.6 branch. I merged that change up to 1.7, reverted the revert commit, and merged that all the way up to master. > Upgrade to Commons VFS 2.1 > -- > > Key: ACCUMULO-3470 > URL: https://issues.apache.org/jira/browse/ACCUMULO-3470 > Project: Accumulo > Issue Type: Task >Reporter: Dave Marion >Assignee: Dave Marion > Fix For: 1.6.6, 1.7.2, 1.8.0, 2.0.0 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Commons VFS 2.1 is nearing release. When released we need to remove the VFS > related classes in the start module, update the imports, and update the > version in the pom. Will set fixVersions when VFS is released. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Accumulo-1.8-Integration-Tests - Build # 967 - Failure! -- master
Accumulo-1.8-Integration-Tests - Build # 967 - Failure: Check console output at https://secure.penguinsinabox.com/jenkins/job/Accumulo-1.8-Integration-Tests/967/ to view the results.