[vfs] Cache size issue
Hi All, We've been developing a system where we move around a lot of files. We started to use vfs, because sometimes these files are on a remote filesystem and vfs is really a great library which enables us to handle these cases uniform. As the system was started to be used very heavily recently we experienced memory issues, OutOfMemoryErrors. After investigating the problem it seems the cause of the problem was the FilesCaches of the manager which was growing without limits. We have few questions regarding this: 1, is this infinite growing of a SoftRefFilesCache a bug or a feature? javadoc says if the jvm needs memory then the cache will free. But this one by one removal in the maintenance thread of the cache doesn't seem to be very efficient. 2, we started to use LRUFilesCache, because with that we can limit the number of the cached entries. Indeed it seems it has much lower memory usage (according the set limit), but checking a heap dump showed that our LRUCache contained around ~21k entries however its size was set to 500. Do you have some idea how it's possible? I created a small test program which can reproduce the issue: https://svn.code.sf.net/p/screeningbee/code/trunk/vfs-cache-test/src/main/java/ch/systemsx/bee/vfscachetest/CacheSizeChecker.java if you set the testroot for a directory contains a lot of files and limit the heapsize (eg. -Xmx128m) then it will fail with the default SoftRefCache and with WeekRefCache and it works with LRU an NullFilesCache Of course in the real application we have lot more memory, but it was also eaten up after a while. we use java 7 and vfs 2.0 Any help would be appreciated! Bela
Re: [MATH] Restricted hierarchical clustering
Hi Thorsten, this sounds like a very specific use-case of a hierarchical clustering. I could imagine the following way to achieve it: * first cluster all data points with kmeans, with a k=50 as you would like to have 50 clusters on level 2 * take the 50 clusters and feed them into a HAC like algorithm which finishes when the number of level 1 clusters have been formed ( = 10 ) I did experiment with this, see the code below. The test is very simple, I create sequences from the normal ascii alphabet, AA .. ZZ For the distance measure, I created a special variant of the Euclidean distance by taking the position of the sequence into account, e.g. AA is more similar to AB than to BA. The result of kmeans is directly used to bootstrap the HAC clustering. Is this something you had in mind? public static class Sequence implements Clusterable { private char[] sequence; public Sequence(char[] seq) { sequence = seq; } public double[] getPoint() { double[] point = new double[ sequence.length ]; for ( int i = 0; i < point.length; i++ ) { point[i] = sequence[i] - 'A'; } return point; } @Override public String toString() { return String.valueOf(sequence); } } public static class SequenceDistance implements DistanceMeasure { public double compute(double[] a, double[] b) { double dist = 0; for ( int i = 0, j = 2; i < a.length; i++ ) { double diff = a[i] - b[i]; dist += FastMath.pow( 10, j ) * diff * diff; --j; if ( j < 1 ) { j = 1; } } return FastMath.sqrt(dist); } } public static void main(String[] args) { List points = new ArrayList(); for ( char a = 'A'; a <= 'Z' ; a++ ) { for ( char b = 'A'; b <= 'Z'; b++ ) { points.add(new Sequence( new char[] { a, b } ) ); } } KMeansPlusPlusClusterer kmeans = new KMeansPlusPlusClusterer(50, 100, new SequenceDistance()); List> cluster = kmeans.cluster(points); Set> currentClusters = new HashSet>(cluster); while (currentClusters.size() > 10 ) { Cluster a = null; Cluster b = null; double minDistance = Double.MAX_VALUE; int i = 0; for (Cluster clusterA : currentClusters) { int j = 0; for (Cluster clusterB : currentClusters) { if (j++ <= i) { continue; } double distance = maxDistance(clusterA, clusterB, kmeans.getDistanceMeasure()); if (distance < minDistance) { a = clusterA; b = clusterB; minDistance = distance; } } i++; } currentClusters.remove(a); currentClusters.remove(b); Cluster merge = new HierarchicalCluster(minDistance, a, b); currentClusters.add(merge); } for ( Cluster c : currentClusters ) { System.out.println(c.getPoints()); printLeafs(c); System.out.println("---"); } } public static double maxDistance(Cluster a, Cluster b, DistanceMeasure dm) { double maxDistance = Double.MIN_VALUE; for (final T pA : a.getPoints()) { for (final T pB : b.getPoints()) { double d = dm.compute(pA.getPoint(), pB.getPoint()); if (d > maxDistance) { maxDistance = d; } } } return maxDistance; } public static void printLeafs(Cluster cluster) { if ( cluster instanceof HierarchicalCluster) { HierarchicalCluster hc = (HierarchicalCluster) cluster; //System.out.println("Cluster: distance=" + hc.getDistance() + " " + cluster.getPoints()); if (hc.getLeftChild() != null) { printLeafs(hc.getLeftChild()); } if (hc.getRightChild() != null) { printLeafs(hc.getRightChild()); } } else { System.out.println("ClusterLeaf: " + cluster.getPoints()); } } On Tue, Nov 12, 2013 at 11:58 PM, em...@thorstenschaefer.de < em...@thorstenschaefer.de> wrote: > I saw Thomas’ patch in https://issues.apache.org/jira/browse/MATH-959which > aims to add support for HAC to commons-math. However, I am just faced > with a use case and wonder if/how this could be done either with existing > methods or the proposed HAC algorithm there. > > Lets assume we have items to 1000 cluster. Each item represents a > sequence, e.g. AB, AC, AD, …, BA, BB, BC, …, ZA, …, ZZ and I can assign > data points to each item which can
Re: [compress] Random access of SevenZFile
On Wed, 13 Nov 2013 15:28:04 +0200 Damjan Jovanovic wrote: > > It is not possible to seek to an arbitrary file's contents in a 7z > archive anyway, since 7z archives can use solid compression for some > or all files, which means you potentially have to sequentially > decompress some or all of the preceding files' contents to get to the > contents of the one you want. I see. That's unfortunate. I think I'll be dropping support for 7z, then. M - To unsubscribe, e-mail: user-unsubscr...@commons.apache.org For additional commands, e-mail: user-h...@commons.apache.org
Re: [compress] Random access of SevenZFile
On Wed, Nov 13, 2013 at 3:18 PM, wrote: > On Wed, 13 Nov 2013 06:05:06 +0100 > Stefan Bodewig wrote: > >> On 2013-11-12, wrote: >> >> > The 7z file format is (supposedly) a random access format, much like >> > zip archives. However, The SevenZFile class seems to only expose a >> > sequential interface (where I'm expected to seek over entries one at a >> > time, presumably whilst unpacking files). >> >> Much like zip 7z has file metadata at the end of the archive, so yes, >> SevenZFile could build up a Map when opening the archive and provide >> random access. Actually it does collect the information of all entries >> (in Archive.files), only an API to use it for random access is missing. >> >> Things aren't all that bad, though. Repeatedly calling getNextEntry >> will create streams for each entry but not consume them - so the files >> are not unpacked while you iterate over the entries. >> >> Stefan > > Hello! > > I spent a bit of time yesterday implementing this; I build a HashMap of > names to SevenZArchiveEntry instances by iterating over all entries upon > archive loading. > > However, I'm having further problems actually obtaining streams to specific > entries. The only interface exposed by SevenZFile is a set of mostly > undocumented read() functions that don't state where the data comes from. > The documentation for the no-argument read() function states > "Read a byte of data". > > I'm assuming that the functions will actually read from the byte offset > in the file described by the most recent entry returned by getNextEntry(). > Unfortunately, given that there's apparently no way to seek, this seems to > imply that I can't do anything with a SevenZFile beyond sequentially > decompressing all entries in order. This makes it essentially useless for > my needs (writing an interactive archive management program). > > Am I missing something obvious here? > > M It is not possible to seek to an arbitrary file's contents in a 7z archive anyway, since 7z archives can use solid compression for some or all files, which means you potentially have to sequentially decompress some or all of the preceding files' contents to get to the contents of the one you want. Damjan - To unsubscribe, e-mail: user-unsubscr...@commons.apache.org For additional commands, e-mail: user-h...@commons.apache.org
Re: [compress] Random access of SevenZFile
On Wed, 13 Nov 2013 06:05:06 +0100 Stefan Bodewig wrote: > On 2013-11-12, wrote: > > > The 7z file format is (supposedly) a random access format, much like > > zip archives. However, The SevenZFile class seems to only expose a > > sequential interface (where I'm expected to seek over entries one at a > > time, presumably whilst unpacking files). > > Much like zip 7z has file metadata at the end of the archive, so yes, > SevenZFile could build up a Map when opening the archive and provide > random access. Actually it does collect the information of all entries > (in Archive.files), only an API to use it for random access is missing. > > Things aren't all that bad, though. Repeatedly calling getNextEntry > will create streams for each entry but not consume them - so the files > are not unpacked while you iterate over the entries. > > Stefan Hello! I spent a bit of time yesterday implementing this; I build a HashMap of names to SevenZArchiveEntry instances by iterating over all entries upon archive loading. However, I'm having further problems actually obtaining streams to specific entries. The only interface exposed by SevenZFile is a set of mostly undocumented read() functions that don't state where the data comes from. The documentation for the no-argument read() function states "Read a byte of data". I'm assuming that the functions will actually read from the byte offset in the file described by the most recent entry returned by getNextEntry(). Unfortunately, given that there's apparently no way to seek, this seems to imply that I can't do anything with a SevenZFile beyond sequentially decompressing all entries in order. This makes it essentially useless for my needs (writing an interactive archive management program). Am I missing something obvious here? M - To unsubscribe, e-mail: user-unsubscr...@commons.apache.org For additional commands, e-mail: user-h...@commons.apache.org
Re: [math] eigenvector doubts and issues
Hi Thomas, > are you sure that your eigenvectors are drawn correctly? I assume yes. I used the eigenvectors with the highest and lowest eigenvalue, since in the simpler 3d case it was giving me the plane exactly cutting through the dataset as I wished it. But you are most probably right in using a plane that connects the two lower eigenvalue eigenvectors. I was using the wrong one for this purpose I guess. > I tried to reproduce your results with geogebra, and in my case the > plane defined by the two eigenvectors with the least eigenvalue seems to > split the values quite well, see this image: > > http://people.apache.org/~tn/pca.png Actually it qualitatively splits the dataset more or less the same as using the first and last eigenvector: one 2 dots are on the wrong side of the plane. I was hoping to be able to get more precise results for such simple examples of data. In fact, if I make the dataset more regular (i.e. equal intervals for x and y), things work as needed: double[] x = {1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4}; double[] y = {0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3}; double[] z = {1, 1, 1, 2, 1, 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2}; In that case the plane between the two lower eigenvalue eigenvectors splits the dataset in higher and lower. http://en.zimagez.com/zimage/eigenvectors3dcase3.php I will do some more tests with complex data. Thanks, Andrea > > The results from the computations: > > covariance: > {0.7272727273,0.0,0.1818181818}, > {0.0,0.3409090909,0.2272727273}, > {0.1818181818,0.2272727273,0.2727272727}} > > lambda1 = 0.8056498828134406 > v1 = {0.9015723558; 0.1900593782; 0.3886447222} > > lambda2 = 0.4874287594020183 > v2 = {-0.3799516758; 0.7774478203; 0.5012101464} > > > lambda3 = 0.04783044869363171 > v3 = {-0.2068913033; -0.5995434259; 0.7731388421} > > The plane is constructed with v2 and v3. > > Thomas > > - > To unsubscribe, e-mail: user-unsubscr...@commons.apache.org > For additional commands, e-mail: user-h...@commons.apache.org > - To unsubscribe, e-mail: user-unsubscr...@commons.apache.org For additional commands, e-mail: user-h...@commons.apache.org