[vfs] Cache size issue

2013-11-13 Thread Béla Hullár
Hi All,

We've been developing a system where we move around a lot of files. We
started to use vfs, because sometimes these files are on a remote
filesystem and vfs is really a great library which enables us to handle
these cases uniform.

As the system was started to be used very heavily recently we experienced
memory issues, OutOfMemoryErrors. After investigating the problem it seems
the cause of the problem was the FilesCaches of the manager which was
growing without limits. We have few questions regarding this:

1, is this infinite growing of a SoftRefFilesCache a bug or a feature?
javadoc says if the jvm needs memory then the cache will free. But this one
by one removal in the maintenance thread of the cache doesn't seem to be
very efficient.

2, we started to use LRUFilesCache, because with that we can limit the
number of the cached entries. Indeed it seems it has much lower memory
usage (according the set limit), but checking a heap dump showed that our
LRUCache contained around ~21k entries however its size was set to 500. Do
you have some idea how it's possible?

I created a small test program which can reproduce the issue:
https://svn.code.sf.net/p/screeningbee/code/trunk/vfs-cache-test/src/main/java/ch/systemsx/bee/vfscachetest/CacheSizeChecker.java

if you set the testroot for a directory contains a lot of files and limit
the heapsize (eg. -Xmx128m) then it will fail with the default SoftRefCache
and with WeekRefCache and it works with LRU an NullFilesCache
Of course in the real application we have lot more memory, but it was also
eaten up after a while.

we use java 7 and vfs 2.0

Any help would be appreciated!
Bela


Re: [MATH] Restricted hierarchical clustering

2013-11-13 Thread Thomas Neidhart
Hi Thorsten,

this sounds like a very specific use-case of a hierarchical clustering.
I could imagine the following way to achieve it:

 * first cluster all data points with kmeans, with a k=50 as you would like
to have 50 clusters on level 2
 * take the 50 clusters and feed them into a HAC like algorithm which
finishes when the number of level 1 clusters have been formed ( = 10 )

I did experiment with this, see the code below. The test is very simple, I
create sequences from the normal ascii alphabet, AA .. ZZ
For the distance measure, I created a special variant of the Euclidean
distance by taking the position of the sequence into account, e.g. AA is
more similar to AB than to BA.
The result of kmeans is directly used to bootstrap the HAC clustering.

Is this something you had in mind?


public static class Sequence implements Clusterable {

private char[] sequence;

public Sequence(char[] seq) {
sequence = seq;
}

public double[] getPoint() {
double[] point = new double[ sequence.length ];
for ( int i = 0; i < point.length; i++ ) {
point[i] = sequence[i] - 'A';
}
return point;
}

@Override
public String toString() {
return String.valueOf(sequence);
}
}

public static class SequenceDistance implements DistanceMeasure {

public double compute(double[] a, double[] b) {
double dist = 0;
for ( int i = 0, j = 2; i < a.length; i++ ) {
double diff = a[i] - b[i];
dist += FastMath.pow( 10, j ) * diff * diff;

--j;
if ( j < 1 ) {
j = 1;
}
}
return FastMath.sqrt(dist);
}

}

public static void main(String[] args) {

List points = new ArrayList();

for ( char a = 'A'; a <= 'Z' ; a++ ) {
for ( char b = 'A'; b <= 'Z'; b++ ) {
points.add(new Sequence( new char[] { a, b } ) );
}
}

KMeansPlusPlusClusterer kmeans = new
KMeansPlusPlusClusterer(50, 100, new SequenceDistance());
List> cluster = kmeans.cluster(points);

Set> currentClusters = new
HashSet>(cluster);

while (currentClusters.size() > 10 ) {
  Cluster a = null;
  Cluster b = null;
  double minDistance = Double.MAX_VALUE;
  int i = 0;
  for (Cluster clusterA : currentClusters) {
  int j = 0;
  for (Cluster clusterB : currentClusters) {
  if (j++ <= i) {
  continue;
  }
  double distance = maxDistance(clusterA, clusterB,
kmeans.getDistanceMeasure());
  if (distance < minDistance) {
  a = clusterA;
  b = clusterB;
  minDistance = distance;
  }
  }
  i++;
  }

  currentClusters.remove(a);
  currentClusters.remove(b);
  Cluster merge = new
HierarchicalCluster(minDistance, a, b);
  currentClusters.add(merge);
}

for ( Cluster c : currentClusters ) {
System.out.println(c.getPoints());
printLeafs(c);
System.out.println("---");
}
}

public static  double maxDistance(Cluster a,
Cluster b, DistanceMeasure dm) {
double maxDistance = Double.MIN_VALUE;
for (final T pA : a.getPoints()) {
for (final T pB : b.getPoints()) {
double d = dm.compute(pA.getPoint(), pB.getPoint());
if (d > maxDistance) {
maxDistance = d;
}
}
}
return maxDistance;
}

public static  void printLeafs(Cluster
cluster) {
if ( cluster instanceof HierarchicalCluster) {
HierarchicalCluster hc = (HierarchicalCluster) cluster;
//System.out.println("Cluster: distance=" + hc.getDistance() +
" " + cluster.getPoints());

if (hc.getLeftChild() != null) {
printLeafs(hc.getLeftChild());
}
if (hc.getRightChild() != null) {
printLeafs(hc.getRightChild());
}

} else {
System.out.println("ClusterLeaf: " +
cluster.getPoints());
}

}





On Tue, Nov 12, 2013 at 11:58 PM, em...@thorstenschaefer.de <
em...@thorstenschaefer.de> wrote:

> I saw Thomas’ patch in https://issues.apache.org/jira/browse/MATH-959which 
> aims to add support for HAC to commons-math. However, I am just faced
> with a use case and wonder if/how this could be done either with existing
> methods or the proposed HAC algorithm there.
>
> Lets assume we have items to 1000 cluster. Each item represents a
> sequence, e.g. AB, AC, AD, …, BA, BB, BC, …, ZA, …, ZZ and I can assign
> data points  to each item which can 

Re: [compress] Random access of SevenZFile

2013-11-13 Thread org.apache.commons
On Wed, 13 Nov 2013 15:28:04 +0200
Damjan Jovanovic  wrote:
> 
> It is not possible to seek to an arbitrary file's contents in a 7z
> archive anyway, since 7z archives can use solid compression for some
> or all files, which means you potentially have to sequentially
> decompress some or all of the preceding files' contents to get to the
> contents of the one you want.

I see. That's unfortunate.

I think I'll be dropping support for 7z, then.

M

-
To unsubscribe, e-mail: user-unsubscr...@commons.apache.org
For additional commands, e-mail: user-h...@commons.apache.org



Re: [compress] Random access of SevenZFile

2013-11-13 Thread Damjan Jovanovic
On Wed, Nov 13, 2013 at 3:18 PM,   wrote:
> On Wed, 13 Nov 2013 06:05:06 +0100
> Stefan Bodewig  wrote:
>
>> On 2013-11-12,  wrote:
>>
>> > The 7z file format is (supposedly) a random access format, much like
>> > zip archives. However, The SevenZFile class seems to only expose a
>> > sequential interface (where I'm expected to seek over entries one at a
>> > time, presumably whilst unpacking files).
>>
>> Much like zip 7z has file metadata at the end of the archive, so yes,
>> SevenZFile could build up a Map when opening the archive and provide
>> random access.  Actually it does collect the information of all entries
>> (in Archive.files), only an API to use it for random access is missing.
>>
>> Things aren't all that bad, though.  Repeatedly calling getNextEntry
>> will create streams for each entry but not consume them - so the files
>> are not unpacked while you iterate over the entries.
>>
>> Stefan
>
> Hello!
>
> I spent a bit of time yesterday implementing this; I build a HashMap of
> names to SevenZArchiveEntry instances by iterating over all entries upon
> archive loading.
>
> However, I'm having further problems actually obtaining streams to specific
> entries. The only interface exposed by SevenZFile is a set of mostly
> undocumented read() functions that don't state where the data comes from.
> The documentation for the no-argument read() function states
> "Read a byte of data".
>
> I'm assuming that the functions will actually read from the byte offset
> in the file described by the most recent entry returned by getNextEntry().
> Unfortunately, given that there's apparently no way to seek, this seems to
> imply that I can't do anything with a SevenZFile beyond sequentially
> decompressing all entries in order. This makes it essentially useless for
> my needs (writing an interactive archive management program).
>
> Am I missing something obvious here?
>
> M

It is not possible to seek to an arbitrary file's contents in a 7z
archive anyway, since 7z archives can use solid compression for some
or all files, which means you potentially have to sequentially
decompress some or all of the preceding files' contents to get to the
contents of the one you want.

Damjan

-
To unsubscribe, e-mail: user-unsubscr...@commons.apache.org
For additional commands, e-mail: user-h...@commons.apache.org



Re: [compress] Random access of SevenZFile

2013-11-13 Thread org.apache.commons
On Wed, 13 Nov 2013 06:05:06 +0100
Stefan Bodewig  wrote:

> On 2013-11-12,  wrote:
> 
> > The 7z file format is (supposedly) a random access format, much like
> > zip archives. However, The SevenZFile class seems to only expose a
> > sequential interface (where I'm expected to seek over entries one at a
> > time, presumably whilst unpacking files).
> 
> Much like zip 7z has file metadata at the end of the archive, so yes,
> SevenZFile could build up a Map when opening the archive and provide
> random access.  Actually it does collect the information of all entries
> (in Archive.files), only an API to use it for random access is missing.
> 
> Things aren't all that bad, though.  Repeatedly calling getNextEntry
> will create streams for each entry but not consume them - so the files
> are not unpacked while you iterate over the entries.
> 
> Stefan

Hello!

I spent a bit of time yesterday implementing this; I build a HashMap of
names to SevenZArchiveEntry instances by iterating over all entries upon
archive loading. 

However, I'm having further problems actually obtaining streams to specific 
entries. The only interface exposed by SevenZFile is a set of mostly
undocumented read() functions that don't state where the data comes from. 
The documentation for the no-argument read() function states 
"Read a byte of data".

I'm assuming that the functions will actually read from the byte offset
in the file described by the most recent entry returned by getNextEntry().
Unfortunately, given that there's apparently no way to seek, this seems to
imply that I can't do anything with a SevenZFile beyond sequentially 
decompressing all entries in order. This makes it essentially useless for
my needs (writing an interactive archive management program).

Am I missing something obvious here?

M

-
To unsubscribe, e-mail: user-unsubscr...@commons.apache.org
For additional commands, e-mail: user-h...@commons.apache.org



Re: [math] eigenvector doubts and issues

2013-11-13 Thread andrea antonello
Hi Thomas,

> are you sure that your eigenvectors are drawn correctly?

I assume yes. I used the eigenvectors with the highest and lowest
eigenvalue, since in the simpler 3d case it was giving me the plane
exactly cutting through the dataset as I wished it. But you are most
probably right in using a plane that connects the two lower eigenvalue
eigenvectors. I was using the wrong one for this purpose I guess.

> I tried to reproduce your results with geogebra, and in my case the
> plane defined by the two eigenvectors with the least eigenvalue seems to
> split the values quite well, see this image:
>
> http://people.apache.org/~tn/pca.png

Actually it qualitatively splits the dataset more or less the same as
using the first and last eigenvector: one 2 dots are on the wrong side
of the plane.

I was hoping to be able to get more precise results for such simple
examples of data.

In fact, if I make the dataset more regular (i.e. equal intervals for
x and y), things work as needed:

double[] x = {1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4};
double[] y = {0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3};
double[] z = {1, 1, 1, 2, 1, 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2};


In that case the plane between the two lower eigenvalue eigenvectors
splits the dataset in higher and lower.

http://en.zimagez.com/zimage/eigenvectors3dcase3.php

I will do some more tests with complex data.

Thanks,
Andrea

>
> The results from the computations:
>
> covariance:
>   {0.7272727273,0.0,0.1818181818},
>   {0.0,0.3409090909,0.2272727273},
>   {0.1818181818,0.2272727273,0.2727272727}}
>
> lambda1 =  0.8056498828134406
> v1 =  {0.9015723558; 0.1900593782; 0.3886447222}
>
> lambda2 = 0.4874287594020183
> v2 = {-0.3799516758; 0.7774478203; 0.5012101464}
>
>
> lambda3 = 0.04783044869363171
> v3 = {-0.2068913033; -0.5995434259; 0.7731388421}
>
> The plane is constructed with v2 and v3.
>
> Thomas
>
> -
> To unsubscribe, e-mail: user-unsubscr...@commons.apache.org
> For additional commands, e-mail: user-h...@commons.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@commons.apache.org
For additional commands, e-mail: user-h...@commons.apache.org