As mentioned before, most file systems have the ability of providing callback to changes in files/directories, at SE 6 using this ability requires resorting to native code (JNA or JNI, see http://jnotify.sourceforge.net/) , in SE 7 this will be implemented in NIO 2 (http://java.sun.com/ developer/technicalArticles/javase/nio/).
I would look into rsync, as far as i know it implements a very efficient algorithm for comparing folders, there is a Java implementation http://jarsync.sourceforge.net/ On May 29, 4:09 pm, Korny Sietsma <ko...@sietsma.com> wrote: > I keep reading this thread when I should be going to bed :) Sadly, > this stuff is in what I call my "0.2% time" so I'm not working very > hard on it right now. > > The ruby code, which basically works (but is rather ugly in parts) > relies on reading the whole file tree into memory, and then traversing > the tree, saving each node (file or dir) in a repostory, which > internally detects duplicates as they are added. > > The repository stores all known nodes, indexed by size, and then > grouped into clumps of identical nodes. When you add a new node, it > compares it to all clumps of the same size as the node, looking for an > identical clump; if it finds one, the new node is added to the clump, > otherwise it forms a new clump. > (sorry if this is a bit of a vague description, I haven't worked on > this code for a while so all the details are a bit vague) > > Once the repository is built, it's pretty easy to throw away all > non-duplicate nodes, and report on the duplicates. > > This works, but has some issues: > - it's got some kind-of ugly handling for some special cases, like > making sure a directory doesn't match it's own children if it only has > one child > - it's a bit slow, and uses a lot of memory > - it *only* handles exact matches. > > I've been playing with shingling and sketching algorithms that are > used by search engines to identify nearly-identical documents, and I > think they could be applied to this problem; in fact I suspect they > could speed it up considerably. (If you want to know more about this > the best reference online seems to be the book "Introduction to > Information Retrieval" which is > athttp://www-csli.stanford.edu/~hinrich/information-retrieval-book.html > - the chapter most relevant is > athttp://nlp.stanford.edu/IR-book/html/htmledition/near-duplicates-and-... > ) > > But, like I said, this is my 0.2% time project, so it might be some > time before I really do more than think about this. :) > > - Korny > > > > On Fri, May 29, 2009 at 4:51 PM, Daniel Lyons <fus...@storytotell.org> wrote: > > > For whatever reason I just can't seem to put this problem down. > > > I have rewritten the code substantially. A major bottleneck was using > > Java's MD5 classes. The "Fast MD5" library really is, and that helped > > a lot. I did get the -> notation to work and I have a reasonable HOF > > now for doing the winnowing, which might even be applicable to another > > program someday, maybe. > > > Anyway I uploaded it here: > > <http://clojure.googlegroups.com/web/dupfinder.clj > > > and again I'd love any feedback anyone cares to give. > > > Just to add insult to injury, I went ahead and programmed it again in > > Ruby. The good news is that I can't seem to get Ruby to find all the > > files the Clojure one finds, but the bad news is that the Ruby version > > is like four times faster. I'd love to understand that. So I uploaded > > that too: <http://clojure.googlegroups.com/web/dupfinder.rb>. Of > > course it must be benefiting to some extent from the fact that the > > Ruby version has a lot less abstraction, but I don't see how the > > approach is fundamentally any different or why there would be such a > > large performance disparity. I must be missing something big. > > > On May 28, 2009, at 6:50 AM, Korny Sietsma wrote: > >> By the way, in response to whoever suggested pre-sorting files; I > >> sort-of do this (in the old ruby version) but actually, mostly the > >> program is looking for duplicate *directories* of files - the goal is > >> to point it at my archive disk, and have it find the biggest identical > >> subdirectories. Duplicate file checking is needed for this, but it's > >> only a tiny part. > > >> And I'm playing with sketching algorithms at work right now, which > >> look very handy for the next phase, which is to find the biggest > >> *similar* subdirectories. That's the real goal - point a program at a > >> terabyte archive disk, and have it spit out : > >> "/archive/old_disks/laptop_2007a is 312gb and 99% similar to > >> /archive/misc/stuff_from_2007" > >> ... or sorting by file count: > >> "/archive/source/old_projects/c_stuff/1996 is 20,324 files and 97% > >> similar to /archive/old/disks/laptop2006/unsorted/old_drives/ > >> old_archive/c_cpp_stuff/90s" > > > I can think of three ways to approach this, none of which are > > particularly easy. > > > The first is to take the duplicate file finding function and look for > > common suffixes of paths. It could almost be like running a fuzzy > > duplicate finder against your duplicates. I suspect the performance > > here will blow for no particular reason I can put my finger on. > > > The second approach would be to use something like rsync's rolling > > checksum. I bet you could use a rather stupid heuristic to find > > candidate directories for similarity and then apply this algorithm > > amongst the candidates. It would be handy if you could have a version > > of diff that would look at two directories recursively and give up if > > they were sufficiently different. > > > Another option would be to ignore the filesystem altogether and look > > for runs of similar blocks on the device directly. For some reason I > > think this will perform better but it'll probably be harder to reason > > about during development and I doubt it will be a cakewalk from Java. > > > In any event, that's a very interesting problem, but for some reason > > it's setting off my alarm bells. I think if it's going to be useful > > it'll probably wind up being exponential time, and that compromises to > > make it run acceptably will probably wind up diminishing the value. I > > have no idea why I think these things though... maybe I'm just up too > > late. > > > Another point: most OSes and filesystems have a change notification > > API now (or an emulation of one) which you could use to maintain > > indices, if you want to do something more than once. This probably > > isn't relevant to this problem though. It's on my mind because I was > > just recently hacking on something for a friend that uses Mac OS X's > > API to do that. > > > Hope any of this helps and looking forward to (all) your feedback, :) > > > — > > Daniel Lyons > >http://www.storytotell.org-- Tell It! > > -- > Kornelis Sietsma korny at my surname dot com > "Every jumbled pile of person has a thinking part > that wonders what the part that isn't thinking > isn't thinking of" --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -~----------~----~----~----~------~----~------~--~---