Re: How to find a needle in a haystack?
On Wed, 2010-05-19 at 15:40 -0430, Patrick O'Callaghan wrote: > On Wed, 2010-05-19 at 14:07 -0400, arag...@dcsnow.com wrote: > > The data in the files is of the unstructured binary type. When I do a > > search, I have _most_ of the file name. Enough to uniquely identify > > it. > > So you don't need to look into the file to get a match? Sounds like the > best procedure would just be to keep an index of all the filenames and > update it when files are added/removed (assuming you have control over > both of these processes). A simple database should be able to handle > this easily, which is pretty much what you suggested yourself. In fact > it looks so simple that a Berkeley DB file would do it, without needing > all the fancy DB machinery or MySQL or Postgres. See for example "man > DB_File". Is there any reason to not use the already existing updatedb/locate combo? The fedora updatedb seems to be based on mlocate, which as far as I know uses the mtime of directories to tell if a directory has changed since the last scan (mtime of the directory will change if files have been added or deleted). This should speed up runs unless a lot of directories change between runs. You can disable the default updatedb configuration and run it manually (or in cron jobs) specifying one file system for each job. Let them run in parallel with output to separate bases. Then globally set the environment variable to tell locate where to look so it finds all the bases. Look at the man pages for updatedb, updatedb.conf locate and mlocate.db. The last one is very optional. -- birger -- users mailing list users@lists.fedoraproject.org To unsubscribe or change subscription options: https://admin.fedoraproject.org/mailman/listinfo/users Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
Re: How to find a needle in a haystack?
On Wed, 2010-05-19 at 14:07 -0400, arag...@dcsnow.com wrote: > The data in the files is of the unstructured binary type. When I do a > search, I have _most_ of the file name. Enough to uniquely identify > it. So you don't need to look into the file to get a match? Sounds like the best procedure would just be to keep an index of all the filenames and update it when files are added/removed (assuming you have control over both of these processes). A simple database should be able to handle this easily, which is pretty much what you suggested yourself. In fact it looks so simple that a Berkeley DB file would do it, without needing all the fancy DB machinery or MySQL or Postgres. See for example "man DB_File". poc -- users mailing list users@lists.fedoraproject.org To unsubscribe or change subscription options: https://admin.fedoraproject.org/mailman/listinfo/users Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
Re: How to find a needle in a haystack?
>> I have a backup server that contains 10 ext3 file systems each with 12 >> million files scattered randomly over 4000 directories. The files >> average > > OK 4000 directories for 12million files means you've got 3000 files per > directory. If you are doing that make sure your fs has htree enabled. The > initial find is still going to suck especially if on spinning media and > doubly (actually far more than doubly) if the disk is also busy trying to > do other work. dir_index is enabled as is noatime. I'm not sure but I believe the noatime was supposed to speed things up a little also. >> size is 1MB. Every day I expect to get 20 or so requests for files from >> this archive. The files were not stored in any logical structure that I >> can use to narrow down the search. This will be different moving >> forward >> but it does not help me for the old data. Additionally, every day data >> is >> added and old data is removed to make space. > > What are you searching by - name or content ? > > If you are searching by content then you really want to stuff the lot > into a freetext engine like Omega. If you are searching by file name I'd > take the time to turn the archive upside down and archive that way up > assuming your tools can do it. > > That is turn all the files backwards so > > foo/bar/hello.c > > becomes > > hello.c/bar/foo > > or build a symlink farm of them that way up (so hello.c/bar/foo is a > symlink to foo/bar/hello.c) > > Its then suddenely a lot lot less painful to find things ! A database > would also no doubt do the job I am searching by name only. I am a little unclear about this. The idea here is to create a symlink for each file in the root of the filesystem? That won't hurt something? --- Will Y. -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -- users mailing list users@lists.fedoraproject.org To unsubscribe or change subscription options: https://admin.fedoraproject.org/mailman/listinfo/users Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
Re: How to find a needle in a haystack?
> On Tue, 2010-05-18 at 16:49 -0400, arag...@dcsnow.com wrote: >> Hello all, >> >> I need some ideas. >> >> I have a backup server that contains 10 ext3 file systems each with 12 >> million files scattered randomly over 4000 directories. The files >> average >> size is 1MB. > > So each filesystem is about 12*10^6 * 1MB = 12*10^12 or 12 terabytes? Each filesystem is 2.5TB so the average file size must be much smaller. At last count, one of the filesystems contained 20 million files. > You don't say what the file contents are like, e.g. text, structured > data, unstructured binary, etc, nor do you say how you match the file > you want (e.g. is it equivalent to a text substring, a regular > expression, or what?). Knowing what the contents look like would help to > evaluate if it's worth e.g. generating a hash for subsections of the > file when it's being stored. Alternatively, it could conceivably make > sense to search for strings in the raw disk and work backwards to > calculate what files they belong to, who knows? The data in the files is of the unstructured binary type. When I do a search, I have _most_ of the file name. Enough to uniquely identify it. I hope that helps. --- Will Y. -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -- users mailing list users@lists.fedoraproject.org To unsubscribe or change subscription options: https://admin.fedoraproject.org/mailman/listinfo/users Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
Re: How to find a needle in a haystack?
> I have a backup server that contains 10 ext3 file systems each with 12 > million files scattered randomly over 4000 directories. The files average OK 4000 directories for 12million files means you've got 3000 files per directory. If you are doing that make sure your fs has htree enabled. The initial find is still going to suck especially if on spinning media and doubly (actually far more than doubly) if the disk is also busy trying to do other work. > size is 1MB. Every day I expect to get 20 or so requests for files from > this archive. The files were not stored in any logical structure that I > can use to narrow down the search. This will be different moving forward > but it does not help me for the old data. Additionally, every day data is > added and old data is removed to make space. What are you searching by - name or content ? If you are searching by content then you really want to stuff the lot into a freetext engine like Omega. If you are searching by file name I'd take the time to turn the archive upside down and archive that way up assuming your tools can do it. That is turn all the files backwards so foo/bar/hello.c becomes hello.c/bar/foo or build a symlink farm of them that way up (so hello.c/bar/foo is a symlink to foo/bar/hello.c) Its then suddenely a lot lot less painful to find things ! A database would also no doubt do the job -- users mailing list users@lists.fedoraproject.org To unsubscribe or change subscription options: https://admin.fedoraproject.org/mailman/listinfo/users Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
Re: How to find a needle in a haystack?
On Tue, 2010-05-18 at 16:49 -0400, arag...@dcsnow.com wrote: > Hello all, > > I need some ideas. > > I have a backup server that contains 10 ext3 file systems each with 12 > million files scattered randomly over 4000 directories. The files average > size is 1MB. So each filesystem is about 12*10^6 * 1MB = 12*10^12 or 12 terabytes? > Every day I expect to get 20 or so requests for files from > this archive. The files were not stored in any logical structure that I > can use to narrow down the search. This will be different moving forward > but it does not help me for the old data. Additionally, every day data is > added and old data is removed to make space. > > So, now that you know a little about the environment, I need ideas on how > to find the file I want to restore fast. > > Using find on the partition is slow. > > I thought about using find and piping the output to a file. I started it > 50 minutes ago and it still isn't done on a single partition. Plus the > file is currently about 1.3GB and how would I maintain such a file? > > Would putting the file names + path in a database be faster? You don't say what the file contents are like, e.g. text, structured data, unstructured binary, etc, nor do you say how you match the file you want (e.g. is it equivalent to a text substring, a regular expression, or what?). Knowing what the contents look like would help to evaluate if it's worth e.g. generating a hash for subsections of the file when it's being stored. Alternatively, it could conceivably make sense to search for strings in the raw disk and work backwards to calculate what files they belong to, who knows? In short, more info is needed to give a sensible answer. poc -- users mailing list users@lists.fedoraproject.org To unsubscribe or change subscription options: https://admin.fedoraproject.org/mailman/listinfo/users Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
Re: How to find a needle in a haystack?
On 05/18/2010 04:49 PM, arag...@dcsnow.com wrote: > Hello all, > > I need some ideas. > > I have a backup server that contains 10 ext3 file systems each with 12 > million files scattered randomly over 4000 directories. The files average > size is 1MB. Every day I expect to get 20 or so requests for files from > this archive. The files were not stored in any logical structure that I > can use to narrow down the search. This will be different moving forward > but it does not help me for the old data. Additionally, every day data is > added and old data is removed to make space. Do you know when data will be added or deleted? Is it randomly during the day or at a fixed time? > So, now that you know a little about the environment, I need ideas on how > to find the file I want to restore fast. > > Using find on the partition is slow. Have you thought about "locate" or any of its implementations (mlocate, slocate, etc)? It essentially indexes your disks for you and keeps the result in a database which it can search. Unlike "find", its not ATM, but, instead, when the disk was last indexed. So, it won't know about any changes made after the last index was done > I thought about using find and piping the output to a file. I started it > 50 minutes ago and it still isn't done on a single partition. Plus the > file is currently about 1.3GB and how would I maintain such a file? > > Would putting the file names + path in a database be faster? > > As always, any help would be greatly appreciated. > > --- > Will Y. -- Kevin J. Cummings kjch...@rcn.com cummi...@kjchome.homeip.net cummi...@kjc386.framingham.ma.us Registered Linux User #1232 (http://counter.li.org) -- users mailing list users@lists.fedoraproject.org To unsubscribe or change subscription options: https://admin.fedoraproject.org/mailman/listinfo/users Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines