Re: How to find a needle in a haystack?

2010-05-19 Thread birger
On Wed, 2010-05-19 at 15:40 -0430, Patrick O'Callaghan wrote:
> On Wed, 2010-05-19 at 14:07 -0400, arag...@dcsnow.com wrote:
> > The data in the files is of the unstructured binary type.  When I do a
> > search, I have _most_ of the file name.  Enough to uniquely identify
> > it.
> 
> So you don't need to look into the file to get a match? Sounds like the
> best procedure would just be to keep an index of all the filenames and
> update it when files are added/removed (assuming you have control over
> both of these processes). A simple database should be able to handle
> this easily, which is pretty much what you suggested yourself. In fact
> it looks so simple that a Berkeley DB file would do it, without needing
> all the fancy DB machinery or MySQL or Postgres. See for example "man
> DB_File".

Is there any reason to not use the already existing updatedb/locate
combo? The fedora updatedb seems to be based on mlocate, which as far as
I know uses the mtime of directories to tell if a directory has changed
since the last scan (mtime of the directory will change if files have
been added or deleted). This should speed up runs unless a lot of
directories change between runs.

You can disable the default updatedb configuration and run it manually
(or in cron jobs) specifying one file system for each job. Let them run
in parallel with output to separate bases. Then globally set the
environment variable to tell locate where to look so it finds all the
bases. Look at the man pages for updatedb, updatedb.conf locate and
mlocate.db. The last one is very optional.

-- 
birger

-- 
users mailing list
users@lists.fedoraproject.org
To unsubscribe or change subscription options:
https://admin.fedoraproject.org/mailman/listinfo/users
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines


Re: How to find a needle in a haystack?

2010-05-19 Thread Patrick O'Callaghan
On Wed, 2010-05-19 at 14:07 -0400, arag...@dcsnow.com wrote:
> The data in the files is of the unstructured binary type.  When I do a
> search, I have _most_ of the file name.  Enough to uniquely identify
> it.

So you don't need to look into the file to get a match? Sounds like the
best procedure would just be to keep an index of all the filenames and
update it when files are added/removed (assuming you have control over
both of these processes). A simple database should be able to handle
this easily, which is pretty much what you suggested yourself. In fact
it looks so simple that a Berkeley DB file would do it, without needing
all the fancy DB machinery or MySQL or Postgres. See for example "man
DB_File".

poc

-- 
users mailing list
users@lists.fedoraproject.org
To unsubscribe or change subscription options:
https://admin.fedoraproject.org/mailman/listinfo/users
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines


Re: How to find a needle in a haystack?

2010-05-19 Thread aragonx
>> I have a backup server that contains 10 ext3 file systems each with 12
>> million files scattered randomly over 4000 directories.  The files
>> average
>
> OK 4000 directories for 12million files means you've got 3000 files per
> directory. If you are doing that make sure your fs has htree enabled. The
> initial find is still going to suck especially if on spinning media and
> doubly (actually far more than doubly) if the disk is also busy trying to
> do other work.

dir_index is enabled as is noatime.  I'm not sure but I believe the
noatime was supposed to speed things up a little also.

>> size is 1MB.  Every day I expect to get 20 or so requests for files from
>> this archive.  The files were not stored in any logical structure that I
>> can use to narrow down the search.  This will be different moving
>> forward
>> but it does not help me for the old data.  Additionally, every day data
>> is
>> added and old data is removed to make space.
>
> What are you searching by - name or content ?
>
> If you are searching by content then you really want to stuff the lot
> into a freetext engine like Omega. If you are searching by file name I'd
> take the time to turn the archive upside down and archive that way up
> assuming your tools can do it.
>
> That is turn all the files backwards so
>
>   foo/bar/hello.c
>
> becomes
>
>   hello.c/bar/foo
>
> or build a symlink farm of them that way up (so hello.c/bar/foo is a
> symlink to foo/bar/hello.c)
>
> Its then suddenely a lot lot less painful to find things ! A database
> would also no doubt do the job

I am searching by name only.  I am a little unclear about this.  The idea
here is to create a symlink for each file in the root of the filesystem? 
That won't hurt something?

---
Will Y.


-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

-- 
users mailing list
users@lists.fedoraproject.org
To unsubscribe or change subscription options:
https://admin.fedoraproject.org/mailman/listinfo/users
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines


Re: How to find a needle in a haystack?

2010-05-19 Thread aragonx
> On Tue, 2010-05-18 at 16:49 -0400, arag...@dcsnow.com wrote:
>> Hello all,
>>
>> I need some ideas.
>>
>> I have a backup server that contains 10 ext3 file systems each with 12
>> million files scattered randomly over 4000 directories.  The files
>> average
>> size is 1MB.
>
> So each filesystem is about 12*10^6 * 1MB = 12*10^12 or 12 terabytes?

Each filesystem is 2.5TB so the average file size must be much smaller. 
At last count, one of the filesystems contained 20 million files.

> You don't say what the file contents are like, e.g. text, structured
> data, unstructured binary, etc, nor do you say how you match the file
> you want (e.g. is it equivalent to a text substring, a regular
> expression, or what?). Knowing what the contents look like would help to
> evaluate if it's worth e.g. generating a hash for subsections of the
> file when it's being stored. Alternatively, it could conceivably make
> sense to search for strings in the raw disk and work backwards to
> calculate what files they belong to, who knows?

The data in the files is of the unstructured binary type.  When I do a
search, I have _most_ of the file name.  Enough to uniquely identify it.

I hope that helps.

---
Will Y.


-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

-- 
users mailing list
users@lists.fedoraproject.org
To unsubscribe or change subscription options:
https://admin.fedoraproject.org/mailman/listinfo/users
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines


Re: How to find a needle in a haystack?

2010-05-18 Thread Alan Cox
> I have a backup server that contains 10 ext3 file systems each with 12
> million files scattered randomly over 4000 directories.  The files average

OK 4000 directories for 12million files means you've got 3000 files per
directory. If you are doing that make sure your fs has htree enabled. The
initial find is still going to suck especially if on spinning media and
doubly (actually far more than doubly) if the disk is also busy trying to
do other work.

> size is 1MB.  Every day I expect to get 20 or so requests for files from
> this archive.  The files were not stored in any logical structure that I
> can use to narrow down the search.  This will be different moving forward
> but it does not help me for the old data.  Additionally, every day data is
> added and old data is removed to make space.

What are you searching by - name or content ?

If you are searching by content then you really want to stuff the lot
into a freetext engine like Omega. If you are searching by file name I'd
take the time to turn the archive upside down and archive that way up
assuming your tools can do it.

That is turn all the files backwards so

foo/bar/hello.c

becomes

hello.c/bar/foo

or build a symlink farm of them that way up (so hello.c/bar/foo is a
symlink to foo/bar/hello.c)

Its then suddenely a lot lot less painful to find things ! A database
would also no doubt do the job

-- 
users mailing list
users@lists.fedoraproject.org
To unsubscribe or change subscription options:
https://admin.fedoraproject.org/mailman/listinfo/users
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines


Re: How to find a needle in a haystack?

2010-05-18 Thread Patrick O'Callaghan
On Tue, 2010-05-18 at 16:49 -0400, arag...@dcsnow.com wrote:
> Hello all,
> 
> I need some ideas.
> 
> I have a backup server that contains 10 ext3 file systems each with 12
> million files scattered randomly over 4000 directories.  The files average
> size is 1MB.

So each filesystem is about 12*10^6 * 1MB = 12*10^12 or 12 terabytes?

>  Every day I expect to get 20 or so requests for files from
> this archive.  The files were not stored in any logical structure that I
> can use to narrow down the search.  This will be different moving forward
> but it does not help me for the old data.  Additionally, every day data is
> added and old data is removed to make space.
> 
> So, now that you know a little about the environment, I need ideas on how
> to find the file I want to restore fast.
> 
> Using find on the partition is slow.
> 
> I thought about using find and piping the output to a file.  I started it
> 50 minutes ago and it still isn't done on a single partition.  Plus the
> file is currently about 1.3GB and how would I maintain such a file?
> 
> Would putting the file names + path in a database be faster?

You don't say what the file contents are like, e.g. text, structured
data, unstructured binary, etc, nor do you say how you match the file
you want (e.g. is it equivalent to a text substring, a regular
expression, or what?). Knowing what the contents look like would help to
evaluate if it's worth e.g. generating a hash for subsections of the
file when it's being stored. Alternatively, it could conceivably make
sense to search for strings in the raw disk and work backwards to
calculate what files they belong to, who knows?

In short, more info is needed to give a sensible answer.

poc

-- 
users mailing list
users@lists.fedoraproject.org
To unsubscribe or change subscription options:
https://admin.fedoraproject.org/mailman/listinfo/users
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines


Re: How to find a needle in a haystack?

2010-05-18 Thread Kevin J. Cummings
On 05/18/2010 04:49 PM, arag...@dcsnow.com wrote:
> Hello all,
> 
> I need some ideas.
> 
> I have a backup server that contains 10 ext3 file systems each with 12
> million files scattered randomly over 4000 directories.  The files average
> size is 1MB.  Every day I expect to get 20 or so requests for files from
> this archive.  The files were not stored in any logical structure that I
> can use to narrow down the search.  This will be different moving forward
> but it does not help me for the old data.  Additionally, every day data is
> added and old data is removed to make space.

Do you know when data will be added or deleted?  Is it randomly during
the day or at a fixed time?

> So, now that you know a little about the environment, I need ideas on how
> to find the file I want to restore fast.
> 
> Using find on the partition is slow.

Have you thought about "locate" or any of its implementations (mlocate,
slocate, etc)?

It essentially indexes your disks for you and keeps the result in a
database which it can search.  Unlike "find", its not ATM, but, instead,
when the disk was last indexed.  So, it won't know about any changes
made after the last index was done

> I thought about using find and piping the output to a file.  I started it
> 50 minutes ago and it still isn't done on a single partition.  Plus the
> file is currently about 1.3GB and how would I maintain such a file?
> 
> Would putting the file names + path in a database be faster?
> 
> As always, any help would be greatly appreciated.
> 
> ---
> Will Y.

-- 
Kevin J. Cummings
kjch...@rcn.com
cummi...@kjchome.homeip.net
cummi...@kjc386.framingham.ma.us
Registered Linux User #1232 (http://counter.li.org)
-- 
users mailing list
users@lists.fedoraproject.org
To unsubscribe or change subscription options:
https://admin.fedoraproject.org/mailman/listinfo/users
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines