Re: [Tracker] more issues with indexer-split
On Wed, 2008-09-03 at 16:35 +0200, Philip Van Hoof wrote: > On Wed, 2008-09-03 at 15:31 +0100, Martyn Russell wrote: > > Jamie McCracken wrote: > > > trackerd should just pass directories at startup and let the indexer > > > work out what to process. Dbus is not optimised for passing large number > > > of strings. Can the current design easily accommodate this? > > > > DBus' optimisation is not an issue here. I can send ALL of my files over > > quicker than the indexer can mtime check ALL the directories in the > > database. > > DBus only starts to perform bad as soon as message size grows over 4 kb > in size. In 4kb you can put quite a lot of uris. > > Therefore I don't think we should focus on reducing the amount of uris > we send from the daemon to the indexer. > ok but lets see how it performs first I want startup of a previously indexed machine to be as good or close to trunk jamie ___ tracker-list mailing list tracker-list@gnome.org http://mail.gnome.org/mailman/listinfo/tracker-list
Re: [Tracker] more issues with indexer-split
On Wed, 2008-09-03 at 15:31 +0100, Martyn Russell wrote: > Jamie McCracken wrote: > > trackerd should just pass directories at startup and let the indexer > > work out what to process. Dbus is not optimised for passing large number > > of strings. Can the current design easily accommodate this? > > DBus' optimisation is not an issue here. I can send ALL of my files over > quicker than the indexer can mtime check ALL the directories in the > database. DBus only starts to perform bad as soon as message size grows over 4 kb in size. In 4kb you can put quite a lot of uris. Therefore I don't think we should focus on reducing the amount of uris we send from the daemon to the indexer. > Yes we can accommodate this. We simply send all files/directories to the > indexer and the indexer can check each parent directory first then > process the files or discard them if the parent directory mtime is up to > date. -- Philip Van Hoof, freelance software developer home: me at pvanhoof dot be gnome: pvanhoof at gnome dot org http://pvanhoof.be/blog http://codeminded.be ___ tracker-list mailing list tracker-list@gnome.org http://mail.gnome.org/mailman/listinfo/tracker-list
Re: [Tracker] more issues with indexer-split
On Wed, 2008-09-03 at 10:32 -0400, Jamie McCracken wrote: > On Wed, 2008-09-03 at 12:34 +0100, Martyn Russell wrote: > > Hi, > > > > So I have been reading up on the things that are remaining for merging. > > This is the list I have so far which I will be working on: > > > > * Check the move files/directories issue. I *think* it works. > > check the new directory name can be searched when doing a rename > > also check the new name is searchable against all items in that > directory > > > > > * Fix the get_file_contents() function so it checks for #13 in the first > > 64Kb. also do what trunk does and validate each line. If it fails utf-8 validation attempt to convert from locale. Best to exit with null if any part fails. I assume the gio stuff handles non utf-8? jamie ___ tracker-list mailing list tracker-list@gnome.org http://mail.gnome.org/mailman/listinfo/tracker-list
Re: [Tracker] more issues with indexer-split
Jamie McCracken wrote: > trackerd should just pass directories at startup and let the indexer > work out what to process. Dbus is not optimised for passing large number > of strings. Can the current design easily accommodate this? DBus' optimisation is not an issue here. I can send ALL of my files over quicker than the indexer can mtime check ALL the directories in the database. Yes we can accommodate this. We simply send all files/directories to the indexer and the indexer can check each parent directory first then process the files or discard them if the parent directory mtime is up to date. -- Regards, Martyn ___ tracker-list mailing list tracker-list@gnome.org http://mail.gnome.org/mailman/listinfo/tracker-list
Re: [Tracker] more issues with indexer-split
On Wed, 2008-09-03 at 12:34 +0100, Martyn Russell wrote: > Hi, > > So I have been reading up on the things that are remaining for merging. > This is the list I have so far which I will be working on: > > * Check the move files/directories issue. I *think* it works. check the new directory name can be searched when doing a rename also check the new name is searchable against all items in that directory > > * Fix the get_file_contents() function so it checks for #13 in the first > 64Kb. > > * Make private libraries .so files to dynamically load them. Also for stemmer - make them dynamically loadable too > > * The directory mtime issue on startup. > also for summary files too - only check em if mtime has changed > Have I missed anything? I think that is it. A lot of Prefs dont work but that can wait til after merge. Im also adding my tracker-fts stuff into that branch so will likely merge when above + my stuff is ready jamie ___ tracker-list mailing list tracker-list@gnome.org http://mail.gnome.org/mailman/listinfo/tracker-list
Re: [Tracker] more issues with indexer-split
On Wed, 2008-09-03 at 12:34 +0100, Martyn Russell wrote: > Jamie McCracken wrote: > >>> trunk only checks directories (If a file in a directory is modified then > >>> the directories mtime is also altered so no need to check every file) > >>> hence startup is much faster. > > Note: the mtime of the parent directory ONLY is updated. This is not > recursive. So if you have /foo/bar/baz/sliff.txt, the mtime of baz/ is > updated not for bar/ and foo/. > > This means you _HAVE_ to go into every directory to see if it has a > subdirectory with an mtime that has updated. that is what trunk does - it only checks directories (and subdirectories). Theres no need to check mtime for a file ever unless the parent directory mtime has changed > > >> We can do this. Can you guarantee that on EVERY file system type the > >> parent directory mtime is updated when a file changes? I am not 100% > >> sure this is the case. > > > > on all major platforms yes (*nix and windows) > > Hmm. This wories me. How mtime is used across file systems tends to vary > slightly and this might come back to bite us. Its not been a problem in the past for tracker and certainly wont be for our target audience > > > it is for me - its in the order of 3x slower than trunk at startup > > What exactly is 3x slower? The crawling? > > I have been thinking about this. The best solution here to me is to send > ALL files/directories to the indexer and let the indexer check the mtime > of a directories before deciding to process the files it holds. This > should dramatically reduce the DB lookups on startup. But if the > slowness is NOT in the indexer, then there is little you can do except > increase the throttle. Have you tested it again recently since I made > throttle mandatory whenever it is called (i.e. it is 5+config value). > This made a lot of difference for me. > trackerd should just pass directories at startup and let the indexer work out what to process. Dbus is not optimised for passing large number of strings. Can the current design easily accommodate this? jamie ___ tracker-list mailing list tracker-list@gnome.org http://mail.gnome.org/mailman/listinfo/tracker-list
Re: [Tracker] more issues with indexer-split
Hi, So I have been reading up on the things that are remaining for merging. This is the list I have so far which I will be working on: * Check the move files/directories issue. I *think* it works. * Fix the get_file_contents() function so it checks for #13 in the first 64Kb. * Make private libraries .so files to dynamically load them. * The directory mtime issue on startup. Have I missed anything? -- Regards, Martyn ___ tracker-list mailing list tracker-list@gnome.org http://mail.gnome.org/mailman/listinfo/tracker-list
Re: [Tracker] more issues with indexer-split
Jamie McCracken wrote: >>> trunk only checks directories (If a file in a directory is modified then >>> the directories mtime is also altered so no need to check every file) >>> hence startup is much faster. Note: the mtime of the parent directory ONLY is updated. This is not recursive. So if you have /foo/bar/baz/sliff.txt, the mtime of baz/ is updated not for bar/ and foo/. This means you _HAVE_ to go into every directory to see if it has a subdirectory with an mtime that has updated. >> We can do this. Can you guarantee that on EVERY file system type the >> parent directory mtime is updated when a file changes? I am not 100% >> sure this is the case. > > on all major platforms yes (*nix and windows) Hmm. This wories me. How mtime is used across file systems tends to vary slightly and this might come back to bite us. > it is for me - its in the order of 3x slower than trunk at startup What exactly is 3x slower? The crawling? I have been thinking about this. The best solution here to me is to send ALL files/directories to the indexer and let the indexer check the mtime of a directories before deciding to process the files it holds. This should dramatically reduce the DB lookups on startup. But if the slowness is NOT in the indexer, then there is little you can do except increase the throttle. Have you tested it again recently since I made throttle mandatory whenever it is called (i.e. it is 5+config value). This made a lot of difference for me. -- Regards, Martyn ___ tracker-list mailing list tracker-list@gnome.org http://mail.gnome.org/mailman/listinfo/tracker-list
Re: [Tracker] tracker-indexer does not index all files
Jamie McCracken wrote: > On Tue, 2008-09-02 at 12:14 +0100, Martyn Russell wrote: >> Jamie McCracken wrote: >>> Another potential crasher - unlike trunk get_file_content does no utf-8 >>> validation and also if file is bigger than MAX_TEXT cuts it off which is >>> likely to not land on a valid utf-8 word break >> This is true. >> >>> ideally do what trunk does and read file line by line so that we will >>> never have a partial utf-8 fragment and the resulting text can be >>> validated and converted from locale to utf-8 if necessary >> I don't think reading line by line is a good idea at all. >> All we need to do is use g_utf8_validate () on the length we read and >> find out where the end is and make sure we don't read half way through a >> UTF8 character. > > how can you tell your are not in the middle of a utf8 char? Line break > is the only char we can be sure of breaking on (CJK may not contain word > breaks like spaces) If you read the documentation for that function, it should return the end position for what is valid if there is invalid utf8 in the stream being read. It is safe to assume that we can read up to end-start for parsing, since it will be valid UTF-8. Unless of course you are expecting to be able to parse non-UTF-8 content? > to read line by line you can still use streams but check for #13 line > break Isn't that just an unnecessary check that (depending on the file) could be quite a performance hit for a file with a lot of line breaks. > I suggest read it in 64kb chunks - if no line break (#13) is found then > exit as its unlikely to be a valid text file that needs indexing That is a good point. To some extent. I just worry about false positives here, i.e. key/value files with some initial valid text and a binary blob as a value. The first thing that springs to mind is a VCard. Not sure to be honest. > of course if you have a better idea (thats not slower) then Im all > ears... No after some checking that seems the sanest idea actually. The only issue there is false positives really. I can work on this. -- Regards, Martyn ___ tracker-list mailing list tracker-list@gnome.org http://mail.gnome.org/mailman/listinfo/tracker-list
Re: [Tracker] more issues with indexer-split
Jamie McCracken wrote: > On Tue, 2008-09-02 at 12:23 +0100, Martyn Russell wrote: >> Jamie McCracken wrote: >>> Could we also reduce memory usage by not statically linking to the >>> private libs libtracker-common and libtracker-db? >> Those libraries should not be available for public use. Before doing so, >> each API would have to be: >> >> a) Documented >> b) Checked it needs to be public >> c) Versioned >> d) ... >> >> This is a lot of work and I don't think it is worth it. >> I haven't looked at the footprints myself though. >> >>> currently my FTS module and the file-indexer-module are ~ 1MB in size >>> due mostly to linking with them and im sure the size of trackerd and >>> tracker-indexer could be made smaller too with only one instance of >>> those libs in memory >> How does the memory footprint compare to the old tracker? >> > > having looked at the contents of libtracker-common, most of the memory > used is for the stemmers - we load them all into memory even though we > only use one of them. i think making each language stemmer a dynamically > loaded module should help reduce things I can look into doing this. -- Regards, Martyn ___ tracker-list mailing list tracker-list@gnome.org http://mail.gnome.org/mailman/listinfo/tracker-list
Re: [Tracker] more issues with indexer-split
Jamie McCracken wrote: > On Tue, 2008-09-02 at 12:23 +0100, Martyn Russell wrote: >> Jamie McCracken wrote: >>> Could we also reduce memory usage by not statically linking to the >>> private libs libtracker-common and libtracker-db? >> Those libraries should not be available for public use. Before doing so, >> each API would have to be: >> >> a) Documented >> b) Checked it needs to be public >> c) Versioned >> d) ... >> >> This is a lot of work and I don't think it is worth it. >> I haven't looked at the footprints myself though. > > > why we would do all that? > > we would not be exporting the headers for those libs so no other apps > outside of tracker source tree will be able to use it effectively > > surely there are some examples of private libs that are not statically > linked? I mis-understood clearly. I thought you meant make it public for public use. I think making them .so libs but privately used is a good idea. -- Regards, Martyn ___ tracker-list mailing list tracker-list@gnome.org http://mail.gnome.org/mailman/listinfo/tracker-list