Re: [Tracker] more issues with indexer-split

2008-09-03 Thread Martyn Russell
Jamie McCracken wrote:
 On Tue, 2008-09-02 at 12:23 +0100, Martyn Russell wrote:
 Jamie McCracken wrote:
 Could we also reduce memory usage by not statically linking to the
 private libs libtracker-common and libtracker-db?
 Those libraries should not be available for public use. Before doing so,
 each API would have to be:

 a) Documented
 b) Checked it needs to be public
 c) Versioned
 d) ...

 This is a lot of work and I don't think it is worth it.
 I haven't looked at the footprints myself though.
 
 
 why we would do all that?
 
 we would not be exporting the headers for those libs so no other apps
 outside of tracker source tree will be able to use it effectively
 
 surely there are some examples of private libs that are not statically
 linked?

I mis-understood clearly. I thought you meant make it public for public
use. I think making them .so libs but privately used is a good idea.

-- 
Regards,
Martyn
___
tracker-list mailing list
tracker-list@gnome.org
http://mail.gnome.org/mailman/listinfo/tracker-list


Re: [Tracker] more issues with indexer-split

2008-09-03 Thread Martyn Russell
Jamie McCracken wrote:
 On Tue, 2008-09-02 at 12:23 +0100, Martyn Russell wrote:
 Jamie McCracken wrote:
 Could we also reduce memory usage by not statically linking to the
 private libs libtracker-common and libtracker-db?
 Those libraries should not be available for public use. Before doing so,
 each API would have to be:

 a) Documented
 b) Checked it needs to be public
 c) Versioned
 d) ...

 This is a lot of work and I don't think it is worth it.
 I haven't looked at the footprints myself though.

 currently my FTS module and the file-indexer-module are ~ 1MB in size
 due mostly to linking with them and im sure the size of trackerd and
 tracker-indexer could be made smaller too with only one instance of
 those libs in memory
 How does the memory footprint compare to the old tracker?

 
 having looked at the contents of libtracker-common, most of the memory
 used is for the stemmers - we load them all into memory even though we
 only use one of them. i think making each language stemmer a dynamically
 loaded module should help reduce things

I can look into doing this.

-- 
Regards,
Martyn
___
tracker-list mailing list
tracker-list@gnome.org
http://mail.gnome.org/mailman/listinfo/tracker-list


Re: [Tracker] tracker-indexer does not index all files

2008-09-03 Thread Martyn Russell
Jamie McCracken wrote:
 On Tue, 2008-09-02 at 12:14 +0100, Martyn Russell wrote:
 Jamie McCracken wrote:
 Another potential crasher - unlike trunk get_file_content does no utf-8
 validation and also if file is bigger than MAX_TEXT cuts it off which is
 likely to not land on a valid utf-8 word break
 This is true.

 ideally do what trunk does and read file line by line so that we will
 never have a partial utf-8 fragment and the resulting text can be
 validated and converted from locale to utf-8 if necessary
 I don't think reading line by line is a good idea at all.
 All we need to do is use g_utf8_validate () on the length we read and
 find out where the end is and make sure we don't read half way through a
 UTF8 character.
 
 how can you tell your are not in the middle of a utf8 char? Line break
 is the only char we can be sure of breaking on (CJK may not contain word
 breaks like spaces)

If you read the documentation for that function, it should return the
end position for what is valid if there is invalid utf8 in the stream
being read. It is safe to assume that we can read up to end-start for
parsing, since it will be valid UTF-8.

Unless of course you are expecting to be able to parse non-UTF-8 content?

 to read line by line you can still use streams but check for #13 line
 break

Isn't that just an unnecessary check that (depending on the file) could
be quite a performance hit for a file with a lot of line breaks.

 I suggest read it in 64kb  chunks - if no line break (#13) is found then
 exit as its unlikely to be a valid text file that needs indexing

That is a good point. To some extent. I just worry about false positives
here, i.e. key/value files with some initial valid text and a binary
blob as a value. The first thing that springs to mind is a VCard. Not
sure to be honest.

 of course if you have a better idea (thats not slower) then Im all
 ears...

No after some checking that seems the sanest idea actually. The only
issue there is false positives really. I can work on this.

-- 
Regards,
Martyn
___
tracker-list mailing list
tracker-list@gnome.org
http://mail.gnome.org/mailman/listinfo/tracker-list


Re: [Tracker] more issues with indexer-split

2008-09-03 Thread Martyn Russell
Jamie McCracken wrote:
 trunk only checks directories (If a file in a directory is modified then
 the directories mtime is also altered so no need to check every file)
 hence startup is much faster.

Note: the mtime of the parent directory ONLY is updated. This is not
recursive. So if you have /foo/bar/baz/sliff.txt, the mtime of baz/ is
updated not for bar/ and foo/.

This means you _HAVE_ to go into every directory to see if it has a
subdirectory with an mtime that has updated.

 We can do this. Can you guarantee that on EVERY file system type the
 parent directory mtime is updated when a file changes? I am not 100%
 sure this is the case.
 
 on all major platforms yes (*nix and windows)

Hmm. This wories me. How mtime is used across file systems tends to vary
slightly and this might come back to bite us.

 it is for me - its in the order of 3x slower than trunk at startup 

What exactly is 3x slower? The crawling?

I have been thinking about this. The best solution here to me is to send
ALL files/directories to the indexer and let the indexer check the mtime
of a directories before deciding to process the files it holds. This
should dramatically reduce the DB lookups on startup. But if the
slowness is NOT in the indexer, then there is little you can do except
increase the throttle. Have you tested it again recently since I made
throttle mandatory whenever it is called (i.e. it is 5+config value).
This made a lot of difference for me.

-- 
Regards,
Martyn
___
tracker-list mailing list
tracker-list@gnome.org
http://mail.gnome.org/mailman/listinfo/tracker-list


Re: [Tracker] more issues with indexer-split

2008-09-03 Thread Martyn Russell
Hi,

So I have been reading up on the things that are remaining for merging.
This is the list I have so far which I will be working on:

* Check the move files/directories issue. I *think* it works.

* Fix the get_file_contents() function so it checks for #13 in the first
64Kb.

* Make private libraries .so files to dynamically load them.

* The directory mtime issue on startup.

Have I missed anything?

-- 
Regards,
Martyn
___
tracker-list mailing list
tracker-list@gnome.org
http://mail.gnome.org/mailman/listinfo/tracker-list


Re: [Tracker] more issues with indexer-split

2008-09-03 Thread Jamie McCracken
On Wed, 2008-09-03 at 12:34 +0100, Martyn Russell wrote:
 Jamie McCracken wrote:
  trunk only checks directories (If a file in a directory is modified then
  the directories mtime is also altered so no need to check every file)
  hence startup is much faster.
 
 Note: the mtime of the parent directory ONLY is updated. This is not
 recursive. So if you have /foo/bar/baz/sliff.txt, the mtime of baz/ is
 updated not for bar/ and foo/.
 
 This means you _HAVE_ to go into every directory to see if it has a
 subdirectory with an mtime that has updated.

that is what trunk does - it only checks directories (and
subdirectories). Theres no need to check mtime for a file ever unless
the parent directory mtime has changed

 
  We can do this. Can you guarantee that on EVERY file system type the
  parent directory mtime is updated when a file changes? I am not 100%
  sure this is the case.
  
  on all major platforms yes (*nix and windows)
 
 Hmm. This wories me. How mtime is used across file systems tends to vary
 slightly and this might come back to bite us.


Its not been a problem in the past for tracker and certainly wont be for
our target audience

 
  it is for me - its in the order of 3x slower than trunk at startup 
 
 What exactly is 3x slower? The crawling?
 
 I have been thinking about this. The best solution here to me is to send
 ALL files/directories to the indexer and let the indexer check the mtime
 of a directories before deciding to process the files it holds. This
 should dramatically reduce the DB lookups on startup. But if the
 slowness is NOT in the indexer, then there is little you can do except
 increase the throttle. Have you tested it again recently since I made
 throttle mandatory whenever it is called (i.e. it is 5+config value).
 This made a lot of difference for me.
 


trackerd should just pass directories at startup and let the indexer
work out what to process. Dbus is not optimised for passing large number
of strings. Can the current design easily accommodate this?


jamie

___
tracker-list mailing list
tracker-list@gnome.org
http://mail.gnome.org/mailman/listinfo/tracker-list


Re: [Tracker] more issues with indexer-split

2008-09-03 Thread Jamie McCracken
On Wed, 2008-09-03 at 12:34 +0100, Martyn Russell wrote:
 Hi,
 
 So I have been reading up on the things that are remaining for merging.
 This is the list I have so far which I will be working on:
 
 * Check the move files/directories issue. I *think* it works.

check the new directory name can be searched when doing a rename

also check the new name is searchable against all items in that
directory

 
 * Fix the get_file_contents() function so it checks for #13 in the first
 64Kb.
 
 * Make private libraries .so files to dynamically load them. 
Also for stemmer - make them dynamically loadable too
 
 * The directory mtime issue on startup.
 

also for summary files too - only check em if mtime has changed

 Have I missed anything?

I think that is it. A lot of Prefs dont work but that can wait til after
merge.

Im also adding my tracker-fts stuff into that branch so will likely
merge when above + my stuff is ready

jamie



___
tracker-list mailing list
tracker-list@gnome.org
http://mail.gnome.org/mailman/listinfo/tracker-list


Re: [Tracker] more issues with indexer-split

2008-09-03 Thread Martyn Russell
Jamie McCracken wrote:
 trackerd should just pass directories at startup and let the indexer
 work out what to process. Dbus is not optimised for passing large number
 of strings. Can the current design easily accommodate this?

DBus' optimisation is not an issue here. I can send ALL of my files over
quicker than the indexer can mtime check ALL the directories in the
database.

Yes we can accommodate this. We simply send all files/directories to the
indexer and the indexer can check each parent directory first then
process the files or discard them if the parent directory mtime is up to
date.

-- 
Regards,
Martyn
___
tracker-list mailing list
tracker-list@gnome.org
http://mail.gnome.org/mailman/listinfo/tracker-list


Re: [Tracker] more issues with indexer-split

2008-09-03 Thread Jamie McCracken
On Wed, 2008-09-03 at 10:32 -0400, Jamie McCracken wrote:
 On Wed, 2008-09-03 at 12:34 +0100, Martyn Russell wrote:
  Hi,
  
  So I have been reading up on the things that are remaining for merging.
  This is the list I have so far which I will be working on:
  
  * Check the move files/directories issue. I *think* it works.
 
 check the new directory name can be searched when doing a rename
 
 also check the new name is searchable against all items in that
 directory
 
  
  * Fix the get_file_contents() function so it checks for #13 in the first
  64Kb.

also do what trunk does and validate each line. If it fails utf-8
validation attempt to convert from locale. Best to exit with null if any
part fails. I assume the gio stuff handles non utf-8?

jamie

___
tracker-list mailing list
tracker-list@gnome.org
http://mail.gnome.org/mailman/listinfo/tracker-list


Re: [Tracker] more issues with indexer-split

2008-09-03 Thread Philip Van Hoof
On Wed, 2008-09-03 at 15:31 +0100, Martyn Russell wrote:
 Jamie McCracken wrote:
  trackerd should just pass directories at startup and let the indexer
  work out what to process. Dbus is not optimised for passing large number
  of strings. Can the current design easily accommodate this?
 
 DBus' optimisation is not an issue here. I can send ALL of my files over
 quicker than the indexer can mtime check ALL the directories in the
 database.

DBus only starts to perform bad as soon as message size grows over 4 kb
in size. In 4kb you can put quite a lot of uris.

Therefore I don't think we should focus on reducing the amount of uris
we send from the daemon to the indexer.

 Yes we can accommodate this. We simply send all files/directories to the
 indexer and the indexer can check each parent directory first then
 process the files or discard them if the parent directory mtime is up to
 date.


-- 
Philip Van Hoof, freelance software developer
home: me at pvanhoof dot be 
gnome: pvanhoof at gnome dot org 
http://pvanhoof.be/blog
http://codeminded.be




___
tracker-list mailing list
tracker-list@gnome.org
http://mail.gnome.org/mailman/listinfo/tracker-list


Re: [Tracker] more issues with indexer-split

2008-09-03 Thread Jamie McCracken
On Wed, 2008-09-03 at 16:35 +0200, Philip Van Hoof wrote:
 On Wed, 2008-09-03 at 15:31 +0100, Martyn Russell wrote:
  Jamie McCracken wrote:
   trackerd should just pass directories at startup and let the indexer
   work out what to process. Dbus is not optimised for passing large number
   of strings. Can the current design easily accommodate this?
  
  DBus' optimisation is not an issue here. I can send ALL of my files over
  quicker than the indexer can mtime check ALL the directories in the
  database.
 
 DBus only starts to perform bad as soon as message size grows over 4 kb
 in size. In 4kb you can put quite a lot of uris.
 
 Therefore I don't think we should focus on reducing the amount of uris
 we send from the daemon to the indexer.
 

ok but lets see how it performs first

I want startup of a previously indexed machine to be as good or close to
trunk

jamie

___
tracker-list mailing list
tracker-list@gnome.org
http://mail.gnome.org/mailman/listinfo/tracker-list