Re: [Tracker] more issues with indexer-split

2008-09-03 Thread Jamie McCracken
On Wed, 2008-09-03 at 16:35 +0200, Philip Van Hoof wrote:
> On Wed, 2008-09-03 at 15:31 +0100, Martyn Russell wrote:
> > Jamie McCracken wrote:
> > > trackerd should just pass directories at startup and let the indexer
> > > work out what to process. Dbus is not optimised for passing large number
> > > of strings. Can the current design easily accommodate this?
> > 
> > DBus' optimisation is not an issue here. I can send ALL of my files over
> > quicker than the indexer can mtime check ALL the directories in the
> > database.
> 
> DBus only starts to perform bad as soon as message size grows over 4 kb
> in size. In 4kb you can put quite a lot of uris.
> 
> Therefore I don't think we should focus on reducing the amount of uris
> we send from the daemon to the indexer.
> 

ok but lets see how it performs first

I want startup of a previously indexed machine to be as good or close to
trunk

jamie

___
tracker-list mailing list
tracker-list@gnome.org
http://mail.gnome.org/mailman/listinfo/tracker-list


Re: [Tracker] more issues with indexer-split

2008-09-03 Thread Philip Van Hoof
On Wed, 2008-09-03 at 15:31 +0100, Martyn Russell wrote:
> Jamie McCracken wrote:
> > trackerd should just pass directories at startup and let the indexer
> > work out what to process. Dbus is not optimised for passing large number
> > of strings. Can the current design easily accommodate this?
> 
> DBus' optimisation is not an issue here. I can send ALL of my files over
> quicker than the indexer can mtime check ALL the directories in the
> database.

DBus only starts to perform bad as soon as message size grows over 4 kb
in size. In 4kb you can put quite a lot of uris.

Therefore I don't think we should focus on reducing the amount of uris
we send from the daemon to the indexer.

> Yes we can accommodate this. We simply send all files/directories to the
> indexer and the indexer can check each parent directory first then
> process the files or discard them if the parent directory mtime is up to
> date.


-- 
Philip Van Hoof, freelance software developer
home: me at pvanhoof dot be 
gnome: pvanhoof at gnome dot org 
http://pvanhoof.be/blog
http://codeminded.be




___
tracker-list mailing list
tracker-list@gnome.org
http://mail.gnome.org/mailman/listinfo/tracker-list


Re: [Tracker] more issues with indexer-split

2008-09-03 Thread Jamie McCracken
On Wed, 2008-09-03 at 10:32 -0400, Jamie McCracken wrote:
> On Wed, 2008-09-03 at 12:34 +0100, Martyn Russell wrote:
> > Hi,
> > 
> > So I have been reading up on the things that are remaining for merging.
> > This is the list I have so far which I will be working on:
> > 
> > * Check the move files/directories issue. I *think* it works.
> 
> check the new directory name can be searched when doing a rename
> 
> also check the new name is searchable against all items in that
> directory
> 
> > 
> > * Fix the get_file_contents() function so it checks for #13 in the first
> > 64Kb.

also do what trunk does and validate each line. If it fails utf-8
validation attempt to convert from locale. Best to exit with null if any
part fails. I assume the gio stuff handles non utf-8?

jamie

___
tracker-list mailing list
tracker-list@gnome.org
http://mail.gnome.org/mailman/listinfo/tracker-list


Re: [Tracker] more issues with indexer-split

2008-09-03 Thread Martyn Russell
Jamie McCracken wrote:
> trackerd should just pass directories at startup and let the indexer
> work out what to process. Dbus is not optimised for passing large number
> of strings. Can the current design easily accommodate this?

DBus' optimisation is not an issue here. I can send ALL of my files over
quicker than the indexer can mtime check ALL the directories in the
database.

Yes we can accommodate this. We simply send all files/directories to the
indexer and the indexer can check each parent directory first then
process the files or discard them if the parent directory mtime is up to
date.

-- 
Regards,
Martyn
___
tracker-list mailing list
tracker-list@gnome.org
http://mail.gnome.org/mailman/listinfo/tracker-list


Re: [Tracker] more issues with indexer-split

2008-09-03 Thread Jamie McCracken
On Wed, 2008-09-03 at 12:34 +0100, Martyn Russell wrote:
> Hi,
> 
> So I have been reading up on the things that are remaining for merging.
> This is the list I have so far which I will be working on:
> 
> * Check the move files/directories issue. I *think* it works.

check the new directory name can be searched when doing a rename

also check the new name is searchable against all items in that
directory

> 
> * Fix the get_file_contents() function so it checks for #13 in the first
> 64Kb.
> 
> * Make private libraries .so files to dynamically load them. 
Also for stemmer - make them dynamically loadable too
> 
> * The directory mtime issue on startup.
> 

also for summary files too - only check em if mtime has changed

> Have I missed anything?

I think that is it. A lot of Prefs dont work but that can wait til after
merge.

Im also adding my tracker-fts stuff into that branch so will likely
merge when above + my stuff is ready

jamie



___
tracker-list mailing list
tracker-list@gnome.org
http://mail.gnome.org/mailman/listinfo/tracker-list


Re: [Tracker] more issues with indexer-split

2008-09-03 Thread Jamie McCracken
On Wed, 2008-09-03 at 12:34 +0100, Martyn Russell wrote:
> Jamie McCracken wrote:
> >>> trunk only checks directories (If a file in a directory is modified then
> >>> the directories mtime is also altered so no need to check every file)
> >>> hence startup is much faster.
> 
> Note: the mtime of the parent directory ONLY is updated. This is not
> recursive. So if you have /foo/bar/baz/sliff.txt, the mtime of baz/ is
> updated not for bar/ and foo/.
> 
> This means you _HAVE_ to go into every directory to see if it has a
> subdirectory with an mtime that has updated.

that is what trunk does - it only checks directories (and
subdirectories). Theres no need to check mtime for a file ever unless
the parent directory mtime has changed

> 
> >> We can do this. Can you guarantee that on EVERY file system type the
> >> parent directory mtime is updated when a file changes? I am not 100%
> >> sure this is the case.
> > 
> > on all major platforms yes (*nix and windows)
> 
> Hmm. This wories me. How mtime is used across file systems tends to vary
> slightly and this might come back to bite us.


Its not been a problem in the past for tracker and certainly wont be for
our target audience

> 
> > it is for me - its in the order of 3x slower than trunk at startup 
> 
> What exactly is 3x slower? The crawling?
> 
> I have been thinking about this. The best solution here to me is to send
> ALL files/directories to the indexer and let the indexer check the mtime
> of a directories before deciding to process the files it holds. This
> should dramatically reduce the DB lookups on startup. But if the
> slowness is NOT in the indexer, then there is little you can do except
> increase the throttle. Have you tested it again recently since I made
> throttle mandatory whenever it is called (i.e. it is 5+config value).
> This made a lot of difference for me.
> 


trackerd should just pass directories at startup and let the indexer
work out what to process. Dbus is not optimised for passing large number
of strings. Can the current design easily accommodate this?


jamie

___
tracker-list mailing list
tracker-list@gnome.org
http://mail.gnome.org/mailman/listinfo/tracker-list


Re: [Tracker] more issues with indexer-split

2008-09-03 Thread Martyn Russell
Hi,

So I have been reading up on the things that are remaining for merging.
This is the list I have so far which I will be working on:

* Check the move files/directories issue. I *think* it works.

* Fix the get_file_contents() function so it checks for #13 in the first
64Kb.

* Make private libraries .so files to dynamically load them.

* The directory mtime issue on startup.

Have I missed anything?

-- 
Regards,
Martyn
___
tracker-list mailing list
tracker-list@gnome.org
http://mail.gnome.org/mailman/listinfo/tracker-list


Re: [Tracker] more issues with indexer-split

2008-09-03 Thread Martyn Russell
Jamie McCracken wrote:
>>> trunk only checks directories (If a file in a directory is modified then
>>> the directories mtime is also altered so no need to check every file)
>>> hence startup is much faster.

Note: the mtime of the parent directory ONLY is updated. This is not
recursive. So if you have /foo/bar/baz/sliff.txt, the mtime of baz/ is
updated not for bar/ and foo/.

This means you _HAVE_ to go into every directory to see if it has a
subdirectory with an mtime that has updated.

>> We can do this. Can you guarantee that on EVERY file system type the
>> parent directory mtime is updated when a file changes? I am not 100%
>> sure this is the case.
> 
> on all major platforms yes (*nix and windows)

Hmm. This wories me. How mtime is used across file systems tends to vary
slightly and this might come back to bite us.

> it is for me - its in the order of 3x slower than trunk at startup 

What exactly is 3x slower? The crawling?

I have been thinking about this. The best solution here to me is to send
ALL files/directories to the indexer and let the indexer check the mtime
of a directories before deciding to process the files it holds. This
should dramatically reduce the DB lookups on startup. But if the
slowness is NOT in the indexer, then there is little you can do except
increase the throttle. Have you tested it again recently since I made
throttle mandatory whenever it is called (i.e. it is 5+config value).
This made a lot of difference for me.

-- 
Regards,
Martyn
___
tracker-list mailing list
tracker-list@gnome.org
http://mail.gnome.org/mailman/listinfo/tracker-list


Re: [Tracker] tracker-indexer does not index all files

2008-09-03 Thread Martyn Russell
Jamie McCracken wrote:
> On Tue, 2008-09-02 at 12:14 +0100, Martyn Russell wrote:
>> Jamie McCracken wrote:
>>> Another potential crasher - unlike trunk get_file_content does no utf-8
>>> validation and also if file is bigger than MAX_TEXT cuts it off which is
>>> likely to not land on a valid utf-8 word break
>> This is true.
>>
>>> ideally do what trunk does and read file line by line so that we will
>>> never have a partial utf-8 fragment and the resulting text can be
>>> validated and converted from locale to utf-8 if necessary
>> I don't think reading line by line is a good idea at all.
>> All we need to do is use g_utf8_validate () on the length we read and
>> find out where the end is and make sure we don't read half way through a
>> UTF8 character.
> 
> how can you tell your are not in the middle of a utf8 char? Line break
> is the only char we can be sure of breaking on (CJK may not contain word
> breaks like spaces)

If you read the documentation for that function, it should return the
end position for what is valid if there is invalid utf8 in the stream
being read. It is safe to assume that we can read up to end-start for
parsing, since it will be valid UTF-8.

Unless of course you are expecting to be able to parse non-UTF-8 content?

> to read line by line you can still use streams but check for #13 line
> break

Isn't that just an unnecessary check that (depending on the file) could
be quite a performance hit for a file with a lot of line breaks.

> I suggest read it in 64kb  chunks - if no line break (#13) is found then
> exit as its unlikely to be a valid text file that needs indexing

That is a good point. To some extent. I just worry about false positives
here, i.e. key/value files with some initial valid text and a binary
blob as a value. The first thing that springs to mind is a VCard. Not
sure to be honest.

> of course if you have a better idea (thats not slower) then Im all
> ears...

No after some checking that seems the sanest idea actually. The only
issue there is false positives really. I can work on this.

-- 
Regards,
Martyn
___
tracker-list mailing list
tracker-list@gnome.org
http://mail.gnome.org/mailman/listinfo/tracker-list


Re: [Tracker] more issues with indexer-split

2008-09-03 Thread Martyn Russell
Jamie McCracken wrote:
> On Tue, 2008-09-02 at 12:23 +0100, Martyn Russell wrote:
>> Jamie McCracken wrote:
>>> Could we also reduce memory usage by not statically linking to the
>>> private libs libtracker-common and libtracker-db?
>> Those libraries should not be available for public use. Before doing so,
>> each API would have to be:
>>
>> a) Documented
>> b) Checked it needs to be public
>> c) Versioned
>> d) ...
>>
>> This is a lot of work and I don't think it is worth it.
>> I haven't looked at the footprints myself though.
>>
>>> currently my FTS module and the file-indexer-module are ~ 1MB in size
>>> due mostly to linking with them and im sure the size of trackerd and
>>> tracker-indexer could be made smaller too with only one instance of
>>> those libs in memory
>> How does the memory footprint compare to the old tracker?
>>
> 
> having looked at the contents of libtracker-common, most of the memory
> used is for the stemmers - we load them all into memory even though we
> only use one of them. i think making each language stemmer a dynamically
> loaded module should help reduce things

I can look into doing this.

-- 
Regards,
Martyn
___
tracker-list mailing list
tracker-list@gnome.org
http://mail.gnome.org/mailman/listinfo/tracker-list


Re: [Tracker] more issues with indexer-split

2008-09-03 Thread Martyn Russell
Jamie McCracken wrote:
> On Tue, 2008-09-02 at 12:23 +0100, Martyn Russell wrote:
>> Jamie McCracken wrote:
>>> Could we also reduce memory usage by not statically linking to the
>>> private libs libtracker-common and libtracker-db?
>> Those libraries should not be available for public use. Before doing so,
>> each API would have to be:
>>
>> a) Documented
>> b) Checked it needs to be public
>> c) Versioned
>> d) ...
>>
>> This is a lot of work and I don't think it is worth it.
>> I haven't looked at the footprints myself though.
> 
> 
> why we would do all that?
> 
> we would not be exporting the headers for those libs so no other apps
> outside of tracker source tree will be able to use it effectively
> 
> surely there are some examples of private libs that are not statically
> linked?

I mis-understood clearly. I thought you meant make it public for public
use. I think making them .so libs but privately used is a good idea.

-- 
Regards,
Martyn
___
tracker-list mailing list
tracker-list@gnome.org
http://mail.gnome.org/mailman/listinfo/tracker-list