hi Martyn! I reviewed the branch before comprehending your overview of what it does, so there are some comments on patches that are later completely overwritten. I've left them in case they are helpful in some way. It'd be nice to squash a bunch of the commits in the branch together though before merging, if you get time, so the history is clearer!
I've attached some actual comments, as a text file because I imagine gmail will mangle it otherwise (I should really stop using gmail :) Sam On Fri, Jun 6, 2014 at 3:16 PM, Martyn Russell <mar...@lanedo.com> wrote: > Hello all, > > I've recently been involved in a project to make Tracker the main indexer > and search engine for content on an embedded platform. > > Part of that project is about making remote content available in Tracker, > for example: 3rd party online cloud based content. > > This already presents some interesting problems for Tracker, because the > only way to handle 3rd party content (right now) is to build an entire > miner yourself, skipping a lot of the useful logic and filtering work that > Carlos and the team have added over the years. Specifically, I am talking > about TrackerIndexingTree, TrackerFileNotifier, TrackerFileSystem, > TrackerCrawler and TrackerMinerFs classes in libtracker-miner. > > As a result of this and keeping in mind that we've just released 1.0 and > are trying to stay stable for users, I've written a new branch trying to > allow a way to "feed" URIs into Tracker from the bottom and up through the > stack. That branch is called 'external-crawler': > > https://git.gnome.org/browse/tracker/log/?h=external-crawler > > So let me try to outline what this branch does or allows: > > 1. URIs are no longer restricted to file:// only (yes current master is if > you want to use most of the libtracker-miner classes). > > 2. There is a new TrackerEnumerator interface that allows for 3rd parties > to implement their own URI crawling solution for their backend. > > 3. A new TrackerFileEnumerator class implementing the TrackerEnumerator > interface provides the functionality that exists right now. This is > essentially just calling the GFileEnumerator API as we do in > master/tracker-1.0 branches right now. > > 4. Improves existing classes like TrackerIndexingTree, > TrackerFileNotifier, etc to take a GFile for the root URI, instead of > always assuming it's file://. > > So, please review *looks at Carlos mainly* :) and let me know thoughts on > the branch. > > I would like to merge to master at some point in the next few months in > time for 1.2.x releases. > > Thanks all, > > -- > Regards, > Martyn > > Founder & Director @ Lanedo GmbH. > http://www.linkedin.com/in/martynrussell > _______________________________________________ > tracker-list mailing list > tracker-list@gnome.org > https://mail.gnome.org/mailman/listinfo/tracker-list >
Most of my complaints are about naming here. Code looks solid. I didn't like the name 'external-crawler' but actually it is quite appropriate, if we consider a crawler a specific kind of miner which traverses a tree of resources. This branch allows adding crawlers for arbitrary URI schemes, but it doesn't seem to support monitoring at all. Is that planned? Seems that you could subclass TrackerMonitor to implement stuff like monitoring RSS feeds for change notifications, or whatever else the data source requires. I notice that this 'generic tree crawler' functionality is kind of similar to the generic 'browse media source' functionality that Grilo provides ... https://developer.gnome.org/grilo/unstable/GrlSource.html#grl-source-browse I wonder if we could make use of Grilo plugins in implementing miners for some data sources? Although Grilo already wraps Tracker, so we start to end up in danger of disappearing up our own fundamentals ... --- > commit 2d43b26b0f768d629728e696e1f380ed28c99040 > Author: Martyn Russell <mar...@lanedo.com> > Date: Wed May 7 12:15:44 2014 +0100 > > libtracker-miner: Miner now uses a proper MinerError API > > Previously, the error code was always 0. Now we have an enum people can > refer > to. > > This is particularly important for miner implementations and cases where > classes want to return an error due to an API being used *while* being > paused. > This is the case in TrackerMinerFS. > > diff --git a/src/libtracker-miner/tracker-miner-fs.c > b/src/libtracker-miner/tracker-miner-fs.c > index 40bb2bf..9e1e9fb 100644 > --- a/src/libtracker-miner/tracker-miner-fs.c > +++ b/src/libtracker-miner/tracker-miner-fs.c > @@ -3943,20 +3943,51 @@ tracker_miner_fs_get_indexing_tree (TrackerMinerFS > *fs) > * tracker_miner_fs_manually_notify_file: > * @fs: a #TrackerMinerFS > * @file: a #GFile > + * @error: a #GError > * > - * Returns: %TRUE if successful, %FALSE otherwise. > + * This API is only useful where the @fs was created using the > + * #TrackerMinerFS:external-crawler property set to %TRUE. By default > + * this is %FALSE. This allows 3rd party developers to 'push' their > + * data into Tracker to be processed, rather than requiring Tracker to > + * index the content itself by crawling the file system and monitoring > + * changes. > + * > + * This is also very helpful for cases where @file is not based on the > + * local file system (for example a cloud implementation) and you want > + * to tell Tracker about content to index more directly. > + * > + * Returns: %TRUE if successful, otherwise %FALSE and @error will be > + * set if a pointer is supplied. > * > * Since: 1.2. > **/ > gboolean Be clearer about what "processed" means: my understanding is in this case that a file enqueued for "processing" by this method will go through the filesystem miner, which adds basic stuff about mtime, filename etc. and then added to the store. One or more extractor/decorator miners may then add additional data. For file:/// URIs of certain MIME types the built-in tracker-extract procss will be one of those. How is tracker_miner_fs_manually_notify_file() different from the D-Bus calls IndexFile() and the proposed (GB#680834) IndexFileForProcess() ? Could one not be implemented in terms of the other? --- > commit e1171cba1041daf4a9604b04041d9996e55de953 > Author: Martyn Russell <mar...@lanedo.com> > Date: Wed May 7 13:31:36 2014 +0100 > > libtracker-miner: Introduce MinerFSQueue for injecting file changes > > This is used to signal TrackerFileNotifier and TrackerMinerFS from > external > crawlers about files and what has happened regarding them. > > diff --git a/src/libtracker-miner/tracker-file-notifier.c > b/src/libtracker-miner/tracker-file-notifier.c > index 17b3bf4..2903718 100644 > --- a/src/libtracker-miner/tracker-file-notifier.c > +++ b/src/libtracker-miner/tracker-file-notifier.c > @@ -1668,38 +1668,42 @@ tracker_file_notifier_get_file_iri > (TrackerFileNotifier *notifier, > } > > gboolean > -tracker_file_notifier_add_file (TrackerFileNotifier *notifier, > - GFile *file) > +tracker_file_notifier_signal_file (TrackerFileNotifier *notifier, > + TrackerMinerFSQueue queue_type, > + GFile *file, > + GFileType file_type) Slightly odd spacing here. > diff --git a/src/libtracker-miner/tracker-miner-fs.h > b/src/libtracker-miner/tracker-miner-fs.h > index 54c53ca..6cf2582 100644 > --- a/src/libtracker-miner/tracker-miner-fs.h > +++ b/src/libtracker-miner/tracker-miner-fs.h > @@ -118,6 +118,14 @@ typedef enum { > TRACKER_MINER_FS_ERROR_HAVE_CRAWLER, > } TrackerMinerFSError; > > +typedef enum { > + TRACKER_MINER_FS_QUEUE_NONE, > + TRACKER_MINER_FS_QUEUE_CREATED, > + TRACKER_MINER_FS_QUEUE_UPDATED, > + TRACKER_MINER_FS_QUEUE_DELETED, > + TRACKER_MINER_FS_QUEUE_MOVED > +} TrackerMinerFSQueue; > + I think TrackerMinerFSEventType would be a much better name. The queues are an implementation detail. --- > commit aac6cf88391775f8db7cab9c47e10b9b72999c49 > Author: Martyn Russell <mar...@lanedo.com> > Date: Fri May 9 15:19:30 2014 +0100 > > libtracker-miner: Fix warning on root GFile unref in FileSystem > > diff --git a/src/libtracker-miner/tracker-file-system.c > b/src/libtracker-miner/tracker-file-system.c > index 7d9a236..dd81f49 100644 > --- a/src/libtracker-miner/tracker-file-system.c > +++ b/src/libtracker-miner/tracker-file-system.c > @@ -356,7 +356,7 @@ file_system_finalize (GObject *object) > NULL); > g_node_destroy (priv->file_tree); > > - if (!priv->root) { > + if (priv->root) { > g_object_unref (priv->root); > } Definitely squash this into 'libtracker-miner: FileSystem _new() now takes GFile for root' --- > commit 0e8a80fdef2d2a93a61c2778270f2af66ad99702 > Author: Martyn Russell <mar...@lanedo.com> > Date: Mon May 12 17:55:08 2014 +0100 > > libtracker-miner: Fix paused check in MinerFS before notifying files > > diff --git a/src/libtracker-miner/tracker-miner-fs.c > b/src/libtracker-miner/tracker-miner-fs.c > index c17c98c..a8368bc 100644 > --- a/src/libtracker-miner/tracker-miner-fs.c > +++ b/src/libtracker-miner/tracker-miner-fs.c > @@ -3985,7 +3985,7 @@ tracker_miner_fs_manually_notify_file (TrackerMinerFS > *fs, > return FALSE; > } > > - if (!priv->external_crawler) { > + if (!priv->is_paused) { > g_set_error (error, > tracker_miner_error_quark (), > TRACKER_MINER_ERROR_PAUSED, > Would it make sense instead to make it impossible to pause the miner if external_crawler == TRUE ? If the miner is being pushed events, rather than polling for them like normal, I don't see why you'd ever want to pause it. Pause the thing that's pushing the events instead! --- > commit 87c30cc73629ff54ff8e5170ded499c2cbbda61a > Author: Martyn Russell <mar...@lanedo.com> > Date: Thu Jun 5 12:32:08 2014 +0100 > > libtracker-miner: Added TrackerEnumerator interface, > TrackerFileEnumerator class > > These are used to enumerate over children of a URI. The > TrackerFileEnumerator > uses GFileEnumerator and the TrackerEnumerator allows for other > implementations to supply GFile/GFileInfo data to Tracker for 3rd parties. > > Updated MinerFS TrackerDirectory flags for external crawlers I'm glad things iterated to this point, this looks really promising! diff --git a/src/libtracker-miner/tracker-miner-fs.c b/src/libtracker-miner/tracker-miner-fs.c > index d14e134..6c2c95d 100644 > --- a/src/libtracker-miner/tracker-miner-fs.c > +++ b/src/libtracker-miner/tracker-miner-fs.c > @@ -3021,23 +3021,26 @@ tracker_miner_fs_directory_add (TrackerMinerFS *fs, > GFile *file, > gboolean recurse) > { > - TrackerDirectoryFlags flags; > + TrackerDirectoryFlags flags = TRACKER_DIRECTORY_FLAG_NONE; > > g_return_if_fail (TRACKER_IS_MINER_FS (fs)); > g_return_if_fail (G_IS_FILE (file)); > > - flags = TRACKER_DIRECTORY_FLAG_MONITOR; > - > if (recurse) { > flags |= TRACKER_DIRECTORY_FLAG_RECURSE; > } > > - if (fs->priv->mtime_checking) { > - flags |= TRACKER_DIRECTORY_FLAG_CHECK_MTIME; > + if (!fs->priv->external_crawler) { > + flags |= TRACKER_DIRECTORY_FLAG_MONITOR; > + > + if (fs->priv->mtime_checking) { > + flags |= TRACKER_DIRECTORY_FLAG_CHECK_MTIME; > + } > } > > So if an external crawler is used, the monitoring is disabled. How do we get change notifications? (I'm guessing that's a separate problem). --- > commit 6321e7d196610daf64cc7bd8d8f656aa544cd643 > Author: Martyn Russell <mar...@lanedo.com> > Date: Thu May 15 14:25:12 2014 +0100 > > libtracker-miner: Added FileEnumerator unit test This commit also got TrackerCrawler to make use of TrackerEnumerator! You should mention that in the commit message. --- > commit 8cdc0188e2c4d341e6e9788370e9381e085ae34b > Author: Martyn Russell <mar...@lanedo.com> > Date: Fri May 16 12:11:00 2014 +0100 > > libtracker-miner: Fixed finalization logic for root unref, tests were > failing Squash into 'libtracker-miner: IndexingTree _new() now takes a GFile' ! Various commits follow that could also do with being squashed. --- > commit 8dc9a253041c1be4a8841d6a5f1533b564d60486 > Author: Martyn Russell <mar...@lanedo.com> > Date: Wed Jun 4 13:20:02 2014 +0100 > > libtracker-miner: Update all documentation around TrackerEnumerator > interface > > Including: > - How we use it in TrackerMinerFS > - TrackerEnumerator itself > - TrackerFileEnumerator > diff --git a/src/libtracker-miner/tracker-enumerator.c > b/src/libtracker-miner/tracker-enumerator.c > index 5cda32f..cc84849 100644 > --- a/src/libtracker-miner/tracker-enumerator.c > +++ b/src/libtracker-miner/tracker-enumerator.c > @@ -25,6 +25,53 @@ > > #include "tracker-enumerator.h" > > +/** > + * SECTION:tracker-enumerator > + * @short_description: Enumerate URI locations > + * @include: libtracker-miner/miner.h > + * > + * #TrackerEnumerator allows you to operate on a set of #GFiles, > + * returning a #GFileInfo structure for each file enumerated (e.g. > + * tracker_enumerator_get_children() will return a #GSList with the > + * children within a parent directory or structure for the URI. > + * > + * This function is similar to g_file_enumerator_next_files_async(), > + * unlike the g_file_enumerator_next() which will return one #GFile at > + * a time, this will return all children at once. That sounds like it could risk locking up the process, or even the system. Some local filesystems take a LONG time to 'ls' sometimes, and chew up huge amounts of IO while doing so. Is this definitely not going to be a performance regression? What if the enumerator is operating on some crazy web API that lists millions of resources? --- > commit 6829ac1c53c995e702efb245f5777835528693d4 > Author: Martyn Russell <mar...@lanedo.com> > Date: Thu Jun 12 19:08:03 2014 +0100 > > libtracker-miner: Make it possible to start the miner via DBus Commit message should say why this is needed; I don't really understand why. --- > commit 410923dfc7aaaff9f862d51092d2209d8dad3e13 > Author: Martyn Russell <mar...@lanedo.com> > Date: Tue Jul 1 10:42:31 2014 +0100 > > libtracker-miner: Added tracker_indexing_tree_get_master_root() and set > in constructed() > > This function gets the top level root for all roots, e.g. file:/// > > We also create the root nodes in constructed() not init() because then the > root GFile has been set in the properties when the object is initiated > I already find the use of the term "root" and "config_root" in TrackerIndexingTree to be quite confusing. Now we have the concept of the "master root", which is mostly called "root", but is different the other type of "root". I think we need some better words :) "Config roots" should perhaps become "configured index points", and the "master root"/"root" can just be referred to as "root" everywhere. --- > commit 011eccbd624aa494e42482c0d809ec588ee2a671 > Author: Martyn Russell <mar...@lanedo.com> > Date: Tue Jul 1 10:48:11 2014 +0100 > > libtracker-miner: Create nodes / cahces in constructed() not init() Typo: caches > We do this because the GFile which is the root for the TrackerFileSystem > is a > property which is set on object construction. If we try to do this in > init() > the root is unset at that point. So we wait until constructed where > properties > are guaranteed to be set. > > Also Chain parent constructed() in TrackerFileSystem --- > commit e5b5280b5f336b80b00d9c9671a72391ab1f0a9b > Author: Martyn Russell <mar...@lanedo.com> > Date: Thu Jul 17 19:32:32 2014 +0100 > > libtracker-control: Allow miner status to be gained from running processes > > Using environment variable: TRACKER_MINERS_DIR_DISABLED That's a weird name for an environment variable ... Overall the branch looks very promising! The main use I see is allowing Tracker to mirror metadata of remote content provided by various internet services. I'm not quite sure which services it actually makes sense to do this for ... I guess only the ones which we want to be able to locally search, run queries on or display in applications like GNOME Documents, GNOME Music etc. While you could theoretically write a TrackerCrawler for Spotify, I don't particularly want to mirror the entire Spotify database on my laptop! What might be more interesting is if we could reuse TrackerEnumerator implementations in other contexts. I mentioned Grilo above already: they have grl_source_browse() and grl_source_search() methods that are implemented by plugins and basically return lists of results for a given query. If they exposed a TrackerEnumerator interface then Tracker could then crawl and save metadata from any data source supported by Grilo ... which would be neat!
_______________________________________________ tracker-list mailing list tracker-list@gnome.org https://mail.gnome.org/mailman/listinfo/tracker-list