Hi list,

**Warning** This is a _long_ mail. Probably way too long, as people's
attention is going to drop. The fact that it's that long probably
expresses how confused I am.
Core developers, please do read it: it's about a PR on which someone has
been putting many, many, hours.

This long running pull request on which the author has been putting a lot of
effort is the Kalman filter pull request:
https://github.com/scikit-learn/scikit-learn/pull/862

I have been spending quite a while looking at this code and trying to
come up with a fair review and guidelines on how to integrate it in the
scikit. While several of us have given low-level feedback on things like
coding style, I must confess that I am not completely happy with the big
picture. I'd like to have a high level discussion on the mailing list on
how such a codebase can be well integrated in the scikit, if it can.

The code as it currently is, does not feel very usable in the settings of
the scikit-learn, based on simple APIs and mostly prediction or
transformation.

Antipatterns that I see
========================

a. The Kalman filter parameters pretty much have to be specified. Learning
   from data is theoretically possible but:

   1. It takes a lot of time (probably fixable using spectral algorithms
      http://www.cs.cmu.edu/~ggordon/spectral-learning/boots-slides.pdf )

   2. The current parametrisation is not natural for this purpose
      (fixable, but requires more understanding than I currently have)

   3. In my experience, the current implementation fails to learn
      reasonnable parameters on what seems like simple problems.

   Specifying the parameters seems to me very problematic. Indeed, as can
   be seen from the example, the parameters to specify can hardly be
   guessed from vague prior knowledge and probably require some
   understanding of the theory behind the Kalman filter and some
   pen and paper work. This is fairly "unscikity".

b. The current object can hardly work outside of the training samples.
   While the contributed implementation catters for missing data, which is
   a precious feature in itself, it can only compute out of sample
   prediction for very few points (i.e. point neighboring given data
   points). This is a property of the Kalman, I believe. It is not a
   show-stopper, in itself, but it limits the usefulness of the code in
   the scikit.

Designing an 'Estimator' API for Kalman filters
===============================================

To merge the code in the scikit-learn, it has to implement an estimator
interface that enables non experts to use it as much as possible to solve
typical problems that the scikit-learn tackles. I am struggling on how to
do this best, as I don't know Kalman filters very well, and have never
used them on real problem.

It seems to me that the question becomes: of do we do 'transform' or
'predict'. Given an object that does some form of data processing, this
is immediately what I may want to do.

Kalman filters can do prediction, in the sens of extrapolation, but that
will work only for a small number of time points, so I think that we can
set that aside for a while.

In my eyes, Kalman filters can do data transforms, in two ways. First,
they can do filtering, and that's probably the most natural and obvious.
Second, they can output the state space, thus increasing the
dimensionality of the feature space.

The real challenge, is that the Kalman filters have a notion of
dependence across the samples. As long as we are with one continuous
measurement vector, things are simple, but as soon as we start giving new
observations, we may want to related them to those previously seen. We
will most probably break cross-validation. This is a general problem that
we will have will all models having some notion of time series.

To merge or not to merge?
===========================

Reasons to merge
-----------------

1. Kalman might come in handy to do feature transform when working with
   time serie data. However, I don't do this myself, so I don't know to
   design an API to make that possible.

2. The contributor is a fighter.

3. We already have HMM, and Kalman directly relate to HMM (it seems to me
   that HMMs have problem b but not problem a, currently)

Reasons not to merge
---------------------

1. "Antipattern" a (necessity to specificy complex model parameters) is
   really a killer for me. In the current situation, I find that it
   limits the usefulness of the code. That said, to go beyond the
   problems probably requires i) a reparametrisation of the problem, to be
   able to specify things like dimensionality of the state space ii)
   applying regularization, which might be beyond the contributor's initial
   goals.

2. As a community, we do not really have the knowledge to maintain this
   codebase if the contributor goes MIA (missing in action). I'd like to
   be somewhat convinced that the guy is going to use it in something
   close to 'production' settings.

3. Scikits.statmodels has already other algorithms for Kalman filters.
   Maybe the problem would fall better in the corresponding API and
   usecases.

4. If we go down that path, we must start thinking about how time-series
   should be supported in scikit-learn. I am not at all opposed to that.
   Actually quite thrilled. However, we need to keep in mind that it will
   make many things more complicated. For this to be an option, I think
   that we need active developers with these usecases and that are ready
   to invest significant efforts in this direction. If not, I am afraid
   that it will remain whishfull thinking.

I'd like to stress that I am not a believer in the strategy of merging as
many features as possible without worrying of how they fit together. The
bigger the scikit becomes, the harder it becomes to maintain it, and to
give a clear picture to our users. In my experience, a project should be
driven by a 'vision', that is simple to explain to potential users and
guides technical and API choices [*].


So, should we merge or not? In a better world, I would probably say that
such a code should be in a 'scikit-signal', and not 'scikit-learn'.
However, there is no scikit-signal.

I'd like a discussion, so that we can give the guy a clear feedback. 

Thanks for your input!

Gael

[*] http://jamesshore.com/Agile-Book/vision.html 
http://www.rastinmehr.com/2009/09/14/does-your-software-project-have-a-vision-and-design-philosophy/

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to