Milan, Jeff, Patrick,

Thank you for your comments and suggestions.

Milan,

This is far from a "completely theoretical problem".  I am performing text
analytics on a corpus of about 2m documents.  There are tens of thousands
of distinct words (lemmata).  It seems to me that the natural
representation of words is as an "enumeration type" -- in R terms, a
"factor".

Why do I think factors are the "natural way" of representing such things?
 Because for most kinds of analysis, only their *identity* matters (not
their spelling as words), but the human user would like to see names, not
numbers. That is pretty much the definition of an enumeration type. In
terms of R implementation, R is very efficient in dealing with integer
identities and indexing (e.g. tabulate) and not very efficient in dealing
with character identities -- indeed, 'table' first converts strings into
factors.  Of course I could represent the lemmata as integers, and perform
the translation between integers and strings myself, but that would just be
duplicating the function of an enumeration type.

Jeffrey,

Extending R "via the mechanisms in place" is exactly what I have in mind.
 Of course, if it's already been done, I'd rather reuse that work than
start from scratch, which is why my message explicitly asks if there is a
"factors package using this or some similar approach".  I did search CRAN,
and wasn't able to find such a thing, but I may have missed something,
which is why I sent my message to the list.

Patrick,

Data.table certainly has some useful mechanisms, and I've been
experimenting with it as an implementation mechanism, though it's not a
drop-in substitute for factors.  Also, though it is efficient for set
operations between small sets and large sets, it is not very efficient for
operations between two large sets -- I am working with its implementors to
see if we can put in place a better algorithm based on e.g. Demaine et
al.<http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.36.9963>and
Barbay
et al <http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.87.9365>.

Thanks everyone, and if you do come across a relevant CRAN package, I'd be
very interested in hearing about it.

          -s

        [[alternative HTML version deleted]]

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to