Re: [Rd] Efficiency of factor objects

2011-11-07 Thread Stavros Macrakis
Matthew,

Yes, the case I am thinking of is a 1-column key; sorry for the
overgeneralization.  I haven't thought much about the multi-column key case.

-s

On Mon, Nov 7, 2011 at 12:48, Matthew Dowle  wrote:

> Stavros Macrakis  alum.mit.edu> writes:
> >
> > data.table certainly has some useful mechanisms, and I've been
> > experimenting with it as an implementation mechanism, though it's not a
> > drop-in substitute for factors.  Also, though it is efficient for set
> > operations between small sets and large sets, it is not very efficient
> for
> > operations between two large sets
>
> As a general statement that could do with some clarification ;) data.table
> likes keys consisting of multiple ordered columns, e.g. (id,date). It is (I
> believe) efficient for joining two large 2+ column keyed data sets because
> the
> upper bound of each row's one-sided binary search is localised in that
> case (by
> group of the previous key column).
>
> As I understand it, Stavros has a different type of 'two large datasets' :
> English language website data. Each set is one large vector of uniformly
> distributed unique strings. That appears to be quite a different problem to
> multiple columns of many times duplicated data.
>
> Matthew
>
> > Thanks everyone, and if you do come across a relevant CRAN package, I'd
> be
> > very interested in hearing about it.
> >
> >   -s
> >
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Efficiency of factor objects

2011-11-07 Thread Milan Bouchet-Valat
Le dimanche 06 novembre 2011 à 19:00 -0500, Stavros Macrakis a écrit :
> Milan, Jeff, Patrick,
> 
> 
> Thank you for your comments and suggestions.
> 
> 
> Milan,
> 
> 
> This is far from a "completely theoretical problem".  I am performing
> text analytics on a corpus of about 2m documents.  There are tens of
> thousands of distinct words (lemmata).  It seems to me that the
> natural representation of words is as an "enumeration type" -- in R
> terms, a "factor".
Interesting. What does your data look like? I've used the tm package,
and for me there are only two representations of text corpora: a list of
texts, which are basically a character string with attributes; a
document-term matrix, with documents as rows, terms as columns, and
counts at their intersection.


So I wonder how you're using factors. Do you have a factor containing
words for each text?

> Why do I think factors are the "natural way" of representing such
> things?  Because for most kinds of analysis, only their identity
> matters (not their spelling as words), but the human user would like
> to see names, not numbers. That is pretty much the definition of an
> enumeration type. In terms of R implementation, R is very efficient in
> dealing with integer identities and indexing (e.g. tabulate) and not
> very efficient in dealing with character identities -- indeed, 'table'
> first converts strings into factors.  Of course I could represent the
> lemmata as integers, and perform the translation between integers and
> strings myself, but that would just be duplicating the function of an
> enumeration type.
My point was that the efficiency of factors is due to the redundancy of
their levels. You usually have very few levels, and many observations
(in my work, often 10 levels and 100,000s of observations). If each
level only appears a few times on average, you don't save that much
memory by using a factor.

Since you have a real use case for that, I withdraw my criticism of your
suggestion being useless. ;-) But I'm still not sure R core devs would
like to follow it, since your application can be considered
non-standard, and worth a specialized class.


Cheers

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Efficiency of factor objects

2011-11-07 Thread Matthew Dowle
Stavros Macrakis  alum.mit.edu> writes:
> 
> data.table certainly has some useful mechanisms, and I've been
> experimenting with it as an implementation mechanism, though it's not a
> drop-in substitute for factors.  Also, though it is efficient for set
> operations between small sets and large sets, it is not very efficient for
> operations between two large sets

As a general statement that could do with some clarification ;) data.table 
likes keys consisting of multiple ordered columns, e.g. (id,date). It is (I 
believe) efficient for joining two large 2+ column keyed data sets because the 
upper bound of each row's one-sided binary search is localised in that case (by 
group of the previous key column).

As I understand it, Stavros has a different type of 'two large datasets' : 
English language website data. Each set is one large vector of uniformly 
distributed unique strings. That appears to be quite a different problem to 
multiple columns of many times duplicated data.

Matthew

> Thanks everyone, and if you do come across a relevant CRAN package, I'd be
> very interested in hearing about it.
> 
>   -s
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Efficiency of factor objects

2011-11-06 Thread Stavros Macrakis
Milan, Jeff, Patrick,

Thank you for your comments and suggestions.

Milan,

This is far from a "completely theoretical problem".  I am performing text
analytics on a corpus of about 2m documents.  There are tens of thousands
of distinct words (lemmata).  It seems to me that the natural
representation of words is as an "enumeration type" -- in R terms, a
"factor".

Why do I think factors are the "natural way" of representing such things?
 Because for most kinds of analysis, only their *identity* matters (not
their spelling as words), but the human user would like to see names, not
numbers. That is pretty much the definition of an enumeration type. In
terms of R implementation, R is very efficient in dealing with integer
identities and indexing (e.g. tabulate) and not very efficient in dealing
with character identities -- indeed, 'table' first converts strings into
factors.  Of course I could represent the lemmata as integers, and perform
the translation between integers and strings myself, but that would just be
duplicating the function of an enumeration type.

Jeffrey,

Extending R "via the mechanisms in place" is exactly what I have in mind.
 Of course, if it's already been done, I'd rather reuse that work than
start from scratch, which is why my message explicitly asks if there is a
"factors package using this or some similar approach".  I did search CRAN,
and wasn't able to find such a thing, but I may have missed something,
which is why I sent my message to the list.

Patrick,

Data.table certainly has some useful mechanisms, and I've been
experimenting with it as an implementation mechanism, though it's not a
drop-in substitute for factors.  Also, though it is efficient for set
operations between small sets and large sets, it is not very efficient for
operations between two large sets -- I am working with its implementors to
see if we can put in place a better algorithm based on e.g. Demaine et
al.and
Barbay
et al .

Thanks everyone, and if you do come across a relevant CRAN package, I'd be
very interested in hearing about it.

  -s

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Efficiency of factor objects

2011-11-05 Thread Patrick Burns

Perhaps 'data.table' would be a package
on CRAN that would be acceptable.

On 05/11/2011 16:45, Jeffrey Ryan wrote:

Or better still, extend R via the mechanisms in place.  Something akin
to a fast factor package.  Any change to R causes downstream issues in
(hundreds of?) millions of lines of deployed code.

It almost seems hard to fathom that a package for this doesn't already
exist. Have you searched CRAN?

Jeff



On Sat, Nov 5, 2011 at 11:30 AM, Milan Bouchet-Valat  wrote:

Le vendredi 04 novembre 2011 à 19:19 -0400, Stavros Macrakis a écrit :

R factors are the natural way to represent factors -- and should be
efficient since they use small integers.  But in fact, for many (but
not all) operations, R factors are considerably slower than integers,
or even character strings.  This appears to be because whenever a
factor vector is subsetted, the entire levels vector is copied.

Is it so common for a factor to have so many levels? One can probably
argue that, in that case, using a numeric or character vector is
preferred - factors are no longer the "natural way" of representing this
kind of data.

Adding code to fix a completely theoretical problem is generally not a
good idea. I think you'd have to come up with a real use case to hope
convincing the developers a change is needed. There are probably many
more interesting areas where speedups can be gained than that.


Regards

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel







--
Patrick Burns
pbu...@pburns.seanet.com
twitter: @portfolioprobe
http://www.portfolioprobe.com/blog
http://www.burns-stat.com
(home of 'Some hints for the R beginner'
and 'The R Inferno')

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Efficiency of factor objects

2011-11-05 Thread Jeffrey Ryan
Or better still, extend R via the mechanisms in place.  Something akin
to a fast factor package.  Any change to R causes downstream issues in
(hundreds of?) millions of lines of deployed code.

It almost seems hard to fathom that a package for this doesn't already
exist. Have you searched CRAN?

Jeff



On Sat, Nov 5, 2011 at 11:30 AM, Milan Bouchet-Valat  wrote:
> Le vendredi 04 novembre 2011 à 19:19 -0400, Stavros Macrakis a écrit :
>> R factors are the natural way to represent factors -- and should be
>> efficient since they use small integers.  But in fact, for many (but
>> not all) operations, R factors are considerably slower than integers,
>> or even character strings.  This appears to be because whenever a
>> factor vector is subsetted, the entire levels vector is copied.
> Is it so common for a factor to have so many levels? One can probably
> argue that, in that case, using a numeric or character vector is
> preferred - factors are no longer the "natural way" of representing this
> kind of data.
>
> Adding code to fix a completely theoretical problem is generally not a
> good idea. I think you'd have to come up with a real use case to hope
> convincing the developers a change is needed. There are probably many
> more interesting areas where speedups can be gained than that.
>
>
> Regards
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>



-- 
Jeffrey Ryan
jeffrey.r...@lemnica.com

www.lemnica.com
www.esotericR.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Efficiency of factor objects

2011-11-05 Thread Milan Bouchet-Valat
Le vendredi 04 novembre 2011 à 19:19 -0400, Stavros Macrakis a écrit :
> R factors are the natural way to represent factors -- and should be
> efficient since they use small integers.  But in fact, for many (but
> not all) operations, R factors are considerably slower than integers,
> or even character strings.  This appears to be because whenever a
> factor vector is subsetted, the entire levels vector is copied.
Is it so common for a factor to have so many levels? One can probably
argue that, in that case, using a numeric or character vector is
preferred - factors are no longer the "natural way" of representing this
kind of data.

Adding code to fix a completely theoretical problem is generally not a
good idea. I think you'd have to come up with a real use case to hope
convincing the developers a change is needed. There are probably many
more interesting areas where speedups can be gained than that.


Regards

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Efficiency of factor objects

2011-11-04 Thread Stavros Macrakis
R factors are the natural way to represent factors -- and should be
efficient since they use small integers.  But in fact, for many (but
not all) operations, R factors are considerably slower than integers,
or even character strings.  This appears to be because whenever a
factor vector is subsetted, the entire levels vector is copied.  For
example:

> i1 <- sample(1e4,1e6,replace=T)
> c1 <- paste('x',i1)
> f1 <- factor(c1)
> system.time(replicate(1e4,{q1<-i1[100:200];1}))
   user  system elapsed
   0.030.000.04
> system.time(replicate(1e4,{q1<-c1[100:200];1}))
   user  system elapsed
   0.040.000.04
> system.time(replicate(1e4,{q1<-f1[100:200];1}))
   user  system elapsed
   0.670.000.68

Putting the levels vector in an environment speeds up subsetting:

myfactor <- function(...) {
 f <- factor(...)
 g <- unclass(f)
 class(g) <- "myfactor"
 attr(g,"mylevels") <- as.environment(list(levels=attr(f,"mylevels")))
 g }
`[.myfactor` <-
function (x, ...)
{
y <- NextMethod("[")
attributes(y) <- attributes(x)
y
}

> m1 <- myfactor(f1)
> system.time(replicate(1e4,{q1<-m1[100:200];1}))
   user  system elapsed
   0.050.000.04

Given R's value semantics, I believe this approach can be extended to
most of class factor's functionality without problems, copying the
environment if necessary.  Some quick tests seem to show that this is
no slower than ordinary factors even for very small numbers of levels.
 To do this, appropriate methods for this class (print, [<-, levels<-,
etc.) would have to be written. Perhaps some core R functions also
have to be changed?

Am I missing some obvious flaw in this approach?  Has anyone already
implemented a factors package using this or some similar approach?

Thanks,

 -s

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel