[Rd] stringsAsFactors = FALSE

2008-11-17 Thread hadley wickham
Hi all,

I love the option to not automatically convert strings into factors,
but there are three places that the current option doesn't work where
I think it should:

options(stringsAsFactors = FALSE)

str(expand.grid(letters))
str(type.convert(letters))

df - read.fwf(textConnection(paste(letters,collapse=\n)), 1)
str(df)

I think type.convert and read.fwf can be fixed by giving them a
stringsAsFactors argument and then using asis = !stringsAsFactors
(like read.table).  The key lines in expand.grid would seem to be

if (!is.factor(x)  is.character(x))
x - factor(x, levels = unique(x))

but I'm not sure why they are being converted to factors in the first place.

Regards,

Hadley

-- 
http://had.co.nz/

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] stringsAsFactors = FALSE

2008-11-17 Thread Prof Brian Ripley

On Mon, 17 Nov 2008, hadley wickham wrote:


Hi all,

I love the option to not automatically convert strings into factors,
but there are three places that the current option doesn't work where
I think it should:


Perhaps you mean 'when I would like it to'?   Things *should* work as 
documented, surely?



options(stringsAsFactors = FALSE)

str(expand.grid(letters))
str(type.convert(letters))

df - read.fwf(textConnection(paste(letters,collapse=\n)), 1)
str(df)


I get


str(df)

'data.frame':   26 obs. of  1 variable:
 $ V1: chr  a b c d ...

so what is wrong with that?  read.fwf just calls read.table, so the 
default options of read.table apply.



I think type.convert and read.fwf can be fixed by giving them a
stringsAsFactors argument and then using asis = !stringsAsFactors
(like read.table).


Seems to me that there is nothing wrong with read.fwf.  For type.convert() 
we could have the default


as.is = !default.stringsAsFactors()

but I think a strong case needs to be made to change the documented 
behaviour.



 The key lines in expand.grid would seem to be

   if (!is.factor(x)  is.character(x))
   x - factor(x, levels = unique(x))

but I'm not sure why they are being converted to factors in the first place.


Nor I am, but it goes back to at least r2107, over 10 years ago.  I don't 
see much problem with adding a 'stringsAsFactors' argument there.


--
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] stringsAsFactors = FALSE

2008-11-17 Thread William Dunlap
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of hadley wickham
 Sent: Monday, November 17, 2008 5:10 AM
 To: r-devel@r-project.org
 Subject: [Rd] stringsAsFactors = FALSE
 ...
 The key lines in 
 expand.grid would seem to be
 
 if (!is.factor(x)  is.character(x))
 x - factor(x, levels = unique(x))
 
 but I'm not sure why they are being converted to factors in 
 the first place.

I think expand.grid converts input strings to factors so they
retain the order they have in the input.  (Note that the levels
argument is unique(x), not the sort(unique(x)) that data.frame uses.)
People generally give expand.grid sorted input and expect it to
not alter the order (the order of the levels affects tables and
and some plots).


lapply(expand.grid(Grade=c(Bad,Good,Better),Size=c(Small,Medium
,Large)), levels)
$Grade
[1] BadGood   Better

$Size
[1] Small  Medium Large


lapply(data.frame(Grade=c(Bad,Good,Better),Size=c(Small,Medium
,Large)), levels)
$Grade
[1] BadBetter Good

$Size
[1] Large  Medium Small


I have nothing against adding the stringsAsFactors argument to
expand.grid.

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com 

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] stringsAsFactors = FALSE

2008-11-17 Thread Prof Brian Ripley

On Mon, 17 Nov 2008, Prof Brian Ripley wrote:


On Mon, 17 Nov 2008, hadley wickham wrote:


Hi all,

I love the option to not automatically convert strings into factors,
but there are three places that the current option doesn't work where
I think it should:


Perhaps you mean 'when I would like it to'?   Things *should* work as 
documented, surely?



options(stringsAsFactors = FALSE)

str(expand.grid(letters))
str(type.convert(letters))

df - read.fwf(textConnection(paste(letters,collapse=\n)), 1)
str(df)


I get


str(df)

'data.frame':   26 obs. of  1 variable:
$ V1: chr  a b c d ...

so what is wrong with that?  read.fwf just calls read.table, so the default 
options of read.table apply.



I think type.convert and read.fwf can be fixed by giving them a
stringsAsFactors argument and then using asis = !stringsAsFactors
(like read.table).


Seems to me that there is nothing wrong with read.fwf.  For type.convert() we 
could have the default


as.is = !default.stringsAsFactors()

but I think a strong case needs to be made to change the documented 
behaviour.


It seems only to be used in RODBC (where I have some extra control 
pending), simecol and BioC:beadarraySNP (both with as.is=TRUE) and reshape 
(author, one Hadley Wickham).  Given it is documented as a help utilty, it 
seems up to the caller to set the behaviour he wants.





 The key lines in expand.grid would seem to be

   if (!is.factor(x)  is.character(x))
   x - factor(x, levels = unique(x))

but I'm not sure why they are being converted to factors in the first 
place.


Nor I am, but it goes back to at least r2107, over 10 years ago.  I don't see 
much problem with adding a 'stringsAsFactors' argument there.


--
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



--
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] stringsAsFactors = FALSE

2008-11-17 Thread hadley wickham
On Mon, Nov 17, 2008 at 11:06 AM, William Dunlap [EMAIL PROTECTED] wrote:
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of hadley wickham
 Sent: Monday, November 17, 2008 5:10 AM
 To: r-devel@r-project.org
 Subject: [Rd] stringsAsFactors = FALSE
 ...
 The key lines in
 expand.grid would seem to be

 if (!is.factor(x)  is.character(x))
 x - factor(x, levels = unique(x))

 but I'm not sure why they are being converted to factors in
 the first place.

 I think expand.grid converts input strings to factors so they
 retain the order they have in the input.  (Note that the levels
 argument is unique(x), not the sort(unique(x)) that data.frame uses.)
 People generally give expand.grid sorted input and expect it to
 not alter the order (the order of the levels affects tables and
 and some plots).

Ah, that makes sense.  (Although the conversion to factors just seems
to be a convenient way to achieve the desired effect in this case -
there's no reason they have to be factors in the output)

Hadley

-- 
http://had.co.nz/

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] stringsAsFactors = FALSE

2008-11-17 Thread hadley wickham
On Mon, Nov 17, 2008 at 9:03 AM, Prof Brian Ripley
[EMAIL PROTECTED] wrote:
 On Mon, 17 Nov 2008, hadley wickham wrote:

 Hi all,

 I love the option to not automatically convert strings into factors,
 but there are three places that the current option doesn't work where
 I think it should:

 Perhaps you mean 'when I would like it to'?   Things *should* work as
 documented, surely?

In an ideal world, I think things should be documented *and* consistent.

 options(stringsAsFactors = FALSE)

 str(expand.grid(letters))
 str(type.convert(letters))

 df - read.fwf(textConnection(paste(letters,collapse=\n)), 1)
 str(df)

 I get

 str(df)

 'data.frame':   26 obs. of  1 variable:
  $ V1: chr  a b c d ...

 so what is wrong with that?  read.fwf just calls read.table, so the default
 options of read.table apply.

Ok, that's weird. I get factors.

 I think type.convert and read.fwf can be fixed by giving them a
 stringsAsFactors argument and then using asis = !stringsAsFactors
 (like read.table).

 Seems to me that there is nothing wrong with read.fwf.  For type.convert()
 we could have the default

 as.is = !default.stringsAsFactors()

 but I think a strong case needs to be made to change the documented
 behaviour.

Well, my intuition was that type.convert should mirror the behaviour
of read.table, since it is what does the conversion behind the scenes.
 I can of course change my own code.

  The key lines in expand.grid would seem to be

   if (!is.factor(x)  is.character(x))
   x - factor(x, levels = unique(x))

 but I'm not sure why they are being converted to factors in the first
 place.

 Nor I am, but it goes back to at least r2107, over 10 years ago.  I don't
 see much problem with adding a 'stringsAsFactors' argument there.

Great, thanks.

Hadley

-- 
http://had.co.nz/

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] stringsAsFactors = FALSE

2008-11-17 Thread Peter Dalgaard

William Dunlap wrote:

but I'm not sure why they are being converted to factors in 
the first place.


I think expand.grid converts input strings to factors so they
retain the order they have in the input. 


Yep. These things do matter. Incidentally, I recently got burned by 
cooking an example using expand.grid, writing the data to a file with 
write.table and reading it back in during lecture with read.table. Odds 
ratio turned upside down...


--
   O__   Peter Dalgaard Ă˜ster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark  Ph:  (+45) 35327918
~~ - ([EMAIL PROTECTED])  FAX: (+45) 35327907

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] stringsAsFactors = FALSE

2008-11-17 Thread Martin Maechler
 WD == William Dunlap [EMAIL PROTECTED]
 on Mon, 17 Nov 2008 09:06:49 -0800 writes:

 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of
 hadley wickham Sent: Monday, November 17, 2008 5:10 AM
 To: r-devel@r-project.org Subject: [Rd] stringsAsFactors
 = FALSE ...  The key lines in expand.grid would seem to
 be
 
 if (!is.factor(x)  is.character(x)) x - factor(x,
 levels = unique(x))
 
 but I'm not sure why they are being converted to factors
 in the first place.

WD I think expand.grid converts input strings to factors so
WD they retain the order they have in the input.  (Note
WD that the levels argument is unique(x), not the
WD sort(unique(x)) that data.frame uses.)  People generally
WD give expand.grid sorted input and expect it to not alter
WD the order (the order of the levels affects tables and
WD and some plots).

 
WD lapply(expand.grid(Grade=c(Bad,Good,Better),Size=c(Small,Medium
WD ,Large)), levels) $Grade [1] Bad Good Better

WD $Size [1] Small Medium Large

 
WD lapply(data.frame(Grade=c(Bad,Good,Better),Size=c(Small,Medium
WD ,Large)), levels) $Grade [1] Bad Better Good

WD $Size [1] Large Medium Small


WD I have nothing against adding the stringsAsFactors
WD argument to expand.grid.

That's fine, but I am VERY MUCH against 
making the default of that argument depend on the ominous
  default.stringsAsFactors()
which is determined by getOption(stringsAsFactors).

Why would I hate such a change very much : 
 Note that we have here an option which would change the
 result of a standard R (S) function  expand.grid().

Whereas I already did not like that change when it happened for
read.table(), in that case, one could at least say, that
read.table() is in some way platform dependent 
{(because it
  typically depends on files of the local platform, but as we
  know this is not true even there; even now, if I tell my
  students, or a book author tells her readers to use
  read.table(http://.;)  I can no longer be sure that my
  students get the same data frame, because they could have
  different settings of getOptions(stringsAsFactors)
   horrible, really!! )}

Please, R should stay as much a functional language as possible
and sensible!
If we start having global options more and more influence
the result of standard R functions, we are going down a very
slippery rope, and one that is making R even more idionsyncratic
than it already needs to be. 
Please, no !!  
Rather revert the read.table() default of stringsAsFactors to
not depend on the option, and maybe provide another set of short
forms of the various
   read.table(*, stringsAsFactors=FALSE)
incantations such that
all the factor-haters-string-lovers can use these short forms...

At the very first DSC, 1999, Joe Eaton, author of GNU octave,
told us how he regretted that he had started going down that bad
path, because users had started asking for it.
In the extreme case, we are ending up with a language that
depends on a whole huge status setting, and what a given
function computes can no longer be predicted by looking at the
function calls, unless you simultaneously know that whole status.
Please, No !!

Martin Maechler, ETH Zurich


WD Bill Dunlap TIBCO Software Inc - Spotfire Division
WD wdunlap tibco.com

WD __
WD R-devel@r-project.org mailing list
WD https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel