Re: [R] Mixed sorting/ordering of strings acknowledging roman numerals?

2014-09-07 Thread Henrik Bengtsson
Thank you David - it took me awhile to get back to this and dig into
it.  It's clever to imitate gtools::mixedorder() as far as possible.
A few comments:

1. It took me a while to understand why you picked 3899 in your
Roman-to-integer table; it's because roman(x) is NA for x  3899.
(BTW, in 'utils', there's utils:::.roman2numeric() which could be
utilized, but it's currently internal.)

2. I think you forgot D=500 and M=1000.

3. There was a typo in your code; I think you meant rank.roman instead
of rank.numeric in one place.

4. The idea behind nonnumeric() is to identify non-numeric substrings
by is.na(as.numeric()).  Unfortunately, for romans that does not work.
Instead, we need to use is.na(numeric(x)) here, i.e.

  nonnumeric - function(x) {
  suppressWarnings(ifelse(is.na(numeric(x)), toupper(x), NA))
  }

Actually, gtools::mixedorder() could use the same.

5. I undid your .numeric to .roman to minimize any differences to
gtools::mixedorder().


With the above fixes, we now have:

mixedorderRoman - function (x)
{
if (length(x)  1)
return(NULL)
else if (length(x) == 1)
return(1)
if (is.numeric(x))
return(order(x))
delim = \\$\\@\\$
# NOTE: Note that as.roman(x) is NA for x  3899
romanC - as.character( as.roman(1:3899) )
numeric - function(x) {
suppressWarnings(match(x, romanC))
}
nonnumeric - function(x) {
suppressWarnings(ifelse(is.na(numeric(x)), toupper(x),
NA))
}
x - as.character(x)
which.nas - which(is.na(x))
which.blanks - which(x == )
if (length(which.blanks)  0)
x[which.blanks] - -Inf
if (length(which.nas)  0)
x[which.nas] - Inf
delimited - gsub(([IVXCLM]+),
paste(delim, \\1, delim, sep = ), x)
step1 - strsplit(delimited, delim)
step1 - lapply(step1, function(x) x[x  ])
step1.numeric - lapply(step1, numeric)
step1.character - lapply(step1, nonnumeric)
maxelem - max(sapply(step1, length))
step1.numeric.t - lapply(1:maxelem, function(i) sapply(step1.numeric,
function(x) x[i]))
step1.character.t - lapply(1:maxelem, function(i) sapply(step1.character,
function(x) x[i]))
rank.numeric - sapply(step1.numeric.t, rank)
rank.character - sapply(step1.character.t, function(x)
as.numeric(factor(x)))
rank.numeric[!is.na(rank.character)] - 0
rank.character - t(t(rank.character) + apply(matrix(rank.numeric),
2, max, na.rm = TRUE))
rank.overall - ifelse(is.na(rank.character), rank.numeric,
rank.character)
order.frame - as.data.frame(rank.overall)
if (length(which.nas)  0)
order.frame[which.nas, ] - Inf
retval - do.call(order, order.frame)
return(retval)
}


The difference to gtools::mixedorder() is minimal:

 romanC - as.character( as.roman(1:3899) )
21c11
 suppressWarnings(match(x, romanC))
---
 suppressWarnings(as.numeric(x))
24c14
 suppressWarnings(ifelse(is.na(numeric(x)), toupper(x),
---
 suppressWarnings(ifelse(is.na(as.numeric(x)), toupper(x),
34c24
 delimited - gsub(([IVXCLDM]+),
---
 delimited - 
 gsub(([+-]{0,1}[0-9]+\\.{0,1}[0-9]*([eE][\\+\\-]{0,1}[0-9]+\\.{0,1}[0-9]*){0,1}),
59,62d48

This difference is so small that the above could now be an option to
mixedorder() with minimal overhead added, e.g. mixedorder(y,
type=c(decimal, roman)).  One could even imagine adding support
for binary, octal and hexadecimal (not done).

Greg (maintainer of gtools; cc:ed), is this something you would
consider adding to gtools?  I've modified the gtools source code
available on CRAN (that's the only source I found), added package
tests, updated the Rd and verified it passes R CMD check.  If
interested, please find the updates at:

  https://github.com/HenrikBengtsson/gtools/compare/cran:master...master

Thanks

Henrik

On Tue, Aug 26, 2014 at 6:46 PM, David Winsemius dwinsem...@comcast.net wrote:

 On Aug 26, 2014, at 5:24 PM, Henrik Bengtsson wrote:

 Hi,

 does anyone know of an implementation/function that sorts strings that
 *contain* roman numerals (I, II, III, IV, V, ...) which are treated as
 numbers.  In 'gtools' there is mixedsort() which does this for strings
 that contains (decimal) numbers.  I'm looking for a mixedsortroman()
 function that does the same but with roman numbers, e.g.

 It's pretty easy to sort something you know to be congruent with the existing 
 roman class:

 romanC - as.character( as.roman(1:3899) )
 match(c(I, II, III,X,V), romanC)
 #[1]  1  2  3 10  5

 But I guess you already know that, so you want a regex approach to parsing. 
 Looking at the path taken by Warnes, it would involve doing something like 
 his regex based insertion of a delimiter for Roman numeral but simpler 
 because he needed to deal with decimal points and signs and exponent 
 notation, none of which you appear to need. If you only need to consider 
 character and Roman, then this hack of Warnes tools succeeds:

  

Re: [R] Mixed sorting/ordering of strings acknowledging roman numerals?

2014-09-07 Thread David Winsemius

On Sep 7, 2014, at 7:40 PM, Henrik Bengtsson wrote:

 Thank you David - it took me awhile to get back to this and dig into
 it.  It's clever to imitate gtools::mixedorder() as far as possible.
 A few comments:
 
 1. It took me a while to understand why you picked 3899 in your
 Roman-to-integer table; it's because roman(x) is NA for x  3899.
 (BTW, in 'utils', there's utils:::.roman2numeric() which could be
 utilized, but it's currently internal.)

Yes, that was the reason. I didn't think I needed a Roman-to-numeric function 
because I discovered the roman numbers were actually simple numeric vectors to 
which a class had been assigned and that it was the class-facilities that did 
all the work. The standard Ops functions were just acting on numeric vectors.

If one doesn't take care, their romanity can be lost:

 R - as.roman(10^(0:4))
 R
[1] IXCMNA

 unclass(R)
[1]1   10  100 1000   NA

 sum(R, na.rm=TRUE)
[1] 
 as.roman(sum(R, na.rm=TRUE))
[1] MCXI

 
 2. I think you forgot D=500 and M=1000.

Quite possible. I suspect Greg will have corrected the omission, but if not, 
this will be helpful to him.

 
 3. There was a typo in your code; I think you meant rank.roman instead
 of rank.numeric in one place.
 

I understood Greg's intention to wrap this into the mixedorder and mixed sort 
duo.

Best;
David.

 4. The idea behind nonnumeric() is to identify non-numeric substrings
 by is.na(as.numeric()).  Unfortunately, for romans that does not work.
 Instead, we need to use is.na(numeric(x)) here, i.e.
 
  nonnumeric - function(x) {
  suppressWarnings(ifelse(is.na(numeric(x)), toupper(x), NA))
  }
 
 Actually, gtools::mixedorder() could use the same.
 
 5. I undid your .numeric to .roman to minimize any differences to
 gtools::mixedorder().
 
 
 With the above fixes, we now have:
 
 mixedorderRoman - function (x)
 {
if (length(x)  1)
return(NULL)
else if (length(x) == 1)
return(1)
if (is.numeric(x))
return(order(x))
delim = \\$\\@\\$
# NOTE: Note that as.roman(x) is NA for x  3899
romanC - as.character( as.roman(1:3899) )
numeric - function(x) {
suppressWarnings(match(x, romanC))
}
nonnumeric - function(x) {
suppressWarnings(ifelse(is.na(numeric(x)), toupper(x),
NA))
}
x - as.character(x)
which.nas - which(is.na(x))
which.blanks - which(x == )
if (length(which.blanks)  0)
x[which.blanks] - -Inf
if (length(which.nas)  0)
x[which.nas] - Inf
delimited - gsub(([IVXCLM]+),
paste(delim, \\1, delim, sep = ), x)
step1 - strsplit(delimited, delim)
step1 - lapply(step1, function(x) x[x  ])
step1.numeric - lapply(step1, numeric)
step1.character - lapply(step1, nonnumeric)
maxelem - max(sapply(step1, length))
step1.numeric.t - lapply(1:maxelem, function(i) sapply(step1.numeric,
function(x) x[i]))
step1.character.t - lapply(1:maxelem, function(i) sapply(step1.character,
function(x) x[i]))
rank.numeric - sapply(step1.numeric.t, rank)
rank.character - sapply(step1.character.t, function(x)
 as.numeric(factor(x)))
rank.numeric[!is.na(rank.character)] - 0
rank.character - t(t(rank.character) + apply(matrix(rank.numeric),
2, max, na.rm = TRUE))
rank.overall - ifelse(is.na(rank.character), rank.numeric,
rank.character)
order.frame - as.data.frame(rank.overall)
if (length(which.nas)  0)
order.frame[which.nas, ] - Inf
retval - do.call(order, order.frame)
return(retval)
 }
 
 
 The difference to gtools::mixedorder() is minimal:
 
  romanC - as.character( as.roman(1:3899) )
 21c11
  suppressWarnings(match(x, romanC))
 ---
suppressWarnings(as.numeric(x))
 24c14
  suppressWarnings(ifelse(is.na(numeric(x)), toupper(x),
 ---
suppressWarnings(ifelse(is.na(as.numeric(x)), toupper(x),
 34c24
  delimited - gsub(([IVXCLDM]+),
 ---
delimited - 
 gsub(([+-]{0,1}[0-9]+\\.{0,1}[0-9]*([eE][\\+\\-]{0,1}[0-9]+\\.{0,1}[0-9]*){0,1}),
 59,62d48
 
 This difference is so small that the above could now be an option to
 mixedorder() with minimal overhead added, e.g. mixedorder(y,
 type=c(decimal, roman)).  One could even imagine adding support
 for binary, octal and hexadecimal (not done).
 
 Greg (maintainer of gtools; cc:ed), is this something you would
 consider adding to gtools?  I've modified the gtools source code
 available on CRAN (that's the only source I found), added package
 tests, updated the Rd and verified it passes R CMD check.  If
 interested, please find the updates at:
 
  https://github.com/HenrikBengtsson/gtools/compare/cran:master...master
 
 Thanks
 
 Henrik
 
 On Tue, Aug 26, 2014 at 6:46 PM, David Winsemius dwinsem...@comcast.net 
 wrote:
 
 On Aug 26, 2014, at 5:24 PM, Henrik Bengtsson wrote:
 
 Hi,
 
 does anyone know of an implementation/function that sorts strings that
 *contain* roman numerals (I, II, III, IV, V, 

[R] Mixed sorting/ordering of strings acknowledging roman numerals?

2014-08-26 Thread Henrik Bengtsson
Hi,

does anyone know of an implementation/function that sorts strings that
*contain* roman numerals (I, II, III, IV, V, ...) which are treated as
numbers.  In 'gtools' there is mixedsort() which does this for strings
that contains (decimal) numbers.  I'm looking for a mixedsortroman()
function that does the same but with roman numbers, e.g.

## DECIMAL NUMBERS
 x - sprintf(chr %d, 12:1)
 x
 [1] chr 12 chr 11 chr 10 chr 9  chr 8
 [6] chr 7  chr 6  chr 5  chr 4  chr 3
[11] chr 2  chr 1

 sort(x)
 [1] chr 1  chr 10 chr 11 chr 12 chr 2
 [6] chr 3  chr 4  chr 5  chr 6  chr 7
[11] chr 8  chr 9

 gtools::mixedsort(x)
 [1] chr 1  chr 2  chr 3  chr 4  chr 5
 [6] chr 6  chr 7  chr 8  chr 9  chr 10
[11] chr 11 chr 12


## ROMAN NUMBERS
 y - sprintf(chr %s, as.roman(12:1))
 y
 [1] chr XII  chr XI   chr Xchr IX
 [5] chr VIII chr VII  chr VI   chr V
 [9] chr IV   chr III  chr II   chr I

 sort(y)
 [1] chr Ichr II   chr III  chr IV
 [5] chr IX   chr Vchr VI   chr VII
 [9] chr VIII chr Xchr XI   chr XII

 mixedsortroman(y)
 [1] chr Ichr II   chr III  chr IV
 [5] chr Vchr VI   chr VII  chr VIII
 [9] chr IX   chr Xchr XI   chr XII

The latter is what I'm looking for.

Before hacking together something myself (e.g. identify roman numerals
substrings, translate them to decimal numbers, use gtools::mixedsort()
to sort them and then translate them back to roman numbers), I'd like
to hear if someone already has this implemented/know of a package that
does this.

Thanks,

Henrik

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Mixed sorting/ordering of strings acknowledging roman numerals?

2014-08-26 Thread David Winsemius

On Aug 26, 2014, at 5:24 PM, Henrik Bengtsson wrote:

 Hi,
 
 does anyone know of an implementation/function that sorts strings that
 *contain* roman numerals (I, II, III, IV, V, ...) which are treated as
 numbers.  In 'gtools' there is mixedsort() which does this for strings
 that contains (decimal) numbers.  I'm looking for a mixedsortroman()
 function that does the same but with roman numbers, e.g.

It's pretty easy to sort something you know to be congruent with the existing 
roman class:

romanC - as.character( as.roman(1:3899) )
match(c(I, II, III,X,V), romanC)
#[1]  1  2  3 10  5

But I guess you already know that, so you want a regex approach to parsing. 
Looking at the path taken by Warnes, it would involve doing something like his 
regex based insertion of a delimiter for Roman numeral but simpler because he 
needed to deal with decimal points and signs and exponent notation, none of 
which you appear to need. If you only need to consider character and Roman, 
then this hack of Warnes tools succeeds:

 mixedorderRoman - function (x) 
{
if (length(x)  1) 
return(NULL)
else if (length(x) == 1) 
return(1)
if (is.numeric(x)) 
return(order(x))
delim = \\$\\@\\$
roman - function(x) {
suppressWarnings(match(x, romanC))
}
nonnumeric - function(x) {
suppressWarnings(ifelse(is.na(as.numeric(x)), toupper(x), 
NA))
}
x - as.character(x)
which.nas - which(is.na(x))
which.blanks - which(x == )
if (length(which.blanks)  0) 
x[which.blanks] - -Inf
if (length(which.nas)  0) 
x[which.nas] - Inf
delimited - gsub(([IVXCL]+), 
paste(delim, \\1, delim, sep = ), x)
step1 - strsplit(delimited, delim)
step1 - lapply(step1, function(x) x[x  ])
step1.roman - lapply(step1, roman)
step1.character - lapply(step1, nonnumeric)
maxelem - max(sapply(step1, length))
step1.roman.t - lapply(1:maxelem, function(i) sapply(step1.roman, 
function(x) x[i]))
step1.character.t - lapply(1:maxelem, function(i) sapply(step1.character, 
function(x) x[i]))
rank.roman - sapply(step1.roman.t, rank)
rank.character - sapply(step1.character.t, function(x) 
as.numeric(factor(x)))
rank.roman[!is.na(rank.character)] - 0
rank.character - t(t(rank.character) + apply(matrix(rank.roman), 
2, max, na.rm = TRUE))
rank.overall - ifelse(is.na(rank.character), rank.numeric, 
rank.character)
order.frame - as.data.frame(rank.overall)
if (length(which.nas)  0) 
order.frame[which.nas, ] - Inf
retval - do.call(order, order.frame)
return(retval)
}

y[mixedorderRoman(y)]
 [1] chr Ichr II   chr III  chr IV   chr IX  
 [6] chr Vchr VI   chr VII  chr VIII chr X   
[11] chr XI   chr XII 


-- 
David.
 
 ## DECIMAL NUMBERS
 x - sprintf(chr %d, 12:1)
 x
 [1] chr 12 chr 11 chr 10 chr 9  chr 8
 [6] chr 7  chr 6  chr 5  chr 4  chr 3
 [11] chr 2  chr 1
 
 sort(x)
 [1] chr 1  chr 10 chr 11 chr 12 chr 2
 [6] chr 3  chr 4  chr 5  chr 6  chr 7
 [11] chr 8  chr 9
 
 gtools::mixedsort(x)
 [1] chr 1  chr 2  chr 3  chr 4  chr 5
 [6] chr 6  chr 7  chr 8  chr 9  chr 10
 [11] chr 11 chr 12
 
 
 ## ROMAN NUMBERS
 y - sprintf(chr %s, as.roman(12:1))
 y
 [1] chr XII  chr XI   chr Xchr IX
 [5] chr VIII chr VII  chr VI   chr V
 [9] chr IV   chr III  chr II   chr I
 
 sort(y)
 [1] chr Ichr II   chr III  chr IV
 [5] chr IX   chr Vchr VI   chr VII
 [9] chr VIII chr Xchr XI   chr XII
 
 mixedsortroman(y)
 [1] chr Ichr II   chr III  chr IV
 [5] chr Vchr VI   chr VII  chr VIII
 [9] chr IX   chr Xchr XI   chr XII
 
 The latter is what I'm looking for.
 
 Before hacking together something myself (e.g. identify roman numerals
 substrings, translate them to decimal numbers, use gtools::mixedsort()
 to sort them and then translate them back to roman numbers), I'd like
 to hear if someone already has this implemented/know of a package that
 does this.
 
 Thanks,
 
 Henrik
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.