Re: [R] Adding SORT to UNIQUE

Stephen H. Dawson, DSL via R-help Wed, 22 Dec 2021 07:58:18 -0800

Avi,

Thanks for the detailed reply. I am unable to reply with the samedetail. Please do not take my lack of response depth as demonstrating alack of appreciation.

My intent was to post an open-ended question asking best process, notbest practice, to amend sort to unique. I posted code showing how Iarrived at my present status. The function read.csv reads a file intable format and creates a data frame from it. No column types aredefined in the code. The scale of workload was not considered, as it isbeyond scope at this point in the dialogue. What is present is definingwhat works, then selecting the efficient option for the scale ofworkload to accomplish.

The biggest benefit of an open-ended question is dialogue members askingquestions to both confirm understandings and explore otherconsiderations. Asking a specific question is necessary for anopen-ended discussion, perhaps two questions. What often destroys anopen-ended dialogue is placing boundaries on the dialogue.

This dialogue has shown there is no single best process to add sort tounique. I am fine with this outcome. It has been time well-spent for me,and the dialogue members from what I read in their positions on theconcept of arranging data to be processed.


My definition of ease is simple: Whatever it takes to do what I need to do.

What is not included in this definition is mastering all aspects of stepone before I move to step two.

Sort() does care what is fed to it. This has been the case with alloccurrences of my experiences for both programming and usingalready-built code. Computers have a funny way of doing what they aretold to do.

I do not want a language that has calculated every possible combinationof ways to combine functions and already made tens of thousands available.

I look forward to learning more about the over 100 languages you canprogram during my journey to learn more about GNU R.



*Stephen Dawson, DSL*
/Executive Strategy Consultant/
Business & Technology
+1 (865) 804-3454
http://www.shdawson.com <http://www.shdawson.com>


On 12/21/21 2:17 PM, Avi Gross via R-help wrote:

Stephen,

Languages have their own philosophies and are often focused initially on doing 
specific things well. Later, they tend to accumulate additional functionality 
both in the base language and extensions.

I am wondering if you have explained your need precisely enough to get the 
answers you want.

SQL and Python have their own ways and both have advantages but also huge 
deficiencies relative to just base R.

But there are rules you live with and if you choose day a data.frame to store 
things in, the columns must all be the same length. The unique members of one 
data.frame are likely to not be the same number so storing them in a data.frame 
does not work. They can be stored quite  few other ways, such as a list of 
lists.

And what is your definition of ease? I can program in Python and SQL and way 
over a hundred other languages and I know I need to adapt my thinking to the 
flow of the language and not the other way around. Base R was not designed to 
be like either SQL or Python. But it can be extended quite a few ways to do 
just about anything.

What you ran into for example is the fact that some functionality is more 
selective in what it works on. A data.frame with one column is logically the 
same as a matrix with one column and as a vector but in reality, they are not 
the same thing. Yes, they can be converted into each other fairly trivially. 
Sort() seems to care what you feed it. If you did not worry about efficiency, 
you could have a version of sort that accepts a wide variety of inputs, 
converts any it can to some possibly common internal form, then converts the 
output back into the form it was received in, or uses a command-line option to 
specify the output format. It is not hard in R to make such a function as it 
has the primitives needed to examine an arbitrary object and see what 
dimensions it has for some number of types and so on, and has utilities to do 
the conversion.

If you want a language that has calculated every possible combination of ways to combine functions and 
already made tens of thousands available, good luck. What languages (including Python and R) expect is 
for you to compose such combinations yourself in one of many ways. The annoying discussions here between 
purists and those wanting to use pre-made packages aside, your question can be handled in many of the 
ways we already discussed. They include making your own (often very small) function that implements 
consolidating the many steps into one logical step. It can mean using pipelines like the new 
"|>" operator recently added to base R or the older versions often used in the tidyverse 
packages like "%>%".

You want to take a data.frame and select a column at a time and ask for it to be made into unique 
values then ordered and shown. So you want a VECTOR and your initial use of the "[" 
operator does not take the underlying list structure of a data.frame apart the way you might have 
thought but as a narrow data.frame. So you MAY need to either extract it using "[[" or 
use various routines R supplies like unlist() or as.vector().

Here is a pipeline using this as my data:

mydf <- data.frame(ints=c(5,4,3,3,4,5), chars=c("z","i","t","s","t","i"))

Note the number of unique items differs s does the data type:

   mydf
   ints chars
   1    5     z
   2    4     i
   3    3     t
   4    3     s
   5    4     t
   6    5     i

To handle the columns one at a time can be done using a pipeline like:

   > mydf[2] |> unlist() |> unique() |> sort()
   [1] "i" "s" "t" "z"
   > mydf[1] |> unlist() |> unique() |> sort()
   [1] 3 4 5

The above takes a two-column data.frame and restricts it into a one-column 
data.frame and then passes the new temporary variable/object into the command 
line of the unlist() function which returns an object (again temporary) which 
is a  vector (in one case numeric and in the other character) and then that 
result is passed into the command line of unique() which returns a shorter 
vector in the same order and then you pass it on to sort() which reorders it.

Note the first steps can be shortened if using the "[[" notation or by using 
the named way of asking for a column:

   > mydf[[1]] |> unique() |> sort()
   [1] 3 4 5
   > mydf$ints |> unique() |> sort()
   [1] 3 4 5

But pipelines are simply syntactic sugar mostly so you also can just nest 
function calls as in sort(unique(unlist(mydf[1]))) or do what I showed earlier 
of creating a function that does the work invisibly and call that.

Python often does their own version of pipelines by adding a dot at the end and 
calling a method and if needed another dot and then calling a method on the 
resulting object and so on. But that is arguably more limiting in some ways and 
more powerful in others. Different paradigms. In R, you do not do 
object.method1.method2(args).method3(args) so a pieline method is used to sort 
of so something related.

Now if your need was to do your operation on an entire data.frame at once, then 
sometimes you will find a way to do it easily and sometimes use things like 
functional programming techniques. It is so common to calculate the sums or 
means of columns in a data.frame (or matrix) that functions like rowSums() and 
colSums() and colMeans() are available in R. But they also allow fairly 
arbitrary things to be done too as in the lapply() family of functions that 
applies an arbitrary function perhaps including arguments, like:

lapply(mydf, max)

sapply(mydf, `[`, 2)

The latter takes the second value in each and every column of the data.frame 
and when possible, consolidates the results. Of course the uniqueness criterion 
when producing uneven numbers of results, does not simplify. Below I show how 
you can do many things including nested methods:


   > lapply(mydf, sort)
   $ints
   [1] 3 3 4 4 5 5

$chars

   [1] "i" "i" "s" "t" "t" "z"

> lapply(lapply(mydf, sort), unique)

   $ints
   [1] 3 4 5

$chars

   [1] "i" "s" "t" "z"

> lapply(lapply(mydf, unique), sort)

   $ints
   [1] 3 4 5

$chars

   [1] "i" "s" "t" "z"

   > lapply(lapply(lapply(mydf, unique), sort), toupper)
   $ints
   [1] "3" "4" "5"

$chars

   [1] "I" "S" "T" "Z"

R has plenty of other such primitives that allow you to compose things many 
ways including other variants like Filter and Reduce and pmap and so on, with 
way more in various packages.

It is simply wrong to insist that a language you are not very familiar with is 
not able to (often fairly easily) do all kinds of things.

Back to your question, if I may, I think one of my earlier posts on this topic 
suggested another. Use factors which are part of base-R to perform the unique() 
for you and then extract the unique levels and re-order them by sorting.

   > sort(levels(factor(mydf[[1]])))
   [1] "3" "4" "5"
   > sort(levels(factor(mydf[[2]])))
   [1] "i" "s" "t" "z"

But note this converts everything to characters so a numeric may need to be 
converted back, and yes, the sorting is not done numerically.

Generally, there are oodles of ways to do anything. If this were Python, you 
might create an object that maintains a sorted set for example but that just 
hides the complexity as the various methods of the underlying object have to 
carefully deal keeping track of the current order and dealing with how things 
are added into the right place or tightening up the data structure if something 
is removed all the time. Others simply supply a sorted() method to use only 
when you actually need that. R can be done in similar ways and you can create 
objects of quite a few kinds to implement some things but it does not often 
seem necessary, at least to me.

I can imagine writing a function that makes a data.frame even from vectors of 
unequal length by calculating the length of the longest vector and then setting 
each shorter vector to be longer with code like:

length(a) <- longest

You can then patch together all the results into a data.frame with trailing NA 
values on some columns.

I quickly cobbled together a few lines that can do that and can be placed 
inside a function to return this:

   lapply(lapply(lapply(mydf, unique), sort), toupper) -> uneven
   longest <- max(unlist(lapply(uneven, length)))
   answer <- data.frame(lapply(uneven, `length<-`, longest))
   print(answer)

   ints chars
1    3     I
2    4     S
3    5     T
4 <NA>     Z

Now this has a single NA but I suggest generalizes well to a more complex 
example:

    ints lower upper
1    10     k     Z
2     9     j     A
3     8     i     Z
4     7     h     A
5     6     g     Z
6     5     f     A
7     4     h     Z
8     3     i     A
9     2     j     Z
10    1     k     A
11    2     l     Z
12    3     m     A

These are uneven and three columns so I tried a function version:

   mydf2 <- data.frame(ints = c(10:1, 2:3),
                       lower = c(letters[11:6], letters[8:13]),
                       upper = rep(c("Z", "A"), 6))

unisortuneven <- function(anydf) {

     uneven <- lapply(lapply(lapply(anydf, unique), sort), toupper)
     longest <- max(unlist(lapply(uneven, length)))
     data.frame(lapply(uneven, `length<-`, longest))
   }

unisortuneven(mydf2)

   ints lower upper
   1     1     F     A
   2     2     G     Z
   3     3     H  <NA>
     4     4     I  <NA>
     5     5     J  <NA>
     6     6     K  <NA>
     7     7     L  <NA>
     8     8     M  <NA>
     9     9  <NA>  <NA>
     10   10  <NA>  <NA>

The above does not format great for text, sadly, so is better shown as the transpose for display purposes:


   > t(unisortuneven(mydf2))
   [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
   ints  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"
   lower "F"  "G"  "H"  "I"  "J"  "K"  "L"  "M"  NA   NA
   upper "A"  "Z"  NA   NA   NA   NA   NA   NA   NA   NA

But hopefully it makes my point that a little thinking and KNOWING about features 
of R like how to use a functionalized version of length() that sets a changed 
value using the odd notation of `length<-` can let you solve all kinds of 
problems in a somewhat abstract manner. Of course the above function is not 
refined and will not handle some useful transformations or deal with errors. That 
can make it quite a bit harder and in some cases, make it a good idea to find 
someone sharing a package where they did the hard work and documented exactly what 
their function does.

I am eclectic and happy to switch tools at a moment's notice if they offer an 
interesting way to do something. But, within a language, I learn the darn rules 
and also the idioms often used and then choose from among many ways I can see 
to solve something and use what is available.  You had a trivial solution 
available to you to simply do one step at a time and save intermediate values, 
transforming at times. Some of us have sent you more general solutions. Do you 
still think what you want is so much harder to do in R, or that perhaps you are 
not thinking in R and thus want it to do it some other way other languages do?





-----Original Message-----
From: R-help <r-help-boun...@r-project.org> On Behalf Of Stephen H. Dawson, DSL 
via R-help
Sent: Tuesday, December 21, 2021 10:16 AM
To: Rui Barradas <ruipbarra...@sapo.pt>; Stephen H. Dawson, DSL via R-help 
<r-help@r-project.org>
Subject: Re: [R] Adding SORT to UNIQUE

Thanks everyone for the replies.

It is clear one either needs to write a function or put the unique entries into 
another dataframe.

It seems odd R cannot sort a list of unique column entries with ease.
Python and SQL can do it with ease.

QUESTION
Is there a simpler means than other than the unique function to capture 
distinct column entries, then sort that list?


*Stephen Dawson, DSL*
/Executive Strategy Consultant/
Business & Technology
+1 (865) 804-3454
http://www.shdawson.com <http://www.shdawson.com>


On 12/20/21 5:53 PM, Rui Barradas wrote:

Hello,

Inline.

Às 21:18 de 20/12/21, Stephen H. Dawson, DSL via R-help escreveu:

Thanks.

sort(unique(Data[[1]]))

This syntax provides row numbers, not column values.

This is not right.
The syntax Data[1] extracts a sub-data.frame, the syntax Data[[1]]
extracts the column vector.

As for my previous answer, it was not addressing the question, I
misinterpreted it as being a question on how to sort by numeric order
when the data is not numeric. Here is a, hopefully, complete answer.
Still with package stringr.


cols_to_sort <- 1:4

Data2 <- lapply(Data[cols_to_sort], \(x){
   stringr::str_sort(unique(x), numeric = TRUE)
})


Or using Avi's suggestion of writing a function to do all the work and
simplify the lapply loop later,


unisort2 <- function(vec, ...) stringr::str_sort(unique(vec), ...)
Data2 <- lapply(Data[cols_to_sort], unisort, numeric = TRUE)


Hope this helps,

Rui Barradas

*Stephen Dawson, DSL*
/Executive Strategy Consultant/
Business & Technology
+1 (865) 804-3454
http://www.shdawson.com <http://www.shdawson.com>


On 12/20/21 11:58 AM, Stephen H. Dawson, DSL via R-help wrote:

Hi,


Running a simple syntax set to review entries in dataframe columns.
Here is the working code.

Data <- read.csv("./input/Source.csv", header=T)
describe(Data)
summary(Data)
unique(Data[1])
unique(Data[2])
unique(Data[3])
unique(Data[4])

I would like to add sort the unique entries. The data in the various
columns are not defined as numbers, but also text. I realize 1 and
10 will not sort properly, as the column is not defined as a number,
but want to see what I have in the columns viewed as sorted.

QUESTION
What is the best process to sort unique output, please?


Thanks.

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Adding SORT to UNIQUE

Reply via email to