Re: [R] Histograms with strings, grouped by repeat count (w/ data)

2007-06-19 Thread Deepayan Sarkar
On 6/18/07, Matthew Trunnell <[EMAIL PROTECTED]> wrote:
> Aha!  So to expand that from the original expression,
>
> > table(table(d$filename, d$email_addr))
>
>   0   1   2   3
> 253  20   8   9
>
> I think that is exactly what I'm looking for.  I knew it must be
> simple!!!  What does the 0 column represent?

Number of unique filename:email_addr combinations that don't occur in the data.

> Also, does this tell me the same thing, filtered by Japan?
> > table(table(d$filename, d$email_addr, 
> > d$country_residence)[d$country_residence=="Japan"])
>
>   0   1   2   3
> 958   5   2   1

No it doesn't.

> length(table(d$filename, d$email_addr, d$country_residence))
[1] 4350
> length(d$country_residence)
[1] 63

You are using an index that is meaningless.


There's an alternative tabulation function that uses a formula
interface similar to that used in modeling functions; this might be
more transparent for your case:

> count <-
+ xtabs(~filename + email_addr, data = d,
+   subset = country_residence == "Japan")
> xtabs(~count)
count
  0   1   3
284   2   4


> How does that differ logically from this?
>
> > table(table(d$filename, d$email_addr)[d$country_residence=="Japan"])
>
>  0  1  2  3
> 51  4  2  1

This is also using meaningless indexing.

Note, incidentally, that you are indexing a matrix of dimension 10x29
as if it were a vector of length 290, which is probably not what you
meant to do anyway:

> str(table(d$filename, d$email_addr))
 'table' int [1:10, 1:29] 1 0 0 0 0 0 0 0 0 0 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:10] "file1" "file10" "file2" "file3" ...
  ..$ : chr [1:29] "email1" "email10" "email11" "email12" ...

You need to read help(Extract) carefully and play around with some
simple examples.

> I don't understand why that produces different results.  The first one
> adds a third dimension to the table, but limits that third dimension
> to a single element, Japan.  Shouldn't it be the same?  And again,
> what's that zero column?

As before, they are the empty combinations.

-Deepayan

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Histograms with strings, grouped by repeat count (w/ data)

2007-06-18 Thread Matthew Trunnell
Aha!  So to expand that from the original expression,

> table(table(d$filename, d$email_addr))

  0   1   2   3
253  20   8   9

I think that is exactly what I'm looking for.  I knew it must be
simple!!!  What does the 0 column represent?

Also, does this tell me the same thing, filtered by Japan?
> table(table(d$filename, d$email_addr, 
> d$country_residence)[d$country_residence=="Japan"])

  0   1   2   3
958   5   2   1

How does that differ logically from this?

> table(table(d$filename, d$email_addr)[d$country_residence=="Japan"])

 0  1  2  3
51  4  2  1

I don't understand why that produces different results.  The first one
adds a third dimension to the table, but limits that third dimension
to a single element, Japan.  Shouldn't it be the same?  And again,
what's that zero column?

Thank you,
Matt


On 6/18/07, jim holtman <[EMAIL PROTECTED]> wrote:
> If you are running on windows, make sure you have 'recording' checked in the
> history window of the graphics.  You can also put the output to a pdf file
> and view it later.
>
> If you use table on the counts matrix:
>
> > table(counts)
> counts
>   0   1   2   3
> 253  20   8   9
> >
>
> this shows that there were 20 single tries, 8 files downloaded twice and 9
> three times.  Is this what you want?
>
> You can also get the indices of the non-zero entries by:
>
> > which(counts != 0, arr.ind=TRUE)
>row col
> file11   1
> file56   2
> file11   3
> file23   3
> file78   4
> file89   4
> file11   5
> file23   5
> file23   6
> .
>
>
>
>
> On 6/18/07, Matthew Trunnell <[EMAIL PROTECTED]> wrote:
> > Jim,
> > Thanks for the quick reply!  When I run your code, I end up with a
> > single barplot of one datapoint, file9 vs email20 == 2.0.  I see the
> > call to barplot is inside a for loop... maybe it's zooming through the
> > display of many barplots, but all I see is the last one?
> >
> > In any case, I need to figure out the distribution of the retries, such as
> > No. Retries   Count
> > 1 6
> > 2 13
> > 3 5
> > 4 3
> > 5 2
> > 6 1
> >
> > That is, 6 people retried the download once; 13 people retried the
> > download twice, etc.  So it would be counting the frequency of the
> > email-filename combination, and grouping those together by the number
> > of retries.  Does that make sense?
> >
> > When I look at the counts object from your code, I can see that it's
> > close to what I need.  How do I access the properties of the counts
> > object-- it's a table, right?  If I look at counts[1,1], that returns
> > 1.  But how do I get at the row/col name of that cell?  Is that cell
> > an object?  rownames(counts[1,1]) returns null.
> >
> > Thanks,
> > Matt
> >
> >
> > On 6/18/07, jim holtman <[EMAIL PROTECTED]> wrote:
> > > You should be using barplot and not hist.  I think this produces what
> you
> > > want:
> > >
> > > x <-
> "filename,last_modified,email_addr,country_residence
> > >
> > > file1,3/4/2006 13:54,email1,Korea (South)
> > > file2,3/4/2006 14:33,email2,United States
> > > file2,3/4/2006 16:03,email2,United States
> > > file2,3/4/2006 16:17,email3,United States
> > > file2,3/4/2006 16:28,email3,United States
> > > file3,3/4/2006 19:13,email4,United States
> > > file2,3/4/2006 21:22,email5,India
> > > file4,3/4/2006 21:46,email6,United States
> > > file1,3/4/2006 22:04,email7,Japan
> > > file2,3/4/2006 22:09,email8,Croatia
> > > file1,3/4/2006 22:22,email7,Japan
> > > file1,3/4/2006 22:29,email9,India
> > > file1,3/4/2006 23:06,email6,United States
> > > file1,3/4/2006 23:33,email6,United States
> > > file5,3/4/2006 23:44,email10,China
> > > file1,3/5/2006 0:13,email9,India
> > > file2,3/5/2006 0:52,email8,Croatia
> > > file2,3/5/2006 0:54,email8,Croatia
> > > file2,3/5/2006 1:10,email5,India
> > > file6,3/5/2006 2:17,email9,India
> > > file2,3/5/2006 2:24,email11,Italy
> > > file7,3/5/2006 2:36,email12,Italy
> > > file8,3/5/2006 2:52,email12,Italy
> > > file2,3/5/2006 3:09,email13,United Kingdom
> > > file2,3/5/2006 4:02,email14,India
> > > file2,3/5/2006 4:07,email14,India
> > > file2,3/5/2006 4:14,email14,India
> > > file2,3/5/2006 4:37,email5,India
> > > file2,3/5/2006 4:44,email15,Belgium
> > > file1,3/5/2006 5:02,email9,India
> > > file1,3/5/2006 5:24,email16,Taiwan
> > > file2,3/5/2006 6:06,email17,Saudi Arabia
> > > file2,3/5/2006 7:32,email17,Saudi Arabia
> > > file2,3/5/2006 8:12,email18,Brazil
> > > file2,3/5/2006 8:26,email18,Brazil
> > > file2,3/5/2006 9:49,email19,United Kingdom
> > > file1,3/5/2006 10:49,email11,Italy
> > > file1,3/5/2006 11:16,email13,United Kingdom
> > > file1,3/5/2006 11:16,email13,United Kingdom
> > > file1,3/5/2006 11:45,email13,United Kingdom
> > > file1,3/5/2006 14:34,email20,Australia
> > > file9,3/5/2006 14:56,email20,Australia
> > > file9,3/5/2006 14:56,email20,Australia
> > > file5,3/5/2006 16:43,email21,United States
> > > file1,3/5/2006 17:17,email7,Japan
>

Re: [R] Histograms with strings, grouped by repeat count (w/ data)

2007-06-18 Thread jim holtman
If you are running on windows, make sure you have 'recording' checked in the
history window of the graphics.  You can also put the output to a pdf file
and view it later.

If you use table on the counts matrix:

> table(counts)
counts
  0   1   2   3
253  20   8   9
>

this shows that there were 20 single tries, 8 files downloaded twice and 9
three times.  Is this what you want?

You can also get the indices of the non-zero entries by:

> which(counts != 0, arr.ind=TRUE)
   row col
file11   1
file56   2
file11   3
file23   3
file78   4
file89   4
file11   5
file23   5
file23   6
.



On 6/18/07, Matthew Trunnell <[EMAIL PROTECTED]> wrote:
>
> Jim,
> Thanks for the quick reply!  When I run your code, I end up with a
> single barplot of one datapoint, file9 vs email20 == 2.0.  I see the
> call to barplot is inside a for loop... maybe it's zooming through the
> display of many barplots, but all I see is the last one?
>
> In any case, I need to figure out the distribution of the retries, such as
> No. Retries   Count
> 1 6
> 2 13
> 3 5
> 4 3
> 5 2
> 6 1
>
> That is, 6 people retried the download once; 13 people retried the
> download twice, etc.  So it would be counting the frequency of the
> email-filename combination, and grouping those together by the number
> of retries.  Does that make sense?
>
> When I look at the counts object from your code, I can see that it's
> close to what I need.  How do I access the properties of the counts
> object-- it's a table, right?  If I look at counts[1,1], that returns
> 1.  But how do I get at the row/col name of that cell?  Is that cell
> an object?  rownames(counts[1,1]) returns null.
>
> Thanks,
> Matt
>
>
> On 6/18/07, jim holtman <[EMAIL PROTECTED]> wrote:
> > You should be using barplot and not hist.  I think this produces what
> you
> > want:
> >
> > x <- "filename,last_modified,email_addr,country_residence
> >
> > file1,3/4/2006 13:54,email1,Korea (South)
> > file2,3/4/2006 14:33,email2,United States
> > file2,3/4/2006 16:03,email2,United States
> > file2,3/4/2006 16:17,email3,United States
> > file2,3/4/2006 16:28,email3,United States
> > file3,3/4/2006 19:13,email4,United States
> > file2,3/4/2006 21:22,email5,India
> > file4,3/4/2006 21:46,email6,United States
> > file1,3/4/2006 22:04,email7,Japan
> > file2,3/4/2006 22:09,email8,Croatia
> > file1,3/4/2006 22:22,email7,Japan
> > file1,3/4/2006 22:29,email9,India
> > file1,3/4/2006 23:06,email6,United States
> > file1,3/4/2006 23:33,email6,United States
> > file5,3/4/2006 23:44,email10,China
> > file1,3/5/2006 0:13,email9,India
> > file2,3/5/2006 0:52,email8,Croatia
> > file2,3/5/2006 0:54,email8,Croatia
> > file2,3/5/2006 1:10,email5,India
> > file6,3/5/2006 2:17,email9,India
> > file2,3/5/2006 2:24,email11,Italy
> > file7,3/5/2006 2:36,email12,Italy
> > file8,3/5/2006 2:52,email12,Italy
> > file2,3/5/2006 3:09,email13,United Kingdom
> > file2,3/5/2006 4:02,email14,India
> > file2,3/5/2006 4:07,email14,India
> > file2,3/5/2006 4:14,email14,India
> > file2,3/5/2006 4:37,email5,India
> > file2,3/5/2006 4:44,email15,Belgium
> > file1,3/5/2006 5:02,email9,India
> > file1,3/5/2006 5:24,email16,Taiwan
> > file2,3/5/2006 6:06,email17,Saudi Arabia
> > file2,3/5/2006 7:32,email17,Saudi Arabia
> > file2,3/5/2006 8:12,email18,Brazil
> > file2,3/5/2006 8:26,email18,Brazil
> > file2,3/5/2006 9:49,email19,United Kingdom
> > file1,3/5/2006 10:49,email11,Italy
> > file1,3/5/2006 11:16,email13,United Kingdom
> > file1,3/5/2006 11:16,email13,United Kingdom
> > file1,3/5/2006 11:45,email13,United Kingdom
> > file1,3/5/2006 14:34,email20,Australia
> > file9,3/5/2006 14:56,email20,Australia
> > file9,3/5/2006 14:56,email20,Australia
> > file5,3/5/2006 16:43,email21,United States
> > file1,3/5/2006 17:17,email7,Japan
> > file2,3/5/2006 17:26,email22,Japan
> > file2,3/5/2006 17:27,email22,Japan
> > file2,3/5/2006 17:33,email23,China
> > file1,3/5/2006 17:45,email22,Japan
> > file2,3/5/2006 17:45,email22,Japan
> > file2,3/5/2006 17:59,email23,China
> > file1,3/5/2006 18:27,email24,Japan
> > file1,3/5/2006 18:47,email25,Taiwan
> > file2,3/5/2006 18:48,email26,New Zealand
> > file2,3/5/2006 19:15,email27,Canada
> > file2,3/5/2006 19:23,email28,Canada
> > file2,3/5/2006 19:24,email28,Canada
> > file10,3/5/2006 19:49,email29,Japan
> > file10,3/5/2006 19:52,email29,Japan
> > file10,3/5/2006 19:57,email29,Japan
> > file2,3/5/2006 20:01,email29,Japan
> > file2,3/5/2006 20:02,email29,Japan
> > file2,3/5/2006 20:06,email29,Japan"
> > d <- read.csv(textConnection(x))
> > barplot(table(d$filename), main="All Files", las=2)  # plot counts for
> all
> > the files
> > # generate plots for each file name showing which emails used them
> > counts <- table(d$filename, d$email_addr)
> > for (i in seq(nrow(counts))){
> > .index <- which(counts[i,] > 0)
> > barplot(counts[i, .index], las=2,
> > n

Re: [R] Histograms with strings, grouped by repeat count (w/ data)

2007-06-18 Thread Matthew Trunnell
Jim,
Thanks for the quick reply!  When I run your code, I end up with a
single barplot of one datapoint, file9 vs email20 == 2.0.  I see the
call to barplot is inside a for loop... maybe it's zooming through the
display of many barplots, but all I see is the last one?

In any case, I need to figure out the distribution of the retries, such as
No. Retries   Count
1 6
2 13
3 5
4 3
5 2
6 1

That is, 6 people retried the download once; 13 people retried the
download twice, etc.  So it would be counting the frequency of the
email-filename combination, and grouping those together by the number
of retries.  Does that make sense?

When I look at the counts object from your code, I can see that it's
close to what I need.  How do I access the properties of the counts
object-- it's a table, right?  If I look at counts[1,1], that returns
1.  But how do I get at the row/col name of that cell?  Is that cell
an object?  rownames(counts[1,1]) returns null.

Thanks,
Matt


On 6/18/07, jim holtman <[EMAIL PROTECTED]> wrote:
> You should be using barplot and not hist.  I think this produces what you
> want:
>
> x <- "filename,last_modified,email_addr,country_residence
>
> file1,3/4/2006 13:54,email1,Korea (South)
> file2,3/4/2006 14:33,email2,United States
> file2,3/4/2006 16:03,email2,United States
> file2,3/4/2006 16:17,email3,United States
> file2,3/4/2006 16:28,email3,United States
> file3,3/4/2006 19:13,email4,United States
> file2,3/4/2006 21:22,email5,India
> file4,3/4/2006 21:46,email6,United States
> file1,3/4/2006 22:04,email7,Japan
> file2,3/4/2006 22:09,email8,Croatia
> file1,3/4/2006 22:22,email7,Japan
> file1,3/4/2006 22:29,email9,India
> file1,3/4/2006 23:06,email6,United States
> file1,3/4/2006 23:33,email6,United States
> file5,3/4/2006 23:44,email10,China
> file1,3/5/2006 0:13,email9,India
> file2,3/5/2006 0:52,email8,Croatia
> file2,3/5/2006 0:54,email8,Croatia
> file2,3/5/2006 1:10,email5,India
> file6,3/5/2006 2:17,email9,India
> file2,3/5/2006 2:24,email11,Italy
> file7,3/5/2006 2:36,email12,Italy
> file8,3/5/2006 2:52,email12,Italy
> file2,3/5/2006 3:09,email13,United Kingdom
> file2,3/5/2006 4:02,email14,India
> file2,3/5/2006 4:07,email14,India
> file2,3/5/2006 4:14,email14,India
> file2,3/5/2006 4:37,email5,India
> file2,3/5/2006 4:44,email15,Belgium
> file1,3/5/2006 5:02,email9,India
> file1,3/5/2006 5:24,email16,Taiwan
> file2,3/5/2006 6:06,email17,Saudi Arabia
> file2,3/5/2006 7:32,email17,Saudi Arabia
> file2,3/5/2006 8:12,email18,Brazil
> file2,3/5/2006 8:26,email18,Brazil
> file2,3/5/2006 9:49,email19,United Kingdom
> file1,3/5/2006 10:49,email11,Italy
> file1,3/5/2006 11:16,email13,United Kingdom
> file1,3/5/2006 11:16,email13,United Kingdom
> file1,3/5/2006 11:45,email13,United Kingdom
> file1,3/5/2006 14:34,email20,Australia
> file9,3/5/2006 14:56,email20,Australia
> file9,3/5/2006 14:56,email20,Australia
> file5,3/5/2006 16:43,email21,United States
> file1,3/5/2006 17:17,email7,Japan
> file2,3/5/2006 17:26,email22,Japan
> file2,3/5/2006 17:27,email22,Japan
> file2,3/5/2006 17:33,email23,China
> file1,3/5/2006 17:45,email22,Japan
> file2,3/5/2006 17:45,email22,Japan
> file2,3/5/2006 17:59,email23,China
> file1,3/5/2006 18:27,email24,Japan
> file1,3/5/2006 18:47,email25,Taiwan
> file2,3/5/2006 18:48,email26,New Zealand
> file2,3/5/2006 19:15,email27,Canada
> file2,3/5/2006 19:23,email28,Canada
> file2,3/5/2006 19:24,email28,Canada
> file10,3/5/2006 19:49,email29,Japan
> file10,3/5/2006 19:52,email29,Japan
> file10,3/5/2006 19:57,email29,Japan
> file2,3/5/2006 20:01,email29,Japan
> file2,3/5/2006 20:02,email29,Japan
> file2,3/5/2006 20:06,email29,Japan"
> d <- read.csv(textConnection(x))
> barplot(table(d$filename), main="All Files", las=2)  # plot counts for all
> the files
> # generate plots for each file name showing which emails used them
> counts <- table(d$filename, d$email_addr)
> for (i in seq(nrow(counts))){
> .index <- which(counts[i,] > 0)
> barplot(counts[i, .index], las=2,
> names.arg=colnames(counts)[.index], main=rownames(counts)[i])
> }
>
>
>
> On 6/18/07, Matthew Trunnell <[EMAIL PROTECTED]> wrote:
> >
> > Hello R gurus,
> >
> > I just spent my first weekend wrestling with R, but so far have come
> > up empty handed.
> >
> > I have a dataset that represents file downloads; it has 4 dimensions:
> > date, filename, email, and country.  (sample data below)
> >
> > My first goal is to get an idea of the frequency of repeated
> > downloads.  Let me explain that.  Some people tend to download
> > multiple times, e.g. if the download fails they keep trying over and
> > over.  I'm trying to build a histogram that shows the repeat count
> > along the x-axis, that is, how many people downloaded once, twice,
> > three times, etc.  I plan to compare the median of that before and
> > after we switched ISPs.
> >
> > To accomplish this, I'm assuming that I'll first need to combine

Re: [R] Histograms with strings, grouped by repeat count (w/ data)

2007-06-18 Thread jim holtman
You should be using barplot and not hist.  I think this produces what you
want:

x <- "filename,last_modified,email_addr,country_residence
file1,3/4/2006 13:54,email1,Korea (South)
file2,3/4/2006 14:33,email2,United States
file2,3/4/2006 16:03,email2,United States
file2,3/4/2006 16:17,email3,United States
file2,3/4/2006 16:28,email3,United States
file3,3/4/2006 19:13,email4,United States
file2,3/4/2006 21:22,email5,India
file4,3/4/2006 21:46,email6,United States
file1,3/4/2006 22:04,email7,Japan
file2,3/4/2006 22:09,email8,Croatia
file1,3/4/2006 22:22,email7,Japan
file1,3/4/2006 22:29,email9,India
file1,3/4/2006 23:06,email6,United States
file1,3/4/2006 23:33,email6,United States
file5,3/4/2006 23:44,email10,China
file1,3/5/2006 0:13,email9,India
file2,3/5/2006 0:52,email8,Croatia
file2,3/5/2006 0:54,email8,Croatia
file2,3/5/2006 1:10,email5,India
file6,3/5/2006 2:17,email9,India
file2,3/5/2006 2:24,email11,Italy
file7,3/5/2006 2:36,email12,Italy
file8,3/5/2006 2:52,email12,Italy
file2,3/5/2006 3:09,email13,United Kingdom
file2,3/5/2006 4:02,email14,India
file2,3/5/2006 4:07,email14,India
file2,3/5/2006 4:14,email14,India
file2,3/5/2006 4:37,email5,India
file2,3/5/2006 4:44,email15,Belgium
file1,3/5/2006 5:02,email9,India
file1,3/5/2006 5:24,email16,Taiwan
file2,3/5/2006 6:06,email17,Saudi Arabia
file2,3/5/2006 7:32,email17,Saudi Arabia
file2,3/5/2006 8:12,email18,Brazil
file2,3/5/2006 8:26,email18,Brazil
file2,3/5/2006 9:49,email19,United Kingdom
file1,3/5/2006 10:49,email11,Italy
file1,3/5/2006 11:16,email13,United Kingdom
file1,3/5/2006 11:16,email13,United Kingdom
file1,3/5/2006 11:45,email13,United Kingdom
file1,3/5/2006 14:34,email20,Australia
file9,3/5/2006 14:56,email20,Australia
file9,3/5/2006 14:56,email20,Australia
file5,3/5/2006 16:43,email21,United States
file1,3/5/2006 17:17,email7,Japan
file2,3/5/2006 17:26,email22,Japan
file2,3/5/2006 17:27,email22,Japan
file2,3/5/2006 17:33,email23,China
file1,3/5/2006 17:45,email22,Japan
file2,3/5/2006 17:45,email22,Japan
file2,3/5/2006 17:59,email23,China
file1,3/5/2006 18:27,email24,Japan
file1,3/5/2006 18:47,email25,Taiwan
file2,3/5/2006 18:48,email26,New Zealand
file2,3/5/2006 19:15,email27,Canada
file2,3/5/2006 19:23,email28,Canada
file2,3/5/2006 19:24,email28,Canada
file10,3/5/2006 19:49,email29,Japan
file10,3/5/2006 19:52,email29,Japan
file10,3/5/2006 19:57,email29,Japan
file2,3/5/2006 20:01,email29,Japan
file2,3/5/2006 20:02,email29,Japan
file2,3/5/2006 20:06,email29,Japan"
d <- read.csv(textConnection(x))
barplot(table(d$filename), main="All Files", las=2)  # plot counts for all
the files
# generate plots for each file name showing which emails used them
counts <- table(d$filename, d$email_addr)
for (i in seq(nrow(counts))){
.index <- which(counts[i,] > 0)
barplot(counts[i, .index], las=2,
names.arg=colnames(counts)[.index], main=rownames(counts)[i])
}



On 6/18/07, Matthew Trunnell <[EMAIL PROTECTED]> wrote:
>
> Hello R gurus,
>
> I just spent my first weekend wrestling with R, but so far have come
> up empty handed.
>
> I have a dataset that represents file downloads; it has 4 dimensions:
> date, filename, email, and country.  (sample data below)
>
> My first goal is to get an idea of the frequency of repeated
> downloads.  Let me explain that.  Some people tend to download
> multiple times, e.g. if the download fails they keep trying over and
> over.  I'm trying to build a histogram that shows the repeat count
> along the x-axis, that is, how many people downloaded once, twice,
> three times, etc.  I plan to compare the median of that before and
> after we switched ISPs.
>
> To accomplish this, I'm assuming that I'll first need to combine the
> email and filename columns so as to represent a single download
> attempt by an individual.  Does that sound right?  Later, it would be
> nice to limit the histogram to a single filename, country, or company.
> I can probably figure that out myself after I understand how to write
> this funky histogram expression.
>
> With the help of Verzani's introductory text, I've learned how to read
> in the CSV data and do some simple tables, like this:
>
> hist(table(d$filename))
> hist(table(d$filename[substring(d$filename, 1, 5)=="file1"]))
> hist(sort(table(d$filename[substring(d$filename, 1, 5)=="file1"])))
>
> Obviously, these commands count the frequency of the files.  What I'd
> like to see are the repeats grouped along the x-axis;  I'd like to
> find, for all files, the distribution of retries.  I hope that makes
> sense. :)
>
> Can someone point me in the right direction?  I'm very new to R and to
> statistics, but I write code for a living.  At this point I'd almost
> be better off writing a program do this kind of simple counting... but
> I have a feeling R would be so useful if I could just get past the
> initial learning curve.
>
> Thank you in advance,
> Matt
>
> Here's some real data, with the private info replaced :)
>
> d<-read.table
> (file="C:\\users\\trunnellm\\downloads\\stat

[R] Histograms with strings, grouped by repeat count (w/ data)

2007-06-18 Thread Matthew Trunnell
Hello R gurus,

I just spent my first weekend wrestling with R, but so far have come
up empty handed.

I have a dataset that represents file downloads; it has 4 dimensions:
date, filename, email, and country.  (sample data below)

My first goal is to get an idea of the frequency of repeated
downloads.  Let me explain that.  Some people tend to download
multiple times, e.g. if the download fails they keep trying over and
over.  I'm trying to build a histogram that shows the repeat count
along the x-axis, that is, how many people downloaded once, twice,
three times, etc.  I plan to compare the median of that before and
after we switched ISPs.

To accomplish this, I'm assuming that I'll first need to combine the
email and filename columns so as to represent a single download
attempt by an individual.  Does that sound right?  Later, it would be
nice to limit the histogram to a single filename, country, or company.
 I can probably figure that out myself after I understand how to write
this funky histogram expression.

With the help of Verzani's introductory text, I've learned how to read
in the CSV data and do some simple tables, like this:

hist(table(d$filename))
hist(table(d$filename[substring(d$filename, 1, 5)=="file1"]))
hist(sort(table(d$filename[substring(d$filename, 1, 5)=="file1"])))

Obviously, these commands count the frequency of the files.  What I'd
like to see are the repeats grouped along the x-axis;  I'd like to
find, for all files, the distribution of retries.  I hope that makes
sense. :)

Can someone point me in the right direction?  I'm very new to R and to
statistics, but I write code for a living.  At this point I'd almost
be better off writing a program do this kind of simple counting... but
I have a feeling R would be so useful if I could just get past the
initial learning curve.

Thank you in advance,
Matt

Here's some real data, with the private info replaced :)

 
d<-read.table(file="C:\\users\\trunnellm\\downloads\\statistics\\downloads.csv",
sep=",", quote="\"", header=TRUE)

filename,last_modified,email_addr,country_residence
file1,3/4/2006 13:54,email1,Korea (South)
file2,3/4/2006 14:33,email2,United States
file2,3/4/2006 16:03,email2,United States
file2,3/4/2006 16:17,email3,United States
file2,3/4/2006 16:28,email3,United States
file3,3/4/2006 19:13,email4,United States
file2,3/4/2006 21:22,email5,India
file4,3/4/2006 21:46,email6,United States
file1,3/4/2006 22:04,email7,Japan
file2,3/4/2006 22:09,email8,Croatia
file1,3/4/2006 22:22,email7,Japan
file1,3/4/2006 22:29,email9,India
file1,3/4/2006 23:06,email6,United States
file1,3/4/2006 23:33,email6,United States
file5,3/4/2006 23:44,email10,China
file1,3/5/2006 0:13,email9,India
file2,3/5/2006 0:52,email8,Croatia
file2,3/5/2006 0:54,email8,Croatia
file2,3/5/2006 1:10,email5,India
file6,3/5/2006 2:17,email9,India
file2,3/5/2006 2:24,email11,Italy
file7,3/5/2006 2:36,email12,Italy
file8,3/5/2006 2:52,email12,Italy
file2,3/5/2006 3:09,email13,United Kingdom
file2,3/5/2006 4:02,email14,India
file2,3/5/2006 4:07,email14,India
file2,3/5/2006 4:14,email14,India
file2,3/5/2006 4:37,email5,India
file2,3/5/2006 4:44,email15,Belgium
file1,3/5/2006 5:02,email9,India
file1,3/5/2006 5:24,email16,Taiwan
file2,3/5/2006 6:06,email17,Saudi Arabia
file2,3/5/2006 7:32,email17,Saudi Arabia
file2,3/5/2006 8:12,email18,Brazil
file2,3/5/2006 8:26,email18,Brazil
file2,3/5/2006 9:49,email19,United Kingdom
file1,3/5/2006 10:49,email11,Italy
file1,3/5/2006 11:16,email13,United Kingdom
file1,3/5/2006 11:16,email13,United Kingdom
file1,3/5/2006 11:45,email13,United Kingdom
file1,3/5/2006 14:34,email20,Australia
file9,3/5/2006 14:56,email20,Australia
file9,3/5/2006 14:56,email20,Australia
file5,3/5/2006 16:43,email21,United States
file1,3/5/2006 17:17,email7,Japan
file2,3/5/2006 17:26,email22,Japan
file2,3/5/2006 17:27,email22,Japan
file2,3/5/2006 17:33,email23,China
file1,3/5/2006 17:45,email22,Japan
file2,3/5/2006 17:45,email22,Japan
file2,3/5/2006 17:59,email23,China
file1,3/5/2006 18:27,email24,Japan
file1,3/5/2006 18:47,email25,Taiwan
file2,3/5/2006 18:48,email26,New Zealand
file2,3/5/2006 19:15,email27,Canada
file2,3/5/2006 19:23,email28,Canada
file2,3/5/2006 19:24,email28,Canada
file10,3/5/2006 19:49,email29,Japan
file10,3/5/2006 19:52,email29,Japan
file10,3/5/2006 19:57,email29,Japan
file2,3/5/2006 20:01,email29,Japan
file2,3/5/2006 20:02,email29,Japan
file2,3/5/2006 20:06,email29,Japan

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.