Re: [R] Mystery Error in midnightStandard

2009-01-28 Thread Yohan Chalabi
 TB == Ted Byers r.ted.by...@gmail.com
 on Tue, 27 Jan 2009 16:00:27 -0500

   TB I wasn't even aware I was using midnightStandard.  You won't
   TB find it in my
   TB script.
   TB
   TB Here is the relevant loop:
   TB
   TB date1 = timeDate(charvec = Sys.Date(), format = %Y-%m-%d)
   TB date1
   TB dow = 3;
   TB for (i in 1:length(V4) ) {
   TB x = read.csv(as.character(V4[[i]]), header = FALSE,
   TB na.strings=);
   TB y = x[,1];
   TB year = V2[[i]];
   TB week = V3[[i]];
   TB dtstr = sprintf(%i-%i-%i,year,week,dow);
   TB date2 = timeDate(dtstr, format = %Y-%U-%w);
   TB resultsdataframe[[i]] - difftimeDate(date1,date2,units =
   TB weeks);
   TB fp = fitdistr(y,exponential);
   TB print(c(V1[[i]],V2[[i]],V3[[i]],fp,fp));
   TB print(c(year,week,date2,resultsdataframe[[i]]));
   TB resultsdataframe[[i]] - fp;
   TB resultsdataframe[[i]] - fp;
   TB }
   TB
   TB It fails with a little more than 100 records left in V4.
   TB
   TB The full error message is:
   TB
   TB Error in midnightStandard(charvec, format) :
   TB 'charvec' has non-NA entries of different number of characters

timeDate() uses the midnight standard. The function 'midnightStandard'
assumes that all entries in 'charvec' have the same 'format'. Can you
please check if this is the case?

This is all I can say from the information you provided. Please give us
a reproducible example.

We can continue this discussion off-list.

regards,
Yohan

   TB
   TB Until it fails, date2 and resultsdataframe[[i]] get correct
   TB values.
   TB
   TB str() produces no surprises:
   TB
   TB  str(resultsdataframe);
   TB 'data.frame': 303 obs. of 6 variables:
   TB $ mid : int 171 206 206 206 206 206 206 206 206 218 ...
   TB $ year : int 2008 2008 2008 2008 2008 2008 2008 2008 2008
   TB 2008 ...
   TB $ week : int 16 17 18 19 21 26 31 35 51 40 ...
   TB $ dt : num 39.9 38.9 37.9 36.9 34.9 ...
   TB $ estimate: num Inf 0.25 Inf 0.0408 0.2 ...
   TB $ sd : num Inf 0.1768 Inf 0.0289 0.1414 ...
   TB
   TB I would assume the error is related to my new code that
   TB manipulates dates,
   TB as it doesn't occur in the earlier version that did not
   TB manipulate dates
   TB (the relevant work being done, albeit very slowly, within
   TB the DB).
   TB
   TB FTR: The year and week values are generated by MySQL using
   TB the YEAR and WEEK
   TB functions applied to timestamps.  I do not know if it is
   TB relevant, but the
   TB week value, at the point of failure, is 0 (a value that does
   TB not occur
   TB earlier in the dataset, but several times subsequently),
   TB and I do not see
   TB how a value of 0 for the week (legitimate in posix date
   TB formats) could
   TB produce the error message I get.
   TB
   TB Any thoughts on what is really wrong, and how to fix it?
   TB
   TB Thanks
   TB
   TB Ted




-- 
PhD student
Swiss Federal Institute of Technology
Zurich

www.ethz.ch

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Mystery Error in midnightStandard

2009-01-28 Thread Ted Byers
Hi Yohan,  Thanks.

On Wed, Jan 28, 2009 at 4:57 AM, Yohan Chalabi chal...@phys.ethz.ch wrote:

  TB == Ted Byers r.ted.by...@gmail.com
  on Tue, 27 Jan 2009 16:00:27 -0500

   TB I wasn't even aware I was using midnightStandard.  You won't
   TB find it in my
   TB script.
   TB
   TB Here is the relevant loop:
   TB
   TB date1 = timeDate(charvec = Sys.Date(), format = %Y-%m-%d)
   TB date1
   TB dow = 3;
   TB for (i in 1:length(V4) ) {
   TB x = read.csv(as.character(V4[[i]]), header = FALSE,
   TB na.strings=);
   TB y = x[,1];
   TB year = V2[[i]];
   TB week = V3[[i]];
   TB dtstr = sprintf(%i-%i-%i,year,week,dow);
   TB date2 = timeDate(dtstr, format = %Y-%U-%w);
   TB resultsdataframe[[i]] - difftimeDate(date1,date2,units =
   TB weeks);
   TB fp = fitdistr(y,exponential);
   TB print(c(V1[[i]],V2[[i]],V3[[i]],fp,fp));
   TB print(c(year,week,date2,resultsdataframe[[i]]));
   TB resultsdataframe[[i]] - fp;
   TB resultsdataframe[[i]] - fp;
   TB }
   TB
   TB It fails with a little more than 100 records left in V4.
   TB
   TB The full error message is:
   TB
   TB Error in midnightStandard(charvec, format) :
   TB 'charvec' has non-NA entries of different number of characters

 timeDate() uses the midnight standard. The function 'midnightStandard'
 assumes that all entries in 'charvec' have the same 'format'. Can you
 please check if this is the case?


It is certain that all entries have the same format, but I'm starting to
think that the error message is something of a red herring.  Consider this:

 year = 2009
 week = 0
 day = 3
 datestr = sprintf(%i-%i-%i,year,week,day);datestr
[1] 2009-0-3
 date1 = timeDate(datestr, format = %Y-%U-%w);
 date1
GMT
[1] [NA]
 day = 4
 datestr = sprintf(%i-%i-%i,year,week,day);datestr
[1] 2009-0-4
 date1 = timeDate(datestr, format = %Y-%U-%w);
 date1
GMT
[1] [2009-01-01]

 datestr = sprintf(%i-%i-%i,year,week,3);datestr
[1] 2009-0-3
 date2 = timeDate(datestr, format = %Y-%U-%w);date2
GMT
[1] [NA]
 difftimeDate(date2,date1, units = weeks)
Error in midnightStandard(charvec, format) :
  'charvec' has non-NA entries of different number of characters
In addition: Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf



The first values for year, week and day are the values on which my loop
dies.  It returns 'NA' here.  It seems clear that it is returning NA because
the date that data corresponds to is 2008-12-31.

The error is being produced by difftimeDate rather than timeDate (as shown
by the above session).  But that represents a flaw in the function design.
It should fail when taking the elapsed time between a null and the present,
but if I wrote such a function, I'd have it return null (perhaps with a
warning) rather than just die.

A bigger issue is that timeDate ought never give null here (which is what I
assume 'NA' means), since all the data comes from transaction data with real
dates, so the elapsed time, measured in weeks, ought to always be a valid
real number that is positive semidefinite.  I have not yet come to any
conclusions as to how it ought to behave (whether to return new years day,
along with a warning, or to return the date requested by reinvoking itself
with the year and week adjusted so a valid date is returned).

On a practical side, how would I test date2 to see if it is null, so I can
give it a sensible default value?

A more troubling thought is that with this handling of dates in this
combination of SQL (my group by clause uses
YEAR(transaction_date),WEEK(transaction_date)) to get the data and R to
process it, the week containing new years day will ALWAYS be split in two at
the first second of the new year. I'm going to have to either figure out a
way to correct this, or ignore it (as it doesn't actually make things wrong,
but rather it splits a sample into two unequal parts).

Thoughts?

Thanks

Ted

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Mystery Error in midnightStandard

2009-01-28 Thread Yohan Chalabi
 TB == Ted Byers r.ted.by...@gmail.com
 on Wed, 28 Jan 2009 09:30:58 -0500

   TB It is certain that all entries have the same format, but I'm
   TB starting to
   TB think that the error message is something of a red herring.
   TB Consider this:
   TB
   TB  year = 2009
   TB  week = 0
   TB  day = 3
   TB  datestr = sprintf(%i-%i-%i,year,week,day);datestr
   TB [1] 2009-0-3
   TB  date1 = timeDate(datestr, format = %Y-%U-%w);
   TB  date1
   TB GMT
   TB [1] [NA]
   TB  day = 4
   TB  datestr = sprintf(%i-%i-%i,year,week,day);datestr
   TB [1] 2009-0-4
   TB  date1 = timeDate(datestr, format = %Y-%U-%w);
   TB  date1
   TB GMT
   TB [1] [2009-01-01]
   TB 
   TB  datestr = sprintf(%i-%i-%i,year,week,3);datestr
   TB [1] 2009-0-3
   TB  date2 = timeDate(datestr, format = %Y-%U-%w);date2
   TB GMT
   TB [1] [NA]
   TB  difftimeDate(date2,date1, units = weeks)
   TB Error in midnightStandard(charvec, format) :
   TB 'charvec' has non-NA entries of different number of characters
   TB In addition: Warning messages:
   TB 1: In min(x) : no non-missing arguments to min; returning Inf
   TB 2: In max(x) : no non-missing arguments to max; returning -Inf
   TB
   TB
   TB
   TB The first values for year, week and day are the values on
   TB which my loop
   TB dies.  It returns 'NA' here.  It seems clear that it is
   TB returning NA because
   TB the date that data corresponds to is 2008-12-31.
   TB
   TB The error is being produced by difftimeDate rather than timeDate
   TB (as shown
   TB by the above session).  But that represents a flaw in the
   TB function design.

This is not a flaw in timeDate. it behaves the same way as
'as.POSIXct' 

strptime(datestr, format = %Y-%U-%w)

Instead of claiming that there is a flaw in the function you could have
suggested an 'is.na' method for 'timeDate'.

I will add an 'is.na' method in the dev version of 'timeDate'.

regards,
Yohan 

   TB It should fail when taking the elapsed time between a null
   TB and the present,
   TB but if I wrote such a function, I'd have it return null
   TB (perhaps with a
   TB warning) rather than just die.
   TB
   TB A bigger issue is that timeDate ought never give null here
   TB (which is what I
   TB assume 'NA' means), since all the data comes from transaction
   TB data with real
   TB dates, so the elapsed time, measured in weeks, ought to always
   TB be a valid
   TB real number that is positive semidefinite.  I have not yet
   TB come to any
   TB conclusions as to how it ought to behave (whether to return
   TB new years day,
   TB along with a warning, or to return the date requested by
   TB reinvoking itself
   TB with the year and week adjusted so a valid date is returned).
   TB
   TB On a practical side, how would I test date2 to see if it is
   TB null, so I can
   TB give it a sensible default value?
   TB
   TB A more troubling thought is that with this handling of dates
   TB in this
   TB combination of SQL (my group by clause uses
   TB YEAR(transaction_date),WEEK(transaction_date)) to get the data
   TB and R to
   TB process it, the week containing new years day will ALWAYS be
   TB split in two at
   TB the first second of the new year. I'm going to have to either
   TB figure out a
   TB way to correct this, or ignore it (as it doesn't actually make
   TB things wrong,
   TB but rather it splits a sample into two unequal parts).




-- 
PhD student
Swiss Federal Institute of Technology
Zurich

www.ethz.ch

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Mystery Error in midnightStandard

2009-01-28 Thread Ted Byers
Hi Yohan,

On Wed, Jan 28, 2009 at 10:28 AM, Yohan Chalabi chal...@phys.ethz.chwrote:

  TB == Ted Byers r.ted.by...@gmail.com
  on Wed, 28 Jan 2009 09:30:58 -0500

   TB It is certain that all entries have the same format, but I'm
   TB starting to
   TB think that the error message is something of a red herring.
   TB Consider this:
   TB
   TB  year = 2009
   TB  week = 0
   TB  day = 3
   TB  datestr = sprintf(%i-%i-%i,year,week,day);datestr
   TB [1] 2009-0-3
   TB  date1 = timeDate(datestr, format = %Y-%U-%w);
   TB  date1
   TB GMT
   TB [1] [NA]
   TB  day = 4
   TB  datestr = sprintf(%i-%i-%i,year,week,day);datestr
   TB [1] 2009-0-4
   TB  date1 = timeDate(datestr, format = %Y-%U-%w);
   TB  date1
   TB GMT
   TB [1] [2009-01-01]
   TB 
   TB  datestr = sprintf(%i-%i-%i,year,week,3);datestr
   TB [1] 2009-0-3
   TB  date2 = timeDate(datestr, format = %Y-%U-%w);date2
   TB GMT
   TB [1] [NA]
   TB  difftimeDate(date2,date1, units = weeks)
TB Error in midnightStandard(charvec, format) :
   TB 'charvec' has non-NA entries of different number of characters
TB In addition: Warning messages:
   TB 1: In min(x) : no non-missing arguments to min; returning Inf
   TB 2: In max(x) : no non-missing arguments to max; returning -Inf
   TB
   TB
   TB
   TB The first values for year, week and day are the values on
   TB which my loop
   TB dies.  It returns 'NA' here.  It seems clear that it is
   TB returning NA because
   TB the date that data corresponds to is 2008-12-31.
   TB
   TB The error is being produced by difftimeDate rather than timeDate
   TB (as shown
   TB by the above session).  But that represents a flaw in the
   TB function design.

 This is not a flaw in timeDate. it behaves the same way as
 'as.POSIXct'


That the two behave the same doesn't change the assessment that the design
is flawed.  That doesn't mean that the function is wrong.  It means only
that the behaviour can be made more useful.  For example, in SQL, if a given
calculation returns NULL, and the result is subsequently used in another
calculation, the result that returns is also NULL.  That is quite useful,
and admits algorithms that can react appropriately to NULLs when necessary.
That is arguably better than forcing the code to fail the moment a NULL is
used in a secondary calculation.  In C++, OTOH, one can catch the problem
earlier using, e.g., exceptions, again allowing the program to complete even
when problems arise for certain values or combinations thereof.

As a software engineer, I understand the issues involved in creating
libraries.  If I want to incorporate the functionality of a given standard
suite of functions (e.g. ANSI C standard library functions, or posix
functions), my first step would be to ensure I can duplicate how they
behave.  But I would not stop there.  There are, for example, serious design
flaws in many ANSI C functions that, ignored, introduce serious security
defects in applications that use them.  I would therefore refactor them to
eliminate the security defects.  If they can not be eliminated, I would
replace the function in question by a similar function that does not have
that security defect.

Posix is a useful, but old, standard, and I am merely suggesting that once
you have duplicated it, look beyond it to ways it can be improved upon.
There is more to the design of a function than whether or not it gives the
right result with good input.  There is how it behaves when there is a
problem with the inputs and whether or not you force the calling code to die
when a problem arises or you give the calling code a way to react to such
problems.  When I add functions to my own C++ or Java libraries, I normally
include more bad input data in the unit tests than good data (though the
latter is sufficient to ensure correct results are invariably obtained),
precisely so I can document how it behaves when there is a problem and give
coders who use it a variety of options to use to deal with them.



 strptime(datestr, format = %Y-%U-%w)

 Instead of claiming that there is a flaw in the function you could have
 suggested an 'is.na' method for 'timeDate'.


At the time, I did not know about is.na.  I have spent the past hour trying
is.na, but to no avail.  I guess that is no surprise to you, but that it
would fail is not reflected in the R documentation of is.na.  That mentions
S3, but not S4.  As I just recently started using R, I have not yet looked
at what S3 and S4 are, so that is a few more hours of study before I get
this problem solved.



 I will add an 'is.na' method in the dev version of 'timeDate'.


Thanks.  I'll benefit from that once it makes it into the production
release.  In the mean time, I need to find a way to make something similar
now, in my script.

Thanks

Ted

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 

Re: [R] Mystery Error in midnightStandard

2009-01-28 Thread Yohan Chalabi
 TB == Ted Byers r.ted.by...@gmail.com
 on Wed, 28 Jan 2009 11:25:55 -0500

   TB That the two behave the same doesn't change the assessment
   TB that the design
   TB is flawed.  That doesn't mean that the function is wrong.
   TB It means only
   TB that the behaviour can be made more useful.  For example,
   TB in SQL, if a given
   TB calculation returns NULL, and the result is subsequently used
   TB in another
   TB calculation, the result that returns is also NULL.  That is
   TB quite useful,
   TB and admits algorithms that can react appropriately to NULLs
   TB when necessary.
   TB That is arguably better than forcing the code to fail the
   TB moment a NULL is
   TB used in a secondary calculation.  In C++, OTOH, one can catch
   TB the problem
   TB earlier using, e.g., exceptions, again allowing the program
   TB to complete even
   TB when problems arise for certain values or combinations thereof.
   TB
   TB As a software engineer, I understand the issues involved
   TB in creating
   TB libraries.  If I want to incorporate the functionality of a
   TB given standard
   TB suite of functions (e.g. ANSI C standard library functions,
   TB or posix
   TB functions), my first step would be to ensure I can duplicate
   TB how they
   TB behave.  But I would not stop there.  There are, for example,
   TB serious design
   TB flaws in many ANSI C functions that, ignored, introduce
   TB serious security
   TB defects in applications that use them.  I would therefore
   TB refactor them to
   TB eliminate the security defects.  If they can not be eliminated,
   TB I would
   TB replace the function in question by a similar function that
   TB does not have
   TB that security defect.
   TB
   TB Posix is a useful, but old, standard, and I am merely suggesting
   TB that once
   TB you have duplicated it, look beyond it to ways it can be
   TB improved upon.
   TB There is more to the design of a function than whether or not
   TB it gives the
   TB right result with good input.  There is how it behaves when
   TB there is a
   TB problem with the inputs and whether or not you force the
   TB calling code to die
   TB when a problem arises or you give the calling code a way to
   TB react to such
   TB problems.  When I add functions to my own C++ or Java libraries,
   TB I normally
   TB include more bad input data in the unit tests than good data
   TB (though the
   TB latter is sufficient to ensure correct results are invariably
   TB obtained),
   TB precisely so I can document how it behaves when there is a
   TB problem and give
   TB coders who use it a variety of options to use to deal with them.
   TB
   TB
   TB 
   TB  strptime(datestr, format = %Y-%U-%w)
   TB 
   TB  Instead of claiming that there is a flaw in the function
   TB you could have
   TB  suggested an 'is.na' method for 'timeDate'.
   TB 
   TB
   TB At the time, I did not know about is.na.  I have spent the
   TB past hour trying
   TB is.na, but to no avail.  I guess that is no surprise to you,
   TB but that it
   TB would fail is not reflected in the R documentation of is.na.
   TB That mentions
   TB S3, but not S4.  As I just recently started using R, I have
   TB not yet looked
   TB at what S3 and S4 are, so that is a few more hours of study
   TB before I get
   TB this problem solved.
   TB
   TB
   TB 
   TB  I will add an 'is.na' method in the dev version of 'timeDate'.
   TB 
   TB 
   TB Thanks.  I'll benefit from that once it makes it into the
   TB production
   TB release.  In the mean time, I need to find a way to make
   TB something similar
   TB now, in my script.

setMethod(is.na, timeDate, function(x) is.na(as.POSIXct(x)))

   TB
   TB Thanks




-- 
PhD student
Swiss Federal Institute of Technology
Zurich

www.ethz.ch

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Mystery Error in midnightStandard

2009-01-27 Thread Ted Byers
I wasn't even aware I was using midnightStandard.  You won't find it in my
script.

Here is the relevant loop:

date1 = timeDate(charvec = Sys.Date(), format = %Y-%m-%d)
date1
dow = 3;
for (i in 1:length(V4) ) {
  x = read.csv(as.character(V4[[i]]), header = FALSE, na.strings=);
  y = x[,1];
  year = V2[[i]];
  week = V3[[i]];
  dtstr = sprintf(%i-%i-%i,year,week,dow);
  date2 = timeDate(dtstr, format = %Y-%U-%w);
  resultsdataframe$dt[[i]] - difftimeDate(date1,date2,units = weeks);
  fp = fitdistr(y,exponential);
  print(c(V1[[i]],V2[[i]],V3[[i]],fp$estimate,fp$sd));
  print(c(year,week,date2,resultsdataframe$dt[[i]]));
  resultsdataframe$estimate[[i]] - fp$estimate;
  resultsdataframe$sd[[i]] - fp$sd;
}

It fails with a little more than 100 records left in V4.

The full error message is:

Error in midnightStandard(charvec, format) :
  'charvec' has non-NA entries of different number of characters

Until it fails, date2 and resultsdataframe$dt[[i]] get correct values.

str() produces no surprises:

 str(resultsdataframe);
'data.frame':303 obs. of  6 variables:
 $ mid : int  171 206 206 206 206 206 206 206 206 218 ...
 $ year: int  2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ...
 $ week: int  16 17 18 19 21 26 31 35 51 40 ...
 $ dt  : num  39.9 38.9 37.9 36.9 34.9 ...
 $ estimate: num  Inf 0.25 Inf 0.0408 0.2 ...
 $ sd  : num  Inf 0.1768 Inf 0.0289 0.1414 ...

I would assume the error is related to my new code that manipulates dates,
as it doesn't occur in the earlier version that did not manipulate dates
(the relevant work being done, albeit very slowly, within the DB).

FTR: The year and week values are generated by MySQL using the YEAR and WEEK
functions applied to timestamps.  I do not know if it is relevant, but the
week value, at the point of failure, is 0 (a value that does not occur
earlier in the dataset, but several times subsequently), and I do not see
how a value of 0 for the week (legitimate in posix date formats) could
produce the error message I get.

Any thoughts on what is really wrong, and how to fix it?

Thanks

Ted

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.