[R] Read data with different column lengths

2007-03-07 Thread Hofert Marius
Dear r-help users,

I have the following simple problem: Reading data from a file. The  
file is a .txt file exported (save as...) from Excel (see below for  
an example). The Excel file consists of two header rows (first row  
consists of ticker symbols of stocks, the second row consists of  
column explanations (Date,Px Last), followed by several rows of  
data. Now forget about the first two rows, I can deal with that (read  
separately, then extract the actual ticker symbols ADS,  
ALV, ...). For reading the rest, I tried several things, for example:
data=read.table(infile,quote=,fill=T,dec=,,skip=2,colClasses=rep(c 
(character,numeric),ntickers))
or
data=matrix(scan(file=infile,what=rep(c 
(character,numeric),ntickers),dec=,,skip=2),ncol=2*ntickers,byrow= 
T)
where infile specifies the path to the input file and ntickers is  
the number of ticker-columns in the data set, so in the example  
below, ntickers=2.

Both ways of reading the data work perfectly fine if all columns have  
the same length (i.e. the same number of filled rows), so if the data  
is given in a (filled) rectangular form. Now, as you can imagine,  
there are days when one stock is traded but not the other... so,  
there might be columns that do not have the same number of filled  
rows (see below, for the stock with ticker symbol ADS, only 3  
trading days are shown, so this column is shorter than the data  
column for the stock ALV). Now, if I export such a structure to  
a .txt file, then all (by default) blank fields will be replaced by  
\t, i.e. tabs. Both reading procedures as give above have problems  
as they either display that the number of rows/columns do not fit  
together or as they read the table, but some cells are shifted to the  
left (for the example below, the entry 07/02/05134,7 appears in  
the empty field of the stock ADS which is of course not what we want).
So the simple question is: How do I read such a structure?
Can there be a simple solution? The problem is simply that empty  
cells are replace by \t which are then ignored for reading. So how  
do we distinguish between the empty cells that are given between the  
columns and the empty cells that actually fill a column to have the  
same length as other columns. Of course I could manually put in a  
certain character (e.g. a *) to fill in the gaps, but the data set  
is simply too large. If it helps, these blank fields only appear in  
the end of each column, not in the middle.

As I work on a Mac (OS X 10.4), it was not possible (at least to me)  
to read the data directly from the Excel file vial the library RODBC  
or read.xls.

Note, that the same problem arises, when I export the Excel file as  
a .csv, then all blank fields are separated by ; instead of \t  
and the reading procedure can also not decide if the field  
corresponds to an empty separating column or actually to a column  
with given entries, but which is simply not as long as another column  
in the file.

Hope, you can help. I would really appreciate it.

Best regards.

Marius

Excel example (I hope it's displayed correctly, the entry in the last  
row should be aligned with the last column):

ADS GY Equity   ALV GY Equity   
DatePx Last DatePx Last
07/02/0441,395  07/01/31130,234
07/02/0542,134  07/02/01133,353
07/02/0641,875  07/02/04133,824
07/02/05134,734

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Read data with different column lengths

2007-03-07 Thread jim holtman
I would have thought that a csv file written out by Excel would have looked
like this for your data:

ADS GY Equity, ,ALV GY Equity,
Date,   Px Last,Date  , Px Last
07/02/04,   41,395  ,07/01/31  , 130,234
07/02/05,   42,134  ,07/02/01  , 133,353
07/02/06,   41,875  ,07/02/04  , 133,824
 , , 07/02/05, 134,734

Here we have commas separating the fields and this will read in the data:

 x - read.csv(/tempxx.csv, skip=2, header=FALSE, as.is=TRUE)
 str(x)
'data.frame':   4 obs. of  4 variables:
 $ V1: chr  07/02/04 07/02/05 07/02/06  
 $ V2: chr 41,395  42,134  41,875
   
 $ V3: chr  07/01/31   07/02/01   07/02/0407/02/05
 $ V4: chr   130,234  133,353  133,824  134,734
 x
 V1  V2 V3   V4
1  07/02/0441,395   07/01/31130,234
2  07/02/0542,134   07/02/01133,353
3  07/02/0641,875   07/02/04133,824
4 07/02/05  134,734


You will have to delete the ',' from the numeric fields, The 'dec=,' on
your command would have read '133,353' in as '133.353' which I don't think
is what you want.  You can delete the commas and convert to numeric with the
following:

 x$V4 - as.numeric(gsub(',', '', x$V4))
 x
 V1  V2 V3 V4
1  07/02/0441,395   07/01/31   130234
2  07/02/0542,134   07/02/01   133353
3  07/02/0641,875   07/02/04   133824
4 07/02/05 134734


So take a look at the options that you have with writing out the Excel data;
you should be able to get either tab or comma delimited.  If not, you can
try reading in each line as a character string and then using 'substr' to
split it apart assuming that you have constant width columns.


On 3/7/07, Hofert Marius [EMAIL PROTECTED] wrote:

 Dear r-help users,

 I have the following simple problem: Reading data from a file. The
 file is a .txt file exported (save as...) from Excel (see below for
 an example). The Excel file consists of two header rows (first row
 consists of ticker symbols of stocks, the second row consists of
 column explanations (Date,Px Last), followed by several rows of
 data. Now forget about the first two rows, I can deal with that (read
 separately, then extract the actual ticker symbols ADS,
 ALV, ...). For reading the rest, I tried several things, for example:
 data=read.table(infile,quote=,fill=T,dec=,,skip=2,colClasses=rep(c
 (character,numeric),ntickers))
 or
 data=matrix(scan(file=infile,what=rep(c
 (character,numeric),ntickers),dec=,,skip=2),ncol=2*ntickers,byrow=
 T)
 where infile specifies the path to the input file and ntickers is
 the number of ticker-columns in the data set, so in the example
 below, ntickers=2.

 Both ways of reading the data work perfectly fine if all columns have
 the same length (i.e. the same number of filled rows), so if the data
 is given in a (filled) rectangular form. Now, as you can imagine,
 there are days when one stock is traded but not the other... so,
 there might be columns that do not have the same number of filled
 rows (see below, for the stock with ticker symbol ADS, only 3
 trading days are shown, so this column is shorter than the data
 column for the stock ALV). Now, if I export such a structure to
 a .txt file, then all (by default) blank fields will be replaced by
 \t, i.e. tabs. Both reading procedures as give above have problems
 as they either display that the number of rows/columns do not fit
 together or as they read the table, but some cells are shifted to the
 left (for the example below, the entry 07/02/05134,7 appears in
 the empty field of the stock ADS which is of course not what we want).
 So the simple question is: How do I read such a structure?
 Can there be a simple solution? The problem is simply that empty
 cells are replace by \t which are then ignored for reading. So how
 do we distinguish between the empty cells that are given between the
 columns and the empty cells that actually fill a column to have the
 same length as other columns. Of course I could manually put in a
 certain character (e.g. a *) to fill in the gaps, but the data set
 is simply too large. If it helps, these blank fields only appear in
 the end of each column, not in the middle.

 As I work on a Mac (OS X 10.4), it was not possible (at least to me)
 to read the data directly from the Excel file vial the library RODBC
 or read.xls.

 Note, that the same problem arises, when I export the Excel file as
 a .csv, then all blank fields are separated by ; instead of \t
 and the reading procedure can also not decide if the field
 corresponds to an empty separating column or actually to a column
 with given entries, but which