I would have thought that a csv file written out by Excel would have looked
like this for your data:
ADS GY Equity, ,ALV GY Equity,
Date, Px Last,Date , Px Last
07/02/04, 41,395 ,07/01/31 , 130,234
07/02/05, 42,134 ,07/02/01 , 133,353
07/02/06, 41,875 ,07/02/04 , 133,824
, , 07/02/05, 134,734
Here we have commas separating the fields and this will read in the data:
x - read.csv(/tempxx.csv, skip=2, header=FALSE, as.is=TRUE)
str(x)
'data.frame': 4 obs. of 4 variables:
$ V1: chr 07/02/04 07/02/05 07/02/06
$ V2: chr 41,395 42,134 41,875
$ V3: chr 07/01/31 07/02/01 07/02/0407/02/05
$ V4: chr 130,234 133,353 133,824 134,734
x
V1 V2 V3 V4
1 07/02/0441,395 07/01/31130,234
2 07/02/0542,134 07/02/01133,353
3 07/02/0641,875 07/02/04133,824
4 07/02/05 134,734
You will have to delete the ',' from the numeric fields, The 'dec=,' on
your command would have read '133,353' in as '133.353' which I don't think
is what you want. You can delete the commas and convert to numeric with the
following:
x$V4 - as.numeric(gsub(',', '', x$V4))
x
V1 V2 V3 V4
1 07/02/0441,395 07/01/31 130234
2 07/02/0542,134 07/02/01 133353
3 07/02/0641,875 07/02/04 133824
4 07/02/05 134734
So take a look at the options that you have with writing out the Excel data;
you should be able to get either tab or comma delimited. If not, you can
try reading in each line as a character string and then using 'substr' to
split it apart assuming that you have constant width columns.
On 3/7/07, Hofert Marius [EMAIL PROTECTED] wrote:
Dear r-help users,
I have the following simple problem: Reading data from a file. The
file is a .txt file exported (save as...) from Excel (see below for
an example). The Excel file consists of two header rows (first row
consists of ticker symbols of stocks, the second row consists of
column explanations (Date,Px Last), followed by several rows of
data. Now forget about the first two rows, I can deal with that (read
separately, then extract the actual ticker symbols ADS,
ALV, ...). For reading the rest, I tried several things, for example:
data=read.table(infile,quote=,fill=T,dec=,,skip=2,colClasses=rep(c
(character,numeric),ntickers))
or
data=matrix(scan(file=infile,what=rep(c
(character,numeric),ntickers),dec=,,skip=2),ncol=2*ntickers,byrow=
T)
where infile specifies the path to the input file and ntickers is
the number of ticker-columns in the data set, so in the example
below, ntickers=2.
Both ways of reading the data work perfectly fine if all columns have
the same length (i.e. the same number of filled rows), so if the data
is given in a (filled) rectangular form. Now, as you can imagine,
there are days when one stock is traded but not the other... so,
there might be columns that do not have the same number of filled
rows (see below, for the stock with ticker symbol ADS, only 3
trading days are shown, so this column is shorter than the data
column for the stock ALV). Now, if I export such a structure to
a .txt file, then all (by default) blank fields will be replaced by
\t, i.e. tabs. Both reading procedures as give above have problems
as they either display that the number of rows/columns do not fit
together or as they read the table, but some cells are shifted to the
left (for the example below, the entry 07/02/05134,7 appears in
the empty field of the stock ADS which is of course not what we want).
So the simple question is: How do I read such a structure?
Can there be a simple solution? The problem is simply that empty
cells are replace by \t which are then ignored for reading. So how
do we distinguish between the empty cells that are given between the
columns and the empty cells that actually fill a column to have the
same length as other columns. Of course I could manually put in a
certain character (e.g. a *) to fill in the gaps, but the data set
is simply too large. If it helps, these blank fields only appear in
the end of each column, not in the middle.
As I work on a Mac (OS X 10.4), it was not possible (at least to me)
to read the data directly from the Excel file vial the library RODBC
or read.xls.
Note, that the same problem arises, when I export the Excel file as
a .csv, then all blank fields are separated by ; instead of \t
and the reading procedure can also not decide if the field
corresponds to an empty separating column or actually to a column
with given entries, but which