On 02/06/2010 09:05 AM, analys...@hotmail.com wrote:
On Feb 5, 8:57 am, Barry Rowlingson<b.rowling...@lancaster.ac.uk>
wrote:
On Fri, Feb 5, 2010 at 10:23 AM, analys...@hotmail.com
<analys...@hotmail.com> wrote:
the csv files are downloaded from a database and it looks like some
character fields contain the CR-LF sequence within them.
This causes R to see a new record/row and the number of rows it sees
is different (usually higher) from the number of rows actually
extracted.
Hard to tell without an example, but I just tried this in a file:
1,2,"this
is a test",99
2,3,"oneliner",45
and:
read.table("test.csv",sep=",")
V1 V2 V3 V4
1 1 2 this\nis a test 99
2 2 3 oneliner 45
seemed to work. But if your strings aren't "quoted" (hard to tell
without an example) then you might have to find another way. Hard to
tell without an example.
Barry
______________________________________________
r-h...@r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Here is a Hex dump (please igmore the '>' at the start of each line) -
of the file that results from extracting two rows.
EF BB BF 64 65 73 63 72-69 70 74 69 6F 6E 0D 0A ...description..
22 3C 73 74 72 6F 6E 67-3E 55 6E 6B 6E 6F 77 6E "<strong>Unknown
20 41 6E 79 74 69 6D 65-2C 20 41 6E 79 77 68 65 Anytime, Anywhe
72 65 20 4C 65 61 72 6E-69 6E 67 3C 62 72 20 2F re Learning<br /
3E 0D 0A 3C 2F 73 74 72-6F 6E 67 3E 20 54 68 65>..</strong> The
20 61 6E 73 77 65 72 20-69 73 20 55 6E 6B 6E 6F answer is Unkno
77 6E 2E 20 3C 73 74 72-6F 6E 67 3E 20 79 6F 75 wn.<strong> you
20 63 61 6E 20 73 74 61-72 74 20 61 6E 64 20 66 can start and f
69 6E 69 73 68 20 69 6E-20 6C 65 73 73 20 74 68 inish in less th
65 6E 20 31 37 20 6D 6F-6E 74 68 73 2E 3C 2F 73 en 17 months.</s
74 72 6F 6E 67 3E 20 3C-62 72 20 2F 3E 0D 0A 3C trong> <br />..<
62 72 20 2F 3E 0D 0A 55-6E 6B 6E 6F 77 6E 20 61 br />..Unknown a
62 6F 75 74 20 65 6E 73-75 72 69 6E 67 20 79 6F bout ensuring yo
75 20 6C 65 61 72 6E 20-2E 22 0D 0A 03 D8 26 8A u learn ."....&.
R, Fortran and Excel see five lines, but the database has only two
lines.
Okay, you have five CR-LF pairs with two being EORs. It looks like the
<br />CR-LF is the EOR sequence, so it should be possible to preserve
those while changing the others to something like "~" or deleting them.
As I said previously, the regexperts can work out a way to distinguish
the CR-LF pairs that are _not_ in an EOR sequence.
You might want to think about dumping the control characters as well.
Jim
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.