Re: [R] how to separate string from numbers in a large txt file

David Winsemius Thu, 16 May 2019 20:30:56 -0700


On 5/16/19 3:53 PM, Michael Boulineau wrote:

OK. So, I named the object test and then checked the 6347th item

test <- readLines ("hangouts-conversation.txt)
test [6347]

[1] "2016-10-21 10:56:37 <John Doe> Admit#8242"

Perhaps where it was getting screwed up is, since the end of this is a
number (8242), then, given that there's no space between the number
and what ought to be the next row, R didn't know where to draw the
line. Sure enough, it looks like this when I go to the original file
and control f "#8242"

2016-10-21 10:35:36 <Jane Doe> What's your login
2016-10-21 10:56:29 <John Doe> John_Doe
2016-10-21 10:56:37 <John Doe> Admit#8242

An octothorpe is an end of line signifier and is interpreted as allowingcomments. You can prevent that interpretation with suitable choice ofparameters to `read.table` or `read.csv`. I don't understand why thatshould cause anu error or a failure to match that pattern.

2016-10-21 11:00:13 <Jane Doe> Okay so you have a discussion

Again, it doesn't look like that in the file. Gmail automatically
formats it like that when I paste it in. More to the point, it looks
like

2016-10-21 10:35:36 <Jane Doe> What's your login2016-10-21 10:56:29
<John Doe> John_Doe2016-10-21 10:56:37 <John Doe> Admit#82422016-10-21
11:00:13 <Jane Doe> Okay so you have a discussion

Notice Admit#82422016. So there's that.

Then I built object test2.

test2 <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4", test)

This worked for 84 lines, then this happened.

It may have done something but as you later discovered my first code forthe pattern was incorrect. I had tested it (and pasted in the results ofthe test) . The way to refer to a capture class is with back-slashesbefore the numbers, not forward-slashes. Try this:



> newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "\\1,\\2,\\3,\\4", chrvec)
> newvec
 [1] "2016-07-01,02:50:35,<john>,hey"
 [2] "2016-07-01,02:51:26,<jane>,waiting for plane to Edinburgh"
 [3] "2016-07-01,02:51:45,<john>,thinking about my boo"
 [4] "2016-07-01,02:52:07,<jane>,nothing crappy has happened, not really"
 [5] "2016-07-01,02:52:20,<john>,plane went by pretty fast, didn't sleep"

[6] "2016-07-01,02:54:08,<jane>,no idea what time it is or where I amreally"

 [7] "2016-07-01,02:54:17,<john>,just know it's london"
 [8] "2016-07-01,02:56:44,<jane>,you are probably asleep"
 [9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a good eay"
[10] "2016-07-01 02:58:56 <jone>"
[11] "2016-07-01 02:59:34 <jane>"

[12] "2016-07-01,03:02:48,<john>,British security is a little morerigorous..."



I made note of the fact that the 10th and 11th lines had no commas.

test2 [84]

[1] "2016-06-28 21:12:43 *** John Doe ended a video chat"


That line didn't have any "<" so wasn't matched.


You could remove all none matching lines for pattern of

dates<space>times<space>"<"<name>">"<space><anything>


with:


chrvec <- chrvec[ grepl("^.{10} .{8} <.+> .+$)", chrvec)]


Do read:

?read.csv

?regex


--

David

test2 [85]

[1] "//1,//2,//3,//4"

test [85]

[1] "2016-07-01 02:50:35 <John Doe> hey"

Notice how I toggled back and forth between test and test2 there. So,
whatever happened with the regex, it happened in the switch from 84 to
85, I guess. It went on like

[990] "//1,//2,//3,//4"
  [991] "//1,//2,//3,//4"
  [992] "//1,//2,//3,//4"
  [993] "//1,//2,//3,//4"
  [994] "//1,//2,//3,//4"
  [995] "//1,//2,//3,//4"
  [996] "//1,//2,//3,//4"
  [997] "//1,//2,//3,//4"
  [998] "//1,//2,//3,//4"
  [999] "//1,//2,//3,//4"
[1000] "//1,//2,//3,//4"

up until line 1000, then I reached max.print.

Michael

On Thu, May 16, 2019 at 1:05 PM David Winsemius <dwinsem...@comcast.net> wrote:


On 5/16/19 12:30 PM, Michael Boulineau wrote:

Thanks for this tip on etiquette, David. I will be sure and not do that again.

I tried the read.fwf from the foreign package, with a code like this:

   d <- read.fwf("hangouts-conversation.txt",
                  widths= c(10,10,20,40),
                  col.names=c("date","time","person","comment"),
                  strip.white=TRUE)

But it threw this error:

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
    line 6347 did not have 4 elements


So what does line 6347 look like? (Use `readLines` and print it out.)

Interestingly, though, the error only happened when I increased the
width size. But I had to increase the size, or else I couldn't "see"
anything.  The comment was so small that nothing was being captured by
the size of the column. so to speak.

It seems like what's throwing me is that there's no comma that
demarcates the end of the text proper. For example:

Not sure why you thought there should be a comma. Lines usually end
with  <cr> and or a <lf>.


Once you have the raw text in a character vector from `readLines` named,
say, 'chrvec', then you could selectively substitute commas for spaces
with regex. (Now that you no longer desire to remove the dates and times.)

sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4", chrvec)

This will not do any replacements when the pattern is not matched. See
this test:


  > newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "\\1,\\2,\\3,\\4", chrvec)
  > newvec
   [1] "2016-07-01,02:50:35,<john>,hey"
   [2] "2016-07-01,02:51:26,<jane>,waiting for plane to Edinburgh"
   [3] "2016-07-01,02:51:45,<john>,thinking about my boo"
   [4] "2016-07-01,02:52:07,<jane>,nothing crappy has happened, not really"
   [5] "2016-07-01,02:52:20,<john>,plane went by pretty fast, didn't sleep"
   [6] "2016-07-01,02:54:08,<jane>,no idea what time it is or where I am
really"
   [7] "2016-07-01,02:54:17,<john>,just know it's london"
   [8] "2016-07-01,02:56:44,<jane>,you are probably asleep"
   [9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a good eay"
[10] "2016-07-01 02:58:56 <jone>"
[11] "2016-07-01 02:59:34 <jane>"
[12] "2016-07-01,03:02:48,<john>,British security is a little more
rigorous..."


You should probably remove the "empty comment" lines.


--

David.

2016-07-01 15:34:30 <John Doe> Lame. We were in a starbucks2016-07-01
15:35:02 <Jane Doe> Hmm that's interesting2016-07-01 15:35:09 <Jane
Doe> You must want coffees2016-07-01 15:35:25 <John Doe> There was
lots of Starbucks in my day2016-07-01 15:35:47

It was interesting, too, when I pasted the text into the email, it
self-formatted into the way I wanted it to look. I had to manually
make it look like it does above, since that's the way that it looks in
the txt file. I wonder if it's being organized by XML or something.

Anyways, There's always a space between the two sideways carrots, just
like there is right now: <John Doe> See. Space. And there's always a
space between the data and time. Like this. 2016-07-01 15:34:30 See.
Space. But there's never a space between the end of the comment and
the next date. Like this: We were in a starbucks2016-07-01 15:35:02
See. starbucks and 2016 are smooshed together.

This code is also on the table right now too.

a <- read.table("E:/working
directory/-189/hangouts-conversation2.txt", quote="\"",
comment.char="", fill=TRUE)

h<-cbind(hangouts.conversation2[,1:2],hangouts.conversation2[,3:5],hangouts.conversation2[,6:9])

aa<-gsub("[^[:digit:]]","",h)
my.data.num <- as.numeric(str_extract(h, "[0-9]+"))

Those last lines are a work in progress. I wish I could import a
picture of what it looks like when it's translated into a data frame.
The fill=TRUE helped to get the data in table that kind of sort of
works, but the comments keep bleeding into the data and time column.
It's like

2016-07-01 15:59:17 <Jane Doe> Seriously I've never been
over               there
2016-07-01 15:59:27 <Jane Doe> It confuses me :(

And then, maybe, the "seriously" will be in a column all to itself, as
will be the "I've'"and the "never" etc.

I will use a regular expression if I have to, but it would be nice to
keep the dates and times on there. Originally, I thought they were
meaningless, but I've since changed my mind on that count. The time of
day isn't so important. But, especially since, say, Gmail itself knows
how to quickly recognize what it is, I know it can be done. I know
this data has structure to it.

Michael



On Wed, May 15, 2019 at 8:47 PM David Winsemius <dwinsem...@comcast.net> wrote:

On 5/15/19 4:07 PM, Michael Boulineau wrote:

I have a wild and crazy text file, the head of which looks like this:

2016-07-01 02:50:35 <john> hey
2016-07-01 02:51:26 <jane> waiting for plane to Edinburgh
2016-07-01 02:51:45 <john> thinking about my boo
2016-07-01 02:52:07 <jane> nothing crappy has happened, not really
2016-07-01 02:52:20 <john> plane went by pretty fast, didn't sleep
2016-07-01 02:54:08 <jane> no idea what time it is or where I am really
2016-07-01 02:54:17 <john> just know it's london
2016-07-01 02:56:44 <jane> you are probably asleep
2016-07-01 02:58:45 <jane> I hope fish was fishy in a good eay
2016-07-01 02:58:56 <jone>
2016-07-01 02:59:34 <jane>
2016-07-01 03:02:48 <john> British security is a little more rigorous...

Looks entirely not-"crazy". Typical log file format.

Two possibilities: 1) Use `read.fwf` from pkg foreign; 2) Use regex
(i.e. the sub-function) to strip everything up to the "<". Read
`?regex`. Since that's not a metacharacters you could use a pattern
".+<" and replace with "".

And do read the Posting Guide. Cross-posting to StackOverflow and Rhelp,
at least within hours of each, is considered poor manners.


--

David.

It goes on for a while. It's a big file. But I feel like it's going to
be difficult to annotate with the coreNLP library or package. I'm
doing natural language processing. In other words, I'm curious as to
how I would shave off the dates, that is, to make it look like:

<john> hey
<jane> waiting for plane to Edinburgh
    <john> thinking about my boo
<jane> nothing crappy has happened, not really
<john> plane went by pretty fast, didn't sleep
<jane> no idea what time it is or where I am really
<john> just know it's london
<jane> you are probably asleep
<jane> I hope fish was fishy in a good eay
    <jone>
<jane>
<john> British security is a little more rigorous...

To be clear, then, I'm trying to clean a large text file by writing a
regular expression? such that I create a new object with no numbers or
dates.

Michael

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] how to separate string from numbers in a large txt file

Reply via email to