Re: [R] how to separate string from numbers in a large txt file

Michael Boulineau Sat, 18 May 2019 17:35:39 -0700

It appears to have worked, although there were three little quirks.
The ; close(con); rm(con) didn't work for me; the first row of the
data.frame was all NAs, when all was said and done; and then there
were still three *** on the same line where the ï»¿ was apparently
deleted.


> a <- readLines ("hangouts-conversation-6.txt", encoding = "UTF-8")
> b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)"
> c <- gsub(b, "\\1<\\2> ", a)
> head (c)
[1] "2016-01-27 09:14:40 *** Jane Doe started a video chat"
[2] "2016-01-27 09:15:20 <Jane Doe>
https://lh3.googleusercontent.com/-_WQF5kRcnpk/Vqj7J4aK1jI/AAAAAAAAAVA/GVqutPqbSuo/s0/be8ded30-87a6-4e80-bdfa-83ed51591dbf";
[3] "2016-01-27 09:15:20 <Jane Doe> Hey "
[4] "2016-01-27 09:15:22 <John Doe>  ended a video chat"
[5] "2016-01-27 21:07:11 <Jane Doe>  started a video chat"
[6] "2016-01-27 21:26:57 <John Doe>  ended a video chat"

You can see those three *** there. But no ï»¿

> d <- "^([0-9-]{10}) ([0-9:]{8}) <(\\w+ \\w+)>\\s*(.+)$"
> e <- data.frame(date = character(),
+                     time = character(),
+                     name = character(),
+                     text = character(),
+                     stringsAsFactors = TRUE)
f <- strcapture(d, c, e)
> f <- f [-c(1),]

When I look at the data.frame, it looks like [1] above, the one with
the three ***, was deleted--surely because of the second regex (the
one where I made object d above).

Thanks for your help everyone. I really learned a lot. The first thing
I'm going to do is continue to study regular expressions, although I
do have a much better sense of them than when I started.

But, before I do anything else, I'm going to study the regex in this
particular code. For example, I'm still not sure why there has to the
second \\w+ in the (\\w+ \\w+). Little things like that.

Michael


On Sat, May 18, 2019 at 4:30 PM Boris Steipe <boris.ste...@utoronto.ca> wrote:
>
> This works for me:
>
> # sample data
> c <- character()
> c[1] <- "2016-01-27 09:14:40 <Jane Doe> started a video chat"
> c[2] <- "2016-01-27 09:15:20 <Jane Doe> https://lh3.googleusercontent.com/";
> c[3] <- "2016-01-27 09:15:20 <Jane Doe> Hey "
> c[4] <- "2016-01-27 09:15:22 <John Doe>  ended a video chat"
> c[5] <- "2016-01-27 21:07:11 <Jane Doe>  started a video chat"
> c[6] <- "2016-01-27 21:26:57 <John Doe>  ended a video chat"
>
>
> # regex  ^(year)       (time)      <(word word)>\\s*(string)$
> patt <- "^([0-9-]{10}) ([0-9:]{8}) <(\\w+ \\w+)>\\s*(.+)$"
> proto <- data.frame(date = character(),
>                     time = character(),
>                     name = character(),
>                     text = character(),
>                     stringsAsFactors = TRUE)
> d <- strcapture(patt, c, proto)
>
>
>
>         date     time     name                               text
> 1 2016-01-27 09:14:40 Jane Doe               started a video chat
> 2 2016-01-27 09:15:20 Jane Doe https://lh3.googleusercontent.com/
> 3 2016-01-27 09:15:20 Jane Doe                               Hey
> 4 2016-01-27 09:15:22 John Doe                 ended a video chat
> 5 2016-01-27 21:07:11 Jane Doe               started a video chat
> 6 2016-01-27 21:26:57 John Doe                 ended a video chat
>
>
>
> B.
>
>
> > On 2019-05-18, at 18:32, Michael Boulineau <michael.p.boulin...@gmail.com> 
> > wrote:
> >
> > Going back and thinking through what Boris and William were saying
> > (also Ivan), I tried this:
> >
> > a <- readLines ("hangouts-conversation-6.csv.txt")
> > b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)"
> > c <- gsub(b, "\\1<\\2> ", a)
> >> head (c)
> > [1] "ï»¿2016-01-27 09:14:40 *** Jane Doe started a video chat"
> > [2] "2016-01-27 09:15:20 <Jane Doe>
> > https://lh3.googleusercontent.com/-_WQF5kRcnpk/Vqj7J4aK1jI/AAAAAAAAAVA/GVqutPqbSuo/s0/be8ded30-87a6-4e80-bdfa-83ed51591dbf";
> > [3] "2016-01-27 09:15:20 <Jane Doe> Hey "
> > [4] "2016-01-27 09:15:22 <John Doe>  ended a video chat"
> > [5] "2016-01-27 21:07:11 <Jane Doe>  started a video chat"
> > [6] "2016-01-27 21:26:57 <John Doe>  ended a video chat"
> >
> > The ï»¿ is still there, since I forgot to do what Ivan had suggested, 
> > namely,
> >
> > a <- readLines(con <- file("hangouts-conversation-6.csv.txt", encoding
> > = "UTF-8")); close(con); rm(con)
> >
> > But then the new code is still turning out only NAs when I apply
> > strcapture (). This was what happened next:
> >
> >> d <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}
> > + [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>) *(.*$)",
> > +                 c, proto=data.frame(stringsAsFactors=FALSE, When="", 
> > Who="",
> > +                                     What=""))
> >> head (d)
> >  When  Who What
> > 1 <NA> <NA> <NA>
> > 2 <NA> <NA> <NA>
> > 3 <NA> <NA> <NA>
> > 4 <NA> <NA> <NA>
> > 5 <NA> <NA> <NA>
> > 6 <NA> <NA> <NA>
> >
> > I've been reading up on regular expressions, too, so this code seems
> > spot on. What's going wrong?
> >
> > Michael
> >
> > On Fri, May 17, 2019 at 4:28 PM Boris Steipe <boris.ste...@utoronto.ca> 
> > wrote:
> >>
> >> Don't start putting in extra commas and then reading this as csv. That 
> >> approach is broken. The correct approach is what Bill outlined: read 
> >> everything with readLines(), and then use a proper regular expression with 
> >> strcapture().
> >>
> >> You need to pre-process the object that readLines() gives you: replace the 
> >> contents of the videochat lines, and make it conform to the format of the 
> >> other lines before you process it into your data frame.
> >>
> >> Approximately something like
> >>
> >> # read the raw data
> >> tmp <- readLines("hangouts-conversation-6.csv.txt")
> >>
> >> # process all video chat lines
> >> patt <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+) "  # (year time )*** 
> >> (word word)
> >> tmp <- gsub(patt, "\\1<\\2> ", tmp)
> >>
> >> # next, use strcapture()
> >>
> >> Note that this makes the assumption that your names are always exactly two 
> >> words containing only letters. If that assumption is not true, more though 
> >> needs to go into the regex. But you can test that:
> >>
> >> patt <- " <\\w+ \\w+> "   #" <word word> "
> >> sum( ! grepl(patt, tmp)))
> >>
> >> ... will give the number of lines that remain in your file that do not 
> >> have a tag that can be interpreted as "Who"
> >>
> >> Once that is fine, use Bill's approach - or a regular expression of your 
> >> own design - to create your data frame.
> >>
> >> Hope this helps,
> >> Boris
> >>
> >>
> >>
> >>
> >>> On 2019-05-17, at 16:18, Michael Boulineau 
> >>> <michael.p.boulin...@gmail.com> wrote:
> >>>
> >>> Very interesting. I'm sure I'll be trying to get rid of the byte order
> >>> mark eventually. But right now, I'm more worried about getting the
> >>> character vector into either a csv file or data.frame; that way, I can
> >>> be able to work with the data neatly tabulated into four columns:
> >>> date, time, person, comment. I assume it's a write.csv function, but I
> >>> don't know what arguments to put in it. header=FALSE? fill=T?
> >>>
> >>> Micheal
> >>>
> >>> On Fri, May 17, 2019 at 1:03 PM Jeff Newmiller <jdnew...@dcn.davis.ca.us> 
> >>> wrote:
> >>>>
> >>>> If byte order mark is the issue then you can specify the file encoding 
> >>>> as "UTF-8-BOM" and it won't show up in your data any more.
> >>>>
> >>>> On May 17, 2019 12:12:17 PM PDT, William Dunlap via R-help 
> >>>> <r-help@r-project.org> wrote:
> >>>>> The pattern I gave worked for the lines that you originally showed from
> >>>>> the
> >>>>> data file ('a'), before you put commas into them.  If the name is
> >>>>> either of
> >>>>> the form "<name>" or "***" then the "(<[^>]*>)" needs to be changed so
> >>>>> something like "(<[^>]*>|[*]{3})".
> >>>>>
> >>>>> The " ï»¿" at the start of the imported data may come from the byte
> >>>>> order
> >>>>> mark that Windows apps like to put at the front of a text file in UTF-8
> >>>>> or
> >>>>> UTF-16 format.
> >>>>>
> >>>>> Bill Dunlap
> >>>>> TIBCO Software
> >>>>> wdunlap tibco.com
> >>>>>
> >>>>>
> >>>>> On Fri, May 17, 2019 at 11:53 AM Michael Boulineau <
> >>>>> michael.p.boulin...@gmail.com> wrote:
> >>>>>
> >>>>>> This seemed to work:
> >>>>>>
> >>>>>>> a <- readLines ("hangouts-conversation-6.csv.txt")
> >>>>>>> b <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "\\1,\\2,\\3,\\4", a)
> >>>>>>> b [1:84]
> >>>>>>
> >>>>>> And the first 85 lines looks like this:
> >>>>>>
> >>>>>> [83] "2016-06-28 21:02:28 *** Jane Doe started a video chat"
> >>>>>> [84] "2016-06-28 21:12:43 *** John Doe ended a video chat"
> >>>>>>
> >>>>>> Then they transition to the commas:
> >>>>>>
> >>>>>>> b [84:100]
> >>>>>> [1] "2016-06-28 21:12:43 *** John Doe ended a video chat"
> >>>>>> [2] "2016-07-01,02:50:35,<John Doe>,hey"
> >>>>>> [3] "2016-07-01,02:51:26,<John Doe>,waiting for plane to Edinburgh"
> >>>>>> [4] "2016-07-01,02:51:45,<John Doe>,thinking about my boo"
> >>>>>>
> >>>>>> Even the strange bit on line 6347 was caught by this:
> >>>>>>
> >>>>>>> b [6346:6348]
> >>>>>> [1] "2016-10-21,10:56:29,<John Doe>,John_Doe"
> >>>>>> [2] "2016-10-21,10:56:37,<John Doe>,Admit#8242"
> >>>>>> [3] "2016-10-21,11:00:13,<Jane Doe>,Okay so you have a discussion"
> >>>>>>
> >>>>>> Perhaps most awesomely, the code catches spaces that are interposed
> >>>>>> into the comment itself:
> >>>>>>
> >>>>>>> b [4]
> >>>>>> [1] "2016-01-27,09:15:20,<Jane Doe>,Hey "
> >>>>>>> b [85]
> >>>>>> [1] "2016-07-01,02:50:35,<John Doe>,hey"
> >>>>>>
> >>>>>> Notice whether there is a space after the "hey" or not.
> >>>>>>
> >>>>>> These are the first two lines:
> >>>>>>
> >>>>>> [1] "ï»¿2016-01-27 09:14:40 *** Jane Doe started a video chat"
> >>>>>> [2] "2016-01-27,09:15:20,<Jane
> >>>>>> Doe>,
> >>>>>>
> >>>>> https://lh3.googleusercontent.com/-_WQF5kRcnpk/Vqj7J4aK1jI/AAAAAAAAAVA/GVqutPqbSuo/s0/be8ded30-87a6-4e80-bdfa-83ed51591dbf
> >>>>>> "
> >>>>>>
> >>>>>> So, who knows what happened with the ï»¿ at the beginning of [1]
> >>>>>> directly above. But notice how there are no commas in [1] but there
> >>>>>> appear in [2]. I don't see why really long ones like [2] directly
> >>>>>> above would be a problem, were they to be translated into a csv or
> >>>>>> data frame column.
> >>>>>>
> >>>>>> Now, with the commas in there, couldn't we write this into a csv or a
> >>>>>> data.frame? Some of this data will end up being garbage, I imagine.
> >>>>>> Like in [2] directly above. Or with [83] and [84] at the top of this
> >>>>>> discussion post/email. Embarrassingly, I've been trying to convert
> >>>>>> this into a data.frame or csv but I can't manage to. I've been using
> >>>>>> the write.csv function, but I don't think I've been getting the
> >>>>>> arguments correct.
> >>>>>>
> >>>>>> At the end of the day, I would like a data.frame and/or csv with the
> >>>>>> following four columns: date, time, person, comment.
> >>>>>>
> >>>>>> I tried this, too:
> >>>>>>
> >>>>>>> c <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}
> >>>>>> + [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>) *(.*$)",
> >>>>>> +                 a, proto=data.frame(stringsAsFactors=FALSE,
> >>>>> When="",
> >>>>>> Who="",
> >>>>>> +                                     What=""))
> >>>>>>
> >>>>>> But all I got was this:
> >>>>>>
> >>>>>>> c [1:100, ]
> >>>>>>   When  Who What
> >>>>>> 1   <NA> <NA> <NA>
> >>>>>> 2   <NA> <NA> <NA>
> >>>>>> 3   <NA> <NA> <NA>
> >>>>>> 4   <NA> <NA> <NA>
> >>>>>> 5   <NA> <NA> <NA>
> >>>>>> 6   <NA> <NA> <NA>
> >>>>>>
> >>>>>> It seems to have caught nothing.
> >>>>>>
> >>>>>>> unique (c)
> >>>>>> When  Who What
> >>>>>> 1 <NA> <NA> <NA>
> >>>>>>
> >>>>>> But I like that it converted into columns. That's a really great
> >>>>>> format. With a little tweaking, it'd be a great code for this data
> >>>>>> set.
> >>>>>>
> >>>>>> Michael
> >>>>>>
> >>>>>> On Fri, May 17, 2019 at 8:20 AM William Dunlap via R-help
> >>>>>> <r-help@r-project.org> wrote:
> >>>>>>>
> >>>>>>> Consider using readLines() and strcapture() for reading such a
> >>>>> file.
> >>>>>> E.g.,
> >>>>>>> suppose readLines(files) produced a character vector like
> >>>>>>>
> >>>>>>> x <- c("2016-10-21 10:35:36 <Jane Doe> What's your login",
> >>>>>>>         "2016-10-21 10:56:29 <John Doe> John_Doe",
> >>>>>>>         "2016-10-21 10:56:37 <John Doe> Admit#8242",
> >>>>>>>         "October 23, 1819 12:34 <Jane Eyre> I am not an angel")
> >>>>>>>
> >>>>>>> Then you can make a data.frame with columns When, Who, and What by
> >>>>>>> supplying a pattern containing three parenthesized capture
> >>>>> expressions:
> >>>>>>>> z <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}
> >>>>>>> [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>) *(.*$)",
> >>>>>>>            x, proto=data.frame(stringsAsFactors=FALSE, When="",
> >>>>> Who="",
> >>>>>>> What=""))
> >>>>>>>> str(z)
> >>>>>>> 'data.frame':   4 obs. of  3 variables:
> >>>>>>> $ When: chr  "2016-10-21 10:35:36" "2016-10-21 10:56:29"
> >>>>> "2016-10-21
> >>>>>>> 10:56:37" NA
> >>>>>>> $ Who : chr  "<Jane Doe>" "<John Doe>" "<John Doe>" NA
> >>>>>>> $ What: chr  "What's your login" "John_Doe" "Admit#8242" NA
> >>>>>>>
> >>>>>>> Lines that don't match the pattern result in NA's - you might make
> >>>>> a
> >>>>>> second
> >>>>>>> pass over the corresponding elements of x with a new pattern.
> >>>>>>>
> >>>>>>> You can convert the When column from character to time with
> >>>>> as.POSIXct().
> >>>>>>>
> >>>>>>> Bill Dunlap
> >>>>>>> TIBCO Software
> >>>>>>> wdunlap tibco.com
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, May 16, 2019 at 8:30 PM David Winsemius
> >>>>> <dwinsem...@comcast.net>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>>
> >>>>>>>> On 5/16/19 3:53 PM, Michael Boulineau wrote:
> >>>>>>>>> OK. So, I named the object test and then checked the 6347th
> >>>>> item
> >>>>>>>>>
> >>>>>>>>>> test <- readLines ("hangouts-conversation.txt)
> >>>>>>>>>> test [6347]
> >>>>>>>>> [1] "2016-10-21 10:56:37 <John Doe> Admit#8242"
> >>>>>>>>>
> >>>>>>>>> Perhaps where it was getting screwed up is, since the end of
> >>>>> this is
> >>>>>> a
> >>>>>>>>> number (8242), then, given that there's no space between the
> >>>>> number
> >>>>>>>>> and what ought to be the next row, R didn't know where to draw
> >>>>> the
> >>>>>>>>> line. Sure enough, it looks like this when I go to the original
> >>>>> file
> >>>>>>>>> and control f "#8242"
> >>>>>>>>>
> >>>>>>>>> 2016-10-21 10:35:36 <Jane Doe> What's your login
> >>>>>>>>> 2016-10-21 10:56:29 <John Doe> John_Doe
> >>>>>>>>> 2016-10-21 10:56:37 <John Doe> Admit#8242
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> An octothorpe is an end of line signifier and is interpreted as
> >>>>>> allowing
> >>>>>>>> comments. You can prevent that interpretation with suitable
> >>>>> choice of
> >>>>>>>> parameters to `read.table` or `read.csv`. I don't understand why
> >>>>> that
> >>>>>>>> should cause anu error or a failure to match that pattern.
> >>>>>>>>
> >>>>>>>>> 2016-10-21 11:00:13 <Jane Doe> Okay so you have a discussion
> >>>>>>>>>
> >>>>>>>>> Again, it doesn't look like that in the file. Gmail
> >>>>> automatically
> >>>>>>>>> formats it like that when I paste it in. More to the point, it
> >>>>> looks
> >>>>>>>>> like
> >>>>>>>>>
> >>>>>>>>> 2016-10-21 10:35:36 <Jane Doe> What's your login2016-10-21
> >>>>> 10:56:29
> >>>>>>>>> <John Doe> John_Doe2016-10-21 10:56:37 <John Doe>
> >>>>>> Admit#82422016-10-21
> >>>>>>>>> 11:00:13 <Jane Doe> Okay so you have a discussion
> >>>>>>>>>
> >>>>>>>>> Notice Admit#82422016. So there's that.
> >>>>>>>>>
> >>>>>>>>> Then I built object test2.
> >>>>>>>>>
> >>>>>>>>> test2 <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4",
> >>>>> test)
> >>>>>>>>>
> >>>>>>>>> This worked for 84 lines, then this happened.
> >>>>>>>>
> >>>>>>>> It may have done something but as you later discovered my first
> >>>>> code
> >>>>>> for
> >>>>>>>> the pattern was incorrect. I had tested it (and pasted in the
> >>>>> results
> >>>>>> of
> >>>>>>>> the test) . The way to refer to a capture class is with
> >>>>> back-slashes
> >>>>>>>> before the numbers, not forward-slashes. Try this:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)",
> >>>>> "\\1,\\2,\\3,\\4",
> >>>>>> chrvec)
> >>>>>>>>> newvec
> >>>>>>>> [1] "2016-07-01,02:50:35,<john>,hey"
> >>>>>>>> [2] "2016-07-01,02:51:26,<jane>,waiting for plane to Edinburgh"
> >>>>>>>> [3] "2016-07-01,02:51:45,<john>,thinking about my boo"
> >>>>>>>> [4] "2016-07-01,02:52:07,<jane>,nothing crappy has happened,
> >>>>> not
> >>>>>> really"
> >>>>>>>> [5] "2016-07-01,02:52:20,<john>,plane went by pretty fast,
> >>>>> didn't
> >>>>>> sleep"
> >>>>>>>> [6] "2016-07-01,02:54:08,<jane>,no idea what time it is or
> >>>>> where I am
> >>>>>>>> really"
> >>>>>>>> [7] "2016-07-01,02:54:17,<john>,just know it's london"
> >>>>>>>> [8] "2016-07-01,02:56:44,<jane>,you are probably asleep"
> >>>>>>>> [9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a good
> >>>>> eay"
> >>>>>>>> [10] "2016-07-01 02:58:56 <jone>"
> >>>>>>>> [11] "2016-07-01 02:59:34 <jane>"
> >>>>>>>> [12] "2016-07-01,03:02:48,<john>,British security is a little
> >>>>> more
> >>>>>>>> rigorous..."
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> I made note of the fact that the 10th and 11th lines had no
> >>>>> commas.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> test2 [84]
> >>>>>>>>> [1] "2016-06-28 21:12:43 *** John Doe ended a video chat"
> >>>>>>>>
> >>>>>>>> That line didn't have any "<" so wasn't matched.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> You could remove all none matching lines for pattern of
> >>>>>>>>
> >>>>>>>> dates<space>times<space>"<"<name>">"<space><anything>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> with:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> chrvec <- chrvec[ grepl("^.{10} .{8} <.+> .+$)", chrvec)]
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Do read:
> >>>>>>>>
> >>>>>>>> ?read.csv
> >>>>>>>>
> >>>>>>>> ?regex
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>>
> >>>>>>>> David
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>> test2 [85]
> >>>>>>>>> [1] "//1,//2,//3,//4"
> >>>>>>>>>> test [85]
> >>>>>>>>> [1] "2016-07-01 02:50:35 <John Doe> hey"
> >>>>>>>>>
> >>>>>>>>> Notice how I toggled back and forth between test and test2
> >>>>> there. So,
> >>>>>>>>> whatever happened with the regex, it happened in the switch
> >>>>> from 84
> >>>>>> to
> >>>>>>>>> 85, I guess. It went on like
> >>>>>>>>>
> >>>>>>>>> [990] "//1,//2,//3,//4"
> >>>>>>>>> [991] "//1,//2,//3,//4"
> >>>>>>>>> [992] "//1,//2,//3,//4"
> >>>>>>>>> [993] "//1,//2,//3,//4"
> >>>>>>>>> [994] "//1,//2,//3,//4"
> >>>>>>>>> [995] "//1,//2,//3,//4"
> >>>>>>>>> [996] "//1,//2,//3,//4"
> >>>>>>>>> [997] "//1,//2,//3,//4"
> >>>>>>>>> [998] "//1,//2,//3,//4"
> >>>>>>>>> [999] "//1,//2,//3,//4"
> >>>>>>>>> [1000] "//1,//2,//3,//4"
> >>>>>>>>>
> >>>>>>>>> up until line 1000, then I reached max.print.
> >>>>>>>>
> >>>>>>>>> Michael
> >>>>>>>>>
> >>>>>>>>> On Thu, May 16, 2019 at 1:05 PM David Winsemius <
> >>>>>> dwinsem...@comcast.net>
> >>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On 5/16/19 12:30 PM, Michael Boulineau wrote:
> >>>>>>>>>>> Thanks for this tip on etiquette, David. I will be sure and
> >>>>> not do
> >>>>>>>> that again.
> >>>>>>>>>>>
> >>>>>>>>>>> I tried the read.fwf from the foreign package, with a code
> >>>>> like
> >>>>>> this:
> >>>>>>>>>>>
> >>>>>>>>>>>  d <- read.fwf("hangouts-conversation.txt",
> >>>>>>>>>>>                 widths= c(10,10,20,40),
> >>>>>>>>>>>
> >>>>> col.names=c("date","time","person","comment"),
> >>>>>>>>>>>                 strip.white=TRUE)
> >>>>>>>>>>>
> >>>>>>>>>>> But it threw this error:
> >>>>>>>>>>>
> >>>>>>>>>>> Error in scan(file = file, what = what, sep = sep, quote =
> >>>>> quote,
> >>>>>> dec
> >>>>>>>> = dec,  :
> >>>>>>>>>>>   line 6347 did not have 4 elements
> >>>>>>>>>>
> >>>>>>>>>> So what does line 6347 look like? (Use `readLines` and print
> >>>>> it
> >>>>>> out.)
> >>>>>>>>>>
> >>>>>>>>>>> Interestingly, though, the error only happened when I
> >>>>> increased the
> >>>>>>>>>>> width size. But I had to increase the size, or else I
> >>>>> couldn't
> >>>>>> "see"
> >>>>>>>>>>> anything.  The comment was so small that nothing was being
> >>>>>> captured by
> >>>>>>>>>>> the size of the column. so to speak.
> >>>>>>>>>>>
> >>>>>>>>>>> It seems like what's throwing me is that there's no comma
> >>>>> that
> >>>>>>>>>>> demarcates the end of the text proper. For example:
> >>>>>>>>>> Not sure why you thought there should be a comma. Lines
> >>>>> usually end
> >>>>>>>>>> with  <cr> and or a <lf>.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Once you have the raw text in a character vector from
> >>>>> `readLines`
> >>>>>> named,
> >>>>>>>>>> say, 'chrvec', then you could selectively substitute commas
> >>>>> for
> >>>>>> spaces
> >>>>>>>>>> with regex. (Now that you no longer desire to remove the dates
> >>>>> and
> >>>>>>>> times.)
> >>>>>>>>>>
> >>>>>>>>>> sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4", chrvec)
> >>>>>>>>>>
> >>>>>>>>>> This will not do any replacements when the pattern is not
> >>>>> matched.
> >>>>>> See
> >>>>>>>>>> this test:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)",
> >>>>> "\\1,\\2,\\3,\\4",
> >>>>>>>> chrvec)
> >>>>>>>>>>> newvec
> >>>>>>>>>>  [1] "2016-07-01,02:50:35,<john>,hey"
> >>>>>>>>>>  [2] "2016-07-01,02:51:26,<jane>,waiting for plane to
> >>>>> Edinburgh"
> >>>>>>>>>>  [3] "2016-07-01,02:51:45,<john>,thinking about my boo"
> >>>>>>>>>>  [4] "2016-07-01,02:52:07,<jane>,nothing crappy has
> >>>>> happened, not
> >>>>>>>> really"
> >>>>>>>>>>  [5] "2016-07-01,02:52:20,<john>,plane went by pretty fast,
> >>>>> didn't
> >>>>>>>> sleep"
> >>>>>>>>>>  [6] "2016-07-01,02:54:08,<jane>,no idea what time it is or
> >>>>> where
> >>>>>> I am
> >>>>>>>>>> really"
> >>>>>>>>>>  [7] "2016-07-01,02:54:17,<john>,just know it's london"
> >>>>>>>>>>  [8] "2016-07-01,02:56:44,<jane>,you are probably asleep"
> >>>>>>>>>>  [9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a
> >>>>> good
> >>>>>> eay"
> >>>>>>>>>> [10] "2016-07-01 02:58:56 <jone>"
> >>>>>>>>>> [11] "2016-07-01 02:59:34 <jane>"
> >>>>>>>>>> [12] "2016-07-01,03:02:48,<john>,British security is a little
> >>>>> more
> >>>>>>>>>> rigorous..."
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> You should probably remove the "empty comment" lines.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>>
> >>>>>>>>>> David.
> >>>>>>>>>>
> >>>>>>>>>>> 2016-07-01 15:34:30 <John Doe> Lame. We were in a
> >>>>>> starbucks2016-07-01
> >>>>>>>>>>> 15:35:02 <Jane Doe> Hmm that's interesting2016-07-01 15:35:09
> >>>>> <Jane
> >>>>>>>>>>> Doe> You must want coffees2016-07-01 15:35:25 <John Doe>
> >>>>> There was
> >>>>>>>>>>> lots of Starbucks in my day2016-07-01 15:35:47
> >>>>>>>>>>>
> >>>>>>>>>>> It was interesting, too, when I pasted the text into the
> >>>>> email, it
> >>>>>>>>>>> self-formatted into the way I wanted it to look. I had to
> >>>>> manually
> >>>>>>>>>>> make it look like it does above, since that's the way that it
> >>>>>> looks in
> >>>>>>>>>>> the txt file. I wonder if it's being organized by XML or
> >>>>> something.
> >>>>>>>>>>>
> >>>>>>>>>>> Anyways, There's always a space between the two sideways
> >>>>> carrots,
> >>>>>> just
> >>>>>>>>>>> like there is right now: <John Doe> See. Space. And there's
> >>>>> always
> >>>>>> a
> >>>>>>>>>>> space between the data and time. Like this. 2016-07-01
> >>>>> 15:34:30
> >>>>>> See.
> >>>>>>>>>>> Space. But there's never a space between the end of the
> >>>>> comment and
> >>>>>>>>>>> the next date. Like this: We were in a starbucks2016-07-01
> >>>>> 15:35:02
> >>>>>>>>>>> See. starbucks and 2016 are smooshed together.
> >>>>>>>>>>>
> >>>>>>>>>>> This code is also on the table right now too.
> >>>>>>>>>>>
> >>>>>>>>>>> a <- read.table("E:/working
> >>>>>>>>>>> directory/-189/hangouts-conversation2.txt", quote="\"",
> >>>>>>>>>>> comment.char="", fill=TRUE)
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>> h<-cbind(hangouts.conversation2[,1:2],hangouts.conversation2[,3:5],hangouts.conversation2[,6:9])
> >>>>>>>>>>>
> >>>>>>>>>>> aa<-gsub("[^[:digit:]]","",h)
> >>>>>>>>>>> my.data.num <- as.numeric(str_extract(h, "[0-9]+"))
> >>>>>>>>>>>
> >>>>>>>>>>> Those last lines are a work in progress. I wish I could
> >>>>> import a
> >>>>>>>>>>> picture of what it looks like when it's translated into a
> >>>>> data
> >>>>>> frame.
> >>>>>>>>>>> The fill=TRUE helped to get the data in table that kind of
> >>>>> sort of
> >>>>>>>>>>> works, but the comments keep bleeding into the data and time
> >>>>>> column.
> >>>>>>>>>>> It's like
> >>>>>>>>>>>
> >>>>>>>>>>> 2016-07-01 15:59:17 <Jane Doe> Seriously I've never been
> >>>>>>>>>>> over               there
> >>>>>>>>>>> 2016-07-01 15:59:27 <Jane Doe> It confuses me :(
> >>>>>>>>>>>
> >>>>>>>>>>> And then, maybe, the "seriously" will be in a column all to
> >>>>>> itself, as
> >>>>>>>>>>> will be the "I've'"and the "never" etc.
> >>>>>>>>>>>
> >>>>>>>>>>> I will use a regular expression if I have to, but it would be
> >>>>> nice
> >>>>>> to
> >>>>>>>>>>> keep the dates and times on there. Originally, I thought they
> >>>>> were
> >>>>>>>>>>> meaningless, but I've since changed my mind on that count.
> >>>>> The
> >>>>>> time of
> >>>>>>>>>>> day isn't so important. But, especially since, say, Gmail
> >>>>> itself
> >>>>>> knows
> >>>>>>>>>>> how to quickly recognize what it is, I know it can be done. I
> >>>>> know
> >>>>>>>>>>> this data has structure to it.
> >>>>>>>>>>>
> >>>>>>>>>>> Michael
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, May 15, 2019 at 8:47 PM David Winsemius <
> >>>>>>>> dwinsem...@comcast.net> wrote:
> >>>>>>>>>>>> On 5/15/19 4:07 PM, Michael Boulineau wrote:
> >>>>>>>>>>>>> I have a wild and crazy text file, the head of which looks
> >>>>> like
> >>>>>> this:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 2016-07-01 02:50:35 <john> hey
> >>>>>>>>>>>>> 2016-07-01 02:51:26 <jane> waiting for plane to Edinburgh
> >>>>>>>>>>>>> 2016-07-01 02:51:45 <john> thinking about my boo
> >>>>>>>>>>>>> 2016-07-01 02:52:07 <jane> nothing crappy has happened, not
> >>>>>> really
> >>>>>>>>>>>>> 2016-07-01 02:52:20 <john> plane went by pretty fast,
> >>>>> didn't
> >>>>>> sleep
> >>>>>>>>>>>>> 2016-07-01 02:54:08 <jane> no idea what time it is or where
> >>>>> I am
> >>>>>>>> really
> >>>>>>>>>>>>> 2016-07-01 02:54:17 <john> just know it's london
> >>>>>>>>>>>>> 2016-07-01 02:56:44 <jane> you are probably asleep
> >>>>>>>>>>>>> 2016-07-01 02:58:45 <jane> I hope fish was fishy in a good
> >>>>> eay
> >>>>>>>>>>>>> 2016-07-01 02:58:56 <jone>
> >>>>>>>>>>>>> 2016-07-01 02:59:34 <jane>
> >>>>>>>>>>>>> 2016-07-01 03:02:48 <john> British security is a little
> >>>>> more
> >>>>>>>> rigorous...
> >>>>>>>>>>>> Looks entirely not-"crazy". Typical log file format.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Two possibilities: 1) Use `read.fwf` from pkg foreign; 2)
> >>>>> Use
> >>>>>> regex
> >>>>>>>>>>>> (i.e. the sub-function) to strip everything up to the "<".
> >>>>> Read
> >>>>>>>>>>>> `?regex`. Since that's not a metacharacters you could use a
> >>>>>> pattern
> >>>>>>>>>>>> ".+<" and replace with "".
> >>>>>>>>>>>>
> >>>>>>>>>>>> And do read the Posting Guide. Cross-posting to
> >>>>> StackOverflow and
> >>>>>>>> Rhelp,
> >>>>>>>>>>>> at least within hours of each, is considered poor manners.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>>
> >>>>>>>>>>>> David.
> >>>>>>>>>>>>
> >>>>>>>>>>>>> It goes on for a while. It's a big file. But I feel like
> >>>>> it's
> >>>>>> going
> >>>>>>>> to
> >>>>>>>>>>>>> be difficult to annotate with the coreNLP library or
> >>>>> package. I'm
> >>>>>>>>>>>>> doing natural language processing. In other words, I'm
> >>>>> curious
> >>>>>> as to
> >>>>>>>>>>>>> how I would shave off the dates, that is, to make it look
> >>>>> like:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> <john> hey
> >>>>>>>>>>>>> <jane> waiting for plane to Edinburgh
> >>>>>>>>>>>>>   <john> thinking about my boo
> >>>>>>>>>>>>> <jane> nothing crappy has happened, not really
> >>>>>>>>>>>>> <john> plane went by pretty fast, didn't sleep
> >>>>>>>>>>>>> <jane> no idea what time it is or where I am really
> >>>>>>>>>>>>> <john> just know it's london
> >>>>>>>>>>>>> <jane> you are probably asleep
> >>>>>>>>>>>>> <jane> I hope fish was fishy in a good eay
> >>>>>>>>>>>>>   <jone>
> >>>>>>>>>>>>> <jane>
> >>>>>>>>>>>>> <john> British security is a little more rigorous...
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> To be clear, then, I'm trying to clean a large text file by
> >>>>>> writing a
> >>>>>>>>>>>>> regular expression? such that I create a new object with no
> >>>>>> numbers
> >>>>>>>> or
> >>>>>>>>>>>>> dates.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Michael
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> ______________________________________________
> >>>>>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and
> >>>>> more,
> >>>>>> see
> >>>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>>>>>>>> PLEASE do read the posting guide
> >>>>>>>> http://www.R-project.org/posting-guide.html
> >>>>>>>>>>>>> and provide commented, minimal, self-contained,
> >>>>> reproducible
> >>>>>> code.
> >>>>>>>>>>> ______________________________________________
> >>>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more,
> >>>>> see
> >>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>>>>>> PLEASE do read the posting guide
> >>>>>>>> http://www.R-project.org/posting-guide.html
> >>>>>>>>>>> and provide commented, minimal, self-contained, reproducible
> >>>>> code.
> >>>>>>>>>> ______________________________________________
> >>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more,
> >>>>> see
> >>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>>>>> PLEASE do read the posting guide
> >>>>>>>> http://www.R-project.org/posting-guide.html
> >>>>>>>>>> and provide commented, minimal, self-contained, reproducible
> >>>>> code.
> >>>>>>>>> ______________________________________________
> >>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more,
> >>>>> see
> >>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>>>> PLEASE do read the posting guide
> >>>>>>>> http://www.R-project.org/posting-guide.html
> >>>>>>>>> and provide commented, minimal, self-contained, reproducible
> >>>>> code.
> >>>>>>>>
> >>>>>>>> ______________________________________________
> >>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>>> PLEASE do read the posting guide
> >>>>>>>> http://www.R-project.org/posting-guide.html
> >>>>>>>> and provide commented, minimal, self-contained, reproducible
> >>>>> code.
> >>>>>>>>
> >>>>>>>
> >>>>>>>       [[alternative HTML version deleted]]
> >>>>>>>
> >>>>>>> ______________________________________________
> >>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>> PLEASE do read the posting guide
> >>>>>> http://www.R-project.org/posting-guide.html
> >>>>>>> and provide commented, minimal, self-contained, reproducible code.
> >>>>>>
> >>>>>> ______________________________________________
> >>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>> PLEASE do read the posting guide
> >>>>>> http://www.R-project.org/posting-guide.html
> >>>>>> and provide commented, minimal, self-contained, reproducible code.
> >>>>>>
> >>>>>
> >>>>>     [[alternative HTML version deleted]]
> >>>>>
> >>>>> ______________________________________________
> >>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>> PLEASE do read the posting guide
> >>>>> http://www.R-project.org/posting-guide.html
> >>>>> and provide commented, minimal, self-contained, reproducible code.
> >>>>
> >>>> --
> >>>> Sent from my phone. Please excuse my brevity.
> >>>
> >>> ______________________________________________
> >>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide 
> >>> http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >>
> >
> > ______________________________________________
> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] how to separate string from numbers in a large txt file

Reply via email to