It appears to have worked, although there were three little quirks. The ; close(con); rm(con) didn't work for me; the first row of the data.frame was all NAs, when all was said and done; and then there were still three *** on the same line where the  was apparently deleted.
> a <- readLines ("hangouts-conversation-6.txt", encoding = "UTF-8") > b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)" > c <- gsub(b, "\\1<\\2> ", a) > head (c) [1] "2016-01-27 09:14:40 *** Jane Doe started a video chat" [2] "2016-01-27 09:15:20 <Jane Doe> https://lh3.googleusercontent.com/-_WQF5kRcnpk/Vqj7J4aK1jI/AAAAAAAAAVA/GVqutPqbSuo/s0/be8ded30-87a6-4e80-bdfa-83ed51591dbf" [3] "2016-01-27 09:15:20 <Jane Doe> Hey " [4] "2016-01-27 09:15:22 <John Doe> ended a video chat" [5] "2016-01-27 21:07:11 <Jane Doe> started a video chat" [6] "2016-01-27 21:26:57 <John Doe> ended a video chat" You can see those three *** there. But no  > d <- "^([0-9-]{10}) ([0-9:]{8}) <(\\w+ \\w+)>\\s*(.+)$" > e <- data.frame(date = character(), + time = character(), + name = character(), + text = character(), + stringsAsFactors = TRUE) f <- strcapture(d, c, e) > f <- f [-c(1),] When I look at the data.frame, it looks like [1] above, the one with the three ***, was deleted--surely because of the second regex (the one where I made object d above). Thanks for your help everyone. I really learned a lot. The first thing I'm going to do is continue to study regular expressions, although I do have a much better sense of them than when I started. But, before I do anything else, I'm going to study the regex in this particular code. For example, I'm still not sure why there has to the second \\w+ in the (\\w+ \\w+). Little things like that. Michael On Sat, May 18, 2019 at 4:30 PM Boris Steipe <boris.ste...@utoronto.ca> wrote: > > This works for me: > > # sample data > c <- character() > c[1] <- "2016-01-27 09:14:40 <Jane Doe> started a video chat" > c[2] <- "2016-01-27 09:15:20 <Jane Doe> https://lh3.googleusercontent.com/" > c[3] <- "2016-01-27 09:15:20 <Jane Doe> Hey " > c[4] <- "2016-01-27 09:15:22 <John Doe> ended a video chat" > c[5] <- "2016-01-27 21:07:11 <Jane Doe> started a video chat" > c[6] <- "2016-01-27 21:26:57 <John Doe> ended a video chat" > > > # regex ^(year) (time) <(word word)>\\s*(string)$ > patt <- "^([0-9-]{10}) ([0-9:]{8}) <(\\w+ \\w+)>\\s*(.+)$" > proto <- data.frame(date = character(), > time = character(), > name = character(), > text = character(), > stringsAsFactors = TRUE) > d <- strcapture(patt, c, proto) > > > > date time name text > 1 2016-01-27 09:14:40 Jane Doe started a video chat > 2 2016-01-27 09:15:20 Jane Doe https://lh3.googleusercontent.com/ > 3 2016-01-27 09:15:20 Jane Doe Hey > 4 2016-01-27 09:15:22 John Doe ended a video chat > 5 2016-01-27 21:07:11 Jane Doe started a video chat > 6 2016-01-27 21:26:57 John Doe ended a video chat > > > > B. > > > > On 2019-05-18, at 18:32, Michael Boulineau <michael.p.boulin...@gmail.com> > > wrote: > > > > Going back and thinking through what Boris and William were saying > > (also Ivan), I tried this: > > > > a <- readLines ("hangouts-conversation-6.csv.txt") > > b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)" > > c <- gsub(b, "\\1<\\2> ", a) > >> head (c) > > [1] "2016-01-27 09:14:40 *** Jane Doe started a video chat" > > [2] "2016-01-27 09:15:20 <Jane Doe> > > https://lh3.googleusercontent.com/-_WQF5kRcnpk/Vqj7J4aK1jI/AAAAAAAAAVA/GVqutPqbSuo/s0/be8ded30-87a6-4e80-bdfa-83ed51591dbf" > > [3] "2016-01-27 09:15:20 <Jane Doe> Hey " > > [4] "2016-01-27 09:15:22 <John Doe> ended a video chat" > > [5] "2016-01-27 21:07:11 <Jane Doe> started a video chat" > > [6] "2016-01-27 21:26:57 <John Doe> ended a video chat" > > > > The  is still there, since I forgot to do what Ivan had suggested, > > namely, > > > > a <- readLines(con <- file("hangouts-conversation-6.csv.txt", encoding > > = "UTF-8")); close(con); rm(con) > > > > But then the new code is still turning out only NAs when I apply > > strcapture (). This was what happened next: > > > >> d <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} > > + [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>) *(.*$)", > > + c, proto=data.frame(stringsAsFactors=FALSE, When="", > > Who="", > > + What="")) > >> head (d) > > When Who What > > 1 <NA> <NA> <NA> > > 2 <NA> <NA> <NA> > > 3 <NA> <NA> <NA> > > 4 <NA> <NA> <NA> > > 5 <NA> <NA> <NA> > > 6 <NA> <NA> <NA> > > > > I've been reading up on regular expressions, too, so this code seems > > spot on. What's going wrong? > > > > Michael > > > > On Fri, May 17, 2019 at 4:28 PM Boris Steipe <boris.ste...@utoronto.ca> > > wrote: > >> > >> Don't start putting in extra commas and then reading this as csv. That > >> approach is broken. The correct approach is what Bill outlined: read > >> everything with readLines(), and then use a proper regular expression with > >> strcapture(). > >> > >> You need to pre-process the object that readLines() gives you: replace the > >> contents of the videochat lines, and make it conform to the format of the > >> other lines before you process it into your data frame. > >> > >> Approximately something like > >> > >> # read the raw data > >> tmp <- readLines("hangouts-conversation-6.csv.txt") > >> > >> # process all video chat lines > >> patt <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+) " # (year time )*** > >> (word word) > >> tmp <- gsub(patt, "\\1<\\2> ", tmp) > >> > >> # next, use strcapture() > >> > >> Note that this makes the assumption that your names are always exactly two > >> words containing only letters. If that assumption is not true, more though > >> needs to go into the regex. But you can test that: > >> > >> patt <- " <\\w+ \\w+> " #" <word word> " > >> sum( ! grepl(patt, tmp))) > >> > >> ... will give the number of lines that remain in your file that do not > >> have a tag that can be interpreted as "Who" > >> > >> Once that is fine, use Bill's approach - or a regular expression of your > >> own design - to create your data frame. > >> > >> Hope this helps, > >> Boris > >> > >> > >> > >> > >>> On 2019-05-17, at 16:18, Michael Boulineau > >>> <michael.p.boulin...@gmail.com> wrote: > >>> > >>> Very interesting. I'm sure I'll be trying to get rid of the byte order > >>> mark eventually. But right now, I'm more worried about getting the > >>> character vector into either a csv file or data.frame; that way, I can > >>> be able to work with the data neatly tabulated into four columns: > >>> date, time, person, comment. I assume it's a write.csv function, but I > >>> don't know what arguments to put in it. header=FALSE? fill=T? > >>> > >>> Micheal > >>> > >>> On Fri, May 17, 2019 at 1:03 PM Jeff Newmiller <jdnew...@dcn.davis.ca.us> > >>> wrote: > >>>> > >>>> If byte order mark is the issue then you can specify the file encoding > >>>> as "UTF-8-BOM" and it won't show up in your data any more. > >>>> > >>>> On May 17, 2019 12:12:17 PM PDT, William Dunlap via R-help > >>>> <r-help@r-project.org> wrote: > >>>>> The pattern I gave worked for the lines that you originally showed from > >>>>> the > >>>>> data file ('a'), before you put commas into them. If the name is > >>>>> either of > >>>>> the form "<name>" or "***" then the "(<[^>]*>)" needs to be changed so > >>>>> something like "(<[^>]*>|[*]{3})". > >>>>> > >>>>> The " " at the start of the imported data may come from the byte > >>>>> order > >>>>> mark that Windows apps like to put at the front of a text file in UTF-8 > >>>>> or > >>>>> UTF-16 format. > >>>>> > >>>>> Bill Dunlap > >>>>> TIBCO Software > >>>>> wdunlap tibco.com > >>>>> > >>>>> > >>>>> On Fri, May 17, 2019 at 11:53 AM Michael Boulineau < > >>>>> michael.p.boulin...@gmail.com> wrote: > >>>>> > >>>>>> This seemed to work: > >>>>>> > >>>>>>> a <- readLines ("hangouts-conversation-6.csv.txt") > >>>>>>> b <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "\\1,\\2,\\3,\\4", a) > >>>>>>> b [1:84] > >>>>>> > >>>>>> And the first 85 lines looks like this: > >>>>>> > >>>>>> [83] "2016-06-28 21:02:28 *** Jane Doe started a video chat" > >>>>>> [84] "2016-06-28 21:12:43 *** John Doe ended a video chat" > >>>>>> > >>>>>> Then they transition to the commas: > >>>>>> > >>>>>>> b [84:100] > >>>>>> [1] "2016-06-28 21:12:43 *** John Doe ended a video chat" > >>>>>> [2] "2016-07-01,02:50:35,<John Doe>,hey" > >>>>>> [3] "2016-07-01,02:51:26,<John Doe>,waiting for plane to Edinburgh" > >>>>>> [4] "2016-07-01,02:51:45,<John Doe>,thinking about my boo" > >>>>>> > >>>>>> Even the strange bit on line 6347 was caught by this: > >>>>>> > >>>>>>> b [6346:6348] > >>>>>> [1] "2016-10-21,10:56:29,<John Doe>,John_Doe" > >>>>>> [2] "2016-10-21,10:56:37,<John Doe>,Admit#8242" > >>>>>> [3] "2016-10-21,11:00:13,<Jane Doe>,Okay so you have a discussion" > >>>>>> > >>>>>> Perhaps most awesomely, the code catches spaces that are interposed > >>>>>> into the comment itself: > >>>>>> > >>>>>>> b [4] > >>>>>> [1] "2016-01-27,09:15:20,<Jane Doe>,Hey " > >>>>>>> b [85] > >>>>>> [1] "2016-07-01,02:50:35,<John Doe>,hey" > >>>>>> > >>>>>> Notice whether there is a space after the "hey" or not. > >>>>>> > >>>>>> These are the first two lines: > >>>>>> > >>>>>> [1] "2016-01-27 09:14:40 *** Jane Doe started a video chat" > >>>>>> [2] "2016-01-27,09:15:20,<Jane > >>>>>> Doe>, > >>>>>> > >>>>> https://lh3.googleusercontent.com/-_WQF5kRcnpk/Vqj7J4aK1jI/AAAAAAAAAVA/GVqutPqbSuo/s0/be8ded30-87a6-4e80-bdfa-83ed51591dbf > >>>>>> " > >>>>>> > >>>>>> So, who knows what happened with the  at the beginning of [1] > >>>>>> directly above. But notice how there are no commas in [1] but there > >>>>>> appear in [2]. I don't see why really long ones like [2] directly > >>>>>> above would be a problem, were they to be translated into a csv or > >>>>>> data frame column. > >>>>>> > >>>>>> Now, with the commas in there, couldn't we write this into a csv or a > >>>>>> data.frame? Some of this data will end up being garbage, I imagine. > >>>>>> Like in [2] directly above. Or with [83] and [84] at the top of this > >>>>>> discussion post/email. Embarrassingly, I've been trying to convert > >>>>>> this into a data.frame or csv but I can't manage to. I've been using > >>>>>> the write.csv function, but I don't think I've been getting the > >>>>>> arguments correct. > >>>>>> > >>>>>> At the end of the day, I would like a data.frame and/or csv with the > >>>>>> following four columns: date, time, person, comment. > >>>>>> > >>>>>> I tried this, too: > >>>>>> > >>>>>>> c <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} > >>>>>> + [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>) *(.*$)", > >>>>>> + a, proto=data.frame(stringsAsFactors=FALSE, > >>>>> When="", > >>>>>> Who="", > >>>>>> + What="")) > >>>>>> > >>>>>> But all I got was this: > >>>>>> > >>>>>>> c [1:100, ] > >>>>>> When Who What > >>>>>> 1 <NA> <NA> <NA> > >>>>>> 2 <NA> <NA> <NA> > >>>>>> 3 <NA> <NA> <NA> > >>>>>> 4 <NA> <NA> <NA> > >>>>>> 5 <NA> <NA> <NA> > >>>>>> 6 <NA> <NA> <NA> > >>>>>> > >>>>>> It seems to have caught nothing. > >>>>>> > >>>>>>> unique (c) > >>>>>> When Who What > >>>>>> 1 <NA> <NA> <NA> > >>>>>> > >>>>>> But I like that it converted into columns. That's a really great > >>>>>> format. With a little tweaking, it'd be a great code for this data > >>>>>> set. > >>>>>> > >>>>>> Michael > >>>>>> > >>>>>> On Fri, May 17, 2019 at 8:20 AM William Dunlap via R-help > >>>>>> <r-help@r-project.org> wrote: > >>>>>>> > >>>>>>> Consider using readLines() and strcapture() for reading such a > >>>>> file. > >>>>>> E.g., > >>>>>>> suppose readLines(files) produced a character vector like > >>>>>>> > >>>>>>> x <- c("2016-10-21 10:35:36 <Jane Doe> What's your login", > >>>>>>> "2016-10-21 10:56:29 <John Doe> John_Doe", > >>>>>>> "2016-10-21 10:56:37 <John Doe> Admit#8242", > >>>>>>> "October 23, 1819 12:34 <Jane Eyre> I am not an angel") > >>>>>>> > >>>>>>> Then you can make a data.frame with columns When, Who, and What by > >>>>>>> supplying a pattern containing three parenthesized capture > >>>>> expressions: > >>>>>>>> z <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} > >>>>>>> [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>) *(.*$)", > >>>>>>> x, proto=data.frame(stringsAsFactors=FALSE, When="", > >>>>> Who="", > >>>>>>> What="")) > >>>>>>>> str(z) > >>>>>>> 'data.frame': 4 obs. of 3 variables: > >>>>>>> $ When: chr "2016-10-21 10:35:36" "2016-10-21 10:56:29" > >>>>> "2016-10-21 > >>>>>>> 10:56:37" NA > >>>>>>> $ Who : chr "<Jane Doe>" "<John Doe>" "<John Doe>" NA > >>>>>>> $ What: chr "What's your login" "John_Doe" "Admit#8242" NA > >>>>>>> > >>>>>>> Lines that don't match the pattern result in NA's - you might make > >>>>> a > >>>>>> second > >>>>>>> pass over the corresponding elements of x with a new pattern. > >>>>>>> > >>>>>>> You can convert the When column from character to time with > >>>>> as.POSIXct(). > >>>>>>> > >>>>>>> Bill Dunlap > >>>>>>> TIBCO Software > >>>>>>> wdunlap tibco.com > >>>>>>> > >>>>>>> > >>>>>>> On Thu, May 16, 2019 at 8:30 PM David Winsemius > >>>>> <dwinsem...@comcast.net> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> > >>>>>>>> On 5/16/19 3:53 PM, Michael Boulineau wrote: > >>>>>>>>> OK. So, I named the object test and then checked the 6347th > >>>>> item > >>>>>>>>> > >>>>>>>>>> test <- readLines ("hangouts-conversation.txt) > >>>>>>>>>> test [6347] > >>>>>>>>> [1] "2016-10-21 10:56:37 <John Doe> Admit#8242" > >>>>>>>>> > >>>>>>>>> Perhaps where it was getting screwed up is, since the end of > >>>>> this is > >>>>>> a > >>>>>>>>> number (8242), then, given that there's no space between the > >>>>> number > >>>>>>>>> and what ought to be the next row, R didn't know where to draw > >>>>> the > >>>>>>>>> line. Sure enough, it looks like this when I go to the original > >>>>> file > >>>>>>>>> and control f "#8242" > >>>>>>>>> > >>>>>>>>> 2016-10-21 10:35:36 <Jane Doe> What's your login > >>>>>>>>> 2016-10-21 10:56:29 <John Doe> John_Doe > >>>>>>>>> 2016-10-21 10:56:37 <John Doe> Admit#8242 > >>>>>>>> > >>>>>>>> > >>>>>>>> An octothorpe is an end of line signifier and is interpreted as > >>>>>> allowing > >>>>>>>> comments. You can prevent that interpretation with suitable > >>>>> choice of > >>>>>>>> parameters to `read.table` or `read.csv`. I don't understand why > >>>>> that > >>>>>>>> should cause anu error or a failure to match that pattern. > >>>>>>>> > >>>>>>>>> 2016-10-21 11:00:13 <Jane Doe> Okay so you have a discussion > >>>>>>>>> > >>>>>>>>> Again, it doesn't look like that in the file. Gmail > >>>>> automatically > >>>>>>>>> formats it like that when I paste it in. More to the point, it > >>>>> looks > >>>>>>>>> like > >>>>>>>>> > >>>>>>>>> 2016-10-21 10:35:36 <Jane Doe> What's your login2016-10-21 > >>>>> 10:56:29 > >>>>>>>>> <John Doe> John_Doe2016-10-21 10:56:37 <John Doe> > >>>>>> Admit#82422016-10-21 > >>>>>>>>> 11:00:13 <Jane Doe> Okay so you have a discussion > >>>>>>>>> > >>>>>>>>> Notice Admit#82422016. So there's that. > >>>>>>>>> > >>>>>>>>> Then I built object test2. > >>>>>>>>> > >>>>>>>>> test2 <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4", > >>>>> test) > >>>>>>>>> > >>>>>>>>> This worked for 84 lines, then this happened. > >>>>>>>> > >>>>>>>> It may have done something but as you later discovered my first > >>>>> code > >>>>>> for > >>>>>>>> the pattern was incorrect. I had tested it (and pasted in the > >>>>> results > >>>>>> of > >>>>>>>> the test) . The way to refer to a capture class is with > >>>>> back-slashes > >>>>>>>> before the numbers, not forward-slashes. Try this: > >>>>>>>> > >>>>>>>> > >>>>>>>>> newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", > >>>>> "\\1,\\2,\\3,\\4", > >>>>>> chrvec) > >>>>>>>>> newvec > >>>>>>>> [1] "2016-07-01,02:50:35,<john>,hey" > >>>>>>>> [2] "2016-07-01,02:51:26,<jane>,waiting for plane to Edinburgh" > >>>>>>>> [3] "2016-07-01,02:51:45,<john>,thinking about my boo" > >>>>>>>> [4] "2016-07-01,02:52:07,<jane>,nothing crappy has happened, > >>>>> not > >>>>>> really" > >>>>>>>> [5] "2016-07-01,02:52:20,<john>,plane went by pretty fast, > >>>>> didn't > >>>>>> sleep" > >>>>>>>> [6] "2016-07-01,02:54:08,<jane>,no idea what time it is or > >>>>> where I am > >>>>>>>> really" > >>>>>>>> [7] "2016-07-01,02:54:17,<john>,just know it's london" > >>>>>>>> [8] "2016-07-01,02:56:44,<jane>,you are probably asleep" > >>>>>>>> [9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a good > >>>>> eay" > >>>>>>>> [10] "2016-07-01 02:58:56 <jone>" > >>>>>>>> [11] "2016-07-01 02:59:34 <jane>" > >>>>>>>> [12] "2016-07-01,03:02:48,<john>,British security is a little > >>>>> more > >>>>>>>> rigorous..." > >>>>>>>> > >>>>>>>> > >>>>>>>> I made note of the fact that the 10th and 11th lines had no > >>>>> commas. > >>>>>>>> > >>>>>>>>> > >>>>>>>>>> test2 [84] > >>>>>>>>> [1] "2016-06-28 21:12:43 *** John Doe ended a video chat" > >>>>>>>> > >>>>>>>> That line didn't have any "<" so wasn't matched. > >>>>>>>> > >>>>>>>> > >>>>>>>> You could remove all none matching lines for pattern of > >>>>>>>> > >>>>>>>> dates<space>times<space>"<"<name>">"<space><anything> > >>>>>>>> > >>>>>>>> > >>>>>>>> with: > >>>>>>>> > >>>>>>>> > >>>>>>>> chrvec <- chrvec[ grepl("^.{10} .{8} <.+> .+$)", chrvec)] > >>>>>>>> > >>>>>>>> > >>>>>>>> Do read: > >>>>>>>> > >>>>>>>> ?read.csv > >>>>>>>> > >>>>>>>> ?regex > >>>>>>>> > >>>>>>>> > >>>>>>>> -- > >>>>>>>> > >>>>>>>> David > >>>>>>>> > >>>>>>>> > >>>>>>>>>> test2 [85] > >>>>>>>>> [1] "//1,//2,//3,//4" > >>>>>>>>>> test [85] > >>>>>>>>> [1] "2016-07-01 02:50:35 <John Doe> hey" > >>>>>>>>> > >>>>>>>>> Notice how I toggled back and forth between test and test2 > >>>>> there. So, > >>>>>>>>> whatever happened with the regex, it happened in the switch > >>>>> from 84 > >>>>>> to > >>>>>>>>> 85, I guess. It went on like > >>>>>>>>> > >>>>>>>>> [990] "//1,//2,//3,//4" > >>>>>>>>> [991] "//1,//2,//3,//4" > >>>>>>>>> [992] "//1,//2,//3,//4" > >>>>>>>>> [993] "//1,//2,//3,//4" > >>>>>>>>> [994] "//1,//2,//3,//4" > >>>>>>>>> [995] "//1,//2,//3,//4" > >>>>>>>>> [996] "//1,//2,//3,//4" > >>>>>>>>> [997] "//1,//2,//3,//4" > >>>>>>>>> [998] "//1,//2,//3,//4" > >>>>>>>>> [999] "//1,//2,//3,//4" > >>>>>>>>> [1000] "//1,//2,//3,//4" > >>>>>>>>> > >>>>>>>>> up until line 1000, then I reached max.print. > >>>>>>>> > >>>>>>>>> Michael > >>>>>>>>> > >>>>>>>>> On Thu, May 16, 2019 at 1:05 PM David Winsemius < > >>>>>> dwinsem...@comcast.net> > >>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>> On 5/16/19 12:30 PM, Michael Boulineau wrote: > >>>>>>>>>>> Thanks for this tip on etiquette, David. I will be sure and > >>>>> not do > >>>>>>>> that again. > >>>>>>>>>>> > >>>>>>>>>>> I tried the read.fwf from the foreign package, with a code > >>>>> like > >>>>>> this: > >>>>>>>>>>> > >>>>>>>>>>> d <- read.fwf("hangouts-conversation.txt", > >>>>>>>>>>> widths= c(10,10,20,40), > >>>>>>>>>>> > >>>>> col.names=c("date","time","person","comment"), > >>>>>>>>>>> strip.white=TRUE) > >>>>>>>>>>> > >>>>>>>>>>> But it threw this error: > >>>>>>>>>>> > >>>>>>>>>>> Error in scan(file = file, what = what, sep = sep, quote = > >>>>> quote, > >>>>>> dec > >>>>>>>> = dec, : > >>>>>>>>>>> line 6347 did not have 4 elements > >>>>>>>>>> > >>>>>>>>>> So what does line 6347 look like? (Use `readLines` and print > >>>>> it > >>>>>> out.) > >>>>>>>>>> > >>>>>>>>>>> Interestingly, though, the error only happened when I > >>>>> increased the > >>>>>>>>>>> width size. But I had to increase the size, or else I > >>>>> couldn't > >>>>>> "see" > >>>>>>>>>>> anything. The comment was so small that nothing was being > >>>>>> captured by > >>>>>>>>>>> the size of the column. so to speak. > >>>>>>>>>>> > >>>>>>>>>>> It seems like what's throwing me is that there's no comma > >>>>> that > >>>>>>>>>>> demarcates the end of the text proper. For example: > >>>>>>>>>> Not sure why you thought there should be a comma. Lines > >>>>> usually end > >>>>>>>>>> with <cr> and or a <lf>. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Once you have the raw text in a character vector from > >>>>> `readLines` > >>>>>> named, > >>>>>>>>>> say, 'chrvec', then you could selectively substitute commas > >>>>> for > >>>>>> spaces > >>>>>>>>>> with regex. (Now that you no longer desire to remove the dates > >>>>> and > >>>>>>>> times.) > >>>>>>>>>> > >>>>>>>>>> sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4", chrvec) > >>>>>>>>>> > >>>>>>>>>> This will not do any replacements when the pattern is not > >>>>> matched. > >>>>>> See > >>>>>>>>>> this test: > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", > >>>>> "\\1,\\2,\\3,\\4", > >>>>>>>> chrvec) > >>>>>>>>>>> newvec > >>>>>>>>>> [1] "2016-07-01,02:50:35,<john>,hey" > >>>>>>>>>> [2] "2016-07-01,02:51:26,<jane>,waiting for plane to > >>>>> Edinburgh" > >>>>>>>>>> [3] "2016-07-01,02:51:45,<john>,thinking about my boo" > >>>>>>>>>> [4] "2016-07-01,02:52:07,<jane>,nothing crappy has > >>>>> happened, not > >>>>>>>> really" > >>>>>>>>>> [5] "2016-07-01,02:52:20,<john>,plane went by pretty fast, > >>>>> didn't > >>>>>>>> sleep" > >>>>>>>>>> [6] "2016-07-01,02:54:08,<jane>,no idea what time it is or > >>>>> where > >>>>>> I am > >>>>>>>>>> really" > >>>>>>>>>> [7] "2016-07-01,02:54:17,<john>,just know it's london" > >>>>>>>>>> [8] "2016-07-01,02:56:44,<jane>,you are probably asleep" > >>>>>>>>>> [9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a > >>>>> good > >>>>>> eay" > >>>>>>>>>> [10] "2016-07-01 02:58:56 <jone>" > >>>>>>>>>> [11] "2016-07-01 02:59:34 <jane>" > >>>>>>>>>> [12] "2016-07-01,03:02:48,<john>,British security is a little > >>>>> more > >>>>>>>>>> rigorous..." > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> You should probably remove the "empty comment" lines. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> -- > >>>>>>>>>> > >>>>>>>>>> David. > >>>>>>>>>> > >>>>>>>>>>> 2016-07-01 15:34:30 <John Doe> Lame. We were in a > >>>>>> starbucks2016-07-01 > >>>>>>>>>>> 15:35:02 <Jane Doe> Hmm that's interesting2016-07-01 15:35:09 > >>>>> <Jane > >>>>>>>>>>> Doe> You must want coffees2016-07-01 15:35:25 <John Doe> > >>>>> There was > >>>>>>>>>>> lots of Starbucks in my day2016-07-01 15:35:47 > >>>>>>>>>>> > >>>>>>>>>>> It was interesting, too, when I pasted the text into the > >>>>> email, it > >>>>>>>>>>> self-formatted into the way I wanted it to look. I had to > >>>>> manually > >>>>>>>>>>> make it look like it does above, since that's the way that it > >>>>>> looks in > >>>>>>>>>>> the txt file. I wonder if it's being organized by XML or > >>>>> something. > >>>>>>>>>>> > >>>>>>>>>>> Anyways, There's always a space between the two sideways > >>>>> carrots, > >>>>>> just > >>>>>>>>>>> like there is right now: <John Doe> See. Space. And there's > >>>>> always > >>>>>> a > >>>>>>>>>>> space between the data and time. Like this. 2016-07-01 > >>>>> 15:34:30 > >>>>>> See. > >>>>>>>>>>> Space. But there's never a space between the end of the > >>>>> comment and > >>>>>>>>>>> the next date. Like this: We were in a starbucks2016-07-01 > >>>>> 15:35:02 > >>>>>>>>>>> See. starbucks and 2016 are smooshed together. > >>>>>>>>>>> > >>>>>>>>>>> This code is also on the table right now too. > >>>>>>>>>>> > >>>>>>>>>>> a <- read.table("E:/working > >>>>>>>>>>> directory/-189/hangouts-conversation2.txt", quote="\"", > >>>>>>>>>>> comment.char="", fill=TRUE) > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>> > >>>>>> > >>>>> h<-cbind(hangouts.conversation2[,1:2],hangouts.conversation2[,3:5],hangouts.conversation2[,6:9]) > >>>>>>>>>>> > >>>>>>>>>>> aa<-gsub("[^[:digit:]]","",h) > >>>>>>>>>>> my.data.num <- as.numeric(str_extract(h, "[0-9]+")) > >>>>>>>>>>> > >>>>>>>>>>> Those last lines are a work in progress. I wish I could > >>>>> import a > >>>>>>>>>>> picture of what it looks like when it's translated into a > >>>>> data > >>>>>> frame. > >>>>>>>>>>> The fill=TRUE helped to get the data in table that kind of > >>>>> sort of > >>>>>>>>>>> works, but the comments keep bleeding into the data and time > >>>>>> column. > >>>>>>>>>>> It's like > >>>>>>>>>>> > >>>>>>>>>>> 2016-07-01 15:59:17 <Jane Doe> Seriously I've never been > >>>>>>>>>>> over there > >>>>>>>>>>> 2016-07-01 15:59:27 <Jane Doe> It confuses me :( > >>>>>>>>>>> > >>>>>>>>>>> And then, maybe, the "seriously" will be in a column all to > >>>>>> itself, as > >>>>>>>>>>> will be the "I've'"and the "never" etc. > >>>>>>>>>>> > >>>>>>>>>>> I will use a regular expression if I have to, but it would be > >>>>> nice > >>>>>> to > >>>>>>>>>>> keep the dates and times on there. Originally, I thought they > >>>>> were > >>>>>>>>>>> meaningless, but I've since changed my mind on that count. > >>>>> The > >>>>>> time of > >>>>>>>>>>> day isn't so important. But, especially since, say, Gmail > >>>>> itself > >>>>>> knows > >>>>>>>>>>> how to quickly recognize what it is, I know it can be done. I > >>>>> know > >>>>>>>>>>> this data has structure to it. > >>>>>>>>>>> > >>>>>>>>>>> Michael > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On Wed, May 15, 2019 at 8:47 PM David Winsemius < > >>>>>>>> dwinsem...@comcast.net> wrote: > >>>>>>>>>>>> On 5/15/19 4:07 PM, Michael Boulineau wrote: > >>>>>>>>>>>>> I have a wild and crazy text file, the head of which looks > >>>>> like > >>>>>> this: > >>>>>>>>>>>>> > >>>>>>>>>>>>> 2016-07-01 02:50:35 <john> hey > >>>>>>>>>>>>> 2016-07-01 02:51:26 <jane> waiting for plane to Edinburgh > >>>>>>>>>>>>> 2016-07-01 02:51:45 <john> thinking about my boo > >>>>>>>>>>>>> 2016-07-01 02:52:07 <jane> nothing crappy has happened, not > >>>>>> really > >>>>>>>>>>>>> 2016-07-01 02:52:20 <john> plane went by pretty fast, > >>>>> didn't > >>>>>> sleep > >>>>>>>>>>>>> 2016-07-01 02:54:08 <jane> no idea what time it is or where > >>>>> I am > >>>>>>>> really > >>>>>>>>>>>>> 2016-07-01 02:54:17 <john> just know it's london > >>>>>>>>>>>>> 2016-07-01 02:56:44 <jane> you are probably asleep > >>>>>>>>>>>>> 2016-07-01 02:58:45 <jane> I hope fish was fishy in a good > >>>>> eay > >>>>>>>>>>>>> 2016-07-01 02:58:56 <jone> > >>>>>>>>>>>>> 2016-07-01 02:59:34 <jane> > >>>>>>>>>>>>> 2016-07-01 03:02:48 <john> British security is a little > >>>>> more > >>>>>>>> rigorous... > >>>>>>>>>>>> Looks entirely not-"crazy". Typical log file format. > >>>>>>>>>>>> > >>>>>>>>>>>> Two possibilities: 1) Use `read.fwf` from pkg foreign; 2) > >>>>> Use > >>>>>> regex > >>>>>>>>>>>> (i.e. the sub-function) to strip everything up to the "<". > >>>>> Read > >>>>>>>>>>>> `?regex`. Since that's not a metacharacters you could use a > >>>>>> pattern > >>>>>>>>>>>> ".+<" and replace with "". > >>>>>>>>>>>> > >>>>>>>>>>>> And do read the Posting Guide. Cross-posting to > >>>>> StackOverflow and > >>>>>>>> Rhelp, > >>>>>>>>>>>> at least within hours of each, is considered poor manners. > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> -- > >>>>>>>>>>>> > >>>>>>>>>>>> David. > >>>>>>>>>>>> > >>>>>>>>>>>>> It goes on for a while. It's a big file. But I feel like > >>>>> it's > >>>>>> going > >>>>>>>> to > >>>>>>>>>>>>> be difficult to annotate with the coreNLP library or > >>>>> package. I'm > >>>>>>>>>>>>> doing natural language processing. In other words, I'm > >>>>> curious > >>>>>> as to > >>>>>>>>>>>>> how I would shave off the dates, that is, to make it look > >>>>> like: > >>>>>>>>>>>>> > >>>>>>>>>>>>> <john> hey > >>>>>>>>>>>>> <jane> waiting for plane to Edinburgh > >>>>>>>>>>>>> <john> thinking about my boo > >>>>>>>>>>>>> <jane> nothing crappy has happened, not really > >>>>>>>>>>>>> <john> plane went by pretty fast, didn't sleep > >>>>>>>>>>>>> <jane> no idea what time it is or where I am really > >>>>>>>>>>>>> <john> just know it's london > >>>>>>>>>>>>> <jane> you are probably asleep > >>>>>>>>>>>>> <jane> I hope fish was fishy in a good eay > >>>>>>>>>>>>> <jone> > >>>>>>>>>>>>> <jane> > >>>>>>>>>>>>> <john> British security is a little more rigorous... > >>>>>>>>>>>>> > >>>>>>>>>>>>> To be clear, then, I'm trying to clean a large text file by > >>>>>> writing a > >>>>>>>>>>>>> regular expression? such that I create a new object with no > >>>>>> numbers > >>>>>>>> or > >>>>>>>>>>>>> dates. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Michael > >>>>>>>>>>>>> > >>>>>>>>>>>>> ______________________________________________ > >>>>>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and > >>>>> more, > >>>>>> see > >>>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>>>>>>>>>> PLEASE do read the posting guide > >>>>>>>> http://www.R-project.org/posting-guide.html > >>>>>>>>>>>>> and provide commented, minimal, self-contained, > >>>>> reproducible > >>>>>> code. > >>>>>>>>>>> ______________________________________________ > >>>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, > >>>>> see > >>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>>>>>>>> PLEASE do read the posting guide > >>>>>>>> http://www.R-project.org/posting-guide.html > >>>>>>>>>>> and provide commented, minimal, self-contained, reproducible > >>>>> code. > >>>>>>>>>> ______________________________________________ > >>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, > >>>>> see > >>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>>>>>>> PLEASE do read the posting guide > >>>>>>>> http://www.R-project.org/posting-guide.html > >>>>>>>>>> and provide commented, minimal, self-contained, reproducible > >>>>> code. > >>>>>>>>> ______________________________________________ > >>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, > >>>>> see > >>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>>>>>> PLEASE do read the posting guide > >>>>>>>> http://www.R-project.org/posting-guide.html > >>>>>>>>> and provide commented, minimal, self-contained, reproducible > >>>>> code. > >>>>>>>> > >>>>>>>> ______________________________________________ > >>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>>>>> PLEASE do read the posting guide > >>>>>>>> http://www.R-project.org/posting-guide.html > >>>>>>>> and provide commented, minimal, self-contained, reproducible > >>>>> code. > >>>>>>>> > >>>>>>> > >>>>>>> [[alternative HTML version deleted]] > >>>>>>> > >>>>>>> ______________________________________________ > >>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>>>> PLEASE do read the posting guide > >>>>>> http://www.R-project.org/posting-guide.html > >>>>>>> and provide commented, minimal, self-contained, reproducible code. > >>>>>> > >>>>>> ______________________________________________ > >>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>>> PLEASE do read the posting guide > >>>>>> http://www.R-project.org/posting-guide.html > >>>>>> and provide commented, minimal, self-contained, reproducible code. > >>>>>> > >>>>> > >>>>> [[alternative HTML version deleted]] > >>>>> > >>>>> ______________________________________________ > >>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>> PLEASE do read the posting guide > >>>>> http://www.R-project.org/posting-guide.html > >>>>> and provide commented, minimal, self-contained, reproducible code. > >>>> > >>>> -- > >>>> Sent from my phone. Please excuse my brevity. > >>> > >>> ______________________________________________ > >>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>> https://stat.ethz.ch/mailman/listinfo/r-help > >>> PLEASE do read the posting guide > >>> http://www.R-project.org/posting-guide.html > >>> and provide commented, minimal, self-contained, reproducible code. > >> > > > > ______________________________________________ > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.