This works for me: # sample data c <- character() c[1] <- "2016-01-27 09:14:40 <Jane Doe> started a video chat" c[2] <- "2016-01-27 09:15:20 <Jane Doe> https://lh3.googleusercontent.com/" c[3] <- "2016-01-27 09:15:20 <Jane Doe> Hey " c[4] <- "2016-01-27 09:15:22 <John Doe> ended a video chat" c[5] <- "2016-01-27 21:07:11 <Jane Doe> started a video chat" c[6] <- "2016-01-27 21:26:57 <John Doe> ended a video chat"
# regex ^(year) (time) <(word word)>\\s*(string)$ patt <- "^([0-9-]{10}) ([0-9:]{8}) <(\\w+ \\w+)>\\s*(.+)$" proto <- data.frame(date = character(), time = character(), name = character(), text = character(), stringsAsFactors = TRUE) d <- strcapture(patt, c, proto) date time name text 1 2016-01-27 09:14:40 Jane Doe started a video chat 2 2016-01-27 09:15:20 Jane Doe https://lh3.googleusercontent.com/ 3 2016-01-27 09:15:20 Jane Doe Hey 4 2016-01-27 09:15:22 John Doe ended a video chat 5 2016-01-27 21:07:11 Jane Doe started a video chat 6 2016-01-27 21:26:57 John Doe ended a video chat B. > On 2019-05-18, at 18:32, Michael Boulineau <michael.p.boulin...@gmail.com> > wrote: > > Going back and thinking through what Boris and William were saying > (also Ivan), I tried this: > > a <- readLines ("hangouts-conversation-6.csv.txt") > b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)" > c <- gsub(b, "\\1<\\2> ", a) >> head (c) > [1] "2016-01-27 09:14:40 *** Jane Doe started a video chat" > [2] "2016-01-27 09:15:20 <Jane Doe> > https://lh3.googleusercontent.com/-_WQF5kRcnpk/Vqj7J4aK1jI/AAAAAAAAAVA/GVqutPqbSuo/s0/be8ded30-87a6-4e80-bdfa-83ed51591dbf" > [3] "2016-01-27 09:15:20 <Jane Doe> Hey " > [4] "2016-01-27 09:15:22 <John Doe> ended a video chat" > [5] "2016-01-27 21:07:11 <Jane Doe> started a video chat" > [6] "2016-01-27 21:26:57 <John Doe> ended a video chat" > > The  is still there, since I forgot to do what Ivan had suggested, namely, > > a <- readLines(con <- file("hangouts-conversation-6.csv.txt", encoding > = "UTF-8")); close(con); rm(con) > > But then the new code is still turning out only NAs when I apply > strcapture (). This was what happened next: > >> d <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} > + [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>) *(.*$)", > + c, proto=data.frame(stringsAsFactors=FALSE, When="", Who="", > + What="")) >> head (d) > When Who What > 1 <NA> <NA> <NA> > 2 <NA> <NA> <NA> > 3 <NA> <NA> <NA> > 4 <NA> <NA> <NA> > 5 <NA> <NA> <NA> > 6 <NA> <NA> <NA> > > I've been reading up on regular expressions, too, so this code seems > spot on. What's going wrong? > > Michael > > On Fri, May 17, 2019 at 4:28 PM Boris Steipe <boris.ste...@utoronto.ca> wrote: >> >> Don't start putting in extra commas and then reading this as csv. That >> approach is broken. The correct approach is what Bill outlined: read >> everything with readLines(), and then use a proper regular expression with >> strcapture(). >> >> You need to pre-process the object that readLines() gives you: replace the >> contents of the videochat lines, and make it conform to the format of the >> other lines before you process it into your data frame. >> >> Approximately something like >> >> # read the raw data >> tmp <- readLines("hangouts-conversation-6.csv.txt") >> >> # process all video chat lines >> patt <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+) " # (year time )*** >> (word word) >> tmp <- gsub(patt, "\\1<\\2> ", tmp) >> >> # next, use strcapture() >> >> Note that this makes the assumption that your names are always exactly two >> words containing only letters. If that assumption is not true, more though >> needs to go into the regex. But you can test that: >> >> patt <- " <\\w+ \\w+> " #" <word word> " >> sum( ! grepl(patt, tmp))) >> >> ... will give the number of lines that remain in your file that do not have >> a tag that can be interpreted as "Who" >> >> Once that is fine, use Bill's approach - or a regular expression of your own >> design - to create your data frame. >> >> Hope this helps, >> Boris >> >> >> >> >>> On 2019-05-17, at 16:18, Michael Boulineau <michael.p.boulin...@gmail.com> >>> wrote: >>> >>> Very interesting. I'm sure I'll be trying to get rid of the byte order >>> mark eventually. But right now, I'm more worried about getting the >>> character vector into either a csv file or data.frame; that way, I can >>> be able to work with the data neatly tabulated into four columns: >>> date, time, person, comment. I assume it's a write.csv function, but I >>> don't know what arguments to put in it. header=FALSE? fill=T? >>> >>> Micheal >>> >>> On Fri, May 17, 2019 at 1:03 PM Jeff Newmiller <jdnew...@dcn.davis.ca.us> >>> wrote: >>>> >>>> If byte order mark is the issue then you can specify the file encoding as >>>> "UTF-8-BOM" and it won't show up in your data any more. >>>> >>>> On May 17, 2019 12:12:17 PM PDT, William Dunlap via R-help >>>> <r-help@r-project.org> wrote: >>>>> The pattern I gave worked for the lines that you originally showed from >>>>> the >>>>> data file ('a'), before you put commas into them. If the name is >>>>> either of >>>>> the form "<name>" or "***" then the "(<[^>]*>)" needs to be changed so >>>>> something like "(<[^>]*>|[*]{3})". >>>>> >>>>> The " " at the start of the imported data may come from the byte >>>>> order >>>>> mark that Windows apps like to put at the front of a text file in UTF-8 >>>>> or >>>>> UTF-16 format. >>>>> >>>>> Bill Dunlap >>>>> TIBCO Software >>>>> wdunlap tibco.com >>>>> >>>>> >>>>> On Fri, May 17, 2019 at 11:53 AM Michael Boulineau < >>>>> michael.p.boulin...@gmail.com> wrote: >>>>> >>>>>> This seemed to work: >>>>>> >>>>>>> a <- readLines ("hangouts-conversation-6.csv.txt") >>>>>>> b <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "\\1,\\2,\\3,\\4", a) >>>>>>> b [1:84] >>>>>> >>>>>> And the first 85 lines looks like this: >>>>>> >>>>>> [83] "2016-06-28 21:02:28 *** Jane Doe started a video chat" >>>>>> [84] "2016-06-28 21:12:43 *** John Doe ended a video chat" >>>>>> >>>>>> Then they transition to the commas: >>>>>> >>>>>>> b [84:100] >>>>>> [1] "2016-06-28 21:12:43 *** John Doe ended a video chat" >>>>>> [2] "2016-07-01,02:50:35,<John Doe>,hey" >>>>>> [3] "2016-07-01,02:51:26,<John Doe>,waiting for plane to Edinburgh" >>>>>> [4] "2016-07-01,02:51:45,<John Doe>,thinking about my boo" >>>>>> >>>>>> Even the strange bit on line 6347 was caught by this: >>>>>> >>>>>>> b [6346:6348] >>>>>> [1] "2016-10-21,10:56:29,<John Doe>,John_Doe" >>>>>> [2] "2016-10-21,10:56:37,<John Doe>,Admit#8242" >>>>>> [3] "2016-10-21,11:00:13,<Jane Doe>,Okay so you have a discussion" >>>>>> >>>>>> Perhaps most awesomely, the code catches spaces that are interposed >>>>>> into the comment itself: >>>>>> >>>>>>> b [4] >>>>>> [1] "2016-01-27,09:15:20,<Jane Doe>,Hey " >>>>>>> b [85] >>>>>> [1] "2016-07-01,02:50:35,<John Doe>,hey" >>>>>> >>>>>> Notice whether there is a space after the "hey" or not. >>>>>> >>>>>> These are the first two lines: >>>>>> >>>>>> [1] "2016-01-27 09:14:40 *** Jane Doe started a video chat" >>>>>> [2] "2016-01-27,09:15:20,<Jane >>>>>> Doe>, >>>>>> >>>>> https://lh3.googleusercontent.com/-_WQF5kRcnpk/Vqj7J4aK1jI/AAAAAAAAAVA/GVqutPqbSuo/s0/be8ded30-87a6-4e80-bdfa-83ed51591dbf >>>>>> " >>>>>> >>>>>> So, who knows what happened with the  at the beginning of [1] >>>>>> directly above. But notice how there are no commas in [1] but there >>>>>> appear in [2]. I don't see why really long ones like [2] directly >>>>>> above would be a problem, were they to be translated into a csv or >>>>>> data frame column. >>>>>> >>>>>> Now, with the commas in there, couldn't we write this into a csv or a >>>>>> data.frame? Some of this data will end up being garbage, I imagine. >>>>>> Like in [2] directly above. Or with [83] and [84] at the top of this >>>>>> discussion post/email. Embarrassingly, I've been trying to convert >>>>>> this into a data.frame or csv but I can't manage to. I've been using >>>>>> the write.csv function, but I don't think I've been getting the >>>>>> arguments correct. >>>>>> >>>>>> At the end of the day, I would like a data.frame and/or csv with the >>>>>> following four columns: date, time, person, comment. >>>>>> >>>>>> I tried this, too: >>>>>> >>>>>>> c <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} >>>>>> + [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>) *(.*$)", >>>>>> + a, proto=data.frame(stringsAsFactors=FALSE, >>>>> When="", >>>>>> Who="", >>>>>> + What="")) >>>>>> >>>>>> But all I got was this: >>>>>> >>>>>>> c [1:100, ] >>>>>> When Who What >>>>>> 1 <NA> <NA> <NA> >>>>>> 2 <NA> <NA> <NA> >>>>>> 3 <NA> <NA> <NA> >>>>>> 4 <NA> <NA> <NA> >>>>>> 5 <NA> <NA> <NA> >>>>>> 6 <NA> <NA> <NA> >>>>>> >>>>>> It seems to have caught nothing. >>>>>> >>>>>>> unique (c) >>>>>> When Who What >>>>>> 1 <NA> <NA> <NA> >>>>>> >>>>>> But I like that it converted into columns. That's a really great >>>>>> format. With a little tweaking, it'd be a great code for this data >>>>>> set. >>>>>> >>>>>> Michael >>>>>> >>>>>> On Fri, May 17, 2019 at 8:20 AM William Dunlap via R-help >>>>>> <r-help@r-project.org> wrote: >>>>>>> >>>>>>> Consider using readLines() and strcapture() for reading such a >>>>> file. >>>>>> E.g., >>>>>>> suppose readLines(files) produced a character vector like >>>>>>> >>>>>>> x <- c("2016-10-21 10:35:36 <Jane Doe> What's your login", >>>>>>> "2016-10-21 10:56:29 <John Doe> John_Doe", >>>>>>> "2016-10-21 10:56:37 <John Doe> Admit#8242", >>>>>>> "October 23, 1819 12:34 <Jane Eyre> I am not an angel") >>>>>>> >>>>>>> Then you can make a data.frame with columns When, Who, and What by >>>>>>> supplying a pattern containing three parenthesized capture >>>>> expressions: >>>>>>>> z <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} >>>>>>> [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>) *(.*$)", >>>>>>> x, proto=data.frame(stringsAsFactors=FALSE, When="", >>>>> Who="", >>>>>>> What="")) >>>>>>>> str(z) >>>>>>> 'data.frame': 4 obs. of 3 variables: >>>>>>> $ When: chr "2016-10-21 10:35:36" "2016-10-21 10:56:29" >>>>> "2016-10-21 >>>>>>> 10:56:37" NA >>>>>>> $ Who : chr "<Jane Doe>" "<John Doe>" "<John Doe>" NA >>>>>>> $ What: chr "What's your login" "John_Doe" "Admit#8242" NA >>>>>>> >>>>>>> Lines that don't match the pattern result in NA's - you might make >>>>> a >>>>>> second >>>>>>> pass over the corresponding elements of x with a new pattern. >>>>>>> >>>>>>> You can convert the When column from character to time with >>>>> as.POSIXct(). >>>>>>> >>>>>>> Bill Dunlap >>>>>>> TIBCO Software >>>>>>> wdunlap tibco.com >>>>>>> >>>>>>> >>>>>>> On Thu, May 16, 2019 at 8:30 PM David Winsemius >>>>> <dwinsem...@comcast.net> >>>>>>> wrote: >>>>>>> >>>>>>>> >>>>>>>> On 5/16/19 3:53 PM, Michael Boulineau wrote: >>>>>>>>> OK. So, I named the object test and then checked the 6347th >>>>> item >>>>>>>>> >>>>>>>>>> test <- readLines ("hangouts-conversation.txt) >>>>>>>>>> test [6347] >>>>>>>>> [1] "2016-10-21 10:56:37 <John Doe> Admit#8242" >>>>>>>>> >>>>>>>>> Perhaps where it was getting screwed up is, since the end of >>>>> this is >>>>>> a >>>>>>>>> number (8242), then, given that there's no space between the >>>>> number >>>>>>>>> and what ought to be the next row, R didn't know where to draw >>>>> the >>>>>>>>> line. Sure enough, it looks like this when I go to the original >>>>> file >>>>>>>>> and control f "#8242" >>>>>>>>> >>>>>>>>> 2016-10-21 10:35:36 <Jane Doe> What's your login >>>>>>>>> 2016-10-21 10:56:29 <John Doe> John_Doe >>>>>>>>> 2016-10-21 10:56:37 <John Doe> Admit#8242 >>>>>>>> >>>>>>>> >>>>>>>> An octothorpe is an end of line signifier and is interpreted as >>>>>> allowing >>>>>>>> comments. You can prevent that interpretation with suitable >>>>> choice of >>>>>>>> parameters to `read.table` or `read.csv`. I don't understand why >>>>> that >>>>>>>> should cause anu error or a failure to match that pattern. >>>>>>>> >>>>>>>>> 2016-10-21 11:00:13 <Jane Doe> Okay so you have a discussion >>>>>>>>> >>>>>>>>> Again, it doesn't look like that in the file. Gmail >>>>> automatically >>>>>>>>> formats it like that when I paste it in. More to the point, it >>>>> looks >>>>>>>>> like >>>>>>>>> >>>>>>>>> 2016-10-21 10:35:36 <Jane Doe> What's your login2016-10-21 >>>>> 10:56:29 >>>>>>>>> <John Doe> John_Doe2016-10-21 10:56:37 <John Doe> >>>>>> Admit#82422016-10-21 >>>>>>>>> 11:00:13 <Jane Doe> Okay so you have a discussion >>>>>>>>> >>>>>>>>> Notice Admit#82422016. So there's that. >>>>>>>>> >>>>>>>>> Then I built object test2. >>>>>>>>> >>>>>>>>> test2 <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4", >>>>> test) >>>>>>>>> >>>>>>>>> This worked for 84 lines, then this happened. >>>>>>>> >>>>>>>> It may have done something but as you later discovered my first >>>>> code >>>>>> for >>>>>>>> the pattern was incorrect. I had tested it (and pasted in the >>>>> results >>>>>> of >>>>>>>> the test) . The way to refer to a capture class is with >>>>> back-slashes >>>>>>>> before the numbers, not forward-slashes. Try this: >>>>>>>> >>>>>>>> >>>>>>>>> newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", >>>>> "\\1,\\2,\\3,\\4", >>>>>> chrvec) >>>>>>>>> newvec >>>>>>>> [1] "2016-07-01,02:50:35,<john>,hey" >>>>>>>> [2] "2016-07-01,02:51:26,<jane>,waiting for plane to Edinburgh" >>>>>>>> [3] "2016-07-01,02:51:45,<john>,thinking about my boo" >>>>>>>> [4] "2016-07-01,02:52:07,<jane>,nothing crappy has happened, >>>>> not >>>>>> really" >>>>>>>> [5] "2016-07-01,02:52:20,<john>,plane went by pretty fast, >>>>> didn't >>>>>> sleep" >>>>>>>> [6] "2016-07-01,02:54:08,<jane>,no idea what time it is or >>>>> where I am >>>>>>>> really" >>>>>>>> [7] "2016-07-01,02:54:17,<john>,just know it's london" >>>>>>>> [8] "2016-07-01,02:56:44,<jane>,you are probably asleep" >>>>>>>> [9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a good >>>>> eay" >>>>>>>> [10] "2016-07-01 02:58:56 <jone>" >>>>>>>> [11] "2016-07-01 02:59:34 <jane>" >>>>>>>> [12] "2016-07-01,03:02:48,<john>,British security is a little >>>>> more >>>>>>>> rigorous..." >>>>>>>> >>>>>>>> >>>>>>>> I made note of the fact that the 10th and 11th lines had no >>>>> commas. >>>>>>>> >>>>>>>>> >>>>>>>>>> test2 [84] >>>>>>>>> [1] "2016-06-28 21:12:43 *** John Doe ended a video chat" >>>>>>>> >>>>>>>> That line didn't have any "<" so wasn't matched. >>>>>>>> >>>>>>>> >>>>>>>> You could remove all none matching lines for pattern of >>>>>>>> >>>>>>>> dates<space>times<space>"<"<name>">"<space><anything> >>>>>>>> >>>>>>>> >>>>>>>> with: >>>>>>>> >>>>>>>> >>>>>>>> chrvec <- chrvec[ grepl("^.{10} .{8} <.+> .+$)", chrvec)] >>>>>>>> >>>>>>>> >>>>>>>> Do read: >>>>>>>> >>>>>>>> ?read.csv >>>>>>>> >>>>>>>> ?regex >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>>> David >>>>>>>> >>>>>>>> >>>>>>>>>> test2 [85] >>>>>>>>> [1] "//1,//2,//3,//4" >>>>>>>>>> test [85] >>>>>>>>> [1] "2016-07-01 02:50:35 <John Doe> hey" >>>>>>>>> >>>>>>>>> Notice how I toggled back and forth between test and test2 >>>>> there. So, >>>>>>>>> whatever happened with the regex, it happened in the switch >>>>> from 84 >>>>>> to >>>>>>>>> 85, I guess. It went on like >>>>>>>>> >>>>>>>>> [990] "//1,//2,//3,//4" >>>>>>>>> [991] "//1,//2,//3,//4" >>>>>>>>> [992] "//1,//2,//3,//4" >>>>>>>>> [993] "//1,//2,//3,//4" >>>>>>>>> [994] "//1,//2,//3,//4" >>>>>>>>> [995] "//1,//2,//3,//4" >>>>>>>>> [996] "//1,//2,//3,//4" >>>>>>>>> [997] "//1,//2,//3,//4" >>>>>>>>> [998] "//1,//2,//3,//4" >>>>>>>>> [999] "//1,//2,//3,//4" >>>>>>>>> [1000] "//1,//2,//3,//4" >>>>>>>>> >>>>>>>>> up until line 1000, then I reached max.print. >>>>>>>> >>>>>>>>> Michael >>>>>>>>> >>>>>>>>> On Thu, May 16, 2019 at 1:05 PM David Winsemius < >>>>>> dwinsem...@comcast.net> >>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> On 5/16/19 12:30 PM, Michael Boulineau wrote: >>>>>>>>>>> Thanks for this tip on etiquette, David. I will be sure and >>>>> not do >>>>>>>> that again. >>>>>>>>>>> >>>>>>>>>>> I tried the read.fwf from the foreign package, with a code >>>>> like >>>>>> this: >>>>>>>>>>> >>>>>>>>>>> d <- read.fwf("hangouts-conversation.txt", >>>>>>>>>>> widths= c(10,10,20,40), >>>>>>>>>>> >>>>> col.names=c("date","time","person","comment"), >>>>>>>>>>> strip.white=TRUE) >>>>>>>>>>> >>>>>>>>>>> But it threw this error: >>>>>>>>>>> >>>>>>>>>>> Error in scan(file = file, what = what, sep = sep, quote = >>>>> quote, >>>>>> dec >>>>>>>> = dec, : >>>>>>>>>>> line 6347 did not have 4 elements >>>>>>>>>> >>>>>>>>>> So what does line 6347 look like? (Use `readLines` and print >>>>> it >>>>>> out.) >>>>>>>>>> >>>>>>>>>>> Interestingly, though, the error only happened when I >>>>> increased the >>>>>>>>>>> width size. But I had to increase the size, or else I >>>>> couldn't >>>>>> "see" >>>>>>>>>>> anything. The comment was so small that nothing was being >>>>>> captured by >>>>>>>>>>> the size of the column. so to speak. >>>>>>>>>>> >>>>>>>>>>> It seems like what's throwing me is that there's no comma >>>>> that >>>>>>>>>>> demarcates the end of the text proper. For example: >>>>>>>>>> Not sure why you thought there should be a comma. Lines >>>>> usually end >>>>>>>>>> with <cr> and or a <lf>. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Once you have the raw text in a character vector from >>>>> `readLines` >>>>>> named, >>>>>>>>>> say, 'chrvec', then you could selectively substitute commas >>>>> for >>>>>> spaces >>>>>>>>>> with regex. (Now that you no longer desire to remove the dates >>>>> and >>>>>>>> times.) >>>>>>>>>> >>>>>>>>>> sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4", chrvec) >>>>>>>>>> >>>>>>>>>> This will not do any replacements when the pattern is not >>>>> matched. >>>>>> See >>>>>>>>>> this test: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", >>>>> "\\1,\\2,\\3,\\4", >>>>>>>> chrvec) >>>>>>>>>>> newvec >>>>>>>>>> [1] "2016-07-01,02:50:35,<john>,hey" >>>>>>>>>> [2] "2016-07-01,02:51:26,<jane>,waiting for plane to >>>>> Edinburgh" >>>>>>>>>> [3] "2016-07-01,02:51:45,<john>,thinking about my boo" >>>>>>>>>> [4] "2016-07-01,02:52:07,<jane>,nothing crappy has >>>>> happened, not >>>>>>>> really" >>>>>>>>>> [5] "2016-07-01,02:52:20,<john>,plane went by pretty fast, >>>>> didn't >>>>>>>> sleep" >>>>>>>>>> [6] "2016-07-01,02:54:08,<jane>,no idea what time it is or >>>>> where >>>>>> I am >>>>>>>>>> really" >>>>>>>>>> [7] "2016-07-01,02:54:17,<john>,just know it's london" >>>>>>>>>> [8] "2016-07-01,02:56:44,<jane>,you are probably asleep" >>>>>>>>>> [9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a >>>>> good >>>>>> eay" >>>>>>>>>> [10] "2016-07-01 02:58:56 <jone>" >>>>>>>>>> [11] "2016-07-01 02:59:34 <jane>" >>>>>>>>>> [12] "2016-07-01,03:02:48,<john>,British security is a little >>>>> more >>>>>>>>>> rigorous..." >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> You should probably remove the "empty comment" lines. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> >>>>>>>>>> David. >>>>>>>>>> >>>>>>>>>>> 2016-07-01 15:34:30 <John Doe> Lame. We were in a >>>>>> starbucks2016-07-01 >>>>>>>>>>> 15:35:02 <Jane Doe> Hmm that's interesting2016-07-01 15:35:09 >>>>> <Jane >>>>>>>>>>> Doe> You must want coffees2016-07-01 15:35:25 <John Doe> >>>>> There was >>>>>>>>>>> lots of Starbucks in my day2016-07-01 15:35:47 >>>>>>>>>>> >>>>>>>>>>> It was interesting, too, when I pasted the text into the >>>>> email, it >>>>>>>>>>> self-formatted into the way I wanted it to look. I had to >>>>> manually >>>>>>>>>>> make it look like it does above, since that's the way that it >>>>>> looks in >>>>>>>>>>> the txt file. I wonder if it's being organized by XML or >>>>> something. >>>>>>>>>>> >>>>>>>>>>> Anyways, There's always a space between the two sideways >>>>> carrots, >>>>>> just >>>>>>>>>>> like there is right now: <John Doe> See. Space. And there's >>>>> always >>>>>> a >>>>>>>>>>> space between the data and time. Like this. 2016-07-01 >>>>> 15:34:30 >>>>>> See. >>>>>>>>>>> Space. But there's never a space between the end of the >>>>> comment and >>>>>>>>>>> the next date. Like this: We were in a starbucks2016-07-01 >>>>> 15:35:02 >>>>>>>>>>> See. starbucks and 2016 are smooshed together. >>>>>>>>>>> >>>>>>>>>>> This code is also on the table right now too. >>>>>>>>>>> >>>>>>>>>>> a <- read.table("E:/working >>>>>>>>>>> directory/-189/hangouts-conversation2.txt", quote="\"", >>>>>>>>>>> comment.char="", fill=TRUE) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>> >>>>>> >>>>> h<-cbind(hangouts.conversation2[,1:2],hangouts.conversation2[,3:5],hangouts.conversation2[,6:9]) >>>>>>>>>>> >>>>>>>>>>> aa<-gsub("[^[:digit:]]","",h) >>>>>>>>>>> my.data.num <- as.numeric(str_extract(h, "[0-9]+")) >>>>>>>>>>> >>>>>>>>>>> Those last lines are a work in progress. I wish I could >>>>> import a >>>>>>>>>>> picture of what it looks like when it's translated into a >>>>> data >>>>>> frame. >>>>>>>>>>> The fill=TRUE helped to get the data in table that kind of >>>>> sort of >>>>>>>>>>> works, but the comments keep bleeding into the data and time >>>>>> column. >>>>>>>>>>> It's like >>>>>>>>>>> >>>>>>>>>>> 2016-07-01 15:59:17 <Jane Doe> Seriously I've never been >>>>>>>>>>> over there >>>>>>>>>>> 2016-07-01 15:59:27 <Jane Doe> It confuses me :( >>>>>>>>>>> >>>>>>>>>>> And then, maybe, the "seriously" will be in a column all to >>>>>> itself, as >>>>>>>>>>> will be the "I've'"and the "never" etc. >>>>>>>>>>> >>>>>>>>>>> I will use a regular expression if I have to, but it would be >>>>> nice >>>>>> to >>>>>>>>>>> keep the dates and times on there. Originally, I thought they >>>>> were >>>>>>>>>>> meaningless, but I've since changed my mind on that count. >>>>> The >>>>>> time of >>>>>>>>>>> day isn't so important. But, especially since, say, Gmail >>>>> itself >>>>>> knows >>>>>>>>>>> how to quickly recognize what it is, I know it can be done. I >>>>> know >>>>>>>>>>> this data has structure to it. >>>>>>>>>>> >>>>>>>>>>> Michael >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, May 15, 2019 at 8:47 PM David Winsemius < >>>>>>>> dwinsem...@comcast.net> wrote: >>>>>>>>>>>> On 5/15/19 4:07 PM, Michael Boulineau wrote: >>>>>>>>>>>>> I have a wild and crazy text file, the head of which looks >>>>> like >>>>>> this: >>>>>>>>>>>>> >>>>>>>>>>>>> 2016-07-01 02:50:35 <john> hey >>>>>>>>>>>>> 2016-07-01 02:51:26 <jane> waiting for plane to Edinburgh >>>>>>>>>>>>> 2016-07-01 02:51:45 <john> thinking about my boo >>>>>>>>>>>>> 2016-07-01 02:52:07 <jane> nothing crappy has happened, not >>>>>> really >>>>>>>>>>>>> 2016-07-01 02:52:20 <john> plane went by pretty fast, >>>>> didn't >>>>>> sleep >>>>>>>>>>>>> 2016-07-01 02:54:08 <jane> no idea what time it is or where >>>>> I am >>>>>>>> really >>>>>>>>>>>>> 2016-07-01 02:54:17 <john> just know it's london >>>>>>>>>>>>> 2016-07-01 02:56:44 <jane> you are probably asleep >>>>>>>>>>>>> 2016-07-01 02:58:45 <jane> I hope fish was fishy in a good >>>>> eay >>>>>>>>>>>>> 2016-07-01 02:58:56 <jone> >>>>>>>>>>>>> 2016-07-01 02:59:34 <jane> >>>>>>>>>>>>> 2016-07-01 03:02:48 <john> British security is a little >>>>> more >>>>>>>> rigorous... >>>>>>>>>>>> Looks entirely not-"crazy". Typical log file format. >>>>>>>>>>>> >>>>>>>>>>>> Two possibilities: 1) Use `read.fwf` from pkg foreign; 2) >>>>> Use >>>>>> regex >>>>>>>>>>>> (i.e. the sub-function) to strip everything up to the "<". >>>>> Read >>>>>>>>>>>> `?regex`. Since that's not a metacharacters you could use a >>>>>> pattern >>>>>>>>>>>> ".+<" and replace with "". >>>>>>>>>>>> >>>>>>>>>>>> And do read the Posting Guide. Cross-posting to >>>>> StackOverflow and >>>>>>>> Rhelp, >>>>>>>>>>>> at least within hours of each, is considered poor manners. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> >>>>>>>>>>>> David. >>>>>>>>>>>> >>>>>>>>>>>>> It goes on for a while. It's a big file. But I feel like >>>>> it's >>>>>> going >>>>>>>> to >>>>>>>>>>>>> be difficult to annotate with the coreNLP library or >>>>> package. I'm >>>>>>>>>>>>> doing natural language processing. In other words, I'm >>>>> curious >>>>>> as to >>>>>>>>>>>>> how I would shave off the dates, that is, to make it look >>>>> like: >>>>>>>>>>>>> >>>>>>>>>>>>> <john> hey >>>>>>>>>>>>> <jane> waiting for plane to Edinburgh >>>>>>>>>>>>> <john> thinking about my boo >>>>>>>>>>>>> <jane> nothing crappy has happened, not really >>>>>>>>>>>>> <john> plane went by pretty fast, didn't sleep >>>>>>>>>>>>> <jane> no idea what time it is or where I am really >>>>>>>>>>>>> <john> just know it's london >>>>>>>>>>>>> <jane> you are probably asleep >>>>>>>>>>>>> <jane> I hope fish was fishy in a good eay >>>>>>>>>>>>> <jone> >>>>>>>>>>>>> <jane> >>>>>>>>>>>>> <john> British security is a little more rigorous... >>>>>>>>>>>>> >>>>>>>>>>>>> To be clear, then, I'm trying to clean a large text file by >>>>>> writing a >>>>>>>>>>>>> regular expression? such that I create a new object with no >>>>>> numbers >>>>>>>> or >>>>>>>>>>>>> dates. >>>>>>>>>>>>> >>>>>>>>>>>>> Michael >>>>>>>>>>>>> >>>>>>>>>>>>> ______________________________________________ >>>>>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and >>>>> more, >>>>>> see >>>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>>>>>>>>> PLEASE do read the posting guide >>>>>>>> http://www.R-project.org/posting-guide.html >>>>>>>>>>>>> and provide commented, minimal, self-contained, >>>>> reproducible >>>>>> code. >>>>>>>>>>> ______________________________________________ >>>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, >>>>> see >>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>>>>>>> PLEASE do read the posting guide >>>>>>>> http://www.R-project.org/posting-guide.html >>>>>>>>>>> and provide commented, minimal, self-contained, reproducible >>>>> code. >>>>>>>>>> ______________________________________________ >>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, >>>>> see >>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>>>>>> PLEASE do read the posting guide >>>>>>>> http://www.R-project.org/posting-guide.html >>>>>>>>>> and provide commented, minimal, self-contained, reproducible >>>>> code. >>>>>>>>> ______________________________________________ >>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, >>>>> see >>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>>>>> PLEASE do read the posting guide >>>>>>>> http://www.R-project.org/posting-guide.html >>>>>>>>> and provide commented, minimal, self-contained, reproducible >>>>> code. >>>>>>>> >>>>>>>> ______________________________________________ >>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>>>> PLEASE do read the posting guide >>>>>>>> http://www.R-project.org/posting-guide.html >>>>>>>> and provide commented, minimal, self-contained, reproducible >>>>> code. >>>>>>>> >>>>>>> >>>>>>> [[alternative HTML version deleted]] >>>>>>> >>>>>>> ______________________________________________ >>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>>> PLEASE do read the posting guide >>>>>> http://www.R-project.org/posting-guide.html >>>>>>> and provide commented, minimal, self-contained, reproducible code. >>>>>> >>>>>> ______________________________________________ >>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>> PLEASE do read the posting guide >>>>>> http://www.R-project.org/posting-guide.html >>>>>> and provide commented, minimal, self-contained, reproducible code. >>>>>> >>>>> >>>>> [[alternative HTML version deleted]] >>>>> >>>>> ______________________________________________ >>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>> PLEASE do read the posting guide >>>>> http://www.R-project.org/posting-guide.html >>>>> and provide commented, minimal, self-contained, reproducible code. >>>> >>>> -- >>>> Sent from my phone. Please excuse my brevity. >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >> > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.