Re: [R] how to separate string from numbers in a large txt file

Boris Steipe Sat, 18 May 2019 16:31:21 -0700

This works for me:

# sample data
c <- character()
c[1] <- "2016-01-27 09:14:40 <Jane Doe> started a video chat"
c[2] <- "2016-01-27 09:15:20 <Jane Doe> https://lh3.googleusercontent.com/";
c[3] <- "2016-01-27 09:15:20 <Jane Doe> Hey "
c[4] <- "2016-01-27 09:15:22 <John Doe>  ended a video chat"
c[5] <- "2016-01-27 21:07:11 <Jane Doe>  started a video chat"
c[6] <- "2016-01-27 21:26:57 <John Doe>  ended a video chat"



# regex  ^(year)       (time)      <(word word)>\\s*(string)$
patt <- "^([0-9-]{10}) ([0-9:]{8}) <(\\w+ \\w+)>\\s*(.+)$" 
proto <- data.frame(date = character(),
                    time = character(),
                    name = character(),
                    text = character(),
                    stringsAsFactors = TRUE)
d <- strcapture(patt, c, proto)



        date     time     name                               text
1 2016-01-27 09:14:40 Jane Doe               started a video chat
2 2016-01-27 09:15:20 Jane Doe https://lh3.googleusercontent.com/
3 2016-01-27 09:15:20 Jane Doe                               Hey 
4 2016-01-27 09:15:22 John Doe                 ended a video chat
5 2016-01-27 21:07:11 Jane Doe               started a video chat
6 2016-01-27 21:26:57 John Doe                 ended a video chat



B.


> On 2019-05-18, at 18:32, Michael Boulineau <michael.p.boulin...@gmail.com> 
> wrote:
> 
> Going back and thinking through what Boris and William were saying
> (also Ivan), I tried this:
> 
> a <- readLines ("hangouts-conversation-6.csv.txt")
> b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)"
> c <- gsub(b, "\\1<\\2> ", a)
>> head (c)
> [1] "ï»¿2016-01-27 09:14:40 *** Jane Doe started a video chat"
> [2] "2016-01-27 09:15:20 <Jane Doe>
> https://lh3.googleusercontent.com/-_WQF5kRcnpk/Vqj7J4aK1jI/AAAAAAAAAVA/GVqutPqbSuo/s0/be8ded30-87a6-4e80-bdfa-83ed51591dbf";
> [3] "2016-01-27 09:15:20 <Jane Doe> Hey "
> [4] "2016-01-27 09:15:22 <John Doe>  ended a video chat"
> [5] "2016-01-27 21:07:11 <Jane Doe>  started a video chat"
> [6] "2016-01-27 21:26:57 <John Doe>  ended a video chat"
> 
> The ï»¿ is still there, since I forgot to do what Ivan had suggested, namely,
> 
> a <- readLines(con <- file("hangouts-conversation-6.csv.txt", encoding
> = "UTF-8")); close(con); rm(con)
> 
> But then the new code is still turning out only NAs when I apply
> strcapture (). This was what happened next:
> 
>> d <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}
> + [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>) *(.*$)",
> +                 c, proto=data.frame(stringsAsFactors=FALSE, When="", Who="",
> +                                     What=""))
>> head (d)
>  When  Who What
> 1 <NA> <NA> <NA>
> 2 <NA> <NA> <NA>
> 3 <NA> <NA> <NA>
> 4 <NA> <NA> <NA>
> 5 <NA> <NA> <NA>
> 6 <NA> <NA> <NA>
> 
> I've been reading up on regular expressions, too, so this code seems
> spot on. What's going wrong?
> 
> Michael
> 
> On Fri, May 17, 2019 at 4:28 PM Boris Steipe <boris.ste...@utoronto.ca> wrote:
>> 
>> Don't start putting in extra commas and then reading this as csv. That 
>> approach is broken. The correct approach is what Bill outlined: read 
>> everything with readLines(), and then use a proper regular expression with 
>> strcapture().
>> 
>> You need to pre-process the object that readLines() gives you: replace the 
>> contents of the videochat lines, and make it conform to the format of the 
>> other lines before you process it into your data frame.
>> 
>> Approximately something like
>> 
>> # read the raw data
>> tmp <- readLines("hangouts-conversation-6.csv.txt")
>> 
>> # process all video chat lines
>> patt <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+) "  # (year time )*** 
>> (word word)
>> tmp <- gsub(patt, "\\1<\\2> ", tmp)
>> 
>> # next, use strcapture()
>> 
>> Note that this makes the assumption that your names are always exactly two 
>> words containing only letters. If that assumption is not true, more though 
>> needs to go into the regex. But you can test that:
>> 
>> patt <- " <\\w+ \\w+> "   #" <word word> "
>> sum( ! grepl(patt, tmp)))
>> 
>> ... will give the number of lines that remain in your file that do not have 
>> a tag that can be interpreted as "Who"
>> 
>> Once that is fine, use Bill's approach - or a regular expression of your own 
>> design - to create your data frame.
>> 
>> Hope this helps,
>> Boris
>> 
>> 
>> 
>> 
>>> On 2019-05-17, at 16:18, Michael Boulineau <michael.p.boulin...@gmail.com> 
>>> wrote:
>>> 
>>> Very interesting. I'm sure I'll be trying to get rid of the byte order
>>> mark eventually. But right now, I'm more worried about getting the
>>> character vector into either a csv file or data.frame; that way, I can
>>> be able to work with the data neatly tabulated into four columns:
>>> date, time, person, comment. I assume it's a write.csv function, but I
>>> don't know what arguments to put in it. header=FALSE? fill=T?
>>> 
>>> Micheal
>>> 
>>> On Fri, May 17, 2019 at 1:03 PM Jeff Newmiller <jdnew...@dcn.davis.ca.us> 
>>> wrote:
>>>> 
>>>> If byte order mark is the issue then you can specify the file encoding as 
>>>> "UTF-8-BOM" and it won't show up in your data any more.
>>>> 
>>>> On May 17, 2019 12:12:17 PM PDT, William Dunlap via R-help 
>>>> <r-help@r-project.org> wrote:
>>>>> The pattern I gave worked for the lines that you originally showed from
>>>>> the
>>>>> data file ('a'), before you put commas into them.  If the name is
>>>>> either of
>>>>> the form "<name>" or "***" then the "(<[^>]*>)" needs to be changed so
>>>>> something like "(<[^>]*>|[*]{3})".
>>>>> 
>>>>> The " ï»¿" at the start of the imported data may come from the byte
>>>>> order
>>>>> mark that Windows apps like to put at the front of a text file in UTF-8
>>>>> or
>>>>> UTF-16 format.
>>>>> 
>>>>> Bill Dunlap
>>>>> TIBCO Software
>>>>> wdunlap tibco.com
>>>>> 
>>>>> 
>>>>> On Fri, May 17, 2019 at 11:53 AM Michael Boulineau <
>>>>> michael.p.boulin...@gmail.com> wrote:
>>>>> 
>>>>>> This seemed to work:
>>>>>> 
>>>>>>> a <- readLines ("hangouts-conversation-6.csv.txt")
>>>>>>> b <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "\\1,\\2,\\3,\\4", a)
>>>>>>> b [1:84]
>>>>>> 
>>>>>> And the first 85 lines looks like this:
>>>>>> 
>>>>>> [83] "2016-06-28 21:02:28 *** Jane Doe started a video chat"
>>>>>> [84] "2016-06-28 21:12:43 *** John Doe ended a video chat"
>>>>>> 
>>>>>> Then they transition to the commas:
>>>>>> 
>>>>>>> b [84:100]
>>>>>> [1] "2016-06-28 21:12:43 *** John Doe ended a video chat"
>>>>>> [2] "2016-07-01,02:50:35,<John Doe>,hey"
>>>>>> [3] "2016-07-01,02:51:26,<John Doe>,waiting for plane to Edinburgh"
>>>>>> [4] "2016-07-01,02:51:45,<John Doe>,thinking about my boo"
>>>>>> 
>>>>>> Even the strange bit on line 6347 was caught by this:
>>>>>> 
>>>>>>> b [6346:6348]
>>>>>> [1] "2016-10-21,10:56:29,<John Doe>,John_Doe"
>>>>>> [2] "2016-10-21,10:56:37,<John Doe>,Admit#8242"
>>>>>> [3] "2016-10-21,11:00:13,<Jane Doe>,Okay so you have a discussion"
>>>>>> 
>>>>>> Perhaps most awesomely, the code catches spaces that are interposed
>>>>>> into the comment itself:
>>>>>> 
>>>>>>> b [4]
>>>>>> [1] "2016-01-27,09:15:20,<Jane Doe>,Hey "
>>>>>>> b [85]
>>>>>> [1] "2016-07-01,02:50:35,<John Doe>,hey"
>>>>>> 
>>>>>> Notice whether there is a space after the "hey" or not.
>>>>>> 
>>>>>> These are the first two lines:
>>>>>> 
>>>>>> [1] "ï»¿2016-01-27 09:14:40 *** Jane Doe started a video chat"
>>>>>> [2] "2016-01-27,09:15:20,<Jane
>>>>>> Doe>,
>>>>>> 
>>>>> https://lh3.googleusercontent.com/-_WQF5kRcnpk/Vqj7J4aK1jI/AAAAAAAAAVA/GVqutPqbSuo/s0/be8ded30-87a6-4e80-bdfa-83ed51591dbf
>>>>>> "
>>>>>> 
>>>>>> So, who knows what happened with the ï»¿ at the beginning of [1]
>>>>>> directly above. But notice how there are no commas in [1] but there
>>>>>> appear in [2]. I don't see why really long ones like [2] directly
>>>>>> above would be a problem, were they to be translated into a csv or
>>>>>> data frame column.
>>>>>> 
>>>>>> Now, with the commas in there, couldn't we write this into a csv or a
>>>>>> data.frame? Some of this data will end up being garbage, I imagine.
>>>>>> Like in [2] directly above. Or with [83] and [84] at the top of this
>>>>>> discussion post/email. Embarrassingly, I've been trying to convert
>>>>>> this into a data.frame or csv but I can't manage to. I've been using
>>>>>> the write.csv function, but I don't think I've been getting the
>>>>>> arguments correct.
>>>>>> 
>>>>>> At the end of the day, I would like a data.frame and/or csv with the
>>>>>> following four columns: date, time, person, comment.
>>>>>> 
>>>>>> I tried this, too:
>>>>>> 
>>>>>>> c <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}
>>>>>> + [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>) *(.*$)",
>>>>>> +                 a, proto=data.frame(stringsAsFactors=FALSE,
>>>>> When="",
>>>>>> Who="",
>>>>>> +                                     What=""))
>>>>>> 
>>>>>> But all I got was this:
>>>>>> 
>>>>>>> c [1:100, ]
>>>>>>   When  Who What
>>>>>> 1   <NA> <NA> <NA>
>>>>>> 2   <NA> <NA> <NA>
>>>>>> 3   <NA> <NA> <NA>
>>>>>> 4   <NA> <NA> <NA>
>>>>>> 5   <NA> <NA> <NA>
>>>>>> 6   <NA> <NA> <NA>
>>>>>> 
>>>>>> It seems to have caught nothing.
>>>>>> 
>>>>>>> unique (c)
>>>>>> When  Who What
>>>>>> 1 <NA> <NA> <NA>
>>>>>> 
>>>>>> But I like that it converted into columns. That's a really great
>>>>>> format. With a little tweaking, it'd be a great code for this data
>>>>>> set.
>>>>>> 
>>>>>> Michael
>>>>>> 
>>>>>> On Fri, May 17, 2019 at 8:20 AM William Dunlap via R-help
>>>>>> <r-help@r-project.org> wrote:
>>>>>>> 
>>>>>>> Consider using readLines() and strcapture() for reading such a
>>>>> file.
>>>>>> E.g.,
>>>>>>> suppose readLines(files) produced a character vector like
>>>>>>> 
>>>>>>> x <- c("2016-10-21 10:35:36 <Jane Doe> What's your login",
>>>>>>>         "2016-10-21 10:56:29 <John Doe> John_Doe",
>>>>>>>         "2016-10-21 10:56:37 <John Doe> Admit#8242",
>>>>>>>         "October 23, 1819 12:34 <Jane Eyre> I am not an angel")
>>>>>>> 
>>>>>>> Then you can make a data.frame with columns When, Who, and What by
>>>>>>> supplying a pattern containing three parenthesized capture
>>>>> expressions:
>>>>>>>> z <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}
>>>>>>> [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>) *(.*$)",
>>>>>>>            x, proto=data.frame(stringsAsFactors=FALSE, When="",
>>>>> Who="",
>>>>>>> What=""))
>>>>>>>> str(z)
>>>>>>> 'data.frame':   4 obs. of  3 variables:
>>>>>>> $ When: chr  "2016-10-21 10:35:36" "2016-10-21 10:56:29"
>>>>> "2016-10-21
>>>>>>> 10:56:37" NA
>>>>>>> $ Who : chr  "<Jane Doe>" "<John Doe>" "<John Doe>" NA
>>>>>>> $ What: chr  "What's your login" "John_Doe" "Admit#8242" NA
>>>>>>> 
>>>>>>> Lines that don't match the pattern result in NA's - you might make
>>>>> a
>>>>>> second
>>>>>>> pass over the corresponding elements of x with a new pattern.
>>>>>>> 
>>>>>>> You can convert the When column from character to time with
>>>>> as.POSIXct().
>>>>>>> 
>>>>>>> Bill Dunlap
>>>>>>> TIBCO Software
>>>>>>> wdunlap tibco.com
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, May 16, 2019 at 8:30 PM David Winsemius
>>>>> <dwinsem...@comcast.net>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> 
>>>>>>>> On 5/16/19 3:53 PM, Michael Boulineau wrote:
>>>>>>>>> OK. So, I named the object test and then checked the 6347th
>>>>> item
>>>>>>>>> 
>>>>>>>>>> test <- readLines ("hangouts-conversation.txt)
>>>>>>>>>> test [6347]
>>>>>>>>> [1] "2016-10-21 10:56:37 <John Doe> Admit#8242"
>>>>>>>>> 
>>>>>>>>> Perhaps where it was getting screwed up is, since the end of
>>>>> this is
>>>>>> a
>>>>>>>>> number (8242), then, given that there's no space between the
>>>>> number
>>>>>>>>> and what ought to be the next row, R didn't know where to draw
>>>>> the
>>>>>>>>> line. Sure enough, it looks like this when I go to the original
>>>>> file
>>>>>>>>> and control f "#8242"
>>>>>>>>> 
>>>>>>>>> 2016-10-21 10:35:36 <Jane Doe> What's your login
>>>>>>>>> 2016-10-21 10:56:29 <John Doe> John_Doe
>>>>>>>>> 2016-10-21 10:56:37 <John Doe> Admit#8242
>>>>>>>> 
>>>>>>>> 
>>>>>>>> An octothorpe is an end of line signifier and is interpreted as
>>>>>> allowing
>>>>>>>> comments. You can prevent that interpretation with suitable
>>>>> choice of
>>>>>>>> parameters to `read.table` or `read.csv`. I don't understand why
>>>>> that
>>>>>>>> should cause anu error or a failure to match that pattern.
>>>>>>>> 
>>>>>>>>> 2016-10-21 11:00:13 <Jane Doe> Okay so you have a discussion
>>>>>>>>> 
>>>>>>>>> Again, it doesn't look like that in the file. Gmail
>>>>> automatically
>>>>>>>>> formats it like that when I paste it in. More to the point, it
>>>>> looks
>>>>>>>>> like
>>>>>>>>> 
>>>>>>>>> 2016-10-21 10:35:36 <Jane Doe> What's your login2016-10-21
>>>>> 10:56:29
>>>>>>>>> <John Doe> John_Doe2016-10-21 10:56:37 <John Doe>
>>>>>> Admit#82422016-10-21
>>>>>>>>> 11:00:13 <Jane Doe> Okay so you have a discussion
>>>>>>>>> 
>>>>>>>>> Notice Admit#82422016. So there's that.
>>>>>>>>> 
>>>>>>>>> Then I built object test2.
>>>>>>>>> 
>>>>>>>>> test2 <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4",
>>>>> test)
>>>>>>>>> 
>>>>>>>>> This worked for 84 lines, then this happened.
>>>>>>>> 
>>>>>>>> It may have done something but as you later discovered my first
>>>>> code
>>>>>> for
>>>>>>>> the pattern was incorrect. I had tested it (and pasted in the
>>>>> results
>>>>>> of
>>>>>>>> the test) . The way to refer to a capture class is with
>>>>> back-slashes
>>>>>>>> before the numbers, not forward-slashes. Try this:
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)",
>>>>> "\\1,\\2,\\3,\\4",
>>>>>> chrvec)
>>>>>>>>> newvec
>>>>>>>> [1] "2016-07-01,02:50:35,<john>,hey"
>>>>>>>> [2] "2016-07-01,02:51:26,<jane>,waiting for plane to Edinburgh"
>>>>>>>> [3] "2016-07-01,02:51:45,<john>,thinking about my boo"
>>>>>>>> [4] "2016-07-01,02:52:07,<jane>,nothing crappy has happened,
>>>>> not
>>>>>> really"
>>>>>>>> [5] "2016-07-01,02:52:20,<john>,plane went by pretty fast,
>>>>> didn't
>>>>>> sleep"
>>>>>>>> [6] "2016-07-01,02:54:08,<jane>,no idea what time it is or
>>>>> where I am
>>>>>>>> really"
>>>>>>>> [7] "2016-07-01,02:54:17,<john>,just know it's london"
>>>>>>>> [8] "2016-07-01,02:56:44,<jane>,you are probably asleep"
>>>>>>>> [9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a good
>>>>> eay"
>>>>>>>> [10] "2016-07-01 02:58:56 <jone>"
>>>>>>>> [11] "2016-07-01 02:59:34 <jane>"
>>>>>>>> [12] "2016-07-01,03:02:48,<john>,British security is a little
>>>>> more
>>>>>>>> rigorous..."
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I made note of the fact that the 10th and 11th lines had no
>>>>> commas.
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> test2 [84]
>>>>>>>>> [1] "2016-06-28 21:12:43 *** John Doe ended a video chat"
>>>>>>>> 
>>>>>>>> That line didn't have any "<" so wasn't matched.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> You could remove all none matching lines for pattern of
>>>>>>>> 
>>>>>>>> dates<space>times<space>"<"<name>">"<space><anything>
>>>>>>>> 
>>>>>>>> 
>>>>>>>> with:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> chrvec <- chrvec[ grepl("^.{10} .{8} <.+> .+$)", chrvec)]
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Do read:
>>>>>>>> 
>>>>>>>> ?read.csv
>>>>>>>> 
>>>>>>>> ?regex
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> 
>>>>>>>> David
>>>>>>>> 
>>>>>>>> 
>>>>>>>>>> test2 [85]
>>>>>>>>> [1] "//1,//2,//3,//4"
>>>>>>>>>> test [85]
>>>>>>>>> [1] "2016-07-01 02:50:35 <John Doe> hey"
>>>>>>>>> 
>>>>>>>>> Notice how I toggled back and forth between test and test2
>>>>> there. So,
>>>>>>>>> whatever happened with the regex, it happened in the switch
>>>>> from 84
>>>>>> to
>>>>>>>>> 85, I guess. It went on like
>>>>>>>>> 
>>>>>>>>> [990] "//1,//2,//3,//4"
>>>>>>>>> [991] "//1,//2,//3,//4"
>>>>>>>>> [992] "//1,//2,//3,//4"
>>>>>>>>> [993] "//1,//2,//3,//4"
>>>>>>>>> [994] "//1,//2,//3,//4"
>>>>>>>>> [995] "//1,//2,//3,//4"
>>>>>>>>> [996] "//1,//2,//3,//4"
>>>>>>>>> [997] "//1,//2,//3,//4"
>>>>>>>>> [998] "//1,//2,//3,//4"
>>>>>>>>> [999] "//1,//2,//3,//4"
>>>>>>>>> [1000] "//1,//2,//3,//4"
>>>>>>>>> 
>>>>>>>>> up until line 1000, then I reached max.print.
>>>>>>>> 
>>>>>>>>> Michael
>>>>>>>>> 
>>>>>>>>> On Thu, May 16, 2019 at 1:05 PM David Winsemius <
>>>>>> dwinsem...@comcast.net>
>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> On 5/16/19 12:30 PM, Michael Boulineau wrote:
>>>>>>>>>>> Thanks for this tip on etiquette, David. I will be sure and
>>>>> not do
>>>>>>>> that again.
>>>>>>>>>>> 
>>>>>>>>>>> I tried the read.fwf from the foreign package, with a code
>>>>> like
>>>>>> this:
>>>>>>>>>>> 
>>>>>>>>>>>  d <- read.fwf("hangouts-conversation.txt",
>>>>>>>>>>>                 widths= c(10,10,20,40),
>>>>>>>>>>> 
>>>>> col.names=c("date","time","person","comment"),
>>>>>>>>>>>                 strip.white=TRUE)
>>>>>>>>>>> 
>>>>>>>>>>> But it threw this error:
>>>>>>>>>>> 
>>>>>>>>>>> Error in scan(file = file, what = what, sep = sep, quote =
>>>>> quote,
>>>>>> dec
>>>>>>>> = dec,  :
>>>>>>>>>>>   line 6347 did not have 4 elements
>>>>>>>>>> 
>>>>>>>>>> So what does line 6347 look like? (Use `readLines` and print
>>>>> it
>>>>>> out.)
>>>>>>>>>> 
>>>>>>>>>>> Interestingly, though, the error only happened when I
>>>>> increased the
>>>>>>>>>>> width size. But I had to increase the size, or else I
>>>>> couldn't
>>>>>> "see"
>>>>>>>>>>> anything.  The comment was so small that nothing was being
>>>>>> captured by
>>>>>>>>>>> the size of the column. so to speak.
>>>>>>>>>>> 
>>>>>>>>>>> It seems like what's throwing me is that there's no comma
>>>>> that
>>>>>>>>>>> demarcates the end of the text proper. For example:
>>>>>>>>>> Not sure why you thought there should be a comma. Lines
>>>>> usually end
>>>>>>>>>> with  <cr> and or a <lf>.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Once you have the raw text in a character vector from
>>>>> `readLines`
>>>>>> named,
>>>>>>>>>> say, 'chrvec', then you could selectively substitute commas
>>>>> for
>>>>>> spaces
>>>>>>>>>> with regex. (Now that you no longer desire to remove the dates
>>>>> and
>>>>>>>> times.)
>>>>>>>>>> 
>>>>>>>>>> sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4", chrvec)
>>>>>>>>>> 
>>>>>>>>>> This will not do any replacements when the pattern is not
>>>>> matched.
>>>>>> See
>>>>>>>>>> this test:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)",
>>>>> "\\1,\\2,\\3,\\4",
>>>>>>>> chrvec)
>>>>>>>>>>> newvec
>>>>>>>>>>  [1] "2016-07-01,02:50:35,<john>,hey"
>>>>>>>>>>  [2] "2016-07-01,02:51:26,<jane>,waiting for plane to
>>>>> Edinburgh"
>>>>>>>>>>  [3] "2016-07-01,02:51:45,<john>,thinking about my boo"
>>>>>>>>>>  [4] "2016-07-01,02:52:07,<jane>,nothing crappy has
>>>>> happened, not
>>>>>>>> really"
>>>>>>>>>>  [5] "2016-07-01,02:52:20,<john>,plane went by pretty fast,
>>>>> didn't
>>>>>>>> sleep"
>>>>>>>>>>  [6] "2016-07-01,02:54:08,<jane>,no idea what time it is or
>>>>> where
>>>>>> I am
>>>>>>>>>> really"
>>>>>>>>>>  [7] "2016-07-01,02:54:17,<john>,just know it's london"
>>>>>>>>>>  [8] "2016-07-01,02:56:44,<jane>,you are probably asleep"
>>>>>>>>>>  [9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a
>>>>> good
>>>>>> eay"
>>>>>>>>>> [10] "2016-07-01 02:58:56 <jone>"
>>>>>>>>>> [11] "2016-07-01 02:59:34 <jane>"
>>>>>>>>>> [12] "2016-07-01,03:02:48,<john>,British security is a little
>>>>> more
>>>>>>>>>> rigorous..."
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> You should probably remove the "empty comment" lines.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> 
>>>>>>>>>> David.
>>>>>>>>>> 
>>>>>>>>>>> 2016-07-01 15:34:30 <John Doe> Lame. We were in a
>>>>>> starbucks2016-07-01
>>>>>>>>>>> 15:35:02 <Jane Doe> Hmm that's interesting2016-07-01 15:35:09
>>>>> <Jane
>>>>>>>>>>> Doe> You must want coffees2016-07-01 15:35:25 <John Doe>
>>>>> There was
>>>>>>>>>>> lots of Starbucks in my day2016-07-01 15:35:47
>>>>>>>>>>> 
>>>>>>>>>>> It was interesting, too, when I pasted the text into the
>>>>> email, it
>>>>>>>>>>> self-formatted into the way I wanted it to look. I had to
>>>>> manually
>>>>>>>>>>> make it look like it does above, since that's the way that it
>>>>>> looks in
>>>>>>>>>>> the txt file. I wonder if it's being organized by XML or
>>>>> something.
>>>>>>>>>>> 
>>>>>>>>>>> Anyways, There's always a space between the two sideways
>>>>> carrots,
>>>>>> just
>>>>>>>>>>> like there is right now: <John Doe> See. Space. And there's
>>>>> always
>>>>>> a
>>>>>>>>>>> space between the data and time. Like this. 2016-07-01
>>>>> 15:34:30
>>>>>> See.
>>>>>>>>>>> Space. But there's never a space between the end of the
>>>>> comment and
>>>>>>>>>>> the next date. Like this: We were in a starbucks2016-07-01
>>>>> 15:35:02
>>>>>>>>>>> See. starbucks and 2016 are smooshed together.
>>>>>>>>>>> 
>>>>>>>>>>> This code is also on the table right now too.
>>>>>>>>>>> 
>>>>>>>>>>> a <- read.table("E:/working
>>>>>>>>>>> directory/-189/hangouts-conversation2.txt", quote="\"",
>>>>>>>>>>> comment.char="", fill=TRUE)
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> h<-cbind(hangouts.conversation2[,1:2],hangouts.conversation2[,3:5],hangouts.conversation2[,6:9])
>>>>>>>>>>> 
>>>>>>>>>>> aa<-gsub("[^[:digit:]]","",h)
>>>>>>>>>>> my.data.num <- as.numeric(str_extract(h, "[0-9]+"))
>>>>>>>>>>> 
>>>>>>>>>>> Those last lines are a work in progress. I wish I could
>>>>> import a
>>>>>>>>>>> picture of what it looks like when it's translated into a
>>>>> data
>>>>>> frame.
>>>>>>>>>>> The fill=TRUE helped to get the data in table that kind of
>>>>> sort of
>>>>>>>>>>> works, but the comments keep bleeding into the data and time
>>>>>> column.
>>>>>>>>>>> It's like
>>>>>>>>>>> 
>>>>>>>>>>> 2016-07-01 15:59:17 <Jane Doe> Seriously I've never been
>>>>>>>>>>> over               there
>>>>>>>>>>> 2016-07-01 15:59:27 <Jane Doe> It confuses me :(
>>>>>>>>>>> 
>>>>>>>>>>> And then, maybe, the "seriously" will be in a column all to
>>>>>> itself, as
>>>>>>>>>>> will be the "I've'"and the "never" etc.
>>>>>>>>>>> 
>>>>>>>>>>> I will use a regular expression if I have to, but it would be
>>>>> nice
>>>>>> to
>>>>>>>>>>> keep the dates and times on there. Originally, I thought they
>>>>> were
>>>>>>>>>>> meaningless, but I've since changed my mind on that count.
>>>>> The
>>>>>> time of
>>>>>>>>>>> day isn't so important. But, especially since, say, Gmail
>>>>> itself
>>>>>> knows
>>>>>>>>>>> how to quickly recognize what it is, I know it can be done. I
>>>>> know
>>>>>>>>>>> this data has structure to it.
>>>>>>>>>>> 
>>>>>>>>>>> Michael
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, May 15, 2019 at 8:47 PM David Winsemius <
>>>>>>>> dwinsem...@comcast.net> wrote:
>>>>>>>>>>>> On 5/15/19 4:07 PM, Michael Boulineau wrote:
>>>>>>>>>>>>> I have a wild and crazy text file, the head of which looks
>>>>> like
>>>>>> this:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 2016-07-01 02:50:35 <john> hey
>>>>>>>>>>>>> 2016-07-01 02:51:26 <jane> waiting for plane to Edinburgh
>>>>>>>>>>>>> 2016-07-01 02:51:45 <john> thinking about my boo
>>>>>>>>>>>>> 2016-07-01 02:52:07 <jane> nothing crappy has happened, not
>>>>>> really
>>>>>>>>>>>>> 2016-07-01 02:52:20 <john> plane went by pretty fast,
>>>>> didn't
>>>>>> sleep
>>>>>>>>>>>>> 2016-07-01 02:54:08 <jane> no idea what time it is or where
>>>>> I am
>>>>>>>> really
>>>>>>>>>>>>> 2016-07-01 02:54:17 <john> just know it's london
>>>>>>>>>>>>> 2016-07-01 02:56:44 <jane> you are probably asleep
>>>>>>>>>>>>> 2016-07-01 02:58:45 <jane> I hope fish was fishy in a good
>>>>> eay
>>>>>>>>>>>>> 2016-07-01 02:58:56 <jone>
>>>>>>>>>>>>> 2016-07-01 02:59:34 <jane>
>>>>>>>>>>>>> 2016-07-01 03:02:48 <john> British security is a little
>>>>> more
>>>>>>>> rigorous...
>>>>>>>>>>>> Looks entirely not-"crazy". Typical log file format.
>>>>>>>>>>>> 
>>>>>>>>>>>> Two possibilities: 1) Use `read.fwf` from pkg foreign; 2)
>>>>> Use
>>>>>> regex
>>>>>>>>>>>> (i.e. the sub-function) to strip everything up to the "<".
>>>>> Read
>>>>>>>>>>>> `?regex`. Since that's not a metacharacters you could use a
>>>>>> pattern
>>>>>>>>>>>> ".+<" and replace with "".
>>>>>>>>>>>> 
>>>>>>>>>>>> And do read the Posting Guide. Cross-posting to
>>>>> StackOverflow and
>>>>>>>> Rhelp,
>>>>>>>>>>>> at least within hours of each, is considered poor manners.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> 
>>>>>>>>>>>> David.
>>>>>>>>>>>> 
>>>>>>>>>>>>> It goes on for a while. It's a big file. But I feel like
>>>>> it's
>>>>>> going
>>>>>>>> to
>>>>>>>>>>>>> be difficult to annotate with the coreNLP library or
>>>>> package. I'm
>>>>>>>>>>>>> doing natural language processing. In other words, I'm
>>>>> curious
>>>>>> as to
>>>>>>>>>>>>> how I would shave off the dates, that is, to make it look
>>>>> like:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> <john> hey
>>>>>>>>>>>>> <jane> waiting for plane to Edinburgh
>>>>>>>>>>>>>   <john> thinking about my boo
>>>>>>>>>>>>> <jane> nothing crappy has happened, not really
>>>>>>>>>>>>> <john> plane went by pretty fast, didn't sleep
>>>>>>>>>>>>> <jane> no idea what time it is or where I am really
>>>>>>>>>>>>> <john> just know it's london
>>>>>>>>>>>>> <jane> you are probably asleep
>>>>>>>>>>>>> <jane> I hope fish was fishy in a good eay
>>>>>>>>>>>>>   <jone>
>>>>>>>>>>>>> <jane>
>>>>>>>>>>>>> <john> British security is a little more rigorous...
>>>>>>>>>>>>> 
>>>>>>>>>>>>> To be clear, then, I'm trying to clean a large text file by
>>>>>> writing a
>>>>>>>>>>>>> regular expression? such that I create a new object with no
>>>>>> numbers
>>>>>>>> or
>>>>>>>>>>>>> dates.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Michael
>>>>>>>>>>>>> 
>>>>>>>>>>>>> ______________________________________________
>>>>>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and
>>>>> more,
>>>>>> see
>>>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>>>>>>>> PLEASE do read the posting guide
>>>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>>>>>>>> and provide commented, minimal, self-contained,
>>>>> reproducible
>>>>>> code.
>>>>>>>>>>> ______________________________________________
>>>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more,
>>>>> see
>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>>>>>> PLEASE do read the posting guide
>>>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>>>>>> and provide commented, minimal, self-contained, reproducible
>>>>> code.
>>>>>>>>>> ______________________________________________
>>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more,
>>>>> see
>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>>>>> PLEASE do read the posting guide
>>>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>>>>> and provide commented, minimal, self-contained, reproducible
>>>>> code.
>>>>>>>>> ______________________________________________
>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more,
>>>>> see
>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>>>> PLEASE do read the posting guide
>>>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>>>> and provide commented, minimal, self-contained, reproducible
>>>>> code.
>>>>>>>> 
>>>>>>>> ______________________________________________
>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>>> PLEASE do read the posting guide
>>>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>>> and provide commented, minimal, self-contained, reproducible
>>>>> code.
>>>>>>>> 
>>>>>>> 
>>>>>>>       [[alternative HTML version deleted]]
>>>>>>> 
>>>>>>> ______________________________________________
>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>> PLEASE do read the posting guide
>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>> 
>>>>>> ______________________________________________
>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide
>>>>>> http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>> 
>>>>> 
>>>>>     [[alternative HTML version deleted]]
>>>>> 
>>>>> ______________________________________________
>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>> 
>>>> --
>>>> Sent from my phone. Please excuse my brevity.
>>> 
>>> ______________________________________________
>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>> 
> 
> ______________________________________________
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] how to separate string from numbers in a large txt file

Reply via email to