Re: [R] need help with excel data
The following is one way to parse your file using R (using R-3.1.2 on Windows in a US English locale). I downloaded it from Google Docs in tab-separated format. I could not get read.table() to do the job, but I don't completely understand the encoding/fileEncoding business there. file - exampX.xlsx - examp.tsv # the name Google Docs suggested lines - readLines(file, encoding=UTF-8) Warning message: In readLines(file, encoding = UTF-8) : incomplete final line found on 'exampX.xlsx - examp.tsv' fields - strsplit(lines, \t) txt - vapply(fields, function(x)x[2], ) # 2nd field of each line nmbrs - regmatches(txt, gregexpr([[:digit:]]+(\\*[[:digit:]]+)*, txt)) lines[16:20] [1] 1.97\tл.а. 11 35*46 27*46 1.61\tсамбо 9 31*36 29*45 [3] 1.17\tс.п. 4 37*29 39*30 1.54\tушу 9 31*39 30*38 [5] 1.73\tсамбо 6 32*39 29*39 nmbrs[16:20] [[1]] [1] 1135*46 27*46 [[2]] [1] 9 31*36 29*45 [[3]] [1] 4 37*29 39*30 [[4]] [1] 9 31*39 30*38 [[5]] [1] 6 32*39 29*39 If you want to split those x*y into x and y you can use the pattern [[:digit:]]+ instead of the one I used. Bill Dunlap TIBCO Software wdunlap tibco.com On Wed, Jan 21, 2015 at 12:31 PM, Dr Polanski n.polyans...@gmail.com wrote: Hi all! Sorry to bother you, I am trying to learn some R via coursera courses and other internet sources yet haven’t managed to go far And now I need to do some, I hope, not too difficult things, which I think R can do, yet have no idea how to make it do so I have a big set of data (empirical) which was obtained by my colleagues and store at not convenient way - all of the data in two cells of an excel table an example of the data is in the attached file (the link) https://drive.google.com/file/d/0B64YMbf_hh5BS2tzVE9WVmV3bFU/view?usp=sharing so the first column has a number and the second has a whole vector (I guess it is) which looks like «some words in Cyrillic(the length varies)» and then the set of numbers «12*23 34*45» (another problem that some times it is «12*23, 34*56» And the number of raws is about 3000 so it is impossible to do manually what I need to have at the end is to have it separately in different excel cells - what is written in words - | 12 | 23 | 34 | 45 | Do you think it is possible to do so using R (or something else?) Thank you very much in advance and sorry for asking for help and so stupid question, the problem is - I am trying and yet haven’t even managed to install openSUSE onto my laptop - only Ubuntu! :) Thank you very much! __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] need help with excel data
Hi all! Sorry to bother you, I am trying to learn some R via coursera courses and other internet sources yet haven’t managed to go far And now I need to do some, I hope, not too difficult things, which I think R can do, yet have no idea how to make it do so I have a big set of data (empirical) which was obtained by my colleagues and store at not convenient way - all of the data in two cells of an excel table an example of the data is in the attached file (the link) https://drive.google.com/file/d/0B64YMbf_hh5BS2tzVE9WVmV3bFU/view?usp=sharing so the first column has a number and the second has a whole vector (I guess it is) which looks like «some words in Cyrillic(the length varies)» and then the set of numbers «12*23 34*45» (another problem that some times it is «12*23, 34*56» And the number of raws is about 3000 so it is impossible to do manually what I need to have at the end is to have it separately in different excel cells - what is written in words - | 12 | 23 | 34 | 45 | Do you think it is possible to do so using R (or something else?) Thank you very much in advance and sorry for asking for help and so stupid question, the problem is - I am trying and yet haven’t even managed to install openSUSE onto my laptop - only Ubuntu! :) Thank you very much! __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] need help with excel data
Dr. Polanski, I would recommend something else. Given the messy nature of your data I would suggest using a language like Python or Perl to extract it to an appropriate format. Python has good regular expression support and unicode support. If you can save your data as a csv file or even text line by line then it would be possible to write some code to read the file, match the lines with a simple regular expression, and then spit them back out as a csv file which you could read into R. I realize that this means learning a new language or finding someone with the requisite skills by I would recommend that over attempting to use R's text processing. Collin. On Wed, Jan 21, 2015 at 3:31 PM, Dr Polanski n.polyans...@gmail.com wrote: Hi all! Sorry to bother you, I am trying to learn some R via coursera courses and other internet sources yet haven’t managed to go far And now I need to do some, I hope, not too difficult things, which I think R can do, yet have no idea how to make it do so I have a big set of data (empirical) which was obtained by my colleagues and store at not convenient way - all of the data in two cells of an excel table an example of the data is in the attached file (the link) https://drive.google.com/file/d/0B64YMbf_hh5BS2tzVE9WVmV3bFU/view?usp=sharing so the first column has a number and the second has a whole vector (I guess it is) which looks like «some words in Cyrillic(the length varies)» and then the set of numbers «12*23 34*45» (another problem that some times it is «12*23, 34*56» And the number of raws is about 3000 so it is impossible to do manually what I need to have at the end is to have it separately in different excel cells - what is written in words - | 12 | 23 | 34 | 45 | Do you think it is possible to do so using R (or something else?) Thank you very much in advance and sorry for asking for help and so stupid question, the problem is - I am trying and yet haven’t even managed to install openSUSE onto my laptop - only Ubuntu! :) Thank you very much! __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] need help with excel data
I think R is quite capable of doing this. You would have to learn a comparable number of fiddly bits to accomplish this in R, Python or Perl. That is not to say that learning Perl or Python is a bad idea... but in terms of shortest path I think they are of comparable complexity. All three languages support regular expressions, which would be the key bit of knowledge to acquire regardless of which tool you use. Other fiddly bits might involve handling the cyrillic strings as data, though you did not convey a desire to retain that information. One way (not extracting cyrillic text): library(XLConnect) DF - readWorksheetFromFile( exampX.xlsx, sheet=examp ) pattern - ^.*(\\d+) *\\* *(\\d+)[^\\d]*(\\d+) *\\* *(\\d+).*$ idx - grep( pattern, DF[[2]] ) dta - sub( pattern, \\1,\\2,\\3,\\4, DF[[2]][idx]) dtamatrix - apply( do.call( rbind , strsplit( dta, , ) ) , 2 , as.numeric ) extracted - data.frame( V1=DF[[1]][idx], dtamatrix ) On Wed, 21 Jan 2015, Collin Lynch wrote: Dr. Polanski, I would recommend something else. Given the messy nature of your data I would suggest using a language like Python or Perl to extract it to an appropriate format. Python has good regular expression support and unicode support. If you can save your data as a csv file or even text line by line then it would be possible to write some code to read the file, match the lines with a simple regular expression, and then spit them back out as a csv file which you could read into R. I realize that this means learning a new language or finding someone with the requisite skills by I would recommend that over attempting to use R's text processing. Collin. On Wed, Jan 21, 2015 at 3:31 PM, Dr Polanski n.polyans...@gmail.com wrote: Hi all! Sorry to bother you, I am trying to learn some R via coursera courses and other internet sources yet haven?t managed to go far And now I need to do some, I hope, not too difficult things, which I think R can do, yet have no idea how to make it do so I have a big set of data (empirical) which was obtained by my colleagues and store at not convenient way - all of the data in two cells of an excel table an example of the data is in the attached file (the link) https://drive.google.com/file/d/0B64YMbf_hh5BS2tzVE9WVmV3bFU/view?usp=sharing so the first column has a number and the second has a whole vector (I guess it is) which looks like ?some words in Cyrillic(the length varies)? and then the set of numbers ?12*23 34*45? (another problem that some times it is ?12*23, 34*56? And the number of raws is about 3000 so it is impossible to do manually what I need to have at the end is to have it separately in different excel cells - what is written in words - | 12 | 23 | 34 | 45 | Do you think it is possible to do so using R (or something else?) Thank you very much in advance and sorry for asking for help and so stupid question, the problem is - I am trying and yet haven?t even managed to install openSUSE onto my laptop - only Ubuntu! :) Thank you very much! __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. --- Jeff NewmillerThe . . Go Live... DCN:jdnew...@dcn.davis.ca.usBasics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/BatteriesO.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] need help with excel data
Try asap utilities (Home and Student edition), http://www.asap-utilities.com/index.php. When installed it will look like this in Excel, Select Columns Rows and then #18. If that is not helpful, then DigDB, http://www.digdb.com/, but this one requires a subscription. It will also split columns. You may have to do some 'cleaning' of individual cells, such as removing leading and/or trainling spaces. A lot of this can be one with the ASAP Utilities 'Text' pull down menu. Matthew On 1/21/2015 3:31 PM, Dr Polanski wrote: Hi all! Sorry to bother you, I am trying to learn some R via coursera courses and other internet sources yet haven’t managed to go far And now I need to do some, I hope, not too difficult things, which I think R can do, yet have no idea how to make it do so I have a big set of data (empirical) which was obtained by my colleagues and store at not convenient way - all of the data in two cells of an excel table an example of the data is in the attached file (the link) https://drive.google.com/file/d/0B64YMbf_hh5BS2tzVE9WVmV3bFU/view?usp=sharing so the first column has a number and the second has a whole vector (I guess it is) which looks like «some words in Cyrillic(the length varies)» and then the set of numbers «12*23 34*45» (another problem that some times it is «12*23, 34*56» And the number of raws is about 3000 so it is impossible to do manually what I need to have at the end is to have it separately in different excel cells - what is written in words - | 12 | 23 | 34 | 45 | Do you think it is possible to do so using R (or something else?) Thank you very much in advance and sorry for asking for help and so stupid question, the problem is - I am trying and yet haven’t even managed to install openSUSE onto my laptop - only Ubuntu! :) Thank you very much! __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] need help with excel data
Hi Dr Polanski, I would recommend you do this in excel seeing as you know how to work with excel. You can use excel to put different parts of a cell into another cell. For example if cell A1 is 12*23 34*45 And you want 12 in a separate cell (say cell A2) go to cell A2 and type: =LEFT(A1,2) This will extract the first 2 characters from the left. To extract 45 you would type: =Right(A1,2) To get 2 characters starting at position 4 you would type: =MID(A1, 4,2) Which will give you 23. Hope this helps. Regards, DR. SEAN PORTER Scientist South African Association for Marine Biological Research Direct Tel: +27 (31) 328 8169 Fax: +27 (31) 328 8188 E-mail: spor...@ori.org.za Web: www.saambr.org.za 1 King Shaka Avenue, Point, Durban 4001 KwaZulu-Natal South Africa PO Box 10712, Marine Parade 4056 KwaZulu-Natal South Africa -Original Message- From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Dr Polanski Sent: 21 January 2015 10:32 PM To: r-help@r-project.org Subject: [R] need help with excel data Hi all! Sorry to bother you, I am trying to learn some R via coursera courses and other internet sources yet haven’t managed to go far And now I need to do some, I hope, not too difficult things, which I think R can do, yet have no idea how to make it do so I have a big set of data (empirical) which was obtained by my colleagues and store at not convenient way - all of the data in two cells of an excel table an example of the data is in the attached file (the link) https://drive.google.com/file/d/0B64YMbf_hh5BS2tzVE9WVmV3bFU/view?usp=sharing so the first column has a number and the second has a whole vector (I guess it is) which looks like «some words in Cyrillic(the length varies)» and then the set of numbers «12*23 34*45» (another problem that some times it is «12*23, 34*56» And the number of raws is about 3000 so it is impossible to do manually what I need to have at the end is to have it separately in different excel cells - what is written in words - | 12 | 23 | 34 | 45 | Do you think it is possible to do so using R (or something else?) Thank you very much in advance and sorry for asking for help and so stupid question, the problem is - I am trying and yet haven’t even managed to install openSUSE onto my laptop - only Ubuntu! :) Thank you very much! __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] need help with excel data
It is good to know R is up to the task and I have to agree with Ista and Jeff that if you are more comfortable in R use it. By way of comparison the python code would look something like what is below. You would need to tweak the regular rexpression (re.match(...) to fit your needs but if you are just learning Python then sticking with R might be a better choice. Best, Collin. import csv, re In = open(Sheet.csv, r) Reader = csv.DictReader(In) Out = open(Out.csv) Writer = csv.DictWriter(Out, [Val, Text, Numbers]) Writer.writeheader() for D in Reader: NewDict = {} NewDict[Val] = D[Col1Name] Match = re.match((?PText\S+) (?PNumbers[0-9]+ [0-9]+\*[0-9]+,? [0-9]+*[0-9]+)$ D[Col2Name]) NewDict[Text] = Match.group(Text) NewDict[Numbers] Match.group(Numbers) Writer.writerow(NewDict) In.close() Out.close() On Wed, Jan 21, 2015 at 9:58 PM, Ista Zahn istaz...@gmail.com wrote: I agree, R will be fine for this. Not being as expert with regex as Jeff I would tend to do this in a few steps, something like library(XLConnect) DF - readWorksheetFromFile( exampX.xlsx, sheet=examp ) library(stringi) ## insert a marker between the text and the numbers txt - stri_replace_all_regex(DF[[2]], ([^\\d]{2,})(\\d+ ), $1|||$2) ## separate the text from the numbers stringNums - stri_split_fixed(txt, |||, 2, simplify = TRUE) ## split the numbers apart nums - stri_split_regex(stringNums[, 2], [^\\d]+, n = 5, simplify=TRUE) ## put it all back together extracted - data.frame(DF[, 1], stringNums[, 1], apply(nums, 2, as.numeric)) ## put the names back names(extracted) - c(names(DF)[1], paste(names(DF)[2], 1:6, sep = _)) Best, Ista On Wed, Jan 21, 2015 at 8:02 PM, Jeff Newmiller jdnew...@dcn.davis.ca.us wrote: I think R is quite capable of doing this. You would have to learn a comparable number of fiddly bits to accomplish this in R, Python or Perl. That is not to say that learning Perl or Python is a bad idea... but in terms of shortest path I think they are of comparable complexity. All three languages support regular expressions, which would be the key bit of knowledge to acquire regardless of which tool you use. Other fiddly bits might involve handling the cyrillic strings as data, though you did not convey a desire to retain that information. One way (not extracting cyrillic text): library(XLConnect) DF - readWorksheetFromFile( exampX.xlsx, sheet=examp ) pattern - ^.*(\\d+) *\\* *(\\d+)[^\\d]*(\\d+) *\\* *(\\d+).*$ idx - grep( pattern, DF[[2]] ) dta - sub( pattern, \\1,\\2,\\3,\\4, DF[[2]][idx]) dtamatrix - apply( do.call( rbind , strsplit( dta, , ) ) , 2 , as.numeric ) extracted - data.frame( V1=DF[[1]][idx], dtamatrix ) On Wed, 21 Jan 2015, Collin Lynch wrote: Dr. Polanski, I would recommend something else. Given the messy nature of your data I would suggest using a language like Python or Perl to extract it to an appropriate format. Python has good regular expression support and unicode support. If you can save your data as a csv file or even text line by line then it would be possible to write some code to read the file, match the lines with a simple regular expression, and then spit them back out as a csv file which you could read into R. I realize that this means learning a new language or finding someone with the requisite skills by I would recommend that over attempting to use R's text processing. Collin. On Wed, Jan 21, 2015 at 3:31 PM, Dr Polanski n.polyans...@gmail.com wrote: Hi all! Sorry to bother you, I am trying to learn some R via coursera courses and other internet sources yet haven?t managed to go far And now I need to do some, I hope, not too difficult things, which I think R can do, yet have no idea how to make it do so I have a big set of data (empirical) which was obtained by my colleagues and store at not convenient way - all of the data in two cells of an excel table an example of the data is in the attached file (the link) https://drive.google.com/file/d/0B64YMbf_hh5BS2tzVE9WVmV3bFU/view?usp=sharing so the first column has a number and the second has a whole vector (I guess it is) which looks like ?some words in Cyrillic(the length varies)? and then the set of numbers ?12*23 34*45? (another problem that some times it is ?12*23, 34*56? And the number of raws is about 3000 so it is impossible to do manually what I need to have at the end is to have it separately in different excel cells - what is written in words - | 12 | 23 | 34 | 45 | Do you think it is possible to do so using R (or something else?) Thank you very much in advance and sorry for asking for help and so stupid question, the problem is - I am trying and yet haven?t even managed to install openSUSE onto my laptop - only Ubuntu! :) Thank
Re: [R] need help with excel data
I agree, R will be fine for this. Not being as expert with regex as Jeff I would tend to do this in a few steps, something like library(XLConnect) DF - readWorksheetFromFile( exampX.xlsx, sheet=examp ) library(stringi) ## insert a marker between the text and the numbers txt - stri_replace_all_regex(DF[[2]], ([^\\d]{2,})(\\d+ ), $1|||$2) ## separate the text from the numbers stringNums - stri_split_fixed(txt, |||, 2, simplify = TRUE) ## split the numbers apart nums - stri_split_regex(stringNums[, 2], [^\\d]+, n = 5, simplify=TRUE) ## put it all back together extracted - data.frame(DF[, 1], stringNums[, 1], apply(nums, 2, as.numeric)) ## put the names back names(extracted) - c(names(DF)[1], paste(names(DF)[2], 1:6, sep = _)) Best, Ista On Wed, Jan 21, 2015 at 8:02 PM, Jeff Newmiller jdnew...@dcn.davis.ca.us wrote: I think R is quite capable of doing this. You would have to learn a comparable number of fiddly bits to accomplish this in R, Python or Perl. That is not to say that learning Perl or Python is a bad idea... but in terms of shortest path I think they are of comparable complexity. All three languages support regular expressions, which would be the key bit of knowledge to acquire regardless of which tool you use. Other fiddly bits might involve handling the cyrillic strings as data, though you did not convey a desire to retain that information. One way (not extracting cyrillic text): library(XLConnect) DF - readWorksheetFromFile( exampX.xlsx, sheet=examp ) pattern - ^.*(\\d+) *\\* *(\\d+)[^\\d]*(\\d+) *\\* *(\\d+).*$ idx - grep( pattern, DF[[2]] ) dta - sub( pattern, \\1,\\2,\\3,\\4, DF[[2]][idx]) dtamatrix - apply( do.call( rbind , strsplit( dta, , ) ) , 2 , as.numeric ) extracted - data.frame( V1=DF[[1]][idx], dtamatrix ) On Wed, 21 Jan 2015, Collin Lynch wrote: Dr. Polanski, I would recommend something else. Given the messy nature of your data I would suggest using a language like Python or Perl to extract it to an appropriate format. Python has good regular expression support and unicode support. If you can save your data as a csv file or even text line by line then it would be possible to write some code to read the file, match the lines with a simple regular expression, and then spit them back out as a csv file which you could read into R. I realize that this means learning a new language or finding someone with the requisite skills by I would recommend that over attempting to use R's text processing. Collin. On Wed, Jan 21, 2015 at 3:31 PM, Dr Polanski n.polyans...@gmail.com wrote: Hi all! Sorry to bother you, I am trying to learn some R via coursera courses and other internet sources yet haven?t managed to go far And now I need to do some, I hope, not too difficult things, which I think R can do, yet have no idea how to make it do so I have a big set of data (empirical) which was obtained by my colleagues and store at not convenient way - all of the data in two cells of an excel table an example of the data is in the attached file (the link) https://drive.google.com/file/d/0B64YMbf_hh5BS2tzVE9WVmV3bFU/view?usp=sharing so the first column has a number and the second has a whole vector (I guess it is) which looks like ?some words in Cyrillic(the length varies)? and then the set of numbers ?12*23 34*45? (another problem that some times it is ?12*23, 34*56? And the number of raws is about 3000 so it is impossible to do manually what I need to have at the end is to have it separately in different excel cells - what is written in words - | 12 | 23 | 34 | 45 | Do you think it is possible to do so using R (or something else?) Thank you very much in advance and sorry for asking for help and so stupid question, the problem is - I am trying and yet haven?t even managed to install openSUSE onto my laptop - only Ubuntu! :) Thank you very much! __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. --- Jeff NewmillerThe . . Go Live... DCN:jdnew...@dcn.davis.ca.usBasics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/Batteries