[R] Text Pattern Recognition - Model
Hi, I have a training dataset which has two columns which has around 70 values. 1. PNRNo whose values like UT768G, CXKA, 4IOI59, 4BV7TW...(typical PNR number patterns) 2. I have created one more factor variable mentioning (IsPNR) - so all the values are 1 (true) My first objective is to create a model on this training set which would recognize the text pattern. Second objective: The model would then be used to predict IsPNR with new set of test values like Anshuk, 4EL58S...as 0 and 1... Which model would be best for recognizing such kind of pattern and having decent accuracy? I tried naiveBayes, but I don't think it is all doing a good job. Its predicting all the test values as true. I suppose bayes is not meant for this. Regards, Anshuk Pal Chaudhuri [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Doing PDF OCR with R
Hi All, I have been trying to do OCR within R (reading PDF data which data as scanned image). Have been reading about this @ http://electricarchaeology.ca/2014/07/15/doing-ocr-within-r/ This a very good post. Effectively 3 steps: convert pdf to ppm (an image format) convert ppm to tif ready for tesseract (using ImageMagick for convert) convert tif to text file The effective code for the above 3 steps as per the link post: lapply(myfiles, function(i){ # convert pdf to ppm (an image format), just pages 1-10 of the PDF # but you can change that easily, just remove or edit the # -f 1 -l 10 bit in the line below shell(shQuote(paste0(F:/xpdf/bin64/pdftoppm.exe , i, -f 1 -l 10 -r 600 ocrbook))) # convert ppm to tif ready for tesseract shell(shQuote(paste0(F:/ImageMagick-6.9.1-Q16/convert.exe *.ppm , i, .tif))) # convert tif to text file shell(shQuote(paste0(F:/Tesseract-OCR/tesseract.exe , i, .tif , i, -l eng))) # delete tif file file.remove(paste0(i, .tif )) }) The first two steps are happening fine. (although taking good amount of time, for 4 pages of a pdf, but will look into the scalability part later, first trying if this works or not) While running this, the first two steps work fine. While runinng the 3rd step, i.e **shell(shQuote(paste0(F:/Tesseract-OCR/tesseract.exe , i, .tif , i, -l eng)))** I having this error: Error: evaluation nested too deeply: infinite recursion / options(expressions=)? Or Tesseract is crashing. Any workaround or root cause analysis would be appreciated. Regards, Anshuk Pal Chaudhuri [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Parsing all rows columns of a Dataframe into one column
Thanks. Another way of handling is which I understood from other forum, thought to share here as well. data.frame(Value=dat[!is.na(dat)]) Regards, Anshuk Pal Chaudhuri -Original Message- From: PIKAL Petr [mailto:petr.pi...@precheza.cz] Sent: 10 August 2015 12:21 To: Anshuk Pal Chaudhuri anshu...@motivitylabs.com; r-help@r-project.org Subject: RE: Parsing all rows columns of a Dataframe into one column Hi Your HTML posting scrammbled your question a bit. If I understand correctly and as you want values of various type in one column, probably the easiest way would be. mat-as.matrix(yourdata) mat-na.omit(mat) dim(mat) - NULL newdata - data.frame(mat, stringsAsFactors=FALSE) Cheers Petr -Original Message- From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Anshuk Pal Chaudhuri Sent: Monday, August 10, 2015 7:24 AM To: r-help@r-project.org Subject: [R] Parsing all rows columns of a Dataframe into one column Hi All, I am using R for reading certain values in a dataset. I have values in a data frame all scattered in different columns rows, some values might be NA as well. e.g. below three columns V1, V2,V3, and their respective values. V1 V2 V2 NA NA 90 abc 89.09 $50 76799 NA NA 02:15 def 1 What I would like to do is parse this data frame, create a new data frame, omit all NA values. The new data frame would have one column, lets say Value column. (order of the samples coming is not an issue) New Data Frame (Output Required): Value abc 76799 02:15 89.09 def 90 $50 1 Any help would be appreciated. Regards, Anshuk Pal Chaudhuri [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code. Tento e-mail a jakékoliv k němu připojené dokumenty jsou důvěrné a jsou určeny pouze jeho adresátům. Jestliže jste obdržel(a) tento e-mail omylem, informujte laskavě neprodleně jeho odesílatele. Obsah tohoto emailu i s přílohami a jeho kopie vymažte ze svého systému. Nejste-li zamýšleným adresátem tohoto emailu, nejste oprávněni tento email jakkoliv užívat, rozšiřovat, kopírovat či zveřejňovat. Odesílatel e-mailu neodpovídá za eventuální škodu způsobenou modifikacemi či zpožděním přenosu e-mailu. V případě, že je tento e-mail součástí obchodního jednání: - vyhrazuje si odesílatel právo ukončit kdykoliv jednání o uzavření smlouvy, a to z jakéhokoliv důvodu i bez uvedení důvodu. - a obsahuje-li nabídku, je adresát oprávněn nabídku bezodkladně přijmout; Odesílatel tohoto e-mailu (nabídky) vylučuje přijetí nabídky ze strany příjemce s dodatkem či odchylkou. - trvá odesílatel na tom, že příslušná smlouva je uzavřena teprve výslovným dosažením shody na všech jejích náležitostech. - odesílatel tohoto emailu informuje, že není oprávněn uzavírat za společnost žádné smlouvy s výjimkou případů, kdy k tomu byl písemně zmocněn nebo písemně pověřen a takové pověření nebo plná moc byly adresátovi tohoto emailu případně osobě, kterou adresát zastupuje, předloženy nebo jejich existence je adresátovi či osobě jím zastoupené známá. This e-mail and any documents attached to it may be confidential and are intended only for its intended recipients. If you received this e-mail by mistake, please immediately inform its sender. Delete the contents of this e-mail with all attachments and its copies from your system. If you are not the intended recipient of this e-mail, you are not authorized to use, disseminate, copy or disclose this e-mail in any manner. The sender of this e-mail shall not be liable for any possible damage caused by modifications of the e-mail or by delay with transfer of the email. In case that this e-mail forms part of business dealings: - the sender reserves the right to end negotiations about entering into a contract in any time, for any reason, and without stating any reasoning. - if the e-mail contains an offer, the recipient is entitled to immediately accept such offer; The sender of this e-mail (offer) excludes any acceptance of the offer on the part of the recipient containing any amendment or variation. - the sender insists on that the respective contract is concluded only upon an express mutual agreement on all its aspects. - the sender of this e-mail informs that he/she is not authorized to enter into any contracts on behalf of the company except for cases in which he/she is expressly authorized to do so in writing, and such authorization or power of attorney is submitted to the recipient or the person represented by the recipient, or the existence of such authorization is known to the recipient of the person represented
[R] Parsing all rows columns of a Dataframe into one column
Hi All, I am using R for reading certain values in a dataset. I have values in a data frame all scattered in different columns rows, some values might be NA as well. e.g. below three columns V1, V2,V3, and their respective values. V1 V2 V2 NA NA 90 abc 89.09 $50 76799 NA NA 02:15 def 1 What I would like to do is parse this data frame, create a new data frame, omit all NA values. The new data frame would have one column, lets say Value column. (order of the samples coming is not an issue) New Data Frame (Output Required): Value abc 76799 02:15 89.09 def 90 $50 1 Any help would be appreciated. Regards, Anshuk Pal Chaudhuri [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Supervised Learning for Text Classification
Hi All, The current process which I am doing: 1. Reading Unstructured data, Understand what text means what. e.g. a phrase like JOHN SMITH this is a customer name. or a phrase like CLAIM DURATION is a reason type field. Currently, doing this manually (no issue over here). 2. Creating a data frame with fields - Reason, Name, Category, Sub-Reason, Description. There is no boolean or numeric field (Do I need to create any? based on #5 need). 3. Manually feeding the data frame with values as read from the unstructured data text. This would have approximate be having 50 rows. and this is my training set. (no issue over here) 4. Dataframe #3 would be nothing but a supervised train data set. Following #5, this is what I need to achieve (and need help here)- 5. Now when I receive another file (test data),lets say to start - just a phrase **, can I predict that text based on the trained model, it is a Reason related phrase (based on the probability, lets say more than 70%), create a new data frame (PredcitedDF) and add a column (e.g. Reason) add a row for this text under Reason field. Again receive on more text phrase, which seems like Name, add a column Name in the PredictedDF and add this value under Name column as second row...and hence further. I was reading about RTextTools (http://www.rtexttools.com/), well in that case it has be told that this value is for this text and hence further... Any help would be appreciated. Regards, Anshuk Pal Chaudhuri [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Itinerary Ticket Parser
Dear All, Was trying to study similar kind of this and found out a post on one of stack overflow site: http://stackoverflow.com/questions/8438903/open-source-projects-for-email-scrubbing-generating-structured-data-from-unstruc I guess this is not really answered yet. But wanted to check the all R-enthusiasts, if something has been done or not. Regards, Anshuk Pal Chaudhuri -Original Message- From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Anshuk Pal Chaudhuri Sent: 30 July 2015 21:16 To: r-help@r-project.org Subject: [R] Itinerary Ticket Parser Dear All, I have seeing a lot of apis', (like worldmate or new product called sift from easilydo) which is used for parsing email and different itinerary tickets. Is there any packages in R which does that? Regards, Anshuk Pal Chaudhuri [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Itinerary Ticket Parser
Dear All, I have seeing a lot of apis', (like worldmate or new product called sift from easilydo) which is used for parsing email and different itinerary tickets. Is there any packages in R which does that? Regards, Anshuk Pal Chaudhuri [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] R Parse HTML tabular data and apply NLP
Hi All, I have quite a few files which is having HTML tabular data. All the files have have different format, numerous nested tables and different information and the table structure is completely different. The only common thing in these files is that they are in tables. I was able to read the table using the readHTMLTable function. e.g one file has 23 tables, able to put all data one data frame. Obviously, the read function is not able to interpret the header obviously (which is also it not supposed to), hence creating creating variables like V1, V2.. Now when I have got all the text into a dataframe (the data is scattered in different columns), how do I interpret the text using machine learning to train that this text (sentence,word..)means this, or this text means this. Basically, automatic categorization of the all the text in the dataframe. I was reading about RTextTools (http://www.rtexttools.com/), well in that case it has be told that this value is for this text and hence further... Any help would be appreciated. Regards, Anshuk Pal Chaudhuri [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.