[R] Text Pattern Recognition - Model

2015-08-17 Thread Anshuk Pal Chaudhuri
Hi,

I have a training dataset which has two columns which has around 70 values.

1.   PNRNo whose values like UT768G, CXKA, 4IOI59, 4BV7TW...(typical PNR 
number patterns)

2.   I have created one more factor variable mentioning (IsPNR) - so all 
the values are 1 (true)

My first objective is to create a model on this training set which would 
recognize the text pattern.

Second objective: The  model would then be used to predict IsPNR with new set 
of test values like Anshuk, 4EL58S...as 0 and 1...

Which model would be best for recognizing such kind of pattern and having 
decent accuracy? I tried naiveBayes, but I don't think it is all doing a good 
job. Its predicting all the test values as true. I suppose bayes is not meant 
for this.


Regards,
Anshuk Pal Chaudhuri


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Doing PDF OCR with R

2015-08-12 Thread Anshuk Pal Chaudhuri
Hi All,

I have been trying to do OCR within R (reading PDF data which data as scanned 
image). Have been reading about this @ 
http://electricarchaeology.ca/2014/07/15/doing-ocr-within-r/

This a very good post.

Effectively 3 steps:

convert pdf to ppm (an image format)
convert ppm to tif ready for tesseract (using ImageMagick for convert)
convert tif to text file
The effective code for the above 3 steps as per the link post:

lapply(myfiles, function(i){
  # convert pdf to ppm (an image format), just pages 1-10 of the PDF
  # but you can change that easily, just remove or edit the
  # -f 1 -l 10 bit in the line below
  shell(shQuote(paste0(F:/xpdf/bin64/pdftoppm.exe , i,  -f 1 -l 10 -r 600 
ocrbook)))
  # convert ppm to tif ready for tesseract
  shell(shQuote(paste0(F:/ImageMagick-6.9.1-Q16/convert.exe *.ppm , i, 
.tif)))
  # convert tif to text file
  shell(shQuote(paste0(F:/Tesseract-OCR/tesseract.exe , i, .tif , i,  -l 
eng)))
  # delete tif file
  file.remove(paste0(i, .tif ))
  })
The first two steps are happening fine. (although taking good amount of time, 
for 4 pages of a pdf, but will look into the scalability part later, first 
trying if this works or not)

While running this, the first two steps work fine.

While runinng the 3rd step, i.e

**shell(shQuote(paste0(F:/Tesseract-OCR/tesseract.exe , i, .tif , i,  -l 
eng)))**
I having this error:

Error: evaluation nested too deeply: infinite recursion / options(expressions=)?

Or

Tesseract is crashing.

Any workaround or root cause analysis would be appreciated.

Regards,
Anshuk Pal Chaudhuri


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Parsing all rows columns of a Dataframe into one column

2015-08-12 Thread Anshuk Pal Chaudhuri
Thanks. Another way of handling is which I understood from other forum, thought 
to share here as well.

data.frame(Value=dat[!is.na(dat)])

Regards,
Anshuk Pal Chaudhuri

-Original Message-
From: PIKAL Petr [mailto:petr.pi...@precheza.cz] 
Sent: 10 August 2015 12:21
To: Anshuk Pal Chaudhuri anshu...@motivitylabs.com; r-help@r-project.org
Subject: RE: Parsing all rows  columns of a Dataframe into one column

Hi

Your HTML posting scrammbled your question a bit. If I understand correctly and 
as you want values of various type in one column, probably the easiest way 
would be.

mat-as.matrix(yourdata)
mat-na.omit(mat)
dim(mat) - NULL
newdata - data.frame(mat, stringsAsFactors=FALSE)

Cheers
Petr

 -Original Message-
 From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Anshuk 
 Pal Chaudhuri
 Sent: Monday, August 10, 2015 7:24 AM
 To: r-help@r-project.org
 Subject: [R] Parsing all rows  columns of a Dataframe into one column

 Hi All,


 I am using R for reading certain values in a dataset.

 I have values in a data frame all scattered in different columns  
 rows, some values might be NA as well.

 e.g. below three columns V1, V2,V3, and their respective values.
 V1

 V2

 V2

 NA

 NA

 90

 abc

 89.09

 $50

 76799

 NA

 NA

 02:15

 def

 1




 What I would like to do is parse this data frame, create a new data 
 frame, omit all NA values. The new data frame would have one column, 
 lets say Value column. (order of the samples coming is not an issue)

 New Data Frame (Output Required):


 Value

 abc

 76799

 02:15

 89.09

 def

 90

 $50

 1




 Any help would be appreciated.

 Regards,
 Anshuk Pal Chaudhuri


   [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting- 
 guide.html and provide commented, minimal, self-contained, 
 reproducible code.


Tento e-mail a jakékoliv k němu připojené dokumenty jsou důvěrné a jsou určeny 
pouze jeho adresátům.
Jestliže jste obdržel(a) tento e-mail omylem, informujte laskavě neprodleně 
jeho odesílatele. Obsah tohoto emailu i s přílohami a jeho kopie vymažte ze 
svého systému.
Nejste-li zamýšleným adresátem tohoto emailu, nejste oprávněni tento email 
jakkoliv užívat, rozšiřovat, kopírovat či zveřejňovat.
Odesílatel e-mailu neodpovídá za eventuální škodu způsobenou modifikacemi či 
zpožděním přenosu e-mailu.

V případě, že je tento e-mail součástí obchodního jednání:
- vyhrazuje si odesílatel právo ukončit kdykoliv jednání o uzavření smlouvy, a 
to z jakéhokoliv důvodu i bez uvedení důvodu.
- a obsahuje-li nabídku, je adresát oprávněn nabídku bezodkladně přijmout; 
Odesílatel tohoto e-mailu (nabídky) vylučuje přijetí nabídky ze strany příjemce 
s dodatkem či odchylkou.
- trvá odesílatel na tom, že příslušná smlouva je uzavřena teprve výslovným 
dosažením shody na všech jejích náležitostech.
- odesílatel tohoto emailu informuje, že není oprávněn uzavírat za společnost 
žádné smlouvy s výjimkou případů, kdy k tomu byl písemně zmocněn nebo písemně 
pověřen a takové pověření nebo plná moc byly adresátovi tohoto emailu případně 
osobě, kterou adresát zastupuje, předloženy nebo jejich existence je adresátovi 
či osobě jím zastoupené známá.

This e-mail and any documents attached to it may be confidential and are 
intended only for its intended recipients.
If you received this e-mail by mistake, please immediately inform its sender. 
Delete the contents of this e-mail with all attachments and its copies from 
your system.
If you are not the intended recipient of this e-mail, you are not authorized to 
use, disseminate, copy or disclose this e-mail in any manner.
The sender of this e-mail shall not be liable for any possible damage caused by 
modifications of the e-mail or by delay with transfer of the email.

In case that this e-mail forms part of business dealings:
- the sender reserves the right to end negotiations about entering into a 
contract in any time, for any reason, and without stating any reasoning.
- if the e-mail contains an offer, the recipient is entitled to immediately 
accept such offer; The sender of this e-mail (offer) excludes any acceptance of 
the offer on the part of the recipient containing any amendment or variation.
- the sender insists on that the respective contract is concluded only upon an 
express mutual agreement on all its aspects.
- the sender of this e-mail informs that he/she is not authorized to enter into 
any contracts on behalf of the company except for cases in which he/she is 
expressly authorized to do so in writing, and such authorization or power of 
attorney is submitted to the recipient or the person represented by the 
recipient, or the existence of such authorization is known to the recipient of 
the person represented

[R] Parsing all rows columns of a Dataframe into one column

2015-08-09 Thread Anshuk Pal Chaudhuri
Hi All,


I am using R for reading certain values in a dataset.

I have values in a data frame all scattered in different columns  rows, some 
values might be NA as well.

e.g. below three columns V1, V2,V3, and their respective values.
V1

V2

V2

NA

NA

90

abc

89.09

$50

76799

NA

NA

02:15

def

1




What I would like to do is parse this data frame, create a new data frame, omit 
all NA values. The new data frame would have one column, lets say Value column. 
(order of the samples coming is not an issue)

New Data Frame (Output Required):


Value

abc

76799

02:15

89.09

def

90

$50

1




Any help would be appreciated.

Regards,
Anshuk Pal Chaudhuri


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Supervised Learning for Text Classification

2015-08-06 Thread Anshuk Pal Chaudhuri
Hi All,

The current process which I am doing:


1.   Reading Unstructured data, Understand what text means what. e.g. a 
phrase like JOHN SMITH this is a customer name. or a phrase like CLAIM 
DURATION is a reason type field. Currently, doing this manually (no issue over 
here).

2.   Creating a data frame with fields - Reason, Name, Category, 
Sub-Reason, Description. There is no boolean or numeric field (Do I need to 
create any? based on #5 need).

3.   Manually feeding the data frame with values as read from the 
unstructured data text. This would have approximate be having 50 rows. and this 
is my training set. (no issue over here)

4.   Dataframe #3 would be nothing but a supervised train data set.



Following #5, this is what I need to achieve (and need help here)-


5.   Now when I receive another file (test data),lets say to start - just a 
phrase **, can I predict that text based on the trained 
model, it is a Reason related phrase (based on the probability, lets say more 
than 70%), create a new data frame (PredcitedDF) and add a column (e.g. Reason) 
add a row for this text under Reason field. Again receive on more text phrase, 
which seems like Name, add a column Name in the PredictedDF and add this 
value under Name column as second row...and hence further.

I was reading about RTextTools (http://www.rtexttools.com/), well in that case 
it has be told that this value is for this text and hence further...

Any help would be appreciated.

Regards,
Anshuk Pal Chaudhuri


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Itinerary Ticket Parser

2015-07-31 Thread Anshuk Pal Chaudhuri
Dear All, 

Was trying to study similar kind of this and found out a post on one of stack 
overflow site: 
http://stackoverflow.com/questions/8438903/open-source-projects-for-email-scrubbing-generating-structured-data-from-unstruc
 

I guess this is not really answered yet. But wanted to check the all 
R-enthusiasts, if something has been done or not. 

Regards,
Anshuk Pal Chaudhuri

-Original Message-
From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Anshuk Pal 
Chaudhuri
Sent: 30 July 2015 21:16
To: r-help@r-project.org
Subject: [R] Itinerary Ticket Parser

Dear All,

I have seeing a lot of apis', (like worldmate or new product called sift from 
easilydo) which is used for parsing email and different itinerary tickets. Is 
there any packages in R which does that?

Regards,
Anshuk Pal Chaudhuri


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Itinerary Ticket Parser

2015-07-30 Thread Anshuk Pal Chaudhuri
Dear All,

I have seeing a lot of apis', (like worldmate or new product called sift from 
easilydo) which is used for parsing email and different itinerary tickets. Is 
there any packages in R which does that?

Regards,
Anshuk Pal Chaudhuri


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] R Parse HTML tabular data and apply NLP

2015-07-30 Thread Anshuk Pal Chaudhuri
Hi All,

I have quite a few files which is having HTML tabular data. All the files have 
have different format, numerous nested tables and different information and the 
table structure is completely different. The only common thing in these files 
is that they are in tables.

I was able to read the table using the readHTMLTable function. e.g one file has 
23 tables, able to put all data one data frame. Obviously, the read function is 
not able to interpret the header obviously (which is also it not supposed to), 
hence creating creating variables like V1, V2..

Now when I have got all the text into a dataframe (the data is scattered in 
different columns), how do I interpret the text using machine learning to train 
that this text (sentence,word..)means this, or this text means this. Basically, 
automatic categorization of the all the text in the dataframe.

I was reading about RTextTools (http://www.rtexttools.com/), well in that case 
it has be told that this value is for this text and hence further...

Any help would be appreciated.

Regards,
Anshuk Pal Chaudhuri


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.