Re: [R] Different TFIDF settings in test set prevent testing model

2023-08-11 Thread James C Schopf
Thank you Bert and Ivan,

I was building the SVM model in hopes of applying it to future cases and hoped 
that the model would be able to deal with new words it hadn't encountered 
during training.   But I tried Bert's suggestion by converting all of the data 
to tokens, creating a DTM, transforming the whole thing with TFI DF, and then 
separating it 75%/25%.  But when I began to train the SVM on the training data, 
R said it needed 26GB for a vector and crashed. I tried again, it crashed 
again.I don't know why this would happen.  I'd just trained 4 SVM models 
using my previous method without any memory trouble on my 8GB CPU.I 
unsuccessfully tried to remove the new words from the new test data. Should I 
try that?  Is there a way to stop my system from crashing with the new method?

Thank you for any ideas.

Here is the code I used when I separated the data after converting to tokens 
and applying TFI DF:

url <- "D:/test/M2.csv"
data <- read_csv(url)
text_corpus <- Corpus(VectorSource(data$Text))
tokenize_document <- function(doc) {
doc_tokens <- unlist(tokenize_words(doc))
doc_bigrams <- unlist(tokenize_ngrams(doc, n = 2))
doc_trigrams <- unlist(tokenize_ngrams(doc, n = 3))
all_tokens <- c(doc_tokens, doc_bigrams, doc_trigrams)
return(all_tokens)
}
all_tokens <- lapply(text_corpus, tokenize_document)
text_dtm <- DocumentTermMatrix(Corpus(VectorSource(all_tokens)))
text_tfidf <- weightTfIdf(text_dtm)
processed_data <- data.frame(M2 = data$M2, text_tfidf = as.matrix(text_tfidf))
indexes <- createDataPartition(processed_data$M2, p = 0.75, list = FALSE)
trainData <- processed_data[indexes,]
testData <- processed_data[-indexes,]
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 2, classProbs 
= TRUE)
model_svmRadial <- train(M2 ~ ., data = trainData, method = "svmRadial", 
trControl = ctrl)






________
From: Ivan Krylov 
Sent: Saturday, August 12, 2023 12:49 AM
To: James C Schopf 
Cc: r-help@r-project.org 
Subject: Re: [R] Different TFIDF settings in test set prevent testing model

� Fri, 11 Aug 2023 10:20:27 +
James C Schopf  �:

> > train_text_dtm <-
> > DocumentTermMatrix(Corpus(VectorSource(all_train_tokens)))

> > test_text_dtm <-
> > DocumentTermMatrix(Corpus(VectorSource(all_test_tokens)))

I understand the need to prepare the test dataset separately
(e.g. in order to be able to work with text that don't exist at the
time when model is trained), but since the model has no representation
for tokens it (well, the tokeniser) hasn't seen during the training
process, you have to ensure that test_text_dtm references exactly the
same tokens as train_text_dtm, in the same order of the columns.

Also, it probably makes sense to reuse the term frequency learned on
the training document set; otherwise you may be importance-weighting
different tokens than ones your SVM has learned as important if your
test set has a significantly different distribution from that of the
training set.

Bert is probably right: with the API given by the tm package, it's
seems easiest to tokenise and weight document-term matrices first, then
split them into the train and test subsets. It may be worth asking the
maintainer about applying previously "learned" transformations to new
corpora.

--
Best regards,
Ivan

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Different TFIDF settings in test set prevent testing model

2023-08-11 Thread James C Schopf
Hello, I'd be very grateful for your help.

I randomly separated a .csv file with 1287 documents 75%/25% into 2 csv files, 
one for training an algorithm and the other for testing the algorithm.  I 
applied similar preprocessing, including TFIDF transformation, to both sets, 
but R won't let me make predictions on the test set due to a different TFIDF 
matrix.
I get the error message:

Error: variable 'text_tfidf' was fitted with type "nmatrix.67503" but type 
"nmatrix.27118" was supplied

I'd greatly appreciate a suggestion to overcome this problem.
Thanks!


Here's my R codes:

> library(tidyverse)
> library(tidytext)
> library(caret)
> library(kernlab)
> library(tokenizers)
> library(tm)
> library(e1071)

***LOAD TRAINING SET/959 rows with text in column1 and yes/no in column2 
(labelled M2)
> url <- "D:/test/M2_75.csv"
> d <- read_csv(url)
***CREATE TEXT CORPUS FROM TEXT COLUMN
> train_text_corpus <- Corpus(VectorSource(d$Text))
***DEFINE TOKENS FOR EACH DOCUMENT IN CORPUS AND COMBINE THEM
> tokenize_document <- function(doc) {
+ doc_tokens <- unlist(tokenize_words(doc))
+ doc_bigrams <- unlist(tokenize_ngrams(doc, n = 2))
+ doc_trigrams <- unlist(tokenize_ngrams(doc, n = 3))
+ all_tokens <- c(doc_tokens, doc_bigrams, doc_trigrams)
+ return(all_tokens)
+ }
***APPLY TOKENS TO DOCUMENTS
> all_train_tokens <- lapply(train_text_corpus, tokenize_document)
***CREATE A DTM FROM THE TOKENS
> train_text_dtm <- DocumentTermMatrix(Corpus(VectorSource(all_train_tokens)))
***TRANSFORM THE DTM INTO A TF-IDF MATRIX
> train_text_tfidf <- weightTfIdf(train_text_dtm)
***CREATE A NEW DATA FRAME WITH M2 COLUMN FROM ORIGINAL DATA
> trainData <- data.frame(M2 = d$M2)
***ADD NEW TFIDF transformed TEXT COLUMN NEXT TO DATA FRAME
> trainData$text_tfidf <- I(as.matrix(train_text_tfidf))
***DEFINE THE ML MODEL
> ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 2, 
> classProbs = TRUE)
***TRAIN SVM
> model_svmRadial <- train(M2 ~ ., data = trainData, method = "svmRadial", 
> trControl = ctrl)
***SAVE SVM
> saveRDS(model_svmRadial, file = "D:/SML/model_M23_svmRadial_UP.RDS")

R code on my test set, which didn't work at last step:

***LOAD TEST SET/ 309 rows with text in column1 and yes/no in column2 (labelled 
M2)
> url <- "D:/test/M2_25.csv"
> d <- read_csv(url)
***CREATE TEXT CORPUS FROM TEXT COLUMN
> test_text_corpus <- Corpus(VectorSource(d$Text))
***DEFINE TOKENS FOR EACH DOCUMENT IN CORPUS AND COMBINE THEM
> tokenize_document <- function(doc) {
 doc_tokens <- unlist(tokenize_words(doc))
 doc_bigrams <- unlist(tokenize_ngrams(doc, n = 2))
 doc_trigrams <- unlist(tokenize_ngrams(doc, n = 3))
 all_tokens <- c(doc_tokens, doc_bigrams, doc_trigrams)
 return(all_tokens)
 }
***APPLY TOKEN TO DOCUMENTS
> all_test_tokens <- lapply(test_text_corpus, tokenize_document)
***CREATE A DTM FROM THE TOKENS
> test_text_dtm <- DocumentTermMatrix(Corpus(VectorSource(all_test_tokens)))
***TRANSFORM THE DTM INTO A TF-IDF MATRIX
> test_text_tfidf <- weightTfIdf(test_text_dtm)
***CREATE A NEW DATA WITH M2 COLUMN FROM ORIGINAL TEST DATA
> testData <- data.frame(M2 = d$M2)
***ADD NEW TFIDF transformed TEXT COLUMN NEXT TO TEST DATA
> testData$text_tfidf <- I(as.matrix(test_text_tfidf))
***LOAD OLD MODEL
model_svmRadial <- readRDS("D:/SML/model_M2_75_svmRadial.RDS")
 ***MAKE PREDICTIONS
predictions <- predict(model_svmRadial, newdata = testData)

This last line produces the error message:

Error: variable 'text_tfidf' was fitted with type "nmatrix.67503" but type 
"nmatrix.27118" was supplied

Please help.  Thanks!








[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.