I know nothing about tf, etc., but can you not simply read in the whole file into R and then randomly split using R? The training and test sets would simply be defined by a single random sample of subscripts which is either chosen or not.
e.g. (simplified example -- you would be subsetting the rows of your full dataset): > x<- 1:10 > samp <- sort(sample(x,5)) > x[samp] ## training [1] 3 4 6 7 8 > x[-samp] ## test [1] 1 2 5 9 10 Apologies if my ignorance means this can't work. Cheers, Bert On Fri, Aug 11, 2023 at 7:17 AM James C Schopf <jcsch...@hotmail.com> wrote: > Hello, I'd be very grateful for your help. > > I randomly separated a .csv file with 1287 documents 75%/25% into 2 csv > files, one for training an algorithm and the other for testing the > algorithm. I applied similar preprocessing, including TFIDF > transformation, to both sets, but R won't let me make predictions on the > test set due to a different TFIDF matrix. > I get the error message: > > Error: variable 'text_tfidf' was fitted with type "nmatrix.67503" but type > "nmatrix.27118" was supplied > > I'd greatly appreciate a suggestion to overcome this problem. > Thanks! > > > Here's my R codes: > > > library(tidyverse) > > library(tidytext) > > library(caret) > > library(kernlab) > > library(tokenizers) > > library(tm) > > library(e1071) > > ***LOAD TRAINING SET/959 rows with text in column1 and yes/no in column2 > (labelled M2) > > url <- "D:/test/M2_75.csv" > > d <- read_csv(url) > ***CREATE TEXT CORPUS FROM TEXT COLUMN > > train_text_corpus <- Corpus(VectorSource(d$Text)) > ***DEFINE TOKENS FOR EACH DOCUMENT IN CORPUS AND COMBINE THEM > > tokenize_document <- function(doc) { > + doc_tokens <- unlist(tokenize_words(doc)) > + doc_bigrams <- unlist(tokenize_ngrams(doc, n = 2)) > + doc_trigrams <- unlist(tokenize_ngrams(doc, n = 3)) > + all_tokens <- c(doc_tokens, doc_bigrams, doc_trigrams) > + return(all_tokens) > + } > ***APPLY TOKENS TO DOCUMENTS > > all_train_tokens <- lapply(train_text_corpus, tokenize_document) > ***CREATE A DTM FROM THE TOKENS > > train_text_dtm <- > DocumentTermMatrix(Corpus(VectorSource(all_train_tokens))) > ***TRANSFORM THE DTM INTO A TF-IDF MATRIX > > train_text_tfidf <- weightTfIdf(train_text_dtm) > ***CREATE A NEW DATA FRAME WITH M2 COLUMN FROM ORIGINAL DATA > > trainData <- data.frame(M2 = d$M2) > ***ADD NEW TFIDF transformed TEXT COLUMN NEXT TO DATA FRAME > > trainData$text_tfidf <- I(as.matrix(train_text_tfidf)) > ***DEFINE THE ML MODEL > > ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 2, > classProbs = TRUE) > ***TRAIN SVM > > model_svmRadial <- train(M2 ~ ., data = trainData, method = "svmRadial", > trControl = ctrl) > ***SAVE SVM > > saveRDS(model_svmRadial, file = "D:/SML/model_M23_svmRadial_UP.RDS") > > R code on my test set, which didn't work at last step: > > ***LOAD TEST SET/ 309 rows with text in column1 and yes/no in column2 > (labelled M2) > > url <- "D:/test/M2_25.csv" > > d <- read_csv(url) > ***CREATE TEXT CORPUS FROM TEXT COLUMN > > test_text_corpus <- Corpus(VectorSource(d$Text)) > ***DEFINE TOKENS FOR EACH DOCUMENT IN CORPUS AND COMBINE THEM > > tokenize_document <- function(doc) { > doc_tokens <- unlist(tokenize_words(doc)) > doc_bigrams <- unlist(tokenize_ngrams(doc, n = 2)) > doc_trigrams <- unlist(tokenize_ngrams(doc, n = 3)) > all_tokens <- c(doc_tokens, doc_bigrams, doc_trigrams) > return(all_tokens) > } > ***APPLY TOKEN TO DOCUMENTS > > all_test_tokens <- lapply(test_text_corpus, tokenize_document) > ***CREATE A DTM FROM THE TOKENS > > test_text_dtm <- > DocumentTermMatrix(Corpus(VectorSource(all_test_tokens))) > ***TRANSFORM THE DTM INTO A TF-IDF MATRIX > > test_text_tfidf <- weightTfIdf(test_text_dtm) > ***CREATE A NEW DATA WITH M2 COLUMN FROM ORIGINAL TEST DATA > > testData <- data.frame(M2 = d$M2) > ***ADD NEW TFIDF transformed TEXT COLUMN NEXT TO TEST DATA > > testData$text_tfidf <- I(as.matrix(test_text_tfidf)) > ***LOAD OLD MODEL > model_svmRadial <- readRDS("D:/SML/model_M2_75_svmRadial.RDS") > ***MAKE PREDICTIONS > predictions <- predict(model_svmRadial, newdata = testData) > > This last line produces the error message: > > Error: variable 'text_tfidf' was fitted with type "nmatrix.67503" but type > "nmatrix.27118" was supplied > > Please help. Thanks! > > > > > > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.