Re: [R] Error in Corpus() in tm package
Le samedi 17 août 2013 à 11:16 -0700, Ajinkya Kale a écrit : It contains all text files which were converted from doc, docx, ppt etc. using libreoffice. Some of them are non-english text documents. Sorry I cannot share the corpus.. but if someone can shed light on what might cause this error then I can try to eliminate those documents if some specific docs are causing it. I think you should go the other way round: try with only one document and see if it works, and do enough attempts to find out in what cases it works and in what cases it fails. If it always fails, try with examples provided by tm, and then with parts of your documents. I don't think it makes sense to try to use VectorSource() as it would imply reimplementing DirSource(). Regards On Sat, Aug 17, 2013 at 9:55 AM, Milan Bouchet-Valat nalimi...@club.fr wrote: Le vendredi 16 août 2013 à 19:35 -0700, Ajinkya Kale a écrit : I am trying to use the text mining package ... I keep getting this error : rm(list=ls()) library(tm) sourceDir - Z:\\projectk_viz\\docs_to_index ovid - Corpus(DirSource(sourceDir),readerControl = list(language = lat)) Error in if (vectorized (length = 0)) stop(vectorized sources must have positive length) : missing value where TRUE/FALSE needed I am not sure what it means. The posting guide asks for a reproducible example. If you cannot make available to us the contents of sourceDir, at least you should tell us what kind of files it contains. Have you tried with only some of the files the directory contains ? Regards --ajinkya [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Sincerely, Ajinkya http://ajinkya.info __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Error in Corpus() in tm package
I did exactly what you mentioned... tried subset of these documents and found out there were some junk non-txt files which were causing this issue. Everything worked fine with dirsource once I deleted them from the dir. But I feel these functions should also tell what file they are failing at I have ended up debugging with sub sets of input one too many times. On Aug 18, 2013 9:01 AM, Milan Bouchet-Valat nalimi...@club.fr wrote: Le samedi 17 août 2013 à 11:16 -0700, Ajinkya Kale a écrit : It contains all text files which were converted from doc, docx, ppt etc. using libreoffice. Some of them are non-english text documents. Sorry I cannot share the corpus.. but if someone can shed light on what might cause this error then I can try to eliminate those documents if some specific docs are causing it. I think you should go the other way round: try with only one document and see if it works, and do enough attempts to find out in what cases it works and in what cases it fails. If it always fails, try with examples provided by tm, and then with parts of your documents. I don't think it makes sense to try to use VectorSource() as it would imply reimplementing DirSource(). Regards On Sat, Aug 17, 2013 at 9:55 AM, Milan Bouchet-Valat nalimi...@club.fr wrote: Le vendredi 16 août 2013 à 19:35 -0700, Ajinkya Kale a écrit : I am trying to use the text mining package ... I keep getting this error : rm(list=ls()) library(tm) sourceDir - Z:\\projectk_viz\\docs_to_index ovid - Corpus(DirSource(sourceDir),readerControl = list(language = lat)) Error in if (vectorized (length = 0)) stop(vectorized sources must have positive length) : missing value where TRUE/FALSE needed I am not sure what it means. The posting guide asks for a reproducible example. If you cannot make available to us the contents of sourceDir, at least you should tell us what kind of files it contains. Have you tried with only some of the files the directory contains ? Regards --ajinkya [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Sincerely, Ajinkya http://ajinkya.info [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Error in Corpus() in tm package
Le dimanche 18 août 2013 à 09:19 -0700, Ajinkya Kale a écrit : I did exactly what you mentioned... tried subset of these documents and found out there were some junk non-txt files which were causing this issue. Everything worked fine with dirsource once I deleted them from the dir. But I feel these functions should also tell what file they are failing at I have ended up debugging with sub sets of input one too many times. Good. Could you send us (or maybe privately to me) at least an excerpt of the file that is enough to reproduce the bug? Indeed it would be nice to get a more explicit error message from tm if possible. Regards On Aug 18, 2013 9:01 AM, Milan Bouchet-Valat nalimi...@club.fr wrote: Le samedi 17 août 2013 à 11:16 -0700, Ajinkya Kale a écrit : It contains all text files which were converted from doc, docx, ppt etc. using libreoffice. Some of them are non-english text documents. Sorry I cannot share the corpus.. but if someone can shed light on what might cause this error then I can try to eliminate those documents if some specific docs are causing it. I think you should go the other way round: try with only one document and see if it works, and do enough attempts to find out in what cases it works and in what cases it fails. If it always fails, try with examples provided by tm, and then with parts of your documents. I don't think it makes sense to try to use VectorSource() as it would imply reimplementing DirSource(). Regards On Sat, Aug 17, 2013 at 9:55 AM, Milan Bouchet-Valat nalimi...@club.fr wrote: Le vendredi 16 août 2013 à 19:35 -0700, Ajinkya Kale a écrit : I am trying to use the text mining package ... I keep getting this error : rm(list=ls()) library(tm) sourceDir - Z:\\projectk_viz\\docs_to_index ovid - Corpus(DirSource(sourceDir),readerControl = list(language = lat)) Error in if (vectorized (length = 0)) stop(vectorized sources must have positive length) : missing value where TRUE/FALSE needed I am not sure what it means. The posting guide asks for a reproducible example. If you cannot make available to us the contents of sourceDir, at least you should tell us what kind of files it contains. Have you tried with only some of the files the directory contains ? Regards --ajinkya [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Sincerely, Ajinkya http://ajinkya.info __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Error in Corpus() in tm package
Le vendredi 16 août 2013 à 19:35 -0700, Ajinkya Kale a écrit : I am trying to use the text mining package ... I keep getting this error : rm(list=ls()) library(tm) sourceDir - Z:\\projectk_viz\\docs_to_index ovid - Corpus(DirSource(sourceDir),readerControl = list(language = lat)) Error in if (vectorized (length = 0)) stop(vectorized sources must have positive length) : missing value where TRUE/FALSE needed I am not sure what it means. The posting guide asks for a reproducible example. If you cannot make available to us the contents of sourceDir, at least you should tell us what kind of files it contains. Have you tried with only some of the files the directory contains ? Regards --ajinkya [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Error in Corpus() in tm package
It contains all text files which were converted from doc, docx, ppt etc. using libreoffice. Some of them are non-english text documents. Sorry I cannot share the corpus.. but if someone can shed light on what might cause this error then I can try to eliminate those documents if some specific docs are causing it. On Sat, Aug 17, 2013 at 9:55 AM, Milan Bouchet-Valat nalimi...@club.frwrote: Le vendredi 16 août 2013 à 19:35 -0700, Ajinkya Kale a écrit : I am trying to use the text mining package ... I keep getting this error : rm(list=ls()) library(tm) sourceDir - Z:\\projectk_viz\\docs_to_index ovid - Corpus(DirSource(sourceDir),readerControl = list(language = lat)) Error in if (vectorized (length = 0)) stop(vectorized sources must have positive length) : missing value where TRUE/FALSE needed I am not sure what it means. The posting guide asks for a reproducible example. If you cannot make available to us the contents of sourceDir, at least you should tell us what kind of files it contains. Have you tried with only some of the files the directory contains ? Regards --ajinkya [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Sincerely, Ajinkya http://ajinkya.info [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Error in Corpus() in tm package
Funny, it works fine if I use VectorSource ovid - Corpus(VectorSource(list.files(sourceDir)[1:1253]), readerControl = list(language = lat)) So I tried only executing DirDource(sourceDir) and that fails with the error i mentioned earlier. So its not a problem with Corpus() which I thought initially it was. Also, I noticed that VectorSource works way more faster than having a DirSource there. Any particular reason ? On Sat, Aug 17, 2013 at 11:16 AM, Ajinkya Kale kaleajin...@gmail.comwrote: It contains all text files which were converted from doc, docx, ppt etc. using libreoffice. Some of them are non-english text documents. Sorry I cannot share the corpus.. but if someone can shed light on what might cause this error then I can try to eliminate those documents if some specific docs are causing it. On Sat, Aug 17, 2013 at 9:55 AM, Milan Bouchet-Valat nalimi...@club.frwrote: Le vendredi 16 août 2013 à 19:35 -0700, Ajinkya Kale a écrit : I am trying to use the text mining package ... I keep getting this error : rm(list=ls()) library(tm) sourceDir - Z:\\projectk_viz\\docs_to_index ovid - Corpus(DirSource(sourceDir),readerControl = list(language = lat)) Error in if (vectorized (length = 0)) stop(vectorized sources must have positive length) : missing value where TRUE/FALSE needed I am not sure what it means. The posting guide asks for a reproducible example. If you cannot make available to us the contents of sourceDir, at least you should tell us what kind of files it contains. Have you tried with only some of the files the directory contains ? Regards --ajinkya [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Sincerely, Ajinkya http://ajinkya.info -- Sincerely, Ajinkya http://ajinkya.info [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Error in Corpus() in tm package
I think I know why it works faster, cause VectorSource in above code only takes the files names as a corpus and not the contents of the files :D duh! Any suggestions to create a vector source out of contents of the txt files ? On Sat, Aug 17, 2013 at 1:59 PM, Ajinkya Kale kaleajin...@gmail.com wrote: Funny, it works fine if I use VectorSource ovid - Corpus(VectorSource(list.files(sourceDir)[1:1253]), readerControl = list(language = lat)) So I tried only executing DirDource(sourceDir) and that fails with the error i mentioned earlier. So its not a problem with Corpus() which I thought initially it was. Also, I noticed that VectorSource works way more faster than having a DirSource there. Any particular reason ? On Sat, Aug 17, 2013 at 11:16 AM, Ajinkya Kale kaleajin...@gmail.comwrote: It contains all text files which were converted from doc, docx, ppt etc. using libreoffice. Some of them are non-english text documents. Sorry I cannot share the corpus.. but if someone can shed light on what might cause this error then I can try to eliminate those documents if some specific docs are causing it. On Sat, Aug 17, 2013 at 9:55 AM, Milan Bouchet-Valat nalimi...@club.frwrote: Le vendredi 16 août 2013 à 19:35 -0700, Ajinkya Kale a écrit : I am trying to use the text mining package ... I keep getting this error : rm(list=ls()) library(tm) sourceDir - Z:\\projectk_viz\\docs_to_index ovid - Corpus(DirSource(sourceDir),readerControl = list(language = lat)) Error in if (vectorized (length = 0)) stop(vectorized sources must have positive length) : missing value where TRUE/FALSE needed I am not sure what it means. The posting guide asks for a reproducible example. If you cannot make available to us the contents of sourceDir, at least you should tell us what kind of files it contains. Have you tried with only some of the files the directory contains ? Regards --ajinkya [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Sincerely, Ajinkya http://ajinkya.info -- Sincerely, Ajinkya http://ajinkya.info -- Sincerely, Ajinkya http://ajinkya.info [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.