Re: [R] Error in Corpus() in tm package

2013-08-18 Thread Milan Bouchet-Valat
Le samedi 17 août 2013 à 11:16 -0700, Ajinkya Kale a écrit :
 It contains all text files which were converted from doc, docx, ppt
 etc. using libreoffice. 
 Some of them are non-english text documents.
 
 
 Sorry I cannot share the corpus.. but if someone can shed light on
 what might cause this error then I can try to eliminate those
 documents if some specific docs are causing it.
I think you should go the other way round: try with only one document
and see if it works, and do enough attempts to find out in what cases it
works and in what cases it fails. If it always fails, try with examples
provided by tm, and then with parts of your documents.

I don't think it makes sense to try to use VectorSource() as it would
imply reimplementing DirSource().


Regards

 On Sat, Aug 17, 2013 at 9:55 AM, Milan Bouchet-Valat
 nalimi...@club.fr wrote:
 Le vendredi 16 août 2013 à 19:35 -0700, Ajinkya Kale a écrit :
  I am trying to use the text mining package ... I keep
 getting this error :
 
  rm(list=ls())
  library(tm)
  sourceDir - Z:\\projectk_viz\\docs_to_index
  ovid - Corpus(DirSource(sourceDir),readerControl =
 list(language = lat))
 
  Error in if (vectorized  (length = 0)) stop(vectorized
 sources must
  have positive length) : missing value where TRUE/FALSE
 needed
 
  I am not sure what it means.
 
 The posting guide asks for a reproducible example. If you
 cannot make
 available to us the contents of sourceDir, at least you should
 tell us
 what kind of files it contains. Have you tried with only some
 of the
 files the directory contains ?
 
 
 Regards
 
  --ajinkya
 
[[alternative HTML version deleted]]
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible
 code.
 
 
 
 
 
 -- 
 
 Sincerely,
 Ajinkya
 http://ajinkya.info


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Error in Corpus() in tm package

2013-08-18 Thread Ajinkya Kale
I did exactly what you mentioned... tried subset of these documents and
found out there were some junk non-txt files which were causing this issue.
Everything worked fine with dirsource once I deleted them from the dir.
But I feel these functions should also tell what file they are failing
at I have ended up debugging with sub sets of input one too many times.
On Aug 18, 2013 9:01 AM, Milan Bouchet-Valat nalimi...@club.fr wrote:

 Le samedi 17 août 2013 à 11:16 -0700, Ajinkya Kale a écrit :
  It contains all text files which were converted from doc, docx, ppt
  etc. using libreoffice.
  Some of them are non-english text documents.
 
 
  Sorry I cannot share the corpus.. but if someone can shed light on
  what might cause this error then I can try to eliminate those
  documents if some specific docs are causing it.
 I think you should go the other way round: try with only one document
 and see if it works, and do enough attempts to find out in what cases it
 works and in what cases it fails. If it always fails, try with examples
 provided by tm, and then with parts of your documents.

 I don't think it makes sense to try to use VectorSource() as it would
 imply reimplementing DirSource().


 Regards

  On Sat, Aug 17, 2013 at 9:55 AM, Milan Bouchet-Valat
  nalimi...@club.fr wrote:
  Le vendredi 16 août 2013 à 19:35 -0700, Ajinkya Kale a écrit :
   I am trying to use the text mining package ... I keep
  getting this error :
  
   rm(list=ls())
   library(tm)
   sourceDir - Z:\\projectk_viz\\docs_to_index
   ovid - Corpus(DirSource(sourceDir),readerControl =
  list(language = lat))
  
   Error in if (vectorized  (length = 0)) stop(vectorized
  sources must
   have positive length) : missing value where TRUE/FALSE
  needed
  
   I am not sure what it means.
 
  The posting guide asks for a reproducible example. If you
  cannot make
  available to us the contents of sourceDir, at least you should
  tell us
  what kind of files it contains. Have you tried with only some
  of the
  files the directory contains ?
 
 
  Regards
 
   --ajinkya
  
 [[alternative HTML version deleted]]
  
   __
   R-help@r-project.org mailing list
   https://stat.ethz.ch/mailman/listinfo/r-help
   PLEASE do read the posting guide
  http://www.R-project.org/posting-guide.html
   and provide commented, minimal, self-contained, reproducible
  code.
 
 
 
 
 
  --
 
  Sincerely,
  Ajinkya
  http://ajinkya.info
 



[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Error in Corpus() in tm package

2013-08-18 Thread Milan Bouchet-Valat
Le dimanche 18 août 2013 à 09:19 -0700, Ajinkya Kale a écrit :
 I did exactly what you mentioned... tried subset of these documents
 and found out there were some junk non-txt files which were causing
 this issue. Everything worked fine with dirsource once I deleted them
 from the dir.
 But I feel these functions should also tell what file they are failing
 at I have ended up debugging with sub sets of input one too many
 times. 
Good. Could you send us (or maybe privately to me) at least an excerpt
of the file that is enough to reproduce the bug? Indeed it would be nice
to get a more explicit error message from tm if possible.


Regards

 
 On Aug 18, 2013 9:01 AM, Milan Bouchet-Valat nalimi...@club.fr
 wrote:
 Le samedi 17 août 2013 à 11:16 -0700, Ajinkya Kale a écrit :
  It contains all text files which were converted from doc,
 docx, ppt
  etc. using libreoffice.
  Some of them are non-english text documents.
 
 
  Sorry I cannot share the corpus.. but if someone can shed
 light on
  what might cause this error then I can try to eliminate
 those
  documents if some specific docs are causing it.
 I think you should go the other way round: try with only one
 document
 and see if it works, and do enough attempts to find out in
 what cases it
 works and in what cases it fails. If it always fails, try with
 examples
 provided by tm, and then with parts of your documents.
 
 I don't think it makes sense to try to use VectorSource() as
 it would
 imply reimplementing DirSource().
 
 
 Regards
 
  On Sat, Aug 17, 2013 at 9:55 AM, Milan Bouchet-Valat
  nalimi...@club.fr wrote:
  Le vendredi 16 août 2013 à 19:35 -0700, Ajinkya Kale
 a écrit :
   I am trying to use the text mining package ... I
 keep
  getting this error :
  
   rm(list=ls())
   library(tm)
   sourceDir - Z:\\projectk_viz\\docs_to_index
   ovid - Corpus(DirSource(sourceDir),readerControl
 =
  list(language = lat))
  
   Error in if (vectorized  (length = 0))
 stop(vectorized
  sources must
   have positive length) : missing value where
 TRUE/FALSE
  needed
  
   I am not sure what it means.
 
  The posting guide asks for a reproducible example.
 If you
  cannot make
  available to us the contents of sourceDir, at least
 you should
  tell us
  what kind of files it contains. Have you tried with
 only some
  of the
  files the directory contains ?
 
 
  Regards
 
   --ajinkya
  
 [[alternative HTML version deleted]]
  
   __
   R-help@r-project.org mailing list
   https://stat.ethz.ch/mailman/listinfo/r-help
   PLEASE do read the posting guide
  http://www.R-project.org/posting-guide.html
   and provide commented, minimal, self-contained,
 reproducible
  code.
 
 
 
 
 
  --
 
  Sincerely,
  Ajinkya
  http://ajinkya.info
 


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Error in Corpus() in tm package

2013-08-17 Thread Milan Bouchet-Valat
Le vendredi 16 août 2013 à 19:35 -0700, Ajinkya Kale a écrit :
 I am trying to use the text mining package ... I keep getting this error :
 
 rm(list=ls())
 library(tm)
 sourceDir - Z:\\projectk_viz\\docs_to_index
 ovid - Corpus(DirSource(sourceDir),readerControl = list(language = lat))
 
 Error in if (vectorized  (length = 0)) stop(vectorized sources must
 have positive length) : missing value where TRUE/FALSE needed
 
 I am not sure what it means.
The posting guide asks for a reproducible example. If you cannot make
available to us the contents of sourceDir, at least you should tell us
what kind of files it contains. Have you tried with only some of the
files the directory contains ?


Regards

 --ajinkya
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Error in Corpus() in tm package

2013-08-17 Thread Ajinkya Kale
It contains all text files which were converted from doc, docx, ppt etc.
using libreoffice.
Some of them are non-english text documents.

Sorry I cannot share the corpus.. but if someone can shed light on what
might cause this error then I can try to eliminate those documents if some
specific docs are causing it.


On Sat, Aug 17, 2013 at 9:55 AM, Milan Bouchet-Valat nalimi...@club.frwrote:

 Le vendredi 16 août 2013 à 19:35 -0700, Ajinkya Kale a écrit :
  I am trying to use the text mining package ... I keep getting this error
 :
 
  rm(list=ls())
  library(tm)
  sourceDir - Z:\\projectk_viz\\docs_to_index
  ovid - Corpus(DirSource(sourceDir),readerControl = list(language =
 lat))
 
  Error in if (vectorized  (length = 0)) stop(vectorized sources must
  have positive length) : missing value where TRUE/FALSE needed
 
  I am not sure what it means.
 The posting guide asks for a reproducible example. If you cannot make
 available to us the contents of sourceDir, at least you should tell us
 what kind of files it contains. Have you tried with only some of the
 files the directory contains ?


 Regards

  --ajinkya
 
[[alternative HTML version deleted]]
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.




-- 

Sincerely,
Ajinkya
http://ajinkya.info

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Error in Corpus() in tm package

2013-08-17 Thread Ajinkya Kale
Funny, it works fine if I use VectorSource
ovid - Corpus(VectorSource(list.files(sourceDir)[1:1253]), readerControl =
list(language = lat))
So I tried only executing  DirDource(sourceDir) and that fails with the
error i mentioned earlier. So its not a problem with Corpus() which I
thought initially it was.

Also, I noticed that VectorSource works way more faster than having a
DirSource there.
Any particular reason ?


On Sat, Aug 17, 2013 at 11:16 AM, Ajinkya Kale kaleajin...@gmail.comwrote:

 It contains all text files which were converted from doc, docx, ppt etc.
 using libreoffice.
 Some of them are non-english text documents.

 Sorry I cannot share the corpus.. but if someone can shed light on what
 might cause this error then I can try to eliminate those documents if some
 specific docs are causing it.


 On Sat, Aug 17, 2013 at 9:55 AM, Milan Bouchet-Valat nalimi...@club.frwrote:

 Le vendredi 16 août 2013 à 19:35 -0700, Ajinkya Kale a écrit :
  I am trying to use the text mining package ... I keep getting this
 error :
 
  rm(list=ls())
  library(tm)
  sourceDir - Z:\\projectk_viz\\docs_to_index
  ovid - Corpus(DirSource(sourceDir),readerControl = list(language =
 lat))
 
  Error in if (vectorized  (length = 0)) stop(vectorized sources must
  have positive length) : missing value where TRUE/FALSE needed
 
  I am not sure what it means.
 The posting guide asks for a reproducible example. If you cannot make
 available to us the contents of sourceDir, at least you should tell us
 what kind of files it contains. Have you tried with only some of the
 files the directory contains ?


 Regards

  --ajinkya
 
[[alternative HTML version deleted]]
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.




 --

 Sincerely,
 Ajinkya
 http://ajinkya.info




-- 

Sincerely,
Ajinkya
http://ajinkya.info

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Error in Corpus() in tm package

2013-08-17 Thread Ajinkya Kale
I think I know why it works faster, cause VectorSource in above code only
takes the files names as a corpus and not the contents of the files :D duh!

Any suggestions to create a vector source out of contents of the txt files ?


On Sat, Aug 17, 2013 at 1:59 PM, Ajinkya Kale kaleajin...@gmail.com wrote:

 Funny, it works fine if I use VectorSource
 ovid - Corpus(VectorSource(list.files(sourceDir)[1:1253]), readerControl
 = list(language = lat))
 So I tried only executing  DirDource(sourceDir) and that fails with the
 error i mentioned earlier. So its not a problem with Corpus() which I
 thought initially it was.

 Also, I noticed that VectorSource works way more faster than having a
 DirSource there.
 Any particular reason ?


 On Sat, Aug 17, 2013 at 11:16 AM, Ajinkya Kale kaleajin...@gmail.comwrote:

 It contains all text files which were converted from doc, docx, ppt etc.
 using libreoffice.
 Some of them are non-english text documents.

 Sorry I cannot share the corpus.. but if someone can shed light on what
 might cause this error then I can try to eliminate those documents if some
 specific docs are causing it.


 On Sat, Aug 17, 2013 at 9:55 AM, Milan Bouchet-Valat 
 nalimi...@club.frwrote:

 Le vendredi 16 août 2013 à 19:35 -0700, Ajinkya Kale a écrit :
  I am trying to use the text mining package ... I keep getting this
 error :
 
  rm(list=ls())
  library(tm)
  sourceDir - Z:\\projectk_viz\\docs_to_index
  ovid - Corpus(DirSource(sourceDir),readerControl = list(language =
 lat))
 
  Error in if (vectorized  (length = 0)) stop(vectorized sources must
  have positive length) : missing value where TRUE/FALSE needed
 
  I am not sure what it means.
 The posting guide asks for a reproducible example. If you cannot make
 available to us the contents of sourceDir, at least you should tell us
 what kind of files it contains. Have you tried with only some of the
 files the directory contains ?


 Regards

  --ajinkya
 
[[alternative HTML version deleted]]
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.




 --

 Sincerely,
 Ajinkya
 http://ajinkya.info




 --

 Sincerely,
 Ajinkya
 http://ajinkya.info




-- 

Sincerely,
Ajinkya
http://ajinkya.info

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.