Jatin, If you file the JIRA and don't want to work on it, I'd be happy to step in and take a stab at it.
RJ On Thu, Sep 18, 2014 at 4:08 PM, Xiangrui Meng <men...@gmail.com> wrote: > Hi Jatin, > > HashingTF should be able to solve the memory problem if you use a > small feature dimension in HashingTF. Please do not cache the input > document, but cache the output from HashingTF and IDF instead. We > don't have a label indexer yet, so you need a label to index map to > map it to double values, e.g., D1 -> 0.0, D2 -> 1.0, etc. Assuming > that the input is an RDD[(label: String, doc: Seq[String])], the code > should look like the following: > > val docTypeToLabel = Map("D1" -> 0.0, ...) > val tf = new HashingTF(); > val freqs = input.map(x => (docTypeToLabel(x._1), > tf.transform(x._2))).cache() > val idf = new IDF() > val idfModel = idf.fit(freqs.values) > val vectors = freqs.map(x => LabeledPoint(x._1, idfModel.transform(x._2))) > val nbModel = NaiveBayes.train(vectors) > > IDF doesn't provide the filter on the min occurrence, but it is nice > to put that option. Please create a JIRA and someone may work on it. > > Best, > Xiangrui > > > On Thu, Sep 18, 2014 at 3:46 AM, jatinpreet <jatinpr...@gmail.com> wrote: > > Hi, > > > > I have been running into memory overflow issues while creating TFIDF > vectors > > to be used in document classification using MLlib's Naive Baye's > > classification implementation. > > > > > http://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/ > > > > Memory overflow and GC issues occur while collecting idfs for all the > terms. > > To give an idea of scale, I am reading around 615,000(around 4GB of text > > data) small sized documents from HBase and running the spark program > with 8 > > cores and 6GB of executor memory. I have tried increasing the parallelism > > level and shuffle memory fraction but to no avail. > > > > The new TFIDF generation APIs caught my eye in the latest Spark version > > 1.1.0. The example given in the official documentation mentions creation > of > > TFIDF vectors based of Hashing Trick. I want to know if it will solve the > > mentioned problem by benefiting from reduced memory consumption. > > > > Also, the example does not state how to create labeled points for a > corpus > > of pre-classified document data. For example, my training input looks > > something like this, > > > > DocumentType | Content > > ----------------------------------------------------------------- > > D1 | This is Doc1 sample. > > D1 | This also belongs to Doc1. > > D1 | Yet another Doc1 sample. > > D2 | Doc2 sample. > > D2 | Sample content for Doc2. > > D3 | The only sample for Doc3. > > D4 | Doc4 sample looks like this. > > D4 | This is Doc4 sample content. > > > > I want to create labeled points from this sample data for training. And > once > > the Naive Bayes model is created, I generate TFIDFs for the test > documents > > and predict the document type. > > > > If the new API can solve my issue, how can I generate labelled points > using > > the new APIs? An example would be great. > > > > Also, I have a special requirement of ignoring terms that occur in less > than > > two documents. This has important implications for the accuracy of my use > > case and needs to be accommodated while generating TFIDFs. > > > > Thanks, > > Jatin > > > > > > > > ----- > > Novice Big Data Programmer > > -- > > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/New-API-for-TFIDF-generation-in-Spark-1-1-0-tp14543.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > > For additional commands, e-mail: user-h...@spark.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- em rnowl...@gmail.com c 954.496.2314