Jatin,

If you file the JIRA and don't want to work on it, I'd be happy to step in
and take a stab at it.

RJ

On Thu, Sep 18, 2014 at 4:08 PM, Xiangrui Meng <men...@gmail.com> wrote:

> Hi Jatin,
>
> HashingTF should be able to solve the memory problem if you use a
> small feature dimension in HashingTF. Please do not cache the input
> document, but cache the output from HashingTF and IDF instead. We
> don't have a label indexer yet, so you need a label to index map to
> map it to double values, e.g., D1 -> 0.0, D2 -> 1.0, etc. Assuming
> that the input is an RDD[(label: String, doc: Seq[String])], the code
> should look like the following:
>
> val docTypeToLabel = Map("D1" -> 0.0, ...)
> val tf = new HashingTF();
> val freqs = input.map(x => (docTypeToLabel(x._1),
> tf.transform(x._2))).cache()
> val idf = new IDF()
> val idfModel = idf.fit(freqs.values)
> val vectors = freqs.map(x => LabeledPoint(x._1, idfModel.transform(x._2)))
> val nbModel = NaiveBayes.train(vectors)
>
> IDF doesn't provide the filter on the min occurrence, but it is nice
> to put that option. Please create a JIRA and someone may work on it.
>
> Best,
> Xiangrui
>
>
> On Thu, Sep 18, 2014 at 3:46 AM, jatinpreet <jatinpr...@gmail.com> wrote:
> > Hi,
> >
> > I have been running into memory overflow issues while creating TFIDF
> vectors
> > to be used in document classification using MLlib's Naive Baye's
> > classification implementation.
> >
> >
> http://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/
> >
> > Memory overflow and GC issues occur while collecting idfs for all the
> terms.
> > To give an idea of scale, I am reading around 615,000(around 4GB of text
> > data) small sized documents from HBase  and running the spark program
> with 8
> > cores and 6GB of executor memory. I have tried increasing the parallelism
> > level and shuffle memory fraction but to no avail.
> >
> > The new TFIDF generation APIs caught my eye in the latest Spark version
> > 1.1.0. The example given in the official documentation mentions creation
> of
> > TFIDF vectors based of Hashing Trick. I want to know if it will solve the
> > mentioned problem by benefiting from reduced memory consumption.
> >
> > Also, the example does not state how to create labeled points for a
> corpus
> > of pre-classified document data. For example, my training input looks
> > something like this,
> >
> > DocumentType  |  Content
> > -----------------------------------------------------------------
> > D1                   |  This is Doc1 sample.
> > D1                   |  This also belongs to Doc1.
> > D1                   |  Yet another Doc1 sample.
> > D2                   |  Doc2 sample.
> > D2                   |  Sample content for Doc2.
> > D3                   |  The only sample for Doc3.
> > D4                   |  Doc4 sample looks like this.
> > D4                   |  This is Doc4 sample content.
> >
> > I want to create labeled points from this sample data for training. And
> once
> > the Naive Bayes model is created, I generate TFIDFs for the test
> documents
> > and predict the document type.
> >
> > If the new API can solve my issue, how can I generate labelled points
> using
> > the new APIs? An example would be great.
> >
> > Also, I have a special requirement of ignoring terms that occur in less
> than
> > two documents. This has important implications for the accuracy of my use
> > case and needs to be accommodated while generating TFIDFs.
> >
> > Thanks,
> > Jatin
> >
> >
> >
> > -----
> > Novice Big Data Programmer
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/New-API-for-TFIDF-generation-in-Spark-1-1-0-tp14543.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
em rnowl...@gmail.com
c 954.496.2314

Reply via email to