On Jan 16, 2014, at 1:58am, Suresh M <suresh4mas...@gmail.com> wrote:
> Hi, > > Thanks for your reply. > I have got the table of contents, meta-data, title, author, etc for the > books. > Can you please tell me the next steps to proceed. > I have read in Mahout In Action book that there are few tools available for > vectorization Ex: Lucene analyzers, Mahout vector encoders > Can you please tell me which is good and how to use it.? I cover some of the issues & approaches to generating text-based features in these two blog posts… http://www.scaleunlimited.com/2013/07/10/text-feature-selection-for-machine-learning-part-1/ http://www.scaleunlimited.com/2013/07/21/text-feature-selection-for-machine-learning-part-2/ -- Ken > On 16 January 2014 14:49, Saeed Iqbal KhattaK > <saeediqbalkhat...@gmail.com>wrote: > >> Dear Suresh, >> >> I am also working in Classification of books. >> >> First of all I collect a meta-data of my e-books, after collecting a >> meta-data than I start my second level to pre-process an e-book. In >> pre-processing, I collect information regarding *books title, chapter >> titles sections, subsection paragraph, sub-paragraph and Bold fonts* etc. >> and remove all other formatted style than i got a result. >> >> >> >> >> On Thu, Jan 16, 2014 at 2:09 PM, Ted Dunning <ted.dunn...@gmail.com> >> wrote: >> >>> You generally want to do linguistic pre-processing (finding phrases, >>> synonymizing certain forms such as abbreviations, tokenizing, dropping >> stop >>> words, removing boilerplate, removing tables) before doing vectorization. >>> Altogether, these form pre-processing. >>> >>> To classify books, you need to recognize that many books are about many >>> topics. You may want to segment your books down to the chapter, section >> or >>> even paragraph level. >>> >>> >>> >>> On Wed, Jan 15, 2014 at 10:25 PM, Suresh M <suresh4mas...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> Can you please tell me what does that pre-processing mean? Is it >>>> vectorization(as explained in Mahout in Action book) >>>> Can it be done using java and Mahout AP ? >>>> And, the model means, is it a class? >>>> >>>> >>>> >>>> >>>> On 16 January 2014 11:38, KK R <kirubakumar...@gmail.com> wrote: >>>> >>>>> Hi Suresh, >>>>> >>>>> Apache Mahout has certain classification algorithms which you can use >>> to >>>> do >>>>> the classifcation. >>>>> >>>>> Step 1: Your data may require any pre-processing. If so, it can be >> done >>>>> using Hadoop / Hive / Mahout utilities. >>>>> >>>>> Step 2: Run classification algorithm on your training data and build >>> your >>>>> model using Mahout classification algorithms. >>>>> >>>>> Step 3: When the actual data comes, it needs to be classified with >> the >>>> help >>>>> of trained model. This can be done sequentially in java or mapreduce >>> can >>>> be >>>>> used if the size of the data is huge and scalability is a >> requirement. >>>>> >>>>> Thanks, >>>>> Kirubakumaresh >>>>> @http://www.linkedin.com/pub/kirubakumaresh-rajendran/66/411/305 >>>>> >>>>> >>>>> On Thu, Jan 16, 2014 at 11:28 AM, Suresh M <suresh4mas...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> Our application will be getting books from different users. >>>>>> We have to classify them accordingly. >>>>>> Some one please tell me how to do that using apache mahout and >> java. >>>>>> Is hadoop necessary for that? >>>>>> >>>>>> >>>>>> -- >>>>>> Thank &Regards >>>>>> Suresh >>>>>> >>>>> >>>> >>> >> >> >> >> -- >> *Saeed Iqbal KhattaK* >> Lecturer (FoIT) -- University of Central Punjab, Lahore >> Tel: +92-42-35880007 - (ext 194) >> MS CS, FAST-NUCES, Peshawar >> BS IT (Hons), Punjab University College of Information Technology (PUCIT), >> University Of The Punjab, Lahore. >> http://saeedkhattak.wordpress.com >> Cell No # +92-333-9533493 >> -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr