MapReduce Training phase

2013-12-07 Thread unmesha sreeveni
I want to know..more things on how the algorithms like svm is made parallel
weather MR -ed training phase or prediction or both...


In normal cases training phase is apt for MR as it takes lot of time.
Do we need to MR prediction also?


-- 
*Thanks  Regards*

Unmesha Sreeveni U.B

*Junior Developer*


Re: SVM Implementation for mahout?

2013-12-07 Thread Fernando Santos
Thanks Manuel.

It seems that these two (https://issues.apache.org/jira/browse/MAHOUT-334
 and https://issues.apache.org/jira/browse/MAHOUT-232) patches might work,
although not in parallel.

Does anyone has sucessfully used any of these two patches already and could
share some comments about it?

Thanks


2013/12/6 Manuel Blechschmidt manuel.blechschm...@gmx.de

 Hi Fernando,
 there are some patches and some discussions:

 SVM:
 https://issues.apache.org/jira/browse/MAHOUT-334
 https://issues.apache.org/jira/browse/MAHOUT-232
 https://issues.apache.org/jira/browse/MAHOUT-14
 https://issues.apache.org/jira/browse/MAHOUT-227

 /Manuel

 On 06.12.2013, at 19:14, Fernando Santos wrote:

  Hello,
 
  Is there any tested SVM implementation for Mahout?
 
  Mahout in action says there is a sequential implementation, but
  Experimental still. I couldn't find this implementation.
 
  Thanks
 
  --
  Fernando Santos
  +55 61 8129 8505

 --
 Manuel Blechschmidt
 M.Sc. IT Systems Engineering
 Dortustr. 57
 14467 Potsdam
 Mobil: 0173/6322621
 Twitter: http://twitter.com/Manuel_B




-- 
Fernando Santos
+55 61 8129 8505


Re: Test naivebayes task running really slowly and not in distributed mode

2013-12-07 Thread Fernando Santos
I realized what was the problem.

First of all the data was not big enough to split the job in more than one
task. Training file was 30MB and my block sizes were 64MB.

Besides that, I set the number of map (mapred.map.tasks) and reduce (
mapred.reduce.tasks) tasks in the mapred-site.xml file of hadoop.

After that the algorithm started running in an acceptable time.



2013/12/2 Fernando Santos fernandoleandro1...@gmail.com

 Train and test set are in single files (part-r-0). Training file is
 30MB and testing file is 2MB.


 2013/12/2 Fernando Santos fernandoleandro1...@gmail.com

 Hello Ted,

 No, the training ran also in one machine. What happens sometimes is that
 each box execute one job one at a time, but not together. For example, if
 it will run 3 jobs, it runs the first job in box1, the next in box2 and the
 next in box 1 again.

 The full dataset is a csv around 70MB. I turned it into sequence file,
 applied seq2sparse, then splitted and trained. The training task was quite
 fast, some minutes to execute. But the test is really slow as I said, and
 also running in one machine.



 2013/12/1 Ted Dunning ted.dunn...@gmail.com

 Did the training run use both machines?

 How large is the input for the test run?

 Is it contained in a single file?




 On Sat, Nov 30, 2013 at 11:22 AM, Fernando Santos 
 fernandoleandro1...@gmail.com wrote:

  Hello everyone,
 
  I'm trying to do a text classification task. My dataset is not that
 big, I
  have around 700.000 small comments.
 
  Following the 20newsgroups example, I created the vector from the text,
  splited it and trained the model. Now I'm trying to test it but it is
  really slow and also I cannot make it to run in the cluster. Whatever
 I do
  it always just run in one machine. And I think the testnb algorithm is
  supposed to run using mapReduce, right?
 
  I also tried this example here (
 
 
 http://chimpler.wordpress.com/2013/06/24/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages-part-2-distribute-classification-with-hadoop/
  )
  but also, the other box in the cluster is not executing any task. In
 fact,
  when I execute the testnb or using the MapReduceClassifier proposed in
 this
  tutorial above, I get one job, executing one task and this task runs
 really
  slowly (like 6 minutes to achieve 0.13% of the task).
 
  I think I must be doing something wrong so that the cluster is not
 working
  how it is supposed to be.
 
  I have a cluster with 2 box configured with hadoop 0.20.205.0 and using
  mahout 0.8.
 
  I also tried versions 0.7 and 0.6 of mahout but nothing changed.
 
  Any help would be aprreciated.
 
 
  The logs I have from this task:
 
 
  *stdout logs*
 
  Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library
  /usr/local/hadoop/lib/libhadoop.so which might have disabled stack
  guard. The VM will try to fix the stack guard now.
  It's highly recommended that you fix the library with 'execstack -c
  libfile', or link it with '-z noexecstack'.
 
 
  *syslog logs*
 
  2013-11-30 17:09:19,191 WARN org.apache.hadoop.util.NativeCodeLoader:
  Unable to load native-hadoop library for your platform... using
  builtin-java classes where applicable
  2013-11-30 17:09:19,400 WARN
  org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi
  already exists!
  2013-11-30 17:09:19,472 INFO org.apache.hadoop.util.ProcessTree:
  setsid exited with exit code 0
  2013-11-30 17:09:19,474 INFO org.apache.hadoop.mapred.Task:  Using
  ResourceCalculatorPlugin :
  org.apache.hadoop.util.LinuxResourceCalculatorPlugin@5810d963
  2013-11-30 17:09:19,543 INFO org.apache.hadoop.mapred.MapTask:
 io.sort.mb
  = 100
  2013-11-30 17:09:19,569 INFO org.apache.hadoop.mapred.MapTask: data
  buffer = 79691776/99614720
  2013-11-30 17:09:19,569 INFO org.apache.hadoop.mapred.MapTask: record
  buffer = 262144/327680
 
 
 
 
 
  --
  Fernando Santos
  +55 61 8129 8505
 




 --
 Fernando Santos
 +55 61 8129 8505




 --
 Fernando Santos
 +55 61 8129 8505




-- 
Fernando Santos
+55 61 8129 8505


Re: SVM Implementation for mahout?

2013-12-07 Thread Suneel Marthi
Any specific reasons u r looking for an SVM implementation only?  
R u sure that those patches r still relevant given the codebase today?





On Saturday, December 7, 2013 2:58 PM, Fernando Santos 
fernandoleandro1...@gmail.com wrote:
 
Thanks Manuel.

It seems that these two (https://issues.apache.org/jira/browse/MAHOUT-334
and https://issues.apache.org/jira/browse/MAHOUT-232) patches might work,
although not in parallel.

Does anyone has sucessfully used any of these two patches already and could
share some comments about it?

Thanks


2013/12/6 Manuel Blechschmidt manuel.blechschm...@gmx.de

 Hi Fernando,
 there are some patches and some discussions:

 SVM:
 https://issues.apache.org/jira/browse/MAHOUT-334
 https://issues.apache.org/jira/browse/MAHOUT-232
 https://issues.apache.org/jira/browse/MAHOUT-14
 https://issues.apache.org/jira/browse/MAHOUT-227

 /Manuel

 On 06.12.2013, at 19:14, Fernando Santos wrote:

  Hello,
 
  Is there any tested SVM implementation for Mahout?
 
  Mahout in action says there is a sequential implementation, but
  Experimental still. I couldn't find this implementation.
 
  Thanks
 
  --
  Fernando Santos
  +55 61 8129 8505

 --
 Manuel Blechschmidt
 M.Sc. IT Systems Engineering
 Dortustr. 57
 14467 Potsdam
 Mobil: 0173/6322621
 Twitter: http://twitter.com/Manuel_B





-- 
Fernando Santos
+55 61 8129 8505

Re: SVM Implementation for mahout?

2013-12-07 Thread Fernando Santos
Hello Suneel,

I want to check if any better performance is reached with SVM.

I've been using naive bayes, but my data is quite unbalanced and therefore
I'm getting pretty bad results with it. I also tried the complementary
naive bayes, but got the same bad results. I read about this difference
between NaiveBayes performance of Weka and Mahout implementations and maybe
that's the cause (
http://mail-archives.apache.org/mod_mbox/mahout-user/201109.mbox/%3ccabdaxxijtfv9nhqxxpyd72rrsv-h60ps13h0pund2injx70...@mail.gmail.com%3E
).

I also tried logistic regression and got around 77% accuracy. So maybe with
SVM it could be better.


2013/12/7 Suneel Marthi suneel_mar...@yahoo.com

 Any specific reasons u r looking for an SVM implementation only?
 R u sure that those patches r still relevant given the codebase today?





 On Saturday, December 7, 2013 2:58 PM, Fernando Santos 
 fernandoleandro1...@gmail.com wrote:

 Thanks Manuel.

 It seems that these two (https://issues.apache.org/jira/browse/MAHOUT-334
 and https://issues.apache.org/jira/browse/MAHOUT-232) patches might work,
 although not in parallel.

 Does anyone has sucessfully used any of these two patches already and could
 share some comments about it?

 Thanks


 2013/12/6 Manuel Blechschmidt manuel.blechschm...@gmx.de

  Hi Fernando,
  there are some patches and some discussions:
 
  SVM:
  https://issues.apache.org/jira/browse/MAHOUT-334
  https://issues.apache.org/jira/browse/MAHOUT-232
  https://issues.apache.org/jira/browse/MAHOUT-14
  https://issues.apache.org/jira/browse/MAHOUT-227
 
  /Manuel
 
  On 06.12.2013, at 19:14, Fernando Santos wrote:
 
   Hello,
  
   Is there any tested SVM implementation for Mahout?
  
   Mahout in action says there is a sequential implementation, but
   Experimental still. I couldn't find this implementation.
  
   Thanks
  
   --
   Fernando Santos
   +55 61 8129 8505
 
  --
  Manuel Blechschmidt
  M.Sc. IT Systems Engineering
  Dortustr. 57
  14467 Potsdam
  Mobil: 0173/6322621
  Twitter: http://twitter.com/Manuel_B

 
 


 --
 Fernando Santos
 +55 61 8129 8505




-- 
Fernando Santos
+55 61 8129 8505


Re: SVM Implementation for mahout?

2013-12-07 Thread Lucas Fernandes Brunialti
Hello Fernando,

The naive bayes approach makes the assumption that your features are
independent, if your featurea have a high correlation, naive bayes won't be
a good choice.

I would advice you to try the neural networks (mlp), it can get a better
decision surface than logistic regression...

Best.

Lucas.
On Dec 7, 2013 6:53 PM, Fernando Santos fernandoleandro1...@gmail.com
wrote:

 Hello Suneel,

 I want to check if any better performance is reached with SVM.

 I've been using naive bayes, but my data is quite unbalanced and therefore
 I'm getting pretty bad results with it. I also tried the complementary
 naive bayes, but got the same bad results. I read about this difference
 between NaiveBayes performance of Weka and Mahout implementations and maybe
 that's the cause (

 http://mail-archives.apache.org/mod_mbox/mahout-user/201109.mbox/%3ccabdaxxijtfv9nhqxxpyd72rrsv-h60ps13h0pund2injx70...@mail.gmail.com%3E
 ).

 I also tried logistic regression and got around 77% accuracy. So maybe with
 SVM it could be better.


 2013/12/7 Suneel Marthi suneel_mar...@yahoo.com

  Any specific reasons u r looking for an SVM implementation only?
  R u sure that those patches r still relevant given the codebase today?
 
 
 
 
 
  On Saturday, December 7, 2013 2:58 PM, Fernando Santos 
  fernandoleandro1...@gmail.com wrote:
 
  Thanks Manuel.
 
  It seems that these two (
 https://issues.apache.org/jira/browse/MAHOUT-334
  and https://issues.apache.org/jira/browse/MAHOUT-232) patches might
 work,
  although not in parallel.
 
  Does anyone has sucessfully used any of these two patches already and
 could
  share some comments about it?
 
  Thanks
 
 
  2013/12/6 Manuel Blechschmidt manuel.blechschm...@gmx.de
 
   Hi Fernando,
   there are some patches and some discussions:
  
   SVM:
   https://issues.apache.org/jira/browse/MAHOUT-334
   https://issues.apache.org/jira/browse/MAHOUT-232
   https://issues.apache.org/jira/browse/MAHOUT-14
   https://issues.apache.org/jira/browse/MAHOUT-227
  
   /Manuel
  
   On 06.12.2013, at 19:14, Fernando Santos wrote:
  
Hello,
   
Is there any tested SVM implementation for Mahout?
   
Mahout in action says there is a sequential implementation, but
Experimental still. I couldn't find this implementation.
   
Thanks
   
--
Fernando Santos
+55 61 8129 8505
  
   --
   Manuel Blechschmidt
   M.Sc. IT Systems Engineering
   Dortustr. 57
   14467 Potsdam
   Mobil: 0173/6322621
   Twitter: http://twitter.com/Manuel_B
 
  
  
 
 
  --
  Fernando Santos
  +55 61 8129 8505
 



 --
 Fernando Santos
 +55 61 8129 8505