RE: Algorithm implementations in Pig

2010-02-24 Thread Palleti, Pallavi
Thanks for the clarification Ankur. Do you have any performance
comparison between pig-0.6 and Hadoop? I would be interested to look at
the same. Last I heard about the comparison was in
http://osdir.com/ml/hive-user-hadoop-apache/2009-06/msg00078.html.
Pig-0.7.0 seems interesting. Thanks for sharing the information. I am
looking forward to experiment with it.

Thanks
Pallavi 

-Original Message-
From: Ankur C. Goel [mailto:gan...@yahoo-inc.com] 
Sent: Wednesday, February 24, 2010 1:24 PM
To: mahout-dev@lucene.apache.org
Subject: Re: Algorithm implementations in Pig

Pallavi,
  Thanks for your comments. Some clarifications w.r.t pig.

Pig does not generate any M/R code. What is it generates is logical,
physical and map-reduce plans that are nothing but DAGs. The map-reduce
plan is  then interpreted by pig's own mappers/reducers. The plan
generation itself is done on the client side and takes few seconds or
minutes (if you have a really big script).

About performance tuning in hadoop, all the M/R parameters can be
adjusted in pig to have the same effect they'd have in Java M/R
programs. Pig 0.7 is moving towards using hadoop's input/output format
in its load/store functions, so your custom i/o formats can be easily
reused with little additional effort.

Pig also provides very nice features like MultiQuery optimization and
skewed  merge join that are hard to implement in Java M/R every time
you need them.

With the latest pig release 0.6 the performance gap between Java M/R and
Pig has been narrowed to a good extent.

Simple statistical measures that you would use to understand or
preprocess your data are very easy to do with just few lines of pig code
and lot of utility UDFs are available for that.

Besides all the good things, I agree that there are compatibility issues
running pig-x on hadoop-y but this has also to do with new features of
Hadoop that pig is able to exploit in its pipeline.

I also agree with the general opinion that for Pig's adoption in Mahout
land it should play out well with Mahout's vector formats.

At the moment I don't have the proper free time to look into this but
will surely get back to evaluating the feasibility of this integration
in the coming few weeks. Till then any of the interested folks can fork
a JIRA for this and work on it.


On 2/24/10 12:27 PM, Palleti, Pallavi pallavi.pall...@corp.aol.com
wrote:

I too have mixed opinion w.r.t pig. Pig would be a good choice to
quickly prototype and test. However, following are the pitfalls I have
observed in pig.

It is not easy to debug in pig. Also, it have performance issues as it
is a layer on top of hadoop, so the overhead of converting pig into
map-reduce code. Also, when the code is available in hadoop, it is in
developer/user's hand to improve the performance by using various
parameters say, no of mappers, different input formats, etc. However is
not the case with pig. Also,there are some compatibility issues with pig
and hadoop. Say, if I am using pig-x version on hadoop-y version, there
might be some compatibility issues and need to spend time on resolving
the same as it is not easy to figure out the errors.
I believe the main motto of mahout is to propose scalable algorithms
which can be used to solve some real world problems. In such case, if
pig has got rid of above pitfalls, then it would be good choice as we
will have very less developing time efforts.

Thanks
Pallavi

-Original Message-
From: Ted Dunning [mailto:ted.dunn...@gmail.com]
Sent: Monday, February 22, 2010 11:32 PM
To: mahout-dev@lucene.apache.org
Subject: Re: Algorithm implementations in Pig

As an interesting test case, can you write a pig program that counts
words.

BUT, it takes an input file name AND an input field name.

On Mon, Feb 22, 2010 at 9:56 AM, Ted Dunning ted.dunn...@gmail.com
wrote:


 That isn't an issue here.  It is the invocation of pig programs and 
 passing useful information to them that is the problem.


 On Mon, Feb 22, 2010 at 9:20 AM, Ankur C. Goel
gan...@yahoo-inc.comwrote:

 Scripting ability while still limited has better streaming support so

 you can have relations streamed Into a custom script executing in 
 either map or reduce phase depending upon where it is placed.




 --
 Ted Dunning, CTO
 DeepDyve




--
Ted Dunning, CTO
DeepDyve



RE: Algorithm implementations in Pig

2010-02-23 Thread Palleti, Pallavi
I too have mixed opinion w.r.t pig. Pig would be a good choice to
quickly prototype and test. However, following are the pitfalls I have
observed in pig.

It is not easy to debug in pig. Also, it have performance issues as it
is a layer on top of hadoop, so the overhead of converting pig into
map-reduce code. Also, when the code is available in hadoop, it is in
developer/user's hand to improve the performance by using various
parameters say, no of mappers, different input formats, etc. However is
not the case with pig. Also,there are some compatibility issues with pig
and hadoop. Say, if I am using pig-x version on hadoop-y version, there
might be some compatibility issues and need to spend time on resolving
the same as it is not easy to figure out the errors. 
I believe the main motto of mahout is to propose scalable algorithms
which can be used to solve some real world problems. In such case, if
pig has got rid of above pitfalls, then it would be good choice as we
will have very less developing time efforts. 

Thanks
Pallavi

-Original Message-
From: Ted Dunning [mailto:ted.dunn...@gmail.com] 
Sent: Monday, February 22, 2010 11:32 PM
To: mahout-dev@lucene.apache.org
Subject: Re: Algorithm implementations in Pig

As an interesting test case, can you write a pig program that counts
words.

BUT, it takes an input file name AND an input field name.

On Mon, Feb 22, 2010 at 9:56 AM, Ted Dunning ted.dunn...@gmail.com
wrote:


 That isn't an issue here.  It is the invocation of pig programs and 
 passing useful information to them that is the problem.


 On Mon, Feb 22, 2010 at 9:20 AM, Ankur C. Goel
gan...@yahoo-inc.comwrote:

 Scripting ability while still limited has better streaming support so

 you can have relations streamed Into a custom script executing in 
 either map or reduce phase depending upon where it is placed.




 --
 Ted Dunning, CTO
 DeepDyve




--
Ted Dunning, CTO
DeepDyve


RE: Fuzzy K Means

2010-02-17 Thread Palleti, Pallavi
How many iterations of FuzzyKMeans you are running? Here is my
observation- When I ran for few iterations,the cluster centroids are far
off. However, when I ran for more than 50 iterations or so, the cluster
points are still different but they are very much near by as if they are
same. By the way, I am using m=3 in membership function.

Thanks
Pallavi

-Original Message-
From: Robin Anil [mailto:robin.a...@gmail.com] 
Sent: Wednesday, February 17, 2010 8:10 PM
To: mahout-dev@lucene.apache.org
Subject: Re: Fuzzy K Means

Tests are passing fine. But Not when testing reuters.

On Wed, Feb 17, 2010 at 8:07 PM, Pallavi Palleti 
pallavi.pall...@corp.aol.com wrote:

 If we just need to verify with some sample dataset, we already have 
 the data in TestFuzzyKMeansClustering code. won't that suffice? 
 Otherwise, I need to manually generate some sample dataset as I don't 
 have this small dataset with me. I am actually running on movielens 
 data using movie ratings as vector (movie as dimension , rating as
coefficient) and user as point.


 Thanks
 Pallavi

 Robin Anil wrote:

 I tracked the versions back to before the change to Writables were
done.
 There is nothing significant change in the code.

 Can you give me a small dataset 10 points maybe 5 dimensions. I can 
 verify the trunk in Case?

 Robin

 On Wed, Feb 17, 2010 at 7:49 PM, Pallavi Palleti  
 pallavi.pall...@corp.aol.com wrote:



 I have a local version which I have submitted long back and I am 
 using it on real data and is not giving same point for all clusters.

 However, I haven't tried with latest mahout code. I have kept my 
 code to output data as text so that it is easy for me to verify. 
 However, current mahout code outputs it as binary data (as 
 sequencefile). So, it is difficult to verify.


 Thanks
 Pallavi

 Robin Anil wrote:



 Have you verified the trunk code on some real data. I am getting 
 same point for all clusters regardless of the distnce measure

 Robin



 On Wed, Feb 17, 2010 at 6:41 PM, Pallavi Palleti  
 pallavi.pall...@corp.aol.com wrote:





 Yes. It shouldn't be a problem. My point was that we are extending

 numpoints as part of ClusterBase, though we are not using it in 
 SoftCluster.
 Other that that, I don't see any issue w.r.t. functionality.


 Thanks
 Pallavi

 Robin Anil wrote:





 In the impl of SoftClusters on writeOut it calculates the 
 centroid and writes it and when read(in) it reads the centroid in
to the center.

 In ClusterDumper it reads into the ClusterBase and does 
 value.getCenter(); It should work normally right

 Robin



 On Wed, Feb 17, 2010 at 6:02 PM, Pallavi Palleti  
 pallavi.pall...@corp.aol.com wrote:







 Yes. But not the total number of points. So, the numpoints from 
 ClusterBase will not be used in SoftCluster. numpoints is 
 specific to Kmeans similar to weightedpoint total for fuzzy 
 kmeans.


 Robin Anil wrote:







 the center is still the averaged out centroid right?
 weightedtotalvector/totalprobWeight



 On Wed, Feb 17, 2010 at 5:10 PM, Pallavi Palleti  
 pallavi.pall...@corp.aol.com wrote:









 I haven't yet gone thru ClusterDumper. However, ClusterBase 
 would be having number of points to average out 
 (pointTotal/numPoints as per
 kmeans)
 where
 as SoftCluster will have weighted point total. So, I am 
 wondering how can we reuse ClusterBase here?


 Thanks
 Pallavi

 Robin Anil wrote:









 yes. So that cluster dumper can print it out.

 On Wed, Feb 17, 2010 at 5:02 PM, Pallavi Palleti  
 pallavi.pall...@corp.aol.com wrote:











 Hi Robin,

 when you meant by reusing ClusterBase, are you planning to 
 extend ClusterBase in SoftCluster? For example, SoftCluster 
 extends ClusterBase?

 Thanks
 Pallavi


 Robin Anil wrote:











 I have been trying to convert FuzzyKMeans SoftCluster(which

 should be ideally be named FuzzyKmeansCluster) to use the 
 ClusterBase.

 I am getting* the same center* for all the clusters. To aid

 the conversion all i did was remove the center vector from 
 the SoftCluster class and reuse the same from the 
 ClusterBase. These are essentially making no change in the 
 tests which passes correctly.

 So I am questioning whether the implementation keeps the 
 average center at all ? Anyone who has used FuzzyKMeans 
 experiencing this?


 Robin





















RE: Yourkit License for all of you

2009-09-03 Thread Palleti, Pallavi
Hi Robin,

I would like to have the license if possible.

Thanks
Pallavi

-Original Message-
From: Robin Anil [mailto:robin.a...@gmail.com] 
Sent: Wednesday, September 02, 2009 2:03 PM
To: mahout-dev
Subject: Yourkit License for all of you

Dear Mahout Devs,Yourkit sales rep gave me my opensource
license. If anyone would like to get one. I can aggregate and send all the
requests to him.

If you would like to have an opensource license of Yourkit Profiler, reply
to on thread within 24 hours of this email. I will, compile it and send it
across.

Robin


RE: [jira] Commented: (MAHOUT-99) Improving speed of KMeans

2009-03-18 Thread Palleti, Pallavi
Yeah. But, I am wondering how the testcases succeeded? I ran them using mvn 
clean install command.

Thanks
Pallavi

-Original Message-
From: Jeff Eastman [mailto:j...@windwardsolutions.com] 
Sent: Thursday, March 19, 2009 9:56 AM
To: mahout-dev@lucene.apache.org
Subject: Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans

The Synthetic Control kMeans job calls the Canopy job to build its initial 
clusters as is commonly done. If the kMeans record format was changed and the 
Canopy not changed accordingly, then everything would still compile but there 
would be a mismatch when the kMeans mapper tried to read in the clusters.

Jeff


Richard Tomsett (JIRA) wrote:
 [ 
 https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jir
 a.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683
 252#action_12683252 ]

 Richard Tomsett commented on MAHOUT-99:
 ---

 Yup, just downloaded the latest trunk and run with Hadoop 0.19.1 and I get 
 the same error on the Synthetic Control example. It seems to be because the 
 new KMeans code uses a KeyValueLineRecordReader object to read the input 
 cluster centres from the canopy clustering output, but the canopy clustering 
 job outputs a SequenceFile (and the old KMeans code read in a SequenceFile 
 for the cluster centres). Think that's the problem at least, I''ll have a 
 quick play.

   
 Improving speed of KMeans
 -

 Key: MAHOUT-99
 URL: https://issues.apache.org/jira/browse/MAHOUT-99
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Reporter: Pallavi Palleti
Assignee: Grant Ingersoll
 Fix For: 0.1

 Attachments: MAHOUT-99-1.patch, Mahout-99.patch, 
 MAHOUT-99.patch


 Improved the speed of KMeans by passing only cluster ID from mapper to 
 reducer. Previously, whole Cluster Info as formatted s`tring was being sent.
 Also removed the implicit assumption of Combiner runs only once approach and 
 the code is modified accordingly so that it won't create a bug when combiner 
 runs zero or more than once.
 

   



RE: [jira] Commented: (MAHOUT-99) Improving speed of KMeans

2009-03-18 Thread Palleti, Pallavi
It depends on the kind of output. If we are just outputting only some numeric 
values then it is preferred to have SequenceFile as the data is written as 
binary. If not, it is preferred to write as simple text. Text file is readable 
where as binary is not readable. 

As we consider the data as text in reducers of both Canopy and KMeans, I don't 
see any performance improvement in using SequenceFile. So, I used 
TextInputFormat which is read friendly.
 
Thanks
Pallavi

-Original Message-
From: Jeff Eastman [mailto:j...@windwardsolutions.com] 
Sent: Thursday, March 19, 2009 10:19 AM
To: mahout-dev@lucene.apache.org
Subject: Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans

Also why not consider just converting canopy? Which reader is better?


Jeff Eastman wrote:
 * PGP Signed: 03/18/09 at 21:37:36

 Sure, why don't you go ahead and post a patch?


 Pallavi Palleti (JIRA) wrote:
 [
 https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.ji
 ra.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=126
 83312#action_12683312
 ]
 Pallavi Palleti commented on MAHOUT-99:
 ---

 I have used KeyValueLineRecordReader internally for my code and 
 forgot to revert back to SequenceFileReader. Will that be sufficient 
 to add another patch on the latest code and modify only KMeansDriver 
 to use SequenceFileReader? Kindly let me know.

 Thanks
 Pallavi

  
 Improving speed of KMeans
 -

 Key: MAHOUT-99
 URL: https://issues.apache.org/jira/browse/MAHOUT-99
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Reporter: Pallavi Palleti
Assignee: Grant Ingersoll
 Fix For: 0.1

 Attachments: MAHOUT-99-1.patch, Mahout-99.patch, 
 MAHOUT-99.patch


 Improved the speed of KMeans by passing only cluster ID from mapper 
 to reducer. Previously, whole Cluster Info as formatted s`tring was 
 being sent.
 Also removed the implicit assumption of Combiner runs only once 
 approach and the code is modified accordingly so that it won't 
 create a bug when combiner runs zero or more than once.
 

   


 * Jeff Eastman j...@windwardsolutions.com
 * 0x6BFF1277

 .




RE: [jira] Commented: (MAHOUT-79) Improving the speed of Fuzzy K-Means by optimizing data transfer between map and reduce tasks

2008-10-18 Thread Palleti, Pallavi
Hi Grant,
 Let me know if you are still facing this issue?

Thanks
Pallavi

-Original Message-
From: Grant Ingersoll (JIRA) [mailto:[EMAIL PROTECTED] 
Sent: Friday, October 17, 2008 11:48 PM
To: mahout-dev@lucene.apache.org
Subject: [jira] Commented: (MAHOUT-79) Improving the speed of Fuzzy K-Means by 
optimizing data transfer between map and reduce tasks


[ 
https://issues.apache.org/jira/browse/MAHOUT-79?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12640608#action_12640608
 ] 

Grant Ingersoll commented on MAHOUT-79:
---

Pallavi,

I'm getting:
 [junit] 08/10/17 14:15:03 INFO mapred.FileInputFormat: Total input paths to 
process : 2
[junit] 08/10/17 14:15:03 INFO mapred.FileInputFormat: Total input paths to 
process : 2
[junit] 08/10/17 14:15:03 INFO mapred.JobClient: Running job: job_local_0002
[junit] 08/10/17 14:15:03 INFO mapred.FileInputFormat: Total input paths to 
process : 2
[junit] 08/10/17 14:15:03 INFO mapred.FileInputFormat: Total input paths to 
process : 2
[junit] 08/10/17 14:15:03 INFO mapred.MapTask: numReduceTasks: 0
[junit] 08/10/17 14:15:03 INFO fuzzykmeans.FuzzyKMeansMapper: In Mapper 
Configure:
[junit] 08/10/17 14:15:03 WARN mapred.LocalJobRunner: job_local_0002
[junit] java.lang.NullPointerException: Cluster is empty!!!
[junit] at 
org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansMapper.configure(FuzzyKMeansMapper.java:76)
[junit] at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
[junit] at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
[junit] at 
org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33)
[junit] at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
[junit] at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
[junit] at org.apache.hadoop.mapred.MapTask.run(MapTask.java:223)
[junit] at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:157)
[junit] 08/10/17 14:15:04 WARN fuzzykmeans.FuzzyKMeansDriver: 
java.io.IOException: Job failed!
[junit] java.io.IOException: Job failed!
[junit] at 
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1113)
[junit] at 
org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.runClustering(FuzzyKMeansDriver.java:207)
[junit] at 
org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.runJob(FuzzyKMeansDriver.java:116)
[junit] at 
org.apache.mahout.clustering.fuzzykmeans.TestFuzzyKmeansClustering.testFuzzyKMeansMRJob(TestFuzzyKmeansClustering.java:248)
[junit] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[junit] at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
[junit] at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
[junit] at java.lang.reflect.Method.invoke(Method.java:597)
[junit] at junit.framework.TestCase.runTest(TestCase.java:164)
[junit] at junit.framework.TestCase.runBare(TestCase.java:130)
[junit] at junit.framework.TestResult$1.protect(TestResult.java:106)
[junit] at junit.framework.TestResult.runProtected(TestResult.java:124)
[junit] at junit.framework.TestResult.run(TestResult.java:109)
[junit] at junit.framework.TestCase.run(TestCase.java:120)
[junit] at junit.framework.TestSuite.runTest(TestSuite.java:230)
[junit] at junit.framework.TestSuite.run(TestSuite.java:225)
[junit] at 
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:421)
[junit] at 
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:912)
[junit] at 
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:766)
[junit] -  ---
[junit] Testcase: 
testFuzzyKMeansMRJob(org.apache.mahout.clustering.fuzzykmeans.TestFuzzyKmeansClustering):
 Caused an ERROR
[junit] output/points/part-0 (No such file or directory)
[junit] java.io.FileNotFoundException: output/points/part-0 (No such 
file or directory)
[junit] at java.io.FileInputStream.open(Native Method)
[junit] at java.io.FileInputStream.init(FileInputStream.java:106)
[junit] at java.io.FileInputStream.init(FileInputStream.java:66)
[junit] at 
org.apache.mahout.clustering.fuzzykmeans.TestFuzzyKmeansClustering.testFuzzyKMeansMRJob(TestFuzzyKmeansClustering.java:257)
[junit] 
[junit] 
[junit] Testcase: 
testFuzzyKMeansReducer(org.apache.mahout.clustering.fuzzykmeans.TestFuzzyKmeansClustering):
   Caused an ERROR
[junit] For input string: 9.0, [s2, 0
[junit] java.lang.NumberFormatException: For input string: 9.0, [s2, 0

RE: Any one working on Cluto- Clustering Algorithm or similar to this?

2008-10-12 Thread Palleti, Pallavi
I am interested in http://glaros.dtc.umn.edu/gkhome/node/193. I am yet
to go thru this publication. After then, I will be able to tell clearly.
Right now, I am interested in the algorithms and the pruning or
filtering part if any that is done for large dimensional sparse data
sets in Cluto. 

Thanks
Pallavi

-Original Message-
From: Grant Ingersoll [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, October 08, 2008 4:57 PM
To: mahout-dev@lucene.apache.org
Subject: Re: Any one working on Cluto- Clustering Algorithm or similar
to this?


On Oct 8, 2008, at 1:26 AM, Palleti, Pallavi wrote:

 Hi all,



 I have come across Cluto http://glaros.dtc.umn.edu/gkhome/views/ 
 cluto
 - a clustering algorithm.   I would like to know if there is any work
 going on in mahout in this regard. If yes, I am willing to use it. If
 not, I might be interested in working on similar clustering algorithm.


I don't think anyone is working on it, but I am curious about what you  
intend to do.  The link you give points to a whole package of tools,  
AFAICT.  Is there a publication
(http://glaros.dtc.umn.edu/gkhome/cluto/cluto/publications 
) or algorithm that you are particularly interested in?


Any one working on Cluto- Clustering Algorithm or similar to this?

2008-10-07 Thread Palleti, Pallavi
Hi all,

 

 I have come across Cluto http://glaros.dtc.umn.edu/gkhome/views/cluto
- a clustering algorithm.   I would like to know if there is any work
going on in mahout in this regard. If yes, I am willing to use it. If
not, I might be interested in working on similar clustering algorithm.

 

Thanks

Pallavi



RE: OutOfMemory Error

2008-09-18 Thread Palleti, Pallavi
Yeah. That was the problem. And Hama can be surely useful for large scale 
matrix operations.

But for this problem, I have modified the code to just pass the ID information 
and read the vector information only when it is needed. In this case, it was 
needed only in the reducer phase. This way, it avoided this problem of out of 
memory error and also faster now.

Thanks
Pallavi
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Edward J. Yoon
Sent: Friday, September 19, 2008 10:35 AM
To: [EMAIL PROTECTED]; mahout-dev@lucene.apache.org; [EMAIL PROTECTED]
Subject: Re: OutOfMemory Error

 The key is of the form ID :DenseVector Representation in mahout with

I guess vector size seems too large so it'll need a distributed vector
architecture (or 2d partitioning strategies) for large scale matrix
operations. The hama team investigate these problem areas. So, it will
be improved If hama can be used for mahout in the future.

/Edward

On Thu, Sep 18, 2008 at 12:28 PM, Pallavi Palleti [EMAIL PROTECTED] wrote:

 Hadoop Version - 17.1
 io.sort.factor =10
 The key is of the form ID :DenseVector Representation in mahout with
 dimensionality size = 160k
 For example: C1:[,0.0011, 3.002, .. 1.001,]
 So, typical size of the key  of the mapper output can be 160K*6 (assuming
 double in string is represented in 5 bytes)+ 5 (bytes for C1:[])  + size
 required to store that the object is of type Text

 Thanks
 Pallavi



 Devaraj Das wrote:




 On 9/17/08 6:06 PM, Pallavi Palleti [EMAIL PROTECTED] wrote:


 Hi all,

I am getting outofmemory error as shown below when I ran map-red on
 huge
 amount of data.:
 java.lang.OutOfMemoryError: Java heap space
 at
 org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:52)
 at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:90)
 at
 org.apache.hadoop.io.SequenceFile$Reader.nextRawKey(SequenceFile.java:1974)
 at
 org.apache.hadoop.io.SequenceFile$Sorter$SegmentDescriptor.nextRawKey(Sequence
 File.java:3002)
 at
 org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java:28
 02)
 at org.apache.hadoop.io.SequenceFile$Sorter.merge(SequenceFile.java:2511)
 at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1040)
 at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:220)
 at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124
 The above error comes almost at the end of map job. I have set the heap
 size
 to 1GB. Still the problem is persisting.  Can someone please help me how
 to
 avoid this error?
 What is the typical size of your key? What is the value of io.sort.factor?
 Hadoop version?





 --
 View this message in context: 
 http://www.nabble.com/OutOfMemory-Error-tp19531174p19545298.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.





-- 
Best regards, Edward J. Yoon
[EMAIL PROTECTED]
http://blog.udanax.org


RE: [jira] Updated: (MAHOUT-74) Fuzzy K-Means clustering

2008-08-21 Thread Palleti, Pallavi
Great. Thanks Grant for the modifications.

-Original Message-
From: Grant Ingersoll (JIRA) [mailto:[EMAIL PROTECTED] 
Sent: Thursday, August 21, 2008 7:09 PM
To: mahout-dev@lucene.apache.org
Subject: [jira] Updated: (MAHOUT-74) Fuzzy K-Means clustering


 [ 
https://issues.apache.org/jira/browse/MAHOUT-74?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-74:
--

Attachment: MAHOUT-74.patch

Looking pretty good, Pallavi.  I modified it slightly so that m is set just via 
the JobConf like the other values.  I think we are in pretty good shape and I 
will commit soon.  I also made m a float.  Looking at the wiki link you have 
there, I don't see any reason why m should be restricted to an int.

 Fuzzy K-Means clustering
 

 Key: MAHOUT-74
 URL: https://issues.apache.org/jira/browse/MAHOUT-74
 Project: Mahout
  Issue Type: New Feature
  Components: Clustering
Reporter: Pallavi Palleti
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 0.1

 Attachments: MAHOUT-74.patch, MAHOUT-74.patch, mahout-74.patch, 
 mahout-74.patch


 Fuzzy KMeans clustering algorithm is an extension to traditional K Means 
 clustering algorithm and performs soft clustering.
 More details about fuzzy k-means can be found here 
 :http://en.wikipedia.org/wiki/Data_clustering#Fuzzy_c-means_clustering
 I have implemented fuzzy K-Means prototype and tests in 
 org.apache.mahout.clustering.fuzzykmeans

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



RE: asFormatString tests fail

2008-08-10 Thread Palleti, Pallavi
On the same lines of optimization -
I added some optimizations for SparseVector. Especially for operators like 
minus, plus and divide. Please look at mahout-67, mahout-66 patches for the 
same.
The methods in AbstractVector were getting called for minus, plus and divide 
where it was iterating thru all the keys irrespective of whether the key 
contains empty value or not.

Thanks
Pallavi 

-Original Message-
From: Sean Owen [mailto:[EMAIL PROTECTED] 
Sent: Saturday, August 09, 2008 12:28 AM
To: mahout-dev@lucene.apache.org
Subject: Re: asFormatString tests fail

Yeah I'm not really worried about the boxing/unboxing yet, since it is
buying some code simplicity, though I took the liberty of eliminating
boxing where it is redundant, like:   int c = new Integer(someString)
versus   int c = Integer.parseInt(someString)

I agree we can go to that trouble if it becomes clear it is
non-trivially slowing down the code. It may well.

I was more interested in more obvious wins by adjusting use of some
Collections API methods as in dot(). I'd be happy to hack away on this
but I am hesitating about doing anything but trivial changes to code
others are working on, while it might be looked at as premature. If
it's viewed as a good thing I can go for it.

Hows about changing asFormatString() to sort output? was that the
right solution or is its output order not guaranteed. I could take
care of it.

On Fri, Aug 8, 2008 at 2:45 PM, Ted Dunning [EMAIL PROTECTED] wrote:
 Worrying about small effects like iterating over keys or entries should be
 made moot by just switching to a dedicated primitive based hash table.
 Trove has a nice implementation, but I believe the license would prevent
 it's use.  Colt has another, not quite as nice implementation that is fast
 and I think it comes under a BSD license.  It is also very easy to hack a
 special purpose structure into place.