RE::How to speed up of Map/Reduce job?

2011-02-01 Thread Black, Michael (IS)
Try this rather small C++ program...it will more than likley be a LOT faster than anything you could do in hadoop. Hadoop is not the hammer for every nail. Too many people think that any "cluster" solution will automagically scale their problem...tain't true. I'd appreciate hearing your resul

RE: EXTERNAL:Re: Import data from mysql

2011-01-10 Thread Black, Michael (IS)
Another thing...where do you get 2M from? 10,000 x 100 = 1,000,000 So you'd have an absolute max of 1M to do for each 1002M total for your examplegender pref cuts that in half or so...plus other prefs cut it further... And 2M calculations is relatively nothing compared to 10,000^2 (

Re: Import data from mysql

2011-01-10 Thread Black, Michael (IS)
You need to stop looking at this as an all-or-nothing...and look at it more like real-time. You only need to do an absolute max of 1*10,000 at a time. And...you actually only need to do considerably less than that with age preference and other factors for the usersand doing the computat

Re: Import data from mysql

2011-01-10 Thread Black, Michael (IS)
I had no idea the kimono comment would be so applicable to your problem... Everything makes sense except the Bayesian computation. Your "score" can be computed on subsetsin particular you only need to do it on "new" and "changed" records. Most of which should be pretty static (age needs

Re: Import data from mysql

2011-01-09 Thread Black, Michael (IS)
All you're doing is delaying the inevitable by going to hadoop. There's no magic to hadoop. It doens't run as fast as individual processes. There's just the ability to split jobs across a cluster which works for some problems. You won't even get a linear improvement in speed. At least I as

RE: EXTERNAL:Re: Import data from mysql

2011-01-09 Thread Black, Michael (IS)
All you're doing is delaying the inevitable by going to hadoop. There's no magic to hadoop. It doens't run as fast as individual processes. There's just the ability to split jobs across a cluster which works for some problems. You won't even get a linear improvement in speed. At least I as

RE:Import data from mysql

2011-01-09 Thread Black, Michael (IS)
What kind of compare do you have to do? You should be able to compute a checksum or such for each row when you insert them and only have to look at the subset that matches if you're doing some sort of substring or such. Michael D. Black Senior Scientist Advanced Analytics Directorate Northrop

RE:Rngd

2011-01-04 Thread Black, Michael (IS)
http://sourceforge.net/projects/gkernel/files/rng-tools rndg is in there. Michael D. Black Senior Scientist Advanced Analytics Directorate Northrop Grumman Information Systems From: Jon Lederman [mailto:jon2...@mac.com] Sent: Tue 1/4/2011 2:00 PM To: commo

Re: HDFS FS Commands Hanging System

2011-01-02 Thread Black, Michael (IS)
Did you sert your config and format the namenode as per these instructions? http://hadoop.apache.org/common/docs/current/single_node_setup.html Michael D. Black Senior Scientist Advanced Analytics Directorate Northrop Grumman Information Systems

RE:HDFS FS Commands Hanging System

2010-12-31 Thread Black, Michael (IS)
Try checking your dfs status hadoop dfsadmin -safemode get Probably says "ON" hadoop dfsadmin -safemode leave Somebody else can probably say how to make this happen every reboot Michael D. Black Senior Scientist Advanced Analytics Directorate Northrop Grumman Information Systems __

RE: ClassNotFoundException

2010-12-28 Thread Black, Michael (IS)
I'm using hadoop-0.20.2 and I see this for my map/reduce class com/ngc/asoc/recommend/Predict$Counter.class com/ngc/asoc/recommend/Predict$R.class com/ngc/asoc/recommend/Predict$M.class com/ngc/asoc/recommend/Predict.class I'm a java idiot so I don't know why they appear but perhaps you have sim

Re: Custom input split

2010-12-26 Thread Black, Michael (IS)
You mean the file is "not trusted". I was using Outlook and my company automatically puts a digital certificate on all emails. I'm using webmail right now which doesn't. That certificate is installed by default on all company computers so it looks trusted to us without having to explicitly t

Custom input split

2010-12-24 Thread Black, Michael (IS)
Using hadoop-0.20 I'm doing custom input splits from a Lucene index. I want to split the document ID's across N mappers (I'm testing the scalabilty of the problem across 4 nodes and 8 cores). So the key is the document# and they are not sequential. At this point I'm using splits.add to add eac

RE: EXTERNAL:Tasktracker failing and getting black listed

2010-12-24 Thread Black, Michael (IS)
#1 Check CPU fan is working. A hot CPU can give flakey errorsespecially during high CPU load. #2 Do memtest on the machine. You might have a bad memory stick that is getting hit (though I would tend to think it would be a bit more random). I've used memtest86 before to find such problems. h

dictionary.csv

2010-12-23 Thread Black, Michael (IS)
Using hadoop-0.20.2+737 on Redhat's distribution. I'm trying to use a dictionary.csv file from a Lucene index inside a map function plus another comma delimited file. It's just a simple loop of reading a line, split the line on commas, and add the dictionary entry to a hash map. It's about an