A new way to merge up those small files!

2010-09-24 Thread Edward Capriolo
Many times a hadoop job produces a file per reducer and the job has many reducers. Or a map only job one output file per input file and you have many input files. Or you just have many small files from some external process. Hadoop has sub optimal handling of small files. There are some ways to han

Re: do you need to call super in Mapper.Context.setup()?

2010-09-24 Thread Lance Norskog
Maybe. Maybe not. Maybe not this time, but next time yes. It's just bulletproofing, like checking for nulls everywhere. Mark Kerzner wrote: Hi, any need for this, protected void setup(Mapper.Context context) throws IOException, InterruptedException { super.setup(context); // TODO - d

tasktracker takes long time to start?

2010-09-24 Thread jiang licht
I noticed a tasktracker (not a new node) starts to do hearbeating with job tracker after couple of minutes. What can cause this? Here's log output when a task tracker is restarted after a clean shutdown of the cluster but it hangs for 9 minutes (network connection was good when the tt started):

Re: jdbc in Hadoop mapper

2010-09-24 Thread Shi Yu
You are right. I ran stop-all.sh and start-all.sh, now it works fine. Thanks! On 2010-9-24 15:30, Harsh J wrote: My guess: Either you haven't put the JAR in all the tasktracker machines, or you have not restarted your tasktrackers and jobtracker after doing so. On Sat, Sep 25, 2010 at 1:47 AM,

Proper blocksize and io.sort.mb setting when using compressed LZO files

2010-09-24 Thread pig
Hello, We just recently switched to using lzo compressed file input for our hadoop cluster using Kevin Weil's lzo library. The files are pretty uniform in size at around 200MB compressed. Our block size is 256MB. Decompressed the average LZO input file is around 1.0GB. I noticed lots of our jo

Re: jdbc in Hadoop mapper

2010-09-24 Thread Harsh J
My guess: Either you haven't put the JAR in all the tasktracker machines, or you have not restarted your tasktrackers and jobtracker after doing so. On Sat, Sep 25, 2010 at 1:47 AM, Shi Yu wrote: > Hi, > > I tried to combine in memory mysql database with Mapreduce to do some value > exchanges. In

jdbc in Hadoop mapper

2010-09-24 Thread Shi Yu
Hi, I tried to combine in memory mysql database with Mapreduce to do some value exchanges. In the Mapper, I declare the mysql driver like this import com.mysql.jdbc.*; import java.sql.DriverManager; import java.sql.SQLException; String driver =

Re: Shuffle tasks getting killed

2010-09-24 Thread cliff palmer
I'm glad it helped Aniket. I would recommend that you start working on performance improvement with your network infrastructure and the balance of data across your logical racks.Cliff On Fri, Sep 24, 2010 at 12:12 AM, aniket ray wrote: > Hi Cliff, > > Thanks it did turn out to be speculative ex

Input splits not working correctly

2010-09-24 Thread Matthew John
Hi all , I am working on a sort function and it is working perfectly fine with a single map task. When I give 2 map tasks, the entire data is replicated twice (sorted output) . When giving 4 map tasks , it gives 4 times the sorted data. and so on I modified the Terasort for this. Major

Help for Sqlserver querying with hadoop

2010-09-24 Thread Biju .B
Hi Need urgent help on using sql server with hadoop am using following code to connect to database DBConfiguration.configureDB(conf,"com.microsoft.sqlserver.jdbc.SQLServerDriver","jdbc:sqlserver://xxx.xxx.xxx.xxx;user=abc;password=abc;DatabaseName=dbname"); String [] fields = { "id", "url" }; St