Re: Are SequenceFiles split? If so, how?

2009-04-20 Thread Jim Twensky
In addition to what Aaron mentioned, you can configure the minimum split size in hadoop-site.xml to have smaller or larger input splits depending on your application. -Jim On Mon, Apr 20, 2009 at 12:18 AM, Aaron Kimball wrote: > Yes, there can be more than one InputSplit per SequenceFile. The f

Re: Hadoop basic question

2009-04-16 Thread Jim Twensky
http://wiki.apache.org/hadoop/FAQ#7 On Thu, Apr 16, 2009 at 6:52 PM, Jae Joo wrote: > Will anyone guide me how to avoid the the single point failure of master > node. > This is what I know. If the master node is donw by some reason, the hadoop > system is down and there is no way to have failov

Re: getting DiskErrorException during map

2009-04-16 Thread Jim Twensky
/tmp. > > Hope this helps! > > Alex > > On Wed, Apr 15, 2009 at 2:37 PM, Jim Twensky > wrote: > > > Alex, > > > > Yes, I bounced the Hadoop daemons after I changed the configuration > files. > > > > I also tried setting $HADOOP_CONF_DIR

Re: getting DiskErrorException during map

2009-04-15 Thread Jim Twensky
xml lives. For > whatever reason your hadoop-site.xml (and the hadoop-default.xml you tried > to change) are probably not being loaded. $HADOOP_CONF_DIR should fix > this. > > Good luck! > > Alex > > On Mon, Apr 13, 2009 at 11:25 AM, Jim Twensky > wrote: > >

Re: Total number of records processed in mapper

2009-04-14 Thread Jim Twensky
Hi Andy, Take a look at this piece of code: Counters counters = job.getCounters(); counters.findCounter("org.apache.hadoop.mapred.Task$Counter", "REDUCE_INPUT_RECORDS").getCounter() This is for reduce input records but I believe there is also a counter for reduce output records. You should dig i

Re: How large is one file split?

2009-04-14 Thread Jim Twensky
Files are stored as blocks and the default block size is 64MB. You can change this by setting the dfs.block.size property. Map/Reduce interprets files in large chunks of bytes and these are called splits. Splits are not physical, think about them as being logical data structures that tell you the s

Re: Map-Reduce Slow Down

2009-04-13 Thread Jim Twensky
s, it looks > fine to me. > > Mithila > > > > > On Mon, Apr 13, 2009 at 2:58 PM, Jim Twensky > wrote: > > > Mithila, > > > > You said all the slaves were being utilized in the 3 node cluster. Which > > application did you run to test that and what was

Re: Grouping Values for Reducer Input

2009-04-13 Thread Jim Twensky
Oh, I forgot to tell that you should change your partitioner to send all the keys in the form of cat,* to the same reducer but it seems like Jeremy has been much faster than me :) -Jim On Mon, Apr 13, 2009 at 5:24 PM, Jim Twensky wrote: > I'm not sure if this is exactly what you want

Re: Grouping Values for Reducer Input

2009-04-13 Thread Jim Twensky
I'm not sure if this is exactly what you want but, can you emit map records as: cat, doc5 -> 3 cat, doc1 -> 1 cat, doc5 -> 1 and so on.. This way, your reducers will get the intermediate key,value pairs as cat, doc5 -> 3 cat, doc5 -> 1 cat, doc1 -> 1 then you can split the keys (cat, doc*)

Re: Map-Reduce Slow Down

2009-04-13 Thread Jim Twensky
Mithila, You said all the slaves were being utilized in the 3 node cluster. Which application did you run to test that and what was your input size? If you tried the word count application on a 516 MB input file on both cluster setups, than some of your nodes in the 15 node cluster may not be runn

Re: getting DiskErrorException during map

2009-04-13 Thread Jim Twensky
ieve some systems have quotas on /tmp. > > Hope this helps. > > Alex > > On Tue, Apr 7, 2009 at 7:22 PM, Jim Twensky wrote: > > > Hi, > > > > I'm using Hadoop 0.19.1 and I have a very small test cluster with 9 > nodes, > > 8 > > of them

getting DiskErrorException during map

2009-04-07 Thread Jim Twensky
Hi, I'm using Hadoop 0.19.1 and I have a very small test cluster with 9 nodes, 8 of them being task trackers. I'm getting the following error and my jobs keep failing when map processes start hitting 30%: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local direct

Re: Please help!

2009-03-31 Thread Jim Twensky
See the original Map Reduce paper by Google at http://labs.google.com/papers/mapreduce.html and please don't spam the list. -jim On Tue, Mar 31, 2009 at 6:15 PM, Hadooper wrote: > Dear developers, > > Is there any detailed example of how Hadoop processes input? > Article > http://hadoop.apache.o

Re: Coordination between Mapper tasks

2009-03-19 Thread Jim Twensky
Stuart, Why do you use RMI to load your dictionary file? I presume you have (key, value) pairs and each of your mappers do numerous lookups to those pairs. In that case, using memcached may be a simpler option and again, you don't have to allocate a seperate 2 GB space for each of those 3 processe

Re: wordcount getting slower with more mappers and reducers?

2009-03-11 Thread Jim Twensky
Sandy, Correct me if I'm wrong but, if you have only two cores and you are running your jobs in pseudo distributed mode, what is the point of having more than 2 mappers/reducers? Any number larger than 2 would make the mapper/reducer threads serialize. That serialization would certainly be an over

Re: Using HDFS for common purpose

2009-01-27 Thread Jim Twensky
You may also want to have a look at this to reach a decision based on your needs: http://www.swaroopch.com/notes/Distributed_Storage_Systems Jim On Tue, Jan 27, 2009 at 1:22 PM, Jim Twensky wrote: > Rasit, > > What kind of data will you be storing on Hbase or directly on HDFS? Do you

Re: Using HDFS for common purpose

2009-01-27 Thread Jim Twensky
Rasit, What kind of data will you be storing on Hbase or directly on HDFS? Do you aim to use it as a data source to do some key/value lookups for small strings/numbers or do you want to store larger files labeled with some sort of a key and retrieve them during a map reduce run? Jim On Tue, Jan

Re: Suitable for Hadoop?

2009-01-21 Thread Jim Twensky
Ricky, Hadoop was formerly optimized for large files, usually files of size larger than one input split. However, there is an input format called MultiFileInputFormat which can be used to utilize Hadoop to work efficiently on smaller files. You can also set the isSplittable method of an input form

Re: Indexed Hashtables

2009-01-15 Thread Jim Twensky
Delip, Why do you think Hbase will be an overkill? I do something similar to what you're trying to do with Hbase and I haven't encountered any significant problems so far. Can you give some more info on the size of the data you have? Jim On Wed, Jan 14, 2009 at 8:47 PM, Delip Rao wrote: > Hi,

Re: Merging reducer outputs into a single part-00000 file

2009-01-14 Thread Jim Twensky
Owen and Rasit, Thank you for the responses. I've figured that mapred.reduce.tasks was set to 1 in my hadoop-default xml and I didn't overwrite it in my hadoop-site.xml configuration file. Jim On Wed, Jan 14, 2009 at 11:23 AM, Owen O'Malley wrote: > On Jan 14, 2009, at 12:46 AM, Rasit OZDAS wr

Merging reducer outputs into a single part-00000 file

2009-01-10 Thread Jim Twensky
Hello, The original map-reduce paper states: "After successful completion, the output of the map-reduce execution is available in the R output files (one per reduce task, with file names as specified by the user)." However, when using Hadoop's TextOutputFormat, all the reducer outputs are combined in

Re: Combiner run specification and questions

2009-01-02 Thread Jim Twensky
Hello Saptarshi, >>E.g if there are only 10 value corresponding >>to a key(as outputted by the mapper), will these 10 values go straight >>to the reducer or to the reducer via the combiner? It depends on whether or not you use the method JobConf.setCombinerClass() or not. If you don't, Hadoop do

Re: Shared thread safe variables?

2009-01-01 Thread Jim Twensky
nearly-flat parallelism as your > data set grows really large more than makes up for it in the long run. > - Aaron > > On Thu, Dec 25, 2008 at 2:22 AM, Jim Twensky > wrote: > > > Hello again, > > > > I think I found an answer to my question. If I write a new > &g

Re: Shared thread safe variables?

2008-12-25 Thread Jim Twensky
at each combiner/reducer. Jim On Wed, Dec 24, 2008 at 12:19 PM, Jim Twensky wrote: > Hi Aaron, > > Thanks for the advice. I actually thought of using multiple combiners and a > single reducer but I was worried about the key sorting phase to be a vaste > for my purpose. If the

Re: Shared thread safe variables?

2008-12-24 Thread Jim Twensky
t; > > > Many other more complicated problems which seem to require shared state, > in > > truth, only require a second (or n+1'th) MapReduce pass. Adding multiple > > passes is a very valid technique for building more complex dataflows. > > > > Cheers, &

Shared thread safe variables?

2008-12-24 Thread Jim Twensky
Hello, I was wondering if Hadoop provides thread safe shared variables that can be accessed from individual mappers/reducers along with a proper locking mechanism. To clarify things, let's say that in the word count example, I want to know the word that has the highest frequency and how many times

Re: Predefined counters

2008-12-22 Thread Jim Twensky
u see in the web UI). I > opened https://issues.apache.org/jira/browse/HADOOP-4043 a while back > to address the fact they are not public. Please consider voting for it > if you think it would be useful. > > Cheers, > Tom > > On Mon, Dec 22, 2008 at 2:47 AM, Jim Twensky >

Predefined counters

2008-12-21 Thread Jim Twensky
Hello, I need to collect some statistics using some of the counters defined by the Map/Reduce framework such as "Reduce input records". I know I should use the getCounter method from Counters.Counter but I couldn't figure how to use it. Can someone give me a two line example of how to read the val

Re: debugging hadoop application!

2008-09-24 Thread Jim Twensky
As far as I know, there is a Hadoop plug-in for Eclipse but it is not possible to debug when running on a real cluster. If you want to add watches and expressions to trace your programs or profile your code, I'd suggest looking at the log files or use other tracing tools such as xtrace ( http://www

Re: Can hadoop sort by values rather than keys?

2008-09-24 Thread Jim Twensky
Sorting according to keys is a requirement for the map/reduce algorithm. I'd suggest running a second map/reduce phase on the output files of your application and use the values as keys in that second phase. I know that will increase the running time, but this is how I do it when I need to get my o

Re: installing hadoop on a OS X cluster

2008-09-10 Thread Jim Twensky
Apparently you have one node with 2 processors where each processor has 4 cores. What do you want to use Hadoop for? If you have a single disk drive and multiple cores on one node then pseudo distributed environment seems like the best approach to me as long as you are not dealing with large amount

Question on Streaming

2008-09-09 Thread Jim Twensky
Hello, I need to use Hadoop Streaming to run several instances of a single program on different files. Before doing it, I wrote a simple test application as the mapper, which basically outputs the standard input without doing anything useful. So it looks like the following: ---

Re: Hadoop Streaming and Multiline Input

2008-09-09 Thread Jim Twensky
If I understand your question correctly, you need to write your own FileInputFormat. Please see http://hadoop.apache.org/core/docs/r0.18.0/api/index.html for details. Regards, Tim On Sat, Sep 6, 2008 at 9:20 PM, Dennis Kubes <[EMAIL PROTECTED]> wrote: > Is is possible to set a multiline text inp

Re: Different Map and Reduce output types - weird error message

2008-08-29 Thread Jim Twensky
types, which contradict with the specified Mapper output types. If I'm correct, am I supposed to write a separate reducer for the local combiner in order to speed things up? Jim On Fri, Aug 29, 2008 at 6:30 PM, Jim Twensky <[EMAIL PROTECTED]> wrote: > Here is the relevan

Re: Different Map and Reduce output types - weird error message

2008-08-29 Thread Jim Twensky
Here is the relevant part of my mapper: (...) private final static IntWritable one = new IntWritable(1); private IntWritable bound = new IntWritable(); (...) while(...) { output.collect(bound,one); } so I'm not sure why my mapper tries to output a FloatWrita

Different Map and Reduce output types - weird error message

2008-08-29 Thread Jim Twensky
Hello, I am working on a Hadoop application that produces different (key,value) types after the map and reduce phases so I'm aware that I need to use "JobConf.setMapOutputKeyClass" and "JobConf.setMapOutputValueClass". However, I still keep getting the following runtime error when I run my applicat