In addition to what Aaron mentioned, you can configure the minimum split
size in hadoop-site.xml to have smaller or larger input splits depending on
your application.
-Jim
On Mon, Apr 20, 2009 at 12:18 AM, Aaron Kimball wrote:
> Yes, there can be more than one InputSplit per SequenceFile. The f
http://wiki.apache.org/hadoop/FAQ#7
On Thu, Apr 16, 2009 at 6:52 PM, Jae Joo wrote:
> Will anyone guide me how to avoid the the single point failure of master
> node.
> This is what I know. If the master node is donw by some reason, the hadoop
> system is down and there is no way to have failov
/tmp.
>
> Hope this helps!
>
> Alex
>
> On Wed, Apr 15, 2009 at 2:37 PM, Jim Twensky
> wrote:
>
> > Alex,
> >
> > Yes, I bounced the Hadoop daemons after I changed the configuration
> files.
> >
> > I also tried setting $HADOOP_CONF_DIR
xml lives. For
> whatever reason your hadoop-site.xml (and the hadoop-default.xml you tried
> to change) are probably not being loaded. $HADOOP_CONF_DIR should fix
> this.
>
> Good luck!
>
> Alex
>
> On Mon, Apr 13, 2009 at 11:25 AM, Jim Twensky
> wrote:
>
>
Hi Andy,
Take a look at this piece of code:
Counters counters = job.getCounters();
counters.findCounter("org.apache.hadoop.mapred.Task$Counter",
"REDUCE_INPUT_RECORDS").getCounter()
This is for reduce input records but I believe there is also a counter for
reduce output records. You should dig i
Files are stored as blocks and the default block size is 64MB. You can
change this by setting the dfs.block.size property. Map/Reduce interprets
files in large chunks of bytes and these are called splits. Splits are not
physical, think about them as being logical data structures that tell you
the s
s, it looks
> fine to me.
>
> Mithila
>
>
>
>
> On Mon, Apr 13, 2009 at 2:58 PM, Jim Twensky
> wrote:
>
> > Mithila,
> >
> > You said all the slaves were being utilized in the 3 node cluster. Which
> > application did you run to test that and what was
Oh, I forgot to tell that you should change your partitioner to send all the
keys in the form of cat,* to the same reducer but it seems like Jeremy has
been much faster than me :)
-Jim
On Mon, Apr 13, 2009 at 5:24 PM, Jim Twensky wrote:
> I'm not sure if this is exactly what you want
I'm not sure if this is exactly what you want but, can you emit map records
as:
cat, doc5 -> 3
cat, doc1 -> 1
cat, doc5 -> 1
and so on..
This way, your reducers will get the intermediate key,value pairs as
cat, doc5 -> 3
cat, doc5 -> 1
cat, doc1 -> 1
then you can split the keys (cat, doc*)
Mithila,
You said all the slaves were being utilized in the 3 node cluster. Which
application did you run to test that and what was your input size? If you
tried the word count application on a 516 MB input file on both cluster
setups, than some of your nodes in the 15 node cluster may not be runn
ieve some systems have quotas on /tmp.
>
> Hope this helps.
>
> Alex
>
> On Tue, Apr 7, 2009 at 7:22 PM, Jim Twensky wrote:
>
> > Hi,
> >
> > I'm using Hadoop 0.19.1 and I have a very small test cluster with 9
> nodes,
> > 8
> > of them
Hi,
I'm using Hadoop 0.19.1 and I have a very small test cluster with 9 nodes, 8
of them being task trackers. I'm getting the following error and my jobs
keep failing when map processes start hitting 30%:
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any
valid local direct
See the original Map Reduce paper by Google at
http://labs.google.com/papers/mapreduce.html and please don't spam the list.
-jim
On Tue, Mar 31, 2009 at 6:15 PM, Hadooper wrote:
> Dear developers,
>
> Is there any detailed example of how Hadoop processes input?
> Article
> http://hadoop.apache.o
Stuart,
Why do you use RMI to load your dictionary file? I presume you have (key,
value) pairs and each of your mappers do numerous lookups to those pairs. In
that case, using memcached may be a simpler option and again, you don't have
to allocate a seperate 2 GB space for each of those 3 processe
Sandy,
Correct me if I'm wrong but, if you have only two cores and you are running
your jobs in pseudo distributed mode, what is the point of having more than
2 mappers/reducers? Any number larger than 2 would make the mapper/reducer
threads serialize. That serialization would certainly be an over
You may also want to have a look at this to reach a decision based on your
needs:
http://www.swaroopch.com/notes/Distributed_Storage_Systems
Jim
On Tue, Jan 27, 2009 at 1:22 PM, Jim Twensky wrote:
> Rasit,
>
> What kind of data will you be storing on Hbase or directly on HDFS? Do you
Rasit,
What kind of data will you be storing on Hbase or directly on HDFS? Do you
aim to use it as a data source to do some key/value lookups for small
strings/numbers or do you want to store larger files labeled with some sort
of a key and retrieve them during a map reduce run?
Jim
On Tue, Jan
Ricky,
Hadoop was formerly optimized for large files, usually files of size larger
than one input split. However, there is an input format called
MultiFileInputFormat which can be used to utilize Hadoop to work efficiently
on smaller files. You can also set the isSplittable method of an input
form
Delip,
Why do you think Hbase will be an overkill? I do something similar to what
you're trying to do with Hbase and I haven't encountered any significant
problems so far. Can you give some more info on the size of the data you
have?
Jim
On Wed, Jan 14, 2009 at 8:47 PM, Delip Rao wrote:
> Hi,
Owen and Rasit,
Thank you for the responses. I've figured that mapred.reduce.tasks was set
to 1 in my hadoop-default xml and I didn't overwrite it in my
hadoop-site.xml configuration file.
Jim
On Wed, Jan 14, 2009 at 11:23 AM, Owen O'Malley wrote:
> On Jan 14, 2009, at 12:46 AM, Rasit OZDAS wr
Hello,
The original map-reduce paper states: "After successful completion, the
output of the map-reduce execution is available in the R output files (one
per reduce task, with file names as specified by the user)." However, when
using Hadoop's TextOutputFormat, all the reducer outputs are combined in
Hello Saptarshi,
>>E.g if there are only 10 value corresponding
>>to a key(as outputted by the mapper), will these 10 values go straight
>>to the reducer or to the reducer via the combiner?
It depends on whether or not you use the method JobConf.setCombinerClass()
or not. If you don't, Hadoop do
nearly-flat parallelism as your
> data set grows really large more than makes up for it in the long run.
> - Aaron
>
> On Thu, Dec 25, 2008 at 2:22 AM, Jim Twensky
> wrote:
>
> > Hello again,
> >
> > I think I found an answer to my question. If I write a new
> &g
at each combiner/reducer.
Jim
On Wed, Dec 24, 2008 at 12:19 PM, Jim Twensky wrote:
> Hi Aaron,
>
> Thanks for the advice. I actually thought of using multiple combiners and a
> single reducer but I was worried about the key sorting phase to be a vaste
> for my purpose. If the
t; >
> > Many other more complicated problems which seem to require shared state,
> in
> > truth, only require a second (or n+1'th) MapReduce pass. Adding multiple
> > passes is a very valid technique for building more complex dataflows.
> >
> > Cheers,
&
Hello,
I was wondering if Hadoop provides thread safe shared variables that can be
accessed from individual mappers/reducers along with a proper locking
mechanism. To clarify things, let's say that in the word count example, I
want to know the word that has the highest frequency and how many times
u see in the web UI). I
> opened https://issues.apache.org/jira/browse/HADOOP-4043 a while back
> to address the fact they are not public. Please consider voting for it
> if you think it would be useful.
>
> Cheers,
> Tom
>
> On Mon, Dec 22, 2008 at 2:47 AM, Jim Twensky
>
Hello,
I need to collect some statistics using some of the counters defined by the
Map/Reduce framework such as "Reduce input records". I know I should use
the getCounter method from Counters.Counter but I couldn't figure how to
use it. Can someone give me a two line example of how to read the val
As far as I know, there is a Hadoop plug-in for Eclipse but it is not
possible to debug when running on a real cluster. If you want to add watches
and expressions to trace your programs or profile your code, I'd suggest
looking at the log files or use other tracing tools such as xtrace (
http://www
Sorting according to keys is a requirement for the map/reduce algorithm. I'd
suggest running a second map/reduce phase on the output files of your
application and use the values as keys in that second phase. I know that
will increase the running time, but this is how I do it when I need to get
my o
Apparently you have one node with 2 processors where each processor has 4
cores. What do you want to use Hadoop for? If you have a single disk drive
and multiple cores on one node then pseudo distributed environment seems
like the best approach to me as long as you are not dealing with large
amount
Hello, I need to use Hadoop Streaming to run several instances of a single
program on different files. Before doing it, I wrote a simple test
application as the mapper, which basically outputs the standard input
without doing anything useful. So it looks like the following:
---
If I understand your question correctly, you need to write your own
FileInputFormat. Please see
http://hadoop.apache.org/core/docs/r0.18.0/api/index.html for details.
Regards,
Tim
On Sat, Sep 6, 2008 at 9:20 PM, Dennis Kubes <[EMAIL PROTECTED]> wrote:
> Is is possible to set a multiline text inp
types, which contradict with the
specified Mapper output types. If I'm correct, am I supposed to write a
separate reducer for the local combiner in order to speed things up?
Jim
On Fri, Aug 29, 2008 at 6:30 PM, Jim Twensky <[EMAIL PROTECTED]> wrote:
> Here is the relevan
Here is the relevant part of my mapper:
(...)
private final static IntWritable one = new IntWritable(1);
private IntWritable bound = new IntWritable();
(...)
while(...) {
output.collect(bound,one);
}
so I'm not sure why my mapper tries to output a FloatWrita
Hello, I am working on a Hadoop application that produces different
(key,value) types after the map and reduce phases so I'm aware that I need
to use "JobConf.setMapOutputKeyClass" and "JobConf.setMapOutputValueClass".
However, I still keep getting the following runtime error when I run my
applicat
36 matches
Mail list logo