Yes you have to deal with the compression. Usually, you'll load the
compression codec in your RecordReader. You can see an example of how
TextInputFormat's LineRecordReader does it:
https://github.com/apache/hadoop-common/blob/release-1.0.1/src/mapred/org/apache/hadoop/mapreduce/lib/input/LineReco
JobTracker and TaskTracker. YARN is only in 0.23 and later releases. 1.0.x is
from the 0.20x line of releases.
-Joey
On Mar 14, 2012, at 7:00, arindam choudhury wrote:
> Hi,
>
> Hadoop 1.0.1 uses hadoop YARN or the tasktracker, jobtracker model?
>
> Regards,
> Arindam
Masoud,
I know that the Puppet Labs website is confusing, but puppet is open
source and has no node limit. You can download it from here:
http://puppetlabs.com/misc/download-options/
If you're using a Red Hat compatible linux distribution, you can get
RPMs from EPEL:
http://projects.puppetlabs.
HDFS has the notion of a working directory which defaults to
/user/. Check out:
http://hadoop.apache.org/common/docs/r1.0.1/api/org/apache/hadoop/fs/FileSystem.html#getWorkingDirectory()
and
http://hadoop.apache.org/common/docs/r1.0.1/api/org/apache/hadoop/fs/FileSystem.html#setWorkingDirectory(
Small typo, try:
jar tf hadoop-core-1.0.1.jar | grep -i MultipleOutputs
;)
-Joey
On Mon, Mar 12, 2012 at 4:56 PM, W.P. McNeill wrote:
> I take that back. On my laptop I'm running Apache Hadoop 1.0.1, and I still
> don't see MultipleOutputs. I am building against hadoop-core-1.0.1.jar and
> the
Apache Bigtop also has Hadoop puppet modules. For the modules based on
Hadoop 0.20.205 you can look at them here:
https://svn.apache.org/repos/asf/incubator/bigtop/branches/branch-0.2/bigtop-deploy/puppet/
I haven't seen any documentation on the modules.
-Joey
On Mon, Mar 12, 2012 at 1:43 PM, P
Something like puppet it is a good choice. There are example puppet
manifests available for most Hadoop-related projects in Apache BigTop,
for example:
https://svn.apache.org/repos/asf/incubator/bigtop/branches/branch-0.2/bigtop-deploy/puppet/
-Joey
On Thu, Mar 8, 2012 at 9:42 PM, Masoud wrote:
If you're using -libjars, there's no reason to copy the jars into
$HADOOP lib. You may have to add the jars to the HADOOP_CLASSPATH if
you use them from your main() method:
export HADOOP_CLASSPATH=dependent-1.jar,dependent-2.jar
hadoop jar main.jar demo.MyJob -libjars
dependent-1.jar,dependent-2.j
I think you mean Writer.getLength(). It returns the current position
in the output stream in bytes (more or less the current size of the
file).
-Joey
On Tue, Mar 6, 2012 at 9:53 AM, Jane Wayne wrote:
> hi,
>
> i am writing a little util class to recurse into a directory and add all
> *.txt files
I know this doesn't fix lzo, but have you considered Snappy for the
intermediate output compression? It gets similar compression ratios
and compress/decompress speed, but arguably has better Hadoop
integration.
-Joey
On Thu, Mar 1, 2012 at 10:01 PM, Marc Sturlese wrote:
> I use to have 2.05 but
Not quite. Datanodes get the namenode host from fs.defalt.name in
core-site.xml. Task trackers find the job tracker from the mapred.job.tracker
setting in mapred-site.xml.
Sent from my iPhone
On Mar 1, 2012, at 18:49, Mohit Anchlia wrote:
> On Thu, Mar 1, 2012 at 4:46 PM, Joey Echever
You only have to refresh nodes if you're making use of an allows file.
Sent from my iPhone
On Mar 1, 2012, at 18:29, Mohit Anchlia wrote:
> Is this the right procedure to add nodes? I took some from hadoop wiki FAQ:
>
> http://wiki.apache.org/hadoop/FAQ
>
> 1. Update conf/slave
> 2. on the s
Try 0.4.15. You can get it from here:
https://github.com/toddlipcon/hadoop-lzo
Sent from my iPhone
On Feb 28, 2012, at 6:49, Marc Sturlese wrote:
> I'm with 0.4.9 (think is the latest)
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/LZO-exception-decompressing-ret
Which version of the Hadoop LZO library are you using? It looks like something
I'm pretty sure was fixed in a newer version.
-Joey
On Feb 28, 2012, at 4:58, Marc Sturlese wrote:
> Hey there,
> I've been running a cluster for over a year and was getting a lzo
> decompressing exception less t
dfs.block.size can be set per job.
mapred.tasktracker.map.tasks.maximum is per tasktracker.
-Joey
On Mon, Feb 27, 2012 at 10:19 AM, Mohit Anchlia wrote:
> Can someone please suggest if parameters like dfs.block.size,
> mapred.tasktracker.map.tasks.maximum are only cluster wide settings or can
>
node
>> redundancy. Perhaps I don't fully understand.
>>
>> I'll check out Bigtop. I looked at it a while ago and forgot about it.
>>
>> Thanks
>> -jeremy
>>
>> On Feb 22, 2012, at 2:43 PM, Joey Echeverria wrote:
>>
>>
Check out the Apache Bigtop project. I believe they have 0.22 RPMs.
Out of curiosity, why are you interested in BackupNode?
-Joey
Sent from my iPhone
On Feb 22, 2012, at 14:56, Jeremy Hansen wrote:
> Any possibility of getting spec files to create packages for 0.22?
>
> Thanks
> -jeremy
>
HDFS supports POSIX style file and directory permissions (read, write, execute)
for the owner, group and world. You can change the permissions with hadoop fs
-chmod
-Joey
On Feb 22, 2012, at 5:32, wrote:
> Hi
>
>
>
>
>
> I want to implement security at file level in Hadoop, essentiall
I'd recommend making a SequenceFile[1] to store each XML file as a value.
-Joey
[1]
http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/io/SequenceFile.html
On Tue, Feb 21, 2012 at 12:15 PM, Mohit Anchlia wrote:
> We have small xml files. Currently I am planning to append these sm
ven need to specify
> lib jars in the command line…should I be worried that it doesn't work that
> way?
>
> On Jan 31, 2012, at 4:09 PM, Joey Echeverria wrote:
>
>> You also need to add the jar to the classpath so it's available in
>> your main. You can do soem
You also need to add the jar to the classpath so it's available in
your main. You can do soemthing like this:
HADOOP_CLASSPATH=/usr/local/mahout/math/target/mahout-math-0.6-SNAPSHOT.jar
hadoop jar ...
-Joey
On Tue, Jan 31, 2012 at 1:38 PM, Daniel Quach wrote:
> For Hadoop 0.20.203 (the latest s
> How much memory/JVM heap does NameNode use for each block?
I don't remember the exact number, it also depends on which version of
Hadoop you're using
> http://search-hadoop.com/m/O886P1VyVvK1 - 1 GB heap for every object?
It's 1 GB for every *million* objects (files, blocks, etc.). This is a
g
You can use FileInptuFormat.setInputPaths(configuration,
job1-output). This will overwrite the old input path(s).
-Joey
On Mon, Jan 16, 2012 at 7:16 PM, W.P. McNeill wrote:
>
> It is possible to unset a configuration value? I think the answer is no,
> but I want to be sure.
>
> I know that you
client username
>> instead of the new one I had set. Do I need to add it somewhere else, or
>> add something else to the property name? I'm using CDH3 with my Hadoop
>> cluster currently setup with one node in pseudo-distributed mode, in case
>> that helps.
>>
>
om the FileInputFormat.getSplits() method. Is this possible?
>
> 2012/1/12 Joey Echeverria
>
>> It doesn't matter if the original comes from mapred-site.xml,
>> core-site.xml, or hdfs-site.xml. All that really matters is if it's a
>> client/job tunable or if it configure
Set the user.name property in your core-site.xml on your client nodes.
-Joey
On Thu, Jan 12, 2012 at 3:55 PM, Eli Finkelshteyn wrote:
> Hi,
> If I have one username on a hadoop cluster and would like to set myself up
> to use that same username from every client from which I access the cluster,
It doesn't matter if the original comes from mapred-site.xml,
core-site.xml, or hdfs-site.xml. All that really matters is if it's a
client/job tunable or if it configures one of the daemons.
Which parameter did you want to change?
On Thu, Jan 12, 2012 at 1:59 PM, Marcel Holle
wrote:
> I need a v
Yes. Hive doesn't format data when you load it. The only exception is if you do
an INSERT OVERWRITE ... .
-Joey
On Jan 10, 2012, at 6:08, Tony Burton wrote:
> Thanks for this Bejoy, very helpful.
>
> So, to summarise: when I CREATE EXTERNAL TABLE in Hive, the STORED AS, ROW
> FORMAT and oth
What's the classpath of the java program submitting the job? It has to
have the configuration directory (e.g. /opt/hadoop/conf) in there or
it won't pick up the correct configs.
-Joey
On Sun, Jan 8, 2012 at 12:59 PM, Mark question wrote:
> mapred-site.xml:
>
>
> mapred.job.tracker
> loc
; Praveenesh
>
> On Thu, Dec 29, 2011 at 4:46 PM, Joey Echeverria wrote:
>
>> Hey Praveenesh,
>>
>> What do you mean by multiuser? Do you want to support multiple users
>> starting/stopping daemons?
>>
>> -Joey
>>
>>
>>
>> O
Hey Praveenesh,
What do you mean by multiuser? Do you want to support multiple users
starting/stopping daemons?
-Joey
On Dec 29, 2011, at 2:49, praveenesh kumar wrote:
> Guys,
>
> Did someone try this thing ?
>
> Thanks
>
> On Tue, Dec 27, 2011 at 4:36 PM, praveenesh kumar wrote:
>
>> H
Can you run the hostname command on both servers and send their output?
-Joey
On Tue, Dec 20, 2011 at 8:21 PM, MirrorX wrote:
>
> dear all
>
> i am trying for many days to get a simple hadoop cluster (with 2 nodes) to
> work but i have trouble configuring the network parameters. i have properly
You could run the flume collectors on other machines and write a source which
connects to the sockets on the data generators.
-Joey
On Dec 15, 2011, at 21:27, "Periya.Data" wrote:
> Sorry...misworded my statement. What I meant was that the sources are meant
> to be untouched and admins do
Hi Bai,
I'm moving this over to scm-us...@cloudera.org as that's a more
appropriate list. (common-user bcced).
I assume by "Cloudera Free" you mean Coudera Manager Free Edition?
You should be able to run a job in the same way that do on any other
Hadoop cluster. The only caveat is that you first
y
On Wed, Dec 7, 2011 at 12:37 PM, wrote:
> What happens then if the nfs server fails or isn't reachable? Does hdfs lock
> up? Does it gracefully ignore the nfs copy?
>
> Thanks,
> randy
>
> - Original Message -
> From: "Joey Echeverria"
> To:
You should also configure the Namenode to use an NFS mount for one of
it's storage directories. That will give the most up-to-date back of
the metadata in case of total node failure.
-Joey
On Wed, Dec 7, 2011 at 3:17 AM, praveenesh kumar wrote:
> This means still we are relying on Secondary Name
You're correct, currently HDFS only supports reading from closed files. You can
configure flume to write your data in small enough chunks so you can do
incremental processing.
-Joey
On Nov 22, 2011, at 2:01, Romeo Kienzler wrote:
> Hi,
>
> I'm planning to use Fume in order to stream data
If your file is bigger than a block size (typically 64mb or 128mb), then it
will be split into more than one block. The blocks may or may not be stored on
different datanodes. If you're using a default InputFormat, then the input will
be split between two task. Since you said you need the whole
You can certainly run HBase on a single server, but I don't think
you'd want to. Very few projects ever reach a scale where a single
MySQL server can't handle it. In my opinion, you should start with the
easy solution (MySQL) and only bring HBase into the mix when your
scale really demands it. If y
on the speculative execution. I can't remember...I think so
> though.
>
> On Nov 11, 2011, at 5:53 AM, Joey Echeverria wrote:
>
>> Another thing to look at is the map outlier. The shuffle will start by
>> default when 5% of the maps are done, but won't finish
Another thing to look at is the map outlier. The shuffle will start by default
when 5% of the maps are done, but won't finish until after the last map is
done. Since one of your maps took 37 minutes, your shuffle take at least that
long.
I would check the following:
Is the input skewed?
Does t
What is your setting for fs.default.name?
-Joey
On Nov 8, 2011, at 5:54, Paolo Di Tommaso wrote:
> Dear all,
>
> I'm trying to install Hadoop (0.20.2) in pseudo distributed mode to run
> some tests on a Linux machine (Fedora 8) .
>
> I have followed the installation steps in the guide availab
You need to create a log directory on your TaskTracker nodes:
/opt/ecip/BMC/hadoopTest/hadoop-0.20.203.0/logs/
Make sure the directory is writable by the mapred user, or which ever
user your TaskTrackers were started as.
-Joey
On Thu, Nov 3, 2011 at 11:11 PM, Li, Yonggang wrote:
>
> I have in
When you get the handle to the FileSystem object you can connect as a
different user:
http://hadoop.apache.org/common/docs/r0.20.203.0/api/org/apache/hadoop/fs/FileSystem.html#get(java.net.URI,
org.apache.hadoop.conf.Configuration, java.lang.String)
This should get any permissions you set enforce
A new API was introduced with Hadoop 0.20. However, that API is not
feature complete. Despite the fact that the old API is marked as
deprecated, it's still the recommended, full feature API. In fact, in
future versions of Hadoop the API has been undeprecated to call more
attention to it's stable na
100 compressed lines of text. So maybe that
> accounts for the progress report.
>
> Any idea what the huge time difference might be due to (2 minutes average
> vs. 20 hrs for the last 3 tasks)? Does that sound like swapping to you?
>
> Thanks,
>
> Brendan
>
> On Thu, N
Is you input data compressed? There have been some bugs in the past
with reporting progress when reading compressed data.
-Joey
On Thu, Nov 3, 2011 at 9:18 AM, Brendan W. wrote:
> Hi,
>
> Running 0.20.2:
>
> A job with about 4000 map tasks quickly blew through all but 3 in a couple
> of hours, w
What are the permissions on \tmp\hadoop-cyg_server\mapred\local\ttprivate?
Which user owns that directory?
Which user are you starting you TaskTracker as?
-Joey
On Wed, Nov 2, 2011 at 9:29 PM, Masoud wrote:
> Hi,
>
> Im running hadop 0.20.204 under cygwin 1.7 on Win7, java 1.6.22
> i got this
Hi Trang,
I'm moving the discuss to scm-us...@cloudera.org as it's not a Hadoop
common issue. I've bcced common-user@hadoop.apache.org and also put
you in the to: field in case you're not on scm-users.
As for your problem, the issue is that SCM doesn't support an
installation via sudo if sudo req
Try getting rid of the extra spaces and new lines.
-Joey
On Mon, Oct 31, 2011 at 1:49 PM, Mark wrote:
> I recently added the following to my core-site.xml
>
>
> io.compression.codecs
>
> org.apache.hadoop.io.compress.DefaultCodec,
> org.apache.hadoop.io.compress.GzipCodec,
> org.apache.hadoop
required the fix in one environment and did not in
> another -- but that may just show my lack of understanding about hadoop. :-)
>
> Jessica
>
> On Wed, Oct 5, 2011 at 4:27 PM, Jessica Owensby
> wrote:
>
>> Great. Thanks! Will give that a try.
>> Je
on that node?
>
> Joey,
> Yes, the lzo files are indexed. They are indexed using the following
> command:
>
> hadoop jar /usr/lib/hadoop/lib/hadoop-lzo-20110217.jar
> com.hadoop.compression.lzo.LzoIndexer /user/hive/warehouse/foo/bar.lzo
>
> Jessica
>
> On Wed, O
Are your LZO files indexed?
-Joey
On Wed, Oct 5, 2011 at 3:35 PM, Jessica Owensby
wrote:
> Hi Joey,
> Thanks. I forgot to say that; yes, the lzocodec class is listed in
> core-site.xml under the io.compression.codecs property:
>
>
> io.compression.codecs
> org.apache.hadoop.io.compress.GzipCo
Did you add the LZO codec configuration to core-site.xml?
-Joey
On Wed, Oct 5, 2011 at 2:31 PM, Jessica Owensby
wrote:
> Hello Everyone,
> I've been having an issue in a hadoop environment (running cdh3u1)
> where any table declared in hive
> with the "STORED AS INPUTFORMAT
> "com.hadoop.mapred.
The Job class copies the Configuraiton that you pass in. You either
need to do your conf.setInt("number", 12345) before you create the Job
object or you need call job.getConfiguration().setInt("number",
12345).
-Joey
On Tue, Oct 4, 2011 at 12:28 PM, Ratner, Alan S (IS)
wrote:
> I have no problem
Raj,
I just tried this on my CHD3u1 VM, and the ramdisk worked the first
time. So, it's possible you've hit a bug in CDH3b3 that was later
fixed. Can you enable debug logging in log4j.properties and then
repost your task tracker log? I think there might be more details that
it will print that will
I would definitely checkout Oozie for this use case.
-Joey
On Thu, Sep 29, 2011 at 12:51 PM, Aaron Baff wrote:
> I saw this, but wasn't sure if it was something that ran on the client and
> just submitted the Job's in sequence, or if that gave it all to the
> JobTracker, and the JobTracker too
Do you close your FileSystem instances at all? IIRC, the FileSystem
instance you use is a singleton and if you close it once, it's closed
for everybody. My guess is you close it in your cleanup method and you
have JVM reuse turned on.
-Joey
On Thu, Sep 29, 2011 at 12:49 PM, Mark question wrote:
HDFS blocks are stored as files in the underlying filesystem of your
datanodes. Those files do not take a fixed amount of space, so if you
store 10 MB in a file and you have 128 MB blocks, you still only use
10 MB (times 3 with default replication).
However, the namenode does incur additional over
FYI, I'm moving this to mapreduce-user@ and bccing common-user@.
It looks like your latest permission problem is on the local disk. What is your
setting for hadoop.tmp.dir? What are the permissions on that directory?
-Joey
On Sep 18, 2011, at 23:27, ArunKumar wrote:
> Hi guys !
>
> Commo
As hfuser, create the /user/arun directory in hdfs-user. Then change the
ownership /user/arun to arun.
-Joey
On Sep 18, 2011 8:07 AM, "ArunKumar" wrote:
> Hi Uma !
>
> I have deleted the data in /app/hadoop/tmp and formatted namenode and
> restarted cluster..
> I tried
> arun$ /home/hduser/hadoop
Losing the name node does not necessarily mean lost data. You should always
have your name node write its metadata to an NFS server to guard against it.
Also, while unavailability is a risk, it is not very common in practice.
-Joey
On Sep 17, 2011, at 19:38, Tom Deutsch wrote:
> I disagree
You might also want to look into MRUnit[1]. It lets you mock the
behavior of the framework to test your map and reduce classes in
isolation. Can't discover all bugs, but a useful tool and works nicely
with IDE debuggers.
-Joey
[1] http://incubator.apache.org/mrunit/
On Thu, Sep 15, 2011 at 3:51
Hi Naveen,
> I use hadoop-0.21.0 distribution. I have a large number of small files (KB).
Word of warning, 0.21 is not a stable release. The recommended version
is in the 0.20.x range.
> Is there any efficient way of handling it in hadoop?
>
> I have heard that solution for that problem is using
That won't work with the replication level as that is entirely a
client side config. You can partially control it by setting the
maximum replication level.
-Joey
On Tue, Sep 13, 2011 at 10:56 AM, Edward Capriolo wrote:
> On Tue, Sep 13, 2011 at 5:53 AM, Steve Loughran wrote:
>
>> On 13/09/11 05
The sort is what's implementing the group by key function. You can't
have one without the other in Hadoop. Are you trying to disable the
sort because you think it's too slow?
-Joey
On Sun, Sep 11, 2011 at 2:43 AM, john smith wrote:
> Hi Arun,
>
> Suppose I am doing a simple wordcount and the map
Not that I know of.
-Joey
On Fri, Aug 19, 2011 at 1:16 PM, modemide wrote:
> Ha, what a silly mistake.
>
> Thank you Joey.
>
> Do you also happen to know of an easier way to tell which racks the
> jobtracker/namenode think each node is in?
>
>
>
> On 8/19/11, Joey
Did you restart the JobTracker?
-Joey
On Fri, Aug 19, 2011 at 12:45 PM, modemide wrote:
> Hi all,
> I've tried to make a rack topology script. I've written it in python
> and it works if I call it with the following arguments:
> 10.2.0.1 10.2.0.11 10.2.0.11 10.2.0.12 10.2.0.21 10.2.0.26 10.2.0
It means your HDFS client jars are using a different RPC version than
your namenode and datanodes. Are you sure that XXX has $HADOOP_HOME in
it's classpath? It really looks like it's pointing to the wrong jars.
-Joey
On Thu, Aug 18, 2011 at 8:14 AM, Ratner, Alan S (IS)
wrote:
> We have a version
If you're talking about the org.apache.hadoop.mapreduce.* API, that
was introduced in 0.20.0. There should be no need to use the 0.21
version.
-Joey
On Tue, Aug 16, 2011 at 1:22 PM, W.P. McNeill wrote:
> Here is my specific problem:
>
> I have a sample word count Hadoop program up on github (
>
What are the types of key1 and key2? What does the readFields() method
look like?
-Joey
On Sun, Aug 14, 2011 at 10:07 PM, Stan Rosenberg
wrote:
> On Sun, Aug 14, 2011 at 9:33 PM, Joey Echeverria wrote:
>
>> Does your compareTo() method test object pointer equality? If so, you
Does your compareTo() method test object pointer equality? If so, you could
be getting burned by Hadoop reusing Writable objects.
-Joey
On Aug 14, 2011 9:20 PM, "Stan Rosenberg"
wrote:
> Hi Folks,
>
> After much poking around I am still unable to determine why I am seeing
> 'reduce' being called
You can configure the undocumented variable dfs.max-repl-streams to
increase the number of replications a data-node is allowed to handle
at one time. The default value is 2. [1]
-Joey
[1]
https://issues.apache.org/jira/browse/HADOOP-2606?focusedCommentId=12578700&page=com.atlassian.jira.plugin.s
You can use any kind of format for files in the distributed cache, so
yes you can use sequence files. They should be faster to parse than
most text formats.
-Joey
On Fri, Aug 12, 2011 at 4:56 AM, Sofia Georgiakaki
wrote:
> Thank you for the reply!
> In each map(), I need to open-read-close these
You can set the keep.failed.task.files property on the job.
-Joey
On Tue, Aug 9, 2011 at 9:39 PM, Saptarshi Guha wrote:
> Hello,
>
> If i have a failure during a job, is there a way I prevent the output
> folder
> from being deleted?
>
> Cheers
> Saptarshi
>
--
Joseph Echeverria
Cloudera, I
If you want to use a combiner, your map has to output the same types
as your combiner outputs. In your case, modify your map to look like
this:
public static class TokenizerMapper
extends Mapper{
public void map(Text key, Text value, Context context
) throws IOExce
How about having the slave write to temp file first, then move it to the file
the master is monitoring for after they close it?
-Joey
On Jul 27, 2011, at 22:51, Nitin Khandelwal
wrote:
> Hi All,
>
> How can I determine if a file is being written to (by any thread) in HDFS. I
> have a conti
You could either use a custom RecordReader or you could override the
run() method on your Mapper class to do the merging before calling the
map() method.
-Joey
On Wed, Jul 27, 2011 at 11:09 AM, Tom Melendez wrote:
>>
>>> 3. Another idea might be create separate seq files for chunk of
>>> records
> 1. Any reason not to use a sequence file for this? Perhaps a mapfile?
> Since I've sorted it, I don't need "random" accesses, but I do need
> to be aware of the keys, as I need to be sure that I get all of the
> relevant keys sent to a given mapper
MapFile *may* be better here (see my answer f
To add to what Bobby said, you can get block locations with
fs.getFileBlockLocations() if you want to open based on locality.
-Joey
On Mon, Jul 25, 2011 at 3:00 PM, Robert Evans wrote:
> Sofia,
>
> You can access any HDFS file from a normal java application so long as your
> classpath and some
Your executable needs to read lines from standard in. Try setting your mapper
like this:
> -mapper "/data/yehdego/hadoop-0.20.2/pknotsRG -"
If that doesn't work, you may need to execute your C program from a shell
script. The -I added to the command line says read from STDIN.
-Joey
On Jul 2
Hi Issac,
I couldn't find anything specifically for the 0.20.203 release, but CDH3
uses basically the same security code. You could probably follow our
security guide with the 0.20.203 release:
https://ccp.cloudera.com/display/CDHDOC/CDH3+Security+Guide
-Joey
On Mon, Jul 18, 2011 at 12:15 PM, I
Facebook contributed some code to do something similar called HDFS RAID:
http://wiki.apache.org/hadoop/HDFS-RAID
-Joey
On Jul 18, 2011, at 3:41, Da Zheng wrote:
> Hello,
>
> It seems that data replication in HDFS is simply data copy among nodes. Has
> anyone considered to use a better encodi
Your map method is misnamed. It should be in all lower case.
-Joey
On Jul 12, 2011 2:46 AM, "Teng, James" wrote:
>
> hi, all.
> I am a new hadoop beginner, I try to construct a map and reduce task to
run, however encountered an exception while continue going further.
> Exception:
> java.io.IOExce
Set mapred.reduce.slowstart.completed.maps to a number close to 1.0.
1.0 means the maps have to completely finish before the reduce starts
copying any data. I often run jobs with this set to .90-.95.
-Joey
On Fri, Jul 8, 2011 at 11:25 AM, Juan P. wrote:
> Here's another thought. I realized that
It looks like both datanodes are trying to serve data out of the smae
directory. Is there any chance that both datanodes are using the same NFS mount
for the dfs.data.dir?
If not, what I would do is delete the data from ${dfs.data.dir} and then
re-format the namenode. You'll lose all of your da
Have you tried using a Combiner?
Here's an example of using one:
http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html#Example%3A+WordCount+v1.0
-Joey
On Thu, Jul 7, 2011 at 4:29 PM, Juan P. wrote:
> Hi guys!
>
> I'd like some help fine tuning my cluster. I currently have 20 boxes
ArrayWritable doesn't serialize type information. You need to subclass it
(e.g. IntArrayWritable) and create a no arg constructor which calls
super(IntWritable.class).
Use this instead of ArrayWritable directly. If you want to store more than
one type, look at the source for MapWritable to see how
Try replacing the hadoop jar from the pig lib directory with the one from your
cluster.
-Joey
On Jul 2, 2011, at 0:38, praveenesh kumar wrote:
> Hi guys..
>
>
>
> I am previously using hadoop and Hbase...
>
>
>
> So for Hbase to run perfectly fine we need Hadoop-0.20-append for Hbase
Yes, you can see a picture describing HAR files in this old blog post:
http://www.cloudera.com/blog/2009/02/the-small-files-problem/
-Joey
On Mon, Jun 27, 2011 at 4:36 PM, Rita wrote:
> So, it does an index of the file?
>
>
>
> On Mon, Jun 27, 2011 at 10:10 AM, Joey Echeverria
The advantage of a hadoop archive files is it lets you access the
files stored in it directly. For example, if you archived three files
(a.txt, b.txt, c.txt) in an archive called foo.har. You could cat one
of the three files using the hadoop command line:
hadoop fs -cat har:///user/joey/out/foo.ha
Yes.
-Joey
On Jun 21, 2011 1:47 PM, "jagaran das" wrote:
> Hi All,
>
> Does CDH3 support Existing File Append ?
>
> Regards,
> Jagaran
>
>
>
>
> From: Eric Charles
> To: common-user@hadoop.apache.org
> Sent: Tue, 21 June, 2011 3:53:33 AM
> Subject: Re: Append to
the slaves file -
>>
>> Cheers -
>>
>> -Original Message-
>> From: Joey Echeverria [mailto:j...@cloudera.com]
>> Sent: Wednesday, June 15, 2011 12:01 PM
>> To: common-user@hadoop.apache.org
>> Subject: Re: Datanode not created on hadoop-0.20.203.0
>&
I would try the following:
hadoop -libjars /home/ayon/jars/MultiOutput.jar jar
/usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u0.jar
-libjars /home/ayon/jars/MultiOutput.jar -input
/user/ayon/streaming_test_input -output
/user/ayon/streaming_test_output -mapper /bin/cat -reduce
By any chance, are you running as root? If so, try running as a different user.
-Joey
On Wed, Jun 15, 2011 at 12:53 PM, rutesh wrote:
> Hi,
>
> I am new to hadoop (Just 1 month old). These are the steps I followed to
> install and run hadoop-0.20.203.0:
>
> 1) Downloaded tar file from
> http:/
This feature doesn't currently work. I don't remember the JIRA for it, but
there's a ticket which will allow a reader to read from an HDFS file before
it's closed. In that case, you implement a queue by having the producer write
to the end of the file and the reader read from the beginning of th
There are some good recommendations in this blog post:
http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-basic-hardware-recommendations/
It's a little dated, but the reasoning and basics are sound.
-Joey
On Thu, Jun 9, 2011 at 10:59 AM, Mark wrote:
> Can someone give some
Hey Andy,
You're correct that 0.20.203 doesn't have append. Your best bet is to
build a version of the append branch or switch to
CDH3u0.
-Joey
On Tue, Jun 7, 2011 at 6:31 PM, Zhong, Sheng wrote:
> Thanks! The issue has been resolved by removing some bad blks...
>
> But St.Ack,
>
> We do want a
Most of the network bandwidth used during a MapReduce job should come
from the shuffle/sort phase. This part doesn't use HDFS. The
TaskTrackers running reduce tasks will pull intermediate results from
TaskTrackers running map tasks over HTTP. In most cases, it's
difficult to get rack locality durin
Larger Hadoop installations are space dense, 20-40 nodes per rack.
When you get to that density with multiple racks, it becomes expensive
to buy a switch with enough capacity for all of the nodes in all of
the racks. The typical solution is to install a switch per rack with
uplinks to a core switch
1 - 100 of 124 matches
Mail list logo