Re: decompressing bzip2 data with a custom InputFormat

2012-03-14 Thread Joey Echeverria
Yes you have to deal with the compression. Usually, you'll load the compression codec in your RecordReader. You can see an example of how TextInputFormat's LineRecordReader does it: https://github.com/apache/hadoop-common/blob/release-1.0.1/src/mapred/org/apache/hadoop/mapreduce/lib/input/LineReco

Re: questions regarding hadoop version 1.0

2012-03-14 Thread Joey Echeverria
JobTracker and TaskTracker. YARN is only in 0.23 and later releases. 1.0.x is from the 0.20x line of releases. -Joey On Mar 14, 2012, at 7:00, arindam choudhury wrote: > Hi, > > Hadoop 1.0.1 uses hadoop YARN or the tasktracker, jobtracker model? > > Regards, > Arindam

Re: setting up a large hadoop cluster

2012-03-12 Thread Joey Echeverria
Masoud, I know that the Puppet Labs website is confusing, but puppet is open source and has no node limit. You can download it from here: http://puppetlabs.com/misc/download-options/ If you're using a Red Hat compatible linux distribution, you can get RPMs from EPEL: http://projects.puppetlabs.

Re: Is there a way to get an absolute HDFS path?

2012-03-12 Thread Joey Echeverria
HDFS has the notion of a working directory which defaults to /user/. Check out: http://hadoop.apache.org/common/docs/r1.0.1/api/org/apache/hadoop/fs/FileSystem.html#getWorkingDirectory() and http://hadoop.apache.org/common/docs/r1.0.1/api/org/apache/hadoop/fs/FileSystem.html#setWorkingDirectory(

Re: What is currently the best way to write to multiple output locations in Hadoop?

2012-03-12 Thread Joey Echeverria
Small typo, try: jar tf hadoop-core-1.0.1.jar | grep -i MultipleOutputs ;) -Joey On Mon, Mar 12, 2012 at 4:56 PM, W.P. McNeill wrote: > I take that back. On my laptop I'm running Apache Hadoop 1.0.1, and I still > don't see MultipleOutputs. I am building against hadoop-core-1.0.1.jar and > the

Re: setting up a large hadoop cluster

2012-03-12 Thread Joey Echeverria
Apache Bigtop also has Hadoop puppet modules. For the modules based on Hadoop 0.20.205 you can look at them here: https://svn.apache.org/repos/asf/incubator/bigtop/branches/branch-0.2/bigtop-deploy/puppet/ I haven't seen any documentation on the modules. -Joey On Mon, Mar 12, 2012 at 1:43 PM, P

Re: Best way for setting up a large cluster

2012-03-08 Thread Joey Echeverria
Something like puppet it is a good choice. There are example puppet manifests available for most Hadoop-related projects in Apache BigTop, for example: https://svn.apache.org/repos/asf/incubator/bigtop/branches/branch-0.2/bigtop-deploy/puppet/ -Joey On Thu, Mar 8, 2012 at 9:42 PM, Masoud wrote:

Re: how to get rid of -libjars ?

2012-03-06 Thread Joey Echeverria
If you're using -libjars, there's no reason to copy the jars into $HADOOP lib. You may have to add the jars to the HADOOP_CLASSPATH if you use them from your main() method: export HADOOP_CLASSPATH=dependent-1.jar,dependent-2.jar hadoop jar main.jar demo.MyJob -libjars dependent-1.jar,dependent-2.j

Re: is there anyway to detect the file size as am i writing a sequence file?

2012-03-06 Thread Joey Echeverria
I think you mean Writer.getLength(). It returns the current position in the output stream in bytes (more or less the current size of the file). -Joey On Tue, Mar 6, 2012 at 9:53 AM, Jane Wayne wrote: > hi, > > i am writing a little util class to recurse into a directory and add all > *.txt files

Re: LZO exception decompressing (returned -8)

2012-03-01 Thread Joey Echeverria
I know this doesn't fix lzo, but have you considered Snappy for the intermediate output compression? It gets similar compression ratios and compress/decompress speed, but arguably has better Hadoop integration. -Joey On Thu, Mar 1, 2012 at 10:01 PM, Marc Sturlese wrote: > I use to have 2.05 but

Re: Adding nodes

2012-03-01 Thread Joey Echeverria
Not quite. Datanodes get the namenode host from fs.defalt.name in core-site.xml. Task trackers find the job tracker from the mapred.job.tracker setting in mapred-site.xml. Sent from my iPhone On Mar 1, 2012, at 18:49, Mohit Anchlia wrote: > On Thu, Mar 1, 2012 at 4:46 PM, Joey Echever

Re: Adding nodes

2012-03-01 Thread Joey Echeverria
You only have to refresh nodes if you're making use of an allows file. Sent from my iPhone On Mar 1, 2012, at 18:29, Mohit Anchlia wrote: > Is this the right procedure to add nodes? I took some from hadoop wiki FAQ: > > http://wiki.apache.org/hadoop/FAQ > > 1. Update conf/slave > 2. on the s

Re: LZO exception decompressing (returned -8)

2012-02-28 Thread Joey Echeverria
Try 0.4.15. You can get it from here: https://github.com/toddlipcon/hadoop-lzo Sent from my iPhone On Feb 28, 2012, at 6:49, Marc Sturlese wrote: > I'm with 0.4.9 (think is the latest) > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/LZO-exception-decompressing-ret

Re: LZO exception decompressing (returned -8)

2012-02-28 Thread Joey Echeverria
Which version of the Hadoop LZO library are you using? It looks like something I'm pretty sure was fixed in a newer version. -Joey On Feb 28, 2012, at 4:58, Marc Sturlese wrote: > Hey there, > I've been running a cluster for over a year and was getting a lzo > decompressing exception less t

Re: dfs.block.size

2012-02-27 Thread Joey Echeverria
dfs.block.size can be set per job. mapred.tasktracker.map.tasks.maximum is per tasktracker. -Joey On Mon, Feb 27, 2012 at 10:19 AM, Mohit Anchlia wrote: > Can someone please suggest if parameters like dfs.block.size, > mapred.tasktracker.map.tasks.maximum are only cluster wide settings or can >

Re: Backupnode in 1.0.0?

2012-02-22 Thread Joey Echeverria
node >> redundancy.  Perhaps I don't fully understand. >> >> I'll check out Bigtop.  I looked at it a while ago and forgot about it. >> >> Thanks >> -jeremy >> >> On Feb 22, 2012, at 2:43 PM, Joey Echeverria wrote: >> >>

Re: Backupnode in 1.0.0?

2012-02-22 Thread Joey Echeverria
Check out the Apache Bigtop project. I believe they have 0.22 RPMs. Out of curiosity, why are you interested in BackupNode? -Joey Sent from my iPhone On Feb 22, 2012, at 14:56, Jeremy Hansen wrote: > Any possibility of getting spec files to create packages for 0.22? > > Thanks > -jeremy >

Re: Security at file level in Hadoop

2012-02-22 Thread Joey Echeverria
HDFS supports POSIX style file and directory permissions (read, write, execute) for the owner, group and world. You can change the permissions with hadoop fs -chmod -Joey On Feb 22, 2012, at 5:32, wrote: > Hi > > > > > > I want to implement security at file level in Hadoop, essentiall

Re: Writing small files to one big file in hdfs

2012-02-21 Thread Joey Echeverria
I'd recommend making a SequenceFile[1] to store each XML file as a value. -Joey [1] http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/io/SequenceFile.html On Tue, Feb 21, 2012 at 12:15 PM, Mohit Anchlia wrote: > We have small xml files. Currently I am planning to append these sm

Re: Adding mahout math jar to hadoop mapreduce execution

2012-02-01 Thread Joey Echeverria
ven need to specify > lib jars in the command line…should I be worried that it doesn't work that > way? > > On Jan 31, 2012, at 4:09 PM, Joey Echeverria wrote: > >> You also need to add the jar to the classpath so it's available in >> your main. You can do soem

Re: Adding mahout math jar to hadoop mapreduce execution

2012-01-31 Thread Joey Echeverria
You also need to add the jar to the classpath so it's available in your main. You can do soemthing like this: HADOOP_CLASSPATH=/usr/local/mahout/math/target/mahout-math-0.6-SNAPSHOT.jar hadoop jar ... -Joey On Tue, Jan 31, 2012 at 1:38 PM, Daniel Quach wrote: > For Hadoop 0.20.203 (the latest s

Re: NameNode per-block memory usage?

2012-01-17 Thread Joey Echeverria
> How much memory/JVM heap does NameNode use for each block? I don't remember the exact number, it also depends on which version of Hadoop you're using > http://search-hadoop.com/m/O886P1VyVvK1 - 1 GB heap for every object? It's 1 GB for every *million* objects (files, blocks, etc.). This is a g

Re: Can you unset a mapred.input.dir configuration value?

2012-01-16 Thread Joey Echeverria
You can use  FileInptuFormat.setInputPaths(configuration, job1-output). This will overwrite the old input path(s). -Joey On Mon, Jan 16, 2012 at 7:16 PM, W.P. McNeill wrote: > > It is possible to unset a configuration value? I think the answer is no, > but I want to be sure. > > I know that you

Re: Username on Hadoop 20.2

2012-01-16 Thread Joey Echeverria
client username >> instead of the new one I had set. Do I need to add it somewhere else, or >> add something else to the property name? I'm using CDH3 with my Hadoop >> cluster currently setup with one node in pseudo-distributed mode, in case >> that helps. >> >

Re: Access core-site.xml from FileInputFormat

2012-01-12 Thread Joey Echeverria
om the FileInputFormat.getSplits() method. Is this possible? > > 2012/1/12 Joey Echeverria > >> It doesn't matter if the original comes from mapred-site.xml, >> core-site.xml, or hdfs-site.xml. All that really matters is if it's a >> client/job tunable or if it configure

Re: Username on Hadoop 20.2

2012-01-12 Thread Joey Echeverria
Set the user.name property in your core-site.xml on your client nodes. -Joey On Thu, Jan 12, 2012 at 3:55 PM, Eli Finkelshteyn wrote: > Hi, > If I have one username on a hadoop cluster and would like to set myself up > to use that same username from every client from which I access the cluster,

Re: Access core-site.xml from FileInputFormat

2012-01-12 Thread Joey Echeverria
It doesn't matter if the original comes from mapred-site.xml, core-site.xml, or hdfs-site.xml. All that really matters is if it's a client/job tunable or if it configures one of the daemons. Which parameter did you want to change? On Thu, Jan 12, 2012 at 1:59 PM, Marcel Holle wrote: > I need a v

Re: has bzip2 compression been deprecated?

2012-01-10 Thread Joey Echeverria
Yes. Hive doesn't format data when you load it. The only exception is if you do an INSERT OVERWRITE ... . -Joey On Jan 10, 2012, at 6:08, Tony Burton wrote: > Thanks for this Bejoy, very helpful. > > So, to summarise: when I CREATE EXTERNAL TABLE in Hive, the STORED AS, ROW > FORMAT and oth

Re: Expected file://// error

2012-01-08 Thread Joey Echeverria
What's the classpath of the java program submitting the job? It has to have the configuration directory (e.g. /opt/hadoop/conf) in there or it won't pick up the correct configs. -Joey On Sun, Jan 8, 2012 at 12:59 PM, Mark question wrote: > mapred-site.xml: > >   >    mapred.job.tracker >    loc

Re: Multi user Hadoop 0.20.205 ?

2011-12-29 Thread Joey Echeverria
; Praveenesh > > On Thu, Dec 29, 2011 at 4:46 PM, Joey Echeverria wrote: > >> Hey Praveenesh, >> >> What do you mean by multiuser? Do you want to support multiple users >> starting/stopping daemons? >> >> -Joey >> >> >> >> O

Re: Multi user Hadoop 0.20.205 ?

2011-12-29 Thread Joey Echeverria
Hey Praveenesh, What do you mean by multiuser? Do you want to support multiple users starting/stopping daemons? -Joey On Dec 29, 2011, at 2:49, praveenesh kumar wrote: > Guys, > > Did someone try this thing ? > > Thanks > > On Tue, Dec 27, 2011 at 4:36 PM, praveenesh kumar wrote: > >> H

Re: network configuration (etc/hosts) ?

2011-12-21 Thread Joey Echeverria
Can you run the hostname command on both servers and send their output? -Joey On Tue, Dec 20, 2011 at 8:21 PM, MirrorX wrote: > > dear all > > i am trying for many days to get a simple hadoop cluster (with 2 nodes) to > work but i have trouble configuring the network parameters. i have properly

Re: streaming data ingest into HDFS

2011-12-15 Thread Joey Echeverria
You could run the flume collectors on other machines and write a source which connects to the sockets on the data generators. -Joey On Dec 15, 2011, at 21:27, "Periya.Data" wrote: > Sorry...misworded my statement. What I meant was that the sources are meant > to be untouched and admins do

Re: Cloudera Free

2011-12-08 Thread Joey Echeverria
Hi Bai, I'm moving this over to scm-us...@cloudera.org as that's a more appropriate list. (common-user bcced). I assume by "Cloudera Free" you mean Coudera Manager Free Edition? You should be able to run a job in the same way that do on any other Hadoop cluster. The only caveat is that you first

Re: HDFS Backup nodes

2011-12-07 Thread Joey Echeverria
y On Wed, Dec 7, 2011 at 12:37 PM, wrote: > What happens then if the nfs server fails or isn't reachable? Does hdfs lock > up? Does it gracefully ignore the nfs copy? > > Thanks, > randy > > - Original Message - > From: "Joey Echeverria" > To:

Re: HDFS Backup nodes

2011-12-07 Thread Joey Echeverria
You should also configure the Namenode to use an NFS mount for one of it's storage directories. That will give the most up-to-date back of the metadata in case of total node failure. -Joey On Wed, Dec 7, 2011 at 3:17 AM, praveenesh kumar wrote: > This means still we are relying on Secondary Name

Re: Incremental Mappers?

2011-11-22 Thread Joey Echeverria
You're correct, currently HDFS only supports reading from closed files. You can configure flume to write your data in small enough chunks so you can do incremental processing. -Joey On Nov 22, 2011, at 2:01, Romeo Kienzler wrote: > Hi, > > I'm planning to use Fume in order to stream data

Re: Regarding loading a big XML file to HDFS

2011-11-22 Thread Joey Echeverria
If your file is bigger than a block size (typically 64mb or 128mb), then it will be split into more than one block. The blocks may or may not be stored on different datanodes. If you're using a default InputFormat, then the input will be split between two task. Since you said you need the whole

Re: HBase Stack

2011-11-15 Thread Joey Echeverria
You can certainly run HBase on a single server, but I don't think you'd want to. Very few projects ever reach a scale where a single MySQL server can't handle it. In my opinion, you should start with the easy solution (MySQL) and only bring HBase into the mix when your scale really demands it. If y

Re: Slow shuffle stage?

2011-11-11 Thread Joey Echeverria
on the speculative execution.  I can't remember...I think so > though. > > On Nov 11, 2011, at 5:53 AM, Joey Echeverria wrote: > >> Another thing to look at is the map outlier. The shuffle will start by >> default when 5% of the maps are done, but won't finish

Re: Slow shuffle stage?

2011-11-11 Thread Joey Echeverria
Another thing to look at is the map outlier. The shuffle will start by default when 5% of the maps are done, but won't finish until after the last map is done. Since one of your maps took 37 minutes, your shuffle take at least that long. I would check the following: Is the input skewed? Does t

Re: Hadoop PseudoDistributed configuration

2011-11-08 Thread Joey Echeverria
What is your setting for fs.default.name? -Joey On Nov 8, 2011, at 5:54, Paolo Di Tommaso wrote: > Dear all, > > I'm trying to install Hadoop (0.20.2) in pseudo distributed mode to run > some tests on a Linux machine (Fedora 8) . > > I have followed the installation steps in the guide availab

Re: someone know how to install hadoop0.20 on hp-ux?

2011-11-04 Thread Joey Echeverria
You need to create a log directory on your TaskTracker nodes: /opt/ecip/BMC/hadoopTest/hadoop-0.20.203.0/logs/ Make sure the directory is writable by the mapred user, or which ever user your TaskTrackers were started as. -Joey On Thu, Nov 3, 2011 at 11:11 PM, Li, Yonggang wrote: > > I have in

Re: Question about superuser and permissions

2011-11-03 Thread Joey Echeverria
When you get the handle to the FileSystem object you can connect as a different user: http://hadoop.apache.org/common/docs/r0.20.203.0/api/org/apache/hadoop/fs/FileSystem.html#get(java.net.URI, org.apache.hadoop.conf.Configuration, java.lang.String) This should get any permissions you set enforce

Re: Hadoop 0.20.2 and JobConf deprecation

2011-11-03 Thread Joey Echeverria
A new API was introduced with Hadoop 0.20. However, that API is not feature complete. Despite the fact that the old API is marked as deprecated, it's still the recommended, full feature API. In fact, in future versions of Hadoop the API has been undeprecated to call more attention to it's stable na

Re: map task attempt progress at 400%?

2011-11-03 Thread Joey Echeverria
100 compressed lines of text.  So maybe that > accounts for the progress report. > > Any idea what the huge time difference might be due to (2 minutes average > vs. 20 hrs for the last 3 tasks)?  Does that sound like swapping to you? > > Thanks, > > Brendan > > On Thu, N

Re: map task attempt progress at 400%?

2011-11-03 Thread Joey Echeverria
Is you input data compressed? There have been some bugs in the past with reporting progress when reading compressed data. -Joey On Thu, Nov 3, 2011 at 9:18 AM, Brendan W. wrote: > Hi, > > Running 0.20.2: > > A job with about 4000 map tasks quickly blew through all but 3 in a couple > of hours, w

Re: Hadoop + cygwin

2011-11-03 Thread Joey Echeverria
What are the permissions on \tmp\hadoop-cyg_server\mapred\local\ttprivate? Which user owns that directory? Which user are you starting you TaskTracker as? -Joey On Wed, Nov 2, 2011 at 9:29 PM, Masoud wrote: > Hi, > > Im running hadop 0.20.204 under cygwin 1.7 on Win7, java 1.6.22 > i got this

Re: Problem using SCM

2011-11-01 Thread Joey Echeverria
Hi Trang, I'm moving the discuss to scm-us...@cloudera.org as it's not a Hadoop common issue. I've bcced common-user@hadoop.apache.org and also put you in the to: field in case you're not on scm-users. As for your problem, the issue is that SCM doesn't support an installation via sudo if sudo req

Re: Default Compression

2011-10-31 Thread Joey Echeverria
Try getting rid of the extra spaces and new lines. -Joey On Mon, Oct 31, 2011 at 1:49 PM, Mark wrote: > I recently added the following to my core-site.xml > > > io.compression.codecs > >  org.apache.hadoop.io.compress.DefaultCodec, > org.apache.hadoop.io.compress.GzipCodec, > org.apache.hadoop

Re: cannot find DeprecatedLzoTextInputFormat

2011-10-16 Thread Joey Echeverria
required the fix in one environment and did not in > another -- but that may just show my lack of understanding about hadoop. :-) > > Jessica > > On Wed, Oct 5, 2011 at 4:27 PM, Jessica Owensby > wrote: > >> Great.  Thanks!  Will give that a try. >> Je

Re: cannot find DeprecatedLzoTextInputFormat

2011-10-05 Thread Joey Echeverria
on that node? > > Joey, > Yes, the lzo files are indexed.  They are indexed using the following > command: > > hadoop jar /usr/lib/hadoop/lib/hadoop-lzo-20110217.jar > com.hadoop.compression.lzo.LzoIndexer /user/hive/warehouse/foo/bar.lzo > > Jessica > > On Wed, O

Re: cannot find DeprecatedLzoTextInputFormat

2011-10-05 Thread Joey Echeverria
Are your LZO files indexed? -Joey On Wed, Oct 5, 2011 at 3:35 PM, Jessica Owensby wrote: > Hi Joey, > Thanks. I forgot to say that; yes, the lzocodec class is listed in > core-site.xml under the io.compression.codecs property: > > >  io.compression.codecs >  org.apache.hadoop.io.compress.GzipCo

Re: cannot find DeprecatedLzoTextInputFormat

2011-10-05 Thread Joey Echeverria
Did you add the LZO codec configuration to core-site.xml? -Joey On Wed, Oct 5, 2011 at 2:31 PM, Jessica Owensby wrote: > Hello Everyone, > I've been having an issue in a hadoop environment (running cdh3u1) > where any table declared in hive > with the "STORED AS INPUTFORMAT > "com.hadoop.mapred.

Re: setInt & getInt

2011-10-04 Thread Joey Echeverria
The Job class copies the Configuraiton that you pass in. You either need to do your conf.setInt("number", 12345) before you create the Job object or you need call job.getConfiguration().setInt("number", 12345). -Joey On Tue, Oct 4, 2011 at 12:28 PM, Ratner, Alan S (IS) wrote: > I have no problem

Re: pointing mapred.local.dir to a ramdisk

2011-10-03 Thread Joey Echeverria
Raj, I just tried this on my CHD3u1 VM, and the ramdisk worked the first time. So, it's possible you've hit a bug in CDH3b3 that was later fixed. Can you enable debug logging in log4j.properties and then repost your task tracker log? I think there might be more details that it will print that will

Re: Running multiple MR Job's in sequence

2011-09-29 Thread Joey Echeverria
I would definitely checkout Oozie for this use case. -Joey On Thu, Sep 29, 2011 at 12:51 PM, Aaron Baff wrote: > I saw this, but wasn't sure if it was something that ran on the client and > just submitted the Job's in sequence, or if that gave it all to the > JobTracker, and the JobTracker too

Re: FileSystem closed

2011-09-29 Thread Joey Echeverria
Do you close your FileSystem instances at all? IIRC, the FileSystem instance you use is a singleton and if you close it once, it's closed for everybody. My guess is you close it in your cleanup method and you have JVM reuse turned on. -Joey On Thu, Sep 29, 2011 at 12:49 PM, Mark question wrote:

Re: block size

2011-09-20 Thread Joey Echeverria
HDFS blocks are stored as files in the underlying filesystem of your datanodes. Those files do not take a fixed amount of space, so if you store 10 MB in a file and you have 128 MB blocks, you still only use 10 MB (times 3 with default replication). However, the namenode does incur additional over

Re: Submitting Jobs from different user to a queue in capacity scheduler

2011-09-19 Thread Joey Echeverria
FYI, I'm moving this to mapreduce-user@ and bccing common-user@. It looks like your latest permission problem is on the local disk. What is your setting for hadoop.tmp.dir? What are the permissions on that directory? -Joey On Sep 18, 2011, at 23:27, ArunKumar wrote: > Hi guys ! > > Commo

Re: Submitting Jobs from different user to a queue in capacity scheduler

2011-09-18 Thread Joey Echeverria
As hfuser, create the /user/arun directory in hdfs-user. Then change the ownership /user/arun to arun. -Joey On Sep 18, 2011 8:07 AM, "ArunKumar" wrote: > Hi Uma ! > > I have deleted the data in /app/hadoop/tmp and formatted namenode and > restarted cluster.. > I tried > arun$ /home/hduser/hadoop

Re: risks of using Hadoop

2011-09-17 Thread Joey Echeverria
Losing the name node does not necessarily mean lost data. You should always have your name node write its metadata to an NFS server to guard against it. Also, while unavailability is a risk, it is not very common in practice. -Joey On Sep 17, 2011, at 19:38, Tom Deutsch wrote: > I disagree

Re: Debugging mapper

2011-09-15 Thread Joey Echeverria
You might also want to look into MRUnit[1]. It lets you mock the behavior of the framework to test your map and reduce classes in isolation. Can't discover all bugs, but a useful tool and works nicely with IDE debuggers. -Joey [1] http://incubator.apache.org/mrunit/ On Thu, Sep 15, 2011 at 3:51

Re: Handling of small files in hadoop

2011-09-14 Thread Joey Echeverria
Hi Naveen, > I use hadoop-0.21.0 distribution. I have a large number of small files (KB). Word of warning, 0.21 is not a stable release. The recommended version is in the 0.20.x range. > Is there any efficient way of handling it in hadoop? > > I have heard that solution for that problem is using

Re: Hadoop doesnt use Replication Level of Namenode

2011-09-13 Thread Joey Echeverria
That won't work with the replication level as that is entirely a client side config. You can partially control it by setting the maximum replication level. -Joey On Tue, Sep 13, 2011 at 10:56 AM, Edward Capriolo wrote: > On Tue, Sep 13, 2011 at 5:53 AM, Steve Loughran wrote: > >> On 13/09/11 05

Re: Disable Sorting?

2011-09-11 Thread Joey Echeverria
The sort is what's implementing the group by key function. You can't have one without the other in Hadoop. Are you trying to disable the sort because you think it's too slow? -Joey On Sun, Sep 11, 2011 at 2:43 AM, john smith wrote: > Hi Arun, > > Suppose I am doing a simple wordcount and the map

Re: Help - Rack Topology Script - Hadoop 0.20 (CDH3u1)

2011-08-21 Thread Joey Echeverria
Not that I know of. -Joey On Fri, Aug 19, 2011 at 1:16 PM, modemide wrote: > Ha, what a silly mistake. > > Thank you Joey. > > Do you also happen to know of an easier way to tell which racks the > jobtracker/namenode think each node is in? > > > > On 8/19/11, Joey

Re: Help - Rack Topology Script - Hadoop 0.20 (CDH3u1)

2011-08-19 Thread Joey Echeverria
Did you restart the JobTracker? -Joey On Fri, Aug 19, 2011 at 12:45 PM, modemide wrote: > Hi all, > I've tried to make a rack topology script.  I've written it in python > and it works if I call it with the following arguments: > 10.2.0.1 10.2.0.11 10.2.0.11 10.2.0.12 10.2.0.21 10.2.0.26  10.2.0

Re: Version Mismatch

2011-08-18 Thread Joey Echeverria
It means your HDFS client jars are using a different RPC version than your namenode and datanodes. Are you sure that XXX has $HADOOP_HOME in it's classpath? It really looks like it's pointing to the wrong jars. -Joey On Thu, Aug 18, 2011 at 8:14 AM, Ratner, Alan S (IS) wrote: > We have a version

Re: How do I add Hadoop dependency to a Maven project?

2011-08-16 Thread Joey Echeverria
If you're talking about the org.apache.hadoop.mapreduce.* API, that was introduced in 0.20.0. There should be no need to use the 0.21 version. -Joey On Tue, Aug 16, 2011 at 1:22 PM, W.P. McNeill wrote: > Here is my specific problem: > > I have a sample word count Hadoop program up on github ( >

Re: WritableComparable

2011-08-14 Thread Joey Echeverria
What are the types of key1 and key2? What does the readFields() method look like? -Joey On Sun, Aug 14, 2011 at 10:07 PM, Stan Rosenberg wrote: > On Sun, Aug 14, 2011 at 9:33 PM, Joey Echeverria wrote: > >> Does your compareTo() method test object pointer equality? If so, you

Re: WritableComparable

2011-08-14 Thread Joey Echeverria
Does your compareTo() method test object pointer equality? If so, you could be getting burned by Hadoop reusing Writable objects. -Joey On Aug 14, 2011 9:20 PM, "Stan Rosenberg" wrote: > Hi Folks, > > After much poking around I am still unable to determine why I am seeing > 'reduce' being called

Re: Speed up node under replicated block during decomission

2011-08-12 Thread Joey Echeverria
You can configure the undocumented variable dfs.max-repl-streams to increase the number of replications a data-node is allowed to handle at one time. The default value is 2. [1] -Joey [1] https://issues.apache.org/jira/browse/HADOOP-2606?focusedCommentId=12578700&page=com.atlassian.jira.plugin.s

Re: Hadoop--store a sequence file in distributed cache?

2011-08-12 Thread Joey Echeverria
You can use any kind of format for files in the distributed cache, so yes you can use sequence files. They should be faster to parse than most text formats. -Joey On Fri, Aug 12, 2011 at 4:56 AM, Sofia Georgiakaki wrote: > Thank you for the reply! > In each map(), I need to open-read-close these

Re: Keep output folder despite a failed Job

2011-08-09 Thread Joey Echeverria
You can set the keep.failed.task.files property on the job. -Joey On Tue, Aug 9, 2011 at 9:39 PM, Saptarshi Guha wrote: > Hello, > > If  i have a failure during a job, is there a way I prevent the output > folder > from being deleted? > > Cheers > Saptarshi > -- Joseph Echeverria Cloudera, I

Re: error:Type mismatch in value from map

2011-07-29 Thread Joey Echeverria
If you want to use a combiner, your map has to output the same types as your combiner outputs. In your case, modify your map to look like this:  public static class TokenizerMapper       extends Mapper{    public void map(Text key, Text value, Context context                    ) throws IOExce

Re: Hadoop Question

2011-07-28 Thread Joey Echeverria
How about having the slave write to temp file first, then move it to the file the master is monitoring for after they close it? -Joey On Jul 27, 2011, at 22:51, Nitin Khandelwal wrote: > Hi All, > > How can I determine if a file is being written to (by any thread) in HDFS. I > have a conti

Re: questions regarding data storage and inputformat

2011-07-27 Thread Joey Echeverria
You could either use a custom RecordReader or you could override the run() method on your Mapper class to do the merging before calling the map() method. -Joey On Wed, Jul 27, 2011 at 11:09 AM, Tom Melendez wrote: >> >>> 3. Another idea might be create separate seq files for chunk of >>> records

Re: questions regarding data storage and inputformat

2011-07-27 Thread Joey Echeverria
> 1. Any reason not to use a sequence file for this?  Perhaps a mapfile? >  Since I've sorted it, I don't need "random" accesses, but I do need > to be aware of the keys, as I need to be sure that I get all of the > relevant keys sent to a given mapper MapFile *may* be better here (see my answer f

Re: Running queries using index on HDFS

2011-07-25 Thread Joey Echeverria
To add to what Bobby said, you can get block locations with fs.getFileBlockLocations() if you want to open based on locality. -Joey On Mon, Jul 25, 2011 at 3:00 PM, Robert Evans wrote: > Sofia, > > You can access any HDFS file from a normal java application so long as your > classpath and some

Re: Hadoop-streaming with a c binary executable as a mapper

2011-07-22 Thread Joey Echeverria
Your executable needs to read lines from standard in. Try setting your mapper like this: > -mapper "/data/yehdego/hadoop-0.20.2/pknotsRG -" If that doesn't work, you may need to execute your C program from a shell script. The -I added to the command line says read from STDIN. -Joey On Jul 2

Re: Where to find best documentation for setting up kerberos authentication in 0.20.203.0rc1

2011-07-18 Thread Joey Echeverria
Hi Issac, I couldn't find anything specifically for the 0.20.203 release, but CDH3 uses basically the same security code. You could probably follow our security guide with the 0.20.203 release: https://ccp.cloudera.com/display/CDHDOC/CDH3+Security+Guide -Joey On Mon, Jul 18, 2011 at 12:15 PM, I

Re: replicate data in HDFS with smarter encoding

2011-07-18 Thread Joey Echeverria
Facebook contributed some code to do something similar called HDFS RAID: http://wiki.apache.org/hadoop/HDFS-RAID -Joey On Jul 18, 2011, at 3:41, Da Zheng wrote: > Hello, > > It seems that data replication in HDFS is simply data copy among nodes. Has > anyone considered to use a better encodi

Re: FW: type mismatch error

2011-07-12 Thread Joey Echeverria
Your map method is misnamed. It should be in all lower case. -Joey On Jul 12, 2011 2:46 AM, "Teng, James" wrote: > > hi, all. > I am a new hadoop beginner, I try to construct a map and reduce task to run, however encountered an exception while continue going further. > Exception: > java.io.IOExce

Re: Cluster Tuning

2011-07-08 Thread Joey Echeverria
Set mapred.reduce.slowstart.completed.maps to a number close to 1.0. 1.0 means the maps have to completely finish before the reduce starts copying any data. I often run jobs with this set to .90-.95. -Joey On Fri, Jul 8, 2011 at 11:25 AM, Juan P. wrote: > Here's another thought. I realized that

Re: HTTP Error

2011-07-08 Thread Joey Echeverria
It looks like both datanodes are trying to serve data out of the smae directory. Is there any chance that both datanodes are using the same NFS mount for the dfs.data.dir? If not, what I would do is delete the data from ${dfs.data.dir} and then re-format the namenode. You'll lose all of your da

Re: Cluster Tuning

2011-07-07 Thread Joey Echeverria
Have you tried using a Combiner? Here's an example of using one: http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html#Example%3A+WordCount+v1.0 -Joey On Thu, Jul 7, 2011 at 4:29 PM, Juan P. wrote: > Hi guys! > > I'd like some help fine tuning my cluster. I currently have 20 boxes

Re: ArrayWritable usage

2011-07-04 Thread Joey Echeverria
ArrayWritable doesn't serialize type information. You need to subclass it (e.g. IntArrayWritable) and create a no arg constructor which calls super(IntWritable.class). Use this instead of ArrayWritable directly. If you want to store more than one type, look at the source for MapWritable to see how

Re: Does hadoop-0.20-append compatible with PIG 0.8 ?

2011-07-02 Thread Joey Echeverria
Try replacing the hadoop jar from the pig lib directory with the one from your cluster. -Joey On Jul 2, 2011, at 0:38, praveenesh kumar wrote: > Hi guys.. > > > > I am previously using hadoop and Hbase... > > > > So for Hbase to run perfectly fine we need Hadoop-0.20-append for Hbase

Re: tar or hadoop archive

2011-06-27 Thread Joey Echeverria
Yes, you can see a picture describing HAR files in this old blog post: http://www.cloudera.com/blog/2009/02/the-small-files-problem/ -Joey On Mon, Jun 27, 2011 at 4:36 PM, Rita wrote: > So, it does an index of the file? > > > > On Mon, Jun 27, 2011 at 10:10 AM, Joey Echeverria

Re: tar or hadoop archive

2011-06-27 Thread Joey Echeverria
The advantage of a hadoop archive files is it lets you access the files stored in it directly. For example, if you archived three files (a.txt, b.txt, c.txt) in an archive called foo.har. You could cat one of the three files using the hadoop command line: hadoop fs -cat har:///user/joey/out/foo.ha

Re: Append to Existing File

2011-06-21 Thread Joey Echeverria
Yes. -Joey On Jun 21, 2011 1:47 PM, "jagaran das" wrote: > Hi All, > > Does CDH3 support Existing File Append ? > > Regards, > Jagaran > > > > > From: Eric Charles > To: common-user@hadoop.apache.org > Sent: Tue, 21 June, 2011 3:53:33 AM > Subject: Re: Append to

Re: Datanode not created on hadoop-0.20.203.0

2011-06-16 Thread Joey Echeverria
the slaves file - >> >> Cheers - >> >> -Original Message- >> From: Joey Echeverria [mailto:j...@cloudera.com] >> Sent: Wednesday, June 15, 2011 12:01 PM >> To: common-user@hadoop.apache.org >> Subject: Re: Datanode not created on hadoop-0.20.203.0 >&

Re: problem with streaming and libjars

2011-06-16 Thread Joey Echeverria
I would try the following: hadoop -libjars /home/ayon/jars/MultiOutput.jar jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u0.jar -libjars /home/ayon/jars/MultiOutput.jar -input /user/ayon/streaming_test_input -output /user/ayon/streaming_test_output -mapper /bin/cat -reduce

Re: Datanode not created on hadoop-0.20.203.0

2011-06-15 Thread Joey Echeverria
By any chance, are you running as root? If so, try running as a different user. -Joey On Wed, Jun 15, 2011 at 12:53 PM, rutesh wrote: > Hi, > >   I am new to hadoop (Just 1 month old). These are the steps I followed to > install and run hadoop-0.20.203.0: > > 1) Downloaded tar file from > http:/

Re: a file can be used as a queue?

2011-06-13 Thread Joey Echeverria
This feature doesn't currently work. I don't remember the JIRA for it, but there's a ticket which will allow a reader to read from an HDFS file before it's closed. In that case, you implement a queue by having the producer write to the end of the file and the reader read from the beginning of th

Re: Hardware specs

2011-06-09 Thread Joey Echeverria
There are some good recommendations in this blog post: http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-basic-hardware-recommendations/ It's a little dated, but the reasoning and basics are sound. -Joey On Thu, Jun 9, 2011 at 10:59 AM, Mark wrote: > Can someone give some

Re: Hbase startup error: NoNode for /hbase/master after running out of space

2011-06-08 Thread Joey Echeverria
Hey Andy, You're correct that 0.20.203 doesn't have append. Your best bet is to build a version of the append branch or switch to CDH3u0. -Joey On Tue, Jun 7, 2011 at 6:31 PM, Zhong, Sheng wrote: > Thanks! The issue has been resolved by removing some bad blks... > > But St.Ack, > > We do want a

Re: Why inter-rack communication in mapreduce slow?

2011-06-06 Thread Joey Echeverria
Most of the network bandwidth used during a MapReduce job should come from the shuffle/sort phase. This part doesn't use HDFS. The TaskTrackers running reduce tasks will pull intermediate results from TaskTrackers running map tasks over HTTP. In most cases, it's difficult to get rack locality durin

Re: Why inter-rack communication in mapreduce slow?

2011-06-06 Thread Joey Echeverria
Larger Hadoop installations are space dense, 20-40 nodes per rack. When you get to that density with multiple racks, it becomes expensive to buy a switch with enough capacity for all of the nodes in all of the racks. The typical solution is to install a switch per rack with uplinks to a core switch

  1   2   >