Re: Accessing HDFS files from an servlet

2012-04-13 Thread Edward Capriolo
http://www.edwardcapriolo.com/wiki/en/Tomcat_Hadoop Have all the hadoop jars and conf files in your classpath --or-- construct your own conf and URI programatically URI i = URI.create("hdfs://192.168.220.200:54310"); FileSystem fs = FileSystem.get(i,conf); On Fri, Apr 13, 2012 at 7:40 AM, Jessic

Re: Issue with loading the Snappy Codec

2012-04-15 Thread Edward Capriolo
You need three things. 1 install snappy to a place the system can pick it out automatically or add it to your java.library.path Then add the full name of the codec to io.compression.codecs. hive> set io.compression.codecs; io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec,org.apache.h

Re: Hive Thrift help

2012-04-16 Thread Edward Capriolo
You can NOT connect to hive thrift to confirm it's status. Thrift is thrift not http. But you are right to say HiveServer does not produce and output by default. if netstat -nl | grep 1 shows status it is up. On Mon, Apr 16, 2012 at 5:18 PM, Rahul Jain wrote: > I am assuming you read thru:

Re: Multiple data centre in Hadoop

2012-04-19 Thread Edward Capriolo
Hive is beginning to implement Region support where one metastore will manage multiple filesystems and jobtrackers. When a query creates a table it will then be copied to one ore more datacenters. In addition the query planner will intelligently attempt to run queries in regions only where all the

Re: Feedback on real world production experience with Flume

2012-04-21 Thread Edward Capriolo
It seems pretty relevant. If you can directly log via NFS that is a viable alternative. On Sat, Apr 21, 2012 at 11:42 AM, alo alt wrote: > We decided NO product and vendor advertising on apache mailing lists! > I do not understand why you'll put that closed source stuff from your employe > in th

Re: Feedback on real world production experience with Flume

2012-04-22 Thread Edward Capriolo
, Alexander Lorenz wrote: > no. That is the Flume Open Source Mailinglist. Not a vendor list. > > NFS logging has nothing to do with decentralized collectors like Flume, JMS > or Scribe. > > sent via my mobile device > > On Apr 22, 2012, at 12:23 AM, Edward Capriolo wro

Re: hadoop.tmp.dir with multiple disks

2012-04-22 Thread Edward Capriolo
Since each hadoop tasks is isolated from others having more tmp directories allows you to isolate that disk bandwidth as well. By listing the disks you give more firepower to shuffle-sorting and merging processes. Edward On Sun, Apr 22, 2012 at 10:02 AM, Jay Vyas wrote: > I don't understand why

Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3

2012-05-03 Thread Edward Capriolo
Honestly that is a hassle, going from 205 to cdh3u3 is probably more or a cross-grade then an upgrade or downgrade. I would just stick it out. But yes like Michael said two clusters on the same gear and distcp. If you are using RF=3 you could also lower your replication to rf=2 'hadoop dfs -setrepl

Re: Splunk + Hadoop

2012-05-22 Thread Edward Capriolo
So a while back their was an article: http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data I recently did my own take on full text searching your logs with solandra, though I have prototyped using solr inside datastax enterprise as well. http://www.edwardcap

Re: Problems with block compression using native codecs (Snappy, LZO) and MapFile.Reader.get()

2012-05-22 Thread Edward Capriolo
if You are getting a SIGSEG it never hurts to try a more recent JVM. 21 has many bug fixes at this point. On Tue, May 22, 2012 at 11:45 AM, Jason B wrote: > JIRA entry created: > > https://issues.apache.org/jira/browse/HADOOP-8423 > > > On 5/21/12, Jason B wrote: >> Sorry about using attachment.

Re: Hadoop on physical Machines compared to Amazon Ec2 / virtual machines

2012-05-31 Thread Edward Capriolo
We actually were in an Amazon/host it yourself debate with someone. Which prompted us to do some calculations: http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/myth_busters_ops_editition_is We calculated the cost for storage alone of 300 TB on ec2 as 585K a month! The cloud people hate

Re: Hadoop with Sharded MySql

2012-05-31 Thread Edward Capriolo
Maybe you can do some VIEWs or unions or merge tables on the mysql side to overcome the aspect of launching so many sqoop jobs. On Thu, May 31, 2012 at 6:02 PM, Srinivas Surasani wrote: > All, > > We are trying to implement sqoop in our environment which has 30 mysql > sharded databases and all t

Re: Ideal file size

2012-06-06 Thread Edward Capriolo
It does not matter what the file size is because the file size is split into blocks which is what the NN tracks. For larger deployments you can go with a large block size like 256MB or even 512MB. Generally the bigger the file the better split calculation is very input format dependent however.

Re: Setting number of mappers according to number of TextInput lines

2012-06-16 Thread Edward Capriolo
No. The number of lines is not known at planning time. All you know is the size of the blocks. You want to look at mapred.max.split.size . On Sat, Jun 16, 2012 at 5:31 AM, Ondřej Klimpera wrote: > I tried this approach, but the job is not distributed among 10 mapper nodes. > Seems Hadoop ignores

Re: Why not having mapred.tasktracker.tasks.maximum?

2010-06-11 Thread Edward Capriolo
On Fri, Jun 11, 2010 at 8:35 AM, Sébastien Rainville < sebastienrainvi...@gmail.com> wrote: > Hi, > > I'm playing around with the hadoop config to optimize the resources of our > cluster. I'm noticing that the cpu usage is sub-optimal. All the machines > in > the cluster have 1 quad core cpu. I lo

Re: Problems with HOD and HDFS

2010-06-14 Thread Edward Capriolo
On Mon, Jun 14, 2010 at 8:37 AM, Amr Awadallah wrote: > Dave, > > Yes, many others have the same situation, the recommended solution is > either to use the Fair Share Scheduler or the Capacity Scheduler. These > schedulers are much better than HOD since they take data locality into > considerati

Re: Task process exit with nonzero status of 1 - deleting userlogs helps

2010-06-14 Thread Edward Capriolo
On Mon, Jun 14, 2010 at 1:15 PM, Johannes Zillmann wrote: > Hi, > > i have running a 4-node cluster with hadoop-0.20.2. Now i suddenly run into > a situation where every task scheduled on 2 of the 4 nodes failed. > Seems like the child jvm crashes. There are no child logs under > logs/userlogs. T

Re: Using wget to download file from HDFS

2010-06-15 Thread Edward Capriolo
On Tue, Jun 15, 2010 at 12:30 PM, Jaydeep Ayachit < jaydeep_ayac...@persistent.co.in> wrote: > Thanks, data node may not be known. Is it possible to direct url to > namenode and namenode handling streaming by fetching data from various data > nodes? > > Regards > Jaydeep > > -Original Message-

Re: Problems with HOD and HDFS

2010-06-15 Thread Edward Capriolo
On Tue, Jun 15, 2010 at 3:10 PM, Jason Stowe wrote: > Hi David, > The original HOD project was integrated with Condor ( > http://bit.ly/CondorProject), which Yahoo! was using to schedule clusters. > > A year or two ago, the Condor project in addition to being open-source w/o > costs for licensing,

Re: real world cluster configurations

2010-06-17 Thread Edward Capriolo
On Thu, Jun 17, 2010 at 7:51 PM, Corbin Hoenes wrote: > the documentation on > http://hadoop.apache.org/common/docs/r0.20.2/cluster_setup.html#Configurationsays > I should put io.sort.factor and io.sort.mb in core-site.xml but it > really should be in mapred-site.xml > > needs an update--eh? > R

Re: [JIRA] (COLL-117) seeing warnings "attempt to override final parameter: dfs.data.dir" in tasktracker logs

2010-06-21 Thread Edward Capriolo
On Mon, Jun 21, 2010 at 5:26 AM, Bikash Singhal wrote: > > Hi Hadoopers, > > I have received WARN in the hadoop cluster. Has anybody seen this . Any > solution? > > > > 2010-06-06 01:45:04,079 WARN org.apache.hadoop.conf.Configuration: > /var/lib/hadoop-0.20/cache/hadoop/mapred/local/taskTracker/

Re: problem with rack-awareness

2010-07-02 Thread Edward Capriolo
On Fri, Jul 2, 2010 at 2:27 PM, Allen Wittenauer wrote: > > On Jul 1, 2010, at 7:50 PM, elton sky wrote: > >> hello, >> >> I am trying to separate my 6 nodes onto 2 different racks. >> For test purpose, I wrote a bash file which smply returns "rack0" all the >> time. And I add property "topology.s

Re: Does hadoop need to have ZooKeeper to work?

2010-07-05 Thread Edward Capriolo
On Mon, Jun 28, 2010 at 10:26 AM, Pierre ANCELOT wrote: > Hive depends on zookeeper though, if you plan to have it. > > > > On Mon, Jun 28, 2010 at 4:23 PM, Eason.Lee wrote: > >> No, they are separate projects! >> they don't depend on each other~~ >> >> 2010/6/28 legolas >> >> > >> > Hi, >> > >>

Re: Text files vs. SequenceFiles

2010-07-06 Thread Edward Capriolo
On Tue, Jul 6, 2010 at 10:56 AM, David Rosenstrauch wrote: > Thanks much for the helpful responses everyone.  This very much helped > clarify our thinking on the code design.  Sounds like all other things being > equal, sequence files are the way to go.  Again, thanks again for the > advice, all.

Re: How do I remove "Non DFS Used"?

2010-07-07 Thread Edward Capriolo
On Wed, Jul 7, 2010 at 9:48 AM, Michael Segel wrote: > > Non DFS used tends to be logging or some other information on the disk. > > So you can't use hadoop commands to remove the files from the disk. > > > >> Date: Wed, 7 Jul 2010 17:11:38 +0900 >> Subject: How do I remove "Non DFS Used"? >> From

Re: rebalancing replication help

2010-07-07 Thread Edward Capriolo
On Wed, Jul 7, 2010 at 9:18 PM, Arun Ramakrishnan wrote: > Looks like there is not much activity in the hdfs-user list. So, am reposting > it in the general list. > > Hi guys. >  I have a few related questions. I am going to layout the steps I have taken. > Please comment on what I can do better

HDFS moves and MapReduce jobs

2010-07-09 Thread Edward Capriolo
This is a question I should go and test out myself but was wondering if anyone has a quick answer. We have map/reduce jobs that produce lots of smaller files to a folder. We also have a hive external table pointed at this folder. We have a tool FileCrusher which is made to bunch up multiple small

Re: Setting different hadoop-env.sh for DataNode, TaskTracker

2010-07-13 Thread Edward Capriolo
On Tue, Jul 13, 2010 at 10:46 AM, Matt Pouttu-Clarke wrote: > Can anyone suggest a way to set different hadoop-env.sh values for DataNode > and TaskTracker without having to duplicate the whole Hadoop conf directory? > For example, to set a different HADOOP_NICENESS for DataNode and > TaskTracker.

Alternatives to start-all.sh stop-all.sh func

2010-07-16 Thread Edward Capriolo
I remember when I was first setting up a hadoop cluster wondering exactly what the SSH-KEYs did and why and if they were needed. start-all.sh and stop-all.sh are good for what they do but they are not very sophisticated. I wrote a blog about using func with hadoop to remotely manage your nodes. h

Re: Most Common ways to load data into Hadoop in production systems

2010-07-21 Thread Edward Capriolo
On Wed, Jul 21, 2010 at 12:42 PM, Xavier Stevens wrote: >  Hi Urckle, > > A lot of the more "advanced" setups just record data directly to HDFS to > start with.  You have to write some custom code using the HDFS API but > that way you don't need to import large masses of data.  People also use > "

Re: Is it safe to set default/minimum replication to 2?

2010-07-21 Thread Edward Capriolo
On Wed, Jul 21, 2010 at 11:45 PM, Brian Bockelman wrote: > Hi Bobby, > > We keep 2 or so replicas here at Nebraska.  We have about 800TB of raw space. > > As a rule of thumb, we: > 1) Increase the replication of extremely important files.  We are a site for > the LHC, so a large part of our data

Re: mapred.userlog.retain.hours

2010-07-29 Thread Edward Capriolo
On Thu, Jul 29, 2010 at 1:30 PM, vishalsant wrote: > > I have chnaged on my namenode and the datanodes mapred-site.xml , to include > > >    mapred.userlog.retain.hours >    2 >   > > > And yet my job xml retains 24 . > > Am I doing anything wrong > -- > View this message in context: > http://ol

Re: Set variables in mapper

2010-08-02 Thread Edward Capriolo
On Mon, Aug 2, 2010 at 12:17 PM, Erik Test wrote: > Hi, > > I'm trying to set a variable in my mapper class by reading an argument from > the command line and then passing the entry to the mapper from main. Is this > possible? > >  public static void main(String[] args) throws Exception >  { >    

Re: Combiner function

2010-08-02 Thread Edward Capriolo
On Mon, Aug 2, 2010 at 4:28 PM, Jackob Carlsson wrote: > Thanks Nick, but "in-memory" means a combiner can only be used over a single > mapper?right?! Is there a way we use it for several mappers as well? Also > what do you mean by "it may or may not run on a particular map attempt"? > > Br, > Jac

Re: Backing up HDFS

2010-08-03 Thread Edward Capriolo
On Tue, Aug 3, 2010 at 10:42 AM, Brian Bockelman wrote: > > On Aug 3, 2010, at 9:12 AM, Eric Sammer wrote: > > > All of that said, what you're protecting against here is permanent loss of a > data center and human error. Disk, rack, and node level failures are already > handled by HDFS when prope

Re: Backing up HDFS

2010-08-03 Thread Edward Capriolo
On Tue, Aug 3, 2010 at 11:46 AM, Michael Segel wrote: > > > >> Date: Tue, 3 Aug 2010 11:02:48 -0400 >> Subject: Re: Backing up HDFS >> From: edlinuxg...@gmail.com >> To: common-user@hadoop.apache.org >> > >> Assuming you are taking the distcp approach you can mirror your >> cluster with some scrip

Re: Best practices - Large Hadoop Cluster

2010-08-10 Thread Edward Capriolo
On Tue, Aug 10, 2010 at 10:01 AM, Brian Bockelman wrote: > Hi Raj, > > I believe the best practice is to *not* start up Hadoop over SSH.  Set it up > as a system service and let your configuration management software take care > of it. > > You probably want to look at ROCKS or one of its variant

Re: Scheduler recommendation

2010-08-11 Thread Edward Capriolo
Sorry I can not speak for the capacity scheduler. We use the fair share and I just modified the configuration so I figured I would chime in. We have the same use case as you production map/reduce jobs production hive jobs, as well as ad hoc hive jobs. We broke our jobs into two classes: those tha

Re: namenode crash in centos. can anybody recommend jdk ?

2010-08-16 Thread Edward Capriolo
On Mon, Aug 16, 2010 at 10:51 AM, Michael Thomas wrote: > We recently discovered the same thing happen to our SNN as well.  Heap too > small == 100% cpu utilization and no checkpoints. > > --Mike > > On 08/16/2010 06:35 AM, Brian Bockelman wrote: >> >> By the way, >> >> Our experience is that if y

Re: Supersede a data node help: how to move all files out of a Hadoop data node?

2010-08-20 Thread Edward Capriolo
On Fri, Aug 20, 2010 at 4:31 PM, jiang licht wrote: > Requirement: I want to get rid of a data node machine. But it has useful data > that is still in use. So, I want to move all its files/blocks to other live > data nodes in the same cluster. > > Question: I understand that if a data node is do

Re: what will happen if a backup name node folder becomes unaccessible?

2010-08-20 Thread Edward Capriolo
On Fri, Aug 20, 2010 at 4:56 PM, jiang licht wrote: > Using nfs folder to back up dfs meta information as follows, > > >         dfs.name.dir >         /hadoop/dfs/name,/hadoop-backup/dfs/name >     > > where /hadoop-backup is on a backup machine and mounted on the master node. > > I have a ques

Re: Configure Secondary Namenode

2010-08-22 Thread Edward Capriolo
2010/8/18 xiujin yang : > > Hi Adarsh, > > Please check start-dfs.sh > > You will find > > "$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR start namenode $nameStartOpt > "$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR start datanode > $dataStartOpt > "$bin"/hadoop-daemons.sh --config $HADOOP_

Re: Hadoop startup problem - directory name required

2010-08-23 Thread Edward Capriolo
On Mon, Aug 23, 2010 at 12:09 PM, cliff palmer wrote: > The 3 *-site.xml files are in the /etc/hadoop-0.20/conf directory.  I've > confirmed that these are the files being used. > Thanks again. > Cliff > > On Mon, Aug 23, 2010 at 10:26 AM, Harsh J wrote: > >> Can you confirm that this is the righ

Re: what will happen if a backup name node folder becomes unaccessible?

2010-08-23 Thread Edward Capriolo
On Mon, Aug 23, 2010 at 3:05 PM, Michael Segel wrote: > > Ok... > > Now you have me confused. > Everything we've seen says that writing to both a local disk and to an NFS > mounted disk would be the best way to prevent a problem. > > Now you and Harsh J say that this could actually be problematic

Re: what will happen if a backup name node folder becomes unaccessible?

2010-08-24 Thread Edward Capriolo
On Tue, Aug 24, 2010 at 1:38 PM, jiang licht wrote: > Sudhir, > > Look forward to your results, if possible with different CDH releases. > > Thanks, > > Michael > > --- On Tue, 8/24/10, Sudhir Vallamkondu > wrote: > > From: Sudhir Vallamkondu > Subject: Re: what will happen if a backup name nod

Re: how to revert from a new version to an older one (CDH3)?

2010-08-24 Thread Edward Capriolo
On Tue, Aug 24, 2010 at 1:36 PM, jiang licht wrote: > Thanks Sudhir and Michael. I want to replace a new release of CDH3 > (0.20.2+320) to a previous release of CDH3 (0.20.2+228). The problem is that > there is no installation package for previous release of CDH3 and no source > to rebuild from

Re: what will happen if a backup name node folder becomes unaccessible?

2010-08-27 Thread Edward Capriolo
On Tue, Aug 24, 2010 at 7:59 PM, Sudhir Vallamkondu wrote: > The cloudera distribution seems to be working fine when a dfs.name.dir > directory is inaccessible in midst of namenode running. > > See below > > had...@training-vm:~$ hadoop version > Hadoop 0.20.1+152 > Subversion  -r c15291d10caa19c2

Re: what will happen if a backup name node folder becomes unaccessible?

2010-08-27 Thread Edward Capriolo
On Fri, Aug 27, 2010 at 8:30 PM, jiang licht wrote: > The same behavior is seen in CDH3 hadoop-0.20.2+228 if a mounted nfs folder > for dfs.name.dir is not available when a name node starts... > > Michael > > --- On Fri, 8/27/10, Edward Capriolo wrote: > > From: Edward

Re: accounts permission on hadoop

2010-08-31 Thread Edward Capriolo
On Tue, Aug 31, 2010 at 5:07 PM, Gang Luo wrote: > Hi all, > I am the administrator of a hadoop cluster. I want to know how to specify a > group a user belong to. Or hadoop just use the group/user information from the > linux system it runs on? For example, if a user 'smith' belongs to a group > '

DataDrivenInputFormat setInput with boundingQuery

2010-08-31 Thread Edward Capriolo
I am working with DataDrivenOutputFormat from trunk. None of the unit tests seem to test the bounded queries Configuration conf = new Configuration(); Job job = new Job(conf); job.setJarByClass(TestZ.class); job.setInputFormatClass(DataDrivenDBInput

Re: DataDrivenInputFormat setInput with boundingQuery

2010-08-31 Thread Edward Capriolo
On Tue, Aug 31, 2010 at 10:32 PM, Edward Capriolo wrote: > I am working with DataDrivenOutputFormat from trunk. None of the unit > tests seem to test the bounded queries > > Configuration conf = new Configuration(); >                Job job = new Job(conf); >              

Why does Generic Options Parser only take the first -D option?

2010-09-02 Thread Edward Capriolo
This is 0.20.0 I have an eclipse run configuration passing these as arguments -D hive2rdbms.jdbc.driver="com.mysql.jdbc.Driver" -D hive2rdbms.connection.url="jdbc:mysql://localhost:3306/test" -D hive2rdbms.data.query="SELECT id,name FROM name WHERE $CONDITIONS" -D hive2rdbms.bounding.query="SELECT

Re: Why does Generic Options Parser only take the first -D option?

2010-09-03 Thread Edward Capriolo
> *      for(String prop : property) { >        String[] keyval = prop.split("=", 2); >        if (keyval.length == 2) { >          conf.set(keyval[0], keyval[1]); >        } >      } >    } > You can add a log after the bold line to verify that all -D options are > returne

Re: Re: namenode consume quite a lot of memory with only serveral hundredsof files in it

2010-09-07 Thread Edward Capriolo
The fact that the memory is high is not necessarily a bad thing. Faster garbage collection implies more CPU usage. I had some success following the tuning advice here, to make my memory usage less spikey http://blog.mikiobraun.de/2010/08/cassandra-gc-tuning.html Again, less spikes != better perf

Re: SequenceFile Header

2010-09-08 Thread Edward Capriolo
On Wed, Sep 8, 2010 at 1:06 PM, Matthew John wrote: > Hi guys, > > I m trying to run a sort on a metafile which had a record consisting of a > key<8 bytes> and a value<32 bytes>. Sort will be with respect to the key. > But my input file does not have a header. So inorder to avail the use of > Sequ

Re: How to disable secondary node

2010-09-09 Thread Edward Capriolo
It is a bad idea to permanently disable 2nn. The edits file grows very very large and will not be processed until the name node restart. We had a 12GB edit file that took 40 minutes of downtime to process. On Thu, Sep 9, 2010 at 3:08 AM, Jeff Zhang wrote: > then, do not start secondary namenode >

Re: Cannot run program "bash": java.io.IOException

2010-09-18 Thread Edward Capriolo
This happens because child processes try to allocate the same memory as the parent. One way to solve this is setting memory overcommit to your linux system. On Sat, Sep 18, 2010 at 4:47 AM, Bradford Stephens wrote: > Hey guys, > > I'm running into issues when doing a moderate-size EMR job on 12 m

A new way to merge up those small files!

2010-09-24 Thread Edward Capriolo
Many times a hadoop job produces a file per reducer and the job has many reducers. Or a map only job one output file per input file and you have many input files. Or you just have many small files from some external process. Hadoop has sub optimal handling of small files. There are some ways to han

Re: hd fs -head?

2010-09-27 Thread Edward Capriolo
On Mon, Sep 27, 2010 at 3:23 AM, Keith Wiley wrote: > Is there a particularly good reason for why the "hadoop fs" command supports > -cat and -tail, but not -head? > > > Keith Wiley     kwi...@keithwiley.com     keith

Re: A new way to merge up those small files!

2010-09-27 Thread Edward Capriolo
;s magic number so that it can attempt to *detect* the > type of the file. > > Cheers > > On Fri, Sep 24, 2010 at 11:41 PM, Edward Capriolo > wrote: > >> Many times a hadoop job produces a file per reducer and the job has >> many reducers. Or a map only job one out

Re: hd fs -head?

2010-09-27 Thread Edward Capriolo
On Mon, Sep 27, 2010 at 11:13 AM, Keith Wiley wrote: > On 2010, Sep 27, at 7:02 AM, Edward Capriolo wrote: > >> On Mon, Sep 27, 2010 at 3:23 AM, Keith Wiley >> wrote: >>> >>> Is there a particularly good reason for why the "hadoop fs" command

Re: Why hadoop is written in java?

2010-10-12 Thread Edward Capriolo
On Tue, Oct 12, 2010 at 12:20 AM, Chris Dyer wrote: > The Java memory overhead is a quite serious problem, and a legitimate > and serious criticism of Hadoop. For MapReduce applications, it is > often (although not always) possible to improve performance by doing > more work in memory (e.g., using

Re: 0.21 found interface but class was expected

2010-11-13 Thread Edward Capriolo
On Sat, Nov 13, 2010 at 9:50 PM, Todd Lipcon wrote: > We do have policies against breaking APIs between consecutive major versions > except for very rare exceptions (eg UnixUserGroupInformation went away when > security was added). > > We do *not* have any current policies that existing code can w

Re: Caution using Hadoop 0.21

2010-11-13 Thread Edward Capriolo
On Sat, Nov 13, 2010 at 4:33 PM, Shi Yu wrote: > I agree with Steve. That's why I am still using 0.19.2 in my production. > > Shi > > On 2010-11-13 12:36, Steve Lewis wrote: >> >> Our group made a very poorly considered decision to build out cluster >> using >> Hadoop 0.21 >> We discovered that a

Re: small files and number of mappers

2010-11-30 Thread Edward Capriolo
On Tue, Nov 30, 2010 at 3:21 AM, Harsh J wrote: > Hey, > > On Tue, Nov 30, 2010 at 4:56 AM, Marc Sturlese > wrote: >> >> Hey there, >> I am doing some tests and wandering which are the best practices to deal >> with very small files which are continuously being generated(1Mb or even >> less). >

Re: HDFS and libhfds

2010-12-07 Thread Edward Capriolo
2010/12/7 Petrucci Andreas : > > hello there, im trying to compile libhdfs in order  but there are some > problems. According to http://wiki.apache.org/hadoop/MountableHDFS  i have > already installes fuse. With ant compile-c++-libhdfs -Dlibhdfs=1 the buils is > successful. > > However when i tr

Re: Topology : Script Based Mapping

2010-12-29 Thread Edward Capriolo
On Tue, Dec 28, 2010 at 11:36 PM, Hemanth Yamijala wrote: > Hi, > > On Tue, Dec 28, 2010 at 6:03 PM, Rajgopal Vaithiyanathan > wrote: >> I wrote a script to map the IP's to a rack. The script is as follows. : >> >> for i in $* ; do >>        topo=`echo $i | cut -d"." -f1,2,3 | sed 's/\./-/g'` >>

Re: new mapreduce API and NLineInputFormat

2011-01-14 Thread Edward Capriolo
On Fri, Jan 14, 2011 at 5:05 PM, Attila Csordas wrote: > Hi, > > what other jars should be added to the build path from 0.21.0 > besides hadoop-common-0.21.0.jar in order to make 0.21.0 NLineInputFormat > work in 0.20.2 as suggested below? > > Generally can somebody provide me a working example co

Re: No locks available

2011-01-17 Thread Edward Capriolo
On Mon, Jan 17, 2011 at 8:13 AM, Adarsh Sharma wrote: > Harsh J wrote: >> >> Could you re-check your permissions on the $(dfs.data.dir)s for your >> failing DataNode versus the user that runs it? >> >> On Mon, Jan 17, 2011 at 6:33 PM, Adarsh Sharma >> wrote: >> >>> >>> Can i know why it occurs. >

Re: Why Hadoop is slow in Cloud

2011-01-17 Thread Edward Capriolo
On Mon, Jan 17, 2011 at 6:08 AM, Steve Loughran wrote: > On 17/01/11 04:11, Adarsh Sharma wrote: >> >> Dear all, >> >> Yesterday I performed a kind of testing between *Hadoop in Standalone >> Servers* & *Hadoop in Cloud. >> >> *I establish a Hadoop cluster of 4 nodes ( Standalone Machines ) in >>

Re: Why Hadoop is slow in Cloud

2011-01-19 Thread Edward Capriolo
On Wed, Jan 19, 2011 at 1:32 PM, Marc Farnum Rendino wrote: > On Tue, Jan 18, 2011 at 8:59 AM, Adarsh Sharma > wrote: >> I want to know *AT WHAT COSTS  *it comes. >> 10-15% is tolerable but at this rate, it needs some work. >> >> As Steve rightly suggest , I am in some CPU bound testing work to

Re: Hive rc location

2011-01-21 Thread Edward Capriolo
On Fri, Jan 21, 2011 at 9:56 AM, abhatna...@vantage.com wrote: > > Where is this file located? > > Also does anyone has a sample > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Hive-rc-tp2296028p2302262.html > Sent from the Hadoop lucene-users mailing list archive at Nab

Re: How to get metrics information?

2011-01-23 Thread Edward Capriolo
On Sat, Jan 22, 2011 at 9:59 PM, Ted Yu wrote: > In the test code, JobTracker is returned from: > >        mr = new MiniMRCluster(0, 0, 0, "file:///", 1, null, null, null, > conf); >        jobTracker = mr.getJobTrackerRunner().getJobTracker(); > > I guess it is not exposed in non-test code. > > O

Re: Hadoop is for whom? Data architect or Java Architect or All

2011-01-27 Thread Edward Capriolo
On Thu, Jan 27, 2011 at 5:42 AM, Steve Loughran wrote: > On 27/01/11 07:28, Manuel Meßner wrote: >> >> Hi, >> >> you may want to take a look into the streaming api, which allows users >> to write there map-reduce jobs with any language, which is capable of >> writing to stdout and reading from std

Re: recommendation on HDDs

2011-02-12 Thread Edward Capriolo
On Fri, Feb 11, 2011 at 7:14 PM, Ted Dunning wrote: > Bandwidth is definitely better with more active spindles.  I would recommend > several larger disks.  The cost is very nearly the same. > > On Fri, Feb 11, 2011 at 3:52 PM, Shrinivas Joshi wrote: > >> Thanks for your inputs, Michael.  We have 6

Re: Hadoop and image processing?

2011-03-03 Thread Edward Capriolo
On Thu, Mar 3, 2011 at 10:00 AM, Tom Deutsch wrote: > Along with Brian I'd also suggest it depends on what you are doing with > the images, but we used Hadoop specifically for this purpose in several > solutions we build to do advanced imaging processing. Both scale out > ability to large data vol

Re: Reason of Formatting Namenode

2011-03-10 Thread Edward Capriolo
On Thu, Mar 10, 2011 at 12:48 AM, Adarsh Sharma wrote: > Thanks Harsh, i.e why if we again format namenode after loading some data > INCOMATIBLE NAMESPACE ID's error occurs. > > > Best Regards, > > Adarsh Sharma > > > > > Harsh J wrote: >> >> Formatting the NameNode initializes the FSNameSystem in

Re: Anyone knows how to attach a figure on Hadoop Wiki page?

2011-03-14 Thread Edward Capriolo
On Mon, Mar 14, 2011 at 1:23 PM, He Chen wrote: > Hi all > > Any suggestions? > > Bests > > Chen > Images have been banned.

Re: check if a sequenceFile is corrupted

2011-03-17 Thread Edward Capriolo
On Thursday, March 17, 2011, Marc Sturlese wrote: > Is there any way to check if a seqfile is corrupted without iterate over all > its keys/values till it crashes? > I've seen that I can get an IOException when opening it or an IOException > reading the X key/value (depending on when it was corrup

Re: how to get rid of attempt_201101170925_****_m_**** directories safely?

2011-03-17 Thread Edward Capriolo
On Thu, Mar 17, 2011 at 1:20 PM, jigar shah wrote: > Hi, > >    we are running a 50 node hadoop cluster and have a problem with these > attempt directories piling up(for eg attempt_201101170925_126956_m_000232_0) > and taking a lot of space. when i restart the tasktracker daemon these > directorie

Re: Is anyone running Hadoop 0.21.0 on Solaris 10 X64?

2011-03-31 Thread Edward Capriolo
On Thu, Mar 31, 2011 at 10:43 AM, XiaoboGu wrote: > I have trouble browsing the file system vi namenode web interface, namenode > saying in log file that th –G option is invalid to get the groups for the > user. > > I thought this was not the case any more but hadoop forks to the 'id' command t

How is hadoop going to handle the next generation disks?

2011-04-07 Thread Edward Capriolo
I have a 0.20.2 cluster. I notice that our nodes with 2 TB disks waste tons of disk io doing a 'du -sk' of each data directory. Instead of 'du -sk' why not just do this with java.io.file? How is this going to work with 4TB 8TB disks and up ? It seems like calculating used and free disk space could

Re: How is hadoop going to handle the next generation disks?

2011-04-08 Thread Edward Capriolo
t;> inodes/dentries are almost always cached so calculating the 'du -sk' on a >> host even with hundreds of thousands of files the du -sk generally uses high >> i/o for a couple of seconds. I am using 2TB disks too. >>  Sridhar >> >> >> On Fri, Apr 8,

Re: Memory mapped resources

2011-04-11 Thread Edward Capriolo
On Mon, Apr 11, 2011 at 7:05 PM, Jason Rutherglen wrote: > Yes you can however it will require customization of HDFS.  Take a > look at HDFS-347 specifically the HDFS-347-branch-20-append.txt patch. >  I have been altering it for use with HBASE-3529.  Note that the patch > noted is for the -append

Hadoop and WikiLeaks

2011-05-18 Thread Edward Capriolo
http://hadoop.apache.org/#What+Is+Apache%E2%84%A2+Hadoop%E2%84%A2%3F March 2011 - Apache Hadoop takes top prize at Media Guardian Innovation Awards The Hadoop project won the "innovator of the year"award from the UK's Guardian newspaper, where it was described as "had the potential as a greater c

Re: Hadoop and WikiLeaks

2011-05-19 Thread Edward Capriolo
ependent of Hadoop before the recent update. > > > > > > On Thu, May 19, 2011 at 4:18 AM, Steve Loughran > wrote: > > > > > On 18/05/11 18:05, javam...@cox.net wrote: > > > > > >> Yes! > > >> > > >> -Pete > >

Re: Using df instead of du to calculate datanode space

2011-05-21 Thread Edward Capriolo
Good job. I brought this up an another thread, but was told it was not a problem. Good thing I'm not crazy. On Sat, May 21, 2011 at 12:42 AM, Joe Stein wrote: > I came up with a nice little hack to trick hadoop into calculating disk > usage with df instead of du > > > http://allthingshadoop.com/2

Re: Hadoop and WikiLeaks

2011-05-22 Thread Edward Capriolo
On Sat, May 21, 2011 at 4:13 PM, highpointe wrote: > >>> Does this copy text bother anyone else? Sure winning any award is great > >>> but > >>> does hadoop want to be associated with "innovation" like WikiLeaks? > >>> > > > > [Only] through the free distribution of information, the guaranteed >

Re: Hadoop and WikiLeaks

2011-05-22 Thread Edward Capriolo
On Sun, May 22, 2011 at 7:29 PM, Todd Lipcon wrote: > C'mon guys -- while this is of course an interesting debate, can we > please keep it off common-user? > > -Todd > > On Sun, May 22, 2011 at 3:30 PM, Edward Capriolo > wrote: > > On Sat, May 21, 2011

Re: Hadoop and WikiLeaks

2011-05-22 Thread Edward Capriolo
On Sun, May 22, 2011 at 7:48 PM, Konstantin Boudnik wrote: > On Sun, May 22, 2011 at 15:30, Edward Capriolo > wrote: > but for the > > reasons I outlined above I would not want to be associated with them at > all. > > "I give no damn about your opinion, but I will d

Re: Hadoop and WikiLeaks

2011-05-22 Thread Edward Capriolo
On Sun, May 22, 2011 at 8:44 PM, Todd Lipcon wrote: > On Sun, May 22, 2011 at 5:10 PM, Edward Capriolo > wrote: > > > > Correct. But it is a place to discuss changing the content of > > http://hadoop.apache.org which is what I am advocating. > > > > Fair en

Re: Why don't my jobs get preempted?

2011-05-31 Thread Edward Capriolo
On Tue, May 31, 2011 at 2:50 PM, W.P. McNeill wrote: > I'm launching long-running tasks on a cluster running the Fair Scheduler. > As I understand it, the Fair Scheduler is preemptive. What I expect to see > is that my long-running jobs sometimes get killed to make room for other > people's jobs

Hadoop Filecrusher! V2 Released!

2011-06-01 Thread Edward Capriolo
All, You know the story: You have data files that are created every 5 minutes. You have hundreds of servers. You want to put those files in hadoop. Eventually: You get lots of files and blocks. Your namenode and secondary name node need more memory (BTW JVM's have issues at large Xmx values). You

Re: Verbose screen logging on hadoop-0.20.203.0

2011-06-05 Thread Edward Capriolo
On Sun, Jun 5, 2011 at 1:04 PM, Shi Yu wrote: > We just upgraded from 0.20.2 to hadoop-0.20.203.0 > > Running the same code ends up a massive amount of debug > information on the screen output. Normally this type of > information is written to logs/userlogs directory. However, > nothing is writte

Re: NameNode heapsize

2011-06-10 Thread Edward Capriolo
On Fri, Jun 10, 2011 at 8:22 AM, Brian Bockelman wrote: > > On Jun 10, 2011, at 6:32 AM, si...@ugcv.com wrote: > > > Dear all, > > > > I'm looking for ways to improve the namenode heap size usage of a > 800-node 10PB testing Hadoop cluster that stores > > around 30 million files. > > > > Here's so

<    1   2   3