Re: Getting job progress in java application

2012-04-29 Thread Bill Graham
Take a look at the JobClient API. You can use that to get the current progress of a running job. On Sunday, April 29, 2012, Ondřej Klimpera wrote: Hello I'd like to ask you what is the preferred way of getting running jobs progress from Java application, that has executed them. Im using

Re: Feedback on real world production experience with Flume

2012-04-22 Thread Bill Graham
+1 on Edward's comment. The MapR comment was relevant and informative and the original poster never said he was only interested in open source options. On Sunday, April 22, 2012, Michael Segel wrote: Gee Edward, what about putting a link to a company website or your blog in your signature...

Re: [Blog Post]: Accumulo and Pig play together now

2012-03-02 Thread Bill Graham
- bcc: u...@nutch.apache.org common-user@hadoop.apache.org This is great Jason. One thing to add though is this line in your Pig script: SET mapred.map.tasks.speculative.execution false Otherwise you'll likely going to get duplicate writes into accumulo. On Fri, Mar 2, 2012 at 5:48 AM, Jason

Re: Writing small files to one big file in hdfs

2012-02-21 Thread Bill Graham
You might want to check out File Crusher: http://www.jointhegrid.com/hadoop_filecrush/index.jsp I've never used it, but it sounds like it could be helpful. On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks bejoy.had...@gmail.com wrote: Hi Mohit AFAIK XMLLoader in pig won't be suited for

Re: How to delete files older than X days in HDFS/Hadoop

2011-11-27 Thread Bill Graham
If you're able to put your data in directories named by date (i.e. MMdd), you can take advantage of the fact that the HDFS client will return directories in sort order of the name, which returns the most recent dirs last. You can then cron a bash script that deletes all the but last N

Re: Why hadoop should be built on JAVA?

2011-08-16 Thread Bill Graham
There was a fairly long discussion on this topic at the beginning of the year FYI: http://search-hadoop.com/m/JvSQe2wNlY11 On Mon, Aug 15, 2011 at 9:00 PM, Chris Song sjh...@gmail.com wrote: Why hadoop should be built in JAVA? For integrity and stability, it is good for hadoop to be

Re: Distcp failure - Server returned HTTP response code: 500

2011-05-18 Thread Bill Graham
Are you able to distcp folders that don't have special characters? What are the versions of the two clusters and are you running on the destination cluster if there's a mis-match? If there is you'll need to use hftp: http://hadoop.apache.org/common/docs/current/distcp.html#cpver On Wed, May 18,

Re: Including Additional Jars

2011-04-06 Thread Bill Graham
If you could share more specifics regarding just how it's not working (i.e., job specifics, stack traces, how you're invoking it, etc), you might get more assistance in troubleshooting. On Wed, Apr 6, 2011 at 1:44 AM, Shuja Rehman shujamug...@gmail.com wrote: -libjars is not working nor

Re: Including Additional Jars

2011-04-06 Thread Bill Graham
param2 param3 but the program still giving the error and does not find the mylib.jar. can u confirm the syntax of command? thnx On Wed, Apr 6, 2011 at 8:29 PM, Bill Graham billgra...@gmail.com wrote: If you could share more specifics regarding just how it's not working (i.e., job specifics

Re: Including Additional Jars

2011-04-04 Thread Bill Graham
Shuja, I haven't tried this, but from what I've read it seems you could just add all your jars required by the Mapper and Reducer to HDFS and then add them to the classpath in your run() method like this: DistributedCache.addFileToClassPath(new Path(/myapp/mylib.jar), job); I think that's all

Re: Chukwa setup issues

2011-04-01 Thread Bill Graham
Unfortunately conf/collectors is used in two different ways in Chukwa, each with a different syntax. This should really be fixed. 1. The script that starts the collectors looks at it for a list of hostnames (no ports) to start collectors on. To start it just on one host, set it to localhost. 2.

Re: Chukwa - Lightweight agents

2011-03-20 Thread Bill Graham
Yes, we run light weight Chukwa agents and collectors only, using Chukwa just as you describe. We've been doing so for over a year or so without many issues. The code is fairly easy to extend when needed. We rolled our own collector, agent and demux RPMs. The monitoring peice of chukwa is

Re: Chukwa?

2011-03-20 Thread Bill Graham
Chukwa hasn't had a release since moving from Hadoop to incubator so there are no releases in the /incubator repos. Follow the link on the Chukwa homepage to the downloads repos: http://incubator.apache.org/chukwa/ http://www.apache.org/dyn/closer.cgi/hadoop/chukwa/chukwa-0.4.0 On Sun, Mar 20,

Re: Hadoop exercises

2011-01-05 Thread Bill Graham
For the even lazier, you could give both the test data and the expected output data. That way they know for sure if they got it right. This also promotes a good testing best practice, which is to assert against and expected set of results. On Wed, Jan 5, 2011 at 9:19 AM, Mark Kerzner

Re: Real-time log processing in Hadoop

2010-09-06 Thread Bill Graham
We're using Chukwa to do steps a-d before writing summary data into MySQL. Data is written into new directories every 5 minutes. Our MR jobs and data load into MySQL takes 5 minutes, so after a 5 minute window closes, we typically have summary data from that interval in MySQL about a few minutes

JIRA down

2010-08-25 Thread Bill Graham
JIRA seems to be down FYI. Database errors are being returned: *Cause: * org.apache.commons.lang.exception.NestableRuntimeException: com.atlassian.jira.exception.DataAccessException: org.ofbiz.core.entity.GenericDataSourceException: Unable to esablish a connection with the database. (FATAL:

Re: Reopen and append to SequenceFile

2010-08-23 Thread Bill Graham
Chukwa also has a JMSAdaptor that can listen to an AMQ queue and stream the messages to a collector(s) to then be persisted as sequence files. On Fri, Aug 20, 2010 at 3:29 AM, cliff palmer palmercl...@gmail.com wrote: You may want to consider using something like the *nix tee command to save

Re: Changing hostnames of tasktracker/datanode nodes - any problems?

2010-08-10 Thread Bill Graham
Sorry to hijack the thread but I have a similar use case. In a few months we're going to be moving colos. The new cluster will be the same size as the current cluster and some downtime is acceptable. The hostnames will be different. From what I've read in this thread it seems like it would be

Re: Changing hostnames of tasktracker/datanode nodes - any problems?

2010-08-10 Thread Bill Graham
Ahh yes of course, distcp. Thanks! On Tue, Aug 10, 2010 at 11:01 AM, Allen Wittenauer awittena...@linkedin.com wrote: On Aug 10, 2010, at 10:54 AM, Bill Graham wrote: Is is correct to say that that would work fine? We have a replication factor of 2, so we'd be copying twice as much data

Re: Chukwa questions

2010-07-09 Thread Bill Graham
Your understanding of how Chukwa works is correct. Hadoop by itself is a system that contains both the HDFS and the MapReduce systems. The other projects you lists are all projects built upon Hadoop, but you don't need them to run or to get value out of Hadoop by itself. To run the Chukwa agent

Re: How to write log4j output to HDFS?

2010-04-21 Thread Bill Graham
Hi, Check out Chukwa: http://hadoop.apache.org/chukwa/docs/r0.3.0/design.html#Introduction It allows you to run agents which tail log4j output and send the data to collectors, which write the data to HDFS. thanks, Bill On Wed, Apr 21, 2010 at 3:43 AM, Dhanya Aishwarya Palanisamy

Re: Web/Data Analytics and Data Collection using Hadoop

2010-03-22 Thread Bill Graham
Hi Utku, We're using Chukwa to collect and aggregate data as you describe and so far it's working well. Typically chukwa collectors are deployed to all data nodes, so there is no master write-bottleneck with this approach actually. There have been discussions lately on the Chukwa list regarding

Re: Web/Data Analytics and Data Collection using Hadoop

2010-03-22 Thread Bill Graham
somehow need to connect to the namenode from the collectors. Isn't it the case while trying to reach the HDFS, or the Chukwa collectors are writing on the local drives instead of HDFS? Best, Utku On Mon, Mar 22, 2010 at 6:34 PM, Bill Graham billgra...@gmail.com wrote: Hi Utku, We're

Re: Is there a way to suppress the attempt logs?

2010-03-17 Thread Bill Graham
Not sure if what you're asking is possible or not, but you could experiment with these params to see if you could achieve a similar effect. property namemapred.userlog.limit.kb/name value0/value descriptionThe maximum size of user-logs of each task in KB. 0 disables the cap. /description

NN fails to start with LeaseManager errors

2010-02-02 Thread Bill Graham
Hi, This morning the namenode of my hadoop cluster shut itself down after the logs/ directory had filled itself with job configs, log files and all the other fun things hadoop leaves there. It had been running for a few months. I deleted all off the job configs and attempt log directories and

Re: Google has obtained the patent over mapreduce

2010-01-20 Thread Bill Graham
Typically companies will patent their IP as a defensive measure to protect themselves from being sued, as has been pointed out already. Another typical reason is to exercise the patent against companies that present a challenge to their core business. I would bet that unless you're making a

Re: Tracking files deletions in HDFS

2009-11-19 Thread Bill Graham
I'm don't know about the auditing tools, but I have seen files get randomly deleted in dev setups when using hadoop with the default hadoop.tmp.dir setting, which is /tmp/hadoop-${user.name}. On Thu, Nov 19, 2009 at 9:03 AM, Stas Oskin stas.os...@gmail.com wrote: Hi. I have a strange case

Re: Apply a patch to hadoop

2009-07-27 Thread Bill Graham
http://wiki.apache.org/hadoop/HowToContribute Search for Applying a patch and you'll find this: patch -p0 cool_patch.patch On Mon, Jul 27, 2009 at 2:33 PM, Gopal Gandhi gopal.gandhi2...@yahoo.comwrote: I am going to apply a patch to hadoop (version 18.3). I searched on line but could not

Re: Block not found

2009-07-02 Thread Bill Graham
I ran into the same issue when using the default settings for dfs.data.dir, which is under the /tmp directory. Files in this directory will be cleaned our periodically as needed by the OS, which will break HDFS. On Thu, Jul 2, 2009 at 7:01 AM, Gross, Danny danny.gr...@spansion.comwrote: Hello