intermediate files of killed tasks not purged

2009-04-28 Thread Sandhya E
Hi Under /mapred/local there are directories like "attempt_200904262046_0026_m_02_0" Each of these directories contains files of format: intermediate.1 intermediate.2 intermediate.3 intermediate.4 intermediate.5 There are many directories in this format. All these correspond to killed task

Shuffle error. Reducers are not finding map i/p in 2 node cluster....

2009-04-28 Thread Sid123
The problem is not going away but i have a lead Here is a detailed diagnostic of what goes wrong: 1) 2 cluster setup on Ubuntu machines (data node and tasktracker is running on both...) 2) The reducer tries to look for a non existent file as described below: The problem happens only on my mach

Re: Blocks replication in downtime even

2009-04-28 Thread Piotr
Hi What happens when the node rejoins then ? - The replication level of several blocks increases ? - The old replicas are removed in favor of new replicas ? (Or the new ones) regards Piotr 2009/4/27 Stas Oskin > Thanks. > > 2009/4/27 Koji Noguchi > > > http://hadoop.apache.org/core/

Re: intermediate files of killed tasks not purged

2009-04-28 Thread Edward J. Yoon
Hi, It seems related with https://issues.apache.org/jira/browse/HADOOP-4654. On Tue, Apr 28, 2009 at 4:01 PM, Sandhya E wrote: > Hi > > Under /mapred/local there are directories like > "attempt_200904262046_0026_m_02_0" > Each of these directories contains files of format: intermediate.1 > i

Re: Database access in 0.18.3

2009-04-28 Thread Aaron Kimball
Cloudera's Distribution for Hadoop is based on Hadoop 0.18.3 but includes a backport of HADOOP-2536. You could switch to this distribution instead. Otherwise, download the 18-branch patch from issue HADOOP-2536 ( http://issues.apache.org/jira/browse/hadoop-2536) and apply it to your local copy and

Re: Database access in 0.18.3

2009-04-28 Thread Aaron Kimball
Should also say, the link to CDH is http://www.cloudera.com/hadoop - Aaron On Tue, Apr 28, 2009 at 5:06 PM, Aaron Kimball wrote: > Cloudera's Distribution for Hadoop is based on Hadoop 0.18.3 but includes a > backport of HADOOP-2536. You could switch to this distribution instead. > Otherwise, do

Patching and bulding produces no libcordio or libhdfs

2009-04-28 Thread Sid123
HI I have applied a small patch for version 0.20 to my old 0.19.1... After i ran the ant tar I found 3 directories libhdfs and libcodio and c++ were missing from th tared build. Where do you get those from? I cant really use 0.20 because of massive library changes... So if some one can help me out

Re: intermediate files of killed tasks not purged

2009-04-28 Thread Amareshwari Sriramadasu
Hi Sandhya, Which version of HADOOP are you using? There could be directories in mapred/local, pre 0.17. Now, there should not be any such directories. From version 0.17 onwards, the attempt directories will be present only at mapred/local/taskTracker/jobCache// . If you are seeing the dire

Re: intermediate files of killed tasks not purged

2009-04-28 Thread Sandhya E
Hi Amareshwari We are on 0.18 version. I verified from jobtracker website that not all killed tasks have left overs in mapred/local. Also there are some tasks that were successful have left their tmp folders in mapred/local Can you please give some pointers on how to debug it further. Regards S

Re: intermediate files of killed tasks not purged

2009-04-28 Thread Amareshwari Sriramadasu
Again, where are you seeing the attemptid directories? are they at mapred/local/ or at mapred/local/taskTracker/jobCache//. If you are seeing files at mapred/local/, then it is bug. Please raise a jira and attach tasktracker logs if possible. If not, mapred/local/taskTracker/jobCache// directori

I need help

2009-04-28 Thread Razen Al Harbi
Hi all, I am writing an application in which I create a forked process to execute a specific Map/Reduce job. The problem is that when I try to read the output stream of the forked process I get nothing and when I execute the same job manually it starts printing the output I am expecting. For cl

Appropriate for Hadoop?

2009-04-28 Thread Adam Retter
If I understand correctly - Hadoop forms a general purpose cluster on which you can execute jobs? We have a Java data processing application here that follows the Producer -> Consumer pattern. It has been written with threading as a concern from the start using java.util.concurrent.Callable. At

Re: I need help

2009-04-28 Thread Steve Loughran
Razen Al Harbi wrote: Hi all, I am writing an application in which I create a forked process to execute a specific Map/Reduce job. The problem is that when I try to read the output stream of the forked process I get nothing and when I execute the same job manually it starts printing the outpu

Re: I need help

2009-04-28 Thread Edward J. Yoon
Hi, Is that command available for all nodes? Did you try as below? ;) Process proc = rt.exec("/bin/hostname"); .. output.collect(hostname, disk usage); On Tue, Apr 28, 2009 at 6:13 PM, Razen Al Harbi wrote: > Hi all, > > I am writing an application in which I create a forked process to execute

Re: intermediate files of killed tasks not purged

2009-04-28 Thread Sandhya E
Attempt directories are in /mapred/local I grep'd for one of the attempt that has left over in mapred/local in tasktracker logs: 09/04/27 21:07:19 INFO mapred.TaskTracker: LaunchTaskAction: attempt_200902120108_44218_r_00_0 09/04/27 21:07:29 INFO mapred.TaskTracker: attempt_200902120108_44218_

Hadoop / MySQL

2009-04-28 Thread Ankur Goel
hello hadoop users, Recently I had a chance to lead a team building a log-processing system that uses Hadoop and MySQL. The system's goal was to process the incoming information as quickly as possible (real time or near real time), and make it available for querying in MySQL. I thought it woul

Re: Storing data-node content to other machine

2009-04-28 Thread Steve Loughran
Vishal Ghawate wrote: Hi, I want to store the contents of all the client machine(datanode)of hadoop cluster to centralized machine with high storage capacity.so that tasktracker will be on the client machine but the contents are stored on the centralized machine. Can anybody he

Re: Patching and bulding produces no libcordio or libhdfs

2009-04-28 Thread Tom White
Have a look at the instructions on http://wiki.apache.org/hadoop/HowToRelease under the "Building" section. It tells you which environment settings and Ant targets you need to set. Tom On Tue, Apr 28, 2009 at 9:09 AM, Sid123 wrote: > > HI I have applied a small patch for version 0.20 to my old 0

Re: Hadoop / MySQL

2009-04-28 Thread Yi-Kai Tsai
Hi Ankur Nice share , btw whats your query behavior ? I'm asking because if the query is simple or could be transform/normalized , you could try output to HBase directly? Yi-Kai hello hadoop users, Recently I had a chance to lead a team building a log-processing system that uses Hadoop and M

Re: Blocks replication in downtime even

2009-04-28 Thread Stas Oskin
Hi. I think one needs to run balancer in order to clean out the redundant blocks. Can anyone confirm this? Regards. 2009/4/28 Piotr > Hi > > What happens when the node rejoins then ? > > - The replication level of several blocks increases ? > - The old replicas are removed in favor of new

Re: Getting free and used space

2009-04-28 Thread Stas Oskin
Hi. Any idea if the getDiskStatus() function requires superuser rights? Or it can work for any user? Thanks. 2009/4/9 Aaron Kimball > You can insert this propery into the jobconf, or specify it on the command > line e.g.: -D hadoop.job.ugi=username,group,group,group. > > - Aaron > > On We

Re: Hadoop / MySQL

2009-04-28 Thread Wang Zhong
Hi, That's really cool. It seems that Hadoop could work with SQL DBs like Mysql with bearable time. I thought when inserting data to Mysql, the expense of communication was always a big problem. You got a method to reduce the expense. Using Distribution Databases like HBase is another good choice

Re: Processing High CPU & Memory intensive tasks on Hadoop - Architecture question

2009-04-28 Thread Steve Loughran
Aaron Kimball wrote: I'm not aware of any documentation about this particular use case for Hadoop. I think your best bet is to look into the JNI documentation about loading native libraries, and go from there. - Aaron You could also try 1. Starting the main processing app as a process on the m

Re: Appropriate for Hadoop?

2009-04-28 Thread Sharad Agarwal
Each document processing is independent and can be processed parallelly, so that part could be done in a map reduce job. Now whether it suits this use case depends on rate at which new URI's are discovered for processing and acceptable delay in processing of a document. The way I see it you can bat

Re: Appropriate for Hadoop?

2009-04-28 Thread Wang Zhong
Hi Adam, It seems that producers and consumer work parallelly, so you can use Hadoop to process your application. But the problem is the expense of commucation with DB. You can refer to Ankur's thread with subject 'Hadoop / MySQL'. Regards, On Tue, Apr 28, 2009 at 6:05 PM, Adam Retter wrote:

RE: General purpose processing on Hadoop

2009-04-28 Thread Adam Retter
> Are you interested in building such a system? I would be interested in using such a system, but otherwise I am afraid that I do not have the time resources available to be involved in such a project. Sorry. Registered Office: 7 Abbey Court, Eagle Way, Sowton, Exeter, Devon, EX2 7HY Registered

RE: Appropriate for Hadoop?

2009-04-28 Thread Adam Retter
> Each document processing is independent and can be processed > parallelly, so that part could be done in a map reduce job. > Now whether it suits this use case depends on rate at which new > URI's are discovered for processing and acceptable delay in processing > of a document. The way I see it

Re: Hadoop / MySQL

2009-04-28 Thread tim robertson
Hi, [Ankur]: How can make sure this happens? -- show processlist is how we spot it... literally it takes hours on our set up so easy to find. So we ended up with 2 DBs - DB1 we insert to, prepare and do batch processing - DB2 serving the read only web app Periodically we dump the DB1, point the

Re: Hadoop / MySQL

2009-04-28 Thread tim robertson
Sorry, that was not meant to be sent to the list... On Tue, Apr 28, 2009 at 3:27 PM, tim robertson wrote: > Hi, > > [Ankur]: How can make sure this happens? > -- show processlist is how we spot it... literally it takes hours on > our set up so easy to find. > > So we ended up with 2 DBs > - DB1

Re: Hadoop / MySQL

2009-04-28 Thread Joerg Rieger
I remember reading an article last year about something similar done by Rackspace. They went through various iterations of their logging system and encountered similar scaling issues with MySQL. In the end they started using Hadoop, Lucene and Solr: " How Rackspace Now Uses MapReduce and

Re: Hadoop / MySQL

2009-04-28 Thread Peter Skomoroch
Thanks for sharing sounds like a nice system - I always advise people to avoid direct SQL inserts for batch jobs / large amounts of data and use MySQL's optimized LOAD utility like you did. Same goes for Oracle... Nothing brings a DB server to its knees like a ton of individual inserts on indexed

Re: Hadoop / MySQL

2009-04-28 Thread Todd Lipcon
Warning: derailing a bit into MySQL discussion below, but I think enough people have similar use cases that it's worth discussing this even though it's gotten off-topic. 2009/4/28 tim robertson > > So we ended up with 2 DBs > - DB1 we insert to, prepare and do batch processing > - DB2 serving th

programming java ee and hadoop at the same time

2009-04-28 Thread George Pang
Hello users, I am trying to program a hadoop powered web application with Eclipse as IDE. Now I have both hadoop perspective and java ee perspective. I wonder if I can have these two together, that I can use the mapper and producers in my servlet. Anyone has experience about this? I will apprec

RE: programming java ee and hadoop at the same time

2009-04-28 Thread Bill Habermaas
George, I haven't used the Hadoop perspective in Eclipse so I can't help with that specifically but map/reduce is a batch process (and can be long running). In my experience, I've written servlets that write to HDFS and then have a background process perform the map/reduce. They can both run in b

RE: Hadoop / MySQL

2009-04-28 Thread Bill Habermaas
Excellent discussion. Thank you Todd. You're forgiven for being off topic (at least by me). :) Bill -Original Message- From: Todd Lipcon [mailto:t...@cloudera.com] Sent: Tuesday, April 28, 2009 2:29 PM To: core-user Subject: Re: Hadoop / MySQL Warning: derailing a bit into MySQL discus

Re: Appropriate for Hadoop?

2009-04-28 Thread Chuck Lam
HDFS is designed with Hadoop in mind, so there are certain advantages (e.g. performance, reliability, and ease of use) to using HDFS for Hadoop. However, it's not required. For example, when you run Hadoop in standalone mode, it just uses the file system on your local machine. When you run it on Am

Re: I need help

2009-04-28 Thread Razen Alharbi
Thanks for the reply, -Steve: I know that I can use the JobClient to run or submit jobs; however, for the time being I need to exec the job as a separate process. -Edward: The forked job is not executed from witin a map or reduce so I dont need to do data collection. It seems for some reason t

(event) 5/19 How Hadoop Enables Big Data for Every Enterprise >> Christopher Bisciglia

2009-04-28 Thread Bonesata
Registration and more information: http://www.meetup.com/CIO-IT-Executives/calendar/10266376/ Limited seats - register now!

Master crashed

2009-04-28 Thread Mayuran Yogarajah
The master in my cluster crashed, the dfs/mapred java processes are still running on the slaves. What should I do next? I brought the master back up and ran stop-mapred.sh and stop-dfs.sh and it said this: slave1.test.com: no tasktracker to stop slave1.test.com: no datanode to stop Not sure wha

streaming but no sorting

2009-04-28 Thread Dmitry Pushkarev
Hi. I'm writing streaming based tasks that involves running thousands of mappers, after that I want to put all these outputs into small number (say 30) output files mainly so that disk space will be used more efficiently, the way I'm doing it right now is using /bin/cat as reducer and setting n

How to write large string to file in HDFS

2009-04-28 Thread nguyenhuynh.mr
Hi all! I have the large String and I want to write it into the file in HDFS. (The large string has >100.000 lines.) Current, I use method copyBytes of class org.apache.hadoop.io.IOUtils. But the copyBytes request the InputStream of content. Therefore, I have to convert the String to InputStre

Re: I need help

2009-04-28 Thread Edward J. Yoon
Why not read the output result after job done? And, if you wanted see the log4j log, you need to set the stdout option to log4jproperties. On Wed, Apr 29, 2009 at 4:35 AM, Razen Alharbi wrote: > > Thanks for the reply, > > -Steve: > I know that I can use the JobClient to run or submit jobs; howev

Re: streaming but no sorting

2009-04-28 Thread jason hadoop
It may be simpler to just have a post processing step that uses something like multi-file input to aggregate the results. As a complete sideways thinking solution, I suspect you have far more map tasks than you have physical machines, instead of writing your output via output.collect, your tasks c

Re: streaming but no sorting

2009-04-28 Thread jason hadoop
There has to be a simpler way :) On Tue, Apr 28, 2009 at 9:22 PM, jason hadoop wrote: > It may be simpler to just have a post processing step that uses something > like multi-file input to aggregate the results. > > As a complete sideways thinking solution, I suspect you have far more map > task

Re: How to write large string to file in HDFS

2009-04-28 Thread jason hadoop
How about new InputStreamReader( new StringReader( String ), "UTF-8" ) replace UTF-8 with an appropriate charset. On Tue, Apr 28, 2009 at 7:47 PM, nguyenhuynh.mr wrote: > Hi all! > > > I have the large String and I want to write it into the file in HDFS. > > (The large string has >100.000 lines.

R on the cloudera hadoop ami?

2009-04-28 Thread Saptarshi Guha
Hello, Thank you to Cloudera for providing an AMI for Hadoop. Would it be possible to include R in the Cloudera yum repositories or better still can the i386 and x86-64 AMIs be updated to have R pre-installed? If yes (thanks!) I think yum -y install R installs R built with --shared-libs (which all

Re: R on the cloudera hadoop ami?

2009-04-28 Thread Jeff Hammerbacher
Thanks for the interest in the Cloudera AMIs, Saptarshi. We're trying to keep Cloudera-specific discussion out of the Apache forums to respect those who may not want to follow along. For Cloudera-specific requests, please use our Get Satisfaction forum at http://getsatisfaction.com/cloudera. Thank

Re: How to write large string to file in HDFS

2009-04-28 Thread Wang Zhong
Where did you get the large string? Can't you generate the string one line per time and append it to local files, then upload to HDFS when finished? On Wed, Apr 29, 2009 at 10:47 AM, nguyenhuynh.mr wrote: > Hi all! > > > I have the large String and I want to write it into the file in HDFS. > > (T

Re: Appropriate for Hadoop?

2009-04-28 Thread Sharad Agarwal
Adam Retter wrote: > > So I don't have to use HDFS at all when using Hadoop? The input URI list has to be stored in HDFS. Each mapper will work on a sublist of URIs depending on the no of maps set in job. - Sharad

Re: How to write large string to file in HDFS

2009-04-28 Thread nguyenhuynh.mr
Wang Zhong wrote: > Where did you get the large string? Can't you generate the string one > line per time and append it to local files, then upload to HDFS when > finished? > > On Wed, Apr 29, 2009 at 10:47 AM, nguyenhuynh.mr > wrote: > >> Hi all! >> >> >> I have the large String and I want to

Re: How to write large string to file in HDFS

2009-04-28 Thread nguyenhuynh.mr
jason hadoop wrote: > How about new InputStreamReader( new StringReader( String ), "UTF-8" ) > replace UTF-8 with an appropriate charset. > > > On Tue, Apr 28, 2009 at 7:47 PM, nguyenhuynh.mr > wrote: > > >> Hi all! >> >> >> I have the large String and I want to write it into the file in HDFS.