Re: Hadoop is for whom? Data architect or Java Architect or All

2011-01-27 Thread Steve Loughran

On 27/01/11 07:28, Manuel Meßner wrote:

Hi,

you may want to take a look into the streaming api, which allows users
to write there map-reduce jobs with any language, which is capable of
writing to stdout and reading from stdin.

http://hadoop.apache.org/mapreduce/docs/current/streaming.html

furthermore pig and hive are hadoop related projects and may be of
interest for non java people:

http://pig.apache.org/
http://hive.apache.org/

So finally my answer: no it isn't ;)


Helps if your ops team have some experience in running java app servers 
or similar, as well as large linux clusters


Re: Best way to limit the number of concurrent tasks per job on hadoop 0.20.2

2011-01-27 Thread Renaud Delbru

Hi Koji,

thanks for sharing the information,
Is the 0.20-security branch planned to be a official release at some point ?

Cheers
--
Renaud Delbru

On 27/01/11 01:50, Koji Noguchi wrote:

Hi Renaud,

Hopefully it’ll be in 0.20-security branch that Arun is trying to push.

Related (very abstract) Jira.
https://issues.apache.org/jira/browse/MAPREDUCE-1872

Koji



On 1/25/11 12:48 PM, "Renaud Delbru"  wrote:

As it seems that the capacity and fair schedulers in hadoop 0.20.2 do
not allow a hard upper limit in number of concurrent tasks, do anybody
know any other solutions to achieve this ?
--
Renaud Delbru

On 25/01/11 11:49, Renaud Delbru wrote:
> Hi,
>
> we would like to limit the number of maximum tasks per job on our
> hadoop 0.20.2 cluster.
> Is the Capacity Scheduler [1] will allow to do this ? Is it correctly
> working on hadoop 0.20.2 (I remember a few months ago, we were
> looking at it, but it seemed incompatible with hadoop 0.20.2).
>
> [1]
http://hadoop.apache.org/common/docs/r0.20.2/capacity_scheduler.html
>
> Regards,






Re: Best way to limit the number of concurrent tasks per job on hadoop 0.20.2

2011-01-27 Thread Steve Loughran

On 27/01/11 10:51, Renaud Delbru wrote:

Hi Koji,

thanks for sharing the information,
Is the 0.20-security branch planned to be a official release at some
point ?

Cheers


If you can play with the beta you can see that it works for you and if 
not, get bugs fixed during the beta cycle


http://people.apache.org/~acmurthy/hadoop-0.20.100-rc0/


Re: Best way to limit the number of concurrent tasks per job on hadoop 0.20.2

2011-01-27 Thread Renaud Delbru

Thanks, we will try to test it next week.
--
Renaud Delbru

On 27/01/11 11:31, Steve Loughran wrote:

On 27/01/11 10:51, Renaud Delbru wrote:

Hi Koji,

thanks for sharing the information,
Is the 0.20-security branch planned to be a official release at some
point ?

Cheers


If you can play with the beta you can see that it works for you and if 
not, get bugs fixed during the beta cycle


http://people.apache.org/~acmurthy/hadoop-0.20.100-rc0/




Re: Hadoop is for whom? Data architect or Java Architect or All

2011-01-27 Thread Edward Capriolo
On Thu, Jan 27, 2011 at 5:42 AM, Steve Loughran  wrote:
> On 27/01/11 07:28, Manuel Meßner wrote:
>>
>> Hi,
>>
>> you may want to take a look into the streaming api, which allows users
>> to write there map-reduce jobs with any language, which is capable of
>> writing to stdout and reading from stdin.
>>
>> http://hadoop.apache.org/mapreduce/docs/current/streaming.html
>>
>> furthermore pig and hive are hadoop related projects and may be of
>> interest for non java people:
>>
>> http://pig.apache.org/
>> http://hive.apache.org/
>>
>> So finally my answer: no it isn't ;)
>
> Helps if your ops team have some experience in running java app servers or
> similar, as well as large linux clusters
>

IMHO Hadoop is not a technology you want to use unless you have people
with Java experience on your staff, or you are willing to learn those
skills. Hadoop does not have a standard interface such as SQL. Working
with it involves reading API, reading through source code, reading
blogs, etc.

I would say the average hadoop user is also somewhat of a hadoop
developer/administrator. Where the average MySQL user for example has
never delved into source code.

In other words if you would with hadoop you are bound to see Java
Exception and stack trace in common every day usage.

This does not mean you have to know java to use hadoop but to use it
very effectively I would suggest it.


Re: Cannot copy files to HDFS

2011-01-27 Thread rahul patodi
Hi,
Your data Node is not up..
please run jps command to check all required daemons are running.
you can refer http://www.hadoop-tutorial.blogspot.com/


-- 
*Regards*,
Rahul Patodi
Software Engineer,
Impetus Infotech (India) Pvt Ltd,
www.impetus.com
Mob:09907074413


Re: Hadoop is for whom? Data architect or Java Architect or All

2011-01-27 Thread Phil Whelan
Hi Manoranjan,

While knowing Java will help you better use more of the features of
Hadoop and process the data more efficiently, I have worked in a
situation where we used Hadoop without touching any Java code at all.
We needed to utilise our legacy Perl code in our Map-Reduce jobs and
simply used the Hadoop streaming. With tools like Whirr you can
completely use Hadoop as a blackbox if you need to.
http://www.slideshare.net/philwhln/map-reduce-using-perl
http://www.philwhln.com/map-reduce-with-ruby-using-hadoop

But, as a "data architect" I think knowing Java is important. Not just
for Hadoop, but for all the other Apache projects that focus on
managing data, are built with Java and providing Java APIs (ActiveMQ,
Cassandra, Lucene / Solr, HBase, Hive, Mahout...).

Thanks,
Phil

On Wed, Jan 26, 2011 at 7:43 AM, manoranjand  wrote:
>
> Hi- I have a basic question. Appologies for my ignorance, but is hadoop a
> mis-fit for a data architect with zero java knowledge?
> --
> View this message in context: 
> http://old.nabble.com/Hadoop-is-for-whom--Data-architect-or-Java-Architect-or-All-tp30765860p30765860.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>



-- 
Cell : +1 (778) 233-4935
Twitter : http://www.twitter.com/philwhln
LinkedIn : http://ca.linkedin.com/in/philwhln
Blog : http://www.philwhln.com
Skype : philwhelan76


Re: Hadoop is for whom? Data architect or Java Architect or All

2011-01-27 Thread sic slc
unsubscribe

On Wed, Jan 26, 2011 at 8:43 AM, manoranjand wrote:

>
> Hi- I have a basic question. Appologies for my ignorance, but is hadoop a
> mis-fit for a data architect with zero java knowledge?
> --
> View this message in context:
> http://old.nabble.com/Hadoop-is-for-whom--Data-architect-or-Java-Architect-or-All-tp30765860p30765860.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


Re: Hadoop Binary File

2011-01-27 Thread Keith Wiley
On Jan 25, 2011, at 21:47 , F.Ozgur Catak wrote:

> Can you give me a simple example/source code for this project.
> 
> On Tue, Jan 25, 2011 at 10:13 PM, Keith Wiley  wrote:
> 
>> I'm also doing binary image processing on Hadoop.  Where relevant, my Key
>> and Value types are a WritableComparable class of my own creation which
>> contains as members a BytesWritable object, obviously read from the file
>> itself directly into memory.  I also keep the path in my class so I know
>> where the file came from later.
>> 
>> On Jan 25, 2011, at 11:46 , F.Ozgur Catak wrote:
>> 
>>> Hi,
>>> 
>>> I'm trying to develop an image processing application with hadoop. All
>> image
>>> files are in HDFS.  But I don't know how to read this files with
>> binary/byte
>>> stream. What is correct decleration of Mapper and
>> Reducer
>>> Class.

Hmmm, "simple" you say...not even remotely.  Our system has grown into quite 
the behemoth.

Let's see...here's my Writable class that I use to pass images around Hadoop as 
keys and values:

public class FileWritable
implements WritableComparable {

private PathfilePath_ = null;
private BytesWritable   fileContents_;

public FileWritable() {
set(null, new BytesWritable());
}

public FileWritable(Path filePath, BytesWritable fileContents) {
set(filePath, fileContents);
}

public void set(Path filePath, BytesWritable fileContents) {
filePath_ = filePath;
fileContents_ = fileContents;
}

public Path getPath() {
return filePath_;
}

/**
 * The key is the filename, i.e., the last component of the full path
 * @return
 */
public Text getKey() {
return new Text(filePath_.getName());
}

public BytesWritable getContents() {
return fileContents_;
}

public void write(DataOutput out) throws IOException {
new Text(filePath_.getName()).write(out);
fileContents_.write(out);
}

public void readFields(DataInput in) throws IOException {
Text filePath = new Text();
filePath.readFields(in);
filePath_ = new Path(filePath.toString());

fileContents_.readFields(in);
}

// If we ever use this class as a key, might want to do this a little 
better.
@Override
public int hashCode() {
return fileContents_.hashCode();
}

@Override
public boolean equals(Object o) {
if (o instanceof FileWritable) {
FileWritable f = (FileWritable) o;
//Is the second half of this comparison *really* 
necessary?!
return filePath_.equals(f.filePath_)
&& fileContents_.equals(f.fileContents_);
}
return false;
}

public int compareTo(FileWritable f) {
//Is the second half of this comparison *really* necessary?!
int cmp = filePath_.compareTo(f.filePath_);
if (cmp != 0)
return cmp;
return fileContents_.compareTo(f.fileContents_);
}
}

Does that help?


Keith Wiley   kwi...@keithwiley.com   www.keithwiley.com

"Luminous beings are we, not this crude matter."
  -- Yoda






Thread safety issues with JNI/native code from map tasks?

2011-01-27 Thread Keith Wiley
I am seeing very perplexing segfaults and standard allocation exceptions in my 
native code (.so files passed to the distributed cace) which is called via JNI 
from the map task.  This code runs perfectly fine (on the same data) outside 
Hadoop.  Even when run in a Hadoop standalone mode (no cluster), it still 
segfaults.  The memory footprint is quite small and inspection at run time 
reveals there is plenty of memory left, yet I get segfaults and exceptions.

I'm starting to wonder if this is a thread issue.

The native code is not *specifically* thread safe (not compiled with pthreads 
or anything like that).

However, it is also not run in any concurrent fashion except w.r.t. to the JVM 
itself.  For example, my map task doesn't make parallel calls through JNI to 
the native code on concurrent threads at the Java level, nor does the native 
code itself spawn any threads (like I said, it isn't even compiled with 
pthreads).

However, there are clearly other "threads" of execution.  For example, the JVM 
itself is running, including whatever supplemental threads the JVM involves 
(the garbage collector?).  In addition, my Java mapper is running two Java 
threads at the time of the native code.  One calls the native code and 
effectively blocks until the native code returns through JNI.  The other just 
spins and sends reports and statuses to the job tracker at regular intervals to 
prevent the task from being killed, but it doesn't do anything else 
particularly memory-related, certainly no JNI/native calls, it's very basic, 
just sleep 'n report, sleep 'n report.

So, the question is, in the scenario I have described, is there any reason to 
suspect that the cause of my problems is some sort of thread trampling between 
the native code and something else in the surrounding environment (the JVM or 
something like that), especially in the context of the surrounding Hadoop 
infrastructure?  It doesn't really make any sense to me, but I'm running out of 
ideas.

I've experimented with "mapred.child.java.opts" and "mapred.child.ulimit" but 
nothing really seems to have any effect on the frequency of these errors.

I'm quite out of ideas.  These segfaults and standard allocation exceptions (in 
the face of plenty of free memory) have basically brought my work to a halt and 
I just don't know what to do anymore.

Thanks.


Keith Wiley   kwi...@keithwiley.com   www.keithwiley.com

"And what if we picked the wrong religion?  Every week, we're just making God
madder and madder!"
  -- Homer Simpson






Too small initial heap problem.

2011-01-27 Thread Jun Young Kim

Hi,

I have 9 cluster (1 master, 8 slaves) to run a hadoop.

when I executed my job in a master, I got the following errors.

11/01/28 10:58:01 INFO mapred.JobClient: Running job: job_201101271451_0011
11/01/28 10:58:02 INFO mapred.JobClient:  map 0% reduce 0%
11/01/28 10:58:08 INFO mapred.JobClient: Task Id : 
attempt_201101271451_0011_m_41_0, Status : FAILED

java.io.IOException: Task process exit with nonzero status of 1.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
11/01/28 10:58:08 WARN *mapred.JobClient: Error reading task 
output*http://hatest03.server:50060/tasklog?plaintext=true&taskid=attempt_201101271451_0011_m_41_0&filter=stdout
11/01/28 10:58:08 WARN *mapred.JobClient: Error reading task 
output*http://hatest03.server:50060/tasklog?plaintext=true&taskid=attempt_201101271451_0011_m_41_0&filter=stderr



after going the hatest03.server, I've checked the directory which is 
named attempt_201101271451_0011_m_41_0.

there is an error msg in a stdout file.

Error occurred during initialization of VM
Too small initial heap


my configuration to use a heap size is


mapred.child.java.opts


-Xmx1024



and the physical memory size is "free -m"
$ free -m
 total   used  free shared
buffers cached

Mem: 12001   4711 7290  0197   4056
-/+ buffers/cache:  457   11544
Swap: 20470   2047


how can I fix this problem?

--
Junyoung Kim (juneng...@gmail.com)



Re: Too small initial heap problem.

2011-01-27 Thread Koji Noguchi
> -Xmx1024
>
This would be 1024 bytes heap.

Maybe you want -Xmx1024m  ?

Koji

On 1/27/11 6:04 PM, "Jun Young Kim"  wrote:

Hi,

I have 9 cluster (1 master, 8 slaves) to run a hadoop.

when I executed my job in a master, I got the following errors.

11/01/28 10:58:01 INFO mapred.JobClient: Running job: job_201101271451_0011
11/01/28 10:58:02 INFO mapred.JobClient:  map 0% reduce 0%
11/01/28 10:58:08 INFO mapred.JobClient: Task Id :
attempt_201101271451_0011_m_41_0, Status : FAILED
java.io.IOException: Task process exit with nonzero status of 1.
 at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
11/01/28 10:58:08 WARN *mapred.JobClient: Error reading task
output*http://hatest03.server:50060/tasklog?plaintext=true&taskid=attempt_201101271451_0011_m_41_0&filter=stdout
11/01/28 10:58:08 WARN *mapred.JobClient: Error reading task
output*http://hatest03.server:50060/tasklog?plaintext=true&taskid=attempt_201101271451_0011_m_41_0&filter=stderr


after going the hatest03.server, I've checked the directory which is
named attempt_201101271451_0011_m_41_0.
there is an error msg in a stdout file.

Error occurred during initialization of VM
Too small initial heap


my configuration to use a heap size is


mapred.child.java.opts


-Xmx1024



and the physical memory size is "free -m"
$ free -m
  total   used  free shared
buffers cached
Mem: 12001   4711 7290  0197   4056
-/+ buffers/cache:  457   11544
Swap: 20470   2047


how can I fix this problem?

--
Junyoung Kim (juneng...@gmail.com)