RE: Practical limit on emitted map/reduce values

2009-06-18 Thread Leon Mergen
Hello Owen,

> > Could you perhaps elaborate on that 100 MB limit ? Is that due to a
> > limit that is caused by the Java VM heap size ? If so, could that,
> > for example, be increased to 512MB by setting mapred.child.java.opts
> > to '-Xmx512m' ?
> 
> A couple of points:
>1. The 100MB was just for ballpark calculations. Of course if you
> have a large heap, you can fit larger values. Don't forget that the
> framework is allocating big chunks of the heap for its own buffers,
> when figuring out how big to make your heaps.
>2. Having large keys is much harder than large values. When doing a
> N-way merge, the framework has N+1 keys and 1 value in memory at a
> time.

Ok, that makes sense. Thanks for the information!


Regards,

Leon Mergen




RE: Practical limit on emitted map/reduce values

2009-06-18 Thread Leon Mergen
Hello Jason,

> In general if the values become very large, it becomes simpler to store
> them
> outline in hdfs, and just pass the hdfs path for the item as the value
> in
> the map reduce task.
> This greatly reduces the amount of IO done, and doesn't blow up the
> sort
> space on the reducer.
> You loose the magic of data locality, but given the item size, and you
> gain
> the IO back by not having to pass the full values to the reducer, or
> handle
> them when sorting the map outputs.

Ah that actually sounds like a nice idea; instead of having the reducer emit 
the huge value, it can create a temporarely file and emit the filename instead.

I wasn't really planning on having huge values anyway (values above 1MB will be 
the exception rather than the rule), but since it's theoretically possible for 
our software to generate them, it seemed like a good idea to investigate any 
real constraints that we might run into.

Your idea sounds like a good workaround for this. Thanks!


Regards,

Leon Mergen












RE: Practical limit on emitted map/reduce values

2009-06-18 Thread Leon Mergen
Hello Owen,

> Keys and values can be large. They are certainly capped above by
> Java's 2GB limit on byte arrays. More practically, you will have
> problems running out of memory with keys or values of 100 MB. There is
> no restriction that a key/value pair fits in a single hdfs block, but
> performance would suffer. (In particular, the FileInputFormats split
> at block sized chunks, which means you will have maps that scan an
> entire block without processing anything.)

Thanks for the quick reply.

Could you perhaps elaborate on that 100 MB limit ? Is that due to a limit that 
is caused by the Java VM heap size ? If so, could that, for example, be 
increased to 512MB by setting mapred.child.java.opts to '-Xmx512m' ?

Regards,

Leon Mergen



Practical limit on emitted map/reduce values

2009-06-18 Thread Leon Mergen
Hello,

I wasn't able to find this anywhere, so I'm sorry if this has been asked before.

I am wondering whether there is a practical limit of the amount of bytes that 
an emitted Map/Reduce value can be. Other than the obvious drawbacks of 
emitting huge values such as performance issues, I would like to know whether 
there are any hard constraints; I can imagine that a value can never be larger 
than the dfs.block.size .

Does anyone have any idea, or can provide me with some pointers where to look ?

Thanks in advance!

Regards,

Leon Mergen


Compression support for libhdfs

2009-05-06 Thread Leon Mergen
Hello,

After examining the libhdfs library, I cannot find any support for compression 
- is this correct ?

And, if this is the case, is it also correct that it is almost trivial to 
implement in hdfsOpenFile () by making an additional call to one of the 
compression codecs createInputStream () / createOutputStream () (and possibly 
before closing the file) ?

Regards,

Leon Mergen



RE: OutOfMemory Error

2008-09-17 Thread Leon Mergen
Hello,

What version of Hadoop are you using ?

Regards,

Leon Mergen

> -Original Message-
> From: Pallavi Palleti [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, September 17, 2008 2:36 PM
> To: core-user@hadoop.apache.org
> Subject: OutOfMemory Error
>
>
> Hi all,
>
>I am getting outofmemory error as shown below when I ran map-red on
> huge
> amount of data.:
> java.lang.OutOfMemoryError: Java heap space
> at
> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.jav
> a:52)
> at
> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:90)
> at
> org.apache.hadoop.io.SequenceFile$Reader.nextRawKey(SequenceFile.java:1
> 974)
> at
> org.apache.hadoop.io.SequenceFile$Sorter$SegmentDescriptor.nextRawKey(S
> equenceFile.java:3002)
> at
> org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.
> java:2802)
> at
> org.apache.hadoop.io.SequenceFile$Sorter.merge(SequenceFile.java:2511)
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.jav
> a:1040)
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698
> )
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:220)
> at
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124
> The above error comes almost at the end of map job. I have set the heap
> size
> to 1GB. Still the problem is persisting.  Can someone please help me
> how to
> avoid this error?
> --
> View this message in context: http://www.nabble.com/OutOfMemory-Error-
> tp19531174p19531174.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.



RE: OutOfMemoryError with map jobs

2008-09-07 Thread Leon Mergen
Hello Chris,

>  From the stack trace you provided, your OOM is probably due to
> HADOOP-3931, which is fixed in 0.17.2. It occurs when the deserialized
> key in an outputted record exactly fills the serialization buffer that
> collects map outputs, causing an allocation as large as the size of
> that buffer. It causes an extra spill, an OOM exception if the task
> JVM has a max heap size too small to mask the bug, and will miss the
> combiner if you've defined one, but it won't drop records.

Ok thanks for that information. I guess that means I will have to upgrade. :-)

> > However, I was wondering: are these hard architectural limits? Say
> > that I wanted to emit 25,000 maps for a single input record, would
> > that mean that I will require huge amounts of (virtual) memory? In
> > other words, what exactly is the reason that increasing the number
> > of emitted maps per input record causes an OutOfMemoryError ?
>
> Do you mean the number of output records per input record in the map?
> The memory allocated for collecting records out of the map is (mostly)
> fixed at the size defined in io.sort.mb. The ratio of input records to
> output records does not affect the collection and sort. The number of
> output records can sometimes influence the memory requirements, but
> not significantly. -C

Ok, so I should not have to worry about this too much! Thanks for the reply and 
information!

Regards,

Leon Mergen



RE: OutOfMemoryError with map jobs

2008-09-06 Thread Leon Mergen
Hello,

> I'm currently developing a map/reduce program that emits a fair amount
> of maps per input record (around 50 - 100), and I'm getting OutOfMemory
> errors:

Sorry for the noise, I found out I had to set the mapred.child.java.opts 
JobConf parameter to "-Xmx512m" to make 512MB of heap space available in the 
map processes.

However, I was wondering: are these hard architectural limits? Say that I 
wanted to emit 25,000 maps for a single input record, would that mean that I 
will require huge amounts of (virtual) memory? In other words, what exactly is 
the reason that increasing the number of emitted maps per input record causes 
an OutOfMemoryError ?

Regards,

Leon Mergen


OutOfMemoryError with map jobs

2008-09-06 Thread Leon Mergen
Hello,

I'm currently developing a map/reduce program that emits a fair amount of maps 
per input record (around 50 - 100), and I'm getting OutOfMemory errors:

2008-09-06 15:28:08,993 ERROR org.apache.hadoop.mapred.pipes.BinaryProtocol: 
java.lang.OutOfMemoryError: Java heap space
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$BlockingBuffer.reset(MapTask.java:564)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:440)
at 
org.apache.hadoop.mapred.pipes.OutputHandler.output(OutputHandler.java:55)
at 
org.apache.hadoop.mapred.pipes.BinaryProtocol$UplinkReaderThread.run(BinaryProtocol.java:117)


It is a reproducible error which occurs at the same percentage all the time - 
when I emit less maps per input record, the problem goes away.

Now, I have tried editing conf/hadoop-env.sh to increase the HADOOP_HEAPSIZE to 
2000MB and set `export HADOOP_TASKTRACKER_OPTS="-Xms32m -Xmx2048m"`, but the 
problem persists at the exact same place.

Now, my use case doesn't really look that spectacular; is this a common 
problem, and if so, what are the usual ways to get around this?

Thanks in advance for a response!

Regards,

Leon Mergen


Re: Missing lib/native/Linux-amd64-64 on hadoop-0.17.2.tar.gz

2008-08-20 Thread Leon Mergen
On Wed, Aug 20, 2008 at 9:31 AM, Yi-Kai Tsai <[EMAIL PROTECTED]> wrote:

> But we do have  lib/native/Linux-amd64-64 on  hadoop-0.17.1.tar.gz and
> hadoop-0.18.0.tar.gz ?


At least for -0.17.1, yes there is.

Regards,

Leon Mergen


Re: Difference between Hadoop Streaming and "Normal" mode

2008-08-13 Thread Leon Mergen
On Wed, Aug 13, 2008 at 4:53 PM, Adam SI <[EMAIL PROTECTED]> wrote:

> Will  coding computational intensive algorithms using c/c++ and using them
> with streaming mode improve the performance ? Just curiosity.


That all depends on the efficiency of your c/c++ code. But given that your
c/c++ code is somewhat more efficient, it speaks for itself that there will
be a point that it will improve performance, as soon as the computation
becomes really intensive.

For our project, everything was written in C++ already, so it made sense to
write the map/reduce applications in C++ too so they can use the same common
library -- I personally would be very reluctant to introduce a new
programming language into my codebase for the sake of a small performance
gain alone, though. But that's a matter of opinion.

Regards,

Leon Mergen


Hadoop Pipes Job submission and JobId

2008-08-08 Thread Leon Mergen
Hello,

I was wondering what the correct way to submit a Job to hadoop using the
Pipes API is -- currently, I invoke a command similar to this:

/usr/local/hadoop/bin/hadoop pipes -conf
/usr/local/mapreduce/reports/reports.xml -input
/store/requests/archive/*/*/* -output out

However, this way of invoking the job has a few problems: it is a shell
command, and thus a bit awkward to embed this type of job submission in a
C++ program. Secondly, it would be awkward to retrieve the JobId from this
shell command since all its output would have to be properly parsed. And
last, it goes into a loop as long as the program is running, instead of
going to the background.

Now, in an ideal case, I would have some kind of HTTP url or whatever where
I can submit job submissions to, which in turn returns some data about the
new job, including the JobId.

I need the JobId to me able to match my system's task id's with Hadoop
JobId's when the  is visited.

I was wondering whether these requirements can be met without having to
write a custom Java application, or that the native Java API is the only way
to go to retrieve the JobId upon job submission.

Thanks in advance!

Regards,

Leon Mergen


Re: extracting input to a task from a (streaming) job?

2008-08-07 Thread Leon Mergen
Hello John,

On Thu, Aug 7, 2008 at 6:30 PM, John Heidemann <[EMAIL PROTECTED]> wrote:

>
> I have a large Hadoop streaming job that generally works fine,
> but a few (2-4) of the ~3000 maps and reduces have problems.
> To make matters worse, the problems are system-dependent (we run an a
> cluster with machines of slightly different OS versions).
> I'd of course like to debug these problems, but they are embedded in a
> large job.
>
> Is there a way to extract the input given to a reducer from a job, given
> the task identity?  (This would also be helpful for mappers.)


I believe you should set "keep.failed.tasks.files" to true -- this way, give
a task id, you can see what input files it has in ~/
taskTracker/${taskid}/work (source:
http://hadoop.apache.org/core/docs/r0.17.0/mapred_tutorial.html#IsolationRunner
)

On top of that, you can always use the debugging facilities:

http://hadoop.apache.org/core/docs/r0.17.0/mapred_tutorial.html#Debugging

"When map/reduce task fails, user can run script for doing post-processing
on task logs i.e task's stdout, stderr, syslog and jobconf. The stdout and
stderr of the user-provided debug script are printed on the diagnostics. "

I hope this helps.

Regards,

Leon Mergen


Re: Hadoop also applicable in a web app environment?

2008-08-05 Thread Leon Mergen
Hello,

On Tue, Aug 5, 2008 at 8:11 PM, Mork0075 <[EMAIL PROTECTED]> wrote:

> So my question: is there a Hadoop scenario for "non computation heavy but
> heavy load" web applications?


I suggest you look into HBase, a subproject of hadoop:
http://hadoop.apache.org/hbase/ -- it is designed after google's Bigtable
and works on top of Hadoop's DFS. It allows quick retrieval of small
portions of data, in a distributed fashion.

Regards,

Leon Mergen


libhdfs and multithreaded applications

2008-08-05 Thread Leon Mergen
Hello,

At the libhdfs wiki http://wiki.apache.org/hadoop/LibHDFS#Threading I read
this:

"libhdfs can be used in threaded applications using the Posix Threads.
However
to carefully interact with JNI's global/local references the user has to
explicitly call
the *hdfsConvertToGlobalRef* / *hdfsDeleteGlobalRef* apis."

I cannot seem to find any reference to these functions in the entire
hadoop-0.17.1
source base. Are these functions deprecated and multi-threaded applications
will
be supported out-of-the box, or is something different changed ?

Regards,

Leon Mergen


Re: Monthly Hadoop user group meetings

2008-05-06 Thread Leon Mergen
On Tue, May 6, 2008 at 6:59 PM, Cole Flournoy <[EMAIL PROTECTED]>
wrote:

> Is there anyway we could set up some off site web cam conferencing
> abilities?  I would love to attend, but I am on the east coast.


Seconded. I'm from Europe, and am pretty sure that I will watch any video
about a Hadoop conference I can get my hands on, including this. :-)

-- 
Leon Mergen
http://www.solatis.com


Re: Hadoop and retrieving data from HDFS

2008-04-24 Thread Leon Mergen
Hello Ted,

On Thu, Apr 24, 2008 at 9:10 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:

>
> Locality aware computing is definitely available in hadoop when you are
> processing native files.  The poster was referring to the situation with
> hbase data.


Ok, thank you for the information!

Regards,

Leon Mergen


Re: Hadoop and retrieving data from HDFS

2008-04-24 Thread Leon Mergen
On Thu, Apr 24, 2008 at 8:41 PM, Bryan Duxbury <[EMAIL PROTECTED]> wrote:

> I think what you're saying is that you are mostly interested in data
> locality. I don't think it's done yet, but it would be pretty easy to make
> HBase provide start keys as well as region locations for splits for a
> MapReduce job. In theory, that would give you all the pieces you need to run
> locality-aware processing.


Aha, so that is the definition: locality aware processing. Yeah, that sounds
exactly like what I'm looking for.

So, as far as I understand it, it is (technically) possible, but it's just
not available within hadoop yet?

-- 
Leon Mergen
http://www.solatis.com


Re: Hadoop and retrieving data from HDFS

2008-04-24 Thread Leon Mergen
On Thu, Apr 24, 2008 at 8:18 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:

>
> Another option is to just store the data in Hadoop and use a tool like Pig,
> Jaql (or grool).  Using these tools to perform ETL for reporting purposes
> and then use these tools or ad hoc map-reduce programs for data-mining
> works
> well.


Yeah, I've read about PIG in some hadoop summit notes; it sounds like a
great development! However, in my case, there are only 2 or 3 different ways
we want to filter the data out of the servers, so it wouldn't be such a big
of deal to write custom applications for them.



> It really depends on what you want to do.
>
> For reference, we push hundreds of millions to billions of log events per
> day (= many thousands per second) through our logging system, into hadoop
> and then into Oracle.  Hadoop barely sweats with this load and we are using
> a tiny cluster.


But I assume these log events are stored in raw format in HDFS, not HBase ?

Regards,

Leon Mergen


Re: Hadoop and retrieving data from HDFS

2008-04-24 Thread Leon Mergen
Hello Peeyush,

On Thu, Apr 24, 2008 at 8:12 PM, Peeyush Bishnoi <[EMAIL PROTECTED]>
wrote:

> Yes you can very well store your data in Tabular Format into Hbase by
> applying the Map-Reduce job on Access logs which has been stored on HDFS .
> So while you initially copy the data in HDFS , your data blocks will be
> created which will be stored on Datanode . After processing of data , it
> will be stored in Hbase HRegion. So your unprocessed data on HDFS and
> processed data in Hbase will be distributed across machines.


Ah yes, I also understood this from reading the BigTable paper and the HBase
architecture docs; HBase uses regions of about 256MB in size, which are
stored on top of HDFS.

But now I am wondering: after that data has been stored inside HBase, is it
possible to process this data without moving it to a different machine ? Say
that I want to data mine on around 100TB of data; if all that data had to be
moved around the cluster before it could be processed, it would be a bit
inefficient. Isn't it a good idea to just process those log files on the
servers they are physically stored on, and, perhaps, allow striping of
multiple MapReduce jobs on the same data by making use of the replication ?

Or is this a bad idea ? Since I've always understood that moving processing
to servers that the data is stored on is cheaper than moving the data to the
servers they can be processed on.

Regards,

Leon Mergen


Hadoop and retrieving data from HDFS

2008-04-24 Thread Leon Mergen
Hello,

I'm sorry if a question like this has been asked before, but I was unable to
find an answer for this anywhere on google; if it is off-topic, I apologize
in advance.

I'm trying to look a bit into the future, and predict scalability problems
for the company I work for: we're using PostgreSQL, and processing many
writes/second (access logs, currently around 250, but this will only
increase significantly in the future). Furthermore, we perform data mining
on this data, and ideally, need to have this data stored in a structured
form (the data is searched in various ways). In other words: a very
interesting problem.

Now, I'm trying to understand a bit of the hadoop/hbase architecture: as I
understand it, HDFS, MapReduce and HBase are sufficiently decoupled that the
use case I was hoping for is not available; however, I'm still going to ask:


Is it possible to store this data in hbase, and thus have all access logs
distributed amongst many different servers, and start MapReduce jobs on
those actual servers, which process all the data on those servers ? In other
words, the data never leaves the actual servers ?

If this isn't possible, is this because someone simply never took the time
to implement such a thing, or is it hard to fit in the design (for example,
that the JobTracker needs to be aware of the physical locations of all the
data, since you don't want to analyze the same (replicated) data twice) ?

>From what I understand by playing with hadoop for the past few days, the
idea is that you fetch your MapReduce data from HDFS rather than BigTable,
or am I mistaken ?

Thanks for your time!

Regards,

Leon Mergen