Memory problems with BytesWritable and huge binary files

2014-01-24 Thread Adam Retter
Hi there,

We have several diverse large datasets to process (one set may be as
much as 27 TB), however all of the files in these datasets are binary
files. We need to be able to pass each binary file to several tools
running in the Map Reduce framework.
We already have a working pipeline of MapReduce tasks that receives
each binary file (as BytesWritable) and processes it, we have tested
it with very small test datasets so far.

For any particular data set, the size of the files involves varies
wildly with each file being anywhere between about 2 KB and 4 GB. With
that in mind we have tried to follow the advice to read the files into
a Sequence File in HDFS. To create the Sequence File we have a Map
Reduce Job that uses a SequenceFileOutputFormat[Text, BytesWritable].

We cannot split these files into chunks, they must be processed by our
tools in our mappers and reducers as complete files. The problem we
have is that BytesWritable appears to load the entire content of a
file into memory, and now that we are trying to process our production
size datasets, once you get a couple of large files on the go, the JVM
throws the dreaded OutOfMemoryError.

What we need is someway to process these binary files, by reading and
writing their contents as Streams to and from the Sequence File. Or
really any other mechanism that does not involve loading the entire
file into RAM! Our own tools that we use in the mappers and reducers
in-fact expect to work with java.io.InputStream. We have tried quite a
few things now, including writing some custom Writable
implementations, but we then end up buffering data in temporary files
which is not exactly ideal when the data already exists in the
sequence files in HDFS.

Is there any hope?


Thanks Adam.

-- 
Adam Retter

skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk


Re: Memory problems with BytesWritable and huge binary files

2014-01-24 Thread Adam Retter
 Is your data in any given file a bunch of key-value pairs?

No. The content of each file itself is the value we are interested in,
and I guess that it's filename is the key.

 If that isn't the
 case, I'm wondering how writing a single large key-value into a sequence
 file helps. It won't. May be you can give an example of your input data?

Well from the Hadoop O'Reilly book, I rather got the impression that
HDFS does not like small files due to it's 64MB block size, and it is
instead recommended to place small files into a Sequence file. Is that
not the case?

Our input data really varies between 130 different file types, it
could be Microsoft Office documents, Video Recordings, Audio, CAD
diagrams etc.

 If indeed they are a bunch of smaller sized key-value pairs, you can write
 your own custom InputFormat that reads the data from your input files one
 k-v pair after another, and feed it to your MR job. There isn't any need for
 converting them to sequence-files at that point.

As I mentioned in my initial email, each file cannot be split up!

 Thanks
 +Vinod
 Hortonworks Inc.
 http://hortonworks.com/


 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity to
 which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader of
 this message is not the intended recipient, you are hereby notified that any
 printing, copying, dissemination, distribution, disclosure or forwarding of
 this communication is strictly prohibited. If you have received this
 communication in error, please contact the sender immediately and delete it
 from your system. Thank You.



-- 
Adam Retter

skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk


Re: Memory problems with BytesWritable and huge binary files

2014-01-24 Thread Adam Retter
So I am not sure I follow you, as we already have a custom InputFormat
and RecordReader and that does not seem to help.

The reason it does not seem to help is that it needs to return the
data as a Writable so that the Writable can then be used in the
following map operation. The map operation needs access to the entire
file.

The only way to do this in Hadoop by default is to use BytesWritable,
but that places everything in memory.

What am I missing?

On 24 January 2014 22:42, Vinod Kumar Vavilapalli
vino...@hortonworks.com wrote:
 Okay. Assuming you don't need a whole file (video) in memory for your 
 processing, you can simply write a Inputformat/RecordReader implementation 
 that streams through any given file and processes it.

 +Vinod

 On Jan 24, 2014, at 12:44 PM, Adam Retter adam.ret...@googlemail.com wrote:

 Is your data in any given file a bunch of key-value pairs?

 No. The content of each file itself is the value we are interested in,
 and I guess that it's filename is the key.

 If that isn't the
 case, I'm wondering how writing a single large key-value into a sequence
 file helps. It won't. May be you can give an example of your input data?

 Well from the Hadoop O'Reilly book, I rather got the impression that
 HDFS does not like small files due to it's 64MB block size, and it is
 instead recommended to place small files into a Sequence file. Is that
 not the case?

 Our input data really varies between 130 different file types, it
 could be Microsoft Office documents, Video Recordings, Audio, CAD
 diagrams etc.

 If indeed they are a bunch of smaller sized key-value pairs, you can write
 your own custom InputFormat that reads the data from your input files one
 k-v pair after another, and feed it to your MR job. There isn't any need for
 converting them to sequence-files at that point.

 As I mentioned in my initial email, each file cannot be split up!

 Thanks
 +Vinod
 Hortonworks Inc.
 http://hortonworks.com/


 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity to
 which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader of
 this message is not the intended recipient, you are hereby notified that any
 printing, copying, dissemination, distribution, disclosure or forwarding of
 this communication is strictly prohibited. If you have received this
 communication in error, please contact the sender immediately and delete it
 from your system. Thank You.



 --
 Adam Retter

 skype: adam.retter
 tweet: adamretter
 http://www.adamretter.org.uk


 --
 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity to
 which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.



-- 
Adam Retter

skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk


RE: Appropriate for Hadoop?

2009-04-29 Thread Adam Retter
I was more concerned that the input to our input is from SQL databases
and a proprietary EMC document store. And that our output is to a
different SQL database.

I don't want to use any sort of file system at all.

 
Adam Retter
Software Developer
Landmark Information Group
 
T: 01392 685403 (x5403) 
 
5-7 Abbey Court, Eagle Way, Sowton,
Exeter, Devon, EX2 7HY
 
www.landmark.co.uk
 
-Original Message-
From: Chuck Lam [mailto:chuck@gmail.com] 
Sent: 28 April 2009 20:25
To: core-user@hadoop.apache.org
Subject: Re: Appropriate for Hadoop?

HDFS is designed with Hadoop in mind, so there are certain advantages
(e.g.
performance, reliability, and ease of use) to using HDFS for Hadoop.
However, it's not required. For example, when you run Hadoop in
standalone
mode, it just uses the file system on your local machine. When you run
it on
Amazon AWS, it can use S3 as a file system.


On Tue, Apr 28, 2009 at 6:15 AM, Adam Retter
adam.ret...@landmark.co.ukwrote:


  Each document processing is independent and can be processed
  parallelly, so that part could be done in a map reduce job.
  Now whether it suits this use case depends on rate at which new
  URI's are discovered for processing and acceptable delay in
processing
  of a document. The way I see it you can batch the URI's
  and input that to mapreduce job. Each mapper can work on sublist of
 URIs.
  You can choose to make DB inserts from mapper itself. In that case
  you can set no of reducers to 0. Otherwise if batching of the
queries
  is an option then you can consider making batch inserts in reducer.
It
  will help in reducing load on DB.

 So I don't have to use HDFS at all when using Hadoop?



 Registered Office: 7 Abbey Court, Eagle Way, Sowton, Exeter, Devon,
EX2 7HY
 Registered Number 2892803 Registered in England and Wales

 This email has been scanned by the MessageLabs Email Security System.
 For more information please visit http://www.messagelabs.com/email

 The information contained in this e-mail is confidential and may be
subject
 to
 legal privilege. If you are not the intended recipient, you must not
use,
 copy,
 distribute or disclose the e-mail or any part of its contents or take
any
 action in reliance on it. If you have received this e-mail in error,
please
 e-mail the sender by replying to this message. All reasonable
precautions
 have
 been taken to ensure no viruses are present in this e-mail. Landmark
 Information
 Group Limited cannot accept responsibility for loss or damage arising
from
 the
 use of this e-mail or attachments and recommend that you subject these
to
 your virus checking procedures prior to use.



Registered Office: 7 Abbey Court, Eagle Way, Sowton, Exeter, Devon, EX2 7HY
Registered Number 2892803 Registered in England and Wales 

This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 

The information contained in this e-mail is confidential and may be subject to 
legal privilege. If you are not the intended recipient, you must not use, copy, 
distribute or disclose the e-mail or any part of its contents or take any 
action in reliance on it. If you have received this e-mail in error, please 
e-mail the sender by replying to this message. All reasonable precautions have 
been taken to ensure no viruses are present in this e-mail. Landmark Information
Group Limited cannot accept responsibility for loss or damage arising from the 
use of this e-mail or attachments and recommend that you subject these to 
your virus checking procedures prior to use.



Appropriate for Hadoop?

2009-04-28 Thread Adam Retter

If I understand correctly - Hadoop forms a general purpose cluster on
which you can execute jobs?

We have a Java data processing application here that follows the
Producer - Consumer pattern. It has been written with threading as a
concern from the start using java.util.concurrent.Callable.

At present the producer is a thread that retrieves a list of document
URI's from a SQL query against databaseA and adds them to a shared
(synchronised) queue.

Each consumer is a thread, of which there can be n, but we typically run
with 16 on the current hardware.
The consumer sits in a loop, processing the queue until it is empty. It
removes a document URI from the shared queue, retrieves the document and
performs a pipeline of transformations on the document, resulting in a
series of 600 to 16000 SQL insert statements which are then executed
against databaseB.

I have been reading about both Terracotta and Hadoop. Hadoop appears the
more general purpose solution that we could use for many applications,
however I am not sure how our application would map onto Hadoop
concepts. I have been studying the Map/Reduce Hadoop approach but our
application does not produce any intermediate files that would be the
input/output to the Map/Reduce processes.

Any guidance would be appreciated, it may well be that our application
is not an appropriate use of Hadoop?


Thanks Adam.
 
Adam Retter
Software Developer
Landmark Information Group
 
T: 01392 685403 (x5403) 
 
5-7 Abbey Court, Eagle Way, Sowton,
Exeter, Devon, EX2 7HY
 
www.landmark.co.uk
 


Registered Office: 7 Abbey Court, Eagle Way, Sowton, Exeter, Devon, EX2 7HY
Registered Number 2892803 Registered in England and Wales 

This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 

The information contained in this e-mail is confidential and may be subject to 
legal privilege. If you are not the intended recipient, you must not use, copy, 
distribute or disclose the e-mail or any part of its contents or take any 
action in reliance on it. If you have received this e-mail in error, please 
e-mail the sender by replying to this message. All reasonable precautions have 
been taken to ensure no viruses are present in this e-mail. Landmark Information
Group Limited cannot accept responsibility for loss or damage arising from the 
use of this e-mail or attachments and recommend that you subject these to 
your virus checking procedures prior to use.



RE: General purpose processing on Hadoop

2009-04-28 Thread Adam Retter

 Are you interested in building such a system?

I would be interested in using such a system, but otherwise I am afraid
that I do not have the time resources available to be involved in such a
project. Sorry.

Registered Office: 7 Abbey Court, Eagle Way, Sowton, Exeter, Devon, EX2 7HY
Registered Number 2892803 Registered in England and Wales 

This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 

The information contained in this e-mail is confidential and may be subject to 
legal privilege. If you are not the intended recipient, you must not use, copy, 
distribute or disclose the e-mail or any part of its contents or take any 
action in reliance on it. If you have received this e-mail in error, please 
e-mail the sender by replying to this message. All reasonable precautions have 
been taken to ensure no viruses are present in this e-mail. Landmark Information
Group Limited cannot accept responsibility for loss or damage arising from the 
use of this e-mail or attachments and recommend that you subject these to 
your virus checking procedures prior to use.



RE: Appropriate for Hadoop?

2009-04-28 Thread Adam Retter

 Each document processing is independent and can be processed
 parallelly, so that part could be done in a map reduce job.
 Now whether it suits this use case depends on rate at which new
 URI's are discovered for processing and acceptable delay in processing
 of a document. The way I see it you can batch the URI's
 and input that to mapreduce job. Each mapper can work on sublist of
URIs.
 You can choose to make DB inserts from mapper itself. In that case
 you can set no of reducers to 0. Otherwise if batching of the queries
 is an option then you can consider making batch inserts in reducer. It
 will help in reducing load on DB.

So I don't have to use HDFS at all when using Hadoop?



Registered Office: 7 Abbey Court, Eagle Way, Sowton, Exeter, Devon, EX2 7HY
Registered Number 2892803 Registered in England and Wales 

This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 

The information contained in this e-mail is confidential and may be subject to 
legal privilege. If you are not the intended recipient, you must not use, copy, 
distribute or disclose the e-mail or any part of its contents or take any 
action in reliance on it. If you have received this e-mail in error, please 
e-mail the sender by replying to this message. All reasonable precautions have 
been taken to ensure no viruses are present in this e-mail. Landmark Information
Group Limited cannot accept responsibility for loss or damage arising from the 
use of this e-mail or attachments and recommend that you subject these to 
your virus checking procedures prior to use.