Re: Seattle / PNW Hadoop + Lucene User Group?

2009-06-03 Thread Bhupesh Bansal
Great Bradford, 

Can you post some videos if you have some ?

Best
Bhupesh



On 6/3/09 11:58 AM, "Bradford Stephens"  wrote:

> Hey everyone!
> I just wanted to give a BIG THANKS for everyone who came. We had over a
> dozen people, and a few got lost at UW :)  [I would have sent this update
> earlier, but I flew to Florida the day after the meeting].
> 
> If you didn't come, you missed quite a bit of learning and topics. Such as:
> 
> -Building a Social Media Analysis company on the Apache Cloud Stack
> -Cancer detection in images using Hadoop
> -Real-time OLAP
> -Scalable Lucene using Katta and Hadoop
> -Video and Network Flow
> -Custom Ranking in Lucene
> 
> I'm going to update our wiki with the topics, and a few questions raised and
> the lessons we've learned.
> 
> The next meetup will be June 24th. Be there, or be... boring :)
> 
> Cheers,
> Bradford
> 
> On Thu, Apr 16, 2009 at 3:27 PM, Bradford Stephens <
> bradfordsteph...@gmail.com> wrote:
> 
>> Greetings,
>> 
>> Would anybody be willing to join a PNW Hadoop and/or Lucene User Group
>> with me in the Seattle area? I can donate some facilities, etc. -- I
>> also always have topics to speak about :)
>> 
>> Cheers,
>> Bradford
>> 



Re: Randomize input file?

2009-05-21 Thread Bhupesh Bansal
Hmm , 

IMHO running a mapper only job will give you an output file
With same order. You should write a custom map-reduce job
Where map emits (Key:Integer.random() , value:line)
And reducer Output (key:NOTHING , value:line)

Reducer will sort on Integer.random() giving you a random ordering for your
input file. 

Best
Bhupesh


On 5/21/09 11:15 AM, "Alex Loddengaard"  wrote:

> Hi John,
> 
> I don't know of a built-in way to do this.  Depending on how well you want
> to randomize, you could just run a MapReduce job with at least one map (the
> more maps, the more random) and no reduces.  When you run a job with no
> reduces, the shuffle phase is skipped entirely, and the intermediate outputs
> from the mappers are stored directly to HDFS.  Though I think each mapper
> will create one HDFS file, so you'll have to concatenate all files into a
> single file.
> 
> The above isn't a very good way to randomize, but it's fairly easy to
> implement and should run pretty quickly.
> 
> Hope this helps.
> 
> Alex
> 
> On Thu, May 21, 2009 at 7:18 AM, John Clarke  wrote:
> 
>> Hi,
>> 
>> I have a need to randomize my input file before processing. I understand I
>> can chain Hadoop jobs together so the first could take the input file
>> randomize it and then the second could take the randomized file and do the
>> processing.
>> 
>> The input file has one entry per line and I want to mix up the lines before
>> the main processing.
>> 
>> Is there an inbuilt ability I have missed or will I have to try and write a
>> Hadoop program to shuffle my input file?
>> 
>> Cheers,
>> John
>> 



Re: What's the local heap size of Hadoop? How to increase it?

2009-04-29 Thread Bhupesh Bansal
Hey , 

Try adding 

  
mapred.child.java.opts
-Xmx800M -server
  
 
With the right JVM size in your hadoop-site.xml , you will have to copy this
to all mapred nodes and restart the cluster.

Best
Bhupesh



On 4/29/09 2:03 PM, "Jasmine (Xuanjing) Huang"  wrote:

> Hi, there,
> 
> What's the local heap size of Hadoop? I have tried to load a local cache
> file which is composed of 500,000 short phrase, but the task failed. The
> output of Hadoop looks like(com.aliasi.dict.ExactDictionaryChunker is a
> third-party jar package, and the records is organized as a trie struction):
> 
> java.lang.OutOfMemoryError: Java heap space
> at java.util.HashMap.addEntry(HashMap.java:753)
> at java.util.HashMap.put(HashMap.java:385)
> at 
> com.aliasi.dict.ExactDictionaryChunker$TrieNode.getOrCreateDaughter(Ex
> actDictionaryChunker.java:476)
> at 
> com.aliasi.dict.ExactDictionaryChunker$TrieNode.add(ExactDictionaryChu
> nker.java:484)
> 
> When I reduce the total record number to 30,000. My mapreduce job became
> succeed. So, I have a question, What's the local heap size of Hadoop's Java
> Virtual Machine? How to increase it?
> 
> Best,
> Jasmine 
> 



Re: Hadoop / MySQL

2009-04-29 Thread Bhupesh Bansal
Slightly off topic .. As being a non-mySQL solution

We have the same problem computing about 100G of data daily and serving it
online with minimum impact while data refresh.

We are using our in-house clone of amazon dynamo a key value Distributed
hash table store (Prject-Voldemort) for the serving side. Project-voldemort
supports a ReadOnlyStore which uses file based data/index. The interesting
part is that we compute the new data/index on hadoop and just Hot Swap it on
voldemort nodes. Total swap time is roughly scp/rsync time with actual
service impact time being very very minimal (closing and opening file
descriptors)

Thanks a lot for info on this thread have been very interesting.

Best
Bhupesh


On 4/29/09 11:48 AM, "Todd Lipcon"  wrote:

> On Wed, Apr 29, 2009 at 7:19 AM, Stefan Podkowinski wrote:
> 
>> If you have trouble loading your data into mysql using INSERTs or LOAD
>> DATA, consider that MySQL supports CSV directly using the CSV storage
>> engine. The only thing you have to do is to copy your hadoop produced
>> csv file into the mysql data directory and issue a "flush tables"
>> command to have mysql flush its caches and pickup the new file. Its
>> very simple and you have the full set of sql commands available just
>> as with innodb or myisam. What you don't get with the csv engine are
>> indexes and foreign keys. Can't have it all, can you?
>> 
> 
> The CSV storage engine is definitely an interesting option, but it has a
> couple downsides:
> 
> - Like you mentioned, you don't get indexes. This seems like a huge deal to
> me - the reason you want to load data into MySQL instead of just keeping it
> in Hadoop is so you can service real-time queries. Not having any indexing
> kind of defeats the purpose there. This is especially true since MySQL only
> supports nested-loop joins, and there's no way of attaching metadata to a
> CSV table to say "hey look, this table is already in sorted order so you can
> use a merge join".
> 
> - Since CSV is a text based format, it's likely to be a lot less compact
> than a proper table. For example, a unix timestamp is likely to be ~10
> characters vs 4 bytes in a packed table.
> 
> - I'm not aware of many people actually using CSV for anything except
> tutorials and training. Since it's not in heavy use by big mysql users, I
> wouldn't build a production system around it.
> 
> Here's a wacky idea that I might be interested in hacking up if anyone's
> interested:
> 
> What if there were a MyISAMTableOutputFormat in hadoop? You could use this
> as a reducer output and have it actually output .frm and .myd files onto
> HDFS, then simply hdfs -get them onto DB servers for realtime serving.
> Sounds like a fun hack I might be interested in if people would find it
> useful. Building the .myi indexes in Hadoop would be pretty killer as well,
> but potentially more difficult.
> 
> -Todd



Re: sub-optimal multiple disk usage in 0.18.3?

2009-04-23 Thread Bhupesh Bansal
What configuration are you using for the disks ??

Best configuration is just doing a JBOD.

http://www.nabble.com/RAID-vs.-JBOD-td21404366.html

Best
Bhupesh



On 4/23/09 12:54 PM, "Mike Andrews"  wrote:

> i have a bunch of datanodes with several disks each, and i noticed
> that sometimes dfs blocks don't get evenly distributed among them. for
> instance, one of my machines has 5 disks with 500 gb each, and 1 disk
> with 2 TB (6 total disks). the 5 smaller disks are each 98% full,
> whereas the larger one is only 12% full. it seems as though dfs should
> do better by putting more of the blocks on the larger disk first. and
> mapreduce jobs are failing on this machine with error
> "java.io.IOException: No space left on device".
> 
> any thoughts or suggestions? thanks in advance.



Lost TaskTracker Errors

2009-04-02 Thread Bhupesh Bansal
Hey Folks, 

Since last 2-3 days I am seeing many of these errors popping up in our
hadoop cluster. 

Task attempt_200904011612_0025_m_000120_0 failed to report status for 604
seconds. Killing

JobTracker logs are doesn¹t have any more info  And task tracker logs are
clean. 

The failures occurred with these symptoms
1. Datanodes will start timing out
2. hdfs will get extremely slow (hdfs ­ls will take like 2 mins Vs 1s in
normal mode)

The datanode logs on failing tasktracker nodes are filled up with
2009-04-02 11:39:46,828 WARN org.apache.hadoop.dfs.DataNode:
DatanodeRegistration(172.16.216.64:50010,
storageID=DS-707090154-172.16.216.64-50010-1223506297192, infoPort=50075,
ipcPort=50020):Failed to transfer blk_-7774359493260170883_282858 to
172.16.216.62:50010 got java.net.SocketTimeoutException: 48 millis
timeout while waiting for channel to be ready for write. ch :
java.nio.channels.SocketChannel[connected local=/172.16.216.64:36689
remote=/172.16.216.62:50010]
at 
org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java
:185)
at 
org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.
java:159)
at 
org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.
java:198)
at 
org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1873)
at 
org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1967)
at 
org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:2855)
at java.lang.Thread.run(Thread.java:619)


We are running a 10 Node cluster (hadoop-0.18.1) on Dual Quad core boxes (8G
RAM) with these properties
1. mapred.child.java.opts = Xmx600M
2. mapred.tasktracker.map.tasks.maximum = 8
3. mapred.tasktracker.reduce.tasks.maximum = 4
4. dfs.datanode.handler.count = 10
5. dfs.datanode.du.reserved = 10240
6. dfs.datanode.max.xcievers = 512

The map jobs writes a Ton of data for each record, does increasing
³dfs.datanode.handler.count² will help in this case ??  What other
configuration change can I try ??


Best
Bhupesh




RE: can't read the SequenceFile correctly

2009-02-06 Thread Bhupesh Bansal
Hey Tom, 

I got also burned by this ?? Why does BytesWritable.getBytes() returns
non-vaild bytes ?? Or we should add a BytesWritable.getValidBytes() kind of 
function. 


Best
Bhupesh 



-Original Message-
From: Tom White [mailto:t...@cloudera.com]
Sent: Fri 2/6/2009 2:25 AM
To: core-user@hadoop.apache.org
Subject: Re: can't read the SequenceFile correctly
 
Hi Mark,

Not all the bytes stored in a BytesWritable object are necessarily
valid. Use BytesWritable#getLength() to determine how much of the
buffer returned by BytesWritable#getBytes() to use.

Tom

On Fri, Feb 6, 2009 at 5:41 AM, Mark Kerzner  wrote:
> Hi,
>
> I have written binary files to a SequenceFile, seemeingly successfully, but
> when I read them back with the code below, after a first few reads I get the
> same number of bytes for the different files. What could go wrong?
>
> Thank you,
> Mark
>
>  reader = new SequenceFile.Reader(fs, path, conf);
>Writable key = (Writable)
> ReflectionUtils.newInstance(reader.getKeyClass(), conf);
>Writable value = (Writable)
> ReflectionUtils.newInstance(reader.getValueClass(), conf);
>long position = reader.getPosition();
>while (reader.next(key, value)) {
>String syncSeen = reader.syncSeen() ? "*" : "";
>byte [] fileBytes = ((BytesWritable) value).getBytes();
>System.out.printf("[%s%s]\t%s\t%s\n", position, syncSeen,
> key, fileBytes.length);
>position = reader.getPosition(); // beginning of next record
>}
>



Re: job management in Hadoop

2009-01-30 Thread Bhupesh Bansal
Bill, 

Currently you can kill the job from the UI.
You have to enable the config in hadoop-default.xml

  webinterface.private.actions to be true

Best
Bhupesh


On 1/30/09 3:23 PM, "Bill Au"  wrote:

> Thanks.
> 
> Anyone knows if there is plan to add this functionality to the web UI like
> job priority can be changed from both the command line and the web UI?
> 
> Bill
> 
> On Fri, Jan 30, 2009 at 5:54 PM, Arun C Murthy  wrote:
> 
>> 
>> On Jan 30, 2009, at 2:41 PM, Bill Au wrote:
>> 
>>  Is there any way to cancel a job after it has been submitted?
>>> 
>>> 
>> bin/hadoop job -kill 
>> 
>> Arun
>> 



Distributed cache testing in local mode

2009-01-22 Thread Bhupesh Bansal
Hey folks, 

I am trying to use Distributed cache in hadoop jobs to pass around
configuration files , external-jars (job sepecific) and some archive data.

I want to test Job end-to-end in local mode, but I think the distributed
caches are localized in TaskTracker code which is not called in local mode
Through LocalJobRunner.

I can do some fairly simple workarounds for this but was just wondering if
folks have more ideas about it.

Thanks
Bhupesh



Re: tasktracker startup Time

2008-11-18 Thread Bhupesh Bansal
Thanks Steve, 

I will try kill -QUIT and report back.

Best
Bhupesh


On 11/18/08 5:45 AM, "Steve Loughran" <[EMAIL PROTECTED]> wrote:

> Bhupesh Bansal wrote:
>> Hey folks, 
>> 
>> I re-started my cluster after some node failures and saw couple of
>> tasktrackers not being up (they finally did after abt 20 Mins)
>> In the logs below check the blue timestamp to Red timestamp.
>> 
>> I was just curious what do we do while starting tasktracker that could
>> should take so much time ???
>> 
>> 
> 
>> 2008-11-17 10:43:04,757 INFO org.mortbay.util.Container: Started
>> [EMAIL PROTECTED]
>> 2008-11-17 11:12:38,373 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
>> Initializing JVM Metrics with processName=TaskTracker, sessionId=
>> 2008-11-17 11:12:38,410 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
>> Initializing RPC Metrics with hostName=TaskTracker, port=47601
> 
> 
> Off the top of my head
> -DNS lookups can introduce delays if your network's DNS is wrong, but
> that shouldn't take so long
> -The task tracker depends on the job tracker and the filesystem being
> up. If the filesystem is recovering: no task trackers
> 
> Next time, get the process ID (via a jps -v call), then do kill -QUIT on
> the process. This will print out to the process's console the stack
> trace of all its threads; this could track down where it is hanging



tasktracker startup Time

2008-11-17 Thread Bhupesh Bansal
Hey folks, 

I re-started my cluster after some node failures and saw couple of
tasktrackers not being up (they finally did after abt 20 Mins)
In the logs below check the blue timestamp to Red timestamp.

I was just curious what do we do while starting tasktracker that could
should take so much time ???


Best
Bhupesh 


2008-11-17 10:43:04,094 INFO org.apache.hadoop.mapred.TaskTracker:
STARTUP_MSG: 
/
STARTUP_MSG: Starting TaskTracker
STARTUP_MSG:   host = 
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.18.1
STARTUP_MSG:   build = ...
/
2008-11-17 10:43:04,292 INFO org.mortbay.util.Credential: Checking Resource
aliases
2008-11-17 10:43:04,400 INFO org.mortbay.http.HttpServer: Version
Jetty/5.1.4
2008-11-17 10:43:04,401 INFO org.mortbay.util.Container: Started
HttpContext[/static,/static]
2008-11-17 10:43:04,401 INFO org.mortbay.util.Container: Started
HttpContext[/logs,/logs]
2008-11-17 10:43:04,713 INFO org.mortbay.util.Container: Started
[EMAIL PROTECTED]
2008-11-17 10:43:04,753 INFO org.mortbay.util.Container: Started
WebApplicationContext[/,/]
2008-11-17 10:43:04,757 INFO org.mortbay.http.SocketListener: Started
SocketListener on 0.0.0.0:50060
2008-11-17 10:43:04,757 INFO org.mortbay.util.Container: Started
[EMAIL PROTECTED]
2008-11-17 11:12:38,373 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=TaskTracker, sessionId=
2008-11-17 11:12:38,410 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
Initializing RPC Metrics with hostName=TaskTracker, port=47601
2008-11-17 11:12:38,487 INFO org.apache.hadoop.ipc.Server: IPC Server
Responder: starting
2008-11-17 11:12:38,488 INFO org.apache.hadoop.ipc.Server: IPC Server
listener on 47601: starting
2008-11-17 11:12:38,490 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 1 on 47601: starting
2008-11-17 11:12:38,490 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 2 on 47601: starting
2008-11-17 11:12:38,490 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 0 on 47601: starting
2008-11-17 11:12:38,490 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 3 on 47601: starting
2008-11-17 11:12:38,491 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 4 on 47601: starting
2008-11-17 11:12:38,491 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 5 on 47601: starting
2008-11-17 11:12:38,491 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 6 on 47601: starting



Re: Mapper settings...

2008-11-06 Thread Bhupesh Bansal
In that case.

I will try to put a patch for this if nobody else is working on it

Best
Bhupesh
 





On 10/31/08 4:06 PM, "Owen O'Malley" <[EMAIL PROTECTED]> wrote:

> 
> On Oct 31, 2008, at 3:15 PM, Bhupesh Bansal wrote:
> 
>> Why do we need these setters in JobConf ??
>> 
>> jobConf.setMapOutputKeyClass(String.class);
>> 
>> jobConf.setMapOutputValueClass(LongWritable.class);
> 
> Just historical. The Mapper and Reducer interfaces didn't use to be
> generic. (Hadoop used to run on Java 1.4 too...)
> 
> It would be nice to remove the need to call them. There is an old bug
> open to check for consistency HADOOP-1683. It would be even better to
> make the setting of both the map and reduce output types optional if
> they are specified by the template parameters.
> 
> -- Owen



Mapper settings...

2008-10-31 Thread Bhupesh Bansal
Hey guys, 

Just curious, 

Why do we need these setters in JobConf ??

jobConf.setMapOutputKeyClass(String.class);

jobConf.setMapOutputValueClass(LongWritable.class);

We should be able to extract these from
OutputController of Mapper class ??

IMHO, they have to be consistent with OutputCollector class. so why have
extra point of failures ?


Best
Bhupesh



Re: Distributed cache Design

2008-10-16 Thread Bhupesh Bansal
Thanks Colin/ Owen

I will try some of the ideas here and report back.

Best
Bhupesh



On 10/16/08 4:05 PM, "Colin Evans" <[EMAIL PROTECTED]> wrote:

> The trick is to amortize your computation over the whole set.  So DFS
> for a single node will always be faster on an in-memory graph, but
> Hadoop is a good tool for computing all-pairs shortest paths in one shot
> if you re-frame the algorithm as a belief propagation and message
> passing algorithm.
> 
> A lot of the time, the computation still explodes into n^2 or worse, so
> you need to use a binning or blocking algorithm, like the one described
> here:  http://www.youtube.com/watch?v=1ZDybXl212Q
> 
> In the case of graphs, a blocking function would be to find overlapping
> strongly connected subgraphs where each subgraph fits in a reasonable
> amount of memory.  Then within each block, you do your computation and
> you pass a summary of that computation to adjacent blocks,which gets
> factored into the next computation.
> 
> When we hooked up a Very Big Graph to our Hadoop cluster, we found that
> there were a lot of scaling problems, which went away when we started
> optimizing for streaming performance.
> 
> -Colin
> 
> 
> 
> Bhupesh Bansal wrote:
>> Can you elaborate here ,
>> 
>> Lets say I want to implement a DFS in my graph. I am not able to picturise
>> implementing it with doing graph in pieces without putting a depth bound to
>> (3-4). Lets say we have 200M (4GB) edges to start with
>> 
>> Best
>> Bhupesh
>> 
>> 
>> 
>> On 10/16/08 3:01 PM, "Owen O'Malley" <[EMAIL PROTECTED]> wrote:
>> 
>>   
>>> On Oct 16, 2008, at 1:52 PM, Bhupesh Bansal wrote:
>>> 
>>> 
>>>> We at Linkedin are trying to run some Large Graph Analysis problems on
>>>> Hadoop. The fastest way to run would be to keep a copy of whole
>>>> Graph in RAM
>>>> at all mappers. (Graph size is about 8G in RAM) we have cluster of 8-
>>>> cores
>>>> machine with 8G on each.
>>>>   
>>> The best way to deal with it is *not* to load the entire graph in one
>>> process. In the WebMap at Yahoo, we have a graph of the web that has
>>> roughly 1 trillion links and 100 billion nodes. See
>>> http://tinyurl.com/4fgok6
>>>   . To invert the links, you process the graph in pieces and resort
>>> based on the target. You'll get much better performance and scale to
>>> almost any size.
>>> 
>>> 
>>>> Whats is the best way of doing that ?? Is there a way so that multiple
>>>> mappers on same machine can access a RAM cache ??  I read about hadoop
>>>> distributed cache looks like it's copies the file (hdfs / http)
>>>> locally on
>>>> the slaves but not necessrily in RAM ??
>>>>   
>>> You could mmap the file from distributed cache using MappedByteBuffer.
>>> Then there will be one copy between jvms...
>>> 
>>> -- Owen
>>> 
>> 
>>   
> 



Re: Distributed cache Design

2008-10-16 Thread Bhupesh Bansal
Can you elaborate here ,

Lets say I want to implement a DFS in my graph. I am not able to picturise
implementing it with doing graph in pieces without putting a depth bound to
(3-4). Lets say we have 200M (4GB) edges to start with

Best
Bhupesh



On 10/16/08 3:01 PM, "Owen O'Malley" <[EMAIL PROTECTED]> wrote:

> 
> On Oct 16, 2008, at 1:52 PM, Bhupesh Bansal wrote:
> 
>> We at Linkedin are trying to run some Large Graph Analysis problems on
>> Hadoop. The fastest way to run would be to keep a copy of whole
>> Graph in RAM
>> at all mappers. (Graph size is about 8G in RAM) we have cluster of 8-
>> cores
>> machine with 8G on each.
> 
> The best way to deal with it is *not* to load the entire graph in one
> process. In the WebMap at Yahoo, we have a graph of the web that has
> roughly 1 trillion links and 100 billion nodes. See http://tinyurl.com/4fgok6
>   . To invert the links, you process the graph in pieces and resort
> based on the target. You'll get much better performance and scale to
> almost any size.
> 
>> Whats is the best way of doing that ?? Is there a way so that multiple
>> mappers on same machine can access a RAM cache ??  I read about hadoop
>> distributed cache looks like it's copies the file (hdfs / http)
>> locally on
>> the slaves but not necessrily in RAM ??
> 
> You could mmap the file from distributed cache using MappedByteBuffer.
> Then there will be one copy between jvms...
> 
> -- Owen



Re: Distributed cache Design

2008-10-16 Thread Bhupesh Bansal
Minor correction the graph size is about 6G and not 8G.


On 10/16/08 1:52 PM, "Bhupesh Bansal" <[EMAIL PROTECTED]> wrote:

> Hey guys, 
> 
> 
> We at Linkedin are trying to run some Large Graph Analysis problems on
> Hadoop. The fastest way to run would be to keep a copy of whole Graph in RAM
> at all mappers. (Graph size is about 8G in RAM) we have cluster of 8-cores
> machine with 8G on each.
> 
> Whats is the best way of doing that ?? Is there a way so that multiple
> mappers on same machine can access a RAM cache ??  I read about hadoop
> distributed cache looks like it's copies the file (hdfs / http) locally on
> the slaves but not necessrily in RAM ??
> 
> Best
> Bhupesh
> 



Distributed cache Design

2008-10-16 Thread Bhupesh Bansal
Hey guys, 


We at Linkedin are trying to run some Large Graph Analysis problems on
Hadoop. The fastest way to run would be to keep a copy of whole Graph in RAM
at all mappers. (Graph size is about 8G in RAM) we have cluster of 8-cores
machine with 8G on each.

Whats is the best way of doing that ?? Is there a way so that multiple
mappers on same machine can access a RAM cache ??  I read about hadoop
distributed cache looks like it's copies the file (hdfs / http) locally on
the slaves but not necessrily in RAM ??

Best
Bhupesh



Re: Hadoop User Group (Bay Area) Oct 15th

2008-10-15 Thread Bhupesh Bansal
Hi , 

I didn't RSVP for this event. I would like to join with 2 of my colleagues.
Please let us know if we can ?

Best
Bhupesh




On 10/15/08 11:56 AM, "Steve Gao" <[EMAIL PROTECTED]> wrote:

> I am excited to see the slides. Would you send me a copy? Thanks.
> 
> --- On Wed, 10/15/08, Nishant Khurana <[EMAIL PROTECTED]> wrote:
> From: Nishant Khurana <[EMAIL PROTECTED]>
> Subject: Re: Hadoop User Group (Bay Area) Oct 15th
> To: core-user@hadoop.apache.org
> Date: Wednesday, October 15, 2008, 9:45 AM
> 
> I would love to see the slides too. I am specially interested in
> implementing database joins with Map Reduce.
> 
> On Wed, Oct 15, 2008 at 7:24 AM, Johan Oskarsson <[EMAIL PROTECTED]>
> wrote:
> 
>> Since I'm not based in the San Francisco I would love to see the
> slides
>> from this meetup uploaded somewhere. Especially the database join
>> techniques talk sounds very interesting to me.
>> 
>> /Johan
>> 
>> Ajay Anand wrote:
>>> The next Bay Area User Group meeting is scheduled for October 15th at
>>> Yahoo! 2821 Mission College Blvd, Santa Clara, Building 1, Training
>>> Rooms 3 & 4 from 6:00-7:30 pm.
>>> 
>>> Agenda:
>>> - Exploiting database join techniques for analytics with Hadoop: Jun
>>> Rao, IBM
>>> - Jaql Update: Kevin Beyer, IBM
>>> - Experiences moving a Petabyte Data Center: Sriram Rao, Quantcast
>>> 
>>> Look forward to seeing you there!
>>> Ajay
>> 
>> 
> 



Mapper OutOfMemoryError Revisited !!

2008-04-11 Thread bhupesh bansal

Hi Guys, I need to restart discussion around 
http://www.nabble.com/Mapper-Out-of-Memory-td14200563.html

 I saw the same OOM error in my map-reduce job in the map phase. 

1. I tried changing mapred.child.java.opts (bumped to 600M) 
2. io.sort.mb was kept at 100MB. 

I see the same errors still. 

I checked with debug the size of "keyValBuffer" in collect(), that is always
less than io.sort.mb and is spilled to disk properly.

I tried changing the map.task number to a very high number so that the input
is split into smaller chunks. It helps for a while as the map job went a bit
far (56% from 5%) but still see the problem.

 I tried bumping mapred.child.java.opts to 1000M , still got the same error. 

I also tried using the -verbose:gc -Xloggc:/tmp/@[EMAIL PROTECTED] value in 
opts to
get the gc.log but didnt got any log??

 I tried using 'jmap -histo pid' to see the heap information, it didnt gave
me any meaningful or obvious problem point. 

What are the other possible memory hog during mapper phase ?? Is the input
file chunk kept fully in memory ?? 

Application: 

My map-reduce job is running with about 2G of input. in the Mapper phase I
read each line and output [5-500] (key,value) pair. so the intermediate data
should be really blown up.  will that be a problem. 

The Error file is attached
http://www.nabble.com/file/p16628181/error.txt error.txt 
-- 
View this message in context: 
http://www.nabble.com/Mapper-OutOfMemoryError-Revisited-%21%21-tp16628181p16628181.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.



Mapper OutOfMemoryError Revisited !!

2008-04-11 Thread bhupesh bansal

Hi Guys, 

I need to restart discussion around 
http://www.nabble.com/Mapper-Out-of-Memory-td14200563.html

I saw the same OOM error in my map-reduce job in the map phase.

1. I tried changing mapred.child.java.opts (bumped to 600M)
2. io.sort.mb was kept at 100MB.
I see the same errors still.

I checked with debug the size of "keyValBuffer" in collect(), that is always
less than io.sort.mb and is spilled to disk properly. 

I tried changing the map.task number to a very high number so that the input
is split into smaller chunks.  It helps for a while as the map job went a
bit far (56% from 5%) but still see the problem. 

I tried bumping mapred.child.java.opts to 1000M , still got the same error. 

I also tried using the -verbose:gc -Xloggc:/tmp/@[EMAIL PROTECTED] value in 
opts to
get the gc.log but didnt got any log??

I tried using 'jmap -histo pid' to see the heap information, it didnt gave
me any meaningful or obvious problem point. 


What are the other possible memory hog during mapper phase ?? Is the input
file chunk kept fully in memory ?? 


task_200804110926_0004_m_000239_0: java.lang.OutOfMemoryError: Java heap
spacetask_200804110926_0004_m_000239_0:  at
java.util.Arrays.copyOf(Arrays.java:2786)task_200804110926_0004_m_000239_0: 
at
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)task_200804110926_0004_m_000239_0:
 
at
java.io.DataOutputStream.write(DataOutputStream.java:90)task_200804110926_0004_m_000239_0:
 
at
java.io.DataOutputStream.writeUTF(DataOutputStream.java:384)task_200804110926_0004_m_000239_0:
 
at
java.io.DataOutputStream.writeUTF(DataOutputStream.java:306)task_200804110926_0004_m_000239_0:
 
at
com.linkedin.Hadoop.DataObjects.SearchTrackingJoinValue.write(SearchTrackingJoinValue.java:117)task_200804110926_0004_m_000239_0:
 
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:350)task_200804110926_0004_m_000239_0:
 
at
com.linkedin.Hadoop.Mapper.SearchClickJoinMapper.readSearchJoinResultsObject(SearchClickJoinMapper.java:131)task_200804110926_0004_m_000239_0:
 
at
com.linkedin.Hadoop.Mapper.SearchClickJoinMapper.map(SearchClickJoinMapper.java:54)task_200804110926_0004_m_000239_0:
 
at
com.linkedin.Hadoop.Mapper.SearchClickJoinMapper.map(SearchClickJoinMapper.java:31)task_200804110926_0004_m_000239_0:
 
at
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)task_200804110926_0004_m_000239_0:
 
at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)task_200804110926_0004_m_000239_0:
 
at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1804)


-- 
View this message in context: 
http://www.nabble.com/Mapper-OutOfMemoryError-Revisited-%21%21-tp16628173p16628173.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.