RE: OutOfMemory Error

2008-09-18 Thread Palleti, Pallavi
Yeah. That was the problem. And Hama can be surely useful for large scale 
matrix operations.

But for this problem, I have modified the code to just pass the ID information 
and read the vector information only when it is needed. In this case, it was 
needed only in the reducer phase. This way, it avoided this problem of out of 
memory error and also faster now.

Thanks
Pallavi
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Edward J. Yoon
Sent: Friday, September 19, 2008 10:35 AM
To: core-user@hadoop.apache.org; [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: Re: OutOfMemory Error

> The key is of the form "ID :DenseVector Representation in mahout with

I guess vector size seems too large so it'll need a distributed vector
architecture (or 2d partitioning strategies) for large scale matrix
operations. The hama team investigate these problem areas. So, it will
be improved If hama can be used for mahout in the future.

/Edward

On Thu, Sep 18, 2008 at 12:28 PM, Pallavi Palleti <[EMAIL PROTECTED]> wrote:
>
> Hadoop Version - 17.1
> io.sort.factor =10
> The key is of the form "ID :DenseVector Representation in mahout with
> dimensionality size = 160k"
> For example: C1:[,0.0011, 3.002, .. 1.001,]
> So, typical size of the key  of the mapper output can be 160K*6 (assuming
> double in string is represented in 5 bytes)+ 5 (bytes for C1:[])  + size
> required to store that the object is of type Text
>
> Thanks
> Pallavi
>
>
>
> Devaraj Das wrote:
>>
>>
>>
>>
>> On 9/17/08 6:06 PM, "Pallavi Palleti" <[EMAIL PROTECTED]> wrote:
>>
>>>
>>> Hi all,
>>>
>>>I am getting outofmemory error as shown below when I ran map-red on
>>> huge
>>> amount of data.:
>>> java.lang.OutOfMemoryError: Java heap space
>>> at
>>> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:52)
>>> at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:90)
>>> at
>>> org.apache.hadoop.io.SequenceFile$Reader.nextRawKey(SequenceFile.java:1974)
>>> at
>>> org.apache.hadoop.io.SequenceFile$Sorter$SegmentDescriptor.nextRawKey(Sequence
>>> File.java:3002)
>>> at
>>> org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java:28
>>> 02)
>>> at org.apache.hadoop.io.SequenceFile$Sorter.merge(SequenceFile.java:2511)
>>> at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1040)
>>> at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698)
>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:220)
>>> at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124
>>> The above error comes almost at the end of map job. I have set the heap
>>> size
>>> to 1GB. Still the problem is persisting.  Can someone please help me how
>>> to
>>> avoid this error?
>> What is the typical size of your key? What is the value of io.sort.factor?
>> Hadoop version?
>>
>>
>>
>>
>
> --
> View this message in context: 
> http://www.nabble.com/OutOfMemory-Error-tp19531174p19545298.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>



-- 
Best regards, Edward J. Yoon
[EMAIL PROTECTED]
http://blog.udanax.org


Re: scp to namenode faster than dfs put?

2008-09-18 Thread Prasad Pingali
Even if writes are happening in parallel from a single machine, wouldn't the 
network congestion cause slow down due to packet collision?

- Prasad.

On Thursday 18 September 2008 10:47:48 pm Raghu Angadi wrote:
> Steve Loughran wrote:
> > [EMAIL PROTECTED] wrote:
> >> thanks for the replies. So looks like replication might be the real
> >> overhead when compared to scp.
> >
> > Makes sense, but there's no reason why you couldn't have first node you
> > copy up the data to, continue and pass that data to the other nodes.
>
> Replication can not account for 50% slow down. When the data is written,
> the writes on replicas are pipelined. So essentially data is written to
> replicas in parallel.
>
> Raghu.
>
> > If
> > its in the same rack, you save on backbone bandwidth, and if it is in a
> > different rack, well, the client operation still finishes faster. A
> > feature for someone to implement, perhaps?
> >
> >>> Also dfs put copies multiple replicas unlike scp.
> >>>
> >>> Lohit
> >>>
> >>> On Sep 17, 2008, at 6:03 AM, "��明" <[EMAIL PROTECTED]>
> >>> wrote:
> >>>
> >>> Actually, No.
> >>> As you said, I understand that "dfs -put" breaks the data into
> >>> blocksand then copies to datanodes,
> >>> but scp do not breaks the data into blocksand , and just copy the
> >>> data to
> >>> the namenode!
> >>>
> >>>
> >>> 2008/9/17, Prasad Pingali <[EMAIL PROTECTED]>:
> >>>
> >>> Hello,
> >>>  I observe that scp of data to the namenode is faster than actually
> >>> putting
> >>> into dfs (all nodes coming from same switch and have same ethernet
> >>> cards,
> >>> homogenous nodes)? I understand that "dfs -put" breaks the data into
> >>> blocks
> >>> and then copies to datanodes, but shouldn't that be atleast as fast as
> >>> copying data to namenode from a single machine, if not faster?
> >>>
> >>> thanks and regards,
> >>> Prasad Pingali,
> >>> IIIT Hyderabad.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> Sorry for my english!!  明
> >>> Please help me to correct my english expression and error in syntax






Re: Hadoop tracing

2008-09-18 Thread Edward J. Yoon
I was once tried to measure/report them on
http://wiki.apache.org/hadoop/DataProcessingBenchmarks. I decided to
stop because I just can't find time to do them. If you/anyone have an
experience with hadoop, please report to that page. :)

/Edward

On Thu, Sep 18, 2008 at 7:25 PM, Naama Kraus <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I am looking for information in the area of Hadoop tracing, instrumentation,
> benchmarking and so forth.
> What utilities exist ? What's their maturity? Where can I get more info
> about them ?
>
> I am curious about statistics on Hadoop behavior (per a typical workload ?
> different workloads ?). I am thinking on various metrics such as -
> Percentage of  time a Hadoop job spends on the various phases (map, sort &
> shuffle, reduce), on I/O, network, framework execution time, user code
> execution time ...
> Known bottlenecks ?
> And whatever else interesting statistics.
>
> Has anyone already measured ? Any documented statistics out there ?
>
> I already encountered various stuff like the X-trace based tracing tool from
> Berkeley, Hadoop metrics API, Hadoop instrumentation API (HADOOP-3772),
> Hadoop Vaidya (HADOOP-4179), gridmix benchmark.
>
> Does anyone have an input on any of those ?
> Anything else I missed ?
>
> Thanks for any direction,
> Naama
>
> --
> oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
> 00 oo 00 oo
> "If you want your children to be intelligent, read them fairy tales. If you
> want them to be more intelligent, read them more fairy tales." (Albert
> Einstein)
>



-- 
Best regards, Edward J. Yoon
[EMAIL PROTECTED]
http://blog.udanax.org


Re: OutOfMemory Error

2008-09-18 Thread Edward J. Yoon
> The key is of the form "ID :DenseVector Representation in mahout with

I guess vector size seems too large so it'll need a distributed vector
architecture (or 2d partitioning strategies) for large scale matrix
operations. The hama team investigate these problem areas. So, it will
be improved If hama can be used for mahout in the future.

/Edward

On Thu, Sep 18, 2008 at 12:28 PM, Pallavi Palleti <[EMAIL PROTECTED]> wrote:
>
> Hadoop Version - 17.1
> io.sort.factor =10
> The key is of the form "ID :DenseVector Representation in mahout with
> dimensionality size = 160k"
> For example: C1:[,0.0011, 3.002, .. 1.001,]
> So, typical size of the key  of the mapper output can be 160K*6 (assuming
> double in string is represented in 5 bytes)+ 5 (bytes for C1:[])  + size
> required to store that the object is of type Text
>
> Thanks
> Pallavi
>
>
>
> Devaraj Das wrote:
>>
>>
>>
>>
>> On 9/17/08 6:06 PM, "Pallavi Palleti" <[EMAIL PROTECTED]> wrote:
>>
>>>
>>> Hi all,
>>>
>>>I am getting outofmemory error as shown below when I ran map-red on
>>> huge
>>> amount of data.:
>>> java.lang.OutOfMemoryError: Java heap space
>>> at
>>> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:52)
>>> at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:90)
>>> at
>>> org.apache.hadoop.io.SequenceFile$Reader.nextRawKey(SequenceFile.java:1974)
>>> at
>>> org.apache.hadoop.io.SequenceFile$Sorter$SegmentDescriptor.nextRawKey(Sequence
>>> File.java:3002)
>>> at
>>> org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java:28
>>> 02)
>>> at org.apache.hadoop.io.SequenceFile$Sorter.merge(SequenceFile.java:2511)
>>> at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1040)
>>> at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698)
>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:220)
>>> at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124
>>> The above error comes almost at the end of map job. I have set the heap
>>> size
>>> to 1GB. Still the problem is persisting.  Can someone please help me how
>>> to
>>> avoid this error?
>> What is the typical size of your key? What is the value of io.sort.factor?
>> Hadoop version?
>>
>>
>>
>>
>
> --
> View this message in context: 
> http://www.nabble.com/OutOfMemory-Error-tp19531174p19545298.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>



-- 
Best regards, Edward J. Yoon
[EMAIL PROTECTED]
http://blog.udanax.org


how to get the filenames stored in dfs as the key

2008-09-18 Thread komagal meenakshi
hi everybody.
   
  can anyone plase help me how to get the input filename in dfs as the key in 
the output?
   
   
  example: [ filenames , value]

   
-
 Unlimited freedom, unlimited storage. Get it now

Re: slow copy makes reduce hang

2008-09-18 Thread Rong-en Fan
this time, I set task timeout to 10m via

  -jobconf mapred.task.timeout=60

However, I still see this "hang" at shuffle stage, and lots
of messages below appear in the log

2008-09-19 12:34:02,289 INFO org.apache.hadoop.mapred.ReduceTask:
task_200809190308_0007_r_01_1 Need 6 map output(s)
2008-09-19 12:34:02,290 INFO org.apache.hadoop.mapred.ReduceTask:
task_200809190308_0007_r_01_1: Got 0 new map-outputs & 0 obsolete
map-outputs from tasktracker and 0 map-outputs from previous failures
2008-09-19 12:34:02,290 INFO org.apache.hadoop.mapred.ReduceTask:
task_200809190308_0007_r_01_1 Got 6 known map output location(s);
scheduling...
2008-09-19 12:34:02,290 INFO org.apache.hadoop.mapred.ReduceTask:
task_200809190308_0007_r_01_1 Scheduled 0 of 6 known outputs (6
slow hosts and 0 dup hosts)

When fetching map output from one weird node (actually, it has a disk died),
the http daemon returns 500 internal server error.

It seems to me that the reducer fails in an infinite loop... I'm wondering
this behavior is fixed in 0.18.x or there is some configuration parameters
that I should tune with?

Thanks,
Rong-En Fan

On Fri, Sep 19, 2008 at 9:42 AM, Rong-en Fan <[EMAIL PROTECTED]> wrote:
> Reply to myself. I'm using streaming and the task timeout was set to 0,
> so that's why.
>
> On Fri, Sep 19, 2008 at 3:34 AM, Rong-en Fan <[EMAIL PROTECTED]> wrote:
>> Hi,
>>
>> I'm using 0.17.2.1 and see a reduce hang in shuffle phase due
>> to a unresponsive node. From the reduce log (sorry that I didn't
>> keep it around), it stuck in copying map output from a dead
>> node (I can not ssh to that one). At that point, all maps are already
>> finished. I'm wondering why this slowness does not trigger a reduce
>> task fail and the corresponding map failed (even if it is finished) then
>> redo the map task on  another node so that the reduce can work.
>>
>> Thanks,
>> Rong-En Fan
>>
>


Data corruption when using Lzo Codec

2008-09-18 Thread Alex Feinberg
Hello,

I am running a custom crawler (written internally) using hadoop
streaming. I am attempting to
compress the output using LZO, but instead I am receiving corrupted
output that is neither in the
format I am aiming for nor as a compressed lzo file. Is this a known
issue? Is there anything
I am doing inherently wrong?

Here is the command line I am using:

 ~/hadoop/bin/hadoop jar
/home/hadoop/hadoop/contrib/streaming/hadoop-0.17.2.1-streaming.jar
-inputformat org.apache.hadoop.mapred.SequenceFileAsTextInputFormat
-mapper /home/hadoop/crawl_map -reducer NONE -jobconf
mapred.output.compress=true -jobconf
mapred.output.compression.codec=org.apache.hadoop.io.compress.LzoCodec
-input pages -output crawl.lzo -jobconf mapred.reduce.tasks=0

The input is in in form of URLs stored as a SequenceFile

When running this without LZO compression, no such issue occurs.

Is there any way for me to recover the corrupted data as to be able to
process it by other
hadoop jobs or offline?

Thanks,

-- 
Alex Feinberg
Platform Engineer, SocialMedia Networks


Re: slow copy makes reduce hang

2008-09-18 Thread Rong-en Fan
Reply to myself. I'm using streaming and the task timeout was set to 0,
so that's why.

On Fri, Sep 19, 2008 at 3:34 AM, Rong-en Fan <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I'm using 0.17.2.1 and see a reduce hang in shuffle phase due
> to a unresponsive node. From the reduce log (sorry that I didn't
> keep it around), it stuck in copying map output from a dead
> node (I can not ssh to that one). At that point, all maps are already
> finished. I'm wondering why this slowness does not trigger a reduce
> task fail and the corresponding map failed (even if it is finished) then
> redo the map task on  another node so that the reduce can work.
>
> Thanks,
> Rong-En Fan
>


Re: custom writable class

2008-09-18 Thread Shengkai Zhu
Here is the link
http://hadoop.apache.org/core/docs/current/mapred_tutorial.html

On Thu, Sep 18, 2008 at 9:16 PM, chanel <[EMAIL PROTECTED]> wrote:

> Where can you find the "Hadoop Map-Reduce Tutorial"?
>
>
> Shengkai Zhu wrote:
>
>> You can refer to the Hadoop Map-Reduce Tutorial
>>
>> On Thu, Sep 18, 2008 at 8:40 PM, Shengkai Zhu <[EMAIL PROTECTED]>
>> wrote:
>>
>>
>>
>>> Your custom implementation of any interface from hadoop-core should be
>>> archived together with the application (i.e. in the same jar).
>>> Andt he jar will be added to CLASSPATH of the task runner, then your
>>> "customwritable.java" could be found.
>>>
>>>
>>> On Thu, Sep 18, 2008 at 8:09 PM, Deepak Diwakar <[EMAIL PROTECTED]
>>> >wrote:
>>>
>>>
>>>
 Hi,

 I am new to hadoop. For my map/reduce task I want to write my on custom
 writable class. Could anyone please let me know where exactly to place
 the
 customwritable.java file?

 I found that in {hadoop-home}
 /hadoop-{version}/src/java/org/apache/hadoop/io/ all  type of writable
 class
 files are there.

 Then  in the main task, we just include "import
 org.apache.hadoop.io.{X}Writable;" But this is not working for me.
 Basically
 at the time of compilation compiler doesn't find my customwritable class
 which i have placed in the mentioned folder.

 plz help me in this endevor.

 Thanks
 deepak



>>>
>>> --
>>>
>>> 朱盛凯
>>>
>>> Jash Zhu
>>>
>>> 复旦大学软件学院
>>>
>>> Software School, Fudan University
>>>
>>>
>>>
>>
>>
>>
>>
>
>
>
> No virus found in this outgoing message.
> Checked by AVG - http://www.avg.com
> Version: 8.0.169 / Virus Database: 270.6.21/1678 - Release Date: 9/18/2008
> 9:01 AM
>
>


-- 

朱盛凯

Jash Zhu

复旦大学软件学院

Software School, Fudan University


Re: streaming question

2008-09-18 Thread Karl Anderson


On 16-Sep-08, at 1:25 AM, Christian Ulrik Søttrup wrote:

Ok i've tried what you suggested and all sorts of combinations with  
no luck.
Then I went through the source of the Streaming lib. It looks like  
it checks for the existence
of the combiner while it is building the jobconf i.e. before the job  
is sent to the nodes.
It calls class.forName() on the combiner in goodClassOrNull() from  
StreamUtil.java

called from setJobconf() in StreamJob.java.

Anybody have an idea how i can use a custom combiner? would I have  
to package it into the streaming jar?


That's what the streaming docs say you have to do - make your own  
streaming jar with them included.  I tried the cache and jar arguments  
myself once, and Hadoop wasn't able to find them to use for the  
framework hooks, even when my streaming executables themselves were  
able to find them.


Karl Anderson
[EMAIL PROTECTED]
http://monkey.org/~kra





Example code for map-side join

2008-09-18 Thread Stuart Sierra
Hello all,
Does anyone have some working example code for doing a map-side
(inner) join?  The documentation at
http://tinyurl.com/43j5pp is less than enlightening...
Thanks,
-Stuart


slow copy makes reduce hang

2008-09-18 Thread Rong-en Fan
Hi,

I'm using 0.17.2.1 and see a reduce hang in shuffle phase due
to a unresponsive node. From the reduce log (sorry that I didn't
keep it around), it stuck in copying map output from a dead
node (I can not ssh to that one). At that point, all maps are already
finished. I'm wondering why this slowness does not trigger a reduce
task fail and the corresponding map failed (even if it is finished) then
redo the map task on  another node so that the reduce can work.

Thanks,
Rong-En Fan


[ANNOUNCE] Hadoop release 0.18.1 available

2008-09-18 Thread Nigel Daley

Release 0.18.1 fixes 9 critical bugs in 0.18.0.

For Hadoop release details and downloads, visit:
http://hadoop.apache.org/core/releases.html

Hadoop 0.18.1 Release Notes are at
http://hadoop.apache.org/core/docs/r0.18.1/releasenotes.html

Thanks to all who contributed to this release!

Nigel


Re: scp to namenode faster than dfs put?

2008-09-18 Thread Raghu Angadi

James Moore wrote:

Isn't one of the features of replication a guarantee that when my
write finishes, I know there are N replicas written?


This is what happens normally, but it is not a guarantee. When there are 
errors, data might be written to fewer replicas.


Raghu.

Seems like if you want the quicker behavior, you write with
replication set to 1 for that file, then change the replication count
when you're finished.





Re: scp to namenode faster than dfs put?

2008-09-18 Thread Raghu Angadi

Steve Loughran wrote:

[EMAIL PROTECTED] wrote:

thanks for the replies. So looks like replication might be the real
overhead when compared to scp.


Makes sense, but there's no reason why you couldn't have first node you 
copy up the data to, continue and pass that data to the other nodes. 


Replication can not account for 50% slow down. When the data is written, 
the writes on replicas are pipelined. So essentially data is written to 
replicas in parallel.


Raghu.

If 
its in the same rack, you save on backbone bandwidth, and if it is in a 
different rack, well, the client operation still finishes faster. A 
feature for someone to implement, perhaps?





Also dfs put copies multiple replicas unlike scp.

Lohit

On Sep 17, 2008, at 6:03 AM, "��明" <[EMAIL PROTECTED]> wrote:

Actually, No.
As you said, I understand that "dfs -put" breaks the data into blocksand
then copies to datanodes,
but scp do not breaks the data into blocksand , and just copy the 
data to

the namenode!


2008/9/17, Prasad Pingali <[EMAIL PROTECTED]>:

Hello,
 I observe that scp of data to the namenode is faster than actually
putting
into dfs (all nodes coming from same switch and have same ethernet 
cards,

homogenous nodes)? I understand that "dfs -put" breaks the data into
blocks
and then copies to datanodes, but shouldn't that be atleast as fast as
copying data to namenode from a single machine, if not faster?

thanks and regards,
Prasad Pingali,
IIIT Hyderabad.





--
Sorry for my english!!  明
Please help me to correct my english expression and error in syntax














Re: scp to namenode faster than dfs put?

2008-09-18 Thread James Moore
Isn't one of the features of replication a guarantee that when my
write finishes, I know there are N replicas written?

Seems like if you want the quicker behavior, you write with
replication set to 1 for that file, then change the replication count
when you're finished.

-- 
James Moore | [EMAIL PROTECTED]
Ruby and Ruby on Rails consulting
blog.restphone.com


Re: [Zookeeper-user] [ANN] katta-0.1.0 release - distribute lucene indexes in a grid

2008-09-18 Thread Patrick Hunt
This is great to see, congratulations! If you have a few minutes please 
update the ZK "poweredby" page:

http://wiki.apache.org/hadoop/ZooKeeper/PoweredBy

BTW. We're in the process of moving to Apache from SourceForge. Our next 
release, 3.0 slated for Oct22, will be on Apache.


Regards,

Patrick

Stefan Groschupf wrote:
After 5 month work we are happy to announce the first developer  
preview release of katta.
This release contains all functionality to serve a large, sharded  
lucene index on many servers.
Katta is standing on the shoulders of the giants lucene, hadoop and  
zookeeper.


Main features:
+ Plays well with Hadoop
+ Apache Version 2 License.
+ Node failure tolerance
+ Master failover
+ Shard replication
+ Plug-able network topologies (Shard - Distribution and Selection  
Polices)

+ Node load balancing at client



Please give katta a test drive and give us some feedback!

Download:
http://sourceforge.net/project/platformdownload.php?group_id=225750

website:
http://katta.sourceforge.net/

Getting started in less than 3 min:
http://katta.wiki.sourceforge.net/Getting+started

Installation on a grid:
http://katta.wiki.sourceforge.net/Installation

Katta presentation today (09/17/08) at hadoop user, yahoo mission  
college:

http://upcoming.yahoo.com/event/1075456/
* slides will be available online later


Many thanks for the hard work:
Johannes Zillmann, Marko Bauhardt, Martin Schaaf (101tec)

I apologize the cross posting.


Yours, the Katta Team.

~~~
101tec Inc., Menlo Park, California
http://www.101tec.com





-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
___
Zookeeper-user mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/zookeeper-user


Re: custom writable class

2008-09-18 Thread chanel

Where can you find the "Hadoop Map-Reduce Tutorial"?

Shengkai Zhu wrote:

You can refer to the Hadoop Map-Reduce Tutorial

On Thu, Sep 18, 2008 at 8:40 PM, Shengkai Zhu <[EMAIL PROTECTED]> wrote:

  

Your custom implementation of any interface from hadoop-core should be
archived together with the application (i.e. in the same jar).
Andt he jar will be added to CLASSPATH of the task runner, then your
"customwritable.java" could be found.


On Thu, Sep 18, 2008 at 8:09 PM, Deepak Diwakar <[EMAIL PROTECTED]>wrote:



Hi,

I am new to hadoop. For my map/reduce task I want to write my on custom
writable class. Could anyone please let me know where exactly to place the
customwritable.java file?

I found that in {hadoop-home}
/hadoop-{version}/src/java/org/apache/hadoop/io/ all  type of writable
class
files are there.

Then  in the main task, we just include "import
org.apache.hadoop.io.{X}Writable;" But this is not working for me.
Basically
at the time of compilation compiler doesn't find my customwritable class
which i have placed in the mentioned folder.

plz help me in this endevor.

Thanks
deepak

  


--

朱盛凯

Jash Zhu

复旦大学软件学院

Software School, Fudan University






  
No virus found in this outgoing message.
Checked by AVG - http://www.avg.com
Version: 8.0.169 / Virus Database: 270.6.21/1678 - Release Date: 9/18/2008 9:01 
AM


Re: custom writable class

2008-09-18 Thread Shengkai Zhu
You can refer to the Hadoop Map-Reduce Tutorial

On Thu, Sep 18, 2008 at 8:40 PM, Shengkai Zhu <[EMAIL PROTECTED]> wrote:

>
> Your custom implementation of any interface from hadoop-core should be
> archived together with the application (i.e. in the same jar).
> Andt he jar will be added to CLASSPATH of the task runner, then your
> "customwritable.java" could be found.
>
>
> On Thu, Sep 18, 2008 at 8:09 PM, Deepak Diwakar <[EMAIL PROTECTED]>wrote:
>
>> Hi,
>>
>> I am new to hadoop. For my map/reduce task I want to write my on custom
>> writable class. Could anyone please let me know where exactly to place the
>> customwritable.java file?
>>
>> I found that in {hadoop-home}
>> /hadoop-{version}/src/java/org/apache/hadoop/io/ all  type of writable
>> class
>> files are there.
>>
>> Then  in the main task, we just include "import
>> org.apache.hadoop.io.{X}Writable;" But this is not working for me.
>> Basically
>> at the time of compilation compiler doesn't find my customwritable class
>> which i have placed in the mentioned folder.
>>
>> plz help me in this endevor.
>>
>> Thanks
>> deepak
>>
>
>
>
> --
>
> 朱盛凯
>
> Jash Zhu
>
> 复旦大学软件学院
>
> Software School, Fudan University
>



-- 

朱盛凯

Jash Zhu

复旦大学软件学院

Software School, Fudan University


Re: custom writable class

2008-09-18 Thread Shengkai Zhu
Your custom implementation of any interface from hadoop-core should be
archived together with the application (i.e. in the same jar).
Andt he jar will be added to CLASSPATH of the task runner, then your
"customwritable.java" could be found.

On Thu, Sep 18, 2008 at 8:09 PM, Deepak Diwakar <[EMAIL PROTECTED]> wrote:

> Hi,
>
> I am new to hadoop. For my map/reduce task I want to write my on custom
> writable class. Could anyone please let me know where exactly to place the
> customwritable.java file?
>
> I found that in {hadoop-home}
> /hadoop-{version}/src/java/org/apache/hadoop/io/ all  type of writable
> class
> files are there.
>
> Then  in the main task, we just include "import
> org.apache.hadoop.io.{X}Writable;" But this is not working for me.
> Basically
> at the time of compilation compiler doesn't find my customwritable class
> which i have placed in the mentioned folder.
>
> plz help me in this endevor.
>
> Thanks
> deepak
>



-- 

朱盛凯

Jash Zhu

复旦大学软件学院

Software School, Fudan University


custom writable class

2008-09-18 Thread Deepak Diwakar
Hi,

I am new to hadoop. For my map/reduce task I want to write my on custom
writable class. Could anyone please let me know where exactly to place the
customwritable.java file?

I found that in {hadoop-home}
/hadoop-{version}/src/java/org/apache/hadoop/io/ all  type of writable class
files are there.

Then  in the main task, we just include "import
org.apache.hadoop.io.{X}Writable;" But this is not working for me. Basically
at the time of compilation compiler doesn't find my customwritable class
which i have placed in the mentioned folder.

plz help me in this endevor.

Thanks
deepak


Re: scp to namenode faster than dfs put?

2008-09-18 Thread Prasad Pingali
On Thursday 18 September 2008 04:12:13 pm Steve Loughran wrote:
> [EMAIL PROTECTED] wrote:
> > thanks for the replies. So looks like replication might be the real
> > overhead when compared to scp.
>
> Makes sense, but there's no reason why you couldn't have first node you
> copy up the data to, continue and pass that data to the other nodes. If
> its in the same rack, you save on backbone bandwidth, and if it is in a
> different rack, well, the client operation still finishes faster. A
> feature for someone to implement, perhaps?

Yeah even I was thinking what would be the implications of such a feature in 
terms of any failures/block corruption at the first node. If that is a 
non-issue this seems to be something that can improve performance.

- Prasad.

>
> >> Also dfs put copies multiple replicas unlike scp.
> >>
> >> Lohit
> >>
> >> On Sep 17, 2008, at 6:03 AM, "��明" <[EMAIL PROTECTED]> wrote:
> >>
> >> Actually, No.
> >> As you said, I understand that "dfs -put" breaks the data into blocksand
> >> then copies to datanodes,
> >> but scp do not breaks the data into blocksand , and just copy the data
> >> to the namenode!
> >>
> >>
> >> 2008/9/17, Prasad Pingali <[EMAIL PROTECTED]>:
> >>
> >> Hello,
> >>  I observe that scp of data to the namenode is faster than actually
> >> putting
> >> into dfs (all nodes coming from same switch and have same ethernet
> >> cards, homogenous nodes)? I understand that "dfs -put" breaks the data
> >> into blocks
> >> and then copies to datanodes, but shouldn't that be atleast as fast as
> >> copying data to namenode from a single machine, if not faster?
> >>
> >> thanks and regards,
> >> Prasad Pingali,
> >> IIIT Hyderabad.
> >>
> >>
> >>
> >>
> >>
> >> --
> >> Sorry for my english!!  明
> >> Please help me to correct my english expression and error in syntax






Re: scp to namenode faster than dfs put?

2008-09-18 Thread Steve Loughran

[EMAIL PROTECTED] wrote:

thanks for the replies. So looks like replication might be the real
overhead when compared to scp.


Makes sense, but there's no reason why you couldn't have first node you 
copy up the data to, continue and pass that data to the other nodes. If 
its in the same rack, you save on backbone bandwidth, and if it is in a 
different rack, well, the client operation still finishes faster. A 
feature for someone to implement, perhaps?





Also dfs put copies multiple replicas unlike scp.

Lohit

On Sep 17, 2008, at 6:03 AM, "��明" <[EMAIL PROTECTED]> wrote:

Actually, No.
As you said, I understand that "dfs -put" breaks the data into blocksand
then copies to datanodes,
but scp do not breaks the data into blocksand , and just copy the data to
the namenode!


2008/9/17, Prasad Pingali <[EMAIL PROTECTED]>:

Hello,
 I observe that scp of data to the namenode is faster than actually
putting
into dfs (all nodes coming from same switch and have same ethernet cards,
homogenous nodes)? I understand that "dfs -put" breaks the data into
blocks
and then copies to datanodes, but shouldn't that be atleast as fast as
copying data to namenode from a single machine, if not faster?

thanks and regards,
Prasad Pingali,
IIIT Hyderabad.





--
Sorry for my english!!  明
Please help me to correct my english expression and error in syntax










--
Steve Loughran  http://www.1060.org/blogxter/publish/5
Author: Ant in Action   http://antbook.org/


Hadoop tracing

2008-09-18 Thread Naama Kraus
Hi,

I am looking for information in the area of Hadoop tracing, instrumentation,
benchmarking and so forth.
What utilities exist ? What's their maturity? Where can I get more info
about them ?

I am curious about statistics on Hadoop behavior (per a typical workload ?
different workloads ?). I am thinking on various metrics such as -
Percentage of  time a Hadoop job spends on the various phases (map, sort &
shuffle, reduce), on I/O, network, framework execution time, user code
execution time ...
Known bottlenecks ?
And whatever else interesting statistics.

Has anyone already measured ? Any documented statistics out there ?

I already encountered various stuff like the X-trace based tracing tool from
Berkeley, Hadoop metrics API, Hadoop instrumentation API (HADOOP-3772),
Hadoop Vaidya (HADOOP-4179), gridmix benchmark.

Does anyone have an input on any of those ?
Anything else I missed ?

Thanks for any direction,
Naama

-- 
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
"If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales." (Albert
Einstein)


Re: Trouble with SequenceFileOutputFormat.getReaders

2008-09-18 Thread Barry Haddow
Hi Chris

I would guess that the IOException is because getReaders() is trying to treat 
_logs as a file, when it's actually a directory. I also see race conditions 
in getReaders() since it lists the files then tries to iterate through them, 
and they can disappear in between. You probably need to delete the _logs 
directory before you pass the output directory to the second map.

The _logs directory is also created by Hadoop 18.0.

cheers
Barry

On Thursday 18 September 2008 05:49:30 Chris Dyer wrote:
> Hi all-
> I am having trouble with SequenceFileOutputFormat.getReaders on a
> hadoop 17.2 cluster.  I am trying to open a set of SequenceFiles that
> was created in one map process that has completed from within a second
> map process by passing in the job configuration for the running map
> process (not of the map process that created the set of sequence
> files) and the path to the output.  When I run locally, this works
> fine, but when I run remotely on the cluster (using HDFS on the
> cluster), I get the following IOException:
>
> java.io.IOException: Cannot open filename /user/redpony/Model1.data.0/_logs
>
> However, the following works:
>
> hadoop dfs -ls /user/redpony/Model1.data.0/_logs
> Found 1 items
> /user/redpony/Model1.data.0/_logs/history2008-09-18
> 00:43 rwxrwxrwx   redpony supergroup
>
> This is probably something dumb, and quite likely related to me not
> having my settings configured properly, but I'm completely at a loss
> for how to proceed.  Any ideas?
>
> Thanks!
> Chris