Re: Num map task?

2009-04-23 Thread jason hadoop
Unless the argument (args[0]) to your job is a comma separated set of paths,
you are only adding a single input path. It may be you want to pass args and
not args[0].
 FileInputFormat.setInputPaths(c, args[0]);

On Thu, Apr 23, 2009 at 7:10 PM, nguyenhuynh.mr wrote:

> Edward J. Yoon wrote:
>
> > As far as I know, FileInputFormat.getSplits() will returns the number
> > of splits automatically computed by the number of files, blocks. BTW,
> > What version of Hadoop/Hbase?
> >
> > I tried to test that code
> > (http://wiki.apache.org/hadoop/Hbase/MapReduce) on my cluster (Hadoop
> > 0.19.1 and Hbase 0.19.0). The number of input paths was 2, map tasks
> > were 274.
> >
> > Below is my changed code for v0.19.0.
> > ---
> >   public JobConf createSubmittableJob(String[] args) {
> > JobConf c = new JobConf(getConf(), TestImport.class);
> > c.setJobName(NAME);
> > FileInputFormat.setInputPaths(c, args[0]);
> >
> > c.set("input.table", args[1]);
> > c.setMapperClass(InnerMap.class);
> > c.setNumReduceTasks(0);
> > c.setOutputFormat(NullOutputFormat.class);
> > return c;
> >   }
> >
> >
> >
> > On Thu, Apr 23, 2009 at 6:19 PM, nguyenhuynh.mr
> >  wrote:
> >
> >> Edward J. Yoon wrote:
> >>
> >>
> >>> How do you to add input paths?
> >>>
> >>> On Wed, Apr 22, 2009 at 5:09 PM, nguyenhuynh.mr
> >>>  wrote:
> >>>
> >>>
>  Edward J. Yoon wrote:
> 
> 
> 
> > Hi,
> >
> > In that case, The atomic unit of split is a file. So, you need to
> > increase the number of files. or Use the TextInputFormat as below.
> >
> > jobConf.setInputFormat(TextInputFormat.class);
> >
> > On Wed, Apr 22, 2009 at 4:35 PM, nguyenhuynh.mr
> >  wrote:
> >
> >
> >
> >> Hi all!
> >>
> >>
> >> I have a MR job use to import contents into HBase.
> >>
> >> The content is text file in HDFS. I used the maps file to store
> local
> >> path of contents.
> >>
> >> Each content has the map file. ( the map is a text file in HDFS and
> >> contain 1 line info).
> >>
> >>
> >> I created the maps directory used to contain map files. And the
>  this
> >> maps directory used to input path for job.
> >>
> >> When i run job, the number map task is same number map files.
> >> Ex: I have 5 maps file -> 5 map tasks.
> >>
> >> Therefor, the map phase is slowly :(
> >>
> >> Why the map phase is slowly if the number map task large and the
> number
> >> map task is equal number of files?.
> >>
> >> * p/s: Run jobs with: 3 node: 1 server and 2 slaver
> >>
> >> Please help me!
> >> Thanks.
> >>
> >> Best,
> >> Nguyen.
> >>
> >>
> >>
> >>
> >>
> >>
> >
> >
>  Current, I use TextInputformat to set InputFormat for map phase.
> 
> 
> 
> >>>
> >>> Thanks for your help!
> >>>
> >> I use FileInputFormat to add input paths.
> >> Some thing like:
> >>FileInputFormat.setInputPath(new Path("dir"));
> >>
> >> The "dir" is a directory contains input files.
> >>
> >> Best,
> >> Nguyen
> >>
> >>
> >>
> >>
> Thanks!
>
> I am using Hadoop version 0.18.2
>
> Cheer,
> Nguyen.
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422


Re: When will hadoop 0.19.2 be released?

2009-04-23 Thread Zhou, Yunqing
But there are already 100TB data stored on DFS.
Is there a safe solution to do such a downgrade?

On Fri, Apr 24, 2009 at 2:08 PM, jason hadoop  wrote:
> You could try the cloudera release based on 18.3, with many backported
> features.
> http://www.cloudera.com/distribution
>
> On Thu, Apr 23, 2009 at 11:06 PM, Zhou, Yunqing  wrote:
>
>> currently I'm managing a 64-nodes hadoop 0.19.1 cluster with 100TB data.
>> and I found 0.19.1 is buggy and I have already applied some patches on
>> hadoop jira to solve problems.
>> But I'm looking forward to a more stable release of hadoop.
>> Do you know when will 0.19.2 be released?
>>
>> Thanks.
>>
>
>
>
> --
> Alpha Chapters of my book on Hadoop are available
> http://www.apress.com/book/view/9781430219422
>


Re: When will hadoop 0.19.2 be released?

2009-04-23 Thread jason hadoop
You could try the cloudera release based on 18.3, with many backported
features.
http://www.cloudera.com/distribution

On Thu, Apr 23, 2009 at 11:06 PM, Zhou, Yunqing  wrote:

> currently I'm managing a 64-nodes hadoop 0.19.1 cluster with 100TB data.
> and I found 0.19.1 is buggy and I have already applied some patches on
> hadoop jira to solve problems.
> But I'm looking forward to a more stable release of hadoop.
> Do you know when will 0.19.2 be released?
>
> Thanks.
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422


When will hadoop 0.19.2 be released?

2009-04-23 Thread Zhou, Yunqing
currently I'm managing a 64-nodes hadoop 0.19.1 cluster with 100TB data.
and I found 0.19.1 is buggy and I have already applied some patches on
hadoop jira to solve problems.
But I'm looking forward to a more stable release of hadoop.
Do you know when will 0.19.2 be released?

Thanks.


Re: Generating many small PNGs to Amazon S3 with MapReduce

2009-04-23 Thread tim robertson
If anyone is interested I did finally get round to processing it all,
and due to the sparsity of data we have, for all 23 zoom levels and
all species we have information on, the result was 807 million PNGs,
which is $8,000 to PUT to S3 - too much for me to pay.

So like most things I will probably go for a compromise and pre
process 10 zoom levels into S3 which will only come in at $457 (only
the PUT into S3) and then render the rest on the fly.  Only people
browsing beyond zoom 10 are then hitting the real time rendering
servers so I think this will work out ok performance wise.

Cheers,

Tim


On Thu, Apr 23, 2009 at 5:45 PM, Stuart Sierra
 wrote:
> On Thu, Apr 23, 2009 at 5:02 PM, Andrew Hitchcock  wrote:
>> 1 billion * ($0.01 / 1000) = 10,000
>
> Oh yeah, I was thinking $0.01 for a single PUT.  Silly me.
>
> -S
>


Re: Num map task?

2009-04-23 Thread nguyenhuynh.mr
Edward J. Yoon wrote:

> As far as I know, FileInputFormat.getSplits() will returns the number
> of splits automatically computed by the number of files, blocks. BTW,
> What version of Hadoop/Hbase?
>
> I tried to test that code
> (http://wiki.apache.org/hadoop/Hbase/MapReduce) on my cluster (Hadoop
> 0.19.1 and Hbase 0.19.0). The number of input paths was 2, map tasks
> were 274.
>
> Below is my changed code for v0.19.0.
> ---
>   public JobConf createSubmittableJob(String[] args) {
> JobConf c = new JobConf(getConf(), TestImport.class);
> c.setJobName(NAME);
> FileInputFormat.setInputPaths(c, args[0]);
>
> c.set("input.table", args[1]);
> c.setMapperClass(InnerMap.class);
> c.setNumReduceTasks(0);
> c.setOutputFormat(NullOutputFormat.class);
> return c;
>   }
>
>
>
> On Thu, Apr 23, 2009 at 6:19 PM, nguyenhuynh.mr
>  wrote:
>   
>> Edward J. Yoon wrote:
>>
>> 
>>> How do you to add input paths?
>>>
>>> On Wed, Apr 22, 2009 at 5:09 PM, nguyenhuynh.mr
>>>  wrote:
>>>
>>>   
 Edward J. Yoon wrote:


 
> Hi,
>
> In that case, The atomic unit of split is a file. So, you need to
> increase the number of files. or Use the TextInputFormat as below.
>
> jobConf.setInputFormat(TextInputFormat.class);
>
> On Wed, Apr 22, 2009 at 4:35 PM, nguyenhuynh.mr
>  wrote:
>
>
>   
>> Hi all!
>>
>>
>> I have a MR job use to import contents into HBase.
>>
>> The content is text file in HDFS. I used the maps file to store local
>> path of contents.
>>
>> Each content has the map file. ( the map is a text file in HDFS and
>> contain 1 line info).
>>
>>
>> I created the maps directory used to contain map files. And the  this
>> maps directory used to input path for job.
>>
>> When i run job, the number map task is same number map files.
>> Ex: I have 5 maps file -> 5 map tasks.
>>
>> Therefor, the map phase is slowly :(
>>
>> Why the map phase is slowly if the number map task large and the number
>> map task is equal number of files?.
>>
>> * p/s: Run jobs with: 3 node: 1 server and 2 slaver
>>
>> Please help me!
>> Thanks.
>>
>> Best,
>> Nguyen.
>>
>>
>>
>>
>>
>> 
>
>   
 Current, I use TextInputformat to set InputFormat for map phase.


 
>>>
>>> Thanks for your help!
>>>   
>> I use FileInputFormat to add input paths.
>> Some thing like:
>>FileInputFormat.setInputPath(new Path("dir"));
>>
>> The "dir" is a directory contains input files.
>>
>> Best,
>> Nguyen
>>
>>
>>
>> 
Thanks!

I am using Hadoop version 0.18.2

Cheer,
Nguyen.


Re: Num map task?

2009-04-23 Thread Edward J. Yoon
As far as I know, FileInputFormat.getSplits() will returns the number
of splits automatically computed by the number of files, blocks. BTW,
What version of Hadoop/Hbase?

I tried to test that code
(http://wiki.apache.org/hadoop/Hbase/MapReduce) on my cluster (Hadoop
0.19.1 and Hbase 0.19.0). The number of input paths was 2, map tasks
were 274.

Below is my changed code for v0.19.0.
---
  public JobConf createSubmittableJob(String[] args) {
JobConf c = new JobConf(getConf(), TestImport.class);
c.setJobName(NAME);
FileInputFormat.setInputPaths(c, args[0]);

c.set("input.table", args[1]);
c.setMapperClass(InnerMap.class);
c.setNumReduceTasks(0);
c.setOutputFormat(NullOutputFormat.class);
return c;
  }



On Thu, Apr 23, 2009 at 6:19 PM, nguyenhuynh.mr
 wrote:
> Edward J. Yoon wrote:
>
>> How do you to add input paths?
>>
>> On Wed, Apr 22, 2009 at 5:09 PM, nguyenhuynh.mr
>>  wrote:
>>
>>> Edward J. Yoon wrote:
>>>
>>>
 Hi,

 In that case, The atomic unit of split is a file. So, you need to
 increase the number of files. or Use the TextInputFormat as below.

 jobConf.setInputFormat(TextInputFormat.class);

 On Wed, Apr 22, 2009 at 4:35 PM, nguyenhuynh.mr
  wrote:


> Hi all!
>
>
> I have a MR job use to import contents into HBase.
>
> The content is text file in HDFS. I used the maps file to store local
> path of contents.
>
> Each content has the map file. ( the map is a text file in HDFS and
> contain 1 line info).
>
>
> I created the maps directory used to contain map files. And the  this
> maps directory used to input path for job.
>
> When i run job, the number map task is same number map files.
> Ex: I have 5 maps file -> 5 map tasks.
>
> Therefor, the map phase is slowly :(
>
> Why the map phase is slowly if the number map task large and the number
> map task is equal number of files?.
>
> * p/s: Run jobs with: 3 node: 1 server and 2 slaver
>
> Please help me!
> Thanks.
>
> Best,
> Nguyen.
>
>
>
>
>



>>> Current, I use TextInputformat to set InputFormat for map phase.
>>>
>>>
>>
>>
>>
>> Thanks for your help!
> I use FileInputFormat to add input paths.
> Some thing like:
>    FileInputFormat.setInputPath(new Path("dir"));
>
> The "dir" is a directory contains input files.
>
> Best,
> Nguyen
>
>
>



-- 
Best Regards, Edward J. Yoon
edwardy...@apache.org
http://blog.udanax.org


Re: Datanode Setup

2009-04-23 Thread jpe30

Right now I'm just trying to get one node running.  Once its running I'll
copy it over.



jason hadoop wrote:
> 
> Have you copied the updated hadoop-site.xml file to the conf directory on
> all of your slave nodes?
> 
> 
> On Thu, Apr 23, 2009 at 2:10 PM, jpe30  wrote:
> 
>>
>> Ok, I've done all of this.  Set up my hosts file in Linux, setup my
>> master
>> and slaves file in Hadoop and setup my hadoop-site.xml.  It still does
>> not
>> work.  The datanode still gives me this error...
>>
>> STARTUP_MSG:   host = java.net.UnknownHostException: myhost: myhost
>>
>> ...which makes me think its not reading the hadoop-site.xml file at all.
>> I've checked the permissions and the user has full permissions to all
>> files
>> within the Hadoop directory.  Any suggestions?
>>
>>
>>
>> Mithila Nagendra wrote:
>> >
>> > You should have conf/slaves file on the master node set to master,
>> node01,
>> > node02. so on and the masters file on master set to master. Also in
>> > the
>> > /etc/hosts file get rid of 'node6' in the line 127.0.0.1
>> > localhost.localdomain   localhost node6 on all your nodes. Ensure that
>> the
>> > /etc/hosts file contain the same information on all nodes. Also
>> > hadoop-site.xml files on all nodes should have master:portno for hdfs
>> and
>> > tasktracker.
>> > Once you do this restart hadoop.
>> >
>> > On Fri, Apr 17, 2009 at 10:04 AM, jpe30  wrote:
>> >
>> >>
>> >>
>> >>
>> >> Mithila Nagendra wrote:
>> >> >
>> >> > You have to make sure that you can ssh between the nodes. Also check
>> >> the
>> >> > file hosts in /etc folder. Both the master and the slave much have
>> each
>> >> > others machines defined in it. Refer to my previous mail
>> >> > Mithila
>> >> >
>> >> >
>> >>
>> >>
>> >> I have SSH setup correctly and here is the /etc/hosts file on node6 of
>> >> the
>> >> datanodes.
>> >>
>> >> #  
>> >> 127.0.0.1   localhost.localdomain   localhost node6
>> >> 192.168.1.10master
>> >> 192.168.1.1 node1
>> >> 192.168.1.2 node2
>> >> 192.168.1.3 node3
>> >> 192.168.1.4 node4
>> >> 192.168.1.5 node5
>> >> 192.168.1.6 node6
>> >>
>> >> I have the slaves file on each machine set as node1 to node6, and each
>> >> masters file set to master except for the master itself.  Still, I
>> keep
>> >> getting that same error in the datanodes...
>> >> --
>> >> View this message in context:
>> >> http://www.nabble.com/Datanode-Setup-tp23064660p23101738.html
>> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>> >>
>> >>
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Datanode-Setup-tp23064660p23203293.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
>>
> 
> 
> -- 
> Alpha Chapters of my book on Hadoop are available
> http://www.apress.com/book/view/9781430219422
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Datanode-Setup-tp23064660p23208349.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: The mechanism of choosing target datanodes

2009-04-23 Thread jason hadoop
I haven't checked the code for any special cases of replication = 1.
The write a block sequence is:

   1. Get a list of datanodes from the namenode for the block replicas, the
   reqest host being the first datanode returned if the request host is a
   datanode.
   2. send the block with the list of datanodes to receive it to the first
   datanode in the list
   3. That datanode sends the block to the next
   4. 3 repeats until the block is fully replicated.



On Thu, Apr 23, 2009 at 2:08 PM, Jerome Banks  wrote:

> FYI, The pipe v2 results were created with
> com.quantcast.armor.jobs.pipev3.util.CountVG , inputing the results from
> com.quantcast.armor.jobs.pipev3.util.MyHarvestV2 (the mainline pipev2
> harvest).
>   The pipe v3 results were a one day run of BloomDaily for 04/12/2009.
>  The CSV files were generated with TopNFlow.
>
>
> On 4/23/09 1:56 PM, "Amr Awadallah"  wrote:
>
> yes, it will be split across many nodes, and if possible each block will
> get a different datanode.
>
> see following link for more details:
>
>
> http://hadoop.apache.org/core/docs/current/hdfs_design.html#Data+Organization
>
> -- amr
>
> Alex Loddengaard wrote:
> > I believe the blocks will be distributed across data nodes and not local
> to
> > only one data node.  If this wasn't the case, then running a MR job on
> the
> > file would only be local to one task tracker.
> >
> > Alex
> >
> > On Thu, Apr 23, 2009 at 2:14 AM, Xie, Tao  wrote:
> >
> >
> >> If a cluster has many datanodes and I want to copy a large file into
> DFS.
> >> If the replication number is set to 1, does the namenode will put the
> file
> >> data on one datanode or several nodes? I wonder if the file will be
> split
> >> into blocks then different unique blocks are on different datanodes.
> >>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/The-mechanism-of-choosing-target-datanodes-tp23193235p23193235.html
> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >>
> >>
> >>
> >
> >
>
>


-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422


Re: Datanode Setup

2009-04-23 Thread jason hadoop
Have you copied the updated hadoop-site.xml file to the conf directory on
all of your slave nodes?


On Thu, Apr 23, 2009 at 2:10 PM, jpe30  wrote:

>
> Ok, I've done all of this.  Set up my hosts file in Linux, setup my master
> and slaves file in Hadoop and setup my hadoop-site.xml.  It still does not
> work.  The datanode still gives me this error...
>
> STARTUP_MSG:   host = java.net.UnknownHostException: myhost: myhost
>
> ...which makes me think its not reading the hadoop-site.xml file at all.
> I've checked the permissions and the user has full permissions to all files
> within the Hadoop directory.  Any suggestions?
>
>
>
> Mithila Nagendra wrote:
> >
> > You should have conf/slaves file on the master node set to master,
> node01,
> > node02. so on and the masters file on master set to master. Also in
> > the
> > /etc/hosts file get rid of 'node6' in the line 127.0.0.1
> > localhost.localdomain   localhost node6 on all your nodes. Ensure that
> the
> > /etc/hosts file contain the same information on all nodes. Also
> > hadoop-site.xml files on all nodes should have master:portno for hdfs and
> > tasktracker.
> > Once you do this restart hadoop.
> >
> > On Fri, Apr 17, 2009 at 10:04 AM, jpe30  wrote:
> >
> >>
> >>
> >>
> >> Mithila Nagendra wrote:
> >> >
> >> > You have to make sure that you can ssh between the nodes. Also check
> >> the
> >> > file hosts in /etc folder. Both the master and the slave much have
> each
> >> > others machines defined in it. Refer to my previous mail
> >> > Mithila
> >> >
> >> >
> >>
> >>
> >> I have SSH setup correctly and here is the /etc/hosts file on node6 of
> >> the
> >> datanodes.
> >>
> >> #  
> >> 127.0.0.1   localhost.localdomain   localhost node6
> >> 192.168.1.10master
> >> 192.168.1.1 node1
> >> 192.168.1.2 node2
> >> 192.168.1.3 node3
> >> 192.168.1.4 node4
> >> 192.168.1.5 node5
> >> 192.168.1.6 node6
> >>
> >> I have the slaves file on each machine set as node1 to node6, and each
> >> masters file set to master except for the master itself.  Still, I keep
> >> getting that same error in the datanodes...
> >> --
> >> View this message in context:
> >> http://www.nabble.com/Datanode-Setup-tp23064660p23101738.html
> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Datanode-Setup-tp23064660p23203293.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422


core-user@hadoop.apache.org

2009-04-23 Thread He Yongqiang
///
Sorry for cross posting.
//
Hi,all 
Hadoop in China Salon is a free discussion forum on Hadoop related
technologies and ideas.
Five months ago, in Nov 2008, we successfully concluded the first hadoop
salon in Beijing. In the previous salon, more than sixty people attended.
Hao Zheng (Yahoo! Inc), Zheng Shao (Facebook Inc) and Wang ShouYan (Baidu
Inc) from the industry gave us very impressive talks on their recent
progresses on Hadoop. And about thirty people (students/professors) are from
Universities and Institutes, who also impressed us deeply. Thank you all the
attendees again. Without you, it would never success.
In early May this year, we are going to host the second Hadoop in China
Salon. It is our great honor to invite you again to take part in the salon.
The meeting is scheduled on May 9 (the first weekend after Labor Day
Holiday).  And this time we are trying to hold it in iHep(Institute of High
Energy Physics, Chinese Academy of Science). The good part is that we can
visit the Beijing Electron Positron Collider (BEPC, the biggest EPC in
China). The bad part is that it may be a little far from Zhong GuanCun, and
it is on Yu Quan Road.
   We now welcome talkers for this salon with our greatest sincerity. Please
share with us the latest progress in your work or your team¹s work on
Hadoop. If you are interested in giving a talk, please drop me an email. And
we expect more hadoop users and developers to join us this time. Please feel
free to come to the Hadoop discussion forum. Please drop me an email if you
would like to come, so we can prepare more free drink and foods.
  If you have any thoughts on this meeting, please let us know by dropping
me an email.
  BTW, we have collected several fantastic talks so far, including:
1) One by Zheng Shao on the recent progress on Hive.
2) Two or more talks organized by Yahoo! China R&D (Thanks Yahoo! India R&D,
Yahoo! China R&D and Hao Zheng ). One talk is research on machine learning.
The other talk is not finally settled, it would be about hadoop roadmap,
enhancements, new scheduler or Pig.
3) One talk from Baidu Inc. The talk will cover the hadoop scheduler used in
Baidu, data security and computing security.
4) Two talks from two teams in ICT (Institute of Computing Technology,
Chinese Academy of Science). One is about our research effort on data
organization and their influences. The other is about Hadoop On GIS.
Thanks for all the talkers.
Many thanks to Cheng Yaodong from ihep(ihep.ac.cn) for providing the venue
and infrastructure.

I will send more detailed schedule in next week.

Statement: This event is nonprofit.


Re: sub-optimal multiple disk usage in 0.18.3?

2009-04-23 Thread jason hadoop
In theory the block allocation strategy is round robin amount the set of
storage locations that meet the minimum free space requirements.

On Thu, Apr 23, 2009 at 12:55 PM, Bhupesh Bansal wrote:

> What configuration are you using for the disks ??
>
> Best configuration is just doing a JBOD.
>
> http://www.nabble.com/RAID-vs.-JBOD-td21404366.html
>
> Best
> Bhupesh
>
>
>
> On 4/23/09 12:54 PM, "Mike Andrews"  wrote:
>
> > i have a bunch of datanodes with several disks each, and i noticed
> > that sometimes dfs blocks don't get evenly distributed among them. for
> > instance, one of my machines has 5 disks with 500 gb each, and 1 disk
> > with 2 TB (6 total disks). the 5 smaller disks are each 98% full,
> > whereas the larger one is only 12% full. it seems as though dfs should
> > do better by putting more of the blocks on the larger disk first. and
> > mapreduce jobs are failing on this machine with error
> > "java.io.IOException: No space left on device".
> >
> > any thoughts or suggestions? thanks in advance.
>
>


-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422


Re: Writing a New Aggregate Function

2009-04-23 Thread jason hadoop
It really isn't documented anywhere. There is a small section in my book in
ch08 about it. It didn't make the alpha that is up of ch08 though.

On Thu, Apr 23, 2009 at 1:44 PM, Dan Milstein  wrote:

> Hello all,
>
> I've been using streaming + the aggregate package (available via -reducer
> aggregate), and have been very happy with what it gives me.
>
> I'm interested in writing my own new aggregate functions (in Java) which I
> could then access from my streaming code.
>
> Can anyone give me pointers towards how to make that happen?  I've read
> through the aggregate package source, but I'm not seeing how to define my
> own, and get access to it from streaming.
>
> To be specific, here's the sort of thing I'd like to be able to do:
>
>  - In Java, define a SampleValues aggregator, which chooses a sample of the
> input given to it
>
>  - From my streaming program, in say python, output:
>
> SampleValues:some_key \t some_value
>
>  - Have the aggregate framework somehow call my new aggregator for the
> combiner and reducer steps
>
> Thanks,
> -Dan Milstein
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422


Re: Generating many small PNGs to Amazon S3 with MapReduce

2009-04-23 Thread Stuart Sierra
On Thu, Apr 23, 2009 at 5:02 PM, Andrew Hitchcock  wrote:
> 1 billion * ($0.01 / 1000) = 10,000

Oh yeah, I was thinking $0.01 for a single PUT.  Silly me.

-S


Re: Datanode Setup

2009-04-23 Thread jpe30

Ok, I've done all of this.  Set up my hosts file in Linux, setup my master
and slaves file in Hadoop and setup my hadoop-site.xml.  It still does not
work.  The datanode still gives me this error...

STARTUP_MSG:   host = java.net.UnknownHostException: myhost: myhost 

...which makes me think its not reading the hadoop-site.xml file at all. 
I've checked the permissions and the user has full permissions to all files
within the Hadoop directory.  Any suggestions?



Mithila Nagendra wrote:
> 
> You should have conf/slaves file on the master node set to master, node01,
> node02. so on and the masters file on master set to master. Also in
> the
> /etc/hosts file get rid of 'node6' in the line 127.0.0.1
> localhost.localdomain   localhost node6 on all your nodes. Ensure that the
> /etc/hosts file contain the same information on all nodes. Also
> hadoop-site.xml files on all nodes should have master:portno for hdfs and
> tasktracker.
> Once you do this restart hadoop.
> 
> On Fri, Apr 17, 2009 at 10:04 AM, jpe30  wrote:
> 
>>
>>
>>
>> Mithila Nagendra wrote:
>> >
>> > You have to make sure that you can ssh between the nodes. Also check
>> the
>> > file hosts in /etc folder. Both the master and the slave much have each
>> > others machines defined in it. Refer to my previous mail
>> > Mithila
>> >
>> >
>>
>>
>> I have SSH setup correctly and here is the /etc/hosts file on node6 of
>> the
>> datanodes.
>>
>> #  
>> 127.0.0.1   localhost.localdomain   localhost node6
>> 192.168.1.10master
>> 192.168.1.1 node1
>> 192.168.1.2 node2
>> 192.168.1.3 node3
>> 192.168.1.4 node4
>> 192.168.1.5 node5
>> 192.168.1.6 node6
>>
>> I have the slaves file on each machine set as node1 to node6, and each
>> masters file set to master except for the master itself.  Still, I keep
>> getting that same error in the datanodes...
>> --
>> View this message in context:
>> http://www.nabble.com/Datanode-Setup-tp23064660p23101738.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Datanode-Setup-tp23064660p23203293.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: The mechanism of choosing target datanodes

2009-04-23 Thread Jerome Banks
FYI, The pipe v2 results were created with 
com.quantcast.armor.jobs.pipev3.util.CountVG , inputing the results from 
com.quantcast.armor.jobs.pipev3.util.MyHarvestV2 (the mainline pipev2 harvest).
   The pipe v3 results were a one day run of BloomDaily for 04/12/2009.
  The CSV files were generated with TopNFlow.


On 4/23/09 1:56 PM, "Amr Awadallah"  wrote:

yes, it will be split across many nodes, and if possible each block will
get a different datanode.

see following link for more details:

http://hadoop.apache.org/core/docs/current/hdfs_design.html#Data+Organization

-- amr

Alex Loddengaard wrote:
> I believe the blocks will be distributed across data nodes and not local to
> only one data node.  If this wasn't the case, then running a MR job on the
> file would only be local to one task tracker.
>
> Alex
>
> On Thu, Apr 23, 2009 at 2:14 AM, Xie, Tao  wrote:
>
>
>> If a cluster has many datanodes and I want to copy a large file into DFS.
>> If the replication number is set to 1, does the namenode will put the file
>> data on one datanode or several nodes? I wonder if the file will be split
>> into blocks then different unique blocks are on different datanodes.
>>
>> --
>> View this message in context:
>> http://www.nabble.com/The-mechanism-of-choosing-target-datanodes-tp23193235p23193235.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
>>
>>
>
>



Re: Generating many small PNGs to Amazon S3 with MapReduce

2009-04-23 Thread Andrew Hitchcock
How do you figure? Puts are one penny per thousand, so I think it'd
only cost $10,000. Here's the math I'm using:

1 billion * ($0.01 / 1000) = 10,000
Math courtesy of Google:
http://www.google.com/search?q=1+billion+*+(0.01+%2F+1000)

Still expensive, but not unreasonably so.

Andrew

On Thu, Apr 23, 2009 at 7:08 AM, Stuart Sierra
 wrote:
> On Wed, Apr 15, 2009 at 8:21 PM, Kevin Peterson  wrote:
>> However, do the math on the costs for S3. We were doing something similar,
>> and found that we were spending a fortune on our put requests at $0.01 per
>> 1000, and next to nothing on storage.
>
> I made a similar discovery.  The cost of PUT adds up fast.  One
> billion PUTs will cost you $10 million!
>
> -Stuart Sierra
>


Re: The mechanism of choosing target datanodes

2009-04-23 Thread Amr Awadallah
yes, it will be split across many nodes, and if possible each block will 
get a different datanode.


see following link for more details:

http://hadoop.apache.org/core/docs/current/hdfs_design.html#Data+Organization

-- amr

Alex Loddengaard wrote:

I believe the blocks will be distributed across data nodes and not local to
only one data node.  If this wasn't the case, then running a MR job on the
file would only be local to one task tracker.

Alex

On Thu, Apr 23, 2009 at 2:14 AM, Xie, Tao  wrote:

  

If a cluster has many datanodes and I want to copy a large file into DFS.
If the replication number is set to 1, does the namenode will put the file
data on one datanode or several nodes? I wonder if the file will be split
into blocks then different unique blocks are on different datanodes.

--
View this message in context:
http://www.nabble.com/The-mechanism-of-choosing-target-datanodes-tp23193235p23193235.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.





  


Writing a New Aggregate Function

2009-04-23 Thread Dan Milstein

Hello all,

I've been using streaming + the aggregate package (available via - 
reducer aggregate), and have been very happy with what it gives me.


I'm interested in writing my own new aggregate functions (in Java)  
which I could then access from my streaming code.


Can anyone give me pointers towards how to make that happen?  I've  
read through the aggregate package source, but I'm not seeing how to  
define my own, and get access to it from streaming.


To be specific, here's the sort of thing I'd like to be able to do:

 - In Java, define a SampleValues aggregator, which chooses a sample  
of the input given to it


 - From my streaming program, in say python, output:

SampleValues:some_key \t some_value

 - Have the aggregate framework somehow call my new aggregator for  
the combiner and reducer steps


Thanks,
-Dan Milstein


Re: core-user Digest 23 Apr 2009 02:09:48 -0000 Issue 887

2009-04-23 Thread Todd Lipcon
On Thu, Apr 23, 2009 at 12:00 PM, Koji Noguchi wrote:

> Owen,
>
> > Is it just the patches that have already been applied
> > to the 18 branch? Or are there more?
> >
> Former. Just the patches that have already been applied to 0.18 branch.
> I especially want HADOOP-5465 in for the 'stable' release.
> (This patch is also missing in 0.19.1)
>

Hey Koji,

FYI, HADOOP-5465 is one of the patches we're bundling in the Cloudera distro
for Hadoop, based on 0.18.3: http://cloudera.com/hadoop

Lacking an 0.18.4 release, you might want to take a look.

-Todd


Re: Hadoop and Matlab

2009-04-23 Thread nitesh bhatia
Hi
The simplest way for you to run Matlab would be to use distributed toolkit
provided in matlab. You just need to configure matlab to discover other
matlab-machines. In that way you will not require to setup a hadoop cluster.
However if you want to use hadoop as a backend framework for distributed
processing, I would suggest you to go for Octave which is open source
toolkit just like matlab. It provides interfaces for c/c++. I think that
would be more easy to configure it with hadoop than going for matlab which
is not open source and licenced.

--nitesh


On Wed, Apr 22, 2009 at 7:10 AM, Edward J. Yoon wrote:

> Hi,
> Where to store the images? How to retrieval the images?
>
> If you have a metadata for the images, the map task can receives a
> 'filename' of image as a key, and file properies (host, file path,
> ..,etc) as its value. Then, I guess you can handle the matlab process
> using runtime object on hadoop cluster.
>
> On Wed, Apr 22, 2009 at 9:30 AM, Sameer Tilak 
> wrote:
> > Hi Edward,
> > Yes, we're building this for handling hundreds of thousands images (at
> > least). We're thinking processing of individual images (or a set of
> images
> > together) will be done in Matlab itself. However, we can use Hadoop
> > framework to process the data in parallel fashion. One Matlab instance
> > handling few hundred images (as a mapper) and have hundreds of such
> > instances and then combine (reducer) the o/p of each instance.
> >
> > On Tue, Apr 21, 2009 at 5:06 PM, Edward J. Yoon  >wrote:
> >
> >> Hi, What is the input data?
> >>
> >> According to my understanding, you have a lot of images and want to
> >> process all images using your matlab script. Then, You should write
> >> some code yourself. I did similar thing for plotting graph with
> >> gnuplot. However, If you want to do large-scale linear algebra
> >> operations for large image processing, I would recommend investigating
> >> other solutions. Hadoop is not a general purpose clustering software,
> >> and it cannot run matlab.
> >>
> >> On Wed, Apr 22, 2009 at 2:55 AM, Sameer Tilak 
> >> wrote:
> >> > Hi there,
> >> >
> >> > We're working on an image analysis project. The image processing code
> is
> >> > written in Matlab. If I invoke that code from a shell script and then
> use
> >> > that shell script within Hadoop streaming, will that work? Has anyone
> >> done
> >> > something along these lines?
> >> >
> >> > Many thaks,
> >> > --ST.
> >> >
> >>
> >>
> >>
> >> --
> >> Best Regards, Edward J. Yoon
> >> edwardy...@apache.org
> >> http://blog.udanax.org
> >>
> >
>
>
>
> --
> Best Regards, Edward J. Yoon
> edwardy...@apache.org
> http://blog.udanax.org
>



-- 
Nitesh Bhatia
Dhirubhai Ambani Institute of Information & Communication Technology
Gandhinagar
Gujarat

"Life is never perfect. It just depends where you draw the line."

visit:
http://www.awaaaz.com - connecting through music
http://www.volstreet.com - lets volunteer for better tomorrow
http://www.instibuzz.com - Voice opinions, Transact easily, Have fun


Re: sub-optimal multiple disk usage in 0.18.3?

2009-04-23 Thread Bhupesh Bansal
What configuration are you using for the disks ??

Best configuration is just doing a JBOD.

http://www.nabble.com/RAID-vs.-JBOD-td21404366.html

Best
Bhupesh



On 4/23/09 12:54 PM, "Mike Andrews"  wrote:

> i have a bunch of datanodes with several disks each, and i noticed
> that sometimes dfs blocks don't get evenly distributed among them. for
> instance, one of my machines has 5 disks with 500 gb each, and 1 disk
> with 2 TB (6 total disks). the 5 smaller disks are each 98% full,
> whereas the larger one is only 12% full. it seems as though dfs should
> do better by putting more of the blocks on the larger disk first. and
> mapreduce jobs are failing on this machine with error
> "java.io.IOException: No space left on device".
> 
> any thoughts or suggestions? thanks in advance.



sub-optimal multiple disk usage in 0.18.3?

2009-04-23 Thread Mike Andrews
i have a bunch of datanodes with several disks each, and i noticed
that sometimes dfs blocks don't get evenly distributed among them. for
instance, one of my machines has 5 disks with 500 gb each, and 1 disk
with 2 TB (6 total disks). the 5 smaller disks are each 98% full,
whereas the larger one is only 12% full. it seems as though dfs should
do better by putting more of the blocks on the larger disk first. and
mapreduce jobs are failing on this machine with error
"java.io.IOException: No space left on device".

any thoughts or suggestions? thanks in advance.

-- 
permanent contact information at http://mikerandrews.com


5th Apache Hadoop Get Together @ Berlin

2009-04-23 Thread Isabel Drost
I would like to announce the fifth Apache Hadoop Get Together @ Berlin. It is 
scheduled to take place at:

newthinking store
Tucholskystr. 48
Berlin Mitte

on Thursday, 25th of June 2005 at 05:00pm.

As always there will be slots of 20min each for talks. After each talk there 
will be time for discussion.

You can order drinks directly at the bar in the newthinking store. After the 
official part we will go to one of the restaurants close by - exactly which 
one will be announced at the beginning of the event.

Talks scheduled so far:

Torsten Curdt: Data Legacy - the challenges of an evolving data warehouse
Abstract: "MapReduce is great for processing great data sets. A distributed 
file system can be used to store huge amounts of data. But what if your data 
format needs to adapt to new requirements? This talk will cover a simple 
introduction to Thrift and Protocol Buffers and sprinkle in some rants and 
approaches to manage your big data sets."

We would like to invite you, the visitor to also tell your Hadoop story, if 
you like, you can bring slides - there will be a beamer. Talks on related 
projects (HBase, CouchDB, Cassandra, Hive, Pig, Lucene, Solr, nutch, katta, 
UIMA, Mahout, ...) are of course welcome as well.

A big Thank You goes to the newthinking store for providing a room in the 
center of Berlin for free.
  
Website: http://upcoming.yahoo.com/event/2488959/?ps=5 (Please keep an eye on 
the upcoming site in case the starting time needs to be shifted.)

Isabel


RE: core-user Digest 23 Apr 2009 02:09:48 -0000 Issue 887

2009-04-23 Thread Koji Noguchi
Owen, 

> Is it just the patches that have already been applied 
> to the 18 branch? Or are there more?
>
Former. Just the patches that have already been applied to 0.18 branch.
I especially want HADOOP-5465 in for the 'stable' release.
(This patch is also missing in 0.19.1)

Koji


-Original Message-
From: Owen O'Malley [mailto:omal...@apache.org] 
Sent: Thursday, April 23, 2009 11:54 AM
To: core-user@hadoop.apache.org
Subject: Re: core-user Digest 23 Apr 2009 02:09:48 - Issue 887


On Apr 22, 2009, at 10:44 PM, Koji Noguchi wrote:

> Nigel,
>
> When you have time, could you release 0.18.4 that contains some of the
> patches that make our clusters 'stable'?

Is it just the patches that have already been applied to the 18  
branch? Or are there more?

-- Owen


Re: core-user Digest 23 Apr 2009 02:09:48 -0000 Issue 887

2009-04-23 Thread Owen O'Malley


On Apr 22, 2009, at 10:44 PM, Koji Noguchi wrote:


Nigel,

When you have time, could you release 0.18.4 that contains some of the
patches that make our clusters 'stable'?


Is it just the patches that have already been applied to the 18  
branch? Or are there more?


-- Owen


Re: The mechanism of choosing target datanodes

2009-04-23 Thread Alex Loddengaard
I believe the blocks will be distributed across data nodes and not local to
only one data node.  If this wasn't the case, then running a MR job on the
file would only be local to one task tracker.

Alex

On Thu, Apr 23, 2009 at 2:14 AM, Xie, Tao  wrote:

>
> If a cluster has many datanodes and I want to copy a large file into DFS.
> If the replication number is set to 1, does the namenode will put the file
> data on one datanode or several nodes? I wonder if the file will be split
> into blocks then different unique blocks are on different datanodes.
>
> --
> View this message in context:
> http://www.nabble.com/The-mechanism-of-choosing-target-datanodes-tp23193235p23193235.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


Re: No route to host prevents from storing files to HDFS

2009-04-23 Thread Stas Oskin
Just to clarify one point - the iptables were running on 2nd DataNode which
I didn't check, as I was sure the problem is in the NameNode/DataNode, and
on NameNode/DataNode.  But I can't understand what and when launched them,
as I checked multiple times and nothing was running before. Moreover, they
were disabled on start-up, so they shouldn't come up in the first place.

Regards.

2009/4/23 Stas Oskin 

> Hi.
>
>
>> Also iptables -L for each machine as an afterthought - just for paranoia's
>> sake
>>
>
> Well, I started preparing all the information you requested, but when I got
> to this stage - I found out there were INDEED iptables running on 2 servers
> from 3.
>
> The strangest thing is that I don't recall enabling them at all. Perhaps
> some 3rd party software have enabled them?
>
> In any case, all seems to be working now.
>
> Thanks for everybody that helped - I will be sure to check iptables on all
> the cluster machines from now on :).
>
> Regards.
>


Re: No route to host prevents from storing files to HDFS

2009-04-23 Thread Stas Oskin
Hi.


> Also iptables -L for each machine as an afterthought - just for paranoia's
> sake
>

Well, I started preparing all the information you requested, but when I got
to this stage - I found out there were INDEED iptables running on 2 servers
from 3.

The strangest thing is that I don't recall enabling them at all. Perhaps
some 3rd party software have enabled them?

In any case, all seems to be working now.

Thanks for everybody that helped - I will be sure to check iptables on all
the cluster machines from now on :).

Regards.


Re: Are SequenceFiles split? If so, how?

2009-04-23 Thread Barnet Wagman

Aaron Kimball wrote:

Explicitly controlling your splits will be very challenging. Taking the case
where you have expensive (X) and cheap (C) objects to process, you may have
a file where the records are lined up X C X C X C X X X X X C C C. In this
case, you'll need to scan through the whole file and build splits such that
the lengthy run of expensive objects is broken up into separate splits, but
the run of cheap objects is consolidated. 
^ I'm not concerned about the variation in processing time of objects; 
there isn't enough variation to worry about. I'm primarily concerned 
with having enough map tasks to utilized all nodes (and cores).

In general, I would just dodge the problem by making sure your splits
relatively small compared to the size of your input data. 
^ This sounds like the right solution.  I'll still need to extend 
SequenceFileInputFormat, but it should be relatively simple to put a 
fixed number of objects into each split.


thanks


Re: Generating many small PNGs to Amazon S3 with MapReduce

2009-04-23 Thread Stuart Sierra
On Wed, Apr 15, 2009 at 8:21 PM, Kevin Peterson  wrote:
> However, do the math on the costs for S3. We were doing something similar,
> and found that we were spending a fortune on our put requests at $0.01 per
> 1000, and next to nothing on storage.

I made a similar discovery.  The cost of PUT adds up fast.  One
billion PUTs will cost you $10 million!

-Stuart Sierra


Re: No route to host prevents from storing files to HDFS

2009-04-23 Thread jason hadoop
Can you give us your network topology ?
I see that at least 3 ip addresses
192.168.253.20, 192.168.253.32 and 192.168.253.21

In particular the fs.default.name which you have provided, the
hadoop-site.xml for each machine,
the slaves file, with ip address mappings if needed and a netstat -a -n -t
-p | grep java (hopefully you run linux)
and the output of jps for each machine

That should let us see what servers are binding to what ports on what
machines, and what you cluster things should be happening.

Also iptables -L for each machine as an afterthought - just for paranoia's
sake

On Thu, Apr 23, 2009 at 2:45 AM, Stas Oskin  wrote:

> Hi.
>
> Maybe, but there will still be at least one virtual network adapter on the
> > host. Try turning them off.
>
>
> Nope, still throws "No route to host" exceptions.
>
> I have another IP address defined on this machine - 192.168.253.21, for the
> same network adapter.
>
> Any idea if it has impact?
>
>
> >
> >
> >> The fs.default.name is:
> >> hdfs://192.168.253.20:8020
> >>
> >
> > what happens if you switch to hostnames over IP addresses?
>
>
> Actually, I never tried this, but point is that the HDFS worked just fine
> with this before.
>
> Regards.
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422


Re: Are SequenceFiles split? If so, how?

2009-04-23 Thread Shevek
On Thu, 2009-04-23 at 17:56 +0900, Aaron Kimball wrote:
> Explicitly controlling your splits will be very challenging. Taking the case
> where you have expensive (X) and cheap (C) objects to process, you may have
> a file where the records are lined up X C X C X C X X X X X C C C. In this
> case, you'll need to scan through the whole file and build splits such that
> the lengthy run of expensive objects is broken up into separate splits, but
> the run of cheap objects is consolidated. I'm suspicious that you can do
> this without scanning through the data (which is what often constitutes the
> bulk of a time in a mapreduce program).

I would also like the ability to stream the data and shuffle it into
buckets; when any bucket achieves a fixed cost (currently assessed as
byte size), it would be shipped as a task.

In practise, in the Hadoop architecture, this causes an extra level of
I/O, since all the data must be read into the shuffler and re-sorted.
Also, it breaks the ability to run map tasks on systems hosting the
data. However, it is a subject about which I am doing some thinking.

> But how much data are you using? I would imagine that if you're operating at
> the scale where Hadoop makes sense, then the high- and low-cost objects will
> -- on average -- balance out and tasks will be roughly evenly proportioned.

True, dat.

But it's still worth thinking about stream splitting, since the
theoretical complexity overhead is an increased constant on a linear
term.

Will get more into architecture first.

S.



Re: No route to host prevents from storing files to HDFS

2009-04-23 Thread Stas Oskin
Hi.

Maybe, but there will still be at least one virtual network adapter on the
> host. Try turning them off.


Nope, still throws "No route to host" exceptions.

I have another IP address defined on this machine - 192.168.253.21, for the
same network adapter.

Any idea if it has impact?


>
>
>> The fs.default.name is:
>> hdfs://192.168.253.20:8020
>>
>
> what happens if you switch to hostnames over IP addresses?


Actually, I never tried this, but point is that the HDFS worked just fine
with this before.

Regards.


Re: No route to host prevents from storing files to HDFS

2009-04-23 Thread Steve Loughran

Stas Oskin wrote:

Hi.

2009/4/23 Matt Massie 


Just for clarity: are you using any type of virtualization (e.g. vmware,
xen) or just running the DataNode java process on the same machine?

What is "fs.default.name" set to in your hadoop-site.xml?




 This machine has OpenVZ installed indeed, but all the applications run
withing the host node, meaning all Java processes are running withing same
machine.


Maybe, but there will still be at least one virtual network adapter on 
the host. Try turning them off.




The fs.default.name is:
hdfs://192.168.253.20:8020


what happens if you switch to hostnames over IP addresses?


Re: No route to host prevents from storing files to HDFS

2009-04-23 Thread Stas Oskin
Hi.

I have one question, is the ip address consistent, I think in one of the
> thread mails, it was stated that the ip address sometimes changes.
>

Same static IP's for all servers.

By the way, I have the fs.default.name defined in IP address could it be
somehow related?

I read that there were some issues with this, but it ran fine for me - that
it, until the power crash.

Regards.


Re: No route to host prevents from storing files to HDFS

2009-04-23 Thread Stas Oskin
Hi.

Shouldn't you be testing connecting _from_ the datanode? The error you
> posted is while this DN is trying connect to another DN.



You might be into something here indeed:

1) Telnet to 192.168.253.20 8020 / 192.168.253.20 50010 works
2) Telnet to localhost 8020 / localhost 50010 doesn't work
3) Telnet to 127.0.0.1 8020 / 127.0.0.1 50010 doesn't work

In the 2 last cases I get:
Trying 127.0.0.1...
telnet: connect to address 127.0.0.1: Connection refused
telnet: Unable to connect to remote host: Connection refused

Could it be related?

Regards.


Re: Num map task?

2009-04-23 Thread nguyenhuynh.mr
Edward J. Yoon wrote:

> How do you to add input paths?
>
> On Wed, Apr 22, 2009 at 5:09 PM, nguyenhuynh.mr
>  wrote:
>   
>> Edward J. Yoon wrote:
>>
>> 
>>> Hi,
>>>
>>> In that case, The atomic unit of split is a file. So, you need to
>>> increase the number of files. or Use the TextInputFormat as below.
>>>
>>> jobConf.setInputFormat(TextInputFormat.class);
>>>
>>> On Wed, Apr 22, 2009 at 4:35 PM, nguyenhuynh.mr
>>>  wrote:
>>>
>>>   
 Hi all!


 I have a MR job use to import contents into HBase.

 The content is text file in HDFS. I used the maps file to store local
 path of contents.

 Each content has the map file. ( the map is a text file in HDFS and
 contain 1 line info).


 I created the maps directory used to contain map files. And the  this
 maps directory used to input path for job.

 When i run job, the number map task is same number map files.
 Ex: I have 5 maps file -> 5 map tasks.

 Therefor, the map phase is slowly :(

 Why the map phase is slowly if the number map task large and the number
 map task is equal number of files?.

 * p/s: Run jobs with: 3 node: 1 server and 2 slaver

 Please help me!
 Thanks.

 Best,
 Nguyen.




 
>>>
>>>
>>>   
>> Current, I use TextInputformat to set InputFormat for map phase.
>>
>> 
>
>
>
> Thanks for your help!
I use FileInputFormat to add input paths.
Some thing like:
FileInputFormat.setInputPath(new Path("dir"));

The "dir" is a directory contains input files.

Best,
Nguyen




Re: No route to host prevents from storing files to HDFS

2009-04-23 Thread Stas Oskin
Hi.

2009/4/23 Matt Massie 

> Just for clarity: are you using any type of virtualization (e.g. vmware,
> xen) or just running the DataNode java process on the same machine?
>
> What is "fs.default.name" set to in your hadoop-site.xml?
>


 This machine has OpenVZ installed indeed, but all the applications run
withing the host node, meaning all Java processes are running withing same
machine.

The fs.default.name is:
hdfs://192.168.253.20:8020

Thanks.


The mechanism of choosing target datanodes

2009-04-23 Thread Xie, Tao

If a cluster has many datanodes and I want to copy a large file into DFS. 
If the replication number is set to 1, does the namenode will put the file
data on one datanode or several nodes? I wonder if the file will be split
into blocks then different unique blocks are on different datanodes.

-- 
View this message in context: 
http://www.nabble.com/The-mechanism-of-choosing-target-datanodes-tp23193235p23193235.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Are SequenceFiles split? If so, how?

2009-04-23 Thread Aaron Kimball
Explicitly controlling your splits will be very challenging. Taking the case
where you have expensive (X) and cheap (C) objects to process, you may have
a file where the records are lined up X C X C X C X X X X X C C C. In this
case, you'll need to scan through the whole file and build splits such that
the lengthy run of expensive objects is broken up into separate splits, but
the run of cheap objects is consolidated. I'm suspicious that you can do
this without scanning through the data (which is what often constitutes the
bulk of a time in a mapreduce program).

But how much data are you using? I would imagine that if you're operating at
the scale where Hadoop makes sense, then the high- and low-cost objects will
-- on average -- balance out and tasks will be roughly evenly proportioned.

In general, I would just dodge the problem by making sure your splits
relatively small compared to the size of your input data. If you have 5
million objects to process, then make each split be roughly equal to say
20,000 of them. Then even if some splits take long to process and others
take a short time, then one CPU may dispatch with a dozen cheap splits in
the same time where one unlucky JVM had to process a single very expensive
split. Now you haven't had to manually balance anything, and you still get
to keep all your CPUs full.

- Aaron


On Mon, Apr 20, 2009 at 11:25 PM, Barnet Wagman wrote:

> Thanks Aaron, that really helps.  I probably do need to control the number
> of splits.  My input 'data' consists of  Java objects and their size (in
> bytes) doesn't necessarily reflect the amount of time needed for each map
> operation.   I need to ensure that I have enough map tasks so that all cpus
> are utilized and the job gets done in a reasonable amount of time.
>  (Currently I'm creating multiple input files and making them unsplitable,
> but subclassing SequenceFileInputFormat to explicitly control then number of
> splits sounds like a better approach).
>
> Barnet
>
>
> Aaron Kimball wrote:
>
>> Yes, there can be more than one InputSplit per SequenceFile. The file will
>> be split more-or-less along 64 MB boundaries. (the actual "edges" of the
>> splits will be adjusted to hit the next block of key-value pairs, so it
>> might be a few kilobytes off.)
>>
>> The SequenceFileInputFormat regards mapred.map.tasks
>> (conf.setNumMapTasks())
>> as a hint, not a set-in-stone metric. (The number of reduce tasks, though,
>> is always 100% user-controlled.) If you need exact control over the number
>> of map tasks, you'll need to subclass it and modify this behavior. That
>> having been said -- are you sure you actually need to precisely control
>> this
>> value? Or is it enough to know how many splits were created?
>>
>> - Aaron
>>
>> On Sun, Apr 19, 2009 at 7:23 PM, Barnet Wagman 
>> wrote:
>>
>>
>>
>>> Suppose a SequenceFile (containing keys and values that are
>>> BytesWritable)
>>> is used as input. Will it be divided into InputSplits?  If so, what's the
>>> criteria use for splitting?
>>>
>>> I'm interested in this because I need to control the number of map tasks
>>> used, which (if I understand it correctly), is equal to the number of
>>> InputSplits.
>>>
>>> thanks,
>>>
>>> bw
>>>
>>>
>>>
>>
>>
>>
>
>


Re: which is better Text or Custom Class

2009-04-23 Thread Aaron Kimball
In general, serializing to text and then parsing back into a different
format will always be slower than using a purpose-built class that can
serialize itself. The tradeoff, of course, is that going to text is often
more convenient from a developer-time perspective.

- Aaron

On Mon, Apr 20, 2009 at 2:23 PM, chintan bhatt  wrote:

>
> Hi all,
> I want to ask you about the performance difference between using the Text
> class and using a custom Class which implements  Writable interface.
>
> Lets say in InvertedIndex problem when I emit token and a list of document
> Ids which contains it  , using Text we usually Concat the list of document
> ids with space as a separator  "d1 d2 d3 d4" etc..If I need the same values
> in a later step of map reduce, I need to split the value string to get the
> list of all document Ids. Is it not better to use Writable List instead??
>
> I need to ask it because I am using too many Concats and Splits in my
> project to use documents total tokens count, token frequency in a particular
> document etc..
>
>
> Thanks in advance,
> Chintan
>
>
> _
> Windows Live Messenger. Multitasking at its finest.
> http://www.microsoft.com/india/windows/windowslive/messenger.aspx