Re: Num map task?
Unless the argument (args[0]) to your job is a comma separated set of paths, you are only adding a single input path. It may be you want to pass args and not args[0]. FileInputFormat.setInputPaths(c, args[0]); On Thu, Apr 23, 2009 at 7:10 PM, nguyenhuynh.mr wrote: > Edward J. Yoon wrote: > > > As far as I know, FileInputFormat.getSplits() will returns the number > > of splits automatically computed by the number of files, blocks. BTW, > > What version of Hadoop/Hbase? > > > > I tried to test that code > > (http://wiki.apache.org/hadoop/Hbase/MapReduce) on my cluster (Hadoop > > 0.19.1 and Hbase 0.19.0). The number of input paths was 2, map tasks > > were 274. > > > > Below is my changed code for v0.19.0. > > --- > > public JobConf createSubmittableJob(String[] args) { > > JobConf c = new JobConf(getConf(), TestImport.class); > > c.setJobName(NAME); > > FileInputFormat.setInputPaths(c, args[0]); > > > > c.set("input.table", args[1]); > > c.setMapperClass(InnerMap.class); > > c.setNumReduceTasks(0); > > c.setOutputFormat(NullOutputFormat.class); > > return c; > > } > > > > > > > > On Thu, Apr 23, 2009 at 6:19 PM, nguyenhuynh.mr > > wrote: > > > >> Edward J. Yoon wrote: > >> > >> > >>> How do you to add input paths? > >>> > >>> On Wed, Apr 22, 2009 at 5:09 PM, nguyenhuynh.mr > >>> wrote: > >>> > >>> > Edward J. Yoon wrote: > > > > > Hi, > > > > In that case, The atomic unit of split is a file. So, you need to > > increase the number of files. or Use the TextInputFormat as below. > > > > jobConf.setInputFormat(TextInputFormat.class); > > > > On Wed, Apr 22, 2009 at 4:35 PM, nguyenhuynh.mr > > wrote: > > > > > > > >> Hi all! > >> > >> > >> I have a MR job use to import contents into HBase. > >> > >> The content is text file in HDFS. I used the maps file to store > local > >> path of contents. > >> > >> Each content has the map file. ( the map is a text file in HDFS and > >> contain 1 line info). > >> > >> > >> I created the maps directory used to contain map files. And the > this > >> maps directory used to input path for job. > >> > >> When i run job, the number map task is same number map files. > >> Ex: I have 5 maps file -> 5 map tasks. > >> > >> Therefor, the map phase is slowly :( > >> > >> Why the map phase is slowly if the number map task large and the > number > >> map task is equal number of files?. > >> > >> * p/s: Run jobs with: 3 node: 1 server and 2 slaver > >> > >> Please help me! > >> Thanks. > >> > >> Best, > >> Nguyen. > >> > >> > >> > >> > >> > >> > > > > > Current, I use TextInputformat to set InputFormat for map phase. > > > > >>> > >>> Thanks for your help! > >>> > >> I use FileInputFormat to add input paths. > >> Some thing like: > >>FileInputFormat.setInputPath(new Path("dir")); > >> > >> The "dir" is a directory contains input files. > >> > >> Best, > >> Nguyen > >> > >> > >> > >> > Thanks! > > I am using Hadoop version 0.18.2 > > Cheer, > Nguyen. > -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422
Re: When will hadoop 0.19.2 be released?
But there are already 100TB data stored on DFS. Is there a safe solution to do such a downgrade? On Fri, Apr 24, 2009 at 2:08 PM, jason hadoop wrote: > You could try the cloudera release based on 18.3, with many backported > features. > http://www.cloudera.com/distribution > > On Thu, Apr 23, 2009 at 11:06 PM, Zhou, Yunqing wrote: > >> currently I'm managing a 64-nodes hadoop 0.19.1 cluster with 100TB data. >> and I found 0.19.1 is buggy and I have already applied some patches on >> hadoop jira to solve problems. >> But I'm looking forward to a more stable release of hadoop. >> Do you know when will 0.19.2 be released? >> >> Thanks. >> > > > > -- > Alpha Chapters of my book on Hadoop are available > http://www.apress.com/book/view/9781430219422 >
Re: When will hadoop 0.19.2 be released?
You could try the cloudera release based on 18.3, with many backported features. http://www.cloudera.com/distribution On Thu, Apr 23, 2009 at 11:06 PM, Zhou, Yunqing wrote: > currently I'm managing a 64-nodes hadoop 0.19.1 cluster with 100TB data. > and I found 0.19.1 is buggy and I have already applied some patches on > hadoop jira to solve problems. > But I'm looking forward to a more stable release of hadoop. > Do you know when will 0.19.2 be released? > > Thanks. > -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422
When will hadoop 0.19.2 be released?
currently I'm managing a 64-nodes hadoop 0.19.1 cluster with 100TB data. and I found 0.19.1 is buggy and I have already applied some patches on hadoop jira to solve problems. But I'm looking forward to a more stable release of hadoop. Do you know when will 0.19.2 be released? Thanks.
Re: Generating many small PNGs to Amazon S3 with MapReduce
If anyone is interested I did finally get round to processing it all, and due to the sparsity of data we have, for all 23 zoom levels and all species we have information on, the result was 807 million PNGs, which is $8,000 to PUT to S3 - too much for me to pay. So like most things I will probably go for a compromise and pre process 10 zoom levels into S3 which will only come in at $457 (only the PUT into S3) and then render the rest on the fly. Only people browsing beyond zoom 10 are then hitting the real time rendering servers so I think this will work out ok performance wise. Cheers, Tim On Thu, Apr 23, 2009 at 5:45 PM, Stuart Sierra wrote: > On Thu, Apr 23, 2009 at 5:02 PM, Andrew Hitchcock wrote: >> 1 billion * ($0.01 / 1000) = 10,000 > > Oh yeah, I was thinking $0.01 for a single PUT. Silly me. > > -S >
Re: Num map task?
Edward J. Yoon wrote: > As far as I know, FileInputFormat.getSplits() will returns the number > of splits automatically computed by the number of files, blocks. BTW, > What version of Hadoop/Hbase? > > I tried to test that code > (http://wiki.apache.org/hadoop/Hbase/MapReduce) on my cluster (Hadoop > 0.19.1 and Hbase 0.19.0). The number of input paths was 2, map tasks > were 274. > > Below is my changed code for v0.19.0. > --- > public JobConf createSubmittableJob(String[] args) { > JobConf c = new JobConf(getConf(), TestImport.class); > c.setJobName(NAME); > FileInputFormat.setInputPaths(c, args[0]); > > c.set("input.table", args[1]); > c.setMapperClass(InnerMap.class); > c.setNumReduceTasks(0); > c.setOutputFormat(NullOutputFormat.class); > return c; > } > > > > On Thu, Apr 23, 2009 at 6:19 PM, nguyenhuynh.mr > wrote: > >> Edward J. Yoon wrote: >> >> >>> How do you to add input paths? >>> >>> On Wed, Apr 22, 2009 at 5:09 PM, nguyenhuynh.mr >>> wrote: >>> >>> Edward J. Yoon wrote: > Hi, > > In that case, The atomic unit of split is a file. So, you need to > increase the number of files. or Use the TextInputFormat as below. > > jobConf.setInputFormat(TextInputFormat.class); > > On Wed, Apr 22, 2009 at 4:35 PM, nguyenhuynh.mr > wrote: > > > >> Hi all! >> >> >> I have a MR job use to import contents into HBase. >> >> The content is text file in HDFS. I used the maps file to store local >> path of contents. >> >> Each content has the map file. ( the map is a text file in HDFS and >> contain 1 line info). >> >> >> I created the maps directory used to contain map files. And the this >> maps directory used to input path for job. >> >> When i run job, the number map task is same number map files. >> Ex: I have 5 maps file -> 5 map tasks. >> >> Therefor, the map phase is slowly :( >> >> Why the map phase is slowly if the number map task large and the number >> map task is equal number of files?. >> >> * p/s: Run jobs with: 3 node: 1 server and 2 slaver >> >> Please help me! >> Thanks. >> >> Best, >> Nguyen. >> >> >> >> >> >> > > Current, I use TextInputformat to set InputFormat for map phase. >>> >>> Thanks for your help! >>> >> I use FileInputFormat to add input paths. >> Some thing like: >>FileInputFormat.setInputPath(new Path("dir")); >> >> The "dir" is a directory contains input files. >> >> Best, >> Nguyen >> >> >> >> Thanks! I am using Hadoop version 0.18.2 Cheer, Nguyen.
Re: Num map task?
As far as I know, FileInputFormat.getSplits() will returns the number of splits automatically computed by the number of files, blocks. BTW, What version of Hadoop/Hbase? I tried to test that code (http://wiki.apache.org/hadoop/Hbase/MapReduce) on my cluster (Hadoop 0.19.1 and Hbase 0.19.0). The number of input paths was 2, map tasks were 274. Below is my changed code for v0.19.0. --- public JobConf createSubmittableJob(String[] args) { JobConf c = new JobConf(getConf(), TestImport.class); c.setJobName(NAME); FileInputFormat.setInputPaths(c, args[0]); c.set("input.table", args[1]); c.setMapperClass(InnerMap.class); c.setNumReduceTasks(0); c.setOutputFormat(NullOutputFormat.class); return c; } On Thu, Apr 23, 2009 at 6:19 PM, nguyenhuynh.mr wrote: > Edward J. Yoon wrote: > >> How do you to add input paths? >> >> On Wed, Apr 22, 2009 at 5:09 PM, nguyenhuynh.mr >> wrote: >> >>> Edward J. Yoon wrote: >>> >>> Hi, In that case, The atomic unit of split is a file. So, you need to increase the number of files. or Use the TextInputFormat as below. jobConf.setInputFormat(TextInputFormat.class); On Wed, Apr 22, 2009 at 4:35 PM, nguyenhuynh.mr wrote: > Hi all! > > > I have a MR job use to import contents into HBase. > > The content is text file in HDFS. I used the maps file to store local > path of contents. > > Each content has the map file. ( the map is a text file in HDFS and > contain 1 line info). > > > I created the maps directory used to contain map files. And the this > maps directory used to input path for job. > > When i run job, the number map task is same number map files. > Ex: I have 5 maps file -> 5 map tasks. > > Therefor, the map phase is slowly :( > > Why the map phase is slowly if the number map task large and the number > map task is equal number of files?. > > * p/s: Run jobs with: 3 node: 1 server and 2 slaver > > Please help me! > Thanks. > > Best, > Nguyen. > > > > > >>> Current, I use TextInputformat to set InputFormat for map phase. >>> >>> >> >> >> >> Thanks for your help! > I use FileInputFormat to add input paths. > Some thing like: > FileInputFormat.setInputPath(new Path("dir")); > > The "dir" is a directory contains input files. > > Best, > Nguyen > > > -- Best Regards, Edward J. Yoon edwardy...@apache.org http://blog.udanax.org
Re: Datanode Setup
Right now I'm just trying to get one node running. Once its running I'll copy it over. jason hadoop wrote: > > Have you copied the updated hadoop-site.xml file to the conf directory on > all of your slave nodes? > > > On Thu, Apr 23, 2009 at 2:10 PM, jpe30 wrote: > >> >> Ok, I've done all of this. Set up my hosts file in Linux, setup my >> master >> and slaves file in Hadoop and setup my hadoop-site.xml. It still does >> not >> work. The datanode still gives me this error... >> >> STARTUP_MSG: host = java.net.UnknownHostException: myhost: myhost >> >> ...which makes me think its not reading the hadoop-site.xml file at all. >> I've checked the permissions and the user has full permissions to all >> files >> within the Hadoop directory. Any suggestions? >> >> >> >> Mithila Nagendra wrote: >> > >> > You should have conf/slaves file on the master node set to master, >> node01, >> > node02. so on and the masters file on master set to master. Also in >> > the >> > /etc/hosts file get rid of 'node6' in the line 127.0.0.1 >> > localhost.localdomain localhost node6 on all your nodes. Ensure that >> the >> > /etc/hosts file contain the same information on all nodes. Also >> > hadoop-site.xml files on all nodes should have master:portno for hdfs >> and >> > tasktracker. >> > Once you do this restart hadoop. >> > >> > On Fri, Apr 17, 2009 at 10:04 AM, jpe30 wrote: >> > >> >> >> >> >> >> >> >> Mithila Nagendra wrote: >> >> > >> >> > You have to make sure that you can ssh between the nodes. Also check >> >> the >> >> > file hosts in /etc folder. Both the master and the slave much have >> each >> >> > others machines defined in it. Refer to my previous mail >> >> > Mithila >> >> > >> >> > >> >> >> >> >> >> I have SSH setup correctly and here is the /etc/hosts file on node6 of >> >> the >> >> datanodes. >> >> >> >> # >> >> 127.0.0.1 localhost.localdomain localhost node6 >> >> 192.168.1.10master >> >> 192.168.1.1 node1 >> >> 192.168.1.2 node2 >> >> 192.168.1.3 node3 >> >> 192.168.1.4 node4 >> >> 192.168.1.5 node5 >> >> 192.168.1.6 node6 >> >> >> >> I have the slaves file on each machine set as node1 to node6, and each >> >> masters file set to master except for the master itself. Still, I >> keep >> >> getting that same error in the datanodes... >> >> -- >> >> View this message in context: >> >> http://www.nabble.com/Datanode-Setup-tp23064660p23101738.html >> >> Sent from the Hadoop core-user mailing list archive at Nabble.com. >> >> >> >> >> > >> > >> >> -- >> View this message in context: >> http://www.nabble.com/Datanode-Setup-tp23064660p23203293.html >> Sent from the Hadoop core-user mailing list archive at Nabble.com. >> >> > > > -- > Alpha Chapters of my book on Hadoop are available > http://www.apress.com/book/view/9781430219422 > > -- View this message in context: http://www.nabble.com/Datanode-Setup-tp23064660p23208349.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: The mechanism of choosing target datanodes
I haven't checked the code for any special cases of replication = 1. The write a block sequence is: 1. Get a list of datanodes from the namenode for the block replicas, the reqest host being the first datanode returned if the request host is a datanode. 2. send the block with the list of datanodes to receive it to the first datanode in the list 3. That datanode sends the block to the next 4. 3 repeats until the block is fully replicated. On Thu, Apr 23, 2009 at 2:08 PM, Jerome Banks wrote: > FYI, The pipe v2 results were created with > com.quantcast.armor.jobs.pipev3.util.CountVG , inputing the results from > com.quantcast.armor.jobs.pipev3.util.MyHarvestV2 (the mainline pipev2 > harvest). > The pipe v3 results were a one day run of BloomDaily for 04/12/2009. > The CSV files were generated with TopNFlow. > > > On 4/23/09 1:56 PM, "Amr Awadallah" wrote: > > yes, it will be split across many nodes, and if possible each block will > get a different datanode. > > see following link for more details: > > > http://hadoop.apache.org/core/docs/current/hdfs_design.html#Data+Organization > > -- amr > > Alex Loddengaard wrote: > > I believe the blocks will be distributed across data nodes and not local > to > > only one data node. If this wasn't the case, then running a MR job on > the > > file would only be local to one task tracker. > > > > Alex > > > > On Thu, Apr 23, 2009 at 2:14 AM, Xie, Tao wrote: > > > > > >> If a cluster has many datanodes and I want to copy a large file into > DFS. > >> If the replication number is set to 1, does the namenode will put the > file > >> data on one datanode or several nodes? I wonder if the file will be > split > >> into blocks then different unique blocks are on different datanodes. > >> > >> -- > >> View this message in context: > >> > http://www.nabble.com/The-mechanism-of-choosing-target-datanodes-tp23193235p23193235.html > >> Sent from the Hadoop core-user mailing list archive at Nabble.com. > >> > >> > >> > > > > > > -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422
Re: Datanode Setup
Have you copied the updated hadoop-site.xml file to the conf directory on all of your slave nodes? On Thu, Apr 23, 2009 at 2:10 PM, jpe30 wrote: > > Ok, I've done all of this. Set up my hosts file in Linux, setup my master > and slaves file in Hadoop and setup my hadoop-site.xml. It still does not > work. The datanode still gives me this error... > > STARTUP_MSG: host = java.net.UnknownHostException: myhost: myhost > > ...which makes me think its not reading the hadoop-site.xml file at all. > I've checked the permissions and the user has full permissions to all files > within the Hadoop directory. Any suggestions? > > > > Mithila Nagendra wrote: > > > > You should have conf/slaves file on the master node set to master, > node01, > > node02. so on and the masters file on master set to master. Also in > > the > > /etc/hosts file get rid of 'node6' in the line 127.0.0.1 > > localhost.localdomain localhost node6 on all your nodes. Ensure that > the > > /etc/hosts file contain the same information on all nodes. Also > > hadoop-site.xml files on all nodes should have master:portno for hdfs and > > tasktracker. > > Once you do this restart hadoop. > > > > On Fri, Apr 17, 2009 at 10:04 AM, jpe30 wrote: > > > >> > >> > >> > >> Mithila Nagendra wrote: > >> > > >> > You have to make sure that you can ssh between the nodes. Also check > >> the > >> > file hosts in /etc folder. Both the master and the slave much have > each > >> > others machines defined in it. Refer to my previous mail > >> > Mithila > >> > > >> > > >> > >> > >> I have SSH setup correctly and here is the /etc/hosts file on node6 of > >> the > >> datanodes. > >> > >> # > >> 127.0.0.1 localhost.localdomain localhost node6 > >> 192.168.1.10master > >> 192.168.1.1 node1 > >> 192.168.1.2 node2 > >> 192.168.1.3 node3 > >> 192.168.1.4 node4 > >> 192.168.1.5 node5 > >> 192.168.1.6 node6 > >> > >> I have the slaves file on each machine set as node1 to node6, and each > >> masters file set to master except for the master itself. Still, I keep > >> getting that same error in the datanodes... > >> -- > >> View this message in context: > >> http://www.nabble.com/Datanode-Setup-tp23064660p23101738.html > >> Sent from the Hadoop core-user mailing list archive at Nabble.com. > >> > >> > > > > > > -- > View this message in context: > http://www.nabble.com/Datanode-Setup-tp23064660p23203293.html > Sent from the Hadoop core-user mailing list archive at Nabble.com. > > -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422
core-user@hadoop.apache.org
/// Sorry for cross posting. // Hi,all Hadoop in China Salon is a free discussion forum on Hadoop related technologies and ideas. Five months ago, in Nov 2008, we successfully concluded the first hadoop salon in Beijing. In the previous salon, more than sixty people attended. Hao Zheng (Yahoo! Inc), Zheng Shao (Facebook Inc) and Wang ShouYan (Baidu Inc) from the industry gave us very impressive talks on their recent progresses on Hadoop. And about thirty people (students/professors) are from Universities and Institutes, who also impressed us deeply. Thank you all the attendees again. Without you, it would never success. In early May this year, we are going to host the second Hadoop in China Salon. It is our great honor to invite you again to take part in the salon. The meeting is scheduled on May 9 (the first weekend after Labor Day Holiday). And this time we are trying to hold it in iHep(Institute of High Energy Physics, Chinese Academy of Science). The good part is that we can visit the Beijing Electron Positron Collider (BEPC, the biggest EPC in China). The bad part is that it may be a little far from Zhong GuanCun, and it is on Yu Quan Road. We now welcome talkers for this salon with our greatest sincerity. Please share with us the latest progress in your work or your team¹s work on Hadoop. If you are interested in giving a talk, please drop me an email. And we expect more hadoop users and developers to join us this time. Please feel free to come to the Hadoop discussion forum. Please drop me an email if you would like to come, so we can prepare more free drink and foods. If you have any thoughts on this meeting, please let us know by dropping me an email. BTW, we have collected several fantastic talks so far, including: 1) One by Zheng Shao on the recent progress on Hive. 2) Two or more talks organized by Yahoo! China R&D (Thanks Yahoo! India R&D, Yahoo! China R&D and Hao Zheng ). One talk is research on machine learning. The other talk is not finally settled, it would be about hadoop roadmap, enhancements, new scheduler or Pig. 3) One talk from Baidu Inc. The talk will cover the hadoop scheduler used in Baidu, data security and computing security. 4) Two talks from two teams in ICT (Institute of Computing Technology, Chinese Academy of Science). One is about our research effort on data organization and their influences. The other is about Hadoop On GIS. Thanks for all the talkers. Many thanks to Cheng Yaodong from ihep(ihep.ac.cn) for providing the venue and infrastructure. I will send more detailed schedule in next week. Statement: This event is nonprofit.
Re: sub-optimal multiple disk usage in 0.18.3?
In theory the block allocation strategy is round robin amount the set of storage locations that meet the minimum free space requirements. On Thu, Apr 23, 2009 at 12:55 PM, Bhupesh Bansal wrote: > What configuration are you using for the disks ?? > > Best configuration is just doing a JBOD. > > http://www.nabble.com/RAID-vs.-JBOD-td21404366.html > > Best > Bhupesh > > > > On 4/23/09 12:54 PM, "Mike Andrews" wrote: > > > i have a bunch of datanodes with several disks each, and i noticed > > that sometimes dfs blocks don't get evenly distributed among them. for > > instance, one of my machines has 5 disks with 500 gb each, and 1 disk > > with 2 TB (6 total disks). the 5 smaller disks are each 98% full, > > whereas the larger one is only 12% full. it seems as though dfs should > > do better by putting more of the blocks on the larger disk first. and > > mapreduce jobs are failing on this machine with error > > "java.io.IOException: No space left on device". > > > > any thoughts or suggestions? thanks in advance. > > -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422
Re: Writing a New Aggregate Function
It really isn't documented anywhere. There is a small section in my book in ch08 about it. It didn't make the alpha that is up of ch08 though. On Thu, Apr 23, 2009 at 1:44 PM, Dan Milstein wrote: > Hello all, > > I've been using streaming + the aggregate package (available via -reducer > aggregate), and have been very happy with what it gives me. > > I'm interested in writing my own new aggregate functions (in Java) which I > could then access from my streaming code. > > Can anyone give me pointers towards how to make that happen? I've read > through the aggregate package source, but I'm not seeing how to define my > own, and get access to it from streaming. > > To be specific, here's the sort of thing I'd like to be able to do: > > - In Java, define a SampleValues aggregator, which chooses a sample of the > input given to it > > - From my streaming program, in say python, output: > > SampleValues:some_key \t some_value > > - Have the aggregate framework somehow call my new aggregator for the > combiner and reducer steps > > Thanks, > -Dan Milstein > -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422
Re: Generating many small PNGs to Amazon S3 with MapReduce
On Thu, Apr 23, 2009 at 5:02 PM, Andrew Hitchcock wrote: > 1 billion * ($0.01 / 1000) = 10,000 Oh yeah, I was thinking $0.01 for a single PUT. Silly me. -S
Re: Datanode Setup
Ok, I've done all of this. Set up my hosts file in Linux, setup my master and slaves file in Hadoop and setup my hadoop-site.xml. It still does not work. The datanode still gives me this error... STARTUP_MSG: host = java.net.UnknownHostException: myhost: myhost ...which makes me think its not reading the hadoop-site.xml file at all. I've checked the permissions and the user has full permissions to all files within the Hadoop directory. Any suggestions? Mithila Nagendra wrote: > > You should have conf/slaves file on the master node set to master, node01, > node02. so on and the masters file on master set to master. Also in > the > /etc/hosts file get rid of 'node6' in the line 127.0.0.1 > localhost.localdomain localhost node6 on all your nodes. Ensure that the > /etc/hosts file contain the same information on all nodes. Also > hadoop-site.xml files on all nodes should have master:portno for hdfs and > tasktracker. > Once you do this restart hadoop. > > On Fri, Apr 17, 2009 at 10:04 AM, jpe30 wrote: > >> >> >> >> Mithila Nagendra wrote: >> > >> > You have to make sure that you can ssh between the nodes. Also check >> the >> > file hosts in /etc folder. Both the master and the slave much have each >> > others machines defined in it. Refer to my previous mail >> > Mithila >> > >> > >> >> >> I have SSH setup correctly and here is the /etc/hosts file on node6 of >> the >> datanodes. >> >> # >> 127.0.0.1 localhost.localdomain localhost node6 >> 192.168.1.10master >> 192.168.1.1 node1 >> 192.168.1.2 node2 >> 192.168.1.3 node3 >> 192.168.1.4 node4 >> 192.168.1.5 node5 >> 192.168.1.6 node6 >> >> I have the slaves file on each machine set as node1 to node6, and each >> masters file set to master except for the master itself. Still, I keep >> getting that same error in the datanodes... >> -- >> View this message in context: >> http://www.nabble.com/Datanode-Setup-tp23064660p23101738.html >> Sent from the Hadoop core-user mailing list archive at Nabble.com. >> >> > > -- View this message in context: http://www.nabble.com/Datanode-Setup-tp23064660p23203293.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: The mechanism of choosing target datanodes
FYI, The pipe v2 results were created with com.quantcast.armor.jobs.pipev3.util.CountVG , inputing the results from com.quantcast.armor.jobs.pipev3.util.MyHarvestV2 (the mainline pipev2 harvest). The pipe v3 results were a one day run of BloomDaily for 04/12/2009. The CSV files were generated with TopNFlow. On 4/23/09 1:56 PM, "Amr Awadallah" wrote: yes, it will be split across many nodes, and if possible each block will get a different datanode. see following link for more details: http://hadoop.apache.org/core/docs/current/hdfs_design.html#Data+Organization -- amr Alex Loddengaard wrote: > I believe the blocks will be distributed across data nodes and not local to > only one data node. If this wasn't the case, then running a MR job on the > file would only be local to one task tracker. > > Alex > > On Thu, Apr 23, 2009 at 2:14 AM, Xie, Tao wrote: > > >> If a cluster has many datanodes and I want to copy a large file into DFS. >> If the replication number is set to 1, does the namenode will put the file >> data on one datanode or several nodes? I wonder if the file will be split >> into blocks then different unique blocks are on different datanodes. >> >> -- >> View this message in context: >> http://www.nabble.com/The-mechanism-of-choosing-target-datanodes-tp23193235p23193235.html >> Sent from the Hadoop core-user mailing list archive at Nabble.com. >> >> >> > >
Re: Generating many small PNGs to Amazon S3 with MapReduce
How do you figure? Puts are one penny per thousand, so I think it'd only cost $10,000. Here's the math I'm using: 1 billion * ($0.01 / 1000) = 10,000 Math courtesy of Google: http://www.google.com/search?q=1+billion+*+(0.01+%2F+1000) Still expensive, but not unreasonably so. Andrew On Thu, Apr 23, 2009 at 7:08 AM, Stuart Sierra wrote: > On Wed, Apr 15, 2009 at 8:21 PM, Kevin Peterson wrote: >> However, do the math on the costs for S3. We were doing something similar, >> and found that we were spending a fortune on our put requests at $0.01 per >> 1000, and next to nothing on storage. > > I made a similar discovery. The cost of PUT adds up fast. One > billion PUTs will cost you $10 million! > > -Stuart Sierra >
Re: The mechanism of choosing target datanodes
yes, it will be split across many nodes, and if possible each block will get a different datanode. see following link for more details: http://hadoop.apache.org/core/docs/current/hdfs_design.html#Data+Organization -- amr Alex Loddengaard wrote: I believe the blocks will be distributed across data nodes and not local to only one data node. If this wasn't the case, then running a MR job on the file would only be local to one task tracker. Alex On Thu, Apr 23, 2009 at 2:14 AM, Xie, Tao wrote: If a cluster has many datanodes and I want to copy a large file into DFS. If the replication number is set to 1, does the namenode will put the file data on one datanode or several nodes? I wonder if the file will be split into blocks then different unique blocks are on different datanodes. -- View this message in context: http://www.nabble.com/The-mechanism-of-choosing-target-datanodes-tp23193235p23193235.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Writing a New Aggregate Function
Hello all, I've been using streaming + the aggregate package (available via - reducer aggregate), and have been very happy with what it gives me. I'm interested in writing my own new aggregate functions (in Java) which I could then access from my streaming code. Can anyone give me pointers towards how to make that happen? I've read through the aggregate package source, but I'm not seeing how to define my own, and get access to it from streaming. To be specific, here's the sort of thing I'd like to be able to do: - In Java, define a SampleValues aggregator, which chooses a sample of the input given to it - From my streaming program, in say python, output: SampleValues:some_key \t some_value - Have the aggregate framework somehow call my new aggregator for the combiner and reducer steps Thanks, -Dan Milstein
Re: core-user Digest 23 Apr 2009 02:09:48 -0000 Issue 887
On Thu, Apr 23, 2009 at 12:00 PM, Koji Noguchi wrote: > Owen, > > > Is it just the patches that have already been applied > > to the 18 branch? Or are there more? > > > Former. Just the patches that have already been applied to 0.18 branch. > I especially want HADOOP-5465 in for the 'stable' release. > (This patch is also missing in 0.19.1) > Hey Koji, FYI, HADOOP-5465 is one of the patches we're bundling in the Cloudera distro for Hadoop, based on 0.18.3: http://cloudera.com/hadoop Lacking an 0.18.4 release, you might want to take a look. -Todd
Re: Hadoop and Matlab
Hi The simplest way for you to run Matlab would be to use distributed toolkit provided in matlab. You just need to configure matlab to discover other matlab-machines. In that way you will not require to setup a hadoop cluster. However if you want to use hadoop as a backend framework for distributed processing, I would suggest you to go for Octave which is open source toolkit just like matlab. It provides interfaces for c/c++. I think that would be more easy to configure it with hadoop than going for matlab which is not open source and licenced. --nitesh On Wed, Apr 22, 2009 at 7:10 AM, Edward J. Yoon wrote: > Hi, > Where to store the images? How to retrieval the images? > > If you have a metadata for the images, the map task can receives a > 'filename' of image as a key, and file properies (host, file path, > ..,etc) as its value. Then, I guess you can handle the matlab process > using runtime object on hadoop cluster. > > On Wed, Apr 22, 2009 at 9:30 AM, Sameer Tilak > wrote: > > Hi Edward, > > Yes, we're building this for handling hundreds of thousands images (at > > least). We're thinking processing of individual images (or a set of > images > > together) will be done in Matlab itself. However, we can use Hadoop > > framework to process the data in parallel fashion. One Matlab instance > > handling few hundred images (as a mapper) and have hundreds of such > > instances and then combine (reducer) the o/p of each instance. > > > > On Tue, Apr 21, 2009 at 5:06 PM, Edward J. Yoon >wrote: > > > >> Hi, What is the input data? > >> > >> According to my understanding, you have a lot of images and want to > >> process all images using your matlab script. Then, You should write > >> some code yourself. I did similar thing for plotting graph with > >> gnuplot. However, If you want to do large-scale linear algebra > >> operations for large image processing, I would recommend investigating > >> other solutions. Hadoop is not a general purpose clustering software, > >> and it cannot run matlab. > >> > >> On Wed, Apr 22, 2009 at 2:55 AM, Sameer Tilak > >> wrote: > >> > Hi there, > >> > > >> > We're working on an image analysis project. The image processing code > is > >> > written in Matlab. If I invoke that code from a shell script and then > use > >> > that shell script within Hadoop streaming, will that work? Has anyone > >> done > >> > something along these lines? > >> > > >> > Many thaks, > >> > --ST. > >> > > >> > >> > >> > >> -- > >> Best Regards, Edward J. Yoon > >> edwardy...@apache.org > >> http://blog.udanax.org > >> > > > > > > -- > Best Regards, Edward J. Yoon > edwardy...@apache.org > http://blog.udanax.org > -- Nitesh Bhatia Dhirubhai Ambani Institute of Information & Communication Technology Gandhinagar Gujarat "Life is never perfect. It just depends where you draw the line." visit: http://www.awaaaz.com - connecting through music http://www.volstreet.com - lets volunteer for better tomorrow http://www.instibuzz.com - Voice opinions, Transact easily, Have fun
Re: sub-optimal multiple disk usage in 0.18.3?
What configuration are you using for the disks ?? Best configuration is just doing a JBOD. http://www.nabble.com/RAID-vs.-JBOD-td21404366.html Best Bhupesh On 4/23/09 12:54 PM, "Mike Andrews" wrote: > i have a bunch of datanodes with several disks each, and i noticed > that sometimes dfs blocks don't get evenly distributed among them. for > instance, one of my machines has 5 disks with 500 gb each, and 1 disk > with 2 TB (6 total disks). the 5 smaller disks are each 98% full, > whereas the larger one is only 12% full. it seems as though dfs should > do better by putting more of the blocks on the larger disk first. and > mapreduce jobs are failing on this machine with error > "java.io.IOException: No space left on device". > > any thoughts or suggestions? thanks in advance.
sub-optimal multiple disk usage in 0.18.3?
i have a bunch of datanodes with several disks each, and i noticed that sometimes dfs blocks don't get evenly distributed among them. for instance, one of my machines has 5 disks with 500 gb each, and 1 disk with 2 TB (6 total disks). the 5 smaller disks are each 98% full, whereas the larger one is only 12% full. it seems as though dfs should do better by putting more of the blocks on the larger disk first. and mapreduce jobs are failing on this machine with error "java.io.IOException: No space left on device". any thoughts or suggestions? thanks in advance. -- permanent contact information at http://mikerandrews.com
5th Apache Hadoop Get Together @ Berlin
I would like to announce the fifth Apache Hadoop Get Together @ Berlin. It is scheduled to take place at: newthinking store Tucholskystr. 48 Berlin Mitte on Thursday, 25th of June 2005 at 05:00pm. As always there will be slots of 20min each for talks. After each talk there will be time for discussion. You can order drinks directly at the bar in the newthinking store. After the official part we will go to one of the restaurants close by - exactly which one will be announced at the beginning of the event. Talks scheduled so far: Torsten Curdt: Data Legacy - the challenges of an evolving data warehouse Abstract: "MapReduce is great for processing great data sets. A distributed file system can be used to store huge amounts of data. But what if your data format needs to adapt to new requirements? This talk will cover a simple introduction to Thrift and Protocol Buffers and sprinkle in some rants and approaches to manage your big data sets." We would like to invite you, the visitor to also tell your Hadoop story, if you like, you can bring slides - there will be a beamer. Talks on related projects (HBase, CouchDB, Cassandra, Hive, Pig, Lucene, Solr, nutch, katta, UIMA, Mahout, ...) are of course welcome as well. A big Thank You goes to the newthinking store for providing a room in the center of Berlin for free. Website: http://upcoming.yahoo.com/event/2488959/?ps=5 (Please keep an eye on the upcoming site in case the starting time needs to be shifted.) Isabel
RE: core-user Digest 23 Apr 2009 02:09:48 -0000 Issue 887
Owen, > Is it just the patches that have already been applied > to the 18 branch? Or are there more? > Former. Just the patches that have already been applied to 0.18 branch. I especially want HADOOP-5465 in for the 'stable' release. (This patch is also missing in 0.19.1) Koji -Original Message- From: Owen O'Malley [mailto:omal...@apache.org] Sent: Thursday, April 23, 2009 11:54 AM To: core-user@hadoop.apache.org Subject: Re: core-user Digest 23 Apr 2009 02:09:48 - Issue 887 On Apr 22, 2009, at 10:44 PM, Koji Noguchi wrote: > Nigel, > > When you have time, could you release 0.18.4 that contains some of the > patches that make our clusters 'stable'? Is it just the patches that have already been applied to the 18 branch? Or are there more? -- Owen
Re: core-user Digest 23 Apr 2009 02:09:48 -0000 Issue 887
On Apr 22, 2009, at 10:44 PM, Koji Noguchi wrote: Nigel, When you have time, could you release 0.18.4 that contains some of the patches that make our clusters 'stable'? Is it just the patches that have already been applied to the 18 branch? Or are there more? -- Owen
Re: The mechanism of choosing target datanodes
I believe the blocks will be distributed across data nodes and not local to only one data node. If this wasn't the case, then running a MR job on the file would only be local to one task tracker. Alex On Thu, Apr 23, 2009 at 2:14 AM, Xie, Tao wrote: > > If a cluster has many datanodes and I want to copy a large file into DFS. > If the replication number is set to 1, does the namenode will put the file > data on one datanode or several nodes? I wonder if the file will be split > into blocks then different unique blocks are on different datanodes. > > -- > View this message in context: > http://www.nabble.com/The-mechanism-of-choosing-target-datanodes-tp23193235p23193235.html > Sent from the Hadoop core-user mailing list archive at Nabble.com. > >
Re: No route to host prevents from storing files to HDFS
Just to clarify one point - the iptables were running on 2nd DataNode which I didn't check, as I was sure the problem is in the NameNode/DataNode, and on NameNode/DataNode. But I can't understand what and when launched them, as I checked multiple times and nothing was running before. Moreover, they were disabled on start-up, so they shouldn't come up in the first place. Regards. 2009/4/23 Stas Oskin > Hi. > > >> Also iptables -L for each machine as an afterthought - just for paranoia's >> sake >> > > Well, I started preparing all the information you requested, but when I got > to this stage - I found out there were INDEED iptables running on 2 servers > from 3. > > The strangest thing is that I don't recall enabling them at all. Perhaps > some 3rd party software have enabled them? > > In any case, all seems to be working now. > > Thanks for everybody that helped - I will be sure to check iptables on all > the cluster machines from now on :). > > Regards. >
Re: No route to host prevents from storing files to HDFS
Hi. > Also iptables -L for each machine as an afterthought - just for paranoia's > sake > Well, I started preparing all the information you requested, but when I got to this stage - I found out there were INDEED iptables running on 2 servers from 3. The strangest thing is that I don't recall enabling them at all. Perhaps some 3rd party software have enabled them? In any case, all seems to be working now. Thanks for everybody that helped - I will be sure to check iptables on all the cluster machines from now on :). Regards.
Re: Are SequenceFiles split? If so, how?
Aaron Kimball wrote: Explicitly controlling your splits will be very challenging. Taking the case where you have expensive (X) and cheap (C) objects to process, you may have a file where the records are lined up X C X C X C X X X X X C C C. In this case, you'll need to scan through the whole file and build splits such that the lengthy run of expensive objects is broken up into separate splits, but the run of cheap objects is consolidated. ^ I'm not concerned about the variation in processing time of objects; there isn't enough variation to worry about. I'm primarily concerned with having enough map tasks to utilized all nodes (and cores). In general, I would just dodge the problem by making sure your splits relatively small compared to the size of your input data. ^ This sounds like the right solution. I'll still need to extend SequenceFileInputFormat, but it should be relatively simple to put a fixed number of objects into each split. thanks
Re: Generating many small PNGs to Amazon S3 with MapReduce
On Wed, Apr 15, 2009 at 8:21 PM, Kevin Peterson wrote: > However, do the math on the costs for S3. We were doing something similar, > and found that we were spending a fortune on our put requests at $0.01 per > 1000, and next to nothing on storage. I made a similar discovery. The cost of PUT adds up fast. One billion PUTs will cost you $10 million! -Stuart Sierra
Re: No route to host prevents from storing files to HDFS
Can you give us your network topology ? I see that at least 3 ip addresses 192.168.253.20, 192.168.253.32 and 192.168.253.21 In particular the fs.default.name which you have provided, the hadoop-site.xml for each machine, the slaves file, with ip address mappings if needed and a netstat -a -n -t -p | grep java (hopefully you run linux) and the output of jps for each machine That should let us see what servers are binding to what ports on what machines, and what you cluster things should be happening. Also iptables -L for each machine as an afterthought - just for paranoia's sake On Thu, Apr 23, 2009 at 2:45 AM, Stas Oskin wrote: > Hi. > > Maybe, but there will still be at least one virtual network adapter on the > > host. Try turning them off. > > > Nope, still throws "No route to host" exceptions. > > I have another IP address defined on this machine - 192.168.253.21, for the > same network adapter. > > Any idea if it has impact? > > > > > > > >> The fs.default.name is: > >> hdfs://192.168.253.20:8020 > >> > > > > what happens if you switch to hostnames over IP addresses? > > > Actually, I never tried this, but point is that the HDFS worked just fine > with this before. > > Regards. > -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422
Re: Are SequenceFiles split? If so, how?
On Thu, 2009-04-23 at 17:56 +0900, Aaron Kimball wrote: > Explicitly controlling your splits will be very challenging. Taking the case > where you have expensive (X) and cheap (C) objects to process, you may have > a file where the records are lined up X C X C X C X X X X X C C C. In this > case, you'll need to scan through the whole file and build splits such that > the lengthy run of expensive objects is broken up into separate splits, but > the run of cheap objects is consolidated. I'm suspicious that you can do > this without scanning through the data (which is what often constitutes the > bulk of a time in a mapreduce program). I would also like the ability to stream the data and shuffle it into buckets; when any bucket achieves a fixed cost (currently assessed as byte size), it would be shipped as a task. In practise, in the Hadoop architecture, this causes an extra level of I/O, since all the data must be read into the shuffler and re-sorted. Also, it breaks the ability to run map tasks on systems hosting the data. However, it is a subject about which I am doing some thinking. > But how much data are you using? I would imagine that if you're operating at > the scale where Hadoop makes sense, then the high- and low-cost objects will > -- on average -- balance out and tasks will be roughly evenly proportioned. True, dat. But it's still worth thinking about stream splitting, since the theoretical complexity overhead is an increased constant on a linear term. Will get more into architecture first. S.
Re: No route to host prevents from storing files to HDFS
Hi. Maybe, but there will still be at least one virtual network adapter on the > host. Try turning them off. Nope, still throws "No route to host" exceptions. I have another IP address defined on this machine - 192.168.253.21, for the same network adapter. Any idea if it has impact? > > >> The fs.default.name is: >> hdfs://192.168.253.20:8020 >> > > what happens if you switch to hostnames over IP addresses? Actually, I never tried this, but point is that the HDFS worked just fine with this before. Regards.
Re: No route to host prevents from storing files to HDFS
Stas Oskin wrote: Hi. 2009/4/23 Matt Massie Just for clarity: are you using any type of virtualization (e.g. vmware, xen) or just running the DataNode java process on the same machine? What is "fs.default.name" set to in your hadoop-site.xml? This machine has OpenVZ installed indeed, but all the applications run withing the host node, meaning all Java processes are running withing same machine. Maybe, but there will still be at least one virtual network adapter on the host. Try turning them off. The fs.default.name is: hdfs://192.168.253.20:8020 what happens if you switch to hostnames over IP addresses?
Re: No route to host prevents from storing files to HDFS
Hi. I have one question, is the ip address consistent, I think in one of the > thread mails, it was stated that the ip address sometimes changes. > Same static IP's for all servers. By the way, I have the fs.default.name defined in IP address could it be somehow related? I read that there were some issues with this, but it ran fine for me - that it, until the power crash. Regards.
Re: No route to host prevents from storing files to HDFS
Hi. Shouldn't you be testing connecting _from_ the datanode? The error you > posted is while this DN is trying connect to another DN. You might be into something here indeed: 1) Telnet to 192.168.253.20 8020 / 192.168.253.20 50010 works 2) Telnet to localhost 8020 / localhost 50010 doesn't work 3) Telnet to 127.0.0.1 8020 / 127.0.0.1 50010 doesn't work In the 2 last cases I get: Trying 127.0.0.1... telnet: connect to address 127.0.0.1: Connection refused telnet: Unable to connect to remote host: Connection refused Could it be related? Regards.
Re: Num map task?
Edward J. Yoon wrote: > How do you to add input paths? > > On Wed, Apr 22, 2009 at 5:09 PM, nguyenhuynh.mr > wrote: > >> Edward J. Yoon wrote: >> >> >>> Hi, >>> >>> In that case, The atomic unit of split is a file. So, you need to >>> increase the number of files. or Use the TextInputFormat as below. >>> >>> jobConf.setInputFormat(TextInputFormat.class); >>> >>> On Wed, Apr 22, 2009 at 4:35 PM, nguyenhuynh.mr >>> wrote: >>> >>> Hi all! I have a MR job use to import contents into HBase. The content is text file in HDFS. I used the maps file to store local path of contents. Each content has the map file. ( the map is a text file in HDFS and contain 1 line info). I created the maps directory used to contain map files. And the this maps directory used to input path for job. When i run job, the number map task is same number map files. Ex: I have 5 maps file -> 5 map tasks. Therefor, the map phase is slowly :( Why the map phase is slowly if the number map task large and the number map task is equal number of files?. * p/s: Run jobs with: 3 node: 1 server and 2 slaver Please help me! Thanks. Best, Nguyen. >>> >>> >>> >> Current, I use TextInputformat to set InputFormat for map phase. >> >> > > > > Thanks for your help! I use FileInputFormat to add input paths. Some thing like: FileInputFormat.setInputPath(new Path("dir")); The "dir" is a directory contains input files. Best, Nguyen
Re: No route to host prevents from storing files to HDFS
Hi. 2009/4/23 Matt Massie > Just for clarity: are you using any type of virtualization (e.g. vmware, > xen) or just running the DataNode java process on the same machine? > > What is "fs.default.name" set to in your hadoop-site.xml? > This machine has OpenVZ installed indeed, but all the applications run withing the host node, meaning all Java processes are running withing same machine. The fs.default.name is: hdfs://192.168.253.20:8020 Thanks.
The mechanism of choosing target datanodes
If a cluster has many datanodes and I want to copy a large file into DFS. If the replication number is set to 1, does the namenode will put the file data on one datanode or several nodes? I wonder if the file will be split into blocks then different unique blocks are on different datanodes. -- View this message in context: http://www.nabble.com/The-mechanism-of-choosing-target-datanodes-tp23193235p23193235.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Are SequenceFiles split? If so, how?
Explicitly controlling your splits will be very challenging. Taking the case where you have expensive (X) and cheap (C) objects to process, you may have a file where the records are lined up X C X C X C X X X X X C C C. In this case, you'll need to scan through the whole file and build splits such that the lengthy run of expensive objects is broken up into separate splits, but the run of cheap objects is consolidated. I'm suspicious that you can do this without scanning through the data (which is what often constitutes the bulk of a time in a mapreduce program). But how much data are you using? I would imagine that if you're operating at the scale where Hadoop makes sense, then the high- and low-cost objects will -- on average -- balance out and tasks will be roughly evenly proportioned. In general, I would just dodge the problem by making sure your splits relatively small compared to the size of your input data. If you have 5 million objects to process, then make each split be roughly equal to say 20,000 of them. Then even if some splits take long to process and others take a short time, then one CPU may dispatch with a dozen cheap splits in the same time where one unlucky JVM had to process a single very expensive split. Now you haven't had to manually balance anything, and you still get to keep all your CPUs full. - Aaron On Mon, Apr 20, 2009 at 11:25 PM, Barnet Wagman wrote: > Thanks Aaron, that really helps. I probably do need to control the number > of splits. My input 'data' consists of Java objects and their size (in > bytes) doesn't necessarily reflect the amount of time needed for each map > operation. I need to ensure that I have enough map tasks so that all cpus > are utilized and the job gets done in a reasonable amount of time. > (Currently I'm creating multiple input files and making them unsplitable, > but subclassing SequenceFileInputFormat to explicitly control then number of > splits sounds like a better approach). > > Barnet > > > Aaron Kimball wrote: > >> Yes, there can be more than one InputSplit per SequenceFile. The file will >> be split more-or-less along 64 MB boundaries. (the actual "edges" of the >> splits will be adjusted to hit the next block of key-value pairs, so it >> might be a few kilobytes off.) >> >> The SequenceFileInputFormat regards mapred.map.tasks >> (conf.setNumMapTasks()) >> as a hint, not a set-in-stone metric. (The number of reduce tasks, though, >> is always 100% user-controlled.) If you need exact control over the number >> of map tasks, you'll need to subclass it and modify this behavior. That >> having been said -- are you sure you actually need to precisely control >> this >> value? Or is it enough to know how many splits were created? >> >> - Aaron >> >> On Sun, Apr 19, 2009 at 7:23 PM, Barnet Wagman >> wrote: >> >> >> >>> Suppose a SequenceFile (containing keys and values that are >>> BytesWritable) >>> is used as input. Will it be divided into InputSplits? If so, what's the >>> criteria use for splitting? >>> >>> I'm interested in this because I need to control the number of map tasks >>> used, which (if I understand it correctly), is equal to the number of >>> InputSplits. >>> >>> thanks, >>> >>> bw >>> >>> >>> >> >> >> > >
Re: which is better Text or Custom Class
In general, serializing to text and then parsing back into a different format will always be slower than using a purpose-built class that can serialize itself. The tradeoff, of course, is that going to text is often more convenient from a developer-time perspective. - Aaron On Mon, Apr 20, 2009 at 2:23 PM, chintan bhatt wrote: > > Hi all, > I want to ask you about the performance difference between using the Text > class and using a custom Class which implements Writable interface. > > Lets say in InvertedIndex problem when I emit token and a list of document > Ids which contains it , using Text we usually Concat the list of document > ids with space as a separator "d1 d2 d3 d4" etc..If I need the same values > in a later step of map reduce, I need to split the value string to get the > list of all document Ids. Is it not better to use Writable List instead?? > > I need to ask it because I am using too many Concats and Splits in my > project to use documents total tokens count, token frequency in a particular > document etc.. > > > Thanks in advance, > Chintan > > > _ > Windows Live Messenger. Multitasking at its finest. > http://www.microsoft.com/india/windows/windowslive/messenger.aspx