Re: Setting number of mappers according to number of TextInput lines

2012-06-16 Thread Shi Yu
How did you try it? I had no problem with NLineInputFormat. It just works exactly as expected. Shi

Re: Hadoop on physical Machines compared to Amazon Ec2 / virtual machines

2012-05-31 Thread Shi Yu
We once calculated the cost of using EC2 to train our machine learning model (assuming we did everything in one shot, which is almost impossible) using EM algorithm. The cost for each model is 10,000 US dollars. The cost for each individual node for each hour seems cheap, but when it scales

Random Sample in Map/Reduce

2012-05-14 Thread Shi Yu
Hi, Before I raise this question I searched relevant topics. There are suggestions online: Mappers: Output all qualifying values, each with a random integer key. Single reducer: Output the first N values, throwing away the keys. However, this schema seems not very efficient when the data

Re: Random Sample in Map/Reduce

2012-05-14 Thread Shi Yu
To answer my own question. I applied a non-repeatable random number generator in the mapper. At mapper setup stage I generate a pre-defined number of random numbers, then I use a counter along the mapper. When the counter is contained in the random number set, the Mapper executes and outputs

Re: transferring between HDFS which reside in different subnet

2012-05-11 Thread Shi Yu
If you could cross-access HDFS from both name nodes, then it should be transferable using /distcp /command. Shi * * On 5/11/2012 8:45 AM, Arindam Choudhury wrote: Hi, I have a question to the hadoop experts: I have two HDFS, in different subnet. HDFS1 : 192.168.*.* HDFS2: 10.10.*.* the

Re: freeze a mapreduce job

2012-05-11 Thread Shi Yu
Is there any risk to suppress a job too long in FS?I guess there are some parameters to control the waiting time of a job (such as timeout ,etc.), for example, if a job is kept idle for more than 24 hours is there a configuration deciding kill/keep that job? Shi On 5/11/2012 6:52 AM,

Re: How to maintain record boundaries

2012-05-11 Thread Shi Yu
here are some quick code for you (based on Tom's book). You could overwrite the TextInputFormat isSplitable method to avoid splitting, which is pretty important and useful when processing sequence data. //Old API public class NonSplittableTextInputFormat extends TextInputFormat {

Re: transferring between HDFS which reside in different subnet

2012-05-11 Thread Shi Yu
It seems in your case HDFS2 could access HDFS, so you should be able to transfer HDFS data to HDFS2. If you want to cross-transfer, you don't need to do distcp on cluster nodes, if any client node (not necessary to be namenode, datanode, secondary node, etc.) could access to both HDFSs, then

Re: SQL analysis

2012-05-10 Thread Shi Yu
It depends on your use case, for example, query only or you have requirement of real time insert and update. The solutions can be different. You might need consider HBase, Cassandra or tools like Flume.

RE: SQL analysis

2012-05-10 Thread Shi Yu
Flume might be suitable for your case. https://cwiki.apache.org/FLUME/ Shi

Re: SQL analysis

2012-05-10 Thread Shi Yu
it around might help. If you want further analysis like Business Intelligence, then you need to train various models. On 5/10/2012 8:30 AM, karanveer.si...@barclays.com wrote: I am more worried about the analysis assuming this data is in HDFS. -Original Message- From: Shi Yu

Re: Nested map reduce job

2012-05-05 Thread Shi Yu
A quick glance at your problem indicates that you might have a design problem with your code. In my opinion you should avoid nested Map/Reduce job. You could use chain Map/Reduce, but the nested or recursive structure is not suggested. I don't know how you implemented your nested M/R job,

Re: How to create an archive-file in Java to distribute a MapFile via Distributed Cache

2012-05-04 Thread Shi Yu
My humble experience: I would prefer specifying the files in command line using -files option, then treat them explicitly in the Mapper configure or setup function using File f1 = new File(file1name); File f2 = new File(file2name); Cause I am not 100% sure how does distributed cached

High Availability Framework for HDFS Namenode in 2.0.0

2012-05-03 Thread Shi Yu
It sounds like an exciting feature. Does anyone have tried this in practice? How does the hot standby namenode perform and how reliable is the HDFS recovery? Is it now a good chance to migrate to 2.0.0, in your opinions? Best, Shi

Re: High Availability Framework for HDFS Namenode in 2.0.0

2012-05-03 Thread Shi Yu
Hi Harsh J, It seems that the 20% performance lost is not that bad, at least some smart people are still working to improve it. I will keep an eye on this interesting trend. Shi

Re: High Availability Framework for HDFS Namenode in 2.0.0

2012-05-03 Thread Shi Yu
Hi Todd, Okay, that sounds really good (sorry didn't grab all the information in that long page). Shi

Mass SocketTimeoutException - 0.20.203

2012-04-29 Thread Shi Yu
Tons of errors seen after Map 100% Reduce 50%, but the job still struggles to finish. What is the possible reason? Is this issue fixed in any of the version? java.net.SocketTimeoutException: 69000 millis timeout while waiting for channel to be ready for read. ch :

Re: setNumTasks

2012-03-22 Thread Shi Yu
If you want to control the number of input splits at fine granularity, you could customize the NLineInputFormat. You need to determine the number of lines per each split. Thus you need to know before is the number of lines in your input data, for instance, using hadoop -text /input/dir/* |

Re: how to get rid of -libjars ?

2012-03-06 Thread Shi Yu
1. Wrap all your jar files inside your artifact, they should be under lib folder. Sometimes this could make your jar file quite big, if you want to save time uploading big jar files remotely, see 2 2. Use -libjars with full path or relative path (w.r.t. your jar package) should work On

Re: Task Killed but no errors

2012-02-27 Thread Shi Yu
On 2/27/2012 1:55 PM, Mohit Anchlia wrote: I submitted a map reduce job that had 9 tasks killed out of 139. But I don't see any errors in the admin page. The entire job however has SUCCEDED. How can I track down the reason? Also, how do I determine if this is something to worry about? Hi,

Re: LZO with sequenceFile

2012-02-26 Thread Shi Yu
Hi, You could easily find lots of documents talking about this. Try kevinweil-hadoop-lzo in google. Shi

Re: LZO with sequenceFile

2012-02-25 Thread Shi Yu
Yes, it is supported by Hadoop sequence file. It is splittable by default. If you have installed and specified LZO correctly, use these: org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma t.setCompressOutput(job,true);

Re: How to read LZO compressed files?

2012-01-01 Thread Shi Yu
You could decompress the LZO file manually into plain text then using TextInputFormat. Alternatively, you don't need to index the LZO compressed file, just using LZOInputFormat on non-indexed files, then the LZO file will not be split anymore.

Writing task log files to HDFS

2011-12-19 Thread Shi Yu
Hi, Is there any working example to write hadoop task log (stderr) files to HDFS? Currently we have a cluster which data nodes are inaccessible to the users, so I am trying to find a way redirecting all task log files to HDFS. Thanks! Best, Shi

Re: DistributedCache in NewAPI on 0.20.X branch

2011-12-17 Thread Shi Yu
Thank you Bejoy! Following your code examples, it finally works. Actually I only changed two places in my original code. First, I added the Override tag. Second, I added a new exception catch(FileNotFoundException e), and now it works! I appreciate your kind and precise help. Best, Shi

Re: DistributedCache in NewAPI on 0.20.X branch

2011-12-16 Thread Shi Yu
Follow my previous question, I put the complete code as follows, I doubt is there any method to get this working on 0.20.X using the new API. The command I executed was: bin/hadoop jar myjar.jar FileTest -files textFile.txt /input/ /output/ The complete code: public class FileTest extends

DistributedCache in NewAPI on 0.20.X branch

2011-12-15 Thread Shi Yu
Hi, I am using 0.20.X branch. However, I need to use the new API because it has the cleanup(context) method in Mapper. However, I am confused about how to load the cached files in mapper. I could load the DistributedCache files using old API (JobConf), but in new API it always returns

Create a single output per each mapper

2011-12-12 Thread Shi Yu
Hi, Suppose I have two mappers, each mapper is assigned 10 lines of data. I want to set a counter for each mapper, counting and accumulating, then output the counter value to the reducer when the mapper finishes processing all the assigned lines. So I want the mapper outputs values only when

Re: risks of using Hadoop

2011-09-21 Thread Shi Yu
I saw the title of this discussion started a few days ago but didn't pay attention to them. this morning i came across to some of these message and rofl, too much drama. According to my experience, there are some risks of using hadoop. 1) not real time and mission critical, you may consider

Re: Reducer to concatenate string values

2011-09-20 Thread Shi Yu
Hi, You probably need to use secondary sort (based on TextPair key) and string concatenation function (like StringBuffer) to do this. I once had a talk on Open Cloud Science workshop about this (also see my previous post in this newsgroup) Best, Shi On 9/20/2011 10:38 AM, Daniel

old problem: mapper output as sequence file

2011-09-19 Thread Shi Yu
Hi, I am stuck again in a probably very simple problem. I couldn't generate the map output in sequence file format. I always get this error: java.io.IOException: wrong key class: org.apache.hadoop.io.Text is not class org.apache.hadoop.io.LongWritable at

Re: old problem: mapper output as sequence file

2011-09-19 Thread Shi Yu
Oh that's brilliant. Thanks a lot Brock! On 9/19/2011 3:15 PM, Brock Noland wrote: Hi, On Mon, Sep 19, 2011 at 3:19 PM, Shi Yush...@uchicago.edu wrote: I am stuck again in a probably very simple problem. I couldn't generate the map output in sequence file format. I always get this error:

Re: Hadoop on Ec2

2011-09-07 Thread Shi Yu
Interested in this topic. We have experienced plenty of difficulties running hadoop in Eucalyptus based virtual instance clusters. Typical issues like java.net.SocketTimeoutException: 69000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel

Hadoop in PBS + Lustre

2011-08-11 Thread Shi Yu
Hi, I found some materials about submitting hadoop jobs via PBS. Any idea how to interactively browse HDFS through PBS? Our supercomputer uses lustre storage system. I found a wiki talking about using Polyserve storage system but not using HDFS. Does anyone have tried lustre + hadoop + PBS?

hadoop 0.20.203.0 Java Runtime Environment Error

2011-07-01 Thread Shi Yu
I had difficulty upgrading applications from Hadoop 0.20.2 to Hadoop 0.20.203.0. The standalone mode runs without problem. In real cluster mode, the program freeze at map 0% reduce 0% and there is only one attempt file in the log directory. The only information is contained in stdout file :

Re: hadoop 0.20.203.0 Java Runtime Environment Error

2011-07-01 Thread Shi Yu
Thanks Edward! I upgraded to 1.6.0_26 and it worked. On 7/1/2011 6:42 PM, Edward Capriolo wrote: That looks like an ancient version of java. Get 1.6.0_u24 or 25 from oracle. Upgrade to a recent java and possibly update your c libs. Edward On Fri, Jul 1, 2011 at 7:24 PM, Shi

Split control in Lzo index

2011-06-23 Thread Shi Yu
Hi, My specific question is: is it possible to control the split of Lzo files by customize the Lzo index files? The background of the problem is: I have a file which has the following format key1 value1 key1 value2 key2 value3 key2 value4 ... Its size in plain text before compression is 11

Re: large memory tasks

2011-06-16 Thread Shi Yu
There is no look up. The process is done by shuffle and sort (secondary sort for multiple keys) in Map/Reduce. The key problem is to join your record files with lookup tables K1 R1K1 V1 K2 R2K2 V2 ... which gives

Re: large memory tasks

2011-06-15 Thread Shi Yu
I had the same problem before, a big lookup table too large to load in memory. I tried and compared the following approaches: in-memory MySQL DB, a dedicated central memcache server, a dedicated central MongoDB server, local DB (each node has its own MongoDB server) model. The local DB

Re: large memory tasks

2011-06-15 Thread Shi Yu
Suppose you are looking up a value V of a key K. And V is required for an upcoming process. Suppose the data in the upcoming process has the form R1 K1 K2 K3, where R1 is the record number, K1 to K3 are the keys occurring in the record, which means in the look up case you would query for

Configure Map Side join for multiple mappers in Hadoop Map/Reduce

2011-06-12 Thread Shi Yu
This is a re-post of the same message. I made it more specific and clear. Have been considering it several days so really appreciate any help. I have a question about configuring Map/Side inner join for multiple mappers in Hadoop. Suppose I have two very large data sets A and B, I use the

Parallel map side join

2011-06-10 Thread Shi Yu
Hi, How to configure map side join in multiple mappers in parallel? Suppose I have data set s a1, a2, a3 and data set b1, b2, b3. I want to let a1 join with b1, a2 join with b2, a3 join with b3 and let the join done in parallel? I think it should be able to configure in mapper 1

Re: Automatic line number in reducer output

2011-06-09 Thread Shi Yu
Hi, Thanks for the reply. The line count in new API works fine now, it was a bug in my code. In new API, Iterator is changed to Iterable, but I didn't pay attention to that and was still using Iterator and hasNext(), Next() method. Surprisingly, the wrong code still ran and got output,

one-to-many Map Side Join without reducer

2011-06-09 Thread Shi Yu
Hi, I have two datasets: dataset 1 has the format: MasterKey1SubKey1SubKey2SubKey3 MasterKey2Subkey4 Subkey5 Subkey6 dataset 2 has the format: SubKey1Value1 SubKey2Value2 ... I want to have one-to-many join based on the SubKey, and the final goal is to

New API for TupleWritable / Mapside join

2011-06-08 Thread Shi Yu
Hi, I am trying to rewrite and improve some old code using Map side join such as TupleWritable, KeyValueTextInputFormat, etc. The reference materials I have are based on old API (0.19.x). Since Hadoop is updating rapidly, I am wondering is there any new functions / API / framework about Map

Automatic line number in reducer output

2011-06-07 Thread Shi Yu
Hi, I am wondering is there any built-in function to automatically add a self-increment line number in reducer output (like the relation DB auto-key). I have this problem because in 0.19.2 API, I used a variable linecount increasing in the reducer like: public static class Reduce extends

standalone ? mapred.LocalJobRunner

2011-06-06 Thread Shi Yu
Hi, I am stuck in a basic problem but can't figure out. My previous verbose logging problem is the same as the one mentioned in the old post. http://mail-archives.apache.org/mod_mbox/nutch-user/200901.mbox/%3c0adbd67bd6811a4bb2144d805124714d03f754a...@kaex1.dom.rastatt.de%3E First quesiton, if

Verbose screen logging on hadoop-0.20.203.0

2011-06-05 Thread Shi Yu
We just upgraded from 0.20.2 to hadoop-0.20.203.0 Running the same code ends up a massive amount of debug information on the screen output. Normally this type of information is written to logs/userlogs directory. However, nothing is written there now and seems everything is outputted to

Re: Verbose screen logging on hadoop-0.20.203.0

2011-06-05 Thread Shi Yu
I still didn't get it. To make sure I am not using any old version, I downloaded two versions 0.20.2 and 0.20.203.0 again and had a fresh install separately on two independent clusters. I tried with a very simple toy program. I didn't change anything in the API so it probably calls the old

ERROR java.io.IOException: Spill failed debug

2011-06-03 Thread Shi Yu
A map/reduce process applied on 3T input data halts for 1 hour at map 57% reduce 19% without any progress. A same error occurs a millions of times in the huge syslog file. And I also got a huge stderr file, where the logs are: Caused by:

Re: providing the same input to more than one Map task

2011-04-25 Thread Shi Yu
Then, what is the main difference: (1) storing the input on the cluster shared directory, loading it in the configure stage of mappers and (2) using the distributed cache? Shi On 4/25/2011 8:17 AM, Kai Voigt wrote: Hi, I'd use the distributed cache to store the vector on every mapper

how to display a specific text line in dfs

2011-04-22 Thread Shi Yu
I use hadoop dfs -text path | head-n hadoop dfs -text path | tail-n to browse the n-th line from the head or from the tail. But it is slow when the file is large. Is there any command that goes directly to a specific line in dfs? Shi

Re: Program freezes at Map 99% Reduce 33%

2011-03-24 Thread Shi Yu
Message- From: Shi Yu [mailto:sh...@uchicago.edu] Sent: Thursday, March 24, 2011 3:02 PM To: hadoop user Subject: Program freezes at Map 99% Reduce 33% I am running a hadoop program processing Tera Byte size data. The code was test successfully on a small sample (100G) and it worked. However

Re: Program freezes at Map 99% Reduce 33%

2011-03-24 Thread Shi Yu
of mappers and then compare that to your program run without hadoop. Kevin -Original Message- From: Shi Yu [mailto:sh...@uchicago.edu] Sent: Thursday, March 24, 2011 3:57 PM To: common-user@hadoop.apache.org Subject: Re: Program freezes at Map 99% Reduce 33% Hi Kevin, thanks for reply. I

Re: how to write outputs sequentially?

2011-03-22 Thread Shi Yu
I guess you need to define a Partitioner to send hased keys to different reducers (sorry, I am still using the old API so probably there is something new in the trunk release). Basically you try to segment the keys into different zones, 0-10, 11-20, ... maybe check the hashCode() function

Re: Test, please respond

2011-03-22 Thread Shi Yu
Actually you could see your own post using google after you posted it. Then you are sure it is not swallowed by the network ... :) Shi On 3/22/2011 12:23 PM, Aaron Baff wrote: Ok, thanks. Guess I'm just having no luck getting my posts replied to. Aaron Baff | Developer | Telescope, Inc.

battle of LZO on hadoop 0.19.2 and 0.20.2

2011-03-22 Thread Shi Yu
Hi. As mentioned in the previous post, I tried to extend some legacy programs built on hadoop 0.19.2 to apply Lzo compression. I had tons of the problems (logical errors and troubles in implementation). After spending a whole week, finally I feel sorting things out, however, there are still

Re: can't find lzo headers when ant compile hadoop package

2011-03-21 Thread Shi Yu
Problem solved, two paths should be set: export C_INCLUDE_PATH=/path_of_lzo_output/include export LIBRARY_PATH=/path_of_lzo_output/lib and enable shared when configuring the lzo compile: ./configure -enable-shared -prefix=/path_of_lzo_output/ Shi On 3/19/2011 1:16 PM, Shi Yu wrote: Trying

lots of LZO errors in mapper and name of file split

2011-03-21 Thread Shi Yu
Hi, My hadoop distribution is 0.20.2. I had many errors when compressing output with LZO (see stack trace at the end). I disabled the compression of mapper output. The naitvecode seems having been loaded correctly, but during the mapper stage, lots of error popped. The program didn't break and

can't find lzo headers when ant compile hadoop package

2011-03-19 Thread Shi Yu
Trying to install LZO and compile the hadoop package following the instructions at http://sudhirvn.blogspot.com/2010/07/installing-hadoop-native-libraries.html I don't have root privilege thus no sudo, no rpm installation is possible. So I built and installed LZO source in my home folder. The

Re: java.io.IOException: Task process exit with nonzero status of 134

2011-03-09 Thread Shi Yu
like to know how should I solve this problem. Should I upgrade anything? I guess this problem is not new. Thanks for the information. Shi On 3/8/2011 4:04 PM, Shi Yu wrote: What is the true reason of causing this? I realized there are many reports on web, but couldn't find the exact solution

java.io.IOException: Task process exit with nonzero status of 134

2011-03-08 Thread Shi Yu
What is the true reason of causing this? I realized there are many reports on web, but couldn't find the exact solution? I have this problem when using compressed sequence file output. SequenceFileOutputFormat.setCompressOutput(conf, true);

Reduce progress goes backward?

2011-02-01 Thread Shi Yu
Hi, I observe that sometimes the map/reduce progress is going backward. What does this mean? 11/02/01 12:57:51 INFO mapred.JobClient: map 100% reduce 99% 11/02/01 12:59:14 INFO mapred.JobClient: map 100% reduce 98% 11/02/01 12:59:45 INFO mapred.JobClient: map 100% reduce 99% 11/02/01

Re: Reduce function

2010-10-18 Thread Shi Yu
How many tags you have? If you have several number of tags, you'd better create a Vector class to hold those tags. And define sum function to increment the values of tags. Then the value class should be your new Vector class. That's better and more decent than the Textpair approach. Shi On

Re: Multiple Input Data Processing using MapReduce

2010-10-14 Thread Shi Yu
Hi Matthew, I have a same problem here (see http://www.listware.net/201009/hadoop-common-user/81228-return-a-parameter-using-map-only.html). I was planning to use join mapper (or mapper chain) to handle two different inputs. The problem was the mapper seems cannot return value directly to

Re: load a serialized object in hadoop

2010-10-13 Thread Shi Yu
As a coming-up to the my own question, I think to invoke the JVM in Hadoop requires much more memory than an ordinary JVM. I found that instead of serialization the object, maybe I could create a MapFile as an index to permit lookups by key in Hadoop. I have also compared the performance of

Re: load a serialized object in hadoop

2010-10-13 Thread Shi Yu
Here is my code. There is no Map/Reduce in it. I could run this code using java -Xmx1000m , however, when using bin/hadoop -D mapred.child.java.opts=-Xmx3000M it has heap space not enough error. I have tried other program in Hadoop with the same settings so the memory is available in my

Re: load a serialized object in hadoop

2010-10-13 Thread Shi Yu
Hi, thanks for the advice. I tried with your settings, $ bin/hadoop jar Test.jar OOloadtest -D HADOOP_CLIENT_OPTS=-Xmx4000m still no effect. Or this is a system variable? Should I export it? How to configure it? Shi java -Xms3G -Xmx3G -classpath

Re: load a serialized object in hadoop

2010-10-13 Thread Shi Yu
Hi, I tried the following five ways: Approach 1: in command line HADOOP_CLIENT_OPTS=-Xmx4000m bin/hadoop jar WordCount.jar OOloadtest Approach 2: I added the hadoop-site.xml file with the following element. Each time I changed, I stop and restart hadoop on all the nodes. ... property

Re: load a serialized object in hadoop

2010-10-13 Thread Shi Yu
Hi, I got it, it should be declared in the enhadoop-env.sh export HADOOP_CLIENT_OPTS=-Xmx4000m Thanks! At the same time I see corrections come in. Shi On 2010-10-13 18:18, Shi Yu wrote: Hi, I tried the following five ways: Approach 1: in command line HADOOP_CLIENT_OPTS=-Xmx4000m bin/hadoop

Negative length is not supported error in Hadoop MapReduce

2010-10-06 Thread Shi Yu
Hi, The input in HDFS is a directory containing 890 files (biggest one 23M, smallest one 145K, average size 10M). It seems that I reach some limit of HDFS because all the files after a certain number (594) could not be loaded. For example, the full run of my code pops the following error:

Re: conf.setCombinerClass in Map/Reduce

2010-10-05 Thread Shi Yu
Hi, thanks for the answer, Antonio. I have found one of the main problem. It was because I used the MultipleOutputs in the Reduce class, so when I set the Combiner and the Reducer, the Combiner will not provide normal data flow to the Reducer. Therefore, the program ceases at the Combiner and

Total input paths number and output

2010-10-02 Thread Shi Yu
Hi, I am running some code on a cluster with several nodes (ranging from 1 to 30) using hadoop-0.19.2. In a test, I only put a single file under the input folder, however, each time I find the logged total input paths to process is 2 (not 1). INFO mapred.FileInputFormat: Total input paths

Re: Multiple masters in hadoop

2010-09-29 Thread Shi Yu
The Master appeared in Masters and Salves files is the machine name or ip address. If you have a single cluster, when you specify multiple names in those files it will cause error because of the connection failure. Shi On 2010-9-29 15:28, Bhushan Mahale wrote: Hi, The master files name in

jdbc in Hadoop mapper

2010-09-24 Thread Shi Yu
Hi, I tried to combine in memory mysql database with Mapreduce to do some value exchanges. In the Mapper, I declare the mysql driver like this import com.mysql.jdbc.*; import java.sql.DriverManager; import java.sql.SQLException; String driver

return a parameter using Map only

2010-09-22 Thread Shi Yu
Dear Hadoopers, I am stuck at a probably very simple problem but can't figure it out. In the Hadoop Map/Reduce framework, I want to search a huge file (which is generated by another Reduce task) for a unique line of record (a String, double value actually). That record is expected to be