Re: Which version to learn ?

2014-04-22 Thread Fengyun RAO
try CDH5 2014-04-21 14:11 GMT+08:00 老赵 laozh...@sina.cn: Hello , I am new to hadoop ,now is learning hadoop-1.2.1, but the stable version also is 2.2.0, I want to find a job about hadoop . Which one i should master more ? Thank you all .

recommended block replication for small cluster

2014-04-03 Thread Fengyun RAO
I know the default replication is 3, which ensures reliability when 2 nodes crash at the same time. However, for a small cluster, e.g. 10~20 nodes, the possibility that 2 nodes crash at the same time is too small. Can we simply set the replication to 2, or are there any other defects? any

Re: recommended block replication for small cluster

2014-04-03 Thread Fengyun RAO
Peyman Mohajerian mohaj...@gmail.com: The reason for replication also has to do with data locality in a larger cluster for running a map-reduce jobs. You can reduce the replication, that's why it's a configurable parameter. On Thu, Apr 3, 2014 at 7:10 AM, Fengyun RAO raofeng...@gmail.com

Re: YarnException: Unauthorized request to start container. This token is expired.

2014-04-02 Thread Fengyun RAO
, at 17:37, Fengyun RAO raofeng...@gmail.com wrote: What does this exception mean? I googled a lot, all the results tell me it's because the time is not synchronized between datanode and namenode. However, I checked all the servers, that the ntpd service is on, and the time differences are less

Re: YarnException: Unauthorized request to start container. This token is expired.

2014-04-02 Thread Fengyun RAO
I've found the jira page: https://issues.apache.org/jira/browse/YARN-180 though I don't quite understand container reservation. 2014-04-02 21:55 GMT+08:00 Fengyun RAO raofeng...@gmail.com: thank you, omkar, I'm fresh to Hadoop, and all the settings are default, so I guess the expiration

Re: Container states trantition questions

2014-04-02 Thread Fengyun RAO
same for me. all mapper ends with 143. I've no idea what it means 2014-04-03 8:45 GMT+08:00 Azuryy Yu azury...@gmail.com: Hi, Does it normal for each container end with TERMINATED(143) ? The whole MR job is successful, but all containers in the map phase end with 143. There are no any

Re: how to solve reducer memory problem?

2014-04-02 Thread Fengyun RAO
It doesn't need 20 GB memory. Reducer doesn't load all data into memory at once, instead is would use the disk, since it does merge sort. 2014-04-03 8:04 GMT+08:00 Li Li fancye...@gmail.com: I have a map reduce program that do some matrix operations. in the reducer, it will average many

YarnException: Unauthorized request to start container. This token is expired.

2014-03-23 Thread Fengyun RAO
What does this exception mean? I googled a lot, all the results tell me it's because the time is not synchronized between datanode and namenode. However, I checked all the servers, that the ntpd service is on, and the time differences are less than 1 second. What's more, the tasks are not always

Re: YarnException: Unauthorized request to start container. This token is expired.

2014-03-23 Thread Fengyun RAO
so you think it's related to the CDH version, not a common Hadoop problem? 2014-03-23 18:56 GMT+08:00 Azuryy azury...@gmail.com: Hi, Please send to the CDH mail list, you cannot get answer here. Sent from my iPhone5s On 2014年3月23日, at 17:37, Fengyun RAO raofeng...@gmail.com wrote

What's the best practice for managing Hadoop dependencie?

2014-03-09 Thread Fengyun RAO
First of all, I want to claim that I used CDH5 beta, and managed project using maven, and I googled and read a lot, e.g. https://issues.apache.org/jira/browse/MAPREDUCE-1700 http://www.datasalt.com/2011/05/handling-dependencies-and-configuration-in-java-hadoop-projects-efficiently/ I believe the

MapReduce: How to output multiplt Avro files?

2014-03-06 Thread Fengyun RAO
our input is a line of text which may be parsed to e.g. A or B object. We want all A objects written to A.avro files, while all B objects written to B.avro. I looked into AvroMultipleOutputs class: http://avro.apache.org/docs/1.7.4/api/java/org/apache/avro/mapreduce/AvroMultipleOutputs.html There

Re: MapReduce: How to output multiplt Avro files?

2014-03-06 Thread Fengyun RAO
add avro user mail-list 2014-03-06 16:09 GMT+08:00 Fengyun RAO raofeng...@gmail.com: our input is a line of text which may be parsed to e.g. A or B object. We want all A objects written to A.avro files, while all B objects written to B.avro. I looked into AvroMultipleOutputs class: http

Re: MapReduce: How to output multiplt Avro files?

2014-03-06 Thread Fengyun RAO
J ha...@cloudera.com: If you have a reducer involved, you'll likely need a common map output data type that both A and B can fit into. On Thu, Mar 6, 2014 at 12:09 AM, Fengyun RAO raofeng...@gmail.com wrote: our input is a line of text which may be parsed to e.g. A or B object. We want all

Re: MapReduce: How to output multiplt Avro files?

2014-03-06 Thread Fengyun RAO
in the mappers, using the avro datum as the value. 3) Figure out what the avro schema is for each datum and write out the data in the reducer. Thanks, Alan *From:* Fengyun RAO [mailto:raofeng...@gmail.com] *Sent:* Thursday, March 06, 2014 2:14 AM *To:* user@hadoop.apache.org; u

Re: Map-Reduce: How to make MR output one file an hour?

2014-03-02 Thread Fengyun RAO
flume agent to receive logs as and when it is generating and you will be directly dumping to hdfs. If you want to remove unwanted logs you can write a custom sink before dumping to hdfs I suppose this would he much easier On 2 Mar 2014 12:34, Fengyun RAO raofeng...@gmail.com wrote: Thanks

Re: Map-Reduce: How to make MR output one file an hour?

2014-03-01 Thread Fengyun RAO
Thanks, but how to set reducer number to X? X is dependent on input (run-time), which is unknown on job configuration (compile time). 2014-03-01 17:44 GMT+08:00 AnilKumar B akumarb2...@gmail.com: Hi, Write the custom partitioner on timestamp and as you mentioned set #reducers to X.

Re: Map-Reduce: How to make MR output one file an hour?

2014-03-01 Thread Fengyun RAO
getmerge function to collate the results to one file. On Mar 1, 2014 4:59 AM, Fengyun RAO raofeng...@gmail.com wrote: Thanks, but how to set reducer number to X? X is dependent on input (run-time), which is unknown on job configuration (compile time). 2014-03-01 17:44 GMT+08:00 AnilKumar B

Re: Map-Reduce: How to make MR output one file an hour?

2014-03-01 Thread Fengyun RAO
to, and it will be sorted by MR when it reaches the reducer Even with this, you can still use MultipleOutputs to customize the file name each reducer generates for better usability, i.e. instead of part-r-x have it generate mmddhh-r-0. -Simon On Sat, Mar 1, 2014 at 10:13 PM, Fengyun RAO

Re: What if file format is dependent upon first few lines?

2014-02-28 Thread Fengyun RAO
. On Thu, Feb 27, 2014 at 9:09 PM, Fengyun RAO raofeng...@gmail.com wrote: thanks, Harsh. could you specify more detail, or give some links or an example where I can start? 2014-02-27 22:17 GMT+08:00 Harsh J ha...@cloudera.com: A mapper's record reader implementation need

Re: YARN - Running Client with third party jars

2014-02-28 Thread Fengyun RAO
read this: http://stackoverflow.com/questions/574594/how-can-i-create-an-executable-jar-with-dependencies-using-maven 2014-02-26 13:22 GMT+08:00 Anand Mundada anandmund...@ymail.com: Hi I want to use json jar in client code. I tried to create runnable jar which include all required jars. But

Map-Reduce: How to make MR output one file an hour?

2014-02-28 Thread Fengyun RAO
It's a common web log analysis situation. The original weblog is saved every hour on multiple servers. Now we would like the parsed log results to be saved one file an hour. How to make it? In our MR job, the input is a directory with many files in many hours, let's say 4X files in X hours. if

What if file format is dependent upon first few lines?

2014-02-27 Thread Fengyun RAO
Below is a fake sample of Microsoft IIS log: #Software: Microsoft Internet Information Services 7.5 #Version: 1.0 #Date: 2013-07-04 20:00:00 #Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status time-taken

Re: What if file format is dependent upon first few lines?

2014-02-27 Thread Fengyun RAO
- you can always seek(0), read the lines you need to prepare, then seek(offset) and continue reading. Apache Avro (http://avro.apache.org) has a similar format - header contains the schema a reader needs to work. On Thu, Feb 27, 2014 at 1:59 AM, Fengyun RAO raofeng...@gmail.com wrote: Below

Re: any suggestions on IIS log storage and analysis?

2014-01-02 Thread Fengyun RAO
example, so the fact that you have file content being split doesn't impact your analysis if you have inter-dependencies. On Mon, Dec 30, 2013 at 7:31 PM, Fengyun RAO raofeng...@gmail.com wrote: Thanks, I understand now, but I don't think this is what we need. The IIS log files are very big

any suggestions on IIS log storage and analysis?

2013-12-30 Thread Fengyun RAO
Hi, HDFS splits files into blocks, and mapreduce runs a map task for each block. However, Fields could be changed in IIS log files, which means fields in one block may depend on another, and thus make it not suitable for mapreduce job. It seems there should be some preprocess before storing and

Re: any suggestions on IIS log storage and analysis?

2013-12-30 Thread Fengyun RAO
sets into one data set. then analyze the joined dataset. On Mon, Dec 30, 2013 at 3:58 PM, Fengyun RAO raofeng...@gmail.com wrote: Hi, HDFS splits files into blocks, and mapreduce runs a map task for each block. However, Fields could be changed in IIS log files, which means fields in one

Re: any suggestions on IIS log storage and analysis?

2013-12-30 Thread Fengyun RAO
Thanks, Yong! The dependence never cross files, but since HDFS splits files into blocks, it may cross blocks, which makes it difficult to write MR job. I don't quite understand what you mean by WholeFileInputFormat . Actually, I have no idea how to deal with dependence across blocks. 2013/12/31

Re: Writing to remote HDFS using C# on Windows

2013-12-10 Thread Fengyun RAO
, 2013 at 1:58 AM, Fengyun RAO raofeng...@gmail.com wrote: Thanks! I tried WebHDFS, which also work well if I copy local files to HDFS, but still can't find a way to open a filestream, and write to it. 2013/12/6 Vinod Kumar Vavilapalli vino...@hortonworks.com You can try using WebHDFS

Writing to remote HDFS using C# on Windows

2013-12-05 Thread Fengyun RAO
Hi, All Is there a way to write files into remote HDFS on Linux using C# on Windows? We want to use HDFS as data storage. We know there is HDFS java API, but not C#. We tried SAMBA for file sharing and FUSE for mounting HDFS. It worked if we simply copy files to HDFS, but if we open a filestream

Re: Writing to remote HDFS using C# on Windows

2013-12-05 Thread Fengyun RAO
Thanks! I tried WebHDFS, which also work well if I copy local files to HDFS, but still can't find a way to open a filestream, and write to it. 2013/12/6 Vinod Kumar Vavilapalli vino...@hortonworks.com You can try using WebHDFS. Thanks, +Vinod On Thu, Dec 5, 2013 at 6:04 PM, Fengyun RAO