try CDH5
2014-04-21 14:11 GMT+08:00 老赵 laozh...@sina.cn:
Hello , I am new to hadoop ,now is learning hadoop-1.2.1,
but the stable version also is 2.2.0, I want to find a job about hadoop .
Which one i should master more ?
Thank you all .
I know the default replication is 3, which ensures reliability when 2 nodes
crash at the same time.
However, for a small cluster, e.g. 10~20 nodes, the possibility that 2
nodes crash at the same time is too small.
Can we simply set the replication to 2, or are there any other defects?
any
Peyman Mohajerian mohaj...@gmail.com:
The reason for replication also has to do with data locality in a larger
cluster for running a map-reduce jobs. You can reduce the replication,
that's why it's a configurable parameter.
On Thu, Apr 3, 2014 at 7:10 AM, Fengyun RAO raofeng...@gmail.com
, at 17:37, Fengyun RAO raofeng...@gmail.com wrote:
What does this exception mean? I googled a lot, all the results tell me
it's because the time is not synchronized between datanode and namenode.
However, I checked all the servers, that the ntpd service is on, and
the time differences are less
I've found the jira page: https://issues.apache.org/jira/browse/YARN-180
though I don't quite understand container reservation.
2014-04-02 21:55 GMT+08:00 Fengyun RAO raofeng...@gmail.com:
thank you, omkar,
I'm fresh to Hadoop, and all the settings are default, so I guess the
expiration
same for me. all mapper ends with 143.
I've no idea what it means
2014-04-03 8:45 GMT+08:00 Azuryy Yu azury...@gmail.com:
Hi,
Does it normal for each container end with TERMINATED(143) ?
The whole MR job is successful, but all containers in the map phase end
with 143.
There are no any
It doesn't need 20 GB memory.
Reducer doesn't load all data into memory at once, instead is would use the
disk, since it does merge sort.
2014-04-03 8:04 GMT+08:00 Li Li fancye...@gmail.com:
I have a map reduce program that do some matrix operations. in the
reducer, it will average many
What does this exception mean? I googled a lot, all the results tell me
it's because the time is not synchronized between datanode and namenode.
However, I checked all the servers, that the ntpd service is on, and the
time differences are less than 1 second.
What's more, the tasks are not always
so you think it's related to the CDH version, not a common Hadoop problem?
2014-03-23 18:56 GMT+08:00 Azuryy azury...@gmail.com:
Hi,
Please send to the CDH mail list, you cannot get answer here.
Sent from my iPhone5s
On 2014年3月23日, at 17:37, Fengyun RAO raofeng...@gmail.com wrote
First of all, I want to claim that I used CDH5 beta, and managed project
using maven, and I googled and read a lot, e.g.
https://issues.apache.org/jira/browse/MAPREDUCE-1700
http://www.datasalt.com/2011/05/handling-dependencies-and-configuration-in-java-hadoop-projects-efficiently/
I believe the
our input is a line of text which may be parsed to e.g. A or B object.
We want all A objects written to A.avro files, while all B objects
written to B.avro.
I looked into AvroMultipleOutputs class:
http://avro.apache.org/docs/1.7.4/api/java/org/apache/avro/mapreduce/AvroMultipleOutputs.html
There
add avro user mail-list
2014-03-06 16:09 GMT+08:00 Fengyun RAO raofeng...@gmail.com:
our input is a line of text which may be parsed to e.g. A or B object.
We want all A objects written to A.avro files, while all B objects
written to B.avro.
I looked into AvroMultipleOutputs class:
http
J ha...@cloudera.com:
If you have a reducer involved, you'll likely need a common map output
data type that both A and B can fit into.
On Thu, Mar 6, 2014 at 12:09 AM, Fengyun RAO raofeng...@gmail.com wrote:
our input is a line of text which may be parsed to e.g. A or B object.
We want all
in the mappers, using the avro datum as the
value.
3) Figure out what the avro schema is for each datum and write out
the data in the reducer.
Thanks,
Alan
*From:* Fengyun RAO [mailto:raofeng...@gmail.com]
*Sent:* Thursday, March 06, 2014 2:14 AM
*To:* user@hadoop.apache.org; u
flume agent to receive logs as and when
it is generating and you will be directly dumping to hdfs.
If you want to remove unwanted logs you can write a custom sink before
dumping to hdfs
I suppose this would he much easier
On 2 Mar 2014 12:34, Fengyun RAO raofeng...@gmail.com wrote:
Thanks
Thanks, but how to set reducer number to X? X is dependent on input
(run-time), which is unknown on job configuration (compile time).
2014-03-01 17:44 GMT+08:00 AnilKumar B akumarb2...@gmail.com:
Hi,
Write the custom partitioner on timestamp and as you mentioned set
#reducers to X.
getmerge function to collate the results to one file.
On Mar 1, 2014 4:59 AM, Fengyun RAO raofeng...@gmail.com wrote:
Thanks, but how to set reducer number to X? X is dependent on input
(run-time), which is unknown on job configuration (compile time).
2014-03-01 17:44 GMT+08:00 AnilKumar B
to, and it will be sorted by
MR when it reaches the reducer
Even with this, you can still use MultipleOutputs to customize the file
name each reducer generates for better usability, i.e. instead of
part-r-x have it generate mmddhh-r-0.
-Simon
On Sat, Mar 1, 2014 at 10:13 PM, Fengyun RAO
.
On Thu, Feb 27, 2014 at 9:09 PM, Fengyun RAO raofeng...@gmail.com wrote:
thanks, Harsh.
could you specify more detail, or give some links or an example where I
can start?
2014-02-27 22:17 GMT+08:00 Harsh J ha...@cloudera.com:
A mapper's record reader implementation need
read this:
http://stackoverflow.com/questions/574594/how-can-i-create-an-executable-jar-with-dependencies-using-maven
2014-02-26 13:22 GMT+08:00 Anand Mundada anandmund...@ymail.com:
Hi I want to use json jar in client code.
I tried to create runnable jar which include all required jars.
But
It's a common web log analysis situation. The original weblog is saved
every hour on multiple servers.
Now we would like the parsed log results to be saved one file an hour. How
to make it?
In our MR job, the input is a directory with many files in many hours,
let's say 4X files in X hours.
if
Below is a fake sample of Microsoft IIS log:
#Software: Microsoft Internet Information Services 7.5
#Version: 1.0
#Date: 2013-07-04 20:00:00
#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port
cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status
time-taken
-
you can always seek(0), read the lines you need to prepare, then
seek(offset) and continue reading.
Apache Avro (http://avro.apache.org) has a similar format - header
contains the schema a reader needs to work.
On Thu, Feb 27, 2014 at 1:59 AM, Fengyun RAO raofeng...@gmail.com wrote:
Below
example, so the
fact that you have file content being split doesn't impact your analysis if
you have inter-dependencies.
On Mon, Dec 30, 2013 at 7:31 PM, Fengyun RAO raofeng...@gmail.com wrote:
Thanks, I understand now, but I don't think this is what we need. The IIS
log files are very big
Hi,
HDFS splits files into blocks, and mapreduce runs a map task for each
block. However, Fields could be changed in IIS log files, which means
fields in one block may depend on another, and thus make it not suitable
for mapreduce job. It seems there should be some preprocess before storing
and
sets into one data set.
then analyze the joined dataset.
On Mon, Dec 30, 2013 at 3:58 PM, Fengyun RAO raofeng...@gmail.com wrote:
Hi,
HDFS splits files into blocks, and mapreduce runs a map task for each
block. However, Fields could be changed in IIS log files, which means
fields in one
Thanks, Yong!
The dependence never cross files, but since HDFS splits files into blocks,
it may cross blocks, which makes it difficult to write MR job. I don't
quite understand what you mean by WholeFileInputFormat . Actually, I have
no idea how to deal with dependence across blocks.
2013/12/31
, 2013 at 1:58 AM, Fengyun RAO raofeng...@gmail.com wrote:
Thanks!
I tried WebHDFS, which also work well if I copy local files to HDFS, but
still can't find a way to open a filestream, and write to it.
2013/12/6 Vinod Kumar Vavilapalli vino...@hortonworks.com
You can try using WebHDFS
Hi, All
Is there a way to write files into remote HDFS on Linux using C# on
Windows? We want to use HDFS as data storage.
We know there is HDFS java API, but not C#. We tried SAMBA for file sharing
and FUSE for mounting HDFS. It worked if we simply copy files to HDFS, but
if we open a filestream
Thanks!
I tried WebHDFS, which also work well if I copy local files to HDFS, but
still can't find a way to open a filestream, and write to it.
2013/12/6 Vinod Kumar Vavilapalli vino...@hortonworks.com
You can try using WebHDFS.
Thanks,
+Vinod
On Thu, Dec 5, 2013 at 6:04 PM, Fengyun RAO
30 matches
Mail list logo