RE: How to get the max number of reducers in Yarn

2014-10-05 Thread java8964
You should setNumberReducerTask in your job, just there is no such max reducer count in the Yarn any more. Setting reducer count is kind of art, instead of science. I think there is only one rule about it, don't set the reducer number larger than the reducer input group count. Set the reducer nu

RE: Reduce phase of wordcount

2014-10-05 Thread java8964
Don't be confused by 6.03 MB/s. The relationship between mapper and reducer is M to N relationship, which means the mapper could send its data to all reducers, and one reducer could receive its input from all mappers. There could be a lot of reasons why you think the reduce copying phase is too

RE: How to get the max number of reducers in Yarn

2014-10-03 Thread java8964
In the MR1, the max reducer is a static value set in the mapred-site.xml. That is the value you get in the API. In the YARN, there is no such static value any more, so you can set any value you like, it is up to RM to decide at runtime, how many reducer tasks are available or can be granted to y

RE: avro mapreduce output

2014-10-03 Thread java8964
Avro data should be a binary format in its own version. Why you got something like JSON? What output format class you use? Yong Date: Fri, 3 Oct 2014 17:34:35 +0800 Subject: avro mapreduce output From: delim123...@gmail.com To: user@hadoop.apache.org Hi, In mapreduce with reduce output format of

RE: AW: Extremely amount of memory and DB connections by MR Job

2014-09-29 Thread java8964
so, you need to dig into the source data for that block, to think why it will cause OOM. I am not sure about this. Is there a hint in the logs to figure it out? 3) Did you give reasonable heap size for the mapper? What it is? 9 Gb (too small??) Best regards, Blanca Von: java8964 [mai

RE: Extremely amount of memory and DB connections by MR Job

2014-09-29 Thread java8964
I don't have any experience with MongoDB, but just gave my 2 cents here. Your code is not efficient, as using the "+=" on String, and you could have reused the Text object in your mapper, as it is a mutable class, to be reused and avoid creating it again and again like "new Text()" in the mapper.

RE: Bzip2 files as an input to MR job

2014-09-23 Thread java8964
Georgi: I think you misunderstand the originally answer. If you already use Avor format, then the file will be splitable. If you want to add compression on top of that, feel free going ahead. If you read the Avor DataFileWriter API: http://avro.apache.org/docs/1.7.6/api/java/org/apache/avro/file

RE: is the HDFS BlockReceiver.PacketResponder source code wrong ?

2014-09-23 Thread java8964
Why do you say so? Does it cause a bug in your case? If so, can you explain the problem you are facing? Yong From: qixiangm...@hotmail.com To: user@hadoop.apache.org Subject: is the HDFS BlockReceiver.PacketResponder source code wrong ? Date: Tue, 23 Sep 2014 07:37:41 + In org.apache.had

RE: conf.get("dfs.data.dir") return null when hdfs-site.xml doesn't set it explicitly

2014-09-09 Thread java8964
ed in at this moment is the folder(for local filesystem) for data node dir. I am thinking about doing some local read, so it will the very first step if I know where to read the data. Demai On Tue, Sep 9, 2014 at 11:13 AM, java8964 wrote: The configuration in fact depends on the xml file.

RE: conf.get("dfs.data.dir") return null when hdfs-site.xml doesn't set it explicitly

2014-09-09 Thread java8964
The configuration in fact depends on the xml file. Not sure what kind of cluster configuration variables/values you are looking for. Remember, the cluster is made of set of computers, and in hadoop, there are hdfs xml, mapred xml and even yarn xml. Mapred.xml and yarn.xml are job related. Without

RE: Hadoop InputFormat - Processing large number of small files

2014-08-21 Thread java8964
If you want to use NLineInputFormat, and also want the individual file to be processed in the map task which prefer to be on the same task node as data node, you need to implement and control that kind of logic by yourself. Extend the NLineInputFormat, Override the getSplits() method, read the l

RE: hadoop/yarn and task parallelization on non-hdfs filesystems

2014-08-15 Thread java8964
> > block. > > > > If the FS in use has its advantages it's better to implement a proper > > interface to it making use of them, than to rely on the LFS by mounting it. > > This is what we do with HDFS. > > > > On Aug 15, 2014 8:52 PM, "java8964&q

RE: hadoop/yarn and task parallelization on non-hdfs filesystems

2014-08-15 Thread java8964
I believe that Calvin mentioned before that this parallel file system mounted into local file system. In this case, will Hadoop just use java.io.File as local File system to treat them as local file and not split the file? Just want to know the logic in hadoop handling the local file. One suggest

RE: issue about run MR job use system user

2014-07-24 Thread java8964
Are you sure user 'Alex' belongs to 'hadoop' group? Why not your run command 'id alex' to prove it? And 'Alex' belongs to 'hadoop' group can be confirmed on the namenode? Yong Date: Thu, 24 Jul 2014 17:11:06 +0800 Subject: issue about run MR job use system user From: justlo...@gmail.com To: user

RE: HDFS File Writes & Reads

2014-06-19 Thread java8964
What your understanding is almost correct, but not with the part your highlighted. The HDFS is not designed for write performance, but the client doesn't have to wait for the acknowledgment of previous packets before sending the next packets. This webpage describes it clearly, and hope it is help

RE: Is it a bug from CombineFileInputFormat?

2014-05-16 Thread java8964
Why do you say so? What problem you got from this code? Yong From: yu_l...@hotmail.com To: user@hadoop.apache.org Subject: Is it a bug from CombineFileInputFormat? Date: Mon, 12 May 2014 22:10:44 -0400 Hi, This is a private static inner class from CombineFilenputFormat.java When "locations' l

RE: spilled records

2014-05-16 Thread java8964
Your first understanding is not correct. Where do you get that interruption from the book? About the #spilled records, every record of output of mapper will be spilled at least one time.So in ideal scenario, these 2 numbers should be equal. If they are not, and spilled number is much larger than

RE: how to solve reducer memory problem?

2014-04-03 Thread java8964
There are several issues could come together, since you know your data, we can only guess here: 1) mapred.child.java.opts=-Xmx2g setting only works IF you didn't set "mapred.map.child.java.opts" or "mapred.reduce.child.java.opts", otherwise, the later one will override the "mapred.child.java.opt

RE: how to solve reducer memory problem?

2014-04-03 Thread java8964
There are several issues could come together, since you know your data, we can only guess here: 1) mapred.child.java.opts=-Xmx2g setting only works IF you didn't set "mapred.map.child.java.opts" or "mapred.reduce.child.java.opts", otherwise, the later one will override the "mapred.child.java.opt

RE: how to solve reducer memory problem?

2014-04-03 Thread java8964
There are several issues could come together, since you know your data, we can only guess here: 1) mapred.child.java.opts=-Xmx2g setting only works IF you didn't set "mapred.map.child.java.opts" or "mapred.reduce.child.java.opts", otherwise, the later one will override the "mapred.child.java.opt

RE: MapReduce: How to output multiplt Avro files?

2014-03-07 Thread java8964
You may consider "SpecificRecord" or "GenericRecord" of Avor. Yong Date: Fri, 7 Mar 2014 10:29:49 +0800 Subject: Re: MapReduce: How to output multiplt Avro files? From: raofeng...@gmail.com To: user@hadoop.apache.org; ha...@cloudera.com thanks, Harsh. any idea on how to build a common map output

RE: Benchmarking Hive Changes

2014-03-05 Thread java8964
ttas anth...@mattas.net On Wed, Mar 5, 2014 at 8:47 AM, java8964 wrote: Are you doing on standalone one box? How large are your test files and how long of the jobs of each type took? Yong > From: anth...@mattas.net > Subject: Benchmarking Hive Changes > Date: Tue, 4 Mar 2014 21:3

RE: Benchmarking Hive Changes

2014-03-05 Thread java8964
Are you doing on standalone one box? How large are your test files and how long of the jobs of each type took? Yong > From: anth...@mattas.net > Subject: Benchmarking Hive Changes > Date: Tue, 4 Mar 2014 21:31:42 -0500 > To: user@hadoop.apache.org > > I’ve been trying to benchmark some of the Hi

RE: Multiple inputs for different avro inputs

2014-02-27 Thread java8964
Using the Union schema is correct, which should be able to support multi schemas input. One question is that why you setInputKeySchema? Does your job load the Avro data as the key to the following Mapper? Yong Date: Thu, 27 Feb 2014 16:13:34 +0530 Subject: Multiple inputs for different avro inpu

RE: What if file format is dependent upon first few lines?

2014-02-27 Thread java8964
If the file is big enough and you want to split them for parallel processing, then maybe one option could be that in your mapper, you can always get the full file path from the InputSplit, then open it (The file path, which means you can read from the the beginning), read the first 4 lines, and

RE: Reading a file in a customized way

2014-02-25 Thread java8964
See my reply for another email today for similar question. "RE: Can the file storage in HDFS be customized?‏"Thanks Yong From: sugandha@gmail.com Date: Tue, 25 Feb 2014 11:40:13 +0530 Subject: Reading a file in a customized way To: user@hadoop.apache.org Hello, Irrespective of the file blocks

RE: Can the file storage in HDFS be customized?

2014-02-25 Thread java8964
Hi, Naolekar: The blocks in HDFS just store the bytes. It has no idea nor cares what kind of data, or how many ploygons in this block. It just store 128M (if your default block size is set to 128M) bytes. It is your InputFormat/RecordReader to read these bytes in, and deserialize them to pair.

RE: Question about the usage of Seekable within the LineRecordReader

2014-02-19 Thread java8964
Hi, Brian: I hope I understand your question correctly. Here is my view what provided from the Seekable interface. The Seekable interface also defines the "seek(long pos)" method, which allows the client to seek to a specified position in the underline InputStream. In the RecordReader, it will ge

RE: Hadoop native and snappy library

2014-02-11 Thread java8964
Where do you compile your libhadoop.so.1.0.0? It is more like that you compiled libhadoop.so.1.0.0 in a environment with glibc 2.14, but tried to use it in an environment only have glibc 2.12. If you are using a hadoop compiled by yourself, then it is best to compile in an environment matching wi

RE: HBase connection hangs

2014-02-10 Thread java8964
Hi, Ted: Our environment is using a distribution from a Vendor, so it is not easy just to patch it myself. But I can seek the option to see if the vendor is willing to patch it in next release. Before I do that, I just want to make sure patching the code is the ONLY solution. I read the source c

RE: shifting sequenceFileOutput format to Avro format

2014-02-04 Thread java8964
t;values":"string"} }} ]} And I tried creating corresponding classes by using avro tool and with plugin, but there are few errors on generated java code. What could be the issue? 1) Error: The method deepCopy(Schema, List>) is undefined for the type GenericData

RE: DistCP : Is it gauranteed to work for any two uri schemes?

2014-02-04 Thread java8964
Just as Harsh pointed out, as long as the underline DFS provides all the required API of DFS for Hadoop, DistCP should work. One thing is that all the required library (including any conf files) needs to be in the classpath, if they are not available in the runtime cluster. Same as S3 file syste

RE: shifting sequenceFileOutput format to Avro format

2014-01-30 Thread java8964
In avro, you need to think about a schema to match your data. Avor's schema is very flexible and should be able to store all kinds of data. If you have a Json string, you have 2 options to generate the Avro schema for it: 1) Use "type: string" to store the whole Json string into Avro. This will b

RE: Configuring hadoop 2.2.0

2014-01-29 Thread java8964
Hi, Ognen: I noticed you were asking this question before under a different subject line. I think you need to tell us where you mean unbalance space, is it on HDFS or the local disk. 1) The HDFS is independent as MR. They are not related to each other.2) Without MR1 or MR2 (Yarn), HDFS should wo

RE: Force one mapper per machine (not core)?

2014-01-29 Thread java8964
Or you can implement your own InputSplit and InputFormat, which you can control how to send tasks to which node, and how many per node. Some detail examples you can get from book "Professional Hadoop Solution" Character 4. Yong > Subject: Re: Force one mapper per machine (not core)? > From: kwi.

RE: Localization feature

2014-01-24 Thread java8964
You need to be more clear about how do you process the files. I think the important question is what kind of InputFormat and OutputFormat you are using in your case. If you are using the default one, on Linux, I believe the TextInputFormat and TextOutputFormat will both convert bytes array to tex

RE: Streaming jobs getting poor locality

2014-01-23 Thread java8964
y on multi nodes in your cluster?2) If you don't use bzip2 file as input, do you have the same problem for other type files, like plain text file? Yong From: ken.willi...@windlogics.com To: user@hadoop.apache.org Subject: RE: Streaming jobs getting poor locality Date: Thu, 23 Jan 2014 16:0

RE: Streaming jobs getting poor locality

2014-01-23 Thread java8964
I believe Hadoop can figure out the codec from the file name extension, and Bzip2 codec is supported from Hadoop as Java implementation, which is also a SplitableCompressionCodec. So 5G bzip2 files generate about 45 mappers is very reasonable, assuming 128M/block. The question is why ONLY one no

RE: Is perfect control over mapper num AND split distribution possible?

2014-01-21 Thread java8964
You cannot use hadoop "NLineInputFormat"? If you generate 100 lines of text file, by default, one line will trigger one mapper task. As long as you have 100 task slot available, you will get 100 mapper running concurrently. You want perfect control over mapper num? NLineInputFormat is designed fo

RE: How to configure multiple reduce jobs in hadoop 2.2.0

2014-01-17 Thread java8964
I read this blog, and have the following questions: What is the relationship between "mapreduce.map.memory.mb" and "mapreduce.map.java.opts"? In the blog, it gives the following settings as example: For our example cluster, we have the minimum RAM for a Container (yarn.scheduler.minimum-allocatio

RE: Reading multiple input files.

2014-01-10 Thread java8964
like this? hadoop MyJob -input /foo -output output Kim On Fri, Jan 10, 2014 at 8:04 AM, java8964 wrote: Yes. The hadoop is very flexible for underline storage system. It is in your control, how to utilize the cluster's resource, include CPU, memory, IO and network bandwidth. C

RE: Reading multiple input files.

2014-01-10 Thread java8964
Yes. The hadoop is very flexible for underline storage system. It is in your control, how to utilize the cluster's resource, include CPU, memory, IO and network bandwidth. Check out hadoop NLineInportFormat, it maybe the right choice for your case. You can put all the metadata of your files (da

RE: Running Hadoop v2 clustered mode MR on an NFS mounted filesystem

2014-01-10 Thread java8964
s On 12/20/2013 9:28 AM, java8964 wrote: I believe the "-fs local" should be removed too. The reason is that even you have a dedicated JobTracker after removing "-jt local", but with "-fs local"

RE: how to caculate a HDFS directory size ?

2014-01-09 Thread java8964
Or even easier like this: hadoop fs -dus /path > From: j...@hortonworks.com > Date: Wed, 8 Jan 2014 17:07:21 -0800 > Subject: Re: how to caculate a HDFS directory size ? > To: user@hadoop.apache.org > > You may want to check the fs shell command COUNT and DU. > > On Wed, Jan 8, 2014 at 4:57 PM,

RE: Setting up Snappy compression in Hadoop

2014-01-02 Thread java8964
If you really confirmed that libsnappy.so.1 is in the correct location, and being loaded into java library path, working in your test program, but still didn't work in MR, there is one another possibility which was puzzling me before. How do you get the libhadoop.so in your hadoop environment? D

RE: any suggestions on IIS log storage and analysis?

2013-12-30 Thread java8964
ce never cross files, but since HDFS splits files into blocks, it may cross blocks, which makes it difficult to write MR job. I don't quite understand what you mean by "WholeFileInputFormat ". Actually, I have no idea how to deal with dependence across blocks. 2013/12/31 java89

RE: Unable to access the link

2013-12-30 Thread java8964
What's wrong to download it from the Apache official website? http://archive.apache.org/dist/hadoop/core/hadoop-1.1.2/ Yong Date: Mon, 30 Dec 2013 11:42:25 -0500 Subject: Unable to access the link From: navaz@gmail.com To: user@hadoop.apache.org Hi I am using below instruction set to set up h

RE: any suggestions on IIS log storage and analysis?

2013-12-30 Thread java8964
I don't know any example of IIS log files. But from what you described, it looks like analyzing one line of log data depends on some previous lines data. You should be more clear about what is this dependence and what you are trying to do. Just based on your questions, you still have different o

RE: Building Hadoop 1.2.1 in IntelliJ IDEA

2013-12-24 Thread java8964
The best way, I am thinking, is to try following: 1) Use the ant command line to generate eclipse file from the hadoop 1.2.1 source folder by "ant eclipse"2) After that, you can using "import project" in IntelliJ for "Eclipse" project, which will handle all the path correctly in Intellij for you

RE: Any method to get input splits by column?

2013-12-23 Thread java8964
You need to store your data into "column-based" format, checking out Hive RCFile, and its InputFormat option. Yong Date: Mon, 23 Dec 2013 21:37:23 +0800 Subject: Any method to get input splits by column? From: samliuhad...@gmail.com To: user@hadoop.apache.org Hi, By default, MR inputformat clas

RE: Running Hadoop v2 clustered mode MR on an NFS mounted filesystem

2013-12-20 Thread java8964
I believe the "-fs local" should be removed too. The reason is that even you have a dedicated JobTracker after removing "-jt local", but with "-fs local", I believe that all the mappers will be run sequentially. "-fs local" will force the mapreducer run in "local" mode, which is really a test mo

RE: Why other process can't see the change after calling hdfsHFlush unless hdfsCloseFile is called?

2013-12-20 Thread java8964
I don't think in HDFS, a file can be written concurrently. Process B won't be able to write the file (But can read) until it is CLOSED by process A. Yong Date: Fri, 20 Dec 2013 15:55:00 +0800 Subject: Re: Why other process can't see the change after calling hdfsHFlush unless hdfsCloseFile is cal

RE: Yarn -- one of the daemons getting killed

2013-12-16 Thread java8964
If it is not killed by OOM killer, maybe the JVM just did a core dump due to whatever reason. Search for core dump of process in the /var/log/messages, or core dump file in your system. From: stuck...@umd.edu To: user@hadoop.apache.org; user@hadoop.apache.org Subject: Re: Yarn -- one of the daemo

RE: File size 0 bytes while open for write

2013-12-13 Thread java8964
If the thread is killed, I don't know there is a way you can get the lease and close the file on behavior of the killed thread, unless your other threads hold the reference of the file writer, and close it. I don't know if any command line tool can do that. Yong From: xell...@outlook.com To: use

RE: File size 0 bytes while open for write

2013-12-13 Thread java8964
The HDFS client that opens a file for writing is granted a lease for the file; no other client can write to the file. The writing client periodically renews the lease by sending a heartbeat to the NameNode. When the file is closed, the lease is revoked. The lease duration is bound by a soft limi

RE: issue about Shuffled Maps in MR job summary

2013-12-12 Thread java8964
e.org one of important things is my input file is very small ,each file less than 10M,and i have a huge number of files On Thu, Dec 12, 2013 at 9:58 AM, java8964 wrote: Assume the block size is 128M, and your mapper each finishes within half minute, then there is not too much logic in your m

RE: issue about Shuffled Maps in MR job summary

2013-12-12 Thread java8964
gs is my input file is very small ,each file less than 10M,and i have a huge number of files On Thu, Dec 12, 2013 at 9:58 AM, java8964 wrote: Assume the block size is 128M, and your mapper each finishes within half minute, then there is not too much logic in your mapper, as it can f

RE: issue about Shuffled Maps in MR job summary

2013-12-11 Thread java8964
the job with all 15 reducer, and i do not know if i increase reducer number from 15 to 30 ,each reduce allocate 6G MEM,that will speed the job or not ,the job run on my product env, it run nearly 1 week,it still not finished On Wed, Dec 11, 2013 at 9:50 PM, java8964 wrote: The whole job

RE: issue about Shuffled Maps in MR job summary

2013-12-11 Thread java8964
The whole job complete time depends on a lot of factors. Are you sure the reducers part is the bottleneck? Also, it also depends on how many Reducer input groups it has in your MR job. If you only have 20 reducer groups, even you jump your reducer count to 40, then the epoch of reducers part won

RE: Ant BuildException error building Hadoop 2.2.0

2013-12-05 Thread java8964
ntrun/build-main.xml May it be a missing dependency? Do you know how can I check the plugin actually exists using Maven? Thanks! On 4 December 2013 20:23, java8964 wrote: Can you try JDK 1.6? I just did a Hadoop 2.2.0 GA release build myself days ago. From my experience, JDK 1.7 not wor

RE: Ant BuildException error building Hadoop 2.2.0

2013-12-04 Thread java8964
I do: ~/hadoop-2.2.0-maven$ cmake --versioncmake version 2.8.2 On 4 December 2013 19:51, java8964 wrote: Do you have 'cmake' in your environment? Yong Date: Wed, 4 Dec 2013 17:20:03 +0100 Subject: Ant BuildException error building Hadoop 2.2.0 From: silvi.ca...@gmail.com

RE: Ant BuildException error building Hadoop 2.2.0

2013-12-04 Thread java8964
Do you have 'cmake' in your environment? Yong Date: Wed, 4 Dec 2013 17:20:03 +0100 Subject: Ant BuildException error building Hadoop 2.2.0 From: silvi.ca...@gmail.com To: user@hadoop.apache.org Hello everyone, I've been having trouble to build Hadoop 2.2.0 using Maven 3.1.1, this is part of th

RE: Folder not created using Hadoop Mapreduce code

2013-11-14 Thread java8964 java8964
Maybe just a silly guess, did you close your Writer? Yong Date: Thu, 14 Nov 2013 12:47:13 +0530 Subject: Re: Folder not created using Hadoop Mapreduce code From: unmeshab...@gmail.com To: user@hadoop.apache.org @rab ra: ys using filesystem s mkdir() we can create folders and we can also create i

RE: Why the reducer's input group count is higher than my GroupComparator implementation

2013-10-30 Thread java8964 java8964
. Date: Tue, 29 Oct 2013 08:57:32 +0100 Subject: Re: Why the reducer's input group count is higher than my GroupComparator implementation From: drdwi...@gmail.com To: user@hadoop.apache.org Did you overwrite the partitioner as well? 2013/10/29 java8964 java8964 Hi, I have a stran

RE: Why the reducer's input group count is higher than my GroupComparator implementation

2013-10-29 Thread java8964 java8964
than 11. Date: Tue, 29 Oct 2013 08:57:32 +0100 Subject: Re: Why the reducer's input group count is higher than my GroupComparator implementation From: drdwi...@gmail.com To: user@hadoop.apache.org Did you overwrite the partitioner as well? 2013/10/29 java8964 java8964 Hi, I have

Why the reducer's input group count is higher than my GroupComparator implementation

2013-10-28 Thread java8964 java8964
Hi, I have a strange question related to my secondary sort implementation in the MR job.Currently I need to support 2nd sort in one of my MR job. I implemented my custom WritableComparable like following: public class MyPartitionKey implements WritableComparable { String type;long id1;

RE: Mapreduce outputs to a different cluster?

2013-10-26 Thread java8964 java8964
has url "hdfs://machine.domain:8080" and data folder "/tmp/myfolder", what should I specify as the output path for MR job? Thanks On Thursday, October 24, 2013 5:31 PM, java8964 java8964 wrote: Just specify the output location using the URI to another cluster. As long as the

RE: Mapreduce outputs to a different cluster?

2013-10-24 Thread java8964 java8964
Just specify the output location using the URI to another cluster. As long as the network is accessible, you should be fine. Yong Date: Thu, 24 Oct 2013 15:28:27 -0700 From: myx...@yahoo.com Subject: Mapreduce outputs to a different cluster? To: user@hadoop.apache.org The scenario is: I run mapr

RE: enable snappy on hadoop 1.1.1

2013-10-07 Thread java8964 java8964
snappy on hadoop 1.1.1 whats the output of ldd on that lib? Does it link properly? You should compile natives for your platforms as the packaged ones may not link properly. On Sat, Oct 5, 2013 at 2:37 AM, java8964 java8964 wrote: I kind of read the hadoop 1.1.1 source code for this,

RE: enable snappy on hadoop 1.1.1

2013-10-04 Thread java8964 java8964
I kind of read the hadoop 1.1.1 source code for this, it is very strange for me now. >From the error, it looks like runtime JVM cannot find the native method of >org/apache/hadoop/io/compress/snappy/SnappyCompressor.compressBytesDirect()I, >that my guess from the error message, but from the log,

enable snappy on hadoop 1.1.1

2013-10-04 Thread java8964 java8964
Hi, I am using hadoop 1.1.1. I want to test to see the snappy compression with hadoop, but I have some problems to make it work on my Linux environment. I am using opensuse 12.3 x86_64. First, when I tried to enable snappy in hadoop 1.1.1 by: conf.setBoolean("mapred.compress.map.outp

Will different files in HDFS trigger different mapper

2013-10-02 Thread java8964 java8964
Hi, I have a question related to how the mapper generated for the input files from HDFS. I understand the split and blocks concept in the HDFS, but my originally understanding is that one mapper will only process data from one file in HDFS, no matter how small this file it is. Is that correct? T

RE: File formats in Hadoop: Sequence files vs AVRO vs RC vs ORC

2013-09-30 Thread java8964 java8964
I am also thinking about this for my current project, so here I share some of my thoughts, but maybe some of them are not correct. 1) In my previous projects years ago, we store a lot of data as plain text, as at that time, people thinks the Big data can store all the data, no need to worry abou

RE: All datanodes are bad IOException when trying to implement multithreading serialization

2013-09-30 Thread java8964 java8964
Not exactly know what you are trying to do, but it seems like the memory is your bottle neck, and you think you have enough CPU resource, so you want to use multi-thread to utilize CPU resources? You can start multi-threads in your mapper, as if you think your mapper logic is very cpu intensive

RE: Extending DFSInputStream class

2013-09-26 Thread java8964 java8964
Just curious, any reason you don't want to use the DFSDataInputStream? Yong Date: Thu, 26 Sep 2013 16:46:00 +0200 Subject: Extending DFSInputStream class From: tmp5...@gmail.com To: user@hadoop.apache.org Hi I would like to wrap DFSInputStream by extension. However it seems that the DFSInputStr

Hadoop sequence file's benefits

2013-09-17 Thread java8964 java8964
Hi, I have a question related to sequence file. I wonder why I should use it under what kind of circumstance? Let's say if I have a csv file, I can store that directly in HDFS. But if I do know that the first 2 fields are some kind of key, and most of MR jobs will query on that key, will it make

RE: MAP_INPUT_RECORDS counter in the reducer

2013-09-17 Thread java8964 java8964
Or you do the calculation in the reducer close() method, even though I am not sure in the reducer you can get the Mapper's count. But even you can't, here is what can do:1) Save the JobConf reference in your Mapper conf metehod2) Store the Map_INPUT_RECORDS counter in the configuration object as

Looking for some advice

2013-09-14 Thread java8964 java8964
Hi, I currently have a project to process the data using MR. I have some thoughts about it, and am looking for some advices if anyone had any feedback. Currently in this project, I have lot of events data related to email tracking coming into the HDFC. So the events are the data for email trackin

RE: help!!!,what is happened with my project?

2013-09-11 Thread java8964 java8964
Did you do a hadoop version upgrade before this error happened? Yong Date: Wed, 11 Sep 2013 16:57:54 +0800 From: heya...@jiandan100.cn To: user@hadoop.apache.org CC: user-unsubscr...@hadoop.apache.org Subject: help!!!,what is happened with my project? Hi: Today when I

RE: distcp failed "Copy failed: ENOENT: No such file or directory"

2013-09-06 Thread java8964 java8964
The error doesn't mean the file not existed in the HDFS, but it means local disk. If you read the error stack trace: at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:581) It indicates the error happened on Local file system. If you try to copy data from an existing

RE: secondary sort - number of reducers

2013-08-30 Thread java8964 java8964
Well, The reducers normally will take much longer than the mappers stage, because the copy/shuffle/sort all happened at this time, and they are the hard part. But before we simply say it is part of life, you need to dig into more of your MR jobs to find out if you can make it faster. You are the

RE: secondary sort - number of reducers

2013-08-29 Thread java8964 java8964
The method getPartition() needs to return a positive number. Simply use hashCode() method is not enough. See the Hadoop HashPartitioner implementation: return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks; When I first read this code, I always wonder why not use Math.abs? Is ( & I

RE: copy files from hdfs to local fs

2013-08-29 Thread java8964 java8964
What's wrong by using old Unix pipe? hadoop fs -cat /user/input/foo.txt | head -100 > local_file Date: Thu, 29 Aug 2013 13:50:37 -0700 Subject: Re: copy files from hdfs to local fs From: chengi.liu...@gmail.com To: user@hadoop.apache.org tail will work as well.. ??? but i want to extract just (sa

RE: Jar issue

2013-08-27 Thread java8964 java8964
I am not sure the original suggestion will work for your case. My understanding is the you want to use some API, only exists in slf4j versiobn 1.6.4, but this library with different version already existed in your hadoop environment, which is quite possible. To change the maven build of the appli

RE: Partitioner vs GroupComparator

2013-08-23 Thread java8964 java8964
As Harsh said, sometime you want to do the 2nd sort, but for MR, it can only be sorted by key, not by value. A lot of time, you want to the reducer output sort by a field, but only do the sort within a group, kind of like 'windowing sort' in relation DB SQL. For example, if you have a data about

RE: running map tasks in remote node

2013-08-23 Thread java8964 java8964
lave nodes, it works fine. I am not able to figure out how to fix this and the reason for the error. I am not understand why it complains about the input directory is not present. As far as I know, slave nodes get a map and map method contains contents of the input file. This should be fine f

RE: running map tasks in remote node

2013-08-22 Thread java8964 java8964
If you don't plan to use HDFS, what kind of sharing file system you are going to use between cluster? NFS?For what you want to do, even though it doesn't make too much sense, but you need to the first problem as the shared file system. Second, if you want to process the files file by file, inste

java.io.IOException: Task process exit with nonzero status of -1

2013-08-15 Thread java8964 java8964
Hi, This is a 4 node hadoop cluster running on CentOS 6.3 with Oracle JDK (64bit) 1.6.0_43. Each node has 32G memory, with max 8 mapper tasks and 4 reducer tasks being set. The hadoop version is 1.0.4. This is setup on Datastax DES 3.0.2, which is using Cassandra CFS as underline DFS, instead o

RE: Encryption in HDFS

2013-02-26 Thread java8964 java8964
I am also interested in your research. Can you share some insight about the following questions? 1) When you use CompressionCodec, can the encrypted file split? From my understand, there is no encrypt way can make the file decryption individually by block, right? For example, if I have 1G file

RE: Question related to Decompressor interface

2013-02-12 Thread java8964 java8964
Can someone share some idea what the Hadoop source code of class org.apache.hadoop.io.compress.BlockDecompressorStream, method rawReadInt() is trying to do here? There is a comment in the code this this method shouldn't return negative number, but in my testing file, it contains the following b

RE: Loader for small files

2013-02-12 Thread java8964 java8964
Hi, Davie: I am not sure I understand this suggestion. Why smaller block size will help this performance issue? >From what the original question about, it looks like the performance problem >is due to that there are a lot of small files, and each file will run in its >own mapper. As hadoop nee

RE: number input files to mapreduce job

2013-02-12 Thread java8964 java8964
I don't think you can get list of all input files in the mapper, but what you can get is the current file's information. In the context object reference, you can get the InputSplit(), which should give you all the information you want of the current input file. http://hadoop.apache.org/docs/r2.0

RE: Confused about splitting

2013-02-10 Thread java8964 java8964
Hi, Chris: Here is my understand about the file split and Data block. The HDFS will store your file into multi data blocks, each block will be 64M or 128M depend on your setting. Of course, the file could contain multi records. So the boundary of the record won't match with the block boundary (i

RE: Question related to Decompressor interface

2013-02-10 Thread java8964 java8964
e can convert any existing Writable into an encrypted form. Dave From: java8964 java8964 [mailto:java8...@hotmail.com] Sent: Sunday, February 10, 2013 3:50 AM To: user@hadoop.apache.org Subject: Question related to Decompressor interface HI, Currently I am researching about options of encry

RE: What to do/check/debug/root cause analysis when jobtracker hang

2013-02-06 Thread java8964 java8964
Our cluster on cdh3u4 has the same problem. I think it is caused by some bugs in JobTracker. I believe Cloudera knows about this issue. After upgrading to cdh3u5, we havn't faced this issue yet, but I am not sure if it is confirmed to fix in the CDH3U5. Yong > Date: Mon, 4 Feb 2013 15:21:18 -08

RE: Profiling the Mapper using hprof on Hadoop 0.20.205

2013-02-06 Thread java8964 java8964
What range you gave it for mapred.task.profile.maps? And you sure your mapper will invoke the methods you expect in the traces? Yong Date: Wed, 6 Feb 2013 23:50:08 +0200 Subject: Profiling the Mapper using hprof on Hadoop 0.20.205 From: yaron.go...@gmail.com To: user@hadoop.apache.org Hi,I wish

RE: Cumulative value using mapreduce

2012-10-05 Thread java8964 java8964
Ted comments on performance are spot on. Regards Bertrand On Thu, Oct 4, 2012 at 9:02 PM, java8964 java8964 wrote: I did the cumulative sum in the HIVE UDF, as one o

RE: Cumulative value using mapreduce

2012-10-04 Thread java8964 java8964
I did the cumulative sum in the HIVE UDF, as one of the project for my employer. 1) You need to decide the grouping elements for your cumulative. For example, an account, a department etc. In the mapper, combine these information as your omit key.2) If you don't have any grouping requirement, yo

why hadoop does not provide a round robin partitioner

2012-09-20 Thread java8964 java8964
Hi, During my development of ETLs on hadoop platform, there is one question I want to ask, why hadoop didn't provide a round robin partitioner? >From my experience, it is very powerful option for small limited distinct >value keys case, and balance the ETL resource. Here is what I want to say: 1