Re: Hive update functionality for External tables

2015-06-16 Thread Yanbo Liang
I have also try to use these functionality but it did not work well for external table. It has many restricts for the underlying file of the table which will be update/delete such as supporting AcidOutputFormat, is bucked etc. It support only ORC as the file format until now and the table show

Re: FSImage from uncompress to compress change

2015-06-16 Thread Yanbo Liang
As far as I know, HDFS get image compression information from image file when loading fsimage. So you can correctly load fsimage file even you set different compression codec. I strongly recommend to do these operations with the same version and run hdfs dfsadmin -saveNamespace to save the new

Re: Set Replica Issue

2015-06-15 Thread Yanbo Liang
1, It means that you can not use native library for your platform which is written by C/C++ and will performance benefit. However, it can be replaced by buildin-java classes. This is a warning log not error one, so it doesn't matter. 2, You can check the replicas number of this file by other ways.

Re: Problems with the Fedarated name node configuration

2014-08-16 Thread Yanbo Liang
- Do you see anything wrong in above configuration ? Looks like all right. - Where am I supposed to run this ( on name nodes, data nodes or on every node) ? run on all DataNodes, refresh all DataNodes to pick up the newly added NameNode. - I suppose the default data

Re: Test read caching

2014-08-15 Thread Yanbo Liang
You can check the response of your command. For example, you can execute hdfs dfsadmin -report and you will get reply like following and can ensure the space of cache used and remaining is reasonable. Configured Cache Capacity: 64000 (62.50 KB) Cache Used: 4096 (4 KB) Cache Remaining: 59904

Re: Not able to place enough replicas

2014-07-14 Thread Yanbo Liang
Maybe the user 'test' has no privilege of write operation. You can refer the ERROR log like: org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:test (auth:SIMPLE) 2014-07-15 2:07 GMT+08:00 Bogdan Raducanu lrd...@gmail.com: I'm getting this error while writing many

Re: Read hflushed data without reopen file

2013-12-27 Thread Yanbo Liang
Hi Chao, As far as I know, if client B opens the file which is under construction, the DFSInputStream will get the LocatedBlocks object and it contains a member variable which called underConstruction to mark this file is under construction. If the file is reopen, the client will get a different

Re: Split the File using mapreduce

2013-12-27 Thread Yanbo Liang
Did you installed Hive on your Hadoop cluster? If yes, use Hive SQL may be simple and efficiency. Otherwise, you can write a MapReduce program with org.apache.hadoop.mapred.lib.MultiOuputFormat, and the output from the Reducer can be written to more than one file. 2013/12/27 Nitin Pawar

Re: Request for a pointer to a MapReduce Program tutorial

2013-12-27 Thread Yanbo Liang
May be you can reference Hadoop in action 2013/12/27 Sitaraman Vilayannur vrsitaramanietfli...@gmail.com Hi, Would much appreciate a pointer to a mapreduce tutorial which explains how i can run a simulated cluster of mapreduce nodes on a single PC and write a Java program with the

Re: building hadoop 2.x from source

2013-12-27 Thread Yanbo Liang
You can use maven to compile and package Hadoop and deploy it to one cluster, then run it with script supplied by Hadoop. And this tutorial for your reference http://svn.apache.org/repos/asf/hadoop/common/trunk/BUILDING.txt 2013/12/25 Karim Awara karim.aw...@kaust.edu.sa Hi, I managed to

Re: How to execute wordcount with compression?

2013-10-18 Thread Yanbo Liang
Compression is irrelevant with yarn. If you want to store files with compression, you should compress the file when they were load to HDFS. The files on HDFS were compressed according to the parameter io.compression.codecs which was set in core-site.xml. If you want to specific a novel compression

Re: Parallel Load Data into Two partitions of a Hive Table

2013-05-03 Thread Yanbo Liang
load data to different partitions parallel is OK, because it equivalent to write to different file on HDFS 2013/5/3 selva selvai...@gmail.com Hi All, I need to load a month worth of processed data into a hive table. Table have 10 partitions. Each day have many files to load and each file is

Re: block over-replicated

2013-04-15 Thread Yanbo Liang
You can reference this function, it remove excess replicas form the map. public void removeStoredBlock(Block block, DatanodeDescriptor node) 2013/4/12 lei liu liulei...@gmail.com I use hadoop-2.0.3. I find when on block is over-replicated, the replicas to be add to excessReplicateMap

Re: Finding mean and median python streaming

2013-04-06 Thread Yanbo Liang
? On Tue, Apr 2, 2013 at 2:14 AM, Yanbo Liang yanboha...@gmail.com wrote: How many Reducer did you start for this job? If you start many Reducers for this job, it will produce multiple output file which named as part-*. And each part is only the local mean and median value

Re: are we able to decommission multi nodes at one time?

2013-04-03 Thread Yanbo Liang
at a time on a replication average of 3 or 3+, and put it back in later without too much data movement impact. On Tue, Apr 2, 2013 at 1:06 PM, Yanbo Liang yanboha...@gmail.com wrote: It's reasonable to decommission 7 nodes at the same time. But may be it also takes long time to finish

Re: hadoop datanode kernel build and HDFS multiplier factor

2013-04-03 Thread Yanbo Liang
I have done similar experiment for tuning hadoop performance. Many factors will influence the performance such as hadoop configuration, JVM, OS. For Linux kernel related factors, we have found two main focus of attention: 1, Every read operation of file system will trigger one disk write

Re: Provide context to map function

2013-04-02 Thread Yanbo Liang
protected void map(KEYIN key, VALUEIN value, Context context) throws IOException, InterruptedException { context.write((KEYOUT) key, (VALUEOUT) value); } Context is a parameter that the execute environment will pass to the map() function. You can just use it in the

Re: Finding mean and median python streaming

2013-04-02 Thread Yanbo Liang
How many Reducer did you start for this job? If you start many Reducers for this job, it will produce multiple output file which named as part-*. And each part is only the local mean and median value of the specific Reducer partition. Two kinds of solutions: 1, Call the method of

Re: MultipleInputs.addInputPath compile error in eclipse(indigo)

2013-04-02 Thread Yanbo Liang
You set the wrong parameter NodeReducer.class which should be subclass of Mapper rather than Reducer. 2013/4/2 YouPeng Yang yypvsxf19870...@gmail.com HI GUYS I want to use the the org.apache.hadoop.mapreduce.lib.input.MultipleInputs; However it comes a compile error in my

Re: are we able to decommission multi nodes at one time?

2013-04-01 Thread Yanbo Liang
It's alowable to decommission multi nodes at the same time. Just write the all the hostnames which will be decommissioned to the exclude file and run bin/hadoop dfsadmin -refreshNodes. However you need to ensure the decommissioned DataNodes are minority of all the DataNodes in the cluster and

Re: are we able to decommission multi nodes at one time?

2013-04-01 Thread Yanbo Liang
You want to decommission how many nodes? 2013/4/2 Henry JunYoung KIM henry.jy...@gmail.com 15 for datanodes and 3 for replication factor. 2013. 4. 1., 오후 3:23, varun kumar varun@gmail.com 작성: How many nodes do you have and replication factor for it.

Re:

2013-03-28 Thread Yanbo Liang
You can get detail information from the Greenplum website: http://www.greenplum.com/products/pivotal-hd 2013/3/28 oualid ait wafli oualid.aitwa...@gmail.com Hi Sameone know samething about EMC distribution for Big Data which itegrate Hadoop and other tools ? Thanks

Re: DFSOutputStream.sync() method latency time

2013-03-28 Thread Yanbo Liang
1st when client wants to write data to HDFS, it should be create DFSOutputStream. Then the client write data to this output stream and this stream will transfer data to all DataNodes with the constructed pipeline by the means of Packet whose size is 64KB. These two operations is concurrent, so the

Re: Inspect a context object and see whats in it

2013-03-28 Thread Yanbo Liang
You can try to add some probes to source code and recompile it. If you want to know the keys and values you add at each step, you can add print code to map() function of class Mapper and reduce() function of class Reducer. The shortcoming is that you will produce many log output which may fill the

Re: DFSOutputStream.sync() method latency time

2013-03-28 Thread Yanbo Liang
of each datanode operation. 2013/3/28 Yanbo Liang yanboha...@gmail.com 1st when client wants to write data to HDFS, it should be create DFSOutputStream. Then the client write data to this output stream and this stream will transfer data to all DataNodes with the constructed pipeline

Re: Any answer ? Candidate application for map reduce

2013-03-25 Thread Yanbo Liang
From your description split the data in to chunks, feed the chunks to the application, and merge the processed chunks to get A back is just suit for the MapReduce paradigm. First you can feed the split chunks to Mapper and merge the processed chunks at Reducer. Why did you not use MapReduce

Re: Understand dfs.datanode.max.xcievers

2013-03-18 Thread Yanbo Liang
dfs.datanode.max.xcievers value should set across the cluster rather than particular DataNode. It means the upper bound on the number of files that the DataNode will serve at any one time. 2013/3/17 Dhanasekaran Anbalagan bugcy...@gmail.com Hi Guys, We are having few data nodes in an

Re: using test.org.apache.hadoop.fs.s3native.InMemoryNativeFileSystemStore class in hadoop

2013-03-18 Thread Yanbo Liang
These test classes are used for unit testing. You can run these cases to test particular function of a class. But when we run these test case, we need some additional classes and functions to simulate some underlying function which were called by these test cases. InMemoryNativeFileSystemStore is

Re: using test.org.apache.hadoop.fs.s3native.InMemoryNativeFileSystemStore class in hadoop

2013-03-18 Thread Yanbo Liang
It just unit test, so you don't need to set any parameters in configuration files. 2013/3/18 Agarwal, Nikhil nikhil.agar...@netapp.com Hi, ** ** Thanks for the quick reply. In order to test the class TestInMemoryNativeS3FileSystemContract and its functions what should be the value

Re: How to Create file in HDFS using java Client with Permission

2013-03-15 Thread Yanbo Liang
You must change to user dasmohap to execute this client program otherwise you can not create file under the directory /user/dasmohap. If you do not have a user called dasmohap at client machine, create it or hack as these step

Re: HDFS Cluster Summary DataNode usages

2013-03-14 Thread Yanbo Liang
It means : the minimum number of used storage capacity / total storage capacity of a datanode; the median number of used storage capacity / total storage capacity of a datanode; the maxmum number of used storage capacity / total storage capacity of a datanode; and the standard deviation of all

Re: Why hadoop is spawing two map over file size 1.5 KB ?

2013-03-14 Thread Yanbo Liang
I guess may be one of them is the speculative execution. You can check the parameter mapred.map.tasks.speculative.execution to ensure whether it is allowed speculative execution. You can get the precise information that whether it is speculative map task from the tasktracker log. 2013/3/12 samir

Re: “hadoop namenode -format” formats wrong directory

2013-02-06 Thread Yanbo Liang
you can try to use the new parameter dfs.namenode.name.dir to specify the directory. 2013/2/6, Andrey V. Romanchev andrey.romanc...@gmail.com: Hello! I'm trying to install Hadoop 1.1.2.21 on CentOS 6.3. I've configured dfs.name.dir in /etc/hadoop/conf/hdfs-site.xml file

Re: distributed cache

2012-11-16 Thread Yanbo Liang
As far as I know, The local.cache.size parameter controls the size of the DistributedCache. By default, it’s set to 10 GB. And the parameter io.sort.mb is not used here, it used as each map task has a circular memory buffer that it writes the output to. 2012/11/16 yingnan.ma

Re: Hadoop and Hbase site xml

2012-11-12 Thread Yanbo Liang
There are two candidate: 1) You need to copy your Hadoop/HBase configuration such as common-site.xml, hdfs-site.xml, or *hbase-site.xml *file from etc or conf subdirectory of Hadoop/HBase installation directory into the Java project directory. Then the configuration of Hadoop/HBase will be auto

Re: problem using s3 instead of hdfs

2012-10-16 Thread Yanbo Liang
Because you did not set defaultFS in conf, so you need to explicit indicate the absolute path (include schema) of the file in S3 when you run a MR job. 2012/10/16 Rahul Patodi patodirahul.had...@gmail.com I think these blog posts will answer your question:

Re: Number of concurrent writer to HDFS

2012-08-06 Thread Yanbo Liang
You can use scribe or flume to collect log data and integrated with hadoop. 2012/8/4 Nguyen Manh Tien tien.nguyenm...@gmail.com Hi, I plan to streaming logs data HDFS using many writer, each writer write a stream of data to a HDFS file (may rotate) I wonder how many concurrent writer i

Re: problem configuring hadoop with s3 bucket

2012-07-26 Thread Yanbo Liang
namenode -format is saying succesfully formatted namenode dir S3://bucket/hadoop/namenode , when it is not even existing there! any suggestion? Thanks again. On Tue, Jul 24, 2012 at 4:11 PM, Yanbo Liang yanboha...@gmail.comwrote: I think you have made confusion about the integration

Re: problem configuring hadoop with s3 bucket

2012-07-25 Thread Yanbo Liang
succesfully formatted namenode dir S3://bucket/hadoop/namenode , when it is not even existing there! any suggestion? Thanks again. On Tue, Jul 24, 2012 at 4:11 PM, Yanbo Liang yanboha...@gmail.com wrote: I think you have made confusion about the integration of hadoop and S3. 1) If you set

Re: Hadoop 2.0 High Availability and Federation

2012-07-24 Thread Yanbo Liang
It's available at Hadoop 2.0. HDFS Federation supplied multiple Namespaces for the whole storage pool. The High Availability is specific for each Namespace/NameNode, so you can configure HA for each NameNode in the federation. You can get some document from

Re: Disk on data node full

2012-03-24 Thread Yanbo Liang
I wonder why this unbalance produce? 2012/3/17 Zizon Qiu zzd...@gmail.com if there are only dfs files under /data and /data2,it will be ok when filled up. unless some other files like mapreduce teme folder or even a namenode image,it may broken the cluster when disk was filled up(as namenode

Re: Getting NameNode instance from DistributedFileSystem

2012-02-13 Thread yanbo liang
There is one member variable which called dfs in DistributedFileSystem class, The type of dfs is DFSClient class. All of the file system operation in DistributedFileSystem class is transferred to the corresponding operation of dfs. And the dfs will communicate with NameNode server by the meaning