hadoop cluster with mixed servers(different memory, speed, etc)

2015-09-17 Thread Demai Ni
hi, folks,

I am wondering how hadoop cluster handle commodity hardware with different
speed, capacity .

This situation is happening and probably become very common soon. That a
cluster starts with 100 machines, and in a couple years, add another 100
machines. With Moore's law as an indicator, the new vs. old machines are at
least one generation apart. The situation get even more complex if the
'new' 100 join the cluster gradually. How hadoop handles this situation and
avoid the weakest link problem?

thanks

Demai


Re: HDFS ShortCircuit Read on Mac?

2015-09-08 Thread Demai Ni
Chris, many thanks for the quick response. I will disable the shortcircuit
on my mac for now. :-)  Demai

On Tue, Sep 8, 2015 at 4:57 PM, Chris Nauroth 
wrote:

> Hello Demai,
>
> HDFS short-circuit read currently does not work on Mac, due to some
> platform differences in handling of domain sockets.  The last time I
> checked, our Hadoop code was exceeding a maximum path length enforced on
> Mac for domain socket paths.  I haven't had availability to look at this in
> a while, but the prior work is tracked in JIRA issues HDFS-3296 and
> HADOOP-11957 if you want to see the current progress.
>
> --Chris Nauroth
>
> From: Demai Ni 
> Reply-To: "user@hadoop.apache.org" 
> Date: Tuesday, September 8, 2015 at 4:46 PM
> To: "user@hadoop.apache.org" 
> Subject: HDFS ShortCircuit Read on Mac?
>
> hi, folks,
>
> wondering anyone has setup HDFS shortcircuit Read on Mac? I installed
> hadoop through homebrew on Mac. It is up and running. But I cannot
> config "dfs.domain.socket.path" as instructed here:
>
> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html
> Since there is no dn_socket on the mac.
>
> any pointers are appreciated.
>
> Demai
>
>


HDFS ShortCircuit Read on Mac?

2015-09-08 Thread Demai Ni
hi, folks,

wondering anyone has setup HDFS shortcircuit Read on Mac? I installed
hadoop through homebrew on Mac. It is up and running. But I cannot
config "dfs.domain.socket.path" as instructed here:
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html
Since there is no dn_socket on the mac.

any pointers are appreciated.

Demai


Re: hadoop/hdfs cache question, do client processes share cache?

2015-08-11 Thread Demai Ni
Ritesh,

many thanks for your response. I just read through the centralized Cache
document. Thanks for the pointer. A couple follow-up questions.

First, the centralized cache required 'explicit' configuration, so by
default, there is no HDFS-managed cache? Will the cache occur at local
filesystem level like Linux?

The 2nd question. The centralized Cache is among the DN of HDFS. Let's say
the client is a stand-alone Linux(not part of the cluster), which connects
to the HDFS cluster with centralized cache configured. So on HDFS cluster,
the file is cached. In the same scenario, the client has 10 processes
repeatedly read the same HDFS file. will HDFS client API be able to cache
the file content at Client side? or every READ will have to move the whole
file through network, and no sharing  between processes?

Demai


On Tue, Aug 11, 2015 at 12:58 PM, Ritesh Kumar Singh <
riteshoneinamill...@gmail.com> wrote:

> Let's assume that hdfs maintains 3 replicas of the 256MB block, then all
> of these 3 datanodes will have only one copy of the block in their
> respective mem cache and thus avoiding the repeated i/o reads. This goes
> with the centralized cache management policy of hdfs that also gives you an
> option to pin 2 of these 3 blocks in cache and save the remaining 256MB of
> cache space. Here's a link
> 
>  on
> the same.
>
> Hope that helps.
>
> Ritesh
>


hadoop/hdfs cache question, do client processes share cache?

2015-08-11 Thread Demai Ni
hi, folks,

I have a quick question about how hdfs handle cache? In this lab
experiment, I have a 4 node hadoop cluster (2.x) and each node has a fair
large memory (96GB).  And have a single hdfs file with 256MB, and also fit
in one HDFS block. The local filesystem is linux.

Now from one of the DataNode, I started 10 hadoop client processes to
repeatedly read the above file. With the assumption that HDFS will cache
the 256MB in memory, so (after the 1st read) READs will have no disk I/O
involved anymore.

My question is : *how many COPY of the 256MB will be in memory of this
DataNode? 10 or 1?*

How about the 10 client processes are located at the 5th linux box
 independent of the cluster? Will we have 10 copies of the 256 MB or just
1?

Many thanks. Appreciate your help on this.

Demai


Re: a non-commerial distribution of hadoop ecosystem?

2015-06-01 Thread Demai Ni
Andrew,

great to hear that you are also using BigTop. I will surely try it out, to
replace my (a little bit) old CDH cluster. :-)

cheers

Demai

On Mon, Jun 1, 2015 at 5:29 PM, Andrew Purtell 
wrote:

> Bigtop, in a nutshell, is a non-commercial multi-stakeholder Apache
> project that produces a build framework that takes as input source from
> Hadoop and related big data projects and produces as output OS native
> packages for installation and management - certainly, a distribution of the
> Hadoop ecosystem - coupled with a suite of integration tests for ensuring
> the distribution components are working well together, coupled with a suite
> of Puppet scripts for post-deploy configuration management. It's a rather
> large nutshell. (Smile)  Bigtop distribution packages are supported by
> Cask's Coopr (coopr.io) and I think to some extent by Ambari (haven't
> tried it).
>
> I've personally used Bigtop for years to produce several custom Hadoop
> distributions. For this purpose it is a great tool.
>
> Please mail u...@bigtop.apache.org if you would like to know more, we'd
> love to talk with you.
>
>
> On Jun 2, 2015, at 7:16 AM, Demai Ni  wrote:
>
> Chris and Roman,
>
> many thanks for the quick response.  I will take a look at bigtop.
> Actually, I heard about it, but thought it is a installation framework,
> instead of a hadoop distribution. Now I am looking at the BigTop 0.7.0
> hadoop instruction, which probably will work fine for my needs. Appreciate
> the pointer.
>
> Roman, I will ping you off list for ODP. I was hoping ODP will be the one
> for me. Well, in reality, it is owned by a few companies, at least not by
> ONE company. :-)  It is fine with me, as long as ODP is open to be used by
> others. I am just having trouble to find document/installation info of the
> ODP. maybe I should google harder? :-)
>
> Demai
>
>
> On Mon, Jun 1, 2015 at 1:46 PM, Roman Shaposhnik  wrote:
>
>> On Mon, Jun 1, 2015 at 1:37 PM, Demai Ni  wrote:
>> > My question is besides the commercial distributions: CDH(Cloudera)  ,
>> HDP
>> > (Horton work), and others like Mapr, IBM... Is there a distribution
>> that is
>> > NOT owned by a company?  I am looking for something simple for cluster
>> > configuration/installation for multiple components: hdfs, yarn,
>> zookeeper,
>> > hive, hbase, maybe Spark. Surely, for a well-experience person(not me),
>> > he/she can build the distribution from Apache releases. Well, I am more
>> > interested on building application on top of it, and hopefully to find
>> one
>> > packed them together.
>>
>> Apache Bigtop (CCed) aims at delivering a 100% open and
>> community-driven distribution of big data management technologies
>> around Apache Hadoop. Same as, for example, what Debian is trying
>> to do for Linux.
>>
>> > BTW, I don't need the latest releases like other commercial distribution
>> > offered.  I am also looking into the ODP(the open data platform), but
>> that
>> > project is kind of quiet after the initial Feb announcement.
>>
>> Feel free to ping me off list if you want more details on ODP.
>>
>> Thanks,
>> Roman.
>>
>
>


Re: a non-commerial distribution of hadoop ecosystem?

2015-06-01 Thread Demai Ni
Chris and Roman,

many thanks for the quick response.  I will take a look at bigtop.
Actually, I heard about it, but thought it is a installation framework,
instead of a hadoop distribution. Now I am looking at the BigTop 0.7.0
hadoop instruction, which probably will work fine for my needs. Appreciate
the pointer.

Roman, I will ping you off list for ODP. I was hoping ODP will be the one
for me. Well, in reality, it is owned by a few companies, at least not by
ONE company. :-)  It is fine with me, as long as ODP is open to be used by
others. I am just having trouble to find document/installation info of the
ODP. maybe I should google harder? :-)

Demai


On Mon, Jun 1, 2015 at 1:46 PM, Roman Shaposhnik  wrote:

> On Mon, Jun 1, 2015 at 1:37 PM, Demai Ni  wrote:
> > My question is besides the commercial distributions: CDH(Cloudera)  , HDP
> > (Horton work), and others like Mapr, IBM... Is there a distribution that
> is
> > NOT owned by a company?  I am looking for something simple for cluster
> > configuration/installation for multiple components: hdfs, yarn,
> zookeeper,
> > hive, hbase, maybe Spark. Surely, for a well-experience person(not me),
> > he/she can build the distribution from Apache releases. Well, I am more
> > interested on building application on top of it, and hopefully to find
> one
> > packed them together.
>
> Apache Bigtop (CCed) aims at delivering a 100% open and
> community-driven distribution of big data management technologies
> around Apache Hadoop. Same as, for example, what Debian is trying
> to do for Linux.
>
> > BTW, I don't need the latest releases like other commercial distribution
> > offered.  I am also looking into the ODP(the open data platform), but
> that
> > project is kind of quiet after the initial Feb announcement.
>
> Feel free to ping me off list if you want more details on ODP.
>
> Thanks,
> Roman.
>


a non-commerial distribution of hadoop ecosystem?

2015-06-01 Thread Demai Ni
hi, Guys,

I have been doing some research/POC using hadoop system. Normally, I either
use homebrew on mac for single node installation, or use CDH(Cloudera) for
a 3~4 nodes small linux cluster.

My question is besides the commercial distributions: CDH(Cloudera)  , HDP
(Horton work), and others like Mapr, IBM... Is there a distribution that is
NOT owned by a company?  I am looking for something simple for cluster
configuration/installation for multiple components: hdfs, yarn, zookeeper,
hive, hbase, maybe Spark. Surely, for a well-experience person(not me),
he/she can build the distribution from Apache releases. Well, I am more
interested on building application on top of it, and hopefully to find one
packed them together.

BTW, I don't need the latest releases like other commercial distribution
offered.  I am also looking into the ODP(the open data platform), but that
project is kind of quiet after the initial Feb announcement.

Thanks.

 Demai


Re: Connect c language with HDFS

2015-05-04 Thread Demai Ni
I would also suggest to take a look at
https://issues.apache.org/jira/browse/HDFS-6994. I have been using libhdfs3
for POC in past few months, and highly recommend it.  the only drawback is
the libhdfs3 has not been formed committed into hadoop/hdfs yet.

if you only like to play with hdfs, using the existing libhdfs lib is fine.
but if you are looking for some serious development, libhdfs3 has a lot of
advantage.


On Mon, May 4, 2015 at 3:59 AM, unmesha sreeveni 
wrote:

> Thanks
> Did it.
>
> http://unmeshasreeveni.blogspot.in/2015/05/hadoop-word-count-using-c-hadoop.html
>
> On Mon, May 4, 2015 at 3:43 PM, Alexander Alten-Lorenz <
> wget.n...@gmail.com> wrote:
>
>> That depends on the installation source (rpm, tgz or parcels). Usually,
>> when you use parcels, libhdfs.so* should be within /opt/cloudera/parcels/
>> CDH/lib64/ (or similar). Or just use linux' "locate" (locate
>> libhdfs.so*) to find the library.
>>
>>
>>
>>
>> --
>> Alexander Alten-Lorenz
>> m: wget.n...@gmail.com
>> b: mapredit.blogspot.com
>>
>> On May 4, 2015, at 11:39 AM, unmesha sreeveni 
>> wrote:
>>
>> thanks alex
>>   I have gone through the same. but once I checked my cloudera
>> distribution I am not able to get those folders ..Thats y I posted here. I
>> dont know if I made any mistake.
>>
>> On Mon, May 4, 2015 at 2:40 PM, Alexander Alten-Lorenz <
>> wget.n...@gmail.com> wrote:
>>
>>> Google:
>>>
>>> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/LibHdfs.html
>>>
>>> --
>>> Alexander Alten-Lorenz
>>> m: wget.n...@gmail.com
>>> b: mapredit.blogspot.com
>>>
>>> On May 4, 2015, at 10:57 AM, unmesha sreeveni 
>>> wrote:
>>>
>>> Hi
>>>   Can we connect c with HDFS using cloudera hadoop distribution.
>>>
>>> --
>>> *Thanks & Regards *
>>>
>>>
>>> *Unmesha Sreeveni U.B*
>>> *Hadoop, Bigdata Developer*
>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>> http://www.unmeshasreeveni.blogspot.in/
>>>
>>>
>>>
>>>
>>
>>
>> --
>> *Thanks & Regards *
>>
>>
>> *Unmesha Sreeveni U.B*
>> *Hadoop, Bigdata Developer*
>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>> http://www.unmeshasreeveni.blogspot.in/
>>
>>
>>
>>
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>


Re: Data locality

2015-03-02 Thread Demai Ni
hi, folks,

I have the similar question. Is there an easy way to tell(from a user
perspective) whether short circuit is enabled? thanks

Demai

On Mon, Mar 2, 2015 at 11:46 AM, Fei Hu  wrote:

> Hi All,
>
> I developed a scheduler for data locality. Now I want to test the
> performance of the scheduler, so I need to monitor how many data are read
> remotely. Is there any tool for monitoring the volume of data moved around
> the cluster?
>
> Thanks,
> Fei


[HDFS] result order of getFileBlockLocations() and listFiles()?

2014-10-29 Thread Demai Ni
hi, Guys,

I am trying to implement a simple program(that is not for production,
experimental). And invoke FileSystem.listFiles() to get a list of files
under a hdfs folder, and then use FileSystem.getFileBlockLocations() to get
replica locations of each file/blocks.

Since it is a controlled environment, I can make sure the files are static
and don't worry about datanode crash, fail-over, etc.

Assuming at a small time-window(say, 1 minute), I have 100~1000s client
invoke the same program to look up the same folder. Will the above two APIs
guarantee *same result in the same order* for all clients?

To elaborate a bit more, say there is a folder called /dfs/dn/user/data
contains three files: file1, file2, and file3.  If client1 gets:
listFiles() : file1,file2,file3
getFileBlockLocation(file1) -> datanode1, datanode3, datanode6

Will all other clients get the same information(I think so) and in the same
order?  or I have to do a sort by each client to guarantee the order?

Many thanks for your inputs

Demai


Re: read from a hdfs file on the same host as client

2014-10-13 Thread Demai Ni
Shivram,

many thanks for confirming the behavior. I will also turn on the
shortcircuit as you suggested. Appreciate the help

Demai

On Mon, Oct 13, 2014 at 3:42 PM, Shivram Mani  wrote:

> Demai, you are right. HDFS's default BlockPlacementPolicyDefault makes
> sure one replica of your block is available on the writer's datanode.
> The replica selection for the read operation is also aimed at minimizing
> bandwidth/latency and will serve the block from the reader's local node.
> If you want to further optimize this, you can set 
> 'dfs.client.read.shortcircuit'
> to true. This would allow the client to bypass the datanode to read the
> file directly.
>
> On Mon, Oct 13, 2014 at 11:58 AM, Demai Ni  wrote:
>
>> hi, folks,
>>
>> a very simple question, looking forward a couple pointers.
>>
>> Let's say I have a hdfs file: testfile, which only have one block(256MB),
>> and the block has a replica on datanode: host1.hdfs.com (the whole hdfs
>> may have 100 nodes though, and the other 2 replica are available at other
>> datanode).
>>
>> If on host1.hdfs.com, I did a "hadoop fs -cat testfile" or a java client
>> to read the file. Should I assume there won't be any significant data
>> movement through network?  That is the namenode is smart enough to give me
>> the data on host1.hdfs.com directly?
>>
>> thanks
>>
>> Demai
>>
>
>
>
> --
> Thanks
> Shivram
>


read from a hdfs file on the same host as client

2014-10-13 Thread Demai Ni
hi, folks,

a very simple question, looking forward a couple pointers.

Let's say I have a hdfs file: testfile, which only have one block(256MB),
and the block has a replica on datanode: host1.hdfs.com (the whole hdfs may
have 100 nodes though, and the other 2 replica are available at other
datanode).

If on host1.hdfs.com, I did a "hadoop fs -cat testfile" or a java client to
read the file. Should I assume there won't be any significant data movement
through network?  That is the namenode is smart enough to give me the data
on host1.hdfs.com directly?

thanks

Demai


hdfs: a C API call to getFileSize() through libhdfs or libhdfs3?

2014-10-02 Thread Demai Ni
hi, folks,

To get the size of a hdfs file, jave API has
FileSystem#getFileStatus(PATH)#getLen();
now I am trying to use a C client to do the same thing.

For a file on local file system, I can grab the info like this:
fseeko(file, 0, SEEK_END);
size = ftello(file);

But I can't find the SEEK_END or a getFileSize() call in the existing
libhdfs or the newly libhdfs3


Can someone point me to the right direction? many thanks

Demai


Re: Planning to propose Hadoop initiative to company. Need some inputs please.

2014-10-01 Thread Demai Ni
hi,

glad to see another person moving from mainframe world to the 'big' data
one. I was in the same boat a few years back after working on mainframe for
10+ years.

Wilm got to the pointers already. I'd like to just chime in a bit from
mainframe side.

The example of website usage is a very good one for bigdata comparing to
mainframe, as mainframe is very expensive to provide reliability for
mission-critical workload. One approach is to look at what the current
application running on mainframe or your guys are considering to implement
on mainframe. For a website usage case, the cost to implement and running
would be only 1/10 if on hadoop/hbase, comparing to mainframe. And
mainframe probably not able to scale up if the data goes to TB.

2nd, be careful that Hadoop is not for all your cases. I am pretty such
that your IT department is handling some mission-critical workloads, like
payroll, employee info, customer-payment, etc. Leaving those workloads on
mainframe. for 1) hbase/hadoop are not design for such RDMS workload; for
2) moving from one database to another is way too much risk unless the top
boss force you do so... :-)

Demai


On Wed, Oct 1, 2014 at 11:02 AM, Wilm Schumacher  wrote:

> Hi,
>
> first: I think hbase is what you are looking for. If I understand
> correctly you want to show the customer his or her data very fast and
> let them manipulate their data. So you need something like a data
> warehouse system. Thus, hbase is the method of choice for you (and I
> think for your kind of data, hbase is a better choice than cassandra or
> mongoDB). But of course you need a running hadoop system to run a hbase.
> So it's not an either/or ;)
>
> (my answers are for hbase, as I think it's what you are looking for. If
> you are not interested, just ignore the following text. Sry @all by
> writing about hbase on this list ;).)
>
> Am 01.10.2014 um 17:24 schrieb mani kandan:
> > 1) How much web usage data will a typical website like ours collect on a
> > daily basis? (I know I can ask our IT department, but I would like to
> > gather some background idea before talking to them.)
> well, if you have the option to ask your IT department you should do
> that, because everyone here would have to guess. You would have to
> explain very detailed what you have to do to let us guess. If you e.g.
> want to track the user on what he or she has clicked, perhaps to make
> personalized ads, than you have to save more data. So, you should ask
> the persons who have the data right away without guessing.
>
> > 3) How many clusters/nodes would I need to ​run a web usage analytics
> > system?
> in the book "hbase in action" there are some recommendations for some
> "case studies" (part IV "deploying hbase"). There are some thoughts on
> the number of nodes, and how to use them, depending on the size of your
> data
>
> > 4) What are the ways for me to use our data? (One use case I'm thinking
> > of is to analyze the error messages log for each page on quote process
> > to redesign the UI. Is this possible?)
> sure. And this should be very easy. I would pump the error log into a
> hbase table. By this method you could read the messages directly from
> the hbase shell (if they are few enough). Or you could use hive to query
> your log a little more "sql like" and make statistics very easy.
>
> > 5) How long would it take for me to set up and start such a system?
> for a novice who have to do it for the first time: for the stand alone
> hbase system perhaps 2 hours. For a complete distributed test cluster
> ... perhaps a day. For the real producing system, with all security
> features ... a little longer ;).
>
> > I'm sorry if some/all of these questions are unanswerable. I just want
> > to discuss my thoughts, and get an idea of what things can I achieve by
> > going the way of Hadoop.
> well, I think, but I could err, that you think of hadoop (or hbase) in a
> way that you just can change the "database backend" from "SQL" to
> "hbase/hadoop" and everything would run right away. This will not be
> that easy. You would have to change the code of your web application in
> a very fundamental way. You have to rethink all the table designs etc.,
> so this could be more complicate than you think right know.
>
> However, hbase/hadoop hase some advantages which are very interesing for
> you. Well first, it is distributed, which enables your company to grow
> almost limitless, or to collect more data about your customers so you
> can get more informations (and sell more stuff). And map reduce is a
> wonderful tool for making real fancy "statistics", which is very
> interesting for an insurance company. Your mathematical economist will
> REALLY love it ;).
>
> Hope this helped.
>
> best wishes
>
> Wilm
>
>
>


Re: conf.get("dfs.data.dir") return null when hdfs-site.xml doesn't set it explicitly

2014-09-09 Thread Demai Ni
Yong,

good point about each node of the cluster could have different values in
the .xml files, and probably true if the nodes have different role or
hardware settings. so some of the configuration (like memory, heap) may not
make sense to client at all.

are some of the settings the same across the cluster? The one I am
interested in at this moment is the folder(for local filesystem) for data
node dir. I am thinking about doing some local read, so it will the very
first step if I know where to read the data.

Demai

On Tue, Sep 9, 2014 at 11:13 AM, java8964  wrote:

> The configuration in fact depends on the xml file. Not sure what kind of
> cluster configuration variables/values you are looking for.
>
> Remember, the cluster is made of set of computers, and in hadoop, there
> are hdfs xml, mapred xml and even yarn xml.
>
> Mapred.xml and yarn.xml are job related. Without concrete job, there is no
> detail configuration can be given.
>
> About the HDFS configuration, there are a set of computers in the cluster.
> In theory, there is nothing wrong that each computer will have different
> configuration settings. Every computer could have different cpu cores,
> memory, disk counts, mount names etc. When you ask configuration
> variables/values, which one should be returned?
>
> Yong
>
> --
> Date: Tue, 9 Sep 2014 10:01:14 -0700
> Subject: Re: conf.get("dfs.data.dir") return null when hdfs-site.xml
> doesn't set it explicitly
> From: nid...@gmail.com
> To: user@hadoop.apache.org
>
>
> Susheel actually brought up a good point.
>
> once the client code connects to the cluster, is there way to get the real
> cluster configuration variables/values instead of relying on the .xml files
> on client side?
>
> Demai
>
> On Mon, Sep 8, 2014 at 10:12 PM, Susheel Kumar Gadalay <
> skgada...@gmail.com> wrote:
>
> One doubt on building Configuration object.
>
> I have a Hadoop remote client and Hadoop cluster.
> When a client submitted a MR job, the Configuration object is built
> from Hadoop cluster node xml files, basically the resource manager
> node core-site.xml and mapred-site.xml and yarn-site.xml.
> Am I correct?
>
> TIA
> Susheel Kumar
>
> On 9/9/14, Bhooshan Mogal  wrote:
> > Hi Demai,
> >
> > conf = new Configuration()
> >
> > will create a new Configuration object and only add the properties from
> > core-default.xml and core-site.xml in the conf object.
> >
> > This is basically a new configuration object, not the same that the
> daemons
> > in the hadoop cluster use.
> >
> >
> >
> > I think what you are trying to ask is if you can get the Configuration
> > object that a daemon in your live cluster (e.g. datanode) is using. I am
> > not sure if the datanode or any other daemon on a hadoop cluster exposes
> > such an API.
> >
> > I would in fact be tempted to get this information from the configuration
> > management daemon instead - in your case cloudera manager. But I am not
> > sure if CM exposes that API either. You could probably find out on the
> > Cloudera mailing list.
> >
> >
> > HTH,
> > Bhooshan
> >
> >
> > On Mon, Sep 8, 2014 at 3:52 PM, Demai Ni  wrote:
> >
> >> hi, Bhooshan,
> >>
> >> thanks for your kind response.  I run the code on one of the data node
> of
> >> my cluster, with only one hadoop daemon running. I believe my java
> client
> >> code connect to the cluster correctly as I am able to retrieve
> >> fileStatus,
> >> and list files under a particular hdfs path, and similar things...
> >> However, you are right that the daemon process use the hdfs-site.xml
> >> under
> >> another folder for cloudera :
> >> /var/run/cloudera-scm-agent/process/90-hdfs-DATANODE/hdfs-site.xml.
> >>
> >> about " retrieving the info from a live cluster", I would like to get
> the
> >> information beyond the configuration files(that is beyond the .xml
> >> files).
> >> Since I am able to use :
> >> conf = new Configuration()
> >> to connect to hdfs and did other operations, shouldn't I be able to
> >> retrieve the configuration variables?
> >>
> >> Thanks
> >>
> >> Demai
> >>
> >>
> >> On Mon, Sep 8, 2014 at 2:40 PM, Bhooshan Mogal <
> bhooshan.mo...@gmail.com>
> >> wrote:
> >>
> >>> Hi Demai,
> >>>
> >>> When you read a property from the conf object, it will only have a
> value
> >>> if the conf object contains that 

Re: conf.get("dfs.data.dir") return null when hdfs-site.xml doesn't set it explicitly

2014-09-09 Thread Demai Ni
Susheel actually brought up a good point.

once the client code connects to the cluster, is there way to get the real
cluster configuration variables/values instead of relying on the .xml files
on client side?

Demai

On Mon, Sep 8, 2014 at 10:12 PM, Susheel Kumar Gadalay 
wrote:

> One doubt on building Configuration object.
>
> I have a Hadoop remote client and Hadoop cluster.
> When a client submitted a MR job, the Configuration object is built
> from Hadoop cluster node xml files, basically the resource manager
> node core-site.xml and mapred-site.xml and yarn-site.xml.
> Am I correct?
>
> TIA
> Susheel Kumar
>
> On 9/9/14, Bhooshan Mogal  wrote:
> > Hi Demai,
> >
> > conf = new Configuration()
> >
> > will create a new Configuration object and only add the properties from
> > core-default.xml and core-site.xml in the conf object.
> >
> > This is basically a new configuration object, not the same that the
> daemons
> > in the hadoop cluster use.
> >
> >
> >
> > I think what you are trying to ask is if you can get the Configuration
> > object that a daemon in your live cluster (e.g. datanode) is using. I am
> > not sure if the datanode or any other daemon on a hadoop cluster exposes
> > such an API.
> >
> > I would in fact be tempted to get this information from the configuration
> > management daemon instead - in your case cloudera manager. But I am not
> > sure if CM exposes that API either. You could probably find out on the
> > Cloudera mailing list.
> >
> >
> > HTH,
> > Bhooshan
> >
> >
> > On Mon, Sep 8, 2014 at 3:52 PM, Demai Ni  wrote:
> >
> >> hi, Bhooshan,
> >>
> >> thanks for your kind response.  I run the code on one of the data node
> of
> >> my cluster, with only one hadoop daemon running. I believe my java
> client
> >> code connect to the cluster correctly as I am able to retrieve
> >> fileStatus,
> >> and list files under a particular hdfs path, and similar things...
> >> However, you are right that the daemon process use the hdfs-site.xml
> >> under
> >> another folder for cloudera :
> >> /var/run/cloudera-scm-agent/process/90-hdfs-DATANODE/hdfs-site.xml.
> >>
> >> about " retrieving the info from a live cluster", I would like to get
> the
> >> information beyond the configuration files(that is beyond the .xml
> >> files).
> >> Since I am able to use :
> >> conf = new Configuration()
> >> to connect to hdfs and did other operations, shouldn't I be able to
> >> retrieve the configuration variables?
> >>
> >> Thanks
> >>
> >> Demai
> >>
> >>
> >> On Mon, Sep 8, 2014 at 2:40 PM, Bhooshan Mogal <
> bhooshan.mo...@gmail.com>
> >> wrote:
> >>
> >>> Hi Demai,
> >>>
> >>> When you read a property from the conf object, it will only have a
> value
> >>> if the conf object contains that property.
> >>>
> >>> In your case, you created the conf object as new Configuration() --
> adds
> >>> core-default and core-site.xml.
> >>>
> >>> Then you added site.xmls (hdfs-site.xml and core-site.xml) from
> specific
> >>> locations. If none of these files have defined dfs.data.dir, then you
> >>> will
> >>> get NULL. This is expected behavior.
> >>>
> >>> What do you mean by retrieving the info from a live cluster? Even for
> >>> processes like datanode, namenode etc, the source of truth for these
> >>> properties is hdfs-site.xml. It is loaded from a specific location when
> >>> you
> >>> start these services.
> >>>
> >>> Question: Where are you running the above code? Is it on a node which
> >>> has
> >>> other hadoop daemons as well?
> >>>
> >>> My guess is that the path you are referring to (/etc/hadoop/conf.
> >>> cloudera.hdfs/core-site.xml) is not the right path where these config
> >>> properties are defined. Since this is a CDH cluster, you would probably
> >>> be
> >>> best served by asking on the CDH mailing list as to where the right
> path
> >>> to
> >>> these files is.
> >>>
> >>>
> >>> HTH,
> >>> Bhooshan
> >>>
> >>>
> >>> On Mon, Sep 8, 2014 at 11:47 AM, Demai Ni  wrote:
> >>>
> >>>> hi, experts,
> >>&

Re: conf.get("dfs.data.dir") return null when hdfs-site.xml doesn't set it explicitly

2014-09-08 Thread Demai Ni
Bhooshan,

Many thanks. I appreciate the help. I will also try out Cloudera mailing
list/community

Demai

On Mon, Sep 8, 2014 at 4:58 PM, Bhooshan Mogal 
wrote:

> Hi Demai,
>
> conf = new Configuration()
>
> will create a new Configuration object and only add the properties from
> core-default.xml and core-site.xml in the conf object.
>
> This is basically a new configuration object, not the same that the
> daemons in the hadoop cluster use.
>
>
>
> I think what you are trying to ask is if you can get the Configuration
> object that a daemon in your live cluster (e.g. datanode) is using. I am
> not sure if the datanode or any other daemon on a hadoop cluster exposes
> such an API.
>
> I would in fact be tempted to get this information from the configuration
> management daemon instead - in your case cloudera manager. But I am not
> sure if CM exposes that API either. You could probably find out on the
> Cloudera mailing list.
>
>
> HTH,
> Bhooshan
>
>
> On Mon, Sep 8, 2014 at 3:52 PM, Demai Ni  wrote:
>
>> hi, Bhooshan,
>>
>> thanks for your kind response.  I run the code on one of the data node of
>> my cluster, with only one hadoop daemon running. I believe my java client
>> code connect to the cluster correctly as I am able to retrieve fileStatus,
>> and list files under a particular hdfs path, and similar things...
>> However, you are right that the daemon process use the hdfs-site.xml under
>> another folder for cloudera :
>> /var/run/cloudera-scm-agent/process/90-hdfs-DATANODE/hdfs-site.xml.
>>
>> about " retrieving the info from a live cluster", I would like to get the
>> information beyond the configuration files(that is beyond the .xml files).
>> Since I am able to use :
>> conf = new Configuration()
>> to connect to hdfs and did other operations, shouldn't I be able to
>> retrieve the configuration variables?
>>
>> Thanks
>>
>> Demai
>>
>>
>> On Mon, Sep 8, 2014 at 2:40 PM, Bhooshan Mogal 
>> wrote:
>>
>>> Hi Demai,
>>>
>>> When you read a property from the conf object, it will only have a value
>>> if the conf object contains that property.
>>>
>>> In your case, you created the conf object as new Configuration() -- adds
>>> core-default and core-site.xml.
>>>
>>> Then you added site.xmls (hdfs-site.xml and core-site.xml) from specific
>>> locations. If none of these files have defined dfs.data.dir, then you will
>>> get NULL. This is expected behavior.
>>>
>>> What do you mean by retrieving the info from a live cluster? Even for
>>> processes like datanode, namenode etc, the source of truth for these
>>> properties is hdfs-site.xml. It is loaded from a specific location when you
>>> start these services.
>>>
>>> Question: Where are you running the above code? Is it on a node which
>>> has other hadoop daemons as well?
>>>
>>> My guess is that the path you are referring to (/etc/hadoop/conf.
>>> cloudera.hdfs/core-site.xml) is not the right path where these config
>>> properties are defined. Since this is a CDH cluster, you would probably be
>>> best served by asking on the CDH mailing list as to where the right path to
>>> these files is.
>>>
>>>
>>> HTH,
>>> Bhooshan
>>>
>>>
>>> On Mon, Sep 8, 2014 at 11:47 AM, Demai Ni  wrote:
>>>
>>>> hi, experts,
>>>>
>>>> I am trying to get the local filesystem directory of data node. My
>>>> cluster is using CDH5.x (hadoop 2.3) and the default configuration. So the
>>>> datanode is under file:///dfs/dn. I didn't specify the value in
>>>> hdfs-site.xml.
>>>>
>>>> My code is something like:
>>>>
>>>> conf = new Configuration()
>>>>
>>>> // test both with and without the following two lines
>>>> conf.addResource (new
>>>> Path("/etc/hadoop/conf.cloudera.hdfs/hdfs-site.xml"));
>>>> conf.addResource (new
>>>> Path("/etc/hadoop/conf.cloudera.hdfs/core-site.xml"));
>>>>
>>>> // I also tried get("dfs.datanode.data.dir"), which also return NULL
>>>> String dnDir = conf.get("dfs.data.dir");  // return NULL
>>>>
>>>> It looks like the get only look at the configuration file instead of
>>>> retrieving the info from the live cluster?
>>>>
>>>> Many thanks for your help in advance.
>>>>
>>>> Demai
>>>>
>>>
>>>
>>>
>>> --
>>> Bhooshan
>>>
>>
>>
>
>
> --
> Bhooshan
>


Re: conf.get("dfs.data.dir") return null when hdfs-site.xml doesn't set it explicitly

2014-09-08 Thread Demai Ni
hi, Bhooshan,

thanks for your kind response.  I run the code on one of the data node of
my cluster, with only one hadoop daemon running. I believe my java client
code connect to the cluster correctly as I am able to retrieve fileStatus,
and list files under a particular hdfs path, and similar things... However,
you are right that the daemon process use the hdfs-site.xml under another
folder for cloudera :
/var/run/cloudera-scm-agent/process/90-hdfs-DATANODE/hdfs-site.xml.

about " retrieving the info from a live cluster", I would like to get the
information beyond the configuration files(that is beyond the .xml files).
Since I am able to use :
conf = new Configuration()
to connect to hdfs and did other operations, shouldn't I be able to
retrieve the configuration variables?

Thanks

Demai


On Mon, Sep 8, 2014 at 2:40 PM, Bhooshan Mogal 
wrote:

> Hi Demai,
>
> When you read a property from the conf object, it will only have a value
> if the conf object contains that property.
>
> In your case, you created the conf object as new Configuration() -- adds
> core-default and core-site.xml.
>
> Then you added site.xmls (hdfs-site.xml and core-site.xml) from specific
> locations. If none of these files have defined dfs.data.dir, then you will
> get NULL. This is expected behavior.
>
> What do you mean by retrieving the info from a live cluster? Even for
> processes like datanode, namenode etc, the source of truth for these
> properties is hdfs-site.xml. It is loaded from a specific location when you
> start these services.
>
> Question: Where are you running the above code? Is it on a node which has
> other hadoop daemons as well?
>
> My guess is that the path you are referring to (/etc/hadoop/conf.
> cloudera.hdfs/core-site.xml) is not the right path where these config
> properties are defined. Since this is a CDH cluster, you would probably be
> best served by asking on the CDH mailing list as to where the right path to
> these files is.
>
>
> HTH,
> Bhooshan
>
>
> On Mon, Sep 8, 2014 at 11:47 AM, Demai Ni  wrote:
>
>> hi, experts,
>>
>> I am trying to get the local filesystem directory of data node. My
>> cluster is using CDH5.x (hadoop 2.3) and the default configuration. So the
>> datanode is under file:///dfs/dn. I didn't specify the value in
>> hdfs-site.xml.
>>
>> My code is something like:
>>
>> conf = new Configuration()
>>
>> // test both with and without the following two lines
>> conf.addResource (new
>> Path("/etc/hadoop/conf.cloudera.hdfs/hdfs-site.xml"));
>> conf.addResource (new
>> Path("/etc/hadoop/conf.cloudera.hdfs/core-site.xml"));
>>
>> // I also tried get("dfs.datanode.data.dir"), which also return NULL
>> String dnDir = conf.get("dfs.data.dir");  // return NULL
>>
>> It looks like the get only look at the configuration file instead of
>> retrieving the info from the live cluster?
>>
>> Many thanks for your help in advance.
>>
>> Demai
>>
>
>
>
> --
> Bhooshan
>


conf.get("dfs.data.dir") return null when hdfs-site.xml doesn't set it explicitly

2014-09-08 Thread Demai Ni
hi, experts,

I am trying to get the local filesystem directory of data node. My cluster
is using CDH5.x (hadoop 2.3) and the default configuration. So the datanode
is under file:///dfs/dn. I didn't specify the value in hdfs-site.xml.

My code is something like:

conf = new Configuration()

// test both with and without the following two lines
conf.addResource (new
Path("/etc/hadoop/conf.cloudera.hdfs/hdfs-site.xml"));
conf.addResource (new Path("/etc/hadoop/conf.cloudera.hdfs/core-site.xml"));

// I also tried get("dfs.datanode.data.dir"), which also return NULL
String dnDir = conf.get("dfs.data.dir");  // return NULL

It looks like the get only look at the configuration file instead of
retrieving the info from the live cluster?

Many thanks for your help in advance.

Demai


Re: question about matching java API with libHDFS

2014-09-04 Thread Demai Ni
hi, Yi A,

Thanks for your response. I took a look at hdfs.h and hdfs.c, it seems the
lib only exposes some of APIs, as there are a lot of other public methods
can be accessed through java API/client, but not implemented in libhdfs,
such as the one I am using now: DFSclient.getNamenode().
getBlockLocations(...)..

Is the libhdfs designed to limit the access? Thanks

Demai


On Thu, Sep 4, 2014 at 2:36 AM, Liu, Yi A  wrote:

>  You could refer to the header file: “src/main/native/libhdfs/hdfs.h”,
> you could get the APIs in detail.
>
>
>
> Regards,
>
> Yi Liu
>
>
>
> *From:* Demai Ni [mailto:nid...@gmail.com]
> *Sent:* Thursday, September 04, 2014 5:21 AM
> *To:* user@hadoop.apache.org
> *Subject:* question about matching java API with libHDFS
>
>
>
> hi, folks,
>
> I am currently using java to access HDFS. for example, I am using this API
> " DFSclient.getNamenode().getBlockLocations(...)..." to retrieve file
> block information.
>
> Now I need to move the same logic into C/C++. so I am looking at libHDFS,
> and this wiki page: http://wiki.apache.org/hadoop/LibHDFS. And I am also
> using the hdfs_test.c for some reference. However, I couldn't find a way to
> easily figure out whether above Java API is exposed through libHDFS?
>
> Probably not, since I couldn't find it. Then, it lead to my next question.
> Is there an easy way to plug in the libHDFS framework, to include additonal
> API?
>
> thanks a lot for your suggestions
>
> Demai
>


question about matching java API with libHDFS

2014-09-03 Thread Demai Ni
hi, folks,

I am currently using java to access HDFS. for example, I am using this API
" DFSclient.getNamenode().getBlockLocations(...)..." to retrieve file block
information.

Now I need to move the same logic into C/C++. so I am looking at libHDFS,
and this wiki page: http://wiki.apache.org/hadoop/LibHDFS. And I am also
using the hdfs_test.c for some reference. However, I couldn't find a way to
easily figure out whether above Java API is exposed through libHDFS?

Probably not, since I couldn't find it. Then, it lead to my next question.
Is there an easy way to plug in the libHDFS framework, to include additonal
API?

thanks a lot for your suggestions

Demai


Re: Local file system to access hdfs blocks

2014-08-29 Thread Demai Ni
Stanley, 

Thanks. 

Btw, I found this jira hdfs-2246, which probably match what I am looking for.  

Demai on the run

On Aug 28, 2014, at 11:34 PM, Stanley Shi  wrote:

> BP-13-7914115-10.122.195.197-14909166276345 is the blockpool information
> blk_1073742025 is the block name;
> 
> these names are "private" to teh HDFS system and user should not use them, 
> right?
> But if you really want ot know this, you can check the fsck code to see 
> whether they are available;
> 
> 
> On Fri, Aug 29, 2014 at 8:13 AM, Demai Ni  wrote:
>> Stanley and all,
>> 
>> thanks. I will write a client application to explore this path. A quick 
>> question again. 
>> Using the fsck command, I can retrieve all the necessary info
>> $ hadoop fsck /tmp/list2.txt -files -blocks -racks
>> .
>>  BP-13-7914115-10.122.195.197-14909166276345:blk_1073742025 len=8 repl=2
>> [/default/10.122.195.198:50010, /default/10.122.195.196:50010]
>> 
>> However, using getFileBlockLocations(), I can't get the block name/id info, 
>> such as  BP-13-7914115-10.122.195.197-14909166276345:blk_1073742025
>> seem the BlockLocation don't provide the public info here. 
>> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/BlockLocation.html
>> 
>> is there another entry point? somethinig fsck is using? thanks 
>> 
>> Demai
>> 
>> 
>> 
>> 
>> On Wed, Aug 27, 2014 at 11:09 PM, Stanley Shi  wrote:
>>> As far as I know, there's no combination of hadoop API can do that.
>>> You can easily get the location of the block (on which DN), but there's no 
>>> way to get the local address of that block file.
>>> 
>>> 
>>> 
>>> On Thu, Aug 28, 2014 at 11:54 AM, Demai Ni  wrote:
>>>> Yehia,
>>>> 
>>>> No problem at all. I really appreciate your willingness to help. Yeah. now 
>>>> I am able to get such information through two steps, and the first step 
>>>> will be either hadoop fsck or getFileBlockLocations(). and then search the 
>>>> local filesystem, my cluster is using the default from CDH, which is 
>>>> /dfs/dn
>>>> 
>>>> I would like to it programmatically, so wondering whether someone already 
>>>> done it? or maybe better a hadoop API call already implemented for this 
>>>> exact purpose
>>>> 
>>>> Demai
>>>> 
>>>> 
>>>> On Wed, Aug 27, 2014 at 7:58 PM, Yehia Elshater  
>>>> wrote:
>>>>> Hi Demai,
>>>>> 
>>>>> Sorry, I missed that you are already tried this out. I think you can 
>>>>> construct the block location on the local file system if you have the 
>>>>> block pool id and the block id. If you are using cloudera distribution, 
>>>>> the default location is under /dfs/dn ( the value of dfs.data.dir, 
>>>>> dfs.datanode.data.dir configuration keys).
>>>>> 
>>>>> Thanks
>>>>> Yehia 
>>>>> 
>>>>> 
>>>>> On 27 August 2014 21:20, Yehia Elshater  wrote:
>>>>>> Hi Demai,
>>>>>> 
>>>>>> You can use fsck utility like the following:
>>>>>> 
>>>>>> hadoop fsck /path/to/your/hdfs/file -files -blocks -locations -racks
>>>>>> 
>>>>>> This will display all the information you need about the blocks of your 
>>>>>> file.
>>>>>> 
>>>>>> Hope it helps.
>>>>>> Yehia
>>>>>> 
>>>>>> 
>>>>>> On 27 August 2014 20:18, Demai Ni  wrote:
>>>>>>> Hi, Stanley,
>>>>>>> 
>>>>>>> Many thanks. Your method works. For now, I can have two steps approach:
>>>>>>> 1) getFileBlockLocations to grab hdfs BlockLocation[]
>>>>>>> 2) use local file system call(like find command) to match the block to 
>>>>>>> files on local file system .
>>>>>>> 
>>>>>>> Maybe there is an existing Hadoop API to return such info in already?
>>>>>>> 
>>>>>>> Demai on the run
>>>>>>> 
>>>>>>> On Aug 26, 2014, at 9:14 PM, Stanley Shi  wrote:
>>>>>>> 
>>>>>>>> I am not sure this is what you want but you can try this shell command:
>>>>>>>> 
>>>>>>>> find [DATANODE_DIR] -name [blockname]
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Tue, Aug 26, 2014 at 6:42 AM, Demai Ni  wrote:
>>>>>>>>> Hi, folks,
>>>>>>>>> 
>>>>>>>>> New in this area. Hopefully to get a couple pointers.
>>>>>>>>> 
>>>>>>>>> I am using Centos and have Hadoop set up using cdh5.1(Hadoop 2.3)
>>>>>>>>> 
>>>>>>>>> I am wondering whether there is a interface to get each hdfs block 
>>>>>>>>> information in the term of local file system.
>>>>>>>>> 
>>>>>>>>> For example, I can use "Hadoop fsck /tmp/test.txt -files -blocks 
>>>>>>>>> -racks" to get blockID and its replica on the nodes, such as: repl 
>>>>>>>>> =3[ /rack/hdfs01, /rack/hdfs02...]
>>>>>>>>> 
>>>>>>>>>  With such info, is there a way to
>>>>>>>>> 1) login to hfds01, and read the block directly at local file system 
>>>>>>>>> level?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> 
>>>>>>>>> Demai on the run
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> -- 
>>>>>>>> Regards,
>>>>>>>> Stanley Shi,
>>>>>>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> Regards,
>>> Stanley Shi,
>>> 
> 
> 
> 
> -- 
> Regards,
> Stanley Shi,
> 


Re: Local file system to access hdfs blocks

2014-08-28 Thread Demai Ni
Stanley and all,

thanks. I will write a client application to explore this path. A quick
question again.
Using the fsck command, I can retrieve all the necessary info
$ hadoop fsck /tmp/list2.txt -files -blocks -racks
.
 *BP-13-7914115-10.122.195.197-14909166276345:blk_1073742025* len=8 repl=2
[/default/10.122.195.198:50010, /default/10.122.195.196:50010]

However, using getFileBlockLocations(), I can't get the block name/id info,
such as
*BP-13-7914115-10.122.195.197-14909166276345:blk_1073742025*seem the
BlockLocation don't provide the public info here.
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/BlockLocation.html

is there another entry point? somethinig fsck is using? thanks

Demai




On Wed, Aug 27, 2014 at 11:09 PM, Stanley Shi  wrote:

> As far as I know, there's no combination of hadoop API can do that.
> You can easily get the location of the block (on which DN), but there's no
> way to get the local address of that block file.
>
>
>
> On Thu, Aug 28, 2014 at 11:54 AM, Demai Ni  wrote:
>
>> Yehia,
>>
>> No problem at all. I really appreciate your willingness to help. Yeah.
>> now I am able to get such information through two steps, and the first step
>> will be either hadoop fsck or getFileBlockLocations(). and then search
>> the local filesystem, my cluster is using the default from CDH, which is
>> /dfs/dn
>>
>> I would like to it programmatically, so wondering whether someone already
>> done it? or maybe better a hadoop API call already implemented for this
>> exact purpose
>>
>> Demai
>>
>>
>> On Wed, Aug 27, 2014 at 7:58 PM, Yehia Elshater 
>> wrote:
>>
>>> Hi Demai,
>>>
>>> Sorry, I missed that you are already tried this out. I think you can
>>> construct the block location on the local file system if you have the block
>>> pool id and the block id. If you are using cloudera distribution, the
>>> default location is under /dfs/dn ( the value of dfs.data.dir,
>>> dfs.datanode.data.dir configuration keys).
>>>
>>> Thanks
>>> Yehia
>>>
>>>
>>> On 27 August 2014 21:20, Yehia Elshater  wrote:
>>>
>>>> Hi Demai,
>>>>
>>>> You can use fsck utility like the following:
>>>>
>>>> hadoop fsck /path/to/your/hdfs/file -files -blocks -locations -racks
>>>>
>>>> This will display all the information you need about the blocks of your
>>>> file.
>>>>
>>>> Hope it helps.
>>>> Yehia
>>>>
>>>>
>>>> On 27 August 2014 20:18, Demai Ni  wrote:
>>>>
>>>>> Hi, Stanley,
>>>>>
>>>>> Many thanks. Your method works. For now, I can have two steps approach:
>>>>> 1) getFileBlockLocations to grab hdfs BlockLocation[]
>>>>> 2) use local file system call(like find command) to match the block to
>>>>> files on local file system .
>>>>>
>>>>> Maybe there is an existing Hadoop API to return such info in already?
>>>>>
>>>>> Demai on the run
>>>>>
>>>>> On Aug 26, 2014, at 9:14 PM, Stanley Shi  wrote:
>>>>>
>>>>> I am not sure this is what you want but you can try this shell command:
>>>>>
>>>>> find [DATANODE_DIR] -name [blockname]
>>>>>
>>>>>
>>>>> On Tue, Aug 26, 2014 at 6:42 AM, Demai Ni  wrote:
>>>>>
>>>>>> Hi, folks,
>>>>>>
>>>>>> New in this area. Hopefully to get a couple pointers.
>>>>>>
>>>>>> I am using Centos and have Hadoop set up using cdh5.1(Hadoop 2.3)
>>>>>>
>>>>>> I am wondering whether there is a interface to get each hdfs block
>>>>>> information in the term of local file system.
>>>>>>
>>>>>> For example, I can use "Hadoop fsck /tmp/test.txt -files -blocks
>>>>>> -racks" to get blockID and its replica on the nodes, such as: repl =3[
>>>>>> /rack/hdfs01, /rack/hdfs02...]
>>>>>>
>>>>>>  With such info, is there a way to
>>>>>> 1) login to hfds01, and read the block directly at local file system
>>>>>> level?
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Demai on the run
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Regards,
>>>>> *Stanley Shi,*
>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
> Regards,
> *Stanley Shi,*
>
>


Re: Local file system to access hdfs blocks

2014-08-27 Thread Demai Ni
Yehia,

No problem at all. I really appreciate your willingness to help. Yeah. now
I am able to get such information through two steps, and the first step
will be either hadoop fsck or getFileBlockLocations(). and then search the
local filesystem, my cluster is using the default from CDH, which is /dfs/dn

I would like to it programmatically, so wondering whether someone already
done it? or maybe better a hadoop API call already implemented for this
exact purpose

Demai


On Wed, Aug 27, 2014 at 7:58 PM, Yehia Elshater 
wrote:

> Hi Demai,
>
> Sorry, I missed that you are already tried this out. I think you can
> construct the block location on the local file system if you have the block
> pool id and the block id. If you are using cloudera distribution, the
> default location is under /dfs/dn ( the value of dfs.data.dir,
> dfs.datanode.data.dir configuration keys).
>
> Thanks
> Yehia
>
>
> On 27 August 2014 21:20, Yehia Elshater  wrote:
>
>> Hi Demai,
>>
>> You can use fsck utility like the following:
>>
>> hadoop fsck /path/to/your/hdfs/file -files -blocks -locations -racks
>>
>> This will display all the information you need about the blocks of your
>> file.
>>
>> Hope it helps.
>> Yehia
>>
>>
>> On 27 August 2014 20:18, Demai Ni  wrote:
>>
>>> Hi, Stanley,
>>>
>>> Many thanks. Your method works. For now, I can have two steps approach:
>>> 1) getFileBlockLocations to grab hdfs BlockLocation[]
>>> 2) use local file system call(like find command) to match the block to
>>> files on local file system .
>>>
>>> Maybe there is an existing Hadoop API to return such info in already?
>>>
>>> Demai on the run
>>>
>>> On Aug 26, 2014, at 9:14 PM, Stanley Shi  wrote:
>>>
>>> I am not sure this is what you want but you can try this shell command:
>>>
>>> find [DATANODE_DIR] -name [blockname]
>>>
>>>
>>> On Tue, Aug 26, 2014 at 6:42 AM, Demai Ni  wrote:
>>>
>>>> Hi, folks,
>>>>
>>>> New in this area. Hopefully to get a couple pointers.
>>>>
>>>> I am using Centos and have Hadoop set up using cdh5.1(Hadoop 2.3)
>>>>
>>>> I am wondering whether there is a interface to get each hdfs block
>>>> information in the term of local file system.
>>>>
>>>> For example, I can use "Hadoop fsck /tmp/test.txt -files -blocks
>>>> -racks" to get blockID and its replica on the nodes, such as: repl =3[
>>>> /rack/hdfs01, /rack/hdfs02...]
>>>>
>>>>  With such info, is there a way to
>>>> 1) login to hfds01, and read the block directly at local file system
>>>> level?
>>>>
>>>>
>>>> Thanks
>>>>
>>>> Demai on the run
>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> *Stanley Shi,*
>>>
>>>
>>
>


Re: Local file system to access hdfs blocks

2014-08-27 Thread Demai Ni
Hi, Stanley,

Many thanks. Your method works. For now, I can have two steps approach:
1) getFileBlockLocations to grab hdfs BlockLocation[]
2) use local file system call(like find command) to match the block to files on 
local file system .

Maybe there is an existing Hadoop API to return such info in already?

Demai on the run

On Aug 26, 2014, at 9:14 PM, Stanley Shi  wrote:

> I am not sure this is what you want but you can try this shell command:
> 
> find [DATANODE_DIR] -name [blockname]
> 
> 
> On Tue, Aug 26, 2014 at 6:42 AM, Demai Ni  wrote:
>> Hi, folks,
>> 
>> New in this area. Hopefully to get a couple pointers.
>> 
>> I am using Centos and have Hadoop set up using cdh5.1(Hadoop 2.3)
>> 
>> I am wondering whether there is a interface to get each hdfs block 
>> information in the term of local file system.
>> 
>> For example, I can use "Hadoop fsck /tmp/test.txt -files -blocks -racks" to 
>> get blockID and its replica on the nodes, such as: repl =3[ /rack/hdfs01, 
>> /rack/hdfs02...]
>> 
>>  With such info, is there a way to
>> 1) login to hfds01, and read the block directly at local file system level?
>> 
>> 
>> Thanks
>> 
>> Demai on the run
> 
> 
> 
> -- 
> Regards,
> Stanley Shi,
> 


Local file system to access hdfs blocks

2014-08-25 Thread Demai Ni
Hi, folks,

New in this area. Hopefully to get a couple pointers. 

I am using Centos and have Hadoop set up using cdh5.1(Hadoop 2.3)

I am wondering whether there is a interface to get each hdfs block information 
in the term of local file system. 

For example, I can use "Hadoop fsck /tmp/test.txt -files -blocks -racks" to get 
blockID and its replica on the nodes, such as: repl =3[ /rack/hdfs01, 
/rack/hdfs02...]

 With such info, is there a way to 
1) login to hfds01, and read the block directly at local file system level?


Thanks

Demai on the run