Re: Work on research project "Hadoop Security Design"
please start a new thread for your questions poonam if your data is in plain files then you can do bin/hadoop dfs -cat file/path you can copy the file to local with bin/hadoop fs -copyToLocal [-ignorecrc] [-crc] URI On Thu, Feb 28, 2013 at 12:57 PM, POONAM GHULI wrote: > > How to extract data from HDFS file system to our sytem? are there any > commands available? > >> >> > regards > poonam > > -- Nitin Pawar
Re: Work on research project "Hadoop Security Design"
How to extract data from HDFS file system to our sytem? are there any commands available? > > regards poonam
Re: where reduce is copying?
Thanks Harsh, you always the first.. Yeah, that's really make sense, copy inbound those output of mappers to running reduce task attempt. I am trying to think that the speed of 0.44MB/s is pretty low to me. i am deciding if it is because of data is not that much to copy as because not all the mappers are finished at the same time. or it is the problem of the network itself. (i already check that bond0 is 1gb) Thanks Patai On Wed, Feb 27, 2013 at 11:06 PM, Harsh J wrote: > The latter (from other machines, inbound to where the reduce is > running, onto the reduce's local disk, via mapred.local.dir). The > reduce will, obviously, copy outputs from all maps that may have > produced data for its assigned partition ID. > > On Thu, Feb 28, 2013 at 12:27 PM, Patai Sangbutsarakum > wrote: >> Good evening Hadoopers! >> >> at the jobtracker page, click on a job, and click at running reduce >> task, I am going to see >> >> task_201302271736_0638_r_00 reduce > copy (136 of 261 at 0.44 MB/s) >> >> I am really curious where is the data is being copy. >> if i clicked at the task, it will show a host that is running the task >> attempt. >> >> question is "reduce > copy" is referring data copy outbound from host >> that is running task attempt, or >> referring to data is being copy from other machines inbound to this >> host (that's running task attempt) >> >> and in both cases how do i know what machines that host is copy data from/to? >> >> Regards, >> Patai > > > > -- > Harsh J
Re: where reduce is copying?
The latter (from other machines, inbound to where the reduce is running, onto the reduce's local disk, via mapred.local.dir). The reduce will, obviously, copy outputs from all maps that may have produced data for its assigned partition ID. On Thu, Feb 28, 2013 at 12:27 PM, Patai Sangbutsarakum wrote: > Good evening Hadoopers! > > at the jobtracker page, click on a job, and click at running reduce > task, I am going to see > > task_201302271736_0638_r_00 reduce > copy (136 of 261 at 0.44 MB/s) > > I am really curious where is the data is being copy. > if i clicked at the task, it will show a host that is running the task > attempt. > > question is "reduce > copy" is referring data copy outbound from host > that is running task attempt, or > referring to data is being copy from other machines inbound to this > host (that's running task attempt) > > and in both cases how do i know what machines that host is copy data from/to? > > Regards, > Patai -- Harsh J
where reduce is copying?
Good evening Hadoopers! at the jobtracker page, click on a job, and click at running reduce task, I am going to see task_201302271736_0638_r_00 reduce > copy (136 of 261 at 0.44 MB/s) I am really curious where is the data is being copy. if i clicked at the task, it will show a host that is running the task attempt. question is "reduce > copy" is referring data copy outbound from host that is running task attempt, or referring to data is being copy from other machines inbound to this host (that's running task attempt) and in both cases how do i know what machines that host is copy data from/to? Regards, Patai
Re: How to find Replication factor for one perticular folder in HDFS
Its "hdfs getconf", not "hdfs -getconf". The first sub-command is not an option arg, generally, when using the hadoop/hdfs/yarn/mapred scripts. On Wed, Feb 27, 2013 at 3:40 PM, Dhanasekaran Anbalagan wrote: > HI YouPeng Yang , > > Hi already configured dfs.replication factor=2 > >>> 1. To get the key from configuration : > /bin/hdfs -getconf -conKey dfs.replication > > hdfs@dvcliftonhera227:~$ hdfs -getconf -conKey dfs.replication > Unrecognized option: -getconf > Could not create the Java virtual machine. > > Please guide me. > > -Dhanasekaran > > > Did I learn something today? If not, I wasted it. > > > On Mon, Feb 25, 2013 at 7:31 PM, YouPeng Yang > wrote: >> >> Hi Dhanasekaran Anbalagan >> >> 1. To get the key from configuration : >> /bin/hdfs -getconf -conKey dfs.replication >> >> >>2.Maybe you can add the attribute >>true to your dfs.replication : >> >> >> dfs.replication >>2 >>true >> >> >> >> regards. >> >> >> >> 2013/2/26 Nitin Pawar >>> >>> see if the link below helps you >>> >>> >>> http://www.michael-noll.com/blog/2011/10/20/understanding-hdfs-quotas-and-hadoop-fs-and-fsck-tools/ >>> >>> >>> On Mon, Feb 25, 2013 at 10:36 PM, Dhanasekaran Anbalagan >>> wrote: Hi Guys, How to query particular folder witch replication factor configured. In my cluster some folder in HDFS configured 2 and some of them configured as three. How to query. please guide me -Dhanasekaran Did I learn something today? If not, I wasted it. >>> >>> >>> >>> >>> -- >>> Nitin Pawar >> >> > -- Harsh J
Re: How to take Whole Database From RDBMS to HDFS Instead of Table/Table
Is it good way to take total 5PB data through the JAVA/JDBC Program ? On Wed, Feb 27, 2013 at 5:56 PM, Michel Segel wrote: > I wouldn't use sqoop if you are taking everything. > Simpler to write your own java/jdbc program that writes its output to HDFS. > > Just saying... > > Sent from a remote device. Please excuse any typos... > > Mike Segel > > On Feb 27, 2013, at 5:15 AM, samir das mohapatra > wrote: > > thanks all. > > > > On Wed, Feb 27, 2013 at 4:41 PM, Jagat Singh wrote: > >> You might want to read this >> >> >> http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_literal_sqoop_import_all_tables_literal >> >> >> >> >> On Wed, Feb 27, 2013 at 10:09 PM, samir das mohapatra < >> samir.help...@gmail.com> wrote: >> >>> Hi All, >>> >>>Using sqoop how to take entire database table into HDFS insted of >>> Table by Table ?. >>> >>> How do you guys did it? >>> Is there some trick? >>> >>> Regards, >>> samir. >>> >> >> >
Re: Work on research project "Hadoop Security Design"
Thank you everyone for your answer. It gives me a lot of paths for reflection. Thanks Larry, I'll have to dig more about the "inter-cloud" system my uni is using. Thomas De : Charles Earl À : "user@hadoop.apache.org" Cc : "user@hadoop.apache.org" Envoyé le : Mercredi 27 février 2013 16h09 Objet : Re: Work on research project "Hadoop Security Design" Thomas, Interesting distinctions articulated, thanks. I knew of an effort called GarewayFS that had inter-cluster secure federation as one goal, there are some similarities to Knox. C On Feb 27, 2013, at 10:04 AM, Larry McCay wrote: Hi Thomas - > > >I think that you need articulate the problems that you want to solve for your >university environment. >The subject that you chose indicates "inter-cloud environment" - so depending >on the inter-cloud problems that currently exist for your environment there >may be interesting work from the Rhino effort or with Knox. > > >It seems that you are leaning toward data protection and encryption as a >solution to some problem within your stated problem subject. >I'd be interested in the usecase that you are addressing with it that is >"inter-cloud". >Another family of issues that would be interesting in the inter-cloud space >would be various identity federation issues across clouds. > > >@Charles - by GatewayFS do you mean HttpFS and are you asking whether Knox is >related to it? >If so, Knox is not directly related to HttpFS though it will leverage lessons >learned and hopefully the experience of those involved. >The Knox gateway is more transparent and committed to serving REST APIs to >numerous Hadoop services rather than just HDFS. >The pluggable providers of Knox gateway will also facilitate easier >integration with customer's identity infrastructure in on-prem and cloud >provider environments. > > >Hope that helps to draw the distinction between Knox and HttpFS. > >thanks, > > >--larry > > >On Wed, Feb 27, 2013 at 9:40 AM, Charles Earl wrote: > >Is this in any way related to GatewayFS? >>I am also curious whether any one knows of plans to incorporate homomorphic >>encryption or secure multiparty into the rhino effort. >>C >> >> >>On Feb 27, 2013, at 9:30 AM, Nitin Pawar wrote: >> >>I am not sure if you guys have heard it or not >>> >>> >>>HortonWorks is in process to incubate a new apache project called Knox for >>>hadoop security. >>>More on this you can look at >>> >>> >>>http://hortonworks.com/blog/introducing-knox-hadoop-security/ >>> >>> >>> >>>http://wiki.apache.org/incubator/knox >>> >>> >>> >>> >>>On Wed, Feb 27, 2013 at 7:51 PM, Thomas Nguy wrote: >>> >>>Thank you very much Panshul, I'll take a look. Thomas. De : Panshul Whisper À : user@hadoop.apache.org; Thomas Nguy Envoyé le : Mercredi 27 février 2013 13h53 Objet : Re: Work on research project "Hadoop Security Design" Hello Thomas, you can look into this project. This is exactly what you are doing, but at a larger scale. https://github.com/intel-hadoop/project-rhino/ Hope this helps, Regards, Panshul On Wed, Feb 27, 2013 at 1:49 PM, Thomas Nguy wrote: Hello developers ! > > >I'm a student at the french university "Ensimag" and currently doing my >master research on "Software security". Interested by cloud computing, I >chose for subject : "Secure hadoop cluster inter-cloud environment". >My idea is to develop a framework in order to improve the security of the >Hadoop cluster running on the cloud of my uni. I have started by checking >the "Hadoop research projects" proposed on Hadoop Wiki and the following >subject fits with mine: > > >"Hadoop Security Design: >An end-to-end proposal for how to support authentication and client side >data encryption/decryption, so that large data sets can be stored in a >public HDFS and only jobs launched by authenticated users can map-reduce >or browse the data" > > >I would like to know if there are already some developers on it so we can >discuss... To be honest, I'm kinda a "beginner" regarding Hadoop and cloud >cumputing so if would be really great if you had some advices or hints for >my research. > > >Best regards.Thomas -- Regards,Ouch Whisper 010101010101 >>> >>> >>> >>>-- >>>Nitin Pawar >>> >> >
HDFS Benchmarking tools in Hadoop 2.0.3-alpha
I am unable to locate the TestDFSIO benchmarking jar in the downloaded tar volume. Has it been deprecated? Thanks, -Dheeren Bebortha
Re: Datanodes shutdown and HBase's regionservers not working
Yes, we make sure that inappropriate use of NFS leading to high load and the lost heartbeat between cluster members. There was a NFS partition point to one virtual machine for some purpose, but the virtual machine shutted down frequently. BTW, the NFS partition was not for the backup of NN metadata, just for other temporary purpose, and it has been removed now. The NFS partition(with autofs) for NN metadata backup has no problem. For more info, google the "NFS high load"... On Wed, Feb 27, 2013 at 9:58 AM, Jean-Marc Spaggiari wrote: > Hi Davey, > > So were you able to find the issue? > > JM > > 2013/2/25 Davey Yan : >> Hi Nicolas, >> >> I think i found what led to shutdown of all of the datanodes, but i am >> not completely certain. >> I will return to this mail list when my cluster returns to be stable. >> >> On Mon, Feb 25, 2013 at 8:01 PM, Nicolas Liochon wrote: >>> Network error messages are not always friendly, especially if there is a >>> misconfiguration. >>> This said, "connection refused" says that the network connection was made, >>> but that the remote port was not opened on the remote box. I.e. the process >>> was dead. >>> It could be useful to pastebin the whole logs as well... >>> >>> >>> On Mon, Feb 25, 2013 at 12:44 PM, Davey Yan wrote: But... there was no log like "network unreachable". On Mon, Feb 25, 2013 at 6:07 PM, Nicolas Liochon wrote: > I agree. > Then for HDFS, ... > The first thing to check is the network I would say. > > > > > On Mon, Feb 25, 2013 at 10:46 AM, Davey Yan wrote: >> >> Thanks for reply, Nicolas. >> >> My question: What can lead to shutdown of all of the datanodes? >> I believe that the regionservers will be OK if the HDFS is OK. >> >> >> On Mon, Feb 25, 2013 at 5:31 PM, Nicolas Liochon >> wrote: >> > Ok, what's your question? >> > When you say the datanode went down, was it the datanode processes or >> > the >> > machines, with both the datanodes and the regionservers? >> > >> > The NameNode pings its datanodes every 3 seconds. However it will >> > internally >> > mark the datanodes as dead after 10:30 minutes (even if in the gui >> > you >> > have >> > 'no answer for x minutes'). >> > HBase monitoring is done by ZooKeeper. By default, a regionserver is >> > considered as dead after 180s with no answer. Before, well, it's >> > considered >> > as live. >> > When you stop a regionserver, it tries to flush its data to the disk >> > (i.e. >> > hdfs, i.e. the datanodes). That's why if you have no datanodes, or if >> > a >> > high >> > ratio of your datanodes are dead, it can't shutdown. Connection >> > refused >> > & >> > socket timeouts come from the fact that before the 10:30 minutes hdfs >> > does >> > not declare the nodes as dead, so hbase tries to use them (and, >> > obviously, >> > fails). Note that there is now an intermediate state for hdfs >> > datanodes, >> > called "stale": an intermediary state where the datanode is used only >> > if >> > you >> > have to (i.e. it's the only datanode with a block replica you need). >> > It >> > will >> > be documented in HBase for the 0.96 release. But if all your >> > datanodes >> > are >> > down it won't change much. >> > >> > Cheers, >> > >> > Nicolas >> > >> > >> > >> > On Mon, Feb 25, 2013 at 10:10 AM, Davey Yan >> > wrote: >> >> >> >> Hey guys, >> >> >> >> We have a cluster with 5 nodes(1 NN and 4 DNs) running for more than >> >> 1 >> >> year, and it works fine. >> >> But the datanodes got shutdown twice in the last month. >> >> >> >> When the datanodes got shutdown, all of them became "Dead Nodes" in >> >> the NN web admin UI(http://ip:50070/dfshealth.jsp), >> >> but regionservers of HBase were still live in the HBase web >> >> admin(http://ip:60010/master-status), of course, they were zombies. >> >> All of the processes of jvm were still running, including >> >> hmaster/namenode/regionserver/datanode. >> >> >> >> When the datanodes got shutdown, the load (using the "top" command) >> >> of >> >> slaves became very high, more than 10, higher than normal running. >> >> From the "top" command, we saw that the processes of datanode and >> >> regionserver were comsuming CPU. >> >> >> >> We could not stop the HBase or Hadoop cluster through normal >> >> commands(stop-*.sh/*-daemon.sh stop *). >> >> So we stopped datanodes and regionservers by kill -9 PID, then the >> >> load of slaves returned to normal level, and we start the cluster >> >> again. >> >> >> >> >> >> Log of NN at the shutdown point(All of the DNs were remo
Re: Encryption in HDFS
Excellent! On 02/25/2013 10:43 PM, Mathias Herberts wrote: Encryption without proper key management only addresses the 'stolen hard drive' problem. So far I have not found 100% satisfactory solutions to this hard problem. I've written OSS (Open Secret Server) partly to address this problem in Pig, i.e. accessing encrypted data without embedding key info into the job description file. Proper encrypted data handling implies striict code review though, as in the case of Pig databags are spillable and you could end up with unencrypted data stored on disk without intent. OSS http://github.com/hbs/oss and the Pig specific code: https://github.com/hbs/oss/blob/master/src/main/java/com/geoxp/oss/pig/PigSecretStore.java On Tue, Feb 26, 2013 at 6:33 AM, Seonyeong Bak wrote: I didn't handle a key distribution problem because I thought that this problem is more difficult. I simply hardcode a key into the code. A challenge related to security are handled in HADOOP-9331, MAPREDUCE-5025, and so on.
Re: Correct way to unzip locally an archive in Yarn
I was using 0.23 and was adding files using the -libjars flag (wanted to upload some jars which were dependencies for my project) but for some reason I could never find it in the DistributedCache or would always keep on getting ClassNotFound on the other side. I took the snippet of code which does that work when you invoke the hadoop jar command and put it in my class and it all worked fine. I don't know if the problem was with my code or if it was a bug. Given that -files is so extensively used, I felt it could be an issue on my side. In the end I started using 1.1 hadoop and so completely forgot about diving more deeper but I can definitely revive it and dig in more. Hopefully you can try using -libjars too and see if that also is facing a similar issue since they both are command line switches which should have almost similar behavior. Thanks, Viral On Tue, Feb 19, 2013 at 10:33 AM, Robert Evans wrote: > Yes if you can trace this down I would be very interested. We are running > 0.23.6 without any issues, but that does not mean that there is not some > bug in the code that is causing this to happen in your situation. > > --Bobby > > From: Sebastiano Vigna > Reply-To: "user@hadoop.apache.org" > Date: Saturday, February 16, 2013 8:39 AM > To: "user@hadoop.apache.org" > Subject: Re: Correct way to unzip locally an archive in Yarn > > I will as soon as I can understand what happens on the cluster (no access > from home). DistributedCache.getLocalCacheFiles() returns in both cases a > local name for the zip file uploaded with -files, but locally my unzip code > works, on the cluster it throws a FileNotFoundException. > > > On 16 February 2013 15:22, Arun C Murthy wrote: > >> This could be a bug, mind opening a jira? Thanks. >> >> On Feb 16, 2013, at 2:34 AM, Sebastiano Vigna wrote: >> >> On 15 February 2013 16:57, Robert Evans wrote: >> >>> Are you trying to run a Map/Reduce job or are you writing a new YARN >>> application? If it is a MR job, then it should work mostly the same as >>> before (on 1.x). If you are writing a new YARN application then there is >>> a >>> separate Map in the ContainerLaunchContext that you need to fill in. >> >> >> It's a MapReduce job (0.23.6). After two days of useless trials, I'm >> uploading the zip with -files and I wrote a stub to unzip it manually. I >> was positively unable to get the archive unzipped *to a local directory* in >> any way. >> >> Unfortunately it works in local but not on the cluster. I have still to >> discover why. :( >> >> Ciao, >> >> >> >> >> -- >> Arun C. Murthy >> Hortonworks Inc. >> http://hortonworks.com/ >> >> >> >
Re: Custom output value for map function
That's right, the date needs to be written and read in the same order. On Wed, Feb 27, 2013 at 11:04 AM, Paul van Hoven < paul.van.ho...@googlemail.com> wrote: > Great! Thank you. > > I guess the order for writing and reading the data this way is > important. I mean, for > > out.writeUTF("blabla") > out.writeInt(12) > > the following would be correct > > text = in.readUTF(); > number = in.readInt(); > > and this would fail: > > number = in.readInt(); > text = in.readUTF(); > > ? > > 2013/2/27 Sandy Ryza : > > Hi Paul, > > > > To do this, you need to make your Dog class implement Hadoop's Writable > > interface, so that it can be serialized to and deserialized from bytes. > > > http://hadoop.apache.org/docs/r1.1.1/api/org/apache/hadoop/io/Writable.html > > > > The methods you implement would look something like this: > > > > public void write(DataOutput out) { > > out.writeDouble(weight); > > out.writeUTF(name); > > out.writeLong(date.toTimeInMillis()); > > } > > > > public void readFields(DataInput in) { > > weight = in.readDouble(); > > name = in.readUTF(); > > date = new Date(in.readLong()); > > } > > > > hope that helps, > > Sandy > > > > On Wed, Feb 27, 2013 at 10:34 AM, Paul van Hoven > > wrote: > >> > >> The output value in the map function is in most examples for hadoop > >> something like this: > >> > >> public static class Map extends Mapper >> outputValue> > >> > >> Normally outputValue is something like Text or IntWriteable. > >> > >> I got a custom class with its own properties like > >> > >> public class Dog { > >>string name; > >>Date birthday; > >>double weight; > >> } > >> > >> Now how would I accomplish the following map function: > >> > >> public static class Map extends Mapper >> Dog> > >> > >> ? > > > > >
Re: Custom output value for map function
Great! Thank you. I guess the order for writing and reading the data this way is important. I mean, for out.writeUTF("blabla") out.writeInt(12) the following would be correct text = in.readUTF(); number = in.readInt(); and this would fail: number = in.readInt(); text = in.readUTF(); ? 2013/2/27 Sandy Ryza : > Hi Paul, > > To do this, you need to make your Dog class implement Hadoop's Writable > interface, so that it can be serialized to and deserialized from bytes. > http://hadoop.apache.org/docs/r1.1.1/api/org/apache/hadoop/io/Writable.html > > The methods you implement would look something like this: > > public void write(DataOutput out) { > out.writeDouble(weight); > out.writeUTF(name); > out.writeLong(date.toTimeInMillis()); > } > > public void readFields(DataInput in) { > weight = in.readDouble(); > name = in.readUTF(); > date = new Date(in.readLong()); > } > > hope that helps, > Sandy > > On Wed, Feb 27, 2013 at 10:34 AM, Paul van Hoven > wrote: >> >> The output value in the map function is in most examples for hadoop >> something like this: >> >> public static class Map extends Mapper> outputValue> >> >> Normally outputValue is something like Text or IntWriteable. >> >> I got a custom class with its own properties like >> >> public class Dog { >>string name; >>Date birthday; >>double weight; >> } >> >> Now how would I accomplish the following map function: >> >> public static class Map extends Mapper> Dog> >> >> ? > >
Large static structures in M/R heap
We have a job that uses a large lookup structure that gets created as a static class during the map setup phase (and we have the JVM reused so this only takes place once). However of late this structure has grown drastically (due to items beyond our control) and we've seen a substantial increase in map time due to the lower available memory. Are there any easy solutions to this sort of problem? My first thought was to see if it was possible to have all tasks for a job execute in parallel within the same JVM, but I'm not seeing any setting that would allow that. Beyond that my only ideas are to move that data into an external one-per-node key-value store like memcached, but I'm worried the additional overhead of sending a query for each value being mapped would also kill the job performance. - Adam
Re: Custom output value for map function
Hi Paul, To do this, you need to make your Dog class implement Hadoop's Writable interface, so that it can be serialized to and deserialized from bytes. http://hadoop.apache.org/docs/r1.1.1/api/org/apache/hadoop/io/Writable.html The methods you implement would look something like this: public void write(DataOutput out) { out.writeDouble(weight); out.writeUTF(name); out.writeLong(date.toTimeInMillis()); } public void readFields(DataInput in) { weight = in.readDouble(); name = in.readUTF(); date = new Date(in.readLong()); } hope that helps, Sandy On Wed, Feb 27, 2013 at 10:34 AM, Paul van Hoven < paul.van.ho...@googlemail.com> wrote: > The output value in the map function is in most examples for hadoop > something like this: > > public static class Map extends Mapper outputValue> > > Normally outputValue is something like Text or IntWriteable. > > I got a custom class with its own properties like > > public class Dog { >string name; >Date birthday; >double weight; > } > > Now how would I accomplish the following map function: > > public static class Map extends Mapper Dog> > > ? >
Custom output value for map function
The output value in the map function is in most examples for hadoop something like this: public static class Map extends Mapper Normally outputValue is something like Text or IntWriteable. I got a custom class with its own properties like public class Dog { string name; Date birthday; double weight; } Now how would I accomplish the following map function: public static class Map extends Mapper ?
Re: How to get Under-replicated blocks information [Location]
When you run FSCK, what options did you run it with? If you're using 'hadoop fsck /' then you will typically see a lot of dots being output and amongst those dots you should see missing/corrupt or under-replicated notices. Here's an excerpt containing under replicated blocks, for an example: . /user/blank/.staging/job_201302211222_0011/libjars/common.jar: Under replicated blk_-856998378388135111_3050830. Target Replicas is 10 but found 9 replica(s). ... .. /user/blank/.staging/job_201302211222_0733/job.jar: Under replicated blk_4441280820634984431_3150889. Target Replicas is 10 but found 9 replica(s). ... /user/blank/.staging/job_201302211222_0741/job.jar: Under replicated blk_6336347773734645693_3150957. Target Replicas is 10 but found 9 replica(s). .. /user/blank/.staging/job_201302211222_0755/job.jar: Under replicated blk_-1209937263563068132_3154083. Target Replicas is 10 but found 9 replica(s). . /user/blank/.staging/job_201302211222_0756/job.jar: Under replicated blk_-365467798112255961_3154084. Target Replicas is 10 but found 9 replica(s). [...] The paths you see in the example are the HDFS locations within the cluster. That should be the information you're looking for. Does this help? -Shawn On Wednesday, February 27, 2013 4:17:18 AM UTC-6, Dhanasekaran Anbalagan wrote: > > Hi Guys, > > I am running three machine cluster, with replication factor 2, I got > problem in replica i changed to 2. > after I ran fsck i got Under-replicated blocks: 71 (0.0034828386 %) > > Total size: 105829415143 B > Total dirs: 9704 > Total files: 2038873 (Files currently being written: 2) > Total blocks (validated): 2038567 (avg. block size 51913 B) (Total open > file blocks (not validated): 2) > Minimally replicated blocks: 2038567 (100.0 %) > Over-replicated blocks: 0 (0.0 %) > Under-replicated blocks: 71 (0.0034828386 %) > Mis-replicated blocks: 0 (0.0 %) > Default replication factor: 2 > Average block replication: 1.995 > Corrupt blocks: 0 > Missing replicas: 71 (0.0017414198 %) > Number of data-nodes: 2 > Number of racks: 1 > FSCK ended at Wed Feb 27 05:12:31 EST 2013 in 32647 milliseconds > > > Please guide what the block location of the 71. in HDFS file system. > > -Dhanasekaran. > Did I learn something today? If not, I wasted it. >
Re: Work on research project "Hadoop Security Design"
Thomas, Interesting distinctions articulated, thanks. I knew of an effort called GarewayFS that had inter-cluster secure federation as one goal, there are some similarities to Knox. C On Feb 27, 2013, at 10:04 AM, Larry McCay wrote: > Hi Thomas - > > I think that you need articulate the problems that you want to solve for your > university environment. > The subject that you chose indicates "inter-cloud environment" - so depending > on the inter-cloud problems that currently exist for your environment there > may be interesting work from the Rhino effort or with Knox. > > It seems that you are leaning toward data protection and encryption as a > solution to some problem within your stated problem subject. > I'd be interested in the usecase that you are addressing with it that is > "inter-cloud". > Another family of issues that would be interesting in the inter-cloud space > would be various identity federation issues across clouds. > > @Charles - by GatewayFS do you mean HttpFS and are you asking whether Knox is > related to it? > If so, Knox is not directly related to HttpFS though it will leverage lessons > learned and hopefully the experience of those involved. > The Knox gateway is more transparent and committed to serving REST APIs to > numerous Hadoop services rather than just HDFS. > The pluggable providers of Knox gateway will also facilitate easier > integration with customer's identity infrastructure in on-prem and cloud > provider environments. > > Hope that helps to draw the distinction between Knox and HttpFS. > > thanks, > > --larry > > On Wed, Feb 27, 2013 at 9:40 AM, Charles Earl wrote: >> Is this in any way related to GatewayFS? >> I am also curious whether any one knows of plans to incorporate homomorphic >> encryption or secure multiparty into the rhino effort. >> C >> >> On Feb 27, 2013, at 9:30 AM, Nitin Pawar wrote: >> >>> I am not sure if you guys have heard it or not >>> >>> HortonWorks is in process to incubate a new apache project called Knox for >>> hadoop security. >>> More on this you can look at >>> >>> http://hortonworks.com/blog/introducing-knox-hadoop-security/ >>> >>> http://wiki.apache.org/incubator/knox >>> >>> >>> On Wed, Feb 27, 2013 at 7:51 PM, Thomas Nguy wrote: Thank you very much Panshul, I'll take a look. Thomas. De : Panshul Whisper À : user@hadoop.apache.org; Thomas Nguy Envoyé le : Mercredi 27 février 2013 13h53 Objet : Re: Work on research project "Hadoop Security Design" Hello Thomas, you can look into this project. This is exactly what you are doing, but at a larger scale. https://github.com/intel-hadoop/project-rhino/ Hope this helps, Regards, Panshul On Wed, Feb 27, 2013 at 1:49 PM, Thomas Nguy wrote: Hello developers ! I'm a student at the french university "Ensimag" and currently doing my master research on "Software security". Interested by cloud computing, I chose for subject : "Secure hadoop cluster inter-cloud environment". My idea is to develop a framework in order to improve the security of the Hadoop cluster running on the cloud of my uni. I have started by checking the "Hadoop research projects" proposed on Hadoop Wiki and the following subject fits with mine: "Hadoop Security Design: An end-to-end proposal for how to support authentication and client side data encryption/decryption, so that large data sets can be stored in a public HDFS and only jobs launched by authenticated users can map-reduce or browse the data" I would like to know if there are already some developers on it so we can discuss... To be honest, I'm kinda a "beginner" regarding Hadoop and cloud cumputing so if would be really great if you had some advices or hints for my research. Best regards. Thomas -- Regards, Ouch Whisper 010101010101 >>> >>> >>> >>> -- >>> Nitin Pawar >
Re: Work on research project "Hadoop Security Design"
Hi Thomas - I think that you need articulate the problems that you want to solve for your university environment. The subject that you chose indicates "inter-cloud environment" - so depending on the inter-cloud problems that currently exist for your environment there may be interesting work from the Rhino effort or with Knox. It seems that you are leaning toward data protection and encryption as a solution to some problem within your stated problem subject. I'd be interested in the usecase that you are addressing with it that is "inter-cloud". Another family of issues that would be interesting in the inter-cloud space would be various identity federation issues across clouds. @Charles - by GatewayFS do you mean HttpFS and are you asking whether Knox is related to it? If so, Knox is not directly related to HttpFS though it will leverage lessons learned and hopefully the experience of those involved. The Knox gateway is more transparent and committed to serving REST APIs to numerous Hadoop services rather than just HDFS. The pluggable providers of Knox gateway will also facilitate easier integration with customer's identity infrastructure in on-prem and cloud provider environments. Hope that helps to draw the distinction between Knox and HttpFS. thanks, --larry On Wed, Feb 27, 2013 at 9:40 AM, Charles Earl wrote: > Is this in any way related to GatewayFS? > I am also curious whether any one knows of plans to incorporate > homomorphic encryption or secure multiparty into the rhino effort. > C > > On Feb 27, 2013, at 9:30 AM, Nitin Pawar wrote: > > I am not sure if you guys have heard it or not > > HortonWorks is in process to incubate a new apache project called Knox for > hadoop security. > More on this you can look at > > http://hortonworks.com/blog/introducing-knox-hadoop-security/ > > http://wiki.apache.org/incubator/knox > > > On Wed, Feb 27, 2013 at 7:51 PM, Thomas Nguy wrote: > >> Thank you very much Panshul, I'll take a look. >> >> Thomas. >> >> -- >> *De :* Panshul Whisper >> *À :* user@hadoop.apache.org; Thomas Nguy >> *Envoyé le :* Mercredi 27 février 2013 13h53 >> *Objet :* Re: Work on research project "Hadoop Security Design" >> >> Hello Thomas, >> >> you can look into this project. This is exactly what you are doing, but >> at a larger scale. >> https://github.com/intel-hadoop/project-rhino/ >> >> Hope this helps, >> >> Regards, >> Panshul >> >> >> On Wed, Feb 27, 2013 at 1:49 PM, Thomas Nguy wrote: >> >> Hello developers ! >> >> I'm a student at the french university "Ensimag" and currently doing my >> master research on "Software security". Interested by cloud computing, I >> chose for subject : "Secure hadoop cluster inter-cloud environment". >> My idea is to develop a framework in order to improve the security of the >> Hadoop cluster running on the cloud of my uni. I have started by >> checking the "Hadoop research projects" proposed on Hadoop Wiki and the >> following subject fits with mine: >> >> "Hadoop Security Design: >> An end-to-end proposal for how to support authentication and client >> side data encryption/decryption, so that large data sets can be stored in a >> public HDFS and only jobs launched by authenticated users can map-reduce or >> browse the data" >> >> I would like to know if there are already some developers on it so we can >> discuss... To be honest, I'm kinda a "beginner" regarding Hadoop and >> cloud cumputing so if would be really great if you had some advices or >> hints for my research. >> >> Best regards. >> Thomas >> >> >> >> >> -- >> Regards, >> Ouch Whisper >> 010101010101 >> >> >> > > > -- > Nitin Pawar > > >
Re: Work on research project "Hadoop Security Design"
Is this in any way related to GatewayFS? I am also curious whether any one knows of plans to incorporate homomorphic encryption or secure multiparty into the rhino effort. C On Feb 27, 2013, at 9:30 AM, Nitin Pawar wrote: > I am not sure if you guys have heard it or not > > HortonWorks is in process to incubate a new apache project called Knox for > hadoop security. > More on this you can look at > > http://hortonworks.com/blog/introducing-knox-hadoop-security/ > > http://wiki.apache.org/incubator/knox > > > On Wed, Feb 27, 2013 at 7:51 PM, Thomas Nguy wrote: > Thank you very much Panshul, I'll take a look. > > Thomas. > > De : Panshul Whisper > À : user@hadoop.apache.org; Thomas Nguy > Envoyé le : Mercredi 27 février 2013 13h53 > Objet : Re: Work on research project "Hadoop Security Design" > > Hello Thomas, > > you can look into this project. This is exactly what you are doing, but at a > larger scale. > https://github.com/intel-hadoop/project-rhino/ > > Hope this helps, > > Regards, > Panshul > > > On Wed, Feb 27, 2013 at 1:49 PM, Thomas Nguy wrote: > Hello developers ! > > I'm a student at the french university "Ensimag" and currently doing my > master research on "Software security". Interested by cloud computing, I > chose for subject : "Secure hadoop cluster inter-cloud environment". > My idea is to develop a framework in order to improve the security of the > Hadoop cluster running on the cloud of my uni. I have started by checking the > "Hadoop research projects" proposed on Hadoop Wiki and the following subject > fits with mine: > > "Hadoop Security Design: > An end-to-end proposal for how to support authentication and client > side data encryption/decryption, so that large data sets can be stored in a > public HDFS and only jobs launched by authenticated users can map-reduce or > browse the data" > > I would like to know if there are already some developers on it so we can > discuss... To be honest, I'm kinda a "beginner" regarding Hadoop and cloud > cumputing so if would be really great if you had some advices or hints for my > research. > > Best regards. > Thomas > > > > -- > Regards, > Ouch Whisper > 010101010101 > > > > > > -- > Nitin Pawar
Re: Work on research project "Hadoop Security Design"
I am not sure if you guys have heard it or not HortonWorks is in process to incubate a new apache project called Knox for hadoop security. More on this you can look at http://hortonworks.com/blog/introducing-knox-hadoop-security/ http://wiki.apache.org/incubator/knox On Wed, Feb 27, 2013 at 7:51 PM, Thomas Nguy wrote: > Thank you very much Panshul, I'll take a look. > > Thomas. > > -- > *De :* Panshul Whisper > *À :* user@hadoop.apache.org; Thomas Nguy > *Envoyé le :* Mercredi 27 février 2013 13h53 > *Objet :* Re: Work on research project "Hadoop Security Design" > > Hello Thomas, > > you can look into this project. This is exactly what you are doing, but at > a larger scale. > https://github.com/intel-hadoop/project-rhino/ > > Hope this helps, > > Regards, > Panshul > > > On Wed, Feb 27, 2013 at 1:49 PM, Thomas Nguy wrote: > > Hello developers ! > > I'm a student at the french university "Ensimag" and currently doing my > master research on "Software security". Interested by cloud computing, I > chose for subject : "Secure hadoop cluster inter-cloud environment". > My idea is to develop a framework in order to improve the security of the > Hadoop cluster running on the cloud of my uni. I have started by checking > the "Hadoop research projects" proposed on Hadoop Wiki and the following > subject fits with mine: > > "Hadoop Security Design: > An end-to-end proposal for how to support authentication and client side > data encryption/decryption, so that large data sets can be stored in a > public HDFS and only jobs launched by authenticated users can map-reduce or > browse the data" > > I would like to know if there are already some developers on it so we can > discuss... To be honest, I'm kinda a "beginner" regarding Hadoop and > cloud cumputing so if would be really great if you had some advices or > hints for my research. > > Best regards. > Thomas > > > > > -- > Regards, > Ouch Whisper > 010101010101 > > > -- Nitin Pawar
Re: Work on research project "Hadoop Security Design"
Thank you very much Panshul, I'll take a look. Thomas. De : Panshul Whisper À : user@hadoop.apache.org; Thomas Nguy Envoyé le : Mercredi 27 février 2013 13h53 Objet : Re: Work on research project "Hadoop Security Design" Hello Thomas, you can look into this project. This is exactly what you are doing, but at a larger scale. https://github.com/intel-hadoop/project-rhino/ Hope this helps, Regards, Panshul On Wed, Feb 27, 2013 at 1:49 PM, Thomas Nguy wrote: Hello developers ! > > >I'm a student at the french university "Ensimag" and currently doing my master >research on "Software security". Interested by cloud computing, I chose for >subject : "Secure hadoop cluster inter-cloud environment". >My idea is to develop a framework in order to improve the security of the >Hadoop cluster running on the cloud of my uni. I have started by checking the >"Hadoop research projects" proposed on Hadoop Wiki and the following subject >fits with mine: > > >"Hadoop Security Design: >An end-to-end proposal for how to support authentication and client side data >encryption/decryption, so that large data sets can be stored in a public HDFS >and only jobs launched by authenticated users can map-reduce or browse the >data" > > >I would like to know if there are already some developers on it so we can >discuss... To be honest, I'm kinda a "beginner" regarding Hadoop and cloud >cumputing so if would be really great if you had some advices or hints for my >research. > > >Best regards.Thomas -- Regards,Ouch Whisper 010101010101
Re: Encryption in HDFS
You can encrypt the splits separately. The issue of key management is actually a layer above this. Looks like the research is on the encryption process w a known key. The layer above would handle key management which can be done a couple of different ways... On Feb 26, 2013, at 1:52 PM, java8964 java8964 wrote: > I am also interested in your research. Can you share some insight about the > following questions? > > 1) When you use CompressionCodec, can the encrypted file split? From my > understand, there is no encrypt way can make the file decryption individually > by block, right? For example, if I have 1G file, encrypted using AES, how do > you or can you decrypt the file block by block, instead of just using one > mapper to decrypt the whole file? > 2) In your CompressionCodec implementation, do you use the DecompressorStream > or BlockDecompressorStream? If BlockDecompressorStream, can you share some > examples? Right now, I have some problems to use BlockDecompressorStream to > do the exactly same thing as you did. > 3) Do you have any plan to share your code, especially if you did use > BlockDecompressorStream and make the encryption file decrypted block by block > in the hadoop MapReduce job. > > Thanks > > Yong > > From: render...@gmail.com > Date: Tue, 26 Feb 2013 14:10:08 +0900 > Subject: Encryption in HDFS > To: user@hadoop.apache.org > > Hello, I'm a university student. > > I implemented AES and Triple DES with CompressionCodec in java cryptography > architecture (JCA) > The encryption is performed by a client node using Hadoop API. > Map tasks read blocks from HDFS and these blocks are decrypted by each map > tasks. > I tested my implementation with generic HDFS. > My cluster consists of 3 nodes (1 master node, 3 worker nodes) and each > machines have quad core processor (i7-2600) and 4GB memory. > A test input is 1TB text file which consists of 32 multiple text files (1 > text file is 32GB) > > I expected that the encryption takes much more time than generic HDFS. > The performance does not differ significantly. > The decryption step takes about 5-7% more than generic HDFS. > The encryption step takes about 20-30% more than generic HDFS because it is > implemented by single thread and executed by 1 client node. > So the encryption can get more performance. > > May there be any error in my test? > > I know there are several implementation for encryting files in HDFS. > Are these implementations enough to secure HDFS? > > best regards, > > seonpark > > * Sorry for my bad english
Re: Work on research project "Hadoop Security Design"
Hello Thomas, you can look into this project. This is exactly what you are doing, but at a larger scale. https://github.com/intel-hadoop/project-rhino/ Hope this helps, Regards, Panshul On Wed, Feb 27, 2013 at 1:49 PM, Thomas Nguy wrote: > Hello developers ! > > I'm a student at the french university "Ensimag" and currently doing my > master research on "Software security". Interested by cloud computing, I > chose for subject : "Secure hadoop cluster inter-cloud environment". > My idea is to develop a framework in order to improve the security of the > Hadoop cluster running on the cloud of my uni. I have started by checking > the "Hadoop research projects" proposed on Hadoop Wiki and the following > subject fits with mine: > > "Hadoop Security Design: > An end-to-end proposal for how to support authentication and client side > data encryption/decryption, so that large data sets can be stored in a > public HDFS and only jobs launched by authenticated users can map-reduce or > browse the data" > > I would like to know if there are already some developers on it so we can > discuss... To be honest, I'm kinda a "beginner" regarding Hadoop and > cloud cumputing so if would be really great if you had some advices or > hints for my research. > > Best regards. > Thomas > -- Regards, Ouch Whisper 010101010101
Work on research project "Hadoop Security Design"
Hello developers ! I'm a student at the french university "Ensimag" and currently doing my master research on "Software security". Interested by cloud computing, I chose for subject : "Secure hadoop cluster inter-cloud environment". My idea is to develop a framework in order to improve the security of the Hadoop cluster running on the cloud of my uni. I have started by checking the "Hadoop research projects" proposed on Hadoop Wiki and the following subject fits with mine: "Hadoop Security Design: An end-to-end proposal for how to support authentication and client side data encryption/decryption, so that large data sets can be stored in a public HDFS and only jobs launched by authenticated users can map-reduce or browse the data" I would like to know if there are already some developers on it so we can discuss... To be honest, I'm kinda a "beginner" regarding Hadoop and cloud cumputing so if would be really great if you had some advices or hints for my research. Best regards. Thomas
Re: How to take Whole Database From RDBMS to HDFS Instead of Table/Table
I wouldn't use sqoop if you are taking everything. Simpler to write your own java/jdbc program that writes its output to HDFS. Just saying... Sent from a remote device. Please excuse any typos... Mike Segel On Feb 27, 2013, at 5:15 AM, samir das mohapatra wrote: > thanks all. > > > > On Wed, Feb 27, 2013 at 4:41 PM, Jagat Singh wrote: >> You might want to read this >> >> http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_literal_sqoop_import_all_tables_literal >> >> >> >> >> On Wed, Feb 27, 2013 at 10:09 PM, samir das mohapatra >> wrote: >>> Hi All, >>> >>>Using sqoop how to take entire database table into HDFS insted of Table >>> by Table ?. >>> >>> How do you guys did it? >>> Is there some trick? >>> >>> Regards, >>> samir. >
Re: How to take Whole Database From RDBMS to HDFS Instead of Table/Table
thanks all. On Wed, Feb 27, 2013 at 4:41 PM, Jagat Singh wrote: > You might want to read this > > > http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_literal_sqoop_import_all_tables_literal > > > > > On Wed, Feb 27, 2013 at 10:09 PM, samir das mohapatra < > samir.help...@gmail.com> wrote: > >> Hi All, >> >>Using sqoop how to take entire database table into HDFS insted of >> Table by Table ?. >> >> How do you guys did it? >> Is there some trick? >> >> Regards, >> samir. >> > >
Re: How to take Whole Database From RDBMS to HDFS Instead of Table/Table
You might want to read this http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_literal_sqoop_import_all_tables_literal On Wed, Feb 27, 2013 at 10:09 PM, samir das mohapatra < samir.help...@gmail.com> wrote: > Hi All, > >Using sqoop how to take entire database table into HDFS insted of Table > by Table ?. > > How do you guys did it? > Is there some trick? > > Regards, > samir. >
Re: How to take Whole Database From RDBMS to HDFS Instead of Table/Table
http://sqoop.apache.org/docs/1.4.1-incubating/SqoopUserGuide.html#_literal_sqoop_import_all_tables_literal is your friend Kai Am 27.02.2013 um 12:09 schrieb samir das mohapatra : > Hi All, > >Using sqoop how to take entire database table into HDFS insted of Table by > Table ?. > > How do you guys did it? > Is there some trick? > > Regards, > samir. -- Kai Voigt k...@123.org
How to take Whole Database From RDBMS to HDFS Instead of Table/Table
Hi All, Using sqoop how to take entire database table into HDFS insted of Table by Table ?. How do you guys did it? Is there some trick? Regards, samir.
RE: How to get Under-replicated blocks information [Location]
you get the block locations(hoping that you are asking about which node) by two ways.. i) ./hadoop fsck / -files -blocks -locations.(Commandline) ii) NameNode UI..(GO to UI and click on browse filesystem and then click on files where you want to check..(Below you can see block locations..)) From: Dhanasekaran Anbalagan [bugcy...@gmail.com] Sent: Wednesday, February 27, 2013 6:17 PM To: cdh-user; user Subject: How to get Under-replicated blocks information [Location] Hi Guys, I am running three machine cluster, with replication factor 2, I got problem in replica i changed to 2. after I ran fsck i got Under-replicated blocks: 71 (0.0034828386 %) Total size: 105829415143 B Total dirs: 9704 Total files: 2038873 (Files currently being written: 2) Total blocks (validated): 2038567 (avg. block size 51913 B) (Total open file blocks (not validated): 2) Minimally replicated blocks: 2038567 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 71 (0.0034828386 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 2 Average block replication: 1.995 Corrupt blocks: 0 Missing replicas: 71 (0.0017414198 %) Number of data-nodes: 2 Number of racks: 1 FSCK ended at Wed Feb 27 05:12:31 EST 2013 in 32647 milliseconds Please guide what the block location of the 71. in HDFS file system. -Dhanasekaran. Did I learn something today? If not, I wasted it.
Re: QJM HA and ClusterID
Patch available now. anybody can take a look, Thanks. On Wed, Feb 27, 2013 at 10:46 AM, Azuryy Yu wrote: > Hi Suresh, > > Thanks for your reply. I filed a bug: > https://issues.apache.org/jira/browse/HDFS-4533 > > > > On Wed, Feb 27, 2013 at 9:30 AM, Suresh Srinivas > wrote: > >> looks start-dfs.sh has a bug. It only takes -upgrade option and ignores >> clusterId. >> >> Consider running the command (which is what start-dfs.sh calls): >> bin/hdfs start namenode -upgrade -clusterId >> >> Please file a bug, if you can, for start-dfs.sh bug which ignores >> additional parameters. >> >> >> On Tue, Feb 26, 2013 at 4:50 PM, Azuryy Yu wrote: >> >>> Anybody here? Thanks! >>> >>> >>> On Tue, Feb 26, 2013 at 9:57 AM, Azuryy Yu wrote: >>> Hi all, I've stay on this question several days. I want upgrade my cluster from hadoop-1.0.3 to hadoop-2.0.3-alpha, I've configured QJM successfully. How to customize clusterID by myself. It generated a random clusterID now. It doesn't work when I run: start-dfs.sh -upgrade -clusterId 12345-test Thanks! >>> >> >> >> -- >> http://hortonworks.com/download/ >> > >