Re: MapReduce shuffling phase

2022-09-01 Thread Ranadip Chatterjee
The reducer nodes handle shuffle in legacy Mapreduce in Hadoop. More modern
frameworks have the option to configure an external shuffle service in
which case it can be run elsewhere.

On Wed, 31 Aug 2022, 20:42 Pratyush Das,  wrote:

> Hi,
>
> Which node on HDFS is MapReduce's "Shuffle" phase that aggregates all
> values corresponding to a key, performed on?
>
> The Map phase happens on the datanode containing a block. I assume that
> the Reduce phase happens on some arbitrary free node. But which node is the
> shuffle phase performed on? (since it aggregates values from all datanodes
> before passing it to the Reducer)
>
> Is the Shuffle phase performed on the client node?
>
> Thank you,
>
> --
> Pratyush Das
>


Re: Why hdfs don't have current working directory

2017-05-26 Thread Ranadip Chatterjee
You could also use the pig shell which has always had the basic posix-like
navigation commands like pwd and cd.

On 26 May 2017 18:55, "Tanvir Rahman"  wrote:

> Hi Sidharth,
> You can check the below HDFS shell tool.
> https://github.com/avast/hdfs- shell
> 
> 
>
> *Feature highlights*
> - HDFS DFS command initiates JVM for each command call, HDFS Shell does
> it only once - which means great speed enhancement when you need to work
> with HDFS more often
> - Commands can be used in a short way - eg. *hdfs dfs -ls /*, *ls /* -
> both will work
> - *HDFS path completion using TAB key*
> - you can easily add any other HDFS manipulation function
> - there is a command history persisting in history log (~/.hdfs-shell/hdfs
> -shell.log)
> - support for relative directory + commands *cd* and *pwd*
> - it can be also launched as a daemon (using UNIX domain sockets)
> - 100% Java, it's open source
>
> Thanks
> Tanvir
>
>
>
> On Fri, May 26, 2017 at 9:09 AM, Sidharth Kumar <
> sidharthkumar2...@gmail.com> wrote:
>
>> Thanks, I'll check it out.
>>
>>
>> Sidharth
>>
>> On 26-May-2017 4:10 PM, "Hariharan"  wrote:
>>
>>> The concept of working directory is only useful for processes, and HDFS
>>> does not have executables. I guess what you're looking for is absolute vs
>>> relative paths (so that you can do something like hdfs cat foo instead of
>>> hdfs cat /user/me/foo). HDFS does have this to a limited extent - if your
>>> path is not absolute, it is relative from your home directory (or root if
>>> there is no home directory for your user).
>>>
>>> Thanks,
>>> Hariharan
>>>
>>> On Fri, May 26, 2017 at 3:44 PM, Sidharth Kumar <
>>> sidharthkumar2...@gmail.com> wrote:
>>>
 Hi,

 Can you kindly explain me why hdfs doesnt have current directory
 concept. Why Hadoop is not implement to use pwd? Why command like cd and
 PWD cannot be implemented in hdfs?

 Regards
 Sidharth
 Mob: +91 819799 <+91%2081975%2055599>
 LinkedIn: www.linkedin.com/in/sidharthkumar2792

>>>
>>>
>
>
> --
> [image: photo]
> *Mohammad Tanvir Rahman*
> Teaching Assistant, University of Houston 
> 713 628 3571 | tanvir9982...@gmail.com | http://www2.cs.uh.edu/~tanvir/
> 
> 
>
>


Re: Simple MapReduce logic using Java API

2015-04-01 Thread Ranadip Chatterjee
Eating up the IOException in the mapper looks suspicious to me. That can
silently consume the input without any output. Also check in the map sysout
messages for the console print output.

As an aside, since you are not doing anything in the reduce, try setting
number of reduces to 0. That will force the job to be map only and make it
simpler.

Regards,
Ranadip
On 31 Mar 2015 19:23, Shahab Yunus shahab.yu...@gmail.com wrote:

 What is the reason of using the queue?
 job.getConfiguration().set(mapred.job.queue.name, exp_dsa);

 Is your mapper or reducer even been called?

 Try adding the override annotation to the map/reduce methods as below:

 @Override
  public void map(Object key, Text value, Context context) throws
 IOException, InterruptedException {

 Regards,
 Shahab

 On Tue, Mar 31, 2015 at 3:26 AM, bradford li bradfor...@gmail.com wrote:

 I'm not sure why my Mapper and Reducer have no output. The logic behind
 my code is, given a file of UUIDs (new line separated), I want to use
 `globStatus` to display all the paths to all potential files that the UUID
 might be in. Open and read the file. Each file contains 1-n lines of JSON.
 The UUID is in `event_header.event_id` in the JSON.

 Right now the MapReduce job runs without errors. However, something is
 wrong because I dont have any output. I'm not sure how to debug MapReduce
 jobs as well. If someone could provide me a source that would be awesome!
 The expected output from this program should be

 UUID_1 1
 UUID_2 1
 UUID_3 1
 UUID_4 1
 ...
 ...
 UUID_n 1

 In my logic, the output file should be the UUIDs with a 1 next to them
 because upon found, 1 is written, if not found 0 is written. They should be
 all 1's because I pulled the UUIDs from the source.

 My Reducer currently does not do anything except I just wanted to see if
 I could get some simple logic working. There are most likely bugs in my
 code as I dont know have a easy way to debug MapReduce jobs

 Driver:

 public class SearchUUID {

 public static void main(String[] args) throws Exception {
 Configuration conf = new Configuration();
 Job job = Job.getInstance(conf, UUID Search);
 job.getConfiguration().set(mapred.job.queue.name,
 exp_dsa);
 job.setJarByClass(SearchUUID.class);
 job.setMapperClass(UUIDMapper.class);
 job.setReducerClass(UUIDReducer.class);
 job.setOutputKeyClass(Text.class);
 job.setOutputValueClass(Text.class);
 FileInputFormat.addInputPath(job, new Path(args[0]));
 FileOutputFormat.setOutputPath(job, new Path(args[1]));
 System.exit(job.waitForCompletion(true) ? 0 : 1);
 }
 }


 UUIDMapper:

 public class UUIDMapper extends MapperObject, Text, Text, Text {
 public void map(Object key, Text value, Context context) throws
 IOException, InterruptedException {

 try {
 Text one = new Text(1);
 Text zero = new Text(0);

 FileSystem fs = FileSystem.get(new Configuration());
 FileStatus[] paths = fs.globStatus(new
 Path(/data/path/to/file/d_20150330-1650));
 for (FileStatus path : paths) {
 BufferedReader br = new BufferedReader(new
 InputStreamReader(fs.open(path.getPath(;
 String json_string = br.readLine();
 while (json_string != null) {
 JsonElement jelement = new
 JsonParser().parse(json_string);
 JsonObject jsonObject =
 jelement.getAsJsonObject();
 jsonObject =
 jsonObject.getAsJsonObject(event_header);
 jsonObject =
 jsonObject.getAsJsonObject(event_id);

 if
 (value.toString().equals(jsonObject.getAsString())) {
 System.out.println(value.toString() +
 slkdjfksajflkjsfdkljsadfk;ljasklfjklasjfklsadl;sjdf);
 context.write(value, one);
 } else {
 context.write(value, zero);
 }

 json_string = br.readLine();
 }
 }
 } catch (IOException failed) {
 }
 }
 }


 Reducer:

 public class UUIDReducer extends ReducerText, Text, Text, Text{

 public void reduce(Text key, Text value, Context context) throws
 IOException, InterruptedException{
 context.write(key, value);
 }
 }





Re: [External] Re: HDFS Block Bad Response Error

2015-03-23 Thread Ranadip Chatterjee
You could check which block that file belongs to by running:

$ hadoop fsck / -files -blocks | grep blk_1084609656_11045296 -B 2


On 20 March 2015 at 14:56, Shipper, Jay [USA] shipper_...@bah.com wrote:


  I just checked the input data and the output data (what the job managed
 to output before failing), and there are no bad blocks in either.

   From: Ranadip Chatterjee ranadi...@gmail.com
 Reply-To: user@hadoop.apache.org user@hadoop.apache.org
 Date: Thursday, March 19, 2015 3:51 AM
 To: user@hadoop.apache.org user@hadoop.apache.org
 Subject: [External] Re: HDFS Block Bad Response Error

   Have you tried hdfs fsck command to try and catch any inconsistencies
 with that block?
 On 16 Mar 2015 19:39, Shipper, Jay [USA] shipper_...@bah.com wrote:

  On a Hadoop 2.4.0 cluster, I have a job running that's encountering the
 following warnings in one of its map tasks (IPs changed, but otherwise,
 this is verbatim):

  ---
  2015-03-16 06:59:37,994 WARN [ResponseProcessor for block
 BP-437460642-10.0.0.1-1391018641114:blk_1084609656_11045296]
 org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor
 exception  for block
 BP-437460642-10.0.0.1-1391018641114:blk_1084609656_11045296
 java.io.EOFException: Premature EOF: no length prefix available
 at
 org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1990)
 at
 org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
 at
 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:796)
 2015-03-16 06:59:37,994 WARN [ResponseProcessor for block
 BP-437460642-10.0.0.1-1391018641114:blk_1084609655_11045295]
 org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor
 exception  for block
 BP-437460642-10.0.0.1-1391018641114:blk_1084609655_11045295
 java.io.IOException: Bad response ERROR for block
 BP-437460642-10.0.0.1-1391018641114:blk_1084609655_11045295 from datanode
 10.0.0.1:1019
 at
 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:819)
  ---

  This job is launched from Hive 0.13.0, and it's consistently happening
 on the same split, which is on a sequence file.  After logging a few errors
 like the above, the map task seems to make no progress and eventually times
 out (with a mapreduce.task.timeout value greater than 5 hours).

  Any pointers on how to begin troubleshooting and resolving this issue?
 In searching around, it was suggested that this is indicative of a network
 issue, but as it happens on the same split consistently, that explanation
 seems unlikely.




-- 
Regards,
Ranadip Chatterjee


Re: HDFS Block Bad Response Error

2015-03-19 Thread Ranadip Chatterjee
Have you tried hdfs fsck command to try and catch any inconsistencies with
that block?
On 16 Mar 2015 19:39, Shipper, Jay [USA] shipper_...@bah.com wrote:

  On a Hadoop 2.4.0 cluster, I have a job running that's encountering the
 following warnings in one of its map tasks (IPs changed, but otherwise,
 this is verbatim):

  ---
  2015-03-16 06:59:37,994 WARN [ResponseProcessor for block
 BP-437460642-10.0.0.1-1391018641114:blk_1084609656_11045296]
 org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor
 exception  for block
 BP-437460642-10.0.0.1-1391018641114:blk_1084609656_11045296
 java.io.EOFException: Premature EOF: no length prefix available
 at
 org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1990)
 at
 org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
 at
 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:796)
 2015-03-16 06:59:37,994 WARN [ResponseProcessor for block
 BP-437460642-10.0.0.1-1391018641114:blk_1084609655_11045295]
 org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor
 exception  for block
 BP-437460642-10.0.0.1-1391018641114:blk_1084609655_11045295
 java.io.IOException: Bad response ERROR for block
 BP-437460642-10.0.0.1-1391018641114:blk_1084609655_11045295 from datanode
 10.0.0.1:1019
 at
 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:819)
  ---

  This job is launched from Hive 0.13.0, and it's consistently happening
 on the same split, which is on a sequence file.  After logging a few errors
 like the above, the map task seems to make no progress and eventually times
 out (with a mapreduce.task.timeout value greater than 5 hours).

  Any pointers on how to begin troubleshooting and resolving this issue?
 In searching around, it was suggested that this is indicative of a network
 issue, but as it happens on the same split consistently, that explanation
 seems unlikely.



Re: Can't find map or reduce logs when a job ends.

2015-03-16 Thread Ranadip Chatterjee
Is the job history server up and running on the right host and port? Please
check the job history server logs, if so? A common reason is for the owner
of job history server to not have read permission on the logs or for the
map reduce process owners to not have write permission in the job history
log location. The error should show up in the job history server logs.

Ranadip
On 13 Mar 2015 18:07, xeonmailinglist-gmail xeonmailingl...@gmail.com
wrote:

  Hi,

 With this configuration in mapreduce (see [1] and [2]), I can’t see the
 map and reduce logs of the job when it ends. When I try to look to the
 history, I get this error Not Found: job_1426267326549_0005. But if I
 list the log dir in the hdfs (see [3]), I have some logs about the job, but
 not logs about the map or reduce tasks.

 Why I can’t see the map and reduce logs? Am I missing some configuration?

 [1] configuration in mapred-site.xml

 configuration
  property
 namemapreduce.framework.name/name
 valueyarn/value
 /property
  property
 namemapreduce.jobhistory.address /name
 valuehadoop-coc-1:10020/value
 /property

  property
 namemapreduce.jobhistory.webapp.address/name
 valuehadoop-coc-1:19888/value
 /property
 property
 namemapreduce.jobhistory.max-age-ms/name
 value180/value
 /property
 /configuration

 [2] configuration in yarn-site.xml

 xubuntu@hadoop-coc-1:~/Programs/hadoop$ cat etc/hadoop/yarn-site.xml
 ?xml version=1.0?
 configuration

 !-- Site specific YARN configuration properties --
 property
  nameyarn.nodemanager.aux-services/name
  valuemapreduce_shuffle/value
 /property
 property
 nameyarn.nodemanager.aux-services.mapreduce.shuffle.class/name
 valueorg.apache.hadoop.mapred.ShuffleHandler/value
 /property
   property
 nameyarn.log-aggregation-enable/name
 valuetrue/value
   /property
   property
 nameyarn.nodemanager.remote-app-log-dir/name
 value/app-logs/value
   /property

 [3] List logs in hadoop

 xubuntu@hadoop-coc-1:~/Programs/hadoop$ hdfs dfs -ls /app-logs/xeon/logs/
 Java HotSpot(TM) Client VM warning: You have loaded library 
 /home/xubuntu/Programs/hadoop-2.6.0/lib/native/libhadoop.so which might have 
 disabled stack guard. The VM will try to fix the stack guard now.
 It's highly recommended that you fix the library with 'execstack -c 
 libfile', or link it with '-z noexecstack'.
 15/03/13 13:58:06 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 Found 4 items
 drwxrwx---   - xeon supergroup  0 2015-03-13 13:35 
 /app-logs/xeon/logs/application_1426267324367_0002
 drwxrwx---   - xeon supergroup  0 2015-03-13 13:37 
 /app-logs/xeon/logs/application_1426267324367_0003
 drwxrwx---   - xeon supergroup  0 2015-03-13 13:44 
 /app-logs/xeon/logs/application_1426267324367_0004
 drwxrwx---   - xeon supergroup  0 2015-03-13 13:47 
 /app-logs/xeon/logs/application_1426267324367_0005

 ​

 --
 --




Re: Time taken by -copyFromLocalHost for transferring data

2015-02-21 Thread Ranadip Chatterjee
$ time hadoop fs -put local file hdfs path

For small files, I would expect the time to have a significant variance
between runs. For larger files, it should be more consistent (since the
throughput will be bound by the network bandwidth of the local machine).
On 21 Feb 2015 08:43, tesm...@gmail.com tesm...@gmail.com wrote:

 Hi,

 How can I measure the time taken by -copyFromLocalHost for transferring my
 data from local host to HDFS?

 Regards,
 Tariq



Re: Encryption At Rest Question

2015-02-20 Thread Ranadip Chatterjee
In case of SSL enabled cluster, the DEK will be encrypted on the wire by
the SSL layer.

In case of non-SSL enabled cluster, it is not. But the intercepter only
gets the DEK and not the encrypted data, so the data is still safe. Only if
the intercepter also manages to gain access to the encrypted data block and
associate that with the corresponding DEK, then the data is compromised.
Given that each HDFS file has a different DEK, the intercepter has to gain
quite a bit of access before the data is compromised.

On 18 February 2015 at 00:04, Plamen Jeliazkov 
plamen.jeliaz...@wandisco.com wrote:

 Hey guys,

 I had a question about how the new file encryption work done primarily in
 HDFS-6134.

 I was just curious, how is the DEK protected on the wire?
 Particularly after the KMS decrypts the EDEK and returns it to the client.

 Thanks,
 -Plamen



 5 reasons your Hadoop needs WANdisco
 http://www.wandisco.com/system/files/documentation/5-Reasons.pdf

 Listed on the London Stock Exchange: WAND
 http://www.bloomberg.com/quote/WAND:LN

 THIS MESSAGE AND ANY ATTACHMENTS ARE CONFIDENTIAL, PROPRIETARY, AND MAY BE
 PRIVILEGED.  If this message was misdirected, WANdisco, Inc. and its
 subsidiaries, (WANdisco) does not waive any confidentiality or
 privilege.  If you are not the intended recipient, please notify us
 immediately and destroy the message without disclosing its contents to
 anyone.  Any distribution, use or copying of this e-mail or the information
 it contains by other than an intended recipient is unauthorized.  The views
 and opinions expressed in this e-mail message are the author's own and may
 not reflect the views and opinions of WANdisco, unless the author is
 authorized by WANdisco to express such views or opinions on its behalf.
 All email sent to or from this address is subject to electronic storage and
 review by WANdisco.  Although WANdisco operates anti-virus programs, it
 does not accept responsibility for any damage whatsoever caused by viruses
 being passed.




-- 
Regards,
Ranadip Chatterjee


Re: Hadoop Security Community

2015-01-26 Thread Ranadip Chatterjee
Hi Adam,

I am interested in collaborating on this. I am working for a large
financial institution at the moment and security is a bit pain in the neck
at the moment. So, this is a major focus area for me at the moment.

Regards,
Ranadip

On 26 January 2015 at 18:32, mirko.kaempf mirko.kae...@gmail.com wrote:

 Dear Adam,

 I am interested in collaborating on this. I work with Cloudera and teach
 Hadoop courses, such as the Administrator course. I learn about security
 implementation and think a common benchmark would be great for the
 community. What are the requirements for contributions? I volunteer as an
 editor and documentation writer.

 Best wishes,
 Mirko


 Von Samsung Mobile gesendet


  Ursprüngliche Nachricht 
 Von: Adam Montville adam.montvi...@cisecurity.org
 Datum:26.01.2015 18:12 (GMT+00:00)
 An: user@hadoop.apache.org
 Cc:
 Betreff: Hadoop Security Community

 All:

  The Center for Internet Security (CIS) has established a Community
 focused on defining a configuration benchmark for Hadoop.  We are in the
 early stages of benchmark development, and hope that you will consider
 joining the effort.  Over the course of the next several days a draft
 benchmark will be made available covering “basic” security configuration
 items pertaining primarily to HDFS.

  If you are interested in participating, as a contributor or editor, in
 this effort, please contact me.

  Kind regards,

  *Adam Montville, CISA, CISSP*
 *Technical Product Executive*
 *Security Controls and Automation*
 *Center for Internet Security*

 *31 Tech Valley Drive, Suite 2 East Greenbush, NY 12061*
  *www.cisecurity.org http://www.cisecurity.org *
 *Follow us @CISecurity *


  This message and attachments may contain confidential information. If it
 appears that this message was sent to you by mistake, any retention,
 dissemination, distribution or copying of this message and attachments is
 strictly prohibited. Please notify the sender immediately and permanently
 delete the message and any attachments.
 . . .




-- 
Regards,
Ranadip Chatterjee