How to set file permissions internally on hadoop

2012-01-22 Thread praveenesh kumar
Hey guys,

How can I configure HDFS so that internally I can set permissions on the
data.
I know there is a parameter called dfs.permissions that needs to be true,
otherwise permissions won't work.

Actually I had set it true previously, so that any user can use the HDFS
data to run jobs on it.
Now my requirement is, I want to set the permissions on data, so that
whoever is putting the data, should be responsible to set permissions on
who can use it.
I am trying to use internal hadoop fs -chmod commands to change the
permission. but even after changing permissions, still other user can also
use and submit job on that data.

Thanks,
Praveenesh


Reducer NullPointerException

2012-01-22 Thread burakkk
Hello everyone,
I have 3 server(1 master, 2 slave) and I installed cdh3u2 on each
server. I execute simple wordcount example but reducer had a
NullPointerException. How can i solve this problem?

The error log is that:
Error: java.lang.NullPointerException
   at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:
768)
   at org.apache.hadoop.mapred.ReduceTask$ReduceCopier
$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2806)
   at org.apache.hadoop.mapred.ReduceTask$ReduceCopier
$GetMapEventsThread.run(ReduceTask.java:2733)

Error: java.lang.NullPointerException
   at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:
768)
   at org.apache.hadoop.mapred.ReduceTask$ReduceCopier
$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2806)
   at org.apache.hadoop.mapred.ReduceTask$ReduceCopier
$GetMapEventsThread.run(ReduceTask.java:2733)

Error: java.lang.NullPointerException
   at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:
768)
   at org.apache.hadoop.mapred.ReduceTask$ReduceCopier
$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2806)
   at org.apache.hadoop.mapred.ReduceTask$ReduceCopier
$GetMapEventsThread.run(ReduceTask.java:2733)

Error: java.lang.NullPointerException
   at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:
768)
   at org.apache.hadoop.mapred.ReduceTask$ReduceCopier
$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2806)
   at org.apache.hadoop.mapred.ReduceTask$ReduceCopier
$GetMapEventsThread.run(ReduceTask.java:2733)


Thanks
Best Regards


Re: Problems with timeout when a Hadoop job generates a large number of key-value pairs

2012-01-22 Thread Alex Kozlov
Hi Steve, I think I was able to reproduce your problem over the weekend
(not sure though, it may be a different problem).  In my case it was that
the mappers were timing out during the merge phase.  I also think the
related tickets are
MAPREDUCE-2177and
MAPREDUCE-2187 .  In
my case I oversubscribed the cluster a bit with respect to the # of
map/reduce slots.

In general, this is quite unusual workload as every byte in the original
dataset generates 100x of output very fast.  This workload requires a
special tuning:

undersubscribe the nodes with respect to # of mappers/reducers (say, use
only 1/2 of the # of spindles for each *mapred.tasktracker.map.tasks.maximum
* and *mapred.tasktracker.reduce.tasks.maximum*)
increase *mapred.reduce.slowstart.completed.maps* (set too ~0.95 so that
reducers do not interfere with working mappers)
reduce *mapred.merge.recordsBeforeProgress* (set to 100, the default is
1)
reduce *mapred.combine.recordsBeforeProgress* (set to 100, the default is
1)
decrease the *dfs.block.size* for the input file so that each mapper
handles less data
increase the # of reducers so that each reducer handles less data
increase *io.sort.mb* and child memory to decrease the # of spills

Hope this helps.  Let me know.

--
Alex K


On Fri, Jan 20, 2012 at 2:23 PM, Steve Lewis  wrote:

> Interesting - I strongly suspect a disk IO or network problem since my code
> is very simple and very fast.
> If you  add lines to  generateSubStrings to limit String length to 100
> characters (I think it is always that but this makes su
>
> public static String[] generateSubStrings(String inp, int minLength, int
> maxLength) {
>// guarantee no more than 100 characters
>if(inp.length() > 100)
>  inp = inp.substring(0,100);
> List holder = new ArrayList();
>for (int start = 0; start < inp.length() - minLength; start++) {
>for (int end = start + minLength; end <
> Math.min(inp.length(), start + maxLength); end++) {
>try {
>holder.add(inp.substring(start, end));
>}
>catch (Exception e) {
>throw new RuntimeException(e);
>
>}
>}
>}
>
> On Fri, Jan 20, 2012 at 12:41 PM, Alex Kozlov  wrote:
>
> > Hi Steve, I ran your job on our cluster and it does not timeout.  I
> noticed
> > that each mapper runs for a long time: one way to avoid a timeout is to
> > update a user counter.  As long as this counter is updated within 10
> > minutes, the task should not timeout (as MR knows that something is being
> > done).  Normally an output bytes counter would be updated, but if the job
> > is stuck somewhere doing something it will timeout.  I agree that there
> > might be a disk IO or network problem that causes a long wait, but
> without
> > detailed logs it's hard to tell.
> >
> > On the side note the SubstringCount class should extend Configured.
> >
> > --
> > Alex K
> > 
> >
> > On Fri, Jan 20, 2012 at 12:18 PM, Michel Segel <
> michael_se...@hotmail.com
> > >wrote:
> >
> > > Steve,
> > > If you want me to debug your code, I'll be glad to set up a billable
> > > contract... ;-)
> > >
> > > What I am willing to do is to help you to debug your code...
> > >
> > > Did you time how long it takes in the Mapper.map() method?
> > > The reason I asked this is to first confirm that you are failing
> within a
> > > map() method.
> > > It could be that you're just not updating your status...
> > >
> > > You said that you are writing many output records for a single input.
> > >
> > > So let's take a look at your code.
> > > Are all writes of the same length? Meaning that in each iteration of
> > > Mapper.map() you will always write. K number of rows?
> > >
> > > If so, ask yourself why some iterations are taking longer and longer?
> > >
> > > Note: I'm assuming that the time for each iteration is taking longer
> than
> > > the previous...
> > >
> > > Or am I missing something?
> > >
> > > -Mike
> > >
> > > Sent from a remote device. Please excuse any typos...
> > >
> > > Mike Segel
> > >
> > > On Jan 20, 2012, at 11:16 AM, Steve Lewis 
> wrote:
> > >
> > > > We have been having problems with mappers timing out after 600 sec
> when
> > > the
> > > > mapper writes many more, say thousands of records for every
> > > > input record - even when the code in the mapper is small and fast. I
> > > > no idea what could cause the system to be so slow and am reluctant to
> > > raise
> > > > the 600 sec limit without understanding why there should be a timeout
> > > when
> > > > all MY code is very fast.
> > > > P
> > > > I am enclosing a small sample which illustrates the proble