Re: where are the old hadoop documentations for v0.22.0 and below ?

2014-07-30 Thread Harsh J
Jane,

The tarball includes generated release documentation pages as well.
Did you download and look inside?

~> tar tf hadoop-0.22.0.tar.gz | grep cluster_setup | grep html
hadoop-0.22.0/common/docs/cluster_setup.html

On Wed, Jul 30, 2014 at 11:24 PM, Jane Wayne  wrote:
> harsh, those are just javadocs. i'm talking about the full documentations
> (see original post).
>
>
> On Tue, Jul 29, 2014 at 2:17 PM, Harsh J  wrote:
>
>> Precompiled docs are available in the archived tarballs of these
>> releases, which you can find on:
>> https://archive.apache.org/dist/hadoop/common/
>>
>> On Tue, Jul 29, 2014 at 1:36 AM, Jane Wayne 
>> wrote:
>> > where can i get the old hadoop documentation (e.g. cluster setup, xml
>> > configuration params) for hadoop v0.22.0 and below? i downloaded the
>> source
>> > and binary files but could not find the documentations as a part of the
>> > archive file.
>> >
>> > on the home page at http://hadoop.apache.org/, i only see documentations
>> > for the following versions.
>> > - current, stable, 1.2.1, 2.2.0, 2.4.1, 0.23.11
>>
>>
>>
>> --
>> Harsh J
>>



-- 
Harsh J


Re: where are the old hadoop documentations for v0.22.0 and below ?

2014-07-29 Thread Harsh J
Precompiled docs are available in the archived tarballs of these
releases, which you can find on:
https://archive.apache.org/dist/hadoop/common/

On Tue, Jul 29, 2014 at 1:36 AM, Jane Wayne  wrote:
> where can i get the old hadoop documentation (e.g. cluster setup, xml
> configuration params) for hadoop v0.22.0 and below? i downloaded the source
> and binary files but could not find the documentations as a part of the
> archive file.
>
> on the home page at http://hadoop.apache.org/, i only see documentations
> for the following versions.
> - current, stable, 1.2.1, 2.2.0, 2.4.1, 0.23.11



-- 
Harsh J


Re: HDFS mounting issue using Hadoop-Fuse on Fully Distributed Cluster?

2014-05-03 Thread Harsh J
Can you check your dmesg | tail output to see if there are any error
messages from the HDFS fuse client?

On Sat, May 3, 2014 at 11:44 PM, Preetham Kukillaya  wrote:
> Hi,
> I m also getting the same error i.e. ?- ? ? ? ?? hdfs
> after mounting the hadoo file system using root. Please can you advise how
> to fix this. This issue is happening irrespective of version of CDH
>
>
>
>
> On Tuesday, 11 June 2013 23:10:50 UTC-7, Mohammad Reza Gerami wrote:
>>
>> Dear All
>> I have the same problem !
>> I have a small cluster of hadoop (ersion 1.1.2) and I want to user hdfs
>> folder like a general directory .
>> I install fuse version 2.9.2 ,
>> but when I want to mount this directory, I have a problem
>>
>> cat /etc/fstab
>> /usr/local/hadoop/build/
>> contrib/fuse-dfs/fuse_dfs#dfs://hb:9000 /export/hdfs fuse usetrash,rw 0 0
>> [root@hb ~]# mount -a
>> port=9000,server=hb
>> fuse-dfs didn't recognize /export/hdfs,-2
>> fuse-dfs ignoring option dev
>> fuse-dfs ignoring option suid
>> fuse: bad mount point `/export/hdfs': Input/output error
>>
>> [root@hb ~]# ll /export/
>> total 0
>> ?- ? ? ? ?? hdfs
>>
>> appreciate your help
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "CDH Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to cdh-user+unsubscr...@cloudera.org.
> For more options, visit https://groups.google.com/a/cloudera.org/d/optout.



-- 
Harsh J


Re: Changing default scheduler in hadoop

2014-04-13 Thread Harsh J
Hi,

On Sun, Apr 13, 2014 at 10:47 AM, Mahesh Khandewal
 wrote:
> Sir i am using Hadoop 1.1.2
> I don't know where is the code residing of default scheduler?

Doing a simple 'find' in the source checkout for name pattern
'Scheduler' should reveal pretty relevant hits. We do name our Java
classes seriously :)

https://github.com/apache/hadoop-common/blob/release-1.1.2/src/mapred/org/apache/hadoop/mapred/JobQueueTaskScheduler.java

> I want to change the default scheduler to fair how can i do this??

You can override the mapred-site.xml placed property
'mapred.jobtracker.taskScheduler' to specify a custom scheduler (or a
supplied one, such as Fair
[http://hadoop.apache.org/docs/r1.1.2/fair_scheduler.html] or Capacity
[http://hadoop.apache.org/docs/r1.1.2/capacity_scheduler.html]
Schedulers).

> And if i want to get back to default scheduler how can i do this?

Remove the configuration override, and it will always go back to the
default FIFO based scheduler, the same whose source has been linked
above.

> I am struggling since 4 months to get help on Apache Hadoop??

Are you unsure about this?

-- 
Harsh J


Re: hdfs permission is still being checked after being disabled

2014-03-07 Thread Harsh J
Did you restart your NameNode after making the configuration that
turns permissions off?

On Thu, Mar 6, 2014 at 10:29 PM, Jane Wayne  wrote:
> i am using hadoop v2.3.0.
>
> in my hdfs-site.xml, i have the following property set.
>
>  
>   dfs.permissions.enabled
>   false
>  
>
> however, when i try to run a hadoop job, i see the following
> AccessControlException.
>
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException):
> Permission denied: user=hadoopuser, access=EXECUTE,
> inode="/tmp":root:supergroup:drwxrwx---
>
> to me, it seems that i have already disabled permission checking, so i
> shouldn't get that AccessControlException.
>
> any ideas?



-- 
Harsh J


Re: rack awarness unexpected behaviour

2013-08-22 Thread Harsh J
I'm not aware of a bug in 0.20.2 that would not honor the Rack
Awareness, but have you done the two below checks as well?

1. Ensuring JT has the same rack awareness scripts and configuration
so it can use it for scheduling, and,
2. Checking if the map and reduce tasks are being evenly spread across
both racks.

On Thu, Aug 22, 2013 at 2:50 PM, Marc Sturlese  wrote:
> I'm on cdh3u4 (0.20.2), gonna try to read a bit on this bug
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/rack-awareness-unexpected-behaviour-tp4086029p4086049.html
> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.



-- 
Harsh J


Re: hadoop v0.23.9, namenode -format command results in Could not find or load main class org.apache.hadoop.hdfs.server.namenode.NameNode

2013-08-11 Thread Harsh J
I don't think you ought to be using HADOOP_HOME anymore.

Try "unset HADOOP_HOME" and then "export HADOOP_PREFIX=/opt/hadoop"
and retry the NN command.

On Sun, Aug 11, 2013 at 8:50 AM, Jane Wayne  wrote:
> hi,
>
> i have downloaded and untarred hadoop v0.23.9. i am trying to set up a
> single node instance to learn this version of hadop. also, i am following
> as best as i can, the instructions at
> http://hadoop.apache.org/docs/r0.23.9/hadoop-project-dist/hadoop-common/SingleCluster.html
> .
>
> when i attempt to run ${HADOOP_HOME}/bin/hdfs namenode -format, i get the
> following error.
>
> Error: Could not find or load main class
> org.apache.hadoop.hdfs.server.namenode.NameNode
>
> the instructions in the link above are complete. they jump right in and
> say, "assuming you have installed hadoop-common/hadoop-hdfs..." what does
> this assumption even mean? how do we install hadoop-common and hadoop-hdfs?
>
> right now, i am running on CentOS 6.4 x64 minimal. my steps are the
> following.
>
> 0. installed jdk 1.7 (Oracle)
> 1. tar xfz hadoop-0.23.9.tar.gz
> 2. mv hadoop-0.23.9 /opt
> 3. ln -s /opt/hadoop-0.23.9 /opt/hadoop
> 4. export HADOOP_HOME=/opt/hadoop
> 5. export JAVA_HOME=/opt/java
> 6. export PATH=${JAVA_HOME}/bin:${HADOOP_HOME}/bin:${PATH}
>
> any help is appreciated.



-- 
Harsh J


Re: "KrbException: Could not load configuration from SCDynamicStore" in Eclipse on Mac

2013-06-16 Thread Harsh J
Anil,

Please try the options provided at
https://issues.apache.org/jira/browse/HADOOP-7489.

Essentially, pass JVM system properties (In Eclipse you'll edit the
Run Configuration for this) and add
"-Djava.security.krb5.realm=yourrealm
-Djava.security.krb5.kdc=yourkdc" and also ensure your Mac's
configured for the cluster's kerberos (i.e. via a krb5.conf or so)?

On Mon, Jun 17, 2013 at 9:56 AM, anil gupta  wrote:
> Hi All,
>
> I am trying to connect to a secure Hadoop/HBase cluster. I wrote a java
> class that connects to the secure cluster and creates a table with presplit
> regions. I am running this class from eclipse itself. But, i keep on
> getting the following exception:
> *Caused by: KrbException: Could not load configuration from SCDynamicStore
> at
> sun.security.krb5.SCDynamicStoreConfig.getConfig(SCDynamicStoreConfig.java:64)
> at sun.security.krb5.Config.loadStanzaTable(Config.java:125)
> at sun.security.krb5.Config.(Config.java:176)
> at sun.security.krb5.Config.getInstance(Config.java:79)*
>
> I have googled this problem but i cannot find a solution for fixing "*
> SCDynamicStore*" related problem in Eclipse on Mac. I want to run this code
> from eclipse. Please let me know if anyone knows the trick to resolve this
> problem on Mac. This is really annoying problem.
>
> --
> Thanks & Regards,
> Anil Gupta



-- 
Harsh J


Re: Changing the maximum tasks per node on a per job basis

2013-05-24 Thread Harsh J
Yes, you're correct that the end-result is not going to be as static
as you expect it to be. FWIW, per node limit configs have been
discussed before (and even implemented + removed):
https://issues.apache.org/jira/browse/HADOOP-5170

On Fri, May 24, 2013 at 1:47 PM, Steve Lewis  wrote:
> My reading on Capacity Scheduling is that it controls the number of jobs
> scheduled at the level of the cluster.
> My issue is not sharing at the level of the cluster - usually my job is the
> only one running but rather at the level of
> the individual machine.
>   Some of my jobs require more memory and do significant processing -
> especially in the reducer - While the cluster can schedule 8 smaller jobs
> on a node when, say, 8  of the larger ones are scheduled slaves run out of
> swap space and tend to crash.
>   It is not unclear that limiting the number of jobs on the cluster will
> stop a scheduler from scheduling the maximum allowed jobs on any node.
>   Even requesting multiple slots for a job affects the number of jobs
> running on the cluster but not on any specific node.
>   Am I wrong here? If I want, say only three of my jobs running on one node
> does asking for enough slots to guarantee the total jobs is no more than 3
> times the number of nodes guarantee this?
>My read is that the total running jobs might be throttled but not the
> number per node.
>   Perhaps a clever use of queues might help but I am not quite sure about
> the details
>
>
> On Thu, May 23, 2013 at 4:37 PM, Harsh J  wrote:
>
>> Your problem seems to surround available memory and over-subscription. If
>> you're using a 0.20.x or 1.x version of Apache Hadoop, you probably want to
>> use the CapacityScheduler to address this for you.
>>
>> I once detailed how-to, on a similar question here:
>> http://search-hadoop.com/m/gnFs91yIg1e
>>
>>
>> On Wed, May 22, 2013 at 2:55 PM, Steve Lewis 
>> wrote:
>>
>> > I have a series of Hadoop jobs to run - one of my jobs requires larger
>> than
>> > standard memory
>> > I allow the task to use 2GB of memory. When I run some of these jobs the
>> > slave nodes are crashing because they run out of swap space. It is not
>> that
>> > s slave count not run one. or even 4  of these jobs but 8 stresses the
>> > limits.
>> >  I could cut the mapred.tasktracker.reduce.tasks.maximum for the entire
>> > cluster but this cripples the whole cluster for one of many jobs.
>> > It seems to be a very bad design
>> > a) to allow the job tracker to keep assigning tasks to a slave that is
>> > already getting low on memory
>> > b) to allow the user to run jobs capable or crashing noeds on the cluster
>> > c) not to allow the user to specify that some jobs need to be limited to
>> a
>> > lower value without requiring this limit for every job.
>> >
>> > Are there plans to fix this??
>> >
>> > --
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>
>
>
> --
> Steven M. Lewis PhD
> 4221 105th Ave NE
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Skype lordjoe_com



--
Harsh J


Re: Changing the maximum tasks per node on a per job basis

2013-05-23 Thread Harsh J
Your problem seems to surround available memory and over-subscription. If
you're using a 0.20.x or 1.x version of Apache Hadoop, you probably want to
use the CapacityScheduler to address this for you.

I once detailed how-to, on a similar question here:
http://search-hadoop.com/m/gnFs91yIg1e


On Wed, May 22, 2013 at 2:55 PM, Steve Lewis  wrote:

> I have a series of Hadoop jobs to run - one of my jobs requires larger than
> standard memory
> I allow the task to use 2GB of memory. When I run some of these jobs the
> slave nodes are crashing because they run out of swap space. It is not that
> s slave count not run one. or even 4  of these jobs but 8 stresses the
> limits.
>  I could cut the mapred.tasktracker.reduce.tasks.maximum for the entire
> cluster but this cripples the whole cluster for one of many jobs.
> It seems to be a very bad design
> a) to allow the job tracker to keep assigning tasks to a slave that is
> already getting low on memory
> b) to allow the user to run jobs capable or crashing noeds on the cluster
> c) not to allow the user to specify that some jobs need to be limited to a
> lower value without requiring this limit for every job.
>
> Are there plans to fix this??
>
> --
>



-- 
Harsh J


Re: How can I kill a file with bad permissions

2013-05-20 Thread Harsh J
You should be able to change permissions at will, as the owner of the
entry. There is certainly no bug there, as demonstrated by below in HDFS
2.x:

➜  ~  hadoop fs -ls
Found 1 item
drwxr-xr-x- harsh harsh  0 2013-04-10 07:03 bin
➜  ~  hadoop fs -chmod 411 bin
➜  ~  hadoop fs -ls bin
ls: Permission denied: user=harsh, access=READ_EXECUTE,
inode="/user/harsh/bin":harsh:harsh:drx--x
➜  ~  hadoop fs -chmod 755 bin
➜  ~  hadoop fs -ls bin
Found 2 items
-rw-r--r--   3 harsh harsh 693508 2013-04-10 07:03 bin/wordcount
-rw-r--r--   3 harsh harsh 873736 2013-04-10 06:56 bin/wordcount-simple



On Mon, May 20, 2013 at 6:32 PM, Steve Lewis  wrote:

> In trying to set file permissions using the Java API I managed to set the
> permissions on a directory to
> drx--x
>
> Now I can neither change them or get rid of the file
> I tried fs -rmr  but  I get permission issues
>
>
>
> --
>



-- 
Harsh J


Re: Running Hadoop client as a different user

2013-05-17 Thread Harsh J
 or disable webhdfs. Defaults to false
> 
> 
> dfs.support.append
> true
> Enable or disable append. Defaults to false
> 
> 
>
> Here is the relevant section of core-site.xml
> 
> hadoop.security.authentication
> simple
> 
> Set the authentication for the cluster. Valid values are: simple or
> kerberos.
> 
> 
>
> 
> hadoop.security.authorization
> false
> 
> Enable authorization for different protocols.
> 
> 
>
> 
> hadoop.security.groups.cache.secs
> 14400
> 
>
> 
> hadoop.kerberos.kinit.command
> /usr/kerberos/bin/kinit
> 
>
> 
> hadoop.http.filter.initializers
> org.apache.hadoop.http.lib.StaticUserWebFilter
> 
>
> 
>
>
>
> On Mon, May 13, 2013 at 5:26 PM, Harsh J  wrote:
>
>> Hi Steve,
>>
>> A normally-written client program would work normally on both
>> permissions and no-permissions clusters. There is no concept of a
>> "password" for users in Apache Hadoop as of yet, unless you're dealing
>> with a specific cluster that has custom-implemented it.
>>
>> Setting a specific user is not the right way to go. In secure and
>> non-secure environments both, the user is automatically inferred by
>> the user actually running the JVM process - its better to simply rely
>> on this.
>>
>> An AccessControlException occurs when a program tries to write or
>> alter a defined path where it lacks permission. To bypass this, the
>> HDFS administrator needs to grant you access to such defined paths,
>> rather than you having to work around that problem.
>>
>> On Mon, May 13, 2013 at 3:25 PM, Steve Lewis 
>> wrote:
>> > -- I have been running Hadoop on a clister set to not check permissions.
>> I
>> > would run a java client on my local machine and would run as the local
>> user
>> > on the cluster.
>> >
>> > I say
>> > *  String connectString =   "hdfs://" + host + ":" + port + "/";*
>> > *Configuration config = new Configuration();*
>> > *
>> > *
>> > *config.set("fs.default.name",connectString);*
>> > *
>> > *
>> > *FileSystem fs  = FileSystem.get(config);*
>> > *The above code works*
>> > *  *
>> > I am trying to port to a cluster where permissions are checked - I have
>>  an
>> > account but need to set a user and password to avoid Access Exceptions
>> >
>> > How do I do this and If I can only access certain directories how do I do
>> > that?
>> >
>> > Also are there some directories my code MUST be able to access outside
>> > those for my user only?
>> >
>> > Steven M. Lewis PhD
>> > 4221 105th Ave NE
>> > Kirkland, WA 98033
>> > 206-384-1340 (cell)
>> > Skype lordjoe_com
>>
>>
>>
>> --
>> Harsh J
>>
>
>
>
> --
> Steven M. Lewis PhD
> 4221 105th Ave NE
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Skype lordjoe_com



-- 
Harsh J


Re: Running Hadoop client as a different user

2013-05-13 Thread Harsh J
Hi Steve,

A normally-written client program would work normally on both
permissions and no-permissions clusters. There is no concept of a
"password" for users in Apache Hadoop as of yet, unless you're dealing
with a specific cluster that has custom-implemented it.

Setting a specific user is not the right way to go. In secure and
non-secure environments both, the user is automatically inferred by
the user actually running the JVM process - its better to simply rely
on this.

An AccessControlException occurs when a program tries to write or
alter a defined path where it lacks permission. To bypass this, the
HDFS administrator needs to grant you access to such defined paths,
rather than you having to work around that problem.

On Mon, May 13, 2013 at 3:25 PM, Steve Lewis  wrote:
> -- I have been running Hadoop on a clister set to not check permissions. I
> would run a java client on my local machine and would run as the local user
> on the cluster.
>
> I say
> *  String connectString =   "hdfs://" + host + ":" + port + "/";*
> *Configuration config = new Configuration();*
> *
> *
> *config.set("fs.default.name",connectString);*
> *
> *
> *FileSystem fs  = FileSystem.get(config);*
> *The above code works*
> *  *
> I am trying to port to a cluster where permissions are checked - I have  an
> account but need to set a user and password to avoid Access Exceptions
>
> How do I do this and If I can only access certain directories how do I do
> that?
>
> Also are there some directories my code MUST be able to access outside
> those for my user only?
>
> Steven M. Lewis PhD
> 4221 105th Ave NE
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Skype lordjoe_com



-- 
Harsh J


Re: Is disk use reported with replication?

2013-04-23 Thread Harsh J
Hi Keith,

The "fs -du" computes length of files, and would not report replicated
on-disk size. HDFS disk utilization OTOH, is the current, simple
report of used/free disk space, which would certainly include
replicated data.

On Mon, Apr 22, 2013 at 10:49 PM, Keith Wiley  wrote:
> Simple question: When I issue a "hadoop fs -du" command and/or when I view 
> the namenode web UI to see HDFS disk utilization (which the namenode reports 
> both as bytes and percentage), should I expect to see disk use reported as 
> "true data size" or as replicated size (e.g. with 3X replication, should I 
> expect reported values to be three times higher than the actual underlying 
> data itself)?
>
> Thanks.
>
> 
> Keith Wiley kwi...@keithwiley.com keithwiley.com
> music.keithwiley.com
>
> "I used to be with it, but then they changed what it was.  Now, what I'm with
> isn't it, and what's it seems weird and scary to me."
>--  Abe (Grandpa) Simpson
> 
>



-- 
Harsh J


Re: How can I unsubscribe from this list?

2013-03-16 Thread Harsh J
>From your email header:

List-Unsubscribe: <mailto:common-user-unsubscr...@hadoop.apache.org>

On Wed, Mar 13, 2013 at 10:42 AM, Alex Luya  wrote:
> can't find a way to unsubscribe from this list.



--
Harsh J


Re: Locks in HDFS

2013-02-22 Thread Harsh J
Hi Abhishek,

I fail to understand what you mean by that; but HDFS generally has no
client-exposed file locking on reads. There's leases for preventing
multiple writers to a single file, but nothing on the read side.

Replication of the blocks under a file is a different concept and is
completely unrelated to this.

This needs to be built at your application's/stack's access/control
levels, since HDFS does not provide this.

On Fri, Feb 22, 2013 at 9:33 PM, abhishek  wrote:
> Harsh,
>
> Can we load the file into HDFS with one replication and lock the file.
>
> Regards
> Abhishek
>
>
> On Feb 22, 2013, at 1:03 AM, Harsh J  wrote:
>
>> HDFS does not have such a client-side feature, but your applications
>> can use Apache Zookeeper to coordinate and implement this on their own
>> - it can be used to achieve distributed locking. While at ZooKeeper,
>> also checkout https://github.com/Netflix/curator which makes using it
>> for common needs very easy.
>>
>> On Fri, Feb 22, 2013 at 5:17 AM, abhishek  wrote:
>>>
>>>> Hello,
>>>
>>>> How can I impose read lock, for a file in HDFS
>>>>
>>>> So that only one user (or) one application , can access file in hdfs at 
>>>> any point of time.
>>>>
>>>> Regards
>>>> Abhi
>>>
>>> --
>>
>>
>>
>> --
>> Harsh J
>>
>> --
>>
>>
>>



--
Harsh J


Re: Locks in HDFS

2013-02-21 Thread Harsh J
HDFS does not have such a client-side feature, but your applications
can use Apache Zookeeper to coordinate and implement this on their own
- it can be used to achieve distributed locking. While at ZooKeeper,
also checkout https://github.com/Netflix/curator which makes using it
for common needs very easy.

On Fri, Feb 22, 2013 at 5:17 AM, abhishek  wrote:
>
>> Hello,
>
>> How can I impose read lock, for a file in HDFS
>>
>> So that only one user (or) one application , can access file in hdfs at any 
>> point of time.
>>
>> Regards
>> Abhi
>
> --
>
>
>



--
Harsh J


Re: Running hadoop on directory structure

2013-02-15 Thread Harsh J
You should be able to use
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html
to achieve this. It accepts subdirectory creation (under main job
output directory). However, the special chars may be an issue (i.e. -,
=, etc.), for which you'll either need
https://issues.apache.org/jira/browse/MAPREDUCE-2293 or a custom hack
to bypass that inbuilt restriction.

Alternatively also look at
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/MultipleOutputFormat.html
(not sure on subdir parts here, but worth checking out.). Note that
this class isn't present in the newer MR API (having been replaced by
aforementioned MultipleOutputs).

On Sat, Feb 16, 2013 at 12:11 AM, Max Lebedev  wrote:
> Hi, I am a CS undergraduate working with hadoop. I wrote a library to process
> logs, my input directory has the following structure:
>
> logs_hourly
> ├── dt=2013-02-15
> │   ├── ts=1360887451
> │   │   └── syslog-2013-02-15-1360887451.gz
> │   └── ts=1360891051
> │   └── syslog-2013-02-15-1360891051.gz
> ├── dt=2013-02-14
> │   ├── ts= 1360801050
> │   │   └── syslog-2013-02-14-1360801050.gz
> │   └── ts=1360804651
> │   └── syslog-2013-02-14-1360804651.gz
>
> Where dt is the day and ts is the hour when the log was created.
>
> Currently, the code takes an input directory (or a range of input
> directories) such as dt=2013-02-15 and goes through every file in every
> subdirectory sequentially with a loop. This process is slow and I think that
> running the code on the files in parallel would be more efficient. Is there
> any where that I could use Hadoop's MapReduce on a directory such as
> dt=2013-02-15 and receive the same directory structure as output?
>
> Thanks,
> Max Lebedev
>
>
>
> --
> View this message in context: 
> http://hadoop.6.n7.nabble.com/Running-hadoop-on-directory-structure-tp67904.html
> Sent from the common-user mailing list archive at Nabble.com.



--
Harsh J


Re: The method setMapperClass(Class) in the type Job is not applicable for the arguments

2013-02-09 Thread Harsh J
Your import line of Mapper has the issue. It is imported as below,
which can probably be removed as it is unnecessary:

import org.apache.hadoop.mapreduce.Mapper.Context;

But you also would need the line below for the actual class to get found:

import org.apache.hadoop.mapreduce.Mapper;

Adding that line to your list of imports should help resolve it.

I recall some IDEs being odd about these things sometimes, and not
reporting the real error correctly. Were you using one? Eclipse?

On Sat, Feb 9, 2013 at 11:47 PM, Ronan Lehane  wrote:
> Hi Harsh,
>
> Thanks for getting back so quickly.
>
> The full source code is attached as there's nothing sensitive in it.
> Coding wouldn't be my strong point so apologies in advance if it looks a
> mess.
>
> Thanks
>
>
> On Sat, Feb 9, 2013 at 6:09 PM, Harsh J  wrote:
>>
>> Whatever "csatAnalysis.MapClass" the compiler picked up, it appears to
>> not be extending the org.apache.hadoop.mapreduce.Mapper class. From
>> your snippets it appears that you have it all defined properly though.
>> A common issue here has also been that people accidentally import the
>> wrong API (mapred.*) but that doesn't seem to be the case either.
>>
>> Can you post your full compilable source somewhere? Remove any logic
>> you don't want to share - we'd mostly be interested in the framework
>> definition parts alone.
>>
>> On Sat, Feb 9, 2013 at 11:27 PM, Ronan Lehane 
>> wrote:
>> > Hi All,
>> >
>> > I hope this is the right forum for this type of question so my apologies
>> > if
>> > not.
>> >
>> > I'm looking to write a map reduce program which is giving me the
>> > following
>> > compilation error:
>> > The method setMapperClass(Class) in the type Job is
>> > not
>> > applicable for the arguments (Class)
>> >
>> > The components involved are:
>> >
>> > 1. Setting the Mapper
>> > //Set the Mapper for the job. Calls MapClass.class
>> > job.setMapperClass(MapClass.class);
>> >
>> > 2. Setting the inputFormat to TextInputFormat
>> > //An InputFormat for plain text files. Files are broken into
>> > lines.
>> > //Either linefeed or carriage-return are used to signal end of
>> > line.
>> > //Keys are the position in the file, and values are the line of
>> > text..
>> > job.setInputFormatClass(TextInputFormat.class);
>> >
>> > 3. Taking the data into the mapper and processing it
>> >     public static class MapClass extends Mapper> > Text,
>> > VectorWritable> {
>> > public void map (LongWritable key, Text value,Context context)
>> > throws IOException, InterruptedException {
>> >
>> > Would anyone have any clues as to what would be wrong with the
>> > arguements
>> > being passed to the Mapper?
>> >
>> > Any help would be appreciated,
>> >
>> > Thanks.
>>
>>
>>
>> --
>> Harsh J
>
>



--
Harsh J


Re: The method setMapperClass(Class) in the type Job is not applicable for the arguments

2013-02-09 Thread Harsh J
Whatever "csatAnalysis.MapClass" the compiler picked up, it appears to
not be extending the org.apache.hadoop.mapreduce.Mapper class. From
your snippets it appears that you have it all defined properly though.
A common issue here has also been that people accidentally import the
wrong API (mapred.*) but that doesn't seem to be the case either.

Can you post your full compilable source somewhere? Remove any logic
you don't want to share - we'd mostly be interested in the framework
definition parts alone.

On Sat, Feb 9, 2013 at 11:27 PM, Ronan Lehane  wrote:
> Hi All,
>
> I hope this is the right forum for this type of question so my apologies if
> not.
>
> I'm looking to write a map reduce program which is giving me the following
> compilation error:
> The method setMapperClass(Class) in the type Job is not
> applicable for the arguments (Class)
>
> The components involved are:
>
> 1. Setting the Mapper
> //Set the Mapper for the job. Calls MapClass.class
> job.setMapperClass(MapClass.class);
>
> 2. Setting the inputFormat to TextInputFormat
> //An InputFormat for plain text files. Files are broken into lines.
> //Either linefeed or carriage-return are used to signal end of
> line.
> //Keys are the position in the file, and values are the line of
> text..
> job.setInputFormatClass(TextInputFormat.class);
>
> 3. Taking the data into the mapper and processing it
> public static class MapClass extends Mapper VectorWritable> {
> public void map (LongWritable key, Text value,Context context)
> throws IOException, InterruptedException {
>
> Would anyone have any clues as to what would be wrong with the arguements
> being passed to the Mapper?
>
> Any help would be appreciated,
>
> Thanks.



--
Harsh J


Re: no jobtracker to stop,no namenode to stop

2013-01-21 Thread Harsh J
In spirit of http://xkcd.com/979/, please also let us know what you felt
was the original issue and how you managed to solve it - for benefit of
other people searching in future?


On Mon, Jan 21, 2013 at 3:26 PM, Sigehere  wrote:

> Hey, Friends I have solved that error
> Thanks
>
>
>
>
> --
> View this message in context:
> http://hadoop-common.472056.n3.nabble.com/no-jobtracker-to-stop-no-namenode-to-stop-tp34874p4006830.html
> Sent from the Users mailing list archive at Nabble.com.
>



-- 
Harsh J


Re: Do we support contatenated/splittable bzip2 files in branch-1?

2012-12-03 Thread Harsh J
Hi Yu Li,

The JIRA HADOOP-7823 backported support for splitting Bzip2 files plus
MR support for it, into branch-1, and it is already available in the
1.1.x releases out currently.

Concatenated Bzip2 files, i.e., HADOOP-7386, is not implemented yet
(AFAIK), but Chris over HADOOP-6335 suggests that HADOOP-4012 may have
fixed it - so can you try and report back?

On Mon, Dec 3, 2012 at 3:19 PM, Yu Li  wrote:
> Dear all,
>
> About splitting support for bzip2, I checked on the JIRA list and found
> HADOOP-7386 marked as "Won't fix"; I also found some work done in
> branch-0.21(also in trunk), say HADOOP-4012 and MAPREDUCE-830, but not
> integrated/migrated into branch-1, so I guess we don't support contatenated
> bzip2 in branch-1, correct? If so, is there any special reason? Many thanks!
>
> --
> Best Regards,
> Li Yu



-- 
Harsh J


Re: "JVM reuse": if I use this option, do setup(), cleanup() get called only once?

2012-11-13 Thread Harsh J
Hi,

Those API hooks are called once per task attempt, and regardless of
JVM reuse they will still be run once per task attempt. So yes,
setup+cleanup for every map split or reduce partition that runs
through the reused JVM.

On Tue, Nov 13, 2012 at 1:47 PM, edward choi  wrote:
> Hi,
>
> This question might sound stupid, but I couldn't find a definite answer on
> Google.
> My job loads a big file at setup() in Map tasks.
> So I would like to use the loaded file again and again.
>
> I came across "JVM reuse", but text book only says that this option enables
> multiple use of
> tasks on a single JVM. It does not say anything about setup() or cleanup().
>
> Even if I set "mapred.job.reuse.jvm.num.tasks" to -1, do setup() and
> cleanup() get called every single time a task is launched?
>
> Best,
> Ed



-- 
Harsh J


Re: Other file systems for hadoop

2012-10-30 Thread Harsh J
Jay,

We used to carry (and formerly, use) an InMemoryFileSystem
implementation (its still around as deprecated in the maintenance 1.x
release line: 
http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/InMemoryFileSystem.html).

The interface for writing FileSystems is indeed decoupled from "any
real world filesystem" so let your imagination run wild.

On Tue, Oct 30, 2012 at 9:16 PM, Jay Vyas  wrote:
> what are the minimal requirements to implement the filesystem interface for
> hdfs?
>
> I was thinking it might be cool to directly implement a 100% memory
> filesystem, just for the hell of it - for fast unit tests, that simulated
> lookups and stuff.
>
> So - if the interface is abstract and decoupled enough from any real world
> filesystem, i think this could definetly work.
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com



-- 
Harsh J


Re: speculative execution before mappers finish

2012-10-12 Thread Harsh J
Think of it in partition terms. If you know that your map-splits X, Y
and Z won't emit any key of partition P, then the Pth reducer can jump
ahead and run without those X, Y and Z completing their processing.

Otherwise, a reducer can't run until all maps have completed, in fear
of losing a few keys that may have come out of the maps it has skipped
fetching from. To some this may be tolerable, or some would be OK to
receive it later - but thats gonna add complexity when you could just
fetch continuously and wait.

Should be easy to take the MRv2 application [0] and add such a thing
in today, if you need it.

[0] - Given the confusion between what MRv2 and YARN mean individually
(they get mixed up too much), hope this blog post of mine helps:
http://www.cloudera.com/blog/2012/10/mr2-and-yarn-briefly-explained/

On Sat, Oct 13, 2012 at 7:46 AM, Jay Vyas  wrote:
> Is it possible for reducers to start (not just copying, but actually)
> "reducing" before all mappers are done, speculatively?
>
> In particular im asking this because Im curious about the internals of how
> the shuffle and sort might (or might not :)) be able to support this.



-- 
Harsh J


Re: concurrency

2012-10-12 Thread Harsh J
Joep,

You're right - I missed in my quick scan that he was actually
replacing those files there. Sorry for the confusion Koert!

On Fri, Oct 12, 2012 at 9:37 PM, J. Rottinghuis  wrote:
> Hi Harsh, Moge Koert,
>
> If Koerts problem is similar to what I have been thinking about where we
> want to consolidate and re-compress older datasets, then the _SUCCESS does
> not really help. _SUCCESS helps to tell if a new dataset is completely
> written.
> However, what is needed here is to replace an existing dataset.
>
> Naive approach:
> The new set can be generated in parallel. Old directory moved out of the
> way (rm and therefore moved to Trash) and then he new directory renamed
> into place.
> I think the problem Koert is describing is how to not mess up map-reduce
> jobs that have already started and may have read some, but not all of the
> files in the directory. If you're lucky, then you'll try to read a file
> that is no longer there, but if you're unlucky then you read a new file
> with the same name and you will never know that you have inconsistent
> results.
>
> Trying to be clever approach:
> Every query puts a "lock" file with the job-id in the directory they read.
> Only when there are no locks, replace the data-set as describe in the naive
> approach. This will reduce the odds for problems, but is rife with
> race-conditions. Also, if the data is read-heavy, you may never get to
> replace the directory. Now you need a write lock to prevent new reads from
> starting.
>
> Would hardlinks solve this problem?
> Simply create a set of (temporary) hardlinks to the files in the directory
> you want to read? Then if the old set is moved out of the way, the
> hardlinks should still point to them. The reading job reads from the
> hardlinks and cleans them up when done. If the hardlinks are placed in a
> directory with the reading job-id then garbage collection should be
> possible for crashed jobs if normal cleanup fails.
>
> Groetjes,
>
> Joep
>
> On Fri, Oct 12, 2012 at 8:35 AM, Harsh J  wrote:
>
>> Hey Koert,
>>
>> Yes the _SUCCESS (Created on successful commit-end of a job) file
>> existence may be checked before firing the new job with the chosen
>> input directory. This is consistent with what Oozie does as well.
>>
>> Since the listing of files happens post-submit() call, doing this will
>> "just work" :)
>>
>> On Fri, Oct 12, 2012 at 8:00 PM, Koert Kuipers  wrote:
>> > We have a dataset that is heavily partitioned, like this
>> > /data
>> >   partition1/
>> > _SUCESS
>> > part-0
>> > part-1
>> > ...
>> >   partition1/
>> > _SUCCESS
>> > part-0
>> > part-1
>> > 
>> >   ...
>> >
>> > We have loaders that use map-red jobs to add new partitions to this data
>> > set at a regular interval (so they write to new sub-directories).
>> >
>> > We also have map-red queries that read from the entire dataset (/data/*).
>> > My worry here is concurrency. It will happen that a query job runs
>> > while a loader
>> > job is adding a new partition at the same time. Is there a risk that the
>> query
>> > could read incomplete or corrupt files? Is there a way to use the _SUCESS
>> > files to prevent this from happening?
>> > Thanks for your time!
>> > Best,
>> > Koert
>>
>>
>>
>> --
>> Harsh J
>>



-- 
Harsh J


Re: setJarByClass method semantics

2012-10-05 Thread Harsh J
Looks at classpath, tries to find resource URL behind the provided
class (looking specifically for jar resources), if found, sets as the
job jar.

Running with .class files doesn't work currently - and the job submits
without a jar carrying the classes. This is typically what happens
when you use an IDE like Eclipse and try to submit a job without
bundling a jar up first. The job triggers and fails with CNF
exceptions.

On Sat, Oct 6, 2012 at 3:30 AM, Jay Vyas  wrote:
> Hi guys:
>
> What does the setJarByClass method do, if for example ,a given class does
> not exist inside of a JAR file?  For example, in the case that we are
> running compiled java classes as-is.  I can't imagine what the "Default"
> behaviour for such a scenario would be, since the semantics of the method
> imply that all hadoop jobs run in the context of a precompiled jar file.
>
> --
> Jay Vyas
> MMSB/UCHC



-- 
Harsh J


Re: doubts reg Hive

2012-09-30 Thread Harsh J
Sudha,

On Mon, Oct 1, 2012 at 9:31 AM, sudha sadhasivam
 wrote:
> We are doing a project in Hive.
>Given a field / value  is it possible to find the corresponding 
> headers (meta data).
> For example if we have a table with id, user, work_place, 
> residence_place
>  given a value "New York" we need to display the headers where New York 
> appears ( for eg work_place, residence_place etc.
>
>Kindly intimate whether  it is possible. If so what is the command for the 
> same?

It is certainly possible and is a trivial requirement. All you're
doing is a filtering here, across multiple columns (OR-wise). Perhaps
something like (Pardon if naive/wrong): "SELECT * FROM people WHERE
residence_place LIKE '%New York%' OR work_place LIKE '%New York%';"

In any case, for Hive questions, you are better off asking the
u...@hive.apache.org lists than the Hadoop user lists here.

-- 
Harsh J


Re: Use of CombineFileInputFormat

2012-09-28 Thread Harsh J
Combines multiple InputSplits per Mapper (CombineFileSplit), read in
serial. Reduces # of mappers for inputs that carry several (usually
small) files/blocks.

On Fri, Sep 28, 2012 at 6:54 AM, Jay Vyas  wrote:
> Its not clear to me what the CombineInputFormat really is ?  Can somebody
> elaborate ?
>
> --
> Jay Vyas
> MMSB/UCHC



-- 
Harsh J


Re: strategies to share information between mapreduce tasks

2012-09-26 Thread Harsh J
Also read: http://arxiv.org/abs/1209.2191 ;-)

On Thu, Sep 27, 2012 at 12:24 AM, Bertrand Dechoux  wrote:
> I wouldn't so surprised. It takes times, energy and money to solve problems
> and make solutions that would be prod-ready. A few people would consider
> that the namenode/secondary spof is a limit for Hadoop itself in order to
> go into a critical production environnement. (I am only quoting it and
> don't want to start a discussion about it.)
>
> One paper that I heard about (but didn't have the time to read as of now)
> might be related to your problem space
> http://arxiv.org/abs/1110.4198
> But research paper does not mean prod ready for tomorrow.
>
> http://research.google.com/archive/mapreduce.html is from 2004.
> and http://research.google.com/pubs/pub36632.html (dremel) is from 2010.
>
> Regards
>
> Bertrand
>
> On Wed, Sep 26, 2012 at 8:18 PM, Jane Wayne wrote:
>
>> jay,
>>
>> thanks. i just needed a sanity check. i hope and expect that one day,
>> hadoop will mature towards supporting a "shared-something" approach.
>> the web service call is not a bad idea at all. that way, we can
>> abstract what that ultimate data store really is.
>>
>> i'm just a little surprised that we are still in the same state with
>> hadoop in regards to this issue (there are probably higher priorities)
>> and that no research (that i know of) has come out of academia to
>> mitigate some of these limitations of hadoop (where's all the funding
>> to hadoop/mapreduce research gone to if this framework is the
>> fundamental building block of a vast amount of knowledge mining
>> activities?).
>>
>> On Wed, Sep 26, 2012 at 12:40 PM, Jay Vyas  wrote:
>> > The reason this is so rare is that the nature of map/reduce tasks is that
>> > they are orthogonal  i.e. the word count, batch image recognition, tera
>> > sort -- all the things hadoop is famous for are largely orthogonal tasks.
>> > Its much more rare (i think) to see people using hadoop to do traffic
>> > simulations or solve protein folding problems... Because those tasks
>> > require continuous signal integration.
>> >
>> > 1) First, try to consider rewriting it so that ll communication is
>> replaced
>> > by state variables in a reducer, and choose your keys wisely, so that all
>> > "communication" between machines is obviated by the fact that a single
>> > reducer is receiving all the information relevant for it to do its task.
>> >
>> > 2) If a small amount of state needs to be preserved or cached in real
>> time
>> > two optimize the situation where two machines might dont have to redo the
>> > same task (i.e. invoke a web service to get a peice of data, or some
>> other
>> > task that needs to be rate limited and not duplicated) then you can use a
>> > fast key value store (like you suggested) like the ones provided by
>> basho (
>> > http://basho.com/) or amazon (Dynamo).
>> >
>> > 3) If you really need alot of message passing, then then you might be
>> > better of using an inherently more integrated tool like GridGain... which
>> > allows for sophisticated message passing between asynchronously running
>> > processes, i.e.
>> >
>> http://gridgaintech.wordpress.com/2011/01/26/distributed-actors-in-gridgain/
>> .
>> >
>> >
>> > It seems like there might not be a reliable way to implement a
>> > sophisticated message passing architecutre in hadoop, because the system
>> is
>> > inherently so dynamic, and is built for rapid streaming reads/writes,
>> which
>> > would be stifled by significant communication overhead.
>>
>
>
>
> --
> Bertrand Dechoux



-- 
Harsh J


Re: strategies to share information between mapreduce tasks

2012-09-26 Thread Harsh J
Apache Giraph is a framework for graph processing, currently runs over
"MR" (but is getting its own coordination via YARN soon):
http://giraph.apache.org.

You may also checkout the generic BSP system (Giraph uses BSP too, if
am not wrong, but doesn't use Hama - works over MR instead), Apache
Hama: http://hama.apache.org

On Wed, Sep 26, 2012 at 9:51 PM, Jane Wayne  wrote:
> i'll look for myself, but could you please let me know what is giraph?
> is it another layer on hadoop like hive/pig or an api like mahout?
>
>
>
> On Wed, Sep 26, 2012 at 12:09 PM, Jonathan Bishop  
> wrote:
>> Yes, Giraph seems like the best way to go - it is mainly a vertex
>> evaluation with message passing between vertices. Synchronization is
>> handled for you.
>>
>> On Wed, Sep 26, 2012 at 8:36 AM, Jane Wayne wrote:
>>
>>> hi,
>>>
>>> i know that some algorithms cannot be parallelized and adapted to the
>>> mapreduce paradigm. however, i have noticed that in most cases where i
>>> find myself struggling to express an algorithm in mapreduce, the
>>> problem is mainly due to no ability to cross-communicate between
>>> mappers or reducers.
>>>
>>> one naive approach i've seen mentioned here and elsewhere, is to use a
>>> database to store data for use by all the mappers. however, i have
>>> seen many arguments (that i agree with largely) against this approach.
>>>
>>> in general, my question is this: has anyone tried to implement an
>>> algorithm using mapreduce where mappers required cross-communications?
>>> how did you solve this limitation of mapreduce?
>>>
>>> thanks,
>>>
>>> jane.
>>>



-- 
Harsh J


Re: libhdfs install dep

2012-09-25 Thread Harsh J
I'd recommend using the packages for Apache Hadoop from Apache Bigtop
(https://cwiki.apache.org/confluence/display/BIGTOP). The ones
upstream (here) aren't maintained as much these days.

On Tue, Sep 25, 2012 at 6:27 PM, Pastrana, Rodrigo (RIS-BCT)
 wrote:
> Leo, yes I'm working with hadoop-1.0.1-1.amd64.rpm from Apache's download 
> site.
> The rpm installs libhdfs in /usr/lib64 so I'm not sure why I would need the 
> hadoop-<*>libhdfs* rpm.
>
> Any idea why the installed /usr/lib64/libhdfs.so is not detected by the 
> package managers?
>
> Thanks, Rodrigo.
>
> -Original Message-
> From: Leo Leung [mailto:lle...@ddn.com]
> Sent: Tuesday, September 25, 2012 2:11 AM
> To: common-user@hadoop.apache.org
> Subject: RE: libhdfs install dep
>
> Rodrigo,
>   Assuming you are asking for hadoop 1.x
>
>   You are missing the hadoop-<*>libhdfs* rpm.
>   Build it or get it from the vendor you got your hadoop from.
>
>
>
> -Original Message-
> From: Pastrana, Rodrigo (RIS-BCT) [mailto:rodrigo.pastr...@lexisnexis.com]
> Sent: Monday, September 24, 2012 8:20 PM
> To: 'core-u...@hadoop.apache.org'
> Subject: libhdfs install dep
>
> Anybody know why libhdfs.so is not found by package managers on CentOS 64 and 
> OpenSuse64?
>
> I hava an rpm which declares Hadoop as a dependacy, but the package managers 
> (KPackageKit, zypper, etc) report libhdfs.so as a missing dependency 
> eventhough Hadoop has been installed via rpm package, and libhdfs.so is 
> installed as well.
>
> Thanks, Rodrigo.
>
>
> -
> The information contained in this e-mail message is intended only
> for the personal and confidential use of the recipient(s) named
> above. This message may be an attorney-client communication and/or
> work product and as such is privileged and confidential. If the
> reader of this message is not the intended recipient or an agent
> responsible for delivering it to the intended recipient, you are
> hereby notified that you have received this document in error and
> that any review, dissemination, distribution, or copying of this
> message is strictly prohibited. If you have received this
> communication in error, please notify us immediately by e-mail, and
> delete the original message.



-- 
Harsh J


Re: Python + hdfs written thrift sequence files: lots of moving parts!

2012-09-25 Thread Harsh J
Hi Jay,

This may be off-topic to you, but I feel its related: Use Avro
DataFiles. There's Python support already available, as well as
several other languages.

On Tue, Sep 25, 2012 at 10:57 PM, Jay Vyas  wrote:
> Hi guys!
>
> Im trying to read some hadoop outputted thrift files in plain old java
> (without using SequenceFile.Reader).  The reason for this is that I
>
> (1) want to understand the sequence file format better and
> (2) would like to be able to port this code to a language which doesnt have
> robust hadoop sequence file i/o / thrift support  (python). My code looks
> like this:
>
> So, before reading forward, if anyone has :
>
> 1) Some general hints on how to create a Sequence file with thrift encoded
> key values in python would be very useful.
> 2) Some tips on the generic approach for reading a sequencefile (the
> comments seem to be a bit underspecified in the SequenceFile header)
>
> I'd appreciate it!
>
> Now, here is my adventure into thrift/hdfs sequence file i/o :
>
> I've written a simple stub which , I think, should be the start of a
> sequence file reader (just tries to skip the header and get straight to the
> data).
>
> But it doesnt handle compression.
>
> http://pastebin.com/vyfgjML9
>
> So, this code ^^ appears to fail with cryptic errors : "don't know what
> type: 15".
>
> This error comes from a case statement, which attempts to determine what
> type of thrift record is being read in:
> "fail 127 don't know what type: 15"
>
>   private byte getTType(byte type) throws TProtocolException {
> switch ((byte)(type & 0x0f)) {
>   case TType.STOP:
> return TType.STOP;
>   case Types.BOOLEAN_FALSE:
>   case Types.BOOLEAN_TRUE:
> return TType.BOOL;
>  
>  case Types.STRUCT:
> return TType.STRUCT;
>   default:
> throw new TProtocolException("don't know what type: " +
> (byte)(type & 0x0f));
> }
>
> Upon further investigation, I have found that, in fact, the Configuration
> object is (of course) heavily utilized by the SequenceFile reader, in
> particular, to
> determine the Codec.  That corroborates my hypothesis that the data needs
> to be decompressed or decoded before it can be deserialized by thrift, I
> believe.
>
> So... I guess what Im assuming is missing here, is that I don't know how to
> manually reproduce the Codec/GZip, etc.. logic inside of
> SequenceFile.Reader in plain old java (i.e without cheating and using the
> SequenceFile.Reader class that is configured in our mapreduce soruce
> code).
>
> With my end goal being to read the file in python, I think it would be nice
> to be able to read the sequencefile in java, and use this as a template
> (since I know that my thrift objects and serialization are working
> correctly in my current java source codebase, when read in from
> SequenceFile.Reader api).
>
> Any suggestions on how I can distill the logic of the SequenceFile.Reader
> class into a simplified version which is specific to my data, so that I can
> start porting into a python script which is capable of scanning a few real
> sequencefiles off of HDFS would be much appreciated !!!
>
> In general... what are the core steps for doing i/o with sequence files
> that are compressed and or serialized in different formats?  Do we
> decompress first , and then deserialize?  Or do them both at the same time
> ?  Thanks!
>
> PS I've added an issue to github here
> https://github.com/matteobertozzi/Hadoop/issues/5, for a python
> SequenceFile reader.  If I get some helpful hints on this thread maybe I
> can directly implement an example on matteobertozzi's python hadoop trunk.
>
> --
> Jay Vyas
> MMSB/UCHC



-- 
Harsh J


Re: Hadoop and Cuda , JCuda (CPU+GPU architecture)

2012-09-24 Thread Harsh J
Make sure to checkout the rootbeer compiler that makes life easy:
https://github.com/pcpratts/rootbeer1

On Mon, Sep 24, 2012 at 10:26 PM, Chen He  wrote:
> Hi Oleg
>
> I will answer your questions one by one.
>
> 1) file size
>
> There is no exactly number of file size that will definitely works well for
> GPGPU+Hadoop. You need to do your project POC to get the number.
>
> I think the GPU+Hadoop is very suitable for computation-intensive and
> data-intensive applications. However, be aware of the bottleneck between
> the GPU memory and CPU memory. I mean the benefit you obtained from using
> GPGPU should be larger than the performance that you sacrificed by shipping
> data between GPU memory and CPU memory.
>
> If you only have computation-intensive applications and can be parallelized
> by GPGPU, CUDA+Hadoop can also provide a parallel framework for you to
> distribute your work among the cluster nodes with fault-tolerance.
>
>
>  2) Is it good Idea to process data as locally as possble (I mean process a
> data like one file per one map)
>
> Local Map tasks are shorter than non-local tasks in the Hadoop MapReduce
> framework.
>
> 3) During your project did you face with limitations , problems?
>
> During my project, the video card was not fancy, it only allowed one CUDA
> program using the card in anytime. Then, we only  configured one map slot
> and one reduce slot in a cluster node. Now, nvidia has some powerful
> products that support multiple program run on the same card simultaneously.
>
> 4)  By the way I didn't fine code Jcuda example with Hadoop. :-)
>
> Your MapReduce code is written in Java, right? Integrate your Jcude code to
> either map() or reduce() method of your MapReduce code (you can also do
> this in the combiner, partitioner or whatever you need). Jcuda example only
> helps you know how Jcuda works.
>
> Chen
>
> On Mon, Sep 24, 2012 at 11:22 AM, Oleg Ruchovets wrote:
>
>> Great ,
>>Can you give some tips or best practices like:
>> 1) file size
>> 2) Is it good Idea to process data as locally as possble (I mean process a
>> data like one file per one map)
>> 3) During your project did you face with limitations , problems?
>>
>>
>>Can you point me on which hartware is better to use( I understand in
>> order to use GPU I need NVIDIA) .
>> I mean using CPU only arthitecture I have 8-12 core per one computer(for
>> example).
>>  What should I do in orger to use CPU+GPU arthitecture? What kind of NVIDIA
>> do I need for this.
>>
>> By the way I didn't fine code Jcuda example with Hadoop. :-)
>>
>> Thanks in advane
>> Oleg.
>>
>> On Mon, Sep 24, 2012 at 6:07 PM, Chen He  wrote:
>>
>> > Please see the Jcuda example. I do refer from there. BTW, you can also
>> > compile your cuda code in advance and let your hadoop code call those
>> > compiled code through Jcuda. That is what I did in my program.
>> >
>> > On Mon, Sep 24, 2012 at 10:45 AM, Oleg Ruchovets > > >wrote:
>> >
>> > > Thank you very much.  I saw this link !!!  . Do you have any code ,
>> > example
>> > > shared in the network (github for example).
>> > >
>> > > On Mon, Sep 24, 2012 at 5:33 PM, Chen He  wrote:
>> > >
>> > > > http://wiki.apache.org/hadoop/CUDA%20On%20Hadoop
>> > > >
>> > > > On Mon, Sep 24, 2012 at 10:30 AM, Oleg Ruchovets <
>> oruchov...@gmail.com
>> > > > >wrote:
>> > > >
>> > > > > Hi
>> > > > >
>> > > > > I am going to process video analytics using hadoop
>> > > > > I am very interested about CPU+GPU architercute espessially using
>> > CUDA
>> > > (
>> > > > > http://www.nvidia.com/object/cuda_home_new.html) and JCUDA (
>> > > > > http://jcuda.org/)
>> > > > > Does using HADOOP and CPU+GPU architecture bring significant
>> > > performance
>> > > > > improvement and does someone succeeded to implement it in
>> production
>> > > > > quality?
>> > > > >
>> > > > > I didn't fine any projects / examples  to use such technology.
>> > > > > If someone could give me a link to best practices and example using
>> > > > > CUDA/JCUDA + hadoop that would be great.
>> > > > > Thanks in advane
>> > > > > Oleg.
>> > > > >
>> > > >
>> > >
>> >
>>



-- 
Harsh J


Re: Relevance of dfs.safemode.extension?

2012-09-21 Thread Harsh J
For big-large clusters, it helps if the NN waits for N seconds after
the threshold percentage being satisfied (minimum # of replicas of
file's blocks being available) so that other DNs get some extra time
to report in their blocks as well and help ease the initial client
load the cluster receives. This is where the extension comes useful at
(certainly tunable to a more suitable value).

For small clusters (single rack or so) you can probably make it 0 to
shed off the extra wait.

However, if you're ever working with NN recovery stuff (one reason the
NN is down, due to), I recommend setting the threshold itself to >
1.1f to make sure the NN doesn't auto-exit safemode until you're sure
that the new inode/block counts are alright and you haven't made any
mistakes with the recovery process. You can then exit safemode
manually when sure. In safemode, the NN does not issue block
deletions, so data loss would not occur out of mistakes made (such as
starting with an old copy of fsimage accidentally, etc.)

On Fri, Sep 21, 2012 at 1:47 PM, Bertrand Dechoux  wrote:
> Hi,
>
> I would like to know the relevance of dfs.safemode.extension.
> Why would someone wait after leaving the safemode?
> Why is it recommended not to set it to 0 instead of 3 (30 seconds)?
>
> Regards
>
> Bertrand



-- 
Harsh J


Re: fs.exists report wrong information

2012-09-19 Thread Harsh J
Hi Mike,

One nit: If you use "extends Configured implements Tool" then do not
do a "new Configuration();" anywhere. Instead just call the
"getConf()" method to get a pre-created config object.

When you do a path.getFileSystem(conf) and there's no scheme in the
path's URI, the default (fs.default.name) scheme is taken from your
configs (which may source it via core-site.xml).

Hence, a valid way of testing this for different filesystems is to
make sure to provide the FS Scheme prefix:

hadoop jar test.jar hdfs://localhost:8020/file
hadoop jar test.jar file:///tmp/file

On Thu, Sep 20, 2012 at 6:12 AM, Mike S  wrote:
> I need to check on my jar file arguments validity when the run method
> is called by the ToolRunner
>
> public int run(String[] args) throws Exception
> {
>
> Path filePath = new Path(args[3]);
>
> FileSystem fs= path .getFileSystem(new Configuration());
>
> if(!fs.exists(path ))
> {
> // Do things
> }
>
> }
>
>
> The stang part is that the above code works file in my Eclipse local
> run but when I move the code to the cluster, it works for the path in
> hdfs but if the path (arg[3]) is on local file system like
> /tmpFolder/myfile, then the above code report the file as not existing
> where the file is there for sure. What I am doing wrong?



-- 
Harsh J


Re: Job history logging

2012-09-14 Thread Harsh J
I guess the reason is that it assumes it can't write history files
after that point, and skips the rest of the work?

On Sat, Sep 15, 2012 at 3:07 AM, Prashant Kommireddi
 wrote:
> Hi All,
>
> I have a question about job history logging. Seems like history logging is
> disabled if file creation fails, is there a reason this is done?
> The following snippet is from JobHistory.JobInfo.logSubmitted()  -
> Hadoop 0.20.2
>
>
>   // Log the history meta info
>   JobHistory.MetaInfoManager.logMetaInfo(writers);
>
>   //add to writer as well
>   JobHistory.log(writers, RecordTypes.Job,
>  new Keys[]{Keys.JOBID, Keys.JOBNAME, Keys.USER,
> Keys.SUBMIT_TIME, Keys.JOBCONF },
>  new String[]{jobId.toString(), jobName, user,
>   String.valueOf(submitTime) ,
> jobConfPath}
> );
>
> }catch(IOException e){
>   LOG.error("Failed creating job history log file, disabling
> history", e);
>   *disableHistory = true; *
> }
>   }
>
>
> Thanks,



-- 
Harsh J


Re: how to clean {mapred.local.dir}/taskTracker

2012-09-09 Thread Harsh J
Hi Zhao,

The space should ideally be cleared up by itself. Can you inspect the
directories to see which tasks seem to have left data over (they may
have specifically done so via keep.failed.task.files=true) and if any
of them are still running?

On Mon, Sep 10, 2012 at 9:00 AM, Zhao Hong  wrote:
> Thanks for your immediately reply .  Are there some other ways that do not
> interrupt the running of TaskTracker?
>
> On Mon, Sep 10, 2012 at 11:25 AM, Harsh J  wrote:
>
>> You could just restart the specific tasktracker that has filled this
>> directory; there's startup parts in the TT that clean up directories
>> before beginning to serve out.
>>
>> On Mon, Sep 10, 2012 at 8:41 AM, Zhao Hong  wrote:
>> > Hi all
>> >
>> > How to clean   {mapred.local.dir}/taskTracker ? It's about 32 G and  I
>> must
>> > restart the cluser to clean it .
>> >
>> > Thanks & Best Regards
>> >
>> > hong
>>
>>
>>
>> --
>> Harsh J
>>



-- 
Harsh J


Re: how to clean {mapred.local.dir}/taskTracker

2012-09-09 Thread Harsh J
You could just restart the specific tasktracker that has filled this
directory; there's startup parts in the TT that clean up directories
before beginning to serve out.

On Mon, Sep 10, 2012 at 8:41 AM, Zhao Hong  wrote:
> Hi all
>
> How to clean   {mapred.local.dir}/taskTracker ? It's about 32 G and  I must
> restart the cluser to clean it .
>
> Thanks & Best Regards
>
> hong



-- 
Harsh J


Re: hadoop fs -tail

2012-09-08 Thread Harsh J
We should be able to extend the Tail command to allow a size
specifier. Could you file a JIRA for this?

Although, I don't think the tail prints newlines at every gap unless
there's a newline in the data itself, so your "listener" can perhaps
wait for that?

On Tue, Jul 17, 2012 at 5:19 AM, Sukhendu Chakraborty
 wrote:
> Hi,
>
> Is there a way to get around the 1KB limitation of the hadoop fs -tail
> command (http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html#tail)?
> In my application some of the records can have length greater than 1KB
> before the newline and I would like to get the full records as part of
> 'tail' (not truncated versions).
>
> Thanks,
> -Sukhendu



-- 
Harsh J


Re: heap size for the job tracker

2012-08-28 Thread Harsh J
Hi Mike,

On Wed, Aug 29, 2012 at 7:40 AM, Mike S  wrote:
> 1> Can I change the heap size for the job tracker only if I am using
> version 1.0.2?

Yes.

> 2>  If so, would you please say what exact line I should put in the
> hadoop-evv.sh and where ? Should I set the value with a number or use
> the Xmx notion?
>
> I mean which one is the correct way
>
> export HADOOP_HEAPSIZE=2000
>
> or
>
> export HADOOP_HEAPSIZE="-Xmx2000m"

The above (first one is right syntax) changes the heap across _all_
daemons, not just JT specifically. So you don't want to do that.

You may instead find and change the below line in hadoop-env.sh to the
following:

export HADOOP_JOBTRACKER_OPTS="$HADOOP_JOBTRACKER_OPTS -Xmx2g"

> 3> Do I need to restart the job tracker node or call start-mapred.sh
> to make the heap size change to take in effect? Is there anything else
> I need to do to make the change to be applied?

You will need to restart the JobTracker JVM for the new heap limit to
get used. You can run "hadoop-daemon.sh stop jobtracker" followed by
"hadoop-daemon.sh start jobtracker" to restart just the JobTracker
daemon (run the command on the JT node).

-- 
Harsh J


Re: doubt about reduce tasks and block writes

2012-08-25 Thread Harsh J
Raj's almost right. In times of high load or space fillup on a local
DN, the NameNode may decide to instead pick a non-local DN for
replica-writing. In this way, the Node A may get a "copy 0" of a
replica from a task. This is per the default block placement policy.

P.s. Note that HDFS hardly makes any differences between replicas,
hence there is no hard-concept of a "copy 0" or "copy 1" block, at the
NN level, it treats all DNs in pipeline equally and same for replicas.

On Sat, Aug 25, 2012 at 4:14 AM, Raj Vishwanathan  wrote:
> But since node A has no TT running, it will not run map or reduce tasks. When 
> the reducer node writes the output file, the fist block will be written on 
> the local node and never on node A.
>
> So, to answer the question, Node A will contain copies of blocks of all 
> output files. It wont contain the copy 0 of any output file.
>
>
> I am reasonably sure about this , but there could be corner cases in case of 
> node failure and such like! I need to look into the code.
>
>
> Raj
>>
>> From: Marc Sturlese 
>>To: hadoop-u...@lucene.apache.org
>>Sent: Friday, August 24, 2012 1:09 PM
>>Subject: doubt about reduce tasks and block writes
>>
>>Hey there,
>>I have a doubt about reduce tasks and block writes. Do a reduce task always
>>first write to hdfs in the node where they it is placed? (and then these
>>blocks would be replicated to other nodes)
>>In case yes, if I have a cluster of 5 nodes, 4 of them run DN and TT and one
>>(node A) just run DN, when running MR jobs, map tasks would never read from
>>node A? This would be because maps have data locality and if the reduce
>>tasks write first to the node where they live, one replica of the block
>>would always be in a node that has a TT. Node A would just contain blocks
>>created from replication by the framework as no reduce task would write
>>there directly. Is this correct?
>>Thanks in advance
>>
>>
>>
>>--
>>View this message in context: 
>>http://lucene.472066.n3.nabble.com/doubt-about-reduce-tasks-and-block-writes-tp4003185.html
>>Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
>>
>>
>>



-- 
Harsh J


Re: Question Regarding FileAlreadyExistsException

2012-08-23 Thread Harsh J
Daniel,

Perhaps you want your OutputFormat set as NullOutputFormat. That does
not carry any checks for output directory pre-existence.

On Thu, Aug 23, 2012 at 9:47 PM, Daniel Hoffman
 wrote:
> Well, I'm using the MultipleOutputs capability to create a directory
> Structure with Dates.
> So I'm managing this myself.
>
> What I've found, and I could be doing this wrong... is that I still have to
> tell the Tool that I want to use a:
> TextOutputFormat or a FileOutputFormat, and then, have to tell the
> respective formats that I want to use some directory.
>
> IE:
> TextOutputFormat.setOutputDirectory.setOutputDirectory(job,/foo/bar/);
>
> As a work around, I just made a temp directory at /tmp/datetimestamp.
>
> It doesn't make much sense though, sense the reducer uses mulitple output
> formats to make an entirely different directory structure..  Of course, I'm
> probably either not following the M/R Paradigm - or just doing it wrong.
>
> The FilealreadyExistsException was applicable to my "/foo/bar" directory
> which had very little to do with my "genuine" output.
>
>
> Dan
>
> On Thu, Aug 23, 2012 at 9:40 AM, Harsh J  wrote:
>
>> I think this specific behavior irritates a lot of new users. We may as
>> well provide a Generic Option to overwrite the output directory if
>> set. That way, we at least help avoid typing a whole delete command.
>> If you agree, please file an improvement request against MAPREDUCE
>> project on the ASF JIRA.
>>
>> On Thu, Aug 23, 2012 at 6:58 PM, Bertrand Dechoux 
>> wrote:
>> > I don't think so. The client is responsible for deleting the resource
>> > before, if it might exist.
>> > Correct me if I am wrong.
>> >
>> > Higher solution (such as Cascading) usually provides a way to define a
>> > strategy to handle it : KEEP, REPLACE, UPDATE ...
>> >
>> http://docs.cascading.org/cascading/2.0/javadoc/cascading/tap/SinkMode.html
>> >
>> > Regards
>> >
>> > Bertrand
>> >
>> > On Thu, Aug 23, 2012 at 3:15 PM, Daniel Hoffman <
>> hoffmandani...@gmail.com>wrote:
>> >
>> >> With respect to the FileAlreadyExistsException which occurrs when a
>> >> duplicate directory is discovered by an OutputFormat,
>> >> Is there a hadoop  property that is accessible by the client to disable
>> >> this behavior?
>> >>
>> >> IE,  disable.file.already.exists.behaviour=true
>> >>
>> >> Thank You
>> >> Daniel G. Hoffman
>> >>
>> >
>> >
>> >
>> > --
>> > Bertrand Dechoux
>>
>>
>>
>> --
>> Harsh J
>>



-- 
Harsh J


Re: Question Regarding FileAlreadyExistsException

2012-08-23 Thread Harsh J
I think this specific behavior irritates a lot of new users. We may as
well provide a Generic Option to overwrite the output directory if
set. That way, we at least help avoid typing a whole delete command.
If you agree, please file an improvement request against MAPREDUCE
project on the ASF JIRA.

On Thu, Aug 23, 2012 at 6:58 PM, Bertrand Dechoux  wrote:
> I don't think so. The client is responsible for deleting the resource
> before, if it might exist.
> Correct me if I am wrong.
>
> Higher solution (such as Cascading) usually provides a way to define a
> strategy to handle it : KEEP, REPLACE, UPDATE ...
> http://docs.cascading.org/cascading/2.0/javadoc/cascading/tap/SinkMode.html
>
> Regards
>
> Bertrand
>
> On Thu, Aug 23, 2012 at 3:15 PM, Daniel Hoffman 
> wrote:
>
>> With respect to the FileAlreadyExistsException which occurrs when a
>> duplicate directory is discovered by an OutputFormat,
>> Is there a hadoop  property that is accessible by the client to disable
>> this behavior?
>>
>> IE,  disable.file.already.exists.behaviour=true
>>
>> Thank You
>> Daniel G. Hoffman
>>
>
>
>
> --
> Bertrand Dechoux



-- 
Harsh J


Re: SQOOP SECURITY ISSUES

2012-08-20 Thread Harsh J
Abhishek,

Moving this to the u...@sqoop.apache.org lists.

On Mon, Aug 20, 2012 at 9:03 PM, sudeep tokala  wrote:
> hi all,
>
> what are all security concerns with sqoop.
>
> Regards
> Abhishek



-- 
Harsh J


Re: how i will take data from my Facebook community ?

2012-08-20 Thread Harsh J
Hey Sujit,

1. Please do not cross-post to several lists, and cause inbox flood to
many others. Thats not a valid way to seek faster answers - forming a
proper question after research is the key. See
http://www.catb.org/~esr/faqs/smart-questions.html for a good read on
this.

2. Your question needs to go to the Facebook Developer Support
community, at http://developers.facebook.com. Perhaps you may even be
looking for this API here, that applies to Facebook Groups:
http://developers.facebook.com/docs/reference/api/group/

On Mon, Aug 20, 2012 at 12:21 PM, Sujit Dhamale
 wrote:
> Hi ,
> below is my case study ...
>
>
> i need to do analytic on my Facebook community.
> is any one know,  how i will take data from my Facebook community ? is any
> Facebook plugin available ? or is any way to store data from Facebook ?
>
> i want to store this data in HDFS and implement further .
>
>
>
> Kind Regard
> Sujit Dhamale



-- 
Harsh J


Re: resetting conf/ parameters in a life cluster.

2012-08-18 Thread Harsh J
No, you will need to restart the TaskTracker to have it in effect.

On Sat, Aug 18, 2012 at 8:46 PM, Jay Vyas  wrote:
> hmmm I wonder if there is a way to push conf/*xml parameters out to all
> the slaves, maybe at runtime ?
>
> On Sat, Aug 18, 2012 at 4:06 PM, Harsh J  wrote:
>
>> Jay,
>>
>> Oddly, the counters limit changes (increases, anyway) needs to be
>> applied at the JT, TT and *also* at the client - to take real effect.
>>
>> On Sat, Aug 18, 2012 at 8:31 PM, Jay Vyas  wrote:
>> > Hi guys:
>> >
>> > I've reset my max counters as follows :
>> >
>> > ./hadoop-site.xml:
>> >
>>  
>> mapreduce.job.counters.limit15000
>> >
>> > However, a job is failing (after reducers get to 100%!) at the very end,
>> > due to exceeded counter limit.  I've confirmed in my
>> > code that indeed the correct counter parameter is being set.
>> >
>> > My hypothesis: Somehow, the name node counters parameter is effectively
>> > being transferred to slaves... BUT the name node *itself* hasn't updated
>> its
>> > maximum counter allowance, so it throws an exception at the end of the
>> job,
>> > that is, they dying message from hadoop is
>> >
>> > " max counter limit 120 exceeded "
>> >
>> > I've confirmed in my job that the counter parameter is correct, when the
>> > job starts... However... somehow the "120 limit exceeded" exception is
>> > still thrown.
>> >
>> > This is in elastic map reduce, hadoop .20.205
>> >
>> > --
>> > Jay Vyas
>> > MMSB/UCHC
>>
>>
>>
>> --
>> Harsh J
>>
>
>
>
> --
> Jay Vyas
> MMSB/UCHC



-- 
Harsh J


Re: resetting conf/ parameters in a life cluster.

2012-08-18 Thread Harsh J
Jay,

Oddly, the counters limit changes (increases, anyway) needs to be
applied at the JT, TT and *also* at the client - to take real effect.

On Sat, Aug 18, 2012 at 8:31 PM, Jay Vyas  wrote:
> Hi guys:
>
> I've reset my max counters as follows :
>
> ./hadoop-site.xml:
>  
> mapreduce.job.counters.limit15000
>
> However, a job is failing (after reducers get to 100%!) at the very end,
> due to exceeded counter limit.  I've confirmed in my
> code that indeed the correct counter parameter is being set.
>
> My hypothesis: Somehow, the name node counters parameter is effectively
> being transferred to slaves... BUT the name node *itself* hasn't updated its
> maximum counter allowance, so it throws an exception at the end of the job,
> that is, they dying message from hadoop is
>
> " max counter limit 120 exceeded "
>
> I've confirmed in my job that the counter parameter is correct, when the
> job starts... However... somehow the "120 limit exceeded" exception is
> still thrown.
>
> This is in elastic map reduce, hadoop .20.205
>
> --
> Jay Vyas
> MMSB/UCHC



-- 
Harsh J


Re: hi, I need to help: Hadoop

2012-08-14 Thread Harsh J
Hi Huong,

See http://developer.yahoo.com/hadoop/tutorial/module2.html#programmatically

On Wed, Aug 15, 2012 at 10:24 AM, huong hoang minh
 wrote:
> I am researching Hadoop  technology. And I don't know how to access
> and copy data from HDFS to the local machine by Java. Can you help me
> , step by step?
>  Thank you very much.
> --
> Hoàng Minh Hương
> GOYOH VIETNAM 44-Trần Cung-Hà Nội-Viet NAM
> Tel: 0915318789



-- 
Harsh J


Re: Avro vs Json

2012-08-12 Thread Harsh J
Moving this to the user@avro lists. Please use the right lists for the
best answers and the right people.

I'd pick Avro out of the two - it is very well designed for typed data
and has a very good implementation of the serializer/deserializer,
aside of the schema advantages. FWIW, Avro has a tojson CLI tool to
dump Avro binary format out as JSON structures, which would be of help
if you seek readability and/or integration with apps/systems that
already depend on JSON.

On Sun, Aug 12, 2012 at 10:41 PM, Mohit Anchlia  wrote:
> We get data in Json format. I was initially thinking of simply storing Json
> in hdfs for processing. I see there is Avro that does the similar thing but
> most likely stores it in more optimized format. I wanted to get users
> opinion on which one is better.



-- 
Harsh J


Re: FLUME AVRO

2012-08-12 Thread Harsh J
Abhishek,

Moving this to user@flume lists, as it is Flume specific.

P.s. Please do not cross post to multiple lists, it does not guarantee
you a faster response nor is mailing to a *-dev list relevant to your
question here. Help avoid additional inbox noise! :)

On Thu, Aug 9, 2012 at 10:43 PM, abhiTowson cal
 wrote:
> hi all,
>
> can log data be converted into avro,when data is sent from source to sink.
>
> Regards
> Abhishek



-- 
Harsh J


Re: Mechanism of hadoop -jar

2012-08-11 Thread Harsh J
Hey,

On Sun, Aug 12, 2012 at 3:39 AM, Jay Vyas  wrote:
> Hi guys:  I'm trying to find documentation on how "hadoop jar" actually
> works i.e. how it copies/runs the jar file across the cluster, in order to
> debug a jar issue.
>
> 1) Where can I get a good explanation of how the hadoop commands (i.e.
> -jar) are implemented ?

The "jar" sub-command executes the org.apache.hadoop.util.RunJar class.

> 2) Specifically, Im trying to access a bundled text file from a jar :
>
> class.getResource("myfile.txt")
>
> from inside a mapreduce job Is it okay to do this ?  Or does a classes
> ability to aquire local resources change  in the mapper/reducer JVMs ?

I believe this should work.

-- 
Harsh J


Re: How can I get the intermediate output file from mapper class?

2012-08-09 Thread Harsh J
Hi,

You need the "file.out" and "file.out.index" files when wanting the
map->intermediate->reduce files. So try a pattern that matches these
and you should have it.

The "X" kind of files are what MR produces on HDFS as regular
outputs - these aren't intermediate.

On Fri, Aug 10, 2012 at 8:52 AM, Liu, Raymond  wrote:
> Hi
>
> I am trying to access the intermediate file save to the local 
> filesystem from mapreduce's mapper output.
>
> I have googled this one : 
> http://stackoverflow.com/questions/7867608/hadoop-mapreduce-intermediate-output
>
> I am using hadoop 1.0.3 , and I did set following property in 
> mapred-site.xml
>
> 
>   keep.task.files.pattern
>   .*_m_0*
> 
>
> Then after restart hadoop and run some jobss, I did see tasks in my local dir 
> like:
>
> /mnt/DP_disk1/raymond/hdfs/mapred/taskTracker/raymond/jobcache/job_201208101040_0003/
>
> But I still cannot find any output dir there.
>
> I have four disks mount for local dir, and only jars,work dir are find as 
> following:
>
> 
> mapred.local.dir
> /mnt/DP_disk1/raymond/hdfs/mapred,/mnt/DP_disk2/raymond/hdfs/mapred,/mnt/DP_disk3/raymond/hdfs/mapred,/mnt/DP_disk4/raymond/hdfs/mapred
> 
>
> Then I search though them:
>
> raymond@sr173:~$ ls 
> /mnt/DP_disk1/raymond/hdfs/mapred/taskTracker/raymond/jobcache/job_201208101040_0003/
> jars  job.xml
> raymond@sr173:~$ ls 
> /mnt/DP_disk2/raymond/hdfs/mapred/taskTracker/raymond/jobcache/job_201208101040_0003/
> raymond@sr173:~$ ls 
> /mnt/DP_disk3/raymond/hdfs/mapred/taskTracker/raymond/jobcache/job_201208101040_0003/
> jobToken  work
> raymond@sr173:~$ ls 
> /mnt/DP_disk4/raymond/hdfs/mapred/taskTracker/raymond/jobcache/job_201208101040_0003/
>
> And I also search the ttprivate dir, no luck there :
>
> raymond@sr173:~$ ls 
> /mnt/DP_disk4/raymond/hdfs/mapred/ttprivate/taskTracker/raymond/jobcache/job_201208101040_0003/attempt_201208101040_0003_m_21_0/taskjvm.sh
> /mnt/DP_disk4/raymond/hdfs/mapred/ttprivate/taskTracker/raymond/jobcache/job_201208101040_0003/attempt_201208101040_0003_m_21_0/taskjvm.sh
>
> So, Is there anything I am still missing?
>
>
> Best Regards,
> Raymond Liu
>



-- 
Harsh J


Re: Local jobtracker in test env?

2012-08-07 Thread Harsh J
Yes, singular JVM (The test JVM itself) and the latter approach (no
TT/JT daemons).

On Wed, Aug 8, 2012 at 4:50 AM, Mohit Anchlia  wrote:
> On Tue, Aug 7, 2012 at 2:08 PM, Harsh J  wrote:
>
>> It used the local mode of operation:
>> org.apache.hadoop.mapred.LocalJobRunner
>>
>>
> In localmode everything is done inside the same JVM i.e.
> tasktracker,jobtracker etc. all run in the same JVM. Or does it mean that
> none of those processes run everything is pipelined in the same process on
> the local file system.
>
>
>> A JobTracker (via MiniMRCluster) is only required for simulating
>> distributed tests.
>>
>> On Wed, Aug 8, 2012 at 2:27 AM, Mohit Anchlia 
>> wrote:
>> > I just wrote a test where fs.default.name is file:/// and
>> > mapred.job.tracker is set to local. The test ran fine, I also see mapper
>> > and reducer were invoked but what I am trying to understand is that how
>> did
>> > this run without specifying the job tracker port and which port task
>> > tracker connected with job tracker. It's not clear from the output:
>> >
>> > Also what's the difference between this and bringing up miniDFS cluster?
>> >
>> > INFO  org.apache.hadoop.mapred.FileInputFormat [main]: Total input paths
>> to
>> > proc
>> > ess : 1
>> > INFO  org.apache.hadoop.mapred.JobClient [main]: Running job:
>> job_local_0001
>> > INFO  org.apache.hadoop.mapred.Task [Thread-11]:  Using
>> > ResourceCalculatorPlugin
>> >  : null
>> > INFO  org.apache.hadoop.mapred.MapTask [Thread-11]: numReduceTasks: 1
>> > INFO  org.apache.hadoop.mapred.MapTask [Thread-11]: io.sort.mb = 100
>> > INFO  org.apache.hadoop.mapred.MapTask [Thread-11]: data buffer =
>> > 79691776/99614
>> > 720
>> > INFO  org.apache.hadoop.mapred.MapTask [Thread-11]: record buffer =
>> > 262144/32768
>> > 0
>> > INFO  com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup
>> [Thread-11]: z
>> > ip 92127
>> > INFO  com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup
>> [Thread-11]: z
>> > ip 1
>> > INFO  com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup
>> [Thread-11]: z
>> > ip 92127
>> > INFO  com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup
>> [Thread-11]: z
>> > ip 1
>> > INFO  org.apache.hadoop.mapred.MapTask [Thread-11]: Starting flush of map
>> > output
>> > INFO  org.apache.hadoop.mapred.MapTask [Thread-11]: Finished spill 0
>> > INFO  org.apache.hadoop.mapred.Task [Thread-11]:
>> > Task:attempt_local_0001_m_0
>> > 0_0 is done. And is in the process of commiting
>> > INFO  org.apache.hadoop.mapred.LocalJobRunner [Thread-11]:
>> > file:/c:/upb/dp/manch
>> > lia-dp/depot/services/data-platform/trunk/analytics/geoinput/geo.dat:0+18
>> > INFO  org.apache.hadoop.mapred.Task [Thread-11]: Task
>> > 'attempt_local_0001_m_
>> > 00_0' done.
>> > INFO  org.apache.hadoop.mapred.Task [Thread-11]:  Using
>> > ResourceCalculatorPlugin
>> >  : null
>> > INFO  org.apache.hadoop.mapred.LocalJobRunner [Thread-11]:
>> > INFO  org.apache.hadoop.mapred.Merger [Thread-11]: Merging 1 sorted
>> segments
>> > INFO  org.apache.hadoop.mapred.Merger [Thread-11]: Down to the last
>> > merge-pass,
>> > with 1 segments left of total size: 26 bytes
>> > INFO  org.apache.hadoop.mapred.LocalJobRunner [Thread-11]:
>> > INFO  com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup
>> [Thread-11]: I
>> > nside reduce
>> > INFO  com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup
>> [Thread-11]: O
>> > utside reduce
>> > INFO  org.apache.hadoop.mapred.Task [Thread-11]:
>> > Task:attempt_local_0001_r_0
>> > 0_0 is done. And is in the process of commiting
>> > INFO  org.apache.hadoop.mapred.LocalJobRunner [Thread-11]:
>> > INFO  org.apache.hadoop.mapred.Task [Thread-11]: Task
>> > attempt_local_0001_r_0
>> > 0_0 is allowed to commit now
>> > INFO  org.apache.hadoop.mapred.FileOutputCommitter [Thread-11]: Saved
>> > output of
>> > task 'attempt_local_0001_r_00_0' to
>> > file:/c:/upb/dp/manchlia-dp/depot/servic
>> > es/data-platform/trunk/analytics/geooutput
>> > INFO  org.apache.hadoop.mapred.LocalJobRunner [Thread-11]: reduce >
>> reduce
>> > INFO  org.apache.hadoop.mapred.Task [Thread-11]: Task
>> > 'attempt_local_0001_r_
>> > 00_0' done.

Re: Local jobtracker in test env?

2012-08-07 Thread Harsh J
cords=2
> INFO  org.apache.hadoop.mapred.JobClient [main]: Reduce input groups=1
> INFO  org.apache.hadoop.mapred.JobClient [main]: Combine output
> records=0
> INFO  org.apache.hadoop.mapred.JobClient [main]: Reduce output records=1
> INFO  org.apache.hadoop.mapred.JobClient [main]: Map output records=2
> INFO  com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [main]: Inside
>  reduce
> INFO  com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [main]: Outsid
> e reduce
> Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4.547 sec
> Results :
> Tests run: 4, Failures: 0, Errors: 0, Skipped: 0



-- 
Harsh J


Re: Setting Configuration for local file:///

2012-08-07 Thread Harsh J
If you instantiate the JobConf with your existing conf object, then
you needn't have that fear.

On Wed, Aug 8, 2012 at 1:40 AM, Mohit Anchlia  wrote:
> On Tue, Aug 7, 2012 at 12:50 PM, Harsh J  wrote:
>
>> What is GeoLookupConfigRunner and how do you utilize the setConf(conf)
>> object within it?
>
>
> Thanks for the pointer I wasn't setting my JobConf object with the conf
> that I passed. Just one more related question, if I use JobConf conf = new
> JobConf(getConf()) and I don't pass in any configuration then does the data
> from xml files in the path used? I want this to work for all the scenarios.
>
>
>>
>> On Wed, Aug 8, 2012 at 1:10 AM, Mohit Anchlia 
>> wrote:
>> > I am trying to write a test on local file system but this test keeps
>> taking
>> > xml files in the path even though I am setting a different Configuration
>> > object. Is there a way for me to override it? I thought the way I am
>> doing
>> > overwrites the configuration but doesn't seem to be working:
>> >
>> >  @Test
>> >  public void testOnLocalFS() throws Exception{
>> >   Configuration conf = new Configuration();
>> >   conf.set("fs.default.name", "file:///");
>> >   conf.set("mapred.job.tracker", "local");
>> >   Path input = new Path("geoinput/geo.dat");
>> >   Path output = new Path("geooutput/");
>> >   FileSystem fs = FileSystem.getLocal(conf);
>> >   fs.delete(output, true);
>> >
>> >   log.info("Here");
>> >   GeoLookupConfigRunner configRunner = new GeoLookupConfigRunner();
>> >   configRunner.setConf(conf);
>> >   int exitCode = configRunner.run(new String[]{input.toString(),
>> > output.toString()});
>> >   Assert.assertEquals(exitCode, 0);
>> >  }
>>
>>
>>
>> --
>> Harsh J
>>



-- 
Harsh J


Re: Setting Configuration for local file:///

2012-08-07 Thread Harsh J
What is GeoLookupConfigRunner and how do you utilize the setConf(conf)
object within it?

On Wed, Aug 8, 2012 at 1:10 AM, Mohit Anchlia  wrote:
> I am trying to write a test on local file system but this test keeps taking
> xml files in the path even though I am setting a different Configuration
> object. Is there a way for me to override it? I thought the way I am doing
> overwrites the configuration but doesn't seem to be working:
>
>  @Test
>  public void testOnLocalFS() throws Exception{
>   Configuration conf = new Configuration();
>   conf.set("fs.default.name", "file:///");
>   conf.set("mapred.job.tracker", "local");
>   Path input = new Path("geoinput/geo.dat");
>   Path output = new Path("geooutput/");
>   FileSystem fs = FileSystem.getLocal(conf);
>   fs.delete(output, true);
>
>   log.info("Here");
>   GeoLookupConfigRunner configRunner = new GeoLookupConfigRunner();
>   configRunner.setConf(conf);
>   int exitCode = configRunner.run(new String[]{input.toString(),
> output.toString()});
>   Assert.assertEquals(exitCode, 0);
>  }



-- 
Harsh J


Re: Basic Question

2012-08-07 Thread Harsh J
Each write call registers (writes) a KV pair to the output. The output
collector does not look for similarities nor does it try to de-dupe
it, and even if the object is the same, its value is copied so that
doesn't matter.

So you will get two KV pairs in your output - since duplication is
allowed and is normal in several MR cases. Think of wordcount, where a
map() call may emit lots of ("is", 1) pairs if there are multiple "is"
in the line it processes, and can use set() calls to its benefit to
avoid too many object creation.

On Tue, Aug 7, 2012 at 11:56 PM, Mohit Anchlia  wrote:
> In Mapper I often use a Global Text object and througout the map processing
> I just call "set" on it. My question is, what happens if collector receives
> similar byte array value. Does the last one overwrite the value in
> collector? So if I did
>
> Text zip = new Text();
> zip.set("9099");
> collector.write(zip,value);
> zip.set("9099");
> collector.write(zip,value1);
>
> Should I expect to receive both values in reducer or just one?



-- 
Harsh J


Re: Hello Hadoop

2012-08-03 Thread Harsh J
Welcome! We look forward to learn from you too! :)

On Fri, Aug 3, 2012 at 10:58 PM, Harit Himanshu <
harit.subscripti...@gmail.com> wrote:

> first message - I have just joined this group looking forward to learn from
> everyone
>



-- 
Harsh J


Re: java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.LongWritable, recieved org.apache.hadoop.io.Text

2012-08-02 Thread Harsh J
This was answered at http://search-hadoop.com/m/j1M3R1Mjjx31

On Fri, Aug 3, 2012 at 3:52 AM, Harit Himanshu  wrote:
> Hi,
>
> I face another issue, now, Here is my program
>
>
> public static class MapClass extends Mapper LongWritable> {
>
> public void map(LongWritable key, Text value, Context context)
> throws IOException, InterruptedException {
> // your map code goes here
> String[] fields = value.toString().split(",");
> Text yearInText = new Text();
> LongWritable out = new LongWritable();
> String year = fields[1];
> String claims = fields[8];
>
> if (claims.length() > 0 && (!claims.startsWith("\""))) {
> yearInText.set(year.toString());
> out.set(Long.parseLong(claims));
> context.write(yearInText, out);
> }
> }
> }
>
>
> public static class Reduce extends Reducer Text> {
>
> public void reduce(Text key, Iterable values, Context
> context) throws IOException, InterruptedException {
> // your reduce function goes here
> Text value = new Text();
> value.set(values.toString());
> context.write(key, value);
> }
> }
>
> public int run(String args[]) throws Exception {
> Job job = new Job();
> job.setJarByClass(TopKRecord.class);
>
> job.setMapperClass(MapClass.class);
> job.setReducerClass(Reduce.class);
>
> FileInputFormat.setInputPaths(job, new Path(args[0]));
> FileOutputFormat.setOutputPath(job, new Path(args[1]));
>
> job.setMapOutputValueClass(LongWritable.class);
> job.setJobName("TopKRecord");
>
> //job.setNumReduceTasks(0);
> boolean success = job.waitForCompletion(true);
> return success ? 0 : 1;
> }
>
> public static void main(String args[]) throws Exception {
> int ret = ToolRunner.run(new TopKRecord(), args);
> System.exit(ret);
> }
> }
>
> When I run this in hadoop, I see the following error
>
> 12/08/02 15:12:59 INFO mapred.JobClient: Task Id :
> attempt_201208021025_0011_m_01_0, Status : FAILED
> java.io.IOException: Type mismatch in key from map: expected
> org.apache.hadoop.io.LongWritable, recieved org.apache.hadoop.io.Text
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1014)
> at
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)
> at
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
> at com.hadoop.programs.TopKRecord$MapClass.map(TopKRecord.java:39)
> at com.hadoop.programs.TopKRecord$MapClass.map(TopKRecord.java:26)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
> at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
> at org.apache.hadoop.mapred.Child.main(Child.java:249)
>
> I asked this question on SO, and got response that I need to
> setMapOutputValue class and I tried that too (see in code above) by
> following
> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/Job.html#setMapOutputKeyClass%28java.lang.Class%29
>
> How do I fix this?
>
> Thank you
> + Harit



-- 
Harsh J


Re: Disable retries

2012-08-02 Thread Harsh J
You may use the APIs directly:
http://hadoop.apache.org/common/docs/stable/api/org/apache/hadoop/mapred/JobConf.html#setMaxMapAttempts(int)
and 
http://hadoop.apache.org/common/docs/stable/api/org/apache/hadoop/mapred/JobConf.html#setMaxReduceAttempts(int)
to avoid config strings pain.

On Fri, Aug 3, 2012 at 5:59 AM, Marco Gallotta  wrote:
> Great, thanks!
>
> --
> Marco Gallotta | Mountain View, California
> Software Engineer, Infrastructure | Loki Studios
> fb.me/marco.gallotta | twitter.com/marcog
> ma...@gallotta.co.za | +1 (650) 417-3313
>
> Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
>
>
> On Thursday 02 August 2012 at 5:02 PM, Bejoy KS wrote:
>
>> Hi Marco
>>
>> You can disable retries by setting
>> mapred.map.max.attempts and mapred.reduce.max.attempts to 1.
>>
>> Also if you need to disable speculative execution. You can disable it by 
>> setting
>> mapred.map.tasks.speculative.execution and 
>> mapred.reduce.tasks.speculative.execution to false.
>>
>> With these two steps you can ensure that a task is attempted only once.
>>
>> These properties to be set in mapred-site.xml or at job level.
>>
>>
>> Regards
>> Bejoy KS
>>
>> Sent from handheld, please excuse typos.
>>
>> -Original Message-
>> From: Marco Gallotta mailto:ma...@gallotta.co.za)>
>> Date: Thu, 2 Aug 2012 16:52:00
>> To: mailto:common-user@hadoop.apache.org)>
>> Reply-To: common-user@hadoop.apache.org 
>> (mailto:common-user@hadoop.apache.org)
>> Subject: Disable retries
>>
>> Hi there
>>
>> Is there a way to disable retries when a mapper/reducer fails? I'm writing 
>> data in my mapper and I'd rather catch the failure, recover from a backup 
>> (fairly lightweight in this case, as the output tables aren't big) and 
>> restart.
>>
>>
>>
>> --
>> Marco Gallotta | Mountain View, California
>> Software Engineer, Infrastructure | Loki Studios
>> fb.me/marco.gallotta (http://fb.me/marco.gallotta) | twitter.com/marcog 
>> (http://twitter.com/marcog)
>> ma...@gallotta.co.za (mailto:ma...@gallotta.co.za) | +1 (650) 417-3313
>>
>> Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
>



-- 
Harsh J


Re: Hadoop with S3 instead of local storage

2012-08-02 Thread Harsh J
With S3 you do not need a NameNode. NameNode is part of HDFS.

On Thu, Aug 2, 2012 at 12:44 PM, Alok Kumar  wrote:
> Hi,
>
> Followed instructions from this link for setup
> http://wiki.apache.org/hadoop/AmazonS3.
>
> my "core-site.xml " contains only these 3 properties :
> 
>   fs.default.name
>   s3://BUCKET
> 
>
> 
>   fs.s3.awsAccessKeyId
>   ID
> 
>
> 
>   fs.s3.awsSecretAccessKey
>   SECRET
> 
>
> hdfs-site.xml is empty!
>
> Namenode log says, its trying to connect to local HDFS not S3.
> Am i missing anything?
>
> Regards,
> Alok



-- 
Harsh J


Re: how to increment counters inside of InputFormat/RecordReader in mapreduce api?

2012-07-30 Thread Harsh J
Jim,

This is fixed in 2.x releases already via
https://issues.apache.org/jira/browse/MAPREDUCE-1905 (incompat
change). In 1.x this is a known limitation. Perhaps we can come up
with a different non-breaking fix for 1.x?

On Mon, Jul 30, 2012 at 11:41 AM, Jim Donofrio  wrote:
> In the mapred api getRecordReader was passed a Reporter which could then get
> passed to the RecordReader to allow a RecordReader to increment counters for
> different types of records, bad records, etc. In the new mapreduce api
> createRecordReader only gets the InputSplit and TaskAttemptContext, both
> have no access to counters.
>
> Is there really no way to increment counters inside of a RecordReader or
> InputFormat in the mapreduce api?



-- 
Harsh J


Re: IOException: too many length or distance symbols

2012-07-29 Thread Harsh J
Good to know Prashant, thanks for getting back!

On Mon, Jul 30, 2012 at 12:54 AM, Prashant Kommireddi
 wrote:
> Thanks Harsh.
>
> On digging some more it appears there was a data corruption issue with
> the file that caused the exception. After having regenerated the gzip
> file from source I no longer see the issue.
>
>
> On Jul 20, 2012, at 8:48 PM, Harsh J  wrote:
>
>> Prashant,
>>
>> Can you add in some context on how these files were written, etc.?
>> Perhaps open a JIRA with a sample file and test-case to reproduce
>> this? Other env stuff with info on version of hadoop, etc. would help
>> too.
>>
>> On Sat, Jul 21, 2012 at 2:05 AM, Prashant Kommireddi
>>  wrote:
>>> I am seeing these exceptions, anyone know what they might be caused due to?
>>> Case of corrupt file?
>>>
>>> java.io.IOException: too many length or distance symbols
>>>at 
>>> org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native
>>> Method)
>>>at 
>>> org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:221)
>>>at 
>>> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:80)
>>>at 
>>> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74)
>>>at java.io.InputStream.read(InputStream.java:85)
>>>at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
>>>at 
>>> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:97)
>>>at org.apache.pig.builtin.PigStorage.getNext(PigStorage.java:109)
>>>at 
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:187)
>>>at 
>>> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:423)
>>>at 
>>> org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
>>>at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
>>>at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>>>at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>>>at org.apache.hadoop.mapred.Child.main(Child.java:170)
>>>
>>>
>>> Thanks,
>>> Prashant
>>
>>
>>
>> --
>> Harsh J



-- 
Harsh J


Re:

2012-07-29 Thread Harsh J
For a job to get submitted to a cluster, you will need proper client
configurations. Have you configured your mapred-site.xml and
yarn-site.xml properly inside /etc/hadoop/conf/mapred-site.xml and
/etc/hadoop/conf/yarn-site.xml at the client node?

On Mon, Jul 30, 2012 at 12:00 AM, abhiTowson cal
 wrote:
> Hi All,
>
> I am getting problem that job is running in localrunner rather than
> the cluster enviormnent.
> And when am running the job i would not be able to see the job id in
> the resource manager UI
>
> Can you please go through the issues and let me know ASAP.
>
> sudo -u hdfs hadoop jar
> /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teragen
> 100 /benchmark/teragen/input
> 12/07/29 13:35:59 WARN conf.Configuration: session.id is deprecated.
> Instead, use dfs.metrics.session-id
> 12/07/29 13:35:59 INFO jvm.JvmMetrics: Initializing JVM Metrics with
> processName=JobTracker, sessionId=
> 12/07/29 13:35:59 INFO util.NativeCodeLoader: Loaded the native-hadoop library
> 12/07/29 13:35:59 WARN mapred.JobClient: Use GenericOptionsParser for
> parsing the arguments. Applications should implement Tool for the
> same.
> Generating 100 using 1 maps with step of 100
> 12/07/29 13:35:59 INFO mapred.JobClient: Running job: job_local_0001
> 12/07/29 13:35:59 INFO mapred.LocalJobRunner: OutputCommitter set in config 
> null
> 12/07/29 13:35:59 INFO mapred.LocalJobRunner: OutputCommitter is
> org.apache.hadoop.mapred.FileOutputCommitter
> 12/07/29 13:35:59 WARN mapreduce.Counters: Group
> org.apache.hadoop.mapred.Task$Counter is deprecated. Use
> org.apache.hadoop.mapreduce.TaskCounter instead
> 12/07/29 13:35:59 INFO util.ProcessTree: setsid exited with exit code 0
> 12/07/29 13:35:59 INFO mapred.Task:  Using ResourceCalculatorPlugin :
> org.apache.hadoop.util.LinuxResourceCalculatorPlugin@47c297a3
> 12/07/29 13:36:00 WARN mapreduce.Counters: Counter name
> MAP_INPUT_BYTES is deprecated. Use FileInputFormatCounters as group
> name and  BYTES_READ as counter name instead
> 12/07/29 13:36:00 INFO mapred.MapTask: numReduceTasks: 0
> 12/07/29 13:36:00 INFO mapred.JobClient:  map 0% reduce 0%
> 12/07/29 13:36:01 INFO mapred.Task: Task:attempt_local_0001_m_00_0
> is done. And is in the process of commiting
> 12/07/29 13:36:01 INFO mapred.LocalJobRunner:
> 12/07/29 13:36:01 INFO mapred.Task: Task attempt_local_0001_m_00_0
> is allowed to commit now
> 12/07/29 13:36:01 INFO mapred.FileOutputCommitter: Saved output of
> task 'attempt_local_0001_m_00_0' to
> hdfs://hadoop-master-1/benchmark/teragen/input
> 12/07/29 13:36:01 INFO mapred.LocalJobRunner:
> 12/07/29 13:36:01 INFO mapred.Task: Task 'attempt_local_0001_m_00_0' done.
> 12/07/29 13:36:02 INFO mapred.JobClient:  map 100% reduce 0%
> 12/07/29 13:36:02 INFO mapred.JobClient: Job complete: job_local_0001
> 12/07/29 13:36:02 INFO mapred.JobClient: Counters: 19
> 12/07/29 13:36:02 INFO mapred.JobClient:   File System Counters
> 12/07/29 13:36:02 INFO mapred.JobClient: FILE: Number of bytes read=142686
> 12/07/29 13:36:02 INFO mapred.JobClient: FILE: Number of bytes
> written=220956
> 12/07/29 13:36:02 INFO mapred.JobClient: FILE: Number of read operations=0
> 12/07/29 13:36:02 INFO mapred.JobClient: FILE: Number of large
> read operations=0
> 12/07/29 13:36:02 INFO mapred.JobClient: FILE: Number of write 
> operations=0
> 12/07/29 13:36:02 INFO mapred.JobClient: HDFS: Number of bytes read=0
> 12/07/29 13:36:02 INFO mapred.JobClient: HDFS: Number of bytes
> written=1
> 12/07/29 13:36:02 INFO mapred.JobClient: HDFS: Number of read operations=1
> 12/07/29 13:36:02 INFO mapred.JobClient: HDFS: Number of large
> read operations=0
> 12/07/29 13:36:02 INFO mapred.JobClient: HDFS: Number of write 
> operations=2
> 12/07/29 13:36:02 INFO mapred.JobClient:   Map-Reduce Framework
> 12/07/29 13:36:02 INFO mapred.JobClient: Map input records=100
> 12/07/29 13:36:02 INFO mapred.JobClient: Map output records=100
> 12/07/29 13:36:02 INFO mapred.JobClient: Input split bytes=82
> 12/07/29 13:36:02 INFO mapred.JobClient: Spilled Records=0
> 12/07/29 13:36:02 INFO mapred.JobClient: CPU time spent (ms)=0
> 12/07/29 13:36:02 INFO mapred.JobClient: Physical memory (bytes) 
> snapshot=0
> 12/07/29 13:36:02 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
> 12/07/29 13:36:02 INFO mapred.JobClient: Total committed heap
> usage (bytes)=124715008
> 12/07/29 13:36:02 INFO mapred.JobClient:
> org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter
>
> Regards
> Abhishek



-- 
Harsh J


Re: NODE MANAGER DOES NOT START

2012-07-28 Thread Harsh J
Good to know! Can you share the solution for benefit of others? Just
in spirit of this: http://xkcd.com/979/ :)

On Sat, Jul 28, 2012 at 7:27 PM, Abhishek  wrote:
> Hi harsh,
>
> Thanks for the reply. I got it resolved it was some problem with /etc/hosts 
> file
>
> Regards
> Abhishek
>
> Sent from my iPhone
>
> On Jul 28, 2012, at 12:57 AM, Harsh J  wrote:
>
>> Hi Abhishek,
>>
>> Easy on the caps mate. Can you pastebin.com-paste your NM logs and RM logs?
>>
>> On Sat, Jul 28, 2012 at 8:45 AM, abhiTowson cal
>>  wrote:
>>> HI all,
>>>
>>> Iam trying to start nodemanager but it does not start. i have
>>> installed CDH4 AND YARN
>>>
>>> all datanodes are running
>>>
>>> Resource manager is also running
>>>
>>> WHEN I CHECK LOG FILES,IT SAYS CONNECTION REFUSED ERROR
>>>
>>> Regards
>>> abhishek
>>
>>
>>
>> --
>> Harsh J



-- 
Harsh J


Re: NODE MANAGER DOES NOT START

2012-07-27 Thread Harsh J
Hi Abhishek,

Easy on the caps mate. Can you pastebin.com-paste your NM logs and RM logs?

On Sat, Jul 28, 2012 at 8:45 AM, abhiTowson cal
 wrote:
> HI all,
>
> Iam trying to start nodemanager but it does not start. i have
> installed CDH4 AND YARN
>
> all datanodes are running
>
> Resource manager is also running
>
> WHEN I CHECK LOG FILES,IT SAYS CONNECTION REFUSED ERROR
>
> Regards
> abhishek



-- 
Harsh J


Re: YARN Pi example job stuck at 0%(No MR tasks are started by ResourceManager)

2012-07-27 Thread Harsh J
I think its alright if we may fail the app if it requests what is
impossible, rather than log or wait for an admin to come along and fix
it in runtime. Please do file a JIRA.

The max allocation value can perhaps also be dynamically set to the
maximum offered RAM value across the NMs that are live, or a fraction
of it? That is what caused this hang in the first place (by letting it
go in as a valid request, since default max alloc is about 10 GB).

On Sat, Jul 28, 2012 at 4:52 AM, anil gupta  wrote:
> Hi Harsh,
>
> Thanks a lot for your response. I am going to try your suggestions and let
> you know the outcome.
> I am running the cluster on VMWare hypervisor. I have 3 physical machines
> with 16GB of RAM, and 4TB( 2 HD of 2TB each). On every machine i am running
> 4 VM's. Each VM is having 3.2 GB of memory. I built this cluster for trying
> out HA(NN, ZK, HMaster) since we are little reluctant to deploy anything
> without HA in prod.
> This cluster is supposed to be used as HBase cluster and MR is going to be
> used only for Bulk Loading. Also, my data dump is around 10 GB(which is
> pretty small for Hadoop). I am going to load this data in 4 different
> schema which will be roughly 150 million records for HBase.
> So, i think i will lower down the memory requirement of Yarn for my use
> case rather than reducing the number of data nodes to increase the memory
> of remaining Data Nodes. Do you think this will be the right approach for
> my cluster environment?
> Also, on a side note, shouldn't the NodeManager throw an error on this kind
> of memory problem? Should i file a JIRA for this? It just sat quietly over
> there.
>
> Thanks a lot,
> Anil Gupta
>
> On Fri, Jul 27, 2012 at 3:36 PM, Harsh J  wrote:
>
>> Hi,
>>
>> The 'root' doesn't matter. You may run jobs as any username on an
>> unsecured cluster, should be just the same.
>>
>> The config yarn.nodemanager.resource.memory-mb = 1200 is your issue.
>> By default, the tasks will execute with a resource demand of 1 GB, and
>> the AM itself demands, by default, 1.5 GB to run. None of your nodes
>> are hence able to start your AM (demand=1500mb) and hence if the AM
>> doesn't start, your job won't initiate either.
>>
>> You can do a few things:
>>
>> 1. Raise yarn.nodemanager.resource.memory-mb to a value close to 4 GB
>> perhaps, if you have the RAM? Think of it as the new 'slots' divider.
>> The larger the offering (close to total RAM you can offer for
>> containers from the machine), the more the tasks that may run on it
>> (depending on their own demand, of course). Reboot the NM's one by one
>> and this app will begin to execute.
>> 2. Lower the AM's requirement, i.e. lower
>> yarn.app.mapreduce.am.resource.mb in your client's mapred-site.xml or
>> job config from 1500 to 1000 or less, so it fits in the NM's offering.
>> Likewise, control the map and reduce's requests via
>> mapreduce.map.memory.mb and mapreduce.reduce.memory.mb as needed.
>> Resubmit the job with these lowered requirements and things should now
>> work.
>>
>> Optionally, you may also cap the max/min possible requests via
>> "yarn.scheduler.minimum-allocation-mb" and
>> "yarn.scheduler.maximum-allocation-mb", such that no app/job ends up
>> demanding more than a certain limit and hence run into the
>> 'forever-waiting' state as in your case.
>>
>> Hope this helps! For some communication diagrams on how an app (such
>> as MR2, etc.) may work on YARN and how the resource negotiation works,
>> you can check out this post from Ahmed at
>> http://www.cloudera.com/blog/2012/02/mapreduce-2-0-in-hadoop-0-23/
>>
>> On Sat, Jul 28, 2012 at 3:35 AM, anil gupta  wrote:
>> > Hi Harsh,
>> >
>> > I have set the *yarn.nodemanager.resource.memory-mb *to 1200 mb. Also,
>> does
>> > it matters if i run the jobs as "root" while the RM service and NM
>> service
>> > are running as "yarn" user? However, i have created the /user/root
>> > directory for root user in hdfs.
>> >
>> > Here is the yarn-site.xml:
>> > 
>> >   
>> > yarn.nodemanager.aux-services
>> > mapreduce.shuffle
>> >   
>> >
>> >   
>> > yarn.nodemanager.aux-services.mapreduce.shuffle.class
>> > org.apache.hadoop.mapred.ShuffleHandler
>> >   
>> >
>> >   
>> > yarn.log-aggregation-enable
>> > true
>> >   
>> >
>> >   
>> > List of directories t

Re: Multinode cluster only recognizes 1 node

2012-07-27 Thread Harsh J
Sean,

Most of the times I've found this to be related to two issues:

1. NN and JT have been bound to to localhost or a 127.0.0.1-resolving
hostname, leading to the other nodes never being able to connect to
its ports, as it never listens over the true network interface.
2. Failure in turning off, or in reconfiguring the firewall.

On Fri, Jul 27, 2012 at 4:50 AM, Barry, Sean F  wrote:
> Hi,
>
> I just set up a 2 node POC cluster and I am currently having an issue with 
> it. I ran a wordcount MR test on my cluster to see if it was working and 
> noticed that the Web ui at localhost:50030 showed that I only have 1 live 
> node. I followed the tutorial step by step and I cannot seem to figure out my 
> problem. When I ran start-all.sh all of the daemons on my master node and my 
> slave node start up perfectly fine. If you have any suggestions please let me 
> know.
>
> -Sean



-- 
Harsh J


Re: YARN Pi example job stuck at 0%(No MR tasks are started by ResourceManager)

2012-07-27 Thread Harsh J
Hi,

The 'root' doesn't matter. You may run jobs as any username on an
unsecured cluster, should be just the same.

The config yarn.nodemanager.resource.memory-mb = 1200 is your issue.
By default, the tasks will execute with a resource demand of 1 GB, and
the AM itself demands, by default, 1.5 GB to run. None of your nodes
are hence able to start your AM (demand=1500mb) and hence if the AM
doesn't start, your job won't initiate either.

You can do a few things:

1. Raise yarn.nodemanager.resource.memory-mb to a value close to 4 GB
perhaps, if you have the RAM? Think of it as the new 'slots' divider.
The larger the offering (close to total RAM you can offer for
containers from the machine), the more the tasks that may run on it
(depending on their own demand, of course). Reboot the NM's one by one
and this app will begin to execute.
2. Lower the AM's requirement, i.e. lower
yarn.app.mapreduce.am.resource.mb in your client's mapred-site.xml or
job config from 1500 to 1000 or less, so it fits in the NM's offering.
Likewise, control the map and reduce's requests via
mapreduce.map.memory.mb and mapreduce.reduce.memory.mb as needed.
Resubmit the job with these lowered requirements and things should now
work.

Optionally, you may also cap the max/min possible requests via
"yarn.scheduler.minimum-allocation-mb" and
"yarn.scheduler.maximum-allocation-mb", such that no app/job ends up
demanding more than a certain limit and hence run into the
'forever-waiting' state as in your case.

Hope this helps! For some communication diagrams on how an app (such
as MR2, etc.) may work on YARN and how the resource negotiation works,
you can check out this post from Ahmed at
http://www.cloudera.com/blog/2012/02/mapreduce-2-0-in-hadoop-0-23/

On Sat, Jul 28, 2012 at 3:35 AM, anil gupta  wrote:
> Hi Harsh,
>
> I have set the *yarn.nodemanager.resource.memory-mb *to 1200 mb. Also, does
> it matters if i run the jobs as "root" while the RM service and NM service
> are running as "yarn" user? However, i have created the /user/root
> directory for root user in hdfs.
>
> Here is the yarn-site.xml:
> 
>   
> yarn.nodemanager.aux-services
> mapreduce.shuffle
>   
>
>   
> yarn.nodemanager.aux-services.mapreduce.shuffle.class
> org.apache.hadoop.mapred.ShuffleHandler
>   
>
>   
> yarn.log-aggregation-enable
> true
>   
>
>   
> List of directories to store localized files
> in.
> yarn.nodemanager.local-dirs
> /disk/yarn/local
>   
>
>   
> Where to store container logs.
> yarn.nodemanager.log-dirs
> /disk/yarn/logs
>   
>
>   
> Where to aggregate logs to.
> yarn.nodemanager.remote-app-log-dir
> /var/log/hadoop-yarn/apps
>   
>
>   
> Classpath for typical applications.
>  yarn.application.classpath
>  
> $HADOOP_CONF_DIR,
> $HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,
> $HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,
> $HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,
> $YARN_HOME/*,$YARN_HOME/lib/*
>  
>   
> 
> yarn.resourcemanager.resource-tracker.address
> ihub-an-l1:8025
> 
> 
> yarn.resourcemanager.address
> ihub-an-l1:8040
> 
> 
> yarn.resourcemanager.scheduler.address
> ihub-an-l1:8030
> 
> 
> yarn.resourcemanager.admin.address
> ihub-an-l1:8141
> 
> 
> yarn.resourcemanager.webapp.address
> ihub-an-l1:8088
> 
> 
> mapreduce.jobhistory.intermediate-done-dir
> /disk/mapred/jobhistory/intermediate/done
> 
> 
> mapreduce.jobhistory.done-dir
> /disk/mapred/jobhistory/done
> 
>
> 
> yarn.web-proxy.address
> ihub-an-l1:
> 
> 
> yarn.app.mapreduce.am.staging-dir
> /user
> 
>
> *
> Amount of physical memory, in MB, that can be allocated
>   for containers.
>yarn.nodemanager.resource.memory-mb
> 1200
> *
>
> 
>
>
>
>
> On Fri, Jul 27, 2012 at 2:23 PM, Harsh J  wrote:
>
>> Can you share your yarn-site.xml contents? Have you tweaked memory
>> sizes in there?
>>
>> On Fri, Jul 27, 2012 at 11:53 PM, anil gupta 
>> wrote:
>> > Hi All,
>> >
>> > I have a Hadoop 2.0 alpha(cdh4)  hadoop/hbase cluster runnning on
>> > CentOS6.0. The cluster has 4 admin nodes and 8 data nodes. I have the RM
>> > and History server running on one machine. RM web interface shows that 8
>> > Nodes are connected to it. I installed this cluster with HA capability
>> and
>> > I have already tested HA for N

Re: Reducer MapFileOutpuFormat

2012-07-27 Thread Harsh J
Hi Bertrand,

I believe he is talking about MapFile's index files, explained here:
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html

On Fri, Jul 27, 2012 at 11:24 AM, Bertrand Dechoux  wrote:
> Your use of 'index' is indeed not clear. Are you talking about Hive or
> HBase?
>
> I can confirm that you will have one result file per reducer. Of course,
> for efficiency reasons, you need to limit the number of files. But if you
> are using multiple reducers it should mean that one reducer isn't fast
> enough, so it could be assumed that the output for each reducer is big
> enough. If that not the case, you can limit the number of reducer to one.
>
> In general, the 'fragmentation' of the results is dealt by the next job.
> You should provide more information about your real problem and its context.
>
> Bertrand
>
> On Fri, Jul 27, 2012 at 3:15 AM, syed kather  wrote:
>
>> Mike ,
>> Can you please give more details . Context is not clear . Can you share ur
>> use case if possible
>> On Jul 24, 2012 1:40 AM, "Mike S"  wrote:
>>
>> > If I set my reducer output to map file output format and the job would
>> > say have 100 reducers, will the output generate 100 different index
>> > file (one for each reducer) or one index file for all the reducers
>> > (basically one index file per job)?
>> >
>> > If it is one index file per reducer, can rely on HDFS append to change
>> > the index write behavior and build one index file from all the
>> > reducers by basically making all the parallel reducers to append to
>> > one index file? Data files do not matter.
>> >
>>
>
>
>
> --
> Bertrand Dechoux



-- 
Harsh J


Re: Reducer MapFileOutpuFormat

2012-07-27 Thread Harsh J
Hey Mike,

Inline.

On Tue, Jul 24, 2012 at 1:39 AM, Mike S  wrote:
> If I set my reducer output to map file output format and the job would
> say have 100 reducers, will the output generate 100 different index
> file (one for each reducer) or one index file for all the reducers
> (basically one index file per job)?

Each MapFile gets its own index file, so yes 100 index files for 100
map files, given each Reducer creating one map file as its output.

> If it is one index file per reducer, can rely on HDFS append to change
> the index write behavior and build one index file from all the
> reducers by basically making all the parallel reducers to append to
> one index file? Data files do not matter.

I don't think MapFiles are append-able yet. For one, the index
complicates things as it keeps offsets based from 0 to length (I
think). Work for easy-appending sequence files has been ongoing
though: https://issues.apache.org/jira/browse/HADOOP-7139. Maybe you
can take a look and help enable MapFiles do the same somehow?

-- 
Harsh J


Re: YARN Pi example job stuck at 0%(No MR tasks are started by ResourceManager)

2012-07-27 Thread Harsh J
duce.JobSubmitter: number of splits:10
> 12/07/27 09:38:27 WARN conf.Configuration: mapred.jar is deprecated.
> Instead, use mapreduce.job.jar
> 12/07/27 09:38:27 WARN conf.Configuration:
> mapred.map.tasks.speculative.execution is deprecated. Instead, use
> mapreduce.map.speculative
> 12/07/27 09:38:27 WARN conf.Configuration: mapred.reduce.tasks is
> deprecated. Instead, use mapreduce.job.reduces
> 12/07/27 09:38:27 WARN conf.Configuration: mapred.output.value.class is
> deprecated. Instead, use mapreduce.job.output.value.class
> 12/07/27 09:38:27 WARN conf.Configuration:
> mapred.reduce.tasks.speculative.execution is deprecated. Instead, use
> mapreduce.reduce.speculative
> 12/07/27 09:38:27 WARN conf.Configuration: mapreduce.map.class is
> deprecated. Instead, use mapreduce.job.map.class
> 12/07/27 09:38:27 WARN conf.Configuration: mapred.job.name is deprecated.
> Instead, use mapreduce.job.name
> 12/07/27 09:38:27 WARN conf.Configuration: mapreduce.reduce.class is
> deprecated. Instead, use mapreduce.job.reduce.class
> 12/07/27 09:38:27 WARN conf.Configuration: mapreduce.inputformat.class is
> deprecated. Instead, use mapreduce.job.inputformat.class
> 12/07/27 09:38:27 WARN conf.Configuration: mapred.input.dir is deprecated.
> Instead, use mapreduce.input.fileinputformat.inputdir
> 12/07/27 09:38:27 WARN conf.Configuration: mapred.output.dir is deprecated.
> Instead, use mapreduce.output.fileoutputformat.outputdir
> 12/07/27 09:38:27 WARN conf.Configuration: mapreduce.outputformat.class is
> deprecated. Instead, use mapreduce.job.outputformat.class
> 12/07/27 09:38:27 WARN conf.Configuration: mapred.map.tasks is deprecated.
> Instead, use mapreduce.job.maps
> 12/07/27 09:38:27 WARN conf.Configuration: mapred.output.key.class is
> deprecated. Instead, use mapreduce.job.output.key.class
> 12/07/27 09:38:27 WARN conf.Configuration: mapred.working.dir is
> deprecated. Instead, use mapreduce.job.working.dir
> 12/07/27 09:38:27 INFO mapred.ResourceMgrDelegate: Submitted application
> application_1343365114818_0002 to ResourceManager at ihub-an-l1/
> 172.31.192.151:8040
> 12/07/27 09:38:27 INFO mapreduce.Job: The url to track the job:
> http://ihub-an-l1:/proxy/application_1343365114818_0002/
> 12/07/27 09:38:27 INFO mapreduce.Job: Running job: job_1343365114818_0002
>
> No Map-Reduce task are started by the cluster. I dont see any errors
> anywhere in the application. Please help me in resolving this problem.
>
> Thanks,
> Anil Gupta



-- 
Harsh J


Re: Hadoop Multithread MapReduce

2012-07-25 Thread Harsh J
Hi,

We do have a Multithreaded Mapper implementation available for use.
Check out: 
http://hadoop.apache.org/common/docs/stable/api/org/apache/hadoop/mapreduce/lib/map/MultithreadedMapper.html

On Thu, Jul 26, 2012 at 8:26 AM, kenyh  wrote:
>
> Does anyone know about the feature about using multiple thread in map task or
> reduce task?
> Is it a good way to use multithread in map task?
> --
> View this message in context: 
> http://old.nabble.com/Hadoop-Multithread-MapReduce-tp34213534p34213534.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>



-- 
Harsh J


Re: Datanode error

2012-07-23 Thread Harsh J
p.hdfs.server.datanode.DataNode: 
> PacketResponder 0 for block blk_3941134611454287401_14080990 Interrupted.
> 2012-07-20 00:12:34,271 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> PacketResponder 0 for block blk_3941134611454287401_14080990 terminating
> 2012-07-20 00:12:34,271 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> writeBlock blk_3941134611454287401_14080990 received exception 
> java.io.EOFException: while trying to read 65557 bytes
> 2012-07-20 00:12:34,271 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: 
> DatanodeRegistration(DN01:50010, 
> storageID=DS-798921853-DN01-50010-1328651609047, infoPort=50075, 
> ipcPort=50020):DataXceiver
> java.io.EOFException: while trying to read 65557 bytes
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:290)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:334)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:398)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:577)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:494)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:183)
>



-- 
Harsh J


Re: IOException: too many length or distance symbols

2012-07-20 Thread Harsh J
Prashant,

Can you add in some context on how these files were written, etc.?
Perhaps open a JIRA with a sample file and test-case to reproduce
this? Other env stuff with info on version of hadoop, etc. would help
too.

On Sat, Jul 21, 2012 at 2:05 AM, Prashant Kommireddi
 wrote:
> I am seeing these exceptions, anyone know what they might be caused due to?
> Case of corrupt file?
>
> java.io.IOException: too many length or distance symbols
> at 
> org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native
> Method)
> at 
> org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:221)
> at 
> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:80)
> at 
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74)
> at java.io.InputStream.read(InputStream.java:85)
> at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
> at 
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:97)
> at org.apache.pig.builtin.PigStorage.getNext(PigStorage.java:109)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:187)
> at 
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:423)
> at 
> org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
>
> Thanks,
> Prashant



-- 
Harsh J


Re: fail and kill all tasks without killing job.

2012-07-20 Thread Harsh J
Hi Jay,

Fail a single task four times (default), and the job will be marked as
failed. Is that what you're looking for?

Or if you wanted your job to have succeeded even if not all tasks
succeeded, tweak the "mapred.max.map/reduce.failures.percent" property
in your job (by default it expects 0% failures, so set a number
between 0-1 that is acceptable for you).

To then avoid having to do it four times for a single task, lower
"mapred.map/reduce.max.attempts" down from its default of 4.

Does this answer your question?

On Sat, Jul 21, 2012 at 2:47 AM, jay vyas  wrote:
> Hi guys : I want my tasks to end/fail, but I don't want to kill my entire
> hadoop job.
>
> I have a hadoop job that runs 5 hadoop jobs in a row.
> Im on the last of those sub-jobs, and want to fail all tasks so that the
> task tracker stops delegating them,
> and the hadoop main job can naturally come to a close.
>
> However, when I run "hadoop job kill-attempt / fail-attempt ", the
> jobtracker seems to simply relaunch
> the same tasks with new ids.
>
> How can I tell the jobtracker to give up on redelegating?



-- 
Harsh J


Re: Datanode error

2012-07-20 Thread Harsh J
: /DN03:50345 dest: 
> /DN01:50010
> 2012-07-20 00:12:34,270 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Exception in receiveBlock for block blk_3941134611454287401_14080990 
> java.io.EOFException: while trying to read 65557 bytes
> 2012-07-20 00:12:34,270 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> PacketResponder 0 for block blk_3941134611454287401_14080990 Interrupted.
> 2012-07-20 00:12:34,271 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> PacketResponder 0 for block blk_3941134611454287401_14080990 terminating
> 2012-07-20 00:12:34,271 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> writeBlock blk_3941134611454287401_14080990 received exception 
> java.io.EOFException: while trying to read 65557 bytes
> 2012-07-20 00:12:34,271 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: 
> DatanodeRegistration(DN01:50010, 
> storageID=DS-798921853-DN01-50010-1328651609047, infoPort=50075, 
> ipcPort=50020):DataXceiver
> java.io.EOFException: while trying to read 65557 bytes
>     at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:290)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:334)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:398)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:577)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:494)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:183)



-- 
Harsh J


Re: Avro vs Protocol Buffer

2012-07-19 Thread Harsh J
+1 to what Bruno's pointed you at. I personally like Avro for its data
files (schema's stored on file, and a good, splittable container for
typed data records). I think speed for serde is on-par with Thrift, if
not faster today. Thrift offers no optimized data container format
AFAIK.

On Thu, Jul 19, 2012 at 1:57 PM, Bruno Freudensprung
 wrote:
> Once new results will be available, you might be interested in:
> https://github.com/eishay/jvm-serializers/wiki/
> https://github.com/eishay/jvm-serializers/wiki/Staging-Results
>
> My2cts,
>
> Bruno.
>
> Le 16/07/2012 22:49, Mike S a écrit :
>
>> Strictly from speed and performance perspective, is Avro as fast as
>> protocol buffer?
>>
>



-- 
Harsh J


Re: Specifying user from Hadoop Client?

2012-07-18 Thread Harsh J
Here's a good write up Jonathan Natkins once did:
http://www.cloudera.com/blog/2012/03/authorization-and-authentication-in-hadoop/

On Thu, Jul 19, 2012 at 2:37 AM, Corbett Martin  wrote:

> Yes we could implement that, although I'd prefer not to force clients to
> add users and grant sudo just to interact with our hadoop cluster.  I
> suppose I need to read up on user authentication and authorization in
> hadoop before doing something like that.
>
> Thanks
>
> -Original Message-
> From: Harsh J [mailto:ha...@cloudera.com]
> Sent: Wednesday, July 18, 2012 12:52 PM
> To: common-user@hadoop.apache.org
> Subject: Re: Specifying user from Hadoop Client?
>
> Corbett,
>
> Unfortunately I do not know of a way to do that without writing wrapper
> code. I do not think it is possible with the secure implementation of
> MR/HDFS, regardless of security being turned on/off.
>
> Can your client machine not have a user named as the one that is allowed
> to do things on HDFS, if thats how you're architecting your usage? Then
> users may do "sudo -u ", given sudo grants for that, and create files
> via sudo -u user hadoop fs -foo bar commands?
>
> On Wed, Jul 18, 2012 at 11:05 PM, Corbett Martin 
> wrote:
>
> > Thanks for the quick response.
> >
> > I came across Secure Impersonation earlier today but it didn't seem to
> > do what I'm looking for.
> >
> > Correct me if I'm wrong but Secure Impersonation would require writing
> > code to operate on HDFS (mkdir, rm...etc), that code would then need to
> > be executed from a client?  I suppose this would do the trick but I
> > was hoping we could just issue hadoop fs commands against our cluster
> > directly from a remote client yet override the username thats being sent
> to the cluster.
> >
> > Thanks
> >
> > On Jul 18, 2012, at 11:54 AM, Harsh J wrote:
> >
> > > Hey Corbett,
> > >
> > > We prevent overriding user.name. We instead provide secure
> > > impersonation (does not require kerberos, don't be fooled by its
> > > name), which is documented at
> > > http://hadoop.apache.org/common/docs/stable/Secure_Impersonation.html.
> > > This should let you do what you're attempting to, in a more
> > > controlled fashion.
> > >
> > > On Wed, Jul 18, 2012 at 10:22 PM, Corbett Martin 
> > wrote:
> > >> Hello
> > >>
> > >> I'm new to Hadoop and I'm trying to do something I *think* should
> > >> be
> > easy but having some trouble.  Here's the details.
> > >>
> > >> 1. I'm running Hadoop version 1.0.2 2. I have a 2 Node Hadoop
> > >> Cluster up and running, with no security
> > enabled
> > >>
> > >> I'm having trouble overriding the username from the client so that
> > >> the
> > files/directories created are owned by the user I specify from the
> client.
> > >>
> > >> For example I'm trying to run:
> > >>
> > >>hadoop fs -Duser.name=someUserName -conf hadoop-cluster.xml
> > -mkdir /user/someOtherUserName/test
> > >>
> > >> And have the directory "test" created in hdfs and owned by
> > "someUserName".  Instead it is creating the directory and giving it
> > the owner of the user (whoami) from the client.  I'd like to override
> > or control that...can someone tell me how?
> > >>
> > >> My hadoop-cluster.xml file on the client looks like this:
> > >>
> > >> 
> > >> 
> > >>
> > >>  
> > >>fs.default.name
> > >>hdfs://server1:54310
> > >>  
> > >>
> > >>  
> > >>mapred.job.tracker
> > >>server1:54311
> > >>  
> > >>
> > >> 
> > >>
> > >> Thanks for the help
> > >>
> > >> This message and its contents (to include attachments) are the
> > >> property
> > of National Health Systems, Inc. and may contain confidential and
> > proprietary information. This email and any files transmitted with it
> > are intended solely for the use of the individual or entity to whom
> > they are addressed. You are hereby notified that any unauthorized
> > disclosure, copying, or distribution of this message, or the taking of
> > any unauthorized action based on information contained herein is
> strictly prohibited.
> > Unauthorized use of information contained her

Re: Specifying user from Hadoop Client?

2012-07-18 Thread Harsh J
Corbett,

Unfortunately I do not know of a way to do that without writing wrapper
code. I do not think it is possible with the secure implementation of
MR/HDFS, regardless of security being turned on/off.

Can your client machine not have a user named as the one that is allowed to
do things on HDFS, if thats how you're architecting your usage? Then users
may do "sudo -u ", given sudo grants for that, and create files via
sudo -u user hadoop fs -foo bar commands?

On Wed, Jul 18, 2012 at 11:05 PM, Corbett Martin  wrote:

> Thanks for the quick response.
>
> I came across Secure Impersonation earlier today but it didn't seem to do
> what I'm looking for.
>
> Correct me if I'm wrong but Secure Impersonation would require writing
> code to operate on HDFS (mkdir, rm…etc), that code would then need to be
> executed from a client?  I suppose this would do the trick but I was hoping
> we could just issue hadoop fs commands against our cluster directly from a
> remote client yet override the username thats being sent to the cluster.
>
> Thanks
>
> On Jul 18, 2012, at 11:54 AM, Harsh J wrote:
>
> > Hey Corbett,
> >
> > We prevent overriding user.name. We instead provide secure
> > impersonation (does not require kerberos, don't be fooled by its
> > name), which is documented at
> > http://hadoop.apache.org/common/docs/stable/Secure_Impersonation.html.
> > This should let you do what you're attempting to, in a more controlled
> > fashion.
> >
> > On Wed, Jul 18, 2012 at 10:22 PM, Corbett Martin 
> wrote:
> >> Hello
> >>
> >> I'm new to Hadoop and I'm trying to do something I *think* should be
> easy but having some trouble.  Here's the details.
> >>
> >> 1. I'm running Hadoop version 1.0.2
> >> 2. I have a 2 Node Hadoop Cluster up and running, with no security
> enabled
> >>
> >> I'm having trouble overriding the username from the client so that the
> files/directories created are owned by the user I specify from the client.
> >>
> >> For example I'm trying to run:
> >>
> >>hadoop fs -Duser.name=someUserName -conf hadoop-cluster.xml
> -mkdir /user/someOtherUserName/test
> >>
> >> And have the directory "test" created in hdfs and owned by
> "someUserName".  Instead it is creating the directory and giving it the
> owner of the user (whoami) from the client.  I'd like to override or
> control that…can someone tell me how?
> >>
> >> My hadoop-cluster.xml file on the client looks like this:
> >>
> >> 
> >> 
> >>
> >>  
> >>fs.default.name
> >>hdfs://server1:54310
> >>  
> >>
> >>  
> >>mapred.job.tracker
> >>server1:54311
> >>  
> >>
> >> 
> >>
> >> Thanks for the help
> >>
> >> This message and its contents (to include attachments) are the property
> of National Health Systems, Inc. and may contain confidential and
> proprietary information. This email and any files transmitted with it are
> intended solely for the use of the individual or entity to whom they are
> addressed. You are hereby notified that any unauthorized disclosure,
> copying, or distribution of this message, or the taking of any unauthorized
> action based on information contained herein is strictly prohibited.
> Unauthorized use of information contained herein may subject you to civil
> and criminal prosecution and penalties. If you are not the intended
> recipient, you should delete this message immediately and notify the sender
> immediately by telephone or by replying to this transmission.
> >
> >
> >
> > --
> > Harsh J
>
>
> This message and its contents (to include attachments) are the property of
> National Health Systems, Inc. and may contain confidential and proprietary
> information. This email and any files transmitted with it are intended
> solely for the use of the individual or entity to whom they are addressed.
> You are hereby notified that any unauthorized disclosure, copying, or
> distribution of this message, or the taking of any unauthorized action
> based on information contained herein is strictly prohibited. Unauthorized
> use of information contained herein may subject you to civil and criminal
> prosecution and penalties. If you are not the intended recipient, you
> should delete this message immediately and notify the sender immediately by
> telephone or by replying to this transmission.
>



-- 
Harsh J


Re: Specifying user from Hadoop Client?

2012-07-18 Thread Harsh J
Hey Corbett,

We prevent overriding user.name. We instead provide secure
impersonation (does not require kerberos, don't be fooled by its
name), which is documented at
http://hadoop.apache.org/common/docs/stable/Secure_Impersonation.html.
This should let you do what you're attempting to, in a more controlled
fashion.

On Wed, Jul 18, 2012 at 10:22 PM, Corbett Martin  wrote:
> Hello
>
> I'm new to Hadoop and I'm trying to do something I *think* should be easy but 
> having some trouble.  Here's the details.
>
> 1. I'm running Hadoop version 1.0.2
> 2. I have a 2 Node Hadoop Cluster up and running, with no security enabled
>
> I'm having trouble overriding the username from the client so that the 
> files/directories created are owned by the user I specify from the client.
>
> For example I'm trying to run:
>
> hadoop fs -Duser.name=someUserName -conf hadoop-cluster.xml -mkdir 
> /user/someOtherUserName/test
>
> And have the directory "test" created in hdfs and owned by "someUserName".  
> Instead it is creating the directory and giving it the owner of the user 
> (whoami) from the client.  I'd like to override or control that…can someone 
> tell me how?
>
> My hadoop-cluster.xml file on the client looks like this:
>
> 
> 
>
>   
> fs.default.name
> hdfs://server1:54310
>   
>
>   
> mapred.job.tracker
> server1:54311
>   
>
> 
>
> Thanks for the help
>
> This message and its contents (to include attachments) are the property of 
> National Health Systems, Inc. and may contain confidential and proprietary 
> information. This email and any files transmitted with it are intended solely 
> for the use of the individual or entity to whom they are addressed. You are 
> hereby notified that any unauthorized disclosure, copying, or distribution of 
> this message, or the taking of any unauthorized action based on information 
> contained herein is strictly prohibited. Unauthorized use of information 
> contained herein may subject you to civil and criminal prosecution and 
> penalties. If you are not the intended recipient, you should delete this 
> message immediately and notify the sender immediately by telephone or by 
> replying to this transmission.



-- 
Harsh J


Re: Concurrency control

2012-07-18 Thread Harsh J
Hi,

Do note that there are many users who haven't used Teradata out there
and they may not directly pick up what you meant to say here.

Since you're speaking of Tables, I am going to assume you mean HBase.
If what you're looking for is atomicity, HBase does offer it already.
If you want to order requests differently, depending on a condition,
the HBase coprocessors (new from Apache HBase 0.92 onwards) provide
you an ability to do that too. If your question is indeed specific to
HBase, please ask it in a more clarified form on the
u...@hbase.apache.org lists.

If not HBase, do you mean read/write concurrency over HDFS files?
Cause HDFS files do not allow concurrent writers (one active lease per
file), AFAICT.

On Wed, Jul 18, 2012 at 9:09 PM, saubhagya dey  wrote:
> how do i manage concurrency in hadoop like we do in teradata.
> We need to have a read and write lock when simultaneous the same table is
> being hit with a read query and write query



-- 
Harsh J


Re: Data Nodes not seeing NameNode / Task Trackers not seeing JobTracker

2012-07-16 Thread Harsh J
Ronan,

A couple of simple things to ensure first:

1. Make sure the firewall isn't the one at fault here. Best to disable
firewall if you do not need it, or carefully configure the rules to
allow in/out traffic over chosen ports.
2. Ensure that the hostnames fs.default.name and mapred.job.tracker
bind to, are external IP-resolving hostnames and not localhost
(loopback interface bound) addresses.

On Tue, Jul 17, 2012 at 12:05 AM, Ronan Lehane  wrote:
> Hi All,
>
> I was wondering if anyone could help me figure out what's going wrong in my
> five node Hadoop cluster, please?
>
> It consists of:
> 1. NameNode
> hduser@namenode:/usr/local/hadoop$ jps
> 13049 DataNode
> 13387 Jps
> 12740 NameNode
> 13316 SecondaryNameNode
>
> 2. JobTracker
> hduser@jobtracker:/usr/local/hadoop$ jps
> 21817 TaskTracker
> 21448 DataNode
> 21542 JobTracker
> 21862 Jps
>
> 3. Slave1
> hduser@slave1:/usr/local/hadoop$ jps
> 21226 DataNode
> 21514 Jps
> 21463 TaskTracker
>
> 4. Slave2
> hduser@slave2:/usr/local/hadoop$ jps
> 20938 Jps
> 20650 DataNode
> 20887 TaskTracker
>
> 5. Slave3
> hduser@slave3:/usr/local/hadoop$ jps
> 22145 Jps
> 21854 DataNode
> 22091 TaskTracker
>
> All DataNodes have been kicked off by running start-dfs.sh on the NameNode
> All TaskTrackers have been kicked off by running start-mapred.sh on the
> JobTracker
>
> When I try to execute a simple wordcount job from the NameNode I receive
> the following error:
> 12/07/16 19:25:22 ERROR security.UserGroupInformation:
> PriviledgedActionException as:hduser cause:java.net.ConnectException: Call
> to jobtracker/10.21.68.218:54311 failed on connection exception:
> java.net.ConnectException: Connection refused
>
> If I check the jobtracker:
> 1. I can ping in both directions by both IP and Hostname
> 2. I can see that the jobtracker is listening on port 54311
> tcp0  0 127.0.0.1:54311 0.0.0.0:*
> LISTEN  1001   425093  21542/java
> 3. Telnet to this port from the NameNode fails with "Connection Refused"
> telnet: Unable to connect to remote host: Connection refused
>
> This issue can be worked around by moving the JobTracker functionality to
> the NameNode, but when this is done the job is executed on the NameNode
> rather than distributed across the cluster.
> Checking the log files on the slaves nodes, I see Server Not Available
> messages referenced at the below wiki.
> http://wiki.apache.org/hadoop/ServerNotAvailable
> The Data Nodes not seeing the NameNode and the Task Trackers not seeing
> JobTracker.
> Checking the JobTracker web interface, it always states there is only 1
> node available.
>
> I've checked the 5 troubleshooting steps provided but it all looks to be ok
> in my environment.
>
> Would anyone have any idea's of what could be causing this?
> Any help would be appreciated.
>
> Cheers,
> Ronan



-- 
Harsh J


Re: Simply reading small a hadoop text file.

2012-07-13 Thread Harsh J
You want the KeyValueTextInputFormat instead of TextInputFormat. It
has its default separator as tab, so you do not need to configure the
delimiter.

However, in case you do have to change the delimiter byte, use the
config: "mapreduce.input.keyvaluelinerecordreader.key.value.separator"

For more, see 
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/KeyValueTextInputFormat.html

On Sat, Jul 14, 2012 at 6:00 AM, Jay Vyas  wrote:
> Hi guys : Whats the idiomatic way to iterate through the k/v pairs in a
> text file ? been playing with almost everything everything with
> SequenceFiles and almost forgot :)
>
> my text output actually has tabs in it... So, im not sure what the default
> separator is, and wehter or not there is a smart way to find the value.
>
> --
> Jay Vyas
> MMSB/UCHC



-- 
Harsh J


Re: Too Many Open Files

2012-07-12 Thread Harsh J
Mike,

Understood. Then you may need to use
http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_files_directly_from_map.2BAC8-reduce_tasks.3F
instead of MultipleTextOutputFormat.

On Thu, Jul 12, 2012 at 11:29 AM, Mike S  wrote:
> 100% sure I have done and again the problem is not becuase my
> configuration is kicking it. The problem is that my application uses
> MultipleTextOutputFormat that may create 500 000 files and linux does
> allow that many open files for whatever reason. If I set the limit too
> high, it will ignore it.
>
> On Wed, Jul 11, 2012 at 10:12 PM, Harsh J  wrote:
>> Are you sure you've raised the limits for your user, and have
>> re-logged in to the machine?
>>
>> Logged in as the user you run eclipse as, what do you get as the
>> output if you run "ulimit -n"?
>>
>> On Thu, Jul 12, 2012 at 3:03 AM, Mike S  wrote:
>>> To debug an specific file, I need to run hadoop in eclipse and eclipse
>>> keep throwing the Too Many Open File Ecxception. I followed the post
>>> out there to increase the number of open file per process in
>>> /etc/security/limits.conf to as high as I my machine accept and still
>>> I am getting the too many open file exception from java io.
>>>
>>> I think the main reason is that I am using a MultipleTextOutputFormat
>>> and my reducer could create many output files based on the my Muti
>>> Output logic. Is there a way to make Hadoop not to open so many open
>>> files. If not, can I control when the reduce to close a file?
>>
>>
>>
>> --
>> Harsh J



-- 
Harsh J


Re: Configuring Hadoop clusters with multiple PCs, each of which has 2 hard disks (Sata+SSD)

2012-07-12 Thread Harsh J
Hi Ivangelion,

Replied inline.

On Thu, Jul 12, 2012 at 2:02 PM, Ivangelion  wrote:
> Hi,

Install all hadoop libs on the SATA disk.

> - 1 PC: pure namenode

Configure dfs.name.dir to write to both places, one under SATA disk
and other under SATA, for redundancy (failure tolerance). This is in
hdfs-site.xml.

SATA/dfs/name,SSD/dfs/name

> - Other 5 PCs: datanodes (1 of which also serves as secondary namenode)

Configure dfs.data.dir to write to a location on to the SATA disk
(SATA/dfs/data). This is in hdfs-site.xml.
Configure mapred.local.dir to write to a location on the SSD disk
(SSD/mapred/local). This is in mapred-site.xml.

> - Sata disk with bigger size: common HDFS data storage
> - SSD disk with smaller size but faster: temporary data storage when
> processing map reduce jobs or doing data analyzing.

If you limit your MR to use only SSD space, it will get only that much
space to write per mapper. So if a mapper tries to write, or if a
reducer tries to read over 200 GB of data, it may run into space
unavailability issues. To avoid this, configure mapred.local.dir to
use SATA/mapred/local as well, if a problem.

> Is there anything that needs to be modified?

Yes, configure fs.checkpoint.dir to SSD/dfs/namesecondary, for the SNN
to use that. Use the hdfs-site.xml.

After configuring these, you may ignore hadoop.tmp.dir, as it
shouldn't be used for anything else.

-- 
Harsh J


Re: protoc: command not found

2012-07-11 Thread Harsh J
Hi Subin,

The dependencies have been documented at
http://wiki.apache.org/hadoop/HowToContribute as well. Ensure you have
done all the steps there before building.

On Thu, Jul 12, 2012 at 12:08 AM, Modeel, Subin  wrote:
> Hi
> I am unable to build the code from TRUNK on my RHEL6.2
> I get the below error. After googling I found that I need to have some 
> protocolbuffer rpm.
> But I unable to find which to install.
> Can I have the  instructions on the setup needed for building the  TRUNK?
>
>
>
> INFO] Executed tasks
> [INFO]
> [INFO] --- build-helper-maven-plugin:1.5:add-source (add-source) @ 
> hadoop-common ---
> [INFO] Source directory: 
> /data/Development_Space/hdpTRUNK/hadoop-common-project/hadoop-common/target/generated-sources/java
>  added.
> [INFO]
> [INFO] --- build-helper-maven-plugin:1.5:add-test-source (add-test-source) @ 
> hadoop-common ---
> [INFO] Test Source directory: 
> /data/Development_Space/hdpTRUNK/hadoop-common-project/hadoop-common/target/generated-test-sources/java
>  added.
> [INFO]
> [INFO] --- maven-antrun-plugin:1.6:run (compile-proto) @ hadoop-common ---
> [INFO] Executing tasks
>
> main:
>  [exec] target/compile-proto.sh: line 17: protoc: command not found
>  [exec] target/compile-proto.sh: line 17: protoc: command not found
>  [exec] target/compile-proto.sh: line 17: protoc: command not found
>  [exec] target/compile-proto.sh: line 17: protoc: command not found
>  [exec] target/compile-proto.sh: line 17: protoc: command not found
>  [exec] target/compile-proto.sh: line 17: protoc: command not found
> [INFO] 
> 
> [INFO] Reactor Summary:
> [INFO]
> [INFO] Apache Hadoop Main  SUCCESS [3.147s]
> [INFO] Apache Hadoop Project POM . SUCCESS [2.072s]
> [INFO] Apache Hadoop Annotations . SUCCESS [1.430s]
> [INFO] Apache Hadoop Project Dist POM  SUCCESS [1.243s]
> [INFO] Apache Hadoop Assemblies .. SUCCESS [0.619s]
> [INFO] Apache Hadoop Auth  SUCCESS [7.017s]
> [INFO] Apache Hadoop Auth Examples ... SUCCESS [0.974s]
> [INFO] Apache Hadoop Common ...... FAILURE [1.548s]
> [INFO] Apache Hadoop Common Project .. SKIPPED
> Thanks,
> Su



-- 
Harsh J


Re: can't disable speculative execution?

2012-07-11 Thread Harsh J
Er, sorry I meant mapred.map.tasks = 1

On Thu, Jul 12, 2012 at 10:44 AM, Harsh J  wrote:
> Try passing mapred.map.tasks = 0 or set a higher min-split size?
>
> On Thu, Jul 12, 2012 at 10:36 AM, Yang  wrote:
>> Thanks Harsh
>>
>> I see
>>
>> then there seems to be some small problems with the Splitter / InputFormat.
>>
>> I'm just reading a 1-line text file through pig:
>>
>> A = LOAD 'myinput.txt' ;
>>
>> supposedly it should generate at most 1 mapper.
>>
>> but in reality , it seems that pig generated 3 mappers, and basically fed
>> empty input to 2 of the mappers
>>
>>
>> Thanks
>> Yang
>>
>> On Wed, Jul 11, 2012 at 10:00 PM, Harsh J  wrote:
>>
>>> Yang,
>>>
>>> No, those three are individual task attempts.
>>>
>>> This is how you may generally dissect an attempt ID when reading it:
>>>
>>> attempt_201207111710_0024_m_00_0
>>>
>>> 1. "attempt" - indicates its an attempt ID you'll be reading
>>> 2. "201207111710" - The job tracker timestamp ID, indicating which
>>> instance of JT ran this job
>>> 3. "0024" - The Job ID for which this was a task attempt
>>> 4. "m" - Indicating this is a mapper (reducers are "r")
>>> 5. "00" - The task ID of the mapper (0 is the first mapper,
>>> 1 is the second, etc.)
>>> 6. "0" - The attempt # for the task ID. 0 means it is the first
>>> attempt, 1 indicates the second attempt, etc.
>>>
>>> On Thu, Jul 12, 2012 at 9:16 AM, Yang  wrote:
>>> > I set the following params to be false in my pig script (0.10.0)
>>> >
>>> > SET mapred.map.tasks.speculative.execution false;
>>> > SET mapred.reduce.tasks.speculative.execution false;
>>> >
>>> >
>>> > I also verified in the jobtracker UI in the job.xml that they are indeed
>>> > set correctly.
>>> >
>>> > when the job finished, jobtracker UI shows that there is only one attempt
>>> > for each task (in fact I have only 1 task too).
>>> >
>>> > but when I went to the tasktracker node, looked under the
>>> > /var/log/hadoop/userlogs/job_id_here/
>>> > dir , there are 3 attempts dir ,
>>> >  job_201207111710_0024 # ls
>>> > attempt_201207111710_0024_m_00_0
>>>  attempt_201207111710_0024_m_01_0
>>> >  attempt_201207111710_0024_m_02_0  job-acls.xml
>>> >
>>> > so 3 attempts were indeed fired ??
>>> >
>>> > I have to get this controlled correctly because I'm trying to debug the
>>> > mappers through eclipse,
>>> > but if more than 1 mapper process is fired, they all try to connect to
>>> the
>>> > same debugger port, and the end result is that nobody is able to
>>> > hook to the debugger.
>>> >
>>> >
>>> > Thanks
>>> > Yang
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>
>
>
> --
> Harsh J



-- 
Harsh J


Re: can't disable speculative execution?

2012-07-11 Thread Harsh J
Try passing mapred.map.tasks = 0 or set a higher min-split size?

On Thu, Jul 12, 2012 at 10:36 AM, Yang  wrote:
> Thanks Harsh
>
> I see
>
> then there seems to be some small problems with the Splitter / InputFormat.
>
> I'm just reading a 1-line text file through pig:
>
> A = LOAD 'myinput.txt' ;
>
> supposedly it should generate at most 1 mapper.
>
> but in reality , it seems that pig generated 3 mappers, and basically fed
> empty input to 2 of the mappers
>
>
> Thanks
> Yang
>
> On Wed, Jul 11, 2012 at 10:00 PM, Harsh J  wrote:
>
>> Yang,
>>
>> No, those three are individual task attempts.
>>
>> This is how you may generally dissect an attempt ID when reading it:
>>
>> attempt_201207111710_0024_m_00_0
>>
>> 1. "attempt" - indicates its an attempt ID you'll be reading
>> 2. "201207111710" - The job tracker timestamp ID, indicating which
>> instance of JT ran this job
>> 3. "0024" - The Job ID for which this was a task attempt
>> 4. "m" - Indicating this is a mapper (reducers are "r")
>> 5. "00" - The task ID of the mapper (0 is the first mapper,
>> 1 is the second, etc.)
>> 6. "0" - The attempt # for the task ID. 0 means it is the first
>> attempt, 1 indicates the second attempt, etc.
>>
>> On Thu, Jul 12, 2012 at 9:16 AM, Yang  wrote:
>> > I set the following params to be false in my pig script (0.10.0)
>> >
>> > SET mapred.map.tasks.speculative.execution false;
>> > SET mapred.reduce.tasks.speculative.execution false;
>> >
>> >
>> > I also verified in the jobtracker UI in the job.xml that they are indeed
>> > set correctly.
>> >
>> > when the job finished, jobtracker UI shows that there is only one attempt
>> > for each task (in fact I have only 1 task too).
>> >
>> > but when I went to the tasktracker node, looked under the
>> > /var/log/hadoop/userlogs/job_id_here/
>> > dir , there are 3 attempts dir ,
>> >  job_201207111710_0024 # ls
>> > attempt_201207111710_0024_m_00_0
>>  attempt_201207111710_0024_m_01_0
>> >  attempt_201207111710_0024_m_02_0  job-acls.xml
>> >
>> > so 3 attempts were indeed fired ??
>> >
>> > I have to get this controlled correctly because I'm trying to debug the
>> > mappers through eclipse,
>> > but if more than 1 mapper process is fired, they all try to connect to
>> the
>> > same debugger port, and the end result is that nobody is able to
>> > hook to the debugger.
>> >
>> >
>> > Thanks
>> > Yang
>>
>>
>>
>> --
>> Harsh J
>>



-- 
Harsh J


Re: Too Many Open Files

2012-07-11 Thread Harsh J
Are you sure you've raised the limits for your user, and have
re-logged in to the machine?

Logged in as the user you run eclipse as, what do you get as the
output if you run "ulimit -n"?

On Thu, Jul 12, 2012 at 3:03 AM, Mike S  wrote:
> To debug an specific file, I need to run hadoop in eclipse and eclipse
> keep throwing the Too Many Open File Ecxception. I followed the post
> out there to increase the number of open file per process in
> /etc/security/limits.conf to as high as I my machine accept and still
> I am getting the too many open file exception from java io.
>
> I think the main reason is that I am using a MultipleTextOutputFormat
> and my reducer could create many output files based on the my Muti
> Output logic. Is there a way to make Hadoop not to open so many open
> files. If not, can I control when the reduce to close a file?



-- 
Harsh J


Re: can't disable speculative execution?

2012-07-11 Thread Harsh J
Your problem is more from the fact that you are running > 1 map slot
per TT, and multiple mappers are getting run at the same time, all
trying to bind to the same port. Limit your TT's max map tasks to 1
when you're relying on such techniques to debug, or use the
LocalJobRunner/Apache MRUnit instead.

On Thu, Jul 12, 2012 at 9:16 AM, Yang  wrote:
> I set the following params to be false in my pig script (0.10.0)
>
> SET mapred.map.tasks.speculative.execution false;
> SET mapred.reduce.tasks.speculative.execution false;
>
>
> I also verified in the jobtracker UI in the job.xml that they are indeed
> set correctly.
>
> when the job finished, jobtracker UI shows that there is only one attempt
> for each task (in fact I have only 1 task too).
>
> but when I went to the tasktracker node, looked under the
> /var/log/hadoop/userlogs/job_id_here/
> dir , there are 3 attempts dir ,
>  job_201207111710_0024 # ls
> attempt_201207111710_0024_m_00_0  attempt_201207111710_0024_m_01_0
>  attempt_201207111710_0024_m_02_0  job-acls.xml
>
> so 3 attempts were indeed fired ??
>
> I have to get this controlled correctly because I'm trying to debug the
> mappers through eclipse,
> but if more than 1 mapper process is fired, they all try to connect to the
> same debugger port, and the end result is that nobody is able to
> hook to the debugger.
>
>
> Thanks
> Yang



-- 
Harsh J


Re: can't disable speculative execution?

2012-07-11 Thread Harsh J
Yang,

No, those three are individual task attempts.

This is how you may generally dissect an attempt ID when reading it:

attempt_201207111710_0024_m_00_0

1. "attempt" - indicates its an attempt ID you'll be reading
2. "201207111710" - The job tracker timestamp ID, indicating which
instance of JT ran this job
3. "0024" - The Job ID for which this was a task attempt
4. "m" - Indicating this is a mapper (reducers are "r")
5. "00" - The task ID of the mapper (0 is the first mapper,
1 is the second, etc.)
6. "0" - The attempt # for the task ID. 0 means it is the first
attempt, 1 indicates the second attempt, etc.

On Thu, Jul 12, 2012 at 9:16 AM, Yang  wrote:
> I set the following params to be false in my pig script (0.10.0)
>
> SET mapred.map.tasks.speculative.execution false;
> SET mapred.reduce.tasks.speculative.execution false;
>
>
> I also verified in the jobtracker UI in the job.xml that they are indeed
> set correctly.
>
> when the job finished, jobtracker UI shows that there is only one attempt
> for each task (in fact I have only 1 task too).
>
> but when I went to the tasktracker node, looked under the
> /var/log/hadoop/userlogs/job_id_here/
> dir , there are 3 attempts dir ,
>  job_201207111710_0024 # ls
> attempt_201207111710_0024_m_00_0  attempt_201207111710_0024_m_01_0
>  attempt_201207111710_0024_m_02_0  job-acls.xml
>
> so 3 attempts were indeed fired ??
>
> I have to get this controlled correctly because I'm trying to debug the
> mappers through eclipse,
> but if more than 1 mapper process is fired, they all try to connect to the
> same debugger port, and the end result is that nobody is able to
> hook to the debugger.
>
>
> Thanks
> Yang



-- 
Harsh J


Re: FileSystem Closed.

2012-07-10 Thread Harsh J
This appears to be a Hive issue (something probably called FS.close()
too early?). Redirecting to the Hive user lists as they can help
better with this.

On Tue, Jul 10, 2012 at 9:59 PM, 안의건  wrote:
> Hello. I have a problem with the filesystem closing.
>
> The filesystem was closed when the hive query is running.
> It is 'select' query and the data size is about 1TB.
> I'm using hadoop-0.20.2 and hive-0.7.1.
>
> The error log is telling that tmp file is not deleted, or the tmp path
> exception is occurred.
>
> Is there any hadoop configuration I'm missing?
>
> Thank you
>
> [stderr logs]
> org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException:
> Filesystem closed
> at
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketFiles(FileSinkOperator.java:454)
> at
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.closeOp(FileSinkOperator.java:636)
> at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:557)
> at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:566)
> at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:566)
> at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:566)
> at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:566)
> at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:566)
> at org.apache.hadoop.hive.ql.exec.ExecMapper.close(ExecMapper.java:193)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> at org.apache.hadoop.mapred.Child.main(Child.java:170)
> Caused by: java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:226)
> at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:617)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453)
> at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:648)
> at org.apache.hadoop.fs.FileSystem.deleteOnExit(FileSystem.java:615)
> at
> org.apache.hadoop.hive.shims.Hadoop20Shims.fileSystemDeleteOnExit(Hadoop20Shims.java:68)
> at
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketFiles(FileSinkOperator.java:451)
> ... 12 more



-- 
Harsh J


Re: fixing the java / unixPrincipal hadoop error... Ubuntu.

2012-07-08 Thread Harsh J
Hi,

That specific class is not something Hadoop provides but does use it.

Couple of things you need to try first:
# Keep only a single java version installed, and remove gcj, etc..
Helps keep self sane.
# Have you tried a simple, explicit "export
JAVA_HOME=/path/to/jdk6u33", before you run your ant?

On Mon, Jul 9, 2012 at 3:18 AM, Jay Vyas  wrote:
> Hi guys : I run into the following roadblock in my VM - and Im not sure
> what the right way to install sun java is.  Any suggestions?
> In particular, the question is best described here:
>
> http://stackoverflow.com/questions/11288964/sun-java-not-loading-unixprincipal-ubuntu-12#comment14859324_11288964
>
> PS I posted this here because, mainly, this is a hadoop issue more than a
> pure java one, since the missing class "UnixPrincipal" Exception (i.e. you
> google for it), is mostly exclusive to the hadoop community.
>
> --
> Jay Vyas
> MMSB/UCHC



-- 
Harsh J


Re: fs.trash.interval

2012-07-08 Thread Harsh J
Hi,

I'm not sure why you're asking how to stop. Can you not ^C (Ctrl-C)
the running 'hadoop fs -rm' command and start over?

^C
hadoop fs -rm -skipTrash /path
hadoop fs -rm -skipTrash .Trash

Also, please send user queries to the common-user@ group, not the
common-dev@ group, which is for project development.

On Mon, Jul 9, 2012 at 2:58 AM, abhiTowson cal
 wrote:
> Hi,
> We have very large sample dataset to delete from HDFS. But we dont
> need this data to be in trash (trash interval is enabled).
> Unfortunately we started deleting data without skip trash option. It's
> taking very long time to move data into trash. Can you please help me
> how to stop this process of deleting and restart process with skip
> trash??



-- 
Harsh J


Re: Versions

2012-07-07 Thread Harsh J
The Apache Bigtop project was started for this very purpose (building
stable, well inter-operating version stacks). Take a read at
http://incubator.apache.org/bigtop/ and for 1.x Bigtop packages, see
https://cwiki.apache.org/confluence/display/BIGTOP/How+to+install+Hadoop+distribution+from+Bigtop

To specifically answer your question though, your list appears fine to
me. They 'should work', but I am not suggesting that I have tested
this stack completely myself.

On Sat, Jul 7, 2012 at 11:57 PM, prabhu K  wrote:
> Hi users list,
>
> I am planing to install following tools.
>
> Hadoop 1.0.3
> hive 0.9.0
> flume 1.2.0
> Hbase 0.92.1
> sqoop 1.4.1
>
> my questions are.
>
> 1. the above tools are compatible with all the versions.
>
> 2. any tool need to change the version
>
> 3. list out all the tools with compatible versions.
>
> Please suggest on this?



-- 
Harsh J


Re: set up Hadoop cluster on mixed OS

2012-07-06 Thread Harsh J
You can setup a minimally working cluster, but note that artifacts
such as native code for compression codecs and the LinuxTaskController
for security, etc. may not work out of the box across all the
platforms. So if you run jobs with compression that demands native
codecs, it may only be passing on your Linux boxes, not the OSX ones.

On Fri, Jul 6, 2012 at 3:01 PM, Senthil Kumar
 wrote:
> You can setup hadoop cluster on mixed environment. We have a cluster with
> Mac, Linux and Solaris.
>
> Regards
> Senthil
>
> On Fri, Jul 6, 2012 at 1:50 PM, Yongwei Xing  wrote:
>
>> I have one MBP with 10.7.4 and one laptop with Ubuntu 12.04. Is it possible
>> to set up a hadoop cluster by such mixed environment?
>>
>> Best Regards,
>>
>> --
>> Welcome to my ET Blog http://www.jdxyw.com
>>



-- 
Harsh J


Re: Binary Files With No Record Begin and End

2012-07-05 Thread Harsh J
I am assuming you've already implemented a custom record reader for
your file and are now only thinking of how to handle record
boundaries. If so, please read the map section here:
http://wiki.apache.org/hadoop/HadoopMapReduce, which explains how MR
does it for Text files, which are read until a \n point. In your case,
you ought to divide, based on the file length, into chunks of 180
bytes each. Then, referencing the block offset and length, you can
auto-determine the start/end length points of each chunk, that is what
Kai was getting at.

For example, lets assume a block of 64 MB - 67108864 each. Lets also
assume each record, right from the start of your file, is 180 bytes
each always. Therefore, if we split the file in 180-multiple splits,
your problem immediately goes away.

So your InputFormat#getSplits() can perhaps do the following:

1. Split file by block sizes first. Lets assume we have two full 64 MB
cuts, and one tail cut of 52 MB (totaling the file size to 180 MB,
assuming we have 1024*1024 records in it). Then we can tweak the
FileSplits to instead end at proper 180-multiples (use modulo operator
to find good boundaries):

First FileSplit - Start at 0, end at 67109040, such that we have
372828 full records in it. This is 176 more bytes after the first HDFS
block.
Second FileSplit - Start, obviously, at 67109040, end at 134217900,
which is 172 bytes more than the second block's 64 MB boundary. This
would then again contain 372827 perfect records in it.
Last FileSplit - Start at 134217900, to EOF, consisting automatically
all perfect records remaining to be read.

Does this make sense?

On Fri, Jul 6, 2012 at 1:31 AM, MJ Sam  wrote:
> By Block Size, you mean the HDFS block size or split size or my record
> size? The problem is that given a split to my mapper, how do make my
> record reader to find where my record start in the given split stream
> to the mapper when there is no record start tag? Would you please
> explain more with what you mean?
>
> On Thu, Jul 5, 2012 at 11:57 AM, Kai Voigt  wrote:
>> Hi,
>>
>> if you know the block size, you can calculate the offsets for your records. 
>> And write a custom record reader class to seek into your records.
>>
>> Kai
>>
>> Am 05.07.2012 um 22:54 schrieb MJ Sam:
>>
>>> Hi,
>>>
>>> The input of my map reduce is a binary file with no record begin and
>>> end marker. The only thing is that each record is a fixed 180bytes
>>> size in the binary file. How do I make Hadoop to properly find the
>>> record in the splits when a record overlap two splits. I was thinking
>>> to make the splits size to be a multiple of 180 but was wondering if
>>> there is anything else that I can do?  Please note that my files are
>>> not sequence file and just a custom binary file.
>>>
>>
>> --
>> Kai Voigt
>> k...@123.org
>>
>>
>>
>>



-- 
Harsh J


Re: Compression and Decompression

2012-07-05 Thread Harsh J
Mohit,

HDFS as a service/server does not yet do compression/decompression. It
is indeed the client that does it each time.

On Fri, Jul 6, 2012 at 1:53 AM, Mohit Anchlia  wrote:
> Is the compression done on the client side or on the server side? If I run
> hadoop fs -text then is this client decompressing the file for me?



-- 
Harsh J


  1   2   3   4   5   6   7   8   9   10   >