Hadoop Streaming - how to specify mapper scripts hosting on HDFS

2013-03-28 Thread praveenesh kumar
Hi,

I am trying to run a hadoop streaming job, where I want to specify my
mapper script residing on HDFS. Currently its trying to locate the
script on local FS only. Is there a option available through which I
can specify hadoop streaming to look for the mapper script on HDFS,
not on local FS.

Regards
Praveenesh


Oozie dynamic action

2013-09-17 Thread praveenesh kumar
Hi,

I have a scenario in which I want to trigger a hive uploading script every
day. I have a set of folders created for a set of customer ids everyday. My
hive script will read the customer id from the path, checks whether the
table for the customer id exits and if not create a table and will create
partition based on date for  a "set of unknown customer_ids".

I can get the set of unique customer_ids from a shell action. It can be
passed as a list or string.

My problem is how can I achieve this dynamic checking/creation of hive
tables and partitions from oozie.

Currently I am doing everything from the shell script and calling it as a
shell action in oozie, but I was wondering if these kind of checks or some
kind of for-loop kind of actions can be done in oozie.

Any thoughts/suggestions on how to tackle the above scenario in the best
way possible using oozie, would be highly helpful.

Regards
Praveenesh


Re: MRBench Maps strange behaviour

2012-08-29 Thread praveenesh kumar
Then the question arises how MRBench is using the parameters :
According to the mail he send... he is running MRBench with the following
parameter:
*
hadoop jar /usr/lib/hadoop-0.20/hadoop-test.jar mrbench -maps 10 -reduces 10
*

I guess he is assuming the MRbench to launch 10 mappers and 10 reducers.
But he is getting some different results which are visible in the counters
and we can use all our map and input-split logics to justify the counter
outputs.

The question arises here -- how can we use MRBench -- what it provides you
? How can we control it to run with different parameters to do some
benchmarking ? Can someone explain how to use MRBench and what it exactly
does.

Regards,
Praveenesh

On Wed, Aug 29, 2012 at 3:31 AM, Hemanth Yamijala wrote:

> Assume you are asking about what is the exact number of maps launched.
> If yes, then the output of the MRBench run is printing the counter
> "Launched map tasks". That is the exact value of maps launched.
>
> Thanks
> Hemanth
>
> On Wed, Aug 29, 2012 at 1:14 PM, Gaurav Dasgupta 
> wrote:
> > Hi Hemanth,
> >
> > Thanks for the reply.
> > Can you tell me how can I calculate or ensure from the counters what
> should
> > be the exact number of Maps?
> > Thanks,
> > Gaurav Dasgupta
> > On Wed, Aug 29, 2012 at 11:26 AM, Hemanth Yamijala 
> > wrote:
> >>
> >> Hi,
> >>
> >> The number of maps specified to any map reduce program (including
> >> those part of MRBench) is generally only a hint, and the actual number
> >> of maps will be influenced in typical cases by the amount of data
> >> being processed. You can take a look at this wiki link to understand
> >> more: http://wiki.apache.org/hadoop/HowManyMapsAndReduces
> >>
> >> In the examples below, since the data you've generated is different,
> >> the number of mappers are different. To be able to judge your
> >> benchmark results, you'd need to benchmark against the same data (or
> >> at least same type of type - i.e. size and type).
> >>
> >> The number of maps printed at the end is straight from the input
> >> specified and doesn't reflect what the job actually ran with. The
> >> information from the counters is the right one.
> >>
> >> Thanks
> >> Hemanth
> >>
> >> On Tue, Aug 28, 2012 at 4:02 PM, Gaurav Dasgupta 
> >> wrote:
> >> > Hi All,
> >> >
> >> > I executed the "MRBench" program from "hadoop-test.jar" in my 12 node
> >> > CDH3
> >> > cluster. After executing, I had some strange observations regarding
> the
> >> > number of Maps it ran.
> >> >
> >> > First I ran the command:
> >> > hadoop jar /usr/lib/hadoop-0.20/hadoop-test.jar mrbench -numRuns 3
> -maps
> >> > 200
> >> > -reduces 200 -inputLines 1024 -inputType random
> >> > And I could see that the actual number of Maps it ran was 201 (for all
> >> > the 3
> >> > runs) instead of 200 (Though the end report displays the launched to
> be
> >> > 200). Here is the console report:
> >> >
> >> >
> >> > 12/08/28 04:34:35 INFO mapred.JobClient: Job complete:
> >> > job_201208230144_0035
> >> >
> >> > 12/08/28 04:34:35 INFO mapred.JobClient: Counters: 28
> >> >
> >> > 12/08/28 04:34:35 INFO mapred.JobClient:   Job Counters
> >> >
> >> > 12/08/28 04:34:35 INFO mapred.JobClient: Launched reduce tasks=200
> >> >
> >> > 12/08/28 04:34:35 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=617209
> >> >
> >> > 12/08/28 04:34:35 INFO mapred.JobClient: Total time spent by all
> >> > reduces
> >> > waiting after reserving slots (ms)=0
> >> >
> >> > 12/08/28 04:34:35 INFO mapred.JobClient: Total time spent by all
> >> > maps
> >> > waiting after reserving slots (ms)=0
> >> >
> >> > 12/08/28 04:34:35 INFO mapred.JobClient: Rack-local map tasks=137
> >> >
> >> > 12/08/28 04:34:35 INFO mapred.JobClient: Launched map tasks=201
> >> >
> >> > 12/08/28 04:34:35 INFO mapred.JobClient: Data-local map tasks=64
> >> >
> >> > 12/08/28 04:34:35 INFO mapred.JobClient:
> >> > SLOTS_MILLIS_REDUCES=1756882
> >> >
> >> >
> >> >
> >> > Again, I ran the MRBench for just 10 Maps and 10 Reduces:
> >> >
> >> > hadoop jar /usr/lib/hadoop-0.20/hadoop-test.jar mrbench -maps 10
> >> > -reduces 10
> >> >
> >> >
> >> >
> >> > This time the actual number of Maps were only 2 and again the end
> report
> >> > displays Maps Lauched to be 10. The console output:
> >> >
> >> >
> >> >
> >> > 12/08/28 05:05:35 INFO mapred.JobClient: Job complete:
> >> > job_201208230144_0040
> >> > 12/08/28 05:05:35 INFO mapred.JobClient: Counters: 27
> >> > 12/08/28 05:05:35 INFO mapred.JobClient:   Job Counters
> >> > 12/08/28 05:05:35 INFO mapred.JobClient: Launched reduce tasks=20
> >> > 12/08/28 05:05:35 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=6648
> >> > 12/08/28 05:05:35 INFO mapred.JobClient: Total time spent by all
> >> > reduces
> >> > waiting after reserving slots (ms)=0
> >> > 12/08/28 05:05:35 INFO mapred.JobClient: Total time spent by all
> >> > maps
> >> > waiting after reserving slots (ms)=0
> >> > 12/08/28 05:05:35 INFO mapred.JobClient: Launched map tasks=2
> >>

Re: reducer not starting

2012-11-21 Thread praveenesh kumar
Sometimes its network issue, reducers are not able to find hostnames or IPs
of the other machines. Make sure your /etc/hosts entries and hostnames are
correct.

Regards,
Praveenesh

On Tue, Nov 20, 2012 at 10:46 PM, Harsh J  wrote:

> Your mappers are failing (possibly a user-side error or an
> environmental one) and are being reattempted by the framework (default
> behavior, attempts 4 times to avoid transient failure scenario).
>
> Visit your job's logs in the JobTracker web UI, to find more
> information on why your tasks fail.
>
> On Tue, Nov 20, 2012 at 10:22 PM, jamal sasha 
> wrote:
> >
> >
> >
> > I am not sure whats happening, but I wrote a simple mapper and reducer
> > script.
> >
> >
> >
> > And I am testing it against a small dataset (like few lines long).
> >
> >
> >
> > For some reason reducer is just not starting.. and mapper is executing
> again
> > and again?
> >
> >
> >
> > 12/11/20 09:21:18 INFO streaming.StreamJob:  map 0%  reduce 0%
> >
> > 12/11/20 09:22:05 INFO streaming.StreamJob:  map 50%  reduce 0%
> >
> > 12/11/20 09:22:10 INFO streaming.StreamJob:  map 100%  reduce 0%
> >
> > 12/11/20 09:32:05 INFO streaming.StreamJob:  map 50%  reduce 0%
> >
> > 12/11/20 09:32:11 INFO streaming.StreamJob:  map 0%  reduce 0%
> >
> > 12/11/20 09:32:20 INFO streaming.StreamJob:  map 50%  reduce 0%
> >
> > 12/11/20 09:32:31 INFO streaming.StreamJob:  map 100%  reduce 0%
> >
> > 12/11/20 09:42:20 INFO streaming.StreamJob:  map 50%  reduce 0%
> >
> > 12/11/20 09:42:31 INFO streaming.StreamJob:  map 0%  reduce 0%
> >
> > 12/11/20 09:42:32 INFO streaming.StreamJob:  map 50%  reduce 0%
> >
> > 12/11/20 09:42:50 INFO streaming.StreamJob:  map 100%  reduce 0%
> >
> >
> >
> >
> >
> > Let me know if you want the code also.
> >
> > Any clues of where I am going wrong?
> >
> > Thanks
> >
> >
> >
> >
> >
> >
>
>
>
> --
> Harsh J
>


Hadoop on Tomcat issue ??

2012-11-23 Thread praveenesh kumar
Hello users,

I am trying to create a Hadoop WAR file, which will be deployed on Tomcat.
My code as a JAR is running perfectly fine. But when I am deploying as WAR
on tomcat, I am getting the following issue. From the logs, its saying
couldn't find my Map and Reduce class, but I already made a JAR of my code
and put it inside WEB-INF/lib folder. All hadoop jars also present at
WEB-INF/lib folder. Can someone let me know what could be the issue :

java.lang.RuntimeException: Error in configuring object

   at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)

   at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)

   at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)

   at
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:387)

   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)

   at org.apache.hadoop.mapred.Child$4.run(Child.java:270)

   at java.security.AccessController.doPrivileged(Native Method)

   at javax.security.auth.Subject.doAs(Subject.java:396)

   at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)

   at org.apache.hadoop.mapred.Child.main(Child.java:264)

Caused by: java.lang.reflect.InvocationTargetException

   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)

   at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

   at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

   at java.lang.reflect.Method.invoke(Method.java:597)

   at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)

   ... 9 more

Caused by: java.lang.RuntimeException: java.lang.RuntimeException:
java.lang.ClassNotFoundException: com.ta.wh.Map

   at
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1028)

   at
org.apache.hadoop.mapred.JobConf.getMapperClass(JobConf.java:968)

   at
org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)

   ... 14 more

Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException:
com.ta.wh.Map

   at
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:996)

   at
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1020)

   ... 16 more

Caused by: java.lang.ClassNotFoundException: com.ta.wh.Map

   at java.net.URLClassLoader$1.run(URLClassLoader.java:202)

   at java.security.AccessController.doPrivileged(Native Method)

   at java.net.URLClassLoader.findClass(URLClassLoader.java:190)

   at java.lang.ClassLoader.loadClass(ClassLoader.java:306)

   at
sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)

   at java.lang.ClassLoader.loadClass(ClassLoader.java:247)

   at java.lang.Class.forName0(Native Method)

   at java.lang.Class.forName(Class.java:247)

   at
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:943)

   at
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:994)

   ... 17 more


Re: DistributedCache deprecated

2014-01-29 Thread praveenesh kumar
I think you can use the Job class.
http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html

Regards
Prav


On Wed, Jan 29, 2014 at 9:13 PM, Giordano, Michael <
michael.giord...@vistronix.com> wrote:

>  I noticed that in Hadoop 2.2.0
> org.apache.hadoop.mapreduce.filecache.DistributedCache has been deprecated.
>
>
>
> (http://hadoop.apache.org/docs/current/api/deprecated-list.html#class)
>
>
>
> Is there a class that provides equivalent functionality? My application
> relies heavily on DistributedCache.
>
>
>
> Thanks,
>
> Mike G.
>
> This communication, along with its attachments, is considered confidential
> and proprietary to Vistronix.  It is intended only for the use of the
> person(s) named above.  Note that unauthorized disclosure or distribution
> of information not generally known to the public is strictly
> prohibited.  If you are not the intended recipient, please notify the
> sender immediately.
>


Re: DistributedCache deprecated

2014-01-29 Thread praveenesh kumar
@Jay - I don't know how Job class is replacing the DistributedCache class ,
but I remember trying distributed cache functions like

   void *addArchiveToClassPath
<http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html#addArchiveToClassPath%28org.apache.hadoop.fs.Path%29>*
(Path<http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/fs/Path.html>
 archive)
  Add an archive path to the current set of classpath entries.
 void *addCacheArchive
<http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html#addCacheArchive%28java.net.URI%29>*
(URI<http://download.oracle.com/javase/6/docs/api/java/net/URI.html?is-external=true>
 uri)
  Add a archives to be localized   void *addCacheFile
<http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html#addCacheFile%28java.net.URI%29>*
(URI<http://download.oracle.com/javase/6/docs/api/java/net/URI.html?is-external=true>
 uri)
  Add a file to be localized

and it works fine. The same way you were using DC before.. Well I am not
sure what would be the best answer, but if you are trying to use DC , I was
able to do it with Job class itself.

Regards
Prav


On Wed, Jan 29, 2014 at 9:27 PM, Jay Vyas  wrote:

> Thanks for asking this : Im not sure and didnt realize this until you
> mentioned it!
>
> 1) Prav:  You are implying that we would use the "Job" Class... but how
> could it replace the DC?
>
> 2) The point of the DC is to replicate a file so that its present and
> local on ALL nodes.   I didnt know it was deprecated, but there must be
> some replacement for it - or maybe it just got renamed and moved?
>
> SO ... what is the future of the DistributedCache for mapreduce jobs?
>
>
> On Wed, Jan 29, 2014 at 4:22 PM, praveenesh kumar wrote:
>
>> I think you can use the Job class.
>>
>> http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html
>>
>> Regards
>> Prav
>>
>>
>> On Wed, Jan 29, 2014 at 9:13 PM, Giordano, Michael <
>> michael.giord...@vistronix.com> wrote:
>>
>>>  I noticed that in Hadoop 2.2.0
>>> org.apache.hadoop.mapreduce.filecache.DistributedCache has been deprecated.
>>>
>>>
>>>
>>> (http://hadoop.apache.org/docs/current/api/deprecated-list.html#class)
>>>
>>>
>>>
>>> Is there a class that provides equivalent functionality? My application
>>> relies heavily on DistributedCache.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Mike G.
>>>
>>> This communication, along with its attachments, is considered
>>> confidential and proprietary to Vistronix.  It is intended only for the use
>>> of the person(s) named above.  Note that unauthorized disclosure or
>>> distribution of information not generally known to the public is strictly
>>> prohibited.  If you are not the intended recipient, please notify the
>>> sender immediately.
>>>
>>
>>
>
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com
>


Re: DistributedCache deprecated

2014-01-29 Thread praveenesh kumar
@Jay - Plus if you see DistributedCache class, these methods have been
added inside the Job class, I am guessing they have kept the functionality
same, just merged DistributedCache class into Job class itself. giving more
methods for developers with less classes to worry about, thus simplifying
the API. I hope that makes sense.

Regards
Prav


On Wed, Jan 29, 2014 at 9:41 PM, praveenesh kumar wrote:

> @Jay - I don't know how Job class is replacing the DistributedCache class
> , but I remember trying distributed cache functions like
>
>void *addArchiveToClassPath
> <http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html#addArchiveToClassPath%28org.apache.hadoop.fs.Path%29>*
> (Path<http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/fs/Path.html>
>  archive)
>   Add an archive path to the current set of classpath entries.
>  void *addCacheArchive
> <http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html#addCacheArchive%28java.net.URI%29>*
> (URI<http://download.oracle.com/javase/6/docs/api/java/net/URI.html?is-external=true>
>  uri)
>   Add a archives to be localized   void *addCacheFile
> <http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html#addCacheFile%28java.net.URI%29>*
> (URI<http://download.oracle.com/javase/6/docs/api/java/net/URI.html?is-external=true>
>  uri)
>   Add a file to be localized
>
> and it works fine. The same way you were using DC before.. Well I am not
> sure what would be the best answer, but if you are trying to use DC , I was
> able to do it with Job class itself.
>
> Regards
> Prav
>
>
> On Wed, Jan 29, 2014 at 9:27 PM, Jay Vyas  wrote:
>
>> Thanks for asking this : Im not sure and didnt realize this until you
>> mentioned it!
>>
>> 1) Prav:  You are implying that we would use the "Job" Class... but how
>> could it replace the DC?
>>
>> 2) The point of the DC is to replicate a file so that its present and
>> local on ALL nodes.   I didnt know it was deprecated, but there must be
>> some replacement for it - or maybe it just got renamed and moved?
>>
>> SO ... what is the future of the DistributedCache for mapreduce jobs?
>>
>>
>> On Wed, Jan 29, 2014 at 4:22 PM, praveenesh kumar 
>> wrote:
>>
>>> I think you can use the Job class.
>>>
>>> http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html
>>>
>>> Regards
>>> Prav
>>>
>>>
>>> On Wed, Jan 29, 2014 at 9:13 PM, Giordano, Michael <
>>> michael.giord...@vistronix.com> wrote:
>>>
>>>>  I noticed that in Hadoop 2.2.0
>>>> org.apache.hadoop.mapreduce.filecache.DistributedCache has been deprecated.
>>>>
>>>>
>>>>
>>>> (http://hadoop.apache.org/docs/current/api/deprecated-list.html#class)
>>>>
>>>>
>>>>
>>>> Is there a class that provides equivalent functionality? My application
>>>> relies heavily on DistributedCache.
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Mike G.
>>>>
>>>> This communication, along with its attachments, is considered
>>>> confidential and proprietary to Vistronix.  It is intended only for the use
>>>> of the person(s) named above.  Note that unauthorized disclosure or
>>>> distribution of information not generally known to the public is strictly
>>>> prohibited.  If you are not the intended recipient, please notify the
>>>> sender immediately.
>>>>
>>>
>>>
>>
>>
>> --
>> Jay Vyas
>> http://jayunit100.blogspot.com
>>
>
>


Re: DistributedCache deprecated

2014-01-29 Thread praveenesh kumar
Hi Mike,

I tried getInstance() method of Job class and it worked for me. I guess
they have make it factory class now. Sorry I also have experimented stuffs,
don't have the exact answers

static 
Job<http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Job.html>
*getInstance
<http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Job.html#getInstance%28%29>*
()   Creates a new
Job<http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Job.html>with
no particular
Cluster<http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Cluster.html>.


Regards
Prav


On Wed, Jan 29, 2014 at 10:53 PM, Giordano, Michael <
michael.giord...@vistronix.com> wrote:

>  Prav,
>
>
>
> Thank you for the prompt answer. I see the methods on the job class and
> this does make sense.
>
>
>
> Unfortunately something else has me confused. It seems as though all of
> the Job() constructors have also been marked deprecated.
>
>
>
>
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Job.html
>
>
>
> How do you create a new Job instance? Is there a factory class?
>
>
>
> Thanks,
>
> Mike G.
>
>  --
> *From:* praveenesh kumar 
> *Sent:* Wednesday, January 29, 2014 4:41 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: DistributedCache deprecated
>
>@Jay - I don't know how Job class is replacing the DistributedCache
> class , but I remember trying distributed cache functions like
>
>void *addArchiveToClassPath
> <http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html#addArchiveToClassPath%28org.apache.hadoop.fs.Path%29>*
> (Path<http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/fs/Path.html>
>  archive)
>   Add an archive path to the current set of classpath entries.
>  void *addCacheArchive
> <http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html#addCacheArchive%28java.net.URI%29>*
> (URI<http://download.oracle.com/javase/6/docs/api/java/net/URI.html?is-external=true>
>  uri)
>   Add a archives to be localized   void *addCacheFile
> <http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html#addCacheFile%28java.net.URI%29>*
> (URI<http://download.oracle.com/javase/6/docs/api/java/net/URI.html?is-external=true>
>  uri)
>   Add a file to be localized
>
>  and it works fine. The same way you were using DC before.. Well I am not
> sure what would be the best answer, but if you are trying to use DC , I was
> able to do it with Job class itself.
>
>  Regards
>  Prav
>
>
> On Wed, Jan 29, 2014 at 9:27 PM, Jay Vyas  wrote:
>
>>  Thanks for asking this : Im not sure and didnt realize this until you
>> mentioned it!
>>
>> 1) Prav:  You are implying that we would use the "Job" Class... but how
>> could it replace the DC?
>>
>> 2) The point of the DC is to replicate a file so that its present and
>> local on ALL nodes.   I didnt know it was deprecated, but there must be
>> some replacement for it - or maybe it just got renamed and moved?
>>
>>  SO ... what is the future of the DistributedCache for mapreduce jobs?
>>
>>
>> On Wed, Jan 29, 2014 at 4:22 PM, praveenesh kumar 
>> wrote:
>>
>>>  I think you can use the Job class.
>>>
>>> http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html
>>>
>>>  Regards
>>>  Prav
>>>
>>>
>>> On Wed, Jan 29, 2014 at 9:13 PM, Giordano, Michael <
>>> michael.giord...@vistronix.com> wrote:
>>>
>>>>  I noticed that in Hadoop 2.2.0
>>>> org.apache.hadoop.mapreduce.filecache.DistributedCache has been deprecated.
>>>>
>>>>
>>>>
>>>> (http://hadoop.apache.org/docs/current/api/deprecated-list.html#class)
>>>>
>>>>
>>>>
>>>> Is there a class that provides equivalent functionality? My application
>>>> relies heavily on DistributedCache.
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Mike G.
>>>>
>>>> This communication, along with its attachments, is considered
>>>> confidential and proprietary to Vistronix.  It is intended only for the use
>>>> of the person(s) named above.  Note that unauthorized disclosure or
>>>> distribution of information not generally known to the public is strictly
>>>> prohibited.  If you are not the intended recipient, please notify the
>>>> sender immediately.
>>>>
>>>
>>>
>>
>>
>>  --
>> Jay Vyas
>> http://jayunit100.blogspot.com
>>
>
>


Re: DistributedCache deprecated

2014-01-30 Thread praveenesh kumar
Hi Amit,

I am not sure how are they linked with DistributedCache.. Job configuration
is not uploading any data in memory.. As far as I am aware of how
DistributedCache works, nothing get loaded in memory. Distributed cache
just copies the files into slave nodes, so that they are accessible to
mappers/reducers. Usually the location is
${hadoop.tmp.dir}/${mapred.local.dir}/tasktracker/archive (depends from
distribution to distribution) You always have to read the files in your
mapper or reducer when ever you want to use them.

What has happened is the method of DistributedCache class has now been
added to Job class, and I am assuming they won't change the functionality
of how distributed cache methods used to work, otherwise there would have
been some nice articles on that, plus I don't see any reason of changing
that as well too..  so everything works still the same way.. Its just that
you use the new Job class to use distributed cache features.

I am not sure what entries you are exactly pointing to. Am I missing
anything here ?


Regards
Prav


On Thu, Jan 30, 2014 at 6:12 AM, Amit Mittal  wrote:

> Hi Mike & Prav,
>
> Although I am new to Hadoop, but would like to add my 2 cents if that
> helps.
> We are having 2 ways for distribution of shared data, one is using Job
> configuration and other is DistributedCache.
> As job configuration is read by the JT, TT and child JVMs, and each time
> the configuration is read, all of its entries are read in memory, even if
> they are not used. So using job configuration is not advised if the data is
> more than few kilobytes. So it is not alternative to DistributedCache
> unless some modifications are done in Job configuration to address this
> limitation.
> So I am also curious to know the alternatative to DistributedCache class.
>
> Thanks
> Amit
>
>
>
> On Thu, Jan 30, 2014 at 2:43 AM, Giordano, Michael <
> michael.giord...@vistronix.com> wrote:
>
>>  I noticed that in Hadoop 2.2.0
>> org.apache.hadoop.mapreduce.filecache.DistributedCache has been deprecated.
>>
>>
>>
>> (http://hadoop.apache.org/docs/current/api/deprecated-list.html#class)
>>
>>
>>
>> Is there a class that provides equivalent functionality? My application
>> relies heavily on DistributedCache.
>>
>>
>>
>> Thanks,
>>
>> Mike G.
>>
>> This communication, along with its attachments, is considered
>> confidential and proprietary to Vistronix.  It is intended only for the use
>> of the person(s) named above.  Note that unauthorized disclosure or
>> distribution of information not generally known to the public is strictly
>> prohibited.  If you are not the intended recipient, please notify the
>> sender immediately.
>>
>
>


Re: DistributedCache deprecated

2014-01-30 Thread praveenesh kumar
Hi Amit,

Side data distribution is altogether a different concept at all. Its when
you set custom (key,value) pairs and use Job object for doing that, so that
you can use them in your mappers/reducers. It is good when you want to pass
some small information to your mappers/reducers like extra command line
arguments that is required by mappers/reducers.
We were not discussing Side data distribution at all.

The question was DistributedCache gets deprecated, where we can find the
right methods which DistributedCache delivers.
If you see the DistributedCache class in MR v1 -
https://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/filecache/DistributedCache.html

and compare it with Job class in MR v2 -
http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html

You would see the methods of DistributedCache class has been added to Job
class. Since DistributedCache is deprecated, my guess was that we can use
Job class to use distributed cache using the same methods which
DistributedCache used to provide.

Everything else is same, its just that you use Job class to set your files
for Distributed cache inside your job configuration. Well I am sorry. I
don't have any nice article as I said that I also did this as part of my
experiment and I was able to use it without any issues, so that's why I
suggested it.

Since most of the developers still using MRv1 on hadoop 2.0, that is why
these changes have not been come into highlights so far. I am hoping a new
documentation on how to use MRv2 would come soon, but if you understand
MRv1, I don't see any reasons why can't you just move around a bit in API
and find your relevant classes that you want to use by yourself.  Again, as
I said, I don't have any valid statements of what I am saying, they are
just the results of my own experiments, which you are most welcome to
conduct and play with. Happy Coding..!!

Regards
Prav




On Thu, Jan 30, 2014 at 12:27 PM, Amit Mittal  wrote:

> Hi Prav,
>
> Yes, you are correct that DistributedCache does not upload file into
> memory. Also using job configuration and DistributedCache are 2 different
> approaches. I am referring based on "Hadoop: The definitive guide"
> Chapter:8 > Side Data Distribution (Page 288-295).
> As you are saying that now methods of DistributedCache moved to Job, I
> request if you please share some article or document on that for my better
> understanding, it will be great help.
>
> Thanks
> Amit
>
>
> On Thu, Jan 30, 2014 at 5:35 PM, praveenesh kumar wrote:
>
>> Hi Amit,
>>
>> I am not sure how are they linked with DistributedCache.. Job
>> configuration is not uploading any data in memory.. As far as I am aware of
>> how DistributedCache works, nothing get loaded in memory. Distributed cache
>> just copies the files into slave nodes, so that they are accessible to
>> mappers/reducers. Usually the location is
>> ${hadoop.tmp.dir}/${mapred.local.dir}/tasktracker/archive (depends from
>> distribution to distribution) You always have to read the files in your
>> mapper or reducer when ever you want to use them.
>>
>> What has happened is the method of DistributedCache class has now been
>> added to Job class, and I am assuming they won't change the functionality
>> of how distributed cache methods used to work, otherwise there would have
>> been some nice articles on that, plus I don't see any reason of changing
>> that as well too..  so everything works still the same way.. Its just that
>> you use the new Job class to use distributed cache features.
>>
>> I am not sure what entries you are exactly pointing to. Am I missing
>> anything here ?
>>
>>
>> Regards
>> Prav
>>
>>
>> On Thu, Jan 30, 2014 at 6:12 AM, Amit Mittal wrote:
>>
>>> Hi Mike & Prav,
>>>
>>> Although I am new to Hadoop, but would like to add my 2 cents if that
>>> helps.
>>> We are having 2 ways for distribution of shared data, one is using Job
>>> configuration and other is DistributedCache.
>>> As job configuration is read by the JT, TT and child JVMs, and each time
>>> the configuration is read, all of its entries are read in memory, even if
>>> they are not used. So using job configuration is not advised if the data is
>>> more than few kilobytes. So it is not alternative to DistributedCache
>>> unless some modifications are done in Job configuration to address this
>>> limitation.
>>> So I am also curious to know the alternatative to DistributedCache class.
>>>
>>> Thanks
>>> Amit
>>>
>>>
>>>
>>> On Thu, Jan 30, 2014 at 2:43 AM, Giordano, Michael <
>>> michae

Re: HDFS copyToLocal and get crc option

2014-01-31 Thread praveenesh kumar
Hi Tom,

My hint is your BLOCKSIZE should be multiple of CRC. Check your property
dfs.block.size - convert it into bytes, then divide it with the checksum
value that is set, usually its dfs.bytes-per-checksum property that tells
this value or you can get the checksum value from the error message you are
getting.

HDFS uses this checksum value to make sure the data doesn't get courrpted
while transfer (due to loss of bytes etc).

I hope setting your block size with the multiple of your CRC checksum
should solve your problem

Regards
Prav


On Fri, Jan 31, 2014 at 4:30 PM, Tom Brown  wrote:

> What is the right way to use the "-crc" option with hadoop dfs
> -copyToLocal?
>
> Is this the wrong list?
>
> --Tom
>
>
> On Tue, Jan 28, 2014 at 11:53 AM, Tom Brown  wrote:
>
>> I am archiving a large amount of data out of my HDFS file system to a
>> separate shared storage solution (There is not much HDFS space left in my
>> cluster, and upgrading it is not an option right now).
>>
>> I understand that HDFS internally manages checksums and won't succeed if
>> the data doesn't match the CRC, so I'm not worried about corruption when
>> reading from HDFS.
>>
>> However, I want to store the HDFS crc calculations alongside the data
>> files after exporting them. I thought the "hadoop dfs -copyToLocal -crc
>>  " command would work, but it always gives me the
>> error "-crc option is not valid when source file system does not have crc
>> files"
>>
>> Can someone explain what exactly that option does, and when (if ever) it
>> should be used?
>>
>> Thanks in advance!
>>
>> --Tom
>>
>
>


Re: HDFS multi-tenancy and federation

2014-02-05 Thread praveenesh kumar
Hi Shani,

I haven't done any implementation on HDFS federation, but as far as I know,
1 namenode can handle only 1 namespace at this time. I hope that helps.

Regards
Prav


On Wed, Feb 5, 2014 at 8:05 AM, Shani Ranasinghe wrote:

> Hi,
>
> Any help on this please?
>
>
>
> On Mon, Feb 3, 2014 at 12:14 PM, Shani Ranasinghe wrote:
>
>>
>> Hi,
>> I would like to know the following.
>>
>> 1) Can there be multiple namespaces in a single namenode? is it
>> recommended?  (I'm having a multi-tenant environment in mind)
>>
>> 2) Let's say I have a federated namespace/namenodes. There are two
>> namenodes A /namespace A1 and namenode B/namespace B1, and have 3
>> datanodes. Can someone from namespace A1,  access the datanode's data in
>> anyway (hacking) belonging to namespace B1. If not how is it handled?
>>
>> After going through a lot  of reference, my understanding on HDFS
>> multi-tenancy and federation is that for multi-tenancy what we could do is
>> use file/folder permissions (u,g,o) and ACL's. Or we could dedicate a
>> namespace per tenant. The issue here is that a namenode (active namenode,
>> passive namenode and secondary namenode) has to be assigned per tenant.  Is
>> there any other way that multi tenancy can be achieved?
>>
>> On federation, let's say I have a namenode for /marketing and another for
>> /finance. Lets say that marketing bears the most load. How can we load
>> balance this? is it possible?
>>
>> Appreciate any help on this.
>>
>> Regards,
>> Shani.
>>
>>
>>
>>
>


java.lang.OutOfMemoryError: Java heap space

2014-02-06 Thread praveenesh kumar
Hi all,

I am running a Pig Script which is running fine for small data. But when I
scale the data, I am getting the following error at my map stage.
Please refer to the map logs as below.

My Pig script is doing a group by first, followed by a join on the grouped
data.


Any clues to understand where I should look at or how shall I deal with
this situation. I don't want to just go by just increasing the heap space.
My map jvm heap space is already 3 GB with io.sort.mb = 768 MB.

2014-02-06 19:15:12,243 WARN org.apache.hadoop.util.NativeCodeLoader:
Unable to load native-hadoop library for your platform... using
builtin-java classes where applicable 2014-02-06 19:15:15,025 INFO
org.apache.hadoop.util.ProcessTree: setsid exited with exit code 0
2014-02-06 19:15:15,123 INFO org.apache.hadoop.mapred.Task: Using
ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@2bd9e282 2014-02-06
19:15:15,546 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 768
2014-02-06 19:15:19,846 INFO org.apache.hadoop.mapred.MapTask: data buffer
= 612032832/644245088 2014-02-06 19:15:19,846 INFO
org.apache.hadoop.mapred.MapTask: record buffer = 9563013/10066330
2014-02-06 19:15:20,037 INFO org.apache.hadoop.io.compress.CodecPool: Got
brand-new decompressor 2014-02-06 19:15:21,083 INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader:
Created input record counter: Input records from _1_tmp1327641329
2014-02-06 19:15:52,894 INFO org.apache.hadoop.mapred.MapTask: Spilling map
output: buffer full= true 2014-02-06 19:15:52,895 INFO
org.apache.hadoop.mapred.MapTask: bufstart = 0; bufend = 611949600; bufvoid
= 644245088 2014-02-06 19:15:52,895 INFO org.apache.hadoop.mapred.MapTask:
kvstart = 0; kvend = 576; length = 10066330 2014-02-06 19:16:06,182 INFO
org.apache.hadoop.mapred.MapTask: Finished spill 0 2014-02-06 19:16:16,169
INFO org.apache.pig.impl.util.SpillableMemoryManager: first memory handler
call - Collection threshold init = 328728576(321024K) used =
1175055104(1147514K) committed = 1770848256(1729344K) max =
2097152000(2048000K) 2014-02-06 19:16:20,446 INFO
org.apache.pig.impl.util.SpillableMemoryManager: Spilled an estimate of
308540402 bytes from 1 objects. init = 328728576(321024K) used =
1175055104(1147514K) committed = 1770848256(1729344K) max =
2097152000(2048000K) 2014-02-06 19:17:22,246 INFO
org.apache.pig.impl.util.SpillableMemoryManager: first memory handler call-
Usage threshold init = 328728576(321024K) used = 1768466512(1727018K)
committed = 1770848256(1729344K) max = 2097152000(2048000K) 2014-02-06
19:17:35,597 INFO org.apache.pig.impl.util.SpillableMemoryManager: Spilled
an estimate of 1073462600 bytes from 1 objects. init = 328728576(321024K)
used = 1768466512(1727018K) committed = 1770848256(1729344K) max =
2097152000(2048000K) 2014-02-06 19:18:01,276 INFO
org.apache.hadoop.mapred.MapTask: Spilling map output: buffer full= true
2014-02-06 19:18:01,288 INFO org.apache.hadoop.mapred.MapTask: bufstart =
611949600; bufend = 52332788; bufvoid = 644245088 2014-02-06 19:18:01,288
INFO org.apache.hadoop.mapred.MapTask: kvstart = 576; kvend = 777; length =
10066330 2014-02-06 19:18:03,377 INFO org.apache.hadoop.mapred.MapTask:
Finished spill 1 2014-02-06 19:18:05,494 INFO
org.apache.hadoop.mapred.MapTask: Record too large for in-memory buffer:
644246693 bytes 2014-02-06 19:18:36,008 INFO
org.apache.pig.impl.util.SpillableMemoryManager: Spilled an estimate of
306271368 bytes from 1 objects. init = 328728576(321024K) used =
1449267128(1415299K) committed = 2097152000(2048000K) max =
2097152000(2048000K) 2014-02-06 19:18:44,448 INFO
org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater
with mapRetainSize=-1 and reduceRetainSize=-1 2014-02-06 19:18:44,780 FATAL
org.apache.hadoop.mapred.Child: Error running child :
java.lang.OutOfMemoryError: Java heap space at
java.util.Arrays.copyOf(Arrays.java:2786) at
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94) at
java.io.DataOutputStream.write(DataOutputStream.java:90) at
java.io.DataOutputStream.writeUTF(DataOutputStream.java:384) at
java.io.DataOutputStream.writeUTF(DataOutputStream.java:306) at
org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:454) at
org.apache.pig.data.BinInterSedes.writeTuple(BinInterSedes.java:542) at
org.apache.pig.data.BinInterSedes.writeBag(BinInterSedes.java:523) at
org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:361) at
org.apache.pig.data.BinInterSedes.writeTuple(BinInterSedes.java:542) at
org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:357) at
org.apache.pig.data.BinSedesTuple.write(BinSedesTuple.java:57) at
org.apache.pig.impl.io.PigNullableWritable.write(PigNullableWritable.java:123)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)

Re: java.lang.OutOfMemoryError: Java heap space

2014-02-06 Thread praveenesh kumar
Its a normal join. I can't use replicated join, as the data is very large.

Regards
Prav


On Thu, Feb 6, 2014 at 7:52 PM, abhishek  wrote:

> Hi Praveenesh,
>
> Did you use "replicated join" in your pig script or is it a regular join ??
>
> Regards
> Abhishek
>
> Sent from my iPhone
>
> > On Feb 6, 2014, at 11:25 AM, praveenesh kumar 
> wrote:
> >
> > Hi all,
> >
> > I am running a Pig Script which is running fine for small data. But when
> I
> > scale the data, I am getting the following error at my map stage.
> > Please refer to the map logs as below.
> >
> > My Pig script is doing a group by first, followed by a join on the
> grouped
> > data.
> >
> >
> > Any clues to understand where I should look at or how shall I deal with
> > this situation. I don't want to just go by just increasing the heap
> space.
> > My map jvm heap space is already 3 GB with io.sort.mb = 768 MB.
> >
> > 2014-02-06 19:15:12,243 WARN org.apache.hadoop.util.NativeCodeLoader:
> > Unable to load native-hadoop library for your platform... using
> > builtin-java classes where applicable 2014-02-06 19:15:15,025 INFO
> > org.apache.hadoop.util.ProcessTree: setsid exited with exit code 0
> > 2014-02-06 19:15:15,123 INFO org.apache.hadoop.mapred.Task: Using
> > ResourceCalculatorPlugin :
> > org.apache.hadoop.util.LinuxResourceCalculatorPlugin@2bd9e282 2014-02-06
> > 19:15:15,546 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 768
> > 2014-02-06 19:15:19,846 INFO org.apache.hadoop.mapred.MapTask: data
> buffer
> > = 612032832/644245088 2014-02-06 19:15:19,846 INFO
> > org.apache.hadoop.mapred.MapTask: record buffer = 9563013/10066330
> > 2014-02-06 19:15:20,037 INFO org.apache.hadoop.io.compress.CodecPool: Got
> > brand-new decompressor 2014-02-06 19:15:21,083 INFO
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader:
> > Created input record counter: Input records from _1_tmp1327641329
> > 2014-02-06 19:15:52,894 INFO org.apache.hadoop.mapred.MapTask: Spilling
> map
> > output: buffer full= true 2014-02-06 19:15:52,895 INFO
> > org.apache.hadoop.mapred.MapTask: bufstart = 0; bufend = 611949600;
> bufvoid
> > = 644245088 2014-02-06 19:15:52,895 INFO
> org.apache.hadoop.mapred.MapTask:
> > kvstart = 0; kvend = 576; length = 10066330 2014-02-06 19:16:06,182 INFO
> > org.apache.hadoop.mapred.MapTask: Finished spill 0 2014-02-06
> 19:16:16,169
> > INFO org.apache.pig.impl.util.SpillableMemoryManager: first memory
> handler
> > call - Collection threshold init = 328728576(321024K) used =
> > 1175055104(1147514K) committed = 1770848256(1729344K) max =
> > 2097152000(2048000K) 2014-02-06 19:16:20,446 INFO
> > org.apache.pig.impl.util.SpillableMemoryManager: Spilled an estimate of
> > 308540402 bytes from 1 objects. init = 328728576(321024K) used =
> > 1175055104(1147514K) committed = 1770848256(1729344K) max =
> > 2097152000(2048000K) 2014-02-06 19:17:22,246 INFO
> > org.apache.pig.impl.util.SpillableMemoryManager: first memory handler
> call-
> > Usage threshold init = 328728576(321024K) used = 1768466512(1727018K)
> > committed = 1770848256(1729344K) max = 2097152000(2048000K) 2014-02-06
> > 19:17:35,597 INFO org.apache.pig.impl.util.SpillableMemoryManager:
> Spilled
> > an estimate of 1073462600 bytes from 1 objects. init = 328728576(321024K)
> > used = 1768466512(1727018K) committed = 1770848256(1729344K) max =
> > 2097152000(2048000K) 2014-02-06 19:18:01,276 INFO
> > org.apache.hadoop.mapred.MapTask: Spilling map output: buffer full= true
> > 2014-02-06 19:18:01,288 INFO org.apache.hadoop.mapred.MapTask: bufstart =
> > 611949600; bufend = 52332788; bufvoid = 644245088 2014-02-06 19:18:01,288
> > INFO org.apache.hadoop.mapred.MapTask: kvstart = 576; kvend = 777;
> length =
> > 10066330 2014-02-06 19:18:03,377 INFO org.apache.hadoop.mapred.MapTask:
> > Finished spill 1 2014-02-06 19:18:05,494 INFO
> > org.apache.hadoop.mapred.MapTask: Record too large for in-memory buffer:
> > 644246693 bytes 2014-02-06 19:18:36,008 INFO
> > org.apache.pig.impl.util.SpillableMemoryManager: Spilled an estimate of
> > 306271368 bytes from 1 objects. init = 328728576(321024K) used =
> > 1449267128(1415299K) committed = 2097152000(2048000K) max =
> > 2097152000(2048000K) 2014-02-06 19:18:44,448 INFO
> > org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater
> > with mapRetainSize=-1 and reduceRetainSize=-1 2014-02-06 19:18:44,780
> FATAL
> > org.apache.hadoop.mapred.Child: Error running child :
> > java.lang.Ou

Re: java.lang.OutOfMemoryError: Java heap space

2014-02-07 Thread praveenesh kumar
Thanks Park for sharing the above configs

But I am wondering if the above config changes would make any huge
difference in my case.
As per my logs, I am very worried about this line -

 INFO org.apache.hadoop.mapred.MapTask: Record too large for in-memory
buffer: 644245358 bytes

If I am understanding it properly, my 1 record is very large to fit
into the memory, which is causing the issue.
Any of the above changes wouldn't make any huge impact, please correct
me if I am taking it totally wrong.

 - Adding hadoop user group here as well, to throw some valuable
inputs to understand the above question.


Since I am doing a join on a grouped bag, do you think that might be the case ?

But if that is the issue, as far as I understand Bags in Pig are
spillable, it shouldn't have given this issue.

I can't get rid of group by, Grouping by first should idealing improve
my join. But if this is the root cause, if I am understanding it
correctly,

do you think I should get rid of group-by.

But my question in that case would be what would happen if I do group
by later after join, if will result in much bigger bag (because it
would have more records after join)

Am I thinking here correctly ?

Regards

Prav



On Fri, Feb 7, 2014 at 3:11 AM, Cheolsoo Park  wrote:

> Looks like you're running out of space in MapOutputBuffer. Two suggestions-
>
> 1)
> You said that io.sort.mb is already set to 768 MB, but did you try to lower
> io.sort.spill.percent in order to spill earlier and more often?
>
> Page 12-
>
> http://www.slideshare.net/Hadoop_Summit/optimizing-mapreduce-job-performance
>
> 2)
> Can't you increase the parallelism of mappers so that each mapper has to
> handle a smaller size of data? Pig determines the number of mappers by
> total input size / pig.maxCombinedSplitSize (128MB by default). So you can
> try to lower pig.maxCombinedSplitSize.
>
> But I admit Pig internal data types are not memory-efficient, and that is
> an optimization opportunity. Contribute!
>
>
>
> On Thu, Feb 6, 2014 at 2:54 PM, praveenesh kumar  >wrote:
>
> > Its a normal join. I can't use replicated join, as the data is very
> large.
> >
> > Regards
> > Prav
> >
> >
> > On Thu, Feb 6, 2014 at 7:52 PM, abhishek 
> > wrote:
> >
> > > Hi Praveenesh,
> > >
> > > Did you use "replicated join" in your pig script or is it a regular
> join
> > ??
> > >
> > > Regards
> > > Abhishek
> > >
> > > Sent from my iPhone
> > >
> > > > On Feb 6, 2014, at 11:25 AM, praveenesh kumar 
> > > wrote:
> > > >
> > > > Hi all,
> > > >
> > > > I am running a Pig Script which is running fine for small data. But
> > when
> > > I
> > > > scale the data, I am getting the following error at my map stage.
> > > > Please refer to the map logs as below.
> > > >
> > > > My Pig script is doing a group by first, followed by a join on the
> > > grouped
> > > > data.
> > > >
> > > >
> > > > Any clues to understand where I should look at or how shall I deal
> with
> > > > this situation. I don't want to just go by just increasing the heap
> > > space.
> > > > My map jvm heap space is already 3 GB with io.sort.mb = 768 MB.
> > > >
> > > > 2014-02-06 19:15:12,243 WARN org.apache.hadoop.util.NativeCodeLoader:
> > > > Unable to load native-hadoop library for your platform... using
> > > > builtin-java classes where applicable 2014-02-06 19:15:15,025 INFO
> > > > org.apache.hadoop.util.ProcessTree: setsid exited with exit code 0
> > > > 2014-02-06 19:15:15,123 INFO org.apache.hadoop.mapred.Task: Using
> > > > ResourceCalculatorPlugin :
> > > >
> org.apache.hadoop.util.LinuxResourceCalculatorPlugin@2bd9e2822014-02-06
> > > > 19:15:15,546 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 768
> > > > 2014-02-06 19:15:19,846 INFO org.apache.hadoop.mapred.MapTask: data
> > > buffer
> > > > = 612032832/644245088 2014-02-06 19:15:19,846 INFO
> > > > org.apache.hadoop.mapred.MapTask: record buffer = 9563013/10066330
> > > > 2014-02-06 19:15:20,037 INFO org.apache.hadoop.io.compress.CodecPool:
> > Got
> > > > brand-new decompressor 2014-02-06 19:15:21,083 INFO
> > > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader:
> > > > Created input record counter: Input records from _1_tmp1327641329
> > > > 2014-02-06 19:15:52,894 INFO org.apa

Re: java.lang.OutOfMemoryError: Java heap space

2014-02-07 Thread praveenesh kumar
Hi Park,

Your explanation makes perfect sense in my case. Thanks for explaining what
is happening behind the scenes. I am wondering you used normal java
compression/decompression or is there a UDF already available to do this
stuff or some kind of property that we need to enable to say to PIG that
compress bags before spilling.

Regards
Prav


On Fri, Feb 7, 2014 at 4:37 PM, Cheolsoo Park  wrote:

> Hi Prav,
>
> You're thinking correctly, and it's true that Pig bags are spillable.
>
> However, spilling is no magic, meaning you can still run into OOM with huge
> bags like you have here. Pig runs Spillable Memory Manager (SMM) in a
> separate thread. When spilling is triggered, SMM locks bags that it's
> trying to spill to disk. After the spilling is finished, GC frees up
> memory. The problem is that it's possible that more bags are loaded into
> memory while the spilling is in progress. Now JVM triggers GC, but GC
> cannot free up memory because SMM is locking the bags, resulting in OOM
> error. This happens quite often.
>
> Sounds like you do group-by to reduce the number of rows before join and
> don't immediately run any aggregation function on the grouped bags. If
> that's the case, can you compress those bags? For eg, you could add a
> foreach after group-by and run a UDF that compresses a bag and returns it
> as bytearray. From there, you're moving around small blobs rather than big
> bags. Of course, you will need to decompress them when you restore data out
> of those bags at some point. This trick saved me several times in the past
> particularly when I dealt with bags of large chararrays.
>
> Just a thought. Hope this is helpful.
>
> Thanks,
> Cheolsoo
>
>
> On Fri, Feb 7, 2014 at 7:37 AM, praveenesh kumar  >wrote:
>
> > Thanks Park for sharing the above configs
> >
> > But I am wondering if the above config changes would make any huge
> > difference in my case.
> > As per my logs, I am very worried about this line -
> >
> >  INFO org.apache.hadoop.mapred.MapTask: Record too large for in-memory
> buffer: 644245358 bytes
> >
> > If I am understanding it properly, my 1 record is very large to fit into
> the memory, which is causing the issue.
> >
> > Any of the above changes wouldn't make any huge impact, please correct
> me if I am taking it totally wrong.
> >
> >  - Adding hadoop user group here as well, to throw some valuable inputs
> to understand the above question.
> >
> >
> > Since I am doing a join on a grouped bag, do you think that might be the
> case ?
> >
> > But if that is the issue, as far as I understand Bags in Pig are
> spillable, it shouldn't have given this issue.
> >
> > I can't get rid of group by, Grouping by first should idealing improve
> my join. But if this is the root cause, if I am understanding it correctly,
> >
> > do you think I should get rid of group-by.
> >
> > But my question in that case would be what would happen if I do group by
> later after join, if will result in much bigger bag (because it would have
> more records after join)
> >
> > Am I thinking here correctly ?
> >
> > Regards
> >
> > Prav
> >
> >
> >
> > On Fri, Feb 7, 2014 at 3:11 AM, Cheolsoo Park  >wrote:
> >
> >> Looks like you're running out of space in MapOutputBuffer. Two
> >> suggestions-
> >>
> >> 1)
> >> You said that io.sort.mb is already set to 768 MB, but did you try to
> >> lower
> >> io.sort.spill.percent in order to spill earlier and more often?
> >>
> >> Page 12-
> >>
> >>
> http://www.slideshare.net/Hadoop_Summit/optimizing-mapreduce-job-performance
> >>
> >> 2)
> >> Can't you increase the parallelism of mappers so that each mapper has to
> >> handle a smaller size of data? Pig determines the number of mappers by
> >> total input size / pig.maxCombinedSplitSize (128MB by default). So you
> can
> >> try to lower pig.maxCombinedSplitSize.
> >>
> >> But I admit Pig internal data types are not memory-efficient, and that
> is
> >> an optimization opportunity. Contribute!
> >>
> >>
> >>
> >> On Thu, Feb 6, 2014 at 2:54 PM, praveenesh kumar  >> >wrote:
> >>
> >> > Its a normal join. I can't use replicated join, as the data is very
> >> large.
> >> >
> >> > Regards
> >> > Prav
> >> >
> >> >
> >> > On Thu, Feb 6, 2014 at 7:52 PM, abhishek 
> &

Re: I am about to lose all my data please help

2014-03-19 Thread praveenesh kumar
Is this property correct ?


fs.default.name
-BLANKED
  

Regards
Prav


On Wed, Mar 19, 2014 at 12:58 PM, Fatih Haltas  wrote:

> Thanks for you helps, but still could not solve my problem.
>
>
> On Tue, Mar 18, 2014 at 10:13 AM, Stanley Shi  wrote:
>
>> Ah yes, I overlooked this. Then please check the file are there or not:
>> "ls /home/hadoop/project/hadoop-data/dfs/name"?
>>
>> Regards,
>> *Stanley Shi,*
>>
>>
>>
>> On Tue, Mar 18, 2014 at 2:06 PM, Azuryy Yu  wrote:
>>
>>> I don't think this is the case, because there is;
>>>   
>>> hadoop.tmp.dir
>>> /home/hadoop/project/hadoop-data
>>>   
>>>
>>>
>>> On Tue, Mar 18, 2014 at 1:55 PM, Stanley Shi  wrote:
>>>
 one possible reason is that you didn't set the namenode working
 directory, by default it's in "/tmp" folder; and the "/tmp" folder might
 get deleted by the OS without any notification. If this is the case, I am
 afraid you have lost all your namenode data.

 *
   dfs.name.dir
   ${hadoop.tmp.dir}/dfs/name
   Determines where on the local filesystem the DFS name node
   should store the name table(fsimage).  If this is a comma-delimited 
 list
   of directories then the name table is replicated in all of the
   directories, for redundancy. 
 *


 Regards,
 *Stanley Shi,*



 On Sun, Mar 16, 2014 at 5:29 PM, Mirko Kämpf wrote:

> Hi,
>
> what is the location of the namenodes fsimage and editlogs?
> And how much memory has the NameNode.
>
> Did you work with a Secondary NameNode or a Standby NameNode for
> checkpointing?
>
> Where are your HDFS blocks located, are those still safe?
>
> With this information at hand, one might be able to fix your setup,
> but do not format the old namenode before
> all is working with a fresh one.
>
> Grab a copy of the maintainance guide:
> http://shop.oreilly.com/product/0636920025085.do?sortby=publicationDate
> which helps solving such type of problems as well.
>
> Best wishes
> Mirko
>
>
> 2014-03-16 9:07 GMT+00:00 Fatih Haltas :
>
> Dear All,
>>
>> I have just restarted machines of my hadoop clusters. Now, I am
>> trying to restart hadoop clusters again, but getting error on namenode
>> restart. I am afraid of loosing my data as it was properly running for 
>> more
>> than 3 months. Currently, I believe if I do namenode formatting, it will
>> work again, however, data will be lost. Is there anyway to solve this
>> without losing the data.
>>
>> I will really appreciate any help.
>>
>> Thanks.
>>
>>
>> =
>> Here is the logs;
>> 
>> 2014-02-26 16:02:39,698 INFO
>> org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
>> /
>> STARTUP_MSG: Starting NameNode
>> STARTUP_MSG:   host = ADUAE042-LAP-V/127.0.0.1
>> STARTUP_MSG:   args = []
>> STARTUP_MSG:   version = 1.0.4
>> STARTUP_MSG:   build =
>> https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0-r 
>> 1393290; compiled by 'hortonfo' on Wed Oct  3 05:13:58 UTC 2012
>> /
>> 2014-02-26 16:02:40,005 INFO
>> org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from
>> hadoop-metrics2.properties
>> 2014-02-26 16:02:40,019 INFO
>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
>> MetricsSystem,sub=Stats registered.
>> 2014-02-26 16:02:40,021 INFO
>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot
>> period at 10 second(s).
>> 2014-02-26 16:02:40,021 INFO
>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics 
>> system
>> started
>> 2014-02-26 16:02:40,169 INFO
>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source 
>> ugi
>> registered.
>> 2014-02-26 16:02:40,193 INFO
>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source 
>> jvm
>> registered.
>> 2014-02-26 16:02:40,194 INFO
>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
>> NameNode registered.
>> 2014-02-26 16:02:40,242 INFO org.apache.hadoop.hdfs.util.GSet: VM
>> type   = 64-bit
>> 2014-02-26 16:02:40,242 INFO org.apache.hadoop.hdfs.util.GSet: 2% max
>> memory = 17.77875 MB
>> 2014-02-26 16:02:40,242 INFO org.apache.hadoop.hdfs.util.GSet:
>> capacity  = 2^21 = 2097152 entries
>> 2014-02-26 16:02:40,242 INFO org.apache.hadoop.hdfs.util.GSet:
>> recommended=2097152, actual=2097152
>> 2014-02-26 16:02:40,273 INFO
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=hadoop
>> 2014-02-26 16:02:40,273 INFO
>> org.apache.h

Re: Doubt

2014-03-19 Thread praveenesh kumar
Why not ? Its just a matter of installing 2 different packages.
Depends on what do you want to use it for, you need to take care of few
things, but as far as installation is concerned, it should be easily doable.

Regards
Prav


On Wed, Mar 19, 2014 at 3:41 PM, sri harsha  wrote:

> Hi all,
> is it possible to install Mongodb on the same VM which consists hadoop?
>
> --
> amiable harsha
>


Re: I am about to lose all my data please help

2014-03-24 Thread praveenesh kumar
Can you also make sure your hostname and IP address are still mapped
correctly. Because what I am guessing is when you restart your machine,
your /etc/hosts entries might get restored (it happens in some
distributions, based on how you installed it). So when you are trying to
restart your namenode, it might be pointing to some different IP/machine
(in general localhost).

I can't think of any reason how  it can happen just by restarting the
machine.


On Mon, Mar 24, 2014 at 5:42 AM, Stanley Shi  wrote:

> Can you confirm that you namenode image and fseditlog are still there? if
> not, then your data IS lost.
>
> Regards,
> *Stanley Shi,*
>
>
>
> On Sun, Mar 23, 2014 at 6:24 PM, Fatih Haltas wrote:
>
>> No, not ofcourse I blinded it.
>>
>>
>> On Wed, Mar 19, 2014 at 5:09 PM, praveenesh kumar 
>> wrote:
>>
>>> Is this property correct ?
>>>
>>>
>>> 
>>> fs.default.name
>>> -BLANKED
>>>   
>>>
>>> Regards
>>> Prav
>>>
>>>
>>> On Wed, Mar 19, 2014 at 12:58 PM, Fatih Haltas wrote:
>>>
>>>> Thanks for you helps, but still could not solve my problem.
>>>>
>>>>
>>>> On Tue, Mar 18, 2014 at 10:13 AM, Stanley Shi wrote:
>>>>
>>>>> Ah yes, I overlooked this. Then please check the file are there or
>>>>> not: "ls /home/hadoop/project/hadoop-data/dfs/name"?
>>>>>
>>>>> Regards,
>>>>> *Stanley Shi,*
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Mar 18, 2014 at 2:06 PM, Azuryy Yu  wrote:
>>>>>
>>>>>> I don't think this is the case, because there is;
>>>>>>   
>>>>>> hadoop.tmp.dir
>>>>>> /home/hadoop/project/hadoop-data
>>>>>>   
>>>>>>
>>>>>>
>>>>>> On Tue, Mar 18, 2014 at 1:55 PM, Stanley Shi wrote:
>>>>>>
>>>>>>> one possible reason is that you didn't set the namenode working
>>>>>>> directory, by default it's in "/tmp" folder; and the "/tmp" folder might
>>>>>>> get deleted by the OS without any notification. If this is the case, I 
>>>>>>> am
>>>>>>> afraid you have lost all your namenode data.
>>>>>>>
>>>>>>> *
>>>>>>>   dfs.name.dir
>>>>>>>   ${hadoop.tmp.dir}/dfs/name
>>>>>>>   Determines where on the local filesystem the DFS name 
>>>>>>> node
>>>>>>>   should store the name table(fsimage).  If this is a 
>>>>>>> comma-delimited list
>>>>>>>   of directories then the name table is replicated in all of the
>>>>>>>   directories, for redundancy. 
>>>>>>> *
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> *Stanley Shi,*
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Mar 16, 2014 at 5:29 PM, Mirko Kämpf >>>>>> > wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> what is the location of the namenodes fsimage and editlogs?
>>>>>>>> And how much memory has the NameNode.
>>>>>>>>
>>>>>>>> Did you work with a Secondary NameNode or a Standby NameNode for
>>>>>>>> checkpointing?
>>>>>>>>
>>>>>>>> Where are your HDFS blocks located, are those still safe?
>>>>>>>>
>>>>>>>> With this information at hand, one might be able to fix your setup,
>>>>>>>> but do not format the old namenode before
>>>>>>>> all is working with a fresh one.
>>>>>>>>
>>>>>>>> Grab a copy of the maintainance guide:
>>>>>>>> http://shop.oreilly.com/product/0636920025085.do?sortby=publicationDate
>>>>>>>> which helps solving such type of problems as well.
>>>>>>>>
>>>>>>>> Best wishes
>>>>>>>> Mirko
>>>>>>>>
>>>>>>>>
>>>>>>>> 2014-03-16 9:07 GMT+00:00 Fatih Hal

Re: Hadoop Takes 6GB Memory to run one mapper

2014-03-25 Thread praveenesh kumar
Can you try storing your file as bytes instead of String. I can't think of
any reason why this would require 6 GB heap space.
Can you explain your use-case that might help some alternatives if you are
interested.

Regards
Prav


On Tue, Mar 25, 2014 at 7:31 AM, Nivrutti Shinde
wrote:

> Yes it is in setup method, Just I am reading the file which is stored at
> hdfs
>
>
> On Tuesday, 25 March 2014 12:01:08 UTC+5:30, Praveenesh Kumar wrote:
>
>> And I am guessing you are not doing this inside map() method right, its
>> in setup() method ?
>>
>>
>> On Tue, Mar 25, 2014 at 6:05 AM, Nivrutti Shinde wrote:
>>
>>> private Map mapData = new ConcurrentHashMap>> String>(1100);
>>> FileInputStream fis = new FileInputStream(file);
>>> GZIPInputStream gzipInputStream = new GZIPInputStream(inputStream);
>>> BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(
>>> gzipInputStream));
>>> String line = null;
>>> while((line=bufferedReader.readLine())!=null)
>>> {
>>> String data[] = line.split("\t");
>>> mapData.put(data[0],data[1]);
>>> }
>>>
>>> On Monday, 24 March 2014 19:17:12 UTC+5:30, Praveenesh Kumar wrote:
>>>
>>>> Can you please share your code snippet. Just want to see how are you
>>>> loading your file into mapper ?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Mar 24, 2014 at 1:15 PM, Nivrutti Shinde 
>>>> wrote:
>>>>
>>>>> Thanks For your reply,
>>>>>
>>>>> Harsh,
>>>>>
>>>>> I tried THashMap but land up in same issue.
>>>>>
>>>>> David,
>>>>>
>>>>> I tried map side join and cascading approach, but time taken by them
>>>>> is lot.
>>>>>
>>>>> On Friday, 21 March 2014 12:03:28 UTC+5:30, Nivrutti Shinde wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have use case where I am loading 200 MB file with 11 million
>>>>>> records, (one record length is 12 ). Into map, so while running the 
>>>>>> hadoop
>>>>>> job, i can quickly get value for the key from each input record in 
>>>>>> mapper.
>>>>>>
>>>>>> Such a small file but to load the data into map, i have to allocate
>>>>>> the 6 GB heap for the same. when i run small code to load this file on
>>>>>> standalone application, it requires 2 GB memory.
>>>>>>
>>>>>> I dont understand why hadoop required 6GB to load the data into
>>>>>> memory. Hadoop Job Runs fine after that but number of mappers i can run 
>>>>>> is
>>>>>> 2. I need to get it done this in 2-3 GB only so i can run ateast 8-9
>>>>>> mappers per node.
>>>>>>
>>>>>> I have created gzip file(which is now only 17MB). I have kept the
>>>>>> file on HDFS. Using HDFS API to read the file and loading the data into
>>>>>> map. Block size is 128 MB. Cloudera Hadoop.
>>>>>>
>>>>>> Any help or alternate approaches to load data into memory with
>>>>>> minimum heap size. So i can run many mappers with 2-3 gb memory allocated
>>>>>> to each.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>  --
>>>>>
>>>>> ---
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "CDH Users" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to cdh-user+u...@cloudera.org.
>>>>>
>>>>> For more options, visit https://groups.google.com/a/cl
>>>>> oudera.org/d/optout.
>>>>>
>>>>
>>>>  --
>>>
>>> ---
>>> You received this message because you are subscribed to the Google
>>> Groups "CDH Users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to cdh-user+u...@cloudera.org.
>>> For more options, visit https://groups.google.com/a/
>>> cloudera.org/d/optout.
>>>
>>
>>  --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "CDH Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to cdh-user+unsubscr...@cloudera.org.
> For more options, visit https://groups.google.com/a/cloudera.org/d/optout.
>


Re: HDFS multi-tenancy and federation

2014-07-15 Thread praveenesh kumar
Federation is just a namenode namespace management capability. It is
designed to control namenode management and to provide scalability for
namenode. I don't think it poses any security or restrictions on accesssing
the HDFS filesystem. I guess this link would help for your question 2 -
http://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/ViewFs.html





On Tue, Jul 15, 2014 at 10:06 AM, Shani Ranasinghe 
wrote:

> Hi,
>
> Thanks for the information.
>
> Can I have an answer for the question 2  please? Appreciate any help.
>
>
> On Wed, Feb 5, 2014 at 1:41 PM, praveenesh kumar 
> wrote:
>
>> Hi Shani,
>>
>> I haven't done any implementation on HDFS federation, but as far as I
>> know, 1 namenode can handle only 1 namespace at this time. I hope that
>> helps.
>>
>> Regards
>> Prav
>>
>>
>> On Wed, Feb 5, 2014 at 8:05 AM, Shani Ranasinghe 
>> wrote:
>>
>>> Hi,
>>>
>>> Any help on this please?
>>>
>>>
>>>
>>> On Mon, Feb 3, 2014 at 12:14 PM, Shani Ranasinghe 
>>> wrote:
>>>
>>>>
>>>> Hi,
>>>> I would like to know the following.
>>>>
>>>> 1) Can there be multiple namespaces in a single namenode? is it
>>>> recommended?  (I'm having a multi-tenant environment in mind)
>>>>
>>>> 2) Let's say I have a federated namespace/namenodes. There are two
>>>> namenodes A /namespace A1 and namenode B/namespace B1, and have 3
>>>> datanodes. Can someone from namespace A1,  access the datanode's data in
>>>> anyway (hacking) belonging to namespace B1. If not how is it handled?
>>>>
>>>> After going through a lot  of reference, my understanding on HDFS
>>>> multi-tenancy and federation is that for multi-tenancy what we could do is
>>>> use file/folder permissions (u,g,o) and ACL's. Or we could dedicate a
>>>> namespace per tenant. The issue here is that a namenode (active namenode,
>>>> passive namenode and secondary namenode) has to be assigned per tenant.  Is
>>>> there any other way that multi tenancy can be achieved?
>>>>
>>>> On federation, let's say I have a namenode for /marketing and another
>>>> for /finance. Lets say that marketing bears the most load. How can we load
>>>> balance this? is it possible?
>>>>
>>>> Appreciate any help on this.
>>>>
>>>> Regards,
>>>> Shani.
>>>>
>>>>
>>>>
>>>>
>>>
>>
>


Delete a folder name containing *

2014-08-20 Thread praveenesh kumar
Hi team

I am in weird situation where I have  following HDFS sample folders

/data/folder/
/data/folder*
/data/folder_day
/data/folder_day/monday
/data/folder/1
/data/folder/2

I want to delete /data/folder* without deleting its sub_folders. If I do
hadoop fs -rmr /data/folder* it will delete everything which I want to
avoid. I tried with escape character \ but HDFS FS shell is not taking it.
Any hints/tricks ?


Regards
Praveenesh


Re: Delete a folder name containing *

2014-08-20 Thread praveenesh kumar
With renaming - you would use the mv command "hadoop fs -mv /data/folder*
/data/new_folder". Won't it move all the sub_dirs along with that ?


On Wed, Aug 20, 2014 at 12:00 PM, dileep kumar  wrote:

> Just Rename the folder.
>
>
> On Wed, Aug 20, 2014 at 6:53 AM, praveenesh kumar 
> wrote:
>
>> Hi team
>>
>> I am in weird situation where I have  following HDFS sample folders
>>
>> /data/folder/
>> /data/folder*
>> /data/folder_day
>> /data/folder_day/monday
>> /data/folder/1
>> /data/folder/2
>>
>> I want to delete /data/folder* without deleting its sub_folders. If I do
>> hadoop fs -rmr /data/folder* it will delete everything which I want to
>> avoid. I tried with escape character \ but HDFS FS shell is not taking it.
>> Any hints/tricks ?
>>
>>
>> Regards
>> Praveenesh
>>
>
>


Re: Delete a folder name containing *

2014-08-20 Thread praveenesh kumar
No, I have tried all usual things like single quotes, double quotes, escape
character.. but it is not working. I wonder what is escape char with Hadoop
FS utility.




On Wed, Aug 20, 2014 at 1:26 PM, Ritesh Kumar Singh <
riteshoneinamill...@gmail.com> wrote:

> try putting the name in quotes
>
>
> On Wed, Aug 20, 2014 at 4:35 PM, praveenesh kumar 
> wrote:
>
>> With renaming - you would use the mv command "hadoop fs -mv /data/folder*
>> /data/new_folder". Won't it move all the sub_dirs along with that ?
>>
>>
>> On Wed, Aug 20, 2014 at 12:00 PM, dileep kumar 
>> wrote:
>>
>>> Just Rename the folder.
>>>
>>>
>>> On Wed, Aug 20, 2014 at 6:53 AM, praveenesh kumar 
>>> wrote:
>>>
>>>> Hi team
>>>>
>>>> I am in weird situation where I have  following HDFS sample folders
>>>>
>>>> /data/folder/
>>>> /data/folder*
>>>> /data/folder_day
>>>> /data/folder_day/monday
>>>> /data/folder/1
>>>> /data/folder/2
>>>>
>>>> I want to delete /data/folder* without deleting its sub_folders. If I
>>>> do hadoop fs -rmr /data/folder* it will delete everything which I want to
>>>> avoid. I tried with escape character \ but HDFS FS shell is not taking it.
>>>> Any hints/tricks ?
>>>>
>>>>
>>>> Regards
>>>> Praveenesh
>>>>
>>>
>>>
>>
>


Re: Delete a folder name containing *

2014-08-20 Thread praveenesh kumar
Not working for me, strange :(


On Wed, Aug 20, 2014 at 3:00 PM, Ritesh Kumar Singh <
riteshoneinamill...@gmail.com> wrote:

> use this:
>
> hadoop fs -rmr "/path-to-folder/"folder\*"
>
> just tried it out :)
>
>
> On Wed, Aug 20, 2014 at 7:07 PM, Ritesh Kumar Singh <
> riteshoneinamill...@gmail.com> wrote:
>
>> interesting ... although escape character is still the forward slash and
>> its proven to work with other special characters. Here's a link:
>> Deleting directory with special character
>> <http://stackoverflow.com/questions/13529114/how-to-delete-a-directory-from-hadoop-cluster-which-is-having-comma-in-its-na>
>>
>>
>>
>>
>> On Wed, Aug 20, 2014 at 6:22 PM, praveenesh kumar 
>> wrote:
>>
>>> No, I have tried all usual things like single quotes, double quotes,
>>> escape character.. but it is not working. I wonder what is escape char with
>>> Hadoop FS utility.
>>>
>>>
>>>
>>>
>>> On Wed, Aug 20, 2014 at 1:26 PM, Ritesh Kumar Singh <
>>> riteshoneinamill...@gmail.com> wrote:
>>>
>>>> try putting the name in quotes
>>>>
>>>>
>>>> On Wed, Aug 20, 2014 at 4:35 PM, praveenesh kumar >>> > wrote:
>>>>
>>>>> With renaming - you would use the mv command "hadoop fs -mv
>>>>> /data/folder* /data/new_folder". Won't it move all the sub_dirs along with
>>>>> that ?
>>>>>
>>>>>
>>>>> On Wed, Aug 20, 2014 at 12:00 PM, dileep kumar 
>>>>> wrote:
>>>>>
>>>>>> Just Rename the folder.
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 20, 2014 at 6:53 AM, praveenesh kumar <
>>>>>> praveen...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi team
>>>>>>>
>>>>>>> I am in weird situation where I have  following HDFS sample folders
>>>>>>>
>>>>>>> /data/folder/
>>>>>>> /data/folder*
>>>>>>> /data/folder_day
>>>>>>> /data/folder_day/monday
>>>>>>> /data/folder/1
>>>>>>> /data/folder/2
>>>>>>>
>>>>>>> I want to delete /data/folder* without deleting its sub_folders. If
>>>>>>> I do hadoop fs -rmr /data/folder* it will delete everything which I 
>>>>>>> want to
>>>>>>> avoid. I tried with escape character \ but HDFS FS shell is not taking 
>>>>>>> it.
>>>>>>> Any hints/tricks ?
>>>>>>>
>>>>>>>
>>>>>>> Regards
>>>>>>> Praveenesh
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


Re: Delete a folder name containing *

2014-08-20 Thread praveenesh kumar
Command used -

1. hadoop fs -rmr /data/folder\*
2. hadoop fs -rmr 'data/folder\*'
3. hadoop fs -rmr "/data/folder\*"

None of them gave any output. Hadoop version - 1.0.2


On Wed, Aug 20, 2014 at 7:26 PM, Ritesh Kumar Singh <
riteshoneinamill...@gmail.com> wrote:

> what's the exact command you are using(with escape and quotes)?
> And what's the output after execution?
>
>
> On Wed, Aug 20, 2014 at 9:33 PM, hadoop hive  wrote:
>
>> Move it to some tmp directory and delete parent directory.
>> On Aug 20, 2014 4:23 PM, "praveenesh kumar"  wrote:
>>
>>> Hi team
>>>
>>> I am in weird situation where I have  following HDFS sample folders
>>>
>>> /data/folder/
>>> /data/folder*
>>> /data/folder_day
>>> /data/folder_day/monday
>>> /data/folder/1
>>> /data/folder/2
>>>
>>> I want to delete /data/folder* without deleting its sub_folders. If I do
>>> hadoop fs -rmr /data/folder* it will delete everything which I want to
>>> avoid. I tried with escape character \ but HDFS FS shell is not taking it.
>>> Any hints/tricks ?
>>>
>>>
>>> Regards
>>> Praveenesh
>>>
>>
>


Falcon usecases

2015-12-02 Thread praveenesh kumar
Hello hadoopers

Just curious to understand what is the current state of falcon.. How much
it is currently being adopted in the industry.. Anyone even using it other
than the creators?

There is not much information on the internet about falcon examples and use
cases but then it is coming along in HDP distribution. Hence this question
on understanding what are the best engineering/deployment principles around
it ?

I personally tried the GUI and it doesn't seems to be working properly on
HDP sandbox 2.3.2, but that is another question to dig later. Before that I
wanted to understand the current adoption of Falcon around big data
industry.

Anyone with any insights, please share..!!

Regards
Prav


Re: Falcon usecases

2015-12-04 Thread praveenesh kumar
Thanks Chris for pointing me to the mailing list and HDP support forums.
However my question is more general and generic that is why I thought of
putting it here. All I am trying to understand from anyone in the hadoop
community who has encountered Falcon before to understand how the community
is responding towards it. Does anyone using it or trying to use it. I can
understand that falcon mailing list currently doesn't support user mailing
list that is why I thought of putting this question here rather than
subscribing to one more mailing list.

@Chris - What is the reason HDP is backing it and delivering it in the HDP
distribution? Do you see any future/current client use cases which kinds of
highlighting its necessities.

FYI - I am trying to be working on falcon for past 2 weeks and trying to
understand it much better from the industry point of view, hence asking
this to understand whether I am on a right path or its still a long way to
go before falcon can be used as a production tool.


On Wed, Dec 2, 2015 at 5:26 PM, Chris Nauroth 
wrote:

> Hello Prav,
>
> You might have better luck getting a response to this question by directly
> asking the Falcon community.  I don't see a user@ mailing list for
> Falcon, but I do see a dev@ list.  More details are here:
>
> http://falcon.apache.org/mail-lists.html
>
> For questions related specifically to HDP Sandbox, you'll likely get more
> help from Hortonworks support forums.  (This is generally true for any
> vendor product that differentiates from the Apache distro.)
>
> I hope this helps.
>
> --Chris Nauroth
>
> From: praveenesh kumar 
> Date: Wednesday, December 2, 2015 at 10:01 AM
> To: "user@hadoop.apache.org" 
> Subject: Falcon usecases
>
> Hello hadoopers
>
> Just curious to understand what is the current state of falcon.. How much
> it is currently being adopted in the industry.. Anyone even using it other
> than the creators?
>
> There is not much information on the internet about falcon examples and
> use cases but then it is coming along in HDP distribution. Hence this
> question on understanding what are the best engineering/deployment
> principles around it ?
>
> I personally tried the GUI and it doesn't seems to be working properly on
> HDP sandbox 2.3.2, but that is another question to dig later. Before that I
> wanted to understand the current adoption of Falcon around big data
> industry.
>
> Anyone with any insights, please share..!!
>
> Regards
> Prav
>


Best way to pass project resources in Java MR

2015-12-16 Thread praveenesh kumar
Hello folks

Basic Java MR question.

What is the right practice to pass any program specific files (sitting in
src/main/resources) in your project folder to your mapper.

I tried the following and it didn't work

1. Inside mapper setup() ->

ClassLoader.getClassLoader().getResource("myfile_in_jar")

ClassLoader.getSystemClassLoader().getResource("myfile_in_jar");

2. Inside Driver run() method


Re: FileUtil.fullyDelete does ?

2016-07-26 Thread praveenesh kumar
https://hadoop.apache.org/docs/r2.7.1/api/org/apache/hadoop/fs/FileUtil.html#fullyDelete(java.io.File)

On Tue, Jul 26, 2016 at 12:09 PM, Divya Gehlot 
wrote:

> Resending to right list
> -- Forwarded message --
> From: "Divya Gehlot" 
> Date: Jul 26, 2016 6:51 PM
> Subject: FileUtil.fullyDelete does ?
> To: "user @spark" 
> Cc:
>
> Hi,
> When I am doing the using theFileUtil.copymerge function
>
> val file = "/tmp/primaryTypes.csv"
>
> FileUtil.fullyDelete(new File(file))
>
>  val destinationFile= "/tmp/singlePrimaryTypes.csv"
>
> FileUtil.fullyDelete(new File(destinationFile))
>
>  val counts = partitions.
>
> reduceByKey {case (x,y) => x + y}.
>
> sortBy {case (key, value) => -value}.
>
> map { case (key, value) => Array(key, value).mkString(",") }
>
>  counts.saveAsTextFile(file)
>
>  merge(file, destinationFile)
>
>
> I am wondering here what does  FileUtil.fullyDelete(new 
> File(destinationFile)) do ?
>
>   does it delete the merged file If yes,then how will we access the 
> merged file ..?
>
>
> Confused here ...
>
>
>
> Thanks,
>
> Divya
>
>
>


Re: Teradata into hadoop Migration

2016-08-05 Thread praveenesh kumar
>From TD perspective have a look at this - https://youtu.be/NTTQdAfZMJA They
are planning to opensource it. Perhaps you can get in touch with the team.
Let me know if you are interested. If you are TD contacts, ask about this,
they should be able to point to the right people.

Again, this is not sales pitch. This tool looks like what you are looking
for and will be open source soon. Let me know if you want to get in touch
with the folks you are working on this.

Regards
Prav

On Fri, Aug 5, 2016 at 4:29 PM, Wei-Chiu Chuang 
wrote:

> Hi,
>
> I think Cloudera Navigator Optimizer is the tool you are looking for. It
> allows you to transform SQL queries (TD) into Impala and Hive.
> http://blog.cloudera.com/blog/2015/11/introducing-cloudera-
> navigator-optimizer-for-optimal-sql-workload-efficiency-on-apache-hadoop/
> Hope this doesn’t sound like a sales pitch. If you’re a Cloudera paid
> customer you should reach out to the account/support team for more
> information.
>
> *disclaimer: I work for Cloudera
>
> Wei-Chiu Chuang
> A very happy Clouderan
>
> On Aug 4, 2016, at 10:50 PM, Rakesh Radhakrishnan 
> wrote:
>
> Sorry, I don't have much insight about this apart from basic Sqoop. AFAIK,
> it is more of vendor specific, you may need to dig more into that line.
>
> Thanks,
> Rakesh
>
> On Mon, Aug 1, 2016 at 11:38 PM, Bhagaban Khatai  > wrote:
>
>> Thanks Rakesh for the useful information. But we are using sqoop for data
>> transfer but all TD logic we are implementing thru Hive.
>> But it's taking time by using mapping provided by TD team and the same
>> logic we are implementing.
>>
>> What I want some tool or ready-made framework so that development effort
>> would be less.
>>
>> Thanks in advance for your help.
>>
>> Bhagaban
>>
>> On Mon, Aug 1, 2016 at 6:07 PM, Rakesh Radhakrishnan 
>> wrote:
>>
>>> Hi Bhagaban,
>>>
>>> Perhaps, you can try "Apache Sqoop" to transfer data to Hadoop from
>>> Teradata. Apache Sqoop provides an efficient approach for transferring
>>> large data between Hadoop related systems and structured data stores. It
>>> allows support for a data store to be added as a so-called connector and
>>> can connect to various databases including Oracle etc.
>>>
>>> I hope the below links will be helpful to you,
>>> http://sqoop.apache.org/
>>> http://blog.cloudera.com/blog/2012/01/cloudera-connector-for
>>> -teradata-1-0-0/
>>> http://hortonworks.com/blog/round-trip-data-enrichment-teradata-hadoop/
>>> http://dataconomy.com/wp-content/uploads/2014/06/Syncsort-A-
>>> 123ApproachtoTeradataOffloadwithHadoop.pdf
>>>
>>> Below are few data ingestion tools, probably you can dig more into it,
>>> https://www.datatorrent.com/product/datatorrent-ingestion/
>>> https://www.datatorrent.com/dtingest-unified-streaming-batch
>>> -data-ingestion-hadoop/
>>>
>>> Thanks,
>>> Rakesh
>>>
>>> On Mon, Aug 1, 2016 at 4:54 PM, Bhagaban Khatai <
>>> email.bhaga...@gmail.com> wrote:
>>>
 Hi Guys-

 I need a quick help if anybody done any migration project in TD into
 hadoop.
 We have very tight deadline and I am trying to find any tool (online or
 paid) for quick development.

 Please help us here and guide me if any other way is available to do
 the development fast.

 Bhagaban

>>>
>>>
>>
>
>


Anyone has tried accessing TDE using HDFS Java APIs

2018-01-25 Thread praveenesh kumar
Hi

We are trying to access TDE files using HDFS JAVA API. The user which is
running the job has access to the TDE zone. We have tried accessing the
file successfully in Hadoop FS Command shell.

If we pass the same file in spark using the same user, it also gets read
properly.

Its just when we are trying to use the vanila HDFS APIs, its not able to
pick the file. And when it picks, its not able to decipher the text. The
data is not getting decrypted. My understanding is when you pass
hdfs-site.xml, core-site.xml and kms-site.xml to the configuration object,
it should be able to handle keys automatically.

Not sure if we need to do anything extra in JAVA API. Any pointers. There
aren't a single example in the documentation to get the TDE files via HDFS
Java API.

Any suggestions would be much appreciated.

Regards
Prav


Re: Anyone has tried accessing TDE using HDFS Java APIs

2018-01-26 Thread praveenesh kumar
Hi Ajay

We are using HDP 2.5.5 with HDFS 2.7.1.2.5

Thanks
Prav

On Thu, Jan 25, 2018 at 5:47 PM, Ajay Kumar 
wrote:

> Hi Praveenesh,
>
>
>
> What version of Hadoop you are using?
>
>
>
> Thanks,
>
> Ajay
>
>
>
> *From: *praveenesh kumar 
> *Date: *Thursday, January 25, 2018 at 8:22 AM
> *To: *"user@hadoop.apache.org" 
> *Subject: *Anyone has tried accessing TDE using HDFS Java APIs
>
>
>
> Hi
>
>
>
> We are trying to access TDE files using HDFS JAVA API. The user which is
> running the job has access to the TDE zone. We have tried accessing the
> file successfully in Hadoop FS Command shell.
>
>
>
> If we pass the same file in spark using the same user, it also gets read
> properly.
>
>
>
> Its just when we are trying to use the vanila HDFS APIs, its not able to
> pick the file. And when it picks, its not able to decipher the text. The
> data is not getting decrypted. My understanding is when you pass
> hdfs-site.xml, core-site.xml and kms-site.xml to the configuration object,
> it should be able to handle keys automatically.
>
>
>
> Not sure if we need to do anything extra in JAVA API. Any pointers. There
> aren't a single example in the documentation to get the TDE files via HDFS
> Java API.
>
>
>
> Any suggestions would be much appreciated.
>
>
>
> Regards
>
> Prav
>


Re: Anyone has tried accessing TDE using HDFS Java APIs

2018-01-29 Thread praveenesh kumar
Hi Ajay

Did you get any chance to look into this. Thanks

Regards
Prav

On Fri, Jan 26, 2018 at 8:48 AM, praveenesh kumar 
wrote:

> Hi Ajay
>
> We are using HDP 2.5.5 with HDFS 2.7.1.2.5
>
> Thanks
> Prav
>
> On Thu, Jan 25, 2018 at 5:47 PM, Ajay Kumar 
> wrote:
>
>> Hi Praveenesh,
>>
>>
>>
>> What version of Hadoop you are using?
>>
>>
>>
>> Thanks,
>>
>> Ajay
>>
>>
>>
>> *From: *praveenesh kumar 
>> *Date: *Thursday, January 25, 2018 at 8:22 AM
>> *To: *"user@hadoop.apache.org" 
>> *Subject: *Anyone has tried accessing TDE using HDFS Java APIs
>>
>>
>>
>> Hi
>>
>>
>>
>> We are trying to access TDE files using HDFS JAVA API. The user which is
>> running the job has access to the TDE zone. We have tried accessing the
>> file successfully in Hadoop FS Command shell.
>>
>>
>>
>> If we pass the same file in spark using the same user, it also gets read
>> properly.
>>
>>
>>
>> Its just when we are trying to use the vanila HDFS APIs, its not able to
>> pick the file. And when it picks, its not able to decipher the text. The
>> data is not getting decrypted. My understanding is when you pass
>> hdfs-site.xml, core-site.xml and kms-site.xml to the configuration object,
>> it should be able to handle keys automatically.
>>
>>
>>
>> Not sure if we need to do anything extra in JAVA API. Any pointers. There
>> aren't a single example in the documentation to get the TDE files via HDFS
>> Java API.
>>
>>
>>
>> Any suggestions would be much appreciated.
>>
>>
>>
>> Regards
>>
>> Prav
>>
>
>