Re: Passing data from Client to AM

2014-01-30 Thread Hitesh Shah
Adding values to a Configuration object does not really work unless you 
serialize the config into a file and send it over to the AM and containers as a 
local resource. The application code would then need to load in this file using 
Configuration::addResource(). MapReduce does this by taking in all user 
configured values and serializing them in the form of job.xml.

-- Hitesh

On Jan 29, 2014, at 2:42 PM, Jay Vyas wrote:

> while your at it, what about adding values to the Configuration() object, 
> does that still work as a hack for information passing?
> 
> 
> On Wed, Jan 29, 2014 at 5:25 PM, Arun C Murthy  wrote:
> Command line arguments & env variables are the most direct options.
> 
> A more onerous option is to write some data to a file in HDFS, use 
> LocalResource to ship it to the container on each node and get application 
> code to read that file locally. (In MRv1 parlance that is "Distributed 
> Cache").
> 
> hth,
> Arun
> 
> On Jan 29, 2014, at 12:59 PM, Brian C. Huffman  
> wrote:
> 
>> I'm looking at Distributed Shell as an example for writing a YARN 
>> application.
>> 
>> My question is why are the script path and associated metadata saved as 
>> environment variables?  Are there any other ways besides environment 
>> variables or command line arguments for passing data from the Client to the 
>> ApplicationMaster?
>> 
>> Thanks,
>> Brian
>> 
>> 
>> 
> 
> --
> Arun C. Murthy
> Hortonworks Inc.
> http://hortonworks.com/
> 
> 
> 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to 
> which it is addressed and may contain information that is confidential, 
> privileged and exempt from disclosure under applicable law. If the reader of 
> this message is not the intended recipient, you are hereby notified that any 
> printing, copying, dissemination, distribution, disclosure or forwarding of 
> this communication is strictly prohibited. If you have received this 
> communication in error, please contact the sender immediately and delete it 
> from your system. Thank You.
> 
> 
> 
> -- 
> Jay Vyas
> http://jayunit100.blogspot.com



Re: shifting sequenceFileOutput format to Avro format

2014-01-30 Thread AnilKumar B
Thanks Yong.

Thanks & Regards,
B Anil Kumar.


On Fri, Jan 31, 2014 at 12:44 AM, java8964  wrote:

> In avro, you need to think about a schema to match your data. Avor's
> schema is very flexible and should be able to store all kinds of data.
>
> If you have a Json string, you have 2 options to generate the Avro schema
> for it:
>
> 1) Use "type: string" to store the whole Json string into Avro. This will
> be easiest, but you have to parse the data later when you use it.
> 2) Use Avro schema to match your json data, using matching structure from
> avro for your data, like 'record, array, map' etc.
>
> Yong
>
> --
> Date: Fri, 31 Jan 2014 00:13:59 +0530
> Subject: shifting sequenceFileOutput format to Avro format
> From: akumarb2...@gmail.com
> To: user@hadoop.apache.org
>
>
> Hi,
>
> As of now in my jobs, I am using SequenceFileOutputFormat and I am
> emitting custom java objects as MR output.
>
> Now I am planning to emit it in avro format, I went through  few blogs but
> still have following doubts.
>
> 1) My current custom Writable objects has nested json format as
> toString(), So when I shift to avro format, should I just emit json string
> in avro format, instead of writable custom object?
>
> 2) If so, how can I create schema? My json string is nested and will have
> random key/value pairs.
>
> 3) Or can I still emit as custom objects?
>
>
>
> Thanks & Regards,
> B Anil Kumar.
>


RE: shifting sequenceFileOutput format to Avro format

2014-01-30 Thread java8964
In avro, you need to think about a schema to match your data. Avor's schema is 
very flexible and should be able to store all kinds of data.
If you have a Json string, you have 2 options to generate the Avro schema for 
it:
1) Use "type: string" to store the whole Json string into Avro. This will be 
easiest, but you have to parse the data later when you use it.2) Use Avro 
schema to match your json data, using matching structure from avro for your 
data, like 'record, array, map' etc.
Yong

Date: Fri, 31 Jan 2014 00:13:59 +0530
Subject: shifting sequenceFileOutput format to Avro format
From: akumarb2...@gmail.com
To: user@hadoop.apache.org

Hi,
As of now in my jobs, I am using SequenceFileOutputFormat and I am emitting 
custom java objects as MR output.
Now I am planning to emit it in avro format, I went through  few blogs but 
still have following doubts.

1) My current custom Writable objects has nested json format as toString(), So 
when I shift to avro format, should I just emit json string in avro format, 
instead of writable custom object? 

2) If so, how can I create schema? My json string is nested and will have 
random key/value pairs.
3) Or can I still emit as custom objects? 


Thanks & Regards,
B Anil Kumar.

  

shifting sequenceFileOutput format to Avro format

2014-01-30 Thread AnilKumar B
Hi,

As of now in my jobs, I am using SequenceFileOutputFormat and I am emitting
custom java objects as MR output.

Now I am planning to emit it in avro format, I went through  few blogs but
still have following doubts.

1) My current custom Writable objects has nested json format as toString(),
So when I shift to avro format, should I just emit json string in avro
format, instead of writable custom object?

2) If so, how can I create schema? My json string is nested and will have
random key/value pairs.

3) Or can I still emit as custom objects?



Thanks & Regards,
B Anil Kumar.


Re: Capture Directory Context in Hadoop Mapper

2014-01-30 Thread Felix Chern
MultipleInputs is nice. Most of the time, I use it for reduce-side join.
It's great, however, you'll need to specify different Mapper class per input 
directory.
In our case, we try to let the Mapper itself to capture the directory 
information, because these directories might contain
data across months, and the the file structures may differ a bit time by time.
Finally, this is the solution I came up with, and it's fun to hack on lower 
level APIs. :D

Yet, thanks for suggesting!

Felix

On Jan 29, 2014, at 10:15 PM, Harsh J  wrote:

> Hi,
> 
> These posts are nicely written - thanks for sharing! Have you also
> taken a look at the MultipleInputs feature, which gives you a cleaner
> approach? 
> http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/lib/input/MultipleInputs.html
> 
> On Thu, Jan 30, 2014 at 4:16 AM, Felix Chern  wrote:
>> Hi all,
>> 
>> I wrote a tutorial of how to receive path information in Mapper class. It's
>> useful in our hadoop use case where we need to apply different logic on
>> different input source directory. Enjoy!
>> 
>> http://www.idryman.org/blog/2014/01/26/capture-directory-context-in-hadoop-mapper/
>> http://www.idryman.org/blog/2014/01/27/capture-path-info-in-hadoop-inputformat-class/
>> 
>> Felix
> 
> 
> 
> -- 
> Harsh J



Re: DistributedCache deprecated

2014-01-30 Thread Amit Mittal
Hi Prav,

You are correct, thanks for the explanation. As per below link, I can see
that Job's method internally calls to DistributedCache itself (
http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-mapreduce-client-core/2.0.0-cdh4.4.0/org/apache/hadoop/mapreduce/Job.java#Job.addCacheFile%28java.net.URI%29)
after ensuring state, I think that might be the reason. Here is one of the
method:

1067 




  public void  
addCacheFile(URI

uri) {

1068 




ensureState
(JobState
.DEFINE
);

1069 




DistributedCache.addCacheFile
(uri,
conf 
);

1070 




  }


Thanks
Amit


On Thu, Jan 30, 2014 at 6:19 PM, praveenesh kumar wrote:

> Hi Amit,
>
> Side data distribution is altogether a different concept at all. Its when
> you set custom (key,value) pairs and use Job object for doing that, so that
> you can use them in your mappers/reducers. It is good when you want to pass
> some small information to your mappers/reducers like extra command line
> arguments that is required by mappers/reducers.
> We were not discussing Side data distribution at all.
>
> The question was DistributedCache gets deprecated, where we can find the
> right methods which DistributedCache delivers.
> If you see the DistributedCache class in MR v1 -
>
> https://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/filecache/DistributedCache.html
>
> and compare it with Job class in MR v2 -
>
> http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html
>
> You would see the methods of DistributedCache class has been added to Job
> class. Since DistributedCache is deprecated, my guess was that we can use
> Job class to use distributed cache using the same methods which
> DistributedCache used to provide.
>
> Everything else is same, its just that you use Job class to set your files
> for Distributed cache inside your job configuration. Well I am sorry. I
> don't have any nice article as I said that I also did this as part of my
> experiment and I was able to use it without any issues, so that's why I
> suggested it.
>
> Since most of the d

Re: DistributedCache deprecated

2014-01-30 Thread praveenesh kumar
Hi Amit,

Side data distribution is altogether a different concept at all. Its when
you set custom (key,value) pairs and use Job object for doing that, so that
you can use them in your mappers/reducers. It is good when you want to pass
some small information to your mappers/reducers like extra command line
arguments that is required by mappers/reducers.
We were not discussing Side data distribution at all.

The question was DistributedCache gets deprecated, where we can find the
right methods which DistributedCache delivers.
If you see the DistributedCache class in MR v1 -
https://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/filecache/DistributedCache.html

and compare it with Job class in MR v2 -
http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html

You would see the methods of DistributedCache class has been added to Job
class. Since DistributedCache is deprecated, my guess was that we can use
Job class to use distributed cache using the same methods which
DistributedCache used to provide.

Everything else is same, its just that you use Job class to set your files
for Distributed cache inside your job configuration. Well I am sorry. I
don't have any nice article as I said that I also did this as part of my
experiment and I was able to use it without any issues, so that's why I
suggested it.

Since most of the developers still using MRv1 on hadoop 2.0, that is why
these changes have not been come into highlights so far. I am hoping a new
documentation on how to use MRv2 would come soon, but if you understand
MRv1, I don't see any reasons why can't you just move around a bit in API
and find your relevant classes that you want to use by yourself.  Again, as
I said, I don't have any valid statements of what I am saying, they are
just the results of my own experiments, which you are most welcome to
conduct and play with. Happy Coding..!!

Regards
Prav




On Thu, Jan 30, 2014 at 12:27 PM, Amit Mittal  wrote:

> Hi Prav,
>
> Yes, you are correct that DistributedCache does not upload file into
> memory. Also using job configuration and DistributedCache are 2 different
> approaches. I am referring based on "Hadoop: The definitive guide"
> Chapter:8 > Side Data Distribution (Page 288-295).
> As you are saying that now methods of DistributedCache moved to Job, I
> request if you please share some article or document on that for my better
> understanding, it will be great help.
>
> Thanks
> Amit
>
>
> On Thu, Jan 30, 2014 at 5:35 PM, praveenesh kumar wrote:
>
>> Hi Amit,
>>
>> I am not sure how are they linked with DistributedCache.. Job
>> configuration is not uploading any data in memory.. As far as I am aware of
>> how DistributedCache works, nothing get loaded in memory. Distributed cache
>> just copies the files into slave nodes, so that they are accessible to
>> mappers/reducers. Usually the location is
>> ${hadoop.tmp.dir}/${mapred.local.dir}/tasktracker/archive (depends from
>> distribution to distribution) You always have to read the files in your
>> mapper or reducer when ever you want to use them.
>>
>> What has happened is the method of DistributedCache class has now been
>> added to Job class, and I am assuming they won't change the functionality
>> of how distributed cache methods used to work, otherwise there would have
>> been some nice articles on that, plus I don't see any reason of changing
>> that as well too..  so everything works still the same way.. Its just that
>> you use the new Job class to use distributed cache features.
>>
>> I am not sure what entries you are exactly pointing to. Am I missing
>> anything here ?
>>
>>
>> Regards
>> Prav
>>
>>
>> On Thu, Jan 30, 2014 at 6:12 AM, Amit Mittal wrote:
>>
>>> Hi Mike & Prav,
>>>
>>> Although I am new to Hadoop, but would like to add my 2 cents if that
>>> helps.
>>> We are having 2 ways for distribution of shared data, one is using Job
>>> configuration and other is DistributedCache.
>>> As job configuration is read by the JT, TT and child JVMs, and each time
>>> the configuration is read, all of its entries are read in memory, even if
>>> they are not used. So using job configuration is not advised if the data is
>>> more than few kilobytes. So it is not alternative to DistributedCache
>>> unless some modifications are done in Job configuration to address this
>>> limitation.
>>> So I am also curious to know the alternatative to DistributedCache class.
>>>
>>> Thanks
>>> Amit
>>>
>>>
>>>
>>> On Thu, Jan 30, 2014 at 2:43 AM, Giordano, Michael <
>>> michael.giord...@vistronix.com> wrote:
>>>
  I noticed that in Hadoop 2.2.0
 org.apache.hadoop.mapreduce.filecache.DistributedCache has been deprecated.



 (http://hadoop.apache.org/docs/current/api/deprecated-list.html#class)



 Is there a class that provides equivalent functionality? My application
 relies heavily on DistributedCache.



 Thanks,

 Mike G.

 This communication, along wit

Re: DistributedCache deprecated

2014-01-30 Thread Amit Mittal
Hi Prav,

Yes, you are correct that DistributedCache does not upload file into
memory. Also using job configuration and DistributedCache are 2 different
approaches. I am referring based on "Hadoop: The definitive guide"
Chapter:8 > Side Data Distribution (Page 288-295).
As you are saying that now methods of DistributedCache moved to Job, I
request if you please share some article or document on that for my better
understanding, it will be great help.

Thanks
Amit


On Thu, Jan 30, 2014 at 5:35 PM, praveenesh kumar wrote:

> Hi Amit,
>
> I am not sure how are they linked with DistributedCache.. Job
> configuration is not uploading any data in memory.. As far as I am aware of
> how DistributedCache works, nothing get loaded in memory. Distributed cache
> just copies the files into slave nodes, so that they are accessible to
> mappers/reducers. Usually the location is
> ${hadoop.tmp.dir}/${mapred.local.dir}/tasktracker/archive (depends from
> distribution to distribution) You always have to read the files in your
> mapper or reducer when ever you want to use them.
>
> What has happened is the method of DistributedCache class has now been
> added to Job class, and I am assuming they won't change the functionality
> of how distributed cache methods used to work, otherwise there would have
> been some nice articles on that, plus I don't see any reason of changing
> that as well too..  so everything works still the same way.. Its just that
> you use the new Job class to use distributed cache features.
>
> I am not sure what entries you are exactly pointing to. Am I missing
> anything here ?
>
>
> Regards
> Prav
>
>
> On Thu, Jan 30, 2014 at 6:12 AM, Amit Mittal wrote:
>
>> Hi Mike & Prav,
>>
>> Although I am new to Hadoop, but would like to add my 2 cents if that
>> helps.
>> We are having 2 ways for distribution of shared data, one is using Job
>> configuration and other is DistributedCache.
>> As job configuration is read by the JT, TT and child JVMs, and each time
>> the configuration is read, all of its entries are read in memory, even if
>> they are not used. So using job configuration is not advised if the data is
>> more than few kilobytes. So it is not alternative to DistributedCache
>> unless some modifications are done in Job configuration to address this
>> limitation.
>> So I am also curious to know the alternatative to DistributedCache class.
>>
>> Thanks
>> Amit
>>
>>
>>
>> On Thu, Jan 30, 2014 at 2:43 AM, Giordano, Michael <
>> michael.giord...@vistronix.com> wrote:
>>
>>>  I noticed that in Hadoop 2.2.0
>>> org.apache.hadoop.mapreduce.filecache.DistributedCache has been deprecated.
>>>
>>>
>>>
>>> (http://hadoop.apache.org/docs/current/api/deprecated-list.html#class)
>>>
>>>
>>>
>>> Is there a class that provides equivalent functionality? My application
>>> relies heavily on DistributedCache.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Mike G.
>>>
>>> This communication, along with its attachments, is considered
>>> confidential and proprietary to Vistronix.  It is intended only for the use
>>> of the person(s) named above.  Note that unauthorized disclosure or
>>> distribution of information not generally known to the public is strictly
>>> prohibited.  If you are not the intended recipient, please notify the
>>> sender immediately.
>>>
>>
>>
>


Re: DistributedCache deprecated

2014-01-30 Thread praveenesh kumar
Hi Amit,

I am not sure how are they linked with DistributedCache.. Job configuration
is not uploading any data in memory.. As far as I am aware of how
DistributedCache works, nothing get loaded in memory. Distributed cache
just copies the files into slave nodes, so that they are accessible to
mappers/reducers. Usually the location is
${hadoop.tmp.dir}/${mapred.local.dir}/tasktracker/archive (depends from
distribution to distribution) You always have to read the files in your
mapper or reducer when ever you want to use them.

What has happened is the method of DistributedCache class has now been
added to Job class, and I am assuming they won't change the functionality
of how distributed cache methods used to work, otherwise there would have
been some nice articles on that, plus I don't see any reason of changing
that as well too..  so everything works still the same way.. Its just that
you use the new Job class to use distributed cache features.

I am not sure what entries you are exactly pointing to. Am I missing
anything here ?


Regards
Prav


On Thu, Jan 30, 2014 at 6:12 AM, Amit Mittal  wrote:

> Hi Mike & Prav,
>
> Although I am new to Hadoop, but would like to add my 2 cents if that
> helps.
> We are having 2 ways for distribution of shared data, one is using Job
> configuration and other is DistributedCache.
> As job configuration is read by the JT, TT and child JVMs, and each time
> the configuration is read, all of its entries are read in memory, even if
> they are not used. So using job configuration is not advised if the data is
> more than few kilobytes. So it is not alternative to DistributedCache
> unless some modifications are done in Job configuration to address this
> limitation.
> So I am also curious to know the alternatative to DistributedCache class.
>
> Thanks
> Amit
>
>
>
> On Thu, Jan 30, 2014 at 2:43 AM, Giordano, Michael <
> michael.giord...@vistronix.com> wrote:
>
>>  I noticed that in Hadoop 2.2.0
>> org.apache.hadoop.mapreduce.filecache.DistributedCache has been deprecated.
>>
>>
>>
>> (http://hadoop.apache.org/docs/current/api/deprecated-list.html#class)
>>
>>
>>
>> Is there a class that provides equivalent functionality? My application
>> relies heavily on DistributedCache.
>>
>>
>>
>> Thanks,
>>
>> Mike G.
>>
>> This communication, along with its attachments, is considered
>> confidential and proprietary to Vistronix.  It is intended only for the use
>> of the person(s) named above.  Note that unauthorized disclosure or
>> distribution of information not generally known to the public is strictly
>> prohibited.  If you are not the intended recipient, please notify the
>> sender immediately.
>>
>
>