Get Current Block or Split ID, and using it, the Block Path

2012-04-08 Thread Deepak Nettem
Hi,

Is it possible to get the 'id' of the currently executing split or block
from within the mapper? Using this block Id / split id, I want to be able
to query the namenode to get the names of hosts having that block / spllit,
and the actual path to the data.

I need this for some analytics that I'm doing. Is there a client API that
allows doing this?  If not, what's the best way to do this?

Best,
Deepak Nettem


Re: Get Current Block or Split ID, and using it, the Block Path

2012-04-08 Thread Mohit Anchlia
I think if you called getInputFormat on JobConf and then called getSplits
you would atleast get the locations.

http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/InputSplit.html

On Sun, Apr 8, 2012 at 9:16 AM, Deepak Nettem deepaknet...@gmail.comwrote:

 Hi,

 Is it possible to get the 'id' of the currently executing split or block
 from within the mapper? Using this block Id / split id, I want to be able
 to query the namenode to get the names of hosts having that block / spllit,
 and the actual path to the data.

 I need this for some analytics that I'm doing. Is there a client API that
 allows doing this?  If not, what's the best way to do this?

 Best,
 Deepak Nettem



Re: Creating and working with temporary file in a map() function

2012-04-08 Thread Ondřej Klimpera
Thanks for your advise, File.createTempFile() works great, at least in 
pseudo-ditributed mode, hope cluster solution will do the same work. You 
saved me hours of trying...



On 04/07/2012 11:29 PM, Harsh J wrote:

MapReduce sets mapred.child.tmp for all tasks to be the Task
Attempt's WorkingDir/tmp automatically. This also sets the
-Djava.io.tmpdir prop for each task at JVM boot.

Hence you may use the regular Java API to create a temporary file:
http://docs.oracle.com/javase/6/docs/api/java/io/File.html#createTempFile(java.lang.String,%20java.lang.String)

These files would also be automatically deleted away after the task
attempt is done.

On Sun, Apr 8, 2012 at 2:14 AM, Ondřej Klimperaklimp...@fit.cvut.cz  wrote:

Hello,

I would like to ask you if it is possible to create and work with a
temporary file while in a map function.

I suppose that map function is running on a single node in Hadoop cluster.
So what is a safe way to create a temporary file and read from it in one
map() run. If it is possible is there a size limit for the file.

The file can not be created before hadoop job is created. I need to create
and process the file inside map().

Thanks for your answer.

Ondrej Klimpera.







Job, JobConf, and Configuration.

2012-04-08 Thread JAX
Hi guys.  Just a theoretical question here : I notice in chapter 1 of the 
Hadoop orielly book that the new API example has *no* Configuration object.

Why is that? 

I thought the new API still uses / needs a Configuration class when running 
jobs.



Jay Vyas 
MMSB
UCHC

On Apr 7, 2012, at 4:29 PM, Harsh J ha...@cloudera.com wrote:

 MapReduce sets mapred.child.tmp for all tasks to be the Task
 Attempt's WorkingDir/tmp automatically. This also sets the
 -Djava.io.tmpdir prop for each task at JVM boot.
 
 Hence you may use the regular Java API to create a temporary file:
 http://docs.oracle.com/javase/6/docs/api/java/io/File.html#createTempFile(java.lang.String,%20java.lang.String)
 
 These files would also be automatically deleted away after the task
 attempt is done.
 
 On Sun, Apr 8, 2012 at 2:14 AM, Ondřej Klimpera klimp...@fit.cvut.cz wrote:
 Hello,
 
 I would like to ask you if it is possible to create and work with a
 temporary file while in a map function.
 
 I suppose that map function is running on a single node in Hadoop cluster.
 So what is a safe way to create a temporary file and read from it in one
 map() run. If it is possible is there a size limit for the file.
 
 The file can not be created before hadoop job is created. I need to create
 and process the file inside map().
 
 Thanks for your answer.
 
 Ondrej Klimpera.
 
 
 
 -- 
 Harsh J


Re: Get Current Block or Split ID, and using it, the Block Path

2012-04-08 Thread Harsh J
Deepak

On Sun, Apr 8, 2012 at 9:46 PM, Deepak Nettem deepaknet...@gmail.com wrote:
 Hi,

 Is it possible to get the 'id' of the currently executing split or block
 from within the mapper? Using this block Id / split id, I want to be able
 to query the namenode to get the names of hosts having that block / spllit,
 and the actual path to the data.

You can get the list of host locations for the current Mapper's split
item via: https://gist.github.com/2339170 (or generally from a
FileSystem object via https://gist.github.com/2339181)

You can't get block IDs via any available publicly supported APIs.
Therefore, you may consider getting the local block file path as an
unavailable option too.

 I need this for some analytics that I'm doing. Is there a client API that
 allows doing this?  If not, what's the best way to do this?

There are some ways to go about it (I wouldn't consider this
impossible to do for sure), but I'm curious what your 'analytics' is
and how it correlates with needing block IDs and actual block file
paths - Cause your problem may also be solvable by other,
pre-available means.

-- 
Harsh J


Re: Job, JobConf, and Configuration.

2012-04-08 Thread Harsh J
The Job class encapsulates the Configuration object and manages it for
you. You can also get its reference out via Job.getConfiguration() -
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/JobContext.html#getConfiguration()

Hence, when you do a: Job job = new Job();, the internal
Configuration object is auto-created for you. This is how the
underlying constructor looks:

public Job() throws IOException {
this(new Configuration());
}

On Mon, Apr 9, 2012 at 12:24 AM, JAX jayunit...@gmail.com wrote:
 Hi guys.  Just a theoretical question here : I notice in chapter 1 of the 
 Hadoop orielly book that the new API example has *no* Configuration object.

 Why is that?

 I thought the new API still uses / needs a Configuration class when running 
 jobs.



 Jay Vyas
 MMSB
 UCHC

 On Apr 7, 2012, at 4:29 PM, Harsh J ha...@cloudera.com wrote:

 MapReduce sets mapred.child.tmp for all tasks to be the Task
 Attempt's WorkingDir/tmp automatically. This also sets the
 -Djava.io.tmpdir prop for each task at JVM boot.

 Hence you may use the regular Java API to create a temporary file:
 http://docs.oracle.com/javase/6/docs/api/java/io/File.html#createTempFile(java.lang.String,%20java.lang.String)

 These files would also be automatically deleted away after the task
 attempt is done.

 On Sun, Apr 8, 2012 at 2:14 AM, Ondřej Klimpera klimp...@fit.cvut.cz wrote:
 Hello,

 I would like to ask you if it is possible to create and work with a
 temporary file while in a map function.

 I suppose that map function is running on a single node in Hadoop cluster.
 So what is a safe way to create a temporary file and read from it in one
 map() run. If it is possible is there a size limit for the file.

 The file can not be created before hadoop job is created. I need to create
 and process the file inside map().

 Thanks for your answer.

 Ondrej Klimpera.



 --
 Harsh J



-- 
Harsh J


Re: Creating and working with temporary file in a map() function

2012-04-08 Thread Harsh J
It will work. Pseudo-distributed mode shouldn't be all that different
from a fully distributed mode. Do let us know if it does not work as
intended.

On Sun, Apr 8, 2012 at 11:40 PM, Ondřej Klimpera klimp...@fit.cvut.cz wrote:
 Thanks for your advise, File.createTempFile() works great, at least in
 pseudo-ditributed mode, hope cluster solution will do the same work. You
 saved me hours of trying...



 On 04/07/2012 11:29 PM, Harsh J wrote:

 MapReduce sets mapred.child.tmp for all tasks to be the Task
 Attempt's WorkingDir/tmp automatically. This also sets the
 -Djava.io.tmpdir prop for each task at JVM boot.

 Hence you may use the regular Java API to create a temporary file:

 http://docs.oracle.com/javase/6/docs/api/java/io/File.html#createTempFile(java.lang.String,%20java.lang.String)

 These files would also be automatically deleted away after the task
 attempt is done.

 On Sun, Apr 8, 2012 at 2:14 AM, Ondřej Klimperaklimp...@fit.cvut.cz
  wrote:

 Hello,

 I would like to ask you if it is possible to create and work with a
 temporary file while in a map function.

 I suppose that map function is running on a single node in Hadoop
 cluster.
 So what is a safe way to create a temporary file and read from it in one
 map() run. If it is possible is there a size limit for the file.

 The file can not be created before hadoop job is created. I need to
 create
 and process the file inside map().

 Thanks for your answer.

 Ondrej Klimpera.







-- 
Harsh J


Re: Creating and working with temporary file in a map() function

2012-04-08 Thread Ondřej Klimpera
I will, but deploying application on a cluster is now far away. Just 
finishing raw implementation. Cluster tuning is planed in the end of 
this month.


Thanks.

On 04/08/2012 09:06 PM, Harsh J wrote:

It will work. Pseudo-distributed mode shouldn't be all that different
from a fully distributed mode. Do let us know if it does not work as
intended.

On Sun, Apr 8, 2012 at 11:40 PM, Ondřej Klimperaklimp...@fit.cvut.cz  wrote:

Thanks for your advise, File.createTempFile() works great, at least in
pseudo-ditributed mode, hope cluster solution will do the same work. You
saved me hours of trying...



On 04/07/2012 11:29 PM, Harsh J wrote:

MapReduce sets mapred.child.tmp for all tasks to be the Task
Attempt's WorkingDir/tmp automatically. This also sets the
-Djava.io.tmpdir prop for each task at JVM boot.

Hence you may use the regular Java API to create a temporary file:

http://docs.oracle.com/javase/6/docs/api/java/io/File.html#createTempFile(java.lang.String,%20java.lang.String)

These files would also be automatically deleted away after the task
attempt is done.

On Sun, Apr 8, 2012 at 2:14 AM, Ondřej Klimperaklimp...@fit.cvut.cz
  wrote:

Hello,

I would like to ask you if it is possible to create and work with a
temporary file while in a map function.

I suppose that map function is running on a single node in Hadoop
cluster.
So what is a safe way to create a temporary file and read from it in one
map() run. If it is possible is there a size limit for the file.

The file can not be created before hadoop job is created. I need to
create
and process the file inside map().

Thanks for your answer.

Ondrej Klimpera.










Re: Get Current Block or Split ID, and using it, the Block Path

2012-04-08 Thread JAX
I have a related question about blocks related to thisNormally, a reduce 
job outputs several files, all in the same directory.

But why? Since we know that Hadoop is abstracting our file for us, shouldn't 
the part-r- outputs ultimately be  thought of as a single file? 

What is the correspondence between the

Part-r-
Part-r-0001
.

Outputs from a reducer, and the native blocks stored by Hfds (if any).

Jay Vyas 
MMSB
UCHC

On Apr 8, 2012, at 2:00 PM, Harsh J ha...@cloudera.com wrote:

 Deepak
 
 On Sun, Apr 8, 2012 at 9:46 PM, Deepak Nettem deepaknet...@gmail.com wrote:
 Hi,
 
 Is it possible to get the 'id' of the currently executing split or block
 from within the mapper? Using this block Id / split id, I want to be able
 to query the namenode to get the names of hosts having that block / spllit,
 and the actual path to the data.
 
 You can get the list of host locations for the current Mapper's split
 item via: https://gist.github.com/2339170 (or generally from a
 FileSystem object via https://gist.github.com/2339181)
 
 You can't get block IDs via any available publicly supported APIs.
 Therefore, you may consider getting the local block file path as an
 unavailable option too.
 
 I need this for some analytics that I'm doing. Is there a client API that
 allows doing this?  If not, what's the best way to do this?
 
 There are some ways to go about it (I wouldn't consider this
 impossible to do for sure), but I'm curious what your 'analytics' is
 and how it correlates with needing block IDs and actual block file
 paths - Cause your problem may also be solvable by other,
 pre-available means.
 
 -- 
 Harsh J


Re: Get Current Block or Split ID, and using it, the Block Path

2012-04-08 Thread Harsh J
Hi,

The part in the default filename stands for partition. In some
cases I agree you would not mind viewing them as a singular file
instead of having to read directories - but there are also use cases
where you would want each partition file to be unique cause you
partitioned and processed them that way.

In any case, cause HDFS lists files in sorted order you can use the
fs -getmerge to get them out as one file if that suits your
application. There is also a on-HDFS concat feature in Hadoop 2.x
(formerly 0.23.x).

On Mon, Apr 9, 2012 at 2:04 AM, JAX jayunit...@gmail.com wrote:
 I have a related question about blocks related to thisNormally, a reduce 
 job outputs several files, all in the same directory.

 But why? Since we know that Hadoop is abstracting our file for us, shouldn't 
 the part-r- outputs ultimately be  thought of as a single file?

 What is the correspondence between the

 Part-r-
 Part-r-0001
 .

 Outputs from a reducer, and the native blocks stored by Hfds (if any).

 Jay Vyas
 MMSB
 UCHC

 On Apr 8, 2012, at 2:00 PM, Harsh J ha...@cloudera.com wrote:

 Deepak

 On Sun, Apr 8, 2012 at 9:46 PM, Deepak Nettem deepaknet...@gmail.com wrote:
 Hi,

 Is it possible to get the 'id' of the currently executing split or block
 from within the mapper? Using this block Id / split id, I want to be able
 to query the namenode to get the names of hosts having that block / spllit,
 and the actual path to the data.

 You can get the list of host locations for the current Mapper's split
 item via: https://gist.github.com/2339170 (or generally from a
 FileSystem object via https://gist.github.com/2339181)

 You can't get block IDs via any available publicly supported APIs.
 Therefore, you may consider getting the local block file path as an
 unavailable option too.

 I need this for some analytics that I'm doing. Is there a client API that
 allows doing this?  If not, what's the best way to do this?

 There are some ways to go about it (I wouldn't consider this
 impossible to do for sure), but I'm curious what your 'analytics' is
 and how it correlates with needing block IDs and actual block file
 paths - Cause your problem may also be solvable by other,
 pre-available means.

 --
 Harsh J



-- 
Harsh J


How do I include the newer version of Commons-lang in my jar?

2012-04-08 Thread Sky

Hi.

I am new to Hadoop and I am working on project on AWS Elastic MapReduce.

The problem I am facing is:
* org.apache.commons.lang.time.DateUtils: parseDate() works OK but 
parseDateStrictly() fails.
I think parseDateStrictly might be new in lang 2.5. I thought I included all 
dependencies. However, for some reason, during runtime, my app is not 
picking up the newer commons-lang.


Would love some help.

Thx
- sky