Creating MapFile.Reader instance in reducer setup

2012-06-19 Thread Ondřej Klimpera
Hello, I'm tring to use MapFile (stored on HDFS) in my reduce task, 
which processes some text data.


When I try to initialize MapFile.Reader in reducer configure() method, 
app throws NullPointerException, when the same approach is used for each 
reduce() method call with the same parameters, everything goes all right.


But creating instance of Reader for each reduce() call creates big slow 
down.


Do you have any idea what am I doing wrong.

Thanks
Ondrej Klimpera




Re: Creating MapFile.Reader instance in reducer setup

2012-06-19 Thread Ondřej Klimpera

Hello,

sorry my mistake. Problem solved.

On 06/19/2012 03:40 PM, Devaraj k wrote:

Can you share the exception stack trace and piece of code where you are trying 
to create?


Thanks
Devaraj


From: Ondřej Klimpera [klimp...@fit.cvut.cz]
Sent: Tuesday, June 19, 2012 6:03 PM
To: common-user@hadoop.apache.org
Subject: Creating MapFile.Reader instance in reducer setup

Hello, I'm tring to use MapFile (stored on HDFS) in my reduce task,
which processes some text data.

When I try to initialize MapFile.Reader in reducer configure() method,
app throws NullPointerException, when the same approach is used for each
reduce() method call with the same parameters, everything goes all right.

But creating instance of Reader for each reduce() call creates big slow
down.

Do you have any idea what am I doing wrong.

Thanks
Ondrej Klimpera




Re: Setting number of mappers according to number of TextInput lines

2012-06-17 Thread Ondřej Klimpera
Hi, I made some progress, combination of NLineInputFormat and 
mapre.max.split.size seems to work, but it is hard to exactly set the 
byte value. Input lines have from 64 to 1024 bytes approx.


What I need is having as much  mappers as possible (use full potential 
of the cluster), where each receives N input lines.



On 06/17/2012 05:02 AM, Harsh J wrote:

Ondřej,

While NLineInputFormat will indeed give you N lines per task, it does
not guarantee that the N map tasks that come out for a file from it
will all be sent to different nodes. Which one is your need exactly -
Simply having N lines per map task, or N wider distributed maps?

On Sat, Jun 16, 2012 at 3:01 PM, Ondřej Klimperaklimp...@fit.cvut.cz  wrote:

I tried this approach, but the job is not distributed among 10 mapper nodes.
Seems Hadoop ignores this property :(

My first thought is, that the small file size is the problem and Hadoop
doesn't care about it's splitting in proper way.

Thanks any ideas.



On 06/16/2012 11:27 AM, Bejoy KS wrote:

Hi Ondrej

You can use NLineInputFormat with n set to 10.

--Original Message--
From: Ondřej Klimpera
To: common-user@hadoop.apache.org
ReplyTo: common-user@hadoop.apache.org
Subject: Setting number of mappers according to number of TextInput lines
Sent: Jun 16, 2012 14:31

Hello,

I have very small input size (kB), but processing to produce some output
takes several minutes. Is there a way how to say, file has 100 lines, i
need 10 mappers, where each mapper node has to process 10 lines of input
file?

Thanks for advice.
Ondrej Klimpera


Regards
Bejoy KS

Sent from handheld, please excuse typos.








Setting number of mappers according to number of TextInput lines

2012-06-16 Thread Ondřej Klimpera

Hello,

I have very small input size (kB), but processing to produce some output 
takes several minutes. Is there a way how to say, file has 100 lines, i 
need 10 mappers, where each mapper node has to process 10 lines of input 
file?


Thanks for advice.
Ondrej Klimpera


Re: Setting number of mappers according to number of TextInput lines

2012-06-16 Thread Ondřej Klimpera
I tried this approach, but the job is not distributed among 10 mapper 
nodes. Seems Hadoop ignores this property :(


My first thought is, that the small file size is the problem and Hadoop 
doesn't care about it's splitting in proper way.


Thanks any ideas.


On 06/16/2012 11:27 AM, Bejoy KS wrote:

Hi Ondrej

You can use NLineInputFormat with n set to 10.

--Original Message--
From: Ondřej Klimpera
To: common-user@hadoop.apache.org
ReplyTo: common-user@hadoop.apache.org
Subject: Setting number of mappers according to number of TextInput lines
Sent: Jun 16, 2012 14:31

Hello,

I have very small input size (kB), but processing to produce some output
takes several minutes. Is there a way how to say, file has 100 lines, i
need 10 mappers, where each mapper node has to process 10 lines of input
file?

Thanks for advice.
Ondrej Klimpera


Regards
Bejoy KS

Sent from handheld, please excuse typos.





Dealing with low space cluster

2012-06-14 Thread Ondřej Klimpera

Hello,

we're testing application on 8 nodes, where each node has 20GB of local 
storage available. What we are trying to achieve is to get more than 
20GB to be processed on this cluster.


Is there a way how to distribute the data on the cluster?

There is also one shared NFS storage disk with 1TB of available space, 
which is now unused.


Thanks for your reply.

Ondrej Klimpera


Re: HADOOP_HOME depracated

2012-06-14 Thread Ondřej Klimpera
Thanks, for your reply. It would be great to mention this in your 
tutorial on your web sites. Is the name of the 
HADOOP_PREFIX/HOME/INSTALL crucial to Hadoop, or it's just user benefit 
to set this variable.


Thanks for reply.

On 06/14/2012 07:46 AM, Harsh J wrote:

Hi Ondřej,

Due to a new packaging format, the Apache Hadoop 1.x has deprecated
the HADOOP_HOME env-var in favor of a new env-var called
'HADOOP_PREFIX'. You can set HADOOP_PREFIX, or set
HADOOP_HOME_WARN_SUPPRESS in your environment to a non-empty value to
suppress the warning.

On Thu, Jun 14, 2012 at 11:11 AM, Ondřej Klimperaklimp...@fit.cvut.cz  wrote:

Hello, why when running Hadoop, there is always HADOOP_HOME shell variable
being told to be deprecated. How to set installation directory on cluster
nodes, which variable is correct.

Thanks

Ondrej Klimpera




Re: Dealing with low space cluster

2012-06-14 Thread Ondřej Klimpera

Hello,

you're right. That's exactly what I ment. And your answer is exactly 
what I thought. I was just wondering if Hadoop can distribute the data 
to other node's local storages if own local space is full.


Thanks

On 06/14/2012 03:38 PM, Harsh J wrote:

Ondřej,

If by processing you mean trying to write out (map outputs)  20 GB of
data per map task, that may not be possible, as the outputs need to be
materialized and the disk space is the constraint there.

Or did I not understand you correctly (in thinking you are asking
about MapReduce)? Cause you otherwise have ~50 GB space available for
HDFS consumption (assuming replication = 3 for proper reliability).

On Thu, Jun 14, 2012 at 1:25 PM, Ondřej Klimperaklimp...@fit.cvut.cz  wrote:

Hello,

we're testing application on 8 nodes, where each node has 20GB of local
storage available. What we are trying to achieve is to get more than 20GB to
be processed on this cluster.

Is there a way how to distribute the data on the cluster?

There is also one shared NFS storage disk with 1TB of available space, which
is now unused.

Thanks for your reply.

Ondrej Klimpera







Re: Dealing with low space cluster

2012-06-14 Thread Ondřej Klimpera

Thanks, I'll try.

One more question, I've got few more nodes, which can be added to the 
cluster. But how to do that?


If I understand it (according to Hadoop's wiki pages):

1. On master node - edit slaves file and add IP addresses of new nodes 
(everything clear)

2. log in to each newly added node and run (it's clear to me too)

$ hadoop-daemon.sh start datanode
$ hadoop-daemon.sh start tasktracker

3. Now I'm not sure, I'm not using dfs.include/mapred.include, do I have 
to run commands:


$ hadoop dfsadmin -refreshNodes
$ hadoop mradmin -refreshNodes

If yes, must it be run on master node, or new slaves nodes?

Ondrej



On 06/14/2012 04:03 PM, Harsh J wrote:

Ondřej,

That isn't currently possible with local storage FS. Your 1 TB NFS
point can help but I suspect it may act as a slow-down point if nodes
use it in parallel. Perhaps mount it only on 3-4 machines (or less),
instead of all, to avoid that?

On Thu, Jun 14, 2012 at 7:28 PM, Ondřej Klimperaklimp...@fit.cvut.cz  wrote:

Hello,

you're right. That's exactly what I ment. And your answer is exactly what I
thought. I was just wondering if Hadoop can distribute the data to other
node's local storages if own local space is full.

Thanks


On 06/14/2012 03:38 PM, Harsh J wrote:

Ondřej,

If by processing you mean trying to write out (map outputs)20 GB of
data per map task, that may not be possible, as the outputs need to be
materialized and the disk space is the constraint there.

Or did I not understand you correctly (in thinking you are asking
about MapReduce)? Cause you otherwise have ~50 GB space available for
HDFS consumption (assuming replication = 3 for proper reliability).

On Thu, Jun 14, 2012 at 1:25 PM, Ondřej Klimperaklimp...@fit.cvut.cz
  wrote:

Hello,

we're testing application on 8 nodes, where each node has 20GB of local
storage available. What we are trying to achieve is to get more than 20GB
to
be processed on this cluster.

Is there a way how to distribute the data on the cluster?

There is also one shared NFS storage disk with 1TB of available space,
which
is now unused.

Thanks for your reply.

Ondrej Klimpera










How Hadoop splits TextInput?

2012-06-13 Thread Ondřej Klimpera

Hello,

I'd like to ask you how Hadoop splits text input, if it's size is 
smaller then HDFS block size.


I'm testing an application, which creates from small input large outputs.

When using NInputSplits input format and setting number of splits in 
mapred-conf.xml some results are lost during writing output.


When app runs with default TextInput format everything goes OK.

Have you an idea, where the problem should be?

Thanks for your answer.


HADOOP_HOME depracated

2012-06-13 Thread Ondřej Klimpera
Hello, why when running Hadoop, there is always HADOOP_HOME shell 
variable being told to be deprecated. How to set installation directory 
on cluster nodes, which variable is correct.


Thanks

Ondrej Klimpera


Re: Getting job progress in java application

2012-04-30 Thread Ondřej Klimpera

Thanks a lot, checked the Docs and submitJob() method did the job.

Two more question please:)

[1] My app is running on Hadoop 0.20.203, if I upgrade the libraries to 
1.0.X, will the old API work, or it is necessary to rewrite map() and 
reduce() functions to new API?


[2] Does the new API support MultipleOutputs?

Thanks again.



On 04/30/2012 12:32 AM, Bill Graham wrote:

Take a look at the JobClient API. You can use that to get the current
progress of a running job.

On Sunday, April 29, 2012, Ondřej Klimpera wrote:


Hello I'd like to ask you what is the preferred way of getting running
jobs progress from Java application, that has executed them.

Im using Hadoop 0.20.203, tried job.end.notification.url property that
works well, but as the property name says, it sends only job end
notifications.

What I need is to get updates on map() and reduce() progress.

Please help how to do this.

Thanks.
Ondrej Klimpera






Getting job progress in java application

2012-04-29 Thread Ondřej Klimpera
Hello I'd like to ask you what is the preferred way of getting running 
jobs progress from Java application, that has executed them.


Im using Hadoop 0.20.203, tried job.end.notification.url property that 
works well, but as the property name says, it sends only job end 
notifications.


What I need is to get updates on map() and reduce() progress.

Please help how to do this.

Thanks.
Ondrej Klimpera



Setting a timeout for one Map() input processing

2012-04-18 Thread Ondřej Klimpera
Hello, I'd like to ask you if there is a possibility of setting a 
timeout for processing one input line of text input in mapper function.


The idea is, that if processing of one line is too long, Hadoop will cut 
this process and continue processing next input line.


Thank you for your answer.

Ondrej Klimpera


Re: Setting a timeout for one Map() input processing

2012-04-18 Thread Ondřej Klimpera

Thanks, I'll try to implement it and get you know if it worked.

On 04/18/2012 04:07 PM, Harsh J wrote:

Since you're looking for per-line (and not per-task/file) monitoring,
this is best done by your own application code (a monitoring thread,
etc.).

On Wed, Apr 18, 2012 at 6:09 PM, Ondřej Klimperaklimp...@fit.cvut.cz  wrote:

Hello, I'd like to ask you if there is a possibility of setting a timeout
for processing one input line of text input in mapper function.

The idea is, that if processing of one line is too long, Hadoop will cut
this process and continue processing next input line.

Thank you for your answer.

Ondrej Klimpera







Re: Creating and working with temporary file in a map() function

2012-04-08 Thread Ondřej Klimpera
Thanks for your advise, File.createTempFile() works great, at least in 
pseudo-ditributed mode, hope cluster solution will do the same work. You 
saved me hours of trying...



On 04/07/2012 11:29 PM, Harsh J wrote:

MapReduce sets mapred.child.tmp for all tasks to be the Task
Attempt's WorkingDir/tmp automatically. This also sets the
-Djava.io.tmpdir prop for each task at JVM boot.

Hence you may use the regular Java API to create a temporary file:
http://docs.oracle.com/javase/6/docs/api/java/io/File.html#createTempFile(java.lang.String,%20java.lang.String)

These files would also be automatically deleted away after the task
attempt is done.

On Sun, Apr 8, 2012 at 2:14 AM, Ondřej Klimperaklimp...@fit.cvut.cz  wrote:

Hello,

I would like to ask you if it is possible to create and work with a
temporary file while in a map function.

I suppose that map function is running on a single node in Hadoop cluster.
So what is a safe way to create a temporary file and read from it in one
map() run. If it is possible is there a size limit for the file.

The file can not be created before hadoop job is created. I need to create
and process the file inside map().

Thanks for your answer.

Ondrej Klimpera.







Re: Creating and working with temporary file in a map() function

2012-04-08 Thread Ondřej Klimpera
I will, but deploying application on a cluster is now far away. Just 
finishing raw implementation. Cluster tuning is planed in the end of 
this month.


Thanks.

On 04/08/2012 09:06 PM, Harsh J wrote:

It will work. Pseudo-distributed mode shouldn't be all that different
from a fully distributed mode. Do let us know if it does not work as
intended.

On Sun, Apr 8, 2012 at 11:40 PM, Ondřej Klimperaklimp...@fit.cvut.cz  wrote:

Thanks for your advise, File.createTempFile() works great, at least in
pseudo-ditributed mode, hope cluster solution will do the same work. You
saved me hours of trying...



On 04/07/2012 11:29 PM, Harsh J wrote:

MapReduce sets mapred.child.tmp for all tasks to be the Task
Attempt's WorkingDir/tmp automatically. This also sets the
-Djava.io.tmpdir prop for each task at JVM boot.

Hence you may use the regular Java API to create a temporary file:

http://docs.oracle.com/javase/6/docs/api/java/io/File.html#createTempFile(java.lang.String,%20java.lang.String)

These files would also be automatically deleted away after the task
attempt is done.

On Sun, Apr 8, 2012 at 2:14 AM, Ondřej Klimperaklimp...@fit.cvut.cz
  wrote:

Hello,

I would like to ask you if it is possible to create and work with a
temporary file while in a map function.

I suppose that map function is running on a single node in Hadoop
cluster.
So what is a safe way to create a temporary file and read from it in one
map() run. If it is possible is there a size limit for the file.

The file can not be created before hadoop job is created. I need to
create
and process the file inside map().

Thanks for your answer.

Ondrej Klimpera.










Re: Working with MapFiles

2012-04-02 Thread Ondřej Klimpera

Ok, thanks.

I missed setup() method because of using older version of hadoop, so I 
suppose that method configure() does the same in hadoop 0.20.203.


Now I'm able to load a map file inside configure() method to 
MapFile.Reader instance as a class private variable, all works fine, 
just wondering if the MapFile is replicated on HDFS and data are read 
locally, or if reading from this file will increase the network 
bandwidth because of getting it's data from another computer node in the 
hadoop cluster.


Hopefully last question to bother you is, if reading files from 
DistributedCache (normal text file) is limited to particular job.
Before running a job I add a file to DistCache. When getting the file in 
Reducer implementation, can it access DistCache files from another jobs?

In another words what will list this command:

//Reducer impl.
public void configure(JobConf job) {

 URI[] distCacheFileUris = DistributedCache.getCacheFiles(job);

}

will the distCacheFileUris variable contain only URIs for this job, or 
for any job running on Hadoop cluster?


Hope it's understandable.
Thanks.

On 04/02/2012 11:34 AM, Ioan Eugen Stan wrote:

Hi Ondrej,

Pe 30.03.2012 14:30, Ondřej Klimpera a scris:

And one more question, is it even possible to add a MapFile (as it
consits of index and data file) to Distributed cache?
Thanks


Should be no problem, they are just two files.


On 03/30/2012 01:15 PM, Ondřej Klimpera wrote:

Hello,

I'm not sure what you mean by using map reduce setup()?

If the file is that small you could load it all in memory to avoid
network IO. Do that in the setup() method of the map reduce job.

Can you please explain little bit more?



Check the javadocs[1]: setup is called once per task so you can read 
the file from HDFS then or perform other initializations.


[1] 
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Mapper.html 



Reading 20 MB in ram should not be a problem and is preferred if you 
need to make many requests against that data. It really depends on 
your use case so think carefully or just go ahead and test it.




Thanks


On 03/30/2012 12:49 PM, Ioan Eugen Stan wrote:

Hello Ondrej,


Pe 29.03.2012 18:05, Ondřej Klimpera a scris:

Hello,

I have a MapFile as a product of MapReduce job, and what I need to
do is:

1. If MapReduce produced more spilts as Output, merge them to single
file.

2. Copy this merged MapFile to another HDFS location and use it as a
Distributed cache file for another MapReduce job.
I'm wondering if it is even possible to merge MapFiles according to
their nature and use them as Distributed cache file.


A MapFile is actually two files [1]: one SequanceFile (with sorted
keys) and a small index for that file. The map file does a version of
binary search to find your key and performs seek() to go to the byte
offset in the file.


What I'm trying to achieve is repeatedly fast search in this file
during
another MapReduce job.
If my idea is absolute wrong, can you give me any tip how to do it?

The file is supposed to be 20MB large.
I'm using Hadoop 0.20.203.


If the file is that small you could load it all in memory to avoid
network IO. Do that in the setup() method of the map reduce job.

The distributed cache will also use HDFS [2] and I don't think it
will provide you with any benefits.


Thanks for your reply:)

Ondrej Klimpera


[1]
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html 



[2]
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html 














Re: Working with MapFiles

2012-03-30 Thread Ondřej Klimpera
Hello, I've got one more question, how is seek() (or get()) method 
implemented in MapFile.Reader, does it use hashCode, compareTo()  or 
another mechanism to find a match in MapFile's index.


Thanks for your reply.
Ondrej Klimpera

On 03/29/2012 08:26 PM, Ondřej Klimpera wrote:

Thanks for your fast reply, I'll try this approach:)

On 03/29/2012 05:43 PM, Deniz Demir wrote:
Not sure if this helps in your use case but you can put all output 
file into distributed cache and then access them in the subsequent 
map-reduce job (in driver code):


// previous mr-job's output
String pstr = hdfs://output_path/;
FileStatus[] files = fs.listStatus(new Path(pstr));
for (FileStatus f : files) {
if (!f.isDir()) {
DistributedCache.addCacheFile(f.getPath().toUri(), 
job.getConfiguration());

}
}

I think you can also copy these files to a different location in dfs 
and then put into distributed cache.



Deniz


On Mar 29, 2012, at 8:05 AM, Ondřej Klimpera wrote:


Hello,

I have a MapFile as a product of MapReduce job, and what I need to 
do is:


1. If MapReduce produced more spilts as Output, merge them to single 
file.


2. Copy this merged MapFile to another HDFS location and use it as a 
Distributed cache file for another MapReduce job.


I'm wondering if it is even possible to merge MapFiles according to 
their nature and use them as Distributed cache file.


What I'm trying to achieve is repeatedly fast search in this file 
during another MapReduce job.

If my idea is absolute wrong, can you give me any tip how to do it?

The file is supposed to be 20MB large.
I'm using Hadoop 0.20.203.

Thanks for your reply:)

Ondrej Klimpera








Re: Working with MapFiles

2012-03-30 Thread Ondřej Klimpera

Hello,

I'm not sure what you mean by using map reduce setup()?

If the file is that small you could load it all in memory to avoid 
network IO. Do that in the setup() method of the map reduce job.


Can you please explain little bit more?

Thanks


On 03/30/2012 12:49 PM, Ioan Eugen Stan wrote:

Hello Ondrej,


Pe 29.03.2012 18:05, Ondřej Klimpera a scris:

Hello,

I have a MapFile as a product of MapReduce job, and what I need to do 
is:


1. If MapReduce produced more spilts as Output, merge them to single 
file.


2. Copy this merged MapFile to another HDFS location and use it as a
Distributed cache file for another MapReduce job.
I'm wondering if it is even possible to merge MapFiles according to
their nature and use them as Distributed cache file.


A MapFile is actually two files [1]: one SequanceFile (with sorted 
keys) and a small index for that file. The map file does a version of 
binary search to find your key and performs seek() to go to the byte 
offset in the file.



What I'm trying to achieve is repeatedly fast search in this file during
another MapReduce job.
If my idea is absolute wrong, can you give me any tip how to do it?

The file is supposed to be 20MB large.
I'm using Hadoop 0.20.203.


If the file is that small you could load it all in memory to avoid 
network IO. Do that in the setup() method of the map reduce job.


The distributed cache will also use HDFS [2] and I don't think it will 
provide you with any benefits.



Thanks for your reply:)

Ondrej Klimpera


[1] 
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html
[2] 
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html




Re: Working with MapFiles

2012-03-30 Thread Ondřej Klimpera
And one more question, is it even possible to add a MapFile (as it 
consits of index and data file) to Distributed cache?

Thanks

On 03/30/2012 01:15 PM, Ondřej Klimpera wrote:

Hello,

I'm not sure what you mean by using map reduce setup()?

If the file is that small you could load it all in memory to avoid 
network IO. Do that in the setup() method of the map reduce job.


Can you please explain little bit more?

Thanks


On 03/30/2012 12:49 PM, Ioan Eugen Stan wrote:

Hello Ondrej,


Pe 29.03.2012 18:05, Ondřej Klimpera a scris:

Hello,

I have a MapFile as a product of MapReduce job, and what I need to 
do is:


1. If MapReduce produced more spilts as Output, merge them to single 
file.


2. Copy this merged MapFile to another HDFS location and use it as a
Distributed cache file for another MapReduce job.
I'm wondering if it is even possible to merge MapFiles according to
their nature and use them as Distributed cache file.


A MapFile is actually two files [1]: one SequanceFile (with sorted 
keys) and a small index for that file. The map file does a version of 
binary search to find your key and performs seek() to go to the byte 
offset in the file.


What I'm trying to achieve is repeatedly fast search in this file 
during

another MapReduce job.
If my idea is absolute wrong, can you give me any tip how to do it?

The file is supposed to be 20MB large.
I'm using Hadoop 0.20.203.


If the file is that small you could load it all in memory to avoid 
network IO. Do that in the setup() method of the map reduce job.


The distributed cache will also use HDFS [2] and I don't think it 
will provide you with any benefits.



Thanks for your reply:)

Ondrej Klimpera


[1] 
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html
[2] 
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html






Working with MapFiles

2012-03-29 Thread Ondřej Klimpera

Hello,

I have a MapFile as a product of MapReduce job, and what I need to do is:

1. If MapReduce produced more spilts as Output, merge them to single file.

2. Copy this merged MapFile to another HDFS location and use it as a 
Distributed cache file for another MapReduce job.


I'm wondering if it is even possible to merge MapFiles according to 
their nature and use them as Distributed cache file.


What I'm trying to achieve is repeatedly fast search in this file during 
another MapReduce job.

If my idea is absolute wrong, can you give me any tip how to do it?

The file is supposed to be 20MB large.
I'm using Hadoop 0.20.203.

Thanks for your reply:)

Ondrej Klimpera


Re: Working with MapFiles

2012-03-29 Thread Ondřej Klimpera

Thanks for your fast reply, I'll try this approach:)

On 03/29/2012 05:43 PM, Deniz Demir wrote:

Not sure if this helps in your use case but you can put all output file into 
distributed cache and then access them in the subsequent map-reduce job (in 
driver code):

// previous mr-job's output
String pstr = hdfs://output_path/;
FileStatus[] files = fs.listStatus(new Path(pstr));
for (FileStatus f : files) {
if (!f.isDir()) {
DistributedCache.addCacheFile(f.getPath().toUri(), 
job.getConfiguration());
}
}

I think you can also copy these files to a different location in dfs and then 
put into distributed cache.


Deniz


On Mar 29, 2012, at 8:05 AM, Ondřej Klimpera wrote:


Hello,

I have a MapFile as a product of MapReduce job, and what I need to do is:

1. If MapReduce produced more spilts as Output, merge them to single file.

2. Copy this merged MapFile to another HDFS location and use it as a 
Distributed cache file for another MapReduce job.

I'm wondering if it is even possible to merge MapFiles according to their 
nature and use them as Distributed cache file.

What I'm trying to achieve is repeatedly fast search in this file during 
another MapReduce job.
If my idea is absolute wrong, can you give me any tip how to do it?

The file is supposed to be 20MB large.
I'm using Hadoop 0.20.203.

Thanks for your reply:)

Ondrej Klimpera






Using MultipleOutputs with new API (v1.0)

2012-01-25 Thread Ondřej Klimpera

Hello,

I'm trying to develop an application, where Reducer has to produce 
multiple outputs.


In detail I need the Reducer to produce two types of files. Each file 
will have different output.


I found in Hadoop, The Definitive Guide, that new API uses only 
MultipleOutputs, but working with MultipleOutputs requires JobConf 
instace, that is @deprecated (I'm using org.apache.hadoop.mapreduce.Job 
instance to handle job configuration).


So I'm wondering how to get MultipleOutputs working.

Can you please provide me some short example or explanation.

Thanks for your reply.

Regards

Ondrej Klimpera


Re: Using MultipleOutputs with new API (v1.0)

2012-01-25 Thread Ondřej Klimpera
I'm using 1.0.0 beta, suppose it was wrong decision to use beta version. 
So do you recommend using 0.20.203.X and stick to Hadoop definitive 
guide approaches?


Thanks for your reply

On 01/25/2012 01:41 PM, Harsh J wrote:

Oh and btw, do not fear the @deprecated 'Old' API. We have
undeprecated it in the recent stable releases, and will continue to
support it for a long time. I'd recommend using the older API, as that
is more feature complete and test covered in the version you use.

On Wed, Jan 25, 2012 at 6:09 PM, Harsh Jha...@cloudera.com  wrote:

What version/release/distro of Hadoop are you using? Apache releases
got the new (unstable) API MultipleOutputs only in 0.21+, and was only
very recently backported to branch-1.

That said, the next release in 1.x (1.1.0, out soon) will carry the
new API MultipleOutputs, but presently no release in 0.20.xxx/1.x has
it.

I'd still recommend sticking to stable API if you are using a
0.20.x/1.x stable Apache release.

On Wed, Jan 25, 2012 at 5:13 PM, Ondřej Klimperaklimp...@fit.cvut.cz  wrote:

Hello,

I'm trying to develop an application, where Reducer has to produce multiple
outputs.

In detail I need the Reducer to produce two types of files. Each file will
have different output.

I found in Hadoop, The Definitive Guide, that new API uses only
MultipleOutputs, but working with MultipleOutputs requires JobConf instace,
that is @deprecated (I'm using org.apache.hadoop.mapreduce.Job instance to
handle job configuration).

So I'm wondering how to get MultipleOutputs working.

Can you please provide me some short example or explanation.

Thanks for your reply.

Regards

Ondrej Klimpera



--
Harsh J
Customer Ops. Engineer, Cloudera







Re: Using MultipleOutputs with new API (v1.0)

2012-01-25 Thread Ondřej Klimpera
One more question. Just downloaded Hadoop 0.20.203.0 considered to be 
last stable release. What about JobConf vs. Confirguration classes. What 
should I use to avoid wrong approaches, because JobConf seems to be 
depricated.
Sorry for bothering you with this questions. I'm just not used to having 
depricated things in my projects.


Thanks.


On 01/25/2012 01:46 PM, Ondřej Klimpera wrote:
I'm using 1.0.0 beta, suppose it was wrong decision to use beta 
version. So do you recommend using 0.20.203.X and stick to Hadoop 
definitive guide approaches?


Thanks for your reply

On 01/25/2012 01:41 PM, Harsh J wrote:

Oh and btw, do not fear the @deprecated 'Old' API. We have
undeprecated it in the recent stable releases, and will continue to
support it for a long time. I'd recommend using the older API, as that
is more feature complete and test covered in the version you use.

On Wed, Jan 25, 2012 at 6:09 PM, Harsh Jha...@cloudera.com  wrote:

What version/release/distro of Hadoop are you using? Apache releases
got the new (unstable) API MultipleOutputs only in 0.21+, and was only
very recently backported to branch-1.

That said, the next release in 1.x (1.1.0, out soon) will carry the
new API MultipleOutputs, but presently no release in 0.20.xxx/1.x has
it.

I'd still recommend sticking to stable API if you are using a
0.20.x/1.x stable Apache release.

On Wed, Jan 25, 2012 at 5:13 PM, Ondřej 
Klimperaklimp...@fit.cvut.cz  wrote:

Hello,

I'm trying to develop an application, where Reducer has to produce 
multiple

outputs.

In detail I need the Reducer to produce two types of files. Each 
file will

have different output.

I found in Hadoop, The Definitive Guide, that new API uses only
MultipleOutputs, but working with MultipleOutputs requires JobConf 
instace,
that is @deprecated (I'm using org.apache.hadoop.mapreduce.Job 
instance to

handle job configuration).

So I'm wondering how to get MultipleOutputs working.

Can you please provide me some short example or explanation.

Thanks for your reply.

Regards

Ondrej Klimpera



--
Harsh J
Customer Ops. Engineer, Cloudera