can we split a big gzipped file on HDFS ?

2011-06-22 Thread Mapred Learn
Hi, If I have a big gzipped text file (~ 60 GB) in HDFS, can i split it into smaller chunks (~ 1 GB) so that I can run a map-red job on those files and finish faster than running job on 1 big file ? Thanks, -JJ

Re: controlling no. of mapper tasks

2011-06-22 Thread Sudharsan Sampath
Hi Allen, The number of map tasks is driven by the number of splits of the input provided. The configuration for 'number of map tasks' is only a hint and will be honored only if the value is more than the number of input splits. If its less, then the latter takes higer precedence. But as a hack/w

Fwd: How to load a sequence file with decimal data to hive ?

2011-06-22 Thread Mapred Learn
In case anybody has some inputs : Sent from my iPhone Begin forwarded message: > From: Mapred Learn > Date: June 22, 2011 6:21:03 PM PDT > To: "u...@hive.apache.org" > Subject: How to load a sequence file with decimal data to hive ? > > Hi, > I have a sequence file where I have delimited da

Re: Algorithm for cross product

2011-06-22 Thread Lance Norskog
If you have scaling problems, check out the Mahout project. They are all about distributed scalable linear algebra & more. http://mahout.apache.org Lance On Wed, Jun 22, 2011 at 5:13 PM, Jason wrote: > I remember I had a similar problem. > The way I approached it was by partitioning one of the d

Re: Algorithm for cross product

2011-06-22 Thread Jason
I remember I had a similar problem. The way I approached it was by partitioning one of the data sets. At high level these are the steps: Suppose you decide to partition set A. Each partition represents a subset/range of the A keys and must be small enough to fit records in memory. Each partit

Re: Is there any way for the reducer to determine the total number of reduce tasks?

2011-06-22 Thread Matei Zaharia
You can implement the configure() method of the Reducer interface and look at the properties in the JobConf. In particular, "mapred.reduce.tasks" is the number of reduce tasks and "mapred.job.tracker" will be set to "local" when running in local mode. Matei On Jun 22, 2011, at 3:12 PM, Steve L

Algorithm for cross product

2011-06-22 Thread Steve Lewis
Assume I have two data sources A and B Assume I have an input format and can generate key values for both A and B I want an algorithm which will generate the cross product of all values in A having the key K and all values in B having the key K. Currently I use a mapper to generate key values for A

Is there any way for the reducer to determine the total number of reduce tasks?

2011-06-22 Thread Steve Lewis
Also is there a good way in code to determine whether job is running on a cluster or in local mode. I want certain debugging information to log only in local mode -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com

Re: hdfs reformat confirmation message

2011-06-22 Thread Joey Echeverria
You could pipe 'yes' to the hadoop command: yes | hadoop namenode -format -Joey On Wed, Jun 22, 2011 at 4:46 PM, Virajith Jalaparti wrote: > Hi, > > When I try to reformat HDFS (I have to multiple times for some experiment I > need to run), it asks for a confirmation Y/N. Is there a way to disa

Re: hdfs reformat confirmation message

2011-06-22 Thread Harsh J
Simply do a "yes Y | hadoop namenode -format". On Thu, Jun 23, 2011 at 2:16 AM, Virajith Jalaparti wrote: > Hi, > > When I try to reformat HDFS (I have to multiple times for some experiment I > need to run), it asks for a confirmation Y/N. Is there a way to disable this > in HDFS/hadoop? I am try

Re: tasktracker maximum map tasks for a certain job

2011-06-22 Thread Jonathan Zukerman
Allen & Matt - After reading this link (which redirects me to http://wiki.apache.org/hadoop/LimitingTaskSlotUsage), and also http://hadoop.apache.org/common/docs/r0.20.2/fair_scheduler.html and http://hadoop.apache.org/common/docs/r0.19.2/capacity_scheduler.html, it seems that all I need to do is w

Re: hdfs reformat confirmation message

2011-06-22 Thread Marcos Ortiz
Well, I think that would be a nice feature to have: hadoop namenode -reformat -y What do you think? Can you add it to the HDFS´s Jira? Regards El 6/22/2011 4:46 PM, Virajith Jalaparti escribió: Hi, When I try to reformat HDFS (I have to multiple times for some experiment I need to run), it ask

hdfs reformat confirmation message

2011-06-22 Thread Virajith Jalaparti
Hi, When I try to reformat HDFS (I have to multiple times for some experiment I need to run), it asks for a confirmation Y/N. Is there a way to disable this in HDFS/hadoop? I am trying to automate my process and pressing Y everytime I do this is just a lot of manual work. Thanks, Virajith

Re: how to get output files of fixed size in map-reduce job output

2011-06-22 Thread Mapred Learn
problem with first option is that even if file is uploaded as 1 GB, then also output is not 1 GB (it wud depend on compression). So, some runs need to be done to estimate what size input file should be uploaded as to get 1 GB output. For block size, I got your point. I think I said the same thing

Re: how to get output files of fixed size in map-reduce job output

2011-06-22 Thread Harsh J
CombineFileInputFormat should help with doing some locality, but it would not be as perfect as having the file loaded to the HDFS itself with a 1 GB block size (block sizes are per file properties, not global ones). You may consider that as an alternative approach. I do not get (ii). I meant by my

Re: how to get output files of fixed size in map-reduce job output

2011-06-22 Thread Mapred Learn
Hi Harsh, Thanks ! i) I was currently doing it by extending CombineFileInputFormat and specifying -Dmapred.max.split.size but this increases job finish time by about 3 times. ii) since you said this file output size is going to be greater than block size in this case. What happens in case when peop

Re: how to get output files of fixed size in map-reduce job output

2011-06-22 Thread Harsh J
Mapred, This should be doable if you are using TextInputFormat (or other FileInputFormat derivatives that do not override getSplits() behaviors). Try this: jobConf.setLong("mapred.min.split.size", ); This would get you splits worth the size you mention, 1 GB or else, and you should have outputs

how to get output files of fixed size in map-reduce job output

2011-06-22 Thread Mapred Learn
I have a use case where I want to process data and generate seq file output of fixed size , say 1 GB i.e. each map-reduce job output should be 1 Gb. Does anybody know of any -D option or any other way to achieve this ? -Thanks JJ

Re: Large startup time in remote MapReduce job

2011-06-22 Thread Allen Wittenauer
On Jun 22, 2011, at 10:08 AM, Allen Wittenauer wrote: > > On Jun 21, 2011, at 2:02 PM, Harsh J wrote: If your jar does not contain code changes that need to get transmitted every time, you can consider placing them on the JT/TT classpaths >>> >>> ... which means you get to

Re: Large startup time in remote MapReduce job

2011-06-22 Thread Allen Wittenauer
On Jun 21, 2011, at 2:02 PM, Harsh J wrote: >>> >>> If your jar does not contain code changes that need to get transmitted >>> every time, you can consider placing them on the JT/TT classpaths >> >>... which means you get to bounce your system every time you change >> code. > > Its ugl

Re: controlling no. of mapper tasks

2011-06-22 Thread Allen Wittenauer
On Jun 20, 2011, at 12:24 PM, wrote: > Hi there, > I know client can send "mapred.reduce.tasks" to specify no. of reduce tasks > and hadoop honours it but "mapred.map.tasks" is not honoured by Hadoop. Is > there any way to control number of map tasks? What I noticed is that Hadoop > is choo

Re: tasktracker maximum map tasks for a certain job

2011-06-22 Thread Allen Wittenauer
On Jun 21, 2011, at 9:52 AM, Jonathan Zukerman wrote: > Hi, > > Is there a way to set the maximum map tasks for all tasktrackers in my > cluster for a certain job? > Most of my tasktrackers are configured to handle 4 maps concurrently, and > most of my jobs don't care where does the map function

Re: Parallelize a workflow using mapReduce

2011-06-22 Thread Hassen Riahi
Thanks for the reply! On Wed, Jun 22, 2011 at 2:51 PM, Hassen Riahi wrote: Hi all, I'm looking to parallelize a workflow using mapReduce. The workflow can be summarized as following: 1- Specify the list of paths of binary files to process in a configuration file (let's call this config

Re: Parallelize a workflow using mapReduce

2011-06-22 Thread Hassen Riahi
Thanks Bobby for the reply! Please find comments inline. If your input file is a list of paths each one with \n at the end, the a TextFileInputFormat would split them for you. I would write it something like the following Mapper { Void map(Long offset, String path, collector) { Path p = n

Re: Parallelize a workflow using mapReduce

2011-06-22 Thread Bibek Paudel
On Wed, Jun 22, 2011 at 5:00 PM, Bibek Paudel wrote: > On Wed, Jun 22, 2011 at 2:51 PM, Hassen Riahi wrote: >> Hi all, >> >> I'm looking to parallelize a workflow using mapReduce. The workflow can be >> summarized as following: >> >> 1- Specify the list of paths of binary files to process in a co

Re: Parallelize a workflow using mapReduce

2011-06-22 Thread Bibek Paudel
On Wed, Jun 22, 2011 at 2:51 PM, Hassen Riahi wrote: > Hi all, > > I'm looking to parallelize a workflow using mapReduce. The workflow can be > summarized as following: > > 1- Specify the list of paths of binary files to process in a configuration > file (let's call this configuration file CONFIG)

Re: Parallelize a workflow using mapReduce

2011-06-22 Thread Robert Evans
If your input file is a list of paths each one with \n at the end, the a TextFileInputFormat would split them for you. I would write it something like the following Mapper { Void map(Long offset, String path, collector) { Path p = new Path(path); FileSystem fs = p.getFileSystem(getConf());

Re: mapreduce and python

2011-06-22 Thread Hassen Riahi
I'm trying these solutions...Thanks for suggestions. I'd like to +1 to using Dumbo for all things Python and Hadoop MapReduce. Its one of the better ways to do things. Do look at the initial conversation here: http://old.nabble.com/hadoop-streaming-binary-input---image-processing-td23544344.htm

Parallelize a workflow using mapReduce

2011-06-22 Thread Hassen Riahi
Hi all, I'm looking to parallelize a workflow using mapReduce. The workflow can be summarized as following: 1- Specify the list of paths of binary files to process in a configuration file (let's call this configuration file CONFIG). These binary files are stored in HDFS. This list of path

Re: Large startup time in remote MapReduce job

2011-06-22 Thread John Armstrong
On Wed, 22 Jun 2011 00:15:56 +0200, Gabor Makrai wrote: > Fortunately, DistributedCache solved my problem! I put a jar file to > HDFS. which contains the necessary classes for the job and I used this: > *DistributedCache.addFileToClassPath(new Path("/myjar/myjar.jar"), conf);* Can I ask which ver

Re: Map job hangs indefinitely

2011-06-22 Thread Sudharsan Sampath
Hi Devraj, I attached the files so that it is easier for anyone to run it and simulate the issue. There are no other files required. following are the logs from the jobtracker and the tasktracker *JobTracker* 2011-06-23 12:46:48,781 DEBUG org.apache.hadoop.mapred.JobTracker: Per-Task memory con

HDFS insertion time

2011-06-22 Thread Ondřej Nevělík
Hello, is there anyway how to determine time needed to upload a file into HDFS? Thanks.