Hi,
If I have a big gzipped text file (~ 60 GB) in HDFS, can i split it into
smaller chunks (~ 1 GB) so that I can run a map-red job on those files
and finish faster than running job on 1 big file ?
Thanks,
-JJ
Hi Allen,
The number of map tasks is driven by the number of splits of the input
provided. The configuration for 'number of map tasks' is only a hint and
will be honored only if the value is more than the number of input splits.
If its less, then the latter takes higer precedence.
But as a hack/w
In case anybody has some inputs :
Sent from my iPhone
Begin forwarded message:
> From: Mapred Learn
> Date: June 22, 2011 6:21:03 PM PDT
> To: "u...@hive.apache.org"
> Subject: How to load a sequence file with decimal data to hive ?
>
> Hi,
> I have a sequence file where I have delimited da
If you have scaling problems, check out the Mahout project. They are
all about distributed scalable linear algebra & more.
http://mahout.apache.org
Lance
On Wed, Jun 22, 2011 at 5:13 PM, Jason wrote:
> I remember I had a similar problem.
> The way I approached it was by partitioning one of the d
I remember I had a similar problem.
The way I approached it was by partitioning one of the data sets. At high level
these are the steps:
Suppose you decide to partition set A.
Each partition represents a subset/range of the A keys and must be small enough
to fit records in memory.
Each partit
You can implement the configure() method of the Reducer interface and look at
the properties in the JobConf. In particular, "mapred.reduce.tasks" is the
number of reduce tasks and "mapred.job.tracker" will be set to "local" when
running in local mode.
Matei
On Jun 22, 2011, at 3:12 PM, Steve L
Assume I have two data sources A and B
Assume I have an input format and can generate key values for both A and B
I want an algorithm which will generate the cross product of all values in A
having the key K and all values in B having the
key K.
Currently I use a mapper to generate key values for A
Also is there a good way in code to determine whether job is running on a
cluster or in local mode.
I want certain debugging information to log only in local mode
--
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com
You could pipe 'yes' to the hadoop command:
yes | hadoop namenode -format
-Joey
On Wed, Jun 22, 2011 at 4:46 PM, Virajith Jalaparti
wrote:
> Hi,
>
> When I try to reformat HDFS (I have to multiple times for some experiment I
> need to run), it asks for a confirmation Y/N. Is there a way to disa
Simply do a "yes Y | hadoop namenode -format".
On Thu, Jun 23, 2011 at 2:16 AM, Virajith Jalaparti
wrote:
> Hi,
>
> When I try to reformat HDFS (I have to multiple times for some experiment I
> need to run), it asks for a confirmation Y/N. Is there a way to disable this
> in HDFS/hadoop? I am try
Allen & Matt - After reading this link (which redirects me to
http://wiki.apache.org/hadoop/LimitingTaskSlotUsage), and also
http://hadoop.apache.org/common/docs/r0.20.2/fair_scheduler.html and
http://hadoop.apache.org/common/docs/r0.19.2/capacity_scheduler.html, it
seems that all I need to do is w
Well, I think that would be a nice feature to have:
hadoop namenode -reformat -y
What do you think?
Can you add it to the HDFS´s Jira?
Regards
El 6/22/2011 4:46 PM, Virajith Jalaparti escribió:
Hi,
When I try to reformat HDFS (I have to multiple times for some
experiment I need to run), it ask
Hi,
When I try to reformat HDFS (I have to multiple times for some
experiment I need to run), it asks for a confirmation Y/N. Is there a
way to disable this in HDFS/hadoop? I am trying to automate my process
and pressing Y everytime I do this is just a lot of manual work.
Thanks,
Virajith
problem with first option is that even if file is uploaded as 1 GB, then
also output is not 1 GB (it wud depend on compression). So, some runs need
to be done to estimate what size input file should be uploaded as to get 1
GB output.
For block size, I got your point. I think I said the same thing
CombineFileInputFormat should help with doing some locality, but it
would not be as perfect as having the file loaded to the HDFS itself
with a 1 GB block size (block sizes are per file properties, not
global ones). You may consider that as an alternative approach.
I do not get (ii). I meant by my
Hi Harsh,
Thanks !
i) I was currently doing it by extending CombineFileInputFormat and
specifying -Dmapred.max.split.size but this increases job finish time by
about 3 times.
ii) since you said this file output size is going to be greater than block
size in this case. What happens in case when peop
Mapred,
This should be doable if you are using TextInputFormat (or other
FileInputFormat derivatives that do not override getSplits()
behaviors).
Try this:
jobConf.setLong("mapred.min.split.size", );
This would get you splits worth the size you mention, 1 GB or else,
and you should have outputs
I have a use case where I want to process data and generate seq file output
of fixed size , say 1 GB i.e. each map-reduce job output should be 1 Gb.
Does anybody know of any -D option or any other way to achieve this ?
-Thanks JJ
On Jun 22, 2011, at 10:08 AM, Allen Wittenauer wrote:
>
> On Jun 21, 2011, at 2:02 PM, Harsh J wrote:
If your jar does not contain code changes that need to get transmitted
every time, you can consider placing them on the JT/TT classpaths
>>>
>>> ... which means you get to
On Jun 21, 2011, at 2:02 PM, Harsh J wrote:
>>>
>>> If your jar does not contain code changes that need to get transmitted
>>> every time, you can consider placing them on the JT/TT classpaths
>>
>>... which means you get to bounce your system every time you change
>> code.
>
> Its ugl
On Jun 20, 2011, at 12:24 PM,
wrote:
> Hi there,
> I know client can send "mapred.reduce.tasks" to specify no. of reduce tasks
> and hadoop honours it but "mapred.map.tasks" is not honoured by Hadoop. Is
> there any way to control number of map tasks? What I noticed is that Hadoop
> is choo
On Jun 21, 2011, at 9:52 AM, Jonathan Zukerman wrote:
> Hi,
>
> Is there a way to set the maximum map tasks for all tasktrackers in my
> cluster for a certain job?
> Most of my tasktrackers are configured to handle 4 maps concurrently, and
> most of my jobs don't care where does the map function
Thanks for the reply!
On Wed, Jun 22, 2011 at 2:51 PM, Hassen Riahi
wrote:
Hi all,
I'm looking to parallelize a workflow using mapReduce. The workflow
can be
summarized as following:
1- Specify the list of paths of binary files to process in a
configuration
file (let's call this config
Thanks Bobby for the reply! Please find comments inline.
If your input file is a list of paths each one with \n at the end,
the a TextFileInputFormat would split them for you.
I would write it something like the following
Mapper {
Void map(Long offset, String path, collector) {
Path p = n
On Wed, Jun 22, 2011 at 5:00 PM, Bibek Paudel wrote:
> On Wed, Jun 22, 2011 at 2:51 PM, Hassen Riahi wrote:
>> Hi all,
>>
>> I'm looking to parallelize a workflow using mapReduce. The workflow can be
>> summarized as following:
>>
>> 1- Specify the list of paths of binary files to process in a co
On Wed, Jun 22, 2011 at 2:51 PM, Hassen Riahi wrote:
> Hi all,
>
> I'm looking to parallelize a workflow using mapReduce. The workflow can be
> summarized as following:
>
> 1- Specify the list of paths of binary files to process in a configuration
> file (let's call this configuration file CONFIG)
If your input file is a list of paths each one with \n at the end, the a
TextFileInputFormat would split them for you.
I would write it something like the following
Mapper {
Void map(Long offset, String path, collector) {
Path p = new Path(path);
FileSystem fs = p.getFileSystem(getConf());
I'm trying these solutions...Thanks for suggestions.
I'd like to +1 to using Dumbo for all things Python and Hadoop
MapReduce. Its one of the better ways to do things.
Do look at the initial conversation here:
http://old.nabble.com/hadoop-streaming-binary-input---image-processing-td23544344.htm
Hi all,
I'm looking to parallelize a workflow using mapReduce. The workflow
can be summarized as following:
1- Specify the list of paths of binary files to process in a
configuration file (let's call this configuration file CONFIG). These
binary files are stored in HDFS. This list of path
On Wed, 22 Jun 2011 00:15:56 +0200, Gabor Makrai
wrote:
> Fortunately, DistributedCache solved my problem! I put a jar file to
> HDFS. which contains the necessary classes for the job and I used this:
> *DistributedCache.addFileToClassPath(new Path("/myjar/myjar.jar"),
conf);*
Can I ask which ver
Hi Devraj,
I attached the files so that it is easier for anyone to run it and simulate
the issue. There are no other files required.
following are the logs from the jobtracker and the tasktracker
*JobTracker*
2011-06-23 12:46:48,781 DEBUG org.apache.hadoop.mapred.JobTracker: Per-Task
memory con
Hello,
is there anyway how to determine time needed to upload a file into HDFS?
Thanks.
32 matches
Mail list logo