How did you try it? I had no problem with NLineInputFormat. It
just works exactly as expected.
Shi
We once calculated the cost of using EC2 to train our machine
learning model (assuming we did everything in one shot, which is
almost impossible) using EM algorithm. The cost for each model
is 10,000 US dollars. The cost for each individual node for
each hour seems cheap, but when it scales
Hi,
Before I raise this question I searched relevant topics. There
are suggestions online:
Mappers: Output all qualifying values, each with a random
integer key.
Single reducer: Output the first N values, throwing away the
keys.
However, this schema seems not very efficient when the data
To answer my own question. I applied a non-repeatable random
number generator in the mapper. At mapper setup stage I generate
a pre-defined number of random numbers, then I use a counter
along the mapper. When the counter is contained in the random
number set, the Mapper executes and outputs
If you could cross-access HDFS from both name nodes, then it should be
transferable using /distcp /command.
Shi *
*
On 5/11/2012 8:45 AM, Arindam Choudhury wrote:
Hi,
I have a question to the hadoop experts:
I have two HDFS, in different subnet.
HDFS1 : 192.168.*.*
HDFS2: 10.10.*.*
the
Is there any risk to suppress a job too long in FS?I guess there are
some parameters to control the waiting time of a job (such as timeout
,etc.), for example, if a job is kept idle for more than 24 hours is
there a configuration deciding kill/keep that job?
Shi
On 5/11/2012 6:52 AM,
here are some quick code for you (based on Tom's book). You could
overwrite the TextInputFormat isSplitable method to avoid splitting,
which is pretty important and useful when processing sequence data.
//Old API
public class NonSplittableTextInputFormat extends TextInputFormat {
It seems in your case HDFS2 could access HDFS, so you should be able to
transfer HDFS data to HDFS2.
If you want to cross-transfer, you don't need to do distcp on cluster
nodes, if any client node (not necessary to be namenode, datanode,
secondary node, etc.) could access to both HDFSs, then
It depends on your use case, for example, query only or you have
requirement of real time insert and update. The solutions can
be different.
You might need consider HBase, Cassandra or tools like Flume.
Flume might be suitable for your case.
https://cwiki.apache.org/FLUME/
Shi
it around might help.
If you want further analysis like Business Intelligence, then you need
to train various models.
On 5/10/2012 8:30 AM, karanveer.si...@barclays.com wrote:
I am more worried about the analysis assuming this data is in HDFS.
-Original Message-
From: Shi Yu
A quick glance at your problem indicates that you might have a design
problem with your code. In my opinion you should avoid nested Map/Reduce
job. You could use chain Map/Reduce, but the nested or recursive
structure is not suggested. I don't know how you implemented your
nested M/R job,
My humble experience: I would prefer specifying the files in
command line using -files option, then treat them explicitly in
the Mapper configure or setup function using
File f1 = new File(file1name);
File f2 = new File(file2name);
Cause I am not 100% sure how does distributed cached
It sounds like an exciting feature. Does anyone have tried this in practice?
How does the hot standby namenode perform and how reliable is the HDFS
recovery? Is it now a good chance to migrate to 2.0.0, in your opinions?
Best,
Shi
Hi Harsh J,
It seems that the 20% performance lost is not that bad, at least some smart
people are still working to improve it. I will keep an eye on this interesting
trend.
Shi
Hi Todd,
Okay, that sounds really good (sorry didn't grab all the
information in that long page).
Shi
Tons of errors seen after Map 100% Reduce 50%, but the job
still struggles to finish. What is the possible reason? Is
this issue fixed in any of the version?
java.net.SocketTimeoutException: 69000 millis timeout while
waiting for channel to be ready for read. ch :
If you want to control the number of input splits at fine granularity,
you could customize the NLineInputFormat. You need to determine the
number of lines per each split. Thus you need to know before is the
number of lines in your input data, for instance, using
hadoop -text /input/dir/* |
1. Wrap all your jar files inside your artifact, they should be under
lib folder. Sometimes this could make your jar file quite big, if you
want to save time uploading big jar files remotely, see 2
2. Use -libjars with full path or relative path (w.r.t. your jar
package) should work
On
On 2/27/2012 1:55 PM, Mohit Anchlia wrote:
I submitted a map reduce job that had 9 tasks killed out of 139. But I
don't see any errors in the admin page. The entire job however has
SUCCEDED. How can I track down the reason?
Also, how do I determine if this is something to worry about?
Hi,
Hi,
You could easily find lots of documents talking about this. Try
kevinweil-hadoop-lzo in google.
Shi
Yes, it is supported by Hadoop sequence file. It is splittable
by default. If you have installed and specified LZO correctly,
use these:
org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma
t.setCompressOutput(job,true);
You could decompress the LZO file manually into plain text then
using TextInputFormat.
Alternatively, you don't need to index the LZO compressed file,
just using LZOInputFormat on non-indexed files, then the LZO
file will not be split anymore.
Hi,
Is there any working example to write hadoop task log (stderr) files to
HDFS? Currently we have a cluster which data nodes are inaccessible to
the users, so I am trying to find a way redirecting all task log files
to HDFS. Thanks!
Best,
Shi
Thank you Bejoy!
Following your code examples, it finally works.
Actually I only changed two places in my original code. First,
I added the Override tag. Second, I added a new exception
catch(FileNotFoundException e), and now it works!
I appreciate your kind and precise help.
Best,
Shi
Follow my previous question, I put the complete code as
follows, I doubt is there any method to get this working on
0.20.X using the new API.
The command I executed was:
bin/hadoop jar myjar.jar FileTest -files textFile.txt /input/
/output/
The complete code:
public class FileTest extends
Hi,
I am using 0.20.X branch. However, I need to use the new API because it
has the cleanup(context) method in Mapper. However, I am confused about
how to load the cached files in mapper. I could load the
DistributedCache files using old API (JobConf), but in new API it
always returns
Hi,
Suppose I have two mappers, each mapper is assigned 10 lines of
data. I want to set a counter for each mapper, counting and
accumulating, then output the counter value to the reducer when
the mapper finishes processing all the assigned lines. So I
want the mapper outputs values only when
I saw the title of this discussion started a few days ago but didn't pay
attention to them. this morning i came across to some of these message
and rofl, too much drama. According to my experience, there are some
risks of using hadoop.
1) not real time and mission critical, you may consider
Hi,
You probably need to use secondary sort (based on TextPair key) and
string concatenation function (like StringBuffer) to do this. I once
had a talk on Open Cloud Science workshop about this (also see my
previous post in this newsgroup)
Best,
Shi
On 9/20/2011 10:38 AM, Daniel
Hi,
I am stuck again in a probably very simple problem. I couldn't generate
the map output in sequence file format. I always get this error:
java.io.IOException: wrong key class: org.apache.hadoop.io.Text is not
class org.apache.hadoop.io.LongWritable
at
Oh that's brilliant. Thanks a lot Brock!
On 9/19/2011 3:15 PM, Brock Noland wrote:
Hi,
On Mon, Sep 19, 2011 at 3:19 PM, Shi Yush...@uchicago.edu wrote:
I am stuck again in a probably very simple problem. I couldn't generate the
map output in sequence file format. I always get this error:
Interested in this topic. We have experienced plenty of difficulties
running hadoop in Eucalyptus based virtual instance clusters. Typical
issues like
java.net.SocketTimeoutException: 69000 millis timeout while waiting for
channel to be ready for read. ch : java.nio.channels.SocketChannel
Hi,
I found some materials about submitting hadoop jobs via PBS. Any idea
how to interactively browse HDFS through PBS? Our supercomputer uses
lustre storage system. I found a wiki talking about using Polyserve
storage system but not using HDFS. Does anyone have tried lustre +
hadoop + PBS?
I had difficulty upgrading applications from Hadoop 0.20.2 to Hadoop
0.20.203.0.
The standalone mode runs without problem. In real cluster mode, the
program freeze at map 0% reduce 0% and there is only one attempt file
in the log directory. The only information is contained in stdout file :
Thanks Edward! I upgraded to 1.6.0_26 and it worked.
On 7/1/2011 6:42 PM, Edward Capriolo wrote:
That looks like an ancient version of java. Get 1.6.0_u24 or 25 from oracle.
Upgrade to a recent java and possibly update your c libs.
Edward
On Fri, Jul 1, 2011 at 7:24 PM, Shi
Hi,
My specific question is: is it possible to control the split of Lzo
files by customize the Lzo index files?
The background of the problem is:
I have a file which has the following format
key1 value1
key1 value2
key2 value3
key2 value4
...
Its size in plain text before compression is 11
There is no look up. The process is done by shuffle and sort (secondary
sort for multiple keys) in Map/Reduce.
The key problem is to join your record files with lookup tables
K1 R1K1 V1
K2 R2K2 V2
...
which gives
I had the same problem before, a big lookup table too large to load in
memory.
I tried and compared the following approaches: in-memory MySQL DB, a
dedicated central memcache server, a dedicated central MongoDB server,
local DB (each node has its own MongoDB server) model.
The local DB
Suppose you are looking up a value V of a key K. And V is required for
an upcoming process. Suppose the data in the upcoming process has the form
R1 K1 K2 K3,
where R1 is the record number, K1 to K3 are the keys occurring in the
record, which means in the look up case you would query for
This is a re-post of the same message. I made it more specific
and clear. Have been considering it several days so really
appreciate any help.
I have a question about configuring Map/Side inner join for
multiple mappers in Hadoop. Suppose I have two very large data
sets A and B, I use the
Hi,
How to configure map side join in multiple mappers in parallel?
Suppose I have data set s a1, a2, a3 and data set b1, b2, b3.
I want to let a1 join with b1, a2 join with b2, a3 join with b3 and
let the join done in parallel? I think it should be able to configure in
mapper 1
Hi,
Thanks for the reply. The line count in new API works fine now, it was a
bug in my code. In new API,
Iterator is changed to Iterable,
but I didn't pay attention to that and was still using Iterator and hasNext(),
Next() method. Surprisingly, the wrong code still ran and got output,
Hi,
I have two datasets: dataset 1 has the format:
MasterKey1SubKey1SubKey2SubKey3
MasterKey2Subkey4 Subkey5 Subkey6
dataset 2 has the format:
SubKey1Value1
SubKey2Value2
...
I want to have one-to-many join based on the SubKey, and the final goal
is to
Hi,
I am trying to rewrite and improve some old code using Map side join
such as TupleWritable, KeyValueTextInputFormat, etc. The reference
materials I have are based on old API (0.19.x). Since Hadoop is
updating rapidly, I am wondering is there any new functions / API /
framework about Map
Hi,
I am wondering is there any built-in function to automatically add a
self-increment line number in reducer output (like the relation DB
auto-key).
I have this problem because in 0.19.2 API, I used a variable linecount
increasing in the reducer like:
public static class Reduce extends
Hi, I am stuck in a basic problem but can't figure out. My previous
verbose logging problem is the same as the one mentioned in the old post.
http://mail-archives.apache.org/mod_mbox/nutch-user/200901.mbox/%3c0adbd67bd6811a4bb2144d805124714d03f754a...@kaex1.dom.rastatt.de%3E
First quesiton, if
We just upgraded from 0.20.2 to hadoop-0.20.203.0
Running the same code ends up a massive amount of debug
information on the screen output. Normally this type of
information is written to logs/userlogs directory. However,
nothing is written there now and seems everything is outputted
to
I still didn't get it.
To make sure I am not using any old version, I downloaded two
versions 0.20.2 and 0.20.203.0 again and had a fresh install
separately on two independent clusters. I tried with a very
simple toy program. I didn't change anything in the API so it
probably calls the old
A map/reduce process applied on 3T input data halts for 1 hour at map
57% reduce 19% without any progress.
A same error occurs a millions of times in the huge syslog file. And I
also got a huge stderr file, where the logs are:
Caused by:
Then, what is the main difference: (1) storing the input on the cluster
shared directory, loading it in the configure stage of mappers and (2)
using the distributed cache?
Shi
On 4/25/2011 8:17 AM, Kai Voigt wrote:
Hi,
I'd use the distributed cache to store the vector on every mapper
I use
hadoop dfs -text path | head-n
hadoop dfs -text path | tail-n
to browse the n-th line from the head or from the tail. But it is slow
when the file is large. Is there any command that goes directly to a
specific line in dfs?
Shi
Message-
From: Shi Yu [mailto:sh...@uchicago.edu]
Sent: Thursday, March 24, 2011 3:02 PM
To: hadoop user
Subject: Program freezes at Map 99% Reduce 33%
I am running a hadoop program processing Tera Byte size data. The code
was test successfully on a small sample (100G) and it worked. However
of mappers and then compare that to your program run without
hadoop.
Kevin
-Original Message-
From: Shi Yu [mailto:sh...@uchicago.edu]
Sent: Thursday, March 24, 2011 3:57 PM
To: common-user@hadoop.apache.org
Subject: Re: Program freezes at Map 99% Reduce 33%
Hi Kevin,
thanks for reply. I
I guess you need to define a Partitioner to send hased keys to different
reducers (sorry, I am still using the old API so probably there is
something new in the trunk release). Basically you try to segment the
keys into different zones, 0-10, 11-20, ...
maybe check the hashCode() function
Actually you could see your own post using google after you posted it.
Then you are sure it is not swallowed by the network ... :)
Shi
On 3/22/2011 12:23 PM, Aaron Baff wrote:
Ok, thanks. Guess I'm just having no luck getting my posts replied to.
Aaron Baff | Developer | Telescope, Inc.
Hi. As mentioned in the previous post, I tried to extend some legacy
programs built on hadoop 0.19.2 to apply Lzo compression. I had tons of
the problems (logical errors and troubles in implementation). After
spending a whole week, finally I feel sorting things out, however, there
are still
Problem solved, two paths should be set:
export C_INCLUDE_PATH=/path_of_lzo_output/include
export LIBRARY_PATH=/path_of_lzo_output/lib
and enable shared when configuring the lzo compile:
./configure -enable-shared -prefix=/path_of_lzo_output/
Shi
On 3/19/2011 1:16 PM, Shi Yu wrote:
Trying
Hi, My hadoop distribution is 0.20.2. I had many errors when compressing
output with LZO (see stack trace at the end). I disabled the compression
of mapper output. The naitvecode seems having been loaded correctly, but
during the mapper stage, lots of error popped. The program didn't break
and
Trying to install LZO and compile the hadoop package following the
instructions at
http://sudhirvn.blogspot.com/2010/07/installing-hadoop-native-libraries.html
I don't have root privilege thus no sudo, no rpm installation is
possible. So I built and installed LZO source in my home folder. The
like to know how should I solve this problem. Should I
upgrade anything? I guess this problem is not new. Thanks for the
information.
Shi
On 3/8/2011 4:04 PM, Shi Yu wrote:
What is the true reason of causing this? I realized there are many
reports on web, but couldn't find the exact solution
What is the true reason of causing this? I realized there are many
reports on web, but couldn't find the exact solution? I have this
problem when using compressed sequence file output.
SequenceFileOutputFormat.setCompressOutput(conf, true);
Hi,
I observe that sometimes the map/reduce progress is going backward. What
does this mean?
11/02/01 12:57:51 INFO mapred.JobClient: map 100% reduce 99%
11/02/01 12:59:14 INFO mapred.JobClient: map 100% reduce 98%
11/02/01 12:59:45 INFO mapred.JobClient: map 100% reduce 99%
11/02/01
How many tags you have? If you have several number of tags, you'd better
create a Vector class to hold those tags. And define sum function to
increment the values of tags. Then the value class should be your new
Vector class. That's better and more decent than the Textpair approach.
Shi
On
Hi Matthew,
I have a same problem here (see
http://www.listware.net/201009/hadoop-common-user/81228-return-a-parameter-using-map-only.html).
I was planning to use join mapper (or mapper chain) to handle two
different inputs. The problem was the mapper seems cannot return value
directly to
As a coming-up to the my own question, I think to invoke the JVM in
Hadoop requires much more memory than an ordinary JVM. I found that
instead of serialization the object, maybe I could create a MapFile as
an index to permit lookups by key in Hadoop. I have also compared the
performance of
Here is my code. There is no Map/Reduce in it. I could run this code
using java -Xmx1000m , however, when using bin/hadoop -D
mapred.child.java.opts=-Xmx3000M it has heap space not enough error.
I have tried other program in Hadoop with the same settings so the
memory is available in my
Hi, thanks for the advice. I tried with your settings,
$ bin/hadoop jar Test.jar OOloadtest -D HADOOP_CLIENT_OPTS=-Xmx4000m
still no effect. Or this is a system variable? Should I export it? How
to configure it?
Shi
java -Xms3G -Xmx3G -classpath
Hi, I tried the following five ways:
Approach 1: in command line
HADOOP_CLIENT_OPTS=-Xmx4000m bin/hadoop jar WordCount.jar OOloadtest
Approach 2: I added the hadoop-site.xml file with the following element.
Each time I changed, I stop and restart hadoop on all the nodes.
...
property
Hi, I got it, it should be declared in the
enhadoop-env.sh
export HADOOP_CLIENT_OPTS=-Xmx4000m
Thanks! At the same time I see corrections come in.
Shi
On 2010-10-13 18:18, Shi Yu wrote:
Hi, I tried the following five ways:
Approach 1: in command line
HADOOP_CLIENT_OPTS=-Xmx4000m bin/hadoop
Hi,
The input in HDFS is a directory containing 890 files (biggest one 23M,
smallest one 145K, average size 10M). It seems that I reach some limit
of HDFS because all the files after a certain number (594) could not be
loaded. For example, the full run of my code pops the following error:
Hi, thanks for the answer, Antonio.
I have found one of the main problem. It was because I used the
MultipleOutputs in the Reduce class, so when I set the Combiner and the
Reducer, the Combiner will not provide normal data flow to the Reducer.
Therefore, the program ceases at the Combiner and
Hi,
I am running some code on a cluster with several nodes (ranging from 1
to 30) using hadoop-0.19.2. In a test, I only put a single file under
the input folder, however, each time I find the logged total input
paths to process is 2 (not 1).
INFO mapred.FileInputFormat: Total input paths
The Master appeared in Masters and Salves files is the machine name or
ip address. If you have a single cluster, when you specify multiple
names in those files it will cause error because of the connection failure.
Shi
On 2010-9-29 15:28, Bhushan Mahale wrote:
Hi,
The master files name in
Hi,
I tried to combine in memory mysql database with Mapreduce to do some
value exchanges. In the Mapper, I declare the mysql driver like this
import com.mysql.jdbc.*;
import java.sql.DriverManager;
import java.sql.SQLException;
String driver
Dear Hadoopers,
I am stuck at a probably very simple problem but can't figure it out. In
the Hadoop Map/Reduce framework, I want to search a huge file (which is
generated by another Reduce task) for a unique line of record (a
String, double value actually). That record is expected to be
76 matches
Mail list logo