Re: Hadoop job using multiple input files

2009-02-06 Thread Billy Pearson

If it was me I would prefix the map values outputs with a: and n:.
a: for address and n: for number
then on the reduce you could test the value to see if its the address or the 
name with if statements no need to worry about which one comes first just 
make sure they both have been set before output on the reduce.


Billy

"Amandeep Khurana"  wrote in 
message news:35a22e220902061646m941a545o554b189ed5bdb...@mail.gmail.com...

Ok. I was able to get this to run but have a slight problem.

*File 1*
1   10
2   20
3   30
3   35
4   40
4   45
4   49
5   50

*File 2*

a   10   123
b   20   21321
c   45   2131
d   40   213

I want to join the above two based on the second column of file 1. Here's
what I am getting as the output.

*Output*
1   a   123
b   21321   2
3
3
4   d   213
c   2131   4
4
5

The ones in red are in the format I want it. The ones in blue have their
order reversed. How can I get them to be in the correct order too?
Basically, the order in which the iterator iterates over the values is not
consistent. How can I get this to be consistent?

Amandeep

Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Fri, Feb 6, 2009 at 2:58 PM, Amandeep Khurana 
 wrote:



Ok. Got it.

Now, how would my reducer know whether the name is coming first or the
address? Is it going to be in the same order in the iterator as the files
are read (alphabetically) in the mapper?


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Fri, Feb 6, 2009 at 5:22 AM, Jeff Hammerbacher 
wrote:



You put the files into a common directory, and use that as your input to
the
MapReduce job. You write a single Mapper class that has an "if" 
statement
examining the map.input.file property, outputting "number" as the key 
for

both files, but "address" for one and "name" for the other. By using a
commone key ("number"), you'll  ensure that the name and address make it
to
the same reducer after the shuffle. In the reducer, you'll then have the
relevant information (in the values) you need to create the name, 
address

pair.

On Fri, Feb 6, 2009 at 2:17 AM, Amandeep Khurana 


wrote:

> Thanks Jeff...
> I am not 100% clear about the first solution you have given. How do I
get
> the multiple files to be read and then feed into a single reducer? I
should
> have multiple mappers in the same class and have different job configs
for
> them, run two separate jobs with one outputing the key as 
> (name,number)

and
> the other outputing the value as (number, address) into the reducer?
> Not clear what I'll be doing with the map.intput.file here...
>
> Amandeep
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>
>
> On Fri, Feb 6, 2009 at 1:55 AM, Jeff Hammerbacher 
> 
> >wrote:
>
> > Hey Amandeep,
> >
> > You can get the file name for a task via the "map.input.file"
property.
> For
> > the join you're doing, you could inspect this property and ouput
(number,
> > name) and (number, address) as your (key, value) pairs, depending on
the
> > file you're working with. Then you can do the combination in your
> reducer.
> >
> > You could also check out the join package in contrib/utils (
> >
> >
>
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/contrib/utils/join/package-summary.html
> > ),
> > but I'd say your job is simple enough that you'll get it done faster
with
> > the above method.
> >
> > This task would be a simple join in Hive, so you could consider 
> > using

> Hive
> > to manage the data and perform the join.
> >
> > Later,
> > Jeff
> >
> > On Fri, Feb 6, 2009 at 1:34 AM, Amandeep Khurana 
> > 

> wrote:
> >
> > > Is it possible to write a map reduce job using multiple input 
> > > files?

> > >
> > > For example:
> > > File 1 has data like - Name, Number
> > > File 2 has data like - Number, Address
> > >
> > > Using these, I want to create a third file which has something 
> > > like

-
> > Name,
> > > Address
> > >
> > > How can a map reduce job be written to do this?
> > >
> > > Amandeep
> > >
> > >
> > >
> > > Amandeep Khurana
> > > Computer Science Graduate Student
> > > University of California, Santa Cruz
> > >
> >
>











Re: Hadoop job using multiple input files

2009-02-06 Thread Amandeep Khurana
Ok. I was able to get this to run but have a slight problem.

*File 1*
1   10
2   20
3   30
3   35
4   40
4   45
4   49
5   50

*File 2*

a   10   123
b   20   21321
c   45   2131
d   40   213

I want to join the above two based on the second column of file 1. Here's
what I am getting as the output.

*Output*
1   a   123
b   21321   2
3
3
4   d   213
c   2131   4
4
5

The ones in red are in the format I want it. The ones in blue have their
order reversed. How can I get them to be in the correct order too?
Basically, the order in which the iterator iterates over the values is not
consistent. How can I get this to be consistent?

Amandeep

Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Fri, Feb 6, 2009 at 2:58 PM, Amandeep Khurana  wrote:

> Ok. Got it.
>
> Now, how would my reducer know whether the name is coming first or the
> address? Is it going to be in the same order in the iterator as the files
> are read (alphabetically) in the mapper?
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>
>
> On Fri, Feb 6, 2009 at 5:22 AM, Jeff Hammerbacher wrote:
>
>> You put the files into a common directory, and use that as your input to
>> the
>> MapReduce job. You write a single Mapper class that has an "if" statement
>> examining the map.input.file property, outputting "number" as the key for
>> both files, but "address" for one and "name" for the other. By using a
>> commone key ("number"), you'll  ensure that the name and address make it
>> to
>> the same reducer after the shuffle. In the reducer, you'll then have the
>> relevant information (in the values) you need to create the name, address
>> pair.
>>
>> On Fri, Feb 6, 2009 at 2:17 AM, Amandeep Khurana 
>> wrote:
>>
>> > Thanks Jeff...
>> > I am not 100% clear about the first solution you have given. How do I
>> get
>> > the multiple files to be read and then feed into a single reducer? I
>> should
>> > have multiple mappers in the same class and have different job configs
>> for
>> > them, run two separate jobs with one outputing the key as (name,number)
>> and
>> > the other outputing the value as (number, address) into the reducer?
>> > Not clear what I'll be doing with the map.intput.file here...
>> >
>> > Amandeep
>> >
>> >
>> > Amandeep Khurana
>> > Computer Science Graduate Student
>> > University of California, Santa Cruz
>> >
>> >
>> > On Fri, Feb 6, 2009 at 1:55 AM, Jeff Hammerbacher > > >wrote:
>> >
>> > > Hey Amandeep,
>> > >
>> > > You can get the file name for a task via the "map.input.file"
>> property.
>> > For
>> > > the join you're doing, you could inspect this property and ouput
>> (number,
>> > > name) and (number, address) as your (key, value) pairs, depending on
>> the
>> > > file you're working with. Then you can do the combination in your
>> > reducer.
>> > >
>> > > You could also check out the join package in contrib/utils (
>> > >
>> > >
>> >
>> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/contrib/utils/join/package-summary.html
>> > > ),
>> > > but I'd say your job is simple enough that you'll get it done faster
>> with
>> > > the above method.
>> > >
>> > > This task would be a simple join in Hive, so you could consider using
>> > Hive
>> > > to manage the data and perform the join.
>> > >
>> > > Later,
>> > > Jeff
>> > >
>> > > On Fri, Feb 6, 2009 at 1:34 AM, Amandeep Khurana 
>> > wrote:
>> > >
>> > > > Is it possible to write a map reduce job using multiple input files?
>> > > >
>> > > > For example:
>> > > > File 1 has data like - Name, Number
>> > > > File 2 has data like - Number, Address
>> > > >
>> > > > Using these, I want to create a third file which has something like
>> -
>> > > Name,
>> > > > Address
>> > > >
>> > > > How can a map reduce job be written to do this?
>> > > >
>> > > > Amandeep
>> > > >
>> > > >
>> > > >
>> > > > Amandeep Khurana
>> > > > Computer Science Graduate Student
>> > > > University of California, Santa Cruz
>> > > >
>> > >
>> >
>>
>
>


Heap size error

2009-02-06 Thread Amandeep Khurana
I'm getting the following error while running my hadoop job:

09/02/06 15:33:03 INFO mapred.JobClient: Task Id :
attempt_200902061333_0004_r_00_1, Status : FAILED
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Unknown Source)
at java.lang.AbstractStringBuilder.expandCapacity(Unknown Source)
at java.lang.AbstractStringBuilder.append(Unknown Source)
at java.lang.StringBuffer.append(Unknown Source)
at TableJoin$Reduce.reduce(TableJoin.java:61)
at TableJoin$Reduce.reduce(TableJoin.java:1)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:430)
at org.apache.hadoop.mapred.Child.main(Child.java:155)

Any inputs?

Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


Re: Hadoop job using multiple input files

2009-02-06 Thread Amandeep Khurana
Ok. Got it.

Now, how would my reducer know whether the name is coming first or the
address? Is it going to be in the same order in the iterator as the files
are read (alphabetically) in the mapper?


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Fri, Feb 6, 2009 at 5:22 AM, Jeff Hammerbacher wrote:

> You put the files into a common directory, and use that as your input to
> the
> MapReduce job. You write a single Mapper class that has an "if" statement
> examining the map.input.file property, outputting "number" as the key for
> both files, but "address" for one and "name" for the other. By using a
> commone key ("number"), you'll  ensure that the name and address make it to
> the same reducer after the shuffle. In the reducer, you'll then have the
> relevant information (in the values) you need to create the name, address
> pair.
>
> On Fri, Feb 6, 2009 at 2:17 AM, Amandeep Khurana  wrote:
>
> > Thanks Jeff...
> > I am not 100% clear about the first solution you have given. How do I get
> > the multiple files to be read and then feed into a single reducer? I
> should
> > have multiple mappers in the same class and have different job configs
> for
> > them, run two separate jobs with one outputing the key as (name,number)
> and
> > the other outputing the value as (number, address) into the reducer?
> > Not clear what I'll be doing with the map.intput.file here...
> >
> > Amandeep
> >
> >
> > Amandeep Khurana
> > Computer Science Graduate Student
> > University of California, Santa Cruz
> >
> >
> > On Fri, Feb 6, 2009 at 1:55 AM, Jeff Hammerbacher  > >wrote:
> >
> > > Hey Amandeep,
> > >
> > > You can get the file name for a task via the "map.input.file" property.
> > For
> > > the join you're doing, you could inspect this property and ouput
> (number,
> > > name) and (number, address) as your (key, value) pairs, depending on
> the
> > > file you're working with. Then you can do the combination in your
> > reducer.
> > >
> > > You could also check out the join package in contrib/utils (
> > >
> > >
> >
> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/contrib/utils/join/package-summary.html
> > > ),
> > > but I'd say your job is simple enough that you'll get it done faster
> with
> > > the above method.
> > >
> > > This task would be a simple join in Hive, so you could consider using
> > Hive
> > > to manage the data and perform the join.
> > >
> > > Later,
> > > Jeff
> > >
> > > On Fri, Feb 6, 2009 at 1:34 AM, Amandeep Khurana 
> > wrote:
> > >
> > > > Is it possible to write a map reduce job using multiple input files?
> > > >
> > > > For example:
> > > > File 1 has data like - Name, Number
> > > > File 2 has data like - Number, Address
> > > >
> > > > Using these, I want to create a third file which has something like -
> > > Name,
> > > > Address
> > > >
> > > > How can a map reduce job be written to do this?
> > > >
> > > > Amandeep
> > > >
> > > >
> > > >
> > > > Amandeep Khurana
> > > > Computer Science Graduate Student
> > > > University of California, Santa Cruz
> > > >
> > >
> >
>


Re: How to use DBInputFormat?

2009-02-06 Thread Fredrik Hedberg
Well, that's also implicit by design, and cannot really be solved in a  
generic way. As with any system, it's not foolproof; unless you fully  
understand what you're doing, you won't reliably get the result you're  
seeking.


As I said before, the JDBC interface for Hadoop solves a specific  
problem, whereas HBase and HDFS is really the answer to the kind of  
problem your hinting at.



Fredrik


On Feb 6, 2009, at 4:06 PM, Stefan Podkowinski wrote:

On Fri, Feb 6, 2009 at 2:40 PM, Fredrik Hedberg   
wrote:
Well, that obviously depend on the RDBMS' implementation. And  
although the
case is not as bad as you describe (otherwise you better ask your  
RDBMS
vendor for your money back), your point is valid. But then again, a  
RDBMS is

not designed for that kind of work.


Right. Clash of design paradigms. Hey MySQL, I want my money back!!  
Oh, wait..

Another scenario I just recognized: what about current/"realtime"
data? E.g. 'select * from logs where date = today()'. Working with
'offset' may turn out to return different results after the table has
been updated and tasks are still pending. Pretty ugly to trace down
this condition, after you found out that sometimes your results are
just not right..


What do you mean by "creating splits/map tasks on the fly  
dynamically"?



Fredrik


On Feb 5, 2009, at 4:49 PM, Stefan Podkowinski wrote:


As far as i understand the main problem is that you need to create
splits from streaming data with an unknown number of records and
offsets. Its just the same problem as with externally compressed  
data
(.gz). You need to go through the complete stream (or do a table  
scan)

to create logical splits. Afterwards each map task needs to seek to
the appropriate offset on a new stream over again. Very expansive.  
As
with compressed files, no wonder only one map task is started for  
each

.gz file and will consume the complete file. IMHO the DBInputFormat
should follow this behavior and just create 1 split whatsoever.
Maybe a future version of hadoop will allow to create splits/map  
tasks

on the fly dynamically?

Stefan

On Thu, Feb 5, 2009 at 3:28 PM, Fredrik Hedberg 
wrote:


Indeed sir.

The implementation was designed like you describe for two  
reasons. First

and
foremost to make is as simple as possible for the user to use a  
JDBC
database as input and output for Hadoop. Secondly because of the  
specific

requirements the MapReduce framework brings to the table (split
distribution, split reproducibility etc).

This design will, as you note, never handle the same amount of  
data as

HBase
(or HDFS), and was never intended to. That being said, there are  
a couple

of
ways that the current design could be augmented to perform better  
(and,

as
in its current form, tweaked, depending on you data and  
computational

requirements). Shard awareness is one way, which would let each
database/tasktracker-node execute mappers on data where each  
split is a

single database server for example.

If you have any ideas on how the current design can be improved,  
please

do
share.


Fredrik

On Feb 5, 2009, at 11:37 AM, Stefan Podkowinski wrote:

The 0.19 DBInputFormat class implementation is IMHO only  
suitable for

very simple queries working on only few datasets. Thats due to the
fact that it tries to create splits from the query by
1) getting a count of all rows using the specified count query  
(huge

performance impact on large tables)
2) creating splits by issuing an individual query for each split  
with

a "limit" and "offset" parameter appended to the input sql query

Effectively your input query "select * from orders" would become
"select * from orders limit  offset " and
executed until count has been reached. I guess this is not  
working sql

syntax for oracle.

Stefan


2009/2/4 Amandeep Khurana :


Adding a semicolon gives me the error "ORA-00911: Invalid  
character"


Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Wed, Feb 4, 2009 at 6:46 AM, Rasit OZDAS  


wrote:


Amandeep,
"SQL command not properly ended"
I get this error whenever I forget the semicolon at the end.
I know, it doesn't make sense, but I recommend giving it a try

Rasit

2009/2/4 Amandeep Khurana :


The same query is working if I write a simple JDBC client and  
query

the
database. So, I'm probably doing something wrong in the  
connection


settings.


But the error looks to be on the query side more than the  
connection


side.


Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Tue, Feb 3, 2009 at 7:25 PM, Amandeep Khurana >


wrote:



Thanks Kevin

I couldnt get it work. Here's the error I get:

bin/hadoop jar ~/dbload.jar LoadTable1
09/02/03 19:21:17 INFO jvm.JvmMetrics: Initializing JVM  
Metrics with

processName=JobTracker, sessionId=
09/02/03 19:21:20 INFO mapred.JobClient: Running job:  
job_local_0001

09/02/03 19:21:21 INFO mapred.JobClient:  map 0% redu

Re: Re: Re: Regarding "Hadoop multi cluster" set-up

2009-02-06 Thread Amandeep Khurana
I had to change the master on my running cluster and ended up with the same
problem. Were you able to fix it at your end?

Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Thu, Feb 5, 2009 at 8:46 AM, shefali pawar wrote:

> Hi,
>
> I do not think that the firewall is blocking the port because it has been
> turned off on both the computers! And also since it is a random port number
> I do not think it should create a problem.
>
> I do not understand what is going wrong!
>
> Shefali
>
> On Wed, 04 Feb 2009 23:23:04 +0530  wrote
> >I'm not certain that the firewall is your problem but if that port is
> >blocked on your master you should open it to let communication through.
> Here
> >is one website that might be relevant:
> >
> >
> http://stackoverflow.com/questions/255077/open-ports-under-fedora-core-8-for-vmware-server
> >
> >but again, this may not be your problem.
> >
> >John
> >
> >On Wed, Feb 4, 2009 at 12:46 PM, shefali pawar wrote:
> >
> >> Hi,
> >>
> >> I will have to check. I can do that tomorrow in college. But if that is
> the
> >> case what should i do?
> >>
> >> Should i change the port number and try again?
> >>
> >> Shefali
> >>
> >> On Wed, 04 Feb 2009 S D wrote :
> >>
> >> >Shefali,
> >> >
> >> >Is your firewall blocking port 54310 on the master?
> >> >
> >> >John
> >> >
> >> >On Wed, Feb 4, 2009 at 12:34 PM, shefali pawar > >wrote:
> >> >
> >> > > Hi,
> >> > >
> >> > > I am trying to set-up a two node cluster using Hadoop0.19.0, with 1
> >> > > master(which should also work as a slave) and 1 slave node.
> >> > >
> >> > > But while running bin/start-dfs.sh the datanode is not starting on
> the
> >> > > slave. I had read the previous mails on the list, but nothing seems
> to
> >> be
> >> > > working in this case. I am getting the following error in the
> >> > > hadoop-root-datanode-slave log file while running the command
> >> > > bin/start-dfs.sh =>
> >> > >
> >> > > 2009-02-03 13:00:27,516 INFO
> >> > > org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG:
> >> > > /
> >> > > STARTUP_MSG: Starting DataNode
> >> > > STARTUP_MSG:  host = slave/172.16.0.32
> >> > > STARTUP_MSG:  args = []
> >> > > STARTUP_MSG:  version = 0.19.0
> >> > > STARTUP_MSG:  build =
> >> > > https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.19-r
> >> > > 713890; compiled by 'ndaley' on Fri Nov 14 03:12:29 UTC 2008
> >> > > /
> >> > > 2009-02-03 13:00:28,725 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > > to server: master/172.16.0.46:54310. Already tried 0 time(s).
> >> > > 2009-02-03 13:00:29,726 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > > to server: master/172.16.0.46:54310. Already tried 1 time(s).
> >> > > 2009-02-03 13:00:30,727 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > > to server: master/172.16.0.46:54310. Already tried 2 time(s).
> >> > > 2009-02-03 13:00:31,728 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > > to server: master/172.16.0.46:54310. Already tried 3 time(s).
> >> > > 2009-02-03 13:00:32,729 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > > to server: master/172.16.0.46:54310. Already tried 4 time(s).
> >> > > 2009-02-03 13:00:33,730 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > > to server: master/172.16.0.46:54310. Already tried 5 time(s).
> >> > > 2009-02-03 13:00:34,731 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > > to server: master/172.16.0.46:54310. Already tried 6 time(s).
> >> > > 2009-02-03 13:00:35,732 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > > to server: master/172.16.0.46:54310. Already tried 7 time(s).
> >> > > 2009-02-03 13:00:36,733 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > > to server: master/172.16.0.46:54310. Already tried 8 time(s).
> >> > > 2009-02-03 13:00:37,734 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > > to server: master/172.16.0.46:54310. Already tried 9 time(s).
> >> > > 2009-02-03 13:00:37,738 ERROR
> >> > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> java.io.IOException:
> >> Call
> >> > > to master/172.16.0.46:54310 failed on local exception: No route to
> >> host
> >> > >at org.apache.hadoop.ipc.Client.call(Client.java:699)
> >> > >at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
> >> > >at $Proxy4.getProtocolVersion(Unknown Source)
> >> > >at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319)
> >> > >at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:306)
> >> > >at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:343)
> >> > >at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:288)
> >> > >at
> >> > >
> >>
> org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:258)
> >> > >at
> >>

Completed jobs not finishing, errors in jobtracker logs

2009-02-06 Thread Bryan Duxbury
I'm seeing some strange behavior on my cluster. Jobs will be done  
(that is, all tasks completed), but the job will still be "running".  
This state seems to persist for minutes, and is really killing my  
throughput.


I'm seeing errors (warnings) in the jobtracker log that look like this:

2009-02-06 12:37:08,425 WARN /: /taskgraph? 
type=reduce&jobid=job_200902061117_0012:

java.lang.ArrayIndexOutOfBoundsException: 3
at org.apache.hadoop.mapred.StatusHttpServer 
$TaskGraphServlet.getReduceAvarageProgresses(StatusHttpServer.java:228)
at org.apache.hadoop.mapred.StatusHttpServer 
$TaskGraphServlet.doGet(StatusHttpServer.java:159)

at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
at org.mortbay.jetty.servlet.ServletHolder.handle 
(ServletHolder.java:427)
at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch 
(WebApplicationHandler.java:475)
at org.mortbay.jetty.servlet.ServletHandler.handle 
(ServletHandler.java:567)

at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
at org.mortbay.jetty.servlet.WebApplicationContext.handle 
(WebApplicationContext.java:635)

at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
at org.mortbay.http.HttpServer.service(HttpServer.java:954)
at org.mortbay.http.HttpConnection.service 
(HttpConnection.java:814)
at org.mortbay.http.HttpConnection.handleNext 
(HttpConnection.java:981)
at org.mortbay.http.HttpConnection.handle 
(HttpConnection.java:831)
at org.mortbay.http.SocketListener.handleConnection 
(SocketListener.java:244)
at org.mortbay.util.ThreadedServer.handle 
(ThreadedServer.java:357)
at org.mortbay.util.ThreadPool$PoolThread.run 
(ThreadPool.java:534)



I'm running hadoop-0.19.0. Any ideas?

-Bryan


Cannot copy from local file system to DFS

2009-02-06 Thread Mithila Nagendra
Hey all
I was trying to run the word count example on one of the hadoop systems I
installed, but when i try to copy the text files from the local file system
to the DFS, it throws up the following exception:

[mith...@node02 hadoop]$ jps
8711 JobTracker
8805 TaskTracker
8901 Jps
8419 NameNode
8642 SecondaryNameNode
[mith...@node02 hadoop]$ cd ..
[mith...@node02 mithila]$ ls
hadoop  hadoop-0.17.2.1.tar  hadoop-datastore  test
[mith...@node02 mithila]$ hadoop/bin/hadoop dfs -copyFromLocal test test
09/02/06 11:26:26 INFO dfs.DFSClient: org.apache.hadoop.ipc.RemoteException:
java.io.IOException: File /user/mithila/test/20417.txt could only be
replicated to 0 nodes, instead of 1
at
org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1145)
at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:300)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)

at org.apache.hadoop.ipc.Client.call(Client.java:557)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
at org.apache.hadoop.dfs.$Proxy0.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at org.apache.hadoop.dfs.$Proxy0.addBlock(Unknown Source)
at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2335)
at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2220)
at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1700(DFSClient.java:1702)
at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1842)

09/02/06 11:26:26 WARN dfs.DFSClient: NotReplicatedYetException sleeping
/user/mithila/test/20417.txt retries left 4
09/02/06 11:26:27 INFO dfs.DFSClient: org.apache.hadoop.ipc.RemoteException:
java.io.IOException: File /user/mithila/test/20417.txt could only be
replicated to 0 nodes, instead of 1
at
org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1145)
at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:300)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)

at org.apache.hadoop.ipc.Client.call(Client.java:557)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
at org.apache.hadoop.dfs.$Proxy0.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at org.apache.hadoop.dfs.$Proxy0.addBlock(Unknown Source)
at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2335)
at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2220)
at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1700(DFSClient.java:1702)
at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1842)

09/02/06 11:26:27 WARN dfs.DFSClient: NotReplicatedYetException sleeping
/user/mithila/test/20417.txt retries left 3
09/02/06 11:26:28 INFO dfs.DFSClient: org.apache.hadoop.ipc.RemoteException:
java.io.IOException: File /user/mithila/test/20417.txt could only be
replicated to 0 nodes, instead of 1
at
org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1145)
at org.apache.hadoop.dfs.NameNode.addBlock(NameNod

Re: Batch processing with Hadoop -- does HDFS scale for parallel reads?

2009-02-06 Thread Brian Bockelman


On Feb 6, 2009, at 11:00 AM, TCK wrote:



How well does the read throughput from HDFS scale with the number of  
data nodes ?
For example, if I had a large file (say 10GB) on a 10 data node  
cluster, would the time taken to read this whole file in parallel  
(ie, with multiple reader client processes requesting different  
parts of the file in parallel) be halved if I had the same file on a  
20 data node cluster ?


Possibly.  (I don't give a firm answer because the answer depends on  
the number of chunks and the number of replicas).


If there are enough replicas and enough separate reading processes  
with enough network bandwidth, then yes, your read bandwidth could  
double.



Is this not possible because HDFS doesn't support random seeks?


It does for reads.  It does not for writes.

Trust me, our physicists have what can best be described as "the most  
god-awful random read patterns you've seen in your life" and they do  
fine on HDFS.


What about if the file was split up into multiple smaller files  
before placing in the HDFS ?


Then things would be less efficient and you'd be less likely to scale.

Brian



Thanks for your input.
-TCK




--- On Wed, 2/4/09, Brian Bockelman  wrote:
From: Brian Bockelman 
Subject: Re: Batch processing with Hadoop -- does HDFS scale for  
parallel reads?

To: core-user@hadoop.apache.org
Date: Wednesday, February 4, 2009, 1:50 PM

Sounds overly complicated.  Complicated usually leads to mistakes :)

What about just having a single cluster and only running the  
tasktrackers on

the fast CPUs?  No messy cross-cluster transferring.

Brian

On Feb 4, 2009, at 12:46 PM, TCK wrote:




Thanks, Brian. This sounds encouraging for us.

What are the advantages/disadvantages of keeping a persistent storage

(HD/K)FS cluster separate from a processing Hadoop+(HD/K)FS cluster ?

The advantage I can think of is that a permanent storage cluster has
different requirements from a map-reduce processing cluster -- the  
permanent
storage cluster would need faster, bigger hard disks, and would need  
to grow as
the total volume of all collected logs grows, whereas the processing  
cluster
would need fast CPUs and would only need to grow with the rate of  
incoming data.
So it seems to make sense to me to copy a piece of data from the  
permanent
storage cluster to the processing cluster only when it needs to be  
processed. Is
my line of thinking reasonable? How would this compare to running  
the map-reduce
processing on same cluster as the data is stored in? Which approach  
is used by

most people?


Best Regards,
TCK



--- On Wed, 2/4/09, Brian Bockelman  wrote:
From: Brian Bockelman 
Subject: Re: Batch processing with Hadoop -- does HDFS scale for  
parallel

reads?

To: core-user@hadoop.apache.org
Date: Wednesday, February 4, 2009, 1:06 PM

Hey TCK,

We use HDFS+FUSE solely as a storage solution for a application which
doesn't understand MapReduce.  We've scaled this solution to

around

80Gbps.  For 300 processes reading from the same file, we get about

20Gbps.


Do consider your data retention policies -- I would say that Hadoop  
as a
storage system is thus far about 99% reliable for storage and is  
not a

backup
solution.  If you're scared of getting more than 1% of your logs  
lost,

have
a good backup solution.  I would also add that when you are  
learning your

operational staff's abilities, expect even more data loss.  As you

gain

experience, data loss goes down.

I don't believe we've lost a single block in the last month, but

it

took us 2-3 months of 1%-level losses to get here.

Brian

On Feb 4, 2009, at 11:51 AM, TCK wrote:



Hey guys,

We have been using Hadoop to do batch processing of logs. The logs  
get

written and stored on a NAS. Our Hadoop cluster periodically copies a

batch of
new logs from the NAS, via NFS into Hadoop's HDFS, processes them,  
and
copies the output back to the NAS. The HDFS is cleaned up at the  
end of

each

batch (ie, everything in it is deleted).


The problem is that reads off the NAS via NFS don't scale even if

we

try to scale the copying process by adding more threads to read in

parallel.


If we instead stored the log files on an HDFS cluster (instead of

NAS), it
seems like the reads would scale since the data can be read from  
multiple

data
nodes at the same time without any contention (except network IO,  
which

shouldn't be a problem).


I would appreciate if anyone could share any similar experience they

have

had with doing parallel reads from a storage HDFS.


Also is it a good idea to have a separate HDFS for storage vs for

doing

the batch processing ?


Best Regards,
TCK


















Re: How to use DBInputFormat?

2009-02-06 Thread Mike Olson

On Feb 6, 2009, at 7:06 AM, Stefan Podkowinski wrote:


Another scenario I just recognized: what about current/"realtime"
data? E.g. 'select * from logs where date = today()'. Working with
'offset' may turn out to return different results after the table has
been updated and tasks are still pending. Pretty ugly to trace down
this condition, after you found out that sometimes your results are
just not right..


In fairness, this isn't an issue unique to streaming data into Hadoop  
from an RDBMS. The "today()" routine is non-functional -- it returns a  
different answer from the same arguments across multiple calls.  
Whenever you put that behavior into your computing infrastructure, you  
make it impossible to reproduce task behavior. This is a big issue in  
a system that uses speculative execution...


Simple answer is to avoid side effects. More complicated answer is to  
understand what they are, and think about them when you design your  
data processing flow.

mike


Re: Batch processing with Hadoop -- does HDFS scale for parallel reads?

2009-02-06 Thread TCK

How well does the read throughput from HDFS scale with the number of data nodes 
?
For example, if I had a large file (say 10GB) on a 10 data node cluster, would 
the time taken to read this whole file in parallel (ie, with multiple reader 
client processes requesting different parts of the file in parallel) be halved 
if I had the same file on a 20 data node cluster ? Is this not possible because 
HDFS doesn't support random seeks? What about if the file was split up into 
multiple smaller files before placing in the HDFS ?
Thanks for your input.
-TCK




--- On Wed, 2/4/09, Brian Bockelman  wrote:
From: Brian Bockelman 
Subject: Re: Batch processing with Hadoop -- does HDFS scale for parallel reads?
To: core-user@hadoop.apache.org
Date: Wednesday, February 4, 2009, 1:50 PM

Sounds overly complicated.  Complicated usually leads to mistakes :)

What about just having a single cluster and only running the tasktrackers on
the fast CPUs?  No messy cross-cluster transferring.

Brian

On Feb 4, 2009, at 12:46 PM, TCK wrote:

> 
> 
> Thanks, Brian. This sounds encouraging for us.
> 
> What are the advantages/disadvantages of keeping a persistent storage
(HD/K)FS cluster separate from a processing Hadoop+(HD/K)FS cluster ?
> The advantage I can think of is that a permanent storage cluster has
different requirements from a map-reduce processing cluster -- the permanent
storage cluster would need faster, bigger hard disks, and would need to grow as
the total volume of all collected logs grows, whereas the processing cluster
would need fast CPUs and would only need to grow with the rate of incoming data.
So it seems to make sense to me to copy a piece of data from the permanent
storage cluster to the processing cluster only when it needs to be processed. Is
my line of thinking reasonable? How would this compare to running the map-reduce
processing on same cluster as the data is stored in? Which approach is used by
most people?
> 
> Best Regards,
> TCK
> 
> 
> 
> --- On Wed, 2/4/09, Brian Bockelman  wrote:
> From: Brian Bockelman 
> Subject: Re: Batch processing with Hadoop -- does HDFS scale for parallel
reads?
> To: core-user@hadoop.apache.org
> Date: Wednesday, February 4, 2009, 1:06 PM
> 
> Hey TCK,
> 
> We use HDFS+FUSE solely as a storage solution for a application which
> doesn't understand MapReduce.  We've scaled this solution to
around
> 80Gbps.  For 300 processes reading from the same file, we get about
20Gbps.
> 
> Do consider your data retention policies -- I would say that Hadoop as a
> storage system is thus far about 99% reliable for storage and is not a
backup
> solution.  If you're scared of getting more than 1% of your logs lost,
have
> a good backup solution.  I would also add that when you are learning your
> operational staff's abilities, expect even more data loss.  As you
gain
> experience, data loss goes down.
> 
> I don't believe we've lost a single block in the last month, but
it
> took us 2-3 months of 1%-level losses to get here.
> 
> Brian
> 
> On Feb 4, 2009, at 11:51 AM, TCK wrote:
> 
>> 
>> Hey guys,
>> 
>> We have been using Hadoop to do batch processing of logs. The logs get
> written and stored on a NAS. Our Hadoop cluster periodically copies a
batch of
> new logs from the NAS, via NFS into Hadoop's HDFS, processes them, and
> copies the output back to the NAS. The HDFS is cleaned up at the end of
each
> batch (ie, everything in it is deleted).
>> 
>> The problem is that reads off the NAS via NFS don't scale even if
we
> try to scale the copying process by adding more threads to read in
parallel.
>> 
>> If we instead stored the log files on an HDFS cluster (instead of
NAS), it
> seems like the reads would scale since the data can be read from multiple
data
> nodes at the same time without any contention (except network IO, which
> shouldn't be a problem).
>> 
>> I would appreciate if anyone could share any similar experience they
have
> had with doing parallel reads from a storage HDFS.
>> 
>> Also is it a good idea to have a separate HDFS for storage vs for
doing
> the batch processing ?
>> 
>> Best Regards,
>> TCK
>> 
>> 
>> 
>> 
> 
> 
> 
> 




  

RE: can't read the SequenceFile correctly

2009-02-06 Thread Bhupesh Bansal
Hey Tom, 

I got also burned by this ?? Why does BytesWritable.getBytes() returns
non-vaild bytes ?? Or we should add a BytesWritable.getValidBytes() kind of 
function. 


Best
Bhupesh 



-Original Message-
From: Tom White [mailto:t...@cloudera.com]
Sent: Fri 2/6/2009 2:25 AM
To: core-user@hadoop.apache.org
Subject: Re: can't read the SequenceFile correctly
 
Hi Mark,

Not all the bytes stored in a BytesWritable object are necessarily
valid. Use BytesWritable#getLength() to determine how much of the
buffer returned by BytesWritable#getBytes() to use.

Tom

On Fri, Feb 6, 2009 at 5:41 AM, Mark Kerzner  wrote:
> Hi,
>
> I have written binary files to a SequenceFile, seemeingly successfully, but
> when I read them back with the code below, after a first few reads I get the
> same number of bytes for the different files. What could go wrong?
>
> Thank you,
> Mark
>
>  reader = new SequenceFile.Reader(fs, path, conf);
>Writable key = (Writable)
> ReflectionUtils.newInstance(reader.getKeyClass(), conf);
>Writable value = (Writable)
> ReflectionUtils.newInstance(reader.getValueClass(), conf);
>long position = reader.getPosition();
>while (reader.next(key, value)) {
>String syncSeen = reader.syncSeen() ? "*" : "";
>byte [] fileBytes = ((BytesWritable) value).getBytes();
>System.out.printf("[%s%s]\t%s\t%s\n", position, syncSeen,
> key, fileBytes.length);
>position = reader.getPosition(); // beginning of next record
>}
>



Tests stalling in my config

2009-02-06 Thread Michael Tolan
Hello,

I recently checked out revision 741606, and am attempting to run the
'test' ant task.
I'm new to building hadoop from source, so my problem is most likely
somewhere in my own configuration, but I'm at a bit of a loss as to
how to trace it.

The only environment variable that I've set for this is:
JAVA_HOME=/home/mtolan/java/jdk1.6.0_10  (Downloaded from Sun)

On running 'ant clean test', I get normal output which ends at

test-core:
[mkdir] Created dir: /home/mtolan/hadoop/trunk/build/test/data
[mkdir] Created dir: /home/mtolan/hadoop/trunk/build/test/logs
 [copy] Copying 1 file to /home/mtolan/hadoop/trunk/build/test/extraconf
[junit] Running org.apache.hadoop.cli.TestCLI

This runs for hours, consuming no resources, so I'm not convinced that
it's working as intended.

What follows are the relevant processes in 'ps', in case there's some
detail I'm missing in the way the commands are being formed.

PS output:

mtolan   25590 32.8  4.7 227392 97712 pts/1Sl+  09:52   0:17
/home/mtolan/java/jdk1.6.0_10/jre/bin/java -classpath
/usr/share/ant/lib/ant-launcher.jar:/usr/share/java/xmlParserAPIs.jar:/usr/share/java/xercesImpl.jar
-Dant.home=/usr/share/ant -Dant.library.dir=/usr/share/ant/lib
org.apache.tools.ant.launch.Launcher -cp  test
mtolan   25640 17.0  2.0 701876 43356 pts/1Sl+  09:52   0:05
/home/mtolan/java/jdk1.6.0_10/jre/bin/java -Xmx512m
-Dtest.build.data=/home/mtolan/hadoop/trunk/build/test/data
-Dtest.cache.data=/home/mtolan/hadoop/trunk/build/test/cache
-Dtest.debug.data=/home/mtolan/hadoop/trunk/build/test/debug
-Dhadoop.log.dir=/home/mtolan/hadoop/trunk/build/test/logs
-Dtest.src.dir=/home/mtolan/hadoop/trunk/src/test
-Dtest.build.extraconf=/home/mtolan/hadoop/trunk/build/test/extraconf
-Dhadoop.policy.file=hadoop-policy.xml
-Djava.library.path=/home/mtolan/hadoop/trunk/build/native/Linux-i386-32/lib:/home/mtolan/hadoop/trunk/lib/native/Linux-i386-32
-Dinstall.c++.examples=/home/mtolan/hadoop/trunk/build/c++-examples/Linux-i386-32
-classpath 
/home/mtolan/hadoop/trunk/build/test/extraconf:/home/mtolan/hadoop/trunk/build/test/classes:/home/mtolan/hadoop/trunk/src/test:/home/mtolan/hadoop/trunk/build:/home/mtolan/hadoop/trunk/build/examples:/home/mtolan/hadoop/trunk/build/tools:/home/mtolan/hadoop/trunk/src/test/lib/ftplet-api-1.0.0-SNAPSHOT.jar:/home/mtolan/hadoop/trunk/src/test/lib/ftpserver-core-1.0.0-SNAPSHOT.jar:/home/mtolan/hadoop/trunk/src/test/lib/ftpserver-server-1.0.0-SNAPSHOT.jar:/home/mtolan/hadoop/trunk/src/test/lib/mina-core-2.0.0-M2-20080407.124109-12.jar:/home/mtolan/hadoop/trunk/build/classes:/home/mtolan/hadoop/trunk/lib/commons-cli-2.0-SNAPSHOT.jar:/home/mtolan/hadoop/trunk/lib/hsqldb-1.8.0.10.jar:/home/mtolan/hadoop/trunk/lib/jsp-2.1/jsp-2.1.jar:/home/mtolan/hadoop/trunk/lib/jsp-2.1/jsp-api-2.1.jar:/home/mtolan/hadoop/trunk/lib/kfs-0.2.2.jar:/home/mtolan/hadoop/trunk/conf:/home/mtolan/.ivy2/cache/commons-logging/commons-logging/jars/commons-logging-1.0.4.jar:/home/mtolan/.ivy2/cache/log4j/log4j/jars/log4j-1.2.15.jar:/home/mtolan/.ivy2/cache/commons-httpclient/commons-httpclient/jars/commons-httpclient-3.0.1.jar:/home/mtolan/.ivy2/cache/commons-codec/commons-codec/jars/commons-codec-1.3.jar:/home/mtolan/.ivy2/cache/xmlenc/xmlenc/jars/xmlenc-0.52.jar:/home/mtolan/.ivy2/cache/net.java.dev.jets3t/jets3t/jars/jets3t-0.6.1.jar:/home/mtolan/.ivy2/cache/commons-net/commons-net/jars/commons-net-1.4.1.jar:/home/mtolan/.ivy2/cache/org.mortbay.jetty/servlet-api-2.5/jars/servlet-api-2.5-6.1.14.jar:/home/mtolan/.ivy2/cache/oro/oro/jars/oro-2.0.8.jar:/home/mtolan/.ivy2/cache/org.mortbay.jetty/jetty/jars/jetty-6.1.14.jar:/home/mtolan/.ivy2/cache/org.mortbay.jetty/jetty-util/jars/jetty-util-6.1.14.jar:/home/mtolan/.ivy2/cache/tomcat/jasper-runtime/jars/jasper-runtime-5.5.12.jar:/home/mtolan/.ivy2/cache/tomcat/jasper-compiler/jars/jasper-compiler-5.5.12.jar:/home/mtolan/.ivy2/cache/commons-el/commons-el/jars/commons-el-1.0.jar:/home/mtolan/.ivy2/cache/junit/junit/jars/junit-3.8.1.jar:/home/mtolan/.ivy2/cache/commons-logging/commons-logging-api/jars/commons-logging-api-1.0.4.jar:/home/mtolan/.ivy2/cache/org.slf4j/slf4j-api/jars/slf4j-api-1.4.3.jar:/home/mtolan/.ivy2/cache/org.eclipse.jdt/core/jars/core-3.1.1.jar:/home/mtolan/.ivy2/cache/org.slf4j/slf4j-log4j12/jars/slf4j-log4j12-1.4.3.jar:/usr/share/ant/lib/junit.jar:/usr/share/ant/lib/ant-launcher.jar:/usr/share/ant/lib/ant.jar:/usr/share/ant/lib/ant-junit.jar
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner
org.apache.hadoop.cli.TestCLI filtertrace=true haltOnError=false
haltOnFailure=false
formatter=org.apache.tools.ant.taskdefs.optional.junit.SummaryJUnitResultFormatter
showoutput=false outputtoformatters=true logtestlistenerevents=true
formatter=org.apache.tools.ant.taskdefs.optional.junit.PlainJUnitResultFormatter,/home/mtolan/hadoop/trunk/build/test/TEST-org.apache.hadoop.cli.TestCLI.txt
crashfile=/home/mtolan/hadoop/trunk/junitvmwatcher1200620086.properties
propsfile=/home/mtolan/hadoop/trunk/juni

Re: re : How to use MapFile in C++ program

2009-02-06 Thread Enis Soztutar
There is currently no way to read MapFiles in any language other than 
Java. You can write a JNI wrapper similar to libhdfs.
Alternatively, you can also write the complete stack from scratch, 
however this might prove very difficult or impossible. You might want to 
check the ObjectFile/TFile specifications for which binary compatible 
reader/writers can be developed in any language :


https://issues.apache.org/jira/browse/HADOOP-3315

Enis

Anh Vũ Nguyễn wrote:

Hi, everybody. I am writing a project in C++ and want to use the power of
MapFile class(which belongs to org.apache.hadoop.io) of hadoop. Can you
please tell me how can I write code in C++ using MapFile or there is no way
to use API org.apache.hadoop.io in c++ (libhdfs only helps with
org.apache.hadoop.fs).
Thanks in advance!

  


Re: How to use DBInputFormat?

2009-02-06 Thread Stefan Podkowinski
On Fri, Feb 6, 2009 at 2:40 PM, Fredrik Hedberg  wrote:
> Well, that obviously depend on the RDBMS' implementation. And although the
> case is not as bad as you describe (otherwise you better ask your RDBMS
> vendor for your money back), your point is valid. But then again, a RDBMS is
> not designed for that kind of work.

Right. Clash of design paradigms. Hey MySQL, I want my money back!! Oh, wait..
Another scenario I just recognized: what about current/"realtime"
data? E.g. 'select * from logs where date = today()'. Working with
'offset' may turn out to return different results after the table has
been updated and tasks are still pending. Pretty ugly to trace down
this condition, after you found out that sometimes your results are
just not right..


> What do you mean by "creating splits/map tasks on the fly dynamically"?
>
>
> Fredrik
>
>
> On Feb 5, 2009, at 4:49 PM, Stefan Podkowinski wrote:
>
>> As far as i understand the main problem is that you need to create
>> splits from streaming data with an unknown number of records and
>> offsets. Its just the same problem as with externally compressed data
>> (.gz). You need to go through the complete stream (or do a table scan)
>> to create logical splits. Afterwards each map task needs to seek to
>> the appropriate offset on a new stream over again. Very expansive. As
>> with compressed files, no wonder only one map task is started for each
>> .gz file and will consume the complete file. IMHO the DBInputFormat
>> should follow this behavior and just create 1 split whatsoever.
>> Maybe a future version of hadoop will allow to create splits/map tasks
>> on the fly dynamically?
>>
>> Stefan
>>
>> On Thu, Feb 5, 2009 at 3:28 PM, Fredrik Hedberg 
>> wrote:
>>>
>>> Indeed sir.
>>>
>>> The implementation was designed like you describe for two reasons. First
>>> and
>>> foremost to make is as simple as possible for the user to use a JDBC
>>> database as input and output for Hadoop. Secondly because of the specific
>>> requirements the MapReduce framework brings to the table (split
>>> distribution, split reproducibility etc).
>>>
>>> This design will, as you note, never handle the same amount of data as
>>> HBase
>>> (or HDFS), and was never intended to. That being said, there are a couple
>>> of
>>> ways that the current design could be augmented to perform better (and,
>>> as
>>> in its current form, tweaked, depending on you data and computational
>>> requirements). Shard awareness is one way, which would let each
>>> database/tasktracker-node execute mappers on data where each split is a
>>> single database server for example.
>>>
>>> If you have any ideas on how the current design can be improved, please
>>> do
>>> share.
>>>
>>>
>>> Fredrik
>>>
>>> On Feb 5, 2009, at 11:37 AM, Stefan Podkowinski wrote:
>>>
 The 0.19 DBInputFormat class implementation is IMHO only suitable for
 very simple queries working on only few datasets. Thats due to the
 fact that it tries to create splits from the query by
 1) getting a count of all rows using the specified count query (huge
 performance impact on large tables)
 2) creating splits by issuing an individual query for each split with
 a "limit" and "offset" parameter appended to the input sql query

 Effectively your input query "select * from orders" would become
 "select * from orders limit  offset " and
 executed until count has been reached. I guess this is not working sql
 syntax for oracle.

 Stefan


 2009/2/4 Amandeep Khurana :
>
> Adding a semicolon gives me the error "ORA-00911: Invalid character"
>
> Amandeep
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>
>
> On Wed, Feb 4, 2009 at 6:46 AM, Rasit OZDAS 
> wrote:
>
>> Amandeep,
>> "SQL command not properly ended"
>> I get this error whenever I forget the semicolon at the end.
>> I know, it doesn't make sense, but I recommend giving it a try
>>
>> Rasit
>>
>> 2009/2/4 Amandeep Khurana :
>>>
>>> The same query is working if I write a simple JDBC client and query
>>> the
>>> database. So, I'm probably doing something wrong in the connection
>>
>> settings.
>>>
>>> But the error looks to be on the query side more than the connection
>>
>> side.
>>>
>>> Amandeep
>>>
>>>
>>> Amandeep Khurana
>>> Computer Science Graduate Student
>>> University of California, Santa Cruz
>>>
>>>
>>> On Tue, Feb 3, 2009 at 7:25 PM, Amandeep Khurana 
>>
>> wrote:
>>>
 Thanks Kevin

 I couldnt get it work. Here's the error I get:

 bin/hadoop jar ~/dbload.jar LoadTable1
 09/02/03 19:21:17 INFO jvm.JvmMetrics: Initializing JVM Metrics with
 processName=JobTracker, sessionId=
 09/02/03 19:21:20 INFO mapred.JobClient: Running job: j

Re: can't read the SequenceFile correctly

2009-02-06 Thread Mark Kerzner
Indeed, this was the answer!

Thank you,
Mark

On Fri, Feb 6, 2009 at 4:25 AM, Tom White  wrote:

> Hi Mark,
>
> Not all the bytes stored in a BytesWritable object are necessarily
> valid. Use BytesWritable#getLength() to determine how much of the
> buffer returned by BytesWritable#getBytes() to use.
>
> Tom
>
> On Fri, Feb 6, 2009 at 5:41 AM, Mark Kerzner 
> wrote:
> > Hi,
> >
> > I have written binary files to a SequenceFile, seemeingly successfully,
> but
> > when I read them back with the code below, after a first few reads I get
> the
> > same number of bytes for the different files. What could go wrong?
> >
> > Thank you,
> > Mark
> >
> >  reader = new SequenceFile.Reader(fs, path, conf);
> >Writable key = (Writable)
> > ReflectionUtils.newInstance(reader.getKeyClass(), conf);
> >Writable value = (Writable)
> > ReflectionUtils.newInstance(reader.getValueClass(), conf);
> >long position = reader.getPosition();
> >while (reader.next(key, value)) {
> >String syncSeen = reader.syncSeen() ? "*" : "";
> >byte [] fileBytes = ((BytesWritable) value).getBytes();
> >System.out.printf("[%s%s]\t%s\t%s\n", position, syncSeen,
> > key, fileBytes.length);
> >position = reader.getPosition(); // beginning of next
> record
> >}
> >
>


Re: How to use DBInputFormat?

2009-02-06 Thread Fredrik Hedberg
Well, that obviously depend on the RDBMS' implementation. And although  
the case is not as bad as you describe (otherwise you better ask your  
RDBMS vendor for your money back), your point is valid. But then  
again, a RDBMS is not designed for that kind of work.


What do you mean by "creating splits/map tasks on the fly dynamically"?


Fredrik


On Feb 5, 2009, at 4:49 PM, Stefan Podkowinski wrote:


As far as i understand the main problem is that you need to create
splits from streaming data with an unknown number of records and
offsets. Its just the same problem as with externally compressed data
(.gz). You need to go through the complete stream (or do a table scan)
to create logical splits. Afterwards each map task needs to seek to
the appropriate offset on a new stream over again. Very expansive. As
with compressed files, no wonder only one map task is started for each
.gz file and will consume the complete file. IMHO the DBInputFormat
should follow this behavior and just create 1 split whatsoever.
Maybe a future version of hadoop will allow to create splits/map tasks
on the fly dynamically?

Stefan

On Thu, Feb 5, 2009 at 3:28 PM, Fredrik Hedberg   
wrote:

Indeed sir.

The implementation was designed like you describe for two reasons.  
First and

foremost to make is as simple as possible for the user to use a JDBC
database as input and output for Hadoop. Secondly because of the  
specific

requirements the MapReduce framework brings to the table (split
distribution, split reproducibility etc).

This design will, as you note, never handle the same amount of data  
as HBase
(or HDFS), and was never intended to. That being said, there are a  
couple of
ways that the current design could be augmented to perform better  
(and, as

in its current form, tweaked, depending on you data and computational
requirements). Shard awareness is one way, which would let each
database/tasktracker-node execute mappers on data where each split  
is a

single database server for example.

If you have any ideas on how the current design can be improved,  
please do

share.


Fredrik

On Feb 5, 2009, at 11:37 AM, Stefan Podkowinski wrote:

The 0.19 DBInputFormat class implementation is IMHO only suitable  
for

very simple queries working on only few datasets. Thats due to the
fact that it tries to create splits from the query by
1) getting a count of all rows using the specified count query (huge
performance impact on large tables)
2) creating splits by issuing an individual query for each split  
with

a "limit" and "offset" parameter appended to the input sql query

Effectively your input query "select * from orders" would become
"select * from orders limit  offset " and
executed until count has been reached. I guess this is not working  
sql

syntax for oracle.

Stefan


2009/2/4 Amandeep Khurana :


Adding a semicolon gives me the error "ORA-00911: Invalid  
character"


Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Wed, Feb 4, 2009 at 6:46 AM, Rasit OZDAS  
 wrote:



Amandeep,
"SQL command not properly ended"
I get this error whenever I forget the semicolon at the end.
I know, it doesn't make sense, but I recommend giving it a try

Rasit

2009/2/4 Amandeep Khurana :


The same query is working if I write a simple JDBC client and  
query the
database. So, I'm probably doing something wrong in the  
connection


settings.


But the error looks to be on the query side more than the  
connection


side.


Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Tue, Feb 3, 2009 at 7:25 PM, Amandeep Khurana >


wrote:



Thanks Kevin

I couldnt get it work. Here's the error I get:

bin/hadoop jar ~/dbload.jar LoadTable1
09/02/03 19:21:17 INFO jvm.JvmMetrics: Initializing JVM  
Metrics with

processName=JobTracker, sessionId=
09/02/03 19:21:20 INFO mapred.JobClient: Running job:  
job_local_0001

09/02/03 19:21:21 INFO mapred.JobClient:  map 0% reduce 0%
09/02/03 19:21:22 INFO mapred.MapTask: numReduceTasks: 0
09/02/03 19:21:24 WARN mapred.LocalJobRunner: job_local_0001
java.io.IOException: ORA-00933: SQL command not properly ended

 at



org 
.apache 
.hadoop 
.mapred.lib.db.DBInputFormat.getRecordReader(DBInputFormat.java: 
289)


 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321)
 at

org.apache.hadoop.mapred.LocalJobRunner 
$Job.run(LocalJobRunner.java:138)

java.io.IOException: Job failed!
 at


org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)


 at LoadTable1.run(LoadTable1.java:130)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java: 
65)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java: 
79)

 at LoadTable1.main(LoadTable1.java:107)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native  
Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown  
Source)

 at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown


S

Re: Hadoop job using multiple input files

2009-02-06 Thread Ian Soboroff
Amandeep Khurana  writes:

> Is it possible to write a map reduce job using multiple input files?
>
> For example:
> File 1 has data like - Name, Number
> File 2 has data like - Number, Address
>
> Using these, I want to create a third file which has something like - Name,
> Address
>
> How can a map reduce job be written to do this?

Have one map job read both files in sequence, and map them to (number,
name or address).  Then reduce on number.

Ian



Re: Hadoop job using multiple input files

2009-02-06 Thread Jeff Hammerbacher
You put the files into a common directory, and use that as your input to the
MapReduce job. You write a single Mapper class that has an "if" statement
examining the map.input.file property, outputting "number" as the key for
both files, but "address" for one and "name" for the other. By using a
commone key ("number"), you'll  ensure that the name and address make it to
the same reducer after the shuffle. In the reducer, you'll then have the
relevant information (in the values) you need to create the name, address
pair.

On Fri, Feb 6, 2009 at 2:17 AM, Amandeep Khurana  wrote:

> Thanks Jeff...
> I am not 100% clear about the first solution you have given. How do I get
> the multiple files to be read and then feed into a single reducer? I should
> have multiple mappers in the same class and have different job configs for
> them, run two separate jobs with one outputing the key as (name,number) and
> the other outputing the value as (number, address) into the reducer?
> Not clear what I'll be doing with the map.intput.file here...
>
> Amandeep
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>
>
> On Fri, Feb 6, 2009 at 1:55 AM, Jeff Hammerbacher  >wrote:
>
> > Hey Amandeep,
> >
> > You can get the file name for a task via the "map.input.file" property.
> For
> > the join you're doing, you could inspect this property and ouput (number,
> > name) and (number, address) as your (key, value) pairs, depending on the
> > file you're working with. Then you can do the combination in your
> reducer.
> >
> > You could also check out the join package in contrib/utils (
> >
> >
> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/contrib/utils/join/package-summary.html
> > ),
> > but I'd say your job is simple enough that you'll get it done faster with
> > the above method.
> >
> > This task would be a simple join in Hive, so you could consider using
> Hive
> > to manage the data and perform the join.
> >
> > Later,
> > Jeff
> >
> > On Fri, Feb 6, 2009 at 1:34 AM, Amandeep Khurana 
> wrote:
> >
> > > Is it possible to write a map reduce job using multiple input files?
> > >
> > > For example:
> > > File 1 has data like - Name, Number
> > > File 2 has data like - Number, Address
> > >
> > > Using these, I want to create a third file which has something like -
> > Name,
> > > Address
> > >
> > > How can a map reduce job be written to do this?
> > >
> > > Amandeep
> > >
> > >
> > >
> > > Amandeep Khurana
> > > Computer Science Graduate Student
> > > University of California, Santa Cruz
> > >
> >
>


Re: can't read the SequenceFile correctly

2009-02-06 Thread Tom White
Hi Mark,

Not all the bytes stored in a BytesWritable object are necessarily
valid. Use BytesWritable#getLength() to determine how much of the
buffer returned by BytesWritable#getBytes() to use.

Tom

On Fri, Feb 6, 2009 at 5:41 AM, Mark Kerzner  wrote:
> Hi,
>
> I have written binary files to a SequenceFile, seemeingly successfully, but
> when I read them back with the code below, after a first few reads I get the
> same number of bytes for the different files. What could go wrong?
>
> Thank you,
> Mark
>
>  reader = new SequenceFile.Reader(fs, path, conf);
>Writable key = (Writable)
> ReflectionUtils.newInstance(reader.getKeyClass(), conf);
>Writable value = (Writable)
> ReflectionUtils.newInstance(reader.getValueClass(), conf);
>long position = reader.getPosition();
>while (reader.next(key, value)) {
>String syncSeen = reader.syncSeen() ? "*" : "";
>byte [] fileBytes = ((BytesWritable) value).getBytes();
>System.out.printf("[%s%s]\t%s\t%s\n", position, syncSeen,
> key, fileBytes.length);
>position = reader.getPosition(); // beginning of next record
>}
>


Re: Hadoop job using multiple input files

2009-02-06 Thread Amandeep Khurana
Thanks Jeff...
I am not 100% clear about the first solution you have given. How do I get
the multiple files to be read and then feed into a single reducer? I should
have multiple mappers in the same class and have different job configs for
them, run two separate jobs with one outputing the key as (name,number) and
the other outputing the value as (number, address) into the reducer?
Not clear what I'll be doing with the map.intput.file here...

Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Fri, Feb 6, 2009 at 1:55 AM, Jeff Hammerbacher wrote:

> Hey Amandeep,
>
> You can get the file name for a task via the "map.input.file" property. For
> the join you're doing, you could inspect this property and ouput (number,
> name) and (number, address) as your (key, value) pairs, depending on the
> file you're working with. Then you can do the combination in your reducer.
>
> You could also check out the join package in contrib/utils (
>
> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/contrib/utils/join/package-summary.html
> ),
> but I'd say your job is simple enough that you'll get it done faster with
> the above method.
>
> This task would be a simple join in Hive, so you could consider using Hive
> to manage the data and perform the join.
>
> Later,
> Jeff
>
> On Fri, Feb 6, 2009 at 1:34 AM, Amandeep Khurana  wrote:
>
> > Is it possible to write a map reduce job using multiple input files?
> >
> > For example:
> > File 1 has data like - Name, Number
> > File 2 has data like - Number, Address
> >
> > Using these, I want to create a third file which has something like -
> Name,
> > Address
> >
> > How can a map reduce job be written to do this?
> >
> > Amandeep
> >
> >
> >
> > Amandeep Khurana
> > Computer Science Graduate Student
> > University of California, Santa Cruz
> >
>


Re: Hadoop job using multiple input files

2009-02-06 Thread Jeff Hammerbacher
Hey Amandeep,

You can get the file name for a task via the "map.input.file" property. For
the join you're doing, you could inspect this property and ouput (number,
name) and (number, address) as your (key, value) pairs, depending on the
file you're working with. Then you can do the combination in your reducer.

You could also check out the join package in contrib/utils (
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/contrib/utils/join/package-summary.html),
but I'd say your job is simple enough that you'll get it done faster with
the above method.

This task would be a simple join in Hive, so you could consider using Hive
to manage the data and perform the join.

Later,
Jeff

On Fri, Feb 6, 2009 at 1:34 AM, Amandeep Khurana  wrote:

> Is it possible to write a map reduce job using multiple input files?
>
> For example:
> File 1 has data like - Name, Number
> File 2 has data like - Number, Address
>
> Using these, I want to create a third file which has something like - Name,
> Address
>
> How can a map reduce job be written to do this?
>
> Amandeep
>
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>


Hadoop job using multiple input files

2009-02-06 Thread Amandeep Khurana
Is it possible to write a map reduce job using multiple input files?

For example:
File 1 has data like - Name, Number
File 2 has data like - Number, Address

Using these, I want to create a third file which has something like - Name,
Address

How can a map reduce job be written to do this?

Amandeep



Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


re : How to use MapFile in C++ program

2009-02-06 Thread Anh Vũ Nguyễn
Hi, everybody. I am writing a project in C++ and want to use the power of
MapFile class(which belongs to org.apache.hadoop.io) of hadoop. Can you
please tell me how can I write code in C++ using MapFile or there is no way
to use API org.apache.hadoop.io in c++ (libhdfs only helps with
org.apache.hadoop.fs).
Thanks in advance!


How to use MapFile in C++

2009-02-06 Thread Anh Vũ Nguyễn
Hi, everybody. I am writing a project in C++ and want to use the features of
MapFile class(which belongs to org.apache.hadoop.io) of hadoop. Can you
please tell me how can I write code in C++ using MapFile or there is no way
to use API org.apache.hadoop.io in c++ (libhdfs only helps with
org.apache.hadoop.fs).
Thanks in advance!