Re: Hadoop job using multiple input files
If it was me I would prefix the map values outputs with a: and n:. a: for address and n: for number then on the reduce you could test the value to see if its the address or the name with if statements no need to worry about which one comes first just make sure they both have been set before output on the reduce. Billy "Amandeep Khurana" wrote in message news:35a22e220902061646m941a545o554b189ed5bdb...@mail.gmail.com... Ok. I was able to get this to run but have a slight problem. *File 1* 1 10 2 20 3 30 3 35 4 40 4 45 4 49 5 50 *File 2* a 10 123 b 20 21321 c 45 2131 d 40 213 I want to join the above two based on the second column of file 1. Here's what I am getting as the output. *Output* 1 a 123 b 21321 2 3 3 4 d 213 c 2131 4 4 5 The ones in red are in the format I want it. The ones in blue have their order reversed. How can I get them to be in the correct order too? Basically, the order in which the iterator iterates over the values is not consistent. How can I get this to be consistent? Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Fri, Feb 6, 2009 at 2:58 PM, Amandeep Khurana wrote: Ok. Got it. Now, how would my reducer know whether the name is coming first or the address? Is it going to be in the same order in the iterator as the files are read (alphabetically) in the mapper? Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Fri, Feb 6, 2009 at 5:22 AM, Jeff Hammerbacher wrote: You put the files into a common directory, and use that as your input to the MapReduce job. You write a single Mapper class that has an "if" statement examining the map.input.file property, outputting "number" as the key for both files, but "address" for one and "name" for the other. By using a commone key ("number"), you'll ensure that the name and address make it to the same reducer after the shuffle. In the reducer, you'll then have the relevant information (in the values) you need to create the name, address pair. On Fri, Feb 6, 2009 at 2:17 AM, Amandeep Khurana wrote: > Thanks Jeff... > I am not 100% clear about the first solution you have given. How do I get > the multiple files to be read and then feed into a single reducer? I should > have multiple mappers in the same class and have different job configs for > them, run two separate jobs with one outputing the key as > (name,number) and > the other outputing the value as (number, address) into the reducer? > Not clear what I'll be doing with the map.intput.file here... > > Amandeep > > > Amandeep Khurana > Computer Science Graduate Student > University of California, Santa Cruz > > > On Fri, Feb 6, 2009 at 1:55 AM, Jeff Hammerbacher > > >wrote: > > > Hey Amandeep, > > > > You can get the file name for a task via the "map.input.file" property. > For > > the join you're doing, you could inspect this property and ouput (number, > > name) and (number, address) as your (key, value) pairs, depending on the > > file you're working with. Then you can do the combination in your > reducer. > > > > You could also check out the join package in contrib/utils ( > > > > > http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/contrib/utils/join/package-summary.html > > ), > > but I'd say your job is simple enough that you'll get it done faster with > > the above method. > > > > This task would be a simple join in Hive, so you could consider > > using > Hive > > to manage the data and perform the join. > > > > Later, > > Jeff > > > > On Fri, Feb 6, 2009 at 1:34 AM, Amandeep Khurana > > > wrote: > > > > > Is it possible to write a map reduce job using multiple input > > > files? > > > > > > For example: > > > File 1 has data like - Name, Number > > > File 2 has data like - Number, Address > > > > > > Using these, I want to create a third file which has something > > > like - > > Name, > > > Address > > > > > > How can a map reduce job be written to do this? > > > > > > Amandeep > > > > > > > > > > > > Amandeep Khurana > > > Computer Science Graduate Student > > > University of California, Santa Cruz > > > > > >
Re: Hadoop job using multiple input files
Ok. I was able to get this to run but have a slight problem. *File 1* 1 10 2 20 3 30 3 35 4 40 4 45 4 49 5 50 *File 2* a 10 123 b 20 21321 c 45 2131 d 40 213 I want to join the above two based on the second column of file 1. Here's what I am getting as the output. *Output* 1 a 123 b 21321 2 3 3 4 d 213 c 2131 4 4 5 The ones in red are in the format I want it. The ones in blue have their order reversed. How can I get them to be in the correct order too? Basically, the order in which the iterator iterates over the values is not consistent. How can I get this to be consistent? Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Fri, Feb 6, 2009 at 2:58 PM, Amandeep Khurana wrote: > Ok. Got it. > > Now, how would my reducer know whether the name is coming first or the > address? Is it going to be in the same order in the iterator as the files > are read (alphabetically) in the mapper? > > > Amandeep Khurana > Computer Science Graduate Student > University of California, Santa Cruz > > > On Fri, Feb 6, 2009 at 5:22 AM, Jeff Hammerbacher wrote: > >> You put the files into a common directory, and use that as your input to >> the >> MapReduce job. You write a single Mapper class that has an "if" statement >> examining the map.input.file property, outputting "number" as the key for >> both files, but "address" for one and "name" for the other. By using a >> commone key ("number"), you'll ensure that the name and address make it >> to >> the same reducer after the shuffle. In the reducer, you'll then have the >> relevant information (in the values) you need to create the name, address >> pair. >> >> On Fri, Feb 6, 2009 at 2:17 AM, Amandeep Khurana >> wrote: >> >> > Thanks Jeff... >> > I am not 100% clear about the first solution you have given. How do I >> get >> > the multiple files to be read and then feed into a single reducer? I >> should >> > have multiple mappers in the same class and have different job configs >> for >> > them, run two separate jobs with one outputing the key as (name,number) >> and >> > the other outputing the value as (number, address) into the reducer? >> > Not clear what I'll be doing with the map.intput.file here... >> > >> > Amandeep >> > >> > >> > Amandeep Khurana >> > Computer Science Graduate Student >> > University of California, Santa Cruz >> > >> > >> > On Fri, Feb 6, 2009 at 1:55 AM, Jeff Hammerbacher > > >wrote: >> > >> > > Hey Amandeep, >> > > >> > > You can get the file name for a task via the "map.input.file" >> property. >> > For >> > > the join you're doing, you could inspect this property and ouput >> (number, >> > > name) and (number, address) as your (key, value) pairs, depending on >> the >> > > file you're working with. Then you can do the combination in your >> > reducer. >> > > >> > > You could also check out the join package in contrib/utils ( >> > > >> > > >> > >> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/contrib/utils/join/package-summary.html >> > > ), >> > > but I'd say your job is simple enough that you'll get it done faster >> with >> > > the above method. >> > > >> > > This task would be a simple join in Hive, so you could consider using >> > Hive >> > > to manage the data and perform the join. >> > > >> > > Later, >> > > Jeff >> > > >> > > On Fri, Feb 6, 2009 at 1:34 AM, Amandeep Khurana >> > wrote: >> > > >> > > > Is it possible to write a map reduce job using multiple input files? >> > > > >> > > > For example: >> > > > File 1 has data like - Name, Number >> > > > File 2 has data like - Number, Address >> > > > >> > > > Using these, I want to create a third file which has something like >> - >> > > Name, >> > > > Address >> > > > >> > > > How can a map reduce job be written to do this? >> > > > >> > > > Amandeep >> > > > >> > > > >> > > > >> > > > Amandeep Khurana >> > > > Computer Science Graduate Student >> > > > University of California, Santa Cruz >> > > > >> > > >> > >> > >
Heap size error
I'm getting the following error while running my hadoop job: 09/02/06 15:33:03 INFO mapred.JobClient: Task Id : attempt_200902061333_0004_r_00_1, Status : FAILED java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Unknown Source) at java.lang.AbstractStringBuilder.expandCapacity(Unknown Source) at java.lang.AbstractStringBuilder.append(Unknown Source) at java.lang.StringBuffer.append(Unknown Source) at TableJoin$Reduce.reduce(TableJoin.java:61) at TableJoin$Reduce.reduce(TableJoin.java:1) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:430) at org.apache.hadoop.mapred.Child.main(Child.java:155) Any inputs? Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz
Re: Hadoop job using multiple input files
Ok. Got it. Now, how would my reducer know whether the name is coming first or the address? Is it going to be in the same order in the iterator as the files are read (alphabetically) in the mapper? Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Fri, Feb 6, 2009 at 5:22 AM, Jeff Hammerbacher wrote: > You put the files into a common directory, and use that as your input to > the > MapReduce job. You write a single Mapper class that has an "if" statement > examining the map.input.file property, outputting "number" as the key for > both files, but "address" for one and "name" for the other. By using a > commone key ("number"), you'll ensure that the name and address make it to > the same reducer after the shuffle. In the reducer, you'll then have the > relevant information (in the values) you need to create the name, address > pair. > > On Fri, Feb 6, 2009 at 2:17 AM, Amandeep Khurana wrote: > > > Thanks Jeff... > > I am not 100% clear about the first solution you have given. How do I get > > the multiple files to be read and then feed into a single reducer? I > should > > have multiple mappers in the same class and have different job configs > for > > them, run two separate jobs with one outputing the key as (name,number) > and > > the other outputing the value as (number, address) into the reducer? > > Not clear what I'll be doing with the map.intput.file here... > > > > Amandeep > > > > > > Amandeep Khurana > > Computer Science Graduate Student > > University of California, Santa Cruz > > > > > > On Fri, Feb 6, 2009 at 1:55 AM, Jeff Hammerbacher > >wrote: > > > > > Hey Amandeep, > > > > > > You can get the file name for a task via the "map.input.file" property. > > For > > > the join you're doing, you could inspect this property and ouput > (number, > > > name) and (number, address) as your (key, value) pairs, depending on > the > > > file you're working with. Then you can do the combination in your > > reducer. > > > > > > You could also check out the join package in contrib/utils ( > > > > > > > > > http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/contrib/utils/join/package-summary.html > > > ), > > > but I'd say your job is simple enough that you'll get it done faster > with > > > the above method. > > > > > > This task would be a simple join in Hive, so you could consider using > > Hive > > > to manage the data and perform the join. > > > > > > Later, > > > Jeff > > > > > > On Fri, Feb 6, 2009 at 1:34 AM, Amandeep Khurana > > wrote: > > > > > > > Is it possible to write a map reduce job using multiple input files? > > > > > > > > For example: > > > > File 1 has data like - Name, Number > > > > File 2 has data like - Number, Address > > > > > > > > Using these, I want to create a third file which has something like - > > > Name, > > > > Address > > > > > > > > How can a map reduce job be written to do this? > > > > > > > > Amandeep > > > > > > > > > > > > > > > > Amandeep Khurana > > > > Computer Science Graduate Student > > > > University of California, Santa Cruz > > > > > > > > > >
Re: How to use DBInputFormat?
Well, that's also implicit by design, and cannot really be solved in a generic way. As with any system, it's not foolproof; unless you fully understand what you're doing, you won't reliably get the result you're seeking. As I said before, the JDBC interface for Hadoop solves a specific problem, whereas HBase and HDFS is really the answer to the kind of problem your hinting at. Fredrik On Feb 6, 2009, at 4:06 PM, Stefan Podkowinski wrote: On Fri, Feb 6, 2009 at 2:40 PM, Fredrik Hedberg wrote: Well, that obviously depend on the RDBMS' implementation. And although the case is not as bad as you describe (otherwise you better ask your RDBMS vendor for your money back), your point is valid. But then again, a RDBMS is not designed for that kind of work. Right. Clash of design paradigms. Hey MySQL, I want my money back!! Oh, wait.. Another scenario I just recognized: what about current/"realtime" data? E.g. 'select * from logs where date = today()'. Working with 'offset' may turn out to return different results after the table has been updated and tasks are still pending. Pretty ugly to trace down this condition, after you found out that sometimes your results are just not right.. What do you mean by "creating splits/map tasks on the fly dynamically"? Fredrik On Feb 5, 2009, at 4:49 PM, Stefan Podkowinski wrote: As far as i understand the main problem is that you need to create splits from streaming data with an unknown number of records and offsets. Its just the same problem as with externally compressed data (.gz). You need to go through the complete stream (or do a table scan) to create logical splits. Afterwards each map task needs to seek to the appropriate offset on a new stream over again. Very expansive. As with compressed files, no wonder only one map task is started for each .gz file and will consume the complete file. IMHO the DBInputFormat should follow this behavior and just create 1 split whatsoever. Maybe a future version of hadoop will allow to create splits/map tasks on the fly dynamically? Stefan On Thu, Feb 5, 2009 at 3:28 PM, Fredrik Hedberg wrote: Indeed sir. The implementation was designed like you describe for two reasons. First and foremost to make is as simple as possible for the user to use a JDBC database as input and output for Hadoop. Secondly because of the specific requirements the MapReduce framework brings to the table (split distribution, split reproducibility etc). This design will, as you note, never handle the same amount of data as HBase (or HDFS), and was never intended to. That being said, there are a couple of ways that the current design could be augmented to perform better (and, as in its current form, tweaked, depending on you data and computational requirements). Shard awareness is one way, which would let each database/tasktracker-node execute mappers on data where each split is a single database server for example. If you have any ideas on how the current design can be improved, please do share. Fredrik On Feb 5, 2009, at 11:37 AM, Stefan Podkowinski wrote: The 0.19 DBInputFormat class implementation is IMHO only suitable for very simple queries working on only few datasets. Thats due to the fact that it tries to create splits from the query by 1) getting a count of all rows using the specified count query (huge performance impact on large tables) 2) creating splits by issuing an individual query for each split with a "limit" and "offset" parameter appended to the input sql query Effectively your input query "select * from orders" would become "select * from orders limit offset " and executed until count has been reached. I guess this is not working sql syntax for oracle. Stefan 2009/2/4 Amandeep Khurana : Adding a semicolon gives me the error "ORA-00911: Invalid character" Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Wed, Feb 4, 2009 at 6:46 AM, Rasit OZDAS wrote: Amandeep, "SQL command not properly ended" I get this error whenever I forget the semicolon at the end. I know, it doesn't make sense, but I recommend giving it a try Rasit 2009/2/4 Amandeep Khurana : The same query is working if I write a simple JDBC client and query the database. So, I'm probably doing something wrong in the connection settings. But the error looks to be on the query side more than the connection side. Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Tue, Feb 3, 2009 at 7:25 PM, Amandeep Khurana > wrote: Thanks Kevin I couldnt get it work. Here's the error I get: bin/hadoop jar ~/dbload.jar LoadTable1 09/02/03 19:21:17 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 09/02/03 19:21:20 INFO mapred.JobClient: Running job: job_local_0001 09/02/03 19:21:21 INFO mapred.JobClient: map 0% redu
Re: Re: Re: Regarding "Hadoop multi cluster" set-up
I had to change the master on my running cluster and ended up with the same problem. Were you able to fix it at your end? Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Thu, Feb 5, 2009 at 8:46 AM, shefali pawar wrote: > Hi, > > I do not think that the firewall is blocking the port because it has been > turned off on both the computers! And also since it is a random port number > I do not think it should create a problem. > > I do not understand what is going wrong! > > Shefali > > On Wed, 04 Feb 2009 23:23:04 +0530 wrote > >I'm not certain that the firewall is your problem but if that port is > >blocked on your master you should open it to let communication through. > Here > >is one website that might be relevant: > > > > > http://stackoverflow.com/questions/255077/open-ports-under-fedora-core-8-for-vmware-server > > > >but again, this may not be your problem. > > > >John > > > >On Wed, Feb 4, 2009 at 12:46 PM, shefali pawar wrote: > > > >> Hi, > >> > >> I will have to check. I can do that tomorrow in college. But if that is > the > >> case what should i do? > >> > >> Should i change the port number and try again? > >> > >> Shefali > >> > >> On Wed, 04 Feb 2009 S D wrote : > >> > >> >Shefali, > >> > > >> >Is your firewall blocking port 54310 on the master? > >> > > >> >John > >> > > >> >On Wed, Feb 4, 2009 at 12:34 PM, shefali pawar > >wrote: > >> > > >> > > Hi, > >> > > > >> > > I am trying to set-up a two node cluster using Hadoop0.19.0, with 1 > >> > > master(which should also work as a slave) and 1 slave node. > >> > > > >> > > But while running bin/start-dfs.sh the datanode is not starting on > the > >> > > slave. I had read the previous mails on the list, but nothing seems > to > >> be > >> > > working in this case. I am getting the following error in the > >> > > hadoop-root-datanode-slave log file while running the command > >> > > bin/start-dfs.sh => > >> > > > >> > > 2009-02-03 13:00:27,516 INFO > >> > > org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG: > >> > > / > >> > > STARTUP_MSG: Starting DataNode > >> > > STARTUP_MSG: host = slave/172.16.0.32 > >> > > STARTUP_MSG: args = [] > >> > > STARTUP_MSG: version = 0.19.0 > >> > > STARTUP_MSG: build = > >> > > https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.19-r > >> > > 713890; compiled by 'ndaley' on Fri Nov 14 03:12:29 UTC 2008 > >> > > / > >> > > 2009-02-03 13:00:28,725 INFO org.apache.hadoop.ipc.Client: Retrying > >> connect > >> > > to server: master/172.16.0.46:54310. Already tried 0 time(s). > >> > > 2009-02-03 13:00:29,726 INFO org.apache.hadoop.ipc.Client: Retrying > >> connect > >> > > to server: master/172.16.0.46:54310. Already tried 1 time(s). > >> > > 2009-02-03 13:00:30,727 INFO org.apache.hadoop.ipc.Client: Retrying > >> connect > >> > > to server: master/172.16.0.46:54310. Already tried 2 time(s). > >> > > 2009-02-03 13:00:31,728 INFO org.apache.hadoop.ipc.Client: Retrying > >> connect > >> > > to server: master/172.16.0.46:54310. Already tried 3 time(s). > >> > > 2009-02-03 13:00:32,729 INFO org.apache.hadoop.ipc.Client: Retrying > >> connect > >> > > to server: master/172.16.0.46:54310. Already tried 4 time(s). > >> > > 2009-02-03 13:00:33,730 INFO org.apache.hadoop.ipc.Client: Retrying > >> connect > >> > > to server: master/172.16.0.46:54310. Already tried 5 time(s). > >> > > 2009-02-03 13:00:34,731 INFO org.apache.hadoop.ipc.Client: Retrying > >> connect > >> > > to server: master/172.16.0.46:54310. Already tried 6 time(s). > >> > > 2009-02-03 13:00:35,732 INFO org.apache.hadoop.ipc.Client: Retrying > >> connect > >> > > to server: master/172.16.0.46:54310. Already tried 7 time(s). > >> > > 2009-02-03 13:00:36,733 INFO org.apache.hadoop.ipc.Client: Retrying > >> connect > >> > > to server: master/172.16.0.46:54310. Already tried 8 time(s). > >> > > 2009-02-03 13:00:37,734 INFO org.apache.hadoop.ipc.Client: Retrying > >> connect > >> > > to server: master/172.16.0.46:54310. Already tried 9 time(s). > >> > > 2009-02-03 13:00:37,738 ERROR > >> > > org.apache.hadoop.hdfs.server.datanode.DataNode: > java.io.IOException: > >> Call > >> > > to master/172.16.0.46:54310 failed on local exception: No route to > >> host > >> > >at org.apache.hadoop.ipc.Client.call(Client.java:699) > >> > >at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) > >> > >at $Proxy4.getProtocolVersion(Unknown Source) > >> > >at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319) > >> > >at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:306) > >> > >at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:343) > >> > >at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:288) > >> > >at > >> > > > >> > org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:258) > >> > >at > >>
Completed jobs not finishing, errors in jobtracker logs
I'm seeing some strange behavior on my cluster. Jobs will be done (that is, all tasks completed), but the job will still be "running". This state seems to persist for minutes, and is really killing my throughput. I'm seeing errors (warnings) in the jobtracker log that look like this: 2009-02-06 12:37:08,425 WARN /: /taskgraph? type=reduce&jobid=job_200902061117_0012: java.lang.ArrayIndexOutOfBoundsException: 3 at org.apache.hadoop.mapred.StatusHttpServer $TaskGraphServlet.getReduceAvarageProgresses(StatusHttpServer.java:228) at org.apache.hadoop.mapred.StatusHttpServer $TaskGraphServlet.doGet(StatusHttpServer.java:159) at javax.servlet.http.HttpServlet.service(HttpServlet.java:689) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.mortbay.jetty.servlet.ServletHolder.handle (ServletHolder.java:427) at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch (WebApplicationHandler.java:475) at org.mortbay.jetty.servlet.ServletHandler.handle (ServletHandler.java:567) at org.mortbay.http.HttpContext.handle(HttpContext.java:1565) at org.mortbay.jetty.servlet.WebApplicationContext.handle (WebApplicationContext.java:635) at org.mortbay.http.HttpContext.handle(HttpContext.java:1517) at org.mortbay.http.HttpServer.service(HttpServer.java:954) at org.mortbay.http.HttpConnection.service (HttpConnection.java:814) at org.mortbay.http.HttpConnection.handleNext (HttpConnection.java:981) at org.mortbay.http.HttpConnection.handle (HttpConnection.java:831) at org.mortbay.http.SocketListener.handleConnection (SocketListener.java:244) at org.mortbay.util.ThreadedServer.handle (ThreadedServer.java:357) at org.mortbay.util.ThreadPool$PoolThread.run (ThreadPool.java:534) I'm running hadoop-0.19.0. Any ideas? -Bryan
Cannot copy from local file system to DFS
Hey all I was trying to run the word count example on one of the hadoop systems I installed, but when i try to copy the text files from the local file system to the DFS, it throws up the following exception: [mith...@node02 hadoop]$ jps 8711 JobTracker 8805 TaskTracker 8901 Jps 8419 NameNode 8642 SecondaryNameNode [mith...@node02 hadoop]$ cd .. [mith...@node02 mithila]$ ls hadoop hadoop-0.17.2.1.tar hadoop-datastore test [mith...@node02 mithila]$ hadoop/bin/hadoop dfs -copyFromLocal test test 09/02/06 11:26:26 INFO dfs.DFSClient: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/mithila/test/20417.txt could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1145) at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:300) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896) at org.apache.hadoop.ipc.Client.call(Client.java:557) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212) at org.apache.hadoop.dfs.$Proxy0.addBlock(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at org.apache.hadoop.dfs.$Proxy0.addBlock(Unknown Source) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2335) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2220) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1700(DFSClient.java:1702) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1842) 09/02/06 11:26:26 WARN dfs.DFSClient: NotReplicatedYetException sleeping /user/mithila/test/20417.txt retries left 4 09/02/06 11:26:27 INFO dfs.DFSClient: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/mithila/test/20417.txt could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1145) at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:300) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896) at org.apache.hadoop.ipc.Client.call(Client.java:557) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212) at org.apache.hadoop.dfs.$Proxy0.addBlock(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at org.apache.hadoop.dfs.$Proxy0.addBlock(Unknown Source) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2335) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2220) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1700(DFSClient.java:1702) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1842) 09/02/06 11:26:27 WARN dfs.DFSClient: NotReplicatedYetException sleeping /user/mithila/test/20417.txt retries left 3 09/02/06 11:26:28 INFO dfs.DFSClient: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/mithila/test/20417.txt could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1145) at org.apache.hadoop.dfs.NameNode.addBlock(NameNod
Re: Batch processing with Hadoop -- does HDFS scale for parallel reads?
On Feb 6, 2009, at 11:00 AM, TCK wrote: How well does the read throughput from HDFS scale with the number of data nodes ? For example, if I had a large file (say 10GB) on a 10 data node cluster, would the time taken to read this whole file in parallel (ie, with multiple reader client processes requesting different parts of the file in parallel) be halved if I had the same file on a 20 data node cluster ? Possibly. (I don't give a firm answer because the answer depends on the number of chunks and the number of replicas). If there are enough replicas and enough separate reading processes with enough network bandwidth, then yes, your read bandwidth could double. Is this not possible because HDFS doesn't support random seeks? It does for reads. It does not for writes. Trust me, our physicists have what can best be described as "the most god-awful random read patterns you've seen in your life" and they do fine on HDFS. What about if the file was split up into multiple smaller files before placing in the HDFS ? Then things would be less efficient and you'd be less likely to scale. Brian Thanks for your input. -TCK --- On Wed, 2/4/09, Brian Bockelman wrote: From: Brian Bockelman Subject: Re: Batch processing with Hadoop -- does HDFS scale for parallel reads? To: core-user@hadoop.apache.org Date: Wednesday, February 4, 2009, 1:50 PM Sounds overly complicated. Complicated usually leads to mistakes :) What about just having a single cluster and only running the tasktrackers on the fast CPUs? No messy cross-cluster transferring. Brian On Feb 4, 2009, at 12:46 PM, TCK wrote: Thanks, Brian. This sounds encouraging for us. What are the advantages/disadvantages of keeping a persistent storage (HD/K)FS cluster separate from a processing Hadoop+(HD/K)FS cluster ? The advantage I can think of is that a permanent storage cluster has different requirements from a map-reduce processing cluster -- the permanent storage cluster would need faster, bigger hard disks, and would need to grow as the total volume of all collected logs grows, whereas the processing cluster would need fast CPUs and would only need to grow with the rate of incoming data. So it seems to make sense to me to copy a piece of data from the permanent storage cluster to the processing cluster only when it needs to be processed. Is my line of thinking reasonable? How would this compare to running the map-reduce processing on same cluster as the data is stored in? Which approach is used by most people? Best Regards, TCK --- On Wed, 2/4/09, Brian Bockelman wrote: From: Brian Bockelman Subject: Re: Batch processing with Hadoop -- does HDFS scale for parallel reads? To: core-user@hadoop.apache.org Date: Wednesday, February 4, 2009, 1:06 PM Hey TCK, We use HDFS+FUSE solely as a storage solution for a application which doesn't understand MapReduce. We've scaled this solution to around 80Gbps. For 300 processes reading from the same file, we get about 20Gbps. Do consider your data retention policies -- I would say that Hadoop as a storage system is thus far about 99% reliable for storage and is not a backup solution. If you're scared of getting more than 1% of your logs lost, have a good backup solution. I would also add that when you are learning your operational staff's abilities, expect even more data loss. As you gain experience, data loss goes down. I don't believe we've lost a single block in the last month, but it took us 2-3 months of 1%-level losses to get here. Brian On Feb 4, 2009, at 11:51 AM, TCK wrote: Hey guys, We have been using Hadoop to do batch processing of logs. The logs get written and stored on a NAS. Our Hadoop cluster periodically copies a batch of new logs from the NAS, via NFS into Hadoop's HDFS, processes them, and copies the output back to the NAS. The HDFS is cleaned up at the end of each batch (ie, everything in it is deleted). The problem is that reads off the NAS via NFS don't scale even if we try to scale the copying process by adding more threads to read in parallel. If we instead stored the log files on an HDFS cluster (instead of NAS), it seems like the reads would scale since the data can be read from multiple data nodes at the same time without any contention (except network IO, which shouldn't be a problem). I would appreciate if anyone could share any similar experience they have had with doing parallel reads from a storage HDFS. Also is it a good idea to have a separate HDFS for storage vs for doing the batch processing ? Best Regards, TCK
Re: How to use DBInputFormat?
On Feb 6, 2009, at 7:06 AM, Stefan Podkowinski wrote: Another scenario I just recognized: what about current/"realtime" data? E.g. 'select * from logs where date = today()'. Working with 'offset' may turn out to return different results after the table has been updated and tasks are still pending. Pretty ugly to trace down this condition, after you found out that sometimes your results are just not right.. In fairness, this isn't an issue unique to streaming data into Hadoop from an RDBMS. The "today()" routine is non-functional -- it returns a different answer from the same arguments across multiple calls. Whenever you put that behavior into your computing infrastructure, you make it impossible to reproduce task behavior. This is a big issue in a system that uses speculative execution... Simple answer is to avoid side effects. More complicated answer is to understand what they are, and think about them when you design your data processing flow. mike
Re: Batch processing with Hadoop -- does HDFS scale for parallel reads?
How well does the read throughput from HDFS scale with the number of data nodes ? For example, if I had a large file (say 10GB) on a 10 data node cluster, would the time taken to read this whole file in parallel (ie, with multiple reader client processes requesting different parts of the file in parallel) be halved if I had the same file on a 20 data node cluster ? Is this not possible because HDFS doesn't support random seeks? What about if the file was split up into multiple smaller files before placing in the HDFS ? Thanks for your input. -TCK --- On Wed, 2/4/09, Brian Bockelman wrote: From: Brian Bockelman Subject: Re: Batch processing with Hadoop -- does HDFS scale for parallel reads? To: core-user@hadoop.apache.org Date: Wednesday, February 4, 2009, 1:50 PM Sounds overly complicated. Complicated usually leads to mistakes :) What about just having a single cluster and only running the tasktrackers on the fast CPUs? No messy cross-cluster transferring. Brian On Feb 4, 2009, at 12:46 PM, TCK wrote: > > > Thanks, Brian. This sounds encouraging for us. > > What are the advantages/disadvantages of keeping a persistent storage (HD/K)FS cluster separate from a processing Hadoop+(HD/K)FS cluster ? > The advantage I can think of is that a permanent storage cluster has different requirements from a map-reduce processing cluster -- the permanent storage cluster would need faster, bigger hard disks, and would need to grow as the total volume of all collected logs grows, whereas the processing cluster would need fast CPUs and would only need to grow with the rate of incoming data. So it seems to make sense to me to copy a piece of data from the permanent storage cluster to the processing cluster only when it needs to be processed. Is my line of thinking reasonable? How would this compare to running the map-reduce processing on same cluster as the data is stored in? Which approach is used by most people? > > Best Regards, > TCK > > > > --- On Wed, 2/4/09, Brian Bockelman wrote: > From: Brian Bockelman > Subject: Re: Batch processing with Hadoop -- does HDFS scale for parallel reads? > To: core-user@hadoop.apache.org > Date: Wednesday, February 4, 2009, 1:06 PM > > Hey TCK, > > We use HDFS+FUSE solely as a storage solution for a application which > doesn't understand MapReduce. We've scaled this solution to around > 80Gbps. For 300 processes reading from the same file, we get about 20Gbps. > > Do consider your data retention policies -- I would say that Hadoop as a > storage system is thus far about 99% reliable for storage and is not a backup > solution. If you're scared of getting more than 1% of your logs lost, have > a good backup solution. I would also add that when you are learning your > operational staff's abilities, expect even more data loss. As you gain > experience, data loss goes down. > > I don't believe we've lost a single block in the last month, but it > took us 2-3 months of 1%-level losses to get here. > > Brian > > On Feb 4, 2009, at 11:51 AM, TCK wrote: > >> >> Hey guys, >> >> We have been using Hadoop to do batch processing of logs. The logs get > written and stored on a NAS. Our Hadoop cluster periodically copies a batch of > new logs from the NAS, via NFS into Hadoop's HDFS, processes them, and > copies the output back to the NAS. The HDFS is cleaned up at the end of each > batch (ie, everything in it is deleted). >> >> The problem is that reads off the NAS via NFS don't scale even if we > try to scale the copying process by adding more threads to read in parallel. >> >> If we instead stored the log files on an HDFS cluster (instead of NAS), it > seems like the reads would scale since the data can be read from multiple data > nodes at the same time without any contention (except network IO, which > shouldn't be a problem). >> >> I would appreciate if anyone could share any similar experience they have > had with doing parallel reads from a storage HDFS. >> >> Also is it a good idea to have a separate HDFS for storage vs for doing > the batch processing ? >> >> Best Regards, >> TCK >> >> >> >> > > > >
RE: can't read the SequenceFile correctly
Hey Tom, I got also burned by this ?? Why does BytesWritable.getBytes() returns non-vaild bytes ?? Or we should add a BytesWritable.getValidBytes() kind of function. Best Bhupesh -Original Message- From: Tom White [mailto:t...@cloudera.com] Sent: Fri 2/6/2009 2:25 AM To: core-user@hadoop.apache.org Subject: Re: can't read the SequenceFile correctly Hi Mark, Not all the bytes stored in a BytesWritable object are necessarily valid. Use BytesWritable#getLength() to determine how much of the buffer returned by BytesWritable#getBytes() to use. Tom On Fri, Feb 6, 2009 at 5:41 AM, Mark Kerzner wrote: > Hi, > > I have written binary files to a SequenceFile, seemeingly successfully, but > when I read them back with the code below, after a first few reads I get the > same number of bytes for the different files. What could go wrong? > > Thank you, > Mark > > reader = new SequenceFile.Reader(fs, path, conf); >Writable key = (Writable) > ReflectionUtils.newInstance(reader.getKeyClass(), conf); >Writable value = (Writable) > ReflectionUtils.newInstance(reader.getValueClass(), conf); >long position = reader.getPosition(); >while (reader.next(key, value)) { >String syncSeen = reader.syncSeen() ? "*" : ""; >byte [] fileBytes = ((BytesWritable) value).getBytes(); >System.out.printf("[%s%s]\t%s\t%s\n", position, syncSeen, > key, fileBytes.length); >position = reader.getPosition(); // beginning of next record >} >
Tests stalling in my config
Hello, I recently checked out revision 741606, and am attempting to run the 'test' ant task. I'm new to building hadoop from source, so my problem is most likely somewhere in my own configuration, but I'm at a bit of a loss as to how to trace it. The only environment variable that I've set for this is: JAVA_HOME=/home/mtolan/java/jdk1.6.0_10 (Downloaded from Sun) On running 'ant clean test', I get normal output which ends at test-core: [mkdir] Created dir: /home/mtolan/hadoop/trunk/build/test/data [mkdir] Created dir: /home/mtolan/hadoop/trunk/build/test/logs [copy] Copying 1 file to /home/mtolan/hadoop/trunk/build/test/extraconf [junit] Running org.apache.hadoop.cli.TestCLI This runs for hours, consuming no resources, so I'm not convinced that it's working as intended. What follows are the relevant processes in 'ps', in case there's some detail I'm missing in the way the commands are being formed. PS output: mtolan 25590 32.8 4.7 227392 97712 pts/1Sl+ 09:52 0:17 /home/mtolan/java/jdk1.6.0_10/jre/bin/java -classpath /usr/share/ant/lib/ant-launcher.jar:/usr/share/java/xmlParserAPIs.jar:/usr/share/java/xercesImpl.jar -Dant.home=/usr/share/ant -Dant.library.dir=/usr/share/ant/lib org.apache.tools.ant.launch.Launcher -cp test mtolan 25640 17.0 2.0 701876 43356 pts/1Sl+ 09:52 0:05 /home/mtolan/java/jdk1.6.0_10/jre/bin/java -Xmx512m -Dtest.build.data=/home/mtolan/hadoop/trunk/build/test/data -Dtest.cache.data=/home/mtolan/hadoop/trunk/build/test/cache -Dtest.debug.data=/home/mtolan/hadoop/trunk/build/test/debug -Dhadoop.log.dir=/home/mtolan/hadoop/trunk/build/test/logs -Dtest.src.dir=/home/mtolan/hadoop/trunk/src/test -Dtest.build.extraconf=/home/mtolan/hadoop/trunk/build/test/extraconf -Dhadoop.policy.file=hadoop-policy.xml -Djava.library.path=/home/mtolan/hadoop/trunk/build/native/Linux-i386-32/lib:/home/mtolan/hadoop/trunk/lib/native/Linux-i386-32 -Dinstall.c++.examples=/home/mtolan/hadoop/trunk/build/c++-examples/Linux-i386-32 -classpath /home/mtolan/hadoop/trunk/build/test/extraconf:/home/mtolan/hadoop/trunk/build/test/classes:/home/mtolan/hadoop/trunk/src/test:/home/mtolan/hadoop/trunk/build:/home/mtolan/hadoop/trunk/build/examples:/home/mtolan/hadoop/trunk/build/tools:/home/mtolan/hadoop/trunk/src/test/lib/ftplet-api-1.0.0-SNAPSHOT.jar:/home/mtolan/hadoop/trunk/src/test/lib/ftpserver-core-1.0.0-SNAPSHOT.jar:/home/mtolan/hadoop/trunk/src/test/lib/ftpserver-server-1.0.0-SNAPSHOT.jar:/home/mtolan/hadoop/trunk/src/test/lib/mina-core-2.0.0-M2-20080407.124109-12.jar:/home/mtolan/hadoop/trunk/build/classes:/home/mtolan/hadoop/trunk/lib/commons-cli-2.0-SNAPSHOT.jar:/home/mtolan/hadoop/trunk/lib/hsqldb-1.8.0.10.jar:/home/mtolan/hadoop/trunk/lib/jsp-2.1/jsp-2.1.jar:/home/mtolan/hadoop/trunk/lib/jsp-2.1/jsp-api-2.1.jar:/home/mtolan/hadoop/trunk/lib/kfs-0.2.2.jar:/home/mtolan/hadoop/trunk/conf:/home/mtolan/.ivy2/cache/commons-logging/commons-logging/jars/commons-logging-1.0.4.jar:/home/mtolan/.ivy2/cache/log4j/log4j/jars/log4j-1.2.15.jar:/home/mtolan/.ivy2/cache/commons-httpclient/commons-httpclient/jars/commons-httpclient-3.0.1.jar:/home/mtolan/.ivy2/cache/commons-codec/commons-codec/jars/commons-codec-1.3.jar:/home/mtolan/.ivy2/cache/xmlenc/xmlenc/jars/xmlenc-0.52.jar:/home/mtolan/.ivy2/cache/net.java.dev.jets3t/jets3t/jars/jets3t-0.6.1.jar:/home/mtolan/.ivy2/cache/commons-net/commons-net/jars/commons-net-1.4.1.jar:/home/mtolan/.ivy2/cache/org.mortbay.jetty/servlet-api-2.5/jars/servlet-api-2.5-6.1.14.jar:/home/mtolan/.ivy2/cache/oro/oro/jars/oro-2.0.8.jar:/home/mtolan/.ivy2/cache/org.mortbay.jetty/jetty/jars/jetty-6.1.14.jar:/home/mtolan/.ivy2/cache/org.mortbay.jetty/jetty-util/jars/jetty-util-6.1.14.jar:/home/mtolan/.ivy2/cache/tomcat/jasper-runtime/jars/jasper-runtime-5.5.12.jar:/home/mtolan/.ivy2/cache/tomcat/jasper-compiler/jars/jasper-compiler-5.5.12.jar:/home/mtolan/.ivy2/cache/commons-el/commons-el/jars/commons-el-1.0.jar:/home/mtolan/.ivy2/cache/junit/junit/jars/junit-3.8.1.jar:/home/mtolan/.ivy2/cache/commons-logging/commons-logging-api/jars/commons-logging-api-1.0.4.jar:/home/mtolan/.ivy2/cache/org.slf4j/slf4j-api/jars/slf4j-api-1.4.3.jar:/home/mtolan/.ivy2/cache/org.eclipse.jdt/core/jars/core-3.1.1.jar:/home/mtolan/.ivy2/cache/org.slf4j/slf4j-log4j12/jars/slf4j-log4j12-1.4.3.jar:/usr/share/ant/lib/junit.jar:/usr/share/ant/lib/ant-launcher.jar:/usr/share/ant/lib/ant.jar:/usr/share/ant/lib/ant-junit.jar org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner org.apache.hadoop.cli.TestCLI filtertrace=true haltOnError=false haltOnFailure=false formatter=org.apache.tools.ant.taskdefs.optional.junit.SummaryJUnitResultFormatter showoutput=false outputtoformatters=true logtestlistenerevents=true formatter=org.apache.tools.ant.taskdefs.optional.junit.PlainJUnitResultFormatter,/home/mtolan/hadoop/trunk/build/test/TEST-org.apache.hadoop.cli.TestCLI.txt crashfile=/home/mtolan/hadoop/trunk/junitvmwatcher1200620086.properties propsfile=/home/mtolan/hadoop/trunk/juni
Re: re : How to use MapFile in C++ program
There is currently no way to read MapFiles in any language other than Java. You can write a JNI wrapper similar to libhdfs. Alternatively, you can also write the complete stack from scratch, however this might prove very difficult or impossible. You might want to check the ObjectFile/TFile specifications for which binary compatible reader/writers can be developed in any language : https://issues.apache.org/jira/browse/HADOOP-3315 Enis Anh Vũ Nguyễn wrote: Hi, everybody. I am writing a project in C++ and want to use the power of MapFile class(which belongs to org.apache.hadoop.io) of hadoop. Can you please tell me how can I write code in C++ using MapFile or there is no way to use API org.apache.hadoop.io in c++ (libhdfs only helps with org.apache.hadoop.fs). Thanks in advance!
Re: How to use DBInputFormat?
On Fri, Feb 6, 2009 at 2:40 PM, Fredrik Hedberg wrote: > Well, that obviously depend on the RDBMS' implementation. And although the > case is not as bad as you describe (otherwise you better ask your RDBMS > vendor for your money back), your point is valid. But then again, a RDBMS is > not designed for that kind of work. Right. Clash of design paradigms. Hey MySQL, I want my money back!! Oh, wait.. Another scenario I just recognized: what about current/"realtime" data? E.g. 'select * from logs where date = today()'. Working with 'offset' may turn out to return different results after the table has been updated and tasks are still pending. Pretty ugly to trace down this condition, after you found out that sometimes your results are just not right.. > What do you mean by "creating splits/map tasks on the fly dynamically"? > > > Fredrik > > > On Feb 5, 2009, at 4:49 PM, Stefan Podkowinski wrote: > >> As far as i understand the main problem is that you need to create >> splits from streaming data with an unknown number of records and >> offsets. Its just the same problem as with externally compressed data >> (.gz). You need to go through the complete stream (or do a table scan) >> to create logical splits. Afterwards each map task needs to seek to >> the appropriate offset on a new stream over again. Very expansive. As >> with compressed files, no wonder only one map task is started for each >> .gz file and will consume the complete file. IMHO the DBInputFormat >> should follow this behavior and just create 1 split whatsoever. >> Maybe a future version of hadoop will allow to create splits/map tasks >> on the fly dynamically? >> >> Stefan >> >> On Thu, Feb 5, 2009 at 3:28 PM, Fredrik Hedberg >> wrote: >>> >>> Indeed sir. >>> >>> The implementation was designed like you describe for two reasons. First >>> and >>> foremost to make is as simple as possible for the user to use a JDBC >>> database as input and output for Hadoop. Secondly because of the specific >>> requirements the MapReduce framework brings to the table (split >>> distribution, split reproducibility etc). >>> >>> This design will, as you note, never handle the same amount of data as >>> HBase >>> (or HDFS), and was never intended to. That being said, there are a couple >>> of >>> ways that the current design could be augmented to perform better (and, >>> as >>> in its current form, tweaked, depending on you data and computational >>> requirements). Shard awareness is one way, which would let each >>> database/tasktracker-node execute mappers on data where each split is a >>> single database server for example. >>> >>> If you have any ideas on how the current design can be improved, please >>> do >>> share. >>> >>> >>> Fredrik >>> >>> On Feb 5, 2009, at 11:37 AM, Stefan Podkowinski wrote: >>> The 0.19 DBInputFormat class implementation is IMHO only suitable for very simple queries working on only few datasets. Thats due to the fact that it tries to create splits from the query by 1) getting a count of all rows using the specified count query (huge performance impact on large tables) 2) creating splits by issuing an individual query for each split with a "limit" and "offset" parameter appended to the input sql query Effectively your input query "select * from orders" would become "select * from orders limit offset " and executed until count has been reached. I guess this is not working sql syntax for oracle. Stefan 2009/2/4 Amandeep Khurana : > > Adding a semicolon gives me the error "ORA-00911: Invalid character" > > Amandeep > > > Amandeep Khurana > Computer Science Graduate Student > University of California, Santa Cruz > > > On Wed, Feb 4, 2009 at 6:46 AM, Rasit OZDAS > wrote: > >> Amandeep, >> "SQL command not properly ended" >> I get this error whenever I forget the semicolon at the end. >> I know, it doesn't make sense, but I recommend giving it a try >> >> Rasit >> >> 2009/2/4 Amandeep Khurana : >>> >>> The same query is working if I write a simple JDBC client and query >>> the >>> database. So, I'm probably doing something wrong in the connection >> >> settings. >>> >>> But the error looks to be on the query side more than the connection >> >> side. >>> >>> Amandeep >>> >>> >>> Amandeep Khurana >>> Computer Science Graduate Student >>> University of California, Santa Cruz >>> >>> >>> On Tue, Feb 3, 2009 at 7:25 PM, Amandeep Khurana >> >> wrote: >>> Thanks Kevin I couldnt get it work. Here's the error I get: bin/hadoop jar ~/dbload.jar LoadTable1 09/02/03 19:21:17 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 09/02/03 19:21:20 INFO mapred.JobClient: Running job: j
Re: can't read the SequenceFile correctly
Indeed, this was the answer! Thank you, Mark On Fri, Feb 6, 2009 at 4:25 AM, Tom White wrote: > Hi Mark, > > Not all the bytes stored in a BytesWritable object are necessarily > valid. Use BytesWritable#getLength() to determine how much of the > buffer returned by BytesWritable#getBytes() to use. > > Tom > > On Fri, Feb 6, 2009 at 5:41 AM, Mark Kerzner > wrote: > > Hi, > > > > I have written binary files to a SequenceFile, seemeingly successfully, > but > > when I read them back with the code below, after a first few reads I get > the > > same number of bytes for the different files. What could go wrong? > > > > Thank you, > > Mark > > > > reader = new SequenceFile.Reader(fs, path, conf); > >Writable key = (Writable) > > ReflectionUtils.newInstance(reader.getKeyClass(), conf); > >Writable value = (Writable) > > ReflectionUtils.newInstance(reader.getValueClass(), conf); > >long position = reader.getPosition(); > >while (reader.next(key, value)) { > >String syncSeen = reader.syncSeen() ? "*" : ""; > >byte [] fileBytes = ((BytesWritable) value).getBytes(); > >System.out.printf("[%s%s]\t%s\t%s\n", position, syncSeen, > > key, fileBytes.length); > >position = reader.getPosition(); // beginning of next > record > >} > > >
Re: How to use DBInputFormat?
Well, that obviously depend on the RDBMS' implementation. And although the case is not as bad as you describe (otherwise you better ask your RDBMS vendor for your money back), your point is valid. But then again, a RDBMS is not designed for that kind of work. What do you mean by "creating splits/map tasks on the fly dynamically"? Fredrik On Feb 5, 2009, at 4:49 PM, Stefan Podkowinski wrote: As far as i understand the main problem is that you need to create splits from streaming data with an unknown number of records and offsets. Its just the same problem as with externally compressed data (.gz). You need to go through the complete stream (or do a table scan) to create logical splits. Afterwards each map task needs to seek to the appropriate offset on a new stream over again. Very expansive. As with compressed files, no wonder only one map task is started for each .gz file and will consume the complete file. IMHO the DBInputFormat should follow this behavior and just create 1 split whatsoever. Maybe a future version of hadoop will allow to create splits/map tasks on the fly dynamically? Stefan On Thu, Feb 5, 2009 at 3:28 PM, Fredrik Hedberg wrote: Indeed sir. The implementation was designed like you describe for two reasons. First and foremost to make is as simple as possible for the user to use a JDBC database as input and output for Hadoop. Secondly because of the specific requirements the MapReduce framework brings to the table (split distribution, split reproducibility etc). This design will, as you note, never handle the same amount of data as HBase (or HDFS), and was never intended to. That being said, there are a couple of ways that the current design could be augmented to perform better (and, as in its current form, tweaked, depending on you data and computational requirements). Shard awareness is one way, which would let each database/tasktracker-node execute mappers on data where each split is a single database server for example. If you have any ideas on how the current design can be improved, please do share. Fredrik On Feb 5, 2009, at 11:37 AM, Stefan Podkowinski wrote: The 0.19 DBInputFormat class implementation is IMHO only suitable for very simple queries working on only few datasets. Thats due to the fact that it tries to create splits from the query by 1) getting a count of all rows using the specified count query (huge performance impact on large tables) 2) creating splits by issuing an individual query for each split with a "limit" and "offset" parameter appended to the input sql query Effectively your input query "select * from orders" would become "select * from orders limit offset " and executed until count has been reached. I guess this is not working sql syntax for oracle. Stefan 2009/2/4 Amandeep Khurana : Adding a semicolon gives me the error "ORA-00911: Invalid character" Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Wed, Feb 4, 2009 at 6:46 AM, Rasit OZDAS wrote: Amandeep, "SQL command not properly ended" I get this error whenever I forget the semicolon at the end. I know, it doesn't make sense, but I recommend giving it a try Rasit 2009/2/4 Amandeep Khurana : The same query is working if I write a simple JDBC client and query the database. So, I'm probably doing something wrong in the connection settings. But the error looks to be on the query side more than the connection side. Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Tue, Feb 3, 2009 at 7:25 PM, Amandeep Khurana > wrote: Thanks Kevin I couldnt get it work. Here's the error I get: bin/hadoop jar ~/dbload.jar LoadTable1 09/02/03 19:21:17 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 09/02/03 19:21:20 INFO mapred.JobClient: Running job: job_local_0001 09/02/03 19:21:21 INFO mapred.JobClient: map 0% reduce 0% 09/02/03 19:21:22 INFO mapred.MapTask: numReduceTasks: 0 09/02/03 19:21:24 WARN mapred.LocalJobRunner: job_local_0001 java.io.IOException: ORA-00933: SQL command not properly ended at org .apache .hadoop .mapred.lib.db.DBInputFormat.getRecordReader(DBInputFormat.java: 289) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321) at org.apache.hadoop.mapred.LocalJobRunner $Job.run(LocalJobRunner.java:138) java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217) at LoadTable1.run(LoadTable1.java:130) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java: 65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java: 79) at LoadTable1.main(LoadTable1.java:107) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown S
Re: Hadoop job using multiple input files
Amandeep Khurana writes: > Is it possible to write a map reduce job using multiple input files? > > For example: > File 1 has data like - Name, Number > File 2 has data like - Number, Address > > Using these, I want to create a third file which has something like - Name, > Address > > How can a map reduce job be written to do this? Have one map job read both files in sequence, and map them to (number, name or address). Then reduce on number. Ian
Re: Hadoop job using multiple input files
You put the files into a common directory, and use that as your input to the MapReduce job. You write a single Mapper class that has an "if" statement examining the map.input.file property, outputting "number" as the key for both files, but "address" for one and "name" for the other. By using a commone key ("number"), you'll ensure that the name and address make it to the same reducer after the shuffle. In the reducer, you'll then have the relevant information (in the values) you need to create the name, address pair. On Fri, Feb 6, 2009 at 2:17 AM, Amandeep Khurana wrote: > Thanks Jeff... > I am not 100% clear about the first solution you have given. How do I get > the multiple files to be read and then feed into a single reducer? I should > have multiple mappers in the same class and have different job configs for > them, run two separate jobs with one outputing the key as (name,number) and > the other outputing the value as (number, address) into the reducer? > Not clear what I'll be doing with the map.intput.file here... > > Amandeep > > > Amandeep Khurana > Computer Science Graduate Student > University of California, Santa Cruz > > > On Fri, Feb 6, 2009 at 1:55 AM, Jeff Hammerbacher >wrote: > > > Hey Amandeep, > > > > You can get the file name for a task via the "map.input.file" property. > For > > the join you're doing, you could inspect this property and ouput (number, > > name) and (number, address) as your (key, value) pairs, depending on the > > file you're working with. Then you can do the combination in your > reducer. > > > > You could also check out the join package in contrib/utils ( > > > > > http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/contrib/utils/join/package-summary.html > > ), > > but I'd say your job is simple enough that you'll get it done faster with > > the above method. > > > > This task would be a simple join in Hive, so you could consider using > Hive > > to manage the data and perform the join. > > > > Later, > > Jeff > > > > On Fri, Feb 6, 2009 at 1:34 AM, Amandeep Khurana > wrote: > > > > > Is it possible to write a map reduce job using multiple input files? > > > > > > For example: > > > File 1 has data like - Name, Number > > > File 2 has data like - Number, Address > > > > > > Using these, I want to create a third file which has something like - > > Name, > > > Address > > > > > > How can a map reduce job be written to do this? > > > > > > Amandeep > > > > > > > > > > > > Amandeep Khurana > > > Computer Science Graduate Student > > > University of California, Santa Cruz > > > > > >
Re: can't read the SequenceFile correctly
Hi Mark, Not all the bytes stored in a BytesWritable object are necessarily valid. Use BytesWritable#getLength() to determine how much of the buffer returned by BytesWritable#getBytes() to use. Tom On Fri, Feb 6, 2009 at 5:41 AM, Mark Kerzner wrote: > Hi, > > I have written binary files to a SequenceFile, seemeingly successfully, but > when I read them back with the code below, after a first few reads I get the > same number of bytes for the different files. What could go wrong? > > Thank you, > Mark > > reader = new SequenceFile.Reader(fs, path, conf); >Writable key = (Writable) > ReflectionUtils.newInstance(reader.getKeyClass(), conf); >Writable value = (Writable) > ReflectionUtils.newInstance(reader.getValueClass(), conf); >long position = reader.getPosition(); >while (reader.next(key, value)) { >String syncSeen = reader.syncSeen() ? "*" : ""; >byte [] fileBytes = ((BytesWritable) value).getBytes(); >System.out.printf("[%s%s]\t%s\t%s\n", position, syncSeen, > key, fileBytes.length); >position = reader.getPosition(); // beginning of next record >} >
Re: Hadoop job using multiple input files
Thanks Jeff... I am not 100% clear about the first solution you have given. How do I get the multiple files to be read and then feed into a single reducer? I should have multiple mappers in the same class and have different job configs for them, run two separate jobs with one outputing the key as (name,number) and the other outputing the value as (number, address) into the reducer? Not clear what I'll be doing with the map.intput.file here... Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Fri, Feb 6, 2009 at 1:55 AM, Jeff Hammerbacher wrote: > Hey Amandeep, > > You can get the file name for a task via the "map.input.file" property. For > the join you're doing, you could inspect this property and ouput (number, > name) and (number, address) as your (key, value) pairs, depending on the > file you're working with. Then you can do the combination in your reducer. > > You could also check out the join package in contrib/utils ( > > http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/contrib/utils/join/package-summary.html > ), > but I'd say your job is simple enough that you'll get it done faster with > the above method. > > This task would be a simple join in Hive, so you could consider using Hive > to manage the data and perform the join. > > Later, > Jeff > > On Fri, Feb 6, 2009 at 1:34 AM, Amandeep Khurana wrote: > > > Is it possible to write a map reduce job using multiple input files? > > > > For example: > > File 1 has data like - Name, Number > > File 2 has data like - Number, Address > > > > Using these, I want to create a third file which has something like - > Name, > > Address > > > > How can a map reduce job be written to do this? > > > > Amandeep > > > > > > > > Amandeep Khurana > > Computer Science Graduate Student > > University of California, Santa Cruz > > >
Re: Hadoop job using multiple input files
Hey Amandeep, You can get the file name for a task via the "map.input.file" property. For the join you're doing, you could inspect this property and ouput (number, name) and (number, address) as your (key, value) pairs, depending on the file you're working with. Then you can do the combination in your reducer. You could also check out the join package in contrib/utils ( http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/contrib/utils/join/package-summary.html), but I'd say your job is simple enough that you'll get it done faster with the above method. This task would be a simple join in Hive, so you could consider using Hive to manage the data and perform the join. Later, Jeff On Fri, Feb 6, 2009 at 1:34 AM, Amandeep Khurana wrote: > Is it possible to write a map reduce job using multiple input files? > > For example: > File 1 has data like - Name, Number > File 2 has data like - Number, Address > > Using these, I want to create a third file which has something like - Name, > Address > > How can a map reduce job be written to do this? > > Amandeep > > > > Amandeep Khurana > Computer Science Graduate Student > University of California, Santa Cruz >
Hadoop job using multiple input files
Is it possible to write a map reduce job using multiple input files? For example: File 1 has data like - Name, Number File 2 has data like - Number, Address Using these, I want to create a third file which has something like - Name, Address How can a map reduce job be written to do this? Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz
re : How to use MapFile in C++ program
Hi, everybody. I am writing a project in C++ and want to use the power of MapFile class(which belongs to org.apache.hadoop.io) of hadoop. Can you please tell me how can I write code in C++ using MapFile or there is no way to use API org.apache.hadoop.io in c++ (libhdfs only helps with org.apache.hadoop.fs). Thanks in advance!
How to use MapFile in C++
Hi, everybody. I am writing a project in C++ and want to use the features of MapFile class(which belongs to org.apache.hadoop.io) of hadoop. Can you please tell me how can I write code in C++ using MapFile or there is no way to use API org.apache.hadoop.io in c++ (libhdfs only helps with org.apache.hadoop.fs). Thanks in advance!