Since the default split size is 64MB hadoop should run this on 352 mappers. I presume that the files are simple text files...
Ashish -----Original Message----- From: Touretsky, Gregory [mailto:[email protected]] Sent: Monday, October 12, 2009 11:23 AM To: Ashish Thusoo Cc: [email protected]; [email protected] Subject: RE: Hive and MapReduce It's 22GB file with ~60M+ records: itstl0016> $HADOOP_HOME/bin/hadoop dfs -ls itstl0016> /user/hive/warehouse/start/stodhuge.out Found 1 items -rw-r--r-- 3 gtouret supergroup 22661980380 2009-10-08 19:48 /user/hive/warehouse/start/stodhuge.out And the cluster consists of 16 dual-core nodes. hive> INSERT OVERWRITE TABLE start_oct30 > select start.* from start > where start.SampleTime >= '2009-10-29' AND start.SampleTime <= '2009-11-01'; Total MapReduce jobs = 2 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_200910121100_0002, Tracking URL = http://itstl0016.iil.intel.com:50030/jobdetails.jsp?jobid=job_200910121100_0002 Kill Command = /nfs/iil/disks/rep_tests_gtouret01/hadoop/bin/hadoop job -Dmapred.job.tracker=itstl0016.iil.intel.com:9001 -kill job_200910121100_0002 2009-10-12 11:49:40,865 map = 0%, reduce = 0% 2009-10-12 11:50:12,065 map = 1%, reduce = 0% [ SKIPPED MANY LINES ] 2009-10-12 12:08:37,648 map = 99%, reduce = 0% 2009-10-12 12:08:52,782 map = 100%, reduce = 0% 2009-10-12 12:08:58,811 map = 100%, reduce = 100% Ended Job = job_200910121100_0002 Moving data to: hdfs://itstl0016.iil.intel.com:9000/tmp/hive-gtouret/329345984/10000 Loading data to table start_oct30 0 Rows loaded to start_oct30 OK Time taken: 1160.775 seconds hive> itstl0016> $HADOOP_HOME/bin/hadoop job -status job_200910121100_0002 Job: job_200910121100_0002 file: hdfs://itstl0016.iil.intel.com:9000/tmp/hadoop-gtouret/mapred/system/job_200910121100_0002/job.xml tracking URL: http://itstl0016.iil.intel.com:50030/jobdetails.jsp?jobid=job_200910121100_0002 map() completion: 1.0 reduce() completion: 1.0 Counters: 11 Job Counters Launched map tasks=1 Data-local map tasks=1 org.apache.hadoop.hive.ql.exec.FilterOperator$Counter PASSED=0 FILTERED=63916590 org.apache.hadoop.hive.ql.exec.FileSinkOperator$TableIdEnum TABLE_ID_1_ROWCOUNT=0 FileSystemCounters HDFS_BYTES_READ=22663352877 org.apache.hadoop.hive.ql.exec.MapOperator$Counter DESERIALIZE_ERRORS=0 Map-Reduce Framework Map input records=63916590 Spilled Records=0 Map input bytes=22615687168 Map output records=0 -- Gregory -----Original Message----- From: Ashish Thusoo [mailto:[email protected]] Sent: Monday, October 12, 2009 8:04 PM To: Touretsky, Gregory Cc: [email protected]; [email protected] Subject: RE: Hive and MapReduce adding hive-user and hive-dev lists. And removing the common mailing list.. Can you elaborate a bit on the datasize. By default Hive should just be relying on hadoop to give you the number of mappers depending on the number of splits you have in your data. Ashish -----Original Message----- From: Touretsky, Gregory [mailto:[email protected]] Sent: Monday, October 12, 2009 3:02 AM To: Touretsky, Gregory; [email protected] Subject: RE: Hive and MapReduce Ok, the patch below actually works. Re-built Hadoop cluster and everything works now. Now I have to understand how to force Hive to run >1 mapper for complicated query on the large table... From: Touretsky, Gregory Sent: Sunday, October 11, 2009 4:39 PM To: [email protected] Cc: Touretsky, Gregory Subject: Hive and MapReduce Hi, I'm running Hadoop 0.20.1 and Hive (checked out revision 824063). Direct MapReduce task succeeds, but Map task created by Hive fails: hive> select * from pokes where foo>100; Total MapReduce jobs = 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_200910111626_0001, Tracking URL = http://itstl0016.iil.intel.com:50030/jobdetails.jsp?jobid=job_200910111626_0001 Kill Command = /nfs/iil/disks/rep_tests_gtouret01/hadoop/bin/hadoop job -Dmapred.job.tracker=itstl0016.iil.intel.com:9001 -kill job_200910111626_0001 2009-10-11 04:26:57,844 map = 100%, reduce = 100% Ended Job = job_200910111626_0001 with errors FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.ExecDriver >From the logs/hadoop-UUUU-jobtracker-XXXX.iil.intel.com.log: 2009-10-11 16:26:56,829 INFO org.apache.hadoop.mapred.JobInProgress: Initializing job_200910111626_0001 2009-10-11 16:26:57,091 INFO org.apache.hadoop.mapred.JobInProgress: Input size for job job_200910111626_0001 = 13. Number of splits = 1 2009-10-11 16:26:57,225 ERROR org.apache.hadoop.mapred.JobTracker: Job initialization failed: java.lang.IllegalArgumentException: Network location name contains /: /IDC1-DC201/WE/34 (I've had the same issue with the /default_rack) at org.apache.hadoop.net.NodeBase.set(NodeBase.java:75) at org.apache.hadoop.net.NodeBase.<init>(NodeBase.java:57) at org.apache.hadoop.mapred.JobTracker.addHostToNodeMapping(JobTracker.java:2390) at org.apache.hadoop.mapred.JobTracker.resolveAndAddToTopology(JobTracker.java:2384) at org.apache.hadoop.mapred.JobInProgress.createCache(JobInProgress.java:349) at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:450) at org.apache.hadoop.mapred.JobTracker.initJob(JobTracker.java:3147) at org.apache.hadoop.mapred.EagerTaskInitializationListener$InitJob.run(EagerTaskInitializationListener.java:79) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) 2009-10-11 16:26:57,225 INFO org.apache.hadoop.mapred.JobTracker: Failing job job_200910111626_0001 2009-10-11 16:26:57,866 INFO org.apache.hadoop.mapred.JobTracker: Killing job job_200910111626_0001 Any suggestion? I saw patches in https://issues.apache.org/jira/browse/HADOOP-5759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712524#action_12712524, but I can't apply all of them cleanly to my Hadoop sources... Thanks, Gregory --------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. --------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
