Re: Are hadoop fs commands serial or parallel
Thanks harsh ! That means basically both APIs as well as hadoop client commands allow only serial writes. I was wondering what could be other ways to write data in parallel to HDFS other than using multiple parallel threads. Thanks, JJ Sent from my iPhone On May 17, 2011, at 10:59 PM, Harsh J ha...@cloudera.com wrote: Hello, Adding to Joey's response, copyFromLocal's current implementation is serial given a list of files. On Wed, May 18, 2011 at 9:57 AM, Mapred Learn mapred.le...@gmail.com wrote: Thanks Joey ! I will try to find out abt copyFromLocal. Looks like Hadoop Apis write serially as you pointed out. Thanks, -JJ On May 17, 2011, at 8:32 PM, Joey Echeverria j...@cloudera.com wrote: The sequence file writer definitely does it serially as you can only ever write to the end of a file in Hadoop. Doing copyFromLocal could write multiple files in parallel (I'm not sure if it does or not), but a single file would be written serially. -Joey On Tue, May 17, 2011 at 5:44 PM, Mapred Learn mapred.le...@gmail.com wrote: Hi, My question is when I run a command from hdfs client, for eg. hadoop fs -copyFromLocal or create a sequence file writer in java code and append key/values to it through Hadoop APIs, does it internally transfer/write data to HDFS serially or in parallel ? Thanks in advance, -JJ -- Joseph Echeverria Cloudera, Inc. 443.305.9434 -- Harsh J
Re: Error in starting tasktracker
Even though the log says the folder file:/usr/lib/hadoop-0.20/logs/history/done is created, I cannot see the folder in the directory. Is that the root cause of the error . Any thoughts ? 2011-05-18 09:18:23,177 INFO org.apache.hadoop.http.HttpServer: Jetty bound to port 50030 2011-05-18 09:18:23,177 INFO org.mortbay.log: jetty-6.1.26 2011-05-18 09:18:24,574 INFO org.mortbay.log: Started SelectChannelConnector@0.0.0.0:50030 2011-05-18 09:18:24,575 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 2011-05-18 09:18:24,576 INFO org.apache.hadoop.mapred.JobTracker: JobTracker up at: 8021 2011-05-18 09:18:24,576 INFO org.apache.hadoop.mapred.JobTracker: JobTracker webserver: 50030 2011-05-18 09:18:25,970 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:8020. Already tried 0 time(s). 2011-05-18 09:18:27,013 INFO org.apache.hadoop.mapred.JobTracker: Cleaning up the system directory 2011-05-18 09:18:27,355 INFO org.apache.hadoop.mapred.JobHistory: Creating DONE folder at file:/usr/lib/hadoop-0.20/logs/history/done 2011-05-18 09:18:27,450 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2011-05-18 09:18:27,680 WARN org.apache.hadoop.mapred.JobTracker: Error starting tracker: org.apache.hadoop.util.Shell$ExitCodeException: chmod: cannot access `/var/log/hadoop-0 .20/history/done': No such file or directory at org.apache.hadoop.util.Shell.runCommand(Shell.java:255) at org.apache.hadoop.util.Shell.run(Shell.java:182) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375) at org.apache.hadoop.util.Shell.execCommand(Shell.java:461) at org.apache.hadoop.util.Shell.execCommand(Shell.java:444) On 17 May 2011 18:32, Subhramanian, Deepak deepak.subhraman...@newsint.co.uk wrote: I reinstalled everything and am able to start everything other than the jobtracker. Jobtracker still gives the port in use even though I verified that the port is not running using netstat. ipedited:/usr/lib/hadoop-0.20/logs/history # /usr/java/jdk1.6.0_25/bin/jps 7435 SecondaryNameNode 7517 TaskTracker 7361 NameNode 7632 Jps 1872 Bootstrap 7221 DataNode 2011-05-17 17:25:10,277 INFO org.apache.hadoop.mapred.JobTracker: STARTUP_MSG: / STARTUP_MSG: Starting JobTracker STARTUP_MSG: host = ipedited STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.2-cdh3u0 STARTUP_MSG: build = -r 81256ad0f2e4ab2bd34b04f53d25a6c23686dd14; compiled by 'hudson' on Fri Mar 25 20:19:33 PDT 2011 / 2011-05-17 17:25:10,895 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Updating the current master key for generating delegation tokens 2011-05-17 17:25:10,897 INFO org.apache.hadoop.mapred.JobTracker: Scheduler configured with (memSizeForMapSlotOnJT, memSizeForReduceSlotOnJT, limitMaxMemForMapTasks, limitMaxMem ForReduceTasks) (-1, -1, -1, -1) 2011-05-17 17:25:10,899 INFO org.apache.hadoop.util.HostsFileReader: Refreshing hosts (include/exclude) list 2011-05-17 17:25:10,961 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Starting expired delegation token remover thread, tokenRemoverScan Interval=60 min(s) 2011-05-17 17:25:10,961 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Updating the current master key for generating delegation tokens 2011-05-17 17:25:11,106 INFO org.apache.hadoop.mapred.JobTracker: Starting jobtracker with owner as mapred 2011-05-17 17:25:11,211 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 8021 2011-05-17 17:25:11,211 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=JobTracker, port=8021 2011-05-17 17:25:11,215 INFO org.apache.hadoop.ipc.metrics.RpcDetailedMetrics: Initializing RPC Metrics with hostName=JobTracker, port=8021 2011-05-17 17:25:11,275 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog 2011-05-17 17:25:11,407 INFO org.apache.hadoop.http.HttpServer: Added global filtersafety (class=org.apache.hadoop.http.HttpServer$QuotingInputFilter) 2011-05-17 17:25:11,439 INFO org.apache.hadoop.http.HttpServer: Port returned by webServer.getConnectors()[0].getLocalPort() before open() is -1. Opening the listener on 50030 2011-05-17 17:25:11,440 INFO org.apache.hadoop.http.HttpServer: listener.getLocalPort() returned 50030 webServer.getConnectors()[0].getLocalPort() returned 50030 2011-05-17 17:25:11,440 INFO org.apache.hadoop.http.HttpServer: Jetty bound to port 50030 2011-05-17 17:25:11,440 INFO org.mortbay.log: jetty-6.1.26 2011-05-17 17:25:12,504 INFO org.mortbay.log:
metrics2 ganglia monitoring
We have a 2 rack hadoop cluster with ganglia 3.0 monitoring on all stations both on the native os and within hadoop. We want to upgrade to the to hadoop 20.203 but with the migration to metrics2 we need help configuring the metrics to continue ganglia monitoring. All tasktrackers/datanodes push unicast udp upstream to a central gmond daemon on their rack that is then polled by a single gmetad daemon for the cluster. The current metrics files includes entries similar to the following for all contexts: Configuration of the dfs context for ganglia dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext dfs.period=25 dfs.servers=xsrv1.cs.roosevelt.edu:8670 We have reviewed the package documentation for metrics2 but the examples and explanation are not helpful. Any assistance in the proper configuration of hadoop-metrics2.properties file to support our current ganglia configuration would be appreciated. Eric
Re: How do you run HPROF locally?
Not familiar with this setup, but I assume this is using the local runner, which simply launches the job in the same process as your program. Therefore no new JVMs are spun up, so the hprof settings in the configuration never apply. The simplest way to fix this is probably to just set the -agentlib:... directly on the JVM of your local program, which will include the Map/Reduce processing in that process. On 5/17/11 6:51 PM, Mark question markq2...@gmail.com wrote: or conf.setBoolean(mapred.task.profile, true); Mark On Tue, May 17, 2011 at 4:49 PM, Mark question markq2...@gmail.com wrote: I usually do this setting inside my java program (in run function) as follows: JobConf conf = new JobConf(this.getConf(),My.class); conf.set(*mapred*.task.*profile*, true); then I'll see some output files in that same working directory. Hope that helps, Mark On Tue, May 17, 2011 at 4:07 PM, W.P. McNeill bill...@gmail.com wrote: I am running a Hadoop Java program in local single-JVM mode via an IDE (IntelliJ). I want to do performance profiling of it. Following the instructions in chapter 5 of *Hadoop: the Definitive Guide*, I added the following properties to my job configuration file. property namemapred.task.profile/name valuetrue/value /property property namemapred.task.profile.params/name value-agentlib:hprof=cpu=samples,heap=sites,depth=6,force=n,thread=y,v erbose=n,file=%s/value /property property namemapred.task.profile.maps/name value0-/value /property property namemapred.task.profile.reduces/name value0-/value /property With these properties, the job runs as before, but I don't see any profiler output. I also tried simply setting property namemapred.child.java.opts/name value-agentlib:hprof=cpu=samples,heap=sites,depth=6,force=n,thread=y,v erbose=n,file=%s/value /property Again, no profiler output. I know I have HPROF installed because running java -agentlib:hprof=help at the command prompt produces a result. Is is possible to run HPROF on a local Hadoop job? Am I doing something wrong? -- CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
Re: Error in starting tasktracker
Hello Deepak, Since your problems appear to be more related to Cloudera's distribution including Apache Hadoop, I'm moving the mail discussion to the cdh-u...@cloudera.org list. I've bcc'd common-user@hadoop.apache.org. Please continue the discussion on cdh-u...@cloudera.org for this. The folder issue might just be the reason the JT fails to start with such a status (I believe I've seen it happen for a TT once). Ensure that the permissions are set right for the logs folder (I think it ought to be rwx-rwx-r-x). Is this a tarball or a package install? On Wed, May 18, 2011 at 3:14 PM, Subhramanian, Deepak deepak.subhraman...@newsint.co.uk wrote: Even though the log says the folder file:/usr/lib/hadoop-0.20/logs/history/done is created, I cannot see the folder in the directory. Is that the root cause of the error . Any thoughts ? 2011-05-18 09:18:23,177 INFO org.apache.hadoop.http.HttpServer: Jetty bound to port 50030 2011-05-18 09:18:23,177 INFO org.mortbay.log: jetty-6.1.26 2011-05-18 09:18:24,574 INFO org.mortbay.log: Started SelectChannelConnector@0.0.0.0:50030 2011-05-18 09:18:24,575 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 2011-05-18 09:18:24,576 INFO org.apache.hadoop.mapred.JobTracker: JobTracker up at: 8021 2011-05-18 09:18:24,576 INFO org.apache.hadoop.mapred.JobTracker: JobTracker webserver: 50030 2011-05-18 09:18:25,970 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:8020. Already tried 0 time(s). 2011-05-18 09:18:27,013 INFO org.apache.hadoop.mapred.JobTracker: Cleaning up the system directory 2011-05-18 09:18:27,355 INFO org.apache.hadoop.mapred.JobHistory: Creating DONE folder at file:/usr/lib/hadoop-0.20/logs/history/done 2011-05-18 09:18:27,450 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2011-05-18 09:18:27,680 WARN org.apache.hadoop.mapred.JobTracker: Error starting tracker: org.apache.hadoop.util.Shell$ExitCodeException: chmod: cannot access `/var/log/hadoop-0 .20/history/done': No such file or directory at org.apache.hadoop.util.Shell.runCommand(Shell.java:255) at org.apache.hadoop.util.Shell.run(Shell.java:182) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375) at org.apache.hadoop.util.Shell.execCommand(Shell.java:461) at org.apache.hadoop.util.Shell.execCommand(Shell.java:444) On 17 May 2011 18:32, Subhramanian, Deepak deepak.subhraman...@newsint.co.uk wrote: I reinstalled everything and am able to start everything other than the jobtracker. Jobtracker still gives the port in use even though I verified that the port is not running using netstat. ipedited:/usr/lib/hadoop-0.20/logs/history # /usr/java/jdk1.6.0_25/bin/jps 7435 SecondaryNameNode 7517 TaskTracker 7361 NameNode 7632 Jps 1872 Bootstrap 7221 DataNode 2011-05-17 17:25:10,277 INFO org.apache.hadoop.mapred.JobTracker: STARTUP_MSG: / STARTUP_MSG: Starting JobTracker STARTUP_MSG: host = ipedited STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.2-cdh3u0 STARTUP_MSG: build = -r 81256ad0f2e4ab2bd34b04f53d25a6c23686dd14; compiled by 'hudson' on Fri Mar 25 20:19:33 PDT 2011 / 2011-05-17 17:25:10,895 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Updating the current master key for generating delegation tokens 2011-05-17 17:25:10,897 INFO org.apache.hadoop.mapred.JobTracker: Scheduler configured with (memSizeForMapSlotOnJT, memSizeForReduceSlotOnJT, limitMaxMemForMapTasks, limitMaxMem ForReduceTasks) (-1, -1, -1, -1) 2011-05-17 17:25:10,899 INFO org.apache.hadoop.util.HostsFileReader: Refreshing hosts (include/exclude) list 2011-05-17 17:25:10,961 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Starting expired delegation token remover thread, tokenRemoverScan Interval=60 min(s) 2011-05-17 17:25:10,961 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Updating the current master key for generating delegation tokens 2011-05-17 17:25:11,106 INFO org.apache.hadoop.mapred.JobTracker: Starting jobtracker with owner as mapred 2011-05-17 17:25:11,211 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 8021 2011-05-17 17:25:11,211 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=JobTracker, port=8021 2011-05-17 17:25:11,215 INFO org.apache.hadoop.ipc.metrics.RpcDetailedMetrics: Initializing RPC Metrics with hostName=JobTracker, port=8021 2011-05-17 17:25:11,275 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog 2011-05-17
Re: Exception in thread AWT-EventQueue-0 java.lang.NullPointerException
@Steve Loughran : That's the way I repaired my program. @Robert Evans @Harsh J: Thanks for your reply.I'm so happy. Thanks so much.
Re: Are hadoop fs commands serial or parallel
kinda clunky but you could do this via shell: for $FILE in $LIST_OF_FILES ; do hadoop fs -copyFromLocal $FILE $DEST_PATH done If doing this via the Java API, then, yes you will have to use multiple threads. On Wed, May 18, 2011 at 1:04 AM, Mapred Learn mapred.le...@gmail.comwrote: Thanks harsh ! That means basically both APIs as well as hadoop client commands allow only serial writes. I was wondering what could be other ways to write data in parallel to HDFS other than using multiple parallel threads. Thanks, JJ Sent from my iPhone On May 17, 2011, at 10:59 PM, Harsh J ha...@cloudera.com wrote: Hello, Adding to Joey's response, copyFromLocal's current implementation is serial given a list of files. On Wed, May 18, 2011 at 9:57 AM, Mapred Learn mapred.le...@gmail.com wrote: Thanks Joey ! I will try to find out abt copyFromLocal. Looks like Hadoop Apis write serially as you pointed out. Thanks, -JJ On May 17, 2011, at 8:32 PM, Joey Echeverria j...@cloudera.com wrote: The sequence file writer definitely does it serially as you can only ever write to the end of a file in Hadoop. Doing copyFromLocal could write multiple files in parallel (I'm not sure if it does or not), but a single file would be written serially. -Joey On Tue, May 17, 2011 at 5:44 PM, Mapred Learn mapred.le...@gmail.com wrote: Hi, My question is when I run a command from hdfs client, for eg. hadoop fs -copyFromLocal or create a sequence file writer in java code and append key/values to it through Hadoop APIs, does it internally transfer/write data to HDFS serially or in parallel ? Thanks in advance, -JJ -- Joseph Echeverria Cloudera, Inc. 443.305.9434 -- Harsh J
Hadoop and WikiLeaks
http://hadoop.apache.org/#What+Is+Apache%E2%84%A2+Hadoop%E2%84%A2%3F March 2011 - Apache Hadoop takes top prize at Media Guardian Innovation Awards The Hadoop project won the innovator of the yearaward from the UK's Guardian newspaper, where it was described as had the potential as a greater catalyst for innovation than other nominees including WikiLeaks and the iPad. Does this copy text bother anyone else? Sure winning any award is great but does hadoop want to be associated with innovation like WikiLeaks? Edward
Re: Are hadoop fs commands serial or parallel
Thanks Patrick ! This would work if directory is to be uploaded but for streaming, I guess, this would not work. Sent from my iPhone On May 18, 2011, at 9:39 AM, Patrick Angeles patr...@cloudera.com wrote: kinda clunky but you could do this via shell: for $FILE in $LIST_OF_FILES ; do hadoop fs -copyFromLocal $FILE $DEST_PATH done If doing this via the Java API, then, yes you will have to use multiple threads. On Wed, May 18, 2011 at 1:04 AM, Mapred Learn mapred.le...@gmail.comwrote: Thanks harsh ! That means basically both APIs as well as hadoop client commands allow only serial writes. I was wondering what could be other ways to write data in parallel to HDFS other than using multiple parallel threads. Thanks, JJ Sent from my iPhone On May 17, 2011, at 10:59 PM, Harsh J ha...@cloudera.com wrote: Hello, Adding to Joey's response, copyFromLocal's current implementation is serial given a list of files. On Wed, May 18, 2011 at 9:57 AM, Mapred Learn mapred.le...@gmail.com wrote: Thanks Joey ! I will try to find out abt copyFromLocal. Looks like Hadoop Apis write serially as you pointed out. Thanks, -JJ On May 17, 2011, at 8:32 PM, Joey Echeverria j...@cloudera.com wrote: The sequence file writer definitely does it serially as you can only ever write to the end of a file in Hadoop. Doing copyFromLocal could write multiple files in parallel (I'm not sure if it does or not), but a single file would be written serially. -Joey On Tue, May 17, 2011 at 5:44 PM, Mapred Learn mapred.le...@gmail.com wrote: Hi, My question is when I run a command from hdfs client, for eg. hadoop fs -copyFromLocal or create a sequence file writer in java code and append key/values to it through Hadoop APIs, does it internally transfer/write data to HDFS serially or in parallel ? Thanks in advance, -JJ -- Joseph Echeverria Cloudera, Inc. 443.305.9434 -- Harsh J
Re: Hadoop and WikiLeaks
Yes! -Pete Edward Capriolo edlinuxg...@gmail.com wrote: = http://hadoop.apache.org/#What+Is+Apache%E2%84%A2+Hadoop%E2%84%A2%3F March 2011 - Apache Hadoop takes top prize at Media Guardian Innovation Awards The Hadoop project won the innovator of the yearaward from the UK's Guardian newspaper, where it was described as had the potential as a greater catalyst for innovation than other nominees including WikiLeaks and the iPad. Does this copy text bother anyone else? Sure winning any award is great but does hadoop want to be associated with innovation like WikiLeaks? Edward -- 1. If a man is standing in the middle of the forest talking, and there is no woman around to hear him, is he still wrong? 2. Behind every great woman... Is a man checking out her ass 3. I am not a member of any organized political party. I am a Democrat.* 4. Diplomacy is the art of saying Nice doggie until you can find a rock.* 5. A process is what you need when all your good people have left. *Will Rogers
Re: Hadoop and WikiLeaks
You are, perhaps, aware that now your name will be associated with WikiLeaks too because this mailing list is archived and publicly searchable? I think you are a hero, man! -- Take care, Konstantin (Cos) Boudnik 2CAC 8312 4870 D885 8616 6115 220F 6980 1F27 E622 Disclaimer: Opinions expressed in this email are those of the author, and do not necessarily represent the views of any company the author might be affiliated with at the moment of writing. On Wed, May 18, 2011 at 09:53, Edward Capriolo edlinuxg...@gmail.com wrote: http://hadoop.apache.org/#What+Is+Apache%E2%84%A2+Hadoop%E2%84%A2%3F March 2011 - Apache Hadoop takes top prize at Media Guardian Innovation Awards The Hadoop project won the innovator of the yearaward from the UK's Guardian newspaper, where it was described as had the potential as a greater catalyst for innovation than other nominees including WikiLeaks and the iPad. Does this copy text bother anyone else? Sure winning any award is great but does hadoop want to be associated with innovation like WikiLeaks? Edward
Re: How do you run HPROF locally?
Ryan Brush had the right answer. If I add the following VM parameter -agentlib:hprof=cpu=samples,heap=sites,depth=6,force=n,thread=y,verbose=n,file=prof.output a profiling file called prof.output gets created in my working directory. Because I'm running locally, both the mapred.task.profile* options and mapred.child.java.opts are ignored. This makes sense now. Thanks. On Wed, May 18, 2011 at 6:34 AM, Brush,Ryan rbr...@cerner.com wrote: Not familiar with this setup, but I assume this is using the local runner, which simply launches the job in the same process as your program. Therefore no new JVMs are spun up, so the hprof settings in the configuration never apply. The simplest way to fix this is probably to just set the -agentlib:... directly on the JVM of your local program, which will include the Map/Reduce processing in that process. On 5/17/11 6:51 PM, Mark question markq2...@gmail.com wrote: or conf.setBoolean(mapred.task.profile, true); Mark On Tue, May 17, 2011 at 4:49 PM, Mark question markq2...@gmail.com wrote: I usually do this setting inside my java program (in run function) as follows: JobConf conf = new JobConf(this.getConf(),My.class); conf.set(*mapred*.task.*profile*, true); then I'll see some output files in that same working directory. Hope that helps, Mark On Tue, May 17, 2011 at 4:07 PM, W.P. McNeill bill...@gmail.com wrote: I am running a Hadoop Java program in local single-JVM mode via an IDE (IntelliJ). I want to do performance profiling of it. Following the instructions in chapter 5 of *Hadoop: the Definitive Guide*, I added the following properties to my job configuration file. property namemapred.task.profile/name valuetrue/value /property property namemapred.task.profile.params/name value-agentlib:hprof=cpu=samples,heap=sites,depth=6,force=n,thread=y,v erbose=n,file=%s/value /property property namemapred.task.profile.maps/name value0-/value /property property namemapred.task.profile.reduces/name value0-/value /property With these properties, the job runs as before, but I don't see any profiler output. I also tried simply setting property namemapred.child.java.opts/name value-agentlib:hprof=cpu=samples,heap=sites,depth=6,force=n,thread=y,v erbose=n,file=%s/value /property Again, no profiler output. I know I have HPROF installed because running java -agentlib:hprof=help at the command prompt produces a result. Is is possible to run HPROF on a local Hadoop job? Am I doing something wrong? -- CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
current line number as key?
Hello, I'm trying to pick up certain lines of a text file. (say 1st, 110th line of a file with 10^10 lines). I need a InputFormat which gives the Mapper line number as the key. I tried to implement RecordReader, but I can't get line information from InputSplit. Any solution to this??? Thanks in advance!!! -- View this message in context: http://old.nabble.com/current-line-number-as-key--tp31649694p31649694.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: current line number as key?
Hi, It is hard to pick up certain lines of a text file - globally I mean. Remember that the file is split according to its size (byte boundries) not lines.,, so, it is possible to keep track of the lines inside a split, but globally for the whole file, assuming it is split among map tasks... i don't think it is possible.. I am new to hadoop, but that is my take on it. Alexandra On Wed, May 18, 2011 at 2:41 PM, bnonymous libei.t...@gmail.com wrote: Hello, I'm trying to pick up certain lines of a text file. (say 1st, 110th line of a file with 10^10 lines). I need a InputFormat which gives the Mapper line number as the key. I tried to implement RecordReader, but I can't get line information from InputSplit. Any solution to this??? Thanks in advance!!! -- View this message in context: http://old.nabble.com/current-line-number-as-key--tp31649694p31649694.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: metrics2 ganglia monitoring
Ganglia plugin is not yet ported to metrics v2 (because Y don't use Ganglia, see also the discussion links on HADOOP-6728). It shouldn't be hard to do a port though, as the new sink interface is actually simpler. On Wed, May 18, 2011 at 4:07 AM, Eric Berkowitz eberkow...@roosevelt.edu wrote: We have a 2 rack hadoop cluster with ganglia 3.0 monitoring on all stations both on the native os and within hadoop. We want to upgrade to the to hadoop 20.203 but with the migration to metrics2 we need help configuring the metrics to continue ganglia monitoring. All tasktrackers/datanodes push unicast udp upstream to a central gmond daemon on their rack that is then polled by a single gmetad daemon for the cluster. The current metrics files includes entries similar to the following for all contexts: Configuration of the dfs context for ganglia dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext dfs.period=25 dfs.servers=xsrv1.cs.roosevelt.edu:8670 We have reviewed the package documentation for metrics2 but the examples and explanation are not helpful. Any assistance in the proper configuration of hadoop-metrics2.properties file to support our current ganglia configuration would be appreciated. Eric
Re: current line number as key?
To the best of my knowledge, the only way to do this is if you have fix width columns. Think about it this way: as alexandra mentioned, you only get byte difference...if you split 1 file among 50 mappers, they have the offset, but they have no idea that that offset means. with respect to other the other files, as they do not know how many lines came before. Finding lines inherently involves a full scan, unless a) the width is fixed or b) you do a job beforehand to explicitly put the line in the document. I would think about what you want to do, and whether or not it is possible to avoid making it line dependent, or if you can make each row a fixed number of bytes... 2011/5/18 Alexandra Anghelescu axanghele...@gmail.com Hi, It is hard to pick up certain lines of a text file - globally I mean. Remember that the file is split according to its size (byte boundries) not lines.,, so, it is possible to keep track of the lines inside a split, but globally for the whole file, assuming it is split among map tasks... i don't think it is possible.. I am new to hadoop, but that is my take on it. Alexandra On Wed, May 18, 2011 at 2:41 PM, bnonymous libei.t...@gmail.com wrote: Hello, I'm trying to pick up certain lines of a text file. (say 1st, 110th line of a file with 10^10 lines). I need a InputFormat which gives the Mapper line number as the key. I tried to implement RecordReader, but I can't get line information from InputSplit. Any solution to this??? Thanks in advance!!! -- View this message in context: http://old.nabble.com/current-line-number-as-key--tp31649694p31649694.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: current line number as key?
You are correct, that there is no easy and efficient way to do this. You could create a new InputFormat that derives from FileInputFormat that makes it so the files do not split, and then have a RecordReader that keeps track of line numbers. But then each file is read by only one mapper. Alternatively you could assume that the split is going to be done deterministically and do two passes one, where you count the number of lines in each partition, and a second that then assigns the lines based off of the output from the first. But that requires two map passes. --Bobby Evans On 5/18/11 1:53 PM, Alexandra Anghelescu axanghele...@gmail.com wrote: Hi, It is hard to pick up certain lines of a text file - globally I mean. Remember that the file is split according to its size (byte boundries) not lines.,, so, it is possible to keep track of the lines inside a split, but globally for the whole file, assuming it is split among map tasks... i don't think it is possible.. I am new to hadoop, but that is my take on it. Alexandra On Wed, May 18, 2011 at 2:41 PM, bnonymous libei.t...@gmail.com wrote: Hello, I'm trying to pick up certain lines of a text file. (say 1st, 110th line of a file with 10^10 lines). I need a InputFormat which gives the Mapper line number as the key. I tried to implement RecordReader, but I can't get line information from InputSplit. Any solution to this??? Thanks in advance!!! -- View this message in context: http://old.nabble.com/current-line-number-as-key--tp31649694p31649694.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Distcp failure - Server returned HTTP response code: 500
Hi Guys I am trying to copy hadoop data from one cluster to another but I am keep on getting this error *Server returned HTTP response code: 500 for URL* * * My distcp command is: scripts/hadoop.sh distcp hftp://c13-hadoop1-nn-r0-n1:50070/user/dwadmin/live/data/warehouse/facts/page_events/ *day=2011-05-17* hdfs://phx1-rb-dev40-pipe1.cnet.com:9000/user/sgehlot In here I have *day=2011-05-17* in my file path I found this online: https://issues.apache.org/jira/browse/HDFS-31 Is this issue is still exists? Is this could be the reason of my job failure? Job Error log: 2011-05-18 11:34:56,505 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId= 2011-05-18 11:34:56,713 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 0 2011-05-18 11:34:57,039 INFO org.apache.hadoop.tools.DistCp: FAIL day=2011-05-17/_logs/history/c13-hadoop1-nn-r0-n1_1291919715221_job_201012091035_41977_dwadmin_CopyFactsToHive%3A+page_events+day%3D2011-05-17 : java.io.IOException: *Server returned HTTP response code: 500 for URL*: http://c13-hadoop1-wkr-r10-n4.cnet.com:50075/streamFile?filename=/user/dwadmin/live/data/warehouse/facts/page_events/day=2011-05-17/_logs/history/c13-hadoop1-nn-r0-n1_1291919715221_job_201012091035_41977_dwadmin_CopyFactsToHive%253A+page_events+day%253D2011-05-17ugi=sgehlot,user at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1313) at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:157) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:398) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:410) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:537) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:306) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170) 2011-05-18 11:35:06,118 WARN org.apache.hadoop.mapred.TaskTracker: Error running child java.io.IOException: Copied: 0 Skipped: 5 Failed: 1 at org.apache.hadoop.tools.DistCp$CopyFilesMapper.close(DistCp.java:572) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170) 2011-05-18 11:35:06,124 INFO org.apache.hadoop.mapred.TaskRunner: Runnning cleanup for the task Any help is appreciated. Thanks, Sonia
Reducer granularity and starvation
I'm working on a cluster with 360 reducer slots. I've got a big job, so when I launch it I follow the recommendations in the Hadoop documentation and set mapred.reduce.tasks=350, i.e. slightly less than the available number of slots. The problem is that my reducers can still take a long time (2-4 hours) to run. So I end up grabbing a big slab of reducers and starving everybody else out. I've got my priority set to VERY_LOW and mapred.reduce.slowstart.completed.maps to 0.9, so I think I've done everything I can do on the job parameters front. Currently there isn't a way to make the individual reducers run faster, so I'm trying to figure out the best way to run my job so that it plays nice with other users of the cluster. My rule of thumb has always been to not try and do any scheduling myself, but let Hadoop handle it for me, but I don't think that works in this scenario. Questions: 1. Am I correct in thinking that long reducer times just mess up Hadoop's scheduling granularity to a degree that it can't handle? Is 4-hour reducer outside the normal operating range of Hadoop? 2. Is there any way to stagger task launches? (Aside from manually.) 3. What if I set mapred.reduce.tasks to be some value much, much larger than the number of available reducer slots, like 100,000. Will that make the amount of work sent to each reducer smaller (hence increasing the scheduler granularity) or will it have no effect? 4. In this scenario, do I just have to reconcile myself to the fact that my job is going to squat on a block of reducers no matter what and set mapred.reduce.tasks to something much less than the available number of slots?
Re: Reducer granularity and starvation
W.P., Hard to help out without knowing more about the characteristics of your data? How many keys are you expecting? How many values per key? Cheers James. On 2011-05-18, at 3:25 PM, W.P. McNeill wrote: I'm working on a cluster with 360 reducer slots. I've got a big job, so when I launch it I follow the recommendations in the Hadoop documentation and set mapred.reduce.tasks=350, i.e. slightly less than the available number of slots. The problem is that my reducers can still take a long time (2-4 hours) to run. So I end up grabbing a big slab of reducers and starving everybody else out. I've got my priority set to VERY_LOW and mapred.reduce.slowstart.completed.maps to 0.9, so I think I've done everything I can do on the job parameters front. Currently there isn't a way to make the individual reducers run faster, so I'm trying to figure out the best way to run my job so that it plays nice with other users of the cluster. My rule of thumb has always been to not try and do any scheduling myself, but let Hadoop handle it for me, but I don't think that works in this scenario. Questions: 1. Am I correct in thinking that long reducer times just mess up Hadoop's scheduling granularity to a degree that it can't handle? Is 4-hour reducer outside the normal operating range of Hadoop? 2. Is there any way to stagger task launches? (Aside from manually.) 3. What if I set mapred.reduce.tasks to be some value much, much larger than the number of available reducer slots, like 100,000. Will that make the amount of work sent to each reducer smaller (hence increasing the scheduler granularity) or will it have no effect? 4. In this scenario, do I just have to reconcile myself to the fact that my job is going to squat on a block of reducers no matter what and set mapred.reduce.tasks to something much less than the available number of slots?
Re: Distcp failure - Server returned HTTP response code: 500
Are you able to distcp folders that don't have special characters? What are the versions of the two clusters and are you running on the destination cluster if there's a mis-match? If there is you'll need to use hftp: http://hadoop.apache.org/common/docs/current/distcp.html#cpver On Wed, May 18, 2011 at 12:44 PM, sonia gehlot sonia.geh...@gmail.comwrote: Hi Guys I am trying to copy hadoop data from one cluster to another but I am keep on getting this error *Server returned HTTP response code: 500 for URL* * * My distcp command is: scripts/hadoop.sh distcp hftp://c13-hadoop1-nn-r0-n1:50070/user/dwadmin/live/data/warehouse/facts/page_events/ *day=2011-05-17* hdfs://phx1-rb-dev40-pipe1.cnet.com:9000/user/sgehlot In here I have *day=2011-05-17* in my file path I found this online: https://issues.apache.org/jira/browse/HDFS-31 Is this issue is still exists? Is this could be the reason of my job failure? Job Error log: 2011-05-18 11:34:56,505 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId= 2011-05-18 11:34:56,713 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 0 2011-05-18 11:34:57,039 INFO org.apache.hadoop.tools.DistCp: FAIL day=2011-05-17/_logs/history/c13-hadoop1-nn-r0-n1_1291919715221_job_201012091035_41977_dwadmin_CopyFactsToHive%3A+page_events+day%3D2011-05-17 : java.io.IOException: *Server returned HTTP response code: 500 for URL*: http://c13-hadoop1-wkr-r10-n4.cnet.com:50075/streamFile?filename=/user/dwadmin/live/data/warehouse/facts/page_events/day=2011-05-17/_logs/history/c13-hadoop1-nn-r0-n1_1291919715221_job_201012091035_41977_dwadmin_CopyFactsToHive%253A+page_events+day%253D2011-05-17ugi=sgehlot,user at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1313) at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:157) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:398) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:410) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:537) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:306) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170) 2011-05-18 11:35:06,118 WARN org.apache.hadoop.mapred.TaskTracker: Error running child java.io.IOException: Copied: 0 Skipped: 5 Failed: 1 at org.apache.hadoop.tools.DistCp$CopyFilesMapper.close(DistCp.java:572) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170) 2011-05-18 11:35:06,124 INFO org.apache.hadoop.mapred.TaskRunner: Runnning cleanup for the task Any help is appreciated. Thanks, Sonia
Re: Reducer granularity and starvation
Altogether my reducers are handling about 10^8 keys. The number of values per key varies, but ranges from 1-100. I'd guess the mean and mode is around 10, but I'm not sure.
Re: Reducer granularity and starvation
Also, the values are much larger (maybe a factor of 10^3) than the keys. I get the impression that this is unusual for Hadoop apps. (It certainly isn't true of word count in any event.)
Re: Reducer granularity and starvation
W.P, Sounds like you are going to be taking a long time no matter what. With a keyspace of about 10^7 that means that either hadoop is going to eventually allocate 10^7 reducers (if you set you reducer count to 10^7) or is going to re-use the ones you have 10^6 / (number of reducers you allocate) times. It is probably just a big job :) Look into fairscheduler or specify less reducers for this job and suffer a slight slowdown, but allow other jobs to get reducers when they need them. You *might* get some efficiencies if you can reduce the number of keys, or ensure that very few keys are getting big lists of data (anti-parallel). Make sure you are using a combiner if there is an opportunity to reduce the amount of data that goes through the shuffle. That is always a good thing IO = slow. Also, see if you can break your job up into smaller pieces so the more expensive operations are happening on less data volume. Good luck! Cheers James. On 2011-05-18, at 3:42 PM, W.P. McNeill wrote: Altogether my reducers are handling about 10^8 keys. The number of values per key varies, but ranges from 1-100. I'd guess the mean and mode is around 10, but I'm not sure.
Re: Reducer granularity and starvation
I'm using fair scheduler and JVM reuse. It is just plain a big job. I'm not using a combiner right now, but that's something to look at. What about bumping the mapred.reduce.tasks up to some huge number? I think that shouldn't make a difference, but I'm hearing conflicting information on this.
Re: Reducer granularity and starvation
W.P, Upping the reduce.tasks to a huge number just means that it will eventually spawn reducers = to (that huge number). You still only have slots for 360 so there is no real advantage, UNLESS you are running into OOM errors, which we’ve seen with higher re-use on the smaller number of reducers. Anyhoo, someone else can chime in and correct me if I am off base. Does that make sense? Cheers James. On 2011-05-18, at 4:04 PM, W.P. McNeill wrote: I'm using fair scheduler and JVM reuse. It is just plain a big job. I'm not using a combiner right now, but that's something to look at. What about bumping the mapred.reduce.tasks up to some huge number? I think that shouldn't make a difference, but I'm hearing conflicting information on this.
some guidance needed
Hello everybody, I'm a GSoC student for this year and I will be working on James [1]. My project is to implement email storage over HDFS. I am quite new to Hadoop and associates and I am looking for some hints as to get started on the right track. I have installed a single node Hadoop instance on my machine and played around with it (ran some examples) but I am interested into what you (more experienced people) think it's the best way to approach my problem. I am a little puzzled about the fact that that I read hadoop is best used for large files and email aren't that large from what I know. Another thing that crossed my mind is that since HDFS is a file system, wouldn't it be possible to set it as a back-end for the (existing) maildir and mailbox storage formats? (I think this question is more suited on the James mailing list, but if you have some ideas please speak your mind). Also, any development resources to get me started are welcomed. [1] http://james.apache.org/mailbox/ [2] https://issues.apache.org/jira/browse/MAILBOX-44 Regards, -- Ioan Eugen Stan
Re: some guidance needed
Hi Ioan, I would encourage you to look at a system like HBase for your mail backend. HDFS doesn't work well with lots of little files, and also doesn't support random update, so existing formats like Maildir wouldn't be a good fit. -Todd On Wed, May 18, 2011 at 4:02 PM, Ioan Eugen Stan stan.ieu...@gmail.com wrote: Hello everybody, I'm a GSoC student for this year and I will be working on James [1]. My project is to implement email storage over HDFS. I am quite new to Hadoop and associates and I am looking for some hints as to get started on the right track. I have installed a single node Hadoop instance on my machine and played around with it (ran some examples) but I am interested into what you (more experienced people) think it's the best way to approach my problem. I am a little puzzled about the fact that that I read hadoop is best used for large files and email aren't that large from what I know. Another thing that crossed my mind is that since HDFS is a file system, wouldn't it be possible to set it as a back-end for the (existing) maildir and mailbox storage formats? (I think this question is more suited on the James mailing list, but if you have some ideas please speak your mind). Also, any development resources to get me started are welcomed. [1] http://james.apache.org/mailbox/ [2] https://issues.apache.org/jira/browse/MAILBOX-44 Regards, -- Ioan Eugen Stan -- Todd Lipcon Software Engineer, Cloudera
Re: some guidance needed
Ioan, I second what Todd said, even with FuseHDFS, mounting HDFS as a regular file system, it won't give you the immediate response about the file status that you need. I believe Google implemented Gmail with HBase. Here is an example of implementing a mail store with Cassandra: http://ewh.ieee.org/r6/scv/computer/nfic/2009/IBM-Jun-Rao.pdf http://ewh.ieee.org/r6/scv/computer/nfic/2009/IBM-Jun-Rao.pdfMark On Wed, May 18, 2011 at 5:05 PM, Todd Lipcon t...@cloudera.com wrote: Hi Ioan, I would encourage you to look at a system like HBase for your mail backend. HDFS doesn't work well with lots of little files, and also doesn't support random update, so existing formats like Maildir wouldn't be a good fit. -Todd On Wed, May 18, 2011 at 4:02 PM, Ioan Eugen Stan stan.ieu...@gmail.com wrote: Hello everybody, I'm a GSoC student for this year and I will be working on James [1]. My project is to implement email storage over HDFS. I am quite new to Hadoop and associates and I am looking for some hints as to get started on the right track. I have installed a single node Hadoop instance on my machine and played around with it (ran some examples) but I am interested into what you (more experienced people) think it's the best way to approach my problem. I am a little puzzled about the fact that that I read hadoop is best used for large files and email aren't that large from what I know. Another thing that crossed my mind is that since HDFS is a file system, wouldn't it be possible to set it as a back-end for the (existing) maildir and mailbox storage formats? (I think this question is more suited on the James mailing list, but if you have some ideas please speak your mind). Also, any development resources to get me started are welcomed. [1] http://james.apache.org/mailbox/ [2] https://issues.apache.org/jira/browse/MAILBOX-44 Regards, -- Ioan Eugen Stan -- Todd Lipcon Software Engineer, Cloudera
Can I use InputSampler.RandomSampler on data with non-Text keys?
I want to do a total sort on some data whose key type is Writable but not Text. I wrote an InputSampler.RandomSampler object following the example in the Total Sort section of *Hadoop: The Definitive Guide*. When I call InputSampler.writePartitionFile() I get a class cast exception because my key type cannot be cast to Text. Specifically the issue seems to be the following section of InputSampler.getSample(): K key = reader.getCurrentKey(); Text keyCopy = WritableUtils.Textclone((Text)key, job.getConfiguration()); From this source it does appear that you can only use a RandomSampler on data with Text keys. However, I'm confused because I don't see this mentioned in any documentation, and I assume this wouldn't be the case because InputSampler takes Key, Value generic specifications. 1. Does InputSampler.RandomSampler only work on data with Text key values? 2. If so, what is the easiest way to generate a random sample for data with non-Text key values? Is there example code anywhere?
Re: Reducer granularity and starvation
The one advantage you would get with a large number of reducers is that the scheduler will be able to give open reduce slots to other jobs without having to be preemptive. This will reduce the risk of you losing a reducer 3 hours into a 4 hour run. -Joey On Wed, May 18, 2011 at 3:08 PM, James Seigel ja...@tynt.com wrote: W.P, Upping the reduce.tasks to a huge number just means that it will eventually spawn reducers = to (that huge number). You still only have slots for 360 so there is no real advantage, UNLESS you are running into OOM errors, which we’ve seen with higher re-use on the smaller number of reducers. Anyhoo, someone else can chime in and correct me if I am off base. Does that make sense? Cheers James. On 2011-05-18, at 4:04 PM, W.P. McNeill wrote: I'm using fair scheduler and JVM reuse. It is just plain a big job. I'm not using a combiner right now, but that's something to look at. What about bumping the mapred.reduce.tasks up to some huge number? I think that shouldn't make a difference, but I'm hearing conflicting information on this. -- Joseph Echeverria Cloudera, Inc. 443.305.9434
Re: Reducer granularity and starvation
Here's a consequence that I see of having the values be much larger than the keys: there's not much point in me adding a combiner. My mapper emits pairs of the form: Key, Value where the size of value is much greater than the size of Key. The reducer then processes input of the form: Key, IteratorValue The reducer then looks at the set of values corresponding to a Key and separates it into one of two bins. I don't think this is particularly CPU-intensive, however, the reducer needs access to the entire set of Values. The set can't be boiled down into some smaller sufficient statistic the way, say, in a word count program we can combine the counts for a word from different documents into a single number. As a result, the only combiner strategy I can see is to have the mapper emit a Value as a single item list: Key, [Value] Have a combiner combine the lists: Key, [Value, Value...] and then the reducer would work on lists of lists. Key, Iterator[Value, Value...] This would save on redundant Key IO, but since Values are so much bigger than Keys I don't think this would matter.
Re: Can I use InputSampler.RandomSampler on data with non-Text keys?
That sounds like a bug to me. I think the easiest way would be to modify InputSampler to handle non Text keys. -Joey On Wed, May 18, 2011 at 4:24 PM, W.P. McNeill bill...@gmail.com wrote: I want to do a total sort on some data whose key type is Writable but not Text. I wrote an InputSampler.RandomSampler object following the example in the Total Sort section of *Hadoop: The Definitive Guide*. When I call InputSampler.writePartitionFile() I get a class cast exception because my key type cannot be cast to Text. Specifically the issue seems to be the following section of InputSampler.getSample(): K key = reader.getCurrentKey(); Text keyCopy = WritableUtils.Textclone((Text)key, job.getConfiguration()); From this source it does appear that you can only use a RandomSampler on data with Text keys. However, I'm confused because I don't see this mentioned in any documentation, and I assume this wouldn't be the case because InputSampler takes Key, Value generic specifications. 1. Does InputSampler.RandomSampler only work on data with Text key values? 2. If so, what is the easiest way to generate a random sample for data with non-Text key values? Is there example code anywhere? -- Joseph Echeverria Cloudera, Inc. 443.305.9434