Re: Submit Hadoop Job Remotely (without creating a jar)
Some tools provide us CLI tools which don't require creating jar. For example, you can use Pig interactive mode if you'd like to use Pig. http://pig.apache.org/docs/r0.12.1/start.html#interactive-mode Hive CLI is one of them: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli Thanks, - Tsuyoshi On Thu, Jun 26, 2014 at 8:22 PM, Ramil Malikov vivalala...@gmail.com wrote: Is it possible submit Job to Hadoop cluster (2.2.0) from remote machine without creating a jar? Like PigServer.submit(pigScript). Thank you. -- - Tsuyoshi
Re: Partitioning and setup errors
The new Configuration() is suspicious. Are you setting configuration information manually? Chris On Jun 27, 2014 5:16 AM, Chris MacKenzie stu...@chrismackenziephotography.co.uk wrote: Hi, I realise my previous question may have been a bit naïve and I also realise I am asking an awful lot here, any advice would be greatly appreciated. - I have been using Hadoop 2.4 in local mode and am sticking to the mapreduce.* side of the track. - I am using a Custom Line reader to read each sequence into a Map - I have a partitioner class which is testing the key from the map class. - I've tried debugging in eclipse with a breakpoint in the partitioner class but getPartition(LongWritable mapKey, Text sequenceString, int numReduceTasks) is not being called. Could there be any reason for that ? Because my map and reduce code works in local mode within eclipse, I wondered if I may get the partitioner to work if I changed to Pseudo Distributed Mode exporting a runnable jar from Eclipse (Kepler) I have several faults On my own computer Pseudo Distributed Mode and the university clusters Pseudo Distributed Mode which I set up. I’ve googled and read extensively but am not seeing a solution to any of these issues. I have this line: 14/06/27 11:45:27 WARN mapreduce.JobSubmitter: No job jar file set. User classes may not be found. See Job or Job#setJar(String). My driver code is: private void doParallelConcordance() throws Exception { Path inDir = new Path(input_sequences/10_sequences.txt); Path outDir = new Path(demo_output); Job job = Job.getInstance(new Configuration()); job.setJarByClass(ParallelGeneticAlignment.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setInputFormatClass(CustomFileInputFormat.class); job.setMapperClass(ConcordanceMapper.class); job.setPartitionerClass(ConcordanceSequencePartitioner.class); job.setReducerClass(ConcordanceReducer.class); FileInputFormat.addInputPath(job, inDir); FileOutputFormat.setOutputPath(job, outDir); job.waitForCompletion(true) } On the university server I am getting this error: 4/06/27 11:45:40 INFO mapreduce.Job: Task Id : attempt_1403860966764_0003_m_00_0, Status : FAILED Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class par.gene.align.concordance.ConcordanceMapper not found On my machine the error is: 4/06/27 12:58:03 INFO mapreduce.Job: Task Id : attempt_1403864060032_0004_r_00_2, Status : FAILED Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class par.gene.align.concordance.ConcordanceReducer not found On the university server I get total paths to process: 14/06/27 11:45:27 INFO input.FileInputFormat: Total input paths to process : 1 14/06/27 11:45:28 INFO mapreduce.JobSubmitter: number of splits:1 On my machine I get total paths to process: 14/06/27 12:57:09 INFO input.FileInputFormat: Total input paths to process : 0 14/06/27 12:57:36 INFO mapreduce.JobSubmitter: number of splits:0 Being new to this community, I thought it polite to introduce myself. I’m planning to return to software development via an MSc at Heriot Watt University in Edinburgh. My MSc project is based on Fosters Genetic Sequence Alignment. I have written a sequential version my goal is now to port it to Hadoop. Thanks in advance, Regards, Chris MacKenzie
Re: group similar items using pairwise similar items
Since you say mutually similar are you really not looking for maximal cliques rather than connected components. Hi, I have a set of items and a pairwise similar items. I want to group together items that are mutually similar. For ex : if *A B C D E F G* are the items I have the following pairwise similar items *A B* *A C* *B C * *D E * *C G* *E F* I want the output as *A B C G* *D E F* Can someone suggest how to do the above ?? If the above problem is cast as a graph problem where every item is a vertex , then a finding connected components or a breadth first search on each node should solve the problem. Can anyone suggest some pointers to those algorithms .. Thanks, Parnab ..
RE: persisent services in Hadoop
Thanks Arun! I do think we are on the bleeding edge of YARN, because everyone else in our application space generates MapReduce (Pig, Hive), or they have overlaid their legacy server-grid on Hadoop. I will explore both resources you mentioned to see where the development community is headed. Cheers, john From: Arun Murthy [mailto:a...@hortonworks.com] Sent: Wednesday, June 25, 2014 11:50 PM To: user@hadoop.apache.org Subject: Re: persisent services in Hadoop John, We are excited to see ISVs like you get value from YARN, and appreciate the patience you've already shown in the past to work through the teething issues of YARN hadoop-2.x. W.r.t long-running services, the most straight-forward option is to go through Apache Slider (http://slider.incubator.apache.org/). Slider has already made good progress in supporting various long-running services such as Apache HBase, Apache Accumulo Apache Storm. I'm very sure the Slider community would be very welcoming of your use-cases, suggestions etc. - particularly as they are gearing up to support various applications atop; and would love your feedback. Furthemore, there is work going on in YARN itself to better support your use case: https://issues.apache.org/jira/browse/YARN-896. Again, your feedback there is very, very welcome. Also, you might be interested in https://issues.apache.org/jira/browse/YARN-1530 which provides a generic framework for collecting application metrics for YARN applications. Hope that helps. thanks, Arun On Wed, Jun 25, 2014 at 1:48 PM, John Lilley john.lil...@redpoint.netmailto:john.lil...@redpoint.net wrote: We are an ISV that currently ships a data-quality/integration suite running as a native YARN application. We are finding several use cases that would benefit from being able to manage a per-node persistent service. MapReduce has its “shuffle auxiliary service”, but it isn’t straightforward to add auxiliary services because they cannot be loaded from HDFS, so we’d have to manage the distribution of JARs across nodes (please tell me if I’m wrong here…). Given that, is there a preferred method for managing persistent services on a Hadoop cluster? We could have an AM that creates a set of YARN tasks and just waits until YARN gives a task on each node, and restart any failed tasks, but it doesn’t really fit the AM/container structure very well. I’ve also read about Slider, which looks interesting. Other ideas? --john -- -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/ CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: Partitioning and setup errors
HI Chris, Thanks for your response. I deeply appreciate it. I don¹t know what you mean by that question. I use configuration: * In the driver Job job = Job.getInstance(new Configuration()); * In the CustomLineRecordReader Configuration job = context.getConfiguration(); One of the biggest issues I have had is staying true to the mapreduce.* format Best wishes, Chris MacKenzie From: Chris Mawata chris.maw...@gmail.com Reply-To: user@hadoop.apache.org Date: Friday, 27 June 2014 14:11 To: user@hadoop.apache.org Subject: Re: Partitioning and setup errors The new Configuration() is suspicious. Are you setting configuration information manually? Chris On Jun 27, 2014 5:16 AM, Chris MacKenzie stu...@chrismackenziephotography.co.uk wrote: Hi, I realise my previous question may have been a bit naïve and I also realise I am asking an awful lot here, any advice would be greatly appreciated. * I have been using Hadoop 2.4 in local mode and am sticking to the mapreduce.* side of the track. * I am using a Custom Line reader to read each sequence into a Map * I have a partitioner class which is testing the key from the map class. * I've tried debugging in eclipse with a breakpoint in the partitioner class but getPartition(LongWritable mapKey, Text sequenceString, int numReduceTasks) is not being called. Could there be any reason for that ? Because my map and reduce code works in local mode within eclipse, I wondered if I may get the partitioner to work if I changed to Pseudo Distributed Mode exporting a runnable jar from Eclipse (Kepler) I have several faults On my own computer Pseudo Distributed Mode and the university clusters Pseudo Distributed Mode which I set up. I¹ve googled and read extensively but am not seeing a solution to any of these issues. I have this line: 14/06/27 11:45:27 WARN mapreduce.JobSubmitter: No job jar file set. User classes may not be found. See Job or Job#setJar(String). My driver code is: private void doParallelConcordance() throws Exception { Path inDir = new Path(input_sequences/10_sequences.txt); Path outDir = new Path(demo_output); Job job = Job.getInstance(new Configuration()); job.setJarByClass(ParallelGeneticAlignment.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setInputFormatClass(CustomFileInputFormat.class); job.setMapperClass(ConcordanceMapper.class); job.setPartitionerClass(ConcordanceSequencePartitioner.class); job.setReducerClass(ConcordanceReducer.class); FileInputFormat.addInputPath(job, inDir); FileOutputFormat.setOutputPath(job, outDir); job.waitForCompletion(true) } On the university server I am getting this error: 4/06/27 11:45:40 INFO mapreduce.Job: Task Id : attempt_1403860966764_0003_m_00_0, Status : FAILED Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class par.gene.align.concordance.ConcordanceMapper not found On my machine the error is: 4/06/27 12:58:03 INFO mapreduce.Job: Task Id : attempt_1403864060032_0004_r_00_2, Status : FAILED Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class par.gene.align.concordance.ConcordanceReducer not found On the university server I get total paths to process: 14/06/27 11:45:27 INFO input.FileInputFormat: Total input paths to process : 1 14/06/27 11:45:28 INFO mapreduce.JobSubmitter: number of splits:1 On my machine I get total paths to process: 14/06/27 12:57:09 INFO input.FileInputFormat: Total input paths to process : 0 14/06/27 12:57:36 INFO mapreduce.JobSubmitter: number of splits:0 Being new to this community, I thought it polite to introduce myself. I¹m planning to return to software development via an MSc at Heriot Watt University in Edinburgh. My MSc project is based on Fosters Genetic Sequence Alignment. I have written a sequential version my goal is now to port it to Hadoop. Thanks in advance, Regards, Chris MacKenzie
Re: Partitioning and setup errors
Probably my fault. I was looking for the extends Configurable implements Tool part. I will double check when I get home rather than send you on a wild goose chase. Cheers Chris On Jun 27, 2014 8:16 AM, Chris MacKenzie stu...@chrismackenziephotography.co.uk wrote: Hi, I realise my previous question may have been a bit naïve and I also realise I am asking an awful lot here, any advice would be greatly appreciated. - I have been using Hadoop 2.4 in local mode and am sticking to the mapreduce.* side of the track. - I am using a Custom Line reader to read each sequence into a Map - I have a partitioner class which is testing the key from the map class. - I've tried debugging in eclipse with a breakpoint in the partitioner class but getPartition(LongWritable mapKey, Text sequenceString, int numReduceTasks) is not being called. Could there be any reason for that ? Because my map and reduce code works in local mode within eclipse, I wondered if I may get the partitioner to work if I changed to Pseudo Distributed Mode exporting a runnable jar from Eclipse (Kepler) I have several faults On my own computer Pseudo Distributed Mode and the university clusters Pseudo Distributed Mode which I set up. I’ve googled and read extensively but am not seeing a solution to any of these issues. I have this line: 14/06/27 11:45:27 WARN mapreduce.JobSubmitter: No job jar file set. User classes may not be found. See Job or Job#setJar(String). My driver code is: private void doParallelConcordance() throws Exception { Path inDir = new Path(input_sequences/10_sequences.txt); Path outDir = new Path(demo_output); Job job = Job.getInstance(new Configuration()); job.setJarByClass(ParallelGeneticAlignment.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setInputFormatClass(CustomFileInputFormat.class); job.setMapperClass(ConcordanceMapper.class); job.setPartitionerClass(ConcordanceSequencePartitioner.class); job.setReducerClass(ConcordanceReducer.class); FileInputFormat.addInputPath(job, inDir); FileOutputFormat.setOutputPath(job, outDir); job.waitForCompletion(true) } On the university server I am getting this error: 4/06/27 11:45:40 INFO mapreduce.Job: Task Id : attempt_1403860966764_0003_m_00_0, Status : FAILED Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class par.gene.align.concordance.ConcordanceMapper not found On my machine the error is: 4/06/27 12:58:03 INFO mapreduce.Job: Task Id : attempt_1403864060032_0004_r_00_2, Status : FAILED Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class par.gene.align.concordance.ConcordanceReducer not found On the university server I get total paths to process: 14/06/27 11:45:27 INFO input.FileInputFormat: Total input paths to process : 1 14/06/27 11:45:28 INFO mapreduce.JobSubmitter: number of splits:1 On my machine I get total paths to process: 14/06/27 12:57:09 INFO input.FileInputFormat: Total input paths to process : 0 14/06/27 12:57:36 INFO mapreduce.JobSubmitter: number of splits:0 Being new to this community, I thought it polite to introduce myself. I’m planning to return to software development via an MSc at Heriot Watt University in Edinburgh. My MSc project is based on Fosters Genetic Sequence Alignment. I have written a sequential version my goal is now to port it to Hadoop. Thanks in advance, Regards, Chris MacKenzie
How to see total pending containers ?
Hi, Is there a way to see total pending containers in a cluster,so that we know how far behind we are with etl ? There is a pending containers field on the scheduler page under dr. who table,but that is always zero. -- Thanks, Ashwin
Re: Configuring Hadoop Client: Where is fail-over configured.
Hi Charley, in hdfs-site.xml you can find the property dfs.ha.namenodes, setting this property every client will know which NN are elegibles to be active, nothing else are required in client. Regards. 2014-06-26 21:30 GMT+02:00 Charley Newtonne cnewto...@gmail.com: I have hadoop 2.4 installed in HA mode using QJM. I have verified the cluster failover works as expected. The java clients are configured to connect to the active NN by specifying the hdf://nn1.company.com:8020. If this nn1 is down, how's the client know the location of the standby nn. Where is the client spillover configured? I have seen some references that this is configured in client's core-site.xml, but this file only specifies the defaultFS (which has the value of the logical cluster name) and ZK quorum nodes. None of these appear to be related to client side failover. Thanks in advance
how to replication speed up
hi. my cluster is . 2 ha namenode 8 datanodes Occurred under the block of 506,803. 1000 block of 10 minutes will be replicated. 600 megabytes of traffic per server occurs. Will take much longer until complete. How can I increase the replication rate.