Re: Submit Hadoop Job Remotely (without creating a jar)

2014-06-27 Thread Tsuyoshi OZAWA
Some tools provide us CLI tools which don't require creating jar. For
example, you can use Pig interactive mode if you'd like to use Pig.
http://pig.apache.org/docs/r0.12.1/start.html#interactive-mode

Hive CLI is one of them:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli

Thanks,
- Tsuyoshi

On Thu, Jun 26, 2014 at 8:22 PM, Ramil Malikov vivalala...@gmail.com wrote:
 Is it possible submit Job to Hadoop cluster (2.2.0) from remote machine
 without creating a jar?
 Like PigServer.submit(pigScript).

 Thank you.



-- 
- Tsuyoshi


Re: Partitioning and setup errors

2014-06-27 Thread Chris Mawata
The new Configuration() is suspicious. Are you setting configuration
information manually?
Chris
On Jun 27, 2014 5:16 AM, Chris MacKenzie 
stu...@chrismackenziephotography.co.uk wrote:

 Hi,

 I realise my previous question may have been a bit naïve and I also
 realise I am asking an awful lot here, any advice would be greatly
 appreciated.

- I have been using Hadoop 2.4 in local mode and am sticking to the 
 mapreduce.*
side of the track.
- I am using a Custom Line reader to read each sequence into a Map
- I have a partitioner class which is testing the key from the map
class.
- I've tried debugging in eclipse with a breakpoint in the partitioner
class but getPartition(LongWritable mapKey, Text sequenceString, int
numReduceTasks) is not being called.

 Could there be any reason for that ?

 Because my map and reduce code works in local mode within eclipse, I wondered
 if I may get the partitioner to work if  I changed to Pseudo Distributed
 Mode exporting a runnable jar from Eclipse (Kepler)

 I have several faults On my own computer  Pseudo Distributed Mode and the
 university clusters Pseudo Distributed Mode which I set up. I’ve googled
 and read extensively but am not seeing a solution to any of these issues.

 I have this line:
 14/06/27 11:45:27 WARN mapreduce.JobSubmitter: No job jar file set.  User
 classes may not be found. See Job or Job#setJar(String).
 My driver code is:

 private void doParallelConcordance() throws Exception {

  Path inDir = new Path(input_sequences/10_sequences.txt);

 Path outDir = new Path(demo_output);


 Job job = Job.getInstance(new Configuration());

 job.setJarByClass(ParallelGeneticAlignment.class);

  job.setOutputKeyClass(Text.class);

 job.setOutputValueClass(IntWritable.class);


 job.setInputFormatClass(CustomFileInputFormat.class);

 job.setMapperClass(ConcordanceMapper.class);

 job.setPartitionerClass(ConcordanceSequencePartitioner.class);

 job.setReducerClass(ConcordanceReducer.class);


 FileInputFormat.addInputPath(job, inDir);

 FileOutputFormat.setOutputPath(job, outDir);


 job.waitForCompletion(true)

 }

 On the university server I am getting this error:
 4/06/27 11:45:40 INFO mapreduce.Job: Task Id :
 attempt_1403860966764_0003_m_00_0, Status : FAILED
 Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class
 par.gene.align.concordance.ConcordanceMapper not found

 On my machine the error is:
 4/06/27 12:58:03 INFO mapreduce.Job: Task Id :
 attempt_1403864060032_0004_r_00_2, Status : FAILED
 Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class
 par.gene.align.concordance.ConcordanceReducer not found

 On the university server I get total paths to process:
 14/06/27 11:45:27 INFO input.FileInputFormat: Total input paths to process
 : 1
 14/06/27 11:45:28 INFO mapreduce.JobSubmitter: number of splits:1

 On my machine I get total paths to process:
 14/06/27 12:57:09 INFO input.FileInputFormat: Total input paths to process
 : 0
 14/06/27 12:57:36 INFO mapreduce.JobSubmitter: number of splits:0

 Being new to this community, I thought it polite to introduce myself. I’m
 planning to return to software development via an MSc at Heriot Watt
 University in Edinburgh. My MSc project is based on Fosters Genetic
 Sequence Alignment. I have written a sequential version my goal is now to
 port it to Hadoop.

 Thanks in advance,
 Regards,

 Chris MacKenzie



Re: group similar items using pairwise similar items

2014-06-27 Thread Chris Mawata
Since you say mutually similar are you really not looking for maximal
cliques rather than connected components.
Hi,

I have a set of items and a pairwise similar items. I want to group
together  items that are mutually similar.

For ex : if *A B C D E  F G* are the items
I have the following pairwise similar items

*A B*
*A C*
*B C *
*D E *
*C G*
*E F*

I want the output as

*A B C G*
*D E F*

Can someone suggest how to do the above ??

If the above problem is cast as a graph problem where every item is a
vertex , then a finding connected components  or a breadth first search on
each node should solve the problem.

Can anyone suggest some pointers to those algorithms ..

Thanks,
Parnab ..


RE: persisent services in Hadoop

2014-06-27 Thread John Lilley
Thanks Arun!
I do think we are on the bleeding edge of YARN, because everyone else in our 
application space generates MapReduce (Pig, Hive), or they have overlaid their 
legacy server-grid on Hadoop.
I will explore both resources you mentioned to see where the development 
community is headed.
Cheers,
john


From: Arun Murthy [mailto:a...@hortonworks.com]
Sent: Wednesday, June 25, 2014 11:50 PM
To: user@hadoop.apache.org
Subject: Re: persisent services in Hadoop

John,

 We are excited to see ISVs like you get value from YARN, and appreciate the 
patience you've already shown in the past to work through the teething issues 
of YARN  hadoop-2.x.

 W.r.t long-running services, the most straight-forward option is to go through 
Apache Slider (http://slider.incubator.apache.org/). Slider has already made 
good progress in supporting various long-running services such as Apache HBase, 
Apache Accumulo  Apache Storm. I'm very sure the Slider community would be 
very welcoming of your use-cases, suggestions etc. - particularly as they are 
gearing up to support various applications atop; and would love your feedback.

 Furthemore, there is work going on in YARN itself to better support your use 
case: https://issues.apache.org/jira/browse/YARN-896.
 Again, your feedback there is very, very welcome.

 Also, you might be interested in 
https://issues.apache.org/jira/browse/YARN-1530 which provides a generic 
framework for collecting application metrics for YARN applications.

 Hope that helps.

thanks,
Arun

On Wed, Jun 25, 2014 at 1:48 PM, John Lilley 
john.lil...@redpoint.netmailto:john.lil...@redpoint.net wrote:
We are an ISV that currently ships a data-quality/integration suite running as 
a native YARN application.  We are finding several use cases that would benefit 
from being able to manage a per-node persistent service.  MapReduce has its 
“shuffle auxiliary service”, but it isn’t straightforward to add auxiliary 
services because they cannot be loaded from HDFS, so we’d have to manage the 
distribution of JARs across nodes (please tell me if I’m wrong here…).  Given 
that, is there a preferred method for managing persistent services on a Hadoop 
cluster?  We could have an AM that creates a set of YARN tasks and just waits 
until YARN gives a task on each node, and restart any failed tasks, but it 
doesn’t really fit the AM/container structure very well.  I’ve also read about 
Slider, which looks interesting.  Other ideas?
--john



--

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader of 
this message is not the intended recipient, you are hereby notified that any 
printing, copying, dissemination, distribution, disclosure or forwarding of 
this communication is strictly prohibited. If you have received this 
communication in error, please contact the sender immediately and delete it 
from your system. Thank You.


Re: Partitioning and setup errors

2014-06-27 Thread Chris MacKenzie
HI Chris,

Thanks for your response. I deeply appreciate it.

I don¹t know what you mean by that question.  I use configuration:
* In the driver  Job job = Job.getInstance(new Configuration());
* In the CustomLineRecordReader Configuration job =
context.getConfiguration();
One of the biggest issues I have had is staying true to the mapreduce.*
format

Best wishes,

Chris MacKenzie

From:  Chris Mawata chris.maw...@gmail.com
Reply-To:  user@hadoop.apache.org
Date:  Friday, 27 June 2014 14:11
To:  user@hadoop.apache.org
Subject:  Re: Partitioning and setup errors


The new Configuration() is suspicious. Are you setting configuration
information manually?
Chris

On Jun 27, 2014 5:16 AM, Chris MacKenzie
stu...@chrismackenziephotography.co.uk wrote:
 Hi,
 
 I realise my previous question may have been a bit naïve and I also realise I
 am asking an awful lot here, any advice would be greatly appreciated.
 * I have been using Hadoop 2.4 in local mode and am sticking to the
 mapreduce.* side of the track.
 * I am using a Custom Line reader to read each sequence into a Map
 * I have a partitioner class which is testing the key from the map class.
 * I've tried debugging in eclipse with a breakpoint in the partitioner class
 but getPartition(LongWritable mapKey, Text sequenceString, int numReduceTasks)
 is not being called.
 Could there be any reason for that ?
 
 Because my map and reduce code works in local mode within eclipse, I wondered
 if I may get the partitioner to work if  I changed to Pseudo Distributed Mode
 exporting a runnable jar from Eclipse (Kepler)
 
 I have several faults On my own computer  Pseudo Distributed Mode and the
 university clusters Pseudo Distributed Mode which I set up. I¹ve googled and
 read extensively but am not seeing a solution to any of these issues.
 
 I have this line:
 14/06/27 11:45:27 WARN mapreduce.JobSubmitter: No job jar file set.  User
 classes may not be found. See Job or Job#setJar(String).
 My driver code is:
 private void doParallelConcordance() throws Exception {
 
 Path inDir = new Path(input_sequences/10_sequences.txt);
 
 Path outDir = new Path(demo_output);
 
 
 
 Job job = Job.getInstance(new Configuration());
 
 job.setJarByClass(ParallelGeneticAlignment.class);
 
 job.setOutputKeyClass(Text.class);
 
 job.setOutputValueClass(IntWritable.class);
 
 
 
 job.setInputFormatClass(CustomFileInputFormat.class);
 
 job.setMapperClass(ConcordanceMapper.class);
 
 job.setPartitionerClass(ConcordanceSequencePartitioner.class);
 
 job.setReducerClass(ConcordanceReducer.class);
 
 
 
 FileInputFormat.addInputPath(job, inDir);
 
 FileOutputFormat.setOutputPath(job, outDir);
 
 
 
 job.waitForCompletion(true)
 
 }
 
 
 On the university server I am getting this error:
 4/06/27 11:45:40 INFO mapreduce.Job: Task Id :
 attempt_1403860966764_0003_m_00_0, Status : FAILED
 Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class
 par.gene.align.concordance.ConcordanceMapper not found
 
 On my machine the error is:
 4/06/27 12:58:03 INFO mapreduce.Job: Task Id :
 attempt_1403864060032_0004_r_00_2, Status : FAILED
 Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class
 par.gene.align.concordance.ConcordanceReducer not found
 
 On the university server I get total paths to process:
 14/06/27 11:45:27 INFO input.FileInputFormat: Total input paths to process : 1
 14/06/27 11:45:28 INFO mapreduce.JobSubmitter: number of splits:1
 
 On my machine I get total paths to process:
 14/06/27 12:57:09 INFO input.FileInputFormat: Total input paths to process : 0
 14/06/27 12:57:36 INFO mapreduce.JobSubmitter: number of splits:0
 
 Being new to this community, I thought it polite to introduce myself. I¹m
 planning to return to software development via an MSc at Heriot Watt
 University in Edinburgh. My MSc project is based on Fosters Genetic Sequence
 Alignment. I have written a sequential version my goal is now to port it to
 Hadoop.
 
 Thanks in advance,
 Regards,
 
 Chris MacKenzie




Re: Partitioning and setup errors

2014-06-27 Thread Chris Mawata
Probably my fault. I was looking for the
extends Configurable implements Tool
part. I will double check when I get home rather than send you on a wild
goose chase.
Cheers
Chris
On Jun 27, 2014 8:16 AM, Chris MacKenzie 
stu...@chrismackenziephotography.co.uk wrote:

 Hi,

 I realise my previous question may have been a bit naïve and I also
 realise I am asking an awful lot here, any advice would be greatly
 appreciated.

- I have been using Hadoop 2.4 in local mode and am sticking to the 
 mapreduce.*
side of the track.
- I am using a Custom Line reader to read each sequence into a Map
- I have a partitioner class which is testing the key from the map
class.
- I've tried debugging in eclipse with a breakpoint in the partitioner
class but getPartition(LongWritable mapKey, Text sequenceString, int
numReduceTasks) is not being called.

 Could there be any reason for that ?

 Because my map and reduce code works in local mode within eclipse, I wondered
 if I may get the partitioner to work if  I changed to Pseudo Distributed
 Mode exporting a runnable jar from Eclipse (Kepler)

 I have several faults On my own computer  Pseudo Distributed Mode and the
 university clusters Pseudo Distributed Mode which I set up. I’ve googled
 and read extensively but am not seeing a solution to any of these issues.

 I have this line:
 14/06/27 11:45:27 WARN mapreduce.JobSubmitter: No job jar file set.  User
 classes may not be found. See Job or Job#setJar(String).
 My driver code is:

 private void doParallelConcordance() throws Exception {

  Path inDir = new Path(input_sequences/10_sequences.txt);

 Path outDir = new Path(demo_output);


 Job job = Job.getInstance(new Configuration());

 job.setJarByClass(ParallelGeneticAlignment.class);

  job.setOutputKeyClass(Text.class);

 job.setOutputValueClass(IntWritable.class);


 job.setInputFormatClass(CustomFileInputFormat.class);

 job.setMapperClass(ConcordanceMapper.class);

 job.setPartitionerClass(ConcordanceSequencePartitioner.class);

 job.setReducerClass(ConcordanceReducer.class);


 FileInputFormat.addInputPath(job, inDir);

 FileOutputFormat.setOutputPath(job, outDir);


 job.waitForCompletion(true)

 }

 On the university server I am getting this error:
 4/06/27 11:45:40 INFO mapreduce.Job: Task Id :
 attempt_1403860966764_0003_m_00_0, Status : FAILED
 Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class
 par.gene.align.concordance.ConcordanceMapper not found

 On my machine the error is:
 4/06/27 12:58:03 INFO mapreduce.Job: Task Id :
 attempt_1403864060032_0004_r_00_2, Status : FAILED
 Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class
 par.gene.align.concordance.ConcordanceReducer not found

 On the university server I get total paths to process:
 14/06/27 11:45:27 INFO input.FileInputFormat: Total input paths to process
 : 1
 14/06/27 11:45:28 INFO mapreduce.JobSubmitter: number of splits:1

 On my machine I get total paths to process:
 14/06/27 12:57:09 INFO input.FileInputFormat: Total input paths to process
 : 0
 14/06/27 12:57:36 INFO mapreduce.JobSubmitter: number of splits:0

 Being new to this community, I thought it polite to introduce myself. I’m
 planning to return to software development via an MSc at Heriot Watt
 University in Edinburgh. My MSc project is based on Fosters Genetic
 Sequence Alignment. I have written a sequential version my goal is now to
 port it to Hadoop.

 Thanks in advance,
 Regards,

 Chris MacKenzie



How to see total pending containers ?

2014-06-27 Thread Ashwin Shankar
Hi,
Is there a way to see total pending containers in a cluster,so that
we know how far behind we are with etl ?

There is a pending containers field on the scheduler page under dr. who
table,but that is always zero.

-- 
Thanks,
Ashwin


Re: Configuring Hadoop Client: Where is fail-over configured.

2014-06-27 Thread Juan Carlos
Hi Charley,

in hdfs-site.xml you can find the property dfs.ha.namenodes, setting this
property every client will know which NN are elegibles to be active,
nothing else are required in client.

Regards.


2014-06-26 21:30 GMT+02:00 Charley Newtonne cnewto...@gmail.com:

 I have hadoop 2.4 installed in HA mode using QJM. I have verified the
 cluster failover works as expected. The java clients are configured to
 connect to the active NN by specifying the hdf://nn1.company.com:8020. If
 this nn1 is down, how's the client know the location of the standby nn.
 Where is the client spillover configured?

 I have seen some references that this is configured in client's
 core-site.xml, but this file only specifies the defaultFS (which has the
 value of the logical cluster name) and ZK quorum nodes. None of these
 appear to be related to client side failover.

 Thanks in advance



how to replication speed up

2014-06-27 Thread 조주일
hi.
 
my cluster is . 
2 ha namenode 
8 datanodes
 
Occurred under the block of 506,803. 
 
1000 block of 10 minutes will be replicated.
 
600 megabytes of traffic per server occurs. 
 
Will take much longer until complete. 
 
How can I increase the replication rate.