du reserved can ran into problems with reserved disk capacity by tune2fs

2013-02-12 Thread Alexander Fahlke
Hi!

I'm using hadoop-0.20.2 on Debian Squeeze and ran into the same confusion
as many others with the parameter for dfs.datanode.du.reserved.
One day some data nodes got out of disk errors although there was space
left on the disks.

The following values are rounded to make the problem more clear:

- the disk for the DFS data has 1000GB and only one Partition (ext3) for
DFS data
- you plan to set the dfs.datanode.du.reserved to 20GB
- the reserved reserved-blocks-percentage by tune2fs is 5% (the default)

That gives all users, except root, 5% less capacity that they can use.
Although the System reports the total of 1000GB as usable for all users via
df.
The hadoop-deamons are not running as root.


If i read it right, than hadoop get's the free capacity via df.

Starting in /src/hdfs/org/apache/hadoop/hdfs/server/datanode/FSDataset.java
on line 350:
 return usage.getCapacity()-reserved;

going to /src/core/org/apache/hadoop/fs/DF.java which says:
 Filesystem disk space usage statistics. Uses the unix 'df' program


When you have 5% reserved by tune2fs (in our case 50GB) and you give
dfs.datanode.du.reserved only 20GB, than you can possibly ran into out of
disk errors that hadoop can't handle.

In this case you must add the planned 20GB du reserved to the reserved
capacity by tune2fs. This results in (at least) 70GB
for dfs.datanode.du.reserved in my case.


Two ideas:

1. The documentation must be clear at this point to avoid this problem.
2. Hadoop could check for reserved space by tune2fs (or other tools) and
add this value to the dfs.datanode.du.reserved parameter.


-- 
BR

Alexander Fahlke
Software Development
www.fahlke.org


Fwd: Delivery Status Notification (Failure)

2013-02-12 Thread samir das mohapatra
Hi All,
   I wanted to know how to connect Hive(hadoop-cdh4 distribution) with
MircoStrategy
   Any help is very helpfull.

  Witing for you response

Note: It is little bit urgent do any one have exprience in that
Thanks,
samir


Re: number input files to mapreduce job

2013-02-12 Thread Mahesh Balija
Hi Vikas,

 You can get the FileSystem instance by calling
FileSystem.get(Configuration);
 Once you get the FileSystem instance you can use
FileSystem.listStatus(InputPath); to get the fileStatus instances.

Best,
Mahesh Balija,
Calsoft Labs.

On Tue, Feb 12, 2013 at 12:35 PM, Vikas Jadhav vikascjadha...@gmail.comwrote:

 Hi all,
 How to get number of Input files and thier to particular mapreduce job in
 java MapReduce  program.

 --
 *
 *
 *

 Thanx and Regards*
 * Vikas Jadhav*



NullPointerException in Spring Data Hadoop with CDH4

2013-02-12 Thread Christian Schneider
Hi,
I try to use Spring Data Hadoop with CDH4 to write a Map Reduce Job.

On startup, I get the following exception:

Exception in thread SimpleAsyncTaskExecutor-1 
java.lang.ExceptionInInitializerError
at 
org.springframework.data.hadoop.mapreduce.JobExecutor$2.run(JobExecutor.java:183)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.lang.NullPointerException
at 
org.springframework.util.ReflectionUtils.makeAccessible(ReflectionUtils.java:405)
at 
org.springframework.data.hadoop.mapreduce.JobUtils.clinit(JobUtils.java:123)
... 2 more

I guess there is a problem with my Hadoop related dependencies. I couldn't find 
any reference showing how to configure Spring Data together with CDH4. But 
Costin showed, he is able to configure it: 
https://build.springsource.org/browse/SPRINGDATAHADOOP-CDH4-JOB1


**Maven Setup**

properties
spring.hadoop.version1.0.0.BUILD-SNAPSHOT/spring.hadoop.version
hadoop.version2.0.0-cdh4.1.3/hadoop.version
/properties

dependencies
...
dependency
groupIdorg.springframework.data/groupId
artifactIdspring-data-hadoop/artifactId
version${spring.hadoop.version}/version
/dependency

dependency
groupIdorg.apache.hadoop/groupId
artifactIdhadoop-common/artifactId
version${hadoop.version}/version
/dependency

dependency
groupIdorg.apache.hadoop/groupId
artifactIdhadoop-client/artifactId
version${hadoop.version}/version
/dependency

dependency
groupIdorg.apache.hadoop/groupId
artifactIdhadoop-streaming/artifactId
version${hadoop.version}/version
/dependency

dependency
groupIdorg.apache.hadoop/groupId
artifactIdhadoop-test/artifactId
version2.0.0-mr1-cdh4.1.3/version
/dependency

dependency
groupIdorg.apache.hadoop/groupId
artifactIdhadoop-tools/artifactId
version2.0.0-mr1-cdh4.1.3/version
/dependency
...
/dependencies
...
repositories   
repository
idcloudera/id

urlhttps://repository.cloudera.com/artifactory/cloudera-repos//url
snapshots
enabledfalse/enabled
/snapshots
/repository

repository
idspring-snapshot/id
nameSpring Maven SNAPSHOT Repository/name
urlhttp://repo.springframework.org/snapshot/url
/repository
/repositories

**Application Context**

?xml version=1.0 encoding=UTF-8?
beans xmlns=http://www.springframework.org/schema/beans;
xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance;
xmlns:hdp=http://www.springframework.org/schema/hadoop; 
xmlns:context=http://www.springframework.org/schema/context;
xmlns:hadoop=http://www.springframework.org/schema/hadoop;
xsi:schemaLocation=
http://www.springframework.org/schema/beans 
http://www.springframework.org/schema/beans/spring-beans.xsd
http://www.springframework.org/schema/hadoop 
http://www.springframework.org/schema/hadoop/spring-hadoop.xsd
http://www.springframework.org/schema/context/spring-context.xsd 
http://www.springframework.org/schema/integration
http://www.springframework.org/schema/context 
http://www.springframework.org/schema/context/spring-context-3.1.xsd;

context:property-placeholder location=classpath:hadoop.properties /

hdp:configuration id=hadoopConfiguration
fs.defaultFS=${hd.fs}
/hdp:configuration

hdp:job id=wordCountJob input-path=${input.path}
output-path=${output.path} mapper=com.example.WordMapper
reducer=com.example.WordReducer /

hdp:job-runner job-ref=wordCountJob run-at-startup=true 
wait-for-completion=true/   

/beans

**Cluster version**

Hadoop 2.0.0-cdh4.1.3


**Note:**

This small Unittest is running fine with the current configuration:

@RunWith(SpringJUnit4ClassRunner.class)
@ContextConfiguration(locations = { classpath:/applicationContext.xml })
public class Starter {

 @Autowired
 private Configuration configuration;

 @Test
 public void shellOps() {
 Assert.assertNotNull(this.configuration);
 FsShell fsShell = new FsShell(this.configuration);
 final CollectionFileStatus coll = fsShell.ls(/user);
 System.out.println(coll);
 }
}


It would be nice if someone can give me an example configuration.

Best Regards,
Christian.


Error for Pseudo-distributed Mode

2013-02-12 Thread yeyu1899
Hi all,
I installed a redhat_enterprise-linux-x86 in VMware Workstation, and set the 
virtual machine 1G memory. 


Then I followed steps guided by Installing CDH4 on a Single Linux Node in 
Pseudo-distributed Mode —— 
https://ccp.cloudera.com/display/CDH4DOC/Installing+CDH4+on+a+Single+Linux+Node+in+Pseudo-distributed+Mode.


When at last, I ran an example Hadoop job with the command $ hadoop jar 
/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep input output23 
'dfs[a-z.]+'


then the screen showed as follows, 
depending AttemptID:attempt_1360528029309_0001_r_00_0 Timed out after 600 
secs and I wonder is that because my virtual machine's memory too little~~??


[hadoop@localhost hadoop-mapreduce]$ hadoop jar 
/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep input output23 
'dfs[a-z]+' 
  
13/02/11 04:30:44 WARN mapreduce.JobSubmitter: No job jar file set.  User 
classes may not be found. See Job or Job#setJar(String).

13/02/11 04:30:44 INFO input.FileInputFormat: Total input paths to process : 4  
 
13/02/11 04:30:45 INFO mapreduce.JobSubmitter: number of splits:4   
 
13/02/11 04:30:45 WARN conf.Configuration: mapred.output.value.class is 
deprecated. Instead, use mapreduce.job.output.value.class   
  
13/02/11 04:30:45 WARN conf.Configuration: mapreduce.combine.class is 
deprecated. Instead, use mapreduce.job.combine.class

13/02/11 04:30:45 WARN conf.Configuration: mapreduce.map.class is deprecated. 
Instead, use mapreduce.job.map.class

13/02/11 04:30:45 WARN conf.Configuration: mapred.job.name is deprecated. 
Instead, use mapreduce.job.name
13/02/11 04:30:45 WARN conf.Configuration: mapreduce.reduce.class is 
deprecated. Instead, use mapreduce.job.reduce.class 
 
13/02/11 04:30:45 WARN conf.Configuration: mapred.input.dir is deprecated. 
Instead, use mapreduce.input.fileinputformat.inputdir   
   
13/02/11 04:30:45 WARN conf.Configuration: mapred.output.dir is deprecated. 
Instead, use mapreduce.output.fileoutputformat.outputdir
  
13/02/11 04:30:45 WARN conf.Configuration: mapreduce.outputformat.class is 
deprecated. Instead, use mapreduce.job.outputformat.class   
   
13/02/11 04:30:45 WARN conf.Configuration: mapred.map.tasks is deprecated. 
Instead, use mapreduce.job.maps   
13/02/11 04:30:45 WARN conf.Configuration: mapred.output.key.class is 
deprecated. Instead, use mapreduce.job.output.key.class 

13/02/11 04:30:45 WARN conf.Configuration: mapred.working.dir is deprecated. 
Instead, use mapreduce.job.working.dir  
 
13/02/11 04:30:46 INFO mapred.YARNRunner: Job jar is not present. Not adding 
any jar to the list of resources.   
13/02/11 04:30:46 INFO mapred.ResourceMgrDelegate: Submitted application 
application_1360528029309_0001 to ResourceManager at /0.0.0.0:8032  
 
13/02/11 04:30:46 INFO mapreduce.Job: The url to track the job: 
http://localhost.localdomain:8088/proxy/application_1360528029309_0001/ 

  
13/02/11 04:30:46 INFO mapreduce.Job: Running job: job_1360528029309_0001   
 
13/02/11 04:31:01 INFO mapreduce.Job: Job job_1360528029309_0001 running in 
uber mode : false
13/02/11 04:31:01 INFO mapreduce.Job:  map 0% reduce 0% 
 
13/02/11 04:47:22 INFO mapreduce.Job: Task Id : 
attempt_1360528029309_0001_r_00_0, Status : FAILED   
AttemptID:attempt_1360528029309_0001_r_00_0 Timed out after 600 secs
 
cleanup failed for container container_1360528029309_0001_01_06 : 
java.lang.reflect.UndeclaredThrowableException  
  

Re: NullPointerException in Spring Data Hadoop with CDH4

2013-02-12 Thread Costin Leau

Hi,

For Spring Data Hadoop problems, it's best to use the designated forum 
[1]. These being said I've tried to reproduce your error but I can't - 
I've upgraded the build to CDH 4.1.3 which runs fine against the VM on 
the CI (4.1.1).

Maybe you have some other libraries on the client classpath?

From the stacktrace, it looks like the org.apache.hadoop.mapreduce.Job 
class has no 'state' or 'info' fields...


Anyway, let's continue the discussion on the forum.

Cheers,
[1] http://forum.springsource.org/forumdisplay.php?87-Hadoop

On 02/12/13 2:51 PM, Christian Schneider wrote:

Hi,
I try to use Spring Data Hadoop with CDH4 to write a Map Reduce Job.

On startup, I get the following exception:

Exception in thread SimpleAsyncTaskExecutor-1 
java.lang.ExceptionInInitializerError
at 
org.springframework.data.hadoop.mapreduce.JobExecutor$2.run(JobExecutor.java:183)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.lang.NullPointerException
at 
org.springframework.util.ReflectionUtils.makeAccessible(ReflectionUtils.java:405)
at 
org.springframework.data.hadoop.mapreduce.JobUtils.clinit(JobUtils.java:123)
... 2 more

I guess there is a problem with my Hadoop related dependencies. I couldn't find 
any reference showing how to configure Spring Data together with CDH4. But 
Costin showed, he is able to configure it: 
https://build.springsource.org/browse/SPRINGDATAHADOOP-CDH4-JOB1


**Maven Setup**

properties
spring.hadoop.version1.0.0.BUILD-SNAPSHOT/spring.hadoop.version
hadoop.version2.0.0-cdh4.1.3/hadoop.version
/properties

dependencies
...
dependency
groupIdorg.springframework.data/groupId
artifactIdspring-data-hadoop/artifactId
version${spring.hadoop.version}/version
/dependency

dependency
groupIdorg.apache.hadoop/groupId
artifactIdhadoop-common/artifactId
version${hadoop.version}/version
/dependency

dependency
groupIdorg.apache.hadoop/groupId
artifactIdhadoop-client/artifactId
version${hadoop.version}/version
/dependency

dependency
groupIdorg.apache.hadoop/groupId
artifactIdhadoop-streaming/artifactId
version${hadoop.version}/version
/dependency

dependency
groupIdorg.apache.hadoop/groupId
artifactIdhadoop-test/artifactId
version2.0.0-mr1-cdh4.1.3/version
/dependency

dependency
groupIdorg.apache.hadoop/groupId
artifactIdhadoop-tools/artifactId
version2.0.0-mr1-cdh4.1.3/version
/dependency
...
/dependencies
...
repositories
repository
idcloudera/id

urlhttps://repository.cloudera.com/artifactory/cloudera-repos//url
snapshots
enabledfalse/enabled
/snapshots
/repository

repository
idspring-snapshot/id
nameSpring Maven SNAPSHOT Repository/name
urlhttp://repo.springframework.org/snapshot/url
/repository
/repositories

**Application Context**

?xml version=1.0 encoding=UTF-8?
beans xmlns=http://www.springframework.org/schema/beans;
xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance;
xmlns:hdp=http://www.springframework.org/schema/hadoop; 
xmlns:context=http://www.springframework.org/schema/context;
xmlns:hadoop=http://www.springframework.org/schema/hadoop;
xsi:schemaLocation=
http://www.springframework.org/schema/beans 
http://www.springframework.org/schema/beans/spring-beans.xsd
http://www.springframework.org/schema/hadoop 
http://www.springframework.org/schema/hadoop/spring-hadoop.xsd
http://www.springframework.org/schema/context/spring-context.xsd 
http://www.springframework.org/schema/integration
http://www.springframework.org/schema/context 
http://www.springframework.org/schema/context/spring-context-3.1.xsd;

context:property-placeholder location=classpath:hadoop.properties /

hdp:configuration id=hadoopConfiguration
fs.defaultFS=${hd.fs}
/hdp:configuration

hdp:job id=wordCountJob input-path=${input.path}
output-path=${output.path} mapper=com.example.WordMapper
reducer=com.example.WordReducer /

hdp:job-runner job-ref=wordCountJob run-at-startup=true 
wait-for-completion=true/   

/beans

**Cluster version**

Hadoop 2.0.0-cdh4.1.3


**Note:**

This small Unittest is running fine with the current configuration:

@RunWith(SpringJUnit4ClassRunner.class)
@ContextConfiguration(locations = { classpath:/applicationContext.xml })
public class Starter {

 @Autowired
 private Configuration configuration;

 @Test
  

RE: Error for Pseudo-distributed Mode

2013-02-12 Thread Vijay Thakorlal
Hi,

 

Could you first try running the example:

$ /usr/bin/hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar
grep input output 'dfs[a-z.]+'

 

Do you receive the same error?

 

Not sure if it's related to a lack of RAM, but as the stack trace shows
errors with network timeout (I realise that you're running in
pseudo-distributed mode):

 

Caused by: com.google.protobuf.ServiceException:
java.net.SocketTimeoutException: Call From localhost.localdomain/127.0.0.1
to localhost.localdomain:54113 failed on socket timeout exception:
java.net.SocketTimeoutException: 6 millis timeout while waiting for
channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
local=/127.0.0.1:60976 remote=localhost.localdomain/127.0.0.1:54113]; For
more details see:  http://wiki.apache.org/hadoop/SocketTimeout


 

Your best bet is probably to start with checking the items mentioned in the
wiki page linked to above. While the default firewall rules (on CentOS)
usually allows pretty much all traffic on the lo interface it might be worth
temporarily turning off iptables (assuming it is on).

 

Vijay

 

 

 

From: yeyu1899 [mailto:yeyu1...@163.com] 
Sent: 12 February 2013 12:58
To: user@hadoop.apache.org
Subject: Error for Pseudo-distributed Mode

 

Hi all,

I installed a redhat_enterprise-linux-x86 in VMware Workstation, and set the
virtual machine 1G memory. 

 

Then I followed steps guided by Installing CDH4 on a Single Linux Node in
Pseudo-distributed Mode --
https://ccp.cloudera.com/display/CDH4DOC/Installing+CDH4+on+a+Single+Linux+N
ode+in+Pseudo-distributed+Mode.

 

When at last, I ran an example Hadoop job with the command $ hadoop jar
/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep input output23
'dfs[a-z.]+'

 

then the screen showed as follows, 

depending AttemptID:attempt_1360528029309_0001_r_00_0 Timed out after
600 secs and I wonder is that because my virtual machine's memory too
little~~??

 

[hadoop@localhost hadoop-mapreduce]$ hadoop jar
/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep input output23
'dfs[a-z]+'


13/02/11 04:30:44 WARN mapreduce.JobSubmitter: No job jar file set.  User
classes may not be found. See Job or Job#setJar(String).


13/02/11 04:30:44 INFO input.FileInputFormat: Total input paths to process :
4   

13/02/11 04:30:45 INFO mapreduce.JobSubmitter: number of splits:4


13/02/11 04:30:45 WARN conf.Configuration: mapred.output.value.class is
deprecated. Instead, use mapreduce.job.output.value.class


13/02/11 04:30:45 WARN conf.Configuration: mapreduce.combine.class is
deprecated. Instead, use mapreduce.job.combine.class


13/02/11 04:30:45 WARN conf.Configuration: mapreduce.map.class is
deprecated. Instead, use mapreduce.job.map.class


13/02/11 04:30:45 WARN conf.Configuration: mapred.job.name is deprecated.
Instead, use mapreduce.job.name

13/02/11 04:30:45 WARN conf.Configuration: mapreduce.reduce.class is
deprecated. Instead, use mapreduce.job.reduce.class


13/02/11 04:30:45 WARN conf.Configuration: mapred.input.dir is deprecated.
Instead, use mapreduce.input.fileinputformat.inputdir


13/02/11 04:30:45 WARN conf.Configuration: mapred.output.dir is deprecated.
Instead, use mapreduce.output.fileoutputformat.outputdir


13/02/11 04:30:45 WARN conf.Configuration: mapreduce.outputformat.class is
deprecated. Instead, use mapreduce.job.outputformat.class


13/02/11 04:30:45 WARN conf.Configuration: mapred.map.tasks is deprecated.
Instead, use mapreduce.job.maps   

13/02/11 04:30:45 WARN conf.Configuration: mapred.output.key.class is
deprecated. Instead, use mapreduce.job.output.key.class


13/02/11 04:30:45 WARN conf.Configuration: mapred.working.dir is deprecated.
Instead, use mapreduce.job.working.dir


13/02/11 04:30:46 INFO mapred.YARNRunner: Job jar is not present. Not adding
any jar to the list of resources.   

13/02/11 04:30:46 INFO mapred.ResourceMgrDelegate: Submitted application
application_1360528029309_0001 to ResourceManager at /0.0.0.0:8032


13/02/11 04:30:46 INFO mapreduce.Job: The url to track the job:
http://localhost.localdomain:8088/proxy/application_1360528029309_0001/


13/02/11 04:30:46 INFO mapreduce.Job: Running job: job_1360528029309_0001


13/02/11 04:31:01 INFO mapreduce.Job: Job job_1360528029309_0001 running in
uber mode : false

13/02/11 04:31:01 INFO mapreduce.Job:  map 0% reduce 0%


13/02/11 04:47:22 INFO mapreduce.Job: Task Id :
attempt_1360528029309_0001_r_00_0, Status : FAILED   

AttemptID:attempt_1360528029309_0001_r_00_0 Timed out after 600 secs


cleanup failed for container container_1360528029309_0001_01_06 :
java.lang.reflect.UndeclaredThrowableException


at
org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.unwrapAn
dThrowException(YarnRemoteExceptionPBImpl.java:135)


at
org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagerPBClientImpl.stopC

RE: number input files to mapreduce job

2013-02-12 Thread java8964 java8964

I don't think you can get list of all input files in the mapper, but what you 
can get is the current file's information.
In the context object reference, you can get the InputSplit(), which should 
give you all the information you want of the current input file.
http://hadoop.apache.org/docs/r2.0.2-alpha/api/org/apache/hadoop/mapred/FileSplit.html
Date: Tue, 12 Feb 2013 12:35:16 +0530
Subject: number input files to mapreduce job
From: vikascjadha...@gmail.com
To: user@hadoop.apache.org

Hi all,How to get number of Input files and thier to particular mapreduce job 
in java MapReduce  program.
-- 



Thanx and Regards Vikas Jadhav

RE: Loader for small files

2013-02-12 Thread java8964 java8964

 Hi, Davie:
I am not sure I understand this suggestion. Why smaller block size will help 
this performance issue?
From what the original question about, it looks like the performance problem 
is due to that there are a lot of small files, and each file will run in its 
own mapper.
As hadoop needs to start a lot of mappers (I think creating a mapper also takes 
time and resource), but each mapper only take small amount of data (maybe 
hundreds K or several M of data, much less than the block size), most of the 
time is wasting on creating task instance for mapper, but each mapper finishes 
very quickly.
This is the reason of performance problem, right? Do I understand the problem 
wrong?
If so, reducing the block size won't help in this case, right? To fix it, we 
need to merge multi-files into one mapper, so let one mapper has enough data to 
process. 
Unless my understanding is total wrong, I don't know how reducing block size 
will help in this case.
Thanks
Yong

 Subject: Re: Loader for small files
 From: davidlabarb...@localresponse.com
 Date: Mon, 11 Feb 2013 15:38:54 -0500
 CC: user@hadoop.apache.org
 To: u...@pig.apache.org
 
 What process creates the data in HDFS? You should be able to set the block 
 size there and avoid the copy.
 
 I would test the dfs.block.size on the copy and see if you get the mapper 
 split you want before worrying about optimizing.
 
 David
 
 On Feb 11, 2013, at 2:10 PM, Something Something mailinglist...@gmail.com 
 wrote:
 
  David:  Your suggestion would add an additional step of copying data from
  one place to another.  Not bad, but not ideal.  Is there no way to avoid
  copying of data?
  
  BTW, we have tried changing the following options to no avail :(
  
  set pig.splitCombination false;
  
   a few other 'dfs' options given below:
  
  mapreduce.min.split.size
  mapreduce.max.split.size
  
  Thanks.
  
  On Mon, Feb 11, 2013 at 10:29 AM, David LaBarbera 
  davidlabarb...@localresponse.com wrote:
  
  You could store your data in smaller block sizes. Do something like
  hadoop fs HADOOP_OPTS=-Ddfs.block.size=1048576
  -Dfs.local.block.size=1048576 -cp /org-input /small-block-input
  You might only need one of those parameters. You can verify the block size
  with
  hadoop fsck /small-block-input
  
  In your pig script, you'll probably need to set
  pig.maxCombinedSplitSize
  to something around the block size
  
  David
  
  On Feb 11, 2013, at 1:24 PM, Something Something mailinglist...@gmail.com
  wrote:
  
  Sorry.. Moving 'hbase' mailing list to BCC 'cause this is not related to
  HBase.  Adding 'hadoop' user group.
  
  On Mon, Feb 11, 2013 at 10:22 AM, Something Something 
  mailinglist...@gmail.com wrote:
  
  Hello,
  
  We are running into performance issues with Pig/Hadoop because our input
  files are small.  Everything goes to only 1 Mapper.  To get around
  this, we
  are trying to use our own Loader like this:
  
  1)  Extend PigStorage:
  
  public class SmallFileStorage extends PigStorage {
  
public SmallFileStorage(String delimiter) {
super(delimiter);
}
  
@Override
public InputFormat getInputFormat() {
return new NLineInputFormat();
}
  }
  
  
  
  2)  Add command line argument to the Pig command as follows:
  
  -Dmapreduce.input.lineinputformat.linespermap=50
  
  
  
  3)  Use SmallFileStorage in the Pig script as follows:
  
  USING com.xxx.yyy.SmallFileStorage ('\t')
  
  
  But this doesn't seem to work.  We still see that everything is going to
  one mapper.  Before we spend any more time on this, I am wondering if
  this
  is a good approach – OR – if there's a better approach?  Please let me
  know.  Thanks.
  
  
  
  
  
 
  

Re: NullPointerException in Spring Data Hadoop with CDH4

2013-02-12 Thread Christian Schneider
With the help of Costin I got a running Maven configuration.

Thank you :).

This is a pom.xml for Spring Data Hadoop and CDH4:

project xmlns=http://maven.apache.org/POM/4.0.0; 
xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance;
xsi:schemaLocation=http://maven.apache.org/POM/4.0.0 
http://maven.apache.org/xsd/maven-4.0.0.xsd;
modelVersion4.0.0/modelVersion

groupIdcom.example/groupId
artifactIdcom.example.main/artifactId
version0.0.1-SNAPSHOT/version
packagingjar/packaging

properties
java-version1.7/java-version

project.build.sourceEncodingUTF-8/project.build.sourceEncoding
spring.version3.2.0.RELEASE/spring.version

spring.hadoop.version1.0.0.BUILD-SNAPSHOT/spring.hadoop.version
hadoop.version.generic2.0.0-cdh4.1.3/hadoop.version.generic
hadoop.version.mr12.0.0-mr1-cdh4.1.3/hadoop.version.mr1
/properties

dependencies

dependency
groupIdorg.springframework/groupId
artifactIdspring-core/artifactId
version${spring.version}/version
exclusions
exclusion
groupIdcommons-logging/groupId
artifactIdcommons-logging/artifactId
/exclusion
/exclusions
/dependency

dependency
groupIdorg.springframework/groupId
artifactIdspring-context/artifactId
version${spring.version}/version
/dependency


dependency
groupIdorg.springframework.data/groupId
artifactIdspring-data-hadoop/artifactId
version${spring.hadoop.version}/version

exclusions
!-- Excluded the Hadoop dependencies to be 
sure that they are not mixed with them provided by cloudera. --
exclusion

artifactIdhadoop-streaming/artifactId
groupIdorg.apache.hadoop/groupId
/exclusion
exclusion
artifactIdhadoop-tools/artifactId
groupIdorg.apache.hadoop/groupId
/exclusion
/exclusions

/dependency

!-- Hadoop Cloudera Dependencies --
dependency
groupIdorg.apache.hadoop/groupId
artifactIdhadoop-common/artifactId
version${hadoop.version.generic}/version
/dependency

dependency
groupIdorg.apache.hadoop/groupId
artifactIdhadoop-hdfs/artifactId
version${hadoop.version.generic}/version
/dependency

dependency
groupIdorg.apache.hadoop/groupId
artifactIdhadoop-tools/artifactId
version2.0.0-mr1-cdh4.1.3/version
/dependency

dependency
groupIdorg.apache.hadoop/groupId
artifactIdhadoop-streaming/artifactId
version2.0.0-mr1-cdh4.1.3/version
/dependency

/dependencies

build
plugins

plugin
groupIdorg.apache.maven.plugins/groupId
artifactIdmaven-compiler-plugin/artifactId
configuration
source${java-version}/source
target${java-version}/target
/configuration
/plugin

/plugins
/build

repositories
repository
idspring-milestones/id
urlhttp://repo.springsource.org/libs-milestone/url
snapshots
enabledfalse/enabled
/snapshots
/repository

repository
idcloudera/id

urlhttps://repository.cloudera.com/artifactory/cloudera-repos//url
snapshots
enabledfalse/enabled
/snapshots
/repository

repository
idspring-snapshot/id
nameSpring Maven SNAPSHOT 

Re: Decommissioning Nodes in Production Cluster.

2013-02-12 Thread Robert Molina
Hi Dhanasekaran,
I believe you are trying to ask if it is recommended to use the
decommissioning feature to remove datanodes from your cluster, the answer
would be yes.  As far as how to do it, there should be some information
here http://wiki.apache.org/hadoop/FAQ that should help.

Regards,
Robert

On Tue, Feb 12, 2013 at 3:50 AM, Dhanasekaran Anbalagan
bugcy...@gmail.comwrote:

 Hi Guys,

 It's recommenced do with removing one the datanode in production cluster.
 via Decommission the particular datanode. please guide me.

 -Dhanasekaran,

 Did I learn something today? If not, I wasted it.



Re: Decommissioning Nodes in Production Cluster.

2013-02-12 Thread Benjamin Kim
Hi,

I would like to add another scenario. What are the steps for removing a 
dead node when the server had a hard failure that is unrecoverable.

Thanks,
Ben

On Tuesday, February 12, 2013 7:30:57 AM UTC-8, sudhakara st wrote:

 The decommissioning process is controlled by an exclude file, which for 
 HDFS is set by the* dfs.hosts.exclude* property, and for MapReduce by 
 the*mapred.hosts.exclude
 * property. In most cases, there is one shared file,referred to as the 
 exclude file.This  exclude file name should be specified as a configuration 
 parameter *dfs.hosts.exclude *in the name node start up.


 To remove nodes from the cluster:
 1. Add the network addresses of the nodes to be decommissioned to the 
 exclude file.

 2. Restart the MapReduce cluster to stop the tasktrackers on the nodes 
 being
 decommissioned.
 3. Update the namenode with the new set of permitted datanodes, with this
 command:
 % hadoop dfsadmin -refreshNodes
 4. Go to the web UI and check whether the admin state has changed to 
 “Decommission
 In Progress” for the datanodes being decommissioned. They will start 
 copying
 their blocks to other datanodes in the cluster.

 5. When all the datanodes report their state as “Decommissioned,” then all 
 the blocks
 have been replicated. Shut down the decommissioned nodes.
 6. Remove the nodes from the include file, and run:
 % hadoop dfsadmin -refreshNodes
 7. Remove the nodes from the slaves file.

  Decommission data nodes in small percentage(less than 2%) at time don't 
 cause any effect on cluster. But it better to pause MR-Jobs before you 
 triggering Decommission to ensure  no task running in decommissioning 
 subjected nodes.
  If very small percentage of task running in the decommissioning node it 
 can submit to other task tracker, but percentage queued jobs  larger then 
 threshold  then there is chance of job failure. Once triggering the 'hadoop 
 dfsadmin -refreshNodes' command and decommission started, you can resume 
 the MR jobs.

 *Source : The Definitive Guide [Tom White]*



 On Tuesday, February 12, 2013 5:20:07 PM UTC+5:30, Dhanasekaran Anbalagan 
 wrote:

 Hi Guys,

 It's recommenced do with removing one the datanode in production cluster.
 via Decommission the particular datanode. please guide me.
  
 -Dhanasekaran,

 Did I learn something today? If not, I wasted it.
  


Re: Decommissioning Nodes in Production Cluster.

2013-02-12 Thread sudhakara st
The decommissioning process is controlled by an exclude file, which for 
HDFS is set by the* dfs.hosts.exclude* property, and for MapReduce by 
the*mapred.hosts.exclude
* property. In most cases, there is one shared file,referred to as the 
exclude file.This  exclude file name should be specified as a configuration 
parameter *dfs.hosts.exclude *in the name node start up.


To remove nodes from the cluster:
1. Add the network addresses of the nodes to be decommissioned to the 
exclude file.

2. Restart the MapReduce cluster to stop the tasktrackers on the nodes being
decommissioned.
3. Update the namenode with the new set of permitted datanodes, with this
command:
% hadoop dfsadmin -refreshNodes
4. Go to the web UI and check whether the admin state has changed to 
“Decommission
In Progress” for the datanodes being decommissioned. They will start copying
their blocks to other datanodes in the cluster.

5. When all the datanodes report their state as “Decommissioned,” then all 
the blocks
have been replicated. Shut down the decommissioned nodes.
6. Remove the nodes from the include file, and run:
% hadoop dfsadmin -refreshNodes
7. Remove the nodes from the slaves file.

 Decommission data nodes in small percentage(less than 2%) at time don't 
cause any effect on cluster. But it better to pause MR-Jobs before you 
triggering Decommission to ensure  no task running in decommissioning 
subjected nodes.
 If very small percentage of task running in the decommissioning node it 
can submit to other task tracker, but percentage queued jobs  larger then 
threshold  then there is chance of job failure. Once triggering the 'hadoop 
dfsadmin -refreshNodes' command and decommission started, you can resume 
the MR jobs.

*Source : The Definitive Guide [Tom White]*



On Tuesday, February 12, 2013 5:20:07 PM UTC+5:30, Dhanasekaran Anbalagan 
wrote:

 Hi Guys,

 It's recommenced do with removing one the datanode in production cluster.
 via Decommission the particular datanode. please guide me.
  
 -Dhanasekaran,

 Did I learn something today? If not, I wasted it.
  


Re: Decommissioning Nodes in Production Cluster.

2013-02-12 Thread shashwat shriparv
On Tue, Feb 12, 2013 at 11:43 PM, Robert Molina rmol...@hortonworks.comwrote:

 to do it, there should be some information he


this is best way to remove data node from a cluster. you have done the
right thing.



∞
Shashwat Shriparv


Re: Loader for small files

2013-02-12 Thread Something Something
No, Yong, I believe you misunderstood. David's explanation makes sense.  As
pointed out in my original email, everything is going to 1 Mapper.  It's
not creating multiple mappers.

BTW, the code given in my original email, indeed works as expected.  It
does trigger multiple mappers, but it doesn't really improve the
performance.

We believe the problem is that there's a data skew.  We are looking into
creating Partitioner to solve it.  Thanks.


On Tue, Feb 12, 2013 at 7:15 AM, java8964 java8964 java8...@hotmail.comwrote:

   Hi, Davie:

 I am not sure I understand this suggestion. Why smaller block size will
 help this performance issue?

 From what the original question about, it looks like the performance
 problem is due to that there are a lot of small files, and each file will
 run in its own mapper.

 As hadoop needs to start a lot of mappers (I think creating a mapper also
 takes time and resource), but each mapper only take small amount of data
 (maybe hundreds K or several M of data, much less than the block size),
 most of the time is wasting on creating task instance for mapper, but each
 mapper finishes very quickly.

 This is the reason of performance problem, right? Do I understand the
 problem wrong?

 If so, reducing the block size won't help in this case, right? To fix it,
 we need to merge multi-files into one mapper, so let one mapper has enough
 data to process.

 Unless my understanding is total wrong, I don't know how reducing block
 size will help in this case.

 Thanks

 Yong

  Subject: Re: Loader for small files
  From: davidlabarb...@localresponse.com
  Date: Mon, 11 Feb 2013 15:38:54 -0500
  CC: user@hadoop.apache.org
  To: u...@pig.apache.org

 
  What process creates the data in HDFS? You should be able to set the
 block size there and avoid the copy.
 
  I would test the dfs.block.size on the copy and see if you get the
 mapper split you want before worrying about optimizing.
 
  David
 
  On Feb 11, 2013, at 2:10 PM, Something Something 
 mailinglist...@gmail.com wrote:
 
   David: Your suggestion would add an additional step of copying data
 from
   one place to another. Not bad, but not ideal. Is there no way to avoid
   copying of data?
  
   BTW, we have tried changing the following options to no avail :(
  
   set pig.splitCombination false;
  
a few other 'dfs' options given below:
  
   mapreduce.min.split.size
   mapreduce.max.split.size
  
   Thanks.
  
   On Mon, Feb 11, 2013 at 10:29 AM, David LaBarbera 
   davidlabarb...@localresponse.com wrote:
  
   You could store your data in smaller block sizes. Do something like
   hadoop fs HADOOP_OPTS=-Ddfs.block.size=1048576
   -Dfs.local.block.size=1048576 -cp /org-input /small-block-input
   You might only need one of those parameters. You can verify the block
 size
   with
   hadoop fsck /small-block-input
  
   In your pig script, you'll probably need to set
   pig.maxCombinedSplitSize
   to something around the block size
  
   David
  
   On Feb 11, 2013, at 1:24 PM, Something Something 
 mailinglist...@gmail.com
   wrote:
  
   Sorry.. Moving 'hbase' mailing list to BCC 'cause this is not
 related to
   HBase. Adding 'hadoop' user group.
  
   On Mon, Feb 11, 2013 at 10:22 AM, Something Something 
   mailinglist...@gmail.com wrote:
  
   Hello,
  
   We are running into performance issues with Pig/Hadoop because our
 input
   files are small. Everything goes to only 1 Mapper. To get around
   this, we
   are trying to use our own Loader like this:
  
   1) Extend PigStorage:
  
   public class SmallFileStorage extends PigStorage {
  
   public SmallFileStorage(String delimiter) {
   super(delimiter);
   }
  
   @Override
   public InputFormat getInputFormat() {
   return new NLineInputFormat();
   }
   }
  
  
  
   2) Add command line argument to the Pig command as follows:
  
   -Dmapreduce.input.lineinputformat.linespermap=50
  
  
  
   3) Use SmallFileStorage in the Pig script as follows:
  
   USING com.xxx.yyy.SmallFileStorage ('\t')
  
  
   But this doesn't seem to work. We still see that everything is
 going to
   one mapper. Before we spend any more time on this, I am wondering if
   this
   is a good approach – OR – if there's a better approach? Please let
 me
   know. Thanks.
  
  
  
  
  
 



Java submit job to remote server

2013-02-12 Thread Alex Thieme
I apologize for asking what seems to be such a basic question, but I would use 
some help with submitting a job to a remote server.

I have downloaded and installed hadoop locally in pseudo-distributed mode. I 
have written some Java code to submit a job. 

Here's the org.apache.hadoop.util.Tool and org.apache.hadoop.mapreduce.Mapper 
I've written.

If I enable the conf.set(mapred.job.tracker, localhost:9001) line, then I 
get the exception included below.

If that line is disabled, then the job is completed. However, in reviewing the 
hadoop server administration page (http://localhost:50030/jobtracker.jsp) I 
don't see the job as processed by the server. Instead, I wonder if my Java code 
is simply running the necessary mapper Java code, bypassing the locally 
installed server.

Thanks in advance.

Alex

public class OfflineDataTool extends Configured implements Tool {

public int run(final String[] args) throws Exception {
final Configuration conf = getConf();
//conf.set(mapred.job.tracker, localhost:9001);

final Job job = new Job(conf);
job.setJarByClass(getClass());
job.setJobName(getClass().getName());

job.setMapperClass(OfflineDataMapper.class);

job.setInputFormatClass(TextInputFormat.class);

job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);

FileInputFormat.addInputPath(job, new 
org.apache.hadoop.fs.Path(args[0]));

final org.apache.hadoop.fs.Path output = new 
org.apache.hadoop.fs.Path(args[1]);
FileSystem.get(conf).delete(output, true);
FileOutputFormat.setOutputPath(job, output);

return job.waitForCompletion(true) ? 0 : 1;
}

public static void main(final String[] args) {
try {
int result = ToolRunner.run(new Configuration(), new 
OfflineDataTool(), new String[]{offline/input, offline/output});
log.error(result = {}, result);
} catch (final Exception e) {
throw new RuntimeException(e);
}
}
}

public class OfflineDataMapper extends MapperLongWritable, Text, Text, Text {

public OfflineDataMapper() {
super();
}

@Override
protected void map(final LongWritable key, final Text value, final Context 
context) throws IOException, InterruptedException {
final String inputString = value.toString();
OfflineDataMapper.log.error(inputString = {}, inputString);
}
}



Re: Java submit job to remote server

2013-02-12 Thread Nitin Pawar
conf.set(mapred.job.tracker, localhost:9001);

this means that your jobtracker is on port 9001 on localhost

if you change it to the remote host and thats the port its running on then
it should work as expected

whats the exception you are getting?


On Wed, Feb 13, 2013 at 2:41 AM, Alex Thieme athi...@athieme.com wrote:

 I apologize for asking what seems to be such a basic question, but I would
 use some help with submitting a job to a remote server.

 I have downloaded and installed hadoop locally in pseudo-distributed mode.
 I have written some Java code to submit a job.

 Here's the org.apache.hadoop.util.Tool
 and org.apache.hadoop.mapreduce.Mapper I've written.

 If I enable the conf.set(mapred.job.tracker, localhost:9001) line,
 then I get the exception included below.

 If that line is disabled, then the job is completed. However, in reviewing
 the hadoop server administration page (
 http://localhost:50030/jobtracker.jsp) I don't see the job as processed
 by the server. Instead, I wonder if my Java code is simply running the
 necessary mapper Java code, bypassing the locally installed server.

 Thanks in advance.

 Alex

 public class OfflineDataTool extends Configured implements Tool {

 public int run(final String[] args) throws Exception {
 final Configuration conf = getConf();
 //conf.set(mapred.job.tracker, localhost:9001);

 final Job job = new Job(conf);
 job.setJarByClass(getClass());
 job.setJobName(getClass().getName());

 job.setMapperClass(OfflineDataMapper.class);

 job.setInputFormatClass(TextInputFormat.class);

 job.setMapOutputKeyClass(Text.class);
 job.setMapOutputValueClass(Text.class);

 job.setOutputKeyClass(Text.class);
 job.setOutputValueClass(Text.class);

 FileInputFormat.addInputPath(job, new
 org.apache.hadoop.fs.Path(args[0]));

 final org.apache.hadoop.fs.Path output = new
 org.apache.hadoop.fs.Path(args[1]);
 FileSystem.get(conf).delete(output, true);
 FileOutputFormat.setOutputPath(job, output);

 return job.waitForCompletion(true) ? 0 : 1;
 }

 public static void main(final String[] args) {
 try {
 int result = ToolRunner.run(new Configuration(), new
 OfflineDataTool(), new String[]{offline/input, offline/output});
 log.error(result = {}, result);
 } catch (final Exception e) {
 throw new RuntimeException(e);
 }
 }
 }

 public class OfflineDataMapper extends MapperLongWritable, Text, Text,
 Text {

 public OfflineDataMapper() {
 super();
 }

 @Override
 protected void map(final LongWritable key, final Text value, final
 Context context) throws IOException, InterruptedException {
 final String inputString = value.toString();
 OfflineDataMapper.log.error(inputString = {}, inputString);
 }
 }




-- 
Nitin Pawar


Re: Java submit job to remote server

2013-02-12 Thread Alex Thieme
Thanks for the prompt reply and I'm sorry I forgot to include the exception. My 
bad. I've included it below. There certainly appears to be a server running on 
localhost:9001. At least, I was able to telnet to that address. While in 
development, I'm treating the server on localhost as the remote server. Moving 
to production, there'd obviously be a different remote server address 
configured.

Root Exception stack trace:
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:375)
at 
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
+ 3 more (set debug level logging or '-Dmule.verbose.exceptions=true' for 
everything)


On Feb 12, 2013, at 4:22 PM, Nitin Pawar nitinpawar...@gmail.com wrote:

 conf.set(mapred.job.tracker, localhost:9001);
 
 this means that your jobtracker is on port 9001 on localhost 
 
 if you change it to the remote host and thats the port its running on then it 
 should work as expected 
 
 whats the exception you are getting? 
 
 
 On Wed, Feb 13, 2013 at 2:41 AM, Alex Thieme athi...@athieme.com wrote:
 I apologize for asking what seems to be such a basic question, but I would 
 use some help with submitting a job to a remote server.
 
 I have downloaded and installed hadoop locally in pseudo-distributed mode. I 
 have written some Java code to submit a job. 
 
 Here's the org.apache.hadoop.util.Tool and org.apache.hadoop.mapreduce.Mapper 
 I've written.
 
 If I enable the conf.set(mapred.job.tracker, localhost:9001) line, then I 
 get the exception included below.
 
 If that line is disabled, then the job is completed. However, in reviewing 
 the hadoop server administration page (http://localhost:50030/jobtracker.jsp) 
 I don't see the job as processed by the server. Instead, I wonder if my Java 
 code is simply running the necessary mapper Java code, bypassing the locally 
 installed server.
 
 Thanks in advance.
 
 Alex
 
 public class OfflineDataTool extends Configured implements Tool {
 
 public int run(final String[] args) throws Exception {
 final Configuration conf = getConf();
 //conf.set(mapred.job.tracker, localhost:9001);
 
 final Job job = new Job(conf);
 job.setJarByClass(getClass());
 job.setJobName(getClass().getName());
 
 job.setMapperClass(OfflineDataMapper.class);
 
 job.setInputFormatClass(TextInputFormat.class);
 
 job.setMapOutputKeyClass(Text.class);
 job.setMapOutputValueClass(Text.class);
 
 job.setOutputKeyClass(Text.class);
 job.setOutputValueClass(Text.class);
 
 FileInputFormat.addInputPath(job, new 
 org.apache.hadoop.fs.Path(args[0]));
 
 final org.apache.hadoop.fs.Path output = new 
 org.apache.hadoop.fs.Path(args[1]);
 FileSystem.get(conf).delete(output, true);
 FileOutputFormat.setOutputPath(job, output);
 
 return job.waitForCompletion(true) ? 0 : 1;
 }
 
 public static void main(final String[] args) {
 try {
 int result = ToolRunner.run(new Configuration(), new 
 OfflineDataTool(), new String[]{offline/input, offline/output});
 log.error(result = {}, result);
 } catch (final Exception e) {
 throw new RuntimeException(e);
 }
 }
 }
 
 public class OfflineDataMapper extends MapperLongWritable, Text, Text, Text 
 {
 
 public OfflineDataMapper() {
 super();
 }
 
 @Override
 protected void map(final LongWritable key, final Text value, final 
 Context context) throws IOException, InterruptedException {
 final String inputString = value.toString();
 OfflineDataMapper.log.error(inputString = {}, inputString);
 }
 }
 
 
 
 
 -- 
 Nitin Pawar



Re: Question related to Decompressor interface

2013-02-12 Thread George Datskos

Hello

Can someone share some idea what the Hadoop source code of class 
org.apache.hadoop.io.compress.BlockDecompressorStream, method 
rawReadInt() is trying to do here?


The BlockDecompressorStream class is used for block-based decompression 
(e.g. snappy).  Each chunk has a header indicating how many bytes it is. 
That header is obtained by the rawReadInt method so it is expected to 
return a non-negative value (since you can't have a negative length).



George


Re: Java submit job to remote server

2013-02-12 Thread Hemanth Yamijala
Can you please include the complete stack trace and not just the root.
Also, have you set fs.default.name to a hdfs location like
hdfs://localhost:9000 ?

Thanks
Hemanth

On Wednesday, February 13, 2013, Alex Thieme wrote:

 Thanks for the prompt reply and I'm sorry I forgot to include the
 exception. My bad. I've included it below. There certainly appears to be a
 server running on localhost:9001. At least, I was able to telnet to that
 address. While in development, I'm treating the server on localhost as the
 remote server. Moving to production, there'd obviously be a different
 remote server address configured.

 Root Exception stack trace:
 java.io.EOFException
 at java.io.DataInputStream.readInt(DataInputStream.java:375)
 at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
 at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
 + 3 more (set debug level logging or '-Dmule.verbose.exceptions=true'
 for everything)

 

 On Feb 12, 2013, at 4:22 PM, Nitin Pawar nitinpawar...@gmail.com wrote:

 conf.set(mapred.job.tracker, localhost:9001);

 this means that your jobtracker is on port 9001 on localhost

 if you change it to the remote host and thats the port its running on then
 it should work as expected

 whats the exception you are getting?


 On Wed, Feb 13, 2013 at 2:41 AM, Alex Thieme athi...@athieme.com wrote:

 I apologize for asking what seems to be such a basic question, but I would
 use some help with submitting a job to a remote server.

 I have downloaded and installed hadoop locally in pseudo-distributed mode.
 I have written some Java code to submit a job.

 Here's the org.apache.hadoop.util.Tool
 and org.apache.hadoop.mapreduce.Mapper I've written.

 If I enable the conf.set(mapred.job.tracker, localhost:9001) line,
 then I get the exception included below.

 If that line is disabled, then the job is completed. However, in reviewing
 the hadoop server administration page (
 http://localhost:50030/jobtracker.jsp) I don't see the job as processed
 by the server. Instead, I wonder if my Java code is simply running the
 necessary mapper Java code, bypassing the locally installed server.

 Thanks in advance.

 Alex

 public class OfflineDataTool extends Configured implements Tool {

 public int run(final String[] args) throws Exception {
 final Configuration conf = getConf();
 //conf.set(mapred.job.tracker, localhost:9001);

 final Job job = new Job(conf);
 job.setJarByClass(getClass());
 job.setJobName(getClass().getName());

 job.setMapperClass(OfflineDataMapper.class);

 job.setInputFormatClass(TextInputFormat.class);

 job.setMapOutputKeyClass(Text.class);
 job.setMapOutputValueClass(Text.class);

 job.setOutputKeyClass(Text.class);
 job.setOutputValueClass(Text.class);

 FileInputFormat.addInputPath(job, new
 org.apache.hadoop.fs.Path(args[0]));

 final org.apache.hadoop.fs.Path output = new org.a




Re: Delivery Status Notification (Failure)

2013-02-12 Thread Arun C Murthy
Pls don't cross-post, this belong only to cdh lists.

On Feb 12, 2013, at 12:55 AM, samir das mohapatra wrote:

 
 
 Hi All,
I wanted to know how to connect Hive(hadoop-cdh4 distribution) with
 MircoStrategy
Any help is very helpfull.
 
   Witing for you response
 
 Note: It is little bit urgent do any one have exprience in that
 Thanks,
 samir
 

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/




Re: Delivery Status Notification (Failure)

2013-02-12 Thread Raj Vishwanathan
Arun

I don't understand your reply!  Had you redirected this person to the hive 
mailing list I would have understood..

My philosophy any mailing list  has always been ; If I know the answer to a 
question, I reply.. Else I humbly walk away!. 


I got a lot of help from this group for my (mostly stupid) questions - and 
people helped me. I would like to return the favor when ( and if) I can.

My humble $0.01.:-)

And for the record- I don't know the answer to the question on microstrategy :-)


Raj






 From: Arun C Murthy a...@hortonworks.com
To: user@hadoop.apache.org 
Sent: Tuesday, February 12, 2013 6:42 PM
Subject: Re: Delivery Status Notification (Failure)
 

Pls don't cross-post, this belong only to cdh lists.


On Feb 12, 2013, at 12:55 AM, samir das mohapatra wrote:




Hi All,
   I wanted to know how to connect Hive(hadoop-cdh4 distribution) with
MircoStrategy
   Any help is very helpfull.

  Witing for you response

Note: It is little bit urgent do any one have exprience in that
Thanks,
samir



--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

 




Re: Delivery Status Notification (Failure)

2013-02-12 Thread Konstantin Boudnik
With all due respect, sir, these mailing lists have certain rules, that aren't
evidently coincide with your philosophy. 

Cos

On Tue, Feb 12, 2013 at 08:45PM, Raj Vishwanathan wrote:
 Arun
 
 I don't understand your reply! ═Had you redirected this person to the hive
 mailing list I would have understood..
 
 My═philosophy═any mailing list ═has always been ; If I know the answer to a
 question, I reply.. Else I humbly walk away!.═
 
 I got a lot of help from this group for my (mostly stupid) questions - and
 people helped me. I would like to return the favor when ( and if) I can.
 
 My humble $0.01.:-)
 
 And for the record- I don't know the answer to the question on microstrategy 
 :-)
 
 
 Raj
 
 
 
 
 
 
  From: Arun C Murthy a...@hortonworks.com
 To: user@hadoop.apache.org 
 Sent: Tuesday, February 12, 2013 6:42 PM
 Subject: Re: Delivery Status Notification (Failure)
  
 
 Pls don't cross-post, this belong only to cdh lists.
 
 
 On Feb 12, 2013, at 12:55 AM, samir das mohapatra wrote:
 
 
 
 
 Hi All,
 ═ ═I wanted to know how to connect Hive(hadoop-cdh4 distribution) with
 MircoStrategy
 ═ ═Any help is very helpfull.
 
 ═ Witing for you response
 
 Note: It is little bit urgent do any one have exprience in that
 Thanks,
 samir
 
 
 
 --
 Arun C. Murthy
 Hortonworks Inc.
 http://hortonworks.com/
 
  
 
 
 


signature.asc
Description: Digital signature


Re: Anyway to load certain Key/Value pair fast?

2013-02-12 Thread Harsh J
Please do not use the general@ lists for any user-oriented questions.
Please redirect them to user@hadoop.apache.org lists, which is where
the user community and questions lie.

I've moved your post there and have added you on CC in case you
haven't subscribed there. Please reply back only to the user@
addresses. The general@ list is for Apache Hadoop project-level
management and release oriented discussions alone.

On Wed, Feb 13, 2013 at 10:54 AM, William Kang weliam.cl...@gmail.com wrote:
 Hi All,
 I am trying to figure out a good solution for such a scenario as following.

 1. I have a 2T file (let's call it A), filled by key/value pairs,
 which is stored in the HDFS with the default 64M block size. In A,
 each key is less than 1K and each value is about 20M.

 2. Occasionally, I will run analysis by using a different type of data
 (usually less than 10G, and let's call it B) and do look-up table
 alike operations by using the values in A. B resides in HDFS as well.

 3. This analysis would require loading only a small number of values
 from A (usually less than 1000 of them) into the memory for fast
 look-up against the data in B. The way B finds the few values in A is
 by looking up for the key in A.

 Is there an efficient way to do this?

 I was thinking if I could identify the locality of the block that
 contains the few values, I might be able to push the B into the few
 nodes that contains the few values in A?  Since I only need to do this
 occasionally, maintaining a distributed database such as HBase cant be
 justified.

 Many thanks.


 Cao



--
Harsh J


Re: Anyway to load certain Key/Value pair fast?

2013-02-12 Thread Harsh J
My reply to your questions is inline.

On Wed, Feb 13, 2013 at 10:59 AM, Harsh J ha...@cloudera.com wrote:
 Please do not use the general@ lists for any user-oriented questions.
 Please redirect them to user@hadoop.apache.org lists, which is where
 the user community and questions lie.

 I've moved your post there and have added you on CC in case you
 haven't subscribed there. Please reply back only to the user@
 addresses. The general@ list is for Apache Hadoop project-level
 management and release oriented discussions alone.

 On Wed, Feb 13, 2013 at 10:54 AM, William Kang weliam.cl...@gmail.com wrote:
 Hi All,
 I am trying to figure out a good solution for such a scenario as following.

 1. I have a 2T file (let's call it A), filled by key/value pairs,
 which is stored in the HDFS with the default 64M block size. In A,
 each key is less than 1K and each value is about 20M.

 2. Occasionally, I will run analysis by using a different type of data
 (usually less than 10G, and let's call it B) and do look-up table
 alike operations by using the values in A. B resides in HDFS as well.

 3. This analysis would require loading only a small number of values
 from A (usually less than 1000 of them) into the memory for fast
 look-up against the data in B. The way B finds the few values in A is
 by looking up for the key in A.

About 1000 such rows would equal a memory expense of near 20 GB, given
the value size of A you've noted above. The solution may need to be
considered with this in mind, if the whole lookup table is to be
eventually generated into the memory and never discarded until the end
of processing.

 Is there an efficient way to do this?

Since HBase may be too much for your simple needs, have you instead
considered using MapFiles, which allow fast key lookups at a file
level over HDFS/MR? You can have these files either highly replicated
(if their size is large), or distributed via the distributed cache in
the lookup jobs (if they are infrequently used and small sized), and
be able to use the  MapFile reader API to perform lookups of keys and
read values only when you want them.

 I was thinking if I could identify the locality of the block that
 contains the few values, I might be able to push the B into the few
 nodes that contains the few values in A?  Since I only need to do this
 occasionally, maintaining a distributed database such as HBase cant be
 justified.

I agree that HBase may not be wholly suited to be run just for this
purpose (unless A's also gonna be scaling over time).

Maintaining value - locality mapping would need to be done by you. FS
APIs provide locality info calls, and your files may be
key-partitioned enough to identify each one's range, and you can
combine the knowledge of these two to do something along these lines.

Using HBase may also turn out to be easier, but thats upto you. You
can also choose to tear it down (i.e. the services) when not needed,
btw.

 Many thanks.


 Cao



 --
 Harsh J



--
Harsh J


Re: Anyway to load certain Key/Value pair fast?

2013-02-12 Thread William Kang
Hi Harsh,
Thanks a lot for your reply and great suggestions.

In the practical cases, the values usually do not reside in the same
data node. Instead, they are mostly distributed by the key range
itself. So, it does require 20G of memory, but distributed in
different nodes.

The MapFile solution is very intriguing. I am not very familiar with
it though. I assume it kinda resemble the basic idea of HBase? I will
certainly try it out and follow up if there are questions.

I agree that using HBase would be much easier. But the value size
makes worry if it is going to push it to the edge. If I do this more
often, I will definitely consider using HBase.

Many thanks for the great reply.


William


On Wed, Feb 13, 2013 at 12:38 AM, Harsh J ha...@cloudera.com wrote:
 My reply to your questions is inline.

 On Wed, Feb 13, 2013 at 10:59 AM, Harsh J ha...@cloudera.com wrote:
 Please do not use the general@ lists for any user-oriented questions.
 Please redirect them to user@hadoop.apache.org lists, which is where
 the user community and questions lie.

 I've moved your post there and have added you on CC in case you
 haven't subscribed there. Please reply back only to the user@
 addresses. The general@ list is for Apache Hadoop project-level
 management and release oriented discussions alone.

 On Wed, Feb 13, 2013 at 10:54 AM, William Kang weliam.cl...@gmail.com 
 wrote:
 Hi All,
 I am trying to figure out a good solution for such a scenario as following.

 1. I have a 2T file (let's call it A), filled by key/value pairs,
 which is stored in the HDFS with the default 64M block size. In A,
 each key is less than 1K and each value is about 20M.

 2. Occasionally, I will run analysis by using a different type of data
 (usually less than 10G, and let's call it B) and do look-up table
 alike operations by using the values in A. B resides in HDFS as well.

 3. This analysis would require loading only a small number of values
 from A (usually less than 1000 of them) into the memory for fast
 look-up against the data in B. The way B finds the few values in A is
 by looking up for the key in A.

 About 1000 such rows would equal a memory expense of near 20 GB, given
 the value size of A you've noted above. The solution may need to be
 considered with this in mind, if the whole lookup table is to be
 eventually generated into the memory and never discarded until the end
 of processing.

 Is there an efficient way to do this?

 Since HBase may be too much for your simple needs, have you instead
 considered using MapFiles, which allow fast key lookups at a file
 level over HDFS/MR? You can have these files either highly replicated
 (if their size is large), or distributed via the distributed cache in
 the lookup jobs (if they are infrequently used and small sized), and
 be able to use the  MapFile reader API to perform lookups of keys and
 read values only when you want them.

 I was thinking if I could identify the locality of the block that
 contains the few values, I might be able to push the B into the few
 nodes that contains the few values in A?  Since I only need to do this
 occasionally, maintaining a distributed database such as HBase cant be
 justified.

 I agree that HBase may not be wholly suited to be run just for this
 purpose (unless A's also gonna be scaling over time).

 Maintaining value - locality mapping would need to be done by you. FS
 APIs provide locality info calls, and your files may be
 key-partitioned enough to identify each one's range, and you can
 combine the knowledge of these two to do something along these lines.

 Using HBase may also turn out to be easier, but thats upto you. You
 can also choose to tear it down (i.e. the services) when not needed,
 btw.

 Many thanks.


 Cao



 --
 Harsh J



 --
 Harsh J


Re: Delivery Status Notification (Failure)

2013-02-12 Thread Raj Vishwanathan
Cos,

I understand that there are rules. What are these rules? Is it Hive vs hadoop ( 
this I understand) or apache  hadoop vs a specific distribution? ( this I am 
not clear about.)


Sent from my iPad
Please excuse the typos. 

On Feb 12, 2013, at 8:56 PM, Konstantin Boudnik c...@apache.org wrote:

 With all due respect, sir, these mailing lists have certain rules, that aren't
 evidently coincide with your philosophy. 
 
 Cos
 
 On Tue, Feb 12, 2013 at 08:45PM, Raj Vishwanathan wrote:
 Arun
 
 I don't understand your reply! ═Had you redirected this person to the hive
 mailing list I would have understood..
 
 My═philosophy═any mailing list ═has always been ; If I know the answer to a
 question, I reply.. Else I humbly walk away!.═
 
 I got a lot of help from this group for my (mostly stupid) questions - and
 people helped me. I would like to return the favor when ( and if) I can.
 
 My humble $0.01.:-)
 
 And for the record- I don't know the answer to the question on microstrategy 
 :-)
 
 
 Raj
 
 
 
 
 
 
 From: Arun C Murthy a...@hortonworks.com
 To: user@hadoop.apache.org 
 Sent: Tuesday, February 12, 2013 6:42 PM
 Subject: Re: Delivery Status Notification (Failure)
 
 
 Pls don't cross-post, this belong only to cdh lists.
 
 
 On Feb 12, 2013, at 12:55 AM, samir das mohapatra wrote:
 
 
 
 
 Hi All,
 ═ ═I wanted to know how to connect Hive(hadoop-cdh4 distribution) with
 MircoStrategy
 ═ ═Any help is very helpfull.
 
 ═ Witing for you response
 
 Note: It is little bit urgent do any one have exprience in that
 Thanks,
 samir
 
 --
 Arun C. Murthy
 Hortonworks Inc.
 http://hortonworks.com/