date:20120109

Cannot start yarn daemons

2012-01-09 Thread raghavendhra rahul

Hi,
  I am trying to install hadoop 0.23.1 SNAPSHOT. while starting yarn
daemons.sh it shows the following error
Exception in thread main java.lang.NoClassDefFoundError:
org/apache/hadoop/conf/Configuration
Caused by: java.lang.ClassNotFoundException:
org.apache.hadoop.conf.Configuration
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
Could not find the main class:
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager. Program will
exit.

Re: Container Launching

2012-01-09 Thread raghavendhra rahul

Any solutions.

On Fri, Jan 6, 2012 at 5:15 PM, raghavendhra rahul 
raghavendhrara...@gmail.com wrote:

 Hi all,
 I am trying to write an application master.Is there a way to specify
 node1: 10 conatiners
 node2: 10 containers
 Can we specify this kind of list using the application master

 Also i set request.setHostName(client); where client is the hostname of
 a node
 I checked the log to find the following error
 java.io.FileNotFoundException: File
 /local1/yarn/.yarn/local/usercache/rahul_2011/appcache/
 application_1325760852770_0001 does not exist
 at
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:431)
 at
 org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:815)
 at
 org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:143)
 at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
 at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:700)
 at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:697)
 at
 org.apache.hadoop.fs.FileContext$FSLinkResolver.resolve(FileContext.java:2325)
 at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:697)
 at
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:122)
 at
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:237)
 at
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:67)
 at
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
 at java.util.concurrent.FutureTask.run(FutureTask.java:166)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:636)

 i.e containers are only launching within the host where the application
 master runs while the other nodes always remain free.
 Any ideas.

how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum

2012-01-09 Thread hao.wang

Hi ,all
Our hadoop cluster has 22 nodes including one namenode, one jobtracker and 
20 datanodes.
Each node has 2 * 12 cores with 32G RAM
Dose anyone tell me how to config following parameters:
mapred.tasktracker.map.tasks.maximum
mapred.tasktracker.reduce.tasks.maximum

regards!
2012-01-09 



hao.wang

Using Java Remote Method Invocation to make a UI for Hadoop

2012-01-09 Thread Tom Wilcox

Hi,

We are trying to make a UI for our HBase + Hadoop applications.

Basically, what we want is a web front end that can be hosted on a different 
server and present data from HBase, as well as allow us to launch Map-Reduce 
jobs from the web browser.

Our current approach is as follows:

We have a 6-node development cluster with each node running Scientific Linux in 
a VM with the following configuration:

Node - Hostname - Daemons

Node 1 - namenode - namenode, secondarynamenode, regionserver, hbase-master, 
zookeeper
Node 2 - jobtracker - jobtracker, zookeeper
Node 3 - slave0 - zookeeper, datanode, tasktracker
Node 4 - slave1 - datanode, tasktracker
Node 5 - slave2 - datanode, tasktracker
Node 6 - slave3 - datanode, tasktracker

We have a 7th scientific linux VM running a Grails web application called 
BillyWeb. 

We have created several Map-Reduce applications that run on the namenode to 
process HBase data and populate a results table. Currently, these MR apps are 
run from the command line on namenode.

We use the HBase REST interface to query data from the results table and 
present it in an AJAX-enabled web page. That is the reading part of the problem 
solved :)

Now we need to have a HTML page with a button, that when clicked will execute 
an MR app on the namenode. This forms the writing part of the problem.

Currently, we are attempting to do this using Java's Remote Method Invocation 
(RMI) feature. I have successfully created a Java client-server application 
where the server program runs on namenode and the client runs on BillyWeb 
successfully. The client program calls a remote object method to trigger the MR 
job on namenode and gets the result.

We are now in the process of integrating it into the Grails webapp, by adding 
the source code and calling it using groovy taglibs. The only remaining issue 
we are wrestling with is the need to modify the security policy of the web app 
to allow access to the client portions of the source code to the remote server.

Eventually we hope to have an internal web site that you can visit in any web 
browser to view custom-made visualisations of the data and execute 
parameterised MR jobs to process data stored in HBase (amongst other places).

We would like to know your thoughts on this approach. In particular, this feels 
like a potentially convoluted approach to building a front-end that may feature 
several redundant steps (are we are reinventing too many wheels?). 

Is there an alternative approach that you can think of that might be more 
sensible? 

Can you think of any problems with the approach we are currently taking?

Thanks,
Tom

has bzip2 compression been deprecated?

2012-01-09 Thread Tony Burton

Hi,

I'm trying to work out which compression algorithm I should be using in my
MapReduce jobs. It seems to me that the best solution is a compromise between
speed, efficiency and splittability. The only compression algorithm to handle
file splits (according to Hadoop: The Definitive Guide 2nd edition p78 etc) is
bzip2, at the expense of compression speed.

However, I see from the documentation at
http://hadoop.apache.org/common/docs/current/native_libraries.html that the
bzip2 library is no longer mentioned, and hasn't been since version 0.20.0, see
http://hadoop.apache.org/common/docs/r0.20.0/native_libraries.html - however
the bzip2 Codec is still in the API at
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/compress/BZip2Codec.html.

Has bzip2 support been removed from Hadoop, or will it be removed soon?

Thanks,

Tony

**
This email and any attachments are confidential, protected by copyright and may
be legally privileged. If you are not the intended recipient, then the
dissemination or copying of this email is prohibited. If you have received this
in error, please notify the sender by replying by email and then delete the
email completely from your system. Neither Sporting Index nor the sender
accepts responsibility for any virus, or any other defect which might affect
any computer or IT system into which the email is received and/or opened. It
is the responsibility of the recipient to scan the email and no responsibility
is accepted for any loss or damage arising in any way from receipt or use of
this email. Sporting Index Ltd is a company registered in England and Wales
with company number 2636842, whose registered office is at Brookfield House,
Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES. Sporting Index Ltd is
authorised and regulated by the UK Financial Services Authority (reg. no.
150404). Any financial promotion contained herein has been issued
and approved by Sporting Index Ltd.

Outbound email has been scanned for viruses and SPAM

Re: how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum

2012-01-09 Thread Harsh J

Hi,

Please read http://hadoop.apache.org/common/docs/current/single_node_setup.html 
to learn how to configure Hadoop using the various *-site.xml configuration 
files, and then follow 
http://hadoop.apache.org/common/docs/current/cluster_setup.html to achieve 
optimal configs for your cluster.

On 09-Jan-2012, at 5:50 PM, hao.wang wrote:

 Hi ,all
Our hadoop cluster has 22 nodes including one namenode, one jobtracker and 
 20 datanodes.
Each node has 2 * 12 cores with 32G RAM
Dose anyone tell me how to config following parameters:
mapred.tasktracker.map.tasks.maximum
mapred.tasktracker.reduce.tasks.maximum
 
 regards!
 2012-01-09 
 
 
 
 hao.wang

Re: has bzip2 compression been deprecated?

2012-01-09 Thread Harsh J

Bzip2 is pretty slow. You probably do not want to use it, even if it does file
splits (a feature not available in the stable line of 0.20.x/1.x, but available
in 0.22+).

To answer your question though, bzip2 was removed from that document cause it
isn't a native library (its pure Java). I think bzip2 was added earlier due to
an oversight, as even 0.20 did not have a native bzip2 library. This change in
docs does not mean that BZip2 is deprecated -- it is still fully supported and
available in the trunk as well. See
https://issues.apache.org/jira/browse/HADOOP-6292 for the doc update changes
that led to this.

The best way would be to use either:

(a) Hadoop sequence files with any compression codec of choice (best would be
lzo, gz, maybe even snappy). This file format is built for HDFS and MR and is
splittable. Another choice would be Avro DataFiles from the Apache Avro project.
(b) LZO codecs for Hadoop, via https://github.com/toddlipcon/hadoop-lzo (and
hadoop-lzo-packager for packages). This requires you to run indexing operations
before the .lzo can be made splittable, but works great with this extra step
added.

On 09-Jan-2012, at 7:17 PM, Tony Burton wrote:

Hi,

I'm trying to work out which compression algorithm I should be using in my
MapReduce jobs. It seems to me that the best solution is a compromise
between speed, efficiency and splittability. The only compression algorithm
to handle file splits (according to Hadoop: The Definitive Guide 2nd edition
p78 etc) is bzip2, at the expense of compression speed.

However, I see from the documentation at
http://hadoop.apache.org/common/docs/current/native_libraries.html that the
bzip2 library is no longer mentioned, and hasn't been since version 0.20.0,
see http://hadoop.apache.org/common/docs/r0.20.0/native_libraries.html -
however the bzip2 Codec is still in the API at
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/compress/BZip2Codec.html.

Has bzip2 support been removed from Hadoop, or will it be removed soon?

Thanks,

Tony

**
This email and any attachments are confidential, protected by copyright and
may be legally privileged. If you are not the intended recipient, then the
dissemination or copying of this email is prohibited. If you have received
this in error, please notify the sender by replying by email and then delete
the email completely from your system. Neither Sporting Index nor the sender
accepts responsibility for any virus, or any other defect which might affect
any computer or IT system into which the email is received and/or opened. It
is the responsibility of the recipient to scan the email and no
responsibility is accepted for any loss or damage arising in any way from
receipt or use of this email. Sporting Index Ltd is a company registered in
England and Wales with company number 2636842, whose registered office is at
Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES. Sporting
Index Ltd is authorised and regulated by the UK Financial Services Authority
(reg. no. 150404). Any financial promotion contained herein has been issued
and approved by Sporting Index Ltd.

Outbound email has been scanned for viruses and SPAM

RE: has bzip2 compression been deprecated?

2012-01-09 Thread Tony Burton

Thanks for the quick reply and the clarification about the documentation.

Regarding sequence files: am I right in thinking that they're a good choice for 
intermediate steps in chained MR jobs, or for file transfer between the Map and 
the Reduce phases of a job; but they shouldn't be used for human-readable files 
at the end of one or more MapReduce jobs? How about if the only use a job's 
output is analysis via Hive - can Hive create tables from sequence files? 

Tony



-Original Message-
From: Harsh J [mailto:ha...@cloudera.com] 
Sent: 09 January 2012 15:34
To: common-user@hadoop.apache.org
Subject: Re: has bzip2 compression been deprecated?

Bzip2 is pretty slow. You probably do not want to use it, even if it does file 
splits (a feature not available in the stable line of 0.20.x/1.x, but available 
in 0.22+).

To answer your question though, bzip2 was removed from that document cause it 
isn't a native library (its pure Java). I think bzip2 was added earlier due to 
an oversight, as even 0.20 did not have a native bzip2 library. This change in 
docs does not mean that BZip2 is deprecated -- it is still fully supported and 
available in the trunk as well. See 
https://issues.apache.org/jira/browse/HADOOP-6292 for the doc update changes 
that led to this.

The best way would be to use either:

(a) Hadoop sequence files with any compression codec of choice (best would be 
lzo, gz, maybe even snappy). This file format is built for HDFS and MR and is 
splittable. Another choice would be Avro DataFiles from the Apache Avro project.
(b) LZO codecs for Hadoop, via https://github.com/toddlipcon/hadoop-lzo (and 
hadoop-lzo-packager for packages). This requires you to run indexing operations 
before the .lzo can be made splittable, but works great with this extra step 
added.

On 09-Jan-2012, at 7:17 PM, Tony Burton wrote:

 Hi,
 
 I'm trying to work out which compression algorithm I should be using in my 
 MapReduce jobs.  It seems to me that the best solution is a compromise 
 between speed, efficiency and splittability. The only compression algorithm 
 to handle file splits (according to Hadoop: The Definitive Guide 2nd edition 
 p78 etc) is bzip2, at the expense of compression speed.
 
 However, I see from the documentation at 
 http://hadoop.apache.org/common/docs/current/native_libraries.html that the 
 bzip2 library is no longer mentioned, and hasn't been since version 0.20.0, 
 see http://hadoop.apache.org/common/docs/r0.20.0/native_libraries.html - 
 however the bzip2 Codec is still in the API at 
 http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/compress/BZip2Codec.html.
 
 Has bzip2 support been removed from Hadoop, or will it be removed soon?
 
 Thanks,
 
 Tony
 
 
 
 **
 This email and any attachments are confidential, protected by copyright and 
 may be legally privileged.  If you are not the intended recipient, then the 
 dissemination or copying of this email is prohibited. If you have received 
 this in error, please notify the sender by replying by email and then delete 
 the email completely from your system.  Neither Sporting Index nor the sender 
 accepts responsibility for any virus, or any other defect which might affect 
 any computer or IT system into which the email is received and/or opened.  It 
 is the responsibility of the recipient to scan the email and no 
 responsibility is accepted for any loss or damage arising in any way from 
 receipt or use of this email.  Sporting Index Ltd is a company registered in 
 England and Wales with company number 2636842, whose registered office is at 
 Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting 
 Index Ltd is authorised and regulated by the UK Financial Services Authority 
 (reg. no. 150404). Any financial promotion contained herein has been issued 
 and approved by Sporting Index Ltd.
 
 Outbound email has been scanned for viruses and SPAM

www.sportingindex.com
Inbound Email has been scanned for viruses and SPAM 
**
This email and any attachments are confidential, protected by copyright and may 
be legally privileged.  If you are not the intended recipient, then the 
dissemination or copying of this email is prohibited. If you have received this 
in error, please notify the sender by replying by email and then delete the 
email completely from your system.  Neither Sporting Index nor the sender 
accepts responsibility for any virus, or any other defect which might affect 
any computer or IT system into which the email is received and/or opened.  It 
is the responsibility of the recipient to scan the email and no responsibility 
is accepted for any loss or damage arising in any way from receipt or use of 
this email.  Sporting Index Ltd is a company registered in England and Wales 
with company number 2636842, whose registered office is at

increase number of map tasks

2012-01-09 Thread sset


Hello,

In hdfs we have set block size - 40bytes . Input Data set is as below
terminated with line feed.

data1   (5*8=40 bytes)
data2
..
...
data10
 
 
But still we see only 2 map tasks spawned, should have been atleast 10 map
tasks. Each mapper performs complex mathematical computation. Not sure how
works internally. Line feed does not work. Even with below settings map
tasks never goes beyound 2, any way to make this spawn 10 tasks. Basically
it should look like compute grid - computation in parallel.
 
property
  nameio.bytes.per.checksum/name
  value30/value
  descriptionThe number of bytes per checksum.  Must not be larger than
  io.file.buffer.size./description
/property

property
  namedfs.block.size/name
   value30/value
  descriptionThe default block size for new files./description
/property

property
  namemapred.tasktracker.map.tasks.maximum/name
  value10/value
  descriptionThe maximum number of map tasks that will be run
  simultaneously by a task tracker.
  /description
/property

single node with high configuration - 8 cpus and 8gb memory. Hence taking
an example of 10 data items with line feeds. We want to utilize full power
of machine - hence want at least 10 map tasks - each task needs to perform
highly complex mathematical simulation.  At present it looks like file data
is the only way to specify number of map tasks via splitsize (in bytes) -
but I prefer some criteria like line feed or whatever.

How do we get 10 map tasks from above configuration - pls help.

thanks
 
-- 
View this message in context: 
http://old.nabble.com/increase-number-of-map-tasks-tp33107775p33107775.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

TaskTracker: exception thrown when changing perms fails - hadoop on Windows

2012-01-09 Thread Michael Freeman

To Whom It May Concern:

I am running:

 haddop version 0.20.203.0 (downloaded from http://hadoop.apache.org/common/
)
Java version 1.7.0_02
Cygwin and ssh

on a Windows XP OS.

Overall the problems I have seen in the log files involve unexpected
directory permission settings of: rwxrwxrwxt when rwxr-xr-x was expected.
In all but the TaskTracker, I was able to solve this problem manually
simply by executing chmod -R 755. In the case of the TaskTracker, although
I manually changed the permissions, when I run start-mapred.sh, it
apparently is recreating this part of the directory tree with the
rwxrwxrwxt perms, then programatically attempting to change the permissions
to 755. An exception is thrown (please see the log file contents below)
when this attempt is made.


I'm at a loss currently on how to rectify this. Note, I'm just beginning
with hadoop and do not have really any background. I'm working my way
through Tom White's, Hadoop The Definitve Guide.

Thanks,
Mike Freeman

P.S. I should mention that my other log files are now error-free (after my
manually changing permissions) and I am able to run hadoop commands from
the cygwin command line.



2012-01-07 14:42:15,046 INFO org.apache.hadoop.metrics2.impl.MetricsConfig:
loaded properties from hadoop-metrics2.properties
2012-01-07 14:42:15,203 INFO
org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
MetricsSystem,sub=Stats registered.
2012-01-07 14:42:15,218 INFO
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot
period at 10 second(s).
2012-01-07 14:42:15,218 INFO
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: TaskTracker metrics
system started
2012-01-07 14:42:15,656 INFO
org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source ugi
registered.
2012-01-07 14:42:15,656 WARN
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi already
exists!
2012-01-07 14:42:15,859 INFO org.mortbay.log: Logging to
org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via
org.mortbay.log.Slf4jLog
2012-01-07 14:42:15,984 INFO org.apache.hadoop.http.HttpServer: Added
global filtersafety
(class=org.apache.hadoop.http.HttpServer$QuotingInputFilter)
2012-01-07 14:42:16,031 INFO org.apache.hadoop.mapred.TaskLogsTruncater:
Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2012-01-07 14:42:16,031 INFO org.apache.hadoop.mapred.TaskTracker: Starting
tasktracker with owner as SYSTEM
2012-01-07 14:42:16,046 ERROR org.apache.hadoop.mapred.TaskTracker: Can not
start task tracker because java.io.IOException: Failed to set permissions
of path: /tmp/hadoop-SYSTEM/mapred/local/taskTracker to 0755
at
org.apache.hadoop.fs.RawLocalFileSystem.checkReturnValue(RawLocalFileSystem.java:525)
at
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:507)
at
org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:318)
at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:183)
at org.apache.hadoop.mapred.TaskTracker.initialize(TaskTracker.java:630)
at org.apache.hadoop.mapred.TaskTracker.init(TaskTracker.java:1328)
at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3430)

2012-01-07 14:42:16,046 INFO org.apache.hadoop.mapred.TaskTracker:
SHUTDOWN_MSG:
/
SHUTDOWN_MSG: Shutting down TaskTracker
/

Re: has bzip2 compression been deprecated?

2012-01-09 Thread Harsh J

Tony,

* Yeah, SequenceFiles aren't human-readable, but fs -text can read it out
(instead of a plain fs -cat). But if you are gonna export your files into a
system you do not have much control over, probably best to have the resultant
files not be in SequenceFile/Avro-DataFile format.
* Intermediate (M-to-R) files use a custom IFile format these days, which is
built purely for that purpose.
* Hive can use SequenceFiles very well. There is also documented info on this
in the Hive's wiki pages (Check the DDL pages, IIRC).

On 09-Jan-2012, at 9:44 PM, Tony Burton wrote:

Thanks for the quick reply and the clarification about the documentation.

Regarding sequence files: am I right in thinking that they're a good choice
for intermediate steps in chained MR jobs, or for file transfer between the
Map and the Reduce phases of a job; but they shouldn't be used for
human-readable files at the end of one or more MapReduce jobs? How about if
the only use a job's output is analysis via Hive - can Hive create tables
from sequence files?

Tony

-Original Message-
From: Harsh J [mailto:ha...@cloudera.com]
Sent: 09 January 2012 15:34
To: common-user@hadoop.apache.org
Subject: Re: has bzip2 compression been deprecated?

Bzip2 is pretty slow. You probably do not want to use it, even if it does
file splits (a feature not available in the stable line of 0.20.x/1.x, but
available in 0.22+).

To answer your question though, bzip2 was removed from that document cause it
isn't a native library (its pure Java). I think bzip2 was added earlier due
to an oversight, as even 0.20 did not have a native bzip2 library. This
change in docs does not mean that BZip2 is deprecated -- it is still fully
supported and available in the trunk as well. See
https://issues.apache.org/jira/browse/HADOOP-6292 for the doc update changes
that led to this.

The best way would be to use either:

(a) Hadoop sequence files with any compression codec of choice (best would be
lzo, gz, maybe even snappy). This file format is built for HDFS and MR and is
splittable. Another choice would be Avro DataFiles from the Apache Avro
project.
(b) LZO codecs for Hadoop, via https://github.com/toddlipcon/hadoop-lzo (and
hadoop-lzo-packager for packages). This requires you to run indexing
operations before the .lzo can be made splittable, but works great with this
extra step added.

On 09-Jan-2012, at 7:17 PM, Tony Burton wrote:

Hi,

I'm trying to work out which compression algorithm I should be using in my
MapReduce jobs. It seems to me that the best solution is a compromise
between speed, efficiency and splittability. The only compression algorithm
to handle file splits (according to Hadoop: The Definitive Guide 2nd edition
p78 etc) is bzip2, at the expense of compression speed.

However, I see from the documentation at
http://hadoop.apache.org/common/docs/current/native_libraries.html that the
bzip2 library is no longer mentioned, and hasn't been since version 0.20.0,
see http://hadoop.apache.org/common/docs/r0.20.0/native_libraries.html -
however the bzip2 Codec is still in the API at
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/compress/BZip2Codec.html.

Has bzip2 support been removed from Hadoop, or will it be removed soon?

Thanks,

Tony

**
This email and any attachments are confidential, protected by copyright and
may be legally privileged. If you are not the intended recipient, then the
dissemination or copying of this email is prohibited. If you have received
this in error, please notify the sender by replying by email and then delete
the email completely from your system. Neither Sporting Index nor the
sender accepts responsibility for any virus, or any other defect which might
affect any computer or IT system into which the email is received and/or
opened. It is the responsibility of the recipient to scan the email and no
responsibility is accepted for any loss or damage arising in any way from
receipt or use of this email. Sporting Index Ltd is a company registered in
England and Wales with company number 2636842, whose registered office is at
Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES. Sporting
Index Ltd is authorised and regulated by the UK Financial Services Authority
(reg. no. 150404). Any financial promotion contained herein has been issued
and approved by Sporting Index Ltd.

Outbound email has been scanned for viruses and SPAM

www.sportingindex.com
Inbound Email has been scanned for viruses and SPAM
**
This email and any attachments are confidential, protected by copyright and
may be legally privileged. If you are not the intended recipient, then the
dissemination or copying of this email is prohibited. If

Using Java Remote Method Invocation to make a UI for Hadoop

2012-01-09 Thread Tom Wilcox

Hi,

We are trying to make a UI for our HBase + Hadoop applications.

Basically, what we want is a web front end that can be hosted on a
different server and present data from HBase, as well as allow us to launch
Map-Reduce jobs from the web browser.

Our current approach is as follows:

We have a 6-node development cluster with each node running Scientific
Linux in a VM with the following configuration:

Node - Hostname - Daemons

Node 1 - namenode - namenode, secondarynamenode, regionserver,
hbase-master, zookeeper
Node 2 - jobtracker - jobtracker, zookeeper
Node 3 - slave0 - zookeeper, datanode, tasktracker
Node 4 - slave1 - datanode, tasktracker
Node 5 - slave2 - datanode, tasktracker
Node 6 - slave3 - datanode, tasktracker

We have a 7th scientific linux VM running a Grails web application called
BillyWeb.

We have created several Map-Reduce applications that run on the namenode to
process HBase data and populate a results table. Currently, these MR apps
are run from the command line on namenode.

We use the HBase REST interface to query data from the results table and
present it in an AJAX-enabled web page. That is the reading part of the
problem solved :)

Now we need to have a HTML page with a button, that when clicked will
execute an MR app on the namenode. This forms the writing part of the
problem.

Currently, we are attempting to do this using Java's Remote Method
Invocation (RMI) feature. I have successfully created a Java client-server
application where the server program runs on namenode and the client runs
on BillyWeb successfully. The client program calls a remote object method
to trigger the MR job on namenode and gets the result.

We are now in the process of integrating it into the Grails webapp, by
adding the source code and calling it using groovy taglibs. The only
remaining issue we are wrestling with is the need to modify the security
policy of the web app to allow access to the client portions of the source
code to the remote server.

Eventually we hope to have an internal web site that you can visit in any
web browser to view custom-made visualisations of the data and execute
parameterised MR jobs to process data stored in HBase (amongst other
places).

We would like to know your thoughts on this approach. In particular, this
feels like a potentially convoluted approach to building a front-end that
may feature several redundant steps (are we are reinventing too many
wheels?).

Is there an alternative approach that you can think of that might be more
sensible?

Can you think of any problems with the approach we are currently taking?

Thanks,
Tom

Re: dual power for hadoop in datacenter?

2012-01-09 Thread Robert Evans

Be aware that if half of your cluster goes down, depending of the version and 
configuration of Hadoop, there may be a replication storm, as hadoop tries to 
bring it all back up to the proper number of replications.  Your cluster may 
still be unusable in this case.

--Bobby Evans

On 1/7/12 2:55 PM, Alexander Lorenz wget.n...@googlemail.com wrote:

NN, SN and JT must have separated power adapter, for the entire cluster are 
dual adapter recommend.
For HBase and zookeeper servers / regionservers also dual adapters with 
seperated power lines recommend.

- Alex

sent via my mobile device

On Jan 7, 2012, at 11:23 AM, Koert Kuipers ko...@tresata.com wrote:

 what are the thoughts on running a hadoop cluster in a datacenter with
 respect to power? should all the boxes have redundant power supplies and be
 on dual power? or just dual power for the namenode, secondary namenode, and
 hbase master, and then perhaps switch the power source per rack for the
 slaves to provide resilience to a power failure? or even just run
 everything on single power and accept the risk that everything can do down
 at once?

Re: has bzip2 compression been deprecated?

2012-01-09 Thread alo.alt

Tony,

snappy is also available:
http://code.google.com/p/hadoop-snappy/

best,
Alex

--
Alexander Lorenz
http://mapredit.blogspot.com

On Jan 9, 2012, at 8:49 AM, Harsh J wrote:

Tony,

On 09-Jan-2012, at 9:44 PM, Tony Burton wrote:

Thanks for the quick reply and the clarification about the documentation.

Tony

-Original Message-
From: Harsh J [mailto:ha...@cloudera.com]
Sent: 09 January 2012 15:34
To: common-user@hadoop.apache.org
Subject: Re: has bzip2 compression been deprecated?

Bzip2 is pretty slow. You probably do not want to use it, even if it does
file splits (a feature not available in the stable line of 0.20.x/1.x, but
available in 0.22+).

To answer your question though, bzip2 was removed from that document cause
it isn't a native library (its pure Java). I think bzip2 was added earlier
due to an oversight, as even 0.20 did not have a native bzip2 library. This
change in docs does not mean that BZip2 is deprecated -- it is still fully
supported and available in the trunk as well. See
https://issues.apache.org/jira/browse/HADOOP-6292 for the doc update changes
that led to this.

The best way would be to use either:

(a) Hadoop sequence files with any compression codec of choice (best would
be lzo, gz, maybe even snappy). This file format is built for HDFS and MR
and is splittable. Another choice would be Avro DataFiles from the Apache
Avro project.
(b) LZO codecs for Hadoop, via https://github.com/toddlipcon/hadoop-lzo (and
hadoop-lzo-packager for packages). This requires you to run indexing
operations before the .lzo can be made splittable, but works great with this
extra step added.

On 09-Jan-2012, at 7:17 PM, Tony Burton wrote:

Hi,

I'm trying to work out which compression algorithm I should be using in my
MapReduce jobs. It seems to me that the best solution is a compromise
between speed, efficiency and splittability. The only compression algorithm
to handle file splits (according to Hadoop: The Definitive Guide 2nd
edition p78 etc) is bzip2, at the expense of compression speed.

However, I see from the documentation at
http://hadoop.apache.org/common/docs/current/native_libraries.html that the
bzip2 library is no longer mentioned, and hasn't been since version 0.20.0,
see http://hadoop.apache.org/common/docs/r0.20.0/native_libraries.html -
however the bzip2 Codec is still in the API at
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/compress/BZip2Codec.html.

Has bzip2 support been removed from Hadoop, or will it be removed soon?

Thanks,

Tony

**
This email and any attachments are confidential, protected by copyright and
may be legally privileged. If you are not the intended recipient, then the
dissemination or copying of this email is prohibited. If you have received
this in error, please notify the sender by replying by email and then
delete the email completely from your system. Neither Sporting Index nor
the sender accepts responsibility for any virus, or any other defect which
might affect any computer or IT system into which the email is received
and/or opened. It is the responsibility of the recipient to scan the email
and no responsibility is accepted for any loss or damage arising in any way
from receipt or use of this email. Sporting Index Ltd is a company
registered in England and Wales with company number 2636842, whose
registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton
Buzzard, LU7 9ES. Sporting Index Ltd is authorised and regulated by the UK
Financial Services Authority (reg. no. 150404). Any financial promotion
contained herein has been issued
and approved by Sporting Index Ltd.

Outbound email has been scanned for viruses and SPAM

www.sportingindex.com
Inbound Email has been scanned for viruses and SPAM
**
This email and any

RE: Using Java Remote Method Invocation to make a UI for Hadoop

2012-01-09 Thread Tom Wilcox

Thanks Harsh J,

I've had a crack at setting up HUE and I now remember why we didn't go for it.

It appears that HUE wants to work with Hadoop 0.20 and we are using Hadoop 1.0 
and HBase 0.92...

I don't suppose there are any other efforts like HUE out there that might be 
more compatible with our setup? 

Or can anyone see anything wrong with using Grails and RMI to talk to our 
namenode?

Cheers,
Tom

From: Harsh J [ha...@cloudera.com]
Sent: 09 January 2012 14:39
To: Tom Wilcox
Subject: Re: Using Java Remote Method Invocation to make a UI for Hadoop

Hey Tom,

Just wondering, would http://github.com/cloudera/hue have not helped you at all?

On 09-Jan-2012, at 6:59 PM, Tom Wilcox wrote:

 Hi,

 We are trying to make a UI for our HBase + Hadoop applications.

 Basically, what we want is a web front end that can be hosted on a different 
 server and present data from HBase, as well as allow us to launch Map-Reduce 
 jobs from the web browser.

 Our current approach is as follows:

 We have a 6-node development cluster with each node running Scientific Linux 
 in a VM with the following configuration:

 Node - Hostname - Daemons
 
 Node 1 - namenode - namenode, secondarynamenode, regionserver, 
 hbase-master, zookeeper
 Node 2 - jobtracker - jobtracker, zookeeper
 Node 3 - slave0 - zookeeper, datanode, tasktracker
 Node 4 - slave1 - datanode, tasktracker
 Node 5 - slave2 - datanode, tasktracker
 Node 6 - slave3 - datanode, tasktracker

 We have a 7th scientific linux VM running a Grails web application called 
 BillyWeb.

 We have created several Map-Reduce applications that run on the namenode to 
 process HBase data and populate a results table. Currently, these MR apps are 
 run from the command line on namenode.

 We use the HBase REST interface to query data from the results table and 
 present it in an AJAX-enabled web page. That is the reading part of the 
 problem solved :)

 Now we need to have a HTML page with a button, that when clicked will execute 
 an MR app on the namenode. This forms the writing part of the problem.

 Currently, we are attempting to do this using Java's Remote Method Invocation 
 (RMI) feature. I have successfully created a Java client-server application 
 where the server program runs on namenode and the client runs on BillyWeb 
 successfully. The client program calls a remote object method to trigger the 
 MR job on namenode and gets the result.

 We are now in the process of integrating it into the Grails webapp, by adding 
 the source code and calling it using groovy taglibs. The only remaining issue 
 we are wrestling with is the need to modify the security policy of the web 
 app to allow access to the client portions of the source code to the remote 
 server.

 Eventually we hope to have an internal web site that you can visit in any web 
 browser to view custom-made visualisations of the data and execute 
 parameterised MR jobs to process data stored in HBase (amongst other places).

 We would like to know your thoughts on this approach. In particular, this 
 feels like a potentially convoluted approach to building a front-end that may 
 feature several redundant steps (are we are reinventing too many wheels?).

 Is there an alternative approach that you can think of that might be more 
 sensible?

 Can you think of any problems with the approach we are currently taking?

 Thanks,
 Tom

can't run a simple mapred job

2012-01-09 Thread T Vinod Gupta

Hi,
I have a hbase/hadoop setup on my instance in aws. I am able to run the
simple wordcount map reduce example but not a custom one that i wrote. here
is the error that i get -

[ec2-user@ip-10-68-145-124 bin]$ hadoop jar HBaseTest.jar
com.akanksh.information.hbasetest.HBaseSweeper
12/01/09 11:27:27 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
12/01/09 11:27:27 INFO mapred.JobClient: Cleaning up the staging area
hdfs://ip-10-68-145-124.ec2.internal:9100/media/ephemeral1/hadoop/mapred/staging/ec2-user/.staging/job_201112151554_0006
Exception in thread main java.lang.RuntimeException:
java.lang.InstantiationException
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:869)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:476)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:506)
at
com.akanksh.information.hbasetest.HBaseSweeper.main(HBaseSweeper.java:86)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
Caused by: java.lang.InstantiationException
at
sun.reflect.InstantiationExceptionConstructorAccessorImpl.newInstance(InstantiationExceptionConstructorAccessorImpl.java:48)
at java.lang.reflect.Constructor.newInstance(Constructor.java:532)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:113)
... 14 more

does this sound like a standard problem?

here is my main method - nothing much in it -

public static void main(String args[]) throws Exception {
Configuration config = HBaseConfiguration.create();

Job job = new Job(config, HBaseSweeper);
job.setJarByClass(HBaseSweeper.class);
Scan scan = new Scan();
scan.setCaching(500);
scan.setCacheBlocks(false);

TableMapReduceUtil.initTableMapperJob(TABLE_NAME, scan,
SweeperMapper.class,
ImmutableBytesWritable.class, Delete.class, job);
job.setOutputFormatClass(FileOutputFormat.class);
job.setInputFormatClass(TextInputFormat.class);
boolean b = job.waitForCompletion(true);
if (!b) {
throw new IOException(error with job!);
}
}

can someone help? i will really appreciate it.

thanks
vinod

Re: has bzip2 compression been deprecated?

2012-01-09 Thread Bejoy Ks

Hi Tony
Adding on to Harsh's comments. If you want the generated sequence
files to be utilized by a hive table. Define your hive table as

CREATE EXTERNAL TABLE tableNAme(col1 INT, c0l2 STRING)
...
...

STORED AS SEQUENCEFILE;

Regards
Bejoy.K.S

On Mon, Jan 9, 2012 at 10:32 PM, alo.alt wget.n...@googlemail.com wrote:

Tony,

snappy is also available:
http://code.google.com/p/hadoop-snappy/

best,
Alex

--
Alexander Lorenz
http://mapredit.blogspot.com

On Jan 9, 2012, at 8:49 AM, Harsh J wrote:

Tony,

* Yeah, SequenceFiles aren't human-readable, but fs -text can read it
out (instead of a plain fs -cat). But if you are gonna export your files
into a system you do not have much control over, probably best to have the
resultant files not be in SequenceFile/Avro-DataFile format.
* Intermediate (M-to-R) files use a custom IFile format these days,
which is built purely for that purpose.
* Hive can use SequenceFiles very well. There is also documented info on
this in the Hive's wiki pages (Check the DDL pages, IIRC).

On 09-Jan-2012, at 9:44 PM, Tony Burton wrote:

Thanks for the quick reply and the clarification about the
documentation.

Regarding sequence files: am I right in thinking that they're a good
choice for intermediate steps in chained MR jobs, or for file transfer
between the Map and the Reduce phases of a job; but they shouldn't be used
for human-readable files at the end of one or more MapReduce jobs? How
about if the only use a job's output is analysis via Hive - can Hive create
tables from sequence files?

Tony

-Original Message-
From: Harsh J [mailto:ha...@cloudera.com]
Sent: 09 January 2012 15:34
To: common-user@hadoop.apache.org
Subject: Re: has bzip2 compression been deprecated?

Bzip2 is pretty slow. You probably do not want to use it, even if it
does file splits (a feature not available in the stable line of 0.20.x/1.x,
but available in 0.22+).

To answer your question though, bzip2 was removed from that document
cause it isn't a native library (its pure Java). I think bzip2 was added
earlier due to an oversight, as even 0.20 did not have a native bzip2
library. This change in docs does not mean that BZip2 is deprecated -- it
is still fully supported and available in the trunk as well. See
https://issues.apache.org/jira/browse/HADOOP-6292 for the doc update
changes that led to this.

The best way would be to use either:

(a) Hadoop sequence files with any compression codec of choice (best
would be lzo, gz, maybe even snappy). This file format is built for HDFS
and MR and is splittable. Another choice would be Avro DataFiles from the
Apache Avro project.
(b) LZO codecs for Hadoop, via
https://github.com/toddlipcon/hadoop-lzo(and hadoop-lzo-packager for
packages). This requires you to run indexing
operations before the .lzo can be made splittable, but works great with
this extra step added.

On 09-Jan-2012, at 7:17 PM, Tony Burton wrote:

Hi,

I'm trying to work out which compression algorithm I should be using
in my MapReduce jobs. It seems to me that the best solution is a
compromise between speed, efficiency and splittability. The only
compression algorithm to handle file splits (according to Hadoop: The
Definitive Guide 2nd edition p78 etc) is bzip2, at the expense of
compression speed.

However, I see from the documentation at
http://hadoop.apache.org/common/docs/current/native_libraries.html that
the bzip2 library is no longer mentioned, and hasn't been since version
0.20.0, see
http://hadoop.apache.org/common/docs/r0.20.0/native_libraries.html -
however the bzip2 Codec is still in the API at
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/compress/BZip2Codec.html
.

Has bzip2 support been removed from Hadoop, or will it be removed soon?

Thanks,

Tony

**
This email and any attachments are confidential, protected by
copyright and may be legally privileged. If you are not the intended
recipient, then the dissemination or copying of this email is prohibited.
If you have received this in error, please notify the sender by replying by
email and then delete the email completely from your system. Neither
Sporting Index nor the sender accepts responsibility for any virus, or any
other defect which might affect any computer or IT system into which the
email is received and/or opened. It is the responsibility of the recipient
to scan the email and no responsibility is accepted for any loss or damage
arising in any way from receipt or use of this email. Sporting Index Ltd
is a company registered in England and Wales with company number 2636842,
whose registered office is at Brookfield House, Green Lane, Ivinghoe,
Leighton Buzzard, LU7 9ES. Sporting Index Ltd is authorised and regulated
by the UK Financial Services Authority (reg. no.

Re: increase number of map tasks

2012-01-09 Thread Bejoy Ks

Hi Satish
  What is your value for mapred.max.split.size? Try setting these
values as well
mapred.min.split.size=0 (it is the default value)
mapred.max.split.size=40

Try executing your job once you apply these changes on top of others you
did.

Regards
Bejoy.K.S

On Mon, Jan 9, 2012 at 10:16 PM, sset satish.se...@hcl.com wrote:


 Hello,

 In hdfs we have set block size - 40bytes . Input Data set is as below
 terminated with line feed.

 data1   (5*8=40 bytes)
 data2
 ..
 ...
 data10


 But still we see only 2 map tasks spawned, should have been atleast 10 map
 tasks. Each mapper performs complex mathematical computation. Not sure how
 works internally. Line feed does not work. Even with below settings map
 tasks never goes beyound 2, any way to make this spawn 10 tasks. Basically
 it should look like compute grid - computation in parallel.

 property
  nameio.bytes.per.checksum/name
  value30/value
  descriptionThe number of bytes per checksum.  Must not be larger than
  io.file.buffer.size./description
 /property

 property
  namedfs.block.size/name
   value30/value
  descriptionThe default block size for new files./description
 /property

 property
  namemapred.tasktracker.map.tasks.maximum/name
  value10/value
  descriptionThe maximum number of map tasks that will be run
  simultaneously by a task tracker.
  /description
 /property

 single node with high configuration - 8 cpus and 8gb memory. Hence taking
 an example of 10 data items with line feeds. We want to utilize full power
 of machine - hence want at least 10 map tasks - each task needs to perform
 highly complex mathematical simulation.  At present it looks like file data
 is the only way to specify number of map tasks via splitsize (in bytes) -
 but I prefer some criteria like line feed or whatever.

 How do we get 10 map tasks from above configuration - pls help.

 thanks

 --
 View this message in context:
 http://old.nabble.com/increase-number-of-map-tasks-tp33107775p33107775.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.

RE: has bzip2 compression been deprecated?

2012-01-09 Thread Tony Burton

Thanks Bejoy - I'm fairly new to Hive so may be wrong here, but I was under the 
impression that the STORED AS part of a CREATE TABLE in Hive refers to how the 
data in the table will be stored once the table is created, rather than the 
compression format of the data used to populate the table. Can you clarify 
which is the correct interpretation? If it's the latter, how would I read a 
sequence file into a Hive table?

Thanks,

Tony




-Original Message-
From: Bejoy Ks [mailto:bejoy.had...@gmail.com] 
Sent: 09 January 2012 17:33
To: common-user@hadoop.apache.org
Subject: Re: has bzip2 compression been deprecated?

Hi Tony
   Adding on to Harsh's comments. If you want the generated sequence
files to be utilized by a hive table. Define your hive table as

CREATE EXTERNAL TABLE tableNAme(col1 INT, c0l2 STRING)
...
...

STORED AS SEQUENCEFILE;


Regards
Bejoy.K.S

On Mon, Jan 9, 2012 at 10:32 PM, alo.alt wget.n...@googlemail.com wrote:

 Tony,

 snappy is also available:
 http://code.google.com/p/hadoop-snappy/

 best,
  Alex

 --
 Alexander Lorenz
 http://mapredit.blogspot.com

 On Jan 9, 2012, at 8:49 AM, Harsh J wrote:

  Tony,
 
  * Yeah, SequenceFiles aren't human-readable, but fs -text can read it
 out (instead of a plain fs -cat). But if you are gonna export your files
 into a system you do not have much control over, probably best to have the
 resultant files not be in SequenceFile/Avro-DataFile format.
  * Intermediate (M-to-R) files use a custom IFile format these days,
 which is built purely for that purpose.
  * Hive can use SequenceFiles very well. There is also documented info on
 this in the Hive's wiki pages (Check the DDL pages, IIRC).
 
  On 09-Jan-2012, at 9:44 PM, Tony Burton wrote:
 
  Thanks for the quick reply and the clarification about the
 documentation.
 
  Regarding sequence files: am I right in thinking that they're a good
 choice for intermediate steps in chained MR jobs, or for file transfer
 between the Map and the Reduce phases of a job; but they shouldn't be used
 for human-readable files at the end of one or more MapReduce jobs? How
 about if the only use a job's output is analysis via Hive - can Hive create
 tables from sequence files?
 
  Tony
 
 
 
  -Original Message-
  From: Harsh J [mailto:ha...@cloudera.com]
  Sent: 09 January 2012 15:34
  To: common-user@hadoop.apache.org
  Subject: Re: has bzip2 compression been deprecated?
 
  Bzip2 is pretty slow. You probably do not want to use it, even if it
 does file splits (a feature not available in the stable line of 0.20.x/1.x,
 but available in 0.22+).
 
  To answer your question though, bzip2 was removed from that document
 cause it isn't a native library (its pure Java). I think bzip2 was added
 earlier due to an oversight, as even 0.20 did not have a native bzip2
 library. This change in docs does not mean that BZip2 is deprecated -- it
 is still fully supported and available in the trunk as well. See
 https://issues.apache.org/jira/browse/HADOOP-6292 for the doc update
 changes that led to this.
 
  The best way would be to use either:
 
  (a) Hadoop sequence files with any compression codec of choice (best
 would be lzo, gz, maybe even snappy). This file format is built for HDFS
 and MR and is splittable. Another choice would be Avro DataFiles from the
 Apache Avro project.
  (b) LZO codecs for Hadoop, via 
  https://github.com/toddlipcon/hadoop-lzo(and hadoop-lzo-packager for 
  packages). This requires you to run indexing
 operations before the .lzo can be made splittable, but works great with
 this extra step added.
 
  On 09-Jan-2012, at 7:17 PM, Tony Burton wrote:
 
  Hi,
 
  I'm trying to work out which compression algorithm I should be using
 in my MapReduce jobs.  It seems to me that the best solution is a
 compromise between speed, efficiency and splittability. The only
 compression algorithm to handle file splits (according to Hadoop: The
 Definitive Guide 2nd edition p78 etc) is bzip2, at the expense of
 compression speed.
 
  However, I see from the documentation at
 http://hadoop.apache.org/common/docs/current/native_libraries.html that
 the bzip2 library is no longer mentioned, and hasn't been since version
 0.20.0, see
 http://hadoop.apache.org/common/docs/r0.20.0/native_libraries.html -
 however the bzip2 Codec is still in the API at
 http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/compress/BZip2Codec.html
 .
 
  Has bzip2 support been removed from Hadoop, or will it be removed soon?
 
  Thanks,
 
  Tony
 
 
 
  **
  This email and any attachments are confidential, protected by
 copyright and may be legally privileged.  If you are not the intended
 recipient, then the dissemination or copying of this email is prohibited.
 If you have received this in error, please notify the sender by replying by
 email and then delete the email completely from your system.  Neither
 Sporting Index nor

connection between slaves and master

2012-01-09 Thread Mark question

Hello guys,

 I'm requesting from a PBS scheduler a number of  machines to run Hadoop
and even though all hadoop daemons start normally on the master and slaves,
the slaves don't have worker tasks in them. Digging into that, there seems
to be some blocking between nodes (?) don't know how to describe it except
that on slave if I telnet master-node  it should be able to connect, but
I get this error:

[mark@node67 ~]$ telnet node77

Trying 192.168.1.77...
telnet: connect to address 192.168.1.77: Connection refused
telnet: Unable to connect to remote host: Connection refused

The log at the slave nodes shows the same thing, even though it has
datanode and tasktracker started from the maste (?):

2012-01-09 10:04:03,436 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:12123. Already tried 0 time(s).
2012-01-09 10:04:04,439 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:12123. Already tried 1 time(s).
2012-01-09 10:04:05,442 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:12123. Already tried 2 time(s).
2012-01-09 10:04:06,444 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:12123. Already tried 3 time(s).
2012-01-09 10:04:07,446 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:12123. Already tried 4 time(s).
2012-01-09 10:04:08,448 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:12123. Already tried 5 time(s).
2012-01-09 10:04:09,450 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:12123. Already tried 6 time(s).
2012-01-09 10:04:10,452 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:12123. Already tried 7 time(s).
2012-01-09 10:04:11,454 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:12123. Already tried 8 time(s).
2012-01-09 10:04:12,456 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:12123. Already tried 9 time(s).
2012-01-09 10:04:12,456 INFO org.apache.hadoop.ipc.RPC: Server at localhost/
127.0.0.1:12123 not available yet, Z...

 Any suggestions of what I can do?

Thanks,
Mark

Netstat Shows Port 8020 Doesn't Seem to Listen

2012-01-09 Thread Eli Finkelshteyn


Hi,
I've been googling, but haven't been able to find an answer. I'm 
currently trying to setup Hadoop in pseudo-distributed mode as a first 
step. I'm using the Cloudera distro and installed everything through YUM 
on CentOS 5.7. I can run everything just fine from my one node itself 
(hadoop fs -ls /, test map-red jobs, etc...), but can't get a remote 
client to be able to connect to it. I'm pretty sure the cause of that 
has to do with the fact that port 8020 and port 8021 do not seem to be 
listening (when I do a netstat -a, they don't show up-- all the other 
Hadoop related ports like 50030 and 50070 do show up). I verified that 
the firewall allows connections over 8020 and 8021 for tcp, and can 
connect through my web browser to 50030 and 50070.


Looking at the namenode log, I see the following error which looks 
suspicious and related to me:


   2012-01-09 12:03:38,000 INFO org.apache.hadoop.ipc.Server: IPC
   Server listener on 8020: starting
   2012-01-09 12:03:38,009 INFO org.apache.hadoop.ipc.Server: IPC
   Server handler 0 on 8020: starting
   2012-01-09 12:03:39,187 INFO org.apache.hadoop.ipc.Server: IPC
   Server handler 2 on 8020: starting
   2012-01-09 12:03:39,188 INFO org.apache.hadoop.ipc.Server: IPC
   Server handler 3 on 8020: starting
   2012-01-09 12:03:39,188 INFO org.apache.hadoop.ipc.Server: IPC
   Server handler 4 on 8020: starting
   2012-01-09 12:03:39,188 INFO org.apache.hadoop.ipc.Server: IPC
   Server handler 5 on 8020: starting
   2012-01-09 12:03:39,188 INFO org.apache.hadoop.ipc.Server: IPC
   Server handler 6 on 8020: starting
   2012-01-09 12:03:39,189 INFO org.apache.hadoop.ipc.Server: IPC
   Server handler 1 on 8020: starting
   2012-01-09 12:03:39,189 INFO org.apache.hadoop.ipc.Server: IPC
   Server handler 7 on 8020: starting
   2012-01-09 12:03:39,189 INFO org.apache.hadoop.ipc.Server: IPC
   Server handler 8 on 8020: starting
   2012-01-09 12:03:39,246 INFO org.apache.hadoop.ipc.Server: IPC
   Server handler 9 on 8020: starting
   2012-01-09 12:03:39,258 WARN
   org.apache.hadoop.util.PluginDispatcher: Unable to load
   dfs.namenode.plugins plugins
   2012-01-09 12:03:40,254 INFO org.apache.hadoop.ipc.Server: IPC
   Server handler 8 on 8020, call
   addBlock(/var/lib/hadoop-0.20/cache/mapred/mapred/system/jobtracker.info,
   DFSClient_-1779116177, null) from 127.0.0.1:39785: error:
   java.io.IOException: File
   /var/lib/hadoop-0.20/cache/mapred/mapred/system/jobtracker.info
   could only be replicated to 0 nodes, instead of 1

Anyone have any idea what my problem might be?

Cheers,
Eli

Adding a soft-linked archive file to the distributed cache doesn't work as advertised

2012-01-09 Thread W.P. McNeill

I am trying to add a zip file to the distributed cache and have it unzipped
on the task nodes with a softlink to the unzipped directory placed in the
working directory of my mapper process. I think I'm doing everything the
way the documentation tells me to, but it's not working.

On the client in the run() function while I'm creating the job I first call:

fs.copyFromLocalFile(gate-app.zip, /tmp/gate-app.zip);

As expected, this copies the archive file gate-app.zip to the HDFS
directory /tmp.

Then I call

DistributedCache.addCacheArchive(/tmp/gate-app.zip#gate-app,
configuration);

I expect this to add /tmp/gate-app.zip to the distributed cache and put a
softlink to it called gate-app in the working directory of each task.
However, when I call job.waitForCompletion(), I see the following error:

Exception in thread main java.io.FileNotFoundException: File does not
exist: /tmp/gate-app.zip#gate-app.

It appears that the distributed cache mechanism is interpreting the entire
URI as the literal name of the file, instead of treating the fragment as
the name of the softlink.

As far as I can tell, I'm doing this correctly according to the API
documentation:
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html
.

The full project in which I'm doing this is up on github:
https://github.com/wpm/Hadoop-GATE.

Can someone tell me what I'm doing wrong?

Re: Netstat Shows Port 8020 Doesn't Seem to Listen

2012-01-09 Thread Eli Finkelshteyn


More info:

In the DataNode log, I'm also seeing:

2012-01-09 13:06:27,751 INFO org.apache.hadoop.ipc.Client: Retrying 
connect to server: localhost/127.0.0.1:8020. Already tried 9 time(s).


Why would things just not load on port 8020? I feel like all the errors 
I'm seeing are caused by this, but I can't see any errors about why this 
occurred in the first place.



On 1/9/12 1:14 PM, Eli Finkelshteyn wrote:

Hi,
I've been googling, but haven't been able to find an answer. I'm 
currently trying to setup Hadoop in pseudo-distributed mode as a first 
step. I'm using the Cloudera distro and installed everything through 
YUM on CentOS 5.7. I can run everything just fine from my one node 
itself (hadoop fs -ls /, test map-red jobs, etc...), but can't get a 
remote client to be able to connect to it. I'm pretty sure the cause 
of that has to do with the fact that port 8020 and port 8021 do not 
seem to be listening (when I do a netstat -a, they don't show up-- all 
the other Hadoop related ports like 50030 and 50070 do show up). I 
verified that the firewall allows connections over 8020 and 8021 for 
tcp, and can connect through my web browser to 50030 and 50070.


Looking at the namenode log, I see the following error which looks 
suspicious and related to me:


2012-01-09 12:03:38,000 INFO org.apache.hadoop.ipc.Server: IPC
Server listener on 8020: starting
2012-01-09 12:03:38,009 INFO org.apache.hadoop.ipc.Server: IPC
Server handler 0 on 8020: starting
2012-01-09 12:03:39,187 INFO org.apache.hadoop.ipc.Server: IPC
Server handler 2 on 8020: starting
2012-01-09 12:03:39,188 INFO org.apache.hadoop.ipc.Server: IPC
Server handler 3 on 8020: starting
2012-01-09 12:03:39,188 INFO org.apache.hadoop.ipc.Server: IPC
Server handler 4 on 8020: starting
2012-01-09 12:03:39,188 INFO org.apache.hadoop.ipc.Server: IPC
Server handler 5 on 8020: starting
2012-01-09 12:03:39,188 INFO org.apache.hadoop.ipc.Server: IPC
Server handler 6 on 8020: starting
2012-01-09 12:03:39,189 INFO org.apache.hadoop.ipc.Server: IPC
Server handler 1 on 8020: starting
2012-01-09 12:03:39,189 INFO org.apache.hadoop.ipc.Server: IPC
Server handler 7 on 8020: starting
2012-01-09 12:03:39,189 INFO org.apache.hadoop.ipc.Server: IPC
Server handler 8 on 8020: starting
2012-01-09 12:03:39,246 INFO org.apache.hadoop.ipc.Server: IPC
Server handler 9 on 8020: starting
2012-01-09 12:03:39,258 WARN
org.apache.hadoop.util.PluginDispatcher: Unable to load
dfs.namenode.plugins plugins
2012-01-09 12:03:40,254 INFO org.apache.hadoop.ipc.Server: IPC
Server handler 8 on 8020, call
addBlock(/var/lib/hadoop-0.20/cache/mapred/mapred/system/jobtracker.info,
DFSClient_-1779116177, null) from 127.0.0.1:39785: error:
java.io.IOException: File
/var/lib/hadoop-0.20/cache/mapred/mapred/system/jobtracker.info
could only be replicated to 0 nodes, instead of 1

Anyone have any idea what my problem might be?

Cheers,
Eli

Re: has bzip2 compression been deprecated?

2012-01-09 Thread Bejoy Ks

Hi Tony
As I understand your requirement, your mapreduce job produces a
Sequence File as ouput and you need to use this file as an input to hive
table.
When you CREATE and EXTERNAL Table in hive you specify a location
where your data is stored and also what is the format of that data( like
the field delimiter,row delimiter, file type etc of your data). You are
actually not loading data any where when you create a hive external
table(issue DDL), just specifying where the data lies in file system in
fact there is not even any validation performed that time to check on the
data quality. When you Query/Retrive your data through Hive QLs the
parameters specified along with CREATE TABLE as ROW FORMAT,FILEDS
TERMINATED, STORED AS etc are used to execute the right MAP REDUCE job(s).

In short STORED AS refer to the type of files that a table's data
directory holds.

For details
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable

Hope it helps!..

Regards
Bejoy.K.S

On Mon, Jan 9, 2012 at 11:32 PM, Tony Burton tbur...@sportingindex.comwrote:

Thanks Bejoy - I'm fairly new to Hive so may be wrong here, but I was
under the impression that the STORED AS part of a CREATE TABLE in Hive
refers to how the data in the table will be stored once the table is
created, rather than the compression format of the data used to populate
the table. Can you clarify which is the correct interpretation? If it's the
latter, how would I read a sequence file into a Hive table?

Thanks,

Tony

-Original Message-
From: Bejoy Ks [mailto:bejoy.had...@gmail.com]
Sent: 09 January 2012 17:33
To: common-user@hadoop.apache.org
Subject: Re: has bzip2 compression been deprecated?

Hi Tony
Adding on to Harsh's comments. If you want the generated sequence
files to be utilized by a hive table. Define your hive table as

CREATE EXTERNAL TABLE tableNAme(col1 INT, c0l2 STRING)
...
...

STORED AS SEQUENCEFILE;

Regards
Bejoy.K.S

On Mon, Jan 9, 2012 at 10:32 PM, alo.alt wget.n...@googlemail.com wrote:

Tony,

snappy is also available:
http://code.google.com/p/hadoop-snappy/

best,
Alex

--
Alexander Lorenz
http://mapredit.blogspot.com

On Jan 9, 2012, at 8:49 AM, Harsh J wrote:

Tony,

* Yeah, SequenceFiles aren't human-readable, but fs -text can read it
out (instead of a plain fs -cat). But if you are gonna export your
files
into a system you do not have much control over, probably best to have
the
resultant files not be in SequenceFile/Avro-DataFile format.
* Intermediate (M-to-R) files use a custom IFile format these days,
which is built purely for that purpose.
* Hive can use SequenceFiles very well. There is also documented info
on
this in the Hive's wiki pages (Check the DDL pages, IIRC).

On 09-Jan-2012, at 9:44 PM, Tony Burton wrote:

Thanks for the quick reply and the clarification about the
documentation.

Regarding sequence files: am I right in thinking that they're a good
choice for intermediate steps in chained MR jobs, or for file transfer
between the Map and the Reduce phases of a job; but they shouldn't be
used
for human-readable files at the end of one or more MapReduce jobs? How
about if the only use a job's output is analysis via Hive - can Hive
create
tables from sequence files?

Tony

-Original Message-
From: Harsh J [mailto:ha...@cloudera.com]
Sent: 09 January 2012 15:34
To: common-user@hadoop.apache.org
Subject: Re: has bzip2 compression been deprecated?

Bzip2 is pretty slow. You probably do not want to use it, even if it
does file splits (a feature not available in the stable line of
0.20.x/1.x,
but available in 0.22+).

To answer your question though, bzip2 was removed from that document
cause it isn't a native library (its pure Java). I think bzip2 was added
earlier due to an oversight, as even 0.20 did not have a native bzip2
library. This change in docs does not mean that BZip2 is deprecated -- it
is still fully supported and available in the trunk as well. See
https://issues.apache.org/jira/browse/HADOOP-6292 for the doc update
changes that led to this.

The best way would be to use either:

(a) Hadoop sequence files with any compression codec of choice (best
would be lzo, gz, maybe even snappy). This file format is built for HDFS
and MR and is splittable. Another choice would be Avro DataFiles from the
Apache Avro project.
(b) LZO codecs for Hadoop, via
https://github.com/toddlipcon/hadoop-lzo(and hadoop-lzo-packager for
packages). This requires you to run indexing
operations before the .lzo can be made splittable, but works great with
this extra step added.

On 09-Jan-2012, at 7:17 PM, Tony Burton wrote:

Hi,

I'm trying to work out which compression algorithm I should be using
in my MapReduce jobs. It seems to me that the best

Re: Adding a soft-linked archive file to the distributed cache doesn't work as advertised

2012-01-09 Thread Alejandro Abdelnur

Bill,

In addition you must call DistributedCached.createSymlink(configuration),
that should do.

Thxs.

Alejandro

On Mon, Jan 9, 2012 at 10:30 AM, W.P. McNeill bill...@gmail.com wrote:

I am trying to add a zip file to the distributed cache and have it unzipped
on the task nodes with a softlink to the unzipped directory placed in the
working directory of my mapper process. I think I'm doing everything the
way the documentation tells me to, but it's not working.

On the client in the run() function while I'm creating the job I first
call:

fs.copyFromLocalFile(gate-app.zip, /tmp/gate-app.zip);

As expected, this copies the archive file gate-app.zip to the HDFS
directory /tmp.

Then I call

DistributedCache.addCacheArchive(/tmp/gate-app.zip#gate-app,
configuration);

I expect this to add /tmp/gate-app.zip to the distributed cache and put a
softlink to it called gate-app in the working directory of each task.
However, when I call job.waitForCompletion(), I see the following error:

Exception in thread main java.io.FileNotFoundException: File does not
exist: /tmp/gate-app.zip#gate-app.

It appears that the distributed cache mechanism is interpreting the entire
URI as the literal name of the file, instead of treating the fragment as
the name of the softlink.

As far as I can tell, I'm doing this correctly according to the API
documentation:

http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html
.

The full project in which I'm doing this is up on github:
https://github.com/wpm/Hadoop-GATE.

Can someone tell me what I'm doing wrong?

Re: Netstat Shows Port 8020 Doesn't Seem to Listen

2012-01-09 Thread Eli Finkelshteyn

Positive. Like I said before, netstat -a | grep 8020 gives me nothing. 
Even if the firewall was the problem, that should still give me output 
that the port is listening, but I'd just be unable to hit it from an 
outside box (I tested this by blocking port 50070, at which point it 
still showed up in netstat -a, but was inaccessible through http from a 
remote machine). This problem is something else.


On 1/9/12 2:31 PM, zGreenfelder wrote:

On Mon, Jan 9, 2012 at 1:58 PM, Eli Finkelshteyniefin...@gmail.com  wrote:

More info:

In the DataNode log, I'm also seeing:

2012-01-09 13:06:27,751 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:8020. Already tried 9 time(s).

Why would things just not load on port 8020? I feel like all the errors I'm
seeing are caused by this, but I can't see any errors about why this
occurred in the first place.


are you sure there isn't a firewall in place blocking port 8020?
e.g. iptables on the local machines?   if you do
telnet localhost 8020
do you make a connection? if you use lsof and/or netstat can you see
the port open?
if you have root access you can try turning off the firewall with
iptables -F to see if things work without firewall rules.

Re: Adding a soft-linked archive file to the distributed cache doesn't work as advertised

2012-01-09 Thread W.P. McNeill

I added a DistributedCache.createSymlink(configuration) call right after
the addCacheArcihve() call, but see the same error.

On Mon, Jan 9, 2012 at 11:05 AM, Alejandro Abdelnur t...@cloudera.comwrote:

Bill,

In addition you must call DistributedCached.createSymlink(configuration),
that should do.

Thxs.

Alejandro

On Mon, Jan 9, 2012 at 10:30 AM, W.P. McNeill bill...@gmail.com wrote:

I am trying to add a zip file to the distributed cache and have it
unzipped
on the task nodes with a softlink to the unzipped directory placed in the
working directory of my mapper process. I think I'm doing everything the
way the documentation tells me to, but it's not working.

On the client in the run() function while I'm creating the job I first
call:

fs.copyFromLocalFile(gate-app.zip, /tmp/gate-app.zip);

As expected, this copies the archive file gate-app.zip to the HDFS
directory /tmp.

Then I call

DistributedCache.addCacheArchive(/tmp/gate-app.zip#gate-app,
configuration);

I expect this to add /tmp/gate-app.zip to the distributed cache and
put a
softlink to it called gate-app in the working directory of each task.
However, when I call job.waitForCompletion(), I see the following error:

Exception in thread main java.io.FileNotFoundException: File does not
exist: /tmp/gate-app.zip#gate-app.

It appears that the distributed cache mechanism is interpreting the
entire
URI as the literal name of the file, instead of treating the fragment as
the name of the softlink.

As far as I can tell, I'm doing this correctly according to the API
documentation:

http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html
.

The full project in which I'm doing this is up on github:
https://github.com/wpm/Hadoop-GATE.

Can someone tell me what I'm doing wrong?

Re: Netstat Shows Port 8020 Doesn't Seem to Listen

2012-01-09 Thread Harsh J

Eli,

What is your fs.default.name set to, in core-site.xml?

On Tue, Jan 10, 2012 at 1:07 AM, Eli Finkelshteyn iefin...@gmail.com wrote:
 Positive. Like I said before, netstat -a | grep 8020 gives me nothing. Even
 if the firewall was the problem, that should still give me output that the
 port is listening, but I'd just be unable to hit it from an outside box (I
 tested this by blocking port 50070, at which point it still showed up in
 netstat -a, but was inaccessible through http from a remote machine). This
 problem is something else.


 On 1/9/12 2:31 PM, zGreenfelder wrote:

 On Mon, Jan 9, 2012 at 1:58 PM, Eli Finkelshteyniefin...@gmail.com
  wrote:

 More info:

 In the DataNode log, I'm also seeing:

 2012-01-09 13:06:27,751 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: localhost/127.0.0.1:8020. Already tried 9 time(s).

 Why would things just not load on port 8020? I feel like all the errors
 I'm
 seeing are caused by this, but I can't see any errors about why this
 occurred in the first place.

 are you sure there isn't a firewall in place blocking port 8020?
 e.g. iptables on the local machines?   if you do
 telnet localhost 8020
 do you make a connection? if you use lsof and/or netstat can you see
 the port open?
 if you have root access you can try turning off the firewall with
 iptables -F to see if things work without firewall rules.





-- 
Harsh J

Re: Netstat Shows Port 8020 Doesn't Seem to Listen

2012-01-09 Thread Idris Ali

Hi,

Looks like problem in starting DFS and MR, can you run 'jps' and see if NN,
DN, SNN, JT and TT are running,

also make sure for pseudo-distributed mode, the following entries are
present:

1. In core-site.xml
 property
namefs.default.name/name
valuehdfs://localhost:8020/value
  /property

  property
 namehadoop.tmp.dir/name
 valueSOME TMP dir with Read/Write acces not system temp/value
  /property
  property

2.  In hdfs-site.xml
property
namedfs.replication/name
value1/value
  /property
  property
 namedfs.permissions/name
 valuefalse/value
  /property
  property
 !-- specify this so that running 'hadoop namenode -format' formats
the right dir --
 namedfs.name.dir/name
 valueLocal dir with Read/Write access/value
  /property

3. In mapred-stie.xml
  property
namemapred.job.tracker/name
valuelocalhost:8021/value
  /property

Thanks,
-Idris

On Tue, Jan 10, 2012 at 1:07 AM, Eli Finkelshteyn iefin...@gmail.comwrote:

 Positive. Like I said before, netstat -a | grep 8020 gives me nothing.
 Even if the firewall was the problem, that should still give me output that
 the port is listening, but I'd just be unable to hit it from an outside box
 (I tested this by blocking port 50070, at which point it still showed up in
 netstat -a, but was inaccessible through http from a remote machine). This
 problem is something else.


 On 1/9/12 2:31 PM, zGreenfelder wrote:

 On Mon, Jan 9, 2012 at 1:58 PM, Eli 
 Finkelshteyniefinkel@gmail.**comiefin...@gmail.com
  wrote:

 More info:

 In the DataNode log, I'm also seeing:

 2012-01-09 13:06:27,751 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: localhost/127.0.0.1:8020. Already tried 9 time(s).

 Why would things just not load on port 8020? I feel like all the errors
 I'm
 seeing are caused by this, but I can't see any errors about why this
 occurred in the first place.

  are you sure there isn't a firewall in place blocking port 8020?
 e.g. iptables on the local machines?   if you do
 telnet localhost 8020
 do you make a connection? if you use lsof and/or netstat can you see
 the port open?
 if you have root access you can try turning off the firewall with
 iptables -F to see if things work without firewall rules.

RE: has bzip2 compression been deprecated?

2012-01-09 Thread Tim Broberg

Out of curiousity, when hive records are compressed, how large is a typical 
compressed record?

Do you have issues where the block size is too small to be compressed 
efficiently?

More generally, I wonder what the smallest desirable compressed record size is 
in the hadoop universe.

- Tim.


From: Tony Burton [tbur...@sportingindex.com]
Sent: Monday, January 09, 2012 10:02 AM
To: common-user@hadoop.apache.org
Subject: RE: has bzip2 compression been deprecated?

Thanks Bejoy - I'm fairly new to Hive so may be wrong here, but I was under the 
impression that the STORED AS part of a CREATE TABLE in Hive refers to how the 
data in the table will be stored once the table is created, rather than the 
compression format of the data used to populate the table. Can you clarify 
which is the correct interpretation? If it's the latter, how would I read a 
sequence file into a Hive table?

Thanks,

Tony




-Original Message-
From: Bejoy Ks [mailto:bejoy.had...@gmail.com]
Sent: 09 January 2012 17:33
To: common-user@hadoop.apache.org
Subject: Re: has bzip2 compression been deprecated?

Hi Tony
   Adding on to Harsh's comments. If you want the generated sequence
files to be utilized by a hive table. Define your hive table as

CREATE EXTERNAL TABLE tableNAme(col1 INT, c0l2 STRING)
...
...

STORED AS SEQUENCEFILE;


Regards
Bejoy.K.S

On Mon, Jan 9, 2012 at 10:32 PM, alo.alt wget.n...@googlemail.com wrote:

 Tony,

 snappy is also available:
 http://code.google.com/p/hadoop-snappy/

 best,
  Alex

 --
 Alexander Lorenz
 http://mapredit.blogspot.com

 On Jan 9, 2012, at 8:49 AM, Harsh J wrote:

  Tony,
 
  * Yeah, SequenceFiles aren't human-readable, but fs -text can read it
 out (instead of a plain fs -cat). But if you are gonna export your files
 into a system you do not have much control over, probably best to have the
 resultant files not be in SequenceFile/Avro-DataFile format.
  * Intermediate (M-to-R) files use a custom IFile format these days,
 which is built purely for that purpose.
  * Hive can use SequenceFiles very well. There is also documented info on
 this in the Hive's wiki pages (Check the DDL pages, IIRC).
 
  On 09-Jan-2012, at 9:44 PM, Tony Burton wrote:
 
  Thanks for the quick reply and the clarification about the
 documentation.
 
  Regarding sequence files: am I right in thinking that they're a good
 choice for intermediate steps in chained MR jobs, or for file transfer
 between the Map and the Reduce phases of a job; but they shouldn't be used
 for human-readable files at the end of one or more MapReduce jobs? How
 about if the only use a job's output is analysis via Hive - can Hive create
 tables from sequence files?
 
  Tony
 
 
 
  -Original Message-
  From: Harsh J [mailto:ha...@cloudera.com]
  Sent: 09 January 2012 15:34
  To: common-user@hadoop.apache.org
  Subject: Re: has bzip2 compression been deprecated?
 
  Bzip2 is pretty slow. You probably do not want to use it, even if it
 does file splits (a feature not available in the stable line of 0.20.x/1.x,
 but available in 0.22+).
 
  To answer your question though, bzip2 was removed from that document
 cause it isn't a native library (its pure Java). I think bzip2 was added
 earlier due to an oversight, as even 0.20 did not have a native bzip2
 library. This change in docs does not mean that BZip2 is deprecated -- it
 is still fully supported and available in the trunk as well. See
 https://issues.apache.org/jira/browse/HADOOP-6292 for the doc update
 changes that led to this.
 
  The best way would be to use either:
 
  (a) Hadoop sequence files with any compression codec of choice (best
 would be lzo, gz, maybe even snappy). This file format is built for HDFS
 and MR and is splittable. Another choice would be Avro DataFiles from the
 Apache Avro project.
  (b) LZO codecs for Hadoop, via 
  https://github.com/toddlipcon/hadoop-lzo(and hadoop-lzo-packager for 
  packages). This requires you to run indexing
 operations before the .lzo can be made splittable, but works great with
 this extra step added.
 
  On 09-Jan-2012, at 7:17 PM, Tony Burton wrote:
 
  Hi,
 
  I'm trying to work out which compression algorithm I should be using
 in my MapReduce jobs.  It seems to me that the best solution is a
 compromise between speed, efficiency and splittability. The only
 compression algorithm to handle file splits (according to Hadoop: The
 Definitive Guide 2nd edition p78 etc) is bzip2, at the expense of
 compression speed.
 
  However, I see from the documentation at
 http://hadoop.apache.org/common/docs/current/native_libraries.html that
 the bzip2 library is no longer mentioned, and hasn't been since version
 0.20.0, see
 http://hadoop.apache.org/common/docs/r0.20.0/native_libraries.html -
 however the bzip2 Codec is still in the API at
 http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/compress/BZip2Codec.html
 .
 
  Has bzip2 support been removed from

Re: Netstat Shows Port 8020 Doesn't Seem to Listen

2012-01-09 Thread Eli Finkelshteyn

Thanks for the help, Idris. I checked all the confs you mentioned, and 
all is as it should be. jps gives me:


24226 Jps
24073 TaskTracker
23854 JobTracker
23780 DataNode
23921 NameNode
23995 SecondaryNameNode

So that looks good. A majority of this stuff is default as set by 
Cloudera. Any other ideas?


Eli

On 1/9/12 3:22 PM, Idris Ali wrote:

Hi,

Looks like problem in starting DFS and MR, can you run 'jps' and see if NN,
DN, SNN, JT and TT are running,

also make sure for pseudo-distributed mode, the following entries are
present:

1. In core-site.xml
  property
 namefs.default.name/name
 valuehdfs://localhost:8020/value
   /property

   property
  namehadoop.tmp.dir/name
  valueSOME TMP dir with Read/Write acces not system temp/value
   /property
   property

2.  In hdfs-site.xml
property
 namedfs.replication/name
 value1/value
   /property
   property
  namedfs.permissions/name
  valuefalse/value
   /property
   property
  !-- specify this so that running 'hadoop namenode -format' formats
the right dir --
  namedfs.name.dir/name
  valueLocal dir with Read/Write access/value
   /property

3. In mapred-stie.xml
   property
 namemapred.job.tracker/name
 valuelocalhost:8021/value
   /property

Thanks,
-Idris

On Tue, Jan 10, 2012 at 1:07 AM, Eli Finkelshteyniefin...@gmail.comwrote:


Positive. Like I said before, netstat -a | grep 8020 gives me nothing.
Even if the firewall was the problem, that should still give me output that
the port is listening, but I'd just be unable to hit it from an outside box
(I tested this by blocking port 50070, at which point it still showed up in
netstat -a, but was inaccessible through http from a remote machine). This
problem is something else.


On 1/9/12 2:31 PM, zGreenfelder wrote:


On Mon, Jan 9, 2012 at 1:58 PM, Eli 
Finkelshteyniefinkel@gmail.**comiefin...@gmail.com
  wrote:


More info:

In the DataNode log, I'm also seeing:

2012-01-09 13:06:27,751 INFO org.apache.hadoop.ipc.Client: Retrying
connect
to server: localhost/127.0.0.1:8020. Already tried 9 time(s).

Why would things just not load on port 8020? I feel like all the errors
I'm
seeing are caused by this, but I can't see any errors about why this
occurred in the first place.

  are you sure there isn't a firewall in place blocking port 8020?

e.g. iptables on the local machines?   if you do
telnet localhost 8020
do you make a connection? if you use lsof and/or netstat can you see
the port open?
if you have root access you can try turning off the firewall with
iptables -F to see if things work without firewall rules.

RE: has bzip2 compression been deprecated?

2012-01-09 Thread Tim Broberg

I thought it was optional whether hive stored blocks (up to 1MB?) or records. 
If records, it's not storing individual records?

Am I misunderstanding?

Maybe I should get off my lazy butt and just check the source code...  ;^)

- Tim.


From: bejoy.had...@gmail.com [bejoy.had...@gmail.com]
Sent: Monday, January 09, 2012 1:22 PM
To: common-user@hadoop.apache.org
Subject: Re: has bzip2 compression been deprecated?

Hi Tim
   When you say in hive a table data is  compressed  by using LZO or so. It 
means the file/blocks that contains the records/data are compressed using LZO. 
The size would be same as the size of file/blocks in hdfs. It is not like 
records are stored as individual blocks in hive. Hive is just a query parser 
that parse SQL like queries into MR jobs and run the same on data that lies in 
HDFS.
When you a have larger chained jobs generated with multiple QLs you may end up 
in more number of small files. There you may go in for enabling merge in hive 
to get sufficiently larger files by merging thE smaller files as the final 
output from your queries. This would be better for subsequent MR jobs that 
operate on the output as well as optimal storage.

Hope it helps!..

Regards
Bejoy K S

-Original Message-
From: Tim Broberg tim.brob...@exar.com
Date: Mon, 9 Jan 2012 12:27:47
To: common-user@hadoop.apache.orgcommon-user@hadoop.apache.org
Reply-To: common-user@hadoop.apache.org
Subject: RE: has bzip2 compression been deprecated?

Out of curiousity, when hive records are compressed, how large is a typical 
compressed record?

Do you have issues where the block size is too small to be compressed 
efficiently?

More generally, I wonder what the smallest desirable compressed record size is 
in the hadoop universe.

- Tim.


From: Tony Burton [tbur...@sportingindex.com]
Sent: Monday, January 09, 2012 10:02 AM
To: common-user@hadoop.apache.org
Subject: RE: has bzip2 compression been deprecated?

Thanks Bejoy - I'm fairly new to Hive so may be wrong here, but I was under the 
impression that the STORED AS part of a CREATE TABLE in Hive refers to how the 
data in the table will be stored once the table is created, rather than the 
compression format of the data used to populate the table. Can you clarify 
which is the correct interpretation? If it's the latter, how would I read a 
sequence file into a Hive table?

Thanks,

Tony




-Original Message-
From: Bejoy Ks [mailto:bejoy.had...@gmail.com]
Sent: 09 January 2012 17:33
To: common-user@hadoop.apache.org
Subject: Re: has bzip2 compression been deprecated?

Hi Tony
   Adding on to Harsh's comments. If you want the generated sequence
files to be utilized by a hive table. Define your hive table as

CREATE EXTERNAL TABLE tableNAme(col1 INT, c0l2 STRING)
...
...

STORED AS SEQUENCEFILE;


Regards
Bejoy.K.S

On Mon, Jan 9, 2012 at 10:32 PM, alo.alt wget.n...@googlemail.com wrote:

 Tony,

 snappy is also available:
 http://code.google.com/p/hadoop-snappy/

 best,
  Alex

 --
 Alexander Lorenz
 http://mapredit.blogspot.com

 On Jan 9, 2012, at 8:49 AM, Harsh J wrote:

  Tony,
 
  * Yeah, SequenceFiles aren't human-readable, but fs -text can read it
 out (instead of a plain fs -cat). But if you are gonna export your files
 into a system you do not have much control over, probably best to have the
 resultant files not be in SequenceFile/Avro-DataFile format.
  * Intermediate (M-to-R) files use a custom IFile format these days,
 which is built purely for that purpose.
  * Hive can use SequenceFiles very well. There is also documented info on
 this in the Hive's wiki pages (Check the DDL pages, IIRC).
 
  On 09-Jan-2012, at 9:44 PM, Tony Burton wrote:
 
  Thanks for the quick reply and the clarification about the
 documentation.
 
  Regarding sequence files: am I right in thinking that they're a good
 choice for intermediate steps in chained MR jobs, or for file transfer
 between the Map and the Reduce phases of a job; but they shouldn't be used
 for human-readable files at the end of one or more MapReduce jobs? How
 about if the only use a job's output is analysis via Hive - can Hive create
 tables from sequence files?
 
  Tony
 
 
 
  -Original Message-
  From: Harsh J [mailto:ha...@cloudera.com]
  Sent: 09 January 2012 15:34
  To: common-user@hadoop.apache.org
  Subject: Re: has bzip2 compression been deprecated?
 
  Bzip2 is pretty slow. You probably do not want to use it, even if it
 does file splits (a feature not available in the stable line of 0.20.x/1.x,
 but available in 0.22+).
 
  To answer your question though, bzip2 was removed from that document
 cause it isn't a native library (its pure Java). I think bzip2 was added
 earlier due to an oversight, as even 0.20 did not have a native bzip2
 library. This change in docs does not mean that BZip2 is deprecated -- it
 is still fully supported and available in the trunk as well.

Re: Netstat Shows Port 8020 Doesn't Seem to Listen

2012-01-09 Thread Eli Finkelshteyn


A bit more info:

When I start up only the namenode by itself, I'm not seeing any errors, 
but what I am seeing that's really odd is:


   2012-01-09 16:48:45,530 INFO org.apache.hadoop.ipc.Server: Starting
   Socket Reader #1 for port 8020
   2012-01-09 16:48:45,531 INFO
   org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics
   with hostName=NameNode, port=8020
   2012-01-09 16:48:45,532 INFO
   org.apache.hadoop.ipc.metrics.RpcDetailedMetrics: Initializing RPC
   Metrics with hostName=NameNode, port=8020
   2012-01-09 16:48:45,541 INFO
   org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at:
   localhost.localdomain/127.0.0.1:8020

That's despite the fact that doing netstat -a | grep 8020 still returns 
nothing.  To me, that makes absolutely no sense. I feel like I should be 
getting an error telling me Namenode did not in fact go up on 8020, but 
I'm not getting that at all.


Eli

On 1/9/12 3:22 PM, Idris Ali wrote:

Hi,

Looks like problem in starting DFS and MR, can you run 'jps' and see if NN,
DN, SNN, JT and TT are running,

also make sure for pseudo-distributed mode, the following entries are
present:

1. In core-site.xml
  property
 namefs.default.name/name
 valuehdfs://localhost:8020/value
   /property

   property
  namehadoop.tmp.dir/name
  valueSOME TMP dir with Read/Write acces not system temp/value
   /property
   property

2.  In hdfs-site.xml
property
 namedfs.replication/name
 value1/value
   /property
   property
  namedfs.permissions/name
  valuefalse/value
   /property
   property
  !-- specify this so that running 'hadoop namenode -format' formats
the right dir --
  namedfs.name.dir/name
  valueLocal dir with Read/Write access/value
   /property

3. In mapred-stie.xml
   property
 namemapred.job.tracker/name
 valuelocalhost:8021/value
   /property

Thanks,
-Idris

On Tue, Jan 10, 2012 at 1:07 AM, Eli Finkelshteyniefin...@gmail.comwrote:


Positive. Like I said before, netstat -a | grep 8020 gives me nothing.
Even if the firewall was the problem, that should still give me output that
the port is listening, but I'd just be unable to hit it from an outside box
(I tested this by blocking port 50070, at which point it still showed up in
netstat -a, but was inaccessible through http from a remote machine). This
problem is something else.


On 1/9/12 2:31 PM, zGreenfelder wrote:


On Mon, Jan 9, 2012 at 1:58 PM, Eli 
Finkelshteyniefinkel@gmail.**comiefin...@gmail.com
  wrote:


More info:

In the DataNode log, I'm also seeing:

2012-01-09 13:06:27,751 INFO org.apache.hadoop.ipc.Client: Retrying
connect
to server: localhost/127.0.0.1:8020. Already tried 9 time(s).

Why would things just not load on port 8020? I feel like all the errors
I'm
seeing are caused by this, but I can't see any errors about why this
occurred in the first place.

  are you sure there isn't a firewall in place blocking port 8020?

e.g. iptables on the local machines?   if you do
telnet localhost 8020
do you make a connection? if you use lsof and/or netstat can you see
the port open?
if you have root access you can try turning off the firewall with
iptables -F to see if things work without firewall rules.

Re: Netstat Shows Port 8020 Doesn't Seem to Listen

2012-01-09 Thread alo.alt

What happen when you try a telnet localhost 8020?
netstat -anl would also useful.

best,
 Alex 

--
Alexander Lorenz
http://mapredit.blogspot.com

On Jan 9, 2012, at 2:02 PM, Eli Finkelshteyn wrote:

 A bit more info:
 
 When I start up only the namenode by itself, I'm not seeing any errors, but 
 what I am seeing that's really odd is:
 
   2012-01-09 16:48:45,530 INFO org.apache.hadoop.ipc.Server: Starting
   Socket Reader #1 for port 8020
   2012-01-09 16:48:45,531 INFO
   org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics
   with hostName=NameNode, port=8020
   2012-01-09 16:48:45,532 INFO
   org.apache.hadoop.ipc.metrics.RpcDetailedMetrics: Initializing RPC
   Metrics with hostName=NameNode, port=8020
   2012-01-09 16:48:45,541 INFO
   org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at:
   localhost.localdomain/127.0.0.1:8020
 
 That's despite the fact that doing netstat -a | grep 8020 still returns 
 nothing.  To me, that makes absolutely no sense. I feel like I should be 
 getting an error telling me Namenode did not in fact go up on 8020, but I'm 
 not getting that at all.
 
 Eli
 
 On 1/9/12 3:22 PM, Idris Ali wrote:
 Hi,
 
 Looks like problem in starting DFS and MR, can you run 'jps' and see if NN,
 DN, SNN, JT and TT are running,
 
 also make sure for pseudo-distributed mode, the following entries are
 present:
 
 1. In core-site.xml
  property
 namefs.default.name/name
 valuehdfs://localhost:8020/value
   /property
 
   property
  namehadoop.tmp.dir/name
  valueSOME TMP dir with Read/Write acces not system temp/value
   /property
   property
 
 2.  In hdfs-site.xml
 property
 namedfs.replication/name
 value1/value
   /property
   property
  namedfs.permissions/name
  valuefalse/value
   /property
   property
  !-- specify this so that running 'hadoop namenode -format' formats
 the right dir --
  namedfs.name.dir/name
  valueLocal dir with Read/Write access/value
   /property
 
 3. In mapred-stie.xml
   property
 namemapred.job.tracker/name
 valuelocalhost:8021/value
   /property
 
 Thanks,
 -Idris
 
 On Tue, Jan 10, 2012 at 1:07 AM, Eli Finkelshteyniefin...@gmail.comwrote:
 
 Positive. Like I said before, netstat -a | grep 8020 gives me nothing.
 Even if the firewall was the problem, that should still give me output that
 the port is listening, but I'd just be unable to hit it from an outside box
 (I tested this by blocking port 50070, at which point it still showed up in
 netstat -a, but was inaccessible through http from a remote machine). This
 problem is something else.
 
 
 On 1/9/12 2:31 PM, zGreenfelder wrote:
 
 On Mon, Jan 9, 2012 at 1:58 PM, Eli 
 Finkelshteyniefinkel@gmail.**comiefin...@gmail.com
  wrote:
 
 More info:
 
 In the DataNode log, I'm also seeing:
 
 2012-01-09 13:06:27,751 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: localhost/127.0.0.1:8020. Already tried 9 time(s).
 
 Why would things just not load on port 8020? I feel like all the errors
 I'm
 seeing are caused by this, but I can't see any errors about why this
 occurred in the first place.
 
  are you sure there isn't a firewall in place blocking port 8020?
 e.g. iptables on the local machines?   if you do
 telnet localhost 8020
 do you make a connection? if you use lsof and/or netstat can you see
 the port open?
 if you have root access you can try turning off the firewall with
 iptables -F to see if things work without firewall rules.

Re: datanode failing to start

2012-01-09 Thread Dave Kelsey


gave up and installed version 1.
it installed correctly and worked, thought the instructions for setup 
and the location of scripts and configs are now out of date.


D

On 1/5/2012 10:25 AM, Dave Kelsey wrote:


java version 1.6.0_29
hadoop: 0.20.203.0

I'm attempting to setup the pseudo-distributed config on a mac 10.6.8.
I followed the steps from the QuickStart 
(http://wiki.apache.org./hadoop/QuickStart) and succeeded with Stage 
1: Standalone Operation.

I followed the steps for Stage 2: Pseudo-distributed Configuration.
I set the JAVA_HOME variable in conf/hadoop-env.sh and I changed 
tools.jar to the location of classes.jar (a mac version of tools.jar)

I've modified the three .xml files as described in the QuickStart.
ssh'ing to localhost has been configured and works with passwordless 
authentication.
I formatted the namenode with bin/hadoop namenode -format as the 
instructions say


This is what I see when I run bin/start-all.sh

root# bin/start-all.sh
starting namenode, logging to 
/Users/admin/hadoop/hadoop-0.20.203.0/bin/../logs/hadoop-root-namenode-Hoot-2.local.out
localhost: starting datanode, logging to 
/Users/admin/hadoop/hadoop-0.20.203.0/bin/../logs/hadoop-root-datanode-Hoot-2.local.out
localhost: Exception in thread main java.lang.NoClassDefFoundError: 
server

localhost: Caused by: java.lang.ClassNotFoundException: server
localhost: at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
localhost: at java.security.AccessController.doPrivileged(Native 
Method)
localhost: at 
java.net.URLClassLoader.findClass(URLClassLoader.java:190)

localhost: at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
localhost: at 
sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)

localhost: at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
localhost: starting secondarynamenode, logging to 
/Users/admin/hadoop/hadoop-0.20.203.0/bin/../logs/hadoop-root-secondarynamenode-Hoot-2.local.out
starting jobtracker, logging to 
/Users/admin/hadoop/hadoop-0.20.203.0/bin/../logs/hadoop-root-jobtracker-Hoot-2.local.out
localhost: starting tasktracker, logging to 
/Users/admin/hadoop/hadoop-0.20.203.0/bin/../logs/hadoop-root-tasktracker-Hoot-2.local.out


There are 4 processes running:
ps -fax | grep hadoop | grep -v grep | wc -l
  4

They are:
SecondaryNameNode
TaskTracker
NameNode
JobTracker


I've searched to see if anyone else has encountered this and not found 
anything


d

p.s. I've also posted this to core-u...@hadoop.apache.org which I've 
yet to find how to subscribe to.

Re: Netstat Shows Port 8020 Doesn't Seem to Listen

2012-01-09 Thread Eli Finkelshteyn


Good call! netstat -anl gives me:
tcp0  0 :::127.0.0.1:8020   
:::*LISTEN


Now it just looks like nothing is running on 8021. And now I'm really 
confused about why I get no communication over 8020 from the datanode.


Just to reiterate, this definitely is not the firewall, running iptables 
-nvL gives:


...
0 0 ACCEPT tcp  --  *  *   0.0.0.0/0
0.0.0.0/0   state NEW tcp dpt:50070
164 ACCEPT tcp  --  *  *   0.0.0.0/0
0.0.0.0/0   state NEW tcp dpt:50030
0 0 ACCEPT tcp  --  *  *   0.0.0.0/0
0.0.0.0/0   state NEW tcp dpt:8021
164 ACCEPT tcp  --  *  *   0.0.0.0/0
0.0.0.0/0   state NEW tcp dpt:8020

...

On 1/9/12 5:08 PM, alo.alt wrote:

What happen when you try a telnet localhost 8020?
netstat -anl would also useful.

best,
  Alex

--
Alexander Lorenz
http://mapredit.blogspot.com

On Jan 9, 2012, at 2:02 PM, Eli Finkelshteyn wrote:


A bit more info:

When I start up only the namenode by itself, I'm not seeing any errors, but 
what I am seeing that's really odd is:

   2012-01-09 16:48:45,530 INFO org.apache.hadoop.ipc.Server: Starting
   Socket Reader #1 for port 8020
   2012-01-09 16:48:45,531 INFO
   org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics
   with hostName=NameNode, port=8020
   2012-01-09 16:48:45,532 INFO
   org.apache.hadoop.ipc.metrics.RpcDetailedMetrics: Initializing RPC
   Metrics with hostName=NameNode, port=8020
   2012-01-09 16:48:45,541 INFO
   org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at:
   localhost.localdomain/127.0.0.1:8020

That's despite the fact that doing netstat -a | grep 8020 still returns 
nothing.  To me, that makes absolutely no sense. I feel like I should be 
getting an error telling me Namenode did not in fact go up on 8020, but I'm not 
getting that at all.

Eli

On 1/9/12 3:22 PM, Idris Ali wrote:

Hi,

Looks like problem in starting DFS and MR, can you run 'jps' and see if NN,
DN, SNN, JT and TT are running,

also make sure for pseudo-distributed mode, the following entries are
present:

1. In core-site.xml
  property
 namefs.default.name/name
 valuehdfs://localhost:8020/value
   /property

   property
  namehadoop.tmp.dir/name
  valueSOME TMP dir with Read/Write acces not system temp/value
   /property
   property

2.  In hdfs-site.xml
property
 namedfs.replication/name
 value1/value
   /property
   property
  namedfs.permissions/name
  valuefalse/value
   /property
   property
  !-- specify this so that running 'hadoop namenode -format' formats
the right dir --
  namedfs.name.dir/name
  valueLocal dir with Read/Write access/value
   /property

3. In mapred-stie.xml
   property
 namemapred.job.tracker/name
 valuelocalhost:8021/value
   /property

Thanks,
-Idris

On Tue, Jan 10, 2012 at 1:07 AM, Eli Finkelshteyniefin...@gmail.comwrote:


Positive. Like I said before, netstat -a | grep 8020 gives me nothing.
Even if the firewall was the problem, that should still give me output that
the port is listening, but I'd just be unable to hit it from an outside box
(I tested this by blocking port 50070, at which point it still showed up in
netstat -a, but was inaccessible through http from a remote machine). This
problem is something else.


On 1/9/12 2:31 PM, zGreenfelder wrote:


On Mon, Jan 9, 2012 at 1:58 PM, Eli 
Finkelshteyniefinkel@gmail.**comiefin...@gmail.com
  wrote:


More info:

In the DataNode log, I'm also seeing:

2012-01-09 13:06:27,751 INFO org.apache.hadoop.ipc.Client: Retrying
connect
to server: localhost/127.0.0.1:8020. Already tried 9 time(s).

Why would things just not load on port 8020? I feel like all the errors
I'm
seeing are caused by this, but I can't see any errors about why this
occurred in the first place.

  are you sure there isn't a firewall in place blocking port 8020?

e.g. iptables on the local machines?   if you do
telnet localhost 8020
do you make a connection? if you use lsof and/or netstat can you see
the port open?
if you have root access you can try turning off the firewall with
iptables -F to see if things work without firewall rules.

RE: has bzip2 compression been deprecated?

2012-01-09 Thread Tim Broberg

Based on this, it seems like the best approach is just to pick block 
compression rather than record compression, presumeably for this very reason.

https://ccp.cloudera.com/display/CDHDOC/Snappy+Installation

Perhaps record compression is the default to prioritize speed...

- Tim.

From: Tim Broberg [tim.brob...@exar.com]
Sent: Monday, January 09, 2012 1:42 PM
To: common-user@hadoop.apache.org; bejoy.had...@gmail.com
Subject: RE: has bzip2 compression been deprecated?

I thought it was optional whether hive stored blocks (up to 1MB?) or records. 
If records, it's not storing individual records?

Am I misunderstanding?

Maybe I should get off my lazy butt and just check the source code...  ;^)

- Tim.


From: bejoy.had...@gmail.com [bejoy.had...@gmail.com]
Sent: Monday, January 09, 2012 1:22 PM
To: common-user@hadoop.apache.org
Subject: Re: has bzip2 compression been deprecated?

Hi Tim
   When you say in hive a table data is  compressed  by using LZO or so. It 
means the file/blocks that contains the records/data are compressed using LZO. 
The size would be same as the size of file/blocks in hdfs. It is not like 
records are stored as individual blocks in hive. Hive is just a query parser 
that parse SQL like queries into MR jobs and run the same on data that lies in 
HDFS.
When you a have larger chained jobs generated with multiple QLs you may end up 
in more number of small files. There you may go in for enabling merge in hive 
to get sufficiently larger files by merging thE smaller files as the final 
output from your queries. This would be better for subsequent MR jobs that 
operate on the output as well as optimal storage.

Hope it helps!..

Regards
Bejoy K S

-Original Message-
From: Tim Broberg tim.brob...@exar.com
Date: Mon, 9 Jan 2012 12:27:47
To: common-user@hadoop.apache.orgcommon-user@hadoop.apache.org
Reply-To: common-user@hadoop.apache.org
Subject: RE: has bzip2 compression been deprecated?

Out of curiousity, when hive records are compressed, how large is a typical 
compressed record?

Do you have issues where the block size is too small to be compressed 
efficiently?

More generally, I wonder what the smallest desirable compressed record size is 
in the hadoop universe.

- Tim.


From: Tony Burton [tbur...@sportingindex.com]
Sent: Monday, January 09, 2012 10:02 AM
To: common-user@hadoop.apache.org
Subject: RE: has bzip2 compression been deprecated?

Thanks Bejoy - I'm fairly new to Hive so may be wrong here, but I was under the 
impression that the STORED AS part of a CREATE TABLE in Hive refers to how the 
data in the table will be stored once the table is created, rather than the 
compression format of the data used to populate the table. Can you clarify 
which is the correct interpretation? If it's the latter, how would I read a 
sequence file into a Hive table?

Thanks,

Tony




-Original Message-
From: Bejoy Ks [mailto:bejoy.had...@gmail.com]
Sent: 09 January 2012 17:33
To: common-user@hadoop.apache.org
Subject: Re: has bzip2 compression been deprecated?

Hi Tony
   Adding on to Harsh's comments. If you want the generated sequence
files to be utilized by a hive table. Define your hive table as

CREATE EXTERNAL TABLE tableNAme(col1 INT, c0l2 STRING)
...
...

STORED AS SEQUENCEFILE;


Regards
Bejoy.K.S

On Mon, Jan 9, 2012 at 10:32 PM, alo.alt wget.n...@googlemail.com wrote:

 Tony,

 snappy is also available:
 http://code.google.com/p/hadoop-snappy/

 best,
  Alex

 --
 Alexander Lorenz
 http://mapredit.blogspot.com

 On Jan 9, 2012, at 8:49 AM, Harsh J wrote:

  Tony,
 
  * Yeah, SequenceFiles aren't human-readable, but fs -text can read it
 out (instead of a plain fs -cat). But if you are gonna export your files
 into a system you do not have much control over, probably best to have the
 resultant files not be in SequenceFile/Avro-DataFile format.
  * Intermediate (M-to-R) files use a custom IFile format these days,
 which is built purely for that purpose.
  * Hive can use SequenceFiles very well. There is also documented info on
 this in the Hive's wiki pages (Check the DDL pages, IIRC).
 
  On 09-Jan-2012, at 9:44 PM, Tony Burton wrote:
 
  Thanks for the quick reply and the clarification about the
 documentation.
 
  Regarding sequence files: am I right in thinking that they're a good
 choice for intermediate steps in chained MR jobs, or for file transfer
 between the Map and the Reduce phases of a job; but they shouldn't be used
 for human-readable files at the end of one or more MapReduce jobs? How
 about if the only use a job's output is analysis via Hive - can Hive create
 tables from sequence files?
 
  Tony
 
 
 
  -Original Message-
  From: Harsh J [mailto:ha...@cloudera.com]
  Sent: 09 January 2012 15:34
  To: common-user@hadoop.apache.org
  Subject: Re: has bzip2 compression been deprecated?
 
  Bzip2 is

Re: Netstat Shows Port 8020 Doesn't Seem to Listen

2012-01-09 Thread alo.alt

Firewall online?
and be sure that in /etc/hosts ONLY 127.0.0.1 is linked to localhost. Nothing 
like YOURHOSTNAME.YOURDOMAIN (Redhat kudzu bug)

- Alex 

--
Alexander Lorenz
http://mapredit.blogspot.com

On Jan 9, 2012, at 2:39 PM, Eli Finkelshteyn wrote:

 Good call! netstat -anl gives me:
 tcp0  0 :::127.0.0.1:8020   :::*
 LISTEN
 
 Now it just looks like nothing is running on 8021. And now I'm really 
 confused about why I get no communication over 8020 from the datanode.
 
 Just to reiterate, this definitely is not the firewall, running iptables -nvL 
 gives:
 
 ...
0 0 ACCEPT tcp  --  *  *   0.0.0.0/00.0.0.0/0  
  state NEW tcp dpt:50070
164 ACCEPT tcp  --  *  *   0.0.0.0/00.0.0.0/0  
  state NEW tcp dpt:50030
0 0 ACCEPT tcp  --  *  *   0.0.0.0/00.0.0.0/0  
  state NEW tcp dpt:8021
164 ACCEPT tcp  --  *  *   0.0.0.0/00.0.0.0/0  
  state NEW tcp dpt:8020
 ...
 
 On 1/9/12 5:08 PM, alo.alt wrote:
 What happen when you try a telnet localhost 8020?
 netstat -anl would also useful.
 
 best,
  Alex
 
 --
 Alexander Lorenz
 http://mapredit.blogspot.com
 
 On Jan 9, 2012, at 2:02 PM, Eli Finkelshteyn wrote:
 
 A bit more info:
 
 When I start up only the namenode by itself, I'm not seeing any errors, but 
 what I am seeing that's really odd is:
 
   2012-01-09 16:48:45,530 INFO org.apache.hadoop.ipc.Server: Starting
   Socket Reader #1 for port 8020
   2012-01-09 16:48:45,531 INFO
   org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics
   with hostName=NameNode, port=8020
   2012-01-09 16:48:45,532 INFO
   org.apache.hadoop.ipc.metrics.RpcDetailedMetrics: Initializing RPC
   Metrics with hostName=NameNode, port=8020
   2012-01-09 16:48:45,541 INFO
   org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at:
   localhost.localdomain/127.0.0.1:8020
 
 That's despite the fact that doing netstat -a | grep 8020 still returns 
 nothing.  To me, that makes absolutely no sense. I feel like I should be 
 getting an error telling me Namenode did not in fact go up on 8020, but I'm 
 not getting that at all.
 
 Eli
 
 On 1/9/12 3:22 PM, Idris Ali wrote:
 Hi,
 
 Looks like problem in starting DFS and MR, can you run 'jps' and see if NN,
 DN, SNN, JT and TT are running,
 
 also make sure for pseudo-distributed mode, the following entries are
 present:
 
 1. In core-site.xml
  property
 namefs.default.name/name
 valuehdfs://localhost:8020/value
   /property
 
   property
  namehadoop.tmp.dir/name
  valueSOME TMP dir with Read/Write acces not system temp/value
   /property
   property
 
 2.  In hdfs-site.xml
 property
 namedfs.replication/name
 value1/value
   /property
   property
  namedfs.permissions/name
  valuefalse/value
   /property
   property
  !-- specify this so that running 'hadoop namenode -format' formats
 the right dir --
  namedfs.name.dir/name
  valueLocal dir with Read/Write access/value
   /property
 
 3. In mapred-stie.xml
   property
 namemapred.job.tracker/name
 valuelocalhost:8021/value
   /property
 
 Thanks,
 -Idris
 
 On Tue, Jan 10, 2012 at 1:07 AM, Eli Finkelshteyniefin...@gmail.comwrote:
 
 Positive. Like I said before, netstat -a | grep 8020 gives me nothing.
 Even if the firewall was the problem, that should still give me output 
 that
 the port is listening, but I'd just be unable to hit it from an outside 
 box
 (I tested this by blocking port 50070, at which point it still showed up 
 in
 netstat -a, but was inaccessible through http from a remote machine). This
 problem is something else.
 
 
 On 1/9/12 2:31 PM, zGreenfelder wrote:
 
 On Mon, Jan 9, 2012 at 1:58 PM, Eli 
 Finkelshteyniefinkel@gmail.**comiefin...@gmail.com
  wrote:
 
 More info:
 
 In the DataNode log, I'm also seeing:
 
 2012-01-09 13:06:27,751 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: localhost/127.0.0.1:8020. Already tried 9 time(s).
 
 Why would things just not load on port 8020? I feel like all the errors
 I'm
 seeing are caused by this, but I can't see any errors about why this
 occurred in the first place.
 
  are you sure there isn't a firewall in place blocking port 8020?
 e.g. iptables on the local machines?   if you do
 telnet localhost 8020
 do you make a connection? if you use lsof and/or netstat can you see
 the port open?
 if you have root access you can try turning off the firewall with
 iptables -F to see if things work without firewall rules.

Re: Netstat Shows Port 8020 Doesn't Seem to Listen

2012-01-09 Thread Eli Finkelshteyn

OK, not sure what I did (restarting the firewall, perhaps?), but I now 
have ports 8020 and 8021 listening and no more errors in my logs. Wooo! 
Only problem is I still can't get any hadoop stuff to work from a remote 
client:


hadoop fs -ls /
2012-01-09 17:53:53.559 java[13396:1903] Unable to load realm info from 
SCDynamicStore
12/01/09 17:53:55 INFO ipc.Client: Retrying connect to server: 
*my_server/my_ip*:8020. Already tried 0 time(s).
12/01/09 17:53:56 INFO ipc.Client: Retrying connect to server: 
*my_server/my_ip*:8020. Already tried 1 time(s).
12/01/09 17:53:57 INFO ipc.Client: Retrying connect to server: 
*my_server/my_ip*:8020. Already tried 2 time(s).

...

I feel like I'm almost there. Might this have to do with the fact that 
core-site.xml and mapred-site.xml specify localhost for ports 8020 and 
8021 (thus not listening to any attempted outside connections?)


Thanks for all the help so far, everyone!

Eli

On 1/9/12 5:43 PM, alo.alt wrote:

Firewall online?
and be sure that in /etc/hosts ONLY 127.0.0.1 is linked to localhost. Nothing 
like YOURHOSTNAME.YOURDOMAIN (Redhat kudzu bug)

- Alex

--
Alexander Lorenz
http://mapredit.blogspot.com

On Jan 9, 2012, at 2:39 PM, Eli Finkelshteyn wrote:


Good call! netstat -anl gives me:
tcp0  0 :::127.0.0.1:8020   :::*
LISTEN

Now it just looks like nothing is running on 8021. And now I'm really confused 
about why I get no communication over 8020 from the datanode.

Just to reiterate, this definitely is not the firewall, running iptables -nvL 
gives:

...
0 0 ACCEPT tcp  --  *  *   0.0.0.0/00.0.0.0/0   
state NEW tcp dpt:50070
164 ACCEPT tcp  --  *  *   0.0.0.0/00.0.0.0/0   
state NEW tcp dpt:50030
0 0 ACCEPT tcp  --  *  *   0.0.0.0/00.0.0.0/0   
state NEW tcp dpt:8021
164 ACCEPT tcp  --  *  *   0.0.0.0/00.0.0.0/0   
state NEW tcp dpt:8020
...

On 1/9/12 5:08 PM, alo.alt wrote:

What happen when you try a telnet localhost 8020?
netstat -anl would also useful.

best,
  Alex

--
Alexander Lorenz
http://mapredit.blogspot.com

On Jan 9, 2012, at 2:02 PM, Eli Finkelshteyn wrote:


A bit more info:

When I start up only the namenode by itself, I'm not seeing any errors, but 
what I am seeing that's really odd is:

   2012-01-09 16:48:45,530 INFO org.apache.hadoop.ipc.Server: Starting
   Socket Reader #1 for port 8020
   2012-01-09 16:48:45,531 INFO
   org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics
   with hostName=NameNode, port=8020
   2012-01-09 16:48:45,532 INFO
   org.apache.hadoop.ipc.metrics.RpcDetailedMetrics: Initializing RPC
   Metrics with hostName=NameNode, port=8020
   2012-01-09 16:48:45,541 INFO
   org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at:
   localhost.localdomain/127.0.0.1:8020

That's despite the fact that doing netstat -a | grep 8020 still returns 
nothing.  To me, that makes absolutely no sense. I feel like I should be 
getting an error telling me Namenode did not in fact go up on 8020, but I'm not 
getting that at all.

Eli

On 1/9/12 3:22 PM, Idris Ali wrote:

Hi,

Looks like problem in starting DFS and MR, can you run 'jps' and see if NN,
DN, SNN, JT and TT are running,

also make sure for pseudo-distributed mode, the following entries are
present:

1. In core-site.xml
  property
 namefs.default.name/name
 valuehdfs://localhost:8020/value
   /property

   property
  namehadoop.tmp.dir/name
  valueSOME TMP dir with Read/Write acces not system temp/value
   /property
   property

2.  In hdfs-site.xml
property
 namedfs.replication/name
 value1/value
   /property
   property
  namedfs.permissions/name
  valuefalse/value
   /property
   property
  !-- specify this so that running 'hadoop namenode -format' formats
the right dir --
  namedfs.name.dir/name
  valueLocal dir with Read/Write access/value
   /property

3. In mapred-stie.xml
   property
 namemapred.job.tracker/name
 valuelocalhost:8021/value
   /property

Thanks,
-Idris

On Tue, Jan 10, 2012 at 1:07 AM, Eli Finkelshteyniefin...@gmail.comwrote:


Positive. Like I said before, netstat -a | grep 8020 gives me nothing.
Even if the firewall was the problem, that should still give me output that
the port is listening, but I'd just be unable to hit it from an outside box
(I tested this by blocking port 50070, at which point it still showed up in
netstat -a, but was inaccessible through http from a remote machine). This
problem is something else.


On 1/9/12 2:31 PM, zGreenfelder wrote:


On Mon, Jan 9, 2012 at 1:58 PM, Eli 
Finkelshteyniefinkel@gmail.**comiefin...@gmail.com
  wrote:


More info:

In the DataNode log, I'm also seeing:

2012-01-09 13:06:27,751 INFO org.apache.hadoop.ipc.Client: Retrying
connect
to server: localhost/127.0.0.1:8020. Already tried 9 time(s).

Why would things just

RE: Netstat Shows Port 8020 Doesn't Seem to Listen

2012-01-09 Thread Vivek Shrivastava

One of the option to troubleshoot is to run open your url in the lynx, on the 
command line it will be not be affected by any firewall and will be a local 
access...

Regards,

Vivek

-Original Message-
From: Eli Finkelshteyn [mailto:iefin...@gmail.com] 
Sent: Monday, January 09, 2012 2:39 PM
To: common-user@hadoop.apache.org
Subject: Re: Netstat Shows Port 8020 Doesn't Seem to Listen

Good call! netstat -anl gives me:
tcp0  0 :::127.0.0.1:8020   
:::*LISTEN

Now it just looks like nothing is running on 8021. And now I'm really 
confused about why I get no communication over 8020 from the datanode.

Just to reiterate, this definitely is not the firewall, running iptables 
-nvL gives:

...
 0 0 ACCEPT tcp  --  *  *   0.0.0.0/0
0.0.0.0/0   state NEW tcp dpt:50070
 164 ACCEPT tcp  --  *  *   0.0.0.0/0
0.0.0.0/0   state NEW tcp dpt:50030
 0 0 ACCEPT tcp  --  *  *   0.0.0.0/0
0.0.0.0/0   state NEW tcp dpt:8021
 164 ACCEPT tcp  --  *  *   0.0.0.0/0
0.0.0.0/0   state NEW tcp dpt:8020
...

On 1/9/12 5:08 PM, alo.alt wrote:
 What happen when you try a telnet localhost 8020?
 netstat -anl would also useful.

 best,
   Alex

 --
 Alexander Lorenz
 http://mapredit.blogspot.com

 On Jan 9, 2012, at 2:02 PM, Eli Finkelshteyn wrote:

 A bit more info:

 When I start up only the namenode by itself, I'm not seeing any errors, but 
 what I am seeing that's really odd is:

2012-01-09 16:48:45,530 INFO org.apache.hadoop.ipc.Server: Starting
Socket Reader #1 for port 8020
2012-01-09 16:48:45,531 INFO
org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics
with hostName=NameNode, port=8020
2012-01-09 16:48:45,532 INFO
org.apache.hadoop.ipc.metrics.RpcDetailedMetrics: Initializing RPC
Metrics with hostName=NameNode, port=8020
2012-01-09 16:48:45,541 INFO
org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at:
localhost.localdomain/127.0.0.1:8020

 That's despite the fact that doing netstat -a | grep 8020 still returns 
 nothing.  To me, that makes absolutely no sense. I feel like I should be 
 getting an error telling me Namenode did not in fact go up on 8020, but I'm 
 not getting that at all.

 Eli

 On 1/9/12 3:22 PM, Idris Ali wrote:
 Hi,

 Looks like problem in starting DFS and MR, can you run 'jps' and see if NN,
 DN, SNN, JT and TT are running,

 also make sure for pseudo-distributed mode, the following entries are
 present:

 1. In core-site.xml
   property
  namefs.default.name/name
  valuehdfs://localhost:8020/value
/property

property
   namehadoop.tmp.dir/name
   valueSOME TMP dir with Read/Write acces not system temp/value
/property
property

 2.  In hdfs-site.xml
 property
  namedfs.replication/name
  value1/value
/property
property
   namedfs.permissions/name
   valuefalse/value
/property
property
   !-- specify this so that running 'hadoop namenode -format' formats
 the right dir --
   namedfs.name.dir/name
   valueLocal dir with Read/Write access/value
/property

 3. In mapred-stie.xml
property
  namemapred.job.tracker/name
  valuelocalhost:8021/value
/property

 Thanks,
 -Idris

 On Tue, Jan 10, 2012 at 1:07 AM, Eli Finkelshteyniefin...@gmail.comwrote:

 Positive. Like I said before, netstat -a | grep 8020 gives me nothing.
 Even if the firewall was the problem, that should still give me output that
 the port is listening, but I'd just be unable to hit it from an outside box
 (I tested this by blocking port 50070, at which point it still showed up in
 netstat -a, but was inaccessible through http from a remote machine). This
 problem is something else.


 On 1/9/12 2:31 PM, zGreenfelder wrote:

 On Mon, Jan 9, 2012 at 1:58 PM, Eli 
 Finkelshteyniefinkel@gmail.**comiefin...@gmail.com
   wrote:

 More info:

 In the DataNode log, I'm also seeing:

 2012-01-09 13:06:27,751 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: localhost/127.0.0.1:8020. Already tried 9 time(s).

 Why would things just not load on port 8020? I feel like all the errors
 I'm
 seeing are caused by this, but I can't see any errors about why this
 occurred in the first place.

   are you sure there isn't a firewall in place blocking port 8020?
 e.g. iptables on the local machines?   if you do
 telnet localhost 8020
 do you make a connection? if you use lsof and/or netstat can you see
 the port open?
 if you have root access you can try turning off the firewall with
 iptables -F to see if things work without firewall rules.

Re: Netstat Shows Port 8020 Doesn't Seem to Listen

2012-01-09 Thread Eli Finkelshteyn

OK, switched over to my site's dns name and I'm golden. Everything works 
both locally and remotely. My suspicion is that the problem all along 
was tied to the firewall not being restarted, and then my exacerbating 
the problem through trying to fix it and corrupting the hdfs file 
system. Anyway, all works now. Thanks for the help, everyone!


Eli

On 1/9/12 6:11 PM, Eli Finkelshteyn wrote:
OK, not sure what I did (restarting the firewall, perhaps?), but I now 
have ports 8020 and 8021 listening and no more errors in my logs. 
Wooo! Only problem is I still can't get any hadoop stuff to work from 
a remote client:


hadoop fs -ls /
2012-01-09 17:53:53.559 java[13396:1903] Unable to load realm info 
from SCDynamicStore
12/01/09 17:53:55 INFO ipc.Client: Retrying connect to server: 
*my_server/my_ip*:8020. Already tried 0 time(s).
12/01/09 17:53:56 INFO ipc.Client: Retrying connect to server: 
*my_server/my_ip*:8020. Already tried 1 time(s).
12/01/09 17:53:57 INFO ipc.Client: Retrying connect to server: 
*my_server/my_ip*:8020. Already tried 2 time(s).

...

I feel like I'm almost there. Might this have to do with the fact that 
core-site.xml and mapred-site.xml specify localhost for ports 8020 and 
8021 (thus not listening to any attempted outside connections?)


Thanks for all the help so far, everyone!

Eli

On 1/9/12 5:43 PM, alo.alt wrote:

Firewall online?
and be sure that in /etc/hosts ONLY 127.0.0.1 is linked to localhost. Nothing 
like YOURHOSTNAME.YOURDOMAIN (Redhat kudzu bug)

- Alex

--
Alexander Lorenz
http://mapredit.blogspot.com

On Jan 9, 2012, at 2:39 PM, Eli Finkelshteyn wrote:


Good call! netstat -anl gives me:
tcp0  0 :::127.0.0.1:8020   :::*
LISTEN

Now it just looks like nothing is running on 8021. And now I'm really confused 
about why I get no communication over 8020 from the datanode.

Just to reiterate, this definitely is not the firewall, running iptables -nvL 
gives:

...
0 0 ACCEPT tcp  --  *  *   0.0.0.0/00.0.0.0/0   
state NEW tcp dpt:50070
164 ACCEPT tcp  --  *  *   0.0.0.0/00.0.0.0/0   
state NEW tcp dpt:50030
0 0 ACCEPT tcp  --  *  *   0.0.0.0/00.0.0.0/0   
state NEW tcp dpt:8021
164 ACCEPT tcp  --  *  *   0.0.0.0/00.0.0.0/0   
state NEW tcp dpt:8020
...

On 1/9/12 5:08 PM, alo.alt wrote:

What happen when you try a telnet localhost 8020?
netstat -anl would also useful.

best,
  Alex

--
Alexander Lorenz
http://mapredit.blogspot.com

On Jan 9, 2012, at 2:02 PM, Eli Finkelshteyn wrote:


A bit more info:

When I start up only the namenode by itself, I'm not seeing any errors, but 
what I am seeing that's really odd is:

   2012-01-09 16:48:45,530 INFO org.apache.hadoop.ipc.Server: Starting
   Socket Reader #1 for port 8020
   2012-01-09 16:48:45,531 INFO
   org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics
   with hostName=NameNode, port=8020
   2012-01-09 16:48:45,532 INFO
   org.apache.hadoop.ipc.metrics.RpcDetailedMetrics: Initializing RPC
   Metrics with hostName=NameNode, port=8020
   2012-01-09 16:48:45,541 INFO
   org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at:
   localhost.localdomain/127.0.0.1:8020

That's despite the fact that doing netstat -a | grep 8020 still returns 
nothing.  To me, that makes absolutely no sense. I feel like I should be 
getting an error telling me Namenode did not in fact go up on 8020, but I'm not 
getting that at all.

Eli

On 1/9/12 3:22 PM, Idris Ali wrote:

Hi,

Looks like problem in starting DFS and MR, can you run 'jps' and see if NN,
DN, SNN, JT and TT are running,

also make sure for pseudo-distributed mode, the following entries are
present:

1. In core-site.xml
  property
 namefs.default.name/name
 valuehdfs://localhost:8020/value
   /property

   property
  namehadoop.tmp.dir/name
  valueSOME TMP dir with Read/Write acces not system temp/value
   /property
   property

2.  In hdfs-site.xml
property
 namedfs.replication/name
 value1/value
   /property
   property
  namedfs.permissions/name
  valuefalse/value
   /property
   property
  !-- specify this so that running 'hadoop namenode -format' formats
the right dir --
  namedfs.name.dir/name
  valueLocal dir with Read/Write access/value
   /property

3. In mapred-stie.xml
   property
 namemapred.job.tracker/name
 valuelocalhost:8021/value
   /property

Thanks,
-Idris

On Tue, Jan 10, 2012 at 1:07 AM, Eli Finkelshteyniefin...@gmail.comwrote:


Positive. Like I said before, netstat -a | grep 8020 gives me nothing.
Even if the firewall was the problem, that should still give me output that
the port is listening, but I'd just be unable to hit it from an outside box
(I tested this by blocking port 50070, at which point it still showed up in
netstat -a, but was inaccessible through http from a remote

Re: Re: how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum

2012-01-09 Thread hao.wang

Hi,
Thanks for your reply!
I had already read the pages before, can you give me sme more specific 
suggestions about how to choose the values of  
mapred.tasktracker.map.tasks.maximum and 
mapred.tasktracker.reduce.tasks.maximum according to our cluster configuration 
if possible?

regards!


2012-01-10 



hao.wang 



发件人： Harsh J 
发送时间： 2012-01-09  23:19:21 
收件人： common-user 
抄送： 
主题： Re: how to set mapred.tasktracker.map.tasks.maximum and 
mapred.tasktracker.reduce.tasks.maximum 
 
Hi,
Please read http://hadoop.apache.org/common/docs/current/single_node_setup.html 
to learn how to configure Hadoop using the various *-site.xml configuration 
files, and then follow 
http://hadoop.apache.org/common/docs/current/cluster_setup.html to achieve 
optimal configs for your cluster.
On 09-Jan-2012, at 5:50 PM, hao.wang wrote:
 Hi ,all
Our hadoop cluster has 22 nodes including one namenode, one jobtracker and 
 20 datanodes.
Each node has 2 * 12 cores with 32G RAM
Dose anyone tell me how to config following parameters:
mapred.tasktracker.map.tasks.maximum
mapred.tasktracker.reduce.tasks.maximum
 
 regards!
 2012-01-09 
 
 
 
 hao.wang

Re: how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum

2012-01-09 Thread Harsh J

Hello again,

Try a 4:3 ratio between maps and reduces, against a total # of available CPUs 
per node (minus one or two, for DN and HBase if you run those). Then tweak it 
as you go (more map-only loads or more map-reduce loads, that depends on your 
usage, and you can tweak the ratio accordingly over time -- changing those 
props do not need JobTracker restarts, just TaskTracker).

On 10-Jan-2012, at 8:17 AM, hao.wang wrote:

 Hi,
Thanks for your reply!
I had already read the pages before, can you give me sme more specific 
 suggestions about how to choose the values of  
 mapred.tasktracker.map.tasks.maximum and 
 mapred.tasktracker.reduce.tasks.maximum according to our cluster 
 configuration if possible?
 
 regards!
 
 
 2012-01-10 
 
 
 
 hao.wang 
 
 
 
 发件人： Harsh J 
 发送时间： 2012-01-09  23:19:21 
 收件人： common-user 
 抄送： 
 主题： Re: how to set mapred.tasktracker.map.tasks.maximum and 
 mapred.tasktracker.reduce.tasks.maximum 
 
 Hi,
 Please read 
 http://hadoop.apache.org/common/docs/current/single_node_setup.html to learn 
 how to configure Hadoop using the various *-site.xml configuration files, and 
 then follow http://hadoop.apache.org/common/docs/current/cluster_setup.html 
 to achieve optimal configs for your cluster.
 On 09-Jan-2012, at 5:50 PM, hao.wang wrote:
 Hi ,all
   Our hadoop cluster has 22 nodes including one namenode, one jobtracker and 
 20 datanodes.
   Each node has 2 * 12 cores with 32G RAM
   Dose anyone tell me how to config following parameters:
   mapred.tasktracker.map.tasks.maximum
   mapred.tasktracker.reduce.tasks.maximum
 
 regards!
 2012-01-09 
 
 
 
 hao.wang

Re: datanode failing to start

2012-01-09 Thread Suresh Srinivas

Can you please send your notes on what info is out of date or better still
create a jira so that it can be addressed.

On Fri, Jan 6, 2012 at 3:11 PM, Dave Kelsey da...@gamehouse.com wrote:

 gave up and installed version 1.
 it installed correctly and worked, thought the instructions for setup and
 the location of scripts and configs are now out of date.

 D

 On 1/5/2012 10:25 AM, Dave Kelsey wrote:


 java version 1.6.0_29
 hadoop: 0.20.203.0

 I'm attempting to setup the pseudo-distributed config on a mac 10.6.8.
 I followed the steps from the QuickStart (http://wiki.apache.org./**
 hadoop/QuickStart http://wiki.apache.org./hadoop/QuickStart) and
 succeeded with Stage 1: Standalone Operation.
 I followed the steps for Stage 2: Pseudo-distributed Configuration.
 I set the JAVA_HOME variable in conf/hadoop-env.sh and I changed
 tools.jar to the location of classes.jar (a mac version of tools.jar)
 I've modified the three .xml files as described in the QuickStart.
 ssh'ing to localhost has been configured and works with passwordless
 authentication.
 I formatted the namenode with bin/hadoop namenode -format as the
 instructions say

 This is what I see when I run bin/start-all.sh

 root# bin/start-all.sh
 starting namenode, logging to /Users/admin/hadoop/hadoop-0.**
 20.203.0/bin/../logs/hadoop-**root-namenode-Hoot-2.local.out
 localhost: starting datanode, logging to /Users/admin/hadoop/hadoop-0.**
 20.203.0/bin/../logs/hadoop-**root-datanode-Hoot-2.local.out
 localhost: Exception in thread main java.lang.**NoClassDefFoundError:
 server
 localhost: Caused by: java.lang.**ClassNotFoundException: server
 localhost: at java.net.URLClassLoader$1.run(**
 URLClassLoader.java:202)
 localhost: at java.security.**AccessController.doPrivileged(**Native
 Method)
 localhost: at java.net.URLClassLoader.**
 findClass(URLClassLoader.java:**190)
 localhost: at java.lang.ClassLoader.**loadClass(ClassLoader.java:**
 306)
 localhost: at sun.misc.Launcher$**AppClassLoader.loadClass(**
 Launcher.java:301)
 localhost: at java.lang.ClassLoader.**loadClass(ClassLoader.java:**
 247)
 localhost: starting secondarynamenode, logging to
 /Users/admin/hadoop/hadoop-0.**20.203.0/bin/../logs/hadoop-**
 root-secondarynamenode-Hoot-2.**local.out
 starting jobtracker, logging to /Users/admin/hadoop/hadoop-0.**
 20.203.0/bin/../logs/hadoop-**root-jobtracker-Hoot-2.local.**out
 localhost: starting tasktracker, logging to /Users/admin/hadoop/hadoop-0.
 **20.203.0/bin/../logs/hadoop-**root-tasktracker-Hoot-2.local.**out

 There are 4 processes running:
 ps -fax | grep hadoop | grep -v grep | wc -l
  4

 They are:
 SecondaryNameNode
 TaskTracker
 NameNode
 JobTracker


 I've searched to see if anyone else has encountered this and not found
 anything

 d

 p.s. I've also posted this to core-u...@hadoop.apache.org which I've yet
 to find how to subscribe to.

Container launch from appmaster

2012-01-09 Thread raghavendhra rahul

Hi all,
I am trying to write an application master.Is there a way to specify
node1: 10 conatiners
node2: 10 containers
Can we specify this kind of list using the application master

Re: Re: how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum

2012-01-09 Thread hao.wang

Hi , 

Thanks for your reply!
According to your suggestion, Maybe I can't apply it to our hadoop cluster.
Cus, each server in our hadoop cluster just contains 2 CPUs. 
 So, I think maybe you mean the core #  but not CPU # in each searver? 
I am looking for your reply.

regards!


2012-01-10 



hao.wang 



发件人： Harsh J 
发送时间： 2012-01-10  11:33:38 
收件人： common-user 
抄送： 
主题： Re: how to set mapred.tasktracker.map.tasks.maximum and 
mapred.tasktracker.reduce.tasks.maximum 
 
Hello again,
Try a 4:3 ratio between maps and reduces, against a total # of available CPUs 
per node (minus one or two, for DN and HBase if you run those). Then tweak it 
as you go (more map-only loads or more map-reduce loads, that depends on your 
usage, and you can tweak the ratio accordingly over time -- changing those 
props do not need JobTracker restarts, just TaskTracker).
On 10-Jan-2012, at 8:17 AM, hao.wang wrote:
 Hi,
Thanks for your reply!
I had already read the pages before, can you give me sme more specific 
 suggestions about how to choose the values of  
 mapred.tasktracker.map.tasks.maximum and 
 mapred.tasktracker.reduce.tasks.maximum according to our cluster 
 configuration if possible?
 
 regards!
 
 
 2012-01-10 
 
 
 
 hao.wang 
 
 
 
 发件人： Harsh J 
 发送时间： 2012-01-09  23:19:21 
 收件人： common-user 
 抄送： 
 主题： Re: how to set mapred.tasktracker.map.tasks.maximum and 
 mapred.tasktracker.reduce.tasks.maximum 
 
 Hi,
 Please read 
 http://hadoop.apache.org/common/docs/current/single_node_setup.html to learn 
 how to configure Hadoop using the various *-site.xml configuration files, and 
 then follow http://hadoop.apache.org/common/docs/current/cluster_setup.html 
 to achieve optimal configs for your cluster.
 On 09-Jan-2012, at 5:50 PM, hao.wang wrote:
 Hi ,all
   Our hadoop cluster has 22 nodes including one namenode, one jobtracker and 
 20 datanodes.
   Each node has 2 * 12 cores with 32G RAM
   Dose anyone tell me how to config following parameters:
   mapred.tasktracker.map.tasks.maximum
   mapred.tasktracker.reduce.tasks.maximum
 
 regards!
 2012-01-09 
 
 
 
 hao.wang

46 matches

Mail list logo