Re: 1 file per record

2008-10-02 Thread chandravadana


hi all...

i have doubt..
If we dont specify numSplits in getsplits(), then what is the default
number of splits taken...


-- 
Best Regards
S.Chandravadana 

-- 
View this message in context: 
http://www.nabble.com/1-file-per-record-tp19644985p19775580.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Is there a way to pause a running hadoop job?

2008-10-02 Thread Klaas Bosteels
On Thu, Oct 2, 2008 at 2:44 AM, Steve Gao <[EMAIL PROTECTED]> wrote:
> I have 5 running jobs, each has 2 reducers. Because I set max number of 
> reducers as 10 so any incoming job will be hold until some of the 5 jobs 
> finish and release reducer quota.
>
> Now the problem is that an incoming job has a higher priority that I want to 
> pause some of the 5 jobs, let the new job finish, and resume the old one.
>
> Is this doable in Hadoop? Thanks!

You could use the patch attached to this JIRA to do this:

https://issues.apache.org/jira/browse/HADOOP-3687

Since paused tasks are kept in memory there is a limit on how much you
can pause with this patch, but nevertheless is can be very useful in
practice.

-Klaas


Re: architecture diagram

2008-10-02 Thread Terrence A. Pietrondi
I am sorry for the confusion. I meant distributed data. 

So help me out here. For example, if I am reducing to a single file, then my 
main transformation logic would be in my mapping step since I am reducing away 
from the data?

Terrence A. Pietrondi
http://del.icio.us/tepietrondi


--- On Wed, 10/1/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote:

> From: Alex Loddengaard <[EMAIL PROTECTED]>
> Subject: Re: architecture diagram
> To: core-user@hadoop.apache.org
> Date: Wednesday, October 1, 2008, 7:44 PM
> I'm not sure what you mean by "disconnected parts
> of data," but Hadoop is
> implemented to try and perform map tasks on machines that
> have input data.
> This is to lower the amount of network traffic, hence
> making the entire job
> run faster.  Hadoop does all this for you under the hood. 
> From a user's
> point of view, all you need to do is store data in HDFS
> (the distributed
> filesystem), and run MapReduce jobs on that data.  Take a
> look here:
> 
> 
> 
> Alex
> 
> On Wed, Oct 1, 2008 at 1:11 PM, Terrence A. Pietrondi
> <[EMAIL PROTECTED]
> > wrote:
> 
> > So to be "distributed" in a sense, you would
> want to do your computation on
> > the disconnected parts of data in the map phase I
> would guess?
> >
> > Terrence A. Pietrondi
> > http://del.icio.us/tepietrondi
> >
> >
> > --- On Wed, 10/1/08, Arun C Murthy
> <[EMAIL PROTECTED]> wrote:
> >
> > > From: Arun C Murthy <[EMAIL PROTECTED]>
> > > Subject: Re: architecture diagram
> > > To: core-user@hadoop.apache.org
> > > Date: Wednesday, October 1, 2008, 2:16 PM
> > > On Oct 1, 2008, at 10:17 AM, Terrence A.
> Pietrondi wrote:
> > >
> > > > I am trying to plan out my map-reduce
> implementation
> > > and I have some
> > > > questions of where computation should be
> split in
> > > order to take
> > > > advantage of the distributed nodes.
> > > >
> > > > Looking at the architecture diagram
> > >
> (http://hadoop.apache.org/core/images/architecture.gif
> > > > ), are the map boxes the major computation
> areas or is
> > > the reduce
> > > > the major computation area?
> > > >
> > >
> > > Usually the maps perform the 'embarrassingly
> > > parallel' computational
> > > steps where-in each map works independently on a
> > > 'split' on your input
> > > and the reduces perform the 'aggregate'
> > > computations.
> > >
> > >  From http://hadoop.apache.org/core/ :
> > >
> > > Hadoop implements MapReduce, using the Hadoop
> Distributed
> > > File System
> > > (HDFS). MapReduce divides applications into many
> small
> > > blocks of work.
> > > HDFS creates multiple replicas of data blocks for
> > > reliability, placing
> > > them on compute nodes around the cluster.
> MapReduce can
> > > then process
> > > the data where it is located.
> > >
> > > The Hadoop Map-Reduce framework is quite good at
> scheduling
> > > your
> > > 'maps' on the actual data-nodes where the
> > > input-blocks are present,
> > > leading to i/o efficiencies...
> > >
> > > Arun
> > >
> > > > Thanks.
> > > >
> > > > Terrence A. Pietrondi
> > > >
> > > >
> > > >
> >
> >
> >
> >


  


Re: too many open files error

2008-10-02 Thread Johannes Zillmann

Having a similar problem.
After upgrading from hadoop 0.16.4 to 0.17.2.1 we're facing  
"java.io.IOException: java.io.IOException: Too many open files" fater  
a few jobs.

f.e.:
Error message from task (reduce) tip_200810020918_0014_r_31 Error  
initializing task_200810020918_0014_r_31_1:

java.io.IOException: java.io.IOException: Too many open files
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:451)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
at org.apache.hadoop.util.Shell.run(Shell.java:134)
at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
at org.apache.hadoop.fs.LocalDirAllocator 
$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:296)
at  
org 
.apache 
.hadoop 
.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
at  
org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:646)
at  
org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:1271)
at  
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:912)
at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java: 
1307)
at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java: 
2266)



Once a job failed with because of these exception, all subsequent jobs  
failing too for the same reason.

After cluster-restart it works fine for a few jobs again

Johannes

On Sep 27, 2008, at 1:59 AM, Karl Anderson wrote:



On 26-Sep-08, at 3:09 PM, Eric Zhang wrote:


Hi,
I encountered following FileNotFoundException resulting from "too  
many open files" error when i tried to run a job.  The job had been  
run for several times before without problem.  I am confused by the  
exception because my code closes all the files and even it  
doesn't,  the job only have only 10-20 small input/output files.
The limit on the open file on my box is 1024.Besides, the error  
seemed to happen even before the task was executed, I am using 0.17  
version.   I'd appreciate if somebody can shed some light on this  
issue.  BTW, the job ran ok after i restarted hadoop.Yes, the  
hadoop-site.xml did exist in that directory.


I had the same errors, including the bash one.  Running one  
particular job would cause all subsequent jobs of any kind to fail,  
even after all running jobs had completed or failed out.  This was  
confusing because the failing jobs themselves often had no  
relationship to the cause, they were just in a bad environment.


If you can't successfully run a dummy job (with the identity mapper  
and reducer, or a streaming job with cat) once you start getting  
failures, then you are probably in the same situation.


I believe that the problem was caused by increasing the timeout, but  
I never pinned it down enough to submit a Jira issue.  It might have  
been the XML reader or something else.  I was using streaming,  
hadoop-ec2, and either 0.17.0 or 0.18.0.  It would happen just as  
rapidly after I made an ec2 image with a higher open file limit.


Eventually I figured it out by running each job in my pipeline 5 or  
so times before trying the next one, which let me see which job was  
causing the problem (because it would eventually fail itself, rather  
than hosing a later job).


Karl Anderson
[EMAIL PROTECTED]
http://monkey.org/~kra






~~~
101tec GmbH
Halle (Saale), Saxony-Anhalt, Germany
http://www.101tec.com



Hadoop Camp next month

2008-10-02 Thread Owen O'Malley

Hi all,
  I'd like to remind everyone that the Hadoop Camp & ApacheCon US is  
coming up in New Orleans next month. http://tinyurl.com/hadoop-camp


It will be the largest gathering of Hadoop developers outside of  
California. We'll have:


Core: Doug Cutting, Dhruba Borthakur, Arun Murthy, Owen O'Malley,  
Sameer Paranjpye,

   Sanjay Radia, Tom White
Zookeeper: Ben Reed

There will also be a training session on Practical Problem Solving  
with Hadoop by Milind Bhandarkar on Monday.


So if you'd like to meet the developers or find out more about Hadoop,  
come join us!


-- Owen


Re: 1 file per record

2008-10-02 Thread Owen O'Malley

On Oct 2, 2008, at 1:50 AM, chandravadana wrote:


If we dont specify numSplits in getsplits(), then what is the default
number of splits taken...


The getSplits() is either library or user code, so it depends which  
class you are using as your InputFormat. The FileInputFormats  
(TextInputFormat and SequenceFileInputFormat) basically divide input  
files by blocks, unless the requested number of mappers is really high.


-- Owen


Re: Maps running after reducers complete successfully?

2008-10-02 Thread Owen O'Malley
It isn't optimal, but it is the expected behavior. In general when we  
lose a TaskTracker, we want the map outputs regenerated so that any  
reduces that need to re-run (including speculative execution). We  
could handle it as a special case if:

  1. We didn't lose any running reduces.
  2. All of the reduces (including speculative tasks) are done with  
shuffling.

  3. We don't plan on launching any more speculative reduces.
If all 3 hold, we don't need to re-run the map tasks. Actually doing  
so, would be a pretty involved patch to the JobTracker/Schedulers.


-- Owen


Question about dfs.datanode.du.reserved and dfs.datanode.du.pct in 0.18.1

2008-10-02 Thread Jason Venner

In our environment we have hdfs nodes that are also used as compute nodes.

Our disk environment is heterogeneous. We have a couple of machines with 
much smaller disk capacity than others. Another minor issue is our IT 
staff sets up 1 filesystem backed by a hardware raid of all of the 
physical disks in the machine.


We have been trying to work with dfs.datanode.du.reserved and 
dfs.datanode.du.pct, but we are still filling up our small machines.


On reading through the code, it appears to me that these two values are 
only examined for determining the host of a replica block.


Questions:

  1. is the 'first block' /always/ written to the local host if it
 is also an hdfs node for the filesystem, ignoring any
 dfs.datanode.du limits.
  2. is there any attempt to ensure that multiple blocks are not
 allocated such that the dfs.datanode.du limits may be exceeded.

Thanks all.
--
Jason Venner
Attributor - Program the Web 
Attributor is hiring Hadoop Wranglers and coding wizards, contact if 
interested


Re: architecture diagram

2008-10-02 Thread Alex Loddengaard
I think it really depends on the job as to where logic goes.  Sometimes your
reduce step is as simple as an identify function, and sometimes it can be
more complex than your map step.  It all depends on your data and the
operation(s) you're trying to perform.

Perhaps we should step out of the abstract.  Do you have a specific problem
you're trying to solve?  Can you describe it?

Alex

On Thu, Oct 2, 2008 at 4:55 AM, Terrence A. Pietrondi <[EMAIL PROTECTED]
> wrote:

> I am sorry for the confusion. I meant distributed data.
>
> So help me out here. For example, if I am reducing to a single file, then
> my main transformation logic would be in my mapping step since I am reducing
> away from the data?
>
> Terrence A. Pietrondi
> http://del.icio.us/tepietrondi
>
>
> --- On Wed, 10/1/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote:
>
> > From: Alex Loddengaard <[EMAIL PROTECTED]>
> > Subject: Re: architecture diagram
> > To: core-user@hadoop.apache.org
> > Date: Wednesday, October 1, 2008, 7:44 PM
> > I'm not sure what you mean by "disconnected parts
> > of data," but Hadoop is
> > implemented to try and perform map tasks on machines that
> > have input data.
> > This is to lower the amount of network traffic, hence
> > making the entire job
> > run faster.  Hadoop does all this for you under the hood.
> > From a user's
> > point of view, all you need to do is store data in HDFS
> > (the distributed
> > filesystem), and run MapReduce jobs on that data.  Take a
> > look here:
> >
> > 
> >
> > Alex
> >
> > On Wed, Oct 1, 2008 at 1:11 PM, Terrence A. Pietrondi
> > <[EMAIL PROTECTED]
> > > wrote:
> >
> > > So to be "distributed" in a sense, you would
> > want to do your computation on
> > > the disconnected parts of data in the map phase I
> > would guess?
> > >
> > > Terrence A. Pietrondi
> > > http://del.icio.us/tepietrondi
> > >
> > >
> > > --- On Wed, 10/1/08, Arun C Murthy
> > <[EMAIL PROTECTED]> wrote:
> > >
> > > > From: Arun C Murthy <[EMAIL PROTECTED]>
> > > > Subject: Re: architecture diagram
> > > > To: core-user@hadoop.apache.org
> > > > Date: Wednesday, October 1, 2008, 2:16 PM
> > > > On Oct 1, 2008, at 10:17 AM, Terrence A.
> > Pietrondi wrote:
> > > >
> > > > > I am trying to plan out my map-reduce
> > implementation
> > > > and I have some
> > > > > questions of where computation should be
> > split in
> > > > order to take
> > > > > advantage of the distributed nodes.
> > > > >
> > > > > Looking at the architecture diagram
> > > >
> > (http://hadoop.apache.org/core/images/architecture.gif
> > > > > ), are the map boxes the major computation
> > areas or is
> > > > the reduce
> > > > > the major computation area?
> > > > >
> > > >
> > > > Usually the maps perform the 'embarrassingly
> > > > parallel' computational
> > > > steps where-in each map works independently on a
> > > > 'split' on your input
> > > > and the reduces perform the 'aggregate'
> > > > computations.
> > > >
> > > >  From http://hadoop.apache.org/core/ :
> > > >
> > > > Hadoop implements MapReduce, using the Hadoop
> > Distributed
> > > > File System
> > > > (HDFS). MapReduce divides applications into many
> > small
> > > > blocks of work.
> > > > HDFS creates multiple replicas of data blocks for
> > > > reliability, placing
> > > > them on compute nodes around the cluster.
> > MapReduce can
> > > > then process
> > > > the data where it is located.
> > > >
> > > > The Hadoop Map-Reduce framework is quite good at
> > scheduling
> > > > your
> > > > 'maps' on the actual data-nodes where the
> > > > input-blocks are present,
> > > > leading to i/o efficiencies...
> > > >
> > > > Arun
> > > >
> > > > > Thanks.
> > > > >
> > > > > Terrence A. Pietrondi
> > > > >
> > > > >
> > > > >
> > >
> > >
> > >
> > >
>
>
>
>


Re: architecture diagram

2008-10-02 Thread Terrence A. Pietrondi
I am trying to write a map reduce implementation to do the following:

1) read tabular data delimited in some fashion
2) pivot that data, so the rows are columns and the columns are rows
3) shuffle the rows (that were the columns) to randomize the data
4) pivot the data back 

For example.

A|B|C
D|E|G

pivots too...

D|A
E|B
C|G

Then for each row, shuffle the contents around randomly...

D|A
B|E
G|C

Then pivot the data back...

A|E|C
D|B|C

You can reference my progress so far...

http://svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/

Terrence A. Pietrondi


--- On Thu, 10/2/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote:

> From: Alex Loddengaard <[EMAIL PROTECTED]>
> Subject: Re: architecture diagram
> To: core-user@hadoop.apache.org
> Date: Thursday, October 2, 2008, 1:36 PM
> I think it really depends on the job as to where logic goes.
>  Sometimes your
> reduce step is as simple as an identify function, and
> sometimes it can be
> more complex than your map step.  It all depends on your
> data and the
> operation(s) you're trying to perform.
> 
> Perhaps we should step out of the abstract.  Do you have a
> specific problem
> you're trying to solve?  Can you describe it?
> 
> Alex
> 
> On Thu, Oct 2, 2008 at 4:55 AM, Terrence A. Pietrondi
> <[EMAIL PROTECTED]
> > wrote:
> 
> > I am sorry for the confusion. I meant distributed
> data.
> >
> > So help me out here. For example, if I am reducing to
> a single file, then
> > my main transformation logic would be in my mapping
> step since I am reducing
> > away from the data?
> >
> > Terrence A. Pietrondi
> > http://del.icio.us/tepietrondi
> >
> >
> > --- On Wed, 10/1/08, Alex Loddengaard
> <[EMAIL PROTECTED]> wrote:
> >
> > > From: Alex Loddengaard
> <[EMAIL PROTECTED]>
> > > Subject: Re: architecture diagram
> > > To: core-user@hadoop.apache.org
> > > Date: Wednesday, October 1, 2008, 7:44 PM
> > > I'm not sure what you mean by
> "disconnected parts
> > > of data," but Hadoop is
> > > implemented to try and perform map tasks on
> machines that
> > > have input data.
> > > This is to lower the amount of network traffic,
> hence
> > > making the entire job
> > > run faster.  Hadoop does all this for you under
> the hood.
> > > From a user's
> > > point of view, all you need to do is store data
> in HDFS
> > > (the distributed
> > > filesystem), and run MapReduce jobs on that data.
>  Take a
> > > look here:
> > >
> > > 
> > >
> > > Alex
> > >
> > > On Wed, Oct 1, 2008 at 1:11 PM, Terrence A.
> Pietrondi
> > > <[EMAIL PROTECTED]
> > > > wrote:
> > >
> > > > So to be "distributed" in a sense,
> you would
> > > want to do your computation on
> > > > the disconnected parts of data in the map
> phase I
> > > would guess?
> > > >
> > > > Terrence A. Pietrondi
> > > > http://del.icio.us/tepietrondi
> > > >
> > > >
> > > > --- On Wed, 10/1/08, Arun C Murthy
> > > <[EMAIL PROTECTED]> wrote:
> > > >
> > > > > From: Arun C Murthy
> <[EMAIL PROTECTED]>
> > > > > Subject: Re: architecture diagram
> > > > > To: core-user@hadoop.apache.org
> > > > > Date: Wednesday, October 1, 2008, 2:16
> PM
> > > > > On Oct 1, 2008, at 10:17 AM, Terrence
> A.
> > > Pietrondi wrote:
> > > > >
> > > > > > I am trying to plan out my
> map-reduce
> > > implementation
> > > > > and I have some
> > > > > > questions of where computation
> should be
> > > split in
> > > > > order to take
> > > > > > advantage of the distributed
> nodes.
> > > > > >
> > > > > > Looking at the architecture
> diagram
> > > > >
> > >
> (http://hadoop.apache.org/core/images/architecture.gif
> > > > > > ), are the map boxes the major
> computation
> > > areas or is
> > > > > the reduce
> > > > > > the major computation area?
> > > > > >
> > > > >
> > > > > Usually the maps perform the
> 'embarrassingly
> > > > > parallel' computational
> > > > > steps where-in each map works
> independently on a
> > > > > 'split' on your input
> > > > > and the reduces perform the
> 'aggregate'
> > > > > computations.
> > > > >
> > > > >  From http://hadoop.apache.org/core/ :
> > > > >
> > > > > Hadoop implements MapReduce, using the
> Hadoop
> > > Distributed
> > > > > File System
> > > > > (HDFS). MapReduce divides applications
> into many
> > > small
> > > > > blocks of work.
> > > > > HDFS creates multiple replicas of data
> blocks for
> > > > > reliability, placing
> > > > > them on compute nodes around the
> cluster.
> > > MapReduce can
> > > > > then process
> > > > > the data where it is located.
> > > > >
> > > > > The Hadoop Map-Reduce framework is
> quite good at
> > > scheduling
> > > > > your
> > > > > 'maps' on the actual data-nodes
> where the
> > > > > input-blocks are present,
> > > > > leading to i/o efficiencies...
> > > > >
> > > > > Arun
> > > > >
> > > > > > Thanks.
> > > > > >
> > > > > > Terrence A. Pietrondi
> > > > > >
> > > > > >
> > > > > >
> > > >
> > > >
> > > >
> > > >
> >
> >
> >
> >


  


How to concatenate hadoop files to a single hadoop file

2008-10-02 Thread Steve Gao
Suppose I have 3 files in Hadoop that I want to "cat" them to a single file. I 
know it can be done by "hadoop dfs -cat" to a local file and updating it to 
Hadoop. But it's very expensive for large files. Is there an internal way to do 
this in Hadoop itself? Thanks



  

Re: Hive questions about the meta db

2008-10-02 Thread Edward Capriolo
I am doing a lot of testing with Hive, I will be sure to add this
information to the wiki once I get it going.

Thus far I downloaded the same version of derby that hive uses. I have
verified that the connections is up and running.

ij version 10.4
ij> connect 'jdbc:derby://nyhadoop1:1527/metastore_db;create=true';
ij> show tables
TABLE_SCHEM |TABLE_NAME|REMARKS

SYS |SYSALIASES|
SYS |SYSCHECKS |
...

vi hive-default.conf
...

  hive.metastore.local
  false
  controls whether to connect to remove metastore server
or open a new metastore server in Hive Client JVM



  javax.jdo.option.ConnectionURL
  jdbc:derby://nyhadoop1:1527/metastore_db;create=true
  JDBC connect string for a JDBC metastore



  javax.jdo.option.ConnectionDriverName
  org.apache.derby.jdbc.ClientDriver
  Driver class name for a JDBC metastore



  hive.metastore.uris
  jdbc:derby://nyhadoop1:1527/metastore_db
  Comma separated list of URIs of metastore servers. The
first server that can be connected to will be used.

...

javax.jdo.PersistenceManagerFactoryClass=org.jpox.PersistenceManagerFactoryImpl
org.jpox.autoCreateSchema=false
org.jpox.validateTables=false
org.jpox.validateColumns=false
org.jpox.validateConstraints=false
org.jpox.storeManagerType=rdbms
org.jpox.autoCreateSchema=true
org.jpox.autoStartMechanismMode=checked
org.jpox.transactionIsolation=read_committed
javax.jdo.option.DetachAllOnCommit=true
javax.jdo.option.NontransactionalRead=true
javax.jdo.option.ConnectionDriverName=org.apache.derby.jdbc.ClientDriver
javax.jdo.option.ConnectionURL=jdbc:derby://nyhadoop1:1527/metastore_db;create=true
javax.jdo.option.ConnectionUserName=
javax.jdo.option.ConnectionPassword=

hive> show tables;
08/10/02 15:17:12 INFO hive.metastore: Trying to connect to metastore
with URI jdbc:derby://nyhadoop1:1527/metastore_db
FAILED: Error in semantic analysis: java.lang.NullPointerException
08/10/02 15:17:12 ERROR ql.Driver: FAILED: Error in semantic analysis:
java.lang.NullPointerException

I must have a setting wrong. Any ideas?


Re: Hive questions about the meta db

2008-10-02 Thread Prasad Chakka

Below property is not needed. Keep that to default value. (Also, you can
create hive-site.xml and leave the hive-default.xml as it is)


  hive.metastore.uris
  jdbc:derby://nyhadoop1:1527/metastore_db
  Comma separated list of URIs of metastore servers. The
first server that can be connected to will be used.


Set local to true.

  hive.metastore.local
  true
  controls whether to connect to remove metastore server
or open a new metastore server in Hive Client JVM


If you are still getting error, check the logs (/tmp/${USER}/hive.log). In
conf directory there is hive-log4j.properties where you can control the
logging level.

Prasad



From: Edward Capriolo <[EMAIL PROTECTED]>
Reply-To: 
Date: Thu, 2 Oct 2008 12:33:20 -0700
To: 
Subject: Re: Hive questions about the meta db

I am doing a lot of testing with Hive, I will be sure to add this
information to the wiki once I get it going.

Thus far I downloaded the same version of derby that hive uses. I have
verified that the connections is up and running.

ij version 10.4
ij> connect 'jdbc:derby://nyhadoop1:1527/metastore_db;create=true';
ij> show tables
TABLE_SCHEM |TABLE_NAME|REMARKS

SYS |SYSALIASES|
SYS |SYSCHECKS |
...

vi hive-default.conf
...

  hive.metastore.local
  false
  controls whether to connect to remove metastore server
or open a new metastore server in Hive Client JVM



  javax.jdo.option.ConnectionURL
  jdbc:derby://nyhadoop1:1527/metastore_db;create=true
  JDBC connect string for a JDBC metastore



  javax.jdo.option.ConnectionDriverName
  org.apache.derby.jdbc.ClientDriver
  Driver class name for a JDBC metastore



  hive.metastore.uris
  jdbc:derby://nyhadoop1:1527/metastore_db
  Comma separated list of URIs of metastore servers. The
first server that can be connected to will be used.

...

javax.jdo.PersistenceManagerFactoryClass=org.jpox.PersistenceManagerFactoryI
mpl
org.jpox.autoCreateSchema=false
org.jpox.validateTables=false
org.jpox.validateColumns=false
org.jpox.validateConstraints=false
org.jpox.storeManagerType=rdbms
org.jpox.autoCreateSchema=true
org.jpox.autoStartMechanismMode=checked
org.jpox.transactionIsolation=read_committed
javax.jdo.option.DetachAllOnCommit=true
javax.jdo.option.NontransactionalRead=true
javax.jdo.option.ConnectionDriverName=org.apache.derby.jdbc.ClientDriver
javax.jdo.option.ConnectionURL=jdbc:derby://nyhadoop1:1527/metastore_db;crea
te=true
javax.jdo.option.ConnectionUserName=
javax.jdo.option.ConnectionPassword=

hive> show tables;
08/10/02 15:17:12 INFO hive.metastore: Trying to connect to metastore
with URI jdbc:derby://nyhadoop1:1527/metastore_db
FAILED: Error in semantic analysis: java.lang.NullPointerException
08/10/02 15:17:12 ERROR ql.Driver: FAILED: Error in semantic analysis:
java.lang.NullPointerException

I must have a setting wrong. Any ideas?




Re: Hive questions about the meta db

2008-10-02 Thread Edward Capriolo
>>  hive.metastore.local
>> true

Why would I set this property to true? My goal is to store the meta
data in an external database. It i set this to true the metabase is
created in the working directory.


RE: Hive questions about the meta db

2008-10-02 Thread Ashish Thusoo
Hi Edward,

Can you send us the contents of /tmp//hive.log. Also lets open a JIRA 
for this and carry out the discussion there - even if this is not a bug (which 
it may turn out to be), NullPointerException is not the most useful user 
visible message, so atleast that we should fix...

Thanks,
Ashish

-Original Message-
From: Edward Capriolo [mailto:[EMAIL PROTECTED]
Sent: Thursday, October 02, 2008 1:28 PM
To: core-user@hadoop.apache.org
Subject: Re: Hive questions about the meta db

>>  hive.metastore.local
>> true

Why would I set this property to true? My goal is to store the meta data in an 
external database. It i set this to true the metabase is created in the working 
directory.


Re: Hive questions about the meta db

2008-10-02 Thread Edward Capriolo
I determined the problem once I set the log4j properties to debug.
derbyclient.jar derbytools.jar does not ship with hive. As a result
when you try to org.apache.derby.jdbc.ClientDriver you get an
invocation target exception.
The solution for this was to download the derby, and place those files
in hive/lib.

It is working now. Thanks!


Lazily deserializing Writables

2008-10-02 Thread Jimmy Lin
Hi everyone,

I'm wondering if it's possible to lazily deserialize a Writable.  That is,
when my custom Writable is handed a DataInput from readFields, can I
simply hang on to the reference and read from it later?  This would be
useful if the Writable is a complex data structure that may be expensive
to deserialize, so I'd only want to do it on-demand.  Or does the runtime
mutate the underlying stream, leaving the Writable with a reference to
something completely different later?

I'm wondering about both present behavior, and the implicit contract
provided by the Hadoop API.

Thanks!

-Jimmy




Re: How to concatenate hadoop files to a single hadoop file

2008-10-02 Thread Steve Gao
Anybody knows? Thanks a lot.

--- On Thu, 10/2/08, Steve Gao <[EMAIL PROTECTED]> wrote:
From: Steve Gao <[EMAIL PROTECTED]>
Subject: How to concatenate hadoop files to a single hadoop file
To: core-user@hadoop.apache.org
Cc: [EMAIL PROTECTED]
Date: Thursday, October 2, 2008, 3:17 PM

Suppose I have 3 files in Hadoop that I want to "cat" them to a single
file. I know it can be done by "hadoop dfs -cat" to a local file and
updating it to Hadoop. But it's very expensive for large files. Is there an
internal way to do this in Hadoop itself? Thanks



  


  

Re: How to concatenate hadoop files to a single hadoop file

2008-10-02 Thread Michael Andrews

You might be able to use hars:

http://hadoop.apache.org/core/docs/current/hadoop_archives.html

On 10/2/08 2:51 PM, "Steve Gao" <[EMAIL PROTECTED]> wrote:

Anybody knows? Thanks a lot.

--- On Thu, 10/2/08, Steve Gao <[EMAIL PROTECTED]> wrote:
From: Steve Gao <[EMAIL PROTECTED]>
Subject: How to concatenate hadoop files to a single hadoop file
To: core-user@hadoop.apache.org
Cc: [EMAIL PROTECTED]
Date: Thursday, October 2, 2008, 3:17 PM

Suppose I have 3 files in Hadoop that I want to "cat" them to a single
file. I know it can be done by "hadoop dfs -cat" to a local file and
updating it to Hadoop. But it's very expensive for large files. Is there an
internal way to do this in Hadoop itself? Thanks









Re: Lazily deserializing Writables

2008-10-02 Thread Bryan Duxbury
We do this with some of our Thrift-serialized types. We account for  
this behavior explicitly in the ThrittWritable class and make it so  
that we can read the serialized version off the wire completely by  
prepending the size. Then, we can read in the raw bytes and hang on  
to them for later as we see fit. I would think that leaving the bytes  
on the DataInput would break things in a very impressive way.


-Bryan

On Oct 2, 2008, at 2:48 PM, Jimmy Lin wrote:


Hi everyone,

I'm wondering if it's possible to lazily deserialize a Writable.   
That is,

when my custom Writable is handed a DataInput from readFields, can I
simply hang on to the reference and read from it later?  This would be
useful if the Writable is a complex data structure that may be  
expensive
to deserialize, so I'd only want to do it on-demand.  Or does the  
runtime

mutate the underlying stream, leaving the Writable with a reference to
something completely different later?

I'm wondering about both present behavior, and the implicit contract
provided by the Hadoop API.

Thanks!

-Jimmy






Re: Lazily deserializing Writables

2008-10-02 Thread Andrzej Bialecki

Jimmy Lin wrote:

Hi everyone,

I'm wondering if it's possible to lazily deserialize a Writable.  That is,
when my custom Writable is handed a DataInput from readFields, can I
simply hang on to the reference and read from it later?  This would be
useful if the Writable is a complex data structure that may be expensive
to deserialize, so I'd only want to do it on-demand.  Or does the runtime
mutate the underlying stream, leaving the Writable with a reference to
something completely different later?

I'm wondering about both present behavior, and the implicit contract
provided by the Hadoop API.


The implicit contract is that you consume all bytes from the input in 
readFields() that you'll ever consume from this DataInput. The same 
DataInput is then passed to other Writables so that they can read their 
fields. If you don't advance the DataInput sufficiently to consume all 
bytes related to your Writable, then the next record won't read in 
properly, and things will start crashing ..



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Lazily deserializing Writables

2008-10-02 Thread Jimmy Lin
Hi Bryan,

Thanks, this answers my question!  So at the very least you'll have to
read in the raw bytes and hang on to them.

-Jimmy

> We do this with some of our Thrift-serialized types. We account for
> this behavior explicitly in the ThrittWritable class and make it so
> that we can read the serialized version off the wire completely by
> prepending the size. Then, we can read in the raw bytes and hang on
> to them for later as we see fit. I would think that leaving the bytes
> on the DataInput would break things in a very impressive way.
>
> -Bryan
>
> On Oct 2, 2008, at 2:48 PM, Jimmy Lin wrote:
>
>> Hi everyone,
>>
>> I'm wondering if it's possible to lazily deserialize a Writable.
>> That is,
>> when my custom Writable is handed a DataInput from readFields, can I
>> simply hang on to the reference and read from it later?  This would be
>> useful if the Writable is a complex data structure that may be
>> expensive
>> to deserialize, so I'd only want to do it on-demand.  Or does the
>> runtime
>> mutate the underlying stream, leaving the Writable with a reference to
>> something completely different later?
>>
>> I'm wondering about both present behavior, and the implicit contract
>> provided by the Hadoop API.
>>
>> Thanks!
>>
>> -Jimmy
>>
>>
>
>
>




Re: Merging of the local FS files threw an exception

2008-10-02 Thread Per Jacobsson
Quick FYI: I've run the same job twice more without seeing the error.
/ Per

On Wed, Oct 1, 2008 at 11:07 AM, Per Jacobsson <[EMAIL PROTECTED]> wrote:

> Hi everyone,
> (apologies if this gets posted on the list twice for some reason, my first
> attempt was denied as "suspected spam")
>
> I ran a job last night with Hadoop 0.18.0 on EC2, using the standard small
> AMI. The job was producing gzipped output, otherwise I haven't changed the
> configuration.
>
> The final reduce steps failed with this error that I haven't seem before:
>
> 2008-10-01 05:02:39,810 WARN org.apache.hadoop.mapred.ReduceTask:
> attempt_200809301822_0005_r_01_0 Merging of the local FS files threw an
> exception: java.io.IOException: java.io.IOException: Rec# 289050: Negative
> value-length: -96
> at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:331)
> at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:134)
> at
> org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:225)
> at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:242)
> at org.apache.hadoop.mapred.Merger.writeFile(Merger.java:83)
> at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(ReduceTask.java:2021)
> at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(ReduceTask.java:2025)
>
> 2008-10-01 05:02:44,131 WARN org.apache.hadoop.mapred.TaskTracker: Error
> running child
> java.io.IOException: attempt_200809301822_0005_r_01_0The reduce copier
> failed
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:255)
> at
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209)
>
> When I try to download the data from HDFS I get a "Found checksum error"
> warning message.
>
> Any ideas what could be the cause? Would upgrading to 0.18.1 solve it?
> Thanks,
> / Per
>
>


Sharing an object across mappers

2008-10-02 Thread Devajyoti Sarkar
I think each mapper/reducer runs in its own JVM which makes it impossible to
share objects. I need to share a large object so that I can access it at
memory speeds across all the mappers. Is it possible to have all the mappers
run in the same VM? Or is there a way to do this across VMs at high speed? I
guess JMI and others such methods will be just too slow.

Thanks,
Dev


is 12 minutes ok for dfs chown -R on 45000 files ?

2008-10-02 Thread Frank Singleton
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi,

Did a test on recursive chown on a fedora 9 box here (2xquad core,16Gram)
Took about 12.5 minutes to complete for 45000 files. (hmm approx 60 files/sec)

This was the namenode that I executed the command on

Q1. Is this rate (60 files/sec) typical of what other folks are seeing ?
Q2. Are there any dfs/jvm parameters I should look at to see if I can improve 
this

time /home/hadoop/hadoop-0.18.1/bin/hadoop dfs -chown -R frank:frank 
/home/frank/proj100

real12m38.631s
user1m54.662s
sys 0m33.124s

time /home/hadoop/hadoop-0.18.1/bin/hadoop dfs -count /home/frank/proj100
 22045891 3965996260 
hdfs://namenode:9000/home/frank/proj100

real0m1.579s
user0m0.686s
sys 0m0.129s


cheers / frank
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iEYEARECAAYFAkjln0MACgkQpZzN+MMic6dqgQCdEtto3qEhKIc50ICMf058w8ar
o4QAoILcDRDYmUUuxPwSFh7LNTQdKodn
=xuZE
-END PGP SIGNATURE-


Re: Sharing an object across mappers

2008-10-02 Thread Alan Ho
It really depends on what type of data you are sharing, how you are looking up 
the data, whether the data is Read-write, and whether you care about 
consistency. If you don't care about consistency, I suggest that you shove the 
data into a BDB store (for key-value lookup) or a lucene store, and copy the 
data to all the nodes. That way all data access will be in-process, no gc 
problems, and you will get very fast results. BDB and lucene both have easy 
replication strategies.

If the data is RW, and you need consistency, you should probably forget about 
MapReduce and just run everything on big-iron.

Regards,
Alan Ho




- Original Message 
From: Devajyoti Sarkar <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Thursday, October 2, 2008 8:41:04 PM
Subject: Sharing an object across mappers

I think each mapper/reducer runs in its own JVM which makes it impossible to
share objects. I need to share a large object so that I can access it at
memory speeds across all the mappers. Is it possible to have all the mappers
run in the same VM? Or is there a way to do this across VMs at high speed? I
guess JMI and others such methods will be just too slow.

Thanks,
Dev



  __
Instant Messaging, free SMS, sharing photos and more... Try the new Yahoo! 
Canada Messenger at http://ca.beta.messenger.yahoo.com/


streaming silently failing when executing binaries with unresolved dependencies

2008-10-02 Thread Chris Dyer
Hi all-
I am using streaming with some c++ mappers and reducers.  One of the
binaries I attempted to run this evening had a dependency on a shared
library that did not exist on my cluster, so it failed during
execution.  However, the streaming framework didn't appear to
recognize this failure, and the job tracker indicated that the mapper
returned success, but did not produce any results.  Has anyone else
encountered this issue?  Should I open a JIRA issue about this?  I'm
using Hadoop-17.2
Thanks-
Chris


Re: streaming silently failing when executing binaries with unresolved dependencies

2008-10-02 Thread Amareshwari Sriramadasu
This is because the non-zero exit status of streaming process was not 
treated as failure until 0.17. In 0.17, you can specify the 
configuration property "stream.non.zero.exit.is.failure" as "true", to 
consider the non-zero exit as failure. From 0.18, the default value 
for/  stream.non.zero.exit.is.failure' is true.


Thanks
Amareshwari
/Chris Dyer wrote:

Hi all-
I am using streaming with some c++ mappers and reducers.  One of the
binaries I attempted to run this evening had a dependency on a shared
library that did not exist on my cluster, so it failed during
execution.  However, the streaming framework didn't appear to
recognize this failure, and the job tracker indicated that the mapper
returned success, but did not produce any results.  Has anyone else
encountered this issue?  Should I open a JIRA issue about this?  I'm
using Hadoop-17.2
Thanks-
Chris
  




Re: Hadoop + Elastic Block Stores

2008-10-02 Thread Alan Ho
Does anybody have performance statistics on running dfs on EBS instead of local 
disk ? I think one of the interesting questions would be what is the sustained 
through-put of EBS.

Some general questions on DFS - is the DFS data replicated to more than 1 node 
? Has anybody tried running DFS entirely in-memory ? 

Regards,
Alan Ho



- Original Message 
From: Doug Cutting <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Monday, September 8, 2008 10:25:24 AM
Subject: Re: Hadoop + Elastic Block Stores

Ryan LeCompte wrote:
> I'd really love to one day 
> see some scripts under src/contrib/ec2/bin that can setup/mount the EBS 
> volumes automatically. :-)

The fastest way might be to write & contribute such scripts!

Doug



  __
Yahoo! Canada Toolbar: Search from anywhere on the web, and bookmark your 
favourite sites. Download it now at
http://ca.toolbar.yahoo.com.


Re: is 12 minutes ok for dfs chown -R on 45000 files ?

2008-10-02 Thread Raghu Angadi


This is mostly disk bound on NameNode. I think this ends up being one 
fsync for each file. If you have multiple directories, you could start 
multiple commands in parallel. Because of the way NameNode syncs having 
multiple clients helps.


Raghu.

Frank Singleton wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi,

Did a test on recursive chown on a fedora 9 box here (2xquad core,16Gram)
Took about 12.5 minutes to complete for 45000 files. (hmm approx 60 files/sec)

This was the namenode that I executed the command on

Q1. Is this rate (60 files/sec) typical of what other folks are seeing ?
Q2. Are there any dfs/jvm parameters I should look at to see if I can improve 
this

time /home/hadoop/hadoop-0.18.1/bin/hadoop dfs -chown -R frank:frank 
/home/frank/proj100

real12m38.631s
user1m54.662s
sys 0m33.124s

time /home/hadoop/hadoop-0.18.1/bin/hadoop dfs -count /home/frank/proj100
 22045891 3965996260 
hdfs://namenode:9000/home/frank/proj100

real0m1.579s
user0m0.686s
sys 0m0.129s


cheers / frank
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iEYEARECAAYFAkjln0MACgkQpZzN+MMic6dqgQCdEtto3qEhKIc50ICMf058w8ar
o4QAoILcDRDYmUUuxPwSFh7LNTQdKodn
=xuZE
-END PGP SIGNATURE-




Re: is 12 minutes ok for dfs chown -R on 45000 files ?

2008-10-02 Thread Frank Singleton
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Frank Singleton wrote:
> Hi,
> 
> Did a test on recursive chown on a fedora 9 box here (2xquad core,16Gram)
> Took about 12.5 minutes to complete for 45000 files. (hmm approx 60 files/sec)
> 
> This was the namenode that I executed the command on
> 
> Q1. Is this rate (60 files/sec) typical of what other folks are seeing ?
> Q2. Are there any dfs/jvm parameters I should look at to see if I can improve 
> this
> 
> time /home/hadoop/hadoop-0.18.1/bin/hadoop dfs -chown -R frank:frank 
> /home/frank/proj100
> 
> real  12m38.631s
> user  1m54.662s
> sys   0m33.124s
> 
> time /home/hadoop/hadoop-0.18.1/bin/hadoop dfs -count /home/frank/proj100
>  22045891 3965996260 
> hdfs://namenode:9000/home/frank/proj100
> 
> real  0m1.579s
> user  0m0.686s
> sys   0m0.129s
> 
> 
> cheers / frank

Just to clarify, this is for when the chown will modify all files owner 
attributes

eg: toggle all from frank:frank to hadoop:hadoop (see below)

for chown -R from frank:frank to frank:frank , the results is only 5 or 6 
seconds.


at this point , all files  under /home/frank/proj100  are frank:frank,  and the 
command executes
in 6 seconds or so.

[EMAIL PROTECTED] ~]$ time /home/hadoop/hadoop-0.18.1/bin/hadoop dfs -chown -R 
frank:frank /home/frank/proj100

real0m5.624s
user0m6.744s
sys 0m0.402s

#now lets change all to hadoop:hadoop

[EMAIL PROTECTED] ~]$ time /home/hadoop/hadoop-0.18.1/bin/hadoop dfs -chown -R 
hadoop:hadoop /home/frank/proj100

real12m43.732s
user0m53.781s
sys 0m10.655s


# now toggle back to frank:frank

[EMAIL PROTECTED] ~]$ time /home/hadoop/hadoop-0.18.1/bin/hadoop dfs -chown -R 
frank:frank /home/frank/proj100

real12m40.700s
user0m45.757s
sys 0m8.173s

# now frank:frank to frank:frank

[EMAIL PROTECTED] ~]$ time /home/hadoop/hadoop-0.18.1/bin/hadoop dfs -chown -R 
frank:frank /home/frank/proj100

real0m5.648s
user0m6.734s
sys 0m0.593s
[EMAIL PROTECTED] ~]$


cheers / frank

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iEYEARECAAYFAkjlvKwACgkQpZzN+MMic6eO4ACfVYEJ3DqWXo1Mg/4StUhG2Vii
r2AAn2YpDmDi2l2a4Bn/1CHAHQtLDgrg
=Dq3d
-END PGP SIGNATURE-