Re: I am tired, Error, Not a host:port pair: local

2008-06-23 Thread Michaela Buergle
> Another question I have is regarding hadoop-site.xml, where I have set
> hadoop.tmp.dir to be /home/hadoop/hadoop-datastore.  But hadoop is creating
> dfs in  /tmp/hadoop-hadoop/dfs?

Are you sure that hadoop-site.xml in your config directory/modified by
you is being used, not some default?

micha


Re: Too many fetch failures AND Shuffle error

2008-06-23 Thread Allen Wittenauer



On 6/21/08 1:53 AM, "Sayali Kulkarni" <[EMAIL PROTECTED]> wrote:
> One question still,
> I currently have just 5-6 nodes. But when Hadoop is deployed on a larger
> cluster, say of 1000+ nodes, is it expected that every time a new machine is
> added to the cluster, you add an entry in the /etc/hosts of all the (1000+)
> machines in the cluster?

No.

Any competent system administrator will say that the installation should
be using DNS or perhaps some other distributed naming service by then.

Heck, even at five nodes I would have deployed DNS. :)

To get an idea of how Yahoo! runs its large installations, take a look
at my presentation on the Hadoop wiki (
http://wiki.apache.org/hadoop/HadoopPresentations ).



Re: how to write a file in HDFS

2008-06-23 Thread Owen O'Malley


On Jun 22, 2008, at 9:23 PM, Samiran Bag wrote:

 Can you tell me how writing to a side file in output directory can  
be done using c program?


By using the libhdfs library, which uses jni to create a C interface  
to HDFS.

http://wiki.apache.org/hadoop/LibHDFS

-- Owen

Meet Hadoop presentation: the math from page 5

2008-06-23 Thread Stefan Groschupf

Hi,
I tried to better understand slide 5 of "meet hadoop":
http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/oscon-part-1.pdf
The slide says is:
given:
–10MB/s transfer
–10ms/seek
–100B/entry (10B entries)
–10kB/page (1B pages)

updating 1% of entries (100M) takes:
–1000 days with random B-Tree updates
–100 days with batched B-Tree updates
–1 day with sort & merge

I wonder how exactly to calculate the 1000 days and 100 days.
time for seeking = 100 000 000 * lg(1 000 000 000) * 10 ms =  
(346.034177 days)
time to read all pages = 100 000 000 * lg(1 000 000 000) * (10kB/10MB/ 
s) =  33.7924001 days
Since we might need to write all pages again we can add another 33  
days, though the result is not a 1000 days, so I do something  
fundamentally wrong. :o


Thanks for any help...

Stefan



Working with XML / XQuery in hadoop

2008-06-23 Thread Kayla Jay
Hi

Just wondering if anyone out there works with and manipulates and stores XML 
data using Hadoop?  I've seen some threads about XML RecordReaders and people 
who use that XML StreamXmlRecordReader to do splits.  But, has anyone 
implemented a query framework that will use the hadoop layer to query against 
the XML in their map/reduce jobs?

I want to know if anyone has done an XQuery or XPath executed within a haoop 
job to find something within the XML stored in hadoop?

I can't find any samples or anyone else out there who uses XML data vs. 
traditional log text data.

Are there any use cases of using hadoop to work with XML and then do queries 
against XML in a distributed manner using hadoop?

Thanks.



  

trouble setting up hadoop

2008-06-23 Thread Sandy
I apologize for the severe basicness of this error, but I am in the process
of getting  hadoop set up. I have been following the instructions in the
Hadoop quickstart. I have confirmed that bin/hadoop will give me help usage
information.

I am now in the stage of standalone operation.

I typed in:
mkdir input
cp conf/*.xml input
bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'

at which point I get:
Exception in thread "main" java.lang.ClassNotFoundException:
java.lang.Iterable not found in
gnu.gcj.runtime.SystemClassLoader{urls=[file:/home/sjm/Desktop/hado
op-0.16.4/bin/../conf/,file:/home/sjm/Desktop/hadoop-0.16.4/bin/../,file:/home/s
jm/Desktop/hadoop-0.16.4/bin/../hadoop-0.16.4-core.jar,file:/home/sjm/Desktop/ha
doop-0.16.4/bin/../lib/commons-cli-2.0-SNAPSHOT.jar,file:/home/sjm/Desktop/hadoo
p-0.16.4/bin/../lib/commons-codec-1.3.jar,file:/home/sjm/Desktop/hadoop-0.16.4/b
in/../lib/commons-httpclient-3.0.1.jar,file:/home/sjm/Desktop/hadoop-0.16.4/bin/
../lib/commons-logging-1.0.4.jar,file:/home/sjm/Desktop/hadoop-0.16.4/bin/../lib
/commons-logging-api-1.0.4.jar,file:/home/sjm/Desktop/hadoop-0.16.4/bin/../lib/j
ets3t-0.5.0.jar,file:/home/sjm/Desktop/hadoop-0.16.4/bin/../lib/jetty-5.1.4.jar,
file:/home/sjm/Desktop/hadoop-0.16.4/bin/../lib/junit-3.8.1.jar,file:/home/sjm/D
esktop/hadoop-0.16.4/bin/../lib/kfs-0.1.jar,file:/home/sjm/Desktop/hadoop-0.16.4
/bin/../lib/log4j-1.2.13.jar,file:/home/sjm/Desktop/hadoop-0.16.4/bin/../lib/ser
vlet-api.jar,file:/home/sjm/Desktop/hadoop-0.16.4/bin/../lib/xmlenc-0.52.jar,fil
e:/home/sjm/Desktop/hadoop-0.16.4/bin/../lib/jetty-ext/commons-el.jar,file:/home
/sjm/Desktop/hadoop-0.16.4/bin/../lib/jetty-ext/jasper-compiler.jar,file:/home/s
jm/Desktop/hadoop-0.16.4/bin/../lib/jetty-ext/jasper-runtime.jar,file:/home/sjm/
Desktop/hadoop-0.16.4/bin/../lib/jetty-ext/jsp-api.jar],
parent=gnu.gcj.runtime. ExtensionClassLoader{urls=[], parent=null}}
   at java.net.URLClassLoader.findClass (libgcj.so.7)
   at java.lang.ClassLoader.loadClass (libgcj.so.7)
   at java.lang.ClassLoader.loadClass (libgcj.so.7)
   at java.lang.VMClassLoader.defineClass (libgcj.so.7)
   at java.lang.ClassLoader.defineClass (libgcj.so.7)
   at java.security.SecureClassLoader.defineClass (libgcj.so.7)
   at java.net.URLClassLoader.findClass (libgcj.so.7)
   at java.lang.ClassLoader.loadClass (libgcj.so.7)
   at java.lang.ClassLoader.loadClass (libgcj.so.7)
   at org.apache.hadoop.util.RunJar.main (RunJar.java:107)

I suspect the issue is path related, though I am not certain. Could someone
please point me in the right direction?

Much thanks,

SM


Re: Working with XML / XQuery in hadoop

2008-06-23 Thread Brian Vargas

-BEGIN PGP SIGNED MESSAGE-
Hash: RIPEMD160

Kayla,

When I first started playing with Hadoop, I created an InputFormat and
RecordReader that, given an XML file, created a series of key-value
pairs where the XPath of the node in the document was the key and the
value of the node (if it had one) was the value.  At the time, it seemed
like a good idea, but turned out to be horribly slow, due to the insane
number of keys that were created.  It also sucked to code against.

It turned out to be way faster, and way easier to code, to just pass in
the name of the files to be loaded and run them through your favorite
parsing implementation within the Map implementation.  Alternatively, if
the files are small enough, you could load the XML bytes into a sequence
file, and then just read them out as BytesWritable - again, into your
favorite parser.  (In fact, if you're dealing with XML files below the
block size of HDFS, that's probably the better way to do it.)

Brian

Kayla Jay wrote:
| Hi
|
| Just wondering if anyone out there works with and manipulates and
| stores XML data using Hadoop?  I've seen some threads about XML
| RecordReaders and people who use that XML StreamXmlRecordReader to do
| splits.  But, has anyone implemented a query framework that will use
| the hadoop layer to query against the XML in their map/reduce jobs?
|
| I want to know if anyone has done an XQuery or XPath executed within
| a haoop job to find something within the XML stored in hadoop?
|
| I can't find any samples or anyone else out there who uses XML data
| vs. traditional log text data.
|
| Are there any use cases of using hadoop to work with XML and then do
| queries against XML in a distributed manner using hadoop?
|
| Thanks.
|
|
|
|
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (MingW32)
Comment: What is this? http://pgp.ardvaark.net

iD8DBQFIYAYI3YdPnMKx1eMRA1d+AKDNfYB/oR42NONht2BT4zuHZP0SXQCgjoA7
3G0oVCxBw9Fij1nWvV58zoo=
=5Eyk
-END PGP SIGNATURE-


Re: Working with XML / XQuery in hadoop

2008-06-23 Thread Stefan Groschupf

Yep, we do.
We have a xml Writable that uses XUM behind the scene. This has a  
getDom and getNode(xquery) method. In readIn we read the byte array  
and create the xum dom object from the byte array.
Write simply triggers the BinaryCodec.serialize and we write the bytes  
out.
However the same would work if you de/serialize xml as text, though we  
found that is slower than xum, though works pretty stable, since xum  
has other issues (you need to use BinaryCodex as jvm sigelton etc).

However in general this works pretty well.
Stefan



On Jun 23, 2008, at 9:38 PM, Kayla Jay wrote:


Hi

Just wondering if anyone out there works with and manipulates and  
stores XML data using Hadoop?  I've seen some threads about XML  
RecordReaders and people who use that XML StreamXmlRecordReader to  
do splits.  But, has anyone implemented a query framework that will  
use the hadoop layer to query against the XML in their map/reduce  
jobs?


I want to know if anyone has done an XQuery or XPath executed within  
a haoop job to find something within the XML stored in hadoop?


I can't find any samples or anyone else out there who uses XML data  
vs. traditional log text data.


Are there any use cases of using hadoop to work with XML and then do  
queries against XML in a distributed manner using hadoop?


Thanks.





~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com




login error while running hadoop on MacOSX 10.5.*

2008-06-23 Thread Lev Givon
I recently installed hadoop 0.17.0 in pseudo-distributed mode on a
MacOSX 10.5.3 system with software managed by Fink installed in /sw. I
configured hadoop to use the stock Java 1.5.0_13 installation in
/Library/Java/Home. When I attempted to run a simple map/reduce job
off of the dfs after starting up the daemons, the job failed with the
following Java error (501 is the ID of the user used to start the
hadoop daemons and run the map/reduce job):

javax.security.auth.login.LoginException: Login failed:
/sw/bin/whoami: cannot find name for user ID 501

What might be causing this to occur? Manually running /sw/bin/whoami
as the user in question returns the corresponding username.

L.G.



Re: trouble setting up hadoop

2008-06-23 Thread Stefan Groschupf

Looks like you have not install a correct java.
Make sure you have a sun java installed on your nodes and java is in  
your path as well JAVA_HOME should be set.
I think gnu.gcj is the gnu java compiler but not a java you need to  
run hadoop.

Check on command line this:
$ java -version
you should see something like this:
java version "1.5.0_13"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_13- 
b05-237)

Java HotSpot(TM) Client VM (build 1.5.0_13-119, mixed mode, sharing)

HTH


On Jun 23, 2008, at 9:40 PM, Sandy wrote:

I apologize for the severe basicness of this error, but I am in the  
process
of getting  hadoop set up. I have been following the instructions in  
the
Hadoop quickstart. I have confirmed that bin/hadoop will give me  
help usage

information.

I am now in the stage of standalone operation.

I typed in:
mkdir input
cp conf/*.xml input
bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'

at which point I get:
Exception in thread "main" java.lang.ClassNotFoundException:
java.lang.Iterable not found in
gnu.gcj.runtime.SystemClassLoader{urls=[file:/home/sjm/Desktop/hado
op-0.16.4/bin/../conf/,file:/home/sjm/Desktop/hadoop-0.16.4/ 
bin/../,file:/home/s
jm/Desktop/hadoop-0.16.4/bin/../hadoop-0.16.4-core.jar,file:/home/ 
sjm/Desktop/ha
doop-0.16.4/bin/../lib/commons-cli-2.0-SNAPSHOT.jar,file:/home/sjm/ 
Desktop/hadoo
p-0.16.4/bin/../lib/commons-codec-1.3.jar,file:/home/sjm/Desktop/ 
hadoop-0.16.4/b
in/../lib/commons-httpclient-3.0.1.jar,file:/home/sjm/Desktop/ 
hadoop-0.16.4/bin/
../lib/commons-logging-1.0.4.jar,file:/home/sjm/Desktop/ 
hadoop-0.16.4/bin/../lib
/commons-logging-api-1.0.4.jar,file:/home/sjm/Desktop/hadoop-0.16.4/ 
bin/../lib/j
ets3t-0.5.0.jar,file:/home/sjm/Desktop/hadoop-0.16.4/bin/../lib/ 
jetty-5.1.4.jar,
file:/home/sjm/Desktop/hadoop-0.16.4/bin/../lib/ 
junit-3.8.1.jar,file:/home/sjm/D
esktop/hadoop-0.16.4/bin/../lib/kfs-0.1.jar,file:/home/sjm/Desktop/ 
hadoop-0.16.4
/bin/../lib/log4j-1.2.13.jar,file:/home/sjm/Desktop/hadoop-0.16.4/ 
bin/../lib/ser
vlet-api.jar,file:/home/sjm/Desktop/hadoop-0.16.4/bin/../lib/ 
xmlenc-0.52.jar,fil
e:/home/sjm/Desktop/hadoop-0.16.4/bin/../lib/jetty-ext/commons- 
el.jar,file:/home
/sjm/Desktop/hadoop-0.16.4/bin/../lib/jetty-ext/jasper- 
compiler.jar,file:/home/s
jm/Desktop/hadoop-0.16.4/bin/../lib/jetty-ext/jasper- 
runtime.jar,file:/home/sjm/

Desktop/hadoop-0.16.4/bin/../lib/jetty-ext/jsp-api.jar],
parent=gnu.gcj.runtime. ExtensionClassLoader{urls=[], parent=null}}
  at java.net.URLClassLoader.findClass (libgcj.so.7)
  at java.lang.ClassLoader.loadClass (libgcj.so.7)
  at java.lang.ClassLoader.loadClass (libgcj.so.7)
  at java.lang.VMClassLoader.defineClass (libgcj.so.7)
  at java.lang.ClassLoader.defineClass (libgcj.so.7)
  at java.security.SecureClassLoader.defineClass (libgcj.so.7)
  at java.net.URLClassLoader.findClass (libgcj.so.7)
  at java.lang.ClassLoader.loadClass (libgcj.so.7)
  at java.lang.ClassLoader.loadClass (libgcj.so.7)
  at org.apache.hadoop.util.RunJar.main (RunJar.java:107)

I suspect the issue is path related, though I am not certain. Could  
someone

please point me in the right direction?

Much thanks,

SM


~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com




Re: login error while running hadoop on MacOSX 10.5.*

2008-06-23 Thread Stefan Groschupf
Which user runs the hadoop? It should be the same you trigger the job  
with.


On Jun 23, 2008, at 10:29 PM, Lev Givon wrote:


I recently installed hadoop 0.17.0 in pseudo-distributed mode on a
MacOSX 10.5.3 system with software managed by Fink installed in /sw. I
configured hadoop to use the stock Java 1.5.0_13 installation in
/Library/Java/Home. When I attempted to run a simple map/reduce job
off of the dfs after starting up the daemons, the job failed with the
following Java error (501 is the ID of the user used to start the
hadoop daemons and run the map/reduce job):

javax.security.auth.login.LoginException: Login failed:
/sw/bin/whoami: cannot find name for user ID 501

What might be causing this to occur? Manually running /sw/bin/whoami
as the user in question returns the corresponding username.

L.G.




~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com




Re: login error while running hadoop on MacOSX 10.5.*

2008-06-23 Thread Lev Givon
Both the daemons and the job were started using the same user.

L.G.

Received from Stefan Groschupf on Mon, Jun 23, 2008 at 04:34:54PM EDT:
> Which user runs the hadoop? It should be the same you trigger the job with.
>
> On Jun 23, 2008, at 10:29 PM, Lev Givon wrote:
>
>> I recently installed hadoop 0.17.0 in pseudo-distributed mode on a
>> MacOSX 10.5.3 system with software managed by Fink installed in /sw. I
>> configured hadoop to use the stock Java 1.5.0_13 installation in
>> /Library/Java/Home. When I attempted to run a simple map/reduce job
>> off of the dfs after starting up the daemons, the job failed with the
>> following Java error (501 is the ID of the user used to start the
>> hadoop daemons and run the map/reduce job):
>>
>> javax.security.auth.login.LoginException: Login failed:
>> /sw/bin/whoami: cannot find name for user ID 501
>>
>> What might be causing this to occur? Manually running /sw/bin/whoami
>> as the user in question returns the corresponding username.
>>
>>  L.G.
>>
>>
>
> ~~~
> 101tec Inc.
> Menlo Park, California, USA
> http://www.101tec.com
>
>


Re: login error while running hadoop on MacOSX 10.5.*

2008-06-23 Thread Stefan Groschupf

The fink part and /sw confuses me.
When I do a which on my os x I get:
$ which whoami
/usr/bin/whoami
Are you using the same whoami on your console as hadoop?

On Jun 23, 2008, at 10:37 PM, Lev Givon wrote:


Both the daemons and the job were started using the same user.

L.G.

Received from Stefan Groschupf on Mon, Jun 23, 2008 at 04:34:54PM EDT:
Which user runs the hadoop? It should be the same you trigger the  
job with.


On Jun 23, 2008, at 10:29 PM, Lev Givon wrote:


I recently installed hadoop 0.17.0 in pseudo-distributed mode on a
MacOSX 10.5.3 system with software managed by Fink installed in / 
sw. I

configured hadoop to use the stock Java 1.5.0_13 installation in
/Library/Java/Home. When I attempted to run a simple map/reduce job
off of the dfs after starting up the daemons, the job failed with  
the

following Java error (501 is the ID of the user used to start the
hadoop daemons and run the map/reduce job):

javax.security.auth.login.LoginException: Login failed:
/sw/bin/whoami: cannot find name for user ID 501

What might be causing this to occur? Manually running /sw/bin/whoami
as the user in question returns the corresponding username.

L.G.




~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com






~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com




Re: login error while running hadoop on MacOSX 10.5.*

2008-06-23 Thread Lev Givon
Yes; I have my PATH configured to list /sw/bin before
/usr/bin. Curiously, hadoop tries to invoke /sw/bin/whoami even when I
set PATH to 

/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin:/usr/X11R6/bin:/usr/local/bin

before starting the daemons and attempting to run the job.

  L.G.

Received from Stefan Groschupf on Mon, Jun 23, 2008 at 04:49:23PM EDT:
> The fink part and /sw confuses me.  When I do a which on my os x I
> get: $ which whoami /usr/bin/whoami Are you using the same whoami on
> your console as hadoop?
>
> On Jun 23, 2008, at 10:37 PM, Lev Givon wrote:
>
>> Both the daemons and the job were started using the same user.
>>
>>  L.G.
>>
>> Received from Stefan Groschupf on Mon, Jun 23, 2008 at 04:34:54PM EDT:
>>> Which user runs the hadoop? It should be the same you trigger the job 
>>> with.
>>>
>>> On Jun 23, 2008, at 10:29 PM, Lev Givon wrote:
>>>
 I recently installed hadoop 0.17.0 in pseudo-distributed mode on a
 MacOSX 10.5.3 system with software managed by Fink installed in /sw. I
 configured hadoop to use the stock Java 1.5.0_13 installation in
 /Library/Java/Home. When I attempted to run a simple map/reduce job
 off of the dfs after starting up the daemons, the job failed with the
 following Java error (501 is the ID of the user used to start the
 hadoop daemons and run the map/reduce job):

 javax.security.auth.login.LoginException: Login failed:
 /sw/bin/whoami: cannot find name for user ID 501

 What might be causing this to occur? Manually running /sw/bin/whoami
 as the user in question returns the corresponding username.

L.G.


>>>
>>> ~~~
>>> 101tec Inc.
>>> Menlo Park, California, USA
>>> http://www.101tec.com
>>>
>>>
>>
>
> ~~~
> 101tec Inc.
> Menlo Park, California, USA
> http://www.101tec.com
>
>


realtime hadoop

2008-06-23 Thread Vadim Zaliva
Hi!

I am considering using Hadoop for (almost) realime data processing. I
have data coming every second and I would like to use hadoop cluster
to process
it as fast as possible. I need to be able to maintain some guaranteed
max. processing time, for example under 3 minutes.

Does anybody have experience with using Hadoop in such manner? I will
appreciate if you can share your experience or give me pointers
to some articles or pages on the subject.

Vadim


Re: realtime hadoop

2008-06-23 Thread Stefan Groschupf

Hadoop might be the wrong technology for you.
Map Reduce is a batch processing mechanism. Also HDFS might be  
critical since to access your data you need to close the file - means  
you might have many small file, a situation where hdfs is not very  
strong (namespace is hold in memory).
Hbase might be an interesting tool for you, also zookeeper if you want  
to do something home grown...




On Jun 23, 2008, at 11:31 PM, Vadim Zaliva wrote:


Hi!

I am considering using Hadoop for (almost) realime data processing. I
have data coming every second and I would like to use hadoop cluster
to process
it as fast as possible. I need to be able to maintain some guaranteed
max. processing time, for example under 3 minutes.

Does anybody have experience with using Hadoop in such manner? I will
appreciate if you can share your experience or give me pointers
to some articles or pages on the subject.

Vadim



~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com




Re: login error while running hadoop on MacOSX 10.5.*

2008-06-23 Thread Stefan Groschupf
Sorry I'm not a unix expert, however the problem is clearly related to  
whoami since this throws an error.

I run hadoop in all kind of configuration super smooth on my os x boxes.
Maybe rename or move /sw/whoami for a test.
Also make sure you restart the os x console since changes  
in .bash_profile are only picked up if you "relogin" into the command  
line.

Sorry that is all I know and could guess.. :(

On Jun 23, 2008, at 10:56 PM, Lev Givon wrote:


Yes; I have my PATH configured to list /sw/bin before
/usr/bin. Curiously, hadoop tries to invoke /sw/bin/whoami even when I
set PATH to

/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin:/usr/X11R6/ 
bin:/usr/local/bin


before starting the daemons and attempting to run the job.

  L.G.

Received from Stefan Groschupf on Mon, Jun 23, 2008 at 04:49:23PM EDT:

The fink part and /sw confuses me.  When I do a which on my os x I
get: $ which whoami /usr/bin/whoami Are you using the same whoami on
your console as hadoop?

On Jun 23, 2008, at 10:37 PM, Lev Givon wrote:


Both the daemons and the job were started using the same user.

L.G.

Received from Stefan Groschupf on Mon, Jun 23, 2008 at 04:34:54PM  
EDT:
Which user runs the hadoop? It should be the same you trigger the  
job

with.

On Jun 23, 2008, at 10:29 PM, Lev Givon wrote:


I recently installed hadoop 0.17.0 in pseudo-distributed mode on a
MacOSX 10.5.3 system with software managed by Fink installed in / 
sw. I

configured hadoop to use the stock Java 1.5.0_13 installation in
/Library/Java/Home. When I attempted to run a simple map/reduce  
job
off of the dfs after starting up the daemons, the job failed  
with the

following Java error (501 is the ID of the user used to start the
hadoop daemons and run the map/reduce job):

javax.security.auth.login.LoginException: Login failed:
/sw/bin/whoami: cannot find name for user ID 501

What might be causing this to occur? Manually running /sw/bin/ 
whoami

as the user in question returns the corresponding username.

L.G.




~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com






~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com






~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com




Re: trouble setting up hadoop

2008-06-23 Thread Sandy
Hi Stefan,

I think that did it. When I type in
java -version
I now get:
java version "1.6.0_06"
Java(TM) SE Runtime Environment (build 1.6.0_06-b02)
Java HotSpot(TM) Client VM (build 10.0-b22, mixed mode, sharing)

And, when I run:
bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z]+'

I get:
08/06/23 17:03:12 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
08/06/23 17:03:13 INFO mapred.FileInputFormat: Total input paths to process
: 2
08/06/23 17:03:13 INFO mapred.JobClient: Running job: job_local_1
08/06/23 17:03:13 INFO mapred.MapTask: numReduceTasks: 1
08/06/23 17:03:13 INFO mapred.LocalJobRunner:
file:/home/sjm/Desktop/hadoop-0.16.4/input/hadoop-site.xml:0+178
08/06/23 17:03:13 INFO mapred.TaskRunner: Task 'job_local_1_map_' done.
08/06/23 17:03:13 INFO mapred.TaskRunner: Saved output of task
'job_local_1_map_' to
file:/home/sjm/Desktop/hadoop-0.16.4/grep-temp-1561747821
08/06/23 17:03:13 INFO mapred.MapTask: numReduceTasks: 1
08/06/23 17:03:13 INFO mapred.LocalJobRunner:
file:/home/sjm/Desktop/hadoop-0.16.4/input/hadoop-default.xml:0+34064
08/06/23 17:03:13 INFO mapred.TaskRunner: Task 'job_local_1_map_0001' done.
08/06/23 17:03:13 INFO mapred.TaskRunner: Saved output of task
'job_local_1_map_0001' to
file:/home/sjm/Desktop/hadoop-0.16.4/grep-temp-1561747821
08/06/23 17:03:13 INFO mapred.LocalJobRunner: reduce > reduce
08/06/23 17:03:13 INFO mapred.TaskRunner: Task 'reduce_ov0kiq' done.
08/06/23 17:03:13 INFO mapred.TaskRunner: Saved output of task
'reduce_ov0kiq' to file:/home/sjm/Desktop/hadoop-0.16.4/grep-temp-1561747821
08/06/23 17:03:14 INFO mapred.JobClient: Job complete: job_local_1
08/06/23 17:03:14 INFO mapred.JobClient: Counters: 9
08/06/23 17:03:14 INFO mapred.JobClient:   Map-Reduce Framework
08/06/23 17:03:14 INFO mapred.JobClient: Map input records=1125
08/06/23 17:03:14 INFO mapred.JobClient: Map output records=0
08/06/23 17:03:14 INFO mapred.JobClient: Map input bytes=34242
08/06/23 17:03:14 INFO mapred.JobClient: Map output bytes=0
08/06/23 17:03:14 INFO mapred.JobClient: Combine input records=0
08/06/23 17:03:14 INFO mapred.JobClient: Combine output records=0
08/06/23 17:03:14 INFO mapred.JobClient: Reduce input groups=0
08/06/23 17:03:14 INFO mapred.JobClient: Reduce input records=0
08/06/23 17:03:14 INFO mapred.JobClient: Reduce output records=0
08/06/23 17:03:14 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with
processName=JobTracker, sessionId= - already initialized
08/06/23 17:03:14 INFO mapred.FileInputFormat: Total input paths to process
: 1
08/06/23 17:03:14 INFO mapred.JobClient: Running job: job_local_2
08/06/23 17:03:14 INFO mapred.MapTask: numReduceTasks: 1
08/06/23 17:03:14 INFO mapred.LocalJobRunner:
file:/home/sjm/Desktop/hadoop-0.16.4/grep-temp-1561747821/part-0:0+86
08/06/23 17:03:14 INFO mapred.TaskRunner: Task 'job_local_2_map_' done.
08/06/23 17:03:14 INFO mapred.TaskRunner: Saved output of task
'job_local_2_map_' to file:/home/sjm/Desktop/hadoop-0.16.4/output
08/06/23 17:03:14 INFO mapred.LocalJobRunner: reduce > reduce
08/06/23 17:03:14 INFO mapred.TaskRunner: Task 'reduce_448bva' done.
08/06/23 17:03:14 INFO mapred.TaskRunner: Saved output of task
'reduce_448bva' to file:/home/sjm/Desktop/hadoop-0.16.4/output
08/06/23 17:03:15 INFO mapred.JobClient: Job complete: job_local_2
08/06/23 17:03:15 INFO mapred.JobClient: Counters: 9
08/06/23 17:03:15 INFO mapred.JobClient:   Map-Reduce Framework
08/06/23 17:03:15 INFO mapred.JobClient: Map input records=0
08/06/23 17:03:15 INFO mapred.JobClient: Map output records=0
08/06/23 17:03:15 INFO mapred.JobClient: Map input bytes=0
08/06/23 17:03:15 INFO mapred.JobClient: Map output bytes=0
08/06/23 17:03:15 INFO mapred.JobClient: Combine input records=0
08/06/23 17:03:15 INFO mapred.JobClient: Combine output records=0
08/06/23 17:03:15 INFO mapred.JobClient: Reduce input groups=0
08/06/23 17:03:15 INFO mapred.JobClient: Reduce input records=0
08/06/23 17:03:15 INFO mapred.JobClient: Reduce output records=0

Does this all look correct? If so, thank you so much. I really appreciate
all the help!

-SM

On Mon, Jun 23, 2008 at 4:32 PM, Stefan Groschupf <[EMAIL PROTECTED]> wrote:

> Looks like you have not install a correct java.
> Make sure you have a sun java installed on your nodes and java is in your
> path as well JAVA_HOME should be set.
> I think gnu.gcj is the gnu java compiler but not a java you need to run
> hadoop.
> Check on command line this:
> $ java -version
> you should see something like this:
> java version "1.5.0_13"
> Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_13-b05-237)
> Java HotSpot(TM) Client VM (build 1.5.0_13-119, mixed mode, sharing)
>
> HTH
>
>
>
> On Jun 23, 2008, at 9:40 PM, Sandy wrote:
>
>  I apologize for the severe basicness of this error, but I am in the
>> process
>> of getting  hadoop set up. I have been following the instru

Re: realtime hadoop

2008-06-23 Thread Chris Anderson
Vadim,

Depending on the nature of your data, CouchDB (http://couchdb.org)
might be worth looking into. It speaks JSON natively, and has
real-time map/reduce support. The 0.8.0 release is imminent (don't
bother with 0.7.2), and the community is active. We're using it for
something similar to what you describe, and it's working well.

Chris

-- 
Chris Anderson
http://jchris.mfdz.com


Re: realtime hadoop

2008-06-23 Thread Konstantin Shvachko

> Also HDFS might be critical since to access your data you need to close the 
file

Not anymore. Since 0.16 files are readable while being written to.

>> it as fast as possible. I need to be able to maintain some guaranteed
>> max. processing time, for example under 3 minutes.

It looks like you do not need very strict guarantees.
I think you can use hdfs as a data-storage.
Don't know what kind of data-processing you do, but I agree with Stefan
that map-reduce is designed for batch tasks rather than for real-time 
processing.



Stefan Groschupf wrote:

Hadoop might be the wrong technology for you.
Map Reduce is a batch processing mechanism. Also HDFS might be critical 
since to access your data you need to close the file - means you might 
have many small file, a situation where hdfs is not very strong 
(namespace is hold in memory).
Hbase might be an interesting tool for you, also zookeeper if you want 
to do something home grown...




On Jun 23, 2008, at 11:31 PM, Vadim Zaliva wrote:


Hi!

I am considering using Hadoop for (almost) realime data processing. I
have data coming every second and I would like to use hadoop cluster
to process
it as fast as possible. I need to be able to maintain some guaranteed
max. processing time, for example under 3 minutes.

Does anybody have experience with using Hadoop in such manner? I will
appreciate if you can share your experience or give me pointers
to some articles or pages on the subject.

Vadim



~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com





Re: realtime hadoop

2008-06-23 Thread Ian Holsman (Lists)

Interesting.
we are planning on using hadoop to provide 'near' real time log 
analysis. we plan on having files close every 5 minutes (1 per log 
machine, so 80 files every 5 minutes) and then have a m/r to merge it 
into a single file that will get processed by other jobs later on.


do you think this will namespace will explode?

I wasn't thinking of clouddb.. it might be an interesting alternative 
once it is a bit more stable.


regards
Ian

Stefan Groschupf wrote:

Hadoop might be the wrong technology for you.
Map Reduce is a batch processing mechanism. Also HDFS might be critical
since to access your data you need to close the file - means you might
have many small file, a situation where hdfs is not very strong
(namespace is hold in memory).
Hbase might be an interesting tool for you, also zookeeper if you want
to do something home grown...



On Jun 23, 2008, at 11:31 PM, Vadim Zaliva wrote:


Hi!

I am considering using Hadoop for (almost) realime data processing. I
have data coming every second and I would like to use hadoop cluster
to process
it as fast as possible. I need to be able to maintain some guaranteed
max. processing time, for example under 3 minutes.

Does anybody have experience with using Hadoop in such manner? I will
appreciate if you can share your experience or give me pointers
to some articles or pages on the subject.

Vadim



~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com






Re: realtime hadoop

2008-06-23 Thread Matt Kent
We use Hadoop in a similar manner, to process batches of data in
real-time every few minutes. However, we do substantial amounts of
processing on that data, so we use Hadoop to distribute our computation.
Unless you have a significant amount of work to be done, I wouldn't
recommend using Hadoop because it's not worth the overhead of launching
the jobs and moving the data around.

Matt

On Tue, 2008-06-24 at 13:34 +1000, Ian Holsman (Lists) wrote:
> Interesting.
> we are planning on using hadoop to provide 'near' real time log 
> analysis. we plan on having files close every 5 minutes (1 per log 
> machine, so 80 files every 5 minutes) and then have a m/r to merge it 
> into a single file that will get processed by other jobs later on.
> 
> do you think this will namespace will explode?
> 
> I wasn't thinking of clouddb.. it might be an interesting alternative 
> once it is a bit more stable.
> 
> regards
> Ian
> 
> Stefan Groschupf wrote:
> > Hadoop might be the wrong technology for you.
> > Map Reduce is a batch processing mechanism. Also HDFS might be critical
> > since to access your data you need to close the file - means you might
> > have many small file, a situation where hdfs is not very strong
> > (namespace is hold in memory).
> > Hbase might be an interesting tool for you, also zookeeper if you want
> > to do something home grown...
> >
> >
> >
> > On Jun 23, 2008, at 11:31 PM, Vadim Zaliva wrote:
> >
> >> Hi!
> >>
> >> I am considering using Hadoop for (almost) realime data processing. I
> >> have data coming every second and I would like to use hadoop cluster
> >> to process
> >> it as fast as possible. I need to be able to maintain some guaranteed
> >> max. processing time, for example under 3 minutes.
> >>
> >> Does anybody have experience with using Hadoop in such manner? I will
> >> appreciate if you can share your experience or give me pointers
> >> to some articles or pages on the subject.
> >>
> >> Vadim
> >>
> >
> > ~~~
> > 101tec Inc.
> > Menlo Park, California, USA
> > http://www.101tec.com
> >
> >
> 



Re: realtime hadoop

2008-06-23 Thread Fernando Padilla
One use case I have a question about, is using Hadoop to power a web 
search or other query.  So the full job should be done in under a 
second, from start to finish.


You know, you have a huge datastore, and you have to run a query against 
that, implemented as a MR query.  Is there a way to optimize that use 
case, where the code doesn't change, but maybe the input parameters of 
the job?  So a MR job could reuse the java code, and even the same JVM 
to avoid all of the startup costs..



I bet hadoop isn't built for that yet (and enough reasons not to support 
it yet).. but maybe it's a usecase that shouldn't be totally ignored.


And if you think about it, this is similar to what HBase is doing, at 
least the query execution part.. A dedicated MR daemon running ontop of 
the Hadoop infrastructure, so you don't incur the cost of distributing 
and starting fresh MR/JVM processes across the cluster..  maybe someone 
would want to refactor this thought process a little bit..




Matt Kent wrote:

We use Hadoop in a similar manner, to process batches of data in
real-time every few minutes. However, we do substantial amounts of
processing on that data, so we use Hadoop to distribute our computation.
Unless you have a significant amount of work to be done, I wouldn't
recommend using Hadoop because it's not worth the overhead of launching
the jobs and moving the data around.

Matt

On Tue, 2008-06-24 at 13:34 +1000, Ian Holsman (Lists) wrote:

Interesting.
we are planning on using hadoop to provide 'near' real time log 
analysis. we plan on having files close every 5 minutes (1 per log 
machine, so 80 files every 5 minutes) and then have a m/r to merge it 
into a single file that will get processed by other jobs later on.


do you think this will namespace will explode?

I wasn't thinking of clouddb.. it might be an interesting alternative 
once it is a bit more stable.


regards
Ian

Stefan Groschupf wrote:

Hadoop might be the wrong technology for you.
Map Reduce is a batch processing mechanism. Also HDFS might be critical
since to access your data you need to close the file - means you might
have many small file, a situation where hdfs is not very strong
(namespace is hold in memory).
Hbase might be an interesting tool for you, also zookeeper if you want
to do something home grown...



On Jun 23, 2008, at 11:31 PM, Vadim Zaliva wrote:


Hi!

I am considering using Hadoop for (almost) realime data processing. I
have data coming every second and I would like to use hadoop cluster
to process
it as fast as possible. I need to be able to maintain some guaranteed
max. processing time, for example under 3 minutes.

Does anybody have experience with using Hadoop in such manner? I will
appreciate if you can share your experience or give me pointers
to some articles or pages on the subject.

Vadim


~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com