Re: Indexing on top of Hadoop

2009-06-10 Thread Stefan Groschupf

Hi,
you might find some code in katta.sourceforge.net very helpful.
Stefan

~~~
Hadoop training and consulting
http://www.scaleunlimited.com
http://www.101tec.com



On Jun 10, 2009, at 5:49 AM, kartik saxena wrote:


Hi,

I have a huge  LDIF file in order of GBs spanning some million user  
records.
I am running the example Grep job on that file. The search results  
have

not really been
upto expectations because of it being a basic per line , brute force.

I was thinking of building some indexes inside HDFS for that file ,  
so that
the search results could improve. What could I possibly try to  
achieve this?



Secura




Re: Distributed Lucene Questions

2009-06-02 Thread Stefan Groschupf

Hi,
you might want to checkout:
http://katta.sourceforge.net/

Stefan

~~~
Hadoop training and consulting
http://www.scaleunlimited.com
http://www.101tec.com



On Jun 1, 2009, at 9:54 AM, Tarandeep Singh wrote:


Hi All,

I am trying to build a distributed system to build and serve lucene  
indexes.

I came across the Distributed Lucene project-
http://wiki.apache.org/hadoop/DistributedLucene
https://issues.apache.org/jira/browse/HADOOP-3394

and have a couple of questions. It will be really helpful if someone  
can

provide some insights.

1) Is this code production ready?
2) Does someone has performance data for this project?
3) It allows searches and updates/deletes to be performed at the  
same time.

How well the system will perform if there are frequent updates to the
system. Will it handle the search and update load easily or will it be
better to rebuild or update the indexes on different machines and then
deploy the indexes back to the machines that are serving the indexes?

Basically I am trying to choose between the 2 approaches-

1) Use Hadoop to build and/or update Lucene indexes and then deploy  
them on
separate cluster that will take care or load balancing, fault  
tolerance etc.
There is a package in Hadoop contrib that does this, so I can use  
that code.


2) Use and/or modify the Distributed Lucene code.

I am expecting daily updates to our index so I am not sure if  
Distribtued
Lucene code (which allows searches and updates on the same indexes)  
will be

able to handle search and update load efficiently.

Any suggestions ?

Thanks,
Tarandeep




ScaleCamp: get together the night before Hadoop Summit

2009-05-13 Thread Stefan Groschupf

Hi All,

We are planing a community event the night before the Hadoop Summit.
This BarCamp (http://en.wikipedia.org/wiki/BarCamp) event will be  
held at the same venue as the Summit (Santa Clara Marriott).

Refreshments will be served to encourage socializing.

To initialize conversations for the social part of the evening we are  
offering people the opportunity to present an experience report of  
their project (within a 15 min presentation).
We have 12 slots in 3 parallel tracks max. The focus should be on  
projects leveraging technologies from the Hadoop eco-system.


Please join us and mingle with the rest of the Hadoop community.

To find out more about this event and signup please visit :
http://www.scaleunlimited.com/events/scale_camp

Please submit your presentation here:
http://www.scaleunlimited.com/about-us/contact


Stefan
P.S. Please spread the word!
P.P.S Apologies for the cross posting.


[ANNOUNCE] Katta 0.5 released

2009-04-09 Thread Stefan Groschupf

(...apologies for the cross posting...)

Release 0.5 of Katta is now available.
Katta - Lucene in the cloud.
http://katta.sourceforge.net


This release fixes bugs from 0.4, including one that sorted the  
results wrong under load.
0.5 also upgrades to Zookeeper to version 3.1.,  Lucene to version  
2.4.1 and hadoop 0.19.0.


The new API supports Lucene Query objects instead of just Strings,  
adds support for Amazon EC2,
switched to Ant and Ivy as a build system and some more minor  
improvements.
Also, we improved our online documentation and added sample code that  
illustrates how to create a sharded Lucene index with Hadoop.


See changes at
http://oss.101tec.com/jira/browse/KATTA?report=com.atlassian.jira.plugin.system.project:changelog-panel

Binary distribution is available at
https://sourceforge.net/projects/katta/

Stefan

~~~
Hadoop training and consulting
http://www.scaleunlimited.com
http://www.101tec.com





ec2 ganglia fixing missing graphs

2009-01-09 Thread Stefan Groschupf

Hi,
for the mail archive...
I'm using the hadoop ec2 scripts and notice that ganglia actually does  
not show any graphs.
I was able to  fix this by adding dejavu-fonts to the packages that  
are installed via yum in create-hadoop-image-remote.sh.

The line looks now like this:
yum -y install rsync lynx screen ganglia-gmetad ganglia-gmond ganglia- 
web dejavu-fonts httpd php


Since this effects the hadoop image, it might interesting to fix this  
and create a new public AMI.

Stefan


~~~
Hadoop training and consulting
http://www.scaleunlimited.com
http://www.101tec.com





contrib/ec2 USER_DATA not used

2008-12-18 Thread Stefan Groschupf

Hi,

can someone tell me what the variable USER_DATA in the launch-hadoop- 
master is all about.

I cant see that it is reused in the script or any other script.
Isnt the way those parameters are passed to the nodes the  
USER_DATA_FILE ?

The line is:
USER_DATA=MASTER_HOST=master,MAX_MAP_TASKS= 
$MAX_MAP_TASKS,MAX_REDUCE_TASKS=$MAX_REDUCE_TASKS,COMPRESS=$COMPRESS

Any hints?
Thanks,
Stefan
~~~
Hadoop training and consulting
http://www.scaleunlimited.com
http://www.101tec.com






Re: [video] visualization of the hadoop code history

2008-12-17 Thread Stefan Groschupf
Very cool stuff, but I don't see a reference anywhere to the author  
of the
visualization, which seems like poor form for a marketing video. I  
apologize

if I missed a reference somewhere.


Jeff, you missed it!
It is the first text screen at the end of the video.
It is actually a cool open source project with quite some contributors.

Stefan

~~~
Hadoop training and consulting
http://www.scaleunlimited.com
http://www.101tec.com





Re: [video] visualization of the hadoop code history

2008-12-17 Thread Stefan Groschupf

Owen O'Malley wrote:
It is interesting, but it would be more interesting to track the  
authors of the patch rather than the committer. The two are rarely  
the same.


Indeed.  There was a period of over a year where I wrote hardly  
anything but committed almost everything.  So I am vastly  
overrepresented in commits.



Thanks for the feedback.

The video was rendered from the svn log file (text version). If  
someone has a script that clean this file up and replace the committer  
name with the real patch author, we are happy to render the video again.



Cheers,
Stefan
~~~
Hadoop training and consulting
http://www.scaleunlimited.com
http://www.101tec.com






New Orleans - drinks tonight

2008-11-05 Thread Stefan Groschupf
Hey, anyone early for hadoop bootcamp aswell? How about meet for  
drinks tonight? Send me a mail offlist...

Stefan




Re: Hadoop Profiling!

2008-10-08 Thread Stefan Groschupf
Just run your map reduce job local and connect your profiler. I use  
yourkit.

Works great!
You can profile your map reduce job running the job in local mode as  
ant other java app as well.
However we also profiled in a grid. You just need to install the  
yourkit agent into the jvm of the node you want to profile and than  
you connect to the node when the job runs.
However you need to time things well, since the task jvm is shutdown  
as soon your job is done.

Stefan

~~~
101tec Inc., Menlo Park, California
web:  http://www.101tec.com
blog: http://www.find23.net



On Oct 8, 2008, at 11:27 AM, Gerardo Velez wrote:


Hi!

I've developed a Map/Reduce algorithm to analyze some logs from web
application.

So basically, we are ready to start QA test phase, so now, I would  
like to

now how efficient is my application
from performance point of view.

So is there any procedure I could use to do some profiling?


Basically I need basi data, like time excecution or code bottlenecks.


Thanks in advance.

-- Gerardo Velez




Re: nagios to monitor hadoop datanodes!

2008-10-07 Thread Stefan Groschupf

try jmx. There should be also jmx to snmp available somewhere.
http://blogs.sun.com/jmxetc/entry/jmx_vs_snmp

~~~
101tec Inc., Menlo Park, California
web:  http://www.101tec.com
blog: http://www.find23.net



On Oct 6, 2008, at 10:05 AM, Gerardo Velez wrote:


Hi Everyone!


I would like to implement Nagios health monitoring of a Hadoop grid.

Some of you have some experience here, do you hace any approach or  
advice I

could use.

At this time I've been only playing with jsp's files that hadoop has
integrated into it. so I;m not sure if it could be a good idea that
nagios monitor request info to these jsp?


Thanks in advance!


-- Gerardo




Re: Searching Lucene Index built using Hadoop

2008-10-06 Thread Stefan Groschupf

Hi,
you might find http://katta.wiki.sourceforge.net/ interesting. If you  
have any katta releated question please use the katta mailing list.

Stefan

~~~
101tec Inc., Menlo Park, California
web:  http://www.101tec.com
blog: http://www.find23.net



On Oct 6, 2008, at 10:26 AM, Saranath wrote:



I'm trying to index a large dataset using Hadoop+Lucene. I used the  
example
under hadoop/trunk/src/conrib/index/ for indexing. I'm unable to  
find a way

to search the index that was successfully built.

I tried copying over the index to one machine and merging them using
IndexWriter.addIndexesNoOptimize().

I would like hear your input on the best way to index+search large  
datasets.


Thanks,
Saranath
--
View this message in context: 
http://www.nabble.com/Searching-Lucene-Index-built-using-Hadoop-tp19842438p19842438.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.






Re: Katta presentation slides

2008-09-23 Thread Stefan Groschupf

Hi All,

thanks a lot for your interest.
Both my katta and the hadoop survey slides can be found here:

http://find23.net/2008/09/23/hadoop-user-group-slides/

If you have a chance please give katta a test drive and give us some  
feedback.


Thanks,
Stefan


On Sep 23, 2008, at 6:20 PM, Rafael Turk wrote:


+1

On Tue, Sep 23, 2008 at 5:14 AM, Naama Kraus [EMAIL PROTECTED]  
wrote:



I'd be interested too. Naama

On Mon, Sep 22, 2008 at 11:32 PM, Deepika Khera [EMAIL PROTECTED]

wrote:



Hi Stefan,



Are the slides from the Katta presentation up somewhere? If not then
could you please post them?



Thanks,
Deepika





--
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00  
oo 00 oo

00 oo 00 oo
If you want your children to be intelligent, read them fairy  
tales. If you
want them to be more intelligent, read them more fairy  
tales. (Albert

Einstein)





[ANN] katta-0.1.0 release - distribute lucene indexes in a grid

2008-09-17 Thread Stefan Groschupf
After 5 month work we are happy to announce the first developer  
preview release of katta.
This release contains all functionality to serve a large, sharded  
lucene index on many servers.
Katta is standing on the shoulders of the giants lucene, hadoop and  
zookeeper.


Main features:
+ Plays well with Hadoop
+ Apache Version 2 License.
+ Node failure tolerance
+ Master failover
+ Shard replication
+ Plug-able network topologies (Shard - Distribution and Selection  
Polices)

+ Node load balancing at client



Please give katta a test drive and give us some feedback!

Download:
http://sourceforge.net/project/platformdownload.php?group_id=225750

website:
http://katta.sourceforge.net/

Getting started in less than 3 min:
http://katta.wiki.sourceforge.net/Getting+started

Installation on a grid:
http://katta.wiki.sourceforge.net/Installation

Katta presentation today (09/17/08) at hadoop user, yahoo mission  
college:

http://upcoming.yahoo.com/event/1075456/
* slides will be available online later


Many thanks for the hard work:
Johannes Zillmann, Marko Bauhardt, Martin Schaaf (101tec)

I apologize the cross posting.


Yours, the Katta Team.

~~~
101tec Inc., Menlo Park, California
http://www.101tec.com






how to LZO

2008-07-29 Thread Stefan Groschupf

Hi,
I would love to use lzo codec. However for some reasons I always only  
get  ...
INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native- 
hadoop library
INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully  
loaded  initialized native-zlib library


My hadoop-site looks like:

property nameio.compression.codecs/name  
 
value 
 
org 
.apache 
.hadoop 
.io 
.compress 
.LzoCodec 
,org 
.apache 
.hadoop 
.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec/ 
value descriptionA list of the compression codec classes that can  
be used for compression/decompression./description /property

I also think I have lzo installed on all my nodes:
yum list | grep lzo
lzo.x86_64 2.02-3.fc8 installed
lzo.i386 2.02-3.fc8 installed
lzo-devel.i386 2.02-3.fc8 fedora
lzo-devel.x86_64 2.02-3.fc8 fedora
 lzop.x86_64 1.02-0.5.rc1.fc8 fedora
Anything I miss you could think of?
Thanks for any hints!

Stefan



Meet Hadoop presentation: the math from page 5

2008-06-23 Thread Stefan Groschupf

Hi,
I tried to better understand slide 5 of meet hadoop:
http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/oscon-part-1.pdf
The slide says is:
given:
–10MB/s transfer
–10ms/seek
–100B/entry (10B entries)
–10kB/page (1B pages)

updating 1% of entries (100M) takes:
–1000 days with random B-Tree updates
–100 days with batched B-Tree updates
–1 day with sort  merge

I wonder how exactly to calculate the 1000 days and 100 days.
time for seeking = 100 000 000 * lg(1 000 000 000) * 10 ms =  
(346.034177 days)
time to read all pages = 100 000 000 * lg(1 000 000 000) * (10kB/10MB/ 
s) =  33.7924001 days
Since we might need to write all pages again we can add another 33  
days, though the result is not a 1000 days, so I do something  
fundamentally wrong. :o


Thanks for any help...

Stefan



Re: trouble setting up hadoop

2008-06-23 Thread Stefan Groschupf

Looks like you have not install a correct java.
Make sure you have a sun java installed on your nodes and java is in  
your path as well JAVA_HOME should be set.
I think gnu.gcj is the gnu java compiler but not a java you need to  
run hadoop.

Check on command line this:
$ java -version
you should see something like this:
java version 1.5.0_13
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_13- 
b05-237)

Java HotSpot(TM) Client VM (build 1.5.0_13-119, mixed mode, sharing)

HTH


On Jun 23, 2008, at 9:40 PM, Sandy wrote:

I apologize for the severe basicness of this error, but I am in the  
process
of getting  hadoop set up. I have been following the instructions in  
the
Hadoop quickstart. I have confirmed that bin/hadoop will give me  
help usage

information.

I am now in the stage of standalone operation.

I typed in:
mkdir input
cp conf/*.xml input
bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'

at which point I get:
Exception in thread main java.lang.ClassNotFoundException:
java.lang.Iterable not found in
gnu.gcj.runtime.SystemClassLoader{urls=[file:/home/sjm/Desktop/hado
op-0.16.4/bin/../conf/,file:/home/sjm/Desktop/hadoop-0.16.4/ 
bin/../,file:/home/s
jm/Desktop/hadoop-0.16.4/bin/../hadoop-0.16.4-core.jar,file:/home/ 
sjm/Desktop/ha
doop-0.16.4/bin/../lib/commons-cli-2.0-SNAPSHOT.jar,file:/home/sjm/ 
Desktop/hadoo
p-0.16.4/bin/../lib/commons-codec-1.3.jar,file:/home/sjm/Desktop/ 
hadoop-0.16.4/b
in/../lib/commons-httpclient-3.0.1.jar,file:/home/sjm/Desktop/ 
hadoop-0.16.4/bin/
../lib/commons-logging-1.0.4.jar,file:/home/sjm/Desktop/ 
hadoop-0.16.4/bin/../lib
/commons-logging-api-1.0.4.jar,file:/home/sjm/Desktop/hadoop-0.16.4/ 
bin/../lib/j
ets3t-0.5.0.jar,file:/home/sjm/Desktop/hadoop-0.16.4/bin/../lib/ 
jetty-5.1.4.jar,
file:/home/sjm/Desktop/hadoop-0.16.4/bin/../lib/ 
junit-3.8.1.jar,file:/home/sjm/D
esktop/hadoop-0.16.4/bin/../lib/kfs-0.1.jar,file:/home/sjm/Desktop/ 
hadoop-0.16.4
/bin/../lib/log4j-1.2.13.jar,file:/home/sjm/Desktop/hadoop-0.16.4/ 
bin/../lib/ser
vlet-api.jar,file:/home/sjm/Desktop/hadoop-0.16.4/bin/../lib/ 
xmlenc-0.52.jar,fil
e:/home/sjm/Desktop/hadoop-0.16.4/bin/../lib/jetty-ext/commons- 
el.jar,file:/home
/sjm/Desktop/hadoop-0.16.4/bin/../lib/jetty-ext/jasper- 
compiler.jar,file:/home/s
jm/Desktop/hadoop-0.16.4/bin/../lib/jetty-ext/jasper- 
runtime.jar,file:/home/sjm/

Desktop/hadoop-0.16.4/bin/../lib/jetty-ext/jsp-api.jar],
parent=gnu.gcj.runtime. ExtensionClassLoader{urls=[], parent=null}}
  at java.net.URLClassLoader.findClass (libgcj.so.7)
  at java.lang.ClassLoader.loadClass (libgcj.so.7)
  at java.lang.ClassLoader.loadClass (libgcj.so.7)
  at java.lang.VMClassLoader.defineClass (libgcj.so.7)
  at java.lang.ClassLoader.defineClass (libgcj.so.7)
  at java.security.SecureClassLoader.defineClass (libgcj.so.7)
  at java.net.URLClassLoader.findClass (libgcj.so.7)
  at java.lang.ClassLoader.loadClass (libgcj.so.7)
  at java.lang.ClassLoader.loadClass (libgcj.so.7)
  at org.apache.hadoop.util.RunJar.main (RunJar.java:107)

I suspect the issue is path related, though I am not certain. Could  
someone

please point me in the right direction?

Much thanks,

SM


~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com




Re: login error while running hadoop on MacOSX 10.5.*

2008-06-23 Thread Stefan Groschupf
Which user runs the hadoop? It should be the same you trigger the job  
with.


On Jun 23, 2008, at 10:29 PM, Lev Givon wrote:


I recently installed hadoop 0.17.0 in pseudo-distributed mode on a
MacOSX 10.5.3 system with software managed by Fink installed in /sw. I
configured hadoop to use the stock Java 1.5.0_13 installation in
/Library/Java/Home. When I attempted to run a simple map/reduce job
off of the dfs after starting up the daemons, the job failed with the
following Java error (501 is the ID of the user used to start the
hadoop daemons and run the map/reduce job):

javax.security.auth.login.LoginException: Login failed:
/sw/bin/whoami: cannot find name for user ID 501

What might be causing this to occur? Manually running /sw/bin/whoami
as the user in question returns the corresponding username.

L.G.




~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com




Re: login error while running hadoop on MacOSX 10.5.*

2008-06-23 Thread Stefan Groschupf

The fink part and /sw confuses me.
When I do a which on my os x I get:
$ which whoami
/usr/bin/whoami
Are you using the same whoami on your console as hadoop?

On Jun 23, 2008, at 10:37 PM, Lev Givon wrote:


Both the daemons and the job were started using the same user.

L.G.

Received from Stefan Groschupf on Mon, Jun 23, 2008 at 04:34:54PM EDT:
Which user runs the hadoop? It should be the same you trigger the  
job with.


On Jun 23, 2008, at 10:29 PM, Lev Givon wrote:


I recently installed hadoop 0.17.0 in pseudo-distributed mode on a
MacOSX 10.5.3 system with software managed by Fink installed in / 
sw. I

configured hadoop to use the stock Java 1.5.0_13 installation in
/Library/Java/Home. When I attempted to run a simple map/reduce job
off of the dfs after starting up the daemons, the job failed with  
the

following Java error (501 is the ID of the user used to start the
hadoop daemons and run the map/reduce job):

javax.security.auth.login.LoginException: Login failed:
/sw/bin/whoami: cannot find name for user ID 501

What might be causing this to occur? Manually running /sw/bin/whoami
as the user in question returns the corresponding username.

L.G.




~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com






~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com




Re: login error while running hadoop on MacOSX 10.5.*

2008-06-23 Thread Stefan Groschupf
Sorry I'm not a unix expert, however the problem is clearly related to  
whoami since this throws an error.

I run hadoop in all kind of configuration super smooth on my os x boxes.
Maybe rename or move /sw/whoami for a test.
Also make sure you restart the os x console since changes  
in .bash_profile are only picked up if you relogin into the command  
line.

Sorry that is all I know and could guess.. :(

On Jun 23, 2008, at 10:56 PM, Lev Givon wrote:


Yes; I have my PATH configured to list /sw/bin before
/usr/bin. Curiously, hadoop tries to invoke /sw/bin/whoami even when I
set PATH to

/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin:/usr/X11R6/ 
bin:/usr/local/bin


before starting the daemons and attempting to run the job.

  L.G.

Received from Stefan Groschupf on Mon, Jun 23, 2008 at 04:49:23PM EDT:

The fink part and /sw confuses me.  When I do a which on my os x I
get: $ which whoami /usr/bin/whoami Are you using the same whoami on
your console as hadoop?

On Jun 23, 2008, at 10:37 PM, Lev Givon wrote:


Both the daemons and the job were started using the same user.

L.G.

Received from Stefan Groschupf on Mon, Jun 23, 2008 at 04:34:54PM  
EDT:
Which user runs the hadoop? It should be the same you trigger the  
job

with.

On Jun 23, 2008, at 10:29 PM, Lev Givon wrote:


I recently installed hadoop 0.17.0 in pseudo-distributed mode on a
MacOSX 10.5.3 system with software managed by Fink installed in / 
sw. I

configured hadoop to use the stock Java 1.5.0_13 installation in
/Library/Java/Home. When I attempted to run a simple map/reduce  
job
off of the dfs after starting up the daemons, the job failed  
with the

following Java error (501 is the ID of the user used to start the
hadoop daemons and run the map/reduce job):

javax.security.auth.login.LoginException: Login failed:
/sw/bin/whoami: cannot find name for user ID 501

What might be causing this to occur? Manually running /sw/bin/ 
whoami

as the user in question returns the corresponding username.

L.G.




~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com






~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com






~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com




Re: [some bugs] Re: file permission problem

2008-03-15 Thread Stefan Groschupf

Great - it is even alrady fixed in 16.1!
Thanks for the hint!
Stefan

On Mar 14, 2008, at 2:49 PM, Andy Li wrote:


I think this is the same problem related to this mail thread.

http://www.mail-archive.com/[EMAIL PROTECTED]/msg02759.html

A JIRA has been filed, please see HADOOP-2915.

On Fri, Mar 14, 2008 at 2:08 AM, Stefan Groschupf [EMAIL PROTECTED]  
wrote:



Hi,
any magic we can do with hadoop.dfs.umask? Or is there any other off
switch for the file security?
Thanks.
Stefan
On Mar 13, 2008, at 11:26 PM, Stefan Groschupf wrote:


Hi Nicholas, Hi All,

I definitely can reproduce the problem Johannes describes.
Also from debugging through the code it is clearly a bug from my
point of view.
So this is the call stack:
SequenceFile.createWriter
FileSystem.create
DFSClient.create
namenode.create
In NameNode I found this:
namesystem.startFile(src,
  new PermissionStatus(Server.getUserInfo().getUserName(),
null, masked),
  clientName, clientMachine, overwrite, replication, blockSize);

In getUserInfo is this comment:
// This is to support local calls (as opposed to rpc ones) to the
name-node.
  // Currently it is name-node specific and should be placed
somewhere else.
  try {
return UnixUserGroupInformation.login();
The login javaDoc says:
/**
 * Get current user's name and the names of all its groups from  
Unix.

 * It's assumed that there is only one UGI per user. If this user
already
 * has a UGI in the ugi map, return the ugi in the map.
 * Otherwise get the current user's information from Unix, store it
 * in the map, and return it.
 */

Beside of that I had some interesting observations.
If I have permissions to write to a folder A I can delete folder A
and file B that is inside of folder A even if I do have no
permissions for B.

Also I noticed following in my dfs
[EMAIL PROTECTED] hadoop]$ bin/hadoop fs -ls /user/joa23/
myApp-1205474968598
Found 1 items
/user/joa23/myApp-1205474968598/VOICE_CALLdir
2008-03-13

16:00

rwxr-xr-x hadoop  supergroup
[EMAIL PROTECTED] hadoop]$ bin/hadoop fs -ls /user/joa23/
myApp-1205474968598/VOICE_CALL
Found 1 items
/user/joa23/myApp-1205474968598/VOICE_CALL/part-0 r 3   27311
2008-03-13 16:00  rw-r--r--   joa23   supergroup

Do I miss something or was I able to write as user joa23 into a
folder owned by hadoop where I should have no permissions. :-O.
Should I open some jira issues?

Stefan





On Mar 13, 2008, at 10:55 AM, [EMAIL PROTECTED] wrote:


Hi Johannes,


i'm using the 0.16.0 distribution.

I assume you mean the 0.16.0 release (

http://hadoop.apache.org/core/releases.html

) without any additional patch.

I just have tried it but cannot reproduce the problem you
described.  I did the following:
1) start a cluster with tsz
2) run a job with nicholas

The output directory and files are owned by nicholas.  Am I doing
the same thing you did?  Could you try again?

Nicholas



- Original Message 
From: Johannes Zillmann [EMAIL PROTECTED]
To: core-user@hadoop.apache.org
Sent: Wednesday, March 12, 2008 5:47:27 PM
Subject: file permission problem

Hi,

i have a question regarding the file permissions.
I have a kind of workflow where i submit a job from my laptop to a
remote hadoop cluster.
After the job finished i do some file operations on the generated
output.
The cluster-user is different to the laptop-user. As output i
specify a directory inside the users home. This output directory,
created through the map-reduce job has cluster-user permissions,
so
this does not allow me to move or delete the output folder with my
laptop-user.

So it looks as follow:
/user/jz/  rwxrwxrwx jzsupergroup
/user/jz/output   rwxr-xr-xhadoopsupergroup

I tried different things to achieve what i want (moving/deleting  
the

output folder):
- jobConf.setUser(hadoop) on the client side
- System.setProperty(user.name,hadoop) before jobConf
instantiation
on the client side
- add user.name node in the hadoop-site.xml on the client side
- setPermision(777) on the home folder on the client side (does
not work
recursiv)
- setPermision(777) on the output folder on the client side
(permission
denied)
- create the output folder before running the job (Output  
directory

already exists exception)

None of the things i tried worked. Is there a way to achieve what
i want ?
Any ideas appreciated!

cheers
Johannes






--
~~~
101tec GmbH

Halle (Saale), Saxony-Anhalt, Germany
http://www.101tec.com






~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com





~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com





~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com




[memory leak?] Re: MapReduce failure

2008-03-15 Thread Stefan Groschupf

Hi there,

we see the same situation and browsing the posts there are quite a lot  
of people running into this OOM problem.
We run a own Mapper and our mapred.child.java.opts is -Xmx3048m, I  
think that should be more then enough.

Also I changed io.sort.mb to 10, which had also no impact.

Any ideas what might cause the OutOfMemoryError ?
Thanks.
Stefan




On Mar 9, 2008, at 10:28 PM, Amar Kamat wrote:

What is the heap size you are using for your tasks? Check  
'mapred.child.java.opts' in your hadoop-default.xml. Try increasing  
it. This will happen if you try running the random-writer + sort  
examples with default parameters. The maps are not able to spill the  
data to the disk. Btw what version of HADOOP are you using?

Amar
On Mon, 10 Mar 2008, Ved Prakash wrote:


Hi friends,

I have made a cluster of 3 machines, one of them is master, and  
other 2
slaves. I executed a mapreduce job on master but after Map, the  
execution
terminates and Reduce doesn't happen. I have checked dfs and no  
output

folder gets created.

this is the error I see

08/03/10 10:35:21 INFO mapred.JobClient: Task Id :
task_200803101001_0001_m_64_0, Status : FAILED
java.lang.OutOfMemoryError: Java heap space
  at  
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java

:95)
  at java.io.DataOutputStream.write(DataOutputStream.java:90)
  at org.apache.hadoop.io.Text.write(Text.java:243)
  at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(
MapTask.java:347)
  at org.apache.hadoop.examples.WordCount 
$MapClass.map(WordCount.java

:72)
  at org.apache.hadoop.examples.WordCount 
$MapClass.map(WordCount.java

:59)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
  at org.apache.hadoop.mapred.TaskTracker 
$Child.main(TaskTracker.java

:1787)

08/03/10 10:35:22 INFO mapred.JobClient:  map 55% reduce 17%
08/03/10 10:35:31 INFO mapred.JobClient:  map 56% reduce 17%
08/03/10 10:35:51 INFO mapred.JobClient:  map 57% reduce 17%
08/03/10 10:36:04 INFO mapred.JobClient:  map 58% reduce 17%
08/03/10 10:36:07 INFO mapred.JobClient:  map 57% reduce 17%
08/03/10 10:36:07 INFO mapred.JobClient: Task Id :
task_200803101001_0001_m_71_0, Status : FAILED
java.lang.OutOfMemoryError: Java heap space
  at  
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java

:95)
  at java.io.DataOutputStream.write(DataOutputStream.java:90)
  at org.apache.hadoop.io.Text.write(Text.java:243)
  at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(
MapTask.java:347)
  at org.apache.hadoop.examples.WordCount 
$MapClass.map(WordCount.java

:72)
  at org.apache.hadoop.examples.WordCount 
$MapClass.map(WordCount.java

:59)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
  at org.apache.hadoop.mapred.TaskTracker 
$Child.main(TaskTracker.java

:1787)

though it tries to overcome this problem but the mapreduce  
application

doesn't create output, can anyone tell me why is this happening?

Thanks





~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com




[some bugs] Re: file permission problem

2008-03-14 Thread Stefan Groschupf

Hi Nicholas, Hi All,

I definitely can reproduce the problem Johannes describes.
Also from debugging through the code it is clearly a bug from my point  
of view.

So this is the call stack:
SequenceFile.createWriter
FileSystem.create
DFSClient.create
namenode.create
In NameNode I found this:
 namesystem.startFile(src,
new PermissionStatus(Server.getUserInfo().getUserName(),  
null, masked),

clientName, clientMachine, overwrite, replication, blockSize);

In getUserInfo is this comment:
 // This is to support local calls (as opposed to rpc ones) to the  
name-node.
// Currently it is name-node specific and should be placed  
somewhere else.

try {
  return UnixUserGroupInformation.login();
The login javaDoc says:
 /**
   * Get current user's name and the names of all its groups from Unix.
   * It's assumed that there is only one UGI per user. If this user  
already

   * has a UGI in the ugi map, return the ugi in the map.
   * Otherwise get the current user's information from Unix, store it
   * in the map, and return it.
   */

Beside of that I had some interesting observations.
If I have permissions to write to a folder A I can delete folder A and  
file B that is inside of folder A even if I do have no permissions for  
B.


Also I noticed following in my dfs
[EMAIL PROTECTED] hadoop]$ bin/hadoop fs -ls /user/joa23/ 
myApp-1205474968598

Found 1 items
/user/joa23/myApp-1205474968598/VOICE_CALL	dir		2008-03-13 16:00	 
rwxr-xr-x	hadoop	supergroup
[EMAIL PROTECTED] hadoop]$ bin/hadoop fs -ls /user/joa23/ 
myApp-1205474968598/VOICE_CALL

Found 1 items
/user/joa23/myApp-1205474968598/VOICE_CALL/part-0	r 3	27311	 
2008-03-13 16:00	rw-r--r--	joa23	supergroup


Do I miss something or was I able to write as user joa23 into a folder  
owned by hadoop where I should have no permissions. :-O.

Should I open some jira issues?

Stefan





On Mar 13, 2008, at 10:55 AM, [EMAIL PROTECTED] wrote:


Hi Johannes,


i'm using the 0.16.0 distribution.
I assume you mean the 0.16.0 release (http://hadoop.apache.org/core/releases.html 
) without any additional patch.


I just have tried it but cannot reproduce the problem you  
described.  I did the following:

1) start a cluster with tsz
2) run a job with nicholas

The output directory and files are owned by nicholas.  Am I doing  
the same thing you did?  Could you try again?


Nicholas



- Original Message 
From: Johannes Zillmann [EMAIL PROTECTED]
To: core-user@hadoop.apache.org
Sent: Wednesday, March 12, 2008 5:47:27 PM
Subject: file permission problem

Hi,

i have a question regarding the file permissions.
I have a kind of workflow where i submit a job from my laptop to a
remote hadoop cluster.
After the job finished i do some file operations on the generated  
output.

The cluster-user is different to the laptop-user. As output i
specify a directory inside the users home. This output directory,
created through the map-reduce job has cluster-user permissions, so
this does not allow me to move or delete the output folder with my
laptop-user.

So it looks as follow:
/user/jz/  rwxrwxrwx jzsupergroup
/user/jz/output   rwxr-xr-xhadoopsupergroup

I tried different things to achieve what i want (moving/deleting the
output folder):
- jobConf.setUser(hadoop) on the client side
- System.setProperty(user.name,hadoop) before jobConf  
instantiation

on the client side
- add user.name node in the hadoop-site.xml on the client side
- setPermision(777) on the home folder on the client side (does not  
work

recursiv)
- setPermision(777) on the output folder on the client side  
(permission

denied)
- create the output folder before running the job (Output directory
already exists exception)

None of the things i tried worked. Is there a way to achieve what i  
want ?

Any ideas appreciated!

cheers
Johannes






--
~~~
101tec GmbH

Halle (Saale), Saxony-Anhalt, Germany
http://www.101tec.com






~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com




Re: [some bugs] Re: file permission problem

2008-03-14 Thread Stefan Groschupf

Hi,
any magic we can do with hadoop.dfs.umask? Or is there any other off  
switch for the file security?

Thanks.
Stefan
On Mar 13, 2008, at 11:26 PM, Stefan Groschupf wrote:


Hi Nicholas, Hi All,

I definitely can reproduce the problem Johannes describes.
Also from debugging through the code it is clearly a bug from my  
point of view.

So this is the call stack:
SequenceFile.createWriter
FileSystem.create
DFSClient.create
namenode.create
In NameNode I found this:
namesystem.startFile(src,
   new PermissionStatus(Server.getUserInfo().getUserName(),  
null, masked),

   clientName, clientMachine, overwrite, replication, blockSize);

In getUserInfo is this comment:
// This is to support local calls (as opposed to rpc ones) to the  
name-node.
   // Currently it is name-node specific and should be placed  
somewhere else.

   try {
 return UnixUserGroupInformation.login();
The login javaDoc says:
/**
  * Get current user's name and the names of all its groups from Unix.
  * It's assumed that there is only one UGI per user. If this user  
already

  * has a UGI in the ugi map, return the ugi in the map.
  * Otherwise get the current user's information from Unix, store it
  * in the map, and return it.
  */

Beside of that I had some interesting observations.
If I have permissions to write to a folder A I can delete folder A  
and file B that is inside of folder A even if I do have no  
permissions for B.


Also I noticed following in my dfs
[EMAIL PROTECTED] hadoop]$ bin/hadoop fs -ls /user/joa23/ 
myApp-1205474968598

Found 1 items
/user/joa23/myApp-1205474968598/VOICE_CALL	dir		2008-03-13 16:00	 
rwxr-xr-x	hadoop	supergroup
[EMAIL PROTECTED] hadoop]$ bin/hadoop fs -ls /user/joa23/ 
myApp-1205474968598/VOICE_CALL

Found 1 items
/user/joa23/myApp-1205474968598/VOICE_CALL/part-0	r 3	27311	 
2008-03-13 16:00	rw-r--r--	joa23	supergroup


Do I miss something or was I able to write as user joa23 into a  
folder owned by hadoop where I should have no permissions. :-O.

Should I open some jira issues?

Stefan





On Mar 13, 2008, at 10:55 AM, [EMAIL PROTECTED] wrote:


Hi Johannes,


i'm using the 0.16.0 distribution.
I assume you mean the 0.16.0 release (http://hadoop.apache.org/core/releases.html 
) without any additional patch.


I just have tried it but cannot reproduce the problem you  
described.  I did the following:

1) start a cluster with tsz
2) run a job with nicholas

The output directory and files are owned by nicholas.  Am I doing  
the same thing you did?  Could you try again?


Nicholas



- Original Message 
From: Johannes Zillmann [EMAIL PROTECTED]
To: core-user@hadoop.apache.org
Sent: Wednesday, March 12, 2008 5:47:27 PM
Subject: file permission problem

Hi,

i have a question regarding the file permissions.
I have a kind of workflow where i submit a job from my laptop to a
remote hadoop cluster.
After the job finished i do some file operations on the generated  
output.

The cluster-user is different to the laptop-user. As output i
specify a directory inside the users home. This output directory,
created through the map-reduce job has cluster-user permissions,  
so

this does not allow me to move or delete the output folder with my
laptop-user.

So it looks as follow:
/user/jz/  rwxrwxrwx jzsupergroup
/user/jz/output   rwxr-xr-xhadoopsupergroup

I tried different things to achieve what i want (moving/deleting the
output folder):
- jobConf.setUser(hadoop) on the client side
- System.setProperty(user.name,hadoop) before jobConf  
instantiation

on the client side
- add user.name node in the hadoop-site.xml on the client side
- setPermision(777) on the home folder on the client side (does  
not work

recursiv)
- setPermision(777) on the output folder on the client side  
(permission

denied)
- create the output folder before running the job (Output directory
already exists exception)

None of the things i tried worked. Is there a way to achieve what  
i want ?

Any ideas appreciated!

cheers
Johannes






--
~~~
101tec GmbH

Halle (Saale), Saxony-Anhalt, Germany
http://www.101tec.com






~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com





~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com




Re: Hadoop summit / workshop at Yahoo!

2008-02-22 Thread Stefan Groschupf

Puhh, 2 days and it is full?
Does Yahoo have no bigger rooms than just for a 100 people?



On Feb 20, 2008, at 12:10 PM, Ajay Anand wrote:


The registration page for the Hadoop summit is now up:
http://developer.yahoo.com/hadoop/summit/

Space is limited, so please sign up early if you are interested in
attending.

About the summit:
Yahoo! is hosting the first summit on Apache Hadoop on March 25th in
Sunnyvale. The summit is sponsored by the Computing Community  
Consortium

(CCC) and brings together leaders from the Hadoop developer and user
communities. The speakers will cover topics in the areas of extensions
being developed for Hadoop, case studies of applications being built  
and

deployed on Hadoop, and a discussion on future directions for the
platform.

Agenda:
8:30-8:55 Breakfast
8:55-9:00 Welcome to Yahoo!  Logistics - Ajay Anand, Yahoo!
9:00-9:30 Hadoop Overview - Doug Cutting / Eric Baldeschwieler, Yahoo!
9:30-10:00 Pig - Chris Olston, Yahoo!
10:00-10:30 JAQL - Kevin Beyer, IBM
10:30-10:45 Break
10:45-11:15 DryadLINQ - Michael Isard, Microsoft
11:15-11:45 Monitoring Hadoop using X-Trace - Andy Konwinski and Matei
Zaharia, UC Berkeley
11:45-12:15 Zookeeper - Ben Reed, Yahoo!
12:15-1:15 Lunch
1:15-1:45 Hbase - Michael Stack, Powerset
1:45-2:15 Hbase App - Bryan Duxbury, Rapleaf
2:15-2:45 Hive - Joydeep Sen Sarma, Facebook
2:45-3:00 Break
3:00-3:20 Building Ground Models of Southern California - Steve
Schossler, David O'Hallaron, Intel / CMU
3:20-3:40 Online search for engineering design content - Mike Haley,
Autodesk
3:40-4:00 Yahoo - Webmap - Arnab Bhattacharjee, Yahoo!
4:00-4:30 Natural language Processing - Jimmy Lin, U of Maryland /
Christophe Bisciglia, Google
4:30-4:45 Break
4:45-5:30 Panel on future directions
5:30-7:00 Happy hour

Look forward to seeing you there!
Ajay

-Original Message-
From: Bradford Stephens [mailto:[EMAIL PROTECTED]
Sent: Wednesday, February 20, 2008 9:17 AM
To: core-user@hadoop.apache.org
Subject: Re: Hadoop summit / workshop at Yahoo!

Hrm yes, I'd like to make a visit as well :)

On Feb 20, 2008 8:05 AM, C G [EMAIL PROTECTED] wrote:

 Hey All:

 Is this going forward?  I'd like to make plans to attend and the

sooner I can get plane tickets the happier the bean counters will be
:-).


 Thx,
 C G


Ajay Anand wrote:


Yahoo plans to host a summit / workshop on Apache Hadoop at our
Sunnyvale campus on March 25th. Given the interest we are seeing

from

developers in a broad range of organizations, this seems like a

good

time to get together and brief each other on the progress that is
being
made.



We would like to cover topics in the areas of extensions being
developed
for Hadoop, innovative applications being built and deployed on
Hadoop,
and future extensions to the platform. Some of the speakers who

have

already committed to present are from organizations such as IBM,
Intel,
Carnegie Mellon University, UC Berkeley, Facebook and Yahoo!, and
we are
actively recruiting other leaders in the space.



If you have an innovative application you would like to talk about,
please let us know. Although there are limitations on the amount of
time
we have, we would love to hear from you. You can contact me at
[EMAIL PROTECTED]



Thanks and looking forward to hearing about your cool apps,

Ajay







--
View this message in context:

http://www.nabble.com/Hadoop-summit---workshop-at-Yahoo%21-tp14889262p15
393386.html

Sent from the Hadoop lucene-users mailing list archive at

Nabble.com.









-
Be a better friend, newshound, and know-it-all with Yahoo! Mobile.

Try it now.



~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com




broadcasting: pig user meeting, Friday, February 8, 2008

2008-02-07 Thread Stefan Groschupf

Hi there,
sorry for cross posting.
If everything works out we will video broadcast the event here:
http://ustream.tv/channel/apache-pig-user-meeting
But no guarantee - sorry.
Also we try to setup a telefon voice call in number - please write me  
a private email if you are interested and I will send out a number.


See you tomorrow.
Stefan


On Feb 6, 2008, at 3:54 PM, Andrzej Bialecki wrote:


Otis Gospodnetic wrote:
Sorry about the word-wrapping (original email) - Yahoo Mail  
problem :(
Is anyone going to be capturing the Piglet meeting on video for the  
those of us living in other corners of the planet?



Please do! It's too far from Poland to just casually drop by .. ;)


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com




Re: pig user meeting, Friday, February 8, 2008

2008-02-06 Thread Stefan Groschupf

Hi Otis,
can you suggest a technology how we could do that? Skype? Ichat?  
Something that is free?
I'm happy setup a video conf, however there are no big presentations  
planed.
I was thinking I can give a overview how we use pig for our current  
project just to reflect our use cases.

But beside that I guess it is just pizza and beer.

Cheers,
Stefan





On Feb 6, 2008, at 11:40 AM, Otis Gospodnetic wrote:


Sorry about the word-wrapping (original email) - Yahoo Mail problem :(

Is anyone going to be capturing the Piglet meeting on video for the  
those of us living in other corners of the planet?


Thank you,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 

From: Stefan Groschupf [EMAIL PROTECTED]
To: core-user@hadoop.apache.org
Sent: Thursday, January 31, 2008 7:09:53 PM
Subject: pig user meeting, Friday, February 8, 2008

Hi


there,


a


couple



of



people



plan



to



meet



and



talk



about



apache



pig



next



Friday

in


the



Mountain



View



area.

(Event


location



is



not



yet



sure).

If


you



are



interested



please



RSVP



asap,



so



we



can



plan



what



kind



of

location


size



we



looking



for.


http://upcoming.yahoo.com/event/420958/

Cheers,
Stefan


~~~
101tec


Inc.

Menlo


Park,



California,



USA

http://www.101tec.com









~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com