Re: Indexing on top of Hadoop

2009-06-10 Thread Stefan Groschupf

Hi,
you might find some code in katta.sourceforge.net very helpful.
Stefan

~~~
Hadoop training and consulting
http://www.scaleunlimited.com
http://www.101tec.com



On Jun 10, 2009, at 5:49 AM, kartik saxena wrote:


Hi,

I have a huge  LDIF file in order of GBs spanning some million user  
records.
I am running the example "Grep" job on that file. The search results  
have

not really been
upto expectations because of it being a basic per line , brute force.

I was thinking of building some indexes inside HDFS for that file ,  
so that
the search results could improve. What could I possibly try to  
achieve this?



Secura




Re: Distributed Lucene Questions

2009-06-02 Thread Stefan Groschupf

Hi,
you might want to checkout:
http://katta.sourceforge.net/

Stefan

~~~
Hadoop training and consulting
http://www.scaleunlimited.com
http://www.101tec.com



On Jun 1, 2009, at 9:54 AM, Tarandeep Singh wrote:


Hi All,

I am trying to build a distributed system to build and serve lucene  
indexes.

I came across the Distributed Lucene project-
http://wiki.apache.org/hadoop/DistributedLucene
https://issues.apache.org/jira/browse/HADOOP-3394

and have a couple of questions. It will be really helpful if someone  
can

provide some insights.

1) Is this code production ready?
2) Does someone has performance data for this project?
3) It allows searches and updates/deletes to be performed at the  
same time.

How well the system will perform if there are frequent updates to the
system. Will it handle the search and update load easily or will it be
better to rebuild or update the indexes on different machines and then
deploy the indexes back to the machines that are serving the indexes?

Basically I am trying to choose between the 2 approaches-

1) Use Hadoop to build and/or update Lucene indexes and then deploy  
them on
separate cluster that will take care or load balancing, fault  
tolerance etc.
There is a package in Hadoop contrib that does this, so I can use  
that code.


2) Use and/or modify the Distributed Lucene code.

I am expecting daily updates to our index so I am not sure if  
Distribtued
Lucene code (which allows searches and updates on the same indexes)  
will be

able to handle search and update load efficiently.

Any suggestions ?

Thanks,
Tarandeep




ScaleCamp: get together the night before Hadoop Summit

2009-05-13 Thread Stefan Groschupf

Hi All,

We are planing a community event the night before the Hadoop Summit.
This "BarCamp" (http://en.wikipedia.org/wiki/BarCamp) event will be  
held at the same venue as the Summit (Santa Clara Marriott).

Refreshments will be served to encourage socializing.

To initialize conversations for the social part of the evening we are  
offering people the opportunity to present an experience report of  
their project (within a 15 min presentation).
We have 12 slots in 3 parallel tracks max. The focus should be on  
projects leveraging technologies from the Hadoop eco-system.


Please join us and mingle with the rest of the Hadoop community.

To find out more about this event and signup please visit :
http://www.scaleunlimited.com/events/scale_camp

Please submit your presentation here:
http://www.scaleunlimited.com/about-us/contact


Stefan
P.S. Please spread the word!
P.P.S Apologies for the cross posting.


[ANNOUNCE] Katta 0.5 released

2009-04-09 Thread Stefan Groschupf

(...apologies for the cross posting...)

Release 0.5 of Katta is now available.
Katta - Lucene in the cloud.
http://katta.sourceforge.net


This release fixes bugs from 0.4, including one that sorted the  
results wrong under load.
0.5 also upgrades to Zookeeper to version 3.1.,  Lucene to version  
2.4.1 and hadoop 0.19.0.


The new API supports Lucene Query objects instead of just Strings,  
adds support for Amazon EC2,
switched to Ant and Ivy as a build system and some more minor  
improvements.
Also, we improved our online documentation and added sample code that  
illustrates how to create a sharded Lucene index with Hadoop.


See changes at
http://oss.101tec.com/jira/browse/KATTA?report=com.atlassian.jira.plugin.system.project:changelog-panel

Binary distribution is available at
https://sourceforge.net/projects/katta/

Stefan

~~~
Hadoop training and consulting
http://www.scaleunlimited.com
http://www.101tec.com





ec2 ganglia fixing missing graphs

2009-01-09 Thread Stefan Groschupf

Hi,
for the mail archive...
I'm using the hadoop ec2 scripts and notice that ganglia actually does  
not show any graphs.
I was able to  fix this by adding dejavu-fonts to the packages that  
are installed via yum in create-hadoop-image-remote.sh.

The line looks now like this:
yum -y install rsync lynx screen ganglia-gmetad ganglia-gmond ganglia- 
web dejavu-fonts httpd php


Since this effects the hadoop image, it might interesting to fix this  
and create a new public AMI.

Stefan


~~~
Hadoop training and consulting
http://www.scaleunlimited.com
http://www.101tec.com





contrib/ec2 USER_DATA not used

2008-12-18 Thread Stefan Groschupf

Hi,

can someone tell me what the variable USER_DATA in the launch-hadoop- 
master is all about.

I cant see that it is reused in the script or any other script.
Isnt the way those parameters are passed to the nodes the  
USER_DATA_FILE ?

The line is:
USER_DATA="MASTER_HOST=master,MAX_MAP_TASKS= 
$MAX_MAP_TASKS,MAX_REDUCE_TASKS=$MAX_REDUCE_TASKS,COMPRESS=$COMPRESS"

Any hints?
Thanks,
Stefan
~~~
Hadoop training and consulting
http://www.scaleunlimited.com
http://www.101tec.com






Re: [video] visualization of the hadoop code history

2008-12-17 Thread Stefan Groschupf
Very cool stuff, but I don't see a reference anywhere to the author  
of the
visualization, which seems like poor form for a marketing video. I  
apologize

if I missed a reference somewhere.


Jeff, you missed it!
It is the first text screen at the end of the video.
It is actually a cool open source project with quite some contributors.

Stefan

~~~
Hadoop training and consulting
http://www.scaleunlimited.com
http://www.101tec.com





Re: [video] visualization of the hadoop code history

2008-12-17 Thread Stefan Groschupf

Owen O'Malley wrote:
It is interesting, but it would be more interesting to track the  
authors of the patch rather than the committer. The two are rarely  
the same.


Indeed.  There was a period of over a year where I wrote hardly  
anything but committed almost everything.  So I am vastly  
overrepresented in commits.



Thanks for the feedback.

The video was rendered from the svn log file (text version). If  
someone has a script that clean this file up and replace the committer  
name with the real patch author, we are happy to render the video again.



Cheers,
Stefan
~~~
Hadoop training and consulting
http://www.scaleunlimited.com
http://www.101tec.com






[video] visualization of the hadoop code history

2008-12-16 Thread Stefan Groschupf

Hi friends of Hadoop,

we from ScaleUnlimited.com put together a video that visualize the  
code commit history of the Hadoop core project.
It is a neat way of visualizing who is behind the Hadoop source code  
and how the project code base grew over the years.


Check it out here:
http://www.scaleunlimited.com/hadoop-resources.html

Best,
Stefan


~~~
Hadoop training and consulting
http://www.scaleunlimited.com




mbox archive files for hadoop mailing lists.

2008-11-26 Thread Stefan Groschupf

Hi,
where do I can find the mbox mailing list archive files for the hadoop  
user mailing lists?

Thanks,
Stefan



New Orleans - drinks tonight

2008-11-05 Thread Stefan Groschupf
Hey, anyone early for hadoop bootcamp aswell? How about meet for  
drinks tonight? Send me a mail offlist...

Stefan




Re: Hadoop Profiling!

2008-10-08 Thread Stefan Groschupf
Just run your map reduce job local and connect your profiler. I use  
yourkit.

Works great!
You can profile your map reduce job running the job in local mode as  
ant other java app as well.
However we also profiled in a grid. You just need to install the  
yourkit agent into the jvm of the node you want to profile and than  
you connect to the node when the job runs.
However you need to time things well, since the task jvm is shutdown  
as soon your job is done.

Stefan

~~~
101tec Inc., Menlo Park, California
web:  http://www.101tec.com
blog: http://www.find23.net



On Oct 8, 2008, at 11:27 AM, Gerardo Velez wrote:


Hi!

I've developed a Map/Reduce algorithm to analyze some logs from web
application.

So basically, we are ready to start QA test phase, so now, I would  
like to

now how efficient is my application
from performance point of view.

So is there any procedure I could use to do some profiling?


Basically I need basi data, like time excecution or code bottlenecks.


Thanks in advance.

-- Gerardo Velez




Re: nagios to monitor hadoop datanodes!

2008-10-07 Thread Stefan Groschupf

try jmx. There should be also jmx to snmp available somewhere.
http://blogs.sun.com/jmxetc/entry/jmx_vs_snmp

~~~
101tec Inc., Menlo Park, California
web:  http://www.101tec.com
blog: http://www.find23.net



On Oct 6, 2008, at 10:05 AM, Gerardo Velez wrote:


Hi Everyone!


I would like to implement Nagios health monitoring of a Hadoop grid.

Some of you have some experience here, do you hace any approach or  
advice I

could use.

At this time I've been only playing with jsp's files that hadoop has
integrated into it. so I;m not sure if it could be a good idea that
nagios monitor request info to these jsp?


Thanks in advance!


-- Gerardo




Re: Searching Lucene Index built using Hadoop

2008-10-06 Thread Stefan Groschupf

Hi,
you might find http://katta.wiki.sourceforge.net/ interesting. If you  
have any katta releated question please use the katta mailing list.

Stefan

~~~
101tec Inc., Menlo Park, California
web:  http://www.101tec.com
blog: http://www.find23.net



On Oct 6, 2008, at 10:26 AM, Saranath wrote:



I'm trying to index a large dataset using Hadoop+Lucene. I used the  
example
under hadoop/trunk/src/conrib/index/ for indexing. I'm unable to  
find a way

to search the index that was successfully built.

I tried copying over the index to one machine and merging them using
IndexWriter.addIndexesNoOptimize().

I would like hear your input on the best way to index+search large  
datasets.


Thanks,
Saranath
--
View this message in context: 
http://www.nabble.com/Searching-Lucene-Index-built-using-Hadoop-tp19842438p19842438.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.






Re: Katta presentation slides

2008-09-23 Thread Stefan Groschupf

Hi All,

thanks a lot for your interest.
Both my katta and the hadoop survey slides can be found here:

http://find23.net/2008/09/23/hadoop-user-group-slides/

If you have a chance please give katta a test drive and give us some  
feedback.


Thanks,
Stefan


On Sep 23, 2008, at 6:20 PM, Rafael Turk wrote:


+1

On Tue, Sep 23, 2008 at 5:14 AM, Naama Kraus <[EMAIL PROTECTED]>  
wrote:



I'd be interested too. Naama

On Mon, Sep 22, 2008 at 11:32 PM, Deepika Khera <[EMAIL PROTECTED]

wrote:



Hi Stefan,



Are the slides from the Katta presentation up somewhere? If not then
could you please post them?



Thanks,
Deepika





--
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00  
oo 00 oo

00 oo 00 oo
"If you want your children to be intelligent, read them fairy  
tales. If you
want them to be more intelligent, read them more fairy  
tales." (Albert

Einstein)





[ANN] katta-0.1.0 release - distribute lucene indexes in a grid

2008-09-17 Thread Stefan Groschupf
After 5 month work we are happy to announce the first developer  
preview release of katta.
This release contains all functionality to serve a large, sharded  
lucene index on many servers.
Katta is standing on the shoulders of the giants lucene, hadoop and  
zookeeper.


Main features:
+ Plays well with Hadoop
+ Apache Version 2 License.
+ Node failure tolerance
+ Master failover
+ Shard replication
+ Plug-able network topologies (Shard - Distribution and Selection  
Polices)

+ Node load balancing at client



Please give katta a test drive and give us some feedback!

Download:
http://sourceforge.net/project/platformdownload.php?group_id=225750

website:
http://katta.sourceforge.net/

Getting started in less than 3 min:
http://katta.wiki.sourceforge.net/Getting+started

Installation on a grid:
http://katta.wiki.sourceforge.net/Installation

Katta presentation today (09/17/08) at hadoop user, yahoo mission  
college:

http://upcoming.yahoo.com/event/1075456/
* slides will be available online later


Many thanks for the hard work:
Johannes Zillmann, Marko Bauhardt, Martin Schaaf (101tec)

I apologize the cross posting.


Yours, the Katta Team.

~~~
101tec Inc., Menlo Park, California
http://www.101tec.com






how to LZO

2008-07-29 Thread Stefan Groschupf

Hi,
I would love to use lzo codec. However for some reasons I always only  
get  ...
"INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native- 
hadoop library
INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully  
loaded & initialized native-zlib library"


My hadoop-site looks like:

 io.compression.codecs  
< 
value 
> 
org 
.apache 
.hadoop 
.io 
.compress 
.LzoCodec 
,org 
.apache 
.hadoop 
.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodecvalue> A list of the compression codec classes that can  
be used for compression/decompression. 

I also think I have lzo installed on all my nodes:
yum list | grep lzo
lzo.x86_64 2.02-3.fc8 installed
lzo.i386 2.02-3.fc8 installed
lzo-devel.i386 2.02-3.fc8 fedora
lzo-devel.x86_64 2.02-3.fc8 fedora
 lzop.x86_64 1.02-0.5.rc1.fc8 fedora
Anything I miss you could think of?
Thanks for any hints!

Stefan



Re: login error while running hadoop on MacOSX 10.5.*

2008-06-23 Thread Stefan Groschupf
Sorry I'm not a unix expert, however the problem is clearly related to  
whoami since this throws an error.

I run hadoop in all kind of configuration super smooth on my os x boxes.
Maybe rename or move /sw/whoami for a test.
Also make sure you restart the os x console since changes  
in .bash_profile are only picked up if you "relogin" into the command  
line.

Sorry that is all I know and could guess.. :(

On Jun 23, 2008, at 10:56 PM, Lev Givon wrote:


Yes; I have my PATH configured to list /sw/bin before
/usr/bin. Curiously, hadoop tries to invoke /sw/bin/whoami even when I
set PATH to

/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin:/usr/X11R6/ 
bin:/usr/local/bin


before starting the daemons and attempting to run the job.

  L.G.

Received from Stefan Groschupf on Mon, Jun 23, 2008 at 04:49:23PM EDT:

The fink part and /sw confuses me.  When I do a which on my os x I
get: $ which whoami /usr/bin/whoami Are you using the same whoami on
your console as hadoop?

On Jun 23, 2008, at 10:37 PM, Lev Givon wrote:


Both the daemons and the job were started using the same user.

L.G.

Received from Stefan Groschupf on Mon, Jun 23, 2008 at 04:34:54PM  
EDT:
Which user runs the hadoop? It should be the same you trigger the  
job

with.

On Jun 23, 2008, at 10:29 PM, Lev Givon wrote:


I recently installed hadoop 0.17.0 in pseudo-distributed mode on a
MacOSX 10.5.3 system with software managed by Fink installed in / 
sw. I

configured hadoop to use the stock Java 1.5.0_13 installation in
/Library/Java/Home. When I attempted to run a simple map/reduce  
job
off of the dfs after starting up the daemons, the job failed  
with the

following Java error (501 is the ID of the user used to start the
hadoop daemons and run the map/reduce job):

javax.security.auth.login.LoginException: Login failed:
/sw/bin/whoami: cannot find name for user ID 501

What might be causing this to occur? Manually running /sw/bin/ 
whoami

as the user in question returns the corresponding username.

L.G.




~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com






~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com






~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com




Re: realtime hadoop

2008-06-23 Thread Stefan Groschupf

Hadoop might be the wrong technology for you.
Map Reduce is a batch processing mechanism. Also HDFS might be  
critical since to access your data you need to close the file - means  
you might have many small file, a situation where hdfs is not very  
strong (namespace is hold in memory).
Hbase might be an interesting tool for you, also zookeeper if you want  
to do something home grown...




On Jun 23, 2008, at 11:31 PM, Vadim Zaliva wrote:


Hi!

I am considering using Hadoop for (almost) realime data processing. I
have data coming every second and I would like to use hadoop cluster
to process
it as fast as possible. I need to be able to maintain some guaranteed
max. processing time, for example under 3 minutes.

Does anybody have experience with using Hadoop in such manner? I will
appreciate if you can share your experience or give me pointers
to some articles or pages on the subject.

Vadim



~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com




Re: login error while running hadoop on MacOSX 10.5.*

2008-06-23 Thread Stefan Groschupf

The fink part and /sw confuses me.
When I do a which on my os x I get:
$ which whoami
/usr/bin/whoami
Are you using the same whoami on your console as hadoop?

On Jun 23, 2008, at 10:37 PM, Lev Givon wrote:


Both the daemons and the job were started using the same user.

L.G.

Received from Stefan Groschupf on Mon, Jun 23, 2008 at 04:34:54PM EDT:
Which user runs the hadoop? It should be the same you trigger the  
job with.


On Jun 23, 2008, at 10:29 PM, Lev Givon wrote:


I recently installed hadoop 0.17.0 in pseudo-distributed mode on a
MacOSX 10.5.3 system with software managed by Fink installed in / 
sw. I

configured hadoop to use the stock Java 1.5.0_13 installation in
/Library/Java/Home. When I attempted to run a simple map/reduce job
off of the dfs after starting up the daemons, the job failed with  
the

following Java error (501 is the ID of the user used to start the
hadoop daemons and run the map/reduce job):

javax.security.auth.login.LoginException: Login failed:
/sw/bin/whoami: cannot find name for user ID 501

What might be causing this to occur? Manually running /sw/bin/whoami
as the user in question returns the corresponding username.

L.G.




~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com






~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com




Re: login error while running hadoop on MacOSX 10.5.*

2008-06-23 Thread Stefan Groschupf
Which user runs the hadoop? It should be the same you trigger the job  
with.


On Jun 23, 2008, at 10:29 PM, Lev Givon wrote:


I recently installed hadoop 0.17.0 in pseudo-distributed mode on a
MacOSX 10.5.3 system with software managed by Fink installed in /sw. I
configured hadoop to use the stock Java 1.5.0_13 installation in
/Library/Java/Home. When I attempted to run a simple map/reduce job
off of the dfs after starting up the daemons, the job failed with the
following Java error (501 is the ID of the user used to start the
hadoop daemons and run the map/reduce job):

javax.security.auth.login.LoginException: Login failed:
/sw/bin/whoami: cannot find name for user ID 501

What might be causing this to occur? Manually running /sw/bin/whoami
as the user in question returns the corresponding username.

L.G.




~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com




Re: trouble setting up hadoop

2008-06-23 Thread Stefan Groschupf

Looks like you have not install a correct java.
Make sure you have a sun java installed on your nodes and java is in  
your path as well JAVA_HOME should be set.
I think gnu.gcj is the gnu java compiler but not a java you need to  
run hadoop.

Check on command line this:
$ java -version
you should see something like this:
java version "1.5.0_13"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_13- 
b05-237)

Java HotSpot(TM) Client VM (build 1.5.0_13-119, mixed mode, sharing)

HTH


On Jun 23, 2008, at 9:40 PM, Sandy wrote:

I apologize for the severe basicness of this error, but I am in the  
process
of getting  hadoop set up. I have been following the instructions in  
the
Hadoop quickstart. I have confirmed that bin/hadoop will give me  
help usage

information.

I am now in the stage of standalone operation.

I typed in:
mkdir input
cp conf/*.xml input
bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'

at which point I get:
Exception in thread "main" java.lang.ClassNotFoundException:
java.lang.Iterable not found in
gnu.gcj.runtime.SystemClassLoader{urls=[file:/home/sjm/Desktop/hado
op-0.16.4/bin/../conf/,file:/home/sjm/Desktop/hadoop-0.16.4/ 
bin/../,file:/home/s
jm/Desktop/hadoop-0.16.4/bin/../hadoop-0.16.4-core.jar,file:/home/ 
sjm/Desktop/ha
doop-0.16.4/bin/../lib/commons-cli-2.0-SNAPSHOT.jar,file:/home/sjm/ 
Desktop/hadoo
p-0.16.4/bin/../lib/commons-codec-1.3.jar,file:/home/sjm/Desktop/ 
hadoop-0.16.4/b
in/../lib/commons-httpclient-3.0.1.jar,file:/home/sjm/Desktop/ 
hadoop-0.16.4/bin/
../lib/commons-logging-1.0.4.jar,file:/home/sjm/Desktop/ 
hadoop-0.16.4/bin/../lib
/commons-logging-api-1.0.4.jar,file:/home/sjm/Desktop/hadoop-0.16.4/ 
bin/../lib/j
ets3t-0.5.0.jar,file:/home/sjm/Desktop/hadoop-0.16.4/bin/../lib/ 
jetty-5.1.4.jar,
file:/home/sjm/Desktop/hadoop-0.16.4/bin/../lib/ 
junit-3.8.1.jar,file:/home/sjm/D
esktop/hadoop-0.16.4/bin/../lib/kfs-0.1.jar,file:/home/sjm/Desktop/ 
hadoop-0.16.4
/bin/../lib/log4j-1.2.13.jar,file:/home/sjm/Desktop/hadoop-0.16.4/ 
bin/../lib/ser
vlet-api.jar,file:/home/sjm/Desktop/hadoop-0.16.4/bin/../lib/ 
xmlenc-0.52.jar,fil
e:/home/sjm/Desktop/hadoop-0.16.4/bin/../lib/jetty-ext/commons- 
el.jar,file:/home
/sjm/Desktop/hadoop-0.16.4/bin/../lib/jetty-ext/jasper- 
compiler.jar,file:/home/s
jm/Desktop/hadoop-0.16.4/bin/../lib/jetty-ext/jasper- 
runtime.jar,file:/home/sjm/

Desktop/hadoop-0.16.4/bin/../lib/jetty-ext/jsp-api.jar],
parent=gnu.gcj.runtime. ExtensionClassLoader{urls=[], parent=null}}
  at java.net.URLClassLoader.findClass (libgcj.so.7)
  at java.lang.ClassLoader.loadClass (libgcj.so.7)
  at java.lang.ClassLoader.loadClass (libgcj.so.7)
  at java.lang.VMClassLoader.defineClass (libgcj.so.7)
  at java.lang.ClassLoader.defineClass (libgcj.so.7)
  at java.security.SecureClassLoader.defineClass (libgcj.so.7)
  at java.net.URLClassLoader.findClass (libgcj.so.7)
  at java.lang.ClassLoader.loadClass (libgcj.so.7)
  at java.lang.ClassLoader.loadClass (libgcj.so.7)
  at org.apache.hadoop.util.RunJar.main (RunJar.java:107)

I suspect the issue is path related, though I am not certain. Could  
someone

please point me in the right direction?

Much thanks,

SM


~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com




Re: Working with XML / XQuery in hadoop

2008-06-23 Thread Stefan Groschupf

Yep, we do.
We have a xml Writable that uses XUM behind the scene. This has a  
getDom and getNode(xquery) method. In readIn we read the byte array  
and create the xum dom object from the byte array.
Write simply triggers the BinaryCodec.serialize and we write the bytes  
out.
However the same would work if you de/serialize xml as text, though we  
found that is slower than xum, though works pretty stable, since xum  
has other issues (you need to use BinaryCodex as jvm sigelton etc).

However in general this works pretty well.
Stefan



On Jun 23, 2008, at 9:38 PM, Kayla Jay wrote:


Hi

Just wondering if anyone out there works with and manipulates and  
stores XML data using Hadoop?  I've seen some threads about XML  
RecordReaders and people who use that XML StreamXmlRecordReader to  
do splits.  But, has anyone implemented a query framework that will  
use the hadoop layer to query against the XML in their map/reduce  
jobs?


I want to know if anyone has done an XQuery or XPath executed within  
a haoop job to find something within the XML stored in hadoop?


I can't find any samples or anyone else out there who uses XML data  
vs. traditional log text data.


Are there any use cases of using hadoop to work with XML and then do  
queries against XML in a distributed manner using hadoop?


Thanks.





~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com




Meet Hadoop presentation: the math from page 5

2008-06-23 Thread Stefan Groschupf

Hi,
I tried to better understand slide 5 of "meet hadoop":
http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/oscon-part-1.pdf
The slide says is:
given:
–10MB/s transfer
–10ms/seek
–100B/entry (10B entries)
–10kB/page (1B pages)

updating 1% of entries (100M) takes:
–1000 days with random B-Tree updates
–100 days with batched B-Tree updates
–1 day with sort & merge

I wonder how exactly to calculate the 1000 days and 100 days.
time for seeking = 100 000 000 * lg(1 000 000 000) * 10 ms =  
(346.034177 days)
time to read all pages = 100 000 000 * lg(1 000 000 000) * (10kB/10MB/ 
s) =  33.7924001 days
Since we might need to write all pages again we can add another 33  
days, though the result is not a 1000 days, so I do something  
fundamentally wrong. :o


Thanks for any help...

Stefan



Re: [memory leak?] Re: MapReduce failure

2008-03-16 Thread Stefan Groschupf
ups sorry I forgot to mention I use 0.16.0. I will try to update to  
16.1 tomorrow and see if this helps, but i couldn't  find an closed  
issue in jira that might be related.

On Mar 15, 2008, at 8:37 PM, Stefan Groschupf wrote:


Hi there,

we see the same situation and browsing the posts there are quite a  
lot of people running into this OOM problem.
We run a own Mapper and our mapred.child.java.opts is -Xmx3048m, I  
think that should be more then enough.

Also I changed io.sort.mb to 10, which had also no impact.

Any ideas what might cause the OutOfMemoryError ?
Thanks.
Stefan




On Mar 9, 2008, at 10:28 PM, Amar Kamat wrote:

What is the heap size you are using for your tasks? Check  
'mapred.child.java.opts' in your hadoop-default.xml. Try increasing  
it. This will happen if you try running the random-writer + sort  
examples with default parameters. The maps are not able to spill  
the data to the disk. Btw what version of HADOOP are you using?

Amar
On Mon, 10 Mar 2008, Ved Prakash wrote:


Hi friends,

I have made a cluster of 3 machines, one of them is master, and  
other 2
slaves. I executed a mapreduce job on master but after Map, the  
execution
terminates and Reduce doesn't happen. I have checked dfs and no  
output

folder gets created.

this is the error I see

08/03/10 10:35:21 INFO mapred.JobClient: Task Id :
task_200803101001_0001_m_64_0, Status : FAILED
java.lang.OutOfMemoryError: Java heap space
 at  
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java

:95)
 at java.io.DataOutputStream.write(DataOutputStream.java:90)
 at org.apache.hadoop.io.Text.write(Text.java:243)
 at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(
MapTask.java:347)
 at org.apache.hadoop.examples.WordCount 
$MapClass.map(WordCount.java

:72)
 at org.apache.hadoop.examples.WordCount 
$MapClass.map(WordCount.java

:59)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
 at org.apache.hadoop.mapred.TaskTracker 
$Child.main(TaskTracker.java

:1787)

08/03/10 10:35:22 INFO mapred.JobClient:  map 55% reduce 17%
08/03/10 10:35:31 INFO mapred.JobClient:  map 56% reduce 17%
08/03/10 10:35:51 INFO mapred.JobClient:  map 57% reduce 17%
08/03/10 10:36:04 INFO mapred.JobClient:  map 58% reduce 17%
08/03/10 10:36:07 INFO mapred.JobClient:  map 57% reduce 17%
08/03/10 10:36:07 INFO mapred.JobClient: Task Id :
task_200803101001_0001_m_71_0, Status : FAILED
java.lang.OutOfMemoryError: Java heap space
 at  
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java

:95)
 at java.io.DataOutputStream.write(DataOutputStream.java:90)
 at org.apache.hadoop.io.Text.write(Text.java:243)
 at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(
MapTask.java:347)
 at org.apache.hadoop.examples.WordCount 
$MapClass.map(WordCount.java

:72)
 at org.apache.hadoop.examples.WordCount 
$MapClass.map(WordCount.java

:59)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
 at org.apache.hadoop.mapred.TaskTracker 
$Child.main(TaskTracker.java

:1787)

though it tries to overcome this problem but the mapreduce  
application

doesn't create output, can anyone tell me why is this happening?

Thanks





~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com





~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com




Re: [memory leak?] Re: MapReduce failure

2008-03-16 Thread Stefan Groschupf
I do not instantiate 3 GB of objects, that is for sure. The wordcount  
example does not run anymore so I dont think this is something  
releated to my code, beside the wordcount example many other users  
report the same problem:

See:
http://markmail.org/search/?q=org.apache.hadoop.mapred.MapTask%24MapOutputBuffer.collect+order%3Adate-backward
Thanks for your help!

Stefan


On Mar 15, 2008, at 11:02 PM, Devaraj Das wrote:

It might have something to do with your application itself. By any  
chance

are you doing a lot of huge object allocation (directly or indirectly)
within the map method? Which version of hadoop are you on?


-Original Message-
From: Stefan Groschupf [mailto:[EMAIL PROTECTED]
Sent: Sunday, March 16, 2008 9:07 AM
To: core-user@hadoop.apache.org
Subject: [memory leak?] Re: MapReduce failure

Hi there,

we see the same situation and browsing the posts there are
quite a lot of people running into this OOM problem.
We run a own Mapper and our mapred.child.java.opts is
-Xmx3048m, I think that should be more then enough.
Also I changed io.sort.mb to 10, which had also no impact.

Any ideas what might cause the OutOfMemoryError ?
Thanks.
Stefan




On Mar 9, 2008, at 10:28 PM, Amar Kamat wrote:


What is the heap size you are using for your tasks? Check
'mapred.child.java.opts' in your hadoop-default.xml. Try increasing
it. This will happen if you try running the random-writer + sort
examples with default parameters. The maps are not able to

spill the

data to the disk. Btw what version of HADOOP are you using?
Amar
On Mon, 10 Mar 2008, Ved Prakash wrote:


Hi friends,

I have made a cluster of 3 machines, one of them is

master, and other

2 slaves. I executed a mapreduce job on master but after Map, the
execution terminates and Reduce doesn't happen. I have checked dfs
and no output folder gets created.

this is the error I see

08/03/10 10:35:21 INFO mapred.JobClient: Task Id :
task_200803101001_0001_m_64_0, Status : FAILED
java.lang.OutOfMemoryError: Java heap space
 at
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java
:95)
 at java.io.DataOutputStream.write(DataOutputStream.java:90)
 at org.apache.hadoop.io.Text.write(Text.java:243)
 at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(
MapTask.java:347)
 at org.apache.hadoop.examples.WordCount
$MapClass.map(WordCount.java
:72)
 at org.apache.hadoop.examples.WordCount
$MapClass.map(WordCount.java
:59)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
 at org.apache.hadoop.mapred.TaskTracker
$Child.main(TaskTracker.java
:1787)

08/03/10 10:35:22 INFO mapred.JobClient:  map 55% reduce

17% 08/03/10

10:35:31 INFO mapred.JobClient:  map 56% reduce 17%

08/03/10 10:35:51

INFO mapred.JobClient:  map 57% reduce 17% 08/03/10 10:36:04 INFO
mapred.JobClient:  map 58% reduce 17% 08/03/10 10:36:07 INFO
mapred.JobClient:  map 57% reduce 17% 08/03/10 10:36:07 INFO
mapred.JobClient: Task Id :
task_200803101001_0001_m_71_0, Status : FAILED
java.lang.OutOfMemoryError: Java heap space
 at
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java
:95)
 at java.io.DataOutputStream.write(DataOutputStream.java:90)
 at org.apache.hadoop.io.Text.write(Text.java:243)
 at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(
MapTask.java:347)
 at org.apache.hadoop.examples.WordCount
$MapClass.map(WordCount.java
:72)
 at org.apache.hadoop.examples.WordCount
$MapClass.map(WordCount.java
:59)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
 at org.apache.hadoop.mapred.TaskTracker
$Child.main(TaskTracker.java
:1787)

though it tries to overcome this problem but the mapreduce
application doesn't create output, can anyone tell me why is this
happening?

Thanks





~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com








~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com




[memory leak?] Re: MapReduce failure

2008-03-15 Thread Stefan Groschupf

Hi there,

we see the same situation and browsing the posts there are quite a lot  
of people running into this OOM problem.
We run a own Mapper and our mapred.child.java.opts is -Xmx3048m, I  
think that should be more then enough.

Also I changed io.sort.mb to 10, which had also no impact.

Any ideas what might cause the OutOfMemoryError ?
Thanks.
Stefan




On Mar 9, 2008, at 10:28 PM, Amar Kamat wrote:

What is the heap size you are using for your tasks? Check  
'mapred.child.java.opts' in your hadoop-default.xml. Try increasing  
it. This will happen if you try running the random-writer + sort  
examples with default parameters. The maps are not able to spill the  
data to the disk. Btw what version of HADOOP are you using?

Amar
On Mon, 10 Mar 2008, Ved Prakash wrote:


Hi friends,

I have made a cluster of 3 machines, one of them is master, and  
other 2
slaves. I executed a mapreduce job on master but after Map, the  
execution
terminates and Reduce doesn't happen. I have checked dfs and no  
output

folder gets created.

this is the error I see

08/03/10 10:35:21 INFO mapred.JobClient: Task Id :
task_200803101001_0001_m_64_0, Status : FAILED
java.lang.OutOfMemoryError: Java heap space
  at  
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java

:95)
  at java.io.DataOutputStream.write(DataOutputStream.java:90)
  at org.apache.hadoop.io.Text.write(Text.java:243)
  at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(
MapTask.java:347)
  at org.apache.hadoop.examples.WordCount 
$MapClass.map(WordCount.java

:72)
  at org.apache.hadoop.examples.WordCount 
$MapClass.map(WordCount.java

:59)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
  at org.apache.hadoop.mapred.TaskTracker 
$Child.main(TaskTracker.java

:1787)

08/03/10 10:35:22 INFO mapred.JobClient:  map 55% reduce 17%
08/03/10 10:35:31 INFO mapred.JobClient:  map 56% reduce 17%
08/03/10 10:35:51 INFO mapred.JobClient:  map 57% reduce 17%
08/03/10 10:36:04 INFO mapred.JobClient:  map 58% reduce 17%
08/03/10 10:36:07 INFO mapred.JobClient:  map 57% reduce 17%
08/03/10 10:36:07 INFO mapred.JobClient: Task Id :
task_200803101001_0001_m_71_0, Status : FAILED
java.lang.OutOfMemoryError: Java heap space
  at  
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java

:95)
  at java.io.DataOutputStream.write(DataOutputStream.java:90)
  at org.apache.hadoop.io.Text.write(Text.java:243)
  at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(
MapTask.java:347)
  at org.apache.hadoop.examples.WordCount 
$MapClass.map(WordCount.java

:72)
  at org.apache.hadoop.examples.WordCount 
$MapClass.map(WordCount.java

:59)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
  at org.apache.hadoop.mapred.TaskTracker 
$Child.main(TaskTracker.java

:1787)

though it tries to overcome this problem but the mapreduce  
application

doesn't create output, can anyone tell me why is this happening?

Thanks





~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com




Re: [some bugs] Re: file permission problem

2008-03-15 Thread Stefan Groschupf

Great - it is even alrady fixed in 16.1!
Thanks for the hint!
Stefan

On Mar 14, 2008, at 2:49 PM, Andy Li wrote:


I think this is the same problem related to this mail thread.

http://www.mail-archive.com/[EMAIL PROTECTED]/msg02759.html

A JIRA has been filed, please see HADOOP-2915.

On Fri, Mar 14, 2008 at 2:08 AM, Stefan Groschupf <[EMAIL PROTECTED]>  
wrote:



Hi,
any magic we can do with hadoop.dfs.umask? Or is there any other off
switch for the file security?
Thanks.
Stefan
On Mar 13, 2008, at 11:26 PM, Stefan Groschupf wrote:


Hi Nicholas, Hi All,

I definitely can reproduce the problem Johannes describes.
Also from debugging through the code it is clearly a bug from my
point of view.
So this is the call stack:
SequenceFile.createWriter
FileSystem.create
DFSClient.create
namenode.create
In NameNode I found this:
namesystem.startFile(src,
  new PermissionStatus(Server.getUserInfo().getUserName(),
null, masked),
  clientName, clientMachine, overwrite, replication, blockSize);

In getUserInfo is this comment:
// This is to support local calls (as opposed to rpc ones) to the
name-node.
  // Currently it is name-node specific and should be placed
somewhere else.
  try {
return UnixUserGroupInformation.login();
The login javaDoc says:
/**
 * Get current user's name and the names of all its groups from  
Unix.

 * It's assumed that there is only one UGI per user. If this user
already
 * has a UGI in the ugi map, return the ugi in the map.
 * Otherwise get the current user's information from Unix, store it
 * in the map, and return it.
 */

Beside of that I had some interesting observations.
If I have permissions to write to a folder A I can delete folder A
and file B that is inside of folder A even if I do have no
permissions for B.

Also I noticed following in my dfs
[EMAIL PROTECTED] hadoop]$ bin/hadoop fs -ls /user/joa23/
myApp-1205474968598
Found 1 items
/user/joa23/myApp-1205474968598/VOICE_CALL
2008-03-13

16:00

rwxr-xr-x hadoop  supergroup
[EMAIL PROTECTED] hadoop]$ bin/hadoop fs -ls /user/joa23/
myApp-1205474968598/VOICE_CALL
Found 1 items
/user/joa23/myApp-1205474968598/VOICE_CALL/part-027311
2008-03-13 16:00  rw-r--r--   joa23   supergroup

Do I miss something or was I able to write as user joa23 into a
folder owned by hadoop where I should have no permissions. :-O.
Should I open some jira issues?

Stefan





On Mar 13, 2008, at 10:55 AM, [EMAIL PROTECTED] wrote:


Hi Johannes,


i'm using the 0.16.0 distribution.

I assume you mean the 0.16.0 release (

http://hadoop.apache.org/core/releases.html

) without any additional patch.

I just have tried it but cannot reproduce the problem you
described.  I did the following:
1) start a cluster with "tsz"
2) run a job with "nicholas"

The output directory and files are owned by "nicholas".  Am I doing
the same thing you did?  Could you try again?

Nicholas



- Original Message 
From: Johannes Zillmann <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Wednesday, March 12, 2008 5:47:27 PM
Subject: file permission problem

Hi,

i have a question regarding the file permissions.
I have a kind of workflow where i submit a job from my laptop to a
remote hadoop cluster.
After the job finished i do some file operations on the generated
output.
The "cluster-user" is different to the "laptop-user". As output i
specify a directory inside the users home. This output directory,
created through the map-reduce job has "cluster-user" permissions,
so
this does not allow me to move or delete the output folder with my
"laptop-user".

So it looks as follow:
/user/jz/  rwxrwxrwx jzsupergroup
/user/jz/output   rwxr-xr-xhadoopsupergroup

I tried different things to achieve what i want (moving/deleting  
the

output folder):
- jobConf.setUser("hadoop") on the client side
- System.setProperty("user.name","hadoop") before jobConf
instantiation
on the client side
- add user.name node in the hadoop-site.xml on the client side
- setPermision(777) on the home folder on the client side (does
not work
recursiv)
- setPermision(777) on the output folder on the client side
(permission
denied)
- create the output folder before running the job (Output  
directory

already exists exception)

None of the things i tried worked. Is there a way to achieve what
i want ?
Any ideas appreciated!

cheers
Johannes






--
~~~
101tec GmbH

Halle (Saale), Saxony-Anhalt, Germany
http://www.101tec.com






~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com





~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com





~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com




Re: [some bugs] Re: file permission problem

2008-03-14 Thread Stefan Groschupf

Hi,
any magic we can do with hadoop.dfs.umask? Or is there any other off  
switch for the file security?

Thanks.
Stefan
On Mar 13, 2008, at 11:26 PM, Stefan Groschupf wrote:


Hi Nicholas, Hi All,

I definitely can reproduce the problem Johannes describes.
Also from debugging through the code it is clearly a bug from my  
point of view.

So this is the call stack:
SequenceFile.createWriter
FileSystem.create
DFSClient.create
namenode.create
In NameNode I found this:
namesystem.startFile(src,
   new PermissionStatus(Server.getUserInfo().getUserName(),  
null, masked),

   clientName, clientMachine, overwrite, replication, blockSize);

In getUserInfo is this comment:
// This is to support local calls (as opposed to rpc ones) to the  
name-node.
   // Currently it is name-node specific and should be placed  
somewhere else.

   try {
 return UnixUserGroupInformation.login();
The login javaDoc says:
/**
  * Get current user's name and the names of all its groups from Unix.
  * It's assumed that there is only one UGI per user. If this user  
already

  * has a UGI in the ugi map, return the ugi in the map.
  * Otherwise get the current user's information from Unix, store it
  * in the map, and return it.
  */

Beside of that I had some interesting observations.
If I have permissions to write to a folder A I can delete folder A  
and file B that is inside of folder A even if I do have no  
permissions for B.


Also I noticed following in my dfs
[EMAIL PROTECTED] hadoop]$ bin/hadoop fs -ls /user/joa23/ 
myApp-1205474968598

Found 1 items
/user/joa23/myApp-1205474968598/VOICE_CALL			2008-03-13 16:00	 
rwxr-xr-x	hadoop	supergroup
[EMAIL PROTECTED] hadoop]$ bin/hadoop fs -ls /user/joa23/ 
myApp-1205474968598/VOICE_CALL

Found 1 items
/user/joa23/myApp-1205474968598/VOICE_CALL/part-0		27311	 
2008-03-13 16:00	rw-r--r--	joa23	supergroup


Do I miss something or was I able to write as user joa23 into a  
folder owned by hadoop where I should have no permissions. :-O.

Should I open some jira issues?

Stefan





On Mar 13, 2008, at 10:55 AM, [EMAIL PROTECTED] wrote:


Hi Johannes,


i'm using the 0.16.0 distribution.
I assume you mean the 0.16.0 release (http://hadoop.apache.org/core/releases.html 
) without any additional patch.


I just have tried it but cannot reproduce the problem you  
described.  I did the following:

1) start a cluster with "tsz"
2) run a job with "nicholas"

The output directory and files are owned by "nicholas".  Am I doing  
the same thing you did?  Could you try again?


Nicholas



- Original Message 
From: Johannes Zillmann <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Wednesday, March 12, 2008 5:47:27 PM
Subject: file permission problem

Hi,

i have a question regarding the file permissions.
I have a kind of workflow where i submit a job from my laptop to a
remote hadoop cluster.
After the job finished i do some file operations on the generated  
output.

The "cluster-user" is different to the "laptop-user". As output i
specify a directory inside the users home. This output directory,
created through the map-reduce job has "cluster-user" permissions,  
so

this does not allow me to move or delete the output folder with my
"laptop-user".

So it looks as follow:
/user/jz/  rwxrwxrwx jzsupergroup
/user/jz/output   rwxr-xr-xhadoopsupergroup

I tried different things to achieve what i want (moving/deleting the
output folder):
- jobConf.setUser("hadoop") on the client side
- System.setProperty("user.name","hadoop") before jobConf  
instantiation

on the client side
- add user.name node in the hadoop-site.xml on the client side
- setPermision(777) on the home folder on the client side (does  
not work

recursiv)
- setPermision(777) on the output folder on the client side  
(permission

denied)
- create the output folder before running the job (Output directory
already exists exception)

None of the things i tried worked. Is there a way to achieve what  
i want ?

Any ideas appreciated!

cheers
Johannes






--
~~~
101tec GmbH

Halle (Saale), Saxony-Anhalt, Germany
http://www.101tec.com






~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com





~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com




[some bugs] Re: file permission problem

2008-03-13 Thread Stefan Groschupf

Hi Nicholas, Hi All,

I definitely can reproduce the problem Johannes describes.
Also from debugging through the code it is clearly a bug from my point  
of view.

So this is the call stack:
SequenceFile.createWriter
FileSystem.create
DFSClient.create
namenode.create
In NameNode I found this:
 namesystem.startFile(src,
new PermissionStatus(Server.getUserInfo().getUserName(),  
null, masked),

clientName, clientMachine, overwrite, replication, blockSize);

In getUserInfo is this comment:
 // This is to support local calls (as opposed to rpc ones) to the  
name-node.
// Currently it is name-node specific and should be placed  
somewhere else.

try {
  return UnixUserGroupInformation.login();
The login javaDoc says:
 /**
   * Get current user's name and the names of all its groups from Unix.
   * It's assumed that there is only one UGI per user. If this user  
already

   * has a UGI in the ugi map, return the ugi in the map.
   * Otherwise get the current user's information from Unix, store it
   * in the map, and return it.
   */

Beside of that I had some interesting observations.
If I have permissions to write to a folder A I can delete folder A and  
file B that is inside of folder A even if I do have no permissions for  
B.


Also I noticed following in my dfs
[EMAIL PROTECTED] hadoop]$ bin/hadoop fs -ls /user/joa23/ 
myApp-1205474968598

Found 1 items
/user/joa23/myApp-1205474968598/VOICE_CALL			2008-03-13 16:00	 
rwxr-xr-x	hadoop	supergroup
[EMAIL PROTECTED] hadoop]$ bin/hadoop fs -ls /user/joa23/ 
myApp-1205474968598/VOICE_CALL

Found 1 items
/user/joa23/myApp-1205474968598/VOICE_CALL/part-0		27311	 
2008-03-13 16:00	rw-r--r--	joa23	supergroup


Do I miss something or was I able to write as user joa23 into a folder  
owned by hadoop where I should have no permissions. :-O.

Should I open some jira issues?

Stefan





On Mar 13, 2008, at 10:55 AM, [EMAIL PROTECTED] wrote:


Hi Johannes,


i'm using the 0.16.0 distribution.
I assume you mean the 0.16.0 release (http://hadoop.apache.org/core/releases.html 
) without any additional patch.


I just have tried it but cannot reproduce the problem you  
described.  I did the following:

1) start a cluster with "tsz"
2) run a job with "nicholas"

The output directory and files are owned by "nicholas".  Am I doing  
the same thing you did?  Could you try again?


Nicholas



- Original Message 
From: Johannes Zillmann <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Wednesday, March 12, 2008 5:47:27 PM
Subject: file permission problem

Hi,

i have a question regarding the file permissions.
I have a kind of workflow where i submit a job from my laptop to a
remote hadoop cluster.
After the job finished i do some file operations on the generated  
output.

The "cluster-user" is different to the "laptop-user". As output i
specify a directory inside the users home. This output directory,
created through the map-reduce job has "cluster-user" permissions, so
this does not allow me to move or delete the output folder with my
"laptop-user".

So it looks as follow:
/user/jz/  rwxrwxrwx jzsupergroup
/user/jz/output   rwxr-xr-xhadoopsupergroup

I tried different things to achieve what i want (moving/deleting the
output folder):
- jobConf.setUser("hadoop") on the client side
- System.setProperty("user.name","hadoop") before jobConf  
instantiation

on the client side
- add user.name node in the hadoop-site.xml on the client side
- setPermision(777) on the home folder on the client side (does not  
work

recursiv)
- setPermision(777) on the output folder on the client side  
(permission

denied)
- create the output folder before running the job (Output directory
already exists exception)

None of the things i tried worked. Is there a way to achieve what i  
want ?

Any ideas appreciated!

cheers
Johannes






--
~~~
101tec GmbH

Halle (Saale), Saxony-Anhalt, Germany
http://www.101tec.com






~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com




Re: Hadoop summit / workshop at Yahoo!

2008-02-22 Thread Stefan Groschupf

Puhh, 2 days and it is full?
Does Yahoo have no bigger rooms than just for a 100 people?



On Feb 20, 2008, at 12:10 PM, Ajay Anand wrote:


The registration page for the Hadoop summit is now up:
http://developer.yahoo.com/hadoop/summit/

Space is limited, so please sign up early if you are interested in
attending.

About the summit:
Yahoo! is hosting the first summit on Apache Hadoop on March 25th in
Sunnyvale. The summit is sponsored by the Computing Community  
Consortium

(CCC) and brings together leaders from the Hadoop developer and user
communities. The speakers will cover topics in the areas of extensions
being developed for Hadoop, case studies of applications being built  
and

deployed on Hadoop, and a discussion on future directions for the
platform.

Agenda:
8:30-8:55 Breakfast
8:55-9:00 Welcome to Yahoo! & Logistics - Ajay Anand, Yahoo!
9:00-9:30 Hadoop Overview - Doug Cutting / Eric Baldeschwieler, Yahoo!
9:30-10:00 Pig - Chris Olston, Yahoo!
10:00-10:30 JAQL - Kevin Beyer, IBM
10:30-10:45 Break
10:45-11:15 DryadLINQ - Michael Isard, Microsoft
11:15-11:45 Monitoring Hadoop using X-Trace - Andy Konwinski and Matei
Zaharia, UC Berkeley
11:45-12:15 Zookeeper - Ben Reed, Yahoo!
12:15-1:15 Lunch
1:15-1:45 Hbase - Michael Stack, Powerset
1:45-2:15 Hbase App - Bryan Duxbury, Rapleaf
2:15-2:45 Hive - Joydeep Sen Sarma, Facebook
2:45-3:00 Break
3:00-3:20 Building Ground Models of Southern California - Steve
Schossler, David O'Hallaron, Intel / CMU
3:20-3:40 Online search for engineering design content - Mike Haley,
Autodesk
3:40-4:00 Yahoo - Webmap - Arnab Bhattacharjee, Yahoo!
4:00-4:30 Natural language Processing - Jimmy Lin, U of Maryland /
Christophe Bisciglia, Google
4:30-4:45 Break
4:45-5:30 Panel on future directions
5:30-7:00 Happy hour

Look forward to seeing you there!
Ajay

-Original Message-
From: Bradford Stephens [mailto:[EMAIL PROTECTED]
Sent: Wednesday, February 20, 2008 9:17 AM
To: core-user@hadoop.apache.org
Subject: Re: Hadoop summit / workshop at Yahoo!

Hrm yes, I'd like to make a visit as well :)

On Feb 20, 2008 8:05 AM, C G <[EMAIL PROTECTED]> wrote:

 Hey All:

 Is this going forward?  I'd like to make plans to attend and the

sooner I can get plane tickets the happier the bean counters will be
:-).


 Thx,
 C G


Ajay Anand wrote:


Yahoo plans to host a summit / workshop on Apache Hadoop at our
Sunnyvale campus on March 25th. Given the interest we are seeing

from

developers in a broad range of organizations, this seems like a

good

time to get together and brief each other on the progress that is
being
made.



We would like to cover topics in the areas of extensions being
developed
for Hadoop, innovative applications being built and deployed on
Hadoop,
and future extensions to the platform. Some of the speakers who

have

already committed to present are from organizations such as IBM,
Intel,
Carnegie Mellon University, UC Berkeley, Facebook and Yahoo!, and
we are
actively recruiting other leaders in the space.



If you have an innovative application you would like to talk about,
please let us know. Although there are limitations on the amount of
time
we have, we would love to hear from you. You can contact me at
[EMAIL PROTECTED]



Thanks and looking forward to hearing about your cool apps,

Ajay







--
View this message in context:

http://www.nabble.com/Hadoop-summit---workshop-at-Yahoo%21-tp14889262p15
393386.html

Sent from the Hadoop lucene-users mailing list archive at

Nabble.com.









-
Be a better friend, newshound, and know-it-all with Yahoo! Mobile.

Try it now.



~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com




broadcasting: pig user meeting, Friday, February 8, 2008

2008-02-07 Thread Stefan Groschupf

Hi there,
sorry for cross posting.
If everything works out we will video broadcast the event here:
http://ustream.tv/channel/apache-pig-user-meeting
But no guarantee - sorry.
Also we try to setup a telefon voice call in number - please write me  
a private email if you are interested and I will send out a number.


See you tomorrow.
Stefan


On Feb 6, 2008, at 3:54 PM, Andrzej Bialecki wrote:


Otis Gospodnetic wrote:
Sorry about the word-wrapping (original email) - Yahoo Mail  
problem :(
Is anyone going to be capturing the Piglet meeting on video for the  
those of us living in other corners of the planet?



Please do! It's too far from Poland to just casually drop by .. ;)


--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com




Re: pig user meeting, Friday, February 8, 2008

2008-02-06 Thread Stefan Groschupf

Hi Otis,
can you suggest a technology how we could do that? Skype? Ichat?  
Something that is free?
I'm happy setup a video conf, however there are no big presentations  
planed.
I was thinking I can give a overview how we use pig for our current  
project just to reflect our use cases.

But beside that I guess it is just pizza and beer.

Cheers,
Stefan





On Feb 6, 2008, at 11:40 AM, Otis Gospodnetic wrote:


Sorry about the word-wrapping (original email) - Yahoo Mail problem :(

Is anyone going to be capturing the Piglet meeting on video for the  
those of us living in other corners of the planet?


Thank you,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 

From: Stefan Groschupf <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Thursday, January 31, 2008 7:09:53 PM
Subject: pig user meeting, Friday, February 8, 2008

Hi


there,


a


couple



of



people



plan



to



meet



and



talk



about



apache



pig



next



Friday

in


the



Mountain



View



area.

(Event


location



is



not



yet



sure).

If


you



are



interested



please



RSVP



asap,



so



we



can



plan



what



kind



of

location


size



we



looking



for.


http://upcoming.yahoo.com/event/420958/

Cheers,
Stefan


~~~
101tec


Inc.

Menlo


Park,



California,



USA

http://www.101tec.com









~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com




pig user meeting, Friday, February 8, 2008

2008-01-31 Thread Stefan Groschupf

Hi there,

a couple of people plan to meet and talk about apache pig next Friday  
in the Mountain View area.

(Event location is not yet sure).
If you are interested please RSVP asap, so we can plan what kind of  
location size we looking for.


http://upcoming.yahoo.com/event/420958/

Cheers,
Stefan


~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com




Re: Reduce hangs 2

2008-01-22 Thread Stefan Groschupf

Hi,
not sure if this is the same source of problem, but I also run in  
problems with a hanging reduce.
It is reproducible for me, though I did not find the source of the  
problem yet.
I run a series of jobs and my last job, the last reduce task hangs for  
about 15 to 20 minutes doing nothing, but than resumes. I running  
hadoop 15.1


Below the log entries during the hang. So I think it is not the copy  
problem mentioned before. I also checked our dfs is healthy.



2008-01-22 21:22:09,327 INFO org.apache.hadoop.mapred.ReduceTask:  
task_200801221313_0003_r_46_1 Need 2 map output(s)
2008-01-22 21:22:09,327 INFO org.apache.hadoop.mapred.ReduceTask:  
task_200801221313_0003_r_46_1: Got 0 new map-outputs & 0 obsolete  
map-outputs from tasktracker and 0 map-outputs from previous failures
2008-01-22 21:22:09,327 INFO org.apache.hadoop.mapred.ReduceTask:  
task_200801221313_0003_r_46_1 Got 2 known map output location(s);  
scheduling...
2008-01-22 21:22:09,327 INFO org.apache.hadoop.mapred.ReduceTask:  
task_200801221313_0003_r_46_1 Scheduled 2 of 2 known outputs (0  
slow hosts and 0 dup hosts)
2008-01-22 21:22:09,327 INFO org.apache.hadoop.mapred.ReduceTask:  
task_200801221313_0003_r_46_1 Copying  
task_200801221313_0003_m_35_0 output from hadoop5.dev.company.com.
2008-01-22 21:22:09,328 INFO org.apache.hadoop.mapred.ReduceTask:  
task_200801221313_0003_r_46_1 Copying  
task_200801221313_0003_m_40_0 output from hadoop1.dev.company.com.
2008-01-22 21:22:11,243 INFO org.apache.hadoop.mapred.ReduceTask:  
task_200801221313_0003_r_46_1 done copying  
task_200801221313_0003_m_40_0 output from hadoop1.dev.company.com.
2008-01-22 21:22:11,610 INFO org.apache.hadoop.mapred.ReduceTask:  
task_200801221313_0003_r_46_1 done copying  
task_200801221313_0003_m_35_0 output from hadoop5.dev.company.com.
2008-01-22 21:22:11,611 INFO org.apache.hadoop.mapred.ReduceTask:  
task_200801221313_0003_r_46_1 Copying of all map outputs complete.  
Initiating the last merge on the remaining files in ramfs:// 
mapoutput169937755
2008-01-22 21:22:11,635 INFO org.apache.hadoop.mapred.ReduceTask:  
task_200801221313_0003_r_46_1 Merge of the 1 files in  
InMemoryFileSystem complete. Local file is /home/hadoop/data/hadoop- 
hadoop/mapred/local/task_200801221313_0003_r_46_1/map_34.out


Any ideas? Thanks!
Stefan 


setting # of maps for a job

2008-01-22 Thread Stefan Groschupf

Hi,
I have trouble setting the number of maps for a job with version 15.1.
As far I understand I can configure the number of maps that a job will  
do in an hadoop-site.xml on the box where I submit the job (that is  
not the jobtracker box).
However my configuration is always ignored. Also changing the value in  
the hadoop-site on the jobtracker box and restarting the nodes does  
not help.

Also I do not set the number via API.
Any ideas where I might oversee something?
Thanks for any hints,
Stefan


~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com