Re: How I should use hadoop to analyze my logs?

2008-08-14 Thread Rafael Turk
Juho,

Try using pig first: http://incubator.apache.org/pig/

--Rafael


On Thu, Aug 14, 2008 at 6:53 AM, Juho Mäkinen <[EMAIL PROTECTED]>wrote:

> Hello,
>
> I'm looking how Hadoo could solve our datamining applications and I've
> come up with a few questions which I haven't found any answer yet.
> Our setup contains multiple diskless webserver frontends which
> generates log data. Each webserver hit generates an UDP packet which
> contains basically the same info than normal apache access log line
> (url, return code, client ip, timestamp etc). The udp packet is
> receivered by a log server. I would want to run map/reduce processed
> on the log data at the same time when the servers are generating new
> data. I was planning that each day would have it's own file in HDFS
> which contains all log entries for that day.
>
> How I should use hadoop and HDFS to write each log entry to a file? I
> was planning that I would create a class which contains request
> attributes (url, return code, client ip etc) and use this as the
> value. I did not found any info how this could be done with HDFS. The
> api seems to support arbitary objects as both key and value, but there
> was no example how to do this.
>
> How will Hadoop handle the concurrency with the writes and the reads?
> The servers will generate log entries around the clock. I also want to
> analyse the log entries at the same time when the servers are
> generating new data. How I can do this? The HDFS architecture page
> tells that the client writes the data first into a local file and once
> the file has reached the block size, the file will be transferred to
> the HDFS storage nodes and the client writes the following data to
> another local file. Is it possible to read the blocks already
> transferred to the HDFS using the map/reduce processes and write new
> blocks to the same file at the same time?
>
> Thanks in advance,
>
>  - Juho Mäkinen
>


Re: lucene/nutch question...

2008-08-14 Thread Otis Gospodnetic
Bruce, you may want to ask on [EMAIL PROTECTED] or [EMAIL PROTECTED] lists, or 
even [EMAIL PROTECTED]


Yes, it sounds like either Lucene or Solr might be the right tools to use.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: bruce <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Thursday, August 14, 2008 4:16:28 PM
> Subject: lucene/nutch question...
> 
> Hi.
> 
> Got a very basic lucene/nutch question.
> 
> Assume I have a page that has a form. Within the form are a number of
> select/drop-down boxes/etc... In this case, each object would comprise a
> variable which would form part of the query string as defined in the form
> action. Is there a way for lucene/nutch to go through the process of
> building up the actions based on the querystring vars, so that lucene/nutch
> can actually search through each possible combination of urls
> 
> Also, is nutch/lucene the right/correct app to use in this scenario? Is
> there a better app to handle this kind of potential application/process.
> 
> Thanks
> 
> -bruce




RE: Un-Blacklist Node

2008-08-14 Thread Xavier Stevens
I don't think this is the per-job blacklist.  The datanode is still
running but the tasktracker isn't on the slave machine.  Can I just
"start-mapred" on that machine?

-Xavier
 

-Original Message-
From: Arun C Murthy [mailto:[EMAIL PROTECTED] 
Sent: Thursday, August 14, 2008 1:43 PM
To: core-user@hadoop.apache.org
Subject: Re: Un-Blacklist Node

Xavier,

On Aug 14, 2008, at 12:18 PM, Xavier Stevens wrote:

> Is there a way to un-blacklist a node without restarting hadoop?
>

Which blacklist are you talking about? Per-job blacklist of
TaskTrackers? Hadoop Daemons?

Arun





Re: Un-Blacklist Node

2008-08-14 Thread Arun C Murthy

Xavier,

On Aug 14, 2008, at 12:18 PM, Xavier Stevens wrote:


Is there a way to un-blacklist a node without restarting hadoop?



Which blacklist are you talking about? Per-job blacklist of  
TaskTrackers? Hadoop Daemons?


Arun



Re: Un-Blacklist Node

2008-08-14 Thread lohit
One way I could think of is to just restart mapred daemons.
./bin/stop-mapred.sh
./bin/start-mapred.sh


Thanks,
Lohit

- Original Message 
From: Xavier Stevens <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Thursday, August 14, 2008 12:18:18 PM
Subject: Un-Blacklist Node

Is there a way to un-blacklist a node without restarting hadoop?

Thanks,

-Xavier

lucene/nutch question...

2008-08-14 Thread bruce
Hi.

Got a very basic lucene/nutch question.

Assume I have a page that has a form. Within the form are a number of
select/drop-down boxes/etc... In this case, each object would comprise a
variable which would form part of the query string as defined in the form
action. Is there a way for lucene/nutch to go through the process of
building up the actions based on the querystring vars, so that lucene/nutch
can actually search through each possible combination of urls

Also, is nutch/lucene the right/correct app to use in this scenario? Is
there a better app to handle this kind of potential application/process.

Thanks

-bruce







RandomWriter not responding to parameter changes

2008-08-14 Thread James Graham (Greywolf)

I have altered the values described in randomwriter, but they don't seem
to have any effect on the amount of data generated.

I am specifying the configuration file as the last parameter; it seems
to have no effect whatsoever.

Go figure.  What am I doing wrong?
--
James Graham (Greywolf)   |
650.930.1138|925.768.4053 *
[EMAIL PROTECTED] |
Check out what people are saying about SearchMe! -- click below
http://www.searchme.com/stack/109aa


Un-Blacklist Node

2008-08-14 Thread Xavier Stevens
Is there a way to un-blacklist a node without restarting hadoop?

Thanks,

-Xavier



Re: When will hadoop version 0.18 be released?

2008-08-14 Thread Konstantin Shvachko

I don't think HADOOP-3781 will be fixed.

Here is the complete list of what is going to be fixed in 0.18
https://issues.apache.org/jira/secure/IssueNavigator.jspa?fixfor=12312972

--Konstantin

Thibaut_ wrote:

Will this bug (https://issues.apache.org/jira/browse/HADOOP-3781) also be
fixed, which makes it impossible to use the distributed jar file with any
external application? (Works only with a local recompile)

Thibaut


Konstantin Shvachko wrote:

But you won't get append in 0.18. It was committed for 0.19.
--konstantin

Arun C Murthy wrote:

On Aug 12, 2008, at 11:51 PM, 11 Nov. wrote:


Hi colleagues,
   As you know, the append writer will be available in version 0.18. 
We are

here waiting for the feature and want to know the rough time of release.
It's currently under vote, it should be released by the end of the week 
if it passes.


Arun







Datanodes that start and then disappear

2008-08-14 Thread Bernard Butler
Hi,

I'm new to Hadoop - so hope you can help with this problem.

I'm trying to set up a small (2-zone) hadoop cluster on Solaris.
start-dfs.sh runs without error, e.g., it prints the following to the
screen:

master: starting datanode, logging to ..
slave: starting datanode, logging to ..

All looks well.  However, when I check the DFS admin web page on the
master (on port 50070) it says


Cluster Summary
1 files and directories, 0 blocks = 1 total. Heap Size is 18.5 MB / 448 MB
(4%)
Capacity:   0 KB
DFS Remaining   :   0 KB
DFS Used:   0 KB
DFS Used%   :   0 %
Live Nodes  :   0
Dead Nodes  :   0

There are no datanodes in the cluster


I had a look in the datanode logs and they were empty on both master and
slave.

Running netstat -an on the master shows that it is listening on ports
50070, 50090 and 54310 (I changed fs.default.name to avoid a port
conflict).  The slave has no hadoop-related ports active although there is
a single com.sun.management.jmxremote process running.

FYI a single-node pseudo-distributed installation worked fine on the
master.  I'm running hadoop-0.17.1.  I did not run start-mapred.sh.

Advice/suggestions would be very welcome.

Thanks,
B Butler




Re: Distributed Lucene - from hadoop contrib

2008-08-14 Thread Anoop Bhatti
Hi,

I was able to make a distributed Lucene index using the
hadoop.contrib.index code, and then search over that index while it is
still in hdfs.  I never used Distributed Lucene or katta.

The key is to use the org.apache.hadoop.dfs.DistributedFileSystem
class for Lucene (see code below)

I tested this on a Lucene index in a clustered environment, with
pieces of the index residing on different machines, and it does query
successfully.  The search time is fast (although the index is only
262MB).

I'd like to know if I'm heading down the right path, so my questions are:
* Has anyone tried searching a distributed Lucene index using a method
like this before?  It seems too easy.  Are there any "gotchas" that I
should look out for as I scale up to more nodes and a larger index?

* Do you think that going ahead with this approach, which consists of
1) creating a Lucene index using the  hadoop.contrib.index code
(thanks, Ning!) and 2) leaving that index "in-place" on hdfs and
searching over it using the client code below, is a good approach?

* What is the status of the bailey project?  It seems to be working on
the same type of problem. Should I wait until that project comes out
with code?


Here's my code:


import java.net.URI;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.contrib.index.lucene.FileSystemDirectory;
import org.apache.hadoop.dfs.DistributedFileSystem;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.store.Directory;

public class LuceneQuery {
public static void main(String[] args) throws Exception {

FileSystem fs = new DistributedFileSystem();
Configuration conf = new Configuration();

//master that has the name node (fs.default.name)
fs.initialize(new URI("hdfs://master:54310"), conf);

//path to the lucene index directory on the master
Path path = new Path("/indexlocation/0");
Directory dir = new FileSystemDirectory(fs, path, false, conf);

IndexSearcher is = new IndexSearcher(dir);
Analyzer analyzer = new StandardAnalyzer();
QueryParser parser = new QueryParser("content", analyzer);
Query query = parser.parse("searchTerm");
Hits hits = is.search(query);

//print out the "id" field of the results
for (int i = 0; i < hits.length(); i++) {
Document doc = hits.doc(i);
System.out.println(doc.get("id"));
}
is.close();

}
}



Thanks,

Anoop Bhatti
--
Committed to open source technology.

On Tue, Aug 12, 2008 at 7:19 PM, Deepika Khera <[EMAIL PROTECTED]> wrote:
> Thank you for your response.
>
> I was imagining the 2 concepts of i) using hadoop.contrib.index to index
> documents ii) providing search in a distributed fashion, to be all in
> one box.
>
> So basically, hadoop.contrib.index is used to create lucene indexes in
> a distributed fashion (by creating shards-each shard being a lucene
> instance). And then I can use Katta or any other Distributed Lucene
> application to serve lucene indexes distributed over many servers.
>
> Deepika
>
>
> -Original Message-
> From: Ning Li [mailto:[EMAIL PROTECTED]
> Sent: Friday, August 08, 2008 7:08 AM
> To: core-user@hadoop.apache.org
> Subject: Re: Distributed Lucene - from hadoop contrib
>
>> 1) Katta n Distributed Lucene are different projects though, right?
> Both
>> being based on kind of the same paradigm (Distributed Index)?
>
> The design of Katta and that of Distributed Lucene are quite different
> last time I checked. I pointed out the Katta project because you can
> find the code for Distributed Lucene there.
>
>> 2) So, I should be able to use the hadoop.contrib.index with HDFS.
>> Though, it would be much better if it is integrated with "Distributed
>> Lucene" or the "Katta project" as these are designed keeping the
>> structure and behavior of indexes in mind. Right?
>
> As described in the README file, hadoop.contrib.index uses map/reduce
> to build Lucene instances. It does not contain a component that serves
> queries. If that's not sufficient for you, you can check out the
> designs of Katta and Distributed Index and see which one suits your
> use better.
>
> Ning
>


RE: HDFS -rmr permissions

2008-08-14 Thread Koji Noguchi
Hi Brian, 

I believe dfs -rmr does check the permission for each file.
What's allowing you to delete other users data is the trash feature. 
Each user's Trash is expunged by the namenode process, which is a
superuser.
More discussion on 
(http://issues.apache.org/jira/browse/HADOOP-2514)

My guess is, what we really need is a 'sticky bit' that won't allow dfs
-mv for files/directories under a dir with 777 permission.  I couldn't
find a Jira so opened a new one. 
https://issues.apache.org/jira/browse/HADOOP-3953

Koji

===
(userB)> hadoop dfs -ls / | grep ' /tmp'
drwxrwxrwx   - knoguchi supergroup  0 2008-08-14 16:47 /tmp

(userB)> hadoop dfs -Dfs.trash.interval=0 -ls /tmp
Found 1 items
drwxr-xr-x   - userA users  0 2008-08-14 16:45 /tmp/userA-dir
(userB)> hadoop dfs -Dfs.trash.interval=0 -lsr /tmp
drwxr-xr-x   - userA users  0 2008-08-14 16:45 /tmp/userA-dir
drwxr-xr-x   - userA users  0 2008-08-14 16:45
/tmp/userA-dir/foo1
-rw-r--r--   1 userA users 13 2008-08-14 16:45
/tmp/userA-dir/foo1/a
-rw-r--r--   1 userA users 15 2008-08-14 16:45
/tmp/userA-dir/foo1/b
-rw-r--r--   1 userA users 25 2008-08-14 16:45
/tmp/userA-dir/foo1/c

(userB)> hadoop dfs -Dfs.trash.interval=0 -rmr /tmp/userA-dir
rmr: org.apache.hadoop.fs.permission.AccessControlException: Permission
denied: user=userB, access=ALL, inode="userA-dir":userA:users:rwxr-xr-x

(userB)> hadoop dfs -Dfs.trash.interval=1 -rmr /tmp/userA-dir
Moved to trash: hdfs://ucdev13.inktomisearch.com:47522/tmp/userA-dir
===


-Original Message-
From: Brian Karlak [mailto:[EMAIL PROTECTED] 
Sent: Thursday, August 07, 2008 11:27 AM
To: core-user@hadoop.apache.org
Cc: Colin Evans
Subject: HDFS -rmr permissions

Hello --

As far as I can tell, "hadoop dfs -rmr" only checks the permissions of  
the directory to be deleted and it's parent.  Unlike Unix, however, it  
does not seem to check the permissions of the directories / files  
contained within the directory to be deleted.

Is this by design?  It seems dangerous.  For instance, we have a  
directory where we want to allow people to deposit common resources  
for a project.  Its permissions need to be 777, otherwise only one  
person can write to it.  But with 777 permissions, any fool can  
accidentally wipe it.

(Of course, if we have /trash set up, accidental writes are not as big  
a deal, but still ...)

Thoughts / comments?  Is there a way to make -rmr check the  
permissions of the files within the directories it's deleting, just as  
unix does?  If not, is this a legit feature request?  (I checked JIRA,  
but I didn't find anything on this ...)

Thanks,
Brian


Re: how to config secondary namenode

2008-08-14 Thread lohit
You could use the same config you use for namenode, (bcn151). 
In addition you might want to change these
fs.checkpoint.dir
fs.checkpoint.period (default is one hour)
dfs.secondary.http.address (if you do not want the default)

Thanks,
Lohit



- Original Message 
From: 志远 <[EMAIL PROTECTED]>
To: core-user 
Sent: Thursday, August 14, 2008 3:02:15 AM
Subject: how to config secondary namenode

  
How to config secondary namenode with
another machine
 
Namenode: bcn151
Secondary namenode: bcn152
Datanodes: hdp1 hdp2
 
Thanks!

Re: Dynamically adding datanodes

2008-08-14 Thread lohit
if the config is right, then this is the procedure to add a new datanode. 
Do you see any exceptions logged in your datanode log?
Run it as daemon so it logs everything into a file under HADOOP_LOG_DIR
./bin/hadoop-daemons.sh --config $HADOOP_CONF_DIR start datanode 

Thanks,
Lohit

- Original Message 
From: Kai Mosebach <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Thursday, August 14, 2008 1:48:02 AM
Subject: Dynamically adding datanodes

Hi,

how can i add a datanode dynamically to a hadoop cluster without restarting the 
whole cluster?
I was trying to run "hadoop datanode" on the new datanode with the appropriate 
config (pointing to my correct namenode) but it does not show up there.

Is there a way?

Thanks Kai



Re: When will hadoop version 0.18 be released?

2008-08-14 Thread Thibaut_

Will this bug (https://issues.apache.org/jira/browse/HADOOP-3781) also be
fixed, which makes it impossible to use the distributed jar file with any
external application. (Works only with a local recompile)

Thibaut


Konstantin Shvachko wrote:
> 
> But you won't get append in 0.18. It was committed for 0.19.
> --konstantin
> 
> Arun C Murthy wrote:
>> 
>> On Aug 12, 2008, at 11:51 PM, 11 Nov. wrote:
>> 
>>> Hi colleagues,
>>>As you know, the append writer will be available in version 0.18. 
>>> We are
>>> here waiting for the feature and want to know the rough time of release.
>> 
>> It's currently under vote, it should be released by the end of the week 
>> if it passes.
>> 
>> Arun
>> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/When-will-hadoop-version-0.18-be-released--tp18957890p18982483.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



How I should use hadoop to analyze my logs?

2008-08-14 Thread Juho Mäkinen
Hello,

I'm looking how Hadoo could solve our datamining applications and I've
come up with a few questions which I haven't found any answer yet.
Our setup contains multiple diskless webserver frontends which
generates log data. Each webserver hit generates an UDP packet which
contains basically the same info than normal apache access log line
(url, return code, client ip, timestamp etc). The udp packet is
receivered by a log server. I would want to run map/reduce processed
on the log data at the same time when the servers are generating new
data. I was planning that each day would have it's own file in HDFS
which contains all log entries for that day.

How I should use hadoop and HDFS to write each log entry to a file? I
was planning that I would create a class which contains request
attributes (url, return code, client ip etc) and use this as the
value. I did not found any info how this could be done with HDFS. The
api seems to support arbitary objects as both key and value, but there
was no example how to do this.

How will Hadoop handle the concurrency with the writes and the reads?
The servers will generate log entries around the clock. I also want to
analyse the log entries at the same time when the servers are
generating new data. How I can do this? The HDFS architecture page
tells that the client writes the data first into a local file and once
the file has reached the block size, the file will be transferred to
the HDFS storage nodes and the client writes the following data to
another local file. Is it possible to read the blocks already
transferred to the HDFS using the map/reduce processes and write new
blocks to the same file at the same time?

Thanks in advance,

 - Juho Mäkinen


Dynamically adding datanodes

2008-08-14 Thread Kai Mosebach
Hi,

how can i add a datanode dynamically to a hadoop cluster without restarting the 
whole cluster?
I was trying to run "hadoop datanode" on the new datanode with the appropriate 
config (pointing to my correct namenode) but it does not show up there.

Is there a way?

Thanks Kai