Re: How I should use hadoop to analyze my logs?
Juho, Try using pig first: http://incubator.apache.org/pig/ --Rafael On Thu, Aug 14, 2008 at 6:53 AM, Juho Mäkinen <[EMAIL PROTECTED]>wrote: > Hello, > > I'm looking how Hadoo could solve our datamining applications and I've > come up with a few questions which I haven't found any answer yet. > Our setup contains multiple diskless webserver frontends which > generates log data. Each webserver hit generates an UDP packet which > contains basically the same info than normal apache access log line > (url, return code, client ip, timestamp etc). The udp packet is > receivered by a log server. I would want to run map/reduce processed > on the log data at the same time when the servers are generating new > data. I was planning that each day would have it's own file in HDFS > which contains all log entries for that day. > > How I should use hadoop and HDFS to write each log entry to a file? I > was planning that I would create a class which contains request > attributes (url, return code, client ip etc) and use this as the > value. I did not found any info how this could be done with HDFS. The > api seems to support arbitary objects as both key and value, but there > was no example how to do this. > > How will Hadoop handle the concurrency with the writes and the reads? > The servers will generate log entries around the clock. I also want to > analyse the log entries at the same time when the servers are > generating new data. How I can do this? The HDFS architecture page > tells that the client writes the data first into a local file and once > the file has reached the block size, the file will be transferred to > the HDFS storage nodes and the client writes the following data to > another local file. Is it possible to read the blocks already > transferred to the HDFS using the map/reduce processes and write new > blocks to the same file at the same time? > > Thanks in advance, > > - Juho Mäkinen >
Re: lucene/nutch question...
Bruce, you may want to ask on [EMAIL PROTECTED] or [EMAIL PROTECTED] lists, or even [EMAIL PROTECTED] Yes, it sounds like either Lucene or Solr might be the right tools to use. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: bruce <[EMAIL PROTECTED]> > To: core-user@hadoop.apache.org > Sent: Thursday, August 14, 2008 4:16:28 PM > Subject: lucene/nutch question... > > Hi. > > Got a very basic lucene/nutch question. > > Assume I have a page that has a form. Within the form are a number of > select/drop-down boxes/etc... In this case, each object would comprise a > variable which would form part of the query string as defined in the form > action. Is there a way for lucene/nutch to go through the process of > building up the actions based on the querystring vars, so that lucene/nutch > can actually search through each possible combination of urls > > Also, is nutch/lucene the right/correct app to use in this scenario? Is > there a better app to handle this kind of potential application/process. > > Thanks > > -bruce
RE: Un-Blacklist Node
I don't think this is the per-job blacklist. The datanode is still running but the tasktracker isn't on the slave machine. Can I just "start-mapred" on that machine? -Xavier -Original Message- From: Arun C Murthy [mailto:[EMAIL PROTECTED] Sent: Thursday, August 14, 2008 1:43 PM To: core-user@hadoop.apache.org Subject: Re: Un-Blacklist Node Xavier, On Aug 14, 2008, at 12:18 PM, Xavier Stevens wrote: > Is there a way to un-blacklist a node without restarting hadoop? > Which blacklist are you talking about? Per-job blacklist of TaskTrackers? Hadoop Daemons? Arun
Re: Un-Blacklist Node
Xavier, On Aug 14, 2008, at 12:18 PM, Xavier Stevens wrote: Is there a way to un-blacklist a node without restarting hadoop? Which blacklist are you talking about? Per-job blacklist of TaskTrackers? Hadoop Daemons? Arun
Re: Un-Blacklist Node
One way I could think of is to just restart mapred daemons. ./bin/stop-mapred.sh ./bin/start-mapred.sh Thanks, Lohit - Original Message From: Xavier Stevens <[EMAIL PROTECTED]> To: core-user@hadoop.apache.org Sent: Thursday, August 14, 2008 12:18:18 PM Subject: Un-Blacklist Node Is there a way to un-blacklist a node without restarting hadoop? Thanks, -Xavier
lucene/nutch question...
Hi. Got a very basic lucene/nutch question. Assume I have a page that has a form. Within the form are a number of select/drop-down boxes/etc... In this case, each object would comprise a variable which would form part of the query string as defined in the form action. Is there a way for lucene/nutch to go through the process of building up the actions based on the querystring vars, so that lucene/nutch can actually search through each possible combination of urls Also, is nutch/lucene the right/correct app to use in this scenario? Is there a better app to handle this kind of potential application/process. Thanks -bruce
RandomWriter not responding to parameter changes
I have altered the values described in randomwriter, but they don't seem to have any effect on the amount of data generated. I am specifying the configuration file as the last parameter; it seems to have no effect whatsoever. Go figure. What am I doing wrong? -- James Graham (Greywolf) | 650.930.1138|925.768.4053 * [EMAIL PROTECTED] | Check out what people are saying about SearchMe! -- click below http://www.searchme.com/stack/109aa
Un-Blacklist Node
Is there a way to un-blacklist a node without restarting hadoop? Thanks, -Xavier
Re: When will hadoop version 0.18 be released?
I don't think HADOOP-3781 will be fixed. Here is the complete list of what is going to be fixed in 0.18 https://issues.apache.org/jira/secure/IssueNavigator.jspa?fixfor=12312972 --Konstantin Thibaut_ wrote: Will this bug (https://issues.apache.org/jira/browse/HADOOP-3781) also be fixed, which makes it impossible to use the distributed jar file with any external application? (Works only with a local recompile) Thibaut Konstantin Shvachko wrote: But you won't get append in 0.18. It was committed for 0.19. --konstantin Arun C Murthy wrote: On Aug 12, 2008, at 11:51 PM, 11 Nov. wrote: Hi colleagues, As you know, the append writer will be available in version 0.18. We are here waiting for the feature and want to know the rough time of release. It's currently under vote, it should be released by the end of the week if it passes. Arun
Datanodes that start and then disappear
Hi, I'm new to Hadoop - so hope you can help with this problem. I'm trying to set up a small (2-zone) hadoop cluster on Solaris. start-dfs.sh runs without error, e.g., it prints the following to the screen: master: starting datanode, logging to .. slave: starting datanode, logging to .. All looks well. However, when I check the DFS admin web page on the master (on port 50070) it says Cluster Summary 1 files and directories, 0 blocks = 1 total. Heap Size is 18.5 MB / 448 MB (4%) Capacity: 0 KB DFS Remaining : 0 KB DFS Used: 0 KB DFS Used% : 0 % Live Nodes : 0 Dead Nodes : 0 There are no datanodes in the cluster I had a look in the datanode logs and they were empty on both master and slave. Running netstat -an on the master shows that it is listening on ports 50070, 50090 and 54310 (I changed fs.default.name to avoid a port conflict). The slave has no hadoop-related ports active although there is a single com.sun.management.jmxremote process running. FYI a single-node pseudo-distributed installation worked fine on the master. I'm running hadoop-0.17.1. I did not run start-mapred.sh. Advice/suggestions would be very welcome. Thanks, B Butler
Re: Distributed Lucene - from hadoop contrib
Hi, I was able to make a distributed Lucene index using the hadoop.contrib.index code, and then search over that index while it is still in hdfs. I never used Distributed Lucene or katta. The key is to use the org.apache.hadoop.dfs.DistributedFileSystem class for Lucene (see code below) I tested this on a Lucene index in a clustered environment, with pieces of the index residing on different machines, and it does query successfully. The search time is fast (although the index is only 262MB). I'd like to know if I'm heading down the right path, so my questions are: * Has anyone tried searching a distributed Lucene index using a method like this before? It seems too easy. Are there any "gotchas" that I should look out for as I scale up to more nodes and a larger index? * Do you think that going ahead with this approach, which consists of 1) creating a Lucene index using the hadoop.contrib.index code (thanks, Ning!) and 2) leaving that index "in-place" on hdfs and searching over it using the client code below, is a good approach? * What is the status of the bailey project? It seems to be working on the same type of problem. Should I wait until that project comes out with code? Here's my code: import java.net.URI; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.contrib.index.lucene.FileSystemDirectory; import org.apache.hadoop.dfs.DistributedFileSystem; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.search.Hits; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.store.Directory; public class LuceneQuery { public static void main(String[] args) throws Exception { FileSystem fs = new DistributedFileSystem(); Configuration conf = new Configuration(); //master that has the name node (fs.default.name) fs.initialize(new URI("hdfs://master:54310"), conf); //path to the lucene index directory on the master Path path = new Path("/indexlocation/0"); Directory dir = new FileSystemDirectory(fs, path, false, conf); IndexSearcher is = new IndexSearcher(dir); Analyzer analyzer = new StandardAnalyzer(); QueryParser parser = new QueryParser("content", analyzer); Query query = parser.parse("searchTerm"); Hits hits = is.search(query); //print out the "id" field of the results for (int i = 0; i < hits.length(); i++) { Document doc = hits.doc(i); System.out.println(doc.get("id")); } is.close(); } } Thanks, Anoop Bhatti -- Committed to open source technology. On Tue, Aug 12, 2008 at 7:19 PM, Deepika Khera <[EMAIL PROTECTED]> wrote: > Thank you for your response. > > I was imagining the 2 concepts of i) using hadoop.contrib.index to index > documents ii) providing search in a distributed fashion, to be all in > one box. > > So basically, hadoop.contrib.index is used to create lucene indexes in > a distributed fashion (by creating shards-each shard being a lucene > instance). And then I can use Katta or any other Distributed Lucene > application to serve lucene indexes distributed over many servers. > > Deepika > > > -Original Message- > From: Ning Li [mailto:[EMAIL PROTECTED] > Sent: Friday, August 08, 2008 7:08 AM > To: core-user@hadoop.apache.org > Subject: Re: Distributed Lucene - from hadoop contrib > >> 1) Katta n Distributed Lucene are different projects though, right? > Both >> being based on kind of the same paradigm (Distributed Index)? > > The design of Katta and that of Distributed Lucene are quite different > last time I checked. I pointed out the Katta project because you can > find the code for Distributed Lucene there. > >> 2) So, I should be able to use the hadoop.contrib.index with HDFS. >> Though, it would be much better if it is integrated with "Distributed >> Lucene" or the "Katta project" as these are designed keeping the >> structure and behavior of indexes in mind. Right? > > As described in the README file, hadoop.contrib.index uses map/reduce > to build Lucene instances. It does not contain a component that serves > queries. If that's not sufficient for you, you can check out the > designs of Katta and Distributed Index and see which one suits your > use better. > > Ning >
RE: HDFS -rmr permissions
Hi Brian, I believe dfs -rmr does check the permission for each file. What's allowing you to delete other users data is the trash feature. Each user's Trash is expunged by the namenode process, which is a superuser. More discussion on (http://issues.apache.org/jira/browse/HADOOP-2514) My guess is, what we really need is a 'sticky bit' that won't allow dfs -mv for files/directories under a dir with 777 permission. I couldn't find a Jira so opened a new one. https://issues.apache.org/jira/browse/HADOOP-3953 Koji === (userB)> hadoop dfs -ls / | grep ' /tmp' drwxrwxrwx - knoguchi supergroup 0 2008-08-14 16:47 /tmp (userB)> hadoop dfs -Dfs.trash.interval=0 -ls /tmp Found 1 items drwxr-xr-x - userA users 0 2008-08-14 16:45 /tmp/userA-dir (userB)> hadoop dfs -Dfs.trash.interval=0 -lsr /tmp drwxr-xr-x - userA users 0 2008-08-14 16:45 /tmp/userA-dir drwxr-xr-x - userA users 0 2008-08-14 16:45 /tmp/userA-dir/foo1 -rw-r--r-- 1 userA users 13 2008-08-14 16:45 /tmp/userA-dir/foo1/a -rw-r--r-- 1 userA users 15 2008-08-14 16:45 /tmp/userA-dir/foo1/b -rw-r--r-- 1 userA users 25 2008-08-14 16:45 /tmp/userA-dir/foo1/c (userB)> hadoop dfs -Dfs.trash.interval=0 -rmr /tmp/userA-dir rmr: org.apache.hadoop.fs.permission.AccessControlException: Permission denied: user=userB, access=ALL, inode="userA-dir":userA:users:rwxr-xr-x (userB)> hadoop dfs -Dfs.trash.interval=1 -rmr /tmp/userA-dir Moved to trash: hdfs://ucdev13.inktomisearch.com:47522/tmp/userA-dir === -Original Message- From: Brian Karlak [mailto:[EMAIL PROTECTED] Sent: Thursday, August 07, 2008 11:27 AM To: core-user@hadoop.apache.org Cc: Colin Evans Subject: HDFS -rmr permissions Hello -- As far as I can tell, "hadoop dfs -rmr" only checks the permissions of the directory to be deleted and it's parent. Unlike Unix, however, it does not seem to check the permissions of the directories / files contained within the directory to be deleted. Is this by design? It seems dangerous. For instance, we have a directory where we want to allow people to deposit common resources for a project. Its permissions need to be 777, otherwise only one person can write to it. But with 777 permissions, any fool can accidentally wipe it. (Of course, if we have /trash set up, accidental writes are not as big a deal, but still ...) Thoughts / comments? Is there a way to make -rmr check the permissions of the files within the directories it's deleting, just as unix does? If not, is this a legit feature request? (I checked JIRA, but I didn't find anything on this ...) Thanks, Brian
Re: how to config secondary namenode
You could use the same config you use for namenode, (bcn151). In addition you might want to change these fs.checkpoint.dir fs.checkpoint.period (default is one hour) dfs.secondary.http.address (if you do not want the default) Thanks, Lohit - Original Message From: 志远 <[EMAIL PROTECTED]> To: core-user Sent: Thursday, August 14, 2008 3:02:15 AM Subject: how to config secondary namenode How to config secondary namenode with another machine Namenode: bcn151 Secondary namenode: bcn152 Datanodes: hdp1 hdp2 Thanks!
Re: Dynamically adding datanodes
if the config is right, then this is the procedure to add a new datanode. Do you see any exceptions logged in your datanode log? Run it as daemon so it logs everything into a file under HADOOP_LOG_DIR ./bin/hadoop-daemons.sh --config $HADOOP_CONF_DIR start datanode Thanks, Lohit - Original Message From: Kai Mosebach <[EMAIL PROTECTED]> To: core-user@hadoop.apache.org Sent: Thursday, August 14, 2008 1:48:02 AM Subject: Dynamically adding datanodes Hi, how can i add a datanode dynamically to a hadoop cluster without restarting the whole cluster? I was trying to run "hadoop datanode" on the new datanode with the appropriate config (pointing to my correct namenode) but it does not show up there. Is there a way? Thanks Kai
Re: When will hadoop version 0.18 be released?
Will this bug (https://issues.apache.org/jira/browse/HADOOP-3781) also be fixed, which makes it impossible to use the distributed jar file with any external application. (Works only with a local recompile) Thibaut Konstantin Shvachko wrote: > > But you won't get append in 0.18. It was committed for 0.19. > --konstantin > > Arun C Murthy wrote: >> >> On Aug 12, 2008, at 11:51 PM, 11 Nov. wrote: >> >>> Hi colleagues, >>>As you know, the append writer will be available in version 0.18. >>> We are >>> here waiting for the feature and want to know the rough time of release. >> >> It's currently under vote, it should be released by the end of the week >> if it passes. >> >> Arun >> > > -- View this message in context: http://www.nabble.com/When-will-hadoop-version-0.18-be-released--tp18957890p18982483.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
How I should use hadoop to analyze my logs?
Hello, I'm looking how Hadoo could solve our datamining applications and I've come up with a few questions which I haven't found any answer yet. Our setup contains multiple diskless webserver frontends which generates log data. Each webserver hit generates an UDP packet which contains basically the same info than normal apache access log line (url, return code, client ip, timestamp etc). The udp packet is receivered by a log server. I would want to run map/reduce processed on the log data at the same time when the servers are generating new data. I was planning that each day would have it's own file in HDFS which contains all log entries for that day. How I should use hadoop and HDFS to write each log entry to a file? I was planning that I would create a class which contains request attributes (url, return code, client ip etc) and use this as the value. I did not found any info how this could be done with HDFS. The api seems to support arbitary objects as both key and value, but there was no example how to do this. How will Hadoop handle the concurrency with the writes and the reads? The servers will generate log entries around the clock. I also want to analyse the log entries at the same time when the servers are generating new data. How I can do this? The HDFS architecture page tells that the client writes the data first into a local file and once the file has reached the block size, the file will be transferred to the HDFS storage nodes and the client writes the following data to another local file. Is it possible to read the blocks already transferred to the HDFS using the map/reduce processes and write new blocks to the same file at the same time? Thanks in advance, - Juho Mäkinen
Dynamically adding datanodes
Hi, how can i add a datanode dynamically to a hadoop cluster without restarting the whole cluster? I was trying to run "hadoop datanode" on the new datanode with the appropriate config (pointing to my correct namenode) but it does not show up there. Is there a way? Thanks Kai