Re: grahical tool for hadoop mapreduce

2009-06-26 Thread Mark Kerzner
Tom, this is so much right on time! Bravo, Karmasphere. I installed the plugins, and nothing crashed - in fact, I get the same screens as the manual promises. It is worth reading this group - they released the plugin two days ago. Mark On Fri, Jun 26, 2009 at 10:13 AM, Tom Wheeler wrote: > Alt

Pregel

2009-06-25 Thread Mark Kerzner
Hi all, my guess, as good as anybody's, is that Pregel is to large graphs is what Hadoop is to large datasets. In other words, Pregel is the next natural step for massively scalable computations after Hadoop. And, as with MapReduce, Google will talk about the technology but not give out the code im

Put computation in Map or in Reduce

2009-04-20 Thread Mark Kerzner
Hi, in an MR step, I need to extract text from various files (using Tika). I have put text extraction into reduce(), because I am writing the extracted text to the output on HDFS. But now it occurs to me that I might as well have put it into map() and have default reduce() which will write every m

Re: Performance question

2009-04-20 Thread Mark Kerzner
k you for the link - I wish I were at the conference! Anyway, at this level I have to make my hands dirty, re-read both Hadoop books, and other article. Cheers, Mark On Mon, Apr 20, 2009 at 10:24 AM, Arun C Murthy wrote: > > On Apr 20, 2009, at 9:56 AM, Mark Kerzner wrote: > >

Re: Performance question

2009-04-20 Thread Mark Kerzner
you, Mark On Mon, Apr 20, 2009 at 7:42 AM, Jean-Daniel Cryans wrote: > Mark, > > There is a setup price when using Hadoop, for each task a new JVM must > be spawned. On such a small scale, you won't see any good using MR. > > J-D > > On Mon, Apr 20, 2009 at 12:26 AM,

Performance question

2009-04-19 Thread Mark Kerzner
Hi, I ran a Hadoop MapReduce task in the local mode, reading and writing from HDFS, and it took 2.5 minutes. Essentially the same operations on the local file system without MapReduce took 1/2 minute. Is this to be expected? It seemed that the system lost most of the time in the MapReduce operat

Re: Broder or other near-duplicate algorithms?

2009-03-24 Thread Mark Kerzner
Yi-Kai, that's good to know - and I have read this article - but is your code available? Thank you, Mark On Tue, Mar 24, 2009 at 9:51 AM, Yi-Kai Tsai wrote: > hi Mark > > we had done something on top of hadoop/hbase (mapreduce for evaluation , > hbase for online serving ) > by reference http:/

Broder or other near-duplicate algorithms?

2009-03-23 Thread Mark Kerzner
Hi, does anybody know of an open-source implementation of the Broder algorithmin Hadoop? Monika Henzinger reports having done so in MapReduce, and I wonder if somebody has repeated her wor

Re: Will Hadoop help for my application?

2009-03-19 Thread Mark Kerzner
My feeling is that JavaSpaces could be a good choice. Here is my plan: - Have one machine running JavaSpaces (using GigaSpaces free community version), put the data in there, with a small object to keep the staring point; - Each worker machine reads the Space (all workers can read at t

Re: Cloudera's Distribution for Hadoop

2009-03-16 Thread Mark Kerzner
Christophe, if you do .deb, I will be the first one to try. As it is, I am second :) Mark On Mon, Mar 16, 2009 at 7:42 PM, Christophe Bisciglia < christo...@cloudera.com> wrote: > Hey Hadoop Fans, > > It's been a crazy week here at Cloudera. Today we launched our > Distribution for Hadoop. This

Temporary files for mapppers and reducers

2009-03-15 Thread Mark Kerzner
Hi, what would be the best place to put temporary files for a reducer? I believe that since reducers each work on its own machine, at its own time, one can do anything, but I would like a confirmation from the experts. Thanks, Mark

Creating Lucene index in Hadoop

2009-03-12 Thread Mark Kerzner
Hi, How do I allow multiple nodes to write to the same index file in HDFS? Thank you, Mark

Re: OT: How to search mailing list archives?

2009-03-08 Thread Mark Kerzner
Yes, that is definitely the coolest of them all On Sun, Mar 8, 2009 at 5:11 PM, Jeff Hammerbacher wrote: > I like MarkMail's excellent service: http://hadoop.markmail.org. > > On Sun, Mar 8, 2009 at 2:54 PM, Iman wrote: > > > You might also want to try the mail archive: > > http://www.mail-archi

Re: Avoiding Ganglia NPE on EC2

2009-03-05 Thread Mark Kerzner
News from ScaleUnlimited bootcamp - where I am now - use hadoop-0.17.2.1 On Thu, Mar 5, 2009 at 3:53 PM, Stuart Sierra wrote: > Hi all, > > I'm getting this NPE on Hadoop 0.18.3, using the EC2 contrib scripts: > >Exception in thread "Timer thread for monitoring dfs" > java.lang.NullPointerEx

Re: Thanks to Christophe for Hadoop Featured Pod Cast

2009-03-01 Thread Mark Kerzner
Thank you for pointing this out! Mark On Sun, Mar 1, 2009 at 9:40 PM, Brock Palen wrote: > Just want to thank Christophe Bisciglia for taking some time out to speak > with us about Hadoop on our podcast Research Computing and Engineering ( > www.rce-cast.com) > > You can find the Hadoop episode

Re: How does NVidia GPU compare to Hadoop/MapReduce

2009-02-27 Thread Mark Kerzner
It does not handle co-ordination of multiple computers, e.g., the flow > of data in and out of a distributed filesystem, distributed reliability, > global computations, etc. > > So you might use CUDA within mapreduce to more efficiently run > compute-intensive tasks over petabytes of da

Re: hdfs disappears

2009-02-23 Thread Mark Kerzner
Exactly the same thing happened to me, and Brian gave the same answer. What if the default is changed to the user's home directory somewhere? On Mon, Feb 23, 2009 at 10:05 PM, Brian Bockelman wrote: > Hello, > > Where are you saving your data? If it's being written into /tmp, it will > be delete

Re: Can never restart HDFS after a day or two

2009-02-17 Thread Mark Kerzner
; > Computer Science Graduate Student > > University of California, Santa Cruz > > > > > > On Mon, Feb 16, 2009 at 8:11 PM, Mark Kerzner > > wrote: > > > > > Hi all, > > > > > > I consistently have this problem that I can run HDFS and

Can never restart HDFS after a day or two

2009-02-16 Thread Mark Kerzner
Hi all, I consistently have this problem that I can run HDFS and restart it after short breaks of a few hours, but the next day I always have to reformat HDFS before the daemons begin to work. Is that normal? Maybe this is treated as temporary data, and the results need to be copied out of HDFS a

Re: Namenode not listening for remote connections to port 9000

2009-02-13 Thread Mark Kerzner
I had a problem that it listened only on 8020, even though I told it to use 9000 On Fri, Feb 13, 2009 at 7:50 AM, Norbert Burger wrote: > On Fri, Feb 13, 2009 at 8:37 AM, Steve Loughran wrote: > > > Michael Lynch wrote: > > > >> Hi, > >> > >> As far as I can tell I've followed the setup instruct

Re: "Too many open files" in 0.18.3

2009-02-12 Thread Mark Kerzner
I once had "too many open files" when I was opening too many sockets and not closing them... On Thu, Feb 12, 2009 at 1:56 PM, Sean Knapp wrote: > Hi all, > I'm continually running into the "Too many open files" error on 18.3: > > DataXceiveServer: java.io.IOException: Too many open files > > >

Re: what's going on :( ?

2009-02-12 Thread Mark Kerzner
ht want to see if you have any Hadoop processes > > running and terminate them (bin/stop-all.sh should help) and then > > restart your cluster with the new configuration to see if that helps. > > > > Later, > > Jeff > > > > On Mon, Feb 9, 2009 at 9:48 PM, Amar Ka

Re: File Transfer Rates

2009-02-10 Thread Mark Kerzner
I say, that's very interesting and useful. On Tue, Feb 10, 2009 at 11:37 PM, Brian Bockelman wrote: > Just to toss out some numbers (and because our users are making > interesting numbers right now) > > Here's our external network router: > http://mrtg.unl.edu/~cricket/?target=%2Frouter-inter

Re: File Transfer Rates

2009-02-10 Thread Mark Kerzner
at 11:09 PM, Mark Kerzner wrote: > > Brian, large files using command-line hadoop go fast, so it is something >> about my computer or network. I won't worry about this now, especially in >> light of Amit reporting fast writes and reads. >> > > You're creating fi

Re: File Transfer Rates

2009-02-10 Thread Mark Kerzner
Brian, large files using command-line hadoop go fast, so it is something about my computer or network. I won't worry about this now, especially in light of Amit reporting fast writes and reads. Mark On Tue, Feb 10, 2009 at 5:00 PM, Brian Bockelman wrote: > > On Feb 10, 2009, at 4:

could this be an error in hadoop documentation of a bug

2009-02-10 Thread Mark Kerzner
Hi, the Quick Starthas this sample configuration fs.default.name hdfs://localhost:9000 but it does not seem to work: even though the daemons do listen to 9000, the following command always uses 8020 hadoop fs -ls hdfs://localho

Re: File Transfer Rates

2009-02-10 Thread Mark Kerzner
Brian, I have a similar question: why does transfer from a local filesystem to SequenceFile takes so long (about 1 second per Meg)? Thank you, Mark On Tue, Feb 10, 2009 at 4:46 PM, Brian Bockelman wrote: > > On Feb 10, 2009, at 4:10 PM, Wasim Bari wrote: > > Hi, >> Could someone help me to fin

what's going on :( ?

2009-02-09 Thread Mark Kerzner
Hi, Hi, why is hadoop suddenly telling me Retrying connect to server: localhost/127.0.0.1:8020 with this configuration fs.default.name hdfs://localhost:9000 mapred.job.tracker localhost:9001 dfs.replication 1 and both this http://localhost:50070/dfs

Re: using HDFS for a distributed storage system

2009-02-09 Thread Mark Kerzner
It is a good and useful overview,thank you. It also mentions Stuart Sierra's post, where Stuart mentions that the process is slow. Does anybody know why? I have written code to write from the PC file system to HDFS, and I also noticed that it is very slow. Instead of 40M/sec, as promised by the To

Re: can't read the SequenceFile correctly

2009-02-06 Thread Mark Kerzner
e#getBytes() to use. > > Tom > > On Fri, Feb 6, 2009 at 5:41 AM, Mark Kerzner > wrote: > > Hi, > > > > I have written binary files to a SequenceFile, seemeingly successfully, > but > > when I read them back with the code below, after a first few reads I g

can't read the SequenceFile correctly

2009-02-05 Thread Mark Kerzner
Hi, I have written binary files to a SequenceFile, seemeingly successfully, but when I read them back with the code below, after a first few reads I get the same number of bytes for the different files. What could go wrong? Thank you, Mark reader = new SequenceFile.Reader(fs, path, con

slow writes to HDFS

2009-02-05 Thread Mark Kerzner
Hi all, I am writing to HDFS with this simple code File[] files = new File(fileDir).listFiles(); for (File file : files) { key.set(file.getPath()); byte[] bytes = new FileUtil().readCompleteFile(file); System.out.println(file

copying binary files to a SequenceFile

2009-02-04 Thread Mark Kerzner
Hi all, I am copying regular binary files to a SequenceFile, and I am using BytesWritable, to which I am giving all the byte[] content of the file. However, once it hits a file larger than my computer memory, it may have problems. Is there a better way? Thank you, Mark

Book: Hadoop-The Definitive Guide

2009-02-02 Thread Mark Kerzner
Hi, I am going through examples in this book (which I have obtained as early draft from Safari), and they all work, with occasional fixes. However, the SequenceFileWriteDemo, even though it works without an error, does not show the create file when I use this command hadoop fs -ls / I remember r

Re: best way to copy all files from a file system to hdfs

2009-02-02 Thread Mark Kerzner
local program to write several block compressed SequenceFiles > in parallel (to HDFS), each containing a portion of the files on your > PC. > > Tom > > On Mon, Feb 2, 2009 at 3:24 PM, Mark Kerzner > wrote: > > Truly, I do not see any advantage to doing this, as opposed

Re: best way to copy all files from a file system to hdfs

2009-02-02 Thread Mark Kerzner
so you can clear out the sprawl > > flip > > On Sun, Feb 1, 2009 at 11:22 PM, Mark Kerzner > wrote: > > > Hi, > > > > I am writing an application to copy all files from a regular PC to a > > SequenceFile. I can surely do this by simply recursing all director

best way to copy all files from a file system to hdfs

2009-02-01 Thread Mark Kerzner
Hi, I am writing an application to copy all files from a regular PC to a SequenceFile. I can surely do this by simply recursing all directories on my PC, but I wonder if there is any way to parallelize this, a MapReduce task even. Tom White's books seems to imply that it will have to be a custom a

HDFS formatting

2009-02-01 Thread Mark Kerzner
Hi, every time I start HDFD daemons, I need to format it first with hadoop namenode -format Why is this? I would expect to have to format it just once. Thank you, Mark

Re: settin JAVA_HOME...

2009-01-30 Thread Mark Kerzner
ult-java/bin/java: No such file or > directory > bin/hadoop: line 273: /usr/lib/jvm/default-java/bin/java: No such file or > directory > bin/hadoop: line 273: exec: /usr/lib/jvm/default-java/bin/java: cannot > execute: No such file or directory > a...@node0:~/Hadoop/hadoop-0.19

Re: settin JAVA_HOME...

2009-01-30 Thread Mark Kerzner
You set it in the conf/hadoop-env.sh file, with an entry like this export JAVA_HOME=/usr/lib/jvm/default-java Mark On Fri, Jan 30, 2009 at 3:49 PM, zander1013 wrote: > > hi, > > i am new to hadoop. i am trying to set it up for the first time as a single > node cluster. at present the snag is th

Re: Finding longest path in a graph

2009-01-29 Thread Mark Kerzner
Oh, hail to the creator of Luke!Mark On Thu, Jan 29, 2009 at 11:20 AM, Andrzej Bialecki wrote: > Hi, > > I'm looking for an advice. I need to process a directed graph encoded as a > list of pairs. The goal is to compute a list of longest paths in > the graph. There is no guarantee that the grap

Re: Finding longest path in a graph

2009-01-29 Thread Mark Kerzner
Andrzej, without deeper understanding of exactly what you are doing, I have a gut feeling that a different distributed system might be a better fit for this specific task. I assume, you are dealing with very large graphs if you are using Hadoop, and you want grid processing. But the linear nature o

Re: HDFS - millions of files in one directory?

2009-01-26 Thread Mark Kerzner
LASSPATH="myjar1;myjar2..." *bin/hadoop dfs -text FILENAME* > > > On Mon, Jan 26, 2009 at 1:08 PM, Mark Kerzner > wrote: > > > Thank you, Doug, then all is clear in my head. > > Mark > > > > On Mon, Jan 26, 2009 at 3:05 PM, Doug Cutting > wrote:

Re: HDFS - millions of files in one directory?

2009-01-26 Thread Mark Kerzner
Thank you, Doug, then all is clear in my head. Mark On Mon, Jan 26, 2009 at 3:05 PM, Doug Cutting wrote: > Mark Kerzner wrote: > >> Okay, I am convinced. I only noticed that Doug, the originator, was not >> happy about it - but in open source one has to give up control sometim

Re: HDFS - millions of files in one directory?

2009-01-26 Thread Mark Kerzner
gt; so > you don't have to compress data in your code. > > Most the time, compression not only saves disk space but improves > performance because there's less data to write. > > Andy > > On Mon, Jan 26, 2009 at 12:35 PM, Mark Kerzner >wrote: > > > Doug, &g

Re: HDFS - millions of files in one directory?

2009-01-26 Thread Mark Kerzner
Doug, SequenceFile looks like a perfect candidate to use in my project, but are you saying that I better use uncompressed data if I am not interested in saving disk space? Thank you, Mark On Mon, Jan 26, 2009 at 11:30 AM, Doug Cutting wrote: > Philip (flip) Kromer wrote: > >> Heretrix

Re: HDFS - millions of files in one directory?

2009-01-25 Thread Mark Kerzner
ile format which someone referenced previously on > this list > > ? > > Brian > > > On Jan 25, 2009, at 8:29 PM, Mark Kerzner wrote: > > Thank you, Jason, this is awesome information. I am going to use a >> balanced >> directory tree structure, and I am goin

Re: HDFS - millions of files in one directory?

2009-01-25 Thread Mark Kerzner
re > out an delimiter is ok, and you really cannot have some delimiters? > Like "X"? And in the worst case, or if performance is not > really a matter, may be just encode all binary to and from ascii? > > On Mon, Jan 26, 2009 at 5:49 AM, Mark Kerzner > wrote: &

Re: HDFS - millions of files in one directory?

2009-01-25 Thread Mark Kerzner
r with 2million blocks on a datanode, under XFS centos (i686) > 5.1 stock kernels would take 21 minutes with noatime, on a 6 disk raid 5 > array. 8way 2.5ghz xeons 8gig ram. Raid controller was a PERC and the > machine basically served hdfs. > > > On Sun, Jan 25, 2009 at 1:49 PM,

Re: HDFS - millions of files in one directory?

2009-01-25 Thread Mark Kerzner
of > small files? > > On Sat, Jan 24, 2009 at 7:03 AM, Mark Kerzner > wrote: > > > > Hi, > > > > there is a performance penalty in Windows (pardon the expression) if you > put > > too many files in the same directory. The OS becomes very slow, stops >

Re: HDFS - millions of files in one directory?

2009-01-24 Thread Mark Kerzner
, one > per > line, into a flat file. > > A distributed database is probably the correct answer, but this is working > quite well for now and even has some advantages. (No-cost replication from > work to home or offline by rsync or thumb drive, for example.) > > flip > >

Re: HDFS - millions of files in one directory?

2009-01-23 Thread Mark Kerzner
les in the directory, you might notice CPU > penalty (for many loads, higher CPU on NN is not an issue). This is mainly > because HDFS does a binary search on files in a directory each time it > inserts a new file. > > If the directory is relatively idle, then there is no penalty.

HDFS - millions of files in one directory?

2009-01-23 Thread Mark Kerzner
Hi, there is a performance penalty in Windows (pardon the expression) if you put too many files in the same directory. The OS becomes very slow, stops seeing them, and lies about their status to my Java requests. I do not know if this is also a problem in Linux, but in HDFS - do I need to balance

Re: How-to in MapReduce

2009-01-23 Thread Mark Kerzner
ers, > > Tim > > > On Fri, Jan 23, 2009 at 10:11 PM, Mark Kerzner > wrote: > > Hi, esteemed group, > > how would I form Maps in MapReduce to recursevely look at every file in a > > directory, and do something to this file, such as produce a PDF or > compute &

Re: hadoop consulting?

2009-01-23 Thread Mark Kerzner - SHMSoft
Christophe, I am writing my first Hadoop project now, and I have 20 years of consulting, and I am in Houston. Here is my resume, http://markkerzner.googlepages.com. I have used EC2. Sincerely, Mark On Fri, Jan 23, 2009 at 4:04 PM, Christophe Bisciglia < christo...@cloudera.com> wrote: > Hey al

How-to in MapReduce

2009-01-23 Thread Mark Kerzner
Hi, esteemed group, how would I form Maps in MapReduce to recursevely look at every file in a directory, and do something to this file, such as produce a PDF or compute its hash? For that matter, Google builds its index using MapReduce, or so the papers say. First the crawlers store all the files.

Archive?

2009-01-22 Thread Mark Kerzner
Hi, is there an archive to the messages? I am a newcomer, granted, but google groups has all the discussion capabilities, and it has a searchable archive. It is strange to have just a mailing list. Am I missing something? Thank you, Mark

Re: Hadoop with many input/output files?

2009-01-22 Thread Mark Kerzner
I have a very similar question: how do I recursively list all files in a given directory, to the end that all files are processed by MapReduce? If I just copy them to the output, let's say, is there any problem dropping them all in the same output directory in HDFS? To use a bad example, Windows ch