how to run uima-as in hadoop

2009-04-20 Thread ykj
Hello,everyone I am new to hadoop.there are few topic about running uima-as in hadoop.in the uima website ,there is one article talking about this .but it is so general.I appreciates too much if anyone can have experiences about running uima-as in hadoop and illustrate it with

Re: UIMA scale-out using Hadoop, number of map tasks

2009-04-20 Thread ykj
Can you tell me how to deployed uima on hadoop? Thanks in advance, Jack -- View this message in context: http://www.nabble.com/UIMA-scale-out-using-Hadoop%2C-number-of-map-tasks-tp19010118p23131414.html Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Are SequenceFiles split? If so, how?

2009-04-20 Thread Jim Twensky
In addition to what Aaron mentioned, you can configure the minimum split size in hadoop-site.xml to have smaller or larger input splits depending on your application. -Jim On Mon, Apr 20, 2009 at 12:18 AM, Aaron Kimball aa...@cloudera.com wrote: Yes, there can be more than one InputSplit per

Re: max value for a dataset

2009-04-20 Thread Shevek
On Sat, 2009-04-18 at 09:57 -0700, jason hadoop wrote: The traditional approach would be a Mapper class that maintained a member variable that you kept the max value record, and in the close method of your mapper you output a single record containing that value. Perhaps you can forgive the

Re: max value for a dataset

2009-04-20 Thread jason hadoop
The Hadoop Framework requires that a Map Phase be run before the Reduce Phase. By doing the initial 'reduce' in the map, a much smaller volume of data has to flow across the network to the reduce tasks. But yes, this could simply be done by using an IdentityMapper and then have all of the work

Re: Performance question

2009-04-20 Thread Jean-Daniel Cryans
Mark, There is a setup price when using Hadoop, for each task a new JVM must be spawned. On such a small scale, you won't see any good using MR. J-D On Mon, Apr 20, 2009 at 12:26 AM, Mark Kerzner markkerz...@gmail.com wrote: Hi, I ran a Hadoop MapReduce task in the local mode, reading and

Not able to subscribe to pig user/dev mailing list

2009-04-20 Thread Pallavi Palleti
Hi all, I am not able to subscribe to pig mailing list (both dev and user). Here is the error message that I am getting when I tried to confirm the subscribtion. Your message did not reach some or all of the intended recipients. Subject:

Re: Performance question

2009-04-20 Thread Mark Kerzner
Jean-Daniel, I realize that, and my question was, is this the normal setup/finishup time, about 2 minutes? If it is, then fine. I would expect that on tasks taking 10-15 minutes, 2 minutes would be totally justified, and I think that this is the guideline - each task should take minutes. Thank

RE: Not able to subscribe to pig user/dev mailing list

2009-04-20 Thread Palleti, Pallavi
Hi all, Sorry for posting in wrong group. When I clicked on nabble in PIG mailing list page(http://hadoop.apache.org/pig/mailing_lists.html), it redirected me to this mailing list. Unaware of this, I posted in the redirected group. Thanks Pallavi -Original Message- From: Pallavi

Re: Performance question

2009-04-20 Thread Jean-Daniel Cryans
Mark, Oh sorry, yes you should expect that kind of delay. A tip to optimize that on big jobs with lots of tasks is to use the JobConf.setNumTasksToExecutePerJvm(int numTasks) which sets how many times a JVM can be reused (instead of spawning new ones). Happy Hadooping! J-D On Mon, Apr 20, 2009

What's the difference of RawLocalFileSystem and LocalFileSystem?

2009-04-20 Thread Xie, Tao
I am new to hadoop and now begin to look into the code. I want to know the difference between RawLocalFileSystem and LocalFileSystem. I know the latter one has the capability to do checksum. Is that all? Thanks. -- View this message in context:

Re: Are SequenceFiles split? If so, how?

2009-04-20 Thread Barnet Wagman
Thanks Aaron, that really helps. I probably do need to control the number of splits. My input 'data' consists of Java objects and their size (in bytes) doesn't necessarily reflect the amount of time needed for each map operation. I need to ensure that I have enough map tasks so that all

Re: Seattle / PNW Hadoop + Lucene User Group?

2009-04-20 Thread Matthew Hall
Same here, sadly there isn't much call for Lucene user groups in Maine. It would be nice though ^^ Matt Amin Mohammed-Coleman wrote: I would love to come but I'm afraid I'm stuck in rainy old England :( Amin On 18 Apr 2009, at 01:08, Bradford Stephens bradfordsteph...@gmail.com wrote:

Re: What's the difference of RawLocalFileSystem and LocalFileSystem?

2009-04-20 Thread Arun C Murthy
On Apr 20, 2009, at 7:49 PM, Xie, Tao wrote: I am new to hadoop and now begin to look into the code. I want to know the difference between RawLocalFileSystem and LocalFileSystem. I know the latter one has the capability to do checksum. Is that all? Pretty much. Arun

Re: max value for a dataset

2009-04-20 Thread Edward Capriolo
Yes I considered Shevek's tactic as well, but as Jason pointed out emit ing the entire data set just to find the maximum value would be wasteful, you do not want to sort the dataset, you just want to break it in parts and find the max value of each part, then bring it into one part and perform

Re: Performance question

2009-04-20 Thread Arun C Murthy
On Apr 20, 2009, at 9:56 AM, Mark Kerzner wrote: Hi, I ran a Hadoop MapReduce task in the local mode, reading and writing from HDFS, and it took 2.5 minutes. Essentially the same operations on the local file system without MapReduce took 1/2 minute. Is this to be expected? Hmm...

Re: Performance question

2009-04-20 Thread Mark Kerzner
Arun, thank you very much for the answer. I will turn off the combiner. I am debugging intermediate MR steps now, so I am mostly interested in performance to for this, and real tuning will be later, in a cluster. I am running 18.3, but general pointers should be good enough at this stage. I am

Re: ebook resources - including lucene in action

2009-04-20 Thread Grant Ingersoll
Lest you think silence equals acceptance... This is not appropriate use of these lists. -Grant On Apr 19, 2009, at 11:58 PM, wu fuheng wrote: welcome to download http://www.ultraie.com/admin/flist.php

Multiple outputs and getmerge?

2009-04-20 Thread Stuart White
I've written a MR job with multiple outputs. The normal output goes to files named part-X and my secondary output records go to files I've chosen to name ExceptionDocuments (and therefore are named ExceptionDocuments-m-X). I'd like to pull merged copies of these files to my local

Re: max value for a dataset

2009-04-20 Thread Brian Bockelman
Hey Jason, Wouldn't this be avoided if you used a combiner to also perform the max() operation? A minimal amount of data would be written over the network. I can't remember if the map output gets written to disk first, then combine applied or if the combine is applied and then the data

Re: Seattle / PNW Hadoop + Lucene User Group?

2009-04-20 Thread Bradford Stephens
Thanks for the responses, everyone. Where shall we host? My company can offer space in our building in Factoria, but it's not exactly a 'cool' or 'fun' place. I can also reserve a room at a local library. I can bring some beer and light refreshments. On Mon, Apr 20, 2009 at 7:22 AM, Matthew Hall

FileSystem.listStatus() doesn't return list of files in hdfs directory

2009-04-20 Thread Praveen Patnala
Hi, I have a single-node hadoop cluster. The hadoop version - [patn...@ac4-dev-ims-211]~/dev/hadoop/hadoop-0.19.1% hadoop version Hadoop 0.19.1 Subversion https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.19 -r 745977 Compiled by ndaley on Fri Feb 20 00:16:34 UTC 2009 Following

Re: Seattle / PNW Hadoop + Lucene User Group?

2009-04-20 Thread Ben Hardy
I might be in Seattle in the near future (currently in Los Angeles). When were you thinking of having this? On Mon, Apr 20, 2009 at 4:28 PM, Bradford Stephens bradfordsteph...@gmail.com wrote: Thanks for the responses, everyone. Where shall we host? My company can offer space in our building

Re: Seattle / PNW Hadoop + Lucene User Group?

2009-04-20 Thread Lauren Cooney
If you guys are interested in space over in Redmond, I can see if MSFT can host. Let me know... Lauren On Mon, Apr 20, 2009 at 4:28 PM, Bradford Stephens bradfordsteph...@gmail.com wrote: Thanks for the responses, everyone. Where shall we host? My company can offer space in our building in

Put computation in Map or in Reduce

2009-04-20 Thread Mark Kerzner
Hi, in an MR step, I need to extract text from various files (using Tika). I have put text extraction into reduce(), because I am writing the extracted text to the output on HDFS. But now it occurs to me that I might as well have put it into map() and have default reduce() which will write every

Re: Put computation in Map or in Reduce

2009-04-20 Thread Stuart White
Unless you need the hashing/sorting provided by the reduce phase, I'd recommend placing your logic in your mapper and, when setting up your job, calling JobConf#setNumReduceTasks(0), so that the reduce phase won't be executed. In that case, any records emitted by your mapper will be written to

Copying files from HDFS to remote database

2009-04-20 Thread Parul Kudtarkar
Our application is using hadoop to parallelize jobs across ec2 cluster. HDFS is used to store output files. How would you ideally copy output files from HDFS to remote databases? Thanks, Parul V. Kudtarkar -- View this message in context:

Re: hadoop-a small doubt

2009-04-20 Thread Parul Kudtarkar
What is the exact purpose that you want a system not in hadoop cluster to access the namenode or datanode? If it is simply to write data to HDFS from local system and then to copy back data from HDFS to local system simply use hadoop file system's shell commands. Hope this helps! deepya wrote: