Re: HBase and MapReduce data locality

2012-08-28 Thread Robert Dyer
Ok but does that imply that only 1 of your compute nodes is promised to have all of the data for any given row? The blocks will replicate, but they don't necessarily all replicate to the same nodes right? So if I have say 2 column families (cf1, cf2) and there is 2 physical files on the HDFS for

Re: MRBench Maps strange behaviour

2012-08-28 Thread Hemanth Yamijala
Hi, The number of maps specified to any map reduce program (including those part of MRBench) is generally only a hint, and the actual number of maps will be influenced in typical cases by the amount of data being processed. You can take a look at this wiki link to understand more: http://wiki.apac

Re: HBase and MapReduce data locality

2012-08-28 Thread N Keywal
Hi, Locations are per block (a file is a set of blocks, a block is replicated on multiple hdfs datanodes). We have locality in HBase because hdfs datanodes are deployed on the same box as the hbase regionserver and hdfs writes one replica of the blocks on the datanode the same machine as the clien

Re: How to reduce total shuffle time

2012-08-28 Thread Gaurav Dasgupta
Hi, Thanks for your replies. I will try working on recommended suggestions and provide feedback. Abhi, In the JobTracker Web UI -> Job Tracker History, go to the specific job. Go to Reduce Task List. Enter into the first reduce task attempt. There you can see the start time. It is the time when

HBase and MapReduce data locality

2012-08-28 Thread Robert Dyer
I have been reading up on HBase and my understanding is that the physical files on the HDFS are split first by region and then by column families. Thus each column family has its own physical file (on a per-region basis). If I run a MapReduce task that uses the HBase as input, wouldn't this imply

RE: unsubscribe

2012-08-28 Thread sathyavageeswaran
HADOOP policy has changed. Any user wanting to unsubscribe needs to donate USD 100/- for Obama’s campaign before the request is accepted. From: Georgi Georgiev [mailto:g.d.georg...@gmail.com] Sent: 29 August 2012 03:31 To: user@hadoop.apache.org Cc: Hennig, Ryan Subject: Re: unsubscr

Re: example usage of s3 file system

2012-08-28 Thread Chris Collins
So looking at the source and single stepping through a simple test case I have to make a directory using FileSystem.mkdir() I see that I am getting a 404 (nothing new there). However the code that is throwing this looks like this below. Note the comment about being "brittle". Seems its looki

Re: Hadoop and MainFrame integration

2012-08-28 Thread Artem Ervits
Can you read the data off backup tapes and dump it to flat files? Artem Ervits Data Analyst New York Presbyterian Hospital From: Marcos Ortiz [mailto:mlor...@uci.cu] Sent: Tuesday, August 28, 2012 06:51 PM To: user@hadoop.apache.org Cc: Siddharth Tiwari Subject: Re: Hadoop and MainFrame integr

copy Configuration into another

2012-08-28 Thread Radim Kolar
Its possible to copy apache.hadoop.conf.Configuration into another configuration object without creating a new instance? I am seeking something like new Configuration(Configuration) but without creating new destination object (its managed by spring)

unsubscribe

2012-08-28 Thread tlyxy228
unsubscribe -- Regards, 唐李洋

Re: Hadoop and MainFrame integration

2012-08-28 Thread Marcos Ortiz
The problem with it, is that Hadoop depends on top of HDFS to storage in blocks of 64/128 MB of size (or the size that you determine, 64 MB is the de-facto size), and then make the calculations. So, you need to move all your data to a HDFS cluster to use data in MapReduce jobs if you want to mak

Re: unsubscribe

2012-08-28 Thread Georgi Georgiev
I even got emails from people not in office, by sending the email bellow - thats crazy! g On Wed, Aug 29, 2012 at 12:56 AM, Georgi Georgiev wrote: > guys - whats going wrong with these request - cant you just teach people > act appropriate - send regular mails to un-sub-subscribe - really a lot

Re: unsubscribe

2012-08-28 Thread Georgi Georgiev
guys - whats going wrong with these request - cant you just teach people act appropriate - send regular mails to un-sub-subscribe - really a lot of spam in my in-mail. cheers, g On Wed, Aug 29, 2012 at 12:08 AM, Fabio Pitzolu wrote: > Epic Ryan!!! > > Sent from my Windows Phone > -

R: unsubscribe

2012-08-28 Thread Fabio Pitzolu
Epic Ryan!!! Sent from my Windows Phone -- Da: Hennig, Ryan Inviato: 28/08/2012 21:14 A: user@hadoop.apache.org Oggetto: RE: unsubscribe Error: unsubscribe request failed. Please retry again during a full moon. *From:* Alberto Andreotti [mailto:albertoandreo...

Re: example usage of s3 file system

2012-08-28 Thread Chris Collins
Thanks, I should of been more clear. I am not attempting to perform a map reduce job. I was literally trying to use the FileSystem abstraction (rather than using jets3t library directly to access S3. I was assuming it handled the mocking of directories in s3 (as it is not a native feature of

Re: best way to join?

2012-08-28 Thread Ted Dunning
I don't mean that. I mean that a k-means clustering with pretty large clusters is a useful auxiliary data structure for finding nearest neighbors. The basic outline is that you find the nearest clusters and search those for near neighbors. The first riff is that you use a clever data structure f

Re: How to reduce total shuffle time

2012-08-28 Thread abhiTowson cal
hi Gaurav, Can you tell me how did calculated total shuffle time ?.Apart from combiners and compression, you can also use some shuffle-sort parameters that might increase the performance, i am not sure exactly which parameters to tweak .Please share if you come across some other techniques , i am

RE: Sqoop installation

2012-08-28 Thread Hennig, Ryan
Rahul, Surely there is some way that you could share the text output of a command, without sending a giant screenshot of your entire desktop? - Ryan

RE: unsubscribe

2012-08-28 Thread Hennig, Ryan
Error: unsubscribe request failed. Please retry again during a full moon. From: Alberto Andreotti [mailto:albertoandreo...@gmail.com] Sent: Thursday, August 23, 2012 9:00 AM To: user@hadoop.apache.org Subject: unsubscribe unsubscribe -- José Pablo Alberto Andreotti. Tel: 54 351 4730292 Móvil: 5

is there sort after mapper before combiner?

2012-08-28 Thread Yue Guan
Hi, there I'm wondering if there is a sort after mapper before combiner? like those case in reducer. I know I probably wrong. As I guess in mapper, pair is in memory and a hash of list should do the work. Thank you very much!

Re: Hadoop and MainFrame integration

2012-08-28 Thread Steve Loughran
On 28 August 2012 09:24, Siddharth Tiwari wrote: > Hi Users. > > We have flat files on mainframes with around a billion records. We need to > sort them and then use them with different jobs on mainframe for report > generation. I was wondering was there any way I could integrate the > mainframe

Re: datanode has no storageID

2012-08-28 Thread Arpit Gupta
> 2. how/where does the namenode stores the datanodes's storageids ? When the datanode connects with the namenode for the first time it will register with the namenode and during the registration the of the datanode the storage id for the datanode is generated. > > 4. can I format/reset the na

Re: Hadoop and MainFrame integration

2012-08-28 Thread Mathias Herberts
build a custom transfer mechanism in Java and use a zaap so you won't consume mips On Aug 28, 2012 6:24 PM, "Siddharth Tiwari" wrote: > Hi Users. > > We have flat files on mainframes with around a billion records. We need to > sort them and then use them with different jobs on mainframe for repo

Hadoop Streaming question

2012-08-28 Thread Periya.Data
Hi all, I am using Python on CDH3u3 for streaming. I do not know how to provide command-line arguments. My python mapper takes in 3 arguments - 2 input files and one placeholder for an output file. I am doing something like this, but fails. Where am I going wrong? What other options do I have? A

Re: example usage of s3 file system

2012-08-28 Thread Manoj Babu
Hi, Here is an example, might help you. http://muhammadkhojaye.blogspot.in/2012/04/how-to-run-amazon-elastic-mapreduce-job.html Cheers! Manoj. On Tue, Aug 28, 2012 at 12:55 PM, Chris Collins wrote: > > > > Hi I am trying to use the Hadoop filesystem abstraction with S3 but in my > tinkering

example usage of s3 file system

2012-08-28 Thread Chris Collins
Hi I am trying to use the Hadoop filesystem abstraction with S3 but in my tinkering I am not having a great deal of success. I am particularly interested in the ability to mimic a directory structure (since s3 native doesnt do it). Can anyone point me to some good example usage of Hadoop FileS

Re: Hadoop and MainFrame integration

2012-08-28 Thread Ankam Venkateshwarlu
Can you please explain how to automate the process of sending files back and forth from Mainframe? Is it done using NDM? I have a requirement to migrate Mainframe to Hadoop. I am looking for more information in this area about the economics, process and tools that support migration etc. Any inf

Re: one reducer is hanged in "reduce-> copy" phase

2012-08-28 Thread Bejoy KS
Hi Abhay The map outputs are deleted only after the reducer runs to completion. >Is it possible to run the same attempt again? Does killing the child java >process or tasktracker on the node help? (since hadoop may schedule a reduce >attempt on another node). Yes,it is possible to re attempt

Re: Hadoop and MainFrame integration

2012-08-28 Thread modemide
At some point in the work flow you're going to have to transfer the file from the mainframe to the Hadoop cluster for processing, and then send it back for storage on the mainframe. You should be able to automate the process of sending the files back and forth. It's been my experience that it's o

Hadoop and MainFrame integration

2012-08-28 Thread Siddharth Tiwari
Hi Users. We have flat files on mainframes with around a billion records. We need to sort them and then use them with different jobs on mainframe for report generation. I was wondering was there any way I could integrate the mainframe with hadoop do the sorting and keep the file on the sever i

datanode has no storageID

2012-08-28 Thread boazya
Hi, hope it's not a newby question... I installed several versions of hadoop for testing, (0.20.203, 0.21.0, and 1.0.3) on various machines. now I am using 1.0.3 on all the machines, I face a problem that in some of the machhines, the datanode gets no storageID from the namenode. where it works, t

Re: How to reduce total shuffle time

2012-08-28 Thread Minh Duc Nguyen
Without knowing your exact workload, using a Combiner (if possible) as Tsuyoshi recommended should decrease your total shuffle time. You can also try compressing the map output so that there's less disk and network IO. Here's an example configuration using Snappy: conf.set("mapred.compress.map.o

Re: best way to join?

2012-08-28 Thread dexter morgan
Right, but if i understood your sugesstion, you look at the end goal , which is: 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]] for example, and you say: here we see a cluster basically, that cluster is represented by the point: [40.123,-50.432] which points does this cluster contains?

Re: distcp error.

2012-08-28 Thread Marcos Ortiz
Hi, Tao. This problem is only with 2.0.1 or with the two versions? Have you tried to use distcp from 1.0.3 to 1.0.3? El 28/08/2012 11:36, Tao escribió: > > Hi, all > > I use distcp copying data from hadoop1.0.3 to hadoop 2.0.1. > > When the file path(or file name) contain Chinese character, an > e

distcp error.

2012-08-28 Thread Tao
Hi, all I use distcp copying data from hadoop1.0.3 to hadoop 2.0.1. When the file path(or file name) contain Chinese character, an exception will throw. Like below. I need some help about this. Thanks. [hdfs@host ~]$ hadoop distcp -i -prbugp -m 14

Re: Hadoop or HBase

2012-08-28 Thread Marcos Ortiz
Regards to all the list. Well, you should ask to the Tumblr´s fellows that they use a combination of MySQL and HBase for its blogging platform. They talked about this topic in the last HBaseCon. Here is the link: http://www.hbasecon.com/sessions/growing-your-inbox-hbase-at-tumblr/ Blake Mathen

Re: best way to join?

2012-08-28 Thread Ted Dunning
On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan wrote: > > I understand your solution ( i think) , didn't think of that, in that > particular way. > I think that lets say i have 1M data-points, and running knn , that the > k=1M and n=10 (each point is a cluster that requires up to 10 points) > is a

one reducer is hanged in "reduce-> copy" phase

2012-08-28 Thread Abhay Ratnaparkhi
Hello, I have a MR job which has 4 reducers running. One of the reduce attempt is pending since long time in reduce->copy phase. The job is not able to complete because of this. I have seen that the child java process on tasktracker is running. Is it possible to run the same attempt again? Does

Re: Job does not run with EOFException

2012-08-28 Thread Caetano Sauer
The host on top of the stack trace contains the host and port I defined on mapred.job.tracker in mapred-site.xml Other than that, I don't know how to verify what you asked me. Any tips? On Tue, Aug 28, 2012 at 3:47 PM, Harsh J wrote: > Are you sure you're reaching the right port for your JobTrc

Re: best way to join?

2012-08-28 Thread dexter morgan
Dear Ted, I understand your solution ( i think) , didn't think of that, in that particular way. I think that lets say i have 1M data-points, and running knn , that the k=1M and n=10 (each point is a cluster that requires up to 10 points) is an overkill. How can i achieve the same result WITHOUT u

Re: Job does not run with EOFException

2012-08-28 Thread Harsh J
Are you sure you're reaching the right port for your JobTrcker? On Tue, Aug 28, 2012 at 7:15 PM, Caetano Sauer wrote: > Hello, > > I am getting the following error when trying to execute a hadoop job on a > 5-node cluster: > > Caused by: java.io.IOException: Call to *** failed on local exception:

RE: hadoop 1.0.3 equivalent of MultipleTextOutputFormat

2012-08-28 Thread Tony Burton
Hi Harsh Thanks for the reply - my understanding is that with MultipleOutputs I can write differently named files into the same target directory. With MultipleTextOutputFormat I was able to override the target directory name to perform the segmentation, by overriding generateFileNameForKeyValue

Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat

2012-08-28 Thread Mohammad Tariq
Apologies for the spelling Harsh. Regards, Mohammad Tariq On Tue, Aug 28, 2012 at 6:22 PM, Mohammad Tariq wrote: > Hello Hars, > >Sorry for the intrusion. I just wanted to ask the same for > WholeFileInputFormat. I am not able to find it in 1.0.3. > > Regards, > Mohammad Tariq

Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat

2012-08-28 Thread Mohammad Tariq
Hello Hars, Sorry for the intrusion. I just wanted to ask the same for WholeFileInputFormat. I am not able to find it in 1.0.3. Regards, Mohammad Tariq On Tue, Aug 28, 2012 at 6:14 PM, Harsh J wrote: > The Multiple*OutputFormat have been deprecated in favor of the generic > Multip

Re: error in shuffle in InMemoryMerger

2012-08-28 Thread Abhay Ratnaparkhi
Checked the "mapred.tmp.local" directory on the node which is running the reducer attempt and seems that there is available space around 1G(though it's less). On Tue, Aug 28, 2012 at 3:55 PM, Joshi, Rekha wrote: > Hi Abhay, > > Ideally the error line - "Caused by: > org.apache.hadoop.util.DiskC

Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat

2012-08-28 Thread Harsh J
The Multiple*OutputFormat have been deprecated in favor of the generic MultipleOutputs API. Would using that instead work for you? On Tue, Aug 28, 2012 at 6:05 PM, Tony Burton wrote: > Hi, > > I've seen that org.apache.hadoop.mapred.lib.MultipleTextOutputFormat is good > for writing results into

hadoop 1.0.3 equivalent of MultipleTextOutputFormat

2012-08-28 Thread Tony Burton
Hi, I've seen that org.apache.hadoop.mapred.lib.MultipleTextOutputFormat is good for writing results into (for example) different directories created on the fly. However, now I'm implementing a MapReduce job using Hadoop 1.0.3, I see that the new API no longer supports MultipleTextOutputFormat.

Fw: pending clean up step

2012-08-28 Thread listenbruder
I'm facing the exact same issue on 0.20.2-cdh3u0. Does anybody have an idea? Tnx. Best, Christoph -- >From Matthias Zengler Subject Re: pending clean up step DateTue, 14 Feb 2012 13:17:04 GMT Hello, I have a question regarding a problem with hadoop. At fir

Re: error in shuffle in InMemoryMerger

2012-08-28 Thread Joshi, Rekha
Hi Abhay, Ideally the error line - "Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/map_128.out" suggests you either do not have permissions for output folder or disk is full. Also 5 is not a big number on thread spawning, (

Re: Why cannot I start namenode or localhost:50070 ?

2012-08-28 Thread Charles AI
hi Mohammad, Thank you for reminding. I have checked the two directories, and set them as /home/hadoopfs/data and /home/hadoopfs/name, not under /tmp. So far, my problem has already been solved. Thank you. On Mon, Aug 27, 2012 at 4:31 PM, Mohammad Tariq wrote: > Hello Charles, > >Have you ad

Re: How to reduce total shuffle time

2012-08-28 Thread Tsuyoshi OZAWA
It depends of workload. Could you tell us more specification about your job? In general case which reducers are bottleneck, there are some tuning techniques as follows: 1. Allocate more memory to reducers. It decreases disk IO of reducers when merging and running reduce functions. 2. Use combine fu

example usage of s3 file system

2012-08-28 Thread Chris Collins
Hi I am trying to use the Hadoop filesystem abstraction with S3 but in my tinkering I am not having a great deal of success. I am particularly interested in the ability to mimic a directory structure (since s3 native doesnt do it). Can anyone point me to some good example usage of Hadoop Fi

How to reduce total shuffle time

2012-08-28 Thread Gaurav Dasgupta
Hi, I have run some large and small jobs and calculated the Total Shuffle Time for the jobs. I can see that the Total Shuffle Time is almost half the Total Time which was taken by the full job to complete. My question, here, is that how can we decrease the Total Shuffle Time? And doing so, what w

Suggestions/Info required regarding Hadoop Benchmarking

2012-08-28 Thread Gaurav Dasgupta
Hi Users, I have a 12 node CDH3 cluster where I am planning to run some benchmark tests. My main intension is to run the benchmarks first with the default Hadoop configuration and then analyze the outcomes and tune the Hadoop metrics accordingly to increase the performance of my cluster. Can some