Hbase python recommended interface
Hi, anyone with recommendations for a python interface to hbase? Thrift is one possibility, but is there a library like https://github.com/pycassa/pycassa ? -- Håvard Wahl Kongsgård NTNU http://havard.security-review.net/
Re: Partition classes, how to pass in background information
If your class implements the configurable interface, hadoop will call the setConf method after creating the instance. Look in the source code for ReflectionUtils.newInstance for more info On Mar 14, 2012 2:31 AM, Jane Wayne jane.wayne2...@gmail.com wrote: i am using the new org.apache.hadoop.mapreduce.Partitioner class. however, i need to pass it some background information. how can i do this? in the old, org.apache.hadoop.mapred.Partitioner class (now deprecated), this class extends JobConfigurable, and it seems the hook to pass in any background data is with the JobConfigurable.configure(JobConf job) method. i thought that if i sub-classed org.apache.hadoop.mapreduce.Partitioner, i could pass in the background information, however, in the org.apache.hadoop.mapreduce.Job class, it only has a setPartitionerClass(Class? extends Partitioner) method. all my development has been the new mapreduce package, and i would definitely desire to stick with the new API/package. any help is appreciated.
Re: questions regarding hadoop version 1.0
JobTracker and TaskTracker. YARN is only in 0.23 and later releases. 1.0.x is from the 0.20x line of releases. -Joey On Mar 14, 2012, at 7:00, arindam choudhury arindam732...@gmail.com wrote: Hi, Hadoop 1.0.1 uses hadoop YARN or the tasktracker, jobtracker model? Regards, Arindam
RE: decompressing bzip2 data with a custom InputFormat
Hi - sorry to bump this, but I'm having trouble resolving this. Essentially the question is: If I create my own InputFormat by subclassing TextInputFormat, does the subclass have to handle its own streaming of compressed data? If so, can anyone point me at an example where this is done? Thanks! Tony -Original Message- From: Tony Burton [mailto:tbur...@sportingindex.com] Sent: 12 March 2012 18:05 To: common-user@hadoop.apache.org Subject: decompressing bzip2 data with a custom InputFormat Hi, I'm setting up a map-only job that reads large bzip2-compressed data files, parses the XML and writes out the same data in plain text format. My XML InputFormat extends TextInputFormat and has a RecordReader based upon the one you can see at http://xmlandhadoop.blogspot.com/ (my version of it works great for uncompressed XML input data). For compressed data, I've added io.compression.codecs to my core-site.xml and set it to o.a.h.io.compress.BZip2Codec. I'm using Hadoop 0.20.2. Have I forgotten something basic when running a Hadoop job to read compressed data? Or, given that I've written my own InputFormat, should I be using an InputStream that can carry out the decompression itself? Thanks Tony ** This email and any attachments are confidential, protected by copyright and may be legally privileged. If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system. Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened. It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email. Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES. Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404). Any financial promotion contained herein has been issued and approved by Sporting Index Ltd. Outbound email has been scanned for viruses and SPAM www.sportingindex.com Inbound Email has been scanned for viruses and SPAM ** This email and any attachments are confidential, protected by copyright and may be legally privileged. If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system. Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened. It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email. Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES. Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404). Any financial promotion contained herein has been issued and approved by Sporting Index Ltd. Outbound email has been scanned for viruses and SPAM
Re: decompressing bzip2 data with a custom InputFormat
Yes you have to deal with the compression. Usually, you'll load the compression codec in your RecordReader. You can see an example of how TextInputFormat's LineRecordReader does it: https://github.com/apache/hadoop-common/blob/release-1.0.1/src/mapred/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java -Joey On Wed, Mar 14, 2012 at 11:08 AM, Tony Burton tbur...@sportingindex.com wrote: Hi - sorry to bump this, but I'm having trouble resolving this. Essentially the question is: If I create my own InputFormat by subclassing TextInputFormat, does the subclass have to handle its own streaming of compressed data? If so, can anyone point me at an example where this is done? Thanks! Tony -Original Message- From: Tony Burton [mailto:tbur...@sportingindex.com] Sent: 12 March 2012 18:05 To: common-user@hadoop.apache.org Subject: decompressing bzip2 data with a custom InputFormat Hi, I'm setting up a map-only job that reads large bzip2-compressed data files, parses the XML and writes out the same data in plain text format. My XML InputFormat extends TextInputFormat and has a RecordReader based upon the one you can see at http://xmlandhadoop.blogspot.com/ (my version of it works great for uncompressed XML input data). For compressed data, I've added io.compression.codecs to my core-site.xml and set it to o.a.h.io.compress.BZip2Codec. I'm using Hadoop 0.20.2. Have I forgotten something basic when running a Hadoop job to read compressed data? Or, given that I've written my own InputFormat, should I be using an InputStream that can carry out the decompression itself? Thanks Tony ** This email and any attachments are confidential, protected by copyright and may be legally privileged. If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system. Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened. It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email. Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES. Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404). Any financial promotion contained herein has been issued and approved by Sporting Index Ltd. Outbound email has been scanned for viruses and SPAM www.sportingindex.com Inbound Email has been scanned for viruses and SPAM ** This email and any attachments are confidential, protected by copyright and may be legally privileged. If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system. Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened. It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email. Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES. Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404). Any financial promotion contained herein has been issued and approved by Sporting Index Ltd. Outbound email has been scanned for viruses and SPAM -- Joseph Echeverria Cloudera, Inc. 443.305.9434
Capacity Scheduler APIs
Hi all, are there any capacity scheduler apis that I can use? e.g. adding, removing queues, tuning properties on the fly and so on. Any help is appreciated. Thanks Harshad
Re: Using a combiner
It is a function of the number of spills on map side and I believe the default is 3. So for every 3 times data is spilled the combiner is run. This number is configurable. Sent from my iPhone On Mar 14, 2012, at 3:26 PM, Gayatri Rao rgayat...@gmail.com wrote: Hi all, I have a quick query on using a combiner in a MR job. Is it true the framework decides whether or not the combiner gets called? Can any one please give more information on how t his is done. Thanks, Gayatri
Re: Partition classes, how to pass in background information
Thanks Chris! That worked! On Wed, Mar 14, 2012 at 6:06 AM, Chris White chriswhite...@gmail.comwrote: If your class implements the configurable interface, hadoop will call the setConf method after creating the instance. Look in the source code for ReflectionUtils.newInstance for more info On Mar 14, 2012 2:31 AM, Jane Wayne jane.wayne2...@gmail.com wrote: i am using the new org.apache.hadoop.mapreduce.Partitioner class. however, i need to pass it some background information. how can i do this? in the old, org.apache.hadoop.mapred.Partitioner class (now deprecated), this class extends JobConfigurable, and it seems the hook to pass in any background data is with the JobConfigurable.configure(JobConf job) method. i thought that if i sub-classed org.apache.hadoop.mapreduce.Partitioner, i could pass in the background information, however, in the org.apache.hadoop.mapreduce.Job class, it only has a setPartitionerClass(Class? extends Partitioner) method. all my development has been the new mapreduce package, and i would definitely desire to stick with the new API/package. any help is appreciated.
Re: does hadoop always respect setNumReduceTasks?
Thanks Lance. On Thu, Mar 8, 2012 at 9:38 PM, Lance Norskog goks...@gmail.com wrote: Instead of String.hashCode() you can use the MD5 hashcode generator. This has not in the wild created a duplicate. (It has been hacked, but that's not relevant here.) http://snippets.dzone.com/posts/show/3686 I think the Partitioner class guarantees that you will have multiple reducers. On Thu, Mar 8, 2012 at 6:30 PM, Jane Wayne jane.wayne2...@gmail.com wrote: i am wondering if hadoop always respect Job.setNumReduceTasks(int)? as i am emitting items from the mapper, i expect/desire only 1 reducer to get these items because i want to assign each key of the key-value input pair a unique integer id. if i had 1 reducer, i can just keep a local counter (with respect to the reducer instance) and increment it. on my local hadoop cluster, i noticed that most, if not all, my jobs have only 1 reducer, regardless of whether or not i set Job.setNumReduceTasks(int). however, as soon as i moved the code unto amazon's elastic mapreduce (emr), i notice that there are multiple reducers. if i set the number of reduce tasks to 1, is this always guaranteed? i ask because i don't know if there is a gotcha like the combiner (where it may or may not run at all). also, it looks like this might not be a good idea just having 1 reducer (it won't scale). it is most likely better if there are +1 reducers, but in that case, i lose the ability to assign unique numbers to the key-value pairs coming in. is there a design pattern out there that addresses this issue? my mapper/reducer key-value pair signatures looks something like the following. mapper(Text, Text, Text, IntWritable) reducer(Text, IntWritable, IntWritable, Text) the mapper reads a sequence file whose key-value pairs are of type Text and Text. i then emit Text (let's say a word) and IntWritable (let's say frequency of the word). the reducer gets the word and its frequencies, and then assigns the word an integer id. it emits IntWritable (the id) and Text (the word). i remember seeing code from mahout's API where they assign integer ids to items. the items were already given an id of type long. the conversion they make is as follows. public static int idToIndex(long id) { return 0x7FFF ((int) id ^ (int) (id 32)); } is there something equivalent for Text or a word? i was thinking about simply taking the hash value of the string/word, but of course, different strings can map to the same hash value. -- Lance Norskog goks...@gmail.com
dynamic mapper?
Suppose I want to generate a mapper class at run time and use that class in my MapReduce job. What is the best way to do this? Would I just have an extra scripted step to pre-compile it and distribute with -libjars, or if I felt like compiling it dynamically with for example JavaCompiler is there some elegant way to distribute the class at run time?
SequenceFile split question
I have a client program that creates sequencefile, which essentially merges small files into a big file. I was wondering how is sequence file splitting the data accross nodes. When I start the sequence file is empty. Does it get split when it reaches the dfs.block size? If so then does it mean that I am always writing to just one node at a given point in time? If I start a new client writing a new sequence file then is there a way to select a different data node?