Re: No FileSystem for scheme: file

2014-11-06 Thread Tim Robertson
NativeS3FileSystem > > org.apache.hadoop.fs.ftp.FTPFileSystem > > org.apache.hadoop.fs.HarFileSystem > > > org.apache.hadoop.hdfs.DistributedFileSystem > > org.apache.hadoop.hdfs.HftpFileSystem > > org.apache.hadoop.hdfs.HsftpFileSystem > > org.apache.hadoop.hdfs.web.WebHdfsFileSystem > >

Re: No FileSystem for scheme: file

2014-11-05 Thread Tim Robertson
/CDH configuration deployment issue and not > something specific to HBase. In the future please consider sending these > kinds of vendor-specific questions to the community support mechanisms of > said vendor. In Cloudera's case, that's http://community.cloudera.com/ > > -S

Re: No FileSystem for scheme: file

2014-11-05 Thread Tim Robertson
s.jar org.gbif.metrics.cube.occurrence.backfill.Backfill Thanks, Tim On Wed, Nov 5, 2014 at 4:30 PM, Sean Busbey wrote: > How are you submitting the job? > > How are your cluster configuration files deployed (i.e. are you using CM)? > > On Wed, Nov 5, 2014 at 8:50 AM, Tim Robertson > wrote: > > >

No FileSystem for scheme: file

2014-11-05 Thread Tim Robertson
Hi all, I'm seeing the following java.io.IOException: No FileSystem for scheme: file at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java

Re: hbase-server required on CP for running MR jobs?

2014-11-05 Thread Tim Robertson
> There isn't hbase-mapreduce module yet. For now, you need to include > hbase-server module. > > Cheers > > On Nov 5, 2014, at 1:27 AM, Tim Robertson > wrote: > > > Hey folks, > > > > I'm upgrading an application from CDH4.3 to CDH5.2 so jumpin

hbase-server required on CP for running MR jobs?

2014-11-05 Thread Tim Robertson
Hey folks, I'm upgrading an application from CDH4.3 to CDH5.2 so jumping from 0.94 to 0.98 and wanted to just ask for confirmation on the dependencies now hbase has split into hbase-client and hbase-server etc. If I am submitting MR jobs (to Yarn) that use things like TableMapReduceUtil it seems

Re: Shared Cluster between HBase and MapReduce

2012-06-06 Thread Tim Robertson
Like Amandeep says, it really depends on the access patterns and jobs running on the cluster. We are using a single cluster for HBase and MR, with each node running DN, TT and RS. We have tried mixed clusters with only some running RS but you start to suffer from data locality issues during scans.

Re: MR not seeing data locality - IP versus Host name

2012-05-28 Thread Tim Robertson
Cheers, Tim On Mon, May 28, 2012 at 3:54 PM, Tim Robertson wrote: > Thanks Stack. We're looking into this a lot. > > As far as we can tell DNS is correct, machine host names are correct etc. > In .META. it uses fully qualified names (c4n5.gbif.org) so I guess I'll &g

Re: MR not seeing data locality - IP versus Host name

2012-05-28 Thread Tim Robertson
.226.238.185", " c4n5.gbif.org"); regionLocation = regionLocation.replaceAll("130.226.238.186", " c4n6.gbif.org"); More when we know more. Tim On Mon, May 28, 2012 at 12:32 AM, Stack wrote: > On Sun, May 27, 2012 at 1:05 PM, Tim Robertson > wrote: &

MR not seeing data locality - IP versus Host name

2012-05-27 Thread Tim Robertson
Hi all, When I run MR jobs, I don't see data locality because the TT sees /default-rack/c4n1.gbif.org but the TableInputFormat is giving /default-rack/130.226.238.181 (the same machine) when it determines the splits for the job. Clearly we have set something up wrong - has anyone seen this? I've

Re: HBase Performance Improvements?

2012-05-09 Thread Tim Robertson
Hey Something, We can share everything, and even our ganglia is public [1] . We are just setting up a new cluster with Puppet and the HBase master just came up. HBase RS will be up probably tomorrow, where the first task will be a bulk load of 400M records - we're just finishing our working day

Re: multiple puts in reducer?

2012-02-28 Thread Tim Robertson
Hi, You can call context.write() multiple times in the Reduce(), to emit more than one row. If you are creating the Puts in the Map function then you need to setMapSpeculativeExecution(false) on the job conf, or else Hadoop *might* spawn more than 1 attempt for a given task, meaning you'll get du

Re: multiple puts in reducer?

2012-02-28 Thread Tim Robertson
Hi, Assuming you use TableOutputFormat [1] you can emit as many PUTs as you want from a reducer. You will need to handle the row key as you create the PUT to emit. HTH, Tim [1] http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html On Tue, Feb 28, 2012 at 3:3

Copenhagen / Scandinavian HUG Meetup - any interest?

2012-02-13 Thread Tim Robertson
Hi all, (cross posted to a few Hadoop mailing lists - apologies for the SPAM) Are there any users around the Copenhagen area that would like a HUG meetup? Just reply with +1 and I'll gauge interest. We could probably host a 1/2 or full day if people were coming from Sweden... We are using Hadoo

Re: HFileInputFormat for MapReduce

2012-02-10 Thread Tim Robertson
> Is HIVE involved?  Or is it just raw scan compared to TFIF? No Hive > Is this a MR scan or just a shell serial scan (or is it still PE?)? We are using PE scan to try and "standardize" as much as possible. > You want to get this scan speed up only?  You are not interested in figuring > how >

Re: HFileInputFormat for MapReduce

2012-02-09 Thread Tim Robertson
Hey Stack, We see the difference between a scan and TextFileInputFormat of the same data as csv being 10x slower. This is what prompted me to look at MR using an HFIF just out of curiosity. Cheers, Tim On Thu, Feb 9, 2012 at 7:32 PM, Stack wrote: > On Thu, Feb 9, 2012 at 12:55 AM,

Re: HFileInputFormat for MapReduce

2012-02-09 Thread Tim Robertson
e HBase at all? > > (I'm not trying to shoo you away from HBase. Just curious what you are > trying to accomplish) > > Amandeep > > On Feb 9, 2012, at 12:19 AM, Tim Robertson wrote: > >> Hi all, >> >> Can anyone elaborate on the pitfalls or implication

HFileInputFormat for MapReduce

2012-02-09 Thread Tim Robertson
Hi all, Can anyone elaborate on the pitfalls or implications of running MapReduce using an HFileInputFormat extending FileInputFormat? I'm sure scanning goes through the RS for good reasons (guessing handling splits, locking, RS monitoring etc) but can it ever be "safe" to run MR over HFiles dire

Re: PerformanceEvaluation results

2012-02-08 Thread Tim Robertson
Hey Stack, Because we run a couple clusters now, we're using templating for the *.site.xml etc. You'll find them in: http://code.google.com/p/gbif-common-resources/source/browse/cluster-puppet/modules/hadoop/templates/ The values for the HBase 3 node cluster come from: http://code.google.c

Re: PerformanceEvaluation results

2012-02-02 Thread Tim Robertson
ing and fine tuning a cluster is something you > have to do on your own. I guess I could say your numbers look fine to me for > that config... But honestly, it would be a swag. > > > Sent from a remote device. Please excuse any typos... > > Mike Segel > > On Feb 1, 2012,

Re: PerformanceEvaluation results

2012-02-01 Thread Tim Robertson
ing did you do? > Why such a small cluster? > > Sorry, but when you start off with a bad hardware configuration, you can get > Hadoop/HBase to work, but performance will always be sub-optimal. > > > > Sent from my iPhone > > On Feb 1, 2012, at 6:52 AM, "Tim Rober

PerformanceEvaluation results

2012-02-01 Thread Tim Robertson
Hi all, We have a 3 node cluster (CD3u2) with the following hardware: RegionServers (+DN + TT) CPU: 2x Intel(R) Xeon(R) CPU E5630 @ 2.53GHz (quad) Disks: 6x250G SATA 5.4K Memory: 24GB Master (+ZK, JT, NN) CPU: Intel(R) Xeon(R) CPU X3363 @ 2.83GHz, 2x6MB (quad) Disks: 2x500G SATA 7.2K

Re: Faster Bulkload from Oracle to HBase

2012-01-31 Thread Tim Robertson
Hi Laxman, We use both #1 and #3 from MySQL which also has hi speed exports. For our 300G and 340M rows, #1 takes us around 3 hours, with Sqoop it is closer to 8 hrs to our 3 node cluster. We are having issues with delimiters though (since we have \r, \t and \n in the database), and now using Avr

Re: Speeding up Scans

2012-01-26 Thread Tim Robertson
Hey Peter, I am trying to benchmark our 3 node cluster now and trying to optimize for scanning. Using the PerformanceEvaluation tool I did a random write to populate 5M rows (I believe they are 1k each but whatever the tool does by default). I am seeing 33k records per second (which I believe to

Re: PerformanceEvaluation scan - how to read the results?

2012-01-26 Thread Tim Robertson
Hey stack >> This gave me 32 regions across 2 of our 3 region servers (we have HDFS >> across 17 nodes but only machines running 3 RS). >> > > The balancer ran?  I'd think it'd balance the regions across the three > servers.  Something stuck in transition stopping the balancer running > (See maste

PerformanceEvaluation scan - how to read the results?

2012-01-25 Thread Tim Robertson
Hi all, I am trying to sanitize our setup, and using the PerformanceEvaluation as a basis to check. To to this, I ran the following to load it up: $HADOOP_HOME/bin/hadoop org.apache.hadoop.hbase.PerformanceEvaluation randomWrite 5 This gave me 32 regions across 2 of our 3 region servers (we have

Re: MR - Input from Hbase output to HDFS

2011-11-09 Thread Tim Robertson
Hi Stuti, I would have thought it was something like: conf.setOutputFormat(TextOutputFormat.class); FileOutputFormat.setOutputPath(conf, new Path()); Cheers, Tim On Thu, Nov 10, 2011 at 8:31 AM, Stuti Awasthi wrote: > Hi > Currently I am understading Hbase MapReduce support. I followed

Re: Hbase Hardware requirement

2011-06-07 Thread Tim Robertson
http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-basic-hardware-recommendations/ "4 1TB hard disks in a JBOD (Just a Bunch Of Disks) configuration 2 quad core CPUs, running at least 2-2.5GHz 16-24GBs of RAM (24-32GBs if you’re considering HBase) Gigabit Ethernet" HTH, Tim

Re: decommissioning nodes

2010-11-25 Thread Tim Robertson
regions over and shut down the RS properly. > > Lars > > On Nov  25, 2010, at 18:24, Tim Robertson wrote: > >> Hi all, >> >> Please forgive this rather naive question - I have a cluster and want >> to decommission nodes (including the RS that hold the -ROOT-

decommissioning nodes

2010-11-25 Thread Tim Robertson
Hi all, Please forgive this rather naive question - I have a cluster and want to decommission nodes (including the RS that hold the -ROOT- and .META). Could someone please advise me the best way to do this gracefully? Can I force HBase to move regions onto the region servers I will keep up? The

Re: MR to load HBase running slowly in reduce

2010-11-24 Thread Tim Robertson
>> I'm using >> http://hbase.apache.org/docs/r0.20.6/api/org/apache/hadoop/hbase/mapreduce/IdentityTableReducer.html > > Did you set it up with TableMapReduceUtil? > >> Not explicitly set be me > > If you use TableMapReduceUtil, then it's set to 2MB by default, but > looking at the RS logs the wri

Re: MR to load HBase running slowly in reduce

2010-11-24 Thread Tim Robertson
op.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 8 on 60020' on region I guess this is bad, but could benefit from some guidance... > Are you monitoring the GCs? > If so, do you see some pauses longer than a second? What's the best way to do this p

MR to load HBase running slowly in reduce

2010-11-24 Thread Tim Robertson
Hi all, I am running an MR job that is loading an HBase table in the reduce, and I am seeing hopeless performance - 10 million records of <1Kb in 2 hours so far. Please bear in mind I am software guy, so go easy ;) but here is what I know so far: (http://code.google.com/p/gbif-occurrencestore/wi

Re: Newbie question

2010-11-14 Thread Tim Robertson
>So updating is okay but Handling deletes is not possible in the current version > of the data unless a new version of the data is written down. Not quite. You can delete a record and it will not show up in scans and gets etc, but physically it will still take up space on the disk until HBase cle

Re: Where do you get your hardware?

2010-11-03 Thread Tim Robertson
We just set up a cluster with Dells, and have a pretty fine relationship with a local Dell supplier. Tim On Wed, Nov 3, 2010 at 2:21 PM, Jason Lotz wrote: > We are in the process of analyzing our options for the future purchases of > our Hadoop/HBase DN/RS servers.  Currently, we purchase Dell

Re: Setting the heap size

2010-10-29 Thread Tim Robertson
> > Sean > > On Thu, Oct 28, 2010 at 2:52 AM, Tim Robertson > wrote: > >> Hi all, >> >> We are setting up a small Hadoop 13 node cluster running 1 HDFS >> master, 9 region severs for HBase and 3 map reduce nodes, and are just >> installing zookeeper to

Re: Advice sought for mixed hardware installation

2010-10-14 Thread Tim Robertson
Thanks again. One of the things we struggle with currently on the RDBMS, is the organisation of 250million records to complex taxonomies, and also point in polygon intersections. Having such memory available the MR jobs allows us to consider loading taxonomies / polygons / RTree indexes into memo

Re: Advice sought for mixed hardware installation

2010-10-14 Thread Tim Robertson
of them (this means you > will only have 9 RS). You don't really need an ensemble, unless you're > planning to share that ZK setup with other apps. > > In any case, you should test all setups. > > J-D > > On Thu, Oct 14, 2010 at 4:51 AM, Tim Robertson > wrot

Advice sought for mixed hardware installation

2010-10-14 Thread Tim Robertson
Hi all, We are about to setup a new installation using the following machines, and CDH3 beta 3: - 10 nodes of single quad core, 8GB memory, 2x500GB SATA - 3 nodes of dual quad core, 24GB memory, 6x250GB SATA We are finding our feet, and will blog tests, metrics etc as we go but our initial usage

Re: MapReduce for random row key subset

2010-09-19 Thread Tim Robertson
Just in case this was misinterpreted - my proposal was not to use mapreduce at all, but to take the Set keys submitted and simply iterate over them, calling the getByKey(key) to populate an H2 DB (or simply an in memory structure) to do the final analytics. My understanding is that HBase is design

Re: MapReduce for random row key subset

2010-09-19 Thread Tim Robertson
Or build a temporary H2 database and then issue SQL for the final group by type counts? H2 is hugely fast (20,000 record inserts per second) when you run it in the same JVM. Tim On Sun, Sep 19, 2010 at 7:30 PM, Tim Robertson wrote: > If you are only doing 1k-100k record set analytics, wo

Re: MapReduce for random row key subset

2010-09-19 Thread Tim Robertson
If you are only doing 1k-100k record set analytics, would it be feasible to use the HBase client directly, perform a filtered scan and do the analytics in memory using Java Collections? It depends on the number of dimensions you need but 100k rows of a few Integers is not absurd to hold in memory,

Developers sought for web crawling project

2010-07-20 Thread Tim Robertson
Hi all, Disclosure: I have been an active member of the Hadoop / HBase / Hive mailing lists for some time. I am not a recruiter, but looking to increase a development team that I lead. I sincerely apologize if this message is against mailing list etiquette; I have not seen any guidelines forbidd

Re: GEO GIS support?

2010-06-26 Thread Tim Robertson
Hi, To my knowledge, there is nothing built in so you would have to build and maintain the spatial index yourself. If you are only doing a distance query, you might consider keeping a column containing something like a geohash (http://en.wikipedia.org/wiki/Geohash) and then build a secondary inde

Re: Big machines or (relatively) small machines?

2010-06-08 Thread Tim Robertson
> - Do you plan to serve data out of HBase or will you just use it for > MapReduce? Or will it be a mix (not recommended)? I am also curious what would be the recommended deployment when you have this need (e.g. building multiple Lucene indexes which hold only the Row ID, so building is MR intens

Re: elastic search or other Lucene for HBase?

2010-06-03 Thread Tim Robertson
Lucene - Nutch > Hadoop ecosystem search :: http://search-hadoop.com/ > > > > - Original Message >> From: Tim Robertson >> To: hbase-u...@hadoop.apache.org >> Sent: Sat, March 27, 2010 2:46:00 PM >> Subject: elastic search or other Lucene for HBase? >>

Re: HBase support

2010-05-30 Thread Tim Robertson
Hi Alex, Is there a publicly visible roadmap somewhere for CDH3 please? http://archive.cloudera.com/docs/cdh3-top.html doesn't yet mention HBase but I gather it is actually in there. I am curious what Hive and Sqoop integration you might have setup out of the box. I could imagine a CDH3 installat