Re: Multiple NIC Cards

2009-06-09 Thread Edward Capriolo
On Tue, Jun 9, 2009 at 11:59 AM, Steve Loughranste...@apache.org wrote:
 John Martyniak wrote:

 When I run either of those on either of the two machines, it is trying to
 resolve against the DNS servers configured for the external addresses for
 the box.

 Here is the result
 Server:        xxx.xxx.xxx.69
 Address:    xxx.xxx.xxx.69#53

 OK. in an ideal world, each NIC has a different hostname. Now, that confuses
 code that assumes a host has exactly one hostname, not zero or two, and I'm
 not sure how well Hadoop handles the 2+ situation (I know it doesn't like 0,
 but hey, its a distributed application). With separate hostnames, you set
 hadoop up to work on the inner addresses, and give out the inner hostnames
 of the jobtracker and namenode. As a result, all traffic to the master nodes
 should be routed on the internal network


Also a subtle issue that I run into is full or partial host names.
Even though all my configuration files reference full host names
server1.domain.com. The name node web interface will redirect people
to http://server1 probably because that is the system hostname. In my
case it is a big deal but it is something to consider during setup.


Re: Multiple NIC Cards

2009-06-09 Thread Edward Capriolo
Also if you are using a topology rack map, make sure you scripts
responds correctly to every possible hostname or IP address as well.

On Tue, Jun 9, 2009 at 1:19 PM, John Martyniakj...@avum.com wrote:
 It seems that this is the issue, as there several posts related to same
 topic but with no resolution.

 I guess the thing of it is that it shouldn't use the hostname of the machine
 at all.  If I tell it the master is x and it has an IP Address of x.x.x.102
 that should be good enough.

 And if that isn't the case then I should be able to specify which network
 adaptor to use as the ip address that it is going to lookup against, whether
 it is by DNS or by /etc/hosts.

 Because I suspect the problem is that I have named the machine as
 duey..com but have told hadoop that machine is called duey-direct.

 Is there work around in 0.19.1?  I am using this with Nutch so don't have an
 option to upgrade at this time.

 -John


 On Jun 9, 2009, at 11:59 AM, Steve Loughran wrote:

 John Martyniak wrote:

 When I run either of those on either of the two machines, it is trying to
 resolve against the DNS servers configured for the external addresses for
 the box.
 Here is the result
 Server:        xxx.xxx.xxx.69
 Address:    xxx.xxx.xxx.69#53

 OK. in an ideal world, each NIC has a different hostname. Now, that
 confuses code that assumes a host has exactly one hostname, not zero or two,
 and I'm not sure how well Hadoop handles the 2+ situation (I know it doesn't
 like 0, but hey, its a distributed application). With separate hostnames,
 you set hadoop up to work on the inner addresses, and give out the inner
 hostnames of the jobtracker and namenode. As a result, all traffic to the
 master nodes should be routed on the internal network




Re: Monitoring hadoop?

2009-06-05 Thread Edward Capriolo
On Fri, Jun 5, 2009 at 10:10 AM, Brian Bockelmanbbock...@cse.unl.edu wrote:
 Hey Anthony,

 Look into hooking your Hadoop system into Ganglia; this produces about 20
 real-time statistics per node.

 Hadoop also does JMX, which hooks into more enterprise-y monitoring
 systems.

 Brian

 On Jun 5, 2009, at 8:55 AM, Anthony McCulley wrote:

 Hey all,
 I'm currently tasked to come up with a web/flex-based
 visualization/monitoring system for a cloud system using hadoop as part of
 a
 university research project.  I was wondering if I could elicit some
 feedback from all of you with regards to:


  - If you were an engineer of a cloud system running hadoop, what
  information would you be interested in capturing, viewing, monitoring,
 etc?
  - Is there any sort of real-time stats or monitoring currently available
  for hadoop?  if so, is in a web-friendly format?

 Thanks in advance,

 - Anthony


I have an ever growing set of cacti templates for hadoop that use jmx
as the back end.

http://www.jointhegrid.com/hadoop/


Re: Blocks amount is stuck in statistics

2009-05-25 Thread Edward Capriolo
On Mon, May 25, 2009 at 6:34 AM, Stas Oskin stas.os...@gmail.com wrote:
 Hi.

 Ok, was too eager to report :).

 It got sorted out after some time.

 Regards.

 2009/5/25 Stas Oskin stas.os...@gmail.com

 Hi.

 I just did an erase of large test folder with about 20,000 blocks, and
 created a new one. I copied about 128 blocks, and fsck reflects it
 correctly, but NN statistics still shows the old number. It does shows the
 currently used space correctly.

 Any idea if this a known issue and was fixed? Also, does it has any
 influence over the operations?

 I'm using Hadoop 0.18.3.

 Thanks.



This is something I am dealing with in my cacti graphs. Shameless
plugs http://www.jointhegrid.com/hadoop

Some hadoop JMX attributes are like traditional SNMP gauges. A JMX
Request directly gather a counter variable and returns it.
FSDatasetStatus Remaining is implemented like that.

Some variables are implemented like SNMP counter, DataNodeStatistics
BytesRead is like that the number keeps increasing.

Other values are sampled. Since I monitor with 5 minute intervals I am
setup like this:

dfs.class=org.apache.hadoop.metrics.spi.NullContextWithUpdateThread
dfs.period=300

DataNodeStatistics BlocksWritten is a gauge that is sampled. So if
your sample period is 5 minutes it will be 5 minutes before the stats
show up and then in 5 minutes they are replaced.

Using NullContextWithUpdateThread with the combination of sampled
variables and non sampled variables is not exactly what I want...As
some data is 5 minutes behind the real time data, but
NullContextWithUpdateThread is working well for me.


Re: ssh issues

2009-05-22 Thread Edward Capriolo
Pankil,

I used to be very confused by hadoop and SSH keys. SSH is NOT
required. Each component can be started by hand. This gem of knowledge
is hidden away in the hundreds of DIGG style articles entitled 'HOW TO
RUN A HADOOP MULTI-MASTER CLUSTER!'

The SSH keys are only required by the shell scripts that are contained
with Hadoop like start-all. They are wrappers to kick off other
scripts on a list of nodes. I PERSONALLY dislike using SSH keys as a
software component and believe they should only be used by
administrators.

We chose the cloudera distribution.
http://www.cloudera.com/distribution. A big factor behind this was the
simple init.d scripts they provided. Each hadoop component has its own
start scripts hadoop-namenode, hadoop-datanode, etc.

My suggestion is taking a look at the Cloudera startup scripts. Even
if you decide not to use the distribution you can take a look at their
start up scripts and fit them to your needs.

On Fri, May 22, 2009 at 10:34 AM,  hmar...@umbc.edu wrote:
 Steve,

 Security through obscurity is always a good practice from a development
 standpoint and one of the reasons why tricking you out is an easy task.
 Please, keep hiding relevant details from people in order to keep everyone
 smiling.

 Hal

 Pankil Doshi wrote:
 Well i made ssh with passphares. as the system in which i need to login
 requires ssh with pass phrases and those systems have to be part of my
 cluster. and so I need a way where I can specify -i path/to key/ and
 passphrase to hadoop in before hand.

 Pankil


 Well, are trying to manage a system whose security policy is
 incompatible with hadoop's current shell scripts. If you push out the
 configs and manage the lifecycle using other tools, this becomes a
 non-issue. Dont raise the topic of HDFS security to your ops team
 though, as they will probably be unhappy about what is currently on offer.

 -steve







Re: Optimal Filesystem (and Settings) for HDFS

2009-05-18 Thread Edward Capriolo
Do not forget 'tune2fs -m 2'. By default this value gets set at 5%.
With 1 TB disks we got 33 GB more usable space. Talk about instant
savings!

On Mon, May 18, 2009 at 1:31 PM, Alex Loddengaard a...@cloudera.com wrote:
 I believe Yahoo! uses ext3, though I know other people have said that XFS
 has performed better in various benchmarks.  We use ext3, though we haven't
 done any benchmarks to prove its worth.

 This question has come up a lot, so I think it'd be worth doing a benchmark
 and writing up the results.  I haven't been able to find a detailed analysis
 / benchmark writeup comparing various filesystems, unfortunately.

 Hope this helps,

 Alex

 On Mon, May 18, 2009 at 8:54 AM, Bob Schulze b.schu...@ecircle.com wrote:

 We are currently rebuilding our cluster - has anybody recommendations on
 the underlaying file system? Just standard Ext3?

 I could imagine that the block size could be larger than its default...

 Thx for any tips,

        Bob





Re: Linking against Hive in Hadoop development tree

2009-05-15 Thread Edward Capriolo
On Fri, May 15, 2009 at 5:05 PM, Aaron Kimball aa...@cloudera.com wrote:
 Hi all,

 For the database import tool I'm writing (Sqoop; HADOOP-5815), in addition
 to uploading data into HDFS and using MapReduce to load/transform the data,
 I'd like to integrate more closely with Hive. Specifically, to run the
 CREATE TABLE statements needed to automatically inject table defintions into
 Hive's metastore for the data files that sqoop loads into HDFS. Doing this
 requires linking against Hive in some way (either directly by using one of
 their API libraries, or loosely by piping commands into a Hive instance).

 In either case, there's a dependency there. I was hoping someone on this
 list with more Ivy experience than I knows what's the best way to make this
 happen. Hive isn't in the maven2 repository that Hadoop pulls most of its
 dependencies from. It might be necessary for sqoop to have access to a full
 build of Hive. It doesn't seem like a good idea to check that binary
 distribution into Hadoop svn, but I'm not sure what's the most expedient
 alternative. Is it acceptable to just require that developers who wish to
 compile/test/run sqoop have a separate standalone Hive deployment and a
 proper HIVE_HOME variable? This would keep our source repo clean. The
 downside here is that it makes it difficult to test Hive-specific
 integration functionality with Hudson and requires extra leg-work of
 developers.

 Thanks,
 - Aaron Kimball


Aaron,

I have a similar situation. I am using the GPL geo-ip library as a
hive UDF. Due to apache/GPL issues it the code would not be
compatible.

Currently my build process reference all if the Hive lib/*.jar files.
It does not really need all of that but not being exactly sure what I
need I reference all of them.

I was thinking one option is to run a GIT system. This way I can
integrate my patch into my forked hive.

I see your problem though, you have a few Hive Entry Points
1) JDBC
2) Hive Thrift Server
3) scripting
4) Java API

The JDBC and Thrift should be the lightest. In that a few Jar files
would make up the entry point rather then the entire hive
distribution.

Although now that Hive has had two releases maybe Hive should be in
maven. With that hive could be an optional or a mandatory ant target
for sqoop.


Re: sub 60 second performance

2009-05-11 Thread Edward Capriolo
On Mon, May 11, 2009 at 12:08 PM, Todd Lipcon t...@cloudera.com wrote:
 In addition to Jason's suggestion, you could also see about setting some of
 Hadoop's directories to subdirs of /dev/shm. If the dataset is really small,
 it should be easy to re-load it onto the cluster if it's lost, so even
 putting dfs.data.dir in /dev/shm might be worth trying.
 You'll probably also want mapred.local.dir in /dev/shm

 Note that if in fact you don't have enough RAM to do this, you'll start
 swapping and your performance will suck like crazy :)

 That said, you may find that even with all storage in RAM your jobs are
 still too slow. Hadoop isn't optimized for this kind of small-job
 performance quite yet. You may find that task setup time dominates the job.
 I think it's entirely reasonable to shoot for sub-60-second jobs down the
 road, and I'd find it interesting to hear what the results are now. Hope you
 report back!

 -Todd

 On Sun, May 10, 2009 at 2:30 PM, Matt Bowyer 
 mattbowy...@googlemail.comwrote:

 Hi,

 I am trying to do 'on demand map reduce' - something which will return in
 reasonable time (a few seconds).

 My dataset is relatively small and can fit into my datanode's memory. Is it
 possible to keep a block in the datanode's memory so on the next job the
 response will be much quicker? The majority of the time spent during the
 job
 run appears to be during the 'HDFS_BYTES_READ' part of the job. I have
 tried
 using the setNumTasksToExecutePerJvm but the block still seems to be
 cleared
 from memory after the job.

 thanks!



Also if your data set is small your can reduce overhead and
(parallelism) by lowering the number of mappers and reducers.

-Dmapred.map.tasks=11
-Dmapred.reduce.tasks=3

Or maybe even go as low as:

-Dmapred.map.tasks=1
-Dmapred.reduce.tasks=1

I use this tactic on jobs with small data sets where the processing
time is much less then the overhead of starting multiple mappers/
reducers and shuffling data.


Re: how to improve the Hadoop's capability of dealing with small files

2009-05-07 Thread Edward Capriolo
2009/5/7 Jeff Hammerbacher ham...@cloudera.com:
 Hey,

 You can read more about why small files are difficult for HDFS at
 http://www.cloudera.com/blog/2009/02/02/the-small-files-problem.

 Regards,
 Jeff

 2009/5/7 Piotr Praczyk piotr.prac...@gmail.com

 If You want to use many small files, they are probably having the same
 purpose and struc?
 Why not use HBase instead of a raw HDFS ? Many small files would be packed
 together and the problem would disappear.

 cheers
 Piotr

 2009/5/7 Jonathan Cao jonath...@rockyou.com

  There are at least two design choices in Hadoop that have implications
 for
  your scenario.
  1. All the HDFS meta data is stored in name node memory -- the memory
 size
  is one limitation on how many small files you can have
 
  2. The efficiency of map/reduce paradigm dictates that each
 mapper/reducer
  job has enough work to offset the overhead of spawning the job.  It
 relies
  on each task reading contiguous chuck of data (typically 64MB), your
 small
  file situation will change those efficient sequential reads to larger
  number
  of inefficient random reads.
 
  Of course, small is a relative term?
 
  Jonathan
 
  2009/5/6 陈桂芬 chenguifen...@163.com
 
   Hi:
  
   In my application, there are many small files. But the hadoop is
 designed
   to deal with many large files.
  
   I want to know why hadoop doesn't support small files very well and
 where
   is the bottleneck. And what can I do to improve the Hadoop's capability
  of
   dealing with small files.
  
   Thanks.
  
  
 


When the small file problem comes up most of the talk centers around
the inode table being in memory. The cloudera blog points out
something:

Furthermore, HDFS is not geared up to efficiently accessing small
files: it is primarily designed for streaming access of large files.
Reading through small files normally causes lots of seeks and lots of
hopping from datanode to datanode to retrieve each small file, all of
which is an inefficient data access pattern.

My application attempted to load 9000 6Kb files using a single
threaded application and the FSOutpustStream objects to write directly
to hadoop files. My plan was to have hadoop merge these files in the
next step. I had to abandon this plan because this process was taking
hours. I knew HDFS had a small file problem but I never realized
that I could not do this problem the 'old fashioned way'. I merged the
files locally and uploading a few small files gave great throughput.
Small files is not just a permanent storage issue it is a serious
optimization.


Cacti Templates for Hadoop

2009-05-06 Thread Edward Capriolo
For those of you that would like to graph the hadoop JMX variables
with cacti I have created cacti templates and data input scripts.
Currently the package gathers and graphs the following information
from the NameNode:

Blocks Total
Files Total
Capacity Used/Capacity Free
Live Data Nodes/Dead Data Nodes

Obligatory Screen shot
http://www.jointhegrid.com/svn/hadoop-cacti-jtg/trunk/doc/cacti-hadoop.png

Setup instructions here:
http://www.jointhegrid.com/svn/hadoop-cacti-jtg/trunk/doc/INSTALL.txt

Project Trunk:
http://www.jointhegrid.com/svn/hadoop-cacti-jtg/trunk/

My next steps are creating graphs for other hadoop components
(DataNode RPCstats). If you want to hurry this process along drop me a
line for extra motivation :)

Edward


Re: What do we call Hadoop+HBase+Lucene+Zookeeper+etc....

2009-05-05 Thread Edward Capriolo
'cloud computing' is a hot term. According to the definition provided
by wikipedia http://en.wikipedia.org/wiki/Cloud_computing,
Hadoop+HBase+Lucene+Zookeeper, fits some of the criteria but not well.

Hadoop is scalable, with HOD it is dynamically scalable.

I do not think (Hadoop+HBase+Lucene+Zookeeper) can be used for
'utility computing'. as managing the stack and getting started is
quite a complex process.

Also this stack is best running on LAN network with high speed
interlinks. Historically the Cloud is composed of WAN links. An
implication of Cloud Computing is that different services would be
running in different geographical locations which is not how hadoop is
normally deployed.

I believe 'Apache Grid Stack' would be a more fitting.

http://en.wikipedia.org/wiki/Grid_computing

Grid computing (or the use of computational grids) is the application
of several computers to a single problem at the same time — usually to
a scientific or technical problem that requires a great number of
computer processing cycles or access to large amounts of data.

Grid computing via the Wikipedia definition describes exactly what
hadoop does. Without amazon S3 and EC2 hadoop does not fit well into a
'cloud computing' IMHO


Re: Getting free and used space

2009-05-02 Thread Edward Capriolo
You can also pull these variables from the name node, datanode with
JMX. I am doing this to graph them with cacti. Both the JMX READ/WRITE
and READ user can access this variable.

On Tue, Apr 28, 2009 at 8:29 AM, Stas Oskin stas.os...@gmail.com wrote:
 Hi.

 Any idea if the getDiskStatus() function requires superuser rights?

 Or it can work for any user?

 Thanks.

 2009/4/9 Aaron Kimball aa...@cloudera.com

 You can insert this propery into the jobconf, or specify it on the command
 line e.g.: -D hadoop.job.ugi=username,group,group,group.

 - Aaron

 On Wed, Apr 8, 2009 at 7:04 AM, Brian Bockelman bbock...@cse.unl.edu
 wrote:

  Hey Stas,
 
  What we do locally is apply the latest patch for this issue:
  https://issues.apache.org/jira/browse/HADOOP-4368
 
  This makes getUsed (actually, it switches to FileSystem.getStatus) not a
  privileged action.
 
  As far as specifying the user ... gee, I can't think of it off the top of
  my head.  It's a variable you can insert into the JobConf, but I'd have
 to
  poke around google or the code to remember which one (I try to not
 override
  it if possible).
 
  Brian
 
 
  On Apr 8, 2009, at 8:51 AM, Stas Oskin wrote:
 
   Hi.
 
  Thanks for the explanation.
 
  Now for the easier part - how do I specify the user when connecting? :)
 
  Is it a config file level, or run-time level setting?
 
  Regards.
 
  2009/4/8 Brian Bockelman bbock...@cse.unl.edu
 
   Hey Stas,
 
  Did you try this as a privileged user?  There might be some permission
  errors... in most of the released versions, getUsed() is only available
  to
  the Hadoop superuser.  It may be that the exception isn't propagating
  correctly.
 
  Brian
 
 
  On Apr 8, 2009, at 3:13 AM, Stas Oskin wrote:
 
  Hi.
 
 
  I'm trying to use the API to get the overall used and free spaces.
 
  I tried this function getUsed(), but it always returns 0.
 
  Any idea?
 
  Thanks.
 
 
 
 
 




Re: Hadoop / MySQL

2009-04-29 Thread Edward Capriolo
On Wed, Apr 29, 2009 at 10:19 AM, Stefan Podkowinski spo...@gmail.com wrote:
 If you have trouble loading your data into mysql using INSERTs or LOAD
 DATA, consider that MySQL supports CSV directly using the CSV storage
 engine. The only thing you have to do is to copy your hadoop produced
 csv file into the mysql data directory and issue a flush tables
 command to have mysql flush its caches and pickup the new file. Its
 very simple and you have the full set of sql commands available just
 as with innodb or myisam. What you don't get with the csv engine are
 indexes and foreign keys. Can't have it all, can you?

 Stefan


 On Tue, Apr 28, 2009 at 9:23 PM, Bill Habermaas b...@habermaas.us wrote:
 Excellent discussion. Thank you Todd.
 You're forgiven for being off topic (at least by me).
 :)
 Bill

 -Original Message-
 From: Todd Lipcon [mailto:t...@cloudera.com]
 Sent: Tuesday, April 28, 2009 2:29 PM
 To: core-user
 Subject: Re: Hadoop / MySQL

 Warning: derailing a bit into MySQL discussion below, but I think enough
 people have similar use cases that it's worth discussing this even though
 it's gotten off-topic.

 2009/4/28 tim robertson timrobertson...@gmail.com


 So we ended up with 2 DBs
 - DB1 we insert to, prepare and do batch processing
 - DB2 serving the read only web app


 This is a pretty reasonable and common architecture. Depending on your
 specific setup, instead of flip-flopping between DB1 and DB2, you could
 actually pull snapshots of MyISAM tables off DB1 and load them onto other
 machines. As long as you've flushed the tables with a read lock, MyISAM
 tables are transferrable between machines (eg via rsync). Obviously this can
 get a bit hairy, but it's a nice trick to consider for this kind of
 workflow.


 Why did we end up with this?  Because of locking on writes that kill
 reads as you say... basically you can't insert when a read is
 happening on myisam as it locks the whole table.


 This is only true if you have binary logging enabled. Otherwise, myisam
 supports concurrent inserts with reads. That said, binary logging is
 necessary if you have any slaves. If you're loading bulk data from the
 result of a mapreduce job, you might be better off not using replication and
 simply loading the bulk data to each of the serving replicas individually.
 Turning off the binary logging will also double your write speed (LOAD DATA
 writes the entirety of the data to the binary log as well as to the table)


  InnoDB has row level
 locking to get around this but in our experience (at the time we had
 130million records) it just didn't work either.


 You're quite likely to be hitting the InnoDB autoincrement lock if you have
 an autoincrement primary key here. There are fixes for this in MySQL 5.1.
 The best solution is to avoid autoincrement primary keys and use LOAD DATA
 for these kind of bulk loads, as others have suggested.


  We spent €10,000 for
 the supposed european expert on mysql from their professional
 services and were unfortunately very disappointed.  Seems such large
 tables are just problematic with mysql.  We are now very much looking
 into Lucene technologies for search and Hadoop for reporting and
 datamining type operations. SOLR does a lot of what our DB does for
 us.


 Yep - oftentimes MySQL is not the correct solution, but other times it can
 be just what you need. If you already have competencies with MySQL and a
 good access layer from your serving tier, it's often easier to stick with
 MySQL than add a new technology into the mix.



 So with myisam... here is what we learnt:

 Only very latest mysql versions (beta still I think) support more than
 4G memory for indexes (you really really need the index in memory, and
 where possible the FK for joins in the index too).


 As far as I know, any 64-bit mysql instance will use more than 4G without
 trouble.


  Mysql has
 differing join strategies between innoDB and myisam, so be aware.


 I don't think this is true. Joining happens at the MySQL execution layer,
 which is above the storage engine API. The same join strategies are
 available for both. For a particular query, InnoDB and MyISAM tables may end
 up providing a different query plan based on the statistics that are
 collected, but given properly analyzed tables, the strategies will be the
 same. This is how MySQL allows inter-storage-engine joins. If one engine is
 providing a better query plan, you can use query hints to enforce that plan
 (see STRAIGHT_JOIN and FORCE INDEX for example)


 An undocumented feature of myisam is you can create memory buffers for
 single indexes:
 In the my.cnf:
     taxon_concept_cache.key_buffer_size=3990M    -- for some reason
 you have to drop a little under 4G

 then in the DB run:
    cache index taxon_concept in taxon_concept_cache;
    load index into cache taxon_concept;

 This allows for making sure an index gets into memory for sure.


 But for most use cases and a properly configured machine you're 

Re: Hadoop / MySQL

2009-04-29 Thread Edward Capriolo
On Wed, Apr 29, 2009 at 2:48 PM, Todd Lipcon t...@cloudera.com wrote:
 On Wed, Apr 29, 2009 at 7:19 AM, Stefan Podkowinski spo...@gmail.comwrote:

 If you have trouble loading your data into mysql using INSERTs or LOAD
 DATA, consider that MySQL supports CSV directly using the CSV storage
 engine. The only thing you have to do is to copy your hadoop produced
 csv file into the mysql data directory and issue a flush tables
 command to have mysql flush its caches and pickup the new file. Its
 very simple and you have the full set of sql commands available just
 as with innodb or myisam. What you don't get with the csv engine are
 indexes and foreign keys. Can't have it all, can you?


 The CSV storage engine is definitely an interesting option, but it has a
 couple downsides:

 - Like you mentioned, you don't get indexes. This seems like a huge deal to
 me - the reason you want to load data into MySQL instead of just keeping it
 in Hadoop is so you can service real-time queries. Not having any indexing
 kind of defeats the purpose there. This is especially true since MySQL only
 supports nested-loop joins, and there's no way of attaching metadata to a
 CSV table to say hey look, this table is already in sorted order so you can
 use a merge join.

 - Since CSV is a text based format, it's likely to be a lot less compact
 than a proper table. For example, a unix timestamp is likely to be ~10
 characters vs 4 bytes in a packed table.

 - I'm not aware of many people actually using CSV for anything except
 tutorials and training. Since it's not in heavy use by big mysql users, I
 wouldn't build a production system around it.

 Here's a wacky idea that I might be interested in hacking up if anyone's
 interested:

 What if there were a MyISAMTableOutputFormat in hadoop? You could use this
 as a reducer output and have it actually output .frm and .myd files onto
 HDFS, then simply hdfs -get them onto DB servers for realtime serving.
 Sounds like a fun hack I might be interested in if people would find it
 useful. Building the .myi indexes in Hadoop would be pretty killer as well,
 but potentially more difficult.

 -Todd


The .frm and .myd are binary platform dependent files. You can not
even move them from 32bit-64bit. Generating them without native tools
would be difficult. Moving then around with HDFS might have merit,
although the RSYNC could accomplish the same thing.

Derby-DB might be a better candidate for something like this since the
underlying DB is cross platform.


Re: max value for a dataset

2009-04-21 Thread Edward Capriolo
On Mon, Apr 20, 2009 at 7:24 PM, Brian Bockelman bbock...@cse.unl.edu wrote:
 Hey Jason,

 Wouldn't this be avoided if you used a combiner to also perform the max()
 operation?  A minimal amount of data would be written over the network.

 I can't remember if the map output gets written to disk first, then combine
 applied or if the combine is applied and then the data is written to disk.
  I suspect the latter, but it'd be a big difference.

 However, the original poster mentioned he was using hbase/pig -- certainly,
 there's some better way to perform max() in hbase/pig?  This list probably
 isn't the right place to ask if you are using those technologies; I'd
 suspect they do something more clever (certainly, you're performing a
 SQL-like operation in MapReduce; not always the best way to approach this
 type of problem).

 Brian

 On Apr 20, 2009, at 8:25 PM, jason hadoop wrote:

 The Hadoop Framework requires that a Map Phase be run before the Reduce
 Phase.
 By doing the initial 'reduce' in the map, a much smaller volume of data
 has
 to flow across the network to the reduce tasks.
 But yes, this could simply be done by using an IdentityMapper and then
 have
 all of the work done in the reduce.


 On Mon, Apr 20, 2009 at 4:26 AM, Shevek had...@anarres.org wrote:

 On Sat, 2009-04-18 at 09:57 -0700, jason hadoop wrote:

 The traditional approach would be a Mapper class that maintained a
 member
 variable that you kept the max value record, and in the close method of

 your

 mapper you output a single record containing that value.

 Perhaps you can forgive the question from a heathen, but why is this
 first mapper not also a reducer? It seems to me that it is performing a
 reduce operation, and that maps should (philosophically speaking) not
 maintain data from one input to the next, since the order (and location)
 of inputs is not well defined. The program to compute a maximum should
 then be a tree of reduction operations, with no maps at all.

 Of course in this instance, what you propose works, but it does seem
 puzzling. Perhaps the answer is simple architectural limitation?

 S.

 The map method of course compares the current record against the max and
 stores current in max when current is larger than max.

 Then each map output is a single record and the reduce behaves very
 similarly, in that the close method outputs the final max record. A

 single

 reduce would be the simplest.

 On your question a Mapper and Reducer defines 3 entry points, configure,
 called once on on task start, the map/reduce called once for each
 record,
 and close, called once after the last call to map/reduce.
 at least through 0.19, the close is not provided with the output

 collector

 or the reporter, so you need to save them in the map/reduce method.

 On Sat, Apr 18, 2009 at 9:28 AM, Farhan Husain russ...@gmail.com

 wrote:

 How do you identify that map task is ending within the map method? Is

 it

 possible to know which is the last call to map method?

 On Sat, Apr 18, 2009 at 10:59 AM, Edward Capriolo 

 edlinuxg...@gmail.com

 wrote:

 I jumped into Hadoop at the 'deep end'. I know pig, hive, and hbase
 support the ability to max(). I am writing my own max() over a simple
 one column dataset.

 The best solution I came up with was using MapRunner. With maprunner

 I

 can store the highest value in a private member variable. I can read
 through the entire data set and only have to emit one value per

 mapper

 upon completion of the map data. Then I can specify one reducer and
 carry out the same operation.

 Does anyone have a better tactic. I thought a counter could do this
 but are they atomic?









 --
 Alpha Chapters of my book on Hadoop are available
 http://www.apress.com/book/view/9781430219422



I took a loot at the description of the book
http://www.apress.com/book/view/9781430219422. Hopefully it and other
endeavors like it can fill a need I have an see quite often. I am
quite interested in practical hadoop algorithms. Most of my searching
finds repeated WordCount examples, depictions of the shuffle-sort.

The most practical lessons I took from my programming with Fortran was
how to sum() min() max() and average() a data set. If the hadoop had a
cookbook of sorts for algorithm design I think many people would
benefit.


Re: max value for a dataset

2009-04-20 Thread Edward Capriolo
Yes I considered Shevek's tactic as well, but as Jason pointed out
emit ing the entire data set just to find the maximum value would be
wasteful, you do not want to sort the dataset, you just want to break
it in parts and find the max value of each part, then bring it into
one part and perform that operation again.

The way I look at it are the 'best' hadoop algorithms are the ones
that emit less key pairs. What Jason suggested, and the MapRunner
concept I was looking at, would both emit about the same amount of key
pairs.

I am curious to see if the MapRunner implementation would run faster
due to less calls to the map function. After all MapRunner is only
iterating over the data set.

On Mon, Apr 20, 2009 at 8:25 AM, jason hadoop jason.had...@gmail.com wrote:
 The Hadoop Framework requires that a Map Phase be run before the Reduce
 Phase.
 By doing the initial 'reduce' in the map, a much smaller volume of data has
 to flow across the network to the reduce tasks.
 But yes, this could simply be done by using an IdentityMapper and then have
 all of the work done in the reduce.


 On Mon, Apr 20, 2009 at 4:26 AM, Shevek had...@anarres.org wrote:

 On Sat, 2009-04-18 at 09:57 -0700, jason hadoop wrote:
  The traditional approach would be a Mapper class that maintained a member
  variable that you kept the max value record, and in the close method of
 your
  mapper you output a single record containing that value.

 Perhaps you can forgive the question from a heathen, but why is this
 first mapper not also a reducer? It seems to me that it is performing a
 reduce operation, and that maps should (philosophically speaking) not
 maintain data from one input to the next, since the order (and location)
 of inputs is not well defined. The program to compute a maximum should
 then be a tree of reduction operations, with no maps at all.

 Of course in this instance, what you propose works, but it does seem
 puzzling. Perhaps the answer is simple architectural limitation?

 S.

  The map method of course compares the current record against the max and
  stores current in max when current is larger than max.
 
  Then each map output is a single record and the reduce behaves very
  similarly, in that the close method outputs the final max record. A
 single
  reduce would be the simplest.
 
  On your question a Mapper and Reducer defines 3 entry points, configure,
  called once on on task start, the map/reduce called once for each record,
  and close, called once after the last call to map/reduce.
  at least through 0.19, the close is not provided with the output
 collector
  or the reporter, so you need to save them in the map/reduce method.
 
  On Sat, Apr 18, 2009 at 9:28 AM, Farhan Husain russ...@gmail.com
 wrote:
 
   How do you identify that map task is ending within the map method? Is
 it
   possible to know which is the last call to map method?
  
   On Sat, Apr 18, 2009 at 10:59 AM, Edward Capriolo 
 edlinuxg...@gmail.com
   wrote:
  
I jumped into Hadoop at the 'deep end'. I know pig, hive, and hbase
support the ability to max(). I am writing my own max() over a simple
one column dataset.
   
The best solution I came up with was using MapRunner. With maprunner
 I
can store the highest value in a private member variable. I can read
through the entire data set and only have to emit one value per
 mapper
upon completion of the map data. Then I can specify one reducer and
carry out the same operation.
   
Does anyone have a better tactic. I thought a counter could do this
but are they atomic?
   
  
 
 
 




 --
 Alpha Chapters of my book on Hadoop are available
 http://www.apress.com/book/view/9781430219422



max value for a dataset

2009-04-18 Thread Edward Capriolo
I jumped into Hadoop at the 'deep end'. I know pig, hive, and hbase
support the ability to max(). I am writing my own max() over a simple
one column dataset.

The best solution I came up with was using MapRunner. With maprunner I
can store the highest value in a private member variable. I can read
through the entire data set and only have to emit one value per mapper
upon completion of the map data. Then I can specify one reducer and
carry out the same operation.

Does anyone have a better tactic. I thought a counter could do this
but are they atomic?


Re: Using HDFS to serve www requests

2009-03-27 Thread Edward Capriolo
but does Sun's Lustre follow in the steps of Gluster then

Yes. IMHO GlusterFS advertises benchmarks vs Luster.

The main difference is that GlusterFS is a fuse (userspace filesystem)
while Luster has to be patched into the kernel, or a module.


Re: virtualization with hadoop

2009-03-26 Thread Edward Capriolo
I use linux-vserver http://linux-vserver.org/

The Linux-VServer technology is a soft partitioning concept based on
Security Contexts which permits the creation of many independent
Virtual Private Servers (VPS) that run simultaneously on a single
physical server at full speed, efficiently sharing hardware resources.

Usually whenever people talk about virtual machines, I always here
about VMware, Xen, QEMU. For MY purposes Linux Vserver is far superior
to all of them and its very helpful for the hadoop work I do. (I only
want linux guests)

No emulation overhead - I installed VMWare server on my laptop and was
able to get 3 linux instances running before the system was unusable,
the instances were not even doing anything.

With VServer my system is not wasting cycles emulating devices. VMs
are securely sharing a kernel and memory. You can effectively run many
more VMs at once. This leaves the processor for user processes
(hadoop) not emulation overhear.

A minimal installation is 50 MB. I do not need a multi GB Linux
install just to test a version of hadoop. This allows me to recklessly
make VMs for whatever I want and not have to worry about GB chunks of
my hard drive going with each VM.

I can tar up a VM and use it as a template to install another VM. Thus
I can deploy a new system in under 30 seconds. The HTTP RPM install
takes about 2 minutes.

The guest is chroot 'ed. I can easily copy files into the guest using
copy commands. Think ant deploy -DTARGETDIR=/path/to/guest.

But it is horrible slow if you not have enough ram and multiple
disks since all I/o-Operations go to the same disk.

VServer will not solve this problem, but at least you want be losing
IO to 'emulation'.

If you are working with hadoop and you need to be able to have
multiple versions running, with different configurations, take a look
at VServer.


Re: Using HDFS to serve www requests

2009-03-26 Thread Edward Capriolo
It is a little more natural to connect to HDFS from apache tomcat.
This will allow you to skip the FUSE mounts and just use the HDFS-API.

I have modified this code to run inside tomcat.
http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample

I will not testify to how well this setup will perform under internet
traffic, but it does work.

GlusterFS is more like a traditional POSIX filesystem. It supports
locking and appends and you can do things like put the mysql data
directory on it.

GLUSTERFS is geared for storing data to be accessed with low latency.
Nodes (Bricks) are normally connected via GIG-E or infiniban. The
GlusterFS volume is mounted directly on a unix system.

Hadoop is a user space file system. The latency is higher. Nodes are
connected by GIG-E. It is closely coupled with MAP/REDUCE.

You can use the API or the FUSE module to mount hadoop but that is not
a direct goal of hadoop. Hope that helps.


Re: Using Hadoop for near real-time processing of log data

2009-02-25 Thread Edward Capriolo
On Wed, Feb 25, 2009 at 1:13 PM, Mikhail Yakshin
greycat.na@gmail.com wrote:
 Hi,

 Is anyone using Hadoop as more of a near/almost real-time processing
 of log data for their systems to aggregate stats, etc?

 We do, although near realtime is pretty relative subject and your
 mileage may vary. For example, startups / shutdowns of Hadoop jobs are
 pretty expensive and it could take anything from 5-10 seconds up to
 several minutes to get the job started and almost same thing goes for
 job finalization. Generally, if your near realtime would tolerate
 3-4-5 minutes lag, it's possible to use Hadoop.

 --
 WBR, Mikhail Yakshin


I was thinking about this. Assuming your datasets are small would
running a local jobtracker or even running the MinimMR cluster from
the test case be an interesting way to run small jobs confided to one
CPU?


Re: Using Hadoop for near real-time processing of log data

2009-02-25 Thread Edward Capriolo
Yeah, but what's the point of using Hadoop then? i.e. we lost all the
parallelism?

Some jobs do not need it. For example, I am working with the Hive sub
project. If I have a table that is less then my block size. Having a
large number of mappers or reducers is counter productive. Hadoop will
start up mappers that never get any data. Setting the job tracker to
'local' or setting map tasks and reduce tasks to 1 makes  the job
finish faster. 20 seconds vs 10 seconds.

If you have a small data set and a system with 8 cores, the MiniMR
cluster can possibly be used as an embedded hadoop. For some jobs the
most efficient parallelism might be 1.

WordCount of 1 2 3 4 5 6 on  the MiniMRCluster test case takes less
then two seconds.

It may not be the common case, but it may be feasible to use hadoop in
that manner.


Re: Batching key/value pairs to map

2009-02-23 Thread Edward Capriolo
We have a MR program that collects once for each token on a line. What
types of applications can benefit from batch mapping?


Hadoop JMX

2009-02-20 Thread Edward Capriolo
I am working to graph the hadoop JMX variables.
http://hadoop.apache.org/core/docs/r0.17.0/api/org/apache/hadoop/dfs/namenode/metrics/NameNodeStatistics.html
I have a two nodes, one running 0.17 and the other running.0.19

The NameNode JMX objects and attributes seem to be working well. I am
graphing Capacity, NumberOfBlocks, NumberOfFiles, as well as the
operations numLiveDataNodes() numDeadDataNodes()

It seems like the DataNode JMX objects are mostly 0 or -1. I do not
have heavy load on these systems so telling if the counter is
implemented is tricky.

My questions:
1) If a JMX attribute is added, is it generally added as a placeholder
to be implemented later or is it added implemented?
2) Is there a target version to have all these attributes implemented,
or are these all being handled via separate Jira?
3) Can I set TaskTrackers to be monitored as I can for DataNodes, NameNodes?
4) Tips tricks gotcha?

Thank you


Re: Pluggable JDBC schemas [Was: How to use DBInputFormat?]

2009-02-13 Thread Edward Capriolo
One thing to mention is 'limit' is not SQL standard. Microsoft SQL
Server uses the  SELECT TOP 100 FROM table. Some RDBMS may not support
any such syntax. To be more SQL compliant you should use some data
like an auto ID or DATE column for an offset. It is tricky to write
anything truly database agnostic though.

On Fri, Feb 13, 2009 at 8:18 AM, Fredrik Hedberg fred...@avafan.com wrote:
 Hi,

 Please let us know how this works out. Also, it would be nice if people with
 experience with other RDMBS than MySQL and Oracle could comment on the
 syntax and performance of their respective RDBMS with regard to Hadoop. Even
 if the syntax of the current SQL queries are valid for other systems than
 MySQL, some users would surely benefit performance-wise from having
 pluggable schemas in the JDBC interface for Hadoop.


 Fredrik

 On Feb 12, 2009, at 6:05 PM, Brian MacKay wrote:

 Amandeep,

 I spoke w/ one of our Oracle DBA's and he suggested changing the query
 statement as follows:

 MySql Stmt:
 select * from TABLE  limit splitlength offset splitstart
 ---
 Oracle Stmt:
 select *
  from (select a.*,rownum rno
  from (your_query_here must contain order by) a
   where rownum = splitstart + splitlength)
 where rno = splitstart;

 This can be put into a function, but would require a type as well.
 -

 If you edit org.apache.hadoop.mapred.lib.db.DBInputFormat, getSelectQuery,
 it should work in Oracle

 protected String getSelectQuery() {

... edit to include check for driver and create Oracle Stmt

 return query.toString();
   }


 Brian

 ==

 On Feb 5, 2009, at 11:37 AM, Stefan Podkowinski wrote:


 The 0.19 DBInputFormat class implementation is IMHO only suitable for
 very simple queries working on only few datasets. Thats due to the
 fact that it tries to create splits from the query by
 1) getting a count of all rows using the specified count query (huge
 performance impact on large tables)
 2) creating splits by issuing an individual query for each split with
 a limit and offset parameter appended to the input sql query

 Effectively your input query select * from orders would become
 select * from orders limit splitlength offset splitstart and
 executed until count has been reached. I guess this is not working sql
 syntax for oracle.

 Stefan


 2009/2/4 Amandeep Khurana ama...@gmail.com:

 Adding a semicolon gives me the error ORA-00911: Invalid character

 Amandeep


 Amandeep Khurana
 Computer Science Graduate Student
 University of California, Santa Cruz


 On Wed, Feb 4, 2009 at 6:46 AM, Rasit OZDAS rasitoz...@gmail.com
 wrote:


 Amandeep,
 SQL command not properly ended
 I get this error whenever I forget the semicolon at the end.
 I know, it doesn't make sense, but I recommend giving it a try

 Rasit

 2009/2/4 Amandeep Khurana ama...@gmail.com:

 The same query is working if I write a simple JDBC client and query
 the
 database. So, I'm probably doing something wrong in the connection

 settings.

 But the error looks to be on the query side more than the connection

 side.

 Amandeep


 Amandeep Khurana
 Computer Science Graduate Student
 University of California, Santa Cruz


 On Tue, Feb 3, 2009 at 7:25 PM, Amandeep Khurana ama...@gmail.com

 wrote:

 Thanks Kevin

 I couldnt get it work. Here's the error I get:

 bin/hadoop jar ~/dbload.jar LoadTable1
 09/02/03 19:21:17 INFO jvm.JvmMetrics: Initializing JVM Metrics
 with
 processName=JobTracker, sessionId=
 09/02/03 19:21:20 INFO mapred.JobClient: Running job:
 job_local_0001
 09/02/03 19:21:21 INFO mapred.JobClient:  map 0% reduce 0%
 09/02/03 19:21:22 INFO mapred.MapTask: numReduceTasks: 0
 09/02/03 19:21:24 WARN mapred.LocalJobRunner: job_local_0001
 java.io.IOException: ORA-00933: SQL command not properly ended

 at



 org.apache.hadoop.mapred.lib.db.DBInputFormat.getRecordReader(DBInputFormat.java:289)

 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321)
 at


 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
 java.io.IOException: Job failed!
 at

 org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)

 at LoadTable1.run(LoadTable1.java:130)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
 at LoadTable1.main(LoadTable1.java:107)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown

 Source)

 at java.lang.reflect.Method.invoke(Unknown Source)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
 at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at 

Using HOD to manage a production cluster

2009-01-31 Thread Edward Capriolo
I am looking at using HOD (Hadoop On Demand) to manage a production
cluster. After reading the documentation It seems that HOD is missing
some things that would need to be carefully set in a production
cluster.

Rack Locality:
HOD uses the -N 5 option and starts a cluster of N nodes. There seems
to be no way to pass specific options to them individually. How can I
make sure the set of servers selected selected will end up in
different racks?

data node blacklist/white list
These are listed in a file can that file be generated?

hadoop-env
Can I set my these options from HOD or do I have to build them into
the hadoop tar.

JMX settings
Can I set my these options from HOD or do I have to build them into
the hadoop tar.

Upgrade with non symmetric configurations:

Old servers  /mnt/drive1  /mnt/drive2
New Servers  /mnt/drive1  /mnt/drive2  /mnt/drive3

Can HOD ship out different configuration files to different nodes? As
new nodes are joining the cluster for an upgrade they may have
different configurations then the old one.

From reading the docs it seems like HOD is great for building on
demand clusters, but may not be ideal for managing a single permanent
long term cluster.

Accidental cluster destruction.
Sounds silly but might the wrong command take out a cluster in one
swipe. Possibly block this feature.

Any thoughts?


Re: Zeroconf for hadoop

2009-01-26 Thread Edward Capriolo
Zeroconf is more focused on simplicity then security. One of the
original problems that may have been fixes is that any program can
announce any service. IE my laptop can announce that it is the DNS for
google.com etc.

I want to mention a related topic to the list. People are approaching
the auto-discovery in a number of ways jira. There are a few ways I
can think of to discover hadoop. A very simple way might be to publish
the configuration over a web interface. I use a network storage system
called gluster-fs. Gluster can be configured so the server holds the
configuration for each client. If the hadoop name node held the entire
configuration for all the nodes the namenode would only need to be
aware of the namenode and it could retrieve its configuration from it.

Having a central configuration management or a discovery system would
be very useful. HOD is what I think to be the closest thing it is more
of a top down deployment system.


Re: Netbeans/Eclipse plugin

2009-01-25 Thread Edward Capriolo
On Sun, Jan 25, 2009 at 10:57 AM, vinayak katkar vinaykat...@gmail.com wrote:
 Any one knows Netbeans or Eclipse plugin for Hadoop Map -Reduce job. I want
 to make plugin for netbeans

 http://vinayakkatkar.wordpress.com
 --
 Vinayak Katkar
 Sun Campus Ambassador
 Sun Microsytems,India
 COEP


There is an ecplipse plugin. http://www.alphaworks.ibm.com/tech/mapreducetools

Seems like some work is being done on netbeans
https://nbhadoop.dev.java.net/

The world needs more netbeans love.


Re: Why does Hadoop need ssh access to master and slaves?

2009-01-23 Thread Edward Capriolo
I am looking to create some RA scripts and experiment with starting
hadoop via linux-ha cluster manager.  Linux HA would handle restarting
downed nodes and eliminate the ssh key dependency.


Re: When I system.out.println() in a map or reduce, where does it go?

2008-12-10 Thread Edward Capriolo
Also be careful when you do this. If you are running map/reduce on a
large file the map and reduce operations will be called many times.
You can end up with a lot of output. Use log4j instead.


Re: File loss at Nebraska

2008-12-09 Thread Edward Capriolo
Also it might be useful to strongly word hadoop-default.conf as many
people might not know a downside exists for using 2 rather then 3 as
the replication factor. Before reading this thread I would have
thought 2 to be sufficient.


JDBC input/output format

2008-12-08 Thread Edward Capriolo
Is anyone working on a JDBC RecordReader/InputFormat. I was thinking
this would be very useful for sending data into mappers. Writing data
to a relational database might be more application dependent but still
possible.


Hadoop IP Tables configuration

2008-12-03 Thread Edward Capriolo
All,

I always run iptables on my systems. Most of the hadoop setup guides I
have found skip iptables/firewall configuration. My namenode and task
tracker are the same node. My current configuration is not working as
I submit jobs from the namenode jobs are kicked off on the slave nodes
but they fail. I suspect I do not have the right ports open, but it
might be more complex, such as an RPC issue.

I uploaded my iptables file to my pastebin. This way I do not have to inline it.

http://paste.jointhegrid.com/10

I want to avoid writing a catch all rule, or shutting down entirely. I
would like to have 1-3 configurations in the end for each hadoop
component. It anyone helps me out I will contribute this to the hadoop
wiki.


Re: What do you do with task logs?

2008-11-18 Thread Edward Capriolo
We just setup a log4j server. This takes the logs off the cluster.
Plus you get all the benefits of log4j

http://timarcher.com/?q=node/10


Re: nagios to monitor hadoop datanodes!

2008-10-29 Thread Edward Capriolo
All I have to say is wow! I never tried jconsole before. I have
hadoop_trunk checked out and the JMX has all kinds of great
information. I am going to look at how I can get JMX/cacti/and hadoop
working together.

Just as an FYI there are separate ENV variables for each now. If you
override hadoop_ops you get a port conflict. It should be like this.

export HADOOP_NAMENODE_OPTS=-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.port=10001

Thanks Brian.


Re: LHadoop Server simple Hadoop input and output

2008-10-24 Thread Edward Capriolo
I came up with my line of thinking after reading this article:

http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data

As a guy that was intrigued by the java coffee cup in 95, that now
lives as a data center/noc jock/unix guy. Lets say I look at a log
management process from a data center prospective. I know:

Syslog is a familiar model (human readable: UDP text)
INETD/XINETD is a familiar model (programs that do amazing things with
STD IN/STD OUT)
Variety of hardware and software

I may be supporting an older Solaris 8, windows or  Free BSD 5 for example.

I want to be able to pipe apache custom log at HDFS, or forward
syslog. That is where LHadoop (or something like it) would come into
play.

I am thinking to even accept raw streams and have the server side use
source-host/regex to determine what file the data should go to.

I want to stay light on the client side. An application that tails log
files and transmits new data is another component to develop and
manage. Had anyone had experience with moving this type of data?


Re: LHadoop Server simple Hadoop input and output

2008-10-23 Thread Edward Capriolo
I had downloaded thrift and ran the example applications after the
Hive meet up. It is very cool stuff. The thriftfs interface is more
elegant than what I was trying to do, and that implementation is more
complete.

Still, someone might be interested in what I did if they want a
super-light API :)

I will link to http://wiki.apache.org/hadoop/HDFS-APIs from my page so
people know the options.


Hive Web-UI

2008-10-10 Thread Edward Capriolo
I was checking out this slide show.
http://www.slideshare.net/jhammerb/2008-ur-tech-talk-zshao-presentation/
in the diagram a Web-UI exists. This was the first I have heard of
this. Is this part of or planned to be a part of contrib/hive? I think
a web interface for showing table schema and executing jobs would be
very interesting. Is anyone working on something like this? If not, I
have a few ideas.


Re: nagios to monitor hadoop datanodes!

2008-10-08 Thread Edward Capriolo
The simple way would be use use nrpe and check_proc. I have never
tested, but a command like 'ps -ef | grep java  | grep NameNode' would
be a fairly decent check. That is not very robust but it should let
you know if the process is alive.

You could also monitor the web interfaces associated with the
different servers remotely.

check_tcp!hadoop1:56070

Both the methods I suggested are quick hacks. I am going to
investigate the JMX options as well  and work them into cacti


Re: nagios to monitor hadoop datanodes!

2008-10-08 Thread Edward Capriolo
That all sounds good. By 'quick hack'  I meant 'check_tcp' was not
good enough because an open TCP socket does not prove much. However,
if the page returns useful attributes that show cluster is alive that
is great and easy.

Come to think of it you can navigate the dfshealth page and get useful
information from it.


Re: Hadoop and security.

2008-10-06 Thread Edward Capriolo
You bring up some valid points. This would be a great topic for a
white paper. The first line of defense should be to apply inbound and
outbound iptables rules. Only source IPs that have a direct need to
interact with the cluster should be allowed to. The same is true with
the   web access. Only a range of source IP's should be allowed to
access the web interfaces. You can do this through SSH tunneling.

Preventing exec commands can be handled with the security manager and
the sandbox. I was thinking to only allow the execution of signed jars
myself but I never implemented it.


Re: Hive questions about the meta db

2008-10-02 Thread Edward Capriolo
I am doing a lot of testing with Hive, I will be sure to add this
information to the wiki once I get it going.

Thus far I downloaded the same version of derby that hive uses. I have
verified that the connections is up and running.

ij version 10.4
ij connect 'jdbc:derby://nyhadoop1:1527/metastore_db;create=true';
ij show tables
TABLE_SCHEM |TABLE_NAME|REMARKS

SYS |SYSALIASES|
SYS |SYSCHECKS |
...

vi hive-default.conf
...
property
  namehive.metastore.local/name
  valuefalse/value
  descriptioncontrols whether to connect to remove metastore server
or open a new metastore server in Hive Client JVM/description
/property

property
  namejavax.jdo.option.ConnectionURL/name
  valuejdbc:derby://nyhadoop1:1527/metastore_db;create=true/value
  descriptionJDBC connect string for a JDBC metastore/description
/property

property
  namejavax.jdo.option.ConnectionDriverName/name
  valueorg.apache.derby.jdbc.ClientDriver/value
  descriptionDriver class name for a JDBC metastore/description
/property

property
  namehive.metastore.uris/name
  valuejdbc:derby://nyhadoop1:1527/metastore_db/value
  descriptionComma separated list of URIs of metastore servers. The
first server that can be connected to will be used./description
/property
...

javax.jdo.PersistenceManagerFactoryClass=org.jpox.PersistenceManagerFactoryImpl
org.jpox.autoCreateSchema=false
org.jpox.validateTables=false
org.jpox.validateColumns=false
org.jpox.validateConstraints=false
org.jpox.storeManagerType=rdbms
org.jpox.autoCreateSchema=true
org.jpox.autoStartMechanismMode=checked
org.jpox.transactionIsolation=read_committed
javax.jdo.option.DetachAllOnCommit=true
javax.jdo.option.NontransactionalRead=true
javax.jdo.option.ConnectionDriverName=org.apache.derby.jdbc.ClientDriver
javax.jdo.option.ConnectionURL=jdbc:derby://nyhadoop1:1527/metastore_db;create=true
javax.jdo.option.ConnectionUserName=
javax.jdo.option.ConnectionPassword=

hive show tables;
08/10/02 15:17:12 INFO hive.metastore: Trying to connect to metastore
with URI jdbc:derby://nyhadoop1:1527/metastore_db
FAILED: Error in semantic analysis: java.lang.NullPointerException
08/10/02 15:17:12 ERROR ql.Driver: FAILED: Error in semantic analysis:
java.lang.NullPointerException

I must have a setting wrong. Any ideas?


Re: Hive questions about the meta db

2008-10-02 Thread Edward Capriolo
  namehive.metastore.local/name
 valuetrue/value

Why would I set this property to true? My goal is to store the meta
data in an external database. It i set this to true the metabase is
created in the working directory.


Re: Hive questions about the meta db

2008-10-02 Thread Edward Capriolo
I determined the problem once I set the log4j properties to debug.
derbyclient.jar derbytools.jar does not ship with hive. As a result
when you try to org.apache.derby.jdbc.ClientDriver you get an
invocation target exception.
The solution for this was to download the derby, and place those files
in hive/lib.

It is working now. Thanks!


Re: text extraction from html based on uniqueness metric

2008-06-10 Thread Edward Capriolo
I have never tried this method. The concept came from a research paper
I ran into. The goal was to detect the language of piece of text by
looking at several factors. Average length of word, average length of
sentence, average number of vowels in a word, etc. He used these to
score and article, and it worked well in determining the language of
the text. It worked well.

This is a fairly basic program that you might see in Artificial
Intelligence, you can create a score and try to determine what the
block of text you are looking for is. The answer is not going to be
perfect, and I can not imagine many out-of-the box solutions will do
exactly what you need. (Just a guess)

The one plus about this is that you can take html right out of the
equation. I believe the java HTML tag parsers has some quick 'toText'
method that will dump the text of a web page.

Also your would think most online newspapers carry a  NewsML XML
version or RSS version of their paper.


Re: does anyone have idea on how to run multiple sequential jobs with bash script

2008-06-10 Thread Edward Capriolo
wait and sleep are not what you are looking for. you can use 'nohup'
to run a job in the background and have its output piped to a file.

On Tue, Jun 10, 2008 at 5:48 PM, Meng Mao [EMAIL PROTECTED] wrote:
 I'm interested in the same thing -- is there a recommended way to batch
 Hadoop jobs together?

 On Tue, Jun 10, 2008 at 5:45 PM, Richard Zhang [EMAIL PROTECTED]
 wrote:

 Hello folks:
 I am running several hadoop applications on hdfs. To save the efforts in
 issuing the set of commands every time, I am trying to use bash script to
 run the several applications sequentially. To let the job finishes before
 it
 is proceeding to the next job, I am using wait in the script like below.

 sh bin/start-all.sh
 wait
 echo cluster start
 (bin/hadoop jar hadoop-0.17.0-examples.jar randomwriter -D
 test.randomwrite.bytes_per_map=107374182 rand)
 wait
 bin/hadoop jar hadoop-0.17.0-examples.jar randomtextwriter  -D
 test.randomtextwrite.total_bytes=107374182 rand-text
 bin/stop-all.sh
 echo finished hdfs randomwriter experiment


 However, it always give the error like below. Does anyone have better idea
 on how to run the multiple sequential jobs with bash script?

 HadoopScript.sh: line 39: wait: pid 10 is not a child of this shell

 org.apache.hadoop.ipc.RemoteException:
 org.apache.hadoop.mapred.JobTracker$IllegalStateException: Job tracker
 still
 initializing
at
 org.apache.hadoop.mapred.JobTracker.ensureRunning(JobTracker.java:1722)
at
 org.apache.hadoop.mapred.JobTracker.getNewJobId(JobTracker.java:1730)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)

at org.apache.hadoop.ipc.Client.call(Client.java:557)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
at $Proxy1.getNewJobId(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at

 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at

 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy1.getNewJobId(Unknown Source)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:696)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973)
at
 org.apache.hadoop.examples.RandomWriter.run(RandomWriter.java:276)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at
 org.apache.hadoop.examples.RandomWriter.main(RandomWriter.java:287)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at

 org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at
 org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at
 org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:53)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:194)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:220)




 --
 hustlin, hustlin, everyday I'm hustlin



Re: Hadoop Distributed Virtualisation

2008-06-06 Thread Edward Capriolo
I once asked a wise man in change of a rather large multi-datacenter
service, Have you every considered virtualization? He replied, All
the CPU's here are pegged at 100%

They may be applications for this type of processing. I have thought
about systems like this from time to time. This thinking goes in
circles. Hadoop is designed for storing and processing on different
hardware.  Virtualization lets you split a system into sub-systems.

Virtualization is great for proof of concept.
For example, I have deployed this: I installed VMware with two linux
systems on my windows host, I followed a hadoop multi-system-tutorial
running on two vmware nodes. I was able to get the word count
application working, I also confirmed that blocks were indeed being
stored on both virtual systems and that processing was being shared
via MAP/REDUCE.

The processing however was slow, of course this is the fault of
VMware. VMware has a very high emulation overhead. Xen has less
overhead. LinuxVserver and OpenVZ use software virtualization (they
have very little (almost no) overhead). Regardless of how much
overhead, overhead is overhead. Personally I find the Vmware falls
short of its promises