Re: Difference between clustering and classification in hadoop

2013-11-22 Thread Devin Suiter RDX
They are both for machine learning. Classification is known as "supervised learning" where you feed the engine data of known patterns and instruct it what are the key nodes. Clustering is "unsupervised learning" where you allow the algorithm to "guess" at what is significant in the correlations pic

Re: Estimating the time of my hadoop jobs

2013-12-17 Thread Devin Suiter RDX
Nikhil, One of the problems you run into with Hadoop in Virtual Machine environments is performance issues when they are all running on the same physical host. With a VM, even though you are giving them 4 GB of RAM, and a virtual CPU and disk, if the virtual machines are sharing physical component

Re: Why other process can't see the change after calling hdfsHFlush unless hdfsCloseFile is called?

2013-12-19 Thread Devin Suiter RDX
Hello, In my experience with Flume, watching the HDFS Sink verbose output, I know that even after a file has flushed, but is still open, it reads as a 0-byte file, even if there is actually data contained in the file. A HDFS "file" is a meta-location that can accept streaming input for as long as

Re: Running Hadoop v2 clustered mode MR on an NFS mounted filesystem

2013-12-20 Thread Devin Suiter RDX
I think most of your problem is coming from the options you are setting: "hadoop jar /hduser/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount *-fs local -jt local* /hduser/mount_point/ /results" You appear to be directing your namenode to run jobs in the *LOCAL* job ru

Re: Running Hadoop v2 clustered mode MR on an NFS mounted filesystem

2013-12-20 Thread Devin Suiter RDX
Yes, there will be the issue of bottlenecking too. There are lots of newer distributed filesystem formats that work well with Hadoop, if you don't want to do HDFS. If you are using a traditional filesystem, you aren't getting any parallel work done - there's only on file to work on, in one piece.

Re: Unable to access the link

2013-12-30 Thread Devin Suiter RDX
The (509) error is telling you what the problem is: "This account's public links are generating too much traffic and have been temporarily disabled!" Which seems to mean, since it is Dropbox. there has been too much traffic directed towards the file or to other public links owned by the hoster's

MapReduce MIME Input type?

2013-12-30 Thread Devin Suiter RDX
Hi, I am trying to puzzle this out, and am hoping for some insight - I have an IMAP inbox dump that I am analyzing - I need to track how many times a given item is referred to in the inbox, i.e. how many emails came in about that thing and over what time. I can load it into MapReduce as TextInputF

Re: MapReduce MIME Input type?

2014-01-02 Thread Devin Suiter RDX
> Are you perhaps looking for http://james.apache.org/mime4j/? You may have > to adapt it for MR but I don't imagine that would be too difficult to do. > > On Mon, Dec 30, 2013 at 11:59 PM, Devin Suiter RDX wrote: > >> Hi, >> >> I am trying to puzzle this out, a

Re: Hadoop permissions issue

2014-01-06 Thread Devin Suiter RDX
Based on the Exception type, it looks like something in your job is looking for a valid value, and not finding it. You will probably need to share the job code for people to help with this - to my eyes, this doesn't appear to be a Hadoop configuration issue, or any kind of problem with how the sys

Re: Newbie: How to set up HDFS file system

2014-01-07 Thread Devin Suiter RDX
Installing Hadoop will install HDFS, and you will need to declare storage directories on the host nodes, etc. There is also the question of what setup you want to use, there is what is called "pseudo-distributed" mode where all the Hadoop daemons are running in one host. Are you a student looking

Re: manipulating key in combine phase

2014-01-13 Thread Devin Suiter RDX
Amit, Have you explored chainMapper class? *Devin Suiter* Jr. Data Solutions Software Engineer 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212 Google Voice: 412-256-8556 | www.rdx.com On Sun, Jan 12, 2014 at 7:28 PM, John Lilley wrote: > Isn’t this is what you’d normally do in the Mapp

Re: manipulating key in combine phase

2014-01-13 Thread Devin Suiter RDX
com On Mon, Jan 13, 2014 at 12:39 PM, Amit Sela wrote: > More than a solution, I'd like to know if a combiner is allowed to change > the key ? will it interfere with the mappers sort/merge ? > > > On Mon, Jan 13, 2014 at 3:06 PM, Devin Suiter RDX wrote: > >>

Federated Namespaces - VM

2014-01-14 Thread Devin Suiter RDX
Hi, I just want to throw out a discussion topic on federation. Reading *The Definitive Guide* on HDFS, it sounds like when federating, every distinct namespace needs a distinct namenode machine instance. This means if a company wanted three namespaces, say retail, commercial, government, they wo

Re: A hadoop command to determine the replication factor of a hdfs file ?

2014-02-08 Thread Devin Suiter RDX
Also Raj - if you're using pseudo-distributed mode, the replication factor will be 1. This is part of pseudo-distributed configuration. So if you're working on a Cloudera preconfigured machine image, for example, it's not wrong that 'hadoop fs -ls' shows all files as replicated 1 times. *Devin Sui

Re: How to keep data consistency?

2014-02-19 Thread Devin Suiter RDX
Edward, It doesn't seem like your "hadoop -put ..." command will even complete - the master isn't receiving the file at any point. It instructs the node1 to connect to the client, after asking the node1 if it is in a state that it can receive data to be written, which depends on several other daem

Re: Questions from a newbie to Hadoop

2014-02-21 Thread Devin Suiter RDX
You should also clarify for the group: Do you want to make a virtual machine to run a pseudo-distributed Hadoop cluster on? Or Do you want to install Hadoop directly onto the Vista machine and run it there? If the former, you should be able to set up a VM just fine with a Linux version of your

Re: Questions from a newbie to Hadoop

2014-02-22 Thread Devin Suiter RDX
bit > > still looking for a 32 bit version of Hortonworks sand box 2.0 > > Hortonworks seems to be very stable and good so far, very easy to set up > > what does this mean; set up P-D mode on it? > > :O > > thanks for the reply > > *From:* Devin Suiter RDX

Re: Performance

2014-02-25 Thread Devin Suiter RDX
http://sortbenchmark.org/ Doesn't just cover Hadoop, but maybe the methodology will give you an idea of what you're looking for. There's too many variables to pin down a "general" average. Every job will run differently on every cluster, given the machines can be heterogenous builds, with heterog

Re: Logic of isSplittable() of class FileInputFormat

2014-02-26 Thread Devin Suiter RDX
Or, as another example, I'm writing a program to analyze a large email dump. The emails are more than one line. TextInputFormat will split them up by line, in addition to deserializing them to text. I'm going to need to customize RecordReader to split based on the MIME metadata length of the emails

Hadoop FileCrush

2014-02-27 Thread Devin Suiter RDX
Hi, Has anyone used Hadoop Filecrush? http://www.jointhegrid.com/hadoop_filecrush/ I was just curious about the reliability and integrity of it. It seems like a nice concept. But, if it is a nice concept and trustworthy, it should be looked at for incorporating under BigTop, one would think. Th

Re: Map-Reduce: How to make MR output one file an hour?

2014-03-01 Thread Devin Suiter RDX
If you only want one file, then you need to set the number of reducers to 1. If the size of the data makes the original MR job impractical to use a single reducer, you run a second job on the output of the first, with the default mapper and reducer, which are the Identity- ones, and set that numRe

Re: unsubsrcibe

2014-03-12 Thread Devin Suiter RDX
You are: 1) Not unsubscribing correctly. From the welcome email you get when you subscribed - 'To remove your address from the list, send a message to: ' 2) Spelling 'unsubscribe' incorrectly. When you send the 'unsubscribe' request to the address above, spell it correctl