Ceph and glusterfs are NOT centralized files systems. Glusterfs can be
used with Hadoop map reduce, but it requires a special plug in, and hdfs 2
can be ha, so it's probably not worth switching. Ymmv.
On Dec 31, 2013 4:01 PM, "Jiayu Ji" wrote:
> I am not very familiar with Ceph and GlusterFS, b
Ignorant question: Did this just devolve into a java discussion?
On 12/30/13, unmesha sreeveni wrote:
> but i need to convert it back to object of the same class.
> If i am converting it to string will it be possible?
>
>
> On Mon, Dec 30, 2013 at 11:16 AM, Harsh J wrote:
>
>> If you can store
Maybe I'm just grouchy tonight.. it's seems all of these questions can be
answered by RTFM.
http://hadoop.apache.org/docs/current2/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html
What's the balance between encouraging learning by New to Hadoop users and
OMG!?
On Fri, Dec 27, 2013 at 8:58 PM,
In big data terms, 500G isn't big. But, moving that much data around
every night is not trivial either. I'm going to guess at a lot here,
but at a very high level.
1. Sqoop the data required to build the summary tables into Hadoop.
2. Crunch the summaries into new tables (really just files on Ha
Hello Hadoopers. I thought I'd share a couple of sqoop bugs we found
recently.
1) If, for some reason, sqoop fails to move a file/directory to it's
-target-dir because the file is no longer available, it will issue a
WARNing and not an error. This is very significant in batch operations.
Effec
Unless something has recently changed, Ambari cannot work on an
existing cluster. One of the several reasons we chose to eschew it.
On 12/5/13, Jilal Oussama wrote:
> Hello all,
>
> Pardon me to ask this question here instead of the Ambari mailing list (I
> am not subscribed to it).
>
> I would
LMGTFY:
http://pydoop.sourceforge.net/docs/pydoop_script.html#pydoop-script-guide
On Wed, Sep 18, 2013 at 6:01 PM, jamal sasha wrote:
> Hi,
> How do I implement (say ) in wordcount a combiner functionality if i am
> using python hadoop streaming?
> Thanks
>
Our evaluation was similar except we did not consider the "management"
tools any vendor provided as that's just as much lock in as any proprietary
tool. What if I want trade vendors? I have to re-tool to use there mgmt?
Nope, wrote our own.
Being in a large enterprise, we went with the "perceiv
The only problem is around the degeneration of the discussion. See
years long threads around vi vs. emacs, Windows vs. Linux, Java vs.
C/Python/Perl/Ruby.
On 9/13/13, Chris Mattmann wrote:
> Errr, what's wrong with discussing these types of issues on list?
>
> Nothing public here, and as long a
Hadoop (HDFS and MapReduce) get group membership, etc. from the OS. The
only "exception" is that you define the HDFS Superuser Group in the XML.
It still must exist at the OS Level, but grants privs at the Hadoop Level.
At least in HDP 1.x
On Wed, Sep 11, 2013 at 9:38 PM, Raj Hadoop wrote:
>
Did you try ganglia forums/lists?
On 9/11/13, orahad bigdata wrote:
> Hi All,
>
> Can somebody help me please?
>
> Thanks
> On 9/11/13, orahad bigdata wrote:
>> Hi All,
>>
>> I'm facing an issue while showing Hadoop metrics in ganglia, Though I
>> have installed ganglia on my master/slaves node
This sound entirely like an OS Level problem and is slightly outside of the
scope of this list, however. I'd suggest you look at your
/etc/nsswitch.conf file and ensure that the hosts: line says
hosts: files dns
This will ensure that names are resolved first by /etc/hosts, then by DNS.
Please al
I think you just went backwards. more replicas (generally speaking) are
better.
I'd take 60 cheap, 1 U servers over 20 "highly fault tolerant" ones for
almost every problem. I'd get them for the same or less $ too.
On Thu, Sep 5, 2013 at 8:41 PM, Sundeep Kambhampati <
kambh...@cse.ohio-stat
As we always say in Technology... it depends!
What country are you in? That makes a difference.
How much buying power do you have? I work for a Fortune 100 Company and we
-- absurdly -- pay about 60% off retail when we buy servers.
Are you buying a bunch at once?
You best bet is to contact 3 or
Steps to Hadoop 2.x documentation.
1. Realize reality,
2. Smoke 2-3 long joints, depending on tolerance levels
3. Review the code...
4. Allow the THC to take effect and view the code in a new light
5. Understand what the developers have said
6. Code mind beautiful patches to base code
7. crash
8.
My foundation is more Linux than Hadoop, so I'll support Harsh (like he
needs it) in asking, "What's the problem?" If you can't df -h this is
probably a "lower than Hadoop" issue, and while most Hadoop folks are
willing to help (see the fact that Harsh responded) this is 99.9% likely to
be an EXT4
Just for clarity, DNS as a service is NOT Required. Name resolution is.
I use /etc/hosts files to identify all nodes in my clusters.
One of the reasons for using Names over IP's is ease of use. I would much
rather use a hostname in my XML to identify NN, JT, etc. vs. some random
string of numb
Hey Hadoop smart folks
I have a tendency to seek optimum performance given my understanding, so
that led to me "brilliant" decision. We settled on EXT4 for our underlying
FS for HDFS. Greedy for speed I thought, let's turn the journal off and
gain the speed benefits. After all, I have 3 co
Ha -- I just decommissioned some nodes today.
Add the nodes you'd like to decom. to the excludes file (search for
it's name in hdfs-site.xml), usually dfs.exclude. Login to your NN
and issue hadoop dfsadmin -refreshNodes
Watch the NN Web interface until the Decommissioning Nodes are complete.
T
This is not a Hadoop question (IMHO).
2 words: Version Control
Did the advent of Hadoop somehow circumvent all IT convention?
Sorry folks, it's been a rough day.
On 6/12/13, Michael Segel wrote:
> Where was the pig script? On HDFS?
>
> How often does your cluster clean up the trash?
>
> (Dele
Yes, NTPD is your best option.
On 6/4/13, Ben Kim wrote:
> Hi,
> This is very basic & fundamental question.
>
> Is time among all nodes needs to be synced?
>
> I've never even thought of timing in hadoop cluster but recently
> experienced my servers going out of sync with time. I know hbase requi
I'll be chastised and have mean things said about me for this.
Get some experience in IT before you start looking at Hadoop. My reasoning
is this: If you don't know how to develop real applications in a
Non-Hadoop world, you'll struggle a lot to develop with Hadoop.
Asking what "things you need
I'll take a swing at this one.
Low latency data access: I hit the enter key (or submit button) and I
expect results within seconds at most. My database query time should be
sub-second.
High throughput of data: I want to scan millions of rows of data and count
or sum some subset. I expect this
It's not a good idea for anything more than Proof of Concept or Sandbox
clusters.
On Tue, May 14, 2013 at 3:10 AM, Leonid Fedotov wrote:
> No, it is not called "pseudo distributed" mode. It called "as you wish"
> mode...
> It is absolutely normal configuration.
> You can distribute your nodes as
<
> 3m.mustaq...@gmail.com> wrote:
> > @chris, I have test it outside. It is working fine.
> >
> >
> > On Wed, May 8, 2013 at 7:48 PM, Leonid Fedotov
> wrote:
> > Error in script.
> >
> >
> > On Wed, May 8, 2013 at 7:11 AM, Chris Em
Your script has an error in it. Please test your script using both IP
Addresses and Names, outside of hadoop.
On Wed, May 8, 2013 at 10:01 AM, Mohammad Mustaqeem
<3m.mustaq...@gmail.com>wrote:
> I have done this and found following error in log -
>
> 2013-05-08 18:53:45,221 WARN org.apache.hado
Finally, one I can answer. :) That should be in core-site.xml (unless it's
moved from ver 1.x). It needs to be in the configuration for NameNode(s)
and JobTracker (Yarn).
In 1.x you need to restart NN and JT services for the script to take effect.
On Wed, May 8, 2013 at 9:43 AM, Mohammad Musta
Glad you got this working... can you explain your use case a little? I'm
trying to understand why you might want to do that.
On Thu, Apr 18, 2013 at 11:29 AM, Lixiang Ao wrote:
> I modified sbin/hadoop-daemon.sh, where HADOOP_PID_DIR is set. It works!
> Everything looks fine now.
>
> Seems d
I assume your talking about the I/O scheduler. Based on normal advice,
only change this if you have a "smart" device between the OS and the
Drives. A SATA controller usually qualifies. I have our DataNodes to to
NOOP to reduce the number of layers.
As always your mileage may vary and you should
Make sure you have the topology script available on the JobTracker server
as well. This also requires a jobtracker stop/start to take effect.
Also, make sure $HADOOP_CONF resolves properly as the mapred user.
On Tue, Mar 26, 2013 at 1:19 AM, preethi ganeshan <
preethiganesha...@gmail.com> wrote:
Hey John,
Make sure your /etc/hosts ( or DNS) is up to date and any topology scripts
are updated. Unfortunately, NN is pretty dumb about IP's vs. Hostnames.
BTW, NN devs. Seriously? You rely on IP addr instead of hostname?
Someone should probably be shot or at least be responsible for fixing
Aww.. You could've used lmgtfy.com :)
On Tue, Mar 12, 2013 at 4:57 PM, varun kumar wrote:
> http://hadoopblogfromvarun.wordpress.com/
>
>
> On Wed, Mar 13, 2013 at 2:16 AM, Mohit Anchlia wrote:
>
>> Is it possible to set replication factor to a different value than the
>> default at the directo
Jokingly I want to say the problem is that you selected Ubuntu (or any
other Debian based Linux) as your platform.
On a more serious note, if you are new to both Linux and Hadoop, you might
be much better off to select CentOS for your Linux as that is the base
development platform for most contrib
I've checked all of the documentation, books and google searches I can
think of
I have a working topology script. I have dynamic IP's. I have an
automated process to update the rack data when a datanode changes IP.
What I don't have is any clue as to when the NN reads this script. If I
exe
> Paul
>
>
> On 18 Feb 2013, at 18:09, Chris Embree wrote:
>
> I'm doing that currently. No problems to report so far.
>
> The only pitfall I've found is around NFS stability. If your NAS is 100%
> solid no problems. I've seen mtab get messed up and re
I'm doing that currently. No problems to report so far.
The only pitfall I've found is around NFS stability. If your NAS is 100%
solid no problems. I've seen mtab get messed up and refuse to remount if
NFS has any hiccups.
If you want to really crazy, consider NFS for your datanode root fs. S
Check your HDFS config file for the groupname you used as HDFS supergroup.
We used hdfs as the group name in our case.
Then just groupadd hdfs (see man groupadd for additional options)
Then, when you create users, add them to that group.
useradd -G hdfs newuser
This is more Linux admin than Ha
You need to configure ssh to use your pem files, by default it uses dsa or
rsa files. Look at man ssh_config.
On Wed, Feb 13, 2013 at 6:46 AM, Pedro Sá da Costa wrote:
> I'm trying to configure ssh for the Hadoop mapreduce, but my nodes only
> communicate with each others using RSA keys in pem
Interesting question. You'd probably need to benchmark to prove it out.
I'm not the exact details of how HDFS stripes data, but it should compare
pretty well to hardware RAID.
Conceptually, HDFS should be able to out perform a RAID solution, since
HDFS "knows" more about the data being written.
Just to maintain some balance on the list, Hortonworks has similar training
vidos and a sandbox appliance.
http://hortonworks.com/community/
Enjoy.
On Sat, Feb 2, 2013 at 10:02 AM, YouPeng Yang wrote:
> Hi akram khalil
>
>if you want to take some courses .Recommend you to take the Cloudera
You should probably think about this in a more cluster fashion. A single
node with a PB of data is probably not a good allocation of CPU : Disk
ration. In addition, you need enough RAM on your NameNode to keep track of
all of your blocks. A few nodes with a PB each would quickly drive up NN
RAM
What type of FS are you using under HDFS? XFS, ext3, ext4? The type and
configuration of the underlying FS will impact performance.
Most notably, ext3 has a lock-up effect when flushing disk cache.
On Thu, Jan 24, 2013 at 2:54 AM, Xibin Liu wrote:
> Thanks, http://search-hadoop.com/m/LLBgUiH0
Hi List,
This should be a simple question, I think. Disclosure, I am not a java
developer. ;)
We're getting ready to build our Dev and Prod clusters. I'm pretty
comfortable with HDFS and how it sits atop several local file systems on
multiple servers. I'm fairly comfortable with the concept of
Can you instead copy intput1 and input2 together?
Or process both files on the second pass?
Otherwise, you'll have to read in output file, load the values and start
your map/red job.
Probably someone else will have a better answer. :)
On Wed, Jan 16, 2013 at 9:07 PM, jamal sasha wrote:
> Hi,
Good point Harsh. As a Linux Admin, I prefer the behavior of 2.x. It
allows me to see if I've made a mistake in my planned placement of files
instead of blindly writing.
On Wed, Jan 16, 2013 at 12:05 PM, Harsh J wrote:
> On 1.x, -put does mkdir the parent directories if they are non existent
>
Ha, you joke, but we're planning on running with no local OS. If it works
as planned I'll post a nice summary of our approach. :)
On Wed, Jan 16, 2013 at 2:53 AM, Harsh J wrote:
> Wipe your OS out.
>
> Please read: http://search-hadoop.com/m/9Qwi9UgMOe
>
>
> On Wed, Jan 16, 2013 at 1:16 PM, V
Harsh,
Is that a change from 1.0 code? Hortonworks explains it a little
differently.
Thanks for the details and pointer to the code.
Chris (another one)
On Dec 18, 2012 5:14 PM, "Harsh J" wrote:
> Hi,
>
> The received write packet is directly socket-written to the next
> node's receiver (asyn
Hi Jay,
We'll need "real developer" expertise on this, but my understanding of the
documentation is:
Client talks to Name node to get Node/Block assignments, client then talks
to node 1: write, fwd, node 2: write, fwd, node 3: write, ack node 2, node
2: ack node1, node 1: ack Client and Name Nod
I think the rule of thumb (hortonworks at least) is 2x cores for maps
threads and 1x cores for reducers. Don't have my notes here so I'm not
100%. It's just a guideline in any event. :)
TEST, TEST, TEST. :)
On Tue, Dec 18, 2012 at 1:08 AM, wrote:
> Hello,
>
> I was unable to find any informa
Just to be a picker of nits... this topic is more concisely Hadoop
Development 101. I only mention this because I am a newbie hadoop admin
and this was over my head. ;) Admins don't worry as much about Key Value
Pairs and parsing as we do about where is the script that starts the
NameNode. ;)
O
Hi Mohammed,
The amount of RAM on the NN is related to the number of blocks... so let's
do some math. :) 1G of RAM to 1M blocks seems to be the general rule.
I'll probably mess this up so someone check my math:
9 PT ~ 9,216 TB ~ 9,437,184 GB of data. Let's put that in 128MB blocks:
according
51 matches
Mail list logo