Re: log

2013-04-18 Thread Nitin Pawar
For this particular file you have set the replication factor to be 10 but there is only one available. Did your dadanodes go down ? or you retired datanodes? Or your number of datanodes is less than that of replication factor? On Apr 19, 2013 12:09 PM, "Mohit Vadhera" wrote: > Can anybody let me

Re: Uploading file to HDFS

2013-04-18 Thread Harsh J
Can you not simply do a fs -put from the location where the 2 TB file currently resides? HDFS should be able to consume it just fine, as the client chunks them into fixed size blocks. On Fri, Apr 19, 2013 at 10:05 AM, 超级塞亚人 wrote: > I have a problem. Our cluster has 32 nodes. Each disk is 1TB. I

log

2013-04-18 Thread Mohit Vadhera
Can anybody let me know the meaning of the below log plz " Target Replicas is 10 but found 1 replica(s)." ? /var/lib/hadoop-hdfs/cache/mapred/mapred/staging/test_user/.staging/job_201302180313_0623/job.split: Under replicated BP-2091347308-172.20.3.119-1356632249303:blk_6297333561560198850_70720.

Re: Best way to collect Hadoop logs across cluster

2013-04-18 Thread Roman Shaposhnik
On Thu, Apr 18, 2013 at 9:23 PM, Mark Kerzner wrote: > Hi, > > my clusters are on EC2, and they disappear after the cluster's instances are > destroyed. What is the best practice to collect the logs for later storage? > > EC2 does exactly that with their EMR, how do they do it? Apache Flume could

Uploading file to HDFS

2013-04-18 Thread 超级塞亚人
I have a problem. Our cluster has 32 nodes. Each disk is 1TB. I wanna upload 2TB file to HDFS.How can I put the file to the namenode and upload to HDFS?

jobtracker is stopping because of permissions

2013-04-18 Thread Mohit Vadhera
Can anybody help me to start jobtracker service. it is an urgent . it looks permission issue . What permission to give on which directory. I am pasting log for the same. Service start and stops 2013-04-19 02:21:06,388 FATAL org.apache.hadoop.mapred.JobTracker: org.apache.hadoop.security.AccessCont

Re: Best way to collect Hadoop logs across cluster

2013-04-18 Thread Marcos Luis Ortiz Valmaseda
Actually the problem is not simple. Based on these problems, there are three companies working on them: - Loggly: http://loggly.com/ Loggly is a part of the Amazon Marketplace: https://aws.amazon.com/solution-providers/isv/loggly - Pageduty: https://papertrailapp.com/ How to do it: ht

Re: setting hdfs balancer bandwidth doesn't work

2013-04-18 Thread 周帅锋
I set bandwidthPerSec = 104857600, but when I add a new data node, and run hadoop balancer, the bandwidth is only 1MB/s, and the datanode log shows that: org.apache.hadoop.hdfs.server. datanode.DataNode: Balancing bandwith is 1048576 bytes/s My hadoop core version is 1.0.3 Thanks 2013/4/19 Than

Re: Best way to collect Hadoop logs across cluster

2013-04-18 Thread Mark Kerzner
So you are saying, the problem is very simple. Just before you destroy the cluster, simply collect the logs to S3. Anyway, I only need them after I have completed with a specific computation. So I don't need any special requirements. In regular permanent clusters, is there something that allows yo

Re: Best way to collect Hadoop logs across cluster

2013-04-18 Thread Marcos Luis Ortiz Valmaseda
When you destroy an EC2 instance, the correct behavior is to erase all data. Why don't you create a service to collect the logs directly to a S3 bucket in real-time or in a batch of 5 mins? 2013/4/18 Mark Kerzner > Hi, > > my clusters are on EC2, and they disappear after the cluster's instances

Best way to collect Hadoop logs across cluster

2013-04-18 Thread Mark Kerzner
Hi, my clusters are on EC2, and they disappear after the cluster's instances are destroyed. What is the best practice to collect the logs for later storage? EC2 does exactly that with their EMR, how do they do it? Thank you, Mark

Re: Run multiple HDFS instances

2013-04-18 Thread Lixiang Ao
Not really, fereration provides seperate namespaces, but I want it looks like one namespace. My basic idea is to maintain a map from files to namenodes, it receive RPC calls from client and forward them to specific namenode that in charge of the file. It's challenging for me but I'll figure out whe

Re: Run multiple HDFS instances

2013-04-18 Thread Hemanth Yamijala
Are you trying to implement something like namespace federation, that's a part of Hadoop 2.0 - http://hadoop.apache.org/docs/r2.0.3-alpha/hadoop-project-dist/hadoop-hdfs/Federation.html On Thu, Apr 18, 2013 at 10:02 PM, Lixiang Ao wrote: > Actually I'm trying to do something like combining mult

Re: How to configure mapreduce archive size?

2013-04-18 Thread Hemanth Yamijala
Well, since the DistributedCache is used by the tasktracker, you need to update the log4j configuration file used by the tasktracker daemon. And you need to get the tasktracker log file - from the machine where you see the distributed cache problem. On Fri, Apr 19, 2013 at 6:27 AM, wrote: > Hi

Re: why multiple checkpoint nodes?

2013-04-18 Thread Thanh Do
Thanks guys for updating! Yeah, I read the thread that Checkpoint/BackupNode may be get deprecated. SNN is a way to go then. I just wonder if we use multiple CheckpointNodes, we might run into the situation where while a checkpoint is on-going, but the first CheckpointNode is slow, then the secon

Re: How to process only input files containing 100% valid rows

2013-04-18 Thread Steve Lewis
With files that small it is much better to write a custom input format which checks the entire file and only passes records from good files. If you need Hadoop you are probably processing a large number of these files and an input format could easily read the entire file and handle it if it as as s

Re: why multiple checkpoint nodes?

2013-04-18 Thread Mohammad Tariq
Hello Thanh, Just to keep you updated, checkpoint node might get depricated. So, it's always better to use secondary namenode. More on this could be found here : https://issues.apache.org/jira/browse/HDFS-2397 https://issues.apache.org/jira/browse/HDFS-4114 Warm Regards, Tariq https://mtar

Re: why multiple checkpoint nodes?

2013-04-18 Thread Bertrand Dechoux
For more information : https://issues.apache.org/jira/browse/HADOOP-7297 It has been corrected but the stable documentation is still the 1.0.4 (previous to correction). See * http://hadoop.apache.org/docs/r1.0.4/hdfs_user_guide.html * http://hadoop.apache.org/docs/r1.1.1/hdfs_user_guide.html * ht

Re: Problem: org.apache.hadoop.mapred.ReduceTask: java.net.SocketTimeoutException: connect timed out

2013-04-18 Thread Som Satpathy
Never mind, got it fixed. Thanks, Som On Tue, Apr 16, 2013 at 6:18 PM, Som Satpathy wrote: > Hi All, > > I have just set up a CDH cluster on EC2 using cloudera manager 4.5. I have > been trying to run a couple of mapreduce jobs as part of an oozie workflow > but have been blocked by the followi

Re: why multiple checkpoint nodes?

2013-04-18 Thread Bertrand Dechoux
It would be important to point the document (which I believe is http://hadoop.apache.org/docs/stable/hdfs_user_guide.html) and the version of Hadoop you are interested in. At one time, the documentation was misleading. The 1.x version didn't have checkpoint/backup nodes only the secondary namenode.

How to process only input files containing 100% valid rows

2013-04-18 Thread Matthias Scherer
Hi all, In my mapreduce job, I would like to process only whole input files containing only valid rows. If one map task processing an input split of a file detects an invalid row, the whole file should be "marked" as invalid and not processed at all. This input file will then be cleansed by ano

Re: Cartesian product in hadoop

2013-04-18 Thread Ted Dunning
It is rarely practical to do exhaustive comparisons on datasets of this size. The method used is to heuristically prune the cartesian product set and only examine pairs that have a high likelihood of being near. This can be done in many ways. Your suggestion of doing a map-side join is a reasona

Re: why multiple checkpoint nodes?

2013-04-18 Thread Thanh Do
so reliability (to prevent metadata loss) is the main motivation for multiple checkpoint nodes? Does anybody use multiple checkpoint nodes in real life? Thanks On Thu, Apr 18, 2013 at 12:07 PM, shashwat shriparv < dwivedishash...@gmail.com> wrote: > more checkpoint nodes means more backup of t

Re: why multiple checkpoint nodes?

2013-04-18 Thread shashwat shriparv
more checkpoint nodes means more backup of the metadata :) *Thanks & Regards* ∞ Shashwat Shriparv On Thu, Apr 18, 2013 at 9:35 PM, Thanh Do wrote: > Hi all, > > The document says "Multiple checkpoint nodes may be specified in the > cluster configuration file". > > Can some one clarify me

Re: Run multiple HDFS instances

2013-04-18 Thread Lixiang Ao
Actually I'm trying to do something like combining multiple namenodes so that they present themselves to clients as a single namespace, implementing basic namenode functionalities. 在 2013年4月18日星期四,Chris Embree 写道: > Glad you got this working... can you explain your use case a little? I'm > tryi

Re: setting hdfs balancer bandwidth doesn't work

2013-04-18 Thread Thanh Do
What do you mean by "doesn't work"? On Thu, Apr 18, 2013 at 10:01 AM, zhoushuaifeng wrote: > ** > Hi, > I set the hdfs balance bandwidth from 1048576 to 104857600, but it doesn't > work, what's wrong? > Does anyone encounter the same problem? > Thanks a lot. > > > dfs.balance.bandwidthPerSec

why multiple checkpoint nodes?

2013-04-18 Thread Thanh Do
Hi all, The document says "Multiple checkpoint nodes may be specified in the cluster configuration file". Can some one clarify me that why we really need to run multiple checkpoint nodes anyway? Is it possible that while checkpoint node A is doing checkpoint, and check point node B kicks in and d

Re: Run multiple HDFS instances

2013-04-18 Thread Chris Embree
Glad you got this working... can you explain your use case a little? I'm trying to understand why you might want to do that. On Thu, Apr 18, 2013 at 11:29 AM, Lixiang Ao wrote: > I modified sbin/hadoop-daemon.sh, where HADOOP_PID_DIR is set. It works! > Everything looks fine now. > > Seems d

Re: will an application with two maps but no reduce be suitable for hadoop?

2013-04-18 Thread Roman Shaposhnik
On Thu, Apr 18, 2013 at 4:49 AM, Hadoop Explorer wrote: > I have an application that evaluate a graph using this algorithm: > > - use a parallel for loop to evaluate all nodes in a graph (to evaluate a > node, an image is read, and then result of this node is calculated) > > - use a second paralle

Run multiple HDFS instances

2013-04-18 Thread Lixiang Ao
I modified sbin/hadoop-daemon.sh, where HADOOP_PID_DIR is set. It works! Everything looks fine now. Seems direct command "hdfs namenode" gives a better sense of control :) Thanks a lot. 在 2013年4月18日星期四,Harsh J 写道: > Yes you can but if you want the scripts to work, you should have them > use a

Re: Cartesian product in hadoop

2013-04-18 Thread Ajay Srivastava
The approach which I proposed will have m+n i/o for reading datasets not the (m + n + m*n) and but further i/o due to spills and reading mapper output by reducer will be more as number of tuples coming out of mapper are ( m + m * n). Regards, Ajay Srivastava On 18-Apr-2013, at 5:40 PM, zheyi

setting hdfs balancer bandwidth doesn't work

2013-04-18 Thread zhoushuaifeng
Hi, I set the hdfs balance bandwidth from 1048576 to 104857600, but it doesn't work, what's wrong? Does anyone encounter the same problem? Thanks a lot. dfs.balance.bandwidthPerSec 104857600 Specifies the maximum amount of bandwidth that each datanode can utilize for the b

Re: Physically moving HDFS cluster to new

2013-04-18 Thread MARCOS MEDRADO RUBINELLI
Here's a rough guideline: Moving a cluster isn't all that different from upgrading it. The initial steps are the same: - stop your mapreduce services - switch you namenode to safe mode - generate a final image with -saveNamespace - stop your hfds services - back up your metadata - as long as you

Re: Cartesian product in hadoop

2013-04-18 Thread zheyi rong
Thank you, now I get your point. But I wonder that this approach would be slower than implementing a custom InputFormat which, each time, provides a pair of lines to mappers; then doing the product in mappers? (in Since your approach would need (m + n + m*n) I/O in mapper side, and (2*m*n) IO in

will an application with two maps but no reduce be suitable for hadoop?

2013-04-18 Thread Hadoop Explorer
I have an application that evaluate a graph using this algorithm: - use a parallel for loop to evaluate all nodes in a graph (to evaluate a node, an image is read, and then result of this node is calculated) - use a second parallel for loop to evaluate all edges in the graph. The function woul

Re: Cartesian product in hadoop

2013-04-18 Thread Ajay Srivastava
Yes, that's a crucial part. Write a class which extends WritableComparator and override compare method. You need to set this class in job client as - job.setGroupingComparatorClass (Grouping comparator class). This will make sure that records having same Ki will be grouped together and will go t

Re: Cartesian product in hadoop

2013-04-18 Thread zheyi rong
Hi Ajay Srivastava, Thank your for your reply. Could you please explain a little bit more on "Write a grouping comparator which group records on first part of key i.e. Ki." ? I guess it is a crucial part, which could filter some pairs before passing them to the reducer. Regards, Zheyi Rong O

Configuration clone constructor not cloning classloader

2013-04-18 Thread Amit Sela
Hi all, I was wondering if there is a good reason why public Configuration(Configuration other) constructor in Hadoop 1.0.4 doesn't clone the classloader in "other" to the new Configration ? Is this a bug ? I'm asking because I'm trying to run a Hadoop client in OSGI environment and I need to pa

Re: Cartesian product in hadoop

2013-04-18 Thread Ajay Srivastava
Hi Rong, You can use following simple method. Lets say dataset1 has m records and when you emit these records from mapper, keys are K1,K2 ….., Km for each respective record. Also add an identifier to identify dataset from where records is being emitted. So if R1 is a record in dataset1, the mapp

Re: Cartesian product in hadoop

2013-04-18 Thread Azuryy Yu
This is not suitable for his large dataset. --Send from my Sony mobile. On Apr 18, 2013 5:58 PM, "Jagat Singh" wrote: > Hi, > > Can you have a look at > > http://pig.apache.org/docs/r0.11.1/basic.html#cross > > Thanks > > > On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong wrote: > >> Dear all, >> >>

Re: Cartesian product in hadoop

2013-04-18 Thread Jagat Singh
Hi, Can you have a look at http://pig.apache.org/docs/r0.11.1/basic.html#cross Thanks On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong wrote: > Dear all, > > I am writing to kindly ask for ideas of doing cartesian product in hadoop. > Specifically, now I have two datasets, each of which contains

Oracle big data appliance

2013-04-18 Thread oualid ait wafli
Hi Someone had already work with the Oracle big data Appliance ? thanks

Cartesian product in hadoop

2013-04-18 Thread zheyi rong
Dear all, I am writing to kindly ask for ideas of doing cartesian product in hadoop. Specifically, now I have two datasets, each of which contains 20million lines. I want to do cartesian product on these two datasets, comparing lines pairwisely. The output of each comparison can be mostly filtere

Re: Run multiple HDFS instances

2013-04-18 Thread Harsh J
Yes you can but if you want the scripts to work, you should have them use a different PID directory (I think its called HADOOP_PID_DIR) every time you invoke them. I instead prefer to start the daemons up via their direct command such as "hdfs namenode" and so and move them to the background, with

Run multiple HDFS instances

2013-04-18 Thread Lixiang Ao
Hi all, Can I run mutiple HDFS instances, that is, n seperate namenodes and n datanodes, on a single machine? I've modified core-site.xml and hdfs-site.xml to avoid port and file conflicting between HDFSes, but when I started the second HDFS, I got the errors: Starting namenodes on [localhost] l

Re: Kerberos documentation

2013-04-18 Thread Christopher Vasanth John
Hi, I have a Kerberos KDC running and also have apache Hadoop 1.0.4 running on a cluster. Is there some kind of documentation I can use to link the two? Basically, I'm trying to make my hadoop cluster secure. Thanks, Chris On Wed, Apr 17, 2013 at 3:30 PM, Aaron T. Myers wrote: > Hi Chr

Re: Hadoop fs -getmerge

2013-04-18 Thread Fabio Pitzolu
Hi Hemanth, I guess that the only solution is to delete the crc files after the export. Does anyone of you knows if someone filed a Jira to implement a parameter to -getmerge to delete the crc files afterwards? *Fabio Pitzolu* Consultant - BI & Infrastructure Mob. +39 3356033776 Telefono 02 871