Re: Issue with Hadoop Streaming

2012-08-02 Thread Robert Evans
It depends on the input format you use. You probably want to look at using NLineInputFormat From: Devi Kumarappan kpala...@att.netmailto:kpala...@att.net Reply-To: mapreduce-u...@hadoop.apache.orgmailto:mapreduce-u...@hadoop.apache.org

Re: Issue with Hadoop Streaming

2012-08-02 Thread Robert Evans
...@hadoop.apache.org Subject: Re: Issue with Hadoop Streaming My mapper is perl script and it is not in Java.So how do I specify the NLineFormat? From: Robert Evans ev...@yahoo-inc.commailto:ev...@yahoo-inc.com To: mapreduce-u...@hadoop.apache.orgmailto:mapreduce-u

Re: Hadoop : java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text

2012-08-02 Thread Robert Evans
The default text input format has a key of a LongWritable that is the offset into the file. The value is the full line. On 8/2/12 2:59 PM, Harit Himanshu harit.himan...@gmail.com wrote: StackOverflow link - http://stackoverflow.com/questions/11784729/hadoop-java-lang-classcastexce

Re: Using REST to get ApplicationMaster info (Issue solved)

2012-07-27 Thread Robert Evans
26, 2012 at 11:59 PM, Robert Evans ev...@yahoo-inc.com wrote: OK I think I understand it now. You probably have ACLs enabled, but no web filter on the RM to let you sign in as a given user. As such the default filter is making you be Dr. Who, or whomever else it is, but the ACL check

Re: Using REST to get ApplicationMaster info (Issue solved)

2012-07-26 Thread Robert Evans
it as I just moved on to further coding :) Thanks, Prajakta On Thu, Jul 26, 2012 at 1:40 AM, Robert Evans ev...@yahoo-inc.com wrote: Hmm, that is very odd. It only checks the user if security is enabled to warn the user about potentially accessing something unsafe. I am not sure why

Re: (Repost) Using REST to get ApplicationMaster info

2012-07-25 Thread Robert Evans
). It supports RESTful APIs as I am able to retrieve JSON objects for RM (cluster/nodes info)+ Historyserver. The only issue is with AppMaster REST API. Regards, Prajakta On Fri, Jul 6, 2012 at 10:55 PM, Robert Evans ev...@yahoo-inc.com wrote: What version of hadoop are you using? It could

Re: Using REST to get ApplicationMaster info (Issue solved)

2012-07-25 Thread Robert Evans
. Regards, Prajakta On Fri, Jul 6, 2012 at 10:22 PM, Robert Evans ev...@yahoo-inc.com wrote: Sorry I did not respond sooner. The default behavior is to have the proxy server run as part of the RM. I am not really sure why it is not doing this in your case. If you set

Re: (Repost) Using REST to get ApplicationMaster info

2012-07-06 Thread Robert Evans
in advance. Regards, Prajakta On Fri, Jun 29, 2012 at 8:55 PM, Robert Evans ev...@yahoo-inc.com wrote: Please don't file that JIRA. The proxy server is intended to front the web server for all calls to the AM. This is so you only have to go to a single location to get to any AM's web

Re: (Repost) Using REST to get ApplicationMaster info

2012-07-06 Thread Robert Evans
:22 PM, Robert Evans ev...@yahoo-inc.com wrote: Sorry I did not respond sooner. The default behavior is to have the proxy server run as part of the RM. I am not really sure why it is not doing this in your case. If you set the config yourself to be a URI that is different from that of the RM

Re: bug in streaming?

2012-07-03 Thread Robert Evans
Yang, I would not call it a bug, I would call it a potential optimization. The default input format for streaming will try to create one mapper per block, but if there is only one block it will create two mappers for it. You can override the streaming input format to get a different behavior.

Re: Which hadoop version shoul I install in a production environment

2012-07-03 Thread Robert Evans
0.23 is also somewhat of an Alpha/Beta quality and I would not want to run it in production just yet. 0.23 was renamed 2.0 and development work has continued on both lines. New features have been going into 2.0 and 0.23 has been left only for stabilization. Hopefully we will have 0.23.3 to the

Re: Dealing with changing file format

2012-07-02 Thread Robert Evans
There are several different ways. One of the ways is to use something like Hcatalog to track the format and location of the dataset. This may be overkill for your problem, but it will grow with you. Another is to store the scheme with the data when it is written out. Your code may need to the

Re: Using REST to get ApplicationMaster info

2012-06-29 Thread Robert Evans
Please don't file that JIRA. The proxy server is intended to front the web server for all calls to the AM. This is so you only have to go to a single location to get to any AM's web service. The proxy server is a very simple proxy and just forwards the extra part of the path on to the AM. If

Re: Is it possible to implement transpose with PigLatin/any other MR language?

2012-06-22 Thread Robert Evans
is not preserved, for other operations again data may be in wrong order in the row. To me it seems like it is not possible to do this in MR. On Fri, Jun 22, 2012 at 12:56 AM, Robert Evans ev...@yahoo-inc.com wrote: That may be true, I have not read through the code very closely, if you have multiple

Re: Is it possible to implement transpose with PigLatin/any other MR language?

2012-06-21 Thread Robert Evans
That may be true, I have not read through the code very closely, if you have multiple reduces, so you can run it with a single reduce or you can write a custom partitioner to do it. You only need to know the length of the column, and then you can divide them up appropriately, kind of like how

Re: Yahoo Hadoop Tutorial with new APIs?

2012-06-04 Thread Robert Evans
, because, I have some ideas behind this, for example: to release a Spanish version of the tutorial. Regards and best wishes On 04/04/2012 05:29 PM, Robert Evans wrote: Re: Yahoo Hadoop Tutorial with new APIs? I am dropping the cross posts and leaving this on common-user with the others BCCed

Re: Pragmatic cluster backup strategies?

2012-05-30 Thread Robert Evans
cluster? Do you backup specific parts of the cluster? Some form of offsite SAN? On Tue, May 29, 2012 at 6:02 PM, Robert Evans ev...@yahoo-inc.com wrote: Yes you will have redundancy, so no single point of hardware failure can wipe out your data, short of a major catastrophe. But you can still have

Re: Pragmatic cluster backup strategies?

2012-05-29 Thread Robert Evans
Yes you will have redundancy, so no single point of hardware failure can wipe out your data, short of a major catastrophe. But you can still have an errant or malicious hadoop fs -rm -rf shut you down. If you still have the original source of your data somewhere else you may be able to

Re: How to mapreduce in the scenario

2012-05-29 Thread Robert Evans
Yes you can do it. In pig you would write something like A = load ‘a.txt’ as (id, name, age, ...) B = load ‘b.txt’ as (id, address, ...) C = JOIN A BY id, B BY id; STORE C into ‘c.txt’ Hive can do it similarly too. Or you could write your own directly in map/redcue or using the data_join jar.

Re: Transfer archives (or any file) from Mapper to Reducer?

2012-05-21 Thread Robert Evans
Be careful putting them in HDFS. It does not scale very well, as the number of file opens will be on the order of Number of Mappers * Number of Reducers. You can quickly do a denial of service on the namenode if you have a lot of mappers and reducers. --Bobby Evans On 5/21/12 4:02 AM, Harsh

Re: Stream data processing

2012-05-21 Thread Robert Evans
Zhiwei, How quickly do you have to get the result out once the new data is added? How far back in time do you have to look for from the occurrence of ? Do you have to do this for all combinations of values or is it just a small subset of values? --Bobby Evans On 5/21/12 3:01 PM,

Re: freeze a mapreduce job

2012-05-11 Thread Robert Evans
There is an idle timeout for map/reduce tasks. If a task makes no progress for 10 min (Default) the AM will kill it on 2.0 and the JT will kill it on 1.0. But I don't know of anything associated with a Job, other then in 0.23 is the AM does not heart beat back in for too long, I believe that

Re: Need to improve documentation for v 0.23.x ( v 2.x)

2012-05-09 Thread Robert Evans
which is good in documentation ( wiki ) is Apache Mahout we can learn from them , lots of extensible references , presentations , tutorials all at one place at wiki to refer. On Mon, May 7, 2012 at 9:19 PM, Robert Evans ev...@yahoo-inc.com wrote: I agree that better documentation is almost

Re: cannot use a map side join to merge the output of multiple map side joins

2012-05-07 Thread Robert Evans
I believe that you are correct about the split processing. It orders the splits by size so that the largest splits are processed first. This allows for the smaller splits to potentially fill in the gaps. As far as a fix is concerned I think overriding the file name in the file output

Re: Need to improve documentation for v 0.23.x ( v 2.x)

2012-05-07 Thread Robert Evans
I agree that better documentation is almost always needed. The problem is in finding the time to really make this happen. If you or anyone else here wants to help out with this effort please feel free to file JIRAs and submit patches to improve the documentation. Even if all the patch is, is

Re: hadoop streaming using a java program as mapper

2012-05-02 Thread Robert Evans
Do you have the error message from running java? You can use myMapper.sh to help you debug what is happening and logging it. Stderr of myMapper.sh is logged and you can get to it. You can run shell commands link find, ls, and you can probably look at any error messages that java produced

Re: KMeans clustering on Hadoop infrastructure

2012-04-30 Thread Robert Evans
You are likely going to get more help from talking to the Mahout mailing list. https://cwiki.apache.org/confluence/display/MAHOUT/Mailing+Lists,+IRC+and+Archives --Bobby Evans On 4/28/12 7:45 AM, Lukáš Kryške lu...@hotmail.cz wrote: Hello, I am successfully running K-Means clustering

Re: Node-wide Combiner

2012-04-30 Thread Robert Evans
Do you mean that when multiple map jobs run on the same node, that there is a combiner that will run across all of that code. There is nothing for that right now. It seems like it could be somewhat difficult to get right given the current architecture. --Bobby Evans On 4/27/12 11:13 PM,

Re: Text Analysis

2012-04-25 Thread Robert Evans
Hadoop itself is the core Map/Reduce and HDFS functionality. The higher level algorithms like sentiment analysis are often done by others. Cloudera has a video from HadoopWorld 2010 about it http://www.cloudera.com/resource/hw10_video_sentiment_analysis_powered_by_hadoop/ And there are

Re: isSplitable() problem

2012-04-24 Thread Robert Evans
The current code guarantees that they will be received in order. There some patches that are likely to go in soon that would allow for the JVM itself to be reused. In those cases I believe that the mapper class would be recreated, so the only thing you would have to worry about would be

Re: Help me with architecture of a somewhat non-trivial mapreduce implementation

2012-04-20 Thread Robert Evans
-Original Message- From: Robert Evans Sent: Thursday, April 19, 2012 2:08 PM To: common-user@hadoop.apache.org Subject: Re: Help me with architecture of a somewhat non-trivial mapreduce implementation From what I can see your implementation seems OK, especially from a performance

Re: remote job submission

2012-04-20 Thread Robert Evans
You can use Oozie to do it. On 4/20/12 8:45 AM, Arindam Choudhury arindamchoudhu...@gmail.com wrote: Sorry. But I can you give me a example. On Fri, Apr 20, 2012 at 3:08 PM, Harsh J ha...@cloudera.com wrote: Arindam, If your machine can access the clusters' NN/JT/DN ports, then you can

Re: Accessing global Counters

2012-04-20 Thread Robert Evans
There was a discussion about this several months ago http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201112.mbox/%3CCADYHM8xiw8_bF=zqe-bagdfz6r3tob0aof9viozgtzeqgkp...@mail.gmail.com%3E The conclusion is that if you want to read them from the reducer you are going to have to do

Re: Help me with architecture of a somewhat non-trivial mapreduce implementation

2012-04-19 Thread Robert Evans
From what I can see your implementation seems OK, especially from a performance perspective. Depending on what storage: is it is likely to be your bottlekneck, not the hadoop computations. Because you are writing files directly instead of relying on Hadoop to do it for you, you may need to

Re: Multiple data centre in Hadoop

2012-04-19 Thread Robert Evans
16, 2012 at 7:08 AM, Robert Evans ev...@yahoo-inc.com wrote: Hi Abhishek, Manu is correct about High Availability within a single colo. I realize that in some cases you have to have fail over between colos. I am not aware of any turn key solution for things like that, but generally what you

Re: Multiple data centre in Hadoop

2012-04-19 Thread Robert Evans
this... And yeah its something one can't talk about ;-) On Apr 19, 2012, at 4:28 PM, Robert Evans wrote: Where I work we have done some things like this, but none of them are open source, and I have not really been directly involved with the details of it. I can guess about what it would take

Re: Multiple data centre in Hadoop

2012-04-16 Thread Robert Evans
/Rack_aware_HDFS_proposal.pdf On Thu, Apr 12, 2012 at 2:18 AM, Abhishek Pratap Singh manu.i...@gmail.comwrote: Thanks Robert. Is there a best practice or design than can address the High Availability to certain extent? ~Abhishek On Wed, Apr 11, 2012 at 12:32 PM, Robert Evans ev...@yahoo-inc.com wrote: No it does

Re: Multiple data centre in Hadoop

2012-04-11 Thread Robert Evans
No it does not. Sorry On 4/11/12 1:44 PM, Abhishek Pratap Singh manu.i...@gmail.com wrote: Hi All, Just wanted if hadoop supports more than one data centre. This is basically for DR purposes and High Availability where one centre goes down other can bring up. Regards, Abhishek

Re: Hadoop streaming or pipes ..

2012-04-05 Thread Robert Evans
Both streaming and pipes do very similar things. They will fork/exec a separate process that is running whatever you want it to run. The JVM that is running hadoop then communicates with this process to send the data over and get the processing results back. The difference between streaming

Re: Yahoo Hadoop Tutorial with new APIs?

2012-04-04 Thread Robert Evans
I am dropping the cross posts and leaving this on common-user with the others BCCed. Marcos, That is a great idea to be able to update the tutorial, especially if the community is interested in helping to do so. We are looking into the best way to do this. The idea right now is to donate

Re: Temporal query

2012-03-29 Thread Robert Evans
I am not aware of anyone that does this for you directly, but it should not be too difficult for you to write what you want using pig or hive. I am not as familiar with Jaql but I assume that you can do it there too. Although it might be simpler to write it using Map/Reduce because we can

Re: question about processing large zip

2012-03-21 Thread Robert Evans
How are your splitting the zip right now? Do you have multiple mappers and each mapper starts at the beginning of the zip and goes to the point it cares about or do you just have one mapper? If you are doing it the first way you may want to increase your replication factor. Alternatively you

Re: Should splittable Gzip be a core hadoop feature?

2012-02-29 Thread Robert Evans
I can see a use for it, but I have two concerns about it. My biggest concern is maintainability. We have had lots of things get thrown into contrib in the past, very few people use them, and inevitably they start to suffer from bit rot. I am not saying that it will happen with this, but if

Re: Should splittable Gzip be a core hadoop feature?

2012-02-29 Thread Robert Evans
DataDriverDBInputFormat sexy for sure but does not need to be part of core. I could see hadoop as just coming with TextInputFormat and SequenceInputFormat and everything else is after market from github, On Wed, Feb 29, 2012 at 11:31 AM, Robert Evans ev...@yahoo-inc.com wrote: I can see a use

Re: Execute a Map/Reduce Job Jar from Another Java Program.

2012-02-03 Thread Robert Evans
(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:186) at com.amd.wrapper.main.ParserWrapper.main(ParserWrapper.java:31) Thanks, Abees On 2 February 2012 23:02, Robert Evans ev...@yahoo-inc.com wrote: What happens

Re: Execute a Map/Reduce Job Jar from Another Java Program.

2012-02-02 Thread Robert Evans
What happens? Is there an exception, does nothing happen? I am curious. Also how did you launch your other job that is trying to run this one. The hadoop script sets up a lot of environment variables classpath etc to make hadoop work properly, and some of that may not be set up correctly to

Re: Why $HADOOP_PREFIX ?

2012-02-01 Thread Robert Evans
I think it comes down to a long history of splitting and then remerging the hadoop project. I could be wrong about a lot of this so take it worth a grain of salt. Hadoop originally, and still is on 1.0 a single project. HDFS, mapreduce and common are all compiled together into a single jar

Re: When to use a combiner?

2012-01-25 Thread Robert Evans
You can use a combiner for average. You just have to write a separate combiner from your reducer. Class myCombiner { //The value is sum/count pairs void reduce(Key key, InterablePairLong, Long values, Context context) { long sum = 0; long count = 0; for(PairLong,

Re: increase number of map tasks

2012-01-10 Thread Robert Evans
Similarly there is the NLineInputFormat that does this automatically. If your input is small it will read in the input and make a split for every N lines of input. Then you don't have to reformat your data files. --Bobby Evans On 1/10/12 8:09 AM, GorGo gylf...@ru.is wrote: Hi. I am no

Re: Hadoop PIPES job using C++ and binary data results in data locality problem.

2012-01-10 Thread Robert Evans
I think what you want to try and do is to use JNI rather then pipes or streaming. PIPES has known issues and it is my understanding that its use is now discouraged. The ideal way to do this is to use JNI to send your data to the C code. Be aware that moving large amounts of data through JNI

Re: dual power for hadoop in datacenter?

2012-01-09 Thread Robert Evans
Be aware that if half of your cluster goes down, depending of the version and configuration of Hadoop, there may be a replication storm, as hadoop tries to bring it all back up to the proper number of replications. Your cluster may still be unusable in this case. --Bobby Evans On 1/7/12 2:55

Re: Appmaster error

2012-01-05 Thread Robert Evans
Pleas don't cross post. Common is BCCed. Each container has a vmem limit that is enforced, but not in local mode. If this is for the app master then you can increase this amount so that when you launch your AM you can set this amount through SubmitApplicationRequest req; ...

Re: Best ways to look-up information?

2011-12-12 Thread Robert Evans
Mark, Are all of the tables used by all of the processes? Are all of the tables used all of the time or are some used infrequently? Does the data in these lookup tables change a lot or is it very stable? What is the actual size of the data, yes 1 million entries, but is this 1 million 1kB,

Re: Passing data files via the distributed cache

2011-11-28 Thread Robert Evans
There is currently no way to delete the data from the cache when you are done. It is garbage collected when the cache starts to fill up (in LRU order if you are on a newer release). The DistributedCache.addCacheFile is modifying the JobConf behind the scenes for you. If you want to dig into

Re: mapred.map.tasks getting set, but not sure where

2011-11-07 Thread Robert Evans
It seems logical too that launching 4000 map tasks on a 20 node cluster is going to have a lot of overhead with it. 20 does not seem like the ideal number, but I don't really know the internals of Cassandra that well. You might want to post this question on the Cassandra list to see if they

Re: mapred.map.tasks getting set, but not sure where

2011-11-04 Thread Robert Evans
What versions of Hadoop were you running with previously, and what version are you running with now? --Bobby Evans On 11/4/11 9:33 AM, Brendan W. bw8...@gmail.com wrote: Hi, In the jobs running on my cluster of 20 machines, I used to run jobs (via hadoop jar ...) that would spawn around 4000

Re: mapred.map.tasks getting set, but not sure where

2011-11-04 Thread Robert Evans
4, 2011 at 11:04 AM, Robert Evans ev...@yahoo-inc.com wrote: What versions of Hadoop were you running with previously, and what version are you running with now? --Bobby Evans On 11/4/11 9:33 AM, Brendan W. bw8...@gmail.com wrote: Hi, In the jobs running on my cluster of 20 machines, I

Re: Is it possible to run multiple MapReduce against the same HDFS?

2011-10-11 Thread Robert Evans
with different accounts, can MapReduce cluster be able to access HDFS directories and files (if authentication in HDFS is enabled)? Thanks! Gerald On Mon, Oct 10, 2011 at 12:36 PM, Robert Evans ev...@yahoo-inc.com wrote: It should be possible to use multiple map/reduce clusters sharing the same HDFS

Re: operation of DistributedCache following manual deletion of cached files?

2011-09-27 Thread Robert Evans
. If that sentence is true, then I still don't have an explanation of why our job didn't correctly push out new versions of the cache files upon the startup and execution of JobConfiguration. We deleted them before our job started, not during. On Fri, Sep 23, 2011 at 9:35 AM, Robert Evans ev...@yahoo

Re: operation of DistributedCache following manual deletion of cached files?

2011-09-27 Thread Robert Evans
wrote: Who is in charge of getting the files there for the first time? The addCacheFile call in the mapreduce job? Or a manual setup by the user/operator? On Tue, Sep 27, 2011 at 11:35 AM, Robert Evans ev...@yahoo-inc.com wrote: The problem is the step 4 in the breaking sequence. Currently

Re: operation of DistributedCache following manual deletion of cached files?

2011-09-27 Thread Robert Evans
...@gmail.com wrote: From that interpretation, it then seems like it would be safe to delete the files between completed runs? How could it distinguish between the files having been deleted and their not having been downloaded from a previous run? On Tue, Sep 27, 2011 at 12:25 PM, Robert Evans ev...@yahoo

Re: operation of DistributedCache following manual deletion of cached files?

2011-09-27 Thread Robert Evans
deleted and their not having been downloaded from a previous run of the job? Is it state in memory that the taskTracker maintains? On Tue, Sep 27, 2011 at 1:44 PM, Robert Evans ev...@yahoo-inc.com wrote: If you are never ever going to use that file again for any map/reduce task in the future then yes

Re: operation of DistributedCache following manual deletion of cached files?

2011-09-27 Thread Robert Evans
, you either repopulate them (as well as their crc checksums) or you restart the TaskTracker? On Tue, Sep 27, 2011 at 3:03 PM, Robert Evans ev...@yahoo-inc.com wrote: Yes, all of the state for the task tracker is in memory. It never looks at the disk to see what is there, it only maintains

Re: many killed tasks, long execution time

2011-09-23 Thread Robert Evans
Can you include the complete stack trace of the IOException you are seeing? --Bobby Evans On 9/23/11 2:15 AM, Sofia Georgiakaki geosofie_...@yahoo.com wrote: Good morning! I would be grateful if anyone could help me about a serious problem that I'm facing. I try to run a hadoop job on a

Re: operation of DistributedCache following manual deletion of cached files?

2011-09-23 Thread Robert Evans
Meng Mao, The way the distributed cache is currently written, it does not verify the integrity of the cache files at all after they are downloaded. It just assumes that if they were downloaded once they are still there and in the proper shape. It might be good to file a JIRA to add in some

Re: many killed tasks, long execution time

2011-09-23 Thread Robert Evans
of reducers are in the range 2-12, and then if I increase the reducers further, the performance gets worse and worse... Any ideas would be helpful! Thank you! From: Robert Evans ev...@yahoo-inc.com To: common-user@hadoop.apache.org common-user@hadoop.apache.org

Re: Can we run job on some datanodes ?

2011-09-21 Thread Robert Evans
Praveen, If you are doing performance measurements be aware that having more datanodes then tasktrackers will impact the performance as well (Don't really know for sure how). It will not be the same performance as running on a cluster with just fewer nodes over all. Also if you do shut off

Re: How to get hadoop job information effectively?

2011-09-21 Thread Robert Evans
Not that I know of. We scrape web pages which is a horrible thing to do. There is a JIRA to add in some web service APIs to expose this type of information, but it is not going to be available for a while. --Bobby Evans On 9/21/11 1:01 PM, Benyi Wang bewang.t...@gmail.com wrote: I'm working

Re: Is Hadoop the right platform for my HPC application?

2011-09-13 Thread Robert Evans
Another option to think about is that there is a Hamster project ( MAPREDUCE-2911 https://issues.apache.org/jira/browse/MAPREDUCE-2911 ) that will allow OpenMPI to run on a Hadoop Cluster. It is still very preliminary and will probably not be ready until Hadoop 0.23 or 0.24. There are other

Re: Is Hadoop the right platform for my HPC application?

2011-09-12 Thread Robert Evans
Parker, The hadoop command itself is just a shell script that sets up your classpath and some environment variables for a JVM. Hadoop provides a java API and you should be able to use to write you application, without dealing with the command line. That being said there is no Map/Reduce

Re: Distributed cluster filesystem on EC2

2011-08-31 Thread Robert Evans
Dmitry, It sounds like an interesting idea, but I have not really heard of anyone doing it before. It would make for a good feature to have tiered file systems all mapped into the same namespace, but that would be a lot of work and complexity. The quick solution would be to know what data you

Re: Hadoop JVM Size (Not MapReduce)

2011-08-19 Thread Robert Evans
The hadoop command is just a shell script that sets up the class path before call java. I think if you set the ENV HADOOP_JAVA_OPTS then they will show up on the command line, but you can look at the top of the hadoop shell script to be sure. It has all the env vars it supports listed there

Re: next gen map reduce

2011-07-28 Thread Robert Evans
It has not been introduced yet. If you are referring to MRV2. It is targeted to go into the 0.23 release of Hadoop, but is currently on the MR-279 branch. Which should hopefully be merged to trunk in abut a week. --Bobby On 7/28/11 7:31 AM, real great.. greatness.hardn...@gmail.com wrote:

Re: Hadoop-streaming using binary executable c program

2011-07-28 Thread Robert Evans
I am not completely sure what you are getting at. It looks like the output of your c program is (And this is just a guess) NOTE: \t stands for the tab character and in streaming it is used to separate the key from the value \n stands for carriage return and is used to separate individual

Re: Custom FileOutputFormat / RecordWriter

2011-07-25 Thread Robert Evans
Tom, That assumes that you will never write to the same file from two different mappers or processes. HDFS currently does not support writing to a single file from multiple processes. --Bobby On 7/25/11 3:25 PM, Tom Melendez t...@supertom.com wrote: Hi Folks, Just doing a sanity check

Re: Custom FileOutputFormat / RecordWriter

2011-07-25 Thread Robert Evans
on the namenode if you are going to create lot of small files using this method. --Bobby On 7/25/11 3:30 PM, Robert Evans ev...@yahoo-inc.com wrote: Tom, That assumes that you will never write to the same file from two different mappers or processes. HDFS currently does not support writing to a single

Re: Hadoop-streaming using binary executable c program

2011-07-25 Thread Robert Evans
This is likely to be slow and it is not ideal. The ideal would be to modify pknotsRG to be able to read from stdin, but that may not be possible. The shell script would probably look something like the following #!/bin/sh rm -f temp.txt; while read line do echo $line temp.txt; done exec

Re: Running queries using index on HDFS

2011-07-25 Thread Robert Evans
Sofia, You can access any HDFS file from a normal java application so long as your classpath and some configuration is set up correctly. That is all that the hadoop jar command does. It is a shell script that sets up the environment for java to work with Hadoop. Look at the example for the

Re: Problem with Hadoop Streaming -file option for Java class files

2011-07-22 Thread Robert Evans
From a practical standpoint if you just leave off the -mapper you will get an IdentityMapper being run in streaming. I don't believe that -mapper will understand something.class as a class file that should be loaded and used as the mapper. I think you need to specify the class, including the

Re: Hadoop-streaming using binary executable c program

2011-07-22 Thread Robert Evans
It looks like it tried to run your program and the program exited with a 1 not a 0. What are the stderr logs like for the mappers that were launched, you should be able to access them through the Web GUI? You might want to add in some stderr log messages to you c program too. To be able to

Re: Hadoop-streaming using binary executable c program

2011-07-22 Thread Robert Evans
I would suggest that you do the following to help you debug. hadoop fs -cat /user/yehdego/RNAData/RF00028_B.bpseqL3G5_seg_Centered_Method.txt | head -2 | /data/yehdego/hadoop-0.20.2/pknotsRG-1.3/src/pknotsRG - This is simulating what hadoop streaming is doing. Here we are taking the first 2

Re: Which release to use?

2011-07-15 Thread Robert Evans
Adarsh, Yahoo! no longer has its own distribution of Hadoop. It has been merged into the 0.20.2XX line so 0.20.203 is what Yahoo is running internally right now, and we are moving towards 0.20.204 which should be out soon. I am not an expert on Cloudera so I cannot really map its releases to

Re: Giving filename as key to mapper ?

2011-07-15 Thread Robert Evans
To add to that if you really want the file name to be the key instead of just calling a different API in your map to get it you will probably need to write your own input format to do it. It should be fairly simple and you can base it off of an existing input format to do it. --Bobby On

Re: Issue with MR code not scaling correctly with data sizes

2011-07-15 Thread Robert Evans
Please don't cross post. I put common-user in BCC. I really don't know for sure what is happening especially without the code or more to go on and debugging something remotely over e-mail is extremely difficult. You are essentially doing a cross which is going to be very expensive no matter

Re: large data and hbase

2011-07-11 Thread Robert Evans
Rita, My understanding is that you do not need to setup map/reduce to use Hbase, but I am not an expert on it. Contacting the Hbase mailing list would probably be the best option to get your questions answered. u...@hbase.apache.org Their setup page might be able to help you out too

Re: Automatic line number in reducer output

2011-06-09 Thread Robert Evans
What exactly is linecount being output as in the new APIs? --Bobby On 6/7/11 11:21 AM, Shi Yu sh...@uchicago.edu wrote: Hi, I am wondering is there any built-in function to automatically add a self-increment line number in reducer output (like the relation DB auto-key). I have this problem

Re: DistributedCache

2011-06-09 Thread Robert Evans
I think the issue you are seeing is because the distributed cache is not set up by default to create symlinks to the files it pulls over. If you want to access them through symlinks in the local directory call DistributedCache.createSymklink(conf) before submitting your job, otherwise you can

Re: Linear scalability question

2011-06-09 Thread Robert Evans
Shantian, You are correct. The other big factor in this is the cost of connections between the Mappers and the Reducers. With N mappers and M reducers you will make M*N connections between them. This can be a very large cost as well. The basic tricks you can play are to filter data before

Re: Hadoop project - help needed

2011-05-31 Thread Robert Evans
Parismav, So you are more or less trying to scrape some data in a distributed way. Well there are several things that you could do, just be careful I am not sure the terms of service for the flickr APIs so make sure that you are not violating them by downloading too much data. You probably

Re: Sorting ...

2011-05-26 Thread Robert Evans
Also if you want something that is fairly fast and a lot less dev work to get going you might want to look at pig. They can do a distributed order by that is fairly good. --Bobby Evans On 5/26/11 2:45 AM, Luca Pireddu pire...@crs4.it wrote: On May 25, 2011 22:15:50 Mark question wrote: I'm

Re: Applications creates bigger output than input?

2011-05-19 Thread Robert Evans
I'm not sure if this has been mentioned or not but in Machine Learning with text based documents, the first stage is often a glorified word count action. Except much of the time they will do N-Gram. So Map Input: Hello this is a test Map Output: Hello This is a test Hello this this is is a a

Re: current line number as key?

2011-05-18 Thread Robert Evans
You are correct, that there is no easy and efficient way to do this. You could create a new InputFormat that derives from FileInputFormat that makes it so the files do not split, and then have a RecordReader that keeps track of line numbers. But then each file is read by only one mapper.

Re: Exception in thread AWT-EventQueue-0 java.lang.NullPointerException

2011-05-16 Thread Robert Evans
What version of hadoop are you using? On 5/14/11 9:37 AM, Lạc Trung trungnb3...@gmail.com wrote: Hello everybody ! This exception was thrown when I tried to copy a file from local file to HDFS. This is my program : *** import

Re: FileSystem API - Moving files in HDFS

2011-05-16 Thread Robert Evans
If they are lots of large files, and you need to copy them quickly, i.e. Not have all the data go through a single machine, you can use hadoop distcp too. --Bobby On 5/14/11 12:49 AM, Mahadev Konar maha...@apache.org wrote: Jim, you can use FileUtil.copy() methods to copy files. Hope that

Re: Map Result Caching

2011-04-18 Thread Robert Evans
DoomUs, To me it seems like it should be something at the application level and less at the Hadoop level. I would think if there really is very little delta between the runs then the application would save the output of a map only job, and the next time would do a union of that and the output

Re: Setting input paths

2011-04-06 Thread Robert Evans
I believe that opening a directory as a file will result in a file not found. You probably need to set it to a glob, that points to that actual files. Something like /user/root/logs/2011/*/*/* for all entries in 2011, or /user/root/logs/2011/01/*/* if you want to restrict it to just