Re: [Spark submit] getting error when use properties file parameter in spark submit

2016-09-06 Thread Sonal Goyal
Looks like a classpath issue - Caused by: java.lang.ClassNotFoundException:
com.amazonaws.services.s3.AmazonS3

Are you using S3 somewhere? Are the required jars in place?

Best Regards,
Sonal
Founder, Nube Technologies 
Reifier at Strata Hadoop World 
Reifier at Spark Summit 2015






On Tue, Sep 6, 2016 at 4:45 PM, Divya Gehlot 
wrote:

> Hi,
> I am getting below error if I try to use properties file paramater in
> spark-submit
>
> Exception in thread "main" java.util.ServiceConfigurationError:
> org.apache.hadoop.fs.FileSystem: Provider 
> org.apache.hadoop.fs.s3a.S3AFileSystem
> could not be instantiated
> at java.util.ServiceLoader.fail(ServiceLoader.java:224)
> at java.util.ServiceLoader.access$100(ServiceLoader.java:181)
> at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:377)
> at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
> at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2673)
> at org.apache.hadoop.fs.FileSystem.getFileSystemClass(
> FileSystem.java:2684)
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2701)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2737)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2719)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:375)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:174)
> at org.apache.spark.deploy.yarn.ApplicationMaster.run(
> ApplicationMaster.scala:142)
> at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$
> main$1.apply$mcV$sp(ApplicationMaster.scala:653)
> at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(
> SparkHadoopUtil.scala:69)
> at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(
> SparkHadoopUtil.scala:68)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at org.apache.hadoop.security.UserGroupInformation.doAs(
> UserGroupInformation.java:1657)
> at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(
> SparkHadoopUtil.scala:68)
> at org.apache.spark.deploy.yarn.ApplicationMaster$.main(
> ApplicationMaster.scala:651)
> at org.apache.spark.deploy.yarn.ApplicationMaster.main(
> ApplicationMaster.scala)
> Caused by: java.lang.NoClassDefFoundError: com/amazonaws/services/s3/
> AmazonS3
> at java.lang.Class.getDeclaredConstructors0(Native Method)
> at java.lang.Class.privateGetDeclaredConstructors(Class.java:2595)
> at java.lang.Class.getConstructor0(Class.java:2895)
> at java.lang.Class.newInstance(Class.java:354)
> at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373)
> ... 19 more
> Caused by: java.lang.ClassNotFoundException: com.amazonaws.services.s3.
> AmazonS3
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> ... 24 more
> End of LogType:stderr
>
> If I remove the --properties-file parameter
> the error is gone
>
> Would really appreciate the help .
>
>
>
> Thanks,
> Divya
>


Re: Mapper input as argument

2013-11-06 Thread Sonal Goyal
Hi Unmesha,

What is the computation you are trying to do? If you are interested in
computing over multiple lines instead of a single line, have a look at
NLineInputFormat.

Best Regards,
Sonal
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal




On Thu, Nov 7, 2013 at 11:35 AM, unmesha sreeveni unmeshab...@gmail.comwrote:

 one more doubt : how to copy each input split entering into mapper into a
 file for computation?


 On Thu, Nov 7, 2013 at 10:35 AM, unmesha sreeveni 
 unmeshab...@gmail.comwrote:

 My driver code is
 FileInputFormat.setInputPaths(job, new Path(args[0]));
  FileOutputFormat.setOutputPath(job,new Path(args[1]));

 and My mapper is
 public void map(Object key, Text value, Context context)
 throws IOException, InterruptedException {
 where value.tostring() contains my input data.

 is that a better way to copy all the data s coming into a file and do
 computations. OR read each line and do the calculation.

 --
 *Thanks  Regards*

 Unmesha Sreeveni U.B

 *Junior Developer*






 --
 *Thanks  Regards*

 Unmesha Sreeveni U.B

 *Junior Developer*

 *Amrita Center For Cyber Security *


 * Amritapuri.www.amrita.edu/cyber/ http://www.amrita.edu/cyber/*



Re: Mapper input as argument

2013-11-06 Thread Sonal Goyal
If you dont need line by line but you want to get a number of lines
together, use NLineInputFormat. If you dont want to split at all, override
isSplitable in FileInputFormat. Or you can use FileInputFormat, get each
line as key/value and compute over it, saving the results and emitting only
as necessary.

I am not sure what your use case is, but I hope the above helps.

Best Regards,
Sonal
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal




On Thu, Nov 7, 2013 at 11:44 AM, unmesha sreeveni unmeshab...@gmail.comwrote:

 Am i able to get the entire split data from mapper. i dnt need as line by
 line.

 my input is of say 50 lines.so these files can be splited into different
 mappers right. how to get each split data. are we able to get that data?


 On Thu, Nov 7, 2013 at 11:39 AM, Sonal Goyal sonalgoy...@gmail.comwrote:

 Hi Unmesha,

 What is the computation you are trying to do? If you are interested in
 computing over multiple lines instead of a single line, have a look at
 NLineInputFormat.

 Best Regards,
 Sonal
 Nube Technologies http://www.nubetech.co

 http://in.linkedin.com/in/sonalgoyal




 On Thu, Nov 7, 2013 at 11:35 AM, unmesha sreeveni 
 unmeshab...@gmail.comwrote:

 one more doubt : how to copy each input split entering into mapper into
 a file for computation?


 On Thu, Nov 7, 2013 at 10:35 AM, unmesha sreeveni unmeshab...@gmail.com
  wrote:

 My driver code is
 FileInputFormat.setInputPaths(job, new Path(args[0]));
  FileOutputFormat.setOutputPath(job,new Path(args[1]));

 and My mapper is
 public void map(Object key, Text value, Context context)
 throws IOException, InterruptedException {
 where value.tostring() contains my input data.

 is that a better way to copy all the data s coming into a file and do
 computations. OR read each line and do the calculation.

 --
 *Thanks  Regards*

 Unmesha Sreeveni U.B

 *Junior Developer*






 --
 *Thanks  Regards*

 Unmesha Sreeveni U.B

 *Junior Developer*

 *Amrita Center For Cyber Security *


 * Amritapuri.www.amrita.edu/cyber/ http://www.amrita.edu/cyber/*





 --
 *Thanks  Regards*

 Unmesha Sreeveni U.B

 *Junior Developer*

 *Amrita Center For Cyber Security *


 * Amritapuri.www.amrita.edu/cyber/ http://www.amrita.edu/cyber/*



Re: Writing to multiple directories in hadoop

2013-10-12 Thread Sonal Goyal
Hi Jamal,

If I remember correctly, you can use the write(key, value, basePath) method
 of MultipleOutput in your reducer to get different directories.

http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html#write(KEYOUT,
VALUEOUT, java.lang.String)

Here is what the API says

Use MultipleOutputs.write(KEYOUT key, VALUEOUT value, String baseOutputPath) to
write key and value to a path specified by baseOutputPath, with no need to
specify a named output:

 private MultipleOutputs out;

 public void setup(Context context) {
   out = new MultipleOutputs(context);
   ...
 }

 public void reduce(Text key, Iterable values, Context context) throws
IOException, InterruptedException {
 for (Text t : values) {
   out.write(key, t, generateFileName(*parameter list...*));
   }
 }

 protected void cleanup(Context context) throws IOException,
InterruptedException {
   out.close();
 }


Use your own code in generateFileName() to create a custom path to your
results. '/' characters in baseOutputPath will be translated into directory
levels in your file system. Also, append your custom-generated path with
part or similar, otherwise your output will be -0, -1 etc. No
call to context.write() is necessary. See example generateFileName() code
below.

 private String generateFileName(Text k) {
   // expect Text k in format Surname|Forename
   String[] kStr = k.toString().split(\\|);

   String sName = kStr[0];
   String fName = kStr[1];

   // example for k = Smith|John
   // output written to /user/hadoop/path/to/output/Smith/John-r-0 (etc)
   return sName + / + fName;
 }


Best Regards,
Sonal
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal




On Sat, Oct 12, 2013 at 3:49 AM, jamal sasha jamalsha...@gmail.com wrote:

 Hi,

 I am trying to separate my output from reducer to different folders..

 My dirver has the following code:
  FileOutputFormat.setOutputPath(job, new Path(output));
 //MultipleOutputs.addNamedOutput(job, namedOutput,
 outputFormatClass, keyClass, valueClass)
 //MultipleOutputs.addNamedOutput(job, namedOutput,
 outputFormatClass, keyClass, valueClass)
 MultipleOutputs.addNamedOutput(job, foo,
 TextOutputFormat.class, NullWritable.class, Text.class);
 MultipleOutputs.addNamedOutput(job, bar,
 TextOutputFormat.class, Text.class,NullWritable.class);
 MultipleOutputs.addNamedOutput(job, foobar,
 TextOutputFormat.class, Text.class, NullWritable.class);

 And then my reducer has the following code:
 mos.write(foo,NullWritable.get(),new Text(jsn.toString()));
 mos.write(bar, key,NullWritable.get());
 mos.write(foobar, key,NullWritable.get());

 But in the output, I see:

 output/foo-r-0001
 output/foo-r-0002
 output/foobar-r-0001
 output/bar-r-0001


 But what I am trying is :

 output/foo/part-r-0001
 output/foo/part-r-0002
 output/bar/part-r-0001
 output/foobar/part-r-0001

 How do I do this?
 Thanks



Re: All datanodes are bad IOException when trying to implement multithreading serialization

2013-09-29 Thread Sonal Goyal
Wouldn't you rather just change your split size so that you can have more 
mappers work on your input? What else are you doing in the mappers?
Sent from my iPad

On Sep 30, 2013, at 2:22 AM, yunming zhang zhangyunming1...@gmail.com wrote:

 Hi, 
 
 I was playing with Hadoop code trying to have a single Mapper support reading 
 a input split using multiple threads. I am getting All datanodes are bad 
 IOException, and I am not sure what is the issue. 
 
 The reason for this work is that I suspect my computation was slow because it 
 takes too long to create the Text() objects from inputsplit using a single 
 thread. I tried to modify the LineRecordReader (since I am mostly using 
 TextInputFormat) to provide additional methods to retrieve lines from the 
 input split  getCurrentKey2(), getCurrentValue2(), nextKeyValue2(). I created 
 a second FSDataInputStream, and second LineReader object for 
 getCurrentKey2(), getCurrentValue2() to read from. Essentially I am trying to 
 open the input split twice with different start points (one in the very 
 beginning, the other in the middle of the split) to read from input split in 
 parallel using two threads.  
 
 In the org.apache.hadoop.mapreduce.mapper.run() method, I modified it to read 
 simultaneously using getCurrentKey() and getCurrentKey2() using Thread 1 and 
 Thread 2 (both threads running at the same tim
   Thread 1:
while(context.nextKeyValue()){
   map(context.getCurrentKey(), context.getCurrentValue(), 
 context);
 }
 
   Thread 2:
 while(context.nextKeyValue2()){
 map(context.getCurrentKey2(), context.getCurrentValue2(), 
 context);
 //System.out.println(two iter);
 }
 
 However, this causes me to see the All Datanodes are bad exception. I think I 
 made sure that I closed the second file. I have attached a copy of my 
 LineRecordReader file to show what I changed trying to enable two 
 simultaneous read to the input split. 
 
 I have modified other files(org.apache.hadoop.mapreduce.RecordReader.java, 
 mapred.MapTask.java )  just to enable Mapper.run to call 
 LinRecordReader.getCurrentKey2() and other access methods for the second 
 file. 
 
 
 I would really appreciate it if anyone could give me a bit advice or just 
 point me to a direction as to where the problem might be, 
 
 Thanks
 
 Yunming 
 
 LineRecordReader.java


Re: Input Split vs Task vs attempt vs computation

2013-09-27 Thread Sonal Goyal
Inline

Best Regards,
Sonal
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal




On Fri, Sep 27, 2013 at 10:42 AM, Sai Sai saigr...@yahoo.in wrote:

 Hi
 I have a few questions i am trying to understand:

 1. Is each input split same as a record, (a rec can be a single line or
 multiple lines).


An InputSplit is a chunk of input that is handled by a map task. It will
generally contain multiple records. The RecordReader provides the key
values to the map task. Check
http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/InputSplit.html


 2. Is each Task a collection of few computations or attempts.

 For ex: if i have a small file with 5 lines.
 By default there will be 1 line on which each map computation is performed.
 So totally 5 computations r done on 1 node.

 This means JT will spawn 1 JVM for 1 Tasktracker on a node
 and another JVM for map task which will instantiate 5 map objects 1 for
 each line.

 i am not sure what you mean by 5 map objects. But yes, the mapper will be
invoked 5 times, once for each line.


 The MT JVM is called the task which will have 5 attempts for  each line.
 This means attempt is same as computation.

 Please let me know if anything is incorrect.
 Thanks
 Sai




Re: Retrieve and compute input splits

2013-09-27 Thread Sonal Goyal
The input splits are not copied, only the information on the location of
the splits is copied to the jobtracker so that it can assign tasktrackers
which are local to the split.

Check the Job Initialization section at
http://answers.oreilly.com/topic/459-anatomy-of-a-mapreduce-job-run-with-hadoop/

To create the list of tasks to run, the job scheduler first retrieves the
input splits computed by the JobClient from the shared filesystem (step 6).
It then creates one map task for each split. The number of reduce tasks to
create is determined by the mapred.reduce.tasks property in the JobConf,
which is set by the setNumReduceTasks() method, and the scheduler simply
creates this number of reduce tasks to be run. Tasks are given IDs at this
point.

Best Regards,
Sonal
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal




On Fri, Sep 27, 2013 at 10:55 AM, Sai Sai saigr...@yahoo.in wrote:

 Hi
 I have attached the anatomy of MR from definitive guide.

 In step 6 it says JT/Scheduler  retrieve  input splits computed by the
 client from hdfs.

 In the above line it refers to as the client computes input splits.

 1. Why does the JT/Scheduler retrieve the input splits and what does it do.
 If it is retrieving the input split does this mean it goes to the block
 and reads each record
 and gets the record back to JT. If so this is a lot of data movement for
 large files.
 which is not data locality. so i m getting confused.

 2. How does the client know how to calculate the input splits.

 Any help please.
 Thanks
 Sai



Re: hadoop download path missing

2012-08-24 Thread Sonal Goyal
I just tried and could go to
http://apache.techartifact.com/mirror/hadoop/common/hadoop-2.0.1-alpha/

Is this still happening for you?

Best Regards,
Sonal
Crux: Reporting for HBase https://github.com/sonalgoyal/crux
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal





On Fri, Aug 24, 2012 at 8:59 PM, Steven Willis swil...@compete.com wrote:

 All the links at: http://www.apache.org/dyn/closer.cgi/hadoop/common/ are
 returning 404s, even the backup site at:
 http://www.us.apache.org/dist/hadoop/common/. However, the eu site:
 http://www.eu.apache.org/dist/hadoop/common/ does work.

 -Steven Willis



Re: Sending data to all reducers

2012-08-23 Thread Sonal Goyal
Hamid,

I would recommend taking a relook at your current algorithm and making sure
you are utilizing the MR framework to its strengths. You can evaluate
having multiple passes for your map reduce program, or doing a map side
join. You mention runtime is important for your system, so make sure you
preserve data locality in the generated tasks.

HTH.

Best Regards,
Sonal
Crux: Reporting for HBase https://github.com/sonalgoyal/crux
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal





On Thu, Aug 23, 2012 at 6:50 PM, Hamid Oliaei oli...@gmail.com wrote:

 Hi,

 I take a look to that, hope it can be useful for my purpose.

 Thank you so much.

 Hamid





Re: About many user accounts in hadoop platform

2012-08-23 Thread Sonal Goyal
Hi,

Do your users want different versions of Hadoop? Or can they share the same
hadoop cluster and schedule their jobs? If the latter, Hadoop can be
configured to run for multiple users, and each user can submit their data
and jobs to the same cluster. Hence you can maintain a single cluster and
utilize your resources more efficiently. You can read more here:

http://www.ibm.com/developerworks/linux/library/os-hadoop-scheduling/index.html

http://www.cloudera.com/blog/2008/11/job-scheduling-in-hadoop/

Best Regards,
Sonal
Crux: Reporting for HBase https://github.com/sonalgoyal/crux
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal





On Fri, Aug 24, 2012 at 9:13 AM, Li Shengmei lisheng...@ict.ac.cn wrote:

 Hi, all

  There are many users in hadoop platform. Can they install their
 own hadoop version on the same clusters platform? 

 I tried to do this but failed. There exsited a user account and the user
 install his hadoop. I create another account and install his hadoop. The
 logs display “ERROR org.apache.hadoop.hdfs.server.namenode.NameNode:
 java.net.BindException: Problem binding to hadoop01/10.3.1.91:9000 :
 Address already in use”. So I change the port no. to 8000, but still
 failed. 

 When I “start-all.sh”, the namenode can’t start, the logs display “ERROR
 org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException
 as:lismhadoop cause:java.net.BindException: Address already in use”

 Can anyone give some suggestions? 

 Thanks,

 May



Re: Allow setting of end-of-record delimiter for TextInputFormat

2012-06-18 Thread Sonal Goyal
Hi,

The record delimiter is not to be specified while copying the file, but
when you run the map reduce job. Just copy the file and specify the
delimiter at the time of the job run.

Best Regards,
Sonal
Crux: Reporting for HBase https://github.com/sonalgoyal/crux
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal





On Mon, Jun 18, 2012 at 6:31 PM, 안의건 ahneui...@gmail.com wrote:

 Hello. I'm trying to test the new patch, 'Allow setting of end-of-record
 delimiter for TextInputFormat'.

 TextInputFormat may now split lines with delimiters other than newline,
 by specifying a configuration parameter textinputformat.record.delimiter

[MAPREDUCE-2254]

 Now I'm using the following command in hadoop-0.23.0, but it makes same
 result when it was done in hadoop-0.20.0.

 hadoop fs -Dtextinputformat.record.delimiter=\r\n -copyFromLocal
 /tmp/test.txt /tmp

 Is there any configuration I'm missing? Why doesn't it working?


 Thank you



Re: hadoop ecosystem

2012-01-28 Thread Sonal Goyal
Crux reporting for hbase can also be included.

Sonal


Sent from my iPad

On 28-Jan-2012, at 11:40 PM, Chris K Wensel ch...@wensel.net wrote:

 PyCascading
 Scalding
 Cascading.JRuby
 Bixo
 
 Strictly speaking, those plus Cascalog (below) are on top of Cascading, which 
 is of course on top of Hadoop, but all of which have independent developer 
 teams (@ twitter, Scale Unlimited, Etsy, etc).
 
 On Jan 28, 2012, at 7:59 AM, Ayad Al-Qershi wrote:
 
 I'm compiling a list of all Hadoop ecosystem/sub projects ordered 
 alphabetically and I need your help if I missed something.
 
 Ambari
 Avro
 Cascading
 Cascalog
 Cassandra
 Chukwa
 Elastic Map Reduce
 Flume
 Hadoop common
 Hama
 Hbase
 Hcatalog
 HDFS
 hiho
 Hive
 Hoop
 Hue
 Jaql
 Mahout
 MapReduce
 Nutch
 Oozie
 Pig
 Sqoop
 Zookeeper
 Your help is highly appreciated.
 
 Thanks,
 
 Iyad
 
 
 --
 Chris K Wensel
 ch...@concurrentinc.com
 http://concurrentinc.com
 


Re: Hbase + mapreduce -- operational design question

2011-09-10 Thread Sonal Goyal
Chinmay, how are you configuring your job? Have you checked using setScan
and selecting the keys you care to run MR over? See

http://ofps.oreilly.com/titles/9781449396107/mapreduce.html

As a shameless plug - For your reports, see if you want to leverage Crux:
https://github.com/sonalgoyal/crux

Best Regards,
Sonal
Crux: Reporting for HBase https://github.com/sonalgoyal/crux
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal





On Sat, Sep 10, 2011 at 2:53 PM, Eugene Kirpichov ekirpic...@gmail.comwrote:

 I believe HBase has some kind of TTL (timeout-based expiry) for
 records and it can clean them up on its own.

 On Sat, Sep 10, 2011 at 1:54 AM, Dhodapkar, Chinmay
 chinm...@qualcomm.com wrote:
  Hello,
  I have a setup where a bunch of clients store 'events' in an Hbase table
 . Also, periodically(once a day), I run a mapreduce job that goes over the
 table and computes some reports.
 
  Now my issue is that the next time I don't want mapreduce job to process
 the 'events' that it has already processed previously. I know that I can
 mark processed event in the hbase table and the mapper can filter them them
 out during the next run. But what I would really like/want is that
 previously processed events don't even hit the mapper.
 
  One solution I can think of is to backup the hbase table after running
 the job and then clear the table. But this has lot of problems..
  1) Clients may have inserted events while the job was running.
  2) I could disable and drop the table and then create it again...but then
 the clients would complain about this short window of unavailability.
 
 
  What do people using Hbase (live) + mapreduce typically do. ?
 
  Thanks!
  Chinmay
 
 



 --
 Eugene Kirpichov
 Principal Engineer, Mirantis Inc. http://www.mirantis.com/
 Editor, http://fprog.ru/



Re: No Mapper but Reducer

2011-09-07 Thread Sonal Goyal
I dont think that is possible, can you explain in what scenario you want to
have no mappers, only reducers?

Best Regards,
Sonal
Crux: Reporting for HBase https://github.com/sonalgoyal/crux
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal





On Wed, Sep 7, 2011 at 1:18 PM, Bejoy KS bejoy.had...@gmail.com wrote:

 Hi
   I'm having a query here. Is it possible to have no mappers but
 reducers alone? AFAIK If we need to avoid the tyriggering of reducers we can
 set numReduceTasks to zero but such a setting on mapper wont work. So how
 can it be achieved if possible?

 Thank You

 Regards
 Bejoy.K.S



Re: Too many maps?

2011-09-06 Thread Sonal Goyal
Mark,

Having a large number of emitted key values from the mapper should not be a
problem. Just make sure that you have enough reducers to handle the data so
that the reduce stage does not become a bottleneck.

Best Regards,
Sonal
Crux: Reporting for HBase https://github.com/sonalgoyal/crux
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal





On Wed, Sep 7, 2011 at 8:44 AM, Mark Kerzner markkerz...@gmail.com wrote:

 Harsh,

 I read one PST file, which contains many emails. But then I emit many maps,
 like this

MapWritable mapWritable = createMapWritable(metadata, fileName);
// use MD5 of the input file as Hadoop key
FileInputStream fileInputStream = new FileInputStream(fileName);
MD5Hash key = MD5Hash.digest(fileInputStream);
fileInputStream.close();
// emit map
context.write(key, mapWritable);

 and it is this context.write calls that I have a great number of. Is that a
 problem?

 Mark

 On Tue, Sep 6, 2011 at 10:06 PM, Harsh J ha...@cloudera.com wrote:

  You can use an input format that lets you read multiple files per map
  (like say, all local files. See CombineFileInputFormat for one
  implementation that does this). This way you get reduced map #s and
  you don't really have to clump your files. One record reader would be
  initialized per file, so I believe you should be free to generate
  unique identities per file/email with this approach (whenever a new
  record reader is initialized)?
 
  On Wed, Sep 7, 2011 at 7:12 AM, Mark Kerzner markkerz...@gmail.com
  wrote:
   Hi,
  
   I am testing my Hadoop-based FreeEed http://frd.org/, an open
  source
   tool for eDiscovery, and I am using the Enron data
   set
 http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set-v2
  for
   that. In my processing, each email with its attachments becomes a map,
   and it is later collected by a reducer and written to the output. With
  the
   (PST) mailboxes of around 2-5 Gigs, I begin to the see the numbers of
  emails
   of about 50,000. I remember in Yahoo best practices that the number of
  maps
   should not exceed 75,000, and I can see that I can break this barrier
  soon.
  
   I could, potentially, combine a few emails into one map, but I would be
   doing it only to circumvent the size problem, not because my processing
   requires it. Besides, my keys are the MD5 hashes of the files, and I
 use
   them to find duplicates. If I combine a few emails into a map, I cannot
  use
   the hashes as keys in a meaningful way anymore.
  
   So my question is, can't I have millions of maps, if that's how many
   artifacts I need to process, and why not?
  
   Thank you. Sincerely,
   Mark
  
 
 
 
  --
  Harsh J
 



Re: I keep getting multiple values for unique reduce keys

2011-09-05 Thread Sonal Goyal
Could you share your mapper code and the container code? When your mapper
emits the keys and values, do you print them out to see that they are
correct, that is, the container only has data specific to that id?

Best Regards,
Sonal
Crux: Reporting for HBase https://github.com/sonalgoyal/crux
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal





On Tue, Sep 6, 2011 at 10:41 AM, Rick Ross r...@semanticresearch.comwrote:

 I'm still poking around on this and I was wondering if there is a way to
 see the intermediate files that the mapper writes and the ones that the
 reducer reads.I might get some clues in there.

 Thanks

 R

 On Sep 4, 2011, at 10:14 PM, Rick Ross wrote:

 Thanks, but unless I misread you, that didn't do it. Naturally the
 object that I am creating just has a couple of ArrayLists to gather up Name
 and Type objects.

 I suspect I need to extend ArrayWritable instead.   I'll try that next.

 Cheers.

 R

 On Sep 4, 2011, at 9:37 PM, Sudharsan Sampath wrote:

 Hi,

 I suspect it's something to do with your custom Writable. Do you have a
 clear method on your container? If so, that should be used before the obj is
 initialized every time to avoid retaining previous values due to object
 reuse during ser-de process.

 Thanks
 Sudhan S



 On Mon, Sep 5, 2011 at 6:11 AM, Rick Ross r...@semanticresearch.comwrote:

 Hi all,

 I have ensured that my mapper produces a unique key for every value it
 writes and further more that each map() call only writes one value.I
 note here that the value is a custom for which I handle the Writable
 interface methods.

 I realize that it isn't very real world to have (well, want) no combining
 done prior to reducing, but I'm still getting my feet wet.

 When the reducer runs, I expected to see one reduce() call for every map()
 call, and I do.However, the value I get is the composite of all the
 reduce() calls that came before it.

 So, for example, the mapper gets data like this :

   ID, Name,  Type,  Other stuff...
   A000,   Cream, Group, ...
   B231,   Led Zeppelin,  Group, ...
   A044,   Liberace,  Individual,...


 ID is the external key from the source data and is guaranteed to be
 unique.

 When I map it, I create a container for the row data and output that
 container with all the data from that row only and use the ID field as a
 key.

 Since the key is always unique I expected the sort/shuffle step to never
 coalesce any two values.So I expected my reduce() method to be called
 once per mapped input row, and it is.

 The problem is, as each row is processed, the reducer sees a set of
 cumulative value data instead of a container with a row of data in it.  So
 the 'value' parameter to reduce always has the information from previous
 reduce steps.

 For example, given the data above :

 1st Reducer Call :
   Key = A000
   Value =
   Container :
  (object 1) : Name = Cream, Type = Group, MBID = A000, ...

 2nd Reducer Call :
   Key = B231
   Value =
   Container :
  (object 1) : Name = Led Zeppelin, Type = Group, MBID = B231, ...
  (object 2) : Name = Cream, Type = Group, MBID = A000, ...

 So the second reduce call has data in it from the first reduce call.
 Very strange!   At a guess I would say the reducer is re-using the object
 when it reads the objects back from the mapping step.  I dunno..

 If anyone has any ideas, I'm open to suggestions.  0.20.2-cdh3u0

 Thanks!

 R









Re: Hadoop in process?

2011-08-26 Thread Sonal Goyal
Hi Frank,

You can use the ClusterMapReduceCase class from org.apache.hadoop.mapred.

Here is an example of adapting it to Junit4 and running test dfs and
cluster.

https://github.com/sonalgoyal/hiho/blob/master/test/co/nubetech/hiho/common/HihoTestCase.java

And here is a blog post that discusses this in detail:
http://nubetech.co/testing-hadoop-map-reduce-jobs

Best Regards,
Sonal
Crux: Reporting for HBase https://github.com/sonalgoyal/crux
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal





On Sat, Aug 27, 2011 at 12:00 AM, Frank Astier fast...@yahoo-inc.comwrote:

 Hi -

 Is there a way I can start HDFS (the namenode) from a Java main and run
 unit tests against that? I need to integrate my Java/HDFS program into unit
 tests, and the unit test machine might not have Hadoop installed. I’m
 currently running the unit tests by hand with hadoop jar ... My unit tests
 create a bunch of (small) files in HDFS and manipulate them. I use the fs
 API for that. I don’t have map/reduce jobs (yet!).

 Thanks!

 Frank



Re: Configuration settings

2011-06-21 Thread Sonal Goyal
Hi Mark,

You can take a look at
http://allthingshadoop.com/2010/04/28/map-reduce-tips-tricks-your-first-real-cluster/
 and
http://www.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/to
configure your cluster. Along with the tasks, you can change the child
jvm heap size, data.xceivers etc. A good practice is to understand what kind
of map reduce programming you will be doing, are your tasks CPU bound or
memory bound and accordingly change your base cluster settings.

Best Regards,
Sonal
https://github.com/sonalgoyal/hihoHadoop ETL and Data
Integrationhttps://github.com/sonalgoyal/hiho
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal





On Wed, Jun 22, 2011 at 6:16 AM, Mark static.void@gmail.com wrote:

 We have a small 4 node clusters that have 12GB of ram and the cpus are Quad
 Core Xeons.

 I'm assuming the defaults aren't that generous so what are some
 configuration changes I should make to take advantage of this hardware? Max
 map task? Max reduce tasks? Anything else?

 Thanks



Re: Retrying connect error while configuring hadoop

2011-04-12 Thread Sonal Goyal
Are your datanodes and namenode machines able to see each other - ping etc?
Is the /etc/hosts configured correctly? Is the namenode process(seen through
jps on master) up ?

Thanks and Regards,
Sonal
https://github.com/sonalgoyal/hihoHadoop ETL and Data
Integrationhttps://github.com/sonalgoyal/hiho
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal





On Tue, Apr 12, 2011 at 11:19 AM, prasunb prasun.bhattachar...@tcs.comwrote:


 Hello,

 I am trying to configure Hadoop in fully distributed mode on three virtual
 Fedora machines. During configuring I am not getting any error. Even when I
 am executing the script start-dfs.sh, there aren't any error.

 But practically the namenode isn't able to connect the datanodes. These are
 the error snippents from the hadoop-root-datanode-hadoop2.log files of
 both datanodes

 ==

 2011-04-08 15:33:03,549 INFO
 org.apache.hadoop.security.UserGroupInformation: JAAS Configuration already
 set up for Hadoop, not re-installing.
 2011-04-08 15:33:03,691 ERROR
 org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Call
 to hadoop1/192.168.161.198:8020 failed on local exception:
 java.io.IOException: Connection reset by peer
at org.apache.hadoop.ipc.Client.wrapException(Client.java:1139)
at org.apache.hadoop.ipc.Client.call(Client.java:1107)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
at $Proxy4.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:398)
at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:342)
at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:317)
at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:297)
at

 org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:338)
at
 org.apache.hadoop.hdfs.server.datanode.DataNode.(DataNode.java:280)
at

 org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1527)
at

 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
at

 org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1485)
at

 org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:1610)
at
 org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1620)
 Caused by: java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202)
at sun.nio.ch.IOUtil.read(IOUtil.java:175)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
at

 org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
at

 org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at
 org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
at
 org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
at java.io.FilterInputStream.read(FilterInputStream.java:116)
at

 org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:375)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
at java.io.DataInputStream.readInt(DataInputStream.java:370)
at
 org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:812)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:720)

 2011-04-08 15:33:03,692 INFO
 org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down DataNode at hadoop2/127.0.0.1
 /
 STARTUP_MSG: Starting DataNode
 STARTUP_MSG:   host = hadoop2/127.0.0.1
 STARTUP_MSG:   args = []
 STARTUP_MSG:   version = 0.20.2-CDH3B4
 STARTUP_MSG:   build =  -r 3aa7c91592ea1c53f3a913a581dbfcdfebe98bfe;
 compiled by 'root' on Mon Feb 21 17:31:12 EST 2011
 /
 2011-04-08 15:47:46,738 INFO
 org.apache.hadoop.security.UserGroupInformation: JAAS Configuration already
 set up for Hadoop, not re-installing.
 2011-04-08 15:47:47,839 INFO org.apache.hadoop.ipc.Client: Retrying connect
 to server: hadoop1/192.168.161.198:8020. Already tried 0 time(s).
 2011-04-08 15:47:48,849 INFO org.apache.hadoop.ipc.Client: Retrying connect
 to server: hadoop1/192.168.161.198:8020. Already tried 1 time(s).
 2011-04-08 15:47:49,859 INFO org.apache.hadoop.ipc.Client: Retrying connect
 to server: hadoop1/192.168.161.198:8020. Already tried 2 time(s).
 2011-04-08 15:47:50,869 INFO org.apache.hadoop.ipc.Client: Retrying connect
 to server: 

Re: Hadoop cluster couldn't run map reduce job

2011-03-13 Thread Sonal Goyal
Can you check your /etc/hosts to see that all master and slave entries are
correct? If you up the logs to DEBUG, you will see where this is failing.

Thanks and Regards,
Sonal
https://github.com/sonalgoyal/hihoHadoop ETL and Data
Integrationhttps://github.com/sonalgoyal/hiho
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal





On Mon, Mar 14, 2011 at 9:41 AM, Yorgo Sun yorgo...@gmail.com wrote:

 Hi all

 I have a hadoop cluster with a namenode and 3 datanodes, I've installed it
 by normal process. everything's fine, but it couldn't run wordcount map
 reduce job. Follow are output logs


 [hadoop@namenode hadoop-0.20.2]$ hadoop jar hadoop-0.20.2-examples.jar
 wordcount /user/root/text.log /user/output1
 11/03/14 12:07:13 INFO input.FileInputFormat: Total input paths to process
 : 1
 11/03/14 12:07:13 INFO mapred.JobClient: Running job: job_201103141205_0001
 11/03/14 12:07:14 INFO mapred.JobClient:  map 0% reduce 0%
 11/03/14 12:07:22 INFO mapred.JobClient:  map 100% reduce 0%
 11/03/14 12:07:27 INFO mapred.JobClient: Task Id :
 attempt_201103141205_0001_r_00_0, Status : FAILED
 Error: java.lang.NullPointerException
 at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:796)
  at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2683)
 at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2605)

 11/03/14 12:07:33 INFO mapred.JobClient: Task Id :
 attempt_201103141205_0001_r_00_1, Status : FAILED
 Error: java.lang.NullPointerException
 at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:796)
  at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2683)
 at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2605)

 11/03/14 12:07:40 INFO mapred.JobClient: Task Id :
 attempt_201103141205_0001_r_00_2, Status : FAILED
 Error: java.lang.NullPointerException
 at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:796)
  at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2683)
 at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2605)

 11/03/14 12:07:49 INFO mapred.JobClient: Job complete:
 job_201103141205_0001
 11/03/14 12:07:49 INFO mapred.JobClient: Counters: 12
 11/03/14 12:07:49 INFO mapred.JobClient:   Job Counters
 11/03/14 12:07:49 INFO mapred.JobClient: Launched reduce tasks=4
 11/03/14 12:07:49 INFO mapred.JobClient: Launched map tasks=1
 11/03/14 12:07:49 INFO mapred.JobClient: Data-local map tasks=1
 11/03/14 12:07:49 INFO mapred.JobClient: Failed reduce tasks=1
 11/03/14 12:07:49 INFO mapred.JobClient:   FileSystemCounters
 11/03/14 12:07:49 INFO mapred.JobClient: HDFS_BYTES_READ=1366
 11/03/14 12:07:49 INFO mapred.JobClient: FILE_BYTES_WRITTEN=1868
 11/03/14 12:07:49 INFO mapred.JobClient:   Map-Reduce Framework
 11/03/14 12:07:49 INFO mapred.JobClient: Combine output records=131
 11/03/14 12:07:49 INFO mapred.JobClient: Map input records=31
 11/03/14 12:07:49 INFO mapred.JobClient: Spilled Records=131
 11/03/14 12:07:49 INFO mapred.JobClient: Map output bytes=2055
 11/03/14 12:07:49 INFO mapred.JobClient: Combine input records=179
 11/03/14 12:07:49 INFO mapred.JobClient: Map output records=179

 Is there anyone have this problem too? please help me. thanks a lot.

 --
 孙绍轩 Yorgo Sun




Re: Hadoop EC2 setup

2011-03-13 Thread Sonal Goyal
Please make sure that the AWS EC2 command line tools are installed and the
environment variables EC2_HOME, EC2_CERT, EC2_PRIVATE_KEY and PATH are set.

Thanks and Regards,
Sonal
https://github.com/sonalgoyal/hihoHadoop ETL and Data
Integrationhttps://github.com/sonalgoyal/hiho
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal





On Sun, Mar 13, 2011 at 9:21 PM, Jason Trost jason.tr...@gmail.com wrote:

 You may also want to try Apache Whirr.  I found this to be very straight
 forward and easy to quickly deploy a fully functional Hadoop cluster.
 http://incubator.apache.org/whirr/quick-start-guide.html

 http://incubator.apache.org/whirr/quick-start-guide.html--Jason

 On Sat, Mar 12, 2011 at 6:19 PM, Chris K Wensel ch...@wensel.net wrote:

  Unless you have a specific need to run a specific Hadoop distro, you
 might
  consider just using Amazon Elastic MapReduce. You can always come back to
  rolling your own, at that time you might look at Whirr.
 
  http://aws.amazon.com/elasticmapreduce/
  http://incubator.apache.org/whirr/
 
  ckw
 
  On Mar 11, 2011, at 5:04 PM, JJ siung wrote:
 
   Hi,
  
   I am following a setup guide here:
  http://wiki.apache.org/hadoop/AmazonEC2 but
   runs into problems when I tried to launch a cluster.
   An error message said
   hadoop-0.21.0/common/src/contrib/ec2/bin/launch-hadoop-master: line
 40:
   ec2-describe-instances: command not found
   I am not even sure if I edited the hadoop-ec2-env.sh correctly. Is
 there
  any
   newer tutorial for setting this up?
  
   Thanks!
 
  --
  Chris K Wensel
  ch...@concurrentinc.com
  http://www.concurrentinc.com
 
  -- Concurrent, Inc. offers mentoring, and support for Cascading
 
 



Dataset comparison and ranking - views

2011-03-07 Thread Sonal Goyal
Hi,

I am working on a problem to compare two different datasets, and rank each
record of the first with respect to the other, in terms of how similar they
are. The records are dimensional, but do not have a lot of dimensions. Some
of the fields will be compared for exact matches, some for similar sound,
some with closest match etc. One of the datasets is large, and the other is
much smaller.  The final goal is to compute a rank between each record of
first dataset with each record of the second. The rank is based on weighted
scores of each dimension comparison.

I was wondering if people in the community have any advice/suggested
patterns/thoughts about cross joining two datasets in map reduce. Do let me
know if you have any suggestions.

Thanks and Regards,
Sonal
https://github.com/sonalgoyal/hihoHadoop ETL and Data
Integrationhttps://github.com/sonalgoyal/hiho
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal


Re: Dataset comparison and ranking - views

2011-03-07 Thread Sonal Goyal
Hi Marcos,

Thanks for replying. I think I was not very clear in my last post. Let me
describe my use case in detail.

I have two datasets coming from different sources, lets call them dataset1
and dataset2. Both of them contain records for entities, say Person. A
single record looks like:

First Name Last Name,  Street, City, State,Zip

We want to compare each record of dataset1 with each record of dataset2, in
effect a cross join.

We know that the way data is collected, names will not match exactly, but we
want to find close enoughs. So we have a rule which says create bigrams and
find the matching bigrams. If 0 to 5 match, give a score of 10, if 5-15
match, give a score of 20 and so on.
For Zip, we have our rule saying exact match or within 5 kms of each
other(through a lookup), give a score of 50 and so on.

Once we have each person of dataset1 compared with that of dataset2, we find
the overall rank. Which is a weighted average of scores of name, address etc
comparison.

One approach is to use the DistributedCache for the smaller dataset and do a
nested loop join in the mapper. The second approach is to use multiple  MR
flows, and compare the fields and reduce/collate the results.

I am curious to know if people have other approaches they have implemented,
what are the efficiencies they have built up etc.

Thanks and Regards,
Sonal
https://github.com/sonalgoyal/hihoHadoop ETL and Data
Integrationhttps://github.com/sonalgoyal/hiho
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal





On Tue, Mar 8, 2011 at 12:55 AM, Marcos Ortiz mlor...@uci.cu wrote:

 On Tue, 2011-03-08 at 00:36 +0530, Sonal Goyal wrote:
  Hi,
 
  I am working on a problem to compare two different datasets, and rank
  each record of the first with respect to the other, in terms of how
  similar they are. The records are dimensional, but do not have a lot
  of dimensions. Some of the fields will be compared for exact matches,
  some for similar sound, some with closest match etc. One of the
  datasets is large, and the other is much smaller.  The final goal is
  to compute a rank between each record of first dataset with each
  record of the second. The rank is based on weighted scores of each
  dimension comparison.
 
  I was wondering if people in the community have any advice/suggested
  patterns/thoughts about cross joining two datasets in map reduce. Do
  let me know if you have any suggestions.
 
  Thanks and Regards,
  Sonal
  Hadoop ETL and Data Integration
  Nube Technologies

 Regards, Sonal. Can you give us more information about a basic workflow
 of your idea?

 Some questions:
 - How do you know that two records are identical? By id?
 - Can you give a example of the ranking that you want to archieve with a
 match of each case:
 - two records that are identical
 - two records that ar similar
 - two records with the closest match

 For MapReduce Design's Algoritms, I recommend to you this excelent from
 Ricky Ho:

 http://horicky.blogspot.com/2010/08/designing-algorithmis-for-map-reduce.html

 For the join of the two datasets, you can use Pig for this. Here you
 have a basic Pig example from Milind Bhandarkar
 (mili...@yahoo-inc.com)'s talk Practical Problem Solving with Hadoop
 and Pig:
 Users = load ‘users’ as (name, age);
 Filtered = filter Users by age = 18 and age = 25;
 Pages = load ‘pages’ as (user, url);
 Joined = join Filtered by name, Pages by user;
 Grouped = group Joined by url;
 Summed = foreach Grouped generate group,
COUNT(Joined) as clicks;
 Sorted = order Summed by clicks desc;
 Top5 = limit Sorted 5;
 store Top5 into ‘top5sites’;


 --
  Marcos Luís Ortíz Valmaseda
  Software Engineer
  Centro de Tecnologías de Gestión de Datos (DATEC)
  Universidad de las Ciencias Informáticas
  http://uncubanitolinuxero.blogspot.com
  http://www.linkedin.com/in/marcosluis2186





Re: Setting java.library.path for map-reduce job

2011-02-28 Thread Sonal Goyal
Hi Adarsh,

I think your mapred.cache.files property has an extra space at the end. Try
removing that and let us know how it goes.
Thanks and Regards,
Sonal
https://github.com/sonalgoyal/hihoHadoop ETL and Data
Integrationhttps://github.com/sonalgoyal/hiho
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal





On Mon, Feb 28, 2011 at 5:06 PM, Adarsh Sharma adarsh.sha...@orkash.comwrote:

 Thanks Sanjay, it seems i found the root cause.

 But I result in following error:

 [hadoop@ws37-mah-lin hadoop-0.20.2]$ bin/hadoop jar wordcount1.jar
 org.myorg.WordCount /user/hadoop/gutenberg /user/hadoop/output1
 Exception in specified URI's java.net.URISyntaxException: Illegal character
 in path at index 36: hdfs://192.168.0.131:54310/jcuda.jar
   at java.net.URI$Parser.fail(URI.java:2809)
   at java.net.URI$Parser.checkChars(URI.java:2982)
   at java.net.URI$Parser.parseHierarchical(URI.java:3066)
   at java.net.URI$Parser.parse(URI.java:3014)
   at java.net.URI.init(URI.java:578)
   at
 org.apache.hadoop.util.StringUtils.stringToURI(StringUtils.java:204)
   at
 org.apache.hadoop.filecache.DistributedCache.getCacheFiles(DistributedCache.java:593)
   at
 org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:638)
   at
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761)
   at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
   at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
   at org.myorg.WordCount.main(WordCount.java:59)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

 Exception in thread main java.lang.NullPointerException
   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:176)
   at
 org.apache.hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java:506)
   at
 org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:640)
   at
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761)
   at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
   at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
   at org.myorg.WordCount.main(WordCount.java:59)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

 Please check my attached mapred-site.xml


 Thanks  best regards,

 Adarsh Sharma



 Kaluskar, Sanjay wrote:

 You will probably have to use distcache to distribute your jar to all
 the nodes too. Read the distcache documentation; Then on each node you
 can add the new jar to the java.library.path through
 mapred.child.java.opts.

 You need to do something like the following in mapred-site.xml, where
 fs-uri is the URI of the file system (something like
 host.mycompany.com:54310).

 property
  namemapred.cache.files/name
  valuehdfs://fs-uri/jcuda/jcuda.jar#jcuda.jar /value
 /property
 property
  namemapred.create.symlink/name
  valueyes/value
 /property
 property
  namemapred.child.java.opts/name
  value-Djava.library.path=jcuda.jar/value
 /property


 -Original Message-
 From: Adarsh Sharma [mailto:adarsh.sha...@orkash.com] Sent: 28 February
 2011 16:03
 To: common-user@hadoop.apache.org
 Subject: Setting java.library.path for map-reduce job

 Dear all,

 I want to set some extra jars in java.library.path , used while running
 map-reduce program in Hadoop Cluster.

 I got a exception entitled no jcuda in java.library.path in each map
 task.

 I run my map-reduce code by below commands :

 javac -classpath
 /home/hadoop/project/hadoop-0.20.2/hadoop-0.20.2-core.jar://home/hadoop/
 project/hadoop-0.20.2/jcuda_1.1_linux64/jcuda.jar:/home/hadoop/project/h
 adoop-0.20.2/lib/commons-cli-1.2.jar
 -d wordcount_classes1/ WordCount.java

 jar -cvf wordcount1.jar -C wordcount_classes1/ .

 bin/hadoop jar wordcount1.jar org.myorg.WordCount /user/hadoop/gutenberg
 /user/hadoop/output1


 Please guide how to achieve this.



 Thanks  best Regards,

 Adarsh Sharma






Re: Setting java.library.path for map-reduce job

2011-02-28 Thread Sonal Goyal
Adarsh,

Are you trying to distribute both the native library and the jcuda.jar?
Could you please explain your job's dependencies?
Thanks and Regards,
Sonal
https://github.com/sonalgoyal/hihoHadoop ETL and Data
Integrationhttps://github.com/sonalgoyal/hiho
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal





On Mon, Feb 28, 2011 at 6:54 PM, Adarsh Sharma adarsh.sha...@orkash.comwrote:

 Sonal Goyal wrote:

 Hi Adarsh,

 I think your mapred.cache.files property has an extra space at the end.
 Try
 removing that and let us know how it goes.
 Thanks and Regards,
 Sonal
 https://github.com/sonalgoyal/hihoHadoop ETL and Data
 Integrationhttps://github.com/sonalgoyal/hiho
 Nube Technologies http://www.nubetech.co

 http://in.linkedin.com/in/sonalgoyal





 Thanks a Lot Sonal but it doesn't succeed.
 Please if possible tell me the proper steps that are need to be followed
 after Configuring Hadoop Cluster.

 I don't believe that a simple commands succeeded as

 [root@cuda1 hadoop-0.20.2]# javac EnumDevices.java
 [root@cuda1 hadoop-0.20.2]# java EnumDevices
 Total number of devices: 1
 Name: Tesla C1060
 Version: 1.3
 Clock rate: 1296000 MHz
 Threads per block: 512


 but in Map-reduce job it fails :

 11/02/28 18:42:47 INFO mapred.JobClient: Task Id :
 attempt_201102281834_0001_m_01_2, Status : FAILED
 java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
   at
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115)
   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:569)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
   at org.apache.hadoop.mapred.Child.main(Child.java:170)
 Caused by: java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
   at
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
   at
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
   at
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:113)
   ... 3 more
 Caused by: java.lang.UnsatisfiedLinkError: no jcuda in java.library.path
   at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1734)
   at java.lang.Runtime.loadLibrary0(Runtime.java:823)
   at java.lang.System.loadLibrary(System.java:1028)
   at jcuda.driver.CUDADriver.clinit(CUDADriver.java:909)
   at jcuda.CUDA.init(CUDA.java:62)
   at jcuda.CUDA.init(CUDA.java:42)




 Thanks  best Regards,

 Adarsh Sharma



 On Mon, Feb 28, 2011 at 5:06 PM, Adarsh Sharma adarsh.sha...@orkash.com
 wrote:



 Thanks Sanjay, it seems i found the root cause.

 But I result in following error:

 [hadoop@ws37-mah-lin hadoop-0.20.2]$ bin/hadoop jar wordcount1.jar
 org.myorg.WordCount /user/hadoop/gutenberg /user/hadoop/output1
 Exception in specified URI's java.net.URISyntaxException: Illegal
 character
 in path at index 36: hdfs://192.168.0.131:54310/jcuda.jar
  at java.net.URI$Parser.fail(URI.java:2809)
  at java.net.URI$Parser.checkChars(URI.java:2982)
  at java.net.URI$Parser.parseHierarchical(URI.java:3066)
  at java.net.URI$Parser.parse(URI.java:3014)
  at java.net.URI.init(URI.java:578)
  at
 org.apache.hadoop.util.StringUtils.stringToURI(StringUtils.java:204)
  at

 org.apache.hadoop.filecache.DistributedCache.getCacheFiles(DistributedCache.java:593)
  at

 org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:638)
  at
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761)
  at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
  at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
  at org.myorg.WordCount.main(WordCount.java:59)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
  at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  at java.lang.reflect.Method.invoke(Method.java:597)
  at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

 Exception in thread main java.lang.NullPointerException
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:176)
  at

 org.apache.hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java:506)
  at

 org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:640)
  at
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761)
  at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
  at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
  at org.myorg.WordCount.main(WordCount.java:59)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method

Re: Setting java.library.path for map-reduce job

2011-02-28 Thread Sonal Goyal
Hi Adarsh,

Have you placed jcuda.jar in HDFS? Your configuration says

hdfs://192.168.0.131:54310/jcuda.jar


Thanks and Regards,
Sonal
https://github.com/sonalgoyal/hihoHadoop ETL and Data
Integrationhttps://github.com/sonalgoyal/hiho
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal





On Tue, Mar 1, 2011 at 9:34 AM, Adarsh Sharma adarsh.sha...@orkash.comwrote:

 Sonal Goyal wrote:

 Adarsh,

 Are you trying to distribute both the native library and the jcuda.jar?
 Could you please explain your job's dependencies?



 Yes Of course , I am trying to run a Juda program in Hadoop Cluster as I am
 able to run it simple through simple javac  java commands at standalone
 machine by setting PATH  LD_LIBRARY_PATH varibale to */usr/local/cuda/lib*
  */home/hadoop/project/jcuda_1.1_linux *folder.

 I listed the contents  jars in these directories :

 [hadoop@cuda1 lib]$ pwd
 /usr/local/cuda/lib
 [hadoop@cuda1 lib]$ ls -ls
 total 158036
   4 lrwxrwxrwx 1 root root   14 Feb 23 19:37 libcublas.so -
 libcublas.so.3
   4 lrwxrwxrwx 1 root root   19 Feb 23 19:37 libcublas.so.3 -
 libcublas.so.3.2.16
 81848 -rwxrwxrwx 1 root root 83720712 Feb 23 19:37 libcublas.so.3.2.16
   4 lrwxrwxrwx 1 root root   14 Feb 23 19:37 libcudart.so -
 libcudart.so.3
   4 lrwxrwxrwx 1 root root   19 Feb 23 19:37 libcudart.so.3 -
 libcudart.so.3.2.16
  424 -rwxrwxrwx 1 root root   423660 Feb 23 19:37 libcudart.so.3.2.16
   4 lrwxrwxrwx 1 root root   13 Feb 23 19:37 libcufft.so -
 libcufft.so.3
   4 lrwxrwxrwx 1 root root   18 Feb 23 19:37 libcufft.so.3 -
 libcufft.so.3.2.16
 27724 -rwxrwxrwx 1 root root 28351780 Feb 23 19:37 libcufft.so.3.2.16
   4 lrwxrwxrwx 1 root root   14 Feb 23 19:37 libcurand.so -
 libcurand.so.3
   4 lrwxrwxrwx 1 root root   19 Feb 23 19:37 libcurand.so.3 -
 libcurand.so.3.2.16
 4120 -rwxrwxrwx 1 root root  4209384 Feb 23 19:37 libcurand.so.3.2.16
   4 lrwxrwxrwx 1 root root   16 Feb 23 19:37 libcusparse.so -
 libcusparse.so.3
   4 lrwxrwxrwx 1 root root   21 Feb 23 19:37 libcusparse.so.3 -
 libcusparse.so.3.2.16
 43048 -rwxrwxrwx 1 root root 44024836 Feb 23 19:37 libcusparse.so.3.2.16
  172 -rwxrwxrwx 1 root root   166379 Nov 25 11:29
 libJCublas-linux-x86_64.so
  152 -rwxrwxrwx 1 root root   144179 Nov 25 11:29
 libJCudaDriver-linux-x86_64.so
  16 -rwxrwxrwx 1 root root 8474 Mar 31  2009 libjcudafft.so
  136 -rwxrwxrwx 1 root root   128672 Nov 25 11:29
 libJCudaRuntime-linux-x86_64.so
  80 -rwxrwxrwx 1 root root70381 Mar 31  2009 libjcuda.so
  44 -rwxrwxrwx 1 root root38039 Nov 25 11:29 libJCudpp-linux-x86_64.so
  44 -rwxrwxrwx 1 root root38383 Nov 25 11:29 libJCufft-linux-x86_64.so
  48 -rwxrwxrwx 1 root root43706 Nov 25 11:29 libJCurand-linux-x86_64.so
  140 -rwxrwxrwx 1 root root   133280 Nov 25 11:29
 libJCusparse-linux-x86_64.so

 And the second folder as :

 [hadoop@cuda1 jcuda_1.1_linux64]$ pwd

 /home/hadoop/project/hadoop-0.20.2/jcuda_1.1_linux64
 [hadoop@cuda1 jcuda_1.1_linux64]$ ls -ls
 total 200
 8 drwxrwxrwx 6 hadoop hadoop  4096 Feb 24 01:44 doc
 8 drwxrwxrwx 3 hadoop hadoop  4096 Feb 24 01:43 examples
 32 -rwxrwxr-x 1 hadoop hadoop 28484 Feb 24 01:43 jcuda.jar
 4 -rw-rw-r-- 1 hadoop hadoop 0 Mar  1 21:27 libcublas.so.3
 4 -rw-rw-r-- 1 hadoop hadoop 0 Mar  1 21:27 libcublas.so.3.2.16
 4 -rw-rw-r-- 1 hadoop hadoop 0 Mar  1 21:27 libcudart.so.3
 4 -rw-rw-r-- 1 hadoop hadoop 0 Mar  1 21:27 libcudart.so.3.2.16
 4 -rw-rw-r-- 1 hadoop hadoop 0 Mar  1 21:27 libcufft.so.3
 4 -rw-rw-r-- 1 hadoop hadoop 0 Mar  1 21:27 libcufft.so.3.2.16
 4 -rw-rw-r-- 1 hadoop hadoop 0 Mar  1 21:27 libcurand.so.3
 4 -rw-rw-r-- 1 hadoop hadoop 0 Mar  1 21:27 libcurand.so.3.2.16
 4 -rw-rw-r-- 1 hadoop hadoop 0 Mar  1 21:27 libcusparse.so.3
 4 -rw-rw-r-- 1 hadoop hadoop 0 Mar  1 21:27 libcusparse.so.3.2.16
 16 -rwxr-xr-x 1 hadoop hadoop  8474 Mar  1 04:12 libjcudafft.so
 80 -rwxr-xr-x 1 hadoop hadoop 70381 Mar  1 04:11 libjcuda.so
 8 -rwxrwxr-x 1 hadoop hadoop   811 Feb 24 01:43 README.txt
 8 drwxrwxrwx 2 hadoop hadoop  4096 Feb 24 01:43 resources
 [hadoop@cuda1 jcuda_1.1_linux64]$

 I think Hadoop would not able to recognize *jcuda.jar* in Tasktracker
 process. Please guide me how to make it available in it.


 Thanks  best Regards,
 Adrash Sharma


  Thanks and Regards,
 Sonal
 https://github.com/sonalgoyal/hihoHadoop ETL and Data
 Integrationhttps://github.com/sonalgoyal/hiho
 Nube Technologies http://www.nubetech.co

 http://in.linkedin.com/in/sonalgoyal





 On Mon, Feb 28, 2011 at 6:54 PM, Adarsh Sharma adarsh.sha...@orkash.com
 wrote:



 Sonal Goyal wrote:



 Hi Adarsh,

 I think your mapred.cache.files property has an extra space at the end.
 Try
 removing that and let us know how it goes.
 Thanks and Regards,
 Sonal
 https://github.com/sonalgoyal/hihoHadoop ETL and Data
 Integrationhttps://github.com/sonalgoyal/hiho
 Nube Technologies http://www.nubetech.co

 http://in.linkedin.com

Re: easiest way to install hadoop

2011-02-22 Thread Sonal Goyal
You can also check Apache Whirr.

Thanks and Regards,
Sonal
https://github.com/sonalgoyal/hihoConnect Hadoop with databases,
Salesforce, FTP servers and others https://github.com/sonalgoyal/hiho
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal





On Wed, Feb 23, 2011 at 11:48 AM, Pavan yarapa...@gmail.com wrote:


 Yes, Cloudera distribution helps you the most if you are not quite familiar
 with hadoop and its ecosystem. Good documentation. CDH3 Beta 4 was released
 recently. Having said this, if you are looking for evaluation/testing
 purposes, I suggest you try the virtual appliance from cloudera first,
 before you make any final decisions:
 http://cloudera-vm.s3.amazonaws.com/cloudera-demo-0.3.5.tar.bz2?downloads

 We are using Red Hat (and thus I assume Cent OS) in PRODUCTION and are
 quite happy. I personally use cloudera in my ubuntu laptop and equally
 happy.

 *Pavan Yara*
 ***@yarapavan
 *


 On Wed, Feb 23, 2011 at 9:20 AM, Nick Jones darel...@gmail.com wrote:

 I found Cloudera's distribution easy to use, but it's the only thing I
 tried.

 Nick



 On Tue, Feb 22, 2011 at 9:42 PM, real great..
 greatness.hardn...@gmail.com wrote:
  Hi,
  Very trivial question.
  Which is the easiest way to install hadoop?
  i mean which distribution should i go for?? apache or cloudera?
  n which is the easiest os for hadoop?
 
  --
  Regards,
  R.V.
 





Re: Best practice for batch file conversions

2011-02-08 Thread Sonal Goyal
You can check out MultipleOutputFormat for this.
Thanks and Regards,
Sonal
https://github.com/sonalgoyal/hihoConnect Hadoop with databases,
Salesforce, FTP servers and others https://github.com/sonalgoyal/hiho
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal





On Wed, Feb 9, 2011 at 5:22 AM, felix gao gre1...@gmail.com wrote:

 I am stuck again. The binary files are stored in hdfs under some
 pre-defined structure like
 root/
 |-- dir1
 |   |-- file1
 |   |-- file2
 |   `-- file3
 |-- dir2
 |   |-- file1
 |   `-- file3
 `-- dir3
 |-- file2
 `-- file3

 after I processed them somehow using Non-splittable InputFormat in my
 mapper, I would like to store the files back into HDFS like
 processed/
 |-- dir1
 |   |-- file1.done
 |   |-- file2.done
 |   `-- file3.done
 |-- dir2
 |   |-- file1.done
 |   `-- file3.done
 `-- dir3
 |-- file2.done
 `-- file3.done

 can someone please show me how to do this?

 thanks,

 Felix

 On Tue, Feb 8, 2011 at 9:43 AM, felix gao gre1...@gmail.com wrote:

 thanks a lot for the pointer. I will play around with it.


 On Mon, Feb 7, 2011 at 10:55 PM, Sonal Goyal sonalgoy...@gmail.comwrote:

 Hi,

 You can use FileStreamInputFormat which returns the file stream as the
 value.


 https://github.com/sonalgoyal/hiho/tree/hihoApache0.20/src/co/nubetech/hiho/mapreduce/lib/input

 You need to remember that you lose data locality by trying to manipulate
 the file as a whole, but in your case, the requirement probably demands it.

 Thanks and Regards,
 Sonal
 https://github.com/sonalgoyal/hihoConnect Hadoop with databases,
 Salesforce, FTP servers and others https://github.com/sonalgoyal/hiho
 Nube Technologies http://www.nubetech.co

 http://in.linkedin.com/in/sonalgoyal






 On Tue, Feb 8, 2011 at 8:59 AM, Harsh J qwertyman...@gmail.com wrote:

 Extend FileInputFormat, and write your own binary-format based
 implementation of it, and make it non-splittable (isSplitable should
 return false). This way, a Mapper would get a whole file, and you
 shouldn't have block-splitting issues.

 On Tue, Feb 8, 2011 at 6:37 AM, felix gao gre1...@gmail.com wrote:
  Hello users of hadoop,
  I have a task to convert large binary files from one format to
 another.  I
  am wondering what is the best practice to do this.  Basically, I am
 trying
  to get one mapper to work on each binary file and i am not sure how to
 do
  that in hadoop properly.
  thanks,
  Felix



 --
 Harsh J
 www.harshj.com







Re: Multiple queues question

2011-02-07 Thread Sonal Goyal
I think the CapacityScheduler is the one to use with multiple queues, see
http://hadoop.apache.org/common/docs/r0.19.2/capacity_scheduler.html
Thanks and Regards,
Sonal
https://github.com/sonalgoyal/hihoConnect Hadoop with databases,
Salesforce, FTP servers and others https://github.com/sonalgoyal/hiho
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal





On Sat, Feb 5, 2011 at 7:21 PM, Robert Grandl rgra...@student.ethz.chwrote:

 Hi all,

 I am trying to submit jobs to different queues in hadoop-0.20.2

 I configured conf/mapred-site.xml
 property
 namemapred.queue.names/name
 valuequeue1, queue2/value
 /property

 Then I was starting wordcount job with following configuration:

 configuration
 property
 namemapred.job.queue.name/name
 valuequeue1/value
 /property
 /configuration

 I am trying to run Hadoop with default FIFO scheduler.

 However the wordcount job appears on 
 http://localhost:50030/jobtracker.jspthat is was submitted
 to both queue1 and queue2.

 Did I forgot to configure something ?

 Thank you very much,
 Robert





Re: Best practice for batch file conversions

2011-02-07 Thread Sonal Goyal
Hi,

You can use FileStreamInputFormat which returns the file stream as the
value.

https://github.com/sonalgoyal/hiho/tree/hihoApache0.20/src/co/nubetech/hiho/mapreduce/lib/input

You need to remember that you lose data locality by trying to manipulate the
file as a whole, but in your case, the requirement probably demands it.

Thanks and Regards,
Sonal
https://github.com/sonalgoyal/hihoConnect Hadoop with databases,
Salesforce, FTP servers and others https://github.com/sonalgoyal/hiho
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal





On Tue, Feb 8, 2011 at 8:59 AM, Harsh J qwertyman...@gmail.com wrote:

 Extend FileInputFormat, and write your own binary-format based
 implementation of it, and make it non-splittable (isSplitable should
 return false). This way, a Mapper would get a whole file, and you
 shouldn't have block-splitting issues.

 On Tue, Feb 8, 2011 at 6:37 AM, felix gao gre1...@gmail.com wrote:
  Hello users of hadoop,
  I have a task to convert large binary files from one format to another.
  I
  am wondering what is the best practice to do this.  Basically, I am
 trying
  to get one mapper to work on each binary file and i am not sure how to do
  that in hadoop properly.
  thanks,
  Felix



 --
 Harsh J
 www.harshj.com



Re: Hadoop XML Error

2011-02-07 Thread Sonal Goyal
Mike,

This error is not related to malformed XML files etc you are trying to copy,
but because for some reason, the source or destination listing can not be
retrieved/parsed. Are you trying to copy between diff versions of clusters?
As far as I know, your destination should be writable, distcp should be run
from the destination cluster. See more here:
http://hadoop.apache.org/common/docs/r0.20.2/distcp.html

Let us know how it goes.

Thanks and Regards,
Sonal
https://github.com/sonalgoyal/hihoConnect Hadoop with databases,
Salesforce, FTP servers and others https://github.com/sonalgoyal/hiho
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal





On Mon, Feb 7, 2011 at 9:21 PM, Korb, Michael [USA] korb_mich...@bah.comwrote:

 I am running two instances of Hadoop on a cluster and want to copy all the
 data from hadoop1 to the updated hadoop2. From hadoop2, I am running the
 command hadoop distcp -update hftp://mc1:50070/ hftp://mc0:50070/;
 where mc1 is the namenode of hadoop1 and mc0 is the namenode of
 hadoop2. I get the following error:

 11/02/07 10:12:31 INFO tools.DistCp: srcPaths=[hftp://mc1:50070/]
 11/02/07 10:12:31 INFO tools.DistCp: destPath=hftp://mc0:50070/
 [Fatal Error] :1:215: XML document structures must start and end within the
 same entity.
 With failures, global counters are inaccurate; consider running with -i
 Copy failed: java.io.IOException: invalid xml directory content
at
 org.apache.hadoop.hdfs.HftpFileSystem$LsParser.fetchList(HftpFileSystem.java:350)
at
 org.apache.hadoop.hdfs.HftpFileSystem$LsParser.getFileStatus(HftpFileSystem.java:355)
at
 org.apache.hadoop.hdfs.HftpFileSystem.getFileStatus(HftpFileSystem.java:384)
at org.apache.hadoop.tools.DistCp.sameFile(DistCp.java:1227)
at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1120)
at org.apache.hadoop.tools.DistCp.copy(DistCp.java:666)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:881)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:908)
 Caused by: org.xml.sax.SAXParseException: XML document structures must
 start and end within the same entity.
at
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1231)
at
 org.apache.hadoop.hdfs.HftpFileSystem$LsParser.fetchList(HftpFileSystem.java:344)
... 9 more

 I am fairly certain that none of the XML files are malformed or corrupted.
 This thread (
 http://www.mail-archive.com/core-dev@hadoop.apache.org/msg18064.html)
 discusses a similar problem caused by file permissions but doesn't seem to
 offer a solution. Any help would be appreciated.

 Thanks,
 Mike



Re: How to reduce number of splits in DataDrivenDBInputFormat?

2011-01-20 Thread Sonal Goyal
Moving this offline from the list.

Thanks and Regards,
Sonal
https://github.com/sonalgoyal/hihoConnect Hadoop with databases,
Salesforce, FTP servers and others https://github.com/sonalgoyal/hiho
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal





On Thu, Jan 20, 2011 at 2:18 PM, Joan joan.monp...@gmail.com wrote:

 Hi Sonal,

 I've downloaded hiho project and I can see that hiho's a
 DBInputAvroMapper.java very interesting.

 I want to read from DB using this Mapper and its Reducer can write
 serialize object too. How can I do?

 After I want create other job that its Mapper reads the output (serialize
 object) from previous Reducer. How can I do?

 Thanks Sonal,


 Joan


 2011/1/20 Sonal Goyal sonalgoy...@gmail.com

 Which hadoop version are you on?

 You can alternatively try using hiho from
 https://github.com/sonalgoyal/hiho  to get your data from the db. Please
 write to me directly if you need any help there.


 Thanks and Regards,
 Sonal
 https://github.com/sonalgoyal/hihoConnect Hadoop with databases,
 Salesforce, FTP servers and others https://github.com/sonalgoyal/hiho
 Nube Technologies http://www.nubetech.co

 http://in.linkedin.com/in/sonalgoyal





 On Thu, Jan 20, 2011 at 1:03 PM, Joan joan.monp...@gmail.com wrote:

 Hi Sonal,

 I put both configurations:

 job.getConfiguration().set(mapreduce.job.maps,4);
 job.getConfiguration().set(mapreduce.map.tasks,4);

 But both configurations don't run. I also try to set mapred.map.task
 but It neither run.

 Joan

 2011/1/20 Sonal Goyal sonalgoy...@gmail.com

 Joan,

 You should be able to set the mapred.map.tasks property to the maximum
 number of mappers you want. This can control parallelism.

 Thanks and Regards,
 Sonal
 https://github.com/sonalgoyal/hihoConnect Hadoop with databases,
 Salesforce, FTP servers and others https://github.com/sonalgoyal/hiho
 Nube Technologies http://www.nubetech.co

 http://in.linkedin.com/in/sonalgoyal






 On Wed, Jan 19, 2011 at 9:32 PM, Joan joan.monp...@gmail.com wrote:

 Hi,

 I want to reduce number of splits because I think that I get many
 splits and I want to reduce these splits.
 While my job is running I can see:

 *INFO mapreduce.Job:  map ∞% reduce 0%*

 I'm using DataDrivenDBInputFormat:
 *
 ** setInput*

 *public static void setInput(Job 
 http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapreduce/Job.html
  job,
 Class 
 http://java.sun.com/javase/6/docs/api/java/lang/Class.html?is-external=true?
  extends DBWritable 
 http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapreduce/lib/db/DBWritable.html
  inputClass,





 String 
 http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true
  tableName,
 String 
 http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true
  conditions,





 String 
 http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true
  splitBy,
 String 
 http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true...
  fieldNames)*

 *Note that the orderBy column is called the splitBy in this
 version. We reuse the same field, but it's not strictly ordering it -- 
 just
 partitioning the results.
 *

 So I get all data from myTable and I try to split by date column. I
 obtain milions rows and I supose that DataDrivenDBInputFormat generates 
 many
 splits and i don't know how to reduce this splits or how to indicates to
 DataDrivenDBInputFormat splits by my date column (corresponds to splitBy).

 The main goal's improve performance, so I want to my Map's faster.


 Can someone help me?

 Thanks

 Joan










Re: How to reduce number of splits in DataDrivenDBInputFormat?

2011-01-19 Thread Sonal Goyal
Joan,

You should be able to set the mapred.map.tasks property to the maximum
number of mappers you want. This can control parallelism.

Thanks and Regards,
Sonal
https://github.com/sonalgoyal/hihoConnect Hadoop with databases,
Salesforce, FTP servers and others https://github.com/sonalgoyal/hiho
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal





On Wed, Jan 19, 2011 at 9:32 PM, Joan joan.monp...@gmail.com wrote:

 Hi,

 I want to reduce number of splits because I think that I get many splits
 and I want to reduce these splits.
 While my job is running I can see:

 *INFO mapreduce.Job:  map ∞% reduce 0%*

 I'm using DataDrivenDBInputFormat:
 *
 ** setInput*

 *public static void setInput(Job 
 http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapreduce/Job.html
  job,
 Class 
 http://java.sun.com/javase/6/docs/api/java/lang/Class.html?is-external=true?
  extends DBWritable 
 http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapreduce/lib/db/DBWritable.html
  inputClass,

 String 
 http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true
  tableName,
 String 
 http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true
  conditions,

 String 
 http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true
  splitBy,
 String 
 http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true...
  fieldNames)*

 *Note that the orderBy column is called the splitBy in this version.
 We reuse the same field, but it's not strictly ordering it -- just
 partitioning the results.
 *

 So I get all data from myTable and I try to split by date column. I obtain
 milions rows and I supose that DataDrivenDBInputFormat generates many splits
 and i don't know how to reduce this splits or how to indicates to
 DataDrivenDBInputFormat splits by my date column (corresponds to splitBy).

 The main goal's improve performance, so I want to my Map's faster.


 Can someone help me?

 Thanks

 Joan






Re: How to reduce number of splits in DataDrivenDBInputFormat?

2011-01-19 Thread Sonal Goyal
Which hadoop version are you on?

You can alternatively try using hiho from https://github.com/sonalgoyal/hiho
to get your data from the db. Please write to me directly if you need any
help there.

Thanks and Regards,
Sonal
https://github.com/sonalgoyal/hihoConnect Hadoop with databases,
Salesforce, FTP servers and others https://github.com/sonalgoyal/hiho
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal





On Thu, Jan 20, 2011 at 1:03 PM, Joan joan.monp...@gmail.com wrote:

 Hi Sonal,

 I put both configurations:

 job.getConfiguration().set(mapreduce.job.maps,4);
 job.getConfiguration().set(mapreduce.map.tasks,4);

 But both configurations don't run. I also try to set mapred.map.task but
 It neither run.

 Joan

 2011/1/20 Sonal Goyal sonalgoy...@gmail.com

 Joan,

 You should be able to set the mapred.map.tasks property to the maximum
 number of mappers you want. This can control parallelism.

 Thanks and Regards,
 Sonal
 https://github.com/sonalgoyal/hihoConnect Hadoop with databases,
 Salesforce, FTP servers and others https://github.com/sonalgoyal/hiho
 Nube Technologies http://www.nubetech.co

 http://in.linkedin.com/in/sonalgoyal






 On Wed, Jan 19, 2011 at 9:32 PM, Joan joan.monp...@gmail.com wrote:

 Hi,

 I want to reduce number of splits because I think that I get many splits
 and I want to reduce these splits.
 While my job is running I can see:

 *INFO mapreduce.Job:  map ∞% reduce 0%*

 I'm using DataDrivenDBInputFormat:
 *
 ** setInput*

 *public static void setInput(Job 
 http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapreduce/Job.html
  job,
 Class 
 http://java.sun.com/javase/6/docs/api/java/lang/Class.html?is-external=true?
  extends DBWritable 
 http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapreduce/lib/db/DBWritable.html
  inputClass,



 String 
 http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true
  tableName,
 String 
 http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true
  conditions,



 String 
 http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true
  splitBy,
 String 
 http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true...
  fieldNames)*

 *Note that the orderBy column is called the splitBy in this version.
 We reuse the same field, but it's not strictly ordering it -- just
 partitioning the results.
 *

 So I get all data from myTable and I try to split by date column. I
 obtain milions rows and I supose that DataDrivenDBInputFormat generates many
 splits and i don't know how to reduce this splits or how to indicates to
 DataDrivenDBInputFormat splits by my date column (corresponds to splitBy).

 The main goal's improve performance, so I want to my Map's faster.


 Can someone help me?

 Thanks

 Joan








Re: Import data from mysql

2011-01-08 Thread Sonal Goyal
Hi Brian,

You can check HIHO at https://github.com/sonalgoyal/hiho which can help you
load data from any JDBC database to the Hadoop file system. If your table
has a date or id field, or any indicator for modified/newly added rows, you
can import only the altered rows every day. Please let me know if you need
help.

Thanks and Regards,
Sonal
https://github.com/sonalgoyal/hihoConnect Hadoop with databases,
Salesforce, FTP servers and others https://github.com/sonalgoyal/hiho
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal





On Sun, Jan 9, 2011 at 5:03 AM, Brian McSweeney
brian.mcswee...@gmail.comwrote:

 Hi folks,

 I'm a TOTAL newbie on hadoop. I have an existing webapp that has a growing
 number of rows in a mysql database that I have to compare against one
 another once a day from a batch job. This is an exponential problem as
 every
 row must be compared against every other row. I was thinking of
 parallelizing this computation via hadoop. As such, I was thinking that
 perhaps the first thing to look at is how to bring info from a database to
 a
 hadoop job and vise versa. I have seen the following relevant info

 https://issues.apache.org/jira/browse/HADOOP-2536

 and also

 http://architects.dzone.com/articles/tools-moving-sql-database

 any advice on what approach to use?

 cheers,
 Brian



Re: How to manage large record in MapReduce

2011-01-07 Thread Sonal Goyal
Jerome,

You can take a look at FileStreamInputFormat at
https://github.com/sonalgoyal/hiho/tree/hihoApache0.20/src/co/nubetech/hiho/mapreduce/lib/input

This provides an input stream per file. In our case, we are using the input
stream to load data into the database directly. Maybe you can use this or a
similar approach for working with your videos.

HTH

Thanks and Regards,
Sonal
https://github.com/sonalgoyal/hihoConnect Hadoop with databases,
Salesforce, FTP servers and others https://github.com/sonalgoyal/hiho
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal





On Thu, Jan 6, 2011 at 4:23 PM, Jérôme Thièvre jthie...@gmail.com wrote:

 Hi,

 we are currently using Hadoop (version 0.20.2) to manage some web archiving
 processes like fulltext indexing, and it works very well with small records
 that contains html.
 Now, we would like to work with other type of web data like videos. These
 kind of data could be really large and of course these records doesn't fit
 in memory.

 Is it possible to manage record which content doesn't reside in memory but
 on disk.
 A possibility would be to implements a Writable that read its content from
 a
 DataInput but doesn't load it in memory, instead it would copy that content
 to a temporary file in the local file system and allows to stream its
 content using an InputStream (an InputStreamWritable).

 Has somebody tested a similar approach, and if not do you think some big
 problems could happen (that impacts performance) with this method ?

 Thanks,

 Jérôme Thièvre



Re: Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected

2011-01-07 Thread Sonal Goyal
Which Hadoop versions are you testing and compiling against?

Thanks and Regards,
Sonal
https://github.com/sonalgoyal/hihoConnect Hadoop with databases,
Salesforce, FTP servers and others https://github.com/sonalgoyal/hiho
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal





On Wed, Jan 5, 2011 at 3:20 PM, Cavus,M.,Fa. Post Direkt 
m.ca...@postdirekt.de wrote:

 Hi,
 I get this, did anyone know why I get an Error?:


 11/01/05 10:46:55 WARN conf.Configuration: fs.checkpoint.period is
 deprecated. Instead, use dfs.namenode.checkpoint.period
 11/01/05 10:46:55 WARN conf.Configuration: mapred.map.tasks is
 deprecated. Instead, use mapreduce.job.maps
 11/01/05 10:46:55 INFO mapreduce.JobSubmitter: number of splits:1
 11/01/05 10:46:55 INFO mapreduce.JobSubmitter: adding the following
 namenodes' delegation tokens:null
 11/01/05 10:46:56 INFO mapreduce.Job: Running job: job_201101051016_0008
 11/01/05 10:46:57 INFO mapreduce.Job:  map 0% reduce 0%
 11/01/05 10:47:04 INFO mapreduce.Job:  map 100% reduce 0%
 11/01/05 10:47:13 INFO mapreduce.Job: Task Id :
 attempt_201101051016_0008_r_00_0, Status : FAILED
 Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext,
 but class was expected
 11/01/05 10:47:23 INFO mapreduce.Job: Task Id :
 attempt_201101051016_0008_r_00_1, Status : FAILED
 Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext,
 but class was expected
 11/01/05 10:47:34 INFO mapreduce.Job: Task Id :
 attempt_201101051016_0008_r_00_2, Status : FAILED
 Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext,
 but class was expected
 11/01/05 10:47:47 INFO mapreduce.Job: Job complete:
 job_201101051016_0008
 11/01/05 10:47:47 INFO mapreduce.Job: Counters: 19
FileSystemCounters
FILE_BYTES_WRITTEN=38
HDFS_BYTES_READ=69
Job Counters
Data-local map tasks=1
Total time spent by all maps waiting after reserving
 slots (ms)=0
Total time spent by all reduces waiting after reserving
 slots (ms)=0
Failed reduce tasks=1
SLOTS_MILLIS_MAPS=5781
SLOTS_MILLIS_REDUCES=6379
Launched map tasks=1
Launched reduce tasks=4
Map-Reduce Framework
Combine input records=0
Failed Shuffles=0
GC time elapsed (ms)=97
Map input records=0
Map output bytes=0
Map output records=0
Merged Map outputs=0
Spilled Records=0
SPLIT_RAW_BYTES=69
 11/01/05 10:47:47 INFO zookeeper.ZooKeeper: Session: 0x12d555a4ed80018
 closed




Re: Dumping Cassandra into Hadoop

2010-10-19 Thread Sonal Goyal
Have you checked https://issues.apache.org/jira/browse/CASSANDRA-913 ?
Thanks and Regards,
Sonal

Sonal Goyal | Founder and CEO | Nube Technologies LLP
http://www.nubetech.co | http://in.linkedin.com/in/sonalgoyal





On Tue, Oct 19, 2010 at 8:31 PM, Mark static.void@gmail.com wrote:

  As the subject implies I am trying to dump Cassandra rows into Hadoop.
 What is the easiest way for me to accomplish this? Thanks.

 Should I be looking into pig for something like this?



Re: Help for Sqlserver querying with hadoop

2010-09-25 Thread Sonal Goyal
Biju,

Have you tried using DataDrivenDBInputFormat?

Thanks and Regards,
Sonal

Sonal Goyal | Founder and CEO | Nube Technologies LLP
Ph: +91-8800541717 | so...@nubetech.co | Skype: sonal.goyal
http://www.nubetech.co | http://in.linkedin.com/in/sonalgoyal





On Fri, Sep 24, 2010 at 2:06 PM, Biju .B bijub...@gmail.com wrote:

 Hi

 Need urgent help on using sql server with hadoop

 am using following code to connect to database


 DBConfiguration.configureDB(conf,com.microsoft.sqlserver.jdbc.SQLServerDriver,jdbc:sqlserver://xxx.xxx.xxx.xxx;user=abc;password=abc;DatabaseName=dbname);
 String [] fields = { id, url };
 String [] fields = { id, url };
 DBInputFormat.setInput(conf,MyRecord.class,urls,null,id, fields);

 Am getting following error

 10/09/24 13:26:42 INFO mapred.JobClient: Task Id :
 attempt_201009231924_0008_m_01_2, Status : FAILED
 java.io.IOException: Incorrect syntax near 'LIMIT'.
at

 org.apache.hadoop.mapreduce.lib.db.DBRecordReader.nextKeyValue(DBRecordReader.java:235)
at

 org.apache.hadoop.mapreduce.lib.db.DBRecordReader.next(DBRecordReader.java:204)
at

 org.apache.hadoop.mapred.lib.db.DBInputFormat$DBRecordReaderWrapper.next(DBInputFormat.java:118)
at

 org.apache.hadoop.mapred.lib.db.DBInputFormat$DBRecordReaderWrapper.next(DBInputFormat.java:87)
at

 org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
at
 org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)



 Found that the error is due to query that each task tries to execute

 SELECT id, url FROM urls AS urls ORDER BY id LIMIT 13228 OFFSET 13228


 the LIMIT and OFFSET are not valid in Sqlserver and it returns error

 Pls tell me how to solve this problem

 Regards
 Biju



Re: Hadoop 0.21.0 release Maven repo

2010-09-12 Thread Sonal Goyal
Thanks Tom.

Thanks and Regards,
Sonal
www.meghsoft.com
http://in.linkedin.com/in/sonalgoyal


On Sat, Sep 11, 2010 at 4:20 AM, Tom White t...@cloudera.com wrote:

 Hi Sonal,

 The 0.21.0 jars are not available in Maven yet, since the process for
 publishing them post split has changed.
 See HDFS-1292 and MAPREDUCE-1929.

 Cheers,
 Tom

 On Fri, Sep 10, 2010 at 1:33 PM, Sonal Goyal sonalgoy...@gmail.com
 wrote:
  Hi,
 
  Can someone please point me to the Maven repo for 0.21 release? Thanks.
 
  Thanks and Regards,
  Sonal
  www.meghsoft.com
  http://in.linkedin.com/in/sonalgoyal
 



Re: Hadoop 0.21.0 release Maven repo

2010-09-12 Thread Sonal Goyal
Thanks Tom.

Thanks and Regards,
Sonal
www.meghsoft.com
http://in.linkedin.com/in/sonalgoyal


On Sat, Sep 11, 2010 at 4:20 AM, Tom White t...@cloudera.com wrote:

 Hi Sonal,

 The 0.21.0 jars are not available in Maven yet, since the process for
 publishing them post split has changed.
 See HDFS-1292 and MAPREDUCE-1929.

 Cheers,
 Tom

 On Fri, Sep 10, 2010 at 1:33 PM, Sonal Goyal sonalgoy...@gmail.com
 wrote:
  Hi,
 
  Can someone please point me to the Maven repo for 0.21 release? Thanks.
 
  Thanks and Regards,
  Sonal
  www.meghsoft.com
  http://in.linkedin.com/in/sonalgoyal
 



Hadoop 0.21.0 release Maven repo

2010-09-10 Thread Sonal Goyal
Hi,

Can someone please point me to the Maven repo for 0.21 release? Thanks.

Thanks and Regards,
Sonal
www.meghsoft.com
http://in.linkedin.com/in/sonalgoyal


Re: How to mount/proxy a db table in hive

2010-08-02 Thread Sonal Goyal
Hi Amit,

Hive needs data to be stored in its own namespace. Can you please explain
why you want to call the database through Hive ?

Thanks and Regards,
Sonal
www.meghsoft.com
http://in.linkedin.com/in/sonalgoyal


On Mon, Aug 2, 2010 at 11:56 AM, amit jaiswal amit_...@yahoo.com wrote:

 Hi,

 I have a database and am looking for a way to 'mount' the db table in hive
 in
 such a way that the select query in hive gets translated to sql query for
 database. I saw DBInputFormat and sqoop, but nothing that can create a
 proxy
 table in hive which internally makes db calls.

 I also tried to use custom variant of DBInputFormat as the input format for
 the
 database table.

 create table employee (id int, name string) stored as INPUTFORMAT
 'mycustominputformat' OUTPUTFORMAT
 'org.apache.hadoop.mapred.SequenceFileOutputFormat';

 select id from employee;
 This fails while running hadoop job because HiveInputFormat only supports
 FileSplits.

 HiveInputFormat:
public long getStart() {
  if (inputSplit instanceof FileSplit) {
return ((FileSplit)inputSplit).getStart();
  }
  return 0;
}

 Any suggestions as if there are any InputFormat implementation that can be
 used?

 -amit



Re: mapreduce for proxy log file analysis

2010-08-01 Thread Sonal Goyal
Hi,

Have you checked Hive? Seems to fit your needs perfectly.

Thanks and Regards,
Sonal
www.meghsoft.com
http://in.linkedin.com/in/sonalgoyal


On Sun, Aug 1, 2010 at 1:40 AM, Bright D L brigh...@gmail.com wrote:

 Hi all,
I am doing a simple project to analyze http proxy server logs by
 hadoop mapreduce approach (in Java). The log file contains logs for a week
 or some times more than that.
I  have following requirements:
1) Find the top 50 bandwidth consumers (IPs) for each day
2) Find the hour of the day where there is maximum bandwidth
 utilization
Please help me out with some directions. Sample code is highly
 appreciated.
 Thank you all,
 Bright


Re: Hive JDBC Connection Timeout

2010-06-17 Thread Sonal Goyal
See if this works:

DriverManger.setLoginTimeout(...);


Thanks and Regards,
Sonal
www.meghsoft.com
http://in.linkedin.com/in/sonalgoyal



On Thu, Jun 17, 2010 at 10:20 PM, T2thenike t2then...@gmail.com wrote:

 I am working with complex Hive queries and moderate amounts of data.  I am
 running into a problem where my JDBC connection is timing out before the
 Hive answer is returned.  The timeout seems to occur at about 3 minutes, but
 the query takes at least 5.  I'm running Hadoop 0.20.1, Hive 0.4.0, and I'd
 like to stick with these versions.  Is there a way to increase the
 connection timeout?

 I am setup using Hive and plain text files for inputs.  I'm getting the JDBC
 running using hive --service hiveserver, which enables the JDBC/Thrift
 interface.  The actual exception is a SQLException caused by a Thrift
 exception where the Connection timed out.

 I have already tried setting the SQL statement.setQueryTimeout(), which seem
 to have no effect.  Maybe there's a setting to increase the Thrift timeout,
 but I haven't been able to find it. Any suggestions?
 --
 View this message in context: 
 http://old.nabble.com/Hive-JDBC-Connection-Timeout-tp28916939p28916939.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




Re: Load data from xml using Mapper.py in hive

2010-06-10 Thread Sonal Goyal
Can you try changing your logging level to debug and see the exact
error message in hive.log?

Thanks and Regards,
Sonal
www.meghsoft.com
http://in.linkedin.com/in/sonalgoyal



On Thu, Jun 10, 2010 at 5:07 PM, Shuja Rehman shujamug...@gmail.com wrote:
 Hi
 I have try to do as you described. Let me explain in steps.

 1- create table test (xmlFile String);
 --

 2-LOAD DATA LOCAL INPATH '1.xml'
 OVERWRITE INTO TABLE test;
 --

 3-CREATE TABLE test_new (
     b STRING,
     c STRING
   )
 ROW FORMAT DELIMITED
 FIELDS TERMINATED BY '\t';

 --
 4-add FILE sampleMapper.groovy;
 --
 5- INSERT OVERWRITE TABLE test_new
 SELECT
   TRANSFORM (xmlfile)
   USING 'sampleMapper.groovy'
   AS (b,c)
 FROM test;
 --
 XML FILE:
 xml file has only one row for testing purpose which is

 xyabHello/bcworld/c/a/xy
 --
 MAPPER
 and i have write the mapper in groovy to parse it. the mapper is

    def xmlData =
  System.in.withReader {
     xmlData=xmlData+ it.readLine()
 }

 def xy = new XmlParser().parseText(xmlData)
 def b=xy.a.b.text()
     def c=xy.a.c.text()
     println  ([b,c].join('\t') )
 --
 Now step 1-4 are fine but when i perform step 5 which will load the data
 from test table to new table using mapper, it throws the error. The error on
 console is

 FAILED: Execution Error, return code 2 from
 org.apache.hadoop.hive.ql.exec.ExecDriver

 I am facing hard time. Any suggestions
 Thanks

 On Thu, Jun 10, 2010 at 3:05 AM, Ashish Thusoo athu...@facebook.com wrote:

 You could load this whole xml file into a table with a single row and a
 single column. The default record delimiter is \n but you can create a table
 where the record delimiter is \001. Once you do that you can follow the
 approach that you described below. Will this solve your problem?

 Ashish
 
 From: Shuja Rehman [mailto:shujamug...@gmail.com]
 Sent: Wednesday, June 09, 2010 3:07 PM
 To: hive-user@hadoop.apache.org
 Subject: Load data from xml using Mapper.py in hive

 Hi
 I have created a table in hive (Suppose table1 with two columns, col1 and
 col2 )

 now i have an xml file for which i have write a python script which read
 the xml file and transform it in single row with tab seperated
 e.g the output of python script can be

 row 1 = val1 val2
 row2 =  val3 val4

 so the output of file has straight rows with the help of python script.
 now i want to load this into created table. I have seen the example of in
 which the data is first loaded in u_data table then transform it using
 python script in u_data_new but in m scenario. it does not fit as i have xml
 file as source.


 Kindly let me know can I achieve this??
 Thanks

 --

 --
 Regards
 Baig




Re: How to apply RDBMS table updates and deletes into Hadoop

2010-06-09 Thread Sonal Goyal
Hi Atreju,

You have a very valid use case here. Data changes in your database,
and you want to pull in only the changes to Hadoop. Have you
considered query based data retrieval from the RDBMS to Hadoop?  As
you already have a date field in your tables which marks the changed
rows, you can query on that field and get only the changed records to
Hadoop.

I have been working on an open source framework for incremental
updates and fetching such records to Hadoop. You can check

http://code.google.com/p/hiho/
http://code.google.com/p/hiho/wiki/DatabaseImportFAQ

If you have any questions or need any changes, please send me an
offline mail.

Thanks and Regards,
Sonal
www.meghsoft.com
http://in.linkedin.com/in/sonalgoyal



On Wed, Jun 9, 2010 at 4:39 AM, Yongqiang He heyongqiang...@gmail.com wrote:
 Hi,

 I think hive’s join + transform could be helpful here.

 Thanks
 Yongqiang
 On 6/8/10 3:58 PM, Aaron Kimball aa...@cloudera.com wrote:

 I think that this might be the way to go. In general, folding updates and
 deletes into datasets is a difficult problem due to the append-only nature
 of datasets.

 Something that might help you here is to partition your tables in Hive based
 on some well-distributed key. Then if you have a relatively small number of
 partitions affected by an incremental import (perhaps more recently-imported
 records are more likely to be updated? in this case, partition the tables by
 the month/week you imported them?) you can only perform the fold-in of the
 new deltas on the affected partitions. This should be much faster than a
 full table scan.

 Have you seen the Sqoop tool? It handles imports and exports between HDFS
 (and Hive) and RDBMS systems --  but currently can only import new records
 (and subsequent INSERTs); it can't handle updates/deletes. Sqoop is
 available at http://github.com/cloudera/sqoop -- it doesn't run on Apache
 0.20.3, but works on CDH (Cloudera's Distribution for Hadoop) and Hadoop
 0.21/trunk.

 This sort of capability is something I'm really interested in adding to
 Sqoop. If you've got a well-run process for doing this, I'd really
 appreciate your help adding this feature :) Send me an email off-list if
 you're interested. At the very least, I'd urge you to try out the tool.

 Cheers,
 - Aaron Kimball

 On Tue, Jun 8, 2010 at 8:54 PM, atreju n.atr...@gmail.com wrote:

 To generate smart output from base data we need to copy some base tables
 from relational database into Hadoop. Some of them are big. To dump the
 entire table into Hadoop everyday is not an option since there are like 30+
 tables and each would take several hours.

 The methodology that we approached is to get the entire table dump first.
 Then each day or every 4-6 hours get only insert/update/delete since the
 last copy from RDBMS (based on a date field in the table). Using Hive do
 outer join + union the new data with existing data and write into a new
 file. For example, if there are a 100 rows in Hadoop, and in RDBMS 3 records
 inserted, 2 records updated and 1 deleted since the last Hadoop copy, then
 the Hive query will get 97 of the not changed data + 3 inserts + 2 updates
 and write into a new file. The other applications like Pig or Hive will pick
 the most recent file to use when selecting/loading data from those base
 table data files.

 This logic is working fine in lower environments for small size tables. With
 production data, for about 30GB size table, the incremental re-generation of
 the file in Hadoop is still taking several hours. I tried using zipped
 version and it took even longer time. I am not convinced that this is the
 best we can do to handle updates and deletes since we had to re-write 29GB
 unchanged data of the 30GB file again into a new file. ...and this is not
 the biggest table.

 I am thinking that this should be problem for many companies. What are the
 other approaches to apply updates and deletes on base tables to the
 Hadoop data files?

 We have 4 data nodes and using version 20.3.

 Thanks!






Re: Problem with DBOutputFormat

2010-06-08 Thread Sonal Goyal
Hi Giridhar,

Which version of Hadoop are you using?

If you want, you can also load data to MySQL using the hiho framework at
http://code.google.com/p/hiho/

Thanks and Regards,
Sonal
www.meghsoft.com
http://in.linkedin.com/in/sonalgoyal



On Tue, Jun 8, 2010 at 3:02 PM, Giridhar Addepalli
giridhar.addepa...@komli.com wrote:
 Hi,



 I am trying to write output to MYSQL DB,

 I am getting following error



 java.io.IOException

     at
 org.apache.hadoop.mapreduce.lib.db.DBOutputFormat.getRecordWriter(DBOutputFormat.java:180)

     at
 org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:553)

     at
 org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)

     at org.apache.hadoop.mapred.Child.main(Child.java:170)



 I have mysql-connector-java-5.0.8-bin.jar  in lib/ directory inside hadoop
 home directory



 Please help,

 Giridhar.


Fwd: metastore set up with Oracle backend?

2010-05-25 Thread Sonal Goyal
Hi Pradeep,

You can check the config values for MySQL as a metastore for Hive at
the following link:

http://www.mazsoft.com/blog/post/2010/02/01/Setting-up-HadoopHive-to-use-MySQL-as-metastore.aspx

Thanks and Regards,
Sonal
www.meghsoft.com
http://in.linkedin.com/in/sonalgoyal


On Tue, May 25, 2010 at 10:00 PM, Pradeep Kamath prade...@yahoo-inc.com wrote:

 Hi Aaron,

   Can you share some details along the lines of what is described in 
 http://wiki.apache.org/hadoop/Hive/AdminManual/MetastoreAdmin#Local_Metastore 
 for MySQL? (Essentially what changes are needed in properties to get this 
 working)



 Thanks,

 Pradeep



 

 From: Aaron McCurry [mailto:amccu...@gmail.com]
 Sent: Monday, May 24, 2010 1:42 PM
 To: hive-user@hadoop.apache.org
 Subject: Re: metastore set up with Oracle backend?



 I have done it, everything seemed to work just fine.



 Aaron





 On Mon, May 24, 2010 at 4:37 PM, Pradeep Kamath prade...@yahoo-inc.com 
 wrote:

 Hi,

   Can hive metastore be setup with Oracle as backend without any code 
 changes? Has anyone tried this? Any pointers would be much appreciated.



 Thanks,

 Pradeep




Re: Need Working example for DBOutputFormat

2010-05-19 Thread Sonal Goyal
Hi Nishant,

If MySQL is your target database, you can check open source
http://code.google.com/p/hiho/ which uses load data infile for a faster
upload to the db.

Let me know if you need any help.

Thanks and Regards,
Sonal
www.meghsoft.com


On Wed, May 19, 2010 at 1:06 PM, Nishant Sonar nisha...@synechron.comwrote:

 Hello,

 Does any body has a working example of DBOutputformat. That connects to the
 DB Server (MYSQL) and then writes a record to the table.

 I tried by following the instruction on 
 http://www.cloudera.com/blog/2009/03/database-access-with-hadoop/; as
 below but was getting an IOException.

 It will be great if anyone can send me example for hadoop 0.20.2 . The one
 below is for an earlier version.

 !-- Runner Class --

 public class EmployeeDBRunner {
public static void main(String[] args) {
Configuration configuration = new Configuration();
JobConf jobConf = new JobConf(configuration,
 EmployeeDBRunner.class);
   DBConfiguration.configureDB(jobConf, com.mysql.jdbc.Driver,
jdbc:mysql://localhost/mydatabase,myuser, mypass);
String[] fields = { employee_id, name };
DBOutputFormat.setOutput(jobConf, employees, fields);

JobConf conf = new JobConf(EmployeeDBRunner.class);
conf.setJobName(Employee);
FileInputFormat.addInputPath(conf, new Path(args[0])); //set input
 as file
conf.setMapperClass(TokenMapper.class);
conf.setReducerClass(DBReducer.class);
conf.setOutputFormat(DBOutputFormat.class);  //set output as DBOF to
 output data to a table.

 // Text, IntWritable
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(IntWritable.class);

// MyRecord,NullWritable
conf.setOutputKeyClass(MyRecord.class);
conf.setOutputValueClass(NullWritable.class);
try {
JobClient.runJob(conf);
} catch (IOException e) {
 e.printStackTrace();
}

}
 }

 !-- Mapper --
 public class TokenMapper extends MapReduceBase implements
MapperObject, Text, Text, IntWritable {
IntWritable single = new IntWritable(1);

public void map(Object arg0, Text line,
OutputCollectorText, IntWritable collector, Reporter arg3)
throws IOException {
StringTokenizer stk = new StringTokenizer(line.toString());
while (stk.hasMoreTokens()) {
Text token = new Text(stk.nextToken());
collector.collect(token, single);
}

}
 }

 !-- Reducer class--
 public class DBReducer extends MapReduceBase implements
org.apache.hadoop.mapred.ReducerText, IntWritable,
 MyRecord,NullWritable {
NullWritable n = NullWritable.get();
public void reduce(Text key, IteratorIntWritable values,
OutputCollectorMyRecord,NullWritable output, Reporter
 reporter)
throws IOException {
long sum = 0;
for (; values.hasNext();) {
values.next();
sum++;
}
MyRecord mRecord  = new MyRecord(sum, key.toString());
   System.out.println(mRecord.getName());
output.collect(mRecord,n);
}
 }





Re: How to add external jar file while running a hadoop program

2010-05-07 Thread Sonal Goyal
Akhil,

For the rejar to work, the to be included jar has to be in the lib folder of
the main jar.

Thanks and Regards,
Sonal
www.meghsoft.com


On Fri, May 7, 2010 at 3:31 PM, akhil1988 akhilan...@gmail.com wrote:


 You need to jar the stanford-parser with your ep.jar
 For this you canunjar the stanford-parser.jar using

 jar -xvf stan...jar

 jar -cvf ep.jar stanford/directory ep/






 harshira wrote:
 
  am new to hadoop.
 
  I have a file Wordcount.java which refers hadoop.jar and
  stanford-parser.jar
 
  I am running the following commnad
 
  javac -classpath .:hadoop-0.20.1-core.jar:stanford-parser.jar -d ep
  WordCount.java
 
  jar cvf ep.jar -C ep .
 
  bin/hadoop jar ep.jar WordCount gutenburg gutenburg1
 
  After executing i am getting the following error:
 
  lang.ClassNotFoundException:
  edu.stanford.nlp.parser.lexparser.LexicalizedParser
 
  The class is in stanford-parser.jar ...
  I guess that different processes doesnt access this jar file . so how can
  this be acheived.
 
  Thanks
  Harshit
 

 --
 View this message in context:
 http://old.nabble.com/How-to-add-external-jar-file-while-running-a-hadoop-program-tp28481933p28484219.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




Re: having a directory as input split

2010-05-04 Thread Sonal Goyal
One way to do this will be:

Create a DirectoryInputFormat which accepts the list of directories as
inputs and emits each directory path in one split. Your custom RecordReader
can then read this split and generate appropriate input for your mapper.

Thanks and Regards,
Sonal
www.meghsoft.com


On Fri, Apr 30, 2010 at 11:48 AM, akhil1988 akhilan...@gmail.com wrote:


 How can I make a directory as a InputSplit rather than a file. I want that
 the input split available to a map task should be a directory and not a
 file. And I will implement my own record reader which will read appropriate
 data from the directory and thus give the records to the map tasks.

 To explain in other words,
 I have a list of directories distributed over hdfs and I know that each of
 these directories is small enough to be present on a single node. I want
 that one directory to be given  to each map task rather than the files
 present in it. How to do this?

 Thanks,
  Akhil
 --
 View this message in context:
 http://old.nabble.com/having-a-directory-as-input-split-tp28408886p28408886.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




Re: Hbase Hive

2010-04-30 Thread Sonal Goyal
If you are looking for an ORM layer for HBase, there is one at

http://github.com/enis/gora

Thanks and Regards,
Sonal
www.meghsoft.com


On Sat, May 1, 2010 at 4:13 AM, Nick Dimiduk ndimi...@gmail.com wrote:

 If by efficiently, you mean low latency then no, you will not get
 ms-response time for your hive queries over hbase as the hive query planner
 still results in m/r jobs being run over the cluster.

 Hope that helps.

 Cheers,
 -Nick

 On Fri, Apr 30, 2010 at 9:55 AM, Jean-Daniel Cryans jdcry...@apache.org
 wrote:

  Inline (and added hbase-user to the recipients).
 
  J-D
 
  On Thu, Apr 29, 2010 at 9:23 PM, Amit Kumar amkumar@gmail.com
 wrote:
   Hi Everyone,
  
   I want to ask about Hbase and Hive.
  
   Q1 Is there any dialect available which can be used with Hibernate to
   create persistence with Hbase. Has somebody written one. I came across
  HBql
   at
 www.hbql.com. Can this be used to create a dialect for Hbase?
 
  HBQL queries HBase directly, but it's not SQL-compliant and doesn't
  feature relational keywords (since HBase doesn't support them, JOINs
  don't scale). I don't know if anybody tried integrating HBQL in
  Hibernate... it's still a very young project.
 
  
   Q2  Once the data is in there in Hbase. In this link I found that it
 can
  be
   used with Hive ( https://issues.apache.org/jira/browse/HIVE-705 ). So
  the
   question is is it safe enough to use the below architecture for
  application
   Hibernate -- Dialect for Hbase -- Hbase -- query from Hbase using
 Hive
  to
   use MapReduce effectively.
 
  Hive goes on top of HBase, so you can use its query language to mine
  HBase tables. Be aware that a MapReduce job isn't meant for live
  queries, so issuing them from Hibernate doesn't make much sense...
  unless you meant something else and this which case please do give
  more details.
 
  
   Thanks  Regards
   Amit Kumar
  
 



Re: counting pairs of items across item types

2010-04-25 Thread Sonal Goyal
Hi Sebastian.

With HIHO, you can supply a sql query which joins tables in the database and
get the results to Hadoop. Say, you want to get the following data from your
table to Hadoop:

select table.1col1, table2.col2 from table1, table2 where table1.id =
table2.addressId

If you check DBInputFormat, it is table driven, whereas HIHO is query
driven. Though I have tested against MySQL, import from other JDBC complaint
databases should work. Currently, export works only for MySQL.

I have updated the documentation to include a project how to. There are also
details on the configuration and implementing. If you need further help,
please let me know.

Thanks and Regards,
Sonal
www.meghsoft.com


On Sat, Apr 24, 2010 at 12:42 AM, Robin Anil robin.a...@gmail.com wrote:

 Check out PIG. You can do SQL like Map/Reduces using it. Thats the best
 answer I have


 On Sat, Apr 24, 2010 at 12:27 AM, Sebastian Feher se...@yahoo.com wrote:

 Hi Robin,

 Thanks for your answer. Yes, I do understand that FPGrowth gives you the
 most frequent co-occurrences and some of the more interesting ones are not
 pairs (not to say that pairs are not interesting). However this is not what
 I want in this case. I need all the pairs for a given active item that
 co-occur with the active item for a number of times greater than threshold.
 FPGrowth gives me that but also much more so I'm trying to find an easier
 algorithm that simply generates the pairs. I do need to process billions of
 data points so performance and scalability are important. I'm also trying to
 understand the technologies involved so please bare with me :)

 Currently, I can run a simple (DB2) SQL query on the data set I've
 mentioned earlier and get the occurrence count.

 SELECT SPACE1.ITEM AS ACT, SPACE2.ITEM AS REC, count(*) as COUNT FROM
 SPACE1, SPACE2 where space1.session=space2.session group by SPACE1.ITEM,
 SPACE2.ITEM;

 ACT REC COUNT
 1 2 1
 1 3 1
 2 2 2
 2 3 1
 2 4 1
 3 2 1
 3 3 1
 4 2 2
 4 3 1
 4 4 1
 6 2 1
 6 4 1

 This would give me the right occurrence count. I was able to run this
 types of queries successfully on a few million data point batches and merge
 the results pretty fast. I want to understand how to implement the
 equivalent in Hadoop. Hopefully this makes more sense.

 Sebastian

 --
 *From:* Robin Anil robin.a...@gmail.com
 *To:* mapreduce-user@hadoop.apache.org
 *Sent:* Fri, April 23, 2010 11:16:59 AM
 *Subject:* Re: counting pairs of items across item types

 Hi Sebastian, Let me get your use case right, You cant to do a pair
 counting like a join. you might need to use PIG or something similar to do
 this easily. Mahout's PFPGrowth counts the co-occurring, frequent n-items
  not just co-occurrence of two items. There you just need either one of the
 viewed or bought transaction table to generate these patterns.

 Robin

 On Fri, Apr 23, 2010 at 7:48 PM, Sebastian Feher se...@yahoo.com wrote:

 ere's  a DBConfiguration and a DBInputFormat but couldn't find much
 details on these. Also I need to access both table in order to generate the
 pairs and count them.
 Next, when generating the pairs, I'd like to store the final outcome
 containing all the pairs whose count is greater than a specified threshold
 back into the database.







Re: import multiple jar

2010-04-20 Thread Sonal Goyal
Hi,

You can add your dependencies in the lib folder of your main jar. Hadoop
will automatically distribute them to the cluster.

You can also explore using DistributedCache or -libjars options.
Thanks and Regards,
Sonal
www.meghsoft.com


On Mon, Apr 19, 2010 at 7:54 PM, Gang Luo lgpub...@yahoo.com.cn wrote:

 Hi all,
 this is kind of a java problem. I was using a package. In an example
 program, I import the package by -classpath when compiling it and pack it
 into a jar. When I execute my jar file, I need to also import the original
 package like this java -classpath package.jar:myExecutable.jar myClass.
 Otherwise, it will report classnotfound exception. However, when run a
 program in hadoop, I cannot import more than one jar files (bin/hadoop jar
 myExecutable.jar myClass). How to impart that package.jar? I try export
 CLASSPATH=..., it doesn't help.

 Thanks,
 -Gang






Re: Trying to figure out possible causes of this exception

2010-04-07 Thread Sonal Goyal
hi Kris,

Seems your program can not find the input file. Have you done a hadoop fs
-ls to verify that the file exists? Also, the path URL should be
hdfs://..


Thanks and Regards,
Sonal
www.meghsoft.com


On Wed, Apr 7, 2010 at 1:16 AM, Kris Nuttycombe
kris.nuttyco...@gmail.comwrote:

 Exception in thread main java.io.FileNotFoundException: File does
 not exist: hdfs:///test-batchEventLog/metrics/data
at
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
at
 org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
at
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241)
at
 org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885)
at
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
at reporting.HDFSMapReduceQuery.execute(HDFSMetricsQuery.scala:60)

 My job config contains the following:

println(using input path:  + inPath)
println(using output path:  + outPath)
FileInputFormat.setInputPaths(job, inPath);
FileOutputFormat.setOutputPath(job, outPath)

 with input  output paths printed out as:

 using input path: hdfs:/test-batchEventLog
 using output path:
 hdfs:/test-batchEventLog/out/03d24392-9bd9-4b23-8240-aceb54b3473c

 Any ideas why this would be occurring?

 Thanks,

 Kris



Re: Does Hadoop compress files?

2010-04-03 Thread Sonal Goyal
Hi,

Please check
http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Data+Compression

Thanks and Regards,
Sonal
www.meghsoft.com


On Sat, Apr 3, 2010 at 11:15 PM, u235sentinel u235senti...@gmail.comwrote:

 I'm starting to evaluate Hadoop.  We are currently running Sensage and
 store a lot of log files in our current environment.  I've been looking at
 the Hadoop forums and googling (of course) but haven't learned if Hadoop
 HDFS does any compression to files we store.

 On the average we're storing about 600 gigs a week in log files (more or
 less).  Generally we need to store about 1 1/2 - 2 years of logs.  With
 Sensage compression we can store about 200+ Tb of logs in our current
 environment.

 As I said, we're starting to evaluate if Hadoop would be a good replacement
 to our Sensage environment (or at least augment it).

 Thanks a bunch!!



Re: Manually splitting files in blocks

2010-03-24 Thread Sonal Goyal
Hi Yuri,

You can also check the source code of FileInputFormat and create your own
RecordReader implementation.
Thanks and Regards,
Sonal
www.meghsoft.com


On Wed, Mar 24, 2010 at 9:08 PM, Patrick Angeles patr...@cloudera.comwrote:

 Yuri,

 Probably the easiest thing is to actually create distinct files and
 configure the block size per file such that HDFS doesn't split it into
 smaller blocks for you.

 - P

 On Wed, Mar 24, 2010 at 11:23 AM, Yuri K. mr_greens...@hotmail.com
 wrote:

 
  Dear Hadoopers,
 
  i'm trying to find out how and where hadoop splits a file into blocks and
  decides to send them to the datanodes.
 
  My specific problem:
  i have two types of data files.
  One large file is used as a database-file where information is sorted
 like
  this:
  [BEGIN DATAROW]
  ... lots of data 1
  [END DATAROW]
 
  [BEGIN DATAROW]
  ... lots of data 2
  [END DATAROW]
  and so on.
 
  and the other smaller files contain raw data and are to be compared to a
  datarow in the large file.
 
  so my question is: is it possible to manually set how hadoop splits the
  large data file into blocks?
  obviously i want the begin-end section to be in one block to optimize
  performance. thus i can replicate the smaller files on each node and so
  those can work independently from the other.
 
  thanks, yk
  --
  View this message in context:
 
 http://old.nabble.com/Manually-splitting-files-in-blocks-tp28015936p28015936.html
  Sent from the Hadoop core-user mailing list archive at Nabble.com.
 
 



Re: Sqoop Installation on Apache Hadop 0.20.2

2010-03-18 Thread Sonal Goyal
Hi Utku,

If MySQL is your target database, you may check Meghsoft's hiho:

http://code.google.com/p/hiho/

The current release supports transferring data from Hadoop to the MySQL
database. We will be releasing the functionality of transfer from MySQL to
Hadoop soon, sometime next week.

Thanks and Regards,
Sonal
www.meghsoft.com


On Thu, Mar 18, 2010 at 5:31 AM, Aaron Kimball aa...@cloudera.com wrote:

 Hi Utku,

 Apache Hadoop 0.20 cannot support Sqoop as-is. Sqoop makes use of the
 DataDrivenDBInputFormat (among other APIs) which are not shipped with
 Apache's 0.20 release. In order to get Sqoop working on 20, you'd need to
 apply a lengthy list of patches from the project source repository to your
 copy of Hadoop and recompile. Or you could just download it all from
 Cloudera, where we've done that work for you :)

 So as it stands, Sqoop won't be able to run on 0.20 unless you choose to
 use
 Cloudera's distribution.  Do note that your use of the term fork is a bit
 strong here; with the exception of (minor) modifications to make it
 interact
 in a more compatible manner with the external Linux environment, our
 distribution only includes code that's available to the project at large.
 But some of that code has not been rolled into a binary release from Apache
 yet. If you choose to go with Cloudera's distribution, it just means that
 you get publicly-available features (like Sqoop, MRUnit, etc.) a year or so
 ahead of what Apache has formally released, but our codebase isn't
 radically
 diverging; CDH is just somewhere ahead of the Apache 0.20 release, but
 behind Apache's svn trunk. (All of Sqoop, MRUnit, etc. are available in the
 Hadoop source repository on the trunk branch.)

 If you install our distribution, then Sqoop will be installed in
 /usr/lib/hadoop-0.20/contrib/sqoop and /usr/bin/sqoop for you. There isn't
 a
 separate package to install Sqoop independent of the rest of CDH; thus no
 extra download link on our site.

 I hope this helps!

 Good luck,
 - Aaron


 On Wed, Mar 17, 2010 at 4:30 AM, Reik Schatz reik.sch...@bwin.org wrote:

  At least for MRUnit, I was not able to find it outside of the Cloudera
  distribution (CDH). What I did: installing CDH locally using apt
 (Ubuntu),
  searched for and copied the mrunit library into my local Maven
 repository,
  and removed CDH after. I guess the same is somehow possible for Sqoop.
 
  /Reik
 
 
  Utku Can Topçu wrote:
 
  Dear All,
 
  I'm trying to run tests using MySQL as some kind of a datasource, so I
  thought cloudera's sqoop would be a nice project to have in the
  production.
  However, I'm not using the cloudera's hadoop distribution right now, and
  actually I'm not thinking of switching from a main project to a fork.
 
  I read the documentation on sqoop at
  http://www.cloudera.com/developers/downloads/sqoop/ but there are
  actually
  no links for downloading the sqoop itself.
 
  Has anyone here know, and tried to use sqoop with the latest apache
  hadoop?
  If so can you give me some tips and tricks on it?
 
  Best Regards,
  Utku
 
 
 
  --
 
  *Reik Schatz*
  Technical Lead, Platform
  P: +46 8 562 470 00
  M: +46 76 25 29 872
  F: +46 8 562 470 01
  E: reik.sch...@bwin.org mailto:reik.sch...@bwin.org
  */bwin/* Games AB
  Klarabergsviadukten 82,
  111 64 Stockholm, Sweden
 
  [This e-mail may contain confidential and/or privileged information. If
 you
  are not the intended recipient (or have received this e-mail in error)
  please notify the sender immediately and destroy this e-mail. Any
  unauthorised copying, disclosure or distribution of the material in this
  e-mail is strictly forbidden.]
 
 



Re: Expanding comma separated values in a column

2010-03-16 Thread Sonal Goyal
Hi Tim,

You can use the explode UDTF. More here:

http://wiki.apache.org/hadoop/Hive/LanguageManual/LateralView

HTH
Thanks and Regards,
Sonal


On Tue, Mar 16, 2010 at 3:32 PM, Tim Robertson timrobertson...@gmail.comwrote:

 Hi all,

 I have a table of 2 columns of strings, with example row as:

 Col1  Col2
 123  23,34,45,67... up to around 1 million

 I'd like to expand the comma separated values to a new taller KVP table:

 Col1  Col2
 12323
 12334
 12345
 12367
 123 potentially 1,000,000 rows generated

 Can someone please point me in the right direction?

 Thanks
 Tim





Re: Cloudera AMIs

2010-03-16 Thread Sonal Goyal
Thanks Tom. I actually wanted to install Hadoop 0.20, or the Cloudera
version which supports that. My application is written using the latest
APIs, which are not backward compatible with the 0.18 versions. I tried
using:

hadoop-ec2 launch-cluster --env REPO=Testing --env HADOOP_VERSION=0.20 but
the EC2 instance does not have hadoop installed. I checked the master, and
there is no hadoop user or any install of hadoop.

Is there a way I can setup the 0.20 cluster, other than following all the
steps ?

Thanks and Regards,
Sonal


On Tue, Mar 16, 2010 at 10:51 AM, Tom White t...@cloudera.com wrote:

 Hi Sonal,

 You should use the one with the later date. The Cloudera AMIs don't
 actually have Hadoop installed on them, just Java and some other base
 packages. Hadoop is installed at start up time; you can find more
 information at http://archive.cloudera.com/docs/ec2.html.

 Cheers,
 Tom

 P.S. For Cloudera-specific questions please consider using the
 Cloudera forum at http://getsatisfaction.com/cloudera

 On Sun, Mar 14, 2010 at 7:03 AM, Sonal Goyal sonalgoy...@gmail.com
 wrote:
  Hi,
 
  I want to know which Cloudera AMI supports which Hadoop version. For
  example,
 
 
 ami-2932d440:cloudera-ec2-hadoop-images/cloudera-hadoop-ubuntu-20090602-i386.manifest.xml
 
 
  ami-ed59bf84:
 
 cloudera-ec2-hadoop-images/cloudera-hadoop-ubuntu-20090623-i386.manifest.xml
 
  Whats the difference between the two? Which Hadoop version do they
 support?
  I need to use the 0.20+ release.
 
 
  Thanks and Regards,
  Sonal
 



Re: WritableName can't load class in hive

2010-03-16 Thread Sonal Goyal
For some custom functions, I put the jar on the local path accessible to the
CLI. Have you tried that?

Thanks and Regards,
Sonal


On Tue, Mar 16, 2010 at 3:49 PM, Oded Rotem oded.rotem...@gmail.com wrote:

 We have a bunch of sequence files containing keys  values of custom
 Writable classes that we wrote, in a HDFS directory.

 We manage to view them using Hadoop fs -text. For further ad-hoc analysis,
 we tried using Hive. Managed to load them as external tables in Hive,
 however running a simple select count() against the table fails with
 WritableName can't load class in the job output log.

 Executing
add jar path
 does not solve it.

 Where do we need to place the jar containing the definition of the writable
 classes?




Cloudera AMIs

2010-03-14 Thread Sonal Goyal
Hi,

I want to know which Cloudera AMI supports which Hadoop version. For
example,

ami-2932d440:cloudera-ec2-hadoop-images/cloudera-hadoop-ubuntu-20090602-i386.manifest.xml


ami-ed59bf84:
cloudera-ec2-hadoop-images/cloudera-hadoop-ubuntu-20090623-i386.manifest.xml

Whats the difference between the two? Which Hadoop version do they support?
I need to use the 0.20+ release.


Thanks and Regards,
Sonal


Re: error in semantic analysis of join statement

2010-03-09 Thread Sonal Goyal
Hi,

Let me explain this through an example:

Lets assume your table looks like:

aux1.objname  aux1.no_null
--  
AA  1
AA  2
AB   1
AB  3

When you do a select aux1.objname, aux1.no_null group by objname, you are
grouping the AAs and the ABs together. However, you need an aggregate
function over no_null so that you can get the value of no_null corresponding
to the groups. When you use a group by over a column, other columns that you
select either need to be grouped, or aggregated in some form.

This is what is missing in your query.
HTH.

Thanks and Regards,
Sonal


On Tue, Mar 9, 2010 at 4:20 PM, Jan Stöcker jan.stoec...@q2web.de wrote:

  Hi,



 I am stuck with what is probably a beginner’s mistake, but I simply don’t
 know

 what’s wrong. I have two tables aux1 and aux2, with each two columns
 objname

 (STRING) and no_null (INT).

 I want to find all entries of objname appearing in both tables and gave
 hive the

 following statement:



 SELECT t1.objname, t1.no_null, t2.no_null, (t1.no_null + t2.no_null) AS
 null_sum FROM aux1 t1

 JOIN aux2 t2 ON (t1.objname = t2.objname) GROUP BY t1.objname SORT BY
 null_sum LIMIT 30;



 But I got the error message “Error in semantic analysis: line 1:19
 Expression Not In Group By Key t1”.

 I don’t really understand what that means. Anyone can help me?



 Regards,

 Jan







Re: where does jobtracker get the IP and port of namenode?

2010-03-09 Thread Sonal Goyal
Can you turn logging level to debug to see what the logs say?

Thanks and Regards,
Sonal


On Tue, Mar 9, 2010 at 1:08 PM, jiang licht licht_ji...@yahoo.com wrote:

 I guess my confusion is this:

 I point fs.default.name to hdfs:A:50001 in core-site.xml (A is IP
 address). I assume when tasktracker starts, it should use A:50001 to contact
 namenode. But actually, tasktracker log shows that it uses B which is IP
 address of another network interface of the  namenode box and because the
 tasktracker box cannot reach address B, the tasktracker simply retries
 connection and finally fails to start. I read some source code in
 org.apache.hadoop.hdfs.DistributedFileSystem.initialize and it seems to me
 the namenode address is passed in earlier from what is specified in 
 fs.default.name. Is this correct that the namenode address used here by
 tasktracker comes from fs.default.name in core-site.xml or somehow there
 is another step in which this value is changed? Could someone elaborate this
 process how tasktracker resolves namenode and contacts it? Thanks!

 Thanks,

 Michael

 --- On Tue, 3/9/10, jiang licht licht_ji...@yahoo.com wrote:

 From: jiang licht licht_ji...@yahoo.com
 Subject: Re: where does jobtracker get the IP and port of namenode?
 To: common-user@hadoop.apache.org
 Date: Tuesday, March 9, 2010, 12:20 AM

 Sorry, that was a typo in my first post. I did use 'fs.default.name' in
 core-site.xml.

 BTW, the following is the list of error message when tasktracker was
 started and shows that tasktracker failed to connect to namenode A:50001.

 /
 STARTUP_MSG: Starting TaskTracker
 STARTUP_MSG:   host = HOSTNAME/127.0.0.1
 STARTUP_MSG:   args = []
 STARTUP_MSG:   version = 0.20.1+169.56
 STARTUP_MSG:   build =  -r 8e662cb065be1c4bc61c55e6bff161e09c1d36f3;
 compiled by 'root' on Tue Feb  9 13:40:08 EST 2010
 /
 2010-03-09 00:08:50,199 INFO org.mortbay.log: Logging to
 org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via
 org.mortbay.log.Slf4jLog
 2010-03-09 00:08:50,341 INFO org.apache.hadoop.http.HttpServer: Port
 returned by webServer.getConnectors()[0].getLocalPort() before open() is -1.
 Opening the listener on 50060
 2010-03-09 00:08:50,350 INFO org.apache.hadoop.http.HttpServer:
 listener.getLocalPort() returned 50060
 webServer.getConnectors()[0].getLocalPort() returned 50060
 2010-03-09 00:08:50,350 INFO org.apache.hadoop.http.HttpServer: Jetty bound
 to port 50060
 2010-03-09 00:08:50,350 INFO org.mortbay.log: jetty-6.1.14
 2010-03-09 00:08:50,707 INFO org.mortbay.log: Started
 selectchannelconnec...@0.0.0.0:50060
 2010-03-09 00:08:50,734 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
 Initializing JVM Metrics with processName=TaskTracker, sessionId=
 2010-03-09 00:08:50,749 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
 Initializing RPC Metrics with hostName=TaskTracker, port=52550
 2010-03-09 00:08:50,799 INFO org.apache.hadoop.ipc.Server: IPC Server
 Responder: starting
 2010-03-09 00:08:50,800 INFO org.apache.hadoop.ipc.Server: IPC Server
 listener on 52550: starting
 2010-03-09 00:08:50,800 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 0 on 52550: starting
 2010-03-09 00:08:50,800 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 1 on 52550: starting
 2010-03-09 00:08:50,801 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 2 on 52550: starting
 2010-03-09 00:08:50,801 INFO org.apache.hadoop.mapred.TaskTracker:
 TaskTracker up at: HOSTNAME/127.0.0.1:52550
 2010-03-09 00:08:50,801 INFO org.apache.hadoop.mapred.TaskTracker: Starting
 tracker tracker_HOSTNAME:HOSTNAME/127.0.0.1:52550
 2010-03-09 00:08:50,802 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 3 on 52550: starting
 2010-03-09 00:08:50,854 INFO org.apache.hadoop.mapred.TaskTracker:  Using
 MemoryCalculatorPlugin :
 org.apache.hadoop.util.linuxmemorycalculatorplu...@27b4c1d7
 2010-03-09 00:08:50,856 INFO org.apache.hadoop.mapred.TaskTracker: Starting
 thread: Map-events fetcher for all reduce tasks on
 tracker_HOSTNAME:HOSTNAME/127.0.0.1:52550
 2010-03-09 00:08:50,858 WARN org.apache.hadoop.mapred.TaskTracker:
 TaskTracker's totalMemoryAllottedForTasks is -1. TaskMemoryManager is
 disabled.
 2010-03-09 00:08:50,859 INFO org.apache.hadoop.mapred.IndexCache:
 IndexCache created with max memory = 10485760
 2010-03-09 00:09:11,970 INFO org.apache.hadoop.ipc.Client: Retrying connect
 to server: /A:50001. Already tried 0 time(s).
 2010-03-09 00:09:32,972 INFO org.apache.hadoop.ipc.Client: Retrying connect
 to server: /A:50001. Already tried 1 time(s).
 ...

 Thanks,

 Michael

 --- On Mon, 3/8/10, Arun C Murthy a...@yahoo-inc.com wrote:

 From: Arun C Murthy a...@yahoo-inc.com
 Subject: Re: where does jobtracker get the IP and port of namenode?
 To: common-user@hadoop.apache.org
 Date: Monday, March 8, 2010, 10:26 PM

  Here's what is set in core-site.xml
 
  dfs.default.name=hdfs://B:50001

Re: where does jobtracker get the IP and port of namenode?

2010-03-09 Thread Sonal Goyal
Hi Michale,

Please check:
http://hadoop.apache.org/common/docs/r0.20.1/cluster_setup.html#Logging

Then see your master and slave logs. The current logs in your emails, as far
as I could deduce show that the connection is failing, but it is unclear
what is causing the connection to fail.
Thanks and Regards,
Sonal


On Tue, Mar 9, 2010 at 3:53 PM, jiang licht licht_ji...@yahoo.com wrote:

 Thanks Sonal. How to set that debug mode? Actually I set
 dfs.namenode.logging.level to all. Please see my first and previous
 posts for error messages.

 Thanks,

 Michael

 --- On Tue, 3/9/10, Sonal Goyal sonalgoy...@gmail.com wrote:

 From: Sonal Goyal sonalgoy...@gmail.com
 Subject: Re: where does jobtracker get the IP and port of namenode?
 To: common-user@hadoop.apache.org
 Date: Tuesday, March 9, 2010, 4:01 AM

 Can you turn logging level to debug to see what the logs say?

 Thanks and Regards,
 Sonal


 On Tue, Mar 9, 2010 at 1:08 PM, jiang licht licht_ji...@yahoo.com wrote:

  I guess my confusion is this:
 
  I point fs.default.name to hdfs:A:50001 in core-site.xml (A is IP
  address). I assume when tasktracker starts, it should use A:50001 to
 contact
  namenode. But actually, tasktracker log shows that it uses B which is IP
  address of another network interface of the  namenode box and because the
  tasktracker box cannot reach address B, the tasktracker simply retries
  connection and finally fails to start. I read some source code in
  org.apache.hadoop.hdfs.DistributedFileSystem.initialize and it seems to
 me
  the namenode address is passed in earlier from what is specified in 
  fs.default.name. Is this correct that the namenode address used here by
  tasktracker comes from fs.default.name in core-site.xml or somehow
 there
  is another step in which this value is changed? Could someone elaborate
 this
  process how tasktracker resolves namenode and contacts it? Thanks!
 
  Thanks,
 
  Michael
 
  --- On Tue, 3/9/10, jiang licht licht_ji...@yahoo.com wrote:
 
  From: jiang licht licht_ji...@yahoo.com
  Subject: Re: where does jobtracker get the IP and port of namenode?
  To: common-user@hadoop.apache.org
  Date: Tuesday, March 9, 2010, 12:20 AM
 
  Sorry, that was a typo in my first post. I did use 'fs.default.name' in
  core-site.xml.
 
  BTW, the following is the list of error message when tasktracker was
  started and shows that tasktracker failed to connect to namenode A:50001.
 
  /
  STARTUP_MSG: Starting TaskTracker
  STARTUP_MSG:   host = HOSTNAME/127.0.0.1
  STARTUP_MSG:   args = []
  STARTUP_MSG:   version = 0.20.1+169.56
  STARTUP_MSG:   build =  -r 8e662cb065be1c4bc61c55e6bff161e09c1d36f3;
  compiled by 'root' on Tue Feb  9 13:40:08 EST 2010
  /
  2010-03-09 00:08:50,199 INFO org.mortbay.log: Logging to
  org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via
  org.mortbay.log.Slf4jLog
  2010-03-09 00:08:50,341 INFO org.apache.hadoop.http.HttpServer: Port
  returned by webServer.getConnectors()[0].getLocalPort() before open() is
 -1.
  Opening the listener on 50060
  2010-03-09 00:08:50,350 INFO org.apache.hadoop.http.HttpServer:
  listener.getLocalPort() returned 50060
  webServer.getConnectors()[0].getLocalPort() returned 50060
  2010-03-09 00:08:50,350 INFO org.apache.hadoop.http.HttpServer: Jetty
 bound
  to port 50060
  2010-03-09 00:08:50,350 INFO org.mortbay.log: jetty-6.1.14
  2010-03-09 00:08:50,707 INFO org.mortbay.log: Started
  selectchannelconnec...@0.0.0.0:50060
  2010-03-09 00:08:50,734 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
  Initializing JVM Metrics with processName=TaskTracker, sessionId=
  2010-03-09 00:08:50,749 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
  Initializing RPC Metrics with hostName=TaskTracker, port=52550
  2010-03-09 00:08:50,799 INFO org.apache.hadoop.ipc.Server: IPC Server
  Responder: starting
  2010-03-09 00:08:50,800 INFO org.apache.hadoop.ipc.Server: IPC Server
  listener on 52550: starting
  2010-03-09 00:08:50,800 INFO org.apache.hadoop.ipc.Server: IPC Server
  handler 0 on 52550: starting
  2010-03-09 00:08:50,800 INFO org.apache.hadoop.ipc.Server: IPC Server
  handler 1 on 52550: starting
  2010-03-09 00:08:50,801 INFO org.apache.hadoop.ipc.Server: IPC Server
  handler 2 on 52550: starting
  2010-03-09 00:08:50,801 INFO org.apache.hadoop.mapred.TaskTracker:
  TaskTracker up at: HOSTNAME/127.0.0.1:52550
  2010-03-09 00:08:50,801 INFO org.apache.hadoop.mapred.TaskTracker:
 Starting
  tracker tracker_HOSTNAME:HOSTNAME/127.0.0.1:52550
  2010-03-09 00:08:50,802 INFO org.apache.hadoop.ipc.Server: IPC Server
  handler 3 on 52550: starting
  2010-03-09 00:08:50,854 INFO org.apache.hadoop.mapred.TaskTracker:  Using
  MemoryCalculatorPlugin :
  org.apache.hadoop.util.linuxmemorycalculatorplu...@27b4c1d7
  2010-03-09 00:08:50,856 INFO org.apache.hadoop.mapred.TaskTracker:
 Starting
  thread: Map-events fetcher for all reduce

Re: Relational operator in hive

2010-03-08 Thread Sonal Goyal
Shouldnt there be a single = in the query? ac_log.month = 1..?
Thanks and Regards,
Sonal


On Mon, Mar 8, 2010 at 3:02 PM, prakash sejwani prakashsejw...@gmail.comwrote:

 Hi all,
I have a query below
   SELECT ip,dt,month,year FROM ac_log WHERE ac_log.month == 1 AND
 ac_log.year == 2010;
when i run this query i get empty result out of it. Basically i want to
 extract the data from a table whose month is 1 and year is 2010.

 Can anybody help me with this query

 thanks,
 prakash




Re: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.ExecDriver

2010-02-17 Thread Sonal Goyal
Hi,

What do your Hive logs say? You can also check the Hadoop mapper and reduce
job logs.

Thanks and Regards,
Sonal


On Wed, Feb 17, 2010 at 4:18 PM, prasenjit mukherjee
prasen@gmail.comwrote:


 Here is my std-error :
 hive insert overwrite local directory '/tmp/mystuff' select transform(*)
 using  'my.py' FROM myhivetable;
 Total MapReduce jobs = 1
 Number of reduce tasks is set to 0 since there's no reduce operator
 Starting Job = job_201002160457_0033, Tracking URL =
 http://ec2-204-236-205-98.compute-1.amazonaws.com:50030/jobdetails.jsp?jobid=job_201002160457_0033
 Kill Command = /usr/lib/hadoop/bin/hadoop job  -Dmapred.job.tracker=
 ec2-204-236-205-98.compute-1.amazonaws.com:8021 -kill
 job_201002160457_0033
 2010-02-17 05:40:28,380 map = 0%,  reduce =0%
 2010-02-17 05:41:12,469 map = 100%,  reduce =100%
 Ended Job = job_201002160457_0033 with errors
 FAILED: Execution Error, return code 2 from
 org.apache.hadoop.hive.ql.exec.ExecDriver


 I am trying to use the following command :

 hive ql :

 add file /root/my.py
 insert overwrite local directory '/tmp/mystuff' select transform(*) using
 'my.py' FROM myhivetable;

 and following is my my.py:
 #!/usr/bin/python
 import sys
 for line in sys.stdin:
   line = line.strip()
   flds = line.split('\t')
   (cl_id,cook_id)=flds[:2]
   sub_id=cl_id
   if cl_id.startswith('foo'): sub_id=cook_id;
   print ','.join([sub_id,flds[2],flds[3]])

 This works fine, as I tested it in commandline using :  echo -e
 'aa\tbb\tcc\tdd' |  /root/my.py

 Any pointers ?



Re: Ubuntu Single Node Tutorial failure. No live or dead nodes.

2010-02-13 Thread Sonal Goyal
Yes, thanks Todd. I am looking to upgrade to 0.20.2.

Thanks and Regards,
Sonal


On Sat, Feb 13, 2010 at 11:07 PM, Todd Lipcon t...@cloudera.com wrote:

 Hi Sonal,

 Why are you using Hadoop 0.20.0? It's fairly old and there are lots of
 fixes in 0.20.1, and more in 0.20.2 which should be released any
 minute now.

 In particular, you're missing this change:
 https://issues.apache.org/jira/browse/HADOOP-5921

 which makes the JobTracker stubbornly wait for DFS to appear.

 I'd recommend using either (a) Apache 0.20.1, (b) Owen's rc of 0.20.2,
 or (c) Cloudera's 0.20.1 based build at
 http://archive.cloudera.com/cdh/2/hadoop-0.20.1+169.56.tar.gz which is
 0.20.1 plus 225 extra patches (incl most of what's in 0.20.2).

 -Todd

 On Sat, Feb 13, 2010 at 8:35 AM, Sonal Goyal sonalgoy...@gmail.com
 wrote:
  Hi Aaron,
 
  I am on Hadoop 0.20.0 on Ubuntu, pseudo distributed mode. If I remove the
  sleep time from my start-all.sh script, my jobtracker comes up
 momentarily
  and then dies.
 
  Here is a capture of my commands:
 
  sgo...@desktop:~/software/hadoop-0.20.0$ bin/hadoop namenode -format
  10/02/13 21:54:19 INFO namenode.NameNode: STARTUP_MSG:
  /
  STARTUP_MSG: Starting NameNode
  STARTUP_MSG:   host = desktop/127.0.1.1
  STARTUP_MSG:   args = [-format]
  STARTUP_MSG:   version = 0.20.0
  STARTUP_MSG:   build =
  https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.20 -r
 763504;
  compiled by 'ndaley' on Thu Apr  9 05:18:40 UTC 2009
  /
  10/02/13 21:54:19 DEBUG conf.Configuration: java.io.IOException: config()
 at org.apache.hadoop.conf.Configuration.init(Configuration.java:210)
 at org.apache.hadoop.conf.Configuration.init(Configuration.java:197)
 at
 
 org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:937)
 at
  org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:964)
 
  Re-format filesystem in /tmp/hadoop-sgoyal/dfs/name ? (Y or N) Y
  10/02/13 21:54:22 DEBUG security.UserGroupInformation: Unix Login:
 
 sgoyal,sgoyal,adm,dialout,cdrom,audio,plugdev,fuse,lpadmin,admin,sambashare,mysql,cvsgroup
  10/02/13 21:54:22 INFO namenode.FSNamesystem:
 
 fsOwner=sgoyal,sgoyal,adm,dialout,cdrom,audio,plugdev,fuse,lpadmin,admin,sambashare,mysql,cvsgroup
  10/02/13 21:54:22 INFO namenode.FSNamesystem: supergroup=supergroup
  10/02/13 21:54:22 INFO namenode.FSNamesystem: isPermissionEnabled=true
  10/02/13 21:54:22 INFO common.Storage: Image file of size 96 saved in 0
  seconds.
  10/02/13 21:54:22 DEBUG namenode.FSNamesystem: Preallocating Edit log,
  current size 0
  10/02/13 21:54:22 DEBUG namenode.FSNamesystem: Edit log size is now
 1049088
  written 512 bytes  at offset 1048576
  10/02/13 21:54:22 INFO common.Storage: Storage directory
  /tmp/hadoop-sgoyal/dfs/name has been successfully formatted.
  10/02/13 21:54:22 INFO namenode.NameNode: SHUTDOWN_MSG:
  /
  SHUTDOWN_MSG: Shutting down NameNode at desktop/127.0.1.1
  /
 
 
  sgo...@desktop:~/software/hadoop-0.20.0$ bin/start-all.sh
  starting namenode, logging to
 
 /home/sgoyal/software/hadoop-0.20.0/bin/../logs/hadoop-sgoyal-namenode-desktop.out
  localhost: starting datanode, logging to
 
 /home/sgoyal/software/hadoop-0.20.0/bin/../logs/hadoop-sgoyal-datanode-desktop.out
  localhost: starting secondarynamenode, logging to
 
 /home/sgoyal/software/hadoop-0.20.0/bin/../logs/hadoop-sgoyal-secondarynamenode-desktop.out
  starting jobtracker, logging to
 
 /home/sgoyal/software/hadoop-0.20.0/bin/../logs/hadoop-sgoyal-jobtracker-desktop.out
  localhost: starting tasktracker, logging to
 
 /home/sgoyal/software/hadoop-0.20.0/bin/../logs/hadoop-sgoyal-tasktracker-desktop.out
 
  sgo...@desktop:~/software/hadoop-0.20.0$ jps
  26171 Jps
  26037 JobTracker
  25966 SecondaryNameNode
  25778 NameNode
  26130 TaskTracker
  25863 DataNode
 
  sgo...@desktop:~/software/hadoop-0.20.0$ jps
  26037 JobTracker
  25966 SecondaryNameNode
  26203 Jps
  25778 NameNode
  26130 TaskTracker
  25863 -- process information unavailable
 
  sgo...@desktop:~/software/hadoop-0.20.0$ jps
  26239 Jps
  26037 JobTracker
  25966 SecondaryNameNode
  25778 NameNode
  26130 TaskTracker
 
  sgo...@desktop:~/software/hadoop-0.20.0$ jps
  26037 JobTracker
  25966 SecondaryNameNode
  25778 NameNode
  26130 TaskTracker
  26252 Jps
 
  sgo...@desktop:~/software/hadoop-0.20.0$ jps
  26288 Jps
  25966 SecondaryNameNode
  25778 NameNode
 
  sgo...@desktop:~/software/hadoop-0.20.0$ jps
  25966 SecondaryNameNode
  25778 NameNode
  26298 Jps
 
  sgo...@desktop:~/software/hadoop-0.20.0$ jps
  26308 Jps
  25966 SecondaryNameNode
  25778 NameNode
 
  My jobtracker logs show:
 
  2010-02-13 21:54:40,660 INFO org.apache.hadoop.mapred.JobTracker:
  STARTUP_MSG

Re: Error While building Hive

2010-02-03 Thread Sonal Goyal
Hi,

I think the default is user.home/.ant/cache. It should have the hadoop/core
folder with the versions you were able to download. Try adding to this.

Thanks and Regards,
Sonal


On Wed, Feb 3, 2010 at 12:12 PM, Vidyasagar Venkata Nallapati 
vidyasagar.nallap...@onmobile.com wrote:

  Hi,

 I am able to connect through the these urls, but its getting timed out.



 Can I download the hadoop source files from these urls externally and copy
 to some locations and build hive?



 Regards

 Vidyasagar N V



 *From:* Sonal Goyal [mailto:sonalgoy...@gmail.com]
 *Sent:* Wednesday, February 03, 2010 11:49 AM
 *To:* hive-user@hadoop.apache.org
 *Subject:* Re: Error While building Hive



 Hi,

 Are you able to access the Hadoop Core URL from your machine? Try from your
 machine the following:

 http://archive.apache.org/dist/hadoop/core/hadoop-0.20.0/hadoop-0.20.0.tar.gz

 https://repository.apache.org/content/repositories/snapshots/hadoop/core/0.20.0/hadoop-0.20.0.tar.gz

 It seems that Ivy is unable to retrieve the hadoop core source files. If
 you are able to, try your build again. It may be a temporary retrieval
 issue.

 Thanks and Regards,
 Sonal

  On Wed, Feb 3, 2010 at 11:10 AM, Vidyasagar Venkata Nallapati 
 vidyasagar.nallap...@onmobile.com wrote:

 Hi,



 While building the Hive I am getting an error, please help me with the
 changes I need to do to build it.

 I was getting the same with hadoop 20.1.



 ivy-retrieve-hadoop-source:

 [ivy:retrieve] :: Ivy 2.0.0-rc2 - 20081028224207 ::
 http://ant.apache.org/ivy/ ::

 :: loading settings :: file = /master/hive/ivy/ivysettings.xml

 [ivy:retrieve] :: resolving dependencies ::
 org.apache.hadoop.hive#shims;work...@ph1

 [ivy:retrieve]  confs: [default]

 [ivy:retrieve] :: resolution report :: resolve 949119ms :: artifacts dl 0ms


 -

 |  |modules||   artifacts
 |

 |   conf   | number| search|dwnlded|evicted||
 number|dwnlded|


 -

 |  default |   1   |   0   |   0   |   0   ||   0   |   0
 |


 -

 [ivy:retrieve]

 [ivy:retrieve] :: problems summary ::

 [ivy:retrieve]  WARNINGS

 [ivy:retrieve]  module not found: hadoop#core;0.20.0

 [ivy:retrieve]   hadoop-source: tried

 [ivy:retrieve]-- artifact hadoop#core;0.20.0!hadoop.tar.gz(source):

 [ivy:retrieve]
 http://archive.apache.org/dist/hadoop/core/hadoop-0.20.0/hadoop-0.20.0.tar.gz

 [ivy:retrieve]   apache-snapshot: tried

 [ivy:retrieve]
 https://repository.apache.org/content/repositories/snapshots/hadoop/core/0.20.0/core-0.20.0.pom

 [ivy:retrieve]-- artifact hadoop#core;0.20.0!hadoop.tar.gz(source):

 [ivy:retrieve]
 https://repository.apache.org/content/repositories/snapshots/hadoop/core/0.20.0/hadoop-0.20.0.tar.gz

 [ivy:retrieve]   maven2: tried

 [ivy:retrieve]
 http://repo1.maven.org/maven2/hadoop/core/0.20.0/core-0.20.0.pom

 [ivy:retrieve]-- artifact hadoop#core;0.20.0!hadoop.tar.gz(source):

 [ivy:retrieve]
 http://repo1.maven.org/maven2/hadoop/core/0.20.0/core-0.20.0.tar.gz

 [ivy:retrieve]  ::

 [ivy:retrieve]  ::  UNRESOLVED DEPENDENCIES ::

 [ivy:retrieve]  ::

 [ivy:retrieve]  :: hadoop#core;0.20.0: not found

 [ivy:retrieve]  ::

 [ivy:retrieve]  ERRORS

 [ivy:retrieve]  Server access Error: Connection timed out url=
 http://archive.apache.org/dist/hadoop/core/hadoop-0.20.0/hadoop-0.20.0.tar.gz

 [ivy:retrieve]  Server access Error: Connection timed out url=
 https://repository.apache.org/content/repositories/snapshots/hadoop/core/0.20.0/core-0.20.0.pom

 [ivy:retrieve]  Server access Error: Connection timed out url=
 https://repository.apache.org/content/repositories/snapshots/hadoop/core/0.20.0/hadoop-0.20.0.tar.gz

 [ivy:retrieve]  Server access Error: Connection timed out url=
 http://repo1.maven.org/maven2/hadoop/core/0.20.0/core-0.20.0.pom

 [ivy:retrieve]  Server access Error: Connection timed out url=
 http://repo1.maven.org/maven2/hadoop/core/0.20.0/core-0.20.0.tar.gz

 [ivy:retrieve]

 [ivy:retrieve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS



 BUILD FAILED

 /master/hive/build.xml:148: The following error occurred while executing
 this line:

 /master/hive/build.xml:93: The following error occurred while executing
 this line:

 /master/hive/shims/build.xml:64: The following error occurred while
 executing this line:

 /master/hive/build-common.xml:172: impossible to resolve dependencies:



 Regards

 Vidyasagar N V


  --

 DISCLAIMER: The information in this message is confidential and may be
 legally privileged

Resolvers for UDAFs

2010-02-03 Thread Sonal Goyal
Hi,

I am writing a UDAF which takes in 4 parameters. I have 2 cases - one where
all the paramters are ints, and second where the last parameter is double. I
wrote two evaluators for this, with iterate as

public boolean iterate(int max, int groupBy, int attribute, int count)

and

public boolean iterate(int max, int groupBy, int attribute, double count)

However, when I run a query, I get the exception:
org.apache.hadoop.hive.ql.exec.AmbiguousMethodException: Ambiguous method
for class org.apache.hadoop.hive.udaf.TopXPerGroup with [int, int, int, int]
at
org.apache.hadoop.hive.ql.exec.DefaultUDAFEvaluatorResolver.getEvaluatorClass(DefaultUDAFEvaluatorResolver.java:83)
at
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFBridge.getEvaluator(GenericUDAFBridge.java:57)
at
org.apache.hadoop.hive.ql.exec.FunctionRegistry.getGenericUDAFEvaluator(FunctionRegistry.java:594)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getGenericUDAFEvaluator(SemanticAnalyzer.java:1882)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genGroupByPlanMapGroupByOperator(SemanticAnalyzer.java:2270)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genGroupByPlanMapAggr1MR(SemanticAnalyzer.java:2821)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:4543)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:5058)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:4999)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:5020)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:4999)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:5020)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:5587)
at
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:114)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:317)
at org.apache.hadoop.hive.ql.Driver.runCommand(Driver.java:370)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:362)
at
org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:140)
at
org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:200)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:311)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

One option for me is to write  a resolver which I will do. But, I just
wanted to know if this is a bug in hive whereby it is not able to get the
write evaluator. Or if this is a gap in my understanding.

I look forward to hearing your views on this.

Thanks and Regards,
Sonal


Re: Error While building Hive

2010-02-03 Thread Sonal Goyal
Hi,

Were you able to get this working?

Thanks and Regards,
Sonal


On Wed, Feb 3, 2010 at 4:08 PM, Sonal Goyal sonalgoy...@gmail.com wrote:

 Hi,

 I think the default is user.home/.ant/cache. It should have the hadoop/core
 folder with the versions you were able to download. Try adding to this.

 Thanks and Regards,
 Sonal



 On Wed, Feb 3, 2010 at 12:12 PM, Vidyasagar Venkata Nallapati 
 vidyasagar.nallap...@onmobile.com wrote:

  Hi,

 I am able to connect through the these urls, but its getting timed out.



 Can I download the hadoop source files from these urls externally and copy
 to some locations and build hive?



 Regards

 Vidyasagar N V



 *From:* Sonal Goyal [mailto:sonalgoy...@gmail.com]
 *Sent:* Wednesday, February 03, 2010 11:49 AM
 *To:* hive-user@hadoop.apache.org
 *Subject:* Re: Error While building Hive



 Hi,

 Are you able to access the Hadoop Core URL from your machine? Try from
 your machine the following:

 http://archive.apache.org/dist/hadoop/core/hadoop-0.20.0/hadoop-0.20.0.tar.gz

 https://repository.apache.org/content/repositories/snapshots/hadoop/core/0.20.0/hadoop-0.20.0.tar.gz

 It seems that Ivy is unable to retrieve the hadoop core source files. If
 you are able to, try your build again. It may be a temporary retrieval
 issue.

 Thanks and Regards,
 Sonal

  On Wed, Feb 3, 2010 at 11:10 AM, Vidyasagar Venkata Nallapati 
 vidyasagar.nallap...@onmobile.com wrote:

 Hi,



 While building the Hive I am getting an error, please help me with the
 changes I need to do to build it.

 I was getting the same with hadoop 20.1.



 ivy-retrieve-hadoop-source:

 [ivy:retrieve] :: Ivy 2.0.0-rc2 - 20081028224207 ::
 http://ant.apache.org/ivy/ ::

 :: loading settings :: file = /master/hive/ivy/ivysettings.xml

 [ivy:retrieve] :: resolving dependencies ::
 org.apache.hadoop.hive#shims;work...@ph1

 [ivy:retrieve]  confs: [default]

 [ivy:retrieve] :: resolution report :: resolve 949119ms :: artifacts dl
 0ms


 -

 |  |modules||
 artifacts   |

 |   conf   | number| search|dwnlded|evicted||
 number|dwnlded|


 -

 |  default |   1   |   0   |   0   |   0   ||   0   |
 0   |


 -

 [ivy:retrieve]

 [ivy:retrieve] :: problems summary ::

 [ivy:retrieve]  WARNINGS

 [ivy:retrieve]  module not found: hadoop#core;0.20.0

 [ivy:retrieve]   hadoop-source: tried

 [ivy:retrieve]-- artifact hadoop#core;0.20.0!hadoop.tar.gz(source):

 [ivy:retrieve]
 http://archive.apache.org/dist/hadoop/core/hadoop-0.20.0/hadoop-0.20.0.tar.gz

 [ivy:retrieve]   apache-snapshot: tried

 [ivy:retrieve]
 https://repository.apache.org/content/repositories/snapshots/hadoop/core/0.20.0/core-0.20.0.pom

 [ivy:retrieve]-- artifact hadoop#core;0.20.0!hadoop.tar.gz(source):

 [ivy:retrieve]
 https://repository.apache.org/content/repositories/snapshots/hadoop/core/0.20.0/hadoop-0.20.0.tar.gz

 [ivy:retrieve]   maven2: tried

 [ivy:retrieve]
 http://repo1.maven.org/maven2/hadoop/core/0.20.0/core-0.20.0.pom

 [ivy:retrieve]-- artifact hadoop#core;0.20.0!hadoop.tar.gz(source):

 [ivy:retrieve]
 http://repo1.maven.org/maven2/hadoop/core/0.20.0/core-0.20.0.tar.gz

 [ivy:retrieve]  ::

 [ivy:retrieve]  ::  UNRESOLVED DEPENDENCIES ::

 [ivy:retrieve]  ::

 [ivy:retrieve]  :: hadoop#core;0.20.0: not found

 [ivy:retrieve]  ::

 [ivy:retrieve]  ERRORS

 [ivy:retrieve]  Server access Error: Connection timed out url=
 http://archive.apache.org/dist/hadoop/core/hadoop-0.20.0/hadoop-0.20.0.tar.gz

 [ivy:retrieve]  Server access Error: Connection timed out url=
 https://repository.apache.org/content/repositories/snapshots/hadoop/core/0.20.0/core-0.20.0.pom

 [ivy:retrieve]  Server access Error: Connection timed out url=
 https://repository.apache.org/content/repositories/snapshots/hadoop/core/0.20.0/hadoop-0.20.0.tar.gz

 [ivy:retrieve]  Server access Error: Connection timed out url=
 http://repo1.maven.org/maven2/hadoop/core/0.20.0/core-0.20.0.pom

 [ivy:retrieve]  Server access Error: Connection timed out url=
 http://repo1.maven.org/maven2/hadoop/core/0.20.0/core-0.20.0.tar.gz

 [ivy:retrieve]

 [ivy:retrieve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS



 BUILD FAILED

 /master/hive/build.xml:148: The following error occurred while executing
 this line:

 /master/hive/build.xml:93: The following error occurred while executing
 this line:

 /master/hive/shims/build.xml:64: The following error occurred while
 executing this line:

 /master/hive/build-common.xml:172: impossible to resolve dependencies:



 Regards

Re: Resolvers for UDAFs

2010-02-03 Thread Sonal Goyal
Hi Zheng,

Wouldnt the query you mentioned need a group by clause? I need the top x
customers per product id. Sorry, can you please explain.

Thanks and Regards,
Sonal


On Thu, Feb 4, 2010 at 12:07 PM, Sonal Goyal sonalgoy...@gmail.com wrote:

 Hi Zheng,

 Thanks for your email and your feedback. I will try to change the code as
 suggested by you.

 Here is the output of describe:

 *hive describe products_bought;
 OK

 product_idint
 customer_idint
 product_countint


 *My function was working fine earlier with this table and iterate(int,
 int, int, int). Once I introduced the other iterate, it stopped working.


 Thanks and Regards,
 Sonal



 On Thu, Feb 4, 2010 at 11:37 AM, Zheng Shao zsh...@gmail.com wrote:

 Hi Sonal,

 1. We usually move the group_by column out of the UDAF - just like we
 do SELECT key, sum(value) FROM table.

 I think you should write:

 SELECT customer_id, topx(2, product_id, product_count)
 FROM products_bought

 and in topx:
 public boolean iterate(int max, int attribute, int count).


 2. Can you run describe products_bought? Does product_count column
 have type int?

 You might want to try removing the other interate function to see
 whether that solves the problem.


 Zheng


 On Wed, Feb 3, 2010 at 9:58 PM, Sonal Goyal sonalgoy...@gmail.com
 wrote:
  Hi Zheng,
 
  My query is:
 
  select a.myTable.key, a.myTable.attribute, a.myTable.count from (select
  explode (t.pc) as myTable from (select topx(2, product_id, customer_id,
  product_count) as pc from (select product_id, customer_id, product_count
  from products_bought order by product_id, product_count desc) r ) t )a;
 
  My overloaded iterators are:
 
  public boolean iterate(int max, int groupBy, int attribute, int count)
 
  public boolean iterate(int max, int groupBy, int attribute, double
 count)
 
  Before overloading, my query was running fine. My table products_bought
 is:
  product_id int, customer_id int, product_count int
 
  And I get:
  FAILED: Error in semantic analysis: Ambiguous method for class
  org.apache.hadoop.hive.udaf.TopXPerGroup with [int, int, int, int]
 
  The hive logs say:
  2010-02-03 11:18:15,721 ERROR processors.DeleteResourceProcessor
  (SessionState.java:printError(255)) - Usage: delete [FILE|JAR|ARCHIVE]
  value [value]*
  2010-02-03 11:22:14,663 ERROR ql.Driver
 (SessionState.java:printError(255))
  - FAILED: Error in semantic analysis: Ambiguous method for class
  org.apache.hadoop.hive.udaf.TopXPerGroup with [int, int, int, int]
  org.apache.hadoop.hive.ql.exec.AmbiguousMethodException: Ambiguous
 method
  for class org.apache.hadoop.hive.udaf.TopXPerGroup with [int, int, int,
 int]
  at
 
 org.apache.hadoop.hive.ql.exec.DefaultUDAFEvaluatorResolver.getEvaluatorClass(DefaultUDAFEvaluatorResolver.java:83)
  at
 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFBridge.getEvaluator(GenericUDAFBridge.java:57)
  at
 
 org.apache.hadoop.hive.ql.exec.FunctionRegistry.getGenericUDAFEvaluator(FunctionRegistry.java:594)
  at
 
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getGenericUDAFEvaluator(SemanticAnalyzer.java:1882)
  at
 
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genGroupByPlanMapGroupByOperator(SemanticAnalyzer.java:2270)
  at
 
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genGroupByPlanMapAggr1MR(SemanticAnalyzer.java:2821)
  at
 
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:4543)
  at
 
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:5058)
  at
 
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:4999)
  at
 
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:5020)
  at
 
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:4999)
  at
 
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:5020)
  at
 
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:5587)
  at
 
 org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:114)
  at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:317)
  at org.apache.hadoop.hive.ql.Driver.runCommand(Driver.java:370)
  at org.apache.hadoop.hive.ql.Driver.run(Driver.java:362)
  at
  org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:140)
  at
  org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:200)
  at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:311)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
  at
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  at java.lang.reflect.Method.invoke(Method.java:597

Help writing UDAF with custom object

2010-01-31 Thread Sonal Goyal
Hi,

I am writing a UDAF which returns the top x results per key. Lets say my
input is

key attribute count
1  16
1  25
1  34
2  18
2  24
2  31

I want the top 2 results per key. Which will be:

key attribute count
1  16
1  25
2  18
2  24

I have written a UDAF for this in the attached file. However, when I run the
code, I get the exception:
FAILED: Unknown exception :
org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaIntObjectInspector
cannot be cast to
org.apache.hadoop.hive.serde2.objectinspector.primitive.SettableIntObjectInspector


Can anyone please let me know what I could be doing wrong?
Thanks and Regards,
Sonal
package org.apache.hadoop.hive.udaf;

import java.util.ArrayList;
import java.util.Collection;
import java.util.Collections;
import java.util.HashMap;
import java.util.Iterator;

import org.apache.hadoop.hive.ql.exec.UDAF;
import org.apache.hadoop.hive.ql.exec.UDAFEvaluator;

public class TopXPerGroup extends UDAF{
	//holds count per key
	public static class Count implements Comparable{
		
		private Integer key;
		private Integer attribute;
		private Integer count;
		
		public Count(Integer key, Integer attribute, Integer count) {
			System.out.println(Creating count with  + key +   +  attribute  +   +count);
			this.key = key;
			this.count = count;	
			this.attribute = attribute;
		}
		
		public Integer getKey() {
			return key;
		}
		
		public void setKey(Integer key) {
			this.key = key;
		}
		
		public Integer getCount() {
			return count;
		}
		
		public void setCount(Integer count) {
			this.count = count;
		}

		public Integer getAttribute() {
			return attribute;
		}

		public void setAttribute(Integer attribute) {
			this.attribute = attribute;
		}

		@Override
		public int compareTo(Object to) {
			System.out.println(Comparing with  + to);
			if ((to == null) || (to.getClass() != getClass())) {
return 1;
			}
			Count compare = (Count) to;
			if (compare.count == count) return 0;
			if (compare.count  count) return -1;
			return 1;
		}			
		
		public String toString() {
			return key + , + attribute + , + count;
		}
	}
	
	private HashMapInteger, ArrayListCount countPerGroup;
	
	public class TopXPerGroupIntEvaluator implements UDAFEvaluator {
		private Integer max;
		
		public TopXPerGroupIntEvaluator() {
			super();
			init();
		}
		
		public void init() {
			countPerGroup = new HashMapInteger, ArrayListCount();
		}
		
		public boolean iterate(Integer max, Integer groupBy, Integer attribute, Integer count) {
			System.out.println(Iterating for top );
			  this.max = max;
		  ArrayListCount counts = countPerGroup.get(groupBy);
		  if (counts == null) {
			  counts = new ArrayListCount();			  
		  }
		  if (counts.size()  max) {
			  Count counter = new Count(groupBy, attribute, count);
			  counts.add(counter);
			  countPerGroup.put(groupBy, counts);
		  }
		  System.out.println(End Iterating for top );
		  return true;
		}
		
		
		public CollectionArrayListCount terminatePartial() {
			return countPerGroup.values();
		}
		
		public boolean merge(HashMapInteger, ArrayListCount merge) {
			//this will get complex
			System.out.println(Mergoing);
			if ((countPerGroup == null) || (countPerGroup.size() == 0)) {
countPerGroup = merge;
			}
			else {
//iterate through countPerGroup, get the arrayList, merge them.
IteratorInteger iter = merge.keySet().iterator();
while (iter.hasNext()) {
	Integer mergeKey = iter.next();
	ArrayListCount fromMerge = merge.get(mergeKey);
	ArrayListCount fromThis = countPerGroup.get(mergeKey);
	if ((fromThis == null) || (fromThis.size() == 0)) {
		countPerGroup.put(mergeKey, fromMerge);
	}
	else {
		countPerGroup.put(mergeKey, merge(fromMerge, fromThis, max));		
	}
}//while
			}
			return true;
		}
		
		private ArrayListCount merge(ArrayListCount from, ArrayListCount to, int max) {
			to.addAll(from);
			Collections.sort(to);
			return (ArrayListCount) to.subList(0, max -1);
		}
		
		public CollectionArrayListCount terminate() {
			return countPerGroup.values();
		}
		
	}//class	
}


DefaultStringifier throws NullPointer

2009-12-09 Thread Sonal Goyal
Hi,

I need to store a object in the configuration. I am trying to use
DefaultStringifier's load and store methods, but I get the following
exception while storing:

java.lang.NullPointerException
at
org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73)
at
org.apache.hadoop.io.DefaultStringifier.init(DefaultStringifier.java:59)
at
org.apache.hadoop.io.DefaultStringifier.store(DefaultStringifier.java:110)

I have explicitly set the JavaSerializer in the config:

conf.set(io.serializations,
org.apache.hadoop.io.serializer.JavaSerialization);

and my object implements the Serializable interface. Its not a very large
object, so I dont want to implement Writable.
Can anyone please shed some light on how I can avoid this error?

Thanks and Regards,
Sonal


Re: return in map

2009-12-06 Thread Sonal Goyal
Hi,

Maybe you could post your code/logic for doing this. One way would be to set
a flag once your criteria is met and emit keys based on the flag.

Thanks and Regards,
Sonal


2009/12/5 Gang Luo lgpub...@yahoo.com.cn

 Hi all,
 I got a tricky problem. I input a small file manually to do some filtering
 work on each line in map function. I check if the line satisfy the constrain
 then I output it, otherwise I return, without doing any other work below.

 For the map function will be called on each line, I think the logic is
 correct. But it doesn't work like this. If there are 5 line for a map task,
 and only the 2nd line satisfies the constrain, then the output will be line
 2, 3, 4, and 5. If the 3rd line satisfies, then output will be line 3, 4, 5.
 It seems that once a map task meet the first satisfying line, the filter
 doesn't work for following lines.

 It is interesting problem. I am checking it now. I also hope someone could
 give me some ideas on this. Thanks.


 -Gang


  ___
  好玩贺卡等你发,邮箱贺卡全新上线!
 http://card.mail.cn.yahoo.com/