Re: Sending data to all reducers

2012-08-23 Thread Sonal Goyal
Hamid, I would recommend taking a relook at your current algorithm and making sure you are utilizing the MR framework to its strengths. You can evaluate having multiple passes for your map reduce program, or doing a map side join. You mention runtime is important for your system, so make sure you

Re: hadoop ecosystem

2012-01-28 Thread Sonal Goyal
Crux reporting for hbase can also be included. Sonal Sent from my iPad On 28-Jan-2012, at 11:40 PM, Chris K Wensel wrote: > PyCascading > Scalding > Cascading.JRuby > Bixo > > Strictly speaking, those plus Cascalog (below) are on top of Cascading, which > is of course on top of Hadoop, but

Re: No Mapper but Reducer

2011-09-07 Thread Sonal Goyal
I dont think that is possible, can you explain in what scenario you want to have no mappers, only reducers? Best Regards, Sonal Crux: Reporting for HBase Nube Technologies On Wed, Sep 7, 2011

Re: I keep getting multiple values for unique reduce keys

2011-09-05 Thread Sonal Goyal
Could you share your mapper code and the container code? When your mapper emits the keys and values, do you print them out to see that they are correct, that is, the container only has data specific to that id? Best Regards, Sonal Crux: Reporting for HBase Nube

Re: Hadoop cluster couldn't run map reduce job

2011-03-13 Thread Sonal Goyal
Can you check your /etc/hosts to see that all master and slave entries are correct? If you up the logs to DEBUG, you will see where this is failing. Thanks and Regards, Sonal Hadoop ETL and Data Integration Nube Technologies <

Re: Dataset comparison and ranking - views

2011-03-07 Thread Sonal Goyal
logies <http://www.nubetech.co> <http://in.linkedin.com/in/sonalgoyal> On Tue, Mar 8, 2011 at 12:55 AM, Marcos Ortiz wrote: > On Tue, 2011-03-08 at 00:36 +0530, Sonal Goyal wrote: > > Hi, > > > > I am working on a problem to compare two different datasets, and rank >

Dataset comparison and ranking - views

2011-03-07 Thread Sonal Goyal
Hi, I am working on a problem to compare two different datasets, and rank each record of the first with respect to the other, in terms of how similar they are. The records are dimensional, but do not have a lot of dimensions. Some of the fields will be compared for exact matches, some for similar

Re: easiest way to install hadoop

2011-02-22 Thread Sonal Goyal
You can also check Apache Whirr. Thanks and Regards, Sonal Connect Hadoop with databases, Salesforce, FTP servers and others Nube Technologies On Wed, Feb

Re: Best practice for batch file conversions

2011-02-09 Thread Sonal Goyal
> On Wed, Feb 9, 2011 at 4:26 PM, felix gao wrote: > >> Sonal, >> >> can you tell me how to use the MultipleOutputFormat in my Mapper? I want >> to read a line of text and convert it to some other format and then write it >> back to HDFS using MultipleOutput

Re: Best practice for batch file conversions

2011-02-08 Thread Sonal Goyal
> | `-- file3.done > |-- dir2 > | |-- file1.done > | `-- file3.done > `-- dir3 > |-- file2.done > `-- file3.done > > can someone please show me how to do this? > > thanks, > > Felix > > On Tue, Feb 8, 2011 at 9:43 AM, felix gao wrote: >

Re: Best practice for batch file conversions

2011-02-07 Thread Sonal Goyal
Hi, You can use FileStreamInputFormat which returns the file stream as the value. https://github.com/sonalgoyal/hiho/tree/hihoApache0.20/src/co/nubetech/hiho/mapreduce/lib/input You need to remember that you lose data locality by trying to manipulate the file as a whole, but in your case, the re

Re: Multiple queues question

2011-02-07 Thread Sonal Goyal
I think the CapacityScheduler is the one to use with multiple queues, see http://hadoop.apache.org/common/docs/r0.19.2/capacity_scheduler.html Thanks and Regards, Sonal Connect Hadoop with databases, Salesforce, FTP servers and others

Re: elastic mapreduce - custom outputformat?

2011-02-03 Thread Sonal Goyal
Yeah, you can use any OutputFormat in your EMR job. If it is file based, it will write to the given file output path, else to Cassandra, DB, whatever you specify.. Thanks and Regards, Sonal Connect Hadoop with databases, Salesforce, FTP servers and others

Re: How to reduce number of splits in DataDrivenDBInputFormat?

2011-01-20 Thread Sonal Goyal
ct too. How can I do? > > After I want create other job that its Mapper reads the output (serialize > object) from previous Reducer. How can I do? > > Thanks Sonal, > > > Joan > > > 2011/1/20 Sonal Goyal > >> Which hadoop version are you on? >> >>

Re: How to reduce number of splits in DataDrivenDBInputFormat?

2011-01-19 Thread Sonal Goyal
> job.getConfiguration().set("mapreduce.job.maps","4"); > job.getConfiguration().set("mapreduce.map.tasks","4"); > > But both configurations don't run. I also try to set "mapred.map.task" but > It neither run. > > Joan &

Re: How to reduce number of splits in DataDrivenDBInputFormat?

2011-01-19 Thread Sonal Goyal
Joan, You should be able to set the mapred.map.tasks property to the maximum number of mappers you want. This can control parallelism. Thanks and Regards, Sonal Connect Hadoop with databases, Salesforce, FTP servers and others

Re: How to split DBInputFormat?

2011-01-04 Thread Sonal Goyal
Hi Hari, I dont think DataDrivenDBInputFormat is available in 0.20.x, its only available in 0.21 versions. You can check hihoApache0.20 branch at https://github.com/sonalgoyal/hiho/ which backports the relevent db formats for Apache Hadoop 0.20 versions. Thanks and Regards, Sonal

Re: How to split DBInputFormat?

2011-01-03 Thread Sonal Goyal
Hi Joan, To get data from the database, you can check the open source framework HIHO at https://github.com/sonalgoyal/hiho/ By providing details of your database and table to import as the configuration values, the split will happen automatically for you. Please feel free to write to me directly

Re: Hadoop 0.21.0 release Maven repo

2010-09-12 Thread Sonal Goyal
e HDFS-1292 and MAPREDUCE-1929. > > Cheers, > Tom > > On Fri, Sep 10, 2010 at 1:33 PM, Sonal Goyal > wrote: > > Hi, > > > > Can someone please point me to the Maven repo for 0.21 release? Thanks. > > > > Thanks and Regards, > > Sonal > > www.meghsoft.com > > http://in.linkedin.com/in/sonalgoyal > > >

Hadoop 0.21.0 release Maven repo

2010-09-10 Thread Sonal Goyal
Hi, Can someone please point me to the Maven repo for 0.21 release? Thanks. Thanks and Regards, Sonal www.meghsoft.com http://in.linkedin.com/in/sonalgoyal

Re: mapreduce for proxy log file analysis

2010-08-01 Thread Sonal Goyal
Hi, Have you checked Hive? Seems to fit your needs perfectly. Thanks and Regards, Sonal www.meghsoft.com http://in.linkedin.com/in/sonalgoyal On Sun, Aug 1, 2010 at 1:40 AM, Bright D L wrote: > Hi all, >I am doing a simple project to analyze http proxy server logs by > hadoop mapreduc

Re: Developping MapReduce functions

2010-06-29 Thread Sonal Goyal
Hi Khaled, Please check: http://hadoop.apache.org/common/docs/r0.20.1/mapred_tutorial.html http://hadoop.apache.org/common/docs/r0.20.1/streaming.html Thanks and Rega

Re: Using a custom FileSplitter?

2010-06-23 Thread Sonal Goyal
Hi Steve, Please check FileInputFormat.setInputPathFilter() to choose which file patterns you want to select for your job. If you want to pass a whole file as an input to your mapper, you can create your own InputFormat by subclassing FileInputFormat and override the isSplitable() method. Thanks

Re: Problem with DBOutputFormat

2010-06-08 Thread Sonal Goyal
Hi Giridhar, Which version of Hadoop are you using? If you want, you can also load data to MySQL using the hiho framework at http://code.google.com/p/hiho/ Thanks and Regards, Sonal www.meghsoft.com http://in.linkedin.com/in/sonalgoyal On Tue, Jun 8, 2010 at 3:02 PM, Giridhar Addepalli wrote

Re: Need Working example for DBOutputFormat

2010-05-19 Thread Sonal Goyal
Hi Nishant, If MySQL is your target database, you can check open source http://code.google.com/p/hiho/ which uses load data infile for a faster upload to the db. Let me know if you need any help. Thanks and Regards, Sonal www.meghsoft.com On Wed, May 19, 2010 at 1:06 PM, Nishant Sonar wrote:

Re: MultipleOutputs or Partitioner

2010-05-10 Thread Sonal Goyal
Hi Alan, You can use MultipleOutputFormat. You can override the generateFileName...methods to get the functionality you want. A partitioner controls how data moves from the mapper to the reducer, so if you take that approach, you will have to specify the number of reducers as the number of files

Re: counting pairs of items across item types

2010-04-25 Thread Sonal Goyal
Hi Sebastian. With HIHO, you can supply a sql query which joins tables in the database and get the results to Hadoop. Say, you want to get the following data from your table to Hadoop: select table.1col1, table2.col2 from table1, table2 where table1.id = table2.addressId If you check DBInputForm

Re: counting pairs of items across item types

2010-04-23 Thread Sonal Goyal
Hi Sebastian, You could use the HIHO framework for querying and extracting data from the database and getting it to Hadoop. It supports table joins. More here: http://code.google.com/p/hiho/ If you need any help, please feel free to contact me directly. Thanks and Regards, Sonal www.meghsoft.co

Re: Trying to figure out possible causes of this exception

2010-04-07 Thread Sonal Goyal
hi Kris, Seems your program can not find the input file. Have you done a hadoop fs -ls to verify that the file exists? Also, the path URL should be hdfs://.. Thanks and Regards, Sonal www.meghsoft.com On Wed, Apr 7, 2010 at 1:16 AM, Kris Nuttycombe wrote: > Exception in thread "main" java