Hadoop Presentation at Ankara / Turkey

2009-04-16 Thread Enis Soztutar

Hi all,

I will be giving a presentation on Hadoop at "1. Ulusal Yüksek Başarım 
ve Grid Konferansı" tomorrow(Apr 17, 13:10). The conference location is 
at KKM ODTU/Ankara/Turkey. Presentation will be in Turkish. All the 
Hadoop users and wanna-be users in the area are welcome to attend.


More info can be found at : http://basarim09.ceng.metu.edu.tr/

Cheers,
Enis Söztutar



Re: what change to be done in OutputCollector to print custom writable object

2009-04-01 Thread Enis Soztutar

Deepak Diwakar wrote:

Hi,

I am learning how to make custom-writable working. So I have implemented a
simple MyWriitable class.

And  I can play with the MyWritable object within the Map-Reduce. but
suppose in Reduce Values are a type of MyWritable object and  I put them
into OutputCollector to get final output. Since value is a custom object I
can't get  them into file but a reference.

 What and where I have to make changes /additions so that print into file
function handles the custom-writable object?

Thanks & regards,
  

just implement toString() in your MyWritable class.


Re: Small Test Data Sets

2009-03-25 Thread Enis Soztutar

Patterson, Josh wrote:

I want to confirm something with the list that I'm seeing;
 
I needed to confirm that my Reader was reading our file format

correctly, so I created a MR job that simply output each K/V pair to the
reducer, which then just wrote out each one to the output file. This
allows me to check by hand that all K/V points of data from our file
format are getting pulled out of the file correctly. I have setup our
InputFormat, RecordReader, and Reader subclasses for our specific file
format.
 
While running some basic tests on a small (1meg) single file I noticed

something odd --- I was getting 2 copies of each data point in the
output file. Initially I thought my Reader was just somehow reading the
data point and not moving the read head, but I verified that was not the
case through a series of tests.
 
I then went on to reason that since I had 2 mappers by default on my

job, and only 1 input file, that each mapper must be reading the file
independently. I then set the -m flag to 1, and I got the proper output;
Is it safe to assume in testing on a file that is smaller than the block
size that I should always use -m 1 in order to get proper block->mapper
mapping? Also, should I assume that if you have more mappers than disk
blocks involved that you will get duplicate values? I may have set
something wrong, I just wanted to check. Thanks
 
Josh Patterson

TVA
 

  
If you have developed your own inputformat, than the problem might be 
there.
The job of the inputformat is to create input splits, and readers. For 
one file and
two mappers, the input format should return two splits each representing 
half of
the file. In your case, I assume you return two splits each containing 
the whole file.

Is this the case?

Enis


Re: merging files

2009-03-18 Thread Enis Soztutar
Use MultipleInputs and use two different mappers for the inputs. map1 
should be IdentityMapper, mapper 2 should output key, value pairs where 
value is a peudo marker value(same for all keys), which marks that the 
value is null/empty. In the reducer just output the key/value pairs 
which does not include the marker value in their values.


in your example suppose that we use -1 as a marker value, then in 
mapper2, the output will be

4, -1
2, -1

and the reducer will get :

2, {1,3,5,-1}
3, {1,2}
4, {7,9,-1}
6, {3}

then reducer will output :

3, 1
3, 2
6, 3



Nir Zohar wrote:

Hi,

 


I would like your help with the below question.

I have 2 files: file1 (key, value), file2 (only key) and I need to exclude
all records from file1 that these key records not in file2.

1. The output format is key-value, not only keys.

2. The key is not primary key; hence it's not possible to have joined in the
end.

 


Can you assist?

 


Thanks,

Nir.

 

 


Example:

 


file1:

2,1

2,3

2,5

3,1

3,2

4,7

4,9

6,3

 


file2:

4

2

 


Output:

3,1

3,2

6,3

 

 

 



  




Re: how to optimize mapreduce procedure??

2009-03-13 Thread Enis Soztutar

ZhiHong Fu wrote:

Hello,

   I'm writing a program which will finish lucene searching in
about 12 index directorys, all of them are stored in HDFS. It is done
like this:
1. We get about 12 index Directorys through lucene index
functionality, each of which about 100M size,
2. We store these 12 index directorys on hadoop HDFS , and this hadoop
cluster is made up of one namenode and five datanodes,totally 6
computers.
3. And then I will do lucene searching for these 12 index directorys,
The mapreduce methods are as follows:
Map Procedure: 12 index directory will be splitted into
numOfMapTasks,for example, if numOfMapTasks=3, then each map we will
get 4 indexDirs and store them in an Intermediate Result.
Combine Procedure: for a intermediate Result locally, we will do
really lucene search in its containing index directory. and then store
these hit result in the intermediate Result.
Reduce Procedure: Reduce the Intermediate Results' hit result. and
get the search Result.

But when I implement like this, I have a performance problem, I set
numOfMapTasks and numOfReduceTasks to any value,such as
numOfMapTasks=12,numOfReduceTasks=5, But a simple search method will
spend about 28 seconds, and Obviously It is unacceptable.
So I'm confused whether I did wrong map-reduce procedure or set wrong
num of map or reduce tasks. and generally where the overhead of
mapreduce proceduce will take place. Any suggestion will be
appreciated.
Thanks.
  

Keeping the indexes at HDFS is not the best choice. Moreover mapreduce does
not fit into the problem of distributed search over several nodes. The 
overhead of

staring a new job for every search is not acceptable.
You can use nutch distributed search or katta(not sure about the name)
for this.


Re: HADOOP-2536 supports Oracle too?

2009-02-17 Thread Enis Soztutar
There is nothing special about the jdbc driver library. I guess that you 
have added the jar from the IDE(netbeans), but did not include the 
necessary libraries(jdbc driver in this case) in the TableAccess.jar.


The standard way is to include the dependent jars in the project's jar 
under the lib directory. For example:


example.jar
 -> META-INF
 -> com/...
 -> lib/postgres.jar
 -> lib/abc.jar

If your classpath is correct, check whether you call 
DBConfiguration.configureDB() with the correct driver class and url.


sandhiya wrote:

Hi,
I'm using postgresql and the driver is not getting detected. How do you run
it in the first place? I just typed 


bin/hadoop jar /root/sandy/netbeans/TableAccess/dist/TableAccess.jar

at the terminal without the quotes. I didnt copy any files from my local
drives into the Hadoop file system. I get an error like this :

java.lang.RuntimeException: java.lang.ClassNotFoundException:
org.postgresql.Driver

and then the complete stack trace

Am i doing something wrong?
I downloaded a jar file for postgresql jdbc support and included it in my
Libraries folder (I'm using NetBeans).
please help

Fredrik Hedberg-3 wrote:
  

Hi,

Although it's not MySQL; this might be of use:

http://svn.apache.org/repos/asf/hadoop/core/trunk/src/examples/org/apache/hadoop/examples/DBCountPageView.java


Fredrik

On Feb 16, 2009, at 8:33 AM, sandhiya wrote:



@Amandeep
Hi,
I'm new to Hadoop and am trying to run a simple database connectivity
program on it. Could you please tell me how u went about it?? my  
mail id is
"sandys_cr...@yahoo.com" . A copy of your code that successfully  
connected

to MySQL will also be helpful.
Thanks,
Sandhiya

Enis Soztutar-2 wrote:
  

From the exception :

java.io.IOException: ORA-00933: SQL command not properly ended

I would broadly guess that Oracle JDBC driver might be complaining  
that

the statement does not end with ";", or something similar. you can
1. download the latest source code of hadoop
2. add a print statement printing the query (probably in
DBInputFormat:119)
3. build hadoop jar
4. use the new hadoop jar to see the actual SQL query
5. run the query on Oracle if is gives an error.

Enis


Amandeep Khurana wrote:


Ok. I created the same database in a MySQL database and ran the same
hadoop
job against it. It worked. So, that means there is some Oracle  
specific
issue. It cant be an issue with the JDBC drivers since I am using  
the

same
drivers in a simple JDBC client.

What could it be?

Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Wed, Feb 4, 2009 at 10:26 AM, Amandeep Khurana 
wrote:


  
Ok. I'm not sure if I got it correct. Are you saying, I should  
test the

statement that hadoop creates directly with the database?

Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Wed, Feb 4, 2009 at 7:13 AM, Enis Soztutar 
wrote:



Hadoop-2536 connects to the db via JDBC, so in theory it should  
work

with
proper jdbc drivers.
It has been tested against MySQL, Hsqldb, and PostreSQL, but not
Oracle.

To answer your earlier question, the actual SQL statements might  
not be

recognized by Oracle, so I suggest the best way to test this is to
insert
print statements, and run the actual SQL statements against  
Oracle to

see if
the syntax is accepted.

We would appreciate if you publish your results.

Enis


Amandeep Khurana wrote:


  
Does the patch HADOOP-2536 support connecting to Oracle  
databases as

well?
Or is it just limited to MySQL?

Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz





  


--
View this message in context:
http://www.nabble.com/HADOOP-2536-supports-Oracle-too--tp21823199p22032715.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

  





  


Re: re : How to use MapFile in C++ program

2009-02-06 Thread Enis Soztutar
There is currently no way to read MapFiles in any language other than 
Java. You can write a JNI wrapper similar to libhdfs.
Alternatively, you can also write the complete stack from scratch, 
however this might prove very difficult or impossible. You might want to 
check the ObjectFile/TFile specifications for which binary compatible 
reader/writers can be developed in any language :


https://issues.apache.org/jira/browse/HADOOP-3315

Enis

Anh Vũ Nguyễn wrote:

Hi, everybody. I am writing a project in C++ and want to use the power of
MapFile class(which belongs to org.apache.hadoop.io) of hadoop. Can you
please tell me how can I write code in C++ using MapFile or there is no way
to use API org.apache.hadoop.io in c++ (libhdfs only helps with
org.apache.hadoop.fs).
Thanks in advance!

  


Re: How to use DBInputFormat?

2009-02-05 Thread Enis Soztutar

Please see below,

Stefan Podkowinski wrote:

As far as i understand the main problem is that you need to create
splits from streaming data with an unknown number of records and
offsets. Its just the same problem as with externally compressed data
(.gz). You need to go through the complete stream (or do a table scan)
to create logical splits. Afterwards each map task needs to seek to
the appropriate offset on a new stream over again. Very expansive. As
with compressed files, no wonder only one map task is started for each
.gz file and will consume the complete file. 

I cannot see an easy way to split the JDBC stream and pass them to nodes.

IMHO the DBInputFormat
should follow this behavior and just create 1 split whatsoever.
  
Why would we want to limit to 1 splits, which effectively resolves to 
sequential computation?

Maybe a future version of hadoop will allow to create splits/map tasks
on the fly dynamically?
  
It is obvious that input residing in one database is not optimal for 
hadoop, and in any case(even with sharding)
DB I/O would be the bottleneck. I guess DBInput/Output formats should be 
used when data is small but computation is costly.



Stefan

On Thu, Feb 5, 2009 at 3:28 PM, Fredrik Hedberg  wrote:
  

Indeed sir.

The implementation was designed like you describe for two reasons. First and
foremost to make is as simple as possible for the user to use a JDBC
database as input and output for Hadoop. Secondly because of the specific
requirements the MapReduce framework brings to the table (split
distribution, split reproducibility etc).

This design will, as you note, never handle the same amount of data as HBase
(or HDFS), and was never intended to. That being said, there are a couple of
ways that the current design could be augmented to perform better (and, as
in its current form, tweaked, depending on you data and computational
requirements). Shard awareness is one way, which would let each
database/tasktracker-node execute mappers on data where each split is a
single database server for example.

If you have any ideas on how the current design can be improved, please do
share.


Fredrik

On Feb 5, 2009, at 11:37 AM, Stefan Podkowinski wrote:



The 0.19 DBInputFormat class implementation is IMHO only suitable for
very simple queries working on only few datasets. Thats due to the
fact that it tries to create splits from the query by
1) getting a count of all rows using the specified count query (huge
performance impact on large tables)
2) creating splits by issuing an individual query for each split with
a "limit" and "offset" parameter appended to the input sql query

Effectively your input query "select * from orders" would become
"select * from orders limit  offset " and
executed until count has been reached. I guess this is not working sql
syntax for oracle.

Stefan


2009/2/4 Amandeep Khurana :
  

Adding a semicolon gives me the error "ORA-00911: Invalid character"

Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Wed, Feb 4, 2009 at 6:46 AM, Rasit OZDAS  wrote:



Amandeep,
"SQL command not properly ended"
I get this error whenever I forget the semicolon at the end.
I know, it doesn't make sense, but I recommend giving it a try

Rasit

2009/2/4 Amandeep Khurana :
  

The same query is working if I write a simple JDBC client and query the
database. So, I'm probably doing something wrong in the connection


settings.
  

But the error looks to be on the query side more than the connection


side.
  

Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Tue, Feb 3, 2009 at 7:25 PM, Amandeep Khurana 


wrote:
  

Thanks Kevin

I couldnt get it work. Here's the error I get:

bin/hadoop jar ~/dbload.jar LoadTable1
09/02/03 19:21:17 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
09/02/03 19:21:20 INFO mapred.JobClient: Running job: job_local_0001
09/02/03 19:21:21 INFO mapred.JobClient:  map 0% reduce 0%
09/02/03 19:21:22 INFO mapred.MapTask: numReduceTasks: 0
09/02/03 19:21:24 WARN mapred.LocalJobRunner: job_local_0001
java.io.IOException: ORA-00933: SQL command not properly ended

  at

  

org.apache.hadoop.mapred.lib.db.DBInputFormat.getRecordReader(DBInputFormat.java:289)
  

  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321)
  at

org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
java.io.IOException: Job failed!
  at
  

org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
  

  at LoadTable1.run(LoadTable1.java:130)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
  at LoadTable1.main(LoadTable1.java:107)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Nativ

Re: HADOOP-2536 supports Oracle too?

2009-02-05 Thread Enis Soztutar

From the exception :

java.io.IOException: ORA-00933: SQL command not properly ended

I would broadly guess that Oracle JDBC driver might be complaining that 
the statement does not end with ";", or something similar. you can

1. download the latest source code of hadoop
2. add a print statement printing the query (probably in DBInputFormat:119)
3. build hadoop jar
4. use the new hadoop jar to see the actual SQL query
5. run the query on Oracle if is gives an error.

Enis


Amandeep Khurana wrote:

Ok. I created the same database in a MySQL database and ran the same hadoop
job against it. It worked. So, that means there is some Oracle specific
issue. It cant be an issue with the JDBC drivers since I am using the same
drivers in a simple JDBC client.

What could it be?

Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Wed, Feb 4, 2009 at 10:26 AM, Amandeep Khurana  wrote:

  

Ok. I'm not sure if I got it correct. Are you saying, I should test the
statement that hadoop creates directly with the database?

Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Wed, Feb 4, 2009 at 7:13 AM, Enis Soztutar  wrote:



Hadoop-2536 connects to the db via JDBC, so in theory it should work with
proper jdbc drivers.
It has been tested against MySQL, Hsqldb, and PostreSQL, but not Oracle.

To answer your earlier question, the actual SQL statements might not be
recognized by Oracle, so I suggest the best way to test this is to insert
print statements, and run the actual SQL statements against Oracle to see if
the syntax is accepted.

We would appreciate if you publish your results.

Enis


Amandeep Khurana wrote:

  

Does the patch HADOOP-2536 support connecting to Oracle databases as
well?
Or is it just limited to MySQL?

Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz






  


Re: HADOOP-2536 supports Oracle too?

2009-02-04 Thread Enis Soztutar
Hadoop-2536 connects to the db via JDBC, so in theory it should work 
with proper jdbc drivers.

It has been tested against MySQL, Hsqldb, and PostreSQL, but not Oracle.

To answer your earlier question, the actual SQL statements might not be 
recognized by Oracle, so I suggest the best way to test this is to 
insert print statements, and run the actual SQL statements against 
Oracle to see if the syntax is accepted.


We would appreciate if you publish your results.

Enis

Amandeep Khurana wrote:

Does the patch HADOOP-2536 support connecting to Oracle databases as well?
Or is it just limited to MySQL?

Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

  


Re: how to pass an object to mapper

2008-12-23 Thread Enis Soztutar
There are several ways you can pass static information to tasks in 
Hadoop. The first is to store it in conf via DefaultStringifier, which 
needs the object to be serialized either through Writable or 
Serializable interfaces. Second way would be to save/serialize the data 
to a file and send it via DistributedCache. Another way would be to save 
the file in the jar, and read from there.


forbbs forbbs wrote:

It seems that JobConf doesn't help. Do I have to write the object into DFS?

  


Re: Anyone have a Lucene index InputFormat for Hadoop?

2008-11-12 Thread Enis Soztutar
I recommend you check nutch's src, which includes classes for Index 
input/output from mapred.


Anthony Urso wrote:

Anyone have a Lucene index InputFormat already implemented?  Failing
that, how about a Writable for the Lucene Document class?

Cheers,
Anthony

  




Re: Hadoop with image processing

2008-10-15 Thread Enis Soztutar

From my understanding of the problem, you can
- keep the image binary data in sequence files
- copy the image whose similar images will searched to dfs with high 
replication.

- in each map, calculate the similarity to the image
- output only the similar images from the map.
- no need a reduce step.

I am not sure whether splitting the image into 4 and analyzing the parts 
individually will make
any change, since the above alg. already distributes the computation to 
all nodes.


Raşit Özdaş wrote:
Hi to all, I'm a new subscriber of the group, I started to work on a hadoop-based project. 
In our application, there are a huge number of images with a regular pattern, differing in 4 parts/blocks.

System takes an image as input and looks for a similar image, considering if 
all these 4 parts match.
(System finds all the matches, even after finding one).
Each of these parts are independent, result of each part computed separately, these are 
printed on the screen and then an average matching percentage is calculated from these.


(I can write more detailed information if needed)

Could you suggest a structure? or any ideas to have a better result?

Images can be divided into 4 parts, I see that. But folder structure of images 
are important and
I have no idea with that. Images are kept in DB (can be changed, if folder 
structure is better)
Is two stage of map-reduce operations better? First, one map-reduce for each image, 
then a second map-reduce for every part of one image.

But as far as I know, the slowest computation slows down whole operation.

This is where I am now.

Thanks in advance..

  




Re: 1 file per record

2008-09-24 Thread Enis Soztutar
Nope, not right now. But this has came up before. Perhaps you will 
contribute one?



chandravadana wrote:

thanks

is there any built in record reader which performs this function..



Enis Soztutar wrote:
  

Yes, you can use MultiFileInputFormat.

You can extend the MultiFileInputFormat to return a RecordReader, which 
reads a record for each file in the MultiFileSplit.


Enis

chandra wrote:


hi..

By setting isSplitable false, we can set 1 file with n records 1 mapper.

Is there any way to set 1 complete file per record..

Thanks in advance
Chandravadana S




This e-mail and any files transmitted with it are for the sole use of the
intended recipient(s) and may contain confidential and privileged
information.
If you are not the intended recipient, please contact the sender by reply
e-mail and destroy all copies of the original message. 
Any unauthorized review, use, disclosure, dissemination, forwarding,

printing or copying of this email or any action taken in reliance on this
e-mail is strictly 
prohibited and may be unlawful.


  
  





  


Re: Questions about Hadoop

2008-09-24 Thread Enis Soztutar



Arijit Mukherjee wrote:

Thanx Enis.

By workflow, I was trying to mean something like a chain of MapReduce
jobs - the first one will extract a certain amount of data from the
original set and do some computation resulting in a smaller summary,
which will then be the input to a further MR job, and so on...somewhat
similar to a workflow as in the SOA world.

  
Yes, you can always chain job together to form a final summary. 
o.a.h.mapred.jobcontrol.JobControl might be interesting for you.

Is it possible to use statistical analysis tools such as R (or say PL/R)
within MapReduce on Hadoop? As far as I've heard, Greenplum is working
on a custom MapReduce engine over their Greenplum database which will
also support PL/R procedures.
  
Using R on Hadoop might include some level of custom coding. If you are 
looking for an ad-hoc tool for data mining, then check Pig and Hive.


Enis

Arijit

Dr. Arijit Mukherjee
Principal Member of Technical Staff, Level-II
Connectiva Systems (I) Pvt. Ltd.
J-2, Block GP, Sector V, Salt Lake
Kolkata 700 091, India
Phone: +91 (0)33 23577531/32 x 107
http://www.connectivasystems.com


-Original Message-
From: Enis Soztutar [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 24, 2008 2:57 PM

To: core-user@hadoop.apache.org
Subject: Re: Questions about Hadoop


Hi,

Arijit Mukherjee wrote:
  

Hi

We've been thinking of using Hadoop for a decision making system which



  

will analyze telecom-related data from various sources to take certain



  

decisions. The data can be huge, of the order of terabytes, and can be



  
stored as CSV files, which I understand will fit into Hadoop as Tom 
White mentions in the Rough Cut Guide that Hadoop is well suited for 
records. The question I want to ask is whether it is possible to 
perform statistical analysis on the data using Hadoop and MapReduce. 
If anyone has done such a thing, we'd be very interested to know about



  
it. Is it also possible to create a workflow like functionality with 
MapReduce?
  


Hadoop can handle TB data sizes, and statistical data analysis is one of

the
perfect things that fit into the mapreduce computation model. You can
check what people are doing with Hadoop at 
http://wiki.apache.org/hadoop/PoweredBy.
I think the best way to see if your requirements can be met by 
Hadoop/mapreduce is
to read the Mapreduce paper by Dean et.al. Also you might be interested 
in checking out
Mahout, which is a subproject of Lucene. They are doing ML on top of 
Hadoop.


Hadoop is mostly suitable for batch jobs, however these jobs can be 
chained together to

form a workflow.  I will try to be more helpful if you could extend what

you mean by workflow.

Enis Soztutar

  

Regards
Arijit

Dr. Arijit Mukherjee
Principal Member of Technical Staff, Level-II
Connectiva Systems (I) Pvt. Ltd.
J-2, Block GP, Sector V, Salt Lake
Kolkata 700 091, India
Phone: +91 (0)33 23577531/32 x 107 http://www.connectivasystems.com


  


No virus found in this incoming message.
Checked by AVG - http://www.avg.com 
Version: 8.0.169 / Virus Database: 270.7.1/1687 - Release Date:

9/23/2008 6:32 PM



  


Re: 1 file per record

2008-09-24 Thread Enis Soztutar

Yes, you can use MultiFileInputFormat.

You can extend the MultiFileInputFormat to return a RecordReader, which 
reads a record for each file in the MultiFileSplit.


Enis

chandra wrote:

hi..

By setting isSplitable false, we can set 1 file with n records 1 mapper.

Is there any way to set 1 complete file per record..

Thanks in advance
Chandravadana S




This e-mail and any files transmitted with it are for the sole use of the 
intended recipient(s) and may contain confidential and privileged information.
If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. 
Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email or any action taken in reliance on this e-mail is strictly 
prohibited and may be unlawful.


  




Re: Questions about Hadoop

2008-09-24 Thread Enis Soztutar

Hi,

Arijit Mukherjee wrote:

Hi

We've been thinking of using Hadoop for a decision making system which
will analyze telecom-related data from various sources to take certain
decisions. The data can be huge, of the order of terabytes, and can be
stored as CSV files, which I understand will fit into Hadoop as Tom
White mentions in the Rough Cut Guide that Hadoop is well suited for
records. The question I want to ask is whether it is possible to perform
statistical analysis on the data using Hadoop and MapReduce. If anyone
has done such a thing, we'd be very interested to know about it. Is it
also possible to create a workflow like functionality with MapReduce?
  
Hadoop can handle TB data sizes, and statistical data analysis is one of 
the

perfect things that fit into the mapreduce computation model. You can check
what people are doing with Hadoop at 
http://wiki.apache.org/hadoop/PoweredBy.
I think the best way to see if your requirements can be met by 
Hadoop/mapreduce is
to read the Mapreduce paper by Dean et.al. Also you might be interested 
in checking out
Mahout, which is a subproject of Lucene. They are doing ML on top of 
Hadoop.


Hadoop is mostly suitable for batch jobs, however these jobs can be 
chained together to
form a workflow.  I will try to be more helpful if you could extend what 
you mean by workflow.


Enis Soztutar


Regards
Arijit

Dr. Arijit Mukherjee
Principal Member of Technical Staff, Level-II
Connectiva Systems (I) Pvt. Ltd.
J-2, Block GP, Sector V, Salt Lake
Kolkata 700 091, India
Phone: +91 (0)33 23577531/32 x 107
http://www.connectivasystems.com


  




Re: Any InputFormat class implementation for Database records

2008-08-26 Thread Enis Soztutar

There is a patch you can try it out and share your xp :
https://issues.apache.org/jira/browse/HADOOP-2536

ruchir wrote:

Hi,

 


I want to know whether is there any implementation of InputFormat class in
Hadoop which can read data from Database instead of from HDFS while
processing any hadoop Job. We have application which gets data from user,
stores it into HDFS and runs hadoop job on that data. Now, we have users
whose data is already stored in database. Is there any such implementation
which we can use to read data from database directly?

I am aware of one such implementation, HBase, but heard that it is not
production ready.

 


Thanks in advance,

Ruchir


  




Re: How I should use hadoop to analyze my logs?

2008-08-15 Thread Enis Soztutar
You can use chukwa, which is a contrib in the trunk for collecting log 
entries from web servers. You can run adaptors in the web servers, and a 
collector in the log server. The log entries may not be analyzed in real 
time, but it should be close to real time.

I suggest you use pig, for log data analysis.

Juho Mäkinen wrote:

Hello,

I'm looking how Hadoo could solve our datamining applications and I've
come up with a few questions which I haven't found any answer yet.
Our setup contains multiple diskless webserver frontends which
generates log data. Each webserver hit generates an UDP packet which
contains basically the same info than normal apache access log line
(url, return code, client ip, timestamp etc). The udp packet is
receivered by a log server. I would want to run map/reduce processed
on the log data at the same time when the servers are generating new
data. I was planning that each day would have it's own file in HDFS
which contains all log entries for that day.

How I should use hadoop and HDFS to write each log entry to a file? I
was planning that I would create a class which contains request
attributes (url, return code, client ip etc) and use this as the
value. I did not found any info how this could be done with HDFS. The
api seems to support arbitary objects as both key and value, but there
was no example how to do this.

How will Hadoop handle the concurrency with the writes and the reads?
The servers will generate log entries around the clock. I also want to
analyse the log entries at the same time when the servers are
generating new data. How I can do this? The HDFS architecture page
tells that the client writes the data first into a local file and once
the file has reached the block size, the file will be transferred to
the HDFS storage nodes and the client writes the following data to
another local file. Is it possible to read the blocks already
transferred to the HDFS using the map/reduce processes and write new
blocks to the same file at the same time?

Thanks in advance,

 - Juho Mäkinen

  




Re: MultiFileInputFormat and gzipped files

2008-08-05 Thread Enis Soztutar
MultiFileWordCount uses its own RecordReader, namely 
MultiFileLineRecordReader. This is different from the LineRecordReader 
which automatically detects the file's codec, and decodes it.


You can write a custom RecordReader similar to LineRecordReader and 
MultiFileLineRecordReader, or just add codecs to MultiFileLineRecordReader.



Michele Catasta wrote:

Hi all,

I'm writing some Hadoop jobs that should run on a collection of
gzipped files. Everything is already working correctly with
MultiFileInputFormat and an initial step of gunzip extraction.
Considering that Hadoop recognizes and handles correctly .gz files (at
least with a single file input), I was wondering if it's able to do
the same with file collections, such that I avoid the overhad of
sequential file extraction.
I tried to run the multi file WordCount example with a bunch of
gzipped text files (0.17.1 installation), and I get a wrong output
(neither correct or empty). With my own InputFormat (not really
different from the one in multiflewc), I got no output at all (map
input record counter = 0).

Is it a desired behavior? Are there some technical reasons why it's
not working in a multi file scenario?
Thanks in advance for the help.


Regards,
  Michele Catasta

  




Re: MultiFileInputFormat - Not enough mappers

2008-07-11 Thread Enis Soztutar
Yes, please open a jira for this. We should ensure that 
avgLengthPerSplit in MultiFileInputFormat should not exceed default file 
block size. However unlike FileInputFormat, all the files will come from 
a different block.



Goel, Ankur wrote:

In this case I have to compute the number of map tasks in the
application - (totalSize / blockSize), which is what I am doing as a
work-around.
I think this should be the default behaviour in MultiFileInputFormat.
Should a JIRA be opened for the same ?

-Ankur


-Original Message-
From: Enis Soztutar [mailto:[EMAIL PROTECTED] 
Sent: Friday, July 11, 2008 7:21 PM

To: core-user@hadoop.apache.org
Subject: Re: MultiFileInputFormat - Not enough mappers

MultiFileSplit currently does not support automatic map task count
computation. You can manually set the number of maps via
jobConf#setNumMapTasks() or via command line arg -D
mapred.map.tasks=


Goel, Ankur wrote:
  

Hi Folks,
  I am using hadoop to process some temporal data which is



  
split in lot of small files (~ 3 - 4 MB) Using TextInputFormat 
resulted in too many mappers (1 per file) creating a lot of overhead 
so I switched to MultiFileInputFormat - 
(MutiFileWordCount.MyInputFormat) which resulted in just 1 mapper.
 
I was hoping to set the no of mappers to 1 so that hadoop 
automatically takes care of generating the right number of map tasks.
 
Looks like when using MultiFileInputFormat one has to rely on the 
application to specify the right number of mappers or am I missing 
something ? Please advise.
 
Thanks

-Ankur

  



  


Re: MultiFileInputFormat - Not enough mappers

2008-07-11 Thread Enis Soztutar
MultiFileSplit currently does not support automatic map task count 
computation. You can manually
set the number of maps via jobConf#setNumMapTasks() or via command line 
arg -D mapred.map.tasks=



Goel, Ankur wrote:

Hi Folks,
  I am using hadoop to process some temporal data which is
split in lot of small files (~ 3 - 4 MB)
Using TextInputFormat resulted in too many mappers (1 per file) creating
a lot of overhead so I switched to
MultiFileInputFormat - (MutiFileWordCount.MyInputFormat) which resulted
in just 1 mapper.
 
I was hoping to set the no of mappers to 1 so that hadoop automatically

takes care of generating the right
number of map tasks.
 
Looks like when using MultiFileInputFormat one has to rely on the

application to specify the right number of mappers
or am I missing something ? Please advise.
 
Thanks

-Ankur

  


Re: edge count question

2008-06-27 Thread Enis Soztutar



Cam Bazz wrote:

Hello,

When I use an custom input format, as in the nutch project - do I have to
keep my index in DFS, or regular file system?

  
You have to ensure that your indexes are accessible by the map/reduce 
tasks, ie. by using hdfs, s3, nfs, kfs, etc.

By the way, are there any alternatives to nutch?
  

yes, of course. There are all sorts of open source crawlers / indexers.


Best Regards


-C.B.

On Fri, Jun 27, 2008 at 10:08 AM, Enis Soztutar <[EMAIL PROTECTED]>
wrote:

  

Cam Bazz wrote:



hello,

I have a lucene index storing documents which holds src and dst words.
word
pairs may repeat. (it is a multigraph).

I want to use hadoop to count how many of the same word pairs there are. I
have looked at the aggregateword count example, and I understand that if I
make a txt file
such as

src1>dst2
src2>dst2
src1>dst2

..

and use something similar to the aggregate word count example, I will get
the result desired.

Now questions. how can I hookup my lucene index to hadoop. is there a
better
way then dumping the index to a text file with >'s, copying this to dfs
and
getting the results back?


  

Yes, you can implement an InputFormat to read from the lucene index. You
can use the implementation in the nutch project, the classes
DeleteDuplicates$InputFormat, DeleteDuplicates$DDRecordReader.



how can I make incremental runs? (once the index processed and I got the
results, how can I dump more data onto it so it does not start from
beginning)


  

As far as i know, there is no easy way for this. Why do you keep your data
as a lucene index?



Best regards,

-C.B.



  


  


Re: edge count question

2008-06-27 Thread Enis Soztutar

Cam Bazz wrote:

hello,

I have a lucene index storing documents which holds src and dst words. word
pairs may repeat. (it is a multigraph).

I want to use hadoop to count how many of the same word pairs there are. I
have looked at the aggregateword count example, and I understand that if I
make a txt file
such as

src1>dst2
src2>dst2
src1>dst2

..

and use something similar to the aggregate word count example, I will get
the result desired.

Now questions. how can I hookup my lucene index to hadoop. is there a better
way then dumping the index to a text file with >'s, copying this to dfs and
getting the results back?
  
Yes, you can implement an InputFormat to read from the lucene index. You 
can use the implementation in the nutch project, the classes 
DeleteDuplicates$InputFormat, DeleteDuplicates$DDRecordReader.

how can I make incremental runs? (once the index processed and I got the
results, how can I dump more data onto it so it does not start from
beginning)
  
As far as i know, there is no easy way for this. Why do you keep your 
data as a lucene index?

Best regards,

-C.B.

  


Re: Hadoop supports RDBMS?

2008-06-25 Thread Enis Soztutar
Yes, there is a way to use DBMS over JDBC. The feature is not realeased 
yet, but you can try it out, and give valuable feedback to us.
You can find the patch and the jira issue at : 
https://issues.apache.org/jira/browse/HADOOP-2536



Lakshmi Narayanan wrote:

Has anyone tried using any RDBMS with the hadoop?  If the data is stored in
the database is there any way we can use the mapreduce with the database
instead of the HDFS?

  


Re: How long is Hadoop full unit test suit expected to run?

2008-06-17 Thread Enis Soztutar
Just standard box like yours (ubuntu). I suspect something must be 
wrong. Could you determine which tests take long time.  On hudson, it 
seems that tests take 1:30 on average. 
http://hudson.zones.apache.org/hudson/view/Hadoop/



Lukas Vlcek wrote:

Hi,

I am aware of HowToContrib wiki page but in my case [ant test] takes more
then one hour. I can not tell you how much time it takes because I always
stopped it after 4-5 hours...

I was running these test on notebook Dell, dual core 1GB of RAM, Windows XP.
I haven't tried it now after switching to Ubuntu but I think this should not
make a big difference. What kind of HW are you using for Hadoop testing? I
would definitely appreciate if [ant test] runs under an hour.

Regards,
Lukas

On Tue, Jun 17, 2008 at 10:26 AM, Enis Soztutar <[EMAIL PROTECTED]>
wrote:

  

Lukas Vlcek wrote:



Hi,

How long is Hadoop full unit test suit expected to run?
How do you go about running Hadoop tests? I found that it can take hours
for
[ant test] target to run which does not seem to be very efficient for
development.
Is there anything I can do to speed up tests (like running Hadoop in a
real
cluster)?


  

Hi,

Yes, ant test can take up to an hour. On my machine it completes in less
than an hour. Previously work has been done to reduce the time that tests
take, but some of the tests take a long time by nature such as testing dfs
balance, etc.
There is an issue to implement ant test-core as a mapred job so that it can
be submitted to a cluster. That would help a lot.



Say I would like to fix a bug in Hadoop in ABC.java. Is it OK if I execute
just ABCTest.java (if available) for the development phase before the
patch
is attached to JIRA ticket? I don't expect the answer to this question is
positive but I can not think of better workaround for now...


  

This very much depends on the patch. If you are "convinced" by running
TestABC, that the patch would be OK, then you can go ahead and submit patch
to hudson for QA testing. However, since the resources at Hudson is limited,
please do not use it for "regular" tests. As a side note, please run ant
test-patch to check your patch.

http://wiki.apache.org/hadoop/HowToContribute
http://wiki.apache.org/hadoop/CodeReviewChecklist

Enis



Regards,
Lukas



  



  


Re: How long is Hadoop full unit test suit expected to run?

2008-06-17 Thread Enis Soztutar


Lukas Vlcek wrote:

Hi,

How long is Hadoop full unit test suit expected to run?
How do you go about running Hadoop tests? I found that it can take hours for
[ant test] target to run which does not seem to be very efficient for
development.
Is there anything I can do to speed up tests (like running Hadoop in a real
cluster)?
  

Hi,

Yes, ant test can take up to an hour. On my machine it completes in less 
than an hour. Previously work has been done to reduce the time that 
tests take, but some of the tests take a long time by nature such as 
testing dfs balance, etc.
There is an issue to implement ant test-core as a mapred job so that it 
can be submitted to a cluster. That would help a lot.

Say I would like to fix a bug in Hadoop in ABC.java. Is it OK if I execute
just ABCTest.java (if available) for the development phase before the patch
is attached to JIRA ticket? I don't expect the answer to this question is
positive but I can not think of better workaround for now...
  
This very much depends on the patch. If you are "convinced" by running 
TestABC, that the patch would be OK, then you can go ahead and submit 
patch to hudson for QA testing. However, since the resources at Hudson 
is limited, please do not use it for "regular" tests. As a side note, 
please run ant test-patch to check your patch.


http://wiki.apache.org/hadoop/HowToContribute
http://wiki.apache.org/hadoop/CodeReviewChecklist

Enis

Regards,
Lukas

  


Re: non-static map or reduce classes?

2008-05-21 Thread Enis Soztutar

Hi,

Static inner classes, and the static fields are different things in 
java. Hadoop needs to instantiate the Mapper and Reducer classes from 
their class names, so if they are defined as inner classes, they need to 
be static. You can either declare the inner classes to be static, and 
use the class normally(accessing non-static data), or refactor Mapper 
and Reducer classes to their own files.


Deyaa Adranale wrote:

hello.
why do we have to set the map and reduce classes as static?
i need inside them to access some data which is not static. what i 
should do?


non static map or reduce classes generates the following exception:

java.lang.RuntimeException: java.lang.NoSuchMethodException: 
hadoop.examples.Simple$MyMapper.()
   at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:45) 


   at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:32)
   at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:53) 


   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:209)
   at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1210)
Caused by: java.lang.NoSuchMethodException: 
hadoop.examples.Simple$MyMapper.()

   at java.lang.Class.getConstructor0(Class.java:2705)
   at java.lang.Class.getDeclaredConstructor(Class.java:1984)
   at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:41) 


   ... 4 more




any ideas?

thanks in advance,

Deyaa



Re: Why ComparableWritable does not take a template?

2008-05-12 Thread Enis Soztutar

Hi,

WritableComparable uses generics in trunk, but if you use 0.16.x you 
cannot use that version. WritableComparable is not generified yet due to 
legacy reasons, but the work is in progress. The problem with your code 
is raising from WritableComparator.newKey(). It seems your object cannot 
be created by keyClass.newInstance() (no default constructor I guess). 
You can implement you own WritableComparator implementation.


steph wrote:


The example in the java doc shows that the compareTo() method uses the 
type of the class instead of the
Object type. However the ComparableWritable class does not take any 
template and therefore it cannot

set the tempate for the class Comparable. Is that a mistake?

* public class MyWritableComparable implements WritableComparable {
 *   // Some data
 *   private int counter;
 *   private long timestamp;
 *
 *   public void write(DataOutput out) throws IOException {
 * out.writeInt(counter);
 * out.writeLong(timestamp);
 *   }
 *
 *   public void readFields(DataInput in) throws IOException {
 * counter = in.readInt();
 * timestamp = in.readLong();
 *   }
 *
 *   public int compareTo(MyWritableComparable w) {
 * int thisValue = this.value;
 * int thatValue = ((IntWritable)o).value;
 * return (thisValue < thatValue ? -1 : 
(thisValue==thatValue ? 0 : 1));

 *   }
 * }

If i try to do what it example shows it does not compile:

  [javac] 
/Users/steph/Work/Rinera/TRUNK/vma/hadoop/parser-hadoop/apps/src/com/rinera/hadoop/weblogs/SummarySQLKey.java:13: 
com.rinera.hadoop.weblogs.SummarySQLKey is not abstract and does not 
override abstract method compareTo(java.lang.Object) in 
java.lang.Comparable

[javac] public class SummarySQLKey


If i don't use the type but instead use the Object type for compareTo()
i get a RuntimeException:

java.lang.RuntimeException: java.lang.InstantiationException: 
com.rinera.hadoop.weblogs.SummarySQLKey
at 
org.apache.hadoop.io.WritableComparator.newKey(WritableComparator.java:75) 

at 
org.apache.hadoop.io.WritableComparator.(WritableComparator.java:63) 

at 
org.apache.hadoop.io.WritableComparator.get(WritableComparator.java:42)
at 
org.apache.hadoop.mapred.JobConf.getOutputKeyComparator(JobConf.java:645)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.(MapTask.java:313)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:174)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:157)
Caused by: java.lang.InstantiationException: 
com.rinera.hadoop.weblogs.SummarySQLKey

at java.lang.Class.newInstance0(Class.java:335)
at java.lang.Class.newInstance(Class.java:303)
at 
org.apache.hadoop.io.WritableComparator.newKey(WritableComparator.java:73) 


... 6 more






Re: JobConf: How to pass List/Map

2008-05-02 Thread Enis Soztutar

It is exactly what DefaultStringifier does, ugly but useful *smile*.

Jason Venner wrote:
We have been serializing to a bytearrayoutput stream then base64 
encoding the underlying byte array and passing that string in the conf.

It is ugly but it works well until 0.17

Enis Soztutar wrote:
Yes Stringifier was committed in 0.17. What you can do in 0.16 is to 
simulate DefaultStringifier. The key feature of the Stringifier is 
that it can convert/restore any object to string using base64 
encoding on the binary form of the object. If your objects can be 
easily converted to and from strings, then you can directly store 
them in conf. The other obvious alternative would be to switch to 
0.17, once it is out.


Tarandeep Singh wrote:
On Wed, Apr 30, 2008 at 5:11 AM, Enis Soztutar 
<[EMAIL PROTECTED]> wrote:
 

Hi,

 There are many ways which you can pass objects using configuration.
Possibly the easiest way would be to use Stringifier interface.

 you can for example :

 DefaultStringifier.store(conf, variable ,"mykey");

 variable = DefaultStringifier.load(conf, "mykey", variableClass );



thanks... but I am using Hadoop-0.16 and Stringifier is a fix for 
0.17 version -

https://issues.apache.org/jira/browse/HADOOP-3048

Any thoughts on how to do this in 0.16 version ?

thanks,
Taran

 
 you should take into account that the variable you pass to 
configuration

should be serializable by the framework. That means it must implement
Writable of Serializable interfaces. In your particular case, you 
might want

to look at ArrayWritable and MapWritable classes.

 That said, you should however not pass large objects via 
configuration,
since it can seriously effect job overhead. If the data you want to 
pass is
large, then you should use other alternatives(such as 
DistributedCache,

HDFS, etc).



 Tarandeep Singh wrote:

  

Hi,

How can I set a list or map to JobConf that I can access in
Mapper/Reducer class ?
The get/setObject method from Configuration has been deprecated and
the documentation says -
"A side map of Configuration to Object should be used instead."
I could not follow this :(

Can someone please explain to me how to do this ?

Thanks,
Taran



  


  




Re: JobConf: How to pass List/Map

2008-04-30 Thread Enis Soztutar
Yes Stringifier was committed in 0.17. What you can do in 0.16 is to 
simulate DefaultStringifier. The key feature of the Stringifier is that 
it can convert/restore any object to string using base64 encoding on the 
binary form of the object. If your objects can be easily converted to 
and from strings, then you can directly store them in conf. The other 
obvious alternative would be to switch to 0.17, once it is out.


Tarandeep Singh wrote:

On Wed, Apr 30, 2008 at 5:11 AM, Enis Soztutar <[EMAIL PROTECTED]> wrote:
  

Hi,

 There are many ways which you can pass objects using configuration.
Possibly the easiest way would be to use Stringifier interface.

 you can for example :

 DefaultStringifier.store(conf, variable ,"mykey");

 variable = DefaultStringifier.load(conf, "mykey", variableClass );



thanks... but I am using Hadoop-0.16 and Stringifier is a fix for 0.17 version -
https://issues.apache.org/jira/browse/HADOOP-3048

Any thoughts on how to do this in 0.16 version ?

thanks,
Taran

  

 you should take into account that the variable you pass to configuration
should be serializable by the framework. That means it must implement
Writable of Serializable interfaces. In your particular case, you might want
to look at ArrayWritable and MapWritable classes.

 That said, you should however not pass large objects via configuration,
since it can seriously effect job overhead. If the data you want to pass is
large, then you should use other alternatives(such as DistributedCache,
HDFS, etc).



 Tarandeep Singh wrote:



Hi,

How can I set a list or map to JobConf that I can access in
Mapper/Reducer class ?
The get/setObject method from Configuration has been deprecated and
the documentation says -
"A side map of Configuration to Object should be used instead."
I could not follow this :(

Can someone please explain to me how to do this ?

Thanks,
Taran



  


  


Re: JobConf: How to pass List/Map

2008-04-30 Thread Enis Soztutar

Hi,

There are many ways which you can pass objects using configuration. 
Possibly the easiest way would be to use Stringifier interface.


you can for example :

DefaultStringifier.store(conf, variable ,"mykey");

variable = DefaultStringifier.load(conf, "mykey", variableClass );

you should take into account that the variable you pass to configuration 
should be serializable by the framework. That means it must implement 
Writable of Serializable interfaces. In your particular case, you might 
want to look at ArrayWritable and MapWritable classes.


That said, you should however not pass large objects via configuration, 
since it can seriously effect job overhead. If the data you want to pass 
is large, then you should use other alternatives(such as 
DistributedCache, HDFS, etc).


Tarandeep Singh wrote:

Hi,

How can I set a list or map to JobConf that I can access in
Mapper/Reducer class ?
The get/setObject method from Configuration has been deprecated and
the documentation says -
"A side map of Configuration to Object should be used instead."
I could not follow this :(

Can someone please explain to me how to do this ?

Thanks,
Taran

  


Re: Best practices for handling many small files

2008-04-24 Thread Enis Soztutar

A shameless attempt to defend MultiFileInputFormat :

A concrete implementation of MultiFileInputFormat is not needed, since 
every InputFormat relying on MultiFileInputFormat is expected to have 
its custom RecordReader implementation, thus they need to override 
getRecordReader(). An implementation which returns (sort of) 
LineRecordReader  is under src/examples/.../MultiFileWordCount. However 
we may include it if any generic (for example returning 
SequenceFileRecordReader) implementation pops up.


An InputFormat returns  many Splits from getSplits(JobConf 
job, int numSplits), which is the number of maps, not the number of 
machines in the cluster.


Last of all, MultiFileSplit class implements getLocations() method, 
which returns the files' locations. Thus it's the JT's job to assign 
tasks to leverage local processing.


Coming to the original question, I think #2 is better, if the 
construction of the sequence file is not a bottleneck. You may, for 
example, create several sequence files in parallel and use all of them 
as input w/o merging.



Joydeep Sen Sarma wrote:

million map processes are horrible. aside from overhead - don't do it if u 
share the cluster with other jobs (all other jobs will get killed whenever the 
million map job is finished - see 
https://issues.apache.org/jira/browse/HADOOP-2393)

well - even for #2 - it begs the question of how the packing itself will be 
parallelized ..

There's a MultiFileInputFormat that can be extended - that allows processing of 
multiple files in a single map job. it needs improvement. For one - it's an 
abstract class - and a concrete implementation for (at least)  text files would 
help. also - the splitting logic is not very smart (from what i last saw). 
ideally - it should take the million files and form it into N groups (say N is 
size of your cluster) where each group has files local to the Nth machine and 
then process them on that machine. currently it doesn't do this (the groups are 
arbitrary). But it's still the way to go ..


-Original Message-
From: [EMAIL PROTECTED] on behalf of Stuart Sierra
Sent: Wed 4/23/2008 8:55 AM
To: core-user@hadoop.apache.org
Subject: Best practices for handling many small files
 
Hello all, Hadoop newbie here, asking: what's the preferred way to

handle large (~1 million) collections of small files (10 to 100KB) in
which each file is a single "record"?

1. Ignore it, let Hadoop create a million Map processes;
2. Pack all the files into a single SequenceFile; or
3. Something else?

I started writing code to do #2, transforming a big tar.bz2 into a
BLOCK-compressed SequenceFile, with the file names as keys.  Will that
work?

Thanks,
-Stuart, altlaw.org


  


Re: I need your help sincerely!

2008-04-22 Thread Enis Soztutar

Hi,

The number of map tasks is supposed to be greater than the number of 
machines, so in your configuration, 6 map tasks is ok. However there 
should be another problem. Have you changed the code for word count?


Please ensure that the example code is unchanged and your configuration 
is right. Also you may want to try the latest stable release, which is 
0.16.3 at this point.


wangxiaowei wrote:

hello, I am a Chinese user.I am using hadoop-0.15.3 now.The problem is that:I 
install hadoop with three nodes.One is taken as NameNode and JobTracker.The 
three are all taken as slaves. It dfs runs normally.I use your example of 
wordcount,the instruction is:bin/hadoop jar hadoop-0.15.3-examples.jar 
wordcount -m 6 -r 2 inputfile outoutdir then I submit it.when it runs,the map 
tasks are all completes,but the reduce task do not run at all! I checked the 
log of JobTracker,finding the error:
at java.util.Hashtable.get(Hashtable.java:334)
at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier.fetchOutput(ReduceTask.java:966)
at org.apache.hadoop.mapred.ReduceTask.run
at org.apache.hadoop.mapred.TaskTracker$Child.main()
the machine just stoped there,I have to ctrl+c to stop the program.
but when I use this instructon:bin/hadoop jar hadoop-0.15.3-examples.jar 
wordcount -m 2 -r 1 inputfile outoutdir
   it runs successfully!
   I do not konw why! I think the numbers of machines should be greater than 
mapTasks,so i set it 6.my machines number is 3.
   Looking forward for your reply!

  4.22,2008

  xiaowei
  


Re: small sized files - how to use MultiInputFileFormat

2008-04-01 Thread Enis Soztutar

Hi,

An example extracting one record per file would be :

public class FooInputFormat extends MultiFileInputFormat {

  @Override
  public RecordReader getRecordReader(InputSplit split, JobConf job, 
Reporter reporter) throws IOException {

return new FooRecordReader(job, (MultiFileSplit)split);
  }
}


public static class FooRecordReader implements RecordReader {

  private MultiFileSplit split;
  private long offset;
  private long totLength;
  private FileSystem fs;
  private int count = 0;
  private Path[] paths;
public FooRecordReader(Configuration conf, MultiFileSplit split)
  throws IOException {
this.split = split;
fs = FileSystem.get(conf);
this.paths = split.getPaths();
this.totLength = split.getLength();
this.offset = 0;
  }

  public WritableComparable createKey() {
..
  }

  public Writable createValue() {
..
  }

  public void close() throws IOException { }

  public long getPos() throws IOException {
return offset;
  }

  public float getProgress() throws IOException {
return ((float)offset) / split.getLength();
  }

  public boolean next(Writable key, Writable value) throws IOException {
if(offset >= totLength)
  return false;
if(count >= split.numPaths())
  return false;
  
Path file = paths[count];

FSDataInputStream stream = fs.open(file);
BufferedReader reader = new BufferedReader(new 
InputStreamReader(stream));

Scanner scanner = new Scanner(reader.readLine());
  //read from file, fill in key and value
   reader.close();
stream.close();
offset += split.getLength(count);
count++;
return true;
  }
}


I guess, I should add an example code to the mapred tutorial, and 
examples directory.


Jason Curtes wrote:

Hello,

I have been trying to run Hadoop on a set of small text files, not larger
than 10k each. The total input size is 15MB. If I try to run the example
word count application, it takes about 2000 seconds, more than half an hour
to complete. However, if I merge all the files into one large file, it takes
much less than a minute. I think using MultiInputFileFormat can be helpful
at this point. However, the API documentation is not really helpful. I
wonder if MultiInputFileFormat can really solve my problem, and if so, can
you suggest me a reference on how to use it, or a few lines to be added to
the word count example to make things more clear?

Thanks in advance.

Regards,

Jason Curtes

  


Re: Hadoop summit video capture?

2008-03-26 Thread Enis Soztutar

+1

Otis Gospodnetic wrote:

Hi,

Wasn't there going to be a live stream from the Hadoop summit?  I couldn't find 
any references on the event site/page, and searches on veoh, youtube and google 
video yielded nothing.

Is an archived version of the video (going to be) available?

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



  


Re: [core-user] Processing binary files Howto??

2008-03-18 Thread Enis Soztutar

Hi, please see below,

Ted Dunning wrote:

This sounds very different from your earlier questions.

If you have a moderate (10's to 1000's) number of binary files, then it is
very easy to write a special purpose InputFormat that tells hadoop that the
file is not splittable.  

@ Ted,
   actually we have MultiFileInputFormat and MultiFileSplit for exactly 
this :)


@ Alfonso,
   The core of the hadoop does not care about the source
of the data(such as files, database, etc). The map and reduce functions 
operate on records
which are just key value pairs. The job of the 
InputFormat/InputSplit/RecordReader interfaces

is to map the actual data source to records.

So, if a file contains a few records and no records is split among two 
files and the total number of files
is in the order of ten thousands, you can extend MultiFileInputFormat to 
return a Records reader which

extracts records from these binary files.

If the above does not apply, you can concatenate  all the files into a 
smaller number of files, then use FileInputFormat.
Then your RecordReader implementation is responsible for finding the 
record boundaries and extracting the records.


In both options, storing the files in DFS and using map-red is a wise 
choice, since mapred over dfs already has locality optimizations. But if 
you must you can distribute the files to the nodes manually, and 
implement an ad-hock Partitioner which ensures the map task is executed 
on the node that has the relevant files.



This allows you to add all of the files as inputs
to the map step and you will get the locality that you want.  The files
should be large enough so that you take at least 10 seconds or more
processing them to get good performance relative to startup costs.  If they
are not, then you may want to package them up in a form that can be read
sequentially.  This need not be splittable, but it would be nice if it were.

If you are producing a single file per hour, then this style works pretty
well.  In my own work, we have a few compressed and encrypted files each
hour that are map-reduced into a more congenial and splittable form each
hour.  Then subsequent steps are used to aggregate or process the data as
needed.

This gives you all of the locality that you were looking for.


On 3/17/08 6:19 AM, "Alfonso Olias Sanz" <[EMAIL PROTECTED]>
wrote:

  

Hi there.

After reading a bit of the hadoop framework and trying the WordCount
example. I have several doubts about how to use map /reduce with
binary files.

In my case binary files are generated in a time line basis. Let's say
1 file per hour. The size of each file is different (briefly we are
getting pictures from space and the stars density is different between
observations). The mappers, rather than receiving the file content.
They have to receive the file name.  I read that if the input files
are big (several blocks), they are split among several tasks in
same/different node/s (block sizes?).  But we want each map task
processes a file rather than a block (or a line of a file as in the
WordCount sample).

In a previous post I did to this forum. I was recommended to use an
input file with all the file names, so the mappers would receive the
file name. But there is a drawback related with data  location (also
was mentioned this), because data then has to be moved from one node
to another.   Data is not going to be replicated to all the nodes.  So
a task taskA that has to process fileB on nodeN, it has to be executed
on nodeN. How can we achive that???  What if a task requires a file
that is on other node. Does the framework moves the logic to that
node?  We need to define a URI file map in each node
(hostname/path/filename) for all the files. Tasks would access the
local URI file map in order to process the files.

Another approach we have thought is to use the distributed file system
to load balance the data among the nodes. And have our processes
running on every node (without using the map/reduce framework). Then
each process has to access to the local node to process the data,
using the dfs API (or checking the local URI file map).  This approach
would be more flexible to us, because depending on the machine
(cuadcore, dualcore) we know how many java threads we can run in order
to get the maximum performance of the machine.  Using the framework we
can only say a number of tasks to be executed on every node, but all
the nodes have to be the same.

URI file map.
Once the files are copied to the distributed file system, then we need
to create this table map. Or is it a way to access a  at
the data node and retrieve the files it handles? rather than getting
all the files in all the nodes in that   ie

NodeA  /tmp/.../mytask/input/fileA-1
/tmp/.../mytask/input/fileA-2

NodeB /tmp/.../mytask/input/fileB

A process at nodeB listing the /tmp/.../input directory, would get only fileB

Any ideas?
Thanks
Alfonso.




  


Re: displaying intermediate results of map/reduce

2008-03-06 Thread Enis Soztutar
You can also run the job in local mode with zero reducers, so that the 
map results are the results of the job.


Prasan Ary wrote:

Hi All,
  I am using eclipse to write a map/reduce java application that connects to 
hadoop on remote cluster.
  Is there a way I can display intermediate results of map ( or reduce) much 
the same way as I would use System.out.println( variable_name) if I were 
running any application on a single machine?
   
  thx,

  Prasan.

   
-

Be a better friend, newshound, and know-it-all with Yahoo! Mobile.  Try it now.
  


Re: Nutch Extensions to MapReduce

2008-03-06 Thread Enis Soztutar

Naama Kraus wrote:

OK. Let me try an example:

Say my map maps a person name to a his child name. . If a person "Dan"
has more than 1 child, bunch of * pairs will be produced, right ?
Now say I have two different information needs:
1. Get a list of all children names for each person.
2. Get the number of children of each person.

I could run two different MapReduce jobs, with same map but different
reducres:
1. emits * pairs where p is the person, lc is a concatenation of his
children names.
2. emits * pairs where p is the person, n is the number of children.
  
No you cannot have more than one type of reduces in one job. But yes you 
can write more than one file as the
result of the reduce phase, which is what I wanted to explain by 
pointing to ParseOutputFormat which writes ParseText and ParseDatato 
different MapFiles at the end of the reduce step.  So this is done by 
implementing OutputFormat + RecordWriter(given a resulting record from 
the reduce, write separate parts of it in different files)

Does that make any sense by now ?

Now, my question is whether I can save the two jobs and have a single one
only which emits both two type of pairs - * and *. In separate
files probably. This way I gain one pass on the input files instead of two
(or more, if I had more output types ...).
  
Actually for this scenario you do not even need two different files with 
* and *.  You can just compute
> which also contains the number of the children (The 
value is a List(for example ArrayWritable) containing children names).



If not, that's also fine, I was just curious :-)

Naama



On Thu, Mar 6, 2008 at 3:58 PM, Enis Soztutar <[EMAIL PROTECTED]>
wrote:

  

Let me explain this more technically :)

An MR job takes  pairs. Each map(k1,v1) may result result
* pairs. So at the end of the map stage, the output will be of
the form  pairs. The reduce takes  pairs and emits * pairs, where k1,k2,k3,v1,v2,v3 are all types.

I cannot understand what you meant by

if a MapReduce job could output multiple files each holds different
 pairs"

The resulting segment directories after a crawl contain
subdirectories(like crawl_generate, content, etc), but these are
generated one-by-one in several jobs running sequentially(and sometimes
by the same job, see ParseOutputFormat in nutch). You can refer further
to the OutputFormat and RecordWriter interfaces for specific needs.

For each split in the reduce phrase a different output file will be
generated, but all the records in the files have the same type. However
in some cases using GenericWritable or ObjectWtritable, you can wrap
different types of keys and values.

Hope it helps,
Enis

Naama Kraus wrote:


Well, I was not actually thinking to use Nutch.
To be concrete, I was interested if a MapReduce job could output
  

multiple


files each holds different  pairs. I got the impression this
  

is


done in Nutch from slide 15 of

  

http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf


but maybe I was mis-understanding.
Is it Nutch specific or achievable using Hadoop API ? Would multiple
different reducers do the trick ?

Thanks for offering to help, I might have more concrete details of what
  

I am


trying to implement later on, now I am basically learning.

Naama

On Thu, Mar 6, 2008 at 3:13 PM, Enis Soztutar <[EMAIL PROTECTED]>
wrote:


  

Hi,

Currently nutch is a fairly complex application that *uses* hadoop as a
base for distributed computing and storage. In this regard there is no
part in nutch that "extends" hadoop. The core of the mapreduce indeed
does work with  pairs, and nutch uses specific 
pairs such as , etc.

So long story short, it depends on what you want to build. If you
working on something that is not related to nutch, you do not need it.
You can give further info about your project if you want extended help.

best wishes.
Enis

Naama Kraus wrote:



Hi,

I've seen in


  

http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf(slide<http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf%28slide>
<
http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf%28slide


12) that Nutch has extensions to MapReduce. I wanted to ask whether
these are part of the Hadoop API or inside Nutch only.

More specifically, I saw in


  

http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf(slide<http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf%28slide>
<
http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf%28slide


15) that MapReduce outputs two files each holds different 
pairs. I'd be curious to know if I can achieve that using the standard

  

API.



Thanks, Naama



  



  




  


Re: Nutch Extensions to MapReduce

2008-03-06 Thread Enis Soztutar

Let me explain this more technically :)

An MR job takes  pairs. Each map(k1,v1) may result result 
* pairs. So at the end of the map stage, the output will be of 
the form  pairs. The reduce takes  pairs and emits v3>* pairs, where k1,k2,k3,v1,v2,v3 are all types.


I cannot understand what you meant by

if a MapReduce job could output multiple files each holds different  
pairs"

The resulting segment directories after a crawl contain 
subdirectories(like crawl_generate, content, etc), but these are 
generated one-by-one in several jobs running sequentially(and sometimes 
by the same job, see ParseOutputFormat in nutch). You can refer further 
to the OutputFormat and RecordWriter interfaces for specific needs.


For each split in the reduce phrase a different output file will be 
generated, but all the records in the files have the same type. However 
in some cases using GenericWritable or ObjectWtritable, you can wrap 
different types of keys and values.


Hope it helps,
Enis

Naama Kraus wrote:

Well, I was not actually thinking to use Nutch.
To be concrete, I was interested if a MapReduce job could output multiple
files each holds different  pairs. I got the impression this is
done in Nutch from slide 15 of
http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf
but maybe I was mis-understanding.
Is it Nutch specific or achievable using Hadoop API ? Would multiple
different reducers do the trick ?

Thanks for offering to help, I might have more concrete details of what I am
trying to implement later on, now I am basically learning.

Naama

On Thu, Mar 6, 2008 at 3:13 PM, Enis Soztutar <[EMAIL PROTECTED]>
wrote:

  

Hi,

Currently nutch is a fairly complex application that *uses* hadoop as a
base for distributed computing and storage. In this regard there is no
part in nutch that "extends" hadoop. The core of the mapreduce indeed
does work with  pairs, and nutch uses specific 
pairs such as , etc.

So long story short, it depends on what you want to build. If you
working on something that is not related to nutch, you do not need it.
You can give further info about your project if you want extended help.

best wishes.
Enis

Naama Kraus wrote:


Hi,

I've seen in

  

http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf(slide<http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf%28slide>


12) that Nutch has extensions to MapReduce. I wanted to ask whether
these are part of the Hadoop API or inside Nutch only.

More specifically, I saw in

  

http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf(slide<http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf%28slide>


15) that MapReduce outputs two files each holds different 
pairs. I'd be curious to know if I can achieve that using the standard
  

API.


Thanks, Naama


  




  


Re: Difference between local mode and distributed mode

2008-03-06 Thread Enis Soztutar

Hi,

LocalJobRunner uses just 0 or 1 reduce. This is because running in local 
mode is only supported for testing purposes.
Although you can simulate distribute mode in local, by using 
MiniMRCluster and MiniDFSCluster under src/test.


Best wishes
Enis

Naama Kraus wrote:

Hi,

I ran a simple MapReduce job which defines 3 reducers:

*conf.setNumReduceTasks(3);*

When running on top of HDFS (distributed mode), I got 3 out files as I
expected.
When running on top of a local files system (local mode), I got 1 file and
not 3.

My question is whether the behavior in the local is the expected one, is a
bug, or maybe something needs to be configured in the local mode in order to
get the 3 files as well ?

Thanks for an insight, Naama

  


Re: Nutch Extensions to MapReduce

2008-03-06 Thread Enis Soztutar

Hi,

Currently nutch is a fairly complex application that *uses* hadoop as a 
base for distributed computing and storage. In this regard there is no 
part in nutch that "extends" hadoop. The core of the mapreduce indeed 
does work with  pairs, and nutch uses specific  
pairs such as , etc.


So long story short, it depends on what you want to build. If you 
working on something that is not related to nutch, you do not need it. 
You can give further info about your project if you want extended help.


best wishes.
Enis

Naama Kraus wrote:

Hi,

I've seen in
http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf(slide
12) that Nutch has extensions to MapReduce. I wanted to ask whether
these are part of the Hadoop API or inside Nutch only.

More specifically, I saw in
http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf(slide
15) that MapReduce outputs two files each holds different 
pairs. I'd be curious to know if I can achieve that using the standard API.

Thanks, Naama

  


Re: Why no DoubleWritable?

2008-02-05 Thread Enis Soztutar

Hi,

The reason may be that perhaps nobody needed the extra precision brought 
by double compansating the extra space, compared to FloatWritable. If 
you really need DoubleWritable you may write the class, which will be 
straightforward, and then attach it to a jira issue so that we can add 
it to the core. That's the way open sourse works after all. *smile*


Jimmy Lin wrote:

Hi guys,

What's the design decision for not implementing a DoubleWritable type
that implements WritableComparable? I noticed that there are classes
corresponding to all Java primitives except for double.

Thanks in advance,
Jimmy




  


Re: hadoop file system browser

2008-01-24 Thread Enis Soztutar
Yes, you can solve the bottleneck by starting a webdav server on each 
client. But this would include the burden to manage the servers etc. and 
it may not be the intended use case for webdav. But we can further 
discuss the architecture in the relevant issue.


Alban Chevignard wrote:

Thanks for the clarification. I agree that running a single WebDAV
server for all clients would make it a bottleneck. But I can't see
anything in the current WebDAV server implementation that precludes
running an instance of it on each client. It seems to me that would
solve any bottleneck issue.

-Alban

On Jan 23, 2008 2:53 AM, Enis Soztutar <[EMAIL PROTECTED]> wrote:
  

As you know, dfs client connects to the individual datanodes to
read/write data and has a minimal interaction with the Namenode, which
improves the io rate linearly(theoretically 1:1). However current
implementation of webdav interface, is just a server working on a single
machine, which translates the webdav requests to namenode. Thus the
whole traffic passes through this webdav server, which makes it a
bottleneck. I was planning to integrate webdav server with
namenode/datanode, and forward the requests to the other datanodes, so
that we can do io in parallel, but my focus on webdav has faded for now.




Alban Chevignard wrote:


What are the scalability issues associated with the current WebDAV interface?

Thanks,
-Alban

On Jan 22, 2008 7:27 AM, Enis Soztutar <[EMAIL PROTECTED]> wrote:

  

Webdav interface for hadoop works as it is, but it needs a major
redesign to be scalable, however it is still useful. It has even been
used with windows explorer defining the webdav server as a remote service.


Ted Dunning wrote:



There has been significant work on building a web-DAV interface for HDFS.  I
haven't heard any news for some time, however.


On 1/21/08 11:32 AM, "Dawid Weiss" <[EMAIL PROTECTED]> wrote:



  

The Eclipse plug-in also features a DFS browser.


  

Yep. That's all true, I don't mean to self-promote, because there really isn't
that much to advertise ;) I was just quite attached to file manager-like user
interface; the mucommander clone I posted served me as a browser, but also for
rudimentary file operations (copying to/from, deleting folders etc.). In my
experience it's been quite handy.

It would be probably a good idea to implement a commons-vfs plugin for Hadoop
so
that HDFS filesystem is transparent to use for other apps.

Dawid



  
  


  


Re: hadoop file system browser

2008-01-22 Thread Enis Soztutar
As you know, dfs client connects to the individual datanodes to 
read/write data and has a minimal interaction with the Namenode, which 
improves the io rate linearly(theoretically 1:1). However current 
implementation of webdav interface, is just a server working on a single 
machine, which translates the webdav requests to namenode. Thus the 
whole traffic passes through this webdav server, which makes it a 
bottleneck. I was planning to integrate webdav server with 
namenode/datanode, and forward the requests to the other datanodes, so 
that we can do io in parallel, but my focus on webdav has faded for now.




Alban Chevignard wrote:

What are the scalability issues associated with the current WebDAV interface?

Thanks,
-Alban

On Jan 22, 2008 7:27 AM, Enis Soztutar <[EMAIL PROTECTED]> wrote:
  

Webdav interface for hadoop works as it is, but it needs a major
redesign to be scalable, however it is still useful. It has even been
used with windows explorer defining the webdav server as a remote service.


Ted Dunning wrote:


There has been significant work on building a web-DAV interface for HDFS.  I
haven't heard any news for some time, however.


On 1/21/08 11:32 AM, "Dawid Weiss" <[EMAIL PROTECTED]> wrote:


  

The Eclipse plug-in also features a DFS browser.

  

Yep. That's all true, I don't mean to self-promote, because there really isn't
that much to advertise ;) I was just quite attached to file manager-like user
interface; the mucommander clone I posted served me as a browser, but also for
rudimentary file operations (copying to/from, deleting folders etc.). In my
experience it's been quite handy.

It would be probably a good idea to implement a commons-vfs plugin for Hadoop
so
that HDFS filesystem is transparent to use for other apps.

Dawid




  


  


Re: hadoop file system browser

2008-01-22 Thread Enis Soztutar
Webdav interface for hadoop works as it is, but it needs a major 
redesign to be scalable, however it is still useful. It has even been 
used with windows explorer defining the webdav server as a remote service.


Ted Dunning wrote:

There has been significant work on building a web-DAV interface for HDFS.  I
haven't heard any news for some time, however.


On 1/21/08 11:32 AM, "Dawid Weiss" <[EMAIL PROTECTED]> wrote:

  

The Eclipse plug-in also features a DFS browser.
  

Yep. That's all true, I don't mean to self-promote, because there really isn't
that much to advertise ;) I was just quite attached to file manager-like user
interface; the mucommander clone I posted served me as a browser, but also for
rudimentary file operations (copying to/from, deleting folders etc.). In my
experience it's been quite handy.

It would be probably a good idea to implement a commons-vfs plugin for Hadoop
so 
that HDFS filesystem is transparent to use for other apps.


Dawid