Re: Working with MapFiles

2012-04-02 Thread Ioan Eugen Stan

Hi Ondrej,

Pe 02.04.2012 13:00, Ondřej Klimpera a scris:

Ok, thanks.

I missed setup() method because of using older version of hadoop, so I
suppose that method configure() does the same in hadoop 0.20.203.


Aha, if it's possible, try upgrading. I don't know how support is for 
versions older then hadoop 0.20 branch.



Now I'm able to load a map file inside configure() method to
MapFile.Reader instance as a class private variable, all works fine,
just wondering if the MapFile is replicated on HDFS and data are read
locally, or if reading from this file will increase the network
bandwidth because of getting it's data from another computer node in the
hadoop cluster.



You could use a method variable instead of a class private if you load 
the file. If the MapFile is wrote to HDFS then yes it is replicated, and 
you can configure the replication factor at file creation (and later 
maybe). If you use DistributedCache then the files are not written in 
HDFS, but in mapred.local.dir [1] folder on every node.
The folder size is configurable so it's possible that the data will be 
available there for the next MR job but don't rely on this.


Please read the docs, I may get things wrong. RTFM will save you life ;).

[1] http://developer.yahoo.com/hadoop/tutorial/module5.html#auxdata
[2] https://forums.aws.amazon.com/message.jspa?messageID=152538


Hopefully last question to bother you is, if reading files from
DistributedCache (normal text file) is limited to particular job.
Before running a job I add a file to DistCache. When getting the file in
Reducer implementation, can it access DistCache files from another jobs?
In another words what will list this command:

//Reducer impl.
public void configure(JobConf job) {

URI[] distCacheFileUris = DistributedCache.getCacheFiles(job);

}

will the distCacheFileUris variable contain only URIs for this job, or
for any job running on Hadoop cluster?

Hope it's understandable.
Thanks.



It's

--
Ioan Eugen Stan
http://ieugen.blogspot.com


Re: Working with MapFiles

2012-04-02 Thread Ioan Eugen Stan

Hi Ondrej,

Pe 30.03.2012 14:30, Ondřej Klimpera a scris:

And one more question, is it even possible to add a MapFile (as it
consits of index and data file) to Distributed cache?
Thanks


Should be no problem, they are just two files.


On 03/30/2012 01:15 PM, Ondřej Klimpera wrote:

Hello,

I'm not sure what you mean by using map reduce setup()?

"If the file is that small you could load it all in memory to avoid
network IO. Do that in the setup() method of the map reduce job."

Can you please explain little bit more?



Check the javadocs[1]: setup is called once per task so you can read the 
file from HDFS then or perform other initializations.


[1] 
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Mapper.html 



Reading 20 MB in ram should not be a problem and is preferred if you 
need to make many requests against that data. It really depends on your 
use case so think carefully or just go ahead and test it.




Thanks


On 03/30/2012 12:49 PM, Ioan Eugen Stan wrote:

Hello Ondrej,


Pe 29.03.2012 18:05, Ondřej Klimpera a scris:

Hello,

I have a MapFile as a product of MapReduce job, and what I need to
do is:

1. If MapReduce produced more spilts as Output, merge them to single
file.

2. Copy this merged MapFile to another HDFS location and use it as a
Distributed cache file for another MapReduce job.
I'm wondering if it is even possible to merge MapFiles according to
their nature and use them as Distributed cache file.


A MapFile is actually two files [1]: one SequanceFile (with sorted
keys) and a small index for that file. The map file does a version of
binary search to find your key and performs seek() to go to the byte
offset in the file.


What I'm trying to achieve is repeatedly fast search in this file
during
another MapReduce job.
If my idea is absolute wrong, can you give me any tip how to do it?

The file is supposed to be 20MB large.
I'm using Hadoop 0.20.203.


If the file is that small you could load it all in memory to avoid
network IO. Do that in the setup() method of the map reduce job.

The distributed cache will also use HDFS [2] and I don't think it
will provide you with any benefits.


Thanks for your reply:)

Ondrej Klimpera


[1]
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html

[2]
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html








--
Ioan Eugen Stan
http://ieugen.blogspot.com


Re: Working with MapFiles

2012-03-30 Thread Ioan Eugen Stan

Hello Ondrej,


Pe 29.03.2012 18:05, Ondřej Klimpera a scris:

Hello,

I have a MapFile as a product of MapReduce job, and what I need to do is:

1. If MapReduce produced more spilts as Output, merge them to single file.

2. Copy this merged MapFile to another HDFS location and use it as a
Distributed cache file for another MapReduce job.
I'm wondering if it is even possible to merge MapFiles according to
their nature and use them as Distributed cache file.


A MapFile is actually two files [1]: one SequanceFile (with sorted keys) 
and a small index for that file. The map file does a version of binary 
search to find your key and performs seek() to go to the byte offset in 
the file.



What I'm trying to achieve is repeatedly fast search in this file during
another MapReduce job.
If my idea is absolute wrong, can you give me any tip how to do it?

The file is supposed to be 20MB large.
I'm using Hadoop 0.20.203.


If the file is that small you could load it all in memory to avoid 
network IO. Do that in the setup() method of the map reduce job.


The distributed cache will also use HDFS [2] and I don't think it will 
provide you with any benefits.



Thanks for your reply:)

Ondrej Klimpera


[1] 
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html
[2] 
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html

--
Ioan Eugen Stan
http://ieugen.blogspot.com


Re: how to get rid of -libjars ?

2012-03-06 Thread Ioan Eugen Stan

Pe 06.03.2012 17:37, Jane Wayne a scris:

currently, i have my main jar and then 2 depedent jars. what i do is
1. copy dependent-1.jar to $HADOOP/lib
2. copy dependent-2.jar to $HADOOP/lib

then, when i need to run my job, MyJob inside main.jar, i do the following.

hadoop jar main.jar demo.MyJob -libjars dependent-1.jar,dependent-2.jar
-Dmapred.input.dir=/input/path -Dmapred.output.dir=/output/path

what i want to do is NOT copy the dependent jars to $HADOOP/lib and always
specify -libjars. is there any way around this multi-step procedure? i
really do not want to clutter $HADOOP/lib or specify a comma-delimited list
of jars for -libjars.

any help is appreciated.



Hello,

Specify the full path to the jar on the -libjars? My experience with 
-libjars is that it didn't work as advertised.


Search for an older post on the list about this issue ( -libjars not 
working). I tried adding a lot of jars and some got on the job classpath 
(2), some didn't (most of them).


I got over this by including all the jars in a lib directory inside the 
main jar.


Cheers,
--
Ioan Eugen Stan
http://ieugen.blogspot.com


Re: ClassNotFoundException: -libjars not working?

2012-02-28 Thread Ioan Eugen Stan

Pe 28.02.2012 10:58, madhu phatak a scris:

Hi,
  -libjars doesn't always work.Better way is to create a runnable jar with
all dependencies ( if no of dependency is less) or u have to keep the jars
into the lib folder of the hadoop in all machines.



Thanks for the reply Madhu,

I adopted the second solution as explained in [1]. From what I found 
browsing the net it seems that -libjars is broken in hadoop version > 
0.18. I didn't got time to check the code yet. Cloudera released hadoop 
sources are packaged a bit odd and Netbeans doens't seem to play well 
with that and this really affects my will to try to fix the problem.


"-libjars" is a nice feature that permits the use of skinny jars and 
would help system admins do better packaging. It also allows better 
control over the classpath. Too bad it didn't work.



[1] 
http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/


Cheers,

--
Ioan Eugen Stan
http://ieugen.blogspot.com


Re: LZO with sequenceFile

2012-02-26 Thread Ioan Eugen Stan
2012/2/26 Mohit Anchlia :
> Thanks. Does it mean LZO is not installed by default? How can I install LZO?

The LZO library is released under GPL and I believe it can't be
included in most distributions of Hadoop because of this (can't mix
GPL with non GPL stuff). It should be easily available though.

> On Sat, Feb 25, 2012 at 6:27 PM, Shi Yu  wrote:
>
>> Yes, it is supported by Hadoop sequence file. It is splittable
>> by default. If you have installed and specified LZO correctly,
>> use these:
>>
>>
>> org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma
>> t.setCompressOutput(job,true);
>>
>> org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma
>> t.setOutputCompressorClass(job,com.hadoop.compression.lzo.LzoC
>> odec.class);
>>
>> org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma
>> t.setOutputCompressionType(job,
>> SequenceFile.CompressionType.BLOCK);
>>
>> job.setOutputFormatClass(org.apache.hadoop.mapreduce.lib.outpu
>> t.SequenceFileOutputFormat.class);
>>
>>
>> Shi
>>



-- 
Ioan Eugen Stan
http://ieugen.blogspot.com/


ClassNotFoundException: -libjars not working?

2012-02-22 Thread Ioan Eugen Stan
are/mailbox-convertor/lib/servlet-api-2.5-6.1.14.jar,/usr/share/mailbox-convertor/lib/antisamy-1.4.4.jar,/usr/share/mailbox-convertor/lib/antisamy-sample-configs-1.4.4.jar,/usr/share/mailbox-convertor/lib/jcl-over-slf4j-1.6.1.jar,/usr/share/mailbox-convertor/lib/jul-to-slf4j-1.6.1.jar,/usr/share/mailbox-convertor/lib/slf4j-api-1.6.1.jar,/usr/share/mailbox-convertor/lib/slf4j-log4j12-1.6.1.jar,/usr/share/mailbox-convertor/lib/spring-aop-3.0.5.RELEASE.jar,/usr/share/mailbox-convertor/lib/spring-asm-3.1.0.RELEASE.jar,/usr/share/mailbox-convertor/lib/spring-beans-3.0.5.RELEASE.jar,/usr/share/mailbox-convertor/lib/spring-context-3.0.5.RELEASE.jar,/usr/share/mailbox-convertor/lib/spring-core-3.0.5.RELEASE.jar,/usr/share/mailbox-convertor/lib/spring-expression-3.1.0.RELEASE.jar,/usr/share/mailbox-convertor/lib/uncommon
s-maths-1.2.2.jar,/usr/share/mailbox-convertor/lib/watchmaker-framework-0.6.2.jar,/usr/share/mailbox-convertor/lib/snappy-java-1.0.3.2.jar,/usr/share/mailbox-convertor/lib/snakeyaml-1.6.jar,/usr/share/mailbox-convertor/lib/oro-2.0.8.jar,/usr/share/mailbox-convertor/lib/stax-api-1.0.1.jar,/usr/share/mailbox-convertor/lib/jasper-compiler-5.5.23.jar,/usr/share/mailbox-convertor/lib/jasper-runtime-5.5.23.jar,/usr/share/mailbox-convertor/lib/wstx-asl-3.2.7.jar,/usr/share/mailbox-convertor/lib/xml-apis-1.3.04.jar,/usr/share/mailbox-convertor/lib/xml-apis-ext-1.3.04.jar,/usr/share/mailbox-convertor/lib/xmlenc-0.52.jar,/usr/share/mailbox-convertor/lib/xpp3_min-1.1.3.4.O.jar 
-t lucehbase-emails -m 13294028075653




The tmpjars from the generated jobconf looks like this (taken from web 
interface):


tmpjars 
file:/usr/share/mailbox-convertor/lib/zookeeper-3.4.2.jar,file:/usr/lib/hadoop-0.20/hadoop-core-0.20.2-cdh3u1.jar,file:/usr/share/mailbox-convertor/lib/hbase-0.92.0-1and1.jar


and mapred.job.classpath.files is:

/var/tmp/mapred/staging/hbase/.staging/job_201201271031_0027/libjars/zookeeper-3.4.2.jar:/var/tmp/mapred/staging/hbase/.staging/job_201201271031_0027/libjars/hadoop-core-0.20.2-cdh3u1.jar:/var/tmp/mapred/staging/hbase/.staging/job_201201271031_0027/libjars/hbase-0.92.0-1and1.jar

and I get:

Error: java.lang.ClassNotFoundException: org.apache.commons.lang.ArrayUtils
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at 
com.unitedinternet.portal.emailorganizer.TableToDictionaryMapper.map(TableToDictionaryMapper.java:31)
at 
com.unitedinternet.portal.emailorganizer.TableToDictionaryMapper.map(TableToDictionaryMapper.java:23)

at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)

at org.apache.hadoop.mapred.Child.main(Child.java:264)

Regards,

--
Ioan Eugen Stan
http://ieugen.blogspot.com


Re: Best Linux Operating system used for Hadoop

2012-01-30 Thread Ioan Eugen Stan

Pe 27.01.2012 11:15, Sujit Dhamale a scris:

Hi All,
I am new to Hadoop,
Can any one tell me which is the best Linux Operating system used for
installing&  running Hadoop. ??
now a day i am using Ubuntu 11.4 and install Hadoop on it but it
crashes number of times .

can some please help me out ???


Kind regards
Sujit Dhamale



I think the most important thing you have to keep in mind is who is gong 
to administer your cluster. It's important that the administrator is 
confortable/experienced with the distribution you are going to use.


As for which distribution to use, you can safely choose from one that 
has very good support (community and/or vendor). In no particular order:


- Debian
- Ubuntu LTS.
- RedHad /CentOS (company/ comunity support)

There are a few Hadoop distributions available from companies that you 
can check out:


- Mapr
- Cloudera's Hadoop distribution
- Hortonwork (not sure about them providing a Hadoop distribution)

They offer installation instructions on different platforms for their 
products. Maybe you can check them out to see if they are good for you.


Cheers,
--
Ioan Eugen Stan
http://ieugen.blogspot.com


Re: missing job history and strange MR job output

2012-01-16 Thread Ioan Eugen Stan

Pe 13.01.2012 06:00, Harsh J a scris:

Perhaps you aren't writing it properly? Its hard to tell what your
problem may be without looking at some code snippets
(sensitive/irrelevant parts may be cut out, or even pseudocode typed
up is fine), etc..



Hello Harsh and others,

It's fixed. After resolving a childish bug on my part (with building the 
Scan object) I still had problems with the setup. It ran everything up 
until waitForCompletion() where it hanged. I checked the logs and it 
barely showed any output from the MapReduceMini cluster. Just a few 
lines announcing the start of TaskTrackers and JobTrackers, etc.


Removing the local maven repository finally solved the issue and now I 
can happily continue with coding.


It seems that periodically cleaning maven repo is a must these days.

Thanks for the support,

--
Ioan Eugen Stan
http://ieugen.blogspot.com


missing job history and strange MR job output

2012-01-12 Thread Ioan Eugen Stan
Hello,

I'm struggling for two days now to figure out what's wrong with my
map-reduce job without success. I'm trying to write a map reduce job
that reads data from a HBase table and outputs to a sequence file. I'm
using the HBaseTestingUtility with the  Mini clusters. All things went
well, but after a big re-factoring of my code I no longer get the job
history and the output of my map reduce is a Sequence file that just
contains the header with the class names.

I've hit a brick wall and don't really know where to go right now.

Cheers,

p.s. if this is better on the hbase ml please let me know, but the
code that reads the hbase table seems ok. I'm doing a small scan
before to count some stuff to pass to the job.

-- 
Ioan Eugen Stan
http://ieugen.blogspot.com/


Re: some guidance needed

2011-05-19 Thread Ioan Eugen Stan
I have forwarded this discussion to my mentors so they are informed
and I hope they will provide better input regarding email storage.

> I second what Todd said, even with FuseHDFS, mounting HDFS as a regular file
> system, it won't give you the immediate response about the file status that
> you need. I believe Google implemented Gmail with HBase. Here is an example
> of implementing a mail store with Cassandra:
> http://ewh.ieee.org/r6/scv/computer/nfic/2009/IBM-Jun-Rao.pdf
>
> <http://ewh.ieee.org/r6/scv/computer/nfic/2009/IBM-Jun-Rao.pdf>Mark

Thanks Mark, I will look into that. I am currently watching. Claudera
Hadoop Training [1] to get a better view of how things work.

I have one question: what is the defining difference between Cassandra
and HBase? Also, Eric, one of my mentors, suggested I use Gora for
this and after a quick look at Gora I saw that it is an ORM for HBase
and Cassandra which will allow me switch between them. The downside
with this is that Gora is still incubating so a piece of advice about
using it or not is welcomed. I will also ask on the Gora mailing list
to see how things are there.

>> I would encourage you to look at a system like HBase for your mail
>> backend. HDFS doesn't work well with lots of little files, and also
>> doesn't support random update, so existing formats like Maildir
>> wouldn't be a good fit.

I don't think I understand correctly what you mean by random updates.
E-mails are immutable so once written they are not going to be
updated. But if you are referring to the fact that lots of (small)
files will be written in a directory and that this can be a problem
then I get it. This will also mean that mailbox format (all emails in
one file) will be more inappropriate than Maildir. But since e-mails
are immutable and adding a mail to the mailbox means appending a small
piece of data to the file this should not be a problem if Hadoop has
append.

The presentation on Vimeo it stated that HDFS 0.19 did not had append,
I don't know yet what is the status on that, but things are a little
brighter. You could have a mailbox file that could grow to a very
large size. This will lead to all the users emails into one big file
that is easy to manage, the only thing that it's missing is the
fetching the emails. Since emails are appended to the file (inbox) as
they come, and you usually are interested in the latest emails
received you could just read the tail of the file and do some indexing
based on that. Should I post this on the HDFS mailing-list also?

I'm talking without real experience with Hadoop so shut me up if I'm wrong.

>> --
>> Todd Lipcon
>> Software Engineer, Cloudera

You are form Cloudera, nice. Answers straight from the source :).

[1] http://vimeo.com/3591321

Thanks,

-- 
Ioan-Eugen Stan


some guidance needed

2011-05-18 Thread Ioan Eugen Stan
Hello everybody,

I'm a GSoC student for this year and I will be working on James [1].
My project is to implement email storage over HDFS. I am quite new to
Hadoop and associates and I am looking for some hints as to get
started on the right track.

I have installed a single node Hadoop instance on my machine and
played around with it (ran some examples) but I am interested into
what you (more experienced people) think it's the best way to approach
my problem.

I am a little puzzled about the fact that that I read hadoop is best
used for large files and email aren't that large from what I know.
Another thing that crossed my mind is that since HDFS is a file
system, wouldn't it be possible to set it as a back-end for the
(existing) maildir and mailbox storage formats? (I think this question
is more suited on the James mailing list, but if you have some ideas
please speak your mind).

Also, any development resources to get me started are welcomed.


[1] http://james.apache.org/mailbox/
[2] https://issues.apache.org/jira/browse/MAILBOX-44

Regards,
-- 
Ioan Eugen Stan