Re: New to Search and Blur

Aaron McCurry Wed, 20 Feb 2013 11:23:19 -0800

Welcome Paul!  I will try to answer your questions below:

On Wed, Feb 20, 2013 at 1:41 PM, Paul O'Donoghue <[email protected]> wrote:

> Hi,
>
> First up I would like to say I’m really excited by the Blur project, it
> seems to fit the need of a potential project perfectly. I’m hoping that I
> can someday contribute back to this project in some way as it seems that it
> will be of enormous help to me.
>
> Now, on to the meat of the issue. I’m a complete search newbie. I am coming
> from a Spring/Application development background but have to get involved
> in the Search/Big data field for a current client. Since the new year I
> have been looking at Hadoop and have setup a small cluster using Cloudera’s
> excellent tools. I’ve been downloading datasets, running MR jobs, etc. and
> think I have gleaned a very basic level of knowledge which is enough for me
> to learn more when I need it. This week I have started looking at Blur, and
> at present I have cloned the src to the hadoop namenode where I have built
> and started the blur servers. But now I am stuck, and don’t know where to
> go. So I will ask the following
>
> 1 - /apache-blur-0.2.0-SNAPSHOT/conf/servers. At present I just have my
> namenode defined in here. Do I need to add my datanodes as well?
>

So you don't have to but the normal configuration would be to run blur
along side the datanodes.  Which means you will have to copy the SNAPSHOT
directory to all the datanodes as well as adding all the datanodes to the
servers file.  However if you want to start simple then you could just run
blur on a single node, the namenode could work.  Just to be clear, I would
not recommend running Blur on the same machine as your namenode in a
production environment, but for testing it should be fine.  I would however
put the name of your server in servers file and remove localhost.

>
> 2 - blur> create repl-table hdfs://localhost:9000/blur/repl-table 1
> java.net.ConnectException: Call to localhost/127.0.0.1:9000 failed on
> connection exception: java.net.ConnectException: Connection refused.
>

> I’m confused here. Is 9000 the correct port? Is there some sort of user
> auth issue?
>

I would change the command to be "create repl-table"
hdfs://<namenode>/blur/repl-table 1

The <namenode> should be as the fs.default.name in your core-site.xml in
the hadoop/conf directory.

>
> 3 - Assuming I create a table on the hdfs, when I want to import my data
> into it I use a MR job yes? What is the best way to package this job? Do I
> have to include all the Blur jars or do I install Blur on the datanodes and
> set a classpath? Is it possible to link to an example MR job in a maven
> project? Or am I on completely the wrong track.
>

You are on the right track, however you won't need to package up the jar
files across the cluster.  We haven't built a nice automated way to run map
reduce jobs but this is what you need to do.

Take a look at:

https://git-wip-us.apache.org/repos/asf?p=incubator-blur.git;a=blob;f=src/blur-mapred/src/main/java/org/apache/blur/example/BlurExampleIndexWriter.java;h=9d6eb546e565303f328556fea29d1345344e8065;hb=0.2-dev

This is a writing example in the new blur code (0.2.x), there is also a
reading example in the same package.  This example actually pushes the
updates through the thrift API, the bulk importer that writes indexes
directly to HDFS has not been rewritten for 0.2 yet.

As for the blur libraries, you can use the simple approach of putting all
the jars in a lib folder and creating a single jar including your classes
and the lib folder (jars inside the jar).  Hadoop understands that the lib
folder in the jar file is to be added to the classpath of the running
tasks.  Thus it will automatically distribute the libraries on to the
Hadoop MR cluster.

Let us know if you have more questions and how we can help.  Thanks!

Aaron

>
> Thanks for your help,
>
> Paul.
>

Re: New to Search and Blur

Reply via email to