You might consider trying to get the de-duplication done at index time:
https://cwiki.apache.org/confluence/display/solr/De-Duplication that way
the map reduce job wouldn't even be necessary.

When it comes to the map reduce job, you would need to be more specific
with *what* you are doing for people to try and help, are you attempting to
query for every record of all 40 million rows - how many mapper tasks? But
right off the bat I see you are using Java's HttpURLConnection, you should
really use SolrJ for querying purposes:
https://cwiki.apache.org/confluence/display/solr/Using+SolrJ you won't need
to deal with xml parsing and it uses Apache's HttpClient with much more
reasonable defaults.

-Steve

On Thu, Dec 24, 2015 at 11:28 PM, Dino Chopins <dino.chop...@gmail.com>
wrote:

> Hi Erick,
>
> Thank you for your response and pointer. What I mean by running Lucene/SOLR
> on Hadoop is to have Lucene/SOLR index available to be queried using
> mapreduce or any best practice recommended.
>
> I need to have this mechanism to do large scale row deduplication. Let me
> elaborate why I need this:
>
>    1. I have two data sources with 35 and 40 million records of customer
>    profile - the data come from two systems (SAP and MS CRM)
>    2. Need to index and compare row by row of the two data sources using
>    name, address, birth date, phone and email field. For birth date and
> email
>    it will use exact comparison, but for the other fields will use
>    probabilistic comparison. Btw, the data has been normalized before they
> are
>    being indexed.
>    3. Each finding will be categorized under same person, and will be
>    deduplicated automatically or under user intervention depending on the
>    score.
>
> I usually use it using Lucene index on local filesystem and use term
> vector, but since this will be repeated task and then challenged by
> management to do this on top of Hadoop cluster I need to have a framework
> or best practice to do this.
>
> I understand that to have Lucene index on HDFS is not very appropriate
> since HDFS is designed for large block operation. With that understanding,
> I use SOLR and hope to query it using http call from mapreduce job.  The
> snippet code is below.
>
>             url = new URL(SOLR-Query-URL);
>
>             HttpURLConnection connection = (HttpURLConnection)
> url.openConnection();
>             connection.setRequestMethod("GET");
>
> The later method turns out to perform very bad. The simple mapreduce job
> that only read the data sources and write to hdfs takes 15 minutes, but
> once I do the http request it takes three hours now and still ongoing.
>
> What went wrong? And what will be solution to my problem?
>
> Thanks,
>
> Dino
>
> On Mon, Dec 14, 2015 at 12:30 AM, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
> > First, what do you mean "run Lucene/Solr on Hadoop"?
> >
> > You can use the HdfsDirectoryFactory to store Solr/Lucene
> > indexes on Hadoop, at that point the actual filesystem
> > that holds the index is transparent to the end user, you just
> > use Solr as you would if it was using indexes on the local
> > file system. See:
> > https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS
> >
> > If you want to use Map-Reduce to _build_ indexes, see the
> > MapReduceIndexerTool in the Solr contrib area.
> >
> > Best,
> > Erick
> >
>
>
>
>
> --
> Regards,
>
> Dino
>

Reply via email to