Solr for large volume data processing with minimal full-text serach

souravm Fri, 07 Nov 2008 09:24:05 -0800

Hi Shalin,

Thanks for your input.


Yes I agree that my application is not much about full text search.

Hive/Chukwa/Pig (a combination) running on Hadoop can be a good bet. But where 
they fall short is in online querying of the huge data.

I am specifically talking about Pig in this case which has benchmarking figure 
in the order of 3-10 minutes with 11 nodes for around 4GB data size (200 M 
records). Where as for Solr I can see processing time is under second at 1 node 
(but higher memory) for around 1 GB data size (0.5 M records).

Since for my application online query performance is one of the key requirement 
(I think irrespective of type of application no user would like to wait on the 
screen for more than a minute) I'm in dilemma.

Regards,
Sourav



-----Original Message-----
From: Shalin Shekhar Mangar [mailto:[EMAIL PROTECTED]
Sent: Friday, November 07, 2008 7:48 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr Multicore ...

>From what I can understand, you have little full-text search involved here.
You should probably look at Hadoop and its contrib and sub-projects such as
Pig, Hive and Chukwa.

http://wiki.apache.org/hadoop/
http://wiki.apache.org/hadoop/Hive
http://wiki.apache.org/hadoop/Chukwa
http://incubator.apache.org/pig/

On Fri, Nov 7, 2008 at 9:03 PM, souravm <[EMAIL PROTECTED]> wrote:

> Hi Guys,
>
> Here I'm struggling with to decide whether Solr would be a fitting solution
> for me. Highly appreciate you
>
> The key requirements can be summarized as below -
>
> 1. Need to process very high volume of data online from log files of
> various applications - around 100s of Millions of total size may be varying
> within a range of 30-40 GB.
>
> 2. Flexibility - Log file formats from different applications would be
> different. Also for the same application log file formats can vary. However,
> the log files would be in xml and if a new type has to be supported then the
> schema for the same would be known before hand.
>
> 3. The type of queries to be supported -
> a) Mostly aggregation type statistics (min, max, average, sd, count etc.)
> of response times, sales numbers etc.
> b) Ability to support adhoc queries relating multiple fields in a given
> logfile, joining similar fields in multiple logfiles
>
> 4. Flexibility - Log file formats from different applications would be
> different. Also for the same application log file formats can vary. However,
> the log files would be in xml and if a new type has to be supported then the
> schema for the same would be known before hand.
>
> 5. Expected performance would be around 10 to 20 sec for majority of the
> queries. For rest it may be a bit more higher.
>
> I'm planning to use Solr with multicore and distributed search feature.
> However also considering Hadoop with Hbase as that looks to be a natural
> solution to support multiple file formats and handling adhoc queries.
>
> I would surely like to have your viewpoints on this regard - whether given
> the key requirements above Solr is a right choice or Hadoop+HBase would be
> better (or any other open source product).
>
> Thanks in advance.
>
> Regards,
> Sourav
>
> **************** CAUTION - Disclaimer *****************
> This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended
> solely
> for the use of the addressee(s). If you are not the intended recipient,
> please
> notify the sender by e-mail and delete the original message. Further, you
> are not
> to copy, disclose, or distribute this e-mail or its contents to any other
> person and
> any such actions are unlawful. This e-mail may contain viruses. Infosys has
> taken
> every reasonable precaution to minimize this risk, but is not liable for
> any damage
> you may sustain as a result of any virus in this e-mail. You should carry
> out your
> own virus checks before opening the e-mail or attachment. Infosys reserves
> the
> right to monitor and review the content of all messages sent to or from
> this e-mail
> address. Messages sent to or from this e-mail address may be stored on the
> Infosys e-mail system.
> ***INFOSYS******** End of Disclaimer ********INFOSYS***
>



--
Regards,
Shalin Shekhar Mangar.

Solr for large volume data processing with minimal full-text serach

Reply via email to