# disclaimer I'm an employee of Elastic (the company behind Elasticsearch) and 
lead of Elasticsearch Hadoop integration

Some things to clarify on the Elasticsearch side:

1. Elasticsearch is a distributed, real-time search and analytics engine. Search is just one aspect of it and it can work with any type of data (whether it's text, image encoding, etc...): Github, Wikipedia, Stackoverflow are popular examples of known websites that are powered by Elasticsearch. In fact you can find plenty of use cases and information about this on the website [1].

2. Elasticsearch is stand-alone and can be run on the same or separate machines as other services. In fact, on the _same_ machine, one can run _multiple_ Elasticsearch nodes (and thus clusters). For best performance, having dedicated hardware (as Nick suggested) works best.

3. The Elasticsearch Spark integration has been available for over a year through Map/Reduce and the native (Scala and Java) API since q3 last year. There are plenty of features available which are fully documented here [2]. Better yet, there's a talk by yours truly from Spark Summit East [3] that is fully focused on exactly this topic.

4. elasticsearch-hadoop is certified by Databricks, Cloudera, Hortonworks and MapR and supports both Spark core and Spark SQL 1.0-1.3. There are binaries for Scala 2.10 and 2.11. And for what it's worth, it provided on of the first (if not the first) implementation of DataSource API outside Databricks, which means not only using Elasticsearch in declarative fasion but also having push-down support for operators.

Hopefully these materials will get you started with Spark and Elasticsearch and also clarify some of the misconceptions about Elasticsearch.

Cheers,

[1] https://www.elastic.co/products/elasticsearch
[2] http://www.elastic.co/guide/en/elasticsearch/hadoop/master/reference.html
[3] 
http://spark-summit.org/east/2015/talk/using-spark-and-elasticsearch-for-real-time-data-analysis


On 4/28/15 8:16 PM, Nick Pentreath wrote:
Depends on your use case and search volume. Typically you'd have a dedicated ES 
cluster if your app is doing a lot of
real time indexing and search.

If it's only for spark integration then you could colocate ES and spark

—
Sent from Mailbox <https://www.dropbox.com/mailbox>


On Tue, Apr 28, 2015 at 6:41 PM, Jeetendra Gangele <gangele...@gmail.com 
<mailto:gangele...@gmail.com>> wrote:

    Thanks for reply.

    Elastic search index will be within my Cluster? or I need the separate host 
the elastic search?


    On 28 April 2015 at 22:03, Nick Pentreath <nick.pentre...@gmail.com 
<mailto:nick.pentre...@gmail.com>> wrote:

        I haven't used Solr for a long time, and haven't used Solr in Spark.

        However, why do you say "Elasticsearch is not a good option ..."? ES 
absolutely supports full-text search and
        not just filtering and grouping (in fact it's original purpose was and 
still is text search, though filtering,
        grouping and aggregation are heavily used).
        
http://www.elastic.co/guide/en/elasticsearch/guide/master/full-text-search.html



        On Tue, Apr 28, 2015 at 6:27 PM, Jeetendra Gangele <gangele...@gmail.com 
<mailto:gangele...@gmail.com>> wrote:

            Does anyone tried using solr inside spark?
            below is the project describing it.
            https://github.com/LucidWorks/spark-solr.

            I have a requirement in which I want to index 20 millions companies 
name and then search as and when new
            data comes in. the output should be list of companies matching the 
query.

            Spark has inbuilt elastic search but for this purpose Elastic 
search is not a good option since this is
            totally text search problem?

            Elastic search is good  for filtering and grouping.

            Does any body used solr inside spark?

            Regards
            jeetendra





--
Costin


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to