# disclaimer I'm an employee of Elastic (the company behind Elasticsearch) and
lead of Elasticsearch Hadoop integration
Some things to clarify on the Elasticsearch side:
1. Elasticsearch is a distributed, real-time search and analytics engine. Search is just one aspect of it and it can
work with any type of data (whether it's text, image encoding, etc...): Github, Wikipedia, Stackoverflow are popular
examples of known websites that are powered by Elasticsearch. In fact you can find plenty of use cases and information
about this on the website [1].
2. Elasticsearch is stand-alone and can be run on the same or separate machines as other services. In fact, on the
_same_ machine, one can run _multiple_ Elasticsearch nodes (and thus clusters). For best performance, having dedicated
hardware (as Nick suggested) works best.
3. The Elasticsearch Spark integration has been available for over a year through Map/Reduce and the native (Scala and
Java) API since q3 last year. There are plenty of features available which are fully documented here [2]. Better yet,
there's a talk by yours truly from Spark Summit East [3] that is fully focused on exactly this topic.
4. elasticsearch-hadoop is certified by Databricks, Cloudera, Hortonworks and MapR and supports both Spark core and
Spark SQL 1.0-1.3. There are binaries for Scala 2.10 and 2.11. And for what it's worth, it provided on of the first (if
not the first) implementation of DataSource API outside Databricks, which means not only using Elasticsearch in
declarative fasion but also having push-down support for operators.
Hopefully these materials will get you started with Spark and Elasticsearch and also clarify some of the misconceptions
about Elasticsearch.
Cheers,
[1] https://www.elastic.co/products/elasticsearch
[2] http://www.elastic.co/guide/en/elasticsearch/hadoop/master/reference.html
[3]
http://spark-summit.org/east/2015/talk/using-spark-and-elasticsearch-for-real-time-data-analysis
On 4/28/15 8:16 PM, Nick Pentreath wrote:
Depends on your use case and search volume. Typically you'd have a dedicated ES
cluster if your app is doing a lot of
real time indexing and search.
If it's only for spark integration then you could colocate ES and spark
—
Sent from Mailbox <https://www.dropbox.com/mailbox>
On Tue, Apr 28, 2015 at 6:41 PM, Jeetendra Gangele <gangele...@gmail.com
<mailto:gangele...@gmail.com>> wrote:
Thanks for reply.
Elastic search index will be within my Cluster? or I need the separate host
the elastic search?
On 28 April 2015 at 22:03, Nick Pentreath <nick.pentre...@gmail.com
<mailto:nick.pentre...@gmail.com>> wrote:
I haven't used Solr for a long time, and haven't used Solr in Spark.
However, why do you say "Elasticsearch is not a good option ..."? ES
absolutely supports full-text search and
not just filtering and grouping (in fact it's original purpose was and
still is text search, though filtering,
grouping and aggregation are heavily used).
http://www.elastic.co/guide/en/elasticsearch/guide/master/full-text-search.html
On Tue, Apr 28, 2015 at 6:27 PM, Jeetendra Gangele <gangele...@gmail.com
<mailto:gangele...@gmail.com>> wrote:
Does anyone tried using solr inside spark?
below is the project describing it.
https://github.com/LucidWorks/spark-solr.
I have a requirement in which I want to index 20 millions companies
name and then search as and when new
data comes in. the output should be list of companies matching the
query.
Spark has inbuilt elastic search but for this purpose Elastic
search is not a good option since this is
totally text search problem?
Elastic search is good for filtering and grouping.
Does any body used solr inside spark?
Regards
jeetendra
--
Costin
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org