Re: Experiences about NoSQL databases with Spark

2015-12-06 Thread ayan guha
Hi

I have a general question. I want to do a real time aggrega*tion using
spark. I have kinesis as source and planning ES as data source. there might
be close to 2000 distinct events possible. I want to keep a runnning count
of how many times each event occurs.*

*Currently upon receiving an event I am looking up backend by the event
code (which is used as document id, so fast lookup) and adding 1 with
the* current
value.

I am worried because this process is not idempotent. To solve it, I can
keep writing each event and let ES aggregate while querying. But this seems
wasteful.Am I correct in is assumption?

I know about update and new track by state functions, but I was wondering
what is the general approach to solve this issue,? Any pointer would be
very helpful.

Best
Ayan

On Sun, Dec 6, 2015 at 6:17 PM, Nick Pentreath 
wrote:

> I've had great success using Elasticsearch with Spark - the integration
> works well (both ways - reading and indexing) and ES + Kibana makes a
> powerful event / time-series storage, aggregation and data visualization
> stack.
>
>
> —
> Sent from Mailbox 
>
>
> On Sun, Dec 6, 2015 at 9:07 AM, manasdebashiskar 
> wrote:
>
>> Depends on your need.
>> Have you looked at Elastic search, or Accumulo or Cassandra?
>> If post processing of your data is not your motive and you want to just
>> retrieve the data later greenplum(based on postgresql) can be an
>> alternative.
>>
>> in short there are many NOSQL out there with each having different
>> project
>> maturity and feature sets.
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Experiences-about-NoSQL-databases-with-Spark-tp25462p25594.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


-- 
Best Regards,
Ayan Guha


Re: Experiences about NoSQL databases with Spark

2015-12-05 Thread Nick Pentreath
I've had great success using Elasticsearch with Spark - the integration works 
well (both ways - reading and indexing) and ES + Kibana makes a powerful event 
/ time-series storage, aggregation and data visualization stack.






—
Sent from Mailbox

On Sun, Dec 6, 2015 at 9:07 AM, manasdebashiskar 
wrote:

> Depends on your need.
> Have you looked at Elastic search, or Accumulo or Cassandra?
> If post processing of your data is not your motive and you want to just
> retrieve the data later greenplum(based on postgresql) can be an
> alternative.
> in short there are many NOSQL out there with each having different project
> maturity and feature sets.
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Experiences-about-NoSQL-databases-with-Spark-tp25462p25594.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org

Re: Experiences about NoSQL databases with Spark

2015-12-05 Thread manasdebashiskar
Depends on your need.
Have you looked at Elastic search, or Accumulo or Cassandra?
If post processing of your data is not your motive and you want to just
retrieve the data later greenplum(based on postgresql) can be an
alternative.

in short there are many NOSQL out there with each having different project
maturity and feature sets.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Experiences-about-NoSQL-databases-with-Spark-tp25462p25594.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Experiences about NoSQL databases with Spark

2015-11-28 Thread Jörn Franke
I would not use MongoDB because it does not fit well into the Spark or Hadoop 
architecture. You can use it if your data amount is very small and already 
preaggregated, but this is a very limited use case. You can use Hbase or with 
future versions of Hive (if they use TEZ > 0.8) For interactive queries.
Hbase with Phoenix and hive offer standard sql interfaces and can easily be 
integrated with a web interface.
With Hive you can already today use the ORC and parquet format on HDFS. They 
support storage indexes and bloom filters to accelerate your queries. You could 
also just use HDFS with these storage formats.

Maybe you can elaborate more on data volumes and queries you want to do on the 
processed part? Is the processed data updated?

Depending on your use case/data another option for interactive queries are 
solr/elastic search for text analytics and titandb for interactive graph 
queries (it supports amongst others hbase as the storage layer). Of course 
there are some more (also commercial). Both offer REST interfaces and would be 
easy to integrate with a web application using JSON/ds3.js In some cases a 
relational database can make sense.



> On 24 Nov 2015, at 13:46, sparkuser2345  wrote:
> 
> I'm interested in knowing which NoSQL databases you use with Spark and what
> are your experiences. 
> 
> On a general level, I would like to use Spark streaming to process incoming
> data, fetch relevant aggregated data from the database, and update the
> aggregates in the DB based on the incoming records. The data in the DB
> should be indexed to be able to fetch the relevant data fast and to allow
> fast interactive visualization of the data. 
> 
> I've been reading about MongoDB+Spark and I've got the impression that there
> are some challenges in fetching data by indices and in updating documents,
> but things are moving so fast, so I don't know if these are relevant
> anymore. Do you find any benefit from using HBase with Spark as HBase is
> built on top of HDFS? 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Experiences-about-NoSQL-databases-with-Spark-tp25462.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Experiences about NoSQL databases with Spark

2015-11-28 Thread Yu Zhang
BTW, if you decide to try the mongodb, please use the 3.0+ version with
"wiredtiger" engine.

On Sat, Nov 28, 2015 at 11:30 PM, Yu Zhang  wrote:

> If you need to construct multiple indexes, hbase will perform better, the
> writing speed is slow in mongodb with many indexes and the memory cost is
> huge!
>
> But my concern is: with mongodb, you could easily cooperate with js and
> with some visualization tools like D3.js, the work will become smooth as
> breeze.
>
> Could you provide additional details of the data size and number of
> operations you need in your program? I believe this is a quite general
> question and hope to hear any comments and thoughts.
>
> On Tue, Nov 24, 2015 at 9:50 AM, Ted Yu  wrote:
>
>> You should consider using HBase as the NoSQL database.
>> w.r.t. 'The data in the DB should be indexed', you need to design the
>> schema in HBase carefully so that the retrieval is fast.
>>
>> Disclaimer: I work on HBase.
>>
>> On Tue, Nov 24, 2015 at 4:46 AM, sparkuser2345 
>> wrote:
>>
>>> I'm interested in knowing which NoSQL databases you use with Spark and
>>> what
>>> are your experiences.
>>>
>>> On a general level, I would like to use Spark streaming to process
>>> incoming
>>> data, fetch relevant aggregated data from the database, and update the
>>> aggregates in the DB based on the incoming records. The data in the DB
>>> should be indexed to be able to fetch the relevant data fast and to allow
>>> fast interactive visualization of the data.
>>>
>>> I've been reading about MongoDB+Spark and I've got the impression that
>>> there
>>> are some challenges in fetching data by indices and in updating
>>> documents,
>>> but things are moving so fast, so I don't know if these are relevant
>>> anymore. Do you find any benefit from using HBase with Spark as HBase is
>>> built on top of HDFS?
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Experiences-about-NoSQL-databases-with-Spark-tp25462.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


Re: Experiences about NoSQL databases with Spark

2015-11-28 Thread Yu Zhang
If you need to construct multiple indexes, hbase will perform better, the
writing speed is slow in mongodb with many indexes and the memory cost is
huge!

But my concern is: with mongodb, you could easily cooperate with js and
with some visualization tools like D3.js, the work will become smooth as
breeze.

Could you provide additional details of the data size and number of
operations you need in your program? I believe this is a quite general
question and hope to hear any comments and thoughts.

On Tue, Nov 24, 2015 at 9:50 AM, Ted Yu  wrote:

> You should consider using HBase as the NoSQL database.
> w.r.t. 'The data in the DB should be indexed', you need to design the
> schema in HBase carefully so that the retrieval is fast.
>
> Disclaimer: I work on HBase.
>
> On Tue, Nov 24, 2015 at 4:46 AM, sparkuser2345 
> wrote:
>
>> I'm interested in knowing which NoSQL databases you use with Spark and
>> what
>> are your experiences.
>>
>> On a general level, I would like to use Spark streaming to process
>> incoming
>> data, fetch relevant aggregated data from the database, and update the
>> aggregates in the DB based on the incoming records. The data in the DB
>> should be indexed to be able to fetch the relevant data fast and to allow
>> fast interactive visualization of the data.
>>
>> I've been reading about MongoDB+Spark and I've got the impression that
>> there
>> are some challenges in fetching data by indices and in updating documents,
>> but things are moving so fast, so I don't know if these are relevant
>> anymore. Do you find any benefit from using HBase with Spark as HBase is
>> built on top of HDFS?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Experiences-about-NoSQL-databases-with-Spark-tp25462.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Experiences about NoSQL databases with Spark

2015-11-24 Thread Ted Yu
You should consider using HBase as the NoSQL database.
w.r.t. 'The data in the DB should be indexed', you need to design the
schema in HBase carefully so that the retrieval is fast.

Disclaimer: I work on HBase.

On Tue, Nov 24, 2015 at 4:46 AM, sparkuser2345 
wrote:

> I'm interested in knowing which NoSQL databases you use with Spark and what
> are your experiences.
>
> On a general level, I would like to use Spark streaming to process incoming
> data, fetch relevant aggregated data from the database, and update the
> aggregates in the DB based on the incoming records. The data in the DB
> should be indexed to be able to fetch the relevant data fast and to allow
> fast interactive visualization of the data.
>
> I've been reading about MongoDB+Spark and I've got the impression that
> there
> are some challenges in fetching data by indices and in updating documents,
> but things are moving so fast, so I don't know if these are relevant
> anymore. Do you find any benefit from using HBase with Spark as HBase is
> built on top of HDFS?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Experiences-about-NoSQL-databases-with-Spark-tp25462.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>