Re: Data from PostgreSQL to Spark

2015-08-03 Thread Jeetendra Gangele
Here is the solution this looks perfect for me.
thanks for all your help

http://www.confluent.io/blog/bottled-water-real-time-integration-of-postgresql-and-kafka/

On 28 July 2015 at 23:27, Jörn Franke jornfra...@gmail.com wrote:

 Can you put some transparent cache in front of the database? Or some jdbc
 proxy?

 Le mar. 28 juil. 2015 à 19:34, Jeetendra Gangele gangele...@gmail.com a
 écrit :

 can the source write to Kafka/Flume/Hbase in addition to Postgres? no
 it can't write ,this is due to the fact that there are many applications
 those are producing this postGreSql data.I can't really asked all the teams
 to start writing to some other source.


 velocity of the application is too high.






 On 28 July 2015 at 21:50, santosh...@gmail.com wrote:

 Sqoop’s incremental data fetch will reduce the data size you need to
 pull from source, but then by the time that incremental data fetch is
 complete, is it not current again, if velocity of the data is high?

 May be you can put a trigger in Postgres to send data to the big data
 cluster as soon as changes are made. Or as I was saying in another email,
 can the source write to Kafka/Flume/Hbase in addition to Postgres?

 Sent from Windows Mail

 *From:* Jeetendra Gangele gangele...@gmail.com
 *Sent:* ‎Tuesday‎, ‎July‎ ‎28‎, ‎2015 ‎5‎:‎43‎ ‎AM
 *To:* santosh...@gmail.com
 *Cc:* ayan guha guha.a...@gmail.com, felixcheun...@hotmail.com,
 user@spark.apache.org

 I trying do that, but there will always data mismatch, since by the time
 scoop is fetching main database will get so many updates. There is
 something called incremental data fetch using scoop but that hits a
 database rather than reading the WAL edit.



 On 28 July 2015 at 02:52, santosh...@gmail.com wrote:

 Why cant you bulk pre-fetch the data to HDFS (like using Sqoop) instead
 of hitting Postgres multiple times?

 Sent from Windows Mail

 *From:* ayan guha guha.a...@gmail.com
 *Sent:* ‎Monday‎, ‎July‎ ‎27‎, ‎2015 ‎4‎:‎41‎ ‎PM
 *To:* Jeetendra Gangele gangele...@gmail.com
 *Cc:* felixcheun...@hotmail.com, user@spark.apache.org

 You can call dB connect once per partition. Please have a look at
 design patterns of for each construct in document.
 How big is your data in dB? How soon that data changes? You would be
 better off if data is in spark already
 On 28 Jul 2015 04:48, Jeetendra Gangele gangele...@gmail.com wrote:

 Thanks for your reply.

 Parallel i will be hitting around 6000 call to postgreSQl which is not
 good my database will die.
 these calls to database will keeps on increasing.
 Handling millions on request is not an issue with Hbase/NOSQL

 any other alternative?




 On 27 July 2015 at 23:18, felixcheun...@hotmail.com wrote:

 You can have Spark reading from PostgreSQL through the data access
 API. Do you have any concern with that approach since you mention copying
 that data into HBase.

 From: Jeetendra Gangele
 Sent: Monday, July 27, 6:00 AM
 Subject: Data from PostgreSQL to Spark
 To: user

 Hi All

 I have a use case where where I am consuming the Events from RabbitMQ
 using spark streaming.This event has some fields on which I want to query
 the PostgreSQL and bring the data and then do the join between event data
 and PostgreSQl data and put the aggregated data into HDFS, so that I run
 run analytics query over this data using SparkSQL.

 my question is PostgreSQL data in production data so i don't want to
 hit so many times.

 at any given  1 seconds time I may have 3000 events,that means I need
 to fire 3000 parallel query to my PostGreSQl and this data keeps on
 growing, so my database will go down.



 I can't migrate this PostgreSQL data since lots of system using
 it,but I can take this data to some NOSQL like base and query the Hbase,
 but here issue is How can I make sure that Hbase has upto date data?

 Any anyone suggest me best approach/ method to handle this case?

 Regards

 Jeetendra




Re: Data from PostgreSQL to Spark

2015-07-28 Thread Jeetendra Gangele
Hi Ayan Thanks for reply.
Its around 5 GB having 10 tables...this data changes very frequently every
minutes few updates
its difficult to have this data in spark, if any updates happen on main
tables, how can I refresh spark data?





On 28 July 2015 at 02:11, ayan guha guha.a...@gmail.com wrote:

 You can call dB connect once per partition. Please have a look at design
 patterns of for each construct in document.
 How big is your data in dB? How soon that data changes? You would be
 better off if data is in spark already
 On 28 Jul 2015 04:48, Jeetendra Gangele gangele...@gmail.com wrote:

 Thanks for your reply.

 Parallel i will be hitting around 6000 call to postgreSQl which is not
 good my database will die.
 these calls to database will keeps on increasing.
 Handling millions on request is not an issue with Hbase/NOSQL

 any other alternative?




 On 27 July 2015 at 23:18, felixcheun...@hotmail.com wrote:

 You can have Spark reading from PostgreSQL through the data access API.
 Do you have any concern with that approach since you mention copying that
 data into HBase.

 From: Jeetendra Gangele
 Sent: Monday, July 27, 6:00 AM
 Subject: Data from PostgreSQL to Spark
 To: user

 Hi All

 I have a use case where where I am consuming the Events from RabbitMQ
 using spark streaming.This event has some fields on which I want to query
 the PostgreSQL and bring the data and then do the join between event data
 and PostgreSQl data and put the aggregated data into HDFS, so that I run
 run analytics query over this data using SparkSQL.

 my question is PostgreSQL data in production data so i don't want to hit
 so many times.

 at any given  1 seconds time I may have 3000 events,that means I need to
 fire 3000 parallel query to my PostGreSQl and this data keeps on growing,
 so my database will go down.



 I can't migrate this PostgreSQL data since lots of system using it,but I
 can take this data to some NOSQL like base and query the Hbase, but here
 issue is How can I make sure that Hbase has upto date data?

 Any anyone suggest me best approach/ method to handle this case?

 Regards

 Jeetendra




Re: Data from PostgreSQL to Spark

2015-07-28 Thread Jeetendra Gangele
I trying do that, but there will always data mismatch, since by the time
scoop is fetching main database will get so many updates. There is
something called incremental data fetch using scoop but that hits a
database rather than reading the WAL edit.



On 28 July 2015 at 02:52, santosh...@gmail.com wrote:

  Why cant you bulk pre-fetch the data to HDFS (like using Sqoop) instead
 of hitting Postgres multiple times?

 Sent from Windows Mail

 *From:* ayan guha guha.a...@gmail.com
 *Sent:* ‎Monday‎, ‎July‎ ‎27‎, ‎2015 ‎4‎:‎41‎ ‎PM
 *To:* Jeetendra Gangele gangele...@gmail.com
 *Cc:* felixcheun...@hotmail.com, user@spark.apache.org

 You can call dB connect once per partition. Please have a look at design
 patterns of for each construct in document.
 How big is your data in dB? How soon that data changes? You would be
 better off if data is in spark already
 On 28 Jul 2015 04:48, Jeetendra Gangele gangele...@gmail.com wrote:

 Thanks for your reply.

 Parallel i will be hitting around 6000 call to postgreSQl which is not
 good my database will die.
 these calls to database will keeps on increasing.
 Handling millions on request is not an issue with Hbase/NOSQL

 any other alternative?




 On 27 July 2015 at 23:18, felixcheun...@hotmail.com wrote:

 You can have Spark reading from PostgreSQL through the data access API.
 Do you have any concern with that approach since you mention copying that
 data into HBase.

 From: Jeetendra Gangele
 Sent: Monday, July 27, 6:00 AM
 Subject: Data from PostgreSQL to Spark
 To: user

 Hi All

 I have a use case where where I am consuming the Events from RabbitMQ
 using spark streaming.This event has some fields on which I want to query
 the PostgreSQL and bring the data and then do the join between event data
 and PostgreSQl data and put the aggregated data into HDFS, so that I run
 run analytics query over this data using SparkSQL.

 my question is PostgreSQL data in production data so i don't want to hit
 so many times.

 at any given  1 seconds time I may have 3000 events,that means I need to
 fire 3000 parallel query to my PostGreSQl and this data keeps on growing,
 so my database will go down.



 I can't migrate this PostgreSQL data since lots of system using it,but I
 can take this data to some NOSQL like base and query the Hbase, but here
 issue is How can I make sure that Hbase has upto date data?

 Any anyone suggest me best approach/ method to handle this case?

 Regards

 Jeetendra




Re: Data from PostgreSQL to Spark

2015-07-28 Thread santoshv98
Sqoop’s incremental data fetch will reduce the data size you need to pull from 
source, but then by the time that incremental data fetch is complete, is it not 
current again, if velocity of the data is high?




May be you can put a trigger in Postgres to send data to the big data cluster 
as soon as changes are made. Or as I was saying in another email, can the 
source write to Kafka/Flume/Hbase in addition to Postgres?





Sent from Windows Mail





From: Jeetendra Gangele
Sent: ‎Tuesday‎, ‎July‎ ‎28‎, ‎2015 ‎5‎:‎43‎ ‎AM
To: santosh...@gmail.com
Cc: ayan guha, felixcheun...@hotmail.com, user@spark.apache.org





I trying do that, but there will always data mismatch, since by the time scoop 
is fetching main database will get so many updates. There is something called 
incremental data fetch using scoop but that hits a database rather than reading 
the WAL edit.







On 28 July 2015 at 02:52, santosh...@gmail.com wrote:




Why cant you bulk pre-fetch the data to HDFS (like using Sqoop) instead of 
hitting Postgres multiple times?





Sent from Windows Mail





From: ayan guha
Sent: ‎Monday‎, ‎July‎ ‎27‎, ‎2015 ‎4‎:‎41‎ ‎PM
To: Jeetendra Gangele
Cc: felixcheun...@hotmail.com, user@spark.apache.org







You can call dB connect once per partition. Please have a look at design 
patterns of for each construct in document. 
How big is your data in dB? How soon that data changes? You would be better off 
if data is in spark already

On 28 Jul 2015 04:48, Jeetendra Gangele gangele...@gmail.com wrote:


Thanks for your reply.



Parallel i will be hitting around 6000 call to postgreSQl which is not good my 
database will die.

these calls to database will keeps on increasing.

Handling millions on request is not an issue with Hbase/NOSQL




any other alternative?











On 27 July 2015 at 23:18, felixcheun...@hotmail.com wrote:


You can have Spark reading from PostgreSQL through the data access API. Do you 
have any concern with that approach since you mention copying that data into 
HBase.



From: Jeetendra Gangele
Sent: Monday, July 27, 6:00 AM
Subject: Data from PostgreSQL to Spark
To: user




Hi All 


I have a use case where where I am consuming the Events from RabbitMQ using 
spark streaming.This event has some fields on which I want to query the 
PostgreSQL and bring the data and then do the join between event data and 
PostgreSQl data and put the aggregated data into HDFS, so that I run run 
analytics query over this data using SparkSQL. 


my question is PostgreSQL data in production data so i don't want to hit so 
many times. 


at any given  1 seconds time I may have 3000 events,that means I need to fire 
3000 parallel query to my PostGreSQl and this data keeps on growing, so my 
database will go down. 

  

I can't migrate this PostgreSQL data since lots of system using it,but I can 
take this data to some NOSQL like base and query the Hbase, but here issue is 
How can I make sure that Hbase has upto date data? 


Any anyone suggest me best approach/ method to handle this case? 



Regards 

Jeetendra

Re: Data from PostgreSQL to Spark

2015-07-28 Thread Jeetendra Gangele
can the source write to Kafka/Flume/Hbase in addition to Postgres? no
it can't write ,this is due to the fact that there are many applications
those are producing this postGreSql data.I can't really asked all the teams
to start writing to some other source.


velocity of the application is too high.






On 28 July 2015 at 21:50, santosh...@gmail.com wrote:

  Sqoop’s incremental data fetch will reduce the data size you need to
 pull from source, but then by the time that incremental data fetch is
 complete, is it not current again, if velocity of the data is high?

 May be you can put a trigger in Postgres to send data to the big data
 cluster as soon as changes are made. Or as I was saying in another email,
 can the source write to Kafka/Flume/Hbase in addition to Postgres?

 Sent from Windows Mail

 *From:* Jeetendra Gangele gangele...@gmail.com
 *Sent:* ‎Tuesday‎, ‎July‎ ‎28‎, ‎2015 ‎5‎:‎43‎ ‎AM
 *To:* santosh...@gmail.com
 *Cc:* ayan guha guha.a...@gmail.com, felixcheun...@hotmail.com,
 user@spark.apache.org

 I trying do that, but there will always data mismatch, since by the time
 scoop is fetching main database will get so many updates. There is
 something called incremental data fetch using scoop but that hits a
 database rather than reading the WAL edit.



 On 28 July 2015 at 02:52, santosh...@gmail.com wrote:

  Why cant you bulk pre-fetch the data to HDFS (like using Sqoop) instead
 of hitting Postgres multiple times?

 Sent from Windows Mail

 *From:* ayan guha guha.a...@gmail.com
 *Sent:* ‎Monday‎, ‎July‎ ‎27‎, ‎2015 ‎4‎:‎41‎ ‎PM
 *To:* Jeetendra Gangele gangele...@gmail.com
 *Cc:* felixcheun...@hotmail.com, user@spark.apache.org

 You can call dB connect once per partition. Please have a look at design
 patterns of for each construct in document.
 How big is your data in dB? How soon that data changes? You would be
 better off if data is in spark already
 On 28 Jul 2015 04:48, Jeetendra Gangele gangele...@gmail.com wrote:

 Thanks for your reply.

 Parallel i will be hitting around 6000 call to postgreSQl which is not
 good my database will die.
 these calls to database will keeps on increasing.
 Handling millions on request is not an issue with Hbase/NOSQL

 any other alternative?




 On 27 July 2015 at 23:18, felixcheun...@hotmail.com wrote:

 You can have Spark reading from PostgreSQL through the data access API.
 Do you have any concern with that approach since you mention copying that
 data into HBase.

 From: Jeetendra Gangele
 Sent: Monday, July 27, 6:00 AM
 Subject: Data from PostgreSQL to Spark
 To: user

 Hi All

 I have a use case where where I am consuming the Events from RabbitMQ
 using spark streaming.This event has some fields on which I want to query
 the PostgreSQL and bring the data and then do the join between event data
 and PostgreSQl data and put the aggregated data into HDFS, so that I run
 run analytics query over this data using SparkSQL.

 my question is PostgreSQL data in production data so i don't want to
 hit so many times.

 at any given  1 seconds time I may have 3000 events,that means I need
 to fire 3000 parallel query to my PostGreSQl and this data keeps on
 growing, so my database will go down.



 I can't migrate this PostgreSQL data since lots of system using it,but
 I can take this data to some NOSQL like base and query the Hbase, but here
 issue is How can I make sure that Hbase has upto date data?

 Any anyone suggest me best approach/ method to handle this case?

 Regards

 Jeetendra








Re: Data from PostgreSQL to Spark

2015-07-28 Thread Jörn Franke
Can you put some transparent cache in front of the database? Or some jdbc
proxy?

Le mar. 28 juil. 2015 à 19:34, Jeetendra Gangele gangele...@gmail.com a
écrit :

 can the source write to Kafka/Flume/Hbase in addition to Postgres? no
 it can't write ,this is due to the fact that there are many applications
 those are producing this postGreSql data.I can't really asked all the teams
 to start writing to some other source.


 velocity of the application is too high.






 On 28 July 2015 at 21:50, santosh...@gmail.com wrote:

  Sqoop’s incremental data fetch will reduce the data size you need to
 pull from source, but then by the time that incremental data fetch is
 complete, is it not current again, if velocity of the data is high?

 May be you can put a trigger in Postgres to send data to the big data
 cluster as soon as changes are made. Or as I was saying in another email,
 can the source write to Kafka/Flume/Hbase in addition to Postgres?

 Sent from Windows Mail

 *From:* Jeetendra Gangele gangele...@gmail.com
 *Sent:* ‎Tuesday‎, ‎July‎ ‎28‎, ‎2015 ‎5‎:‎43‎ ‎AM
 *To:* santosh...@gmail.com
 *Cc:* ayan guha guha.a...@gmail.com, felixcheun...@hotmail.com,
 user@spark.apache.org

 I trying do that, but there will always data mismatch, since by the time
 scoop is fetching main database will get so many updates. There is
 something called incremental data fetch using scoop but that hits a
 database rather than reading the WAL edit.



 On 28 July 2015 at 02:52, santosh...@gmail.com wrote:

  Why cant you bulk pre-fetch the data to HDFS (like using Sqoop)
 instead of hitting Postgres multiple times?

 Sent from Windows Mail

 *From:* ayan guha guha.a...@gmail.com
 *Sent:* ‎Monday‎, ‎July‎ ‎27‎, ‎2015 ‎4‎:‎41‎ ‎PM
 *To:* Jeetendra Gangele gangele...@gmail.com
 *Cc:* felixcheun...@hotmail.com, user@spark.apache.org

 You can call dB connect once per partition. Please have a look at design
 patterns of for each construct in document.
 How big is your data in dB? How soon that data changes? You would be
 better off if data is in spark already
 On 28 Jul 2015 04:48, Jeetendra Gangele gangele...@gmail.com wrote:

 Thanks for your reply.

 Parallel i will be hitting around 6000 call to postgreSQl which is not
 good my database will die.
 these calls to database will keeps on increasing.
 Handling millions on request is not an issue with Hbase/NOSQL

 any other alternative?




 On 27 July 2015 at 23:18, felixcheun...@hotmail.com wrote:

 You can have Spark reading from PostgreSQL through the data access
 API. Do you have any concern with that approach since you mention copying
 that data into HBase.

 From: Jeetendra Gangele
 Sent: Monday, July 27, 6:00 AM
 Subject: Data from PostgreSQL to Spark
 To: user

 Hi All

 I have a use case where where I am consuming the Events from RabbitMQ
 using spark streaming.This event has some fields on which I want to query
 the PostgreSQL and bring the data and then do the join between event data
 and PostgreSQl data and put the aggregated data into HDFS, so that I run
 run analytics query over this data using SparkSQL.

 my question is PostgreSQL data in production data so i don't want to
 hit so many times.

 at any given  1 seconds time I may have 3000 events,that means I need
 to fire 3000 parallel query to my PostGreSQl and this data keeps on
 growing, so my database will go down.



 I can't migrate this PostgreSQL data since lots of system using it,but
 I can take this data to some NOSQL like base and query the Hbase, but here
 issue is How can I make sure that Hbase has upto date data?

 Any anyone suggest me best approach/ method to handle this case?

 Regards

 Jeetendra












Data from PostgreSQL to Spark

2015-07-27 Thread Jeetendra Gangele
Hi All

I have a use case where where I am consuming the Events from RabbitMQ using
spark streaming.This event has some fields on which I want to query the
PostgreSQL and bring the data and then do the join between event data and
PostgreSQl data and put the aggregated data into HDFS, so that I run run
analytics query over this data using SparkSQL.

my question is PostgreSQL data in production data so i don't want to hit so
many times.

at any given  1 seconds time I may have 3000 events,that means I need to
fire 3000 parallel query to my PostGreSQl and this data keeps on growing,
so my database will go down.

I can't migrate this PostgreSQL data since lots of system using it,but I
can take this data to some NOSQL like base and query the Hbase, but here
issue is How can I make sure that Hbase has upto date data?

Any anyone suggest me best approach/ method to handle this case?


Regards
Jeetendra


Re: Data from PostgreSQL to Spark

2015-07-27 Thread felixcheung_m
You can have Spark reading from PostgreSQL through the data access API. Do you 
have any concern with that approach since you mention copying that data into 
HBase.



From: Jeetendra Gangele

Sent: Monday, July 27, 6:00 AM

Subject: Data from PostgreSQL to Spark

To: user



Hi All 



I have a use case where where I am consuming the Events from RabbitMQ using 
spark streaming.This event has some fields on which I want to query the 
PostgreSQL and bring the data and then do the join between event data and 
PostgreSQl data and put the aggregated data into HDFS, so that I run run 
analytics query over this data using SparkSQL. 



my question is PostgreSQL data in production data so i don't want to hit so 
many times. 



at any given  1 seconds time I may have 3000 events,that means I need to fire 
3000 parallel query to my PostGreSQl and this data keeps on growing, so my 
database will go down. 


  


I can't migrate this PostgreSQL data since lots of system using it,but I can 
take this data to some NOSQL like base and query the Hbase, but here issue is 
How can I make sure that Hbase has upto date data? 



Any anyone suggest me best approach/ method to handle this case? 




Regards 


Jeetendra 

Re: Data from PostgreSQL to Spark

2015-07-27 Thread Jeetendra Gangele
Thanks for your reply.

Parallel i will be hitting around 6000 call to postgreSQl which is not good
my database will die.
these calls to database will keeps on increasing.
Handling millions on request is not an issue with Hbase/NOSQL

any other alternative?




On 27 July 2015 at 23:18, felixcheun...@hotmail.com wrote:

 You can have Spark reading from PostgreSQL through the data access API. Do
 you have any concern with that approach since you mention copying that data
 into HBase.

 From: Jeetendra Gangele
 Sent: Monday, July 27, 6:00 AM
 Subject: Data from PostgreSQL to Spark
 To: user

 Hi All

 I have a use case where where I am consuming the Events from RabbitMQ
 using spark streaming.This event has some fields on which I want to query
 the PostgreSQL and bring the data and then do the join between event data
 and PostgreSQl data and put the aggregated data into HDFS, so that I run
 run analytics query over this data using SparkSQL.

 my question is PostgreSQL data in production data so i don't want to hit
 so many times.

 at any given  1 seconds time I may have 3000 events,that means I need to
 fire 3000 parallel query to my PostGreSQl and this data keeps on growing,
 so my database will go down.



 I can't migrate this PostgreSQL data since lots of system using it,but I
 can take this data to some NOSQL like base and query the Hbase, but here
 issue is How can I make sure that Hbase has upto date data?

 Any anyone suggest me best approach/ method to handle this case?

 Regards

 Jeetendra




Re: Data from PostgreSQL to Spark

2015-07-27 Thread ayan guha
You can call dB connect once per partition. Please have a look at design
patterns of for each construct in document.
How big is your data in dB? How soon that data changes? You would be better
off if data is in spark already
On 28 Jul 2015 04:48, Jeetendra Gangele gangele...@gmail.com wrote:

 Thanks for your reply.

 Parallel i will be hitting around 6000 call to postgreSQl which is not
 good my database will die.
 these calls to database will keeps on increasing.
 Handling millions on request is not an issue with Hbase/NOSQL

 any other alternative?




 On 27 July 2015 at 23:18, felixcheun...@hotmail.com wrote:

 You can have Spark reading from PostgreSQL through the data access API.
 Do you have any concern with that approach since you mention copying that
 data into HBase.

 From: Jeetendra Gangele
 Sent: Monday, July 27, 6:00 AM
 Subject: Data from PostgreSQL to Spark
 To: user

 Hi All

 I have a use case where where I am consuming the Events from RabbitMQ
 using spark streaming.This event has some fields on which I want to query
 the PostgreSQL and bring the data and then do the join between event data
 and PostgreSQl data and put the aggregated data into HDFS, so that I run
 run analytics query over this data using SparkSQL.

 my question is PostgreSQL data in production data so i don't want to hit
 so many times.

 at any given  1 seconds time I may have 3000 events,that means I need to
 fire 3000 parallel query to my PostGreSQl and this data keeps on growing,
 so my database will go down.



 I can't migrate this PostgreSQL data since lots of system using it,but I
 can take this data to some NOSQL like base and query the Hbase, but here
 issue is How can I make sure that Hbase has upto date data?

 Any anyone suggest me best approach/ method to handle this case?

 Regards

 Jeetendra




Re: Data from PostgreSQL to Spark

2015-07-27 Thread santoshv98

I can't migrate this PostgreSQL data since lots of system using it,but I can 
take this data to some NOSQL like base and query the Hbase, but here issue is 
How can I make sure that Hbase has upto date data? 

Is velocity an issue in Postgres that your data would become stale as soon as 
it reaches Big data cluster? If your concern is that Datastore (hbase etc) is 
not current in Big Data cluster, can the source write to other stores (like 
Kafka/Hbase etc/Flume) as well when it writes to Postgres? 





Sent from Windows Mail





From: santosh...@gmail.com
Sent: ‎Monday‎, ‎July‎ ‎27‎, ‎2015 ‎5‎:‎22‎ ‎PM
To: ayan guha, Jeetendra Gangele
Cc: felixcheun...@hotmail.com, user@spark.apache.org





Why cant you bulk pre-fetch the data to HDFS (like using Sqoop) instead of 
hitting Postgres multiple times?





Sent from Windows Mail





From: ayan guha
Sent: ‎Monday‎, ‎July‎ ‎27‎, ‎2015 ‎4‎:‎41‎ ‎PM
To: Jeetendra Gangele
Cc: felixcheun...@hotmail.com, user@spark.apache.org





You can call dB connect once per partition. Please have a look at design 
patterns of for each construct in document. 
How big is your data in dB? How soon that data changes? You would be better off 
if data is in spark already

On 28 Jul 2015 04:48, Jeetendra Gangele gangele...@gmail.com wrote:


Thanks for your reply.



Parallel i will be hitting around 6000 call to postgreSQl which is not good my 
database will die.

these calls to database will keeps on increasing.

Handling millions on request is not an issue with Hbase/NOSQL




any other alternative?











On 27 July 2015 at 23:18, felixcheun...@hotmail.com wrote:


You can have Spark reading from PostgreSQL through the data access API. Do you 
have any concern with that approach since you mention copying that data into 
HBase.



From: Jeetendra Gangele
Sent: Monday, July 27, 6:00 AM
Subject: Data from PostgreSQL to Spark
To: user




Hi All 


I have a use case where where I am consuming the Events from RabbitMQ using 
spark streaming.This event has some fields on which I want to query the 
PostgreSQL and bring the data and then do the join between event data and 
PostgreSQl data and put the aggregated data into HDFS, so that I run run 
analytics query over this data using SparkSQL. 


my question is PostgreSQL data in production data so i don't want to hit so 
many times. 


at any given  1 seconds time I may have 3000 events,that means I need to fire 
3000 parallel query to my PostGreSQl and this data keeps on growing, so my 
database will go down. 

  

I can't migrate this PostgreSQL data since lots of system using it,but I can 
take this data to some NOSQL like base and query the Hbase, but here issue is 
How can I make sure that Hbase has upto date data? 


Any anyone suggest me best approach/ method to handle this case? 



Regards 

Jeetendra