Re: Data from PostgreSQL to Spark

2015-07-28 Thread santoshv98
Sqoop’s incremental data fetch will reduce the data size you need to pull from 
source, but then by the time that incremental data fetch is complete, is it not 
current again, if velocity of the data is high?




May be you can put a trigger in Postgres to send data to the big data cluster 
as soon as changes are made. Or as I was saying in another email, can the 
source write to Kafka/Flume/Hbase in addition to Postgres?





Sent from Windows Mail





From: Jeetendra Gangele
Sent: ‎Tuesday‎, ‎July‎ ‎28‎, ‎2015 ‎5‎:‎43‎ ‎AM
To: santosh...@gmail.com
Cc: ayan guha, felixcheun...@hotmail.com, user@spark.apache.org





I trying do that, but there will always data mismatch, since by the time scoop 
is fetching main database will get so many updates. There is something called 
incremental data fetch using scoop but that hits a database rather than reading 
the WAL edit.







On 28 July 2015 at 02:52,  wrote:




Why cant you bulk pre-fetch the data to HDFS (like using Sqoop) instead of 
hitting Postgres multiple times?





Sent from Windows Mail





From: ayan guha
Sent: ‎Monday‎, ‎July‎ ‎27‎, ‎2015 ‎4‎:‎41‎ ‎PM
To: Jeetendra Gangele
Cc: felixcheun...@hotmail.com, user@spark.apache.org







You can call dB connect once per partition. Please have a look at design 
patterns of for each construct in document. 
How big is your data in dB? How soon that data changes? You would be better off 
if data is in spark already

On 28 Jul 2015 04:48, "Jeetendra Gangele"  wrote:


Thanks for your reply.



Parallel i will be hitting around 6000 call to postgreSQl which is not good my 
database will die.

these calls to database will keeps on increasing.

Handling millions on request is not an issue with Hbase/NOSQL




any other alternative?











On 27 July 2015 at 23:18,  wrote:


You can have Spark reading from PostgreSQL through the data access API. Do you 
have any concern with that approach since you mention copying that data into 
HBase.



From: Jeetendra Gangele
Sent: Monday, July 27, 6:00 AM
Subject: Data from PostgreSQL to Spark
To: user




Hi All 


I have a use case where where I am consuming the Events from RabbitMQ using 
spark streaming.This event has some fields on which I want to query the 
PostgreSQL and bring the data and then do the join between event data and 
PostgreSQl data and put the aggregated data into HDFS, so that I run run 
analytics query over this data using SparkSQL. 


my question is PostgreSQL data in production data so i don't want to hit so 
many times. 


at any given  1 seconds time I may have 3000 events,that means I need to fire 
3000 parallel query to my PostGreSQl and this data keeps on growing, so my 
database will go down. 

  

I can't migrate this PostgreSQL data since lots of system using it,but I can 
take this data to some NOSQL like base and query the Hbase, but here issue is 
How can I make sure that Hbase has upto date data? 


Any anyone suggest me best approach/ method to handle this case? 



Regards 

Jeetendra

Re: Data from PostgreSQL to Spark

2015-07-27 Thread santoshv98

I can't migrate this PostgreSQL data since lots of system using it,but I can 
take this data to some NOSQL like base and query the Hbase, but here issue is 
How can I make sure that Hbase has upto date data? 

Is velocity an issue in Postgres that your data would become stale as soon as 
it reaches Big data cluster? If your concern is that Datastore (hbase etc) is 
not current in Big Data cluster, can the source write to other stores (like 
Kafka/Hbase etc/Flume) as well when it writes to Postgres? 





Sent from Windows Mail





From: santosh...@gmail.com
Sent: ‎Monday‎, ‎July‎ ‎27‎, ‎2015 ‎5‎:‎22‎ ‎PM
To: ayan guha, Jeetendra Gangele
Cc: felixcheun...@hotmail.com, user@spark.apache.org





Why cant you bulk pre-fetch the data to HDFS (like using Sqoop) instead of 
hitting Postgres multiple times?





Sent from Windows Mail





From: ayan guha
Sent: ‎Monday‎, ‎July‎ ‎27‎, ‎2015 ‎4‎:‎41‎ ‎PM
To: Jeetendra Gangele
Cc: felixcheun...@hotmail.com, user@spark.apache.org





You can call dB connect once per partition. Please have a look at design 
patterns of for each construct in document. 
How big is your data in dB? How soon that data changes? You would be better off 
if data is in spark already

On 28 Jul 2015 04:48, "Jeetendra Gangele"  wrote:


Thanks for your reply.



Parallel i will be hitting around 6000 call to postgreSQl which is not good my 
database will die.

these calls to database will keeps on increasing.

Handling millions on request is not an issue with Hbase/NOSQL




any other alternative?











On 27 July 2015 at 23:18,  wrote:


You can have Spark reading from PostgreSQL through the data access API. Do you 
have any concern with that approach since you mention copying that data into 
HBase.



From: Jeetendra Gangele
Sent: Monday, July 27, 6:00 AM
Subject: Data from PostgreSQL to Spark
To: user




Hi All 


I have a use case where where I am consuming the Events from RabbitMQ using 
spark streaming.This event has some fields on which I want to query the 
PostgreSQL and bring the data and then do the join between event data and 
PostgreSQl data and put the aggregated data into HDFS, so that I run run 
analytics query over this data using SparkSQL. 


my question is PostgreSQL data in production data so i don't want to hit so 
many times. 


at any given  1 seconds time I may have 3000 events,that means I need to fire 
3000 parallel query to my PostGreSQl and this data keeps on growing, so my 
database will go down. 

  

I can't migrate this PostgreSQL data since lots of system using it,but I can 
take this data to some NOSQL like base and query the Hbase, but here issue is 
How can I make sure that Hbase has upto date data? 


Any anyone suggest me best approach/ method to handle this case? 



Regards 

Jeetendra

Re: Data from PostgreSQL to Spark

2015-07-27 Thread santoshv98
Why cant you bulk pre-fetch the data to HDFS (like using Sqoop) instead of 
hitting Postgres multiple times?





Sent from Windows Mail





From: ayan guha
Sent: ‎Monday‎, ‎July‎ ‎27‎, ‎2015 ‎4‎:‎41‎ ‎PM
To: Jeetendra Gangele
Cc: felixcheun...@hotmail.com, user@spark.apache.org





You can call dB connect once per partition. Please have a look at design 
patterns of for each construct in document. 
How big is your data in dB? How soon that data changes? You would be better off 
if data is in spark already

On 28 Jul 2015 04:48, "Jeetendra Gangele"  wrote:


Thanks for your reply.



Parallel i will be hitting around 6000 call to postgreSQl which is not good my 
database will die.

these calls to database will keeps on increasing.

Handling millions on request is not an issue with Hbase/NOSQL




any other alternative?











On 27 July 2015 at 23:18,  wrote:


You can have Spark reading from PostgreSQL through the data access API. Do you 
have any concern with that approach since you mention copying that data into 
HBase.



From: Jeetendra Gangele
Sent: Monday, July 27, 6:00 AM
Subject: Data from PostgreSQL to Spark
To: user




Hi All 


I have a use case where where I am consuming the Events from RabbitMQ using 
spark streaming.This event has some fields on which I want to query the 
PostgreSQL and bring the data and then do the join between event data and 
PostgreSQl data and put the aggregated data into HDFS, so that I run run 
analytics query over this data using SparkSQL. 


my question is PostgreSQL data in production data so i don't want to hit so 
many times. 


at any given  1 seconds time I may have 3000 events,that means I need to fire 
3000 parallel query to my PostGreSQl and this data keeps on growing, so my 
database will go down. 

  

I can't migrate this PostgreSQL data since lots of system using it,but I can 
take this data to some NOSQL like base and query the Hbase, but here issue is 
How can I make sure that Hbase has upto date data? 


Any anyone suggest me best approach/ method to handle this case? 



Regards 

Jeetendra

Re: Spark performance

2015-07-12 Thread santoshv98
Ravi


Spark (or in that case Big Data solutions like Hive) is suited for large 
analytical loads, where the “scaling  up” starts to pale in comparison to 
“Scaling out” with regards to performance, versatility(types of data) and cost. 
Without going into the details of MsSQL architecture, there is an inflection 
point in terms of cost(licensing), performance and Maintainability where open 
Source commodity platform would start to become viable albeit sometimes at the 
expense of slower performance. With 1 million records ,  I am not sure you are 
reaching that point to justify a Spark cluster. So why are you planning to move 
away from MSSql and move to Spark as the destination platform?


You said “Spark performance” is slow as compared to MSSql. What kind of load 
are you running and what kind of querying are you performing? There may be 
startup costs associated with running the Map side of the querying.


If your testing to understand Spark, can you post what you are currently doing 
(queries, table structures, compression and storage optimizations)? That way, 
we could look at suggesting optimizations but again, not to compare with MsSQL, 
but to improve Spark side of things.


Again, to quote someone who answered earlier in the thread, What is your ‘Use 
case’? 


-Santosh






Sent from Windows Mail





From: Jörn Franke
Sent: ‎Saturday‎, ‎July‎ ‎11‎, ‎2015 ‎8‎:‎20‎ ‎PM
To: Mohammed Guller, Ravisankar Mani, user@spark.apache.org





Honestly you are addressing this wrongly - you do not seem.to have a business 
case for changing - so why do you want to switch 




Le sam. 11 juil. 2015 à 3:28, Mohammed Guller  a écrit :





Hi Ravi,

First, Neither Spark nor Spark SQL is a database. Both are compute engines, 
which need to be paired with a storage system. Seconds, they are designed for 
processing large distributed datasets. If you have only 100,000 records or even 
a million records, you don’t need Spark. A RDBMS will perform much better for 
that volume of data.

 

Mohammed

 

From: Ravisankar Mani [mailto:rrav...@gmail.com] 
Sent: Friday, July 10, 2015 3:50 AM
To: user@spark.apache.org
Subject: Spark performance



 





Hi everyone,


I have planned to move mssql server to spark?.  I have using around 50,000 to 
1l records.


 The spark performance is slow when compared to mssql server.


 

What is the best data base(Spark or sql) to store or retrieve data around 
50,000 to 1l records ?

regards,

Ravi