Re: Data from PostgreSQL to Spark

Jeetendra Gangele Tue, 28 Jul 2015 10:35:02 -0700

can the source write to Kafka/Flume/Hbase in addition to Postgres? no
it can't write ,this is due to the fact that there are many applications
those are producing this postGreSql data.I can't really asked all the teams
to start writing to some other source.



velocity of the application is too high.






On 28 July 2015 at 21:50, <santosh...@gmail.com> wrote:

>  Sqoop’s incremental data fetch will reduce the data size you need to
> pull from source, but then by the time that incremental data fetch is
> complete, is it not current again, if velocity of the data is high?
>
> May be you can put a trigger in Postgres to send data to the big data
> cluster as soon as changes are made. Or as I was saying in another email,
> can the source write to Kafka/Flume/Hbase in addition to Postgres?
>
> Sent from Windows Mail
>
> *From:* Jeetendra Gangele <gangele...@gmail.com>
> *Sent:* ‎Tuesday‎, ‎July‎ ‎28‎, ‎2015 ‎5‎:‎43‎ ‎AM
> *To:* santosh...@gmail.com
> *Cc:* ayan guha <guha.a...@gmail.com>, felixcheun...@hotmail.com,
> user@spark.apache.org
>
> I trying do that, but there will always data mismatch, since by the time
> scoop is fetching main database will get so many updates. There is
> something called incremental data fetch using scoop but that hits a
> database rather than reading the WAL edit.
>
>
>
> On 28 July 2015 at 02:52, <santosh...@gmail.com> wrote:
>
>>  Why cant you bulk pre-fetch the data to HDFS (like using Sqoop) instead
>> of hitting Postgres multiple times?
>>
>> Sent from Windows Mail
>>
>> *From:* ayan guha <guha.a...@gmail.com>
>> *Sent:* ‎Monday‎, ‎July‎ ‎27‎, ‎2015 ‎4‎:‎41‎ ‎PM
>> *To:* Jeetendra Gangele <gangele...@gmail.com>
>> *Cc:* felixcheun...@hotmail.com, user@spark.apache.org
>>
>> You can call dB connect once per partition. Please have a look at design
>> patterns of for each construct in document.
>> How big is your data in dB? How soon that data changes? You would be
>> better off if data is in spark already
>> On 28 Jul 2015 04:48, "Jeetendra Gangele" <gangele...@gmail.com> wrote:
>>
>>> Thanks for your reply.
>>>
>>> Parallel i will be hitting around 6000 call to postgreSQl which is not
>>> good my database will die.
>>> these calls to database will keeps on increasing.
>>> Handling millions on request is not an issue with Hbase/NOSQL
>>>
>>> any other alternative?
>>>
>>>
>>>
>>>
>>> On 27 July 2015 at 23:18, <felixcheun...@hotmail.com> wrote:
>>>
>>>> You can have Spark reading from PostgreSQL through the data access API.
>>>> Do you have any concern with that approach since you mention copying that
>>>> data into HBase.
>>>>
>>>> From: Jeetendra Gangele
>>>> Sent: Monday, July 27, 6:00 AM
>>>> Subject: Data from PostgreSQL to Spark
>>>> To: user
>>>>
>>>> Hi All
>>>>
>>>> I have a use case where where I am consuming the Events from RabbitMQ
>>>> using spark streaming.This event has some fields on which I want to query
>>>> the PostgreSQL and bring the data and then do the join between event data
>>>> and PostgreSQl data and put the aggregated data into HDFS, so that I run
>>>> run analytics query over this data using SparkSQL.
>>>>
>>>> my question is PostgreSQL data in production data so i don't want to
>>>> hit so many times.
>>>>
>>>> at any given  1 seconds time I may have 3000 events,that means I need
>>>> to fire 3000 parallel query to my PostGreSQl and this data keeps on
>>>> growing, so my database will go down.
>>>>
>>>>
>>>>
>>>> I can't migrate this PostgreSQL data since lots of system using it,but
>>>> I can take this data to some NOSQL like base and query the Hbase, but here
>>>> issue is How can I make sure that Hbase has upto date data?
>>>>
>>>> Any anyone suggest me best approach/ method to handle this case?
>>>>
>>>> Regards
>>>>
>>>> Jeetendra
>>>>
>>>>
>
>
>
>

Re: Data from PostgreSQL to Spark

Reply via email to