Re: parallel processing with JDBC

Madabhattula Rajesh Kumar Mon, 15 Aug 2016 07:18:13 -0700

Hi Mich,

Thank you


Regards,,
Rajesh

On Mon, Aug 15, 2016 at 6:35 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Ok Rajesh
>
> This is standalone.
>
> In that case it ought to be at least 4 connections as one executor will
> use one worker.
>
> I am hesitant in here as you can see with (at least) as with Standalone
> mode you may end up with more executors on each worker.
>
> But try it and see whether numPartitions" -> "4" is good or you can
> change this to something higher.
>
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 15 August 2016 at 12:19, Madabhattula Rajesh Kumar <mrajaf...@gmail.com
> > wrote:
>
>> Hi Mich,
>>
>> Thank you for detailed explanation. One more question
>>
>> In my cluster, I have one master and 4 workers. In this case, 4
>> connections will be opened to Oracle ?
>>
>> Regards,
>> Rajesh
>>
>> On Mon, Aug 15, 2016 at 3:59 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> It happens that the number of parallel processes open from Spark to
>>> RDBMS is determined by the number of executors.
>>>
>>> I just tested this.
>>>
>>> With Yarn client using to executors I see two connections to RDBMS
>>>
>>>
>>> EXECUTIONS USERNAME       SID SERIAL# USERS_EXECUTING SQL_TEXT
>>> ---------- ---------- ------- ------- ---------------
>>> --------------------------------------------------
>>>          1 SCRATCHPAD     443   62565               1 SELECT
>>> "RANDOMISED","RANDOM_STRING","PADDING","CLU
>>>
>>> STERED","ID","SCATTERED","SMALL_VC" FROM (SELECT t
>>>                                                       o_char(ID) AS ID,
>>> to_char(CLUSTERED) AS CLUSTERED,
>>>
>>> to_char(SCATTERED) AS SCATTERED, to_char(RANDOMIS
>>>                                                       ED) AS
>>> RANDOMISED, RANDOM_STRING, SMALL_VC, PADDIN
>>>                                                       G FROM
>>> scratchpad.dummy) WHERE ID >= 23000001 AND
>>>                                                       ID < 24000001
>>>          1 SCRATCHPAD     406   46793               1 SELECT
>>> "RANDOMISED","RANDOM_STRING","PADDING","CLU
>>>
>>> STERED","ID","SCATTERED","SMALL_VC" FROM (SELECT t
>>>                                                       o_char(ID) AS ID,
>>> to_char(CLUSTERED) AS CLUSTERED,
>>>
>>> to_char(SCATTERED) AS SCATTERED, to_char(RANDOMIS
>>>                                                       ED) AS
>>> RANDOMISED, RANDOM_STRING, SMALL_VC, PADDIN
>>>                                                       G FROM
>>> scratchpad.dummy) WHERE ID >= 24000001 AND
>>>                                                       ID < 25000001
>>>
>>> So it  sounds like (can someone else independently confirm this) that
>>> regardless of what one specifies in "numPartitions" one ends up one
>>> connection from one Spark executor to RDBMS.
>>>
>>> HTH
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 15 August 2016 at 09:12, Mich Talebzadeh <mich.talebza...@gmail.com>
>>> wrote:
>>>
>>>> Hi.
>>>>
>>>> This is a very good question
>>>>
>>>> I did some tests on this.
>>>>
>>>> If you are joining two tables then you are creating a result set based
>>>> on some conditions. In this case what I normally do is to specify an ID
>>>> column from either tables and will base my partitioning on that ID column.
>>>> This is pretty straight forward. So bring back your ID column and base you
>>>> lower and upper limit on that ID value
>>>>
>>>> "partitionColumn" -> "ID",
>>>> "lowerBound" -> "1",
>>>> "upperBound" -> "100000000",
>>>> "numPartitions" -> "100",
>>>>
>>>>
>>>>
>>>> Also I have noticed that regardless of the number of partitions you
>>>> specify at the RDBMS site, the number of parallel connections will be
>>>> limited and the result set will be partitioned accordingly. For example
>>>> with numberPartitions=100, I see only 8 connections in Oracle coming from
>>>> Spark connection.
>>>>
>>>> scala> val s = HiveContext.read.format("jdbc").options(
>>>>      | Map("url" -> _ORACLEserver,
>>>>      | "dbtable" -> "(SELECT to_char(ID) AS ID, to_char(CLUSTERED) AS
>>>> CLUSTERED, to_char(SCATTERED) AS SCATTERED, to_char(RANDOMISED) AS
>>>> RANDOMISED, RANDOM_STRING, SMALL_VC, PADDING FROM scratchpad.dummy)",
>>>>      | "partitionColumn" -> "ID",
>>>>      | "lowerBound" -> "1",
>>>>      | "upperBound" -> "100000000",
>>>>      | "numPartitions" -> "100",
>>>>      | "user" -> _username,
>>>>      | "password" -> _password)).load
>>>> s: org.apache.spark.sql.DataFrame = [ID: string, CLUSTERED: string ...
>>>> 5 more fields]
>>>> scala> s.toJavaRDD.partitions.size()
>>>> res1: Int = 100
>>>>
>>>> This also seems to set the number of partitions. I still think that the
>>>> emphasis has to be on getting data from RDBMS as quickly as possible. The
>>>> partitioning does work. In below the login scratchpad has multiple
>>>> connections to Oracle and does the range selection OK
>>>>
>>>>          1 SCRATCHPAD      45   43048               1 SELECT
>>>> "SMALL_VC","CLUSTERED","PADDING","RANDOM_ST
>>>>
>>>> RING","ID","SCATTERED","RANDOMISED" FROM (SELECT t
>>>>                                                       o_char(ID) AS
>>>> ID, to_char(CLUSTERED) AS CLUSTERED,
>>>>
>>>> to_char(SCATTERED) AS SCATTERED, to_char(RANDOMIS
>>>>                                                       ED) AS
>>>> RANDOMISED, RANDOM_STRING, SMALL_VC, PADDIN
>>>>                                                       G FROM
>>>> scratchpad.dummy)
>>>> *WHERE ID >= 16000001
>>>> AND                                                      ID < 17000001*
>>>>
>>>> HTH
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 15 August 2016 at 08:18, ayan guha <guha.a...@gmail.com> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> I would suggest you to look at sqoop as well. Essentially, you can
>>>>> provide a splitBy/partitionBy column using which data will be distributed
>>>>> among your stated number of mappers
>>>>>
>>>>> On Mon, Aug 15, 2016 at 5:07 PM, Madabhattula Rajesh Kumar <
>>>>> mrajaf...@gmail.com> wrote:
>>>>>
>>>>>> Hi Mich,
>>>>>>
>>>>>> I have a below question.
>>>>>>
>>>>>> I want to join two tables and return the result based on the input
>>>>>> value. In this case, how we need to specify lower bound and upper bound
>>>>>> values ?
>>>>>>
>>>>>> select t1.id, t1.name, t2.course, t2.qualification from t1, t2 where
>>>>>> t1.transactionid=*11111* and t1.id = t2.id
>>>>>>
>>>>>> *11111 => dynamic input value.*
>>>>>>
>>>>>> Regards,
>>>>>> Rajesh
>>>>>>
>>>>>> On Mon, Aug 15, 2016 at 12:05 PM, Mich Talebzadeh <
>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>
>>>>>>> If you have your RDBMS table partitioned, then you need to consider
>>>>>>> how much data you want to extract in other words the result set 
>>>>>>> returned by
>>>>>>> the JDBC call.
>>>>>>>
>>>>>>> If you want all the data, then the number of partitions specified in
>>>>>>> the JDBC call should be equal to the number of partitions in your RDBMS
>>>>>>> table.
>>>>>>>
>>>>>>> HTH
>>>>>>>
>>>>>>> Dr Mich Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> LinkedIn * 
>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>>> may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>> damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 14 August 2016 at 21:44, Ashok Kumar <ashok34...@yahoo.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thank you very much sir.
>>>>>>>>
>>>>>>>> I forgot to mention that two of these Oracle tables are range
>>>>>>>> partitioned. In that case what would be the optimum number of 
>>>>>>>> partitions if
>>>>>>>> you can share?
>>>>>>>>
>>>>>>>> Warmest
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sunday, 14 August 2016, 21:37, Mich Talebzadeh <
>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> If you have primary keys on these tables then you can parallelise
>>>>>>>> the process reading data.
>>>>>>>>
>>>>>>>> You have to be careful not to set the number of partitions too
>>>>>>>> many. Certainly there is a balance between the number of partitions
>>>>>>>> supplied to JDBC and the load on the network and the source DB.
>>>>>>>>
>>>>>>>> Assuming that your underlying table has primary key ID, then this
>>>>>>>> will create 20 parallel processes to Oracle DB
>>>>>>>>
>>>>>>>>  val d = HiveContext.read.format("jdbc").options(
>>>>>>>>  Map("url" -> _ORACLEserver,
>>>>>>>>  "dbtable" -> "(SELECT <COL1>, <COL2>, ....FROM <TABLE>)",
>>>>>>>>  "partitionColumn" -> "ID",
>>>>>>>>  "lowerBound" -> "1",
>>>>>>>>  "upperBound" -> "maxID",
>>>>>>>>  "numPartitions" -> "20",
>>>>>>>>  "user" -> _username,
>>>>>>>>  "password" -> _password)).load
>>>>>>>>
>>>>>>>> assuming your upper bound on ID is maxID
>>>>>>>>
>>>>>>>>
>>>>>>>> This will open multiple connections to RDBMS, each getting a subset
>>>>>>>> of data that you want.
>>>>>>>>
>>>>>>>> You need to test it to ensure that you get the numPartitions
>>>>>>>> optimum and you don't overload any component.
>>>>>>>>
>>>>>>>> HTH
>>>>>>>>
>>>>>>>>
>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>
>>>>>>>> LinkedIn * 
>>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>>
>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>>
>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>>> for any loss, damage or destruction of data or any other property 
>>>>>>>> which may
>>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>>> damages
>>>>>>>> arising from such loss, damage or destruction.
>>>>>>>>
>>>>>>>>
>>>>>>>> On 14 August 2016 at 21:15, Ashok Kumar <
>>>>>>>> ashok34...@yahoo.com.invalid> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> There are 4 tables ranging from 10 million to 100 million rows but
>>>>>>>> they all have primary keys.
>>>>>>>>
>>>>>>>> The network is fine but our Oracle is RAC and we can only connect
>>>>>>>> to a designated Oracle node (where we have a DQ account only).
>>>>>>>>
>>>>>>>> We have a limited time window of few hours to get the required data
>>>>>>>> out.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sunday, 14 August 2016, 21:07, Mich Talebzadeh <
>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> How big are your tables and is there any issue with the network
>>>>>>>> between your Spark nodes and your Oracle DB that adds to issues?
>>>>>>>>
>>>>>>>> HTH
>>>>>>>>
>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>
>>>>>>>> LinkedIn * https://www.linkedin.com/ profile/view?id=
>>>>>>>> AAEAAAAWh2gBxianrbJd6zP6AcPCCd OABUrV8Pw
>>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>>
>>>>>>>> http://talebzadehmich. wordpress.com
>>>>>>>> <http://talebzadehmich.wordpress.com/>
>>>>>>>>
>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>>> for any loss, damage or destruction of data or any other property 
>>>>>>>> which may
>>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>>> damages
>>>>>>>> arising from such loss, damage or destruction.
>>>>>>>>
>>>>>>>>
>>>>>>>> On 14 August 2016 at 20:50, Ashok Kumar <
>>>>>>>> ashok34...@yahoo.com.invalid> wrote:
>>>>>>>>
>>>>>>>> Hi Gurus,
>>>>>>>>
>>>>>>>> I have few large tables in rdbms (ours is Oracle). We want to
>>>>>>>> access these tables through Spark JDBC
>>>>>>>>
>>>>>>>> What is the quickest way of getting data into Spark Dataframe say
>>>>>>>> multiple connections from Spark
>>>>>>>>
>>>>>>>> thanking you
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best Regards,
>>>>> Ayan Guha
>>>>>
>>>>
>>>>
>>>
>>
>

Re: parallel processing with JDBC

Reply via email to