Hi Mich, Thank you for detailed explanation. One more question
In my cluster, I have one master and 4 workers. In this case, 4 connections will be opened to Oracle ? Regards, Rajesh On Mon, Aug 15, 2016 at 3:59 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > It happens that the number of parallel processes open from Spark to RDBMS > is determined by the number of executors. > > I just tested this. > > With Yarn client using to executors I see two connections to RDBMS > > > EXECUTIONS USERNAME SID SERIAL# USERS_EXECUTING SQL_TEXT > ---------- ---------- ------- ------- --------------- > -------------------------------------------------- > 1 SCRATCHPAD 443 62565 1 SELECT > "RANDOMISED","RANDOM_STRING","PADDING","CLU > > STERED","ID","SCATTERED","SMALL_VC" FROM (SELECT t > o_char(ID) AS ID, > to_char(CLUSTERED) AS CLUSTERED, > to_char(SCATTERED) > AS SCATTERED, to_char(RANDOMIS > ED) AS RANDOMISED, > RANDOM_STRING, SMALL_VC, PADDIN > G FROM > scratchpad.dummy) WHERE ID >= 23000001 AND > ID < 24000001 > 1 SCRATCHPAD 406 46793 1 SELECT > "RANDOMISED","RANDOM_STRING","PADDING","CLU > > STERED","ID","SCATTERED","SMALL_VC" FROM (SELECT t > o_char(ID) AS ID, > to_char(CLUSTERED) AS CLUSTERED, > to_char(SCATTERED) > AS SCATTERED, to_char(RANDOMIS > ED) AS RANDOMISED, > RANDOM_STRING, SMALL_VC, PADDIN > G FROM > scratchpad.dummy) WHERE ID >= 24000001 AND > ID < 25000001 > > So it sounds like (can someone else independently confirm this) that > regardless of what one specifies in "numPartitions" one ends up one > connection from one Spark executor to RDBMS. > > HTH > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 15 August 2016 at 09:12, Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > >> Hi. >> >> This is a very good question >> >> I did some tests on this. >> >> If you are joining two tables then you are creating a result set based on >> some conditions. In this case what I normally do is to specify an ID column >> from either tables and will base my partitioning on that ID column. This is >> pretty straight forward. So bring back your ID column and base you lower >> and upper limit on that ID value >> >> "partitionColumn" -> "ID", >> "lowerBound" -> "1", >> "upperBound" -> "100000000", >> "numPartitions" -> "100", >> >> >> >> Also I have noticed that regardless of the number of partitions you >> specify at the RDBMS site, the number of parallel connections will be >> limited and the result set will be partitioned accordingly. For example >> with numberPartitions=100, I see only 8 connections in Oracle coming from >> Spark connection. >> >> scala> val s = HiveContext.read.format("jdbc").options( >> | Map("url" -> _ORACLEserver, >> | "dbtable" -> "(SELECT to_char(ID) AS ID, to_char(CLUSTERED) AS >> CLUSTERED, to_char(SCATTERED) AS SCATTERED, to_char(RANDOMISED) AS >> RANDOMISED, RANDOM_STRING, SMALL_VC, PADDING FROM scratchpad.dummy)", >> | "partitionColumn" -> "ID", >> | "lowerBound" -> "1", >> | "upperBound" -> "100000000", >> | "numPartitions" -> "100", >> | "user" -> _username, >> | "password" -> _password)).load >> s: org.apache.spark.sql.DataFrame = [ID: string, CLUSTERED: string ... 5 >> more fields] >> scala> s.toJavaRDD.partitions.size() >> res1: Int = 100 >> >> This also seems to set the number of partitions. I still think that the >> emphasis has to be on getting data from RDBMS as quickly as possible. The >> partitioning does work. In below the login scratchpad has multiple >> connections to Oracle and does the range selection OK >> >> 1 SCRATCHPAD 45 43048 1 SELECT >> "SMALL_VC","CLUSTERED","PADDING","RANDOM_ST >> >> RING","ID","SCATTERED","RANDOMISED" FROM (SELECT t >> o_char(ID) AS ID, >> to_char(CLUSTERED) AS CLUSTERED, >> >> to_char(SCATTERED) AS SCATTERED, to_char(RANDOMIS >> ED) AS RANDOMISED, >> RANDOM_STRING, SMALL_VC, PADDIN >> G FROM >> scratchpad.dummy) >> *WHERE ID >= 16000001 >> AND ID < 17000001* >> >> HTH >> >> >> >> >> >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> On 15 August 2016 at 08:18, ayan guha <guha.a...@gmail.com> wrote: >> >>> Hi >>> >>> I would suggest you to look at sqoop as well. Essentially, you can >>> provide a splitBy/partitionBy column using which data will be distributed >>> among your stated number of mappers >>> >>> On Mon, Aug 15, 2016 at 5:07 PM, Madabhattula Rajesh Kumar < >>> mrajaf...@gmail.com> wrote: >>> >>>> Hi Mich, >>>> >>>> I have a below question. >>>> >>>> I want to join two tables and return the result based on the input >>>> value. In this case, how we need to specify lower bound and upper bound >>>> values ? >>>> >>>> select t1.id, t1.name, t2.course, t2.qualification from t1, t2 where >>>> t1.transactionid=*11111* and t1.id = t2.id >>>> >>>> *11111 => dynamic input value.* >>>> >>>> Regards, >>>> Rajesh >>>> >>>> On Mon, Aug 15, 2016 at 12:05 PM, Mich Talebzadeh < >>>> mich.talebza...@gmail.com> wrote: >>>> >>>>> If you have your RDBMS table partitioned, then you need to consider >>>>> how much data you want to extract in other words the result set returned >>>>> by >>>>> the JDBC call. >>>>> >>>>> If you want all the data, then the number of partitions specified in >>>>> the JDBC call should be equal to the number of partitions in your RDBMS >>>>> table. >>>>> >>>>> HTH >>>>> >>>>> Dr Mich Talebzadeh >>>>> >>>>> >>>>> >>>>> LinkedIn * >>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>> >>>>> >>>>> >>>>> http://talebzadehmich.wordpress.com >>>>> >>>>> >>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>>> any loss, damage or destruction of data or any other property which may >>>>> arise from relying on this email's technical content is explicitly >>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>> arising from such loss, damage or destruction. >>>>> >>>>> >>>>> >>>>> On 14 August 2016 at 21:44, Ashok Kumar <ashok34...@yahoo.com> wrote: >>>>> >>>>>> Thank you very much sir. >>>>>> >>>>>> I forgot to mention that two of these Oracle tables are range >>>>>> partitioned. In that case what would be the optimum number of partitions >>>>>> if >>>>>> you can share? >>>>>> >>>>>> Warmest >>>>>> >>>>>> >>>>>> On Sunday, 14 August 2016, 21:37, Mich Talebzadeh < >>>>>> mich.talebza...@gmail.com> wrote: >>>>>> >>>>>> >>>>>> If you have primary keys on these tables then you can parallelise the >>>>>> process reading data. >>>>>> >>>>>> You have to be careful not to set the number of partitions too many. >>>>>> Certainly there is a balance between the number of partitions supplied to >>>>>> JDBC and the load on the network and the source DB. >>>>>> >>>>>> Assuming that your underlying table has primary key ID, then this >>>>>> will create 20 parallel processes to Oracle DB >>>>>> >>>>>> val d = HiveContext.read.format("jdbc").options( >>>>>> Map("url" -> _ORACLEserver, >>>>>> "dbtable" -> "(SELECT <COL1>, <COL2>, ....FROM <TABLE>)", >>>>>> "partitionColumn" -> "ID", >>>>>> "lowerBound" -> "1", >>>>>> "upperBound" -> "maxID", >>>>>> "numPartitions" -> "20", >>>>>> "user" -> _username, >>>>>> "password" -> _password)).load >>>>>> >>>>>> assuming your upper bound on ID is maxID >>>>>> >>>>>> >>>>>> This will open multiple connections to RDBMS, each getting a subset >>>>>> of data that you want. >>>>>> >>>>>> You need to test it to ensure that you get the numPartitions optimum >>>>>> and you don't overload any component. >>>>>> >>>>>> HTH >>>>>> >>>>>> >>>>>> Dr Mich Talebzadeh >>>>>> >>>>>> LinkedIn * >>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>>> >>>>>> http://talebzadehmich.wordpress.com >>>>>> >>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>> for any loss, damage or destruction of data or any other property which >>>>>> may >>>>>> arise from relying on this email's technical content is explicitly >>>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>>> arising from such loss, damage or destruction. >>>>>> >>>>>> >>>>>> On 14 August 2016 at 21:15, Ashok Kumar <ashok34...@yahoo.com.invalid >>>>>> > wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> There are 4 tables ranging from 10 million to 100 million rows but >>>>>> they all have primary keys. >>>>>> >>>>>> The network is fine but our Oracle is RAC and we can only connect to >>>>>> a designated Oracle node (where we have a DQ account only). >>>>>> >>>>>> We have a limited time window of few hours to get the required data >>>>>> out. >>>>>> >>>>>> Thanks >>>>>> >>>>>> >>>>>> On Sunday, 14 August 2016, 21:07, Mich Talebzadeh < >>>>>> mich.talebza...@gmail.com> wrote: >>>>>> >>>>>> >>>>>> How big are your tables and is there any issue with the network >>>>>> between your Spark nodes and your Oracle DB that adds to issues? >>>>>> >>>>>> HTH >>>>>> >>>>>> Dr Mich Talebzadeh >>>>>> >>>>>> LinkedIn * https://www.linkedin.com/ profile/view?id= >>>>>> AAEAAAAWh2gBxianrbJd6zP6AcPCCd OABUrV8Pw >>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>>> >>>>>> http://talebzadehmich. wordpress.com >>>>>> <http://talebzadehmich.wordpress.com/> >>>>>> >>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>> for any loss, damage or destruction of data or any other property which >>>>>> may >>>>>> arise from relying on this email's technical content is explicitly >>>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>>> arising from such loss, damage or destruction. >>>>>> >>>>>> >>>>>> On 14 August 2016 at 20:50, Ashok Kumar <ashok34...@yahoo.com.invalid >>>>>> > wrote: >>>>>> >>>>>> Hi Gurus, >>>>>> >>>>>> I have few large tables in rdbms (ours is Oracle). We want to access >>>>>> these tables through Spark JDBC >>>>>> >>>>>> What is the quickest way of getting data into Spark Dataframe say >>>>>> multiple connections from Spark >>>>>> >>>>>> thanking you >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >>> >>> -- >>> Best Regards, >>> Ayan Guha >>> >> >> >