Hi Mich, Thank you
Regards,, Rajesh On Mon, Aug 15, 2016 at 6:35 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Ok Rajesh > > This is standalone. > > In that case it ought to be at least 4 connections as one executor will > use one worker. > > I am hesitant in here as you can see with (at least) as with Standalone > mode you may end up with more executors on each worker. > > But try it and see whether numPartitions" -> "4" is good or you can > change this to something higher. > > > HTH > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 15 August 2016 at 12:19, Madabhattula Rajesh Kumar <mrajaf...@gmail.com > > wrote: > >> Hi Mich, >> >> Thank you for detailed explanation. One more question >> >> In my cluster, I have one master and 4 workers. In this case, 4 >> connections will be opened to Oracle ? >> >> Regards, >> Rajesh >> >> On Mon, Aug 15, 2016 at 3:59 PM, Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> It happens that the number of parallel processes open from Spark to >>> RDBMS is determined by the number of executors. >>> >>> I just tested this. >>> >>> With Yarn client using to executors I see two connections to RDBMS >>> >>> >>> EXECUTIONS USERNAME SID SERIAL# USERS_EXECUTING SQL_TEXT >>> ---------- ---------- ------- ------- --------------- >>> -------------------------------------------------- >>> 1 SCRATCHPAD 443 62565 1 SELECT >>> "RANDOMISED","RANDOM_STRING","PADDING","CLU >>> >>> STERED","ID","SCATTERED","SMALL_VC" FROM (SELECT t >>> o_char(ID) AS ID, >>> to_char(CLUSTERED) AS CLUSTERED, >>> >>> to_char(SCATTERED) AS SCATTERED, to_char(RANDOMIS >>> ED) AS >>> RANDOMISED, RANDOM_STRING, SMALL_VC, PADDIN >>> G FROM >>> scratchpad.dummy) WHERE ID >= 23000001 AND >>> ID < 24000001 >>> 1 SCRATCHPAD 406 46793 1 SELECT >>> "RANDOMISED","RANDOM_STRING","PADDING","CLU >>> >>> STERED","ID","SCATTERED","SMALL_VC" FROM (SELECT t >>> o_char(ID) AS ID, >>> to_char(CLUSTERED) AS CLUSTERED, >>> >>> to_char(SCATTERED) AS SCATTERED, to_char(RANDOMIS >>> ED) AS >>> RANDOMISED, RANDOM_STRING, SMALL_VC, PADDIN >>> G FROM >>> scratchpad.dummy) WHERE ID >= 24000001 AND >>> ID < 25000001 >>> >>> So it sounds like (can someone else independently confirm this) that >>> regardless of what one specifies in "numPartitions" one ends up one >>> connection from one Spark executor to RDBMS. >>> >>> HTH >>> >>> >>> Dr Mich Talebzadeh >>> >>> >>> >>> LinkedIn * >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> On 15 August 2016 at 09:12, Mich Talebzadeh <mich.talebza...@gmail.com> >>> wrote: >>> >>>> Hi. >>>> >>>> This is a very good question >>>> >>>> I did some tests on this. >>>> >>>> If you are joining two tables then you are creating a result set based >>>> on some conditions. In this case what I normally do is to specify an ID >>>> column from either tables and will base my partitioning on that ID column. >>>> This is pretty straight forward. So bring back your ID column and base you >>>> lower and upper limit on that ID value >>>> >>>> "partitionColumn" -> "ID", >>>> "lowerBound" -> "1", >>>> "upperBound" -> "100000000", >>>> "numPartitions" -> "100", >>>> >>>> >>>> >>>> Also I have noticed that regardless of the number of partitions you >>>> specify at the RDBMS site, the number of parallel connections will be >>>> limited and the result set will be partitioned accordingly. For example >>>> with numberPartitions=100, I see only 8 connections in Oracle coming from >>>> Spark connection. >>>> >>>> scala> val s = HiveContext.read.format("jdbc").options( >>>> | Map("url" -> _ORACLEserver, >>>> | "dbtable" -> "(SELECT to_char(ID) AS ID, to_char(CLUSTERED) AS >>>> CLUSTERED, to_char(SCATTERED) AS SCATTERED, to_char(RANDOMISED) AS >>>> RANDOMISED, RANDOM_STRING, SMALL_VC, PADDING FROM scratchpad.dummy)", >>>> | "partitionColumn" -> "ID", >>>> | "lowerBound" -> "1", >>>> | "upperBound" -> "100000000", >>>> | "numPartitions" -> "100", >>>> | "user" -> _username, >>>> | "password" -> _password)).load >>>> s: org.apache.spark.sql.DataFrame = [ID: string, CLUSTERED: string ... >>>> 5 more fields] >>>> scala> s.toJavaRDD.partitions.size() >>>> res1: Int = 100 >>>> >>>> This also seems to set the number of partitions. I still think that the >>>> emphasis has to be on getting data from RDBMS as quickly as possible. The >>>> partitioning does work. In below the login scratchpad has multiple >>>> connections to Oracle and does the range selection OK >>>> >>>> 1 SCRATCHPAD 45 43048 1 SELECT >>>> "SMALL_VC","CLUSTERED","PADDING","RANDOM_ST >>>> >>>> RING","ID","SCATTERED","RANDOMISED" FROM (SELECT t >>>> o_char(ID) AS >>>> ID, to_char(CLUSTERED) AS CLUSTERED, >>>> >>>> to_char(SCATTERED) AS SCATTERED, to_char(RANDOMIS >>>> ED) AS >>>> RANDOMISED, RANDOM_STRING, SMALL_VC, PADDIN >>>> G FROM >>>> scratchpad.dummy) >>>> *WHERE ID >= 16000001 >>>> AND ID < 17000001* >>>> >>>> HTH >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> Dr Mich Talebzadeh >>>> >>>> >>>> >>>> LinkedIn * >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>> >>>> >>>> >>>> http://talebzadehmich.wordpress.com >>>> >>>> >>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>> any loss, damage or destruction of data or any other property which may >>>> arise from relying on this email's technical content is explicitly >>>> disclaimed. The author will in no case be liable for any monetary damages >>>> arising from such loss, damage or destruction. >>>> >>>> >>>> >>>> On 15 August 2016 at 08:18, ayan guha <guha.a...@gmail.com> wrote: >>>> >>>>> Hi >>>>> >>>>> I would suggest you to look at sqoop as well. Essentially, you can >>>>> provide a splitBy/partitionBy column using which data will be distributed >>>>> among your stated number of mappers >>>>> >>>>> On Mon, Aug 15, 2016 at 5:07 PM, Madabhattula Rajesh Kumar < >>>>> mrajaf...@gmail.com> wrote: >>>>> >>>>>> Hi Mich, >>>>>> >>>>>> I have a below question. >>>>>> >>>>>> I want to join two tables and return the result based on the input >>>>>> value. In this case, how we need to specify lower bound and upper bound >>>>>> values ? >>>>>> >>>>>> select t1.id, t1.name, t2.course, t2.qualification from t1, t2 where >>>>>> t1.transactionid=*11111* and t1.id = t2.id >>>>>> >>>>>> *11111 => dynamic input value.* >>>>>> >>>>>> Regards, >>>>>> Rajesh >>>>>> >>>>>> On Mon, Aug 15, 2016 at 12:05 PM, Mich Talebzadeh < >>>>>> mich.talebza...@gmail.com> wrote: >>>>>> >>>>>>> If you have your RDBMS table partitioned, then you need to consider >>>>>>> how much data you want to extract in other words the result set >>>>>>> returned by >>>>>>> the JDBC call. >>>>>>> >>>>>>> If you want all the data, then the number of partitions specified in >>>>>>> the JDBC call should be equal to the number of partitions in your RDBMS >>>>>>> table. >>>>>>> >>>>>>> HTH >>>>>>> >>>>>>> Dr Mich Talebzadeh >>>>>>> >>>>>>> >>>>>>> >>>>>>> LinkedIn * >>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>>>> >>>>>>> >>>>>>> >>>>>>> http://talebzadehmich.wordpress.com >>>>>>> >>>>>>> >>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>>> for any loss, damage or destruction of data or any other property which >>>>>>> may >>>>>>> arise from relying on this email's technical content is explicitly >>>>>>> disclaimed. The author will in no case be liable for any monetary >>>>>>> damages >>>>>>> arising from such loss, damage or destruction. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 14 August 2016 at 21:44, Ashok Kumar <ashok34...@yahoo.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Thank you very much sir. >>>>>>>> >>>>>>>> I forgot to mention that two of these Oracle tables are range >>>>>>>> partitioned. In that case what would be the optimum number of >>>>>>>> partitions if >>>>>>>> you can share? >>>>>>>> >>>>>>>> Warmest >>>>>>>> >>>>>>>> >>>>>>>> On Sunday, 14 August 2016, 21:37, Mich Talebzadeh < >>>>>>>> mich.talebza...@gmail.com> wrote: >>>>>>>> >>>>>>>> >>>>>>>> If you have primary keys on these tables then you can parallelise >>>>>>>> the process reading data. >>>>>>>> >>>>>>>> You have to be careful not to set the number of partitions too >>>>>>>> many. Certainly there is a balance between the number of partitions >>>>>>>> supplied to JDBC and the load on the network and the source DB. >>>>>>>> >>>>>>>> Assuming that your underlying table has primary key ID, then this >>>>>>>> will create 20 parallel processes to Oracle DB >>>>>>>> >>>>>>>> val d = HiveContext.read.format("jdbc").options( >>>>>>>> Map("url" -> _ORACLEserver, >>>>>>>> "dbtable" -> "(SELECT <COL1>, <COL2>, ....FROM <TABLE>)", >>>>>>>> "partitionColumn" -> "ID", >>>>>>>> "lowerBound" -> "1", >>>>>>>> "upperBound" -> "maxID", >>>>>>>> "numPartitions" -> "20", >>>>>>>> "user" -> _username, >>>>>>>> "password" -> _password)).load >>>>>>>> >>>>>>>> assuming your upper bound on ID is maxID >>>>>>>> >>>>>>>> >>>>>>>> This will open multiple connections to RDBMS, each getting a subset >>>>>>>> of data that you want. >>>>>>>> >>>>>>>> You need to test it to ensure that you get the numPartitions >>>>>>>> optimum and you don't overload any component. >>>>>>>> >>>>>>>> HTH >>>>>>>> >>>>>>>> >>>>>>>> Dr Mich Talebzadeh >>>>>>>> >>>>>>>> LinkedIn * >>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>>>>> >>>>>>>> http://talebzadehmich.wordpress.com >>>>>>>> >>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>>>> for any loss, damage or destruction of data or any other property >>>>>>>> which may >>>>>>>> arise from relying on this email's technical content is explicitly >>>>>>>> disclaimed. The author will in no case be liable for any monetary >>>>>>>> damages >>>>>>>> arising from such loss, damage or destruction. >>>>>>>> >>>>>>>> >>>>>>>> On 14 August 2016 at 21:15, Ashok Kumar < >>>>>>>> ashok34...@yahoo.com.invalid> wrote: >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> There are 4 tables ranging from 10 million to 100 million rows but >>>>>>>> they all have primary keys. >>>>>>>> >>>>>>>> The network is fine but our Oracle is RAC and we can only connect >>>>>>>> to a designated Oracle node (where we have a DQ account only). >>>>>>>> >>>>>>>> We have a limited time window of few hours to get the required data >>>>>>>> out. >>>>>>>> >>>>>>>> Thanks >>>>>>>> >>>>>>>> >>>>>>>> On Sunday, 14 August 2016, 21:07, Mich Talebzadeh < >>>>>>>> mich.talebza...@gmail.com> wrote: >>>>>>>> >>>>>>>> >>>>>>>> How big are your tables and is there any issue with the network >>>>>>>> between your Spark nodes and your Oracle DB that adds to issues? >>>>>>>> >>>>>>>> HTH >>>>>>>> >>>>>>>> Dr Mich Talebzadeh >>>>>>>> >>>>>>>> LinkedIn * https://www.linkedin.com/ profile/view?id= >>>>>>>> AAEAAAAWh2gBxianrbJd6zP6AcPCCd OABUrV8Pw >>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>>>>> >>>>>>>> http://talebzadehmich. wordpress.com >>>>>>>> <http://talebzadehmich.wordpress.com/> >>>>>>>> >>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>>>> for any loss, damage or destruction of data or any other property >>>>>>>> which may >>>>>>>> arise from relying on this email's technical content is explicitly >>>>>>>> disclaimed. The author will in no case be liable for any monetary >>>>>>>> damages >>>>>>>> arising from such loss, damage or destruction. >>>>>>>> >>>>>>>> >>>>>>>> On 14 August 2016 at 20:50, Ashok Kumar < >>>>>>>> ashok34...@yahoo.com.invalid> wrote: >>>>>>>> >>>>>>>> Hi Gurus, >>>>>>>> >>>>>>>> I have few large tables in rdbms (ours is Oracle). We want to >>>>>>>> access these tables through Spark JDBC >>>>>>>> >>>>>>>> What is the quickest way of getting data into Spark Dataframe say >>>>>>>> multiple connections from Spark >>>>>>>> >>>>>>>> thanking you >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Best Regards, >>>>> Ayan Guha >>>>> >>>> >>>> >>> >> >