Thanks Mike for clarification. I think there is another option to get data out of RDBMS through some form of SELECT ALL COLUMNS TAB SEPARATED OR OTHER and put them in a flat file or files. scp that file from the RDBMS directory to a private directory on HDFS system and push it into HDFS. That will by-pass the JDBC reliability problem and I guess in this case one is in more control.
I do concur that there are security issues with this. For example that local file system may have to have encryption etc that will make it tedious. I believe Jorn mentioned this somewhere. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 22 June 2016 at 15:59, Michael Segel <msegel_had...@hotmail.com> wrote: > Hi, > > Just to clear a few things up… > > First I know its hard to describe some problems because they deal with > client confidential information. > (Also some basic ‘dead hooker’ thought problems to work through before > facing them at a client.) > The questions I pose here are very general and deal with some basic design > issues/consideration. > > > W.R.T JDBC / Beeline: > > There are many use cases where you don’t want to migrate some or all data > to HDFS. This is why tools like Apache Drill exist. At the same time… > there are different cluster design patterns. One such pattern is a > storage/compute model where you have multiple clusters acting either as > compute clusters which pull data from storage clusters. An example would be > spinning up an EMR cluster and running a M/R job where you read from S3 and > output to S3. Or within your enterprise you have your Data Lake (Data > Sewer) and then a compute cluster for analytics. > > In addition, you have some very nasty design issues to deal with like > security. Yes, that’s the very dirty word nobody wants to deal with and in > most of these tools, security is an afterthought. > So you may not have direct access to the cluster or an edge node. You only > have access to a single port on a single machine through the firewall, > which is running beeline so you can pull data from your storage cluster. > > Its very possible that you have to pull data from the cluster thru beeline > to store the data within a spark job running on the cluster. (Oh the irony! > ;-) > > Its important to understand that due to design constraints, options like > sqoop or running a query directly against Hive may not be possible and > these use cases do exist when dealing with PII information. > > Its also important to realize that you may have to pull data from multiple > data sources for some not so obvious but important reasons… > So your spark app has to be able to generate an RDD from data in an RDBS > (Oracle, DB2, etc …) persist it for local lookups and then pull data from > the cluster and then output it either back to the cluster, another cluster, > or someplace else. All of these design issues occur when you’re dealing > with large enterprises. > > > But I digress… > > The reason I started this thread was to get a better handle on where > things run when we make decisions to run either in client or cluster mode > in YARN. > Issues like starting/stopping long running apps are an issue in > production. Tying up cluster resources while your application remains > dormant waiting for the next batch of events to come through. > Trying to understand the gotchas of each design consideration so that we > can weigh the pros and cons of a decision when designing a solution…. > > > Spark is relatively new and its a bit disruptive because it forces you to > think in terms of storage/compute models or Jim Scott’s holistic ‘Zeta > Architecture’. In addition, it forces you to rethink your cluster hardware > design too. > > HTH to clarify the questions and of course thanks for the replies. > > -Mike > > On Jun 21, 2016, at 11:17 PM, Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > > If you are going to get data out of an RDBMS like Oracle then the correct > procedure is: > > > 1. Use Hive on Spark execution engine. That improves Hive performance > 2. You can use JDBC through Spark itself. No issue there. It will use > JDBC provided by HiveContext > 3. JDBC is fine. Every vendor has a utility to migrate a full database > from one to another using JDBC. For example SAP relies on JDBC to migrate a > whole Oracle schema to SAP ASE > 4. I have imported an Oracle table of 1 billion rows through Spark > into Hive ORC table. It works fine. Actually I use Spark to do the job for > these type of imports. Register a tempTable from DF and use it to put data > into Hive table. You can create Hive table explicitly in Spark and do an > INSERT/SELECT into it rather than save etc. > 5. You can access Hive tables through HiveContext. Beeline is a client > tool that connects to Hive thrift server. I don't think it comes into > equation here > 6. Finally one experiment worth multiples of these speculation. Try > for yourself and fine out. > 7. If you want to use JDBC for an RDBMS table then you will need to > download the relevant JAR file. For example for Oracle it is ojdbc6.jar etc > > Like anything else your mileage varies and need to experiment with it. > Otherwise these are all opinions. > > HTH > > > > Dr Mich Talebzadeh > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > http://talebzadehmich.wordpress.com > > > > On 22 June 2016 at 06:46, Jörn Franke <jornfra...@gmail.com> wrote: > >> I would import data via sqoop and put it on HDFS. It has some mechanisms >> to handle the lack of reliability by jdbc. >> >> Then you can process the data via Spark. You could also use jdbc rdd but >> I do not recommend to use it, because you do not want to pull data all the >> time out of the database when you execute your application. Furthermore, >> you have to handle connection interruptions, the multiple >> serialization/deserialization efforts, if one executor crashes you have to >> repull some or all of the data from the database etc >> >> Within the cluster it does not make sense to me to pull data via jdbc >> from hive. All the benefits such as data locality, reliability etc would be >> gone. >> >> Hive supports different execution engines (TEZ, Spark), formats (Orc, >> parquet) and further optimizations to make the analysis fast. It always >> depends on your use case. >> >> On 22 Jun 2016, at 05:47, Michael Segel <msegel_had...@hotmail.com> >> wrote: >> >> >> Sorry, I think you misunderstood. >> Spark can read from JDBC sources so to say using beeline as a way to >> access data is not a spark application isn’t really true. Would you say >> the same if you were pulling data in to spark from Oracle or DB2? >> There are a couple of different design patterns and use cases where data >> could be stored in Hive yet your only access method is via a JDBC or >> Thift/Rest service. Think also of compute / storage cluster >> implementations. >> >> WRT to #2, not exactly what I meant, by exposing the data… and there are >> limitations to the thift service… >> >> On Jun 21, 2016, at 5:44 PM, ayan guha <guha.a...@gmail.com> wrote: >> >> 1. Yes, in the sense you control number of executors from spark >> application config. >> 2. Any IO will be done from executors (never ever on driver, unless you >> explicitly call collect()). For example, connection to a DB happens one for >> each worker (and used by local executors). Also, if you run a reduceByKey >> job and write to hdfs, you will find a bunch of files were written from >> various executors. What happens when you want to expose the data to world: >> Spark Thrift Server (STS), which is a long running spark application (ie >> spark context) which can serve data from RDDs. >> >> Suppose I have a data source… like a couple of hive tables and I access >> the tables via beeline. (JDBC) - >> This is NOT a spark application, and there is no RDD created. Beeline is >> just a jdbc client tool. You use beeline to connect to HS2 or STS. >> >> In this case… Hive generates a map/reduce job and then would stream the >> result set back to the client node where the RDD result set would be built. >> -- >> This is never true. When you connect Hive from spark, spark actually >> reads hive metastore and streams data directly from HDFS. Hive MR jobs do >> not play any role here, making spark faster than hive. >> >> HTH.... >> >> Ayan >> >> On Wed, Jun 22, 2016 at 9:58 AM, Michael Segel <msegel_had...@hotmail.com >> > wrote: >> >>> Ok, its at the end of the day and I’m trying to make sure I understand >>> the locale of where things are running. >>> >>> I have an application where I have to query a bunch of sources, creating >>> some RDDs and then I need to join off the RDDs and some other lookup tables. >>> >>> >>> Yarn has two modes… client and cluster. >>> >>> I get it that in cluster mode… everything is running on the cluster. >>> But in client mode, the driver is running on the edge node while the >>> workers are running on the cluster. >>> >>> When I run a sparkSQL command that generates a new RDD, does the result >>> set live on the cluster with the workers, and gets referenced by the >>> driver, or does the result set get migrated to the driver running on the >>> client? (I’m pretty sure I know the answer, but its never safe to assume >>> anything…) >>> >>> The follow up questions: >>> >>> 1) If I kill the app running the driver on the edge node… will that >>> cause YARN to free up the cluster’s resources? (In cluster mode… that >>> doesn’t happen) What happens and how quickly? >>> >>> 1a) If using the client mode… can I spin up and spin down the number of >>> executors on the cluster? (Assuming that when I kill an executor any >>> portion of the RDDs associated with that executor are gone, however the >>> spark context is still alive on the edge node? [again assuming that the >>> spark context lives with the driver.]) >>> >>> 2) Any I/O between my spark job and the outside world… (e.g. walking >>> through the data set and writing out a data set to a file) will occur on >>> the edge node where the driver is located? (This may seem kinda silly, but >>> what happens when you want to expose the result set to the world… ? ) >>> >>> Now for something slightly different… >>> >>> Suppose I have a data source… like a couple of hive tables and I access >>> the tables via beeline. (JDBC) In this case… Hive generates a map/reduce >>> job and then would stream the result set back to the client node where the >>> RDD result set would be built. I realize that I could run Hive on top of >>> spark, but that’s a separate issue. Here the RDD will reside on the client >>> only. (That is I could in theory run this as a single spark instance.) >>> If I were to run this on the cluster… then the result set would stream >>> thru the beeline gate way and would reside back on the cluster sitting in >>> RDDs within each executor? >>> >>> I realize that these are silly questions but I need to make sure that I >>> know the flow of the data and where it ultimately resides. There really is >>> a method to my madness, and if I could explain it… these questions really >>> would make sense. ;-) >>> >>> TIA, >>> >>> -Mike >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> >> >> -- >> Best Regards, >> Ayan Guha >> >> >> > >