Re: Silly question on Dropping Temp Table

2018-05-26 Thread Aakash Basu
Well, it did, meaning, internally a TempTable and a TempView are the same. Thanks buddy! On Sat, May 26, 2018 at 9:23 PM, Aakash Basu wrote: > Question is, while registering, using registerTempTable() and while > dropping, using a dropTempView(), would it go and hit

Re: Silly question on Dropping Temp Table

2018-05-26 Thread Aakash Basu
Question is, while registering, using registerTempTable() and while dropping, using a dropTempView(), would it go and hit the same TempTable internally or would search for a registered view? Not sure. Any idea? On Sat, May 26, 2018 at 9:04 PM, SNEHASISH DUTTA wrote: >

Silly question on Dropping Temp Table

2018-05-26 Thread Aakash Basu
Hi all, I'm trying to use dropTempTable() after the respective Temporary Table's use is over (to free up the memory for next calculations). Newer Spark Session doesn't need sqlContext, so, it is confusing me on how to use the function. 1) Tried, same DF which I used to register a temp table to

Re: Quick but probably silly question...

2017-01-17 Thread Jörn Franke
You run compaction, i.e. save the modified/deleted records in a dedicated file. Every now and then you compare the original and delta file and create a new version. When querying before compaction then you need to check in original and delta file. I don to think orc need tez for it , but it

Quick but probably silly question...

2017-01-17 Thread Michael Segel
Hi, While the parquet file is immutable and the data sets are immutable, how does sparkSQL handle updates or deletes? I mean if I read in a file using SQL in to an RDD, mutate it, eg delete a row, and then persist it, I now have two files. If I reread the table back in … will I see duplicates

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-22 Thread Michael Segel
The only documentation on this… in terms of direction … (that I could find) If your client is not close to the cluster (e.g. your PC) then you definitely want to go cluster to improve performance. If your client is close to the cluster (e.g. an edge node) then you could go either client or

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-22 Thread Marcelo Vanzin
On Wed, Jun 22, 2016 at 1:32 PM, Mich Talebzadeh wrote: > Does it also depend on the number of Spark nodes involved in choosing which > way to go? Not really. -- Marcelo - To unsubscribe, e-mail:

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-22 Thread Mich Talebzadeh
Thanks Marcelo, Sounds like cluster mode is more resilient than the client-mode. Does it also depend on the number of Spark nodes involved in choosing which way to go? Cheers Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-22 Thread Marcelo Vanzin
Trying to keep the answer short and simple... On Wed, Jun 22, 2016 at 1:19 PM, Michael Segel wrote: > But this gets to the question… what are the real differences between client > and cluster modes? > What are the pros/cons and use cases where one has advantages over

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-22 Thread Michael Segel
LOL… I hate YARN, but unfortunately I don’t get to make the call on which tools we’re going to use, I just get paid to make stuff work on the tools provided. ;-) Testing is somewhat problematic. You have to really test at some large enough fraction of scale. Fortunately for this issue (YARN

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-22 Thread Mich Talebzadeh
This is exactly the sort of topics that distinguish lab work from enterprise practice :) The question on YARN client versus YARN cluster mode. I am not sure how much in real life it is going to make an impact if I choose one over the other? These days I yell developers that it is perfectly valid

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-22 Thread Michael Segel
JDBC reliability problem? Ok… a bit more explanation… Usually when you have to go back to a legacy system, its because the data set is usually metadata and is relatively small. Its not the sort of data that gets ingested in to a data lake unless you’re also ingesting the metadata and are

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-22 Thread Mich Talebzadeh
Thanks Mike for clarification. I think there is another option to get data out of RDBMS through some form of SELECT ALL COLUMNS TAB SEPARATED OR OTHER and put them in a flat file or files. scp that file from the RDBMS directory to a private directory on HDFS system and push it into HDFS. That

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-22 Thread Michael Segel
Hi, Just to clear a few things up… First I know its hard to describe some problems because they deal with client confidential information. (Also some basic ‘dead hooker’ thought problems to work through before facing them at a client.) The questions I pose here are very general and deal

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-22 Thread Mich Talebzadeh
If you are going to get data out of an RDBMS like Oracle then the correct procedure is: 1. Use Hive on Spark execution engine. That improves Hive performance 2. You can use JDBC through Spark itself. No issue there. It will use JDBC provided by HiveContext 3. JDBC is fine. Every

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-21 Thread Jörn Franke
I would import data via sqoop and put it on HDFS. It has some mechanisms to handle the lack of reliability by jdbc. Then you can process the data via Spark. You could also use jdbc rdd but I do not recommend to use it, because you do not want to pull data all the time out of the database when

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-21 Thread ayan guha
I may be wrong here, but beeline is basically a client library. So you "connect" to STS and/or HS2 using beeline. Spark connecting to jdbc is different discussion and no way related to beeline. When you read data from DB (Oracle, DB2 etc) then you do not use beeline, but use jdbc connection to

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-21 Thread Michael Segel
Sorry, I think you misunderstood. Spark can read from JDBC sources so to say using beeline as a way to access data is not a spark application isn’t really true. Would you say the same if you were pulling data in to spark from Oracle or DB2? There are a couple of different design patterns and

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-21 Thread ayan guha
1. Yes, in the sense you control number of executors from spark application config. 2. Any IO will be done from executors (never ever on driver, unless you explicitly call collect()). For example, connection to a DB happens one for each worker (and used by local executors). Also, if you run a

Silly question about Yarn client vs Yarn cluster modes...

2016-06-21 Thread Michael Segel
Ok, its at the end of the day and I’m trying to make sure I understand the locale of where things are running. I have an application where I have to query a bunch of sources, creating some RDDs and then I need to join off the RDDs and some other lookup tables. Yarn has two modes… client and

Re: Silly Question on my part...

2016-05-17 Thread Gene Pang
Hi Michael, Yes, you can use Alluxio to share Spark RDDs. Here is a blog post about getting started with Spark and Alluxio ( http://www.alluxio.com/2016/04/getting-started-with-alluxio-and-spark/), and some documentation ( http://alluxio.org/documentation/master/en/Running-Spark-on-Alluxio.html).

Re: Silly Question on my part...

2016-05-17 Thread Dood
On 5/16/2016 12:12 PM, Michael Segel wrote: For one use case.. we were considering using the thrift server as a way to allow multiple clients access shared RDDs. Within the Thrift Context, we create an RDD and expose it as a hive table. The question is… where does the RDD exist. On the

Re: Silly Question on my part...

2016-05-17 Thread Michael Segel
Thanks for the response. That’s what I thought, but I didn’t want to assume anything. (You know what happens when you ass u me … :-) Not sure about Tachyon though. Its a thought, but I’m very conservative when it comes to design choices. > On May 16, 2016, at 5:21 PM, John Trengrove

Re: Silly Question on my part...

2016-05-16 Thread John Trengrove
If you are wanting to share RDDs it might be a good idea to check out Tachyon / Alluxio. For the Thrift server, I believe the datasets are located in your Spark cluster as RDDs and you just communicate with it via the Thrift JDBC Distributed Query Engine connector. 2016-05-17 5:12 GMT+10:00

Silly Question on my part...

2016-05-16 Thread Michael Segel
For one use case.. we were considering using the thrift server as a way to allow multiple clients access shared RDDs. Within the Thrift Context, we create an RDD and expose it as a hive table. The question is… where does the RDD exist. On the Thrift service node itself, or is that just a

Re: Silly question...

2016-04-13 Thread Mich Talebzadeh
linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > http://talebzadehmich.wordpress.com > > > > On 12 April 2016 at 23:42, Michael Segel <msegel_had...@hotmai

Re: Silly question...

2016-04-13 Thread Michael Segel
w?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> > > http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> > > > On 12 April 2016 at 23:42, Michael Segel <msegel_had..

Re: Silly question...

2016-04-13 Thread Mich Talebzadeh
ile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 12 April 2016 at 23:42, Michael Segel <msegel_had...@hotmail.com> wrote: > Hi, > This is probably a silly question on my part… > > I’m looking at the latest (spark 1.6.1 relea

Silly question...

2016-04-12 Thread Michael Segel
Hi, This is probably a silly question on my part… I’m looking at the latest (spark 1.6.1 release) and would like to do a build w Hive and JDBC support. From the documentation, I see two things that make me scratch my head. 1) Scala 2.11 "Spark does not yet support its JDBC comp

Re: Silly question about building Spark 1.4.1

2015-07-20 Thread Michael Segel
have sent this to user… However… it looks like the docs page may need some editing? Thx -Mike Begin forwarded message: From: Michael Segel msegel_had...@hotmail.com mailto:msegel_had...@hotmail.com Subject: Silly question about building Spark 1.4.1 Date: July 20, 2015 at 12:26

Re: Silly question about building Spark 1.4.1

2015-07-20 Thread Dean Wampler
page may need some editing? Thx -Mike Begin forwarded message: *From: *Michael Segel msegel_had...@hotmail.com *Subject: **Silly question about building Spark 1.4.1* *Date: *July 20, 2015 at 12:26:40 PM MST *To: *d...@spark.apache.org Hi, I’m looking at the online docs for building

Re: Silly question about building Spark 1.4.1

2015-07-20 Thread Ted Yu
msegel_had...@hotmail.com *Subject: **Silly question about building Spark 1.4.1* *Date: *July 20, 2015 at 12:26:40 PM MST *To: *d...@spark.apache.org Hi, I’m looking at the online docs for building spark 1.4.1 … http://spark.apache.org/docs/latest/building-spark.html I was interested

Fwd: Silly question about building Spark 1.4.1

2015-07-20 Thread Michael Segel
Sorry, Should have sent this to user… However… it looks like the docs page may need some editing? Thx -Mike Begin forwarded message: From: Michael Segel msegel_had...@hotmail.com Subject: Silly question about building Spark 1.4.1 Date: July 20, 2015 at 12:26:40 PM MST To: d