Re: Merging Parquet Files

2020-09-03 Thread Michael Segel
Hi, I think you’re asking the right question, however you’re making an assumption that he’s on the cloud and he never talked about the size of the file. It could be that he’s got a lot of small-ish data sets. 1GB is kinda small in relative terms. Again YMMV. Personally if you’re going

Re: schema change for structured spark streaming using jsonl files

2018-04-25 Thread Michael Segel
Hi, This is going to sound complicated. Taken as an individual JSON document, because its a self contained schema doc, its structured. However there isn’t a persisting schema that has to be consistent across multiple documents. So you can consider it semi structured. If you’re parsing the

Re: Reading Hive RCFiles?

2018-01-29 Thread Michael Segel
s similarly with PairRDD and RCFileOutputFormat On Thu, Jan 18, 2018 at 5:02 PM, Michael Segel <msegel_had...@hotmail.com<mailto:msegel_had...@hotmail.com>> wrote: No idea on how that last line of garbage got in the message. > On Jan 18, 2018, at 9:32 AM, Michael Segel > &l

Re: Reading Hive RCFiles?

2018-01-18 Thread Michael Segel
No idea on how that last line of garbage got in the message. > On Jan 18, 2018, at 9:32 AM, Michael Segel <msegel_had...@hotmail.com> wrote: > > Hi, > > I’m trying to find out if there’s a simple way for Spark to be able to read > an RCFile. > > I kno

Reading Hive RCFiles?

2018-01-18 Thread Michael Segel
Hi, I’m trying to find out if there’s a simple way for Spark to be able to read an RCFile. I know I can create a table in Hive, then drop the files in to that directory and use a sql context to read the file from Hive, however I wanted to read the file directly. Not a lot of details to go

Apache Spark documentation on mllib's Kmeans doesn't jibe.

2017-12-13 Thread Michael Segel
Hi, Just came across this while looking at the docs on how to use Spark’s Kmeans clustering. Note: This appears to be true in both 2.1 and 2.2 documentation. The overview page: https://spark.apache.org/docs/2.1.0/mllib-clustering.html#k-means Here’ the example contains the following line:

Re: [Spark Context]: How to add on demand jobs to an existing spark context?

2017-02-07 Thread Michael Segel
Why couldn’t you use the spark thrift server? On Feb 7, 2017, at 1:28 PM, Cosmin Posteuca > wrote: answer for Gourav Sengupta I want to use same spark application because i want to work as a FIFO scheduler. My problem is that i have

Quick but probably silly question...

2017-01-17 Thread Michael Segel
Hi, While the parquet file is immutable and the data sets are immutable, how does sparkSQL handle updates or deletes? I mean if I read in a file using SQL in to an RDD, mutate it, eg delete a row, and then persist it, I now have two files. If I reread the table back in … will I see duplicates

Re: Spark/Parquet/Statistics question

2017-01-17 Thread Michael Segel
Hi, Lexicographically speaking, Min/Max should work because String(s) support a comparator operator. So anything which supports an equality test (<,>, <= , >= , == …) can also support min and max functions as well. I guess the question is if Spark does support this, and if not, why? Yes,

Re: importing data into hdfs/spark using Informatica ETL tool

2016-11-09 Thread Michael Segel
Oozie, a product only a mad Russian would love. ;-) Just say no to hive. Go from Flat to Parquet. (This sounds easy, but there’s some work that has to occur…) Sorry for being cryptic, Mich’s question is pretty much generic for anyone building a data lake so it ends up overlapping with some work

Re: Save a spark RDD to disk

2016-11-09 Thread Michael Segel
Can you increase the number of partitions and also increase the number of executors? (This should improve the parallelization but you may become disk i/o bound) On Nov 8, 2016, at 4:08 PM, Elf Of Lothlorein > wrote: Hi I am trying to save a RDD

Re: sanboxing spark executors

2016-11-08 Thread Michael Segel
Not that easy of a problem to solve… Can you impersonate the user who provided the code? I mean if Joe provides the lambda function, then it runs as Joe so it has joe’s permissions. Steve is right, you’d have to get down to your cluster’s security and authenticate the user before accepting

Re: Spark Streaming backpressure weird behavior/bug

2016-11-07 Thread Michael Segel
Spark inherits its security from the underlying mechanisms in either YARN or MESOS (whichever environment you are launching your cluster/jobs) That said… there is limited support from Ranger. There are three parts to this… 1) Ranger being called when the job is launched… 2) Ranger being

How sensitive is Spark to Swap?

2016-11-07 Thread Michael Segel
This may seem like a silly question, but it really isn’t. In terms of Map/Reduce, its possible to over subscribe the cluster because there is a lack of sensitivity if the servers swap memory to disk. In terms of HBase, which is very sensitive, swap doesn’t just kill performance, but also can

Re: Quirk in how Spark DF handles JSON input records?

2016-11-03 Thread Michael Segel
re.sub(r"\s+", "", x, flags=re.UNICODE)) // convert the rdd to dataframe. If you have your own schema, this is where you should add it. df = spark.read.json(js) Assaf. From: Michael Segel [mailto:msegel_had...@hotmail.com] Sent: Wednesday, November 02, 2016 9:39 PM To: Daniel

Re: Quirk in how Spark DF handles JSON input records?

2016-11-02 Thread Michael Segel
On Nov 2, 2016, at 2:22 PM, Daniel Siegmann > wrote: Yes, it needs to be on a single line. Spark (or Hadoop really) treats newlines as a record separator by default. While it is possible to use a different string as a

Re: Quirk in how Spark DF handles JSON input records?

2016-11-02 Thread Michael Segel
ARGH!! Looks like a formatting issue. Spark doesn’t like ‘pretty’ output. So then the entire record which defines the schema has to be a single line? Really? On Nov 2, 2016, at 1:50 PM, Michael Segel <msegel_had...@hotmail.com<mailto:msegel_had...@hotmail.com>> wrote: This ma

Quirk in how Spark DF handles JSON input records?

2016-11-02 Thread Michael Segel
This may be a silly mistake on my part… Doing an example using Chicago’s Crime data.. (There’s a lot of it going around. ;-) The goal is to read a file containing a JSON record that describes the crime data.csv for ingestion into a data frame, then I want to output to a Parquet file. (Pretty

Re: spark with kerberos

2016-10-18 Thread Michael Segel
Loughran <ste...@hortonworks.com<mailto:ste...@hortonworks.com>> wrote: On 17 Oct 2016, at 22:11, Michael Segel <michael_se...@hotmail.com<mailto:michael_se...@hotmail.com>> wrote: @Steve you are going to have to explain what you mean by ‘turn Kerberos on’. Taken one w

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Michael Segel
c. they can fit indeed to some use cases to some other less. On 17 Oct 2016, at 23:02, Michael Segel <msegel_had...@hotmail.com<mailto:msegel_had...@hotmail.com>> wrote: You really don’t want to do OLTP on a distributed NoSQL engine. Remember Big Data isn’t relational its more of a h

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Michael Segel
Phoneix? I haven't used it myself so curious about pro's and cons about the use of it. On 18 Oct 2016 03:17, "Michael Segel" <msegel_had...@hotmail.com<mailto:msegel_had...@hotmail.com>> wrote: Guys, Sorry for jumping in late to the game… If memory serves (which may no

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Michael Segel
eers, Jayesh From: vincent gromakowski <vincent.gromakow...@gmail.com<mailto:vincent.gromakow...@gmail.com>> Date: Monday, October 17, 2016 at 1:53 PM To: Benjamin Kim <bbuil...@gmail.com<mailto:bbuil...@gmail.com>> Cc: Michael Segel <msegel_had...@hotmail.com<mailt

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Michael Segel
ling cache to periodically update the spark in memory tables with persistent store... It's not part of the public API and I don't know yet what are the issues doing this but I think Spark community should look at this path: making the thriftserver be instantiable in any spark job. 2016-10-17 18:17

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Michael Segel
mory tables with persistent store... It's not part of the public API and I don't know yet what are the issues doing this but I think Spark community should look at this path: making the thriftserver be instantiable in any spark job. 2016-10-17 18:17 GMT+02:00 Michael Segel <msegel_had

Re: Accessing Hbase tables through Spark, this seems to work

2016-10-17 Thread Michael Segel
Mitch, Short answer… no, it doesn’t scale. Longer answer… You are using an UUID as the row key? Why? (My guess is that you want to avoid hot spotting) So you’re going to have to pull in all of the data… meaning a full table scan… and then perform a sort order transformation, dropping the

Indexing w spark joins?

2016-10-17 Thread Michael Segel
Hi, Apologies if I’ve asked this question before but I didn’t see it in the list and I’m certain that my last surviving brain cell has gone on strike over my attempt to reduce my caffeine intake… Posting this to both user and dev because I think the question / topic jumps in to both camps.

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Michael Segel
Guys, Sorry for jumping in late to the game… If memory serves (which may not be a good thing…) : You can use HiveServer2 as a connection point to HBase. While this doesn’t perform well, its probably the cleanest solution. I’m not keen on Phoenix… wouldn’t recommend it…. The issue is that

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Michael Segel
equirements this is kind of pointless. > > On Thu, Sep 29, 2016 at 1:27 PM, Michael Segel > <msegel_had...@hotmail.com> wrote: >> Spark standalone is not Yarn… or secure for that matter… ;-) >> >>> On Sep 29, 2016, at 11:18 AM, Cody Koeninger <c...@koeninger.org&

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Michael Segel
re can get out of sync, leading to lost / > duplicate data. > > Regarding long running spark jobs, I have streaming jobs in the > standalone manager that have been running for 6 months or more. > > On Thu, Sep 29, 2016 at 11:01 AM, Michael Segel > <msegel_had...@hotmail.com>

Fwd: tod...@yahoo-inc.com is no longer with Yahoo! (was: Re: Treadting NaN fields in Spark)

2016-09-29 Thread Michael Segel
Hi, Hate to be a pain… but could someone remove this email address (see below) from the spark mailing list(s) It seems that ‘Elvis’ has left the building and forgot to change his mail subscriptions… Begin forwarded message: From: Yahoo! No Reply

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Michael Segel
Ok… so what’s the tricky part? Spark Streaming isn’t real time so if you don’t mind a slight delay in processing… it would work. The drawback is that you now have a long running Spark Job (assuming under YARN) and that could become a problem in terms of security and resources. (How well does

Re: Treadting NaN fields in Spark

2016-09-29 Thread Michael Segel
t;http://talebzadehmich.wordpress.com/> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable

Re: Treadting NaN fields in Spark

2016-09-29 Thread Michael Segel
Hi, Just a few thoughts so take it for what its worth… Databases have static schemas and will reject a row’s column on insert. In your case… you have one data set where you have a column which is supposed to be a number but you have it as a string. You want to convert this to a double in your

Re: Spark Hive Rejection

2016-09-29 Thread Michael Segel
Correct me if I’m wrong but isn’t hive schema on read and not on write? So you shouldn’t fail on write. On Sep 29, 2016, at 1:25 AM, Mostafa Alaa Mohamed > wrote: Dears, I want to ask • What will happened if there are

Re: building runnable distribution from source

2016-09-29 Thread Michael Segel
You may want to replace the 2.4 with a later release. On Sep 29, 2016, at 3:08 AM, AssafMendelson > wrote: Hi, I am trying to compile the latest branch of spark in order to try out some code I wanted to contribute. I was looking at the

Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-22 Thread Michael Segel
r. You'd rather processes > fail than grind everything to a halt. You'd buy more memory or > optimize memory before trading it for I/O. > > On Thu, Sep 22, 2016 at 6:29 PM, Michael Segel > <msegel_had...@hotmail.com> wrote: >> Ok… gotcha… wasn’t sure that YARN just looked a

Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-22 Thread Michael Segel
; On Thu, Sep 22, 2016 at 3:54 PM, Michael Segel > <msegel_had...@hotmail.com> wrote: >> Thanks for the response Sean. >> >> But how does YARN know about the off-heap memory usage? >> That’s the piece that I’m missing. >> >> Thx again, >> &g

Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-22 Thread Michael Segel
be limited via the xms/xmx parameter of the JVM. This can >> be configured via spark options for yarn (be aware that they are different >> in cluster and client mode), but i recommend to use the spark options for >> the off heap maximum. >> >> https://spar

Off Heap (Tungsten) Memory Usage / Management ?

2016-09-21 Thread Michael Segel
I’ve asked this question a couple of times from a friend who didn’t know the answer… so I thought I would try here. Suppose we launch a job on a cluster (YARN) and we have set up the containers to be 3GB in size. What does that 3GB represent? I mean what happens if we end up using 2-3GB

Re: Spark Thrift Server performance

2016-07-13 Thread Michael Segel
Hey, silly question? If you’re running a load balancer, are you trying to reuse the RDDs between jobs? TIA -Mike > On Jul 13, 2016, at 9:08 AM, ayan guha > wrote: > > My 2 cents: > > Yes, we are running multiple STS (we are running on

Re: How to run Zeppelin and Spark Thrift Server Together

2016-07-13 Thread Michael Segel
I believe that there is one JVM for the Thrift Service and that there is only one context for the service. This would allow you to share RDDs across multiple jobs, however… not so great for security. HTH… > On Jul 10, 2016, at 10:05 PM, Takeshi Yamamuro

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Michael Segel
Just a clarification. Tez is ‘vendor’ independent. ;-) Yeah… I know… Anyone can support it. Only Hortonworks has stacked the deck in their favor. Drill could be in the same boat, although there now more committers who are not working for MapR. I’m not sure who outside of HW is

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Michael Segel
I don’t think that it would be a good comparison. If memory serves, Tez w LLAP is going to be running a separate engine that is constantly running, no? Spark? That runs under hive… Unless you’re suggesting that the spark context is constantly running as part of the hiveserver2? > On May

Re: Is Spark suited for replacing a batch job using many database tables?

2016-07-06 Thread Michael Segel
advice. I have to retrieve the basic data from the DB2 tables > but afterwards I'm pretty free to transform the data as needed. > > > > On 6. Juli 2016 um 22:12:26 MESZ, Michael Segel <msegel_had...@hotmail.com> > wrote: >> I think you need to learn the basics of ho

Re: Is Spark suited for replacing a batch job using many database tables?

2016-07-06 Thread Michael Segel
I think you need to learn the basics of how to build a ‘data lake/pond/sewer’ first. The short answer is yes. The longer answer is that you need to think more about translating a relational model in to a hierarchical model, something that I seriously doubt has been taught in schools in a

Re: Spark application doesn't scale to worker nodes

2016-07-05 Thread Michael Segel
Did the OP say he was running a stand alone cluster of Spark, or on Yarn? > On Jul 5, 2016, at 10:22 AM, Mich Talebzadeh > wrote: > > Hi Jakub, > > Any reason why you are running in standalone mode, given that your are > familiar with YARN? > > In theory your

Re: Joining a compressed ORC table with a non compressed text table

2016-06-29 Thread Michael Segel
Hi, I’m not sure I understand your initial question… Depending on the compression algo, you may or may not be able to split the file. So if its not splittable, you have a single long running thread. My guess is that you end up with a very long single partition. If so, if you repartition,

Re: Spark Thrift Server Concurrency

2016-06-23 Thread Michael Segel
Hi, There are a lot of moving parts and a lot of unknowns from your description. Besides the version stuff. How many executors, how many cores? How much memory? Are you persisting (memory and disk) or just caching (memory) During the execution… same tables… are you seeing a lot of

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-22 Thread Michael Segel
The only documentation on this… in terms of direction … (that I could find) If your client is not close to the cluster (e.g. your PC) then you definitely want to go cluster to improve performance. If your client is close to the cluster (e.g. an edge node) then you could go either client or

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-22 Thread Michael Segel
com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> > > http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> > > > On 22 June 2016 at 19:04, Michael

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-22 Thread Michael Segel
/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> > > http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> > > > On 22 June 2016 at 15:59, M

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-22 Thread Michael Segel
reliability etc would be gone. > > Hive supports different execution engines (TEZ, Spark), formats (Orc, > parquet) and further optimizations to make the analysis fast. It always > depends on your use case. > > On 22 Jun 2016, at 05:47, Michael Segel <msegel_had...@hotm

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-21 Thread Michael Segel
e from spark, spark actually reads > hive metastore and streams data directly from HDFS. Hive MR jobs do not play > any role here, making spark faster than hive. > > HTH > > Ayan > > On Wed, Jun 22, 2016 at 9:58 AM, Michael Segel <msegel_had...@hotmail.com > <m

Re: Union of multiple RDDs

2016-06-21 Thread Michael Segel
By repartition I think you mean coalesce() where you would get one parquet file per partition? And this would be a new immutable copy so that you would want to write this new RDD to a different HDFS directory? -Mike > On Jun 21, 2016, at 8:06 AM, Eugene Morozov

Silly question about Yarn client vs Yarn cluster modes...

2016-06-21 Thread Michael Segel
Ok, its at the end of the day and I’m trying to make sure I understand the locale of where things are running. I have an application where I have to query a bunch of sources, creating some RDDs and then I need to join off the RDDs and some other lookup tables. Yarn has two modes… client and

Re: Creating a Hive table through Spark and potential locking issue (a bug)

2016-06-08 Thread Michael Segel
> On Jun 8, 2016, at 3:35 PM, Eugene Koifman wrote: > > if you split “create table test.dummy as select * from oraclehadoop.dummy;” > into create table statement, followed by insert into test.dummy as select… > you should see the behavior you expect with Hive. > Drop

Re: Creating a Hive table through Spark and potential locking issue (a bug)

2016-06-08 Thread Michael Segel
you try a select * from foo; and in another shell try dropping foo? and if you want to simulate a m/r job add something like an order by 1 clause. HTH -Mike > On Jun 8, 2016, at 2:36 PM, Michael Segel <mse...@segel.com> wrote: > > Hi, > > Lets take a step back… >

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Michael Segel
m/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> > > http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> > > > On 30 May 2016 at 20:19, Michael Segel <msegel_had...@hotmail.com > <mailto:msegel_had...@hotmail.com>>

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Michael Segel
Mich, Most people use vendor releases because they need to have the support. Hortonworks is the vendor who has the most skin in the game when it comes to Tez. If memory serves, Tez isn’t going to be M/R but a local execution engine? Then LLAP is the in-memory piece to speed up Tez? HTH

Re: HiveContext standalone => without a Hive metastore

2016-05-30 Thread Michael Segel
Going from memory… Derby is/was Cloudscape which IBM acquired from Informix who bought the company way back when. (Since IBM released it under Apache licensing, Sun Microsystems took it and created JavaDB…) I believe that there is a networking function so that you can either bring it up in

Secondary Indexing?

2016-05-30 Thread Michael Segel
I’m not sure where to post this since its a bit of a philosophical question in terms of design and vision for spark. If we look at SparkSQL and performance… where does Secondary indexing fit in? The reason this is a bit awkward is that if you view Spark as querying RDDs which are temporary,

Re: SPARK - DataFrame for BulkLoad

2016-05-18 Thread Michael Segel
Yes, but he’s using phoenix which may not work cleanly with your HBase spark module. They key issue here may be Phoenix which is separate from HBase. > On May 18, 2016, at 5:36 AM, Ted Yu wrote: > > Please see HBASE-14150 > > The hbase-spark module would be available

Re: Silly Question on my part...

2016-05-17 Thread Michael Segel
the Thrift JDBC > Distributed Query Engine connector. > > 2016-05-17 5:12 GMT+10:00 Michael Segel <msegel_had...@hotmail.com > <mailto:msegel_had...@hotmail.com>>: > For one use case.. we were considering using the thrift server as a way to > allow multiple clients a

Silly Question on my part...

2016-05-16 Thread Michael Segel
For one use case.. we were considering using the thrift server as a way to allow multiple clients access shared RDDs. Within the Thrift Context, we create an RDD and expose it as a hive table. The question is… where does the RDD exist. On the Thrift service node itself, or is that just a

Re: removing header from csv file

2016-05-03 Thread Michael Segel
Hi, Another silly question… Don’t you want to use the header line to help create a schema for the RDD? Thx -Mike > On May 3, 2016, at 8:09 AM, Mathieu Longtin wrote: > > This only works if the files are "unsplittable". For example gzip files, each > partition is

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Michael Segel
Silly question? If you change the predicate to ( s.date >= ‘2016-01-03’ OR s.date IS NULL ) AND (d.date >= ‘2016-01-03’ OR d.date IS NULL) What do you get? Sorry if the syntax isn’t 100% correct. The idea is to not drop null values from the query. I would imagine that this shouldn’t

Re: Spark support for Complex Event Processing (CEP)

2016-04-29 Thread Michael Segel
analyzing and very big data). This solution is suitable for very complex > (targeted) analyzing. It can be too slow and memory-consuming, but well done > pre-processing of log data can help a lot. > > --- > Esa Heikkinen > > 28.4.2016, 14:44, Michael Segel kirjoitti: >&

Re: Spark support for Complex Event Processing (CEP)

2016-04-28 Thread Michael Segel
he single digit microseconds > level can sometimes matter in financial trading but rarely. > HTH > > > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > <https://www.linkedi

Re: Spark support for Complex Event Processing (CEP)

2016-04-28 Thread Michael Segel
nen <esa.heikki...@student.tut.fi> > wrote: > > > Do you know any good examples how to use Spark streaming in tracking public > transportation systems ? > > Or Storm or some other tool example ? > > Regards > Esa Heikkinen > > 28.4.2016, 3:16, Michael

Fwd: Spark support for Complex Event Processing (CEP)

2016-04-27 Thread Michael Segel
Doh! Wrong email account again! > Begin forwarded message: > > From: Michael Segel <michael_se...@hotmail.com> > Subject: Re: Spark support for Complex Event Processing (CEP) > Date: April 27, 2016 at 7:16:55 PM CDT > To: Mich Talebzadeh <mich.talebza...@gma

Fwd: Spark support for Complex Event Processing (CEP)

2016-04-27 Thread Michael Segel
Sorry sent from wrong email address. > Begin forwarded message: > > From: Michael Segel <michael_se...@hotmail.com> > Subject: Re: Spark support for Complex Event Processing (CEP) > Date: April 27, 2016 at 7:51:14 AM CDT > To: Mich Talebzadeh <mich.talebza...@gma

Re: Spark SQL Transaction

2016-04-21 Thread Michael Segel
Hi, Sometimes terms get muddled over time. If you’re not using transactions, then each database statement is atomic and is itself a transaction. So unless you have some explicit ‘Begin Work’ at the start…. your statements should be atomic and there will be no ‘redo’ or ‘commit’ or

Re: Spark 1.6.1 DataFrame write to JDBC

2016-04-21 Thread Michael Segel
How many partitions in your data set. Per the Spark DataFrameWritetr Java Doc: “ Saves the content of the DataFrame to a external database table via JDBC. In the case the table already exists in the external

Re: inter spark application communication

2016-04-18 Thread Michael Segel
> wrote: > > Yes , I want to chain spark applications. > On Mon, Apr 18, 2016 at 4:46 PM Michael Segel <msegel_had...@hotmail.com > <mailto:msegel_had...@hotmail.com>> wrote: > Yes, but I’m confused. Are you chaining your spark jobs? So you run job one

Re: How to start HDFS on Spark Standalone

2016-04-18 Thread Michael Segel
Perhaps this is a silly question on my part…. Why do you want to start up HDFS on a single node? You only mention one windows machine in your description of your cluster. If this is a learning experience, why not run Hadoop in a VM (MapR and I think the other vendors make linux images that

Re: inter spark application communication

2016-04-18 Thread Michael Segel
have you thought about Akka? What are you trying to send? Why do you want them to talk to one another? > On Apr 18, 2016, at 12:04 PM, Soumitra Johri > wrote: > > Hi, > > I have two applications : App1 and App2. > On a single cluster I have to spawn 5

Re: Silly question...

2016-04-13 Thread Michael Segel
w?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> > > http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> > > > On 12 April 2016 at 23:42, Michael Segel <msegel_had..

Fwd: Can i have a hive context and sql context in the same app ?

2016-04-12 Thread Michael Segel
Sorry for duplicate(s), I forgot to switch my email address. > Begin forwarded message: > > From: Michael Segel <mse...@segel.com> > Subject: Re: Can i have a hive context and sql context in the same app ? > Date: April 12, 2016 at 4:05:26 PM MST > To: Michael Armbrust

Silly question...

2016-04-12 Thread Michael Segel
Hi, This is probably a silly question on my part… I’m looking at the latest (spark 1.6.1 release) and would like to do a build w Hive and JDBC support. From the documentation, I see two things that make me scratch my head. 1) Scala 2.11 "Spark does not yet support its JDBC component for

Re: Sqoop on Spark

2016-04-11 Thread Michael Segel
t;> >> HTH >> >> >> Dr Mich Talebzadeh >> >> LinkedIn >> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> >> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> >> &g

Re: Sqoop on Spark

2016-04-10 Thread Michael Segel
ql databases (eg MongoDB) for Spark is generally a > bad idea - no data locality, it cannot handle real big data volumes for > compute and you may potentially overload an operational database. > And if your job fails for whatever reason (eg scheduling ) then you have to >

Re: Sqoop on Spark

2016-04-06 Thread Michael Segel
I don’t think its necessarily a bad idea. Sqoop is an ugly tool and it requires you to make some assumptions as a way to gain parallelism. (Not that most of the assumptions are not valid for most of the use cases…) Depending on what you want to do… your data may not be persisted on HDFS.

Re: Unable to Limit UI to localhost interface

2016-03-30 Thread Michael Segel
It sounds like when you start up spark, its using 0.0.0.0 which means it will listen on all interfaces. You should be able to limit which interface to use. The weird thing is that if you are specifying the IP Address and Port, Spark shouldn’t be listening on all of the interfaces for that

Fwd: Spark and N-tier architecture

2016-03-29 Thread Michael Segel
> Begin forwarded message: > > From: Michael Segel <mse...@segel.com> > Subject: Re: Spark and N-tier architecture > Date: March 29, 2016 at 4:16:44 PM MST > To: Alexander Pivovarov <apivova...@gmail.com> > Cc: Mich Talebzadeh <mich.talebza...@gmail.com>

Re: Spark SQL Json Parse

2016-03-03 Thread Michael Segel
Why do you want to write out NULL if the column has no data? Just insert the fields that you have. > On Mar 3, 2016, at 9:10 AM, barisak wrote: > > Hi, > > I have a problem with Json Parser. I am using spark streaming with > hiveContext for keeping json format

Re: temporary tables created by registerTempTable()

2016-02-15 Thread Michael Segel
I was just looking at that… Out of curiosity… if you make it a Hive Temp Table… who has access to the data? Just your app, or anyone with access to the same database? (Would you be able to share data across different JVMs? ) (E.G - I have a reader who reads from source A that needs to

Re: Spark Job Server with Yarn and Kerberos

2016-01-04 Thread Michael Segel
Its been a while... but this isn’t a spark issue. A spark job on YARN runs as a regular job. What happens when you run a regular M/R job by that user? I don’t think we did anything special... > On Jan 4, 2016, at 12:22 PM, Mike Wright > wrote: >

Re: stopping a process usgin an RDD

2016-01-04 Thread Michael Segel
Not really a good idea. It breaks the paradigm. If I understand the OP’s idea… they want to halt processing the RDD, but not the entire job. So when it hits a certain condition, it will stop that task yet continue on to the next RDD. (Assuming you have more RDDs or partitions than you have

Re: TCP/IP speedup

2015-08-02 Thread Michael Segel
This may seem like a silly question… but in following Mark’s link, the presentation talks about the TPC-DS benchmark. Here’s my question… what benchmark results? If you go over to the TPC.org http://tpc.org/ website they have no TPC-DS benchmarks listed. (Either audited or unaudited) So

Re: Silly question about building Spark 1.4.1

2015-07-20 Thread Michael Segel
/0636920033073.do (O'Reilly) Typesafe http://typesafe.com/ @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com http://polyglotprogramming.com/ On Mon, Jul 20, 2015 at 2:55 PM, Michael Segel msegel_had...@hotmail.com mailto:msegel_had...@hotmail.com wrote: Sorry, Should

Fwd: Silly question about building Spark 1.4.1

2015-07-20 Thread Michael Segel
Sorry, Should have sent this to user… However… it looks like the docs page may need some editing? Thx -Mike Begin forwarded message: From: Michael Segel msegel_had...@hotmail.com Subject: Silly question about building Spark 1.4.1 Date: July 20, 2015 at 12:26:40 PM MST To: d

Re: Research ideas using spark

2015-07-16 Thread Michael Segel
, at 12:40 PM, vaquar khan vaquar.k...@gmail.com wrote: I would suggest study spark ,flink,strom and based on your understanding and finding prepare your research paper. May be you will invented new spark ☺ Regards, Vaquar khan On 16 Jul 2015 00:47, Michael Segel msegel_had

Re: spark streaming job to hbase write

2015-07-16 Thread Michael Segel
You ask an interesting question… Lets set aside spark, and look at the overall ingestion pattern. Its really an ingestion pattern where your input in to the system is from a queue. Are the events discrete or continuous? (This is kinda important.) If the events are continuous then more

Re: Research ideas using spark

2015-07-15 Thread Michael Segel
Silly question… When thinking about a PhD thesis… do you want to tie it to a specific technology or do you want to investigate an idea but then use a specific technology. Or is this an outdated way of thinking? I am doing my PHD thesis on large scale machine learning e.g Online learning,

Re: Spark or Storm

2015-06-17 Thread Michael Segel
Actually the reverse. Spark Streaming is really a micro batch system where the smallest window is 1/2 a second (500ms). So for CEP, its not really a good idea. So in terms of options…. spark streaming, storm, samza, akka and others… Storm is probably the easiest to pick up, spark streaming

Re: HW imbalance

2015-01-30 Thread Michael Segel
homogeneously sized executors won't be able to take advantage of the extra memory on the bigger boxes. Cloudera Manager can certainly configure YARN with different resource profiles for different nodes if that's what you're wondering. -Sandy On Thu, Jan 29, 2015 at 11:03 PM, Michael Segel

Re: HW imbalance

2015-01-29 Thread Michael Segel
the same amount of memory. It's possibly to configure YARN with different amounts of memory for each host (using yarn.nodemanager.resource.memory-mb), so other apps might be able to take advantage of the extra memory. -Sandy On Mon, Jan 26, 2015 at 8:34 AM, Michael Segel msegel_had

Re: quickly counting the number of rows in a partition?

2015-01-14 Thread Michael Segel
Sorry, but the accumulator is still going to require you to walk through the RDD to get an accurate count, right? Its not being persisted? On Jan 14, 2015, at 5:17 AM, Ganelin, Ilya ilya.gane...@capitalone.com wrote: Alternative to doing a naive toArray is to declare an accumulator per