Re: How to process one partition at a time?

2016-04-06 Thread Hemant Bhanawat
Apparently, there is another way to do it. You can try creating a PartitionPruningRDD and pass a partition filter function to it. This RDD will do the same thing that I suggested in my mail and you will not have to create a new RDD. Hemant Bhanawat

partition an empty RDD

2016-04-06 Thread Tenghuan He
Hi all, I want to create an empty rdd and partition it val buffer: RDD[(K, (V, Int))] = base.context.emptyRDD[(K, (V, Int))].partitionBy(new HashPartitioner(5)) but got Error: No ClassTag available for K scala needs at runtime to have information about K , but how to solve this? Thanks in

Re: Executor shutdown hooks?

2016-04-06 Thread Hemant Bhanawat
As part of PR https://github.com/apache/spark/pull/11723, I have added a killAllTasks function that can be used to kill (rather interrupt) individual tasks before an executor exits. If this PR is accepted, for doing task level cleanups, we can add a call to this function before executor exits. The

java.lang.RuntimeException: Executor is not registered - Spark 1.

2016-04-06 Thread Rodrick Brown
I'm running Spark on Mesos in coarse grain mode and experiencing some serious issues when trying to run an application on spark 1.6.1 we have no issues running the same app on spark 1.5.1 (we're trying to migrate to 1.6.1) I'm running the mesos-external-shuffle service on all my slaves.

Re: Kryo serialization mismatch in spark sql windowing function

2016-04-06 Thread Soam Acharya
Hi Josh, Appreciate the response! Also, Steve - we meet again :) At any rate, here's the output (a lot of it anyway) of running spark-sql with the verbose option so that you can get a sense of the settings and the classpath. Does anything stand out? Using properties file:

Running Spark on Yarn-Client/Cluster mode

2016-04-06 Thread ashesh_28
Hi, I am new to the world of Hadoop and this is my first post in here. Recently i have setup a Multi-node Hadoop cluster (3 Nodes Cluster) with HA feature for Namenode & ResourceManager with Zookeeper server. *Daemons running in NN1 (ptfhadoop01v) :* 2945 JournalNode 3137

Re: Kryo serialization mismatch in spark sql windowing function

2016-04-06 Thread Josh Rosen
Spark is compiled against a custom fork of Hive 1.2.1 which added shading of Protobuf and removed shading of Kryo. What I think that what's happening here is that stock Hive 1.2.1 is taking precedence so the Kryo instance that it's returning is an instance of shaded/relocated Hive version rather

Re: Amazon S3 Access Error

2016-04-06 Thread Nezih Yigitbasi
Did you take a look at this jira? On Wed, Apr 6, 2016 at 6:44 PM Joice Joy wrote: > I am facing an S3 access error when using Spark 1.6.1 pre-built for Hadoop > 2.6 or later. > But if I use Spark 1.6.1 pre-built for

Amazon S3 Access Error

2016-04-06 Thread Joice Joy
I am facing an S3 access error when using Spark 1.6.1 pre-built for Hadoop 2.6 or later. But if I use Spark 1.6.1 pre-built for Hadoop 2.4 or later, it works. Am I missing something that needs to be configured with Hadoop 2.6 PFB the error: java.io.IOException: No FileSystem for scheme: s3n

Kryo serialization mismatch in spark sql windowing function

2016-04-06 Thread Soam Acharya
Hi folks, I have a build of Spark 1.6.1 on which spark sql seems to be functional outside of windowing functions. For example, I can create a simple external table via Hive: CREATE EXTERNAL TABLE PSTable (pid int, tty string, time string, cmd string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

Re: Executor shutdown hooks?

2016-04-06 Thread Reynold Xin
On Wed, Apr 6, 2016 at 4:39 PM, Sung Hwan Chung wrote: > My option so far seems to be using JVM's shutdown hook, but I was > wondering if Spark itself had an API for tasks. > Spark would be using that under the hood anyway, so you might as well just use the jvm

Re: Data frames or Spark sql for partitioned tables

2016-04-06 Thread Mich Talebzadeh
are you referring to data frame after map? You can use the following for columns .toDF("name1", "name2") after the map. Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Executor shutdown hooks?

2016-04-06 Thread Sung Hwan Chung
What I meant is 'application'. I.e., when we manually terminate an application that was submitted via spark-submit. When we manually kill an application, it seems that individual tasks do not receive the interruptException. That interruptException seems to work iff we cancel the job through

Data frames or Spark sql for partitioned tables

2016-04-06 Thread mdkhajaasmath
Hi, I am new to spark and trying to implement the solution without using hive. We are migrating to new environment where hive is not present intead I need to use spark to output files. I look at case class and maximum number of columns I can use is 22 but I have 180 columns . In this scenario

RE: Sqoop on Spark

2016-04-06 Thread Yong Zhang
Good to know that. That is why Sqoop has this "direct" mode, to utilize the vendor specific feature. But for MPP, I still think it makes sense that vendor provide some kind of InputFormat, or data source in Spark, so Hadoop eco-system can integrate with them more natively. Yong Date: Wed, 6

Re: ordering over structs

2016-04-06 Thread Michael Armbrust
> > Ordering for a struct goes in order of the fields. So the max struct is > the one with the highest TotalValue (and then the highest category > if there are multiple entries with the same hour and total value). > > Is this due to "InterpretedOrdering" in StructType? > That is one

Re: Sqoop on Spark

2016-04-06 Thread Peyman Mohajerian
It is using JDBC driver, i know that's the case for Teradata: http://developer.teradata.com/connectivity/articles/teradata-connector-for-hadoop-now-available Teradata Connector (which is used by Cloudera and Hortonworks) for doing Sqoop is parallelized and works with ORC and probably other

RE: Sqoop on Spark

2016-04-06 Thread Yong Zhang
If they do that, they must provide a customized input format, instead of through JDBC. Yong Date: Wed, 6 Apr 2016 23:56:54 +0100 Subject: Re: Sqoop on Spark From: mich.talebza...@gmail.com To: mohaj...@gmail.com CC: jornfra...@gmail.com; msegel_had...@hotmail.com; guha.a...@gmail.com;

Re: LabeledPoint with features in matrix form (word2vec matrix)

2016-04-06 Thread jamborta
you probably better off defining your own data structure. labelled point can store a label, vector. but in your case is more like a label, vector, vector. i'd probably use tuples with breeze sparse arrays: RDD[(label:Int, vector1:SparseArray[Double], vector2:SparseArray[Double])] -- View this

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-06 Thread DB Tsai
+1 for renaming the jar file. Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Tue, Apr 5, 2016 at 8:02 PM, Chris Fregly wrote: > perhaps renaming to Spark ML would actually clear up code and

Re: Spark Streaming - NotSerializableException: Methods & Closures:

2016-04-06 Thread jamborta
you can declare you class serializable, as spark would want to serialise the whole class. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-NotSerializableException-Methods-Closures-tp26672p26689.html Sent from the Apache Spark User List

Re: Sqoop on Spark

2016-04-06 Thread Mich Talebzadeh
SAP Sybase IQ does that and I believe SAP Hana as well. Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com

RE: ordering over structs

2016-04-06 Thread Yong Zhang
1) Is a struct in Spark like a struct in C++? Kinda. Its an ordered collection of data with known names/types. 2) What is an alias in this context? it is assigning a name to the column. similar to doing AS in sql. 3) How does this code even work? Ordering

Re: Sqoop on Spark

2016-04-06 Thread Peyman Mohajerian
For some MPP relational stores (not operational) it maybe feasible to run Spark jobs and also have data locality. I know QueryGrid (Teradata) and PolyBase (microsoft) use data locality to move data between their MPP and Hadoop. I would guess (have no idea) someone like IBM already is doing that

Re: Sqoop on Spark

2016-04-06 Thread Mich Talebzadeh
Sorry are you referring to Hive as a relational Data Warehouse in this scenario. The assumption here is that data is coming from a relational database (Oracle) so IMO the best storage for it in Big Data World is another DW adaptive to SQL. Spark is a powerful query tool and together with Hive as

Re: How to convert a Vector[Vector] to a RDD[Vector]?

2016-04-06 Thread jamborta
if you are asking about scala vectors it is simple as this: val vec = Vector(Vector(1,2), Vector(1,2), Vector(1,2)) val vecrdd = sc.parallelize(vec) where vecrdd: org.apache.spark.rdd.RDD[scala.collection.immutable.Vector[Int] -- View this message in context:

Re: Sqoop on Spark

2016-04-06 Thread Jörn Franke
Well I am not sure, but using a database as a storage, such as relational databases or certain nosql databases (eg MongoDB) for Spark is generally a bad idea - no data locality, it cannot handle real big data volumes for compute and you may potentially overload an operational database. And if

Re: Sqoop on Spark

2016-04-06 Thread Mich Talebzadeh
I just created an example of how to use JDBC to get Oracle data into Hive table using Sqoop. Please see thread below How to use Spark JDBC to read from RDBMS table, create Hive ORC table and save RDBMS data in it HTH Dr Mich Talebzadeh LinkedIn *

How to use Spark JDBC to read from RDBMS table, create Hive ORC table and save RDBMS data in it

2016-04-06 Thread Mich Talebzadeh
Hi, There was a question on the merits of using Sqoop to ingest data from Oracle table to Hive. The issue is that Sqoop reverts to MapReduce when getting data into Hive which is not that great. One can do IMO better by using JDBC connection (which is identical with what Sqoop does anyway but

Re: ordering over structs

2016-04-06 Thread Michael Armbrust
> > 1) Is a struct in Spark like a struct in C++? > Kinda. Its an ordered collection of data with known names/types. > 2) What is an alias in this context? > it is assigning a name to the column. similar to doing AS in sql. > 3) How does this code even work? > Ordering for a struct

Re: Plan issue with spark 1.5.2

2016-04-06 Thread Darshan Singh
This is interesting. See below notebook. it is in 1.6. https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5228339421202847/186877441366454/2805387300416006/latest.html You create the 2 data-frame from partitioned parquet file. Persist the files and

Re: Sqoop on Spark

2016-04-06 Thread Ranadip Chatterjee
I know of projects that have done this but have never seen any advantage of "using spark to do what sqoop does" - at least in a yarn cluster. Both frameworks will have similar overheads of getting the containers allocated by yarn and creating new jvms to do the work. Probably spark will have a

ordering over structs

2016-04-06 Thread Imran Akbar
I have a use case similar to this: http://stackoverflow.com/questions/33878370/spark-dataframe-select-the-first-row-of-each-group and I'm trying to understand the solution titled "ordering over structs": 1) Is a struct in Spark like a struct in C++? 2) What is an alias in this context? 3) How

RE: Plan issue with spark 1.5.2

2016-04-06 Thread Yong Zhang
Got it. In the old MapReduce/Hive world, mapJoin means the broadcast join in Spark. So I thought you are looking for broadcast join in this case. What you describe is exactly the hash join. The most correct way is hash join, provided that your joining DFs already partitioned in the same way on

Re: Plan issue with spark 1.5.2

2016-04-06 Thread Darshan Singh
Thanks for the information. When I mention map side join. I meant that each partition from 1 DF join with partition with same key of DF 2 on the worker node without shuffling the data.In other words do as much as work within worker node before shuffling the data. Thanks Darshan Singh On Wed,

RE: Plan issue with spark 1.5.2

2016-04-06 Thread Yong Zhang
I think I gave you one misleading information. If you have 2 already partitioned (K, V) RDDs, and join them by K, then the correct plan you should see is HashJoin, instead of SortMerge. My guess is that when you see the SortMerge Join in DF, then Spark doesn't use the most efficient way of

Re: how to query the number of running executors?

2016-04-06 Thread Cesar Flores
Thanks Ted: That is the kind of answer I was looking for. Best, Cesar flores On Wed, Apr 6, 2016 at 3:01 PM, Ted Yu wrote: > Have you looked at SparkListener ? > > /** >* Called when the driver registers a new executor. >*/ > def

Re: Plan issue with spark 1.5.2

2016-04-06 Thread Darshan Singh
Thanks a lot for this. I was thinking of using cogrouped RDDs. We will try to move to 1.6 as there are other issues as well in 1.5.2. Same code is much faster in the 1.6.1.But plan wise I do not see much diff.Why it is still partitioning and then sorting and then joining? I expect it to sort

Re: how to query the number of running executors?

2016-04-06 Thread Ted Yu
Have you looked at SparkListener ? /** * Called when the driver registers a new executor. */ def onExecutorAdded(executorAdded: SparkListenerExecutorAdded): Unit /** * Called when the driver removes an executor. */ def onExecutorRemoved(executorRemoved:

Re: Problem with pyspark on Docker talking to YARN cluster

2016-04-06 Thread John Omernik
Was there any other creative solutions for this? I am running into the same issue with submitting to yarn from a Docker container and the solutions don't provided don't work. (1. the host doesn't work, even if I use the hostname of the physical node because when spark tries to bind to the hostname

RE: Plan issue with spark 1.5.2

2016-04-06 Thread Yong Zhang
What you are looking for is https://issues.apache.org/jira/browse/SPARK-4849 This feature is available in Spark 1.6.0, so the DataFrame can reuse the partitioned data in the join. For you case in 1.5.x, you have to use the RDD way to tell Spark that the join should utilize the presorted data.

Re: Executor shutdown hooks?

2016-04-06 Thread Mark Hamstra
Why would the Executors shutdown when the Job is terminated? Executors are bound to Applications, not Jobs. Furthermore, unless spark.job.interruptOnCancel is set to true, canceling the Job at the Application and DAGScheduler level won't actually interrupt the Tasks running on the Executors. If

Re: Scala: Perform Unit Testing in spark

2016-04-06 Thread Shishir Anshuman
I placed the *tests* jars in the *lib* folder, Now its working. On Wed, Apr 6, 2016 at 7:34 PM, Lars Albertsson wrote: > Hi, > > I wrote a longish mail on Spark testing strategy last month, which you > may find useful: >

Executor shutdown hooks?

2016-04-06 Thread Sung Hwan Chung
Hi, I'm looking for ways to add shutdown hooks to executors : i.e., when a Job is forcefully terminated before it finishes. The scenario goes likes this : executors are running a long running job within a 'map' function. The user decides to terminate the job, then the mappers should perform some

Re: Using an Option[some primitive type] in Spark Dataset API

2016-04-06 Thread Michael Armbrust
> We only define implicits for a subset of the types we support in > SQLImplicits > . > We should probably consider adding Option[T] for common T as the internal > infrastructure

Re: Select per Dataset attribute (Scala) not possible? Why no Seq().as[type] for Datasets?

2016-04-06 Thread Michael Armbrust
> > Seq(Text(0, "hello"), Text(1, "world")).toDF.as[Text] Use toDS() and you can skip the .as[Text] > Sure! It works with map, but not with select. Wonder if it's by design > or...will soon be fixed? Thanks again for your help. This is by design. select is relational and works with column

Re: Plan issue with spark 1.5.2

2016-04-06 Thread Darshan Singh
I used 1.5.2.I have used movies data to reproduce the issue. Below is the physical plan. I am not sure why it is hash partitioning the data and then sort and then join. I expect the data to be joined first and then send for further processing. I sort of expect a common partitioner which will work

how to query the number of running executors?

2016-04-06 Thread Cesar Flores
Hello: I wonder if there is a way to query the number of running executors (nor the number asked executors) inside a spark job? Thanks -- Cesar Flores

Re: Partition pruning in spark 1.5.2

2016-04-06 Thread Darshan Singh
It worked fine and I was looking for this only as I do not want cache the dataframe as the data in some of partitions will change. However, I have much larger number of partitions(column is not just country but something where values can be 100's of thousands). Now the metdata is much bigger than

MLLIB LDA throws NullPointerException

2016-04-06 Thread jamborta
Hi all, I came across a really weird error on spark 1.6 (calling LDA from pyspark) //data is [index, DenseVector] data1 = corpusZippedDataFiltered.repartition(100).sample(False, 0.1, 100) data2 = sc.parallelize(data1.collect().repartition(100) ldaModel1 = LDA.train(data1, k=10,

Re: Sqoop on Spark

2016-04-06 Thread Michael Segel
I don’t think its necessarily a bad idea. Sqoop is an ugly tool and it requires you to make some assumptions as a way to gain parallelism. (Not that most of the assumptions are not valid for most of the use cases…) Depending on what you want to do… your data may not be persisted on HDFS.

Agenda Announced for Spark Summit 2016 in San Francisco

2016-04-06 Thread Scott walent
Spark Summit 2016 (www.spark-summit.org/2016) will be held from June 6-8 at the Union Square Hilton in San Francisco, and the recently released agenda features a stellar lineup of community talks led by top engineers, architects, data scientists, researchers, entrepreneurs and analysts from UC

Re: lost executor due to large shuffle spill memory

2016-04-06 Thread Michael Slavitch
Shuffle will always spill the local dataset to disk. Changing memory settings does nothing to alter this, so you need to set spark.local.dir appropriately to a fast disk. > On Apr 6, 2016, at 12:32 PM, Lishu Liu wrote: > > Thanks Michael. I use 5 m3.2xlarge nodes.

Re: lost executor due to large shuffle spill memory

2016-04-06 Thread Lishu Liu
Thanks Michael. I use 5 m3.2xlarge nodes. Should I increase spark.storage.memoryFraction? Also I'm thinking maybe I should repartition all_pairs so that each partition will be small enough to be handled. On Tue, Apr 5, 2016 at 8:03 PM, Michael Slavitch wrote: > Do you have

Re: [Yarn] Spark AMs dead lock

2016-04-06 Thread Peter Rudenko
It doesn't matter - just an example. Imagine yarn cluster with 100GB of ram and i submit simultaneously a lot of jobs in a loop. Thanks, Peter Rudenko On 4/6/16 7:22 PM, Ted Yu wrote: Which hadoop release are you using ? bq. yarn cluster with 2GB RAM I assume 2GB is per node. Isn't this too

Re: [Yarn] Spark AMs dead lock

2016-04-06 Thread Ted Yu
Which hadoop release are you using ? bq. yarn cluster with 2GB RAM I assume 2GB is per node. Isn't this too low for your use case ? Cheers On Wed, Apr 6, 2016 at 9:19 AM, Peter Rudenko wrote: > Hi i have a situation, say i have a yarn cluster with 2GB RAM. I'm >

[Yarn] Spark AMs dead lock

2016-04-06 Thread Peter Rudenko
Hi i have a situation, say i have a yarn cluster with 2GB RAM. I'm submitting 2 spark jobs with "driver-memory 1GB --num-executors 2 --executor-memory 1GB". So i see 2 spark AM running, but they are unable to allocate workers containers and start actual job. And they are hanging for a while.

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-06 Thread Andy Davidson
+1 From: Matei Zaharia Date: Tuesday, April 5, 2016 at 4:58 PM To: Xiangrui Meng Cc: Shivaram Venkataraman , Sean Owen , Xiangrui Meng , dev , "user @spark"

Re: Sqoop on Spark

2016-04-06 Thread Mich Talebzadeh
Yes JDBC is another option. Need to be aware of some conversion issues like spark does like CHAR types etc. You best bet is to do the conversion when fetching data from Oracle itself. var _ORACLEserver : String = "jdbc:oracle:thin:@rhes564:1521:mydb" var _username : String = "sh" var _password :

Re: Select per Dataset attribute (Scala) not possible? Why no Seq().as[type] for Datasets?

2016-04-06 Thread Sebastian YEPES FERNANDEZ
​​Hello ​​Jacek, I was just facing the same issue and have found a possible solution: scala> ds.select('id.as[Int], 'text.as[String]).show +---+-+ | _1| _2| +---+-+ | 0|hello| | 1|world| +---+-+ ​The only thing is that the resulting DS losses the filed names ;-(​ ​Regards,

Re: Scala: Perform Unit Testing in spark

2016-04-06 Thread Lars Albertsson
Hi, I wrote a longish mail on Spark testing strategy last month, which you may find useful: http://mail-archives.apache.org/mod_mbox/spark-user/201603.mbox/browser Let me know if you have follow up questions or want assistance. Regards, Lars Albertsson Data engineering consultant

Re: ClassCastException when extracting and collecting DF array column type

2016-04-06 Thread Nick Pentreath
Ah I got it - Seq[(Int, Float)] is actually represented as Seq[Row] (seq of struct type) internally. So a further extraction is required, e.g. row => row.getSeq[Row](1).map { r => r.getInt(0) } On Wed, 6 Apr 2016 at 13:35 Nick Pentreath wrote: > Hi there, > > In

Re: RDD Partitions not distributed evenly to executors

2016-04-06 Thread Mike Hynes
Hello All (and Devs in particular), Thank you again for your further responses. Please find a detailed email below which identifies the cause (I believe) of the partition imbalance problem, which occurs in spark 1.5, 1.6, and a 2.0-SNAPSHOT. This is followed by follow-up questions for the dev

Re: Unsubscribe

2016-04-06 Thread R. Revert
unsubscribe

Big Data Interview FAQ

2016-04-06 Thread Chaturvedi Chola
Hello Team The below is a very good book on Big Data for interview preparation. https://notionpress.com/read/big-data-interview-faqs Thanks, Chaturvedi.

Unsubscribe

2016-04-06 Thread Brian London
unsubscribe

RE: How to process one partition at a time?

2016-04-06 Thread Sun, Rui
Maybe you can try SparkContext.submitJob: def submitJob[T, U, R](rdd: RDD[T], processPartition: (Iterator[T]) ⇒ U, partitions: Seq[Int], resultHandler: (Int, U) ⇒ Unit, resultFunc: ⇒ R):

Fwd: Using an Option[some primitive type] in Spark Dataset API

2016-04-06 Thread Razvan Panda
Copy paste from SO question: http://stackoverflow.com/q/36449368/750216 "Is it possible to use Option[_] member in a case class used with Dataset API? eg. Option[Int] I tried to find an example but could not find any yet. This can probably be done with with a custom encoder (mapping?) but I

Re: dataframe sorting and find the index of the maximum element

2016-04-06 Thread Angel Angel
hi, Thanks for help. Actually i didn't declare the function. Can you give me the best way so solve this problem. Here i am explaining my problem: I have 3 tables and i have to find the maximum element in each table. Table1 Table2 Table3 21 31 41 55 1 12 32 19 96 2 14 28 Then

ClassCastException when extracting and collecting DF array column type

2016-04-06 Thread Nick Pentreath
Hi there, In writing some tests for a PR I'm working on, with a more complex array type in a DF, I ran into this issue (running off latest master). Any thoughts? *// create DF with a column of Array[(Int, Double)]* val df = sc.parallelize(Seq( (0, Array((1, 6.0), (1, 4.0))), (1, Array((1, 3.0),

Re: How to process one partition at a time?

2016-04-06 Thread Hemant Bhanawat
Instead of doing it in compute, you could rather override getPartitions method of your RDD and return only the target partitions. This way tasks for only target partitions will be created. Currently in your case, tasks for all the partitions are getting created. I hope it helps. I would like to

Re: How to process one partition at a time?

2016-04-06 Thread Andrei
I'm writing a kind of sampler which in most cases will require only 1 partition, sometimes 2 and very rarely more. So it doesn't make sense to process all partitions in parallel. What is the easiest way to limit computations to one partition only? So far the best idea I came to is to create a

Spark Sampling

2016-04-06 Thread saurabh guru
I have a steady stream of data coming in from my Kafka agent to my Spark system. How do I sample this data at Spark so that it doesn't get heavily loaded? I have object of type: JavaDStream lines How to achieve 1% sampling on the above? I was doing rdd.sample(false, 1.0) before inserting the

Re: Sqoop on Spark

2016-04-06 Thread Jorge Sánchez
Ayan, there was a talk in spark summit https://spark-summit.org/2015/events/Sqoop-on-Spark-for-Data-Ingestion/ Apparently they had a lot of problems and the project seems abandoned. If you just have to do simple ingestion of a full table or a simple query, just use Sqoop as suggested by Mich,

Re: SQL query results: what is being cached?

2016-04-06 Thread Mich Talebzadeh
well this is expected behaviour much like what you in RDBMS with cold cache. The underlying file systems have file cache that cache some of the result set once a physical read is done, so the data is obtained from the file cache for subsequent reads. In RDBMS terminology this Logical IO or

SQL query results: what is being cached?

2016-04-06 Thread Mohamed Nadjib MAMI
I noticed that in most SQL queries (sqlContext.sql(query)) I ran on Parquet tables that some results are returned faster after the first and second run of the query. Is this variation normal i.e. two executions of the same job can take different times? or there is some intermediate results

call stored procedure from spark

2016-04-06 Thread divyanshumarwah
hi How can i call a stored procedure in mysql from spark using python. Actually, I want to get the data in spark by calling a procedure in mysql using python and use it for further operations. Please suggest a solution. -- View this message in context: