Spark DataFrame sum of multiple columns

2016-04-21 Thread Naveen Kumar Pokala
Hi, Do we have any way to perform Row level operations in spark dataframes. For example, I have a dataframe with columns from A,B,C,...Z.. I want to add one more column New Column with sum of all column values. A B C D . . . Z New Column 1 2 4 3 26 351 Can somebody

Something wrong with sortBy

2016-04-21 Thread tuan3w
I'm working on implementing LSH on Spark. I start with an implementation provided by SoundCloud: https://github.com/soundcloud/cosine-lsh-join-spark/blob/master/src/main/scala/com/soundcloud/lsh/Lsh.scala when I check WebUI, I see that after call sortBy, the number of partitions of RDD descreases

[Ask :]Best Practices - Application logging in Spark 1.5.2 + Scala 2.10

2016-04-21 Thread Divya Gehlot
Hi, I am using Spark with Hadoop 2.7 cluster I need to print all my print statement and or any errors to file for instance some info if passed some level or some error if something misisng in my Spark Scala Script. Can some body help me or redirect me tutorial,blog, books . Whats the best way to

java.io.NotSerializableException: org.apache.spark.sql.types.LongType

2016-04-21 Thread Andy Davidson
I started using http://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html#fp-gr owth in python. It was really easy to get the frequent items set. Unfortunately associations is not implemented in python. Here is my python code It works great rawJsonRDD = jsonToPythonDictionaries(sc,

Re: Why Spark having OutOfMemory Exception?

2016-04-21 Thread Zhan Zhang
The data may be not large, but the driver need to do a lot of bookkeeping. In your case, it is possible the driver control plane takes too much memory. I think you can find a java developer to look at the coredump. Otherwise, it is hard to tell exactly which part are using all the memory.

RE: Create tab separated file from a dataframe spark 1.4 with Java

2016-04-21 Thread Mohammed Guller
It should be straightforward to do this using the spark-csv package. Assuming “myDF” is your DataFrame, you can use the following code to save data in a TSV file. myDF.write .format("com.databricks.spark.csv") .option("delimiter", "\t") .save("data.tsv") Mohammed From: Mail.com

Spark SQL insert overwrite table not showing all the partition.

2016-04-21 Thread Bijay Kumar Pathak
Hi, I have a job which writes to the Hive table with dynamic partition. Inside the job, I am writing into the table two-time but I am only seeing the partition with last write although I can see in the Spark UI it is processing data fro both the partition. Below is the query I am using to write

Re: Spark 2.0 forthcoming features

2016-04-21 Thread Jules Damji
Thanks Michael, we're doing a Spark 2.0 webinar. Register and if you can't make it; you can always watch the recording. Cheers Jules Sent from my iPhone Pardon the dumb thumb typos :) > On Apr 20, 2016, at 10:15 AM, Michael Malak > wrote: > >

Re: Spark 1.6.1 DataFrame write to JDBC

2016-04-21 Thread Mich Talebzadeh
I would be surprised if Oracle cannot handle million row calculations, unless you are also using other data in Spark. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Issue with Spark shell and scalatest

2016-04-21 Thread Mich Talebzadeh
hm still struggling with those two above scala> import org.apache.spark.internal.Logging :57: error: object internal is not a member of package org.apache.spark import org.apache.spark.internal.Logging ^ scala> import

Re: Spark 1.6.1 DataFrame write to JDBC

2016-04-21 Thread Jonathan Gray
I think I know understand what the problem is and it is, in some ways, to do with partitions and, in other ways, to do with memory. I now think that the database write was not the source of the problem (the problem being end-to-end performance). The application reads rows from a database, does

Re: Issue with Spark shell and scalatest

2016-04-21 Thread Ted Yu
For you, it should be spark-core_2.10-1.5.1.jar Please replace version of Spark in my example with the version you use. On Thu, Apr 21, 2016 at 1:23 PM, Mich Talebzadeh wrote: > Hi Ted > > I cannot see spark-core_2.11-2.0.0-SNAPSHOT.jar under > >

Re: Issue with Spark shell and scalatest

2016-04-21 Thread Mich Talebzadeh
Hi Ted I cannot see spark-core_2.11-2.0.0-SNAPSHOT.jar under https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/ Sorry where are these artefacts please? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Tracing Spark DataFrame Execition

2016-04-21 Thread Andrés Ivaldi
Hello, It's possible to trace DataFrame, I'd like to do a progress DataFrame Execution?, I looked at SparkListeners, but nested dataframes produces several Jobs, and I dont know how to relate these Jobs also I'm reusing SparkContext. Regards. -- Ing. Ivaldi Andres

bisecting kmeans model tree

2016-04-21 Thread roni
Hi , I want to get the bisecting kmeans tree structure to show a dendogram on the heatmap I am generating based on the hierarchical clustering of data. How do I get that using mlib . Thanks -Roni

Create tab separated file from a dataframe spark 1.4 with Java

2016-04-21 Thread Mail.com
> Hi I have a dataframe and need to write to a tab separated file using spark 1.4 and Java. Can some one please suggest. Thanks, Pradeep

Re: Issue with Spark shell and scalatest

2016-04-21 Thread Ted Yu
Plug in 1.5.1 for your jars: $ jar tvf ./core/target/spark-core_2.11-2.0.0-SNAPSHOT.jar | grep Logging ... 1781 Thu Apr 21 08:19:34 PDT 2016 org/apache/spark/internal/Logging$.class jar tvf external/kafka/target/spark-streaming-kafka_2.11-2.0.0-SNAPSHOT.jar | grep LeaderOffset ... 3310 Thu

Can't access sqlite db from Spark

2016-04-21 Thread sturm
Hi, I have the folowing code: val conf = new SparkConf().setAppName("Spark Test") val sc = new SparkContext(conf) val sqlContext = new org.apache.spark.sql.SQLContext(sc) val data = sqlContext.read.format("jdbc").options( Map( "url" -> "jdbc:sqlite:/nv/pricing/ix_tri_pi.sqlite3",

Re: Spark 1.6.1. How to prevent serialization of KafkaProducer

2016-04-21 Thread Bryan Jeffrey
Here is what we're doing: import java.util.Properties import kafka.producer.{KeyedMessage, Producer, ProducerConfig} import net.liftweb.json.Extraction._ import net.liftweb.json._ import org.apache.spark.streaming.dstream.DStream class KafkaWriter(brokers: Array[String], topic: String,

Re: Spark 1.6.1. How to prevent serialization of KafkaProducer

2016-04-21 Thread Todd Nist
Have you looked at these: http://allegro.tech/2015/08/spark-kafka-integration.html http://mkuthan.github.io/blog/2016/01/29/spark-kafka-integration2/ Full example here: https://github.com/mkuthan/example-spark-kafka HTH. -Todd On Thu, Apr 21, 2016 at 2:08 PM, Alexander Gallego

Re: Issue with Spark shell and scalatest

2016-04-21 Thread Mich Talebzadeh
These two are giving me griefs: scala> import org.apache.spark.internal.Logging :26: error: object internal is not a member of package org.apache.spark import org.apache.spark.internal.Logging scala> import org.apache.spark.streaming.kafka.KafkaCluster.LeaderOffset :29: error: object

Does Spark Steaming support event time window now?

2016-04-21 Thread Yifei Li
I read from the following article: https://databricks.com/blog/2015/07/30/diving-into-spark-streamings-execution-model.html which says that Spark Streaming has a future direction for "Event time and out-of-order". I am wondering if it is supported now. I my scenario, I have three streams,

Re: Spark 1.6.1. How to prevent serialization of KafkaProducer

2016-04-21 Thread Alexander Gallego
Thanks Ted. KafkaWordCount (producer) does not operate on a DStream[T] ```scala object KafkaWordCountProducer { def main(args: Array[String]) { if (args.length < 4) { System.err.println("Usage: KafkaWordCountProducer " + " ") System.exit(1) } val

Re: Issue with Spark shell and scalatest

2016-04-21 Thread Mich Talebzadeh
Thanks jar tvf spark-core_2.10-1.5.1-tests.jar | grep SparkFunSuite 1787 Wed Sep 23 23:34:26 BST 2015 org/apache/spark/SparkFunSuite$$anonfun$withFixture$1.class 1780 Wed Sep 23 23:34:26 BST 2015 org/apache/spark/SparkFunSuite$$anonfun$withFixture$2.class 3982 Wed Sep 23 23:34:26 BST 2015

Re: Issue with Spark shell and scalatest

2016-04-21 Thread Ted Yu
Please replace version number for the release you are using : spark-core_2.10-1.5.1-tests.jar On Thu, Apr 21, 2016 at 10:18 AM, Mich Talebzadeh wrote: > I don't seem to be able to locate spark-core_2.11-2.0.0-SNAPSHOT-tests.jar > file :( > > Dr Mich Talebzadeh > > >

Re: Issue with Spark shell and scalatest

2016-04-21 Thread Mich Talebzadeh
I don't seem to be able to locate spark-core_2.11-2.0.0-SNAPSHOT-tests.jar file :( Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Re: Spark 1.6.1. How to prevent serialization of KafkaProducer

2016-04-21 Thread Ted Yu
In KafkaWordCount , the String is sent back and producer.send() is called. I guess if you don't find via solution in your current design, you can consider the above. On Thu, Apr 21, 2016 at 10:04 AM, Alexander Gallego wrote: > Hello, > > I understand that you cannot

Spark 1.6.1. How to prevent serialization of KafkaProducer

2016-04-21 Thread Alexander Gallego
Hello, I understand that you cannot serialize Kafka Producer. So I've tried: (as suggested here https://forums.databricks.com/questions/369/how-do-i-handle-a-task-not-serializable-exception.html ) - Make the class Serializable - not possible - Declare the instance only within the lambda

Re: Spark support for Complex Event Processing (CEP)

2016-04-21 Thread Mich Talebzadeh
Hi Mario, I sorted that one out with Ted's help thanks scalatest_2.11-2.2.6.jar Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Re: Spark support for Complex Event Processing (CEP)

2016-04-21 Thread Mario Ds Briggs
googling 'java error 'is not a member of package' and then even its related searches seemed to suggest it is not a missing jar problem, though i couldnt put a finger on what exactly it is in your case some specifically in spark-shell as well -

Re: Issue with Spark shell and scalatest

2016-04-21 Thread Ted Yu
It is in core-XX-tests jar: $ jar tvf ./core/target/spark-core_2.11-2.0.0-SNAPSHOT-tests.jar | grep SparkFunSuite 1830 Thu Apr 21 08:19:14 PDT 2016 org/apache/spark/SparkFunSuite$$anonfun$withFixture$1.class 1823 Thu Apr 21 08:19:14 PDT 2016

Re: Issue with Spark shell and scalatest

2016-04-21 Thread Mich Talebzadeh
like war of attrition :) now I get with sbt object SparkFunSuite is not a member of package org.apache.spark Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

EMR Spark Custom Metrics

2016-04-21 Thread Mark Kelly
Hi, So i would like some custom metrics. The environment we use is AWS EMR 4.5.0 with spark 1.6.1 and Ganglia. the code snippit below shows how we register custom metrics (this worked in EMR 4.2.0 with spark 1.5.2) package org.apache.spark.metrics.source import com.codahale.metrics._ import

Re: Issue with Spark shell and scalatest

2016-04-21 Thread Ted Yu
Have you tried the following ? libraryDependencies += "org.scalatest" %% "scalatest" % "2.2.6" On Thu, Apr 21, 2016 at 9:19 AM, Mich Talebzadeh wrote: > Unfortunately this sbt dependency is not working > > libraryDependencies += "org.apache.spark" %% "spark-core" %

Re: Issue with Spark shell and scalatest

2016-04-21 Thread Mich Talebzadeh
Unfortunately this sbt dependency is not working libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1" % "provided" libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.5.1" % "provided" libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1" % "provided"

Re: Save DataFrame to HBase

2016-04-21 Thread Benjamin Kim
Hi Ted, Can this module be used with an older version of HBase, such as 1.0 or 1.1? Where can I get the module from? Thanks, Ben > On Apr 21, 2016, at 6:56 AM, Ted Yu wrote: > > The hbase-spark module in Apache HBase (coming with hbase 2.0 release) can do > this. > >

Re: Issue with Spark shell and scalatest

2016-04-21 Thread Mich Talebzadeh
Thanks Ted. It was a typo in my alias and it is sorted now slong='rlwrap spark-shell --master spark://50.140.197.217:7077 --jars

Re: Issue with Spark shell and scalatest

2016-04-21 Thread Ted Yu
I tried on refreshed copy of master branch: $ bin/spark-shell --jars /home/hbase/.m2/repository/org/scalatest/scalatest_2.11/2.2.6/scalatest_2.11-2.2.6.jar ... scala> import org.scalatest.{BeforeAndAfter, BeforeAndAfterAll} import org.scalatest.{BeforeAndAfter, BeforeAndAfterAll} BTW I noticed

Re: Issue with Spark shell and scalatest

2016-04-21 Thread Ted Yu
Mich: $ jar tvf /home/hbase/.m2/repository/org/scalatest/scalatest_2.11/2.2.6/scalatest_2.11-2.2.6.jar | grep BeforeAndAfter 4257 Sat Dec 26 14:35:48 PST 2015 org/scalatest/BeforeAndAfter$class.class 2602 Sat Dec 26 14:35:48 PST 2015 org/scalatest/BeforeAndAfter.class 1998 Sat Dec 26

Issue with Spark shell and scalatest

2016-04-21 Thread Mich Talebzadeh
to Mario, Alonso, Luciano, user Hi, Following example in https://github.com/agsachin/spark/blob/CEP/external/kafka/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala#L532 Does anyone know which jar file this belongs to? I use *scalatest_2.11-2.2.6.jar *in my

Re: spark on yarn

2016-04-21 Thread Steve Loughran
If there isn't enough space in your cluster for all the executors you asked for to be created, Spark will only get the ones which can be allocated. It will start work without waiting for the others to arrive. Make sure you ask for enough memory: YARN is a lot more unforgiving about memory use

Spark 1.6.1 already maximum pages

2016-04-21 Thread nihed mbarek
Hi I just got an issue with my execution on spark 1.6.1 I'm trying to join between two dataframes one of 5 partition and the second small 2 partition. Spark Sql shuffle partitions equal to 256000 Any idea ?? java.lang.IllegalStateException: Have already allocated a maximum of 8192 pages

Re: Spark support for Complex Event Processing (CEP)

2016-04-21 Thread Mich Talebzadeh
Hi, Following example in https://github.com/agsachin/spark/blob/CEP/external/kafka/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala#L532 Does anyone know which jar file this belongs to? I use *scalatest_2.11-2.2.6.jar *in my spark-shell spark-shell --master

Re: Pls assist: which conf file do i need to modify if i want spark-shell to inclucde external packages?

2016-04-21 Thread Mich Talebzadeh
try this using the shell parameter SPARK_CLASSPATH in $HIVE_HOME/conf cp spark-env.sh.template spark-env.sh Then edit that file and set export SPARK_CLASSPATH= Connect to spark-shell and see if it find it HTH Dr Mich Talebzadeh LinkedIn *

Re: How to change akka.remote.startup-timeout in spark

2016-04-21 Thread Todd Nist
I believe you can adjust it by setting the following: spark.akka.timeout 100s Communication timeout between Spark nodes. HTH. -Todd On Thu, Apr 21, 2016 at 9:49 AM, yuemeng (A) wrote: > When I run a spark application,sometimes I get follow ERROR: > > 16/04/21 09:26:45

Re: Spark 1.6.1 DataFrame write to JDBC

2016-04-21 Thread Jonathan Gray
I tried increasing the batch size (1000 to 10,000 to 100,000) but it didn't appear to make any appreciable difference in my test case. In addition I had read in the Oracle JDBC documentation that batches should be set between 10 and 100 and anything out of that range was not advisable. However, I

Re: Pls assist: which conf file do i need to modify if i want spark-shell to inclucde external packages?

2016-04-21 Thread Marco Mistroni
Thank mich but I seem to remember to modify a config file so that I don't need to specify the --packages option every time I start the shell Kr On 21 Apr 2016 3:20 pm, "Mich Talebzadeh" wrote: > on spark-shell this will work > > $SPARK_HOME/bin/spark-shell *--packages

Re: Pls assist: which conf file do i need to modify if i want spark-shell to inclucde external packages?

2016-04-21 Thread Mich Talebzadeh
on spark-shell this will work $SPARK_HOME/bin/spark-shell *--packages *com.databricks:spark-csv_2.11:1.3.0 HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Pls assist: which conf file do i need to modify if i want spark-shell to inclucde external packages?

2016-04-21 Thread Marco Mistroni
HI all i need to use spark-csv in my spark instance, and i want to avoid launching spark-shell by passing the package name every time I seem to remember that i need to amend a file in the /conf directory to inlcude e,g spark.packages com.databricks:spark-csv_2.11:1.4.0 but i cannot find

Re: Spark SQL Transaction

2016-04-21 Thread Mich Talebzadeh
This statement ."..each database statement is atomic and is itself a transaction.. your statements should be atomic and there will be no ‘redo’ or ‘commit’ or ‘rollback’." MSSQL compiles with ACIDITY which requires that each transaction be "all or nothing": if one part of the transaction fails,

How to change akka.remote.startup-timeout in spark

2016-04-21 Thread yuemeng (A)
When I run a spark application,sometimes I get follow ERROR: 16/04/21 09:26:45 ERROR SparkContext: Error initializing SparkContext. java.util.concurrent.TimeoutException: Futures timed out after [1 milliseconds] at

Re: Save DataFrame to HBase

2016-04-21 Thread Ted Yu
The hbase-spark module in Apache HBase (coming with hbase 2.0 release) can do this. On Thu, Apr 21, 2016 at 6:52 AM, Benjamin Kim wrote: > Has anyone found an easy way to save a DataFrame into HBase? > > Thanks, > Ben > > >

Re: RDD generated from Dataframes

2016-04-21 Thread Sean Owen
I don't think that's generally true, but is true to the extent that you can push down the work of higher-level logical operators like select and groupBy, on common types, that can be understood and optimized. Your arbitrary user code is opaque and can't be optimized. So DataFrame.groupBy.max is

Re: RDD generated from Dataframes

2016-04-21 Thread Ted Yu
In upcoming 2.0 release, the signature for map() has become: def map[U : Encoder](func: T => U): Dataset[U] = withTypedPlan { Note: DataFrame and DataSet are unified in 2.0 FYI On Thu, Apr 21, 2016 at 6:49 AM, Apurva Nandan wrote: > Hello everyone, > > Generally

Save DataFrame to HBase

2016-04-21 Thread Benjamin Kim
Has anyone found an easy way to save a DataFrame into HBase? Thanks, Ben - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

how to change akka.remote.startup-timeout value in spark

2016-04-21 Thread yuemeng (A)
When I run a spark application ,sometimes I will get follow error: 16/04/21 09:26:45 ERROR SparkContext: Error initializing SparkContext. java.util.concurrent.TimeoutException: Futures timed out after [1 milliseconds] at

RDD generated from Dataframes

2016-04-21 Thread Apurva Nandan
Hello everyone, Generally speaking, I guess it's well known that dataframes are much faster than RDD when it comes to performance. My question is how do you go around when it comes to transforming a dataframe using map. I mean then the dataframe gets converted into RDD, hence now do you again

Re: StructField Translation Error with Spark SQL

2016-04-21 Thread Ted Yu
You meant for fields which are nullable. Can you pastebin the complete stack trace ? Try 1.6.1 when you have a chance. Thanks On Wed, Apr 20, 2016 at 10:20 PM, Charles Nnamdi Akalugwu < cprenzb...@gmail.com> wrote: > I get the same error for fields which are not null unfortunately. > > Can't

Re: Spark SQL Transaction

2016-04-21 Thread Michael Segel
Hi, Sometimes terms get muddled over time. If you’re not using transactions, then each database statement is atomic and is itself a transaction. So unless you have some explicit ‘Begin Work’ at the start…. your statements should be atomic and there will be no ‘redo’ or ‘commit’ or

Re: Spark 1.6.1 DataFrame write to JDBC

2016-04-21 Thread Michael Segel
How many partitions in your data set. Per the Spark DataFrameWritetr Java Doc: “ Saves the content of the DataFrame to a external database table via JDBC. In the case the table already exists in the external

Word2VecModel limitted to spark.akka.frameSize ?!

2016-04-21 Thread Stefan Falk
Hello! I am experiencing an issue [1] with Word2VecModel#save. It appears to exceed spark.akka.frameSize (see stack trace [3]). Setting the frameSize is not really an option because that would just limit me to 2GB so I wonder if there is anything I can do to make this work even if the model

Re: How to know whether I'm in the first batch of spark streaming

2016-04-21 Thread Praveen Devarao
Thanks Yu for sharing the use case. >>If our system have some problem, such as hdfs issue, and the "first batch" and "second batch" were both queued. When the issue gone, these two batch will start together. Then, will onBatchStarted be called concurrently for these two batches?<< Not

Re: Impala can't read partitioned Parquet files saved from DF.partitionBy

2016-04-21 Thread Petr Novak
I have to ask my colleague if there is any specific error but I think it just doesn't see files. Petr On Thu, Apr 21, 2016 at 11:54 AM, Petr Novak wrote: > Hello, > Impala (v2.1.0, Spark 1.6.0) can't read partitioned Parquet files saved > from DF.partitionBy (using

Impala can't read partitioned Parquet files saved from DF.partitionBy

2016-04-21 Thread Petr Novak
Hello, Impala (v2.1.0, Spark 1.6.0) can't read partitioned Parquet files saved from DF.partitionBy (using Python). Is there any known reason, some config? Or it should generally work hence it is likely to be something wrong solely on our side? Many thanks, Petr

Long(20+ seconds) startup delay for jobs when running Spark on YARN

2016-04-21 Thread Akmal Abbasov
Hi, I'm running Spark(1.6.1) on YARN(2.5.1), cluster mode. It's taking 20+ seconds for application to move from ACCEPTED to RUNNING state, here's logs 16/04/21 09:06:56 INFO impl.YarnClientImpl: Submitted application application_1461229289298_0001 16/04/21 09:06:57 INFO yarn.Client: Application

Re: How to know whether I'm in the first batch of spark streaming

2016-04-21 Thread Yu Xie
Thank you Praveen in our spark streaming, we write down the data to a HDFS directory, and use the MMDDHHHmm00 format of batch time as the directory name. So, when we stop the streaming and start the streaming again (we do not use checkpoint), in the init of the first batch, we will write

Re: How to know whether I'm in the first batch of spark streaming

2016-04-21 Thread Praveen Devarao
Hi Yu, Could you provide more details on what and how are you trying to initialize.are you having this initialization as part of the code block in action of the DStream? Say if the second batch finishes before first batch wouldn't your results be affected as init would have not

Re: Spark 1.6.1 DataFrame write to JDBC

2016-04-21 Thread Mich Talebzadeh
What is the end database, Have you checked the performance of your query at the target? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Re: Choosing an Algorithm in Spark MLib

2016-04-21 Thread Prashant Sharma
As far as I can understand, your requirements are pretty straight forward and doable with just simple SQL queries. Take a look at Spark SQL on spark documentation. Prashant Sharma On Tue, Apr 12, 2016 at 8:13 PM, Joe San wrote: > up vote > down votefavorite >