Re: slow SQL query with cached dataset

2016-04-25 Thread Jörn Franke
I do not know your data, but it looks that you have too many partitions for such a small data set. > On 26 Apr 2016, at 00:47, Imran Akbar wrote: > > Hi, > > I'm running a simple query like this through Spark SQL: > > sqlContext.sql("SELECT MIN(age) FROM data WHERE

Re: reduceByKey as Action or Transformation

2016-04-25 Thread Sumedh Wale
On Monday 25 April 2016 11:28 PM, Weiping Qu wrote: Dear Ted, You are right. ReduceByKey is transformation. My fault. I would rephrase my question using following code snippet. object ScalaApp {   def main(args:

Re: Cant join same dataframe twice ?

2016-04-25 Thread Ted Yu
Can you show us the structure of df2 and df3 ? Thanks On Mon, Apr 25, 2016 at 8:23 PM, Divya Gehlot wrote: > Hi, > I am using Spark 1.5.2 . > I have a use case where I need to join the same dataframe twice on two > different columns. > I am getting error missing

Cant join same dataframe twice ?

2016-04-25 Thread Divya Gehlot
Hi, I am using Spark 1.5.2 . I have a use case where I need to join the same dataframe twice on two different columns. I am getting error missing Columns For instance , val df1 = df2.join(df3,"Column1") Below throwing error missing columns val df 4 = df1.join(df3,"Column2") Is the bug or valid

Re: Do transformation functions on RDD invoke a Job [sc.runJob]?

2016-04-25 Thread Praveen Devarao
Cool!! Thanks for the clarification Mike. Thanking You - Praveen Devarao Spark Technology Centre IBM India Software Labs - "Courage

Issues with Long Running Streaming Application

2016-04-25 Thread Bryan Jeffrey
Hello. I have a long running streaming application. It is consuming a large amount of data from Kafka (on the order of 25K messages / second in two minute batches. The job reads the data, makes some decisions on what to save, and writes the selected data into Cassandra. The job is very stable -

Re: XML Data Source for Spark

2016-04-25 Thread Hyukjin Kwon
Hi Janan, Sorry, I was sleeping. I guess you sent a email to me first and then ask it to mailing list because I am not answering. I just tested this to double-check and could produce the same exception below: java.lang.NoSuchMethodError: scala.Predef$.$conforms()Lscala/Predef$$less$colon$less;

Re: Converting from KafkaUtils.createDirectStream to KafkaUtils.createStream

2016-04-25 Thread Mich Talebzadeh
thanks I sorted this out. Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com On 25 April 2016 at 15:20,

Re: slow SQL query with cached dataset

2016-04-25 Thread Mich Talebzadeh
Are you sure it is not spilling to disk? How many rows are cached in your result set -> sqlContext.sql("SELECT * FROM raw WHERE (dt_year=2015 OR dt_year=2016)") HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Hive Metastore issues with Spark 1.6.1

2016-04-25 Thread Mich Talebzadeh
Hi, How is it doing that, running the script against your metastore? HiveContext is native to Hive code that allows Spark to use Hive SQL dialect. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: XML Data Source for Spark

2016-04-25 Thread Michael Armbrust
You are using a version of the library that was compiled for a different version of Scala than the version of Spark that you are using. Make sure that they match up. On Mon, Apr 25, 2016 at 5:19 PM, Mohamed ismail wrote: > here is an example with code. >

Re: XML Data Source for Spark

2016-04-25 Thread Mohamed ismail
here is an example with code.  http://stackoverflow.com/questions/33078221/xml-processing-in-spark I haven't tried. On Monday, April 25, 2016 1:06 PM, Jinan Alhajjaj wrote: Hi All,I am trying to use XML data source that is used for parsing and querying XML

Hive Metastore issues with Spark 1.6.1

2016-04-25 Thread Pradeepkumar Konda
I have installed Spark 1.6.1 and trying to connect to a Hive metastore 0.14.0 version. This was working fine on Spark 1.4.1. I am pointing to same meta store from 1.6.1 and then getting connectivity issues. I read over some online threads and added below 2 lines to default spark conf xml

XML Data Source for Spark

2016-04-25 Thread Jinan Alhajjaj
Hi All,I am trying to use XML data source that is used for parsing and querying XML data with Apache Spark, for Spark SQL and data frames.I am using Apache spark version 1.6.1 and I am using Java as a programming language. I wrote this sample code :SparkConf conf = new

JoinWithCassandraTable over individual queries

2016-04-25 Thread vaibhavrtk
Hi I have an RDD with elements as tuple ((key1,key2),value) where (key1,key2) is the partitioning key in my Cassandra table Now for each such element I have to do a read from Cassandra table. My Cassandra table and spark cluster are in different nodes and cant be co-located. Right now I am doing

Re: Call Spark package API from R

2016-04-25 Thread Jörn Franke
You can call any Java/scala library from R using the package rJava > On 25 Apr 2016, at 19:16, ankur.jain wrote: > > Hello Team, > > Is there any way to call spark code (scala/python) from R? > I want to use Cloudera spark-ts api with SparkR, if anyone had used that >

DataFrame group and agg

2016-04-25 Thread Andrés Ivaldi
Hello, Anyone know if this is on purpose or its a bug? in https://github.com/apache/spark/blob/2f1d0320c97f064556fa1cf98d4e30d2ab2fe661/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala class the def agg have many implemetations next two of them: Line 136: def

Re: reduceByKey as Action or Transformation

2016-04-25 Thread Weiping Qu
Dear Ted, You are right. ReduceByKey is transformation. My fault. I would rephrase my question using following code snippet. object ScalaApp { def main(args: Array[String]): Unit ={ val conf = new SparkConf().setAppName("ScalaApp").setMaster("local") val sc = new SparkContext(conf)

Re: Defining case class within main method throws "No TypeTag available for Accounts"

2016-04-25 Thread Mich Talebzadeh
BTW is this documented as it seems to be potential issue. Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com

Re: Defining case class within main method throws "No TypeTag available for Accounts"

2016-04-25 Thread Mich Talebzadeh
cheers Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com On 25 April 2016 at 18:35, Michael Armbrust

Re: Defining case class within main method throws "No TypeTag available for Accounts"

2016-04-25 Thread Ashok Kumar
Thanks Michael as I gathered for now it is a feature. On Monday, 25 April 2016, 18:36, Michael Armbrust wrote: When you define a class inside of a method, it implicitly has a pointer to the outer scope of the method.  Spark doesn't have access to this scope, so

Re: Defining case class within main method throws "No TypeTag available for Accounts"

2016-04-25 Thread Michael Armbrust
When you define a class inside of a method, it implicitly has a pointer to the outer scope of the method. Spark doesn't have access to this scope, so this makes it hard (impossible?) for us to construct new instances of that class. So, define your classes that you plan to use with Spark at the

Re: Dataset aggregateByKey equivalent

2016-04-25 Thread Lee Becker
On Sat, Apr 23, 2016 at 8:56 AM, Michael Armbrust wrote: > Have you looked at aggregators? > > > https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Dataset%20Aggregator.html > Thanks for the pointer to aggregators. I wasn't yet aware of them. However,

Re: Do transformation functions on RDD invoke a Job [sc.runJob]?

2016-04-25 Thread Michael Armbrust
Spark SQL's query planner has always delayed building the RDD, so has never needed to eagerly calculate the range boundaries (since Spark 1.0). On Mon, Apr 25, 2016 at 2:04 AM, Praveen Devarao wrote: > Thanks Reynold for the reason as to why sortBykey invokes a Job > >

Call Spark package API from R

2016-04-25 Thread ankur.jain
Hello Team, Is there any way to call spark code (scala/python) from R? I want to use Cloudera spark-ts api with SparkR, if anyone had used that please let me know. Thank Ankur -- View this message in context:

issues with spark metrics

2016-04-25 Thread Gábor Fehér
Hi All, I am trying to set up monitoring to better understand the performance bottlenecks of my Spark application. I have some questions: 1. BlockManager.disk.diskSpaceUsed_MB is always zero when I go to http://localhost:4040/metrics/json/ Even though I know that blockmanager is using a lot of

Spark 1.6.1 throws error: Did not find registered driver with class oracle.jdbc.OracleDriver

2016-04-25 Thread Mich Talebzadeh
Hi, This JDBC connection was working fine in Spark 1.5,2 val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc) val sqlContext = new HiveContext(sc) println ("\nStarted at"); sqlContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss') ").collect.foreach(println) //

Re: reduceByKey as Action or Transformation

2016-04-25 Thread Ted Yu
Can you show snippet of your code which demonstrates what you observed ? Thansk On Mon, Apr 25, 2016 at 8:38 AM, Weiping Qu wrote: > Thanks. > I read that from the specification. > I thought the way people distinguish actions and transformations depends > on whether

unsubsribe

2016-04-25 Thread Kartik Veerepalli
Unsubscribe Kartik Veerepalli Software Developer 560 Herndon Parkway, Suite 240 Herndon, VA 20170 (w) 703-437-0100 (f) 703-940-6001 www.syntasa.com | Connect with us [Twitter] [Facebook] [LinkedIn]

Re: reduceByKey as Action or Transformation

2016-04-25 Thread Weiping Qu
Thanks. I read that from the specification. I thought the way people distinguish actions and transformations depends on whether they are lazily executed or not. As far as I saw from my codes, the reduceByKey will be executed without any operations in the Action category. Please correct me if I

GroubByKey Iterable[T] - Memory usage

2016-04-25 Thread Nirav Patel
Hi, Is the Iterable from out of GroupByKey is loaded fully into memory of reducer task or can it also be on disk? Also, is there a way to evacuate from memory once reducer is done iterating it and want to use memory for something else. Thanks -- [image: What's New with Xactly]

Re: Spark 2.0 forthcoming features

2016-04-25 Thread Sourav Mazumder
Thanks a lot Michael and Jules. Regards, Sourav On Thu, Apr 21, 2016 at 3:08 PM, Jules Damji wrote: > Thanks Michael, we're doing a Spark 2.0 webinar. Register and if you can't > make it; you can always watch the recording. > > Cheers > Jules > > Sent from my iPhone >

Re: [Spark 1.5.2]All data being written to only one part file rest part files are empty

2016-04-25 Thread nguyen duc tuan
Maybe the problem is the data itself. For example, the first dataframe might has common keys in only one part of the second dataframe. I think you can verify if you are in this situation by repartition one dataframe and join it. If this is the true reason, you might see the result distributed more

reduceByKey as Action or Transformation

2016-04-25 Thread Weiping Qu
Hi, I'd like just to verify that whether reduceByKey is transformation or actions. As written in RDD papers, spark flow will not be triggered only if actions are reached. I tried and saw that the my flow will be executed once there is a reduceByKey while it is categorized into transformations

Re: Converting from KafkaUtils.createDirectStream to KafkaUtils.createStream

2016-04-25 Thread Cody Koeninger
Show the full relevant code including imports. On Fri, Apr 22, 2016 at 4:46 PM, Mich Talebzadeh wrote: > Hi Cody, > > This is my first attempt on using offset ranges (this may not mean much in > my context at the moment) > > val ssc = new StreamingContext(conf,

Re: Which jar file has import org.apache.spark.internal.Logging

2016-04-25 Thread Cody Koeninger
I would suggest reading the documentation first. http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.kafka.OffsetRange$ The OffsetRange class is not private. The instance constructor is private. You obtain instances by using the apply method on the companion

Re: next on empty iterator though i used hasNext

2016-04-25 Thread Ted Yu
Can you show more of your code inside the while loop ? Which version of Spark / Kinesis do you use ? Thanks On Mon, Apr 25, 2016 at 4:04 AM, Selvam Raman wrote: > I am reading a data from Kinesis stream (merging shard values with union > stream) to spark streaming. then

RE: Java exception when showing join

2016-04-25 Thread Yong Zhang
sorry I didn't pay attention you are using pyspark, so ignore my reply, as I only use Scala version. Yong From: java8...@hotmail.com To: webe...@aim.com; user@spark.apache.org Subject: RE: Java exception when showing join Date: Mon, 25 Apr 2016 09:41:18 -0400 dispute_df.join(comments_df,

RE: Java exception when showing join

2016-04-25 Thread Yong Zhang
dispute_df.join(comments_df, $"dispute_df.COMMENTID" === $"comments_df.COMMENTID").first() If you are using DataFrame API, and some of them are trick for first time user, my suggestion is to always referring the unit tests. That is in fact the way I tried to find out how to do it for lots of

Defining case class within main method throws "No TypeTag available for Accounts"

2016-04-25 Thread Mich Talebzadeh
Hi, I notice buiding with sbt if I define my case class *outside of main method* like below it works case class Accounts( TransactionDate: String, TransactionType: String, Description: String, Value: Double, Balance: Double, AccountName: String, AccountNumber : String) object

New executor pops up on every node when any executor dies

2016-04-25 Thread Vaclav Vymazal
Hi, I have a Spark streaming application that is pumping data from Kafka into HDFS and Elasticsearch. The application is running on a Spark Standalone cluster (in client mode). Whenever one of the executors fails (or is killed), new executor for the application is spawned on every node in the

Re: Java exception when showing join

2016-04-25 Thread Brent S. Elmer Ph.D.
I get an invalid syntax error when I do that. On Fri, 2016-04-22 at 20:06 -0400, Yong Zhang wrote: > use "dispute_df.join(comments_df, dispute_df.COMMENTID === > comments_df.COMMENTID).first()" instead. > > Yong > > Date: Fri, 22 Apr 2016 17:42:26 -0400 > From: webe...@aim.com > To:

next on empty iterator though i used hasNext

2016-04-25 Thread Selvam Raman
I am reading a data from Kinesis stream (merging shard values with union stream) to spark streaming. then doing the following code to push the data to DB. ​ splitCSV.foreachRDD(new VoidFunction2,Time>() { private static final long serialVersionUID = 1L; public void

GraphFrames and IPython notebook issue - No module named graphframes

2016-04-25 Thread Camelia Elena Ciolac
Hello, I work locally on my laptop, not using DataBricks Community edition. I downloaded graphframes-0.1.0-spark1.6.jar from http://spark-packages.org/package/graphframes/graphframes and placed it in a folder named spark_extra_jars where I have other jars too. After executing in a

Re: Spark Streaming, Broadcast variables, java.lang.ClassCastException

2016-04-25 Thread mwol
I forgot the streamingContext.start() streamingContext.awaitTermination() in my example code, but the error stays the same... -- View this message in context:

Spark Streaming, Broadcast variables, java.lang.ClassCastException

2016-04-25 Thread mwol
Hi, I try to read data from a static textfile stored in HDFS, store its content into an ArrayBuffer which in turn should be broadcasted via sparkContext.broadcast as a BroadcastVariable. I am using cloudera's spark, version 1.6.0-cdh5.7.0 and spark-streaming_2.10. I start the application on yarn

Re: Spark Streaming Job get killed after running for about 1 hour

2016-04-25 Thread أنس الليثي
I am using the latest Spark version 1.6 I have increased the maximum number of open files using this command *sysctl -w fs.file-max=3275782* Also I increased the limit for the user who run the spark job by updating the /etc/security/limits.conf file. Soft limit is 1024 and Hard limit is 65536.

Re: Do transformation functions on RDD invoke a Job [sc.runJob]?

2016-04-25 Thread Praveen Devarao
Thanks Reynold for the reason as to why sortBykey invokes a Job When you say "DataFrame/Dataset does not have this issue" is it right to assume you are referring to Spark 2.0 or Spark 1.6 DF already has built-in it? Thanking You

Do transformation functions on RDD invoke a Job [sc.runJob]?

2016-04-25 Thread Praveen Devarao
Hi, I have a streaming program with the block as below [ref: https://github.com/agsachin/streamingBenchmark/blob/master/spark-benchmarks/src/main/scala/TwitterStreaming.scala ] 1 val lines = messages.map(_._2) 2 val hashTags = lines.flatMap(status => status.split(" "