Re: Timeline for supporting basic operations like groupBy, joins etc on Streaming DataFrames

2016-06-05 Thread raaggarw
Thanks So, 1) For joins (stream-batch) - are all types of joins supported - i mean inner, leftouter etc or specific ones? Also what is the timeline for complete support - I mean stream-stream joins? 2) So now outputMode is exposed via DataFrameWriter but will work in specific cases as you

Re: Timeline for supporting basic operations like groupBy, joins etc on Streaming DataFrames

2016-06-05 Thread raaggarw
I accidentally deleted the original post. So I am just pasting the response from Tathagata Das Join is supported but only stream-batch joins. Outmodes were added late last week, currently supports append mode for non-aggregation queries and complete mode for aggregation

Timeline for supporting basic operations like groupBy, joins etc on Streaming DataFrames

2016-06-05 Thread raaggarw
Hi, I am Ravi, Computer scientist @ Adobe Systems. We have been actively using Spark for our internal projects. Recently we had a need for ETL on streaming data, so we were exploring Spark 2.0 for that. *But as i could see, the streaming dataframes do not support basic operations like Joins,

Timeline for supporting basic operations like groupBy, joins etc on Streaming DataFrames

2016-06-05 Thread raaggarw
Hi, I am Ravi, Computer scientist @ Adobe Systems. We have been actively using Spark for our internal projects. Recently we had a need for ETL on streaming data, so we were exploring Spark 2.0 for that. *But as i could see, the streaming dataframes do not support basic operations like Joins,

Performance of Spark/MapReduce

2016-06-05 Thread Deepak Goel
Hey Namaskara~Nalama~Guten Tag~Bonjour Sorry about that (The question might still be general as I am new to Spark). My question is: Spark claims to be 10x times faster on disk and 100x times faster in memory as compared to Mapreduce. Is there any benchmark paper for the same which sketches

Unsubscribe

2016-06-05 Thread goutham koneru

RE: GraphX Java API

2016-06-05 Thread Santoshakhilesh
Ok , thanks for letting me know. Yes Since Java and scala programs ultimately runs on JVM. So the APIs written in one language can be called from other. When I had used GraphX (around 2015 beginning) the Java Native APIs were not available for GraphX. So I chose to develop my application in

Re: StackOverflowError even with JavaSparkContext union(JavaRDD... rdds)

2016-06-05 Thread Everett Anderson
Indeed! I wasn't able to get this to work in cluster mode, yet, but increasing driver and executor stack sizes in client mode (still running on a YARN EMR cluster) got it to work! I'll fiddle more. FWIW, I used spark-submit --deploy-mode client --conf

Specify node where driver should run

2016-06-05 Thread Saiph Kappa
Hi, In yarn-cluster mode, is there any way to specify on which node I want the driver to run? Thanks.

Performance of Spark/MapReduce

2016-06-05 Thread Deepak Goel
Hello Sorry, I am new to Spark. Spark claims it can do all that what MapReduce can do (and more!) but 10X times faster on disk, and 100X faster in memory. Why would then I use Mapreduce at all? Thanks Deepak Hey Namaskara~Nalama~Guten Tag~Bonjour -- Keigu Deepak 73500 12833

Re: Does Spark uses data locality information from HDFS when running in standalone mode?

2016-06-05 Thread Eugene Morozov
Marco, I'd say yes, because it uses different implementation of hadoop's InputFormat interface underneath. What kind of proof would you like to see? -- Be well! Jean Morozov On Sun, Jun 5, 2016 at 12:50 PM, Marco Capuccini < marco.capucc...@farmbio.uu.se> wrote: > Dear all, > > Does Spark uses

Re: Basic question on using one's own classes in the Scala app

2016-06-05 Thread Ashok Kumar
Thank you. I added this as dependency libraryDependencies += "com.databricks" % "apps.twitter_classifier" % "1.0.0" That number at the end I chose arbitrary? Is that correct  Also in my TwitterAnalyzer.scala I added this linw import com.databricks.apps.twitter_classifier._ Now I am getting this

Re: StackOverflowError even with JavaSparkContext union(JavaRDD... rdds)

2016-06-05 Thread Eugene Morozov
Everett, try to increase thread stack size. To do that run your application with the following options (my app is a web application, so you might adjust something): -XX:ThreadStackSize=81920 -Dspark.executor.extraJavaOptions="-XX:ThreadStackSize=81920" The number 81920 is memory in KB. You could

Re: Basic question on using one's own classes in the Scala app

2016-06-05 Thread Jacek Laskowski
On Sun, Jun 5, 2016 at 9:01 PM, Ashok Kumar wrote: > Now I have added this > > libraryDependencies += "com.databricks" % "apps.twitter_classifier" > > However, I am getting an error > > > error: No implicit for Append.Value[Seq[sbt.ModuleID], >

Re: Akka with Hadoop/Spark

2016-06-05 Thread Jacek Laskowski
Hi, "I am supposed to work with akka and Hadoop in building apps on top of the data available in hadoop" <-- that's outside the topics covered in this mailing list (unless you're going to use Spark, too). Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark

StackOverflowError even with JavaSparkContext union(JavaRDD... rdds)

2016-06-05 Thread Everett Anderson
Hi! I have a fairly simple Spark (1.6.1) Java RDD-based program that's scanning through lines of about 1000 large text files of records and computing some metrics about each line (record type, line length, etc). Most are identical so I'm calling distinct(). In the loop over the list of files,

Re: [REPOST] Severe Spark Streaming performance degradation after upgrading to 1.6.1

2016-06-05 Thread Daniel Darabos
If you fill up the cache, 1.6.0+ will suffer performance degradation from GC thrashing. You can set spark.memory.useLegacyMode to true, or spark.memory.fraction to 0.66, or spark.executor.extraJavaOptions to -XX:NewRatio=3 to avoid this issue. I think my colleague filed a ticket for this issue,

Akka with Hadoop/Spark

2016-06-05 Thread KhajaAsmath Mohammed
Hi Everyone, I am have done lot of examples in spark and have good overview of how it works. I am going to join new project where I am supposed to work with akka and Hadoop in building apps on top of the data available in hadoop. Does anyone have any use case of how this work or any tutorials. I

Re: Basic question on using one's own classes in the Scala app

2016-06-05 Thread Ashok Kumar
Hello for 1, I read the doc as libraryDependencies += groupID % artifactID % revision jar tvf utilities-assembly-0.1-SNAPSHOT.jar|grep CheckpointDirectory com/databricks/apps/twitter_classifier/getCheckpointDirectory.class getCheckpointDirectory.class Now I have added this libraryDependencies

Re: ML regression - spark context dies without error

2016-06-05 Thread Yanbo Liang
Could you tell me which regression algorithm, the parameters you set and the detail exception information? Or it's better to paste your code and exception here if it's applicable, then other members can help you to diagnose the problem. Thanks Yanbo 2016-05-12 2:03 GMT-07:00 AlexModestov

Re: Basic question on using one's own classes in the Scala app

2016-06-05 Thread Ted Yu
For #1, please find examples on the net e.g. http://www.scala-sbt.org/0.13/docs/Scala-Files-Example.html For #2, import . getCheckpointDirectory Cheers On Sun, Jun 5, 2016 at 8:36 AM, Ashok Kumar wrote: > Thank you sir. > > At compile time can I do something similar to

Re: Basic question on using one's own classes in the Scala app

2016-06-05 Thread Ashok Kumar
Thank you sir. At compile time can I do something similar to libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1" I have these name := "scala" version := "1.0" scalaVersion := "2.10.4" And if I look at jar file i have jar tvf utilities-assembly-0.1-SNAPSHOT.jar|grep Check  1180

Re: Basic question on using one's own classes in the Scala app

2016-06-05 Thread Ted Yu
At compilation time, you need to declare the dependence on getCheckpointDirectory. At runtime, you can use '--jars utilities-assembly-0.1-SNAPSHOT.jar' to pass the jar. Cheers On Sun, Jun 5, 2016 at 3:06 AM, Ashok Kumar wrote: > Hi all, > > Appreciate any advice

Re: Does Spark uses data locality information from HDFS when running in standalone mode?

2016-06-05 Thread Mich Talebzadeh
Actually it is an interesting question. Spark standalone uses simple cluster manager that is included with Spark. However, I am not sure that simple cluster manager can work out the whereabouts of datanodes in Hadoop cluster. I start YARN with HDFS together so don't have this concern HTH Dr Mich

Re: Does Spark uses data locality information from HDFS when running in standalone mode?

2016-06-05 Thread Mich Talebzadeh
I use YARN as I run Hive on Spark engine in yarn-cluster mode plus other stuff. if I turn off YARN half of my applications won't work. I don't see great concern for supporting YARN. However you may have other reasons Dr Mich Talebzadeh LinkedIn *

Caching table partition after join

2016-06-05 Thread Zalzberg, Idan (Agoda)
Hi, I have a complicated scenario where I can't seem to explain to spark how to handle the query in the best way. I am using spark from the thrift server so only SQL. To explain the scenario, let's assume: Table A: Key : String Value : String Table B: Key: String Value2: String Part : String

Re: Does Spark uses data locality information from HDFS when running in standalone mode?

2016-06-05 Thread Marco Capuccini
I meant when running in standalone cluster mode, where Hadoop data nodes run on the same nodes where the Spark workers run. I don’t want to support YARN as well in my infrastructure, and since I already set up a standalone Spark cluster, I was wondering if running only HDFS in the same cluster

Re: Does Spark uses data locality information from HDFS when running in standalone mode?

2016-06-05 Thread Mich Talebzadeh
Well in standalone mode you are running your spark code on one physical node so the assumption would be that there is HDFS node running on the same host. When you are running Spark in yarn-client mode, then Yarn is part of Hadoop core and Yarn will know about the datanodes from

Basic question on using one's own classes in the Scala app

2016-06-05 Thread Ashok Kumar
Hi all, Appreciate any advice on this. It is about scala I have created a very basic Utilities.scala that contains a test class and method. I intend to add my own classes and methods as I expand and make references to these classes and methods in my other apps class getCheckpointDirectory {  def

Does Spark uses data locality information from HDFS when running in standalone mode?

2016-06-05 Thread Marco Capuccini
Dear all, Does Spark uses data locality information from HDFS, when running in standalone mode? Or is it running on YARN mandatory for such purpose? I can't find this information in the docs, and on Google I am only finding contrasting opinion on that. Regards Marco Capuccini

Re: Using data frames to join separate RDDs in spark streaming

2016-06-05 Thread Cyril Scetbon
Problem solved by creating only one RDD. > On Jun 1, 2016, at 14:05, Cyril Scetbon wrote: > > It seems that to join a DStream with a RDD I can use : > > mgs.transform(rdd => rdd.join(rdd1)) > > or > > mgs.foreachRDD(rdd => rdd.join(rdd1)) > > But, I can't see why

Re: Spark SQL Nested Array of JSON with empty field

2016-06-05 Thread Ewan Leith
The spark json read is unforgiving of things like missing elements from some json records, or mixed types. If you want to pass invalid json files through spark you're best doing an initial parse through the Jackson APIs using a defined schema first, then you can set types like Option[String]