Re: Best way to merge files from streaming jobs‏ on S3

2016-03-04 Thread Chris Miller
Why does the order matter? Coalesce runs in parallel and if it's just writing to the file, then I imagine it would do it in whatever order it happens to be executed in each thread. If you want to sort the resulting data, I imagine you'd need to save it to some sort of data structure instead of

Re: Error building a self contained Spark app

2016-03-04 Thread Chandeep Singh
#3 If your code is dependent on other projects you will need to package everything together in order to distribute over a Spark cluster. In your example below I don’t see much of an advantage by building a package. > On Mar 5, 2016, at 12:32 AM, Ted Yu wrote: > > Answers

Re: Error building a self contained Spark app

2016-03-04 Thread Chandeep Singh
That is because an instance of org.apache.spark.sql.SQLContext doesn’t exist in the current context and is required before you can use any of its implicit methods. As Ted mentioned importing org.apache.spark.sql.functions._ will take care of the below error. > On Mar 4, 2016, at 11:35 PM,

Re: Error building a self contained Spark app

2016-03-04 Thread Ted Yu
Please import: import org.apache.spark.sql.functions._ On Fri, Mar 4, 2016 at 3:35 PM, Mich Talebzadeh wrote: > thanks. It is like war of attrition. I always thought that you add import > before the class itself not within the class? w3hat is the reason for it >

Re: Error building a self contained Spark app

2016-03-04 Thread Mich Talebzadeh
thanks. It is like war of attrition. I always thought that you add import before the class itself not within the class? w3hat is the reason for it please? this is my code import org.apache.spark.SparkContext import org.apache.spark.SparkConf import org.apache.spark.sql.Row import

Re: Error building a self contained Spark app

2016-03-04 Thread Chandeep Singh
This is what you need: val sc = new SparkContext(sparkConf) val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.implicits._ > On Mar 4, 2016, at 11:03 PM, Mich Talebzadeh > wrote: > > Hi Ted, > > This is my code > > import

Re: Error building a self contained Spark app

2016-03-04 Thread Ted Yu
After: val sqlContext = new org.apache.spark.sql.SQLContext(sc) Please add: import sqlContext.implicits._ On Fri, Mar 4, 2016 at 3:03 PM, Mich Talebzadeh wrote: > Hi Ted, > > This is my code > > import org.apache.spark.SparkConf > import

Re: Using netlib-java in Spark 1.6 on linux

2016-03-04 Thread Chris Fregly
I have all of this pre-wired up and Docker-ized for your instant enjoyment here: https://github.com/fluxcapacitor/pipeline/wiki You can review the Dockerfile for the details (Ubuntu 14.04-based). This is easy BREEZEy. Also,

Re: Error building a self contained Spark app

2016-03-04 Thread Mich Talebzadeh
Hi Ted, This is my code import org.apache.spark.SparkConf import org.apache.spark.sql.Row import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql.types._ import org.apache.spark.sql.SQLContext // object Sequence { def main(args: Array[String]) { val conf = new

RE: Error building a self contained Spark app

2016-03-04 Thread Jelez Raditchkov
Ok this is what I have: object SQLHiveContextSingleton { @transient private var instance: HiveContext = _ def getInstance(sparkContext: SparkContext): HiveContext = { synchronized { if (instance == null || sparkContext.isStopped) { instance = new

Re: Error building a self contained Spark app

2016-03-04 Thread Ted Yu
Can you show your code snippet ? Here is an example: val sqlContext = new SQLContext(sc) import sqlContext.implicits._ On Fri, Mar 4, 2016 at 1:55 PM, Mich Talebzadeh wrote: > Hi Ted, > > I am getting the following error after adding that import > >

FW: How to get the singleton instance of SQLContext/HiveContext: val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)‏

2016-03-04 Thread Jelez Raditchkov
From: je...@hotmail.com To: yuzhih...@gmail.com Subject: RE: How to get the singleton instance of SQLContext/HiveContext: val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)‏ Date: Fri, 4 Mar 2016 14:09:20 -0800 Below code is from the soruces, is this what you ask? class

Re: Error building a self contained Spark app

2016-03-04 Thread Mich Talebzadeh
Hi Ted, I am getting the following error after adding that import [error] /home/hduser/dba/bin/scala/Sequence/src/main/scala/Sequence.scala:5: not found: object sqlContext [error] import sqlContext.implicits._ [error]^ [error]

Re: How to get the singleton instance of SQLContext/HiveContext: val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)‏

2016-03-04 Thread Ted Yu
bq. However the method does not seem inherited to HiveContext. Can you clarify the above observation ? HiveContext extends SQLContext . On Fri, Mar 4, 2016 at 1:23 PM, jelez wrote: > What is the best approach to use getOrCreate for streaming job with > HiveContext. > It

Use cases for kafka direct stream messageHandler

2016-03-04 Thread Cody Koeninger
Wanted to survey what people are using the direct stream messageHandler for, besides just extracting key / value / offset. Would your use case still work if that argument was removed, and the stream just contained ConsumerRecord objects

Re: Error building a self contained Spark app

2016-03-04 Thread Ted Yu
Can you add the following into your code ? import sqlContext.implicits._ On Fri, Mar 4, 2016 at 1:14 PM, Mich Talebzadeh wrote: > Hi, > > I have a simple Scala program as below > > import org.apache.spark.SparkContext > import org.apache.spark.SparkContext._ > import

spark driver in docker

2016-03-04 Thread yanlin wang
We would like to run multiple spark driver in docker container. Any suggestion for the port expose and network settings for docker so driver is reachable by the worker nodes? —net=“hosts” is the last thing we want to do. Thx Yanlin

Best way to merge files from streaming jobs‏ on S3

2016-03-04 Thread jelez
My streaming job is creating files on S3. The problem is that those files end up very small if I just write them to S3 directly. This is why I use coalesce() to reduce the number of files and make them larger. However, coalesce shuffles data and my job processing time ends up higher than

How to get the singleton instance of SQLContext/HiveContext: val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)‏

2016-03-04 Thread jelez
What is the best approach to use getOrCreate for streaming job with HiveContext. It seems for SQLContext the recommended approach is to use getOrCreate: https://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations val sqlContext =

SSL support for Spark Thrift Server

2016-03-04 Thread Sourav Mazumder
Hi All, While starting the Spark Thrift Server I don't see any option to start it with SSL support. Is that support currently there ? Regards, Sourav

Error building a self contained Spark app

2016-03-04 Thread Mich Talebzadeh
Hi, I have a simple Scala program as below import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf import org.apache.spark.sql.SQLContext object Sequence { def main(args: Array[String]) { val conf = new SparkConf().setAppName("Sequence")

Spark reduce serialization question

2016-03-04 Thread James Jia
I'm running a distributed KMeans algorithm with 4 executors. I have a RDD[Data]. I use mapPartition to run a learner on each data partition, and then call reduce with my custom model reduce function to reduce the result of the model to start a new iteration. The model size is around ~330 MB. I

Best way to merge files from streaming jobs

2016-03-04 Thread Jelez Raditchkov
My streaming job is creating files on S3.The problem is that those files end up very small if I just write them to S3 directly.This is why I use coalesce() to reduce the number of files and make them larger. However, coalesce shuffles data and my job processing time ends up higher than

Re: Spark 1.5.2 : change datatype in programaticallly generated schema

2016-03-04 Thread Michael Armbrust
Change the type of a subset of the columns using withColumn, after you have loaded the DataFrame. Here is an example. On Thu, Mar

Re: Spark SQL - udf with entire row as parameter

2016-03-04 Thread Michael Armbrust
You have to use SQL to call it (but you will be able to do it with dataframes in Spark 2.0 due to a better parser). You need to construct a struct(*) and then pass that to your function since a function must have a fixed number of arguments. Here is an example

How to get the singleton instance of SQLContext/HiveContext: val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)

2016-03-04 Thread Jelez Raditchkov
What is the best approach to use getOrCreate for streaming job with HiveContext.It seems for SQLContext the recommended approach is to use getOrCreate: https://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operationsval sqlContext =

S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

2016-03-04 Thread Jelez Raditchkov
Working on a streaming job with DirectParquetOutputCommitter to S3I need to use PartitionBy and hence SaveMode.Append Apparently when using SaveMode.Append spark automatically defaults to the default parquet output committer and ignores DirectParquetOutputCommitter. My problems are:1. the

Re: Does Spark 1.5.x really still support Hive 0.12?

2016-03-04 Thread Michael Armbrust
Read the docs at the link that you pasted: http://spark.apache.org/docs/latest/sql-programming-guide.html#interacting-with-different-versions-of-hive-metastore Spark will always compile against the same version of Hive (1.2.1), but it can dynamically load jars to speak to other versions. On Fri,

Re: Installing Spark on Mac

2016-03-04 Thread Vishnu Viswanath
Installing spark on mac is similar to how you install it on Linux. I use mac and have written a blog on how to install spark here is the link : http://vishnuviswanath.com/spark_start.html Hope this helps. On Fri, Mar 4, 2016 at 2:29 PM, Simon Hafner wrote: > I'd try

Re: Installing Spark on Mac

2016-03-04 Thread Eduardo Costa Alfaia
Hi Aida Run only "build/mvn -DskipTests clean package” BR Eduardo Costa Alfaia Ph.D. Student in Telecommunications Engineering Università degli Studi di Brescia Tel: +39 3209333018 On 3/4/16, 16:18, "Aida" wrote: >Hi all, > >I am a complete novice and was

Re: Installing Spark on Mac

2016-03-04 Thread Simon Hafner
I'd try `brew install spark` or `apache-spark` and see where that gets you. https://github.com/Homebrew/homebrew 2016-03-04 21:18 GMT+01:00 Aida : > Hi all, > > I am a complete novice and was wondering whether anyone would be willing to > provide me with a step by step

Installing Spark on Mac

2016-03-04 Thread Aida
Hi all, I am a complete novice and was wondering whether anyone would be willing to provide me with a step by step guide on how to install Spark on a Mac; on standalone mode btw. I downloaded a prebuilt version, the second version from the top. However, I have not installed Hadoop and am not

Does Spark 1.5.x really still support Hive 0.12?

2016-03-04 Thread Yong Zhang
When I tried to compile the Spark 1.5.2 with -Phive-0.12.0, maven gave me back an error that profile doesn't exist any more. But when I read the Spark SQL programming guide here: http://spark.apache.org/docs/1.5.2/sql-programming-guide.htmlIt keeps mentioning Spark 1.5.2 still can work with

Re: Do we need schema for Parquet files with Spark?

2016-03-04 Thread Ryan Blue
Hi Ashok, The schema for your data comes from the data frame you're using in Spark and resolved with a Hive table schema if you are writing to one. For encodings, you don't need to configure them because they are selected for your data automatically. For example, Parquet will try

1.6.0 spark.sql datetime conversion problem

2016-03-04 Thread Michal Vince
Hi guys I`m using spark 1.6.0 and I`m not sure if I found a bug or I`m doing something wrong I`m playing with dataframes and I`m converting iso 8601 with millis to my timezone - which is Europe/Bratislava with fromt_utc_timestamp function from spark.sql.functions the problem is that

Re: Issue with sbt failing on amplab succinct

2016-03-04 Thread Mich Talebzadeh
Thanks Luciano. This seems to work now although some spurious errors! *spark-submit --packages amplab:succinct:0.1.5 --class "SimpleApp" --master local target/scala-2.10/simple-project_2.10-1.0.jar* Ivy Default Cache set to: /home/hduser/.ivy2/cache The jars for the packages stored in:

Re: Spark Streaming

2016-03-04 Thread anbucheeralan
Hi, Were you able to solve this issue? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-tp24058p26396.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Issue with sbt failing on amplab succinct

2016-03-04 Thread Ted Yu
Maybe leave a comment on http://spark-packages.org/package/amplab/succinct ? On Fri, Mar 4, 2016 at 7:22 AM, Mich Talebzadeh wrote: > here you are > > sbt package > [info] Set current project to Simple Project (in build > file:/home/hduser/dba/bin/scala/) > [success]

Re: Issue with sbt failing on amplab succinct

2016-03-04 Thread Mich Talebzadeh
here you are sbt package [info] Set current project to Simple Project (in build file:/home/hduser/dba/bin/scala/) [success] Total time: 1 s, completed Mar 4, 2016 2:50:16 PM hduser@rhes564::/home/hduser/dba/bin/scala> spark-submit --class "SimpleApp" --master local

Re: Issue with sbt failing on amplab succinct

2016-03-04 Thread Ted Yu
Can you show the complete stack trace ? It was clear which class whose definition was not found. On Fri, Mar 4, 2016 at 6:46 AM, Mich Talebzadeh wrote: > Hi, > > I have a simple Scala code that I want to use it in an sbt project. > > It is pretty simple but imports

Issue with sbt failing on amplab succinct

2016-03-04 Thread Mich Talebzadeh
Hi, I have a simple Scala code that I want to use it in an sbt project. It is pretty simple but imports the following: // Import SuccinctRDD import edu.berkeley.cs.succinct._ name := "Simple Project" version := "1.0" scalaVersion := "2.10.5" libraryDependencies += "org.apache.spark" %%

Spark SQL - udf with entire row as parameter

2016-03-04 Thread Nisrina Luthfiyati
Hi all, I'm using spark sql in python and want to write a udf that takes an entire Row as the argument. I tried something like: def functionName(row): ... return a_string udfFunctionName=udf(functionName, StringType()) df.withColumn('columnName', udfFunctionName('*')) but this gives an

Re: Spark 1.5 on Mesos

2016-03-04 Thread Ashish Soni
It did not helped , same error , Is this the issue i am running into https://issues.apache.org/jira/browse/SPARK-11638 *Warning: Local jar /mnt/mesos/sandbox/spark-examples-1.6.0-hadoop2.6.0.jar does not exist, skipping.* java.lang.ClassNotFoundException: org.apache.spark.examples.SparkPi On

Re: DataFrame .filter only seems to work when .cache is called in local mode in 1.6.0

2016-03-04 Thread James Hammerton
Sure thing, I'll see if I can isolate this. Regards. James On 4 March 2016 at 12:24, Ted Yu wrote: > If you can reproduce the following with a unit test, I suggest you open a > JIRA. > > Thanks > > On Mar 4, 2016, at 4:01 AM, James Hammerton wrote: > >

Re: Facing issue with floor function in spark SQL query

2016-03-04 Thread Mich Talebzadeh
Spark sql has both FLOOR and CEILING functions spark-sql> select FLOOR(11.95),CEILING(11.95); 11.012.0 Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: AVRO vs Parquet

2016-03-04 Thread Paul Leclercq
Nice article about Parquet *with* Avro : - https://dzone.com/articles/understanding-how-parquet - http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/ Nice video from the good folks of Cloudera for the *differences* between "Avrow" and Parquet -

Re: Facing issue with floor function in spark SQL query

2016-03-04 Thread Ajay Chander
Hi Ashok, Try using hivecontext instead of sqlcontext. I suspect sqlcontext doesnot have that functionality. Let me know if it works. Thanks, Ajay On Friday, March 4, 2016, ashokkumar rajendran < ashokkumar.rajend...@gmail.com> wrote: > Hi Ayan, > > Thanks for the response. I am using SQL

Re: Mapper side join with DataFrames API

2016-03-04 Thread Deepak Gopalakrishnan
Have added this to SO, can you guys share any thoughts ? http://stackoverflow.com/questions/35795518/spark-1-6-spills-to-disk-even-when-there-is-enough-memory

Spark 1.5.2 -Better way to create custom schema

2016-03-04 Thread Divya Gehlot
Hi , I have a data set in HDFS . Is there any better any to define the custom schema for the data set having more 100+ fields of different data types. Thanks, Divya

Re: Facing issue with floor function in spark SQL query

2016-03-04 Thread ashokkumar rajendran
Hi Ayan, Thanks for the response. I am using SQL query (not Dataframe). Could you please explain how I should import this sql function to it? Simply importing this class to my driver code does not help here. Many functions that I need are already there in the sql.functions so I do not want to

Re: Sorting the dataframe

2016-03-04 Thread Gourav Sengupta
Hi, I am completely agree with the use of dataframe for most operations using SPARK, unless you are custom algorithm or algorithms that need use of RDD. Databricks have taken a cue from Apache Flink (I think) and rewritten tungsten as the base engine that drives dataframe, so there is performance

Re: Job fails at saveAsHadoopDataset stage due to Lost Executor due to reason unknown so far

2016-03-04 Thread ayan guha
Hi I doubt if that is a correct use of HBase. In case you are having analytics use case, you would probably better off using Hive. On Fri, Mar 4, 2016 at 3:09 AM, Nirav Patel wrote: > It's write once table. Mainly used for read/query intensive application. > We in fact

Re: How to display the web ui when running Spark on YARN?

2016-03-04 Thread Steve Loughran
On 3 Mar 2016, at 09:17, Shady Xu > wrote: Hi all, I am running Spark in yarn-client mode, but every time I access the web ui, the browser redirect me to one of the worker nodes and shows nothing. The url looks like

Re: Facing issue with floor function in spark SQL query

2016-03-04 Thread ayan guha
Most likely you are missing import of org.apache.spark.sql.functions. In any case, you can write your own function for floor and use it as UDF. On Fri, Mar 4, 2016 at 7:34 PM, ashokkumar rajendran < ashokkumar.rajend...@gmail.com> wrote: > Hi, > > I load json file that has timestamp (as long

Number Of Jobs In Spark Streaming

2016-03-04 Thread Sandip Mehta
Hi All, Is it fair to say that, number of jobs in a given spark streaming application is equal to number of actions in an application? Regards Sandeep - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional

[Kinesis] multiple KinesisRecordProcessor threads.

2016-03-04 Thread Li Ming Tsai
Hi, @chris @tdas Referring to the latest integration documentation, it states the following: "A single Kinesis input DStream can read from multiple shards of a Kinesis stream by creating multiple KinesisRecordProcessor threads." But looking at the API and the example, each time we call

Facing issue with floor function in spark SQL query

2016-03-04 Thread ashokkumar rajendran
Hi, I load json file that has timestamp (as long in milliseconds) and several other attributes. I would like to group them by 5 minutes and store them as separate file. I am facing couple of problems here.. 1. Using Floor function at select clause (to bucket by 5mins) gives me error saying

Re: Sorting the dataframe

2016-03-04 Thread Mich Talebzadeh
Try this example, similar to yours. DF should sufficient val a = Seq(("Mich",20), ("Christian", 18), ("James",13), ("Richard",16)) // Sort option 1 using tempTable val b = a.toDF("Name","score").registerTempTable("tmp") sql("select Name,score from tmp order by score desc").show // Sort option 2

Re: Do we need schema for Parquet files with Spark?

2016-03-04 Thread ashokkumar rajendran
Thanks for the clarification Xinh. On Fri, Mar 4, 2016 at 12:30 PM, Xinh Huynh wrote: > Hi Ashok, > > On the Spark SQL side, when you create a dataframe, it will have a schema > (each column has a type such as Int or String). Then when you save that > dataframe as