Spark metrics when running with YARN?

2016-08-29 Thread Otis Gospodnetić
Hi, When Spark is run on top of YARN, where/how can one get Spark metrics? Thanks, Otis -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/

How to use custom class in DataSet

2016-08-29 Thread canan chen
e.g. I have a custom class A (not case class), and I'd like to use it as DataSet[A]. I guess I need to implement Encoder for this, but didn't find any example for that, is there any document for that ? Thanks

Spark 2.0.0 - What all access is needed to save model to S3?

2016-08-29 Thread Aseem Bansal
Hi What all access is needed to save a model to S3? Initially I thought it should be only write. Then I found it also needs delete to delete temporary files. Now they have given me DELETE access I am getting the error Exception in thread "main" org.apache.hadoop.fs.s3.S3Exception:

RE: After calling persist, why the size in sparkui is not matching with the actual file size

2016-08-29 Thread Rohit Kumar Prusty
Thanks Denis, for your quick response. My original file is not compressed. It is just a text log file. Regards Rohit Kumar Prusty +91-9884070075 From: Denis Bolshakov [mailto:bolshakov.de...@gmail.com] Sent: Monday, August 29, 2016 9:03 PM To: Rohit Kumar Prusty Cc:

Re: Design patterns involving Spark

2016-08-29 Thread Chanh Le
Hi everyone, Seems a lot people using Druid for realtime Dashboard. I’m just wondering of using Druid for main storage engine because Druid can store the raw data and can integrate with Spark also (theoretical). In that case do we need to store 2 separate storage Druid (store segment in HDFS)

Re: Spark launcher handle and listener not giving state

2016-08-29 Thread ckanth99
That must be it then :( We are using Cloudera distribution which has Spark 1.5.1. Thanks a lot Marcelo On Mon, 29 Aug 2016 16:39:34 -0700 Marcelo Vanzinvan...@cloudera.com wrote You haven't said which version of Spark you are using. The state API only works if the underlying Spark

How to attach a ReceiverSupervisor for a Custom receiver in Spark Streaming?

2016-08-29 Thread kant kodali
How to attach a ReceiverSupervisor for a Custom receiver in Spark Streaming?

java.lang.RuntimeException: java.lang.AssertionError: assertion failed: A ReceiverSupervisor has not been attached to the receiver yet.

2016-08-29 Thread kant kodali
java.lang.RuntimeException: java.lang.AssertionError: assertion failed: A ReceiverSupervisor has not been attached to the receiver yet. Maybe you are starting some computation in the receiver before the Receiver.onStart() has been called.

Re: Spark launcher handle and listener not giving state

2016-08-29 Thread Marcelo Vanzin
You haven't said which version of Spark you are using. The state API only works if the underlying Spark version is also 1.6 or later. On Mon, Aug 29, 2016 at 4:36 PM, ckanth99 wrote: > Hi All, > > I have a web application which will submit spark jobs on Cloudera spark >

Spark launcher handle and listener not giving state

2016-08-29 Thread ckanth99
Hi All, I have a web application which will submit spark jobs on Cloudera spark cluster using spark launcher library. It is successfully submitting the spark job to cluster. However it is not calling back the listener class methods and also the getState() on returned SparkAppHandle never

Cleanup after Spark SQL job with window aggregation takes a long time

2016-08-29 Thread Jestin Ma
After a Spark SQL job appending a few columns using window aggregation functions, and performing a join and some data massaging, I find that the cleanup after the job finishes saving the result data to disk takes as long if not longer than the job. I currently am performing window aggregation on

Re: Automating lengthy command to pyspark with configuration?

2016-08-29 Thread Russell Jurney
I've got most of it working through spark.jars On Sunday, August 28, 2016, ayan guha wrote: > Best to create alias and place in your bashrc > On 29 Aug 2016 08:30, "Russell Jurney" >

Re: S3A + EMR failure when writing Parquet?

2016-08-29 Thread Everett Anderson
Okay, I don't think it's really just S3A issue, anymore. I can run the job using fs.s3.impl/spark.hadoop.fs.s3.impl set to the S3A impl as a --conf param from the EMR console successfully, as well. The problem seems related to the fact that we're trying to spark-submit jobs to a YARN cluster from

Great performance improvement of Spark 1.6.2 on our production cluster

2016-08-29 Thread Yong Zhang
Today I deployed Spark 1.6.2 on our production cluster. There is one daily huge job we run it every day using Spark SQL, and it is the biggest Spark job on our cluster running daily. I was impressive by the speed improvement. Here is the history statistics of this daily job: 1) 11 to 12 hours

Re: Coding in the Spark ml "ecosystem" why is everything private?!

2016-08-29 Thread Thunder Stumpges
I understand all that and can respect a team's desire not to have to support many little internal details of a system; but at the same time, I am talking about valuable aspects of a platform that make coding components that work in an ecosystem viable. Let's put aside the OpenHashMap as I see that

Re: Coding in the Spark ml "ecosystem" why is everything private?!

2016-08-29 Thread Sean Owen
If something isn't public, then it could change across even maintenance releases. Although you can indeed still access it in some cases by writing code in the same package, you're taking some risk that it will stop working across releases. If it's not public, the message is that you should build

Coding in the Spark ml "ecosystem" why is everything private?!

2016-08-29 Thread Thunder Stumpges
Hi all, I'm not sure if this belongs here in users or over in dev as I guess it's somewhere in between. We have been starting to implement some machine learning pipelines, and it seemed from the documentation that Spark had a fairly well thought-out platform (see:

Exception during creation of ActorReceiver when running ActorWordCount on CDH 5.5.2

2016-08-29 Thread Ricky Pritchett
I get the following exception on the worker nodes when running the ActorWordCount Example.  Caused by: java.lang.IllegalArgumentException: constructor public akka.actor.CreatorFunctionConsumer(scala.Function0) is incompatible with arguments [class java.lang.Class, class

Re: After calling persist, why the size in sparkui is not matching with the actual file size

2016-08-29 Thread Denis Bolshakov
Hello, Spark uses snappy by default, is your original file compressed? Also it keeps data in own representation format (column base), and it's not the same as text. Best regards, Denis On 29 August 2016 at 16:52, Rohit Kumar Prusty wrote: > Hi Team, > > I am new to

Spark Streaming batch sequence number

2016-08-29 Thread Matt Smith
Is it possible to get a sequence number for the current batch (ie. first batch is 0, second is 1, etc?).

After calling persist, why the size in sparkui is not matching with the actual file size

2016-08-29 Thread Rohit Kumar Prusty
Hi Team, I am new to spark and have this basic question. After calling persist, why the size in sparkui is not matching with the actual file size? Actaul File Size for "/user/rohit_prusty/application2.log" - 39 KB Code snippet: === logData =

Re: Is there a reduceByKey functionality in DataFrame API?

2016-08-29 Thread Marius Soutier
Not really, a grouped DataFrame only provides SQL-like functions like sum and avg (at least in 1.5). > On 29.08.2016, at 14:56, ayan guha wrote: > > If you are confused because of the name of two APIs. I think DF API name > groupBy came from SQL, but it works similarly

Re: Is there a reduceByKey functionality in DataFrame API?

2016-08-29 Thread ayan guha
If you are confused because of the name of two APIs. I think DF API name groupBy came from SQL, but it works similarly as reducebykey. On 29 Aug 2016 20:57, "Marius Soutier" wrote: > In DataFrames (and thus in 1.5 in general) this is not possible, correct? > > On 11.08.2016,

Re: How to acess the WrappedArray

2016-08-29 Thread Bedrytski Aliaksandr
Hi, It depends on how you see "elements from the WrappedArray" represented. Is it a List[Any] or you need a special case class for each line? Or you want to create a DataFrame that will hold the type for each column? Will the json file always be < 100mb so that you can pre-treat it with a *sed*

Re: Is there a reduceByKey functionality in DataFrame API?

2016-08-29 Thread Marius Soutier
In DataFrames (and thus in 1.5 in general) this is not possible, correct? > On 11.08.2016, at 05:42, Holden Karau wrote: > > Hi Luis, > > You might want to consider upgrading to Spark 2.0 - but in Spark 1.6.2 you > can do groupBy followed by a reduce on the

Re: How to acess the WrappedArray

2016-08-29 Thread Denis Bolshakov
Hello, Not sure that it will help, but I would do the following 1. Need to create a case class which matches your json schema. 2. Change the following line: old: Dataset rows_salaries = spark.read().json("/Users/ sreeharsha/Downloads/rows_salaries.json"); new: Dataset rows_salaries =

How to acess the WrappedArray

2016-08-29 Thread Sreeharsha
Here is the snippet of code : //The entry point into all functionality in Spark is the SparkSession class. To create a basic SparkSession, just use SparkSession.builder(): SparkSession spark = SparkSession.builder().appName("Java Spark SQL Example" ).master("local").getOrCreate(); //With a

Re: How can we connect RDD from previous job to next job

2016-08-29 Thread Mich Talebzadeh
Can someone correct me on this: 1. Jobs run and finish independently of each other. There is no correlation between job 1 and job2 2. If job 2 depends on job1 output, then a persistent storage like Parquet file on HDFS can be used to save the outcome of job 1 and job2 can start

Re: How can we connect RDD from previous job to next job

2016-08-29 Thread Sean Owen
If you mean to persist data in an RDD, then you should do just that -- persist the RDD to durable storage so it can be read later by any other app. Checkpointing is not a way to store RDDs, but a specific way to recover the same application in some cases. Parquet has been supported for a long

Re: How can we connect RDD from previous job to next job

2016-08-29 Thread Sachin Mittal
I understood the approach. Does spark 1.6 support Parquet format, I mean saving and loading from Parquet file. Also if I use checkpoint, what I understand is that RDD location on filesystem is not removed when job is over. So I can read that RDD in next job. Is that one of the usecase of

Re: How can we connect RDD from previous job to next job

2016-08-29 Thread Sean Owen
You just save the data in the RDD in whatever form you want to whatever persistent storage you want, and then re-read it from another job. This could be Parquet format on HDFS for example. Parquet is just a common file format. There is no need to keep the job running just to keep an RDD alive. On

Re: Equivalent of "predict" function from LogisticRegressionWithLBFGS in OneVsRest with LogisticRegression classifier (Spark 2.0)

2016-08-29 Thread Nick Pentreath
Try this: val df = spark.createDataFrame(Seq(Vectors.dense(Array(10, 590, 190, 700))).map(Tuple1.apply)).toDF("features") On Sun, 28 Aug 2016 at 11:06 yaroslav wrote: > Hi, > > We use such kind of logic for training our model > > val model = new

Re: Best practises to storing data in Parquet files

2016-08-29 Thread Mich Talebzadeh
Hi Kevin. When you say Kafka interacting with Oracle database (if I understand you correctly) are you using GoldenGate with Kafka interface to push data from Oracle to Kafka? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

can I use cassandra for checkpointing during a spark streaming job

2016-08-29 Thread kant kodali
I understand that I cannot use spark streaming window operation without checkpointing to HDFS but Without window operation I don't think we can do much with spark streaming. so since it is very essential can I use Cassandra as a distributed storage? If so, can I see an example on how I can tell

json with millisecond timestamp in spark 2

2016-08-29 Thread filousen
Hi I see a behaviour change after testing spark2. Timestamp with milliseconds (from java for instance) can no longer be processed using TimestampType while reading a json. I guess this is due to the following code:

Re: Insert non-null values from dataframe

2016-08-29 Thread Selvam Raman
Thanks for the update. we are using 2.0 version. so planning to write own custom logic to remove the null values. Thanks, selvam R On Fri, Aug 26, 2016 at 9:08 PM, Russell Spitzer wrote: > Cassandra does not differentiate between null and empty, so when reading >

Re: Issues with Spark On Hbase Connector and versions

2016-08-29 Thread Sachin Jain
There is connection leak problem with hortonworks hbase connector if you use hbase 1.2.0. I tried to use hortonwork's connector and felt into the same problem. Have a look at this Hbase issue HBASE-16017 [0]. The fix for this was backported to 1.3.0, 1.4.0 and 2.0.0 I have raised a ticket on