Re: HBase / Spark Kerberos problem

2016-05-19 Thread John Trengrove
Have you had a look at this issue? https://issues.apache.org/jira/browse/SPARK-12279 There is a comment by Y Bodnar on how they successfully got Kerberos and HBase working. 2016-05-18 18:13 GMT+10:00 : > Hi all, > > I have been puzzling over a Kerberos

Re: How to get the batch information from Streaming UI

2016-05-16 Thread John Trengrove
You would want to add a listener to your Spark Streaming context. Have a look at the StatsReportListener [1]. [1] http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.scheduler.StatsReportListener 2016-05-17 7:18 GMT+10:00 Samuel Zhou : > Hi, >

Re: Silly Question on my part...

2016-05-16 Thread John Trengrove
If you are wanting to share RDDs it might be a good idea to check out Tachyon / Alluxio. For the Thrift server, I believe the datasets are located in your Spark cluster as RDDs and you just communicate with it via the Thrift JDBC Distributed Query Engine connector. 2016-05-17 5:12 GMT+10:00

Re: How to use the spark submit script / capability

2016-05-15 Thread John Trengrove
Assuming you are refering to running SparkSubmit.main programatically otherwise read this [1]. I can't find any scaladocs for org.apache.spark.deploy.* but Oozie's [2] example of using SparkSubmit is pretty comprehensive. [1] http://spark.apache.org/docs/latest/submitting-applications.html [2]

Re: VectorAssembler handling null values

2016-04-20 Thread John Trengrove
You could handle null values by using the DataFrame.na functions in a preprocessing step like DataFrame.na.fill(). For reference: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameNaFunctions John On 21 April 2016 at 03:41, Andres Perez