Re: Issues with reading gz files with Spark Streaming

2016-10-22 Thread Nkechi Achara
I do not use rename, and the files are written to, and then moved to a directory on HDFS in gz format. On 22 October 2016 at 15:14, Steve Loughran <ste...@hortonworks.com> wrote: > > > On 21 Oct 2016, at 15:53, Nkechi Achara <nkach...@googlemail.com> wrote: > > >

Issues with reading gz files with Spark Streaming

2016-10-21 Thread Nkechi Achara
Hi, I am using Spark 1.5.0 to read gz files with textFileStream, but when new files are dropped in the specified directory. I know this is only the case with gz files as when i extract the file into the directory specified the files are read on the next window and processed. My code is here:

Anyone got a good solid example of integrating Spark and Solr

2016-09-14 Thread Nkechi Achara
Hi All, I am trying to find some good examples on how to implement Spark and Solr and coming up blank. Basically the implementation of spark-solr does not seem to work correctly with CDH 552 (*1.5.x branch) where i am receiving various issues relating to dependencies, which I have not been fully

Spark, Kryo Serialization Issue with ProtoBuf field

2016-07-13 Thread Nkechi Achara
Hi, I am seeing an error when running my spark job relating to Serialization of a protobuf field when transforming an RDD. com.esotericsoftware.kryo.KryoException: java.lang.UnsupportedOperationException Serialization trace: otherAuthors_

Having issues of passing properties to Spark in 1.5 in comparison to 1.2

2016-07-05 Thread Nkechi Achara
After using Spark 1.2 for quite a long time, I have realised that you can no longer pass spark configuration to the driver via the --conf via command line (or in my case shell script). I am thinking about using system properties and picking the config up using the following bit of code: def

Using saveAsNewAPIHadoopDataset for Saving custom classes to Hbase

2016-04-22 Thread Nkechi Achara
Hi All, I ma having a few issues saving my data to Hbase. I have created a pairRDD for my custom class using the following: val rdd1 =rdd.map{it=> (getRowKey(it), it) } val job = Job.getInstance(hConf) val jobConf = job.getConfiguration

Using Spark to retrieve a HDFS file protected by Kerberos

2016-03-23 Thread Nkechi Achara
I am having issues setting up my spark environment to read from a kerberized HDFS file location. At the moment I have tried to do the following: def ugiDoAs[T](ugi: Option[UserGroupInformation])(code: => T) = ugi match { case None => code case Some(u) => u.doAs(new

Using Spark to retrieve a HDFS file protected by Kerberos

2016-03-22 Thread Nkechi Achara
I am having issues setting up my spark environment to read from a kerberized HDFS file location. At the moment I have tried to do the following: def ugiDoAs[T](ugi: Option[UserGroupInformation])(code: => T) = ugi match { case None => code case Some(u) => u.doAs(new

Fwd: Issues with Long subtraction in an RDD when utilising tailrecursion

2016-01-26 Thread Nkechi Achara
at 22:51, Ted Yu <yuzhih...@gmail.com> wrote: > bq. (successfulAuction.timestampNanos - auction.timestampNanos) < > 1000L && > > Have you included the above condition into consideration when inspecting > timestamps of the results ? > > On Tue, Jan 26,

Issues with Long subtraction in an RDD when utilising tailrecursion

2016-01-26 Thread Nkechi Achara
down votefavorite I am having an issue with the subtraction of a long within an RDD to filter out items in the RDD that are within a certain time range. So my code filters an

Any beginner samples for using ML / MLIB to produce a moving average of a (K, iterable[V])

2015-07-15 Thread Nkechi Achara
Hi all, I am trying to get some summary statistics to retrieve the moving average for several devices that have an array or latency in seconds in this kind of format: deviceLatencyMap = [K:String, Iterable[V: Double]] I understand that there is a MultivariateSummary, but as this is a trait, but