Re: Issues with reading gz files with Spark Streaming

2016-10-22 Thread Nkechi Achara
I do not use rename, and the files are written to, and then moved to a
directory on HDFS in gz format.

On 22 October 2016 at 15:14, Steve Loughran <ste...@hortonworks.com> wrote:

>
> > On 21 Oct 2016, at 15:53, Nkechi Achara <nkach...@googlemail.com> wrote:
> >
> > Hi,
> >
> > I am using Spark 1.5.0 to read gz files with textFileStream, but when
> new files are dropped in the specified directory. I know this is only the
> case with gz files as when i extract the file into the directory specified
> the files are read on the next window and processed.
> >
> > My code is here:
> >
> > val comments = ssc.fileStream[LongWritable, Text,
> TextInputFormat]("file:///tmp/", (f: Path) => true, newFilesOnly=false).
> >   map(pair => pair._2.toString)
> > comments.foreachRDD(i => i.foreach(m=> println(m)))
> >
> > any idea why the gz files are not being recognized.
> >
> > Thanks in advance,
> >
> > K
>
> Are the files being written in the directory or renamed in? As you should
> be using rename() against a filesystem (not an object store) to make sure
> that the file isn't picked up
>


Issues with reading gz files with Spark Streaming

2016-10-21 Thread Nkechi Achara
Hi,

I am using Spark 1.5.0 to read gz files with textFileStream, but when new
files are dropped in the specified directory. I know this is only the case
with gz files as when i extract the file into the directory specified the
files are read on the next window and processed.

My code is here:

val comments = ssc.fileStream[LongWritable, Text,
TextInputFormat]("file:///tmp/", (f: Path) => true, newFilesOnly=false).
  map(pair => pair._2.toString)
comments.foreachRDD(i => i.foreach(m=> println(m)))

any idea why the gz files are not being recognized.

Thanks in advance,

K


Anyone got a good solid example of integrating Spark and Solr

2016-09-14 Thread Nkechi Achara
Hi All,

I am trying to find some good examples on how to implement Spark and Solr
and coming up blank. Basically the implementation of spark-solr does not
seem to work correctly with CDH 552 (*1.5.x branch) where i am receiving
various issues relating to dependencies, which I have not been fully able
to unravel.

I have also attempted to implement a custom solution, where i copy the
token and jaas to each executor, and set the necessary auth Properties, but
this still is prone to failure due to serialization, and kerberos auth
issues.

Has anyone got an example of an implementation of querying solr in a
distributed way where kerberos is involved?

Thanks,

K


Spark, Kryo Serialization Issue with ProtoBuf field

2016-07-13 Thread Nkechi Achara
Hi,

I am seeing an error when running my spark job relating to Serialization of
a protobuf field when transforming an RDD.

com.esotericsoftware.kryo.KryoException:
java.lang.UnsupportedOperationException Serialization trace: otherAuthors_
(com.thomsonreuters.kraken.medusa.dbor.proto.Book$DBBooks)

The error seems to be created at this point:

val booksPerTier: Iterable[(TimeTier, RDD[DBBooks])] = allTiers.map {

  tier => (tier, books.filter(b => isInTier(endOfInterval, tier, b)
&& !isBookPublished(o)).mapPartitions( it =>

  it.map{ord =>

(ord.getAuthor, ord.getPublisherName,
getGenre(ord.getSourceCountry))}))

}


val averagesPerAuthor = booksPerTier.flatMap { case (tier, opt) =>

  opt.map(o => (tier, o._1, PublisherCompanyComparison,
o._3)).countByValue()

}


val averagesPerPublisher = booksPerTier.flatMap { case (tier, opt) =>

  opt.map(o => (tier, o._1, PublisherComparison(o._2),
o._3)).countByValue()

}

The field is a list specified in the protobuf as the below:

otherAuthors_ = java.util.Collections.emptyList()

As you can see the code is not actually utilising that field from the Book
Protobuf, although it still is being transmitted over the network.

Has anyone got any advice on this?

Thanks,

K


Having issues of passing properties to Spark in 1.5 in comparison to 1.2

2016-07-05 Thread Nkechi Achara
After using Spark 1.2 for quite a long time, I have realised that you can
no longer pass spark configuration to the driver via the --conf via command
line (or in my case shell script).

I am thinking about using system properties and picking the config up using
the following bit of code:

def getConfigOption(conf: SparkConf, name: String)
conf getOption name orElse sys.props.get(name)

How do i pass a config.file option and string version of the date specified
as a start time to a spark-submit command?

I have attempted using the following in my start up shell script:
other Config options \
--conf
"spark.executor.extraJavaOptions=-Dconfig.file=../conf/mifid.conf
-DstartTime=2016-06-04 00:00:00" \

but this fails at the space in the date splits the command up.

Any idea how to do this successfully, or has anyone got any advice on this
one?

Thanks

K


Using saveAsNewAPIHadoopDataset for Saving custom classes to Hbase

2016-04-22 Thread Nkechi Achara
Hi All,

I ma having a few issues saving my data to Hbase.

I have created a pairRDD for my custom class using the following:

val rdd1 =rdd.map{it=>
  (getRowKey(it),
  it)
}

val job = Job.getInstance(hConf)
val jobConf = job.getConfiguration
jobConf.set(TableOutputFormat.OUTPUT_TABLE, "tableName")
job.setOutputFormatClass(classOf[TableOutputFormat[CustomClass]])

rdd1.saveAsNewAPIHadoopDataset(jobConf)

When I run it, I receive the error:

ava.lang.ClassCastException: com.test.CustomClass cannot be cast to
org.apache.hadoop.hbase.client.Mutation
at
org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:85)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1036)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1034)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1034)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1205)

Has anyone got a concrete example of how to use this function?
Also, does anyone know what it will actually save to Hbase, will it just be
a single column for the CustomClass?

Thanks,

Keech


Using Spark to retrieve a HDFS file protected by Kerberos

2016-03-23 Thread Nkechi Achara
I am having issues setting up my spark environment to read from a
kerberized HDFS file location.

At the moment I have tried to do the following:

def ugiDoAs[T](ugi:   Option[UserGroupInformation])(code: => T) = ugi match
{
case None => code
case Some(u) => u.doAs(new PrivilegedExceptionAction[T] {
  override def run(): T = code }) }

val sparkConf =
defaultSparkConf.setAppName("file-test").setMaster("yarn-client")

val sc = ugiDoAs(ugi) {new SparkContext(conf)}

val file = sc.textFile("path")

It fails at the point of creating the Spark Context, with the following
error:

Exception in thread "main"
org.apache.hadoop.security.AccessControlException: SIMPLE authentication is
not enabled. Available:[TOKEN, KERBEROS]

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)

at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)

at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
at
org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)

at
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:155)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:497)

at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)


Has anyone got a simple example on how to allow spark to connect to a
kerberized HDFS location?

I know that spark needs to be in Yarn mode to be able to make it work, but
the login method does not seem to be working in this respect. Although I
know that the User Group Information (ugi) object is valid as I have used
it to connect to ZK in the same object and HBase.


Thanks!


Using Spark to retrieve a HDFS file protected by Kerberos

2016-03-22 Thread Nkechi Achara
I am having issues setting up my spark environment to read from a
kerberized HDFS file location.

At the moment I have tried to do the following:

def ugiDoAs[T](ugi:   Option[UserGroupInformation])(code: => T) = ugi match
{
case None => code
case Some(u) => u.doAs(new PrivilegedExceptionAction[T] {
  override def run(): T = code }) }

val sparkConf =
defaultSparkConf.setAppName("file-test").setMaster("yarn-client")

val sc = ugiDoAs(ugi) {new SparkContext(conf)}

val file = sc.textFile("path")

It fails at the point of creating the Spark Context, with the following
error:

Exception in thread "main"
org.apache.hadoop.security.AccessControlException: SIMPLE authentication is
not enabled. Available:[TOKEN, KERBEROS] at
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at
org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at
org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
at
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:155)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497) at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)


Has anyone got a simple example on how to allow spark to connect to a
kerberized HDFS location?

I know that spark needs to be in Yarn mode to be able to make it work, but
the login method does not seem to be working in this respect. Although I
know that the User Group Information (ugi) object is valid as I have used
it to connect to ZK in the same object and HBase.


Fwd: Issues with Long subtraction in an RDD when utilising tailrecursion

2016-01-26 Thread Nkechi Achara
Hi,

Yep, strangely I get values where the successful auction has a smaller time
than the other relevant auctions.
I have also attempted to reverse the statement, and I receive auctions that
are still greater than the successful auction. But also they are of a
greater value.

On 26 January 2016 at 22:51, Ted Yu <yuzhih...@gmail.com> wrote:

> bq.  (successfulAuction.timestampNanos - auction.timestampNanos) <
> 1000L &&
>
> Have you included the above condition into consideration when inspecting
> timestamps of the results ?
>
> On Tue, Jan 26, 2016 at 1:10 PM, Nkechi Achara <nkach...@googlemail.com>
> wrote:
>
>>
>>
>> down votefavorite
>> <http://stackoverflow.com/questions/35009560/anyone-had-issues-with-subtraction-of-a-long-within-an-rdd?noredirect=1#>
>>
>> I am having an issue with the subtraction of a long within an RDD to
>> filter out items in the RDD that are within a certain time range.
>>
>> So my code filters an RDD of case class auctions, with an object of
>> successfulAuctions(Long, Int, String):
>>
>> auctions.filter(it => relevantAuctions(it, successfulAuctions))
>>
>> The successfulAuctions object is made up of a timestamp: Long, an itemID:
>> Int, and a direction: String (BUY/SELL).
>>
>> The relevantAuctions function basically uses tail recursion to find the
>> auctions in a time range for the exact item and direction.
>>
>> @tailrec
>>   def relevantAuctions(auction: Auction, successfulAuctions: List[(Long, 
>> String, String)]): Boolean = successfulAuctions match {
>> case sample :: xs => if (isRelevantAuction(auction, sample) )true 
>> else relevantAuctions(auction, xs)
>> case Nil => false
>>   }
>>
>> This then feeds into another method in the if statement that checks the
>> timestamp in the sample is within a 10ms range, and the item ID is the
>> same, as is the direction.
>>
>> def isRelevantAuction(auction: Auction, successfulAuction: (Long, String, 
>> String)): Boolean = {(successfulAuction.timestampNanos - 
>> auction.timestampNanos) >= 0 &&
>>   (successfulAuction.timestampNanos - auction.timestampNanos) < 1000L &&
>>   auction.itemID == successfulAuction.itemID &&
>>   auction.direction== successfulAuction.direction
>>  }
>>
>> I am having issues where the range option is not entirely working. The
>> timestamps I am receiving back are not within the required range. Although
>> the Item ID and direction seems to be working successfully.
>>
>> The results I am getting are as follows, when I have a timestamp of
>> 1431651108749267459 for the successful auction, I am receiving other
>> auctions of a time GREATER than this, where it should be less.
>>
>> The auctions I am receiving have the timestamps of:
>>
>> 143165110874932660314316511087493307321431651108749537901
>>
>> Has anyone experienced this phenomenon?
>>
>> Thanks!
>>
>
>


Issues with Long subtraction in an RDD when utilising tailrecursion

2016-01-26 Thread Nkechi Achara
down votefavorite


I am having an issue with the subtraction of a long within an RDD to filter
out items in the RDD that are within a certain time range.

So my code filters an RDD of case class auctions, with an object of
successfulAuctions(Long, Int, String):

auctions.filter(it => relevantAuctions(it, successfulAuctions))

The successfulAuctions object is made up of a timestamp: Long, an itemID:
Int, and a direction: String (BUY/SELL).

The relevantAuctions function basically uses tail recursion to find the
auctions in a time range for the exact item and direction.

@tailrec
  def relevantAuctions(auction: Auction, successfulAuctions:
List[(Long, String, String)]): Boolean = successfulAuctions match
{
case sample :: xs => if (isRelevantAuction(auction, sample) )
true else relevantAuctions(auction, xs)
case Nil => false
  }

This then feeds into another method in the if statement that checks the
timestamp in the sample is within a 10ms range, and the item ID is the
same, as is the direction.

def isRelevantAuction(auction: Auction, successfulAuction: (Long,
String, String)): Boolean = {(successfulAuction.timestampNanos -
auction.timestampNanos) >= 0 &&
  (successfulAuction.timestampNanos - auction.timestampNanos) < 1000L &&
  auction.itemID == successfulAuction.itemID &&
  auction.direction== successfulAuction.direction
 }

I am having issues where the range option is not entirely working. The
timestamps I am receiving back are not within the required range. Although
the Item ID and direction seems to be working successfully.

The results I am getting are as follows, when I have a timestamp of
1431651108749267459 for the successful auction, I am receiving other
auctions of a time GREATER than this, where it should be less.

The auctions I am receiving have the timestamps of:

143165110874932660314316511087493307321431651108749537901

Has anyone experienced this phenomenon?

Thanks!


Any beginner samples for using ML / MLIB to produce a moving average of a (K, iterable[V])

2015-07-15 Thread Nkechi Achara
Hi all,

I am trying to get some summary statistics to retrieve the moving average
for several devices that have an array or latency in seconds in this kind
of format:

deviceLatencyMap = [K:String, Iterable[V: Double]]

I understand that there is a MultivariateSummary, but as this is a trait,
but I can't understand what I use in it's stead.

If you need any more code, please let me know.

Thanks All.

K