Re: Fast Unit Tests

2018-05-01 Thread Geoff Von Allmen
I am pretty new to spark/scala myself, but I just recently implemented unit tests to test my transformations/aggregations and such myself. I’m using the mrpowers spark-fast-tests and spark-daria libraries. I

Re: Spark LOCAL mode and external jar (extraClassPath)

2018-04-13 Thread Geoff Von Allmen
gt; To set any runtime properties of the local instance. Note that it is > possible (and I am more convinced of this as time goes on) that alluxio > simply does not work in spark local mode as described above. > > > On Apr 13, 2018, at 11:09 AM, Geoff Von Allmen <ge...@ibleducation.co

Re: Spark LOCAL mode and external jar (extraClassPath)

2018-04-13 Thread Geoff Von Allmen
I fought with a ClassNotFoundException for quite some time, but it was for kafka. The final configuration that got everything working was running spark-submit with the following options: --jars "/path/to/.ivy2/jars/package.jar" \ --driver-class-path "/path/to/.ivy2/jars/package.jar" \ --conf

Re: select count * doesnt seem to respect update mode in Kafka Structured Streaming?

2018-03-20 Thread Geoff Von Allmen
axRate - spark.streaming.kafka.maxRatePerPartition ​ On Mon, Mar 19, 2018 at 5:27 PM, kant kodali <kanth...@gmail.com> wrote: > Yes it indeed makes sense! Is there a way to get incremental counts when I > start from 0 and go through 10M records? perhaps count for every micro > batch or something? > > On Mon, Mar 19,

Re: select count * doesnt seem to respect update mode in Kafka Structured Streaming?

2018-03-19 Thread Geoff Von Allmen
Trigger does not mean report the current solution every 'trigger seconds'. It means it will attempt to fetch new data and process it no faster than trigger seconds intervals. If you're reading from the beginning and you've got 10M entries in kafka, it's likely pulling everything down then

Structured Streaming: distinct (Spark 2.2)

2018-03-19 Thread Geoff Von Allmen
I see in the documentation that the distinct operation is not supported in Structured Streaming. That being said, I have noticed that you are able to successfully call distinct() on a data

Re: Standalone Cluster: ClassNotFound org.apache.kafka.common.serialization.ByteArrayDeserializer

2017-12-27 Thread Geoff Von Allmen
und > (note that those must be accessible from each node in the cluster) > i suggest to go over the manual > <https://spark.apache.org/docs/latest/submitting-applications.html> > > Eyal > > > On Wed, Dec 27, 2017 at 1:08 AM, Geoff Von Allmen <ge...@ibleducation.com> >

Standalone Cluster: ClassNotFound org.apache.kafka.common.serialization.ByteArrayDeserializer

2017-12-26 Thread Geoff Von Allmen
I am trying to deploy a standalone cluster but running into ClassNotFound errors. I have tried a whole myriad of different approaches varying from packaging all dependencies into a single JAR and using the --packages and --driver-class-path options. I’ve got a master node started, a slave node