Re: Spark based Data Warehouse

2017-11-17 Thread lucas.g...@gmail.com
We are using Spark on Kubernetes on AWS (it's a long story) but it does work. It's still on the raw side but we've been pretty successful. We configured our cluster primarily with Kube-AWS and auto scaling groups. There are gotcha's there, but so far we've been quite successful. Gary Lucas On

Re: Spark based Data Warehouse

2017-11-17 Thread ashish rawat
Thanks everyone for their suggestions. Does any of you take care of auto scale up and down of your underlying spark clusters on AWS? On Nov 14, 2017 10:46 AM, "lucas.g...@gmail.com" wrote: Hi Ashish, bear in mind that EMR has some additional tooling available that smoothes

SpecificColumnarIterator has grown past JVM limit of 0xFFF

2017-11-17 Thread Md. Rezaul Karim
Dear All, I was training the RandomForest with an input dataset having 20,000 columns and 12,000 rows. But when I start the training, it shows an exception: Constant pool for class org.apache.spark.sql.catalyst.expressions.GeneratedClass$*SpecificColumnarIterator has grown past JVM limit of

History server and non-HDFS filesystems

2017-11-17 Thread Paul Mackles
Hi - I had originally posted this as a bug (SPARK-22528) but given my uncertainty, it was suggested that I send it to the mailing list instead... We are using Azure Data Lake (ADL) to store our event logs. This worked fine in 2.1.x, but in 2.2.0 the underlying files are no longer visible to the

Spark Streaming in Wait mode

2017-11-17 Thread KhajaAsmath Mohammed
Hi, I am running spark streaming job and it is not picking up the next batches but the job is still shows as running on YARN. is this expected behavior if there is no data or waiting for data to pick up? I am almost behind 4 hours of batches (30 min interval) [image: Inline image 1] [image:

Union of streaming dataframes

2017-11-17 Thread Lalwani, Jayesh
Is union of 2 Structured streaming dataframes from different sources supported in 2.2? We have done a union of 2 streaming dataframes that are from the same source. I wanted to know if multiple streams are supported or going to be supported in the future

Re: Struct Type

2017-11-17 Thread Jacek Laskowski
Hi, Use explode function, filter operator and collect_list function. Or "heavier" flatMap. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark Follow

Re: Spark Streaming fails with unable to get records after polling for 512 ms

2017-11-17 Thread Cody Koeninger
I don't see anything obvious, you'd need to do more troubleshooting. Could also try creating a single rdd for a known range of offsets: http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#creating-an-rdd On Wed, Nov 15, 2017 at 9:33 PM, jagadish kagitala

Struct Type

2017-11-17 Thread KhajaAsmath Mohammed
Hi, I have following schema in dataframe and I want to extract key which matches as MaxSpeed from the array and it's corresponding value of the key. |-- tags: array (nullable = true) ||-- element: struct (containsNull = true) |||-- key: string (nullable = true) |||--

Re: Multiple transformations without recalculating or caching

2017-11-17 Thread Fernando Pereira
Notice the fact that I have 1+ TB. If I didn't mind things to be slow I wouldn't be using spark. On 17 November 2017 at 11:06, Sebastian Piu wrote: > If you don't want to recalculate you need to hold the results somewhere, > of you need to save it why don't you so that

Re: Multiple transformations without recalculating or caching

2017-11-17 Thread Sebastian Piu
If you don't want to recalculate you need to hold the results somewhere, of you need to save it why don't you so that and then read it again and get your stats? On Fri, 17 Nov 2017, 10:03 Fernando Pereira, wrote: > Dear Spark users > > Is it possible to take the output of

Multiple transformations without recalculating or caching

2017-11-17 Thread Fernando Pereira
Dear Spark users Is it possible to take the output of a transformation (RDD/Dataframe) and feed it to two independent transformations without recalculating the first transformation and without caching the whole dataset? Consider the case of a very large dataset (1+TB) which suffered several