[Spark SQL] Catalyst ScalaReflection/ExpressionEncoder fail with relocated (shaded) classes

2018-09-14 Thread johkelly
Hello, I'm trying to compile google's timestamp.proto protobuf to a scala case class and use it as a field in another proto-derived case class as part of a larger dataset schema. (Although the SQL date type might be preferred in a schema, I encountered this problem when I attempted to use

[SparkSQL] Count Distinct issue

2018-09-14 Thread Daniele Foroni
Hi all, I am having some troubles in doing a count distinct over multiple columns. This is an example of my data: ++++---+ |a |b |c |d | ++++---+ |null|null|null|1 | |null|null|null|2 | |null|null|null|3 | |null|null|null|4 | |null|null|null|5 |

Re: StackOverflow Error when run ALS with 100 iterations

2018-09-14 Thread LeoB
Just wanted to add a comment to the Jira ticket but I don't think I have permission to do so, so answering here instead. I am encountering the same issue with a stackOverflow Exception. I would like to point out that there is a localCheckpoint

Re: Python Dependencies Issue on EMR

2018-09-14 Thread Patrick McCarthy
You didn't say how you're zipping the dependencies, but I'm guessing you either include .egg files or zipped up a virtualenv. In either case, the extra C stuff that scipy and pandas rely upon doesn't get included. An approach like this solved the last problem I had that seemed like this -

What is the best way for Spark to read HDF5@scale?

2018-09-14 Thread kathleen li
Hi, Any Spark-connector for HDF5? The following link does not work anymore? https://www.hdfgroup.org/downloads/spark-connector/ down vo Thanks, Kathleen

Re: Unsubscribe

2018-09-14 Thread Mohan Palavancha
On Thu, Sep 13, 2018 at 7:47 PM Pekka Lehtonen wrote: > >

Spark2 DynamicAllocation doesn't release executors that used cache

2018-09-14 Thread Sergejs Andrejevs
Hi, We're starting to use Spark2 with usecases for Dynamic Allocation. However, it was noticed it doesn't work as expected when dataset is cached (persist). The cluster runs with: CDH 5.15.0 Spark 2.3.0 Oracle Java 8.131 The following configs are passed to spark (as well as setup at cluster): #

DAGScheduler in SparkStreaming

2018-09-14 Thread Guillermo Ortiz
A question, if you use Spark Streaming, the DAG is calculated for each microbatch? it's possible to calculate only the first time?

Is there any open source framework that converts Cypher to SparkSQL?

2018-09-14 Thread kant kodali
Hi All, Is there any open source framework that converts Cypher to SparkSQL? Thanks!

Re: Local vs Cluster

2018-09-14 Thread Apostolos N. Papadopoulos
Hi Aakash, in the cluster you need to consider the total number of executors you are using. Please take a look in the following link for an introduction. https://spoddutur.github.io/spark-notes/distribution_of_executors_cores_and_memory_for_spark_application.html regards, Apostolos

Re: Local vs Cluster

2018-09-14 Thread Mich Talebzadeh
Local only one JVM, runs on the host you submitted the job ${SPARK_HOME}/bin/spark-submit \ --master local[N] \ Standalone meaning using Spark own scheduler ${SPARK_HOME}/bin/spark-submit \ --master spark:// \ Where IP_ADDRESS is the host your Spark master

Local vs Cluster

2018-09-14 Thread Aakash Basu
Hi, What is the Spark cluster equivalent of standalone's local[N]. I mean, the value we set as a parameter of local as N, which parameter takes it in the cluster mode? Thanks, Aakash.