Re: Spark Shell issue on HDInsight

2017-05-11 Thread ayan guha
Works for me tooyou are a life-saver :) But the question: should/how we report this to Azure team? On Fri, May 12, 2017 at 10:32 AM, Denny Lee wrote: > I was able to repro your issue when I had downloaded the jars via blob but > when I downloaded them as raw, I was

Best Practice for Enum in Spark SQL

2017-05-11 Thread Mike Wheeler
Hi Spark Users, I want to store Enum type (such as Vehicle Type: Car, SUV, Wagon) in my data. My storage format will be parquet and I need to access the data from Spark-shell, Spark SQL CLI, and hive. My questions: 1) Should I store my Enum type as String or store it as numeric encoding (aka

Re: Spark Shell issue on HDInsight

2017-05-11 Thread Denny Lee
I was able to repro your issue when I had downloaded the jars via blob but when I downloaded them as raw, I was able to get everything up and running. For example: wget https://github.com/Azure/azure-documentdb-spark/*blob*

Matrix multiplication and cluster / partition / blocks configuration

2017-05-11 Thread John Compitello
Hey all, I’ve found myself in a position where I need to do a relatively large matrix multiply (at least, compared to what I normally have to do). I’m looking to multiply a 100k by 500k dense matrix by its transpose to yield 100k by 100k matrix. I’m trying to do this on Google Cloud, so I

RE: Spark consumes more memory

2017-05-11 Thread Anantharaman, Srinatha (Contractor)
Rick, Thank you for the input. Now space issue is resolved. yarn.nodemanager.local.dirs and yarn.nodemanager.log.dirs was filling up. For 5Gb of data why it should take 10 mins to load with 7-8 executors with 2 cores and I also see all the executors memory is upto 7-20 GB If 5 GB of data takes

Re: Spark <--> S3 flakiness

2017-05-11 Thread Vadim Semenov
Use the official mailing list archive http://mail-archives.apache.org/mod_mbox/spark-user/201705.mbox/%3ccajyeq0gh1fbhbajb9gghognhqouogydba28lnn262hfzzgf...@mail.gmail.com%3e On Thu, May 11, 2017 at 2:50 PM, lucas.g...@gmail.com wrote: > Also, and this is unrelated to the

Re: Spark <--> S3 flakiness

2017-05-11 Thread Miguel Morales
Might want to try to use gzip as opposed to parquet. The only way i ever reliably got parquet to work on S3 is by using Alluxio as a buffer, but it's a decent amount of work. On Thu, May 11, 2017 at 11:50 AM, lucas.g...@gmail.com wrote: > Also, and this is unrelated to the

Re: Spark <--> S3 flakiness

2017-05-11 Thread lucas.g...@gmail.com
Also, and this is unrelated to the actual question... Why don't these messages show up in the archive? http://apache-spark-user-list.1001560.n3.nabble.com/ Ideally I'd want to post a link to our internal wiki for these questions, but can't find them in the archive. On 11 May 2017 at 07:16,

Re: Spark consumes more memory

2017-05-11 Thread Rick Moritz
I would try to track down the "no space left on device" - find out where that originates from, since you should be able to allocate 10 executors with 4 cores and 15GB RAM each quite easily. In that case,you may want to increase overhead, so yarn doesn't kill your executors. Check that no local

Spark consumes more memory

2017-05-11 Thread Anantharaman, Srinatha (Contractor)
Hi, I am reading a Hive Orc table into memory, StorageLevel is set to (StorageLevel.MEMORY_AND_DISK_SER) Total size of the Hive table is 5GB Started the spark-shell as below spark-shell --master yarn --deploy-mode client --num-executors 8 --driver-memory 5G --executor-memory 7G

BinaryClassificationMetrics only supports AreaUnderPR and AreaUnderROC?

2017-05-11 Thread Lan Jiang
I realized that in the Spark ML, BinaryClassifcationMetrics only supports AreaUnderPR and AreaUnderROC. Why is that? I What if I need other metrics such as F-score, accuracy? I tried to use MulticlassClassificationEvaluator to evaluate other metrics such as Accuracy for a binary classification

Re: Spark <--> S3 flakiness

2017-05-11 Thread lucas.g...@gmail.com
Looks like this isn't viable in spark 2.0.0 (and greater I presume). I'm pretty sure I came across this blog and ignored it due to that. Any other thoughts? The linked tickets in: https://issues.apache.org/jira/browse/SPARK-10063 https://issues.apache.org/jira/browse/HADOOP-13786

Re: running spark program on intellij connecting to remote master for cluster

2017-05-11 Thread s t
Hello David, Let me make it more clear; * There is not any spark installed on windows laptop, just the intellij and the related dependencies. * SparkLauncher is good starting point for submitting a job programatically but i am not sure if my current problem is related with job

Re: Spark Shell issue on HDInsight

2017-05-11 Thread ayan guha
Hi Thanks for reply, but unfortunately did not work. I am getting same error. sshuser@ed0-svochd:~/azure-spark-docdb-test$ spark-shell --jars azure-documentdb-spark-0.0.3-SNAPSHOT.jar,azure-documentdb-1.10.0.jar SPARK_MAJOR_VERSION is set to 2, using Spark2 Setting default log level to "WARN".

Reading Avro messages from Kafka using Structured Streaming in Spark 2.1

2017-05-11 Thread Revin Chalil
I am trying to convert avro records with field type = bytes to json string using Structured Streaming in Spark 2.1. Please see below. object AvroConvert { case class KafkaMessage( payload: String ) val schemaString ="""{