Re: CROSSVALIDATION and hypotetic fail

2017-05-11 Thread Jörn Franke
Use several jobs and orchestrate them, e.g. Via Ozzie. These jobs then can save intermediate results to disk and load them from there. Alternatively (or additionally!) you may use persist (to memory and disk), but I am not sure this is suitable for such long running applications. > On 12. May 2

CROSSVALIDATION and hypotetic fail

2017-05-11 Thread issues solution
Hi , often we preform a grid search and Cross validation under pyspark to find best perameters , but when you have in error not related to computation but to networks or any think else . HOW WE CAN SAVE INTERMADAITE RESULT ,particulary when you have a large process during 3 or 4 days

Re: Spark Shell issue on HDInsight

2017-05-11 Thread ayan guha
Works for me tooyou are a life-saver :) But the question: should/how we report this to Azure team? On Fri, May 12, 2017 at 10:32 AM, Denny Lee wrote: > I was able to repro your issue when I had downloaded the jars via blob but > when I downloaded them as raw, I was able to get everything up

Best Practice for Enum in Spark SQL

2017-05-11 Thread Mike Wheeler
Hi Spark Users, I want to store Enum type (such as Vehicle Type: Car, SUV, Wagon) in my data. My storage format will be parquet and I need to access the data from Spark-shell, Spark SQL CLI, and hive. My questions: 1) Should I store my Enum type as String or store it as numeric encoding (aka 1=C

Re: Spark Shell issue on HDInsight

2017-05-11 Thread Denny Lee
I was able to repro your issue when I had downloaded the jars via blob but when I downloaded them as raw, I was able to get everything up and running. For example: wget https://github.com/Azure/azure-documentdb-spark/*blob* /master/releases/azure-documentdb-spark-0.0.3_2.0.2_2.11/azure-documentdb

Re: Spark <--> S3 flakiness

2017-05-11 Thread lucas.g...@gmail.com
Interesting, the links here: http://spark.apache.org/community.html point to: http://apache-spark-user-list.1001560.n3.nabble.com/ On 11 May 2017 at 12:35, Vadim Semenov wrote: > Use the official mailing list archive > > http://mail-archives.apache.org/mod_mbox/spark-user/201705.mbox/% > 3ccaj

Matrix multiplication and cluster / partition / blocks configuration

2017-05-11 Thread John Compitello
Hey all, I’ve found myself in a position where I need to do a relatively large matrix multiply (at least, compared to what I normally have to do). I’m looking to multiply a 100k by 500k dense matrix by its transpose to yield 100k by 100k matrix. I’m trying to do this on Google Cloud, so I don’

RE: Spark consumes more memory

2017-05-11 Thread Anantharaman, Srinatha (Contractor)
Rick, Thank you for the input. Now space issue is resolved. yarn.nodemanager.local.dirs and yarn.nodemanager.log.dirs was filling up. For 5Gb of data why it should take 10 mins to load with 7-8 executors with 2 cores and I also see all the executors memory is upto 7-20 GB If 5 GB of data takes

Re: Spark <--> S3 flakiness

2017-05-11 Thread Vadim Semenov
Use the official mailing list archive http://mail-archives.apache.org/mod_mbox/spark-user/201705.mbox/%3ccajyeq0gh1fbhbajb9gghognhqouogydba28lnn262hfzzgf...@mail.gmail.com%3e On Thu, May 11, 2017 at 2:50 PM, lucas.g...@gmail.com wrote: > Also, and this is unrelated to the actual question... Why

Re: Spark <--> S3 flakiness

2017-05-11 Thread Miguel Morales
Might want to try to use gzip as opposed to parquet. The only way i ever reliably got parquet to work on S3 is by using Alluxio as a buffer, but it's a decent amount of work. On Thu, May 11, 2017 at 11:50 AM, lucas.g...@gmail.com wrote: > Also, and this is unrelated to the actual question... Why

Re: Spark <--> S3 flakiness

2017-05-11 Thread lucas.g...@gmail.com
Also, and this is unrelated to the actual question... Why don't these messages show up in the archive? http://apache-spark-user-list.1001560.n3.nabble.com/ Ideally I'd want to post a link to our internal wiki for these questions, but can't find them in the archive. On 11 May 2017 at 07:16, lucas

Re: Spark consumes more memory

2017-05-11 Thread Rick Moritz
I would try to track down the "no space left on device" - find out where that originates from, since you should be able to allocate 10 executors with 4 cores and 15GB RAM each quite easily. In that case,you may want to increase overhead, so yarn doesn't kill your executors. Check that no local driv

Spark consumes more memory

2017-05-11 Thread Anantharaman, Srinatha (Contractor)
Hi, I am reading a Hive Orc table into memory, StorageLevel is set to (StorageLevel.MEMORY_AND_DISK_SER) Total size of the Hive table is 5GB Started the spark-shell as below spark-shell --master yarn --deploy-mode client --num-executors 8 --driver-memory 5G --executor-memory 7G --executor-cores

BinaryClassificationMetrics only supports AreaUnderPR and AreaUnderROC?

2017-05-11 Thread Lan Jiang
I realized that in the Spark ML, BinaryClassifcationMetrics only supports AreaUnderPR and AreaUnderROC. Why is that? I What if I need other metrics such as F-score, accuracy? I tried to use MulticlassClassificationEvaluator to evaluate other metrics such as Accuracy for a binary classification pro

Re: Spark <--> S3 flakiness

2017-05-11 Thread lucas.g...@gmail.com
Looks like this isn't viable in spark 2.0.0 (and greater I presume). I'm pretty sure I came across this blog and ignored it due to that. Any other thoughts? The linked tickets in: https://issues.apache.org/jira/browse/SPARK-10063 https://issues.apache.org/jira/browse/HADOOP-13786 https://issues.