Exclude certain data from Training Data - Mlib

2016-11-15 Thread Bhaarat Sharma
I have my data in two colors and excluded_colors. colors contains all colors excluded_colors contains some colors that I wish to exclude from my trainingset. I am trying to split the data into a training and testing set and ensure that the colors in excluded_colors are not in my training set but

Re: scala.MatchError while doing BinaryClassificationMetrics

2016-11-15 Thread Bhaarat Sharma
ta) > > Check Scala API docs for some details: http://spark.apache. > org/docs/latest/api/scala/index.html#org.apache.spark.ml.evaluation. > BinaryClassificationEvaluator > > On Mon, 14 Nov 2016 at 20:02 Bhaarat Sharma <bhaara...@gmail.com> wrote: > > Can you please

Re: scala.MatchError while doing BinaryClassificationMetrics

2016-11-14 Thread Bhaarat Sharma
ck Pentreath <nick.pentre...@gmail.com> wrote: > DataFrame.rdd returns an RDD[Row]. You'll need to use map to extract the > doubles from the test score and label DF. > > But you may prefer to just use spark.ml evaluators, which work with > DataFrames. Try BinaryClassificationE

scala.MatchError while doing BinaryClassificationMetrics

2016-11-14 Thread Bhaarat Sharma
I am getting scala.MatchError in the code below. I'm not able to see why this would be happening. I am using Spark 2.0.1 scala> testResults.columns res538: Array[String] = Array(TopicVector, subject_id, hadm_id, isElective, isNewborn, isUrgent, isEmergency, isMale, isFemale, oasis_score,

Re: How to write contents of RDD to HDFS as separate file for each item in RDD (PySpark)

2016-07-30 Thread Bhaarat Sharma
S file? On Sat, Jul 30, 2016 at 10:19 PM, ayan guha <guha.a...@gmail.com> wrote: > This sounds a bad idea, given hdfs does not work well with small files. > > On Sun, Jul 31, 2016 at 8:57 AM, Bhaarat Sharma <bhaara...@gmail.com> > wrote: > >> I am reading bun

How to write contents of RDD to HDFS as separate file for each item in RDD (PySpark)

2016-07-30 Thread Bhaarat Sharma
I am reading bunch of files in PySpark using binaryFiles. Then I want to get the number of bytes for each file and write this number to an HDFS file with the corresponding name. Example: if directory /myimages has one.jpg, two.jpg, and three.jpg then I want three files one-success.jpg,

Re: PySpark 1.6.1: 'builtin_function_or_method' object has no attribute '__code__' in Pickles

2016-07-29 Thread Bhaarat Sharma
submit so that it gets shipped to > all executors > > On Sat, Jul 30, 2016 at 3:24 PM, Bhaarat Sharma <bhaara...@gmail.com> > wrote: > >> I am using PySpark 1.6.1. In my python program I'm using ctypes and >> trying to load the liblept library via the liblept.so.4.0.

PySpark 1.6.1: 'builtin_function_or_method' object has no attribute '__code__' in Pickles

2016-07-29 Thread Bhaarat Sharma
I am using PySpark 1.6.1. In my python program I'm using ctypes and trying to load the liblept library via the liblept.so.4.0.2 file on my system. While trying to load the library via cdll.LoadLibrary("liblept.so.4.0.2") I get an error : 'builtin_function_or_method' object has no attribute