RE: Getting prediction values in spark mllib

2016-02-11 Thread Chandan Verma
Thanks it got solved ☺ From: Artem Aliev [mailto:artem.al...@gmail.com] Sent: Thursday, February 11, 2016 3:19 PM To: Sonal Goyal Cc: Chandan Verma; user@spark.apache.org Subject: Re: Getting prediction values in spark mllib It depends on Algorithm you use NaiveBayesModel has

Re: Getting prediction values in spark mllib

2016-02-11 Thread Artem Aliev
It depends on Algorithm you use NaiveBayesModel has predictProbabilities method to work dirrectly with probabilites the LogisiticRegresionModel and SVMModel clearThreshold() will make predict method returns probabilites as mentioned above On Thu, Feb 11, 2016 at 11:17 AM, Sonal Goyal

Re: Getting prediction values in spark mllib

2016-02-11 Thread Sonal Goyal
Looks like you are doing binary classification and you are getting the label out. If you clear the model threshold, you should be able to get the raw score. On Feb 11, 2016 1:32 PM, "Chandan Verma" wrote: > > > Following is the code Snippet > > > > > >

[OT] Apache Spark Jobs in Kochi, India

2016-02-11 Thread Andrew Holway
Hello, I'm not sure how appropriate job postings are to a user group. We're getting deep into spark and are looking for some talent in our Kochi office. http://bit.ly/Spark-Eng - Apache Spark Engineer / Architect - Kochi http://bit.ly/Spark-Dev - Lead Apache Spark Developer - Kochi Sorry for

Getting prediction values in spark mllib

2016-02-11 Thread Chandan Verma
Following is the code Snippet JavaRDD> predictionAndLabels = data .map(new Function>() { public Tuple2 call(LabeledPoint p) { Double

spark shell ini file

2016-02-11 Thread Mich Talebzadeh
Hi, in Hive one can use -I parameter to preload certain setting into the beeline etc. Is there equivalent parameter for spark-shell as well. for example spark-shell --master spark://50.140.197.217:7077 can I pass a parameter file? Thanks -- Mich Talebzadeh LinkedIn

Re: retrieving all the rows with collect()

2016-02-11 Thread Mich Talebzadeh
Thanks Jacob much appreciated Mich On 11/02/2016 00:01, Jakob Odersky wrote: > Exactly! > As a final note, `foreach` is also defined on RDDs. This means that > you don't need to `collect()` the results into an array (which could > give you an OutOfMemoryError in case the RDD is really

Re: SparkSQL parallelism

2016-02-11 Thread Rishi Mishra
I am not sure why all 3 nodes should query. If you have not mentioned any partitions it should only be one partition of JDBCRDD where all dataset should reside. On Fri, Feb 12, 2016 at 10:15 AM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi, > > I have a spark cluster with One

Re: Spark Streaming with Kafka: Dealing with 'slow' partitions

2016-02-11 Thread Sebastian Piu
Have you tried using fair scheduler and queues On 12 Feb 2016 4:24 a.m., "p pathiyil" wrote: > With this setting, I can see that the next job is being executed before > the previous one is finished. However, the processing of the 'hot' > partition eventually hogs all the

mllib:Survival Analysis : assertion failed: AFTAggregator loss sum is infinity. Error for unknown reason.

2016-02-11 Thread Stuti Awasthi
Hi All, Im wanted to try Survival Analysis on Spark 1.6. I am successfully able to run the AFT example provided. Now I tried to train the model with Ovarian data which is standard data comes with Survival library in R. Default Column Name : Futime,fustat,age,resid_ds,rx,ecog_ps Here are the

??????off-heap certain operations

2016-02-11 Thread Sea
spark.memory.offHeap.enabled (default is false) , it is wrong in spark docs. Spark1.6 do not recommend to use off-heap memory. -- -- ??: "Ovidiu-Cristian MARCU";; : 2016??2??12??(??) 5:51 ??:

Re: Passing a dataframe to where clause + Spark SQL

2016-02-11 Thread Rishabh Wadhawan
Hi Divya Considering you are able to successfully load both tables testCond and test as data frames. As now taking your case: when you do val condval = testCond.select(“Cond”) //Where Cond is a column name, here condval is a DataFrame.Even if it has one row, it is still a data frame if you want

Spark workers disconnecting on 1.5.2

2016-02-11 Thread Andy Max
I launched a 4 node Spark 1.5.2 cluster. No activity for a day or so. Now noticed that few of the workers are disconnected. Don't see this issue on Spark 1.4 or Spark 1.3. Would appreciate any pointers. Thx

Re: Scala types to StructType

2016-02-11 Thread Rishabh Wadhawan
I had the same issue. I resolved it in Java, but I am pretty sure it would work with scala too. Its kind of a gross hack. But what I did is say I had a table in Mysql with 1000 columns what is did is that I threw a jdbc query to extracted the schema of the table. I stored that schema and wrote

Re: Kafka + Spark 1.3 Integration

2016-02-11 Thread Cody Koeninger
That's what the kafkaParams argument is for. Not all of the kafka configuration parameters will be relevant, though. On Thu, Feb 11, 2016 at 12:07 PM, Nipun Arora wrote: > Hi , > > Thanks for the explanation and the example link. Got it working. > A follow up

best practices? spark streaming writing output detecting disk full error

2016-02-11 Thread Andy Davidson
We recently started a Spark/Spark Streaming POC. We wrote a simple streaming app in java to collect tweets. We choose twitter because we new we get a lot of data and probably lots of burst. Good for stress testing We spun up a couple of small clusters using the spark-ec2 script. In one cluster

Re: How to parallel read files in a directory

2016-02-11 Thread Jakob Odersky
Hi Junjie, How do you access the files currently? Have you considered using hdfs? It's designed to be distributed across a cluster and Spark has built-in support. Best, --Jakob On Feb 11, 2016 9:33 AM, "Junjie Qian" wrote: > Hi all, > > I am working with Spark 1.6,

Inserting column to DataFrame

2016-02-11 Thread Zsolt Tóth
Hi, I'd like to append a column of a dataframe to another DF (using Spark 1.5.2): DataFrame outputDF = unlabelledDF.withColumn("predicted_label", predictedDF.col("predicted")); I get the following exception: java.lang.IllegalArgumentException: requirement failed: DataFrame must have the same

Re: spark shell ini file

2016-02-11 Thread Ted Yu
Please see: [SPARK-13086][SHELL] Use the Scala REPL settings, to enable things like `-i file` On Thu, Feb 11, 2016 at 1:45 AM, Mich Talebzadeh < mich.talebza...@cloudtechnologypartners.co.uk> wrote: > Hi, > > > > in Hive one can use -I parameter to preload certain setting into the > beeline

Re: Spark Certification

2016-02-11 Thread Timothy Spann
I was wondering that as well. Also is it fully updated for 1.6? Tim http://airisdata.com/ http://sparkdeveloper.com/ From: naga sharathrayapati > Date: Wednesday, February 10, 2016 at 11:36 PM To:

Spark Streaming with Kafka: Dealing with 'slow' partitions

2016-02-11 Thread p pathiyil
Hi, I am looking at a way to isolate the processing of messages from each Kafka partition within the same driver. Scenario: A DStream is created with the createDirectStream call by passing in a few partitions. Let us say that the streaming context is defined to have a time duration of 2 seconds.

Dataframes

2016-02-11 Thread Gaurav Agarwal
Hi Can we load 5 data frame for 5 tables in one spark context. I am asking why because we have to give Map options= new hashmap(); Options.put(driver,""); Options.put(URL,""); Options.put(dbtable,""); I can give only table query at time in dbtable options . How will I register multiple queries

RE: Dataframes

2016-02-11 Thread Prashant Verma
Hi Gaurav, You can try something like this. SparkConf conf = new SparkConf(); JavaSparkContext sc = new JavaSparkContext(conf); SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc); Class.forName("com.mysql.jdbc.Driver"); String url="url"; Properties prop = new

Scala types to StructType

2016-02-11 Thread Fabian Böhnlein
Hi all, is there a way to create a Spark SQL Row schema based on Scala data types without creating a manual mapping? That's the only example I can find which doesn't require spark.sql.types.DataType already as input, but it requires to define them as Strings. * val struct = (new

Question on Spark architecture and DAG

2016-02-11 Thread Mich Talebzadeh
Hi, I have used Hive on Spark engine and of course Hive tables and its pretty impressive comparing Hive using MR engine. Let us assume that I use spark shell. Spark shell is a client that connects to spark master running on a host and port like below spark-shell --master

Re: Skip empty batches - spark streaming

2016-02-11 Thread Sebastian Piu
I'm using the Kafka direct stream api but I can have a look on extending it to have this behaviour Thanks! On 11 Feb 2016 9:07 p.m., "Shixiong(Ryan) Zhu" wrote: > Are you using a custom input dstream? If so, you can make the `compute` > method return None to skip a

newbie unable to write to S3 403 forbidden error

2016-02-11 Thread Andy Davidson
I am using spark 1.6.0 in a cluster created using the spark-ec2 script. I am using the standalone cluster manager My java streaming app is not able to write to s3. It appears to be some for of permission problem. Any idea what the problem might be? I tried use the IAM simulator to test the

Testing email please ignore

2016-02-11 Thread Mich Talebzadeh
-- Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the

Re: cache DataFrame

2016-02-11 Thread Gaurav Agarwal
Thanks for the below info. I have one more question. I have my own framework where the Sql query is already build ,so I am thinking instead of using data frame filter criteria I could use Dataframe d=sqlcontext.Sql(" and append query here"). d.printschema() List row =d.collectaslist(); Here when

Re: Skip empty batches - spark streaming

2016-02-11 Thread Shixiong(Ryan) Zhu
Are you using a custom input dstream? If so, you can make the `compute` method return None to skip a batch. On Thu, Feb 11, 2016 at 1:03 PM, Sebastian Piu wrote: > I was wondering if there is there any way to skip batches with zero events > when streaming? > By skip I

Re: Computing hamming distance over large data set

2016-02-11 Thread Brian Morton
Karl, This is tremendously useful. Thanks very much for your insight. Brian On Thu, Feb 11, 2016 at 12:58 PM, Karl Higley wrote: > Hi, > > It sounds like you're trying to solve the approximate nearest neighbor > (ANN) problem. With a large dataset, parallelizing a brute

Re: Skip empty batches - spark streaming

2016-02-11 Thread Shixiong(Ryan) Zhu
Yeah, DirectKafkaInputDStream always returns a RDD even if it's empty. Feel free to send a PR to improve it. On Thu, Feb 11, 2016 at 1:09 PM, Sebastian Piu wrote: > I'm using the Kafka direct stream api but I can have a look on extending > it to have this behaviour > >

Re: Skip empty batches - spark streaming

2016-02-11 Thread Sebastian Piu
Yes, and as far as I recall it also has partitions (empty) which screws up the isEmpty call if the rdd has been transformed down the line. I will have a look tomorrow at the office and see if I can collaborate On 11 Feb 2016 9:14 p.m., "Shixiong(Ryan) Zhu" wrote: > Yeah,

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-11 Thread Sebastian Piu
Looks like mapWithState could help you? On 11 Feb 2016 8:40 p.m., "Abhishek Anand" wrote: > Hi All, > > I have an use case like follows in my production environment where I am > listening from kafka with slideInterval of 1 min and windowLength of 2 > hours. > > I have a

Stateful Operation on JavaPairDStream Help Needed !!

2016-02-11 Thread Abhishek Anand
Hi All, I have an use case like follows in my production environment where I am listening from kafka with slideInterval of 1 min and windowLength of 2 hours. I have a JavaPairDStream where for each key I am getting the same key but with different value,which might appear in the same batch or

Skip empty batches - spark streaming

2016-02-11 Thread Sebastian Piu
I was wondering if there is there any way to skip batches with zero events when streaming? By skip I mean avoid the empty rdd from being created at all?

Re: Skip empty batches - spark streaming

2016-02-11 Thread Cody Koeninger
Please don't change the behavior of DirectKafkaInputDStream. Returning an empty rdd is (imho) the semantically correct thing to do, and some existing jobs depend on that behavior. If it's really an issue for you, you can either override directkafkainputdstream, or just check isEmpty as the first

Re: Skip empty batches - spark streaming

2016-02-11 Thread Andy Davidson
You can always call rdd.isEmpty() Andy private static void save(JavaDStream jsonRdd, String outputURI) { jsonTweets.foreachRDD(new VoidFunction2() { private static final long serialVersionUID = 1L; @Override public void

off-heap certain operations

2016-02-11 Thread Ovidiu-Cristian MARCU
Hi, Reading though the latest documentation for Memory management I can see that the parameter spark.memory.offHeap.enabled (true by default) is described with ‘If true, Spark will attempt to use off-heap memory for certain operations’ [1]. Can you please describe the certain operations you

Re: Skip empty batches - spark streaming

2016-02-11 Thread Sebastian Piu
Thanks for clarifying Cody. I will extend the current behaviour for my use case. If there is anything worth sharing I'll run it through the list Cheers On 11 Feb 2016 9:47 p.m., "Cody Koeninger" wrote: > Please don't change the behavior of DirectKafkaInputDStream. >

Re: Building Spark with a Custom Version of Hadoop: HDFS ClassNotFoundException

2016-02-11 Thread Ted Yu
I think SPARK_CLASSPATH is deprecated. Can you show the command line launching your Spark job ? Which Spark release do you use ? Thanks On Thu, Feb 11, 2016 at 5:38 PM, Charlie Wright wrote: > built and installed hadoop with: > mvn package -Pdist -DskipTests -Dtar >

Re: Building Spark with a Custom Version of Hadoop: HDFS ClassNotFoundException

2016-02-11 Thread Ted Yu
The Spark driver does not run on the YARN cluster in client mode, only the Spark executors do. Can you check YARN logs for the failed job to see if there was more clue ? Does the YARN cluster run the customized hadoop or stock hadoop ? Cheers On Thu, Feb 11, 2016 at 5:44 PM, Charlie Wright

Re: Spark Streaming with Kafka: Dealing with 'slow' partitions

2016-02-11 Thread p pathiyil
With this setting, I can see that the next job is being executed before the previous one is finished. However, the processing of the 'hot' partition eventually hogs all the concurrent jobs. If there was a way to restrict jobs to be one per partition, then this setting would provide the

SparkSQL parallelism

2016-02-11 Thread Madabhattula Rajesh Kumar
Hi, I have a spark cluster with One Master and 3 worker nodes. I have written a below code to fetch the records from oracle using sparkSQL val sqlContext = new org.apache.spark.sql.SQLContext(sc) val employees = sqlContext.read.format("jdbc").options( Map("url" ->

Re: Scala types to StructType

2016-02-11 Thread Yogesh Mahajan
CatatlystTypeConverters.scala has all types of utility methods to convert from Scala to row and vice a versa. On Fri, Feb 12, 2016 at 12:21 AM, Rishabh Wadhawan wrote: > I had the same issue. I resolved it in Java, but I am pretty sure it would > work with scala too. Its

Re: Scala types to StructType

2016-02-11 Thread Yogesh Mahajan
Right, Thanks Ted. On Fri, Feb 12, 2016 at 10:21 AM, Ted Yu wrote: > Minor correction: the class is CatalystTypeConverters.scala > > On Thu, Feb 11, 2016 at 8:46 PM, Yogesh Mahajan > wrote: > >> CatatlystTypeConverters.scala has all types of utility

Re: Scala types to StructType

2016-02-11 Thread Ted Yu
Minor correction: the class is CatalystTypeConverters.scala On Thu, Feb 11, 2016 at 8:46 PM, Yogesh Mahajan wrote: > CatatlystTypeConverters.scala has all types of utility methods to convert > from Scala to row and vice a versa. > > > On Fri, Feb 12, 2016 at 12:21 AM,

Re: Computing hamming distance over large data set

2016-02-11 Thread Karl Higley
Hi, It sounds like you're trying to solve the approximate nearest neighbor (ANN) problem. With a large dataset, parallelizing a brute force O(n^2) approach isn't likely to help all that much, because the number of pairwise comparisons grows quickly as the size of the dataset increases. I'd look

spark thrift server transport protocol

2016-02-11 Thread Sanjeev Verma
I am running spark thrift server to run query over hive, how do I can know which transport protocol my client and server using currently. Thanks

Re: Dataframes

2016-02-11 Thread Rishabh Wadhawan
Hi Gaurav I am not sure what you are trying to do here as you are naming two data frames with the same name which would be a compilation error in java. However, after trying to see what you are asking, as of what I understand your question is. You can do something like this; > SqlContext

Re: Spark execuotr Memory profiling

2016-02-11 Thread Rishabh Wadhawan
Hi All Please check this jira ticket regarding the issue. I was having the same issue with shuffling. Seems like the shuffling memory max is 2g. https://issues.apache.org/jira/browse/SPARK-5928 > On Feb 11, 2016, at 9:08 AM,

Re: Kafka + Spark 1.3 Integration

2016-02-11 Thread Nipun Arora
Hi , Thanks for the explanation and the example link. Got it working. A follow up question. In Kafka one can define properties as follows: Properties props = new Properties(); props.put("zookeeper.connect", zookeeper); props.put("group.id", groupId); props.put("zookeeper.session.timeout.ms",

Re: spark thrift server transport protocol

2016-02-11 Thread Ted Yu
>From the head of HiveThriftServer2 : * The main entry point for the Spark SQL port of HiveServer2. Starts up a `SparkSQLContext` and a * `HiveThriftServer2` thrift server. Looking at HiveServer2.java from Hive, looks like it uses thrift protocol. FYI On Thu, Feb 11, 2016 at 9:34 AM,

cache DataFrame

2016-02-11 Thread Gaurav Agarwal
Hi When the dataFrame will load the table into memory when it reads from HIVe/Phoenix or from any database. These are two points where need one info , when tables will be loaded into memory or cached when at point 1 or point 2 below. 1. DataFrame df = sContext.load("jdbc","(select * from

Re: Dataframes

2016-02-11 Thread Ted Yu
bq. Whether sContext(SQlCOntext) will help to query in both the dataframes and will it decide on which dataframe to query for . Can you clarify what you were asking ? The queries would be carried out on respective DataFrame's as shown in your snippet. On Thu, Feb 11, 2016 at 8:47 AM, Gaurav

How to parallel read files in a directory

2016-02-11 Thread Junjie Qian
Hi all, I am working with Spark 1.6, scala and have a big dataset divided into several small files. My question is: right now the read operation takes really long time and often has RDD warnings. Is there a way I can read the files in parallel, that all nodes or workers read the file at the

ApacheCon NA 2016 - Important Dates!!!

2016-02-11 Thread Melissa Warnkin
Hello everyone! I hope this email finds you well.  I hope everyone is as excited about ApacheCon as I am! I'd like to remind you all of a couple of important dates, as well as ask for your assistance in spreading the word! Please use your social media platform(s) to get the word out! The more

Re: LogisticRegressionModel not able to load serialized model from S3

2016-02-11 Thread Utkarsh Sengar
The problem turned out to be corrupt parquet data, the error was a bit misleading by spark though. On Mon, Feb 8, 2016 at 3:41 PM, Utkarsh Sengar wrote: > I am storing a model in s3 in this path: > "bucket_name/p1/models/lr/20160204_0410PM/ser" and the structure of the >

Re: Spark Certification

2016-02-11 Thread Prem Sure
I did recently. it includes MLib & Graphx too and I felt like exam content covered all topics till 1.3 and not the > 1.3 versions of spark. On Thu, Feb 11, 2016 at 9:39 AM, Janardhan Karri wrote: > I am planning to do that with databricks >