Re: Create a Column expression from a String

2016-11-19 Thread Luciano Resende
Are you looking for UDFs? https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-udfs.html On Sun, Nov 20, 2016 at 3:12 AM Stuart White wrote: > I'd like to allow for runtime-configured Column expressions in my > Spark SQL application. For

Create a Column expression from a String

2016-11-19 Thread Stuart White
I'd like to allow for runtime-configured Column expressions in my Spark SQL application. For example, if my application needs a 5-digit zip code, but the file I'm processing contains a 9-digit zip code, I'd like to be able to configure my application with the expression "substring('zipCode, 0,

Re: using StreamingKMeans

2016-11-19 Thread Debasish Ghosh
I share both the concerns that u have expressed. And as I mentioned in my earlier mail, offline (batch) training is an option if I get a dataset without outliers. In that case I can train and have a model. I find the model parameters, which will be the mean distance to the centroid. Note in

Re: HPC with Spark? Simultaneous, parallel one to one mapping of partition to vcore

2016-11-19 Thread Stephen Boesch
While "apparently" saturating the N available workers using your proposed N partitions - the "actual" distribution of workers to tasks is controlled by the scheduler. If my past experience were of service - you can *not *trust the default Fair Scheduler to ensure the round-robin scheduling of the

HPC with Spark? Simultaneous, parallel one to one mapping of partition to vcore

2016-11-19 Thread Adam Smith
Dear community, I have a RDD with N rows and N partitions. I want to ensure that the partitions run all at the some time, by setting the number of vcores (spark-yarn) to N. The partitions need to talk to each other with some socket based sync that is why I need them to run more or less

Re: using StreamingKMeans

2016-11-19 Thread ayan guha
Here are 2 concerns I would have with the design (This discussion is mostly to validate my own understanding) 1. if you have outliers "before" running k-means, aren't your centroids get skewed? In other word, outliers by themselves may bias the cluster evaluation, isn't it? 2. Typically

Re: using StreamingKMeans

2016-11-19 Thread Debasish Ghosh
Looking for alternative suggestions in case where we have 1 continuous stream of data. Offline training and online prediction can be one option if we can have an alternate set of data to train. But if it's one single stream you don't have separate sets for training or cross validation. So

Re: Run spark with hadoop snapshot

2016-11-19 Thread Luke Miner
Thanks! Should I do it from the spark build environment? On Sat, Nov 19, 2016 at 4:48 AM, Steve Loughran wrote: > I'd recommend you build a fill spark release with the new hadoop version; > you should have built that locally earlier the same day (so that ivy/maven > pick

Re: using StreamingKMeans

2016-11-19 Thread ayan guha
Curious why do you want to train your models every 3 secs? On 20 Nov 2016 06:25, "Debasish Ghosh" wrote: > Thanks a lot for the response. > > Regarding the sampling part - yeah that's what I need to do if there's no > way of titrating the number of clusters online. > >

Re: Logistic Regression Match Error

2016-11-19 Thread Meeraj Kunnumpurath
Thank you, it was the escape character, option("escape", "\"") Regards On Sat, Nov 19, 2016 at 11:10 PM, Meeraj Kunnumpurath < mee...@servicesymphony.com> wrote: > I triied .option("quote", "\""), which I believe is the default, still the > same error. This is the offending record. > > Primo

Re: using StreamingKMeans

2016-11-19 Thread Debasish Ghosh
Thanks a lot for the response. Regarding the sampling part - yeah that's what I need to do if there's no way of titrating the number of clusters online. I am using something like dstream.foreachRDD { rdd => if (rdd.count() > 0) { //.. logic } } Feels a little odd but if that's the idiom

Re: Logistic Regression Match Error

2016-11-19 Thread Meeraj Kunnumpurath
I triied .option("quote", "\""), which I believe is the default, still the same error. This is the offending record. Primo 4-In-1 Soft Seat Toilet Trainer and Step Stool White with Pastel Blue Seat,"I chose this potty for my son because of the good reviews. I do not like it. I'm honestly baffled

Re: Logistic Regression Match Error

2016-11-19 Thread Meeraj Kunnumpurath
Digging through it looks like an issue with reading CSV. Some of the data have embedded commas in them, these fields are rightly quoted. However, the CSV reader seems to be getting to a pickle, when the records contain quoted and unquoted data. Fields are only quoted, when there are commas within

Logistic Regression Match Error

2016-11-19 Thread Meeraj Kunnumpurath
Hello, I have the following code that trains a mapping of review text to ratings. I use a tokenizer to get all the words from the review, and use a count vectorizer to get all the words. However, when I train the classifier I get a match error. Any pointers will be very helpful. The code is

Re: Mac vs cluster Re: kafka 0.10 with Spark 2.02 auto.offset.reset=earliest will only read from a single partition on a multi partition topic

2016-11-19 Thread Hster Geguri
Hi Cody, Our test producer has been vetted for producing evenly into each partition. We use kafka-manager to track this. $ kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list ' > 10.102.22.11:9092' --topic simple_logtest --time -2 > simple_logtest:2:0 > simple_logtest:4:0 >

Re: Mac vs cluster Re: kafka 0.10 with Spark 2.02 auto.offset.reset=earliest will only read from a single partition on a multi partition topic

2016-11-19 Thread Cody Koeninger
This is running locally on my mac, but it's still a standalone spark master with multiple separate executor jvms (i.e. using --master not --local[2]), so it should be the same code paths. I can't speak to yarn one way or the other, but you said you tried it with the standalone scheduler. At the

Re: using StreamingKMeans

2016-11-19 Thread Cody Koeninger
So I haven't played around with streaming k means at all, but given that no one responded to your message a couple of days ago, I'll say what I can. 1. Can you not sample out some % of the stream for training? 2. Can you run multiple streams at the same time with different values for k and

covert local tsv file to orc file on distributed cloud storage(openstack).

2016-11-19 Thread vr spark
Hi, I am looking for scala or python code samples to covert local tsv file to orc file and store on distributed cloud storage(openstack). So, need these 3 samples. Please suggest. 1. read tsv 2. convert to orc 3. store on distributed cloud storage thanks VR

Mac vs cluster Re: kafka 0.10 with Spark 2.02 auto.offset.reset=earliest will only read from a single partition on a multi partition topic

2016-11-19 Thread Hster Geguri
Hi Cody, Thank you for testing this on a Saturday morning! I failed to mention that when our data engineer runs our drivers(even complex ones) locally on his Mac, the drivers work fine. However when we launch it into the cluster (4 machines either for a YARN cluster or spark standalone) we get

Re: Kafka direct approach,App UI shows wrong input rate

2016-11-19 Thread Cody Koeninger
There have definitely been issues with UI reporting for the direct stream in the past, but I'm not able to reproduce this with 2.0.2 and 0.8. See below: https://i.imgsafe.org/086019ae57.png On Fri, Nov 18, 2016 at 4:38 AM, Julian Keppel wrote: > Hello, > > I use

Usage of mllib api in ml

2016-11-19 Thread janardhan shetty
Hi, I am trying to use the evaluation metrics offered by mllib multiclassmetrics in ml dataframe setting. Is there any examples how to use it?

Re: Spark ML DataFrame API - need cosine similarity, how to convert to RDD Vectors?

2016-11-19 Thread Yanbo Liang
Hi Russell, Do you want to use RowMatrix.columnSimilarities to calculate cosine similarities? If so, you should using the following steps: val dataset: DataFrame // Convert the type of features column from ml.linalg.Vector type to mllib.linalg.Vector val oldDataset: DataFrame =

Re: kafka 0.10 with Spark 2.02 auto.offset.reset=earliest will only read from a single partition on a multi partition topic

2016-11-19 Thread Cody Koeninger
I ran your example using the versions of kafka and spark you are using, against a standalone cluster. This is what I observed: (in kafka working directory) bash-3.2$ ./bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list 'localhost:9092' --topic simple_logtest --time -2

using StreamingKMeans

2016-11-19 Thread debasishg
Hello - I am trying to implement an outlier detection application on streaming data. I am a newbie to Spark and hence would like some advice on the confusions that I have .. I am thinking of using StreamingKMeans - is this a good choice ? I have one stream of data and I need an online algorithm.

Re: VectorUDT and ml.Vector

2016-11-19 Thread Yanbo Liang
The reason behind this error can be inferred from the error log: *MLUtils.convertMatrixColumnsFromML *was used to convert ml.linalg.Matrix to mllib.linalg.Matrix, but it looks like the column type is ml.linalg.Vector in your case. Could you check the type of column "features" in your dataframe

Re: Kafka segmentation

2016-11-19 Thread Cody Koeninger
I mean I don't understand exactly what the issue is. Can you fill in these blanks My settings are : My code is : I expected to see : Instead, I saw : On Thu, Nov 17, 2016 at 12:53 PM, Hoang Bao Thien wrote: > I am sorry I don't understand your idea. What do you mean

Re: why is method predict protected in PredictionModel

2016-11-19 Thread Yanbo Liang
This function is used internally currently, we will expose it as public to support make prediction on single instance. See discussion at https://issues.apache.org/jira/browse/SPARK-10413. Thanks Yanbo On Thu, Nov 17, 2016 at 1:24 AM, wobu wrote: > Hi, > > we were using

Re: DataFrame select non-existing column

2016-11-19 Thread Kristoffer Sjögren
Thanks. Here's my code example [1] and the printSchema() output [2]. This code still fails with the following message: "No such struct field mobile in auction, geo" By looking at the schema, it seems that pass.mobile did not get nested, which is the way it needs to be for my use case. Is nested

Stateful aggregations with Structured Streaming

2016-11-19 Thread Yuval.Itzchakov
I've been using `DStream.mapWithState` and was looking forward to trying out Structured Streaming. The thing I can't under is, does Structured Streaming in it's current state support stateful aggregations? Looking at the StateStore design document

Re: Run spark with hadoop snapshot

2016-11-19 Thread Steve Loughran
I'd recommend you build a fill spark release with the new hadoop version; you should have built that locally earlier the same day (so that ivy/maven pick up the snapshot) dev/make-distribution.sh -Pyarn,hadoop-2.7,hive -Dhadoop.version=2.9.0-SNAPSHOT; > On 18 Nov 2016, at 19:31, lminer

Re: Reading LZO files with Spark

2016-11-19 Thread Sean Owen
Are you missing the hadoop-lzo package? it's not part of Hadoop/Spark. On Sat, Nov 19, 2016 at 4:20 AM learning_spark < dibyendu.chakraba...@gmail.com> wrote: > Hi Users, I am not sure about the latest status of this issue: > https://issues.apache.org/jira/browse/SPARK-2394 However, I have seen