Re: [Spark Streaming] How to do join two messages in spark streaming(Probabaly messasges are in differnet RDD) ?

2016-12-06 Thread Tathagata Das
This sounds like something you can solve by a stateful operator. check out mapWithState. If both the message can be keyed with a common key, then you can define a keyed-state. the state will have a field for the first message.When you see the first message for a key, fill the first field with

Adding a new nested column into dataframe

2016-12-06 Thread AShaitarov
Hello Spark experts! I need to add one more nested column to existing ones. For example: Initial DF schema looks like that: |-- Id: string (nullable = true) |-- Type: string (nullable = true) |-- Uri: string (nullable = true) |-- attributes: struct (nullable = false) | |-- CountryGroupID: array

Re: Terminate job without killing

2016-12-06 Thread Leonid Blokhin
Hi, Bruno! You can send a message to the topic MQTT, when finished Job. This can be done with the help of Mist service https://github.com/Hydrospheredata/mist, or in a similar way. Regards, Leonid 7 дек. 2016 г. 6:03 пользователь "Bruno Faria" написал: I have a

Dynamically applying schema in spark.

2016-12-06 Thread Satwinder Singh
Hi Spark Team, We working on one use case where we need to parse Avro json schema (json is nested schema) using apache spark 1.6 and scala 10.2. As of now we are able to deflate the json from avro and read it and bring in hive table. Our requirement is like avro json can add new fields and we

Terminate job without killing

2016-12-06 Thread Bruno Faria
I have a python spark job that runs successfully but never ends (releases the prompt). I got messages like "releasing accumulator" but never the shutdown message (expected) and the prompt release. In order to handle this I used sys.exit(0), now it works but the tasks always appears as KILLED

Re: [Spark Streaming] How to do join two messages in spark streaming(Probabaly messasges are in differnet RDD) ?

2016-12-06 Thread sancheng
any valuable feedback is appreciated! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-How-to-do-join-two-messages-in-spark-streaming-Probabaly-messasges-are-in-differnet--tp28161p28163.html Sent from the Apache Spark User List mailing list

OutOfMemoryError while running job...

2016-12-06 Thread Kevin Burton
I am trying to run a Spark job which reads from ElasticSearch and should write it's output back to a separate ElasticSearch index. Unfortunately I keep getting `java.lang.OutOfMemoryError: Java heap space` exceptions. I've tried running it with: --conf spark.memory.offHeap.enabled=true --conf

credentials are not hiding on a jdbc query

2016-12-06 Thread Cesar
Releted to https://issues.apache.org/jira/browse/SPARK-12504?jql=project%20%3D%20SPARK%20AND%20text%20~%20%22credentials%20jdbc%22 Is there a way to override the explain command on that class without updating spark. My spark version is 1.6 and is very hard for me to upgrade spark. Therefore I

[Spark SQL]: Dataset Encoder equivalent for pre 1.6.0 releases?

2016-12-06 Thread Denis Papathanasiou
I have a case class named "Computed", and I'd like to be able to encode all the Row objects in the DataFrame like this: def myEncoder (df: DataFrame): Dataset[Computed] = df.as(Encoders.bean(classOf[Computed])) This works just fine with the latest version of spark, but I'm forced to use

[no subject]

2016-12-06 Thread ayan guha
Hi We are generating some big model objects > hdfs dfs -du -h /myfolder 325 975 /myfolder/__ORCHMETA__ 1.7 M5.0 M/myfolder/model 185.3 K 555.9 K /myfolder/predict The issue I am facing while loading is Error in .jcall("com/oracle/obx/df/OBXSerializer", returnSig =

MLlib to Compute boundaries of a rectangle given random points on its Surface

2016-12-06 Thread Pradeep Gaddam
Hello, Can someone please let me know if it is possible to construct a surface(for example:- Rectangle) given random points on its surface using Spark MLlib? Thanks Pradeep Gaddam This message and any attachments may contain confidential information of View, Inc. If you are not the

Re: Livy with Spark

2016-12-06 Thread Mich Talebzadeh
Thanks Richard. I saw your question in the above blog: How does Livy proxy the user? Per task? Do you know how quotas are assigned to users, like how do you stop one Livy user from using all of the resources available to the Executors? My points are: 1. Still don't understand how quotas

Re: [GraphX] Extreme scheduler delay

2016-12-06 Thread Sean Owen
(For what it is worth, I happened to look into this with Anton earlier and am also pretty convinced it's related to GraphX rather than the app. It's somewhat difficult to debug what gets sent in the closure AFAICT.) On Tue, Dec 6, 2016 at 7:49 PM AntonIpp wrote: > Hi

Re: Writing DataFrame filter results to separate files

2016-12-06 Thread Everett Anderson
On Mon, Dec 5, 2016 at 5:33 PM, Michael Armbrust wrote: > 1. In my case, I'd need to first explode my data by ~12x to assign each >> record to multiple 12-month rolling output windows. I'm not sure Spark SQL >> would be able to optimize this away, combining it with the

Re: get corrupted rows using columnNameOfCorruptRecord

2016-12-06 Thread Michael Armbrust
.where("xxx IS NOT NULL") will give you the rows that couldn't be parsed. On Tue, Dec 6, 2016 at 6:31 AM, Yehuda Finkelstein < yeh...@veracity-group.com> wrote: > Hi all > > > > I’m trying to parse json using existing schema and got rows with NULL’s > > //get schema > > val df_schema =

Re: Spark Streaming - join streaming and static data

2016-12-06 Thread Cody Koeninger
You do not need recent versions of spark, kafka, or structured streaming in order to do this. Normal DStreams are sufficient. You can parallelize your static data from the database to an RDD, and there's a join method available on RDDs. Transforming a single given timestamp line into multiple

Re: Monitoring the User Metrics for a long running Spark Job

2016-12-06 Thread Chawla,Sumit
Any pointers on this? Regards Sumit Chawla On Mon, Dec 5, 2016 at 8:30 PM, Chawla,Sumit wrote: > An example implementation i found is : https://github.com/groupon/ > spark-metrics > > Anyone has any experience using this? I am more interested in something > for

Re: driver in queued state and not started

2016-12-06 Thread Michael Gummelt
Client mode or cluster mode? On Mon, Dec 5, 2016 at 10:05 PM, Yu Wei wrote: > Hi Guys, > > > I tried to run spark on mesos cluster. > > However, when I tried to submit jobs via spark-submit. The driver is in > "Queued state" and not started. > > > Which should I check? > >

efficient filtering on a dataframe

2016-12-06 Thread Koert Kuipers
i have a dataframe on which i need to run many queries that start with a filter on a column x. currently i write the dataframe out to parquet datasource partitioned by field x, after which i repeatedly read the datasource back in from parquet. the queries are efficient because the filter gets

Re: Spark Streaming - join streaming and static data

2016-12-06 Thread Burak Yavuz
Hi Daniela, This is trivial with Structured Streaming. If your Kafka cluster is 0.10.0 or above, you may use Spark 2.0.2 to create a Streaming DataFrame from Kafka, and then also create a DataFrame using the JDBC connection, and you may join those. In Spark 2.1, there's support for a function

get corrupted rows using columnNameOfCorruptRecord

2016-12-06 Thread Yehuda Finkelstein
Hi all I’m trying to parse json using existing schema and got rows with NULL’s //get schema val df_schema = spark.sqlContext.sql("select c1,c2,…cn t1 limit 1") //read json file val f = sc.textFile("/tmp/x") //load json into data frame using schema var df =

Re: How to compute the recall and F1-score in Linear Regression based model

2016-12-06 Thread Sean Owen
There is no such thing as multiclass regression. These metrics are for classification problems and don't have meaning for regression. On Tue, Dec 6, 2016 at 7:55 PM Md. Rezaul Karim < rezaul.ka...@insight-centre.org> wrote: > Hi Sean, > > According to Spark documentation, precision, recall, F1,

Re: How to compute the recall and F1-score in Linear Regression based model

2016-12-06 Thread Md. Rezaul Karim
Hi Sean, According to Spark documentation, precision, recall, F1, true positive rate, false positive rate etc. can be calculated using the MultiMetrics evaluator for the multiclass classifiers also. For example in *Random Forest *based classifier or regressor: // Get evaluation metrics.

[GraphX] Extreme scheduler delay

2016-12-06 Thread AntonIpp
Hi everyone, I have a small Scala test project which uses GraphX and for some reason has extreme scheduler delay when executed on the cluster. The problem is not related to the cluster configuration, as other GraphX applications run without any issue. I have attached the source code (

Re: How to compute the recall and F1-score in Linear Regression based model

2016-12-06 Thread Sean Owen
Precision, recall and F1 are metrics for binary classifiers, not regression models. Can you clarify what you intend to do? On Tue, Dec 6, 2016, 19:14 Md. Rezaul Karim wrote: > Hi Folks, > > I have the following code snippet in Java that can calculate the

How to compute the recall and F1-score in Linear Regression based model

2016-12-06 Thread Md. Rezaul Karim
Hi Folks, I have the following code snippet in Java that can calculate the precision in Linear Regressor based model. Dataset predictions = model.transform(testData); long count = 0; for (Row r : predictions.select("features", "label", "prediction").collectAsList()) { count++;

Spark Streaming - join streaming and static data

2016-12-06 Thread Daniela S
Hi   I have some questions regarding Spark Streaming.   I receive a stream of JSON messages from Kafka. The messages consist of a timestamp and an ID.   timestamp                 ID 2016-12-06 13:00    1 2016-12-06 13:40    5 ...   In a database I have values for each ID:   ID      

Re: Re: Re: how to add colum to dataframe

2016-12-06 Thread lk_spark
I have know what is the right way to do it: val df = spark.read.parquet("/parquetdata/weixin/page/month=201607") val df2 = df.withColumn("pa_bid",when(isnull($"url"),"".split("#")(0)).otherwise(split(split(col("url"),"_biz=")(1), "")(1))) scala> df2.select("pa_bid","url").show

Re: Re: how to add colum to dataframe

2016-12-06 Thread lk_spark
thanks for reply. I will search how to use na.fill . and I don't know how to get the value of the column and do some operation like substr or split. 2016-12-06 lk_spark 发件人:Pankaj Wahane 发送时间:2016-12-06 17:39 主题:Re: how to add colum to dataframe

Re: how to add colum to dataframe

2016-12-06 Thread Pankaj Wahane
You may want to try using df2.na.fill(…) From: lk_spark Date: Tuesday, 6 December 2016 at 3:05 PM To: "user.spark" Subject: how to add colum to dataframe hi,all: my spark version is 2.0 I have a parquet file with one colum name url type is

how to add colum to dataframe

2016-12-06 Thread lk_spark
hi,all: my spark version is 2.0 I have a parquet file with one colum name url type is string,I wang get substring from the url and add it to the datafram: val df = spark.read.parquet("/parquetdata/weixin/page/month=201607") val df2 =

doing streaming efficiently

2016-12-06 Thread Mendelson, Assaf
Hi, I have a system which does streaming doing analysis over a long period of time. For example a sliding window of 24 hours every 15 minutes. I have a batch process I need to convert to this streaming. I am wondering how to do so efficiently. I am currently building the streaming process so I