Re: spark streaming with kinesis

2016-11-20 Thread Takeshi Yamamuro
"1 userid data" is ambiguous though (user-input data? stream? shard?), since a kinesis worker fetch data from shards that the worker has an ownership of, IIUC user-input data in a shard are transferred into an assigned worker as long as you get no failure. // maropu On Mon, Nov 21, 2016 at 1:59

Re: Re: Multiple streaming aggregations in structured streaming

2016-11-20 Thread Reynold Xin
Can you use the approximate count distinct? On Sun, Nov 20, 2016 at 11:51 PM, Xinyu Zhang wrote: > > MapWithState is also very useful. > I want to calculate UV in real time, but "distinct count" and "multiple > streaming aggregations" are not supported. > Is there any method to

Re:Re: Multiple streaming aggregations in structured streaming

2016-11-20 Thread Xinyu Zhang
MapWithState is also very useful. I want to calculate UV in real time, but "distinct count" and "multiple streaming aggregations" are not supported. Is there any method to calculate real-time UV in the current version? At 2016-11-19 06:01:45, "Michael Armbrust"

RE: DataFrame select non-existing column

2016-11-20 Thread Mendelson, Assaf
The nested columns are in fact a syntactic sugar. You basically have a column called pass. The type of this column is a struct which has a field called mobile. After you read the parquet file you can check the schema (df.schema) and looking at what it has. Basically loop through the types and

RE: Join Query

2016-11-20 Thread Shreya Agarwal
Replication join = broadcast join. Look for that term on google. Many examples. Semi join can be done on dataframes/dataset by passing “semi join” as the third parameter on the join/joinWith function. Not sure about the other two. Sent from my Windows 10 phone From: Aakash

RE: HDPCD SPARK Certification Queries

2016-11-20 Thread Shreya Agarwal
Replication join = broadcast join. Look for that term on google. Many examples. Semi join can be done on dataframes/dataset by passing “semi join” as the third parameter on the join/joinWith function. Not sure about the other two. Sent from my Windows 10 phone From: Aakash

Re: spark streaming with kinesis

2016-11-20 Thread Shushant Arora
Hi Thanks. Have a doubt on spark streaming kinesis consumer. Say I have a batch time of 500 ms and kiensis stream is partitioned on userid(uniformly distributed).But since IdleTimeBetweenReadsInMillis is set to 1000ms so Spark receiver nodes will fetch the data at interval of 1 second and store

Re: Spark driver not reusing HConnection

2016-11-20 Thread Mukesh Jha
Any ideas folks? On Fri, Nov 18, 2016 at 3:37 PM, Mukesh Jha wrote: > Hi > > I'm accessing multiple regions (~5k) of an HBase table using spark's > newAPIHadoopRDD. But the driver is trying to calculate the region size of > all the regions. > It is not even reusing the

Re: dataframe data visualization

2016-11-20 Thread ayan guha
Zeppelin with Spark thrift server? On Mon, Nov 21, 2016 at 1:47 PM, Saisai Shao wrote: > You might take a look at this project (https://github.com/vegas-viz/Vegas), > it has Spark integration. > > Thanks > Saisai > > On Mon, Nov 21, 2016 at 10:23 AM,

Re: Linear regression + Janino Exception

2016-11-20 Thread janardhan shetty
Seems like this is associated to : https://issues.apache.org/jira/browse/SPARK-16845 On Sun, Nov 20, 2016 at 6:09 PM, janardhan shetty wrote: > Hi, > > I am trying to execute Linear regression algorithm for Spark 2.02 and > hitting the below error when I am fitting my

Re: dataframe data visualization

2016-11-20 Thread Saisai Shao
You might take a look at this project (https://github.com/vegas-viz/Vegas), it has Spark integration. Thanks Saisai On Mon, Nov 21, 2016 at 10:23 AM, wenli.o...@alibaba-inc.com < wenli.o...@alibaba-inc.com> wrote: > Hi anyone, > > is there any easy way for me to do data visualization in spark

dataframe data visualization

2016-11-20 Thread wenli.o...@alibaba-inc.com
Hi anyone, is there any easy way for me to do data visualization in spark using scala when data is in dataframe format? Thanks. Wayne Ouyang smime.p7s Description: S/MIME cryptographic signature

Linear regression + Janino Exception

2016-11-20 Thread janardhan shetty
Hi, I am trying to execute Linear regression algorithm for Spark 2.02 and hitting the below error when I am fitting my training set: val lrModel = lr.fit(train) It happened on 2.0.0 as well. Any resolution steps is appreciated. *Error Snippet: * 16/11/20 18:03:45 *ERROR CodeGenerator: failed

Fwd: Yarn resource utilization with Spark pipe()

2016-11-20 Thread Sameer Choudhary
Hi, I am working on an Spark 1.6.2 application on YARN managed EMR cluster that uses RDD's pipe method to process my data. I start a light weight daemon process that starts processes for each task via pipes. This is to ensure that I don't run into https://issues.apache.org/jira/browse/SPARK-671.

Re: How do I access the nested field in a dataframe, spark Streaming app... Please help.

2016-11-20 Thread shyla deshpande
Thanks Jon, great Learning resource. Thanks Pandees, addresses[0].city would work , but I want all the cities not just from addresses[0]. Finally, I wrote the following function to get the cities. def getCities(addresses: Seq[Address]) : String = { var cities:String = "" if

Re: How do I access the nested field in a dataframe, spark Streaming app... Please help.

2016-11-20 Thread Jon Gregg
In these cases it might help to just flatten the DataFrame. Here's a helper function from the tutorial (scroll down to the "Flattening" header:

Re: Flume integration

2016-11-20 Thread ayan guha
Hi While I am following this discussion with interest, I am trying to comprehend any architectural benefit of a spark sink. Is there any feature in flume makes it more suitable to ingest stream data than sppark streaming, so that we should chain them? For example does it help durability or

Re: How do I access the nested field in a dataframe, spark Streaming app... Please help.

2016-11-20 Thread pandees waran
have you tried using "." access method? e.g: ds1.select("name","addresses[0].element.city") On Sun, Nov 20, 2016 at 9:59 AM, shyla deshpande wrote: > The following my dataframe schema > > root > |-- name: string (nullable = true) > |-- addresses: array

RE: Error in running twitter streaming job

2016-11-20 Thread Marco Mistroni
Hi Start by running it locally. See how it compares. Debug Then move to cluster Debugging stuff running on cluster is a pain as there can be tons of reasons Isolate the problem locally On 20 Nov 2016 5:04 pm, "Kappaganthu, Sivaram (ES)" < sivaram.kappagan...@adp.com> wrote: > Thank You for

How do I access the nested field in a dataframe, spark Streaming app... Please help.

2016-11-20 Thread shyla deshpande
The following my dataframe schema root |-- name: string (nullable = true) |-- addresses: array (nullable = true) ||-- element: struct (containsNull = true) |||-- street: string (nullable = true) |||-- city: string (nullable = true) I want to

Re: Flume integration

2016-11-20 Thread Mich Talebzadeh
Thanks Ian. Was your source of Flume IBM/MQ by any chance? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Re: covert local tsv file to orc file on distributed cloud storage(openstack).

2016-11-20 Thread Steve Loughran
On 19 Nov 2016, at 17:21, vr spark > wrote: Hi, I am looking for scala or python code samples to covert local tsv file to orc file and store on distributed cloud storage(openstack). So, need these 3 samples. Please suggest. 1. read tsv 2.

Re: Flume integration

2016-11-20 Thread Ian Brooks
Hi Mich, Yes, i managed to resolve this one. The issue was because the way described in the docs doesn't work properly as in order for the Flume part to be notified you need to set the storageLevel on the PollingStream like JavaReceiverInputDStream flumeStream =

Re: Usage of mllib api in ml

2016-11-20 Thread Marco Mistroni
Hi If it is an rdd based can't u use data frame.rdd (though I don't know if u will have an rdd of vectorsu might need to convert each row to a vector yourself.. Hth On 20 Nov 2016 4:29 pm, "janardhan shetty" wrote: > Hi Marco and Yanbo, > > It is not the usage of

Re: Usage of mllib api in ml

2016-11-20 Thread janardhan shetty
Hi Marco and Yanbo, It is not the usage of MulticlassClassificationEvaluator. Probably I was not clear. Let me explain: I am trying to use confusionMatrix which is not present in MulticlassClassificationEvaluator ml version where as it is present in MulticlassMetrics of mllib. How to leverage

Re: Will spark cache table once even if I call read/cache on the same table multiple times

2016-11-20 Thread Yong Zhang
If you have 2 different RDD (as 2 different references and RDD ID shown in your example), then YES, Spark will cache 2 exactly same thing in the memory. There is no way that spark will compare and know that they are the same content. You define them as 2 RDD, then they are different RDDs, and

Re: DataFrame select non-existing column

2016-11-20 Thread Kristoffer Sjögren
The problem is that I do not know which data frames has the pass.mobile column. I just list a HDFS directory which contain the parquet files and some files has the column and some don't. I really don't want to have conditional logic that inspect the schema. But maybe that's the only option? Maybe

Re: Usage of mllib api in ml

2016-11-20 Thread Marco Mistroni
Hi you can also have a look at this example, https://github.com/sryza/aas/blob/master/ch04-rdf/src/main/scala/com/cloudera/datascience/rdf/RunRDF.scala#L220 kr marco On Sun, Nov 20, 2016 at 9:09 AM, Yanbo Liang wrote: > You can refer this

Re: Flume integration

2016-11-20 Thread Mich Talebzadeh
Hi Ian, Has this been resolved? How about data to Flume and then Kafka and Kafka streaming into Spark? Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Will spark cache table once even if I call read/cache on the same table multiple times

2016-11-20 Thread Taotao.Li
hi, you can check my stackoverflow question : http://stackoverflow.com/questions/36195105/what-happens-if-i-cache-the-same-rdd-twice-in-spark/36195812#36195812 On Sat, Nov 19, 2016 at 3:16 AM, Rabin Banerjee < dev.rabin.baner...@gmail.com> wrote: > Hi Yong, > > But every time val tabdf =

Using Flume as Input Stream to Spark

2016-11-20 Thread Mich Talebzadeh
Hi, For streaming data I have used both Kafka and Twitter as Spark has receivers for both and procedures for Kafka streaming are well established. However, I have not used Flume as feeds to Spark streaming and to the best of my knowledge I have not seen any discussion on this in this forum. I

RE: DataFrame select non-existing column

2016-11-20 Thread Mendelson, Assaf
The issue is that you already have a struct called pass. What you did was add a new columned called "pass.mobile" instead of adding the element to pass - The schema for pass element is the same as before. When you do select pass.mobile, it finds the pass structure and checks for mobile in it.

Re: Usage of mllib api in ml

2016-11-20 Thread Yanbo Liang
You can refer this example( http://spark.apache.org/docs/latest/ml-tuning.html#example-model-selection-via-cross-validation) which use BinaryClassificationEvaluator, and it should be very straightforward to switch to MulticlassClassificationEvaluator. Thanks Yanbo On Sat, Nov 19, 2016 at 9:03