Re: Secondary Sort using Apache Spark 1.6

2017-03-29 Thread Pariksheet Barapatre
Many Thanks Yong. Your solution rocks. If you could paste your answer on stack overflow then I can mark it as correct answer. Also, can you tell me how to achieve same using companion object. Cheers Pari On 29 March 2017 at 21:37, Yong Zhang wrote: > The error message

Re: How best we can store streaming data on dashboards for real time user experience?

2017-03-29 Thread Noorul Islam Kamal Malmiyoda
I think better place would be a in memory cache for real time. Regards, Noorul On Thu, Mar 30, 2017 at 10:31 AM, Gaurav1809 wrote: > I am getting streaming data and want to show them onto dashboards in real > time? > May I know how best we can handle these streaming

How best we can store streaming data on dashboards for real time user experience?

2017-03-29 Thread Gaurav1809
I am getting streaming data and want to show them onto dashboards in real time? May I know how best we can handle these streaming data? where to store? (DB or HDFS or ???) I want to give users a real time analytics experience. Please suggest possible ways. Thanks. -- View this message in

Re: Why VectorUDT private?

2017-03-29 Thread Ryan
spark version 2.1.0, vector is from ml package. the Vector in mllib has a public VectorUDT type On Thu, Mar 30, 2017 at 10:57 AM, Ryan wrote: > I'm writing a transformer and the input column is vector type(which is the > output column from other transformer). But as the

Why VectorUDT private?

2017-03-29 Thread Ryan
I'm writing a transformer and the input column is vector type(which is the output column from other transformer). But as the VectorUDT is private, how could I check/transform schema for the vector column?

Re: Need help for RDD/DF transformation.

2017-03-29 Thread Mungeol Heo
Hello Yong, First of all, thank your attention. Note that the values of elements, which have values at RDD/DF1, in the same list will be always same. Therefore, the "1" and "3", which from RDD/DF 1, will always have the same value which is "a". The goal here is assigning same value to elements

Re: Spark streaming + kafka error with json library

2017-03-29 Thread Tathagata Das
Try depending on "spark-streaming-kafka-0-10_2.11" (not the assembly) On Wed, Mar 29, 2017 at 9:59 AM, Srikanth wrote: > Hello, > > I'm trying to use "org.json4s" % "json4s-native" library in a spark > streaming + kafka direct app. > When I use the latest version of the

Re: Spark SQL, dataframe join questions.

2017-03-29 Thread vaquar khan
HI , I found following two links are helpful sharing with you . http://stackoverflow.com/questions/38353524/how-to-ensure-partitioning-induced-by-spark-dataframe-join http://spark.apache.org/docs/latest/configuration.html Regards, Vaquar khan On Wed, Mar 29, 2017 at 2:45 PM, Vidya Sujeet

Re: KMean clustering resulting Skewed Issue

2017-03-29 Thread Asher Krim
As I said in my previous reply, I don't think k-means is the right tool to start with. Try LDA with k (number of latent topics) set to 3 and go up to say 20. The problem likely lies is the feature vectors, on which you provided almost no information. Text is not taken from a continuous space, so

Re: Spark SQL, dataframe join questions.

2017-03-29 Thread Vidya Sujeet
In repartition, every element in the partition is moved to a new partition..doing a full shuffle compared to shuffles done by reduceBy clauses. With this in mind, repartition would increase your query performance. ReduceBy key will also shuffle based on the aggregation. The best way to design is

Re: Collaborative filtering steps in spark

2017-03-29 Thread chris snow
Thanks Nick, that helps my with my understanding of ALS. On Wed, 29 Mar 2017 at 14:41, Nick Pentreath wrote: > No, it does a random initialization. It does use a slightly different > approach from pure normal random - it chooses non-negative draws which > results in

Alternatives for dataframe collectAsList()

2017-03-29 Thread szep.laszlo.it
Hi, after I created a dataset Dataset df = sqlContext.sql("query"); I need to have a result values and I call a method: collectAsList() List list = df.collectAsList(); But it's very slow, if I work with large datasets (20-30 million records). I know, that the result isn't presented in driver

Returning DataFrame for text file

2017-03-29 Thread George Obama
Hi, I saw that the API, either R or Scala, we are returning DataFrame for sparkSession.read.text() What’s the rational behind this? Regards, George

Re: Spark SQL, dataframe join questions.

2017-03-29 Thread Yong Zhang
You don't need to repartition your data just for join purpose. But if the either parties of join is already partitioned, Spark will use this advantage as part of join optimization. Should you reduceByKey before the join really depend on your join logic. ReduceByKey will shuffle, and following

Spark streaming + kafka error with json library

2017-03-29 Thread Srikanth
Hello, I'm trying to use "org.json4s" % "json4s-native" library in a spark streaming + kafka direct app. When I use the latest version of the lib I get an error similar to this The work around suggest there is to use version 3.2.10. As spark has a

Re: Spark SQL, dataframe join questions.

2017-03-29 Thread shyla deshpande
On Tue, Mar 28, 2017 at 2:57 PM, shyla deshpande wrote: > Following are my questions. Thank you. > > 1. When joining dataframes is it a good idea to repartition on the key column > that is used in the join or > the optimizer is too smart so forget it. > > 2. In RDD

Re: Secondary Sort using Apache Spark 1.6

2017-03-29 Thread Yong Zhang
The error message indeed is not very clear. What you did wrong is that the repartitionAndSortWithinPartitions not only requires PairRDD, but also OrderedRDD. Your case class as key is NOT Ordered. Either you extends it from Ordered, or provide a companion object to do the implicit Ordering.

httpclient conflict in spark

2017-03-29 Thread Arvind Kandaswamy
Hello, I am getting the following error. I get this error when trying to use AWS S3. This appears to be a conflict with httpclient. AWS S3 comes with httplient-4.5.2.jar. I am not sure how to force spark to use this version. I have tried spark.driver.userClassPathFirst = true,

Re: Collaborative filtering steps in spark

2017-03-29 Thread Nick Pentreath
No, it does a random initialization. It does use a slightly different approach from pure normal random - it chooses non-negative draws which results in very slightly better results empirically. In practice I'm not sure if the average rating approach will make a big difference (it's been a long

Issues with partitionBy method on data frame writer SPARK 2.0.2

2017-03-29 Thread Luke Swift
Hello I am trying to write parquet files from a data frame. I am able to use the partitionBy("year", "month", "day") and spark correctly physically partitions the data in a directory structure i expect. The issue is when the partitions themselves are anything non trivial in size then the memory

Secondary Sort using Apache Spark 1.6

2017-03-29 Thread Pariksheet Barapatre
Hi, I am referring web link http://codingjunkie.net/spark-secondary-sort/ to implement secondary sort in my spark job. I have defined my key case class as case class DeviceKey(serialNum: String, eventDate:

Re: Need help for RDD/DF transformation.

2017-03-29 Thread Yong Zhang
What is the desired result for RDD/DF 1 1, a 3, c 5, b RDD/DF 2 [1, 2, 3] [4, 5] Yong From: Mungeol Heo Sent: Wednesday, March 29, 2017 5:37 AM To: user@spark.apache.org Subject: Need help for RDD/DF transformation. Hello, Suppose,

Need help for RDD/DF transformation.

2017-03-29 Thread Mungeol Heo
Hello, Suppose, I have two RDD or data frame like addressed below. RDD/DF 1 1, a 3, a 5, b RDD/DF 2 [1, 2, 3] [4, 5] I need to create a new RDD/DF like below from RDD/DF 1 and 2. 1, a 2, a 3, a 4, b 5, b Is there an efficient way to do this? Any help will be great. Thank you.

Re: Upgrade the scala code using the most updated Spark version

2017-03-29 Thread Anahita Talebi
Hi, Thanks everybody to help me to solve my problem :) As Zhu said, I had to use mapPartitionsWithIndex in my code. Thanks, Have a nice day, Anahita On Wed, Mar 29, 2017 at 2:51 AM, Shixiong(Ryan) Zhu wrote: > mapPartitionsWithSplit was removed in Spark 2.0.0. You

Re: dataframe join questions. Appreciate your input.

2017-03-29 Thread shyla deshpande
On Tue, Mar 28, 2017 at 2:57 PM, shyla deshpande wrote: > Following are my questions. Thank you. > > 1. When joining dataframes is it a good idea to repartition on the key column > that is used in the join or > the optimizer is too smart so forget it. > > 2. In RDD