date:20170329

Re: Secondary Sort using Apache Spark 1.6

2017-03-29 Thread Pariksheet Barapatre

Many Thanks Yong. Your solution rocks. If you could paste your answer on stack overflow then I can mark it as correct answer. Also, can you tell me how to achieve same using companion object. Cheers Pari On 29 March 2017 at 21:37, Yong Zhang wrote: > The error message

Re: How best we can store streaming data on dashboards for real time user experience?

2017-03-29 Thread Noorul Islam Kamal Malmiyoda

I think better place would be a in memory cache for real time. Regards, Noorul On Thu, Mar 30, 2017 at 10:31 AM, Gaurav1809 wrote: > I am getting streaming data and want to show them onto dashboards in real > time? > May I know how best we can handle these streaming

How best we can store streaming data on dashboards for real time user experience?

2017-03-29 Thread Gaurav1809

I am getting streaming data and want to show them onto dashboards in real time? May I know how best we can handle these streaming data? where to store? (DB or HDFS or ???) I want to give users a real time analytics experience. Please suggest possible ways. Thanks. -- View this message in

Re: Why VectorUDT private?

2017-03-29 Thread Ryan

spark version 2.1.0, vector is from ml package. the Vector in mllib has a public VectorUDT type On Thu, Mar 30, 2017 at 10:57 AM, Ryan wrote: > I'm writing a transformer and the input column is vector type(which is the > output column from other transformer). But as the

Why VectorUDT private?

2017-03-29 Thread Ryan

I'm writing a transformer and the input column is vector type(which is the output column from other transformer). But as the VectorUDT is private, how could I check/transform schema for the vector column?

Re: Need help for RDD/DF transformation.

2017-03-29 Thread Mungeol Heo

Hello Yong, First of all, thank your attention. Note that the values of elements, which have values at RDD/DF1, in the same list will be always same. Therefore, the "1" and "3", which from RDD/DF 1, will always have the same value which is "a". The goal here is assigning same value to elements

Re: Spark streaming + kafka error with json library

2017-03-29 Thread Tathagata Das

Try depending on "spark-streaming-kafka-0-10_2.11" (not the assembly) On Wed, Mar 29, 2017 at 9:59 AM, Srikanth wrote: > Hello, > > I'm trying to use "org.json4s" % "json4s-native" library in a spark > streaming + kafka direct app. > When I use the latest version of the

Re: Spark SQL, dataframe join questions.

2017-03-29 Thread vaquar khan

HI , I found following two links are helpful sharing with you . http://stackoverflow.com/questions/38353524/how-to-ensure-partitioning-induced-by-spark-dataframe-join http://spark.apache.org/docs/latest/configuration.html Regards, Vaquar khan On Wed, Mar 29, 2017 at 2:45 PM, Vidya Sujeet

Re: KMean clustering resulting Skewed Issue

2017-03-29 Thread Asher Krim

As I said in my previous reply, I don't think k-means is the right tool to start with. Try LDA with k (number of latent topics) set to 3 and go up to say 20. The problem likely lies is the feature vectors, on which you provided almost no information. Text is not taken from a continuous space, so

Re: Spark SQL, dataframe join questions.

2017-03-29 Thread Vidya Sujeet

In repartition, every element in the partition is moved to a new partition..doing a full shuffle compared to shuffles done by reduceBy clauses. With this in mind, repartition would increase your query performance. ReduceBy key will also shuffle based on the aggregation. The best way to design is

Re: Collaborative filtering steps in spark

2017-03-29 Thread chris snow

Thanks Nick, that helps my with my understanding of ALS. On Wed, 29 Mar 2017 at 14:41, Nick Pentreath wrote: > No, it does a random initialization. It does use a slightly different > approach from pure normal random - it chooses non-negative draws which > results in

Alternatives for dataframe collectAsList()

2017-03-29 Thread szep.laszlo.it

Hi, after I created a dataset Dataset df = sqlContext.sql("query"); I need to have a result values and I call a method: collectAsList() List list = df.collectAsList(); But it's very slow, if I work with large datasets (20-30 million records). I know, that the result isn't presented in driver

Returning DataFrame for text file

2017-03-29 Thread George Obama

Hi, I saw that the API, either R or Scala, we are returning DataFrame for sparkSession.read.text() What’s the rational behind this? Regards, George

Re: Spark SQL, dataframe join questions.

2017-03-29 Thread Yong Zhang

You don't need to repartition your data just for join purpose. But if the either parties of join is already partitioned, Spark will use this advantage as part of join optimization. Should you reduceByKey before the join really depend on your join logic. ReduceByKey will shuffle, and following

Spark streaming + kafka error with json library

2017-03-29 Thread Srikanth

Hello, I'm trying to use "org.json4s" % "json4s-native" library in a spark streaming + kafka direct app. When I use the latest version of the lib I get an error similar to this The work around suggest there is to use version 3.2.10. As spark has a

Re: Spark SQL, dataframe join questions.

2017-03-29 Thread shyla deshpande

On Tue, Mar 28, 2017 at 2:57 PM, shyla deshpande wrote: > Following are my questions. Thank you. > > 1. When joining dataframes is it a good idea to repartition on the key column > that is used in the join or > the optimizer is too smart so forget it. > > 2. In RDD

Re: Secondary Sort using Apache Spark 1.6

2017-03-29 Thread Yong Zhang

The error message indeed is not very clear. What you did wrong is that the repartitionAndSortWithinPartitions not only requires PairRDD, but also OrderedRDD. Your case class as key is NOT Ordered. Either you extends it from Ordered, or provide a companion object to do the implicit Ordering.

httpclient conflict in spark

2017-03-29 Thread Arvind Kandaswamy

Hello, I am getting the following error. I get this error when trying to use AWS S3. This appears to be a conflict with httpclient. AWS S3 comes with httplient-4.5.2.jar. I am not sure how to force spark to use this version. I have tried spark.driver.userClassPathFirst = true,

Re: Collaborative filtering steps in spark

2017-03-29 Thread Nick Pentreath

No, it does a random initialization. It does use a slightly different approach from pure normal random - it chooses non-negative draws which results in very slightly better results empirically. In practice I'm not sure if the average rating approach will make a big difference (it's been a long

Issues with partitionBy method on data frame writer SPARK 2.0.2

2017-03-29 Thread Luke Swift

Hello I am trying to write parquet files from a data frame. I am able to use the partitionBy("year", "month", "day") and spark correctly physically partitions the data in a directory structure i expect. The issue is when the partitions themselves are anything non trivial in size then the memory

Secondary Sort using Apache Spark 1.6

2017-03-29 Thread Pariksheet Barapatre

Hi, I am referring web link http://codingjunkie.net/spark-secondary-sort/ to implement secondary sort in my spark job. I have defined my key case class as case class DeviceKey(serialNum: String, eventDate:

Re: Need help for RDD/DF transformation.

2017-03-29 Thread Yong Zhang

What is the desired result for RDD/DF 1 1, a 3, c 5, b RDD/DF 2 [1, 2, 3] [4, 5] Yong From: Mungeol Heo Sent: Wednesday, March 29, 2017 5:37 AM To: user@spark.apache.org Subject: Need help for RDD/DF transformation. Hello, Suppose,

Need help for RDD/DF transformation.

2017-03-29 Thread Mungeol Heo

Hello, Suppose, I have two RDD or data frame like addressed below. RDD/DF 1 1, a 3, a 5, b RDD/DF 2 [1, 2, 3] [4, 5] I need to create a new RDD/DF like below from RDD/DF 1 and 2. 1, a 2, a 3, a 4, b 5, b Is there an efficient way to do this? Any help will be great. Thank you.

Re: Upgrade the scala code using the most updated Spark version

2017-03-29 Thread Anahita Talebi

Hi, Thanks everybody to help me to solve my problem :) As Zhu said, I had to use mapPartitionsWithIndex in my code. Thanks, Have a nice day, Anahita On Wed, Mar 29, 2017 at 2:51 AM, Shixiong(Ryan) Zhu wrote: > mapPartitionsWithSplit was removed in Spark 2.0.0. You

Re: dataframe join questions. Appreciate your input.

2017-03-29 Thread shyla deshpande

On Tue, Mar 28, 2017 at 2:57 PM, shyla deshpande wrote: > Following are my questions. Thank you. > > 1. When joining dataframes is it a good idea to repartition on the key column > that is used in the join or > the optimizer is too smart so forget it. > > 2. In RDD

Re: Secondary Sort using Apache Spark 1.6

Re: How best we can store streaming data on dashboards for real time user experience?

How best we can store streaming data on dashboards for real time user experience?

Re: Why VectorUDT private?

Why VectorUDT private?

Re: Need help for RDD/DF transformation.

Re: Spark streaming + kafka error with json library

Re: Spark SQL, dataframe join questions.

Re: KMean clustering resulting Skewed Issue

Re: Spark SQL, dataframe join questions.

Re: Collaborative filtering steps in spark

Alternatives for dataframe collectAsList()

Returning DataFrame for text file

Re: Spark SQL, dataframe join questions.

Spark streaming + kafka error with json library

Re: Spark SQL, dataframe join questions.

Re: Secondary Sort using Apache Spark 1.6

httpclient conflict in spark

Re: Collaborative filtering steps in spark

Issues with partitionBy method on data frame writer SPARK 2.0.2

Secondary Sort using Apache Spark 1.6

Re: Need help for RDD/DF transformation.

Need help for RDD/DF transformation.

Re: Upgrade the scala code using the most updated Spark version

Re: dataframe join questions. Appreciate your input.

25 matches

Site Navigation

Mail list logo

Footer information