FP growth - Items in a transaction must be unique

2017-02-01 Thread Devi P.V
Hi all, I am trying to run FP growth algorithm using spark and scala.sample input dataframe is following, +---+ |productName

Re: increasing cross join speed

2017-02-01 Thread Takeshi Yamamuro
Hi, I'm not sure how to improve this kind of queries only on vanilla spark though, you can write custom physical plans for top-k queries. You can check the link below as a reference; benchmark: https://github.com/apache/incubator-hivemall/pull/33 manual:

Re: pivot over non numerical data

2017-02-01 Thread Kevin Mellott
This should work for non-numerical data as well - can you please elaborate on the error you are getting and provide a code sample? As a preliminary hint, you can "aggregate" text values using *max*. df.groupBy("someCol") .pivot("anotherCol") .agg(max($"textCol")) Thanks, Kevin On Wed, Feb

Re: Parameterized types and Datasets - Spark 2.1.0

2017-02-01 Thread Don Drake
I imported that as my first command in my previous email. I'm using a spark-shell. scala> import org.apache.spark.sql.Encoder import org.apache.spark.sql.Encoder scala> Any comments regarding importing implicits in an application? Thanks. -Don On Wed, Feb 1, 2017 at 6:10 PM, Michael

Re: JavaRDD text matadata(file name) findings

2017-02-01 Thread neil90
You can use the https://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaSparkContext.html#wholeTextFiles(java.lang.String) but it will return a rdd as such (filename,content) -- View this message in context:

Re: Parameterized types and Datasets - Spark 2.1.0

2017-02-01 Thread Michael Armbrust
This is the error, you are missing an import: :13: error: not found: type Encoder abstract class RawTable[A : Encoder](inDir: String) { Works for me in a REPL.

Re: Question about Multinomial LogisticRegression in spark mllib in spark 2.1.0

2017-02-01 Thread Hollin Wilkins
Hey Aseem, If you are looking for a full-featured library to execute Spark ML pipelines outside of Spark, take a look at MLeap: https://github.com/combust/mleap Not only does it support transforming single instances of a feature vector, but you can execute your entire ML pipeline including

Re: Dataset Question: No Encoder found for Set[(scala.Long, scala.Long)]

2017-02-01 Thread Jerry Lam
Hi Koert, Thank you for your help! GOT IT! Best Regards, Jerry On Wed, Feb 1, 2017 at 6:24 PM, Koert Kuipers wrote: > you can still use it as Dataset[Set[X]]. all transformations should work > correctly. > > however dataset.schema will show binary type, and dataset.show

Re: Parameterized types and Datasets - Spark 2.1.0

2017-02-01 Thread Don Drake
Thanks for the reply. I did give that syntax a try [A : Encoder] yesterday, but I kept getting this exception in a spark-shell and Zeppelin browser. scala> import org.apache.spark.sql.Encoder import org.apache.spark.sql.Encoder scala> scala> case class RawTemp(f1: String, f2: String, temp:

Re: Dataset Question: No Encoder found for Set[(scala.Long, scala.Long)]

2017-02-01 Thread Koert Kuipers
you can still use it as Dataset[Set[X]]. all transformations should work correctly. however dataset.schema will show binary type, and dataset.show will show bytes (unfortunately). for example: scala> implicit def setEncoder[X]: Encoder[Set[X]] = Encoders.kryo[Set[X]] setEncoder: [X]=>

RE: Jars directory in Spark 2.0

2017-02-01 Thread Sidney Feiner
Ok, good to know ☺ Shading every spark app it is then… Thanks! Sidney Feiner / SW Developer M: +972.528197720 / Skype: sidney.feiner.startapp [StartApp] From: Marcelo Vanzin [mailto:van...@cloudera.com] Sent: Wednesday, February 1, 2017 7:41 PM To: Sidney Feiner

Re: Parameterized types and Datasets - Spark 2.1.0

2017-02-01 Thread Michael Armbrust
You need to enforce that an Encoder is available for the type A using a context bound . import org.apache.spark.sql.Encoder abstract class RawTable[A : Encoder](inDir: String) { ... } On Tue, Jan 31, 2017 at 8:12 PM, Don Drake

Re: using withWatermark on Dataset

2017-02-01 Thread Michael Armbrust
Can you give the full stack trace? Also which version of Spark are you running? On Wed, Feb 1, 2017 at 10:38 AM, Jerry Lam wrote: > Hi everyone, > > Anyone knows how to use withWatermark on Dataset? > > I have tried the following but hit this exception: > > dataset

pivot over non numerical data

2017-02-01 Thread Darshan Pandya
Hello, I am trying to transpose some data using groupBy pivot aggr as mentioned in this blog https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html But this works only for numerical data. Any hints for doing the same thing for non numerical data ? -- Sincerely,

using withWatermark on Dataset

2017-02-01 Thread Jerry Lam
Hi everyone, Anyone knows how to use withWatermark on Dataset? I have tried the following but hit this exception: dataset org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to "MyType" The code looks like the following: dataset .withWatermark("timestamp", "5

Re: Jars directory in Spark 2.0

2017-02-01 Thread Marcelo Vanzin
Spark has never shaded dependencies (in the sense of renaming the classes), with a couple of exceptions (Guava and Jetty). So that behavior is nothing new. Spark's dependencies themselves have a lot of other dependencies, so doing that would have limited benefits anyway. On Tue, Jan 31, 2017 at

Re: Dataset Question: No Encoder found for Set[(scala.Long, scala.Long)]

2017-02-01 Thread Jerry Lam
Hi Koert, Thanks for the tips. I tried to do that but the column's type is now Binary. Do I get the Set[X] back in the Dataset? Best Regards, Jerry On Tue, Jan 31, 2017 at 8:04 PM, Koert Kuipers wrote: > set is currently not supported. you can use kryo encoder. there is

Re: Question about Multinomial LogisticRegression in spark mllib in spark 2.1.0

2017-02-01 Thread Seth Hendrickson
In Spark.ML the coefficients are not "pivoted" meaning that they do not set one of the coefficient sets equal to zero. You can read more about it here: https://en.wikipedia.org/wiki/Multinomial_logistic_regression#As_a_set_of_independent_binary_regressions You can translate your set of

union of compatible types

2017-02-01 Thread Koert Kuipers
spark's onion/merging of compatible types seems kind of weak. it works on basic types in the top level record, but it fails for nested records, maps, arrays, etc. are there any known workarounds or plans to improve this? for example i get errors like this: org.apache.spark.sql.AnalysisException:

Re: tylerchap...@yahoo-inc.com is no longer with Yahoo! (was: Question about Multinomial LogisticRegression in spark mllib in spark 2.1.0)

2017-02-01 Thread Aseem Bansal
Can a admin of mailing list please remove this email? I get this email every time I send an email to the mailing list. On Wed, Feb 1, 2017 at 5:12 PM, Yahoo! No Reply wrote: > > This is an automatically generated message. > > tylerchap...@yahoo-inc.com is no longer

Re: Hive Java UDF running on spark-sql issue

2017-02-01 Thread Alex
Yes... Its taking values form a record which is a json and converting it into multiple columns after typecasting... On Wed, Feb 1, 2017 at 4:07 PM, Marco Mistroni wrote: > Hi > What is the UDF supposed to do? Are you trying to write a generic > function to convert values

Question about Multinomial LogisticRegression in spark mllib in spark 2.1.0

2017-02-01 Thread Aseem Bansal
*What I want to do* I have a trained a ml.classification.LogisticRegressionModel using spark ml package. It has 3 features and 3 classes. So the generated model has coefficients in (3, 3) matrix and intercepts in Vector of length (3) as expected. Now, I want to take these coefficients and

Re: Hive Java UDF running on spark-sql issue

2017-02-01 Thread Marco Mistroni
Hi What is the UDF supposed to do? Are you trying to write a generic function to convert values to another type depending on what is the type of the original value? Kr On 1 Feb 2017 5:56 am, "Alex" wrote: Hi , we have Java Hive UDFS which are working perfectly fine in

A question about inconsistency during dataframe creation with RDD/dict in PySpark

2017-02-01 Thread Han-Cheol Cho
Dear spark user ml members, I have quite messy input data so it is difficult to load them as a dataframe object directly. What I did is to load it as an RDD of strings first, convert it to an RDD of pyspark.sql.Row objects, then use toDF method as below. mydf = myrdd.map(parse).toDF() I