java.lang.ClassCastException: optional binary element (UTF8) is not a group

2016-09-20 Thread Rajan, Naveen
Dear All, My code works fine with JSON input data. When I tried the Parquet data format, it worked for English data. For Japanese text, I am getting the below stack-trace. Pls help! Caused by: java.lang.ClassCastException: optional binary element (UTF8) is not a group at org.apache.parque

Re: Is RankingMetrics' NDCG implementation correct?

2016-09-20 Thread Nick Pentreath
(cc'ing dev list also) I think a more general version of ranking metrics that allows arbitrary relevance scores could be useful. Ranking metrics are applicable to other settings like search or other learning-to-rank use cases, so it should be a little more generic than pure recommender settings.

spark sql thrift server: driver OOM

2016-09-20 Thread Young
Hi, all: I'm using spark sql thrift server under Spark1.3.1 to do hive sql query. I started spark sql thrift server like ./sbin/start-thriftserver.sh --master yarn-client --num-executors 12 --executor-memory 5g --driver-memory 5g, then sent continuos hive sql to the thrift server. However

Re: is there any bug for the configuration of spark 2.0 cassandra spark connector 2.0 and cassandra 3.0.8

2016-09-20 Thread Todd Nist
These types of questions would be better asked on the user mailing list for the Spark Cassandra connector: http://groups.google.com/a/lists.datastax.com/forum/#!forum/spark-connector-user Version compatibility can be found here: https://github.com/datastax/spark-cassandra-connector#version-compa

Convert RDD to JSON Rdd and append more information

2016-09-20 Thread sujeet jog
Hi, I have a Rdd of n rows, i want to transform this to a Json RDD, and also add some more information , any idea how to accomplish this . ex : - i have rdd with n rows with data like below , , 16.9527493170273,20.1989561393151,15.7065424947394 17.9527493170273,21.1989561393151,15.70654249

Re: Convert RDD to JSON Rdd and append more information

2016-09-20 Thread Deepak Sharma
Enrich the RDDs first with more information and then map it to some case class , if you are using scala. You can then use play api's (play.api.libs.json.Writes/play.api.libs.json.Json) classes to convert the mapped case class to json. Thanks Deepak On Tue, Sep 20, 2016 at 6:42 PM, sujeet jog wro

Re: write.df is failing on Spark Cluster

2016-09-20 Thread Kevin Mellott
Have you checked to see if any files already exist at /nfspartition/sankar/ banking_l1_v2.csv? If so, you will need to delete them before attempting to save your DataFrame to that location. Alternatively, you may be able to specify the "mode" setting of the df.write operation to "overwrite", depend

Re: write.df is failing on Spark Cluster

2016-09-20 Thread Sankar Mittapally
Hey Kevin, It is a empty directory, It is able to write part files to the directory but while merging those part files we are getting above error. Regards On Tue, Sep 20, 2016 at 7:46 PM, Kevin Mellott wrote: > Have you checked to see if any files already exist at > /nfspartition/sankar/banki

Re: SPARK-10835 in 2.0

2016-09-20 Thread janardhan shetty
Is this a bug? On Sep 19, 2016 10:10 PM, "janardhan shetty" wrote: > Hi, > > I am hitting this issue. https://issues.apache.org/jira/browse/SPARK-10835 > . > > Issue seems to be resolved but resurfacing in 2.0 ML. Any workaround is > appreciated ? > > Note: > Pipeline has Ngram before word2Vec. >

Re: SPARK-10835 in 2.0

2016-09-20 Thread Sean Owen
Ah, I think that this was supposed to be changed with SPARK-9062. Let me see about reopening 10835 and addressing it. On Tue, Sep 20, 2016 at 3:24 PM, janardhan shetty wrote: > Is this a bug? > > On Sep 19, 2016 10:10 PM, "janardhan shetty" wrote: >> >> Hi, >> >> I am hitting this issue. >> http

Re: SPARK-10835 in 2.0

2016-09-20 Thread janardhan shetty
Thanks Sean. On Sep 20, 2016 7:45 AM, "Sean Owen" wrote: > Ah, I think that this was supposed to be changed with SPARK-9062. Let > me see about reopening 10835 and addressing it. > > On Tue, Sep 20, 2016 at 3:24 PM, janardhan shetty > wrote: > > Is this a bug? > > > > On Sep 19, 2016 10:10 PM, "

Re: Continuous warning while consuming using new kafka-spark010 API

2016-09-20 Thread Cody Koeninger
-dev +user Than warning pretty much means what it says - the consumer tried to get records for the given partition / offset, and couldn't do so after polling the kafka broker for X amount of time. If that only happens when you put additional load on Kafka via producing, the first thing I'd do is

Re: SPARK-10835 in 2.0

2016-09-20 Thread Sean Owen
You can probably just do an identity transformation on the column to make its type a nullable String array -- ArrayType(StringType, true). Of course, I'm not sure why Word2Vec must reject a non-null array type when it can of course handle nullable, but the previous discussion indicated that this ha

Re: SPARK-10835 in 2.0

2016-09-20 Thread janardhan shetty
Hi Sean, Any suggestions for workaround as of now? On Sep 20, 2016 7:46 AM, "janardhan shetty" wrote: > Thanks Sean. > On Sep 20, 2016 7:45 AM, "Sean Owen" wrote: > >> Ah, I think that this was supposed to be changed with SPARK-9062. Let >> me see about reopening 10835 and addressing it. >> >>

Re: Dataframe, Java: How to convert String to Vector ?

2016-09-20 Thread Peter Figliozzi
Hi Yan, I agree, it IS really confusing. Here is the technique for transforming a column. It is very general because you can make "myConvert" do whatever you want. import org.apache.spark.mllib.linalg.Vectors val df = Seq((0, "[1,3,5]"), (1, "[2,4,6]")).toDF df.show() // The columns were named

Options for method createExternalTable

2016-09-20 Thread CalumAtTheGuardian
Hi, I am trying to create an external table WITH partitions using SPARK. Currently I am using catalog.createExternalTable(cleanTableName, "ORC", schema, Map("path" -> s"s3://$location/")) Does createExternalTable have options that can create a table with partitions? I assume it would be a key-

Re: write.df is failing on Spark Cluster

2016-09-20 Thread Kevin Mellott
Can you please post the line of code that is doing the df.write command? On Tue, Sep 20, 2016 at 9:29 AM, Sankar Mittapally < sankar.mittapa...@creditvidya.com> wrote: > Hey Kevin, > > It is a empty directory, It is able to write part files to the directory > but while merging those part files we

Re: write.df is failing on Spark Cluster

2016-09-20 Thread Sankar Mittapally
Please find the code below. sankar2 <- read.df("/nfspartition/sankar/test/2016/08/test.json") I tried these two commands. write.df(sankar2,"/nfspartition/sankar/test/test.csv","csv",header="true") saveDF(sankar2,"sankartest.csv",source="csv",mode="append",schema="true") On Tue, Sep 20, 2016 a

Spark Tasks Taking Increasingly Longer

2016-09-20 Thread Chris Jansen
Hi, I've been struggling with the reliability, which on the face of it, should be a fairly simple job: Given a number of events, group them by user, event type and week in which they occurred and aggregate their counts. The input data is fairly skewed so I do a repartition and add a salt to the k

RE: LDA and Maximum Iterations

2016-09-20 Thread Yang, Yuhao
Hi Frank, Which version of Spark are you using? Also can you share more information about the exception. If it’s not confidential, you can send the data sample to me (yuhao.y...@intel.com) and I can try to investigate. Regards, Yuhao From: Frank Zhang [mailto:dataminin...@yahoo.com.INVALID] S

Re: write.df is failing on Spark Cluster

2016-09-20 Thread Kevin Mellott
Instead of *mode="append"*, try *mode="overwrite"* On Tue, Sep 20, 2016 at 11:30 AM, Sankar Mittapally < sankar.mittapa...@creditvidya.com> wrote: > Please find the code below. > > sankar2 <- read.df("/nfspartition/sankar/test/2016/08/test.json") > > I tried these two commands. > write.df(sankar2

Re: write.df is failing on Spark Cluster

2016-09-20 Thread Sankar Mittapally
I used that one also On Sep 20, 2016 10:44 PM, "Kevin Mellott" wrote: > Instead of *mode="append"*, try *mode="overwrite"* > > On Tue, Sep 20, 2016 at 11:30 AM, Sankar Mittapally creditvidya.com> wrote: > >> Please find the code below. >> >> sankar2 <- read.df("/nfspartition/sankar/test/2016/08

Re: Similar Items

2016-09-20 Thread Nick Pentreath
A few options include: https://github.com/marufaytekin/lsh-spark - I've used this a bit and it seems quite scalable too from what I've looked at. https://github.com/soundcloud/cosine-lsh-join-spark - not used this but looks like it should do exactly what you need. https://github.com/mrsqueeze/*spa

Re: LDA and Maximum Iterations

2016-09-20 Thread Frank Zhang
Hi Yuhao,    Thank you so much for your great contribution to the LDA and other Spark modules!     I use both Spark 1.6.2 and 2.0.0. The data I used originally is very large which has tens of millions of documents. But for test purpose, the data set I mentioned earlier ("/data/mllib/sample_lda_d

Re: Sending extraJavaOptions for Spark 1.6.1 on mesos 0.28.2 in cluster mode

2016-09-20 Thread Michael Gummelt
Probably this: https://issues.apache.org/jira/browse/SPARK-13258 As described in the JIRA, workaround is to use SPARK_JAVA_OPTS On Mon, Sep 19, 2016 at 5:07 PM, sagarcasual . wrote: > Hello, > I have my Spark application running in cluster mode in CDH with > extraJavaOptions. > However when I a

Re: Similar Items

2016-09-20 Thread Kevin Mellott
Thanks Nick - those examples will help a ton!! On Tue, Sep 20, 2016 at 12:20 PM, Nick Pentreath wrote: > A few options include: > > https://github.com/marufaytekin/lsh-spark - I've used this a bit and it > seems quite scalable too from what I've looked at. > https://github.com/soundcloud/cosine-

Dataset doesn't have partitioner after a repartition on one of the columns

2016-09-20 Thread McBeath, Darin W (ELS-STL)
I'm using Spark 2.0. I've created a dataset from a parquet file and repartition on one of the columns (docId) and persist the repartitioned dataset. val om = ds.repartition($"docId").persist(StorageLevel.MEMORY_AND_DISK) When I try to confirm the partitioner, with om.rdd.partitioner I get Op

Re: write.df is failing on Spark Cluster

2016-09-20 Thread Kevin Mellott
Are you able to manually delete the folder below? I'm wondering if there is some sort of non-Spark factor involved (permissions, etc). /nfspartition/sankar/banking_l1_v2.csv On Tue, Sep 20, 2016 at 12:19 PM, Sankar Mittapally < sankar.mittapa...@creditvidya.com> wrote: > I used that one also > >

Re: Similar Items

2016-09-20 Thread Kevin Mellott
Using the Soundcloud implementation of LSH, I was able to process a 22K product dataset in a mere 65 seconds! Thanks so much for the help! On Tue, Sep 20, 2016 at 1:15 PM, Kevin Mellott wrote: > Thanks Nick - those examples will help a ton!! > > On Tue, Sep 20, 2016 at 12:20 PM, Nick Pentreath

Re: Similar Items

2016-09-20 Thread Peter Figliozzi
Related question: is there anything that does scalable matrix multiplication on Spark? For example, we have that long list of vectors and want to construct the similarity matrix: v * T(v). In R it would be: v %*% t(v) Thanks, Pete On Mon, Sep 19, 2016 at 3:49 PM, Kevin Mellott wrote: > Hi a

Re: write.df is failing on Spark Cluster

2016-09-20 Thread Sankar Mittapally
Yeah I can do all operations on that folder On Sep 21, 2016 12:15 AM, "Kevin Mellott" wrote: > Are you able to manually delete the folder below? I'm wondering if there > is some sort of non-Spark factor involved (permissions, etc). > > /nfspartition/sankar/banking_l1_v2.csv > > On Tue, Sep 20, 2

Re: write.df is failing on Spark Cluster

2016-09-20 Thread Divya Gehlot
Spark version plz ? On 21 September 2016 at 09:46, Sankar Mittapally < sankar.mittapa...@creditvidya.com> wrote: > Yeah I can do all operations on that folder > > On Sep 21, 2016 12:15 AM, "Kevin Mellott" > wrote: > >> Are you able to manually delete the folder below? I'm wondering if there >> i

java.lang.ClassCastException: optional binary element (UTF8) is not a group

2016-09-20 Thread naveen.ra...@sony.com
My code works fine with JSON input format (Spark 1.6 on Amazon EMR, emr-5.0.0). I tried the Parquet format. Works fine for English data. When I tried the Parquet format with some Japanese language text, I am getting this weird stack-trace: *Caused by: java.lang.ClassCastException: optional binary

Israel Spark Meetup

2016-09-20 Thread Romi Kuntsman
Hello, Please add a link in Spark Community page ( https://spark.apache.org/community.html) To Israel Spark Meetup (https://www.meetup.com/israel-spark-users/) We're an active meetup group, unifying the local Spark user community, and having regular meetups. Thanks! Romi K.