handle data skew problem when calculating word count and word dependency

2016-11-13 Thread ruan.answer
I am planning to calculating word count and two word dependency via spark, but the data is skew, how can i solve this problem. And do you have some suggest about double level data slice? I have some topics, and each topic corresponding to lots of text. so I have a RDD structure like this:

Re: Convert SparseVector column to Densevector column

2016-11-13 Thread Takeshi Yamamuro
Hi, How about this? import org.apache.spark.ml.linalg._ val toSV = udf((v: Vector) => v.toDense) val df = Seq((0.1, Vectors.sparse(16, Array(0, 3), Array(0.1, 0.3))), (0.2, Vectors.sparse(16, Array(0, 3), Array(0.1, 0.3.toDF("a", "b") df.select(toSV($"b")) // maropu On Mon, Nov 14, 2016

Convert SparseVector column to Densevector column

2016-11-13 Thread janardhan shetty
Hi, Is there any easy way of converting a dataframe column from SparseVector to DenseVector using import org.apache.spark.ml.linalg.DenseVector API ? Spark ML 2.0

Re: Spark SQL shell hangs

2016-11-13 Thread Hyukjin Kwon
Hi Rakesh, Could you please open an issue in https://github.com/databricks/spark-xml with some codes so that reviewers can reproduce the issue you met? Thanks! 2016-11-14 0:20 GMT+09:00 rakesh sharma : > Hi > > I'm trying to convert an XML file to data frame using

Re: Spark ML : One hot Encoding for multiple columns

2016-11-13 Thread Nicholas Sharkey
Amen > On Nov 13, 2016, at 7:55 PM, janardhan shetty wrote: > > These Jiras' are still unresolved: > https://issues.apache.org/jira/browse/SPARK-11215 > > Also there is https://issues.apache.org/jira/browse/SPARK-8418 > >> On Wed, Aug 17, 2016 at 11:15 AM, Nisha

Re: Spark ML : One hot Encoding for multiple columns

2016-11-13 Thread janardhan shetty
These Jiras' are still unresolved: https://issues.apache.org/jira/browse/SPARK-11215 Also there is https://issues.apache.org/jira/browse/SPARK-8418 On Wed, Aug 17, 2016 at 11:15 AM, Nisha Muktewar wrote: > > The OneHotEncoder does *not* accept multiple columns. > > You can

Re: sbt shenanigans for a Spark-based project

2016-11-13 Thread Don Drake
I would upgrade your Scala version to 2.11.8 as Spark 2.0 uses Scala 2.11 by default. On Sun, Nov 13, 2016 at 3:01 PM, Marco Mistroni wrote: > HI all > i have a small Spark-based project which at the moment depends on jar > from Spark 1.6.0 > The project has few Spark

Re: Instability issues with Spark 2.0.1 and Kafka 0.10

2016-11-13 Thread Cody Koeninger
Preferred locations are only advisory, you can still get tasks scheduled on other executors. You can try bumping up the size of the cache to see if that is causing the issue you're seeing. On Nov 13, 2016 12:47, "Ivan von Nagy" wrote: > As the code iterates through the parallel

sbt shenanigans for a Spark-based project

2016-11-13 Thread Marco Mistroni
HI all i have a small Spark-based project which at the moment depends on jar from Spark 1.6.0 The project has few Spark examples plus one which depends on Flume libraries I am attempting to move to Spark 2.0, but i am having issues with my dependencies The stetup below works fine when compiled

Re: Nearest neighbour search

2016-11-13 Thread Meeraj Kunnumpurath
That was a bit of a brute force search, so I changed the code to use a UDF to create the dot product between the two IDF vectors, and do a sort on the new column. package com.ss.ml.clustering import org.apache.spark.sql.{DataFrame, SparkSession} import org.apache.spark.sql.functions._ import

receiver based spark streaming doubts

2016-11-13 Thread Shushant Arora
Hi In spark streaming based on receivers - when receiver gets data and store in blocks for workers to process, How many blocks does receiver gives to worker. Say I have a streaming app with 30 sec of batch interval what will happen 1.for first batch(first 30 sec) there will not be any data for

Re: Nearest neighbour search

2016-11-13 Thread Meeraj Kunnumpurath
This is what I have done, is there a better way of doing this? val df = spark.read.option("header", "false").csv("data") val tk = new Tokenizer().setInputCol("_c2").setOutputCol("words") val tf = new HashingTF().setInputCol("words").setOutputCol("tf") val idf = new

Re: Instability issues with Spark 2.0.1 and Kafka 0.10

2016-11-13 Thread Ivan von Nagy
As the code iterates through the parallel list, it is processing up to 8 KafkaRDD at a time. Each has it's own unique topic and consumer group now. Every topic has 4 partitions, so technically there should never be more then 32 CachedKafkaConsumers. However, this seems to not be the case as we are

[ANNOUNCE] Apache SystemML 0.11.0-incubating released

2016-11-13 Thread Luciano Resende
The Apache SystemML team is pleased to announce the release of Apache SystemML version 0.11.0-incubating. Apache SystemML provides declarative large-scale machine learning (ML) that aims at a flexible specification of ML algorithms and automatic generation of hybrid runtime plans ranging from

Re: Strongly Connected Components

2016-11-13 Thread Nicholas Chammas
FYI: There is a new connected components implementation coming in GraphFrames 0.3. See: https://github.com/graphframes/graphframes/pull/119 Implementation is based on: https://mmds-data.org/presentations/2014/vassilvitskii_mmds14.pdf Nick On Sat, Nov 12, 2016 at 3:01 PM Koert Kuipers

Re: Joining to a large, pre-sorted file

2016-11-13 Thread Silvio Fiorito
Hi Stuart, Yes that's the query plan but if you take a look at my screenshot it skips the first stage since the datasets are co-partitioned. Thanks, Silvio From: Stuart White Sent: Saturday, November 12, 2016 11:20:28 AM To: Silvio

Spark SQL shell hangs

2016-11-13 Thread rakesh sharma
Hi I'm trying to convert an XML file to data frame using data bricks spark XML. But the shell hanhs when I do a select operation on the table. I believe it's memory issue. How can I debug this. The cm file sizes 86 MB. Thanks in advance Rakesh Get Outlook for Android

Nearest neighbour search

2016-11-13 Thread Meeraj Kunnumpurath
Hello, I have a dataset containing TF-IDF vectors for a corpus of documents. How do I perform a nearest neighbour search on the dataset, using cosine similarity? val df = spark.read.option("header", "false").csv("data") val tk = new Tokenizer().setInputCol("_c2").setOutputCol("words")

Re: toDebugString is clipped

2016-11-13 Thread Sean Owen
I believe it's the shell (Scala shell) that's cropping the output. See http://blog.ssanj.net/posts/2016-10-16-output-in-scala-repl-is-truncated.html On Sun, Nov 13, 2016 at 1:56 AM Anirudh Perugu < anirudh.per...@stonybrook.edu> wrote: > Hello all, > > I am trying to understanding how graphx

Re: spark-shell not starting ( in a Kali linux 2 OS)

2016-11-13 Thread Kelum Perera
Thanks Marco, Sea, & Oshadha, Changed the permission to the files in spark directory using "chmod" & now it works. Thank you very much for the help. Kelum On Sun, Nov 13, 2016 at 5:31 PM, Marco Mistroni wrote: > Hi > not a Linux expert but how did you installed

Re: spark-shell not starting ( in a Kali linux 2 OS)

2016-11-13 Thread Marco Mistroni
Hi not a Linux expert but how did you installed Spark ? as a root user? The error above seems to indicate you dont have permissions to access that directory. If you have full control of the host you can try to do a chmod 777 to the directory where you installed Spark and its subdirs

Re: Spark joins using row id

2016-11-13 Thread Yan Facai
pairRDD can use (hash) partition information to do some optimizations when joined, while I am not sure if dataset could. On Sat, Nov 12, 2016 at 7:11 PM, Rohit Verma wrote: > For datasets structured as > > ds1 > rowN col1 > 1 A > 2 B > 3 C > 4

Re: spark-shell not starting ( in a Kali linux 2 OS)

2016-11-13 Thread Kelum Perera
Thanks Oshadha & Sean, Now, When i enter "spark-shell", this error pops as; bash: /root/spark/bin/pyspark: Permission denied Same error comes for "pyspark" too. Any help on this. Thanks for your help. Kelum On Sun, Nov 13, 2016 at 2:14 PM, Oshadha Gunawardena < oshadha.ro...@gmail.com>

Re: Spark stalling during shuffle (maybe a memory issue)

2016-11-13 Thread bogdanbaraila
The issue was fixed for me by allocating just one core per executor. If I have executors with more then 1 core the issue appears again. I didn't yet understood why is this happening but for the ones having similar issue they can try this. -- View this message in context:

Re: spark-shell not starting ( in a Kali linux 2 OS)

2016-11-13 Thread Sean Owen
You set SCALA_HOME twice and didn't set SPARK_HOME. On Sun, Nov 13, 2016, 04:50 Kelum Perera wrote: > Dear Users, > > I'm a newbie, trying to get spark-shell using kali linux OS, but getting > error - "spark-shell: command not found" > > I'm running on Kali Linux 2 (64bit)