RE: TF-IDF Question

2015-06-04 Thread Somnath Pandeya
Hi, org.apache.spark.mllib.linalg.Vector = (1048576,[35587,884670],[3.458767233,3.458767233]) it is sparse vector representation of terms so the first term(1048576) is the length of vector [35587,884670] is the index of words [3.458767233,3.458767233] are the tf-idf values of the terms. Thanks

RE: How to use Eclipse on Windows to build Spark environment?

2015-05-28 Thread Somnath Pandeya
Try scala eclipse plugin to eclipsify spark project and import spark as eclipse project -Somnath -Original Message- From: Nan Xiao [mailto:xiaonan830...@gmail.com] Sent: Thursday, May 28, 2015 12:32 PM To: user@spark.apache.org Subject: How to use Eclipse on Windows to build Spark

RE: save as text file throwing null pointer error.

2015-04-14 Thread Somnath Pandeya
Hi Akhil, I am running my program standalone, I am getting null pointer exception when I running spark program locally and when I am trying to save my RDD as a text file. From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Tuesday, April 14, 2015 12:41 PM To: Somnath Pandeya Cc: user

save as text file throwing null pointer error.

2015-04-09 Thread Somnath Pandeya
JavaRDDString lineswithoutStopWords = nonEmptylines .map(new FunctionString, String() { /** * */ private static final long

how to find near duplicate items from given dataset using spark

2015-04-02 Thread Somnath Pandeya
Hi All, I want to find near duplicate items from given dataset For e.g consider a data set 1. Cricket,bat,ball,stumps 2. Cricket,bowler,ball,stumps, 3. Football,goalie,midfielder,goal 4. Football,refree,midfielder,goal, Here 1 and 2 are near duplicates (only field 2 is

RE: used cores are less then total no. of core

2015-02-25 Thread Somnath Pandeya
Thanks Akhil , it was a simple fix which you told .. I missed it .. ☺ From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Wednesday, February 25, 2015 12:48 PM To: Somnath Pandeya Cc: user@spark.apache.org Subject: Re: used cores are less then total no. of core You can set the following

used cores are less then total no. of core

2015-02-24 Thread Somnath Pandeya
Hi All, I am running a simple word count example of spark (standalone cluster) , In the UI it is showing For each worker no. of cores available are 32 ,but while running the jobs only 5 cores are being used, What should I do to increase no. of used core or it is selected based on jobs. Thanks

RE: skipping header from each file

2015-01-09 Thread Somnath Pandeya
May be you can use wholeTextFiles method, which returns filename and content of the file as PariRDD and ,then you can remove the first line from files. -Original Message- From: Hafiz Mujadid [mailto:hafizmujadi...@gmail.com] Sent: Friday, January 09, 2015 11:48 AM To:

RE: Spark with Hive cluster dependencies

2015-01-07 Thread Somnath Pandeya
You can follow the below the link also. It works on stand alone spark cluster. https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started thanks Somnath From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Thursday, January 08, 2015 2:21 AM To: jamborta Cc: user

spark worker nodes getting disassociated while running hive on spark

2015-01-04 Thread Somnath Pandeya
Hi, I have setup the spark 1.2 standalone cluster and trying to run hive on spark by following below link. https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started I got the latest build of hive on spark from git and was trying to running few queries. Queries are

RE: unable to do group by with 1st column

2014-12-25 Thread Somnath Pandeya
Hi , You can try reducebyKey also , Something like this JavaPairRDDString, String ones = lines .mapToPair(new PairFunctionString, String, String() { @Override public Tuple2String, String call(String s)