RE: TF-IDF Question

2015-06-04 Thread Somnath Pandeya
Hi,

org.apache.spark.mllib.linalg.Vector = 
(1048576,[35587,884670],[3.458767233,3.458767233])
it is sparse vector representation of terms
so the first term(1048576) is the length of vector
[35587,884670] is the index of words
[3.458767233,3.458767233] are the tf-idf values of the terms.

Thanks
Somnath



From: franco barrientos [mailto:franco.barrien...@exalitica.com]
Sent: Thursday, June 04, 2015 11:17 PM
To: user@spark.apache.org
Subject: TF-IDF Question

Hi all!,

I have a .txt file where each row of it it's a collection of terms of a 
document separated by space. For example:

1 Hola spark
2 ..

I followed this example of spark site 
https://spark.apache.org/docs/latest/mllib-feature-extraction.html and i get 
something like this:

tfidf.first()
org.apache.spark.mllib.linalg.Vector = 
(1048576,[35587,884670],[3.458767233,3.458767233])

I think this:


  1.  First parameter 1048576 i don't know what it is but always it´s the 
same number (maybe the number of terms).
  2.  Second parameter [35587,884670] i think are the terms of the first line 
in my .txt file.
  3.  Third parameter [3.458767233,3.458767233] i think are the tfidf values 
for my terms.
Anyone knows the exact interpretation of this and in the second point if these 
values are the terms, how can i match this values with the original terms 
values ([35587=Hola,884670=spark])?.

Regards and thanks in advance.

Franco Barrientos
Data Scientist
Málaga #115, Of. 1003, Las Condes.
Santiago, Chile.
(+562)-29699649
(+569)-76347893
franco.barrien...@exalitica.commailto:franco.barrien...@exalitica.com
www.exalitica.com
http://www.exalitica.com/
[http://exalitica.com/web/img/frim.png]


RE: How to use Eclipse on Windows to build Spark environment?

2015-05-28 Thread Somnath Pandeya
Try scala eclipse plugin to eclipsify spark project and import spark as eclipse 
project

-Somnath

-Original Message-
From: Nan Xiao [mailto:xiaonan830...@gmail.com] 
Sent: Thursday, May 28, 2015 12:32 PM
To: user@spark.apache.org
Subject: How to use Eclipse on Windows to build Spark environment?

Hi all,

I want to use Eclipse on Windows to build Spark environment, but find the 
reference 
page(https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-IDESetup)
doesn't contain any guide about Eclipse.

Could anyone give tutorials or links about how to using Eclipse on Windows to 
build Spark environment? Thanks in advance!

Best Regards
Nan Xiao

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


 CAUTION - Disclaimer *
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely 
for the use of the addressee(s). If you are not the intended recipient, please 
notify the sender by e-mail and delete the original message. Further, you are 
not 
to copy, disclose, or distribute this e-mail or its contents to any other 
person and 
any such actions are unlawful. This e-mail may contain viruses. Infosys has 
taken 
every reasonable precaution to minimize this risk, but is not liable for any 
damage 
you may sustain as a result of any virus in this e-mail. You should carry out 
your 
own virus checks before opening the e-mail or attachment. Infosys reserves the 
right to monitor and review the content of all messages sent to or from this 
e-mail 
address. Messages sent to or from this e-mail address may be stored on the 
Infosys e-mail system.
***INFOSYS End of Disclaimer INFOSYS***
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: save as text file throwing null pointer error.

2015-04-14 Thread Somnath Pandeya
Hi Akhil,

I am running my program standalone, I am getting null pointer exception when I 
running spark program locally and when I  am trying to save my RDD as a text 
file.

From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: Tuesday, April 14, 2015 12:41 PM
To: Somnath Pandeya
Cc: user@spark.apache.org
Subject: Re: save as text file throwing null pointer error.

Where exactly is it throwing null pointer exception? Are you starting your 
program from another program or something? looks like you are invoking 
ProcessingBuilder etc.

Thanks
Best Regards

On Thu, Apr 9, 2015 at 6:46 PM, Somnath Pandeya 
somnath_pand...@infosys.commailto:somnath_pand...@infosys.com wrote:

JavaRDDString lineswithoutStopWords = nonEmptylines
   .map(new FunctionString, String() {

  /**
  *
   */
  private static final long serialVersionUID = 
1L;

  @Override
  public String call(String line) throws 
Exception {
 // TODO Auto-generated method stub
 return removeStopWords(line, 
stopwords);
  }

   });

  lineswithoutStopWords.saveAsTextFile(output/testop.txt);



Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.NullPointerException
   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
   at org.apache.hadoop.util.Shell.runCommand(Shell.java:404)
   at org.apache.hadoop.util.Shell.run(Shell.java:379)
   at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
   at org.apache.hadoop.util.Shell.execCommand(Shell.java:678)
   at org.apache.hadoop.util.Shell.execCommand(Shell.java:661)
   at 
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:639)
   at 
org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:468)
   at 
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456)
   at 
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:905)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:798)
   at 
org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
   at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91)
   at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1068)
   at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
15/04/09 18:44:36 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, 
localhost): java.lang.NullPointerException
   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
   at org.apache.hadoop.util.Shell.runCommand(Shell.java:404)
   at org.apache.hadoop.util.Shell.run(Shell.java:379)
   at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
   at org.apache.hadoop.util.Shell.execCommand(Shell.java:678)
   at org.apache.hadoop.util.Shell.execCommand(Shell.java:661)
   at 
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:639)
   at 
org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:468)
   at 
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456)
   at 
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:905)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:798)
   at 
org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
   at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91)
   at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1068)
   at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run

save as text file throwing null pointer error.

2015-04-09 Thread Somnath Pandeya

JavaRDDString lineswithoutStopWords = nonEmptylines
   .map(new FunctionString, String() {

  /**
  *
   */
  private static final long serialVersionUID = 
1L;

  @Override
  public String call(String line) throws 
Exception {
 // TODO Auto-generated method stub
 return removeStopWords(line, 
stopwords);
  }

   });

  lineswithoutStopWords.saveAsTextFile(output/testop.txt);



Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.NullPointerException
   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
   at org.apache.hadoop.util.Shell.runCommand(Shell.java:404)
   at org.apache.hadoop.util.Shell.run(Shell.java:379)
   at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
   at org.apache.hadoop.util.Shell.execCommand(Shell.java:678)
   at org.apache.hadoop.util.Shell.execCommand(Shell.java:661)
   at 
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:639)
   at 
org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:468)
   at 
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456)
   at 
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:905)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:798)
   at 
org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
   at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91)
   at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1068)
   at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
15/04/09 18:44:36 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, 
localhost): java.lang.NullPointerException
   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
   at org.apache.hadoop.util.Shell.runCommand(Shell.java:404)
   at org.apache.hadoop.util.Shell.run(Shell.java:379)
   at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
   at org.apache.hadoop.util.Shell.execCommand(Shell.java:678)
   at org.apache.hadoop.util.Shell.execCommand(Shell.java:661)
   at 
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:639)
   at 
org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:468)
   at 
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456)
   at 
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:905)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:798)
   at 
org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
   at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91)
   at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1068)
   at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)

15/04/09 18:44:36 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times; 
aborting job
15/04/09 18:44:36 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have 
all completed, from pool
15/04/09 18:44:36 INFO TaskSchedulerImpl: Cancelling stage 1
15/04/09 18:44:36 INFO DAGScheduler: Job 1 failed: saveAsTextFile at 
TextPreProcessing.java:49, took 0.172959 s
Exception in thread main org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost 
task 0.0 in stage 1.0 (TID 1, localhost): java.lang.NullPointerException
   at 

how to find near duplicate items from given dataset using spark

2015-04-02 Thread Somnath Pandeya
Hi All,

I want to find near duplicate items from given dataset
For e.g consider a data set

1.   Cricket,bat,ball,stumps

2.   Cricket,bowler,ball,stumps,

3.   Football,goalie,midfielder,goal

4.   Football,refree,midfielder,goal,
Here 1 and 2 are near duplicates (only field 2 is different ) and 3 and 4 are 
near duplicates(only 2 field is different)

This is what I did
Created an Article class and implemented equls and hashcode method (my hash 
code method returns constant (1) for all objecst).
And in spark I am using article as a key doing group by on the article.
Is this approach correct, or is there any better approach.

This is how my code looks like.

Article Class
public class Article implements Serializable {

private static final long serialVersionUID = 1L;
   private String first;
   private String second;
   private String third;
   private String fourth;

   public Article() {
  set(, , , );
   }

   public Article(String first, String second, String third, String fourth) 
{
  // super();
  set(first, second, third, fourth);
   }

   @Override
   public int hashCode() {
  int result = 1;
  return result;
   }

   @Override
   public boolean equals(Object obj) {
  if (this == obj)
 return true;
  if (obj == null)
 return false;
  if (getClass() != obj.getClass())
 return false;
  Article other = (Article) obj;
  if ((first.equals(other.first) || second.equals(other.second)
   || third.equals(other.third) || 
fourth.equals(other.fourth))) {
 return true;
  } else {
 return false;
  }
   }

   private void set(String first, String second, String third, String 
fourth) {
  this.first = first;
  this.second = second;
  this.third = third;
  this.fourth = fourth;
   }


Spark Code
   public static void main(String[] args) throws Exception {

  SparkConf sparkConf = new SparkConf().setAppName(JavaWordCount)
   .setMaster(local);
  JavaSparkContext ctx = new JavaSparkContext(sparkConf);
  JavaRDDString lines = ctx.textFile(data1/*);

  JavaRDDArticle articles = lines.map(new FunctionString, 
Article() {

 /**
  *
  */
 private static final long serialVersionUID = 1L;

 public Article call(String line) throws Exception {
   String[] words = line.split(,);
   // System.out.println(line);

   Article article = new Article(words[0], words[1], 
words[2],
 words[3]);

   return article;
 }
  });


  JavaPairRDDArticle, String articlePair = lines
   .mapToPair(new PairFunctionString, Article, 
String() {

  public Tuple2Article, String call(String 
line)
throws Exception {

 String[] words = line.split(,);
 // System.out.println(line);

 Article article = new 
Article(words[0], words[1],
   words[2], words[3]);
 return new Tuple2Article, 
String(article, line);
  }
   });

  JavaPairRDDArticle, IterableString articlePairs = articlePair
   .groupByKey();


  MapArticle, IterableString dupArticles = articlePairs
   .collectAsMap();

  System.out.println(size {}  + dupArticles.size());

  SetArticle uniqueArticle = dupArticles.keySet();

  for (Article article : uniqueArticle) {
 IterableString temps = dupArticles.get(article);
 System.out.println(keys  + article);
 for (String string : temps) {
   System.out.println(string);
 }
 System.out.println(==);
  }
  ctx.close();
  ctx.stop();
   }
}


 CAUTION - Disclaimer *
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
for the use of the addressee(s). If you are not the intended recipient, please
notify the sender by e-mail and delete the original message. Further, you are 
not
to copy, disclose, or distribute this e-mail or 

RE: used cores are less then total no. of core

2015-02-25 Thread Somnath Pandeya
Thanks Akhil , it was a simple fix which you told .. I missed it .. ☺

From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: Wednesday, February 25, 2015 12:48 PM
To: Somnath Pandeya
Cc: user@spark.apache.org
Subject: Re: used cores are less then total no. of core

You can set the following in the conf while creating the SparkContext  (if you 
are not using spark-submit)

.set(spark.cores.max, 32)



Thanks
Best Regards

On Wed, Feb 25, 2015 at 11:52 AM, Somnath Pandeya 
somnath_pand...@infosys.commailto:somnath_pand...@infosys.com wrote:
Hi All,

I am running a simple word count example of spark (standalone cluster) , In the 
UI it is showing
For each worker no. of cores available are 32 ,but while running the jobs only 
5 cores are being used,

What should I do to increase no. of used core or it is selected based on jobs.

Thanks
Somnaht

 CAUTION - Disclaimer *

This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely

for the use of the addressee(s). If you are not the intended recipient, please

notify the sender by e-mail and delete the original message. Further, you are 
not

to copy, disclose, or distribute this e-mail or its contents to any other 
person and

any such actions are unlawful. This e-mail may contain viruses. Infosys has 
taken

every reasonable precaution to minimize this risk, but is not liable for any 
damage

you may sustain as a result of any virus in this e-mail. You should carry out 
your

own virus checks before opening the e-mail or attachment. Infosys reserves the

right to monitor and review the content of all messages sent to or from this 
e-mail

address. Messages sent to or from this e-mail address may be stored on the

Infosys e-mail system.

***INFOSYS End of Disclaimer INFOSYS***




used cores are less then total no. of core

2015-02-24 Thread Somnath Pandeya
Hi All,

I am running a simple word count example of spark (standalone cluster) , In the 
UI it is showing
For each worker no. of cores available are 32 ,but while running the jobs only 
5 cores are being used,

What should I do to increase no. of used core or it is selected based on jobs.

Thanks
Somnaht

 CAUTION - Disclaimer *
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
for the use of the addressee(s). If you are not the intended recipient, please
notify the sender by e-mail and delete the original message. Further, you are 
not
to copy, disclose, or distribute this e-mail or its contents to any other 
person and
any such actions are unlawful. This e-mail may contain viruses. Infosys has 
taken
every reasonable precaution to minimize this risk, but is not liable for any 
damage
you may sustain as a result of any virus in this e-mail. You should carry out 
your
own virus checks before opening the e-mail or attachment. Infosys reserves the
right to monitor and review the content of all messages sent to or from this 
e-mail
address. Messages sent to or from this e-mail address may be stored on the
Infosys e-mail system.
***INFOSYS End of Disclaimer INFOSYS***


RE: skipping header from each file

2015-01-09 Thread Somnath Pandeya
May be you can use wholeTextFiles method, which returns filename and content of 
the file as PariRDD and ,then you can remove the first line from files.



-Original Message-
From: Hafiz Mujadid [mailto:hafizmujadi...@gmail.com]
Sent: Friday, January 09, 2015 11:48 AM
To: user@spark.apache.org
Subject: skipping header from each file

Suppose I give three files paths to spark context to read and each file has 
schema in first row. how can we skip schema lines from headers


val rdd=sc.textFile(file1,file2,file3);



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/skipping-header-from-each-file-tp21051.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


 CAUTION - Disclaimer *
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
for the use of the addressee(s). If you are not the intended recipient, please
notify the sender by e-mail and delete the original message. Further, you are 
not
to copy, disclose, or distribute this e-mail or its contents to any other 
person and
any such actions are unlawful. This e-mail may contain viruses. Infosys has 
taken
every reasonable precaution to minimize this risk, but is not liable for any 
damage
you may sustain as a result of any virus in this e-mail. You should carry out 
your
own virus checks before opening the e-mail or attachment. Infosys reserves the
right to monitor and review the content of all messages sent to or from this 
e-mail
address. Messages sent to or from this e-mail address may be stored on the
Infosys e-mail system.
***INFOSYS End of Disclaimer INFOSYS***

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Spark with Hive cluster dependencies

2015-01-07 Thread Somnath Pandeya
You can follow the below the link also. It works on stand alone spark cluster.

https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started


thanks
Somnath
From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Thursday, January 08, 2015 2:21 AM
To: jamborta
Cc: user
Subject: Re: Spark with Hive cluster dependencies

Have you looked at Spark 
SQLhttp://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables?
 It supports HiveQL, can read from the hive metastore, and does not require 
hadoop.

On Wed, Jan 7, 2015 at 8:27 AM, jamborta 
jambo...@gmail.commailto:jambo...@gmail.com wrote:
Hi all,

We have been building a system where we heavily reply on hive queries
executed through spark to load and manipulate data, running on CDH and yarn.
I have been trying to explore lighter setups where we would not have to
maintain a hadoop cluster, just run the system on spark only.

Is it possible to run spark standalone, and setup hive alongside, without
the hadoop cluster? if not, any suggestion how we can replicate the
convenience of hive tables (and hive sql) without hive?

thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-with-Hive-cluster-dependencies-tp21017.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org


 CAUTION - Disclaimer *
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely 
for the use of the addressee(s). If you are not the intended recipient, please 
notify the sender by e-mail and delete the original message. Further, you are 
not 
to copy, disclose, or distribute this e-mail or its contents to any other 
person and 
any such actions are unlawful. This e-mail may contain viruses. Infosys has 
taken 
every reasonable precaution to minimize this risk, but is not liable for any 
damage 
you may sustain as a result of any virus in this e-mail. You should carry out 
your 
own virus checks before opening the e-mail or attachment. Infosys reserves the 
right to monitor and review the content of all messages sent to or from this 
e-mail 
address. Messages sent to or from this e-mail address may be stored on the 
Infosys e-mail system.
***INFOSYS End of Disclaimer INFOSYS***

spark worker nodes getting disassociated while running hive on spark

2015-01-04 Thread Somnath Pandeya
Hi,

I have setup the spark 1.2 standalone cluster and trying to run hive on spark 
by following  below link.

https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started

I got the latest build of hive on spark from git and was trying to running few 
queries. Queries are running fine for some time and after that I am getting 
following errors

Error on master node
15/01/05 12:16:59 INFO actor.LocalActorRef: Message 
[akka.remote.transport.AssociationHandle$Disassociated] from 
Actor[akka://sparkMaster/deadLetters] to 
Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%40xx.xx.xx.xx%3A34823-1#1101564287]
 was not delivered. [1] dead letters encountered. This logging can be turned 
off or adjusted with configuration settings 'akka.log-dead-letters' and 
'akka.log-dead-letters-during-shutdown'.
15/01/05 12:16:59 INFO master.Master: akka.tcp://sparkWorker@machinename:58392 
got disassociated, removing it.
15/01/05 12:16:59 INFO master.Master: Removing worker 
worker-20150105120340-machine-58392 on 
indhyhdppocap03.infosys-platforms.com:58392

Error on slave node

15/01/05 12:20:21 INFO transport.ProtocolStateActor: No response from remote. 
Handshake timed out or transport failure detector triggered.
15/01/05 12:20:21 INFO actor.LocalActorRef: Message 
[akka.remote.transport.AssociationHandle$Disassociated] from 
Actor[akka://sparkWorker/deadLetters] to 
Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%40machineName%3A7077-1#-1301148631]
 was not delivered. [1] dead letters encountered. This logging can be turned 
off or adjusted with configuration settings 'akka.log-dead-letters' and 
'akka.log-dead-letters-during-shutdown'.
15/01/05 12:20:21 INFO worker.Worker: Disassociated 
[akka.tcp://sparkWorker@machineName:58392] - 
[akka.tcp://sparkMaster@machineName:7077] Disassociated !
15/01/05 12:20:21 ERROR worker.Worker: Connection to master failed! Waiting for 
master to reconnect...
15/01/05 12:20:21 WARN remote.ReliableDeliverySupervisor: Association with 
remote system [akka.tcp://sparkMaster@machineName:7077] has failed, address is 
now gated for [5000] ms. Reason is: [Disassociated].


Please Help

-Somnath

 CAUTION - Disclaimer *
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
for the use of the addressee(s). If you are not the intended recipient, please
notify the sender by e-mail and delete the original message. Further, you are 
not
to copy, disclose, or distribute this e-mail or its contents to any other 
person and
any such actions are unlawful. This e-mail may contain viruses. Infosys has 
taken
every reasonable precaution to minimize this risk, but is not liable for any 
damage
you may sustain as a result of any virus in this e-mail. You should carry out 
your
own virus checks before opening the e-mail or attachment. Infosys reserves the
right to monitor and review the content of all messages sent to or from this 
e-mail
address. Messages sent to or from this e-mail address may be stored on the
Infosys e-mail system.
***INFOSYS End of Disclaimer INFOSYS***


RE: unable to do group by with 1st column

2014-12-25 Thread Somnath Pandeya
Hi ,
You can try reducebyKey also ,
Something like this
JavaPairRDDString, String ones = lines
   .mapToPair(new PairFunctionString, String, 
String() {
  @Override
  public Tuple2String, String call(String s) {
 String[] temp = s.split(,);
 return new Tuple2String, 
String(temp[0], temp[1]);
  }
   });

  JavaPairRDDString, String counts = ones
   .reduceByKey(new Function2String, String, String() 
{
  @Override
  public String call(String i1, String i2) {
 return i1 + , + i2;
  }
   });

From: Tobias Pfeiffer [mailto:t...@preferred.jp]
Sent: Friday, December 26, 2014 6:35 AM
To: Amit Behera
Cc: u...@spark.incubator.apache.org
Subject: Re: unable to do group by with 1st column

Hi,

On Fri, Dec 26, 2014 at 5:22 AM, Amit Behera 
amit.bd...@gmail.commailto:amit.bd...@gmail.com wrote:
How can I do it? Please help me to do.

Have you considered using groupByKey?
http://spark.apache.org/docs/latest/programming-guide.html#transformations

Tobias

 CAUTION - Disclaimer *
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely 
for the use of the addressee(s). If you are not the intended recipient, please 
notify the sender by e-mail and delete the original message. Further, you are 
not 
to copy, disclose, or distribute this e-mail or its contents to any other 
person and 
any such actions are unlawful. This e-mail may contain viruses. Infosys has 
taken 
every reasonable precaution to minimize this risk, but is not liable for any 
damage 
you may sustain as a result of any virus in this e-mail. You should carry out 
your 
own virus checks before opening the e-mail or attachment. Infosys reserves the 
right to monitor and review the content of all messages sent to or from this 
e-mail 
address. Messages sent to or from this e-mail address may be stored on the 
Infosys e-mail system.
***INFOSYS End of Disclaimer INFOSYS***