RE: TF-IDF Question
Hi, org.apache.spark.mllib.linalg.Vector = (1048576,[35587,884670],[3.458767233,3.458767233]) it is sparse vector representation of terms so the first term(1048576) is the length of vector [35587,884670] is the index of words [3.458767233,3.458767233] are the tf-idf values of the terms. Thanks Somnath From: franco barrientos [mailto:franco.barrien...@exalitica.com] Sent: Thursday, June 04, 2015 11:17 PM To: user@spark.apache.org Subject: TF-IDF Question Hi all!, I have a .txt file where each row of it it's a collection of terms of a document separated by space. For example: 1 Hola spark 2 .. I followed this example of spark site https://spark.apache.org/docs/latest/mllib-feature-extraction.html and i get something like this: tfidf.first() org.apache.spark.mllib.linalg.Vector = (1048576,[35587,884670],[3.458767233,3.458767233]) I think this: 1. First parameter 1048576 i don't know what it is but always it´s the same number (maybe the number of terms). 2. Second parameter [35587,884670] i think are the terms of the first line in my .txt file. 3. Third parameter [3.458767233,3.458767233] i think are the tfidf values for my terms. Anyone knows the exact interpretation of this and in the second point if these values are the terms, how can i match this values with the original terms values ([35587=Hola,884670=spark])?. Regards and thanks in advance. Franco Barrientos Data Scientist Málaga #115, Of. 1003, Las Condes. Santiago, Chile. (+562)-29699649 (+569)-76347893 franco.barrien...@exalitica.commailto:franco.barrien...@exalitica.com www.exalitica.com http://www.exalitica.com/ [http://exalitica.com/web/img/frim.png]
RE: How to use Eclipse on Windows to build Spark environment?
Try scala eclipse plugin to eclipsify spark project and import spark as eclipse project -Somnath -Original Message- From: Nan Xiao [mailto:xiaonan830...@gmail.com] Sent: Thursday, May 28, 2015 12:32 PM To: user@spark.apache.org Subject: How to use Eclipse on Windows to build Spark environment? Hi all, I want to use Eclipse on Windows to build Spark environment, but find the reference page(https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-IDESetup) doesn't contain any guide about Eclipse. Could anyone give tutorials or links about how to using Eclipse on Windows to build Spark environment? Thanks in advance! Best Regards Nan Xiao - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org CAUTION - Disclaimer * This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. This e-mail may contain viruses. Infosys has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. Infosys reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infosys e-mail system. ***INFOSYS End of Disclaimer INFOSYS*** - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: save as text file throwing null pointer error.
Hi Akhil, I am running my program standalone, I am getting null pointer exception when I running spark program locally and when I am trying to save my RDD as a text file. From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Tuesday, April 14, 2015 12:41 PM To: Somnath Pandeya Cc: user@spark.apache.org Subject: Re: save as text file throwing null pointer error. Where exactly is it throwing null pointer exception? Are you starting your program from another program or something? looks like you are invoking ProcessingBuilder etc. Thanks Best Regards On Thu, Apr 9, 2015 at 6:46 PM, Somnath Pandeya somnath_pand...@infosys.commailto:somnath_pand...@infosys.com wrote: JavaRDDString lineswithoutStopWords = nonEmptylines .map(new FunctionString, String() { /** * */ private static final long serialVersionUID = 1L; @Override public String call(String line) throws Exception { // TODO Auto-generated method stub return removeStopWords(line, stopwords); } }); lineswithoutStopWords.saveAsTextFile(output/testop.txt); Exception in task 0.0 in stage 1.0 (TID 1) java.lang.NullPointerException at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012) at org.apache.hadoop.util.Shell.runCommand(Shell.java:404) at org.apache.hadoop.util.Shell.run(Shell.java:379) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589) at org.apache.hadoop.util.Shell.execCommand(Shell.java:678) at org.apache.hadoop.util.Shell.execCommand(Shell.java:661) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:639) at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:468) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:905) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:798) at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123) at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1068) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 15/04/09 18:44:36 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, localhost): java.lang.NullPointerException at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012) at org.apache.hadoop.util.Shell.runCommand(Shell.java:404) at org.apache.hadoop.util.Shell.run(Shell.java:379) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589) at org.apache.hadoop.util.Shell.execCommand(Shell.java:678) at org.apache.hadoop.util.Shell.execCommand(Shell.java:661) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:639) at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:468) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:905) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:798) at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123) at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1068) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run
save as text file throwing null pointer error.
JavaRDDString lineswithoutStopWords = nonEmptylines .map(new FunctionString, String() { /** * */ private static final long serialVersionUID = 1L; @Override public String call(String line) throws Exception { // TODO Auto-generated method stub return removeStopWords(line, stopwords); } }); lineswithoutStopWords.saveAsTextFile(output/testop.txt); Exception in task 0.0 in stage 1.0 (TID 1) java.lang.NullPointerException at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012) at org.apache.hadoop.util.Shell.runCommand(Shell.java:404) at org.apache.hadoop.util.Shell.run(Shell.java:379) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589) at org.apache.hadoop.util.Shell.execCommand(Shell.java:678) at org.apache.hadoop.util.Shell.execCommand(Shell.java:661) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:639) at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:468) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:905) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:798) at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123) at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1068) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 15/04/09 18:44:36 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, localhost): java.lang.NullPointerException at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012) at org.apache.hadoop.util.Shell.runCommand(Shell.java:404) at org.apache.hadoop.util.Shell.run(Shell.java:379) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589) at org.apache.hadoop.util.Shell.execCommand(Shell.java:678) at org.apache.hadoop.util.Shell.execCommand(Shell.java:661) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:639) at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:468) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:905) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:798) at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123) at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1068) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 15/04/09 18:44:36 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times; aborting job 15/04/09 18:44:36 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 15/04/09 18:44:36 INFO TaskSchedulerImpl: Cancelling stage 1 15/04/09 18:44:36 INFO DAGScheduler: Job 1 failed: saveAsTextFile at TextPreProcessing.java:49, took 0.172959 s Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost): java.lang.NullPointerException at
how to find near duplicate items from given dataset using spark
Hi All, I want to find near duplicate items from given dataset For e.g consider a data set 1. Cricket,bat,ball,stumps 2. Cricket,bowler,ball,stumps, 3. Football,goalie,midfielder,goal 4. Football,refree,midfielder,goal, Here 1 and 2 are near duplicates (only field 2 is different ) and 3 and 4 are near duplicates(only 2 field is different) This is what I did Created an Article class and implemented equls and hashcode method (my hash code method returns constant (1) for all objecst). And in spark I am using article as a key doing group by on the article. Is this approach correct, or is there any better approach. This is how my code looks like. Article Class public class Article implements Serializable { private static final long serialVersionUID = 1L; private String first; private String second; private String third; private String fourth; public Article() { set(, , , ); } public Article(String first, String second, String third, String fourth) { // super(); set(first, second, third, fourth); } @Override public int hashCode() { int result = 1; return result; } @Override public boolean equals(Object obj) { if (this == obj) return true; if (obj == null) return false; if (getClass() != obj.getClass()) return false; Article other = (Article) obj; if ((first.equals(other.first) || second.equals(other.second) || third.equals(other.third) || fourth.equals(other.fourth))) { return true; } else { return false; } } private void set(String first, String second, String third, String fourth) { this.first = first; this.second = second; this.third = third; this.fourth = fourth; } Spark Code public static void main(String[] args) throws Exception { SparkConf sparkConf = new SparkConf().setAppName(JavaWordCount) .setMaster(local); JavaSparkContext ctx = new JavaSparkContext(sparkConf); JavaRDDString lines = ctx.textFile(data1/*); JavaRDDArticle articles = lines.map(new FunctionString, Article() { /** * */ private static final long serialVersionUID = 1L; public Article call(String line) throws Exception { String[] words = line.split(,); // System.out.println(line); Article article = new Article(words[0], words[1], words[2], words[3]); return article; } }); JavaPairRDDArticle, String articlePair = lines .mapToPair(new PairFunctionString, Article, String() { public Tuple2Article, String call(String line) throws Exception { String[] words = line.split(,); // System.out.println(line); Article article = new Article(words[0], words[1], words[2], words[3]); return new Tuple2Article, String(article, line); } }); JavaPairRDDArticle, IterableString articlePairs = articlePair .groupByKey(); MapArticle, IterableString dupArticles = articlePairs .collectAsMap(); System.out.println(size {} + dupArticles.size()); SetArticle uniqueArticle = dupArticles.keySet(); for (Article article : uniqueArticle) { IterableString temps = dupArticles.get(article); System.out.println(keys + article); for (String string : temps) { System.out.println(string); } System.out.println(==); } ctx.close(); ctx.stop(); } } CAUTION - Disclaimer * This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or
RE: used cores are less then total no. of core
Thanks Akhil , it was a simple fix which you told .. I missed it .. ☺ From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Wednesday, February 25, 2015 12:48 PM To: Somnath Pandeya Cc: user@spark.apache.org Subject: Re: used cores are less then total no. of core You can set the following in the conf while creating the SparkContext (if you are not using spark-submit) .set(spark.cores.max, 32) Thanks Best Regards On Wed, Feb 25, 2015 at 11:52 AM, Somnath Pandeya somnath_pand...@infosys.commailto:somnath_pand...@infosys.com wrote: Hi All, I am running a simple word count example of spark (standalone cluster) , In the UI it is showing For each worker no. of cores available are 32 ,but while running the jobs only 5 cores are being used, What should I do to increase no. of used core or it is selected based on jobs. Thanks Somnaht CAUTION - Disclaimer * This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. This e-mail may contain viruses. Infosys has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. Infosys reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infosys e-mail system. ***INFOSYS End of Disclaimer INFOSYS***
used cores are less then total no. of core
Hi All, I am running a simple word count example of spark (standalone cluster) , In the UI it is showing For each worker no. of cores available are 32 ,but while running the jobs only 5 cores are being used, What should I do to increase no. of used core or it is selected based on jobs. Thanks Somnaht CAUTION - Disclaimer * This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. This e-mail may contain viruses. Infosys has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. Infosys reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infosys e-mail system. ***INFOSYS End of Disclaimer INFOSYS***
RE: skipping header from each file
May be you can use wholeTextFiles method, which returns filename and content of the file as PariRDD and ,then you can remove the first line from files. -Original Message- From: Hafiz Mujadid [mailto:hafizmujadi...@gmail.com] Sent: Friday, January 09, 2015 11:48 AM To: user@spark.apache.org Subject: skipping header from each file Suppose I give three files paths to spark context to read and each file has schema in first row. how can we skip schema lines from headers val rdd=sc.textFile(file1,file2,file3); -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/skipping-header-from-each-file-tp21051.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org CAUTION - Disclaimer * This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. This e-mail may contain viruses. Infosys has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. Infosys reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infosys e-mail system. ***INFOSYS End of Disclaimer INFOSYS*** - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Spark with Hive cluster dependencies
You can follow the below the link also. It works on stand alone spark cluster. https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started thanks Somnath From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Thursday, January 08, 2015 2:21 AM To: jamborta Cc: user Subject: Re: Spark with Hive cluster dependencies Have you looked at Spark SQLhttp://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables? It supports HiveQL, can read from the hive metastore, and does not require hadoop. On Wed, Jan 7, 2015 at 8:27 AM, jamborta jambo...@gmail.commailto:jambo...@gmail.com wrote: Hi all, We have been building a system where we heavily reply on hive queries executed through spark to load and manipulate data, running on CDH and yarn. I have been trying to explore lighter setups where we would not have to maintain a hadoop cluster, just run the system on spark only. Is it possible to run spark standalone, and setup hive alongside, without the hadoop cluster? if not, any suggestion how we can replicate the convenience of hive tables (and hive sql) without hive? thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-with-Hive-cluster-dependencies-tp21017.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.orgmailto:user-h...@spark.apache.org CAUTION - Disclaimer * This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. This e-mail may contain viruses. Infosys has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. Infosys reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infosys e-mail system. ***INFOSYS End of Disclaimer INFOSYS***
spark worker nodes getting disassociated while running hive on spark
Hi, I have setup the spark 1.2 standalone cluster and trying to run hive on spark by following below link. https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started I got the latest build of hive on spark from git and was trying to running few queries. Queries are running fine for some time and after that I am getting following errors Error on master node 15/01/05 12:16:59 INFO actor.LocalActorRef: Message [akka.remote.transport.AssociationHandle$Disassociated] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%40xx.xx.xx.xx%3A34823-1#1101564287] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 15/01/05 12:16:59 INFO master.Master: akka.tcp://sparkWorker@machinename:58392 got disassociated, removing it. 15/01/05 12:16:59 INFO master.Master: Removing worker worker-20150105120340-machine-58392 on indhyhdppocap03.infosys-platforms.com:58392 Error on slave node 15/01/05 12:20:21 INFO transport.ProtocolStateActor: No response from remote. Handshake timed out or transport failure detector triggered. 15/01/05 12:20:21 INFO actor.LocalActorRef: Message [akka.remote.transport.AssociationHandle$Disassociated] from Actor[akka://sparkWorker/deadLetters] to Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%40machineName%3A7077-1#-1301148631] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 15/01/05 12:20:21 INFO worker.Worker: Disassociated [akka.tcp://sparkWorker@machineName:58392] - [akka.tcp://sparkMaster@machineName:7077] Disassociated ! 15/01/05 12:20:21 ERROR worker.Worker: Connection to master failed! Waiting for master to reconnect... 15/01/05 12:20:21 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkMaster@machineName:7077] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. Please Help -Somnath CAUTION - Disclaimer * This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. This e-mail may contain viruses. Infosys has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. Infosys reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infosys e-mail system. ***INFOSYS End of Disclaimer INFOSYS***
RE: unable to do group by with 1st column
Hi , You can try reducebyKey also , Something like this JavaPairRDDString, String ones = lines .mapToPair(new PairFunctionString, String, String() { @Override public Tuple2String, String call(String s) { String[] temp = s.split(,); return new Tuple2String, String(temp[0], temp[1]); } }); JavaPairRDDString, String counts = ones .reduceByKey(new Function2String, String, String() { @Override public String call(String i1, String i2) { return i1 + , + i2; } }); From: Tobias Pfeiffer [mailto:t...@preferred.jp] Sent: Friday, December 26, 2014 6:35 AM To: Amit Behera Cc: u...@spark.incubator.apache.org Subject: Re: unable to do group by with 1st column Hi, On Fri, Dec 26, 2014 at 5:22 AM, Amit Behera amit.bd...@gmail.commailto:amit.bd...@gmail.com wrote: How can I do it? Please help me to do. Have you considered using groupByKey? http://spark.apache.org/docs/latest/programming-guide.html#transformations Tobias CAUTION - Disclaimer * This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. This e-mail may contain viruses. Infosys has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. Infosys reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infosys e-mail system. ***INFOSYS End of Disclaimer INFOSYS***