How is the predict() working in LogisticRegressionModel?
Hi all,Can somebody point me to the implementation of predict() in LogisticRegressionModel of spark mllib? I could find a predictPoint() in the class LogisticRegressionModel, but where is predict()? Thanks & Regards, Meethu M
Re: Please reply if you use Mesos fine grained mode
Hi, We are using Mesos fine grained mode because we can have multiple instances of spark to share machines and each application get resources dynamically allocated. Thanks & Regards, Meethu M On Wednesday, 4 November 2015 5:24 AM, Reynold Xinwrote: If you are using Spark with Mesos fine grained mode, can you please respond to this email explaining why you use it over the coarse grained mode? Thanks.
Re: Best way to merge final output part files created by Spark job
Try coalesce(1) before writing Thanks & Regards, Meethu M On Tuesday, 15 September 2015 6:49 AM, java8964wrote: #yiv1620377612 #yiv1620377612 --.yiv1620377612hmmessage P{margin:0px;padding:0px;}#yiv1620377612 body.yiv1620377612hmmessage{font-size:12pt;font-family:Calibri;}#yiv1620377612 For text file, this merge works fine, but for binary format like "ORC", "Parquet" or "AVOR", not sure this will work. These kind of formats in fact are not append-able, as they write the detail data information either in the head or at tail part of the file. You have to use the format specified API to merge the data. Yong Date: Mon, 14 Sep 2015 09:10:33 +0200 Subject: Re: Best way to merge final output part files created by Spark job From: gmu...@stratio.com To: umesh.ka...@gmail.com CC: user@spark.apache.org Hi, check out FileUtil.copyMerge function in the Hadoop API. It's simple, - Get the hadoop configuration from Spark Context FileSystem fs = FileSystem.get(sparkContext.hadoopConfiguration()); - Create new Path with destination and source directory. - Call copyMerge FileUtil.copyMerge(fs, inputPath, fs, destPath, true, sparkContext.hadoopConfiguration(), null); 2015-09-13 23:25 GMT+02:00 unk1102 : Hi I have a spark job which creates around 500 part files inside each directory I process. So I have thousands of such directories. So I need to merge these small small 500 part files. I am using spark.sql.shuffle.partition as 500 and my final small files are ORC files. Is there a way to merge orc files in Spark if not please suggest the best way to merge files created by Spark job in hdfs please guide. Thanks much. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-merge-final-output-part-files-created-by-Spark-job-tp24681.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Gaspar Muñoz @gmunozsoria Vía de las dos Castillas, 33, Ática 4, 3ª Planta 28224 Pozuelo de Alarcón, MadridTel: +34 91 352 59 42 // @stratiobd
Re: make-distribution.sh failing at spark/R/lib/sparkr.zip
Hi, It worked after removing that line. Thank you for the response and fix . Thanks Regards, Meethu M On Thursday, 13 August 2015 4:12 AM, Burak Yavuz brk...@gmail.com wrote: For the record:https://github.com/apache/spark/pull/8147 https://issues.apache.org/jira/browse/SPARK-9916 On Wed, Aug 12, 2015 at 3:08 PM, Burak Yavuz brk...@gmail.com wrote: Are you running from master? Could you delete line 222 of make-distribution.sh?We updated when we build sparkr.zip. I'll submit a fix for it for 1.5 and master. Burak On Wed, Aug 12, 2015 at 3:31 AM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi, I am trying to create a package using the make-distribution.sh script from the github master branch. But its not getting successfully completed. The last statement printed is + cp /home/meethu/git/FlytxtRnD/spark/R/lib/sparkr.zip /home/meethu/git/FlytxtRnD/spark/dist/R/libcp: cannot stat `/home/meethu/git/FlytxtRnD/spark/R/lib/sparkr.zip': No such file or directory My bulid is success and I am trying to execute the following command ./make-distribution.sh --tgz -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive Please help. Thanks Regards, Meethu M
Re: Combining Spark Files with saveAsTextFile
Hi,Try using coalesce(1) before calling saveAsTextFile() Thanks Regards, Meethu M On Wednesday, 5 August 2015 7:53 AM, Brandon White bwwintheho...@gmail.com wrote: What is the best way to make saveAsTextFile save as only a single file?
RE:Building scaladoc using build/sbt unidoc failure
Hi, I am getting the assertion error while trying to run build/sbt unidoc same as you described in Building scaladoc using build/sbt unidoc failure .Could you tell me how you get it working ? | | | | | | | | | Building scaladoc using build/sbt unidoc failureHello,I am trying to build scala doc from the 1.4 branch. | | | | View on mail-archives.apache.org | Preview by Yahoo | | | | | Thanks Regards, Meethu M
Re: How to create fewer output files for Spark job ?
Try using coalesce Thanks Regards, Meethu M On Wednesday, 3 June 2015 11:26 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I am running a series of spark functions with 9000 executors and its resulting in 9000+ files that is execeeding the namespace file count qutota. How can Spark be configured to use CombinedOutputFormat. {code}protected def writeOutputRecords(detailRecords: RDD[(AvroKey[DetailOutputRecord], NullWritable)], outputDir: String) { val writeJob = new Job() val schema = SchemaUtil.outputSchema(_detail) AvroJob.setOutputKeySchema(writeJob, schema) detailRecords.saveAsNewAPIHadoopFile(outputDir, classOf[AvroKey[GenericRecord]], classOf[org.apache.hadoop.io.NullWritable], classOf[AvroKeyOutputFormat[GenericRecord]], writeJob.getConfiguration) }{code} -- Deepak
Re: How to run multiple jobs in one sparkcontext from separate threads in pyspark?
Hi Davies,Thank you for pointing to spark streaming. I am confused about how to return the result after running a function via a thread.I tried using Queue to add the results to it and print it at the end.But here, I can see the results after all threads are finished.How to get the result of the function once a thread is finished, rather than waiting for all other threads to finish? Thanks Regards, Meethu M On Tuesday, 19 May 2015 2:43 AM, Davies Liu dav...@databricks.com wrote: SparkContext can be used in multiple threads (Spark streaming works with multiple threads), for example: import threading import time def show(x): time.sleep(1) print x def job(): sc.parallelize(range(100)).foreach(show) threading.Thread(target=job).start() On Mon, May 18, 2015 at 12:34 AM, ayan guha guha.a...@gmail.com wrote: Hi So to be clear, do you want to run one operation in multiple threads within a function or you want run multiple jobs using multiple threads? I am wondering why python thread module can't be used? Or you have already gave it a try? On 18 May 2015 16:39, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi Akhil, The python wrapper for Spark Job Server did not help me. I actually need the pyspark code sample which shows how I can call a function from 2 threads and execute it simultaneously. Thanks Regards, Meethu M On Thursday, 14 May 2015 12:38 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Did you happened to have a look at the spark job server? Someone wrote a python wrapper around it, give it a try. Thanks Best Regards On Thu, May 14, 2015 at 11:10 AM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi all, Quote Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. How to run multiple jobs in one SPARKCONTEXT using separate threads in pyspark? I found some examples in scala and java, but couldn't find python code. Can anyone help me with a pyspark example? Thanks Regards, Meethu M - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to run multiple jobs in one sparkcontext from separate threads in pyspark?
Hi Akhil, The python wrapper for Spark Job Server did not help me. I actually need the pyspark code sample which shows how I can call a function from 2 threads and execute it simultaneously. Thanks Regards, Meethu M On Thursday, 14 May 2015 12:38 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Did you happened to have a look at the spark job server? Someone wrote a python wrapper around it, give it a try. ThanksBest Regards On Thu, May 14, 2015 at 11:10 AM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi all, Quote Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. How to run multiple jobs in one SPARKCONTEXT using separate threads in pyspark? I found some examples in scala and java, but couldn't find python code. Can anyone help me with a pyspark example? Thanks Regards, Meethu M
Re: Restricting the number of iterations in Mllib Kmeans
Hi,I think you cant supply an initial set of centroids to kmeans Thanks Regards, Meethu M On Friday, 15 May 2015 12:37 AM, Suman Somasundar suman.somasun...@oracle.com wrote: !--#yiv5602900621 _filtered #yiv5602900621 {font-family:Cambria Math;panose-1:2 4 5 3 5 4 6 3 2 4;} _filtered #yiv5602900621 {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;}#yiv5602900621 #yiv5602900621 p.yiv5602900621MsoNormal, #yiv5602900621 li.yiv5602900621MsoNormal, #yiv5602900621 div.yiv5602900621MsoNormal {margin:0in;margin-bottom:.0001pt;font-size:11.0pt;font-family:Calibri, sans-serif;}#yiv5602900621 a:link, #yiv5602900621 span.yiv5602900621MsoHyperlink {color:blue;text-decoration:underline;}#yiv5602900621 a:visited, #yiv5602900621 span.yiv5602900621MsoHyperlinkFollowed {color:purple;text-decoration:underline;}#yiv5602900621 span.yiv5602900621EmailStyle17 {font-family:Calibri, sans-serif;color:windowtext;}#yiv5602900621 .yiv5602900621MsoChpDefault {} _filtered #yiv5602900621 {margin:1.0in 1.0in 1.0in 1.0in;}#yiv5602900621 div.yiv5602900621WordSection1 {}--Hi,, I want to run a definite number of iterations in Kmeans. There is a command line argument to set maxIterations, but even if I set it to a number, Kmeans runs until the centroids converge. Is there a specific way to specify it in command line? Also, I wanted to know if we can supply the initial set of centroids to the program instead of it choosing the centroids in random? Thanks, Suman.
How to run multiple jobs in one sparkcontext from separate threads in pyspark?
Hi all, Quote Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. How to run multiple jobs in one SPARKCONTEXT using separate threads in pyspark? I found some examples in scala and java, but couldn't find python code. Can anyone help me with a pyspark example? Thanks Regards, Meethu M
Spark-1.3.0 UI shows 0 cores in completed applications tab
Hi all, I started spark-shell in spark-1.3.0 and did some actions. The UI was showing 8 cores under the running applications tab. But when I exited the spark-shell using exit, the application is moved to completed applications tab and the number of cores is 0. Again when I exited the spark-shell using sc.stop() ,it is showing correctly 8 cores under completed applications tab. Why it is showing 0 cores when I didnt use sc.stop()?Does anyone face this issue? Thanks Regards, Meethu M
How to build Spark and run examples using Intellij ?
Hi, I am trying to run examples of spark(master branch from git) from Intellij(14.0.2) but facing errors. These are the steps I followed: 1. git clone the master branch of apache spark.2. Build it using mvn -DskipTests clean install3. In Intellij select Import Projects and choose the POM.xml of spark root folder(Auto Import enabled)4. Then I tried to run SparkPi program but getting the following errors Information:9/3/15 3:46 PM - Compilation completed with 44 errors and 0 warnings in 5 sec usr/local/spark-1.3.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scalaError:(314, 109) polymorphic expression cannot be instantiated to expected type; found : [T(in method apply)]org.apache.spark.sql.catalyst.dsl.ScalaUdfBuilder[T(in method apply)] required: org.apache.spark.sql.catalyst.dsl.package.ScalaUdfBuilder[T(in method functionToUdfBuilder)] implicit def functionToUdfBuilder[T: TypeTag](func: Function1[_, T]): ScalaUdfBuilder[T] = ScalaUdfBuilder(func) I am able to run examples of this built version of spark from terminal using ./bin/run-example script. Could someone please help me in this issue? Thanks Regards, Meethu M
How to read from hdfs using spark-shell in Intel hadoop?
Hi, I am not able to read from HDFS(Intel distribution hadoop,Hadoop version is 1.0.3) from spark-shell(spark version is 1.2.1). I built spark using the commandmvn -Dhadoop.version=1.0.3 clean package and started spark-shell and read a HDFS file using sc.textFile() and the exception is WARN hdfs.DFSClient: Failed to connect to /10.88.6.133:50010, add to deadNodes and continuejava.net.SocketTimeoutException: 12 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.88.6.131:44264 remote=/10.88.6.133:50010] The same problem is asked in the this mail. RE: Spark is unable to read from HDFS | | | | | | | | | RE: Spark is unable to read from HDFSHi,Thanks for the reply. I've tried the below. | | | | View on mail-archives.us.apache.org | Preview by Yahoo | | | | | As suggested in the above mail,In addition to specifying HADOOP_VERSION=1.0.3 in the ./project/SparkBuild.scala file, you will need to specify the libraryDependencies and name spark-core resolvers. Otherwise, sbt will fetch version 1.0.3 of hadoop-core from apache instead of Intel. You can set up your own local or remote repository that you specify Now HADOOP_VERSION is deprecated and -Dhadoop.version should be used. Can anybody please elaborate on how to specify tat SBT should fetch hadoop-core from Intel which is in our internal repository? Thanks Regards, Meethu M
Re: Mllib Error
Hi,Try this.Change spark-mllib to spark-mllib_2.10 libraryDependencies ++=Seq( org.apache.spark % spark-core_2.10 % 1.1.1 org.apache.spark % spark-mllib_2.10 % 1.1.1 ) Thanks Regards, Meethu M On Friday, 12 December 2014 12:22 PM, amin mohebbi aminn_...@yahoo.com.INVALID wrote: I'm trying to build a very simple scala standalone app using the Mllib, but I get the following error when trying to bulid the program:Object Mllib is not a member of package org.apache.sparkThen, I realized that I have to add Mllib as dependency as follow :libraryDependencies ++= Seq( org.apache.spark %% spark-core % 1.1.0, org.apache.spark %% spark-mllib % 1.1.0 )But, here I got an error that says :unresolved dependency spark-core_2.10.4;1.1.1 : not foundso I had to modify it toorg.apache.spark % spark-core_2.10 % 1.1.1,But there is still an error that says :unresolved dependency spark-mllib;1.1.1 : not foundAnyone knows how to add dependency of Mllib in .sbt file? Best Regards ... Amin Mohebbi PhD candidate in Software Engineering at university of Malaysia Tel : +60 18 2040 017 E-Mail : tp025...@ex.apiit.edu.my amin_...@me.com
Re: How to incrementally compile spark examples using mvn
Hi all, I made some code changes in mllib project and as mentioned in the previous mails I did mvn install -pl mllib Now I run a program in examples using run-example, the new code is not executing.Instead the previous code itself is running. But if I do an mvn install in the entire spark project , I can see the new code running.But installing the entire spark takes a lot of time and so its difficult to do this each time I make some changes. Can someone tell me how to compile mllib alone and get the changes working? Thanks Regards, Meethu M On Friday, 28 November 2014 2:39 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi,I have a similar problem.I modified the code in mllib and examples.I did mvn install -pl mllib mvn install -pl examples But when I run the program in examples using run-example,the older version of mllib (before the changes were made) is getting executed.How to get the changes made in mllib while calling it from examples project? Thanks Regards, Meethu M On Monday, 24 November 2014 3:33 PM, Yiming (John) Zhang sdi...@gmail.com wrote: Thank you, Marcelo and Sean, mvn install is a good answer for my demands. -邮件原件- 发件人: Marcelo Vanzin [mailto:van...@cloudera.com] 发送时间: 2014年11月21日 1:47 收件人: yiming zhang 抄送: Sean Owen; user@spark.apache.org 主题: Re: How to incrementally compile spark examples using mvn Hi Yiming, On Wed, Nov 19, 2014 at 5:35 PM, Yiming (John) Zhang sdi...@gmail.com wrote: Thank you for your reply. I was wondering whether there is a method of reusing locally-built components without installing them? That is, if I have successfully built the spark project as a whole, how should I configure it so that I can incrementally build (only) the spark-examples sub project without the need of downloading or installation? As Sean suggest, you shouldn't need to install anything. After mvn install, your local repo is a working Spark installation, and you can use spark-submit and other tool directly within it. You just need to remember to rebuild the assembly/ project when modifying Spark code (or the examples/ project when modifying examples). -- Marcelo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to incrementally compile spark examples using mvn
Hi,I have a similar problem.I modified the code in mllib and examples.I did mvn install -pl mllib mvn install -pl examples But when I run the program in examples using run-example,the older version of mllib (before the changes were made) is getting executed.How to get the changes made in mllib while calling it from examples project? Thanks Regards, Meethu M On Monday, 24 November 2014 3:33 PM, Yiming (John) Zhang sdi...@gmail.com wrote: Thank you, Marcelo and Sean, mvn install is a good answer for my demands. -邮件原件- 发件人: Marcelo Vanzin [mailto:van...@cloudera.com] 发送时间: 2014年11月21日 1:47 收件人: yiming zhang 抄送: Sean Owen; user@spark.apache.org 主题: Re: How to incrementally compile spark examples using mvn Hi Yiming, On Wed, Nov 19, 2014 at 5:35 PM, Yiming (John) Zhang sdi...@gmail.com wrote: Thank you for your reply. I was wondering whether there is a method of reusing locally-built components without installing them? That is, if I have successfully built the spark project as a whole, how should I configure it so that I can incrementally build (only) the spark-examples sub project without the need of downloading or installation? As Sean suggest, you shouldn't need to install anything. After mvn install, your local repo is a working Spark installation, and you can use spark-submit and other tool directly within it. You just need to remember to rebuild the assembly/ project when modifying Spark code (or the examples/ project when modifying examples). -- Marcelo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: ISpark class not found
Hi, I was also trying Ispark..But I couldnt even start the notebook..I am getting the following error. ERROR:tornado.access:500 POST /api/sessions (127.0.0.1) 10.15ms referer=http://localhost:/notebooks/Scala/Untitled0.ipynb How did you start the notebook? Thanks Regards, Meethu M On Wednesday, 12 November 2014 6:50 AM, Laird, Benjamin benjamin.la...@capitalone.com wrote: I've been experimenting with the ISpark extension to IScala (https://github.com/tribbloid/ISpark) Objects created in the REPL are not being loaded correctly on worker nodes, leading to a ClassNotFound exception. This does work correctly in spark-shell. I was curious if anyone has used ISpark and has encountered this issue. Thanks! Simple example: In [1]: case class Circle(rad:Float) In [2]: val rdd = sc.parallelize(1 to 1).map(i=Circle(i.toFloat)).take(10)14/11/11 13:03:35 ERROR TaskResultGetter: Exception while getting task resultcom.esotericsoftware.kryo.KryoException: Unable to find class: [L$line5.$read$$iwC$$iwC$Circle; Full trace in my gist: https://gist.github.com/benjaminlaird/3e543a9a89fb499a3a14 The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.
Is there a step-by-step instruction on how to build Spark App with IntelliJ IDEA?
Hi, This question was asked earlier and I did it in the way specified..I am getting java.lang.ClassNotFoundException.. Can somebody explain all the steps required to build a spark app using IntelliJ (latest version)starting from creating the project to running it..I searched a lot but couldnt find an appropriate documentation.. Re: Is there a step-by-step instruction on how to build Spark App with IntelliJ IDEA? | | | | | | | | | Re: Is there a step-by-step instruction on how to build Spark App with IntelliJ IDEA?Don’t try to use spark-core as an archetype. Instead just create a plain Scala project (noarchetype) and add a Maven dependency on spark-core. That should be all you need. | | | | View on mail-archives.apache.org | Preview by Yahoo | | | | | Thanks Regards, Meethu M
Re: Relation between worker memory and executor memory in standalone mode
Try to set --total-executor-cores to limit how many total cores it can use. Thanks Regards, Meethu M On Thursday, 2 October 2014 2:39 AM, Akshat Aranya aara...@gmail.com wrote: I guess one way to do so would be to run 1 worker per node, like say, instead of running 1 worker and giving it 8 cores, you can run 4 workers with 2 cores each. Then, you get 4 executors with 2 cores each. On Wed, Oct 1, 2014 at 1:06 PM, Boromir Widas vcsub...@gmail.com wrote: I have not found a way to control the cores yet. This effectively limits the cluster to a single application at a time. A subsequent application shows in the 'WAITING' State on the dashboard. On Wed, Oct 1, 2014 at 2:49 PM, Akshat Aranya aara...@gmail.com wrote: On Wed, Oct 1, 2014 at 11:33 AM, Akshat Aranya aara...@gmail.com wrote: On Wed, Oct 1, 2014 at 11:00 AM, Boromir Widas vcsub...@gmail.com wrote: 1. worker memory caps executor. 2. With default config, every job gets one executor per worker. This executor runs with all cores available to the worker. By the job do you mean one SparkContext or one stage execution within a program? Does that also mean that two concurrent jobs will get one executor each at the same time? Experimenting with this some more, I figured out that an executor takes away spark.executor.memory amount of memory from the configured worker memory. It also takes up all the cores, so even if there is still some memory left, there are no cores left for starting another executor. Is my assessment correct? Is there no way to configure the number of cores that an executor can use? On Wed, Oct 1, 2014 at 11:04 AM, Akshat Aranya aara...@gmail.com wrote: Hi, What's the relationship between Spark worker and executor memory settings in standalone mode? Do they work independently or does the worker cap executor memory? Also, is the number of concurrent executors per worker capped by the number of CPU cores configured for the worker?
Same code --works in spark 1.0.2-- but not in spark 1.1.0
Hi all, My code was working fine in spark 1.0.2 ,but after upgrading to 1.1.0, its throwing exceptions and tasks are getting failed. The code contains some map and filter transformations followed by groupByKey (reduceByKey in another code ). What I could find out is that the code works fine until groupByKey or reduceByKey in both versions.But after that the following errors show up in Spark 1.1.0 java.io.FileNotFoundException: /tmp/spark-local-20141006173014-4178/35/shuffle_6_0_5161 (Too many open files) java.io.FileOutputStream.openAppend(Native Method) java.io.FileOutputStream.init(FileOutputStream.java:210) org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123) org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:192) org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:67) org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:65) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:701) I cleaned my /tmp directory,changed my local directory to another folder ; but nothing helped. Can anyone say what could be the reason .? Thanks Regards, Meethu M
Python version of kmeans
Hi all, I need the kmeans code written against Pyspark for some testing purpose. Can somebody tell me the difference between these two files. spark-1.0.1/examples/src/main/python/kmeans.py and spark-1.0.1/python/pyspark/mllib/clustering.py Thanks Regards, Meethu M
Re: how to specify columns in groupby
Thank you Yanbo for the reply.. I 've another query related to cogroup.I want to iterate over the results of cogroup operation. My code is * grp = RDD1.cogroup(RDD2) * map((lambda (x,y): (x,list(y[0]),list(y[1]))), list(grp)) My result looks like : [((u'764', u'20140826'), [0.70146274566650391], [ ]), ((u'863', u'20140826'), [0.368011474609375], [ ]), ((u'9571520', u'20140826'), [0.0046129226684570312], [0.60009])] When I do one more cogroup operation like grp1 = grp.cogroup(RDD3) I am not able to see the results.All my RDDs are of the form ((x,y),z).Can somebody help me to solve this. Thanks Regards, Meethu M On Thursday, 28 August 2014 5:59 PM, Yanbo Liang yanboha...@gmail.com wrote: For your reference: val d1 = textFile.map(line = { val fileds = line.split(,) ((fileds(0),fileds(1)), fileds(2).toDouble) }) val d2 = d1.reduceByKey(_+_) d2.foreach(println) 2014-08-28 20:04 GMT+08:00 MEETHU MATHEW meethu2...@yahoo.co.in: Hi all, I have an RDD which has values in the format id,date,cost. I want to group the elements based on the id and date columns and get the sum of the cost for each group. Can somebody tell me how to do this? Thanks Regards, Meethu M
how to specify columns in groupby
Hi all, I have an RDD which has values in the format id,date,cost. I want to group the elements based on the id and date columns and get the sum of the cost for each group. Can somebody tell me how to do this? Thanks Regards, Meethu M
Re: Losing Executors on cluster with RDDs of 100GB
Hi, Plz give a try by changing the worker memory such that worker memoryexecutor memory Thanks Regards, Meethu M On Friday, 22 August 2014 5:18 PM, Yadid Ayzenberg ya...@media.mit.edu wrote: Hi all, I have a spark cluster of 30 machines, 16GB / 8 cores on each running in standalone mode. Previously my application was working well ( several RDDs the largest being around 50G). When I started processing larger amounts of data (RDDs of 100G) my app is losing executors. Im currently just loading them from a database, rePartitioning and persisting to disk (with replication x2) I have spark.executor.memory= 9G, memoryFraction = 0.5, spark.worker.timeout =120, spark.akka.askTimeout=30, spark.storage.blockManagerHeartBeatMs=3. I haven't change the default of my worker memory so its at 512m (should this be larger) ? I've been getting the following messages from my app: [error] o.a.s.s.TaskSchedulerImpl - Lost executor 3 on myserver1: worker lost [error] o.a.s.s.TaskSchedulerImpl - Lost executor 13 on myserver2: Unknown executor exit code (137) (died from signal 9?) [error] a.r.EndpointWriter - AssociationError [akka.tcp://spark@master:59406] - [akka.tcp://sparkExecutor@myserver2:32955]: Error [Association failed with [akka.tcp://sparkExecutor@myserver2:32955]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkexecu...@myserver2.com:32955] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: myserver2/198.18.102.160:32955 ] [error] a.r.EndpointWriter - AssociationError [akka.tcp://spark@master:59406] - [akka.tcp://spark@myserver1:53855]: Error [Association failed with [akka.tcp://spark@myserver1:53855]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://spark@myserver1:53855] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: myserver1/198.18.102.160:53855 ] The worker logs and executor logs do not contain errors. Any ideas what the problem is ? Yadid - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: OutOfMemory Error
Hi , How to increase the heap size? What is the difference between spark executor memory and heap size? Thanks Regards, Meethu M On Monday, 18 August 2014 12:35 PM, Akhil Das ak...@sigmoidanalytics.com wrote: I believe spark.shuffle.memoryFraction is the one you are looking for. spark.shuffle.memoryFraction : Fraction of Java heap to use for aggregation and cogroups during shuffles, if spark.shuffle.spill is true. At any given time, the collective size of all in-memory maps used for shuffles is bounded by this limit, beyond which the contents will begin to spill to disk. If spills are often, consider increasing this value at the expense of spark.storage.memoryFraction. You can give it a try. Thanks Best Regards On Mon, Aug 18, 2014 at 12:21 PM, Ghousia ghousia.ath...@gmail.com wrote: Thanks for the answer Akhil. We are right now getting rid of this issue by increasing the number of partitions. And we are persisting RDDs to DISK_ONLY. But the issue is with heavy computations within an RDD. It would be better if we have the option of spilling the intermediate transformation results to local disk (only in case if memory consumption is high) . Do we have any such option available with Spark? If increasing the partitions is the only the way, then one might end up with OutOfMemory Errors, when working with certain algorithms where intermediate result is huge. On Mon, Aug 18, 2014 at 12:02 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Hi Ghousia, You can try the following: 1. Increase the heap size 2. Increase the number of partitions 3. You could try persisting the RDD to use DISK_ONLY Thanks Best Regards On Mon, Aug 18, 2014 at 10:40 AM, Ghousia Taj ghousia.ath...@gmail.com wrote: Hi, I am trying to implement machine learning algorithms on Spark. I am working on a 3 node cluster, with each node having 5GB of memory. Whenever I am working with slightly more number of records, I end up with OutOfMemory Error. Problem is, even if number of records is slightly high, the intermediate result from a transformation is huge and this results in OutOfMemory Error. To overcome this, we are partitioning the data such that each partition has only a few records. Is there any better way to fix this issue. Some thing like spilling the intermediate data to local disk? Thanks, Ghousia. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemory-Error-tp12275.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Use of SPARK_DAEMON_JAVA_OPTS
Hi all, Sorry for taking this topic again,still I am confused on this. I set SPARK_DAEMON_JAVA_OPTS=-XX:+UseCompressedOops -Xmx8g when I run my application,I got the following line in logs. Spark Command: java -cp ::/usr/local/spark-1.0.1/conf:/usr/local/spark-1.0.1/assembly/target/scala-2.10/spark-assembly-1.0.1-hadoop1.2.1.jar -XX:MaxPermSize=128m -XX:+UseCompressedOops-Xmx8g-Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512morg.apache.spark.deploy.worker.Worker spark://master:7077 -Xmx is set twice. One from the SPARK_DAEMON_JAVA_OPTS . 2nd from bin/spark-class(from SPARK_DAEMON_MEMORY or DEFAULT_MEM). I believe that the second value will be taken in execution ie the one passed as SPARK_DAEMON _MEMORY or DEFAULT_MEM. So I would like to know what is the purpose of SPARK_DAEMON_JAVA_OPTS and how it is different from SPARK_DAEMON _MEMORY. Thanks Regards, Meethu M
Re: Error with spark-submit (formatting corrected)
Hi, Instead of spark://10.1.3.7:7077 use spark://vmsparkwin1:7077 try this $ ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://vmsparkwin1:7077 --executor-memory 1G --total-executor-cores 2 ./lib/spark-examples-1.0.0-hadoop2.2.0.jar 10 Thanks Regards, Meethu M On Friday, 18 July 2014 7:51 AM, Jay Vyas jayunit100.apa...@gmail.com wrote: I think I know what is happening to you. I've looked some into this just this week, and so its fresh in my brain :) hope this helps. When no workers are known to the master, iirc, you get this message. I think this is how it works. 1) You start your master 2) You start a slave, and give it master url as an argument. 3) The slave then binds to a random port 4) The slave then does a handshake with master, which you can see in the slave logs (it sais something like sucesfully connected to master at …. Actualy, i think tha master also logs that it now is aware of a slave running on ip:port… So in your case, I suspect, none of the slaves have connected to the master, so the job sits idle. This is similar to the yarn scenario of submitting a job to a resource manager with no node-managers running. On Jul 17, 2014, at 6:57 PM, ranjanp piyush_ran...@hotmail.com wrote: Hi, I am new to Spark and trying out with a stand-alone, 3-node (1 master, 2 workers) cluster. From the Web UI at the master, I see that the workers are registered. But when I try running the SparkPi example from the master node, I get the following message and then an exception. 14/07/17 01:20:36 INFO AppClient$ClientActor: Connecting to master spark://10.1.3.7:7077... 14/07/17 01:20:46 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory I searched a bit for the above warning, and found and found that others have encountered this problem before, but did not see a clear resolution except for this link: http://apache-spark-user-list.1001560.n3.nabble.com/TaskSchedulerImpl-Initial-job-has-not-accepted-any-resources-check-your-cluster-UI-to-ensure-that-woy-tt8247.html#a8444 Based on the suggestion there I tried supplying --executor-memory option to spark-submit but that did not help. Any suggestions. Here are the details of my set up. - 3 nodes (each with 4 CPU cores and 7 GB memory) - 1 node configured as Master, and the other two configured as workers - Firewall is disabled on all nodes, and network communication between the nodes is not a problem - Edited the conf/spark-env.sh on all nodes to set the following: SPARK_WORKER_CORES=3 SPARK_WORKER_MEMORY=5G - The Web UI as well as logs on master show that Workers were able to register correctly. Also the Web UI correctly shows the aggregate available memory and CPU cores on the workers: URL: spark://vmsparkwin1:7077 Workers: 2 Cores: 6 Total, 0 Used Memory: 10.0 GB Total, 0.0 B Used Applications: 0 Running, 0 Completed Drivers: 0 Running, 0 Completed Status: ALIVE I try running the SparkPi example first using the run-example (which was failing) and later directly using the spark-submit as shown below: $ export MASTER=spark://vmsparkwin1:7077 $ echo $MASTER spark://vmsparkwin1:7077 azureuser@vmsparkwin1 /cygdrive/c/opt/spark-1.0.0 $ ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://10.1.3.7:7077 --executor-memory 1G --total-executor-cores 2 ./lib/spark-examples-1.0.0-hadoop2.2.0.jar 10 The following is the full screen output: 14/07/17 01:20:13 INFO SecurityManager: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/07/17 01:20:13 INFO SecurityManager: Changing view acls to: azureuser 14/07/17 01:20:13 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(azureuser) 14/07/17 01:20:14 INFO Slf4jLogger: Slf4jLogger started 14/07/17 01:20:14 INFO Remoting: Starting remoting 14/07/17 01:20:14 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sp...@vmsparkwin1.cssparkwin.b1.internal.cloudapp.net:49839] 14/07/17 01:20:14 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sp...@vmsparkwin1.cssparkwin.b1.internal.cloudapp.net:49839] 14/07/17 01:20:14 INFO SparkEnv: Registering MapOutputTracker 14/07/17 01:20:14 INFO SparkEnv: Registering BlockManagerMaster 14/07/17 01:20:14 INFO DiskBlockManager: Created local directory at C:\cygwin\tmp\spark-local-20140717012014-b606 14/07/17 01:20:14 INFO MemoryStore: MemoryStore started with capacity 294.9 MB. 14/07/17 01:20:14 INFO ConnectionManager: Bound socket to port 49842 with id = ConnectionManagerId(vmsparkwin1.cssparkwin.b1.internal.cloudapp.net,49842) 14/07/17 01:20:14 INFO BlockManagerMaster: Trying to register BlockManager 14/07/17 01:20:14 INFO BlockManagerInfo: Registering block manager
Re: Pysparkshell are not listing in the web UI while running
Hi Akhil, That fixed the problem...Thanks Thanks Regards, Meethu M On Thursday, 17 July 2014 2:26 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Hi Neethu, Your application is running on local mode and that's the reason why you are not seeing the driver app in the 8080 webUI. You can pass the Master IP to your pyspark and get it running in cluster mode. eg: IPYTHON_OPTS=notebook --pylab inline $SPARK_HOME/bin/pyspark --master spark://master:7077 Replace master:7077 with the spark uri that you are seeing in top left of the 8080 webui. Thanks Best Regards On Thu, Jul 17, 2014 at 1:35 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi all, I just upgraded to spark 1.0.1. In spark 1.0.0 when I start Ipython notebook using the following command,it used to come in the running applications tab in master:8080 web UI. IPYTHON_OPTS=notebook --pylab inline $SPARK_HOME/bin/pyspark But now when I run it,its not getting listed under running application/completed application(once its closed).But I am able to see the spark stages at master:4040 while its running Anyone have any idea why this Thanks Regards, Meethu M
Difference between collect() and take(n)
Hi all, I want to know how collect() works, and how it is different from take().I am just reading a file of 330MB which has 43lakh rows with 13 columns and calling take(430) to save to a variable.But the same is not working with collect().So is there any difference in the operation of both. Again,I wanted to set java heap size for my spark pgm. I set it using spark.executor.extraJavaOptions in spark-default-conf.sh. Now I want to set the same for the worker.Can I do that with SPARK_DAEMON_JAVA_OPTS?Is the following syntax correct? SPARK_DAEMON_JAVA_OPTS=-XX:+UseCompressedOops -Xmx3g Thanks Regards, Meethu M
Re: org.jboss.netty.channel.ChannelException: Failed to bind to: master/1xx.xx..xx:0
The problem is resolved.I have added SPARK_LOCAL_IP=master in both slaves also.When i changed this my slaves are working. Thank you all for your suggestions Thanks Regards, Meethu M On Wednesday, 2 July 2014 10:22 AM, Aaron Davidson ilike...@gmail.com wrote: In your spark-env.sh, do you happen to set SPARK_PUBLIC_DNS or something of that kin? This error suggests the worker is trying to bind a server on the master's IP, which clearly doesn't make sense On Mon, Jun 30, 2014 at 11:59 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi, I did netstat -na | grep 192.168.125.174 and its showing 192.168.125.174:7077 LISTEN(after starting master) I tried to execute the following script from the slaves manually but it ends up with the same exception and log.This script is internally executing the java command. /usr/local/spark-1.0.0/sbin/start-slave.sh 1 spark://192.168.125.174:7077 In this case netstat is showing any connection established to master:7077. When we manually execute the java command,the connection is getting established to master. Thanks Regards, Meethu M On Monday, 30 June 2014 6:38 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Are you sure you have this ip 192.168.125.174 bind for that machine? (netstat -na | grep 192.168.125.174) Thanks Best Regards On Mon, Jun 30, 2014 at 5:34 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi all, I reinstalled spark,reboot the system,but still I am not able to start the workers.Its throwing the following exception: Exception in thread main org.jboss.netty.channel.ChannelException: Failed to bind to: master/192.168.125.174:0 I doubt the problem is with 192.168.125.174:0. Eventhough the command contains master:7077,why its showing 0 in the log. java -cp ::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://master:7077 Can somebody tell me a solution. Thanks Regards, Meethu M On Friday, 27 June 2014 4:28 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi, ya I tried setting another PORT also,but the same problem.. master is set in etc/hosts Thanks Regards, Meethu M On Friday, 27 June 2014 3:23 PM, Akhil Das ak...@sigmoidanalytics.com wrote: tha's strange, did you try setting the master port to something else (use SPARK_MASTER_PORT). Also you said you are able to start it from the java commandline java -cp ::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://:master:7077 What is the master ip specified here? is it like you have entry for master in the /etc/hosts? Thanks Best Regards On Fri, Jun 27, 2014 at 3:09 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi Akhil, I am running it in a LAN itself..The IP of the master is given correctly. Thanks Regards, Meethu M On Friday, 27 June 2014 2:51 PM, Akhil Das ak...@sigmoidanalytics.com wrote: why is it binding to port 0? 192.168.125.174:0 :/ Check the ip address of that master machine (ifconfig) looks like the ip address has been changed (hoping you are running this machines on a LAN) Thanks Best Regards On Fri, Jun 27, 2014 at 12:00 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi all, My Spark(Standalone mode) was running fine till yesterday.But now I am getting the following exeception when I am running start-slaves.sh or start-all.sh slave3: failed to launch org.apache.spark.deploy.worker.Worker: slave3: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) slave3: at java.lang.Thread.run(Thread.java:662) The log files has the following lines. 14/06/27 11:06:30 INFO SecurityManager: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/06/27 11:06:30 INFO SecurityManager: Changing view acls to: hduser 14/06/27 11:06:30 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hduser) 14/06/27 11:06:30 INFO Slf4jLogger: Slf4jLogger started 14/06/27 11:06:30 INFO Remoting: Starting remoting Exception in thread main org.jboss.netty.channel.ChannelException: Failed to bind to: master/192.168.125.174:0 at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272) ... Caused by: java.net.BindException: Cannot assign requested address ... I saw the same error reported before and have tried the following solutions. Set the variable SPARK_LOCAL_IP ,Changed the SPARK_MASTER_PORT to a different number..But nothing is working. When I try to start the worker from the respective machines using the following java command,its running without any exception java -cp ::/usr/local/spark
Re: org.jboss.netty.channel.ChannelException: Failed to bind to: master/1xx.xx..xx:0
Hi, I did netstat -na | grep 192.168.125.174 and its showing 192.168.125.174:7077 LISTEN(after starting master) I tried to execute the following script from the slaves manually but it ends up with the same exception and log.This script is internally executing the java command. /usr/local/spark-1.0.0/sbin/start-slave.sh 1 spark://192.168.125.174:7077 In this case netstat is showing any connection established to master:7077. When we manually execute the java command,the connection is getting established to master. Thanks Regards, Meethu M On Monday, 30 June 2014 6:38 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Are you sure you have this ip 192.168.125.174 bind for that machine? (netstat -na | grep 192.168.125.174) Thanks Best Regards On Mon, Jun 30, 2014 at 5:34 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi all, I reinstalled spark,reboot the system,but still I am not able to start the workers.Its throwing the following exception: Exception in thread main org.jboss.netty.channel.ChannelException: Failed to bind to: master/192.168.125.174:0 I doubt the problem is with 192.168.125.174:0. Eventhough the command contains master:7077,why its showing 0 in the log. java -cp ::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://master:7077 Can somebody tell me a solution. Thanks Regards, Meethu M On Friday, 27 June 2014 4:28 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi, ya I tried setting another PORT also,but the same problem.. master is set in etc/hosts Thanks Regards, Meethu M On Friday, 27 June 2014 3:23 PM, Akhil Das ak...@sigmoidanalytics.com wrote: tha's strange, did you try setting the master port to something else (use SPARK_MASTER_PORT). Also you said you are able to start it from the java commandline java -cp ::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://:master:7077 What is the master ip specified here? is it like you have entry for master in the /etc/hosts? Thanks Best Regards On Fri, Jun 27, 2014 at 3:09 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi Akhil, I am running it in a LAN itself..The IP of the master is given correctly. Thanks Regards, Meethu M On Friday, 27 June 2014 2:51 PM, Akhil Das ak...@sigmoidanalytics.com wrote: why is it binding to port 0? 192.168.125.174:0 :/ Check the ip address of that master machine (ifconfig) looks like the ip address has been changed (hoping you are running this machines on a LAN) Thanks Best Regards On Fri, Jun 27, 2014 at 12:00 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi all, My Spark(Standalone mode) was running fine till yesterday.But now I am getting the following exeception when I am running start-slaves.sh or start-all.sh slave3: failed to launch org.apache.spark.deploy.worker.Worker: slave3: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) slave3: at java.lang.Thread.run(Thread.java:662) The log files has the following lines. 14/06/27 11:06:30 INFO SecurityManager: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/06/27 11:06:30 INFO SecurityManager: Changing view acls to: hduser 14/06/27 11:06:30 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hduser) 14/06/27 11:06:30 INFO Slf4jLogger: Slf4jLogger started 14/06/27 11:06:30 INFO Remoting: Starting remoting Exception in thread main org.jboss.netty.channel.ChannelException: Failed to bind to: master/192.168.125.174:0 at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272) ... Caused by: java.net.BindException: Cannot assign requested address ... I saw the same error reported before and have tried the following solutions. Set the variable SPARK_LOCAL_IP ,Changed the SPARK_MASTER_PORT to a different number..But nothing is working. When I try to start the worker from the respective machines using the following java command,its running without any exception java -cp ::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://:master:7077 Somebody please give a solution Thanks Regards, Meethu M
Failed to launch Worker
Hi , I am using Spark Standalone mode with one master and 2 slaves.I am not able to start the workers and connect it to the master using ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://x.x.x.174:7077 The log says Exception in thread main org.jboss.netty.channel.ChannelException: Failed to bind to: master/x.x.x.174:0 at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272) ... Caused by: java.net.BindException: Cannot assign requested address When I try to start the worker from the slaves using the following java command,its running without any exception java -cp ::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://:master:7077 Thanks Regards, Meethu M
Re: Failed to launch Worker
Yes. Thanks Regards, Meethu M On Tuesday, 1 July 2014 6:14 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Is this command working?? java -cp ::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://x.x.x.174:7077 Thanks Best Regards On Tue, Jul 1, 2014 at 6:08 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi , I am using Spark Standalone mode with one master and 2 slaves.I am not able to start the workers and connect it to the master using ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://x.x.x.174:7077 The log says Exception in thread main org.jboss.netty.channel.ChannelException: Failed to bind to: master/x.x.x.174:0 at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272) ... Caused by: java.net.BindException: Cannot assign requested address When I try to start the worker from the slaves using the following java command,its running without any exception java -cp ::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://:master:7077 Thanks Regards, Meethu M
Re: org.jboss.netty.channel.ChannelException: Failed to bind to: master/1xx.xx..xx:0
Hi all, I reinstalled spark,reboot the system,but still I am not able to start the workers.Its throwing the following exception: Exception in thread main org.jboss.netty.channel.ChannelException: Failed to bind to: master/192.168.125.174:0 I doubt the problem is with 192.168.125.174:0. Eventhough the command contains master:7077,why its showing 0 in the log. java -cp ::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://master:7077 Can somebody tell me a solution. Thanks Regards, Meethu M On Friday, 27 June 2014 4:28 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi, ya I tried setting another PORT also,but the same problem.. master is set in etc/hosts Thanks Regards, Meethu M On Friday, 27 June 2014 3:23 PM, Akhil Das ak...@sigmoidanalytics.com wrote: tha's strange, did you try setting the master port to something else (use SPARK_MASTER_PORT). Also you said you are able to start it from the java commandline java -cp ::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://:master:7077 What is the master ip specified here? is it like you have entry for master in the /etc/hosts? Thanks Best Regards On Fri, Jun 27, 2014 at 3:09 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi Akhil, I am running it in a LAN itself..The IP of the master is given correctly. Thanks Regards, Meethu M On Friday, 27 June 2014 2:51 PM, Akhil Das ak...@sigmoidanalytics.com wrote: why is it binding to port 0? 192.168.125.174:0 :/ Check the ip address of that master machine (ifconfig) looks like the ip address has been changed (hoping you are running this machines on a LAN) Thanks Best Regards On Fri, Jun 27, 2014 at 12:00 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi all, My Spark(Standalone mode) was running fine till yesterday.But now I am getting the following exeception when I am running start-slaves.sh or start-all.sh slave3: failed to launch org.apache.spark.deploy.worker.Worker: slave3: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) slave3: at java.lang.Thread.run(Thread.java:662) The log files has the following lines. 14/06/27 11:06:30 INFO SecurityManager: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/06/27 11:06:30 INFO SecurityManager: Changing view acls to: hduser 14/06/27 11:06:30 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hduser) 14/06/27 11:06:30 INFO Slf4jLogger: Slf4jLogger started 14/06/27 11:06:30 INFO Remoting: Starting remoting Exception in thread main org.jboss.netty.channel.ChannelException: Failed to bind to: master/192.168.125.174:0 at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272) ... Caused by: java.net.BindException: Cannot assign requested address ... I saw the same error reported before and have tried the following solutions. Set the variable SPARK_LOCAL_IP ,Changed the SPARK_MASTER_PORT to a different number..But nothing is working. When I try to start the worker from the respective machines using the following java command,its running without any exception java -cp ::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://:master:7077 Somebody please give a solution Thanks Regards, Meethu M
Re: How to control a spark application(executor) using memory amount per node?
Hi, Try setting driver-java-options with spark-submit or set spark.executor.extraJavaOptions in spark-default.conf Thanks Regards, Meethu M On Monday, 30 June 2014 1:28 PM, hansen han...@neusoft.com wrote: Hi, When i send the following statements in spark-shell: val file = sc.textFile(hdfs://nameservice1/user/study/spark/data/soc-LiveJournal1.txt) val count = file.flatMap(line = line.split( )).map(word = (word, 1)).reduceByKey(_+_) println(count.count()) and, it throw a exception: .. 14/06/30 15:50:53 WARN TaskSetManager: Loss was due to java.lang.OutOfMemoryError java.lang.OutOfMemoryError: Java heap space at java.io.ObjectOutputStream$HandleTable.growEntries(ObjectOutputStream.java:2346) at java.io.ObjectOutputStream$HandleTable.assign(ObjectOutputStream.java:2275) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1427) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:28) at org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:176) at org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:164) at org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:161) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.foreach(ExternalAppendOnlyMap.scala:239) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102) at org.apache.spark.scheduler.Task.run(Task.scala:53) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) then, I set the following configuration in spark-env.sh export SPARK_EXECUTOR_MEMORY=1G It's not OK. spark.png http://apache-spark-user-list.1001560.n3.nabble.com/file/n8521/spark.png I found when i start spark-shell, then console also print the logs: SparkDeploySchedulerBackend: Granted executor ID app-20140630144110-0002/0 on hostPort dlx8:7078 with 8 cores, *512.0 MB RAM* How to increate 512.0 MB RAM to the more memory? Pls! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-control-a-spark-application-executor-using-memory-amount-per-node-tp8521.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: org.jboss.netty.channel.ChannelException: Failed to bind to: master/1xx.xx..xx:0
Hi Akhil, The IP is correct and is able to start the workers when we start it as a java command.Its becoming 192.168.125.174:0 when we call from the scripts. Thanks Regards, Meethu M On Friday, 27 June 2014 1:49 PM, Akhil Das ak...@sigmoidanalytics.com wrote: why is it binding to port 0? 192.168.125.174:0 :/ Check the ip address of that master machine (ifconfig) looks like the ip address has been changed (hoping you are running this machines on a LAN) Thanks Best Regards On Fri, Jun 27, 2014 at 12:00 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi all, My Spark(Standalone mode) was running fine till yesterday.But now I am getting the following exeception when I am running start-slaves.sh or start-all.sh slave3: failed to launch org.apache.spark.deploy.worker.Worker: slave3: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) slave3: at java.lang.Thread.run(Thread.java:662) The log files has the following lines. 14/06/27 11:06:30 INFO SecurityManager: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/06/27 11:06:30 INFO SecurityManager: Changing view acls to: hduser 14/06/27 11:06:30 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hduser) 14/06/27 11:06:30 INFO Slf4jLogger: Slf4jLogger started 14/06/27 11:06:30 INFO Remoting: Starting remoting Exception in thread main org.jboss.netty.channel.ChannelException: Failed to bind to: master/192.168.125.174:0 at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272) ... Caused by: java.net.BindException: Cannot assign requested address ... I saw the same error reported before and have tried the following solutions. Set the variable SPARK_LOCAL_IP ,Changed the SPARK_MASTER_PORT to a different number..But nothing is working. When I try to start the worker from the respective machines using the following java command,its running without any exception java -cp ::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://:master:7077 Somebody please give a solution Thanks Regards, Meethu M
Re: join operation is taking too much time
Hi, Thanks Andrew and Daniel for the response. Setting spark.shuffle.spill to false didnt make any difference. 5 days completed in 6 min and 10 days was stuck after around 1hr. Daniel,in my current use case I cant read all the files to a single RDD.But I have another use case where I did it in that way,ie I read all the files to a single RDD and joined with with the RDD of 9 million rows and it worked fine and took only 3 minutes. Thanks Regards, Meethu M On Wednesday, 18 June 2014 12:11 AM, Daniel Darabos daniel.dara...@lynxanalytics.com wrote: I've been wondering about this. Is there a difference in performance between these two? valrdd1 =sc.textFile(files.mkString(,))valrdd2 =sc.union(files.map(sc.textFile(_))) I don't know about your use-case, Meethu, but it may be worth trying to see if reading all the files into one RDD (like rdd1) would perform better in the join. (If this is possible in your situation.) On Tue, Jun 17, 2014 at 6:45 PM, Andrew Or and...@databricks.com wrote: How long does it get stuck for? This is a common sign for the OS thrashing due to out of memory exceptions. If you keep it running longer, does it throw an error? Depending on how large your other RDD is (and your join operation), memory pressure may or may not be the problem at all. It could be that spilling your shuffles to disk is slowing you down (but probably shouldn't hang your application). For the 5 RDDs case, what happens if you set spark.shuffle.spill to false? 2014-06-17 5:59 GMT-07:00 MEETHU MATHEW meethu2...@yahoo.co.in: Hi all, I want to do a recursive leftOuterJoin between an RDD (created from file) with 9 million rows(size of the file is 100MB) and 30 other RDDs(created from 30 diff files in each iteration of a loop) varying from 1 to 6 million rows. When I run it for 5 RDDs,its running successfully in 5 minutes.But when I increase it to 10 or 30 RDDs its gradually slowing down and finally getting stuck without showing any warning or error. I am running in standalone mode with 2 workers of 4GB each and a total of 16 cores . Any of you facing similar problems with JOIN or is it a problem with my configuration. Thanks Regards, Meethu M
options set in spark-env.sh is not reflecting on actual execution
Hi all, I have a doubt regarding the options in spark-env.sh. I set the following values in the file in master and 2 workers SPARK_WORKER_MEMORY=7g SPARK_EXECUTOR_MEMORY=6g SPARK_DAEMON_JAVA_OPTS+=- Dspark.akka.timeout=30 -Dspark.akka.frameSize=1 -Dspark.blockManagerHeartBeatMs=80 -Dspark.shuffle.spill=false But SPARK_EXECUTOR_MEMORY is showing 4g in web UI.Do I need to change it anywhere else to make it 4g and to reflect it in web UI. A warning is coming that blockManagerHeartBeatMs is exceeding 45 while executing a process even though I set it to 80. So I doubt whether it should be set as SPARK_MASTER_OPTS or SPARK_WORKER_OPTS.. Thanks Regards, Meethu M
join operation is taking too much time
Hi all, I want to do a recursive leftOuterJoin between an RDD (created from file) with 9 million rows(size of the file is 100MB) and 30 other RDDs(created from 30 diff files in each iteration of a loop) varying from 1 to 6 million rows. When I run it for 5 RDDs,its running successfully in 5 minutes.But when I increase it to 10 or 30 RDDs its gradually slowing down and finally getting stuck without showing any warning or error. I am running in standalone mode with 2 workers of 4GB each and a total of 16 cores . Any of you facing similar problems with JOIN or is it a problem with my configuration. Thanks Regards, Meethu M
Re: Wildcard support in input path
Hi Jianshi, I have used wild card characters (*) in my program and it worked.. My code was like this b = sc.textFile(hdfs:///path to file/data_file_2013SEP01*) Thanks Regards, Meethu M On Wednesday, 18 June 2014 9:29 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: It would be convenient if Spark's textFile, parquetFile, etc. can support path with wildcard, such as: hdfs://domain/user/jianshuang/data/parquet/table/month=2014* Or is there already a way to do it now? Jianshi -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/
ArrayIndexOutOfBoundsException when reading bzip2 files
Hi, I am getting ArrayIndexOutOfBoundsException while reading from bz2 files in HDFS.I have come across the same issue in JIRA at https://issues.apache.org/jira/browse/SPARK-1861, but it seems to be resolved. I have tried the workaround suggested(SPARK_WORKER_CORES=1),but its still showing error.What may be the possible reason that I am getting the same error again? I am using Spark1.0.0 with hadoop 1.2.1. java.lang.ArrayIndexOutOfBoundsException: 90 at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:897) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:499) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:330) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:394) at org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:422) at java.io.InputStream.read(InputStream.java:101) at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:205) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:169) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:176) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:43) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:198) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:181) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:303) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:200) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:174) Thanks Regards, Meethu M
Re: ArrayIndexOutOfBoundsException when reading bzip2 files
Hi Akhil, Plz find the code below. x = sc.textFile(hdfs:///**) x = x.filter(lambda z:z.split(,)[0]!=' ') x = x.filter(lambda z:z.split(,)[3]!=' ') z = x.reduce(add) Thanks Regards, Meethu M On Monday, 9 June 2014 5:52 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Can you paste the piece of code!? Thanks Best Regards On Mon, Jun 9, 2014 at 5:24 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi, I am getting ArrayIndexOutOfBoundsException while reading from bz2 files in HDFS.I have come across the same issue in JIRA at https://issues.apache.org/jira/browse/SPARK-1861, but it seems to be resolved. I have tried the workaround suggested(SPARK_WORKER_CORES=1),but its still showing error.What may be the possible reason that I am getting the same error again? I am using Spark1.0.0 with hadoop 1.2.1. java.lang.ArrayIndexOutOfBoundsException: 90 at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:897) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:499) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:330) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:394) at org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:422) at java.io.InputStream.read(InputStream.java:101) at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:205) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:169) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:176) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:43) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:198) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:181) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:303) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:200) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:174) Thanks Regards, Meethu M
Re: ArrayIndexOutOfBoundsException when reading bzip2 files
Hi Sean, Thank you for the fast response. Thanks Regards, Meethu M On Monday, 9 June 2014 6:04 PM, Sean Owen so...@cloudera.com wrote: Have a search online / at the Spark JIRA. This was a known upstream bug in Hadoop. https://issues.apache.org/jira/browse/SPARK-1861 On Mon, Jun 9, 2014 at 7:54 AM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi, I am getting ArrayIndexOutOfBoundsException while reading from bz2 files in HDFS.I have come across the same issue in JIRA at https://issues.apache.org/jira/browse/SPARK-1861, but it seems to be resolved. I have tried the workaround suggested(SPARK_WORKER_CORES=1),but its still showing error.What may be the possible reason that I am getting the same error again? I am using Spark1.0.0 with hadoop 1.2.1. java.lang.ArrayIndexOutOfBoundsException: 90 at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:897) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:499) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:330) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:394) at org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:422) at java.io.InputStream.read(InputStream.java:101) at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:205) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:169) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:176) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:43) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:198) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:181) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:303) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:200) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:174) Thanks Regards, Meethu M
How to stop a running SparkContext in the proper way?
Hi, I want to know how I can stop a running SparkContext in a proper way so that next time when I start a new SparkContext, the web UI can be launched on the same port 4040.Now when i quit the job using ctrl+z the new sc are launched in new ports. I have the same problem with ipython notebook.It is launched on a different port when I start the notebook second time after closing the first one.I am starting ipython using the command IPYTHON_OPTS=notebook --ip --pylab inline ./bin/pyspark Thanks Regards, Meethu M