How is the predict() working in LogisticRegressionModel?

2015-11-13 Thread MEETHU MATHEW
Hi all,Can somebody point me to the implementation of predict() in 
LogisticRegressionModel of spark mllib? I could find a predictPoint() in the 
class LogisticRegressionModel, but where is predict()?
 Thanks & Regards,  Meethu M

Re: Please reply if you use Mesos fine grained mode

2015-11-03 Thread MEETHU MATHEW
Hi,
We are using Mesos fine grained mode because we can have multiple instances of 
spark to share machines and each application get resources dynamically 
allocated.  Thanks & Regards,  Meethu M 


 On Wednesday, 4 November 2015 5:24 AM, Reynold Xin  
wrote:
   

 If you are using Spark with Mesos fine grained mode, can you please respond to 
this email explaining why you use it over the coarse grained mode?
Thanks.


  

Re: Best way to merge final output part files created by Spark job

2015-09-17 Thread MEETHU MATHEW
Try coalesce(1) before writing Thanks & Regards, Meethu M 


 On Tuesday, 15 September 2015 6:49 AM, java8964  
wrote:
   

 #yiv1620377612 #yiv1620377612 --.yiv1620377612hmmessage 
P{margin:0px;padding:0px;}#yiv1620377612 
body.yiv1620377612hmmessage{font-size:12pt;font-family:Calibri;}#yiv1620377612 
For text file, this merge works fine, but for binary format like "ORC", 
"Parquet" or "AVOR", not sure this will work.
These kind of formats in fact are not append-able, as they write the detail 
data information either in the head or at tail part of the file.
You have to use the format specified API to merge the data.
Yong

Date: Mon, 14 Sep 2015 09:10:33 +0200
Subject: Re: Best way to merge final output part files created by Spark job
From: gmu...@stratio.com
To: umesh.ka...@gmail.com
CC: user@spark.apache.org

Hi, check out  FileUtil.copyMerge function in the Hadoop API.  
It's simple,  
   
   - Get the hadoop configuration from Spark Context  FileSystem fs = 
FileSystem.get(sparkContext.hadoopConfiguration());   

   - Create new Path with destination and source directory.
   - Call copyMerge   FileUtil.copyMerge(fs, inputPath, fs, destPath, true, 
sparkContext.hadoopConfiguration(), null);

2015-09-13 23:25 GMT+02:00 unk1102 :

Hi I have a spark job which creates around 500 part files inside each
directory I process. So I have thousands of such directories. So I need to
merge these small small 500 part files. I am using
spark.sql.shuffle.partition as 500 and my final small files are ORC files.
Is there a way to merge orc files in Spark if not please suggest the best
way to merge files created by Spark job in hdfs please guide. Thanks much.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-merge-final-output-part-files-created-by-Spark-job-tp24681.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





-- 

Gaspar Muñoz 
@gmunozsoria

Vía de las dos Castillas, 33, Ática 4, 3ª Planta
28224 Pozuelo de Alarcón, MadridTel: +34 91 352 59 42 // @stratiobd 

  

Re: make-distribution.sh failing at spark/R/lib/sparkr.zip

2015-08-13 Thread MEETHU MATHEW
Hi,
It worked after removing that line. Thank you for the response and fix .
 Thanks  Regards, Meethu M 


 On Thursday, 13 August 2015 4:12 AM, Burak Yavuz brk...@gmail.com wrote:
   

 For the record:https://github.com/apache/spark/pull/8147
https://issues.apache.org/jira/browse/SPARK-9916

On Wed, Aug 12, 2015 at 3:08 PM, Burak Yavuz brk...@gmail.com wrote:

Are you running from master? Could you delete line 222 of 
make-distribution.sh?We updated when we build sparkr.zip. I'll submit a fix for 
it for 1.5 and master.
Burak
On Wed, Aug 12, 2015 at 3:31 AM, MEETHU MATHEW meethu2...@yahoo.co.in wrote:

Hi, I am trying to create a package using the make-distribution.sh script from 
the github master branch. But its not getting successfully completed. The last 
statement printed is 
+ cp /home/meethu/git/FlytxtRnD/spark/R/lib/sparkr.zip 
/home/meethu/git/FlytxtRnD/spark/dist/R/libcp: cannot stat 
`/home/meethu/git/FlytxtRnD/spark/R/lib/sparkr.zip': No such file or directory
My bulid is success and I am trying to execute the following command 
./make-distribution.sh --tgz -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4 
-Dhadoop.version=2.6.0 -Phive

Please help.
Thanks  Regards, Meethu M





  

Re: Combining Spark Files with saveAsTextFile

2015-08-06 Thread MEETHU MATHEW
Hi,Try using coalesce(1) before calling saveAsTextFile() Thanks  Regards, 
Meethu M 


 On Wednesday, 5 August 2015 7:53 AM, Brandon White 
bwwintheho...@gmail.com wrote:
   

 What is the best way to make saveAsTextFile save as only a single file?

  

RE:Building scaladoc using build/sbt unidoc failure

2015-07-10 Thread MEETHU MATHEW
Hi,
I am getting the assertion error while trying to run build/sbt unidoc same as 
you described in Building scaladoc using build/sbt unidoc failure .Could you 
tell me how you get it working ?
|   |
|   |   |   |   |   |
| Building scaladoc using build/sbt unidoc failureHello,I am trying to build 
scala doc from the 1.4 branch.  |
|  |
| View on mail-archives.apache.org | Preview by Yahoo |
|  |
|   |


 Thanks  Regards,
Meethu M

Re: How to create fewer output files for Spark job ?

2015-06-04 Thread MEETHU MATHEW
Try using coalesce Thanks  Regards,
Meethu M 


 On Wednesday, 3 June 2015 11:26 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com 
wrote:
   

 I am running a series of spark functions with 9000 executors and its resulting 
in 9000+ files that is execeeding the namespace file count qutota.
How can Spark be configured to use CombinedOutputFormat. {code}protected def 
writeOutputRecords(detailRecords: RDD[(AvroKey[DetailOutputRecord], 
NullWritable)], outputDir: String) {    val writeJob = new Job()    val schema 
= SchemaUtil.outputSchema(_detail)    AvroJob.setOutputKeySchema(writeJob, 
schema)    detailRecords.saveAsNewAPIHadoopFile(outputDir,      
classOf[AvroKey[GenericRecord]],      
classOf[org.apache.hadoop.io.NullWritable],      
classOf[AvroKeyOutputFormat[GenericRecord]],      writeJob.getConfiguration)  
}{code}

-- 
Deepak


  

Re: How to run multiple jobs in one sparkcontext from separate threads in pyspark?

2015-05-20 Thread MEETHU MATHEW
Hi Davies,Thank you for pointing to spark streaming. I am confused about how to 
return the result after running a function via  a thread.I tried using Queue to 
add the results to it and print it at the end.But here, I can see the results 
after all threads are finished.How to get the result of the function once a 
thread is finished, rather than waiting for all other threads to finish? Thanks 
 Regards,
Meethu M 


 On Tuesday, 19 May 2015 2:43 AM, Davies Liu dav...@databricks.com wrote:
   

 SparkContext can be used in multiple threads (Spark streaming works
with multiple threads), for example:

import threading
import time

def show(x):
    time.sleep(1)
    print x

def job():
    sc.parallelize(range(100)).foreach(show)

threading.Thread(target=job).start()


On Mon, May 18, 2015 at 12:34 AM, ayan guha guha.a...@gmail.com wrote:
 Hi

 So to be clear, do you want to run one operation in multiple threads within
 a function or you want run multiple jobs using multiple threads? I am
 wondering why python thread module can't be used? Or you have already gave
 it a try?

 On 18 May 2015 16:39, MEETHU MATHEW meethu2...@yahoo.co.in wrote:

 Hi Akhil,

 The python wrapper for Spark Job Server did not help me. I actually need
 the pyspark code sample  which shows how  I can call a function from 2
 threads and execute it simultaneously.

 Thanks  Regards,
 Meethu M



 On Thursday, 14 May 2015 12:38 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:


 Did you happened to have a look at the spark job server? Someone wrote a
 python wrapper around it, give it a try.

 Thanks
 Best Regards

 On Thu, May 14, 2015 at 11:10 AM, MEETHU MATHEW meethu2...@yahoo.co.in
 wrote:

 Hi all,

  Quote
  Inside a given Spark application (SparkContext instance), multiple
 parallel jobs can run simultaneously if they were submitted from separate
 threads. 

 How to run multiple jobs in one SPARKCONTEXT using separate threads in
 pyspark? I found some examples in scala and java, but couldn't find python
 code. Can anyone help me with a pyspark example?

 Thanks  Regards,
 Meethu M






-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



  

Re: How to run multiple jobs in one sparkcontext from separate threads in pyspark?

2015-05-18 Thread MEETHU MATHEW
Hi Akhil, The python wrapper for Spark Job Server did not help me. I actually 
need the pyspark code sample  which shows how  I can call a function from 2 
threads and execute it simultaneously. Thanks  Regards,
Meethu M 


 On Thursday, 14 May 2015 12:38 PM, Akhil Das ak...@sigmoidanalytics.com 
wrote:
   

 Did you happened to have a look at the spark job server? Someone wrote a 
python wrapper around it, give it a try.
ThanksBest Regards
On Thu, May 14, 2015 at 11:10 AM, MEETHU MATHEW meethu2...@yahoo.co.in wrote:

Hi all,
 Quote Inside a given Spark application (SparkContext instance), multiple 
parallel jobs can run simultaneously if they were submitted from separate 
threads.  
How to run multiple jobs in one SPARKCONTEXT using separate threads in pyspark? 
I found some examples in scala and java, but couldn't find python code. Can 
anyone help me with a pyspark example? 
Thanks  Regards,
Meethu M



  

Re: Restricting the number of iterations in Mllib Kmeans

2015-05-18 Thread MEETHU MATHEW
Hi,I think you cant supply an initial set of centroids to kmeans Thanks  
Regards,
Meethu M 


 On Friday, 15 May 2015 12:37 AM, Suman Somasundar 
suman.somasun...@oracle.com wrote:
   

 !--#yiv5602900621 _filtered #yiv5602900621 {font-family:Cambria 
Math;panose-1:2 4 5 3 5 4 6 3 2 4;} _filtered #yiv5602900621 
{font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;}#yiv5602900621 
#yiv5602900621 p.yiv5602900621MsoNormal, #yiv5602900621 
li.yiv5602900621MsoNormal, #yiv5602900621 div.yiv5602900621MsoNormal 
{margin:0in;margin-bottom:.0001pt;font-size:11.0pt;font-family:Calibri, 
sans-serif;}#yiv5602900621 a:link, #yiv5602900621 
span.yiv5602900621MsoHyperlink 
{color:blue;text-decoration:underline;}#yiv5602900621 a:visited, #yiv5602900621 
span.yiv5602900621MsoHyperlinkFollowed 
{color:purple;text-decoration:underline;}#yiv5602900621 
span.yiv5602900621EmailStyle17 {font-family:Calibri, 
sans-serif;color:windowtext;}#yiv5602900621 .yiv5602900621MsoChpDefault {} 
_filtered #yiv5602900621 {margin:1.0in 1.0in 1.0in 1.0in;}#yiv5602900621 
div.yiv5602900621WordSection1 {}--Hi,,

I want to run a definite number of iterations in Kmeans.  There is a command 
line argument to set maxIterations, but even if I set it to a number, Kmeans 
runs until the centroids converge. Is there a specific way to specify it in 
command line?
Also, I wanted to know if we can supply the initial set of centroids to the 
program instead of it choosing the centroids in random?  Thanks,
Suman.

  

How to run multiple jobs in one sparkcontext from separate threads in pyspark?

2015-05-13 Thread MEETHU MATHEW
Hi all,
 Quote Inside a given Spark application (SparkContext instance), multiple 
parallel jobs can run simultaneously if they were submitted from separate 
threads.  
How to run multiple jobs in one SPARKCONTEXT using separate threads in pyspark? 
I found some examples in scala and java, but couldn't find python code. Can 
anyone help me with a pyspark example? 
Thanks  Regards,
Meethu M

Spark-1.3.0 UI shows 0 cores in completed applications tab

2015-03-26 Thread MEETHU MATHEW
Hi all,
I started spark-shell in spark-1.3.0 and did some actions. The UI was showing 8 
cores under the running applications tab. But when I exited the spark-shell 
using exit, the application is moved to completed applications tab and the 
number of cores is 0. Again when I exited the spark-shell using sc.stop() ,it 
is showing correctly  8  cores under completed applications tab. Why it is 
showing 0 cores when I didnt use sc.stop()?Does anyone face this issue? Thanks 
 Regards,
Meethu M

How to build Spark and run examples using Intellij ?

2015-03-09 Thread MEETHU MATHEW
Hi,
I am trying to  run examples of spark(master branch from git)  from 
Intellij(14.0.2) but facing errors. These are the steps I followed:
1. git clone the master branch of apache spark.2. Build it using mvn 
-DskipTests clean install3. In Intellij  select Import Projects and choose the 
POM.xml of spark root folder(Auto Import enabled)4. Then I tried to run SparkPi 
program but getting the following errors
Information:9/3/15 3:46 PM - Compilation completed with 44 errors and 0 
warnings in 5 sec
usr/local/spark-1.3.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scalaError:(314,
 109) polymorphic expression cannot be instantiated to expected type; found   : 
[T(in method apply)]org.apache.spark.sql.catalyst.dsl.ScalaUdfBuilder[T(in 
method apply)] required: 
org.apache.spark.sql.catalyst.dsl.package.ScalaUdfBuilder[T(in method 
functionToUdfBuilder)]  implicit def functionToUdfBuilder[T: TypeTag](func: 
Function1[_, T]): ScalaUdfBuilder[T] = ScalaUdfBuilder(func)
I am able to run examples of this built version of spark from terminal using 
./bin/run-example script.
Could someone please help me in this issue?
Thanks  Regards,
Meethu M

How to read from hdfs using spark-shell in Intel hadoop?

2015-02-26 Thread MEETHU MATHEW
Hi,
I am not able to read from HDFS(Intel distribution hadoop,Hadoop version is 
1.0.3) from spark-shell(spark version is 1.2.1). I built spark using the 
commandmvn -Dhadoop.version=1.0.3 clean package and started  spark-shell and 
read a HDFS file using sc.textFile() and the exception is  
 WARN hdfs.DFSClient: Failed to connect to /10.88.6.133:50010, add to deadNodes 
and continuejava.net.SocketTimeoutException: 12 millis timeout while 
waiting for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/10.88.6.131:44264 
remote=/10.88.6.133:50010]

The same problem is asked in the this mail.
 RE: Spark is unable to read from HDFS
|   |
|   |   |   |   |   |
| RE: Spark is unable to read from HDFSHi,Thanks for the reply. I've tried the 
below.  |
|  |
| View on mail-archives.us.apache.org | Preview by Yahoo |
|  |
|   |

  As suggested in the above mail,In addition to specifying 
HADOOP_VERSION=1.0.3 in the ./project/SparkBuild.scala file, you will need to 
specify the libraryDependencies and name spark-core  resolvers. Otherwise, 
sbt will fetch version 1.0.3 of hadoop-core from apache instead of Intel. You 
can set up your own local or remote repository that you specify 
Now HADOOP_VERSION is deprecated and -Dhadoop.version should be used. Can 
anybody please elaborate on how to specify tat SBT should fetch hadoop-core 
from Intel which is in our internal repository?
Thanks  Regards,
Meethu M

Re: Mllib Error

2014-12-11 Thread MEETHU MATHEW
Hi,Try this.Change spark-mllib to spark-mllib_2.10
libraryDependencies ++=Seq( org.apache.spark % spark-core_2.10 % 1.1.1
 org.apache.spark % spark-mllib_2.10 % 1.1.1 ) 
Thanks  Regards,
Meethu M 

 On Friday, 12 December 2014 12:22 PM, amin mohebbi 
aminn_...@yahoo.com.INVALID wrote:
   

  I'm trying to build a very simple scala standalone app using the Mllib, but I 
get the following error when trying to bulid the program:Object Mllib is not a 
member of package org.apache.sparkThen, I realized that I have to add Mllib as 
dependency as follow :libraryDependencies ++= Seq(
org.apache.spark  %% spark-core  % 1.1.0,
org.apache.spark  %% spark-mllib % 1.1.0
)But, here I got an error that says :unresolved dependency 
spark-core_2.10.4;1.1.1 : not foundso I had to modify it toorg.apache.spark % 
spark-core_2.10 % 1.1.1,But there is still an error that says :unresolved 
dependency spark-mllib;1.1.1 : not foundAnyone knows how to add dependency of 
Mllib in .sbt file?
Best Regards

...

Amin Mohebbi

PhD candidate in Software Engineering 
 at university of Malaysia  

Tel : +60 18 2040 017



E-Mail : tp025...@ex.apiit.edu.my

  amin_...@me.com

   

Re: How to incrementally compile spark examples using mvn

2014-12-04 Thread MEETHU MATHEW
Hi all,
I made some code changes  in mllib project and as mentioned in the previous 
mails I did 
mvn install -pl mllib 
Now  I run a program in examples using run-example, the new code is not 
executing.Instead the previous code itself is running.
But if I do an  mvn install in the entire spark project , I can see the new 
code running.But installing the entire spark takes a lot of time and so its 
difficult to do this each time  I make some changes.
Can someone tell me how to compile mllib alone and get the changes working? 
Thanks  Regards,
Meethu M 

 On Friday, 28 November 2014 2:39 PM, MEETHU MATHEW 
meethu2...@yahoo.co.in wrote:
   

 Hi,I have a similar problem.I modified the code in mllib and examples.I did 
mvn install -pl mllib mvn install -pl examples
But when I run the program in examples using run-example,the older version of  
mllib (before the changes were made) is getting executed.How to get the changes 
made in mllib while  calling it from examples project? Thanks  Regards,
Meethu M 

 On Monday, 24 November 2014 3:33 PM, Yiming (John) Zhang 
sdi...@gmail.com wrote:
   

 Thank you, Marcelo and Sean, mvn install is a good answer for my demands. 

-邮件原件-
发件人: Marcelo Vanzin [mailto:van...@cloudera.com] 
发送时间: 2014年11月21日 1:47
收件人: yiming zhang
抄送: Sean Owen; user@spark.apache.org
主题: Re: How to incrementally compile spark examples using mvn

Hi Yiming,

On Wed, Nov 19, 2014 at 5:35 PM, Yiming (John) Zhang sdi...@gmail.com wrote:
 Thank you for your reply. I was wondering whether there is a method of 
 reusing locally-built components without installing them? That is, if I have 
 successfully built the spark project as a whole, how should I configure it so 
 that I can incrementally build (only) the spark-examples sub project 
 without the need of downloading or installation?

As Sean suggest, you shouldn't need to install anything. After mvn install, 
your local repo is a working Spark installation, and you can use spark-submit 
and other tool directly within it.

You just need to remember to rebuild the assembly/ project when modifying Spark 
code (or the examples/ project when modifying examples).


--
Marcelo


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




   

Re: How to incrementally compile spark examples using mvn

2014-11-28 Thread MEETHU MATHEW
Hi,I have a similar problem.I modified the code in mllib and examples.I did mvn 
install -pl mllib mvn install -pl examples
But when I run the program in examples using run-example,the older version of  
mllib (before the changes were made) is getting executed.How to get the changes 
made in mllib while  calling it from examples project? Thanks  Regards,
Meethu M 

 On Monday, 24 November 2014 3:33 PM, Yiming (John) Zhang 
sdi...@gmail.com wrote:
   

 Thank you, Marcelo and Sean, mvn install is a good answer for my demands. 

-邮件原件-
发件人: Marcelo Vanzin [mailto:van...@cloudera.com] 
发送时间: 2014年11月21日 1:47
收件人: yiming zhang
抄送: Sean Owen; user@spark.apache.org
主题: Re: How to incrementally compile spark examples using mvn

Hi Yiming,

On Wed, Nov 19, 2014 at 5:35 PM, Yiming (John) Zhang sdi...@gmail.com wrote:
 Thank you for your reply. I was wondering whether there is a method of 
 reusing locally-built components without installing them? That is, if I have 
 successfully built the spark project as a whole, how should I configure it so 
 that I can incrementally build (only) the spark-examples sub project 
 without the need of downloading or installation?

As Sean suggest, you shouldn't need to install anything. After mvn install, 
your local repo is a working Spark installation, and you can use spark-submit 
and other tool directly within it.

You just need to remember to rebuild the assembly/ project when modifying Spark 
code (or the examples/ project when modifying examples).


--
Marcelo


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


   

Re: ISpark class not found

2014-11-11 Thread MEETHU MATHEW
Hi,
I was also trying Ispark..But I couldnt even start the notebook..I am getting 
the following error.
ERROR:tornado.access:500 POST /api/sessions (127.0.0.1) 10.15ms 
referer=http://localhost:/notebooks/Scala/Untitled0.ipynb
How did you start the notebook?
 Thanks  Regards,
Meethu M 

 On Wednesday, 12 November 2014 6:50 AM, Laird, Benjamin 
benjamin.la...@capitalone.com wrote:
   

 I've been experimenting with the ISpark extension to IScala 
(https://github.com/tribbloid/ISpark)
Objects created in the REPL are not being loaded correctly on worker nodes, 
leading to a ClassNotFound exception. This does work correctly in spark-shell.
I was curious if anyone has used ISpark and has encountered this issue. Thanks!

Simple example:
In [1]: case class Circle(rad:Float)
In [2]: val rdd = sc.parallelize(1 to 
1).map(i=Circle(i.toFloat)).take(10)14/11/11 13:03:35 ERROR 
TaskResultGetter: Exception while getting task 
resultcom.esotericsoftware.kryo.KryoException: Unable to find class: 
[L$line5.$read$$iwC$$iwC$Circle;

Full trace in my gist: 
https://gist.github.com/benjaminlaird/3e543a9a89fb499a3a14

 The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed.  If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.

   

Is there a step-by-step instruction on how to build Spark App with IntelliJ IDEA?

2014-11-10 Thread MEETHU MATHEW
Hi,
This question was asked  earlier  and I did it in the way specified..I am 
getting java.lang.ClassNotFoundException..
Can somebody explain all the steps required to build a spark app using IntelliJ 
(latest version)starting from creating the project to running it..I searched a 
lot but couldnt find an appropriate documentation..
Re: Is there a step-by-step instruction on how to build Spark App with IntelliJ 
IDEA?

|   |
|   |   |   |   |   |
| Re: Is there a step-by-step instruction on how to build Spark App with 
IntelliJ IDEA?Don’t try to use spark-core as an archetype. Instead just create 
a plain Scala project (noarchetype) and add a Maven dependency on spark-core. 
That should be all you need.  |
|  |
| View on mail-archives.apache.org | Preview by Yahoo |
|  |
|   |

   Thanks  Regards,
Meethu M

Re: Relation between worker memory and executor memory in standalone mode

2014-10-07 Thread MEETHU MATHEW
Try  to set --total-executor-cores to limit how many total cores it can use.

Thanks  Regards, 
Meethu M


On Thursday, 2 October 2014 2:39 AM, Akshat Aranya aara...@gmail.com wrote:
 


I guess one way to do so would be to run 1 worker per node, like say, instead 
of running 1 worker and giving it 8 cores, you can run 4 workers with 2 cores 
each.  Then, you get 4 executors with 2 cores each.



On Wed, Oct 1, 2014 at 1:06 PM, Boromir Widas vcsub...@gmail.com wrote:

I have not found a way to control the cores yet. This effectively limits the 
cluster to a single application at a time. A subsequent application shows in 
the 'WAITING' State on the dashboard. 



On Wed, Oct 1, 2014 at 2:49 PM, Akshat Aranya aara...@gmail.com wrote:





On Wed, Oct 1, 2014 at 11:33 AM, Akshat Aranya aara...@gmail.com wrote:





On Wed, Oct 1, 2014 at 11:00 AM, Boromir Widas vcsub...@gmail.com wrote:

1. worker memory caps executor. 
2. With default config, every job gets one executor per worker. This 
executor runs with all cores available to the worker.


By the job do you mean one SparkContext or one stage execution within a 
program?  Does that also mean that two concurrent jobs will get one executor 
each at the same time?



Experimenting with this some more, I figured out that an executor takes away 
spark.executor.memory amount of memory from the configured worker memory.  
It also takes up all the cores, so even if there is still some memory left, 
there are no cores left for starting another executor.  Is my assessment 
correct? Is there no way to configure the number of cores that an executor 
can use?


 



On Wed, Oct 1, 2014 at 11:04 AM, Akshat Aranya aara...@gmail.com wrote:

Hi,

What's the relationship between Spark worker and executor memory settings 
in standalone mode?  Do they work independently or does the worker cap 
executor memory?

Also, is the number of concurrent executors per worker capped by the 
number of CPU cores configured for the worker?






Same code --works in spark 1.0.2-- but not in spark 1.1.0

2014-10-07 Thread MEETHU MATHEW
Hi all,

My code was working fine in spark 1.0.2 ,but after upgrading to 1.1.0, its 
throwing exceptions and tasks are getting failed.

The code contains some map and filter transformations followed by groupByKey 
(reduceByKey in another code ). What I could find out is that the code works 
fine until  groupByKey  or reduceByKey  in both versions.But after that the 
following errors show up in Spark 1.1.0
 
java.io.FileNotFoundException: 
/tmp/spark-local-20141006173014-4178/35/shuffle_6_0_5161 (Too many open files)
java.io.FileOutputStream.openAppend(Native Method)
java.io.FileOutputStream.init(FileOutputStream.java:210)

org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123)

org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:192)

org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:67)

org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:65)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:701)

I cleaned my /tmp directory,changed my local directory to another folder ; but 
nothing helped.
 
Can anyone say what could  be the reason .?

Thanks  Regards, 
Meethu M

Python version of kmeans

2014-09-17 Thread MEETHU MATHEW
Hi all,

I need the kmeans code written against Pyspark for some testing purpose.
Can somebody tell me the difference between these two files.

 spark-1.0.1/examples/src/main/python/kmeans.py   and 

 spark-1.0.1/python/pyspark/mllib/clustering.py


Thanks  Regards, 
Meethu M

Re: how to specify columns in groupby

2014-08-29 Thread MEETHU MATHEW
Thank you Yanbo for the reply..

I 've another query related to  cogroup.I want to iterate over the results of 
cogroup operation.

My code is 
* grp = RDD1.cogroup(RDD2)
* map((lambda (x,y): (x,list(y[0]),list(y[1]))), list(grp))
My result looks like :

[((u'764', u'20140826'), [0.70146274566650391], [ ]),
 ((u'863', u'20140826'), [0.368011474609375], [ ]),
 ((u'9571520', u'20140826'), [0.0046129226684570312], [0.60009])]
 
When I do one more cogroup operation like 

grp1 = grp.cogroup(RDD3)

I am not able to see the results.All my RDDs are of the form ((x,y),z).Can 
somebody help me to solve this.

Thanks  Regards, 
Meethu M


On Thursday, 28 August 2014 5:59 PM, Yanbo Liang yanboha...@gmail.com wrote:
 


For your reference:

val d1 = textFile.map(line = {
  val fileds = line.split(,)
  ((fileds(0),fileds(1)), fileds(2).toDouble)
})

val d2 = d1.reduceByKey(_+_)
d2.foreach(println)




2014-08-28 20:04 GMT+08:00 MEETHU MATHEW meethu2...@yahoo.co.in:

Hi all,


I have an RDD  which has values in the  format id,date,cost.


I want to group the elements based on the id and date columns and get the sum 
of the cost  for each group.


Can somebody tell me how to do this?


 
Thanks  Regards, 
Meethu M

how to specify columns in groupby

2014-08-28 Thread MEETHU MATHEW
Hi all,

I have an RDD  which has values in the  format id,date,cost.

I want to group the elements based on the id and date columns and get the sum 
of the cost  for each group.

Can somebody tell me how to do this?

 
Thanks  Regards, 
Meethu M

Re: Losing Executors on cluster with RDDs of 100GB

2014-08-26 Thread MEETHU MATHEW
Hi,

Plz give a try by changing the worker memory such that worker memoryexecutor 
memory
 
Thanks  Regards, 
Meethu M


On Friday, 22 August 2014 5:18 PM, Yadid Ayzenberg ya...@media.mit.edu wrote:
 


Hi all,

I have a spark cluster of 30 machines, 16GB / 8 cores on each running in 
standalone mode. Previously my application was working well ( several 
RDDs the largest being around 50G).
When I started processing larger amounts of data (RDDs of 100G) my app 
is losing executors. Im currently just loading them from a database, 
rePartitioning and persisting to disk (with replication x2)
I have spark.executor.memory= 9G, memoryFraction = 0.5, 
spark.worker.timeout =120, spark.akka.askTimeout=30, 
spark.storage.blockManagerHeartBeatMs=3.
I haven't change the default of my worker memory so its at 512m (should 
this be larger) ?

I've been getting the following messages from my app:

  [error] o.a.s.s.TaskSchedulerImpl - Lost executor 3 on myserver1: 
worker lost
[error] o.a.s.s.TaskSchedulerImpl - Lost executor 13 on myserver2: 
Unknown executor exit code (137) (died from signal 9?)
[error] a.r.EndpointWriter - AssociationError 
[akka.tcp://spark@master:59406] - 
[akka.tcp://sparkExecutor@myserver2:32955]: Error [Association failed 
with [akka.tcp://sparkExecutor@myserver2:32955]] [
akka.remote.EndpointAssociationException: Association failed with 
[akka.tcp://sparkexecu...@myserver2.com:32955]
Caused by: 
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: 
Connection refused: myserver2/198.18.102.160:32955
]
[error] a.r.EndpointWriter - AssociationError 
[akka.tcp://spark@master:59406] - [akka.tcp://spark@myserver1:53855]: 
Error [Association failed with [akka.tcp://spark@myserver1:53855]] [
akka.remote.EndpointAssociationException: Association failed with 
[akka.tcp://spark@myserver1:53855]
Caused by: 
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: 
Connection refused: myserver1/198.18.102.160:53855
]

The worker logs and executor logs do not contain errors. Any ideas what 
the problem is ?

Yadid


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: OutOfMemory Error

2014-08-20 Thread MEETHU MATHEW


 Hi ,

How to increase the heap size?

What is the difference between spark executor memory and heap size?

Thanks  Regards, 
Meethu M


On Monday, 18 August 2014 12:35 PM, Akhil Das ak...@sigmoidanalytics.com 
wrote:
 


I believe spark.shuffle.memoryFraction is the one you are looking for.

spark.shuffle.memoryFraction : Fraction of Java heap to use for aggregation and 
cogroups during shuffles, if spark.shuffle.spill is true. At any given time, 
the collective size of all in-memory maps used for shuffles is bounded by this 
limit, beyond which the contents will begin to spill to disk. If spills are 
often, consider increasing this value at the expense of 
spark.storage.memoryFraction.


You can give it a try.



Thanks
Best Regards


On Mon, Aug 18, 2014 at 12:21 PM, Ghousia ghousia.ath...@gmail.com wrote:

Thanks for the answer Akhil. We are right now getting rid of this issue by 
increasing the number of partitions. And we are persisting RDDs to DISK_ONLY. 
But the issue is with heavy computations within an RDD. It would be better if 
we have the option of spilling the intermediate transformation results to local 
disk (only in case if memory consumption is high)  . Do we have any such option 
available with Spark? If increasing the partitions is the only the way, then 
one might end up with OutOfMemory Errors, when working with certain algorithms 
where intermediate result is huge.




On Mon, Aug 18, 2014 at 12:02 PM, Akhil Das ak...@sigmoidanalytics.com wrote:

Hi Ghousia,


You can try the following:


1. Increase the heap size
2. Increase the number of partitions
3. You could try persisting the RDD to use DISK_ONLY




Thanks
Best Regards



On Mon, Aug 18, 2014 at 10:40 AM, Ghousia Taj ghousia.ath...@gmail.com 
wrote:

Hi,

I am trying to implement machine learning algorithms on Spark. I am working
on a 3 node cluster, with each node having 5GB of memory. Whenever I am
working with slightly more number of records, I end up with OutOfMemory
Error. Problem is, even if number of records is slightly high, the
intermediate result from a transformation is huge and this results in
OutOfMemory Error. To overcome this, we are partitioning the data such that
each partition has only a few records.

Is there any better way to fix this issue. Some thing like spilling the
intermediate data to local disk?

Thanks,
Ghousia.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemory-Error-tp12275.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





Use of SPARK_DAEMON_JAVA_OPTS

2014-07-23 Thread MEETHU MATHEW


 Hi all,

Sorry for taking this topic again,still I am confused on this.

I set SPARK_DAEMON_JAVA_OPTS=-XX:+UseCompressedOops -Xmx8g             

when I run my application,I  got the following line in logs.

Spark Command: java -cp 
::/usr/local/spark-1.0.1/conf:/usr/local/spark-1.0.1/assembly/target/scala-2.10/spark-assembly-1.0.1-hadoop1.2.1.jar
 -XX:MaxPermSize=128m 
-XX:+UseCompressedOops-Xmx8g-Dspark.akka.logLifecycleEvents=true -Xms512m 
-Xmx512morg.apache.spark.deploy.worker.Worker spark://master:7077


-Xmx is set twice. One from the SPARK_DAEMON_JAVA_OPTS .
2nd from bin/spark-class(from SPARK_DAEMON_MEMORY or DEFAULT_MEM).

I believe that the second value will be taken in execution ie the one passed as 
SPARK_DAEMON _MEMORY or DEFAULT_MEM.

So I would like to know what is the purpose of SPARK_DAEMON_JAVA_OPTS and how 
it is different from SPARK_DAEMON _MEMORY.


Thanks  Regards, 
Meethu M

Re: Error with spark-submit (formatting corrected)

2014-07-18 Thread MEETHU MATHEW
Hi,
Instead of spark://10.1.3.7:7077 use spark://vmsparkwin1:7077  try this

$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
 spark://vmsparkwin1:7077 --executor-memory 1G --total-executor-cores 2
 ./lib/spark-examples-1.0.0-hadoop2.2.0.jar 10

 
Thanks  Regards, 
Meethu M


On Friday, 18 July 2014 7:51 AM, Jay Vyas jayunit100.apa...@gmail.com wrote:
 


I think I know what is happening to you.  I've looked some into this just this 
week, and so its fresh in my brain :) hope this helps.


When no workers are known to the master, iirc, you get this message.

I think  this is how it works.

1) You start your master
2) You start a slave, and give it master url as an argument.
3) The slave then binds to a random port
4) The slave then does a handshake with master, which you can see in the slave 
logs (it sais something like sucesfully connected to master at ….
  Actualy, i think tha master also logs that it now is aware of a slave running 
on ip:port…

So in your case, I suspect, none of the slaves have connected to the master, so 
the job sits idle.

This is similar to the yarn scenario of submitting a job to a resource manager 
with no node-managers running. 



On Jul 17, 2014, at 6:57 PM, ranjanp piyush_ran...@hotmail.com wrote:

 Hi, 
 I am new to Spark and trying out with a stand-alone, 3-node (1 master, 2
 workers) cluster. 
 
 From the Web UI at the master, I see that the workers are registered. But
 when I try running the SparkPi example from the master node, I get the
 following message and then an exception. 
 
 14/07/17 01:20:36 INFO AppClient$ClientActor: Connecting to master
 spark://10.1.3.7:7077... 
 14/07/17 01:20:46 WARN TaskSchedulerImpl: Initial job has not accepted any
 resources; check your cluster UI to ensure that workers are registered and
 have sufficient memory 
 
 I searched a bit for the above warning, and found and found that others have
 encountered this problem before, but did not see a clear resolution except
 for this link:
 http://apache-spark-user-list.1001560.n3.nabble.com/TaskSchedulerImpl-Initial-job-has-not-accepted-any-resources-check-your-cluster-UI-to-ensure-that-woy-tt8247.html#a8444
 
 Based on the suggestion there I tried supplying --executor-memory option to
 spark-submit but that did not help. 
 
 Any suggestions. Here are the details of my set up. 
 - 3 nodes (each with 4 CPU cores and 7 GB memory) 
 - 1 node configured as Master, and the other two configured as workers 
 - Firewall is disabled on all nodes, and network communication between the
 nodes is not a problem 
 - Edited the conf/spark-env.sh on all nodes to set the following: 
  SPARK_WORKER_CORES=3 
  SPARK_WORKER_MEMORY=5G 
 - The Web UI as well as logs on master show that Workers were able to
 register correctly. Also the Web UI correctly shows the aggregate available
 memory and CPU cores on the workers: 
 
 URL: spark://vmsparkwin1:7077
 Workers: 2
 Cores: 6 Total, 0 Used
 Memory: 10.0 GB Total, 0.0 B Used
 Applications: 0 Running, 0 Completed
 Drivers: 0 Running, 0 Completed
 Status: ALIVE
 
 I try running the SparkPi example first using the run-example (which was
 failing) and later directly using the spark-submit as shown below: 
 
 $ export MASTER=spark://vmsparkwin1:7077
 
 $ echo $MASTER
 spark://vmsparkwin1:7077
 
 azureuser@vmsparkwin1 /cygdrive/c/opt/spark-1.0.0
 $ ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
 spark://10.1.3.7:7077 --executor-memory 1G --total-executor-cores 2
 ./lib/spark-examples-1.0.0-hadoop2.2.0.jar 10
 
 
 The following is the full screen output:
 
 14/07/17 01:20:13 INFO SecurityManager: Using Spark's default log4j profile:
 org/apache/spark/log4j-defaults.properties
 14/07/17 01:20:13 INFO SecurityManager: Changing view acls to: azureuser
 14/07/17 01:20:13 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled; users with view permissions: Set(azureuser)
 14/07/17 01:20:14 INFO Slf4jLogger: Slf4jLogger started
 14/07/17 01:20:14 INFO Remoting: Starting remoting
 14/07/17 01:20:14 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp://sp...@vmsparkwin1.cssparkwin.b1.internal.cloudapp.net:49839]
 14/07/17 01:20:14 INFO Remoting: Remoting now listens on addresses:
 [akka.tcp://sp...@vmsparkwin1.cssparkwin.b1.internal.cloudapp.net:49839]
 14/07/17 01:20:14 INFO SparkEnv: Registering MapOutputTracker
 14/07/17 01:20:14 INFO SparkEnv: Registering BlockManagerMaster
 14/07/17 01:20:14 INFO DiskBlockManager: Created local directory at
 C:\cygwin\tmp\spark-local-20140717012014-b606
 14/07/17 01:20:14 INFO MemoryStore: MemoryStore started with capacity 294.9
 MB.
 14/07/17 01:20:14 INFO ConnectionManager: Bound socket to port 49842 with id
 = ConnectionManagerId(vmsparkwin1.cssparkwin.b1.internal.cloudapp.net,49842)
 14/07/17 01:20:14 INFO BlockManagerMaster: Trying to register BlockManager
 14/07/17 01:20:14 INFO BlockManagerInfo: Registering block manager
 

Re: Pysparkshell are not listing in the web UI while running

2014-07-17 Thread MEETHU MATHEW
Hi Akhil,

That fixed the problem...Thanks

 
Thanks  Regards, 
Meethu M


On Thursday, 17 July 2014 2:26 PM, Akhil Das ak...@sigmoidanalytics.com wrote:
 


Hi Neethu,

Your application is running on local mode and that's the reason why you are not 
seeing the driver app in the 8080 webUI. You can pass the Master IP to your 
pyspark and get it running in cluster mode. 

eg: IPYTHON_OPTS=notebook --pylab inline $SPARK_HOME/bin/pyspark --master 
spark://master:7077


Replace master:7077 with the spark uri that you are seeing in top left of the 
8080 webui.



Thanks
Best Regards


On Thu, Jul 17, 2014 at 1:35 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote:



 Hi all,


I just upgraded to spark 1.0.1. In spark 1.0.0 when I start Ipython notebook 
using the following command,it used to come in the running applications tab in 
master:8080 web UI.


IPYTHON_OPTS=notebook --pylab inline $SPARK_HOME/bin/pyspark


But now when I run it,its not getting listed under running 
application/completed application(once its closed).But I am able to see the 
spark stages at master:4040 while its running


Anyone have any idea why this 






Thanks  Regards, 
Meethu M

Difference between collect() and take(n)

2014-07-10 Thread MEETHU MATHEW
Hi all,

I want to know how collect() works, and how it is different from take().I am 
just reading a file of 330MB which has 43lakh rows with 13 columns and calling 
take(430) to save to a variable.But the same is not working with 
collect().So is there any difference in the operation of both.


Again,I wanted to set java heap size for my spark pgm. I set it using 
spark.executor.extraJavaOptions in spark-default-conf.sh. Now I want to set the 
same for the worker.Can I do that with SPARK_DAEMON_JAVA_OPTS?Is the following 
syntax correct?

SPARK_DAEMON_JAVA_OPTS=-XX:+UseCompressedOops -Xmx3g


Thanks  Regards, 
Meethu M

Re: org.jboss.netty.channel.ChannelException: Failed to bind to: master/1xx.xx..xx:0

2014-07-02 Thread MEETHU MATHEW
The problem is resolved.I have added SPARK_LOCAL_IP=master in both slaves 
also.When i changed this my slaves are working.
Thank you all for your suggestions
 
Thanks  Regards, 
Meethu M


On Wednesday, 2 July 2014 10:22 AM, Aaron Davidson ilike...@gmail.com wrote:
 


In your spark-env.sh, do you happen to set SPARK_PUBLIC_DNS or something of 
that kin? This error suggests the worker is trying to bind a server on the 
master's IP, which clearly doesn't make sense





On Mon, Jun 30, 2014 at 11:59 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote:

Hi,


I did netstat -na | grep 192.168.125.174 and its showing 192.168.125.174:7077 
LISTEN(after starting master)


I tried to execute the following script from the slaves manually but it ends 
up with the same exception and log.This script is internally executing the 
java command.
 /usr/local/spark-1.0.0/sbin/start-slave.sh 1 spark://192.168.125.174:7077
In this case netstat is showing any connection established to master:7077.


When we manually execute the java command,the connection is getting 
established to master.


Thanks  Regards, 
Meethu M



On Monday, 30 June 2014 6:38 PM, Akhil Das ak...@sigmoidanalytics.com wrote:
 


Are you sure you have this ip 192.168.125.174 bind for that machine? (netstat 
-na | grep 192.168.125.174)


Thanks
Best Regards


On Mon, Jun 30, 2014 at 5:34 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote:

Hi all,


I reinstalled spark,reboot the system,but still I am not able to start the 
workers.Its throwing the following exception:


Exception in thread main org.jboss.netty.channel.ChannelException: Failed 
to bind to: master/192.168.125.174:0


I doubt the problem is with 192.168.125.174:0. Eventhough the command 
contains master:7077,why its showing 0 in the log.


java -cp 
::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar
 -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m 
org.apache.spark.deploy.worker.Worker spark://master:7077


Can somebody tell me  a solution.
 
Thanks  Regards, 
Meethu M



On Friday, 27 June 2014 4:28 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote:
 


Hi,
ya I tried setting another PORT also,but the same problem..
master is set in etc/hosts
 
Thanks  Regards, 
Meethu M



On Friday, 27 June 2014 3:23 PM, Akhil Das ak...@sigmoidanalytics.com wrote:
 


tha's strange, did you try setting the master port to something else (use 
SPARK_MASTER_PORT).


Also you said you are able to start it from the java commandline


java -cp 
::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar
 -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m 
org.apache.spark.deploy.worker.Worker spark://:master:7077



What is the master ip specified here? is it like you have entry for master in 
the /etc/hosts? 


Thanks
Best Regards


On Fri, Jun 27, 2014 at 3:09 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote:

Hi Akhil,


I am running it in a LAN itself..The IP of the master is given correctly.
 
Thanks  Regards, 
Meethu M



On Friday, 27 June 2014 2:51 PM, Akhil Das ak...@sigmoidanalytics.com 
wrote:
 


why is it binding to port 0? 192.168.125.174:0 :/


Check the ip address of that master machine (ifconfig) looks like the ip 
address has been changed (hoping you are running this machines on a LAN)


Thanks
Best Regards


On Fri, Jun 27, 2014 at 12:00 PM, MEETHU MATHEW meethu2...@yahoo.co.in 
wrote:

Hi all,


My Spark(Standalone mode) was running fine till yesterday.But now I am 
getting  the following exeception when I am running start-slaves.sh or 
start-all.sh


slave3: failed to launch org.apache.spark.deploy.worker.Worker:
slave3:   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
slave3:   at java.lang.Thread.run(Thread.java:662)


The log files has the following lines.


14/06/27 11:06:30 INFO SecurityManager: Using Spark's default log4j 
profile: org/apache/spark/log4j-defaults.properties
14/06/27 11:06:30 INFO SecurityManager: Changing view acls to: hduser
14/06/27 11:06:30 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(hduser)
14/06/27 11:06:30 INFO Slf4jLogger: Slf4jLogger started
14/06/27 11:06:30 INFO Remoting: Starting remoting
Exception in thread main org.jboss.netty.channel.ChannelException: Failed 
to bind to: master/192.168.125.174:0
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
...
Caused by: java.net.BindException: Cannot assign requested address
...
I saw the same error reported before and have tried the following solutions.


Set the variable SPARK_LOCAL_IP ,Changed the SPARK_MASTER_PORT to a 
different number..But nothing is working.


When I try to start the worker from the respective machines using the 
following java command,its running without any exception


java -cp 
::/usr/local/spark

Re: org.jboss.netty.channel.ChannelException: Failed to bind to: master/1xx.xx..xx:0

2014-07-01 Thread MEETHU MATHEW
Hi,

I did netstat -na | grep 192.168.125.174 and its showing 192.168.125.174:7077 
LISTEN(after starting master)

I tried to execute the following script from the slaves manually but it ends up 
with the same exception and log.This script is internally executing the java 
command.
 /usr/local/spark-1.0.0/sbin/start-slave.sh 1 spark://192.168.125.174:7077
In this case netstat is showing any connection established to master:7077.

When we manually execute the java command,the connection is getting established 
to master.

Thanks  Regards, 
Meethu M


On Monday, 30 June 2014 6:38 PM, Akhil Das ak...@sigmoidanalytics.com wrote:
 


Are you sure you have this ip 192.168.125.174 bind for that machine? (netstat 
-na | grep 192.168.125.174)


Thanks
Best Regards


On Mon, Jun 30, 2014 at 5:34 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote:

Hi all,


I reinstalled spark,reboot the system,but still I am not able to start the 
workers.Its throwing the following exception:


Exception in thread main org.jboss.netty.channel.ChannelException: Failed to 
bind to: master/192.168.125.174:0


I doubt the problem is with 192.168.125.174:0. Eventhough the command contains 
master:7077,why its showing 0 in the log.


java -cp 
::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar
 -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m 
org.apache.spark.deploy.worker.Worker spark://master:7077


Can somebody tell me  a solution.
 
Thanks  Regards, 
Meethu M



On Friday, 27 June 2014 4:28 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote:
 


Hi,
ya I tried setting another PORT also,but the same problem..
master is set in etc/hosts
 
Thanks  Regards, 
Meethu M



On Friday, 27 June 2014 3:23 PM, Akhil Das ak...@sigmoidanalytics.com wrote:
 


tha's strange, did you try setting the master port to something else (use 
SPARK_MASTER_PORT).


Also you said you are able to start it from the java commandline


java -cp 
::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar
 -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m 
org.apache.spark.deploy.worker.Worker spark://:master:7077



What is the master ip specified here? is it like you have entry for master in 
the /etc/hosts? 


Thanks
Best Regards


On Fri, Jun 27, 2014 at 3:09 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote:

Hi Akhil,


I am running it in a LAN itself..The IP of the master is given correctly.
 
Thanks  Regards, 
Meethu M



On Friday, 27 June 2014 2:51 PM, Akhil Das ak...@sigmoidanalytics.com wrote:
 


why is it binding to port 0? 192.168.125.174:0 :/


Check the ip address of that master machine (ifconfig) looks like the ip 
address has been changed (hoping you are running this machines on a LAN)


Thanks
Best Regards


On Fri, Jun 27, 2014 at 12:00 PM, MEETHU MATHEW meethu2...@yahoo.co.in 
wrote:

Hi all,


My Spark(Standalone mode) was running fine till yesterday.But now I am 
getting  the following exeception when I am running start-slaves.sh or 
start-all.sh


slave3: failed to launch org.apache.spark.deploy.worker.Worker:
slave3:   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
slave3:   at java.lang.Thread.run(Thread.java:662)


The log files has the following lines.


14/06/27 11:06:30 INFO SecurityManager: Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.properties
14/06/27 11:06:30 INFO SecurityManager: Changing view acls to: hduser
14/06/27 11:06:30 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(hduser)
14/06/27 11:06:30 INFO Slf4jLogger: Slf4jLogger started
14/06/27 11:06:30 INFO Remoting: Starting remoting
Exception in thread main org.jboss.netty.channel.ChannelException: Failed 
to bind to: master/192.168.125.174:0
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
...
Caused by: java.net.BindException: Cannot assign requested address
...
I saw the same error reported before and have tried the following solutions.


Set the variable SPARK_LOCAL_IP ,Changed the SPARK_MASTER_PORT to a 
different number..But nothing is working.


When I try to start the worker from the respective machines using the 
following java command,its running without any exception


java -cp 
::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar
 -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m 
org.apache.spark.deploy.worker.Worker spark://:master:7077



Somebody please give a solution
 
Thanks  Regards, 
Meethu M









Failed to launch Worker

2014-07-01 Thread MEETHU MATHEW


 Hi ,

I am using Spark Standalone mode with one master and 2 slaves.I am not  able to 
start the workers and connect it to the master using 


./bin/spark-class org.apache.spark.deploy.worker.Worker spark://x.x.x.174:7077

The log says

Exception in thread main org.jboss.netty.channel.ChannelException: Failed to 
bind to: master/x.x.x.174:0
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
...
Caused by: java.net.BindException: Cannot assign requested address

When I try to start the worker from the slaves using the following java 
command,its running without any exception

java -cp 
::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar
 -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m 
org.apache.spark.deploy.worker.Worker spark://:master:7077





Thanks  Regards, 
Meethu M

Re: Failed to launch Worker

2014-07-01 Thread MEETHU MATHEW
Yes.
 
Thanks  Regards, 
Meethu M


On Tuesday, 1 July 2014 6:14 PM, Akhil Das ak...@sigmoidanalytics.com wrote:
 


Is this command working??

java -cp 
::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar
 -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m 
org.apache.spark.deploy.worker.Worker spark://x.x.x.174:7077



Thanks
Best Regards


On Tue, Jul 1, 2014 at 6:08 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote:



 Hi ,


I am using Spark Standalone mode with one master and 2 slaves.I am not  able 
to start the workers and connect it to the master using 


./bin/spark-class org.apache.spark.deploy.worker.Worker spark://x.x.x.174:7077


The log says


Exception in thread main org.jboss.netty.channel.ChannelException: Failed to 
bind to: master/x.x.x.174:0
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
...
Caused by: java.net.BindException: Cannot assign requested address


When I try to start the worker from the slaves using the following java 
command,its running without any exception


java -cp 
::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar
 -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m 
org.apache.spark.deploy.worker.Worker spark://:master:7077









Thanks  Regards, 
Meethu M

Re: org.jboss.netty.channel.ChannelException: Failed to bind to: master/1xx.xx..xx:0

2014-06-30 Thread MEETHU MATHEW
Hi all,

I reinstalled spark,reboot the system,but still I am not able to start the 
workers.Its throwing the following exception:

Exception in thread main org.jboss.netty.channel.ChannelException: Failed to 
bind to: master/192.168.125.174:0

I doubt the problem is with 192.168.125.174:0. Eventhough the command contains 
master:7077,why its showing 0 in the log.

java -cp 
::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar
 -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m 
org.apache.spark.deploy.worker.Worker spark://master:7077

Can somebody tell me  a solution.
 
Thanks  Regards, 
Meethu M


On Friday, 27 June 2014 4:28 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote:
 


Hi,
ya I tried setting another PORT also,but the same problem..
master is set in etc/hosts
 
Thanks  Regards, 
Meethu M


On Friday, 27 June 2014 3:23 PM, Akhil Das ak...@sigmoidanalytics.com wrote:
 


tha's strange, did you try setting the master port to something else (use 
SPARK_MASTER_PORT).

Also you said you are able to start it from the java commandline

java -cp 
::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar
 -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m 
org.apache.spark.deploy.worker.Worker spark://:master:7077


What is the master ip specified here? is it like you have entry for master in 
the /etc/hosts? 


Thanks
Best Regards


On Fri, Jun 27, 2014 at 3:09 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote:

Hi Akhil,


I am running it in a LAN itself..The IP of the master is given correctly.
 
Thanks  Regards, 
Meethu M



On Friday, 27 June 2014 2:51 PM, Akhil Das ak...@sigmoidanalytics.com wrote:
 


why is it binding to port 0? 192.168.125.174:0 :/


Check the ip address of that master machine (ifconfig) looks like the ip 
address has been changed (hoping you are running this machines on a LAN)


Thanks
Best Regards


On Fri, Jun 27, 2014 at 12:00 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote:

Hi all,


My Spark(Standalone mode) was running fine till yesterday.But now I am 
getting  the following exeception when I am running start-slaves.sh or 
start-all.sh


slave3: failed to launch org.apache.spark.deploy.worker.Worker:
slave3:   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
slave3:   at java.lang.Thread.run(Thread.java:662)


The log files has the following lines.


14/06/27 11:06:30 INFO SecurityManager: Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.properties
14/06/27 11:06:30 INFO SecurityManager: Changing view acls to: hduser
14/06/27 11:06:30 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(hduser)
14/06/27 11:06:30 INFO Slf4jLogger: Slf4jLogger started
14/06/27 11:06:30 INFO Remoting: Starting remoting
Exception in thread main org.jboss.netty.channel.ChannelException: Failed 
to bind to: master/192.168.125.174:0
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
...
Caused by: java.net.BindException: Cannot assign requested address
...
I saw the same error reported before and have tried the following solutions.


Set the variable SPARK_LOCAL_IP ,Changed the SPARK_MASTER_PORT to a different 
number..But nothing is working.


When I try to start the worker from the respective machines using the 
following java command,its running without any exception


java -cp 
::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar
 -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m 
org.apache.spark.deploy.worker.Worker spark://:master:7077



Somebody please give a solution
 
Thanks  Regards, 
Meethu M




Re: How to control a spark application(executor) using memory amount per node?

2014-06-30 Thread MEETHU MATHEW
Hi,

Try setting driver-java-options with spark-submit or set 
spark.executor.extraJavaOptions in spark-default.conf
 
Thanks  Regards, 
Meethu M


On Monday, 30 June 2014 1:28 PM, hansen han...@neusoft.com wrote:
 


Hi,

When i send the following statements in spark-shell:
    val file =
sc.textFile(hdfs://nameservice1/user/study/spark/data/soc-LiveJournal1.txt)
    val count = file.flatMap(line = line.split( )).map(word = (word,
1)).reduceByKey(_+_)
    println(count.count())

and, it throw a exception:
..
14/06/30 15:50:53 WARN TaskSetManager: Loss was due to
java.lang.OutOfMemoryError
java.lang.OutOfMemoryError: Java heap space
    at
java.io.ObjectOutputStream$HandleTable.growEntries(ObjectOutputStream.java:2346)
    at
java.io.ObjectOutputStream$HandleTable.assign(ObjectOutputStream.java:2275)
    at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1427)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
    at
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:28)
    at
org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:176)
    at
org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:164)
    at
org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:161)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at
org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.foreach(ExternalAppendOnlyMap.scala:239)
    at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
    at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
    at org.apache.spark.scheduler.Task.run(Task.scala:53)
    at
org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
    at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
    at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
    at
org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
    at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)

then, I set the following configuration in spark-env.sh
    export SPARK_EXECUTOR_MEMORY=1G

It's not OK.

spark.png
http://apache-spark-user-list.1001560.n3.nabble.com/file/n8521/spark.png  

I found when i start spark-shell, then console also print the logs:
    SparkDeploySchedulerBackend: Granted executor ID
app-20140630144110-0002/0 on hostPort dlx8:7078 with 8 cores, *512.0 MB RAM*

How to increate 512.0 MB RAM to the more memory?

Pls!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-control-a-spark-application-executor-using-memory-amount-per-node-tp8521.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: org.jboss.netty.channel.ChannelException: Failed to bind to: master/1xx.xx..xx:0

2014-06-27 Thread MEETHU MATHEW
Hi Akhil,

The IP is correct and is able to start the workers when we start it as a java 
command.Its becoming 192.168.125.174:0  when we call from the scripts.


 
Thanks  Regards, 
Meethu M


On Friday, 27 June 2014 1:49 PM, Akhil Das ak...@sigmoidanalytics.com wrote:
 


why is it binding to port 0? 192.168.125.174:0 :/

Check the ip address of that master machine (ifconfig) looks like the ip 
address has been changed (hoping you are running this machines on a LAN)


Thanks
Best Regards


On Fri, Jun 27, 2014 at 12:00 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote:

Hi all,


My Spark(Standalone mode) was running fine till yesterday.But now I am getting 
 the following exeception when I am running start-slaves.sh or start-all.sh


slave3: failed to launch org.apache.spark.deploy.worker.Worker:
slave3:   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
slave3:   at java.lang.Thread.run(Thread.java:662)


The log files has the following lines.


14/06/27 11:06:30 INFO SecurityManager: Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.properties
14/06/27 11:06:30 INFO SecurityManager: Changing view acls to: hduser
14/06/27 11:06:30 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(hduser)
14/06/27 11:06:30 INFO Slf4jLogger: Slf4jLogger started
14/06/27 11:06:30 INFO Remoting: Starting remoting
Exception in thread main org.jboss.netty.channel.ChannelException: Failed to 
bind to: master/192.168.125.174:0
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
...
Caused by: java.net.BindException: Cannot assign requested address
...
I saw the same error reported before and have tried the following solutions.


Set the variable SPARK_LOCAL_IP ,Changed the SPARK_MASTER_PORT to a different 
number..But nothing is working.


When I try to start the worker from the respective machines using the 
following java command,its running without any exception


java -cp 
::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar
 -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m 
org.apache.spark.deploy.worker.Worker spark://:master:7077



Somebody please give a solution
 
Thanks  Regards, 
Meethu M

Re: join operation is taking too much time

2014-06-18 Thread MEETHU MATHEW
Hi,
Thanks Andrew and Daniel for the response.

Setting spark.shuffle.spill to false didnt make any difference. 5 days  
completed in 6 min and 10 days was stuck after around 1hr.


Daniel,in my current use case I cant read all the files to a single RDD.But I 
have another use case where I did it in that way,ie  I read all the files to a 
single RDD and joined with with the RDD of 9 million rows and it worked fine  
and took only 3 minutes.
 
Thanks  Regards, 
Meethu M


On Wednesday, 18 June 2014 12:11 AM, Daniel Darabos 
daniel.dara...@lynxanalytics.com wrote:
 


I've been wondering about this. Is there a difference in performance between 
these two?

valrdd1 =sc.textFile(files.mkString(,))valrdd2 
=sc.union(files.map(sc.textFile(_)))

I don't know about your use-case, Meethu, but it may be worth trying to see if 
reading all the files into one RDD (like rdd1) would perform better in the 
join. (If this is possible in your situation.)




On Tue, Jun 17, 2014 at 6:45 PM, Andrew Or and...@databricks.com wrote:

How long does it get stuck for? This is a common sign for the OS thrashing due 
to out of memory exceptions. If you keep it running longer, does it throw an 
error?


Depending on how large your other RDD is (and your join operation), memory 
pressure may or may not be the problem at all. It could be that spilling your 
shuffles
to disk is slowing you down (but probably shouldn't hang your application). 
For the 5 RDDs case, what happens if you set spark.shuffle.spill to false?



2014-06-17 5:59 GMT-07:00 MEETHU MATHEW meethu2...@yahoo.co.in:




 Hi all,


I want  to do a recursive leftOuterJoin between an RDD (created from  file) 
with 9 million rows(size of the file is 100MB) and 30 other RDDs(created from 
30 diff files in each iteration of a loop) varying from 1 to 6 million rows.
When I run it for 5 RDDs,its running successfully  in 5 minutes.But when I 
increase it to 10 or 30 RDDs its gradually slowing down and finally getting 
stuck without showing any warning or error.


I am running in standalone mode with 2 workers of 4GB each and a total of 16 
cores .


Any of you facing similar problems with JOIN  or is it a problem with my 
configuration.


Thanks  Regards, 
Meethu M


options set in spark-env.sh is not reflecting on actual execution

2014-06-18 Thread MEETHU MATHEW
Hi all,

I have a doubt regarding the options in spark-env.sh. I set the following 
values in the file in master and 2 workers

SPARK_WORKER_MEMORY=7g
SPARK_EXECUTOR_MEMORY=6g
SPARK_DAEMON_JAVA_OPTS+=- Dspark.akka.timeout=30 
-Dspark.akka.frameSize=1 -Dspark.blockManagerHeartBeatMs=80 
-Dspark.shuffle.spill=false

But SPARK_EXECUTOR_MEMORY is showing 4g in web UI.Do I need to change it 
anywhere else to make it 4g and to reflect it in web UI.

A warning is coming that blockManagerHeartBeatMs is exceeding 45 while 
executing a process even though I set it to 80.

So I doubt whether it should be set  as SPARK_MASTER_OPTS or SPARK_WORKER_OPTS..
 
Thanks  Regards, 
Meethu M

join operation is taking too much time

2014-06-17 Thread MEETHU MATHEW


 Hi all,

I want  to do a recursive leftOuterJoin between an RDD (created from  file) 
with 9 million rows(size of the file is 100MB) and 30 other RDDs(created from 
30 diff files in each iteration of a loop) varying from 1 to 6 million rows.
When I run it for 5 RDDs,its running successfully  in 5 minutes.But when I 
increase it to 10 or 30 RDDs its gradually slowing down and finally getting 
stuck without showing any warning or error.

I am running in standalone mode with 2 workers of 4GB each and a total of 16 
cores .

Any of you facing similar problems with JOIN  or is it a problem with my 
configuration.

Thanks  Regards, 
Meethu M

Re: Wildcard support in input path

2014-06-17 Thread MEETHU MATHEW
Hi Jianshi,

I have used wild card characters (*) in my program and it worked..
My code was like this
b = sc.textFile(hdfs:///path to file/data_file_2013SEP01*)

 
Thanks  Regards, 
Meethu M


On Wednesday, 18 June 2014 9:29 AM, Jianshi Huang jianshi.hu...@gmail.com 
wrote:
 


It would be convenient if Spark's textFile, parquetFile, etc. can support path 
with wildcard, such as:

  hdfs://domain/user/jianshuang/data/parquet/table/month=2014*

Or is there already a way to do it now?

Jianshi

-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/

ArrayIndexOutOfBoundsException when reading bzip2 files

2014-06-09 Thread MEETHU MATHEW
Hi,
I am getting ArrayIndexOutOfBoundsException while reading from bz2 files  in 
HDFS.I have come across the same issue in JIRA at 
https://issues.apache.org/jira/browse/SPARK-1861, but it seems to be resolved. 
I have tried the workaround suggested(SPARK_WORKER_CORES=1),but its still 
showing error.What may be the possible reason that I am getting the same error 
again?
I am using Spark1.0.0 with hadoop 1.2.1.
java.lang.ArrayIndexOutOfBoundsException: 90
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:897)
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:499)
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:330)
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:394)
at 
org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:422)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:205)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:169)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:176)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:43)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:198)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:181)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:303)
at 
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:200)
at 
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175)
at 
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:174)
 

Thanks  Regards, 
Meethu M

Re: ArrayIndexOutOfBoundsException when reading bzip2 files

2014-06-09 Thread MEETHU MATHEW
Hi Akhil,
Plz find the code below.
 x = sc.textFile(hdfs:///**)
 x = x.filter(lambda z:z.split(,)[0]!=' ')
 x = x.filter(lambda z:z.split(,)[3]!=' ')
 z = x.reduce(add)
 
Thanks  Regards, 
Meethu M


On Monday, 9 June 2014 5:52 PM, Akhil Das ak...@sigmoidanalytics.com wrote:
 


Can you paste the piece of code!?


Thanks
Best Regards


On Mon, Jun 9, 2014 at 5:24 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote:

Hi,
I am getting ArrayIndexOutOfBoundsException while reading from bz2 files  in 
HDFS.I have come across the same issue in JIRA at 
https://issues.apache.org/jira/browse/SPARK-1861, but it seems to be resolved. 
I have tried the workaround suggested(SPARK_WORKER_CORES=1),but its still 
showing error.What may be the possible reason that I am getting the same error 
again?
I am using Spark1.0.0 with hadoop 1.2.1.
java.lang.ArrayIndexOutOfBoundsException: 90
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:897)
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:499)
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:330)
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:394)
at 
org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:422)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:205)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:169)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:176)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:43)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:198)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:181)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:303)
at 
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:200)
at 
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175)
at 
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:174)
 

Thanks  Regards, 
Meethu M

Re: ArrayIndexOutOfBoundsException when reading bzip2 files

2014-06-09 Thread MEETHU MATHEW
Hi Sean,

Thank you for the fast response.
 
Thanks  Regards, 
Meethu M


On Monday, 9 June 2014 6:04 PM, Sean Owen so...@cloudera.com wrote:
 


Have a search online / at the Spark JIRA. This was a known upstream
bug in Hadoop.

https://issues.apache.org/jira/browse/SPARK-1861


On Mon, Jun 9, 2014 at 7:54 AM, MEETHU MATHEW meethu2...@yahoo.co.in wrote:
 Hi,
 I am getting ArrayIndexOutOfBoundsException while reading from bz2 files  in
 HDFS.I have come across the same issue in JIRA at
 https://issues.apache.org/jira/browse/SPARK-1861, but it seems to be
 resolved. I have tried the workaround suggested(SPARK_WORKER_CORES=1),but
 its still showing error.What may be the possible reason that I am getting
 the same error again?
 I am using Spark1.0.0 with hadoop 1.2.1.
 java.lang.ArrayIndexOutOfBoundsException: 90
 at
 org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:897)
 at
 org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:499)
 at
 org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:330)
 at
 org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:394)
 at
 org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:422)
 at java.io.InputStream.read(InputStream.java:101)
 at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:205)
 at org.apache.hadoop.util.LineReader.readLine(LineReader.java:169)
 at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:176)
 at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:43)
 at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:198)
 at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:181)
 at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
 at
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at
 org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:303)
 at
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:200)
 at
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175)
 at
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175)
 at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
 at
 org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:174)

 Thanks  Regards,
 Meethu M

How to stop a running SparkContext in the proper way?

2014-06-03 Thread MEETHU MATHEW
Hi,

I want to know how I can stop a running SparkContext in a proper way so that 
next time when I start a new SparkContext, the web UI can be launched on the 
same port 4040.Now when i quit the job using ctrl+z the new sc are launched in 
new ports.

I have the same problem with ipython notebook.It is launched on a different 
port when I start the notebook second time after closing the first one.I am 
starting ipython using the command

IPYTHON_OPTS=notebook --ip  --pylab inline ./bin/pyspark

Thanks  Regards, 
Meethu M