Re: mesos or kubernetes ?

2016-08-14 Thread Gurvinder Singh
On 08/13/2016 08:24 PM, guyoh wrote:
> My company is trying to decide whether to use kubernetes or mesos. Since we
> are planning to use Spark in the near future, I was wandering what is the
> best choice for us. 
> Thanks, 
> Guy
> 
Both Kubernetes and Mesos enables you to share your infrastructure with
other workloads. In K8s it is mainly standalone mode and to provide
failure recovery you can use the options mentioned here
https://spark.apache.org/docs/latest/spark-standalone.html#high-availability
I have tested file system recovery with K8s and it works fine, as K8s
restart the master with in few seconds. In K8s you can dynamically scale
your cluster as K8s support horizontal pod autoscaling
http://blog.kubernetes.io/2016/03/using-Spark-and-Zeppelin-to-process-Big-Data-on-Kubernetes.html
which helps you spin down your spark cluster when not is use and you can
run other workloads using docker/rkt containers on your resources and
scale up spark cluster when needed given you have free resources. You
don't need static partitioning for your spark cluster when running on
k8s, as spark will run in containers and will share the same underlying
resources as others.

You can soonish (https://github.com/apache/spark/pull/13950) use the
authentication based on Github,Google,FB etc to access Spark master UI
when deployed in standalone e.g. on K8s and only expose the master UI to
access all other UIs (worker logs, app).

As usual choice is yours, I just wanted to give you the info about k8s side.

- Gurvinder
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/mesos-or-kubernetes-tp27530.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



spark master ui to proxy app and worker ui

2016-03-03 Thread Gurvinder Singh
Hi,

I am wondering if it is possible for the spark standalone master UI to
proxy app/driver UI and worker UI. The reason for this is that currently
if you want to access UI of driver and worker to see logs, you need to
have access to their IP:port which makes it harder to open up from
networking point of view. So operationally it makes life easier if
master can simply proxy those connections and allow access both app and
worker UI details from master UI itself.

Master does not need to have content stream to it all the time, only
when user wants to access contents from other UIs then it simply proxy
the request/response during that duration. Thus master will not have to
incur extra load all the time.

Thanks,
Gurvinder

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Tungsten and Spark Streaming

2015-09-10 Thread Gurvinder Singh
On 09/10/2015 07:42 AM, Tathagata Das wrote:
> Rewriting is necessary. You will have to convert RDD/DStream operations
> to DataFrame operations. So get the RDDs in DStream, using
> transform/foreachRDD, convert to DataFrames and then do DataFrame
> operations.

Are there any plans for 1.6 or later to add support of tungsten to
RDD/DStream directly or it is intended that users should switch to
dataframe rather then operating on RDD/Dstream level.

> 
> On Wed, Sep 9, 2015 at 9:23 PM, N B  > wrote:
> 
> Hello,
> 
> How can we start taking advantage of the performance gains made
> under Project Tungsten in Spark 1.5 for a Spark Streaming program? 
> 
> From what I understand, this is available by default for Dataframes.
> But for a program written using Spark Streaming, would we see any
> potential gains "out of the box" in 1.5 or will we have to rewrite
> some portions of the application code to realize that benefit?
> 
> Any insight/documentation links etc in this regard will be appreciated.
> 
> Thanks
> Nikunj
> 
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to avoid shuffle errors for a large join ?

2015-09-05 Thread Gurvinder Singh
On 09/05/2015 11:22 AM, Reynold Xin wrote:
> Try increase the shuffle memory fraction (by default it is only 16%).
> Again, if you run Spark 1.5, this will probably run a lot faster,
> especially if you increase the shuffle memory fraction ...
Hi Reynold,

Does the 1.5 has better join/cogroup performance for RDD case too or
only for SQL.

- Gurvinder
> 
> On Tue, Sep 1, 2015 at 8:13 AM, Thomas Dudziak  > wrote:
> 
> While it works with sort-merge-join, it takes about 12h to finish
> (with 1 shuffle partitions). My hunch is that the reason for
> that is this:
> 
> INFO ExternalSorter: Thread 3733 spilling in-memory map of 174.9 MB
> to disk (62 times so far)
> 
> (and lots more where this comes from).
> 
> On Sat, Aug 29, 2015 at 7:17 PM, Reynold Xin  > wrote:
> 
> Can you try 1.5? This should work much, much better in 1.5 out
> of the box.
> 
> For 1.4, I think you'd want to turn on sort-merge-join, which is
> off by default. However, the sort-merge join in 1.4 can still
> trigger a lot of garbage, making it slower. SMJ performance is
> probably 5x - 1000x better in 1.5 for your case.
> 
> 
> On Thu, Aug 27, 2015 at 6:03 PM, Thomas Dudziak
> > wrote:
> 
> I'm getting errors like "Removing executor with no recent
> heartbeats" & "Missing an output location for shuffle"
> errors for a large SparkSql join (1bn rows/2.5TB joined with
> 1bn rows/30GB) and I'm not sure how to configure the job to
> avoid them.
> 
> The initial stage completes fine with some 30k tasks on a
> cluster with 70 machines/10TB memory, generating about 6.5TB
> of shuffle writes, but then the shuffle stage first waits
> 30min in the scheduling phase according to the UI, and then
> dies with the mentioned errors.
> 
> I can see in the GC logs that the executors reach their
> memory limits (32g per executor, 2 workers per machine) and
> can't allocate any more stuff in the heap. Fwiw, the top 10
> in the memory use histogram are:
> 
> num #instances #bytes  class name
> --
>1: 24913959511958700560
>  scala.collection.immutable.HashMap$HashMap1
>2: 251085327 8034730464 
>  scala.Tuple2
>3: 243694737 5848673688  java.lang.Float
>4: 231198778 5548770672  java.lang.Integer
>5:  72191585 4298521576
>  [Lscala.collection.immutable.HashMap;
>6:  72191582 2310130624
>  scala.collection.immutable.HashMap$HashTrieMap
>7:  74114058 1778737392  java.lang.Long
>8:   6059103  779203840  [Ljava.lang.Object;
>9:   5461096  174755072
>  scala.collection.mutable.ArrayBuffer
>   10: 34749   70122104  [B
> 
> Relevant settings are (Spark 1.4.1, Java 8 with G1 GC):
> 
> spark.core.connection.ack.wait.timeout 600
> spark.executor.heartbeatInterval   60s
> spark.executor.memory  32g
> spark.mesos.coarse false
> spark.network.timeout  600s
> spark.shuffle.blockTransferService netty
> spark.shuffle.consolidateFiles true
> spark.shuffle.file.buffer  1m
> spark.shuffle.io.maxRetries6
> spark.shuffle.manager  sort
> 
> The join is currently configured with
> spark.sql.shuffle.partitions=1000 but that doesn't seem to
> help. Would increasing the partitions help ? Is there a
> formula to determine an approximate partitions number value
> for a join ?
> Any help with this job would be appreciated !
> 
> cheers,
> Tom
> 
> 
> 
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark and mesos issue

2014-09-16 Thread Gurvinder Singh
It might not be related only to memory issue. Memory issue is also
there as you mentioned. I have seen that one too. The fine mode issue
is mainly spark considering that it got two different block manager
for same ID, whereas if I search for the ID in the mesos slave, it
exist only on the one slave not on multiple of them. Theis might be
due to the size of ID, as spark out the error as

14/09/16 08:04:29 ERROR BlockManagerMasterActor: Got two different
block manager registrations on 20140822-112818-711206558-5050-25951-0

where as in the mesos slave I see logs as

I0915 20:55:18.293903 31434 containerizer.cpp:392] Starting container
'3aab2237-d32f-470d-a206-7bada454ad3f' for executor
'20140822-112818-711206558-5050-25951-0' of framework
'20140822-112818-711206558-5050-25951-0053'

I0915 20:53:28.039218 31437 containerizer.cpp:392] Starting container
'fe4b344f-16c9-484a-9c2f-92bd92b43f6d' for executor
'20140822-112818-711206558-5050-25951-0' of framework
'20140822-112818-711206558-5050-25951-0050'


you the last 3 digits of ID are missing in spark where as they are
different in mesos slaves.

- Gurvinder
On 09/15/2014 11:13 PM, Brenden Matthews wrote:
 I started hitting a similar problem, and it seems to be related to 
 memory overhead and tasks getting OOM killed.  I filed a ticket
 here:
 
 https://issues.apache.org/jira/browse/SPARK-3535
 
 On Wed, Jul 16, 2014 at 5:27 AM, Ray Rodriguez
 rayrod2...@gmail.com mailto:rayrod2...@gmail.com wrote:
 
 I'll set some time aside today to gather and post some logs and 
 details about this issue from our end.
 
 
 On Wed, Jul 16, 2014 at 2:05 AM, Vinod Kone vinodk...@gmail.com 
 mailto:vinodk...@gmail.com wrote:
 
 
 
 
 On Tue, Jul 15, 2014 at 11:02 PM, Vinod Kone vi...@twitter.com 
 mailto:vi...@twitter.com wrote:
 
 
 On Fri, Jul 4, 2014 at 2:05 AM, Gurvinder Singh 
 gurvinder.si...@uninett.no mailto:gurvinder.si...@uninett.no
 wrote:
 
 ERROR storage.BlockManagerMasterActor: Got two different block
 manager registrations on 201407031041-1227224054-5050-24004-0
 
 Googling about it seems that mesos is starting slaves at the same
 time and giving them the same id. So may bug in mesos ?
 
 
 Has this issue been resolved? We need more information to triage
 this. Maybe some logs that show the lifecycle of the duplicate
 instances?
 
 
 @vinodkone
 
 
 
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: hdfs read performance issue

2014-08-20 Thread Gurvinder Singh
I got some time to look in to it. It appears as that Spark (latest git)
is doing this operation much more often compare to Aug 1 version. Here
is the log from operation I am referring to

14/08/19 12:37:26 INFO spark.CacheManager: Partition rdd_8_414 not
found, computing it
14/08/19 12:37:26 INFO rdd.HadoopRDD: Input split:
hdfs://test/test_flows/test-2014-05-06.csv:9529458688+134217728
14/08/19 12:37:41 INFO python.PythonRDD: Times: total = 16312, boot = 8,
init = 134, finish = 16170
14/08/19 12:37:41 INFO columnar.StringColumnBuilder: Compressor for
[dstip]:
org.apache.spark.sql.columnar.compression.PassThrough$Encoder@6374d682,
ratio: 1.0
14/08/19 12:37:41 INFO columnar.StringColumnBuilder: Compressor for
[dstport]:
org.apache.spark.sql.columnar.compression.PassThrough$Encoder@baf23d1,
ratio: 1.0
14/08/19 12:37:41 INFO columnar.StringColumnBuilder: Compressor for
[srcport]:
org.apache.spark.sql.columnar.compression.PassThrough$Encoder@17587455,
ratio: 1.0
14/08/19 12:37:41 INFO columnar.StringColumnBuilder: Compressor for
[stime]:
org.apache.spark.sql.columnar.compression.PassThrough$Encoder@303d846c,
ratio: 1.0
14/08/19 12:37:41 INFO columnar.StringColumnBuilder: Compressor for
[endtime]:
org.apache.spark.sql.columnar.compression.PassThrough$Encoder@16c0e732,
ratio: 1.0
14/08/19 12:37:41 INFO columnar.StringColumnBuilder: Compressor for
[srcip]:
org.apache.spark.sql.columnar.compression.PassThrough$Encoder@528a8f49,
ratio: 1.0
14/08/19 12:37:41 INFO storage.MemoryStore: ensureFreeSpace(64834432)
called with curMem=1556288334, maxMem=944671

In Aug 1 version the log file from processing the same size data is of
approx 136KB where from latest git it is of 23 MB. The only message
which makes logfile grow is the mentioned above. It appears the latest
git version has an issue when reading data and converting it columnar
format. As this conversion happens when Spark is trying to create a RDD,
once for each RDD. In latest git version it just simply might be doing
for each record in RDD. That's what causing it to slow read from disk as
it spends time in this operation.

Any suggestion/help in this regard will be helpful.

- Gurvinder
On 08/14/2014 10:27 AM, Gurvinder Singh wrote:
 Hi,
 
 I am running spark from the git directly. I recently compiled the newer
 version Aug 13 version and it has performance drop of 2-3x in read from
 HDFS compare to git version of Aug 1. So I am wondering which commit
 would have cause such an issue in read performance. The performance is
 almost same once data is cached in memory, but read from HDFS is well
 slow compare to Aug 1 version.
 
 - Gurvinder
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



read performance issue

2014-08-14 Thread Gurvinder Singh
Hi,

I am running spark from the git directly. I recently compiled the newer
version Aug 13 version and it has performance drop of 2-3x in read from
HDFS compare to git version of Aug 1. So I am wondering which commit
would have cause such an issue in read performance. The performance is
almost same once data is cached in memory, but read from HDFS is well
slow compare to Aug 1 version.

- Gurvinder

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SQLCtx cacheTable

2014-08-05 Thread Gurvinder Singh
On 08/04/2014 10:57 PM, Michael Armbrust wrote:
 If mesos is allocating a container that is exactly the same as the max
 heap size then that is leaving no buffer space for non-heap JVM memory,
 which seems wrong to me.
 
This can be a cause. I am now wondering how mesos pick up the size and
setup the -Xmx parameter.
 The problem here is that cacheTable is more aggressive about grabbing
 large ByteBuffers during caching (which it later releases when it knows
 the exact size of the data)  There is a discussion here about trying to
 improve this: https://issues.apache.org/jira/browse/SPARK-2650
 
I am not sure if this issue is the one which is causing issue for us. As
we have approx 60GB of cached data size, where as each executor memory
is 17GB and there are 15 of them so in total 255GB which is way more
than cached data of 60GB.

Any suggestions as where to look for changing the mesos setting in this
case.

- Gurvinder
 
 On Sun, Aug 3, 2014 at 11:35 PM, Gurvinder Singh
 gurvinder.si...@uninett.no mailto:gurvinder.si...@uninett.no wrote:
 
 On 08/03/2014 02:33 AM, Michael Armbrust wrote:
  I am not a mesos expert... but it sounds like there is some mismatch
  between the size that mesos is giving you and the maximum heap size of
  the executors (-Xmx).
 
 It seems that mesos is giving the correct size to java process. It has
 exact size set in -Xms/-Xmx params. Do you if somehow I can find which
 class or thread inside the spark jvm process is using how much memory
 and see which makes it to reach the memory limit on CacheTable case
 where as not in cache RDD case.
 
 - Gurvinder
 
  On Fri, Aug 1, 2014 at 12:07 AM, Gurvinder Singh
  gurvinder.si...@uninett.no mailto:gurvinder.si...@uninett.no
 mailto:gurvinder.si...@uninett.no
 mailto:gurvinder.si...@uninett.no wrote:
 
  It is not getting out of memory exception. I am using Mesos as
 cluster
  manager and it says when I use cacheTable that the container
 has used
  all of its allocated memory and thus kill it. I can see it in
 the logs
  on mesos-slave where executor runs. But on the web UI of spark
  application, it shows that is still have 4-5GB space left for
  caching/storing. So I am wondering how the memory is handled in
  cacheTable case. Does it reserve the memory storage and other
 parts run
  out of their memory. I also tries to change the
  spark.storage.memoryFraction but that did not help.
 
  - Gurvinder
  On 08/01/2014 08:42 AM, Michael Armbrust wrote:
   Are you getting OutOfMemoryExceptions with cacheTable? or
 what do you
   mean when you say you have to specify larger executor
 memory?  You
  might
   be running into SPARK-2650
   https://issues.apache.org/jira/browse/SPARK-2650.
  
   Is there something else you are trying to accomplish by
 setting the
   persistence level?  If you are looking for something like
  DISK_ONLY you
   can simulate that now using saveAsParquetFile and parquetFile.
  
   It is possible long term that we will automatically map the
  standard RDD
   persistence levels to these more efficient implementations
 in the
  future.
  
  
   On Thu, Jul 31, 2014 at 11:26 PM, Gurvinder Singh
   gurvinder.si...@uninett.no
 mailto:gurvinder.si...@uninett.no
 mailto:gurvinder.si...@uninett.no mailto:gurvinder.si...@uninett.no
  mailto:gurvinder.si...@uninett.no
 mailto:gurvinder.si...@uninett.no
  mailto:gurvinder.si...@uninett.no
 mailto:gurvinder.si...@uninett.no wrote:
  
   Thanks Michael for explaination. Actually I tried
 caching the
  RDD and
   making table on it. But the performance for cacheTable
 was 3X
  better
   than caching RDD. Now I know why it is better. But is it
  possible to
   add the support for persistence level into cacheTable itself
  like RDD.
   May be it is not related, but on the same size of data set,
  when I use
   cacheTable I have to specify larger executor memory than
 I need in
   case of caching RDD. Although in the storage tab on
 status web
  UI, the
   memory footprint is almost same 58.3 GB in cacheTable and
  59.7GB in
   cache RDD. Is it possible that there is some memory leak or
  cacheTable
   works differently and thus require higher memory. The
  difference is
   5GB per executor for the dataset of size 122 GB.
  
   Thanks,
   Gurvinder
   On 08/01/2014 04:42 AM, Michael Armbrust wrote:
cacheTable uses a special columnar

Re: reading compress lzo files

2014-07-06 Thread Gurvinder Singh
On 07/06/2014 05:19 AM, Nicholas Chammas wrote:
 On Fri, Jul 4, 2014 at 3:33 PM, Gurvinder Singh
 gurvinder.si...@uninett.no mailto:gurvinder.si...@uninett.no wrote:
 
 csv =
 
 sc.newAPIHadoopFile(opts.input,com.hadoop.mapreduce.LzoTextInputFormat,org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text).count()
 
 Does anyone know what the rough equivalent of this would be in the Scala
 API?
 
I am not sure, I haven't tested it using scala.
com.hadoop.mapreduce.LzoTextInputFormat class is from this package
https://github.com/twitter/hadoop-lzo

I have installed it from clourdera hadoop-lzo package with liblzo2-2
debian package on all of my workers. Make sure you have hadoop-lzo.jar
in your class path for spark.

- Gurvinder

 I am trying the following, but the first import yields an error on my
 |spark-ec2| cluster:
 
 |import com.hadoop.mapreduce.LzoTextInputFormat
 import org.apache.hadoop.io.LongWritable
 import org.apache.hadoop.io.Text
 
 sc.newAPIHadoopFile(s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data,
  LzoTextInputFormat, LongWritable, Text)
 |
 
 |scala import com.hadoop.mapreduce.LzoTextInputFormat
 console:12: error: object hadoop is not a member of package com
import com.hadoop.mapreduce.LzoTextInputFormat
 |
 
 Nick
 
 ​




spark and mesos issue

2014-07-04 Thread Gurvinder Singh
We are getting this issue when we are running jobs with close to 1000
workers. Spark is from the github version and mesos is 0.19.0

ERROR storage.BlockManagerMasterActor: Got two different block manager
registrations on 201407031041-1227224054-5050-24004-0

Googling about it seems that mesos is starting slaves at the same time
and giving them the same id. So may bug in mesos ?

Thanks,
Gurvinder
On 07/04/2014 01:03 AM, Vinod Kone wrote:
 correct url:
 
 https://issues.apache.org/jira/issues/?jql=project%20%3D%20MESOS%20AND%20%22Target%20Version%2Fs%22%20%3D%200.19.1
 
 
 On Thu, Jul 3, 2014 at 1:40 PM, Vinod Kone vinodk...@gmail.com
 mailto:vinodk...@gmail.com wrote:
 
 Hi,
 
 We are planning to release 0.19.1 (likely next week) which will be a
 bug fix release. Specifically, these are the fixes that we are
 planning to cherry pick.
 
 
 https://issues.apache.org/jira/issues/?filter=12326191jql=project%20%3D%20MESOS%20AND%20%22Target%20Version%2Fs%22%20%3D%200.19.1
 
 If there are other critical fixes that need to be backported to
 0.19.1 please reply here as soon as possible.
 
 Thanks,
 
 



Re: issue with running example code

2014-07-04 Thread Gurvinder Singh
In the end it turns out that the issue was caused by a config settings
in spark-defaults.conf. After removing this setting

spark.files.userClassPathFirst   true

things are back to normal. Just reporting in case f someone will have
the same issue.

- Gurvinder
On 07/03/2014 06:49 PM, Gurvinder Singh wrote:
 Just to provide more information on this issue. It seems that SPARK_HOME
 environment variable is causing the issue. If I unset the variable in
 spark-class script and run in the local mode my code runs fine without
 the exception. But if I run with SPARK_HOME, I get the exception
 mentioned below. I could run without setting SPARK_HOME but it is not
 possible to run in the cluster settings, as this tells where is spark on
 worker nodes. E.g. we are using Mesos as cluster manager, thus when set
 master to mesos we get the exception as SPARK_HOME is not set.
 
 Just to mention again the pyspark works fine as well as spark-shell,
 only when we are running compiled jar it seems SPARK_HOME causes some
 java run time issues that we get class cast exception.
 
 Thanks,
 Gurvinder
 On 07/01/2014 09:28 AM, Gurvinder Singh wrote:
 Hi,

 I am having issue in running scala example code. I have tested and able
 to run successfully python example code, but when I run the scala code I
 get this error

 java.lang.ClassCastException: cannot assign instance of
 org.apache.spark.examples.SparkPi$$anonfun$1 to field
 org.apache.spark.rdd.MappedRDD.f of type scala.Function1 in instance of
 org.apache.spark.rdd.MappedRDD

 I have compiled spark from the github directly and running with the
 command as

 spark-submit /usr/share/spark/lib/spark-examples_2.10-1.1.0-SNAPSHOT.jar
 --class org.apache.spark.examples.SparkPi 5 --jars
 /usr/share/spark/lib/spark-assembly-1.1.0-SNAPSHOT-hadoop2.h5.0.1.jar

 Any suggestions will be helpful.

 Thanks,
 Gurvinder

 



Re: reading compress lzo files

2014-07-04 Thread Gurvinder Singh
an update on this issue, now spark is able to read the lzo file and
recognize that it has index and starts multiple map tasks. you need to
use following function instead of textFile

csv =
sc.newAPIHadoopFile(opts.input,com.hadoop.mapreduce.LzoTextInputFormat,org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text).count()

- Gurvinder
On 07/03/2014 06:24 PM, Gurvinder Singh wrote:
 Hi all,
 
 I am trying to read the lzo files. It seems spark recognizes that the
 input file is compressed and got the decompressor as
 
 14/07/03 18:11:01 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
 14/07/03 18:11:01 INFO lzo.LzoCodec: Successfully loaded  initialized
 native-lzo library [hadoop-lzo rev
 ee825cb06b23d3ab97cdd87e13cbbb630bd75b98]
 14/07/03 18:11:01 INFO Configuration.deprecation: hadoop.native.lib is
 deprecated. Instead, use io.native.lib.available
 14/07/03 18:11:01 INFO compress.CodecPool: Got brand-new decompressor
 [.lzo]
 
 But it has two issues
 
 1. It just stuck here without doing anything waited for 15 min for a
 small files.
 2. I used the hadoop-lzo to create the index so that spark can split
 the input to multiple maps but spark creates only one mapper.
 
 I am using python with reading using sc.textFile(). Spark version is
 of the git master.
 
 Regards,
 Gurvinder
 



Re: issue with running example code

2014-07-03 Thread Gurvinder Singh
Just to provide more information on this issue. It seems that SPARK_HOME
environment variable is causing the issue. If I unset the variable in
spark-class script and run in the local mode my code runs fine without
the exception. But if I run with SPARK_HOME, I get the exception
mentioned below. I could run without setting SPARK_HOME but it is not
possible to run in the cluster settings, as this tells where is spark on
worker nodes. E.g. we are using Mesos as cluster manager, thus when set
master to mesos we get the exception as SPARK_HOME is not set.

Just to mention again the pyspark works fine as well as spark-shell,
only when we are running compiled jar it seems SPARK_HOME causes some
java run time issues that we get class cast exception.

Thanks,
Gurvinder
On 07/01/2014 09:28 AM, Gurvinder Singh wrote:
 Hi,
 
 I am having issue in running scala example code. I have tested and able
 to run successfully python example code, but when I run the scala code I
 get this error
 
 java.lang.ClassCastException: cannot assign instance of
 org.apache.spark.examples.SparkPi$$anonfun$1 to field
 org.apache.spark.rdd.MappedRDD.f of type scala.Function1 in instance of
 org.apache.spark.rdd.MappedRDD
 
 I have compiled spark from the github directly and running with the
 command as
 
 spark-submit /usr/share/spark/lib/spark-examples_2.10-1.1.0-SNAPSHOT.jar
 --class org.apache.spark.examples.SparkPi 5 --jars
 /usr/share/spark/lib/spark-assembly-1.1.0-SNAPSHOT-hadoop2.h5.0.1.jar
 
 Any suggestions will be helpful.
 
 Thanks,
 Gurvinder
 



issue with running example code

2014-07-01 Thread Gurvinder Singh
Hi,

I am having issue in running scala example code. I have tested and able
to run successfully python example code, but when I run the scala code I
get this error

java.lang.ClassCastException: cannot assign instance of
org.apache.spark.examples.SparkPi$$anonfun$1 to field
org.apache.spark.rdd.MappedRDD.f of type scala.Function1 in instance of
org.apache.spark.rdd.MappedRDD

I have compiled spark from the github directly and running with the
command as

spark-submit /usr/share/spark/lib/spark-examples_2.10-1.1.0-SNAPSHOT.jar
--class org.apache.spark.examples.SparkPi 5 --jars
/usr/share/spark/lib/spark-assembly-1.1.0-SNAPSHOT-hadoop2.h5.0.1.jar

Any suggestions will be helpful.

Thanks,
Gurvinder