Memory allocation

2020-04-17 Thread Pat Ferrel
I have used Spark for several years and realize from recent chatter on this list that I don’t really understand how it uses memory. Specifically is spark.executor.memory and spark.driver.memory taken from the JVM heap when does Spark take memory from JVM heap and when it is from off JVM heap.

Re: IDE suitable for Spark

2020-04-07 Thread Pat Ferrel
IntelliJ Scala works well when debugging master=local. Has anyone used it for remote/cluster debugging? I’ve heard it is possible... From: Luiz Camargo Reply: Luiz Camargo Date: April 7, 2020 at 10:26:35 AM To: Dennis Suhari Cc: yeikel valdes , zahidr1...@gmail.com , user@spark.apache.org S

Re: k8s orchestrating Spark service

2019-07-03 Thread Pat Ferrel
todays question. From: Matt Cheah Reply: Matt Cheah Date: July 1, 2019 at 5:14:05 PM To: Pat Ferrel , user@spark.apache.org Subject: Re: k8s orchestrating Spark service > We’d like to deploy Spark Workers/Executors and Master (whatever master is easiest to talk about since we really do

Re: k8s orchestrating Spark service

2019-07-01 Thread Pat Ferrel
run our Driver and Executors considering that the Driver is part of the Server process? Maybe we are talking past each other with some mistaken assumptions (on my part perhaps). From: Pat Ferrel Reply: Pat Ferrel Date: July 1, 2019 at 4:57:20 PM To: user@spark.apache.org , Matt Cheah

Re: k8s orchestrating Spark service

2019-07-01 Thread Pat Ferrel
anyone have something they like? From: Matt Cheah Reply: Matt Cheah Date: July 1, 2019 at 4:45:55 PM To: Pat Ferrel , user@spark.apache.org Subject: Re: k8s orchestrating Spark service Sorry, I don’t quite follow – why use the Spark standalone cluster as an in-between layer when one can just

Re: k8s orchestrating Spark service

2019-07-01 Thread Pat Ferrel
services including Spark. The rest work, we are asking if anyone has seen a good starting point for adding Spark as a k8s managed service. From: Matt Cheah Reply: Matt Cheah Date: July 1, 2019 at 3:26:20 PM To: Pat Ferrel , user@spark.apache.org Subject: Re: k8s orchestrating Spark service

k8s orchestrating Spark service

2019-06-30 Thread Pat Ferrel
We're trying to setup a system that includes Spark. The rest of the services have good Docker containers and Helm charts to start from. Spark on the other hand is proving difficult. We forked a container and have tried to create our own chart but are having several problems with this. So back to

Re: run new spark version on old spark cluster ?

2019-05-20 Thread Pat Ferrel
It is always dangerous to run a NEWER version of code on an OLDER cluster. The danger increases with the semver change and this one is not just a build #. In other word 2.4 is considered to be a fairly major change from 2.3. Not much else can be said. From: Nicolas Paris Reply: user@spark.apach

Fwd: Spark Architecture, Drivers, & Executors

2019-05-17 Thread Pat Ferrel
In order to create an application that executes code on Spark we have a long lived process. It periodically runs jobs programmatically on a Spark cluster, meaning it does not use spark-submit. The Jobs it executes have varying requirements for memory so we want to have the Spark Driver run in the c

Re: Spark structured streaming watermarks on nested attributes

2019-05-06 Thread Pat Ferrel
Streams have no end until watermarked or closed. Joins need bounded datasets, et voila. Something tells me you should consider the streaming nature of your data and whether your joins need to use increments/snippets of infinite streams or to re-join the entire contents of the streams accumulated at

Re: Deep Learning with Spark, what is your experience?

2019-05-04 Thread Pat Ferrel
@Riccardo Spark does not do the DL learning part of the pipeline (afaik) so it is limited to data ingestion and transforms (ETL). It therefore is optional and other ETL options might be better for you. Most of the technologies @Gourav mentions have their own scaling based on their own compute eng

Re: spark.submit.deployMode: cluster

2019-03-28 Thread Pat Ferrel
Thanks, are you referring to https://github.com/spark-jobserver/spark-jobserver or the undocumented REST job server included in Spark? From: Jason Nerothin Reply: Jason Nerothin Date: March 28, 2019 at 2:53:05 PM To: Pat Ferrel Cc: Felix Cheung , Marcelo Vanzin , user Subject: Re

Re: spark.submit.deployMode: cluster

2019-03-28 Thread Pat Ferrel
;-) Great idea. Can you suggest a project? Apache PredictionIO uses spark-submit (very ugly) and Apache Mahout only launches trivially in test apps since most uses are as a lib. From: Felix Cheung Reply: Felix Cheung Date: March 28, 2019 at 9:42:31 AM To: Pat Ferrel , Marcelo Vanzin Cc

Re: Where does the Driver run?

2019-03-28 Thread Pat Ferrel
f, but for standalone mode you might be able to use this: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/Client.scala Lastly, you can always check where Spark processes run by executing ps on the machine, i.e. `ps aux | grep java`. Best, Jianneng *Fro

Re: spark.submit.deployMode: cluster

2019-03-26 Thread Pat Ferrel
Reply: Marcelo Vanzin Date: March 26, 2019 at 1:59:36 PM To: Pat Ferrel Cc: user Subject: Re: spark.submit.deployMode: cluster If you're not using spark-submit, then that option does nothing. If by "context creation API" you mean "new SparkContext()" or a

spark.submit.deployMode: cluster

2019-03-26 Thread Pat Ferrel
I have a server that starts a Spark job using the context creation API. It DOES NOY use spark-submit. I set spark.submit.deployMode = “cluster” In the GUI I see 2 workers with 2 executors. The link for running application “name” goes back to my server, the machine that launched the job. This is

Re: Where does the Driver run?

2019-03-25 Thread Pat Ferrel
:07 AM To: Pat Ferrel Cc: Akhil Das , user Subject: Re: Where does the Driver run? Hi Pat, Indeed, I don't think that it's possible to use cluster mode w/o spark-submit. All the docs I see appear to always describe needing to use spark-submit for cluster mode -- it's not even

Re: Where does the Driver run?

2019-03-25 Thread Pat Ferrel
only guessing at that). Further; if we don’t use spark-submit we can’t use deployMode = cluster ??? From: Akhil Das Reply: Akhil Das Date: March 24, 2019 at 7:45:07 PM To: Pat Ferrel Cc: user Subject: Re: Where does the Driver run? There's also a driver ui (usually available on port

Re: Where does the Driver run?

2019-03-24 Thread Pat Ferrel
60g BTW I would expect this to create one Executor, one Driver, and the Master on 2 Workers. From: Andrew Melo Reply: Andrew Melo Date: March 24, 2019 at 12:46:35 PM To: Pat Ferrel Cc: Akhil Das , user Subject: Re: Where does the Driver run? Hi Pat, On Sun, Mar 24, 2019 at 1:03 PM

Re: Where does the Driver run?

2019-03-24 Thread Pat Ferrel
60g From: Andrew Melo Reply: Andrew Melo Date: March 24, 2019 at 12:46:35 PM To: Pat Ferrel Cc: Akhil Das , user Subject: Re: Where does the Driver run? Hi Pat, On Sun, Mar 24, 2019 at 1:03 PM Pat Ferrel wrote: > Thanks, I have seen this many times in my research. Paraphrasing do

Re: Where does the Driver run?

2019-03-24 Thread Pat Ferrel
Reply: Akhil Das Date: March 23, 2019 at 9:26:50 PM To: Pat Ferrel Cc: user Subject: Re: Where does the Driver run? If you are starting your "my-app" on your local machine, that's where the driver is running. [image: image.png] Hope this helps. <https://spark.apache.

Where does the Driver run?

2019-03-23 Thread Pat Ferrel
I have researched this for a significant amount of time and find answers that seem to be for a slightly different question than mine. The Spark 2.3.3 cluster is running fine. I see the GUI on “ http://master-address:8080";, there are 2 idle workers, as configured. I have a Scala application that

Re: Spark with Kubernetes connecting to pod ID, not address

2019-02-13 Thread Pat Ferrel
a:163) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) at io.netty.util.concurrent.Def

Spark with Kubernetes connecting to pod id, not address

2019-02-12 Thread Pat Ferrel
From: Pat Ferrel Reply: Pat Ferrel Date: February 12, 2019 at 5:40:41 PM To: user@spark.apache.org Subject:  Spark with Kubernetes connecting to pod id, not address We have a k8s deployment of several services including Apache Spark. All services seem to be operational. Our application

Give a task more resources

2017-01-11 Thread Pat Ferrel
I have a task that would benefit from more cores but the standalone scheduler launches it when only a subset are available. I’d rather use all cluster cores on this task. Is there a way to tell the scheduler to finish everything before allocating resources to a task? Like "finish everything els

Memory allocation for Broadcast values

2015-12-20 Thread Pat Ferrel
I have a large Map that is assembled in the driver and broadcast to each node. My question is how best to allocate memory for this. The Driver has to have enough memory for the Maps, but only one copy is serialized to each node. What type of memory should I size to match the Maps? Is the broadc

Re: RDD.isEmpty

2015-12-09 Thread Pat Ferrel
> On Wed, Dec 9, 2015 at 7:49 PM, Pat Ferrel wrote: >> The “Any” is required by the code it is being passed to, which is the >> Elasticsearch Spark index writing code. The values are actually RDD[(String, >> Map[String, String])] > > (Is it frequently a big big map by a

Re: RDD.isEmpty

2015-12-09 Thread Pat Ferrel
the driver is trivial. Or maybe adapt some version of the implementation of take() to be an optimized, smarter isEmpty(). Neither seemed worth the overhead at the time, but this could be a case against that, if it turns out somehow to be serialization time. On Wed, Dec 9, 2015 at 5:55 PM

Re: RDD.isEmpty

2015-12-09 Thread Pat Ferrel
x27;s isEmpty? and not an upstream stage? isEmpty is definitely the action here. It doesn't make sense that take(1) is so much faster since it's the "same thing". On Wed, Dec 9, 2015 at 5:11 PM, Pat Ferrel wrote: > Sure, I thought this might be a known issue. > > I

Re: RDD.isEmpty

2015-12-09 Thread Pat Ferrel
It should at best collect 1 item to the driver. This means evaluating at least 1 element of 1 partition. I can imagine pathological cases where that's slow, but, do you have any more info? how slow is slow and what is slow? On Wed, Dec 9, 2015 at 4:41 PM, Pat Ferrel wrote: > I’m ge

RDD.isEmpty

2015-12-09 Thread Pat Ferrel
I’m getting *huge* execution times on a moderate sized dataset during the RDD.isEmpty. Everything in the calculation is fast except an RDD.isEmpty calculation. I’m using Spark 1.5.1 and from researching I would expect this calculation to be linearly proportional to the number of partitions as a

rdd.saveAsSequenceFile(path)

2015-06-27 Thread Pat Ferrel
Our project is having a hard time following what we are supposed to do to migrate this function from Spark 1.2 to 1.3. /** * Dump matrix as computed Mahout's DRM into specified (HD)FS path * @param path */ def dfsWrite(path: String) = { val ktag = implicitly[ClassTag[K]] //va

Re: Tasks run only on one machine

2015-04-23 Thread Pat Ferrel
Argh, I looked and there really isn’t that much data yet. There will be thousands but starting small. I bet this is just a total data size not requiring all workers thing—sorry, nevermind. On Apr 23, 2015, at 10:30 AM, Pat Ferrel wrote: They are in HDFS so available on all workers On Apr

Re: Tasks run only on one machine

2015-04-23 Thread Pat Ferrel
They are in HDFS so available on all workers On Apr 23, 2015, at 10:29 AM, Pat Ferrel wrote: Physically? Not sure, they were written using the nano-batch rdds in a streaming job that is in a separate driver. The job is a Kafka consumer. Would that effect all derived rdds? If so is there

Re: Tasks run only on one machine

2015-04-23 Thread Pat Ferrel
, 2015, at 10:23 AM, Sean Owen wrote: Where are the file splits? meaning is it possible they were also (only) available on one node and that was also your driver? On Thu, Apr 23, 2015 at 1:21 PM, Pat Ferrel wrote: > Sure > >var columns = mc.textFile(source).map { line => line.spl

Re: Tasks run only on one machine

2015-04-23 Thread Pat Ferrel
tokens => tokens(rowIDColumn) -> tokens(columnIDPosition) } interactions.cache() On Apr 23, 2015, at 10:14 AM, Jeetendra Gangele wrote: Will you be able to paste code here? On 23 April 2015 at 22:21, Pat Ferrel mailto:p...@occamsmachete.com>> wrote: Using Spark streaming to create

Tasks run only on one machine

2015-04-23 Thread Pat Ferrel
Using Spark streaming to create a large volume of small nano-batch input files, ~4k per file, thousands of ‘part-x’ files. When reading the nano-batch files and doing a distributed calculation my tasks run only on the machine where it was launched. I’m launching in “yarn-client” mode. The r

TaskResultLost

2015-04-14 Thread Pat Ferrel
Running on Spark 1.1.1 Hadoop 2.4 with Yarn AWS dedicated cluster (non-EMR) Is this in our code or config? I’ve never run into a TaskResultLost, not sure what can cause that. TaskResultLost (result lost from block manager) nivea.m [11:01 AM] collect at AtA.

Re: Need Advice about reading lots of text files

2015-03-17 Thread Pat Ferrel
o something like: sc.parallelize("s3n://mybucket/file1" :: "s3n://mybucket/file1" ... :: Nil).flatMap(new ReadLinesSafe(_)) You can also build up the list of files by running a Spark job: https://gist.github.com/marmbrus/15e72f7bc22337cf6653 <https://gist.github.com/marmbru

Re: Need Advice about reading lots of text files

2015-03-15 Thread Pat Ferrel
quot; ... :: Nil).flatMap(new ReadLinesSafe(_)) You can also build up the list of files by running a Spark job: https://gist.github.com/marmbrus/15e72f7bc22337cf6653 <https://gist.github.com/marmbrus/15e72f7bc22337cf6653> Michael On Sat, Mar 14, 2015 at 10:38 AM, Pat Ferrel mailto:p...@occam

Re: Need Advice about reading lots of text files

2015-03-14 Thread Pat Ferrel
a jvm per task). On Sat, Mar 14, 2015 at 12:51 PM, Pat Ferrel mailto:p...@occamsmachete.com>> wrote: Any advice on dealing with a large number of separate input files? On Mar 13, 2015, at 4:06 PM, Pat Ferrel mailto:p...@occamsmachete.com>> wrote: We have many text files that we nee

Re: Need Advice about reading lots of text files

2015-03-14 Thread Pat Ferrel
Any advice on dealing with a large number of separate input files? On Mar 13, 2015, at 4:06 PM, Pat Ferrel wrote: We have many text files that we need to read in parallel. We can create a comma delimited list of files to pass in to sparkContext.textFile(fileList). The list can get very large

Need Advice about reading lots of text files

2015-03-13 Thread Pat Ferrel
We have many text files that we need to read in parallel. We can create a comma delimited list of files to pass in to sparkContext.textFile(fileList). The list can get very large (maybe 1) and is all on hdfs. The question is: what is the most performant way to read them? Should they be bro

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-02 Thread Pat Ferrel
Sab, not sure what you require for the similarity metric or your use case but you can also look at spark-rowsimilarity or spark-itemsimilarity (column-wise) here http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html

Re: Upgrade to Spark 1.2.1 using Guava

2015-03-02 Thread Pat Ferrel
this problem with mahout snapshot and spark 1.2.0 not really trying to figure out why that was a problem, since there were already too many moving parts in my app. Obviously there is a classpath issue somewhere. /Erlend On 27 Feb 2015 22:30, "Pat Ferrel" mailto:p...@occamsmachete.c

Re: Upgrade to Spark 1.2.1 using Guava

2015-02-28 Thread Pat Ferrel
where the serializer is and where the user class is. At the top you said Pat that you didn't try this, but why not? On Fri, Feb 27, 2015 at 10:11 PM, Pat Ferrel wrote: > I’ll try to find a Jira for it. I hope a fix is in 1.3 > > > On Feb 27, 2015, at 1:59 PM, Pat Ferrel wrote: >

Re: Upgrade to Spark 1.2.1 using Guava

2015-02-27 Thread Pat Ferrel
I’ll try to find a Jira for it. I hope a fix is in 1.3 On Feb 27, 2015, at 1:59 PM, Pat Ferrel wrote: Thanks! that worked. On Feb 27, 2015, at 1:50 PM, Pat Ferrel wrote: I don’t use spark-submit I have a standalone app. So I guess you want me to add that key/value to the conf in my code

Re: Upgrade to Spark 1.2.1 using Guava

2015-02-27 Thread Pat Ferrel
Thanks! that worked. On Feb 27, 2015, at 1:50 PM, Pat Ferrel wrote: I don’t use spark-submit I have a standalone app. So I guess you want me to add that key/value to the conf in my code and make sure it exists on workers. On Feb 27, 2015, at 1:47 PM, Marcelo Vanzin wrote: On Fri, Feb 27

Re: Upgrade to Spark 1.2.1 using Guava

2015-02-27 Thread Pat Ferrel
I don’t use spark-submit I have a standalone app. So I guess you want me to add that key/value to the conf in my code and make sure it exists on workers. On Feb 27, 2015, at 1:47 PM, Marcelo Vanzin wrote: On Fri, Feb 27, 2015 at 1:42 PM, Pat Ferrel wrote: > I changed in the spark mas

Re: Upgrade to Spark 1.2.1 using Guava

2015-02-27 Thread Pat Ferrel
I changed in the spark master conf, which is also the only worker. I added a path to the jar that has guava in it. Still can’t find the class. Trying Erland’s idea next. On Feb 27, 2015, at 1:35 PM, Marcelo Vanzin wrote: On Fri, Feb 27, 2015 at 1:30 PM, Pat Ferrel wrote: > @Marcelo do

Re: Upgrade to Spark 1.2.1 using Guava

2015-02-27 Thread Pat Ferrel
wrote: Hi. I have had a simliar issue. I had to pull the JavaSerializer source into my own project, just so I got the classloading of this class under control. This must be a class loader issue with spark. -E On Fri, Feb 27, 2015 at 8:52 PM, Pat Ferrel mailto:p...@occamsmachete.com>> wr

Re: Upgrade to Spark 1.2.1 using Guava

2015-02-27 Thread Pat Ferrel
s use Guava). On Wed, Feb 25, 2015 at 5:17 PM, Pat Ferrel wrote: > The root Spark pom has guava set at a certain version number. It’s very hard > to read the shading xml. Someone suggested that I try using > userClassPathFirst but that sounds too heavy handed since I don’t really > car

Upgrade to Spark 1.2.1 using Guava

2015-02-25 Thread Pat Ferrel
there a recommended way to use it in a job? On Feb 25, 2015, at 3:50 PM, Pat Ferrel wrote: I pass in my own dependencies jar with the class in it when creating the context. I’ve verified that the jar is in the list and checked in the jar to find guava. This should work, right so I must have

Re: upgrade to Spark 1.2.1

2015-02-25 Thread Pat Ferrel
caused by Spark using shaded Guava jar ? Cheers On Wed, Feb 25, 2015 at 3:26 PM, Pat Ferrel mailto:p...@occamsmachete.com>> wrote: Getting an error that confuses me. Running a largish app on a standalone cluster on my laptop. The app uses a guava HashBiMap as a broadcast value. With Spark 1

upgrade to Spark 1.2.1

2015-02-25 Thread Pat Ferrel
Getting an error that confuses me. Running a largish app on a standalone cluster on my laptop. The app uses a guava HashBiMap as a broadcast value. With Spark 1.1.0 I simply registered the class and its serializer with kryo like this: kryo.register(classOf[com.google.common.collect.HashBiMap

Re: Row similarities

2015-01-18 Thread Pat Ferrel
algebra first? On Jan 17, 2015, at 6:27 PM, Reza Zadeh wrote: We're focused on providing block matrices, which makes transposition simple: https://issues.apache.org/jira/browse/SPARK-3434 <https://issues.apache.org/jira/browse/SPARK-3434> On Sat, Jan 17, 2015 at 3:25 PM, Pat Ferr

Re: Row similarities

2015-01-17 Thread Pat Ferrel
t and use rowSimilarities. > > > On Sat, Jan 17, 2015 at 10:45 AM, Pat Ferrel <mailto:p...@occamsmachete.com>> wrote: > BTW it looks like row and column similarities (cosine based) are coming to > MLlib through DIMSUM. Andrew said rowSimilarity doesn’t seem to be in the

Re: Row similarities

2015-01-17 Thread Pat Ferrel
ld get these together. On Jan 17, 2015, at 9:37 AM, Andrew Musselman wrote: Excellent, thanks Pat. On Jan 17, 2015, at 9:27 AM, Pat Ferrel mailto:p...@occamsmachete.com>> wrote: > Mahout’s Spark implementation of rowsimilarity is in the Scala > SimilarityAnalysis class. It actual

Re: Row similarities

2015-01-17 Thread Pat Ferrel
Mahout’s Spark implementation of rowsimilarity is in the Scala SimilarityAnalysis class. It actually does either row or column similarity but only supports LLR at present. It does [AA’] for columns or [A’A] for rows first then calculates the distance (LLR) for non-zero elements. This is a major

Re: Is there any Spark implementation for Item-based Collaborative Filtering?

2014-11-30 Thread Pat Ferrel
Actually the spark-itemsimilarity job and related code in the Spark module of Mahout creates all-pairs similarity too. It’s designed to use with a search engine, which provides the query part of the recommender. Integrate the two and you have a near realtime scalable item-based/cooccurrence coll

Lots of small input files

2014-11-21 Thread Pat Ferrel
I have a job that searches for input recursively and creates a string of pathnames to treat as one input. The files are part-x files and they are fairly small. The job seems to take a long time to complete considering the size of the total data (150m) and only runs on the master machine. T

Re: Cores on Master

2014-11-18 Thread Pat Ferrel
OK hacking the start-slave.sh did it On Nov 18, 2014, at 4:12 PM, Pat Ferrel wrote: This seems to work only on a ‘worker’ not the master? So I’m back to having no way to control cores on the master? On Nov 18, 2014, at 3:24 PM, Pat Ferrel wrote: Looks like I can do this by not using start

Re: Cores on Master

2014-11-18 Thread Pat Ferrel
This seems to work only on a ‘worker’ not the master? So I’m back to having no way to control cores on the master? On Nov 18, 2014, at 3:24 PM, Pat Ferrel wrote: Looks like I can do this by not using start-all.sh but starting each worker separately passing in a '--cores n' to the

Re: Cores on Master

2014-11-18 Thread Pat Ferrel
Looks like I can do this by not using start-all.sh but starting each worker separately passing in a '--cores n' to the master? No config/env way? On Nov 18, 2014, at 3:14 PM, Pat Ferrel wrote: I see the default and max cores settings but these seem to control total cores per cl

Cores on Master

2014-11-18 Thread Pat Ferrel
I see the default and max cores settings but these seem to control total cores per cluster. My cobbled together home cluster needs the Master to not use all its cores or it may lock up (it does other things). Is there a way to control max cores used for a particular cluster machine in standalon

Re: Class not found

2014-10-21 Thread Pat Ferrel
maven cache is laid out differently but it does work on Linux and BSD/mac. Still looks like a hack to me. On Oct 21, 2014, at 1:28 PM, Pat Ferrel wrote: Doesn’t this seem like a dangerous error prone hack? It will build different bits on different machines. It doesn’t even work on my linux

Re: Class not found

2014-10-21 Thread Pat Ferrel
different artifacts to support any option that changes the linkage info/class naming? On Oct 21, 2014, at 12:16 PM, Pat Ferrel wrote: Not sure if this has been clearly explained here but since I took a day to track it down… Several people have experienced a class not found error on Spark when the

Class not found

2014-10-21 Thread Pat Ferrel
Not sure if this has been clearly explained here but since I took a day to track it down… Several people have experienced a class not found error on Spark when the class referenced is supposed to be in the Spark jars. One thing that can cause this is if you are building Spark for your cluster

Re: Upgrade to Spark 1.1.0?

2014-10-19 Thread Pat Ferrel
Trying to upgrade from Spark 1.0.1 to 1.1.0. Can’t imagine the upgrade is the problem but anyway... I get a NoClassDefFoundError for RandomGenerator when running a driver from the CLI. But only when using a named master, even a standalone master. If I run using master = local[4] the job execute

Re: local class incompatible: stream classdesc serialVersionUID

2014-10-16 Thread Pat Ferrel
on of Apache Spark on each node of the cluster? And I am not only asking about current project version (1.0.0, 1.1.0 etc.) but also about package type (hadoop 1.x, hadoop 2.x). On Fri, Oct 17, 2014 at 12:35 AM, Pat Ferrel wrote: I’ve read several discussions of the error here and so have wiped all cl

local class incompatible: stream classdesc serialVersionUID

2014-10-16 Thread Pat Ferrel
I’ve read several discussions of the error here and so have wiped all cluster machines and copied the master’s spark build to the rest of the cluster. I’ve built my job on the master using the correct Spark version as a dependency and even build that version of Spark. I still get the incompatibl

Re: Running new code on a Spark Cluster

2014-06-26 Thread Pat Ferrel
serialized. -Original Message- From: Pat Ferrel [mailto:p...@occamsmachete.com] Sent: Thursday, June 26, 2014 12:13 PM To: user@spark.apache.org Subject: Running new code on a Spark Cluster I've created a CLI driver for a Spark version of a Mahout job called "item similarity&quo

Running new code on a Spark Cluster

2014-06-26 Thread Pat Ferrel
I’ve created a CLI driver for a Spark version of a Mahout job called "item similarity" with several tests that all work fine on local[4] Spark standalone. The code even reads and writes to clustered HDFS. But switching to clustered Spark has a problem that seems tied to a broadcast and/or serial

Re: File list read into single RDD

2014-05-21 Thread Pat Ferrel
ses) > you'll see it mentioned there in the docs. > > http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/FileInputFormat.html > > But that's not obvious. > > Nick > > 2014년 4월 28일 월요일, Pat Ferrel 님이 작성한 메시지: > Perfect. > > BTW just so I

Re: File list read into single RDD

2014-05-18 Thread Pat Ferrel
ere in the docs. > > http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/FileInputFormat.html > > But that's not obvious. > > Nick > > 2014년 4월 28일 월요일, Pat Ferrel 님이 작성한 메시지: > Perfect. > > BTW just so I know where to look next time, was that

Re: File list read into single RDD

2014-04-28 Thread Pat Ferrel
th/to/file1,/path/to/file2') So once you have your list of files, concatenate their paths like that and pass the single string to textFile(). Nick On Mon, Apr 28, 2014 at 7:23 PM, Pat Ferrel wrote: sc.textFile(URI) supports reading multiple files in parallel but only with a wildcard.

File list read into single RDD

2014-04-28 Thread Pat Ferrel
sc.textFile(URI) supports reading multiple files in parallel but only with a wildcard. I need to walk a dir tree, match a regex to create a list of files, then I’d like to read them into a single RDD in parallel. I understand these could go into separate RDDs then a union RDD can be created. Is

Read from list of files in parallel

2014-04-28 Thread Pat Ferrel
Warning noob question: The sc.textFile(URI) method seems to support reading from files in parallel but you have to supply some wildcard URI, which greatly limits how the storage is structured. Is there a simple way to pass in a URI list or is it an exercise left for the student?