Re: writing to local files on a worker

2018-11-15 Thread Steve Lewis
Exception{ LineNumberReader rdr = new LineNumberReader(new FileReader(f)); StringBuilder sb = new StringBuilder(); String line = rdr.readLine(); while(line != null) { sb.append(line); sb.append("\n"); line = rdr.readLine();

Re: writing to local files on a worker

2018-11-12 Thread Steve Lewis
are within a map step. > > Generally you should not call external applications from Spark. > > > Am 11.11.2018 um 23:13 schrieb Steve Lewis : > > > > I have a problem where a critical step needs to be performed by a third > party c++ application. I can send or install

writing to local files on a worker

2018-11-11 Thread Steve Lewis
I have a problem where a critical step needs to be performed by a third party c++ application. I can send or install this program on the worker nodes. I can construct a function holding all the data this program needs to process. The problem is that the program is designed to read and write from

No space left on device

2018-08-20 Thread Steve Lewis
We are trying to run a job that has previously run on Spark 1.3 on a different cluster. The job was converted to 2.3 spark and this is a new cluster. The job dies after completing about a half dozen stages with java.io.IOException: No space left on device It appears that the nodes are

How do i get a spark instance to use my log4j properties

2016-04-12 Thread Steve Lewis
Ok I am stymied. I have tried everything I can think of to get spark to use my own version of log4j.properties In the launcher code - I launch a local instance from a Java application I say -Dlog4j.configuration=conf/log4j.properties where conf/log4j.properties is user.dir - no luck Spark

I need some help making datasets with known columns from a JavaBean

2016-03-01 Thread Steve Lewis
I asked a similar question a day or so ago but this is a much more concrete example showing the difficulty I am running into I am trying to use DataSets. I have an object which I want to encode with its fields as columns. The object is a well behaved Java Bean. However one field is an object (or

DataSet Evidence

2016-02-29 Thread Steve Lewis
I have a relatively complex Java object that I would like to use in a dataset if I say Encoder evidence = Encoders.kryo(MyType.class); JavaRDD rddMyType= generateRDD(); // some code Dataset datasetMyType= sqlCtx.createDataset( rddMyType.rdd(), evidence); I get one column - the whole object

Datasets and columns

2016-01-25 Thread Steve Lewis
assume I have the following code SparkConf sparkConf = new SparkConf(); JavaSparkContext sqlCtx= new JavaSparkContext(sparkConf); JavaRDD rddMyType= generateRDD(); // some code Encoder evidence = Encoders.kryo(MyType.class); Dataset datasetMyType= sqlCtx.createDataset( rddMyType.rdd(),

Re: Datasets and columns

2016-01-25 Thread Steve Lewis
columns. Try running: datasetMyType.printSchema() > > On Mon, Jan 25, 2016 at 1:16 PM, Steve Lewis <lordjoe2...@gmail.com> > wrote: > >> assume I have the following code >> >> SparkConf sparkConf = new SparkConf(); >> >> JavaSparkContext sqlCtx= new JavaSp

Re: I need help mapping a PairRDD solution to Dataset

2016-01-20 Thread Steve Lewis
ou can also do this using lambda functions if you want though: > > ds1.groupBy(_.region).cogroup(ds2.groupBy(_.region) { (key, iter1, iter2) > => > ... > } > > > On Wed, Jan 20, 2016 at 10:26 AM, Steve Lewis <lordjoe2...@gmail.com> > wrote: > >> We

I need help mapping a PairRDD solution to Dataset

2016-01-20 Thread Steve Lewis
We have been working a large search problem which we have been solving in the following ways. We have two sets of objects, say children and schools. The object is to find the closest school to each child. There is a distance measure but it is relatively expensive and would be very costly to apply

Strange Set of errors

2015-12-14 Thread Steve Lewis
I am running on a spark 1.5.1 cluster managed by Mesos - I have an application that handled a chemistry problem which can be increased by increasing the number of atoms - increasing the number of Spark stages. I do a repartition at each stage - Stage 9 is the last stage. At each stage the size and

Has the format of a spark jar file changes in 1.5

2015-12-12 Thread Steve Lewis
I have been using my own code to build the jar file I use for spark submit. In 1.4 I could simply add all class and resource files I find in the class path to the jar and add all jars in the classpath into a directory called lib in the jar file. In 1.5 I see that resources and classes in jars in

Looking for a few Spark Benchmarks

2015-07-19 Thread Steve Lewis
I was in a discussion with someone who works for a cloud provider which offers Spark/Hadoop services. We got into a discussion of performance and the bewildering array of machine types and the problem of selecting a cluster with 20 Large instances VS 10 Jumbo instances or the trade offs between

Re: How can I force operations to complete and spool to disk

2015-05-07 Thread Steve Lewis
once more. Tasks typically preform both operations several hundred thousand times. why it can not be done distributed way? On Thu, May 7, 2015 at 3:16 PM, Steve Lewis lordjoe2...@gmail.com wrote: I am performing a job where I perform a number of steps in succession. One step is a map

How can I force operations to complete and spool to disk

2015-05-06 Thread Steve Lewis
I am performing a job where I perform a number of steps in succession. One step is a map on a JavaRDD which generates objects taking up significant memory. The this is followed by a join and an aggregateByKey. The problem is that the system is running getting OutOfMemoryErrors - Most tasks work

Re: Can a map function return null

2015-04-19 Thread Steve Lewis
null Sent from Samsung Mobile Original message From: Olivier Girardot Date:2015/04/18 22:04 (GMT+00:00) To: Steve Lewis ,user@spark.apache.org Subject: Re: Can a map function return null You can return an RDD with null values inside, and afterwards filter on item

Can a map function return null

2015-04-18 Thread Steve Lewis
I find a number of cases where I have an JavaRDD and I wish to transform the data and depending on a test return 0 or one item (don't suggest a filter - the real case is more complex). So I currently do something like the following - perform a flatmap returning a list with 0 or 1 entry depending

Fwd: Numbering RDD members Sequentially

2015-03-11 Thread Steve Lewis
-- Forwarded message -- From: Steve Lewis lordjoe2...@gmail.com Date: Wed, Mar 11, 2015 at 9:13 AM Subject: Re: Numbering RDD members Sequentially To: Daniel, Ronald (ELS-SDG) r.dan...@elsevier.com perfect - exactly what I was looking for, not quite sure why it is called

Numbering RDD members Sequentially

2015-03-10 Thread Steve Lewis
I have Hadoop Input Format which reads records and produces JavaPairRDDString,String locatedData where _1() is a formatted version of the file location - like 12690,, 24386 .27523 ... _2() is data to be processed For historical reasons I want to convert _1() into in integer

Is there a way to read a parquet database without generating an RDD

2015-01-06 Thread Steve Lewis
I have an application where a function needs access to the results of a select from a parquet database. Creating a JavaSQLContext and from it a JavaSchemaRDD as shown below works but the parallelism is not needed - a simple JDBC call would work - Are there alternative non-parallel ways to

Is there a way (in Java) to turn Java Iterable into a JavaRDD?

2014-12-19 Thread Steve Lewis
I notice new methods such as JavaSparkContext makeRDD (with few useful examples) - It takes a Seq but while there are ways to turn a list into a Seq I see nothing that uses an Iterable

Who is using Spark and related technologies for bioinformatics applications?

2014-12-17 Thread Steve Lewis
I am aware of the ADAM project in Berkeley and I am working on Proteomic searches - anyone else working in this space

how to convert an rdd to a single output file

2014-12-12 Thread Steve Lewis
I have an RDD which is potentially too large to store in memory with collect. I want a single task to write the contents as a file to hdfs. Time is not a large issue but memory is. I say the following converting my RDD (scans) to a local Iterator. This works but hasNext shows up as a separate task

Re: how to convert an rdd to a single output file

2014-12-12 Thread Steve Lewis
output file. - SF On Fri, Dec 12, 2014 at 11:19 AM, Steve Lewis lordjoe2...@gmail.com wrote: I have an RDD which is potentially too large to store in memory with collect. I want a single task to write the contents as a file to hdfs. Time is not a large issue but memory is. I say the following

Re: how to convert an rdd to a single output file

2014-12-12 Thread Steve Lewis
temp dirs if it has to. On Fri, Dec 12, 2014 at 2:39 PM, Steve Lewis lordjoe2...@gmail.com wrote: The objective is to let the Spark application generate a file in a format which can be consumed by other programs - as I said I am willing to give up parallelism at this stage (all the expensive

Re: How can I create an RDD with millions of entries created programmatically

2014-12-08 Thread Steve Lewis
thread. On Mon, Dec 8, 2014 at 8:05 PM, Steve Lewis lordjoe2...@gmail.com wrote: I have a function which generates a Java object and I want to explore failures which only happen when processing large numbers of these object. the real code is reading a many gigabyte file but in the test code I

In Java how can I create an RDD with a large number of elements

2014-12-08 Thread Steve Lewis
assume I don't care about values which may be created in a later map - in scala I can say val rdd = sc.parallelize(1 to 10, numSlices = 1000) but in Java JavaSparkContext can only paralellize a List - limited to Integer,MAX_VALUE elements and required to exist in memory - the best I can

I am having problems reading files in the 4GB range

2014-12-05 Thread Steve Lewis
I am using a custom hadoop input format which works well on smaller files but fails with a file at about 4GB size - the format is generating about 800 splits and all variables in my code are longs - Any suggestions? Is anyone reading files of this size? Exception in thread main

Problems creating and reading a large test file

2014-12-05 Thread Steve Lewis
I am trying to look at problems reading a data file over 4G. In my testing I am trying to create such a file. My plan is to create a fasta file (a simple format used in biology) looking like 1 TCCTTACGGAGTTCGGGTGTTTATCTTACTTATCGCGGTTCGCTGCCGCTCCGGGAGCCCGGATAGGCTGCGTTAATACCTAAGGAGCGCGTATTG 2

Failed to read chunk exception

2014-12-04 Thread Steve Lewis
I am running a large job using 4000 partitions - after running for four hours on a 16 node cluster it fails with the following message. The errors are in spark code and seem address unreliability at the level of the disk - Anyone seen this and know what is going on and how to fix it. Exception

Re: Any ideas why a few tasks would stall

2014-12-04 Thread Steve Lewis
. In particular, if the RDDs are PairRDDs, partitions are assigned based on the hash of the key, so an even distribution of values among keys is required for even split of data across partitions. On December 2, 2014 at 4:15:25 PM, Steve Lewis (lordjoe2...@gmail.com) wrote: 1) I can go

How can a function get a TaskContext

2014-12-04 Thread Steve Lewis
https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/TaskContext.java has a Java implementation if TaskContext wit a very useful method /** * Return the currently active TaskContext. This can be called inside of * user functions to access contextual information about

How can a function running on a slave access the Executor

2014-12-03 Thread Steve Lewis
I have been working on balancing work across a number of partitions and find it would be useful to access information about the current execution environment much of which (like Executor ID) are available if there was a way to get the current executor or the Hadoop TaskAttempt context - does any

Re: Any ideas why a few tasks would stall

2014-12-02 Thread Steve Lewis
period, you can identify 'hot spots' or expensive sections in the user code. On Tue, Dec 2, 2014 at 3:03 PM, Steve Lewis lordjoe2...@gmail.com wrote: I am working on a problem which will eventually involve many millions of function calls. A have a small sample with several thousand calls working

How can a function access Executor ID, Function ID and other parameters

2014-11-30 Thread Steve Lewis
I am running on a 15 node cluster and am trying to set partitioning to balance the work across all nodes. I am using an Accumulator to track work by Mac Address but would prefer to use data known to the Spark environment - Executor ID, and Function ID show up in the Spark UI and Task ID and

How can a function access Executor ID, Function ID and other parameters known to the Spark Environment

2014-11-26 Thread Steve Lewis
I am running on a 15 node cluster and am trying to set partitioning to balance the work across all nodes. I am using an Accumulator to track work by Mac Address but would prefer to use data known to the Spark environment - Executor ID, and Function ID show up in the Spark UI and Task ID and

Why is this operation so expensive

2014-11-25 Thread Steve Lewis
I have an JavaPairRDDKeyType,Tuple2Type1,Type2 originalPairs. There are on the order of 100 million elements I call a function to rearrange the tuples JavaPairRDDString,Tuple2Type1,Type2 newPairs = originalPairs.values().mapToPair(new PairFunctionTuple2Type1,Type2, String, Tuple2IType1,Type2

Re: Why is this operation so expensive

2014-11-25 Thread Steve Lewis
. An alternative is to make the .equals() and .hashcode() of KeyType delegate to the .getId() method you use in the anonymous function. Cheers, Andrew On Tue, Nov 25, 2014 at 10:06 AM, Steve Lewis lordjoe2...@gmail.com wrote: I have an JavaPairRDDKeyType,Tuple2Type1,Type2 originalPairs

How do I get the executor ID from running Java code

2014-11-17 Thread Steve Lewis
The spark UI lists a number of Executor IDS on the cluster. I would like to access both executor ID and Task/Attempt IDs from the code inside a function running on a slave machine. Currently my motivation is to examine parallelism and locality but in Hadoop this aids in allowing code to write

How do you force a Spark Application to run in multiple tasks

2014-11-14 Thread Steve Lewis
I have instrumented word count to track how many machines the code runs on. I use an accumulator to maintain a Set or MacAddresses. I find that everything is done on a single machine. This is probably optimal for word count but not the larger problems I am working on. How to a force processing to

Re: How do you force a Spark Application to run in multiple tasks

2014-11-14 Thread Steve Lewis
to force it to have more partitions, you can call RDD.repartition(numPartitions). Note that this will introduce a shuffle you wouldn't otherwise have. Also make sure your job is allocated more than one core in your cluster (you can see this on the web UI). On Fri, Nov 14, 2014 at 2:18 PM, Steve

How can my java code executing on a slave find the task id?

2014-11-12 Thread Steve Lewis
I am trying to determine how effective partitioning is at parallelizing my tasks. So far I suspect it that all work is done in one task. My plan is to create a number of accumulators - one for each task and have functions increment the accumulator for the appropriate task (or slave) the values

How (in Java) do I create an Accumulator of type Long

2014-11-12 Thread Steve Lewis
JavaSparkContext currentContext = ...; AccumulatorInteger accumulator = currentContext.accumulator(0, MyAccumulator); will create an Accumulator of Integers. For many large Data problems Integer is too small and Long is a better type. I see a call like the following

Re: How (in Java) do I create an Accumulator of type Long

2014-11-12 Thread Steve Lewis
JavaSparkContext has helper methods for int and double but not long. You can just make your own little implementation of AccumulatorParamLong right? ... which would be nice to add to JavaSparkContext. On Wed, Nov 12, 2014 at 11:05 PM, Steve Lewis lordjoe2...@gmail.com wrote

Is there a way to clone a JavaRDD without persisting it

2014-11-11 Thread Steve Lewis
In my problem I have a number of intermediate JavaRDDs and would like to be able to look at their sizes without destroying the RDD for sibsequent processing. persist will do this but these are big and perisist seems expensive and I am unsure of which StorageLevel is needed, Is there a way to

How do I kill av job submitted with spark-submit

2014-11-02 Thread Steve Lewis
I see the job in the web interface but don't know how to kill it

Re: A Spark Design Problem

2014-11-01 Thread Steve Lewis
,key join with JavaPairRDDbin,lock If you partition both RDDs by the bin id, I think you should be able to get what you want. Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Fri, Oct 31, 2014 at 5:44 PM, Steve Lewis lordjoe2

A Spark Design Problem

2014-10-31 Thread Steve Lewis
The original problem is in biology but the following captures the CS issues, Assume I have a large number of locks and a large number of keys. There is a scoring function between keys and locks and a key that fits a lock will have a high score. There may be many keys fitting one lock and a key

Questions about serialization and SparkConf

2014-10-29 Thread Steve Lewis
Assume in my executor I say SparkConf sparkConf = new SparkConf(); sparkConf.set(spark.kryo.registrator, com.lordjoe.distributed.hydra.HydraKryoSerializer); sparkConf.set(mysparc.data, Some user Data); sparkConf.setAppName(Some App); Now 1) Are there default

com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 13994

2014-10-28 Thread Steve Lewis
A cluster I am running on keeps getting KryoException. Unlike the Java serializer the Kryo Exception gives no clue as to what class is giving the error The application runs properly locally but no the cluster and I have my own custom KryoRegistrator and register sereral dozen classes -

Re: com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 13994

2014-10-28 Thread Steve Lewis
objects you're trying to serialize and see if those work. -Original Message- *From: *Steve Lewis [lordjoe2...@gmail.com] *Sent: *Tuesday, October 28, 2014 10:46 PM Eastern Standard Time *To: *user@spark.apache.org *Subject: *com.esotericsoftware.kryo.KryoException: Encountered

Re: How do you write a JavaRDD into a single file

2014-10-21 Thread Steve Lewis
Collect will store the entire output in a List in memory. This solution is acceptable for Little Data problems although if the entire problem fits in the memory of a single machine there is less motivation to use Spark. Most problems which benefit from Spark are large enough that even the data

How do you write a JavaRDD into a single file

2014-10-20 Thread Steve Lewis
At the end of a set of computation I have a JavaRDDString . I want a single file where each string is printed in order. The data is small enough that it is acceptable to handle the printout on a single processor. It may be large enough that using collect to generate a list might be unacceptable.

Re: How do you write a JavaRDD into a single file

2014-10-20 Thread Steve Lewis
expose a method to iterate over the data, called toLocalIterator. It does not require that the RDD fit entirely in memory. On Mon, Oct 20, 2014 at 6:13 PM, Steve Lewis lordjoe2...@gmail.com wrote: At the end of a set of computation I have a JavaRDDString . I want a single file where each

How to I get at a SparkContext or better a JavaSparkContext from the middle of a function

2014-10-14 Thread Steve Lewis
I am running a couple of functions on an RDD which require access to data on the file system known to the context. If I create a class with a context a a member variable I get a serialization error, So I am running my function on some slave and I want to read in data from a Path defined by a

Broadcast Torrent fail - then the job dies

2014-10-08 Thread Steve Lewis
I am running on Windows 8 using Spark 1.1.0 in local mode with Hadoop 2.2 - I repeatedly see the following in my logs. I believe this happens in combineByKey 14/10/08 09:36:30 INFO executor.Executor: Running task 3.0 in stage 0.0 (TID 3) 14/10/08 09:36:30 INFO broadcast.TorrentBroadcast:

Re: Broadcast Torrent fail - then the job dies

2014-10-08 Thread Steve Lewis
spark.broadcast.factory to org.apache.spark.broadcast.HttpBroadcastFactory in spark conf. Thanks, Liquan On Wed, Oct 8, 2014 at 11:21 AM, Steve Lewis lordjoe2...@gmail.com wrote: I am running on Windows 8 using Spark 1.1.0 in local mode with Hadoop 2.2 - I repeatedly see the following in my

anyone else seeing something like https://issues.apache.org/jira/browse/SPARK-3637

2014-10-07 Thread Steve Lewis
java.lang.NullPointerException at java.nio.ByteBuffer.wrap(ByteBuffer.java:392) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) at

Re: Spark and Python using generator of data bigger than RAM as input to sc.parallelize()

2014-10-06 Thread Steve Lewis
Try a Hadoop Custom InputFormat - I can give you some samples - While I have not tried this an input split has only a length (could be ignores if the format treats as non splittable) and a String for a location. If the location is a URL into wikipedia the whole thing should work. Hadoop

What can be done if a FlatMapFunctions generated more data that can be held in memory

2014-10-01 Thread Steve Lewis
I number of the problems I want to work with generate datasets which are too large to hold in memory. This becomes an issue when building a FlatMapFunction and also when the data used in combineByKey cannot be held in memory. The following is a simple, if a little silly, example of a

A sample for generating big data - and some design questions

2014-09-30 Thread Steve Lewis
This sample below is essentially word count modified to be big data by turning lines into groups of upper case letters and then generating all case variants - it is modeled after some real problems in biology The issue is I know how to do this in Hadoop but in Spark the use of a List in my flatmap

Re: Does anyone have experience with using Hadoop InputFormats?

2014-09-24 Thread Steve Lewis
, Liquan On Tue, Sep 23, 2014 at 5:43 PM, Steve Lewis lordjoe2...@gmail.com wrote: Well I had one and tried that - my message tells what I found found 1) Spark only accepts org.apache.hadoop.mapred.InputFormatK,V not org.apache.hadoop.mapreduce.InputFormatK,V 2) Hadoop expects K and V

Re: Does anyone have experience with using Hadoop InputFormats?

2014-09-24 Thread Steve Lewis
Do your custom Writable classes implement Serializable - I think that is the only real issue - my code uses vanilla Text

Re: Does anyone have experience with using Hadoop InputFormats?

2014-09-24 Thread Steve Lewis
Hmmm - I have only tested in local mode but I got an java.io.NotSerializableException: org.apache.hadoop.io.Text at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1180) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1528) at

Has anyone seen java.nio.ByteBuffer.wrap(ByteBuffer.java:392)

2014-09-24 Thread Steve Lewis
java.lang.NullPointerException at java.nio.ByteBuffer.wrap(ByteBuffer.java:392) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) at

Does anyone have experience with using Hadoop InputFormats?

2014-09-23 Thread Steve Lewis
When I experimented with using an InputFormat I had used in Hadoop for a long time in Hadoop I found 1) it must extend org.apache.hadoop.mapred.FileInputFormat (the deprecated class not org.apache.hadoop.mapreduce.lib.input;FileInputFormat 2) initialize needs to be called in the constructor 3)

Is there any way (in Java) to make a JavaRDD from an iterable

2014-09-22 Thread Steve Lewis
The only way I find is to turn it into a list - in effect holding everything in memory (see code below). Surely Spark has a better way. Also what about unterminated iterables like a Fibonacci series - (useful only if limited in some other way ) /** * make an RDD from an iterable *

Re: Reproducing the function of a Hadoop Reducer

2014-09-20 Thread Steve Lewis
, reduce, merge) reducedUnsorted.sortByKey() On Fri, Sep 19, 2014 at 1:37 PM, Steve Lewis lordjoe2...@gmail.com wrote: I am struggling to reproduce the functionality of a Hadoop reducer on Spark (in Java) in Hadoop I have a function public void doReduce(K key, IteratorV values) in Hadoop

Looking for a good sample of Using Spark to do things Hadoop can do

2014-09-12 Thread Steve Lewis
Assume I have a large book with many Chapters and many lines of text. Assume I have a function that tells me the similarity of two lines of text. The objective is to find the most similar line in the same chapter within 200 lines of the line found. The real problem involves biology and is beyond

Re: Is the structure for a jar file for running Spark applications the same as that for Hadoop

2014-09-10 Thread Steve Lewis
In modern projects there are a bazillion dependencies - when I use Hadoop I just put them in a lib directory in the jar - If I have a project that depends on 50 jars I need a way to deliver them to Spark - maybe wordcount can be written without dependencies but real projects need to deliver

Is the structure for a jar file for running Spark applications the same as that for Hadoop

2014-09-08 Thread Steve Lewis
In a Hadoop jar there is a directory called lib and all non-provided third party jars go there and are included in the class path of the code. Do jars for Spark have the same structure - another way to ask the question is if I have code to execute Spark and a jar build for Hadoop can I simply use

Re: Mapping Hadoop Reduce to Spark

2014-09-04 Thread Steve Lewis
worrying about this. Matei On August 30, 2014 at 9:04:37 AM, Steve Lewis (lordjoe2...@gmail.com) wrote: When programming in Hadoop it is possible to guarantee 1) All keys sent to a specific partition will be handled by the same machine (thread) 2) All keys received by a specific machine

I am looking for a Java sample of a Partitioner

2014-09-02 Thread Steve Lewis
Assume say JavaWord count I call the equivalent of a Mapper JavaPairRDDString, Integer ones = words.mapToPair(,,, Now right here I want to guarantee that each word starting with a particular letter is processed in a specific partition - (Don't tell me this is a dumb idea - I know that but in a

I am looking for a Java sample of a Partitioner

2014-09-02 Thread Steve Lewis
Assume say JavaWord count I call the equivalent of a Mapper JavaPairRDDString, Integer ones = words.mapToPair(,,, Now right here I want to guarantee that each word starting with a particular letter is processed in a specific partition - (Don't tell me this is a dumb idea - I know that but in a

Re: Mapping Hadoop Reduce to Spark

2014-08-31 Thread Steve Lewis
, Steve Lewis (lordjoe2...@gmail.com) wrote: When programming in Hadoop it is possible to guarantee 1) All keys sent to a specific partition will be handled by the same machine (thread) 2) All keys received by a specific machine (thread) will be received in sorted order 3) These conditions

Mapping Hadoop Reduce to Spark

2014-08-30 Thread Steve Lewis
When programming in Hadoop it is possible to guarantee 1) All keys sent to a specific partition will be handled by the same machine (thread) 2) All keys received by a specific machine (thread) will be received in sorted order 3) These conditions will hold even if the values associated with a

What happens if I have a function like a PairFunction but which might return multiple values

2014-08-28 Thread Steve Lewis
In many cases when I work with Map Reduce my mapper or my reducer might take a single value and map it to multiple keys - The reducer might also take a single key and emit multiple values I don't think that functions like flatMap and reduceByKey will work or are there tricks I am not aware of

How do you hit breakpoints using IntelliJ In functions used by an RDD

2014-08-25 Thread Steve Lewis
I was able to get JavaWordCount running with a local instance under IntelliJ. In order to do so I needed to use maven to package my code and call String[] jars = { /SparkExamples/target/word-count-examples_2.10-1.0.0.jar }; sparkConf.setJars(jars); After that the sample ran properly and

Re: How do you hit breakpoints using IntelliJ In functions used by an RDD

2014-08-25 Thread Steve Lewis
invoke an action, like count(), on the words RDD. On Mon, Aug 25, 2014 at 6:32 PM, Steve Lewis lordjoe2...@gmail.com wrote: I was able to get JavaWordCount running with a local instance under IntelliJ. In order to do so I needed to use maven to package my code and call String[] jars

I am struggling to run Spark Examples on my local machine

2014-08-21 Thread Steve Lewis
I download the binaries for spark-1.0.2-hadoop1 and unpack it on my Widows 8 box. I can execute spark-shell.com and get a command window which does the proper things I open a browser to http:/localhost:4040 and a window comes up describing the spark-master Then using IntelliJ I create a project

Re: Does anyone have a stand alone spark instance running on Windows

2014-08-20 Thread Steve Lewis
notes - http://ml-nlp-ir.blogspot.com/2014/04/building-spark-on-windows-and-cloudera.html on how to build from source and run examples in spark shell. Regards, Manu On Sat, Aug 16, 2014 at 12:14 PM, Steve Lewis lordjoe2...@gmail.com wrote: I want to look at porting a Hadoop problem

Re: Does anyone have a stand alone spark instance running on Windows

2014-08-18 Thread Steve Lewis
-spark-on-windows-and-cloudera.html on how to build from source and run examples in spark shell. Regards, Manu On Sat, Aug 16, 2014 at 12:14 PM, Steve Lewis lordjoe2...@gmail.com wrote: I want to look at porting a Hadoop problem to Spark - eventually I want to run on a Hadoop 2.0 cluster

Does anyone have a stand alone spark instance running on Windows

2014-08-16 Thread Steve Lewis
I want to look at porting a Hadoop problem to Spark - eventually I want to run on a Hadoop 2.0 cluster but while I am learning and porting I want to run small problems in my windows box. I installed scala and sbt. I download Spark and in the spark directory can say mvn -Phadoop-0.23