Re: Broadcasting Non Serializable Objects

2016-10-18 Thread Daniel Imberman
Hi Pedro, Can you please post your code? Daniel On Tue, Oct 18, 2016 at 12:27 PM pedroT wrote: > Hi guys. > > I know this is a well known topic, but reading about (a lot) I'm not sure > about the answer.. > > I need to broadcast a complex estructure with a lot of objects

Re: How to do nested for-each loops across RDDs ?

2016-08-15 Thread Daniel Imberman
There's no real way of doing nested for-loops with RDD's because the whole idea is that you could have so much data in the RDD that it would be really ugly to store it all in one worker. There are, however, ways to handle what you're asking about. I would personally use something like CoGroup or

Arrays in Datasets (1.6.1)

2016-06-27 Thread Daniel Imberman
Hi all, So I've been attempting to reformat a project I'm working on to use the Dataset API and have been having some issues with encoding errors. From what I've read, I think that I should be able to store Arrays of primitive values in a dataset. However, the following class gives me encoding

Re: Creating a python port for a Scala Spark Projeect

2016-06-22 Thread Daniel Imberman
ark-examples > ). Good luck! :) > > On Wed, Jun 22, 2016 at 7:07 PM, Daniel Imberman < > daniel.imber...@gmail.com> wrote: > >> Hi All, >> >> I've developed a spark module in scala that I would like to add a python >> port for. I want to be able to al

Creating a python port for a Scala Spark Projeect

2016-06-22 Thread Daniel Imberman
Hi All, I've developed a spark module in scala that I would like to add a python port for. I want to be able to allow users to create a pyspark RDD and send it to my system. I've been looking into the pyspark source code as well as py4J and was wondering if there has been anything like this

Re: Forcing data from disk to memory

2016-03-24 Thread Daniel Imberman
from disk to memory > To: Daniel Imberman <daniel.imber...@gmail.com> > > > Hi, > > We have no direct approach; we need to unpersist cached data, then > re-cache data as MEMORY_ONLY. > > // maropu > > On Thu, Mar 24, 2016 at 8:22 AM, Daniel Imberman < > d

Forcing data from disk to memory

2016-03-23 Thread Daniel Imberman
Hi all, So I have a question about persistence. Let's say I have an RDD that's persisted MEMORY_AND_DISK, and I know that I now have enough memory space cleared up that I can force the data on disk into memory. Is it possible to tell spark to re-evaluate the open RDD memory and move that

Reading an RDD from a checkpoint.

2016-03-14 Thread Daniel Imberman
So I'm attempting to pre-compute my data such that I can pull an RDD from a checkpoint. However, I'm finding that upon running the same job twice the system is simply recreating the RDD from scratch. Here is the code I'm implementing to create the checkpoint: def

Attempting to aggregate multiple values

2016-02-26 Thread Daniel Imberman
Hi all, So over the past few days I've been attempting to create a function that takes an RDD[U], and creates three MMaps. I've been attempting to aggregate these values but I'm running into a major issue. when I initially tried to use separate aggregators for each map, I noticed a significant

Performing multiple aggregations over the same data

2016-02-23 Thread Daniel Imberman
Hi guys, So I'm running into a speed issue where I have a dataset that needs to be aggregated multiple times. Initially my team had set up three accumulators and were running a single foreach loop over the data. Something along the lines of val accum1:Accumulable[a] val accum2: Accumulable[b]

Re: Running multiple foreach loops

2016-02-17 Thread Daniel Imberman
Thank you Ted! On Wed, Feb 17, 2016 at 2:12 PM Ted Yu <yuzhih...@gmail.com> wrote: > If the Accumulators are updated at the same time, calling foreach() once > seems to have better performance. > > > On Feb 17, 2016, at 4:30 PM, Daniel Imberman <daniel.imber...@gmail.com&

Running multiple foreach loops

2016-02-17 Thread Daniel Imberman
Hi all, So I'm currently figuring out how to accumulate three separate accumulators: val a:Accumulator val b:Accumulator val c:Accumulator I have an r:RDD[thing] and the code currently reads r.foreach{ thing => a += thing b += thing c += thing }

Terminating Spark Steps in AWS

2016-01-26 Thread Daniel Imberman
Hi all, I want to set up a series of spark steps on an EMR spark cluster, and terminate the current step if it's taking too long. However, when I ssh into the master node and run hadoop jobs -list, the master node seems to believe that there is no jobs running. I don't want to terminate the

Re: Terminating Spark Steps in AWS

2016-01-26 Thread Daniel Imberman
hich is only for Hadoop MapReduce jobs. For Spark jobs, which run on YARN, > you instead want "yarn application -list". > > Hope this helps, > Jonathan (from the EMR team) > > On Tue, Jan 26, 2016 at 10:05 AM Daniel Imberman < > daniel.imber...@gmail.com> wrote:

Re: Split columns in RDD

2016-01-19 Thread Daniel Imberman
Hi Richard, If I understand the question correctly it sounds like you could probably do this using mapValues (I'm assuming that you want two pieces of information out of all rows, the states as individual items, and the number of states in the row) val separatedInputStrings = input:RDD[(Int,

Re: Split columns in RDD

2016-01-19 Thread Daniel Imberman
edit: Mistake in the second code example val numColumns = separatedInputStrings.filter{ case(id, (stateList, numStates)) => numStates}.reduce(math.max) On Tue, Jan 19, 2016 at 8:17 AM Daniel Imberman <daniel.imber...@gmail.com> wrote: > Hi Richard, > > If I understand the

Re: Split columns in RDD

2016-01-19 Thread Daniel Imberman
edit 2: filter should be map val numColumns = separatedInputStrings.map{ case(id, (stateList, numStates)) => numStates}.reduce(math.max) On Tue, Jan 19, 2016 at 8:19 AM Daniel Imberman <daniel.imber...@gmail.com> wrote: > edit: Mistake in the second code example > &

Re: Sending large objects to specific RDDs

2016-01-17 Thread Daniel Imberman
le > is reduced. > > On Sat, Jan 16, 2016 at 9:39 AM, Daniel Imberman < > daniel.imber...@gmail.com> wrote: > >> Hi Ted, >> >> I think I might have figured something out!(Though I haven't tested it at >> scale yet) >> >> My current thought i

Re: Sending large objects to specific RDDs

2016-01-16 Thread Daniel Imberman
On Sat, Jan 16, 2016 at 9:38 AM Koert Kuipers <ko...@tresata.com> wrote: > Just doing a join is not an option? If you carefully manage your > partitioning then this can be pretty efficient (meaning no extra shuffle, > basically map-side join) > On Jan 13, 2016 2:30 PM

Re: Sending large objects to specific RDDs

2016-01-14 Thread Daniel Imberman
AWS but you have to do your own virtualization scripts). Do you have any other thoughts on how I could go about dealing with this purely using spark and HDFS? Thank you On Wed, Jan 13, 2016 at 11:49 AM Daniel Imberman <daniel.imber...@gmail.com> wrote: > Thank you Ted! That sounds like

Sending large objects to specific RDDs

2016-01-13 Thread Daniel Imberman
I'm looking for a way to send structures to pre-determined partitions so that they can be used by another RDD in a mapPartition. Essentially I'm given and RDD of SparseVectors and an RDD of inverted indexes. The inverted index objects are quite large. My hope is to do a MapPartitions within the

Re: Sending large objects to specific RDDs

2016-01-13 Thread Daniel Imberman
oking up object should be very fast. > > Cheers > > On Wed, Jan 13, 2016 at 11:29 AM, Daniel Imberman < > daniel.imber...@gmail.com> wrote: > >> I'm looking for a way to send structures to pre-determined partitions so >> that >> they can be used by another R

Re: Read Accumulator value while running

2016-01-13 Thread Daniel Imberman
Hi Kira, I'm having some trouble understanding your question. Could you please give a code example? >From what I think you're asking there are two issues with what you're looking to do. (Please keep in mind I could be totally wrong on both of these assumptions, but this is what I've been lead

Re: Best practice for retrieving over 1 million files from S3

2016-01-13 Thread Daniel Imberman
is before. That's pretty much > what I'm doing already. Was just thinking there might be a better way. > > Darin. > ------ > *From:* Daniel Imberman <daniel.imber...@gmail.com> > *To:* Darin McBeath <ddmcbe...@yahoo.com>; User <user@spark.apache.or

Re: Best practice for retrieving over 1 million files from S3

2016-01-13 Thread Daniel Imberman
Hi Darin, You should read this article. TextFile is very inefficient in S3. http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219 Cheers On Wed, Jan 13, 2016 at 11:43 AM Darin McBeath wrote: > I'm looking for some suggestions based on

[no subject]

2016-01-11 Thread Daniel Imberman
Hi all, I'm looking for a way to efficiently partition an RDD, but allow the same data to exists on multiple partitions. Lets say I have a key-value RDD with keys {1,2,3,4} I want to be able to to repartition the RDD so that so the partitions look like p1 = {1,2} p2 = {2,3} p3 = {3,4}

Re: partitioning RDD

2016-01-11 Thread Daniel Imberman
> > Cheers > > On Mon, Jan 11, 2016 at 10:51 AM, Daniel Imberman < > daniel.imber...@gmail.com> wrote: > >> Hi all, >> >> I'm looking for a way to efficiently partition an RDD, but allow the same >> data to exists on multiple partitions. >&g

Comparing Subsets of an RDD

2016-01-04 Thread Daniel Imberman
Hi, I’m looking for a way to compare subsets of an RDD intelligently. Lets say I had an RDD with key/value pairs of type (Int->T). I eventually need to say “compare all values of key 1 with all values of key 2 and compare values of key 3 to the values of key 5 and key 7”, how would I go about

Comparing Subsets of an RDD

2016-01-04 Thread Daniel Imberman
Hi, I’m looking for a way to compare subsets of an RDD intelligently. Lets say I had an RDD with key/value pairs of type (Int->T). I eventually need to say “compare all values of key 1 with all values of key 2 and compare values of key 3 to the values of key 5 and key 7”, how would I go about

Re: groupByKey does not work?

2016-01-04 Thread Daniel Imberman
Could you please post the associated code and output? On Mon, Jan 4, 2016 at 3:55 PM Arun Luthra wrote: > I tried groupByKey and noticed that it did not group all values into the > same group. > > In my test dataset (a Pair rdd) I have 16 records, where there are only 4 >

Re: groupByKey does not work?

2016-01-04 Thread Daniel Imberman
Could you try simplifying the key and seeing if that makes any difference? Make it just a string or an int so we can count out any issues in object equality. On Mon, Jan 4, 2016 at 4:42 PM Arun Luthra wrote: > Spark 1.5.0 > > data: > >

Re: groupByKey does not work?

2016-01-04 Thread Daniel Imberman
, 2016 at 5:05 PM Arun Luthra <arun.lut...@gmail.com> wrote: > If I simplify the key to String column with values lo1, lo2, lo3, lo4, it > works correctly. > > On Mon, Jan 4, 2016 at 4:49 PM, Daniel Imberman <daniel.imber...@gmail.com > > wrote: > >> Could yo