Re: Write and Read file through map reduce

2015-01-05 Thread Corey Nolet
Hitarth, I don't know how much direction you are looking for with regards to the formats of the times but you can certainly read both files into the third mapreduce job using the FileInputFormat by comma-separating the paths to the files. The blocks for both files will essentially be unioned toget

Re: Jr. to Mid Level Big Data jobs in Bay Area

2015-05-17 Thread Corey Nolet
Agreed. Apache user lists archive questions and answers specifically for the purpose of helping the larger community navigate its projects. It is not a place for classifieds and employment information. On Sun, May 17, 2015 at 9:24 PM, Billy Watson wrote: > Uh, it's not about being tolerant. It'

Re: Hetergeneous Hadoop Cluster

2015-09-24 Thread Corey Nolet
If the hardware is drastically different, I would think a multi-volume HDFS instance would be a good idea (put like-hardware in the same volumes). On Mon, Sep 21, 2015 at 3:29 PM, Tushar Kapila wrote: > Would only matter if OS specific communication was being used between > nodes. I assume they

Re: Hetergeneous Hadoop Cluster

2015-09-25 Thread Corey Nolet
adoop-hdfs/Federation.html On Fri, Sep 25, 2015 at 12:42 AM, Ashish Kumar9 wrote: > This is interesting . Can you share any blog/document that talks > multi-volume HDFS instances . > > Thanks and Regards, > Ashish Kumar > > > From:Corey Nolet > To:user@hadoop.apach

Re: unsubscribe

2016-03-19 Thread Corey Nolet
Gerald, In order to unsubscribe from this lister, you need to send an email to user-unsubscr...@hadoop.apache.org. On Wed, Mar 16, 2016 at 4:39 AM, Gerald-G wrote: > >

Re: Apache Flink

2016-04-17 Thread Corey Nolet
One thing I've noticed about Flink in my following of the project has been that it has established, in a few cases, some novel ideas and improvements over Spark. The problem with it, however, is that both the development team and the community around it are very small and many of those novel improv

Re: Apache Flink

2016-04-17 Thread Corey Nolet
> Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On

Re: Apache Flink

2016-04-17 Thread Corey Nolet
you don't care what each individual > event/tuple does, e.g. of you push different event types to separate kafka > topics and all you care is to do a count, what is the need for single event > processing. > > On Sun, Apr 17, 2016 at 12:43 PM, Corey Nolet wrote: > >> i ha

Re: ROSE: Spark + R on the JVM.

2016-01-12 Thread Corey Nolet
David, Thank you very much for announcing this! It looks like it could be very useful. Would you mind providing a link to the github? On Tue, Jan 12, 2016 at 10:03 AM, David wrote: > Hi all, > > I'd like to share news of the recent release of a new Spark package, ROSE. > > > ROSE is a Scala lib

Shuffle memory woes

2016-02-05 Thread Corey Nolet
I just recently had a discovery that my jobs were taking several hours to completely because of excess shuffle spills. What I found was that when I hit the high point where I didn't have enough memory for the shuffles to store all of their file consolidations at once, it could spill so many times t

Re: Shuffle memory woes

2016-02-06 Thread Corey Nolet
o have more partitions > play with shuffle memory fraction > > in spark 1.6 cache vs shuffle memory fractions are adjusted automatically > > On 5 February 2016 at 23:07, Corey Nolet wrote: > >> I just recently had a discovery that my jobs were taking several hours to >> compl

Re: Help needed in deleting a message posted in Spark User List

2016-02-06 Thread Corey Nolet
The whole purpose of Apache mailing lists is that the messages get indexed all over the web so that discussions and questions/solutions can be searched easily by google and other engines. For this reason, and the messages being sent via email as Steve pointed out, it's just not possible to retract

Re: Shuffle memory woes

2016-02-07 Thread Corey Nolet
n > if map side is ok, and you just reducing by key or something it should be > ok, so some detail is missing...skewed data? aggregate by key? > > On 6 February 2016 at 20:13, Corey Nolet wrote: > >> Igor, >> >> Thank you for the response but unfortunately, the pro

Re: Shuffle memory woes

2016-02-07 Thread Corey Nolet
ot of children and doesn't even run concurrently with any other stages so I ruled out the concurrency of the stages as a culprit for the shuffliing problem we're seeing. On Sun, Feb 7, 2016 at 7:49 AM, Corey Nolet wrote: > Igor, > > I don't think the question is "why can

Re: Shuffle memory woes

2016-02-07 Thread Corey Nolet
: >>"The dataset is 100gb at most, the spills can up to 10T-100T", Are >> your input files lzo format, and you use sc.text() ? If memory is not >> enough, spark will spill 3-4x of input data to disk. >> >> >> -- 原始邮件

Re: Shuffle memory woes

2016-02-08 Thread Corey Nolet
ople will say. > Corey do you have presentation available online? > > On 8 February 2016 at 05:16, Corey Nolet wrote: > >> Charles, >> >> Thank you for chiming in and I'm glad someone else is experiencing this >> too and not just me. I know very well how the Spark

Shuffle guarantees

2016-03-01 Thread Corey Nolet
So if I'm using reduceByKey() with a HashPartitioner, I understand that the hashCode() of my key is used to create the underlying shuffle files. Is anything other than hashCode() used in the shuffle files when the data is pulled into the reducers and run through the reduce function? The reason I'm

Re: Shuffle guarantees

2016-03-01 Thread Corey Nolet
How can this be assumed if the object used for the key, for instance, in the case where a HashPartitioner is used, cannot assume ordering and therefore cannot assume a comparator can be used? On Tue, Mar 1, 2016 at 2:56 PM, Corey Nolet wrote: > So if I'm using reduceByKey() with a HashPa

Re: Shuffle guarantees

2016-03-01 Thread Corey Nolet
Nevermind, a look @ the ExternalSorter class tells me that the iterator for each key that's only partially ordered ends up being merge sorted by equality after the fact. Wanted to post my finding on here for others who may have the same questions. On Tue, Mar 1, 2016 at 3:05 PM, Corey

MapType vs StructType

2015-07-17 Thread Corey Nolet
I notice JSON objects are all parsed as Map[String,Any] in Jackson but for some reason, the "inferSchema" tools in Spark SQL extracts the schema of nested JSON objects as StructTypes. This makes it really confusing when trying to rectify the object hierarchy when I have maps because the Catalyst c

Re: MapType vs StructType

2015-07-17 Thread Corey Nolet
we don't support union types). JSON doesn't have >> differentiated data structures so we go with the one that gives you more >> information when doing inference by default. If you pass in a schema to >> JSON however, you can override this and have a JSON object parsed as a map. &g

Re: [ Potential bug ] Spark terminal logs say that job has succeeded even though job has failed in Yarn cluster mode

2015-07-27 Thread Corey Nolet
Elkhan, What does the ResourceManager say about the final status of the job? Spark jobs that run as Yarn applications can fail but still successfully clean up their resources and give them back to the Yarn cluster. Because of this, there's a difference between your code throwing an exception in a

Re: [ Potential bug ] Spark terminal logs say that job has succeeded even though job has failed in Yarn cluster mode

2015-07-28 Thread Corey Nolet
I can only give you a general overview of how the Yarn integration works from the Scala point of view. Hope this helps. > Yarn related logs can be found in RM ,NM, DN, NN log files in detail. > > Thanks again. > > On Mon, Jul 27, 2015 at 7:45 PM, Corey Nolet wrote: > >>

SparkConf "ignoring" keys

2015-08-05 Thread Corey Nolet
I've been using SparkConf on my project for quite some time now to store configuration information for its various components. This has worked very well thus far in situations where I have control over the creation of the SparkContext & the SparkConf. I have run into a bit of a problem trying to i

Re: Newbie question: what makes Spark run faster than MapReduce

2015-08-07 Thread Corey Nolet
1) Spark only needs to shuffle when data needs to be partitioned around the workers in an all-to-all fashion. 2) Multi-stage jobs that would normally require several map reduce jobs, thus causing data to be dumped to disk between the jobs can be cached in memory.

Re: What is the reason for ExecutorLostFailure?

2015-08-18 Thread Corey Nolet
Usually more information as to the cause of this will be found down in your logs. I generally see this happen when an out of memory exception has occurred for one reason or another on an executor. It's possible your memory settings are too small per executor or the concurrent number of tasks you ar

Re: MongoDB and Spark

2015-09-11 Thread Corey Nolet
Unfortunately, MongoDB does not directly expose its locality via its client API so the problem with trying to schedule Spark tasks against it is that the tasks themselves cannot be scheduled locally on nodes containing query results- which means you can only assume most results will be sent over th

Using MatrixFactorizationModel as a feature extractor

2017-11-27 Thread Corey Nolet
I'm trying to use the MatrixFactorizationModel to, for instance, determine the latent factors of a user or item that were not used in the training data of the model. I'm not as concerned about the rating as I am with the latent factors for the user/item. Thanks!

Re: Using MatrixFactorizationModel as a feature extractor

2017-11-27 Thread Corey Nolet
tions until the other user gets worked into the model. On Mon, Nov 27, 2017 at 3:08 PM, Corey Nolet wrote: > I'm trying to use the MatrixFactorizationModel to, for instance, determine > the latent factors of a user or item that were not used in the training > data of the model. I

PySpark API on top of Apache Arrow

2018-05-23 Thread Corey Nolet
Please forgive me if this question has been asked already. I'm working in Python with Arrow+Plasma+Pandas Dataframes. I'm curious if anyone knows of any efforts to implement the PySpark API on top of Apache Arrow directly. In my case, I'm doing data science on a machine with 288 cores and 1TB of r

Re: PySpark API on top of Apache Arrow

2018-05-26 Thread Corey Nolet
ark-on-a-single-node-machine.html > > regars, > > 2018-05-23 22:30 GMT+02:00 Corey Nolet : > >> Please forgive me if this question has been asked already. >> >> I'm working in Python with Arrow+Plasma+Pandas Dataframes. I'm curious if >> anyone know

Blocking DStream.forEachRDD()

2015-05-07 Thread Corey Nolet
Is this somehtign I can do. I am using a FileOutputFormat inside of the foreachRDD call. After the input format runs, I want to do some directory cleanup and I want to block while I'm doing that. Is that something I can do inside of this function? If not, where would I accomplish this on every micr

Re: Blocking DStream.forEachRDD()

2015-05-07 Thread Corey Nolet
It does look the function that's executed is in the driver so doing an Await.result() on a thread AFTER i've executed an action should work. Just updating this here in case anyone has this question in the future. Is this somehtign I can do. I am using a FileOutputFormat inside of the foreachRDD cal

yarn-cluster spark-submit process not dying

2015-05-28 Thread Corey Nolet
I am submitting jobs to my yarn cluster via the yarn-cluster mode and I'm noticing the jvm that fires up to allocate the resources, etc... is not going away after the application master and executors have been allocated. Instead, it just sits there printing 1 second status updates to the console. I

Re: yarn-cluster spark-submit process not dying

2015-05-28 Thread Corey Nolet
yza wrote: > Hi Corey, > > As of this PR https://github.com/apache/spark/pull/5297/files, this can > be controlled with spark.yarn.submit.waitAppCompletion. > > -Sandy > > On Thu, May 28, 2015 at 11:48 AM, Corey Nolet wrote: > >> I am submitting jobs to my yar

Fully in-memory shuffles

2015-06-10 Thread Corey Nolet
Is it possible to configure Spark to do all of its shuffling FULLY in memory (given that I have enough memory to store all the data)?

Re: Fully in-memory shuffles

2015-06-10 Thread Corey Nolet
b.com/apache/spark/pull/5403 > > > > On Wed, Jun 10, 2015 at 7:08 AM, Corey Nolet wrote: > >> Is it possible to configure Spark to do all of its shuffling FULLY in >> memory (given that I have enough memory to store all the data)? >> >> >> >> >

Re: Fully in-memory shuffles

2015-06-10 Thread Corey Nolet
fer cache and > not ever touch spinning disk if it is a size that is less than memory > on the machine. > > - Patrick > > On Wed, Jun 10, 2015 at 5:06 PM, Corey Nolet wrote: > > So with this... to help my understanding of Spark under the hood- > > > > Is this sta

Using spark.hadoop.* to set Hadoop properties

2015-06-17 Thread Corey Nolet
I've become accustomed to being able to use system properties to override properties in the Hadoop Configuration objects. I just recently noticed that when Spark creates the Hadoop Configuraiton in the SparkContext, it cycles through any properties prefixed with spark.hadoop. and add those properti

Executor memory allocations

2015-06-17 Thread Corey Nolet
So I've seen in the documentation that (after the overhead memory is subtracted), the memory allocations of each executor are as follows (assume default settings): 60% for cache 40% for tasks to process data Reading about how Spark implements shuffling, I've also seen it say "20% of executor mem

Re: Is there programmatic way running Spark job on Yarn cluster without using spark-submit script ?

2015-06-17 Thread Corey Nolet
An example of being able to do this is provided in the Spark Jetty Server project [1] [1] https://github.com/calrissian/spark-jetty-server On Wed, Jun 17, 2015 at 8:29 PM, Elkhan Dadashov wrote: > Hi all, > > Is there any way running Spark job in programmatic way on Yarn cluster > without using

Re: Is there programmatic way running Spark job on Yarn cluster without using spark-submit script ?

2015-06-18 Thread Corey Nolet
aster("local") > >.setConf(SparkLauncher.DRIVER_MEMORY, "2g") > > .launch(); > > spark.waitFor(); > >} > > } > > } > > > > On Wed, Jun 17, 2015 at 5:51 PM, Corey Nolet wrote: > >> An example of being able to do thi

Re: Shuffle produces one huge partition and many tiny partitions

2015-06-18 Thread Corey Nolet
Doesn't repartition call coalesce(shuffle=true)? On Jun 18, 2015 6:53 PM, "Du Li" wrote: > I got the same problem with rdd,repartition() in my streaming app, which > generated a few huge partitions and many tiny partitions. The resulting > high data skew makes the processing time of a batch unpre

Re: Shuffle produces one huge partition and many tiny partitions

2015-06-18 Thread Corey Nolet
/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L341 On Thu, Jun 18, 2015 at 7:55 PM, Du Li wrote: > repartition() means coalesce(shuffle=false) > > > > On Thursday, June 18, 2015 4:07 PM, Corey Nolet > wrote: > > > Doesn't repart

Coalescing with shuffle = false in imbalanced cluster

2015-06-18 Thread Corey Nolet
I'm confused about this. The comment on the function seems to indicate that there is absolutely no shuffle or network IO but it also states that it assigns an even number of parent partitions to each final partition group. I'm having trouble seeing how this can be guaranteed without some data pass

Re: Grouping elements in a RDD

2015-06-20 Thread Corey Nolet
If you use rdd.mapPartitions(), you'll be able to get a hold of the iterators for each partiton. Then you should be able to do iterator.grouped(size) on each of the partitions. I think it may mean you have 1 element at the end of each partition that may have less than "size" elements. If that's oka

Reducer memory usage

2015-06-21 Thread Corey Nolet
I've seen a few places where it's been mentioned that after a shuffle each reducer needs to pull its partition into memory in its entirety. Is this true? I'd assume the merge sort that needs to be done (in the cases where sortByKey() is not used) wouldn't need to pull all of the data into memory at

Re: map vs mapPartitions

2015-06-25 Thread Corey Nolet
I don't know exactly what's going on under the hood but I would not assume that just because a whole partition is not being pulled into memory @ one time that that means each record is being pulled at 1 time. That's the beauty of exposing Iterators & Iterables in an API rather than collections- the

Re: map vs mapPartitions

2015-06-25 Thread Corey Nolet
e chunking of the data in the partition (fetching more than 1 record @ a time). On Thu, Jun 25, 2015 at 12:19 PM, Corey Nolet wrote: > I don't know exactly what's going on under the hood but I would not assume > that just because a whole partition is not being pulled into memory @

Spark eating exceptions in multi-threaded local mode

2014-12-16 Thread Corey Nolet
I've been running a job in local mode using --master local[*] and I've noticed that, for some reason, exceptions appear to get eaten- as in, I don't see them. If i debug in my IDE, I'll see that an exception was thrown if I step through the code but if I just run the application, it appears everyth

Re: When will spark 1.2 released?

2014-12-19 Thread Corey Nolet
> The dates of the jars were still of Dec 10th. I figured that was because the jars were staged in Nexus on that date (before the vote). On Fri, Dec 19, 2014 at 12:16 PM, Ted Yu wrote: > > Looking at: > http://search.maven.org/#browse%7C717101892 > > The dates of the jars were still of Dec 10th.

Using "SparkSubmit.main()" to submit SparkContext in web application

2014-12-20 Thread Corey Nolet
I am looking to run a SparkContext in a web application that is outside of my Spark cluster. I understand that I can use the "client" deployment mode and use the spark-submit script that hsipts with Spark but I'm really interested in running this inside of a SpringWeb application that can be starte

Submit spark jobs inside web application

2014-12-29 Thread Corey Nolet
I want to have a SparkContext inside of a web application running in Jetty that i can use to submit jobs to a cluster of Spark executors. I am running YARN. Ultimately, I would love it if I could just use somethjing like SparkSubmit.main() to allocate a bunch of resoruces in YARN when the webapp

How to tell if RDD no longer has any children

2014-12-29 Thread Corey Nolet
Let's say I have an RDD which gets cached and has two children which do something with it: val rdd1 = ...cache() rdd1.saveAsSequenceFile() rdd1.groupBy()..saveAsSequenceFile() If I were to submit both calls to saveAsSequenceFile() in thread to take advantage of concurrency (where possi

Cached RDD

2014-12-29 Thread Corey Nolet
If I have 2 RDDs which depend on the same RDD like the following: val rdd1 = ... val rdd2 = rdd1.groupBy()... val rdd3 = rdd1.groupBy()... If I don't cache rdd1, will it's lineage be calculated twice (one for rdd2 and one for rdd3)?

Submitting spark jobs through yarn-client

2015-01-02 Thread Corey Nolet
I'm trying to get a SparkContext going in a web container which is being submitted through yarn-client. I'm trying two different approaches and both seem to be resulting in the same error from the yarn nodemanagers: 1) I'm newing up a spark context direct, manually adding all the lib jars from Spa

Re: Submitting spark jobs through yarn-client

2015-01-02 Thread Corey Nolet
Looking a little closer @ the launch_container.sh file, it appears to be adding a $PWD/__app__.jar to the classpath but there is no __app__.jar in the directory pointed to by PWD. Any ideas? On Fri, Jan 2, 2015 at 4:20 PM, Corey Nolet wrote: > I'm trying to get a SparkContext going

Re: Submitting spark jobs through yarn-client

2015-01-02 Thread Corey Nolet
to set these properties or why they aren't making it through. On Fri, Jan 2, 2015 at 5:02 PM, Corey Nolet wrote: > Looking a little closer @ the launch_container.sh file, it appears to be > adding a $PWD/__app__.jar to the classpath but there is no __app__.jar in > the directory pointed

Re: Submitting spark jobs through yarn-client

2015-01-02 Thread Corey Nolet
my findings. On Fri, Jan 2, 2015 at 5:46 PM, Corey Nolet wrote: > .. and looking even further, it looks like the actual command tha'ts > executed starting up the JVM to run the > org.apache.spark.deploy.yarn.ExecutorLauncher is passing in "--class > 'notused'

Re: Submitting spark jobs through yarn-client

2015-01-03 Thread Corey Nolet
lient driver application. Here's the example code on github: https://github.com/cjnolet/spark-jetty-server On Fri, Jan 2, 2015 at 11:35 PM, Corey Nolet wrote: > So looking @ the actual code- I see where it looks like --class 'notused' > --jar null is set on the ClientBase.scala

Strange DAG scheduling behavior on currently dependent RDDs

2015-01-07 Thread Corey Nolet
We just updated to Spark 1.2.0 from Spark 1.1.0. We have a small framework that we've been developing that connects various different RDDs together based on some predefined business cases. After updating to 1.2.0, some of the concurrency expectations about how the stages within jobs are executed ha

Re: Strange DAG scheduling behavior on currently dependent RDDs

2015-01-07 Thread Corey Nolet
lineages. What's strange is that this bug only surfaced when I updated Spark. On Wed, Jan 7, 2015 at 9:12 AM, Corey Nolet wrote: > We just updated to Spark 1.2.0 from Spark 1.1.0. We have a small framework > that we've been developing that connects various different RDDs together

What does (### skipped) mean in the Spark UI?

2015-01-07 Thread Corey Nolet
We just upgraded to Spark 1.2.0 and we're seeing this in the UI.

Re: What does (### skipped) mean in the Spark UI?

2015-01-07 Thread Corey Nolet
gt; Looks like the number of skipped stages couldn't be formatted. > > Cheers > > On Wed, Jan 7, 2015 at 12:08 PM, Corey Nolet wrote: > >> We just upgraded to Spark 1.2.0 and we're seeing this in the UI. >> > >

Re: Web Service + Spark

2015-01-09 Thread Corey Nolet
Cui Lin, The solution largely depends on how you want your services deployed (Java web container, Spray framework, etc...) and if you are using a cluster manager like Yarn or Mesos vs. just firing up your own executors and master. I recently worked on an example for deploying Spark services insid

Submitting SparkContext and seeing driverPropsFetcher exception

2015-01-09 Thread Corey Nolet
I'm seeing this exception when creating a new SparkContext in YARN: [ERROR] AssociationError [akka.tcp://sparkdri...@coreys-mbp.home:58243] <- [akka.tcp://driverpropsfetc...@coreys-mbp.home:58453]: Error [Shut down address: akka.tcp://driverpropsfetc...@coreys-mbp.home:58453] [ akka.remote.ShutDo

Custom JSON schema inference

2015-01-14 Thread Corey Nolet
I'm working with RDD[Map[String,Any]] objects all over my codebase. These objects were all originally parsed from JSON. The processing I do on RDDs consists of parsing json -> grouping/transforming dataset into a feasible report -> outputting data to a file. I've been wanting to infer the schemas

Re: Accumulators

2015-01-14 Thread Corey Nolet
Just noticed an error in my wording. Should be " I'm assuming it's not immediately aggregating on the driver each time I call the += on the Accumulator." On Wed, Jan 14, 2015 at 9:19 PM, Corey Nolet wrote: > What are the limitations of using Accumulators to get a unio

Accumulators

2015-01-14 Thread Corey Nolet
What are the limitations of using Accumulators to get a union of a bunch of small sets? Let's say I have an RDD[Map{String,Any} and i want to do: rdd.map(accumulator += Set(_.get("entityType").get)) What implication does this have on performance? I'm assuming it's not immediately aggregating ea

Spark SQL Custom Predicate Pushdown

2015-01-15 Thread Corey Nolet
I have document storage services in Accumulo that I'd like to expose to Spark SQL. I am able to push down predicate logic to Accumulo to have it perform only the seeks necessary on each tablet server to grab the results being asked for. I'm interested in using Spark SQL to push those predicates do

Re: Creating Apache Spark-powered “As Service” applications

2015-01-16 Thread Corey Nolet
There's also an example of running a SparkContext in a java servlet container from Calrissian: https://github.com/calrissian/spark-jetty-server On Fri, Jan 16, 2015 at 2:31 PM, olegshirokikh wrote: > The question is about the ways to create a Windows desktop-based and/or > web-based application

Re: Spark SQL Custom Predicate Pushdown

2015-01-16 Thread Corey Nolet
dicate Push Down: > > > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala > > > > Examples also can be found in the unit test: > > > https://github.com/apache/spark/blob/master/sql/core/src/test/scala/o

Re: Spark SQL Custom Predicate Pushdown

2015-01-17 Thread Corey Nolet
ty. I have an example [1] of what I'm trying to accomplish. [1] https://github.com/calrissian/accumulo-recipes/blob/273/thirdparty/spark/src/main/scala/org/calrissian/accumulorecipes/spark/sql/EventStore.scala#L49 On Fri, Jan 16, 2015 at 10:17 PM, Corey Nolet wrote: > Hao, > >

Re: Spark SQL Custom Predicate Pushdown

2015-01-17 Thread Corey Nolet
Michael, What I'm seeing (in Spark 1.2.0) is that the required columns being pushed down to the DataRelation are not the product of the SELECT clause but rather just the columns explicitly included in the WHERE clause. Examples from my testing: SELECT * FROM myTable --> The required columns are

Re: Spark SQL Custom Predicate Pushdown

2015-01-17 Thread Corey Nolet
, Jan 17, 2015 at 4:29 PM, Michael Armbrust wrote: > How are you running your test here? Are you perhaps doing a .count()? > > On Sat, Jan 17, 2015 at 12:54 PM, Corey Nolet wrote: > >> Michael, >> >> What I'm seeing (in Spark 1.2.0) is that the required

[SQL] Conflicts in inferred Json Schemas

2015-01-21 Thread Corey Nolet
Let's say I have 2 formats for json objects in the same file schema1 = { "location": "12345 My Lane" } schema2 = { "location":{"houseAddres":"1234 My Lane"} } >From my tests, it looks like the current inferSchema() function will end up with only StructField("location", StringType). What would be

Spark 1.2.x Yarn Auxiliary Shuffle Service

2015-01-27 Thread Corey Nolet
I've read that this is supposed to be a rather significant optimization to the shuffle system in 1.1.0 but I'm not seeing much documentation on enabling this in Yarn. I see github classes for it in 1.2.0 and a property "spark.shuffle.service.enabled" in the spark-defaults.conf. The code mentions t

Partition + equivalent of MapReduce multiple outputs

2015-01-27 Thread Corey Nolet
I need to be able to take an input RDD[Map[String,Any]] and split it into several different RDDs based on some partitionable piece of the key (groups) and then send each partition to a separate set of files in different folders in HDFS. 1) Would running the RDD through a custom partitioner be the

Re: Partition + equivalent of MapReduce multiple outputs

2015-01-27 Thread Corey Nolet
51 AM, Corey Nolet wrote: > I need to be able to take an input RDD[Map[String,Any]] and split it into > several different RDDs based on some partitionable piece of the key > (groups) and then send each partition to a separate set of files in > different folders in HDFS. > > 1

Re: Partition + equivalent of MapReduce multiple outputs

2015-01-28 Thread Corey Nolet
y-spark-one-spark-job > > On Wed, Jan 28, 2015 at 12:51 AM, Corey Nolet wrote: > >> I need to be able to take an input RDD[Map[String,Any]] and split it into >> several different RDDs based on some partitionable piece of the key >> (groups) and then send each partition to

Re: Partition + equivalent of MapReduce multiple outputs

2015-01-28 Thread Corey Nolet
I'm looking @ the ShuffledRDD code and it looks like there is a method setKeyOrdering()- is this guaranteed to order everything in the partition? I'm on Spark 1.2.0 On Wed, Jan 28, 2015 at 9:07 AM, Corey Nolet wrote: > In all of the soutions I've found thus far, sorting h

Re: Partition + equivalent of MapReduce multiple outputs

2015-01-28 Thread Corey Nolet
e/src/main/scala/org/apache/spark/rdd/OrderedRDDFunctions.scala On Wed, Jan 28, 2015 at 9:16 AM, Corey Nolet wrote: > I'm looking @ the ShuffledRDD code and it looks like there is a method > setKeyOrdering()- is this guaranteed to order everything in the partition? > I'm on S

Long pauses after writing to sequence files

2015-01-30 Thread Corey Nolet
We have a series of spark jobs which run in succession over various cached datasets, do small groups and transforms, and then call saveAsSequenceFile() on them. Each call to save as a sequence file appears to have done its work, the task says it completed in "xxx.x seconds" but then it pauses

“mapreduce.job.user.classpath.first” for Spark

2015-02-03 Thread Corey Nolet
I'm having a really bad dependency conflict right now with Guava versions between my Spark application in Yarn and (I believe) Hadoop's version. The problem is, my driver has the version of Guava which my application is expecting (15.0) while it appears the Spark executors that are working on my R

Re: “mapreduce.job.user.classpath.first” for Spark

2015-02-04 Thread Corey Nolet
.org/jira/browse/SPARK-2996 - only works for YARN). >> Also thread at >> http://apache-spark-user-list.1001560.n3.nabble.com/netty-on-classpath-when-using-spark-submit-td18030.html >> . >> >> HTH, >> Markus >> >> On 02/03/2015 11:20 PM, Corey Nolet wrot

Re: “mapreduce.job.user.classpath.first” for Spark

2015-02-04 Thread Corey Nolet
ith Spark 1.1 and earlier you'd get >> Guava 14 from Spark, so still a problem for you). >> >> Right now, the option Markus mentioned >> (spark.yarn.user.classpath.first) can be a workaround for you, since >> it will place your app's jars before Yarn's on the classpath. >> >

Re: “mapreduce.job.user.classpath.first” for Spark

2015-02-04 Thread Corey Nolet
My mistake Marcello, I was looking at the wrong message. That reply was meant for bo yang. On Feb 4, 2015 4:02 PM, "Marcelo Vanzin" wrote: > Hi Corey, > > On Wed, Feb 4, 2015 at 12:44 PM, Corey Nolet wrote: > >> Another suggestion is to build Spark by yourself. >

Re: How to design a long live spark application

2015-02-05 Thread Corey Nolet
Here's another lightweight example of running a SparkContext in a common java servlet container: https://github.com/calrissian/spark-jetty-server On Thu, Feb 5, 2015 at 11:46 AM, Charles Feduke wrote: > If you want to design something like Spark shell have a look at: > > http://zeppelin-project.

Writable serialization from InputFormat losing fields

2015-02-10 Thread Corey Nolet
I am using an input format to load data from Accumulo [1] in to a Spark RDD. It looks like something is happening in the serialization of my output writable between the time it is emitted from the InputFormat and the time it reaches its destination on the driver. What's happening is that the resul

Re: Writable serialization from InputFormat losing fields

2015-02-10 Thread Corey Nolet
I am able to get around the problem by doing a map and getting the Event out of the EventWritable before I do my collect. I think I'll do that for now. On Tue, Feb 10, 2015 at 6:04 PM, Corey Nolet wrote: > I am using an input format to load data from Accumulo [1] in to a Spark > RD

Easy way to "partition" an RDD into chunks like Guava's Iterables.partition

2015-02-11 Thread Corey Nolet
I think the word "partition" here is a tad different than the term "partition" that we use in Spark. Basically, I want something similar to Guava's Iterables.partition [1], that is, If I have an RDD[People] and I want to run an algorithm that can be optimized by working on 30 people at a time, I'd

Re: Easy way to "partition" an RDD into chunks like Guava's Iterables.partition

2015-02-11 Thread Corey Nolet
Doesn't iter still need to fit entirely into memory? On Wed, Feb 11, 2015 at 5:55 PM, Mark Hamstra wrote: > rdd.mapPartitions { iter => > val grouped = iter.grouped(batchSize) > for (group <- grouped) { ... } > } > > On Wed, Feb 11, 2015 at 2:44 PM, Corey Nolet

Custom Kryo serializer

2015-02-12 Thread Corey Nolet
I'm trying to register a custom class that extends Kryo's Serializer interface. I can't tell exactly what Class the registerKryoClasses() function on the SparkConf is looking for. How do I register the Serializer class?

Re: Custom Kryo serializer

2015-02-12 Thread Corey Nolet
I was able to get this working by extending KryoRegistrator and setting the "spark.kryo.registrator" property. On Thu, Feb 12, 2015 at 12:31 PM, Corey Nolet wrote: > I'm trying to register a custom class that extends Kryo's Serializer > interface. I can

Re: Easy way to "partition" an RDD into chunks like Guava's Iterables.partition

2015-02-12 Thread Corey Nolet
group should need to fit. > > On Wed, Feb 11, 2015 at 2:56 PM, Corey Nolet wrote: > >> Doesn't iter still need to fit entirely into memory? >> >> On Wed, Feb 11, 2015 at 5:55 PM, Mark Hamstra >> wrote: >> >>> rdd.mapPartitions { iter =

Re: Easy way to "partition" an RDD into chunks like Guava's Iterables.partition

2015-02-12 Thread Corey Nolet
ng all the data to a single partition (no matter what window I set) and it seems to lock up my jobs. I waited for 15 minutes for a stage that usually takes about 15 seconds and I finally just killed the job in yarn. On Thu, Feb 12, 2015 at 4:40 PM, Corey Nolet wrote: > So I tried this: > >

Using Spark SQL for temporal data

2015-02-12 Thread Corey Nolet
I have a temporal data set in which I'd like to be able to query using Spark SQL. The dataset is actually in Accumulo and I've already written a CatalystScan implementation and RelationProvider[1] to register with the SQLContext so that I can apply my SQL statements. With my current implementation

Re: Using Spark SQL for temporal data

2015-02-12 Thread Corey Nolet
tDate).toDate > }.getOrElse() > val end = filters.find { > case LessThan("end", endDate: String) => DateTime.parse(endDate).toDate > }.getOrElse() > > ... > > Filters are advisory, so you can ignore ones that aren't start/end. > > Michael > > On

Re: Using Spark SQL for temporal data

2015-02-12 Thread Corey Nolet
Ok. I just verified that this is the case with a little test: WHERE (a = 'v1' and b = 'v2')PrunedFilteredScan passes down 2 filters WHERE(a = 'v1' and b = 'v2') or (a = 'v3') PrunedFilteredScan passes down 0 filters On Fri, Feb 13, 2015

SparkSQL doesn't seem to like "$"'s in column names

2015-02-13 Thread Corey Nolet
I don't remember Oracle ever enforcing that I couldn't include a $ in a column name, but I also don't thinking I've ever tried. When using sqlContext.sql(...), I have a "SELECT * from myTable WHERE locations_$homeAddress = '123 Elm St'" It's telling me $ is invalid. Is this a bug?

<    1   2   3   4   5   >