at 2:19 PM, Corey Nolet cjno...@gmail.com wrote:
Is it possible to select if, say, there was an addresses field that had a
json array?
You can get the Nth item by address.getItem(0). If you want to walk
through the whole array look at LATERAL VIEW EXPLODE in HiveQL
).collect()
res0: Array[org.apache.spark.sql.Row] = Array([John])
This will double show people who have more than one matching address.
On Tue, Oct 28, 2014 at 5:52 PM, Corey Nolet cjno...@gmail.com wrote:
So it wouldn't be possible to have a json string like this:
{ name:John, age:53, locations
Am I able to do a join on an exploded field?
Like if I have another object:
{ streetNumber:2300, locationName:The Big Building} and I want to
join with the previous json by the locations[].number field- is that
possible?
On Tue, Oct 28, 2014 at 9:31 PM, Corey Nolet cjno...@gmail.com wrote
$QueryExecution.sparkPlan(SQLContext.scala:400)
On Tue, Oct 28, 2014 at 10:48 PM, Michael Armbrust mich...@databricks.com
wrote:
Can you println the .queryExecution of the SchemaRDD?
On Tue, Oct 28, 2014 at 7:43 PM, Corey Nolet cjno...@gmail.com wrote:
So this appears to work just fine:
hctx.sql(SELECT
Hongbin,
Please send an email to user-unsubscr...@spark.apache.org in order to
unsubscribe from the user list.
On Fri, Oct 31, 2014 at 9:05 AM, Hongbin Liu hongbin@theice.com wrote:
Apology for having to send to all.
I am highly interested in spark, would like to stay in this mailing
Michael,
I should probably look closer myself @ the design of 1.2 vs 1.1 but I've
been curious why Spark's in-memory data uses the heap instead of putting it
off heap? Was this the optimization that was done in 1.2 to alleviate GC?
On Mon, Nov 3, 2014 at 8:52 PM, Shailesh Birari
I'm fairly new to spark and I'm trying to kick the tires with a few
InputFormats. I noticed the sc.hadoopRDD() method takes a mapred JobConf
instead of a MapReduce Job object. Is there future planned support for the
mapreduce packaging?
I'm trying to use a custom input format with SparkContext.newAPIHadoopRDD.
Creating the new RDD works fine but setting up the configuration file via
the static methods on input formats that require a Hadoop Job object is
proving to be difficult.
Trying to new up my own Job object with the
, Corey Nolet cjno...@gmail.com wrote:
I'm trying to use a custom input format with SparkContext.newAPIHadoopRDD.
Creating the new RDD works fine but setting up the configuration file via
the static methods on input formats that require a Hadoop Job object is
proving to be difficult.
Trying
place that there is a problem
is 'ln.streetnumber, which prevents the rest of the query from resolving.
If you look at the subquery ln, it is only producing two columns:
locationName and locationNumber. So streetnumber is not valid.
On Tue, Oct 28, 2014 at 8:02 PM, Corey Nolet cjno...@gmail.com
I'm loading sequence files containing json blobs in the value, transforming
them into RDD[String] and then using hiveContext.jsonRDD(). It looks like
Spark reads the files twice- once when I I define the jsonRDD() and then
again when I actually make my call to hiveContext.sql().
Looking @ the
Abdul,
Please send an email to user-unsubscr...@spark.apache.org
On Tue, Nov 18, 2014 at 2:05 PM, Abdul Hakeem alhak...@gmail.com wrote:
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional
assigning the object to a
temporary variable.
Matei
On Nov 5, 2014, at 2:54 PM, Corey Nolet cjno...@gmail.com wrote:
The closer I look @ the stack trace in the Scala shell, it appears to be
the call to toString() that is causing the construction of the Job object
to fail. Is there a ways
I've read in the documentation that RDDs can be run concurrently when
submitted in separate threads. I'm curious how the scheduler would handle
propagating these down to the tasks.
I have 3 RDDs:
- one RDD which loads some initial data, transforms it and caches it
- two RDDs which use the cached
Reading the documentation a little more closely, I'm using the wrong
terminology. I'm using stages to refer to what spark is calling a job. I
guess application (more than one spark context) is what I'm asking about
On Dec 5, 2014 5:19 PM, Corey Nolet cjno...@gmail.com wrote:
I've read
I've been running a job in local mode using --master local[*] and I've
noticed that, for some reason, exceptions appear to get eaten- as in, I
don't see them. If i debug in my IDE, I'll see that an exception was thrown
if I step through the code but if I just run the application, it appears
The dates of the jars were still of Dec 10th.
I figured that was because the jars were staged in Nexus on that date
(before the vote).
On Fri, Dec 19, 2014 at 12:16 PM, Ted Yu yuzhih...@gmail.com wrote:
Looking at:
http://search.maven.org/#browse%7C717101892
The dates of the jars were
I want to have a SparkContext inside of a web application running in Jetty
that i can use to submit jobs to a cluster of Spark executors. I am running
YARN.
Ultimately, I would love it if I could just use somethjing like
SparkSubmit.main() to allocate a bunch of resoruces in YARN when the webapp
Let's say I have an RDD which gets cached and has two children which do
something with it:
val rdd1 = ...cache()
rdd1.saveAsSequenceFile()
rdd1.groupBy()..saveAsSequenceFile()
If I were to submit both calls to saveAsSequenceFile() in thread to take
advantage of concurrency (where
If I have 2 RDDs which depend on the same RDD like the following:
val rdd1 = ...
val rdd2 = rdd1.groupBy()...
val rdd3 = rdd1.groupBy()...
If I don't cache rdd1, will it's lineage be calculated twice (one for rdd2
and one for rdd3)?
Looking a little closer @ the launch_container.sh file, it appears to be
adding a $PWD/__app__.jar to the classpath but there is no __app__.jar in
the directory pointed to by PWD. Any ideas?
On Fri, Jan 2, 2015 at 4:20 PM, Corey Nolet cjno...@gmail.com wrote:
I'm trying to get a SparkContext
I'm trying to get a SparkContext going in a web container which is being
submitted through yarn-client. I'm trying two different approaches and both
seem to be resulting in the same error from the yarn nodemanagers:
1) I'm newing up a spark context direct, manually adding all the lib jars
from
2, 2015 at 5:46 PM, Corey Nolet cjno...@gmail.com wrote:
.. and looking even further, it looks like the actual command tha'ts
executed starting up the JVM to run the
org.apache.spark.deploy.yarn.ExecutorLauncher is passing in --class
'notused' --jar null.
I would assume this isn't expected
they aren't making it through.
On Fri, Jan 2, 2015 at 5:02 PM, Corey Nolet cjno...@gmail.com wrote:
Looking a little closer @ the launch_container.sh file, it appears to be
adding a $PWD/__app__.jar to the classpath but there is no __app__.jar in
the directory pointed to by PWD. Any ideas?
On Fri, Jan
driver application.
Here's the example code on github:
https://github.com/cjnolet/spark-jetty-server
On Fri, Jan 2, 2015 at 11:35 PM, Corey Nolet cjno...@gmail.com wrote:
So looking @ the actual code- I see where it looks like --class 'notused'
--jar null is set on the ClientBase.scala when yarn
I'm having a really bad dependency conflict right now with Guava versions
between my Spark application in Yarn and (I believe) Hadoop's version.
The problem is, my driver has the version of Guava which my application is
expecting (15.0) while it appears the Spark executors that are working on
my
My mistake Marcello, I was looking at the wrong message. That reply was
meant for bo yang.
On Feb 4, 2015 4:02 PM, Marcelo Vanzin van...@cloudera.com wrote:
Hi Corey,
On Wed, Feb 4, 2015 at 12:44 PM, Corey Nolet cjno...@gmail.com wrote:
Another suggestion is to build Spark by yourself
works for YARN).
Also thread at
http://apache-spark-user-list.1001560.n3.nabble.com/netty-on-classpath-when-using-spark-submit-td18030.html
.
HTH,
Markus
On 02/03/2015 11:20 PM, Corey Nolet wrote:
I'm having a really bad dependency conflict right now with Guava versions
between my Spark
Here's another lightweight example of running a SparkContext in a common
java servlet container: https://github.com/calrissian/spark-jetty-server
On Thu, Feb 5, 2015 at 11:46 AM, Charles Feduke charles.fed...@gmail.com
wrote:
If you want to design something like Spark shell have a look at:
I'm working with RDD[Map[String,Any]] objects all over my codebase. These
objects were all originally parsed from JSON. The processing I do on RDDs
consists of parsing json - grouping/transforming dataset into a feasible
report - outputting data to a file.
I've been wanting to infer the schemas
, Jan 17, 2015 at 4:29 PM, Michael Armbrust mich...@databricks.com
wrote:
How are you running your test here? Are you perhaps doing a .count()?
On Sat, Jan 17, 2015 at 12:54 PM, Corey Nolet cjno...@gmail.com wrote:
Michael,
What I'm seeing (in Spark 1.2.0) is that the required columns being
Michael,
What I'm seeing (in Spark 1.2.0) is that the required columns being pushed
down to the DataRelation are not the product of the SELECT clause but
rather just the columns explicitly included in the WHERE clause.
Examples from my testing:
SELECT * FROM myTable -- The required columns are
I have document storage services in Accumulo that I'd like to expose to
Spark SQL. I am able to push down predicate logic to Accumulo to have it
perform only the seeks necessary on each tablet server to grab the results
being asked for.
I'm interested in using Spark SQL to push those predicates
There's also an example of running a SparkContext in a java servlet
container from Calrissian: https://github.com/calrissian/spark-jetty-server
On Fri, Jan 16, 2015 at 2:31 PM, olegshirokikh o...@solver.com wrote:
The question is about the ways to create a Windows desktop-based and/or
Down:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
Examples also can be found in the unit test:
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/sources
*From:* Corey Nolet
an example [1] of what I'm trying to accomplish.
[1]
https://github.com/calrissian/accumulo-recipes/blob/273/thirdparty/spark/src/main/scala/org/calrissian/accumulorecipes/spark/sql/EventStore.scala#L49
On Fri, Jan 16, 2015 at 10:17 PM, Corey Nolet cjno...@gmail.com wrote:
Hao,
Thanks so much
Just noticed an error in my wording.
Should be I'm assuming it's not immediately aggregating on the driver
each time I call the += on the Accumulator.
On Wed, Jan 14, 2015 at 9:19 PM, Corey Nolet cjno...@gmail.com wrote:
What are the limitations of using Accumulators to get a union of a bunch
What are the limitations of using Accumulators to get a union of a bunch of
small sets?
Let's say I have an RDD[Map{String,Any} and i want to do:
rdd.map(accumulator += Set(_.get(entityType).get))
What implication does this have on performance? I'm assuming it's not
immediately aggregating
I think the word partition here is a tad different than the term
partition that we use in Spark. Basically, I want something similar to
Guava's Iterables.partition [1], that is, If I have an RDD[People] and I
want to run an algorithm that can be optimized by working on 30 people at a
time, I'd
Doesn't iter still need to fit entirely into memory?
On Wed, Feb 11, 2015 at 5:55 PM, Mark Hamstra m...@clearstorydata.com
wrote:
rdd.mapPartitions { iter =
val grouped = iter.grouped(batchSize)
for (group - grouped) { ... }
}
On Wed, Feb 11, 2015 at 2:44 PM, Corey Nolet cjno
We've been using commons configuration to pull our properties out of
properties files and system properties (prioritizing system properties over
others) and we add those properties to our spark conf explicitly and we use
ArgoPartser to get the command line argument for which property file to
load.
I was able to get this working by extending KryoRegistrator and setting the
spark.kryo.registrator property.
On Thu, Feb 12, 2015 at 12:31 PM, Corey Nolet cjno...@gmail.com wrote:
I'm trying to register a custom class that extends Kryo's Serializer
interface. I can't tell exactly what Class
group should need to fit.
On Wed, Feb 11, 2015 at 2:56 PM, Corey Nolet cjno...@gmail.com wrote:
Doesn't iter still need to fit entirely into memory?
On Wed, Feb 11, 2015 at 5:55 PM, Mark Hamstra m...@clearstorydata.com
wrote:
rdd.mapPartitions { iter =
val grouped = iter.grouped(batchSize
the
data to a single partition (no matter what window I set) and it seems to
lock up my jobs. I waited for 15 minutes for a stage that usually takes
about 15 seconds and I finally just killed the job in yarn.
On Thu, Feb 12, 2015 at 4:40 PM, Corey Nolet cjno...@gmail.com wrote:
So I tried
I have a temporal data set in which I'd like to be able to query using
Spark SQL. The dataset is actually in Accumulo and I've already written a
CatalystScan implementation and RelationProvider[1] to register with the
SQLContext so that I can apply my SQL statements.
With my current
I've read that this is supposed to be a rather significant optimization to
the shuffle system in 1.1.0 but I'm not seeing much documentation on
enabling this in Yarn. I see github classes for it in 1.2.0 and a property
spark.shuffle.service.enabled in the spark-defaults.conf.
The code mentions
I'm looking @ the ShuffledRDD code and it looks like there is a method
setKeyOrdering()- is this guaranteed to order everything in the partition?
I'm on Spark 1.2.0
On Wed, Jan 28, 2015 at 9:07 AM, Corey Nolet cjno...@gmail.com wrote:
In all of the soutions I've found thus far, sorting has been
, Corey Nolet cjno...@gmail.com wrote:
I need to be able to take an input RDD[Map[String,Any]] and split it into
several different RDDs based on some partitionable piece of the key
(groups) and then send each partition to a separate set of files in
different folders in HDFS.
1) Would running
I need to be able to take an input RDD[Map[String,Any]] and split it into
several different RDDs based on some partitionable piece of the key
(groups) and then send each partition to a separate set of files in
different folders in HDFS.
1) Would running the RDD through a custom partitioner be the
/scala/org/apache/spark/rdd/OrderedRDDFunctions.scala
On Wed, Jan 28, 2015 at 9:16 AM, Corey Nolet cjno...@gmail.com wrote:
I'm looking @ the ShuffledRDD code and it looks like there is a method
setKeyOrdering()- is this guaranteed to order everything in the partition?
I'm on Spark 1.2.0
On Wed
We have a series of spark jobs which run in succession over various cached
datasets, do small groups and transforms, and then call
saveAsSequenceFile() on them.
Each call to save as a sequence file appears to have done its work, the
task says it completed in xxx.x seconds but then it pauses
Cui Lin,
The solution largely depends on how you want your services deployed (Java
web container, Spray framework, etc...) and if you are using a cluster
manager like Yarn or Mesos vs. just firing up your own executors and master.
I recently worked on an example for deploying Spark services
I'm seeing this exception when creating a new SparkContext in YARN:
[ERROR] AssociationError [akka.tcp://sparkdri...@coreys-mbp.home:58243] -
[akka.tcp://driverpropsfetc...@coreys-mbp.home:58453]: Error [Shut down
address: akka.tcp://driverpropsfetc...@coreys-mbp.home:58453] [
I'm trying to register a custom class that extends Kryo's Serializer
interface. I can't tell exactly what Class the registerKryoClasses()
function on the SparkConf is looking for.
How do I register the Serializer class?
I don't remember Oracle ever enforcing that I couldn't include a $ in a
column name, but I also don't thinking I've ever tried.
When using sqlContext.sql(...), I have a SELECT * from myTable WHERE
locations_$homeAddress = '123 Elm St'
It's telling me $ is invalid. Is this a bug?
This doesn't seem to have helped.
On Fri, Feb 13, 2015 at 2:51 PM, Michael Armbrust mich...@databricks.com
wrote:
Try using `backticks` to escape non-standard characters.
On Fri, Feb 13, 2015 at 11:30 AM, Corey Nolet cjno...@gmail.com wrote:
I don't remember Oracle ever enforcing that I
Nevermind- I think I may have had a schema-related issue (sometimes
booleans were represented as string and sometimes as raw booleans but when
I populated the schema one or the other was chosen.
On Fri, Feb 13, 2015 at 8:03 PM, Corey Nolet cjno...@gmail.com wrote:
Here are the results
Here are the results of a few different SQL strings (let's assume the
schemas are valid for the data types used):
SELECT * from myTable where key1 = true - no filters are pushed to my
PrunedFilteredScan
SELECT * from myTable where key1 = true and key2 = 5 - 1 filter (key2) is
pushed to my
I am able to get around the problem by doing a map and getting the Event
out of the EventWritable before I do my collect. I think I'll do that for
now.
On Tue, Feb 10, 2015 at 6:04 PM, Corey Nolet cjno...@gmail.com wrote:
I am using an input format to load data from Accumulo [1] in to a Spark
We just updated to Spark 1.2.0 from Spark 1.1.0. We have a small framework
that we've been developing that connects various different RDDs together
based on some predefined business cases. After updating to 1.2.0, some of
the concurrency expectations about how the stages within jobs are executed
:
Looks like the number of skipped stages couldn't be formatted.
Cheers
On Wed, Jan 7, 2015 at 12:08 PM, Corey Nolet cjno...@gmail.com wrote:
We just upgraded to Spark 1.2.0 and we're seeing this in the UI.
We just upgraded to Spark 1.2.0 and we're seeing this in the UI.
lineages.
What's strange is that this bug only surfaced when I updated Spark.
On Wed, Jan 7, 2015 at 9:12 AM, Corey Nolet cjno...@gmail.com wrote:
We just updated to Spark 1.2.0 from Spark 1.1.0. We have a small framework
that we've been developing that connects various different RDDs together
Given the following scenario:
dstream.map(...).filter(...).window(...).foreachrdd()
When would the onBatchCompleted fire?
Taking out the complexity of the ARIMA models to simplify things- I can't
seem to find a good way to represent even standard moving averages in spark
streaming. Perhaps it's my ignorance with the micro-batched style of the
DStreams API.
On Fri, Mar 27, 2015 at 9:13 PM, Corey Nolet cjno
be listening to a
partition.
Yes, my understanding is that multiple receivers in one group are the
way to consume a topic's partitions in parallel.
On Sat, Feb 28, 2015 at 12:56 AM, Corey Nolet cjno...@gmail.com wrote:
Looking @ [1], it seems to recommend pull from multiple Kafka topics in
order
Thanks for taking this on Ted!
On Sat, Feb 28, 2015 at 4:17 PM, Ted Yu yuzhih...@gmail.com wrote:
I have created SPARK-6085 with pull request:
https://github.com/apache/spark/pull/4836
Cheers
On Sat, Feb 28, 2015 at 12:08 PM, Corey Nolet cjno...@gmail.com wrote:
+1 to a better default
Zhang zzh...@hortonworks.com
wrote:
Currently in spark, it looks like there is no easy way to know the
dependencies. It is solved at run time.
Thanks.
Zhan Zhang
On Feb 26, 2015, at 4:20 PM, Corey Nolet cjno...@gmail.com wrote:
Ted. That one I know. It was the dependency part I
+1 to a better default as well.
We were working find until we ran against a real dataset which was much
larger than the test dataset we were using locally. It took me a couple
days and digging through many logs to figure out this value was what was
causing the problem.
On Sat, Feb 28, 2015 at
if there was an
automatic partition reconfiguration function that automagically did that...
On Tue, Feb 24, 2015 at 3:20 AM, Corey Nolet cjno...@gmail.com wrote:
I *think* this may have been related to the default memory overhead
setting being too low. I raised the value to 1G it and tried my job again
Looking @ [1], it seems to recommend pull from multiple Kafka topics in
order to parallelize data received from Kafka over multiple nodes. I notice
in [2], however, that one of the createConsumer() functions takes a
groupId. So am I understanding correctly that creating multiple DStreams
with the
?
spark.shuffle.service.enable = true
On 21.2.2015. 17:50, Corey Nolet wrote:
I'm experiencing the same issue. Upon closer inspection I'm noticing
that executors are being lost as well. Thing is, I can't figure out how
they are dying. I'm using MEMORY_AND_DISK_SER and i've got over 1.3TB
:
Could you try to turn on the external shuffle service?
spark.shuffle.service.enable = true
On 21.2.2015. 17:50, Corey Nolet wrote:
I'm experiencing the same issue. Upon closer inspection I'm noticing
that executors are being lost as well. Thing is, I can't figure out how
they are dying. I'm
-
but i have a suspicion this may have been the cause of the executors being
killed by the application master.
On Feb 23, 2015 5:25 PM, Corey Nolet cjno...@gmail.com wrote:
I've got the opposite problem with regards to partitioning. I've got over
6000 partitions for some of these RDDs which
Let's say I'm given 2 RDDs and told to store them in a sequence file and
they have the following dependency:
val rdd1 = sparkContext.sequenceFile().cache()
val rdd2 = rdd1.map()
How would I tell programmatically without being the one who built rdd1 and
rdd2 whether or not rdd2
I see the rdd.dependencies() function, does that include ALL the
dependencies of an RDD? Is it safe to assume I can say
rdd2.dependencies.contains(rdd1)?
On Thu, Feb 26, 2015 at 4:28 PM, Corey Nolet cjno...@gmail.com wrote:
Let's say I'm given 2 RDDs and told to store them in a sequence file
I'm experiencing the same issue. Upon closer inspection I'm noticing that
executors are being lost as well. Thing is, I can't figure out how they are
dying. I'm using MEMORY_AND_DISK_SER and i've got over 1.3TB of memory
allocated for the application. I was thinking perhaps it was possible that
a
the execution
if there is no shuffle dependencies in between RDDs.
Thanks.
Zhan Zhang
On Feb 26, 2015, at 1:28 PM, Corey Nolet cjno...@gmail.com wrote:
Let's say I'm given 2 RDDs and told to store them in a sequence file and
they have the following dependency:
val rdd1
be the behavior and myself and all my coworkers
expected.
On Thu, Feb 26, 2015 at 6:26 PM, Corey Nolet cjno...@gmail.com wrote:
I should probably mention that my example case is much over simplified-
Let's say I've got a tree, a fairly complex one where I begin a series of
jobs at the root which
in almost all cases. That much, I
don't know how hard it is to implement. But I speculate that it's
easier to deal with it at that level than as a function of the
dependency graph.
On Thu, Feb 26, 2015 at 10:49 PM, Corey Nolet cjno...@gmail.com wrote:
I'm trying to do the scheduling myself now
future { rdd1.saveAsHasoopFile(...) }
future { rdd2.saveAsHadoopFile(…)]
In this way, rdd1 will be calculated once, and two saveAsHadoopFile will
happen concurrently.
Thanks.
Zhan Zhang
On Feb 26, 2015, at 3:28 PM, Corey Nolet cjno...@gmail.com wrote:
What confused me
:
* Return information about what RDDs are cached, if they are in mem or
on disk, how much space
* they take, etc.
*/
@DeveloperApi
def getRDDStorageInfo: Array[RDDInfo] = {
Cheers
On Thu, Feb 26, 2015 at 4:00 PM, Corey Nolet cjno...@gmail.com wrote:
Zhan,
This is exactly what I'm
Spark uses a SerializableWritable [1] to java serialize writable objects.
I've noticed (at least in Spark 1.2.1) that it breaks down with some
objects when Kryo is used instead of regular java serialization. Though it
is wrapping the actual AccumuloInputFormat (another example of something
you
I would do sum square. This would allow you to keep an ongoing value as an
associative operation (in an aggregator) and then calculate the variance
std deviation after the fact.
On Wed, Mar 25, 2015 at 10:28 PM, Haopu Wang hw...@qilinsoft.com wrote:
Hi,
I have a DataFrame object and I
I want to use ARIMA for a predictive model so that I can take time series
data (metrics) and perform a light anomaly detection. The time series data
is going to be bucketed to different time units (several minutes within
several hours, several hours within several days, several days within
several
How hard would it be to expose this in some way? I ask because the current
textFile and objectFile functions are obviously at some point calling out
to a FileInputFormat and configuring it.
Could we get a way to configure any arbitrary inputformat / outputformat?
for ARIMA models?
On Mon, Mar 30, 2015 at 9:30 AM, Corey Nolet cjno...@gmail.com wrote:
Taking out the complexity of the ARIMA models to simplify things- I can't
seem to find a good way to represent even standard moving averages in spark
streaming. Perhaps it's my ignorance with the micro-batched
If you return an iterable, you are not tying the API to a compactbuffer.
Someday, the data could be fetched lazily and he API would not have to
change.
On Apr 23, 2015 6:59 PM, Dean Wampler deanwamp...@gmail.com wrote:
I wasn't involved in this decision (I just make the fries), but
Giovanni,
The DAG can be walked by calling the dependencies() function on any RDD.
It returns a Seq containing the parent RDDs. If you start at the leaves
and walk through the parents until dependencies() returns an empty Seq, you
ultimately have your DAG.
On Sat, Apr 25, 2015 at 1:28 PM, Akhil
A tad off topic, but could still be relevant.
Accumulo's design is a tad different in the realm of being able to shard
and perform set intersections/unions server-side (through seeks). I've got
an adapter for Spark SQL on top of a document store implementation in
Accumulo that accepts the
Is this somehtign I can do. I am using a FileOutputFormat inside of the
foreachRDD call. After the input format runs, I want to do some directory
cleanup and I want to block while I'm doing that. Is that something I can
do inside of this function? If not, where would I accomplish this on every
It does look the function that's executed is in the driver so doing an
Await.result() on a thread AFTER i've executed an action should work. Just
updating this here in case anyone has this question in the future.
Is this somehtign I can do. I am using a FileOutputFormat inside of the
foreachRDD
tried this?
Within a window you would probably take the first x% as training and
the rest as test. I don't think there's a question of looking across
windows.
On Thu, Apr 2, 2015 at 12:31 AM, Corey Nolet cjno...@gmail.com wrote:
Surprised I haven't gotten any responses about this. Has
Is it possible to configure Spark to do all of its shuffling FULLY in
memory (given that I have enough memory to store all the data)?
I've seen a few places where it's been mentioned that after a shuffle each
reducer needs to pull its partition into memory in its entirety. Is this
true? I'd assume the merge sort that needs to be done (in the cases where
sortByKey() is not used) wouldn't need to pull all of the data into memory
of the data in the partition (fetching more than 1 record @ a time).
On Thu, Jun 25, 2015 at 12:19 PM, Corey Nolet cjno...@gmail.com wrote:
I don't know exactly what's going on under the hood but I would not assume
that just because a whole partition is not being pulled into memory @ one
time
I don't know exactly what's going on under the hood but I would not assume
that just because a whole partition is not being pulled into memory @ one
time that that means each record is being pulled at 1 time. That's the
beauty of exposing Iterators Iterables in an API rather than collections-
/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L341
On Thu, Jun 18, 2015 at 7:55 PM, Du Li l...@yahoo-inc.com.invalid wrote:
repartition() means coalesce(shuffle=false)
On Thursday, June 18, 2015 4:07 PM, Corey Nolet cjno...@gmail.com
wrote:
Doesn't
I'm confused about this. The comment on the function seems to indicate
that there is absolutely no shuffle or network IO but it also states that
it assigns an even number of parent partitions to each final partition
group. I'm having trouble seeing how this can be guaranteed without some
data
at 5:51 PM, Corey Nolet cjno...@gmail.com wrote:
An example of being able to do this is provided in the Spark Jetty Server
project [1]
[1] https://github.com/calrissian/spark-jetty-server
On Wed, Jun 17, 2015 at 8:29 PM, Elkhan Dadashov elkhan8...@gmail.com
wrote:
Hi all,
Is there any way
1 - 100 of 135 matches
Mail list logo