gt;
>>> Hi Jonathan,
>>>
>>> you might be interested in https://issues.apache.org/j
>>> ira/browse/SPARK-3655 (not yet available) and https://github.com/tresata
>>> /spark-sorted (not part of spark, but it is available right now).
>>> Hopefully thats
Threads
El viernes, 15 de enero de 2016, Kira escribió:
> Hi,
>
> Can we run *simultaneous* actions on the *same RDD* ?; if yes how can this
> be done ?
>
> Thank you,
> Regards
>
>
>
> --
> View this message in context:
>
cture your RDD transformations
> to compute the required results in one single operation.
>
> On 15 January 2016 at 06:18, Jonathan Coveney <jcove...@gmail.com
> <javascript:_e(%7B%7D,'cvml','jcove...@gmail.com');>> wrote:
>
>> Threads
>>
>>
>> El viern
so I have the following...
broadcast some stuff
cache an rdd
do a bunch of stuff, eventually calling actions which reduce it to an
acceptable size
I'm getting an OOM on the driver (well, GC is getting out of control),
largely because I have a lot of partitions and it looks like the job
history
reading the code, is there any reason why
setting spark.cleaner.ttl.MAP_OUTPUT_TRACKER directly won't get picked up?
2015-11-17 14:45 GMT-05:00 Jonathan Coveney <jcove...@gmail.com>:
> so I have the following...
>
> broadcast some stuff
> cache an rdd
> do a bunch of stuf
Additionally, I'm curious if there are any JIRAS around making dataframes
support ordering better? there are a lot of operations that can be
optimized if you know that you have a total ordering on your data...are
there any plans, or at least JIRAS, around having the catalyst optimizer
handle this
Caused by: java.lang.ClassNotFoundException: scala.Some
indicates that you don't have the scala libs present. How are you executing
this? My guess is the issue is a conflict between scala 2.11.6 in your
build and 2.11.7? Not sure...try setting your scala to 2.11.7?
But really, first it'd be good
t; you suggested. I am unclear as to why it works with 2.11.7 and not 2.11.6.
>
> Thanks,
> Babar
>
> On Mon, Nov 2, 2015 at 2:10 PM Jonathan Coveney <jcove...@gmail.com
> <javascript:_e(%7B%7D,'cvml','jcove...@gmail.com');>> wrote:
>
>> Caused by: java.lang.ClassNotFou
do you have JAVA_HOME set to a java 7 jdk?
2015-10-23 7:12 GMT-04:00 emlyn :
> xjlin0 wrote
> > I cannot enter REPL shell in 1.4.0/1.4.1/1.5.0/1.5.1(with pre-built with
> > or without Hadoop or home compiled with ant or maven). There was no
> error
> > message in v1.4.x,
I've noticed this as well and am curious if there is anything more people
can say.
My theory is that it is just communication overhead. If you only have a
couple of gigabytes (a tiny dataset), then spotting that into 50 nodes
means you'll have a ton of tiny partitions all finishing very quickly,
Nobody is saying not to use immutable data structures, only that guava's
aren't natively supported.
Scala's default collections library is all immutable. list, Vector, Map.
This is what people generally use, especially in scala code!
El martes, 6 de octubre de 2015, Jakub Dubovsky <
LZO files are not splittable by default but there are projects with Input
and Output formats to make splittable LZO files. Check out twitter's
elephantbird on GitHub
El miércoles, 7 de octubre de 2015, Mohammed Guller
escribió:
> It is not uncommon to process datasets
LZO files are not splittable by default but there are projects with Input
and Output formats to make splittable LZO files. Check out twitter's
elephantbird on GitHub
El miércoles, 7 de octubre de 2015, Mohammed Guller
escribió:
> It is not uncommon to process datasets
You can put a class in the org.apache.spark namespace to access anything
that is private[spark]. You can then make enrichments there to access
whatever you need. Just beware upgrade pain :)
El martes, 6 de octubre de 2015, Erwan ALLAIN
escribió:
> Hello,
>
> I'm
It's highly conceivable to be able to beat spark in performance on tiny
data sets like this. That's not really what it has been optimized for.
El martes, 22 de septiembre de 2015, juljoin
escribió:
> Hello,
>
> I am trying to figure Spark out and I still have some
having a file per record is pretty inefficient on almost any file system
El martes, 22 de septiembre de 2015, Daniel Haviv <
daniel.ha...@veracity-group.com> escribió:
> Hi,
> We are trying to load around 10k avro files (each file holds only one
> record) using spark-avro but it takes over 15
It worked for Twitter!
Seriously though: scala is much much more pleasant. And scala has a great
story for using Java libs. And since spark is kind of framework-y (use its
scripts to submit, start up repl, etc) the projects tend to be lead
projects, so even in a big company that uses Java the
Try adding the following to your build.sbt
libraryDependencies += "org.scala-lang" % "scala-reflect" % "2.11.6"
I believe that spark shades the scala library, and this is a library
that it looks like you need in an unshaded way.
2015-09-07 16:48 GMT-04:00 Gheorghe Postelnicu <
Version := "2.11.6"
>
> libraryDependencies ++= Seq(
> "org.apache.spark" %% "spark-core" % "1.4.1" % "provided",
> "org.apache.spark" %% "spark-sql" % "1.4.1" % "provided",
> "
You can make a Hadoop input format which passes through the name of the
file. I generally find it easier to just hit Hadoop, get the file names,
and construct the RDDs though
El martes, 1 de septiembre de 2015, Matt K escribió:
> Just want to add - I'm looking to partition
array[String] doesn't pretty print by default. Use .mkString(,) for
example
El jueves, 27 de agosto de 2015, Arun Luthra arun.lut...@gmail.com
escribió:
What types of RDD can saveAsObjectFile(path) handle? I tried a naive test
with an RDD[Array[String]], but when I tried to read back the
I've used the instructions and it worked fine.
Can you post exactly what you're doing, and what it fails with? Or are you
just trying to understand how it works?
2015-08-24 15:48 GMT-04:00 Lanny Ripple la...@spotright.com:
Hello,
The instructions for building spark against scala-2.11
Put a log4j.properties file in conf/. You can copy
log4j.properties.template as a good base
El miércoles, 29 de julio de 2015, canan chen ccn...@gmail.com escribió:
Anyone know how to set log level in spark-submit ? Thanks
That's great! Thanks
El martes, 28 de julio de 2015, Ted Yu yuzhih...@gmail.com escribió:
If I understand correctly, there would be one value in the executor.
Cheers
On Tue, Jul 28, 2015 at 4:23 PM, Jonathan Coveney jcove...@gmail.com
javascript:_e(%7B%7D,'cvml','jcove...@gmail.com
i am running in coarse grained mode, let's say with 8 cores per executor.
If I use a broadcast variable, will all of the tasks in that executor share
the same value? Or will each task broadcast its own value ie in this case,
would there be one value in the executor shared by the 8 tasks, or would
Spark version is 1.3.0 (will upgrade as soon as we upgrade past mesos
0.19.0)...
Regardless, I'm running into a really weird situation where when I pass
--jars to bin/spark-shell I can't reference those classes on the repl. Is
this expected? The logs even tell me that my jars have been added, and
A helpful example of how to convert:
http://blog.cloudera.com/blog/2014/05/how-to-convert-existing-data-into-parquet/
As far as performance, that depends on your data. If you have a lot of
columns and use all of them, parquet deserialization is expensive. If you
have a column and only need a few
Can you check your local and remote logs?
2015-05-06 16:24 GMT-04:00 Wang, Ningjun (LNG-NPV)
ningjun.w...@lexisnexis.com:
This problem happen in Spark 1.3.1. It happen when two jobs are running
simultaneously each in its own Spark Context.
I don’t remember seeing this bug in Spark
As per my understanding, storing 5minutes file means we could not create
RDD more granular than 5minutes.
This depends on the file format. Many file formats are splittable (like
parquet), meaning that you can seek into various points of the file.
2015-05-05 12:45 GMT-04:00 Rendy Bambang Junior
about a workload where that's
relevant though, before going that route. Maybe if people are using
SSD's that would make sense.
- Patrick
On Mon, Apr 13, 2015 at 8:19 AM, Jonathan Coveney jcove...@gmail.com
wrote:
I'm surprised that I haven't been able to find this via google, but I
haven't
a filter on each RDD first ? We do
not do this using Pig on M/R. Is it required in Spark world ?
On Mon, Apr 13, 2015 at 9:58 PM, Jonathan Coveney jcove...@gmail.com
wrote:
My guess would be data skew. Do you know if there is some item id that is
a catch all? can it be null? item id 0? lots
a filter on each RDD first ? We do
not do this using Pig on M/R. Is it required in Spark world ?
On Mon, Apr 13, 2015 at 9:58 PM, Jonathan Coveney jcove...@gmail.com
wrote:
My guess would be data skew. Do you know if there is some item id that is
a catch all? can it be null? item id 0? lots of data
My guess would be data skew. Do you know if there is some item id that is a
catch all? can it be null? item id 0? lots of data sets have this sort of
value and it always kills joins
2015-04-13 11:32 GMT-04:00 ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com:
Code:
val viEventsWithListings: RDD[(Long,
I'm surprised that I haven't been able to find this via google, but I
haven't...
What is the setting that requests some amount of disk space for the
executors? Maybe I'm misunderstanding how this is configured...
Thanks for any help!
I need to have my own scheduler to point to a proprietary remote execution
framework.
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L2152
I'm looking at where it decides on the backend and it doesn't look like
there is a hook. Of course I can
I believe if you do the following:
sc.parallelize(List(1,2,3,4,5,5,6,6,7,8,9,10,2,4)).map((_,1)).reduceByKey(_+_).mapValues(_+1).reduceByKey(_+_).toDebugString
(8) MapPartitionsRDD[34] at reduceByKey at console:23 []
| MapPartitionsRDD[33] at mapValues at console:23 []
| ShuffledRDD[32] at
, at 2:49 PM, Jonathan Coveney jcove...@gmail.com wrote:
I believe if you do the following:
sc.parallelize(List(1,2,3,4,5,5,6,6,7,8,9,10,2,4)).map((_,1)).reduceByKey(_+_).mapValues(_+1).reduceByKey(_+_).toDebugString
(8) MapPartitionsRDD[34] at reduceByKey at console:23
Hello all,
I am wondering if spark already has support for optimizations on sorted
data and/or if such support could be added (I am comfortable dropping to a
lower level if necessary to implement this, but I'm not sure if it is
possible at all).
Context: we have a number of data sets which are
38 matches
Mail list logo