if you error is on executors you need to check the executor logs for full
stacktrace
On Tue, Aug 18, 2015 at 10:01 PM, satish chandra j jsatishchan...@gmail.com
wrote:
HI All,
Please let me know if any arguments to be passed in CLI to retrieve FULL
STACK TRACE in Apache Spark
I am stuck in
starting is easy, just use a lazy val. stopping is harder. i do not think
executors have a cleanup hook currently...
On Sun, Aug 9, 2015 at 5:29 AM, Daniel Haviv
daniel.ha...@veracity-group.com wrote:
Hi,
I'd like to start a service with each Spark Executor upon initalization
and have the
has anyone tried to make HiveContext only if the class is available?
i tried this:
implicit lazy val sqlc: SQLContext = try {
Class.forName(org.apache.spark.sql.hive.HiveContext, true,
Thread.currentThread.getContextClassLoader)
i am using scala 2.11
spark jars are not in my assembly jar (they are provided), since i launch
with spark-submit
On Thu, Jul 16, 2015 at 4:34 PM, Koert Kuipers ko...@tresata.com wrote:
spark 1.4.0
spark-csv is a normal dependency of my project and in the assembly jar
that i use
but i
://github.com/apache/spark/blob/master/repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkILoop.scala#L1023-L1037).
What is the version of Spark you are using? How did you add the spark-csv
jar?
On Thu, Jul 16, 2015 at 1:21 PM, Koert Kuipers ko...@tresata.com wrote:
has anyone tried to make
that solved it, thanks!
On Thu, Jul 16, 2015 at 6:22 PM, Koert Kuipers ko...@tresata.com wrote:
thanks i will try 1.4.1
On Thu, Jul 16, 2015 at 5:24 PM, Yin Huai yh...@databricks.com wrote:
Hi Koert,
For the classloader issue, you probably hit
https://issues.apache.org/jira/browse/SPARK
?
Thanks,
Yin
On Thu, Jul 16, 2015 at 2:12 PM, Koert Kuipers ko...@tresata.com wrote:
i am using scala 2.11
spark jars are not in my assembly jar (they are provided), since i
launch with spark-submit
On Thu, Jul 16, 2015 at 4:34 PM, Koert Kuipers ko...@tresata.com wrote:
spark 1.4.0
https://issues.apache.org/jira/browse/SPARK-8817
On Fri, Jul 3, 2015 at 11:43 AM, Koert Kuipers ko...@tresata.com wrote:
i see the relaxation to allow duplicate field names was done on purpose,
since some data sources can have dupes due to case insensitive resolution.
apparently the issue
, Akhil Das ak...@sigmoidanalytics.com wrote:
I think you can open up a jira, not sure if this PR
https://github.com/apache/spark/pull/2209/files (SPARK-2890
https://issues.apache.org/jira/browse/SPARK-2890) broke the validation
piece.
Thanks
Best Regards
On Fri, Jul 3, 2015 at 4:29 AM, Koert
i am surprised this is allowed...
scala sqlContext.sql(select name as boo, score as boo from
candidates).schema
res7: org.apache.spark.sql.types.StructType =
StructType(StructField(boo,StringType,true),
StructField(boo,IntegerType,true))
should StructType check for duplicate field names?
rdd.persist(StorageLevel.MEMORY_AND_DISK_SER)
On Wed, Jul 1, 2015 at 11:01 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
How do i persist an RDD using StorageLevel.MEMORY_AND_DISK_SER ?
--
Deepak
see also:
https://github.com/apache/spark/pull/6848
On Mon, Jun 29, 2015 at 12:48 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
sc.hadoopConfiguration.set(mapreduce.input.fileinputformat.split.maxsize,
67108864)
sc.sequenceFile(getMostRecentDirectory(tablePath, _.startsWith(_)).get
+ /*,
you need 1) to publish to inhouse maven, so your application can depend on
your version, and 2) use the spark distribution you compiled to launch your
job (assuming you run with yarn so you can launch multiple versions of
spark on same cluster)
On Sun, Jun 28, 2015 at 4:33 PM, ÐΞ€ρ@Ҝ (๏̯͡๏)
-1.4.0/dist/lib/
cp:
/Users/dvasthimal/ebay/projects/ep/spark-1.4.0/lib_managed/jars/datanucleus*.jar:
No such file or directory
LM-SJL-00877532:spark-1.4.0 dvasthimal$ ./make-distribution.sh --tgz
-Phadoop-2.4 -Pyarn -Phive -Phive-thriftserver
On Sun, Jun 28, 2015 at 1:41 PM, Koert
spark is partitioner aware, so it can exploit a situation where 2 datasets
are partitioned the same way (for example by doing a map-side join on
them). map-red does not expose this.
On Sun, Jun 28, 2015 at 12:13 PM, YaoPau jonrgr...@gmail.com wrote:
I've heard Spark is not just MapReduce
-Phive-thriftserver
On Sun, Jun 28, 2015 at 1:41 PM, Koert Kuipers ko...@tresata.com
wrote:
you need 1) to publish to inhouse maven, so your application can depend
on your version, and 2) use the spark distribution you compiled to launch
your job (assuming you run with yarn so you can launch
-guide.html#which-storage-level-to-choose
When do i choose this setting ? (Attached is my code for reference)
On Sun, Jun 28, 2015 at 2:57 PM, Koert Kuipers ko...@tresata.com wrote:
a blockJoin spreads out one side while replicating the other. i would
suggest replicating the smaller side
.
And is my assumptions on replication levels correct.
Did you get a chance to look at my processing.
On Sun, Jun 28, 2015 at 3:31 PM, Koert Kuipers ko...@tresata.com wrote:
regarding your calculation of executors... RAM in executor is not really
comparable to size on disk.
if you read from
we went through a similar process, switching from scalding (where
everything just works on large datasets) to spark (where it does not).
spark can be made to work on very large datasets, it just requires a little
more effort. pay attention to your storage levels (should be
memory-and-disk or
i noticed in DataFrame that to get the rdd out of it some conversions are
done:
val converter = CatalystTypeConverters.createToScalaConverter(schema)
rows.map(converter(_).asInstanceOf[Row])
does this mean DataFrame internally does not use the standard scala types?
why not?
just a heads up, i was doing some basic coding using DataFrame, Row,
StructType, etc. and i ended up with deadlocks in my sbt tests due to the
usage of
ScalaReflectionLock.synchronized in the spark sql code.
the issue away when i changed my tests to run consecutively...
could it be composed maybe? a general version and then a sql version that
exploits the additional info/abilities available there and uses the general
version internally...
i assume the sql version can benefit from the logical phase optimization to
pick join details. or is there more?
On Tue, Jun
a skew join (where the dominant key is spread across multiple executors) is
pretty standard in other frameworks, see for example in scalding:
https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/JoinAlgorithms.scala
this would be a great addition to
we are still running into issues with spark-shell not working on 2.11, but
we are running on somewhat older master so maybe that has been resolved
already.
On Tue, May 26, 2015 at 11:48 AM, Dean Wampler deanwamp...@gmail.com
wrote:
Most of the 2.11 issues are being resolved in Spark 1.4. For a
i searched the jiras but couldnt find any recent mention of this. let me
try with 1.4.0 branch and see if it goes away...
On Wed, May 6, 2015 at 3:05 PM, Koert Kuipers ko...@tresata.com wrote:
hello all,
i build spark 1.3.1 (for cdh 5.3 with yarn) twice: for scala 2.10 and
scala 2.11. i am
i am trying to launch the spark 1.3.1 history server on a secure cluster.
i can see in the logs that it successfully logs into kerberos, and it is
replaying all the logs, but i never see the log message that indicate the
web server is started (i should see something like Successfully started
:17 PM, Koert Kuipers ko...@tresata.com wrote:
good idea i will take a look. it does seem to be spinning one cpu at
100%...
On Thu, May 7, 2015 at 2:03 PM, Marcelo Vanzin van...@cloudera.com
wrote:
Can you get a jstack for the process? Maybe it's stuck somewhere.
On Thu, May 7, 2015 at 11
got it. thanks!
On Thu, May 7, 2015 at 2:52 PM, Marcelo Vanzin van...@cloudera.com wrote:
Ah, sorry, that's definitely what Shixiong mentioned. The patch I
mentioned did not make it into 1.3...
On Thu, May 7, 2015 at 11:48 AM, Koert Kuipers ko...@tresata.com wrote:
seems i got one thread
i am having no luck using the 1.4 branch with scala 2.11
$ build/mvn -DskipTests -Pyarn -Dscala-2.11 -Pscala-2.11 clean package
[error]
/home/koert/src/opensource/spark/core/src/main/scala/org/apache/spark/rdd/RDDOperationScope.scala:78:
in object RDDOperationScope, multiple overloaded
hello all,
i build spark 1.3.1 (for cdh 5.3 with yarn) twice: for scala 2.10 and scala
2.11. i am running on a secure cluster. the deployment configs are
identical.
i can launch jobs just fine on both the scala 2.10 and scala 2.11 versions.
spark-shell works on the scala 2.10 version, but not on
shoot me an email if you need any help with spark-sorted. it does not
(yet?) have a java api, so you will have to work in scala
On Mon, May 4, 2015 at 4:05 PM, Burak Yavuz brk...@gmail.com wrote:
I think this Spark Package may be what you're looking for!
our experience is that unless you can benefit from spark features such as
co-partitioning that allow for more efficient execution that spark is
slightly slower for disk to disk.
On Apr 27, 2015 10:34 PM, bit1...@163.com bit1...@163.com wrote:
Hi,
I am frequently asked why spark is also much
because CompactBuffer is considered an implementation detail. It is also
not public for the same reason.
On Thu, Apr 23, 2015 at 6:46 PM, Hao Ren inv...@gmail.com wrote:
Should I repost this to dev list ?
--
View this message in context:
Use KafkaRDD directly. It is in spark-streaming-kafka package
On Sat, Apr 18, 2015 at 6:43 AM, Shushant Arora shushantaror...@gmail.com
wrote:
Hi
I want to consume messages from kafka queue using spark batch program not
spark streaming, Is there any way to achieve this, other than using low
better idea :)
On Sat, Apr 18, 2015 at 11:22 AM Koert Kuipers ko...@tresata.com wrote:
Use KafkaRDD directly. It is in spark-streaming-kafka package
On Sat, Apr 18, 2015 at 6:43 AM, Shushant Arora
shushantaror...@gmail.com wrote:
Hi
I want to consume messages from kafka queue using spark
and streaming is just
as good an approach. Not sure...
On Sat, Apr 18, 2015 at 3:13 PM, Koert Kuipers ko...@tresata.com wrote:
Yeah I think would pick the second approach because it is simpler
operationally in case of any failures. But of course the smaller the window
gets the more attractive
i believe it is a generalization of some classes inside graphx, where there
was/is a need to keep stuff indexed for random access within each rdd
partition
On Thu, Apr 16, 2015 at 5:00 PM, Evo Eftimov evo.efti...@isecc.com wrote:
Can somebody from Data Briks sched more light on this Indexed RDD
is a
little clunky, and this should get rolled into the other changes you are
proposing to hadoop RDD friends -- but I'll go into more discussion on
that thread.
On Mon, Mar 23, 2015 at 12:55 PM, Koert Kuipers ko...@tresata.com wrote:
there is a way to reinstate the partitioner, but that requires
is it safe to access SparkEnv.get inside say mapPartitions?
i need to get a Serializer (so SparkEnv.get.serializer)
thanks
in the comments on SparkContext.objectFile it says:
It will also be pretty slow if you use the default serializer (Java
serialization)
this suggests the spark.serializer is used, which means i can switch to the
much faster kryo serializer. however when i look at the code it uses
currently its pretty hard to control the Hadoop Input/Output formats used
in Spark. The conventions seems to be to add extra parameters to all
methods and then somewhere deep inside the code (for example in
PairRDDFunctions.saveAsHadoopFile) all these parameters get translated into
settings on the
i just realized the major limitation is that i lose partitioning info...
On Mon, Mar 23, 2015 at 1:34 AM, Reynold Xin r...@databricks.com wrote:
On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers ko...@tresata.com wrote:
so finally i can resort to:
rdd.saveAsObjectFile(...)
sc.objectFile
there is a way to reinstate the partitioner, but that requires
sc.objectFile to read exactly what i wrote, which means sc.objectFile
should never split files on reading (a feature of hadoop file inputformat
that gets in the way here).
On Mon, Mar 23, 2015 at 1:39 PM, Koert Kuipers ko
i would like to use spark for some algorithms where i make no attempt to
work in memory, so read from hdfs and write to hdfs for every step.
of course i would like every step to only be evaluated once. and i have no
need for spark's RDD lineage info, since i persist to reliable storage.
the
i added it
On Fri, Mar 6, 2015 at 2:40 PM, Burak Yavuz brk...@gmail.com wrote:
Hi Koert,
Would you like to register this on spark-packages.org?
Burak
On Fri, Mar 6, 2015 at 8:53 AM, Koert Kuipers ko...@tresata.com wrote:
currently spark provides many excellent algorithms for operations
currently spark provides many excellent algorithms for operations per key
as long as the data send to the reducers per key fits in memory. operations
like combineByKey, reduceByKey and foldByKey rely on pushing the operation
map-side so that the data reduce-side is small. and groupByKey simply
problems as it kinda gets converted back to a row
oriented format.
@Koert - that looks really exciting. Do you have any statistics on memory
and scan performance?
On Saturday, February 14, 2015, Koert Kuipers ko...@tresata.com wrote:
i wrote a proof of concept to automatically store any RDD
hey,
running my first map-red like (meaning disk-to-disk, avoiding in memory
RDDs) computation in spark on yarn i immediately got bitten by a too low
spark.yarn.executor.memoryOverhead. however it took me about an hour to
find out this was the cause. at first i observed failing shuffles leading
to
does anyone have the right maven invocation for cdh5 with yarn?
i tried:
$ mvn -Phadoop2.3 -Dhadoop.version=2.5.0-cdh5.2.3 -Pyarn -DskipTests clean
package
$ mvn -Phadoop2.3 -Dhadoop.version=2.5.0-cdh5.2.3 -Pyarn test
it builds and passes tests just fine, but when i deploy on cluster and i
try to
thanks! my bad
On Wed, Feb 18, 2015 at 2:00 PM, Sandy Ryza sandy.r...@cloudera.com wrote:
Hi Koert,
You should be using -Phadoop-2.3 instead of -Phadoop2.3.
-Sandy
On Wed, Feb 18, 2015 at 10:51 AM, Koert Kuipers ko...@tresata.com wrote:
does anyone have the right maven invocation
i wrote a proof of concept to automatically store any RDD of tuples or case
classes in columar format using arrays (and strongly typed, so you get the
benefit of primitive arrays). see:
https://github.com/tresata/spark-columnar
On Fri, Feb 13, 2015 at 3:06 PM, Michael Armbrust
the whole spark.files.userClassPathFirs never really worked for me in
standalone mode, since jars were added dynamically which means they had
different classloaders leading to a real classloader hell if you tried to
add a newer version of jar that spark already used. see:
applications/situations. never thought i would say that.
best
On Wed, Feb 4, 2015 at 4:01 PM, Marcelo Vanzin van...@cloudera.com wrote:
Hi Koert,
On Wed, Feb 4, 2015 at 11:35 AM, Koert Kuipers ko...@tresata.com wrote:
do i understand it correctly that on yarn the the customer jars are truly
placed
package for the 1.2.0 release. It shouldn't be
causing conflicts.
[1] https://issues.apache.org/jira/browse/SPARK-2848
On Wed, Feb 4, 2015 at 2:35 PM, Koert Kuipers ko...@tresata.com wrote:
the whole spark.files.userClassPathFirs never really worked for me in
standalone mode, since jars were
anyhow i am ranting... sorry
On Wed, Feb 4, 2015 at 5:54 PM, Koert Kuipers ko...@tresata.com wrote:
yeah i think we have been lucky so far. but i dont really see how i have a
choice. it would be fine if say hadoop exposes a very small set of
libraries as part of the classpath. but if i look
for example? or avro? it just makes my life harder. and i dont
really see who benefits.
the yarn classpath is insane too.
On Wed, Feb 4, 2015 at 4:26 PM, Marcelo Vanzin van...@cloudera.com wrote:
On Wed, Feb 4, 2015 at 1:12 PM, Koert Kuipers ko...@tresata.com wrote:
about putting stuff on classpath
yes jobs run as the user that launched them.
if you want to run jobs on a secure cluster then use yarn. hadoop
standalone does not support secure hadoop.
On Mon, Feb 2, 2015 at 5:37 PM, Jim Green openkbi...@gmail.com wrote:
Hi Team,
Does spark support impersonation?
For example, when spark
i have a simple spark app that i run with spark-submit on yarn. it runs
fine and shows up with finalStatus=SUCCEEDED in the resource manager logs.
however in the nodemanager logs i see this:
2015-01-31 18:30:48,195 INFO
clue there ?
You can pastebin part of the RM log around the time your job ran ?
What hadoop version are you using ?
Thanks
On Sat, Jan 31, 2015 at 11:24 AM, Koert Kuipers ko...@tresata.com wrote:
i have a simple spark app that i run with spark-submit on yarn. it runs
fine and shows up
.
--
*From:* Koert Kuipers ko...@tresata.com
*To:* Mohit Jaggi mohitja...@gmail.com
*Cc:* Tobias Pfeiffer t...@preferred.jp; Ganelin, Ilya
ilya.gane...@capitalone.com; derrickburns derrickrbu...@gmail.com;
user@spark.apache.org user@spark.apache.org
*Sent:* Friday, January
operation such as this one?
This use case reminds me of FIR filtering in DSP. It seems that RDDs could
use something that serves the same purpose as
scala.collection.Iterator.sliding.
--
*From:* Koert Kuipers ko...@tresata.com
*To:* Mohit Jaggi mohitja
assuming the data can be partitioned then you have many timeseries for
which you want to detect potential gaps. also assuming the resulting gaps
info per timeseries is much smaller data then the timeseries data itself,
then this is a classical example to me of a sorted (streaming) foldLeft,
spark 1.2.0, and
it indeed does not run on CDH5.3.0. i get class incompatibility errors.
On Tue, Jan 6, 2015 at 10:29 AM, Koert Kuipers ko...@tresata.com wrote:
if the classes are in the original location than i think its safe to say
that this makes it impossible for us to build one app that can
it also changes some transitive dependencies which also have
compatibility issues (e.g. the typesafe config library). But I believe
it's needed to support Scala 2.11...
On Mon, Jan 5, 2015 at 8:27 AM, Koert Kuipers ko...@tresata.com wrote:
since spark shaded akka i wonder if it would work, but i
, Jan 3, 2015 at 11:22 AM, Koert Kuipers ko...@tresata.com wrote:
hey Ted,
i am aware of the upgrade efforts for akka. however if spark 1.2 forces
me to upgrade all our usage of akka to 2.3.x while spark 1.0 and 1.1 force
me to use akka 2.2.x then we cannot build one application that runs on all
thats great. i tried this once and gave up after a few hours.
On Sat, Jan 3, 2015 at 2:59 AM, Corey Nolet cjno...@gmail.com wrote:
Took me just about all night (it's 3am here in EST) but I finally figured
out how to get this working. I pushed up my example code for others who may
be
, Jan 2, 2015 at 9:11 AM, Koert Kuipers ko...@tresata.com wrote:
i noticed spark 1.2.0 bumps the akka version. since spark uses it's own
akka version, does this mean it can co-exist with another akka version in
the same JVM? has anyone tried this?
we have some spark apps that also use akka
i noticed spark 1.2.0 bumps the akka version. since spark uses it's own
akka version, does this mean it can co-exist with another akka version in
the same JVM? has anyone tried this?
we have some spark apps that also use akka (2.2.3) and spray. if different
akka versions causes conflicts then
sc.textFile uses a hadoop input format. hadoop input formats by default
create one task per file, and they are not very suitable for many very
small files. can you turns your 1000 files into one larger text file?
otherwise maybe try:
val data = sc.textFile(/user/foo/myfiles/*).coalesce(100)
On
at 2:41 PM, Koert Kuipers ko...@tresata.com wrote:
hello all,
we at tresata wrote a library to provide for batch integration between
spark and kafka (distributed write of rdd to kafa, distributed read of rdd
from kafka). our main use cases are (in lambda architecture jargon):
* period appends
hello all,
we at tresata wrote a library to provide for batch integration between
spark and kafka (distributed write of rdd to kafa, distributed read of rdd
from kafka). our main use cases are (in lambda architecture jargon):
* period appends to the immutable master dataset on hdfs from kafka
spark can do efficient joins if both RDDs have the same partitioner. so in
case of self join I would recommend to create an rdd that has explicit
partitioner and has been cached.
On Dec 8, 2014 8:52 AM, Theodore Vasiloudis
theodoros.vasilou...@gmail.com wrote:
Hello all,
I am working on a
.
}
??? // Return something.
}
On Mon, Dec 8, 2014 at 3:28 PM, Koert Kuipers ko...@tresata.com
wrote:
spark can do efficient joins if both RDDs have the same partitioner.
so in case of self join I would recommend to create an rdd that has
explicit partitioner and has been cached.
On Dec 8, 2014 8
at java.lang.Class.forName(Class.java:270)^[[0m
BTW I didn't find JavaAPISuite in test output either.
Cheers
On Sat, Dec 6, 2014 at 9:12 PM, Koert Kuipers ko...@tresata.com wrote:
Ted,
i mean
core/src/test/java/org/apache/spark/JavaAPISuite.java
On Sat, Dec 6, 2014 at 9:27 PM, Ted Yu yuzhih...@gmail.com
/skipTests
/configuration
/plugin
plugin
I was able to run JavaAPISuite using:
mvn test -pl core -Dtest=JavaAPISuite
But it takes a long time ...
Cheers
On Sun, Dec 7, 2014 at 8:56 AM, Koert Kuipers ko...@tresata.com wrote:
hey guys,
i was able to run the test
://issues.apache.org/jira/browse/SPARK-661
I got bit by this too recently and meant to look into it.
On Sun, Dec 7, 2014 at 4:50 PM, Koert Kuipers ko...@tresata.com wrote:
so as part of the official build the java api does not get tested then?
i am sure there is a good reason for it, but thats surprising
when i run mvn test -pl core, i dont see JavaAPISuite being run. or if it
is, its being very very quiet about it. is this by design?
. For usage example, see test case
JavaAPISuite.testJavaJdbcRDD.
./core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala
FYI
On Sat, Dec 6, 2014 at 5:43 PM, Koert Kuipers ko...@tresata.com wrote:
when i run mvn test -pl core, i dont see JavaAPISuite being run. or if
it is, its being very very
i suddenly also run into the issue that maven is trying to download
snapshots that dont exists for other sub projects.
did something change in the maven build?
does maven not have capability to smartly compile the other sub-projects
that a sub-project depends on?
i rather avoid mvn install
i think what changed is that core now has dependencies on other sub
projects. ok... so i am forced to install stuff because maven cannot
compile what is needed. i will install
On Fri, Dec 5, 2014 at 7:12 PM, Koert Kuipers ko...@tresata.com wrote:
i suddenly also run into the issue that maven
do these requirements boils down to a need for foldLeftByKey with sorting
of the values?
https://issues.apache.org/jira/browse/SPARK-3655
On Wed, Dec 3, 2014 at 6:34 PM, Xuefeng Wu ben...@gmail.com wrote:
I have similar requirememt,take top N by key. right now I use
groupByKey,but one key
since spark holds data structures on heap (and by default tries to work
with all data in memory) and its written in Scala seeing lots of scala
Tuple2 is not unexpected. how do these numbers relate to your data size?
On Oct 27, 2014 2:26 PM, Sonal Goyal sonalgoy...@gmail.com wrote:
Hi,
I wanted
looks like a misssing class issue? what makes you think its serialization?
shapeless does indeed have a lot of helper classes that get sucked in and
are not serializable. see here:
https://groups.google.com/forum/#!topic/shapeless-dev/05_DXnoVnI4
and for a project that uses shapeless in spark
spark can definitely very quickly answer queries like give me all
transactions with property x. and you can put a http query server in front
of it and run queries concurrently.
but spark does not support inserts, updates, or fast random access lookups.
this is because RDDs are immutable and
doing cleanup in an iterator like that assumes the iterator always gets
fully read, which is not necessary the case (for example RDD.take does not).
instead i would use mapPartitionsWithContext, in which case you can write a
function of the form.
f: (TaskContext, Iterator[T]) = Iterator[U]
now
this requires evaluation of the rdd to do the count.
val x: RDD[X] = ...
val y: RDD[X] = ...
x.cache
val z = if(x.count thres) x.union(y) else x
On Oct 27, 2014 7:51 PM, Josh J joshjd...@gmail.com wrote:
Hi,
How could I combine rdds? I would like to combine two RDDs if the count in
an RDD is
you ran out of kryo buffer. are you using spark 1.1 (which supports buffer
resizing) or spark 1.0 (which has a fixed size buffer)?
On Oct 21, 2014 5:30 PM, nitinkak001 nitinkak...@gmail.com wrote:
I am running a simple rdd filter command. What does it mean?
Here is the full stack trace(and code
well, sort of! we make input/output formats (cascading taps, scalding
sources) available in spark, and we ported the scalding fields api to
spark. so it's for those of us that have a serious investment in
cascading/scalding and want to leverage that in spark.
blog is here:
thanks
On Wed, Oct 1, 2014 at 4:56 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Pretty cool, thanks for sharing this! I've added a link to it on the wiki:
https://cwiki.apache.org/confluence/display/SPARK/Supplemental+Spark+Projects
.
Matei
On Oct 1, 2014, at 1:41 PM, Koert Kuipers ko
apologies for asking yet again about spark memory assumptions, but i cant
seem to keep it in my head.
if i use PairRDDFunctions.cogroup, it returns for every key 2 iterables. do
the contents of these iterables have to fit in memory? or is the data
streamed?
.
On Mon, Sep 15, 2014 at 11:16 AM, Koert Kuipers ko...@tresata.com wrote:
in spark 1.1.0 i get this error:
2014-09-14 23:17:01 ERROR actor.OneForOneStrategy: Found both
spark.driver.extraClassPath and SPARK_CLASSPATH. Use only the former.
i checked my application. i do not set
now that spark has a sort based shuffle, can we expect a secondary sort
soon? there are some use cases where getting a sorted iterator of values
per key is helpful.
we build our own adjacency lists as well. the main motivation for us was
that graphx has some assumptions about everything fitting in memory (it has
.cache statements all over place). however if my understanding is wrong and
graphx can handle graphs that do not fit in memory i would be interested
in spark 1.1.0 i get this error:
2014-09-14 23:17:01 ERROR actor.OneForOneStrategy: Found both
spark.driver.extraClassPath and SPARK_CLASSPATH. Use only the former.
i checked my application. i do not set spark.driver.extraClassPath or
SPARK_CLASSPATH.
SPARK_CLASSPATH is set in spark-env.sh
hey mark,
you think that this is on purpose, or is it an omission? thanks, koert
On Mon, Sep 15, 2014 at 8:32 PM, Mark Grover m...@apache.org wrote:
Hi Koert,
I work on Bigtop and CDH packaging and you are right, based on my quick
glance, it doesn't seem to be used.
Mark
From: Koert
a grep for SPARK_MASTER_IP shows that sbin/start-master.sh and
sbin/start-slaves.sh are the only ones that use it.
yet for example in CDH5 the spark-master is started from
/etc/init.d/spark-master by running bin/spark-class. does that means
SPARK_MASTER_IP is simply ignored? it looks like that to
matei,
it is good to hear that the restriction that keys need to fit in memory no
longer applies to combineByKey. however join requiring keys to fit in
memory is still a big deal to me. does it apply to both sides of the join,
or only one (while othe other side is streaming)?
On Sat, Aug 30,
i feel like SchemaRDD has usage beyond just sql. perhaps it belongs in core?
i was just looking at ALS
(mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala)
any need all the variables need to be vars and to have all these setters
around? it just leads to so much clutter
if you really want them to the vars it is safe in scala to make them public
spark-submit doesnt handle being symlinks currently:
$ spark-submit
/usr/local/bin/spark-submit: line 44: /usr/local/bin/spark-class: No such
file or directory
/usr/local/bin/spark-submit: line 44: exec: /usr/local/bin/spark-class:
cannot execute: No such file or directory
to fix i changed the
301 - 400 of 482 matches
Mail list logo