to get
around it, other than to stop passing around anonymous functions?
Thanks,
- Nathan Kronenfeld
- Uncharted Software
It's a bit of a pain, but you could just use an outer join (assuming there
are no duplicates in the input datasets, of course):
import org.apache.spark.sql.test.SharedSparkSession
import org.scalatest.FunSpec
class QuestionSpec extends FunSpec with SharedSparkSession {
describe("spark list
ikely (though not certainly) a gradle, not a spark,
problem, but I'm hoping someone else here has encountered this before?
Thanks in advance,
-Nathan Kronenfeld
>
>
> https://softwareengineering.stackexchange.com/questions/297276/what-is-a-shaded-java-dependency
>
>
> See also elasticsearch's discussion on shading
>
> https://www.elastic.co/de/blog/to-shade-or-not-to-shade
>
> Best,
> Anastasios
>
>
> On Fri, 26 Oct 2018, 15:45 Nathan Kronenfeld,
y? To load 2.8.5
so that our code uses it, without messing up spark?
Thanks,
-Nathan Kronenfeld
>
> On Thu, Jun 21, 2018 at 4:51 PM, Chawla,Sumit
>>> wrote:
>>>
Hi
I have been trying to this simple operation. I want to land all
values with one key in same partition, and not have any different key in
the same partition. Is this possible? I am getting b and c
I don't know if this is the best way or not, but:
val indexer = new StringIndexer().setInputCol("vr").setOutputCol("vrIdx")
val indexModel = indexer.fit(data)
val indexedData = indexModel.transform(data)
val variables = indexModel.labels.length
val toSeq = udf((a: Double, b: Double) => Seq(a,
Looking in Dataset, there are select functions taking from 1 to 5
TypedColumn arguments.
Is there a built-in way to pull out more than 5 typed columns into a
Dataset (without having to resort to using a DataFrame, or manual
processing of the RDD)?
Thanks,
- Nathan Kronenfeld
I think you want a join of type "left_anti"... See below log
scala> import spark.implicits._
import spark.implicits._
scala> case class Foo (a: String, b: Int)
defined class Foo
scala> case class Bar (a: String, d: Double)
defined class Bar
scala> var fooDs = Seq(Foo("a", 1), Foo("b", 2),
Thanks for the info
On Fri, Sep 1, 2017 at 12:06 PM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:
> No unfortunately not - as i recall storageLevel accesses some private
> methods to get the result.
>
> On Fri, 1 Sep 2017 at 17:55, Nathan Kronenfeld
> <nkrone
>
> Arguably isCached could be added to dataset too, shouldn't be a
> controversial change.
>
> On Fri, 1 Sep 2017 at 17:31, Nathan Kronenfeld
> <nkronenfeld@uncharted.software>
> wrote:
>
>> I'm currently porting some of our code from RDDs to Datasets
at the moment to determine if a dataset is cached or not?
Thanks in advance
-Nathan Kronenfeld
to have a CSV file format
exposed, and the only entry points we can find are when reading files.
What is the modern pattern for converting an already-read RDD of CSV lines
into a dataframe?
Thanks,
Nathan Kronenfeld
Uncharted Software
it.
Is there a way to construct a graph so that it uses the partitions given
and doesn't shuffle everything around?
Thanks,
-Nathan Kronenfeld
Hello, folks.
We just recently switched to using Yarn on our cluster (when upgrading to
cloudera 5.4.1)
I'm trying to run a spark job from within a broader application (a web
service running on Jetty), so I can't just start it using spark-submit.
Does anyone know of an instructions page on how
What would it do with the following dataset?
(A, B)
(A, C)
(B, D)
On Wed, Mar 25, 2015 at 1:02 PM, Himanish Kushary himan...@gmail.com
wrote:
Hi,
I have a RDD of pairs of strings like below :
(A,B)
(B,C)
(C,D)
(A,D)
(E,F)
(B,F)
I need to transform/filter this into a RDD of pairs
(sc)
import sqlContext._
val d1 = sc.parallelize(1 to 10).map { i = Person(i,i+1,i+2)}
val d2 = d1.select('id, 'score, 'id + 'score)
d2.foreach(println)
2014-12-12 14:11 GMT+08:00 Nathan Kronenfeld nkronenf...@oculusinfo.com:
Hi, there.
I'm trying to understand how to augment data
?
Thanks,
-Nathan
--
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone: +1-416-203-3003 x 238
Email: nkronenf...@oculusinfo.com
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
--
Nathan Kronenfeld
Senior
of cobbling together the same function on accumulators,
when reduce/fold are simpler and have the behavior you suggest.
On Fri, Nov 21, 2014 at 5:46 AM, Nathan Kronenfeld
nkronenf...@oculusinfo.com wrote:
I think I understand what is going on here, but I was hoping someone
could
confirm
and...@andrewash.com wrote:
Hi Nathan,
It sounds like what you're asking for has already been filed as
https://issues.apache.org/jira/browse/SPARK-664 Does that ticket match
what you're proposing?
Andrew
On Fri, Nov 21, 2014 at 12:29 PM, Nathan Kronenfeld
nkronenf...@oculusinfo.com wrote:
We've
anyone know a way to get accumulators to
accumulate as results collect, rather than all at once at the end, so we
only have to hold a few in memory at a time, rather than all 400?
Thanks,
-Nathan
--
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street
, that was keeping the job from finishing? It's not too uncommon to
see occasional performance regressions while caching due to GC, though 2
seconds to 8 minutes is a bit extreme.
On Wed, Nov 12, 2014 at 9:01 PM, Nathan Kronenfeld
nkronenf...@oculusinfo.com wrote:
Sorry, I think I was not clear in what I
, Nathan Kronenfeld
nkronenf...@oculusinfo.com wrote:
Can anyone point me to a good primer on how spark decides where to send
what task, how it distributes them, and how it determines data locality?
I'm trying a pretty simple task - it's doing a foreach over cached data,
accumulating some
with locality level ANY. Why would that change when running
the exact same task twice in a row on cached data?
Any help or pointers that I could get would be much appreciated.
Thanks,
-Nathan
--
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street
RDD NameStorage LevelCached PartitionsFraction CachedSize in MemorySize in
TachyonSize on Disk 8
http://hadoop-s1.oculus.guest:4042/storage/rdd?id=8 Memory Deserialized
1x Replicated 426 107% 59.7 GB 0.0 B 0.0 BAnyone understand what it means
to have more than 100% of an rdd cached?
Thanks,
,
-Nathan
--
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone: +1-416-203-3003 x 238
Email: nkronenf...@oculusinfo.com
org.apache.spark.io.SnappyCompressionCodec
spark.shuffle.file.buffer.kb 500
spark.speculation true
On Fri, Oct 17, 2014 at 2:46 AM, Nathan Kronenfeld
nkronenf...@oculusinfo.com wrote:
I'm trying to understand two things about how spark is working.
(1) When I try to cache an rdd that fits well within memory (about
some
relatively sizable accumulators; at the moment, we're creating one per
query, and running out of memory after far too few queries.
I've tried methods that don't involve accumulators; they involve a shuffle
instead, and take 10x as long.
Thanks,
-Nathan
--
Nathan Kronenfeld
: A = DataMonad[B] ) : DataMonad[B] = {
f( data )
}
def foreach ...
def withFilter ...
:
:
etc, something like that
}
On Wed, Dec 18, 2013 at 10:42 PM, Reynold Xin r...@apache.org wrote:
On Wed, Dec 18, 2013 at 12:17 PM, Nathan Kronenfeld
nkronenf...@oculusinfo.com
at Nabble.com.
--
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone: +1-416-203-3003 x 238
Email: nkronenf...@oculusinfo.com
Does anyone know what this error means:
14/07/21 23:07:22 INFO TaskSchedulerImpl: Adding task set 3.0 with 1 tasks
14/07/21 23:07:22 INFO TaskSetManager: Starting task 3.0:0 as TID 1620 on
executor 27: r104u05.oculus.local (PROCESS_LOCAL)
14/07/21 23:07:22 INFO TaskSetManager: Serialized task
70: 2.177
Iteration 80: 2.472
Iteration 90: 2.814
Iteration 99: 3.018
slightly slower - but not significantly.
Does anyone know, if the caching is working, why is iteration 100 slower
than iteration 1? And why is caching making so little difference?
Thanks,
-Nathan Kronenfeld
this
to true when using ext4 or xfs filesystems. On ext3, this option might
degrade performance on machines with many (8) cores due to filesystem
limitations.
```
2014-05-23 16:00 GMT+02:00 Nathan Kronenfeld nkronenf...@oculusinfo.com:
In trying to sort some largish datasets, we came across
, what
problems should we anticipate?
Thanks,
-Nathan Kronenfeld
--
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone: +1-416-203-3003 x 238
Email: nkronenf...@oculusinfo.com
We are trying to read some large GraphML files to use in spark.
Is there an easy way to read XML-based files like this that accounts for
partition boundaries and the like?
Thanks,
Nathan
--
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley
into a string
record. -Xiangrui
On Tue, May 20, 2014 at 8:25 AM, Nathan Kronenfeld
nkronenf...@oculusinfo.com wrote:
We are trying to read some large GraphML files to use in spark.
Is there an easy way to read XML-based files like this that accounts for
partition boundaries and the like
/cloud9/collection/XMLInputFormat.java
On Tue, May 20, 2014 at 10:31 AM, Nathan Kronenfeld
nkronenf...@oculusinfo.com wrote:
Unfortunately, I don't have a bunch of moderately big xml files; I have
one,
really big file - big enough that reading it into memory as a single
string
.
--
Software Engineer
Analytics Engineering Team@ Box
Mountain View, CA
--
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone: +1-416-203-3003 x 238
Email: nkronenf...@oculusinfo.com
longer it takes in node_local mode, it seems like the whole
thing would probably run much faster just by waiting for the right jvm to
be free. Is there any way of forcing this?
Thanks,
-Nathan
--
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street
traffic during that period? If you
shared some information about how each node in your cluster is set up (heap
size, memory, CPU, etc) that might help with debugging.
Andrew
On Mon, Mar 24, 2014 at 9:13 PM, Nathan Kronenfeld
nkronenf...@oculusinfo.com wrote:
What does this error mean
by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: hadoop-s2.oculus.loca\
l/192.168.0.47:45186
]
?
--
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone: +1-416-203-3003 x 238
Email: nkronenf...@oculusinfo.com
summary
objects. Again, it saves a lot of object creation.
On Mon, Mar 24, 2014 at 8:57 AM, Jaonary Rabarisoa jaon...@gmail.comwrote:
Dear all,
Sorry for asking such a basic question, but someone can explain when one
should use mapPartiontions instead of map.
Thanks
Jaonary
--
Nathan
].
How are subclasses expected to override this if it's private? If they
aren't, how and when should they now clear any extraneous data they have?
Similarly, I now see no way to get the timing information - how is a custom
dstream supposed to do this now?
Thanks,
-Nathan
--
Nathan
44 matches
Mail list logo