Graph testing question

2015-12-01 Thread Nathan Kronenfeld
I give it. Is there a way to construct a graph so that it uses the partitions given and doesn't shuffle everything around? Thanks, -Nathan Kronenfeld

CSV conversion

2016-10-26 Thread Nathan Kronenfeld
only to have a CSV file format exposed, and the only entry points we can find are when reading files. What is the modern pattern for converting an already-read RDD of CSV lines into a dataframe? Thanks, Nathan Kronenfeld Uncharted Software

Migration issue upgrading from Spark 2.4.5 to spark 3.0.1

2020-12-10 Thread Nathan Kronenfeld
ow to get around it, other than to stop passing around anonymous functions? Thanks, - Nathan Kronenfeld - Uncharted Software

isCached

2017-09-01 Thread Nathan Kronenfeld
s there any way at the moment to determine if a dataset is cached or not? Thanks in advance -Nathan Kronenfeld

Re: isCached

2017-09-01 Thread Nathan Kronenfeld
could be added to dataset too, shouldn't be a > controversial change. > > On Fri, 1 Sep 2017 at 17:31, Nathan Kronenfeld > > wrote: > >> I'm currently porting some of our code from RDDs to Datasets. >> >> With RDDs it's pretty easy to figure out if th

Re: isCached

2017-09-01 Thread Nathan Kronenfeld
Thanks for the info On Fri, Sep 1, 2017 at 12:06 PM, Nick Pentreath wrote: > No unfortunately not - as i recall storageLevel accesses some private > methods to get the result. > > On Fri, 1 Sep 2017 at 17:55, Nathan Kronenfeld > > wrote: > >> Ah, in 2.1.0. >&

Re: Apache Spark-Subtract two datasets

2017-10-13 Thread Nathan Kronenfeld
I think you want a join of type "left_anti"... See below log scala> import spark.implicits._ import spark.implicits._ scala> case class Foo (a: String, b: Int) defined class Foo scala> case class Bar (a: String, d: Double) defined class Bar scala> var fooDs = Seq(Foo("a", 1), Foo("b", 2), Foo("

select with more than 5 typed columns

2018-01-08 Thread Nathan Kronenfeld
Looking in Dataset, there are select functions taking from 1 to 5 TypedColumn arguments. Is there a built-in way to pull out more than 5 typed columns into a Dataset (without having to resort to using a DataFrame, or manual processing of the RDD)? Thanks, - Nathan Kronenfeld

Re: Building SparkML vectors from long data

2018-06-12 Thread Nathan Kronenfeld
I don't know if this is the best way or not, but: val indexer = new StringIndexer().setInputCol("vr").setOutputCol("vrIdx") val indexModel = indexer.fit(data) val indexedData = indexModel.transform(data) val variables = indexModel.labels.length val toSeq = udf((a: Double, b: Double) => Seq(a, b))

Re: RepartitionByKey Behavior

2018-06-22 Thread Nathan Kronenfeld
> > On Thu, Jun 21, 2018 at 4:51 PM, Chawla,Sumit >>> wrote: >>> Hi I have been trying to this simple operation. I want to land all values with one key in same partition, and not have any different key in the same partition. Is this possible? I am getting b and c alwa

conflicting version question

2018-10-26 Thread Nathan Kronenfeld
y? To load 2.8.5 so that our code uses it, without messing up spark? Thanks, -Nathan Kronenfeld

Re: conflicting version question

2018-10-26 Thread Nathan Kronenfeld
> > > https://softwareengineering.stackexchange.com/questions/297276/what-is-a-shaded-java-dependency > > > See also elasticsearch's discussion on shading > > https://www.elastic.co/de/blog/to-shade-or-not-to-shade > > Best, > Anastasios > > > On Fri, 26 Oct 2018, 15:45 Nathan Kronen

Problem upgrading from 2.3.1 to 2.4.3 with gradle

2019-09-09 Thread Nathan Kronenfeld
e this is likely (though not certainly) a gradle, not a spark, problem, but I'm hoping someone else here has encountered this before? Thanks in advance, -Nathan Kronenfeld

Re: [Spark SQL]: Does Union operation followed by drop duplicate follows "keep first"

2019-09-13 Thread Nathan Kronenfeld
It's a bit of a pain, but you could just use an outer join (assuming there are no duplicates in the input datasets, of course): import org.apache.spark.sql.test.SharedSparkSession import org.scalatest.FunSpec class QuestionSpec extends FunSpec with SharedSparkSession { describe("spark list ques

running spark on yarn

2015-05-21 Thread Nathan Kronenfeld
Hello, folks. We just recently switched to using Yarn on our cluster (when upgrading to cloudera 5.4.1) I'm trying to run a spark job from within a broader application (a web service running on Jetty), so I can't just start it using spark-submit. Does anyone know of an instructions page on how t

Adding a column to a SchemaRDD

2014-12-11 Thread Nathan Kronenfeld
this possible? If so, how? Thanks, -Nathan -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone: +1-416-203-3003 x 238 Email: nkronenf...@oculusinfo.com

Re: Adding a column to a SchemaRDD

2014-12-12 Thread Nathan Kronenfeld
t; Otherwise, if your self-defining function is straightforward and you can > represent it by SQL, using Spark SQL or DSL is also a good choice. > > case class Person(id: Int, score: Int, value: Int) > > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > > import sqlContext.

Re: Adding a column to a SchemaRDD

2014-12-15 Thread Nathan Kronenfeld
;mod_sum) > > You can use t1.printSchema() to print the schema of this SchemaRDD and > check whether it satisfy your requirements. > > > > 2014-12-13 0:00 GMT+08:00 Nathan Kronenfeld : >> >> (1) I understand about immutability, that's why I said I wanted a new &g

spark failure

2014-02-24 Thread Nathan Kronenfeld
inedExecutorBackend" "akka://spark@hadoop-client.oculus.local:41101/user/CoarseGrainedScheduler" "4" "hadoop-s2.oculus.local" "32" "app-20140224234441-0003" 14/02/24 23:44:45 INFO worker.Worker: Executor app-20140224234441-0003/4 finished wit

Re: Trying to connect to spark from within a web server

2014-02-28 Thread Nathan Kronenfeld
rote: > Mostly likely all your classes/jars that are required to connect to Spark > and not being loaded or the incorrect versions are being loaded when you > start to do this from inside the web container (Tomcat). > > > > > > On Sat, Feb 22, 2014 at 1:51 PM, Nathan Kronen

spark-streaming

2014-03-14 Thread Nathan Kronenfeld
w? Thanks, -Nathan -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone: +1-416-203-3003 x 238 Email: nkronenf...@oculusinfo.com

Akka error with largish job (works fine for smaller versions)

2014-03-24 Thread Nathan Kronenfeld
: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: hadoop-s2.oculus.loca\ l/192.168.0.47:45186 ] ? -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone: +1-416-203-3003 x 238 Email: nkronenf...@oculusinfo.com

Re: mapPartitions use case

2014-03-24 Thread Nathan Kronenfeld
ger set pf summary objects. Again, it saves a lot of object creation. On Mon, Mar 24, 2014 at 8:57 AM, Jaonary Rabarisoa wrote: > Dear all, > > Sorry for asking such a basic question, but someone can explain when one > should use mapPartiontions instead of map. > > Thanks > >

Re: Akka error with largish job (works fine for smaller versions)

2014-03-25 Thread Nathan Kronenfeld
work traffic during that period? If you > shared some information about how each node in your cluster is set up (heap > size, memory, CPU, etc) that might help with debugging. > > Andrew > > > On Mon, Mar 24, 2014 at 9:13 PM, Nathan Kronenfeld < > nkronenf...@oculusinfo.com>

Re: Using an external jar in the driver, in yarn-standalone mode.

2014-03-25 Thread Nathan Kronenfeld
another >> jar. >> >> I tried to set SPARK_CLASSPATH, ADD_JAR, I tried to use --addJar in the >> spark-class arguments, I always end up with a "Class not found exception" >> when I want to use classes defined in my jar. >> >> Any ideas? >> >> T

Re: spark-streaming

2014-04-02 Thread Nathan Kronenfeld
you want to get? > > TD > > > On Fri, Mar 14, 2014 at 8:57 PM, Nathan Kronenfeld < > nkronenf...@oculusinfo.com> wrote: > >> I'm trying to update some spark streaming code from 0.8.1 to 0.9.0. >> >> Among other things, I've found the function clearMe

process_local vs node_local

2014-04-14 Thread Nathan Kronenfeld
Given how much longer it takes in node_local mode, it seems like the whole thing would probably run much faster just by waiting for the right jvm to be free. Is there any way of forcing this? Thanks, -Nathan -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2

Re: process_local vs node_local

2014-04-14 Thread Nathan Kronenfeld
ocal mode, it seems like the whole > thing would probably run much faster just by waiting for the right jvm to > be free. Is there any way of forcing this? > > > > > > Thanks, > > -Nathan > > > > > > -- > > Nathan Kronenfeld > >

Re: Job failed: java.io.NotSerializableException: org.apache.spark.SparkContext

2014-05-16 Thread Nathan Kronenfeld
uler.org >> $apache$spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:441) >> at >> >> org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:149) >> >> >> >> >> -- >> View this message in context: >> http

reading large XML files

2014-05-20 Thread Nathan Kronenfeld
We are trying to read some large GraphML files to use in spark. Is there an easy way to read XML-based files like this that accounts for partition boundaries and the like? Thanks, Nathan -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley

Re: reading large XML files

2014-05-20 Thread Nathan Kronenfeld
a string > record. -Xiangrui > > On Tue, May 20, 2014 at 8:25 AM, Nathan Kronenfeld > wrote: > > We are trying to read some large GraphML files to use in spark. > > > > Is there an easy way to read XML-based files like this that accounts for > > partition boundari

Re: reading large XML files

2014-05-20 Thread Nathan Kronenfeld
u/umd/cloud9/collection/XMLInputFormat.java > > On Tue, May 20, 2014 at 10:31 AM, Nathan Kronenfeld > wrote: > > Unfortunately, I don't have a bunch of moderately big xml files; I have > one, > > really big file - big enough that reading it into memory as a single > string

Shuffle file consolidation

2014-05-23 Thread Nathan Kronenfeld
, what problems should we anticipate? Thanks, -Nathan Kronenfeld -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone: +1-416-203-3003 x 238 Email: nkronenf...@oculusinfo.com

Re: Shuffle file consolidation

2014-05-29 Thread Nathan Kronenfeld
is recommended to set this > to "true" when using ext4 or xfs filesystems. On ext3, this option might > degrade performance on machines with many (>8) cores due to filesystem > limitations. > ``` > > > 2014-05-23 16:00 GMT+02:00 Nathan Kronenfeld : > > In trying

Re:

2015-03-25 Thread Nathan Kronenfeld
What would it do with the following dataset? (A, B) (A, C) (B, D) On Wed, Mar 25, 2015 at 1:02 PM, Himanish Kushary wrote: > Hi, > > I have a RDD of pairs of strings like below : > > (A,B) > (B,C) > (C,D) > (A,D) > (E,F) > (B,F) > > I need to transform/filter this into a RDD of pairs that does

Iteration question

2014-07-11 Thread Nathan Kronenfeld
ng, why is iteration 100 slower than iteration 1? And why is caching making so little difference? Thanks, -Nathan Kronenfeld -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone: +1-416-203-3003 x 238 Email: nkronenf...@oculusinfo.com

persistence state of an RDD

2014-07-15 Thread Nathan Kronenfeld
Is there a way of determining programatically the cache state of an RDD? Not its storage level, but how much of it is actually cached? Thanks, -Nathan -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A

Re: persistence state of an RDD

2014-07-15 Thread Nathan Kronenfeld
Thanks On Tue, Jul 15, 2014 at 10:39 AM, Praveen Seluka wrote: > Nathan, you are looking for SparkContext.getRDDStorageInfo which returns > the information on how much is cached. > > > On Tue, Jul 15, 2014 at 8:01 PM, Nathan Kronenfeld < > nkronenf...@oculusinfo.com> wr

new error for me

2014-07-21 Thread Nathan Kronenfeld
Does anyone know what this error means: 14/07/21 23:07:22 INFO TaskSchedulerImpl: Adding task set 3.0 with 1 tasks 14/07/21 23:07:22 INFO TaskSetManager: Starting task 3.0:0 as TID 1620 on executor 27: r104u05.oculus.local (PROCESS_LOCAL) 14/07/21 23:07:22 INFO TaskSetManager: Serialized task 3.0:0

Very wierd behavior

2014-07-22 Thread Nathan Kronenfeld
rkers disconnecting, and applications die When I run foo.mapPartitions.saveAsHadoopDataset, it works fine. Anyone got an explanation for why that might be? -Thanks, Nathan -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, To

Re: Have different reduce key than mapper key

2014-07-23 Thread Nathan Kronenfeld
t from the Apache Spark User List mailing list archive at Nabble.com. > -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone: +1-416-203-3003 x 238 Email: nkronenf...@oculusinfo.com

Re: Spark streaming vs. spark usage

2014-07-28 Thread Nathan Kronenfeld
taMonad( f( data ) ) > } > > def flatMap[B]( f : A => DataMonad[B] ) : DataMonad[B] = { >f( data ) > } > > def foreach ... > def withFilter ... > : > : > etc, something like that > } > > On Wed, Dec 18, 2013 at 10:

Accumulator question

2014-10-03 Thread Nathan Kronenfeld
d queries using some relatively sizable accumulators; at the moment, we're creating one per query, and running out of memory after far too few queries. I've tried methods that don't involve accumulators; they involve a shuffle instead, and take 10x as long. Thanks, -Na

rdd caching and use thereof

2014-10-16 Thread Nathan Kronenfeld
help, -Nathan -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone: +1-416-203-3003 x 238 Email: nkronenf...@oculusinfo.com

Re: rdd caching and use thereof

2014-10-16 Thread Nathan Kronenfeld
codec org.apache.spark.io.SnappyCompressionCodec spark.shuffle.file.buffer.kb 500 spark.speculation true On Fri, Oct 17, 2014 at 2:46 AM, Nathan Kronenfeld < nkronenf...@oculusinfo.com> wrote: > I'm trying to understand two things about how spark is working. > > (1) When I try to cache an rdd tha

wierd caching

2014-11-08 Thread Nathan Kronenfeld
RDD NameStorage LevelCached PartitionsFraction CachedSize in MemorySize in TachyonSize on Disk 8 Memory Deserialized 1x Replicated 426 107% 59.7 GB 0.0 B 0.0 BAnyone understand what it means to have more than 100% of an rdd cached? Thanks,

data locality, task distribution

2014-11-11 Thread Nathan Kronenfeld
the last 10-20% of the tasks end up with locality level ANY. Why would that change when running the exact same task twice in a row on cached data? Any help or pointers that I could get would be much appreciated. Thanks, -Nathan -- Nathan Kronenfeld Senior Visualization Dev

Re: data locality, task distribution

2014-11-12 Thread Nathan Kronenfeld
manages to work in 2 seconds, it's because all the >> tasks were PROCESS_LOCAL; when it takes longer, the last 10-20% of the >> tasks end up with locality level ANY. Why would that change when running >> the exact same task twice in a row on cached data? >> >> An

Re: data locality, task distribution

2014-11-13 Thread Nathan Kronenfeld
tic variance in task times? Was it a single straggler, or many > tasks, that was keeping the job from finishing? It's not too uncommon to > see occasional performance regressions while caching due to GC, though 2 > seconds to 8 minutes is a bit extreme. > > On Wed, Nov 12, 2014 at 9

Another accumulator question

2014-11-20 Thread Nathan Kronenfeld
f so, does anyone know a way to get accumulators to accumulate as results collect, rather than all at once at the end, so we only have to hold a few in memory at a time, rather than all 400? Thanks, -Nathan -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berke

Re: Another accumulator question

2014-11-21 Thread Nathan Kronenfeld
kind of cobbling together the same function on accumulators, > when reduce/fold are simpler and have the behavior you suggest. > > On Fri, Nov 21, 2014 at 5:46 AM, Nathan Kronenfeld > wrote: > > I think I understand what is going on here, but I was hoping someone > could >

Re: Another accumulator question

2014-11-21 Thread Nathan Kronenfeld
at 6:08 PM, Andrew Ash wrote: > Hi Nathan, > > It sounds like what you're asking for has already been filed as > https://issues.apache.org/jira/browse/SPARK-664 Does that ticket match > what you're proposing? > > Andrew > > On Fri, Nov 21, 2014 at 12:29 PM, Natha

Re: Alternatives to groupByKey

2014-12-03 Thread Nathan Kronenfeld
> -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Alternatives-to-groupByKey-tp20293.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > ----