Migration issue upgrading from Spark 2.4.5 to spark 3.0.1

2020-12-10 Thread Nathan Kronenfeld
to get around it, other than to stop passing around anonymous functions? Thanks, - Nathan Kronenfeld - Uncharted Software

Re: [Spark SQL]: Does Union operation followed by drop duplicate follows "keep first"

2019-09-13 Thread Nathan Kronenfeld
It's a bit of a pain, but you could just use an outer join (assuming there are no duplicates in the input datasets, of course): import org.apache.spark.sql.test.SharedSparkSession import org.scalatest.FunSpec class QuestionSpec extends FunSpec with SharedSparkSession { describe("spark list

Problem upgrading from 2.3.1 to 2.4.3 with gradle

2019-09-09 Thread Nathan Kronenfeld
ikely (though not certainly) a gradle, not a spark, problem, but I'm hoping someone else here has encountered this before? Thanks in advance, -Nathan Kronenfeld

Re: conflicting version question

2018-10-26 Thread Nathan Kronenfeld
> > > https://softwareengineering.stackexchange.com/questions/297276/what-is-a-shaded-java-dependency > > > See also elasticsearch's discussion on shading > > https://www.elastic.co/de/blog/to-shade-or-not-to-shade > > Best, > Anastasios > > > On Fri, 26 Oct 2018, 15:45 Nathan Kronenfeld,

conflicting version question

2018-10-26 Thread Nathan Kronenfeld
y? To load 2.8.5 so that our code uses it, without messing up spark? Thanks, -Nathan Kronenfeld

Re: RepartitionByKey Behavior

2018-06-22 Thread Nathan Kronenfeld
> > On Thu, Jun 21, 2018 at 4:51 PM, Chawla,Sumit >>> wrote: >>> Hi I have been trying to this simple operation. I want to land all values with one key in same partition, and not have any different key in the same partition. Is this possible? I am getting b and c

Re: Building SparkML vectors from long data

2018-06-12 Thread Nathan Kronenfeld
I don't know if this is the best way or not, but: val indexer = new StringIndexer().setInputCol("vr").setOutputCol("vrIdx") val indexModel = indexer.fit(data) val indexedData = indexModel.transform(data) val variables = indexModel.labels.length val toSeq = udf((a: Double, b: Double) => Seq(a,

select with more than 5 typed columns

2018-01-08 Thread Nathan Kronenfeld
Looking in Dataset, there are select functions taking from 1 to 5 TypedColumn arguments. Is there a built-in way to pull out more than 5 typed columns into a Dataset (without having to resort to using a DataFrame, or manual processing of the RDD)? Thanks, - Nathan Kronenfeld

Re: Apache Spark-Subtract two datasets

2017-10-13 Thread Nathan Kronenfeld
I think you want a join of type "left_anti"... See below log scala> import spark.implicits._ import spark.implicits._ scala> case class Foo (a: String, b: Int) defined class Foo scala> case class Bar (a: String, d: Double) defined class Bar scala> var fooDs = Seq(Foo("a", 1), Foo("b", 2),

Re: isCached

2017-09-01 Thread Nathan Kronenfeld
Thanks for the info On Fri, Sep 1, 2017 at 12:06 PM, Nick Pentreath <nick.pentre...@gmail.com> wrote: > No unfortunately not - as i recall storageLevel accesses some private > methods to get the result. > > On Fri, 1 Sep 2017 at 17:55, Nathan Kronenfeld > <nkrone

Re: isCached

2017-09-01 Thread Nathan Kronenfeld
> > Arguably isCached could be added to dataset too, shouldn't be a > controversial change. > > On Fri, 1 Sep 2017 at 17:31, Nathan Kronenfeld > <nkronenfeld@uncharted.software> > wrote: > >> I'm currently porting some of our code from RDDs to Datasets

isCached

2017-09-01 Thread Nathan Kronenfeld
at the moment to determine if a dataset is cached or not? Thanks in advance -Nathan Kronenfeld

CSV conversion

2016-10-26 Thread Nathan Kronenfeld
to have a CSV file format exposed, and the only entry points we can find are when reading files. What is the modern pattern for converting an already-read RDD of CSV lines into a dataframe? Thanks, Nathan Kronenfeld Uncharted Software

Graph testing question

2015-12-01 Thread Nathan Kronenfeld
it. Is there a way to construct a graph so that it uses the partitions given and doesn't shuffle everything around? Thanks, -Nathan Kronenfeld

running spark on yarn

2015-05-21 Thread Nathan Kronenfeld
Hello, folks. We just recently switched to using Yarn on our cluster (when upgrading to cloudera 5.4.1) I'm trying to run a spark job from within a broader application (a web service running on Jetty), so I can't just start it using spark-submit. Does anyone know of an instructions page on how

Re:

2015-03-25 Thread Nathan Kronenfeld
What would it do with the following dataset? (A, B) (A, C) (B, D) On Wed, Mar 25, 2015 at 1:02 PM, Himanish Kushary himan...@gmail.com wrote: Hi, I have a RDD of pairs of strings like below : (A,B) (B,C) (C,D) (A,D) (E,F) (B,F) I need to transform/filter this into a RDD of pairs

Re: Adding a column to a SchemaRDD

2014-12-12 Thread Nathan Kronenfeld
(sc) import sqlContext._ val d1 = sc.parallelize(1 to 10).map { i = Person(i,i+1,i+2)} val d2 = d1.select('id, 'score, 'id + 'score) d2.foreach(println) 2014-12-12 14:11 GMT+08:00 Nathan Kronenfeld nkronenf...@oculusinfo.com: Hi, there. I'm trying to understand how to augment data

Adding a column to a SchemaRDD

2014-12-11 Thread Nathan Kronenfeld
? Thanks, -Nathan -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone: +1-416-203-3003 x 238 Email: nkronenf...@oculusinfo.com

Re: Alternatives to groupByKey

2014-12-03 Thread Nathan Kronenfeld
Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Nathan Kronenfeld Senior

Re: Another accumulator question

2014-11-21 Thread Nathan Kronenfeld
of cobbling together the same function on accumulators, when reduce/fold are simpler and have the behavior you suggest. On Fri, Nov 21, 2014 at 5:46 AM, Nathan Kronenfeld nkronenf...@oculusinfo.com wrote: I think I understand what is going on here, but I was hoping someone could confirm

Re: Another accumulator question

2014-11-21 Thread Nathan Kronenfeld
and...@andrewash.com wrote: Hi Nathan, It sounds like what you're asking for has already been filed as https://issues.apache.org/jira/browse/SPARK-664 Does that ticket match what you're proposing? Andrew On Fri, Nov 21, 2014 at 12:29 PM, Nathan Kronenfeld nkronenf...@oculusinfo.com wrote: We've

Another accumulator question

2014-11-20 Thread Nathan Kronenfeld
anyone know a way to get accumulators to accumulate as results collect, rather than all at once at the end, so we only have to hold a few in memory at a time, rather than all 400? Thanks, -Nathan -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street

Re: data locality, task distribution

2014-11-13 Thread Nathan Kronenfeld
, that was keeping the job from finishing? It's not too uncommon to see occasional performance regressions while caching due to GC, though 2 seconds to 8 minutes is a bit extreme. On Wed, Nov 12, 2014 at 9:01 PM, Nathan Kronenfeld nkronenf...@oculusinfo.com wrote: Sorry, I think I was not clear in what I

Re: data locality, task distribution

2014-11-12 Thread Nathan Kronenfeld
, Nathan Kronenfeld nkronenf...@oculusinfo.com wrote: Can anyone point me to a good primer on how spark decides where to send what task, how it distributes them, and how it determines data locality? I'm trying a pretty simple task - it's doing a foreach over cached data, accumulating some

data locality, task distribution

2014-11-11 Thread Nathan Kronenfeld
with locality level ANY. Why would that change when running the exact same task twice in a row on cached data? Any help or pointers that I could get would be much appreciated. Thanks, -Nathan -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street

wierd caching

2014-11-08 Thread Nathan Kronenfeld
RDD NameStorage LevelCached PartitionsFraction CachedSize in MemorySize in TachyonSize on Disk 8 http://hadoop-s1.oculus.guest:4042/storage/rdd?id=8 Memory Deserialized 1x Replicated 426 107% 59.7 GB 0.0 B 0.0 BAnyone understand what it means to have more than 100% of an rdd cached? Thanks,

rdd caching and use thereof

2014-10-17 Thread Nathan Kronenfeld
, -Nathan -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone: +1-416-203-3003 x 238 Email: nkronenf...@oculusinfo.com

Re: rdd caching and use thereof

2014-10-17 Thread Nathan Kronenfeld
org.apache.spark.io.SnappyCompressionCodec spark.shuffle.file.buffer.kb 500 spark.speculation true On Fri, Oct 17, 2014 at 2:46 AM, Nathan Kronenfeld nkronenf...@oculusinfo.com wrote: I'm trying to understand two things about how spark is working. (1) When I try to cache an rdd that fits well within memory (about

Accumulator question

2014-10-03 Thread Nathan Kronenfeld
some relatively sizable accumulators; at the moment, we're creating one per query, and running out of memory after far too few queries. I've tried methods that don't involve accumulators; they involve a shuffle instead, and take 10x as long. Thanks, -Nathan -- Nathan Kronenfeld

Re: Spark streaming vs. spark usage

2014-07-28 Thread Nathan Kronenfeld
: A = DataMonad[B] ) : DataMonad[B] = { f( data ) } def foreach ... def withFilter ... : : etc, something like that } On Wed, Dec 18, 2013 at 10:42 PM, Reynold Xin r...@apache.org wrote: On Wed, Dec 18, 2013 at 12:17 PM, Nathan Kronenfeld nkronenf...@oculusinfo.com

Re: Have different reduce key than mapper key

2014-07-23 Thread Nathan Kronenfeld
at Nabble.com. -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone: +1-416-203-3003 x 238 Email: nkronenf...@oculusinfo.com

new error for me

2014-07-21 Thread Nathan Kronenfeld
Does anyone know what this error means: 14/07/21 23:07:22 INFO TaskSchedulerImpl: Adding task set 3.0 with 1 tasks 14/07/21 23:07:22 INFO TaskSetManager: Starting task 3.0:0 as TID 1620 on executor 27: r104u05.oculus.local (PROCESS_LOCAL) 14/07/21 23:07:22 INFO TaskSetManager: Serialized task

Iteration question

2014-07-11 Thread Nathan Kronenfeld
70: 2.177 Iteration 80: 2.472 Iteration 90: 2.814 Iteration 99: 3.018 slightly slower - but not significantly. Does anyone know, if the caching is working, why is iteration 100 slower than iteration 1? And why is caching making so little difference? Thanks, -Nathan Kronenfeld

Re: Shuffle file consolidation

2014-05-29 Thread Nathan Kronenfeld
this to true when using ext4 or xfs filesystems. On ext3, this option might degrade performance on machines with many (8) cores due to filesystem limitations. ``` 2014-05-23 16:00 GMT+02:00 Nathan Kronenfeld nkronenf...@oculusinfo.com: In trying to sort some largish datasets, we came across

Shuffle file consolidation

2014-05-23 Thread Nathan Kronenfeld
, what problems should we anticipate? Thanks, -Nathan Kronenfeld -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone: +1-416-203-3003 x 238 Email: nkronenf...@oculusinfo.com

reading large XML files

2014-05-20 Thread Nathan Kronenfeld
We are trying to read some large GraphML files to use in spark. Is there an easy way to read XML-based files like this that accounts for partition boundaries and the like? Thanks, Nathan -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley

Re: reading large XML files

2014-05-20 Thread Nathan Kronenfeld
into a string record. -Xiangrui On Tue, May 20, 2014 at 8:25 AM, Nathan Kronenfeld nkronenf...@oculusinfo.com wrote: We are trying to read some large GraphML files to use in spark. Is there an easy way to read XML-based files like this that accounts for partition boundaries and the like

Re: reading large XML files

2014-05-20 Thread Nathan Kronenfeld
/cloud9/collection/XMLInputFormat.java On Tue, May 20, 2014 at 10:31 AM, Nathan Kronenfeld nkronenf...@oculusinfo.com wrote: Unfortunately, I don't have a bunch of moderately big xml files; I have one, really big file - big enough that reading it into memory as a single string

Re: Job failed: java.io.NotSerializableException: org.apache.spark.SparkContext

2014-05-16 Thread Nathan Kronenfeld
. -- Software Engineer Analytics Engineering Team@ Box Mountain View, CA -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone: +1-416-203-3003 x 238 Email: nkronenf...@oculusinfo.com

process_local vs node_local

2014-04-14 Thread Nathan Kronenfeld
longer it takes in node_local mode, it seems like the whole thing would probably run much faster just by waiting for the right jvm to be free. Is there any way of forcing this? Thanks, -Nathan -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street

Re: Akka error with largish job (works fine for smaller versions)

2014-03-25 Thread Nathan Kronenfeld
traffic during that period? If you shared some information about how each node in your cluster is set up (heap size, memory, CPU, etc) that might help with debugging. Andrew On Mon, Mar 24, 2014 at 9:13 PM, Nathan Kronenfeld nkronenf...@oculusinfo.com wrote: What does this error mean

Akka error with largish job (works fine for smaller versions)

2014-03-24 Thread Nathan Kronenfeld
by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: hadoop-s2.oculus.loca\ l/192.168.0.47:45186 ] ? -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone: +1-416-203-3003 x 238 Email: nkronenf...@oculusinfo.com

Re: mapPartitions use case

2014-03-24 Thread Nathan Kronenfeld
summary objects. Again, it saves a lot of object creation. On Mon, Mar 24, 2014 at 8:57 AM, Jaonary Rabarisoa jaon...@gmail.comwrote: Dear all, Sorry for asking such a basic question, but someone can explain when one should use mapPartiontions instead of map. Thanks Jaonary -- Nathan

spark-streaming

2014-03-14 Thread Nathan Kronenfeld
]. How are subclasses expected to override this if it's private? If they aren't, how and when should they now clear any extraneous data they have? Similarly, I now see no way to get the timing information - how is a custom dstream supposed to do this now? Thanks, -Nathan -- Nathan