Re: OutOfMemory when doing joins in spark 2.0 while same code runs fine in spark 1.5.2

2016-07-20 Thread Ian O'Connell
Ravi did your issue ever get solved for this? I think i've been hitting the same thing, it looks like the spark.sql.autoBroadcastJoinThreshold stuff isn't kicking in as expected, if I set that to -1 then the computation proceeds successfully. On Tue, Jun 14, 2016 at 12:28 AM, Ravi Aggarwal

Re: Spark and Stanford CoreNLP

2014-11-24 Thread Ian O'Connell
object MyCoreNLP { @transient lazy val coreNLP = new coreNLP() } and then refer to it from your map/reduce/map partitions or that it should be fine (presuming its thread safe), it will only be initialized once per classloader per jvm On Mon, Nov 24, 2014 at 7:58 AM, Evan Sparks

Re: Algebird using spark-shell

2014-10-30 Thread Ian O'Connell
Whats the error with the 2.10 version of algebird? On Thu, Oct 30, 2014 at 12:49 AM, thadude ohpre...@yahoo.com wrote: I've tried: . /bin/spark-shell --jars algebird-core_2.10-0.8.1.jar scala import com.twitter.algebird._ import com.twitter.algebird._ scala import HyperLogLog._ import

Re: Algebird using spark-shell

2014-10-30 Thread Ian O'Connell
Algebird 0.8.0 has 2.11 support if you want to run in a 2.11 env. On Thu, Oct 30, 2014 at 10:08 AM, Buntu Dev buntu...@gmail.com wrote: Thanks.. I was using Scala 2.11.1 and was able to use algebird-core_2.10-0.1.11.jar with spark-shell. On Thu, Oct 30, 2014 at 8:22 AM, Ian O'Connell i

Re: Kryo UnsupportedOperationException

2014-09-25 Thread Ian O'Connell
I would guess the field serializer is having issues being able to reconstruct the class again, its pretty much best effort. Is this an intermediate type? On Thu, Sep 25, 2014 at 2:12 PM, Sandy Ryza sandy.r...@cloudera.com wrote: We're running into an error (below) when trying to read spilled

Re: Querying a parquet file in s3 with an ec2 install

2014-09-08 Thread Ian O'Connell
Mmm how many days worth of data/how deep is your data nesting? I suspect your running into a current issue with parquet (a fix is in master but I don't believe released yet..). It reads all the metadata to the submitter node as part of scheduling the job. This can cause long start times(timeouts

Re: GroupByKey results in OOM - Any other alternative

2014-06-15 Thread Ian O'Connell
Depending on your requirements when doing hourly metrics calculating distinct cardinality, a much more scalable method would be to use a hyper log log data structure. a scala impl people have used with spark would be

Re: Spark and Java 8

2014-05-06 Thread Ian O'Connell
I think the distinction there might be they never said they ran that code under CDH5, just that spark supports it and spark runs under CDH5. Not that you can use these features while running under CDH5. They could use mesos or the standalone scheduler to run them On Tue, May 6, 2014 at 6:16 AM,

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Ian O'Connell
A mutable map in an object should do what your looking for then I believe. You just reference the object as an object in your closure so it won't be swept up when your closure is serialized and you can reference variables of the object on the remote host then. e.g.: object MyObject { val mmap =

Re: java.lang.NullPointerException met when computing new RDD or use .count

2014-03-17 Thread Ian O'Connell
I'm guessing the other result was wrong, or just never evaluated here. The RDD transforms being lazy may have let it be expressed, but it wouldn't work. Nested RDD's are not supported. On Mon, Mar 17, 2014 at 4:01 PM, anny9699 anny9...@gmail.com wrote: Hi Andrew, Thanks for the reply.