Re: can spark take advantage of ordered data?

2017-03-10 Thread Jonathan Coveney
gt; >>> Hi Jonathan, >>> >>> you might be interested in https://issues.apache.org/j >>> ira/browse/SPARK-3655 (not yet available) and https://github.com/tresata >>> /spark-sorted (not part of spark, but it is available right now). >>> Hopefully thats

Re: simultaneous actions

2016-01-15 Thread Jonathan Coveney
Threads El viernes, 15 de enero de 2016, Kira escribió: > Hi, > > Can we run *simultaneous* actions on the *same RDD* ?; if yes how can this > be done ? > > Thank you, > Regards > > > > -- > View this message in context: >

Re: simultaneous actions

2016-01-15 Thread Jonathan Coveney
cture your RDD transformations > to compute the required results in one single operation. > > On 15 January 2016 at 06:18, Jonathan Coveney <jcove...@gmail.com > <javascript:_e(%7B%7D,'cvml','jcove...@gmail.com');>> wrote: > >> Threads >> >> >> El viern

Is there a way to delete task history besides using a ttl?

2015-11-17 Thread Jonathan Coveney
so I have the following... broadcast some stuff cache an rdd do a bunch of stuff, eventually calling actions which reduce it to an acceptable size I'm getting an OOM on the driver (well, GC is getting out of control), largely because I have a lot of partitions and it looks like the job history

Re: Is there a way to delete task history besides using a ttl?

2015-11-17 Thread Jonathan Coveney
reading the code, is there any reason why setting spark.cleaner.ttl.MAP_OUTPUT_TRACKER directly won't get picked up? 2015-11-17 14:45 GMT-05:00 Jonathan Coveney <jcove...@gmail.com>: > so I have the following... > > broadcast some stuff > cache an rdd > do a bunch of stuf

Re: Sort Merge Join

2015-11-02 Thread Jonathan Coveney
Additionally, I'm curious if there are any JIRAS around making dataframes support ordering better? there are a lot of operations that can be optimized if you know that you have a total ordering on your data...are there any plans, or at least JIRAS, around having the catalyst optimizer handle this

Re: Getting ClassNotFoundException: scala.Some on Spark 1.5.x

2015-11-02 Thread Jonathan Coveney
Caused by: java.lang.ClassNotFoundException: scala.Some indicates that you don't have the scala libs present. How are you executing this? My guess is the issue is a conflict between scala 2.11.6 in your build and 2.11.7? Not sure...try setting your scala to 2.11.7? But really, first it'd be good

Re: Getting ClassNotFoundException: scala.Some on Spark 1.5.x

2015-11-02 Thread Jonathan Coveney
t; you suggested. I am unclear as to why it works with 2.11.7 and not 2.11.6. > > Thanks, > Babar > > On Mon, Nov 2, 2015 at 2:10 PM Jonathan Coveney <jcove...@gmail.com > <javascript:_e(%7B%7D,'cvml','jcove...@gmail.com');>> wrote: > >> Caused by: java.lang.ClassNotFou

Re: Cannot start REPL shell since 1.4.0

2015-10-23 Thread Jonathan Coveney
do you have JAVA_HOME set to a java 7 jdk? 2015-10-23 7:12 GMT-04:00 emlyn : > xjlin0 wrote > > I cannot enter REPL shell in 1.4.0/1.4.1/1.5.0/1.5.1(with pre-built with > > or without Hadoop or home compiled with ant or maven). There was no > error > > message in v1.4.x,

Re: spark performance non-linear response

2015-10-07 Thread Jonathan Coveney
I've noticed this as well and am curious if there is anything more people can say. My theory is that it is just communication overhead. If you only have a couple of gigabytes (a tiny dataset), then spotting that into 50 nodes means you'll have a ton of tiny partitions all finishing very quickly,

Re: RDD of ImmutableList

2015-10-06 Thread Jonathan Coveney
Nobody is saying not to use immutable data structures, only that guava's aren't natively supported. Scala's default collections library is all immutable. list, Vector, Map. This is what people generally use, especially in scala code! El martes, 6 de octubre de 2015, Jakub Dubovsky <

Re: laziness in textFile reading from HDFS?

2015-10-06 Thread Jonathan Coveney
LZO files are not splittable by default but there are projects with Input and Output formats to make splittable LZO files. Check out twitter's elephantbird on GitHub El miércoles, 7 de octubre de 2015, Mohammed Guller escribió: > It is not uncommon to process datasets

Re: laziness in textFile reading from HDFS?

2015-10-06 Thread Jonathan Coveney
LZO files are not splittable by default but there are projects with Input and Output formats to make splittable LZO files. Check out twitter's elephantbird on GitHub El miércoles, 7 de octubre de 2015, Mohammed Guller escribió: > It is not uncommon to process datasets

Re: does KafkaCluster can be public ?

2015-10-06 Thread Jonathan Coveney
You can put a class in the org.apache.spark namespace to access anything that is private[spark]. You can then make enrichments there to access whatever you need. Just beware upgrade pain :) El martes, 6 de octubre de 2015, Erwan ALLAIN escribió: > Hello, > > I'm

Re: Apache Spark job in local[*] is slower than regular 1-thread Python program

2015-09-22 Thread Jonathan Coveney
It's highly conceivable to be able to beat spark in performance on tiny data sets like this. That's not really what it has been optimized for. El martes, 22 de septiembre de 2015, juljoin escribió: > Hello, > > I am trying to figure Spark out and I still have some

Re: spark-avro takes a lot time to load thousands of files

2015-09-22 Thread Jonathan Coveney
having a file per record is pretty inefficient on almost any file system El martes, 22 de septiembre de 2015, Daniel Haviv < daniel.ha...@veracity-group.com> escribió: > Hi, > We are trying to load around 10k avro files (each file holds only one > record) using spark-avro but it takes over 15

Re: Java vs. Scala for Spark

2015-09-08 Thread Jonathan Coveney
It worked for Twitter! Seriously though: scala is much much more pleasant. And scala has a great story for using Java libs. And since spark is kind of framework-y (use its scripts to submit, start up repl, etc) the projects tend to be lead projects, so even in a big company that uses Java the

Re: Spark 1.4 RDD to DF fails with toDF()

2015-09-07 Thread Jonathan Coveney
Try adding the following to your build.sbt libraryDependencies += "org.scala-lang" % "scala-reflect" % "2.11.6" I believe that spark shades the scala library, and this is a library that it looks like you need in an unshaded way. 2015-09-07 16:48 GMT-04:00 Gheorghe Postelnicu <

Re: Spark 1.4 RDD to DF fails with toDF()

2015-09-07 Thread Jonathan Coveney
Version := "2.11.6" > > libraryDependencies ++= Seq( > "org.apache.spark" %% "spark-core" % "1.4.1" % "provided", > "org.apache.spark" %% "spark-sql" % "1.4.1" % "provided", > "

Re: extracting file path using dataframes

2015-09-01 Thread Jonathan Coveney
You can make a Hadoop input format which passes through the name of the file. I generally find it easier to just hit Hadoop, get the file names, and construct the RDDs though El martes, 1 de septiembre de 2015, Matt K escribió: > Just want to add - I'm looking to partition

Re: types allowed for saveasobjectfile?

2015-08-27 Thread Jonathan Coveney
array[String] doesn't pretty print by default. Use .mkString(,) for example El jueves, 27 de agosto de 2015, Arun Luthra arun.lut...@gmail.com escribió: What types of RDD can saveAsObjectFile(path) handle? I tried a naive test with an RDD[Array[String]], but when I tried to read back the

Re: spark and scala-2.11

2015-08-24 Thread Jonathan Coveney
I've used the instructions and it worked fine. Can you post exactly what you're doing, and what it fails with? Or are you just trying to understand how it works? 2015-08-24 15:48 GMT-04:00 Lanny Ripple la...@spotright.com: Hello, The instructions for building spark against scala-2.11

Re: How to set log level in spark-submit ?

2015-07-29 Thread Jonathan Coveney
Put a log4j.properties file in conf/. You can copy log4j.properties.template as a good base El miércoles, 29 de julio de 2015, canan chen ccn...@gmail.com escribió: Anyone know how to set log level in spark-submit ? Thanks

Re: broadcast variable question

2015-07-28 Thread Jonathan Coveney
That's great! Thanks El martes, 28 de julio de 2015, Ted Yu yuzhih...@gmail.com escribió: If I understand correctly, there would be one value in the executor. Cheers On Tue, Jul 28, 2015 at 4:23 PM, Jonathan Coveney jcove...@gmail.com javascript:_e(%7B%7D,'cvml','jcove...@gmail.com

broadcast variable question

2015-07-28 Thread Jonathan Coveney
i am running in coarse grained mode, let's say with 8 cores per executor. If I use a broadcast variable, will all of the tasks in that executor share the same value? Or will each task broadcast its own value ie in this case, would there be one value in the executor shared by the 8 tasks, or would

--jars not working?

2015-06-12 Thread Jonathan Coveney
Spark version is 1.3.0 (will upgrade as soon as we upgrade past mesos 0.19.0)... Regardless, I'm running into a really weird situation where when I pass --jars to bin/spark-shell I can't reference those classes on the repl. Is this expected? The logs even tell me that my jars have been added, and

Re: Avro to Parquet ?

2015-05-07 Thread Jonathan Coveney
A helpful example of how to convert: http://blog.cloudera.com/blog/2014/05/how-to-convert-existing-data-into-parquet/ As far as performance, that depends on your data. If you have a lot of columns and use all of them, parquet deserialization is expensive. If you have a column and only need a few

Re: java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_2_piece0

2015-05-06 Thread Jonathan Coveney
Can you check your local and remote logs? 2015-05-06 16:24 GMT-04:00 Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com: This problem happen in Spark 1.3.1. It happen when two jobs are running simultaneously each in its own Spark Context. I don’t remember seeing this bug in Spark

Re: Number of files to load

2015-05-05 Thread Jonathan Coveney
As per my understanding, storing 5minutes file means we could not create RDD more granular than 5minutes. This depends on the file format. Many file formats are splittable (like parquet), meaning that you can seek into various points of the file. 2015-05-05 12:45 GMT-04:00 Rendy Bambang Junior

Re: Configuring amount of disk space available to spark executors in mesos?

2015-04-13 Thread Jonathan Coveney
about a workload where that's relevant though, before going that route. Maybe if people are using SSD's that would make sense. - Patrick On Mon, Apr 13, 2015 at 8:19 AM, Jonathan Coveney jcove...@gmail.com wrote: I'm surprised that I haven't been able to find this via google, but I haven't

Re: Equi Join is taking for ever. 1 Task is Running while other 199 are complete

2015-04-13 Thread Jonathan Coveney
a filter on each RDD first ? We do not do this using Pig on M/R. Is it required in Spark world ? On Mon, Apr 13, 2015 at 9:58 PM, Jonathan Coveney jcove...@gmail.com wrote: My guess would be data skew. Do you know if there is some item id that is a catch all? can it be null? item id 0? lots

Re: Equi Join is taking for ever. 1 Task is Running while other 199 are complete

2015-04-13 Thread Jonathan Coveney
a filter on each RDD first ? We do not do this using Pig on M/R. Is it required in Spark world ? On Mon, Apr 13, 2015 at 9:58 PM, Jonathan Coveney jcove...@gmail.com wrote: My guess would be data skew. Do you know if there is some item id that is a catch all? can it be null? item id 0? lots of data

Re: Equi Join is taking for ever. 1 Task is Running while other 199 are complete

2015-04-13 Thread Jonathan Coveney
My guess would be data skew. Do you know if there is some item id that is a catch all? can it be null? item id 0? lots of data sets have this sort of value and it always kills joins 2015-04-13 11:32 GMT-04:00 ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com: Code: val viEventsWithListings: RDD[(Long,

Configuring amount of disk space available to spark executors in mesos?

2015-04-13 Thread Jonathan Coveney
I'm surprised that I haven't been able to find this via google, but I haven't... What is the setting that requests some amount of disk space for the executors? Maybe I'm misunderstanding how this is configured... Thanks for any help!

What's the cleanest way to make spark aware of my custom scheduler?

2015-04-13 Thread Jonathan Coveney
I need to have my own scheduler to point to a proprietary remote execution framework. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L2152 I'm looking at where it decides on the backend and it doesn't look like there is a hook. Of course I can

Re: RDD.map does not allowed to preservesPartitioning?

2015-03-26 Thread Jonathan Coveney
I believe if you do the following: sc.parallelize(List(1,2,3,4,5,5,6,6,7,8,9,10,2,4)).map((_,1)).reduceByKey(_+_).mapValues(_+1).reduceByKey(_+_).toDebugString (8) MapPartitionsRDD[34] at reduceByKey at console:23 [] | MapPartitionsRDD[33] at mapValues at console:23 [] | ShuffledRDD[32] at

Re: RDD.map does not allowed to preservesPartitioning?

2015-03-26 Thread Jonathan Coveney
, at 2:49 PM, Jonathan Coveney jcove...@gmail.com wrote: I believe if you do the following: sc.parallelize(List(1,2,3,4,5,5,6,6,7,8,9,10,2,4)).map((_,1)).reduceByKey(_+_).mapValues(_+1).reduceByKey(_+_).toDebugString (8) MapPartitionsRDD[34] at reduceByKey at console:23

can spark take advantage of ordered data?

2015-03-11 Thread Jonathan Coveney
Hello all, I am wondering if spark already has support for optimizations on sorted data and/or if such support could be added (I am comfortable dropping to a lower level if necessary to implement this, but I'm not sure if it is possible at all). Context: we have a number of data sets which are