Re: to retrive full stack trace

2015-08-18 Thread Koert Kuipers
if you error is on executors you need to check the executor logs for full stacktrace On Tue, Aug 18, 2015 at 10:01 PM, satish chandra j jsatishchan...@gmail.com wrote: HI All, Please let me know if any arguments to be passed in CLI to retrieve FULL STACK TRACE in Apache Spark I am stuck in

Re: Starting a service with Spark Executors

2015-08-09 Thread Koert Kuipers
starting is easy, just use a lazy val. stopping is harder. i do not think executors have a cleanup hook currently... On Sun, Aug 9, 2015 at 5:29 AM, Daniel Haviv daniel.ha...@veracity-group.com wrote: Hi, I'd like to start a service with each Spark Executor upon initalization and have the

create HiveContext if available, otherwise SQLContext

2015-07-16 Thread Koert Kuipers
has anyone tried to make HiveContext only if the class is available? i tried this: implicit lazy val sqlc: SQLContext = try { Class.forName(org.apache.spark.sql.hive.HiveContext, true, Thread.currentThread.getContextClassLoader)

Re: create HiveContext if available, otherwise SQLContext

2015-07-16 Thread Koert Kuipers
i am using scala 2.11 spark jars are not in my assembly jar (they are provided), since i launch with spark-submit On Thu, Jul 16, 2015 at 4:34 PM, Koert Kuipers ko...@tresata.com wrote: spark 1.4.0 spark-csv is a normal dependency of my project and in the assembly jar that i use but i

Re: create HiveContext if available, otherwise SQLContext

2015-07-16 Thread Koert Kuipers
://github.com/apache/spark/blob/master/repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkILoop.scala#L1023-L1037). What is the version of Spark you are using? How did you add the spark-csv jar? On Thu, Jul 16, 2015 at 1:21 PM, Koert Kuipers ko...@tresata.com wrote: has anyone tried to make

Re: create HiveContext if available, otherwise SQLContext

2015-07-16 Thread Koert Kuipers
that solved it, thanks! On Thu, Jul 16, 2015 at 6:22 PM, Koert Kuipers ko...@tresata.com wrote: thanks i will try 1.4.1 On Thu, Jul 16, 2015 at 5:24 PM, Yin Huai yh...@databricks.com wrote: Hi Koert, For the classloader issue, you probably hit https://issues.apache.org/jira/browse/SPARK

Re: create HiveContext if available, otherwise SQLContext

2015-07-16 Thread Koert Kuipers
? Thanks, Yin On Thu, Jul 16, 2015 at 2:12 PM, Koert Kuipers ko...@tresata.com wrote: i am using scala 2.11 spark jars are not in my assembly jar (they are provided), since i launch with spark-submit On Thu, Jul 16, 2015 at 4:34 PM, Koert Kuipers ko...@tresata.com wrote: spark 1.4.0

Re: duplicate names in sql allowed?

2015-07-03 Thread Koert Kuipers
https://issues.apache.org/jira/browse/SPARK-8817 On Fri, Jul 3, 2015 at 11:43 AM, Koert Kuipers ko...@tresata.com wrote: i see the relaxation to allow duplicate field names was done on purpose, since some data sources can have dupes due to case insensitive resolution. apparently the issue

Re: duplicate names in sql allowed?

2015-07-03 Thread Koert Kuipers
, Akhil Das ak...@sigmoidanalytics.com wrote: I think you can open up a jira, not sure if this PR https://github.com/apache/spark/pull/2209/files (SPARK-2890 https://issues.apache.org/jira/browse/SPARK-2890) broke the validation piece. Thanks Best Regards On Fri, Jul 3, 2015 at 4:29 AM, Koert

duplicate names in sql allowed?

2015-07-02 Thread Koert Kuipers
i am surprised this is allowed... scala sqlContext.sql(select name as boo, score as boo from candidates).schema res7: org.apache.spark.sql.types.StructType = StructType(StructField(boo,StringType,true), StructField(boo,IntegerType,true)) should StructType check for duplicate field names?

Re: StorageLevel.MEMORY_AND_DISK_SER

2015-07-01 Thread Koert Kuipers
rdd.persist(StorageLevel.MEMORY_AND_DISK_SER) On Wed, Jul 1, 2015 at 11:01 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: How do i persist an RDD using StorageLevel.MEMORY_AND_DISK_SER ? -- Deepak

Re: Fine control with sc.sequenceFile

2015-06-29 Thread Koert Kuipers
see also: https://github.com/apache/spark/pull/6848 On Mon, Jun 29, 2015 at 12:48 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: sc.hadoopConfiguration.set(mapreduce.input.fileinputformat.split.maxsize, 67108864) sc.sequenceFile(getMostRecentDirectory(tablePath, _.startsWith(_)).get + /*,

Re: Join highly skewed datasets

2015-06-28 Thread Koert Kuipers
you need 1) to publish to inhouse maven, so your application can depend on your version, and 2) use the spark distribution you compiled to launch your job (assuming you run with yarn so you can launch multiple versions of spark on same cluster) On Sun, Jun 28, 2015 at 4:33 PM, ÐΞ€ρ@Ҝ (๏̯͡๏)

Re: Join highly skewed datasets

2015-06-28 Thread Koert Kuipers
-1.4.0/dist/lib/ cp: /Users/dvasthimal/ebay/projects/ep/spark-1.4.0/lib_managed/jars/datanucleus*.jar: No such file or directory LM-SJL-00877532:spark-1.4.0 dvasthimal$ ./make-distribution.sh --tgz -Phadoop-2.4 -Pyarn -Phive -Phive-thriftserver On Sun, Jun 28, 2015 at 1:41 PM, Koert

Re: What does Spark is not just MapReduce mean? Isn't every Spark job a form of MapReduce?

2015-06-28 Thread Koert Kuipers
spark is partitioner aware, so it can exploit a situation where 2 datasets are partitioned the same way (for example by doing a map-side join on them). map-red does not expose this. On Sun, Jun 28, 2015 at 12:13 PM, YaoPau jonrgr...@gmail.com wrote: I've heard Spark is not just MapReduce

Re: Join highly skewed datasets

2015-06-28 Thread Koert Kuipers
-Phive-thriftserver On Sun, Jun 28, 2015 at 1:41 PM, Koert Kuipers ko...@tresata.com wrote: you need 1) to publish to inhouse maven, so your application can depend on your version, and 2) use the spark distribution you compiled to launch your job (assuming you run with yarn so you can launch

Re: Join highly skewed datasets

2015-06-28 Thread Koert Kuipers
-guide.html#which-storage-level-to-choose When do i choose this setting ? (Attached is my code for reference) On Sun, Jun 28, 2015 at 2:57 PM, Koert Kuipers ko...@tresata.com wrote: a blockJoin spreads out one side while replicating the other. i would suggest replicating the smaller side

Re: Join highly skewed datasets

2015-06-28 Thread Koert Kuipers
. And is my assumptions on replication levels correct. Did you get a chance to look at my processing. On Sun, Jun 28, 2015 at 3:31 PM, Koert Kuipers ko...@tresata.com wrote: regarding your calculation of executors... RAM in executor is not really comparable to size on disk. if you read from

Re: Join highly skewed datasets

2015-06-26 Thread Koert Kuipers
we went through a similar process, switching from scalding (where everything just works on large datasets) to spark (where it does not). spark can be made to work on very large datasets, it just requires a little more effort. pay attention to your storage levels (should be memory-and-disk or

sql dataframe internal representation

2015-06-25 Thread Koert Kuipers
i noticed in DataFrame that to get the rdd out of it some conversions are done: val converter = CatalystTypeConverters.createToScalaConverter(schema) rows.map(converter(_).asInstanceOf[Row]) does this mean DataFrame internally does not use the standard scala types? why not?

org.apache.spark.sql.ScalaReflectionLock

2015-06-23 Thread Koert Kuipers
just a heads up, i was doing some basic coding using DataFrame, Row, StructType, etc. and i ended up with deadlocks in my sbt tests due to the usage of ScalaReflectionLock.synchronized in the spark sql code. the issue away when i changed my tests to run consecutively...

Re: Spark SQL and Skewed Joins

2015-06-17 Thread Koert Kuipers
could it be composed maybe? a general version and then a sql version that exploits the additional info/abilities available there and uses the general version internally... i assume the sql version can benefit from the logical phase optimization to pick join details. or is there more? On Tue, Jun

Re: Spark SQL and Skewed Joins

2015-06-16 Thread Koert Kuipers
a skew join (where the dominant key is spread across multiple executors) is pretty standard in other frameworks, see for example in scalding: https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/JoinAlgorithms.scala this would be a great addition to

Re: Recommended Scala version

2015-05-26 Thread Koert Kuipers
we are still running into issues with spark-shell not working on 2.11, but we are running on somewhat older master so maybe that has been resolved already. On Tue, May 26, 2015 at 11:48 AM, Dean Wampler deanwamp...@gmail.com wrote: Most of the 2.11 issues are being resolved in Spark 1.4. For a

Re: spark-shell breaks for scala 2.11 (with yarn)?

2015-05-08 Thread Koert Kuipers
i searched the jiras but couldnt find any recent mention of this. let me try with 1.4.0 branch and see if it goes away... On Wed, May 6, 2015 at 3:05 PM, Koert Kuipers ko...@tresata.com wrote: hello all, i build spark 1.3.1 (for cdh 5.3 with yarn) twice: for scala 2.10 and scala 2.11. i am

history server

2015-05-07 Thread Koert Kuipers
i am trying to launch the spark 1.3.1 history server on a secure cluster. i can see in the logs that it successfully logs into kerberos, and it is replaying all the logs, but i never see the log message that indicate the web server is started (i should see something like Successfully started

Re: history server

2015-05-07 Thread Koert Kuipers
:17 PM, Koert Kuipers ko...@tresata.com wrote: good idea i will take a look. it does seem to be spinning one cpu at 100%... On Thu, May 7, 2015 at 2:03 PM, Marcelo Vanzin van...@cloudera.com wrote: Can you get a jstack for the process? Maybe it's stuck somewhere. On Thu, May 7, 2015 at 11

Re: history server

2015-05-07 Thread Koert Kuipers
got it. thanks! On Thu, May 7, 2015 at 2:52 PM, Marcelo Vanzin van...@cloudera.com wrote: Ah, sorry, that's definitely what Shixiong mentioned. The patch I mentioned did not make it into 1.3... On Thu, May 7, 2015 at 11:48 AM, Koert Kuipers ko...@tresata.com wrote: seems i got one thread

branch-1.4 scala 2.11

2015-05-07 Thread Koert Kuipers
i am having no luck using the 1.4 branch with scala 2.11 $ build/mvn -DskipTests -Pyarn -Dscala-2.11 -Pscala-2.11 clean package [error] /home/koert/src/opensource/spark/core/src/main/scala/org/apache/spark/rdd/RDDOperationScope.scala:78: in object RDDOperationScope, multiple overloaded

spark-shell breaks for scala 2.11 (with yarn)?

2015-05-06 Thread Koert Kuipers
hello all, i build spark 1.3.1 (for cdh 5.3 with yarn) twice: for scala 2.10 and scala 2.11. i am running on a secure cluster. the deployment configs are identical. i can launch jobs just fine on both the scala 2.10 and scala 2.11 versions. spark-shell works on the scala 2.10 version, but not on

Re: ReduceByKey and sorting within partitions

2015-05-04 Thread Koert Kuipers
shoot me an email if you need any help with spark-sorted. it does not (yet?) have a java api, so you will have to work in scala On Mon, May 4, 2015 at 4:05 PM, Burak Yavuz brk...@gmail.com wrote: I think this Spark Package may be what you're looking for!

Re: Why Spark is much faster than Hadoop MapReduce even on disk

2015-04-28 Thread Koert Kuipers
our experience is that unless you can benefit from spark features such as co-partitioning that allow for more efficient execution that spark is slightly slower for disk to disk. On Apr 27, 2015 10:34 PM, bit1...@163.com bit1...@163.com wrote: Hi, I am frequently asked why spark is also much

Re: why does groupByKey return RDD[(K, Iterable[V])] not RDD[(K, CompactBuffer[V])] ?

2015-04-23 Thread Koert Kuipers
because CompactBuffer is considered an implementation detail. It is also not public for the same reason. On Thu, Apr 23, 2015 at 6:46 PM, Hao Ren inv...@gmail.com wrote: Should I repost this to dev list ? -- View this message in context:

Re: spark with kafka

2015-04-18 Thread Koert Kuipers
Use KafkaRDD directly. It is in spark-streaming-kafka package On Sat, Apr 18, 2015 at 6:43 AM, Shushant Arora shushantaror...@gmail.com wrote: Hi I want to consume messages from kafka queue using spark batch program not spark streaming, Is there any way to achieve this, other than using low

Re: spark with kafka

2015-04-18 Thread Koert Kuipers
better idea :) On Sat, Apr 18, 2015 at 11:22 AM Koert Kuipers ko...@tresata.com wrote: Use KafkaRDD directly. It is in spark-streaming-kafka package On Sat, Apr 18, 2015 at 6:43 AM, Shushant Arora shushantaror...@gmail.com wrote: Hi I want to consume messages from kafka queue using spark

Re: spark with kafka

2015-04-18 Thread Koert Kuipers
and streaming is just as good an approach. Not sure... On Sat, Apr 18, 2015 at 3:13 PM, Koert Kuipers ko...@tresata.com wrote: Yeah I think would pick the second approach because it is simpler operationally in case of any failures. But of course the smaller the window gets the more attractive

Re: AMP Lab Indexed RDD - question for Data Bricks AMP Labs

2015-04-16 Thread Koert Kuipers
i believe it is a generalization of some classes inside graphx, where there was/is a need to keep stuff indexed for random access within each rdd partition On Thu, Apr 16, 2015 at 5:00 PM, Evo Eftimov evo.efti...@isecc.com wrote: Can somebody from Data Briks sched more light on this Indexed RDD

Re: spark disk-to-disk

2015-03-24 Thread Koert Kuipers
is a little clunky, and this should get rolled into the other changes you are proposing to hadoop RDD friends -- but I'll go into more discussion on that thread. On Mon, Mar 23, 2015 at 12:55 PM, Koert Kuipers ko...@tresata.com wrote: there is a way to reinstate the partitioner, but that requires

SparkEnv

2015-03-23 Thread Koert Kuipers
is it safe to access SparkEnv.get inside say mapPartitions? i need to get a Serializer (so SparkEnv.get.serializer) thanks

objectFile uses only java serializer?

2015-03-23 Thread Koert Kuipers
in the comments on SparkContext.objectFile it says: It will also be pretty slow if you use the default serializer (Java serialization) this suggests the spark.serializer is used, which means i can switch to the much faster kryo serializer. however when i look at the code it uses

hadoop input/output format advanced control

2015-03-23 Thread Koert Kuipers
currently its pretty hard to control the Hadoop Input/Output formats used in Spark. The conventions seems to be to add extra parameters to all methods and then somewhere deep inside the code (for example in PairRDDFunctions.saveAsHadoopFile) all these parameters get translated into settings on the

Re: spark disk-to-disk

2015-03-23 Thread Koert Kuipers
i just realized the major limitation is that i lose partitioning info... On Mon, Mar 23, 2015 at 1:34 AM, Reynold Xin r...@databricks.com wrote: On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers ko...@tresata.com wrote: so finally i can resort to: rdd.saveAsObjectFile(...) sc.objectFile

Re: spark disk-to-disk

2015-03-23 Thread Koert Kuipers
there is a way to reinstate the partitioner, but that requires sc.objectFile to read exactly what i wrote, which means sc.objectFile should never split files on reading (a feature of hadoop file inputformat that gets in the way here). On Mon, Mar 23, 2015 at 1:39 PM, Koert Kuipers ko

spark disk-to-disk

2015-03-22 Thread Koert Kuipers
i would like to use spark for some algorithms where i make no attempt to work in memory, so read from hdfs and write to hdfs for every step. of course i would like every step to only be evaluated once. and i have no need for spark's RDD lineage info, since i persist to reliable storage. the

Re: spark-sorted, or secondary sort and streaming reduce for spark

2015-03-06 Thread Koert Kuipers
i added it On Fri, Mar 6, 2015 at 2:40 PM, Burak Yavuz brk...@gmail.com wrote: Hi Koert, Would you like to register this on spark-packages.org? Burak On Fri, Mar 6, 2015 at 8:53 AM, Koert Kuipers ko...@tresata.com wrote: currently spark provides many excellent algorithms for operations

spark-sorted, or secondary sort and streaming reduce for spark

2015-03-06 Thread Koert Kuipers
currently spark provides many excellent algorithms for operations per key as long as the data send to the reducers per key fits in memory. operations like combineByKey, reduceByKey and foldByKey rely on pushing the operation map-side so that the data reduce-side is small. and groupByKey simply

Re: Columnar-Oriented RDDs

2015-03-01 Thread Koert Kuipers
problems as it kinda gets converted back to a row oriented format. @Koert - that looks really exciting. Do you have any statistics on memory and scan performance? On Saturday, February 14, 2015, Koert Kuipers ko...@tresata.com wrote: i wrote a proof of concept to automatically store any RDD

bitten by spark.yarn.executor.memoryOverhead

2015-02-28 Thread Koert Kuipers
hey, running my first map-red like (meaning disk-to-disk, avoiding in memory RDDs) computation in spark on yarn i immediately got bitten by a too low spark.yarn.executor.memoryOverhead. however it took me about an hour to find out this was the cause. at first i observed failing shuffles leading to

build spark for cdh5

2015-02-18 Thread Koert Kuipers
does anyone have the right maven invocation for cdh5 with yarn? i tried: $ mvn -Phadoop2.3 -Dhadoop.version=2.5.0-cdh5.2.3 -Pyarn -DskipTests clean package $ mvn -Phadoop2.3 -Dhadoop.version=2.5.0-cdh5.2.3 -Pyarn test it builds and passes tests just fine, but when i deploy on cluster and i try to

Re: build spark for cdh5

2015-02-18 Thread Koert Kuipers
thanks! my bad On Wed, Feb 18, 2015 at 2:00 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Koert, You should be using -Phadoop-2.3 instead of -Phadoop2.3. -Sandy On Wed, Feb 18, 2015 at 10:51 AM, Koert Kuipers ko...@tresata.com wrote: does anyone have the right maven invocation

Re: Columnar-Oriented RDDs

2015-02-13 Thread Koert Kuipers
i wrote a proof of concept to automatically store any RDD of tuples or case classes in columar format using arrays (and strongly typed, so you get the benefit of primitive arrays). see: https://github.com/tresata/spark-columnar On Fri, Feb 13, 2015 at 3:06 PM, Michael Armbrust

Re: “mapreduce.job.user.classpath.first” for Spark

2015-02-04 Thread Koert Kuipers
the whole spark.files.userClassPathFirs never really worked for me in standalone mode, since jars were added dynamically which means they had different classloaders leading to a real classloader hell if you tried to add a newer version of jar that spark already used. see:

Re: “mapreduce.job.user.classpath.first” for Spark

2015-02-04 Thread Koert Kuipers
applications/situations. never thought i would say that. best On Wed, Feb 4, 2015 at 4:01 PM, Marcelo Vanzin van...@cloudera.com wrote: Hi Koert, On Wed, Feb 4, 2015 at 11:35 AM, Koert Kuipers ko...@tresata.com wrote: do i understand it correctly that on yarn the the customer jars are truly placed

Re: “mapreduce.job.user.classpath.first” for Spark

2015-02-04 Thread Koert Kuipers
package for the 1.2.0 release. It shouldn't be causing conflicts. [1] https://issues.apache.org/jira/browse/SPARK-2848 On Wed, Feb 4, 2015 at 2:35 PM, Koert Kuipers ko...@tresata.com wrote: the whole spark.files.userClassPathFirs never really worked for me in standalone mode, since jars were

Re: “mapreduce.job.user.classpath.first” for Spark

2015-02-04 Thread Koert Kuipers
anyhow i am ranting... sorry On Wed, Feb 4, 2015 at 5:54 PM, Koert Kuipers ko...@tresata.com wrote: yeah i think we have been lucky so far. but i dont really see how i have a choice. it would be fine if say hadoop exposes a very small set of libraries as part of the classpath. but if i look

Re: “mapreduce.job.user.classpath.first” for Spark

2015-02-04 Thread Koert Kuipers
for example? or avro? it just makes my life harder. and i dont really see who benefits. the yarn classpath is insane too. On Wed, Feb 4, 2015 at 4:26 PM, Marcelo Vanzin van...@cloudera.com wrote: On Wed, Feb 4, 2015 at 1:12 PM, Koert Kuipers ko...@tresata.com wrote: about putting stuff on classpath

Re: Spark impersonation

2015-02-02 Thread Koert Kuipers
yes jobs run as the user that launched them. if you want to run jobs on a secure cluster then use yarn. hadoop standalone does not support secure hadoop. On Mon, Feb 2, 2015 at 5:37 PM, Jim Green openkbi...@gmail.com wrote: Hi Team, Does spark support impersonation? For example, when spark

spark on yarn succeeds but exit code 1 in logs

2015-01-31 Thread Koert Kuipers
i have a simple spark app that i run with spark-submit on yarn. it runs fine and shows up with finalStatus=SUCCEEDED in the resource manager logs. however in the nodemanager logs i see this: 2015-01-31 18:30:48,195 INFO

Re: spark on yarn succeeds but exit code 1 in logs

2015-01-31 Thread Koert Kuipers
clue there ? You can pastebin part of the RM log around the time your job ran ? What hadoop version are you using ? Thanks On Sat, Jan 31, 2015 at 11:24 AM, Koert Kuipers ko...@tresata.com wrote: i have a simple spark app that i run with spark-submit on yarn. it runs fine and shows up

Re: spark challenge: zip with next???

2015-01-30 Thread Koert Kuipers
. -- *From:* Koert Kuipers ko...@tresata.com *To:* Mohit Jaggi mohitja...@gmail.com *Cc:* Tobias Pfeiffer t...@preferred.jp; Ganelin, Ilya ilya.gane...@capitalone.com; derrickburns derrickrbu...@gmail.com; user@spark.apache.org user@spark.apache.org *Sent:* Friday, January

Re: spark challenge: zip with next???

2015-01-30 Thread Koert Kuipers
operation such as this one? This use case reminds me of FIR filtering in DSP. It seems that RDDs could use something that serves the same purpose as scala.collection.Iterator.sliding. -- *From:* Koert Kuipers ko...@tresata.com *To:* Mohit Jaggi mohitja

Re: spark challenge: zip with next???

2015-01-30 Thread Koert Kuipers
assuming the data can be partitioned then you have many timeseries for which you want to detect potential gaps. also assuming the resulting gaps info per timeseries is much smaller data then the timeseries data itself, then this is a classical example to me of a sorted (streaming) foldLeft,

Re: different akka versions and spark

2015-01-15 Thread Koert Kuipers
spark 1.2.0, and it indeed does not run on CDH5.3.0. i get class incompatibility errors. On Tue, Jan 6, 2015 at 10:29 AM, Koert Kuipers ko...@tresata.com wrote: if the classes are in the original location than i think its safe to say that this makes it impossible for us to build one app that can

Re: different akka versions and spark

2015-01-06 Thread Koert Kuipers
it also changes some transitive dependencies which also have compatibility issues (e.g. the typesafe config library). But I believe it's needed to support Scala 2.11... On Mon, Jan 5, 2015 at 8:27 AM, Koert Kuipers ko...@tresata.com wrote: since spark shaded akka i wonder if it would work, but i

Re: different akka versions and spark

2015-01-05 Thread Koert Kuipers
, Jan 3, 2015 at 11:22 AM, Koert Kuipers ko...@tresata.com wrote: hey Ted, i am aware of the upgrade efforts for akka. however if spark 1.2 forces me to upgrade all our usage of akka to 2.3.x while spark 1.0 and 1.1 force me to use akka 2.2.x then we cannot build one application that runs on all

Re: Submitting spark jobs through yarn-client

2015-01-03 Thread Koert Kuipers
thats great. i tried this once and gave up after a few hours. On Sat, Jan 3, 2015 at 2:59 AM, Corey Nolet cjno...@gmail.com wrote: Took me just about all night (it's 3am here in EST) but I finally figured out how to get this working. I pushed up my example code for others who may be

Re: different akka versions and spark

2015-01-03 Thread Koert Kuipers
, Jan 2, 2015 at 9:11 AM, Koert Kuipers ko...@tresata.com wrote: i noticed spark 1.2.0 bumps the akka version. since spark uses it's own akka version, does this mean it can co-exist with another akka version in the same JVM? has anyone tried this? we have some spark apps that also use akka

different akka versions and spark

2015-01-02 Thread Koert Kuipers
i noticed spark 1.2.0 bumps the akka version. since spark uses it's own akka version, does this mean it can co-exist with another akka version in the same JVM? has anyone tried this? we have some spark apps that also use akka (2.2.3) and spray. if different akka versions causes conflicts then

Re: Why so many tasks?

2014-12-16 Thread Koert Kuipers
sc.textFile uses a hadoop input format. hadoop input formats by default create one task per file, and they are not very suitable for many very small files. can you turns your 1000 files into one larger text file? otherwise maybe try: val data = sc.textFile(/user/foo/myfiles/*).coalesce(100) On

Re: spark kafka batch integration

2014-12-15 Thread Koert Kuipers
at 2:41 PM, Koert Kuipers ko...@tresata.com wrote: hello all, we at tresata wrote a library to provide for batch integration between spark and kafka (distributed write of rdd to kafa, distributed read of rdd from kafka). our main use cases are (in lambda architecture jargon): * period appends

spark kafka batch integration

2014-12-14 Thread Koert Kuipers
hello all, we at tresata wrote a library to provide for batch integration between spark and kafka (distributed write of rdd to kafa, distributed read of rdd from kafka). our main use cases are (in lambda architecture jargon): * period appends to the immutable master dataset on hdfs from kafka

Re: Efficient self-joins

2014-12-08 Thread Koert Kuipers
spark can do efficient joins if both RDDs have the same partitioner. so in case of self join I would recommend to create an rdd that has explicit partitioner and has been cached. On Dec 8, 2014 8:52 AM, Theodore Vasiloudis theodoros.vasilou...@gmail.com wrote: Hello all, I am working on a

Re: Efficient self-joins

2014-12-08 Thread Koert Kuipers
. } ??? // Return something. } On Mon, Dec 8, 2014 at 3:28 PM, Koert Kuipers ko...@tresata.com wrote: spark can do efficient joins if both RDDs have the same partitioner. so in case of self join I would recommend to create an rdd that has explicit partitioner and has been cached. On Dec 8, 2014 8

Re: run JavaAPISuite with mavem

2014-12-07 Thread Koert Kuipers
at java.lang.Class.forName(Class.java:270)^[[0m BTW I didn't find JavaAPISuite in test output either. Cheers On Sat, Dec 6, 2014 at 9:12 PM, Koert Kuipers ko...@tresata.com wrote: Ted, i mean core/src/test/java/org/apache/spark/JavaAPISuite.java On Sat, Dec 6, 2014 at 9:27 PM, Ted Yu yuzhih...@gmail.com

Re: run JavaAPISuite with mavem

2014-12-07 Thread Koert Kuipers
/skipTests /configuration /plugin plugin I was able to run JavaAPISuite using: mvn test -pl core -Dtest=JavaAPISuite But it takes a long time ... Cheers On Sun, Dec 7, 2014 at 8:56 AM, Koert Kuipers ko...@tresata.com wrote: hey guys, i was able to run the test

Re: run JavaAPISuite with mavem

2014-12-07 Thread Koert Kuipers
://issues.apache.org/jira/browse/SPARK-661 I got bit by this too recently and meant to look into it. On Sun, Dec 7, 2014 at 4:50 PM, Koert Kuipers ko...@tresata.com wrote: so as part of the official build the java api does not get tested then? i am sure there is a good reason for it, but thats surprising

run JavaAPISuite with mavem

2014-12-06 Thread Koert Kuipers
when i run mvn test -pl core, i dont see JavaAPISuite being run. or if it is, its being very very quiet about it. is this by design?

Re: run JavaAPISuite with mavem

2014-12-06 Thread Koert Kuipers
. For usage example, see test case JavaAPISuite.testJavaJdbcRDD. ./core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala FYI On Sat, Dec 6, 2014 at 5:43 PM, Koert Kuipers ko...@tresata.com wrote: when i run mvn test -pl core, i dont see JavaAPISuite being run. or if it is, its being very very

Re: How to incrementally compile spark examples using mvn

2014-12-05 Thread Koert Kuipers
i suddenly also run into the issue that maven is trying to download snapshots that dont exists for other sub projects. did something change in the maven build? does maven not have capability to smartly compile the other sub-projects that a sub-project depends on? i rather avoid mvn install

Re: How to incrementally compile spark examples using mvn

2014-12-05 Thread Koert Kuipers
i think what changed is that core now has dependencies on other sub projects. ok... so i am forced to install stuff because maven cannot compile what is needed. i will install On Fri, Dec 5, 2014 at 7:12 PM, Koert Kuipers ko...@tresata.com wrote: i suddenly also run into the issue that maven

Re: Alternatives to groupByKey

2014-12-03 Thread Koert Kuipers
do these requirements boils down to a need for foldLeftByKey with sorting of the values? https://issues.apache.org/jira/browse/SPARK-3655 On Wed, Dec 3, 2014 at 6:34 PM, Xuefeng Wu ben...@gmail.com wrote: I have similar requirememt,take top N by key. right now I use groupByKey,but one key

Re: Java api overhead?

2014-10-29 Thread Koert Kuipers
since spark holds data structures on heap (and by default tries to work with all data in memory) and its written in Scala seeing lots of scala Tuple2 is not unexpected. how do these numbers relate to your data size? On Oct 27, 2014 2:26 PM, Sonal Goyal sonalgoy...@gmail.com wrote: Hi, I wanted

Re: Spark with HLists

2014-10-29 Thread Koert Kuipers
looks like a misssing class issue? what makes you think its serialization? shapeless does indeed have a lot of helper classes that get sucked in and are not serializable. see here: https://groups.google.com/forum/#!topic/shapeless-dev/05_DXnoVnI4 and for a project that uses shapeless in spark

Re: Is Spark the right tool?

2014-10-28 Thread Koert Kuipers
spark can definitely very quickly answer queries like give me all transactions with property x. and you can put a http query server in front of it and run queries concurrently. but spark does not support inserts, updates, or fast random access lookups. this is because RDDs are immutable and

Re: Keep state inside map function

2014-10-28 Thread Koert Kuipers
doing cleanup in an iterator like that assumes the iterator always gets fully read, which is not necessary the case (for example RDD.take does not). instead i would use mapPartitionsWithContext, in which case you can write a function of the form. f: (TaskContext, Iterator[T]) = Iterator[U] now

Re: combine rdds?

2014-10-27 Thread Koert Kuipers
this requires evaluation of the rdd to do the count. val x: RDD[X] = ... val y: RDD[X] = ... x.cache val z = if(x.count thres) x.union(y) else x On Oct 27, 2014 7:51 PM, Josh J joshjd...@gmail.com wrote: Hi, How could I combine rdds? I would like to combine two RDDs if the count in an RDD is

Re: com.esotericsoftware.kryo.KryoException: Buffer overflow.

2014-10-21 Thread Koert Kuipers
you ran out of kryo buffer. are you using spark 1.1 (which supports buffer resizing) or spark 1.0 (which has a fixed size buffer)? On Oct 21, 2014 5:30 PM, nitinkak001 nitinkak...@gmail.com wrote: I am running a simple rdd filter command. What does it mean? Here is the full stack trace(and code

run scalding on spark

2014-10-01 Thread Koert Kuipers
well, sort of! we make input/output formats (cascading taps, scalding sources) available in spark, and we ported the scalding fields api to spark. so it's for those of us that have a serious investment in cascading/scalding and want to leverage that in spark. blog is here:

Re: run scalding on spark

2014-10-01 Thread Koert Kuipers
thanks On Wed, Oct 1, 2014 at 4:56 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Pretty cool, thanks for sharing this! I've added a link to it on the wiki: https://cwiki.apache.org/confluence/display/SPARK/Supplemental+Spark+Projects . Matei On Oct 1, 2014, at 1:41 PM, Koert Kuipers ko

in memory assumption in cogroup?

2014-09-29 Thread Koert Kuipers
apologies for asking yet again about spark memory assumptions, but i cant seem to keep it in my head. if i use PairRDDFunctions.cogroup, it returns for every key 2 iterables. do the contents of these iterables have to fit in memory? or is the data streamed?

Re: Found both spark.driver.extraClassPath and SPARK_CLASSPATH

2014-09-21 Thread Koert Kuipers
. On Mon, Sep 15, 2014 at 11:16 AM, Koert Kuipers ko...@tresata.com wrote: in spark 1.1.0 i get this error: 2014-09-14 23:17:01 ERROR actor.OneForOneStrategy: Found both spark.driver.extraClassPath and SPARK_CLASSPATH. Use only the former. i checked my application. i do not set

secondary sort

2014-09-20 Thread Koert Kuipers
now that spark has a sort based shuffle, can we expect a secondary sort soon? there are some use cases where getting a sorted iterator of values per key is helpful.

Re: Adjacency List representation in Spark

2014-09-18 Thread Koert Kuipers
we build our own adjacency lists as well. the main motivation for us was that graphx has some assumptions about everything fitting in memory (it has .cache statements all over place). however if my understanding is wrong and graphx can handle graphs that do not fit in memory i would be interested

Found both spark.driver.extraClassPath and SPARK_CLASSPATH

2014-09-15 Thread Koert Kuipers
in spark 1.1.0 i get this error: 2014-09-14 23:17:01 ERROR actor.OneForOneStrategy: Found both spark.driver.extraClassPath and SPARK_CLASSPATH. Use only the former. i checked my application. i do not set spark.driver.extraClassPath or SPARK_CLASSPATH. SPARK_CLASSPATH is set in spark-env.sh

Re: SPARK_MASTER_IP

2014-09-15 Thread Koert Kuipers
hey mark, you think that this is on purpose, or is it an omission? thanks, koert On Mon, Sep 15, 2014 at 8:32 PM, Mark Grover m...@apache.org wrote: Hi Koert, I work on Bigtop and CDH packaging and you are right, based on my quick glance, it doesn't seem to be used. Mark From: Koert

SPARK_MASTER_IP

2014-09-13 Thread Koert Kuipers
a grep for SPARK_MASTER_IP shows that sbin/start-master.sh and sbin/start-slaves.sh are the only ones that use it. yet for example in CDH5 the spark-master is started from /etc/init.d/spark-master by running bin/spark-class. does that means SPARK_MASTER_IP is simply ignored? it looks like that to

Re: Mapping Hadoop Reduce to Spark

2014-08-31 Thread Koert Kuipers
matei, it is good to hear that the restriction that keys need to fit in memory no longer applies to combineByKey. however join requiring keys to fit in memory is still a big deal to me. does it apply to both sides of the join, or only one (while othe other side is streaming)? On Sat, Aug 30,

SchemaRDD

2014-08-27 Thread Koert Kuipers
i feel like SchemaRDD has usage beyond just sql. perhaps it belongs in core?

mllib style

2014-08-11 Thread Koert Kuipers
i was just looking at ALS (mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala) any need all the variables need to be vars and to have all these setters around? it just leads to so much clutter if you really want them to the vars it is safe in scala to make them public

spark-submit symlink

2014-08-05 Thread Koert Kuipers
spark-submit doesnt handle being symlinks currently: $ spark-submit /usr/local/bin/spark-submit: line 44: /usr/local/bin/spark-class: No such file or directory /usr/local/bin/spark-submit: line 44: exec: /usr/local/bin/spark-class: cannot execute: No such file or directory to fix i changed the

<    1   2   3   4   5   >