Re: reduceByKeyAndWindow, but using log timestamps instead of clock seconds

2015-07-02 Thread Tobias Pfeiffer
Hi, On Thu, Jan 29, 2015 at 9:52 AM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Thu, Jan 29, 2015 at 1:54 AM, YaoPau jonrgr...@gmail.com wrote: My thinking is to maintain state in an RDD and update it an persist it with each 2-second pass, but this also seems like it could get messy

Re: Spark shell never leaves ACCEPTED state in YARN CDH5

2015-03-25 Thread Tobias Pfeiffer
Hi, On Thu, Mar 26, 2015 at 4:08 AM, Khandeshi, Ami ami.khande...@fmr.com.invalid wrote: I am seeing the same behavior. I have enough resources….. CPU *and* memory are sufficient? No previous (unfinished) jobs eating them? Tobias

Re: SchemaRDD: SQL Queries vs Language Integrated Queries

2015-03-11 Thread Tobias Pfeiffer
Hi, On Wed, Mar 11, 2015 at 11:05 PM, Cesar Flores ces...@gmail.com wrote: Thanks for both answers. One final question. *This registerTempTable is not an extra process that the SQL queries need to do that may decrease performance over the language integrated method calls? * As far as I

Re: Timed out while stopping the job generator plus subsequent failures

2015-03-11 Thread Tobias Pfeiffer
Sean, On Wed, Mar 11, 2015 at 7:43 PM, Tobias Pfeiffer t...@preferred.jp wrote: it seems like I am unable to shut down my StreamingContext properly, both in local[n] and yarn-cluster mode. In addition, (only) in yarn-cluster mode, subsequent use of a new StreamingContext will raise

Re: SQL with Spark Streaming

2015-03-11 Thread Tobias Pfeiffer
Hi, On Thu, Mar 12, 2015 at 12:08 AM, Huang, Jie jie.hu...@intel.com wrote: According to my understanding, your approach is to register a series of tables by using transformWith, right? And then, you can get a new Dstream (i.e., SchemaDstream), which consists of lots of SchemaRDDs. Yep,

Timed out while stopping the job generator plus subsequent failures

2015-03-11 Thread Tobias Pfeiffer
Hi, it seems like I am unable to shut down my StreamingContext properly, both in local[n] and yarn-cluster mode. In addition, (only) in yarn-cluster mode, subsequent use of a new StreamingContext will raise an InvalidActorNameException. I use logger.info(stoppingStreamingContext)

Re: Timed out while stopping the job generator plus subsequent failures

2015-03-11 Thread Tobias Pfeiffer
Hi, I discovered what caused my issue when running on YARN and was able to work around it. On Wed, Mar 11, 2015 at 7:43 PM, Tobias Pfeiffer t...@preferred.jp wrote: The processing itself is complete, i.e., the batch currently processed at the time of stop() is finished and no further batches

Re: SchemaRDD: SQL Queries vs Language Integrated Queries

2015-03-10 Thread Tobias Pfeiffer
Hi, On Tue, Mar 10, 2015 at 2:13 PM, Cesar Flores ces...@gmail.com wrote: I am new to the SchemaRDD class, and I am trying to decide in using SQL queries or Language Integrated Queries ( https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD ). Can someone

Re: SQL with Spark Streaming

2015-03-10 Thread Tobias Pfeiffer
Hi, On Wed, Mar 11, 2015 at 9:33 AM, Cheng, Hao hao.ch...@intel.com wrote: Intel has a prototype for doing this, SaiSai and Jason are the authors. Probably you can ask them for some materials. The github repository is here: https://github.com/intel-spark/stream-sql Also, what I did is

Re: Spark with data on NFS v HDFS

2015-03-05 Thread Tobias Pfeiffer
Hi, On Thu, Mar 5, 2015 at 10:58 PM, Ashish Mukherjee ashish.mukher...@gmail.com wrote: I understand Spark can be used with Hadoop or standalone. I have certain questions related to use of the correct FS for Spark data. What is the efficiency trade-off in feeding data to Spark from NFS v

Re: scala.Double vs java.lang.Double in RDD

2015-03-04 Thread Tobias Pfeiffer
Hi, On Thu, Mar 5, 2015 at 12:20 AM, Imran Rashid iras...@cloudera.com wrote: This doesn't involve spark at all, I think this is entirely an issue with how scala deals w/ primitives and boxing. Often it can hide the details for you, but IMO it just leads to far more confusing errors when

scala.Double vs java.lang.Double in RDD

2015-03-04 Thread Tobias Pfeiffer
Hi, I have a function with signature def aggFun1(rdd: RDD[(Long, (Long, Double))]): RDD[(Long, Any)] and one with def aggFun2[_Key: ClassTag, _Index](rdd: RDD[(_Key, (_Index, Double))]): RDD[(_Key, Double)] where all Double classes involved are scala.Double classes (according to

Re: Spark sql results can't be printed out to system console from spark streaming application

2015-03-03 Thread Tobias Pfeiffer
Hi, can you explain how you copied that into your *streaming* application? Like, how do you issue the SQL, what data do you operate on, how do you view the logs etc.? Tobias On Wed, Mar 4, 2015 at 8:55 AM, Cui Lin cui@hds.com wrote: Dear all, I found the below sample code can be

Re: Issue with yarn cluster - hangs in accepted state.

2015-03-03 Thread Tobias Pfeiffer
Hi, On Wed, Mar 4, 2015 at 6:20 AM, Zhan Zhang zzh...@hortonworks.com wrote: Do you have enough resource in your cluster? You can check your resource manager to see the usage. Yep, I can confirm that this is a very annoying issue. If there is not enough memory or VCPUs available, your app

Re: Best practices for query creation in Spark SQL.

2015-03-02 Thread Tobias Pfeiffer
Hi, I think your chances for a satisfying answer would increase dramatically if you elaborated a bit more on what you actually want to know. (Holds for any of your last four questions about Spark SQL...) Tobias

Re: Dealing with 'smaller' data

2015-02-26 Thread Tobias Pfeiffer
Hi On Fri, Feb 27, 2015 at 10:50 AM, Gary Malouf malouf.g...@gmail.com wrote: The honest answer is that it is unclear to me at this point. I guess what I am really wondering is if there are cases where one would find it beneficial to use Spark against one or more RDBs? Well, RDBs are all

Re: Dealing with 'smaller' data

2015-02-26 Thread Tobias Pfeiffer
Gary, On Fri, Feb 27, 2015 at 8:40 AM, Gary Malouf malouf.g...@gmail.com wrote: I'm considering whether or not it is worth introducing Spark at my new company. The data is no-where near Hadoop size at this point (it sits in an RDS Postgres cluster). Will it ever become Hadoop size? Looking

Re: Dealing with 'smaller' data

2015-02-26 Thread Tobias Pfeiffer
On Fri, Feb 27, 2015 at 10:57 AM, Gary Malouf malouf.g...@gmail.com wrote: So when deciding whether to take on installing/configuring Spark, the size of the data does not automatically make that decision in your mind. You got me there ;-) Tobias

Re: Spark Streaming - Collecting RDDs into array in the driver program

2015-02-25 Thread Tobias Pfeiffer
Hi, On Thu, Feb 26, 2015 at 11:24 AM, Thanigai Vellore thanigai.vell...@gmail.com wrote: It appears that the function immediately returns even before the foreachrdd stage is executed. Is that possible? Sure, that's exactly what happens. foreachRDD() schedules a computation, it does not

Re: Why must the dstream.foreachRDD(...) parameter be serializable?

2015-02-23 Thread Tobias Pfeiffer
Sean, thanks for your message! On Mon, Feb 23, 2015 at 6:03 PM, Sean Owen so...@cloudera.com wrote: What I haven't investigated is whether you can enable checkpointing for the state in updateStateByKey separately from this mechanism, which is exactly your question. What happens if you set a

Re: Why must the dstream.foreachRDD(...) parameter be serializable?

2015-02-23 Thread Tobias Pfeiffer
Hi, On Tue, Feb 24, 2015 at 4:34 AM, Tathagata Das tathagata.das1...@gmail.com wrote: There are different kinds of checkpointing going on. updateStateByKey requires RDD checkpointing which can be enabled only by called sparkContext.setCheckpointDirectory. But that does not enable Spark

Re: Use Spark Streaming for Batch?

2015-02-22 Thread Tobias Pfeiffer
Hi, On Sat, Feb 21, 2015 at 1:05 AM, craigv craigvanderbo...@gmail.com wrote: /Might it be possible to perform large batches processing on HDFS time series data using Spark Streaming?/ 1.I understand that there is not currently an InputDStream that could do what's needed. I would have

Re: SQL query over (Long, JSON string) tuples

2015-01-29 Thread Tobias Pfeiffer
Hi Ayoub, thanks for your mail! On Thu, Jan 29, 2015 at 6:23 PM, Ayoub benali.ayoub.i...@gmail.com wrote: SQLContext and hiveContext have a jsonRDD method which accept an RDD[String] where the string is a JSON String a returns a SchemaRDD, it extends RDD[Row] which the type you want. After

SQL query over (Long, JSON string) tuples

2015-01-29 Thread Tobias Pfeiffer
Hi, I have data as RDD[(Long, String)], where the Long is a timestamp and the String is a JSON-encoded string. I want to infer the schema of the JSON and then do a SQL statement on the data (no aggregates, just column selection and UDF application), but still have the timestamp associated with

Re: spark challenge: zip with next???

2015-01-29 Thread Tobias Pfeiffer
Hi, On Fri, Jan 30, 2015 at 6:32 AM, Ganelin, Ilya ilya.gane...@capitalone.com wrote: Make a copy of your RDD with an extra entry in the beginning to offset. The you can zip the two RDDs and run a map to generate an RDD of differences. Does that work? I recently tried something to compute

Re: reduceByKeyAndWindow, but using log timestamps instead of clock seconds

2015-01-28 Thread Tobias Pfeiffer
Hi, On Thu, Jan 29, 2015 at 1:54 AM, YaoPau jonrgr...@gmail.com wrote: My thinking is to maintain state in an RDD and update it an persist it with each 2-second pass, but this also seems like it could get messy. Any thoughts or examples that might help me? I have just implemented some

Error reporting/collecting for users

2015-01-27 Thread Tobias Pfeiffer
Hi, in my Spark Streaming application, computations depend on users' input in terms of * user-defined functions * computation rules * etc. that can throw exceptions in various cases (think: exception in UDF, division by zero, invalid access by key etc.). Now I am wondering about what is a

Re: Error reporting/collecting for users

2015-01-27 Thread Tobias Pfeiffer
Hi, On Wed, Jan 28, 2015 at 1:45 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote: It is a Streaming application, so how/when do you plan to access the accumulator on driver? Well... maybe there would be some user command or web interface showing the errors that have happened during

Re: Error reporting/collecting for users

2015-01-27 Thread Tobias Pfeiffer
Hi, thanks for your mail! On Wed, Jan 28, 2015 at 11:44 AM, Tathagata Das tathagata.das1...@gmail.com wrote: That seems reasonable to me. Are you having any problems doing it this way? Well, actually I haven't done that yet. The idea of using accumulators to collect errors just came while

Why must the dstream.foreachRDD(...) parameter be serializable?

2015-01-27 Thread Tobias Pfeiffer
Hi, I want to do something like dstream.foreachRDD(rdd = if (someCondition) ssc.stop()) so in particular the function does not touch any element in the RDD and runs completely within the driver. However, this fails with a NotSerializableException because $outer is not serializable etc. The

Re: Why must the dstream.foreachRDD(...) parameter be serializable?

2015-01-27 Thread Tobias Pfeiffer
Hi, thanks for the answers! On Wed, Jan 28, 2015 at 11:31 AM, Shao, Saisai saisai.s...@intel.com wrote: Also this `foreachFunc` is more like an action function of RDD, thinking of rdd.foreach(func), in which `func` need to be serializable. So maybe I think your way of use it is not a normal

Re: Pairwise Processing of a List

2015-01-25 Thread Tobias Pfeiffer
Hi, On Mon, Jan 26, 2015 at 9:32 AM, Steve Nunez snu...@hortonworks.com wrote: I’ve got a list of points: List[(Float, Float)]) that represent (x,y) coordinate pairs and need to sum the distance. It’s easy enough to compute the distance: Are you saying you want all combinations (N^2) of

Re: spark streaming with checkpoint

2015-01-25 Thread Tobias Pfeiffer
Hi, On Tue, Jan 20, 2015 at 8:16 PM, balu.naren balu.na...@gmail.com wrote: I am a beginner to spark streaming. So have a basic doubt regarding checkpoints. My use case is to calculate the no of unique users by day. I am using reduce by key and window for this. Where my window duration is 24

Re: [SQL] Conflicts in inferred Json Schemas

2015-01-25 Thread Tobias Pfeiffer
Hi, On Thu, Jan 22, 2015 at 2:26 AM, Corey Nolet cjno...@gmail.com wrote: Let's say I have 2 formats for json objects in the same file schema1 = { location: 12345 My Lane } schema2 = { location:{houseAddres:1234 My Lane} } From my tests, it looks like the current inferSchema() function will

Re: Serializability: for vs. while loops

2015-01-25 Thread Tobias Pfeiffer
Aaron, On Thu, Jan 15, 2015 at 5:05 PM, Aaron Davidson ilike...@gmail.com wrote: Scala for-loops are implemented as closures using anonymous inner classes which are instantiated once and invoked many times. This means, though, that the code inside the loop is actually sitting inside a class,

Re: Pairwise Processing of a List

2015-01-25 Thread Tobias Pfeiffer
Sean, On Mon, Jan 26, 2015 at 10:28 AM, Sean Owen so...@cloudera.com wrote: Note that RDDs don't really guarantee anything about ordering though, so this only makes sense if you've already sorted some upstream RDD by a timestamp or sequence number. Speaking of order, is there some reading

Re: Closing over a var with changing value in Streaming application

2015-01-21 Thread Tobias Pfeiffer
Hi, On Wed, Jan 21, 2015 at 9:13 PM, Bob Tiernay btier...@hotmail.com wrote: Maybe I'm misunderstanding something here, but couldn't this be done with broadcast variables? I there is the following caveat from the docs: In addition, the object v should not be modified after it is broadcast

Closing over a var with changing value in Streaming application

2015-01-20 Thread Tobias Pfeiffer
Hi, I am developing a Spark Streaming application where I want every item in my stream to be assigned a unique, strictly increasing Long. My input data already has RDD-local integers (from 0 to N-1) assigned, so I am doing the following: var totalNumberOfItems = 0L // update the keys of the

Re: Closing over a var with changing value in Streaming application

2015-01-20 Thread Tobias Pfeiffer
Hi, On Wed, Jan 21, 2015 at 4:46 PM, Akhil Das ak...@sigmoidanalytics.com wrote: How about using accumulators http://spark.apache.org/docs/1.2.0/programming-guide.html#accumulators? As far as I understand, they solve the part of the problem that I am not worried about, namely increasing the

Re: If an RDD appeared twice in a DAG, of which calculation is triggered by a single action, will this RDD be calculated twice?

2015-01-19 Thread Tobias Pfeiffer
Hi, On Sat, Jan 17, 2015 at 3:37 AM, Peng Cheng pc...@uow.edu.au wrote: I'm talking about RDD1 (not persisted or checkpointed) in this situation: ...(somewhere) - RDD1 - RDD2 || V V

Re: How to get the master URL at runtime inside driver program?

2015-01-19 Thread Tobias Pfeiffer
Hi, On Sun, Jan 18, 2015 at 11:08 AM, guxiaobo1982 guxiaobo1...@qq.com wrote: Driver programs submitted by the spark-submit script will get the runtime spark master URL, but how it get the URL inside the main method when creating the SparkConf object? The master will be stored in the

Re: MatchError in JsonRDD.toLong

2015-01-19 Thread Tobias Pfeiffer
Hi, On Fri, Jan 16, 2015 at 6:14 PM, Wang, Daoyuan daoyuan.w...@intel.com wrote: The second parameter of jsonRDD is the sampling ratio when we infer schema. OK, I was aware of this, but I guess I understand the problem now. My sampling ratio is so low that I only see the Long values of data

Re: Determine number of running executors

2015-01-19 Thread Tobias Pfeiffer
Hi, On Sat, Jan 17, 2015 at 3:05 AM, Shuai Zheng szheng.c...@gmail.com wrote: Can you share more information about how do you do that? I also have similar question about this. Not very proud about it ;-), but here you go: // find the number of workers available to us. val _runCmd =

Re: MatchError in JsonRDD.toLong

2015-01-16 Thread Tobias Pfeiffer
Hi again, On Fri, Jan 16, 2015 at 4:25 PM, Tobias Pfeiffer t...@preferred.jp wrote: Now I'm wondering where this comes from (I haven't touched this component in a while, nor upgraded Spark etc.) [...] So the reason that the error is showing up now is that suddenly data from a different

Re: MatchError in JsonRDD.toLong

2015-01-16 Thread Tobias Pfeiffer
Hi, On Fri, Jan 16, 2015 at 5:55 PM, Wang, Daoyuan daoyuan.w...@intel.com wrote: Can you provide how you create the JsonRDD? This should be reproducible in the Spark shell: - import org.apache.spark.sql._ val sqlc = new SparkContext(sc)

Re: Testing if an RDD is empty?

2015-01-15 Thread Tobias Pfeiffer
Hi, On Fri, Jan 16, 2015 at 7:31 AM, freedafeng freedaf...@yahoo.com wrote: I myself saw many times that my app threw out exceptions because an empty RDD cannot be saved. This is not big issue, but annoying. Having a cheap solution testing if an RDD is empty would be nice if there is no such

MatchError in JsonRDD.toLong

2015-01-15 Thread Tobias Pfeiffer
Hi, I am experiencing a weird error that suddenly popped up in my unit tests. I have a couple of HDFS files in JSON format and my test is basically creating a JsonRDD and then issuing a very simple SQL query over it. This used to work fine, but now suddenly I get: 15:58:49.039 [Executor task

Re: Serializability: for vs. while loops

2015-01-15 Thread Tobias Pfeiffer
Aaron, thanks for your mail! On Thu, Jan 15, 2015 at 5:05 PM, Aaron Davidson ilike...@gmail.com wrote: Scala for-loops are implemented as closures using anonymous inner classes [...] While loops, on the other hand, involve none of this trickery, and everyone is happy. Ah, I was suspecting

Re: *ByKey aggregations: performance + order

2015-01-14 Thread Tobias Pfeiffer
Sean, thanks for your message. On Wed, Jan 14, 2015 at 8:36 PM, Sean Owen so...@cloudera.com wrote: On Wed, Jan 14, 2015 at 4:53 AM, Tobias Pfeiffer t...@preferred.jp wrote: OK, it seems like even on a local machine (with no network overhead), the groupByKey version is about 5 times slower

Re:

2015-01-14 Thread Tobias Pfeiffer
Hi, On Thu, Jan 15, 2015 at 12:23 AM, Ted Yu yuzhih...@gmail.com wrote: On Wed, Jan 14, 2015 at 6:58 AM, Jianguo Li flyingfromch...@gmail.com wrote: I am using Spark-1.1.1. When I used sbt test, I ran into the following exceptions. Any idea how to solve it? Thanks! I think somebody posted

Serializability: for vs. while loops

2015-01-14 Thread Tobias Pfeiffer
Hi, sorry, I don't like questions about serializability myself, but still... Can anyone give me a hint why for (i - 0 to (maxId - 1)) { ... } throws a NotSerializableException in the loop body while var i = 0 while (i maxId) { // same code as in the for loop i += 1 } works

Re: quickly counting the number of rows in a partition?

2015-01-13 Thread Tobias Pfeiffer
Hi again, On Wed, Jan 14, 2015 at 10:06 AM, Tobias Pfeiffer t...@preferred.jp wrote: If you think of items.map(x = /* throw exception */).count() then even though the count you want to get does not necessarily require the evaluation of the function in map() (i.e., the number is the same

Re: *ByKey aggregations: performance + order

2015-01-13 Thread Tobias Pfeiffer
Hi, On Wed, Jan 14, 2015 at 12:11 PM, Tobias Pfeiffer t...@preferred.jp wrote: Now I don't know (yet) if all of the functions I want to compute can be expressed in this way and I was wondering about *how much* more expensive we are talking about. OK, it seems like even on a local machine

*ByKey aggregations: performance + order

2015-01-13 Thread Tobias Pfeiffer
Hi, I have an RDD[(Long, MyData)] where I want to compute various functions on lists of MyData items with the same key (this will in general be a rather short lists, around 10 items per key). Naturally I was thinking of groupByKey() but was a bit intimidated by the warning: This operation may be

Re: quickly counting the number of rows in a partition?

2015-01-13 Thread Tobias Pfeiffer
Hi, On Mon, Jan 12, 2015 at 8:09 PM, Ganelin, Ilya ilya.gane...@capitalone.com wrote: Use the mapPartitions function. It returns an iterator to each partition. Then just get that length by converting to an array. On Tue, Jan 13, 2015 at 2:50 PM, Kevin Burton bur...@spinn3r.com wrote:

Re: RDD Moving Average

2015-01-08 Thread Tobias Pfeiffer
Hi, On Wed, Jan 7, 2015 at 9:47 AM, Asim Jalis asimja...@gmail.com wrote: One approach I was considering was to use mapPartitions. It is straightforward to compute the moving average over a partition, except for near the end point. Does anyone see how to fix that? Well, I guess this is not

Re: Create DStream consisting of HDFS and (then) Kafka data

2015-01-07 Thread Tobias Pfeiffer
Hi, On Thu, Jan 8, 2015 at 2:19 PM, rekt...@voodoowarez.com wrote: dstream processing bulk HDFS data- is something I don't feel is super well socialized yet, fingers crossed that base gets built up a little more. Just out of interest (and hoping not to hijack my own thread), why are you

Create DStream consisting of HDFS and (then) Kafka data

2015-01-07 Thread Tobias Pfeiffer
Hi, I have a setup where data from an external stream is piped into Kafka and also written to HDFS periodically for long-term storage. Now I am trying to build an application that will first process the HDFS files and then switch to Kafka, continuing with the first item that was not yet in HDFS.

Re: I think I am almost lost in the internals of Spark

2015-01-06 Thread Tobias Pfeiffer
Hi, On Tue, Jan 6, 2015 at 11:24 PM, Todd bit1...@163.com wrote: I am a bit new to Spark, except that I tried simple things like word count, and the examples given in the spark sql programming guide. Now, I am investigating the internals of Spark, but I think I am almost lost, because I

Re: How to replace user.id to user.names in a file

2015-01-06 Thread Tobias Pfeiffer
Hi, it looks to me as if you need the whole user database on every node, so maybe put the id-name information as a Map[Id, String] in a broadcast variable and then do something like recommendations.map(line = { line.map(uid = usernames(uid)) }) or so? Tobias

Re: How to replace user.id to user.names in a file

2015-01-06 Thread Tobias Pfeiffer
Hi, On Wed, Jan 7, 2015 at 10:47 AM, Riginos Samaras samarasrigi...@gmail.com wrote: Yes something like this. Can you please give me an example to create a Map? That depends heavily on the shape of your input file. What about something like: (for (line - Source.fromFile(filename).getLines())

Re: How to replace user.id to user.names in a file

2015-01-06 Thread Tobias Pfeiffer
Hi, On Wed, Jan 7, 2015 at 11:13 AM, Riginos Samaras samarasrigi...@gmail.com wrote: exactly thats what I'm looking for, my code is like this: //code val users_map = users_file.map{ s = val parts = s.split(,) (parts(0).toInt, parts(1)) }.distinct //code but i get the error:

Re: Add StructType column to SchemaRDD

2015-01-05 Thread Tobias Pfeiffer
Hi Michael, On Tue, Jan 6, 2015 at 3:43 PM, Michael Armbrust mich...@databricks.com wrote: Oh sorry, I'm rereading your email more carefully. Its only because you have some setup code that you want to amortize? Yes, exactly that. Concerning the docs, I'd be happy to contribute, but I don't

Add StructType column to SchemaRDD

2015-01-05 Thread Tobias Pfeiffer
Hi, I have a SchemaRDD where I want to add a column with a value that is computed from the rest of the row. As the computation involves a network operation and requires setup code, I can't use SELECT *, myUDF(*) FROM rdd, but I wanted to use a combination of: - get schema of input SchemaRDD

Re: unable to do group by with 1st column

2014-12-25 Thread Tobias Pfeiffer
Hi, On Fri, Dec 26, 2014 at 5:22 AM, Amit Behera amit.bd...@gmail.com wrote: How can I do it? Please help me to do. Have you considered using groupByKey? http://spark.apache.org/docs/latest/programming-guide.html#transformations Tobias

Re: serialization issue with mapPartitions

2014-12-25 Thread Tobias Pfeiffer
Hi, On Fri, Dec 26, 2014 at 1:32 AM, ey-chih chow eyc...@hotmail.com wrote: I got some issues with mapPartitions with the following piece of code: val sessions = sc .newAPIHadoopFile( ... path to an avro file ...,

Re: Discourse: A proposed alternative to the Spark User list

2014-12-25 Thread Tobias Pfeiffer
Nick, uh, I would have expected a rather heated discussion, but the opposite seems to be the case ;-) Independent of my personal preferences w.r.t. usability, habits etc., I think it is not good for a software/tool/framework if questions and discussions are spread over too many places. I guess

Re: serialization issue with mapPartitions

2014-12-25 Thread Tobias Pfeiffer
Hi, On Fri, Dec 26, 2014 at 10:13 AM, ey-chih chow eyc...@hotmail.com wrote: I should rephrase my question as follows: How to use the corresponding Hadoop Configuration of a HadoopRDD in defining a function as an input parameter to the MapPartitions function? Well, you could try to pull

Re: SchemaRDD to RDD[String]

2014-12-24 Thread Tobias Pfeiffer
Hi, On Wed, Dec 24, 2014 at 3:18 PM, Hafiz Mujadid hafizmujadi...@gmail.com wrote: I want to convert a schemaRDD into RDD of String. How can we do that? Currently I am doing like this which is not converting correctly no exception but resultant strings are empty here is my code Hehe,

Re: How to run an action and get output?

2014-12-23 Thread Tobias Pfeiffer
Hi, On Fri, Dec 19, 2014 at 6:53 PM, Ashic Mahtab as...@live.com wrote: val doSomething(entry:SomeEntry, session:Session) : SomeOutput = { val result = session.SomeOp(entry) SomeOutput(entry.Key, result.SomeProp) } I could use a transformation for rdd.map, but in case of failures,

Semantics of foreachPartition()

2014-12-18 Thread Tobias Pfeiffer
Hi, I have the following code in my application: tmpRdd.foreach(item = { println(abc: + item) }) tmpRdd.foreachPartition(iter = { iter.map(item = { println(xyz: + item) }) }) In the output, I see only the abc prints

Re: Semantics of foreachPartition()

2014-12-18 Thread Tobias Pfeiffer
Hi again, On Thu, Dec 18, 2014 at 6:43 PM, Tobias Pfeiffer t...@preferred.jp wrote: tmpRdd.foreachPartition(iter = { iter.map(item = { println(xyz: + item) }) }) Uh, with iter.foreach(...) it works... the reason being apparently

Re: spark streaming kafa best practices ?

2014-12-17 Thread Tobias Pfeiffer
Hi, On Thu, Dec 18, 2014 at 3:08 AM, Patrick Wendell pwend...@gmail.com wrote: On Wed, Dec 17, 2014 at 5:43 AM, Gerard Maas gerard.m...@gmail.com wrote: I was wondering why one would choose for rdd.map vs rdd.foreach to execute a side-effecting function on an RDD. Personally, I like to

Re: Spark SQL DSL for joins?

2014-12-16 Thread Tobias Pfeiffer
Jerry, On Wed, Dec 17, 2014 at 3:35 PM, Jerry Raj jerry@gmail.com wrote: Another problem with the DSL: t1.where('term == dmin).count() returns zero. Looks like you need ===: https://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD Tobias

Re: Running spark-submit from a remote machine using a YARN application

2014-12-14 Thread Tobias Pfeiffer
Hi, On Fri, Dec 12, 2014 at 7:01 AM, ryaminal tacmot...@gmail.com wrote: Now our solution is to make a very simply YARN application which execustes as its command spark-submit --master yarn-cluster s3n://application/jar.jar This seemed so simple and elegant, but it has some weird

Re: Adding a column to a SchemaRDD

2014-12-14 Thread Tobias Pfeiffer
Nathan, On Fri, Dec 12, 2014 at 3:11 PM, Nathan Kronenfeld nkronenf...@oculusinfo.com wrote: I can see how to do it if can express the added values in SQL - just run SELECT *,valueCalculation AS newColumnName FROM table I've been searching all over for how to do this if my added value is a

Count-based windows

2014-12-08 Thread Tobias Pfeiffer
Hi, I am interested in building an application that uses sliding windows not based on the time when the item was received, but on either * a timestamp embedded in the data, or * a count (like: every 10 items, look at the last 100 items). Also, I want to do this on stream data received from

Re: spark-submit on YARN is slow

2014-12-08 Thread Tobias Pfeiffer
Hi, On Tue, Dec 9, 2014 at 4:39 AM, Sandy Ryza sandy.r...@cloudera.com wrote: Can you try using the YARN Fair Scheduler and set yarn.scheduler.fair.continuous-scheduling-enabled to true? I'm using Cloudera 5.2.0 and my configuration says yarn.resourcemanager.scheduler.class =

Re: spark-submit on YARN is slow

2014-12-07 Thread Tobias Pfeiffer
Hi, thanks for your responses! On Sat, Dec 6, 2014 at 4:22 AM, Sandy Ryza sandy.r...@cloudera.com wrote: What version are you using? In some recent versions, we had a couple of large hardcoded sleeps on the Spark side. I am using Spark 1.1.1. As Andrew mentioned, I guess most of the 10

Re: Market Basket Analysis

2014-12-04 Thread Tobias Pfeiffer
Hi, On Thu, Dec 4, 2014 at 11:58 PM, Rohit Pujari rpuj...@hortonworks.com wrote: I'd like to do market basket analysis using spark, what're my options? To do it or not to do it ;-) Seriously, could you elaborate a bit on what you want to know? Tobias

Re: Stateful mapPartitions

2014-12-04 Thread Tobias Pfeiffer
Hi, On Fri, Dec 5, 2014 at 3:56 AM, Akshat Aranya aara...@gmail.com wrote: Is it possible to have some state across multiple calls to mapPartitions on each partition, for instance, if I want to keep a database connection open? If you're using Scala, you can use a singleton object, this will

Re: SPARK LIMITATION - more than one case class is not allowed !!

2014-12-04 Thread Tobias Pfeiffer
On Fri, Dec 5, 2014 at 12:53 PM, Rahul Bindlish rahul.bindl...@nectechnologies.in wrote: Is it a limitation that spark does not support more than one case class at a time. What do you mean? I do not have the slightest idea what you *could* possibly mean by to support a case class. Tobias

Re: SPARK LIMITATION - more than one case class is not allowed !!

2014-12-04 Thread Tobias Pfeiffer
Rahul, On Fri, Dec 5, 2014 at 1:29 PM, Rahul Bindlish rahul.bindl...@nectechnologies.in wrote: I have created objectfiles [person_obj,office_obj] from csv[person_csv,office_csv] files using case classes[person,office] with API (saveAsObjectFile) Now I restarted spark-shell and load

Re: SPARK LIMITATION - more than one case class is not allowed !!

2014-12-04 Thread Tobias Pfeiffer
Rahul, On Fri, Dec 5, 2014 at 2:50 PM, Rahul Bindlish rahul.bindl...@nectechnologies.in wrote: I have done so thats why spark is able to load objectfile [e.g. person_obj] and spark has maintained serialVersionUID [person_obj]. Next time when I am trying to load another objectfile [e.g.

Re: SPARK LIMITATION - more than one case class is not allowed !!

2014-12-04 Thread Tobias Pfeiffer
Rahul, On Fri, Dec 5, 2014 at 3:51 PM, Rahul Bindlish rahul.bindl...@nectechnologies.in wrote: 1. Copy csv files in current directory. 2. Open spark-shell from this directory. 3. Run one_scala file which will create object-files from csv-files in current directory. 4. Restart spark-shell

Does count() evaluate all mapped functions?

2014-12-03 Thread Tobias Pfeiffer
Hi, I have an RDD and a function that should be called on every item in this RDD once (say it updates an external database). So far, I used rdd.map(myFunction).count() or rdd.mapPartitions(iter = iter.map(myFunction)) but I am wondering if this always triggers the call of myFunction in both

Re: Spark SQL UDF returning a list?

2014-12-03 Thread Tobias Pfeiffer
Hi, On Wed, Dec 3, 2014 at 4:31 PM, Jerry Raj jerry@gmail.com wrote: Exception in thread main java.lang.RuntimeException: [1.57] failure: ``('' expected but identifier myudf found I also tried returning a List of Ints, that did not work either. Is there a way to write a UDF that returns

Re: textFileStream() issue?

2014-12-03 Thread Tobias Pfeiffer
Hi, On Wed, Dec 3, 2014 at 5:31 PM, Bahubali Jain bahub...@gmail.com wrote: I am trying to use textFileStream(some_hdfs_location) to pick new files from a HDFS location.I am seeing a pretty strange behavior though. textFileStream() is not detecting new files when I move them from a location

Re: Best way to have some singleton per worker

2014-12-03 Thread Tobias Pfeiffer
Hi, On Thu, Dec 4, 2014 at 2:59 AM, Ashic Mahtab as...@live.com wrote: I've been doing this with foreachPartition (i.e. have the parameters for creating the singleton outside the loop, do a foreachPartition, create the instance, loop over entries in the partition, close the partition), but

spark-submit on YARN is slow

2014-12-03 Thread Tobias Pfeiffer
Hi, I am using spark-submit to submit my application to YARN in yarn-cluster mode. I have both the Spark assembly jar file as well as my application jar file put in HDFS and can see from the logging output that both files are used from there. However, it still takes about 10 seconds for my

Re: netty on classpath when using spark-submit

2014-12-03 Thread Tobias Pfeiffer
Markus, On Tue, Nov 11, 2014 at 10:40 AM, M. Dale medal...@yahoo.com wrote: I never tried to use this property. I was hoping someone else would jump in. When I saw your original question I remembered that Hadoop has something similar. So I searched and found the link below. A quick JIRA

Re: Passing Java Options to Spark AM launching

2014-12-01 Thread Tobias Pfeiffer
Hi, have a look at the documentation for spark.driver.extraJavaOptions (which seems to have disappeared since I looked it up last week) and spark.executor.extraJavaOptions at http://spark.apache.org/docs/latest/configuration.html#runtime-environment. Tobias

Re: kafka pipeline exactly once semantics

2014-11-30 Thread Tobias Pfeiffer
Josh, On Sun, Nov 30, 2014 at 10:17 PM, Josh J joshjd...@gmail.com wrote: I would like to setup a Kafka pipeline whereby I write my data to a single topic 1, then I continue to process using spark streaming and write the transformed results to topic2, and finally I read the results from topic

Re: Determine number of running executors

2014-11-25 Thread Tobias Pfeiffer
Hi, Thanks for your help! Sandy, I had a bit of trouble finding the spark.executor.cores property. (It wasn't there although its value should have been 2.) I ended up throwing regular expressions on scala.util.Properties.propOrElse(sun.java.command, ), which worked surprisingly well ;-) Thanks

Re: Is spark streaming +MlLib for online learning?

2014-11-24 Thread Tobias Pfeiffer
Hi, On Tue, Nov 25, 2014 at 9:40 AM, Joanne Contact joannenetw...@gmail.com wrote: I seemed to read somewhere that spark is still batch learning, but spark streaming could allow online learning. Spark doesn't do Machine Learning itself, but MLlib does. MLlib currently can do online learning

Re: Setup Remote HDFS for Spark

2014-11-24 Thread Tobias Pfeiffer
Hi, On Sat, Nov 22, 2014 at 12:13 AM, EH eas...@gmail.com wrote: Unfortunately whether it is possible to have both Spark and HDFS running on the same machine is not under our control. :( Right now we have Spark and HDFS running in different machines. In this case, is it still possible to

spark-submit and logging

2014-11-20 Thread Tobias Pfeiffer
Hi, I am using spark-submit to submit my application jar to a YARN cluster. I want to deliver a single jar file to my users, so I would like to avoid to tell them also, please put that log4j.xml file somewhere and add that path to the spark-submit command. I thought it would be sufficient that

Re: Cannot access data after a join (error: value _1 is not a member of Product with Serializable)

2014-11-19 Thread Tobias Pfeiffer
Hi, it looks what you are trying to use as a Tuple cannot be inferred to be a Tuple from the compiler. Try to add type declarations and maybe you will see where things fail. Tobias

Re: Spark Streaming with Kafka is failing with Error

2014-11-18 Thread Tobias Pfeiffer
Hi, do you have some logging backend (log4j, logback) on your classpath? This seems a bit like there is no particular implementation of the abstract `log()` method available. Tobias

Re: Parsing a large XML file using Spark

2014-11-18 Thread Tobias Pfeiffer
Hi, see https://www.mail-archive.com/dev@spark.apache.org/msg03520.html for one solution. One issue with those XML files is that they cannot be processed line by line in parallel; plus you inherently need shared/global state to parse XML or check for well-formedness, I think. (Same issue with

  1   2   3   >