Re: Printing the RDDs in SparkPageRank

2014-08-25 Thread Deep Pradhan
When I add parts(0).collect().foreach(println) parts(1).collect().foreach(println), for printing parts, I get the following error *not enough arguments for method collect: (pf: PartialFunction[Char,B])(implicit bf:scala.collection.generic.CanBuildFrom[String,B,That])That.Unspecified value

How to join two PairRDD together?

2014-08-25 Thread Gefei Li
Hello everyone, I am transplanting a clustering algorithm to spark platform, and I meet a problem confusing me for a long time, can someone help me? I have a PairRDDInteger, Integer named patternRDD, which the key represents a number and the value stores an information of the key. And I

many fetch failure in BlockManager

2014-08-25 Thread 余根茂
*HI ALL:* *My job is cpu intensive, and its resource configuration is 400 worker * 1 core * 3G. There are many fetch failure, like:* 14-08-23 08:34:52 WARN [Result resolver thread-3] TaskSetManager: Loss was due to fetch failure from BlockManagerId(slave1:33500) 14-08-23 08:34:52 INFO

Re: Open sourcing Spindle by Adobe Research, a web analytics processing engine in Scala, Spark, and Parquet.

2014-08-25 Thread Brandon Amos
\cc David Tompkins and Jim Donahue if they have anything to add. \cc My school email. Please include bamos_cmu.edu for further discussion. Hi Deb, Debasish Das wrote Looks very cool...will try it out for ad-hoc analysis of our datasets and provide more feedback... Could you please give bit

spark and matlab

2014-08-25 Thread Jaonary Rabarisoa
Hi all, Is there someone that tried to pipe RDD into matlab script ? I'm trying to do something similiar if one of you could point some hints. Best regards, Jao

Re: Printing the RDDs in SparkPageRank

2014-08-25 Thread Sean Owen
On Mon, Aug 25, 2014 at 7:18 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: When I add parts(0).collect().foreach(println) parts(1).collect().foreach(println), for printing parts, I get the following error not enough arguments for method collect: (pf: PartialFunction[Char,B])(implicit

apply at Option.scala:120

2014-08-25 Thread Wang, Jensen
Hi, All When I run spark applications, I see from the web-ui that some stage description are like apply at Option.scala:120. Why spark splits a stage on a line that is not in my spark program but a Scala library? Thanks Jensen

StorageLevel error.

2014-08-25 Thread rapelly kartheek
Hi, Can someone help me with the following error: scala val rdd = sc.parallelize(Array(1,2,3,4)) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at console:12 scala rdd.persist(StorageLevel.MEMORY_ONLY) console:15: error: not found: value StorageLevel

Re: StorageLevel error.

2014-08-25 Thread taoist...@gmail.com
you need import StorageLevel by: import org.apache.spark.storage.StorageLevel taoist...@gmail.com From: rapelly kartheek Date: 2014-08-25 18:22 To: user Subject: StorageLevel error. Hi, Can someone help me with the following error: scala val rdd = sc.parallelize(Array(1,2,3,4)) rdd:

Re: Trying to run SparkSQL over Spark Streaming

2014-08-25 Thread praveshjain1991
Hi, Thanks for your help the other day. I had one more question regarding the same. If you want to issue an SQL statement on streaming data, you must have both the registerAsTable() and the sql() call *within* the foreachRDD(...) block, or -- as you experienced -- the table name will be unknown

Spark - GraphX pregel like with global variables (accumulator / broadcast)

2014-08-25 Thread BertrandR
Hi, I'm working on big graph analytics, and currently implementing a mean field inference algorithm in GraphX/Spark. I start with an arbitrary graph, keep a (sparse) probability distribution at each node implemented as a Map[Long,Double]. At each iteration, from the current estimates of the

Request for help in writing to Textfile

2014-08-25 Thread yh18190
Hi Guys, I am currently playing with huge data.I have an RDD which returns RDD[List[(tuples)]].I need only the tuples to be written to textfile output using saveAsTextFile function. example:val mod=modify.saveASTextFile() returns

Manipulating columns in CSV file or Transpose of Array[Array[String]] RDD

2014-08-25 Thread Hingorani, Vineet
Hello all, Could someone help me with the manipulation of csv file data. I have 'semicolon' separated csv data including doubles and strings. I want to calculate the maximum/average of a column. When I read the file using sc.textFile(test.csv).map(_.split(;), each field is read as string.

Re: Development environment issues

2014-08-25 Thread Daniel Siegmann
On Thu, Aug 21, 2014 at 6:21 PM, pierred pie...@demartines.com wrote: So, what is the accepted wisdom in terms of IDE and development environment? I don't know what the accepted wisdom is. I've been getting by with the Scala IDE for Eclipse, though I am using the stable version - as you noted,

Re: Understanding RDD.GroupBy OutOfMemory Exceptions

2014-08-25 Thread Andrew Ash
Hi Patrick, For the spilling within on key work you mention might land in Spark 1.2, is that being tracked in https://issues.apache.org/jira/browse/SPARK-1823 or is there another ticket I should be following? Thanks! Andrew On Tue, Aug 5, 2014 at 3:39 PM, Patrick Wendell pwend...@gmail.com

Re: Manipulating columns in CSV file or Transpose of Array[Array[String]] RDD

2014-08-25 Thread Victor Tso-Guillen
Do you want to do this on one column or all numeric columns? On Mon, Aug 25, 2014 at 7:09 AM, Hingorani, Vineet vineet.hingor...@sap.com wrote: Hello all, Could someone help me with the manipulation of csv file data. I have 'semicolon' separated csv data including doubles and strings. I

Re: Understanding RDD.GroupBy OutOfMemory Exceptions

2014-08-25 Thread Patrick Wendell
Hey Andrew, We might create a new JIRA for it, but it doesn't exist yet. We'll create JIRA's for the major 1.2 issues at the beginning of September. - Patrick On Mon, Aug 25, 2014 at 8:53 AM, Andrew Ash and...@andrewash.com wrote: Hi Patrick, For the spilling within on key work you mention

RE: Manipulating columns in CSV file or Transpose of Array[Array[String]] RDD

2014-08-25 Thread Hingorani, Vineet
Hello Victor, I want to do it on multiple columns. I was able to do it on one column by the help of Sean using code below. val matData = file.map(_.split(;)) val stats = matData.map(_(2).toDouble).stats() stats.mean stats.max Thank you Vineet From: Victor Tso-Guillen

SPARK Hive Context UDF Class Not Found Exception,

2014-08-25 Thread S Malligarjunan
Hello All, I have added a jar from S3 instance into classpath, i have tried following options 1. sc.addJar(s3n://mybucket/lib/myUDF.jar) 2. hiveContext.sparkContext.addJar(s3n://mybucket/lib/myUDF.jar) 3. ./bin/spark-shell --jars s3n://mybucket/lib/myUDF.jar I am getting ClassNotException when

How do you hit breakpoints using IntelliJ In functions used by an RDD

2014-08-25 Thread Steve Lewis
I was able to get JavaWordCount running with a local instance under IntelliJ. In order to do so I needed to use maven to package my code and call String[] jars = { /SparkExamples/target/word-count-examples_2.10-1.0.0.jar }; sparkConf.setJars(jars); After that the sample ran properly and

Re: How do you hit breakpoints using IntelliJ In functions used by an RDD

2014-08-25 Thread Sean Owen
flatMap() is a transformation only. Calling it by itself does nothing, and it just describes the relationship between one RDD and another. You should see it swing into action if you invoke an action, like count(), on the words RDD. On Mon, Aug 25, 2014 at 6:32 PM, Steve Lewis

Re: How do you hit breakpoints using IntelliJ In functions used by an RDD

2014-08-25 Thread Steve Lewis
That was not quite in English My Flatmap code is shown below I know the code is called since the answers are correct but would like to put a break point in dropNonLetters to make sure that code works properly I am running in the IntelliJ debugger but believe the code is executing on a

Re: Low Level Kafka Consumer for Spark

2014-08-25 Thread RodrigoB
Hi Dibyendu, My colleague has taken a look at the spark kafka consumer github you have provided and started experimenting. We found that somehow when Spark has a failure after a data checkpoint, the expected re-computations correspondent to the metadata checkpoints are not recovered so we loose

GraphX usecases

2014-08-25 Thread Sunita Arvind
Hi, I am exploring GraphX library and trying to determine which usecases make most sense for/with it. From what I initially thought, it looked like GraphX could be applied to data stored in RDBMSs as Spark could translate the relational data into graphical representation. However, there seems to

Re: Spark - GraphX pregel like with global variables (accumulator / broadcast)

2014-08-25 Thread Ankur Dave
At 2014-08-25 06:41:36 -0700, BertrandR bertrand.rondepierre...@gmail.com wrote: Unfortunately, this works well for extremely small graphs, but it becomes exponentially slow with the size of the graph and the number of iterations (doesn't finish 20 iterations with graphs having 48000 edges).

Re: GraphX usecases

2014-08-25 Thread Ankur Dave
At 2014-08-25 11:23:37 -0700, Sunita Arvind sunitarv...@gmail.com wrote: Does this We introduce GraphX, which combines the advantages of both data-parallel and graph-parallel systems by efficiently expressing graph computation within the Spark data-parallel framework. We leverage new ideas in

Re: Spark QL and protobuf schema

2014-08-25 Thread Michael Armbrust
In general all PRs should be made against master. When necessary, we can back port them to the 1.1 branch as well. However, since we are in code-freeze for that branch, we'll only do that for major bug fixes at this point. On Thu, Aug 21, 2014 at 10:58 AM, Dmitriy Lyubimov dlie...@gmail.com

Read timeout while running a Job on data in S3

2014-08-25 Thread Arpan Ghosh
I am running a spark job on ~ 124 GB of data in a S3 bucket. The Job runs fine but occasionally returns the following exception during the first map stage which involves reading and transforming the data from S3. Is there a config parameter I can set to increase this timeout limit? *14/08/23

Re: spark and matlab

2014-08-25 Thread Matei Zaharia
Have you tried the pipe() operator? It should work if you can launch your script from the command line. Just watch out for any environment variables needed (you can pass them to pipe() as an optional argument if there are some). On August 25, 2014 at 12:41:29 AM, Jaonary Rabarisoa

Re: HiveContext ouput log file

2014-08-25 Thread Michael Armbrust
Just like with normal Spark Jobs, that command returns an RDD that contains the lineage for computing the answer but does not actually compute the answer. You'll need to run collect() on the RDD in order to get the result. On Mon, Aug 25, 2014 at 11:46 AM, S Malligarjunan

Re: SPARK Hive Context UDF Class Not Found Exception,

2014-08-25 Thread Michael Armbrust
Which version of Spark SQL are you using? Several issues with custom hive UDFs have been fixed in 1.1. On Mon, Aug 25, 2014 at 9:57 AM, S Malligarjunan smalligarju...@yahoo.com.invalid wrote: Hello All, I have added a jar from S3 instance into classpath, i have tried following options 1.

Re: [Spark SQL] How to select first row in each GROUP BY group?

2014-08-25 Thread Michael Armbrust
In our case, the ROW has about 80 columns which exceeds the case class limit.​ Starting with Spark 1.1 you'll be able to also use the applySchema API https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L126 .

Re: Spark SQL: Caching nested structures extremely slow

2014-08-25 Thread Michael Armbrust
One useful thing to do when you run into unexpected slowness is to run 'jstack' a few times on the driver and executors and see if there is any particular hotspot in the Spark SQL code. Also, it seems like a better option here might be to use the new applySchema API

Re: apply at Option.scala:120

2014-08-25 Thread Andrew Or
This should be fixed in the latest Spark. What branch are you running? 2014-08-25 1:32 GMT-07:00 Wang, Jensen jensen.w...@sap.com: Hi, All When I run spark applications, I see from the web-ui that some stage description are like “apply at Option.scala:120”. Why spark splits a

Re: Writeup on Spark SQL with GDELT

2014-08-25 Thread Michael Armbrust
Thanks for this very thorough write-up and for continuing to update it as you progress! As I said in the other thread it would be great to do a little profiling to see if we can get to the heart of the slowness with nested case classes (very little optimization has been done in this code path).

Re: Potential Thrift Server Bug on Spark SQL,perhaps with cache table?

2014-08-25 Thread Cheng Lian
Hi John, I tried to follow your description but failed to reproduce this issue. Would you mind to provide some more details? Especially: - Exact Git commit hash of the snapshot version you were using Mine: e0f946265b9ea5bc48849cf7794c2c03d5e29fba

Re: countByWindow save the count ?

2014-08-25 Thread Daniil Osipov
You could try to use foreachRDD on the result of countByWindow with a function that performs the save operation. On Fri, Aug 22, 2014 at 1:58 AM, Josh J joshjd...@gmail.com wrote: Hi, Hopefully a simple question. Though is there an example of where to save the output of countByWindow ? I

Re: Merging two Spark SQL tables?

2014-08-25 Thread Michael Armbrust
SO I tried the above (why doesn't union or ++ have the same behavior btw?) I don't think there is a good reason for this. I'd open a JIRA. and it works, but is slow because the original Rdds are not cached and files must be read from disk. I also discovered you can recover the

Request for Help

2014-08-25 Thread yh18190
Hi Guys, I just want to know whether their is any way to determine which file is being handled by spark from a group of files input inside a directory.Suppose I have 1000 files which are given as input,I want to determine which file is being handled currently by spark program so that if any error

Re: Spark QL and protobuf schema

2014-08-25 Thread Michael Armbrust
In general master should be a superset of what is in any of the release branches. In the particular case of Spark SQL master and branch-1.1 should be identical (though that will likely change once Patrick cuts the first RC). On Mon, Aug 25, 2014 at 12:50 PM, Dmitriy Lyubimov dlie...@gmail.com

Re: How do you hit breakpoints using IntelliJ In functions used by an RDD

2014-08-25 Thread Sean Owen
PS from an offline exchange -- yes more is being called here, the rest is the standard WordCount example. The trick was to make sure the task executes locally, and calling setMaster(local) on SparkConf in the example code does that. That seems to work fine in IntelliJ for debugging this. On Mon,

Re: Storage Handlers in Spark SQL

2014-08-25 Thread Michael Armbrust
- dev list + user list You should be able to query Spark SQL using JDBC, starting with the 1.1 release. There is some documentation is the repo https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md#running-the-thrift-jdbc-server, and we'll update the official docs once the

Spark Screencast doesn't show in Chrome on OS X

2014-08-25 Thread Nick Chammas
https://spark.apache.org/screencasts/1-first-steps-with-spark.html The embedded YouTube video shows up in Safari on OS X but not in Chrome. How come? Nick -- View this message in context:

RE: Hive From Spark

2014-08-25 Thread Andrew Lee
Hi Du, I didn't notice the ticket was updated recently. SPARK-2848 is a sub-task of Spark-2420, and it's already resolved in Spark 1.1.0.It looks like Spark-2420 will release in Spark 1.2.0 according to the current JIRA status. I'm tracking branch-1.1 instead of the master and haven't seen the

unable to instantiate HiveMetaStoreClient on LocalHiveContext

2014-08-25 Thread Du Li
Hi, I created an instance of LocalHiveContext and attempted to create a database. However, it failed with message org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: Unable to

Does Spark Streaming count the number of windows processed?

2014-08-25 Thread jchen
Hi, Does any one know whether Spark Streaming count the number of windows processed? I am trying to keep a record of the result of processed windows and corresponding timestamp. But I cannot find any related documents or examples. Thanks, -JC -- View this message in context:

Re: Low Level Kafka Consumer for Spark

2014-08-25 Thread bharatvenkat
I like this consumer for what it promises - better control over offset and recovery from failures. If I understand this right, it still uses single worker process to read from Kafka (one thread per partition) - is there a way to specify multiple worker processes (on different machines) to read

Re: Trying to run SparkSQL over Spark Streaming

2014-08-25 Thread Tobias Pfeiffer
Hi, On Mon, Aug 25, 2014 at 7:11 PM, praveshjain1991 praveshjain1...@gmail.com wrote: If you want to issue an SQL statement on streaming data, you must have both the registerAsTable() and the sql() call *within* the foreachRDD(...) block, or -- as you experienced -- the table name will be

Re: Trying to run SparkSQL over Spark Streaming

2014-08-25 Thread Tobias Pfeiffer
Hi again, On Tue, Aug 26, 2014 at 10:13 AM, Tobias Pfeiffer t...@preferred.jp wrote: On Mon, Aug 25, 2014 at 7:11 PM, praveshjain1991 praveshjain1...@gmail.com wrote: If you want to issue an SQL statement on streaming data, you must have both the registerAsTable() and the sql() call

Re: Spark webUI - application details page

2014-08-25 Thread SK
Hi, I am able to access the Application details web page from the master UI page when I run Spark in standalone mode on my local machine. However, I am not able to access it when I run Spark on our private cluster. The Spark master runs on one of the nodes in the cluster. I am able to access the

RE: Request for help in writing to Textfile

2014-08-25 Thread Liu, Raymond
You can try to manipulate the string you want to output before saveAsTextFile, something like modify. flatMap(x=x).map{x= val s=x.toString s.subSequence(1,s.length-1) } Should have more optimized way. Best Regards, Raymond Liu -Original Message- From: yh18190

Re: Spark Screencast doesn't show in Chrome on OS X

2014-08-25 Thread Matei Zaharia
It seems to be because you went there with https:// instead of http://. That said, we'll fix it so that it works on both protocols. Matei On August 25, 2014 at 1:56:16 PM, Nick Chammas (nicholas.cham...@gmail.com) wrote: https://spark.apache.org/screencasts/1-first-steps-with-spark.html The

creating a subgraph with an edge predicate

2014-08-25 Thread dizzy5112
Im currently creating a subgraph using the vertex predicate: subgraph(vpred = (vid,attr) = attr.split(,)(2)!=999) but wondering if a subgraph can be created using the edge predicate, if so a sample would be great :) thanks Dave -- View this message in context:

Re: How to join two PairRDD together?

2014-08-25 Thread Vida Ha
Can you paste the code? It's unclear to me how/when the out of memory is occurring without seeing the code. On Sun, Aug 24, 2014 at 11:37 PM, Gefei Li gefeili.2...@gmail.com wrote: Hello everyone, I am transplanting a clustering algorithm to spark platform, and I meet a problem

Re: Manipulating columns in CSV file or Transpose of Array[Array[String]] RDD

2014-08-25 Thread Victor Tso-Guillen
Assuming the CSV is well-formed (every row has the same number of columns) and every column is a number, this is how you can do it. You can adjust so that you pick just the columns you want, of course, by mapping each row to a new Array that contains just the column values you want. Just be sure

Re: amp lab spark streaming twitter example

2014-08-25 Thread Forest D
Hi Jonathan, Thanks for the reply. I ran other exercises (movie recommendation and GraphX) on the same cluster and did not see these errors. So I think this might not be related to the memory setting.. Thanks, Forest On Aug 24, 2014, at 10:27 AM, Jonathan Haddad j...@jonhaddad.com wrote:

Re: Spark Screencast doesn't show in Chrome on OS X

2014-08-25 Thread Michael Hausenblas
https://spark.apache.org/screencasts/1-first-steps-with-spark.html The embedded YouTube video shows up in Safari on OS X but not in Chrome. I’m using Chrome 36.0.1985.143 on MacOS 10.9.4 and it it works like a charm for me. Cheers, Michael -- Michael Hausenblas Ireland,