Re: How do you hit breakpoints using IntelliJ In functions used by an RDD

2014-08-25 Thread Akhil Das
You need to run your app in localmode ( aka master=local[2]) to get it debugged locally. If you are running it on a cluster, then you can use the remote debugging feature. For remote debugging, you need to pass th

Re: "Block input-* already exists on this machine; not re-adding it" warnings

2014-08-25 Thread Aniket Bhatnagar
Answering my own question, it seems that the warnings are expected as explained by TD @ http://apache-spark-user-list.1001560.n3.nabble.com/streaming-questions-td3281.html . Here is what he wrote: "Spark Streaming is designed to replicate the received data within the machines in a Spark cluster fo

Re: Spark webUI - application details page

2014-08-25 Thread Akhil Das
Have a look at the history server, looks like you have enabled history server on your local and not on the remote server. http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/monitoring.html Thanks Best Regards On Tue, Aug 26, 2014 at 7:01 AM, SK wrote: > Hi, > > I am able to access the App

Re: Only master is really busy at KMeans training

2014-08-25 Thread durin
With a lower number of partitions, I keep losing executors during collect at KMeans.scala:283 The error message is "ExecutorLostFailure (executor lost)". The program recovers by automatically repartitioning the whole dataset (126G), which takes very long and seems to only delay the inevitable

Re: amp lab spark streaming twitter example

2014-08-25 Thread Akhil Das
I think your *sparkUrl *points to an invalid cluster url. Just make sure you are giving the correct url (the one you see on top left in the master:8080 webUI). Thanks Best Regards On Tue, Aug 26, 2014 at 11:07 AM, Forest D wrote: > Hi Jonathan, > > Thanks for the reply. I ran other exercises (

Re: Spark Screencast doesn't show in Chrome on OS X

2014-08-25 Thread Michael Hausenblas
> https://spark.apache.org/screencasts/1-first-steps-with-spark.html > > The embedded YouTube video shows up in Safari on OS X but not in Chrome. I’m using Chrome 36.0.1985.143 on MacOS 10.9.4 and it it works like a charm for me. Cheers, Michael -- Michael Hausenblas Ireland,

Re: amp lab spark streaming twitter example

2014-08-25 Thread Forest D
Hi Jonathan, Thanks for the reply. I ran other exercises (movie recommendation and GraphX) on the same cluster and did not see these errors. So I think this might not be related to the memory setting.. Thanks, Forest On Aug 24, 2014, at 10:27 AM, Jonathan Haddad wrote: > Could you be hittin

Re: Manipulating columns in CSV file or Transpose of Array[Array[String]] RDD

2014-08-25 Thread Victor Tso-Guillen
Assuming the CSV is well-formed (every row has the same number of columns) and every column is a number, this is how you can do it. You can adjust so that you pick just the columns you want, of course, by mapping each row to a new Array that contains just the column values you want. Just be sure th

Re: How to join two PairRDD together?

2014-08-25 Thread Vida Ha
Can you paste the code? It's unclear to me how/when the out of memory is occurring without seeing the code. On Sun, Aug 24, 2014 at 11:37 PM, Gefei Li wrote: > Hello everyone, > I am transplanting a clustering algorithm to spark platform, and I > meet a problem confusing me for a long ti

creating a subgraph with an edge predicate

2014-08-25 Thread dizzy5112
Im currently creating a subgraph using the vertex predicate: subgraph(vpred = (vid,attr) => attr.split(",")(2)!="999") but wondering if a subgraph can be created using the edge predicate, if so a sample would be great :) thanks Dave -- View this message in context: http://apache-spark-user-li

Pair RDD

2014-08-25 Thread Deep Pradhan
Hi, I have an input file of a graph in the format When I use sc.textFile, it will change the entire text file into an RDD. How can I transform the file into key, value pair and then eventually into paired RDDs. Thank You

Re: Spark Screencast doesn't show in Chrome on OS X

2014-08-25 Thread Nicholas Chammas
Yeah, I just picked the link up from a post somewhere on Stack Overflow. Dunno were the original poster got it from. On Mon, Aug 25, 2014 at 9:50 PM, Matei Zaharia wrote: > It seems to be because you went there with https:// instead of http://. > That said, we'll fix it so that it works on both

Re: Spark Screencast doesn't show in Chrome on OS X

2014-08-25 Thread Matei Zaharia
It seems to be because you went there with https:// instead of http://. That said, we'll fix it so that it works on both protocols. Matei On August 25, 2014 at 1:56:16 PM, Nick Chammas (nicholas.cham...@gmail.com) wrote: https://spark.apache.org/screencasts/1-first-steps-with-spark.html The e

RE: Request for help in writing to Textfile

2014-08-25 Thread Liu, Raymond
You can try to manipulate the string you want to output before saveAsTextFile, something like modify. flatMap(x=>x).map{x=> val s=x.toString s.subSequence(1,s.length-1) } Should have more optimized way. Best Regards, Raymond Liu -Original Message- From: yh18190 [mailto:yh

Re: Spark webUI - application details page

2014-08-25 Thread SK
Hi, I am able to access the Application details web page from the master UI page when I run Spark in standalone mode on my local machine. However, I am not able to access it when I run Spark on our private cluster. The Spark master runs on one of the nodes in the cluster. I am able to access the

Re: Trying to run SparkSQL over Spark Streaming

2014-08-25 Thread Tobias Pfeiffer
Hi again, On Tue, Aug 26, 2014 at 10:13 AM, Tobias Pfeiffer wrote: > > On Mon, Aug 25, 2014 at 7:11 PM, praveshjain1991 < > praveshjain1...@gmail.com> wrote: >> >> "If you want to issue an SQL statement on streaming data, you must have >> both >> the registerAsTable() and the sql() call *within*

Re: Trying to run SparkSQL over Spark Streaming

2014-08-25 Thread Tobias Pfeiffer
Hi, On Mon, Aug 25, 2014 at 7:11 PM, praveshjain1991 wrote: > > "If you want to issue an SQL statement on streaming data, you must have > both > the registerAsTable() and the sql() call *within* the foreachRDD(...) > block, > or -- as you experienced -- the table name will be unknown" > > Since

Re: Low Level Kafka Consumer for Spark

2014-08-25 Thread bharatvenkat
I like this consumer for what it promises - better control over offset and recovery from failures. If I understand this right, it still uses single worker process to read from Kafka (one thread per partition) - is there a way to specify multiple worker processes (on different machines) to read fro

Does Spark Streaming count the number of windows processed?

2014-08-25 Thread jchen
Hi, Does any one know whether Spark Streaming count the number of windows processed? I am trying to keep a record of the result of processed windows and corresponding timestamp. But I cannot find any related documents or examples. Thanks, -JC -- View this message in context: http://apache-spa

Re: error from DecisonTree Training:

2014-08-25 Thread Joseph Bradley
Following up, this bug with using DecisionTree with Java has been fixed, and this update is in the current release candidate for 1.1. It also include some more Java-friendly constructors trainClassifier() and trainRegressor(). Joseph On Mon, Jul 21, 2014 at 4:41 PM, Jack Yang wrote: > That is

unable to instantiate HiveMetaStoreClient on LocalHiveContext

2014-08-25 Thread Du Li
Hi, I created an instance of LocalHiveContext and attempted to create a database. However, it failed with message "org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: Unable to i

RE: Hive From Spark

2014-08-25 Thread Andrew Lee
Hi Du, I didn't notice the ticket was updated recently. SPARK-2848 is a sub-task of Spark-2420, and it's already resolved in Spark 1.1.0.It looks like Spark-2420 will release in Spark 1.2.0 according to the current JIRA status. I'm tracking branch-1.1 instead of the master and haven't seen the re

Re: Hive From Spark

2014-08-25 Thread Du Li
Never mind. I have resolved this issue by moving the local guava dependency forward. Du On 8/22/14, 5:08 PM, "Du Li" wrote: >I thought the fix had been pushed to the apache master ref. commit >"[SPARK-2848] Shade Guava in uber-jars" By Marcelo Vanzin on 8/20. So my >previous email was based o

Spark Screencast doesn't show in Chrome on OS X

2014-08-25 Thread Nick Chammas
https://spark.apache.org/screencasts/1-first-steps-with-spark.html The embedded YouTube video shows up in Safari on OS X but not in Chrome. How come? Nick -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Screencast-doesn-t-show-in-Chrome-on-OS-X-tp1

Re: Storage Handlers in Spark SQL

2014-08-25 Thread Michael Armbrust
- dev list + user list You should be able to query Spark SQL using JDBC, starting with the 1.1 release. There is some documentation is the repo , and we'll update the official docs once the r

Re: How do you hit breakpoints using IntelliJ In functions used by an RDD

2014-08-25 Thread Sean Owen
PS from an offline exchange -- yes more is being called here, the rest is the standard WordCount example. The trick was to make sure the task executes locally, and calling setMaster("local") on SparkConf in the example code does that. That seems to work fine in IntelliJ for debugging this. On Mon

Re: Spark QL and protobuf schema

2014-08-25 Thread Michael Armbrust
In general master should be a superset of what is in any of the release branches. In the particular case of Spark SQL master and branch-1.1 should be identical (though that will likely change once Patrick cuts the first RC). On Mon, Aug 25, 2014 at 12:50 PM, Dmitriy Lyubimov wrote: > Ok, I was

Request for Help

2014-08-25 Thread yh18190
Hi Guys, I just want to know whether their is any way to determine which file is being handled by spark from a group of files input inside a directory.Suppose I have 1000 files which are given as input,I want to determine which file is being handled currently by spark program so that if any error

Re: Spark QL and protobuf schema

2014-08-25 Thread Dmitriy Lyubimov
Ok, I was just asking that the changes you've mentioned are likely to be found on 1.1 branch so it would make sense for my starting point to fork off 1.1. Or perhaps master. The question of PR is fairly far off at this point, for legal reasons if nothing else. if and by the time the work is approv

Re: Merging two Spark SQL tables?

2014-08-25 Thread Michael Armbrust
> > SO I tried the above (why doesn't union or ++ have the same behavior > btw?) I don't think there is a good reason for this. I'd open a JIRA. > and it works, but is slow because the original Rdds are not > cached and files must be read from disk. > > I also discovered you can recover the In

Re: countByWindow save the count ?

2014-08-25 Thread Daniil Osipov
You could try to use foreachRDD on the result of countByWindow with a function that performs the save operation. On Fri, Aug 22, 2014 at 1:58 AM, Josh J wrote: > Hi, > > Hopefully a simple question. Though is there an example of where to save > the output of countByWindow ? I would like to save

Re: Potential Thrift Server Bug on Spark SQL,perhaps with cache table?

2014-08-25 Thread Cheng Lian
Hi John, I tried to follow your description but failed to reproduce this issue. Would you mind to provide some more details? Especially: - Exact Git commit hash of the snapshot version you were using Mine: e0f946265b9ea5bc48849cf7794c2c03d5e29fba

Re: Writeup on Spark SQL with GDELT

2014-08-25 Thread Michael Armbrust
Thanks for this very thorough write-up and for continuing to update it as you progress! As I said in the other thread it would be great to do a little profiling to see if we can get to the heart of the slowness with nested case classes (very little optimization has been done in this code path). I

Re: apply at Option.scala:120

2014-08-25 Thread Andrew Or
This should be fixed in the latest Spark. What branch are you running? 2014-08-25 1:32 GMT-07:00 Wang, Jensen : > Hi, All > >When I run spark applications, I see from the web-ui that some > stage description are like “apply at Option.scala:120”. > > Why spark splits a stage on a line t

Re: Spark SQL: Caching nested structures extremely slow

2014-08-25 Thread Michael Armbrust
One useful thing to do when you run into unexpected slowness is to run 'jstack' a few times on the driver and executors and see if there is any particular hotspot in the Spark SQL code. Also, it seems like a better option here might be to use the new applySchema API

Re: [Spark SQL] How to select first row in each GROUP BY group?

2014-08-25 Thread Michael Armbrust
> > In our case, the ROW has about 80 columns which exceeds the case class > limit.​ > Starting with Spark 1.1 you'll be able to also use the applySchema API .

Re: SPARK Hive Context UDF Class Not Found Exception,

2014-08-25 Thread Michael Armbrust
Which version of Spark SQL are you using? Several issues with custom hive UDFs have been fixed in 1.1. On Mon, Aug 25, 2014 at 9:57 AM, S Malligarjunan < smalligarju...@yahoo.com.invalid> wrote: > Hello All, > > I have added a jar from S3 instance into classpath, i have tried following > option

Re: HiveContext ouput log file

2014-08-25 Thread Michael Armbrust
Just like with normal Spark Jobs, that command returns an RDD that contains the lineage for computing the answer but does not actually compute the answer. You'll need to run collect() on the RDD in order to get the result. On Mon, Aug 25, 2014 at 11:46 AM, S Malligarjunan < smalligarju...@yahoo.

Re: spark and matlab

2014-08-25 Thread Matei Zaharia
Have you tried the pipe() operator? It should work if you can launch your script from the command line. Just watch out for any environment variables needed (you can pass them to pipe() as an optional argument if there are some). On August 25, 2014 at 12:41:29 AM, Jaonary Rabarisoa (jaon...@gmail

Read timeout while running a Job on data in S3

2014-08-25 Thread Arpan Ghosh
I am running a spark job on ~ 124 GB of data in a S3 bucket. The Job runs fine but occasionally returns the following exception during the first map stage which involves reading and transforming the data from S3. Is there a config parameter I can set to increase this timeout limit? *14/08/23 04:45

Re: Spark QL and protobuf schema

2014-08-25 Thread Michael Armbrust
In general all PRs should be made against master. When necessary, we can back port them to the 1.1 branch as well. However, since we are in code-freeze for that branch, we'll only do that for major bug fixes at this point. On Thu, Aug 21, 2014 at 10:58 AM, Dmitriy Lyubimov wrote: > ok i'll tr

Re: GraphX usecases

2014-08-25 Thread Ankur Dave
At 2014-08-25 11:23:37 -0700, Sunita Arvind wrote: > Does this "We introduce GraphX, which combines the advantages of both > data-parallel and graph-parallel systems by efficiently expressing graph > computation within the Spark data-parallel framework. We leverage new ideas > in distributed graph

Re: GraphX usecases

2014-08-25 Thread Sunita Arvind
Thanks for the clarification Ankur Appreciate it. Regards Sunita On Monday, August 25, 2014, Ankur Dave wrote: > At 2014-08-25 11:23:37 -0700, Sunita Arvind > wrote: > > Does this "We introduce GraphX, which combines the advantages of both > > data-parallel and graph-parallel systems by effici

HiveContext ouput log file

2014-08-25 Thread S Malligarjunan
Hello All, I have executed the following udf sql in my spark hivecontext, hiveContext.hql(select count(t1.col1) from t1 join t2 where myUDF(t1.id , t2.id) = true) Where do i find the count output?   Thanks and Regards, Sankar S.  

Re: Spark - GraphX pregel like with global variables (accumulator / broadcast)

2014-08-25 Thread Ankur Dave
At 2014-08-25 06:41:36 -0700, BertrandR wrote: > Unfortunately, this works well for extremely small graphs, but it becomes > exponentially slow with the size of the graph and the number of iterations > (doesn't finish 20 iterations with graphs having 48000 edges). > [...] > It seems to me that a

GraphX usecases

2014-08-25 Thread Sunita Arvind
Hi, I am exploring GraphX library and trying to determine which usecases make most sense for/with it. From what I initially thought, it looked like GraphX could be applied to data stored in RDBMSs as Spark could translate the relational data into graphical representation. However, there seems to b

Re: Low Level Kafka Consumer for Spark

2014-08-25 Thread RodrigoB
Hi Dibyendu, My colleague has taken a look at the spark kafka consumer github you have provided and started experimenting. We found that somehow when Spark has a failure after a data checkpoint, the expected re-computations correspondent to the metadata checkpoints are not recovered so we loose K

Re: How do you hit breakpoints using IntelliJ In functions used by an RDD

2014-08-25 Thread Steve Lewis
That was not quite in English My Flatmap code is shown below I know the code is called since the answers are correct but would like to put a break point in dropNonLetters to make sure that code works properly I am running in the IntelliJ debugger but believe the code is executing on a Spar

Re: How do you hit breakpoints using IntelliJ In functions used by an RDD

2014-08-25 Thread Sean Owen
flatMap() is a transformation only. Calling it by itself does nothing, and it just describes the relationship between one RDD and another. You should see it swing into action if you invoke an action, like count(), on the words RDD. On Mon, Aug 25, 2014 at 6:32 PM, Steve Lewis wrote: > I was able

How do you hit breakpoints using IntelliJ In functions used by an RDD

2014-08-25 Thread Steve Lewis
I was able to get JavaWordCount running with a local instance under IntelliJ. In order to do so I needed to use maven to package my code and call String[] jars = { "/SparkExamples/target/word-count-examples_2.10-1.0.0.jar" }; sparkConf.setJars(jars); After that the sample ran properly and

SPARK Hive Context UDF Class Not Found Exception,

2014-08-25 Thread S Malligarjunan
Hello All, I have added a jar from S3 instance into classpath, i have tried following options 1. sc.addJar("s3n://mybucket/lib/myUDF.jar") 2. hiveContext.sparkContext.addJar("s3n://mybucket/lib/myUDF.jar") 3. ./bin/spark-shell --jars s3n://mybucket/lib/myUDF.jar I am getting ClassNotException wh

RE: Manipulating columns in CSV file or Transpose of Array[Array[String]] RDD

2014-08-25 Thread Hingorani, Vineet
Hello Victor, I want to do it on multiple columns. I was able to do it on one column by the help of Sean using code below. val matData = file.map(_.split(";")) val stats = matData.map(_(2).toDouble).stats() stats.mean stats.max Thank you Vineet From: Victor Tso-Guillen [mailto:v...@paxata.

Re: Understanding RDD.GroupBy OutOfMemory Exceptions

2014-08-25 Thread Patrick Wendell
Hey Andrew, We might create a new JIRA for it, but it doesn't exist yet. We'll create JIRA's for the major 1.2 issues at the beginning of September. - Patrick On Mon, Aug 25, 2014 at 8:53 AM, Andrew Ash wrote: > Hi Patrick, > > For the spilling within on key work you mention might land in Spa

Re: Manipulating columns in CSV file or Transpose of Array[Array[String]] RDD

2014-08-25 Thread Victor Tso-Guillen
Do you want to do this on one column or all numeric columns? On Mon, Aug 25, 2014 at 7:09 AM, Hingorani, Vineet wrote: > Hello all, > > Could someone help me with the manipulation of csv file data. I have > 'semicolon' separated csv data including doubles and strings. I want to > calculate the

Re: Understanding RDD.GroupBy OutOfMemory Exceptions

2014-08-25 Thread Andrew Ash
Hi Patrick, For the spilling within on key work you mention might land in Spark 1.2, is that being tracked in https://issues.apache.org/jira/browse/SPARK-1823 or is there another ticket I should be following? Thanks! Andrew On Tue, Aug 5, 2014 at 3:39 PM, Patrick Wendell wrote: > Hi Jens, > >

Re: Development environment issues

2014-08-25 Thread Daniel Siegmann
On Thu, Aug 21, 2014 at 6:21 PM, pierred wrote: So, what is the accepted wisdom in terms of IDE and development environment? > I don't know what the accepted wisdom is. I've been getting by with the Scala IDE for Eclipse, though I am using the stable version - as you noted, this keeps me from up

Manipulating columns in CSV file or Transpose of Array[Array[String]] RDD

2014-08-25 Thread Hingorani, Vineet
Hello all, Could someone help me with the manipulation of csv file data. I have 'semicolon' separated csv data including doubles and strings. I want to calculate the maximum/average of a column. When I read the file using sc.textFile(test.csv).map(_.split(";"), each field is read as string. Cou

Request for help in writing to Textfile

2014-08-25 Thread yh18190
Hi Guys, I am currently playing with huge data.I have an RDD which returns RDD[List[(tuples)]].I need only the tuples to be written to textfile output using saveAsTextFile function. example:val mod=modify.saveASTextFile() returns List((20140813,4,141127,3,HYPHLJLU,HY,KNGHWEB,USD,144.00,662.

Spark - GraphX pregel like with global variables (accumulator / broadcast)

2014-08-25 Thread BertrandR
Hi, I'm working on big graph analytics, and currently implementing a mean field inference algorithm in GraphX/Spark. I start with an arbitrary graph, keep a (sparse) probability distribution at each node implemented as a Map[Long,Double]. At each iteration, from the current estimates of the distri

Re: Trying to run SparkSQL over Spark Streaming

2014-08-25 Thread praveshjain1991
Hi, Thanks for your help the other day. I had one more question regarding the same. "If you want to issue an SQL statement on streaming data, you must have both the registerAsTable() and the sql() call *within* the foreachRDD(...) block, or -- as you experienced -- the table name will be unknown"

Re: StorageLevel error.

2014-08-25 Thread taoist...@gmail.com
you need import StorageLevel by: import org.apache.spark.storage.StorageLevel taoist...@gmail.com From: rapelly kartheek Date: 2014-08-25 18:22 To: user Subject: StorageLevel error. Hi, Can someone help me with the following error: scala> val rdd = sc.parallelize(Array(1,2,3,4)) rdd: org

StorageLevel error.

2014-08-25 Thread rapelly kartheek
Hi, Can someone help me with the following error: scala> val rdd = sc.parallelize(Array(1,2,3,4)) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :12 scala> rdd.persist(StorageLevel.MEMORY_ONLY) :15: error: not found: value StorageLevel rdd.persist(S

apply at Option.scala:120

2014-08-25 Thread Wang, Jensen
Hi, All When I run spark applications, I see from the web-ui that some stage description are like "apply at Option.scala:120". Why spark splits a stage on a line that is not in my spark program but a Scala library? Thanks Jensen

Re: Printing the RDDs in SparkPageRank

2014-08-25 Thread Sean Owen
On Mon, Aug 25, 2014 at 7:18 AM, Deep Pradhan wrote: > When I add > > parts(0).collect().foreach(println) > > parts(1).collect().foreach(println), for printing parts, I get the following > error > > not enough arguments for method collect: (pf: > PartialFunction[Char,B])(implicit > bf:scala.collec

spark and matlab

2014-08-25 Thread Jaonary Rabarisoa
Hi all, Is there someone that tried to pipe RDD into matlab script ? I'm trying to do something similiar if one of you could point some hints. Best regards, Jao

Re: Open sourcing Spindle by Adobe Research, a web analytics processing engine in Scala, Spark, and Parquet.

2014-08-25 Thread Brandon Amos
\cc David Tompkins and Jim Donahue if they have anything to add. \cc My school email. Please include bamos_cmu.edu for further discussion. Hi Deb, Debasish Das wrote > Looks very cool...will try it out for ad-hoc analysis of our datasets and > provide more feedback... > > Could you please give

many fetch failure in "BlockManager"

2014-08-25 Thread 余根茂
*HI ALL:* *My job is cpu intensive, and its resource configuration is 400 worker * 1 core * 3G. There are many fetch failure, like:* 14-08-23 08:34:52 WARN [Result resolver thread-3] TaskSetManager: Loss was due to fetch failure from BlockManagerId(slave1:33500) 14-08-23 08:34:52 INFO [spark-