mapPartitionsWithIndex

2014-07-13 Thread Madhura
I have a text file consisting of a large number of random floating values separated by spaces. I am loading this file into a RDD in scala. I have heard of mapPartitionsWithIndex but I haven't been able to implement it. For each partition I want to call a method(process in this case) to which I wan

Re: KMeans code is rubbish

2014-07-13 Thread Wanda Hawk
The problem is that I get the same results every time On Friday, July 11, 2014 7:22 PM, Ameet Talwalkar wrote: Hi Wanda, As Sean mentioned, K-means is not guaranteed to find an optimal answer, even for seemingly simple toy examples. A common heuristic to deal with this issue is to run kme

Re: Streaming. Cannot get socketTextStream to receive anything.

2014-07-13 Thread kytay
Hi Tobias I have been using "local[4]" to test. My problem is likely caused by the tcp host server that I am trying the emulate. I was trying to emulate the tcp host to send out messages. (although I am not sure at the moment :D) First way I tried was to use a tcp tool called, Hercules. Second w

Re: Powered By Spark: Can you please add our org?

2014-07-13 Thread Alex Gaudio
Awesome! Thanks, Reynold! On Tue, Jul 8, 2014 at 4:00 PM, Reynold Xin wrote: > I added you to the list. Cheers. > > > > On Mon, Jul 7, 2014 at 6:19 PM, Alex Gaudio wrote: > >> Hi, >> >> Sailthru is also using Spark. Could you please add us to the Powered By >> Spark >>

Re: Supported SQL syntax in Spark SQL

2014-07-13 Thread Nicholas Chammas
Are you sure the code running on the cluster has been updated? I launched the cluster using spark-ec2 from the 1.0.1 release, so I’m assuming that’s taken care of, at least in theory. I just spun down the clusters I had up, but I will revisit this tomorrow and provide the information you requeste

Re: SPARK S3 LZO input; worker stuck

2014-07-13 Thread hassan
Interestingly if I don't cache the data it works. However, as I need to re-use the data to apply different kinds of filtering it really slows down the job as it needs to read from S3 again and again. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-S3-

SPARK S3 LZO input; worker stuck

2014-07-13 Thread hassan
Hi I'm trying to read lzo compressed files from S3 using spark. The lzo files are not indexed. Spark job starts to read the files just fine but after a while it just hangs. No network throughput. I have to restart the worker process to get it back up. Any idea what could be causing this. We were u

Re: Streaming. Cannot get socketTextStream to receive anything.

2014-07-13 Thread Tobias Pfeiffer
Hi, I experienced exactly the same problems when using SparkContext with "local[1]" master specification, because in that case one thread is used for receiving data, the others for processing. As there is only one thread running, no processing will take place. Once you shut down the connection, th

Re: Supported SQL syntax in Spark SQL

2014-07-13 Thread Michael Armbrust
Are you sure the code running on the cluster has been updated? We recently optimized the execution of LIKE queries that can be evaluated without using full regular expressions. So it's possible this error is due to missing functionality on the executors. > How can I trace this down for a bug rep

Catalyst dependency on Spark Core

2014-07-13 Thread Aniket Bhatnagar
As per the recent presentation given in Scala days ( http://people.apache.org/~marmbrus/talks/SparkSQLScalaDays2014.pdf), it was mentioned that Catalyst is independent of Spark. But on inspecting pom.xml of sql/catalyst module, it seems it has a dependency on Spark Core. Any particular reason for t

Re: Supported SQL syntax in Spark SQL

2014-07-13 Thread Nicholas Chammas
Actually, this looks like its some kind of regression in 1.0.1, perhaps related to assembly and packaging with spark-ec2. I don’t see this issue with the same data on a 1.0.0 EC2 cluster. How can I trace this down for a bug report? Nick ​ On Sun, Jul 13, 2014 at 11:18 PM, Nicholas Chammas < nic

Re: Anaconda Spark AMI

2014-07-13 Thread Jeremy Freeman
Hi Ben, This is great! I just spun up an EC2 cluster and tested basic pyspark + ipython/numpy/scipy functionality, and all seems to be working so far. Will let you know if any issues arise. We do a lot with pyspark + scientific computing, and for EC2 usage I think this is a terrific way to ge

Re: Supported SQL syntax in Spark SQL

2014-07-13 Thread Nicholas Chammas
For example, are LIKE 'string%' queries supported? Trying one on 1.0.1 yields java.lang.ExceptionInInitializerError. Nick ​ On Sat, Jul 12, 2014 at 10:16 PM, Nick Chammas wrote: > Is there a place where we can find an up-to-date list of supported SQL > syntax in Spark SQL? > > Nick > > > -

Re: Possible bug in ClientBase.scala?

2014-07-13 Thread Chester Chen
Ron, Which distribution and Version of Hadoop are you using ? I just looked at CDH5 ( hadoop-mapreduce-client-core- 2.3.0-cdh5.0.0), MRJobConfig does have the field : java.lang.String DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH; Chester On Sun, Jul 13, 2014 at 6:49 PM, Ron Gonzalez wr

Re: Streaming. Cannot get socketTextStream to receive anything.

2014-07-13 Thread kytay
Hi Akhil Das Thanks. I tried the codes. and it works. There's a problem with my socket codes that is not flushing the content out, and for the test tool, Hercules, I have to close the socket connection to "flush" the content out. I am going to troubleshoot why nc works, and the codes and test t

Re: not getting output from socket connection

2014-07-13 Thread Walrus theCat
Hah, thanks for tidying up the paper trail here, but I was the OP (and solver) of the recent "reduce" thread that ended in this solution. On Sun, Jul 13, 2014 at 4:26 PM, Michael Campbell < michael.campb...@gmail.com> wrote: > Make sure you use "local[n]" (where n > 1) in your context setup too,

Re: Possible bug in ClientBase.scala?

2014-07-13 Thread Nicholas Chammas
On Sun, Jul 13, 2014 at 9:49 PM, Ron Gonzalez wrote: > I can easily fix this by changing this to YarnConfiguration instead of > MRJobConfig but was wondering what the steps are for submitting a fix. > Relevant links: - https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

Possible bug in ClientBase.scala?

2014-07-13 Thread Ron Gonzalez
Hi, I was doing programmatic submission of Spark yarn jobs and I saw code in ClientBase.getDefaultYarnApplicationClasspath(): val field = classOf[MRJobConfig].getField("DEFAULT_YARN_APPLICATION_CLASSPATH) MRJobConfig doesn't have this field so the created launch env is incomplete. Workaround i

Re: Spark Streaming with Kafka NoClassDefFoundError

2014-07-13 Thread Tathagata Das
Alsom the reason the spark-streaming-kafka is not included in the spark assembly is that we do not want dependencies of external systems like kafka (which itself probably has a complex dependency tree) to cause conflict with the core spark's functionality and stability. TD On Sun, Jul 13, 2014 a

Re: Spark Streaming with Kafka NoClassDefFoundError

2014-07-13 Thread Tathagata Das
In case you still have issues with duplicate files in uber jar, here is a reference sbt file with assembly plugin that deals with duplicates https://github.com/databricks/training/blob/sparkSummit2014/streaming/scala/build.sbt On Fri, Jul 11, 2014 at 10:06 AM, Bill Jay wrote: > You may try to

Re: Problem reading in LZO compressed files

2014-07-13 Thread Nicholas Chammas
I actually never got this to work, which is part of the reason why I filed that JIRA. Apart from using --jar when starting the shell, I don’t have any more pointers for you. :( ​ On Sun, Jul 13, 2014 at 12:57 PM, Ognen Duzlevski wrote: > Nicholas, > > Thanks! > > How do I make spark assemble a

Re: can't print DStream after reduce

2014-07-13 Thread Sean Owen
How about a PR that rejects a context configured for local or local[1]? As I understand it is not intended to work and has bitten several people. On Jul 14, 2014 12:24 AM, "Michael Campbell" wrote: > This almost had me not using Spark; I couldn't get any output. It is not > at all obvious what's

Re: not getting output from socket connection

2014-07-13 Thread Michael Campbell
Make sure you use "local[n]" (where n > 1) in your context setup too, (if you're running locally), or you won't get output. On Sat, Jul 12, 2014 at 11:36 PM, Walrus theCat wrote: > Thanks! > > I thought it would get "passed through" netcat, but given your email, I > was able to follow this tuto

Re: can't print DStream after reduce

2014-07-13 Thread Michael Campbell
This almost had me not using Spark; I couldn't get any output. It is not at all obvious what's going on here to the layman (and to the best of my knowledge, not documented anywhere), but now you know you'll be able to answer this question for the numerous people that will also have it. On Sun, J

Ideal core count within a single JVM

2014-07-13 Thread lokesh.gidra
Hello, What would be an ideal core count to run a spark job in local mode to get best utilization of CPU? Actually I have a 48-core machine but the performance of local[48] is poor as compared to local[10]. Lokesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabb

Re: can't print DStream after reduce

2014-07-13 Thread Walrus theCat
Great success! I was able to get output to the driver console by changing the construction of the Streaming Spark Context from: val ssc = new StreamingContext("local" /**TODO change once a cluster is up **/, "AppName", Seconds(1)) to: val ssc = new StreamingContext("local[2]" /**TODO

Re: SparkSql newbie problems with nested selects

2014-07-13 Thread Andy Davidson
Hi Michael Changing my col name to something other the Œcount¹ . Fixed the parse error Many thanks, Andy From: Michael Armbrust Reply-To: Date: Sunday, July 13, 2014 at 1:18 PM To: Cc: "u...@spark.incubator.apache.org" Subject: Re: SparkSql newbie problems with nested selects > Hi A

Re: can't print DStream after reduce

2014-07-13 Thread Walrus theCat
More strange behavior: lines.foreachRDD(x => println(x.first)) // works lines.foreachRDD(x => println((x.count,x.first))) // no output is printed to driver console On Sun, Jul 13, 2014 at 11:47 AM, Walrus theCat wrote: > > Thanks for your interest. > > lines.foreachRDD(x => println(x.count))

Task serialized size dependent on size of RDD?

2014-07-13 Thread Sébastien Rainville
Hi, I'm having trouble serializing tasks for this code: val rddC = (rddA join rddB) .map { case (x, (y, z)) => z -> y } .reduceByKey( { (y1, y2) => Semigroup.plus(y1, y2) }, 1000) Somehow when running on a small data set the size of the serialized task is about 650KB, which is very big, and

Re: SparkSql newbie problems with nested selects

2014-07-13 Thread Michael Armbrust
Hi Andy, The SQL parser is pretty basic (we plan to improve this for the 1.2 release). In this case I think part of the problem is that one of your variables is "count", which is a reserved word. Unfortunately, we don't have the ability to escape identifiers at this point. However, I did manage

SparkSql newbie problems with nested selects

2014-07-13 Thread Andy Davidson
Hi I am running into trouble with a nested query using python. To try and debug it, I first wrote the query I want using sqlite3 select freq.docid, freqTranspose.docid, sum(freq.count * freqTranspose.count) from Frequency as freq, (select term, docid, count from Frequency) as freqT

Re: can't print DStream after reduce

2014-07-13 Thread Walrus theCat
Thanks for your interest. lines.foreachRDD(x => println(x.count)) And I got 0 every once in a while (which I think is strange, because lines.print prints the input I'm giving it over the socket.) When I tried: lines.map(_->1).reduceByKey(_+_).foreachRDD(x => println(x.count)) I got no count.

Re: can't print DStream after reduce

2014-07-13 Thread Walrus theCat
Update on this: val lines = ssc.socketTextStream("localhost", ) lines.print // works lines.map(_->1).print // works lines.map(_->1).reduceByKey(_+_).print // nothing printed to driver console Just lots of: 14/07/13 11:37:40 INFO receiver.BlockGenerator: Pushed block input-0-1405276660400

Re: spark ui on yarn

2014-07-13 Thread Koert Kuipers
my yarn environment does have less memory for the executors. i am checking if the RDDs are cached by calling sc.getRDDStorageInfo, which shows an RDD as fully cached in memory, yet it does not show up in the UI On Sun, Jul 13, 2014 at 1:49 AM, Matei Zaharia wrote: > The UI code is the same in

Re: can't print DStream after reduce

2014-07-13 Thread Tathagata Das
Try doing DStream.foreachRDD and then printing the RDD count and further inspecting the RDD. On Jul 13, 2014 1:03 AM, "Walrus theCat" wrote: > Hi, > > I have a DStream that works just fine when I say: > > dstream.print > > If I say: > > dstream.map(_,1).print > > that works, too. However, if I d

Re: Confused by groupByKey() and the default partitioner

2014-07-13 Thread Aaron Davidson
Ah -- I should have been more clear, list concatenation isn't going to be any faster. In many cases I've seen people use groupByKey() when they are really trying to do some sort of aggregation. and thus constructing this concatenated list is more expensive than they need. On Sun, Jul 13, 2014 at

Re: Problem reading in LZO compressed files

2014-07-13 Thread Ognen Duzlevski
Nicholas, Thanks! How do I make spark assemble against a local version of Hadoop? I have 2.4.1 running on a test cluster and I did "SPARK_HADOOP_VERSION=2.4.1 sbt/sbt assembly" but all it did was pull in hadoop-2.4.1 dependencies via sbt (which is sufficient for using a 2.4.1 HDFS). I am gue

Re: Confused by groupByKey() and the default partitioner

2014-07-13 Thread Guanhua Yan
Thanks, Aaron. I replaced groupByKey with reduceByKey along with some list concatenation operations, and found that the performance becomes even worse. So groupByKey is not that bad in my code. Best regards, - Guanhua From: Aaron Davidson Reply-To: Date: Sat, 12 Jul 2014 16:32:22 -0700 To:

Re: Problem reading in LZO compressed files

2014-07-13 Thread Nicholas Chammas
If you’re still seeing gibberish, it’s because Spark is not using the LZO libraries properly. In your case, I believe you should be calling newAPIHadoopFile() instead of textFile(). For example: sc.newAPIHadoopFile("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data", c

Problem reading in LZO compressed files

2014-07-13 Thread Ognen Duzlevski
Hello, I have been trying to play with the Google ngram dataset provided by Amazon in form of LZO compressed files. I am having trouble understanding what is going on ;). I have added the compression jar and native library to the underlying Hadoop/HDFS installation, restarted the name node a

Error in JavaKafkaWordCount.java example

2014-07-13 Thread Mahebub Sayyed
Hello, I am referring following example: https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java I am getting following C*ompilation Error* : \example\JavaKafkaWordCount.java:[62,70] error: cannot access ClassTag Please help

RE: Spark SQL - java.lang.NoClassDefFoundError: Could not initialize class $line10.$read$

2014-07-13 Thread Peretz, Gil
Thank You, I managed to activate Spark-Shell and to use it, by re-compiling and packaging Spark. SPARK_HADOOP_VERSION=2.4.0 SPARK_YARN=true SPARK_HIVE=true sbt/sbt clean assembly Regards. --- Gil Peretz , +972-54-5597107 From: Michael Armbrust [mailto:mich...@d

Re: Nested Query With Spark SQL(1.0.1)

2014-07-13 Thread anyweil
Or is it supported? I know I could doing it myself with filter, but if SQL could support, would be much better, thx! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Nested-Query-With-Spark-SQL-1-0-1-tp9544p9547.html Sent from the Apache Spark User List mai

Re: Large Task Size?

2014-07-13 Thread Kyle Ellrott
It uses the standard SquaredL2Updater, and I also tried to broadcast it as well. The input is a RDD created by taking the union of several inputs, that have all been run against MLUtils.kFold to produce even more RDDs. If I run with 10 different inputs, each with 10 kFolds. I'm pretty certain that

can't print DStream after reduce

2014-07-13 Thread Walrus theCat
Hi, I have a DStream that works just fine when I say: dstream.print If I say: dstream.map(_,1).print that works, too. However, if I do the following: dstream.reduce{case(x,y) => x}.print I don't get anything on my console. What's going on? Thanks