Re: SQLCtx cacheTable

2014-08-05 Thread Gurvinder Singh
On 08/04/2014 10:57 PM, Michael Armbrust wrote: If mesos is allocating a container that is exactly the same as the max heap size then that is leaving no buffer space for non-heap JVM memory, which seems wrong to me. This can be a cause. I am now wondering how mesos pick up the size and setup

Re: Can't see any thing one the storage panel of application UI

2014-08-05 Thread Akhil Das
You need to use persist or cache those rdds to appear in the Storage. Unless you do it, those rdds will be computed again. Thanks Best Regards On Tue, Aug 5, 2014 at 8:03 AM, binbinbin915 binbinbin...@live.cn wrote: Actually, if you don’t use method like persist or cache, it even not store

Re: java.lang.IllegalStateException: unread block data while running the sampe WordCount program from Eclipse

2014-08-05 Thread nightwolf
Did you ever find a sln to this problem? I'm having similar issues. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-IllegalStateException-unread-block-data-while-running-the-sampe-WordCount-program-from-Ecle-tp8388p11412.html Sent from the Apache

Re: Spark Deployment Patterns - Automated Deployment Performance Testing

2014-08-05 Thread nightwolf
Thanks AL! Thats what I though. I've setup nexus to maintain spark libs and download them when needed. For development purposes. Suppose we have a dev cluster. Is it possible to run the driver program locally (on a developers machine)? I..e just run the driver from the ID and have it

Spark stream data from kafka topics and output as parquet file on HDFS

2014-08-05 Thread rafeeq s
Hi, I am new to Apache Spark and Trying to Develop spark streaming program to *stream data from kafka topics and output as parquet file on HDFS*. Please share the *sample reference* program to stream data from kafka topics and output as parquet file on HDFS. Thanks in Advance. Regards,

Running driver/SparkContent locally

2014-08-05 Thread nightwolf
I'm trying to run a local driver (on a development machine) and have this driver communicate with the Spark master and workers however I'm having a few problems getting the driver to connect and run a simple job from within an IDE. It all looks like it works but when I try to do something simple

Re: about spark and using machine learning model

2014-08-05 Thread Julien Naour
You can find in the following presentation a simple example of a clustering model use to classify new incoming tweet : https://www.youtube.com/watch?v=sPhyePwo7FA Regards, Julien 2014-08-05 7:08 GMT+02:00 Xiangrui Meng men...@gmail.com: Some extra work is needed to close the loop. One related

Re: Running driver/SparkContent locally

2014-08-05 Thread nightwolf
The code for this example is very simple; object SparkMain extends App with Serializable { val conf = new SparkConf(false) //.setAppName(cc-test) //.setMaster(spark://hadoop-001:7077) //.setSparkHome(/tmp) .set(spark.driver.host, 192.168.23.108) .set(spark.cores.max, 10)

Re: Spark streaming at-least once guarantee

2014-08-05 Thread lalit1303
Hi Sanjeet, I have been using spark streaming for processing of files present in S3 and HDFS. I am also using SQS messages for the same purpose as yours i.e. pointer to S3 file. As of now, I have a separate SQS job which receive message from SQS queue and gets the corresponding file from S3. Now,

java.lang.StackOverflowError

2014-08-05 Thread Chengi Liu
Hi, I am doing some basic preprocessing in pyspark (local mode as follows): files = [ input files] def read(filename,sc): #process file return rdd if __name__ ==__main__: conf = SparkConf() conf.setMaster('local') sc = SparkContext(conf =conf) sc.setCheckpointDir(root+temp/)

Re: Bad Digest error while doing aws s3 put

2014-08-05 Thread lmk
Is it possible that the Content-MD5 changes during multipart upload to s3? But even then, it succeeds if I increase the cluster configuration.. For ex. it throws Bad Digest error after writing 48/100 files when the cluster is of 3 m3.2xlarge slaves it throws Bad Digest error after writing 64/100

Re: Spark stream data from kafka topics and output as parquet file on HDFS

2014-08-05 Thread Dibyendu Bhattacharya
You can try this Kafka Spark Consumer which I recently wrote. This uses the Low Level Kafka Consumer https://github.com/dibbhatt/kafka-spark-consumer Dibyendu On Tue, Aug 5, 2014 at 12:52 PM, rafeeq s rafeeq.ec...@gmail.com wrote: Hi, I am new to Apache Spark and Trying to Develop spark

Re: Low Level Kafka Consumer for Spark

2014-08-05 Thread Dibyendu Bhattacharya
Thanks Jonathan, Yes, till non-ZK based offset management is available in Kafka, I need to maintain the offset in ZK. And yes, both cases explicit commit is necessary. I modified the Low Level Kafka Spark Consumer little bit to have Receiver spawns threads for every partition of the topic and

Re: Spark stream data from kafka topics and output as parquet file on HDFS

2014-08-05 Thread rafeeq s
Thanks Dibyendu. 1. Spark itself have api jar for kafka, still we require manual offset management (using simple consumer concept) and manual consumer ? 2.Kafka Spark Consumer which is implemented in kafka 0.8.0 ,Can we use it for kafka 0.8.1 ? 3.How to use Kafka Spark Consumer to produce output

Running Hive UDF from spark-shell fails due to datatype issue

2014-08-05 Thread visakh
Hi, I'm running Hive 0.13.1 and the latest master branch of Spark (built with SPARK_HIVE=true). I'm trying to compute Jaccard similarity using the Hive UDF from Brickhouse (https://github.com/klout/brickhouse/blob/master/src/main/java/brickhouse/udf/sketch/SetSimilarityUDF.java). *Hive table

Understanding RDD.GroupBy OutOfMemory Exceptions

2014-08-05 Thread Jens Kristian Geyti
I'm doing a simple groupBy on a fairly small dataset (80 files in HDFS, few gigs in total, line based, 500-2000 chars per line). I'm running Spark on 8 low-memory machines in a yarn cluster, i.e. something along the lines of: spark-submit ... --master yarn-client --num-executors 8

Setting spark.executor.memory problem

2014-08-05 Thread Grzegorz Białek
Hi, I wanted to make simple Spark app running in local mode with 2g spark.executor.memory and 1g for caching. But following code: val conf = new SparkConf() .setMaster(local) .setAppName(app) .set(spark.executor.memory, 2g) .set(spark.storage.memoryFraction, 0.5) val sc = new

RE: Spark stream data from kafka topics and output as parquet file on HDFS

2014-08-05 Thread Shao, Saisai
Hi Rafeeq, I think current Spark Streaming api can offer you the ability to fetch data from Kafka and store to another external store, if you do not care about management of consumer offset manually, there’s no need to use low level api as SimpleConsumer. For Kafka 0.8.1 compatibility, you

Re: spark sql left join gives KryoException: Buffer overflow

2014-08-05 Thread Dima Zhiyanov
I am also experiencing this kryo buffer problem. My join is left outer with under 40mb on the right side. I would expect the broadcast join to succeed in this case (hive did) Another problem is that the optimizer chose nested loop join for some reason I would expect broadcast (map side) hash

Re: spark sql left join gives KryoException: Buffer overflow

2014-08-05 Thread Dima Zhiyanov
Yes Sent from my iPhone On Aug 5, 2014, at 7:38 AM, Dima Zhiyanov [via Apache Spark User List] ml-node+s1001560n11432...@n3.nabble.com wrote: I am also experiencing this kryo buffer problem. My join is left outer with under 40mb on the right side. I would expect the broadcast join to

master=local vs master=local[*]

2014-08-05 Thread Grzegorz Białek
Hi, I have Spark application which computes join of two RDDs. One contains around 150MB of data (7 million entries) second around 1,5MB (80 thousand entries) and result of this join contains 50MB of data (2 million entries). When I run it on one core (with master=local) it works correctly (whole

Spark SQL Thrift Server

2014-08-05 Thread John Omernik
I gave things working on my cluster with the sparksql thrift server. (Thank you Yin Huai at Databricks!) That said, I was curious how I can cache a table via my instance here? I tried the shark like create table table_cached as select * from table and that did not create a cached table.

Re: Spark SQL Thrift Server

2014-08-05 Thread Michael Armbrust
We are working on an overhaul of the docs before the 1.1 release. In the mean time try: CACHE TABLE tableName. On Tue, Aug 5, 2014 at 9:02 AM, John Omernik j...@omernik.com wrote: I gave things working on my cluster with the sparksql thrift server. (Thank you Yin Huai at Databricks!) That

Re: spark sql left join gives KryoException: Buffer overflow

2014-08-05 Thread Michael Armbrust
For outer joins I'd recommend upgrading to master or waiting for a 1.1 release candidate (which should be out this week). On Tue, Aug 5, 2014 at 7:38 AM, Dima Zhiyanov dimazhiya...@hotmail.com wrote: I am also experiencing this kryo buffer problem. My join is left outer with under 40mb on the

Re: master=local vs master=local[*]

2014-08-05 Thread Andre Bois-Crettez
The more cores you have, the less memory they will get. 512M is already quite small, and if you have 4 cores it will mean roughly 128M per task. Sometimes it is interesting to have less cores and more memory. how many cores do you have ? André On 2014-08-05 16:43, Grzegorz Białek wrote: Hi,

Re: Can't see any thing one the storage panel of application UI

2014-08-05 Thread Andrew Or
Ah yes, Spark doesn't cache all of your RDDs by default. It turns out that caching things too aggressively can lead to suboptimal performance because there might be a lot of churn. If you don't call persist or cache then your RDDs won't actually be cached. Note that even once they're cached they

Re: Setting spark.executor.memory problem

2014-08-05 Thread Andrew Or
Hi Grzegorz, For local mode you only have one executor, and this executor is your driver, so you need to set the driver's memory instead. *That said, in local mode, by the time you run spark-submit, a JVM has already been launched with the default memory settings, so setting spark.driver.memory

Re: Setting spark.executor.memory problem

2014-08-05 Thread Andrew Or
(Clarification: you'll need to pass in --driver-memory not just for local mode, but for any application you're launching with client deploy mode) 2014-08-05 9:24 GMT-07:00 Andrew Or and...@databricks.com: Hi Grzegorz, For local mode you only have one executor, and this executor is your

Re: Gradient Boosted Machines

2014-08-05 Thread Manish Amde
Hi Daniel, Thanks a lot for your interest. Gradient boosting and AdaBoost algorithms are under active development and should be a part of release 1.2. -Manish On Mon, Jul 14, 2014 at 11:24 AM, Daniel Bendavid daniel.benda...@creditkarma.com wrote: Hi, My company is strongly considering

Re: Writing to RabbitMQ

2014-08-05 Thread jschindler
You are correct in that I am trying to publish inside of a foreachRDD loop. I am currently refactoring and will try publishing inside the foreachPartition loop. Below is the code showing the way it is currently written, thanks! object myData { def main(args: Array[String]) { val ssc =

java.lang.IllegalArgumentException: Unable to create serializer com.esotericsoftware.kryo.serializers.FieldSerializer

2014-08-05 Thread Sameer Tilak
Hi All, I am trying to move away from spark-shell to spark-submit and have been making some code changes. However, I am now having problem with serialization. It used to work fine before the code update. Not sure what I did wrong. However, here is the code JaccardScore.scala package

graph reduceByKey

2014-08-05 Thread Omer Holzinger
Hey all! I'm a total beginner with spark / hadoop / graph computation so please excuse my beginner question. I've created a graph, using graphx. Now, for every vertex, I want to get all its second degree neighbors. so if my graph is: v1 -- v2 v1 -- v4 v1 -- v6 I want to get something like: v2

Problem running Spark shell (1.0.0) on EMR

2014-08-05 Thread Omer Holzinger
I'm having similar problem to: http://mail-archives.apache.org/mod_mbox/spark-user/201407.mbox/browser I'm trying to follow the tutorial at: When I run: val file = sc.textFile(s3://bigdatademo/sample/wiki/) I get: WARN storage.BlockManager: Putting block broadcast_1 failed

pyspark inferSchema

2014-08-05 Thread Brad Miller
Hi All, I have a data set where each record is serialized using JSON, and I'm interested to use SchemaRDDs to work with the data. Unfortunately I've hit a snag since some fields in the data are maps and list, and are not guaranteed to be populated for each record. This seems to cause

Re: pyspark inferSchema

2014-08-05 Thread Nicholas Chammas
I was just about to ask about this. Currently, there are two methods, sqlContext.jsonFile() and sqlContext.jsonRDD(), that work on JSON text and infer a schema that covers the whole data set. For example: from pyspark.sql import SQLContext sqlContext = SQLContext(sc) a =

Spark Memory Issues

2014-08-05 Thread Sunny Khatri
Hi, I'm trying to run a spark application with the executor-memory 3G. but I'm running into the following error: 14/08/05 18:02:58 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[5] at map at KMeans.scala:123), which has no missing parents 14/08/05 18:02:58 INFO DAGScheduler: Submitting 1

Re: Spark SQL Thrift Server

2014-08-05 Thread John Omernik
Thanks Michael. Is there a way to specify off_heap? I.e. Tachyon via the thrift server? Thanks! On Tue, Aug 5, 2014 at 11:06 AM, Michael Armbrust mich...@databricks.com wrote: We are working on an overhaul of the docs before the 1.1 release. In the mean time try: CACHE TABLE tableName.

Re: How to read from OpenTSDB using PySpark (or Scala Spark)?

2014-08-05 Thread bumble123
Thank you!! Could you give me any sample code for the receiver? I'm still new to Spark and not quite sure how I would do that. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-from-OpenTSDB-using-PySpark-or-Scala-Spark-tp11211p11454.html Sent

Re: Spark Memory Issues

2014-08-05 Thread Akhil Das
Are you able to see the job on the WebUI (8080)? If yes, how much memory are you seeing there specifically for this job? [image: Inline image 1] Here you can see i have 11.8Gb RAM on both workers and my app is using 11GB. 1. What are all the memory that you are seeing in your case? 2. Make sure

Re: Spark shell creating a local SparkContext instead of connecting to connecting to Spark Master

2014-08-05 Thread Akhil Das
​You can always start your spark-shell by specifying the master as MASTER=spark://*whatever*:7077 $SPARK_HOME/bin/spark-shell​ Then it will connect to that *whatever* master. Thanks Best Regards On Tue, Aug 5, 2014 at 8:51 PM, Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: Hi

Re: Spark Memory Issues

2014-08-05 Thread Sunny Khatri
The only UI I have currently is the Application Master (Cluster mode), with the following executor nodes status: Executors (3) - *Memory:* 0.0 B Used (3.7 GB Total) - *Disk:* 0.0 B Used Executor IDAddressRDD BlocksMemory UsedDisk UsedActive TasksFailed TasksComplete TasksTotal TasksTask

Re: pyspark inferSchema

2014-08-05 Thread Brad Miller
Hi Nick, Thanks for the great response. I actually already investigated jsonRDD and jsonFile, although I did not realize they provide more complete schema inference. I did however have other problems with jsonRDD and jsonFile, but I will now describe in a separate thread with an appropriate

Re: Spark Memory Issues

2014-08-05 Thread Akhil Das
For that UI to have some values, your process should do some operation. Which is not happening here ( 14/08/05 18:03:13 WARN YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory ) Can you open up a

Re: pyspark inferSchema

2014-08-05 Thread Nicholas Chammas
Notice the difference in the schema. Are you running the 1.0.1 release, or a more bleeding-edge version from the repository? Yep, my bad. I’m running off master at commit 184048f80b6fa160c89d5bb47b937a0a89534a95. Nick ​

trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Brad Miller
Hi All, I am interested to use jsonRDD and jsonFile to create a SchemaRDD out of some JSON data I have, but I've run into some instability involving the following java exception: An error occurred while calling o1326.collect. : org.apache.spark.SparkException: Job aborted due to stage failure:

Re: pyspark inferSchema

2014-08-05 Thread Davies Liu
On Tue, Aug 5, 2014 at 11:01 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I was just about to ask about this. Currently, there are two methods, sqlContext.jsonFile() and sqlContext.jsonRDD(), that work on JSON text and infer a schema that covers the whole data set. For example:

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Nicholas Chammas
I believe this is a known issue in 1.0.1 that's fixed in 1.0.2. See: SPARK-2376: Selecting list values inside nested JSON objects raises java.lang.IllegalArgumentException https://issues.apache.org/jira/browse/SPARK-2376 On Tue, Aug 5, 2014 at 2:55 PM, Brad Miller bmill...@eecs.berkeley.edu

Re: pyspark inferSchema

2014-08-05 Thread Brad Miller
Got it. Thanks! On Tue, Aug 5, 2014 at 11:53 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Notice the difference in the schema. Are you running the 1.0.1 release, or a more bleeding-edge version from the repository? Yep, my bad. I’m running off master at commit

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Michael Armbrust
Is this on 1.0.1? I'd suggest running this on master or the 1.1-RC which should be coming out this week. Pyspark did not have good support for nested data previously. If you still encounter issues using a more recent version, please file a JIRA. Thanks! On Tue, Aug 5, 2014 at 11:55 AM, Brad

Re: java.lang.StackOverflowError

2014-08-05 Thread Chengi Liu
Bump On Tuesday, August 5, 2014, Chengi Liu chengi.liu...@gmail.com wrote: Hi, I am doing some basic preprocessing in pyspark (local mode as follows): files = [ input files] def read(filename,sc): #process file return rdd if __name__ ==__main__: conf = SparkConf()

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Brad Miller
Nick: Thanks for both the original JIRA bug report and the link. Michael: This is on the 1.0.1 release. I'll update to master and follow-up if I have any problems. best, -Brad On Tue, Aug 5, 2014 at 12:04 PM, Michael Armbrust mich...@databricks.com wrote: Is this on 1.0.1? I'd suggest

Re: pyspark inferSchema

2014-08-05 Thread Brad Miller
Hi Davies, Thanks for the response and tips. Is the sample argument to inferSchema available in the 1.0.1 release of pyspark? I'm not sure (based on the documentation linked below) that it is. http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema It

Re: java.lang.StackOverflowError

2014-08-05 Thread Davies Liu
Could you create an re-producable script (and data) to allow us to investigate this? Davies On Tue, Aug 5, 2014 at 1:10 AM, Chengi Liu chengi.liu...@gmail.com wrote: Hi, I am doing some basic preprocessing in pyspark (local mode as follows): files = [ input files] def read(filename,sc):

Re: Spark Memory Issues

2014-08-05 Thread Akhil Das
Are you sure that you were not running SparkPi in local mode? Thanks Best Regards On Wed, Aug 6, 2014 at 12:43 AM, Sunny Khatri sunny.k...@gmail.com wrote: Well I was able to run the SparkPi, that also does the similar stuff, successfully. On Tue, Aug 5, 2014 at 11:52 AM, Akhil Das

Re: Spark Memory Issues

2014-08-05 Thread Sunny Khatri
Yeah, ran it on yarn-cluster mode. On Tue, Aug 5, 2014 at 12:17 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Are you sure that you were not running SparkPi in local mode? Thanks Best Regards On Wed, Aug 6, 2014 at 12:43 AM, Sunny Khatri sunny.k...@gmail.com wrote: Well I was able

Re: pyspark inferSchema

2014-08-05 Thread Davies Liu
This sample argument of inferSchema is still no in master, if will try to add it if it make sense. On Tue, Aug 5, 2014 at 12:14 PM, Brad Miller bmill...@eecs.berkeley.edu wrote: Hi Davies, Thanks for the response and tips. Is the sample argument to inferSchema available in the 1.0.1 release

Re: pyspark inferSchema

2014-08-05 Thread Brad Miller
Assuming updating to master fixes the bug I was experiencing with jsonRDD and jsonFile, then pushing sample to master will probably not be necessary. We believe that the link below was the bug I experienced, and I've been told it is fixed in master.

Re: Configuration setup and Connection refused

2014-08-05 Thread alamin.ishak
Hi, Anyone? Any input would be much appreciated Thanks, Amin On 5 Aug 2014 00:31, Al Amin alamin.is...@gmail.com wrote: Hi all, Any help would be much appreciated. Thanks, Al On Mon, Aug 4, 2014 at 7:09 PM, Al Amin alamin.is...@gmail.com wrote: Hi all, I have setup 2 nodes (master

SELECT DISTINCT generates random results?

2014-08-05 Thread Nan Zhu
Hi, all I use “SELECT DISTINCT” to query the data saved in hive it seems that this statement cannot understand the table structure and just output the data in other fields Anyone met the similar problem before? Best, -- Nan Zhu

Re: pyspark inferSchema

2014-08-05 Thread Yin Huai
Yes, 2376 has been fixed in master. Can you give it a try? Also, for inferSchema, because Python is dynamically typed, I agree with Davies to provide a way to scan a subset (or entire) of the dataset to figure out the proper schema. We will take a look it. Thanks, Yin On Tue, Aug 5, 2014 at

Re: SELECT DISTINCT generates random results?

2014-08-05 Thread Nan Zhu
nvm, some problem brought by the ill-formatted raw data -- Nan Zhu On Tuesday, August 5, 2014 at 3:42 PM, Nan Zhu wrote: Hi, all I use “SELECT DISTINCT” to query the data saved in hive it seems that this statement cannot understand the table structure and just output the

spark-submit symlink

2014-08-05 Thread Koert Kuipers
spark-submit doesnt handle being symlinks currently: $ spark-submit /usr/local/bin/spark-submit: line 44: /usr/local/bin/spark-class: No such file or directory /usr/local/bin/spark-submit: line 44: exec: /usr/local/bin/spark-class: cannot execute: No such file or directory to fix i changed the

spark-ec2 script with VPC

2014-08-05 Thread Erik Shilts
I'm trying to use the spark-ec2 script to launch a spark cluster within a virtual private cloud (VPC) but I don't see an option for that. Is there a way to specify the VPC while using the spark-ec2 script? I found an old spark-incubator mailing list comment which claims to have added that

[PySpark] [SQL] Going from RDD[dict] to SchemaRDD

2014-08-05 Thread Nicholas Chammas
Forking from this thread http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-inferSchema-tc11449.html. On Tue, Aug 5, 2014 at 3:01 PM, Davies Liu dav...@databricks.com http://mailto:dav...@databricks.com wrote: Before upcoming 1.1 release, we did not support nested structures via

issue with spark and bson input

2014-08-05 Thread Dmitriy Selivanov
Hello, I have issue when try to use bson file as spark input. I use mongo-hadoop-connector 1.3.0 and spark 1.0.0: val sparkConf = new SparkConf() val sc = new SparkContext(sparkConf) val config = new Configuration() config.set(mongo.job.input.format,

Re: Include permalinks in mail footer

2014-08-05 Thread Nicholas Chammas
Looks like this feature has been turned off. Are these changes intentional? Or perhaps I'm not understanding how it's supposed to work. Nick On Fri, Jul 18, 2014 at 12:20 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Looks like this has now been turned on for new threads? On

Re: [PySpark] [SQL] Going from RDD[dict] to SchemaRDD

2014-08-05 Thread Michael Armbrust
Maybe; I’m not sure just yet. Basically, I’m looking for something functionally equivalent to this: sqlContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x))) In other words, given an RDD of JSON-serializable Python dictionaries, I want to be able to infer a schema that is guaranteed to

Re: Understanding RDD.GroupBy OutOfMemory Exceptions

2014-08-05 Thread Jens Kristian Geyti
Patrick Wendell wrote In the latest version of Spark we've added documentation to make this distinction more clear to users: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L390 That is a very good addition to the documentation. Nice

Re: Include permalinks in mail footer

2014-08-05 Thread Matei Zaharia
Emails sent from Nabble have it, while others don't. Unfortunately I haven't received a reply from ASF infra on this yet. Matei On August 5, 2014 at 2:04:10 PM, Nicholas Chammas (nicholas.cham...@gmail.com) wrote: Looks like this feature has been turned off. Are these changes intentional? Or

Re: Configuration setup and Connection refused

2014-08-05 Thread Mayur Rustagi
Spark is not able to communicate with your hadoop hdfs. Is your hdfs running, if so can you try to explicitly connect to it with hadoop command line tools giving full hostname port. Or test port using telnet localhost 9000 In all likelyhood either your hdfs is down, bound to wrong port/ip that

Re: Include permalinks in mail footer

2014-08-05 Thread Matei Zaharia
Oh actually sorry, it looks like infra has looked at it but they can't add permalinks. They can only add here's how to unsubscribe footers. My bad, I just didn't catch the email update from them. Matei On August 5, 2014 at 2:39:45 PM, Matei Zaharia (matei.zaha...@gmail.com) wrote: Emails sent

Re: Include permalinks in mail footer

2014-08-05 Thread Nicholas Chammas
Ah, the user-specific to: address? I see. OK, thanks for looking into it! On Tue, Aug 5, 2014 at 5:40 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Oh actually sorry, it looks like infra has looked at it but they can't add permalinks. They can only add here's how to unsubscribe footers. My

Re: Configuration setup and Connection refused

2014-08-05 Thread Mayur Rustagi
Then dont specify hdfs when you read file. Also the community is quite active in response in general, just be a little patient. Also if possible look at spark training as part of spark summit 2014 vids and/or amplabs training on spark website. Mayur Rustagi Ph: +1 (760) 203 3257

Re: Configuration setup and Connection refused

2014-08-05 Thread Andrew Or
Hi Amin, This happens usually because your application can't talk to HDFS, and thinks that the name node is waiting on port 9000 when it's not. Are you using the EC2 scripts for standalone Spark? You can verify whether or not the port is correct by checking the configurations with

Re: Understanding RDD.GroupBy OutOfMemory Exceptions

2014-08-05 Thread Patrick Wendell
Hi Jens, Within a partition things will spill - so the current documentation is correct. This spilling can only occur *across keys* at the moment. Spilling cannot occur within a key at present. This is discussed in the video here:

[Streaming] Akka-based receiver with messages defined in uploaded jar

2014-08-05 Thread Anton Brazhnyk
Greetings, I modified ActorWordCount example a little and it uses simple case class as the message for Streaming instead of the primitive string. I also modified launch code to not use run-example script, but set spark master in the code and attach the jar (setJars(...)) with all the classes

Re: [PySpark] [SQL] Going from RDD[dict] to SchemaRDD

2014-08-05 Thread Nicholas Chammas
SPARK-2870: Thorough schema inference directly on RDDs of Python dictionaries https://issues.apache.org/jira/browse/SPARK-2870 On Tue, Aug 5, 2014 at 5:07 PM, Michael Armbrust mich...@databricks.com wrote: Maybe; I’m not sure just yet. Basically, I’m looking for something functionally

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Brad Miller
Hi All, I've built and deployed the current head of branch-1.0, but it seems to have only partly fixed the bug. This code now runs as expected with the indicated output: srdd = sqlCtx.jsonRDD(sc.parallelize(['{foo:[1,2,3]}', '{foo:[4,5,6]}'])) srdd.printSchema() root |-- foo:

python dependencies loaded but not on PYTHONPATH

2014-08-05 Thread Dominik Hübner
Hey, I just tried to submit a task to my spark cluster using the following command ./spark/bin/spark-submit --py-files file:///root/abc.zip --master spark://xxx.xxx.xxx.xxx:7077 test.py It seems like the dependency I’ve added gets loaded: 14/08/05 23:07:00 INFO spark.SparkContext: Added file

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Nicholas Chammas
This looks to be fixed in master: from pyspark.sql import SQLContext sqlContext = SQLContext(sc) sc.parallelize(['{foo:[[1,2,3], [4,5,6]]}', '{foo:[[1,2,3], [4,5,6]]}']) ParallelCollectionRDD[5] at parallelize at PythonRDD.scala:315 sqlContext.jsonRDD(sc.parallelize(['{foo:[[1,2,3],

Re: Using sbt-pack with Spark 1.0.0

2014-08-05 Thread lbustelo
Are there any workarounds for this? Seems to be a dead end so far. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-sbt-pack-with-Spark-1-0-0-tp6649p11502.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: [Streaming] Akka-based receiver with messages defined in uploaded jar

2014-08-05 Thread Tathagata Das
Can you show us the modified version. The reason could very well be what you suggest, but I want to understand what conditions lead to this. TD On Tue, Aug 5, 2014 at 3:55 PM, Anton Brazhnyk anton.brazh...@genesys.com wrote: Greetings, I modified ActorWordCount example a little and it uses

Re: Unit Test for Spark Streaming

2014-08-05 Thread Tathagata Das
That function is simply deletes a directory recursively. you can use alternative libraries. see this discussion http://stackoverflow.com/questions/779519/delete-files-recursively-in-java On Tue, Aug 5, 2014 at 5:02 PM, JiajiaJing jj.jing0...@gmail.com wrote: Hi TD, I encountered a problem

Re: Spark Streaming fails - where is the problem?

2014-08-05 Thread Tathagata Das
@ Simon Any progress? On Tue, Aug 5, 2014 at 12:17 AM, Akhil Das ak...@sigmoidanalytics.com wrote: You need to add twitter4j-*-3.0.3.jars to your class path Thanks Best Regards On Tue, Aug 5, 2014 at 7:18 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Are you able to run it

Re: Spark streaming at-least once guarantee

2014-08-05 Thread Tathagata Das
I can try answering the question even if I am not Sanjeet ;) There isnt a simple way to do this. In fact the ideal way to do it would be to create a new InputDStream (just like FileInputDStream

Re: streaming window not behaving as advertised (v1.0.1)

2014-08-05 Thread Tathagata Das
1. udpateStateByKey should be called on all keys even if there is not data corresponding to that key. There is a unit test for that. https://github.com/apache/spark/blob/master/streaming/src/test/scala/org/apache/spark/streaming/BasicOperationsSuite.scala#L337 2. I am increasing the priority for

Can't zip RDDs with unequal numbers of partitions

2014-08-05 Thread Bin
Hi All, I met the titled error. This exception occured in line 223, as shown below: 212 // read files 213 val lines = sc.textFile(path_edges).map(line=line.split(,)).map(line=((line(0), line(1)), line(2).toDouble)).reduceByKey(_+ _).cache 214 215 val

Save an RDD to a SQL Database

2014-08-05 Thread Vida Ha
Hi, I would like to save an RDD to a SQL database. It seems like this would be a common enough use case. Are there any built in libraries to do it? Otherwise, I'm just planning on mapping my RDD, and having that call a method to write to the database. Given that a lot of records are going to

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Brad Miller
Hi All, I checked out and built master. Note that Maven had a problem building Kafka (in my case, at least); I was unable to fix this easily so I moved on since it seemed unlikely to have any influence on the problem at hand. Master improves functionality (including the example Nicholas just

Re: pyspark inferSchema

2014-08-05 Thread Brad Miller
I've followed up in a thread more directly related to jsonRDD and jsonFile, but it seems like after building from the current master I'm still having some problems with nested dictionaries.

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Yin Huai
I tried jsonRDD(...).printSchema() and it worked. Seems the problem is when we take the data back to the Python side, SchemaRDD#javaToPython failed on your cases. I have created https://issues.apache.org/jira/browse/SPARK-2875 to track it. Thanks, Yin On Tue, Aug 5, 2014 at 9:20 PM, Brad

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Brad Miller
I concur that printSchema works; it just seems to be operations that use the data where trouble happens. Thanks for posting the bug. -Brad On Tue, Aug 5, 2014 at 10:05 PM, Yin Huai yh...@databricks.com wrote: I tried jsonRDD(...).printSchema() and it worked. Seems the problem is when we

type issue: found RDD[T] expected RDD[A]

2014-08-05 Thread Amit Kumar
Hi All, I am having some trouble trying to write generic code that uses sqlContext and RDDs. Can you suggest what might be wrong? class SparkTable[T : ClassTag](val sqlContext:SQLContext, val extractor: (String) = (T) ) { private[this] var location:Option[String] =None private[this] var