Re: Separating classloader management from SparkContexts

2014-03-19 Thread Andrew Ash
Hi Punya, This seems like a problem that the recently-announced job-server would likely have run into at one point. I haven't tested it yet, but I'd be interested to see what happens when two jobs in the job server have conflicting classes. Does the server correctly segregate each job's classes

Re: Pyspark worker memory

2014-03-19 Thread Matei Zaharia
Try checking spark-env.sh on the workers as well. Maybe code there is somehow overriding the spark.executor.memory setting. Matei On Mar 18, 2014, at 6:17 PM, Jim Blomo jim.bl...@gmail.com wrote: Hello, I'm using the Github snapshot of PySpark and having trouble setting the worker memory

Re: Running spark examples/scala scripts

2014-03-19 Thread Pariksheet Barapatre
:-) Thanks for suggestion. I was actually asking how to run spark scripts as a standalone App. I am able to run Java code and Python code as standalone app. one more doubt, documentation says - to read HDFS file, we need to add dependency groupIdorg.apache.hadoop/groupId

Re: Unable to read HDFS file -- SimpleApp.java

2014-03-19 Thread Prasad
Check this thread out, http://apache-spark-user-list.1001560.n3.nabble.com/Error-reading-HDFS-file-using-spark-0-9-0-hadoop-2-2-0-incompatible-protobuf-2-5-and-2-4-1-tp2158p2807.html -- you have to remove conflicting akka and protbuf versions. Thanks Prasad. -- View this message in context:

Re: Hadoop Input Format - newAPIHadoopFile

2014-03-19 Thread Bertrand Dechoux
I don't know the Spark issue but the Hadoop context is clear. old api - org.apache.hadoop.mapred new api - org.apache.hadoop.mapreduce You might only need to change your import. Regards Bertrand On Wed, Mar 19, 2014 at 11:29 AM, Pariksheet Barapatre pbarapa...@gmail.com wrote: Hi,

Re: Hadoop Input Format - newAPIHadoopFile

2014-03-19 Thread Pariksheet Barapatre
Seems like import issue, ran with HadoopFile and it worked. Not getting import statement for textInputFormat class location for new API. Can anybody help? Thanks Pariksheet On 19 March 2014 16:05, Bertrand Dechoux decho...@gmail.com wrote: I don't know the Spark issue but the Hadoop context

How to distribute external executable (script) with Spark ?

2014-03-19 Thread Jaonary Rabarisoa
Hi all, I'm trying to build an evaluation platform based on Spark. The idea is to run a blackbox executable (build with c/c++ or some scripting language). This blackbox takes a set of data as input and outpout some metrics. Since I have a huge amount of data, I need to distribute the computation

Re: example of non-line oriented input data?

2014-03-19 Thread Diana Carroll
If I don't call iter(), and just return treeiterator directly, I get an error message that the object is not of an iterator type. This is in Python 2.6...perhaps a bug? BUT I also realized my code was wrong. It results in an RDD containing all the tags in all the files. What I really want is

Re: Joining two HDFS files in in Spark

2014-03-19 Thread Shixiong Zhu
Do you want to read the file content in the following statement? val ny_daily= sc.parallelize(List(hdfs://localhost:8020/user/user/ NYstock/NYSE_daily)) If so, you should use textFile, e.g., val ny_daily= sc.textFile(hdfs://localhost:8020/user/user/ NYstock/NYSE_daily) parallelize is used to

Re: Separating classloader management from SparkContexts

2014-03-19 Thread Punya Biswal
Hi Andrew, Thanks for pointing me to that example. My understanding of the JobServer (based on watching a demo of its UI) is that it maintains a set of spark contexts and allows people to add jars to them, but doesn't allow unloading or reloading jars within a spark context. The code in JobCache

Re: Hadoop Input Format - newAPIHadoopFile

2014-03-19 Thread Pariksheet Barapatre
Thanks . it worked.. Very basic question, i have created custominput format e.g. stock. How do I refer this class as custom inputformat. I.e. where to keep this class on linux folder. Do i need to add this jar if so how . I am running code through spark-shell. Thanks Pari On 19-Mar-2014 7:35

RE: Spark enables us to process Big Data on an ARM cluster !!

2014-03-19 Thread Xia, Junluan
Very cool! -Original Message- From: Chanwit Kaewkasi [mailto:chan...@gmail.com] Sent: Wednesday, March 19, 2014 10:36 AM To: user@spark.apache.org Subject: Spark enables us to process Big Data on an ARM cluster !! Hi all, We are a small team doing a research on low-power (and low-cost)

Re: Spark enables us to process Big Data on an ARM cluster !!

2014-03-19 Thread Koert Kuipers
i dont know anything about arm clusters but it looks great. what are the specs? the nodes have no local disk at all? On Tue, Mar 18, 2014 at 10:36 PM, Chanwit Kaewkasi chan...@gmail.comwrote: Hi all, We are a small team doing a research on low-power (and low-cost) ARM clusters. We built

Transitive dependency incompatibility

2014-03-19 Thread Jaka Jančar
Hi, I'm getting the following error: java.lang.NoSuchMethodError: org.apache.http.impl.conn.DefaultClientConnectionOperator.init(Lorg/apache/http/conn/scheme/SchemeRegistry;Lorg/apache/http/conn/DnsResolver;)V at

Re: How to distribute external executable (script) with Spark ?

2014-03-19 Thread Mayur Rustagi
I doubt thr is something like this out of the box. Easiest thing is to package it in to a jar send that jar across. Regards Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Wed, Mar 19, 2014 at 6:57 AM, Jaonary Rabarisoa

Re: Spark enables us to process Big Data on an ARM cluster !!

2014-03-19 Thread Christopher Nguyen
Chanwit, that is awesome! Improvements in shuffle operations should help improve life even more for you. Great to see a data point on ARM. Sent while mobile. Pls excuse typos etc. On Mar 18, 2014 7:36 PM, Chanwit Kaewkasi chan...@gmail.com wrote: Hi all, We are a small team doing a research

Re: example of non-line oriented input data?

2014-03-19 Thread Diana Carroll
Actually, thinking more on this question, Matei: I'd definitely say support for Avro. There's a lot of interest in this!! On Tue, Mar 18, 2014 at 8:14 PM, Matei Zaharia matei.zaha...@gmail.comwrote: BTW one other thing -- in your experience, Diana, which non-text InputFormats would be most

Re: trying to understand job cancellation

2014-03-19 Thread Koert Kuipers
on spark 1.0.0 SNAPSHOT this seems to work. at least so far i have seen no issues yet. On Thu, Mar 6, 2014 at 8:44 AM, Koert Kuipers ko...@tresata.com wrote: its 0.9 snapshot from january running in standalone mode. have these fixed been merged into 0.9? On Thu, Mar 6, 2014 at 12:45 AM,

Re: Connect Exception Error in spark interactive shell...

2014-03-19 Thread Mayur Rustagi
The data may be spilled off to disk hence HDFS is a necessity for Spark. You can run Spark on a single machine not use HDFS but in distributed mode HDFS will be required. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On

Re: example of non-line oriented input data?

2014-03-19 Thread Jeremy Freeman
Another vote on this, support for simple SequenceFiles and/or Avro would be terrific, as using plain text can be very space-inefficient, especially for numerical data. -- Jeremy On Mar 19, 2014, at 5:24 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I'd second the request for Avro

Re: spark 0.8 examples in local mode

2014-03-19 Thread maxpar
Just figure it out. I need to add a file:// in URI. I guess it is not needed in previous Hadoop versions. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-0-8-examples-in-local-mode-tp2892p2897.html Sent from the Apache Spark User List mailing list

saveAsTextFile() failing for large datasets

2014-03-19 Thread Soila Pertet Kavulya
I am testing the performance of Spark to see how it behaves when the dataset size exceeds the amount of memory available. I am running wordcount on a 4-node cluster (Intel Xeon 16 cores (32 threads), 256GB RAM per node). I limited spark.executor.memory to 64g, so I have 256g of memory available in

in SF until Friday

2014-03-19 Thread Nicholas Chammas
I'm in San Francisco until Friday for a conference (visiting from Boston). If any of y'all are up for a drink or something, I'd love to meet you in person and say hi. Nick -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/in-SF-until-Friday-tp2900.html

Re: What's the lifecycle of an rdd? Can I control it?

2014-03-19 Thread hequn cheng
persist and unpersist. unpersist:Mark the RDD as non-persistent, and remove all blocks for it from memory and disk 2014-03-19 16:40 GMT+08:00 林武康 vboylin1...@gmail.com: Hi, can any one tell me about the lifecycle of an rdd? I search through the official website and still can't figure it out.

Re: What's the lifecycle of an rdd? Can I control it?

2014-03-19 Thread Nicholas Chammas
Related question: If I keep creating new RDDs and cache()-ing them, does Spark automatically unpersist the least recently used RDD when it runs out of memory? Or is an explicit unpersist the only way to get rid of an RDD (barring the PR Tathagata mentioned)? Also, does unpersist()-ing an RDD

Re: What's the lifecycle of an rdd? Can I control it?

2014-03-19 Thread Matei Zaharia
Yes, Spark automatically removes old RDDs from the cache when you make new ones. Unpersist forces it to remove them right away. In both cases though, note that Java doesn’t garbage-collect the objects released until later. Matei On Mar 19, 2014, at 7:22 PM, Nicholas Chammas

Re: What's the lifecycle of an rdd? Can I control it?

2014-03-19 Thread Nicholas Chammas
Okie doke, good to know. On Wed, Mar 19, 2014 at 7:35 PM, Matei Zaharia matei.zaha...@gmail.comwrote: Yes, Spark automatically removes old RDDs from the cache when you make new ones. Unpersist forces it to remove them right away. In both cases though, note that Java doesn’t garbage-collect

Re: java.net.SocketException on reduceByKey() in pyspark

2014-03-19 Thread Uri Laserson
I have the exact same error running on a bare metal cluster with CentOS6 and Python 2.6.6. Any other thoughts on the problem here? I only get the error on operations that require communication, like reduceByKey or groupBy. On Sun, Mar 2, 2014 at 1:29 PM, Nicholas Chammas

PySpark worker fails with IOError Broken Pipe

2014-03-19 Thread Nicholas Chammas
So I have the pyspark shell open and after some idle time I sometimes get this: PySpark worker failed with exception: Traceback (most recent call last): File /root/spark/python/pyspark/worker.py, line 77, in main serializer.dump_stream(func(split_index, iterator), outfile) File

Relation between DStream and RDDs

2014-03-19 Thread Sanjay Awatramani
Hi, As I understand, a DStream consists of 1 or more RDDs. And foreachRDD will run a given func on each and every RDD inside a DStream. I created a simple program which reads log files from a folder every hour: JavaStreamingContext stcObj = new JavaStreamingContext(confObj, new Duration(60 *