Re: Relation between DStream and RDDs

2014-03-19 Thread Tathagata Das
That is a good question. If I understand correctly, you need multiple RDDs from a DStream in *every batch*. Can you elaborate on why do you need multiple RDDs every batch? TD On Wed, Mar 19, 2014 at 10:20 PM, Sanjay Awatramani wrote: > Hi, > > As I understand, a DStream consists of 1 or more RD

Relation between DStream and RDDs

2014-03-19 Thread Sanjay Awatramani
Hi, As I understand, a DStream consists of 1 or more RDDs. And foreachRDD will run a given func on each and every RDD inside a DStream. I created a simple program which reads log files from a folder every hour: JavaStreamingContext stcObj = new JavaStreamingContext(confObj, new Duration(60 * 60

PySpark worker fails with IOError Broken Pipe

2014-03-19 Thread Nicholas Chammas
So I have the pyspark shell open and after some idle time I sometimes get this: >>> PySpark worker failed with exception: > Traceback (most recent call last): > File "/root/spark/python/pyspark/worker.py", line 77, in main > serializer.dump_stream(func(split_index, iterator), outfile) > Fi

答复: What's the lifecycle of an rdd? Can I control it?

2014-03-19 Thread 林武康
Thank you, everybody. Nice to know 😊 -原始邮件- 发件人: "Nicholas Chammas" 发送时间: ‎2014/‎3/‎20 10:23 收件人: "user" 主题: Re: What's the lifecycle of an rdd? Can I control it? Related question: If I keep creating new RDDs and cache()-ing them, does Spark automatically unpersist the least recentl

Re: Shark does not give any results with SELECT count(*) command

2014-03-19 Thread qingyang li
have found the cause , my problem is : the style of file salves is not correct, so the task only be run on master. explain here to help other guy who also encounter similiar problem. 2014-03-20 9:57 GMT+08:00 qingyang li : > Hi, i install spark0.9.0 and shark0.9 on 3 nodes , when i run select *

Re: java.net.SocketException on reduceByKey() in pyspark

2014-03-19 Thread Uri Laserson
I have the exact same error running on a bare metal cluster with CentOS6 and Python 2.6.6. Any other thoughts on the problem here? I only get the error on operations that require communication, like reduceByKey or groupBy. On Sun, Mar 2, 2014 at 1:29 PM, Nicholas Chammas wrote: > Alright, so

Re: What's the lifecycle of an rdd? Can I control it?

2014-03-19 Thread Nicholas Chammas
Okie doke, good to know. On Wed, Mar 19, 2014 at 7:35 PM, Matei Zaharia wrote: > Yes, Spark automatically removes old RDDs from the cache when you make new > ones. Unpersist forces it to remove them right away. In both cases though, > note that Java doesn’t garbage-collect the objects released u

Re: What's the lifecycle of an rdd? Can I control it?

2014-03-19 Thread Matei Zaharia
Yes, Spark automatically removes old RDDs from the cache when you make new ones. Unpersist forces it to remove them right away. In both cases though, note that Java doesn’t garbage-collect the objects released until later. Matei On Mar 19, 2014, at 7:22 PM, Nicholas Chammas wrote: > Related

Re: What's the lifecycle of an rdd? Can I control it?

2014-03-19 Thread Nicholas Chammas
Related question: If I keep creating new RDDs and cache()-ing them, does Spark automatically unpersist the least recently used RDD when it runs out of memory? Or is an explicit unpersist the only way to get rid of an RDD (barring the PR Tathagata mentioned)? Also, does unpersist()-ing an RDD imme

Re: What's the lifecycle of an rdd? Can I control it?

2014-03-19 Thread Tathagata Das
Just a head's up, there is an active *pull requeust* that will automatically unpersist RDDs that are not in reference/scope from the application any more. TD On Wed, Mar 19, 2014 at 6:58 PM, hequn cheng wrote: > persist and unpersist. > unpersist:Mark

Re: What's the lifecycle of an rdd? Can I control it?

2014-03-19 Thread hequn cheng
persist and unpersist. unpersist:Mark the RDD as non-persistent, and remove all blocks for it from memory and disk 2014-03-19 16:40 GMT+08:00 林武康 : > Hi, can any one tell me about the lifecycle of an rdd? I search through > the official website and still can't figure it out. Can I use an rdd in

Shark does not give any results with SELECT count(*) command

2014-03-19 Thread qingyang li
Hi, i install spark0.9.0 and shark0.9 on 3 nodes , when i run select * from src , i can get result, but when i run select count(*) from src or select * from src limit 1, there is no result output. i have found similiar problem on google groups: https://groups.google.com/forum/#!searchin/spark-use

Re: Machine Learning on streaming data

2014-03-19 Thread Tathagata Das
Yes, of course you can conceptually apply machine learning algorithm on Spark Streaming. However the current MLLib does not yet have direct support for Spark Streaming's DStream. However, since DStreams are essentially a sequence of RDDs, you can apply MLLib algorithms on those RDDs. Take a look at

in SF until Friday

2014-03-19 Thread Nicholas Chammas
I'm in San Francisco until Friday for a conference (visiting from Boston). If any of y'all are up for a drink or something, I'd love to meet you in person and say hi. Nick -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/in-SF-until-Friday-tp2900.html Sent

saveAsTextFile() failing for large datasets

2014-03-19 Thread Soila Pertet Kavulya
I am testing the performance of Spark to see how it behaves when the dataset size exceeds the amount of memory available. I am running wordcount on a 4-node cluster (Intel Xeon 16 cores (32 threads), 256GB RAM per node). I limited spark.executor.memory to 64g, so I have 256g of memory available in

Re: spark-streaming

2014-03-19 Thread Tathagata Das
Hey Nathan, We made that private in order to reduce the visible public API, to have greater control in the future. Can you tell me more about the timing information that you want to get? TD On Fri, Mar 14, 2014 at 8:57 PM, Nathan Kronenfeld < nkronenf...@oculusinfo.com> wrote: > I'm trying to

Re: spark 0.8 examples in local mode

2014-03-19 Thread maxpar
Just figure it out. I need to add a "file://" in URI. I guess it is not needed in previous Hadoop versions. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-0-8-examples-in-local-mode-tp2892p2897.html Sent from the Apache Spark User List mailing list ar

Re: example of non-line oriented input data?

2014-03-19 Thread Jeremy Freeman
Another vote on this, support for simple SequenceFiles and/or Avro would be terrific, as using plain text can be very space-inefficient, especially for numerical data. -- Jeremy On Mar 19, 2014, at 5:24 PM, Nicholas Chammas wrote: > I'd second the request for Avro support in Python first, fo

Re: example of non-line oriented input data?

2014-03-19 Thread Nicholas Chammas
I'd second the request for Avro support in Python first, followed by Parquet. On Wed, Mar 19, 2014 at 2:14 PM, Evgeny Shishkin wrote: > > On 19 Mar 2014, at 19:54, Diana Carroll wrote: > > Actually, thinking more on this question, Matei: I'd definitely say > support for Avro. There's a lot of

Re: example of non-line oriented input data?

2014-03-19 Thread Evgeny Shishkin
On 19 Mar 2014, at 19:54, Diana Carroll wrote: > Actually, thinking more on this question, Matei: I'd definitely say support > for Avro. There's a lot of interest in this!! > Agree, and parquet as default Cloudera Impala format. > On Tue, Mar 18, 2014 at 8:14 PM, Matei Zaharia > wrote:

how to sort within DStream or merge consecutive RDDs

2014-03-19 Thread Adrian Mocanu
Hi I would like to know if it is possible to sort within DStream. I know it's possible to sort within an RDD and I know it's impossible to sort within the entire DStream but I would be satisfied with sorting across 2 RDDs: 1) merge 2 consecutive RDDs 2) reduce by key + sort the merged data 3) ta

spark 0.8 examples in local mode

2014-03-19 Thread maxpar
Hi folks, I have just upgrade to Spark 0.8.1, and try some examples like: ./run-example org.apache.spark.examples.SparkHdfsLR local lr_data.txt 3 It turns out that Spark keeps trying to read the file from HDFS other than local FS: Client: Retrying connect to server: Node1/192.168.0.101:90

workers die with AssociationError

2014-03-19 Thread Eric Kimbrel
I am running spark with a cloudera cluster, spark version 0.9.0-cdh5.0.0-beta-2 While nothing else is running on the cluster i am having frequent worker failures with errors like AssociationError [akka.tcp://sparkWorker@worker5:7078] -> [akka.tcp://sparkExecutor@worker5:37487]: Error [Association

Re: Running spark examples/scala scripts

2014-03-19 Thread Mayur Rustagi
You have to pick the right client version for your Hadoop. So basically its going to be your hadoop version. Map of hadoop versions to cdh & hortonworks is given on spark website. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi

Re: Connect Exception Error in spark interactive shell...

2014-03-19 Thread Mayur Rustagi
The data may be spilled off to disk hence HDFS is a necessity for Spark. You can run Spark on a single machine & not use HDFS but in distributed mode HDFS will be required. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On

Re: Spark enables us to process Big Data on an ARM cluster !!

2014-03-19 Thread Chanwit Kaewkasi
Hi Koert, There's some NAND flash built-in each node. We mount the NAND flash as a local directory for Spark to spill data out. A DZone article, also written by me, will tell more about the cluster. We really appreciate the design of Spark's RDD done by the Spark team. It turned out to be perfect

closure scope & Serialization

2014-03-19 Thread Domen Grabec
Hey, I have 3 classes where 2 are in circular dependency like this: package org.example import org.apache.spark.SparkContext class A(bLazy: => Option[B]) extends java.io.Serializable{ lazy val b: Option[B] = bLazy } class B(aLazy: => Option[A]) extends java.io.Serializable{ lazy val a: Opti

Re: trying to understand job cancellation

2014-03-19 Thread Koert Kuipers
on spark 1.0.0 SNAPSHOT this seems to work. at least so far i have seen no issues yet. On Thu, Mar 6, 2014 at 8:44 AM, Koert Kuipers wrote: > its 0.9 snapshot from january running in standalone mode. > > have these fixed been merged into 0.9? > > > On Thu, Mar 6, 2014 at 12:45 AM, Matei Zaharia

Re: partitioning via groupByKey

2014-03-19 Thread Jaka Jančar
The former: a single new RDD is returned. Check the PairRDDFunctions docs (http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.PairRDDFunctions): def groupByKey(): RDD[(K, Seq[V])] Group the values for each key in the RDD into a single sequence. On Wednesday, March 19,

Re: example of non-line oriented input data?

2014-03-19 Thread Diana Carroll
Actually, thinking more on this question, Matei: I'd definitely say support for Avro. There's a lot of interest in this!! On Tue, Mar 18, 2014 at 8:14 PM, Matei Zaharia wrote: > BTW one other thing -- in your experience, Diana, which non-text > InputFormats would be most useful to support in Py

partitioning via groupByKey

2014-03-19 Thread Adrian Mocanu
When you partition via groupByKey tulpes (parts of the RDD) are moved from some node to another node based on key (hash partitioning). Do the tuples remain part of 1 RDD as before but moved to different nodes or does this shuffling create, say, several RDDs which will have parts of the original

Re: Spark enables us to process Big Data on an ARM cluster !!

2014-03-19 Thread Christopher Nguyen
Chanwit, that is awesome! Improvements in shuffle operations should help improve life even more for you. Great to see a data point on ARM. Sent while mobile. Pls excuse typos etc. On Mar 18, 2014 7:36 PM, "Chanwit Kaewkasi" wrote: > Hi all, > > We are a small team doing a research on low-power

Re: How to distribute external executable (script) with Spark ?

2014-03-19 Thread Mayur Rustagi
I doubt thr is something like this out of the box. Easiest thing is to package it in to a jar & send that jar across. Regards Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Wed, Mar 19, 2014 at 6:57 AM, Jaonary Rabarisoa

Re: Incrementally add/remove vertices in GraphX

2014-03-19 Thread Alessandro Lulli
Hi All, Thanks for your answer. Regarding GraphX streaming: - Is there an issue (pull request) to follow to keep track of the update? - where is possible to find description and details of what will be provided? Thanks for your help and your time to answer my questions Alessandro O

Transitive dependency incompatibility

2014-03-19 Thread Jaka Jančar
Hi, I'm getting the following error: java.lang.NoSuchMethodError: org.apache.http.impl.conn.DefaultClientConnectionOperator.(Lorg/apache/http/conn/scheme/SchemeRegistry;Lorg/apache/http/conn/DnsResolver;)V at org.apache.http.impl.conn.PoolingClientConnectionManager.createConnectionOperator(Pool

Is shutting down of SparkContext optional?

2014-03-19 Thread Roman Pastukhov
Hi, After switching from Spark 0.8.0 to Spark 0.9.0 (and to Scala 2.10) one application started hanging after main thread is done (in 'local[2]' mode, without a cluster). Adding SparkContext.stop() at the end solves this. Is this behavior normal and shutting down of SparkContext is required?

Re: Spark enables us to process Big Data on an ARM cluster !!

2014-03-19 Thread Koert Kuipers
i dont know anything about arm clusters but it looks great. what are the specs? the nodes have no local disk at all? On Tue, Mar 18, 2014 at 10:36 PM, Chanwit Kaewkasi wrote: > Hi all, > > We are a small team doing a research on low-power (and low-cost) ARM > clusters. We built a 20-node ARM

RE: Spark enables us to process Big Data on an ARM cluster !!

2014-03-19 Thread Xia, Junluan
Very cool! -Original Message- From: Chanwit Kaewkasi [mailto:chan...@gmail.com] Sent: Wednesday, March 19, 2014 10:36 AM To: user@spark.apache.org Subject: Spark enables us to process Big Data on an ARM cluster !! Hi all, We are a small team doing a research on low-power (and low-cost)

Re: Hadoop Input Format - newAPIHadoopFile

2014-03-19 Thread Pariksheet Barapatre
Thanks . it worked.. Very basic question, i have created custominput format e.g. stock. How do I refer this class as custom inputformat. I.e. where to keep this class on linux folder. Do i need to add this jar if so how . I am running code through spark-shell. Thanks Pari On 19-Mar-2014 7:35 pm,

Re: Separating classloader management from SparkContexts

2014-03-19 Thread Punya Biswal
Hi Andrew, Thanks for pointing me to that example. My understanding of the JobServer (based on watching a demo of its UI) is that it maintains a set of spark contexts and allows people to add jars to them, but doesn't allow unloading or reloading jars within a spark context. The code in JobCache a

Re: Joining two HDFS files in in Spark

2014-03-19 Thread Shixiong Zhu
Do you want to read the file content in the following statement? val ny_daily= sc.parallelize(List("hdfs://localhost:8020/user/user/ NYstock/NYSE_daily")) If so, you should use "textFile", e.g., val ny_daily= sc.textFile("hdfs://localhost:8020/user/user/ NYstock/NYSE_daily") "parallelize" is us

Re: Hadoop Input Format - newAPIHadoopFile

2014-03-19 Thread Shixiong Zhu
The correct import statement is "import org.apache.hadoop.mapreduce.lib.input.TextInputFormat". Best Regards, Shixiong Zhu 2014-03-19 18:46 GMT+08:00 Pariksheet Barapatre : > Seems like import issue, ran with HadoopFile and it worked. Not getting > import statement for textInputFormat class loc

Re: example of non-line oriented input data?

2014-03-19 Thread Diana Carroll
If I don't call iter(), and just return treeiterator directly, I get an error message that the object is not of an iterator type. This is in Python 2.6...perhaps a bug? BUT I also realized my code was wrong. It results in an RDD containing all the tags in all the files. What I really want is an

Re: Joining two HDFS files in in Spark

2014-03-19 Thread Yana Kadiyska
Not sure what you mean by "not getting information how to join". If you mean that you can't see the result I believe you need to collect the result of the join on the driver, as in val joinedRdd=enKeyValuePair1.join(enKeyValuePair) joinedRdd.collect().map(prinltn) On Wed, Mar 19, 2014 at 4:57 A

How to distribute external executable (script) with Spark ?

2014-03-19 Thread Jaonary Rabarisoa
Hi all, I'm trying to build an evaluation platform based on Spark. The idea is to run a blackbox executable (build with c/c++ or some scripting language). This blackbox takes a set of data as input and outpout some metrics. Since I have a huge amount of data, I need to distribute the computation a

Re: Hadoop Input Format - newAPIHadoopFile

2014-03-19 Thread Pariksheet Barapatre
Seems like import issue, ran with HadoopFile and it worked. Not getting import statement for textInputFormat class location for new API. Can anybody help? Thanks Pariksheet On 19 March 2014 16:05, Bertrand Dechoux wrote: > I don't know the Spark issue but the Hadoop context is clear. > > old

Re: Hadoop Input Format - newAPIHadoopFile

2014-03-19 Thread Bertrand Dechoux
I don't know the Spark issue but the Hadoop context is clear. old api -> org.apache.hadoop.mapred new api -> org.apache.hadoop.mapreduce You might only need to change your import. Regards Bertrand On Wed, Mar 19, 2014 at 11:29 AM, Pariksheet Barapatre wrote: > Hi, > > Trying to read HDFS fi

Hadoop Input Format - newAPIHadoopFile

2014-03-19 Thread Pariksheet Barapatre
Hi, Trying to read HDFS file with TextInputFormat. scala> import org.apache.hadoop.mapred.TextInputFormat scala> import org.apache.hadoop.io.{LongWritable, Text} scala> val file2 = sc.newAPIHadoopFile[LongWritable,Text,TextInputFormat]("hdfs:// 192.168.100.130:8020/user/hue/pig/examples/data/sonn

Joining two HDFS files in in Spark

2014-03-19 Thread Chhaya Vishwakarma
Hi I want to join two files from HDFS using spark shell. Both the files are tab separated and I want to join on second column Tried code But not giving any output val ny_daily= sc.parallelize(List("hdfs://localhost:8020/user/user/NYstock/NYSE_daily")) val ny_daily_split = ny_daily.map(line =>l

What's the lifecycle of an rdd? Can I control it?

2014-03-19 Thread 林武康
Hi, can any one tell me about the lifecycle of an rdd? I search through the official website and still can't figure it out. Can I use an rdd in some stages and destroy it in order to release memory because that no stages ahead will use this rdd any more. Is it possible? Thanks! Sincerely Lin

Re: Connect Exception Error in spark interactive shell...

2014-03-19 Thread Sai Prasanna
Mayur, While reading a local file which is not in HDFS through spark shell, does the HDFS need to be up and running ??? On Tue, Mar 18, 2014 at 9:46 PM, Mayur Rustagi wrote: > Your hdfs is down. Probably forgot to format namenode. > check if namenode is running >ps -aef|grep Namenode > if n

Re: Pyspark worker memory

2014-03-19 Thread Jim Blomo
To document this, it would be nice to clarify what environment variables should be used to set which Java system properties, and what type of process they affect. I'd be happy to start a page if you can point me to the right place: SPARK_JAVA_OPTS: -Dspark.executor.memory can by set on the mach

Re: Pyspark worker memory

2014-03-19 Thread Jim Blomo
Thanks for the suggestion, Matei. I've tracked this down to a setting I had to make on the Driver. It looks like spark-env.sh has no impact on the Executor, which confused me for a long while with settings like SPARK_EXECUTOR_MEMORY. The only setting that mattered was setting the system property

Re: Unable to read HDFS file -- SimpleApp.java

2014-03-19 Thread Prasad
Check this thread out, http://apache-spark-user-list.1001560.n3.nabble.com/Error-reading-HDFS-file-using-spark-0-9-0-hadoop-2-2-0-incompatible-protobuf-2-5-and-2-4-1-tp2158p2807.html -- you have to remove conflicting akka and protbuf versions. Thanks Prasad. -- View this message in context: