Re: Spark job for demoing Spark metrics monitoring?

2015-01-21 Thread Akhil Das
I think you can easily run twitter popular hashtags You can also save the data into a db and visualize it. Thanks Best Regards On Thu, Jan 22, 2015 at 12:37 AM, Otis

RE: spark 1.1.0 save data to hdfs failed

2015-01-21 Thread ey-chih chow
The hdfs release should be hadoop 1.0.4. Ey-Chih Chow Date: Wed, 21 Jan 2015 16:56:25 -0800 Subject: Re: spark 1.1.0 save data to hdfs failed From: yuzhih...@gmail.com To: eyc...@hotmail.com CC: user@spark.apache.org What hdfs release are you using ? Can you check namenode log around time of err

Re: Exception in connection from worker to worker

2015-01-21 Thread Akhil Das
Can you try the following: - Use Kryo Serializer - Enable RDD Compression - Repartition the data (Use hash partition, then all the similar keys will go in the same partition) Thanks Best Regards On Thu, Jan 22, 2015 at 4:05 AM, vantoniuk wrote: > I have temporary fix for my case. My sample fi

Re: Is there a way to delete hdfs file/directory using spark API?

2015-01-21 Thread Raghavendra Pandey
You can use Hadoop Client Api to remove files https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#delete(org.apache.hadoop.fs.Path, boolean). I don't think spark has any wrapper on hadoop filesystem APIs. On Thu, Jan 22, 2015 at 12:15 PM, LinQili wrote: > Hi, all > I

Re: Is there a way to delete hdfs file/directory using spark API?

2015-01-21 Thread Akhil Das
There is no direct way of doing it, but you can do something like this: val hadoopConf = ssc.sparkContext.hadoopConfiguration var hdfs = org.apache.hadoop.fs.FileSystem.get(hadoopConf) tmp_stream = ssc.textFileStream("/akhld/sigmoid/") // each line will have hdfs location to be deleted. tmp_s

Is there a way to delete hdfs file/directory using spark API?

2015-01-21 Thread LinQili
Hi, allI wonder how to delete hdfs file/directory using spark API?

Re: Is Apache Spark less accurate than Scikit Learn?

2015-01-21 Thread Jacques Heunis
Ah I see, thanks! I was just confused because given the same configuration, I would have thought that Spark and Scikit would give more similar results, but I guess this is simply not the case (as in your example, in order to get spark to give an mse sufficiently close to scikit's you have to give i

Re: Saving a mllib model in Spark SQL

2015-01-21 Thread Divyansh Jain
Hey, Thanks Xiangrui Meng and Cheng Lian for your valuable suggestions. It works! Divyansh Jain. On Tue, January 20, 2015 2:49 pm, Xiangrui Meng wrote: > You can save the cluster centers as a SchemaRDD of two columns (id: > Int, center: Array[Double]). When you load it back, you can construct >

Re: Confused why I'm losing workers/executors when writing a large file to S3

2015-01-21 Thread Tsai Li Ming
I’m getting the same issue on Spark 1.2.0. Despite having set “spark.core.connection.ack.wait.timeout” in spark-defaults.conf and verified in the job UI (port 4040) environment tab, I still get the “no heartbeat in 60 seconds” error. spark.core.connection.ack.wait.timeout=3600 15/01/22 07:29:

RE: Discourse: A proposed alternative to the Spark User list

2015-01-21 Thread Bob Tiernay
Very well stated. Thanks for putting in the effort to formalize your thoughts of which I agree entirely. How are these type of decisions made traditionally in the Spark community? Is there a formal process? What's the next step? Thanks again From: nicholas.cham...@gmail.com Date: Thu, 22 Jan 201

Re: Discourse: A proposed alternative to the Spark User list

2015-01-21 Thread Nicholas Chammas
I think a few things need to be laid out clearly: 1. This mailing list is the “official” user discussion platform. That is, it is sponsored and managed by the ASF. 2. Users are free to organize independent discussion platforms focusing on Spark, and there is already one such platform i

Re: reading a csv dynamically

2015-01-21 Thread Pankaj Narang
Yes I think you need to create one map first which will keep the number of values in every line. Now you can group all the records with same number of values. Now you know how many types of arrays you will have. val dataRDD = sc.textFile("file.csv") val dataLengthRDD = dataRDD .map(line=>(_.sp

Spark 1.2 – How to change Default (Random) port ….

2015-01-21 Thread Shailesh Birari
Hello, Recently, I have upgraded my setup to Spark 1.2 from Spark 1.1. I have 4 node Ubuntu Spark Cluster. With Spark 1.1, I used to write Spark Scala program in Eclipse on my Windows development host and submit the job on Ubuntu Cluster, from Eclipse (Windows machine). As on my network not all

Re: Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)

2015-01-21 Thread Tassilo Klein
What do you suggest? Should I send you the script so you can run it yourself? Yes, my broadcast variables are fairly large (1.7 MBytes). On Wed, Jan 21, 2015 at 8:20 PM, Davies Liu wrote: > Because that you have large broadcast, they need to be loaded into > Python worker for each tasks, if the

Re: Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)

2015-01-21 Thread Davies Liu
Because that you have large broadcast, they need to be loaded into Python worker for each tasks, if the worker is not reused. We will really appreciate that if you could provide a short script to reproduce the freeze, then we can investigate the root cause and fix it. Also, fire a JIRA for it, tha

Re: Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)

2015-01-21 Thread Tassilo Klein
I set spark.python.worker.reuse = false and now it seems to run longer than before (it has not crashed yet). However, it is very very slow. How to proceed? On Wed, Jan 21, 2015 at 2:21 AM, Davies Liu wrote: > Could you try to disable the new feature of reused worker by: > spark.python.worker.reu

Re: spark 1.1.0 save data to hdfs failed

2015-01-21 Thread Ted Yu
What hdfs release are you using ? Can you check namenode log around time of error below to see if there is some clue ? Cheers On Wed, Jan 21, 2015 at 4:51 PM, ey-chih chow wrote: > Hi, > > I used the following fragment of a scala program to save data to hdfs: > > contextAwareEvents > .

Announcing SF / East Bay Area Stream Processing Meetup

2015-01-21 Thread Siva Jagadeesan
Hi All I have been running Bay Area Storm meetup for almost 2 years. Instead of having meetups for storm and spark, I changed the storm meetup to be stream processing meetup where we can discuss about all stream processing frameworks. http://www.meetup.com/Bay-Area-Stream-Processing/events/218816

spark 1.1.0 save data to hdfs failed

2015-01-21 Thread ey-chih chow
Hi, I used the following fragment of a scala program to save data to hdfs: contextAwareEvents .map(e => (new AvroKey(e), null)) .saveAsNewAPIHadoopFile("hdfs://" + masterHostname + ":9000/ETL/output/" + dateDir, classOf[AvroKey[GenericRecord]],

reading a csv dynamically

2015-01-21 Thread daze5112
Hi all, im currently reading a csv file shich has the following format: (String, Double, Double,Double, Double, Double) and can map this no problems using: val dataRDD = sc.textFile("file.csv"). map(_.split (",")). map(a=> (Array(a(0)), Array(a(1).toDouble, a(2).toDouble), a(3), Array(

Announcing SF / East Bay Area Stream Processing Meetup

2015-01-21 Thread Siva Jagadeesan
Hi All I have been running Bay Area Storm meetup for almost 2 years. Instead of having meetups for storm and spark, I changed the storm meetup to be stream processing meetup where we can discuss about all stream processing frameworks. http://www.meetup.com/Bay-Area-Stream-Processing/events/218816

Re: How to use more executors

2015-01-21 Thread Nan Zhu
…not sure when will it be reviewed… but for now you can work around by allowing multiple worker instances on a single machine http://spark.apache.org/docs/latest/spark-standalone.html search SPARK_WORKER_INSTANCES Best, -- Nan Zhu http://codingcat.me On Wednesday, January 21, 2015 at

Re: Is Apache Spark less accurate than Scikit Learn?

2015-01-21 Thread Robin East
I don’t get those results. I get: spark 0.14 scikit-learn0.85 The scikit-learn mse is due to the very low eta0 setting. Tweak that to 0.1 and push iterations to 400 and you get a mse ~= 0. Of course the coefficients are both ~1 and the intercept ~0. Similarly if you change the mll

Re: How to use more executors

2015-01-21 Thread Larry Liu
Will SPARK-1706 be included in next release? On Wed, Jan 21, 2015 at 2:50 PM, Ted Yu wrote: > Please see SPARK-1706 > > On Wed, Jan 21, 2015 at 2:43 PM, Larry Liu wrote: > >> I tried to submit a job with --conf "spark.cores.max=6" >> or --total-executor-cores 6 on a standalone cluster. But I

Re: How to use more executors

2015-01-21 Thread Ted Yu
Please see SPARK-1706 On Wed, Jan 21, 2015 at 2:43 PM, Larry Liu wrote: > I tried to submit a job with --conf "spark.cores.max=6" > or --total-executor-cores 6 on a standalone cluster. But I don't see more > than 1 executor on each worker. I am wondering how to use multiple > executors when su

How to use more executors

2015-01-21 Thread Larry Liu
I tried to submit a job with --conf "spark.cores.max=6" or --total-executor-cores 6 on a standalone cluster. But I don't see more than 1 executor on each worker. I am wondering how to use multiple executors when submitting jobs. Thanks larry

Re: Exception in connection from worker to worker

2015-01-21 Thread vantoniuk
I have temporary fix for my case. My sample file was 2G / 50M lines in size. My initial configuration was 1000 splits. Based on my understanding of distributed algorithms, number of splits can affect the memory pressure in operations such as distinct and reduceByKey. So i tried to reduce the numbe

Re: Connect to Hive metastore (on YARN) from Spark Shell?

2015-01-21 Thread Zhan Zhang
You can put hive-site.xml in your conf/ directory. It will connect to Hive when HiveContext is initialized. Thanks. Zhan Zhang On Jan 21, 2015, at 12:35 PM, YaoPau wrote: > Is this possible, and if so what steps do I need to take to make this happen? > > > > > -- > View this message in

loading utf16le file with sc.textFile

2015-01-21 Thread Nathan Stott
How can I load a utf16le file with BOM using sc.textFile? Right now I get a String that is garbled. I can't find documentation on using a different encoding when loading a text file. Any help is appreciated. - To unsubscribe, e-ma

Is Apache Spark less accurate than Scikit Learn?

2015-01-21 Thread JacquesH
I've recently been trying to get to know Apache Spark as a replacement for Scikit Learn, however it seems to me that even in simple cases, Scikit converges to an accurate model far faster than Spark does. For example I generated 1000 data points for a very simple linear function (z=x+y) with the fo

Connect to Hive metastore (on YARN) from Spark Shell?

2015-01-21 Thread YaoPau
Is this possible, and if so what steps do I need to take to make this happen? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Connect-to-Hive-metastore-on-YARN-from-Spark-Shell-tp21300.html Sent from the Apache Spark User List mailing list archive at Nabbl

Re: spark 1.2 three times slower than spark 1.1, why?

2015-01-21 Thread Davies Liu
On Tue, Jan 20, 2015 at 11:13 PM, Fengyun RAO wrote: > the LogParser instance is not serializable, and thus cannot be a broadcast, You could create a empty LogParser object (it's serializable), then load the data in executor lazily. Could you add some logging to LogParser to check the behavior b

Re: Discourse: A proposed alternative to the Spark User list

2015-01-21 Thread pzecevic
Hi, I tried to find the last reply by Nick Chammas (that I received in the digest) using the Nabble web interface, but I cannot find it (perhaps he didn't reply directly to the user list?). That's one example of Nabble's usability. Anyhow, I wanted to add my two cents... Apache user group could b

Re: [SQL] Using HashPartitioner to distribute by column

2015-01-21 Thread Cheng Lian
Michael - I mean although preparing and repartitioning the underlying data can't avoid the shuffle introduced by Spark SQL (Yin has explained why), but it does help to reduce network IO. On 1/21/15 10:01 AM, Yin Huai wrote: Hello Michael, In Spark SQL, we have our internal concepts of Output

Re: spark 1.2 three times slower than spark 1.1, why?

2015-01-21 Thread Kay Ousterhout
Is it possible to re-run your job with spark.eventLog.enabled to true, and send the resulting logs to the list? Those have more per-task information that can help diagnose this. -Kay On Wed, Jan 21, 2015 at 1:57 AM, Fengyun RAO wrote: > btw: Shuffle Write(11 GB) mean 11 GB per Executor, for eac

Re: sparkcontext.objectFile return thousands of partitions

2015-01-21 Thread Sean Owen
You have 8 files, not 8 partitions. It does not follow that they should be read as 8 partitions since they are presumably large and so you would be stuck using at most 8 tasks in parallel to process. The number of partitions is determined by Hadoop input splits and generally makes a partition per b

Re: How to 'Pipe' Binary Data in Apache Spark

2015-01-21 Thread Frank Austin Nothaft
Hi Venkat/Nick, The Spark RDD.pipe method pipes text data into a subprocess and then receives text data back from that process. Once you have the binary data loaded into an RDD properly, to pipe binary data to/from a subprocess (e.g., you want the data in the pipes to contain binary, not text),

Spark job for demoing Spark metrics monitoring?

2015-01-21 Thread Otis Gospodnetic
Hi, I'll be showing our Spark monitoring at the upcoming Spark Summit in NYC. I'd like to run some/any Spark job that really exercises Spark and makes it emit all its various metrics (so the metrics charts are full of data and not bla

Re: KNN for large data set

2015-01-21 Thread Xiangrui Meng
For large datasets, you need hashing in order to compute k-nearest neighbors locally. You can start with LSH + k-nearest in Google scholar: http://scholar.google.com/scholar?q=lsh+k+nearest -Xiangrui On Tue, Jan 20, 2015 at 9:55 PM, DEVAN M.S. wrote: > Hi all, > > Please help me to find out best

Re: SPARK-streaming app running 10x slower on YARN vs STANDALONE cluster

2015-01-21 Thread Mukesh Jha
numStreams is 5 in my case. List> kafkaStreams = new ArrayList<>(numStreams); for (int i = 0; i < numStreams; i++) { kafkaStreams.add(KafkaUtils.createStream(sc, byte[].class, byte[].class, DefaultDecoder.class, DefaultDecoder.class, kafkaConf, topicMap, StorageLevel.MEMORY_ONLY_SER()))

Re: Spark job stuck at RangePartitioner at Exchange.scala:79

2015-01-21 Thread Sunita Arvind
I was able to resolve this by adding rdd.collect() after every stage. This enforced RDD evaluation and helped avoid the choke point. regards Sunita Kopppar On Sat, Jan 17, 2015 at 12:56 PM, Sunita Arvind wrote: > Hi, > > My spark jobs suddenly started getting hung and here is the debug leading

[SQL] Conflicts in inferred Json Schemas

2015-01-21 Thread Corey Nolet
Let's say I have 2 formats for json objects in the same file schema1 = { "location": "12345 My Lane" } schema2 = { "location":{"houseAddres":"1234 My Lane"} } >From my tests, it looks like the current inferSchema() function will end up with only StructField("location", StringType). What would be

Re: [SparkSQL] Try2: Parquet predicate pushdown troubles

2015-01-21 Thread Cheng Lian
Oh yes, thanks for adding that using sc.hadoopConfiguration.set also works :-) ​ On Wed, Jan 21, 2015 at 7:11 AM, Yana Kadiyska wrote: > Thanks for looking Cheng. Just to clarify in case other people need this > sooner, setting sc.hadoopConfiguration.set("parquet.task.side.metadata"," > false")d

RE: Spark 1.1.0 - spark-submit failed

2015-01-21 Thread ey-chih chow
Thanks for help. I added the following dependency in my pom file and the problem went away. io.netty netty 3.6.6.Final Ey-Chih Date: Tue, 20 Jan 2015 16:57:20 -0800 Subject: Re: Spark

RE: How to 'Pipe' Binary Data in Apache Spark

2015-01-21 Thread Venkat, Ankam
I am trying to solve similar problem. I am using option # 2 as suggested by Nick. I have created an RDD with sc.binaryFiles for a list of .wav files. But, I am not able to pipe it to the external programs. For example: >>> sq = sc.binaryFiles("wavfiles") <-- All .wav files stored on “wavfile

Re: HDFS Namenode in safemode when I turn off my EC2 instance

2015-01-21 Thread Su She
Hello Sean & Akhil, I tried running the stop-all.sh script on my master and I got this message: localhost: Permission denied (publickey,gssapi-keyex,gssapi-with-mic). chown: changing ownership of `/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/spark/logs': Operation not permitted no org.apa

Re: Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)

2015-01-21 Thread Davies Liu
We have not meet this issue, so not sure there are bugs related to reused worker or not. Could provide more details about it? On Wed, Jan 21, 2015 at 2:27 AM, critikaled wrote: > I'm also facing the same issue. > is this a bug? > > > > -- > View this message in context: > http://apache-spark-us

How to delete graph checkpoints?

2015-01-21 Thread Cheuk Lam
This is a question about checkpointing on GraphX. We'd like to automate deleting checkpoint files of old graphs. The RDD class has a getCheckpointFile() function, which allows us to retrieve the checkpoint file of an old RDD and then delete it. However, I couldn't find a way to get hold of the c

Re: [SQL] Using HashPartitioner to distribute by column

2015-01-21 Thread Yin Huai
Hello Michael, In Spark SQL, we have our internal concepts of Output Partitioning (representing the partitioning scheme of an operator's output) and Required Child Distribution (representing the requirement of input data distribution of an operator) for a physical operator. Let's say we have two o

Re: sparkcontext.objectFile return thousands of partitions

2015-01-21 Thread Noam Barcay
maybe each of the file parts has many blocks? did you try SparkContext.coalesce to reduce the number of partitions? can be done w/ or w/o data-shuffle. *Noam Barcay* Developer // *Kenshoo* *Office* +972 3 746-6500 *427 // *Mobile* +972 54 475-3142 __ *www.Ke

Re: Discourse: A proposed alternative to the Spark User list

2015-01-21 Thread Nicholas Chammas
Josh / Patrick, What do y’all think of the idea of promoting Stack Overflow as a place to ask questions over this list, as long as the questions fit SO’s guidelines ( how-to-ask , dont-ask )? The apache-spark

Re: [SQL] Using HashPartitioner to distribute by column

2015-01-21 Thread Michael Davies
Hi Cheng, Are you saying that by setting up the lineage schemaRdd.keyBy(_.getString(1)).partitionBy(new HashPartitioner(n)).values.applySchema(schema) then Spark SQL will know that an SQL “group by” on Customer Code will not have to shuffle? But the prepared will have already shuffled so we p

[mllib] Decision Tree - prediction probabilites of label classes

2015-01-21 Thread Zsolt Tóth
Hi, I use DecisionTree for multi class classification. I can get the probability of the predicted label for every node in the decision tree from node.predict().prob(). Is it possible to retrieve or count the probability of every possible label class in the node? To be more clear: Say in Node A the

sparkcontext.objectFile return thousands of partitions

2015-01-21 Thread Wang, Ningjun (LNG-NPV)
Why sc.objectFile(...) return a Rdd with thousands of partitions? I save a rdd to file system using rdd.saveAsObjectFile("file:///tmp/mydir") Note that the rdd contains 7 millions object. I check the directory /tmp/mydir/, it contains 8 partitions part-0 part-2 part-4 part-6

Re: [SparkSQL] Try2: Parquet predicate pushdown troubles

2015-01-21 Thread Yana Kadiyska
Thanks for looking Cheng. Just to clarify in case other people need this sooner, setting sc.hadoopConfiguration.set("parquet.task.side.metadata"," false")did work well in terms of dropping rowgroups/showing small input size. What was odd about that is that the overall time wasn't much better...but

Re: Finding most occurrences in a JSON Nested Array

2015-01-21 Thread Pankaj Narang
send me the current code here. I will fix and send back to you -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Finding-most-occurrences-in-a-JSON-Nested-Array-tp20971p21295.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

Re: Possible to restart (or stop and create) a StreamingContext

2015-01-21 Thread jamborta
Just found this in the documentation: "A SparkContext can be re-used to create multiple StreamingContexts, as long as the previous StreamingContext is stopped (without stopping the SparkContext) before the next StreamingContext is created." in this case, I assume the error I reported above is a b

Re: Error for first run from iPython Notebook

2015-01-21 Thread Dave
Is this the wrong list to be asking this question? I'm not even sure where to start troubleshooting. On Tue, Jan 20, 2015 at 9:48 AM, Dave wrote: > Not sure if anyone who can help has seen this. Any suggestions would be > appreciated, thanks! > > > On Mon Jan 19 2015 at 1:50:43 PM Dave wrote:

Re: NullPointer when access rdd.sparkContext (Spark 1.1.1)

2015-01-21 Thread Luis Ángel Vicente Sánchez
Yes, I have just found that. By replacing, rdd.map(_._1).distinct().foreach { case (game, category) => persist(game, category, minTime, maxTime, rdd) } with, rdd.map(_._1).distinct().collect().foreach { case (game, category) => persist(game, category, minTime, maxTime

Broadcast variable questions

2015-01-21 Thread Hao Ren
Hi, Spark 1.2.0, standalone, local mode(for test) Here are several questions on broadcast variable: 1) Where is the broadcast variable cached on executors ? In memory or On disk ? I read somewhere, it was said these variables are stored in spark.local.dir. But I can find any info in Spark 1.2

Re: NullPointer when access rdd.sparkContext (Spark 1.1.1)

2015-01-21 Thread Sean Owen
It looks like you are trying to use the RDD in a distributed operation, which won't work. The context will be null. On Jan 21, 2015 1:50 PM, "Luis Ángel Vicente Sánchez" < langel.gro...@gmail.com> wrote: > The SparkContext is lost when I call the persist function from the sink > function, just bef

Confused about shuffle read and shuffle write

2015-01-21 Thread Darin McBeath
I have the following code in a Spark Job. // Get the baseline input file(s) JavaPairRDD hsfBaselinePairRDDReadable = sc.hadoopFile(baselineInputBucketFile, SequenceFileInputFormat.class, Text.class, Text.class); JavaPairRDD hsfBaselinePairRDD = hsfBaselinePairRDDReadable.mapToPair(newConvertFr

Re: Discourse: A proposed alternative to the Spark User list

2015-01-21 Thread Jay Vyas
Its a very valid idea indeed, but... It's a tricky subject since the entire ASF is run on mailing lists , hence there are so many different but equally sound ways of looking at this idea, which conflict with one another. > On Jan 21, 2015, at 7:03 AM, btiernay wrote: > > I think this is a re

Possible to restart (or stop and create) a StreamingContext

2015-01-21 Thread jamborta
hi all, I have been experimenting with creating a sparkcontext -> streamingcontext -> a few streams -> starting -> stopping -> creating new streams -> starting a new (or the existing) streamingcontext with the new streams (I need to keep the existing sparkcontext alive as it would run other spark

Re: NullPointer when access rdd.sparkContext (Spark 1.1.1)

2015-01-21 Thread Luis Ángel Vicente Sánchez
The SparkContext is lost when I call the persist function from the sink function, just before the function call... everything works as intended so I guess is the FunctionN class serialisation what it's causing the problem. I will try to embed the functionality in the sink method to verify that. 20

NullPointer when access rdd.sparkContext (Spark 1.1.1)

2015-01-21 Thread Luis Ángel Vicente Sánchez
The following functions, def sink(data: DStream[((GameID, Category), ((TimeSeriesKey, Platform), HLL))]): Unit = { data.foreachRDD { rdd => rdd.cache() val (minTime, maxTime): (Long, Long) = rdd.map { case (_, ((TimeSeriesKey(_, time), _), _)) => (time, time)

Re: Closing over a var with changing value in Streaming application

2015-01-21 Thread Tobias Pfeiffer
Hi, On Wed, Jan 21, 2015 at 9:13 PM, Bob Tiernay wrote: > Maybe I'm misunderstanding something here, but couldn't this be done with > broadcast variables? I there is the following caveat from the docs: > > "In addition, the object v should not be modified after it is broadcast > in order to ensu

Re: ClosureCleaner should use ClassLoader created by SparkContext

2015-01-21 Thread Aniket Bhatnagar
Here is the stack trace for reference. Notice that this happens in when the job spawns a new thread. java.lang.ClassNotFoundException: com.myclass$$anonfun$8$$anonfun$9 at java.net.URLClassLoader$1.run(URLClassLoader.java:366) ~[na:1.7.0_71] at java.net.URLClassLoader$1.run(URLClas

RE: Closing over a var with changing value in Streaming application

2015-01-21 Thread Bob Tiernay
Maybe I'm misunderstanding something here, but couldn't this be done with broadcast variables? I there is the following caveat from the docs: "In addition, the object v should not be modified after it is broadcast in order to ensure that all nodes get the same value of the broadcast variable (e

ClosureCleaner should use ClassLoader created by SparkContext

2015-01-21 Thread Aniket Bhatnagar
While implementing a spark server, I realized that Thread's context loader must be set to any dynamically loaded classloader so that ClosureCleaner can do it's thing. Should the ClosureCleaner not use classloader created by SparkContext (that has all dynamically added jars via SparkContext.addJar)

Re: Discourse: A proposed alternative to the Spark User list

2015-01-21 Thread btiernay
I think this is a really great idea for really opening up the discussions that happen here. Also, it would be nice to know why there doesn't seem to be much interest. Maybe I'm misunderstanding some nuance of Apache projects. Cheers -- View this message in context: http://apache-spark-user-lis

Re: Closing over a var with changing value in Streaming application

2015-01-21 Thread Tobias Pfeiffer
Hi again, On Wed, Jan 21, 2015 at 4:53 PM, Tobias Pfeiffer wrote: > > On Wed, Jan 21, 2015 at 4:46 PM, Akhil Das > wrote: > >> How about using accumulators >> ? >> > > As far as I understand, they solve the part of the probl

Are these numbers abnormal for spark streaming?

2015-01-21 Thread Ashic Mahtab
Hi Guys, I've got Spark Streaming set up for a low data rate system (using spark's features for analysis, rather than high throughput). Messages are coming in throughout the day, at around 1-20 per second (finger in the air estimate...not analysed yet). In the spark streaming UI for the applica

Re: dynamically change receiver for a spark stream

2015-01-21 Thread Tamas Jambor
Hi Gerard, thanks, that makes sense. I'll try that out. Tamas On Wed, Jan 21, 2015 at 11:14 AM, Gerard Maas wrote: > Hi Tamas, > > I meant not changing the receivers, but starting/stopping the Streaming > jobs. So you would have a 'small' Streaming job for a subset of streams > that you'd conf

Re: dynamically change receiver for a spark stream

2015-01-21 Thread Gerard Maas
Hi Tamas, I meant not changing the receivers, but starting/stopping the Streaming jobs. So you would have a 'small' Streaming job for a subset of streams that you'd configure->start->stop on demand. I haven't tried myself yet, but I think it should also be possible to create a Streaming Job from

Re: dynamically change receiver for a spark stream

2015-01-21 Thread Tamas Jambor
we were thinking along the same line, that is to fix the number of streams and change the input and output channels dynamically. But could not make it work (seems that the receiver is not allowing any change in the config after it started). thanks, On Wed, Jan 21, 2015 at 10:49 AM, Gerard Maas

Re: dynamically change receiver for a spark stream

2015-01-21 Thread Gerard Maas
One possible workaround could be to orchestrate launch/stopping of Streaming jobs on demand as long as the number of jobs/streams stay within the boundaries of the resources (cores) you've available. e.g. if you're using Mesos, Marathon offers a REST interface to manage job lifecycle. You will stil

Re: dynamically change receiver for a spark stream

2015-01-21 Thread Tamas Jambor
thanks for the replies. is this something we can get around? Tried to hack into the code without much success. On Wed, Jan 21, 2015 at 3:15 AM, Shao, Saisai wrote: > Hi, > > I don't think current Spark Streaming support this feature, all the > DStream lineage is fixed after the context is start

Re: Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)

2015-01-21 Thread critikaled
I'm also facing the same issue. is this a bug? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-slow-working-Spark-1-2-fast-freezing-tp21278p21283.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Re: [Spark Streaming] The FileInputDStream newFilesOnly=false does not work in 1.2 since

2015-01-21 Thread Terry Hole
Here are sample program to reproduce it, please check it: https://gist.github.com/jhu-chang/1ee5b0788c7479414eeb We can see that the file does not been included in 1.2 The file is in customer folder, the timestamp 2015/01/21 15:41:22, On Wed, Jan 21, 2015 at 2:29 PM, Sean Owen wrote: > See a

Re: spark 1.2 three times slower than spark 1.1, why?

2015-01-21 Thread Fengyun RAO
btw: Shuffle Write(11 GB) mean 11 GB per Executor, for each task, it's ~40 MB 2015-01-21 17:53 GMT+08:00 Fengyun RAO : > I don't know how to debug distributed application, any tools or suggestion? > > but from spark web UI, > > the GC time (~0.1 s), Shuffle Write(11 GB) are similar for spark 1.1

Re: spark 1.2 three times slower than spark 1.1, why?

2015-01-21 Thread Fengyun RAO
I don't know how to debug distributed application, any tools or suggestion? but from spark web UI, the GC time (~0.1 s), Shuffle Write(11 GB) are similar for spark 1.1 and 1.2. there are no Shuffle Read and Spill. The only difference is Duration DurationMin25th percentileMedian75th percentileMaxs

Re: SPARK-streaming app running 10x slower on YARN vs STANDALONE cluster

2015-01-21 Thread Gerard Maas
Hi Mukesh, How are you creating your receivers? Could you post the (relevant) code? -kr, Gerard. On Wed, Jan 21, 2015 at 9:42 AM, Mukesh Jha wrote: > Hello Guys, > > I've re partitioned my kafkaStream so that it gets evenly distributed > among the executors and the results are better. > Still

Re: spark 1.2 three times slower than spark 1.1, why?

2015-01-21 Thread Fengyun RAO
thanks JaeBoo, in our case, the shuffle write are similar. 2015-01-21 17:01 GMT+08:00 JaeBoo Jung : > I was recently faced with a similar issue, but unfortunately I could > not find out why it happened. > > Here's jira ticket https://issues.apache.org/jira/browse/SPARK-5081 of my > previous post

Re: spark-submit --py-files remote: "Only local additional python files are supported"

2015-01-21 Thread Vladimir Grigor
Thank you Andrew for you reply! I am very intested in having this feature. It is possible to run PySpark on AWS EMR in client mode(https://aws.amazon.com/articles/4926593393724923), but that kills the whole idea of running batch jobs in EMR on PySpark. Could you please (help to) create a task(wit

Re: spark 1.2 three times slower than spark 1.1, why?

2015-01-21 Thread JaeBoo Jung
Title: Samsung Enterprise Portal mySingle I was recently faced with a similar issue, but unfortunately I could not find out why it happened. Here's jira ticket https://issues.apache.org/jira/browse/SPARK-5081 of my previous post. Please check your shuffle I/O differences between the two in sp

Re: How to share a NonSerializable variable among tasks in the same worker node?

2015-01-21 Thread Sean Owen
Singletons aren't hacks; it can be an entirely appropriate pattern for this. What exception do you get? From Spark or your code? I think this pattern is orthogonal to using Spark. On Jan 21, 2015 8:11 AM, "octavian.ganea" wrote: > In case someone has the same problem: > > The singleton hack works

Re: SPARK-streaming app running 10x slower on YARN vs STANDALONE cluster

2015-01-21 Thread Mukesh Jha
Hello Guys, I've re partitioned my kafkaStream so that it gets evenly distributed among the executors and the results are better. Still from the executors page it seems that only 1 executors all 8 cores are getting used and other executors are using just 1 core. Is this the correct interpretation

Re: spark 1.2 three times slower than spark 1.1, why?

2015-01-21 Thread Fengyun RAO
maybe you mean different spark-submit script? we also use the same spark-submit script, thus the same memory, cores, etc configuration. ​ 2015-01-21 15:45 GMT+08:00 Sean Owen : > I don't know of any reason to think the singleton pattern doesn't work or > works differently. I wonder if, for examp

Re: spark 1.2 three times slower than spark 1.1, why?

2015-01-21 Thread Fengyun RAO
Thanks, Paul, I don’t understand how subclass FlatMapFunction helps, could you show a sample code? We need one instance per executor, not per partition, thus mapPartitions() doesn’t help. ​ 2015-01-21 16:07 GMT+08:00 Paul Wais : > To force one instance per executor, you could explicitly subclas

Re: Spark Streaming with Kafka

2015-01-21 Thread Dibyendu Bhattacharya
You can probably try the Low Level Consumer option with Spark 1.2 https://github.com/dibbhatt/kafka-spark-consumer This Consumer can recover from any underlying failure of Spark Platform or Kafka and either retry or restart the receiver. This is being working nicely for us. Regards, Dibyendu O

Re: spark 1.2 three times slower than spark 1.1, why?

2015-01-21 Thread Fengyun RAO
thanks, Sean. I don't quite understand "you have *more *partitions across *more *workers". It's within the same cluster, and the same data, thus I think the same partition, the same workers. we switched from spark 1.1 to 1.2, then it's 3x slower. (We upgrade from CDH 5.2.1 to CDH 5.3, hence spa

Re: RangePartitioner

2015-01-21 Thread Sandy Ryza
Hi Rishi, If you look in the Spark UI, have any executors registered? Are you able to collect a jstack of the driver process? -Sandy On Tue, Jan 20, 2015 at 9:07 PM, Rishi Yadav wrote: > I am joining two tables as below, the program stalls at below log line > and never proceeds. > What might

Re: Support for SQL on unions of tables (merge tables?)

2015-01-21 Thread Paul Wais
Thanks Cheng! For the list, I talked with Michael Armbrust at a recent Spark meetup and his comments were: * For a union of tables, use a view and the Hive metastore * SQLContext might have the directory-traversing logic I need in it already * The union() of sequence files I saw was slow becaus

Re: How to share a NonSerializable variable among tasks in the same worker node?

2015-01-21 Thread octavian.ganea
In case someone has the same problem: The singleton hack works for me sometimes, sometimes it doesn't in spark 1.2.0, that is, sometimes I get nullpointerexception. Anyway, if you really need to work with big indexes and you want to have the smallest amount of communication between master and node

Re: spark 1.2 three times slower than spark 1.1, why?

2015-01-21 Thread Paul Wais
To force one instance per executor, you could explicitly subclass FlatMapFunction and have it lazy-create your parser in the subclass constructor. You might also want to try RDD#mapPartitions() (instead of RDD#flatMap() if you want one instance per partition. This approach worked well for me when