date:20150121

Re: Spark job for demoing Spark metrics monitoring?

2015-01-21 Thread Akhil Das

I think you can easily run twitter popular hashtags You can also save the data into a db and visualize it. Thanks Best Regards On Thu, Jan 22, 2015 at 12:37 AM, Otis

RE: spark 1.1.0 save data to hdfs failed

2015-01-21 Thread ey-chih chow

The hdfs release should be hadoop 1.0.4. Ey-Chih Chow Date: Wed, 21 Jan 2015 16:56:25 -0800 Subject: Re: spark 1.1.0 save data to hdfs failed From: yuzhih...@gmail.com To: eyc...@hotmail.com CC: user@spark.apache.org What hdfs release are you using ? Can you check namenode log around time of err

Re: Exception in connection from worker to worker

2015-01-21 Thread Akhil Das

Can you try the following: - Use Kryo Serializer - Enable RDD Compression - Repartition the data (Use hash partition, then all the similar keys will go in the same partition) Thanks Best Regards On Thu, Jan 22, 2015 at 4:05 AM, vantoniuk wrote: > I have temporary fix for my case. My sample fi

Re: Is there a way to delete hdfs file/directory using spark API?

2015-01-21 Thread Raghavendra Pandey

You can use Hadoop Client Api to remove files https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#delete(org.apache.hadoop.fs.Path, boolean). I don't think spark has any wrapper on hadoop filesystem APIs. On Thu, Jan 22, 2015 at 12:15 PM, LinQili wrote: > Hi, all > I

Re: Is there a way to delete hdfs file/directory using spark API?

2015-01-21 Thread Akhil Das

There is no direct way of doing it, but you can do something like this: val hadoopConf = ssc.sparkContext.hadoopConfiguration var hdfs = org.apache.hadoop.fs.FileSystem.get(hadoopConf) tmp_stream = ssc.textFileStream("/akhld/sigmoid/") // each line will have hdfs location to be deleted. tmp_s

Is there a way to delete hdfs file/directory using spark API?

2015-01-21 Thread LinQili

Hi, allI wonder how to delete hdfs file/directory using spark API?

Re: Is Apache Spark less accurate than Scikit Learn?

2015-01-21 Thread Jacques Heunis

Ah I see, thanks! I was just confused because given the same configuration, I would have thought that Spark and Scikit would give more similar results, but I guess this is simply not the case (as in your example, in order to get spark to give an mse sufficiently close to scikit's you have to give i

Re: Saving a mllib model in Spark SQL

2015-01-21 Thread Divyansh Jain

Hey, Thanks Xiangrui Meng and Cheng Lian for your valuable suggestions. It works! Divyansh Jain. On Tue, January 20, 2015 2:49 pm, Xiangrui Meng wrote: > You can save the cluster centers as a SchemaRDD of two columns (id: > Int, center: Array[Double]). When you load it back, you can construct >

Re: Confused why I'm losing workers/executors when writing a large file to S3

2015-01-21 Thread Tsai Li Ming

I’m getting the same issue on Spark 1.2.0. Despite having set “spark.core.connection.ack.wait.timeout” in spark-defaults.conf and verified in the job UI (port 4040) environment tab, I still get the “no heartbeat in 60 seconds” error. spark.core.connection.ack.wait.timeout=3600 15/01/22 07:29:

RE: Discourse: A proposed alternative to the Spark User list

2015-01-21 Thread Bob Tiernay

Very well stated. Thanks for putting in the effort to formalize your thoughts of which I agree entirely. How are these type of decisions made traditionally in the Spark community? Is there a formal process? What's the next step? Thanks again From: nicholas.cham...@gmail.com Date: Thu, 22 Jan 201

Re: Discourse: A proposed alternative to the Spark User list

2015-01-21 Thread Nicholas Chammas

I think a few things need to be laid out clearly: 1. This mailing list is the “official” user discussion platform. That is, it is sponsored and managed by the ASF. 2. Users are free to organize independent discussion platforms focusing on Spark, and there is already one such platform i

Re: reading a csv dynamically

2015-01-21 Thread Pankaj Narang

Yes I think you need to create one map first which will keep the number of values in every line. Now you can group all the records with same number of values. Now you know how many types of arrays you will have. val dataRDD = sc.textFile("file.csv") val dataLengthRDD = dataRDD .map(line=>(_.sp

Spark 1.2 – How to change Default (Random) port ….

2015-01-21 Thread Shailesh Birari

Hello, Recently, I have upgraded my setup to Spark 1.2 from Spark 1.1. I have 4 node Ubuntu Spark Cluster. With Spark 1.1, I used to write Spark Scala program in Eclipse on my Windows development host and submit the job on Ubuntu Cluster, from Eclipse (Windows machine). As on my network not all

Re: Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)

2015-01-21 Thread Tassilo Klein

What do you suggest? Should I send you the script so you can run it yourself? Yes, my broadcast variables are fairly large (1.7 MBytes). On Wed, Jan 21, 2015 at 8:20 PM, Davies Liu wrote: > Because that you have large broadcast, they need to be loaded into > Python worker for each tasks, if the

Re: Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)

2015-01-21 Thread Davies Liu

Because that you have large broadcast, they need to be loaded into Python worker for each tasks, if the worker is not reused. We will really appreciate that if you could provide a short script to reproduce the freeze, then we can investigate the root cause and fix it. Also, fire a JIRA for it, tha

Re: Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)

2015-01-21 Thread Tassilo Klein

I set spark.python.worker.reuse = false and now it seems to run longer than before (it has not crashed yet). However, it is very very slow. How to proceed? On Wed, Jan 21, 2015 at 2:21 AM, Davies Liu wrote: > Could you try to disable the new feature of reused worker by: > spark.python.worker.reu

Re: spark 1.1.0 save data to hdfs failed

2015-01-21 Thread Ted Yu

What hdfs release are you using ? Can you check namenode log around time of error below to see if there is some clue ? Cheers On Wed, Jan 21, 2015 at 4:51 PM, ey-chih chow wrote: > Hi, > > I used the following fragment of a scala program to save data to hdfs: > > contextAwareEvents > .

Announcing SF / East Bay Area Stream Processing Meetup

2015-01-21 Thread Siva Jagadeesan

Hi All I have been running Bay Area Storm meetup for almost 2 years. Instead of having meetups for storm and spark, I changed the storm meetup to be stream processing meetup where we can discuss about all stream processing frameworks. http://www.meetup.com/Bay-Area-Stream-Processing/events/218816

spark 1.1.0 save data to hdfs failed

2015-01-21 Thread ey-chih chow

Hi, I used the following fragment of a scala program to save data to hdfs: contextAwareEvents .map(e => (new AvroKey(e), null)) .saveAsNewAPIHadoopFile("hdfs://" + masterHostname + ":9000/ETL/output/" + dateDir, classOf[AvroKey[GenericRecord]],

reading a csv dynamically

2015-01-21 Thread daze5112

Hi all, im currently reading a csv file shich has the following format: (String, Double, Double,Double, Double, Double) and can map this no problems using: val dataRDD = sc.textFile("file.csv"). map(_.split (",")). map(a=> (Array(a(0)), Array(a(1).toDouble, a(2).toDouble), a(3), Array(

Announcing SF / East Bay Area Stream Processing Meetup

2015-01-21 Thread Siva Jagadeesan

Hi All I have been running Bay Area Storm meetup for almost 2 years. Instead of having meetups for storm and spark, I changed the storm meetup to be stream processing meetup where we can discuss about all stream processing frameworks. http://www.meetup.com/Bay-Area-Stream-Processing/events/218816

Re: How to use more executors

2015-01-21 Thread Nan Zhu

…not sure when will it be reviewed… but for now you can work around by allowing multiple worker instances on a single machine http://spark.apache.org/docs/latest/spark-standalone.html search SPARK_WORKER_INSTANCES Best, -- Nan Zhu http://codingcat.me On Wednesday, January 21, 2015 at

Re: Is Apache Spark less accurate than Scikit Learn?

2015-01-21 Thread Robin East

I don’t get those results. I get: spark 0.14 scikit-learn0.85 The scikit-learn mse is due to the very low eta0 setting. Tweak that to 0.1 and push iterations to 400 and you get a mse ~= 0. Of course the coefficients are both ~1 and the intercept ~0. Similarly if you change the mll

Re: How to use more executors

2015-01-21 Thread Larry Liu

Will SPARK-1706 be included in next release? On Wed, Jan 21, 2015 at 2:50 PM, Ted Yu wrote: > Please see SPARK-1706 > > On Wed, Jan 21, 2015 at 2:43 PM, Larry Liu wrote: > >> I tried to submit a job with --conf "spark.cores.max=6" >> or --total-executor-cores 6 on a standalone cluster. But I

Re: How to use more executors

2015-01-21 Thread Ted Yu

Please see SPARK-1706 On Wed, Jan 21, 2015 at 2:43 PM, Larry Liu wrote: > I tried to submit a job with --conf "spark.cores.max=6" > or --total-executor-cores 6 on a standalone cluster. But I don't see more > than 1 executor on each worker. I am wondering how to use multiple > executors when su

How to use more executors

2015-01-21 Thread Larry Liu

I tried to submit a job with --conf "spark.cores.max=6" or --total-executor-cores 6 on a standalone cluster. But I don't see more than 1 executor on each worker. I am wondering how to use multiple executors when submitting jobs. Thanks larry

Re: Exception in connection from worker to worker

2015-01-21 Thread vantoniuk

I have temporary fix for my case. My sample file was 2G / 50M lines in size. My initial configuration was 1000 splits. Based on my understanding of distributed algorithms, number of splits can affect the memory pressure in operations such as distinct and reduceByKey. So i tried to reduce the numbe

Re: Connect to Hive metastore (on YARN) from Spark Shell?

2015-01-21 Thread Zhan Zhang

You can put hive-site.xml in your conf/ directory. It will connect to Hive when HiveContext is initialized. Thanks. Zhan Zhang On Jan 21, 2015, at 12:35 PM, YaoPau wrote: > Is this possible, and if so what steps do I need to take to make this happen? > > > > > -- > View this message in

loading utf16le file with sc.textFile

2015-01-21 Thread Nathan Stott

How can I load a utf16le file with BOM using sc.textFile? Right now I get a String that is garbled. I can't find documentation on using a different encoding when loading a text file. Any help is appreciated. - To unsubscribe, e-ma

Is Apache Spark less accurate than Scikit Learn?

2015-01-21 Thread JacquesH

I've recently been trying to get to know Apache Spark as a replacement for Scikit Learn, however it seems to me that even in simple cases, Scikit converges to an accurate model far faster than Spark does. For example I generated 1000 data points for a very simple linear function (z=x+y) with the fo

Connect to Hive metastore (on YARN) from Spark Shell?

2015-01-21 Thread YaoPau

Is this possible, and if so what steps do I need to take to make this happen? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Connect-to-Hive-metastore-on-YARN-from-Spark-Shell-tp21300.html Sent from the Apache Spark User List mailing list archive at Nabbl

Re: spark 1.2 three times slower than spark 1.1, why?

2015-01-21 Thread Davies Liu

On Tue, Jan 20, 2015 at 11:13 PM, Fengyun RAO wrote: > the LogParser instance is not serializable, and thus cannot be a broadcast, You could create a empty LogParser object (it's serializable), then load the data in executor lazily. Could you add some logging to LogParser to check the behavior b

Re: Discourse: A proposed alternative to the Spark User list

2015-01-21 Thread pzecevic

Hi, I tried to find the last reply by Nick Chammas (that I received in the digest) using the Nabble web interface, but I cannot find it (perhaps he didn't reply directly to the user list?). That's one example of Nabble's usability. Anyhow, I wanted to add my two cents... Apache user group could b

Re: [SQL] Using HashPartitioner to distribute by column

2015-01-21 Thread Cheng Lian

Michael - I mean although preparing and repartitioning the underlying data can't avoid the shuffle introduced by Spark SQL (Yin has explained why), but it does help to reduce network IO. On 1/21/15 10:01 AM, Yin Huai wrote: Hello Michael, In Spark SQL, we have our internal concepts of Output

Re: spark 1.2 three times slower than spark 1.1, why?

2015-01-21 Thread Kay Ousterhout

Is it possible to re-run your job with spark.eventLog.enabled to true, and send the resulting logs to the list? Those have more per-task information that can help diagnose this. -Kay On Wed, Jan 21, 2015 at 1:57 AM, Fengyun RAO wrote: > btw: Shuffle Write(11 GB) mean 11 GB per Executor, for eac

Re: sparkcontext.objectFile return thousands of partitions

2015-01-21 Thread Sean Owen

You have 8 files, not 8 partitions. It does not follow that they should be read as 8 partitions since they are presumably large and so you would be stuck using at most 8 tasks in parallel to process. The number of partitions is determined by Hadoop input splits and generally makes a partition per b

Re: How to 'Pipe' Binary Data in Apache Spark

2015-01-21 Thread Frank Austin Nothaft

Hi Venkat/Nick, The Spark RDD.pipe method pipes text data into a subprocess and then receives text data back from that process. Once you have the binary data loaded into an RDD properly, to pipe binary data to/from a subprocess (e.g., you want the data in the pipes to contain binary, not text),

Spark job for demoing Spark metrics monitoring?

2015-01-21 Thread Otis Gospodnetic

Hi, I'll be showing our Spark monitoring at the upcoming Spark Summit in NYC. I'd like to run some/any Spark job that really exercises Spark and makes it emit all its various metrics (so the metrics charts are full of data and not bla

Re: KNN for large data set

2015-01-21 Thread Xiangrui Meng

For large datasets, you need hashing in order to compute k-nearest neighbors locally. You can start with LSH + k-nearest in Google scholar: http://scholar.google.com/scholar?q=lsh+k+nearest -Xiangrui On Tue, Jan 20, 2015 at 9:55 PM, DEVAN M.S. wrote: > Hi all, > > Please help me to find out best

Re: SPARK-streaming app running 10x slower on YARN vs STANDALONE cluster

2015-01-21 Thread Mukesh Jha

numStreams is 5 in my case. List> kafkaStreams = new ArrayList<>(numStreams); for (int i = 0; i < numStreams; i++) { kafkaStreams.add(KafkaUtils.createStream(sc, byte[].class, byte[].class, DefaultDecoder.class, DefaultDecoder.class, kafkaConf, topicMap, StorageLevel.MEMORY_ONLY_SER()))

Re: Spark job stuck at RangePartitioner at Exchange.scala:79

2015-01-21 Thread Sunita Arvind

I was able to resolve this by adding rdd.collect() after every stage. This enforced RDD evaluation and helped avoid the choke point. regards Sunita Kopppar On Sat, Jan 17, 2015 at 12:56 PM, Sunita Arvind wrote: > Hi, > > My spark jobs suddenly started getting hung and here is the debug leading

[SQL] Conflicts in inferred Json Schemas

2015-01-21 Thread Corey Nolet

Let's say I have 2 formats for json objects in the same file schema1 = { "location": "12345 My Lane" } schema2 = { "location":{"houseAddres":"1234 My Lane"} } >From my tests, it looks like the current inferSchema() function will end up with only StructField("location", StringType). What would be

Re: [SparkSQL] Try2: Parquet predicate pushdown troubles

2015-01-21 Thread Cheng Lian

Oh yes, thanks for adding that using sc.hadoopConfiguration.set also works :-) On Wed, Jan 21, 2015 at 7:11 AM, Yana Kadiyska wrote: > Thanks for looking Cheng. Just to clarify in case other people need this > sooner, setting sc.hadoopConfiguration.set("parquet.task.side.metadata"," > false")d

RE: Spark 1.1.0 - spark-submit failed

2015-01-21 Thread ey-chih chow

Thanks for help. I added the following dependency in my pom file and the problem went away. io.netty netty 3.6.6.Final Ey-Chih Date: Tue, 20 Jan 2015 16:57:20 -0800 Subject: Re: Spark

RE: How to 'Pipe' Binary Data in Apache Spark

2015-01-21 Thread Venkat, Ankam

I am trying to solve similar problem. I am using option # 2 as suggested by Nick. I have created an RDD with sc.binaryFiles for a list of .wav files. But, I am not able to pipe it to the external programs. For example: >>> sq = sc.binaryFiles("wavfiles") <-- All .wav files stored on “wavfile

Re: HDFS Namenode in safemode when I turn off my EC2 instance

2015-01-21 Thread Su She

Hello Sean & Akhil, I tried running the stop-all.sh script on my master and I got this message: localhost: Permission denied (publickey,gssapi-keyex,gssapi-with-mic). chown: changing ownership of `/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/spark/logs': Operation not permitted no org.apa

Re: Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)

2015-01-21 Thread Davies Liu

We have not meet this issue, so not sure there are bugs related to reused worker or not. Could provide more details about it? On Wed, Jan 21, 2015 at 2:27 AM, critikaled wrote: > I'm also facing the same issue. > is this a bug? > > > > -- > View this message in context: > http://apache-spark-us

How to delete graph checkpoints?

2015-01-21 Thread Cheuk Lam

This is a question about checkpointing on GraphX. We'd like to automate deleting checkpoint files of old graphs. The RDD class has a getCheckpointFile() function, which allows us to retrieve the checkpoint file of an old RDD and then delete it. However, I couldn't find a way to get hold of the c

Re: [SQL] Using HashPartitioner to distribute by column

2015-01-21 Thread Yin Huai

Hello Michael, In Spark SQL, we have our internal concepts of Output Partitioning (representing the partitioning scheme of an operator's output) and Required Child Distribution (representing the requirement of input data distribution of an operator) for a physical operator. Let's say we have two o

Re: sparkcontext.objectFile return thousands of partitions

2015-01-21 Thread Noam Barcay

maybe each of the file parts has many blocks? did you try SparkContext.coalesce to reduce the number of partitions? can be done w/ or w/o data-shuffle. *Noam Barcay* Developer // *Kenshoo* *Office* +972 3 746-6500 *427 // *Mobile* +972 54 475-3142 __ *www.Ke

Re: Discourse: A proposed alternative to the Spark User list

2015-01-21 Thread Nicholas Chammas

Josh / Patrick, What do y’all think of the idea of promoting Stack Overflow as a place to ask questions over this list, as long as the questions fit SO’s guidelines ( how-to-ask , dont-ask )? The apache-spark

Re: [SQL] Using HashPartitioner to distribute by column

2015-01-21 Thread Michael Davies

Hi Cheng, Are you saying that by setting up the lineage schemaRdd.keyBy(_.getString(1)).partitionBy(new HashPartitioner(n)).values.applySchema(schema) then Spark SQL will know that an SQL “group by” on Customer Code will not have to shuffle? But the prepared will have already shuffled so we p

[mllib] Decision Tree - prediction probabilites of label classes

2015-01-21 Thread Zsolt Tóth

Hi, I use DecisionTree for multi class classification. I can get the probability of the predicted label for every node in the decision tree from node.predict().prob(). Is it possible to retrieve or count the probability of every possible label class in the node? To be more clear: Say in Node A the

sparkcontext.objectFile return thousands of partitions

2015-01-21 Thread Wang, Ningjun (LNG-NPV)

Why sc.objectFile(...) return a Rdd with thousands of partitions? I save a rdd to file system using rdd.saveAsObjectFile("file:///tmp/mydir") Note that the rdd contains 7 millions object. I check the directory /tmp/mydir/, it contains 8 partitions part-0 part-2 part-4 part-6

Re: [SparkSQL] Try2: Parquet predicate pushdown troubles

2015-01-21 Thread Yana Kadiyska

Thanks for looking Cheng. Just to clarify in case other people need this sooner, setting sc.hadoopConfiguration.set("parquet.task.side.metadata"," false")did work well in terms of dropping rowgroups/showing small input size. What was odd about that is that the overall time wasn't much better...but

Re: Finding most occurrences in a JSON Nested Array

2015-01-21 Thread Pankaj Narang

send me the current code here. I will fix and send back to you -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Finding-most-occurrences-in-a-JSON-Nested-Array-tp20971p21295.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

Re: Possible to restart (or stop and create) a StreamingContext

2015-01-21 Thread jamborta

Just found this in the documentation: "A SparkContext can be re-used to create multiple StreamingContexts, as long as the previous StreamingContext is stopped (without stopping the SparkContext) before the next StreamingContext is created." in this case, I assume the error I reported above is a b

Re: Error for first run from iPython Notebook

2015-01-21 Thread Dave

Is this the wrong list to be asking this question? I'm not even sure where to start troubleshooting. On Tue, Jan 20, 2015 at 9:48 AM, Dave wrote: > Not sure if anyone who can help has seen this. Any suggestions would be > appreciated, thanks! > > > On Mon Jan 19 2015 at 1:50:43 PM Dave wrote:

Re: NullPointer when access rdd.sparkContext (Spark 1.1.1)

2015-01-21 Thread Luis Ángel Vicente Sánchez

Yes, I have just found that. By replacing, rdd.map(_._1).distinct().foreach { case (game, category) => persist(game, category, minTime, maxTime, rdd) } with, rdd.map(_._1).distinct().collect().foreach { case (game, category) => persist(game, category, minTime, maxTime

Broadcast variable questions

2015-01-21 Thread Hao Ren

Hi, Spark 1.2.0, standalone, local mode(for test) Here are several questions on broadcast variable: 1) Where is the broadcast variable cached on executors ? In memory or On disk ? I read somewhere, it was said these variables are stored in spark.local.dir. But I can find any info in Spark 1.2

Re: NullPointer when access rdd.sparkContext (Spark 1.1.1)

2015-01-21 Thread Sean Owen

It looks like you are trying to use the RDD in a distributed operation, which won't work. The context will be null. On Jan 21, 2015 1:50 PM, "Luis Ángel Vicente Sánchez" < langel.gro...@gmail.com> wrote: > The SparkContext is lost when I call the persist function from the sink > function, just bef

Confused about shuffle read and shuffle write

2015-01-21 Thread Darin McBeath

I have the following code in a Spark Job. // Get the baseline input file(s) JavaPairRDD hsfBaselinePairRDDReadable = sc.hadoopFile(baselineInputBucketFile, SequenceFileInputFormat.class, Text.class, Text.class); JavaPairRDD hsfBaselinePairRDD = hsfBaselinePairRDDReadable.mapToPair(newConvertFr

Re: Discourse: A proposed alternative to the Spark User list

2015-01-21 Thread Jay Vyas

Its a very valid idea indeed, but... It's a tricky subject since the entire ASF is run on mailing lists , hence there are so many different but equally sound ways of looking at this idea, which conflict with one another. > On Jan 21, 2015, at 7:03 AM, btiernay wrote: > > I think this is a re

Possible to restart (or stop and create) a StreamingContext

2015-01-21 Thread jamborta

hi all, I have been experimenting with creating a sparkcontext -> streamingcontext -> a few streams -> starting -> stopping -> creating new streams -> starting a new (or the existing) streamingcontext with the new streams (I need to keep the existing sparkcontext alive as it would run other spark

Re: NullPointer when access rdd.sparkContext (Spark 1.1.1)

2015-01-21 Thread Luis Ángel Vicente Sánchez

The SparkContext is lost when I call the persist function from the sink function, just before the function call... everything works as intended so I guess is the FunctionN class serialisation what it's causing the problem. I will try to embed the functionality in the sink method to verify that. 20

NullPointer when access rdd.sparkContext (Spark 1.1.1)

2015-01-21 Thread Luis Ángel Vicente Sánchez

The following functions, def sink(data: DStream[((GameID, Category), ((TimeSeriesKey, Platform), HLL))]): Unit = { data.foreachRDD { rdd => rdd.cache() val (minTime, maxTime): (Long, Long) = rdd.map { case (_, ((TimeSeriesKey(_, time), _), _)) => (time, time)

Re: Closing over a var with changing value in Streaming application

2015-01-21 Thread Tobias Pfeiffer

Hi, On Wed, Jan 21, 2015 at 9:13 PM, Bob Tiernay wrote: > Maybe I'm misunderstanding something here, but couldn't this be done with > broadcast variables? I there is the following caveat from the docs: > > "In addition, the object v should not be modified after it is broadcast > in order to ensu

Re: ClosureCleaner should use ClassLoader created by SparkContext

2015-01-21 Thread Aniket Bhatnagar

Here is the stack trace for reference. Notice that this happens in when the job spawns a new thread. java.lang.ClassNotFoundException: com.myclass$$anonfun$8$$anonfun$9 at java.net.URLClassLoader$1.run(URLClassLoader.java:366) ~[na:1.7.0_71] at java.net.URLClassLoader$1.run(URLClas

RE: Closing over a var with changing value in Streaming application

2015-01-21 Thread Bob Tiernay

Maybe I'm misunderstanding something here, but couldn't this be done with broadcast variables? I there is the following caveat from the docs: "In addition, the object v should not be modified after it is broadcast in order to ensure that all nodes get the same value of the broadcast variable (e

ClosureCleaner should use ClassLoader created by SparkContext

2015-01-21 Thread Aniket Bhatnagar

While implementing a spark server, I realized that Thread's context loader must be set to any dynamically loaded classloader so that ClosureCleaner can do it's thing. Should the ClosureCleaner not use classloader created by SparkContext (that has all dynamically added jars via SparkContext.addJar)

Re: Discourse: A proposed alternative to the Spark User list

2015-01-21 Thread btiernay

I think this is a really great idea for really opening up the discussions that happen here. Also, it would be nice to know why there doesn't seem to be much interest. Maybe I'm misunderstanding some nuance of Apache projects. Cheers -- View this message in context: http://apache-spark-user-lis

Re: Closing over a var with changing value in Streaming application

2015-01-21 Thread Tobias Pfeiffer

Hi again, On Wed, Jan 21, 2015 at 4:53 PM, Tobias Pfeiffer wrote: > > On Wed, Jan 21, 2015 at 4:46 PM, Akhil Das > wrote: > >> How about using accumulators >> ? >> > > As far as I understand, they solve the part of the probl

Are these numbers abnormal for spark streaming?

2015-01-21 Thread Ashic Mahtab

Hi Guys, I've got Spark Streaming set up for a low data rate system (using spark's features for analysis, rather than high throughput). Messages are coming in throughout the day, at around 1-20 per second (finger in the air estimate...not analysed yet). In the spark streaming UI for the applica

Re: dynamically change receiver for a spark stream

2015-01-21 Thread Tamas Jambor

Hi Gerard, thanks, that makes sense. I'll try that out. Tamas On Wed, Jan 21, 2015 at 11:14 AM, Gerard Maas wrote: > Hi Tamas, > > I meant not changing the receivers, but starting/stopping the Streaming > jobs. So you would have a 'small' Streaming job for a subset of streams > that you'd conf

Re: dynamically change receiver for a spark stream

2015-01-21 Thread Gerard Maas

Hi Tamas, I meant not changing the receivers, but starting/stopping the Streaming jobs. So you would have a 'small' Streaming job for a subset of streams that you'd configure->start->stop on demand. I haven't tried myself yet, but I think it should also be possible to create a Streaming Job from

Re: dynamically change receiver for a spark stream

2015-01-21 Thread Tamas Jambor

we were thinking along the same line, that is to fix the number of streams and change the input and output channels dynamically. But could not make it work (seems that the receiver is not allowing any change in the config after it started). thanks, On Wed, Jan 21, 2015 at 10:49 AM, Gerard Maas

Re: dynamically change receiver for a spark stream

2015-01-21 Thread Gerard Maas

One possible workaround could be to orchestrate launch/stopping of Streaming jobs on demand as long as the number of jobs/streams stay within the boundaries of the resources (cores) you've available. e.g. if you're using Mesos, Marathon offers a REST interface to manage job lifecycle. You will stil

Re: dynamically change receiver for a spark stream

2015-01-21 Thread Tamas Jambor

thanks for the replies. is this something we can get around? Tried to hack into the code without much success. On Wed, Jan 21, 2015 at 3:15 AM, Shao, Saisai wrote: > Hi, > > I don't think current Spark Streaming support this feature, all the > DStream lineage is fixed after the context is start

Re: Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)

2015-01-21 Thread critikaled

I'm also facing the same issue. is this a bug? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-slow-working-Spark-1-2-fast-freezing-tp21278p21283.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Re: [Spark Streaming] The FileInputDStream newFilesOnly=false does not work in 1.2 since

2015-01-21 Thread Terry Hole

Here are sample program to reproduce it, please check it: https://gist.github.com/jhu-chang/1ee5b0788c7479414eeb We can see that the file does not been included in 1.2 The file is in customer folder, the timestamp 2015/01/21 15:41:22, On Wed, Jan 21, 2015 at 2:29 PM, Sean Owen wrote: > See a

Re: spark 1.2 three times slower than spark 1.1, why?

2015-01-21 Thread Fengyun RAO

btw: Shuffle Write(11 GB) mean 11 GB per Executor, for each task, it's ~40 MB 2015-01-21 17:53 GMT+08:00 Fengyun RAO : > I don't know how to debug distributed application, any tools or suggestion? > > but from spark web UI, > > the GC time (~0.1 s), Shuffle Write(11 GB) are similar for spark 1.1

Re: spark 1.2 three times slower than spark 1.1, why?

2015-01-21 Thread Fengyun RAO

I don't know how to debug distributed application, any tools or suggestion? but from spark web UI, the GC time (~0.1 s), Shuffle Write(11 GB) are similar for spark 1.1 and 1.2. there are no Shuffle Read and Spill. The only difference is Duration DurationMin25th percentileMedian75th percentileMaxs

Re: SPARK-streaming app running 10x slower on YARN vs STANDALONE cluster

2015-01-21 Thread Gerard Maas

Hi Mukesh, How are you creating your receivers? Could you post the (relevant) code? -kr, Gerard. On Wed, Jan 21, 2015 at 9:42 AM, Mukesh Jha wrote: > Hello Guys, > > I've re partitioned my kafkaStream so that it gets evenly distributed > among the executors and the results are better. > Still

Re: spark 1.2 three times slower than spark 1.1, why?

2015-01-21 Thread Fengyun RAO

thanks JaeBoo, in our case, the shuffle write are similar. 2015-01-21 17:01 GMT+08:00 JaeBoo Jung : > I was recently faced with a similar issue, but unfortunately I could > not find out why it happened. > > Here's jira ticket https://issues.apache.org/jira/browse/SPARK-5081 of my > previous post

Re: spark-submit --py-files remote: "Only local additional python files are supported"

2015-01-21 Thread Vladimir Grigor

Thank you Andrew for you reply! I am very intested in having this feature. It is possible to run PySpark on AWS EMR in client mode(https://aws.amazon.com/articles/4926593393724923), but that kills the whole idea of running batch jobs in EMR on PySpark. Could you please (help to) create a task(wit

Re: spark 1.2 three times slower than spark 1.1, why?

2015-01-21 Thread JaeBoo Jung

Title: Samsung Enterprise Portal mySingle I was recently faced with a similar issue, but unfortunately I could not find out why it happened. Here's jira ticket https://issues.apache.org/jira/browse/SPARK-5081 of my previous post. Please check your shuffle I/O differences between the two in sp

Re: How to share a NonSerializable variable among tasks in the same worker node?

2015-01-21 Thread Sean Owen

Singletons aren't hacks; it can be an entirely appropriate pattern for this. What exception do you get? From Spark or your code? I think this pattern is orthogonal to using Spark. On Jan 21, 2015 8:11 AM, "octavian.ganea" wrote: > In case someone has the same problem: > > The singleton hack works

Re: SPARK-streaming app running 10x slower on YARN vs STANDALONE cluster

2015-01-21 Thread Mukesh Jha

Hello Guys, I've re partitioned my kafkaStream so that it gets evenly distributed among the executors and the results are better. Still from the executors page it seems that only 1 executors all 8 cores are getting used and other executors are using just 1 core. Is this the correct interpretation

Re: spark 1.2 three times slower than spark 1.1, why?

2015-01-21 Thread Fengyun RAO

maybe you mean different spark-submit script? we also use the same spark-submit script, thus the same memory, cores, etc configuration. 2015-01-21 15:45 GMT+08:00 Sean Owen : > I don't know of any reason to think the singleton pattern doesn't work or > works differently. I wonder if, for examp

Re: spark 1.2 three times slower than spark 1.1, why?

2015-01-21 Thread Fengyun RAO

Thanks, Paul, I don’t understand how subclass FlatMapFunction helps, could you show a sample code? We need one instance per executor, not per partition, thus mapPartitions() doesn’t help. 2015-01-21 16:07 GMT+08:00 Paul Wais : > To force one instance per executor, you could explicitly subclas

Re: Spark Streaming with Kafka

2015-01-21 Thread Dibyendu Bhattacharya

You can probably try the Low Level Consumer option with Spark 1.2 https://github.com/dibbhatt/kafka-spark-consumer This Consumer can recover from any underlying failure of Spark Platform or Kafka and either retry or restart the receiver. This is being working nicely for us. Regards, Dibyendu O

Re: spark 1.2 three times slower than spark 1.1, why?

2015-01-21 Thread Fengyun RAO

thanks, Sean. I don't quite understand "you have *more *partitions across *more *workers". It's within the same cluster, and the same data, thus I think the same partition, the same workers. we switched from spark 1.1 to 1.2, then it's 3x slower. (We upgrade from CDH 5.2.1 to CDH 5.3, hence spa

Re: RangePartitioner

2015-01-21 Thread Sandy Ryza

Hi Rishi, If you look in the Spark UI, have any executors registered? Are you able to collect a jstack of the driver process? -Sandy On Tue, Jan 20, 2015 at 9:07 PM, Rishi Yadav wrote: > I am joining two tables as below, the program stalls at below log line > and never proceeds. > What might

Re: Support for SQL on unions of tables (merge tables?)

2015-01-21 Thread Paul Wais

Thanks Cheng! For the list, I talked with Michael Armbrust at a recent Spark meetup and his comments were: * For a union of tables, use a view and the Hive metastore * SQLContext might have the directory-traversing logic I need in it already * The union() of sequence files I saw was slow becaus

Re: How to share a NonSerializable variable among tasks in the same worker node?

2015-01-21 Thread octavian.ganea

In case someone has the same problem: The singleton hack works for me sometimes, sometimes it doesn't in spark 1.2.0, that is, sometimes I get nullpointerexception. Anyway, if you really need to work with big indexes and you want to have the smallest amount of communication between master and node

Re: spark 1.2 three times slower than spark 1.1, why?

2015-01-21 Thread Paul Wais

To force one instance per executor, you could explicitly subclass FlatMapFunction and have it lazy-create your parser in the subclass constructor. You might also want to try RDD#mapPartitions() (instead of RDD#flatMap() if you want one instance per partition. This approach worked well for me when

96 matches

Mail list logo