change default storage level

2015-07-09 Thread Michal Čizmazia
Is there a way how to change the default storage level? If not, how can I properly change the storage level wherever necessary, if my input and intermediate results do not fit into memory? In this example: context.wholeTextFiles(...) .flatMap(s - ...) .flatMap(s - ...) Does persist()

Re: Spark Mesos task rescheduling

2015-07-09 Thread Iulian Dragoș
On Thu, Jul 9, 2015 at 12:32 PM, besil sbernardine...@beintoo.com wrote: Hi, We are experimenting scheduling errors due to mesos slave failing. It seems to be an open bug, more information can be found here. https://issues.apache.org/jira/browse/SPARK-3289 According to this link

GraphX Synth Benchmark

2015-07-09 Thread AshutoshRaghuvanshi
I am running spark cluster over ssh in standalone mode, I have run pagerank LiveJounral example: MASTER=spark://172.17.27.12:7077 bin/run-example graphx.SynthBenchmark -app=pagerank -niters=100 -nverts=4847571 Output/soc-liveJounral.txt its been running for more than 2hours, I guess this is

Accessing Spark Web UI from another place than where the job actually ran

2015-07-09 Thread rroxanaioana
I have a spark cluster with 1 master 9nodes.I am running in standalone-mode. I do not have access to a web browser from any of the nodes in the cluster (I am connecting to the nodes through ssh --it is a grid5000 cluster). I was wondering, is there any possibility to access Spark Web UI in this

[SparkSQL] Incorrect ROLLUP results

2015-07-09 Thread Yana Kadiyska
Hi folks, I just re-wrote a query from using UNION ALL to use with rollup and I'm seeing some unexpected behavior. I'll open a JIRA if needed but wanted to check if this is user error. Here is my code: case class KeyValue(key: Int, value: String) val df = sc.parallelize(1 to 50).map(i=KeyValue(i,

Re: Some BlockManager Doubts

2015-07-09 Thread Shixiong Zhu
MEMORY_AND_DISK will use disk if there is no enough memory. If there is no enough memory when putting a MEMORY_AND_DISK block, BlockManager will store it to disk. And if a MEMORY_AND_DISK block is dropped from memory, MemoryStore will call BlockManager.dropFromMemory to store it to disk, see

Re: Spark Mesos task rescheduling

2015-07-09 Thread Silvio Bernardinello
Hi, Thank you for confirming my doubts and for the link. Yes, we actually run in fine-grained mode because we would like to dynamically resize our cluster as needed (thank you for the dynamic allocation link). However, we tried coarse grained mode and mesos seems not to relaunch the failed task.

Re: change default storage level

2015-07-09 Thread Shixiong Zhu
Spark won't store RDDs to memory unless you use a memory StorageLevel. By default, your input and intermediate results won't be put into memory. You can call persist if you want to avoid duplicate computation or reading. E.g., val r1 = context.wholeTextFiles(...) val r2 = r1.flatMap(s - ...) val

Re: spark streaming kafka compatibility

2015-07-09 Thread Shushant Arora
Thanks cody, so is it means if old kafka consumer 0.8.1.1 works with kafka cluster version 0.8.2 then spark streaming 1.3 should also work? I have tested standalone consumer kafka consumer 0.8.0 with kafka cluster 0.8.2 and that works. On Thu, Jul 9, 2015 at 9:58 PM, Cody Koeninger

orderBy + cache is invoking work on mesos cluster

2015-07-09 Thread Corey Stubbs
Spark Version: 1.3.1 Cluster: Mesos 0.22.0 Scala Version: 2.10.4 I am seeing work done on my cluster when invoking cache on an rdd. I would have expected the last line of the code below to not invoke any cluster work. Is there some condition where cache will do cluster work? val sqlContext =

Re: [SparkSQL] Incorrect ROLLUP results

2015-07-09 Thread ayan guha
Can you please post result of show()? On 10 Jul 2015 01:00, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi folks, I just re-wrote a query from using UNION ALL to use with rollup and I'm seeing some unexpected behavior. I'll open a JIRA if needed but wanted to check if this is user error. Here

Re: PySpark without PySpark

2015-07-09 Thread Sujit Pal
Hi Ashish, Your 00-pyspark-setup file looks very different from mine (and from the one described in the blog post). Questions: 1) Do you have SPARK_HOME set up in your environment? Because if not, it sets it to None in your code. You should provide the path to your Spark installation. In my case

Re: [SparkSQL] Incorrect ROLLUP results

2015-07-09 Thread Yana Kadiyska
+---+---+---+ |cnt|_c1|grp| +---+---+---+ | 1| 31| 0| | 1| 31| 1| | 1| 4| 0| | 1| 4| 1| | 1| 42| 0| | 1| 42| 1| | 1| 15| 0| | 1| 15| 1| | 1| 26| 0| | 1| 26| 1| | 1| 37| 0| | 1| 10| 0| | 1| 37| 1| | 1| 10| 1| | 1| 48| 0| | 1| 21| 0| | 1| 48| 1| | 1| 21| 1| |

Re: spark streaming kafka compatibility

2015-07-09 Thread Cody Koeninger
Yes, it should work, let us know if not. On Thu, Jul 9, 2015 at 11:34 AM, Shushant Arora shushantaror...@gmail.com wrote: Thanks cody, so is it means if old kafka consumer 0.8.1.1 works with kafka cluster version 0.8.2 then spark streaming 1.3 should also work? I have tested standalone

Re: GraphX Synth Benchmark

2015-07-09 Thread Khaled Ammar
Hi, I am not a spark expert but I found that passing a small partitions value might help. Try to use this option --numEPart=$partitions where partitions=3 (number of workers) or at most 3*40 (total number of cores). Thanks, -Khaled On Thu, Jul 9, 2015 at 11:37 AM, AshutoshRaghuvanshi

spark streaming kafka compatibility

2015-07-09 Thread Shushant Arora
Does spark streaming 1.3 requires kafka version 0.8.1.1 and is not compatible with kafka 0.8.2 ? As per maven dependency of spark streaming 1.3 with kafka dependenciesdependencygroupIdorg.apache.spark/groupIdartifactIdspark-streaming_2.10/artifactIdversion 1.3.0

Re: spark streaming kafka compatibility

2015-07-09 Thread Cody Koeninger
It's the consumer version. Should work with 0.8.2 clusters. On Thu, Jul 9, 2015 at 11:10 AM, Shushant Arora shushantaror...@gmail.com wrote: Does spark streaming 1.3 requires kafka version 0.8.1.1 and is not compatible with kafka 0.8.2 ? As per maven dependency of spark streaming 1.3 with

What is a best practice for passing environment variables to Spark workers?

2015-07-09 Thread dgoldenberg
I have about 20 environment variables to pass to my Spark workers. Even though they're in the init scripts on the Linux box, the workers don't see these variables. Does Spark do something to shield itself from what may be defined in the environment? I see multiple pieces of info on how to pass

How to ignore features in mllib

2015-07-09 Thread Arun Luthra
Is it possible to ignore features in mllib? In other words, I would like to have some 'pass-through' data, Strings for example, attached to training examples and test data. A related stackoverflow question:

Scheduler delay vs. Getting result time

2015-07-09 Thread hbogert
Hi, In the Spark UI, under “Show additional metrics”, there are two extra metrics you can show .1 Scheduler delay .2 and Getting result time When hovering “Scheduler Delay it says (among other things): …time to send task result from executor… When hovering “Getting result time”: Time that the

What is faster for SQL table storage, On-Heap or off-heap?

2015-07-09 Thread Brandon White
Is the read / aggregate performance better when caching Spark SQL tables on-heap with sqlContext.cacheTable() or off heap by saving it to Tachyon? Has anybody tested this? Any theories?

Re: Spark Streaming Hangs on Start

2015-07-09 Thread Tathagata Das
1. There will be a long running job with description start() as that is the jobs that is running the receivers. It will never end. 2. You need to set the number of cores given to the Spark executors by the YARN container. That is SparkConf spark.executor.cores, --executor-cores in spark-submit.

Re: How to ignore features in mllib

2015-07-09 Thread Burak Yavuz
If you use the Pipelines Api with DataFrames, you select which columns you would like to train on using the VectorAssembler. While using the VectorAssembler, you can choose not to select some features if you like. Best, Burak On Thu, Jul 9, 2015 at 10:38 AM, Arun Luthra arun.lut...@gmail.com

Re: Remote spark-submit not working with YARN

2015-07-09 Thread Juan Gordon
Hi , I checked the logs and it looks like YARN is trying to comunicate with my test server through the local IP ( SPARK cluster and my test server are in differents VPC in Amazon EC2) and thats why YARN can't response. I try the same script in yarn-cluster mode and it runs correctly in that way.

Re: spark ec2 as non-root / any plan to improve that in the future ?

2015-07-09 Thread Nicholas Chammas
No plans to change that at the moment, but agreed it is against accepted convention. It would be a lot of work to change the tool, change the AMIs, and test everything. My suggestion is not to hold your breath for such a change. spark-ec2, as far as I understand, is not intended for spinning up

Caching in spark

2015-07-09 Thread vinod kumar
Hi Guys, Can any one please share me how to use caching feature of spark via spark sql queries? -Vinod

SPARK vs SQL

2015-07-09 Thread vinod kumar
Hi Everyone, Is there is any document/material which compares spark with SQL server? If so please share me the details. Thanks, Vinod

Numer of runJob at SparkPlan.scala:122 in Spark SQL

2015-07-09 Thread Wojciech Pituła
Hey, I was wondering if it is possible to tune number of jobs generated by spark sql? Currently my query generates over 80 runJob at SparkPlan.scala:122 jobs, every one of them gets executed in ~4 sec and contains only 5 tasks. As a result of this, most of my cores do nothing.

Performance slow

2015-07-09 Thread Ravisankar Mani
Hi everyone, More time to be taken when i execute query using (group by + order by) or (group by + cast + order by) in same query. Kindly refer the following query Could you please provide any solution regarding thisd performance issue? SELECT

RE: [SparkSQL] Incorrect ROLLUP results

2015-07-09 Thread Cheng, Hao
Never mind, I’ve created the jira issue at https://issues.apache.org/jira/browse/SPARK-8972. From: Cheng, Hao [mailto:hao.ch...@intel.com] Sent: Friday, July 10, 2015 9:15 AM To: yana.kadiy...@gmail.com; ayan guha Cc: user Subject: RE: [SparkSQL] Incorrect ROLLUP results Yes, this is a bug, do

Re: [X-post] Saving SparkSQL result RDD to Cassandra

2015-07-09 Thread Su She
Thanks Todd, this was helpful! I also got some help from the other forum, and for those that might run into this problem in the future, the solution that worked for me was: foreachRDD {r = r.map(x = data(x.getString(0), x.getInt(1))).saveToCassandra(demo, sqltest)} On Thu, Jul 9, 2015 at 4:37

Re: Spark serialization in closure

2015-07-09 Thread Chen Song
Thanks Andrew. I tried with your suggestions and (2) works for me. (1) still doesn't work. Chen On Thu, Jul 9, 2015 at 4:58 PM, Andrew Or and...@databricks.com wrote: Hi Chen, I believe the issue is that `object foo` is a member of `object testing`, so the only way to access `object foo`

Re: work around Size exceeds Integer.MAX_VALUE

2015-07-09 Thread Michal Čizmazia
Thanks Matei! It worked. On 9 July 2015 at 19:43, Matei Zaharia matei.zaha...@gmail.com wrote: Thus means that one of your cached RDD partitions is bigger than 2 GB of data. You can fix it by having more partitions. If you read data from a file system like HDFS or S3, set the number of

Re: Data Processing speed SQL Vs SPARK

2015-07-09 Thread vinod kumar
For records below 50,000 SQL is better right? On Fri, Jul 10, 2015 at 12:18 AM, ayan guha guha.a...@gmail.com wrote: With your load, either should be fine. I would suggest you to run couple of quick prototype. Best Ayan On Fri, Jul 10, 2015 at 2:06 PM, vinod kumar

RE: [SparkSQL] Incorrect ROLLUP results

2015-07-09 Thread Cheng, Hao
Yes, this is a bug, do you mind to create a jira issue for this? I will fix this asap. BTW, what’s your spark version? From: Yana Kadiyska [mailto:yana.kadiy...@gmail.com] Sent: Friday, July 10, 2015 12:16 AM To: ayan guha Cc: user Subject: Re: [SparkSQL] Incorrect ROLLUP results

Re: Data Processing speed SQL Vs SPARK

2015-07-09 Thread vinod kumar
Ayan, I would want to process a data which nearly around 5 records to 2L records(in flat). Is there is any scaling is there to decide what technology is best?either SQL or SPARK? On Thu, Jul 9, 2015 at 9:40 AM, ayan guha guha.a...@gmail.com wrote: It depends on workload. How much data

Re: Data Processing speed SQL Vs SPARK

2015-07-09 Thread ayan guha
With your load, either should be fine. I would suggest you to run couple of quick prototype. Best Ayan On Fri, Jul 10, 2015 at 2:06 PM, vinod kumar vinodsachin...@gmail.com wrote: Ayan, I would want to process a data which nearly around 5 records to 2L records(in flat). Is there is

[Spark Hive SQL] Set the hive connection in hive context is broken in spark 1.4.1-rc1?

2015-07-09 Thread Terry Hole
Hi, I am trying to set the hive metadata destination to a mysql database in hive context, it works fine in spark 1.3.1, but it seems broken in spark 1.4.1-rc1, where it always connect to the default metadata: local), is this a regression or we must set the connection in hive-site.xml? The code

Apache Spark : Custom function for reduceByKey - missing arguments for method

2015-07-09 Thread ameyamm
I am trying to normalize a dataset (convert values for all attributes in the vector to 0-1 range). I created an RDD of tuple (attrib-name, attrib-value) for all the records in the dataset as follows: val attribMap : RDD[(String,DoubleDimension)] = contactDataset.flatMap(

Re: PySpark without PySpark

2015-07-09 Thread Sujit Pal
Hi Ashish, Julian's approach is probably better, but few observations: 1) Your SPARK_HOME should be C:\spark-1.3.0 (not C:\spark-1.3.0\bin). 2) If you have anaconda python installed (I saw that you had set this up in a separate thread, py4j should be part of the package - at least I think so.

Re: Connecting to nodes on cluster

2015-07-09 Thread Ashish Dutt
Hello Akhil, Thanks for the response. I will have to figure this out. Sincerely, Ashish On Thu, Jul 9, 2015 at 3:40 PM, Akhil Das ak...@sigmoidanalytics.com wrote: On Wed, Jul 8, 2015 at 7:31 PM, Ashish Dutt ashish.du...@gmail.com wrote: Hi, We have a cluster with 4 nodes. The cluster

Re: Breaking lineage and reducing stages in Spark Streaming

2015-07-09 Thread Tathagata Das
If you are continuously unioning RDDs, then you are accumulating ever increasing data, and you are processing ever increasing amount of data in every batch. Obviously this is going to not last for very long. You fundamentally cannot keep processing ever increasing amount of data with finite

Re: Problem in Understanding concept of Physical Cores

2015-07-09 Thread Tathagata Das
Query 1) What spark runs is tasks in task slots, whatever is the mapping ot tasks to physical cores it does not matter. If there are two task slots (2 threads in local mode, or an executor with 2 task slots in distributed mode), it can only run two tasks concurrently. That is true even if the task

Re: Spark Streaming Hangs on Start

2015-07-09 Thread Tathagata Das
Do you have enough cores in the configured number of executors in YARN? On Thu, Jul 9, 2015 at 2:29 AM, Bin Wang wbi...@gmail.com wrote: I'm using spark streaming with Kafka, and submit it to YARN cluster with mode yarn-cluster. But it hangs at SparkContext.start(). The Kafka config is

Re: Breaking lineage and reducing stages in Spark Streaming

2015-07-09 Thread Anand Nalya
The data coming from dstream have the same keys that are in myRDD, so the reduceByKey after union keeps the overall tuple count in myRDD fixed. Or even with fixed tuple count, it will keep consuming more resources? On 9 July 2015 at 16:19, Tathagata Das t...@databricks.com wrote: If you are

query on Spark + Flume integration using push model

2015-07-09 Thread diplomatic Guru
Hello all, I'm trying to configure the flume to push data into a sink so that my stream job could pick up the data. My events are in JSON format, but the Spark + Flume integration [1] document only refer to Avro sink. [1] https://spark.apache.org/docs/latest/streaming-flume-integration.html I

Some BlockManager Doubts

2015-07-09 Thread Dibyendu Bhattacharya
Hi , Just would like to clarify few doubts I have how BlockManager behaves . This is mostly in regards to Spark Streaming Context . There are two possible cases Blocks may get dropped / not stored in memory Case 1. While writing the Block for MEMORY_ONLY settings , if Node's BlockManager does

S3 vs HDFS

2015-07-09 Thread Brandon White
Are there any significant performance differences between reading text files from S3 and hdfs?

spark ec2 as non-root / any plan to improve that in the future ?

2015-07-09 Thread matd
Hi, Spark ec2 scripts are useful, but they install everything as root. AFAIK, it's not a good practice ;-) Why is it so ? Should these scripts reserved for test/demo purposes, and not to be used for a production system ? Is it planned in some roadmap to improve that, or to replace ec2-scripts

Re: Job completed successfully without processing anything

2015-07-09 Thread Akhil Das
Looks like a configuration problem with your spark setup, are you running the driver on a different network? Can you try a simple program from spark-shell and make sure your setup is proper? (like sc.parallelize(1 to 1000).collect()) Thanks Best Regards On Thu, Jul 9, 2015 at 1:02 AM, ÐΞ€ρ@Ҝ

spark streaming performance

2015-07-09 Thread Michel Hubert
Hi, I've developed a POC Spark Streaming application. But it seems to perform better on my development machine than on our cluster. I submit it to yarn on our cloudera cluster. But my first question is more detailed: In de application UI (:4040) I see in the streaming section that the batch

Re: Connecting to nodes on cluster

2015-07-09 Thread Akhil Das
On Wed, Jul 8, 2015 at 7:31 PM, Ashish Dutt ashish.du...@gmail.com wrote: Hi, We have a cluster with 4 nodes. The cluster uses CDH 5.4 for the past two days I have been trying to connect my laptop to the server using spark master ip:port but its been unsucessful. The server contains data

Re: Is there a way to shutdown the derby in hive context in spark shell?

2015-07-09 Thread Akhil Das
Did you try sc.shutdown and creating a new one? Thanks Best Regards On Wed, Jul 8, 2015 at 8:12 PM, Terry Hole hujie.ea...@gmail.com wrote: I am using spark 1.4.1rc1 with default hive settings Thanks - Terry Hi All, I'd like to use the hive context in spark shell, i need to recreate the

Re: What does RDD lineage refer to ?

2015-07-09 Thread Akhil Das
Yes, just to add see the following scenario of rdd lineage: RDD1 - RDD2 - RDD3 - RDD4 here RDD2 depends on the RDD1's output and the lineage goes till RDD4. Now, for some reason RDD3 is lost, and spark will recompute it from RDD2. Thanks Best Regards On Thu, Jul 9, 2015 at 5:51 AM, canan

Re: Re: how to use DoubleRDDFunctions on mllib Vector?

2015-07-09 Thread 诺铁
Ok, got it , thanks. On Thu, Jul 9, 2015 at 12:02 PM, prosp4300 prosp4...@163.com wrote: Seems what Feynman mentioned is the source code instead of documentation, vectorMean is private, see

Re: Is there a way to shutdown the derby in hive context in spark shell?

2015-07-09 Thread Terry Hole
Hi, Akhil, I have tried, it does not work. This may be related to the new added isolated classloader in spark hive context, the error call stack is: java.sql.SQLException: Failed to start database 'metastore_db' with class loader

Re: spark streaming performance

2015-07-09 Thread Tathagata Das
What were the number of cores in the executor? It could be that you had only one core in the executor which did all the 50 tasks serially so 50 task X 15 ms = ~ 1 second. Could you take a look at the task details in the stage page to see when the tasks were added to see whether it explains the 5

Re: S3 vs HDFS

2015-07-09 Thread Sujee Maniyam
latency is much bigger for S3 (if that matters) And with HDFS you'd get data-locality that will boost your app performance. I did some light experimenting on this. see my presentation here for some benchmark numbers ..etc http://www.slideshare.net/sujee/hadoop-to-sparkv2 from slide# 34 cheers

Questions about Fault tolerance of Spark

2015-07-09 Thread 牛兆捷
Hi All: We already know that Spark utilizes the lineage to recompute the RDDs when failure occurs. I want to study the performance of this fault-tolerant approach and have some questions about it. 1) Is there any benchmark (or standard failure model) to test the fault tolerance of these kinds of

Scheduler delay vs. Getting result time

2015-07-09 Thread Hans van den Bogert
Hi, In the Spark UI, under “Show additional metrics”, there are two extra metrics you can show .1 Scheduler delay .2 and Getting result time When hovering “Scheduler Delay it says (among other things): …time to send task result from executor… When hovering “Getting result time”: Time that

Data Processing speed SQL Vs SPARK

2015-07-09 Thread vinod kumar
Hi Everyone, I am new to spark. Am using SQL in my application to handle data in my application.I have a thought to move to spark now. Is data processing speed of spark better than SQL server? Thank, Vinod

Re: databases currently supported by Spark SQL JDBC

2015-07-09 Thread ayan guha
I suppose every RDBMS has a jdbc driver to connct to. I know Oracle, MySQL, SQL Server, Terdata, Netezza have. On Thu, Jul 9, 2015 at 10:09 PM, Niranda Perera niranda.per...@gmail.com wrote: Hi, I'm planning to use Spark SQL JDBC datasource provider in various RDBMS databases. what are the

Spark Streaming Hangs on Start

2015-07-09 Thread Bin Wang
I'm using spark streaming with Kafka, and submit it to YARN cluster with mode yarn-cluster. But it hangs at SparkContext.start(). The Kafka config is right since it can show some events in Streaming tab of web UI. The attached file is the screen shot of the Jobs tab of web UI. The code in the

Breaking lineage and reducing stages in Spark Streaming

2015-07-09 Thread Anand Nalya
Hi, I've an application in which an rdd is being updated with tuples coming from RDDs in a DStream with following pattern. dstream.foreachRDD(rdd = { myRDD = myRDD.union(rdd.filter(myfilter)).reduceByKey(_+_) }) I'm using cache() and checkpointin to cache results. Over the time, the lineage

RE: SparkR dataFrame read.df fails to read from aws s3

2015-07-09 Thread Sun, Rui
Hi, Ben 1) I guess this may be a JDK version mismatch. Could you check the JDK version? 2) I believe this is a bug in SparkR. I will fire a JIRA issue for it. From: Ben Spark [mailto:ben_spar...@yahoo.com.au] Sent: Thursday, July 9, 2015 12:14 PM To: user Subject: SparkR dataFrame

RE: Breaking lineage and reducing stages in Spark Streaming

2015-07-09 Thread Michel Hubert
Hi, I was just wondering how you generated to second image with the charts. What product? From: Anand Nalya [mailto:anand.na...@gmail.com] Sent: donderdag 9 juli 2015 11:48 To: spark users Subject: Breaking lineage and reducing stages in Spark Streaming Hi, I've an application in which an rdd

Spark Mesos task rescheduling

2015-07-09 Thread besil
Hi, We are experimenting scheduling errors due to mesos slave failing. It seems to be an open bug, more information can be found here. https://issues.apache.org/jira/browse/SPARK-3289 According to this link

Re: Breaking lineage and reducing stages in Spark Streaming

2015-07-09 Thread Anand Nalya
Thats from the Streaming tab for Spark 1.4 WebUI. On 9 July 2015 at 15:35, Michel Hubert mich...@vsnsystemen.nl wrote: Hi, I was just wondering how you generated to second image with the charts. What product? *From:* Anand Nalya [mailto:anand.na...@gmail.com] *Sent:* donderdag 9

Re: Breaking lineage and reducing stages in Spark Streaming

2015-07-09 Thread Dean Wampler
Is myRDD outside a DStream? If so are you persisting on each batch iteration? It should be checkpointed frequently too. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler

Re: Breaking lineage and reducing stages in Spark Streaming

2015-07-09 Thread Anand Nalya
Yes, myRDD is outside of DStream. Following is the actual code where newBase and current are the rdds being updated with each batch: val base = sc.textFile... var newBase = base.cache() val dstream: DStream[String] = ssc.textFileStream... var current: RDD[(String, Long)] =

Re: Spark 1.4.0 - Using SparkR on EC2 Instance

2015-07-09 Thread RedOakMark
That’s correct. We were setting up a Spark EC2 cluster from the command line, then installing RStudio Server, logging into that through the web interface and attempting to initialize the cluster within RStudio. We have made some progress on this outside of the thread - I will see what I can

Re: S3 vs HDFS

2015-07-09 Thread Daniel Darabos
I recommend testing it for yourself. Even if you have no application, you can just run the spark-ec2 script, log in, run spark-shell and try reading files from an S3 bucket and from hdfs://master IP:9000/. (This is the ephemeral HDFS cluster, which uses SSD.) I just tested our application this

Re: Breaking lineage and reducing stages in Spark Streaming

2015-07-09 Thread Dean Wampler
I think you're complicating the cache behavior by aggressively re-using vars when temporary vals would be more straightforward. For example, newBase = newBase.unpersist()... effectively means that newBase's data is not actually cached when the subsequent .union(...) is performed, so it probably

Re: DLL load failed: %1 is not a valid win32 application on invoking pyspark

2015-07-09 Thread ashishdutt
Not really a clean solution but I solved the problem by reinstalling Anaconda -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/DLL-load-failed-1-is-not-a-valid-win32-application-on-invoking-pyspark-tp23733p23743.html Sent from the Apache Spark User List

Re: Data Processing speed SQL Vs SPARK

2015-07-09 Thread ayan guha
It depends on workload. How much data you would want to process? On 9 Jul 2015 22:28, vinod kumar vinodsachin...@gmail.com wrote: Hi Everyone, I am new to spark. Am using SQL in my application to handle data in my application.I have a thought to move to spark now. Is data processing speed

SPARK_WORKER_DIR and SPARK_LOCAL_DIR

2015-07-09 Thread corrius
Hello, I have a 4 nodes spark cluster running on EC2 and it's running out of space in disk. I'm running Spark 1.3.1. I have mounted a second SSD disk in every instance on /tmp/spark and set SPARK_LOCAL_DIRS and SPARK_WORKER_DIRS pointing to this folder: set | grep SPARK

Pyspark not working on yarn-cluster mode

2015-07-09 Thread jegordon
Hi to all, Is there any way to run pyspark scripts with yarn-cluster mode without using the spark-submit script? I need it in this way because i will integrate this code into a django web app. When i try to run any script in yarn-cluster mode i got the following error :

[X-post] Saving SparkSQL result RDD to Cassandra

2015-07-09 Thread Su She
Hello All, I also posted this on the Spark/Datastax thread, but thought it was also 50% a spark question (or mostly a spark question). I was wondering what is the best practice to saving streaming Spark SQL (

Re: Spark serialization in closure

2015-07-09 Thread Andrew Or
Hi Chen, I believe the issue is that `object foo` is a member of `object testing`, so the only way to access `object foo` is to first pull `object testing` into the closure, then access a pointer to get to `object foo`. There are two workarounds that I'm aware of: (1) Move `object foo` outside

RE: Feature Generation On Spark

2015-07-09 Thread Mohammed Guller
Take a look at the examples here: https://spark.apache.org/docs/latest/ml-guide.html Mohammed From: rishikesh thakur [mailto:rishikeshtha...@hotmail.com] Sent: Saturday, July 4, 2015 10:49 PM To: ayan guha; Michal Čizmazia Cc: user Subject: RE: Feature Generation On Spark I have one document

Does spark guarantee that the same task will process the same key over time?

2015-07-09 Thread micvog
For example in the simplest word count example, I want to update the count in memory and always have the same word getting updated by the same task - not use any distributed memstore. I know that updateStateByKey should guarantee that, but how do you approach this problem outside of spark

Re: Spark serialization in closure

2015-07-09 Thread Chen Song
Thanks Erik. I saw the document too. That is why I am confused because as per the article, it should be good as long as *foo *is serializable. However, what I have seen is that it would work if *testing* is serializable, even foo is not serializable, as shown below. I don't know if there is

Re: Spark serialization in closure

2015-07-09 Thread Chen Song
Repost the code example, object testing extends Serializable { object foo { val v = 42 } val list = List(1,2,3) val rdd = sc.parallelize(list) def func = { val after = rdd.foreachPartition { it = println(foo.v) } } } On Thu, Jul 9, 2015 at 4:09

Re: Spark serialization in closure

2015-07-09 Thread Richard Marscher
Reading that article and applying it to your observations of what happens at runtime: shouldn't the closure require serializing testing? The foo singleton object is a member of testing, and then you call this foo value in the closure func and further in the foreachPartition closure. So following

Friend recommendation using collaborative filtering?

2015-07-09 Thread Diogo B.
Dear list, I have some questions regarding collaborative filtering. Although they are not specific to Spark, I hope the folks in this community might be able to help me somehow. We are looking for a simple way how to recommend users to other users, i.e., how to recommend new friends. Do you

Re: Breaking lineage and reducing stages in Spark Streaming

2015-07-09 Thread Tathagata Das
Summarizing the main problems discussed by Dean 1. If you have an infinitely growing lineage, bad things will eventually happen. You HAVE TO periodically (say every 10th batch), checkpoint the information. 2. Unpersist the previous `current` RDD ONLY AFTER running an action on the `newCurrent`.

Spark serialization in closure

2015-07-09 Thread Chen Song
I am not sure this is more of a question for Spark or just Scala but I am posting my question here. The code snippet below shows an example of passing a reference to a closure in rdd.foreachPartition method. ``` object testing { object foo extends Serializable { val v = 42 }

Re: Spark serialization in closure

2015-07-09 Thread Erik Erlandson
I think you have stumbled across this idiosyncrasy: http://erikerlandson.github.io/blog/2015/03/31/hygienic-closures-for-scala-function-serialization/ - Original Message - I am not sure this is more of a question for Spark or just Scala but I am posting my question here. The code

DataFrame insertInto fails, saveAsTable works (Azure HDInsight)

2015-07-09 Thread Daniel Haviv
Hi, I'm running Spark 1.4 on Azure. DataFrame's insertInto fails, but when saveAsTable works. It seems like some issue with accessing Azure's blob storage but that doesn't explain why one type of write works and the other doesn't. This is the stack trace: Caused by:

Re: work around Size exceeds Integer.MAX_VALUE

2015-07-09 Thread Michal Čizmazia
Spark version 1.4.0 in the Standalone mode 2015-07-09 20:12:02 INFO (sparkDriver-akka.actor.default-dispatcher-3) BlockManagerInfo:59 - Added rdd_0_0 on disk on localhost:51132 (size: 29.8 GB) 2015-07-09 20:12:02 ERROR (Executor task launch worker-0) Executor:96 - Exception in task 0.0 in stage

Re: spark streaming performance

2015-07-09 Thread Tathagata Das
I am not sure why you are getting node_local and not process_local. Also there is probably not a good documentation other than that configuration page - http://spark.apache.org/docs/latest/configuration.html (search for locality) On Thu, Jul 9, 2015 at 5:51 AM, Michel Hubert

Number of Threads in Executor to process Tasks

2015-07-09 Thread Aniruddh Sharma
Hi I am new to Spark. I am confused between correlation in threads and physical cores. As per my understanding, according to number of partitions in data set, number of tasks is created. For example I have a machine which has 10 physical cores and I have data set which has 100 partitions then in

work around Size exceeds Integer.MAX_VALUE

2015-07-09 Thread Michal Čizmazia
Please could anyone give me pointers for appropriate SparkConf to work around Size exceeds Integer.MAX_VALUE? Stacktrace: 2015-07-09 20:12:02 INFO (sparkDriver-akka.actor.default-dispatcher-3) BlockManagerInfo:59 - Added rdd_0_0 on disk on localhost:51132 (size: 29.8 GB) 2015-07-09 20:12:02

How to specify PATHS for user defined functions.

2015-07-09 Thread Dan Dong
Hi, All, I have a function and want to access it in my spark programs, but I got the: Exception in thread main java.lang.NoSuchMethodError in spark-submit. I put the function under: ./src/main/scala/com/aaa/MYFUNC/MYFUNC.scala: package com.aaa.MYFUNC object MYFUNC{ def FUNC1(input:

Re: work around Size exceeds Integer.MAX_VALUE

2015-07-09 Thread Ted Yu
Which release of Spark are you using ? Can you show the complete stack trace ? getBytes() could be called from: getBytes(file, 0, file.length) or: getBytes(segment.file, segment.offset, segment.length) Cheers On Thu, Jul 9, 2015 at 2:50 PM, Michal Čizmazia mici...@gmail.com wrote:

Re: Pyspark not working on yarn-cluster mode

2015-07-09 Thread Marcelo Vanzin
You cannot run Spark in cluster mode by instantiating a SparkContext like that. You have to launch it with the spark-submit command line script. On Thu, Jul 9, 2015 at 2:23 PM, jegordon jgordo...@gmail.com wrote: Hi to all, Is there any way to run pyspark scripts with yarn-cluster mode

Re: change default storage level

2015-07-09 Thread Michal Čizmazia
Thanks Shixiong! Your response helped me to understand the role of persist(). No persist() calls were required indeed. I solved my problem by setting spark.local.dir to allow more space for Spark temporary folder. It works automatically. I am seeing logs like this: Not enough space to cache

Re: [X-post] Saving SparkSQL result RDD to Cassandra

2015-07-09 Thread Todd Nist
foreachRDD returns a unit: def foreachRDD(foreachFunc: (RDD https://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/RDD.html [T]) ⇒ Unit): Unit Apply a function to each RDD in this DStream. This is an output operator, so 'this' DStream will be registered as an output stream and

Re: work around Size exceeds Integer.MAX_VALUE

2015-07-09 Thread Matei Zaharia
Thus means that one of your cached RDD partitions is bigger than 2 GB of data. You can fix it by having more partitions. If you read data from a file system like HDFS or S3, set the number of partitions higher in the sc.textFile, hadoopFile, etc methods (it's an optional second parameter to

Re: Spark Streaming Hangs on Start

2015-07-09 Thread Bin Wang
Thanks for the help. I set --executor-cores and it works now. I've used --total-executor-cores and don't realize it changed. Tathagata Das t...@databricks.com于2015年7月10日周五 上午3:11写道: 1. There will be a long running job with description start() as that is the jobs that is running the receivers.

  1   2   >