date:20140731

Re: spark streaming actor receiver doesn't play well with kryoserializer

2014-07-31 Thread Prashant Sharma

This looks like a bug to me. This happens because we serialize the code that starts the receiver and send it across. And since we have not registered the classes of akka library it does not work. I have not tried myself, but may be by including something like chill-akka (

java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]

2014-07-31 Thread Bin

Hi All, The data size of my task is about 30mb. It runs smoothly in local mode. However, when I submit it to the cluster, it throws the titled error (Please see below for the complete output). Actually, my output is almost the same with

Re: Ports required for running spark

2014-07-31 Thread Konstantin Kudryavtsev

Hi Haiyang, you are right, YARN takes over the resource management, bot I constantly got Exception ConnectionRefused on mentioned port. So, I suppose some spark internal communications are done via this port... but I don't know what exactly and how can I change it... Thank you, Konstantin

configuration needed to run twitter(25GB) dataset

2014-07-31 Thread Jiaxin Shi

We have a 6-nodes cluster , each node has 64GB memory. here is the command: ./bin/spark-submit --class org.apache.spark.examples.graphx.LiveJournalPageRank examples/target/scala-2.10/spark-examples-1.0.1-hadoop1.0.4.jar hdfs://dataset/twitter --tol=0.01 --numEPart=144 --numIter=10 But it ran out

Re: Ports required for running spark

2014-07-31 Thread Haiyang Fu

Hi Konstantin, Could you please post your first container's stderr log here which is always the AM log?As far as I know, ports except 8020,8080,8081,50070,50071 are all random socket ports determined by each job. So 33007 maybe just a temporary port for data transferation. The deeper reason for

How to share a NonSerializable variable among tasks in the same worker node?

2014-07-31 Thread Fengyun RAO

As shown here: 2 - Why Is My Spark Job so Slow and Only Using a Single Thread? http://engineering.sharethrough.com/blog/2013/09/13/top-3-troubleshooting-tips-to-keep-you-sparking/ 123456789101112131415 object JSONParser { def parse(raw: String): String = ...}object MyFirstSparkJob { def

java.lang.OutOfMemoryError: Java heap space

2014-07-31 Thread Sameer Tilak

Hi everyone,I have the following configuration. I am currently running my app in local mode. val conf = new SparkConf().setMaster(local[2]).setAppName(ApproxStrMatch).set(spark.executor.memory, 3g).set(spark.storage.memoryFraction, 0.1) I am getting the following error. I tried setting up

Re: Shark/Spark running on EC2 can read from S3 bucket but cannot write to it - Wrong FS

2014-07-31 Thread William Cox

I am running Spark 0.9.1 and Shark 0.9.1. Sorry I didn't include that. On Thu, Jul 31, 2014 at 9:50 AM, William Cox william@distilnetworks.com wrote: *The Shark-specific group appears to be in moderation pause, so I'm asking here.* I'm running Shark/Spark on EC2. I am using Shark to

Re: the EC2 setup script often will not allow me to SSH into my machines. Ideas?

2014-07-31 Thread William Cox

Ah, thanks for the help! That worked great. On Wed, Jul 30, 2014 at 10:31 AM, Zongheng Yang zonghen...@gmail.com wrote: To add to this: for this many (= 20) machines I usually use at least --wait 600. On Wed, Jul 30, 2014 at 9:10 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote:

Inconsistent Spark SQL behavior when column names contain dots

2014-07-31 Thread Budde, Adam

I’m working with a dataset where each row is stored as a single-line flat JSON object. I want to leverage Spark SQL to run relational queries on this data. Many of the object keys in this dataset have dots in them, e.g.: { “key.number1”: “value1”, “key.number2”: “value2” … } I can successfully

store spark streaming dstream in hdfs or cassandra

2014-07-31 Thread salemi

Hi, I was wondering what is the best way to store off dstreams in hdfs or casandra. Could somebody provide an example? Thanks, Ali -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/store-spark-streaming-dstream-in-hdfs-or-cassandra-tp11064.html Sent from

Re: Inconsistent Spark SQL behavior when column names contain dots

2014-07-31 Thread Budde, Adam

I still see the same “Unresolved attributes” error when using hql + backticks. Here’s a code snippet that replicates this behavior: val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) val sampleRDD = sc.parallelize(Array({key.one: value1, key.two: value2})) val sampleTable =

Re: Inconsistent Spark SQL behavior when column names contain dots

2014-07-31 Thread Yin Huai

I have created https://issues.apache.org/jira/browse/SPARK-2775 to track it. On Thu, Jul 31, 2014 at 11:47 AM, Budde, Adam bu...@amazon.com wrote: I still see the same “Unresolved attributes” error when using hql + backticks. Here’s a code snippet that replicates this behavior: val

Re: Number of partitions and Number of concurrent tasks

2014-07-31 Thread Darin McBeath

Ok, I set the number of spark worker instances to 2 (below is my startup command). But, this essentially had the effect of increasing my number of workers from 3 to 6 (which was good) but it also reduced my number of cores per worker from 8 to 4 (which was not so good). In the end, I would

Re: store spark streaming dstream in hdfs or cassandra

2014-07-31 Thread Hari Shreedharan

Off the top of my head, you can use the ForEachDStream to which you pass in the code that writes to Hadoop, and then register that as an output stream, so the function you pass in is periodically executed and causes the data to be written to HDFS. If you are ok with the data being in text

SchemaRDD select expression

2014-07-31 Thread buntu

I'm looking to write a select statement to get a distinct count on userId grouped by keyword column on a parquet file SchemaRDD equivalent of: SELECT keyword, count(distinct(userId)) from table group by keyword How to write it using the chained select().groupBy() operations? Thanks! --

Re: spark.shuffle.consolidateFiles seems not working

2014-07-31 Thread Jianshi Huang

I got the number from the Hadoop admin. It's 1M actually. I suspect the consolidation didn't work as expected? Any other reason? On Thu, Jul 31, 2014 at 11:01 AM, Shao, Saisai saisai.s...@intel.com wrote: I don’t think it’s a bug of consolidated shuffle, it’s a Linux configuration problem.

RDD operation examples with data?

2014-07-31 Thread Chris Curtin

Hi, I'm learning Spark and I am confused about when to use the many different operations on RDDs. Does anyone have any examples which show example inputs and resulting outputs for the various RDD operations and if the operation takes an Function a simple example of the code? For example,

Re: How do you debug a PythonException?

2014-07-31 Thread Nicholas Chammas

So if I try this again but in the Scala shell (as opposed to the Python one), this is what I get: scala val a = sc.textFile(s3n://some-path/*.json, minPartitions=sc.defaultParallelism * 3).cache() a: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at console:12 scala

Re: RDD operation examples with data?

2014-07-31 Thread Jacob Eisinger

I would check out the source examples on Spark's Github: https://github.com/apache/spark/tree/master/examples/src/main/scala/org/apache/spark/examples And, Zhen He put together a great web page with summaries and examples of each function:

Re: SchemaRDD select expression

2014-07-31 Thread Zongheng Yang

countDistinct is recently added and is in 1.0.2. If you are using that or the master branch, you could try something like: r.select('keyword, countDistinct('userId)).groupBy('keyword) On Thu, Jul 31, 2014 at 12:27 PM, buntu buntu...@gmail.com wrote: I'm looking to write a select statement

Re: store spark streaming dstream in hdfs or cassandra

2014-07-31 Thread Gerard Maas

To read/write from/to Cassandra I recommend you to use the Spark-Cassandra connector at [1] Using it, saving a Spark Streaming RDD to Cassandra is fairly easy: sparkConfig.set(CassandraConnectionHost, cassandraHost) val sc = new SparkContext(sparkConfig) val ssc = new StreamingContext(sc,

Re: How do you debug a PythonException?

2014-07-31 Thread Nicholas Chammas

Davies, That was it. Removing the call to cache() let the job run successfully, but this challenges my understanding of how Spark handles caching data. I thought it was safe to cache data sets larger than the cluster could hold in memory. What Spark would do is cache as much as it could and

Re: SchemaRDD select expression

2014-07-31 Thread Zongheng Yang

Looking at what this patch [1] has to do to achieve it, I am not sure if you can do the same thing in 1.0.0 using DSL only. Just curious, why don't you use the hql() / sql() methods and pass a query string in? [1] https://github.com/apache/spark/pull/1211/files On Thu, Jul 31, 2014 at 2:20 PM,

Re: SchemaRDD select expression

2014-07-31 Thread Buntu Dev

I was not sure if registerAsTable() and then query against that table have additional performance impact and if DSL eliminates that. On Thu, Jul 31, 2014 at 2:33 PM, Zongheng Yang zonghen...@gmail.com wrote: Looking at what this patch [1] has to do to achieve it, I am not sure if you can do

Re: SchemaRDD select expression

2014-07-31 Thread Michael Armbrust

The performance should be the same using the DSL or SQL strings. On Thu, Jul 31, 2014 at 2:36 PM, Buntu Dev buntu...@gmail.com wrote: I was not sure if registerAsTable() and then query against that table have additional performance impact and if DSL eliminates that. On Thu, Jul 31, 2014 at

Re: SchemaRDD select expression

2014-07-31 Thread Buntu Dev

Thanks Michael for confirming! On Thu, Jul 31, 2014 at 2:43 PM, Michael Armbrust mich...@databricks.com wrote: The performance should be the same using the DSL or SQL strings. On Thu, Jul 31, 2014 at 2:36 PM, Buntu Dev buntu...@gmail.com wrote: I was not sure if registerAsTable() and then

Re: Installing Spark 0.9.1 on EMR Cluster

2014-07-31 Thread chaitu reddy

Hi Rahul, I am not sure about bootstrapping while creating but we have downloaded the tar ball , extracted and configured accordingly and it worked fine. I believe you would want to write a custom script which does all these things and add it like a bootstrap action. Thanks, Sai On Jul 31, 2014

Issue with Spark on EC2 using spark-ec2 script

2014-07-31 Thread Ryan Tabora

Hey all, I was able to spawn up a cluster, but when I'm trying to submit a simple jar via spark-submit it fails to run. I am trying to run the simple Standalone Application from the quickstart. Oddly enough, I could get another application running through the spark-shell. What am I doing wrong

makeLinkRDDs in MLlib ALS

2014-07-31 Thread alwayforver

It seems to me that the way makeLinkRDDs works is by taking advantage of the fact that partition IDs happen to coincide with what we get from userPartitioner, since the HashPartitioner in *val grouped = ratingsByUserBlock.partitionBy(new HashPartitioner(numUserBlocks))* is actually preserving

Spark job finishes then command shell is blocked/hangs?

2014-07-31 Thread bumble123

Hi, My spark job finishes with this output: 14/07/31 16:33:25 INFO SparkContext: Job finished: count at RetrieveData.scala:18, took 0.013189 s However, the command line doesn't go back to normal and instead just hangs. This is my first time running a spark job - is this normal? If not, how do I

Re: Installing Spark 0.9.1 on EMR Cluster

2014-07-31 Thread nit

Have you tried flag --spark-version of spark-ec2 ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Installing-Spark-0-9-1-on-EMR-Cluster-tp11084p11096.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Example standalone app error!

2014-07-31 Thread Tathagata Das

When are you guys getting the error? When Sparkcontext is created? Or when it is being shutdown? If this error is being thrown when the SparkContext is created, then one possible reason maybe conflicting versions of Akka. Spark depends on a version of Akka which is different from that of Scala,

Re: spark.scheduler.pool seems not working in spark streaming

2014-07-31 Thread Tathagata Das

Whoa! That worked! I was half afraid it wont, since I hadnt tried it myself. TD On Wed, Jul 30, 2014 at 8:32 PM, liuwei stupi...@126.com wrote: Hi, Tathagata Das: I followed your advice and solved this problem, thank you very much! 在 2014年7月31日，上午3:07，Tathagata Das

Re: spark.scheduler.pool seems not working in spark streaming

2014-07-31 Thread Tathagata Das

I filed a JIRA for this task for future reference. https://issues.apache.org/jira/browse/SPARK-2780 On Thu, Jul 31, 2014 at 5:37 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Whoa! That worked! I was half afraid it wont, since I hadnt tried it myself. TD On Wed, Jul 30, 2014 at 8:32

Re: sbt package failed: wrong libraryDependencies for spark-streaming?

2014-07-31 Thread durin

Hi Tathagata, I didn't mean to say this was an error. According to the other thread I linked, right now there shouldn't be any conflicts, so I wanted to use streaming in the shell for easy testing. I thought I had to create my own project in which I'd add streaming as a dependency, but if I can

spark-submit registers the driver twice

2014-07-31 Thread salemi

Hi All, I am using the spark-submit command to submit my jar to a standalone cluster with two executor. When I use the spark-submit it deploys the application twice and I see two application entries in the master UI. The master logs as shown below also indicate that submit try to deploy the

Re: sbt package failed: wrong libraryDependencies for spark-streaming?

2014-07-31 Thread Tathagata Das

Hey Simon, The stuff you are trying to show - logs, contents of spark-env.sh, etc. are missing from the email. At least I am not able to see it (viewing through gmail). Are you pasting screenshots? Those might get blocked out somehow! TD On Thu, Jul 31, 2014 at 6:55 PM, durin

Re: Issue with Spark on EC2 using spark-ec2 script

2014-07-31 Thread ratabora

Hey Dean! Thanks! Did you try running this on a local environment or one generated by the spark-ec2 script? The environment I am running on is a 4 data node 1 master spark cluster generated by the spark-ec2 script. I haven't modified anything in the environment except for adding data to the

Re: SQLCtx cacheTable

2014-07-31 Thread Michael Armbrust

cacheTable uses a special columnar caching technique that is optimized for SchemaRDDs. It something similar to MEMORY_ONLY_SER but not quite. You can specify the persistence level on the SchemaRDD itself and register that as a temporary table, however it is likely you will not get as good

Re: HiveContext is creating metastore warehouse locally instead of in hdfs

2014-07-31 Thread Andrew Lee

Could you enable HistoryServer and provide the properties and CLASSPATH for the spark-shell? And 'env' command to list your environment variables? By the way, what does the spark logs says? Enable debug mode to see what's going on in spark-shell when it tries to interact and init HiveContext.

Re: Re: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]

2014-07-31 Thread Haiyang Fu

Glad to help you On Fri, Aug 1, 2014 at 11:28 AM, Bin wubin_phi...@126.com wrote: Hi Haiyang, Thanks, it really is the reason. Best, Bin 在 2014-07-31 08:05:34，Haiyang Fu haiyangfu...@gmail.com 写道： Have you tried to increase the dirver memory? On Thu, Jul 31, 2014 at 3:54 PM, Bin

Re: java.lang.OutOfMemoryError: Java heap space

2014-07-31 Thread Haiyang Fu

Hi, here are two tips for you, 1. increase the parallism level 2.increase the driver memory On Fri, Aug 1, 2014 at 12:58 AM, Sameer Tilak ssti...@live.com wrote: Hi everyone, I have the following configuration. I am currently running my app in local mode. val conf = new

43 matches

Mail list logo