Re: Is there a way to get previous/other keys' state in Spark Streaming?

2014-07-18 Thread Yan Fang
Thank you, TD. This is important information for us. Will keep an eye on that. Cheers, Fang, Yan yanfang...@gmail.com +1 (206) 849-4108 On Thu, Jul 17, 2014 at 6:54 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Yes, this is the limitation of the current implementation. But this will

error from DecisonTree Training:

2014-07-18 Thread Jack Yang
Hi All, I got an error while using DecisionTreeModel (my program is written in Java, spark 1.0.1, scala 2.10.1). I have read a local file, loaded it as RDD, and then sent to decisionTree for training. See below for details: JavaRDDLabeledPoint Points = lines.map(new ParsePoint()).cache();

Re: Spark Streaming

2014-07-18 Thread Guangle Fan
Thanks Tathagata! I tried it, and worked out perfectly. On Thu, Jul 17, 2014 at 6:34 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Step 1: use rdd.mapPartitions(). Thats equivalent to mapper of the MapReduce. You can open connection, get all the data and buffer it, close connection,

Re: Large scale ranked recommendation

2014-07-18 Thread Nick Pentreath
It is very true that making predictions in batch for all 1 million users against the 10k items will be quite onerous in terms of computation. I have run into this issue too in making batch predictions. Some ideas: 1. Do you really need to generate recommendations for each user in batch? How are

data locality

2014-07-18 Thread Haopu Wang
I have a standalone spark cluster and a HDFS cluster which share some of nodes. When reading HDFS file, how does spark assign tasks to nodes? Will it ask HDFS the location for each file block in order to get a right worker node? How about a spark cluster on Yarn? Thank you very much!

Re: Is there a way to get previous/other keys' state in Spark Streaming?

2014-07-18 Thread Tathagata Das
Good to know! I am bumping the priority of this issue in my head. Thanks for the feedback. Others seeing this thread, please comment if you think that this is an important issue for you as well. Not at my computer right now but I will make a Jira for this. TD On Jul 17, 2014 11:22 PM, Yan Fang

Re: data locality

2014-07-18 Thread Sandy Ryza
Hi Haopu, Spark will ask HDFS for file block locations and try to assign tasks based on these. There is a snag. Spark schedules its tasks inside of executor processes that stick around for the lifetime of a Spark application. Spark requests executors before it runs any jobs, i.e. before it has

Re: Large scale ranked recommendation

2014-07-18 Thread Bertrand Dechoux
And you might want to apply clustering before. It is likely that every user and every item are not unique. Bertrand Dechoux On Fri, Jul 18, 2014 at 9:13 AM, Nick Pentreath nick.pentre...@gmail.com wrote: It is very true that making predictions in batch for all 1 million users against the 10k

Re: Last step of processing is using too much memory.

2014-07-18 Thread Roch Denis
Well, for what it's worth, I found the issue after spending the whole night running experiments;). Basically, I needed to give a higher number of partition for the groupByKey. I was simply using the default, which generated only 4 partitions and so the whole thing blew up. -- View this

RE: data locality

2014-07-18 Thread Haopu Wang
Sandy, Do you mean the “preferred location” is working for standalone cluster also? Because I check the code of SparkContext and see comments as below: // This is used only by YARN for now, but should be relevant to other cluster types (Mesos, // etc) too. This is typically

incompatible local class serialVersionUID with spark Shark

2014-07-18 Thread Megane1994
Hello, I want to run Shark on yarn. My environment Shark-0.9.1. Spark-1.0.0 hadoop-2.3.0 My first question is that: Is it possible to run shark-0.9.1 with Spark-1.0.0 on yarn? or Shark and Spark have to be necessarily in the same version? For the moment, when i make a request like show

concurrent jobs

2014-07-18 Thread Haopu Wang
By looking at the code of JobScheduler, I find a parameter of below: private val numConcurrentJobs = ssc.conf.getInt(spark.streaming.concurrentJobs, 1) private val jobExecutor = Executors.newFixedThreadPool(numConcurrentJobs) Does that mean each App can have only one active stage?

Re: GraphX Pragel implementation

2014-07-18 Thread Arun Kumar
Thanks On Fri, Jul 18, 2014 at 12:22 AM, Ankur Dave ankurd...@gmail.com wrote: If your sendMsg function needs to know the incoming messages as well as the vertex value, you could define VD to be a tuple of the vertex value and the last received message. The vprog function would then store

Re: ClassNotFoundException: $line11.$read$ when loading an HDFS text file with SparkQL in spark-shell

2014-07-18 Thread Svend
Hi, Yes, the error still occurs when we replace the lambdas with named functions: (same error traces as in previous posts) -- View this message in context:

TreeNodeException: No function to evaluate expression. type: AttributeReference, tree: id#0 on GROUP BY

2014-07-18 Thread Martin Gammelsæter
Hi again! I am having problems when using GROUP BY on both SQLContext and HiveContext (same problem). My code (simplified as much as possible) can be seen here: http://pastebin.com/33rjW67H In short, I'm getting data from a Cassandra store with Datastax' new driver (which works great by the

spark sql left join gives KryoException: Buffer overflow

2014-07-18 Thread Pei-Lun Lee
Hi, We have a query with left joining and got this error: Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1.0:0 failed 4 times, most recent failure: Exception failure in TID 5 on host ip-10-33-132-101.us-west-2.compute.internal:

What is shuffle spill to memory?

2014-07-18 Thread Sébastien Rainville
Hi, in the Spark UI, one of the metrics is shuffle spill (memory). What is it exactly? Spilling to disk when the shuffle data doesn't fit in memory I get it, but what does it mean to spill to memory? Thanks, - Sebastien

Re: Error with spark-submit (formatting corrected)

2014-07-18 Thread MEETHU MATHEW
Hi, Instead of spark://10.1.3.7:7077 use spark://vmsparkwin1:7077  try this $ ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://vmsparkwin1:7077 --executor-memory 1G --total-executor-cores 2 ./lib/spark-examples-1.0.0-hadoop2.2.0.jar 10   Thanks Regards, Meethu M

Dividing tasks among Spark workers

2014-07-18 Thread Madhura
I am running my program on a spark cluster but when I look into my UI while the job is running I see that only one worker does most of the tasks. My cluster has one master and 4 workers where the master is also a worker. I want my task to complete as quickly as possible and I believe that if the

Re: spark1.0.1 spark sql error java.lang.NoClassDefFoundError: Could not initialize class $line11.$read$

2014-07-18 Thread Victor Sheng
Hi,Svend Your reply is very helpful to me. I'll keep an eye on that ticket. And also... Cheers :) Best Regards, Victor -- View this message in context:

Re: how to pass extra Java opts to workers for spark streaming jobs

2014-07-18 Thread Chen Song
Thanks Andrew, I tried and it works. On Fri, Jul 18, 2014 at 12:53 AM, Andrew Or and...@databricks.com wrote: You will need to include that in the SPARK_JAVA_OPTS environment variable, so add the following line to spark-env.sh: export SPARK_JAVA_OPTS= -XX:+UseConcMarkSweepGC This should

Re: spark streaming rate limiting from kafka

2014-07-18 Thread Chen Song
Thanks Tathagata, That would be awesome if Spark streaming can support receiving rate in general. I tried to explore the link you provided but could not find any specific JIRA related to this? Do you have the JIRA number for this? On Thu, Jul 17, 2014 at 9:21 PM, Tathagata Das

Re: Distribute data from Kafka evenly on cluster

2014-07-18 Thread Chen Song
Speaking of this, I have another related question. In my spark streaming job, I set up multiple consumers to receive data from Kafka, with each worker from one partition. Initially, Spark is intelligent enough to associate each worker to each partition, to make data consumption distributed.

Re: Dividing tasks among Spark workers

2014-07-18 Thread Shannon Quinn
The default # of partitions is the # of cores, correct? On 7/18/14, 10:53 AM, Yanbo Liang wrote: check how many partitions in your program. If only one, change it to more partitions will make the execution parallel. 2014-07-18 20:57 GMT+08:00 Madhura das.madhur...@gmail.com

Re: Distribute data from Kafka evenly on cluster

2014-07-18 Thread Tobias Pfeiffer
Hi, as far as I know, rebalance is triggered from Kafka in order to distribute partitions evenly. That is, to achieve the opposite of what you are seeing. I think it would be interesting to check the Kafka logs for the result of the rebalance operation and why you see what you are seeing. I know

Re: Dividing tasks among Spark workers

2014-07-18 Thread Yanbo Liang
Clusters will not be fully utilized unless you set the level of parallelism for each operation high enough. Spark automatically sets the number of “map” tasks to run on each file according to its size. You can pass the level of parallelism as a second argument or set the config property

Re: Spark Streaming Json file groupby function

2014-07-18 Thread srinivas
Hi I am able to save my RDD generated to local file that are coming from Spark SQL that are getting from Spark Streaming. If i put the steamingcontext to 10 sec the data coming in that 10 sec time window is only processed by my sql and the data is stored in the location i specified and for next

Python: saving/reloading RDD

2014-07-18 Thread Roch Denis
Hello, Just to make sure I correctly read the doc and the forums. It's my understanding that currently in python with Spark 1.0.1 there is no way to save my RDD to disk that I can just reload. The hadoop RDD are not yet present in Python. Is that correct? I just want to make sure that's the case

Re: Large scale ranked recommendation

2014-07-18 Thread Xiangrui Meng
Nick's suggestion is a good approach for your data. The item factors to broadcast should be a few MBs. -Xiangrui On Jul 18, 2014, at 12:59 AM, Bertrand Dechoux decho...@gmail.com wrote: And you might want to apply clustering before. It is likely that every user and every item are not

Re: Python: saving/reloading RDD

2014-07-18 Thread Xiangrui Meng
You can save RDDs to text files using RDD.saveAsTextFile and load it back using sc.textFile. But make sure the record to string conversion is correctly implemented if the type is not primitive and you have the parser to load them back. -Xiangrui On Jul 18, 2014, at 8:39 AM, Roch Denis

Re: Python: saving/reloading RDD

2014-07-18 Thread Shannon Quinn
+1, had to learn this the hard way when some of my objects were written as pointers, rather than translated correctly to strings :) On 7/18/14, 11:52 AM, Xiangrui Meng wrote: You can save RDDs to text files using RDD.saveAsTextFile and load it back using sc.textFile. But make sure the record

Re: the default GraphX graph-partition strategy on multicore machine?

2014-07-18 Thread Yifan LI
Hi Ankur, Thanks so much! :)) Yes, is possible to defining a custom partition strategy? And, some other questions: (2*4 cores machine, 24GB memory) - if I load one edges file(5 GB), without any cores/partitions setting, what is the default partition in graph construction? and how many cores

Re: Python: saving/reloading RDD

2014-07-18 Thread Roch Denis
Yeah but I would still have to do a map pass with an ast.litteral_eval() for each line, correct? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-saving-reloading-RDD-tp10172p10179.html Sent from the Apache Spark User List mailing list archive at

Re: What is shuffle spill to memory?

2014-07-18 Thread Andrew Or
Shuffle spill (memory) is the size of the deserialized form of the data in memory at the time when we spill it, whereas shuffle spill (disk) is the size of the serialized form of the data on disk after we spill it. This is why the latter tends to be much smaller than the former. Note that both

Re: Need help on Spark UDF (Join) Performance tuning .

2014-07-18 Thread S Malligarjunan
Hello Experts, Appreciate your input highly, please suggest/ give me hint, what would be the issue here?   Thanks and Regards, Malligarjunan S.   On Thursday, 17 July 2014, 22:47, S Malligarjunan smalligarju...@yahoo.com wrote: Hello Experts, I am facing performance problem when I use

Re: Spark Streaming timestamps

2014-07-18 Thread Bill Jay
Hi Tathagata, On Thu, Jul 17, 2014 at 6:12 PM, Tathagata Das tathagata.das1...@gmail.com wrote: The RDD parameter in foreachRDD contains raw/transformed data from the last batch. So when forearchRDD is called with the time parameter as 5:02:01 and batch size is 1 minute, then the rdd will

Re: Large scale ranked recommendation

2014-07-18 Thread m3.sharma
Thanks Nick real-time suggestion is good, will see if we can add that to our deployment strategy and you are correct we may not need recommendation for each user. Will try adding more resources and broadcasting item features suggestion as currently they don't seem to be huge. As users and

Re: MLLib - Regularized logistic regression in python

2014-07-18 Thread fjeg
Thanks for all your helpful replies. Best, Francisco -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-Regularized-logistic-regression-in-python-tp9780p10184.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Large scale ranked recommendation

2014-07-18 Thread Nick Pentreath
Agree GPUs may be interesting for this kind of massively parallel linear algebra on reasonable size vectors. These projects might be of interest in this regard: https://github.com/BIDData/BIDMach https://github.com/BIDData/BIDMat https://github.com/dlwh/gust Nick On Fri, Jul 18, 2014 at 7:40

Re: Decision tree classifier in MLlib

2014-07-18 Thread Joseph Bradley
Hi Sudha, Have you checked if the labels are being loaded correctly? It sounds like the DT algorithm can't find any useful splits to make, so maybe it thinks they are all the same? Some data loading functions threshold labels to make them binary. Hope it helps, Joseph On Fri, Jul 11, 2014 at

Job aborted due to stage failure: TID x failed for unknown reasons

2014-07-18 Thread Shannon Quinn
Hi all, I'm dealing with some strange error messages that I *think* comes down to a memory issue, but I'm having a hard time pinning it down and could use some guidance from the experts. I have a 2-machine Spark (1.0.1) cluster. Both machines have 8 cores; one has 16GB memory, the other

Re: spark streaming rate limiting from kafka

2014-07-18 Thread Tathagata Das
Oops, wrong link! JIRA: https://github.com/apache/spark/pull/945/files Github PR: https://github.com/apache/spark/pull/945/files On Fri, Jul 18, 2014 at 7:19 AM, Chen Song chen.song...@gmail.com wrote: Thanks Tathagata, That would be awesome if Spark streaming can support receiving rate in

Re: Memory compute-intensive tasks

2014-07-18 Thread rpandya
Hi Matei- Changing to coalesce(numNodes, true) still runs all partitions on a single node, which I verified by printing the hostname before I exec the external process. -- View this message in context:

Re: spark streaming rate limiting from kafka

2014-07-18 Thread Tathagata Das
Dang! Messed it up again! JIRA: https://issues.apache.org/jira/browse/SPARK-1341 Github PR: https://github.com/apache/spark/pull/945/files On Fri, Jul 18, 2014 at 11:35 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Oops, wrong link! JIRA:

Spark Streaming with long batch / window duration

2014-07-18 Thread aaronjosephs
Would it be a reasonable use case of spark streaming to have a very large window size (lets say on the scale of weeks). In this particular case the reduce function would be invertible so that would aid in efficiency. I assume that having a larger batch size since the window is so large would also

Re: the default GraphX graph-partition strategy on multicore machine?

2014-07-18 Thread Ankur Dave
On Fri, Jul 18, 2014 at 9:13 AM, Yifan LI iamyifa...@gmail.com wrote: Yes, is possible to defining a custom partition strategy? Yes, you just need to create a subclass of PartitionStrategy as follows: import org.apache.spark.graphx._ object MyPartitionStrategy extends PartitionStrategy {

Re: the default GraphX graph-partition strategy on multicore machine?

2014-07-18 Thread Ankur Dave
Sorry, I didn't read your vertex replication example carefully, so my previous answer is wrong. Here's the correct one: On Fri, Jul 18, 2014 at 9:13 AM, Yifan LI iamyifa...@gmail.com wrote: I don't understand, for instance, we have 3 edge partition tables(EA: a - b, a - c; EB: a - d, a - e;

Re: incompatible local class serialVersionUID with spark Shark

2014-07-18 Thread Michael Armbrust
There is no version of shark that works with spark 1.0. More details about the path forward here: http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html On Jul 18, 2014 4:53 AM, Megane1994 leumenilari...@yahoo.fr wrote: Hello, I want to run

Re: TreeNodeException: No function to evaluate expression. type: AttributeReference, tree: id#0 on GROUP BY

2014-07-18 Thread Michael Armbrust
Sorry for the non-obvious error message. It is not valid SQL to include attributes in the select clause unless they are also in the group by clause or are inside of an aggregate function. On Jul 18, 2014 5:12 AM, Martin Gammelsæter martingammelsae...@gmail.com wrote: Hi again! I am having

Re: Need help on Spark UDF (Join) Performance tuning .

2014-07-18 Thread Michael Armbrust
It's likely that since your UDF is a black box to hive's query optimizer that it must choose a less efficient join algorithm that passes all possible matches to your function for comparison. This will happen any time your UDF touches attributes from both sides of the join. In general you can

Re: spark1.0.1 spark sql error java.lang.NoClassDefFoundError: Could not initialize class $line11.$read$

2014-07-18 Thread Michael Armbrust
Can you tell us more about your environment. Specifically, are you also running on Mesos? On Jul 18, 2014 12:39 AM, Victor Sheng victorsheng...@gmail.com wrote: when I run a query to a hadoop file. mobile.registerAsTable(mobile) val count = sqlContext.sql(select count(1) from mobile) res5:

Re: Cannot connect to hive metastore

2014-07-18 Thread Michael Armbrust
See the section on advanced dependency management: http://spark.apache.org/docs/latest/submitting-applications.html On Jul 17, 2014 10:53 PM, linkpatrickliu linkpatrick...@live.com wrote: Seems like the mysql connector jar is not included in the classpath. Where can I set the jar to the

Re: can we insert and update with spark sql

2014-07-18 Thread Michael Armbrust
You can do insert into. As with other SQL on HDFS systems there is no updating of data. On Jul 17, 2014 1:26 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Is this what you are looking for? https://spark.apache.org/docs/1.0.0/api/java/org/apache/spark/sql/parquet/InsertIntoParquetTable.html

Visualization/Summary tools for Spark Streaming data

2014-07-18 Thread Subodh Nijsure
Hi, This is my second week of working with Spark, pardon if this is elementary question in spark domain. I am looking for ways to render output of Spark Streaming. First let me describe problem set. I am monitoring (push from devices every minute) temperature/humidity and other environmental

Reading Avro Sequence Files

2014-07-18 Thread tcg
I'm trying to read and an Avro Sequence File using the sequenceFile method on the spark context object and I get a NullPointerException. If I read the file outside of Spark using AvroSequenceFile.Reader I don't have any problems. Anyone have success in doing this? Below is one I typed and saw

Re: Reading Avro Sequence Files

2014-07-18 Thread tcg
Correction: I get a null pointer exception when I attempt to perform an action like 'first'. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Reading-Avro-Sequence-Files-tp10201p10202.html Sent from the Apache Spark User List mailing list archive at

Re: NullPointerException When Reading Avro Sequence Files

2014-07-18 Thread aaronjosephs
I think you probably want to use `AvroSequenceFileOutputFormat` with `newAPIHadoopFile`. I'm not even sure that in hadoop you would use SequenceFileInput format to read an avro sequence file -- View this message in context:

Re: NullPointerException When Reading Avro Sequence Files

2014-07-18 Thread Sparky
Thanks for responding. I tried using the newAPIHadoopFile method and got an IO Exception with the message Not a data file. If anyone has an example of this working I'd appreciate your input or examples. What I entered at the repl and what I got back are below: val myAvroSequenceFile =

Re: Large scale ranked recommendation

2014-07-18 Thread Christopher Johnson
If you are performing recommendations via a latent factor model then I highly recommend you look into methods of approximate nearest neighbors. At Spotify we batch process top N recommendations for 40M users with a catalog of 40M items, but we avoid the naive O(n*m) process you are describing by

Re: spark sql left join gives KryoException: Buffer overflow

2014-07-18 Thread Michael Armbrust
Unfortunately, this is a query where we just don't have an efficiently implementation yet. You might try switching the table order. Here is the JIRA for doing something more efficient: https://issues.apache.org/jira/browse/SPARK-2212 On Fri, Jul 18, 2014 at 7:05 AM, Pei-Lun Lee

Re: Visualization/Summary tools for Spark Streaming data

2014-07-18 Thread bryanv
You might check out Bokeh ( http://bokeh.pydata.org http://bokeh.pydata.org ) which is a python (and other languages) system for streaming and big data vis targeting the browser. I just gave a talk at SciPy 2014 where you can hear more and see examples: https://www.youtube.com/watch?v=B9NpLOyp-dI

Re: Memory compute-intensive tasks

2014-07-18 Thread rpandya
I also tried increasing --num-executors to numNodes * coresPerNode and using coalesce(numNodes*10,true), and it still ran all the tasks on one node. It seems like it is placing all the executors on one node (though not always the same node, which indicates it is aware of more than one!). I'm using

Re: Broadcasting a set in PySpark

2014-07-18 Thread Josh Rosen
You have to use `myBroadcastVariable.value` to access the broadcasted value; see https://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables On Fri, Jul 18, 2014 at 2:56 PM, Vedant Dhandhania ved...@retentionscience.com wrote: Hi All, I am trying to broadcast a set in a

Re: Large scale ranked recommendation

2014-07-18 Thread m3.sharma
Christopher, that's really a great idea to search in latent factor space rather than computing each entry of matrix, now the complexity of the problem has reduced drastically from naive O(n*m). Since our data is not that huge I will try exact nbrhood search then fallback to approximate if that

Re: Broadcasting a set in PySpark

2014-07-18 Thread Vedant Dhandhania
Hi Josh, I did make that change, however I get this error now: 568.492: [GC [PSYoungGen: 1412948K-207017K(1465088K)] 4494287K-3471149K(4960384K), 0.1280200 secs] [Times: user=0.23 sys=0.63, real=0.13 secs] 568.642: [Full GCTraceback (most recent call last): File stdin, line 1, in module

Running Spark/YARN on AWS EMR - Issues finding file on hdfs?

2014-07-18 Thread _soumya_
I'm stumped with this one. I'm using YARN on EMR to distribute my spark job. While it seems initially, the job is starting up fine - the Spark Executor nodes are having trouble pulling the jars from the location on hdfs that the master just put the files on. [hadoop@ip-172-16-2-167 ~]$

RE: Hive From Spark

2014-07-18 Thread JiajiaJing
Hi Cheng Hao, Thank you very much for your reply. Basically, the program runs on Spark 1.0.0 and Hive 0.12.0 . Some setups of the environment are done by running SPARK_HIVE=true sbt/sbt assembly/assembly, including the jar in all the workers, and copying the hive-site.xml to spark's conf dir.

Re: jar changed on src filesystem

2014-07-18 Thread cmti95035
Andrew, Yes, this works after cleaning up the .staging as you suggested. Thanks a lot! Jian -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/jar-changed-on-src-filesystem-tp10011p10216.html Sent from the Apache Spark User List mailing list archive at

BUG in spark-ec2 script (--ebs-vol-size) and workaround...

2014-07-18 Thread Ben Horner
Hello all, There is a bug in the spark-ec2 script (perhaps due to a change in the Amazon AMI). The --ebs-vol-size option directs the spark-ec2 script to add an EBS volume of the specified size, and mount it at /vol for a persistent HDFS. To do this, it uses mkfs.xfs which is not available

Re: Spark Streaming with long batch / window duration

2014-07-18 Thread Tathagata Das
If you want to process data that spans across weeks, then it best to use a dedicated data store (file system, sql / nosql database, etc.) that is designed for long term data storage and retrieval. Spark Streaming is not designed as a long term data store. Also it does not seem like you need low

Re: Spark Streaming with long batch / window duration

2014-07-18 Thread aaronjosephs
Unfortunately for reasons I won't go into my options for what I can use are limited, it was more of a curiosity to see if spark could handle a use case like this since the functionality I wanted fit perfectly into the reduceByKeyAndWindow frame of thinking. Anyway thanks for answering. -- View

Java null pointer exception while saving hadoop file

2014-07-18 Thread durga
Hi I am getting null pointer exception while saving the data into hadoop. code as follows. If I change the last line to sorted_tup.take(count.toInt).foreach { case ((a, b, c), l) = sc.parallelize(l.toSeq).coalesce(1).saveAsTextFile(hdfsDir + a + / + b + / + c)} . I am able to save it , But for

Re: BUG in spark-ec2 script (--ebs-vol-size) and workaround...

2014-07-18 Thread Shivaram Venkataraman
Thanks a lot for reporting this. I think we just missed installing xfsprogs on the AMI. I have a fix for this at https://github.com/mesos/spark-ec2/pull/59. After the pull request is merged, any new clusters launched should have mkfs.xfs Thanks Shivaram On Fri, Jul 18, 2014 at 4:56 PM, Ben

Re: unserializable object in Spark Streaming context

2014-07-18 Thread Gino Bustelo
I get TD's recommendation of sharing a connection among tasks. Now, is there a good way to determine when to close connections? Gino B. On Jul 17, 2014, at 7:05 PM, Yan Fang yanfang...@gmail.com wrote: Hi Sean, Thank you. I see your point. What I was thinking is that, do computation in

SparkSQL operator priority

2014-07-18 Thread Christos Kozanitis
Hello What is the order with which SparkSQL deserializes parquet fields? Is it possible to modify it? I am using SparkSQL to query a parquet file that consists of a lot of fields (around 30 or so). Let me call an example table MyTable and let's suppose the name of one of its fields is position.

Re: unserializable object in Spark Streaming context

2014-07-18 Thread Tathagata Das
Thats, a good question. My first reach is timeout. Timing out after 10s of seconds should be sufficient. So there should be a timer in the singleton that runs a check every second, on when the singleton was last used, and closes the connections after a time out. Any attempts to use the connection

Re: Graphx : Perfomance comparison over cluster

2014-07-18 Thread Ankur Dave
Thanks for your interest. I should point out that the numbers in the arXiv paper are from GraphX running on top of a custom version of Spark with an experimental in-memory shuffle prototype. As a result, if you benchmark GraphX at the current master, it's expected that it will be 2-3x slower than

Re: unserializable object in Spark Streaming context

2014-07-18 Thread Tathagata Das
Actually, let me clarify further. There are number of possibilities. 1. The easier, less efficient way is to create a connection object every time you do foreachPartition (as shown in the pseudocode earlier in the thread). For each partition, you create a connection, use it to push a all the

Re: Graphx : Perfomance comparison over cluster

2014-07-18 Thread ShreyanshB
Thanks a lot Ankur. The version with in-memory shuffle is here: https://github.com/amplab/graphx2/commits/vldb. Unfortunately Spark has changed a lot since then, and the way to configure and invoke Spark is different. I can send you the correct configuration/invocation for this if you're

SeattleSparkMeetup: Spark at eBay - Troubleshooting the everyday issues

2014-07-18 Thread Denny Lee
We're coming off a great Seattle Spark Meetup session with Evan Chan (@evanfchan) Interactive OLAP Queries with @ApacheSpark and #Cassandra  (http://www.slideshare.net/EvanChan2/2014-07olapcassspark) at Whitepages.  Now, we're proud to announce that our next session is Spark at eBay -