Re: Spark Streaming: no job has started yet

2014-07-23 Thread Akhil Das
Can you paste the piece of code? Thanks Best Regards On Wed, Jul 23, 2014 at 1:22 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi all, I am running a spark streaming job. The job hangs on one stage, which shows as follows: Details for Stage 4 Summary Metrics No tasks have started

Re: How could I start new spark cluster with hadoop2.0.2

2014-07-23 Thread Akhil Das
AFAIK you can use the --hadoop-major-version parameter with the spark-ec2 https://github.com/apache/spark/blob/master/ec2/spark_ec2.py script to switch the hadoop version. Thanks Best Regards On Wed, Jul 23, 2014 at 6:07 AM, durga durgak...@gmail.com wrote: Hi, I am trying to create spark

Re: Spark clustered client

2014-07-23 Thread Nick Pentreath
At the moment your best bet for sharing SparkContexts across jobs will be Ooyala job server: https://github.com/ooyala/spark-jobserver It doesn't yet support spark 1.0 though I did manage to amend it to get it to build and run on 1.0 — Sent from Mailbox On Wed, Jul 23, 2014 at 1:21 AM, Asaf

Re: streaming window not behaving as advertised (v1.0.1)

2014-07-23 Thread Alan Ngai
foreachRDD is how I extracted values in the first place, so that’s not going to make a difference. I don’t think it’s related to SPARK-1312 because I’m generating data every second in the first place and I’m using foreachRDD right after the window operation. The code looks something like val

Re: Spark 0.9.1 core dumps on Mesos 0.18.0

2014-07-23 Thread Dale Johnson
I'm having the exact same problem. I tried Mesos 0.19, 0.14.2, 0.14.1, hadoop 2.3.0, spark 0.9.1. # SIGSEGV (0xb) at pc=0x7fab70c55c4d, pid=31012, tid=140366980314880 # # JRE version: 6.0_31-b31 # Java VM: OpenJDK 64-Bit Server VM (23.25-b01 mixed mode linux-amd64 compressed oops) #

spark.streaming.unpersist and spark.cleaner.ttl

2014-07-23 Thread Haopu Wang
I have a DStream receiving data from a socket. I'm using local mode. I set spark.streaming.unpersist to false and leave spark.cleaner.ttl to be infinite. I can see files for input and shuffle blocks under spark.local.dir folder and the size of folder keeps increasing, although JVM's memory usage

Spark deployed by Cloudera Manager

2014-07-23 Thread Debasish Das
Hi, We have been using standalone spark for last 6 months and I used to run application jars fine on spark cluster with the following command. java -cp :/app/data/spark_deploy/conf:/app/data/spark_deploy/lib/spark-assembly-1.0.0-SNAPSHOT-hadoop2.0.0-mr1-cdh4.5.0.jar:./app.jar -Xms2g -Xmx2g

RE: spark.streaming.unpersist and spark.cleaner.ttl

2014-07-23 Thread Shao, Saisai
Hi Haopu, Please see the inline comments. Thanks Jerry -Original Message- From: Haopu Wang [mailto:hw...@qilinsoft.com] Sent: Wednesday, July 23, 2014 3:00 PM To: user@spark.apache.org Subject: spark.streaming.unpersist and spark.cleaner.ttl I have a DStream receiving data from a

Re: Spark deployed by Cloudera Manager

2014-07-23 Thread buntu
If you need to run Spark apps through Hue, see if Ooyala's job server helps: http://gethue.com/get-started-with-spark-deploy-spark-server-and-compute-pi-from-your-web-browser/ -- View this message in context:

Re: Spark deployed by Cloudera Manager

2014-07-23 Thread Debasish Das
I found the issue... If you use spark git and generate the assembly jar then org.apache.hadoop.io.Writable.class is packaged with it If you use the assembly jar that ships with CDH in

Re: streaming window not behaving as advertised (v1.0.1)

2014-07-23 Thread Alan Ngai
TD, it looks like your instincts were correct. I misunderstood what you meant. If I force an eval on the inputstream using foreachRDD, the windowing will work correctly. If I don’t do that, lazy eval somehow screws with window batches I eventually receive. Any reason the bug is categorized

Re: Spark 0.9.1 core dumps on Mesos 0.18.0

2014-07-23 Thread Dale Johnson
Okay, I finally got this. The project/SparkBuild needed to be set, and only 0.19.0 seems to work (out of 0.14.1, 0.14.2). org.apache.mesos % mesos% 0.19.0, was the one that worked. -- View this message in context:

Re: akka disassociated on GC

2014-07-23 Thread Xiangrui Meng
Yes, that's the plan. If you use broadcast, please also make sure TorrentBroadcastFactory is used, which became the default broadcast factory very recently. -Xiangrui On Tue, Jul 22, 2014 at 10:47 PM, Makoto Yui yuin...@gmail.com wrote: Hi Xiangrui, By using your treeAggregate and broadcast

Use of SPARK_DAEMON_JAVA_OPTS

2014-07-23 Thread MEETHU MATHEW
 Hi all, Sorry for taking this topic again,still I am confused on this. I set SPARK_DAEMON_JAVA_OPTS=-XX:+UseCompressedOops -Xmx8g              when I run my application,I  got the following line in logs. Spark Command: java -cp

spark-shell -- running into ArrayIndexOutOfBoundsException

2014-07-23 Thread buntu
I'm using the spark-shell locally and working on a dataset of size 900MB. I initially ran into java.lang.OutOfMemoryError: GC overhead limit exceeded error and upon researching set SPARK_DRIVER_MEMORY to 4g. Now I run into ArrayIndexOutOfBoundsException, please let me know if there is some way to

Re: Spark deployed by Cloudera Manager

2014-07-23 Thread Bertrand Dechoux
Is there any documentation from cloudera on how to run Spark apps on CDH Manager deployed Spark ? Asking the cloudera community would be a good idea. http://community.cloudera.com/ In the end only Cloudera will fix quickly issues with CDH... Bertrand Dechoux On Wed, Jul 23, 2014 at 9:28 AM,

Spark execution plan

2014-07-23 Thread Luis Guerra
Hi all, I was wondering how spark may deal with an execution plan. Using PIG as example and its DAG execution, I would like to manage Spark for a similar solution. For instance, if my code has 3 different parts, being A and B self-sufficient parts: Part A: .. . . var output_a Part

RE: spark.streaming.unpersist and spark.cleaner.ttl

2014-07-23 Thread Haopu Wang
Jerry, thanks for the response. For the default storage level of DStream, it looks like Spark's document is wrong. In this link: http://spark.apache.org/docs/latest/streaming-programming-guide.html#memory-tuning It mentions: Default persistence level of DStreams: Unlike RDDs, the default

Re: hadoop version

2014-07-23 Thread mrm
Thank you! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/hadoop-version-tp10405p10485.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

driver memory

2014-07-23 Thread mrm
Hi, How do I increase the driver memory? This are my configs right now: sed 's/INFO/ERROR/' spark/conf/log4j.properties.template ./ephemeral-hdfs/conf/log4j.properties sed 's/INFO/ERROR/' spark/conf/log4j.properties.template spark/conf/log4j.properties # Environment variables and Spark

Re: Need info on log4j.properties for apache spark.

2014-07-23 Thread Sean Owen
You specify your own log4j configuration in the usual log4j way -- package it in your assembly, or specify on the command line for example. See http://logging.apache.org/log4j/1.2/manual.html The template you can start with is in core/src/main/resources/org/apache/spark/log4j-defaults.properties

Re: java.lang.StackOverflowError when calling count()

2014-07-23 Thread lalit1303
Hi, Thanks TD for your reply. I am still not able to resolve the problem for my use case. I have let's say 1000 different RDD's, and I am applying a transformation function on each RDD and I want the output of all rdd's combined to a single output RDD. For, this I am doing the following: *Loop

Re: driver memory

2014-07-23 Thread mrm
Hi, I figured out my problem so I wanted to share my findings. I was basically trying to broadcast an array with 4 million elements, and a size of approximatively 150 MB. Every time I was trying to broadcast, I got an OutOfMemory error. I fixed my problem by increasing the driver memory using:

Re: Spark 1.0.1 SQL on 160 G parquet file (snappy compressed, made by cloudera impala), 23 core and 60G mem / node, yarn-client mode, always failed

2014-07-23 Thread chutium
in spark 1.1 maybe not so easy like spark 1.0 after commit: https://issues.apache.org/jira/browse/SPARK-2446 only binary with UTF8 annotation will be recognized as string after this commit, but in impala strings are always without UTF8 anno -- View this message in context:

RE: spark.streaming.unpersist and spark.cleaner.ttl

2014-07-23 Thread Shao, Saisai
Yeah, the document may not be precisely aligned with latest code, so the best way is to check the code. -Original Message- From: Haopu Wang [mailto:hw...@qilinsoft.com] Sent: Wednesday, July 23, 2014 5:56 PM To: user@spark.apache.org Subject: RE: spark.streaming.unpersist and

Down-scaling Spark on EC2 cluster

2014-07-23 Thread Shubhabrata
Hello, We plan to use Spark on EC2 for our data science pipeline. We successfully manage to set up cluster as-well-as launch and run applications on remote-clusters. However, to enhance scalability we would like to implement auto-scaling in EC2 for Spark applications. However, I did not find any

Configuring Spark Memory

2014-07-23 Thread Martin Goodson
We are having difficulties configuring Spark, partly because we still don't understand some key concepts. For instance, how many executors are there per machine in standalone mode? This is after having closely read the documentation several times:

Re: Spark execution plan

2014-07-23 Thread Luis Guerra
Thanks for your answer. However, there has been a missunderstanding here. My question is related to control the execution in parallel of different parts of code, similarly to PIG, where there is a planning phase before the execution. On Wed, Jul 23, 2014 at 1:46 PM, chutium teng@gmail.com

Re: Down-scaling Spark on EC2 cluster

2014-07-23 Thread Akhil Das
Hi Currently this is not supported out of the Box. But you can of course add/remove workers in a running cluster. Better option would be to use a Mesos cluster where adding/removing nodes are quiet simple. But again, i believe adding new worker in the middle of a task won't give you better

Cluster submit mode - only supported on Yarn?

2014-07-23 Thread Chris Schneider
We are starting to use Spark, but we don't have any existing infrastructure related to big-data, so we decided to setup the standalone cluster, rather than mess around with Yarn or Mesos. But it appears like the driver program has to stay up on the client for the full duration of the job (client

Workarounds for accessing sequence file data via PySpark?

2014-07-23 Thread Gary Malouf
I am aware that today PySpark can not load sequence files directly. Are there work-arounds people are using (short of duplicating all the data to text files) for accessing this data?

Re: Workarounds for accessing sequence file data via PySpark?

2014-07-23 Thread Nick Pentreath
Load from sequenceFile for PySpark is in master and save is in this PR underway (https://github.com/apache/spark/pull/1338) I hope that Kan will have it ready to merge in time for 1.1 release window (it should be, the PR just needs a final review or two). In the meantime you can check out master

Re: Cluster submit mode - only supported on Yarn?

2014-07-23 Thread Steve Nunez
I¹m also in early stages of setting up long running Spark jobs. Easiest way I¹ve found is to set up a cluster and submit the job via YARN. Then I can come back and check in on progress when I need to. Seems the trick is tuning the queue priority and YARN preemption to get the job to run in a

Re: Configuring Spark Memory

2014-07-23 Thread Andrew Ash
Hi Martin, In standalone mode, each SparkContext you initialize gets its own set of executors across the cluster. So for example if you have two shells open, they'll each get two JVMs on each worker machine in the cluster. As far as the other docs, you can configure the total number of cores

Have different reduce key than mapper key

2014-07-23 Thread soumick86
How can I transform the mapper key at the reducer output. The functions I have encountered are combineByKey, reduceByKey, etc that work on the values and not on the key. For example below, this is what I want to achieve but seems like I can only have K1 and not K2:

Re: Configuring Spark Memory

2014-07-23 Thread Sean Owen
On Wed, Jul 23, 2014 at 4:19 PM, Andrew Ash and...@andrewash.com wrote: In standalone mode, each SparkContext you initialize gets its own set of executors across the cluster. So for example if you have two shells open, they'll each get two JVMs on each worker machine in the cluster. Dumb

Re: wholeTextFiles not working with HDFS

2014-07-23 Thread kmader
I have the same issue val a = sc.textFile(s3n://MyBucket/MyFolder/*.tif) a.first works perfectly fine, but val d = sc.wholeTextFiles(s3n://MyBucket/MyFolder/*.tif) does not work d.first Gives the following error message java.io.FileNotFoundException:

Lost executors

2014-07-23 Thread Eric Friedman
I'm using spark 1.0.1 on a quite large cluster, with gobs of memory, etc. Cluster resources are available to me via Yarn and I am seeing these errors quite often. ERROR YarnClientClusterScheduler: Lost executor 63 on host: remote Akka client disassociated This is in an interactive shell

Re: Configuring Spark Memory

2014-07-23 Thread Martin Goodson
Thanks Andrew, So if there is only one SparkContext there is only one executor per machine? This seems to contradict Aaron's message from the link above: If each machine has 16 GB of RAM and 4 cores, for example, you might set spark.executor.memory between 2 and 3 GB, totaling 8-12 GB used by

Re: Have different reduce key than mapper key

2014-07-23 Thread Nathan Kronenfeld
You do them sequentially in code; Spark will take care of combining them in execution. so something like: foo.map(fcn to [K1, V1]).reduceByKey(fcn from (V1, V1) to V1).map(fcn from (K1, V1) to (K2, V2)) On Wed, Jul 23, 2014 at 11:22 AM, soumick86 sdasgu...@dstsystems.com wrote: How can I

Re: spark1.0.1 spark sql error java.lang.NoClassDefFoundError: Could not initialize class $line11.$read$

2014-07-23 Thread Nicholas Chammas
Do we have a JIRA issue to track this? I think I've run into a similar issue. On Wed, Jul 23, 2014 at 1:12 AM, Yin Huai yh...@databricks.com wrote: It is caused by a bug in Spark REPL. I still do not know which part of the REPL code causes it... I think people working REPL may have better

Re: Cluster submit mode - only supported on Yarn?

2014-07-23 Thread Chris Schneider
Thanks Steve, but my goal is to hopefully avoid installing yet another component into my environment. Yarn is cool, but wouldn't be used for anything but Spark. We have no hadoop in our ecosystem (or HDFS). Ideally I'd avoid having to learn about yet another tool. On Wed, Jul 23, 2014 at 11:12

Re: Down-scaling Spark on EC2 cluster

2014-07-23 Thread Nicholas Chammas
There is a JIRA issue to track adding such functionality to spark-ec2: SPARK-2008 https://issues.apache.org/jira/browse/SPARK-2008 - Enhance spark-ec2 to be able to add and remove slaves to an existing cluster ​ On Wed, Jul 23, 2014 at 10:12 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Hi

akka 2.3.x?

2014-07-23 Thread Lee Mighdoll
Is there branch somewhere with current spark on a current version of akka? I'm trying to embed spark into a spray app. I can probably backport our app to 2.2.3 for a little while, but I wouldn't want to be stuck there too long. Related: do I need the protobuf shading if I'm using the

Re: How could I start new spark cluster with hadoop2.0.2

2014-07-23 Thread durga
Thanks Akhil -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-could-I-start-new-spark-cluster-with-hadoop2-0-2-tp10450p10514.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: spark1.0.1 spark sql error java.lang.NoClassDefFoundError: Could not initialize class $line11.$read$

2014-07-23 Thread Yin Huai
Yes, https://issues.apache.org/jira/browse/SPARK-2576 is used to track it. On Wed, Jul 23, 2014 at 9:11 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Do we have a JIRA issue to track this? I think I've run into a similar issue. On Wed, Jul 23, 2014 at 1:12 AM, Yin Huai

Re: Spark deployed by Cloudera Manager

2014-07-23 Thread Marcelo Vanzin
Discussions about how CDH packages Spark aside, you should be using the spark-class script (assuming you're still in 0.9) instead of executing Java directly. That will make sure that the environment needed to run Spark apps is set up correctly. CDH 5.1 ships with Spark 1.0.0, so it has

Re: How could I start new spark cluster with hadoop2.0.2

2014-07-23 Thread durga
Hi It seems I can only give --hadoop-major-version=2 . it is taking 2.0.0. How could I say it should use 2.0.2 is there any --hadoop-minor-version variable I can use? Thanks, D. -- View this message in context:

error: bad symbolic reference. A signature in SparkContext.class refers to term io in package org.apache.hadoop which is not available

2014-07-23 Thread Sameer Tilak
with this will be great! scalac -classpath /apps/software/secondstring/secondstring/dist/lib/secondstring-20140723.jar:/opt/cloudera/parcels/CDH/lib/spark/core/lib/spark-core_2.10-1.0.0-cdh5.1.0.jar:/opt/cloudera/parcels/CDH/lib/spark/lib/kryo-2.21.jar:/opt/cloudera/parcels/CDH/lib/hadoop/lib/commons-io-2.4

Re: error: bad symbolic reference. A signature in SparkContext.class refers to term io in package org.apache.hadoop which is not available

2014-07-23 Thread Sean Owen
with this will be great! scalac -classpath /apps/software/secondstring/secondstring/dist/lib/secondstring-20140723.jar:/opt/cloudera/parcels/CDH/lib/spark/core/lib/spark-core_2.10-1.0.0-cdh5.1.0.jar:/opt/cloudera/parcels/CDH/lib/spark/lib/kryo-2.21.jar:/opt/cloudera/parcels/CDH/lib/hadoop/lib/commons-io-2.4.jar

Re: spark-shell -- running into ArrayIndexOutOfBoundsException

2014-07-23 Thread buntu
Just wanted to add more info.. I was using SparkSQL reading in the tab-delimited raw data files converting the timestamp to Date format: sc.textFile(rawdata/*).map(_.split(\t)).map(p = Point(df.format(new Date( p(0).trim.toLong*1000L )), p(1), p(2).trim.toInt ,p(3).trim.toInt, p(4).trim.toInt

why there is only getString(index) but no getString(columnName) in catalyst.expressions.Row.scala ?

2014-07-23 Thread chutium
i do not want to use always schemaRDD.map { case Row(xxx) = ... } using case we must rewrite the table schema again is there any plan to implement this? Thanks -- View this message in context:

spark-submit to remote master fails

2014-07-23 Thread didi
Hi all I guess the problem has something to do with the fact i submit the job to remote location I submit from OracleVM running ubuntu and suspect some NAT issues maybe? akka tcp tries this address as follows from the STDERR print which is appended akka.tcp://spark@LinuxDevVM.local:59266 STDERR

Re: Graphx : Perfomance comparison over cluster

2014-07-23 Thread ShreyanshB
Thanks Ankur. The version with in-memory shuffle is here: https://github.com/amplab/graphx2/commits/vldb. Unfortunately Spark has changed a lot since then, and the way to configure and invoke Spark is different. I can send you the correct configuration/invocation for this if you're interested in

Re: spark-submit to remote master fails

2014-07-23 Thread Andrew Or
Yeah, seems to be the case. In general your executors should be able to reach the driver, which I don't think is the case for you currently (LinuxDevVM.local:59266 seems very private). What you need is some sort of gateway node that can be publicly reached from your worker machines to launch your

Re: Lost executors

2014-07-23 Thread Andrew Or
Hi Eric, Have you checked the executor logs? It is possible they died because of some exception, and the message you see is just a side effect. Andrew 2014-07-23 8:27 GMT-07:00 Eric Friedman eric.d.fried...@gmail.com: I'm using spark 1.0.1 on a quite large cluster, with gobs of memory, etc.

Convert raw data files to Parquet format

2014-07-23 Thread buntu
I wanted to experiment with using Parquet data with SparkSQL. I got some tab-delimited files and wanted to know how to convert them to Parquet format. I'm using standalone spark-shell. Thanks! -- View this message in context:

Re: Convert raw data files to Parquet format

2014-07-23 Thread Michael Armbrust
Take a look at the programming guide for spark sql: http://spark.apache.org/docs/latest/sql-programming-guide.html On Wed, Jul 23, 2014 at 11:09 AM, buntu buntu...@gmail.com wrote: I wanted to experiment with using Parquet data with SparkSQL. I got some tab-delimited files and wanted to know

Re: How to do an interactive Spark SQL

2014-07-23 Thread hsy...@gmail.com
Anyone has any idea on this? On Tue, Jul 22, 2014 at 7:02 PM, hsy...@gmail.com hsy...@gmail.com wrote: But how do they do the interactive sql in the demo? https://www.youtube.com/watch?v=dJQ5lV5Tldw And if it can work in the local mode. I think it should be able to work in cluster mode,

Re: Use of SPARK_DAEMON_JAVA_OPTS

2014-07-23 Thread Andrew Or
Hi Meethu, SPARK_DAEMON_JAVA_OPTS is not intended for setting memory. Please use SPARK_DAEMON_MEMORY instead. It turns out that java respects only the last -Xms and -Xmx values, and in spark-class we put SPARK_DAEMON_JAVA_OPTS before the SPARK_DAEMON_MEMORY. In general, memory configuration in

Help in merging a RDD agaisnt itself using the V of a (K,V).

2014-07-23 Thread Roch Denis
Hello, Most of the tasks I've accomplished in Spark were fairly straightforward but I can't figure the following problem using the Spark API: Basically, I have an IP with a bunch of user ID associated to it. I want to create a list of all user id that are associated together, even if some are

Re: driver memory

2014-07-23 Thread Andrew Or
Hi Maria, SPARK_MEM is actually a deprecated because it was too general; the reason it worked was because SPARK_MEM applies to everything (drivers, executors, masters, workers, history servers...). In favor of more specific configs, we broke this down into SPARK_DRIVER_MEMORY and

spark github source build error

2014-07-23 Thread m3.sharma
I am trying to build spark after cloning from github repo: I am executing: ./sbt/sbt -Dhadoop.version=2.4.0 -Pyarn assembly I am getting following error: [warn] ^ [error] [error] while compiling:

RE: error: bad symbolic reference. A signature in SparkContext.class refers to term io in package org.apache.hadoop which is not available

2014-07-23 Thread Sameer Tilak
$calculateSortedJaccardScore$1$$anonfun$5.classapproxstrmatch/JaccardScore$$anonfun$calculateJaccardScore$1$$anonfun$1.class However, when I start my spark shell: spark-shell --jars /apps/sameert/software/secondstring/secondstring/dist/lib/secondstring-20140723.jar /apps/sameert/software/approxstrmatch/target

Re: combineByKey at ShuffledDStream.scala

2014-07-23 Thread Bill Jay
The streaming program contains the following main stages: 1. receive data from Kafka 2. preprocessing of the data. These are all map and filtering stages. 3. Group by a field 4. Process the groupBy results using map. Inside this processing, I use collect, count. Thanks! Bill On Tue, Jul 22,

Re: Spark Streaming: no job has started yet

2014-07-23 Thread Bill Jay
The code is pretty long. But the main idea is to consume from Kafka, preprocess the data, and groupBy a field. I use mutliple DStream to add parallelism to the consumer. It seems when the number of DStreams is large, this happens often. Thanks, Bill On Tue, Jul 22, 2014 at 11:13 PM, Akhil Das

Re: spark github source build error

2014-07-23 Thread Xiangrui Meng
try `sbt/sbt clean` first? -Xiangrui On Wed, Jul 23, 2014 at 11:23 AM, m3.sharma sharm...@umn.edu wrote: I am trying to build spark after cloning from github repo: I am executing: ./sbt/sbt -Dhadoop.version=2.4.0 -Pyarn assembly I am getting following error: [warn]

Re: Convert raw data files to Parquet format

2014-07-23 Thread buntu
Thanks Michael. If I read in multiple files and attempt to saveAsParquetFile() I get the ArrayIndexOutOfBoundsException. I don't see this if I try the same with a single file: case class Point(dt: String, uid: String, kw: String, tz: Int, success: Int, code: String ) val point =

Error in History UI - Seeing stdout/stderr

2014-07-23 Thread balvisio
Hello all, I have noticed the (what I think) is a erroneous behavior when using the WebUI: 1. Launching App from Eclipse to a cluster (with 3 workers) 2. Using Spark 0.9.0 (Cloudera distr 5.0.1) 3. The Application makes the worker write to the stdout using System.out.println(...) When the

Re: Convert raw data files to Parquet format

2014-07-23 Thread buntu
That seems to be the issue, when I reduce the number of fields it works perfectly fine. Thanks again Michael.. that was super helpful!! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Convert-raw-data-files-to-Parquet-format-tp10526p10541.html Sent from

Re: spark-shell -- running into ArrayIndexOutOfBoundsException

2014-07-23 Thread buntu
Turns out to be an issue with number of fields being read, one of the fields might be missing from the raw data file causing this error. Michael Ambrust pointed it out in another thread. -- View this message in context:

Spark cluster spanning multiple data centers

2014-07-23 Thread Ray Qiu
Hi, Is it feasible to deploy a Spark cluster spanning multiple data centers if there is fast connections with not too high latency (30ms) between them? I don't know whether there is any presumptions in the software expecting all cluster nodes to be local (super low latency, for instance). Has

using shapeless in spark to optimize data layout in memory

2014-07-23 Thread Koert Kuipers
hello all, in case anyone is interested, i just wrote a short blog about using shapeless in spark to optimize data layout in memory. blog is here: http://tresata.com/tresata-open-sources-spark-columnar code is here: https://github.com/tresata/spark-columnar

Re: spark github source build error

2014-07-23 Thread m3.sharma
Thanks, it works now :) On Wed, Jul 23, 2014 at 11:45 AM, Xiangrui Meng [via Apache Spark User List] ml-node+s1001560n10537...@n3.nabble.com wrote: try `sbt/sbt clean` first? -Xiangrui On Wed, Jul 23, 2014 at 11:23 AM, m3.sharma [hidden email]

Re: Help in merging a RDD agaisnt itself using the V of a (K,V).

2014-07-23 Thread Sean Owen
So, given sets, you are joining overlapping sets until all of them are mutually disjoint, right? If graphs are out, then I also would love to see a slick distributed solution, but couldn't think of one. It seems like a cartesian product won't scale. You can write a simple method to implement

Re: wholeTextFiles not working with HDFS

2014-07-23 Thread kmader
That worked for me as well, I was using spark 1.0 compiled against Hadoop 1.0, switching to 1.0.1 compiled against hadoop 2 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p10547.html Sent from the Apache Spark

Re: Spark job tracker.

2014-07-23 Thread abhiguruvayya
Is there any thing equivalent to haddop Job (org.apache.hadoop.mapreduce.Job;) in spark? Once i submit the spark job i want to concurrently read the sparkListener interface implementation methods where i can grab the job status. I am trying to find a way to wrap the spark submit object into one

Get Spark Streaming timestamp

2014-07-23 Thread Bill Jay
Hi all, I have a question regarding Spark streaming. When we use the saveAsTextFiles function and my batch is 60 seconds, Spark will generate a series of files such as: result-140614896, result-140614802, result-140614808, etc. I think this is the timestamp for the beginning of each

Announcing Spark 0.9.2

2014-07-23 Thread Xiangrui Meng
I'm happy to announce the availability of Spark 0.9.2! Spark 0.9.2 is a maintenance release with bug fixes across several areas of Spark, including Spark Core, PySpark, MLlib, Streaming, and GraphX. We recommend all 0.9.x users to upgrade to this stable release. Contributions to this release came

persistent HDFS instance for cluster restarts/destroys

2014-07-23 Thread durga
Hi All, I have a question, For my company , we are planning to use spark-ec2 scripts to create cluster for us. I understand that , persistent HDFS will make the hdfs available for cluster restarts. Question is: 1) What happens , If I destroy and re-create , do I loose the data. a) If I

Re: Where is the PowerGraph abstraction

2014-07-23 Thread Ankur Dave
We removed https://github.com/apache/spark/commit/732333d78e46ee23025d81ca9fbe6d1e13e9f253 the PowerGraph abstraction layer when merging GraphX into Spark to reduce the maintenance costs. You can still read the code

RE: error: bad symbolic reference. A signature in SparkContext.class refers to term io in package org.apache.hadoop which is not available

2014-07-23 Thread Sameer Tilak
/JaccardScore.classapproxstrmatch/JaccardScore$$anonfun$calculateSortedJaccardScore$1$$anonfun$5.classapproxstrmatch/JaccardScore$$anonfun$calculateJaccardScore$1$$anonfun$1.class However, when I start my spark shell: spark-shell --jars /apps/sameert/software/secondstring/secondstring/dist/lib/secondstring-20140723.jar

Re: Help in merging a RDD agaisnt itself using the V of a (K,V).

2014-07-23 Thread Roch Denis
Ah yes, you're quite right with partitions I could probably process a good chunk of the data but I didn't think a reduce would work? Sorry, I'm still new to Spark and map reduce in general but I thought that the reduce result wasn't an RDD and had to fit into memory. If the result of a reduce can

streaming sequence files?

2014-07-23 Thread Barnaby
If I save an RDD as a sequence file such as: val wordCounts = words.map(x = (x, 1)).reduceByKey(_ + _) wordCounts.foreachRDD( d = { d.saveAsSequenceFile(tachyon://localhost:19998/files/WordCounts- + (new SimpleDateFormat(MMdd-HHmmss) format

Re: Streaming. Cannot get socketTextStream to receive anything.

2014-07-23 Thread kytay
Hi TD You are right, I did not include \n to delimit the string flushed. That's the reason. Is there a way for me to define the delimiter? Like SOH or ETX instead of \n Regards kytay -- View this message in context:

Re: Get Spark Streaming timestamp

2014-07-23 Thread Tobias Pfeiffer
Bill, Spark Streaming's DStream provides overloaded methods for transform() and foreachRDD() that allow you to access the timestamp of a batch: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.dstream.DStream I think the timestamp is the end of the batch, not

Re: Help in merging a RDD agaisnt itself using the V of a (K,V).

2014-07-23 Thread Roch Denis
For what it's worth, I got it to work with a Cartesian product even if it's very inefficient it worked out alright for me. The trick was to flat map it (step4) after the cartesian product so that I could do a reduce by key and find the commonalities. After I was done, I could check if any Value

Re: persistent HDFS instance for cluster restarts/destroys

2014-07-23 Thread Mayur Rustagi
Yes you lose the data You can add machines but will require you to restart the cluster. Also adding is manual on you add nodes Regards Mayur On Wednesday, July 23, 2014, durga durgak...@gmail.com wrote: Hi All, I have a question, For my company , we are planning to use spark-ec2 scripts to

Re: What if there are large, read-only variables shared by all map functions?

2014-07-23 Thread Mayur Rustagi
Have a look at broadcast variables . On Tuesday, July 22, 2014, Parthus peng.wei@gmail.com wrote: Hi there, I was wondering if anybody could help me find an efficient way to make a MapReduce program like this: 1) For each map function, it need access some huge files, which is around

Re: Lost executors

2014-07-23 Thread Eric Friedman
hi Andrew, Thanks for your note. Yes, I see a stack trace now. It seems to be an issue with python interpreting a function I wish to apply to an RDD. The stack trace is below. The function is a simple factorial: def f(n): if n == 1: return 1 return n * f(n-1) and I'm trying to use it

Re: Lost executors

2014-07-23 Thread Eric Friedman
And... PEBCAK I mistakenly believed I had set PYSPARK_PYTHON to a python 2.7 install, but it was on a python 2.6 install on the remote nodes, hence incompatible with what the master was sending. Have set this to point to the correct version everywhere and it works. Apologies for the false

Re: persistent HDFS instance for cluster restarts/destroys

2014-07-23 Thread durga
Thanks Mayur. is there any documentation/readme with step by step process available for adding or deleting nodes? Thanks, D. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/persistent-HDFS-instance-for-cluster-restarts-destroys-tp10551p10565.html Sent from

Re: What if there are large, read-only variables shared by all map functions?

2014-07-23 Thread Aaron Davidson
In particular, take a look at the TorrentBroadcast, which should be much more efficient than HttpBroadcast (which was the default in 1.0) for large files. If you find that TorrentBroadcast doesn't work for you, then another way to solve this problem is to place the data on all nodes' local disks,

Re: Configuring Spark Memory

2014-07-23 Thread Nishkam Ravi
See if this helps: https://github.com/nishkamravi2/SparkAutoConfig/ It's a very simple tool for auto-configuring default parameters in Spark. Takes as input high-level parameters (like number of nodes, cores per node, memory per node, etc) and spits out default configuration, user advice and

Re: Get Spark Streaming timestamp

2014-07-23 Thread Bill Jay
Hi Tobias, It seems this parameter is an input to the function. What I am expecting is output from a function that tells me the starting or ending time of the batch. For instance, If I use saveAsTextFiles, it seems DStream will generate a batch every minute and the starting time is a complete