Re: Shark does not give any results with SELECT count(*) command

2014-03-25 Thread qingyang li
reopen this thread because i encounter this problem again. Here is my env: scala 2.10.3 s spark 0.9.0tandalone mode shark 0.9.0downlaod the source code and build by myself hive hive-shark-0.11 I have copied hive-site.xml from my hadoop cluster , it's hive version is 0.12, after copied , i

Re: Java API - Serialization Issue

2014-03-25 Thread santhoma
This worked great. Thanks a lot -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Java-API-Serialization-Issue-tp1460p3178.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

graph.persist error

2014-03-25 Thread moxing
Hi, I am dealing with a graph consisting of 20 million nodes and 2 billion edges. When I want to persist the graph then an exception throw : Caused by: java.lang.UnsupportedOperationException: Cannot change storage level of an RDD after it was already assigned a leve Here is my code: def main

How to set environment variable for a spark job

2014-03-25 Thread santhoma
Hello I have a requirement to set some env values for my spark jobs. Does anyone know how to set them? Specifically following variables: 1) ORACLE_HOME 2) LD_LIBRARY_PATH thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-set-environment-varia

Re: How to set environment variable for a spark job

2014-03-25 Thread Sourav Chandra
You can pass them in the environment map used to create spark context. On Tue, Mar 25, 2014 at 2:29 PM, santhoma wrote: > Hello > > I have a requirement to set some env values for my spark jobs. > Does anyone know how to set them? Specifically following variables: > > 1) ORACLE_HOME > 2) LD_LIB

Re: N-Fold validation and RDD partitions

2014-03-25 Thread Jaonary Rabarisoa
There is also a "randomSplit" method in the latest version of spark https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala On Tue, Mar 25, 2014 at 1:21 AM, Holden Karau wrote: > There is also https://github.com/apache/spark/pull/18 against the c

tracking resource usage for spark-shell commands

2014-03-25 Thread Bharath Bhushan
Is there a way to see the resource usage of each spark-shell command — say time taken and memory used? I checked the WebUI of spark-shell and of the master and I don’t see any such breakdown. I see the time taken in the INFO logs but nothing about memory usage. It would also be nice to track the

Re: Shark does not give any results with SELECT count(*) command

2014-03-25 Thread Praveen R
Hi Qingyang Li, Shark-0.9.0 uses a patched version of hive-0.11 and using configuration/metastore of hive-0.12 could be incompatible. May I know the reason you are using hive-site.xml from previous hive version(to use existing metastore?). You might just leave hive-site.xml blank, otherwise. S

Re: 答复: 答复: RDD usage

2014-03-25 Thread hequn cheng
Hi~I wrote a program to test.The non-idempotent "compute" function in foreach does change the value of RDD. It may looks a little crazy to do so since modify the RDD will make it impossible to keep RDD fault-tolerant in spark :) 2014-03-25 11:11 GMT+08:00 林武康 : > Hi hequn, I dig into the sourc

Re: Akka error with largish job (works fine for smaller versions)

2014-03-25 Thread Andrew Ash
Possibly one of your executors is in the middle of a large stop-the-world GC and doesn't respond to network traffic during that period? If you shared some information about how each node in your cluster is set up (heap size, memory, CPU, etc) that might help with debugging. Andrew On Mon, Mar 2

Re: Pig on Spark

2014-03-25 Thread lalit1303
Hi, I have been following Aniket's spork github repository. https://github.com/aniket486/pig I have done all the changes mentioned in recently modified pig-spark file. I am using: hadoop 2.0.5 alpha spark-0.8.1-incubating mesos 0.16.0 ##PIG variables export *HADOOP_CONF_DIR*=$HADOOP_INSTALL/etc/

Worker Threads Vs Spark Executor Memory

2014-03-25 Thread Annamalai, Sai IN BLR STS
Hi All, 1) Does number of worker threads bear any relationship to setting executor memory ?. I have a 16 GB RAM, with an 8-core processor. I had set SPARK_MEM to 12g and was running locally with default 1 thread. So this means there can be maximum one executor in one node scheduled at any

Change print() in JavaNetworkWordCount

2014-03-25 Thread Eduardo Costa Alfaia
Hi Guys, I think that I already did this question, but I don't remember if anyone has answered me. I would like changing in the function print() the quantity of words and the frequency number that are sent to driver's screen. The default value is 10. Anyone could help me with this? Best Rega

Re: Change print() in JavaNetworkWordCount

2014-03-25 Thread Sourav Chandra
You can extend DStream and override print() method. Then you can create your own DSTream extending from this. On Tue, Mar 25, 2014 at 6:07 PM, Eduardo Costa Alfaia < e.costaalf...@unibs.it> wrote: > Hi Guys, > I think that I already did this question, but I don't remember if anyone > has answere

tuple as keys in pyspark show up reversed

2014-03-25 Thread Friso van Vollenhoven
Hi, I have an example where I use a tuple of (int,int) in Python as key for a RDD. When I do a reduceByKey(...), sometimes the tuples turn up with the two int's reversed in order (which is problematic, as the ordering is part of the key). Here is a ipython notebook that has some code and demonstr

Re: tuple as keys in pyspark show up reversed

2014-03-25 Thread Friso van Vollenhoven
OK, forget about this question. It was a nasty, one character typo in my own code (sorting by rating instead of item at one point). Best, Friso On Tue, Mar 25, 2014 at 1:53 PM, Friso van Vollenhoven < f.van.vollenho...@gmail.com> wrote: > Hi, > > I have an example where I use a tuple of (int,int

K-means faster on Mahout then on Spark

2014-03-25 Thread Egor Pahomov
Hi, I'm running benchmark, which compares Mahout and SparkML. For now I have next results for k-means: Number of iterations= 10, number of elements = 1000, mahouttime= 602, spark time = 138 Number of iterations= 40, number of elements = 1000, mahouttime= 1917, spark time = 330 Number of ite

Re: K-means faster on Mahout then on Spark

2014-03-25 Thread Guillaume Pitel (eXenSa)
Maybe with "MEMORY_ONLY", spark has to recompute the RDD several times because they don't fit in memory. It makes things run slower. As a general safe rule, use MEMORY_AND_DISK_SER Guillaume Pitel - Président d'eXenSa Prashant Sharma a écrit : >I think Mahout uses FuzzyKmeans, which is dif

RE: [bug?] streaming window unexpected behaviour

2014-03-25 Thread Adrian Mocanu
Thanks TD! Is it possible to perhaps add another window method that doesn't not generate partial windows? Or, Is it possible to remove the first few partial windows? I'm thinking of using an accumulator to count how many windows there are. -A -Original Message- From: Tathagata Das [mail

Re: K-means faster on Mahout then on Spark

2014-03-25 Thread Suneel Marthi
Mahout does have a kmeans which can be executed in mapreduce and iterative modes. Sent from my iPhone > On Mar 25, 2014, at 9:25 AM, Prashant Sharma wrote: > > I think Mahout uses FuzzyKmeans, which is different algorithm and it is not > iterative. > > Prashant Sharma > > >> On Tue, Mar 2

Re: K-means faster on Mahout then on Spark

2014-03-25 Thread Egor Pahomov
Mahout used MR and made one MR on every iteration. It worked as predicted. My question more about why spark was so slow. I would try MEMORY_AND_DISK_SER 2014-03-25 17:58 GMT+04:00 Suneel Marthi : > Mahout does have a kmeans which can be executed in mapreduce and iterative > modes. > > Sent from

Re: K-means faster on Mahout then on Spark

2014-03-25 Thread Prashant Sharma
I think Mahout uses FuzzyKmeans, which is different algorithm and it is not iterative. Prashant Sharma On Tue, Mar 25, 2014 at 6:50 PM, Egor Pahomov wrote: > Hi, I'm running benchmark, which compares Mahout and SparkML. For now I > have next results for k-means: > Number of iterations= 10, numb

RE: [bug?] streaming window unexpected behaviour

2014-03-25 Thread Adrian Mocanu
Let me rephrase that, Do you think it is possible to use an accumulator to skip the first few incomplete RDDs? -Original Message- From: Adrian Mocanu [mailto:amoc...@verticalscope.com] Sent: March-25-14 9:57 AM To: user@spark.apache.org Cc: u...@spark.incubator.apache.org Subject: RE: [b

Running a task once on each executor

2014-03-25 Thread deenar.toraskar
Hi Is there a way in Spark to run a function on each executor just once. I have a couple of use cases. a) I use an external library that is a singleton. It keeps some global state and provides some functions to manipulate it (e.g. reclaim memory. etc.) . I want to check the global state of this

Using an external jar in the driver, in yarn-standalone mode.

2014-03-25 Thread Julien Carme
Hello, I have been struggling for ages to use an external jar in my spark driver program, in yarn-standalone mode. I just want to use in my main program, outside the calls to spark functions, objects that are defined in another jar. I tried to set SPARK_CLASSPATH, ADD_JAR, I tried to use --addJar

Re: Running a task once on each executor

2014-03-25 Thread Christopher Nguyen
Deenar, when you say "just once", have you defined "across multiple " (e.g., across multiple threads in the same JVM on the same machine)? In principle you can have multiple executors on the same machine. In any case, assuming it's the same JVM, have you considered using a singleton that maintains

ClassCastException when using saveAsTextFile

2014-03-25 Thread Niko Stahl
Hi, I'm trying to save an RDD to HDFS with the saveAsTextFile method on my ec2 cluster and am encountering the following exception (the app is called GraphTest): Exception failure: java.lang.ClassCastException: cannot assign instance of GraphTest$$anonfun$3 to field org.apache.spark.rdd.MappedRDD

Implementation problem with Streaming

2014-03-25 Thread Sanjay Awatramani
Hi, I had initially thought of a streaming approach to solve my problem, and I am stuck at few places and want opinion if this problem is suitable for streaming, or is it better to stick to basic spark. Problem: I get chunks of log files in a folder and need to do some analysis on them on an h

Re: Running a task once on each executor

2014-03-25 Thread deenar.toraskar
Christopher It is once per JVM. TaskNonce would meet my needs. I guess if I want it once per thread, then a ThreadLocal would do the same. But how do I invoke TaskNonce, what is the best way to generate a RDD to ensure that there is one element per executor. Deenar -- View this message in c

Re: Akka error with largish job (works fine for smaller versions)

2014-03-25 Thread Nathan Kronenfeld
After digging deeper, I realized all the workers ran out of memory, giving an hs_error.log file in /tmp/jvm- with the header: # Native memory allocation (malloc) failed to allocate 2097152 bytes for committing reserved memory. # Possible reasons: # The system is out of physical RAM or swap space

Re: Using an external jar in the driver, in yarn-standalone mode.

2014-03-25 Thread Sandy Ryza
Hi Julien, Have you called SparkContext#addJars? -Sandy On Tue, Mar 25, 2014 at 10:05 AM, Julien Carme wrote: > Hello, > > I have been struggling for ages to use an external jar in my spark driver > program, in yarn-standalone mode. I just want to use in my main program, > outside the calls to

Re: ClassCastException when using saveAsTextFile

2014-03-25 Thread Niko Stahl
Ok, so I've been able to narrow down the problem to this specific case: def toCsv(userTuple: String) = {"a,b,c"} val dataTemp = Array("line1", "line2") val dataTempDist = sc.parallelize(dataTemp) val usersFormatted = dataTempDist.map(toCsv) usersFormatted.saveAsTextFile("hdfs://" + masterDomain +

Re: Running a task once on each executor

2014-03-25 Thread Christopher Nguyen
Deenar, the singleton pattern I'm suggesting would look something like this: public class TaskNonce { private transient boolean mIsAlreadyDone; private static transient TaskNonce mSingleton = new TaskNonce(); private transient Object mSyncObject = new Object(); public TaskNonce getSing

Re: Using an external jar in the driver, in yarn-standalone mode.

2014-03-25 Thread Nathan Kronenfeld
by 'use ... my main program' I presume you mean you have a main function in a class file you want to use as your entry point. SPARK_CLASSPATH, ADD_JAR, etc add your jars in on the master and the workers... but they don't on the client. For that, you're just using ordinary, everyday java/scala - so

Static ports for fileserver and httpbroadcast in Spark driver

2014-03-25 Thread Guillermo Cabrera2
Hi: I am setting up a Spark 0.9.0 cluster over multiple hosts using Docker. I use a combination of /etc/hosts editing and port mapping to handle correct routing between Spark Master and Worker containers. My issue arises when I try to do any operation involving a textFile (hdfs or local) in the

Spark 0.9.1 - How to run bin/spark-class with my own hadoop jar files?

2014-03-25 Thread Andrew Lee
Hi All, I'm getting the following error when I execute start-master.sh which also invokes spark-class at the end. Failed to find Spark assembly in /root/spark/assembly/target/scala-2.10/ You need to build Spark with 'sbt/sbt assembly' before running this program. After digging into the cod

Re: spark executor/driver log files management

2014-03-25 Thread Tathagata Das
The logs from the executor are redirected to stdout only because there is a default log4j.properties that is configured to do so. If you put your log4j.properties with rolling file appender in the classpath (refer to Spark docs for that), all the logs will get redirected to a separate files that wi

Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-03-25 Thread Gary Malouf
Can anyone verify the claims from Aureliano regarding the Akka dependency protobuf collision? Our team has a major need to upgrade to protobuf 2.5.0 up the pipe and Spark seems to be the blocker here. On Fri, Mar 21, 2014 at 6:49 PM, Aureliano Buendia wrote: > > > > On Tue, Mar 18, 2014 at 12:5

Re: Using an external jar in the driver, in yarn-standalone mode.

2014-03-25 Thread Julien Carme
Thanks for your answer. I am using bin/spark-class org.apache.spark.deploy.yarn.Client --jar myjar.jar --class myclass ... myclass in myjar.jar contains a main that initializes a SparkContext in yarn-standalone mode. Then I am using some code that uses myotherjar.jar, but I do not execute it us

Re: Spark 0.9.1 - How to run bin/spark-class with my own hadoop jar files?

2014-03-25 Thread Paul Schooss
Andrew, I ran into the same problem and eventually settled on just running the jars directly with java. Since we use sbt to build our jars we had all the dependancies builtin to the jar it self so need for random class paths. On Tue, Mar 25, 2014 at 1:47 PM, Andrew Lee wrote: > Hi All, > > I'm

Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-03-25 Thread Patrick Wendell
Starting with Spark 0.9 the protobuf dependency we use is shaded and cannot interfere with other protobuf libaries including those in Hadoop. Not sure what's going on in this case. Would someone who is having this problem post exactly how they are building spark? - Patrick On Fri, Mar 21, 2014 at

RE: Spark 0.9.1 - How to run bin/spark-class with my own hadoop jar files?

2014-03-25 Thread Andrew Lee
Hi Paul, I got it sorted out. The problem is that the JARs are built into the assembly JARs when you run sbt/sbt clean assembly What I did is:sbt/sbt clean package This will only give you the small JARs. The next steps is to update the CLASSPATH in the bin/compute-classpath.sh script manually, app

RE: Using an external jar in the driver, in yarn-standalone mode.

2014-03-25 Thread Andrew Lee
Hi Julien, The ADD_JAR doesn't work in the command line. I checked spark-class, and I couldn't find any Bash shell bringing in the variable ADD_JAR to the CLASSPATH. Were you able to print out the properties and environment variables from the Web GUI? localhost:4040 This should give you an idea w

Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-03-25 Thread giive chen
Hi I am quite beginner in spark and I have similar issue last week. I don't know if my issue is the same as yours. I found that my program's jar contain protobuf and when I remove this dependency on my program's pom.xml, rebuild my program and it works. Here is how I solved my own issue. Environ

Re: Spark 0.9.0-incubation + Apache Hadoop 2.2.0 + YARN encounter Compression codec com.hadoop.compression.lzo.LzoCodec not found

2014-03-25 Thread alee526
You can try to add the following to your shell: In bin/compute-classpath.sh, append the JAR lzo JAR from Mapreduce: CLASSPATH=$CLASSPATH:$HADOOP_HOME/share/hadoop/mapreduce/lib/hadoop-lzo.jar export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:$HADOOP_HOME/lib/native/ export LD_LIBRARY_PATH=$LD_LIBRARY_

[BLOG] Shark on Cassandra

2014-03-25 Thread Brian O'Neill
As promised, here is that follow-up post for those looking to get started with Shark against Cassandra: -- Brian ONeill CTO, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://brianoneill.blogspot.com/ twitter: @boneill42

[BLOG] : Shark on Cassandra

2014-03-25 Thread Brian O'Neill
As promised, here is that followup post for those looking to get started with Shark against Cassandra: http://brianoneill.blogspot.com/2014/03/shark-on-cassandra-w-cash-interrogating.html Again -- thanks to Rohit and the team at TupleJump. Great work. -brian -- Brian ONeill CTO, Health Market

Re: Spark Streaming ZeroMQ Java Example

2014-03-25 Thread Tathagata Das
Unfortunately there isnt one right now. But it is probably too hard to start with the JavaNetworkWordCount, and use the ZeroMQUtils in the same way as the Sca

Building Spark 0.9.x for CDH5 with mrv1 installation (Protobuf 2.5 upgrade)

2014-03-25 Thread Gary Malouf
Today, our cluster setup is as follows: Mesos 0.15, CDH 4.2.1-MRV1, Spark 0.9-pre-scala-2.10 off master build targeted at appropriate CDH4 version We are looking to upgrade all of these in order to get protobuf 2.5 working properly. The question is, which 'Hadoop version build' of Spark 0.9 is

Re: Implementation problem with Streaming

2014-03-25 Thread Mayur Rustagi
2 good benefits of Streaming 1. maintains windows as you move across time, removing & adding monads as you move through the window 2. Connect with streaming systems like kafka to import data as it comes & process it You dont seem to need any of these features, you would be better off using Spark w

Re: [bug?] streaming window unexpected behaviour

2014-03-25 Thread Tathagata Das
You can probably do it in a simpler but sort of hacky way! If your window size is W and sliding interval S, you can do some math to figure out how many of the first windows are actually partial windows. Its probably math.ceil(W/S) . So in a windowDStream.foreachRDD() you can increment a global cou

Re: tracking resource usage for spark-shell commands

2014-03-25 Thread Mayur Rustagi
Time taken is shown in Shark shell web ui (hosted on 4040 port). Also memory used is shown in terms of Storage of RDD, how much shuffle data was written & read during the process is also highlighted thr. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi

Re: Writing RDDs to HDFS

2014-03-25 Thread Ognen Duzlevski
Well, my long running app has 512M per executor on a 16 node cluster where each machine has 16G of RAM. I could not run a second application until I restricted the spark.cores.max. As soon as I restricted the cores, I am able to run a second job at the same time. Ognen On 3/24/14, 7:46 PM, Ya

Re: [BLOG] : Shark on Cassandra

2014-03-25 Thread Matei Zaharia
Very cool, thanks for posting this! Matei On Mar 25, 2014, at 6:18 PM, Brian O'Neill wrote: > As promised, here is that followup post for those looking to get started with > Shark against Cassandra: > http://brianoneill.blogspot.com/2014/03/shark-on-cassandra-w-cash-interrogating.html > > Aga

any distributed cache mechanism available in spark ?

2014-03-25 Thread santhoma
I have been writing map-reduce on hadoop using PIG , and is now trying to migrate to SPARK. My cluster consists of multiple nodes, and the jobs depend on a native library (.so files). In hadoop and PIG , I could distribute the files across nodes using "-files" or "-archive" option, but I could no

Spark executor memory & relationship with worker threads

2014-03-25 Thread Sai Prasanna
Hi All, Does number of worker threads bear any relationship to setting executor memory ?. I have a 16 GB RAM, with an 8-core processor. I had set SPARK_MEM to 12g and was running locally with default 1 thread. So this means there can be maximum one executor in one node scheduled at any point of tim

Re: rdd.saveAsTextFile problem

2014-03-25 Thread gaganbm
Hi Folks, Is this issue resolved ? If yes, could you please throw some light on how to fix this ? I am facing the same problem during writing to text files. When I do stream.foreachRDD(rdd =>{ rdd.saveAsTextFile(<"Some path">) }) This wo

Re: Building Spark 0.9.x for CDH5 with mrv1 installation (Protobuf 2.5 upgrade)

2014-03-25 Thread Patrick Wendell
I'm not sure exactly how your cluster is configured. But as far as I can tell Cloudera's MR1 CDH5 dependencies are against Hadoop 2.3. I'd just find the exact CDH version you have and link against the `mr1` version of their published dependencies in that version. So I think you wan't "2.3.0-mr1-cd

ALS memory limits

2014-03-25 Thread Debasish Das
Hi, For our usecases we are looking into 20 x 1M matrices which comes in the similar ranges as outlined by the paper over here: http://sandeeptata.blogspot.com/2012/12/sparkler-large-scale-matrix.html Is the exponential runtime growth in spark ALS as outlined by the blog still exists in recommen

Re: How to set environment variable for a spark job

2014-03-25 Thread santhoma
I tried it, it did not work conf.setExecutorEnv("ORACLE_HOME", orahome) conf.setExecutorEnv("LD_LIBRARY_PATH", ldpath) Any idea how to set it using java.library.path ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-set-environment-variabl

RDD Collect returns empty arrays

2014-03-25 Thread gaganbm
I am getting strange behavior with the RDDs. All I want is to persist the RDD contents in a single file. The saveAsTextFile() saves them in multiple textfiles for each partition. So I tried with rdd.coalesce(1,true).saveAsTextFile(). This fails with the exception : org.apache.spark.SparkExcepti

Re: How to set environment variable for a spark job

2014-03-25 Thread Sourav Chandra
Did you try to access the variables in worker using System.getenv(...) and it failed? On Wed, Mar 26, 2014 at 11:42 AM, santhoma wrote: > I tried it, it did not work > > conf.setExecutorEnv("ORACLE_HOME", orahome) > conf.setExecutorEnv("LD_LIBRARY_PATH", ldpath) > > Any idea how to set