Re: MLlib Naive Bayes classifier confidence

2014-11-10 Thread Sean Owen
Not directly. If you could access brzPi and brzTheta in the NaiveBayesModel, you could repeat its same computation in predict() and exponentiate it to get back class probabilities, since input and internal values are in log space. Hm I wonder how people feel about exposing those fields or a

-Error stopping receiver in running Spark+Flume sample code FlumeEventCount.scala

2014-11-10 Thread Ping Tang
Hi, Can somebody help me to understand why this error occurred? 2014-11-10 00:17:44,512 INFO [Executor task launch worker-0] receiver.BlockGenerator (Logging.scala:logInfo(59)) - Started BlockGenerator 2014-11-10 00:17:44,513 INFO [Executor task launch worker-0]

dealing with large values in kv pairs

2014-11-10 Thread YANG Fan
Hi, I've got a huge list of key-value pairs, where the key is an integer and the value is a long string(around 1Kb). I want to concatenate the strings with the same keys. Initially I did something like: pairs.reduceByKey((a, b) = a+ +b) Then tried to save the result to HDFS. But it was

Re: dealing with large values in kv pairs

2014-11-10 Thread Sean Owen
You are suggesting that the String concatenation is slow? It probably is because of all the allocation. Consider foldByKey instead which starts with an empty StringBuilder as its zero value. This will build up the result far more efficiently. On Nov 10, 2014 8:37 AM, YANG Fan idd...@gmail.com

canopy clustering

2014-11-10 Thread amin mohebbi
I want to run k-means of MLib on a big dataset, it seems for big datsets, we need to perform pre-clustering methods such as canopy clustering. By starting with an initial clustering the number of more expensive distance measurements can be significantly reduced by ignoring points outside of

closure serialization behavior driving me crazy

2014-11-10 Thread Sandy Ryza
I'm experiencing some strange behavior with closure serialization that is totally mind-boggling to me. It appears that two arrays of equal size take up vastly different amount of space inside closures if they're generated in different ways. The basic flow of my app is to run a bunch of tiny

index File create by mapFile can't

2014-11-10 Thread buring
Hi Recently I want to save a big RDD[(k,v)] in form of index and data ,I deceide to use hadoop mapFile. I tried some examples like this :https://gist.github.com/airawat/6538748 I runs the code well and generate a index and data file. I can use command hadoop fs -text

index File create by mapFile can't read

2014-11-10 Thread buring
Hi Recently i want to save a big RDD[(k,v)] in form of index and data ,I deceide to use hadoop mapFile. I tried some examples like this :https://gist.github.com/airawat/6538748 I runs the code well and generate a index and data file. I can use command hadoop fs -text

Re: MLlib Naive Bayes classifier confidence

2014-11-10 Thread jatinpreet
Thanks for the answer. The variables brzPi and brzTheta are declared private. I am writing my code with Java otherwise I could have replicated the scala class and performed desired computation, which is as I observed a multiplication of brzTheta with test vector and adding this value to brzPi.

Is there a step-by-step instruction on how to build Spark App with IntelliJ IDEA?

2014-11-10 Thread MEETHU MATHEW
Hi, This question was asked  earlier  and I did it in the way specified..I am getting java.lang.ClassNotFoundException.. Can somebody explain all the steps required to build a spark app using IntelliJ (latest version)starting from creating the project to running it..I searched a lot but couldnt

Solidifying Understanding of Standalone Mode

2014-11-10 Thread Ashic Mahtab
Hello, I'm hoping to understand exactly what happens when a spark compiled app is submitted to a spark stand-alone cluster master. Say, our master is A, and workers are W1 and W2. Client machine C is submitting an app to the master using spark-submit. Here's what I think happens? * C submits

Mysql retrieval and storage using JdbcRDD

2014-11-10 Thread akshayhazari
So far I have tried this and I am able to compile it successfully . There isn't enough documentation on spark for its usage with databases. I am using AbstractFunction0 and AbsctractFunction1 here. I am unable to access the database. The jar just runs without doing anything when submitted. I want

Backporting spark 1.1.0 to CDH 5.1.3

2014-11-10 Thread Zalzberg, Idan (Agoda)
Hello, I have a big cluster running CDH 5.1.3 which I can't upgrade to 5.2.0 at the current time. I would like to run Spark-On-Yarn in that cluster. I tried to compile spark with CDH-5.1.3 and I got HDFS to work but I am having problems with the connection to hive: java.sql.SQLException: Could

Removing INFO logs

2014-11-10 Thread Ritesh Kumar Singh
How can I remove all the INFO logs that appear on the console when I submit an application using spark-submit?

Re: Removing INFO logs

2014-11-10 Thread Ritesh Kumar Singh
It works. Thanks On Mon, Nov 10, 2014 at 6:32 PM, YANG Fan idd...@gmail.com wrote: Hi, In conf/log4j.properties, change the following log4j.rootCategory=INFO, console to log4j.rootCategory=WARN, console This works for me. Best, Fan On Mon, Nov 10, 2014 at 8:21 PM, Ritesh

Re: MLlib Naive Bayes classifier confidence

2014-11-10 Thread Sean Owen
It's hacky, but you could access these fields via reflection. It'd be better to propose opening them up in a PR. On Mon, Nov 10, 2014 at 9:25 AM, jatinpreet jatinpr...@gmail.com wrote: Thanks for the answer. The variables brzPi and brzTheta are declared private. I am writing my code with Java

Running Spark on SPARC64 X+

2014-11-10 Thread Greg Jennings
Hello all, I'm hoping someone can help me with this hardware question. We have an upcoming need to run our machine learning application on physical hardware. Up until now, we've just rented a cloud-based high performance cluster, so my understanding of the real relative performance tradeoffs

Increase Executor Memory on YARN

2014-11-10 Thread Mudassar Sarwar
Hi, How can we increase the executor memory of a running spark cluster on YARN? We want to increase the executor memory on the addition of new nodes in the cluster. We are running spark version 1.0.2. Thanks Mudassar -- View this message in context:

To generate IndexedRowMatrix from an RowMatrix

2014-11-10 Thread Lijun Wang
Hi, I need a matrix with each row having a index, e.g., index = 0 for first row, index = 1 for second row. Could someone tell me how to generate such IndexedRowMatrix from an RowMatrix? Besides, is there anyone having the experience to do multiplication of two distributed matrix, e.g., two

Re: org/apache/commons/math3/random/RandomGenerator issue

2014-11-10 Thread lev
I see, thanks. I'm not running on ec2, and I wouldn't like to start copying jars on all the servers in the cluster. Any ideas of how I can add this jar in a simple way? Here are my failed attempts so far: - adding the math3 jar the lib folder in my project root. The math3 classes did appear in

Re: To generate IndexedRowMatrix from an RowMatrix

2014-11-10 Thread Cheng Lian
You may use |RDD.zipWithIndex|. On 11/10/14 10:03 PM, Lijun Wang wrote: Hi, I need a matrix with each row having a index, e.g., index = 0 for first row, index = 1 for second row. Could someone tell me how to generate such IndexedRowMatrix from an RowMatrix? Besides, is there anyone

Re: Increase Executor Memory on YARN

2014-11-10 Thread Arun Ahuja
If you are using spark-submit with --master yarn you can also pass as a flag --executor-memory ​ On Mon, Nov 10, 2014 at 8:58 AM, Mudassar Sarwar mudassar.sar...@northbaysolutions.net wrote: Hi, How can we increase the executor memory of a running spark cluster on YARN? We want to increase

Re: Understanding spark operation pipeline and block storage

2014-11-10 Thread Hao Ren
Hey, guys Feel free to ask for more details if my questions are not clear. Any insight here ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-spark-operation-pipeline-and-block-storage-tp18201p18496.html Sent from the Apache Spark User

Re: MLlib Naive Bayes classifier confidence

2014-11-10 Thread jatinpreet
Thanks, I will try it out and raise a request for making the variables accessible. An unrelated question, do you think the probability value thus calculated will be a good measure of confidence in prediction? I have been reading mixed opinions about the same. Jatin - Novice Big Data

Re: Understanding spark operation pipeline and block storage

2014-11-10 Thread Cheng Lian
On 11/6/14 1:39 AM, Hao Ren wrote: Hi, I would like to understand the pipeline of spark's operation(transformation and action) and some details on block storage. Let's consider the following code: val rdd1 = SparkContext.textFile(hdfs://...) rdd1.map(func1).map(func2).count For example, we

Re: Executor Lost Failure

2014-11-10 Thread Ritesh Kumar Singh
On Mon, Nov 10, 2014 at 10:52 PM, Ritesh Kumar Singh riteshoneinamill...@gmail.com wrote: Tasks are now getting submitted, but many tasks don't happen. Like, after opening the spark-shell, I load a text file from disk and try printing its contentsas:

Fwd: Executor Lost Failure

2014-11-10 Thread Ritesh Kumar Singh
-- Forwarded message -- From: Ritesh Kumar Singh riteshoneinamill...@gmail.com Date: Mon, Nov 10, 2014 at 10:52 PM Subject: Re: Executor Lost Failure To: Akhil Das ak...@sigmoidanalytics.com Tasks are now getting submitted, but many tasks don't happen. Like, after opening the

Question about textFileStream

2014-11-10 Thread Saiph Kappa
Hi, In my application I am doing something like this new StreamingContext(sparkConf, Seconds(10)).textFileStream(logs/), and I get some unknown exceptions when I copy a file with about 800 MB to that folder (logs/). I have a single worker running with 512 MB of memory. Anyone can tell me if

Kafka version dependency in Spark 1.2

2014-11-10 Thread Bhaskar Dutta
Hi, Is there any plan to bump the Kafka version dependency in Spark 1.2 from 0.8.0 to 0.8.1.1? Current dependency is still on Kafka 0.8.0 https://github.com/apache/spark/blob/branch-1.2/external/kafka/pom.xml Thanks Bhaskie

Re: spark SNAPSHOT repo

2014-11-10 Thread Sean Owen
I don't think there are any regular SNAPSHOT builds published to Maven Central. You can always mvn install the build in your local repo or any shares repo you want. If you just want a recentish build of 1.2.0 without rolling your own you could point to

Re: spark SNAPSHOT repo

2014-11-10 Thread jamborta
thanks, that looks good. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-SNAPSHOT-repo-tp18502p18505.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: Question about textFileStream

2014-11-10 Thread Soumitra Kumar
Entire file in a window. On Mon, Nov 10, 2014 at 9:20 AM, Saiph Kappa saiph.ka...@gmail.com wrote: Hi, In my application I am doing something like this new StreamingContext(sparkConf, Seconds(10)).textFileStream(logs/), and I get some unknown exceptions when I copy a file with about 800 MB

which is the recommended workflow engine for Apache Spark jobs?

2014-11-10 Thread Adamantios Corais
I have some previous experience with Apache Oozie while I was developing in Apache Pig. Now, I am working explicitly with Apache Spark and I am looking for a tool with similar functionality. Is Oozie recommended? What about Luigi? What do you use \ recommend?

Re: which is the recommended workflow engine for Apache Spark jobs?

2014-11-10 Thread Jimmy McErlain
I have used Oozie for all our workflows with Spark apps but you will have to use a java event as the workflow element. I am interested in anyones experience with Luigi and/or any other tools. On Mon, Nov 10, 2014 at 10:34 AM, Adamantios Corais adamantios.cor...@gmail.com wrote: I have some

Re: Kafka version dependency in Spark 1.2

2014-11-10 Thread Matei Zaharia
Just curious, what are the pros and cons of this? Can the 0.8.1.1 client still talk to 0.8.0 versions of Kafka, or do you need it to match your Kafka version exactly? Matei On Nov 10, 2014, at 9:48 AM, Bhaskar Dutta bhas...@gmail.com wrote: Hi, Is there any plan to bump the Kafka

Re: Kafka version dependency in Spark 1.2

2014-11-10 Thread Sean McNamara
Can the 0.8.1.1 client still talk to 0.8.0 versions of Kafka Yes it can. 0.8.1 is fully compatible with 0.8. It is buried on this page: http://kafka.apache.org/documentation.html In addition to the pom version bump SPARK-2492 would bring the kafka streaming receiver (which was originally

Re: Kafka version dependency in Spark 1.2

2014-11-10 Thread Helena
Version 0.8.2-beta is published. I'd consider waiting on this, it has quite a few nice changes coming. https://archive.apache.org/dist/kafka/0.8.2-beta/RELEASE_NOTES.html I started the 0.8.1.1 upgrade in a branch a few weeks ago but abandoned it because I wasn't sure if there was interest beyond

Mapping SchemaRDD/Row to JSON

2014-11-10 Thread Akshat Aranya
Hi, Does there exist a way to serialize Row objects to JSON. In the absence of such a way, is the right way to go: * get hold of schema using SchemaRDD.schema * Iterate through each individual Row as a Seq and use the schema to convert values in the row to JSON types. Thanks, Akshat

Re: disable log4j for spark-shell

2014-11-10 Thread hmxxyy
Tried --driver-java-options and SPARK_JAVA_OPTS, none of them worked Had to change the default one and rebuilt. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/disable-log4j-for-spark-shell-tp11278p18513.html Sent from the Apache Spark User List mailing

Status of MLLib exporting models to PMML

2014-11-10 Thread Aris
Hello Spark and MLLib folks, So a common problem in the real world of using machine learning is that some data analysis use tools like R, but the more data engineers out there will use more advanced systems like Spark MLLib or even Python Scikit Learn. In the real world, I want to have a system

MLLib Decision Tress algorithm hangs, others fine

2014-11-10 Thread tsj
Hello all, I have some text data that I am running different algorithms on. I had no problems with LibSVM and Naive Bayes on the same data, but when I run Decision Tree, the execution hangs in the middle of DecisionTree.trainClassifier(). The only difference from the example given on the site

Custom persist or cache of RDD?

2014-11-10 Thread Benyi Wang
When I have a multi-step process flow like this: A - B - C - D - E - F I need to store B and D's results into parquet files B.saveAsParquetFile D.saveAsParquetFile If I don't cache/persist any step, spark might recompute from A,B,C,D and E if something is wrong in F. Of course, I'd better

Re: Backporting spark 1.1.0 to CDH 5.1.3

2014-11-10 Thread Marcelo Vanzin
Hello, CDH 5.1.3 ships with a version of Hive that's not entirely the same as the Hive Spark 1.1 supports. So when building your custom Spark, you should make sure you change all the dependency versions to point to the CDH versions. IIRC Spark depends on org.spark-project.hive:0.12.0, you'd have

Re: disable log4j for spark-shell

2014-11-10 Thread hmxxyy
Even after changing core/src/main/resources/org/apache/spark/log4j-defaults.properties to WARN followed by a rebuild, the log level is still INFO. Any other suggestions? -- View this message in context:

Re: Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-11-10 Thread Srinivas Chamarthi
I am trying to use spark with spray and I have the dependency problem with quasiquotes. The issue comes up only when I include spark dependencies. I am not sure how this one can be excluded. Jianshi: can you let me know what version of spray + akka + spark are you using ? [error]

Re: disable log4j for spark-shell

2014-11-10 Thread hmxxyy
Some console messages: 14/11/10 20:04:33 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:46713 14/11/10 20:04:33 INFO util.Utils: Successfully started service 'HTTP file server' on port 46713. 14/11/10 20:04:34 INFO server.Server: jetty-8.y.z-SNAPSHOT 14/11/10 20:04:34 INFO

Spray with Spark-sql build fails with Incompatible dependencies

2014-11-10 Thread Srinivas Chamarthi
I am trying to use spark with spray and I have the dependency problem with quasiquotes. The issue comes up only when I include spark dependencies. I am not sure how this one can be excluded. does anyone tried this before and it works ? [error] Modules were resolved with conflicting

streaming linear regression is not building the model

2014-11-10 Thread Bui, Tri
Hi, The model weight is not updating for streaming linear regression. The code and data below is what I am running. import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD

Re: Fwd: Executor Lost Failure

2014-11-10 Thread Ankur Dave
At 2014-11-10 22:53:49 +0530, Ritesh Kumar Singh riteshoneinamill...@gmail.com wrote: Tasks are now getting submitted, but many tasks don't happen. Like, after opening the spark-shell, I load a text file from disk and try printing its contentsas: sc.textFile(/path/to/file).foreach(println)

Spark Master crashes job on task failure

2014-11-10 Thread Griffiths, Michael (NYC-RPM)
Hi, I'm running Spark in standalone mode: 1 master, 15 slaves. I started the node with the ec2 script, and I'm currently breaking the job into many small parts (~2,000) to better examine progress and failure. Pretty basic - submitting a PySpark job (via spark-submit) to the cluster. The job

Re: Custom persist or cache of RDD?

2014-11-10 Thread Sean Owen
Well you can always create C by loading B from disk, and likewise for E / D. No need for any custom procedure. On Mon, Nov 10, 2014 at 7:33 PM, Benyi Wang bewang.t...@gmail.com wrote: When I have a multi-step process flow like this: A - B - C - D - E - F I need to store B and D's results

RE: Spark Master crashes job on task failure

2014-11-10 Thread Griffiths, Michael (NYC-RPM)
Nevermind - I don't know what I was thinking with the below. It's just maxTaskFailures causing the job to failure. From: Griffiths, Michael (NYC-RPM) [mailto:michael.griffi...@reprisemedia.com] Sent: Monday, November 10, 2014 4:48 PM To: user@spark.apache.org Subject: Spark Master crashes job on

convert ListString to dstream

2014-11-10 Thread Josh J
Hi, I have some data generated by some utilities that returns the results as a ListString. I would like to join this with a Dstream of strings. How can I do this? I tried the following though get scala compiler errors val list_scalaconverted = ssc.sparkContext.parallelize(listvalues.toArray())

JavaKafkaWordCount not working under Spark Streaming

2014-11-10 Thread Something Something
I am embarrassed to admit but I can't get a basic 'word count' to work under Kafka/Spark streaming. My code looks like this. I don't see any word counts in console output. Also, don't see any output in UI. Needless to say, I am newbie in both 'Spark' as well as 'Kafka'. Please help. Thanks.

Building spark from source - assertion failed: org.eclipse.jetty.server.DispatcherType

2014-11-10 Thread jamborta
Hello, I am trying to build Spark from source using the following: export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Phive -DskipTests clean package this works OK with branch-1.1, when I switch to branch-1.2, I get the

Re: Building spark from source - assertion failed: org.eclipse.jetty.server.DispatcherType

2014-11-10 Thread sadhan
I ran into the same issue, reverting this commit seems to work https://github.com/apache/spark/commit/bd86cb1738800a0aa4c88b9afdba2f97ac6cbf25 -- View this message in context:

Re: JavaKafkaWordCount not working under Spark Streaming

2014-11-10 Thread Tathagata Das
What is the Spark master that you are using. Use local[4], not local if you are running locally. On Mon, Nov 10, 2014 at 3:01 PM, Something Something mailinglist...@gmail.com wrote: I am embarrassed to admit but I can't get a basic 'word count' to work under Kafka/Spark streaming. My code

Re: Building spark from source - assertion failed: org.eclipse.jetty.server.DispatcherType

2014-11-10 Thread jamborta
ah, thanks, reverted a few days, works now. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Building-spark-from-source-assertion-failed-org-eclipse-jetty-server-DispatcherType-tp18529p18532.html Sent from the Apache Spark User List mailing list archive at

Re: JavaKafkaWordCount not working under Spark Streaming

2014-11-10 Thread Something Something
I am not running locally. The Spark master is: spark://machine name:7077 On Mon, Nov 10, 2014 at 3:47 PM, Tathagata Das tathagata.das1...@gmail.com wrote: What is the Spark master that you are using. Use local[4], not local if you are running locally. On Mon, Nov 10, 2014 at 3:01 PM,

thrift jdbc server probably running queries as hive query

2014-11-10 Thread Sadhan Sood
I was testing out the spark thrift jdbc server by running a simple query in the beeline client. The spark itself is running on a yarn cluster. However, when I run a query in beeline - I see no running jobs in the spark UI(completely empty) and the yarn UI seem to indicate that the submitted query

Re: disable log4j for spark-shell

2014-11-10 Thread lordjoe
public static void main(String[] args) throws Exception { System.out.println(Set Log to Warn); Logger rootLogger = Logger.getRootLogger(); rootLogger.setLevel(Level.WARN); ... works for me -- View this message in context:

Re: convert ListString to dstream

2014-11-10 Thread Tobias Pfeiffer
Josh, On Tue, Nov 11, 2014 at 7:43 AM, Josh J joshjd...@gmail.com wrote: I have some data generated by some utilities that returns the results as a ListString. I would like to join this with a Dstream of strings. How can I do this? I tried the following though get scala compiler errors val

Re: Mapping SchemaRDD/Row to JSON

2014-11-10 Thread Tobias Pfeiffer
Akshat On Tue, Nov 11, 2014 at 4:12 AM, Akshat Aranya aara...@gmail.com wrote: Does there exist a way to serialize Row objects to JSON. I can't think of any other way than the one you proposed. A Row is more or less an Array[Object], so you need to read JSON key and data type from the

Question about RDD Union and SubtractByKey

2014-11-10 Thread Darin McBeath
I have the following code where I'm using RDD 'union' and 'subtractByKey' to create a new baseline RDD.  All of my RDDs are a key pair with the 'key' a String and the 'value' a String (xml document). // **// Merge the daily

inconsistent edge counts in GraphX

2014-11-10 Thread Buttler, David
Hi, I am building a graph from a large CSV file. Each record contains a couple of nodes and about 10 edges. When I try to load a large portion of the graph, using multiple partitions, I get inconsistent results in the number of edges between different runs. However, if I use a single

Re: To generate IndexedRowMatrix from an RowMatrix

2014-11-10 Thread buring
You should supply more information about your input data. For example ,I generate a IndexRowMatrix from ALS algorithm input data format,my code like this: val inputData = sc.textFile(fname).map{ line= val parts = line.trim.split(' ')

Re: Mapping SchemaRDD/Row to JSON

2014-11-10 Thread Michael Armbrust
There is a JIRA for adding this: https://issues.apache.org/jira/browse/SPARK-4228 Your described approach sounds reasonable. On Mon, Nov 10, 2014 at 5:10 PM, Tobias Pfeiffer t...@preferred.jp wrote: Akshat On Tue, Nov 11, 2014 at 4:12 AM, Akshat Aranya aara...@gmail.com wrote: Does there

Checkpoint bugs in GraphX

2014-11-10 Thread Xu Lijie
Hi, all. I'm not sure whether someone has reported this bug: There should be a checkpoint() method in EdgeRDD and VertexRDD as follows: override def checkpoint(): Unit = { partitionsRDD.checkpoint() } Current EdgeRDD and VertexRDD use *RDD.checkpoint()*, which only checkpoint the

Re: thrift jdbc server probably running queries as hive query

2014-11-10 Thread Cheng Lian
Hey Sadhan, I really don't think this is Spark log... Unlike Shark, Spark SQL doesn't even provide a Hive mode to let you execute queries against Hive. Would you please check whether there is an existing HiveServer2 running there? Spark SQL HiveThriftServer2 is just a Spark port of

Discuss how to do checkpoint more efficently

2014-11-10 Thread Xu Lijie
Hi, all. I want to seek suggestions on how to do checkpoint more efficiently, especially for iterative applications written by GraphX. For iterative applications, the lineage of a job can be very long, which is easy to cause statckoverflow error. A solution is to do checkpoint. However,

Re: Checkpoint bugs in GraphX

2014-11-10 Thread Xu Lijie
Nice, we currently encounter a stackoverflow error caused by this bug. We also found that val partitionsRDD: RDD[(PartitionID, EdgePartition[ED, VD])], val targetStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY) will not be serialized even without adding @transient. However, transient can

Re: Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-11-10 Thread Jianshi Huang
Hi Srinivas, Here's the versions I'm using. spark.version1.2.0-SNAPSHOT/spark.version spray.version1.3.2/spray.version spray.json.version1.3.0/spray.json.version akka.grouporg.spark-project.akka/akka.group akka.version2.3.4-spark/akka.version I'm using

Strange behavior of spark-shell while accessing hdfs

2014-11-10 Thread hmxxyy
I am trying spark-shell on a single host and got some strange behavior of spark-shell. If I run bin/spark-shell without connecting a master, it can access a hdfs file on a remote cluster with kerberos authentication. scala val textFile =

Re: closure serialization behavior driving me crazy

2014-11-10 Thread Matei Zaharia
Hey Sandy, Try using the -Dsun.io.serialization.extendedDebugInfo=true flag on the JVM to print the contents of the objects. In addition, something else that helps is to do the following: { val _arr = arr models.map(... _arr ...) } Basically, copy the global variable into a local one.

Re: how to use JNI in spark?

2014-11-10 Thread tangweihan
You just need to add --driver-library-path the directory in you submit command. And in your worker node, add the lib in the right work directory -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-use-JNI-in-spark-tp530p18551.html Sent from the Apache

Re: which is the recommended workflow engine for Apache Spark jobs?

2014-11-10 Thread Adamantios Corais
Hi again, As Jimmy said, any thoughts about Luigi and/or any other tools? So far it seems that Oozie is the best and only choice here. Is that right? On Mon, Nov 10, 2014 at 8:43 PM, Jimmy McErlain jimmy.mcerl...@gmail.com wrote: I have used Oozie for all our workflows with Spark apps but you