Re: Is it safe to use Scala 2.11 for Spark build?

2014-11-18 Thread Jianshi Huang
Ok, I'll wait until -Pscala-2.11 is more stable and used by more people. Thanks for the help! Jianshi On Tue, Nov 18, 2014 at 3:49 PM, Ye Xianjin advance...@gmail.com wrote: Hi Prashant Sharma, It's not even ok to build with scala-2.11 profile on my machine. Just check out the

Logging problem in Spark when using Flume Log4jAppender

2014-11-18 Thread QiaoanChen
Hi, I want to do log aggregation in a Spark standalone mode cluster, using Apache Flume. But something weird happended. Here are my operations: (1) Start a flume agent, listening on port 3. ( flume-conf.properties

how to know the Spark worker Mechanism

2014-11-18 Thread tangweihan
I'm a newbee in Spark. I know that what the work should do is written in RDD. But I want to make the worker load a native lib and I can do something to change the content of the lib in memory. So how can I do. I can do it on driver, but not worker. I always get a fatal error. The jvm report A

Re: Spark On Yarn Issue: Initial job has not accepted any resources

2014-11-18 Thread Ritesh Kumar Singh
Not sure how to solve this, but spotted these lines in the logs: 14/11/18 14:28:23 INFO YarnAllocationHandler: Container marked as *failed*: container_1415961020140_0325_01_02 14/11/18 14:28:38 INFO YarnAllocationHandler: Container marked as *failed*: container_1415961020140_0325_01_03

Slave Node Management in Standalone Cluster

2014-11-18 Thread Kenichi Maehashi
Hi, I'm operating Spark in standalone cluster configuration (3 slaves) and have some question. 1. How can I stop a slave on the specific node? Under `sbin/` directory, there are `start-{all,master,slave,slaves}` and `stop-{all,master,slaves}`, but no `stop-slave`. Are there any way to stop

Re: how to know the Spark worker Mechanism

2014-11-18 Thread Yanbo Liang
Did you set spark.executor.extraLibraryPath to the directory which your native library exists? 2014-11-18 16:13 GMT+08:00 tangweihan tangwei...@360.cn: I'm a newbee in Spark. I know that what the work should do is written in RDD. But I want to make the worker load a native lib and I can do

[Spark/ Spark Streaming] Spark 1.1.0 fails working with akka 2.3.6

2014-11-18 Thread Sourav Chandra
Hi, I have created a spark streaming application based on spark-1.1.0. While running it failed saying akk jar version mismatch. Some projects are using akka 2.3.6 so I have no choice to change the akka version as it will affect others. What should I do? *Caused by: akka.ConfigurationException:

Re: how to know the Spark worker Mechanism

2014-11-18 Thread tangweihan
Ok. I don't put it in the path. Because this is not a lib I want to use permanently. here is my code in RDD. val fileaddr = SparkFiles.get(segment.so); System.load(fileaddr); val config = SparkFiles.get(qsegconf.ini) val segment = new Segment//this

Re: inconsistent edge counts in GraphX

2014-11-18 Thread Ankur Dave
At 2014-11-11 01:51:43 +, Buttler, David buttl...@llnl.gov wrote: I am building a graph from a large CSV file. Each record contains a couple of nodes and about 10 edges. When I try to load a large portion of the graph, using multiple partitions, I get inconsistent results in the number

Re: Slave Node Management in Standalone Cluster

2014-11-18 Thread Akhil Das
1. You can comment the rest of the workers from the conf/slaves file and do a stop-slaves.sh from that machine to stop the specific worker. 2. There is no direct command for it, but you can do something like the following: $ curl localhost:8080 | grep Applications -C 10 | head -n20 ​Where

Re: GraphX / PageRank with edge weights

2014-11-18 Thread Ankur Dave
At 2014-11-13 21:28:52 +, Ommen, Jurgen omme0...@stthomas.edu wrote: I'm using GraphX and playing around with its PageRank algorithm. However, I can't see from the documentation how to use edge weight when running PageRank. Is this possible to consider edge weights and how would I do it?

RE: Null pointer exception with larger datasets

2014-11-18 Thread Naveen Kumar Pokala
Thanks Akhil. -Naveen. From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Tuesday, November 18, 2014 1:19 PM To: Naveen Kumar Pokala Cc: user@spark.apache.org Subject: Re: Null pointer exception with larger datasets Make sure your list is not null, if that is null then its more like

Re: Pagerank implementation

2014-11-18 Thread Ankur Dave
At 2014-11-15 18:01:22 -0700, tom85 tom.manha...@gmail.com wrote: This line: val newPR = oldPR + (1.0 - resetProb) * msgSum makes no sense to me. Should it not be: val newPR = resetProb/graph.vertices.count() + (1.0 - resetProb) * msgSum ? This is an unusual version of PageRank where the

New Codes in GraphX

2014-11-18 Thread Deep Pradhan
Hi, I am using Spark-1.0.0. There are two GraphX directories that I can see here 1. spark-1.0.0/examples/src/main/scala/org/apache/sprak/examples/graphx which contains LiveJournalPageRank,scala 2. spark-1.0.0/graphx/src/main/scala/org/apache/sprak/graphx/lib which contains

Re: Landmarks in GraphX section of Spark API

2014-11-18 Thread Ankur Dave
At 2014-11-17 14:47:50 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote: I was going through the graphx section in the Spark API in https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.lib.ShortestPaths$ Here, I find the word landmark. Can anyone explain to me

Re: Landmarks in GraphX section of Spark API

2014-11-18 Thread Deep Pradhan
So landmark can contain just one vertex right? Which algorithm has been used to compute the shortest path? Thank You On Tue, Nov 18, 2014 at 2:53 PM, Ankur Dave ankurd...@gmail.com wrote: At 2014-11-17 14:47:50 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote: I was going through the

Re: Running PageRank in GraphX

2014-11-18 Thread Ankur Dave
At 2014-11-18 12:02:52 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote: I just ran the PageRank code in GraphX with some sample data. What I am seeing is that the total rank changes drastically if I change the number of iterations from 10 to 100. Why is that so? As far as I understand, the

Re: Running PageRank in GraphX

2014-11-18 Thread Deep Pradhan
There are no vertices of zero outdegree. The total rank for the graph with numIter = 10 is 4.99 and for the graph with numIter = 100 is 5.99 I do not know why so much variation. On Tue, Nov 18, 2014 at 3:22 PM, Ankur Dave ankurd...@gmail.com wrote: At 2014-11-18 12:02:52 +0530, Deep Pradhan

Re: Landmarks in GraphX section of Spark API

2014-11-18 Thread Ankur Dave
At 2014-11-18 14:59:20 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote: So landmark can contain just one vertex right? Right. Which algorithm has been used to compute the shortest path? It's distributed Bellman-Ford. Ankur

Re: Landmarks in GraphX section of Spark API

2014-11-18 Thread Deep Pradhan
Does Bellman-Ford give the best solution? On Tue, Nov 18, 2014 at 3:27 PM, Ankur Dave ankurd...@gmail.com wrote: At 2014-11-18 14:59:20 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote: So landmark can contain just one vertex right? Right. Which algorithm has been used to compute the

Re: New Codes in GraphX

2014-11-18 Thread Deep Pradhan
The codes that are present in 2 can be run with the command *$SPARK_HOME/bin/spark-submit --master local[*] --class org.apache.spark.graphx.lib.Analytics $SPARK_HOME/assembly/target/scala-2.10/spark-assembly-*.jar pagerank /edge-list-file.txt --numEPart=8 --numIter=10

Re: New Codes in GraphX

2014-11-18 Thread Ankur Dave
At 2014-11-18 14:51:54 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote: I am using Spark-1.0.0. There are two GraphX directories that I can see here 1. spark-1.0.0/examples/src/main/scala/org/apache/sprak/examples/graphx which contains LiveJournalPageRank,scala 2.

Re: Slave Node Management in Standalone Cluster

2014-11-18 Thread Kousuke Saruta
Hi Kenichi 1. How can I stop a slave on the specific node? Under `sbin/` directory, there are `start-{all,master,slave,slaves}` and `stop-{all,master,slaves}`, but no `stop-slave`. Are there any way to stop the specific (e.g., the 2nd) slave via command line? You can use

Re: Landmarks in GraphX section of Spark API

2014-11-18 Thread Ankur Dave
At 2014-11-18 15:29:08 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote: Does Bellman-Ford give the best solution? It gives the same solution as any other algorithm, since there's only one correct solution for shortest paths and it's guaranteed to find it eventually. There are probably

Spark Streaming with Kafka is failing with Error

2014-11-18 Thread Sourav Chandra
Hi, While running my spark streaming application built on spark 1.1.0 I am getting below error. *14/11/18 15:35:30 ERROR ReceiverTracker: Deregistered receiver for stream 0: Error starting receiver 0 - java.lang.AbstractMethodError* * at org.apache.spark.Logging$class.log(Logging.scala:52)* * at

Re: New Codes in GraphX

2014-11-18 Thread Deep Pradhan
What command should I use to run the LiveJournalPageRank.scala? If you want to write a separate application, the ideal way is to do it in a separate project that links in Spark as a dependency [1]. But even for this, I have to do the build every time I change the code, right? Thank You On Tue,

Re: New Codes in GraphX

2014-11-18 Thread Ankur Dave
At 2014-11-18 15:35:13 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote: Now, how do I run the LiveJournalPageRank.scala that is there in 1? I think it should work to use MASTER=local[*] $SPARK_HOME/bin/run-example graphx.LiveJournalPageRank /edge-list-file.txt --numEPart=8 --numIter=10

Re: New Codes in GraphX

2014-11-18 Thread Deep Pradhan
Yes the above command works, but there is this problem. Most of the times, the total rank is Nan (Not a Number). Why is it so? Thank You On Tue, Nov 18, 2014 at 3:48 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: What command should I use to run the LiveJournalPageRank.scala? If you want

Re: Landmarks in GraphX section of Spark API

2014-11-18 Thread Ankur Dave
At 2014-11-18 15:44:31 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote: I meant to ask whether it gives the solution faster than other algorithms. No, it's just that it's much simpler and easier to implement than the others. Section 5.2 of the Pregel paper [1] justifies using it for a graph

Re: New Codes in GraphX

2014-11-18 Thread Ankur Dave
At 2014-11-18 15:51:52 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote: Yes the above command works, but there is this problem. Most of the times, the total rank is Nan (Not a Number). Why is it so? I've also seen this, but I'm not sure why it happens. If you could find out which vertices

Kestrel and Spark Stream

2014-11-18 Thread Eduardo Alfaia
Hi guys, Has anyone already tried doing this work? Thanks -- Informativa sulla Privacy: http://www.unibs.it/node/8155

Re: Kestrel and Spark Stream

2014-11-18 Thread Akhil Das
You can implement a custom receiver http://spark.apache.org/docs/latest/streaming-custom-receivers.html to connect to Kestrel and use it. I think someone have already tried it, not sure if it is working though. Here's the link

Re: Kestrel and Spark Stream

2014-11-18 Thread prabeesh k
You can refer the following link https://github.com/prabeesh/Spark-Kestrel On Tue, Nov 18, 2014 at 3:51 PM, Akhil Das ak...@sigmoidanalytics.com wrote: You can implement a custom receiver http://spark.apache.org/docs/latest/streaming-custom-receivers.html to connect to Kestrel and use it. I

How to assign consecutive numeric id to each row based on its content?

2014-11-18 Thread shahab
Hi, In my spark application, I am loading some rows from database into Spark RDDs Each row has several fields, and a string key. Due to my requirements I need to work with consecutive numeric ids (starting from 1 to N, where N is the number of unique keys) instead of string keys . Also several

Re: Building Spark with hive does not work

2014-11-18 Thread Cheng Lian
Ah... Thanks Ted! And Hao, sorry for being the original trouble maker :) On 11/18/14 1:50 AM, Ted Yu wrote: Looks like this was where you got that commandline: http://search-hadoop.com/m/JW1q5RlPrl Cheers On Mon, Nov 17, 2014 at 9:44 AM, Hao Ren inv...@gmail.com mailto:inv...@gmail.com

ReduceByKey but with different functions depending on key

2014-11-18 Thread jelgh
Hello everyone, I'm new to Spark and I have the following problem: I have this large JavaRDDMyClass collection, which I group with by creating a hashcode from some fields in MyClass: JavaRDDMyClass collection = ...; JavaPairRDDInteger, Iterablelt;MyClass grouped = collection.groupBy(...); //

Re: How to assign consecutive numeric id to each row based on its content?

2014-11-18 Thread Cheng Lian
A not so efficient way can be this: |val r0: RDD[OriginalRow] = ... val r1 = r0.keyBy(row = extractKeyFromOriginalRow(row)) val r2 = r1.keys.distinct().zipWithIndex() val r3 = r2.join(r1).values | On 11/18/14 8:54 PM, shahab wrote: Hi, In my spark application, I am loading some

Getting spark job progress programmatically

2014-11-18 Thread Aniket Bhatnagar
I am writing yet another Spark job server and have been able to submit jobs and return/save results. I let multiple jobs use the same spark context but I set job group while firing each job so that I can in future cancel jobs. Further, what I deserve to do is provide some kind of status

Is sorting persisted after pair rdd transformations?

2014-11-18 Thread Aniket Bhatnagar
I am trying to figure out if sorting is persisted after applying Pair RDD transformations and I am not able to decisively tell after reading the documentation. For example: val numbers = .. // RDD of numbers val pairedNumbers = numbers.map(number = (number % 100, number)) val sortedPairedNumbers

Re: Getting spark job progress programmatically

2014-11-18 Thread andy petrella
I started some quick hack for that in the notebook, you can head to: https://github.com/andypetrella/spark-notebook/blob/master/common/src/main/scala/notebook/front/widgets/SparkInfo.scala On Tue Nov 18 2014 at 2:44:48 PM Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: I am writing yet

Re: Building Spark with hive does not work

2014-11-18 Thread Hao Ren
nvm, it would be better if correctness of flags could be checked by sbt during building. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Building-Spark-with-hive-does-not-work-tp19072p19183.html Sent from the Apache Spark User List mailing list archive at

Re: Getting spark job progress programmatically

2014-11-18 Thread Aniket Bhatnagar
Thanks Andy. This is very useful. This gives me all active stages their percentage completion but I am unable to tie stages to job group (or specific job). I looked at Spark's code and to me, it seems org.apache.spark.scheduler.ActiveJob's group ID should get propagated to StageInfo (possibly in

Re: Getting spark job progress programmatically

2014-11-18 Thread andy petrella
yep, we should also propose to add this stuffs in the public API. Any other ideas? On Tue Nov 18 2014 at 4:03:35 PM Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: Thanks Andy. This is very useful. This gives me all active stages their percentage completion but I am unable to tie stages

sum/avg group by specified ranges

2014-11-18 Thread tridib
Hello Experts, I need to get total of an amount fields for specified date range. Now that group by on calculated field does not work (https://issues.apache.org/jira/browse/SPARK-4296), what is the best way to get this done? I thought to do it using spark, but I suspect I will miss the performance

Re: Status of MLLib exporting models to PMML

2014-11-18 Thread Charles Earl
Yes, The case is convincing for PMML with Oryx. I will also investigate parameter server. Cheers, Charles On Tuesday, November 18, 2014, Sean Owen so...@cloudera.com wrote: I'm just using PMML. I haven't hit any limitation of its expressiveness, for the model types is supports. I don't think

Nightly releases

2014-11-18 Thread Arun Ahuja
Are nightly releases posted anywhere? There are quite a few vital bugfixes and performance improvements being commited to Spark and using the latest commits is useful (or even necessary for some jobs). Is there a place to post them, it doesn't seem like it would diffcult to run make-dist nightly

Re: Nightly releases

2014-11-18 Thread Arun Ahuja
Of course we can run this as well to get the lastest, but the build is fairly long and this seems like a resource many would need. On Tue, Nov 18, 2014 at 10:21 AM, Arun Ahuja aahuj...@gmail.com wrote: Are nightly releases posted anywhere? There are quite a few vital bugfixes and performance

Re: ReduceByKey but with different functions depending on key

2014-11-18 Thread Yanbo
First use groupByKey(), you get a tuple RDD with (key:K,value:ArrayBuffer[V]). Then use map() on this RDD with a function has different operations depending on the key which act as a parameter of this function. 在 2014年11月18日,下午8:59,jelgh johannes.e...@gmail.com 写道: Hello everyone, I'm

Re: ReduceByKey but with different functions depending on key

2014-11-18 Thread Debasish Das
groupByKey does not run a combiner so be careful about the performance...groupByKey does shuffle even for local groups... reduceByKey and aggregateByKey does run a combiner but if you want a separate function for each key, you can have a key to closure map that you can broadcast and use it in

Re: How to assign consecutive numeric id to each row based on its content?

2014-11-18 Thread Daniel Siegmann
I think zipWithIndex is zero-based, so if you want 1 to N, you'll need to increment them like so: val r2 = r1.keys.distinct().zipWithIndex().mapValues(_ + 1) If the number of distinct keys is relatively small, you might consider collecting them into a map and broadcasting them rather than using

Re: Exception in spark sql when running a group by query

2014-11-18 Thread Sadhan Sood
ah makes sense - Thanks Michael! On Mon, Nov 17, 2014 at 6:08 PM, Michael Armbrust mich...@databricks.com wrote: You are perhaps hitting an issue that was fixed by #3248 https://github.com/apache/spark/pull/3248? On Mon, Nov 17, 2014 at 9:58 AM, Sadhan Sood sadhan.s...@gmail.com wrote:

Re: Nightly releases

2014-11-18 Thread Andrew Ash
I can see this being valuable for users wanting to live on the cutting edge without building CI infrastructure themselves, myself included. I think Patrick's recent work on the build scripts for 1.2.0 will make delivering nightly builds to a public maven repo easier. On Tue, Nov 18, 2014 at

Is there a way to create key based on counts in Spark

2014-11-18 Thread Blind Faith
As it is difficult to explain this, I would show what I want. Lets us say, I have an RDD A with the following value A = [word1, word2, word3] I want to have an RDD with the following value B = [(1, word1), (2, word2), (3, word3)] That is, it gives a unique number to each entry as a key value.

Re: ReduceByKey but with different functions depending on key

2014-11-18 Thread lordjoe
Map the key value into a key,Tuple2key,value and process that - Also ask the Spark maintainers for a version of keyed operations where the key is passed in as an argument - I run into these cases all the time /** * map a tuple int a key tuple pair to insure subsequent processing has

JavaKafkaWordCount

2014-11-18 Thread Eduardo Costa Alfaia
Hi Guys, I am doing some tests with JavaKafkaWordCount, my cluster is composed by 8 workers and 1 driver con spark-1.1.0, I am using Kafka too and I have some questions about. 1 - When I launch the command: bin/spark-submit --class org.apache.spark.examples.streaming.JavaKafkaWordCount

Re: Is there a way to create key based on counts in Spark

2014-11-18 Thread Debasish Das
Use zipWithIndex but cache the data before you run zipWithIndex...that way your ordering will be consistent (unless the bug has been fixed where you don't have to cache the data)... Normally these operations are used for dictionary building and so I am hoping you can cache the dictionary of

Spark on YARN

2014-11-18 Thread Alan Prando
Hi Folks! I'm running Spark on YARN cluster installed with Cloudera Manager Express. The cluster has 1 master and 3 slaves, each machine with 32 cores and 64G RAM. My spark's job is working fine, however it seems that just 2 of 3 slaves are working (htop shows 2 slaves working 100% on 32 cores,

Re: Spark on YARN

2014-11-18 Thread Debasish Das
I run my Spark on YARN jobs as: HADOOP_CONF_DIR=/etc/hadoop/conf/ /app/data/v606014/dist/bin/spark-submit --master yarn --jars test-job.jar --executor-cores 4 --num-executors 10 --executor-memory 16g --driver-memory 4g --class TestClass test.jar It uses HADOOP_CONF_DIR to schedule executors and

Re: Spark on YARN

2014-11-18 Thread Marcelo Vanzin
Can you check in your RM's web UI how much of each resource does Yarn think you have available? You can also check that in the Yarn configuration directly. Perhaps it's not configured to use all of the available resources. (If it was set up with Cloudera Manager, CM will reserve some room for

Re: Spark on YARN

2014-11-18 Thread Sandy Ryza
Hey Alan, Spark's application master will take up 1 core on one of the nodes on the cluster. This means that that node will only have 31 cores remaining, not enough to fit your third executor. -Sandy On Tue, Nov 18, 2014 at 10:03 AM, Alan Prando a...@scanboo.com.br wrote: Hi Folks! I'm

Re: Spark on YARN

2014-11-18 Thread Sean Owen
My guess is you're asking for all cores of all machines but the driver needs at least one core, so one executor is unable to find a machine to fit on. On Nov 18, 2014 7:04 PM, Alan Prando a...@scanboo.com.br wrote: Hi Folks! I'm running Spark on YARN cluster installed with Cloudera Manager

Re: Spark streaming cannot receive any message from Kafka

2014-11-18 Thread Bill Jay
Hi Jerry, I looked at KafkaUtils.createStream api and found actually the spark.default.parallelism is specified in SparkConf instead. I do not remember the exact stacks of the exception. But the exception was incurred when createStream was called if we do not specify the

Re: Pyspark Error

2014-11-18 Thread Shannon Quinn
My best guess would be a networking issue--it looks like the Python socket library isn't able to connect to whatever hostname you're providing Spark in the configuration. On 11/18/14 9:10 AM, amin mohebbi wrote: Hi there, *I have already downloaded Pre-built spark-1.1.0, I want to run

Iterative transformations over RDD crashes in phantom reduce

2014-11-18 Thread Shannon Quinn
Hi all, This is somewhat related to my previous question ( http://apache-spark-user-list.1001560.n3.nabble.com/Iterative-changes-to-RDD-and-broadcast-variables-tt19042.html , for additional context) but for all practical purposes this is its own issue. As in my previous question, I'm making

Re: Iterative transformations over RDD crashes in phantom reduce

2014-11-18 Thread Shannon Quinn
To clarify about what, precisely, is impossible: the crash happens with INDEX == 1 in func2, but func2 is only called in the reduceByKey transformation when INDEX == 0. And according to the output of the foreach() in line 4, that reduceByKey(func2) works just fine. How is it then invoked again

unsubscribe

2014-11-18 Thread Abdul Hakeem
- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: Pyspark Error

2014-11-18 Thread Davies Liu
It seems that `localhost` can not be resolved in your machines, I had filed https://issues.apache.org/jira/browse/SPARK-4475 to track it. On Tue, Nov 18, 2014 at 6:10 AM, amin mohebbi aminn_...@yahoo.com.invalid wrote: Hi there, I have already downloaded Pre-built spark-1.1.0, I want to run

Re: Is there a way to create key based on counts in Spark

2014-11-18 Thread Davies Liu
On Tue, Nov 18, 2014 at 9:06 AM, Debasish Das debasish.da...@gmail.com wrote: Use zipWithIndex but cache the data before you run zipWithIndex...that way your ordering will be consistent (unless the bug has been fixed where you don't have to cache the data)... Could you point some link about

Re: Iterative transformations over RDD crashes in phantom reduce

2014-11-18 Thread Shannon Quinn
Sorry everyone--turns out an oft-forgotten single line of code was required to make this work: index = 0 INDEX = sc.broadcast(index) M = M.flatMap(func1).reduceByKey(func2) M.foreach(debug_output) *M.cache()* index = 1 INDEX = sc.broadcast(index) M = M.flatMap(func1) M.foreach(debug_output)

Re: SparkSQL exception on spark.sql.codegen

2014-11-18 Thread Michael Armbrust
Those are probably related. It looks like we are somehow not being thread safe when initializing various parts of the scala compiler. Since code gen is pretty experimental we probably won't have the resources to investigate backporting a fix. However, if you can reproduce the problem in Spark

Re: Is there a way to create key based on counts in Spark

2014-11-18 Thread Sean Owen
On Tue, Nov 18, 2014 at 8:26 PM, Davies Liu dav...@databricks.com wrote: On Tue, Nov 18, 2014 at 9:06 AM, Debasish Das debasish.da...@gmail.com wrote: Use zipWithIndex but cache the data before you run zipWithIndex...that way your ordering will be consistent (unless the bug has been fixed

Problems launching 1.2.0-SNAPSHOT cluster with Hive support on EC2

2014-11-18 Thread curtkohler
I've developed a Spark application using the 1.2.0-SNAPSHOP branch that leverages Spark Streaming and Hive and can run it locally with no problem (I need some fixes in the 1.2.0 branch). I successfully launched my EC2 cluster by specifying a git commit hash from the 1.2.0-SNAPSHOT branch as the

GraphX twitter

2014-11-18 Thread tom85
I'm having problems running the twitter graph on a cluster with 4 nodes, each having over 100GB of RAM and 32 virtual cores per node. I do have a pre-installed spark version (built against hadoop 2.3, because it didn't compile on my system), but I'm loading my graph file from disk without hdfs.

Re: unsubscribe

2014-11-18 Thread Corey Nolet
Abdul, Please send an email to user-unsubscr...@spark.apache.org On Tue, Nov 18, 2014 at 2:05 PM, Abdul Hakeem alhak...@gmail.com wrote: - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional

Lost executors

2014-11-18 Thread Pala M Muthaia
Hi, I am using Spark 1.0.1 on Yarn 2.5, and doing everything through spark shell. I am running a job that essentially reads a bunch of HBase keys, looks up HBase data, and performs some filtering and aggregation. The job works fine in smaller datasets, but when i try to execute on the full

Re: Lost executors

2014-11-18 Thread Sandy Ryza
Hi Pala, Do you have access to your YARN NodeManager logs? Are you able to check whether they report killing any containers for exceeding memory limits? -Sandy On Tue, Nov 18, 2014 at 1:54 PM, Pala M Muthaia mchett...@rocketfuelinc.com wrote: Hi, I am using Spark 1.0.1 on Yarn 2.5, and

spark-shell giving me error of unread block data

2014-11-18 Thread Anson Abraham
I'm essentially loading a file and saving output to another location: val source = sc.textFile(/tmp/testfile.txt) source.saveAsTextFile(/tmp/testsparkoutput) when i do so, i'm hitting this error: 14/11/18 21:15:08 INFO DAGScheduler: Failed to run saveAsTextFile at console:15

Re: Is there a way to create key based on counts in Spark

2014-11-18 Thread Davies Liu
I see, thanks! On Tue, Nov 18, 2014 at 12:12 PM, Sean Owen so...@cloudera.com wrote: On Tue, Nov 18, 2014 at 8:26 PM, Davies Liu dav...@databricks.com wrote: On Tue, Nov 18, 2014 at 9:06 AM, Debasish Das debasish.da...@gmail.com wrote: Use zipWithIndex but cache the data before you run

Re: spark-shell giving me error of unread block data

2014-11-18 Thread Ritesh Kumar Singh
It can be a serialization issue. Happens when there are different versions installed on the same system. What do you mean by the first time you installed and tested it out? On Wed, Nov 19, 2014 at 3:29 AM, Anson Abraham anson.abra...@gmail.com wrote: I'm essentially loading a file and saving

Cores on Master

2014-11-18 Thread Pat Ferrel
I see the default and max cores settings but these seem to control total cores per cluster. My cobbled together home cluster needs the Master to not use all its cores or it may lock up (it does other things). Is there a way to control max cores used for a particular cluster machine in

Re: Cores on Master

2014-11-18 Thread Pat Ferrel
Looks like I can do this by not using start-all.sh but starting each worker separately passing in a '--cores n' to the master? No config/env way? On Nov 18, 2014, at 3:14 PM, Pat Ferrel p...@occamsmachete.com wrote: I see the default and max cores settings but these seem to control total cores

JdbcRDD

2014-11-18 Thread Krishna
Hi, Are there any examples of using JdbcRDD in java available? Its not clear what is the last argument in this example ( https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/rdd/JdbcRDDSuite.scala ): sc = new SparkContext(local, test) val rdd = new JdbcRDD( sc, () =

Re: Cores on Master

2014-11-18 Thread Pat Ferrel
This seems to work only on a ‘worker’ not the master? So I’m back to having no way to control cores on the master? On Nov 18, 2014, at 3:24 PM, Pat Ferrel p...@occamsmachete.com wrote: Looks like I can do this by not using start-all.sh but starting each worker separately passing in a '--cores

Re: JdbcRDD

2014-11-18 Thread mykidong
I had also same problem to use JdbcRDD in java. For me, I have written a class in scala to get JdbcRDD, and I call this instance from java. for instance, JdbcRDDWrapper.scala like this: ... import java.sql._ import org.apache.spark.SparkContext import org.apache.spark.rdd.JdbcRDD import

Re: JdbcRDD

2014-11-18 Thread Krishna
Thanks Kidong. I'll try your approach. On Tue, Nov 18, 2014 at 4:22 PM, mykidong mykid...@gmail.com wrote: I had also same problem to use JdbcRDD in java. For me, I have written a class in scala to get JdbcRDD, and I call this instance from java. for instance, JdbcRDDWrapper.scala like

Spark streaming: java.io.IOException: Version Mismatch (Expected: 28, Received: 18245 )

2014-11-18 Thread Bill Jay
Hi all, I am running a Spark Streaming job. It was able to produce the correct results up to some time. Later on, the job was still running but producing no result. I checked the Spark streaming UI and found that 4 tasks of a stage failed. The error messages showed that Job aborted due to stage

Re: Cores on Master

2014-11-18 Thread Pat Ferrel
OK hacking the start-slave.sh did it On Nov 18, 2014, at 4:12 PM, Pat Ferrel p...@occamsmachete.com wrote: This seems to work only on a ‘worker’ not the master? So I’m back to having no way to control cores on the master? On Nov 18, 2014, at 3:24 PM, Pat Ferrel p...@occamsmachete.com wrote:

Parsing a large XML file using Spark

2014-11-18 Thread Soumya Simanta
If there a one big XML file (e.g., Wikipedia dump 44GB or the larger dump that all revision information also) that is stored in HDFS, is it possible to parse it in parallel/faster using Spark? Or do we have to use something like a PullParser or Iteratee? My current solution is to read the single

Re: Lost executors

2014-11-18 Thread Pala M Muthaia
Sandy, Good point - i forgot about NM logs. When i looked up the NM logs, i only see the following statements that align with the driver side log about lost executor. Many executors show the same log statement at the same time, so it seems like the decision to kill many if not all executors

Re: SparkSQL exception on spark.sql.codegen

2014-11-18 Thread Eric Zhen
Okay, thank you Micheal. On Wed, Nov 19, 2014 at 3:45 AM, Michael Armbrust mich...@databricks.com wrote: Those are probably related. It looks like we are somehow not being thread safe when initializing various parts of the scala compiler. Since code gen is pretty experimental we probably

Re: Spark Streaming with Kafka is failing with Error

2014-11-18 Thread Tobias Pfeiffer
Hi, do you have some logging backend (log4j, logback) on your classpath? This seems a bit like there is no particular implementation of the abstract `log()` method available. Tobias

Re: Parsing a large XML file using Spark

2014-11-18 Thread Tobias Pfeiffer
Hi, see https://www.mail-archive.com/dev@spark.apache.org/msg03520.html for one solution. One issue with those XML files is that they cannot be processed line by line in parallel; plus you inherently need shared/global state to parse XML or check for well-formedness, I think. (Same issue with

Re: Sourcing data from RedShift

2014-11-18 Thread Gary Malouf
Hi guys, We ultimately needed to add 8 ec2 xl's to get better performance. As was suspected, we could not fit all the data into ram. This worked great with files sized around 100-350MB in size as our initial export task produced. Unfortunately, for the partition settings that we were able to

Re: Slave Node Management in Standalone Cluster

2014-11-18 Thread Kenichi Maehashi
Hi Akhil and Kousuke, Thank you for your quick response. Monitoring through JSON API seems straightforward and cool. Thanks again! 2014-11-18 19:06 GMT+09:00 Kousuke Saruta saru...@oss.nttdata.co.jp: Hi Kenichi 1. How can I stop a slave on the specific node? Under `sbin/` directory,

Converting a json struct to map

2014-11-18 Thread Daniel Haviv
Hi, I'm loading a json file into a RDD and then save that RDD as parquet. One of the fields is a map of keys and values but it is being translated and stored as a struct. How can I convert the field into a map? Thanks, Daniel

k-means clustering

2014-11-18 Thread amin mohebbi
Hi there, I would like to do text clustering using  k-means and Spark on a massive dataset. As you know, before running the k-means, I have to do pre-processing methods such as TFIDF and NLTK on my big dataset. The following is my code in python : | | if __name__ == '__main__': | | | #

SparkSQL and Hive/Hive metastore testing - LocalHiveContext

2014-11-18 Thread Night Wolf
Hi, Just to give some context. We are using Hive metastore with csv Parquet files as a part of our ETL pipeline. We query these with SparkSQL to do some down stream work. I'm curious whats the best way to go about testing Hive SparkSQL? I'm using 1.1.0 I see that the LocalHiveContext has been

A partitionBy problem

2014-11-18 Thread Tao Xiao
Hi all, I tested *partitionBy *feature in wordcount application, and I'm puzzled by a phenomenon. In this application, I created an rdd from some text files in HDFS(about 100GB in size), each of which has lines composed of words separated by a character #. I wanted to count the occurence for

Re: Converting a json struct to map

2014-11-18 Thread Akhil Das
Something like this? val map_rdd = json_rdd.map(json = { val mapper = new ObjectMapper() with ScalaObjectMapper mapper.registerModule(DefaultScalaModule) val myMap = mapper.readValue[Map[String,String]](json) myMap }) Thanks Best Regards On Wed, Nov 19, 2014 at