Re: What about implementing various hypothesis test for Logistic Regression in MLlib

2014-08-20 Thread Xiangrui Meng
We implemented chi-squared tests in v1.1: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala#L166 and we will add more after v1.1. Feedback on which tests should come first would be greatly appreciated. -Xiangrui On Tue, Aug 19, 2014 at

Got NotSerializableException when access broadcast variable

2014-08-20 Thread 田毅
Hi everyone! I got a exception when i run my script with spark-shell: I added SPARK_JAVA_OPTS=-Dsun.io.serialization.extendedDebugInfo=true in spark-env.sh to show the following stack: org.apache.spark.SparkException: Task not serializable at

Re: Performance problem on collect

2014-08-20 Thread Emmanuel Castanier
It did the job. Thanks. :) Le 19 août 2014 à 10:20, Sean Owen so...@cloudera.com a écrit : In that case, why not collectAsMap() and have the whole result as a simple Map in memory? then lookups are trivial. RDDs aren't distributed maps. On Tue, Aug 19, 2014 at 9:17 AM, Emmanuel Castanier

RDD Row Index

2014-08-20 Thread TJ Klein
Hi, I wonder if there is something like an (row) index to of the elements in the RDD. Specifically, my RDD is generated from a series of files, where the value corresponds the file contents. Ideally, I would like to have the keys to be an enumeration of the file number e.g. (0,file contents

[Spark SQL] How to select first row in each GROUP BY group?

2014-08-20 Thread Fengyun RAO
I have a table with 4 columns: a, b, c, time What I need is something like: SELECT a, b, GroupFirst(c) FROM t GROUP BY a, b GroupFirst means the first item of column c group, and by the first I mean minimal time in that group. In Oracle/Sql Server, we could write: WITH summary AS (

Accessing to elements in JavaDStream

2014-08-20 Thread cuongpham92
Hi, I am a newbie to Spark Streaming, and I am quite confused about JavaDStream in SparkStreaming. In my situation, after catching a message Hello world from Kafka in JavaDStream, I want to access to JavaDStream and change this message to Hello John, but I could not figure how to do it. Any idea

Re: NullPointerException from '.count.foreachRDD'

2014-08-20 Thread anoldbrain
Looking at the source codes of DStream.scala /** * Return a new DStream in which each RDD has a single element generated by counting each RDD * of this DStream. */ def count(): DStream[Long] = { this.map(_ = (null, 1L))

Broadcast vs simple variable

2014-08-20 Thread Julien Naour
Hi, I have a question about broadcast. I'm working on a clustering algorithm close to KMeans. It seems that KMeans broadcast clusters centers at each step. For the moment I just use my centers as Array that I call directly in my map at each step. Could it be more efficient to use broadcast

Re: Got NotSerializableException when access broadcast variable

2014-08-20 Thread tianyi
Thanks for help. I run this script again with bin/spark-shell --conf spark.serializer=org.apache.spark.serializer.KryoSerializer” in the console, I can see: scala sc.getConf.getAll.foreach(println) (spark.tachyonStore.folderName,spark-eaabe986-03cb-41bd-bde5-993c7db3f048)

Difference between amplab docker and spark docker?

2014-08-20 Thread Josh J
Hi, Whats the difference between amplab docker https://github.com/amplab/docker-scripts and spark docker https://github.com/apache/spark/tree/master/docker? Thanks, Josh

Re: OutOfMemory Error

2014-08-20 Thread MEETHU MATHEW
 Hi , How to increase the heap size? What is the difference between spark executor memory and heap size? Thanks Regards, Meethu M On Monday, 18 August 2014 12:35 PM, Akhil Das ak...@sigmoidanalytics.com wrote: I believe spark.shuffle.memoryFraction is the one you are looking for.

Re: a noob question for how to implement setup and cleanup in Spark map

2014-08-20 Thread Victor Tso-Guillen
How about this: val prev: RDD[V] = rdd.mapPartitions(partition = { /*setup()*/; partition }) new RDD[V](prev) { protected def getPartitions = prev.partitions def compute(split: Partition, context: TaskContext) = { context.addOnCompleteCallback(() = /*cleanup()*/)

Re: hdfs read performance issue

2014-08-20 Thread Gurvinder Singh
I got some time to look in to it. It appears as that Spark (latest git) is doing this operation much more often compare to Aug 1 version. Here is the log from operation I am referring to 14/08/19 12:37:26 INFO spark.CacheManager: Partition rdd_8_414 not found, computing it 14/08/19 12:37:26 INFO

Re: a noob question for how to implement setup and cleanup in Spark map

2014-08-20 Thread Victor Tso-Guillen
And duh, of course, you can do the setup in that new RDD as well :) On Wed, Aug 20, 2014 at 1:59 AM, Victor Tso-Guillen v...@paxata.com wrote: How about this: val prev: RDD[V] = rdd.mapPartitions(partition = { /*setup()*/; partition }) new RDD[V](prev) { protected def getPartitions =

RE: NullPointerException from '.count.foreachRDD'

2014-08-20 Thread Shao, Saisai
Hi, I don't think there's a NPE issue when using DStream/count() even there is no data feed into Spark Streaming. I tested using Kafka in my local settings, both are OK with and without data consumed. Actually you can see the details in ReceiverInputDStream, even there is no data in this

RE: OutOfMemory Error

2014-08-20 Thread Shao, Saisai
Hi Meethu, The spark.executor.memory is the Java heap size of forked executor process. Increasing the spark.executor.memory can actually increase the runtime heap size of executor process. For the details of Spark configurations, you can check:

Hi

2014-08-20 Thread rapelly kartheek
Hi I have this doubt: I understand that each java process runs on different JVM instances. Now, if I have a single executor on my machine and run several java processes, then there will be several JVM instances running. Now, process_local means, the data is located on the same JVM as the task

RE: Hi

2014-08-20 Thread Shao, Saisai
Hi, Actually several java task threads running in a single executor, not processes, so each executor will only have one JVM runtime which shares with different task threads. Thanks Jerry From: rapelly kartheek [mailto:kartheek.m...@gmail.com] Sent: Wednesday, August 20, 2014 5:29 PM To:

Re: RDD Row Index

2014-08-20 Thread Sean Owen
zipWithIndex() will give you something like an index for each element in the RDD. If you files are small, you can use SparkContext.wholeTextFiles() to load an RDD where each element is (filename, content). Maybe that's what you are looking for if you are really looking to extract an ID from the

[no subject]

2014-08-20 Thread Cường Phạm

RE: NullPointerException from '.count.foreachRDD'

2014-08-20 Thread anoldbrain
Thank you for the reply. I implemented my InputDStream to return None when there's no data. After changing it to return empty RDD, the exception is gone. I am curious as to why all other processings worked correctly with my old incorrect implementation, with or without data? My actual codes,

Web UI doesn't show some stages

2014-08-20 Thread Grzegorz Białek
Hi, I am wondering why in web UI some stages (like join, filter) are not visible. For example this code: val simple = sc.parallelize(Array.range(0,100)) val simple2 = sc.parallelize(Array.range(0,100)) val toJoin = simple.map(x = (x, x.toString + x.toString)) val rdd = simple2 .map(x =

Re: Segmented fold count

2014-08-20 Thread fil
Could I write groupCount() in Scala, and then use it from Pyspark? Care to supply an example, I'm finding them hard to find :) It's doable, but not so convenient. If you really care about the performance difference, you should write your program in Scala. Is it possible to write my

Potential Thrift Server Bug on Spark SQL,perhaps with cache table?

2014-08-20 Thread John Omernik
I am working with Spark SQL and the Thrift server. I ran into an interesting bug, and I am curious on what information/testing I can provide to help narrow things down. My setup is as follows: Hive 0.12 with a table that has lots of columns (50+) stored as rcfile. Spark-1.1.0-SNAPSHOT with Hive

Advantage of using cache()

2014-08-20 Thread Grzegorz Białek
Hi, I tried to write small program which shows that using cache() can speed up execution but results with and without cache were similar. Could help me with this issue? I tried to compute rdd and use it later in two places and I thought in second usage this rdd is recomputed but it doesn't:

Re: spark-submit with HA YARN

2014-08-20 Thread Matt Narrell
Yes, I’m pretty sure my YARN and HDFS HA configuration is correct. I can use the UIs and HDFS command line tools with HA support as expected (failing over namenodes and resourcemanagers, etc) so I believe this to be a Spark issue. Like I mentioned earlier, if i manipulate the

How to pass env variables from master to executors within spark-shell

2014-08-20 Thread Darin McBeath
Can't seem to figure this out.  I've tried several different approaches without success. For example, I've tried setting spark.executor.extraJavaOptions in the spark-default.conf (prior to starting the spark-shell) but this seems to have no effect. Outside of spark-shell (within a java

Spark exception while reading different inputs

2014-08-20 Thread durga
Hi I am using using below program in spark-shell to load and filter data from the data sets. I am getting exceptions if I run the programs for multiple times, If I restart the shell it is working fine. 1) please let me know what I am doing wrong. 2) Also is there a way to make the program better

Stage failure in BlockManager due to FileNotFoundException on long-running streaming job

2014-08-20 Thread Silvio Fiorito
This is a long running Spark Streaming job running in YARN, Spark v1.0.2 on CDH5. The jobs will run for about 34-37 hours then die due to this FileNotFoundException. There’s very little CPU or RAM usage, I’m running 2 x cores, 2 x executors, 4g memory, YARN cluster mode. Here’s the stack

Is Spark SQL Thrift Server part of the 1.0.2 release

2014-08-20 Thread Tam, Ken K
Is Spark SQL Thrift Server part of the 1.0.2 release? If not, which release is the target? Thanks, Ken

RE: Is Spark SQL Thrift Server part of the 1.0.2 release

2014-08-20 Thread Tam, Ken K
What is the best way to run Hive queries in 1.0.2? In my case. Hive queries will be invoked from a middle tier webapp. I am thinking to use the Hive JDBC driver. Thanks, Ken From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Wednesday, August 20, 2014 9:38 AM To: Tam, Ken K Cc:

Re: Does anyone have a stand alone spark instance running on Windows

2014-08-20 Thread Steve Lewis
I have made a little progress - by downloading a prebuilt version of Spark I can call spark-shell.cmd and bring up a spark shell. In the shell things run. Next I go to my development environment and try to run JavaWordCount i try -Dspark.master=spark://local[*]:55519

Re: spark-submit with HA YARN

2014-08-20 Thread Marcelo Vanzin
On Wed, Aug 20, 2014 at 8:54 AM, Matt Narrell matt.narr...@gmail.com wrote: An “unaccepted” reply to this thread from Dean Chen suggested to build Spark with a newer version of Hadoop (2.4.1) and this has worked to some extent. I’m now able to submit jobs (omitting an explicit

Re: spark-submit with HA YARN

2014-08-20 Thread Marcelo Vanzin
Ah, sorry, forgot to talk about the second issue. On Wed, Aug 20, 2014 at 8:54 AM, Matt Narrell matt.narr...@gmail.com wrote: However, now the Spark jobs running in the ApplicationMaster on a given node fails to find the active resourcemanager. Below is a log excerpt from one of the assigned

GraphX question about graph traversal

2014-08-20 Thread Cesar Arevalo
Hi All: I have a question about how to do the following operation in GraphX. Suppose I have a graph with the following vertices and scores on the edges: (V1 {type:B})-(V2 {type:A})--(V3 {type:A})-(V4 {type:B}) 100 10100 I would

MLlib: issue with increasing maximum depth of the decision tree

2014-08-20 Thread Sameer Tilak
Hi All,My dataset is fairly small -- a CSV file with around half million rows and 600 features. Everything works when I set maximum depth of the decision tree to 5 or 6. However, I get this error for larger values of that parameter -- For example when I set it to 10. Have others encountered a

Spark-job error on writing result into hadoop w/ switch_user=false

2014-08-20 Thread Jongyoul Lee
Hi, I've used hdfs 2.3.0-cdh5.0.1, mesos 0.19.1 and spark 1.0.2 that is re-compiled. For a security reason, we run hdfs and mesos as hdfs, that is an account name and not in a root group, and non-root user submit a spark job on mesos. With no-switch_user, simple job, which only read data from

Re: GraphX question about graph traversal

2014-08-20 Thread glxc
I don't know if Pregel would be necessary since it's not iterative You could filter the graph by looking at edge triplets, and testing if source =B, dest =A, and edge value 5 -- View this message in context:

Personalized Page rank in graphx

2014-08-20 Thread Mohit Singh
Hi, I was wondering if Personalized Page Rank algorithm is implemented in graphx. If the talks and presentation were to be believed ( https://amplab.cs.berkeley.edu/wp-content/uploads/2014/02/graphx@strata2014_final.pdf) it is.. but cant find the algo code (

Re: spark-submit with HA YARN

2014-08-20 Thread Matt Narrell
Marcelo, Specifying the driver-class-path yields behavior like https://issues.apache.org/jira/browse/SPARK-2420 and https://issues.apache.org/jira/browse/SPARK-2848 It feels like opening a can of worms here if I also need to replace the guava dependencies. Wouldn’t calling

Re: Stage failure in BlockManager due to FileNotFoundException on long-running streaming job

2014-08-20 Thread Aaron Davidson
This is likely due to a bug in shuffle file consolidation (which you have enabled) which was hopefully fixed in 1.1 with this patch: https://github.com/apache/spark/commit/78f2af582286b81e6dc9fa9d455ed2b369d933bd Until 1.0.3 or 1.1 are released, the simplest solution is to disable

How to set KryoRegistrator class in spark-shell

2014-08-20 Thread Benyi Wang
I want to use opencsv's CSVParser to parse csv lines using a script like below in spark-shell: import au.com.bytecode.opencsv.CSVParser; import com.esotericsoftware.kryo.Kryo import org.apache.spark.serializer.KryoRegistrator import org.apache.hadoop.fs.{Path, FileSystem} class MyKryoRegistrator

Re: spark-submit with HA YARN

2014-08-20 Thread Marcelo Vanzin
Hi, On Wed, Aug 20, 2014 at 11:59 AM, Matt Narrell matt.narr...@gmail.com wrote: Specifying the driver-class-path yields behavior like https://issues.apache.org/jira/browse/SPARK-2420 and https://issues.apache.org/jira/browse/SPARK-2848 It feels like opening a can of worms here if I also

Re: spark-submit with HA YARN

2014-08-20 Thread Matt Narrell
Ok Marcelo, Thanks for the quick and thorough replies. I’ll keep an eye on these tickets and the mailing list to see how things move along. mn On Aug 20, 2014, at 1:33 PM, Marcelo Vanzin van...@cloudera.com wrote: Hi, On Wed, Aug 20, 2014 at 11:59 AM, Matt Narrell matt.narr...@gmail.com

RE: How to set KryoRegistrator class in spark-shell

2014-08-20 Thread Sameer Tilak
Hi Wang,Have you tried doing this in your application? conf.set(spark.serializer, org.apache.spark.serializer.KryoSerializer) conf.set(spark.kryo.registrator, yourpackage.MyKryoRegistrator) You then don't need to specify it via commandline. Date: Wed, 20 Aug 2014 12:25:14 -0700

FW: Decision tree: categorical variables

2014-08-20 Thread Sameer Tilak
From: ssti...@live.com To: men...@gmail.com Subject: RE: Decision tree: categorical variables Date: Wed, 20 Aug 2014 12:09:52 -0700 Hi Xiangrui, My data is in the following format:

RE: Decision tree: categorical variables

2014-08-20 Thread Sameer Tilak
Was able to resolve the parsing issue. Thanks! From: ssti...@live.com To: user@spark.apache.org Subject: FW: Decision tree: categorical variables Date: Wed, 20 Aug 2014 12:48:10 -0700 From: ssti...@live.com To: men...@gmail.com Subject: RE: Decision tree: categorical variables Date: Wed, 20

Re: How to set KryoRegistrator class in spark-shell

2014-08-20 Thread Benyi Wang
I can do that in my application, but I really want to know how I can do it in spark-shell because I usually prototype in spark-shell before I put the code into an application. On Wed, Aug 20, 2014 at 12:47 PM, Sameer Tilak ssti...@live.com wrote: Hi Wang, Have you tried doing this in your

Re: GraphX question about graph traversal

2014-08-20 Thread Cesar Arevalo
Hey, thanks for your response. And I had seen the triplets, but I'm not quite sure how the triplets would get me that V1 is connected to V4. Maybe I need to spend more time understanding it, I guess. -Cesar On Wed, Aug 20, 2014 at 10:56 AM, glxc r.ryan.mcc...@gmail.com wrote: I don't know

Small input split sizes

2014-08-20 Thread David Rosenstrauch
I'm still bumping up against this issue: spark (and shark) are breaking my inputs into 64MB-sized splits. Anyone know where/how to configure spark so that it either doesn't split the inputs, or at least uses a much large split size? (E.g., 512MB.) Thanks, DR On 07/15/2014 05:58 PM, David

Re: Personalized Page rank in graphx

2014-08-20 Thread Ankur Dave
At 2014-08-20 10:57:57 -0700, Mohit Singh mohit1...@gmail.com wrote: I was wondering if Personalized Page Rank algorithm is implemented in graphx. If the talks and presentation were to be believed (https://amplab.cs.berkeley.edu/wp-content/uploads/2014/02/graphx@strata2014_final.pdf) it

Re: Stage failure in BlockManager due to FileNotFoundException on long-running streaming job

2014-08-20 Thread Silvio Fiorito
Thanks, I’ll go ahead and disable that setting for now. From: Aaron Davidson ilike...@gmail.commailto:ilike...@gmail.com Date: Wednesday, August 20, 2014 at 3:20 PM To: Silvio Fiorito silvio.fior...@granturing.commailto:silvio.fior...@granturing.com Cc:

Re: GraphX question about graph traversal

2014-08-20 Thread Ankur Dave
At 2014-08-20 10:34:50 -0700, Cesar Arevalo ce...@zephyrhealthinc.com wrote: I would like to get the type B vertices that are connected through type A vertices where the edges have a score greater than 5. So, from the example above I would like to get V1 and V4. It sounds like you're trying to

Re: Advantage of using cache()

2014-08-20 Thread Patrick Wendell
Your rdd2 and rdd3 differ in two ways so it's hard to track the exact effect of caching. In rdd3, in addition to the fact that rdd will be cached, you are also doing a bunch of extra random number generation. So it will be hard to isolate the effect of caching. On Wed, Aug 20, 2014 at 7:48 AM,

Re: Broadcast vs simple variable

2014-08-20 Thread Patrick Wendell
For large objects, it will be more efficient to broadcast it. If your array is small it won't really matter. How many centers do you have? Unless you are finding that you have very large tasks (and Spark will print a warning about this), it could be okay to just reference it directly. On Wed,

Re: Web UI doesn't show some stages

2014-08-20 Thread Patrick Wendell
The reason is that some operators get pipelined into a single stage. rdd.map(XX).filter(YY) - this executes in a single stage since there is no data movement needed in between these operations. If you call toDeubgString on the final RDD it will give you some information about the exact lineage.

Re: GraphX question about graph traversal

2014-08-20 Thread Cesar Arevalo
Hi Ankur, thank you for your response. I already looked at the sample code you sent. And I think the modification you are referring to is on the tryMatch function of the PartialMatch class. I noticed you have a case in there that checks for a pattern match, and I think that's the code I need to

java.io.NotSerializableException: org.scalatest.Assertions$AssertionsHelper

2014-08-20 Thread Chris Jones
New to Apache Spark, trying to build a scalatest. Below is the error I'm consistently seeing. Somehow Spark is trying to load a scalatest AssertionHelper class which is not serializable. The scalatest I have specified doesn't even have any assertions in it. I added the JVM flag 

Re: Is Spark SQL Thrift Server part of the 1.0.2 release

2014-08-20 Thread Michael Armbrust
You could use the programatic API http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables to make the hive queries directly. On Wed, Aug 20, 2014 at 9:47 AM, Tam, Ken K ken@verizon.com wrote: What is the best way to run Hive queries in 1.0.2? In my case. Hive queries

Re: java.io.NotSerializableException: org.scalatest.Assertions$AssertionsHelper

2014-08-20 Thread Marcelo Vanzin
My guess is that your test is trying to serialize a closure referencing connectionInfo; that closure will have a reference to the test instance, since the instance is needed to execute that method. Try to make the connectionInfo method local to the method where it's needed, or declare it in an

Re: Got NotSerializableException when access broadcast variable

2014-08-20 Thread Vida Ha
Hi, I doubt the the broadcast variable is your problem, since you are seeing: org.apache.spark.SparkException: Task not serializable Caused by: java.io.NotSerializableException: org.apache.spark.sql .hive.HiveContext$$anon$3 We have a knowledgebase article that explains why this happens - it's

Re: java.io.NotSerializableException: org.scalatest.Assertions$AssertionsHelper

2014-08-20 Thread Vida Ha
Hi Chris, We have a knowledge base article to explain what's happening here: https://github.com/databricks/spark-knowledgebase/blob/master/troubleshooting/javaionotserializableexception.md Let me know if the article is not clear enough - I would be happy to edit and improve it. -Vida On Wed,

Spark memory settings on yarn

2014-08-20 Thread centerqi hu
Spark memory settings let me very misunderstanding. My code is as follows. spark-1.0.2-bin-2.4.1/bin/spark-submit --class SimpleApp \ --master yarn \ --deploy-mode cluster \ --queue sls_queue_1 \ --num-executors 3 \ --driver-memory 6g \ --executor-memory 10g \ --executor-cores 5 \

Re: Got NotSerializableException when access broadcast variable

2014-08-20 Thread Yin Huai
If you want to filter the table name, you can use hc.sql(show tables).filter(row = !test.equals(row.getString(0 Seems making functionRegistry transient can fix the error. On Wed, Aug 20, 2014 at 8:53 PM, Vida Ha v...@databricks.com wrote: Hi, I doubt the the broadcast variable is your

DStream cannot write to text file

2014-08-20 Thread cuongpham92
Hi, I tried to write to text file from DStream in Spark Streaming, using DStream.saveAsTextFile(test,output), but it did not work. Any suggestions? Thanks in advance. Cuong -- View this message in context:

RE: Got NotSerializableException when access broadcast variable

2014-08-20 Thread Yin Huai
PR is https://github.com/apache/spark/pull/2074. -- From: Yin Huai huaiyin@gmail.com Sent: ‎8/‎20/‎2014 10:56 PM To: Vida Ha v...@databricks.com Cc: tianyi tia...@asiainfo.com; Fengyun RAO raofeng...@gmail.com; user@spark.apache.org Subject: Re: Got

Re: Web UI doesn't show some stages

2014-08-20 Thread Zhan Zhang
Try to answer your another question. One sortByKey is triggered by rangePartition which does sample to calculate the range boundaries, which again triggers the first reduceByKey. The second sortByKey is doing the real work to sort based on the partition calculated, which again trigger the

Re: Spark memory settings on yarn

2014-08-20 Thread Marcelo Vanzin
That command line you mention in your e-mail doesn't look like something started by Spark. Spark would start one of ApplicationMaster, ExecutableRunner or CoarseGrainedSchedulerBackend, not org.apache.hadoop.mapred.YarnChild. On Wed, Aug 20, 2014 at 6:56 PM, centerqi hu cente...@gmail.com wrote:

Trying to run SparkSQL over Spark Streaming

2014-08-20 Thread praveshjain1991
I am trying to run SQL queries over streaming data in spark. This looks pretty straight forward but when I try it, I get the error table not found : tablename. It unable to find the table I've registered. Using Spark SQL with batch data works fine so I'm thinking it has to do with how I'm calling

Re: Trying to run SparkSQL over Spark Streaming

2014-08-20 Thread Tobias Pfeiffer
Hi, On Thu, Aug 21, 2014 at 2:19 PM, praveshjain1991 praveshjain1...@gmail.com wrote: Using Spark SQL with batch data works fine so I'm thinking it has to do with how I'm calling streamingcontext.start(). Any ideas what is the issue? Here is the code: Please have a look at