Re: Kryo Issue on Spark 1.0.1, Mesos 0.18.2

2014-07-25 Thread Gary Malouf
Maybe this is me misunderstanding the Spark system property behavior, but I'm not clear why the class being loaded ends up having '/' rather than '.' in it's fully qualified name. When I tested this out locally, the '/' were preventing the class from being loaded. On Fri, Jul 25, 2014 at 2:27 PM

Re: spark streaming actor receiver doesn't play well with kryoserializer

2014-07-25 Thread Alan Ngai
The stack trace was from running the Actor count sample directly, without a spark cluster, so I guess the logs would be from both? I enabled more logging and got this stack trace 4/07/25 17:55:26 [INFO] SecurityManager: Changing view acls to: alan 14/07/25 17:55:26 [INFO] SecurityManager: Secu

Re: Initial job has not accepted any resources (but workers are in UI)

2014-07-25 Thread Navicore
solution: opened all ports on the ec2 machine that the driver was running on. need to narrow down what ports akka wants... but the issue is solved. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Initial-job-has-not-accepted-any-resources-but-workers-are-in

RE: Need help, got java.lang.ExceptionInInitializerError in Yarn-Client/Cluster mode

2014-07-25 Thread Andrew Lee
Hi Jianshi, Could you provide which HBase version you're using? By the way, a quick sanity check on whether the Workers can access HBase? Were you able to manually write one record to HBase with the serialize function? Hardcode and test it ? From: jianshi.hu...@gmail.com Date: Fri, 25 Jul 2014 15

Re: saveAsTextFiles file not found exception

2014-07-25 Thread Bill Jay
I just saw another error after my job was run for 2 hours: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /apps/data/vddil/real-time/checkpoint/temp: File does not exist. Holder DFSClient_NONMAPREDUCE_327993456_13 does not have any

RE: Spark SQL and Hive tables

2014-07-25 Thread Andrew Lee
Hi Michael, If I understand correctly, the assembly JAR file is deployed onto HDFS /user/$USER/.stagingSpark folders that will be used by all computing (worker) nodes when people run in yarn-cluster mode. Could you elaborate more what does the document mean by this? It is a bit misleading and I

RE: Spark SQL and Hive tables

2014-07-25 Thread sstilak
Thanks! Will do. Sent via the Samsung GALAXY S®4, an AT&T 4G LTE smartphone Original message From: Michael Armbrust Date:07/25/2014 3:24 PM (GMT-08:00) To: user@spark.apache.org Subject: Re: Spark SQL and Hive tables > > [S]ince Hive has a large number of dependencies,

Re: Spark SQL and Hive tables

2014-07-25 Thread Michael Armbrust
> > [S]ince Hive has a large number of dependencies, it is not included in the > default Spark assembly. In order to use Hive you must first run > ‘SPARK_HIVE=true > sbt/sbt assembly/assembly’ (or use -Phive for maven). This command builds > a new assembly jar that includes Hive. Note that this Hi

Re: spark streaming actor receiver doesn't play well with kryoserializer

2014-07-25 Thread Tathagata Das
Is this error on the executor or on the driver? Can you provide a larger snippet of the logs, driver as well as if possible executor logs. TD On Thu, Jul 24, 2014 at 10:28 PM, Alan Ngai wrote: > bump. any ideas? > > On Jul 24, 2014, at 3:09 AM, Alan Ngai wrote: > > it looks like when you con

RE: Spark SQL and Hive tables

2014-07-25 Thread Sameer Tilak
Hi Jerry, I am having trouble with this. May be something wrong with my import or version etc. scala> import org.apache.spark.sql._;import org.apache.spark.sql._ scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc):24: error: object hive is not a member of package org.apache.s

Re: saveAsTextFiles file not found exception

2014-07-25 Thread Tathagata Das
Can you give a stack trace and logs of the exception? Its hard to say anything without any associated stack trace and logs. TD On Fri, Jul 25, 2014 at 1:32 PM, Bill Jay wrote: > Hi, > > I am running a Spark Streaming job that uses saveAsTextFiles to save > results into hdfs files. However, it

Re: Using Spark Streaming with Kafka 0.7.2

2014-07-25 Thread Tathagata Das
Spark Streaming is built as part of the whole Spark repository. Hence follow Spark's building instructions to build Spark Streaming along with Spark. Spark Streaming 0.8.1 was built with kafka 0.7.2. You can take a look. If necessary, I

RE: Spark SQL and Hive tables

2014-07-25 Thread Sameer Tilak
Thanks, Michael. From: mich...@databricks.com Date: Fri, 25 Jul 2014 14:49:00 -0700 Subject: Re: Spark SQL and Hive tables To: user@spark.apache.org >From the programming guide: When working with Hive one must construct a HiveContext, which inherits from SQLContext, and adds support for finding

RE: Spark SQL and Hive tables

2014-07-25 Thread Sameer Tilak
Thanks, Jerry. Date: Fri, 25 Jul 2014 17:48:27 -0400 Subject: Re: Spark SQL and Hive tables From: chiling...@gmail.com To: user@spark.apache.org Hi Sameer, The blog post you referred to is about Spark SQL. I don't think the intent of the article is meant to guide you how to read data from Hive v

Re: Spark SQL and Hive tables

2014-07-25 Thread Michael Armbrust
>From the programming guide: When working with Hive one must construct a HiveContext, which inherits > from SQLContext, and adds support for finding tables in in the MetaStore > and writing queries using HiveQL. conf/ is a top level directory in the spark distribution that you downloaded. On

Re: Spark SQL and Hive tables

2014-07-25 Thread Jerry Lam
Hi Sameer, The blog post you referred to is about Spark SQL. I don't think the intent of the article is meant to guide you how to read data from Hive via Spark SQL. So don't worry too much about the blog post. The programming guide I referred to demonstrate how to read data from Hive using Spark

RE: Spark SQL and Hive tables

2014-07-25 Thread Sameer Tilak
Hi Michael,Thanks. I am not creating HiveContext, I am creating SQLContext. I am using CDH 5.1. Can you please let me know which conf/ directory you are talking about? From: mich...@databricks.com Date: Fri, 25 Jul 2014 14:34:53 -0700 Subject: Re: Spark SQL and Hive tables To: user@spark.apache

Re: Emacs Setup Anyone?

2014-07-25 Thread Andrei
I have never tried Spark REPL from within Emacs, but I remember that switching from normal Python to Pyspark was as simple as changing interpreter name at the beginning of session. Seems like ensime [1] (together with ensime-emacs [2]) should be a good point to start. For example, take a look at en

Re: Spark SQL and Hive tables

2014-07-25 Thread Michael Armbrust
In particular, have you put your hive-site.xml in the conf/ directory? Also, are you creating a HiveContext instead of a SQLContext? On Fri, Jul 25, 2014 at 2:27 PM, Jerry Lam wrote: > Hi Sameer, > > Maybe this page will help you: > https://spark.apache.org/docs/latest/sql-programming-guide.ht

RE: Spark SQL and Hive tables

2014-07-25 Thread Sameer Tilak
Hi Jerry,Thanks for your reply. I was following the steps in this programming guide. It does not mention anything about creating HiveContext or HQL explicitly. http://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html Users(userId INT, name String, email ST

Re: Spark SQL and Hive tables

2014-07-25 Thread Jerry Lam
Hi Sameer, Maybe this page will help you: https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables Best Regards, Jerry On Fri, Jul 25, 2014 at 5:25 PM, Sameer Tilak wrote: > Hi All, > I am trying to load data from Hive tables using Spark SQL. I am using > spark-shell. Her

Spark SQL and Hive tables

2014-07-25 Thread Sameer Tilak
Hi All,I am trying to load data from Hive tables using Spark SQL. I am using spark-shell. Here is what I see: val trainingDataTable = sql("""SELECT prod.prod_num, demographics.gender, demographics.birth_year, demographics.income_group FROM prod p JOIN demographics d ON d.user_id = p.user_id"""

saveAsTextFiles file not found exception

2014-07-25 Thread Bill Jay
Hi, I am running a Spark Streaming job that uses saveAsTextFiles to save results into hdfs files. However, it has an exception after 20 batches result-140631234/_temporary/0/task_201407251119__m_03 does not exist. When the job is running, I do not change any file in the folder. Does

Re: Support for Percentile and Variance Aggregation functions in Spark with HiveContext

2014-07-25 Thread Michael Armbrust
Hmm, in general we try to support all the UDAFs, but this one must be using a different base class that we don't have a wrapper for. JIRA here: https://issues.apache.org/jira/browse/SPARK-2693 On Fri, Jul 25, 2014 at 8:06 AM, wrote: > > Hi all, > > I am using Spark 1.0.0 with CDH 5.1.0. > > I

Re: sparkcontext stop and then start again

2014-07-25 Thread Davies Liu
Hey Mohit, Behind the pyspark.SparkContext, there is SparkContext in JVM, so the overhead of creating a SparkContext is pretty high. Also, during starting and stopping an SparkContext, there are lots of things need to setup and release, maybe there some corner cases make it not so solid enough.

Kryo Issue on Spark 1.0.1, Mesos 0.18.2

2014-07-25 Thread Gary Malouf
After upgrading to Spark 1.0.1 from 0.9.1 everything seemed to be going well. Looking at the Mesos slave logs, I noticed: ERROR KryoSerializer: Failed to run spark.kryo.registrator java.lang.ClassNotFoundException: com/mediacrossing/verrazano/kryo/MxDataRegistrator My spark-env.sh has the follow

Re: Issue submitting spark job to yarn

2014-07-25 Thread Ron Gonzalez
Turns out that it had the spark assembly in the target/dependency dir so resource localization failed because spark was writing both the assembly from the dependency dir along with the main resource localization of the assembly as part of the main code line. Yarn doesn't like it if the file is o

Re: Hadoop client protocol mismatch with spark 1.0.1, cdh3u5

2014-07-25 Thread Bharath Ravi Kumar
That's right, I'm looking to depend on spark in general and change only the hadoop client deps. The spark master and slaves use the spark-1.0.1-bin-hadoop1 binaries from the downloads page. The relevant snippet from the app's maven pom is as follows: org.apache.spark

Re: Decision tree classifier in MLlib

2014-07-25 Thread Evan R. Sparks
Can you share the dataset via a gist or something and we can take a look at what's going on? On Fri, Jul 25, 2014 at 10:51 AM, SK wrote: > yes, the output is continuous. So I used a threshold to get binary labels. > If prediction < threshold, then class is 0 else 1. I use this binary label > t

Re: memory leak query

2014-07-25 Thread Rico
Hi Michael, I have similar question before. My problem was that my data was too large to be cached in memory because of serializatio

Re: Decision tree classifier in MLlib

2014-07-25 Thread SK
yes, the output is continuous. So I used a threshold to get binary labels. If prediction < threshold, then class is 0 else 1. I use this binary label to then compute the accuracy. Even with this binary transformation, the accuracy with decision tree model is low compared to LR or SVM (for the spec

Re: Caching issue with msg: RDD block could not be dropped from memory as it does not exist

2014-07-25 Thread Rico
I could find out the issue. In fact, I did not realize before that when loaded into memory, the data is deserialized. As a result, what seems to be a 21Gb dataset occupies 77Gb in memory. Details about this is clearly explained in the guide on serialization and memory tuning

Re: Are all transformations lazy?

2014-07-25 Thread Rico
It may be confusing at first but there is also an important difference between reduce and reduceByKey operations. reduce is an action on an RDD. Hence, it will request the evaluation of transformations that resulted to the RDD. In contrast, reduceByKey is a transformation on PairRDDs, not an act

sparkcontext stop and then start again

2014-07-25 Thread Mohit Jaggi
Folks, I had some pyspark code which used to hang with no useful debug logs. It got fixed when I changed my code to keep the sparkcontext forever instead of stopping it and then creating another one later. Is this a bug or expected behavior? Mohit.

Using Spark Streaming with Kafka 0.7.2

2014-07-25 Thread maddenpj
Hi all, Currently we have Kafka 0.7.2 running in production and can't upgrade for external reasons however spark streaming (1.0.1) was built with Kafka 0.8.0. What is the best way to use spark streaming with older versions of Kafka. Currently I'm investigating trying to build spark streaming mysel

Re: Hadoop client protocol mismatch with spark 1.0.1, cdh3u5

2014-07-25 Thread Sean Owen
If you link against the pre-built binary, that's for Hadoop 1.0.4. Can you show your deps to clarify what you are depending on? Building custom Spark and depending on it is a different thing from depending on plain Spark and changing its deps. I think you want the latter. On Fri, Jul 25, 2014 at 5

Re: Hadoop client protocol mismatch with spark 1.0.1, cdh3u5

2014-07-25 Thread Bharath Ravi Kumar
Thanks for responding. I used the pre built spark binaries meant for hadoop1,cdh3u5. I do not intend to build spark against a specific distribution. Irrespective of whether I build my app with the explicit cdh hadoop client dependency, I get the same error message. I also verified that my app's u

Re: Initial job has not accepted any resources (but workers are in UI)

2014-07-25 Thread Navicore
thx for the reply, the UI says my application has cores and mem ID NameCores Memory per Node Submitted Time UserState Duration app-20140725164107-0001 SectionsAndSeamsPipeline6 512.0 MB 2014/07/25 16:41:07tercel RUNNING 21 s -- View this message

Issue submitting spark job to yarn

2014-07-25 Thread Ron Gonzalez
Folks,   I've been able to submit simple jobs to yarn thus far. However, when I did something more complicated that added 194 dependency jars using --addJars, the job fails in YARN with no logs. What ends up happening is that no container logs get created (app master or executor). If I add just

Re: Questions about disk IOs

2014-07-25 Thread Charles Li
Hi Xiangrui, I have 16 * 40 cpu cores in total. But I am only using 200 partitions on the 200 executors. I use coalesce without shuffle to reduce the default partition of RDD. The shuffle size from the WebUI is nearly 100m. On Jul 25, 2014, at 23:51, Xiangrui Meng wrote: > How many partition

Re: Questions about disk IOs

2014-07-25 Thread Xiangrui Meng
How many partitions did you use and how many CPU cores in total? The former shouldn't be much larger than the latter. Could you also check the shuffle size from the WebUI? -Xiangrui On Fri, Jul 25, 2014 at 4:10 AM, Charles Li wrote: > Hi Xiangrui, > > Thanks for your treeAggregate patch. It is ve

Re: Hadoop client protocol mismatch with spark 1.0.1, cdh3u5

2014-07-25 Thread Sean Owen
This indicates your app is not actually using the version of the HDFS client you think. You built Spark from source with the right deps it seems, but are you sure you linked to your build in your app? On Fri, Jul 25, 2014 at 4:32 PM, Bharath Ravi Kumar wrote: > Any suggestions to work around thi

sharing spark context among machines

2014-07-25 Thread myxjtu
Is it possible now to share spark context among machines (through serialization or some other ways)? I am looking for possible ways to make the spark job submission to be HA (high availability). For example, if a job submitted to machine A fails in the middle (due to machine A crash), I want this

Re: Spark got stuck with a loop

2014-07-25 Thread Denis RP
Anyone can help? I'm using spark 1.0.1 I'm confusing that if the block is found, why no non-empty blocks is got, and the process keeps going forever? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-got-stuck-with-a-loop-tp10590p10663.html Se

Re: NMF implementaion is Spark

2014-07-25 Thread Xiangrui Meng
It is ALS with setNonnegative. -Xiangrui On Fri, Jul 25, 2014 at 7:38 AM, Aureliano Buendia wrote: > Hi, > > Is there an implementation for Nonnegative Matrix Factorization in Spark? I > understand that MLlib comes with matrix factorization, but it does not seem > to cover the nonnegative case.

Re: Hadoop client protocol mismatch with spark 1.0.1, cdh3u5

2014-07-25 Thread Bharath Ravi Kumar
Any suggestions to work around this issue ? The pre built spark binaries don't appear to work against cdh as documented, unless there's a build issue, which seems unlikely. On 25-Jul-2014 3:42 pm, "Bharath Ravi Kumar" wrote: > > I'm encountering a hadoop client protocol mismatch trying to read f

Re: Down-scaling Spark on EC2 cluster

2014-07-25 Thread Nicholas Chammas
No idea. Right now implementing this is up for grabs by the community. On Fri, Jul 25, 2014 at 5:40 AM, Shubhabrata wrote: > Any idea about the probable dates for this implementation. I believe it > would > be a wonderful (and essential) functionality to gain more acceptance in the > community.

Initial job has not accepted any resources (but workers are in UI)

2014-07-25 Thread Ed Sweeney
Hi all, Amazon Linux, AWS, Spark 1.0.1 reading a file. The UI shows there are workers and shows this app context with the 2 tasks waiting. All the hostnames resolve properly so I am guessing the message is correct and that the workers won't accept the job for mem reasons. What params do I tweak

Support for Percentile and Variance Aggregation functions in Spark with HiveContext

2014-07-25 Thread vinay . kashyap
Hi all, I am using Spark 1.0.0 with CDH 5.1.0. I want to aggregate the data in a raw table using a simple query like below SELECT MIN(field1), MAX(field2), AVG(field3), PERCENTILE(field4), year,month,day FROM  raw_data_table  GROUP BY year, month, day MIN, MAX and AVG functions work fine for m

Re: mapToPair vs flatMapToPair vs flatMap function usage.

2014-07-25 Thread Daniel Siegmann
The map and flatMap methods have a similar purpose, but map is 1 to 1, while flatMap is 1 to 0-N (outputting 0 is similar to a filter, except of course it could be outputting a different type). On Thu, Jul 24, 2014 at 6:41 PM, abhiguruvayya wrote: > Can any one help me understand the key differ

Re: Strange exception on coalesce()

2014-07-25 Thread Sean Owen
I'm pretty sure this was already fixed last week in SPARK-2414: https://github.com/apache/spark/commit/7c23c0dc3ed721c95690fc49f435d9de6952523c On Fri, Jul 25, 2014 at 1:34 PM, innowireless TaeYun Kim wrote: > Hi, > I'm using Spark 1.0.0. > > On filter() - map() - coalesce() - saveAsText() sequen

NMF implementaion is Spark

2014-07-25 Thread Aureliano Buendia
Hi, Is there an implementation for Nonnegative Matrix Factorization in Spark? I understand that MLlib comes with matrix factorization, but it does not seem to cover the nonnegative case.

How to pass additional options to Mesos when submitting job?

2014-07-25 Thread Krisztián Szűcs
Hi, We’re trying to use Docker containerization within Mesos via Deimos. We’re submitting Spark jobs from localhost to our cluster. We’ve managed it to work (with fix deimos configuration), but we have issues with passing some options (like job dependent container image) in TaskInfo to Mesos du

Re: EOFException when I list all files in hdfs directory

2014-07-25 Thread Bertrand Dechoux
Well, anyone can open an account on apache jira and post a new ticket/enhancement/issue/bug... Bertrand Dechoux On Fri, Jul 25, 2014 at 4:07 PM, Sparky wrote: > Thanks for the suggestion. I can confirm that my problem is I have files > with zero bytes. It's a known bug and is marked as a hig

Re: EOFException when I list all files in hdfs directory

2014-07-25 Thread Sparky
Thanks for the suggestion. I can confirm that my problem is I have files with zero bytes. It's a known bug and is marked as a high priority: https://issues.apache.org/jira/browse/SPARK-1960 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/EOFException-when

Re: EOFException when I list all files in hdfs directory

2014-07-25 Thread Akhil Das
Try without the * val avroRdd = sc.newAPIHadoopFile("hdfs://:8020//", classOf[AvroSequenceFileInputFormat[AvroKey[GenericRecord],NullWritable]], classOf[AvroKey[GenericRecord]], classOf[NullWritable]) avroRdd.collect() Thanks Best Regards On Fri, Jul 25, 2014 at 7:22 PM, Sparky wrote: > I'm

Re: EOFException when I list all files in hdfs directory

2014-07-25 Thread Sparky
I'm pretty sure my problem is related to this unresolved bug regarding files with size zero: https://issues.apache.org/jira/browse/SPARK-1960 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/EOFException-when-I-list-all-files-in-hdfs-directory-tp10648p10649.

EOFException when I list all files in hdfs directory

2014-07-25 Thread Sparky
I'm trying to list and then process all files in an hdfs directory. I'm able to run the code below when I supply a specific AvroSequence file, but if I use a wildcard to get all Avro sequence files in the directory it fails. Anyone know how to do this? val avroRdd = sc.newAPIHadoopFile("hdfs://

RE: Strange exception on coalesce()

2014-07-25 Thread innowireless TaeYun Kim
(Sorry for resending, I've reformatted the text as HTML.) Hi, I'm using Spark 1.0.0. On filter() - map() - coalesce() - saveAsText() sequence, the following exception is thrown. Exception in thread "main" java.util.NoSuchElementException: None.get at scala.None$.get(Option.scal

Re: Bad Digest error while doing aws s3 put

2014-07-25 Thread Akhil Das
Bad Digest error means the file you are trying to upload actually changed while uploading. If you can make a temporary copy of the file before uploading then you won't face this problem. Thanks Best Regards On Fri, Jul 25, 2014 at 5:34 PM, lmk wrote: > Can someone look into this and help me r

Strange exception on coalesce()

2014-07-25 Thread innowireless TaeYun Kim
Hi, I'm using Spark 1.0.0. On filter() - map() - coalesce() - saveAsText() sequence, the following exception is thrown. Exception in thread "main" java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:313) at scala.None$.get(Option.scala:311) at org.apache.spark.r

Re: Bad Digest error while doing aws s3 put

2014-07-25 Thread lmk
Can someone look into this and help me resolve this error pls.. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Bad-Digest-error-while-doing-aws-s3-put-tp10036p10644.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: data locality

2014-07-25 Thread Tsai Li Ming
Hi, In the standalone mode, how can we check data locality is working as expected when tasks are assigned? Thanks! On 23 Jul, 2014, at 12:49 am, Sandy Ryza wrote: > On standalone there is still special handling for assigning tasks within > executors. There just isn't special handling for w

Re: Questions about disk IOs

2014-07-25 Thread Charles Li
Hi Xiangrui, Thanks for your treeAggregate patch. It is very helpful. After applying your patch in my local repos, the new spark can handle more partition than before. But after some iteration(mapPartition + reduceByKey), the reducer seems become more slower and finally hang. The logs shows the

Hadoop client protocol mismatch with spark 1.0.1, cdh3u5

2014-07-25 Thread Bharath Ravi Kumar
I'm encountering a hadoop client protocol mismatch trying to read from HDFS (cdh3u5) using the pre-build spark from the downloads page (linked under "For Hadoop 1 (HDP1, CDH3)"). I've also followed the instructions at http://spark.apache.org/docs/latest/hadoop-third-party-distributions.html (i.e.

Re: Down-scaling Spark on EC2 cluster

2014-07-25 Thread Shubhabrata
Any idea about the probable dates for this implementation. I believe it would be a wonderful (and essential) functionality to gain more acceptance in the community. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Down-scaling-Spark-on-EC2-cluster-tp10494p106

Re: GraphX Pragel implementation

2014-07-25 Thread Arun Kumar
Hi Thanks for the quick response.I am new to scala and some help will be required Regards -Arun On Fri, Jul 25, 2014 at 10:37 AM, Ankur Dave wrote: > On Thu, Jul 24, 2014 at 9:52 AM, Arun Kumar wrote: > >> While using pregel API for Iterations how to figure out which super step >> the itera

Re: Need help, got java.lang.ExceptionInInitializerError in Yarn-Client/Cluster mode

2014-07-25 Thread Jianshi Huang
I nailed it down to a union operation, here's my code snippet: val properties: RDD[((String, String, String), Externalizer[KeyValue])] = vertices.map { ve => val (vertices, dsName) = ve val rval = GraphConfig.getRval(datasetConf, Constants.VERTICES, dsName) val (_, rvalAsc, r