Re: Need help. Spark + Accumulo = Error: java.lang.NoSuchMethodError: org.apache.commons.codec.binary.Base64.encodeBase64String

2014-06-23 Thread Sean Owen
No, this is just standard Maven informational license info in META-INF. It is not going to affect runtime behavior or how classes are loaded. On Mon, Jun 23, 2014 at 6:30 AM, anoldbrain anoldbr...@gmail.com wrote: I checked the META-INF/DEPENDENCIES file in the spark-assembly jar from official

Re: Need help. Spark + Accumulo = Error: java.lang.NoSuchMethodError: org.apache.commons.codec.binary.Base64.encodeBase64String

2014-06-23 Thread anoldbrain
I used Java Decompiler to check the included org.apache.commons.codec.binary.Base64 .class file (in spark-assembly jar file) and for both encodeBase64 and decodeBase64, there is only (byte []) version and no encodeBase64/decodeBase64(String). I have encountered the reported issue. This conflicts

Kafka Streaming - Error Could not compute split

2014-06-23 Thread Kanwaldeep
We are using Spark 1.0.0 deployed on Spark Standalone cluster and I'm getting the following exception. With previous version I've seen this error occur along with OutOfMemory errors which I'm not seeing with Sparks 1.0. Any suggestions? Job aborted due to stage failure: Task 3748.0:20 failed 4

Re: Shark vs Impala

2014-06-23 Thread Aaron Davidson
Note that regarding a long load time, data format means a whole lot in terms of query performance. If you load all your data into compressed, columnar Parquet files on local hardware, Spark SQL would also perform far, far better than it would reading from gzipped S3 files. You must also be careful

Re: Problems running Spark job on mesos in fine-grained mode

2014-06-23 Thread Fedechicco
I'm getting the same behavior and it's crucial I get it fixed for an evaluation of Spark + Mesos within my company. I'm bumping +1 for the request of putting this fix in the 1.0.1 if possible! thanks, Federico 2014-06-20 20:51 GMT+02:00 Sébastien Rainville sebastienrainvi...@gmail.com : Hi,

Multiclass classification evaluation measures

2014-06-23 Thread Ulanov, Alexander
Hi, I've implemented a class with measures for evaluation of multiclass classification (as well as unit tests). They are per class and averaged Precision, Recall and F1-measure. As far as I know, in Spark, there is binary classification evaluator only, given that Spark's Bayesian classifier

Re: implicit ALS dataSet

2014-06-23 Thread redocpot
Hi, The real-world dataset is a bit more large, so I tested on the MovieLens data set, and find the same results: alpha lambda rank top1 top5 EPR_in EPR_out 40 0.001 50 297 559 0.05855

Re: Need help. Spark + Accumulo = Error: java.lang.NoSuchMethodError: org.apache.commons.codec.binary.Base64.encodeBase64String

2014-06-23 Thread anoldbrain
found a workaround by adding SPARK_CLASSPATH=.../commons-codec-xxx.jar to spark-env.sh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-Spark-Accumulo-Error-java-lang-NoSuchMethodError-org-apache-commons-codec-binary-Base64-eng-tp7667p8117.html

Re: Shark vs Impala

2014-06-23 Thread Toby Douglass
On Sun, Jun 22, 2014 at 5:53 PM, Debasish Das debasish.da...@gmail.com wrote: 600s for Spark vs 5s for Redshift...The numbers look much different from the amplab benchmark... https://amplab.cs.berkeley.edu/benchmark/ Is it like SSDs or something that's helping redshift or the whole data is

Re: Shark vs Impala

2014-06-23 Thread Toby Douglass
On Mon, Jun 23, 2014 at 8:50 AM, Aaron Davidson ilike...@gmail.com wrote: Note that regarding a long load time, data format means a whole lot in terms of query performance. If you load all your data into compressed, columnar Parquet files on local hardware, Spark SQL would also perform far,

Re: Serialization problem in Spark

2014-06-23 Thread rrussell25
Thanks for pointer...tried Kryo and ran into a strange error: org.apache.spark.SparkException: Job aborted due to stage failure: Exception while deserializing and fetching task: com.esotericsoftware.kryo.KryoException: Unable to find class: rg.apache.hadoop.hbase.io.ImmutableBytesWritable It is

Help with object access from mapper (simple question)

2014-06-23 Thread Yana Kadiyska
Hi folks, hoping someone can explain to me what's going on: I have the following code, largely based on RecoverableNetworkWordCount example ( https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/RecoverableNetworkWordCount.scala ): I am setting

Re: pyspark-Failed to run first

2014-06-23 Thread Congrui Yi
So it does not work for files on HDFS either? That is really a problem. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-Failed-to-run-first-tp7691p8128.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Help with object access from mapper (simple question)

2014-06-23 Thread Marcelo Vanzin
object in Scala is similar to a class with only static fields / methods in Java. So when you set its fields in the driver, the object does not get serialized and sent to the executors; they have their own copy of the class and its static fields, which haven't been initialized. Use a proper class,

Basic Scala and Spark questions

2014-06-23 Thread Sameer Tilak
Hi All,I am new so Scala and Spark. I have a basic question. I have the following import statements in my Scala program. I want to pass my function (printScore) to Spark. It will compare a string import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import

about a JavaWordCount example with spark-core_2.10-1.0.0.jar

2014-06-23 Thread Alonso Isidoro Roman
Hi all, I am new to Spark, so this is probably a basic question. i want to explore the possibilities of this fw, concretely using it in conjunction with 3 party libs, like mongodb, for example. I have been keeping instructions from http://codeforhire.com/2014/02/18/using-spark-with-mongodb/ in

Re: Help with object access from mapper (simple question)

2014-06-23 Thread Yana Kadiyska
Thank you so much! I was trying for a singleton and opted against a class but clearly this backfired. Clearly time to revisit Scala lessons. Thanks again On Mon, Jun 23, 2014 at 1:16 PM, Marcelo Vanzin van...@cloudera.com wrote: object in Scala is similar to a class with only static fields /

RE: HDFS folder .sparkStaging not deleted and filled up HDFS in yarn mode

2014-06-23 Thread Andrew Lee
I checked the source code, it looks like it was re-added back based on JIRA SPARK-1588, but I don't know if there's any test case associated with this? SPARK-1588. Restore SPARK_YARN_USER_ENV and SPARK_JAVA_OPTS for YARN. Sandy Ryza sa...@cloudera.com 2014-04-29 12:54:02 -0700

Re: pyspark regression results way off

2014-06-23 Thread frol
Here is my conversation about the same issue with regression methods: https://issues.apache.org/jira/browse/SPARK-1859 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-regression-results-way-off-tp7672p8139.html Sent from the Apache Spark User List

Re: Need help. Spark + Accumulo = Error: java.lang.NoSuchMethodError: org.apache.commons.codec.binary.Base64.encodeBase64String

2014-06-23 Thread anoldbrain
Assuming this should not happen, I don't want to have to keep building a custom version of spark for every new release, thus preferring the workaround. -- View this message in context:

How to use K-fold validation in spark-1.0?

2014-06-23 Thread holdingonrobin
Hello, I noticed there are some discussions about implementing K-fold validation to Mllib on Spark and believe it should be in Spark-1.0 now. However there isn't any documentation or example about how to use it in the training. While I am reading the code to find out, does anyone use it

Re: how to make saveAsTextFile NOT split output into multiple file?

2014-06-23 Thread holdingonrobin
I used some standard Java IO libraries to write files directly to the cluster. It is a little bit trivial tho: val sc = getSparkContext val hadoopConf = SparkHadoopUtil.get.newConfiguration val hdfsPath = hdfs://your/path val fs = FileSystem.get(hadoopConf) val path

Efficiently doing an analysis with Cartesian product (pyspark)

2014-06-23 Thread Aaron Dossett
I am relatively new to Spark and am getting stuck trying to do the following: - My input is integer key, value pairs where the key is not unique. I'm interested in information about all possible distinct key combinations, thus the Cartesian product. - My first attempt was to create a separate

Re: Efficiently doing an analysis with Cartesian product (pyspark)

2014-06-23 Thread Aaron
Sorry, I got my sample outputs wrong (1,1) - 400 (1,2) - 500 (2,2)- 600 On Jun 23, 2014, at 4:29 PM, Aaron Dossett [via Apache Spark User List] ml-node+s1001560n8144...@n3.nabble.commailto:ml-node+s1001560n8144...@n3.nabble.com wrote: I am relatively new to Spark and am getting stuck trying

Run Spark on Mesos? Add yourself to the #PoweredByMesos list

2014-06-23 Thread Dave Lester
Hi All, It's great to see a growing number of companies Powered By Spark https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark! If you're running Spark on Apache Mesos http://mesos.apache.org, drop me a line or post to the u...@mesos.apache.org list and we'll also be happy to add

balancing RDDs

2014-06-23 Thread Sean McNamara
We have a use case where we’d like something to execute once on each node and I thought it would be good to ask here. Currently we achieve this by setting the parallelism to the number of nodes and use a mod partitioner: val balancedRdd = sc.parallelize( (0 until Settings.parallelism)

Re: hi

2014-06-23 Thread Andrew Or
Ah never mind. The 0.0.0.0 is for the UI, not for Master, which uses the output of the hostname command. But yes, long answer short, go to the web UI and use that URL. 2014-06-23 11:13 GMT-07:00 Andrew Or and...@databricks.com: Hm, spark://localhost:7077 should work, because the standalone

Error when running unit tests

2014-06-23 Thread SK
I am using Spark 1.0.0. I am able to successfully run sbt package. However, when I run sbt test or sbt test-only class, I get the following error: [error] error while loading root, zip file is empty scala.reflect.internal.MissingRequirementError: object scala.runtime in compiler mirror not

Bug in Spark REPL

2014-06-23 Thread Shivani Rao
I have two jars with the following packages package a.b.c.d.z found in jar1 package a.b.e found in jar2 In scala REPL (no spark) both imports work just fine, but in the Spark REPL, I found that import a.b.c.d.z gives me the following error object c is not a member of package a.b Has

DAGScheduler: Failed to run foreach

2014-06-23 Thread Sameer Tilak
Hi All, I am using spark for text analysis. I have a source file that has few thousand sentences and a dataset of tens of millions of statements. I want to compare each statement from the sourceFile with each statement from the dataset and generate a score. I am having following problem. I

Re: Bug in Spark REPL

2014-06-23 Thread Shivani Rao
Actually I figured it out. There was a problem was that I was loading the sbt package-ed jar into the class path and not the sbt assembly-ed jar. Once I put the right jar in for package a.b.c.d.z everything worked thanks shivani On Mon, Jun 23, 2014 at 4:38 PM, Shivani Rao raoshiv...@gmail.com

RE: DAGScheduler: Failed to run foreach

2014-06-23 Thread Sameer Tilak
The subject should be: org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: and not DAGScheduler: Failed to run foreach If I call printScoreCanndedString with a hard-coded string and identical 2nd parameter, it works fine.

Re: DAGScheduler: Failed to run foreach

2014-06-23 Thread Aaron Davidson
Please note that this: for (sentence - sourcerdd) { ... } is actually Scala syntactic sugar which is converted into sourcerdd.foreach { sentence = ... } What this means is that this will actually run on the cluster, which is probably not what you want if you're trying to print them. Try

which function can generate a ShuffleMapTask

2014-06-23 Thread lihu
I see that the task will either be a ShuffleMapTask or be a ResultTask, I wonder which function will generate a ShuffleMapTask, which will generate a ResultTask?