Re: return probability \ confidence instead of actual class

2014-09-22 Thread Adamantios Corais
Nobody? If that's not supported already, can please, at least, give me a few hints on how to implement it? Thanks! On Fri, Sep 19, 2014 at 7:43 PM, Adamantios Corais adamantios.cor...@gmail.com wrote: Hi, I am working with the SVMWithSGD classification algorithm on Spark. It works fine

Possibly a dumb question: differences between saveAsNewAPIHadoopFile and saveAsNewAPIHadoopDataset?

2014-09-22 Thread innowireless TaeYun Kim
Hi, I'm confused with saveAsNewAPIHadoopFile and saveAsNewAPIHadoopDataset. What's the difference between the two? What's the individual use cases of the two APIs? Could you describe the internal flows of the two APIs briefly? I've used Spark several months, but I have no experience on

Re: Possibly a dumb question: differences between saveAsNewAPIHadoopFile and saveAsNewAPIHadoopDataset?

2014-09-22 Thread Matei Zaharia
File takes a filename to write to, while Dataset takes only a JobConf. This means that Dataset is more general (it can also save to storage systems that are not file systems, such as key-value stores), but is more annoying to use if you actually have a file. Matei On September 21, 2014 at

Re: Error while calculating the max temperature

2014-09-22 Thread Sean Owen
If your map() sometimes does not emit an element, then you need to call flatMap() instead, and emit Some(value) (or any collection of values) if there is an element to return, or None otherwise. On Mon, Sep 22, 2014 at 4:50 PM, Praveen Sripati praveensrip...@gmail.com wrote: During the map based

Is there any way (in Java) to make a JavaRDD from an iterable

2014-09-22 Thread Steve Lewis
The only way I find is to turn it into a list - in effect holding everything in memory (see code below). Surely Spark has a better way. Also what about unterminated iterables like a Fibonacci series - (useful only if limited in some other way ) /** * make an RDD from an iterable *

spark time out

2014-09-22 Thread Chen Song
I am using Spark 1.1.0 and have seen a lot of Fetch Failures due to the following exception. java.io.IOException: sendMessageReliably failed because ack was not received within 60 sec at org.apache.spark.network.ConnectionManager$$anon$5$$anonfun$run$15.apply(ConnectionManager.scala:854)

Accumulator Immutability?

2014-09-22 Thread Vikram Kalabi
Consider this snippet from spark scaladoc https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.Accumulator , scala val accum = sc.accumulator(0) accum: spark.Accumulator[Int] = 0 scala sc.parallelize(Array(1, 2, 3, 4)).foreach(x = accum += x) ...10/09/29 18:41:08 INFO

Re: Accumulator Immutability?

2014-09-22 Thread Sean Owen
'accum' is a reference that can't point to another object because it's val. However the object it points to can certainly change state. 'val' has an effect mostly like 'final' in Java. Although the accum += ... syntax might lead you to believe it's executing accum = accum + ..., as it would in

Re: ParquetRecordReader warnings: counter initialization

2014-09-22 Thread Michael Armbrust
These are coming from the parquet library and as far as I know can be safely ignored. On Mon, Sep 22, 2014 at 3:27 AM, Andrew Ash and...@andrewash.com wrote: Hi All, I'm seeing the below WARNINGs in stdout using Spark SQL in Spark 1.1.0 -- is this warning a known issue? I don't see any open

Running Spark in Local Mode vs. Single Node Cluster

2014-09-22 Thread kriskalish
I'm in a situation where I'm running Spark streaming on a single machine right now. The plan is to ultimately run it on a cluster, but for the next couple months it will probably stay on one machine. I tried to do some digging and I can't find any indication of whether it's better to run spark as

Re: Weird aggregation results when reusing objects inside reduceByKey

2014-09-22 Thread kriskalish
Thanks for the insight, I didn't realize there was internal object reuse going on. Is this a mechanism of Scala/Java or is this a mechanism of Spark? I actually just converted the code to use immutable case classes everywhere, so it will be a little tricky to test foldByKey(). I'll try to get to

Re: SparkSQL Thriftserver in Mesos

2014-09-22 Thread John Omernik
Any thoughts on this? On Sat, Sep 20, 2014 at 12:16 PM, John Omernik j...@omernik.com wrote: I am running the Thrift server in SparkSQL, and running it on the node I compiled spark on. When I run it, tasks only work if they landed on that node, other executors started on nodes I didn't

Re: SparkSQL Thriftserver in Mesos

2014-09-22 Thread Dean Wampler
The Mesos install guide says this: To use Mesos from Spark, you need a Spark binary package available in a place accessible by Mesos, and a Spark driver program configured to connect to Mesos. For example, putting it in HDFS or copying it to each node in the same location should do the trick.

Re: Out of memory exception in MLlib's naive baye's classification training

2014-09-22 Thread Xiangrui Meng
Does feature size 43839 equal to the number of terms? Check the output dimension of your feature vectorizer and reduce number of partitions to match the number of physical cores. I saw you set spark.storage.memoryFaction to 0.0. Maybe it is better to keep the default. Also please confirm the

Spark SQL CLI

2014-09-22 Thread gtinside
Hi , I have been using spark shell to execute all SQLs. I am connecting to Cassandra , converting the data in JSON and then running queries on it, I am using HiveContext (and not SQLContext) because of explode functionality in it. I want to see how can I use Spark SQL CLI for directly running

Re: clarification for some spark on yarn configuration options

2014-09-22 Thread Greg Hill
I thought I had this all figured out, but I'm getting some weird errors now that I'm attempting to deploy this on production-size servers. It's complaining that I'm not allocating enough memory to the memoryOverhead values. I tracked it down to this code:

Re: clarification for some spark on yarn configuration options

2014-09-22 Thread Nishkam Ravi
Greg, if you look carefully, the code is enforcing that the memoryOverhead be lower (and not higher) than spark.driver.memory. Thanks, Nishkam On Mon, Sep 22, 2014 at 1:26 PM, Greg Hill greg.h...@rackspace.com wrote: I thought I had this all figured out, but I'm getting some weird errors now

Re: clarification for some spark on yarn configuration options

2014-09-22 Thread Greg Hill
Gah, ignore me again. I was reading the logic backwards. For some reason it isn't picking up my SPARK_DRIVER_MEMORY environment variable and is using the default of 512m. Probably an environmental issue. Greg From: Greg greg.h...@rackspace.commailto:greg.h...@rackspace.com Date: Monday,

The wikipedia Extraction (WEX) Dataset

2014-09-22 Thread daidong
I watched several presentations from the AMP Camp 2013. Many of the Spark examples are about extracting information from the tsv format Wikipedia extraction dataset (around 66 GB). It used to be provided as an open data set in Amazon EBS, but now it already disappeared. I really want to use these

Re: clarification for some spark on yarn configuration options

2014-09-22 Thread Nishkam Ravi
Maybe try --driver-memory if you are using spark-submit? Thanks, Nishkam On Mon, Sep 22, 2014 at 1:41 PM, Greg Hill greg.h...@rackspace.com wrote: Ah, I see. It turns out that my problem is that that comparison is ignoring SPARK_DRIVER_MEMORY and comparing to the default of 512m. Is that

Streaming: HdfsWordCount does not print any output

2014-09-22 Thread SK
Hi, I tried running the HdfsWordCount program in the streaming examples in Spark 1.1.0. I provided a directory in the distributed filesystem as input. This directory has one text file. However, the only thing that the program keeps printing is the time - but not the word count. I have not used

Re: ParquetRecordReader warnings: counter initialization

2014-09-22 Thread Andrew Ash
Thanks for the info Michael. I see this a few other places in the Impala+Parquet context but a real quick scan didn't reveal any leads on this warning. I'll ignore for now. Andrew On Mon, Sep 22, 2014 at 12:16 PM, Michael Armbrust mich...@databricks.com wrote: These are coming from the

Re: Spark SQL CLI

2014-09-22 Thread Yin Huai
Hi Gaurav, Can you put hive-site.xml in conf/ and try again? Thanks, Yin On Mon, Sep 22, 2014 at 4:02 PM, gtinside gtins...@gmail.com wrote: Hi , I have been using spark shell to execute all SQLs. I am connecting to Cassandra , converting the data in JSON and then running queries on it,

Re: Streaming: HdfsWordCount does not print any output

2014-09-22 Thread SK
This issue is resolved. The file needs to be created after the program has started to execute. thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-HdfsWordCount-does-not-print-any-output-tp14849p14852.html Sent from the Apache Spark User List

Re: The wikipedia Extraction (WEX) Dataset

2014-09-22 Thread daidong
Really sorry to brother everybody. It is my mistake. The data set is still on the amazon and can be downloaded. The reason of my failure is that I start an instance not in U.S., so can not attach the EBS volume. -- View this message in context:

Change number of workers and memory

2014-09-22 Thread Dhimant
I am having a spark cluster having some high performance nodes and others are having commodity specs (lower configuration). When I configure worker memory and instances in spark-env.sh, it reflects to all the nodes. Can I change SPARK_WORKER_MEMORY and SPARK_WORKER_INSTANCES properties per

Why recommend 2-3 tasks per CPU core ?

2014-09-22 Thread myasuka
We are now implementing a matrix multiplication algorithm on Spark, which was designed in the traditional MPI working way before. It assumes every core in the grid computes in parallel. Now in our develop environment, each executor node has 16 cores, and I assign 16 tasks to each executor node