Re: Type problem in Java when using flatMapValues

2014-10-03 Thread Robin Keunen
Damn, you're right, I wasn't looking at it properly. I was confused by intelliJ I guess. Many thanks! On 2014-10-02 19:02, Sean Owen wrote: Eh, is it not that you are mapping the values of an RDD whose keys are StringStrings, but expecting the keys are Strings? That's also about what the

Re: new error for me

2014-10-03 Thread Akhil Das
I used to face this while running it on a single node machine and when i allocate more memory for the executor. (ie, my machine was 28Gb memory and i allocated 26Gb for the executor, dropping the memory from 26 to 20Gb solved my issue.). If you are seeing an executor lost exception then you can

Re: Any issues with repartition?

2014-10-03 Thread Akhil Das
What is your cluster setup? and how much memory are you allocating to the executor? Thanks Best Regards On Fri, Oct 3, 2014 at 7:52 AM, jamborta jambo...@gmail.com wrote: Hi Arun, Have you found a solution? Seems that I have the same problem. thanks, -- View this message in context:

SparkSQL on Hive error

2014-10-03 Thread Kevin Paul
Hi all, I tried to launch my application with spark-submit, the command I use is: bin/spark-submit --class ${MY_CLASS} --jars ${MY_JARS} --master local myApplicationJar.jar I've buillt spark with SPARK_HIVE=true, and was able to start HiveContext, and was able to run command like,

Re: SparkSQL on Hive error

2014-10-03 Thread Michael Armbrust
Are you running master? There was briefly a regression here that is hopefully fixed by spark#2635 https://github.com/apache/spark/pull/2635. On Fri, Oct 3, 2014 at 1:43 AM, Kevin Paul kevinpaulap...@gmail.com wrote: Hi all, I tried to launch my application with spark-submit, the command I use

spark 1.1.0 - hbase 0.98.6-hadoop2 version - py4j.protocol.Py4JJavaError java.lang.ClassNotFoundException

2014-10-03 Thread serkan.dogan
Hi, I installed hbase-0.98.6-hadoop2. It's working not any problem with that. When i am try to run spark hbase python examples, (wordcount examples working - not python issue) ./bin/spark-submit --master local --driver-class-path ./examples/target/spark-examples_2.10-1.1.0.jar

Re: How to make ./bin/spark-sql work with hive?

2014-10-03 Thread Michael Armbrust
Often java.lang.NoSuchMethodError means that you have more than one version of a library on your classpath, in this case it looks like hive. On Thu, Oct 2, 2014 at 8:44 PM, Li HM hmx...@gmail.com wrote: I have rebuild package with -Phive Copied hive-site.xml to conf (I am using hive-0.12)

spark 1.1.0 - hbase 0.98.6-hadoop2 version - py4j.protocol.Py4JJavaError java.lang.ClassNotFoundException

2014-10-03 Thread serkan.dogan
Hi, I installed hbase-0.98.6-hadoop2. It's working not any problem with that. When i am try to run spark hbase python examples, (wordcount examples working - not python issue) ./bin/spark-submit --master local --driver-class-path ./examples/target/spark-examples_2.10-1.1.0.jar

Re: Setup/Cleanup for RDD closures?

2014-10-03 Thread Mayur Rustagi
Current approach is to use mappartition, initialize the connection in the beginning, iterate through the data close off the connector. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, Oct 3, 2014 at 10:16 AM, Stephen

Re: SparkSQL on Hive error

2014-10-03 Thread Cheng Lian
Also make sure to call |hiveContext.sql| within the same thread where |hiveContext| is created, because Hive uses thread-local variable to initialize the |Driver.conf|. On 10/3/14 4:52 PM, Michael Armbrust wrote: Are you running master? There was briefly a regression here that is hopefully

Akka connection refused when running standalone Scala app on Spark 0.9.2

2014-10-03 Thread Irina Fedulova
Hi, I have set up Spark 0.9.2 standalone cluster using CDH5 and pre-built spark distribution archive for Hadoop 2. I was not using spark-ec2 scripts because I am not on EC2 cloud. Spark-shell seems to be working properly -- I am able to perform simple RDD operations, as well as e.g. SparkPi

Re: Any issues with repartition?

2014-10-03 Thread jamborta
I have two nodes with 96G ram 16 cores, my setup is as follows: conf = (SparkConf() .setMaster(yarn-cluster) .set(spark.executor.memory, 30G) .set(spark.cores.max, 32) .set(spark.executor.instances, 2) .set(spark.executor.cores, 8)

The question about mount ephemeral disk in slave-setup.sh

2014-10-03 Thread TANG Gen
Hi, I am quite a new user of spark, and I have a stupid question about mount ephemeral disk for AWS EC2. If I well understand the spark_ec.py script, it is spark-ec2/setup-slave.sh that mounts the ephemeral disk for AWS EC2(Instance Store Volumes). However, in setup-slave.sh, it seems that these

Could Spark make use of Intel Xeon Phi?

2014-10-03 Thread 余 浪
Hi, I have set up Spark 1.0.2 on the cluster using standalone mode and the input is managed by HDFS. One node of the cluster has Intel Xeon Phi 5110P coprocessor. Is there any possibility that spark could be aware of Phi and run job on Xeon Phi? Do I have to modify the code of scheduler?

Re: Could Spark make use of Intel Xeon Phi?

2014-10-03 Thread 牛兆捷
What are the specific features of intel Xeon Phi that can be utilized by Spark? 2014-10-03 18:09 GMT+08:00 余 浪 yulan...@gmail.com: Hi, I have set up Spark 1.0.2 on the cluster using standalone mode and the input is managed by HDFS. One node of the cluster has Intel Xeon Phi 5110P

Breeze Library usage in Spark

2014-10-03 Thread Priya Ch
Hi Team, When I am trying to use DenseMatrix of breeze library in spark, its throwing me the following error: java.lang.noclassdeffounderror: breeze/storage/Zero Can someone help me on this ? Thanks, Padma Ch

How to save Spark log into file

2014-10-03 Thread arthur.hk.c...@gmail.com
Hi, How can the spark log be saved into file instead of showing them on console? Below is my conf/log4j.properties conf/log4j.properties ### # Root logger option log4j.rootLogger=INFO, file # Direct log messages to a log file log4j.appender.file=org.apache.log4j.RollingFileAppender #Redirect

Re: how to debug ExecutorLostFailure

2014-10-03 Thread jamborta
digging a bit deeper on, the executors get lost when the memory gets close to the physical memory size: http://apache-spark-user-list.1001560.n3.nabble.com/file/n15680/memory_usage.png I'm not clear if I am allocating too much, or too less memory in this case. thanks, -- View this message

Re: Setup/Cleanup for RDD closures?

2014-10-03 Thread Sean Owen
Yes, though it's a little more complex than that: http://mail-archives.apache.org/mod_mbox/spark-user/201407.mbox/%3CCAPH-c_O9kQO6yJ4khXUVdO=+D4vj=JfG2tP9eqn5RPko=dr...@mail.gmail.com%3E On Fri, Oct 3, 2014 at 9:58 AM, Mayur Rustagi mayur.rust...@gmail.com wrote: Current approach is to use

Re: Spark inside Eclipse

2014-10-03 Thread Sanjay Subramanian
cool thanks will set this up and report back how things wentregardssanjay From: Daniel Siegmann daniel.siegm...@velos.io To: Ashish Jain ashish@gmail.com Cc: Sanjay Subramanian sanjaysubraman...@yahoo.com; user@spark.apache.org user@spark.apache.org Sent: Thursday, October 2, 2014

Re: Spark inside Eclipse

2014-10-03 Thread jay vyas
For intelliJ + SBT, also you can follow the directions http://jayunit100.blogspot.com/2014/07/set-up-spark-application-devleopment.html . ITs really easy to run spark in an IDE . The process for eclipse is virtually identical. On Fri, Oct 3, 2014 at 10:03 AM, Sanjay Subramanian

Re: Akka Connection refused - standalone cluster using spark-0.9.0

2014-10-03 Thread irina
Hi ssimanta, were you able to resolve the problem with failing standalone scala program, but spark repl working just fine? I am getting the same issue... Thanks, Irina -- View this message in context:

Question about addFiles()

2014-10-03 Thread Tom Weber
Just getting started with Spark, so hopefully this is all there and I just haven't found it yet. I have a driver pgm on my client machine, I can use addFiles to distribute files to the remote worker nodes of the cluster. They are there to be found by my code running in the executors, so al is

Re: [SparkSQL] Function parity with Shark?

2014-10-03 Thread Yana Kadiyska
Thanks -- it does appear that I misdiagnosed a bit: case works generally but it doesn't seem to like the bit operation, which does not seem to work (type of bit_field in Hive is bigint): Error: java.lang.RuntimeException: Unsupported language features in query: select (case when bit_field 1=1

Re: Akka connection refused when running standalone Scala app on Spark 0.9.2

2014-10-03 Thread Yana Kadiyska
when you're running spark-shell and the example, are you actually specifying --master spark://master:7077 as shown here: http://spark.apache.org/docs/latest/programming-guide.html#initializing-spark because if you're not, your spark-shell is running in local mode and not actually connecting to

Using GraphX with Spark Streaming?

2014-10-03 Thread Arko Provo Mukherjee
Hello Spark Gurus, I am trying to learn Spark. I am specially interested in GraphX. Since Spark can used in streaming context as well, I wanted to know whether it is possible to use the Spark Toolkits like GraphX or MLlib in the streaming context? Apologies if this is a stupid question but I am

Re: Breeze Library usage in Spark

2014-10-03 Thread Xiangrui Meng
Did you add a different version of breeze to the classpath? In Spark 1.0, we use breeze 0.7, and in Spark 1.1 we use 0.9. If the breeze version you used is different from the one comes with Spark, you might see class not found. -Xiangrui On Fri, Oct 3, 2014 at 4:22 AM, Priya Ch

Re: HiveContext: cache table not supported for partitioned table?

2014-10-03 Thread Du Li
Thanks for your explanation. From: Cheng Lian lian.cs@gmail.commailto:lian.cs@gmail.com Date: Thursday, October 2, 2014 at 8:01 PM To: Du Li l...@yahoo-inc.com.INVALIDmailto:l...@yahoo-inc.com.INVALID, d...@spark.apache.orgmailto:d...@spark.apache.org

MLlib Collaborative Filtering failed to run with rank 1000

2014-10-03 Thread jw.cmu
I was able to run collaborative filtering with low rank numbers, like 20~160 on the netflix dataset, but it fails due to the following error when I set the rank to 1000: 14/10/03 03:27:36 WARN TaskSetManager: Loss was due to java.lang.IllegalArgumentException java.lang.IllegalArgumentException:

[ANN] SparkSQL support for Cassandra with Calliope

2014-10-03 Thread Rohit Rai
Hi All, An year ago we started this journey and laid the path for Spark + Cassandra stack. We established the ground work and direction for Spark Cassandra connectors and we have been happy seeing the results. With Spark 1.1.0 and SparkSQL release, we its time to take Calliope

array size limit vs partition number

2014-10-03 Thread anny9699
Hi, Sorry I am not very familiar with Java. I found that if I set the RDD partition number to be higher, I meet this error messagejava.lang.OutOfMemoryError: Requested array size exceeds VM limit; however if I set the RDD partition number to be lower, the error is gone. My aws ec2 cluster has 72

Re: MLlib Collaborative Filtering failed to run with rank 1000

2014-10-03 Thread Xiangrui Meng
The current impl of ALS constructs least squares subproblems in memory. So for rank 100, the total memory it requires is about 480,189 * 100^2 / 2 * 8 bytes ~ 20GB, divided by the number of blocks. For rank 1000, this number goes up to 2TB, unfortunately. There is a JIRA for optimizing ALS:

Re: Akka connection refused when running standalone Scala app on Spark 0.9.2

2014-10-03 Thread Irina Fedulova
Yana, many thanks for looking into this! I am not running spark-shell in local mode, I am really starting spark-shell with --master spark://master:7077 and run in cluster mode. Second thing is I tried to set spark.driver.host to master both in scala app when creating context, and in

Re: MLlib Collaborative Filtering failed to run with rank 1000

2014-10-03 Thread jw.cmu
Thanks, Xiangrui. I didn't check the test error yet. I agree that rank 1000 might overfit for this particular dataset. Currently I'm just running some scalability tests - I'm trying to see how large the model can be scaled to given a fixed amount of hardware. -- View this message in context:

Re: [SparkSQL] Function parity with Shark?

2014-10-03 Thread Michael Armbrust
Thanks for digging in! These both look like they should have JIRAs. On Fri, Oct 3, 2014 at 8:14 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Thanks -- it does appear that I misdiagnosed a bit: case works generally but it doesn't seem to like the bit operation, which does not seem to work

Re: window every n elements instead of time based

2014-10-03 Thread Michael Allman
Hi, I also have a use for count-based windowing. I'd like to process data batches by size as opposed to time. Is this feature on the development roadmap? Is there a JIRA ticket for it? Thank you, Michael -- View this message in context:

Re: Akka connection refused when running standalone Scala app on Spark 0.9.2

2014-10-03 Thread Yana Kadiyska
I don't think it's a red herring... (btw. spark.driver.host needs to be set to the IP or FQDN of the machine where you're running the program). I am running 0.9.2 on CDH4 and the beginning of my executor log looks like below (I've obfuscated the IP -- this is the log from executor

Re: Handling tree reduction algorithm with Spark in parallel

2014-10-03 Thread Boromir Widas
Thanks Matei, will check out the MLLib implementation. On Wed, Oct 1, 2014 at 2:24 PM, Andy Twigg andy.tw...@gmail.com wrote: Yes, that makes sense. It's similar to the all reduce pattern in vw. On Wednesday, 1 October 2014, Matei Zaharia matei.zaha...@gmail.com wrote: Some of the MLlib

Re: The question about mount ephemeral disk in slave-setup.sh

2014-10-03 Thread TANG Gen
I have taken a look at the code of mesos spark-ec2 and documentation of AWS. I think that maybe I found the answer. In fact, there are two types AMI in AWS EBS backed AMI and instance store backed AMI. For EBS backed AMI, we can add instance store volume when we create the images(The details can

Re: Spark Monitoring with Ganglia

2014-10-03 Thread TANG Gen
Maybe you can follow the instruction in this link https://github.com/mesos/spark-ec2/tree/v3/ganglia https://github.com/mesos/spark-ec2/tree/v3/ganglia . For me it works well -- View this message in context:

pyspark on python 3

2014-10-03 Thread Ariel Rokem
Hi everyone, What is the state of affairs w.r.t python 3? Is this still post still a good description of the situation? https://groups.google.com/forum/#!topic/spark-users/GRKmVo0ZDBc Thanks! Ariel

Re: pyspark on python 3

2014-10-03 Thread Gen
According to the official site of spark, for the latest version of spark(1.1.0), it does not work with python 3 Spark 1.1.0 works with Python 2.6 or higher (but not Python 3). It uses the standard CPython interpreter, so C libraries like NumPy can be used. -- View this message in context:

Worker with no Executor (YARN client-mode)

2014-10-03 Thread jonathan.keebler
Hi all, We're running Spark 1.0 on CDH 5.1.2. We're using Spark in YARN-client mode. We're seeing that one of our nodes is not being assigned any tasks, and no resources (RAM,cpu) are being used on this node. In the CM UI this worker node is in good health and the spark Worker process is

Re: How to make ./bin/spark-sql work with hive?

2014-10-03 Thread Li HM
This is my SPARK_CLASSPATH after cleanup SPARK_CLASSPATH=/home/test/lib/hcatalog-core.jar:$SPARK_CLASSPATH now use mydb works. but show tables and select * from test still gives exception: spark-sql show tables; OK java.io.IOException: java.io.IOException: Cannot create an instance of

Re: partitions number with variable number of cores

2014-10-03 Thread Gen
Maybe I am wrong, but how many resource that a spark application can use depends on the mode of deployment(the type of resource manager), you can take a look at https://spark.apache.org/docs/latest/job-scheduling.html https://spark.apache.org/docs/latest/job-scheduling.html . For your case, I

Re: How to make ./bin/spark-sql work with hive?

2014-10-03 Thread Michael Armbrust
Why are you including hcatalog-core.jar? That is probably causing the issues. On Fri, Oct 3, 2014 at 3:03 PM, Li HM hmx...@gmail.com wrote: This is my SPARK_CLASSPATH after cleanup SPARK_CLASSPATH=/home/test/lib/hcatalog-core.jar:$SPARK_CLASSPATH now use mydb works. but show tables and

problem with user@spark.apache.org spam filter

2014-10-03 Thread Andy Davidson
Any idea why my email was returned with the following error message? Thanks Andy This is the mail system at host smtprelay06.hostedemail.com. I'm sorry to have to inform you that your message could not be delivered to one or more recipients. It's attached below. For further assistance,

Accumulator question

2014-10-03 Thread Nathan Kronenfeld
I notice that accumulators register themselves with a private Accumulators object. I don't notice any way to unregister them when one is done. Am I missing something? If not, is there any plan for how to free up that memory? I've a case where we're gathering data from repeated queries using

any good library to implement multilabel classification on spark?

2014-10-03 Thread critikaled
Hi, Going through spark mllib doc I have noticed that it supports multiclass classification can any body help me in implementing multilabel classification on spark like in Mulan http://mulan.sourceforge.net/index.html and Meka http://meka.sourceforge.net/ libraries. -- View this message

Re: MLlib Collaborative Filtering failed to run with rank 1000

2014-10-03 Thread Xiangrui Meng
It would be really helpful if you can help test the scalability of the new ALS impl: https://github.com/mengxr/spark-als/blob/master/src/main/scala/org/apache/spark/ml/SimpleALS.scala . It should be faster and more scalable, but the code is messy now. Best, Xiangrui On Fri, Oct 3, 2014 at 11:57

Re: How to make ./bin/spark-sql work with hive?

2014-10-03 Thread Li HM
If I don't have that jar, I am getting the following error: xception in thread main java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassNotFoundException: org.apache.hcatalog.security.HdfsAuthorizationProvider at

Re: How to make ./bin/spark-sql work with hive?

2014-10-03 Thread Hmxxyy
No, it is hive 0.12.4. Let me try your suggestion. It is an existing hive db. I am using the original hive-site.xml as is. Sent from my iPhone On Oct 3, 2014, at 5:02 PM, Edwin Chiu edwin.c...@manage.com wrote: Are you using hive 0.13? Switching back to HadoopDefaultAuthenticator in

Spark Streaming writing to HDFS

2014-10-03 Thread Abraham Jacob
Hi All, Would really appreciate if someone in the community can help me with this. I have a simple Java spark streaming application - NetworkWordCount SparkConf sparkConf = new SparkConf().setMaster(yarn-cluster).setAppName(Streaming WordCount); JavaStreamingContext jssc = new

Re: Spark inside Eclipse

2014-10-03 Thread Sanjay Subramanian
So some progress but still errors  object WordCount {  def main(args: Array[String]) {    if (args.length 1) {      System.err.println(Usage: WordCount file)      System.exit(1)    }    val conf = new SparkConf().setMaster(local).setAppName(sWhatever)    val sc = new SparkContext(conf);    

Re: pyspark on python 3

2014-10-03 Thread tomo cocoa
Hi, I prefer that PySpark can also be executed on Python 3. Do you have some reason or demand to use PySpark through Python3? If you create an issue on JIRA, I would try to resolve it. On 4 October 2014 06:47, Gen gen.tan...@gmail.com wrote: According to the official site of spark, for the

Re: pyspark on python 3

2014-10-03 Thread Josh Rosen
It would be great if we supported Python 3 and I'd be happy to review any pull requests to add it. I don't know that Python 3 is very widely-used, but I'm open to supporting it if it won't require too much work. By the way, we recently added support for PyPy:

Null values in Date field only when RDD is saved as File.

2014-10-03 Thread Manas Kar
Hi, I am using a library that parses Ais Messages. My code which follows the simple steps gives me null values in Date field. 1) Get the message from file. 2) parse the message. 3) map the message RDD to only keep the (Date, SomeInfo) 4) take top 100 element. Result = the Date field appears fine

Trouble getting filtering on field correct

2014-10-03 Thread Chop
Given an RDD with multiple lines of the form: u'207.86.121.131 207.86.121.131 2012-11-27 13:02:17 titlestring 622592 27 184464' (fields are separated by a ) What pyspark function/commands do I use to filter out those lines where line[8] = x? (i.e line[8] = 125) when I use line.split( ) I get

Re: Null values in Date field only when RDD is saved as File.

2014-10-03 Thread manasdebashiskar
Correction to my question. (5) should read 5) save the tuple RDD(created at step 3) to HDFS using SaveAsTextFile. Can someone please guide me in the right direction? Thanks in advance Manas - Manas Kar -- View this message in context:

Re: Null values in Date field only when RDD is saved as File.

2014-10-03 Thread manasdebashiskar
Correction to my question. (5) should read 5) save the tuple RDD(created at step 3) to HDFS using SaveAsTextFile. Can someone please guide me in the right direction? Thanks in advance Manas On Fri, Oct 3, 2014 at 11:42 PM, manasdebashiskar [via Apache Spark User List]

Re: How to make ./bin/spark-sql work with hive?

2014-10-03 Thread Li HM
It won't work with valueorg.apache.hadoop.hive.ql.security. HadoopDefaultAuthenticator/value. Just wonder how and why it works with you guys. Here is the new error: Exception in thread main java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException:

Re: How to make ./bin/spark-sql work with hive?

2014-10-03 Thread Li HM
If I change it to valueorg.apache.hadoop.hive.ql.security.authorization.HiveAuthorizationProvider/value The error becomes: Exception in thread main java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: java.lang.NoSuchMethodException:

Re: Fwd: Breeze Library usage in Spark

2014-10-03 Thread DB Tsai
You dont have to include breeze jar which is already in spark assembly jar. For native one, its optional. Sent from my Google Nexus 5 On Oct 3, 2014 8:04 PM, Priya Ch learnings.chitt...@gmail.com wrote: yes. I have included breeze-0.9 in build.sbt file. I ll change this to 0.7. Apart from

Re: Trouble getting filtering on field correct

2014-10-03 Thread Davies Liu
rdd.filter(lambda line: int(line.split(' ')[8]) = 125) On Fri, Oct 3, 2014 at 8:16 PM, Chop thomrog...@att.net wrote: Given an RDD with multiple lines of the form: u'207.86.121.131 207.86.121.131 2012-11-27 13:02:17 titlestring 622592 27 184464' (fields are separated by a ) What pyspark