Re: Spark SQL JDBC

2014-08-13 Thread Cheng Lian
Oh, thanks for reporting this. This should be a bug since SPARK_HIVE was deprecated, we shouldn’t rely on it any more. ​ On Wed, Aug 13, 2014 at 1:23 PM, ZHENG, Xu-dong dong...@gmail.com wrote: Just find this is because below lines in make_distribution.sh doesn't work: if [ $SPARK_HIVE ==

Re: training recsys model

2014-08-13 Thread Xiangrui Meng
You can define an evaluation metric first and then use a grid search to find the best set of training parameters. Ampcamp has a tutorial showing how to do this for ALS: http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html -Xiangrui On Tue, Aug 12, 2014 at 8:01 PM,

Re: Killing spark app problem

2014-08-13 Thread grzegorz-bialek
In next run of this application it exited safely after ctrl-\ (SIGQUIT). grzegorz-bialek wrote Hi, when I run some spark application on my local machine using spark-submit: $SPARK_HOME/bin/spark-submit --driver-memory 1g class jar When I want to interrupt computing by ctrl-c it

Viewing web UI after fact

2014-08-13 Thread grzegorz-bialek
Hi, I wanted to access Spark web UI after application stops. I set spark.eventLog.enabled to true and logs are availaible in JSON format in /tmp/spark-event but web UI isn't available under address http://driver-node:4040 I'm running Spark in standalone mode. What should I do to access web UI

Re: Missing SparkSQLCLIDriver and Beeline drivers in Spark

2014-08-13 Thread ajatix
Thanks a lot Worked like a charm. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Missing-SparkSQLCLIDriver-and-Beeline-drivers-in-Spark-tp11724p12024.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Running GraphX through Java

2014-08-13 Thread Sonal Goyal
Hi, I am trying to run and test some graph apis using Java. I started with connected components, here is my code. JavaRDDEdgeLong vertices; ///code to populate vertices .. .. ClassTagLong longTag = scala.reflect.ClassTag$.MODULE$.apply(Long.class); ClassTagFloat floatTag =

Re: Re: Compile spark code with idea succesful but run SparkPi error with java.lang.SecurityException

2014-08-13 Thread Zhanfeng Huo
Thank you, Ton. That helps a lot. I want to debug spark code for tracing state transform. So I use sbt as my build tools and compile spark code in Intellij IDEA . Zhanfeng Huo From: Ron's Yahoo! Date: 2014-08-12 03:46 To: Zhanfeng Huo CC: user Subject: Re: Compile spark code with idea

Re: How to save mllib model to hdfs and reload it

2014-08-13 Thread lancezhange
let's say you have a model which is of class org.apache.spark.mllib.classification.LogisticRegressionModel you can save model to disk as following: /import java.io.FileOutputStream import java.io.ObjectOutputStream val fos = new FileOutputStream(e:/model.obj) val oos = new

Is there any interest in handling XML within Spark ?

2014-08-13 Thread Darin McBeath
I've been playing around with Spark off and on for the past month and have developed some XML helper utilities that enable me to filter an XML dataset as well as transform an XML dataset (we have a lot of XML content).  I'm posting this email to see if there would be any interest in this effort

Re: How to save mllib model to hdfs and reload it

2014-08-13 Thread Jaideep Dhok
Hi, I have faced a similar issue when trying to run a map function with predict. In my case I had some non-serializable fields in my calling class. After making those fields transient, the error went away. On Wed, Aug 13, 2014 at 6:39 PM, lancezhange lancezha...@gmail.com wrote: let's say you

Re: Lost executors

2014-08-13 Thread rpandya
After a lot of grovelling through logs, I found out that the Nagios monitor process detected that the machine was almost out of memory, and killed the SNAP executor process. So why is the machine running out of memory? Each node has 128GB of RAM, 4 executors, about 40GB of data. It did run out of

Re: How to save mllib model to hdfs and reload it

2014-08-13 Thread Sean Owen
PS I think that solving not serializable exceptions by adding 'transient' is usually a mistake. It's a band-aid on a design problem. transient causes the default serialization mechanism to not serialize the field when the object is serialized. When deserialized, this field will be null, which

Re: How to save mllib model to hdfs and reload it

2014-08-13 Thread lancezhange
my prediction codes are simple enough as follows: *val labelsAndPredsOnGoodData = goodDataPoints.map { point = val prediction = model.predict(point.features) (point.label, prediction) }* when model is the loaded one, above code just can't work. Can you catch the error? Thanks. PS. i use

Re: How to save mllib model to hdfs and reload it

2014-08-13 Thread Christopher Nguyen
+1 what Sean said. And if there are too many state/argument parameters for your taste, you can always create a dedicated (serializable) class to encapsulate them. Sent while mobile. Pls excuse typos etc. On Aug 13, 2014 6:58 AM, Sean Owen so...@cloudera.com wrote: PS I think that solving not

Re: How to save mllib model to hdfs and reload it

2014-08-13 Thread Christopher Nguyen
Lance, some debugging ideas: you might try model.predict(RDD[Vector]) to isolate the cause to serialization of the loaded model. And also try to serialize the deserialized (loaded) model manually to see if that throws any visible exceptions. Sent while mobile. Pls excuse typos etc. On Aug 13,

SparkSQL Hive partitioning support

2014-08-13 Thread Silvio Fiorito
Using the SchemaRDD insertInto method, is there any way to support partitions on a field in the RDD? If not, what's the alternative, register a table and do an insert into via SQL statement? Any plans to support partitioning via insertInto? What other options are there for inserting into a

Re: Contribution to Spark MLLib

2014-08-13 Thread Debasish Das
Dennis, If it is PLSA with least square loss then the QuadraticMinimizer that we open sourced should be able to solve it for modest topics (till 1000 I believe)...if we integrate a cg solver for equality (Nocedal's KNITRO paper is the reference) the topic size can be increased much larger than

Getting percentile from Spark Streaming?

2014-08-13 Thread bumble123
Hi, I'm trying to figure out how to constantly update, say, the 95th percentile of a set of data through Spark Streaming. I'm not sure how to order the dataset though, and while I can find percentiles in regular Spark, I can't seem to figure out how to get that to transfer over to Spark

Re: Lost executors

2014-08-13 Thread Matei Zaharia
What is your Spark executor memory set to? (You can see it in Spark's web UI at http://driver:4040 under the executors tab). One thing to be aware of is that the JVM never really releases memory back to the OS, so it will keep filling up to the maximum heap size you set. Maybe 4 executors with

Re: SparkSQL Hive partitioning support

2014-08-13 Thread Michael Armbrust
This is not supported at the moment. There are no concrete plans at the moment to support it though the programatic API, but it should work using SQL as you suggested. On Wed, Aug 13, 2014 at 8:22 AM, Silvio Fiorito silvio.fior...@granturing.com wrote: Using the SchemaRDD *insertInto*

Re: Viewing web UI after fact

2014-08-13 Thread Andrew Or
The Spark UI isn't available through the same address; otherwise new applications won't be able to bind to it. Once the old application finishes, the standalone Master renders the after-the-fact application UI and exposes it under a different URL. To see this, go to the Master UI (master-url:8080)

Re: Support for ORC Table in Shark/Spark

2014-08-13 Thread Michael Armbrust
I would expect this to work with Spark SQL (available in 1.0+) but there is a JIRA open to confirm this works SPARK-2883 https://issues.apache.org/jira/browse/SPARK-2883. On Mon, Aug 11, 2014 at 10:23 PM, vinay.kash...@socialinfra.net wrote: Hi all, Is it possible to use table with ORC

Re: Lost executors

2014-08-13 Thread Shivaram Venkataraman
If the JVM heap size is close to the memory limit the OS sometimes kills the process under memory pressure. I've usually found that lowering the executor memory size helps. Shivaram On Wed, Aug 13, 2014 at 11:01 AM, Matei Zaharia matei.zaha...@gmail.com wrote: What is your Spark executor

Re: How to direct insert vaules into SparkSQL tables?

2014-08-13 Thread Michael Armbrust
I do not believe this is true. If you are using a hive context you should be able to register an RDD as a temporary table and then use INSERT INTO https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueriesto add data to a hive

Re: Lost executors

2014-08-13 Thread Andrew Or
To add to the pile of information we're asking you to provide, what version of Spark are you running? 2014-08-13 11:11 GMT-07:00 Shivaram Venkataraman shiva...@eecs.berkeley.edu : If the JVM heap size is close to the memory limit the OS sometimes kills the process under memory pressure. I've

Re: Lost executors

2014-08-13 Thread rpandya
I'm running Spark 1.0.1 with SPARK_MEMORY=60g, so 4 executors at that size would indeed run out of memory (the machine has 110GB). And in fact they would get repeatedly restarted and killed until eventually Spark gave up. I'll try with a smaller limit, but it'll be a while - somehow my HDFS got

groupByKey() completes 99% on Spark + EC2 + S3 but then throws java.net.SocketException: Connection reset

2014-08-13 Thread Arpan Ghosh
Hi, Let me begin by describing my Spark setup on EC2 (launched using the provided spark-ec2.py script): - 100 c3.2xlarge workers (8 cores 15GB memory each) - 1 c3.2xlarge Master (only running master daemon) - Spark 1.0.2 - 8GB mounted at */* 80 GB mounted at */mnt*

Re: Lost executors

2014-08-13 Thread Andrew Or
Hi Ravi, Setting SPARK_MEMORY doesn't do anything. I believe you confused it with SPARK_MEM, which is now deprecated. You should set SPARK_EXECUTOR_MEMORY instead, or spark.executor.memory as a config in conf/spark-defaults.conf. Assuming you haven't set the executor memory through a different

Re: Kafka - streaming from multiple topics

2014-08-13 Thread maddenpj
Can you link to the JIRA issue? I'm having to work around this bug and it would be nice to monitor the JIRA so I can change my code when it's fixed. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-streaming-from-multiple-topics-tp8678p12053.html Sent

Re: Cached RDD Block Size - Uneven Distribution

2014-08-13 Thread anthonyjschu...@gmail.com
I am having a similar problem: I have a large dataset in HDFS and (for a few possible reason including a filter operation, and some of my computation nodes simply not being hdfs datanodes) have a large skew on my RDD blocks: the master node always has the most, while the worker nodes have few...

Re: groupByKey() completes 99% on Spark + EC2 + S3 but then throws java.net.SocketException: Connection reset

2014-08-13 Thread Davies Liu
Arpan, Which version of Spark are you using? Could you try the master or 1.1 branch? which can spill the data into disk during groupByKey(). PS: it's better to use reduceByKey() or combineByKey() to reduce data size during shuffle. Maybe there is a huge key in the data sets, you can find it in

Open source project: Deploy Spark to a cluster with Puppet and Fabric.

2014-08-13 Thread bdamos
Hi Spark community, We're excited about Spark at Adobe Research and have just open sourced a project we use to automatically provision a Spark cluster and submit applications. The project is on GitHub, and we're happy for any feedback from the community:

Open source project: Example Spark project using Parquet as a columnar store with Thrift objects.

2014-08-13 Thread bdamos
Hi Spark community, We're excited about Spark at Adobe Research and have just open sourced an example project writing and reading Thrift objects to Parquet with Spark. The project is on GitHub, and we're happy for any feedback: https://github.com/adobe-research/spark-parquet-thrift-example

spark streaming : what is the best way to make a driver highly available

2014-08-13 Thread salemi
Hi All, what is the best way to make a spark streaming driver highly available. I would like the backup driver to pickup the processing if the primary driver dies. Thanks, Ali -- View this message in context:

Re: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2014-08-13 Thread Yin Huai
I think the problem is that when you are using yarn-cluster mode, because the Spark driver runs inside the application master, the hive-conf is not accessible by the driver. Can you try to set those confs by using hiveContext.set(...)? Or, maybe you can copy hive-site.xml to spark/conf in the node

Re: groupByKey() completes 99% on Spark + EC2 + S3 but then throws java.net.SocketException: Connection reset

2014-08-13 Thread Arpan Ghosh
Thanks Davies. I am running Spark 1.0.2 (which seems to be the latest release) I'll try changing it to a reduceByKey() and check the size of the largest key and post the results here. UPDATE: If I run this job and DO NOT specify the number of partitions for the input textFile() (124 GB being

Re: groupByKey() completes 99% on Spark + EC2 + S3 but then throws java.net.SocketException: Connection reset

2014-08-13 Thread Davies Liu
The 1.1 release will come out this or next month, we will really appreciate that if you could test it with you real case. Davies On Wed, Aug 13, 2014 at 1:57 PM, Arpan Ghosh ar...@automatic.com wrote: Thanks Davies. I am running Spark 1.0.2 (which seems to be the latest release) I'll try

Re: SparkSQL Hive partitioning support

2014-08-13 Thread Yin Huai
Hi Silvio, You can insert into a static partition via SQL statement. Dynamic partitioning is not supported at the moment. Thanks, Yin On Wed, Aug 13, 2014 at 2:03 PM, Michael Armbrust mich...@databricks.com wrote: This is not supported at the moment. There are no concrete plans at the

Re: groupByKey() completes 99% on Spark + EC2 + S3 but then throws java.net.SocketException: Connection reset

2014-08-13 Thread Arpan Ghosh
So you are saying that in-spite of spark.shuffle.spill being set to true by default, version 1.0.2 does not spill data to disk during a groupByKey()? On Wed, Aug 13, 2014 at 2:05 PM, Davies Liu dav...@databricks.com wrote: The 1.1 release will come out this or next month, we will really

Re: groupByKey() completes 99% on Spark + EC2 + S3 but then throws java.net.SocketException: Connection reset

2014-08-13 Thread Davies Liu
In Spark (Scala/Java), it will spill the data to disk, but in PySpark, it will not. On Wed, Aug 13, 2014 at 2:10 PM, Arpan Ghosh ar...@automatic.com wrote: So you are saying that in-spite of spark.shuffle.spill being set to true by default, version 1.0.2 does not spill data to disk during a

Re: spark.files.userClassPathFirst=true Not Working Correctly

2014-08-13 Thread DNoteboom
Hi, I have Yarn available to test and I'm currently working on getting my application to run correctly on the Yarn cluster. I'll get back to you with the results once I'm able to successfully run it(successfully meaning I at least get to the point where my application currently fails) On Tue,

Using Hadoop InputFormat in Python

2014-08-13 Thread Tassilo Klein
Hi, I'd like to read in a (binary) file from Python for which I have defined a Java InputFormat (.java) definition. However, now I am stuck in how to use that in Python and didn't find anything in newsgroups either. As far as I know, I have to use this newAPIHadoopRDD function. However, I am not

Re: Using Hadoop InputFormat in Python

2014-08-13 Thread Sunny Khatri
Not that much familiar with Python APIs, but You should be able to configure a job object with your custom InputFormat and pass in the required configuration (:- job.getConfiguration()) to newAPIHadoopRDD to get the required RDD On Wed, Aug 13, 2014 at 2:59 PM, Tassilo Klein tjkl...@gmail.com

Re: Using Hadoop InputFormat in Python

2014-08-13 Thread Kan Zhang
Tassilo, newAPIHadoopRDD has been added to PySpark in master and yet-to-be-released 1.1 branch. It allows you specify your custom InputFormat. Examples of using it include hbase_inputformat.py and cassandra_inputformat.py in examples/src/main/python. Check it out. On Wed, Aug 13, 2014 at 3:12

Re: Using Hadoop InputFormat in Python

2014-08-13 Thread Tassilo Klein
Yes, somehow seems logical. But where / how to pass -the InputFormat definition (.jar/.java/.class) Spark. I mean when using Hadoop I need to call something like 'hadoop jar myInputFormat.jar -inFormat myFormat other stuff' to register the file format definition file. -- View this message in

Spark Akka/actor failures.

2014-08-13 Thread ldmtwo
Need help getting around these errors. I have this program that runs fine on smaller input sizes. As it gets larger, Spark has increasing difficulty of being efficient and functioning without errors. We have about 46GB free on each node. The workers and executors are configured to use this up

Re: groupByKey() completes 99% on Spark + EC2 + S3 but then throws java.net.SocketException: Connection reset

2014-08-13 Thread Arpan Ghosh
Here are the biggest keys: [ (17634, 87874097), (8407, 38395833), (20092, 14403311), (9295, 4142636), (14359, 3129206), (13051, 2608708), (14133, 2073118), (4571, 2053514), (16175, 2021669), (5268, 1908557), (3669, 1687313), (14051,

How to access the individual elements of RDD[Iterable[Float]] to do sum(),stdev() ?

2014-08-13 Thread KRaman
Hello all, I’m beginner in Spark and Scala. I have the following code which does a groupBy on 2 keys. val rdd2 = rdd1.groupBy(x = (x._2._1._1, x._2._1._2)) rdd1 looks like below: rdd1 is obtained as a result of left outer join between 2 RDDs. Class[_ : org.apache.spark.rdd.RDD[ ((Int,

java.lang.UnknownError: no bin was found for continuous variable.

2014-08-13 Thread Sameer Tilak
Hi All, I am using the decision tree algorithm and I get the following error. Any help would be great! java.lang.UnknownError: no bin was found for continuous variable. at org.apache.spark.mllib.tree.DecisionTree$.findBin$1(DecisionTree.scala:492) at

Re: groupByKey() completes 99% on Spark + EC2 + S3 but then throws java.net.SocketException: Connection reset

2014-08-13 Thread Davies Liu
For the hottest key, it will need about 1-2 GB memory for Python worker to do groupByKey(). These configurations can not help with the memory of Python worker. So, two options: 1) use reduceByKey() or combineByKey() to reduce the memory consumption in Python worker. 2) try master or 1.1 branch

Re: spark streaming : what is the best way to make a driver highly available

2014-08-13 Thread Tobias Pfeiffer
Hi, On Thu, Aug 14, 2014 at 5:49 AM, salemi alireza.sal...@udo.edu wrote: what is the best way to make a spark streaming driver highly available. I would also be interested in that. In particular for Streaming applications where the Spark driver is running for a long time, this might be

Re: SparkSQL Hive partitioning support

2014-08-13 Thread Silvio Fiorito
Yin Michael, Thanks, I'll try the SQL route. From: Yin Huai huaiyin@gmail.commailto:huaiyin@gmail.com Date: Wednesday, August 13, 2014 at 5:04 PM To: Michael Armbrust mich...@databricks.commailto:mich...@databricks.com Cc: Silvio Fiorito

SPARK_LOCAL_DIRS option

2014-08-13 Thread Debasish Das
Hi, I have set up the SPARK_LOCAL_DIRS option in spark-env.sh so that Spark can use more shuffle space... Does Spark cleans all the shuffle files once the runs are done ? Seems to me that the shuffle files are not cleaned... Do I need to set this variable ? spark.cleaner.ttl Right now we are

How to debug: Runs locally but not on cluster

2014-08-13 Thread jerryye
Hi all, I have an issue where I'm able to run my code in standalone mode but not on my cluster. I've isolated it to a few things but am at a lost at how to debug this. Below is the code. Any suggestions would be much appreciated Thanks! 1) RDD size is causing the problem. The code below as is

Ways to partition the RDD

2014-08-13 Thread bdev
I've got ~500 tab delimited log files 25gigs each with page name and userId who viewed the page along with timestamp. I'm trying to build a basic spark app to get a unique visitors per page. I was able to achieve this using SparkSQL by registering the RDD of a case class and running a select

Re: SPARK_LOCAL_DIRS option

2014-08-13 Thread Andrew Ash
Hi Deb, If you don't have long-running Spark applications (those taking more than spark.worker.cleanup.appDataTtl) then the TTL-based cleaner is a good solution. If however you have a mix of long-running and short-running applications, then the TTL-based solution will fail. It will clean up

Spark SQL Stackoverflow error

2014-08-13 Thread Vishal Vibhandik
Hi, I tried running the sample sql code JavaSparkSQL but keep getting this error:- the error comes on line JavaSchemaRDD teenagers = sqlCtx.sql(SELECT name FROM people WHERE age = 13 AND age = 19); C:\spark-submit --class org.apache.spark.examples.sql.JavaSparkSQL --master local

Script to deploy spark to Google compute engine

2014-08-13 Thread Soumya Simanta
Before I start doing something on my own I wanted to check if someone has created a script to deploy the latest version of Spark to Google Compute Engine. Thanks -Soumya

Re: Ways to partition the RDD

2014-08-13 Thread bdev
Forgot to mention, I'm using Spark 1.0.0 and running against 40 node yarn-cluster. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Ways-to-partition-the-RDD-tp12083p12088.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Support for ORC Table in Shark/Spark

2014-08-13 Thread vinay.kashyap
Thanks Micheal for the info. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Support-for-ORC-Table-in-Shark-Spark-tp11952p12089.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Running GraphX through Java

2014-08-13 Thread Sonal Goyal
Hi All, Sorry reposting this again in the hope to get some clues. Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Wed, Aug 13, 2014 at 3:53 PM, Sonal Goyal sonalgoy...@gmail.com wrote: Hi, I am trying to run and test some graph apis