Spark SQL 1.2 with CDH 4, Hive UDF is not working.

2014-12-22 Thread Ji ZHANG
Hi, Recently I'm migrating from Shark 0.9 to Spark SQL 1.2, my CDH version is 4.5, Hive 0.11. I've managed to setup Spark SQL Thriftserver, and normal queries work fine, but custom UDF is not usable. The symptom is when executing CREATE TEMPORARY FUNCTION, the query hangs on a lock request:

what is the default log4j configuration passed to yarn container

2014-12-22 Thread Venkata ramana gollamudi
Hi, In case of MR task the log4j configuration and container log folder for a container is explicitly set in the container Launch context by org.apache.hadoop.mapreduce.v2.util.MRApps.addLog4jSystemProperties i.e from MapReduce YARN client code and not YARN code. This is also visible from

Re: Spark SQL 1.2 with CDH 4, Hive UDF is not working.

2014-12-22 Thread Cheng Lian
Hi Ji, Spark SQL 1.2 only works with either Hive 0.12.0 or 0.13.1 due to Hive API/protocol compatibility issues. When interacting with Hive 0.11.x, connections and simple queries may succeed, but things may go crazy in unexpected corners (like UDF). Cheng On 12/22/14 4:15 PM, Ji ZHANG

Graceful shutdown in spark streaming

2014-12-22 Thread Jesper Lundgren
Hello all, I have a spark streaming application running in a standalone cluster (deployed with spark-submit --deploy-mode cluster). I am trying to add graceful shutdown functionality to this application but I am not sure what is the best practice for this. Currently I am using this code:

Re: Is Spark? or GraphX runs fast? a performance comparison on Page Rank

2014-12-22 Thread pradhandeep
Did you try running PageRank.scala instead of LiveJournalPageRank.scala? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-or-GraphX-runs-fast-a-performance-comparison-on-Page-Rank-tp19710p20808.html Sent from the Apache Spark User List mailing list

Re: How to get list of edges between two Vertex ?

2014-12-22 Thread pradhandeep
Do you need the multiple edges or can you get the work done by having single edge between two vertices? In my view point, you can group the edges using groupEdges which will group the same edges together. It may work because the message passed between the vertices through same edges (replicated)

Possible problems in packaging mlllib

2014-12-22 Thread shkesar
I am trying to run the twitter classifier https://github.com/databricks/reference-apps A NoClasssDefFoundError pops up. I've checked the library that the HashingTF class file is there. Some stack overflow questions show that might be problem with packaging the class. Exception in thread main

Re: Possible problems in packaging mlllib

2014-12-22 Thread Sean Owen
Are you using an old version of Spark? I think this appeared in 1.1. You don't usually package this class or MLlib, so your packaging probably is not relevant, but it has to be available at runtime on your cluster then. On Mon, Dec 22, 2014 at 10:16 AM, shkesar shubhamke...@live.com wrote: I am

Using more cores on machines

2014-12-22 Thread Ashic Mahtab
Hi, Say we have 4 nodes with 2 cores each in stand alone mode. I'd like to dedicate 4 cores to a streaming application. I can do this via spark submit by: spark-submit --total-executor-cores 4 However, this assigns one core per machine. I would like to use 2 cores on 2 machines instead,

Re: Using more cores on machines

2014-12-22 Thread Sean Owen
I think you want: --num-executors 2 --executor-cores 2 On Mon, Dec 22, 2014 at 10:39 AM, Ashic Mahtab as...@live.com wrote: Hi, Say we have 4 nodes with 2 cores each in stand alone mode. I'd like to dedicate 4 cores to a streaming application. I can do this via spark submit by:

RE: Using more cores on machines

2014-12-22 Thread Ashic Mahtab
Hi Sean, Thanks for the response. It seems --num-executors is ignored. Specifying --num-executors 2 --executor-cores 2 is giving the app all 8 cores across 4 machines. -Ashic. From: so...@cloudera.com Date: Mon, 22 Dec 2014 10:57:31 + Subject: Re: Using more cores on machines To:

Re: java.sql.SQLException: No suitable driver found

2014-12-22 Thread Michael Orr
Here is a script I use to submit a directory of jar files. It assumes jar files are in target/dependency or lib/ DRIVER_PATH= DEPEND_PATH= if [ -d lib ]; then DRIVER_PATH=lib DEPEND_PATH=lib else DRIVER_PATH=target DEPEND_PATH=target/dependency fi DEPEND_JARS=log4j.properties for f in

Re: locality sensitive hashing for spark

2014-12-22 Thread Michael Orr
The implementation closely aligns with jaccard. It should be possible to swap out the hash functions to a family that is compatible with other distance measures. On Dec 22, 2014, at 1:16 AM, Nick Pentreath nick.pentre...@gmail.com wrote: Looks interesting thanks for sharing. Does it

Re: S3 files , Spark job hungsup

2014-12-22 Thread Shuai Zheng
Is it possible too many connections open to read from s3 from one node? I have this issue before because I open a few hundreds of files on s3 to read from one node. It just block itself without error until timeout later. On Monday, December 22, 2014, durga durgak...@gmail.com wrote: Hi All, I

Can Spark SQL thrift server UI provide JOB kill operate or any REST API?

2014-12-22 Thread Xiaoyu Wang
Hello everyone! Like the title. I start the Spark SQL 1.2.0 thrift server. Use beeline connect to the server to execute SQL. I want to kill one SQL job running in the thrift server and not kill the thrift server. I set property spark.ui.killEnabled=true in spark-default.conf But in the UI, only

Re: Fetch Failure

2014-12-22 Thread steghe
Which version of spark are you running? It could be related to this https://issues.apache.org/jira/browse/SPARK-3633 fixed in 1.1.1 and 1.2.0 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html Sent from the Apache Spark

RE: Effects problems in logistic regression

2014-12-22 Thread Franco Barrientos
Thanks again DB Tsai, LogisticRegressionWithLBFGS works for me! De: Franco Barrientos [mailto:franco.barrien...@exalitica.com] Enviado el: jueves, 18 de diciembre de 2014 16:42 Para: 'DB Tsai' CC: 'Sean Owen'; user@spark.apache.org Asunto: RE: Effects problems in logistic regression

Spark exception when sending message to akka actor

2014-12-22 Thread Priya Ch
Hi All, I have akka remote actors running on 2 nodes. I submitted spark application from node1. In the spark code, in one of the rdd, i am sending message to actor running on node1. My Spark code is as follows: class ActorClient extends Actor with Serializable { import context._ val

Re: Effects problems in logistic regression

2014-12-22 Thread DB Tsai
Sounds great. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, Dec 22, 2014 at 5:27 AM, Franco Barrientos franco.barrien...@exalitica.com wrote: Thanks again DB Tsai,

Tuning Spark Streaming jobs

2014-12-22 Thread Gerard Maas
Hi, After facing issues with the performance of some of our Spark Streaming jobs, we invested quite some effort figuring out the factors that affect the performance characteristics of a Streaming job. We defined an empirical model that helps us reason about Streaming jobs and applied it to tune

RE: Using more cores on machines

2014-12-22 Thread Ashic Mahtab
Hi Josh, I'm not looking to change the 1:1 ratio. What I'm trying to do is get both cores on two machines working, rather than one core on all four machines. With --total-executor-cores 4, I have 1 core per machine working for an app. I'm looking for something that'll let me use 2 cores per

Re: MLlib, classification label problem

2014-12-22 Thread Sean Owen
Yeah, it's mentioned in the doc: Note that, in the mathematical formulation in this guide, a training label y is denoted as either +1 (positive) or −1 (negative), which is convenient for the formulation. However, the negative label is represented by 0 in MLlib instead of −1, to be consistent with

Re: Using more cores on machines

2014-12-22 Thread Boromir Widas
If you are looking to reduce network traffic then setting spark.deploy.spreadOut to false may help. On Mon, Dec 22, 2014 at 11:44 AM, Ashic Mahtab as...@live.com wrote: Hi Josh, I'm not looking to change the 1:1 ratio. What I'm trying to do is get both cores on two machines working, rather

Re: S3 files , Spark job hungsup

2014-12-22 Thread durga katakam
Yes . I am reading thousands of files every hours. Is there any way I can tell spark to timeout. Thanks for your help. -D On Mon, Dec 22, 2014 at 4:57 AM, Shuai Zheng szheng.c...@gmail.com wrote: Is it possible too many connections open to read from s3 from one node? I have this issue before

Re: custom python converter from HBase Result to tuple

2014-12-22 Thread Ted Yu
Which HBase version are you using ? Can you show the full stack trace ? Cheers On Mon, Dec 22, 2014 at 11:02 AM, Antony Mayi antonym...@yahoo.com.invalid wrote: Hi, can anyone please give me some help how to write custom converter of hbase data to (for example) tuples of ((family,

Re: Tuning Spark Streaming jobs

2014-12-22 Thread Gerard Maas
Hi Tim, That would be awesome. We have seen some really disparate Mesos allocations for our Spark Streaming jobs. (like (7,4,1) over 3 executors for 4 kafka consumer instead of the ideal (3,3,3,3)) For network dependent consumers, achieving an even deployment would provide a reliable and

Announcing Spark Packages

2014-12-22 Thread Xiangrui Meng
Dear Spark users and developers, I’m happy to announce Spark Packages (http://spark-packages.org), a community package index to track the growing number of open source packages and libraries that work with Apache Spark. Spark Packages makes it easy for users to find, discuss, rate, and install

Re: does spark sql support columnar compression with encoding when caching tables

2014-12-22 Thread Sadhan Sood
Thanks Cheng, Michael - that was super helpful. On Sun, Dec 21, 2014 at 7:27 AM, Cheng Lian lian.cs@gmail.com wrote: Would like to add that compression schemes built in in-memory columnar storage only supports primitive columns (int, string, etc.), complex types like array, map and

Re: UNION two RDDs

2014-12-22 Thread Jerry Lam
Hi Sean and Madhu, Thank you for the explanation. I really appreciate it. Best Regards, Jerry On Fri, Dec 19, 2014 at 4:50 AM, Sean Owen so...@cloudera.com wrote: coalesce actually changes the number of partitions. Unless the original RDD had just 1 partition, coalesce(1) will make an RDD

MLLib beginner question

2014-12-22 Thread boci
Hi! I want to try out spark mllib in my spark project, but I got a little problem. I have training data (external file), but the real data com from another rdd. How can I do that? I try to simple using same SparkContext to boot rdd (first I create rdd using sc.textFile() and after

Re: spark-repl_1.2.0 was not uploaded to central maven repository.

2014-12-22 Thread Sean Owen
Just closing the loop -- FWIW this was indeed on purpose -- https://issues.apache.org/jira/browse/SPARK-3452 . I take it that it's not encouraged to depend on the REPL as a module. On Sun, Dec 21, 2014 at 10:34 AM, Sean Owen so...@cloudera.com wrote: I'm only speculating, but I wonder if it was

Long-running job cleanup

2014-12-22 Thread Ganelin, Ilya
Hi all, I have a long running job iterating over a huge dataset. Parts of this operation are cached. Since the job runs for so long, eventually the overhead of spark shuffles starts to accumulate culminating in the driver starting to swap. I am aware of the spark.cleanup.tll parameter that

Re: Spark in Standalone mode

2014-12-22 Thread durga
Please check your spark version and hadoop version in your mvn as well as local spark setup. If hadoop versions not matching you might get this issue. Thanks, -D -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-in-Standalone-mode-tp20780p20815.html

Re: spark-repl_1.2.0 was not uploaded to central maven repository.

2014-12-22 Thread peng
Thanks a lot for point it out. I also found it in pom.xml. A new ticket for reverting it has been submitted: https://issues.apache.org/jira/browse/SPARK-4923 At first I assume that further development on it has been moved to databricks cloud. But the JIRA ticket was already there in September.

Re: Announcing Spark Packages

2014-12-22 Thread peng
Me 2 :) On 12/22/2014 06:14 PM, Andrew Ash wrote: Hi Xiangrui, That link is currently returning a 503 Over Quota error message. Would you mind pinging back out when the page is back up? Thanks! Andrew On Mon, Dec 22, 2014 at 12:37 PM, Xiangrui Meng men...@gmail.com

Re: Announcing Spark Packages

2014-12-22 Thread Hitesh Shah
Hello Xiangrui, If you have not already done so, you should look at http://www.apache.org/foundation/marks/#domains for the policy on use of ASF trademarked terms in domain names. thanks — Hitesh On Dec 22, 2014, at 12:37 PM, Xiangrui Meng men...@gmail.com wrote: Dear Spark users and

Re: Can Spark SQL thrift server UI provide JOB kill operate or any REST API?

2014-12-22 Thread Michael Armbrust
I would expect that killing a stage would kill the whole job. Are you not seeing that happen? On Mon, Dec 22, 2014 at 5:09 AM, Xiaoyu Wang wangxy...@gmail.com wrote: Hello everyone! Like the title. I start the Spark SQL 1.2.0 thrift server. Use beeline connect to the server to execute SQL.

Re: Interpreting MLLib's linear regression o/p

2014-12-22 Thread Xiangrui Meng
Did you check the indices in the LIBSVM data and the master file? Do they match? -Xiangrui On Sat, Dec 20, 2014 at 8:13 AM, Sameer Tilak ssti...@live.com wrote: Hi All, I use LIBSVM format to specify my input feature vector, which used 1-based index. When I run regression the o/p is 0-indexed

Re: MLLib beginner question

2014-12-22 Thread Xiangrui Meng
How big is the dataset you want to use in prediction? -Xiangrui On Mon, Dec 22, 2014 at 1:47 PM, boci boci.b...@gmail.com wrote: Hi! I want to try out spark mllib in my spark project, but I got a little problem. I have training data (external file), but the real data com from another rdd.

RE: Interpreting MLLib's linear regression o/p

2014-12-22 Thread Sameer Tilak
Hi,It is a text format in which each line represents a labeled sparse feature vector using the following format:label index1:value1 index2:value2 ...This was the confusing part in the documentation: where the indices are one-based and in ascending order. After loading, the feature indices are

Re: Announcing Spark Packages

2014-12-22 Thread Nicholas Chammas
Hitesh, From your link http://www.apache.org/foundation/marks/#domains: You may not use ASF trademarks such as “Apache” or “ApacheFoo” or “Foo” in your own domain names if that use would be likely to confuse a relevant consumer about the source of software or services provided through your

Re: Announcing Spark Packages

2014-12-22 Thread Nicholas Chammas
Okie doke! (I just assumed there was an issue since the policy was brought up.) On Mon Dec 22 2014 at 8:33:53 PM Patrick Wendell pwend...@gmail.com wrote: Hey Nick, I think Hitesh was just trying to be helpful and point out the policy - not necessarily saying there was an issue. We've taken

Re: spark streaming python + kafka

2014-12-22 Thread Davies Liu
There is a WIP pull request[1] working on this, it should be merged into master soon. [1] https://github.com/apache/spark/pull/3715 On Fri, Dec 19, 2014 at 2:15 AM, Oleg Ruchovets oruchov...@gmail.com wrote: Hi , I've just seen that streaming spark supports python from 1.2 version.

Re: Who manage the log4j appender while running spark on yarn?

2014-12-22 Thread WangTaoTheTonic
After some discussions with Hadoop guys, I got how the mechanism works. If we don't add -Dlog4j.configuration into java options to the container(AM or executors), they will use log4j.properties(if any) under container's classpath(extraClasspath plus yarn.application.classpath). If we wanna custom

Re: Who manage the log4j appender while running spark on yarn?

2014-12-22 Thread Marcelo Vanzin
If you don't specify your own log4j.properties, Spark will load the default one (from core/src/main/resources/org/apache/spark/log4j-defaults.properties, which ends up being packaged with the Spark assembly). You can easily override the config file if you want to, though; check the Debugging

broadcasting object issue

2014-12-22 Thread Henry Hung
Hi All, I have a problem with broadcasting a serialize class object that returned by another not-serialize class, here is the sample code: class A extends java.io.Serializable { def halo(): String = halo } class B { def getA() = new A } val list = List(1) val b = new B val a = b.getA

Re: custom python converter from HBase Result to tuple

2014-12-22 Thread Antony Mayi
using hbase 0.98.6 there is no stack trace, just this short error. just noticed it does the fallback to toString as in the message as this is what I get back to python: hbase_rdd.collect() [(u'key1', u'List(cf1:12345:14567890, cf2:123:14567896)')] so the question is why it falls back to

Joins in Spark

2014-12-22 Thread Deep Pradhan
Hi, I have two RDDs, vertices and edges. Vertices is an RDD and edges is a pair RDD. I want to take three way join of these two. Joins work only when both the RDDs are pair RDDS right? So, how am I supposed to take a three way join of these RDDs? Thank You

Joins in Spark

2014-12-22 Thread Deep Pradhan
Hi, I have two RDDs, vertices and edges. Vertices is an RDD and edges is a pair RDD. I want to take three way join of these two. Joins work only when both the RDDs are pair RDDS right? So, how am I supposed to take a three way join of these RDDs? Thank You

Re: broadcasting object issue

2014-12-22 Thread madhu phatak
Hi, Just ran your code on spark-shell. If you replace val bcA = sc.broadcast(a) with val bcA = sc.broadcast(new B().getA) it seems to work. Not sure why. On Tue, Dec 23, 2014 at 9:12 AM, Henry Hung ythu...@winbond.com wrote: Hi All, I have a problem with broadcasting a serialize

Re: Joins in Spark

2014-12-22 Thread madhu phatak
Hi, You can map your vertices rdd as follow val pairVertices = verticesRDD.map(vertice = (vertice,null)) the above gives you a pairRDD. After join make sure that you remove superfluous null value. On Tue, Dec 23, 2014 at 10:36 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I have two

Joins in Spark

2014-12-22 Thread pradhandeep
Hi, I have two RDDs, veritces which is an RDD and edges, which is a pair RDD. I have to do a three-way join of these two. Joins work only when both the RDDs are pair RDDs, so how can we perform a three-way join of these RDDs? Thank You -- View this message in context:

Fwd: Joins in Spark

2014-12-22 Thread Deep Pradhan
This gives me two pair RDDs, one is the edgesRDD and another is verticesRDD with each vertex padded with value null. But I have to take a three way join of these two RDD and I have only one common attribute in these two RDDs. How can I go about doing the three join?

Consistent hashing of RDD row

2014-12-22 Thread lev
Hello, I have a process where I need to create a random number for each row in an RDD. That new RDD will be used in a few iteration, and it is necessary that between iterations the numbers won't change (i.e., if a partition get evicted from the cache, the numbers of that partition will be

Re: Spark SQL with a sorted file

2014-12-22 Thread Jerry Raj
Michael, Thanks. Is this still turned off in the released 1.2? Is it possible to turn it on just to get an idea of how much of a difference it makes? -Jerry On 05/12/14 12:40 am, Michael Armbrust wrote: I'll add that some of our data formats will actual infer this sort of useful information