Hi,
Recently I'm migrating from Shark 0.9 to Spark SQL 1.2, my CDH version
is 4.5, Hive 0.11. I've managed to setup Spark SQL Thriftserver, and
normal queries work fine, but custom UDF is not usable.
The symptom is when executing CREATE TEMPORARY FUNCTION, the query
hangs on a lock request:
Hi,
In case of MR task the log4j configuration and container log folder for a
container is explicitly set in the container Launch context by
org.apache.hadoop.mapreduce.v2.util.MRApps.addLog4jSystemProperties i.e from
MapReduce YARN client code and not YARN code.
This is also visible from
Hi Ji,
Spark SQL 1.2 only works with either Hive 0.12.0 or 0.13.1 due to Hive
API/protocol compatibility issues. When interacting with Hive 0.11.x,
connections and simple queries may succeed, but things may go crazy in
unexpected corners (like UDF).
Cheng
On 12/22/14 4:15 PM, Ji ZHANG
Hello all,
I have a spark streaming application running in a standalone cluster
(deployed with spark-submit --deploy-mode cluster). I am trying to add
graceful shutdown functionality to this application but I am not sure what
is the best practice for this.
Currently I am using this code:
Did you try running PageRank.scala instead of LiveJournalPageRank.scala?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-or-GraphX-runs-fast-a-performance-comparison-on-Page-Rank-tp19710p20808.html
Sent from the Apache Spark User List mailing list
Do you need the multiple edges or can you get the work done by having single
edge between two vertices?
In my view point, you can group the edges using groupEdges which will group
the same edges together. It may work because the message passed between the
vertices through same edges (replicated)
I am trying to run the twitter classifier
https://github.com/databricks/reference-apps
A NoClasssDefFoundError pops up. I've checked the library that the HashingTF
class file is there. Some stack overflow questions show that might be
problem with packaging the class.
Exception in thread main
Are you using an old version of Spark? I think this appeared in 1.1.
You don't usually package this class or MLlib, so your packaging
probably is not relevant, but it has to be available at runtime on
your cluster then.
On Mon, Dec 22, 2014 at 10:16 AM, shkesar shubhamke...@live.com wrote:
I am
Hi,
Say we have 4 nodes with 2 cores each in stand alone mode. I'd like to dedicate
4 cores to a streaming application. I can do this via spark submit by:
spark-submit --total-executor-cores 4
However, this assigns one core per machine. I would like to use 2 cores on 2
machines instead,
I think you want:
--num-executors 2 --executor-cores 2
On Mon, Dec 22, 2014 at 10:39 AM, Ashic Mahtab as...@live.com wrote:
Hi,
Say we have 4 nodes with 2 cores each in stand alone mode. I'd like to
dedicate 4 cores to a streaming application. I can do this via spark submit
by:
Hi Sean,
Thanks for the response.
It seems --num-executors is ignored. Specifying --num-executors 2
--executor-cores 2 is giving the app all 8 cores across 4 machines.
-Ashic.
From: so...@cloudera.com
Date: Mon, 22 Dec 2014 10:57:31 +
Subject: Re: Using more cores on machines
To:
Here is a script I use to submit a directory of jar files. It assumes jar files
are in target/dependency or lib/
DRIVER_PATH=
DEPEND_PATH=
if [ -d lib ]; then
DRIVER_PATH=lib
DEPEND_PATH=lib
else
DRIVER_PATH=target
DEPEND_PATH=target/dependency
fi
DEPEND_JARS=log4j.properties
for f in
The implementation closely aligns with jaccard. It should be possible to swap
out the hash functions to a family that is compatible with other distance
measures.
On Dec 22, 2014, at 1:16 AM, Nick Pentreath nick.pentre...@gmail.com wrote:
Looks interesting thanks for sharing.
Does it
Is it possible too many connections open to read from s3 from one node? I
have this issue before because I open a few hundreds of files on s3 to read
from one node. It just block itself without error until timeout later.
On Monday, December 22, 2014, durga durgak...@gmail.com wrote:
Hi All,
I
Hello everyone!
Like the title.
I start the Spark SQL 1.2.0 thrift server. Use beeline connect to the server to
execute SQL.
I want to kill one SQL job running in the thrift server and not kill the thrift
server.
I set property spark.ui.killEnabled=true in spark-default.conf
But in the UI, only
Which version of spark are you running?
It could be related to this
https://issues.apache.org/jira/browse/SPARK-3633
fixed in 1.1.1 and 1.2.0
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html
Sent from the Apache Spark
Thanks again DB Tsai, LogisticRegressionWithLBFGS works for me!
De: Franco Barrientos [mailto:franco.barrien...@exalitica.com]
Enviado el: jueves, 18 de diciembre de 2014 16:42
Para: 'DB Tsai'
CC: 'Sean Owen'; user@spark.apache.org
Asunto: RE: Effects problems in logistic regression
Hi All,
I have akka remote actors running on 2 nodes. I submitted spark application
from node1. In the spark code, in one of the rdd, i am sending message to
actor running on node1. My Spark code is as follows:
class ActorClient extends Actor with Serializable
{
import context._
val
Sounds great.
Sincerely,
DB Tsai
---
Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai
On Mon, Dec 22, 2014 at 5:27 AM, Franco Barrientos
franco.barrien...@exalitica.com wrote:
Thanks again DB Tsai,
Hi,
After facing issues with the performance of some of our Spark Streaming
jobs, we invested quite some effort figuring out the factors that affect
the performance characteristics of a Streaming job. We defined an
empirical model that helps us reason about Streaming jobs and applied it to
tune
Hi Josh,
I'm not looking to change the 1:1 ratio.
What I'm trying to do is get both cores on two machines working, rather than
one core on all four machines. With --total-executor-cores 4, I have 1 core per
machine working for an app. I'm looking for something that'll let me use 2
cores per
Yeah, it's mentioned in the doc:
Note that, in the mathematical formulation in this guide, a training label
y is denoted as either +1 (positive) or −1 (negative), which is
convenient for the formulation. However, the negative label is
represented by 0 in MLlib instead of −1, to be consistent with
If you are looking to reduce network traffic then setting
spark.deploy.spreadOut
to false may help.
On Mon, Dec 22, 2014 at 11:44 AM, Ashic Mahtab as...@live.com wrote:
Hi Josh,
I'm not looking to change the 1:1 ratio.
What I'm trying to do is get both cores on two machines working, rather
Yes . I am reading thousands of files every hours. Is there any way I can
tell spark to timeout.
Thanks for your help.
-D
On Mon, Dec 22, 2014 at 4:57 AM, Shuai Zheng szheng.c...@gmail.com wrote:
Is it possible too many connections open to read from s3 from one node? I
have this issue before
Which HBase version are you using ?
Can you show the full stack trace ?
Cheers
On Mon, Dec 22, 2014 at 11:02 AM, Antony Mayi antonym...@yahoo.com.invalid
wrote:
Hi,
can anyone please give me some help how to write custom converter of hbase
data to (for example) tuples of ((family,
Hi Tim,
That would be awesome. We have seen some really disparate Mesos allocations
for our Spark Streaming jobs. (like (7,4,1) over 3 executors for 4 kafka
consumer instead of the ideal (3,3,3,3))
For network dependent consumers, achieving an even deployment would
provide a reliable and
Dear Spark users and developers,
I’m happy to announce Spark Packages (http://spark-packages.org), a
community package index to track the growing number of open source
packages and libraries that work with Apache Spark. Spark Packages
makes it easy for users to find, discuss, rate, and install
Thanks Cheng, Michael - that was super helpful.
On Sun, Dec 21, 2014 at 7:27 AM, Cheng Lian lian.cs@gmail.com wrote:
Would like to add that compression schemes built in in-memory columnar
storage only supports primitive columns (int, string, etc.), complex types
like array, map and
Hi Sean and Madhu,
Thank you for the explanation. I really appreciate it.
Best Regards,
Jerry
On Fri, Dec 19, 2014 at 4:50 AM, Sean Owen so...@cloudera.com wrote:
coalesce actually changes the number of partitions. Unless the
original RDD had just 1 partition, coalesce(1) will make an RDD
Hi!
I want to try out spark mllib in my spark project, but I got a little
problem. I have training data (external file), but the real data com from
another rdd. How can I do that?
I try to simple using same SparkContext to boot rdd (first I create rdd
using sc.textFile() and after
Just closing the loop -- FWIW this was indeed on purpose --
https://issues.apache.org/jira/browse/SPARK-3452 . I take it that it's
not encouraged to depend on the REPL as a module.
On Sun, Dec 21, 2014 at 10:34 AM, Sean Owen so...@cloudera.com wrote:
I'm only speculating, but I wonder if it was
Hi all, I have a long running job iterating over a huge dataset. Parts of this
operation are cached. Since the job runs for so long, eventually the overhead
of spark shuffles starts to accumulate culminating in the driver starting to
swap.
I am aware of the spark.cleanup.tll parameter that
Please check your spark version and hadoop version in your mvn as well as
local spark setup. If hadoop versions not matching you might get this issue.
Thanks,
-D
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-in-Standalone-mode-tp20780p20815.html
Thanks a lot for point it out. I also found it in pom.xml.
A new ticket for reverting it has been submitted:
https://issues.apache.org/jira/browse/SPARK-4923
At first I assume that further development on it has been moved to
databricks cloud. But the JIRA ticket was already there in September.
Me 2 :)
On 12/22/2014 06:14 PM, Andrew Ash wrote:
Hi Xiangrui,
That link is currently returning a 503 Over Quota error message.
Would you mind pinging back out when the page is back up?
Thanks!
Andrew
On Mon, Dec 22, 2014 at 12:37 PM, Xiangrui Meng men...@gmail.com
Hello Xiangrui,
If you have not already done so, you should look at
http://www.apache.org/foundation/marks/#domains for the policy on use of ASF
trademarked terms in domain names.
thanks
— Hitesh
On Dec 22, 2014, at 12:37 PM, Xiangrui Meng men...@gmail.com wrote:
Dear Spark users and
I would expect that killing a stage would kill the whole job. Are you not
seeing that happen?
On Mon, Dec 22, 2014 at 5:09 AM, Xiaoyu Wang wangxy...@gmail.com wrote:
Hello everyone!
Like the title.
I start the Spark SQL 1.2.0 thrift server. Use beeline connect to the
server to execute SQL.
Did you check the indices in the LIBSVM data and the master file? Do
they match? -Xiangrui
On Sat, Dec 20, 2014 at 8:13 AM, Sameer Tilak ssti...@live.com wrote:
Hi All,
I use LIBSVM format to specify my input feature vector, which used 1-based
index. When I run regression the o/p is 0-indexed
How big is the dataset you want to use in prediction? -Xiangrui
On Mon, Dec 22, 2014 at 1:47 PM, boci boci.b...@gmail.com wrote:
Hi!
I want to try out spark mllib in my spark project, but I got a little
problem. I have training data (external file), but the real data com from
another rdd.
Hi,It is a text format in which each line represents a labeled sparse feature
vector using the following format:label index1:value1 index2:value2 ...This was
the confusing part in the documentation:
where the indices are one-based and in ascending order. After loading, the
feature indices are
Hitesh,
From your link http://www.apache.org/foundation/marks/#domains:
You may not use ASF trademarks such as “Apache” or “ApacheFoo” or “Foo” in
your own domain names if that use would be likely to confuse a relevant
consumer about the source of software or services provided through your
Okie doke! (I just assumed there was an issue since the policy was brought
up.)
On Mon Dec 22 2014 at 8:33:53 PM Patrick Wendell pwend...@gmail.com wrote:
Hey Nick,
I think Hitesh was just trying to be helpful and point out the policy
- not necessarily saying there was an issue. We've taken
There is a WIP pull request[1] working on this, it should be merged
into master soon.
[1] https://github.com/apache/spark/pull/3715
On Fri, Dec 19, 2014 at 2:15 AM, Oleg Ruchovets oruchov...@gmail.com wrote:
Hi ,
I've just seen that streaming spark supports python from 1.2 version.
After some discussions with Hadoop guys, I got how the mechanism works.
If we don't add -Dlog4j.configuration into java options to the container(AM
or executors), they will use log4j.properties(if any) under container's
classpath(extraClasspath plus yarn.application.classpath).
If we wanna custom
If you don't specify your own log4j.properties, Spark will load the
default one (from
core/src/main/resources/org/apache/spark/log4j-defaults.properties,
which ends up being packaged with the Spark assembly).
You can easily override the config file if you want to, though; check
the Debugging
Hi All,
I have a problem with broadcasting a serialize class object that returned by
another not-serialize class, here is the sample code:
class A extends java.io.Serializable {
def halo(): String = halo
}
class B {
def getA() = new A
}
val list = List(1)
val b = new B
val a = b.getA
using hbase 0.98.6
there is no stack trace, just this short error.
just noticed it does the fallback to toString as in the message as this is what
I get back to python:
hbase_rdd.collect()
[(u'key1', u'List(cf1:12345:14567890, cf2:123:14567896)')]
so the question is why it falls back to
Hi,
I have two RDDs, vertices and edges. Vertices is an RDD and edges is a pair
RDD. I want to take three way join of these two. Joins work only when both
the RDDs are pair RDDS right? So, how am I supposed to take a three way
join of these RDDs?
Thank You
Hi,
I have two RDDs, vertices and edges. Vertices is an RDD and edges is a pair
RDD. I want to take three way join of these two. Joins work only when both
the RDDs are pair RDDS right? So, how am I supposed to take a three way
join of these RDDs?
Thank You
Hi,
Just ran your code on spark-shell. If you replace
val bcA = sc.broadcast(a)
with
val bcA = sc.broadcast(new B().getA)
it seems to work. Not sure why.
On Tue, Dec 23, 2014 at 9:12 AM, Henry Hung ythu...@winbond.com wrote:
Hi All,
I have a problem with broadcasting a serialize
Hi,
You can map your vertices rdd as follow
val pairVertices = verticesRDD.map(vertice = (vertice,null))
the above gives you a pairRDD. After join make sure that you remove
superfluous null value.
On Tue, Dec 23, 2014 at 10:36 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
Hi,
I have two
Hi,
I have two RDDs, veritces which is an RDD and edges, which is a pair RDD. I
have to do a three-way join of these two. Joins work only when both the RDDs
are pair RDDs, so how can we perform a three-way join of these RDDs?
Thank You
--
View this message in context:
This gives me two pair RDDs, one is the edgesRDD and another is verticesRDD
with each vertex padded with value null. But I have to take a three way
join of these two RDD and I have only one common attribute in these two
RDDs. How can I go about doing the three join?
Hello,
I have a process where I need to create a random number for each row in an
RDD.
That new RDD will be used in a few iteration, and it is necessary that
between iterations the numbers won't change
(i.e., if a partition get evicted from the cache, the numbers of that
partition will be
Michael,
Thanks. Is this still turned off in the released 1.2? Is it possible to
turn it on just to get an idea of how much of a difference it makes?
-Jerry
On 05/12/14 12:40 am, Michael Armbrust wrote:
I'll add that some of our data formats will actual infer this sort of
useful information
55 matches
Mail list logo