Hi wangyi:
have more detail information?
I guess it maybe caused by need a jars that havn't upload to the
workers,such as your main class.
./bin/spark-class org.apache.spark.deploy.Client launch
[client-options] \
cluster-url application-jar-url main-class \
[application-options]
Hi,
*Spark RuntimeException due to Unsupported datatype NullType , *When saving
null primitives *jsonRDD *with *.saveAsParquetFile()*
*Code: I am trying to* store jsonRDD into Parquet file using *saveAsParquetFile
with below code.*
JavaRDDString javaRDD =
If you're running on Ubuntu, do ulimit -n, which gives the max number of
allowed open files. You will have to change the value in
/etc/security/limits.conf to something like 1, logout and log back in.
Thanks,
Ron
Sent from my iPad
On Aug 10, 2014, at 10:19 PM, Davies Liu
I want to do the below query that I run in impala calling a c++ UDF in spark
sql.
In which pnl_flat_pp and pfh_flat are both impala table with partitioned.
Can Spark Sql does that?
select a.pnl_type_code,percentile_udf_cloudera(cast(90.0 as
Hi,
I’m new to this mailing list as well as spark-streaming.
I’m using spark-streaming in a cloudera environment to consume a kafka
source and store all data into hdfs. There is a great volume of data and our
issue is that the kafka consumer is going too fast for HDFS, it fills up the
storage
hive-thriftserver does not work with parquet tables in hive metastore also,
this PR will fix it too?
do not need to change any pom.xml ?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/CDH5-HiveContext-Parquet-tp11853p11880.html
Sent from the Apache Spark
Hi Yana,
I notice there is GC happening in every executor which
is around 400ms on an average. Do you think it is a major impact on the
overall query time..??
And regarding the memory for a single worker,
I have tried distributing the memory by increasing the number of workers
per node and
no, spark sql can not insert or update textfile yet, can only insert into
parquet files
but,
people.union(new_people).registerAsTable(people)
could be an idea.
--
View this message in context:
Hi,
We intend to apply other operations on the data later in the same spark
context, but our first step is to archive it.
Our goal is somth like this
Step 1 : consume kafka
Step 2 : archive to hdfs AND send to step 3
Step 3 : transform data
Step 4 : save transformed data to HDFS as input for
hi,
I have googled and find similar question without good answer,
http://stackoverflow.com/questions/24520225/writing-to-hadoop-distributed-file-system-multiple-times-with-spark
in short, I would like to separate raw data and divide by some key, for
example, create date, and put the in directory
Hi
I've installed Spark on a Windows 7 machine. I can get the SparkShell up
and running but when running through the simple example in Getting Started,
I get the following error (tried running as administrator as well) - any
ideas?
scala val textFile = sc.textFile(README.md)
14/08/11 08:55:52
I didn’t reply to the last part of your message:
My source is Kafka, kafka already acts as a buffer with a lot of space.
So when I start my spark job, there is a lot of data to catch up (and it is
critical not to lose any), but the kafka consumer goes as fast as it can (and
it’s faster than my
All,
I have been searching the web for a few days looking for a definitive
Spark/Spark Streaming RDD.Pipe() example and cannot find one. Would it be
possible to share with the group an example of the both the Java/Scala side
as well as the script (ex Python) side? Any help or response would be
Thanks Daves and Ron!
It indeed was due to ulimit issue. Thanks a lot!
Best,
Baoqiang Cao
Blog: http://baoqiang.org
Email: bqcaom...@gmail.com
On Aug 11, 2014, at 3:08 AM, Ron Gonzalez zlgonza...@yahoo.com wrote:
If you're running on Ubuntu, do ulimit -n, which gives the max number of
Hi,
I ran Spark application in local mode with command:
$SPARK_HOME/bin/spark-submit --driver-memory 1g class jar
with set master=local.
After around 10 minutes of computing it started to slow down
significantly that next stage took around 50 minutes and next after 5 hours
in 80%
done and CPU
I'm using Spark 1.0.0
On Mon, Aug 11, 2014 at 4:14 PM, Grzegorz Białek
grzegorz.bia...@codilime.com wrote:
Hi,
I ran Spark application in local mode with command:
$SPARK_HOME/bin/spark-submit --driver-memory 1g class jar
with set master=local.
After around 10 minutes of computing it
I got the same exception after the streaming job runs for a while, The
ERROR message was complaining about a temp file not being found in the
output folder.
14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job
140774430 ms.0
java.io.FileNotFoundException: File
The exception was thrown out in application master(spark streaming driver)
and the job shut down after this exception.
On Mon, Aug 11, 2014 at 10:29 AM, Chen Song chen.song...@gmail.com wrote:
I got the same exception after the streaming job runs for a while, The
ERROR message was complaining
sharing /reusing RDDs is always useful for many use cases, is this possible
via persisting RDD on tachyon?
such as off heap persist a named RDD into a given path (instead of
/tmp_spark_tachyon/spark-xxx-xxx-xxx)
or
saveAsParquetFile on tachyon
i tried to save a SchemaRDD on tachyon,
val
Bill
Did you get this resolved somehow? Anyone has any insight into this problem?
Chen
On Mon, Aug 11, 2014 at 10:30 AM, Chen Song chen.song...@gmail.com wrote:
The exception was thrown out in application master(spark streaming driver)
and the job shut down after this exception.
On Mon,
I've also been seeing similar stacktraces on Spark core (not streaming) and
have a theory it's related to spark.speculation being turned on. Do you
have that enabled by chance?
On Mon, Aug 11, 2014 at 8:10 AM, Chen Song chen.song...@gmail.com wrote:
Bill
Did you get this resolved somehow?
I have an array 'dataAll' of key-value pairs where each value is an array of
arrays. I would like to parallelize a task over the elements of 'dataAll' to
the workers. In the dummy example below, the number of elements in 'dataAll'
is 3 but in real application it would be tens to hundreds.
Reviving this discussion again...
I'm interested in using Spark as the engine for a web service.
The SparkContext and its RDDs only exist in the JVM that started it. While
RDDs are resilient, this means the context owner isn't resilient, so I may
be able to serve requests out of a single
I've see this same exact problem too and I've been ignoring, but I wonder if
I'm loosing data. Can anyone at least comment on this?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/ClassNotFound-for-user-class-in-uber-jar-tp10613p11902.html
Sent from the
Not sure if this problem reached the Spark guys because it shows in Nabble
that This post has NOT been accepted by the mailing list yet.
http://apache-spark-user-list.1001560.n3.nabble.com/ClassNotFound-for-user-class-in-uber-jar-td10613.html#a11902
I'm resubmitting.
Greetings,
I'm currently
Andrew that is a good finding.
Yes, I have speculative execution turned on, becauseI saw tasks stalled on
HDFS client.
If I turned off speculative execution, is there a way to circumvent the
hanging task issue?
On Mon, Aug 11, 2014 at 11:13 AM, Andrew Ash and...@andrewash.com wrote:
I've
Since you were using hql(...), it’s probably not related to JDBC driver.
But I failed to reproduce this issue locally with a single node pseudo
distributed YARN cluster. Would you mind to elaborate more about steps to
reproduce this bug? Thanks
On Sun, Aug 10, 2014 at 9:36 PM, Cheng Lian
Hi John, the JDBC Thrift server resides in its own build profile and need
to be enabled explicitly by ./sbt/sbt -Phive-thriftserver assembly.
On Tue, Aug 5, 2014 at 4:54 AM, John Omernik j...@omernik.com wrote:
I am using spark-1.1.0-SNAPSHOT right now and trying to get familiar with
the
Hi Chen,
You need to set the max input split size so that the underlying hadoop
libraries will calculate the splits appropriately. I have done the
following successfully:
val job = new Job()
FileInputFormat.setMaxInputSplitSize(job, 12800L)
And then use job.getConfiguration when creating a
Hi Christopher,
I am also in the need of having a single function call on the node level.
Your suggestion makes sense as a solution to the requirement, but still
feels like a workaround, this check will get called on every row...Also
having static members and methods created specially on a
Hi All,I read on the mailing list that random forest implementation was on the
roadmap. I wanted to check about its status? We are currently using Weka and
would like to move over to MLib for performance.
Look at: https://github.com/ooyala/spark-jobserver
On Mon, Aug 11, 2014 at 11:48 AM, coolfrood aara...@quantcast.com wrote:
Reviving this discussion again...
I'm interested in using Spark as the engine for a web service.
The SparkContext and its RDDs only exist in the JVM that started it.
Hi,
Thanks for the reference to the LBFGS optimizer.
I tried to use the LBFGS optimizer, but I am not able to pass it as an
input to the LogisticRegression model for binary classification. After
studying the code in mllib/classification/LogisticRegression.scala, it
appears that the only
i was just looking at ALS
(mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala)
any need all the variables need to be vars and to have all these setters
around? it just leads to so much clutter
if you really want them to the vars it is safe in scala to make them public
Spark 1.0.2, Python, Cloudera 5.1 (Hadoop 2.3.0)
It seems that Python jobs I'm sending to YARN show up as succeeded even if
they failed... Am I doing something wrong, is this a known issue?
Thanks,
Shay
Could you share what's the cluster manager you're using and exactly
where the error shows up (driver or executor)?
A quick look reveals that Standalone and Yarn use different options to
control this, for example. (Maybe that already should be a bug.)
On Mon, Aug 11, 2014 at 12:24 PM, DNoteboom
Not sure what your environment is but this happened to me before because I had
a couple of servlet-api jars in the path which did not match.
I was building a system that programmatically submitted jobs so I had my own
jars that conflicted with that of spark. The solution is do mvn
Hi Jeremy,
Thanks for the reply.
We got Spark on our setup after a similar script was brought up to work with
LSF.
Really appreciate your help.
Will keep in touch on Twitter
Thanks,@sidkashyap :)
From: freeman.jer...@gmail.com
Subject: Re: Spark on an HPC setup
Date: Thu, 29 May 2014 00:37:54
I'm currently running on my local machine on standalone. The error shows up
in my code when I am closing resources using the
TaskContext.addOnCompleteCallBack. However, the cause of this error is
because of a faulty classLoader which must occur in the Executor in the
function createClassLoader.
Hey all,
Is there any kind of API to access information about resources, executors,
and applications in a standalone cluster displayed in the web UI?
Currently I'm using 1.0.x, but interested in experimenting with bleeding
edge.
Thanks,
Wonha
We have an open-sourced Random Forest at Alpine Data Labs with the Apache
license. We're also trying to have it merged into Spark MLlib now.
https://github.com/AlpineNow/alpineml
It's been tested a lot, and the accuracy and training time benchmark is
great. There could be some bugs here and
Hi,
// Initialize the optimizer using logistic regression as the loss function with
L2 regularization
val lbfgs = new LBFGS(new LogisticGradient(), new SquaredL2Updater())
// Set the hyperparameters
Hi Friends,
Any response on this, I looked into documentation but could not get any
information
--Manoj
On Fri, Aug 8, 2014 at 6:56 AM, Manoj kumar manojkumarr2...@gmail.com
wrote:
Hi Team,
Do we have Job ACL's for Spark which is similar to Hadoop Job ACL’s.
Where I can restrict who
Hi Jenny,
How's your metastore configured for both Hive and Spark SQL? Which
metastore mode are you using (based on
https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin
)?
Thanks,
Yin
On Mon, Aug 11, 2014 at 6:15 PM, Jenny Zhao linlin200...@gmail.com wrote:
you can
Is the speculative execution enabled?
Best,
Haoyuan
On Mon, Aug 11, 2014 at 8:08 AM, chutium teng@gmail.com wrote:
sharing /reusing RDDs is always useful for many use cases, is this possible
via persisting RDD on tachyon?
such as off heap persist a named RDD into a given path (instead
Thanks Yin!
here is my hive-site.xml, which I copied from $HIVE_HOME/conf, didn't
experience problem connecting to the metastore through hive. which uses DB2
as metastore database.
?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?
!--
Licensed to the Apache Software
Hi,
On Mon, Aug 11, 2014 at 9:41 PM, Gwenhael Pasquiers
gwenhael.pasqui...@ericsson.com wrote:
We intend to apply other operations on the data later in the same spark
context, but our first step is to archive it.
Our goal is somth like this
Step 1 : consume kafka
Step 2 : archive to
In general, (and I am prototyping), I have a better idea :)
- Consume kafka in Spark from topic-A
- transform data in Spark (normalize, enrich etc etc)
- Feed it back to Kafka (in a different topic-B)
- Have flume-HDFS (for M/R, Impala, Spark batch) or Spark-streaming
or any other compute
I'm trying to apply KMeans training to some text data, which consists of
lines that each contain something between 3 and 20 words. For that purpose,
all unique words are saved in a dictionary. This dictionary can become very
large as no hashing etc. is done, but it should spill to disk in case it
hi, TD. I also fall into the trap of long lineage, and your suggestions do
work well. But i don't understand why the long lineage can cause stackover,
and where it takes effect?
--
View this message in context:
I am trying to train a KMeans model with sparse vector with Spark 1.0.1.
When I run the training I got the following exception:
java.lang.IllegalArgumentException: requirement failed
at scala.Predef$.require(Predef.scala:221)
at
Hi all,
We are trying to use Spark MLlib to train super large data (100M features
and 5B rows). The input data in HDFS has ~26K partitions. By default, MLlib
will create a task for every partition at each iteration. But because our
dimensions are also very high, such large number of tasks will
Hi
It may be simple question, but I can not figure out the most efficient way.
There is a RDD containing list.
RDD
(
List(1,2,3,4,5)
List(6,7,8,9,10)
)
I want to transform this to
RDD
(
List(1,6)
List(2,7)
List(3,8)
List(4,9)
List(5,10)
)
And I want to achieve this without using collect
Try something like this.
scala val a = sc.parallelize(List(1,2,3,4,5))
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize
at console:12
scala val b = sc.parallelize(List(6,7,8,9,10))
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize
at
How about increase HDFS file extent size? like current value is 128M, we
make it 512M or bigger.
On Tue, Aug 12, 2014 at 11:46 AM, ZHENG, Xu-dong dong...@gmail.com wrote:
Hi all,
We are trying to use Spark MLlib to train super large data (100M features
and 5B rows). The input data in HDFS
Hi all,
Is it possible to use table with ORC format in Shark
version 0.9.1 with Spark 0.9.2 and Hive version 0.12.0..??
I have
tried creating the ORC table in Shark using the below
query
create table orc_table (x int, y string) stored as
orc
create table works, but when I try to insert values
hello:
I want to know,if I use history data to training model and I want to use
this model in other app.How should I do?
Should I save this model in disk? And when I use this model then load it
from disk.But I don't know how to save the mllib model,and reload it?
I will be very pleasure,if
Have you solved this problem??
And could you share how to save model to hdfs and reload it?
Thanks
XiaoQinyu
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-Save-SVM-model-to-disk-tp74p11954.html
Sent from the Apache Spark User List mailing list
I think this has the same effect and issue with #1, right?
On Tue, Aug 12, 2014 at 1:08 PM, Jiusheng Chen chenjiush...@gmail.com
wrote:
How about increase HDFS file extent size? like current value is 128M, we
make it 512M or bigger.
On Tue, Aug 12, 2014 at 11:46 AM, ZHENG, Xu-dong
Not sure which stalled HDFS client issue your'e referring to, but there was
one fixed in Spark 1.0.2 that could help you out --
https://github.com/apache/spark/pull/1409. I've still seen one related to
Configuration objects not being threadsafe though so you'd still need to
keep speculation on to
Hi ssimanta.
The first line creates RDD[Int], not RDD[List[Int]].
In case of List , I can not zip all list elements in RDD like a.zip(b) and I
can not use only tuple2 because realworld RDD has more List elements in
source RDD.
So I guess the expected result depends on the count of original Lists.
61 matches
Mail list logo