You can use a tachyon based storage for that and everytime the client
queries, you just get it from there.
Thanks
Best Regards
On Mon, Apr 6, 2015 at 6:01 PM, Siddharth Ubale siddharth.ub...@syncoms.com
wrote:
Hi ,
In Spark Web Application the RDD is generating every time client is
You could try leaving all the configuration values to default and running
your application and see if you are still hitting the heap issue, If so try
adding a Swap space to the machines which will definitely help. Another way
would be to set the heap space manually (export _JAVA_OPTIONS=-Xmx5g)
When you say done fetching documents, does it mean that you are stopping
the streamingContext? (ssc.stop) or you meant completed fetching documents
for a batch? If possible, you could paste your custom receiver code so that
we can have a look at it.
Thanks
Best Regards
On Tue, Apr 7, 2015 at
Why are you not using sbin/start-all.sh?
Thanks
Best Regards
On Wed, Apr 8, 2015 at 10:24 PM, Mohit Anchlia mohitanch...@gmail.com
wrote:
I am trying to start the worker by:
sbin/start-slave.sh spark://ip-10-241-251-232:7077
In the logs it's complaining about:
Master must be a URL of the
Hi,
If you have the same spark context, then you can cache the query result via
caching the table ( sqlContext.cacheTable(tableName) ).
Maybe you can have a look at OOyola server also.
On Tue, Apr 14, 2015 at 11:36 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
You can use a tachyon based
One hack you can put in would be to bring Result class
http://grepcode.com/file_/repository.cloudera.com/content/repositories/releases/com.cloudera.hbase/hbase/0.89.20100924-28/org/apache/hadoop/hbase/client/Result.java/?v=source
locally and serialize it (implements serializable) and use it.
Just make sure you import the followings:
import org.apache.spark.SparkContext._
import org.apache.spark.StreamingContext._
Thanks
Best Regards
On Wed, Apr 8, 2015 at 6:38 AM, Su She suhsheka...@gmail.com wrote:
Hello Everyone,
I am trying to implement this example (Spark Streaming with
I think what happened was applying the narrowest possible type. Type
widening is required, and as a result, the narrowest type is string between
a string and an int.
Hi,
In one of the application we have made which had no clone stuff, we have
set the value of spark.storage.memoryFraction to very low, and yes that
gave us performance benefits.
Regarding that issue, you should also look at the data you are trying to
broadcast, as sometimes creating that data
If you want to use 2g of memory on each worker, you can simply export
SPARK_WORKER_MEMORY=2g inside your spark-env.sh on all machine in the
cluster.
Thanks
Best Regards
On Wed, Apr 8, 2015 at 7:27 AM, Jia Yu jia...@asu.edu wrote:
Hi guys,
Currently I am running Spark program on Amazon EC2.
Can you share a bit more information on the type of application that you
are running? From the stacktrace i can only say, for some reason your
connection timedout (prolly a GC pause or network issue)
Thanks
Best Regards
On Wed, Apr 8, 2015 at 9:48 PM, Shuai Zheng szheng.c...@gmail.com wrote:
Hi Team,
I am running Spark Word Count example(
https://github.com/sryza/simplesparkapp), if I go with master as local it
works fine.
But when I change the master to yarn its end with retries connecting to
resource manager(stack trace mentioned below),
15/04/14 11:31:57 INFO RMProxy: Connecting
You can do this:
strLen = udf((s: String) = s.length())
cleanProcessDF.withColumn(dii,strLen(col(di)))
(You might need to play with the type signature a little bit to get it to
compile)
On Fri, Apr 10, 2015 at 11:30 AM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:
Hi, I'm running into some
That totally depends on your disk IO and the number of CPUs that you have
in the cluster. For example, if you are having a disk IO of 100MB/s and a
handful of CPUs ( say 40 cores, on 10 machines), then it could take you to
~ 1GB/Sec i believe.
Thanks
Best Regards
On Tue, Apr 7, 2015 at 2:48 AM,
That explains it. Thanks Reynold.
Justin
On Mon, Apr 13, 2015 at 11:26 PM, Reynold Xin r...@databricks.com wrote:
I think what happened was applying the narrowest possible type. Type
widening is required, and as a result, the narrowest type is string between
a string and an int.
We have a similar version (Sigstream), you could find more over here
https://sigmoid.com/
Thanks
Best Regards
On Wed, Apr 8, 2015 at 9:25 AM, haopu hw...@qilinsoft.com wrote:
I'm also interested in this project. Do you have any update on it? Is it
still active?
Thank you!
--
View this
Where exactly is it throwing null pointer exception? Are you starting your
program from another program or something? looks like you are invoking
ProcessingBuilder etc.
Thanks
Best Regards
On Thu, Apr 9, 2015 at 6:46 PM, Somnath Pandeya somnath_pand...@infosys.com
wrote:
JavaRDDString
Hello,
I tried to follow the tutorial of Spark SQL, but is not able to
saveAsParquetFile from a RDD of case class.
Here is my Main.scala and build.sbt
https://gist.github.com/pishen/939cad3da612ec03249f
At line 34, compiler said that value saveAsParquetFile is not a member of
OK, it do work.
Maybe it will be better to update this usage in the official Spark SQL
tutorial:
http://spark.apache.org/docs/latest/sql-programming-guide.html
Thanks,
pishen
2015-04-14 15:30 GMT+08:00 fightf...@163.com fightf...@163.com:
Hi,there
If you want to use the saveAsParquetFile,
Hi,
I am trying to use Spark in combination with Yarn with 3rd party code which is
unaware of distributed file systems. Providing hdfs file references thus does
not work.
My idea to resolve this issue was the following:
Within a function I take the HDFS file reference I get as parameter and
Hi guys
I have parquet data written by Impala:
Server version: impalad version 2.1.2-cdh5 RELEASE (build
36aad29cee85794ecc5225093c30b1e06ffb68d3)
When using Spark SQL 1.3.0 (spark-assembly-1.3.0-hadoop2.4.0) i get the
following error:
val correlatedEventData = sqlCtx.sql(
s
You can try something like this:
eventsDStream.foreachRDD(rdd = {
val curdate = new DateTime()
val fmt = DateTimeFormat.forPattern(dd_MM_);
rdd.saveAsTextFile(s3n://bucket_name/test/events_+fmt.print(curdate)+/events)
})
Thanks
Best Regards
On Fri, Apr 10, 2015 at 4:22
Can you please share the native support of data formats available with
Spark.
Two i can see are parquet and textFile
sc.parquetFile
sc.textFile
I see that Hadoop Input Formats (Avro) are having issues, that i faced in
earlier threads and seems to be well known.
This usually means something didn't start due to a fairly low-level
error, like a class not found or incompatible Spark versions
somewhere. At least, that's also what I see in unit tests when things
like that go wrong.
On Tue, Apr 14, 2015 at 8:06 AM, Akhil Das ak...@sigmoidanalytics.com wrote:
There's sc.objectFile also.
Thanks
Best Regards
On Tue, Apr 14, 2015 at 2:59 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
Can you please share the native support of data formats available with
Spark.
Two i can see are parquet and textFile
sc.parquetFile
sc.textFile
I see that Hadoop
Sorry for off-topic, have not foud specific MLLib forum/
Please, advise a good overview of using clustering algorithms to group
users according to user purchase and browsing history on a web site.
Thanks!
Hi guys
I have parquet data written by Impala:
Server version: impalad version 2.1.2-cdh5 RELEASE (build
36aad29cee85794ecc5225093c30b1e06ffb68d3)
When using Spark SQL 1.3.0 (spark-assembly-1.3.0-hadoop2.4.0) i get the
following error:
val correlatedEventData = sqlCtx.sql(
s
Dear All,
I would like to know if its possible to configure the SparkConf() in order
to interact with a remote kerberized cluster in yarn-client mode.
the spark will not be installed on the cluster itself and the localhost
can't ask for a ticket, But a keytab as been generated in purpose and
Hi.
It doesn't work.
val file = SqlContext.parquetfile(hdfs://node1/user/hive/warehouse/
file.parquet)
file.repartition(127)
println(h.partitions.size.toString()) -- Return 27!
Regards
On Fri, Apr 10, 2015 at 4:50 PM, Felix C felixcheun...@hotmail.com wrote:
RDD.repartition(1000)?
Here is related problem:
http://apache-spark-user-list.1001560.n3.nabble.com/Launching-history-server-problem-td12574.html
but no answer.
What I'm trying to do: wrap spark-history with /etc/init.d script
Problems I have: can't make it read spark-defaults.conf
I've put this file here:
If your localhost can¹t talk to a KDC, you can¹t access a kerberized
cluster. Only key tab file is not enough.
-Neal
On 4/14/15, 3:54 AM, philippe L lanckvrind.p@gmail.com wrote:
Dear All,
I would like to know if its possible to configure the SparkConf() in order
to interact with a remote
Would you mind to open a JIRA for this?
I think your suspicion makes sense. Will have a look at this tomorrow.
Thanks for reporting!
Cheng
On 4/13/15 7:13 PM, zhangxiongfei wrote:
Hi experts
I run below code in Spark Shell to access parquet files in Tachyon.
1.First,created a DataFrame by
It cleans the work dir, and SPARK_LOCAL_DIRS should be cleaned automatically.
From the source code comments:
// SPARK_LOCAL_DIRS environment variable, and deleted by the Worker when the
// application finishes.
On 13.04.2015, at 11:26, Guillaume Pitel guillaume.pi...@exensa.com wrote:
Does
That’s true, spill dirs don’t get cleaned up when something goes wrong. We are
are restarting long running jobs once in a while for cleanups and have
spark.cleaner.ttl set to a lower value than the default.
On 14.04.2015, at 17:57, Guillaume Pitel guillaume.pi...@exensa.com wrote:
Right, I
I need some help to convert the date pattern in my Scala code for Spark 1.3. I
am reading the dates from two flat files having two different date formats.
File 1:
2015-03-27
File 2:
02-OCT-12
09-MAR-13
This format of file 2 is not being recognized by my Spark SQL when I am
comparing it in a
Your Yarn access is not configured. 0.0.0.0:8032http://0.0.0.0:8032 this
is default yarn address. I guess you don't have yarn-site.xml in your
classpath.
-Neal
From: Vineet Mishra clearmido...@gmail.commailto:clearmido...@gmail.com
Date: Tuesday, April 14, 2015 at 12:05 AM
To:
Hi,
I've gotten an application working with sbt-assembly and spark, thought I'd
present an option. In my experience, trying to bundle any of the Spark
libraries in your uber jar is going to be a major pain. There will be a lot
of deduplication to work through and even if you resolve them it can
Yes, I think the default Spark builds are on Scala 2.10. You need to
follow instructions at
http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211
to build 2.11 packages. -Xiangrui
On Mon, Apr 13, 2015 at 4:00 PM, Jay Katukuri jkatuk...@apple.com wrote:
Hi Xiangrui,
Hi Vadim,
After removing provided from org.apache.spark %%
spark-streaming-kinesis-asl I ended up with huge number of deduplicate
errors:
https://gist.github.com/trienism/3d6f8d6b7ff5b7cead6a
It would be nice if you could share some pieces of your mergeStrategy code
for reference.
Also, after
Hi twinkle,
To be completely honest, I'm not sure, I had never heard spark.task.cpus
before. But I could imagine two different use cases:
a) instead of just relying on spark's creation of tasks for parallelism, a
user wants to run multiple threads *within* a task. This is sort of going
against
Spark SQL (which also can give you an RDD for use with the standard Spark
RDD API) has support for json, parquet, and hive tables
http://spark.apache.org/docs/latest/sql-programming-guide.html#data-sources.
There is also a library for Avro https://github.com/databricks/spark-avro.
On Tue, Apr 14,
Can you open a JIRA?
On Tue, Apr 14, 2015 at 1:56 AM, Clint McNeil cl...@impactradius.com
wrote:
Hi guys
I have parquet data written by Impala:
Server version: impalad version 2.1.2-cdh5 RELEASE (build
36aad29cee85794ecc5225093c30b1e06ffb68d3)
When using Spark SQL 1.3.0
There is an example here:
http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases
On Mon, Apr 13, 2015 at 6:07 PM, doovs...@sina.com wrote:
Hi all,
Who know how to access postgresql on Spark SQL? Do I need add the
postgresql dependency in build.sbt and set
RDDs are immutable. Running .repartition does not change the RDD, but
instead returns *a new RDD *with more partitions.
On Tue, Apr 14, 2015 at 3:59 AM, Masf masfwo...@gmail.com wrote:
Hi.
It doesn't work.
val file = SqlContext.parquetfile(hdfs://node1/user/hive/warehouse/
file.parquet)
Running on Spark 1.1.1 Hadoop 2.4 with Yarn AWS dedicated cluster (non-EMR)
Is this in our code or config? I’ve never run into a TaskResultLost, not sure
what can cause that.
TaskResultLost (result lost from block manager)
nivea.m https://gd-a.slack.com/team/nivea.m[11:01 AM]
collect at
Ø Also, it can be a problem when reusing the same sparkcontext for many runs.
That is what happen to me. We use spark jobserver and use one sparkcontext for
all jobs. The SPARK_LOCAL_DIRS is not cleaned up and is eating disk space
quickly.
Ningjun
From: Marius Soutier
Hi
I am running a spark streaming application but on console nothing is
getting printed.
I am doing
1.bin/spark-shell --master clusterMgrUrl
2.import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.dstream.DStream
With Spark 1.3 xx.saveAsTextFile(path, codec) gives following trace. Same
works with Spark 1.2
Config is CDH 5.3.0 (Hadoop 2.3) with Kerberos
15/04/14 18:06:15 INFO scheduler.TaskSetManager: Lost task 1.3 in stage 2.0
(TID 17) on executor node1078.svc.devpg.pdx.wd: java.lang.SecurityException
I forgot to mention that the imageId field is a custom scala object. Do I
need to implement some special method to make it works (equal, hashCode ) ?
On Tue, Apr 14, 2015 at 5:00 PM, Jaonary Rabarisoa jaon...@gmail.com
wrote:
Dear all,
In the latest version of spark there's a feature called :
Hi Tobias,
It should be possible to get an InputStream from an HDFS file. However, if
your libraries only work directly on files, then maybe that wouldn't work?
If that's the case and different tasks need different files, your way is
probably the best way. If all tasks need the same file, a
Dear all,
In the latest version of spark there's a feature called : automatic
partition discovery and Schema migration for parquet. As far as I know,
this gives the ability to split the DataFrame into several parquet files,
and by just loading the parent directory one can get the global schema of
Hi,
Is it possible to store RDDs as custom output formats, For example ORC?
Thanks,
Daniel
I have an RDD that contains millions of Document objects. Each document has an
unique Id that is a string. I need to find the documents by ids quickly.
Currently I used RDD join as follow
First I save the RDD as object file
allDocs : RDD[Document] = getDocs() // this RDD contains 7 million
Right, I remember now, the only problematic case is when things go bad
and the cleaner is not executed.
Also, it can be a problem when reusing the same sparkcontext for many runs.
Guillaume
It cleans the work dir, and SPARK_LOCAL_DIRS should be cleaned
automatically. From the source code
Fundamentally, stream processing systems are designed for processing
streams of data, not for storing large volumes of data for a long period of
time. So if you have to maintain that much state for months, then its best
to use another system that is designed for long term storage (like
Cassandra)
Thanks guys. This might explain why I might be having problems.
Vadim
ᐧ
On Tue, Apr 14, 2015 at 5:27 PM, Mike Trienis mike.trie...@orcsol.com
wrote:
Richard,
You response was very helpful and actually resolved my issue. In case
others run into a similar issue, I followed the procedure:
hmm, I dunno why IntelliJ is unhappy, but you can always fall back to
getting a class from the String:
Class.forName(scala.reflect.ClassTag$$anon$1)
perhaps the class is package private or something, and the repl somehow
subverts it ...
On Tue, Apr 14, 2015 at 5:44 PM, Arun Lists
What version of Spark are you using? There was a known bug which could be
causing this. It got fixed in Spark 1.3
TD
On Mon, Apr 13, 2015 at 11:44 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
When you say done fetching documents, does it mean that you are stopping
the streamingContext?
Hi Akhil,
I am running my program standalone, I am getting null pointer exception when I
running spark program locally and when I am trying to save my RDD as a text
file.
From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: Tuesday, April 14, 2015 12:41 PM
To: Somnath Pandeya
Cc:
Thanks, yes. I was using Int for my V and didn't get the second param in
the second closure right :)
On Mon, Apr 13, 2015 at 1:55 PM, Dean Wampler deanwamp...@gmail.com wrote:
That appears to work, with a few changes to get the types correct:
input.distinct().combineByKey((s: String) = 1,
Have you tried marking only spark-streaming-kinesis-asl as not provided,
and the rest as provided? Then you will not even need to add
kinesis-asl.jar in the spark-submit.
TD
On Tue, Apr 14, 2015 at 2:27 PM, Mike Trienis mike.trie...@orcsol.com
wrote:
Richard,
You response was very helpful
Could you see something like this in the console?
---
Time: 142905487 ms
---
Best Regards,
Shixiong(Ryan) Zhu
2015-04-15 2:11 GMT+08:00 Shushant Arora shushantaror...@gmail.com:
Hi
I am running a spark
Wow, it all works now! Thanks, Imran!
In case someone else finds this useful, here are the additional classes
that I had to register (in addition to my application specific classes):
val tuple3ArrayClass = classOf[Array[Tuple3[Any, Any, Any]]]
val anonClass =
Thank you Tathagata, very helpful answer.
Though, I would like to highlight that recent stream processing systems are
trying to help users in implementing use case of holding such large (like 2
months of data) states. I would mention here Samza state management
Hi guys,
Trying to use a Spark SQL context’s .load(“jdbc, …) method to create a DF from
a JDBC data source. All seems to work well locally (master = local[*]), however
as soon as we try and run on YARN we have problems.
We seem to be running into problems with the class path and loading up the
Just an update, tried with the old JdbcRDD and that worked fine.
From: Nathan
nathan.mccar...@quantium.com.aumailto:nathan.mccar...@quantium.com.au
Date: Wednesday, 15 April 2015 1:57 pm
To: user@spark.apache.orgmailto:user@spark.apache.org
user@spark.apache.orgmailto:user@spark.apache.org
I've changed it to
import sqlContext.implicits._
but it still doesn't work. (I've updated the gist)
BTW, using .toDF() do work, thanks for this information.
Regards,
pishen
2015-04-14 20:35 GMT+08:00 Todd Nist tsind...@gmail.com:
I think docs are correct. If you follow the example from the
Hi Robert,
A lot of task metrics are already available for individual tasks. You can
get these programmatically by registering a SparkListener, and you van also
view them in the UI. Eg., for each task, you can see runtime,
serialization time, amount of shuffle data read, etc. I'm working on
If you are using Scala/Java or
pyspark.mllib.classification.LogisticRegressionModel, you should be
able to call weights and intercept to get the model coefficients. If
you are using the pipeline API in Python, you can try
model._java_model.weights(), we are going to add a method to get the
weights
If you want to see an example that calls MLlib's FPGrowth, you can
find them under the examples/ folder:
Scala:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/FPGrowthExample.scala,
Java:
Hi Arun,
It can be hard to use kryo with required registration because of issues
like this -- there isn't a good way to register all the classes that you
need transitively. In this case, it looks like one of your classes has a
reference to a ClassTag, which in turn has a reference to some
HI Shuai,
I don't think this is a bug with kryo, its just a subtlety with the kryo
works. I *think* that it would also work if you changed your
PropertiesUtil class to either (a) remove the no-arg constructor or (b)
instead of extending properties, you make it a contained member variable.
I wish
Hello,
I would like to know if there is a way of catching exception throw from
executor exception from the driver program. Here is an example:
try {
val x = sc.parallelize(Seq(1,2,3)).map(e = e / 0).collect
} catch {
case e: SparkException = {
println(sERROR: $e)
println(sCAUSE:
If you're doing in Scala per se - then you can probably just reference
JodaTime or Java Date / Time classes. If are using SparkSQL, then you can
use the various Hive date functions for conversion.
On Tue, Apr 14, 2015 at 11:04 AM BASAK, ANANDA ab9...@att.com wrote:
I need some help to convert
I am a total newbe to spark so be kind.
I am looking for an example that implements the FP-Growth algorithm so I
can better understand both the algorithm as well as spark. The one example
I found (on spark .apache.org example) was incomplete.
Thanks,
Eric
Hey guys, could you please help me with a question I asked on
Stackoverflow:
https://stackoverflow.com/questions/29635681/is-it-feasible-to-keep-millions-of-keys-in-state-of-spark-streaming-job-for-two
? I'll be really grateful for your help!
I'm also pasting the question below:
I'm trying to
Interesting, my gut instinct is the same as Sean's. I'd suggest debugging
this in plain old scala first, without involving spark. Even just in the
scala shell, create one of your Array[T], try calling .toSet and calling
.distinct. If those aren't the same, then its got nothing to do with
spark.
Shuffle write could be a good indication of skew, but it looks like the
task in question hasn't generated any shuffle write yet, because its still
working on the shuffle-read side. So I wouldn't read too much into the
fact that the shuffle write is 0 for a task that is still running.
The
Hi,
I am training a model using the logistic regression algorithm in ML. I was
wondering if there is any API to access the weight vectors (aka the
co-efficients for each feature). I need those co-efficients for real time
predictions.
Thanks,
Jianguo
Hi Imran,
Thanks for the response! However, I am still not there yet.
In the Scala interpreter, I can do:
scala classOf[scala.reflect.ClassTag$$anon$1]
but when I try to do this in my program in IntelliJ, it indicates an error:
Cannot resolve symbol ClassTag$$anon$1
Hence I am not any closer
(+dev)
Hi Justin,
short answer: no, there is no way to do that.
I'm just guessing here, but I imagine this was done to eliminate
serialization problems (eg., what if we got an error trying to serialize
the user exception to send from the executors back to the driver?).
Though, actually that
I think docs are correct. If you follow the example from the docs and add
this import shown below, I believe you will get what your looking for:
// This is used to implicitly convert an RDD to a DataFrame.import
sqlContext.implicits._
You could also simply take your rdd and do the following:
Richard,
You response was very helpful and actually resolved my issue. In case
others run into a similar issue, I followed the procedure:
- Upgraded to spark 1.3.0
- Add all spark related libraries are provided
- Include spark transitive library dependencies
where my build.sbt file
More info on why toDF is required:
http://spark.apache.org/docs/latest/sql-programming-guide.html#upgrading-from-spark-sql-10-12-to-13
On Tue, Apr 14, 2015 at 6:55 AM, pishen tsai pishe...@gmail.com wrote:
I've changed it to
import sqlContext.implicits._
but it still doesn't work. (I've
84 matches
Mail list logo