That is completely alright, as the system will make sure the works get done.
My major concern is, the data drop. Will using async stop data loss?
On Thu, May 21, 2015 at 4:55 PM, Tathagata Das t...@databricks.com wrote:
If you cannot push data as fast as you are generating it, then async isnt
Thanks for the JIRA. I will look into this issue.
TD
On Thu, May 21, 2015 at 1:31 AM, Aniket Bhatnagar
aniket.bhatna...@gmail.com wrote:
I ran into one of the issues that are potentially caused because of this
and have logged a JIRA bug -
https://issues.apache.org/jira/browse/SPARK-7788
i want evaluate some different distance measure for time-space clustering.
so i need a api for implement my own function in java.
2015-05-19 22:08 GMT+02:00 Xiangrui Meng men...@gmail.com:
Just curious, what distance measure do you need? -Xiangrui
On Mon, May 11, 2015 at 8:28 AM, Jaonary
Can you share the code, may be i/someone can help you out
Thanks
Best Regards
On Thu, May 21, 2015 at 1:45 PM, Allan Jie allanmcgr...@gmail.com wrote:
Hi,
Just check the logs of datanode, it looks like this:
2015-05-20 11:42:14,605 INFO
If you cannot push data as fast as you are generating it, then async isnt
going to help either. The work is just going to keep piling up as many
many async jobs even though your batch processing times will be low as that
processing time is not going to reflect how much of overall work is pending
Hi there,
This question may seem to be kind of naïve, but what's the difference between
MEMORY_AND_DISK and MEMORY_AND_DISK_SER?
If I call rdd.persist(StorageLevel.MEMORY_AND_DISK), the BlockManager won't
serialize the rdd?
Thanks,
Zhipeng
thanks a lot for ur help, now i split my project, it's works.
2015-05-19 15:44 GMT+02:00 Alexander Alexandrov
alexander.s.alexand...@gmail.com:
Sorry, we're using a forked version which changed groupID.
2015-05-19 15:15 GMT+02:00 Till Rohrmann trohrm...@apache.org:
I guess it's a typo:
Would partitioning your data based on the key and then running
mapPartitions help?
Best Regards,
Sonal
Founder, Nube Technologies http://www.nubetech.co
http://in.linkedin.com/in/sonalgoyal
On Thu, May 21, 2015 at 4:33 AM, roy rp...@njit.edu wrote:
I have a key-value RDD, key is a timestamp
Hi,
I was testing spark to read data from hive using HiveContext. I got the
following error, when I used a simple query with constants in predicates.
I am using spark 1.3*. *Anyone encountered error like this ??
*Error:*
Exception in thread main org.apache.spark.sql.AnalysisException:
How many part files are you having? Did you try re-partitioning to a
smaller number so that you will have bigger files of smaller number.
Thanks
Best Regards
On Wed, May 20, 2015 at 3:06 AM, Richard Grossman richie...@gmail.com
wrote:
Hi
I'm using spark 1.3.1 and now I can't set the size of
textFile does reads all files in a directory.
We have modified the sparkstreaming code base to read nested files from S3,
you can check this function
Hi ,
I had tried the workaround shared here, but still facing the same issue...
Thanks.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/rdd-saveAsTextFile-problem-tp176p22970.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
+user
If this was in cluster mode, you should provide a path on a shared file
system, e.g., HDFS, instead of a local path. If this is in local model, I'm
not sure what went wrong.
On Wed, May 20, 2015 at 2:09 PM, Eric Tanner eric.tan...@justenough.com
wrote:
Here is the stack trace. Thanks for
This thread happened a year back, can you please share what issue you are
facing? which version of spark you are using? What is your system
environment? Exception stack-trace?
Thanks
Best Regards
On Thu, May 21, 2015 at 12:19 PM, Keerthi keerthi.reddy1...@gmail.com
wrote:
Hi ,
I had tried
Sure, the code is very simple. I think u guys can understand from the main
function.
public class Test1 {
public static double[][] createBroadcastPoints(String localPointPath, int
row, int col) throws IOException{
BufferedReader br = RAWF.reader(localPointPath);
String line = null;
int rowIndex
Sure, the code is very simple. I think u guys can understand from the main
function.
public class Test1 {
public static double[][] createBroadcastPoints(String localPointPath,
int
row, int col) throws IOException{
BufferedReader br = RAWF.reader(localPointPath);
Hi,
Many thanks for the help. My Spark version is 1.3.0 too and I run it on Yarn.
According to your advice I have changed the configuration. Now my program can
read the hbase-site.xml correctly. And it can also authenticate with zookeeper
successfully.
But I meet a new problem that is my
Honestly, given the length of my email, I didn't expect a reply. :-) Thanks
for reading and replying. However, I have a follow-up question:
I don't think if I understand the block replication completely. Are the
blocks replicated immediately after they are received by the receiver? Or
are they
never mind... i didnt realize you were referring to the first table as df.
so you want to do a join between the first table and an RDD?
the right way to do it within the data frame construct is to think of it as
a join... map the second RDD to a data frame and do an inner join on ip
On Thu, May
Looks like somehow the file size reported by the FSInputDStream of
Tachyon's FileSystem interface, is returning zero.
On Mon, May 11, 2015 at 4:38 AM, Dibyendu Bhattacharya
dibyendu.bhattach...@gmail.com wrote:
Just to follow up this thread further .
I was doing some fault tolerant testing
Thanks. I suspected that, but figured that df query inside a map sounds so
intuitive that I don't just want to give up.
I've tried join and even better with a DStream.transform() and it works!
freqs = testips.transform(lambda rdd: rdd.join(kvrdd).map(lambda (x,y):
y[1]))
Thank you for the help!
I don't need to be 100% randome. How about randomly pick a few partitions and
return all docs in those partitions? Is
rdd.mapPartitionsWithIndex() the right method to use to just process a small
portion of partitions?
Ningjun
-Original Message-
From: Sean Owen
From the performance and scalability standpoint, is it better to plug in, say
a multi-threaded pipeliner into a Spark job, or implement pipelining via
Spark's own transformation mechanisms such as e.g. map or filter?
I'm seeing some reference architectures where things like 'morphlines' are
I have not seen this error but have seen another user have weird parser
issues before:
http://mail-archives.us.apache.org/mod_mbox/spark-user/201501.mbox/%3ccag6lhyed_no6qrutwsxeenrbqjuuzvqtbpxwx4z-gndqoj3...@mail.gmail.com%3E
I would attach a debugger and see what is going on -- if I'm looking
From the docs,
https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence:
Storage LevelMeaningMEMORY_ONLYStore RDD as deserialized Java objects in
the JVM. If the RDD does not fit in memory, some partitions will not be
cached and will be recomputed on the fly each time they're
Can some one provide example of Spark Streaming using Java?
I have cassandra running but did not configure spark but would like to
create Dstream.
Thanks
--
View this message in context:
hi. I have a spark streaming - cassandra application which you can
probably borrow pretty easily.
You can always rewrite a part of it in java if you need to , or else, you
can just use scala (see the blog post below if you want a java style dev
workflow w/ scala using intellij)/
This application
Doesnt seem like a Cassandra specific issue. Could you give us more
information (code, errors, stack traces)?
On Thu, May 21, 2015 at 1:33 PM, tshah77 tejasrs...@gmail.com wrote:
TD,
Do you have any example about reading from cassandra using spark streaming
in java?
I am trying to connect
I would like to know if the foreachPartitions will results in a better
performance, due to an higher level of parallelism, compared to the foreach
method considering the case in which I'm flowing through an RDD in order to
perform some sums into an accumulator variable.
Thank you,
Beniamino.
TD,
Do you have any example about reading from cassandra using spark streaming
in java?
I am trying to connect to cassandra using spark streaming and it is throwing
an error as could not parse master url.
Thanks
Tejas
--
View this message in context:
Are the worker nodes colocated with HBase region servers ?
Were you running as hbase super user ?
You may need to login, using code similar to the following:
if (isSecurityEnabled()) {
SecurityUtil.login(conf, fileConfKey, principalConfKey, localhost);
}
SecurityUtil is
Hi, everybody.
There are some cases in which I can obtain the same results by using the
mapPartitions and the foreach method.
For example in a typical MapReduce approach one would perform a reduceByKey
immediately after a mapPartitions that transform the original RDD in a
collection of tuple
*Hi Spark Devs and Users,BerkeleyX and Databricks are currently developing
two Spark-related MOOC on edX (intro
https://www.edx.org/course/introduction-big-data-apache-spark-uc-berkeleyx-cs100-1x,
ml
https://www.edx.org/course/scalable-machine-learning-uc-berkeleyx-cs190-1x),
the first of which
What I found with the CDH-5.4.1 Spark 1.3, the
spark.executor.extraClassPath setting is not working. Had to use
SPARK_CLASSPATH instead.
On Thursday, May 21, 2015, Ted Yu yuzhih...@gmail.com wrote:
Are the worker nodes colocated with HBase region servers ?
Were you running as hbase super user
This got resolved after cleaning /user/spark/applicationHistory/*
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-HistoryServer-not-coming-up-tp22975p22981.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hello, folks.
We just recently switched to using Yarn on our cluster (when upgrading to
cloudera 5.4.1)
I'm trying to run a spark job from within a broader application (a web
service running on Jetty), so I can't just start it using spark-submit.
Does anyone know of an instructions page on how
Doesn't work for me so far ,
using command but got such output. What should I check to fix the issue?
Any configuration parameters ...
[root@sdo-hdp-bd-master1 ~]# yarn logs -applicationId
application_1426424283508_0048
15/05/21 13:25:09 INFO impl.TimelineClientImpl: Timeline service
Hi Spark Users Group,
I’m doing groupby operations on my DataFrame *df* as following, to get
count for each value of col1:
df.groupBy(col1).agg(col1 - count).show // I don't know if I should
write like this.
col1 COUNT(col1#347)
aaa2
bbb4
ccc4
...
and more...
As I ‘d like to
On Thu, May 21, 2015 at 4:17 PM, Howard Yang howardyang2...@gmail.com
wrote:
follow
http://www.srccodes.com/p/article/38/build-install-configure-run-apache-hadoop-2.2.0-microsoft-windows-os
to build latest version Hadoop in my windows machine,
and Add Environment Variable *HADOOP_HOME* and
Hi,
thanks for answer, I'll open a ticket.
In the meantime - I have found a workaround. The recipe is the following:
1. Create a new account/group on all machines (lets call it sparkuser).
Run spark from this account.
2. Add your user to group sparkuser.
3. If you decide to write
df.groupBy($col1).agg(count($col1).as(c)).show
On Thu, May 21, 2015 at 3:09 AM, SLiZn Liu sliznmail...@gmail.com wrote:
Hi Spark Users Group,
I’m doing groupby operations on my DataFrame *df* as following, to get
count for each value of col1:
df.groupBy(col1).agg(col1 - count).show // I
Hi,
is this fixed in master?
Grega
On Thu, May 14, 2015 at 7:50 PM, Michael Armbrust mich...@databricks.com
wrote:
End of the month is the target:
https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage
On Thu, May 14, 2015 at 3:45 AM, Ishwardeep Singh
Hello!
I just started with Spark. I have an application which counts words in a
file (1 MB file).
The file is stored locally. I loaded the file using native code and then
created the RDD from it.
JavaRDDString rddFromFile = context.parallelize(myFile,
2);
Is there any other way to solve the problem? Let me state the use case
I have an RDD[Document] contains over 7 millions items. The RDD need to be save
on a persistent storage (currently I save it as object file on disk). Then I
need to get a small random sample of Document objects (e.g. 10,000
I guess the fundamental issue is that these aren't stored in a way
that allows random access to a Document.
Underneath, Hadoop has a concept of a MapFile which is like a
SequenceFile with an index of offsets into the file where records
being. Although Spark doesn't use it, you could maybe create
Hi,
I have JavaPairRDDString, ListInteger and as an example what I want to
get.
user_id
cat1
cat2
cat3
cat4
522
0
1
2
0
62
1
0
3
0
661
1
2
0
1
query : the users who have a number (except 0) in cat1 and cat3 column
answer: cat2 - 522,611 cat3-522,62 = user 522
How can I
Hey, I think I found out the problem. Turns out that the file I saved is
too large.
On 21 May 2015 at 16:44, Akhil Das ak...@sigmoidanalytics.com wrote:
Can you share the code, may be i/someone can help you out
Thanks
Best Regards
On Thu, May 21, 2015 at 1:45 PM, Allan Jie
After deserialization, something seems to be wrong with my pandas DataFrames.
It looks like the timezone information is lost, and subsequent errors ensue.
Serializing and deserializing a timezone-aware DataFrame tests just fine, so
it must be Spark that somehow changes the data.
My program runs
These are relevant:
JIRA: https://issues.apache.org/jira/browse/SPARK-6411
PR: https://github.com/apache/spark/pull/6250
On Thu, May 21, 2015 at 3:16 PM, Def_Os njde...@gmail.com wrote:
After deserialization, something seems to be wrong with my pandas DataFrames.
It looks like the timezone
I stumble upon this thread and I conjecture that this may affect restoring a
checkpointed RDD as well:
http://apache-spark-user-list.1001560.n3.nabble.com/Union-of-checkpointed-RDD-in-Apache-Spark-has-long-gt-10-hour-between-stage-latency-td22925.html#a22928
In my case I have 1600+ fragmented
Awesome,
Thanks a ton for helping us all and futuristic planning,
Much appreciate it,
Regards,
Kartik
On May 21, 2015 4:41 PM, Marco Shaw marco.s...@gmail.com wrote:
*Hi Spark Devs and Users,BerkeleyX and Databricks are currently developing
two Spark-related MOOC on edX
Could you try with specify PYSPARK_PYTHON to the path of python in
your virtual env, for example
PYSPARK_PYTHON=/path/to/env/bin/python bin/spark-submit xx.py
On Mon, Apr 20, 2015 at 12:51 AM, Karlson ksonsp...@siberie.de wrote:
Hi all,
I am running the Python process that communicates with
Hi all,
We are running spark streaming with version 1.1.1. recently we found
an odd problem.
In stage 44554, All the task finished, but the stage marked finished
long time later, as you can see the log below, the last task finished @15/05/21
21:17:36
And also the
Hello,
New to Spark. I wanted to know if it is possible to use a Labeled Point RDD
in org.apache.spark.mllib.clustering.KMeans. After I cluster my data I would
like to be able to identify which observations were grouped with each
centroid.
Thanks
--
View this message in context:
You can predict and then zip it with the points RDD to get approx. same as
LP.
Cheers
k/
On Thu, May 21, 2015 at 6:19 PM, anneywarlord anneywarl...@gmail.com
wrote:
Hello,
New to Spark. I wanted to know if it is possible to use a Labeled Point RDD
in org.apache.spark.mllib.clustering.KMeans.
Or you can simply use `reduceByKeyLocally` if you don't want to worry about
implementing accumulators and such, and assuming that the reduced values
will fit in memory of the driver (which you are assuming by using
accumulators).
Best,
Burak
On Thu, May 21, 2015 at 2:46 PM, ben
https://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application
When log aggregation isn’t turned on, logs are retained locally on each
machine under YARN_APP_LOGS_DIR, which is usually configured to/tmp/logs or
$HADOOP_HOME/logs/userlogs depending on the Hadoop version and
Hi Tathagata,
Thanks for looking into this. Further investigating I found that the issue
is with Tachyon does not support File Append. The streaming receiver which
writes to WAL when failed, and again restarted, not able to append to same
WAL file after restart.
I raised this with Tachyon user
Hi, guys, I'm pretty new to LDA. I notice spark 1.3.0 mllib provide EM
based LDA implementation. It returns both topics and topic distribution.
My question is how can I use these parameters to predict on new document ?
And I notice there is an Online LDA implementation in spark master branch,
Your original code snippet seems incomplete and there isn't enough
information to figure out what problem you actually ran into
from your original code snippet there is an rdd variable which is well
defined and a df variable that is not defined in the snippet of code you
sent
one way to make
If sampling whole partitions is sufficient (or a part of a partition),
sure you could mapPartitionsWithIndex and decide whether to process a
partition at all based on its # and skip the rest. That's much faster.
On Thu, May 21, 2015 at 7:07 PM, Wang, Ningjun (LNG-NPV)
ningjun.w...@lexisnexis.com
Hi,
I am using spark 1.2.0. Can you suggest docker containers which can be
deployed in production? I found lot of spark images in
https://registry.hub.docker.com/ . But could not figure out which one to
use. None of them seems like official image.
Does anybody have any recommendation?
Thanks
Can you try commenting the saveAsTextFile and do a simple count()? If its a
broadcast issue, then it would throw up the same error.
On 21 May 2015 14:21, allanjie allanmcgr...@gmail.com wrote:
Sure, the code is very simple. I think u guys can understand from the main
function.
public class
Hi,
After restarting Spark HistoryServer, it failed to come up, I checked
logs for Spark HistoryServer found following messages :'
2015-05-21 11:38:03,790 WARN org.apache.spark.scheduler.ReplayListenerBus:
Log path provided contains no log files.
2015-05-21 11:38:52,319 INFO
My 2 cents:
As per javadoc:
https://docs.oracle.com/javase/7/docs/api/java/lang/Runtime.html#addShutdownHook(java.lang.Thread)
Shutdown hooks should also finish their work quickly. When a program
invokes exit the expectation is that the virtual machine will promptly shut
down and exit. When the
Hi,
it looks you are writing to a local filesystem. Could you try writing
to a location visible by all nodes (master and workers), e.g. nfs share?
HTH,
Tomasz
W dniu 21.05.2015 o 17:16, rroxanaioana pisze:
Hello!
I just started with Spark. I have an application which counts words in a
Seems like there might be a mismatch between your Spark jars and your
cluster's HDFS version. Make sure you're using the Spark jar that matches
the hadoop version of your cluster.
On Thu, May 21, 2015 at 8:48 AM, roy rp...@njit.edu wrote:
Hi,
After restarting Spark HistoryServer, it failed
I have a dataframe as a reference table for IP frequencies.
e.g.,
ip freq
10.226.93.67 1
10.226.93.69 1
161.168.251.101 4
10.236.70.2 1
161.168.251.105 14
All I need is to query the df in a map.
rdd = sc.parallelize(['208.51.22.18',
Hi Cody,
That is clear. Thanks!
Bill
On Tue, May 19, 2015 at 1:27 PM, Cody Koeninger c...@koeninger.org wrote:
If you checkpoint, the job will start from the successfully consumed
offsets. If you don't checkpoint, by default it will start from the
highest available offset, and you will
So DataFrames, like RDDs, can only be accused from the driver. If your IP
Frequency table is small enough you could collect it and distribute it as a
hashmap with broadcast or you could also join your rdd with the ip
frequency table. Hope that helps :)
On Thursday, May 21, 2015, ping yan
70 matches
Mail list logo