That's awesome Yan. I was considering Phoenix for SQL calls to HBase since
Cassandra supports CQL but HBase QL support was lacking. I will get back to
you as I start using it on our loads.
I am assuming the latencies won't be much different from accessing HBase
through tsdb asynchbase as that's
Hi,
I have following code that uses checkpoint to checkpoint the heavy ops,which
works well that the last heavyOpRDD.foreach(println) will not recompute from
the beginning.
But when I re-run this program, the rdd computing chain will be recomputed from
the beginning, I thought that it will
I am using pyspark and I want to test the sql function. I get this Java
tree error. Any ideas.
iwaggDF.registerTempTable('iwagg')
hierDF.registerTempTable('hier')
res3=sqlc.sql('select name, sum(amount) as amount from iwagg a left
join hier b on a.segm=b.segm group by name order by sum(amount)
Hi,
I’ve posted this question to stackoverflow.com here
http://stackoverflow.com/questions/31534458/failing-integration-test-for-apache-spark-streaming
but it’s not getting any responses.
I've been trying to track down an issue with some unit/integration tests
I've been writing for a Spark
That happens when you batch duration is less than your processing time, you
need to set StorageLevel to MEMORY_AND_DISK, if you are using the latest
version of spark and you are just exploring things, then you can go with
the kafka consumers that comes with Spark itself. You will not have this
Yaa got it
Thanks Akhil.
From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: Tuesday, July 28, 2015 2:47 PM
To: Manohar Reddy
Cc: user@spark.apache.org
Subject: Re: java.lang.ArrayIndexOutOfBoundsException: 0 on Yarn Client
That happens when you batch duration is less than your processing
Hi Ayan Thanks for reply.
Its around 5 GB having 10 tables...this data changes very frequently every
minutes few updates
its difficult to have this data in spark, if any updates happen on main
tables, how can I refresh spark data?
On 28 July 2015 at 02:11, ayan guha guha.a...@gmail.com wrote:
Hi recompiled and retried, now its looking like this with s3a :
com.amazonaws.AmazonClientException: Unable to load AWS credentials
from any provider in the chain
S3n is working find, (only problem is still the endpoint)
-
To
Hello all,
I am currently having an error with Spark SQL access Elasticsearch using
Elasticsearch Spark integration. Below is the series of command I issued
along with the stacktrace. I am unclear what the error could mean. I can
print the schema correctly but error out if i try and display a
Put a try catch inside your code and inside the catch print out the length
or the list itself which causes the ArrayIndexOutOfBounds. It might happen
that some of your data is not proper.
Thanks
Best Regards
On Mon, Jul 27, 2015 at 8:24 PM, Manohar753 manohar.re...@happiestminds.com
wrote:
Hi
Hi Akhil,
Thanks for thereply.I found the root cause but don’t know how to solve this.
Below is the cause.this map function not going inside to execute because of
this all my list fields are empty.
Please let me know what might be the cause to not execute this snippet of
code.the below map is
I trying do that, but there will always data mismatch, since by the time
scoop is fetching main database will get so many updates. There is
something called incremental data fetch using scoop but that hits a
database rather than reading the WAL edit.
On 28 July 2015 at 02:52,
You need to find the bottleneck here, it could your network (if the data is
huge) or your producer code isn't pushing at 20k/s, If you are able to
produce at 20k/s then make sure you are able to receive at that rate (try
it without spark).
Thanks
Best Regards
On Sat, Jul 25, 2015 at 3:29 PM,
Hi,
I would like to get the recommendations to use Spark-Cassandra connector
DataFrame feature.
I was trying to save a Dataframe containing 8 Million rows to Cassandra through
the Spark-Cassandra connector. Based on the Spark log, this single action took
about 60 minutes to complete. I think it
Hi Haripriya,
Your master has registered it's public ip to be 127.0.0.1:5050 which won't
be able to be reached by the slave node.
If mesos didn't pick up the right ip you can specifiy one yourself via the
--ip flag.
Tim
On Mon, Jul 27, 2015 at 8:32 PM, Haripriya Ayyalasomayajula
Hi All,
I was wondering why the recommended number for parallelism was 2 -3 times
the number of cores on your cluster.
Is the heuristic explained in any of the Spark papers? Or is it more of an
agreed upon rule of thumb?
Thanks,
Rahul P
--
View this message in context:
Thanks Owen, the problem under Cygwin is while run spark-submit under
1.4.0, it simply report
Error: Could not find or load main class org.apache.spark.launcher.Main
This is because under Cygwin spark-class make the LAUNCH_CLASSPATH as
Thanks Akhil.that solved now but below is the new stack trace.
Don’t feel bad, am look into that but if it is there in your fingers please
15/07/28 09:03:31 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 5.0
(TID 77, ip-10-252-7-70.us-west-2.compute.internal): java.lang.Exception: Could
With s3n try this out:
*s3service.s3-endpoint*The host name of the S3 service. You should only
ever change this value from the default if you need to contact an
alternative S3 endpoint for testing purposes.
Default: s3.amazonaws.com
Thanks
Best Regards
On Tue, Jul 28, 2015 at 1:54 PM, Schmirr
Did you try it with just: (comment out line 27)
println Count of spark: + file.filter({s - s.contains('spark')}).count()
Thanks
Best Regards
On Sun, Jul 26, 2015 at 12:43 AM, tog guillaume.all...@gmail.com wrote:
Hi
I have been using Spark for quite some time using either scala or python.
NO!
On Tue, Jul 28, 2015 at 5:03 PM, Harshvardhan Chauhan ha...@gumgum.com
wrote:
--
*Harshvardhan Chauhan* | Software Engineer
*GumGum* http://www.gumgum.com/ | *Ads that stick*
310-260-9666 | ha...@gumgum.com
--
*Harshvardhan Chauhan* | Software Engineer
*GumGum* http://www.gumgum.com/ | *Ads that stick*
310-260-9666 | ha...@gumgum.com
Can you run the windows batch files (e.g. spark-submit.cmd) from the cygwin
shell?
On Tue, Jul 28, 2015 at 7:26 PM, Proust GZ Feng pf...@cn.ibm.com wrote:
Hi, Owen
Add back the cygwin classpath detection can pass the issue mentioned
before, but there seems lack of further support in the
Hi
I do not think op asks about attempt failure but stage failure and finally
leading to job failure. In that case, rdd info from last run is gone even
if from cache, isn't it?
Ayan
On 29 Jul 2015 07:01, Tathagata Das t...@databricks.com wrote:
If you are using the same RDDs in the both the
@Das, Is there anyway to identify a kafka topic when we have unified
stream? As of now, for each topic I create dedicated DStream and use
foreachRDD on each of these Streams. If I have say 100 kafka topics, then
how can I use unified stream and still take topic specific actions inside
foreachRDD ?
Thank you Tathagata. My main use case for the 500 streams is to append new
elements into their corresponding Spark SQL tables. Every stream is mapped
to a table so I'd like to use the streams to appended the new rdds to the
table. If I union all the streams, appending new elements becomes a
That's great! Thanks
El martes, 28 de julio de 2015, Ted Yu yuzhih...@gmail.com escribió:
If I understand correctly, there would be one value in the executor.
Cheers
On Tue, Jul 28, 2015 at 4:23 PM, Jonathan Coveney jcove...@gmail.com
javascript:_e(%7B%7D,'cvml','jcove...@gmail.com');
Hi guys,
A job hanged about 16 hours when I run random forest algorithm, I don't know
why that happened.
I use spark 1.4.0 on yarn and here is the code
http://apache-spark-user-list.1001560.n3.nabble.com/file/n24047/1.png
and following picture is from spark ui
Hi, Owen
Add back the cygwin classpath detection can pass the issue mentioned
before, but there seems lack of further support in the launch lib, see
below stacktrace
LAUNCH_CLASSPATH:
C:\spark-1.4.0-bin-hadoop2.3\lib\spark-assembly-1.4.0-hadoop2.3.0.jar
java -cp
Thanks Vanzin, spark-submit.cmd works
Thanks
Proust
From: Marcelo Vanzin van...@cloudera.com
To: Proust GZ Feng/China/IBM@IBMCN
Cc: Sean Owen so...@cloudera.com, user user@spark.apache.org
Date: 07/29/2015 10:35 AM
Subject:Re: NO Cygwin Support in bin/spark-class in Spark
@Ashwin: You could append the topic in the data.
val kafkaStreams = topics.map { topic =
KafkaUtils.createDirectStream(topic...).map { x = (x, topic) }
}
val unionedStream = context.union(kafkaStreams)
@Brandon:
I dont recommend it, but you could do something crazy like use the
Hi guys
When I run random forest algorithm, a job hanged for 15.8h, I can not figure
out why that happened.
Here is the code
http://apache-spark-user-list.1001560.n3.nabble.com/file/n24046/%E5%B1%8F%E5%B9%95%E5%BF%AB%E7%85%A7_2015-07-29_%E4%B8%8A%E5%8D%8810.png
And I use spark 1.4.0 on yarn
I dont think any one has really run 500 text streams.
And parSequences do nothing out there, you are only parallelizing the setup
code which does not really compute anything. Also it setsup 500 foreachRDD
operations that will get executed in each batch sequentially, so does not
make sense. The
Hi TD,
We have a requirement to maintain the user session state and to
maintain/update the metrics for minute, day and hour granularities for a
user session in our Streaming job. Can I keep those granularities in the
state and recalculate each time there is a change? How would the
i am running in coarse grained mode, let's say with 8 cores per executor.
If I use a broadcast variable, will all of the tasks in that executor share
the same value? Or will each task broadcast its own value ie in this case,
would there be one value in the executor shared by the 8 tasks, or would
If I understand correctly, there would be one value in the executor.
Cheers
On Tue, Jul 28, 2015 at 4:23 PM, Jonathan Coveney jcove...@gmail.com
wrote:
i am running in coarse grained mode, let's say with 8 cores per executor.
If I use a broadcast variable, will all of the tasks in that
If you are trying to keep such long term state, it will be more robust in
the long term to use a dedicated data store (cassandra/HBase/etc.) that is
designed for long term storage.
On Tue, Jul 28, 2015 at 4:37 PM, swetha swethakasire...@gmail.com wrote:
Hi TD,
We have a requirement to
Can you show the full stack trace ?
Which Spark release are you using ?
Thanks
On Jul 27, 2015, at 10:07 AM, Wayne Song wayne.e.s...@gmail.com wrote:
Hello,
I am trying to start a Spark master for a standalone cluster on an EC2 node.
The CLI command I'm using looks like this:
Is this average supposed to be across all partitions? If so, it will
require some one of the reduce operations in every batch interval. If
that's too slow for the data rate, I would investigate using
PairDStreamFunctions.updateStateByKey to compute the sum + count of the 2nd
integers, per 1st
You can try out a few tricks employed by folks at Lynx Analytics...
Daniel Darabos gave some details at Spark Summit:
https://www.youtube.com/watch?v=zt1LdVj76LUindex=13list=PL-x35fyliRwhP52fwDqULJLOnqnrN5nDs
On 22.7.2015. 17:00, Louis Hust wrote:
My code like below:
MapString,
Sorry about self-promotion, but there's a really nice tutorial for
setting up Eclipse for Spark in Spark in Action book:
http://www.manning.com/bonaci/
On 27.7.2015. 10:22, Akhil Das wrote:
You can follow this doc
One approach would be to store the batch data in an intermediate storage
(like HBase/MySQL or even in zookeeper), and inside your filter function
you just go and read the previous value from this storage and do whatever
operation that you are supposed to do.
Thanks
Best Regards
On Sun, Jul 26,
I'm using SparkStreaming and I want to configure checkpoint to manage
fault-tolerance.
I've been reading the documentation. Is it necessary to create and
configure the InputDSStream in the getOrCreate function?
I checked the example in
Hi Akhil
I have it working now with Groovy REPL in a form similar to the one you are
mentionning. Still I dont understand why the previous form (with the Function)
is raising that exception.
Cheers
Guillaume
On 28 July 2015 at 08:56, Akhil Das ak...@sigmoidanalytics.com wrote:
Did you try it
Did you try binding to 0.0.0.0?
Thanks
Best Regards
On Mon, Jul 27, 2015 at 10:37 PM, Wayne Song wayne.e.s...@gmail.com wrote:
Hello,
I am trying to start a Spark master for a standalone cluster on an EC2
node.
The CLI command I'm using looks like this:
Note that I'm specifying the
You need to trigger an action on your rowrdd for it to execute the map, you
can do a rowrdd.count() for that.
Thanks
Best Regards
On Tue, Jul 28, 2015 at 2:18 PM, Manohar Reddy
manohar.re...@happiestminds.com wrote:
Hi Akhil,
Thanks for thereply.I found the root cause but don’t know how
I tried those 3 possibilities, and everything is working = endpoint param
is not working :
sc.hadoopConfiguration.set(s3service.s3-endpoint,test)
sc.hadoopConfiguration.set(fs.s3n.endpoint,test)
sc.hadoopConfiguration.set(fs.s3n.s3-endpoint,test)
2015-07-28 10:28 GMT+02:00 Akhil Das
You don't have to use some other package in order to get access to the
offsets.
Shushant, have you read the available documentation at
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md
or watched
Can you show us the snippet of the exception stack ?
Thanks
On Jul 27, 2015, at 10:22 PM, Stephen Boesch java...@gmail.com wrote:
when using spark-submit: which directory contains third party libraries that
will be loaded on each of the slaves? I would like to scp one or more
libraries
Path to log file should be displayed when you launch the master.
e.g.
/mnt/var/log/apps/spark
-hadoop-org.apache.spark.deploy.master.Master-MACHINENAME.out
On Mon, Jul 27, 2015 at 11:28 PM, Jack Yang j...@uow.edu.au wrote:
Hi all,
I have questions with regarding to the log file directory.
Yes, you need to follow the documentation. Configure your stream,
including the transformations made to it, inside the getOrCreate function.
On Tue, Jul 28, 2015 at 3:14 AM, Guillermo Ortiz konstt2...@gmail.com
wrote:
I'm using SparkStreaming and I want to configure checkpoint to manage
When you say you installed Spark, did you install the master and slave
services for standalone mode as described here
http://spark.apache.org/docs/latest/spark-standalone.html? If you
intended to run Spark on Hadoop, see here
http://spark.apache.org/docs/latest/running-on-yarn.html.
It looks like
Hi all.
I am writing a twitter connector using spark streaming. i have written the
following code to maintain checkpoint.
val
ssc=StreamingContext.getOrCreate(hdfs://192.168.23.109:9000/home/cloud9/twitterCheckpoint,()=
{ managingContext() })
def managingContext():StreamingContext =
{
That stacktrace looks like an out of heap space on the driver while writing
checkpoint, not on the worker nodes. How much memory are you giving the
driver? How big are your stored checkpoints?
On Tue, Jul 28, 2015 at 9:30 AM, Nicolas Phung nicolas.ph...@gmail.com
wrote:
Hi,
After using
Hi TD,
Thanks for the info. I have the scenario like this.
I am reading the data from kafka topic. Let's say kafka has 3 partitions
for the topic. In my streaming application, I would configure 3 receivers
with 1 thread each such that they would receive 3 dstreams (from 3
partitions of kafka
Sorry about self-promotion, but there's a really nice tutorial for
setting up Eclipse for Spark in Spark in Action book:
http://www.manning.com/bonaci/
On 24.7.2015. 7:26, Siva Reddy wrote:
Hi All,
I am trying to setup the Eclipse (LUNA) with Maven so that I create a
maven projects
Hi Ted, yes, cloudera blog and your code was my starting point - but I
needed something more spark-centric rather than on hbase. Basically doing a
lot of ad-hoc transformations with RDDs that were based on HBase tables and
then mutating them after series of iterative (bsp-like) steps.
On 28 July
You can try out a few tricks employed by folks at Lynx Analytics...
Daniel Darabos gave some details at Spark Summit:
https://www.youtube.com/watch?v=zt1LdVj76LUindex=13list=PL-x35fyliRwhP52fwDqULJLOnqnrN5nDs
On 22.7.2015. 17:00, Louis Hust wrote:
My code like below:
MapString,
Oops, yes, I'm still messing with the repo on a daily basis.. fixed
On 28 July 2015 at 17:11, Ted Yu yuzhih...@gmail.com wrote:
I got a compilation error:
[INFO] /home/hbase/s-on-hbase/src/main/scala:-1: info: compiling
[INFO] Compiling 18 source files to
Cool, will revisit, is your latest code visible publicly somewhere ?
On 28 July 2015 at 17:14, Ted Malaska ted.mala...@cloudera.com wrote:
Yup you should be able to do that with the APIs that are going into HBase.
Let me know if you need to chat about the problem and how to implement it
with
If I have a Hive table with six columns and create a DataFrame (Spark
1.4.1) using a sqlContext.sql(select * from ...) query, the resulting
physical plan shown by explain reflects the goal of returning all six
columns.
If I then call select(one_column) on that first DataFrame, the resulting
The documentation for the Numpy dependency for MLlib seems somewhat vague [1].
Is Numpy only a dependency for the driver node, or must it also be installed on
every worker node?
Thanks,
Alek
[1] -- http://spark.apache.org/docs/latest/mllib-guide.html#dependencies
CONFIDENTIALITY NOTICE This
Hi all, last couple of months I've been working on a large graph analytics
and along the way have written from scratch a HBase-Spark integration as
none of the ones out there worked either in terms of scale or in the way
they integrated with the RDD interface. This week I have generalised it
into
Hi, I am using sc.parallelise(...32k of items) several times for 1 job. Each
executor takes x amount of time to process it's items but this results in
some executors finishing quickly and staying idle till the others catch up.
Only after all executors complete the first 32k batch, the next batch
Hello,
I'm using spark-csv instead of sc.textfile() to work with CSV files.
How can I set no.of partitions that will be created when reading a CSV?
Basically an equivalent for minPartitions in textFile()
val myrdd = sc.textFile(my.csv,24)
Srikanth
I got a compilation error:
[INFO] /home/hbase/s-on-hbase/src/main/scala:-1: info: compiling
[INFO] Compiling 18 source files to /home/hbase/s-on-hbase/target/classes
at 1438099569598
[ERROR]
/home/hbase/s-on-hbase/src/main/scala/org/apache/spark/hbase/examples/simple/HBaseTableSimple.scala:36:
Hi,
I'm using a simple akka actor to create a actorStream. The actor just
forwards the messages received to the stream by
calling super[ActorHelper].store(msg). This works ok when I create the
stream with
ssc.actorStream[A](Props(new ProxyReceiverActor[A]), receiverActorName)
but when I try to
I have K/V pairs where V is an Iterable (from previous groupBy). I use the
JAVA API.
What I want is to iterate over the values by key, and on every element set
previousElementId attribute, that is the id of the previous element in the
sorted list.
I try to do this with mapValues. I create an
Hi
I am processing kafka messages using spark streaming 1.3.
I am using mapPartitions function to process kafka message.
How can I access offset no of individual message getting being processed.
JavaPairInputDStreambyte[], byte[] directKafkaStream
=KafkaUtils.createDirectStream(..);
Does adding back the cygwin detection and this clause make it work?
if $cygwin; then
CLASSPATH=`cygpath -wp $CLASSPATH`
fi
If so I imagine that's fine to bring back, if that's still needed.
On Tue, Jul 28, 2015 at 9:49 AM, Proust GZ Feng pf...@cn.ibm.com wrote:
Thanks Owen, the problem under
Hi Xiangrui Spark People,
I recently got round to writing an evaluation framework for Spark that I
was hoping to PR into MLLib and this would solve some of the aforementioned
issues. I have put the code on github in a separate repo for now as I
would like to get some sandboxed feedback. The
Hi,I'd like to remotely run spark-submit from a local machine to submit a job
to spark cluster (cluster mode).
What method do I use to authenticate myself to the cluster? Like how to pass
user id or password or private key to the cluster
Any help is appreciated.
Hi,
I'm starting R on Spark via the sparkR script but I can't access the
sparkcontext as described in the programming guide. Any ideas?
Thanks,
Siegfried
You may check out apache phoenix on top of Hbase for this. However, it does
not have ODBC drivers, but JDBC ones. Maybe Hive 1.2 with a new version of
TEZ will also serve your purpose. You should run some proof of concept with
these technologies using real or generated data. About how much data
Hi,
I'm using spark-submit cluster mode to submit a job from local to spark
cluster. There are input files, output files, and job log files that I need to
transfer in and out between local machine and spark cluster.Any recommendation
methods to use file transferring. Is there any future
We want these use actions respond within 2 to 5 seconds.
I think this goal is a stretch for Spark. Some queries may run faster than
that on a large dataset,
but in general you can't put an SLA like this. For example if you have to
join some huge datasets,
you'll likely will be much over that.
Sqoop’s incremental data fetch will reduce the data size you need to pull from
source, but then by the time that incremental data fetch is complete, is it not
current again, if velocity of the data is high?
May be you can put a trigger in Postgres to send data to the big data cluster
as
Try use: org.apache.hadoop.hive.serde2.RegexSerDe
GP
On 27 Jul 2015, at 09:35, ZhuGe t...@outlook.commailto:t...@outlook.com
wrote:
Hi all:
I am testing the performance of hive on spark sql.
The existing table is created with
ROW FORMAT
SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
Hey Stephen,
In case these libraries exist on the client as a form of maven library, you
can use --packages to ship the library and all it's dependencies, without
building an uber jar.
Best,
Burak
On Tue, Jul 28, 2015 at 10:23 AM, Marcelo Vanzin van...@cloudera.com
wrote:
Hi Stephen,
There
can the source write to Kafka/Flume/Hbase in addition to Postgres? no
it can't write ,this is due to the fact that there are many applications
those are producing this postGreSql data.I can't really asked all the teams
to start writing to some other source.
velocity of the application is too
Hi Stephen,
There is no such directory currently. If you want to add an existing jar to
every app's classpath, you need to modify two config values:
spark.driver.extraClassPath and spark.executor.extraClassPath.
On Mon, Jul 27, 2015 at 10:22 PM, Stephen Boesch java...@gmail.com wrote:
when
Can you put some transparent cache in front of the database? Or some jdbc
proxy?
Le mar. 28 juil. 2015 à 19:34, Jeetendra Gangele gangele...@gmail.com a
écrit :
can the source write to Kafka/Flume/Hbase in addition to Postgres? no
it can't write ,this is due to the fact that there are many
I just cloned the master repository of Spark from Github. I am running it on
OSX 10.9, Spark 1.4.1 and Scala 2.10.4
I just tried to run the SparkPi example program using IntelliJ Idea but get
the error : akka.actor.ActorNotFound: Actor not found for:
Thanks Corey for your answer,
Do you mean that final status : SUCCEEDED in terminal logs means that
YARN RM could clean the resources after the application has finished
(application finishing does not necessarily mean succeeded or failed) ?
With that logic it totally makes sense.
Basically the
Hi,
I'm puzzling over the following problem: when I cache a small sample of a
big dataframe, the small dataframe is recomputed when selecting a column
(but not if show() or count() is invoked).
Why is that so and how can I avoid recomputation of the small sample
dataframe?
More details:
- I
Hi all,
I am experimenting and learning performance on big tasks locally, with a 32
cores node and more than 64GB of Ram, data is loaded from a database through
JDBC driver, and launching heavy computations against it. I am presented with
two questions:
1. My RDD is poorly distributed. I
This might be an issue with how pyspark propagates the error back to the
AM. I'm pretty sure this does not happen for Scala / Java apps.
Have you filed a bug?
On Tue, Jul 28, 2015 at 11:17 AM, Elkhan Dadashov elkhan8...@gmail.com
wrote:
Thanks Corey for your answer,
Do you mean that final
Hi Saif,
Are you using JdbcRDD directly from Spark?
If yes, then the poor distribution could be due to the bound key you used.
See the JdbcRDD Scala doc at
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.JdbcRDD
:
sql
the text of the query. The query must contain
Thank you for your response Zhen,
I am using some vendor specific JDBC driver JAR file (honestly I dont know
where it came from). It’s api is NOT like JdbcRDD, instead, more like jdbc from
DataFrameReader
BTW this is most probably caused by this line in PythonRunner.scala:
System.exit(process.waitFor())
The YARN backend doesn't like applications calling System.exit().
On Tue, Jul 28, 2015 at 12:00 PM, Marcelo Vanzin van...@cloudera.com
wrote:
This might be an issue with how pyspark
We will try to address this before Spark 1.5 is released:
https://issues.apache.org/jira/browse/SPARK-9141
On Tue, Jul 28, 2015 at 11:50 AM, Kristina Rogale Plazonic kpl...@gmail.com
wrote:
Hi,
I'm puzzling over the following problem: when I cache a small sample of a
big dataframe, the
Is it possible to restart the job from the last successful stage instead of
from the beginning?
For example, if your job has stages 0, 1 and 2 .. and stage 0 takes a long
time and is successful, but the job fails on stage 1, it would be useful to
be able to restart from the output of stage 0
Saif,
I am guessing but not sure your use case. Are you retrieving the entire
table into Spark? If yes, do you have primary key on your table?
If also yes, then JdbcRDD should be efficient. DataFrameReader.jdbc gives
you more options, again, depends on your use case.
Possible for you to describe
The problem was that I was trying to start the example app in standalone
cluster mode by passing in
*-Dspark.master=spark://myhost:7077* as an argument to the JVM. I launched
the example app locally using -*Dspark.master=local* and it worked.
--
View this message in context:
But then how can we get to know if job is making progress in programmatic
way (Java) ?
Or if job has failed or succeeded ?
Is looking to application log files the only way knowing about job final
status (failed/succeeded) ?
Because when job fails Job History server does not have much info about
Thanks a lot for feedback, Marcelo.
I've filed a bug just now - SPARK-9416
https://issues.apache.org/jira/browse/SPARK-9416
On Tue, Jul 28, 2015 at 12:14 PM, Marcelo Vanzin van...@cloudera.com
wrote:
BTW this is most probably caused by this line in PythonRunner.scala:
there's a spark-submit.cmd file for windows. Does that work?
On 27 Jul 2015, at 21:19, Proust GZ Feng
pf...@cn.ibm.commailto:pf...@cn.ibm.com wrote:
Hi, Spark Users
Looks like Spark 1.4.0 cannot work with Cygwin due to the removing of Cygwin
support in bin/spark-class
The changeset is
Hi Stahlman,
finalRDDStorageLevel is the storage level for the final user/item
factors. It is not common to set it to StorageLevel.NONE, unless you
want to save the factors directly to disk. So if it is NONE, we cannot
unpersist the intermediate RDDs (in/out blocks) because the final
user/item
The partitioner is not saved with the RDD. So when you load the model
back, we lose the partitioner information. You can call repartition on
the user/product factors and then create a new
MatrixFactorizationModel object using the repartitioned RDDs. It would
be useful to create a utility method
It wasn't removed, but rewritten. Cygwin is just a distribution of
POSIX-related utilities so you should be able to use the normal .sh
scripts. In any event, you didn't say what the problem is?
On Tue, Jul 28, 2015 at 5:19 AM, Proust GZ Feng pf...@cn.ibm.com wrote:
Hi, Spark Users
Looks like
1 - 100 of 117 matches
Mail list logo