SizeEstimator.estimate(df) will not give the size of dataframe rt? I think
it will give size of df object.
With RDD, I sample() and collect() and sum size of each row. If I do the
same with dataframe it will no longer be size when represented in columnar
format.
I'd also like to know how
That looks like it's during recovery from a checkpoint, so it'd be driver
memory not executor memory.
How big is the checkpoint directory that you're trying to restore from?
On Mon, Aug 10, 2015 at 10:57 AM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
We're getting the below error.
Consider two code snippets as the following:
// Java code:
abstract class Ops implements Serializable{
public abstract Integer apply(Integer x);
public void doSomething(JavaRDDInteger rdd) {
rdd.map(x - x + apply(x))
.collect()
.forEach(System.out::println);
}
}
public
I wonder during recovery from a checkpoint whether we can estimate the size
of the checkpoint and compare with Runtime.getRuntime().freeMemory().
If the size of checkpoint is much bigger than free memory, log warning, etc
Cheers
On Mon, Aug 10, 2015 at 9:34 AM, Dmitry Goldenberg
The difference is really that Java and Scala work differently. In
Java, your anonymous subclass of Ops defined in (a method of)
AbstractTest captures a reference to it. That much is 'correct' in
that it's how Java is supposed to work, and AbstractTest is indeed not
serializable since you didn't
I figured out the issue - it had to do with the Cassandra jar I had
compiled. I had tested a previous version. Using --jars (comma separated)
and --driver-class-path (colon separated) is working.
On Mon, Aug 10, 2015 at 1:08 AM ayan guha guha.a...@gmail.com wrote:
Easiest way should be to add
We're getting the below error. Tried increasing spark.executor.memory e.g.
from 1g to 2g but the below error still happens.
Any recommendations? Something to do with specifying -Xmx in the submit job
scripts?
Thanks.
Exception in thread main java.lang.OutOfMemoryError: GC overhead limit
From a quick glance of SparkStrategies.scala , when statistics.sizeInBytes
of the LogicalPlan is = autoBroadcastJoinThreshold, the plan's output
would be used in broadcast join as the 'build' relation.
FYI
On Mon, Aug 10, 2015 at 8:04 AM, Srikanth srikanth...@gmail.com wrote:
Would there be a way to chunk up/batch up the contents of the checkpointing
directories as they're being processed by Spark Streaming? Is it mandatory
to load the whole thing in one go?
On Mon, Aug 10, 2015 at 12:42 PM, Ted Yu yuzhih...@gmail.com wrote:
I wonder during recovery from a
You need to keep a certain number of rdds around for checkpointing, based
on e.g. the window size. Those would all need to be loaded at once.
On Mon, Aug 10, 2015 at 11:49 AM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
Would there be a way to chunk up/batch up the contents of the
Looks like workaround is to reduce *window length.*
*Cheers*
On Mon, Aug 10, 2015 at 10:07 AM, Cody Koeninger c...@koeninger.org wrote:
You need to keep a certain number of rdds around for checkpointing, based
on e.g. the window size. Those would all need to be loaded at once.
On Mon, Aug
Thank you Gen, the changes to HBaseConverters.scala look to now be returning
all column qualifiers, as follows -
(u'row1', {u'qualifier': u'a', u'timestamp': u'1438716994027', u'value':
u'value1', u'columnFamily': u'f1', u'type': u'Put', u'row': u'row1'})
(u'row1', {u'qualifier': u'b',
https://www.youtube.com/watch?v=H07zYvkNYL8
On Mon, Aug 10, 2015 at 10:55 AM, Ted Yu yuzhih...@gmail.com wrote:
Please take a look at the first section of
https://spark.apache.org/community
Cheers
On Mon, Aug 10, 2015 at 10:54 AM, Phil Kallos phil.kal...@gmail.com
wrote:
please
Older versions of Spark (i.e. when it was still called SchemaRDD instead of
DataFrame) did not support merging different parquet schema. However,
Spark 1.4+ should.
On Sat, Aug 8, 2015 at 8:58 PM, sim s...@swoop.com wrote:
Adam, did you find a solution for this?
--
View this message in
Michael, is there an example anywhere that demonstrates how this works with the
schema changing over time?
Must the Hive tables be set up as external tables outside of saveAsTable? In my
experience, in 1.4.1, writing to a table with SaveMode.Append fails if the
schema don't match.
Thanks,
Sim
Hi everyone,
I recently started using the new Kafka direct approach.
Now, as far as I understood, each Kafka partition /is/ an RDD partition that
will be processed by a single core.
What I don't understand is the relation between those partitions and the
blocks generated every blockInterval.
You can't create a DataFrame from an arbitrary object since we don't know
how to figure out the schema. You can either create a JavaBean
https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
or manually create a row + specify the schema
I am running as a yarn-client which probably means that the program that
submitted the job is where the listening is also occurring? I thought that
the yarn is only used to negotiate resources in yarn-client master mode.
On Mon, Aug 10, 2015 at 11:34 AM, Tathagata Das t...@databricks.com wrote:
I tested this in master (1.5 release), it worked as expected (changed
spark.driver.maxResultSize to 10m),
len(sc.range(10).map(lambda i: '*' * (123) ).take(1))
1
len(sc.range(10).map(lambda i: '*' * (124) ).take(1))
15/08/10 10:45:55 ERROR TaskSetManager: Total size of serialized
results of 1
Thank you! :)
2015-08-10 19:58 GMT+02:00 Cody Koeninger c...@koeninger.org:
There's no long-running receiver pushing blocks of messages, so
blockInterval isn't relevant.
Batch interval is what matters.
On Mon, Aug 10, 2015 at 12:52 PM, allonsy luke1...@gmail.com wrote:
Hi everyone,
I
If you are running on a cluster, the listening is occurring on one of the
executors, not in the driver.
On Mon, Aug 10, 2015 at 10:29 AM, Mohit Anchlia mohitanch...@gmail.com
wrote:
I am trying to run this program as a yarn-client. The job seems to be
submitting successfully however I don't
In yarn-client mode, the driver is on the machine where you ran the
spark-submit. The executors are running in the YARN cluster nodes, and the
socket receiver listening on port is running in one of the executors.
On Mon, Aug 10, 2015 at 11:43 AM, Mohit Anchlia mohitanch...@gmail.com
wrote:
Umesh:
Please take a look at the classes under:
sql/core/src/main/scala/org/apache/spark/sql/parquet
FYI
On Mon, Aug 10, 2015 at 10:35 AM, Umesh Kacha umesh.ka...@gmail.com wrote:
Hi Bo thanks much let me explain please see the following code
JavaPairRDDString,PortableDataStream pairRdd =
I think to use *toLocalIterator* method and something like that, but I
have doubts about memory and parallelism and sure there is a better way to
do it.
It will still run all earlier parts of the job in parallel. Only the
actual retrieving of the final partitions will be serial. This is
Any help in best recommendation for gracefully shutting down a spark stream
application ?
I am running it on yarn and a way to tell from externally either yarn
application -kill command or some other way but need current batch to be
processed completely and checkpoint to be saved before shutting
There's no long-running receiver pushing blocks of messages, so
blockInterval isn't relevant.
Batch interval is what matters.
On Mon, Aug 10, 2015 at 12:52 PM, allonsy luke1...@gmail.com wrote:
Hi everyone,
I recently started using the new Kafka direct approach.
Now, as far as I
Hi Michael thanks for the reply. I know that I can create DataFrame using
JavaBean or Struct Type I want to know how can I create DataFrame from
above code which is custom Hadoop format.
On Tue, Aug 11, 2015 at 12:04 AM, Michael Armbrust mich...@databricks.com
wrote:
You can't create a
Hi Bo thanks much let me explain please see the following code
JavaPairRDDString,PortableDataStream pairRdd =
javaSparkContext.binaryFiles(/hdfs/path/to/binfile);
JavaRDDPortableDataStream javardd = pairRdd.values();
DataFrame binDataFrame = sqlContext.createDataFrame(javaBinRdd,
Hi I have my own Hadoop custom InputFormat which I want to use in DataFrame.
How do we do that? I know I can use sc.hadoopFile(..) but then how do I
convert it into DataFrame
JavaPairRDDVoid,MyRecordWritable myFormatAsPairRdd =
You need to keep a certain number of rdds around for checkpointing --
that seems like a hefty expense to pay in order to achieve fault
tolerance. Why does Spark persist whole RDD's of data? Shouldn't it be
sufficient to just persist the offsets, to know where to resume from?
Thanks.
On Mon,
Hi,
I am sorry to bother again.
When I do join as follow:
df = sqlContext.sql(selet a.someItem, b.someItem from a full outer join b
on condition1 *or* condition2)
df.first()
The program failed at the result size is bigger than
spark.driver.maxResultSize.
It is really strange, as one record is no
I found SPARK-3733 which was marked dup of SPARK-4924 which went to 1.4.0
FYI
On Mon, Aug 10, 2015 at 5:12 AM, mark manwoodv...@googlemail.com wrote:
Hi All
I need to be able to create, submit and report on Spark jobs
programmatically in response to events arriving on a Kafka bus. I also
I think this may be expected. When the streaming context is stopped without
the SparkContext, then the receivers are stopped by the driver. The
receiver sends back the message that it has been stopped. This is being
(probably incorrectly) logged with ERROR level.
On Sun, Aug 9, 2015 at 12:52 AM,
Is it receiving any data? If so, then it must be listening.
Alternatively, to test these theories, you can locally running a spark
standalone cluster (one node standalone cluster in local machine), and
submit your app in client mode on that to see whether you are seeing the
process listening on
From logs, it seems that Spark Streaming does handle *kill -SIGINT* with
graceful shutdown.
Please could you confirm?
Thanks!
On 30 July 2015 at 08:19, anshu shukla anshushuk...@gmail.com wrote:
Yes I was doing same , if You mean that this is the correct way to do
Then I will verify it
Bar123
On Monday, 10 August 2015, 20:20, Resourcing Team
barclayscare...@invalidemail.com wrote:
Dear Shing Hing,
Thank you for applying to Barclays. We have received your application and are
currently reviewing your details.
Updates on your progress will be emailed to you and can
Please help as not sure what is incorrect with below code as it gives me
complilaton error in eclipse
SparkConf sparkConf = new
SparkConf().setMaster(local[4]).setAppName(JavaDirectKafkaWordCount);
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf,
Well, RDDs also contain data, don't they?
The question is, what can be so hefty in the checkpointing directory to
cause Spark driver to run out of memory? It seems that it makes
checkpointing expensive, in terms of I/O and memory consumption. Two
network hops -- to driver, then to workers.
I'm running Spark 1.3 on CDH 5.4.4, and trying to set up Spark to run via
iPython Notebook. I'm getting collect() to work just fine, but take()
errors. (I'm having issues with collect() on other datasets ... but take()
seems to break every time I run it.)
My code is below. Any thoughts?
sc
Hi All,
I have an RDD of case class objects.
scala case class Entity(
| value: String,
| identifier: String
| )
defined class Entity
scala Entity(hello, id1)
res25: Entity = Entity(hello,id1)
During a map operation, I'd like to return a new RDD that contains all of
the
http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers
http://spark.apache.org/docs/latest/streaming-programming-guide.html#semantics-of-output-operations
https://www.youtube.com/watch?v=fXnNEq1v3VA
On Mon, Aug 10, 2015 at 4:32 PM, Shushant
When will window functions be integrated into Spark (without HiveContext?)
Gesendet mit AquaMail für Android
http://www.aqua-mail.com
Am 10. August 2015 23:04:22 schrieb Michael Armbrust mich...@databricks.com:
You will need to use a HiveContext for window functions to work.
On Mon, Aug 10,
Thanks!
On Tue, Aug 11, 2015 at 1:34 AM, Tathagata Das t...@databricks.com wrote:
1. RPC can be done in many ways, and a web service is one of many ways. A
even more hacky version can be the app polling a file in a file system, if
the file exists start shutting down.
2. No need to set a
You will need to use a HiveContext for window functions to work.
On Mon, Aug 10, 2015 at 1:26 PM, Jerry jerry.c...@gmail.com wrote:
Hello,
Using Apache Spark 1.4.1 I'm unable to use lag or lead when making queries
to a data frame and I'm trying to figure out if I just have a bad setup or
if
Hi
How can I avoid duplicate processing of kafka messages in spark stream 1.3
because of executor failure.
1.Can I some how access accumulators of failed task in retry task to skip
those many events which are already processed by failed task on this
partition ?
2.Or I ll have to persist each
There is was a similar problem reported before on this list.
Weird python errors like this generally mean you have different
versions of python in the nodes of your cluster. Can you check that?
From error stack you use 2.7.10 |Anaconda 2.3.0
while OS/CDH version of Python is probably 2.6.
--
I am using the same exact code:
https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaRecoverableNetworkWordCount.java
Submitting like this:
yarn:
/opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/bin/spark-submit --class
If you want to convert the hash to word, the very thought defies the usage of
hashing.
You may map the words with hashing, but that wouldn't be good.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/TFIDF-Transformation-tp24086p24203.html
Sent from the
Is it possible that you have Python 2.7 on the driver, but Python 2.6
on the workers?.
PySpark requires that you have the same minor version of Python in
both driver and worker. In PySpark 1.4+, it will do this check before
run any tasks.
On Mon, Aug 10, 2015 at 2:53 PM, YaoPau
Hi All,
Any explanation for this?
As Reece said I can do operations with hive but -
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) -- gives error.
I have already created spark ec2 cluster with the spark-ec2 script. How can
I build it again?
Thanks
_Roni
On Tue, Jul 28, 2015
I use Play json, may be its very famous.
If you would like to try below is the sbt dependency
com.typesafe.play % play-json_2.10 % 2.2.1,
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Json-parsing-library-for-Spark-Streaming-tp24016p24204.html
Sent from
Hi,
If I understand the RandomForest model in the ML Pipeline implementation in
the ml package correctly, I have to first run my outcome label variable
through the StringIndexer, even if my labels are numeric. The StringIndexer
then converts the labels into numeric indices based on frequency of
I do see this message:
15/08/10 19:19:12 WARN YarnScheduler: Initial job has not accepted any
resources; check your cluster UI to ensure that workers are registered and
have sufficient resources
On Mon, Aug 10, 2015 at 4:15 PM, Mohit Anchlia mohitanch...@gmail.com
wrote:
I am using the same
1. When you are running locally, make sure the master in the SparkConf
reflects that and is not somehow set to yarn-client
2. You may not be getting any resources from YARN at all, so no executors,
so no receiver running. That is why I asked the most basic question - Is it
receiving data? That
Hi! Sorry if this is a repost.
I'm using Spark + Kinesis ASL to process and persist stream data to
ElasticSearch. For the most part it works nicely.
There is a subtle issue I'm running into about how failures are handled.
For example's sake, let's say I am processing a Kinesis stream that
What is the error you are getting? It would also be awesome if you could
try with Spark 1.5 when the first preview comes out (hopefully early next
week).
On Mon, Aug 10, 2015 at 11:41 AM, Simeon Simeonov s...@swoop.com wrote:
Michael, is there an example anywhere that demonstrates how this
Michael, please, see
http://apache-spark-user-list.1001560.n3.nabble.com/Schema-evolution-in-tables-tt23999.html
The exception is
java.lang.RuntimeException: Relation[ ... ]
org.apache.spark.sql.parquet.ParquetRelation2@83a73a05
requires that the query in the SELECT clause of the INSERT
Hello,
Using Apache Spark 1.4.1 I'm unable to use lag or lead when making queries
to a data frame and I'm trying to figure out if I just have a bad setup or
if this is a bug. As for the exceptions I get: when using selectExpr() with
a string as an argument, I get NoSuchElementException: key not
The rdd is indeed defined by mostly just the offsets / topic partitions.
On Mon, Aug 10, 2015 at 3:24 PM, Dmitry Goldenberg dgoldenberg...@gmail.com
wrote:
You need to keep a certain number of rdds around for checkpointing --
that seems like a hefty expense to pay in order to achieve fault
Dear Sir / Madam,
I have a plan to contribute some codes about passing filters to a
datasource as physical planning.
In more detail, I understand when we want to build up filter operations
from data like Parquet (when actually reading and filtering HDFS blocks at
first not filtering in memory
Anyone know this ? Thanks
On Fri, Aug 7, 2015 at 4:20 PM, canan chen ccn...@gmail.com wrote:
Is there any reason that historyserver use another property for the event
log dir ? Thanks
Hi Jerry, Akhil,
Thanks your your help. With s3n, the entire file is downloaded even while
just creating the RDD with sqlContext.read.parquet(). It seems like even
just opening and closing the InputStream causes the entire data to get
fetched.
As it turned out, I was able to use s3a and avoid
Hi,everyone.
I have one question in loading data using spark datasource api and using
jdbc that which way is effective?
HI All,
I have tried Commands as mentioned below but still it is errors
dse spark-submit --master spark://10.246.43.15:7077 --class HelloWorld
--jars /home/missingmerch/
postgresql-9.4-1201.jdbc41.jar,/home/missingmerch/dse.jar,/home/missingmerch/spark-
cassandra-connector-java_2.10-1.1.1.jar
Hi,
As I understand JDBC is meant for moderate voulme of data but Datasource
api is a better option if volume of data volume is more
Datasource API is not available is lower version of Spark such as 1.2.0
Regards,
Satish
On Tue, Aug 11, 2015 at 8:53 AM, 李铖 lidali...@gmail.com wrote:
DataFrameReader.json() can handle gzipped JSONlines files automatically but
there doesn't seem to be a way to get DataFrameWriter.json() to write
compressed JSONlines files. Uncompressed JSONlines is a very expensive from
an I/O standpoint because field names are included in every record.
Is
For monitoring, please take a look at
http://spark.apache.org/docs/latest/monitoring.html
Especially REST API section.
Cheers
On Mon, Aug 10, 2015 at 8:33 AM, Ted Yu yuzhih...@gmail.com wrote:
I found SPARK-3733 which was marked dup of SPARK-4924 which went to 1.4.0
FYI
On Mon, Aug 10,
In general, it is a little risky to put long running stuff in a shutdown
hook as it may delay shutdown of the process which may delay other things.
That said, you could try it out.
A better way to explicitly shutdown gracefully is to use an RPC to signal
the driver process to start shutting down,
Note that this is true only from Spark 1.4 where the shutdown hooks were
added.
On Mon, Aug 10, 2015 at 12:12 PM, Michal Čizmazia mici...@gmail.com wrote:
From logs, it seems that Spark Streaming does handle *kill -SIGINT* with
graceful shutdown.
Please could you confirm?
Thanks!
On 30
By RPC you mean web service exposed on driver which listens and set some
flag and driver checks that flag at end of each batch and if set then
gracefully stop the application ?
On Tue, Aug 11, 2015 at 12:43 AM, Tathagata Das t...@databricks.com wrote:
In general, it is a little risky to put
1. RPC can be done in many ways, and a web service is one of many ways. A
even more hacky version can be the app polling a file in a file system, if
the file exists start shutting down.
2. No need to set a flag. When you get the signal from RPC, you can just
call context.stop(stopGracefully =
No, it's not like a given KafkaRDD object contains an array of messages
that gets serialized with the object. Its compute method generates an
iterator of messages as needed, by connecting to kafka.
I don't know what was so hefty in your checkpoint directory, because you
deleted it. My
By the way, if Hive is present in the Spark install, does show up in text
when you start the spark shell? Any commands I can run to check if it
exists? I didn't setup the spark machine that I use, so I don't know what's
present or absent.
Thanks,
Jerry
On Mon, Aug 10, 2015 at 2:38 PM,
You are correct. The earlier Kinesis receiver (as of Spark 1.4) was not
saving checkpoints correctly and was in general not reliable (even with WAL
enabled). We have improved this in Spark 1.5 with updated Kinesis receiver,
that keeps track of the Kinesis sequence numbers as part of the Spark
Yan / Bing:
Mind taking a look at HBASE-14181
https://issues.apache.org/jira/browse/HBASE-14181 'Add Spark DataFrame
DataSource to HBase-Spark Module' ?
Thanks
On Wed, Jul 22, 2015 at 4:53 PM, Bing Xiao (Bing) bing.x...@huawei.com
wrote:
We are happy to announce the availability of the Spark
We did have 2.7 on the driver, 2.6 on the edge nodes and figured that was
the issue, so we've tried many combinations since then with all three of
2.6.6, 2.7.5, and Anaconda's 2.7.10 on each node with different PATHs and
PYTHONPATHs each time. Every combination has produced the same error.
We
Following Hadoop conventions, Spark won't overwrite an existing directory.
You need to provide a unique output path every time you run the program, or
delete or rename the target directory before you run the job.
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
Have a look at the various versions of
PairDStreamFunctions.updateStateByWindow (
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions).
It supports updating running state in memory. (You can persist the state to
a database/files
HI All,
Please help me to fix Spark Cassandra Connector issue, find the details
below
*Command:*
dse spark-submit --master spark://10.246.43.15:7077 --class HelloWorld
--jars ///home/missingmerch/postgresql-9.4-1201.jdbc41.jar
///home/missingmerch/etl-0.0.1-SNAPSHOT.jar
*Error:*
WARN
Add the other Cassandra dependencies (dse.jar,
spark-cassandra-connect-java_2.10) to your --jars argument on the command
line.
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler
Hi All,
I just wanted to know how does directAPI for spark streaming compare with
earlier receivers based API. Has anyone used directAPI based approach on
production or is it still being used for pocs?
Also, since I'm new to spark, could anyone share a starting point from
where I could find a
org.apache.spark.streaming.twitter.TwitterInputDStream is a small class.
You could write your own that lets you change the filters at run time. Then
provide a mechanism in your app, like periodic polling of a database table
or file for the list of filters.
Dean Wampler, Ph.D.
Author: Programming
Thanx Dean, i am giving unique output path and in every time i also delete
the directory before i run the job.
2015-08-10 15:30 GMT+03:00 Dean Wampler deanwamp...@gmail.com:
Following Hadoop conventions, Spark won't overwrite an existing directory.
You need to provide a unique output path
So, just before running the job, if you run the HDFS command at a shell
prompt: hdfs dfs -ls hdfs://172.31.42.10:54310/./weblogReadResult.
Does it say the path doesn't exist?
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Easiest way should be to add both jars in SPARK_CLASSPATH as a colon
separated string.
On 10 Aug 2015 06:20, Jonathan Haddad j...@jonhaddad.com wrote:
I'm trying to write a simple job for Pyspark 1.4 migrating data from MySQL
to Cassandra. I can work with either the MySQL JDBC jar or the
Hi All,
I am creating spark twitter streaming connection in my app over long period
of time. When I have some new keywords I need to add them to the spark
streaming connection. I need to stop and start the current twitter streaming
connection in this case.
I have tried akka actor scheduling but
Isnt it a space separated data? It is not a comma(,) separated nor pipe (|)
separated data.
Thanks
Best Regards
On Mon, Aug 10, 2015 at 12:06 PM, Netwaver wanglong_...@163.com wrote:
Hi Spark experts,
I am now using Spark 1.4.1 and trying Spark SQL/DataFrame
API with text
Thanks for the response Robin
is this the same for both Directed and undirected graphs ?
val edges =
Array(0L - 1L, 1L - 2L, 2L - 0L) ++
Array(3L - 4L, 4L - 5L, 5L - 3L) ++
Array(6L - 0L, 5L - 7L)
val rawEdges = sc.parallelize(edges)
val graph =
Hi Spark experts,
I am now using Spark 1.4.1 and trying Spark SQL/DataFrame API
with text file in below format
id gender height
1 M 180
2 F 167
... ...
But I meet
How to find the cliques using spark graphx ? a quick code snippet is
appreciated.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-Graphx-Cliques-tp24191.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi,
I have looked at the UI scheduler tab and it appears my new user was
allocated less cores than my other user, is there any way i can avoid this
happening?
Thanks,
Jem
On Sat, Aug 8, 2015 at 8:32 PM Shushant Arora shushantaror...@gmail.com
wrote:
which is the scheduler on your cluster.
Hi,
I'm using Spark on a Google Compute Engine cluster with the Google Cloud
Storage connector (instead of HDFS, as recommended here
https://cloud.google.com/hadoop/google-cloud-storage-connector#benefits
), and get a lot of rate limit errors, as added below.
The errors relate to temp files
I am using spark 1.4.1 with connector 1.4.0
When I post events slowly and the are being picked one by one everything
runs smoothly but when the stream starts delivering batched records there is
no obvious way to separate them.
Am i missing something? How do I separate the records when they are
hi guys,
i have a question about spark streaming.
There’s an application keep sending transaction records into spark stream with
about 50k tps
The record represents a sales information including customer id / product id /
time / price columns
The application is required to monitor the change
What is the differents of loading data using jdbc and loading data using
spard data source api?
or differents of loading data using mongo-hadoop and loading data using
native java driver?
Which way is better?
I don't know if DSE changed spark-submit, but you have to use a
comma-separated list of jars to --jars. It probably looked for HelloWorld
in the second one, the dse.jar file. Do this:
dse spark-submit --master spark://10.246.43.15:7077 --class HelloWorld
--jars /home/missingmerch/
For direct stream questions:
https://github.com/koeninger/kafka-exactly-once
Yes, it is used in production.
For general spark streaming question:
http://spark.apache.org/docs/latest/streaming-programming-guide.html
On Mon, Aug 10, 2015 at 7:51 AM, Mohit Durgapal durgapalmo...@gmail.com
Hi,
Thanks for quick input, now I am getting class not found error
*Command:*
dse spark-submit --master spark://10.246.43.15:7077 --class HelloWorld
--jars ///home/missingmerch/postgresql-9.4-1201.jdbc41.jar
///home/missingmerch/dse.jar
You don't connect to spark exactly. The spark client (running on your remote
machine) submits jobs to the YARN cluster running on HDP. What you probably
need is yarn-cluster or yarn-client with the yarn client configs as downloaded
from the Ambari actions menu.
Simon
On 10 Aug 2015, at
Hi!
I want to know how can I connect to hortonworks spark from an other
machine.
So there is a HDP 2.2 and I want to connect to this, from remotely via java
api.
Do you have any suggestion?
Thanks!
Regards,
--
*Egyed Zsombor *
Junior Big Data Engineer
Mobile: +36 70 320 65 81 |
1 - 100 of 107 matches
Mail list logo