It is in Spark not different compared to another program. However a web service
and json is probably not very suitable for large data volumes.
> On 03 May 2016, at 04:45, KhajaAsmath Mohammed
> wrote:
>
> Hi,
>
> I am working on a project to pull data from sprinklr
Hallo,
Spark is a general framework for distributed in-memory processing. You can
always write a highly-specified piece of code which is faster than Spark, but
then it can do only one thing and if you need something else you will have to
rewrite everything from scratch . This is why Spark is
Hello,
I am trying to find some performance figures of spark vs various other
languages for ALS based recommender system. I am using 20 million ratings
movielens dataset. The test environment involves one big 30 core machine
with 132 GB memory. I am using the scala version of the script provided
print() isn't really the best way to benchmark things, since it calls
take(10) under the covers, but 380 records / second for a single
receiver doesn't sound right in any case.
Am I understanding correctly that you're trying to process a large
number of already-existing kafka messages, not keep
Hi,
its morning 4:40 here, therefore I might not be getting things right.
But there is a very high chance of getting spurious results in case you
have created that variable more than once in IPython or pyspark shell and
cached it and are re using it. Please close the sessions and create the
I use this command:
build/mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6 -Psparkr
-Dhadoop.version=2.7.0 package -DskipTests -X
and get this failure message:
[INFO]
[INFO] BUILD FAILURE
[INFO]
I rebuilt 1.6.1 locally:
[INFO] Spark Project External Kafka ... SUCCESS [
30.868 s]
[INFO] Spark Project Examples . SUCCESS [02:29
min]
[INFO] Spark Project External Kafka Assembly .. SUCCESS [
9.644 s]
[INFO]
Hi,
This is not a continuation of a previous query, and now building by connect
to inernet without a proxy as before.
After disable Zinc, get this errormessage:
[ERROR] Failed to execute goal
net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on
project
Hi,
I am working on a project to pull data from sprinklr for every 15 minutes
and process that in spark. After processing it, I need to save that back in
s3 Bukcet.
is there a way that I can talk to webservice in spark directly and parse
the response of the json data ?
Thanks,
Asmath
Looks like this was continuation of your previous query.
If that is the case, please use original thread so that people can have
more context.
Have you tried disabling Zinc server ?
What's the version of Java / maven you're using ?
Are you behind proxy ?
Finally the 1.6.1 artifacts are
[INFO]
[INFO] BUILD FAILURE
[INFO]
[INFO] Total time: 14.765 s
[INFO] Finished at: 2016-05-03T10:08:46+08:00
[INFO] Final Memory: 35M/191M
[INFO]
You might have a settings.xml that is forcing your internal Maven
repository to be the mirror of external repositories and thus not finding
the dependency.
On Mon, May 2, 2016 at 6:11 PM, Hien Luu wrote:
> Not I am not. I am considering downloading it manually and place it
Not I am not. I am considering downloading it manually and place it in my
local repository.
On Mon, May 2, 2016 at 5:54 PM, ☼ R Nair (रविशंकर नायर) <
ravishankar.n...@gmail.com> wrote:
> Oracle jdbc is not part of Maven repository, are you keeping a downloaded
> file in your local repo?
>
>
>From the output of dependency:tree of master branch:
[INFO]
[INFO] Building Spark Project Docker Integration Tests 2.0.0-SNAPSHOT
[INFO]
[WARNING] The
Oracle jdbc is not part of Maven repository, are you keeping a downloaded
file in your local repo?
Best, RS
On May 2, 2016 8:51 PM, "Hien Luu" wrote:
> Hi all,
>
> I am running into a build problem with com.oracle:ojdbc6:jar:11.2.0.1.0.
> It kept getting "Operation timed
Hi All,
I am using Eclipse with Maven for developing Spark applications. I got a
error for Reading from S3 in Scala but it works fine in Java when I run
them in the same project in Eclipse. The Scala/Java code and the error in
following
Scala
val uri = URI.create("s3a://" + key + ":" + seckey
Hi all,
I am running into a build problem with com.oracle:ojdbc6:jar:11.2.0.1.0.
It kept getting "Operation timed out" while building Spark Project Docker
Integration Tests module (see the error below).
Has anyone run this problem before? If so, how did you resolve around this
problem?
[INFO]
The workers and executors run as separate JVM processes in the standalone mode.
The use of multiple workers on a single machine depends on how you will be
using the clusters. If you run multiple Spark applications simultaneously, each
application gets its own its executor. So, for example, if
Hello again. I searched for "backport kafka" in the list archives but
couldn't find anything but a post from Spark 0.7.2 . I was going to
use accumulators to make a counter, but then saw on the Streaming tab
the Receiver Statistics. Then I removed all other "functionality"
except:
See http://spark.apache.org/docs/latest/running-on-yarn.html,
especially the parts that talk about
spark.yarn.historyServer.address.
On Mon, May 2, 2016 at 2:14 PM, satish saley wrote:
>
>
> Hello,
>
> I am running pyspark job using yarn-cluster mode. I can see spark job
Hello,
I am running pyspark job using yarn-cluster mode. I can see spark job in
yarn but I am able to go from any "log history" link from yarn to spark
history server. How would I keep track of yarn log and its corresponding
log in spark history server? Is there any setting in yarn/spark that let
Yong,
Sorry, let explain my deduction; it is going be difficult to get a sample
data out since the dataset I am using is proprietary.
>From the above set queries (ones mentioned in above comments) both inner
and outer join are producing the same counts. They are basically pulling
out selected
Hi Bill,
You can try DirectStream and increase # of partition to kafka. then input
Dstream will have the partitions as per kafka topic without using
re-partitioning.
Can you please share your event timeline chart from spark ui. You need to
tune your configuration as per computation. Spark ui
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2367)
at
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
at
We are still not sure what is the problem, if you cannot show us with some
example data.
For dps with 42632 rows, and swig with 42034 rows, if dps full outer join with
swig on 3 columns; with additional filters, get the same resultSet row count as
dps lefter outer join with swig on 3 columns,
Gourav,
I wish that was case, but I have done a select count on each of the two
tables individually and they return back different number of rows:
dps.registerTempTable("dps_pin_promo_lt")
swig.registerTempTable("swig_pin_promo_lt")
dps.count()
RESULT: 42632
swig.count()
RESULT: 42034
I have the following simple example that I can't get to work correctly.
In [1]:
from pyspark.sql import SQLContext, Row
from pyspark.sql.types import StructType, StructField, IntegerType,
StringType
from pyspark.sql.functions import asc, desc, sum, count
sqlContext = SQLContext(sc)
error_schema
This shows that both the tables have matching records and no mismatches.
Therefore obviously you have the same results irrespective of whether you
use right or left join.
I think that there is no problem here, unless I am missing something.
Regards,
Gourav
On Mon, May 2, 2016 at 7:48 PM, kpeng1
Also, the results of the inner query produced the same results:
sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc , d.account AS
d_acc , s.ad as s_ad , d.ad as d_ad , s.spend AS s_spend ,
d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
dps_pin_promo_lt d ON (s.date
Adding back user@spark.
Since the top of stack trace is in Datastax class(es), I suggest polling on
their mailing list.
On Mon, May 2, 2016 at 11:29 AM, Piyush Verma
wrote:
> Hmm weird. They show up on the Web interface.
>
> Wait, got it. Its wrapped up Inside the < raw
Hi Kevin,
Thanks.
Please post the result of the same query with INNER JOIN and then it will
give us a bit of insight.
Regards,
Gourav
On Mon, May 2, 2016 at 7:10 PM, Kevin Peng wrote:
> Gourav,
>
> Apologies. I edited my post with this information:
> Spark version: 1.6
>
Maybe you were trying to embed pictures for the error and your code - but
they didn't go through.
On Mon, May 2, 2016 at 10:32 AM, meson10 wrote:
> Hi,
>
> I am trying to save a RDD to Cassandra but I am running into the following
> error:
>
>
>
> The Python code looks
Jorn,
what aspects are you speaking about ?
My response was absolutely pertinent to Jinan because he will not even face
the problem if he used Scala. So it was along the lines of helping a person
to learn fishing that giving him a fish.
And by the way your blinkered and biased response missed
Hi Cody,
I'm going to use an accumulator right now to get an idea of the
throughput. Thanks for mentioning the back ported module. Also it
looks like I missed this section:
https://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#reducing-the-processing-time-of-each-batch
from the
Hi David,
My current concern is that I'm using a spark hbase bulk put driver
written for Spark 1.2 on the version of CDH my spark / yarn cluster is
running on. Even if I were to run on another Spark cluster, I'm
concerned that I might have issues making the put requests into hbase.
However I
You See oversimplifying here and some of your statements are not correct. There
are also other aspects to consider. Finally, it would be better to support him
with the problem, because Spark supports Java. Java and Scala run on the same
underlying JVM.
> On 02 May 2016, at 17:42, Gourav
Gourav,
Apologies. I edited my post with this information:
Spark version: 1.6
Result from spark shell
OS: Linux version 2.6.32-431.20.3.el6.x86_64 (
mockbu...@c6b9.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat
4.4.7-4) (GCC) ) #1 SMP Thu Jun 19 21:14:45 UTC 2014
Thanks,
KP
On Mon,
Have you tested for read throughput (without writing to hbase, just
deserialize)?
Are you limited to using spark 1.2, or is upgrading possible? The
kafka direct stream is available starting with 1.3. If you're stuck
on 1.2, I believe there have been some attempts to backport it, search
the
Hi,
I have worked on 300GB data by querying it from CSV (using SPARK CSV) and
writing it to Parquet format and then querying parquet format to query it
and partition the data and write out individual csv files without any
issues on a single node SPARK cluster installation.
Are you trying to
Hi,
As always, can you please write down details regarding your SPARK cluster -
the version, OS, IDE used, etc?
Regards,
Gourav Sengupta
On Mon, May 2, 2016 at 5:58 PM, kpeng1 wrote:
> Hi All,
>
> I am running into a weird result with Spark SQL Outer joins. The results
>
I've written an application to get content from a kafka topic with 1.7
billion entries, get the protobuf serialized entries, and insert into
hbase. Currently the environment that I'm running in is Spark 1.2.
With 8 executors and 2 cores, and 2 jobs, I'm only getting between
0-2500 writes /
Hi,
I am trying to save a RDD to Cassandra but I am running into the following
error:
The Python code looks like this:
I am using DSE 4.8.6 which runs Spark 1.4.2
I ran through a bunch of existing posts on this mailing lists and have
already performed the following routines:
* Ensure that
Hi All,
I am running into a weird result with Spark SQL Outer joins. The results
for all of them seem to be the same, which does not make sense due to the
data. Here are the queries that I am running with the results:
sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc , d.account AS
If you're confused about the type of an argument, you're probably
better off looking at documentation that includes static types:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.kafka.KafkaUtils$
createDirectStream's fromOffsets parameter takes a map from
Thanks Ted, I thought the avg. block size was already low and less than the
usual 128mb. If I need to reduce it further via parquet.block.size, it
would mean an increase in the number of blocks and that should increase the
number of tasks/executors. Is that the correct way to interpret this?
On
I am still a little bit confused about workers, executors and JVMs in
standalone mode.
Are worker processes and executors independent JVMs or do executors run
within the worker JVM?
I have some memory-rich nodes (192GB) and I would like to avoid deploying
massive JVMs due to well known performance
Hi,
I just found out that we can have lots of empty input partitions when
reading from parquet files.
Sample code as following:
val hconf = sc.hadoopConfiguration
val job = new Job(hconf)
FileInputFormat.setInputPaths(job, new Path("path_to_data"))
I am still a little bit confused about workers, executors and JVMs in
standalone mode.
Are worker processes and executors independent JVMs or do executors run
within the worker JVM?
I have some memory-rich nodes (192GB) and I would like to avoid deploying
massive JVMs due to well known performance
JAVA does not easily parallelize, JAVA is verbose, uses different classes
for serializing, and on top of that you are using RDD's instead of
dataframes.
Should a senior project not have an implied understanding that it should be
technically superior?
Why not use SCALA?
Regards,
Gourav
On Mon,
Because I am doing this project for my senior project by using Java.I try s3a
URI as this: s3a://accessId:secret@bucket/path
It show me an error :Exception in thread "main" java.lang.NoSuchMethodError:
Please consider decreasing block size.
Thanks
> On May 1, 2016, at 9:19 PM, Buntu Dev wrote:
>
> I got a 10g limitation on the executors and operating on parquet dataset with
> block size 70M with 200 blocks. I keep hitting the memory limits when doing a
> 'select * from
Hi,
I tried to monitor spark applications through spark APIs. I can submit new
application/driver with the REST API (POST
http://spark-cluster-ip:6066/v1/submissions/create ...). The API return the
driver's id (submissionId). I can check the driver's status and kill it with
the same API.
I could solve the issue but the solution is very weird.
I run this command cat old_script.py > new_script.py then I submitted the
job using the new script.
This is the second time I face such issue with python script and I have no
explanation to what happened.
I hope this trick help someone
Hi,
I'm trying to start consuming messages from a kafka topic (via direct
stream) from a given offset.
The documentation of createDirectStream says:
:param fromOffsets: Per-topic/partition Kafka offsets defining the
(inclusive) starting
point of the stream.
However it expects a dictionary
How many executors are you running? Is your partition scheme ensures data
is distributed evenly? It is possible that your data is skewed and one of
the executors failing. Maybe you can try reduce per executor memory and
increase partitions.
On 2 May 2016 14:19, "Buntu Dev"
Hi,
I agree with Steve, just start using vanilla SPARK EMR.
You can try to see point #4 here for dynamic allocation of executors
https://blogs.aws.amazon.com/bigdata/post/Tx6J5RM20WPG5V/Building-a-Recommendation-Engine-with-Spark-ML-on-Amazon-EMR-using-Zeppelin
.
Note that dynamic allocation of
Hi folks,
I am suddenly seeing :
Error:scalac: bad symbolic reference. A signature in Logging.class refers
to type Logger
in package org.slf4j which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version
Can you try once by creating your own schema file and using it to read the XML.
I had similar issue but got that resolved by custom schema and by specifying
each attribute in that.
Pradeep
> On May 1, 2016, at 9:45 AM, Hyukjin Kwon wrote:
>
> To be more clear,
>
> If
hi, by stopping Zinc server, got this error message:
[INFO] Spark Project External Kafka Assembly .. SKIPPED
[INFO]
[INFO] BUILD FAILURE
[INFO]
59 matches
Mail list logo