Each spark partition will contain messages only from a single kafka
topcipartition. Use hasOffsetRanges to tell which kafka topicpartition
it's from. See the docs
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
On Sun, Aug 23, 2015 at 10:56 AM, Spark Enthusiast
If you set concurrentJobs flag to 2, then it lets you run two jobs
parallel. It will be a bit hard for you predict the application behavior
with this flag thus debugging would be a headache.
Thanks
Best Regards
On Sun, Aug 23, 2015 at 10:36 AM, Sateesh Kavuri sateesh.kav...@gmail.com
wrote:
Hi
Hi Ted,
Thanks for the reply. I tried setting both of the keyid and accesskey via
sc.hadoopConfiguration.set(fs.s3n.awsAccessKeyId, ***)
sc.hadoopConfiguration.set(fs.s3n.awsSecretAccessKey, **)
However, the error still occurs for ORC format.
If I change the format to JSON, although the
Ted,
Thanks for the suggestions. Actually I tried both s3n and s3 and the
result remains the same.
On Sun, Aug 23, 2015 at 12:27 PM, Ted Yu yuzhih...@gmail.com wrote:
In your case, I would specify fs.s3.awsAccessKeyId /
fs.s3.awsSecretAccessKey since you use s3 protocol.
On Sun, Aug 23,
Another addition to the Powered by Spark page:
B2i Healthcare (http://b2i.sg) uses Spark in healthcare analytics with
medical ontologies like SNOMED CT. Our Snow Owl MQ (
http://b2i.sg/snow-owl-mq) product relies on the Spark ecosystem to analyze
~1 billion health records with over 70 healthcare
spark-env.sh works for me in Spark 1.4 but not
spark.executor.extraJavaOptions.
On Sun, Aug 23, 2015 at 11:27 AM Raghavendra Pandey
raghavendra.pan...@gmail.com wrote:
I think the only way to pass on environment variables to worker node is to
write it in spark-env.sh file on each worker node.
Folks,
I use the following Streaming API from KafkaUtils :
public JavaPairInputDStreamString, String inputDStream() {
HashSetString topicsSet = new
HashSetString(Arrays.asList(topics.split(,)));
HashMapString, String kafkaParams = new HashMapString, String();
Hi I am hitting issue of long GC pauses in my Spark job and because of it
YARN is killing executors one by one and Spark job becomes slower and
slower. I came across this article where they mentioned about using G1GC I
tried to use the same command but something seems wrong
:135] Downloading
'http://d3kbcqa49mib13.cloudfront.net/spark-1.4.1-bin-hadoop2.6.tgz'
to
'/var/mesos/slaves/20150823-110659-1862270986-5050-3230-S1/frameworks/20150823-191138-1862270986-5050-3768-/executors/0/runs/a5a13cd4-013a-4ebc-8ef7-eb9c33358429/spark-1.4.1-bin-hadoop2.6.tgz'
I0823 19:14
In your case, I would specify fs.s3.awsAccessKeyId /
fs.s3.awsSecretAccessKey since you use s3 protocol.
On Sun, Aug 23, 2015 at 11:03 AM, lostrain A donotlikeworkingh...@gmail.com
wrote:
Hi Ted,
Thanks for the reply. I tried setting both of the keyid and accesskey via
I'm currently having the same issues. The documentation for Mesos dispatcher
is sparse. I'll also add that I'm able to see the framework running in the
mesos and spark driver UIs, but when viewing the spark job ui on a slave, no
job is seen.
--
View this message in context:
I think the only way to pass on environment variables to worker node is to
write it in spark-env.sh file on each worker node.
On Sun, Aug 23, 2015 at 8:16 PM, Hemant Bhanawat hemant9...@gmail.com
wrote:
Check for spark.driver.extraJavaOptions and
spark.executor.extraJavaOptions in the
Hi,
I'm trying to save a simple dataframe to S3 in ORC format. The code is as
follows:
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
import sqlContext.implicits._
val df=sc.parallelize(1 to 1000).toDF()
df.write.format(orc).save(s3://logs/dummy)
I
You may have seen this:
http://search-hadoop.com/m/q3RTtdSyM52urAyI
On Aug 23, 2015, at 1:01 AM, lostrain A donotlikeworkingh...@gmail.com
wrote:
Hi,
I'm trying to save a simple dataframe to S3 in ORC format. The code is as
follows:
val sqlContext = new
Check for spark.driver.extraJavaOptions and spark.executor.extraJavaOptions
in the following article. I think you can use -D to pass system vars:
spark.apache.org/docs/latest/configuration.html#runtime-environment
Hi,
I am starting a spark streaming job in standalone mode with spark-submit.
Is
Hi,
I am starting a spark streaming job in standalone mode with spark-submit.
Is there a way to make the UNIX environment variables with which spark-submit
is started available to the processes started on the worker nodes?
Jan
Hi Bcjaes,
Sorry I didn't see the previous thread so not sure what issues you are running
into.
In cluster mode the driver logs and results are all available through the Mesos
UI, you need to look at terminated frameworks if it's a job that's already
finished.
I'll try to add more docs as we
anybody has any suggestions?
On Fri, Aug 21, 2015 at 3:14 PM, Jerrick Hoang jerrickho...@gmail.com
wrote:
Is there a workaround without updating Hadoop? Would really appreciate if
someone can explain what spark is trying to do here and what is an easy way
to turn this off. Thanks all!
On
SPARK-8458 is in 1.4.1 release.
You can upgrade to 1.4.1 or, wait for the upcoming 1.5.0 release.
On Sun, Aug 23, 2015 at 2:05 PM, lostrain A donotlikeworkingh...@gmail.com
wrote:
Hi Zhan,
Thanks for the point. Yes I'm using a cluster with spark-1.4.0 and it
looks like this is most likely
If you are using spark-1.4.0, probably it is caused by
SPARK-8458https://issues.apache.org/jira/browse/SPARK-8458
Thanks.
Zhan Zhang
On Aug 23, 2015, at 12:49 PM, lostrain A
donotlikeworkingh...@gmail.commailto:donotlikeworkingh...@gmail.com wrote:
Ted,
Thanks for the suggestions. Actually
Hi Zhan,
Thanks for the point. Yes I'm using a cluster with spark-1.4.0 and it
looks like this is most likely the reason. I'll verify this again once the
we make the upgrade.
Best,
los
On Sun, Aug 23, 2015 at 1:25 PM, Zhan Zhang zzh...@hortonworks.com wrote:
If you are using spark-1.4.0,
Here's a longer version of that talk that I gave, which goes into more
detail on the internals:
http://www.slideshare.net/databricks/spark-sql-deep-dive-melbroune
On Fri, Aug 21, 2015 at 8:28 AM, Sameer Farooqui same...@databricks.com
wrote:
Have you seen the Spark SQL paper?:
1 minute to discover 1000s of partitions -- yes, that is what I have
observed. And I would assert that is very slow.
On Sun, Aug 23, 2015 at 7:16 PM, Michael Armbrust mich...@databricks.com
wrote:
We should not be actually scanning all of the data of all of the
partitions, but we do need to at
GrapX is a graph analytics engine rather than a graph database. It's typical
use case is running large-scale graph algorithms like page rank , connected
components, label propagation and so on. It can be an element of complex
processing pipelines that involve other Spark components such as Data
We should not be actually scanning all of the data of all of the
partitions, but we do need to at least list all of the available
directories so that we can apply your predicates to the actual values that
are present when we are deciding which files need to be read in a given
spark job. While
Hello,
I am new to Spark and just running some tests to get familiar with the APIs.
When calling the rollup function on my DataFrame, I get different results
when I alias the columns I am grouping on (see below for example data set).
I was expecting alias function to only affect the column name.
26 matches
Mail list logo