Re: How to parse multiple event types using Kafka

2015-08-23 Thread Cody Koeninger
Each spark partition will contain messages only from a single kafka topcipartition. Use hasOffsetRanges to tell which kafka topicpartition it's from. See the docs http://spark.apache.org/docs/latest/streaming-kafka-integration.html On Sun, Aug 23, 2015 at 10:56 AM, Spark Enthusiast

Re: Spark streaming multi-tasking during I/O

2015-08-23 Thread Akhil Das
If you set concurrentJobs flag to 2, then it lets you run two jobs parallel. It will be a bit hard for you predict the application behavior with this flag thus debugging would be a headache. Thanks Best Regards On Sun, Aug 23, 2015 at 10:36 AM, Sateesh Kavuri sateesh.kav...@gmail.com wrote: Hi

Re: Error when saving a dataframe as ORC file

2015-08-23 Thread lostrain A
Hi Ted, Thanks for the reply. I tried setting both of the keyid and accesskey via sc.hadoopConfiguration.set(fs.s3n.awsAccessKeyId, ***) sc.hadoopConfiguration.set(fs.s3n.awsSecretAccessKey, **) However, the error still occurs for ORC format. If I change the format to JSON, although the

Re: Error when saving a dataframe as ORC file

2015-08-23 Thread lostrain A
Ted, Thanks for the suggestions. Actually I tried both s3n and s3 and the result remains the same. On Sun, Aug 23, 2015 at 12:27 PM, Ted Yu yuzhih...@gmail.com wrote: In your case, I would specify fs.s3.awsAccessKeyId / fs.s3.awsSecretAccessKey since you use s3 protocol. On Sun, Aug 23,

B2i Healthcare Powered by Spark addition

2015-08-23 Thread Brandon Ulrich
Another addition to the Powered by Spark page: B2i Healthcare (http://b2i.sg) uses Spark in healthcare analytics with medical ontologies like SNOMED CT. Our Snow Owl MQ ( http://b2i.sg/snow-owl-mq) product relies on the Spark ecosystem to analyze ~1 billion health records with over 70 healthcare

Re: How to set environment of worker applications

2015-08-23 Thread Sathish Kumaran Vairavelu
spark-env.sh works for me in Spark 1.4 but not spark.executor.extraJavaOptions. On Sun, Aug 23, 2015 at 11:27 AM Raghavendra Pandey raghavendra.pan...@gmail.com wrote: I think the only way to pass on environment variables to worker node is to write it in spark-env.sh file on each worker node.

How to parse multiple event types using Kafka

2015-08-23 Thread Spark Enthusiast
Folks, I use the following Streaming API from KafkaUtils : public JavaPairInputDStreamString, String inputDStream() { HashSetString topicsSet = new HashSetString(Arrays.asList(topics.split(,))); HashMapString, String kafkaParams = new HashMapString, String();

Spark YARN executors are not launching when using +UseG1GC

2015-08-23 Thread unk1102
Hi I am hitting issue of long GC pauses in my Spark job and because of it YARN is killing executors one by one and Spark job becomes slower and slower. I came across this article where they mentioned about using G1GC I tried to use the same command but something seems wrong

is there a 'knack' to docker and mesos?

2015-08-23 Thread Dick Davies
:135] Downloading 'http://d3kbcqa49mib13.cloudfront.net/spark-1.4.1-bin-hadoop2.6.tgz' to '/var/mesos/slaves/20150823-110659-1862270986-5050-3230-S1/frameworks/20150823-191138-1862270986-5050-3768-/executors/0/runs/a5a13cd4-013a-4ebc-8ef7-eb9c33358429/spark-1.4.1-bin-hadoop2.6.tgz' I0823 19:14

Re: Error when saving a dataframe as ORC file

2015-08-23 Thread Ted Yu
In your case, I would specify fs.s3.awsAccessKeyId / fs.s3.awsSecretAccessKey since you use s3 protocol. On Sun, Aug 23, 2015 at 11:03 AM, lostrain A donotlikeworkingh...@gmail.com wrote: Hi Ted, Thanks for the reply. I tried setting both of the keyid and accesskey via

Re: Spark Mesos Dispatcher

2015-08-23 Thread bcajes
I'm currently having the same issues. The documentation for Mesos dispatcher is sparse. I'll also add that I'm able to see the framework running in the mesos and spark driver UIs, but when viewing the spark job ui on a slave, no job is seen. -- View this message in context:

Re: How to set environment of worker applications

2015-08-23 Thread Raghavendra Pandey
I think the only way to pass on environment variables to worker node is to write it in spark-env.sh file on each worker node. On Sun, Aug 23, 2015 at 8:16 PM, Hemant Bhanawat hemant9...@gmail.com wrote: Check for spark.driver.extraJavaOptions and spark.executor.extraJavaOptions in the

Error when saving a dataframe as ORC file

2015-08-23 Thread lostrain A
Hi, I'm trying to save a simple dataframe to S3 in ORC format. The code is as follows: val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) import sqlContext.implicits._ val df=sc.parallelize(1 to 1000).toDF() df.write.format(orc).save(s3://logs/dummy) I

Re: Error when saving a dataframe as ORC file

2015-08-23 Thread Ted Yu
You may have seen this: http://search-hadoop.com/m/q3RTtdSyM52urAyI On Aug 23, 2015, at 1:01 AM, lostrain A donotlikeworkingh...@gmail.com wrote: Hi, I'm trying to save a simple dataframe to S3 in ORC format. The code is as follows: val sqlContext = new

Re: How to set environment of worker applications

2015-08-23 Thread Hemant Bhanawat
Check for spark.driver.extraJavaOptions and spark.executor.extraJavaOptions in the following article. I think you can use -D to pass system vars: spark.apache.org/docs/latest/configuration.html#runtime-environment Hi, I am starting a spark streaming job in standalone mode with spark-submit. Is

How to set environment of worker applications

2015-08-23 Thread Jan Algermissen
Hi, I am starting a spark streaming job in standalone mode with spark-submit. Is there a way to make the UNIX environment variables with which spark-submit is started available to the processes started on the worker nodes? Jan

Re: Spark Mesos Dispatcher

2015-08-23 Thread Timothy Chen
Hi Bcjaes, Sorry I didn't see the previous thread so not sure what issues you are running into. In cluster mode the driver logs and results are all available through the Mesos UI, you need to look at terminated frameworks if it's a job that's already finished. I'll try to add more docs as we

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-23 Thread Jerrick Hoang
anybody has any suggestions? On Fri, Aug 21, 2015 at 3:14 PM, Jerrick Hoang jerrickho...@gmail.com wrote: Is there a workaround without updating Hadoop? Would really appreciate if someone can explain what spark is trying to do here and what is an easy way to turn this off. Thanks all! On

Re: Error when saving a dataframe as ORC file

2015-08-23 Thread Ted Yu
SPARK-8458 is in 1.4.1 release. You can upgrade to 1.4.1 or, wait for the upcoming 1.5.0 release. On Sun, Aug 23, 2015 at 2:05 PM, lostrain A donotlikeworkingh...@gmail.com wrote: Hi Zhan, Thanks for the point. Yes I'm using a cluster with spark-1.4.0 and it looks like this is most likely

Re: Error when saving a dataframe as ORC file

2015-08-23 Thread Zhan Zhang
If you are using spark-1.4.0, probably it is caused by SPARK-8458https://issues.apache.org/jira/browse/SPARK-8458 Thanks. Zhan Zhang On Aug 23, 2015, at 12:49 PM, lostrain A donotlikeworkingh...@gmail.commailto:donotlikeworkingh...@gmail.com wrote: Ted, Thanks for the suggestions. Actually

Re: Error when saving a dataframe as ORC file

2015-08-23 Thread lostrain A
Hi Zhan, Thanks for the point. Yes I'm using a cluster with spark-1.4.0 and it looks like this is most likely the reason. I'll verify this again once the we make the upgrade. Best, los On Sun, Aug 23, 2015 at 1:25 PM, Zhan Zhang zzh...@hortonworks.com wrote: If you are using spark-1.4.0,

Re: SparkSQL concerning materials

2015-08-23 Thread Michael Armbrust
Here's a longer version of that talk that I gave, which goes into more detail on the internals: http://www.slideshare.net/databricks/spark-sql-deep-dive-melbroune On Fri, Aug 21, 2015 at 8:28 AM, Sameer Farooqui same...@databricks.com wrote: Have you seen the Spark SQL paper?:

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-23 Thread Philip Weaver
1 minute to discover 1000s of partitions -- yes, that is what I have observed. And I would assert that is very slow. On Sun, Aug 23, 2015 at 7:16 PM, Michael Armbrust mich...@databricks.com wrote: We should not be actually scanning all of the data of all of the partitions, but we do need to at

Re: Spark GraphaX

2015-08-23 Thread Robineast
GrapX is a graph analytics engine rather than a graph database. It's typical use case is running large-scale graph algorithms like page rank , connected components, label propagation and so on. It can be an element of complex processing pipelines that involve other Spark components such as Data

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-23 Thread Michael Armbrust
We should not be actually scanning all of the data of all of the partitions, but we do need to at least list all of the available directories so that we can apply your predicates to the actual values that are present when we are deciding which files need to be read in a given spark job. While

DataFrame rollup with alias?

2015-08-23 Thread Isabelle Phan
Hello, I am new to Spark and just running some tests to get familiar with the APIs. When calling the rollup function on my DataFrame, I get different results when I alias the columns I am grouping on (see below for example data set). I was expecting alias function to only affect the column name.