question about Union in pyspark and preserving partitioners

2016-05-16 Thread Cameron Davidson-Pilon
I'm looking into how to do more efficient jobs by using partition strategies, but I'm hitting a blocker after I do a `union` between two RDDs. Suppose A and B are both RDDs with the same partitioner, that is, `A.partitioner == B.partitioner` If I do A.union(B), the resulting RDD has None

Re: KafkaUtils.createDirectStream Not Fetching Messages with Confluent Serializers as Value Decoder.

2016-05-16 Thread Mail.com
Hi Muthu, Are you on spark 1.4.1 and Kafka 0.8.2? I have a similar issue even for simple string messages. Console producer and consumer work fine. But spark always reruns empty RDD. I am using Receiver based Approach. Thanks, Pradeep > On May 16, 2016, at 8:19 PM, Ramaswamy, Muthuraman >

Re: Will spark swap memory out to disk if the memory is not enough?

2016-05-16 Thread Ted Yu
Have you seen this thread ? http://search-hadoop.com/m/q3RTtRbEiIXuOOS=Re+PySpark+issue+with+sortByKey+IndexError+list+index+out+of+range+ which led to SPARK-4384 On Mon, May 16, 2016 at 8:09 PM, kramer2...@126.com wrote: > I know the cache operation can cache data in

Will spark swap memory out to disk if the memory is not enough?

2016-05-16 Thread kramer2...@126.com
I know the cache operation can cache data in memoyr/disk... But I am expecting to know will other operation will do the same? For example, I created a dataframe called df. The df is big so when I run some action like : df.sort(column_name).show() df.collect() It will throw error like :

what is the wrong while adding one column in the dataframe

2016-05-16 Thread Zhiliang Zhu
Hi All, For the given DataFrame created by hive sql, however, then it is required to add one more column based on the existing column, and should also keep the previous columns there for the result DataFrame. final double DAYS_30 = 1000 * 60 * 60 * 24 * 30.0; //DAYS_30 seems difficult to call

?????? Why spark 1.6.1 master can not monitor and start a auto stop worker?

2016-05-16 Thread sunday2000
I Use $SPARK_HOME/sbin/start-slave.sh client10, get this error: org.apache.spark.deploy.worker.Worker running as process 5490. Stop it first. -- -- ??: "Ted Yu";; : 2016??5??17??(??) 10:22 ??:

Re: Why spark 1.6.1 master can not monitor and start a auto stop worker?

2016-05-16 Thread Ted Yu
I guess 2.0 would be released before Spark Summit. On Mon, May 16, 2016 at 7:19 PM, sunday2000 <2314476...@qq.com> wrote: > Hi, > I found the bug status is : Solved, then when will release the solved > version? > > > -- 原始邮件 -- > *发件人:* "Ted

?????? Why spark 1.6.1 master can not monitor and start a auto stop worker?

2016-05-16 Thread sunday2000
Hi, I found the bug status is : Solved, then when will release the solved version? -- -- ??: "Ted Yu";; : 2016??5??17??(??) 9:58 ??: "sunday2000"<2314476...@qq.com>; :

Re: Why spark 1.6.1 master can not monitor and start a auto stop worker?

2016-05-16 Thread Ted Yu
Please take a look at this JIRA: [SPARK-13604][CORE] Sync worker's state after registering with master On Mon, May 16, 2016 at 6:54 PM, sunday2000 <2314476...@qq.com> wrote: > Hi, > > A client woker stoppped, and has this error message, do u know why this > happen? > > 16/05/17 03:42:20 INFO

Why spark 1.6.1 master can not monitor and start a auto stop worker?

2016-05-16 Thread sunday2000
Hi, A client woker stoppped, and has this error message, do u know why this happen? 16/05/17 03:42:20 INFO Worker: Not spawning another attempt to register with the master, since there is an attempt scheduled already. 16/05/17 03:42:20 INFO Worker: Master with url spark://master:7077

Re: How to get the batch information from Streaming UI

2016-05-16 Thread John Trengrove
You would want to add a listener to your Spark Streaming context. Have a look at the StatsReportListener [1]. [1] http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.scheduler.StatsReportListener 2016-05-17 7:18 GMT+10:00 Samuel Zhou : > Hi, >

Re: Silly Question on my part...

2016-05-16 Thread John Trengrove
If you are wanting to share RDDs it might be a good idea to check out Tachyon / Alluxio. For the Thrift server, I believe the datasets are located in your Spark cluster as RDDs and you just communicate with it via the Thrift JDBC Distributed Query Engine connector. 2016-05-17 5:12 GMT+10:00

Re: KafkaUtils.createDirectStream Not Fetching Messages with Confluent Serializers as Value Decoder.

2016-05-16 Thread Ramaswamy, Muthuraman
Yes, I can see the messages. Also, I wrote a quick custom decoder for avro and it works fine for the following: >> kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": >> brokers}, valueDecoder=decoder) But, when I use the Confluent Serializers to leverage the Schema

Re: sbt for Spark build with Scala 2.11

2016-05-16 Thread Eric Richardson
Good news - and Java 8 as well. I saw Matei after his talk at Scala days and he said he would look into a 2.11 default but it seems that is already the plan. Scala 2.12 is getting closer as well. On Mon, May 16, 2016 at 2:55 PM, Ted Yu wrote: > For 2.0, I believe that is

Re: sbt for Spark build with Scala 2.11

2016-05-16 Thread Ted Yu
For 2.0, I believe that is the case. Jenkins jobs have been running against Scala 2.11: [INFO] --- scala-maven-plugin:3.2.2:testCompile (scala-test-compile-first) @ java8-tests_2.11 --- FYI On Mon, May 16, 2016 at 2:45 PM, Eric Richardson wrote: > On Thu, May 12,

Re: sbt for Spark build with Scala 2.11

2016-05-16 Thread Eric Richardson
On Thu, May 12, 2016 at 9:23 PM, Luciano Resende wrote: > Spark has moved to build using Scala 2.11 by default in master/trunk. > Does this mean that the pre-built binaries for download will also move to 2.11 as well? > > > As for the 2.0.0-SNAPSHOT, it is actually the

How to get the batch information from Streaming UI

2016-05-16 Thread Samuel Zhou
Hi, Does anyone know how to get the batch information(like batch time, input size, processing time, status) from Streaming UI by using Scala/Java API ? Because I want to put the information in log files and the streaming jobs are managed by YARN. Thanks, Samuel

Re: What / Where / When / How questions in Spark 2.0 ?

2016-05-16 Thread Ovidiu-Cristian MARCU
Could you please consider a short answer regarding the Apache Beam Capability Matrix todo’s for future Spark 2.0 release [4]? (some related references below [5][6]) Thanks [4] http://beam.incubator.apache.org/capability-matrix/#cap-full-what

Re: Monitoring Spark application progress

2016-05-16 Thread Василец Дмитрий
hello use google translate and https://mkdev.me/posts/ci-i-monitoring-spark-prilozheniy On Mon, May 16, 2016 at 6:13 PM, Ashok Kumar wrote: > Hi, > > I would like to know the approach and tools please to get the full > performance for a Spark app running through

Re: broadcast variable not picked up

2016-05-16 Thread Davies Liu
broadcast_var is only defined in foo(), I think you should have `global` for it. def foo(): global broadcast_var broadcast_var = sc.broadcast(var) On Fri, May 13, 2016 at 3:53 PM, abi wrote: > def kernel(arg): > input = broadcast_var.value + 1 > #some

Silly Question on my part...

2016-05-16 Thread Michael Segel
For one use case.. we were considering using the thrift server as a way to allow multiple clients access shared RDDs. Within the Thrift Context, we create an RDD and expose it as a hive table. The question is… where does the RDD exist. On the Thrift service node itself, or is that just a

Re: Apache Spark Slack

2016-05-16 Thread Matei Zaharia
I don't think any of the developers use this as an official channel, but all the ASF IRC channels are indeed on FreeNode. If there's demand for it, we can document this on the website and say that it's mostly for users to find other users. Development discussions should happen on the dev

Re: Kafka stream message sampling

2016-05-16 Thread Samuel Zhou
Hi, Mich, I created the Kafka DStream with following Java code: sparkConf = new SparkConf().setAppName(this.getClass().getSimpleName() + ", topic: " + topics); jssc = new JavaStreamingContext(sparkConf, Durations.seconds(batchInterval )); HashSet topicsSet = new

Re: KafkaUtils.createDirectStream Not Fetching Messages with Confluent Serializers as Value Decoder.

2016-05-16 Thread Cody Koeninger
Have you checked to make sure you can receive messages just using a byte array for value? On Mon, May 16, 2016 at 12:33 PM, Ramaswamy, Muthuraman wrote: > I am trying to consume AVRO formatted message through > KafkaUtils.createDirectStream. I followed the listed

KafkaUtils.createDirectStream Not Fetching Messages with Confluent Serializers as Value Decoder.

2016-05-16 Thread Ramaswamy, Muthuraman
I am trying to consume AVRO formatted message through KafkaUtils.createDirectStream. I followed the listed below example (refer link) but the messages are not being fetched by the Stream. http://stackoverflow.com/questions/30339636/spark-python-avro-kafka-deserialiser Is there any code missing

how to add one more column in DataFrame

2016-05-16 Thread Zhiliang Zhu
Hi All, For the given DataFrame created by hive sql, however, then it is required to add one more column based on the existing column, and should also keep the previous columns there for the result DataFrame. final double DAYS_30 = 1000 * 60 * 60 * 24 * 30.0; //DAYS_30 seems difficult to call

Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-16 Thread Dood
On 5/16/2016 9:53 AM, Yuval Itzchakov wrote: AFAIK, the underlying data represented under the DataSet[T] abstraction will be formatted in Tachyon under the hood, but as with RDD's if needed they will be spilled to local disk on the worker of needed. There is another option in case of

Re: Apache Spark Slack

2016-05-16 Thread Dood
On 5/16/2016 9:52 AM, Xinh Huynh wrote: I just went to IRC. It looks like the correct channel is #apache-spark. So, is this an "official" chat room for Spark? Ah yes, my apologies, it is #apache-spark indeed. Not sure if there is an official channel on IRC for spark :-)

Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-16 Thread Yuval Itzchakov
AFAIK, the underlying data represented under the DataSet[T] abstraction will be formatted in Tachyon under the hood, but as with RDD's if needed they will be spilled to local disk on the worker of needed. On Mon, May 16, 2016, 19:47 Benjamin Kim wrote: > I have a curiosity

Re: Apache Spark Slack

2016-05-16 Thread Xinh Huynh
I just went to IRC. It looks like the correct channel is #apache-spark. So, is this an "official" chat room for Spark? Xinh On Mon, May 16, 2016 at 9:35 AM, Dood@ODDO wrote: > On 5/16/2016 9:30 AM, Paweł Szulc wrote: > >> >> Just realized that people have to be invited

Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-16 Thread Benjamin Kim
I have a curiosity question. These forever/unlimited DataFrames/DataSets will persist and be query capable. I still am foggy about how this data will be stored. As far as I know, memory is finite. Will the data be spilled to disk and be retrievable if the query spans data not in memory? Is

Re: How to get and save core dump of native library in executors

2016-05-16 Thread prateek arora
Please help to solve my problem . Regards Prateek -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-and-save-core-dump-of-native-library-in-executors-tp26945p26967.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Efficient for loops in Spark

2016-05-16 Thread Erik Erlandson
Regarding the specific problem of generating random folds in a more efficient way, this should help: http://silex.freevariable.com/latest/api/#com.redhat.et.silex.sample.split.SplitSampleRDDFunctions It uses a sort of multiplexing formalism on RDDs:

Re: Apache Spark Slack

2016-05-16 Thread Dood
On 5/16/2016 9:30 AM, Paweł Szulc wrote: Just realized that people have to be invited to this thing. You see, that's why Gitter is just simpler. I will try to figure it out ASAP You don't need invitations to IRC and it has been around for decades. You can just go to webchat.freenode.net

Re: Apache Spark Slack

2016-05-16 Thread Paweł Szulc
Just realized that people have to be invited to this thing. You see, that's why Gitter is just simpler. I will try to figure it out ASAP 16 maj 2016 15:40 "Paweł Szulc" napisał(a): > I've just created this https://apache-spark.slack.com for ad-hoc > communications within

Re: Monitoring Spark application progress

2016-05-16 Thread Василец Дмитрий
spark + zabbix + jmx https://translate.google.ru/translate?sl=ru=en=y=_t=en=UTF-8=https%3A%2F%2Fmkdev.me%2Fposts%2Fci-i-monitoring-spark-prilozheniy= On Mon, May 16, 2016 at 6:13 PM, Ashok Kumar wrote: > Hi, > > I would like to know the approach and tools please to

Monitoring Spark application progress

2016-05-16 Thread Ashok Kumar
Hi, I would like to know the approach and tools please to get the full performance for a Spark app running through Spark-shell and Spark-sumbit - Through Spark GUI at 4040? - Through OS utilities top, SAR  - Through Java tools like jbuilder etc - Through integration Spark with

Re: GC overhead limit exceeded

2016-05-16 Thread Takeshi Yamamuro
To understand the issue, you need to describe more about your case; what's the version of spark you use and what's your job? Also, what if you directly use scala interfaces instead of python ones? On Mon, May 16, 2016 at 11:56 PM, Aleksandr Modestov < aleksandrmodes...@gmail.com> wrote: > Hi, >

Re: GC overhead limit exceeded

2016-05-16 Thread Aleksandr Modestov
Hi, "Why did you though you have enough memory for your task? You checked task statistics in your WebUI?". I mean that I have jnly about 5Gb data but spark.driver memory in 60Gb. I check task statistics in web UI. But really spark says that *"05-16 17:50:06.254 127.0.0.1:54321

Re: GC overhead limit exceeded

2016-05-16 Thread Takeshi Yamamuro
Hi, Why did you though you have enough memory for your task? You checked task statistics in your WebUI? Anyway, If you get stuck with the GC issue, you'd better off increasing the number of partitions. // maropu On Mon, May 16, 2016 at 10:00 PM, AlexModestov wrote:

Re: Apache Spark Slack

2016-05-16 Thread Dood
On 5/16/2016 6:40 AM, Paweł Szulc wrote: I've just created this https://apache-spark.slack.com for ad-hoc communications within the comunity. Everybody's welcome! Why not just IRC? Slack is yet another place to create an account etc. - IRC is much easier. What does Slack give you that's so

Re: apache spark on gitter?

2016-05-16 Thread Sean Owen
We'll have to be careful to not present this as an official project chat, since it's not affiliated. Even calling it apache-spark.slack.com is potentially problematic since it gives some impression it's from the ASF. Ideally, call this something that is unambiguously not related to Apache itself,

Apache Spark Slack

2016-05-16 Thread Paweł Szulc
I've just created this https://apache-spark.slack.com for ad-hoc communications within the comunity. Everybody's welcome! -- Regards, Paul Szulc twitter: @rabbitonweb blog: www.rabbitonweb.com

Re: apache spark on gitter?

2016-05-16 Thread Paweł Szulc
I've just created https://apache-spark.slack.com On Thu, May 12, 2016 at 9:28 AM, Paweł Szulc wrote: > Hi, > > well I guess the advantage of gitter over maling list is the same as with > IRC. It's not actually a replacer because mailing list is also important. > But it is

GC overhead limit exceeded

2016-05-16 Thread AlexModestov
I get the error in the apache spark... "spark.driver.memory 60g spark.python.worker.memory 60g spark.master local[*]" The amount of data is about 5Gb, but spark says that "GC overhead limit exceeded". I guess that my conf-file gives enought resources. "16/05/16 15:13:02 WARN

Re: Debug spark core and streaming programs in scala

2016-05-16 Thread Ted Yu
>From https://spark.apache.org/docs/latest/monitoring.html#metrics : - JmxSink: Registers metrics for viewing in a JMX console. FYI On Sun, May 15, 2016 at 11:54 PM, Mich Talebzadeh wrote: > Have you tried Spark GUI on 4040. This will show jobs being executed by

Re: Lost names of struct fields in UDF

2016-05-16 Thread Alexander Chermenin
Hi. It's surprisingly, but this code solves my problem: private static Column namedStruct(Column... cols) {    List<_expression_> exprs = Arrays.stream(cols)    .flatMap(c ->    Stream.of(    new Literal(UTF8String.fromString(((NamedExpression) c.expr()).name()),

Re: Renaming nested columns in dataframe

2016-05-16 Thread Alexander Chermenin
Hello. I think you can use something like this:  df.select(    struct(        column("site.site_id").as("id"),        column("site.site_name").as("name"),        column("site.site_domain").as("domain"),        column("site.site_cat").as("cat"),        struct(            

What / Where / When / How questions in Spark 2.0 ?

2016-05-16 Thread Ovidiu-Cristian MARCU
Hi, We can see in [2] many interesting (and expected!) improvements (promises) like extended SQL support, unified API (DataFrames, DataSets), improved engine (Tungsten relates to ideas from modern compilers and MPP databases - similar to Flink [3]), structured streaming etc. It seems we

Renaming nested columns in dataframe

2016-05-16 Thread Prashant Bhardwaj
Hi How can I rename nested columns in dataframe through scala API? Like following schema > |-- site: struct (nullable = false) > > ||-- site_id: string (nullable = true) > > ||-- site_name: string (nullable = true) > > ||-- site_domain: string (nullable = true) > > ||-- site_cat:

Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-16 Thread Yuval Itzchakov
Oh, that looks neat! Thx, will read up on that. On Mon, May 16, 2016, 14:10 Ofir Manor wrote: > Yuval, > Not sure what in-scope to land in 2.0, but there is another new infra bit > to manage state more efficiently called State Store, whose initial version > is already

Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-16 Thread Ofir Manor
Yuval, Not sure what in-scope to land in 2.0, but there is another new infra bit to manage state more efficiently called State Store, whose initial version is already commited: SPARK-13809 - State Store: A new framework for state management for computing Streaming Aggregates

?????? spark udf can not change a json string to a map

2016-05-16 Thread ??????
hi, Ted. I found a built-in function called str_to_map, which can transform string to map. But it can not meet my need. Because my string is maybe a map with a array nested in its value. for example, map. I think it can not work fine in my situation. Cheers

Issue with creation of EC2 cluster using spark scripts

2016-05-16 Thread Marco Mistroni
hi all i am experiencing issues when creating ec2 clusters using scripts in hte spark\ec2 directory i launched the following command ./spark-ec2 -k sparkkey -i sparkAccessKey.pem -r us-west2 -s 4 launch MM-Cluster My output is stuck with the following (has been for the last 20 minutes) i

Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-16 Thread Yuval Itzchakov
Also, re-reading the relevant part from the Structured Streaming documentation ( https://docs.google.com/document/d/1NHKdRSNCbCmJbinLmZuqNA1Pt6CGpFnLVRbzuDUcZVM/edit#heading=h.335my4b18x6x ): Discretized streams (aka dstream) Unlike Storm, dstream exposes a higher level API similar to RDDs. There

Re: Kafka stream message sampling

2016-05-16 Thread Mich Talebzadeh
Hi Samuel, How do you create your RDD based on Kakfa direct stream? Do you have your code snippet? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Issue with Spark Streaming UI

2016-05-16 Thread Mich Talebzadeh
Have you check Streaming tab in Spark GUI? [image: Inline images 1] HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Re: Debug spark core and streaming programs in scala

2016-05-16 Thread Mich Talebzadeh
Have you tried Spark GUI on 4040. This will show jobs being executed by executors is each stage and the line of code as well. [image: Inline images 1] Also command line tools like jps and jmonitor HTH Dr Mich Talebzadeh LinkedIn *

Re: Executors and Cores

2016-05-16 Thread Mich Talebzadeh
Hi Pradeep, Resources allocated for each Spark app can be capped to allow a balanced resourcing for all apps. However, you really need to monitor each app. One option would be to use jmonitor package to look at resource usage (heap, CPU, memory etc) for each job. In general you should not