HBase / Spark Kerberos problem

2016-05-18 Thread philipp.meyerhoefer
Hi all, I have been puzzling over a Kerberos problem for a while now and wondered if anyone can help. For spark-submit, I specify --master yarn-client --keytab x --principal y, which creates my SparkContext fine. Connections to Zookeeper Quorum to find the HBase master work well too. But when i

Re: Tar File: On Spark

2016-05-18 Thread Sun Rui
1. create a temp dir on HDFS, say “/tmp” 2. write a script to create in the temp dir one file for each tar file. Each file has only one line: 3. Write a spark application. It is like: val rdd = sc.textFile () rdd.map { line => construct an untar command using the path information in “l

Tar File: On Spark

2016-05-18 Thread ayan guha
Hi I have few tar files in HDFS in a single folder. each file has multiple files in it. tar1: - f1.txt - f2.txt tar2: - f1.txt - f2.txt (each tar file will have exact same number of files, same name) I am trying to find a way (spark or pig) to extract them to their own f

Any way to pass custom hadoop conf to through spark thrift server ?

2016-05-18 Thread Jeff Zhang
I want to pass one custom hadoop conf to spark thrift server so that both driver and executor side can get this conf. And I also want this custom hadoop conf only detected by this user's job who set this conf. Is it possible for spark thrift server now ? Thanks -- Best Regards Jeff Zhang

Re: Accessing Cassandra data from Spark Shell

2016-05-18 Thread Cassa L
I tried all combinations of spark-cassandra connector. Didn't work. Finally, I downgraded spark to 1.5.1 and now it works. LCassa On Wed, May 18, 2016 at 11:11 AM, Mohammed Guller wrote: > As Ben mentioned, Spark 1.5.2 does work with C*. Make sure that you are > using the correct version of the

Latency experiment without losing executors

2016-05-18 Thread gkumar7
I would like to test the latency (tasks/s) perceived in a simple application on Apache Spark. The idea: The workers will generate random data to be placed in a list. The final action (count) will count the total number of data points generated. Below, the numberOfPartitions is equal to the number

RE: KafkaUtils.createDirectStream Not Fetching Messages with Confluent Serializers as Value Decoder.

2016-05-18 Thread Ramaswamy, Muthuraman
I am using Spark 1.6.1 and Kafka 0.9+ It works for both receiver and receiver-less mode. One thing I noticed when you specify invalid topic name, KafkaUtils doesn't fetch any messages. So, check you have specified the topic name correctly. ~Muthu From: M

How to perform reduce operation in the same order as partition indexes

2016-05-18 Thread Pulasthi Supun Wickramasinghe
Hi Devs/All, I am pretty new to Spark. I have a program which does some map reduce operations with matrices. Here *shortrddFinal* is a of type " *RDD[Array[Short]]"* and consists of several partitions *var BC = shortrddFinal.mapPartitionsWithIndex(calculateBCInternal).reduce(mergeBC)* The map fu

Does Structured Streaming support Kafka as data source?

2016-05-18 Thread Todd
Hi, I brief the spark code, and it looks that structured streaming doesn't support kafka as data source yet?

Spark UI metrics - Task execution time and number of records processed

2016-05-18 Thread Nirav Patel
Hi, I have been noticing that for shuffled tasks(groupBy, Join) reducer tasks are not evenly loaded. Most of them (90%) finished super fast but there are some outliers that takes much longer as you can see from "Max" value in following metric. Metric is from Join operation done on two RDDs. I trie

Re: Is there a way to run a jar built for scala 2.11 on spark 1.6.1 (which is using 2.10?)

2016-05-18 Thread Ted Yu
Depending on the version of hadoop you use, you may find tar ball prebuilt with Scala 2.11: https://s3.amazonaws.com/spark-related-packages FYI On Wed, May 18, 2016 at 3:35 PM, Koert Kuipers wrote: > no but you can trivially build spark 1.6.1 for scala 2.11 > > On Wed, May 18, 2016 at 6:11 PM,

Re: Is there a way to run a jar built for scala 2.11 on spark 1.6.1 (which is using 2.10?)

2016-05-18 Thread Koert Kuipers
no but you can trivially build spark 1.6.1 for scala 2.11 On Wed, May 18, 2016 at 6:11 PM, Sergey Zelvenskiy wrote: > >

Re: Unit testing framework for Spark Jobs?

2016-05-18 Thread Todd Nist
Perhaps these may be of some use: https://github.com/mkuthan/example-spark http://mkuthan.github.io/blog/2015/03/01/spark-unit-testing/ https://github.com/holdenk/spark-testing-base On Wed, May 18, 2016 at 2:14 PM, swetha kasireddy wrote: > Hi Lars, > > Do you have any examples for the methods

Is there a way to run a jar built for scala 2.11 on spark 1.6.1 (which is using 2.10?)

2016-05-18 Thread Sergey Zelvenskiy

Couldn't find leader offsets

2016-05-18 Thread samsayiam
I have seen questions posted about this on SO and on this list but haven't seen a response that addresses my issue. I am trying to create a direct stream connection to a kafka topic but it fails with Couldn't find leader offsets for Set(...). If I run a kafka consumer I can read the topic but can

Re: Can Pyspark access Scala API?

2016-05-18 Thread Ted Yu
Please take a look at python/pyspark/ml/wrapper.py On Wed, May 18, 2016 at 1:08 PM, Abi wrote: > Thanks for that. But the question is more general. Can pyspark access > Scala somehow ? > > > On May 18, 2016 3:53:50 PM EDT, Ted Yu wrote: >> >> Not sure if you have seen this (for 2.0): >> >> [SPA

Re: Can Pyspark access Scala API?

2016-05-18 Thread Abi
Thanks for that. But the question is more general. Can pyspark access Scala somehow ? On May 18, 2016 3:53:50 PM EDT, Ted Yu wrote: >Not sure if you have seen this (for 2.0): > >[SPARK-15087][CORE][SQL] Remove AccumulatorV2.localValue and keep only >value > >Can you tell us your use case ? > >On

Re: Can Pyspark access Scala API?

2016-05-18 Thread Ted Yu
Not sure if you have seen this (for 2.0): [SPARK-15087][CORE][SQL] Remove AccumulatorV2.localValue and keep only value Can you tell us your use case ? On Tue, May 17, 2016 at 9:16 PM, Abi wrote: > Can Pyspark access Scala API? The accumulator in pysPark does not have > local variable available

Re: Unit testing framework for Spark Jobs?

2016-05-18 Thread swetha kasireddy
Hi Lars, Do you have any examples for the methods that you described for Spark batch and Streaming? Thanks! On Wed, Mar 30, 2016 at 2:41 AM, Lars Albertsson wrote: > Thanks! > > It is on my backlog to write a couple of blog posts on the topic, and > eventually some example code, but I am curre

RE: Accessing Cassandra data from Spark Shell

2016-05-18 Thread Mohammed Guller
As Ben mentioned, Spark 1.5.2 does work with C*. Make sure that you are using the correct version of the Spark Cassandra Connector. Mohammed Author: Big Data Analytics with Spark From: Ben Slater [mailto:ben.sla...@i

Re: SLF4J binding error while running Spark using YARN as Cluster Manager

2016-05-18 Thread Marcelo Vanzin
Hi Anubhav, This is happening because you're trying to use the configuration generated for CDH with upstream Spark. The CDH configuration will add extra needed jars that we don't include in our build of Spark, so you'll end up getting duplicate classes. You can either try to use a different Spark

SLF4J binding error while running Spark using YARN as Cluster Manager

2016-05-18 Thread Anubhav Agarwal
Hi, I am having log4j trouble while running Spark using YARN as cluster manager in CDH 5.3.3. I get the following error:- SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/data/12/yarn/nm/filecache/34/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticL

Re: [Spark 2.0 state store] Streaming wordcount using spark state store

2016-05-18 Thread Michael Armbrust
The state store for structured streaming is an internal concept, and isn't designed to be consumed by end users. I'm hoping to write some documentation on how to do aggregation, but support for reading from Kafka and other sources will likely come in Spark 2.1+ On Wed, May 18, 2016 at 3:50 AM, Sh

Re: 2 tables join happens at Hive but not in spark

2016-05-18 Thread Davies Liu
What the schema of the two tables looks like? Could you also show the explain of the query? On Sat, Feb 27, 2016 at 2:10 AM, Sandeep Khurana wrote: > Hello > > We have 2 tables (tab1, tab2) exposed using hive. The data is in different > hdfs folders. We are trying to join these 2 tables on certa

Re: KafkaUtils.createDirectStream Not Fetching Messages with Confluent Serializers as Value Decoder.

2016-05-18 Thread Mail.com
Adding back users. > On May 18, 2016, at 11:49 AM, Mail.com wrote: > > Hi Uladzimir, > > I run is as below. > > Spark-submit --class com.test --num-executors 4 --executor-cores 5 --queue > Dev --master yarn-client --driver-memory 512M --executor-memory 512M test.jar > > Thanks, > Pradeep >

Submit python egg?

2016-05-18 Thread Darren Govoni
Hi  I have a python egg with a __main__.py in it. I am able to execute the egg by itself fine. Is there a way to just submit the egg to spark and have it run? It seems an external .py script is needed which would be unfortunate if true. Thanks Sent from my Verizon Wireless 4G LTE smartpho

Re: Managed memory leak detected.SPARK-11293 ?

2016-05-18 Thread Serega Sheypak
Ok, it happens only in YARN+cluster mode. It works with snappy in YARN+client mode. I've started to hit this problem when I switched to cluster mode. 2016-05-18 16:31 GMT+02:00 Ted Yu : > According to: > > http://blog.erdemagaoglu.com/post/4605524309/lzo-vs-snappy-vs-lzf-vs-zlib-a-comparison-of

Re: Managed memory leak detected.SPARK-11293 ?

2016-05-18 Thread Ted Yu
According to: http://blog.erdemagaoglu.com/post/4605524309/lzo-vs-snappy-vs-lzf-vs-zlib-a-comparison-of performance of snappy and lzf were on-par to each other. Maybe lzf has lower memory requirement. On Wed, May 18, 2016 at 7:22 AM, Serega Sheypak wrote: > Switching from snappy to lzf helped

Re: Managed memory leak detected.SPARK-11293 ?

2016-05-18 Thread Serega Sheypak
Switching from snappy to lzf helped me: *spark.io.compression.codec=lzf* Do you know why? :) I can't find exact explanation... 2016-05-18 15:41 GMT+02:00 Ted Yu : > Please increase the number of partitions. > > Cheers > > On Wed, May 18, 2016 at 4:17 AM, Serega Sheypak > wrote: > >> Hi, plea

Re: spark udf can not change a json string to a map

2016-05-18 Thread Ted Yu
Please take a look at JavaUtils#mapAsSerializableJavaMap FYI On Mon, May 16, 2016 at 3:24 AM, 喜之郎 <251922...@qq.com> wrote: > > hi, Ted. > I found a built-in function called str_to_map, which can transform string > to map. > But it can not meet my need. > > Because my string is maybe a map with

Spark Task not serializable with lag Window function

2016-05-18 Thread luca_guerra
I've noticed that after I use a Window function over a DataFrame if I call a map() with a function, Spark returns a "Task not serializable" Exception This is my code: val hc = new org.apache.spark.sql.hive.HiveContext(sc) import hc.implicits._ import org.apache.spark.sql.expressions.Window import

Re: Managed memory leak detected.SPARK-11293 ?

2016-05-18 Thread Ted Yu
Please increase the number of partitions. Cheers On Wed, May 18, 2016 at 4:17 AM, Serega Sheypak wrote: > Hi, please have a look at log snippet: > 16/05/18 03:27:16 INFO spark.MapOutputTrackerWorker: Doing the fetch; > tracker endpoint = > NettyRpcEndpointRef(spark://mapoutputtrac...@xxx.xxx.xx

Re: File not found exception while reading from folder using textFileStream

2016-05-18 Thread Saisai Shao
>From my understanding, we should copy the file into another folder and move to source folder after copy is finished, otherwise we will read the half-copied data or meet the issue as you mentioned above. On Wed, May 18, 2016 at 8:32 PM, Ted Yu wrote: > The following should handle the situation y

Re: SPARK - DataFrame for BulkLoad

2016-05-18 Thread Michael Segel
Yes, but he’s using phoenix which may not work cleanly with your HBase spark module. They key issue here may be Phoenix which is separate from HBase. > On May 18, 2016, at 5:36 AM, Ted Yu wrote: > > Please see HBASE-14150 > > The hbase-spark module would be available in the upcoming hbase 2

Re: SPARK - DataFrame for BulkLoad

2016-05-18 Thread Ted Yu
Please see HBASE-14150 The hbase-spark module would be available in the upcoming hbase 2.0 release. On Tue, May 17, 2016 at 11:48 PM, Takeshi Yamamuro wrote: > Hi, > > Have you checked this? > > http://mail-archives.apache.org/mod_mbox/spark-user/201311.mbox/%3ccacyzca3askwd-tujhqi1805bn7sctgua

Re: File not found exception while reading from folder using textFileStream

2016-05-18 Thread Ted Yu
The following should handle the situation you encountered: diff --git a/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala b/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.sca index ed93058..f79420b 100644 --- a/streaming/src/main/scala

Managed memory leak detected.SPARK-11293 ?

2016-05-18 Thread Serega Sheypak
Hi, please have a look at log snippet: 16/05/18 03:27:16 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://mapoutputtrac...@xxx.xxx.xxx.xxx:38128) 16/05/18 03:27:16 INFO spark.MapOutputTrackerWorker: Got the output locations 16/05/18 03:27:16 INFO st

[Spark 2.0 state store] Streaming wordcount using spark state store

2016-05-18 Thread Shekhar Bansal
Hi What is the right way of using spark2.0 state store feature in spark streaming??I referred test cases in this(https://github.com/apache/spark/pull/11645/files) pull request and implemented word count using state store.My source is kafka(1 topic, 10 partitions). My data pump is pushing number

File not found exception while reading from folder using textFileStream

2016-05-18 Thread Yogesh Vyas
Hi, I am trying to read the files in a streaming way using Spark Streaming. For this I am copying files from my local folder to the source folder from where spark reads the file. After reading and printing some of the files, it gives the following error: Caused by: org.apache.hadoop.ipc.RemoteExce

Re: Does Structured Streaming support count(distinct) over all the streaming data?

2016-05-18 Thread Sean Owen
Late to the thread, but, why is counting distinct elements over a 24-hour window not possible? you can certainly do it now, and I'd presume it's possible with structured streaming with a window. countByValueAndWindow should do it right? the keys (with non-zero counts, I suppose) in a window are th

Re: Error joining dataframes

2016-05-18 Thread ram kumar
I tried it, eg: df_join = df1.join(df2,df1( "Id") ===df2("Id"), "fullouter") +++++ | id| A| id| B| +++++ | 1| 0|null|null| | 2| 0| 2| 0| |null|null| 3| 0| +++++ if I try, df_join = df1.join(df2,df1( "Id") ===df2("Id"),

Re: Error joining dataframes

2016-05-18 Thread Divya Gehlot
Can you try var df_join = df1.join(df2,df1( "Id") ===df2("Id"), "fullouter").drop(df1("Id")) On May 18, 2016 2:16 PM, "ram kumar" wrote: I tried scala> var df_join = df1.join(df2, "Id", "fullouter") :27: error: type mismatch; found : String("Id") required: org.apache.spark.sql.Column

HBase / Spark Kerberos problem

2016-05-18 Thread philipp.meyerhoefer
Hi all, I have been puzzling over a Kerberos problem for a while now and wondered if anyone can help. For spark-submit, I specify --keytab x --principal y, which creates my SparkContext fine. Connections to Zookeeper Quorum to find the HBase master work well too. But when it comes to a .count()

Re: Error joining dataframes

2016-05-18 Thread Takeshi Yamamuro
Ah, yes. `df_join` has the two `id`, so you need to select which id you use; scala> :paste // Entering paste mode (ctrl-D to finish) val df1 = Seq((1, 0), (2, 0)).toDF("id", "A") val df2 = Seq((2, 0), (3, 0)).toDF("id", "B") val df3 = df1.join(df2, df1("id") === df2("id"), "outer") df3.print

Re: Error joining dataframes

2016-05-18 Thread ram kumar
When you register a temp table from the dataframe eg: var df_join = df1.join(df2, df1("id") === df2("id"), "outer") df_join.registerTempTable("test") sqlContext.sql("select * from test") +++++ | id| A| id| B| +++++ | 1| 0|null|null| | 2| 0| 2|

Re: Error joining dataframes

2016-05-18 Thread Takeshi Yamamuro
Look weird, seems spark-v1.5.x can accept the query. What's the difference between the example and your query? Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.2 /_/ scala> :paste // Enterin