spark streaming write orc suddenly slow?

2017-03-27 Thread 446463...@qq.com
Hi All: when I use sparking streaming to consume kafka data to stroe in HDFS with orc files format. it's fast in the beginning,but it's slow in several hours later. environment: spark version: 2.1.0 kafka version 0.10.1.1 spark-streaming-kafka jar's version: spark-streaming-kafka-0-8-2.11

Re: Why selectExpr changes schema (to include id column)?

2017-03-27 Thread Hyukjin Kwon
Thanks for your confirmation. On 28 Mar 2017 5:02 a.m., "Jacek Laskowski" wrote: Hi Hyukjin, It was a false alarm as I had a local change to `def schema` in `Dataset` that caused the issue. I apologize for the noise. Sorry and thanks a lot for the prompt response. I

Re: Kafka failover with multiple data centers

2017-03-27 Thread nguyen duc Tuan
Hi Soumitra, We're working on that. The Idea here is to use Kafka to get brokers' information of the topic and use Kafka client to find coresponding offsets on new cluster ( https://jeqo.github.io/post/2017-01-31-kafka-rewind-consumers-offset/). You need kafka >=0.10.1.0 because it supports

Support Stored By Clause

2017-03-27 Thread Denny Lee
Per SPARK-19630, wondering if there are plans to support "STORED BY" clause for Spark 2.x? Thanks!

Re: Upgrade the scala code using the most updated Spark version

2017-03-27 Thread vvshvv
it yo On Jörn Franke , Mar 28, 2017 12:11 AM wrote:Usually you define the dependencies to the Spark library as provided. You also seem to mix different Spark versions which should be avoided.The Hadoop library seems to be outdated and should also only be provided.The other

Re: Kafka failover with multiple data centers

2017-03-27 Thread Soumitra Johri
Hi, did you guys figure it out? Thanks Soumitra On Sun, Mar 5, 2017 at 9:51 PM nguyen duc Tuan wrote: > Hi everyone, > We are deploying kafka cluster for ingesting streaming data. But > sometimes, some of nodes on the cluster have troubles (node dies, kafka > daemon is

Re: Spark shuffle files

2017-03-27 Thread Mark Hamstra
When the RDD using them goes out of scope. On Mon, Mar 27, 2017 at 3:13 PM, Ashwin Sai Shankar wrote: > Thanks Mark! follow up question, do you know when shuffle files are > usually un-referenced? > > On Mon, Mar 27, 2017 at 2:35 PM, Mark Hamstra

Re: Spark shuffle files

2017-03-27 Thread Ashwin Sai Shankar
Thanks Mark! follow up question, do you know when shuffle files are usually un-referenced? On Mon, Mar 27, 2017 at 2:35 PM, Mark Hamstra wrote: > Shuffle files are cleaned when they are no longer referenced. See > https://github.com/apache/spark/blob/master/core/src/ >

Re: unable to stream kafka messages

2017-03-27 Thread Michael Armbrust
Ah, I understand what you are asking now. There is no API for specifying a kafka specific "decoder", since Spark SQL already has a rich language for expressing transformations. The dataframe code I gave will parse the JSON and materialize in a class, very similar to what

Re: Upgrade the scala code using the most updated Spark version

2017-03-27 Thread Mich Talebzadeh
check these versions function create_build_sbt_file { BUILD_SBT_FILE=${GEN_APPSDIR}/scala/${APPLICATION}/build.sbt [ -f ${BUILD_SBT_FILE} ] && rm -f ${BUILD_SBT_FILE} cat >> $BUILD_SBT_FILE << ! lazy val root = (project in file(".")). settings( name :=

Re: Spark shuffle files

2017-03-27 Thread Mark Hamstra
Shuffle files are cleaned when they are no longer referenced. See https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ContextCleaner.scala On Mon, Mar 27, 2017 at 12:38 PM, Ashwin Sai Shankar < ashan...@netflix.com.invalid> wrote: > Hi! > > In spark on yarn, when are

Re: Upgrade the scala code using the most updated Spark version

2017-03-27 Thread Jörn Franke
Usually you define the dependencies to the Spark library as provided. You also seem to mix different Spark versions which should be avoided. The Hadoop library seems to be outdated and should also only be provided. The other dependencies you could assemble in a fat jar. > On 27 Mar 2017, at

Re: unable to stream kafka messages

2017-03-27 Thread kaniska Mandal
yup, that solves the compilation issue :-) one quick question regarding specifying Decoder in kafka stream: please note that I am encoding the message as follows while sending data to kafka - *String msg = objectMapper.writeValueAsString(tweetEvent);* *return msg.getBytes();* I have a

Re: unable to stream kafka messages

2017-03-27 Thread Michael Armbrust
You need to import col from org.apache.spark.sql.functions. On Mon, Mar 27, 2017 at 1:20 PM, kaniska Mandal wrote: > Hi Michael, > > Can you please check if I am using correct version of spark-streaming > library as specified in my pom (specified in the email) ? > >

Re: unable to stream kafka messages

2017-03-27 Thread kaniska Mandal
Hi Michael, Can you please check if I am using correct version of spark-streaming library as specified in my pom (specified in the email) ? col("value").cast("string") - throwing an error 'cannot find symbol method col(java.lang.String)' I tried $"value" which results into similar compilation

Re: Why selectExpr changes schema (to include id column)?

2017-03-27 Thread Jacek Laskowski
Hi Hyukjin, It was a false alarm as I had a local change to `def schema` in `Dataset` that caused the issue. I apologize for the noise. Sorry and thanks a lot for the prompt response. I appreciate. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2

Spark shuffle files

2017-03-27 Thread Ashwin Sai Shankar
Hi! In spark on yarn, when are shuffle files on local disk removed? (Is it when the app completes or once all the shuffle files are fetched or end of the stage?) Thanks, Ashwin

Upgrade the scala code using the most updated Spark version

2017-03-27 Thread Anahita Talebi
Hi friends, I have a code which is written in Scala. The scala version 2.10.4 and Spark version 1.5.2 are used to run the code. I would like to upgrade the code to the most updated version of spark, meaning 2.1.0. Here is the build.sbt: import AssemblyKeys._ assemblySettings name :=

Re: unable to stream kafka messages

2017-03-27 Thread Michael Armbrust
Sorry, I don't think that I understand the question. Value is just a binary blob that we get from kafka and pass to you. If its stored in JSON, I think the code I provided is a good option, but if you are using a different encoding you may need to write a UDF. On Fri, Mar 24, 2017 at 4:58 PM,

Re: How to insert nano seconds in the TimestampType in Spark

2017-03-27 Thread Michael Armbrust
The timestamp type is only microsecond precision. You would need to store it on your own (as binary or limited range long or something) if you require nanosecond precision. On Mon, Mar 27, 2017 at 5:29 AM, Devender Yadav < devender.ya...@impetus.co.in> wrote: > Hi All, > > I am using spark

Application kill from UI do not propagate exception

2017-03-27 Thread Noorul Islam K M
Hi all, I am trying to trap UI kill event of a spark application from driver. Some how the exception thrown is not propagated to the driver main program. See for example using spark-shell below. Is there a way to get hold of this event and shutdown the driver program? Regards, Noorul

This is a test mail, please ignore!

2017-03-27 Thread Noorul Islam K M
Sending plain text mail to test whether my mail appear in the list. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/This-is-a-test-mail-please-ignore-tp28538.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: spark-submit config via file

2017-03-27 Thread Thakrar, Jayesh
Roy - can you check if you have HADOOP_CONF_DIR and YARN_CONF_DIR set to the directory containing the HDFS and YARN configuration files? From: Sandeep Nemuri Date: Monday, March 27, 2017 at 9:44 AM To: Saisai Shao Cc: Yong Zhang

Re: spark-submit config via file

2017-03-27 Thread Sandeep Nemuri
You should try adding your NN host and port in the URL. On Mon, Mar 27, 2017 at 11:03 AM, Saisai Shao wrote: > It's quite obvious your hdfs URL is not complete, please looks at the > exception, your hdfs URI doesn't have host, port. Normally it should be OK > if HDFS is

Re: Why selectExpr changes schema (to include id column)?

2017-03-27 Thread Hyukjin Kwon
I just tried to build against the current master to help check - https://github.com/apache/spark/commit/3fbf0a5f9297f438bc92db11f106d4a0ae568613 It seems I can't reproduce this as below: scala> spark.range(1).printSchema root |-- id: long (nullable = false) scala>

How to insert nano seconds in the TimestampType in Spark

2017-03-27 Thread Devender Yadav
Hi All, I am using spark version - 1.6.1 I have a text table in hive having `timestamp` datatype with nanoseconds precision. Hive Table Schema: c_timestamp timestamp Hive Table data: hive> select * from tbl1; OK 00:00:00.1 12:12:12.123456789

apache-spark: Converting List of Rows into Dataset Java

2017-03-27 Thread Karin Valisova
Hello! I am running Spark on Java and bumped into a problem I can't solve or find anything helpful among answered questions, so I would really appreciate your help. I am running some calculations, creating rows for each result: List results = new LinkedList(); for(something){

Why selectExpr changes schema (to include id column)?

2017-03-27 Thread Jacek Laskowski
Hi, While toying with selectExpr I've noticed that the schema changes to include id column. I can't seem to explain it. Anyone? scala> spark.range(1).printSchema root |-- value: long (nullable = true) scala> spark.range(1).selectExpr("*").printSchema root |-- id: long (nullable = false) p.s.

Re: spark-submit config via file

2017-03-27 Thread Saisai Shao
It's quite obvious your hdfs URL is not complete, please looks at the exception, your hdfs URI doesn't have host, port. Normally it should be OK if HDFS is your default FS. I think the problem is you're running on HDI, in which default FS is wasb. So here short name without host:port will lead to