spark write hex null string terminates into columns

2017-05-12 Thread Afshin, Bardia
I’m running a process where I load the original data, remove some column and write out the columns remaining into a output file. Spark is putting in hex 00 into some of the columns and this is causing issues when importing into RedShift. What’s the most efficient way to resolve this?

Re: Restful API Spark Application

2017-05-12 Thread Василец Дмитрий
and livy https://hortonworks.com/blog/livy-a-rest-interface-for-apache-spark/ On Fri, May 12, 2017 at 10:51 PM, Sam Elamin wrote: > Hi Nipun > > Have you checked out the job servwr > > https://github.com/spark-jobserver/spark-jobserver > > Regards > Sam > On Fri, 12 May

Re: Restful API Spark Application

2017-05-12 Thread Sam Elamin
Hi Nipun Have you checked out the job servwr https://github.com/spark-jobserver/spark-jobserver Regards Sam On Fri, 12 May 2017 at 21:00, Nipun Arora wrote: > Hi, > > We have written a java spark application (primarily uses spark sql). We > want to expand this to

Restful API Spark Application

2017-05-12 Thread Nipun Arora
Hi, We have written a java spark application (primarily uses spark sql). We want to expand this to provide our application "as a service". For this, we are trying to write a REST API. While a simple REST API can be easily made, and I can get Spark to run through the launcher. I wonder, how the

Re: Reading Avro messages from Kafka using Structured Streaming in Spark 2.1

2017-05-12 Thread Michael Armbrust
I believe that Avro/Kafka messages have a few bytes at the beginning of the message to denote which schema is being used. Have you tried using the KafkaAvroDecoder inside of the map instead? On Fri, May 12, 2017 at 9:26 AM, Revin Chalil wrote: > Just following up on this;

Re: Convert DStream into Streaming Dataframe

2017-05-12 Thread Michael Armbrust
Are there any particular things that the DataFrame or Dataset API are missing? On Fri, May 12, 2017 at 9:49 AM, Tejinder Aulakh wrote: > Hi, > > Is there any way to convert a DStream to a streaming dataframe? I want to > use Structured streaming in a new common module

Re: Why does dataset.union fails but dataset.rdd.union execute correctly?

2017-05-12 Thread Dirceu Semighini Filho
Hi Mathew, thanks for answering this, I've also tried with a simple case class and it works fine. I'm using this case class structure, which is failing: import java.text.SimpleDateFormat import java.util.Calendar import scala.annotation.tailrec trait TabbedToString { _: Product => override

Re: Convert DStream into Streaming Dataframe

2017-05-12 Thread Tathagata Das
Unfortunately, no. DStreams and streaming DataFrames are so different in their abstractions and implementations that there is no way to convert them. On Fri, May 12, 2017 at 9:49 AM, Tejinder Aulakh wrote: > Hi, > > Is there any way to convert a DStream to a streaming

Question on whether to use Java 8 or Scala for writing Spark applications

2017-05-12 Thread raghavendran_c
Hi, Our organization is a Java shop. All of our developers are used to Java 7, and are gearing up to migrate to Java 8. They have a fair knowledge of Hadoop and Map-Reduce and are planning to learn Spark. With Java 8 available (with its Lambda expression conciseness), is it still beneficial to

Convert DStream into Streaming Dataframe

2017-05-12 Thread Tejinder Aulakh
Hi, Is there any way to convert a DStream to a streaming dataframe? I want to use Structured streaming in a new common module that I'm developing. The existing code uses DStream so trying to figure out how to convert a DStream to a Streaming Dataframe. The documentation only describes how to read

RE: Reading Avro messages from Kafka using Structured Streaming in Spark 2.1

2017-05-12 Thread Revin Chalil
Just following up on this; would appreciate any responses on this. Thanks. From: Revin Chalil [mailto:rcha...@expedia.com] Sent: Wednesday, May 10, 2017 11:21 PM To: user@spark.apache.org Subject: Reading Avro messages from Kafka using Structured Streaming in Spark 2.1 I am trying to convert

Re: Spark Shuffle Encryption

2017-05-12 Thread Marcelo Vanzin
http://spark.apache.org/docs/latest/configuration.html#shuffle-behavior All the options you need to know are there. On Fri, May 12, 2017 at 9:11 AM, Shashi Vishwakarma wrote: > Hi > > I was doing research on encrypting spark shuffle data and found that Spark > 2.1 has

Spark Shuffle Encryption

2017-05-12 Thread Shashi Vishwakarma
Hi I was doing research on encrypting spark shuffle data and found that Spark 2.1 has got that feature. https://issues.apache.org/jira/browse/SPARK-5682 Does anyone has more documentation around it ? How do I aim to use this feature in real production environment keeping mind that I need to

Re: BinaryClassificationMetrics only supports AreaUnderPR and AreaUnderROC?

2017-05-12 Thread Yanbo Liang
Yeah, for binary data, you can also use MulticlassClassificationEvaluator to evaluate other metrics which BinaryClassificationEvaluator doesn't cover, such as accuracy, f1, weightedPrecision and weightedRecall. Thanks Yanbo On Thu, May 11, 2017 at 10:31 PM, Lan Jiang

Re: Spark <--> S3 flakiness

2017-05-12 Thread Gene Pang
Hi, Yes, you can use Alluxio with Spark to read/write to S3. Here is a blog post on Spark + Alluxio + S3 , and here is some documentation for configuring Alluxio + S3

Re: Best Practice for Enum in Spark SQL

2017-05-12 Thread Anastasios Zouzias
Hi Mike, FYI: Is you are using Spark 2.x, you might have issues with encoders if you use a case class with Enumeration type field, see https://issues.apache.org/jira/browse/SPARK-17248 For (1), (2), I would guess Int would be better (space-wise), but I am not familiar with parquet's internals.

Re: GraphX subgraph from list of VertexIds

2017-05-12 Thread Robineast
it would be listVertices.contains(vid) wouldn't it? - Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action -- View this message in context:

Re: CROSSVALIDATION and hypotetic fail

2017-05-12 Thread Jörn Franke
Use several jobs and orchestrate them, e.g. Via Ozzie. These jobs then can save intermediate results to disk and load them from there. Alternatively (or additionally!) you may use persist (to memory and disk), but I am not sure this is suitable for such long running applications. > On 12. May

CROSSVALIDATION and hypotetic fail

2017-05-12 Thread issues solution
Hi , often we preform a grid search and Cross validation under pyspark to find best perameters , but when you have in error not related to computation but to networks or any think else . HOW WE CAN SAVE INTERMADAITE RESULT ,particulary when you have a large process during 3 or 4 days