Re: Encoder with empty bytes deserializes with non-empty bytes

2018-02-21 Thread David Capwell
Ok found my issue case c if c == classOf[ByteString] => StaticInvoke(classOf[Protobufs], ArrayType(ByteType), "fromByteString", parent :: Nil) Should be case c if c == classOf[ByteString] => StaticInvoke(classOf[Protobufs], BinaryType, "fromByteString", parent :: Nil) This causes the java

Encoder with empty bytes deserializes with non-empty bytes

2018-02-21 Thread David Capwell
I am trying to create a Encoder for protobuf data and noticed something rather weird. When we have a empty ByteString (not null, just empty), when we deserialize we get back a empty array of length 8. I took the generated code and see something weird going on. UnsafeRowWriter 1. public

Re: Return statements aren't allowed in Spark closures

2018-02-21 Thread Lian Jiang
Sorry Bryan. Unfortunately, this is not the root cause. Any other ideas? This is blocking my scenario. Thanks. On Wed, Feb 21, 2018 at 4:26 PM, Bryan Jeffrey wrote: > Lian, > > You're writing Scala. Just remove the 'return'. No need for it in Scala. > > Get Outlook for Android

Consuming Data in Parallel using Spark Streaming

2018-02-21 Thread Vibhakar, Beejal
I am trying to process data from 3 different Kafka topics using 3 InputDStream with a single StreamingContext. I am currently testing this under Sandbox where I see data processed from one Kafka topic followed by other. Question#1: I want to understand that when I run this program in Hadoop clu

I got weird error from a join

2018-02-21 Thread hsy...@gmail.com
from pyspark.sql import Row A_DF = sc.parallelize( [ Row(id='A123', name='a1'), Row(id='A234', name='a2') ]).toDF() B_DF = sc.parallelize( [ Row(id='A123', pid='A234', ename='e1') ]).toDF() join_df = B_DF.join(A_DF, B_DF.id==A_DF.id).drop(B_DF.id) final_joi

Re: parquet vs orc files

2018-02-21 Thread Stephen Joung
In case of parquet, best source for me to configure and to ensure "min/max statistics" was https://www.slideshare.net/mobile/RyanBlue3/parquet-performance-tuning-the-missing-guide --- I don't have any experience in orc. 2018년 2월 22일 (목) 오전 6:59, Kane Kim 님이 작성: > Thanks, how does min/max index

Re: Return statements aren't allowed in Spark closures

2018-02-21 Thread Bryan Jeffrey
Lian, You're writing Scala. Just remove the 'return'. No need for it in Scala. Get Outlook for Android From: Lian Jiang Sent: Wednesday, February 21, 2018 4:16:08 PM To: user Subject: Return statements aren't allowed in Spark closures I c

Re: parquet vs orc files

2018-02-21 Thread Kane Kim
Thanks, how does min/max index work? Can spark itself configure bloom filters when saving as orc? On Wed, Feb 21, 2018 at 1:40 PM, Jörn Franke wrote: > In the latest version both are equally well supported. > > You need to insert the data sorted on filtering columns > Then you will benefit from

Re: parquet vs orc files

2018-02-21 Thread Jörn Franke
In the latest version both are equally well supported. You need to insert the data sorted on filtering columns Then you will benefit from min max indexes and in case of orc additional from bloom filters, if you configure them. In any case I recommend also partitioning of files (do not confuse wit

Return statements aren't allowed in Spark closures

2018-02-21 Thread Lian Jiang
I can run below code in spark-shell using yarn client mode. val csv = spark.read.option("header", "true").csv("my.csv") def queryYahoo(row: Row) : Int = { return 10; } csv.repartition(5).rdd.foreachPartition{ p => p.foreach(r => { queryYahoo(r) })} However, the same code failed when run using s

parquet vs orc files

2018-02-21 Thread Kane Kim
Hello, Which format is better supported in spark, parquet or orc? Will spark use internal sorting of parquet/orc files (and how to test that)? Can spark save sorted parquet/orc files? Thanks!

Re: Job never finishing

2018-02-21 Thread Nikhil Goyal
Thanks for the help :) On Tue, Feb 20, 2018 at 4:22 PM, Femi Anthony wrote: > You can use spark speculation as a way to get around the problem. > > Here is a useful link: > > http://asyncified.io/2016/08/13/leveraging-spark- > speculation-to-identify-and-re-schedule-slow-running-tasks/ > > Sent

FINAL REMINDER: CFP for Apache EU Roadshow Closes 25th February

2018-02-21 Thread Sharan F
Hello Apache Supporters and Enthusiasts This is your FINAL reminder that the Call for Papers (CFP) for the Apache EU Roadshow is closing soon. Our Apache EU Roadshow will focus on Cloud, IoT, Apache Tomcat, Apache Http and will run from 13-14 June 2018 in Berlin. Note that the CFP deadline has

CSV use case

2018-02-21 Thread SNEHASISH DUTTA
Hi, I am using spark 2.2 csv reader I have data in following format 123|123|"abc"||""|"xyz" Where || is null And "" is one blank character as per the requirement I was using option sep as pipe And option quote as "" Parsed the data and using regex I was able to fulfill all the mentioned condit

Re: Log analysis with GraphX

2018-02-21 Thread JB Data
Hi, Interesting discussion, let me add my *shell* point of view. My focus is only Prediction, to avoid pure DS to "crier aux loups", I warn how simple my *datayse* of the problem is : - No use of the button in the model, only page navigation. - User navigation 're-init' when click on a page ever c

Re: Serialize a DataFrame with Vector values into text/csv file

2018-02-21 Thread vermanurag
Try to_json on the vector column. That should do it. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org