Re: Lost leader exception in Kafka Direct for Streaming

2015-10-21 Thread Cody Koeninger
You can try running the driver in the cluster manager with --supervise, but that's basically the same as restarting it when it fails. There is no reasonable automatic "recovery" when something is fundamentally wrong with your kafka cluster. On Wed, Oct 21, 2015 at 12:46 AM, swetha kasireddy <

Can we add an unsubscribe link in the footer of every email?

2015-10-21 Thread Nicholas Chammas
Every week or so someone emails the list asking to unsubscribe. Of course, that's not the right way to do it. You're supposed to email a different address than this one to unsubscribe, yet this is not in-your-face obvious, so many people miss it. And

How to check whether the RDD is empty or not

2015-10-21 Thread diplomatic Guru
Hello All, I have a Spark Streaming job that should do some action only if the RDD is not empty. This can be done easily with the spark batch RDD as I could .take(1) and check whether it is empty or not. But this cannot been done in Spark Streaming DStrem JavaPairInputDStream

Re: Can we add an unsubscribe link in the footer of every email?

2015-10-21 Thread Ted Yu
The number of occurrences of such incidence is low. I think currently we don't need to add the footer. I checked several other Apache projects whose user@ I subscribe to - there is no such footer. Cheers On Wed, Oct 21, 2015 at 7:38 AM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: >

Re: How to check whether the RDD is empty or not

2015-10-21 Thread diplomatic Guru
I tried below code but still carrying out the action even though there is no new data. JavaPairInputDStream input = ssc.fileStream(iFolder, LongWritable.class,Text.class, TextInputFormat.class); if(input != null){ //do some action if it is not empty } On 21 October 2015 at

Spark 1.5.1 with Hive 0.13.1

2015-10-21 Thread Sébastien Rainville
Hi, I'm trying to get Spark 1.5.1 to work with Hive 0.13.1. I set the following properties in spark-defaults.conf: spark.sql.hive.metastore.version 0.13.1 spark.sql.hive.metastore.jars /usr/lib/hadoop/client/*:/opt/hive/current/lib/* but I get the following exception when launching the shell:

Re: [Spark Streaming] Design Patterns forEachRDD

2015-10-21 Thread Sandip Mehta
Does this help ? final JavaHBaseContext hbaseContext = new JavaHBaseContext(javaSparkContext, conf); customerModels.foreachRDD(new Function() { private static final long serialVersionUID = 1L; @Override public Void call(JavaRDD currentRDD) throws Exception {

Re: How to distinguish columns when joining DataFrames with shared parent?

2015-10-21 Thread Ali Tajeldin EDU
Furthermore, even adding aliasing as suggested by the warning doesn't seem to help either. Slight modification to example below: > scala> val largeValues = df.filter('value >= 10).as("lv") And just looking at the join results: > scala> val j = smallValues > .join(largeValues,

[Spark Streaming] Design Patterns forEachRDD

2015-10-21 Thread Nipun Arora
Hi All, Can anyone provide a design pattern for the following code shown in the Spark User Manual, in JAVA ? I have the same exact use-case, and for some reason the design pattern for Java is missing. Scala version taken from :

Re: can I use Spark as alternative for gem fire cache ?

2015-10-21 Thread Jags Ramnarayanan
Kali, This is possible depending on the access pattern by your ETL logic. If you only read (no point mutations) and you can pay the additional price of having to scan your dimension data each time you have to lookup something then spark could work out. Note that a KV RDD isn't really a Map

Re: Using spark in cluster mode

2015-10-21 Thread Jacek Laskowski
Hi, Start here -> http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds and then hop to http://spark.apache.org/docs/latest/spark-standalone.html. Once done, be back with your questions. I think it's gonna help a lot. Pozdrawiam, Jacek -- Jacek

Re: How to check whether the RDD is empty or not

2015-10-21 Thread Tathagata Das
What do you mean by checking when a "DStream is empty"? DStream represents an endless stream of data, and at point of time checking whether it is empty or not does not make sense. FYI, there is RDD.isEmpty() On Wed, Oct 21, 2015 at 10:03 AM, diplomatic Guru wrote: >

Mapping to multiple groups in Apache Spark

2015-10-21 Thread Jeffrey Richley
I am in a situation where I am using Apache Spark and its map/reduce functionality. I am now at a stage where I have been able to map to a data set that conceptually has many "rows" of data. Now what I am needing is to do a reduce which usually is a straight forward thing. My real need though is

Re: How to distinguish columns when joining DataFrames with shared parent?

2015-10-21 Thread Michael Armbrust
Unfortunately, the mechanisms that we use to differentiate columns automatically don't work particularly well in the presence of self joins. However, you can get it work if you use the $"column" syntax consistently: val df = Seq((1, 1), (1, 10), (2, 3), (3, 20), (3, 5), (4, 10)).toDF("key",

Re: Kafka Streaming and Filtering > 3000 partitons

2015-10-21 Thread Cody Koeninger
The rdd partitions are 1:1 with kafka topicpartitions, so you can use offsets ranges to figure out which topic a given rdd partition is for and proceed accordingly. See the kafka integration guide in the spark streaming docs for more details, or https://github.com/koeninger/kafka-exactly-once As

Re: dataframe average error: Float does not take parameters

2015-10-21 Thread Ali Tajeldin EDU
Which version of Spark are you using? I just tried the example below on 1.5.1 and it seems to work as expected: scala> val res = df.groupBy("key").count.agg(min("count"), avg("count")) res: org.apache.spark.sql.DataFrame = [min(count): bigint, avg(count): double] scala> res.show

Kafka Streaming and Filtering > 3000 partitons

2015-10-21 Thread Dave Ariens
Hey folks, I have a very large number of Kafka topics (many thousands of partitions) that I want to consume, filter based on topic-specific filters, then produce back to filtered topics in Kafka. Using the receiver-less based approach with Spark 1.4.1 (described

Re: Spark_1.5.1_on_HortonWorks

2015-10-21 Thread Ajay Chander
Thanks for your kind inputs. Right now I am running spark-1.3.1 on YARN(4 node cluster) on a HortonWorks distribution. Now I want to upgrade spark-1.3.1 to spark-1.5.1. So at this point of time, do I have to manually go and copy spark-1.5.1 tarbal to all the nodes or is there any alternative so

dataframe average error: Float does not take parameters

2015-10-21 Thread Carol McDonald
This used to work : // What's the min number of bids per item? what's the average? what's the max? auction.groupBy("item", "auctionid").count.agg(min("count"), avg("count"),max("count")).show // MIN(count) AVG(count)MAX(count) // 1 16.992025518341308 75 but this now gives an error val

Re: SF Spark Office Hours Experiment - Friday Afternoon

2015-10-21 Thread Jacek Laskowski
Hi Holden, What a great idea! I'd love to join, but since I'm in Europe it's not gonna happen by this Fri. Any plans to visit Europe or perhaps Warsaw, Poland and host office hours here? ;-) p.s. What about an virtual event with Google Hangout on Air on? Pozdrawiam, Jacek -- Jacek Laskowski |

Re: How to check whether the RDD is empty or not

2015-10-21 Thread Gerard Maas
As TD mentions, there's no such thing as an 'empty DStream'. Some intervals of a DStream could be empty, in which case the related RDD will be empty. This means that you should express such condition based on the RDD's of the DStream. Translated in code: dstream.foreachRDD{ rdd => if

Mapping to multiple groups in Apache Spark

2015-10-21 Thread jeffrichley
I am in a situation where I am using Apache Spark and its map/reduce functionality. I am now at a stage where I have been able to map to a data set that conceptually has many "rows" of data. Now what I am needing is to do a reduce which usually is a straight forward thing. My real need though is

Re: dataframe average error: Float does not take parameters

2015-10-21 Thread Carol McDonald
version 1.3.1 scala> auction.printSchema root |-- auctionid: string (nullable = true) |-- bid: float (nullable = false) |-- bidtime: float (nullable = false) |-- bidder: string (nullable = true) |-- bidderrate: integer (nullable = true) |-- openbid: float (nullable = false) |--

Re: How to check whether the RDD is empty or not

2015-10-21 Thread diplomatic Guru
Tathagata, thank you for the response. I have two receivers in my Spark Stream job; 1 reads an endless stream of data from flume and the other reads data from HDFS directory. However, files do not get moved into HDFS frequently (let's say it gets moved every 10 minutes). This is where I need to

spark streaming 1.51. uses very old version of twitter4j

2015-10-21 Thread Andy Davidson
While digging around the spark source today I discovered it depends on version 3.0.3 of twitter4j. This version was released on dec 2 2012. I noticed that the current version is 4.0.4 and was released on 6/23/2015 I am not aware of any particular problems. Are they any plans to upgrade? What is

Poor use cases for Spark

2015-10-21 Thread Ben Thompson
Hello, I'm interested in hearing use cases and parallelism problems where Spark was *not* a good fit for you. This is an effort to understand the limits of MapReduce style parallelism. Some broad things that pop out: -- recursion -- problems where the task graph is not known ahead of time --

Re: SF Spark Office Hours Experiment - Friday Afternoon

2015-10-21 Thread Holden Karau
Probably no trips to Warsaw planned by me in the next little while, but a few people have asked for a hangouts office hours. I'll try and schedule one after Spark Summit Europe :) On Wed, Oct 21, 2015 at 11:54 AM, Jacek Laskowski wrote: > Hi Holden, > > What a great idea! I'd

Slow activation using Spark Streaming's new receiver scheduling mechanism

2015-10-21 Thread Budde, Adam
Hi all, My team uses Spark Streaming to implement the batch processing component of a lambda architecture with 5 min intervals. We process roughly 15 TB/day using three discrete Spark clusters and about 250 receivers per cluster. We've been having some issues migrating our platform from Spark

Re: Spark_1.5.1_on_HortonWorks

2015-10-21 Thread Artem Ervits
You can use these steps http://hortonworks.com/hadoop-tutorial/apache-spark-1-4-1-technical-preview-with-hdp/ 1.5.1 is not officially supported yet but should be coming in a month or so. On Oct 21, 2015 1:56 PM, "Ajay Chander" wrote: > Thanks for your kind inputs. Right now

Distributed caching of a file in SPark Streaming

2015-10-21 Thread swetha
Hi, I need to cache a file in a distributed fashion like Hadoop Distributed Cache and to be able to use it when needed. Is doing the following a right way of doing the same? Also, by doing SparkFiles.get(fileName) , would it just give all the contents in the form of a String?

Re: SF Spark Office Hours Experiment - Friday Afternoon

2015-10-21 Thread Luciano Resende
On Tue, Oct 20, 2015 at 3:55 PM, Holden Karau wrote: > Hi SF based folks, > > I'm going to try doing some simple office hours this Friday afternoon > outside of Paramo Coffee. If no one comes by I'll just be drinking coffee > hacking on some Spark PRs so if you just want to

Re: How to distinguish columns when joining DataFrames with shared parent?

2015-10-21 Thread Michael Armbrust
Yeah, I was suggesting that you avoid using org.apache.spark.sql.DataFrame.apply(colName: String) when you are working with selfjoins as it eagerly binds to a specific column in a what that breaks when we do the rewrite of one side of the query. Using the apply method constructs a resolved column

Poor use cases for Spark

2015-10-21 Thread tbenthompson
Hello, I'm interested in hearing use cases and parallelism problems where Spark was *not* a good fit for you. This is an effort to understand the limits of MapReduce style parallelism. Some broad things that pop out: -- recursion -- problems where the task graph is not known ahead of time --

Re: How to distinguish columns when joining DataFrames with shared parent?

2015-10-21 Thread Isabelle Phan
Thanks Michael and Ali for the reply! I'll make sure to use unresolved columns when working with self joins then. As pointed by Ali, isn't there still an issue with the aliasing? It works when using org.apache.spark.sql.functions.col(colName: String) method, but not when using

RE: Kafka Streaming and Filtering > 3000 partitons

2015-10-21 Thread Dave Ariens
Cody, First off--thanks for your contributions and blog post, I actually linked to in my original question. You'll have to forgive me as I've only been using Spark and writing Scala for a few days. I'm aware that the RDD partitions are 1:1 with Kafka topic partitions and you can get the offset

Re: Kafka Streaming and Filtering > 3000 partitons

2015-10-21 Thread Cody Koeninger
Yeah, that's the general idea. Regarding the question in your code comments ... The code inside of foreachPartition is what's running on the executor. It wouldn't make any sense to try to get a partition ID before that block. On Wed, Oct 21, 2015 at 4:07 PM, Dave Ariens

--jars option not working for spark on Mesos in cluster mode

2015-10-21 Thread Virag Kothari
Hi, I am trying to run a spark job on mesos in cluster mode using the following command ./bin/spark-submit --deploy-mode cluster --master mesos://172.17.0.1:7077 —-jars http://172.17.0.2:18630/mesos/extraJars.jar --class MyClass http://172.17.0.2:18630/mesos/foo.jar The application jar

Re: spark streaming 1.51. uses very old version of twitter4j

2015-10-21 Thread Luciano Resende
Thanks for catching that, I have created a JIRA to track it, and hopefully I can submit a fix for the next release https://issues.apache.org/jira/browse/SPARK-11245 On Wed, Oct 21, 2015 at 1:11 PM, Andy Davidson < a...@santacruzintegration.com> wrote: > While digging around the spark source

Re: Spark-Testing-Base Q/A

2015-10-21 Thread Holden Karau
On Wednesday, October 21, 2015, Mark Vervuurt wrote: > Hi Holden, > > Thanks for the information, I think that a Java Base Class in order to > test SparkStreaming using Java would be useful for the community. > Unfortunately not all of our customers are willing to use

Spark_sql

2015-10-21 Thread Ajay Chander
Hi Everyone, I have a use case where I have to create a DataFrame inside the map() function. To create a DataFrame it need sqlContext or hiveContext. Now how do I pass the context to my map function ? And I am doing it in java. I tried creating a class "TestClass" which implements "Function

Spark_1.5.1_on_HortonWorks

2015-10-21 Thread Ajay Chander
Hi Sasai, Thanks for your time. I have followed your inputs and downloaded "spark-1.5.1-bin-hadoop2.6" on one of the node say node1. And when I did a pie test everything seems to be working fine, except that the spark-history -server running on this node1 has gone down. It was complaining about

Re: Spark_1.5.1_on_HortonWorks

2015-10-21 Thread Saisai Shao
How you start history server, do you still use the history server of 1.3.1, or you started the history server in 1.5.1? The Spark tarball you used is the community version, so Application TimelineServer based history provider is not supported, you could comment this configuration

Re: Getting info from DecisionTreeClassificationModel

2015-10-21 Thread sethah
I believe this question will give you the answer your looking for: Decision Tree Accuracy Basically, you can traverse the tree from the root node. -- View this message in

Re: Spark_1.5.1_on_HortonWorks

2015-10-21 Thread Saisai Shao
Hi Ajay, You don't need to copy tarball to all the nodes, only one node you want to run spark application is enough (mostly the master node), Yarn will help to distribute the Spark dependencies. The link I mentioned before is the one you could follow, please read my previous mail. Thanks Saisai

java.util.NoSuchElementException: key not found error

2015-10-21 Thread Sourav Mazumder
In 1.5.0 if I use randomSplit on a data frame I get this error. Here is teh code snippet - val splitData = merged.randomSplit(Array(70,30)) val trainData = splitData(0).persist() val testData = splitData(1) trainData.registerTempTable("trn") %sql select * from trn The exception goes like this

problems with spark 1.5.1 streaming TwitterUtils.createStream()

2015-10-21 Thread Andy Davidson
Hi I want to use twitters public streaming api to follow a set of ids. I want to implement my driver using java. The current TwitterUtils is a wrapper around twitter4j and does not expose the full twitter streaming api. I started by digging through the source code. Unfortunately I do not know

Sporadic error after moving from kafka receiver to kafka direct stream

2015-10-21 Thread Conor Fennell
Hi, Firstly want to say a big thanks to Cody for contributing the kafka direct stream. I have been using the receiver based approach for months but the direct stream is a much better solution for my use case. The job in question is now ported over to the direct stream doing idempotent outputs

Re: java.util.NoSuchElementException: key not found error

2015-10-21 Thread Josh Rosen
This is https://issues.apache.org/jira/browse/SPARK-10422, which has been fixed in Spark 1.5.1. On Wed, Oct 21, 2015 at 4:40 PM, Sourav Mazumder < sourav.mazumde...@gmail.com> wrote: > In 1.5.0 if I use randomSplit on a data frame I get this error. > > Here is teh code snippet - > > val

Re: How to distinguish columns when joining DataFrames with shared parent?

2015-10-21 Thread Isabelle Phan
Ok, got it. Thanks a lot Michael for the detailed reply! On Oct 21, 2015 1:54 PM, "Michael Armbrust" wrote: > Yeah, I was suggesting that you avoid using > org.apache.spark.sql.DataFrame.apply(colName: > String) when you are working with selfjoins as it eagerly binds to

spark SQL thrift server - support for more features via jdbc (table catalog)

2015-10-21 Thread rkrist
Hello, there seems to be missing support for some operations in spark SQL thrift server. To be more specific - when connected to our spark SQL instance (1.5.1, standallone deployment) from standard jdbc sql client (squirrel SQL and few others) via the thrift server, sql query processing seem to

how to use Trees and ensembles: class probabilities

2015-10-21 Thread r7raul1...@163.com
how to use trees and ensembles: class probabilities in spark 1.5.0 . Any example or document ? r7raul1...@163.com

Re: Spark_sql

2015-10-21 Thread Ted Yu
I don't think passing sqlContext to map() is supported. Can you describe your use case in more detail ? Why do you need to create a DataFrame inside the map() function ? Cheers On Wed, Oct 21, 2015 at 6:32 PM, Ajay Chander wrote: > Hi Everyone, > > I have a use case where

Re: Spark_1.5.1_on_HortonWorks

2015-10-21 Thread Frans Thamura
talking about spark in hdp Is there reference about Spark-R, and what version should we install in R? -- Frans Thamura (曽志胜) Java Champion Shadow Master and Lead Investor Meruvian. Integrated Hypermedia Java Solution Provider. Mobile: +628557888699 Blog:

Re: Spark_1.5.1_on_HortonWorks

2015-10-21 Thread Saisai Shao
SparkR is shipped with Hortonworks version of Spark 1.4.1, there's no difference compared to community version, you could refer to the docs of Apache Spark. It would be better to ask HDP related questions in ( http://hortonworks.com/community/forums/forum/spark/ ). Sorry for not so familiar with

Getting info from DecisionTreeClassificationModel

2015-10-21 Thread rake
I’m trying to use Spark ml to create a classification tree model and examine the resulting model. I have managed to create a DecisionTreeClassificationModel (class org.apache.spark.ml.classification.DecisionTreeClassificationModel), but have not been able to obtain basic information from the

Re: Issue in spark batches

2015-10-21 Thread Tathagata Das
Unfortunately, you will have to write that code yourself. TD On Tue, Oct 20, 2015 at 11:28 PM, varun sharma wrote: > Hi TD, > Is there any way in spark I can fail/retry batch in case of any > exceptions or do I have to write code to explicitly keep on retrying? >

Re: [spark1.5.1] HiveQl.parse throws org.apache.spark.sql.AnalysisException: null

2015-10-21 Thread Sebastian Nadorp
What we're trying to achieve is a fast way of testing the validity of our SQL queries within Unit tests without going through the time consuming task of setting up an Hive Test Context. If there is any way to speed this step up, any help would be appreciated. Thanks, Sebastian *Sebastian Nadorp*

How to use And Operator in filter (PySpark)

2015-10-21 Thread Jeff Zhang
I can do it in scala api, but not sure what's the syntax in pyspark. (Didn't find it in python api) Here's what I tried, both failed >>> df.filter(df.age>3 & df.name=="Andy").collect() >>> df.filter(df.age>3 and df.name=="Andy").collect() -- Best Regards Jeff Zhang

Spark-Testing-Base Q/A

2015-10-21 Thread Mark Vervuurt
Hi Everyone, I am busy trying out ‘Spark-Testing-Base ’. I have the following questions? Can you test Spark Streaming Jobs using Java? Can I use Spark-Testing-Base 1.3.0_0.1.1 together with Spark 1.3.1? Thanks. Greetings, Mark

Re: spark-shell (1.5.1) not starting cleanly on Windows.

2015-10-21 Thread Steve Loughran
you've hit this https://wiki.apache.org/hadoop/WindowsProblems the next version of hadoop will fail with a more useful message, including that wiki link On 21 Oct 2015, at 00:36, Renato Perini > wrote: java.lang.RuntimeException:

Re: Job splling to disk and memory in Spark Streaming

2015-10-21 Thread Adrian Tanase
+1 – you can definitely make it work by making sure you are using the same partitioner (including the same number of partitions). For most operations like reduceByKey, updateStateByKey – simply specifying it enough. There are some gotchas for other operations: * mapValues and

Spark on Yarn

2015-10-21 Thread Raghuveer Chanda
Hi all, I am trying to run spark on yarn in quickstart cloudera vm.It already has spark 1.3 and Hadoop 2.6.0-cdh5.4.0 installed.(I am not using spark-submit since I want to run a different version of spark). I am able to run spark 1.3 on yarn but get the below error for spark 1.4. The log shows

Problem with applying Multivariate Gaussian Model

2015-10-21 Thread Eyal Sharon
Hi , I have been trying to apply an Anomaly Detection model using Spark MLib. I am using this library org.apache.spark.mllib.stat.distribution.MultivariateGaussian As an input, I give the model a mean vector and a Covariance matrix, assuming my features have Covariance , hence the covariane

RE: Spark on Yarn

2015-10-21 Thread Jean-Baptiste Onofré
Hi The compiled version (master side) and client version diverge on spark network JavaUtils. You should use the same/aligned version. RegardsJB Sent from my Samsung device Original message From: Raghuveer Chanda Date: 21/10/2015 12:33

Inner Joins on Cassandra RDDs

2015-10-21 Thread Priya Ch
Hello All, I have two Cassandra RDDs. I am using joinWithCassandraTable which is doing a cartesian join because of which we are getting unwanted rows. How to perform inner join on Cassandra RDDs ? If I intend to use normal join, i have to read entire table which is costly. Is there any

Re: Spark on Yarn

2015-10-21 Thread Raghuveer Chanda
Hi, So does this mean I can't run spark 1.4 fat jar on yarn without installing spark 1.4. I am including spark 1.4 in my pom.xml so doesn't this mean its compiling in 1.4. On Wed, Oct 21, 2015 at 4:38 PM, Jean-Baptiste Onofré wrote: > Hi > > The compiled version (master

Re: Spark-Testing-Base Q/A

2015-10-21 Thread Mark Vervuurt
Hi Holden, Thanks for the information, I think that a Java Base Class in order to test SparkStreaming using Java would be useful for the community. Unfortunately not all of our customers are willing to use Scala or Python. If i am not wrong it’s 4:00 AM for you in California ;) Regards, Mark

Re: Spark on Yarn

2015-10-21 Thread Adrian Tanase
The question is the spark dependency is marked as provided or is included in the fat jar. For example, we are compiling the spark distro separately for java 8 + scala 2.11 + hadoop 2.6 (with maven) and marking it as provided in sbt. -adrian From: Raghuveer Chanda Date: Wednesday, October 21,

Re: Spark on Yarn

2015-10-21 Thread Raghuveer Chanda
Please find the attached pom.xml. I am using maven to build the fat jar and trying to run it in yarn using *hadoop jar simple-yarn-app-master/target/simple-yarn-app-1.1.0-shaded.jar com.hortonworks.simpleyarnapp.Client hdfs://quickstart.cloudera:8020/simple-yarn-app-1.1.0-shaded.jar* Basically I

Re: Whether Spark is appropriate for our use case.

2015-10-21 Thread Adrian Tanase
Can you share your approximate data size? all should be valid use cases for spark, wondering if you are providing enough resources. Also - do you have some expectations in terms of performance? what does "slow down" mean? For this usecase I would personally favor parquet over DB, and

Re: Issue in spark batches

2015-10-21 Thread varun sharma
Hi TD, Is there any way in spark I can fail/retry batch in case of any exceptions or do I have to write code to explicitly keep on retrying? Also If some batch fail, I want to block further batches to be processed as it would create inconsistency in updation of zookeeper offsets and maybe kill

Re: Job splling to disk and memory in Spark Streaming

2015-10-21 Thread Tathagata Das
Well, reduceByKey needs to shutffle if your intermediate data is not already partitioned in the same way as reduceByKey's partitioning. reduceByKey() has other signatures that take in a partitioner, or simply number of partitions. So you can set the same partitioner as your previous stage.

Re: Ahhhh... Spark creates >30000 partitions... What can I do?

2015-10-21 Thread Ranadip Chatterjee
T3l, Did Sean Owen's suggestion help? If not, can you please share the behaviour? Cheers. On 20 Oct 2015 11:02 pm, "Lan Jiang" wrote: > I think the data file is binary per the original post. So in this case, > sc.binaryFiles should be used. However, I still recommend against