Re: Best way to read XML data from RDD

2016-08-19 Thread Felix Cheung
Ah. Have you tried Jackson? https://github.com/FasterXML/jackson-dataformat-xml/blob/master/README.md _ From: Diwakar Dhanuskodi > Sent: Friday, August 19, 2016 9:41 PM Subject: Re: Best way to read

Re: Best way to read XML data from RDD

2016-08-19 Thread Diwakar Dhanuskodi
Yes . It accepts a xml file as source but not RDD. The XML data embedded   inside json is streamed from kafka cluster.  So I could get it as RDD.  Right  now  I am using  spark.xml  XML.loadstring method inside  RDD map function  but  performance  wise I am not happy as it takes 4 minutes to

Re: Best way to read XML data from RDD

2016-08-19 Thread Felix Cheung
Have you tried https://github.com/databricks/spark-xml ? On Fri, Aug 19, 2016 at 1:07 PM -0700, "Diwakar Dhanuskodi" > wrote: Hi, There is a RDD with json data. I could read json data using rdd.read.json . The json data has

Re: How Spark HA works

2016-08-19 Thread Charles Nnamdi Akalugwu
I am experiencing this exact issue. Does anyone know what's going on with the zookeeper setup? On Jul 5, 2016 10:34 AM, "Akmal Abbasov" wrote: > > Hi, > I'm trying to understand how Spark HA works. I'm using Spark 1.6.1 and Zookeeper 3.4.6. > I've add the following line

Re: spark streaming Directkafka with checkpointing : changed parameters not considered

2016-08-19 Thread chandan prakash
Ohh that explains the reason. My use case does not need state management. So i guess i am better off without checkpointing. Thank you for clarification. Regards, Chandan On Sat, Aug 20, 2016 at 8:24 AM, Cody Koeninger wrote: > Checkpointing is required to be turned on in

Re: spark streaming Directkafka with checkpointing : changed parameters not considered

2016-08-19 Thread Cody Koeninger
Checkpointing is required to be turned on in certain situations (e.g. updateStateByKey), but you're certainly not required to rely on it for fault tolerance. I try not to. On Fri, Aug 19, 2016 at 1:51 PM, chandan prakash wrote: > Thanks Cody for the pointer. > > I am

Re: Spark 2.0 regression when querying very wide data frames

2016-08-19 Thread mhornbech
I did some extra digging. Running the query "select column1 from myTable" I can reproduce the problem on a frame with a single row - it occurs exactly when the frame has more than 200 columns, which smells a bit like a hardcoded limit. Interestingly the problem disappears when replacing the query

Re: Plans for improved Spark DataFrame/Dataset unit testing?

2016-08-19 Thread Everett Anderson
Hi! Just following up on this -- When people talk about a shared session/context for testing like this, I assume it's still within one test class. So it's still the case that if you have a lot of test classes that test Spark-related things, you must configure your build system to not run in them

Spark 2.0 regression when querying very wide data frames

2016-08-19 Thread mhornbech
Hi We currently have some workloads in Spark 1.6.2 with queries operating on a data frame with 1500+ columns (17000 rows). This has never been quite stable, and some queries, such as "select *" would yield empty result sets, but queries restricting to specific columns have mostly worked. Needless

Re: "Schemaless" Spark

2016-08-19 Thread Sebastian Piu
You can do operations without a schema just fine, obviously the more you know about your data the more tools you will have, it is hard without more context on what you are trying to achieve. On Fri, 19 Aug 2016, 22:55 Efe Selcuk, wrote: > Hi Spark community, > > This is a

"Schemaless" Spark

2016-08-19 Thread Efe Selcuk
Hi Spark community, This is a bit of a high level question as frankly I'm not well versed in Spark or related tech. We have a system in place that reads columnar data in through CSV and represents the data in relational tables as it operates. It's essentially schema-based ETL. This restricts our

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-19 Thread Davies Liu
The OOM happen in driver, you may also need more memory for driver. On Fri, Aug 19, 2016 at 2:33 PM, Davies Liu wrote: > You are using lots of tiny executors (128 executor with only 2G > memory), could you try with bigger executor (for example 16G x 16)? > > On Fri, Aug

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-19 Thread Davies Liu
You are using lots of tiny executors (128 executor with only 2G memory), could you try with bigger executor (for example 16G x 16)? On Fri, Aug 19, 2016 at 8:19 AM, Ben Teeuwen wrote: > > So I wrote some code to reproduce the problem. > > I assume here that a pipeline should

Re: Spark SQL concurrent runs fails with java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]

2016-08-19 Thread Davies Liu
The query failed to finish broadcast in 5 minutes, you could decrease the broadcast threshold (spark.sql.autoBroadcastJoinThreshold) or increase the conf: spark.sql.broadcastTimeout On Tue, Jun 28, 2016 at 3:35 PM, Jesse F Chen wrote: > > With the Spark 2.0 build from 0615,

Re: HiveThriftServer and spark.sql.hive.thriftServer.singleSession setting

2016-08-19 Thread Richard M
I was using the 1.1 driver. I upgraded that library to 2.1 and it resolved my problem. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/HiveThriftServer-and-spark-sql-hive-thriftServer-singleSession-setting-tp27340p27566.html Sent from the Apache Spark User

Best way to read XML data from RDD

2016-08-19 Thread Diwakar Dhanuskodi
Hi, There is a RDD with json data. I could read json data using rdd.read.json . The json data has XML data in couple of key-value paris. Which is the best method to read and parse XML from rdd. Is there any specific xml libraries for spark. Could anyone help on this. Thanks.

Re: Spark streaming 2, giving error ClassNotFoundException: scala.collection.GenTraversableOnce$class

2016-08-19 Thread Mich Talebzadeh
Thanks --jars /home/hduser/jars/spark-streaming-kafka-assembly_*2.11*-1.6.1.jar sorted it out Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Re: Spark streaming 2, giving error ClassNotFoundException: scala.collection.GenTraversableOnce$class

2016-08-19 Thread Tathagata Das
You seem to combining Scala 2.10 and 2.11 libraries - your sbt project is 2.11, where as you are trying to pull in spark-streaming-kafka-assembly_ *2.10*-1.6.1.jar. On Fri, Aug 19, 2016 at 11:24 AM, Mich Talebzadeh wrote: > Hi, > > My spark streaming app with 1.6.1

Re: HiveThriftServer and spark.sql.hive.thriftServer.singleSession setting

2016-08-19 Thread Chang Lim
What command did you use to connect? Try this: beeline> !connect jdbc:hive2://localhost:1?hive.server2.transport.mode=http;hive.server2.thrift.http.path=cliservice On Thu, Aug 11, 2016 at 9:23 AM, Richard M [via Apache Spark User List] < ml-node+s1001560n27513...@n3.nabble.com> wrote: >

Spark streaming 2, giving error ClassNotFoundException: scala.collection.GenTraversableOnce$class

2016-08-19 Thread Mich Talebzadeh
Hi, My spark streaming app with 1.6.1 used to work. Now with scala> sc version res0: String = 2.0.0 Compiling with sbt assembly as before, with the following: version := "1.0", scalaVersion := "2.11.8", mainClass in Compile := Some("myPackage.${APPLICATION}") )

Re: 2.0.1/2.1.x release dates

2016-08-19 Thread Michael Gummelt
Adrian, We haven't had any reports of hangs on Mesos in 2.0, so it's likely that if you wait until the release, your problem still won't be solved unless you file a bug. Can you create a JIRA so we can look into it? On Thu, Aug 18, 2016 at 2:40 AM, Sean Owen wrote: >

Re: Attempting to accept an unknown offer

2016-08-19 Thread Michael Gummelt
That error message occurs when the Mesos scheduler tries to accept an offer that doesn't exist. It should never happen. Can you submit a JIRA and cc me to it? Also, what libmesos and mesos master version are you running? On Wed, Aug 17, 2016 at 9:23 AM, vr spark wrote:

Re: [Spark2] Error writing "complex" type to CSV

2016-08-19 Thread Efe Selcuk
Okay so this is partially PEBKAC. I just noticed that there's a debugging field at the end that's another case class with its own simple fields - *that's* the struct that was showing up in the error, not the entry itself. This raises a different question. What has changed that this is no longer

Re: spark streaming Directkafka with checkpointing : changed parameters not considered

2016-08-19 Thread chandan prakash
Thanks Cody for the pointer. I am able to do this now. Not using checkpointing. Rather storing offsets in zookeeper for fault tolerance. Spark Config changes now getting reflected in code deployment. *Using this api :* *KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder,

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-19 Thread Ben Teeuwen
So I wrote some code to reproduce the problem. I assume here that a pipeline should be able to transform a categorical feature with a few million levels. So I create a dataframe with the categorical feature (‘id’), apply a StringIndexer and OneHotEncoder transformer, and run a loop where I

How to continuous update or refresh RandomForestClassificationModel

2016-08-19 Thread 陈哲
Hi All I'm using my training data generate the RandomForestClassificationModel , and I can use this to predict the upcoming data. But if predict failed I'll put the failed features into the training data, here is my question , how can I update or refresh the model ? Which api should