How to read large size files from a directory ?

2017-05-09 Thread ashwini anand
I had posted this question yesterday but formatting of my question was very bad. So I am posting the same question again. Below is my question: I am reading a directory of files using wholeTextFiles. After that I am calling a function on each element of the rdd using map . The whole program uses

[jira] Lantao Jin shared "SPARK-20680: Spark-sql do not support for void column datatype of view" with you

2017-05-09 Thread Lantao Jin (JIRA)
Lantao Jin shared an issue with you > Spark-sql do not support for void column datatype of view > - > > Key: SPARK-20680 > URL:

Re: Multiple CSV libs causes issues spark 2.1

2017-05-09 Thread lucas.g...@gmail.com
> > df = spark.sqlContext.read.csv('out/df_in.csv') > shouldn't this be just - df = spark.read.csv('out/df_in.csv') sparkSession itself is in entry point to dataframes and SQL functionality . our bootstrap is a bit messy, in our case no. In the general case yes. On 9 May 2017 at 16:56,

Re: Multiple CSV libs causes issues spark 2.1

2017-05-09 Thread Pushkar.Gujar
> > df = spark.sqlContext.read.csv('out/df_in.csv') > shouldn't this be just - df = spark.read.csv('out/df_in.csv') sparkSession itself is in entry point to dataframes and SQL functionality . Thank you, *Pushkar Gujar* On Tue, May 9, 2017 at 6:09 PM, Mark Hamstra

Re: Multiple CSV libs causes issues spark 2.1

2017-05-09 Thread Hyukjin Kwon
Sounds like it is related with https://github.com/apache/spark/pull/17916 We will allow pick up the internal one if this one gets merged. On 10 May 2017 7:09 am, "Mark Hamstra" wrote: > Looks to me like it is a conflict between a Databricks library and Spark > 2.1.

Re: how to mark a (bean) class with schema for catalyst ?

2017-05-09 Thread Yang
ah.. thanks , your code also works for me, I figured it's because I tried to encode a tuple of (MyClass, Int): package org.apache.spark /** */ import org.apache.spark.sql.catalyst.util.{ArrayData, GenericArrayData} import org.apache.spark.sql.types._ import org.apache.spark.sql.{Encoders,

Re: Multiple CSV libs causes issues spark 2.1

2017-05-09 Thread Mark Hamstra
Looks to me like it is a conflict between a Databricks library and Spark 2.1. That's an issue for Databricks to resolve or provide guidance. On Tue, May 9, 2017 at 2:36 PM, lucas.g...@gmail.com wrote: > I'm a bit confused by that answer, I'm assuming it's spark deciding

Re: Multiple CSV libs causes issues spark 2.1

2017-05-09 Thread lucas.g...@gmail.com
I'm a bit confused by that answer, I'm assuming it's spark deciding which lib to use. On 9 May 2017 at 14:30, Mark Hamstra wrote: > This looks more like a matter for Databricks support than spark-user. > > On Tue, May 9, 2017 at 2:02 PM, lucas.g...@gmail.com

Re: Multiple CSV libs causes issues spark 2.1

2017-05-09 Thread Mark Hamstra
This looks more like a matter for Databricks support than spark-user. On Tue, May 9, 2017 at 2:02 PM, lucas.g...@gmail.com wrote: > df = spark.sqlContext.read.csv('out/df_in.csv') >> > > >> 17/05/09 15:51:29 WARN ObjectStore: Version information not found in >> metastore.

Multiple CSV libs causes issues spark 2.1

2017-05-09 Thread lucas.g...@gmail.com
> > df = spark.sqlContext.read.csv('out/df_in.csv') > > 17/05/09 15:51:29 WARN ObjectStore: Version information not found in > metastore. hive.metastore.schema.verification is not enabled so recording > the schema version 1.2.0 > 17/05/09 15:51:29 WARN ObjectStore: Failed to get database

Re: how to mark a (bean) class with schema for catalyst ?

2017-05-09 Thread Michael Armbrust
Must be a bug. This works for me in Spark 2.1. On Tue, May 9, 2017 at 12:10 PM, Yang wrote: > somehow the

Re: how to mark a (bean) class with schema for catalyst ?

2017-05-09 Thread Yang
somehow the schema check is here https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L697-L750 supposedly beans are to be handled, but it's not clear to me which line handles the type of beans. if that's clear, I could

Re: how to mark a (bean) class with schema for catalyst ?

2017-05-09 Thread Yang
2.0.2 with scala 2.11 On Tue, May 9, 2017 at 11:30 AM, Michael Armbrust wrote: > Which version of Spark? > > On Tue, May 9, 2017 at 11:28 AM, Yang wrote: > >> actually with var it's the same: >> >> >> scala> class Person4 { >> | >> |

Re: how to mark a (bean) class with schema for catalyst ?

2017-05-09 Thread Michael Armbrust
Which version of Spark? On Tue, May 9, 2017 at 11:28 AM, Yang wrote: > actually with var it's the same: > > > scala> class Person4 { > | > | @scala.beans.BeanProperty var X:Int = 1 > | } > defined class Person4 > > scala> val personEncoder =

Re: how to mark a (bean) class with schema for catalyst ?

2017-05-09 Thread Yang
actually with var it's the same: scala> class Person4 { | | @scala.beans.BeanProperty var X:Int = 1 | } defined class Person4 scala> val personEncoder = Encoders.bean[Person4](classOf[Person4]) personEncoder: org.apache.spark.sql.Encoder[Person4] = class[x[0]: int] scala> val

Re: how to mark a (bean) class with schema for catalyst ?

2017-05-09 Thread Yang
Thanks Michael. I could not use case class here since I need to later modify the output of getX() so that the output is dynamically generated. the bigger context is this: I want to implement topN(), using a BoundedPriorityQueue. basically I include a queue in reduce(), or aggregateByKey(), but

Re: how to mark a (bean) class with schema for catalyst ?

2017-05-09 Thread Michael Armbrust
I think you are supposed to set BeanProperty on a var as they do here . If you are using scala though I'd consider using the case

SPARK randomforestclassifer and balancing classe

2017-05-09 Thread issues solution
HI i have aleardy ask this question but i still without ansewr somone can help me to figure out who i can balance my class when i use fit methode of randomforestclassifer thx for adavance.

Re: How to read large size files from a directory ?

2017-05-09 Thread Alonso Isidoro Roman
please create a github repo and upload the code there... Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman 2017-05-09 8:47 GMT+02:00 ashwini anand

Re: Join streams Apache Spark

2017-05-09 Thread tencas
Hi scorpio, thanks for your reply. I don't understand your approach. Is it possible to receive data from different clients throught the same port on Spark? Surely I'm confused and I'd appreciate your opinion. Regarding the word count example , from Spark Streaming documentation, Spark acts as

Re: Why does dataset.union fails but dataset.rdd.union execute correctly?

2017-05-09 Thread Matthew cao
Hi, I have tried simple test like this: case class A(id: Long) val sample = spark.range(0,10).as[A] sample.createOrReplaceTempView("sample") val df = spark.emptyDataset[A] val df1 = spark.sql("select * from sample").as[A] df.union(df1) It runs ok. And for nullabillity I thought that issue has

how to mark a (bean) class with schema for catalyst ?

2017-05-09 Thread Yang
I'm trying to use Encoders.bean() to create an encoder for my custom class, but it fails complaining about can't find the schema: class Person4 { @scala.beans.BeanProperty def setX(x:Int): Unit = {} @scala. beans.BeanProperty def getX():Int = {1} } val personEncoder = Encoders.bean[

RDD.cacheDataSet() not working intermittently

2017-05-09 Thread jasbir.sing
Hi, I have a scenario in which I am caching my RDDs for future use. But I observed that when I use my RDD, complete DAG is re-executed and RDD gets created again. How can I avoid this scenario and make sure that RDD.cacheDataSet() caches RDD every time. Regards, Jasbir Singh

How to read large size files from a directory ?

2017-05-09 Thread ashwini anand
I am reading each file of a directory using wholeTextFiles. After that I am calling a function on each element of the rdd using map . The whole program uses just 50 lines of each file. The code is as below:def processFiles(fileNameContentsPair): fileName= fileNameContentsPair[0] result =