date:20160809

Spark SQL -JDBC connectivity

2016-08-09 Thread Soni spark

Hi,

I would to know the steps to connect SPARK SQL from spring framework
(Web-UI).
also how to run and deploy the web application?

Re: Spark streaming not processing messages from partitioned topics

2016-08-09 Thread Diwakar Dhanuskodi

Hi Siva,

Does topic  has partitions? which version of Spark you are using?

On Wed, Aug 10, 2016 at 2:38 AM, Sivakumaran S  wrote:

> Hi,
>
> Here is a working example I did.
>
> HTH
>
> Regards,
>
> Sivakumaran S
>
> val topics = "test"
> val brokers = "localhost:9092"
> val topicsSet = topics.split(",").toSet
> val sparkConf = new 
> SparkConf().setAppName("KafkaWeatherCalc").setMaster("local")
> //spark://localhost:7077
> val sc = new SparkContext(sparkConf)
> val ssc = new StreamingContext(sc, Seconds(60))
> val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
> val messages = KafkaUtils.createDirectStream[String, String,
> StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
> messages.foreachRDD(rdd => {
>   if (rdd.isEmpty()) {
> println("Failed to get data from Kafka. Please check that the Kafka
> producer is streaming data.")
> System.exit(-1)
>   }
>   val sqlContext = org.apache.spark.sql.SQLContext.getOrCreate(rdd.
> sparkContext)
>   val weatherDF = sqlContext.read.json(rdd.map(_._2)).toDF()
>   //Process your DF as required here on
> }
>
>
>
> On 09-Aug-2016, at 9:47 PM, Diwakar Dhanuskodi <
> diwakar.dhanusk...@gmail.com> wrote:
>
> Hi,
>
> I am reading json messages from kafka . Topics has 2 partitions. When
> running streaming job using spark-submit, I could see that * val
> dataFrame = sqlContext.read.json(rdd.map(_._2)) *executes indefinitely.
> Am I doing something wrong here. Below is code .This environment is
> cloudera sandbox env. Same issue in hadoop production cluster mode except
> that it is restricted thats why tried to reproduce issue in Cloudera
> sandbox. Kafka 0.10 and  Spark 1.4.
>
> val kafkaParams = Map[String,String]("bootstrap.
> servers"->"localhost:9093,localhost:9092", "group.id" ->
> "xyz","auto.offset.reset"->"smallest")
> val conf = new SparkConf().setMaster("local[3]").setAppName("topic")
> val ssc = new StreamingContext(conf, Seconds(1))
>
> val sqlContext = new org.apache.spark.sql.SQLContext(ssc.sparkContext)
>
> val topics = Set("gpp.minf")
> val kafkaStream = KafkaUtils.createDirectStream[String, String,
> StringDecoder,StringDecoder](ssc, kafkaParams, topics)
>
> kafkaStream.foreachRDD(
>   rdd => {
> if (rdd.count > 0){
>* val dataFrame = sqlContext.read.json(rdd.map(_._2)) *
>dataFrame.printSchema()
> //dataFrame.foreach(println)
> }
> }
>
>
>

Re: Spark streaming not processing messages from partitioned topics

2016-08-09 Thread Diwakar Dhanuskodi

It stops working at sqlContext.read.json(rdd.map(_._2)) . Topics without
partitions is working fine. Do I need to set any other configs
val kafkaParams =
Map[String,String]("bootstrap.servers"->"localhost:9093,localhost:9092", "
group.id" -> "xyz","auto.offset.reset"->"smallest")
Spark version is 1.6.2

kafkaStream.foreachRDD(
  rdd => {
   rdd.foreach(println)
   val dataFrame = sqlContext.read.json(rdd.map(_._2))
   dataFrame.foreach(println)
}
)

On Wed, Aug 10, 2016 at 9:05 AM, Cody Koeninger  wrote:

> No, you don't need a conditional.  read.json on an empty rdd will
> return an empty dataframe.  Foreach on an empty dataframe or an empty
> rdd won't do anything (a task will still get run, but it won't do
> anything).
>
> Leave the conditional out.  Add one thing at a time to the working
> rdd.foreach example and see when it stops working, then take a closer
> look at the logs.
>
>
> On Tue, Aug 9, 2016 at 10:20 PM, Diwakar Dhanuskodi
>  wrote:
> > Hi Cody,
> >
> > Without conditional . It is going with fine. But any processing inside
> > conditional get on to waiting (or) something.
> > Facing this issue with partitioned topics. I would need conditional to
> skip
> > processing when batch is empty.
> > kafkaStream.foreachRDD(
> >   rdd => {
> >
> >val dataFrame = sqlContext.read.json(rdd.map(_._2))
> >/*if (dataFrame.count() > 0) {
> >dataFrame.foreach(println)
> >}
> >else
> >{
> >  println("Empty DStream ")
> >}*/
> > })
> >
> > On Wed, Aug 10, 2016 at 2:35 AM, Cody Koeninger 
> wrote:
> >>
> >> Take out the conditional and the sqlcontext and just do
> >>
> >> rdd => {
> >>   rdd.foreach(println)
> >>
> >>
> >> as a base line to see if you're reading the data you expect
> >>
> >> On Tue, Aug 9, 2016 at 3:47 PM, Diwakar Dhanuskodi
> >>  wrote:
> >> > Hi,
> >> >
> >> > I am reading json messages from kafka . Topics has 2 partitions. When
> >> > running streaming job using spark-submit, I could see that  val
> >> > dataFrame =
> >> > sqlContext.read.json(rdd.map(_._2)) executes indefinitely. Am I doing
> >> > something wrong here. Below is code .This environment is cloudera
> >> > sandbox
> >> > env. Same issue in hadoop production cluster mode except that it is
> >> > restricted thats why tried to reproduce issue in Cloudera sandbox.
> Kafka
> >> > 0.10 and  Spark 1.4.
> >> >
> >> > val kafkaParams =
> >> > Map[String,String]("bootstrap.servers"->"localhost:9093,
> localhost:9092",
> >> > "group.id" -> "xyz","auto.offset.reset"->"smallest")
> >> > val conf = new SparkConf().setMaster("local[3]").setAppName("topic")
> >> > val ssc = new StreamingContext(conf, Seconds(1))
> >> >
> >> > val sqlContext = new org.apache.spark.sql.
> SQLContext(ssc.sparkContext)
> >> >
> >> > val topics = Set("gpp.minf")
> >> > val kafkaStream = KafkaUtils.createDirectStream[String, String,
> >> > StringDecoder,StringDecoder](ssc, kafkaParams, topics)
> >> >
> >> > kafkaStream.foreachRDD(
> >> >   rdd => {
> >> > if (rdd.count > 0){
> >> > val dataFrame = sqlContext.read.json(rdd.map(_._2))
> >> >dataFrame.printSchema()
> >> > //dataFrame.foreach(println)
> >> > }
> >> > }
> >
> >
>

Re: Get distinct column data from grouped data

2016-08-09 Thread Selvam Raman

my frined suggest this way

val fil = sc.textFile("hdfs:///user/vijayc/data/test-spk.tx")

val res =fil.map(l => l.split(",")).map(l =>( l(0),l(1))).groupByKey.map(rd
=>(rd._1,rd._2.toList.distinct))


another useful function is *collect_set* in dataframe.


Thanks,

selvam R

On Tue, Aug 9, 2016 at 4:19 PM, Selvam Raman  wrote:

> Example:
>
> sel1 test
> sel1 test
> sel1 ok
> sel2 ok
> sel2 test
>
>
> expected result:
>
> sel1, [test,ok]
> sel2,[test,ok]
>
> How to achieve the above result using spark dataframe.
>
> please suggest me.
> --
> Selvam Raman
> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>



-- 
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"

Re: Spark streaming not processing messages from partitioned topics

2016-08-09 Thread Cody Koeninger

No, you don't need a conditional.  read.json on an empty rdd will
return an empty dataframe.  Foreach on an empty dataframe or an empty
rdd won't do anything (a task will still get run, but it won't do
anything).

Leave the conditional out.  Add one thing at a time to the working
rdd.foreach example and see when it stops working, then take a closer
look at the logs.


On Tue, Aug 9, 2016 at 10:20 PM, Diwakar Dhanuskodi
 wrote:
> Hi Cody,
>
> Without conditional . It is going with fine. But any processing inside
> conditional get on to waiting (or) something.
> Facing this issue with partitioned topics. I would need conditional to skip
> processing when batch is empty.
> kafkaStream.foreachRDD(
>   rdd => {
>
>val dataFrame = sqlContext.read.json(rdd.map(_._2))
>/*if (dataFrame.count() > 0) {
>dataFrame.foreach(println)
>}
>else
>{
>  println("Empty DStream ")
>}*/
> })
>
> On Wed, Aug 10, 2016 at 2:35 AM, Cody Koeninger  wrote:
>>
>> Take out the conditional and the sqlcontext and just do
>>
>> rdd => {
>>   rdd.foreach(println)
>>
>>
>> as a base line to see if you're reading the data you expect
>>
>> On Tue, Aug 9, 2016 at 3:47 PM, Diwakar Dhanuskodi
>>  wrote:
>> > Hi,
>> >
>> > I am reading json messages from kafka . Topics has 2 partitions. When
>> > running streaming job using spark-submit, I could see that  val
>> > dataFrame =
>> > sqlContext.read.json(rdd.map(_._2)) executes indefinitely. Am I doing
>> > something wrong here. Below is code .This environment is cloudera
>> > sandbox
>> > env. Same issue in hadoop production cluster mode except that it is
>> > restricted thats why tried to reproduce issue in Cloudera sandbox. Kafka
>> > 0.10 and  Spark 1.4.
>> >
>> > val kafkaParams =
>> > Map[String,String]("bootstrap.servers"->"localhost:9093,localhost:9092",
>> > "group.id" -> "xyz","auto.offset.reset"->"smallest")
>> > val conf = new SparkConf().setMaster("local[3]").setAppName("topic")
>> > val ssc = new StreamingContext(conf, Seconds(1))
>> >
>> > val sqlContext = new org.apache.spark.sql.SQLContext(ssc.sparkContext)
>> >
>> > val topics = Set("gpp.minf")
>> > val kafkaStream = KafkaUtils.createDirectStream[String, String,
>> > StringDecoder,StringDecoder](ssc, kafkaParams, topics)
>> >
>> > kafkaStream.foreachRDD(
>> >   rdd => {
>> > if (rdd.count > 0){
>> > val dataFrame = sqlContext.read.json(rdd.map(_._2))
>> >dataFrame.printSchema()
>> > //dataFrame.foreach(println)
>> > }
>> > }
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark streaming not processing messages from partitioned topics

2016-08-09 Thread Diwakar Dhanuskodi

Hi Cody,

Without conditional . It is going with fine. But any processing inside
conditional get on to waiting (or) something.
Facing this issue with partitioned topics. I would need conditional to skip
processing when batch is empty.
kafkaStream.foreachRDD(
  rdd => {

   val dataFrame = sqlContext.read.json(rdd.map(_._2))
   /*if (dataFrame.count() > 0) {
   dataFrame.foreach(println)
   }
   else
   {
 println("Empty DStream ")
   }*/
})

On Wed, Aug 10, 2016 at 2:35 AM, Cody Koeninger  wrote:

> Take out the conditional and the sqlcontext and just do
>
> rdd => {
>   rdd.foreach(println)
>
>
> as a base line to see if you're reading the data you expect
>
> On Tue, Aug 9, 2016 at 3:47 PM, Diwakar Dhanuskodi
>  wrote:
> > Hi,
> >
> > I am reading json messages from kafka . Topics has 2 partitions. When
> > running streaming job using spark-submit, I could see that  val
> dataFrame =
> > sqlContext.read.json(rdd.map(_._2)) executes indefinitely. Am I doing
> > something wrong here. Below is code .This environment is cloudera sandbox
> > env. Same issue in hadoop production cluster mode except that it is
> > restricted thats why tried to reproduce issue in Cloudera sandbox. Kafka
> > 0.10 and  Spark 1.4.
> >
> > val kafkaParams =
> > Map[String,String]("bootstrap.servers"->"localhost:9093,localhost:9092",
> > "group.id" -> "xyz","auto.offset.reset"->"smallest")
> > val conf = new SparkConf().setMaster("local[3]").setAppName("topic")
> > val ssc = new StreamingContext(conf, Seconds(1))
> >
> > val sqlContext = new org.apache.spark.sql.SQLContext(ssc.sparkContext)
> >
> > val topics = Set("gpp.minf")
> > val kafkaStream = KafkaUtils.createDirectStream[String, String,
> > StringDecoder,StringDecoder](ssc, kafkaParams, topics)
> >
> > kafkaStream.foreachRDD(
> >   rdd => {
> > if (rdd.count > 0){
> > val dataFrame = sqlContext.read.json(rdd.map(_._2))
> >dataFrame.printSchema()
> > //dataFrame.foreach(println)
> > }
> > }
>

UNSUBSCRIBE

2016-08-09 Thread James Ding






smime.p7s
Description: S/MIME cryptographic signature

Re: Cumulative Sum function using Dataset API

2016-08-09 Thread Jon Barksdale

Cool, learn something new every day.  Thanks again.

On Tue, Aug 9, 2016 at 4:08 PM ayan guha  wrote:

> Thanks for reporting back. Glad it worked for you. Actually sum with
> partitioning behaviour is same in oracle too.
> On 10 Aug 2016 03:01, "Jon Barksdale"  wrote:
>
>> Hi Santoshakhilesh,
>>
>> I'd seen that already, but I was trying to avoid using rdds to perform
>> this calculation.
>>
>> @Ayan, it seems I was mistaken, and doing a sum(b) over(order by b)
>> totally works.  I guess I expected the windowing with sum to work more like
>> oracle.  Thanks for the suggestion :)
>>
>> Thank you both for your help,
>>
>> Jon
>>
>> On Tue, Aug 9, 2016 at 3:01 AM Santoshakhilesh <
>> santosh.akhil...@huawei.com> wrote:
>>
>>> You could check following link.
>>>
>>>
>>> http://stackoverflow.com/questions/35154267/how-to-compute-cumulative-sum-using-spark
>>>
>>>
>>>
>>> *From:* Jon Barksdale [mailto:jon.barksd...@gmail.com]
>>> *Sent:* 09 August 2016 08:21
>>> *To:* ayan guha
>>> *Cc:* user
>>> *Subject:* Re: Cumulative Sum function using Dataset API
>>>
>>>
>>>
>>> I don't think that would work properly, and would probably just give me
>>> the sum for each partition. I'll give it a try when I get home just to be
>>> certain.
>>>
>>> To maybe explain the intent better, if I have a column (pre sorted) of
>>> (1,2,3,4), then the cumulative sum would return (1,3,6,10).
>>>
>>> Does that make sense? Naturally, if ordering a sum turns it into a
>>> cumulative sum, I'll gladly use that :)
>>>
>>> Jon
>>>
>>> On Mon, Aug 8, 2016 at 4:55 PM ayan guha  wrote:
>>>
>>> You mean you are not able to use sum(col) over (partition by key order
>>> by some_col) ?
>>>
>>>
>>>
>>> On Tue, Aug 9, 2016 at 9:53 AM, jon  wrote:
>>>
>>> Hi all,
>>>
>>> I'm trying to write a function that calculates a cumulative sum as a
>>> column
>>> using the Dataset API, and I'm a little stuck on the implementation.
>>> From
>>> what I can tell, UserDefinedAggregateFunctions don't seem to support
>>> windowing clauses, which I think I need for this use case.  If I write a
>>> function that extends from AggregateWindowFunction, I end up needing
>>> classes
>>> that are package private to the sql package, so I need to make my
>>> function
>>> under the org.apache.spark.sql package, which just feels wrong.
>>>
>>> I've also considered writing a custom transformer, but haven't spend as
>>> much
>>> time reading through the code, so I don't know how easy or hard that
>>> would
>>> be.
>>>
>>> TLDR; What's the best way to write a function that returns a value for
>>> every
>>> row, but has mutable state, and gets row in a specific order?
>>>
>>> Does anyone have any ideas, or examples?
>>>
>>> Thanks,
>>>
>>> Jon
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Cumulative-Sum-function-using-Dataset-API-tp27496.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Best Regards,
>>> Ayan Guha
>>>
>>>

Re: spark 2.0 in intellij

2016-08-09 Thread Michael Jay

Hi,

The problem has been solved simply by updating the scala sdk version from 
incompactible 2.10.x to correct version 2.11.x

From: Michael Jay 
Sent: Tuesday, August 9, 2016 10:11:12 PM
To: user@spark.apache.org
Subject: spark 2.0 in intellij

Dear all,

I am Newbie to Spark. Currently I am trying to import the source code of Spark 
2.0 as a Module to an existing client project.

I have imported Spark-core, Spark-sql and Spark-catalyst as maven dependencies 
in this client project.

During compilation errors as missing SqlBaseParser.java occurred.

After searching online, I found an article in StackOverflow 
http://stackoverflow.com/questions/35617277/spark-sql-has-no-sparksqlparser-scala-file-when-compiling-in-intellij-idea
 to solve this issue.

[http://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-i...@2.png?v=73d79a89bded]

Spark SQL has no SparkSqlParser.scala file when compiling 
...
stackoverflow.com
I have installed spark-hadoop env in my Red Hat 64. And I also want to read and 
write code in spark source code project in intelliJ idea. I have downloaded 
spark ...

So I use mvn to build spark 2.0 first and import the 
...catalyst/target/generated-sources/antrl4 as a new source folder in the maven 
dependency "Spark-catalyst".
Now the problem is that I still got following erros:

Error:scalac: error while loading package, Missing dependency 'bad symbolic 
reference. A signature in package.class refers to term annotation
in package org.apache.spark which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling 
package.class.', required by 
/home/weiping/workspace/tools/spark-2.0.0/sql/core/target/scala-2.11/classes/org/apache/spark/sql/package.class
Error:scalac: error while loading SparkSession, Missing dependency 'bad 
symbolic reference. A signature in SparkSession.class refers to term annotation
in package org.apache.spark which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling 
SparkSession.class.', required by 
/home/weiping/workspace/tools/spark-2.0.0/sql/core/target/scala-2.11/classes/org/apache/spark/sql/SparkSession.class
Error:scalac: error while loading RDD, Missing dependency 'bad symbolic 
reference. A signature in RDD.class refers to term annotation
in package org.apache.spark which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling 
RDD.class.', required by 
/home/weiping/workspace/tools/spark-2.0.0/core/target/scala-2.11/classes/org/apache/spark/rdd/RDD.class
Error:scalac: error while loading JavaRDDLike, Missing dependency 'bad symbolic 
reference. A signature in JavaRDDLike.class refers to term annotation
in package org.apache.spark which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling 
JavaRDDLike.class.', required by 
/home/weiping/workspace/tools/spark-2.0.0/core/target/scala-2.11/classes/org/apache/spark/api/java/JavaRDDLike.class
Error:scalac: error while loading Dataset, Missing dependency 'bad symbolic 
reference. A signature in Dataset.class refers to term annotation
in package org.apache.spark which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling 
Dataset.class.', required by 
/home/weiping/workspace/tools/spark-2.0.0/sql/core/target/scala-2.11/classes/org/apache/spark/sql/Dataset.class
Error:scalac: error while loading ColumnName, Missing dependency 'bad symbolic 
reference. A signature in ColumnName.class refers to term annotation
in package org.apache.spark which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling 
ColumnName.class.', required by 
/home/weiping/workspace/tools/spark-2.0.0/sql/core/target/scala-2.11/classes/org/apache/spark/sql/ColumnName.class
Error:scalac: error while loading Encoder, Missing dependency 'bad symbolic 
reference. A signature in Encoder.class refers to term annotation
in package org.apache.spark which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling 
Encoder.class.', required by

Re: Cumulative Sum function using Dataset API

2016-08-09 Thread ayan guha

Thanks for reporting back. Glad it worked for you. Actually sum with
partitioning behaviour is same in oracle too.
On 10 Aug 2016 03:01, "Jon Barksdale"  wrote:

> Hi Santoshakhilesh,
>
> I'd seen that already, but I was trying to avoid using rdds to perform
> this calculation.
>
> @Ayan, it seems I was mistaken, and doing a sum(b) over(order by b)
> totally works.  I guess I expected the windowing with sum to work more like
> oracle.  Thanks for the suggestion :)
>
> Thank you both for your help,
>
> Jon
>
> On Tue, Aug 9, 2016 at 3:01 AM Santoshakhilesh <
> santosh.akhil...@huawei.com> wrote:
>
>> You could check following link.
>>
>> http://stackoverflow.com/questions/35154267/how-to-
>> compute-cumulative-sum-using-spark
>>
>>
>>
>> *From:* Jon Barksdale [mailto:jon.barksd...@gmail.com]
>> *Sent:* 09 August 2016 08:21
>> *To:* ayan guha
>> *Cc:* user
>> *Subject:* Re: Cumulative Sum function using Dataset API
>>
>>
>>
>> I don't think that would work properly, and would probably just give me
>> the sum for each partition. I'll give it a try when I get home just to be
>> certain.
>>
>> To maybe explain the intent better, if I have a column (pre sorted) of
>> (1,2,3,4), then the cumulative sum would return (1,3,6,10).
>>
>> Does that make sense? Naturally, if ordering a sum turns it into a
>> cumulative sum, I'll gladly use that :)
>>
>> Jon
>>
>> On Mon, Aug 8, 2016 at 4:55 PM ayan guha  wrote:
>>
>> You mean you are not able to use sum(col) over (partition by key order by
>> some_col) ?
>>
>>
>>
>> On Tue, Aug 9, 2016 at 9:53 AM, jon  wrote:
>>
>> Hi all,
>>
>> I'm trying to write a function that calculates a cumulative sum as a
>> column
>> using the Dataset API, and I'm a little stuck on the implementation.  From
>> what I can tell, UserDefinedAggregateFunctions don't seem to support
>> windowing clauses, which I think I need for this use case.  If I write a
>> function that extends from AggregateWindowFunction, I end up needing
>> classes
>> that are package private to the sql package, so I need to make my function
>> under the org.apache.spark.sql package, which just feels wrong.
>>
>> I've also considered writing a custom transformer, but haven't spend as
>> much
>> time reading through the code, so I don't know how easy or hard that would
>> be.
>>
>> TLDR; What's the best way to write a function that returns a value for
>> every
>> row, but has mutable state, and gets row in a specific order?
>>
>> Does anyone have any ideas, or examples?
>>
>> Thanks,
>>
>> Jon
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Cumulative-Sum-function-using-
>> Dataset-API-tp27496.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>>
>>
>>
>> --
>>
>> Best Regards,
>> Ayan Guha
>>
>>

Re: DataFrame equivalent to RDD.partionByKey

2016-08-09 Thread Davies Liu

I think you are looking for `def repartition(numPartitions: Int,
partitionExprs: Column*)`

On Tue, Aug 9, 2016 at 9:36 AM, Stephen Fletcher
 wrote:
> Is there a DataFrameReader equivalent to the RDD's partitionByKey for RDD?
> I'm reading data from a file data source and I want to partition this data
> I'm reading in to be partitioned the same way as the data I'm processing
> through a spark streaming RDD in the process.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark 1.6.2 can read hive tables created with sqoop, but Spark 2.0.0 cannot

2016-08-09 Thread Mich Talebzadeh

Hi,

Is this table created as external table in Hive?

Do you see data through Spark-sql or Hive thrift server.

There is an issue with Zeppelin seeing data when connecting to Spark Thrift
Server. Rows display null value.

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 9 August 2016 at 22:32, cdecleene  wrote:

> Some details of an example table hive table that spark 2.0 could not
> read...
>
> SerDe Library:
> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
> InputFormat:
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
> OutputFormat:
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
>
> COLUMN_STATS_ACCURATE   false
> kite.compression.type   snappy
> numFiles0
> numRows -1
> rawDataSize -1
> totalSize0
>
> All fields within the table are of type "string" and there are less than 20
> of them.
>
> When I say that spark 2.0 cannot read the hive table, I mean that when I
> attempt to execute the following from a pyspark shell...
>
> spark = SparkSession.builder.enableHiveSupport().getOrCreate()
> df = spark.sql("SELECT * FROM dra_agency_analytics.raw_ewt_agcy_dim")
>
> ... the dataframe df has the correct number of rows and the correct
> columns,
> but all values read as "None".
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Spark-1-6-2-can-read-hive-tables-
> created-with-sqoop-but-Spark-2-0-0-cannot-tp27502.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Spark 1.6.2 can read hive tables created with sqoop, but Spark 2.0.0 cannot

2016-08-09 Thread Davies Liu

Can you get all the fields back using Scala or SQL (bin/spark-sql)?

On Tue, Aug 9, 2016 at 2:32 PM, cdecleene  wrote:
> Some details of an example table hive table that spark 2.0 could not read...
>
> SerDe Library:
> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
> InputFormat:
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
> OutputFormat:
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
>
> COLUMN_STATS_ACCURATE   false
> kite.compression.type   snappy
> numFiles0
> numRows -1
> rawDataSize -1
> totalSize0
>
> All fields within the table are of type "string" and there are less than 20
> of them.
>
> When I say that spark 2.0 cannot read the hive table, I mean that when I
> attempt to execute the following from a pyspark shell...
>
> spark = SparkSession.builder.enableHiveSupport().getOrCreate()
> df = spark.sql("SELECT * FROM dra_agency_analytics.raw_ewt_agcy_dim")
>
> ... the dataframe df has the correct number of rows and the correct columns,
> but all values read as "None".
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-6-2-can-read-hive-tables-created-with-sqoop-but-Spark-2-0-0-cannot-tp27502.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark on mesos in docker not getting parameters

2016-08-09 Thread Michael Gummelt

> However, they are missing in subsequent child processes and the final java
process started doesn't contain them either.

I don't see any evidence of this in your process list.  `launcher.Main` is
not the final java process.  `launcher.Main` prints a java command, which
`spark-class` then runs.  That command is the final java process.
`launcher.Main` should take the contents of SPARK_EXECUTOR_OPTS and include
those opts in the command which it prints out.

If you could include the process listing for that final command, and you
observe it doesn't contain the aws system properties from
SPARK_EXECUTOR_OPTS, then I would see something wrong.

On Tue, Aug 9, 2016 at 10:13 AM, Jim Carroll  wrote:

> I'm running spark 2.0.0 on Mesos using spark.mesos.executor.docker.image
> to
> point to a docker container that I built with the Spark installation.
>
> Everything is working except the Spark client process that's started inside
> the container doesn't get any of my parameters I set in the spark config in
> the driver.
>
> I set spark.executor.extraJavaOptions and spark.executor.extraClassPath in
> the driver and they don't get passed all the way through. Here is a capture
> of the chain of processes that are started on the mesos slave, in the
> docker
> container:
>
> root  1064  1051  0 12:46 ?00:00:00 docker -H
> unix:///var/run/docker.sock run --cpu-shares 8192 --memory 4723834880 -e
> SPARK_CLASSPATH=[path to my jar] -e SPARK_EXECUTOR_OPTS=
> -Daws.accessKeyId=[myid] -Daws.secretKey=[mykey] -e SPARK_USER=root -e
> SPARK_EXECUTOR_MEMORY=4096m -e MESOS_SANDBOX=/mnt/mesos/sandbox -e
> MESOS_CONTAINER_NAME=mesos-90e2c720-1e45-4dbc-8271-
> f0c47a33032a-S0.772f8080-6278-4a35-9e57-0009787ac605
> -v
> /tmp/mesos/slaves/90e2c720-1e45-4dbc-8271-f0c47a33032a-
> S0/frameworks/f5794f8a-b56f-4958-b906-f05c426dcef0-0001/
> executors/0/runs/772f8080-6278-4a35-9e57-0009787ac605:/mnt/mesos/sandbox
> --net host --entrypoint /bin/sh --name
> mesos-90e2c720-1e45-4dbc-8271-f0c47a33032a-S0.772f8080-6278-
> 4a35-9e57-0009787ac605
> [my docker image] -c  "/opt/spark/./bin/spark-class"
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url
> spark://CoarseGrainedScheduler@192.168.10.145:46121 --executor-id 0
> --hostname 192.168.10.145 --cores 8 --app-id
> f5794f8a-b56f-4958-b906-f05c426dcef0-0001
>
> root  1193  1175  0 12:46 ?00:00:00 /bin/sh -c
> "/opt/spark/./bin/spark-class"
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url
> spark://CoarseGrainedScheduler@192.168.10.145:46121 --executor-id 0
> --hostname 192.168.10.145 --cores 8 --app-id
> f5794f8a-b56f-4958-b906-f05c426dcef0-0001
>
> root  1208  1193  0 12:46 ?00:00:00 bash
> /opt/spark/./bin/spark-class
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url
> spark://CoarseGrainedScheduler@192.168.10.145:46121 --executor-id 0
> --hostname 192.168.10.145 --cores 8 --app-id
> f5794f8a-b56f-4958-b906-f05c426dcef0-0001
>
> root  1213  1208  0 12:46 ?00:00:00 bash
> /opt/spark/./bin/spark-class
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url
> spark://CoarseGrainedScheduler@192.168.10.145:46121 --executor-id 0
> --hostname 192.168.10.145 --cores 8 --app-id
> f5794f8a-b56f-4958-b906-f05c426dcef0-0001
>
> root  1215  1213  0 12:46 ?00:00:00
> /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Xmx128m -cp /opt/spark/jars/*
> org.apache.spark.launcher.Main
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url
> spark://CoarseGrainedScheduler@192.168.10.145:46121 --executor-id 0
> --hostname 192.168.10.145 --cores 8 --app-id
> f5794f8a-b56f-4958-b906-f05c426dcef0-0001
>
> Notice, in the initial process started by mesos both the SPARK_CLASSPATH is
> set to the value of spark.executor.extraClassPath and the -D options are
> set
> as I set them on spark.executor.extraJavaOptions (in this case, to my aws
> creds) in the drive configuration.
>
> However, they are missing in subsequent child processes and the final java
> process started doesn't contain them either.
>
> I "fixed" the classpath problem by putting my jar in /opt/spark/jars
> (/opt/spark is the location I have spark installed in the docker
> container).
>
> Can someone tell me what I'm missing?
>
> Thanks
> Jim
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Spark-on-mesos-in-docker-not-
> getting-parameters-tp27500.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Michael Gummelt
Software Engineer
Mesosphere

Spark 1.6.2 can read hive tables created with sqoop, but Spark 2.0.0 cannot

2016-08-09 Thread cdecleene

Some details of an example table hive table that spark 2.0 could not read...  

SerDe Library:  
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat:
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat:   
org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat

COLUMN_STATS_ACCURATE   false   
kite.compression.type   snappy  
numFiles0
numRows -1
rawDataSize -1
totalSize0

All fields within the table are of type "string" and there are less than 20
of them. 

When I say that spark 2.0 cannot read the hive table, I mean that when I
attempt to execute the following from a pyspark shell... 

spark = SparkSession.builder.enableHiveSupport().getOrCreate()
df = spark.sql("SELECT * FROM dra_agency_analytics.raw_ewt_agcy_dim")

... the dataframe df has the correct number of rows and the correct columns,
but all values read as "None". 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-6-2-can-read-hive-tables-created-with-sqoop-but-Spark-2-0-0-cannot-tp27502.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

RE: Spark join and large temp files

2016-08-09 Thread Ashic Mahtab

Can you give some outline as to what you mean? Should I broadcast a dataframe, 
and register the broadcasted df as a temp table? And then use a lookup UDF in a 
SELECT query?  
I've managed to get it working by loading the 1.5GB dataset into an embedded 
redis instance on the driver, and used a mapPartitions on the big dataframe to 
map it to the required triples by doing the lookup from redis. It took around 
13 minutes to load the data into redis using 4 cores, and the subsequent map on 
the main dataset was quite fast. 

From: gourav.sengu...@gmail.com
Date: Tue, 9 Aug 2016 21:13:51 +0100
Subject: Re: Spark join and large temp files
To: as...@live.com
CC: mich.talebza...@gmail.com; samkiller@gmail.com; deepakmc...@gmail.com; 
user@spark.apache.org

In case of skewed data the joins will mess things up. Try to write a UDF with 
the lookup on broadcast variable and then let me know the results. It should 
not take more than 40 mins in a 32 GB RAM system with 6 core processors.

Gourav
On Tue, Aug 9, 2016 at 6:02 PM, Ashic Mahtab  wrote:



Hi Mich,Hardware: AWS EMR cluster with 15 nodes with Rx3.2xlarge (CPU, RAM 
fine, disk a couple of hundred gig).
When we do:
onPointFiveTB.join(onePointFiveGig.cache(), "id")
we're seing that the temp directory is filling up fast, until a node gets 
killed. And then everything dies. 
-Ashic. 

From: mich.talebza...@gmail.com
Date: Tue, 9 Aug 2016 17:25:23 +0100
Subject: Re: Spark join and large temp files
To: as...@live.com
CC: samkiller@gmail.com; deepakmc...@gmail.com; user@spark.apache.org

Hi Sam,
What is your spark Hardware spec, No of nodes, RAM per node and disks please?
I don't understand this should not really be an issue. Underneath the bonnet it 
is a hash join. The small table I gather can be cached and the big table will 
do multiple passes using the temp space.
HTH

Dr Mich Talebzadeh


 


LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw


 


http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction
of data or any other property which may arise from relying on this email's 
technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.  



On 9 August 2016 at 15:46, Ashic Mahtab  wrote:



Hi Sam,Yup. It seems it stalls when broadcasting. CPU goes to 100%, but there's 
no progress. The spark UI doesn't even show up.
-Ashic. 

From: samkiller@gmail.com
Date: Tue, 9 Aug 2016 16:21:27 +0200
Subject: Re: Spark join and large temp files
To: as...@live.com
CC: deepakmc...@gmail.com; user@spark.apache.org

Have you tried to broadcast your small table table in order to perform your 
join ?
joined = bigDF.join(broadcast(smallDF, )

On Tue, Aug 9, 2016 at 3:29 PM, Ashic Mahtab  wrote:



Hi Deepak,No...not really. Upping the disk size is a solution, but more 
expensive as you can't attach EBS volumes to EMR clusters configured with data 
pipelines easily (which is what we're doing). I've tried collecting the 1.5G 
dataset in a hashmap, and broadcasting. Timeouts seems to prevent that (even 
after upping the max driver result size). Increasing partition counts didn't 
help (the shuffle used up the temp space). I'm now looking at some form of 
clever broadcasting, or maybe falling back to chunking up the input, producing 
interim output, and unioning them for the final output. Might even try using 
Spark Streaming pointing to the parquet and seeing if that helps. 
-Ashic. 

From: deepakmc...@gmail.com
Date: Tue, 9 Aug 2016 17:31:19 +0530
Subject: Re: Spark join and large temp files
To: as...@live.com

Hi AshicDid you find the resolution to this issue?Just curious to know like 
what helped in this scenario.
ThanksDeepak


On Tue, Aug 9, 2016 at 12:23 AM, Ashic Mahtab  wrote:



Hi Deepak,Thanks for the response. 
Registering the temp tables didn't help. Here's what I have:
val a = sqlContext..read.parquet(...).select("eid.id", 
"name").withColumnRenamed("eid.id", "id")val b = 
sqlContext.read.parquet(...).select("id", "number")
a.registerTempTable("a")b.registerTempTable("b")
val results = sqlContext.sql("SELECT x.id, x.name, y.number FROM a x join b y 
on x.id=y.id)
results.write.parquet(...)
Is there something I'm missing?
Cheers,Ashic.
From: deepakmc...@gmail.com
Date: Tue, 9 Aug 2016 00:01:32 +0530
Subject: Re: Spark join and large temp files
To: as...@live.com
CC: user@spark.apache.org

Register you dataframes as temp tables and then try the join on the temp 
table.This should resolve your issue.
ThanksDeepak
On Mon, Aug 8, 2016 at 11:47 PM, Ashic Mahtab  wrote:



Hello,We have two parquet inputs of the following form:
a: id:String, Name:String  (1.5TB)b: id:String, Number:Int  (1.3GB)
We need to join these two to get (id, Number, Name). We've tried two

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded when using UDFs in SparkSQL (Spark 2.0.0)

2016-08-09 Thread Zoltan Fedor

Thanks, that makes sense.
So it must be that this queue - which is kept because of the UDF - is the
one running out of memory, because without the UDF field there is no out of
memory error and the UDF fields is pretty small, unlikely that it would
take us above the memory limit.

In either case, thanks for your help, I think I understand it now how the
UDFs and the fields together with the number of rows can result our out of
memory scenario.

On Tue, Aug 9, 2016 at 5:06 PM, Davies Liu  wrote:

> When you have a Python UDF, only the input to UDF are passed into
> Python process,
> but all other fields that are used together with the result of UDF are
> kept in a queue
> then join with the result from Python. The length of this queue is depend
> on the
> number of rows is under processing by Python (or in the buffer of
> Python process).
> The amount of memory required also depend on how many fields are used in
> the
> results.
>
> On Tue, Aug 9, 2016 at 11:09 AM, Zoltan Fedor 
> wrote:
> >> Does this mean you only have 1.6G memory for executor (others left for
> >> Python) ?
> >> The cached table could take 1.5G, it means almost nothing left for other
> >> things.
> > True. I have also tried with memoryOverhead being set to 800 (10% of the
> 8Gb
> > memory), but no difference. The "GC overhead limit exceeded" is still the
> > same.
> >
> >> Python UDF do requires some buffering in JVM, the size of buffering
> >> depends on how much rows are under processing by Python process.
> > I did some more testing in the meantime.
> > Leaving the UDFs as-is, but removing some other, static columns from the
> > above SELECT FROM command has stopped the memoryOverhead error from
> > occurring. I have plenty enough memory to store the results with all
> static
> > columns, plus when the UDFs are not there only the rest of the static
> > columns are, then it runs fine. This makes me believe that having UDFs
> and
> > many columns causes the issue together. Maybe when you have UDFs then
> > somehow the memory usage depends on the amount of data in that record
> (the
> > whole row), which includes other fields too, which are actually not used
> by
> > the UDF. Maybe the UDF serialization to Python serializes the whole row
> > instead of just the attributes of the UDF?
> >
> > On Mon, Aug 8, 2016 at 5:59 PM, Davies Liu 
> wrote:
> >>
> >> On Mon, Aug 8, 2016 at 2:24 PM, Zoltan Fedor 
> >> wrote:
> >> > Hi all,
> >> >
> >> > I have an interesting issue trying to use UDFs from SparkSQL in Spark
> >> > 2.0.0
> >> > using pyspark.
> >> >
> >> > There is a big table (5.6 Billion rows, 450Gb in memory) loaded into
> 300
> >> > executors's memory in SparkSQL, on which we would do some calculation
> >> > using
> >> > UDFs in pyspark.
> >> > If I run my SQL on only a portion of the data (filtering by one of the
> >> > attributes), let's say 800 million records, then all works well. But
> >> > when I
> >> > run the same SQL on all the data, then I receive
> >> > "java.lang.OutOfMemoryError: GC overhead limit exceeded" from
> basically
> >> > all
> >> > of the executors.
> >> >
> >> > It seems to me that pyspark UDFs in SparkSQL might have a memory leak,
> >> > causing this "GC overhead limit being exceeded".
> >> >
> >> > Details:
> >> >
> >> > - using Spark 2.0.0 on a Hadoop YARN cluster
> >> >
> >> > - 300 executors, each with 2 CPU cores and 8Gb memory (
> >> > spark.yarn.executor.memoryOverhead=6400 )
> >>
> >> Does this mean you only have 1.6G memory for executor (others left for
> >> Python) ?
> >> The cached table could take 1.5G, it means almost nothing left for other
> >> things.
> >>
> >> Python UDF do requires some buffering in JVM, the size of buffering
> >> depends on
> >> how much rows are under processing by Python process.
> >>
> >> > - a table of 5.6 Billions rows loaded into the memory of the executors
> >> > (taking up 450Gb of memory), partitioned evenly across the executors
> >> >
> >> > - creating even the simplest UDF in SparkSQL causes 'GC overhead limit
> >> > exceeded' error if running on all records. Running the same on a
> smaller
> >> > dataset (~800 million rows) does succeed. If no UDF, the query succeed
> >> > on
> >> > the whole dataset.
> >> >
> >> > - simplified pyspark code:
> >> >
> >> > from pyspark.sql.types import StringType
> >> >
> >> > def test_udf(var):
> >> > """test udf that will always return a"""
> >> > return "a"
> >> > sqlContext.registerFunction("test_udf", test_udf, StringType())
> >> >
> >> > sqlContext.sql("""CACHE TABLE ma""")
> >> >
> >> > results_df = sqlContext.sql("""SELECT SOURCE, SOURCE_SYSTEM,
> >> > test_udf(STANDARD_ACCOUNT_STREET_SRC) AS TEST_UDF_OP,
> >> > ROUND(1.0 - (levenshtein(STANDARD_ACCOUNT_CITY_SRC,
> >> > STANDARD_ACCOUNT_CITY_SRC)
> >> >  /
> >> > CASE WHEN LENGTH (STANDARD_ACCOUNT_CITY_SRC)>LENGTH
> >> > (STANDARD_ACCOUNT_CITY_SRC)
>

UNSUBSCRIBE

2016-08-09 Thread abhishek singh

Re: Spark streaming not processing messages from partitioned topics

2016-08-09 Thread Sivakumaran S

Hi,

Here is a working example I did.

HTH

Regards,

Sivakumaran S

val topics = "test"
val brokers = "localhost:9092"
val topicsSet = topics.split(",").toSet
val sparkConf = new 
SparkConf().setAppName("KafkaWeatherCalc").setMaster("local") 
//spark://localhost:7077
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(60))
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, 
StringDecoder](ssc, kafkaParams, topicsSet)
messages.foreachRDD(rdd => {
  if (rdd.isEmpty()) {
println("Failed to get data from Kafka. Please check that the Kafka 
producer is streaming data.")
System.exit(-1)
  }
  val sqlContext = org.apache.spark.sql.SQLContext.getOrCreate(rdd.sparkContext)
  val weatherDF = sqlContext.read.json(rdd.map(_._2)).toDF()
  //Process your DF as required here on
}

> On 09-Aug-2016, at 9:47 PM, Diwakar Dhanuskodi  
> wrote:
> 
> Hi,
> 
> I am reading json messages from kafka . Topics has 2 partitions. When running 
> streaming job using spark-submit, I could see that  val dataFrame = 
> sqlContext.read.json(rdd.map(_._2)) executes indefinitely. Am I doing 
> something wrong here. Below is code .This environment is cloudera sandbox 
> env. Same issue in hadoop production cluster mode except that it is 
> restricted thats why tried to reproduce issue in Cloudera sandbox. Kafka 0.10 
> and  Spark 1.4.
> 
> val kafkaParams = 
> Map[String,String]("bootstrap.servers"->"localhost:9093,localhost:9092", 
> "group.id " -> "xyz","auto.offset.reset"->"smallest")
> val conf = new SparkConf().setMaster("local[3]").setAppName("topic")
> val ssc = new StreamingContext(conf, Seconds(1))
> 
> val sqlContext = new org.apache.spark.sql.SQLContext(ssc.sparkContext)
> 
> val topics = Set("gpp.minf")
> val kafkaStream = KafkaUtils.createDirectStream[String, String, 
> StringDecoder,StringDecoder](ssc, kafkaParams, topics)
> 
> kafkaStream.foreachRDD(
>   rdd => {
> if (rdd.count > 0){
> val dataFrame = sqlContext.read.json(rdd.map(_._2)) 
>dataFrame.printSchema()
> //dataFrame.foreach(println)
> }
> }

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded when using UDFs in SparkSQL (Spark 2.0.0)

2016-08-09 Thread Davies Liu

When you have a Python UDF, only the input to UDF are passed into
Python process,
but all other fields that are used together with the result of UDF are
kept in a queue
then join with the result from Python. The length of this queue is depend on the
number of rows is under processing by Python (or in the buffer of
Python process).
The amount of memory required also depend on how many fields are used in the
results.

On Tue, Aug 9, 2016 at 11:09 AM, Zoltan Fedor  wrote:
>> Does this mean you only have 1.6G memory for executor (others left for
>> Python) ?
>> The cached table could take 1.5G, it means almost nothing left for other
>> things.
> True. I have also tried with memoryOverhead being set to 800 (10% of the 8Gb
> memory), but no difference. The "GC overhead limit exceeded" is still the
> same.
>
>> Python UDF do requires some buffering in JVM, the size of buffering
>> depends on how much rows are under processing by Python process.
> I did some more testing in the meantime.
> Leaving the UDFs as-is, but removing some other, static columns from the
> above SELECT FROM command has stopped the memoryOverhead error from
> occurring. I have plenty enough memory to store the results with all static
> columns, plus when the UDFs are not there only the rest of the static
> columns are, then it runs fine. This makes me believe that having UDFs and
> many columns causes the issue together. Maybe when you have UDFs then
> somehow the memory usage depends on the amount of data in that record (the
> whole row), which includes other fields too, which are actually not used by
> the UDF. Maybe the UDF serialization to Python serializes the whole row
> instead of just the attributes of the UDF?
>
> On Mon, Aug 8, 2016 at 5:59 PM, Davies Liu  wrote:
>>
>> On Mon, Aug 8, 2016 at 2:24 PM, Zoltan Fedor 
>> wrote:
>> > Hi all,
>> >
>> > I have an interesting issue trying to use UDFs from SparkSQL in Spark
>> > 2.0.0
>> > using pyspark.
>> >
>> > There is a big table (5.6 Billion rows, 450Gb in memory) loaded into 300
>> > executors's memory in SparkSQL, on which we would do some calculation
>> > using
>> > UDFs in pyspark.
>> > If I run my SQL on only a portion of the data (filtering by one of the
>> > attributes), let's say 800 million records, then all works well. But
>> > when I
>> > run the same SQL on all the data, then I receive
>> > "java.lang.OutOfMemoryError: GC overhead limit exceeded" from basically
>> > all
>> > of the executors.
>> >
>> > It seems to me that pyspark UDFs in SparkSQL might have a memory leak,
>> > causing this "GC overhead limit being exceeded".
>> >
>> > Details:
>> >
>> > - using Spark 2.0.0 on a Hadoop YARN cluster
>> >
>> > - 300 executors, each with 2 CPU cores and 8Gb memory (
>> > spark.yarn.executor.memoryOverhead=6400 )
>>
>> Does this mean you only have 1.6G memory for executor (others left for
>> Python) ?
>> The cached table could take 1.5G, it means almost nothing left for other
>> things.
>>
>> Python UDF do requires some buffering in JVM, the size of buffering
>> depends on
>> how much rows are under processing by Python process.
>>
>> > - a table of 5.6 Billions rows loaded into the memory of the executors
>> > (taking up 450Gb of memory), partitioned evenly across the executors
>> >
>> > - creating even the simplest UDF in SparkSQL causes 'GC overhead limit
>> > exceeded' error if running on all records. Running the same on a smaller
>> > dataset (~800 million rows) does succeed. If no UDF, the query succeed
>> > on
>> > the whole dataset.
>> >
>> > - simplified pyspark code:
>> >
>> > from pyspark.sql.types import StringType
>> >
>> > def test_udf(var):
>> > """test udf that will always return a"""
>> > return "a"
>> > sqlContext.registerFunction("test_udf", test_udf, StringType())
>> >
>> > sqlContext.sql("""CACHE TABLE ma""")
>> >
>> > results_df = sqlContext.sql("""SELECT SOURCE, SOURCE_SYSTEM,
>> > test_udf(STANDARD_ACCOUNT_STREET_SRC) AS TEST_UDF_OP,
>> > ROUND(1.0 - (levenshtein(STANDARD_ACCOUNT_CITY_SRC,
>> > STANDARD_ACCOUNT_CITY_SRC)
>> >  /
>> > CASE WHEN LENGTH (STANDARD_ACCOUNT_CITY_SRC)>LENGTH
>> > (STANDARD_ACCOUNT_CITY_SRC)
>> > THEN LENGTH (STANDARD_ACCOUNT_CITY_SRC)
>> > ELSE LENGTH (STANDARD_ACCOUNT_CITY_SRC)
>> >END),2) AS SCORE_ED_STANDARD_ACCOUNT_CITY,
>> > STANDARD_ACCOUNT_STATE_SRC, STANDARD_ACCOUNT_STATE_UNIV
>> > FROM ma""")
>> >
>> > results_df.registerTempTable("m")
>> > sqlContext.cacheTable("m")
>> >
>> > results_df = sqlContext.sql("""SELECT COUNT(*) FROM m""")
>> > print(results_df.take(1))
>> >
>> >
>> > - the error thrown on the executors:
>> >
>> > 16/08/08 15:38:17 ERROR util.Utils: Uncaught exception in thread stdout
>> > writer for /hadoop/cloudera/parcels/Anaconda/bin/python
>> > java.lang.OutOfMemoryError: GC overhead limit exceeded
>> > at
>> >
>> >

Re: Spark streaming not processing messages from partitioned topics

2016-08-09 Thread Cody Koeninger

Take out the conditional and the sqlcontext and just do

rdd => {
  rdd.foreach(println)


as a base line to see if you're reading the data you expect

On Tue, Aug 9, 2016 at 3:47 PM, Diwakar Dhanuskodi
 wrote:
> Hi,
>
> I am reading json messages from kafka . Topics has 2 partitions. When
> running streaming job using spark-submit, I could see that  val dataFrame =
> sqlContext.read.json(rdd.map(_._2)) executes indefinitely. Am I doing
> something wrong here. Below is code .This environment is cloudera sandbox
> env. Same issue in hadoop production cluster mode except that it is
> restricted thats why tried to reproduce issue in Cloudera sandbox. Kafka
> 0.10 and  Spark 1.4.
>
> val kafkaParams =
> Map[String,String]("bootstrap.servers"->"localhost:9093,localhost:9092",
> "group.id" -> "xyz","auto.offset.reset"->"smallest")
> val conf = new SparkConf().setMaster("local[3]").setAppName("topic")
> val ssc = new StreamingContext(conf, Seconds(1))
>
> val sqlContext = new org.apache.spark.sql.SQLContext(ssc.sparkContext)
>
> val topics = Set("gpp.minf")
> val kafkaStream = KafkaUtils.createDirectStream[String, String,
> StringDecoder,StringDecoder](ssc, kafkaParams, topics)
>
> kafkaStream.foreachRDD(
>   rdd => {
> if (rdd.count > 0){
> val dataFrame = sqlContext.read.json(rdd.map(_._2))
>dataFrame.printSchema()
> //dataFrame.foreach(println)
> }
> }

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Spark streaming not processing messages from partitioned topics

2016-08-09 Thread Diwakar Dhanuskodi

Hi,

I am reading json messages from kafka . Topics has 2 partitions. When
running streaming job using spark-submit, I could see that * val dataFrame
= sqlContext.read.json(rdd.map(_._2)) *executes indefinitely. Am I doing
something wrong here. Below is code .This environment is cloudera sandbox
env. Same issue in hadoop production cluster mode except that it is
restricted thats why tried to reproduce issue in Cloudera sandbox. Kafka
0.10 and  Spark 1.4.

val kafkaParams =
Map[String,String]("bootstrap.servers"->"localhost:9093,localhost:9092",
"group.id" -> "xyz","auto.offset.reset"->"smallest")
val conf = new SparkConf().setMaster("local[3]").setAppName("topic")
val ssc = new StreamingContext(conf, Seconds(1))

val sqlContext = new org.apache.spark.sql.SQLContext(ssc.sparkContext)

val topics = Set("gpp.minf")
val kafkaStream = KafkaUtils.createDirectStream[String, String,
StringDecoder,StringDecoder](ssc, kafkaParams, topics)

kafkaStream.foreachRDD(
  rdd => {
if (rdd.count > 0){
   * val dataFrame = sqlContext.read.json(rdd.map(_._2)) *
   dataFrame.printSchema()
//dataFrame.foreach(println)
}
}

Re: Writing all values for same key to one file

2016-08-09 Thread neil90

Why not just create a partitions for they key you want to groupby and save it
in there? Appending to a file already written to HDFS isn't the best idea
IMO.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Writing-all-values-for-same-key-to-one-file-tp27455p27501.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark join and large temp files

2016-08-09 Thread Gourav Sengupta

In case of skewed data the joins will mess things up. Try to write a UDF
with the lookup on broadcast variable and then let me know the results. It
should not take more than 40 mins in a 32 GB RAM system with 6 core
processors.


Gourav

On Tue, Aug 9, 2016 at 6:02 PM, Ashic Mahtab  wrote:

> Hi Mich,
> Hardware: AWS EMR cluster with 15 nodes with Rx3.2xlarge (CPU, RAM fine,
> disk a couple of hundred gig).
>
> When we do:
>
> onPointFiveTB.join(onePointFiveGig.cache(), "id")
>
> we're seing that the temp directory is filling up fast, until a node gets
> killed. And then everything dies.
>
> -Ashic.
>
> --
> From: mich.talebza...@gmail.com
> Date: Tue, 9 Aug 2016 17:25:23 +0100
> Subject: Re: Spark join and large temp files
> To: as...@live.com
> CC: samkiller@gmail.com; deepakmc...@gmail.com; user@spark.apache.org
>
>
> Hi Sam,
>
> What is your spark Hardware spec, No of nodes, RAM per node and disks
> please?
>
> I don't understand this should not really be an issue. Underneath the
> bonnet it is a hash join. The small table I gather can be cached and the
> big table will do multiple passes using the temp space.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 9 August 2016 at 15:46, Ashic Mahtab  wrote:
>
> Hi Sam,
> Yup. It seems it stalls when broadcasting. CPU goes to 100%, but there's
> no progress. The spark UI doesn't even show up.
>
> -Ashic.
>
> --
> From: samkiller@gmail.com
> Date: Tue, 9 Aug 2016 16:21:27 +0200
> Subject: Re: Spark join and large temp files
> To: as...@live.com
> CC: deepakmc...@gmail.com; user@spark.apache.org
>
>
> Have you tried to broadcast your small table table in order to perform
> your join ?
>
> joined = bigDF.join(broadcast(smallDF, )
>
>
> On Tue, Aug 9, 2016 at 3:29 PM, Ashic Mahtab  wrote:
>
> Hi Deepak,
> No...not really. Upping the disk size is a solution, but more expensive as
> you can't attach EBS volumes to EMR clusters configured with data pipelines
> easily (which is what we're doing). I've tried collecting the 1.5G dataset
> in a hashmap, and broadcasting. Timeouts seems to prevent that (even after
> upping the max driver result size). Increasing partition counts didn't help
> (the shuffle used up the temp space). I'm now looking at some form of
> clever broadcasting, or maybe falling back to chunking up the input,
> producing interim output, and unioning them for the final output. Might
> even try using Spark Streaming pointing to the parquet and seeing if that
> helps.
>
> -Ashic.
>
> --
> From: deepakmc...@gmail.com
> Date: Tue, 9 Aug 2016 17:31:19 +0530
> Subject: Re: Spark join and large temp files
> To: as...@live.com
>
> Hi Ashic
> Did you find the resolution to this issue?
> Just curious to know like what helped in this scenario.
>
> Thanks
> Deepak
>
>
> On Tue, Aug 9, 2016 at 12:23 AM, Ashic Mahtab  wrote:
>
> Hi Deepak,
> Thanks for the response.
>
> Registering the temp tables didn't help. Here's what I have:
>
> val a = sqlContext..read.parquet(...).select("eid.id",
> "name").withColumnRenamed("eid.id", "id")
> val b = sqlContext.read.parquet(...).select("id", "number")
>
> a.registerTempTable("a")
> b.registerTempTable("b")
>
> val results = sqlContext.sql("SELECT x.id, x.name, y.number FROM a x join
> b y on x.id=y.id)
>
> results.write.parquet(...)
>
> Is there something I'm missing?
>
> Cheers,
> Ashic.
>
> --
> From: deepakmc...@gmail.com
> Date: Tue, 9 Aug 2016 00:01:32 +0530
> Subject: Re: Spark join and large temp files
> To: as...@live.com
> CC: user@spark.apache.org
>
>
> Register you dataframes as temp tables and then try the join on the temp
> table.
> This should resolve your issue.
>
> Thanks
> Deepak
>
> On Mon, Aug 8, 2016 at 11:47 PM, Ashic Mahtab  wrote:
>
> Hello,
> We have two parquet inputs of the following form:
>
> a: id:String, Name:String  (1.5TB)
> b: id:String, Number:Int  (1.3GB)
>
> We need to join these two to get (id, Number, Name). We've tried two
> approaches:
>
> a.join(b, Seq("id"), "right_outer")
>
> where a and b are dataframes. We also tried taking the rdds, mapping them
> to pair rdds with id as the key, and then joining. What we're seeing is
> that temp file usage is increasing on the join stage, and filling up our
> disks, causing the job to

spark 2.0 in intellij

2016-08-09 Thread Michael Jay

Dear all,

I am Newbie to Spark. Currently I am trying to import the source code of Spark 
2.0 as a Module to an existing client project.

I have imported Spark-core, Spark-sql and Spark-catalyst as maven dependencies 
in this client project.

During compilation errors as missing SqlBaseParser.java occurred.

After searching online, I found an article in StackOverflow 
http://stackoverflow.com/questions/35617277/spark-sql-has-no-sparksqlparser-scala-file-when-compiling-in-intellij-idea
 to solve this issue.

[http://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-i...@2.png?v=73d79a89bded]

Spark SQL has no SparkSqlParser.scala file when compiling 
...
stackoverflow.com
I have installed spark-hadoop env in my Red Hat 64. And I also want to read and 
write code in spark source code project in intelliJ idea. I have downloaded 
spark ...

So I use mvn to build spark 2.0 first and import the 
...catalyst/target/generated-sources/antrl4 as a new source folder in the maven 
dependency "Spark-catalyst".
Now the problem is that I still got following erros:

Error:scalac: error while loading package, Missing dependency 'bad symbolic 
reference. A signature in package.class refers to term annotation
in package org.apache.spark which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling 
package.class.', required by 
/home/weiping/workspace/tools/spark-2.0.0/sql/core/target/scala-2.11/classes/org/apache/spark/sql/package.class
Error:scalac: error while loading SparkSession, Missing dependency 'bad 
symbolic reference. A signature in SparkSession.class refers to term annotation
in package org.apache.spark which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling 
SparkSession.class.', required by 
/home/weiping/workspace/tools/spark-2.0.0/sql/core/target/scala-2.11/classes/org/apache/spark/sql/SparkSession.class
Error:scalac: error while loading RDD, Missing dependency 'bad symbolic 
reference. A signature in RDD.class refers to term annotation
in package org.apache.spark which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling 
RDD.class.', required by 
/home/weiping/workspace/tools/spark-2.0.0/core/target/scala-2.11/classes/org/apache/spark/rdd/RDD.class
Error:scalac: error while loading JavaRDDLike, Missing dependency 'bad symbolic 
reference. A signature in JavaRDDLike.class refers to term annotation
in package org.apache.spark which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling 
JavaRDDLike.class.', required by 
/home/weiping/workspace/tools/spark-2.0.0/core/target/scala-2.11/classes/org/apache/spark/api/java/JavaRDDLike.class
Error:scalac: error while loading Dataset, Missing dependency 'bad symbolic 
reference. A signature in Dataset.class refers to term annotation
in package org.apache.spark which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling 
Dataset.class.', required by 
/home/weiping/workspace/tools/spark-2.0.0/sql/core/target/scala-2.11/classes/org/apache/spark/sql/Dataset.class
Error:scalac: error while loading ColumnName, Missing dependency 'bad symbolic 
reference. A signature in ColumnName.class refers to term annotation
in package org.apache.spark which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling 
ColumnName.class.', required by 
/home/weiping/workspace/tools/spark-2.0.0/sql/core/target/scala-2.11/classes/org/apache/spark/sql/ColumnName.class
Error:scalac: error while loading Encoder, Missing dependency 'bad symbolic 
reference. A signature in Encoder.class refers to term annotation
in package org.apache.spark which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling 
Encoder.class.', required by 
/home/weiping/workspace/tools/spark-2.0.0/sql/catalyst/target/scala-2.11/classes/org/apache/spark/sql/Encoder.class


Can anyone help me?

Thank you,
Mic

Unsubscribe.

2016-08-09 Thread Martin Somers

Unsubscribe.

Thanks
M

Unsubscribe

2016-08-09 Thread Hogancamp, Aaron

Unsubscribe.

Thanks,

Aaron Hogancamp
Data Scientist

Re: Spark 2.0.1 / 2.1.0 on Maven

2016-08-09 Thread Chris Fregly

alrighty then!

bcc'ing user list.  cc'ing dev list.

@user list people:  do not read any further or you will be in violation of
ASF policies!

On Tue, Aug 9, 2016 at 11:50 AM, Mark Hamstra 
wrote:

> That's not going to happen on the user list, since that is against ASF
> policy (http://www.apache.org/dev/release.html):
>
> During the process of developing software and preparing a release, various
>> packages are made available to the developer community for testing
>> purposes. Do not include any links on the project website that might
>> encourage non-developers to download and use nightly builds, snapshots,
>> release candidates, or any other similar package. The only people who
>> are supposed to know about such packages are the people following the dev
>> list (or searching its archives) and thus aware of the conditions placed on
>> the package. If you find that the general public are downloading such test
>> packages, then remove them.
>>
>
> On Tue, Aug 9, 2016 at 11:32 AM, Chris Fregly  wrote:
>
>> this is a valid question.  there are many people building products and
>> tooling on top of spark and would like access to the latest snapshots and
>> such.  today's ink is yesterday's news to these people - including myself.
>>
>> what is the best way to get snapshot releases including nightly and
>> specially-blessed "preview" releases so that we, too, can say "try the
>> latest release in our product"?
>>
>> there was a lot of chatter during the 2.0.0/2.0.1 release that i largely
>> ignored because of conflicting/confusing/changing responses.  and i'd
>> rather not dig through jenkins builds to figure this out as i'll likely get
>> it wrong.
>>
>> please provide the relevant snapshot/preview/nightly/whatever repos (or
>> equivalent) that we need to include in our builds to have access to the
>> absolute latest build assets for every major and minor release.
>>
>> thanks!
>>
>> -chris
>>
>>
>> On Tue, Aug 9, 2016 at 10:00 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> LOL
>>>
>>> Ink has not dried on Spark 2 yet so to speak :)
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 9 August 2016 at 17:56, Mark Hamstra  wrote:
>>>
 What are you expecting to find?  There currently are no releases beyond
 Spark 2.0.0.

 On Tue, Aug 9, 2016 at 9:55 AM, Jestin Ma 
 wrote:

> If we want to use versions of Spark beyond the official 2.0.0 release,
> specifically on Maven + Java, what steps should we take to upgrade? I 
> can't
> find the newer versions on Maven central.
>
> Thank you!
> Jestin
>


>>>
>>
>>
>> --
>> *Chris Fregly*
>> Research Scientist @ PipelineIO
>> San Francisco, CA
>> pipeline.io
>> advancedspark.com
>>
>>
>


-- 
*Chris Fregly*
Research Scientist @ PipelineIO
San Francisco, CA
pipeline.io
advancedspark.com

Re: Spark 2.0.1 / 2.1.0 on Maven

2016-08-09 Thread Sean Owen

Nightlies are built and made available in the ASF snapshot repo, from
master. This is noted at the bottom of the downloads page, and at
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-NightlyBuilds
. This hasn't changed in as long as I can recall.

Nightlies are not blessed, and are not for consumption other than by
developers. That is you shouldn't bundle them in a release, shouldn't
release a product based on "2.0.1 snapshot" for example because no
such ASF release exists. This info isn't meant to be secret, but is
not made obvious to casual end users for this reason. Yes it's for
developers who want to test other products in advance.

So-called preview releases are really just normal releases and are
made available in the usual way. They just have a different name. I
don't know if another one of those will happen; maybe for 3.0.

The published master snapshot would give you 2.1.0-SNAPSHOT at the
moment. Other branches don't have nightlies, but are likely to be of
less interest.

You can always "mvn -DskipTests install" from a checkout of any branch
to make the branch's SNAPSHOT available in your local Maven repo, or
even publish it to your private repo.

On Tue, Aug 9, 2016 at 7:32 PM, Chris Fregly  wrote:
> this is a valid question.  there are many people building products and
> tooling on top of spark and would like access to the latest snapshots and
> such.  today's ink is yesterday's news to these people - including myself.
>
> what is the best way to get snapshot releases including nightly and
> specially-blessed "preview" releases so that we, too, can say "try the
> latest release in our product"?
>
> there was a lot of chatter during the 2.0.0/2.0.1 release that i largely
> ignored because of conflicting/confusing/changing responses.  and i'd rather
> not dig through jenkins builds to figure this out as i'll likely get it
> wrong.
>
> please provide the relevant snapshot/preview/nightly/whatever repos (or
> equivalent) that we need to include in our builds to have access to the
> absolute latest build assets for every major and minor release.
>
> thanks!
>
> -chris
>
>
> On Tue, Aug 9, 2016 at 10:00 AM, Mich Talebzadeh 
> wrote:
>>
>> LOL
>>
>> Ink has not dried on Spark 2 yet so to speak :)
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> Disclaimer: Use it at your own risk. Any and all responsibility for any
>> loss, damage or destruction of data or any other property which may arise
>> from relying on this email's technical content is explicitly disclaimed. The
>> author will in no case be liable for any monetary damages arising from such
>> loss, damage or destruction.
>>
>>
>>
>>
>> On 9 August 2016 at 17:56, Mark Hamstra  wrote:
>>>
>>> What are you expecting to find?  There currently are no releases beyond
>>> Spark 2.0.0.
>>>
>>> On Tue, Aug 9, 2016 at 9:55 AM, Jestin Ma 
>>> wrote:

 If we want to use versions of Spark beyond the official 2.0.0 release,
 specifically on Maven + Java, what steps should we take to upgrade? I can't
 find the newer versions on Maven central.

 Thank you!
 Jestin
>>>
>>>
>>
>
>
>
> --
> Chris Fregly
> Research Scientist @ PipelineIO
> San Francisco, CA
> pipeline.io
> advancedspark.com
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark 2.0.1 / 2.1.0 on Maven

2016-08-09 Thread Mark Hamstra

That's not going to happen on the user list, since that is against ASF
policy (http://www.apache.org/dev/release.html):

During the process of developing software and preparing a release, various
> packages are made available to the developer community for testing
> purposes. Do not include any links on the project website that might
> encourage non-developers to download and use nightly builds, snapshots,
> release candidates, or any other similar package. The only people who are
> supposed to know about such packages are the people following the dev list
> (or searching its archives) and thus aware of the conditions placed on the
> package. If you find that the general public are downloading such test
> packages, then remove them.
>

On Tue, Aug 9, 2016 at 11:32 AM, Chris Fregly  wrote:

> this is a valid question.  there are many people building products and
> tooling on top of spark and would like access to the latest snapshots and
> such.  today's ink is yesterday's news to these people - including myself.
>
> what is the best way to get snapshot releases including nightly and
> specially-blessed "preview" releases so that we, too, can say "try the
> latest release in our product"?
>
> there was a lot of chatter during the 2.0.0/2.0.1 release that i largely
> ignored because of conflicting/confusing/changing responses.  and i'd
> rather not dig through jenkins builds to figure this out as i'll likely get
> it wrong.
>
> please provide the relevant snapshot/preview/nightly/whatever repos (or
> equivalent) that we need to include in our builds to have access to the
> absolute latest build assets for every major and minor release.
>
> thanks!
>
> -chris
>
>
> On Tue, Aug 9, 2016 at 10:00 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> LOL
>>
>> Ink has not dried on Spark 2 yet so to speak :)
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 9 August 2016 at 17:56, Mark Hamstra  wrote:
>>
>>> What are you expecting to find?  There currently are no releases beyond
>>> Spark 2.0.0.
>>>
>>> On Tue, Aug 9, 2016 at 9:55 AM, Jestin Ma 
>>> wrote:
>>>
 If we want to use versions of Spark beyond the official 2.0.0 release,
 specifically on Maven + Java, what steps should we take to upgrade? I can't
 find the newer versions on Maven central.

 Thank you!
 Jestin

>>>
>>>
>>
>
>
> --
> *Chris Fregly*
> Research Scientist @ PipelineIO
> San Francisco, CA
> pipeline.io
> advancedspark.com
>
>

Re: Spark Job Doesn't End on Mesos

2016-08-09 Thread Michael Gummelt

Is this a new issue?
What version of Spark?
What version of Mesos/libmesos?
Can you run the job with debug logging turned on and attach the output?
Do you see the corresponding message in the mesos master that indicates it
received the teardown?

On Tue, Aug 9, 2016 at 1:28 AM, Todd Leo  wrote:

> Hi,
>
> I’m running Spark jobs on Mesos. When the job finishes, *SparkContext* is
> manually closed by sc.stop(). Then Mesos log shows:
>
> I0809 15:48:34.132014 11020 sched.cpp:1589] Asked to stop the driver
> I0809 15:48:34.132181 11277 sched.cpp:831] Stopping framework 
> '20160808-170425-2365980426-5050-4372-0034'
>
> However, the process doesn’t quit after all. This is critical, because I’d
> like to use SparkLauncher to submit such jobs. If my job doesn’t end, jobs
> will pile up and fill up the memory. Pls help. :-|
>
> —
> BR,
> Todd Leo
> 
>

-- 
Michael Gummelt
Software Engineer
Mesosphere

Re: Spark 2.0.1 / 2.1.0 on Maven

2016-08-09 Thread Chris Fregly

this is a valid question.  there are many people building products and
tooling on top of spark and would like access to the latest snapshots and
such.  today's ink is yesterday's news to these people - including myself.

what is the best way to get snapshot releases including nightly and
specially-blessed "preview" releases so that we, too, can say "try the
latest release in our product"?

there was a lot of chatter during the 2.0.0/2.0.1 release that i largely
ignored because of conflicting/confusing/changing responses.  and i'd
rather not dig through jenkins builds to figure this out as i'll likely get
it wrong.

please provide the relevant snapshot/preview/nightly/whatever repos (or
equivalent) that we need to include in our builds to have access to the
absolute latest build assets for every major and minor release.

thanks!

-chris

On Tue, Aug 9, 2016 at 10:00 AM, Mich Talebzadeh 
wrote:

> LOL
>
> Ink has not dried on Spark 2 yet so to speak :)
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 9 August 2016 at 17:56, Mark Hamstra  wrote:
>
>> What are you expecting to find?  There currently are no releases beyond
>> Spark 2.0.0.
>>
>> On Tue, Aug 9, 2016 at 9:55 AM, Jestin Ma 
>> wrote:
>>
>>> If we want to use versions of Spark beyond the official 2.0.0 release,
>>> specifically on Maven + Java, what steps should we take to upgrade? I can't
>>> find the newer versions on Maven central.
>>>
>>> Thank you!
>>> Jestin
>>>
>>
>>
>

-- 
*Chris Fregly*
Research Scientist @ PipelineIO
San Francisco, CA
pipeline.io
advancedspark.com

Unsubscribe

2016-08-09 Thread Aakash Basu

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded when using UDFs in SparkSQL (Spark 2.0.0)

2016-08-09 Thread Zoltan Fedor

> Does this mean you only have 1.6G memory for executor (others left for
Python) ?
> The cached table could take 1.5G, it means almost nothing left for other
things.
True. I have also tried with memoryOverhead being set to 800 (10% of the
8Gb memory), but no difference. The "GC overhead limit exceeded" is still
the same.

> Python UDF do requires some buffering in JVM, the size of buffering
depends on how much rows are under processing by Python process.
I did some more testing in the meantime.
Leaving the UDFs as-is, but removing some other, static columns from the
above SELECT FROM command has stopped the memoryOverhead error
from occurring. I have plenty enough memory to store the results with all
static columns, plus when the UDFs are not there only the rest of the
static columns are, then it runs fine. This makes me believe that having
UDFs and many columns causes the issue together. Maybe when you have UDFs
then somehow the memory usage depends on the amount of data in that record
(the whole row), which includes other fields too, which are actually not
used by the UDF. Maybe the UDF serialization to Python serializes the whole
row instead of just the attributes of the UDF?

On Mon, Aug 8, 2016 at 5:59 PM, Davies Liu  wrote:

> On Mon, Aug 8, 2016 at 2:24 PM, Zoltan Fedor 
> wrote:
> > Hi all,
> >
> > I have an interesting issue trying to use UDFs from SparkSQL in Spark
> 2.0.0
> > using pyspark.
> >
> > There is a big table (5.6 Billion rows, 450Gb in memory) loaded into 300
> > executors's memory in SparkSQL, on which we would do some calculation
> using
> > UDFs in pyspark.
> > If I run my SQL on only a portion of the data (filtering by one of the
> > attributes), let's say 800 million records, then all works well. But
> when I
> > run the same SQL on all the data, then I receive
> > "java.lang.OutOfMemoryError: GC overhead limit exceeded" from basically
> all
> > of the executors.
> >
> > It seems to me that pyspark UDFs in SparkSQL might have a memory leak,
> > causing this "GC overhead limit being exceeded".
> >
> > Details:
> >
> > - using Spark 2.0.0 on a Hadoop YARN cluster
> >
> > - 300 executors, each with 2 CPU cores and 8Gb memory (
> > spark.yarn.executor.memoryOverhead=6400 )
>
> Does this mean you only have 1.6G memory for executor (others left for
> Python) ?
> The cached table could take 1.5G, it means almost nothing left for other
> things.
>
> Python UDF do requires some buffering in JVM, the size of buffering
> depends on
> how much rows are under processing by Python process.
>
> > - a table of 5.6 Billions rows loaded into the memory of the executors
> > (taking up 450Gb of memory), partitioned evenly across the executors
> >
> > - creating even the simplest UDF in SparkSQL causes 'GC overhead limit
> > exceeded' error if running on all records. Running the same on a smaller
> > dataset (~800 million rows) does succeed. If no UDF, the query succeed on
> > the whole dataset.
> >
> > - simplified pyspark code:
> >
> > from pyspark.sql.types import StringType
> >
> > def test_udf(var):
> > """test udf that will always return a"""
> > return "a"
> > sqlContext.registerFunction("test_udf", test_udf, StringType())
> >
> > sqlContext.sql("""CACHE TABLE ma""")
> >
> > results_df = sqlContext.sql("""SELECT SOURCE, SOURCE_SYSTEM,
> > test_udf(STANDARD_ACCOUNT_STREET_SRC) AS TEST_UDF_OP,
> > ROUND(1.0 - (levenshtein(STANDARD_ACCOUNT_CITY_SRC,
> > STANDARD_ACCOUNT_CITY_SRC)
> >  /
> > CASE WHEN LENGTH (STANDARD_ACCOUNT_CITY_SRC)>LENGTH
> > (STANDARD_ACCOUNT_CITY_SRC)
> > THEN LENGTH (STANDARD_ACCOUNT_CITY_SRC)
> > ELSE LENGTH (STANDARD_ACCOUNT_CITY_SRC)
> >END),2) AS SCORE_ED_STANDARD_ACCOUNT_CITY,
> > STANDARD_ACCOUNT_STATE_SRC, STANDARD_ACCOUNT_STATE_UNIV
> > FROM ma""")
> >
> > results_df.registerTempTable("m")
> > sqlContext.cacheTable("m")
> >
> > results_df = sqlContext.sql("""SELECT COUNT(*) FROM m""")
> > print(results_df.take(1))
> >
> >
> > - the error thrown on the executors:
> >
> > 16/08/08 15:38:17 ERROR util.Utils: Uncaught exception in thread stdout
> > writer for /hadoop/cloudera/parcels/Anaconda/bin/python
> > java.lang.OutOfMemoryError: GC overhead limit exceeded
> > at
> > org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(
> UnsafeRow.java:503)
> > at
> > org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(
> UnsafeRow.java:61)
> > at
> > org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$
> 1.apply(BatchEvalPythonExec.scala:64)
> > at
> > org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$
> 1.apply(BatchEvalPythonExec.scala:64)
> > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> > at
> > scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.
> scala:1076)
> > at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:1091)
> > at

Sparking Water (Spark 1.6.0 + H2O 3.8.2.6 ) on CDH 5.7.1

2016-08-09 Thread RK Aduri

All,

Ran into one strange issue. If I initialize a h2o context and start it (NOT 
using it anywhere) , the count action on spark data frame would result in an 
error. The same count action on the spark data frame would work fine without 
h20 context not being initialized. 

hc = H2OContext(sc).start()

It fails in py4j/protocol.py at this function get_return_value. Has anyone out 
there played with Sparkling water faced any issues with H2O + Spark on Cloudera?

Thanks,
RK
-- 
Collective[i] dramatically improves sales and marketing performance using 
technology, applications and a revolutionary network designed to provide 
next generation analytics and decision-support directly to business users. 
Our goal is to maximize human potential and minimize mistakes. In most 
cases, the results are astounding. We cannot, however, stop emails from 
sometimes being sent to the wrong person. If you are not the intended 
recipient, please notify us by replying to this email's sender and deleting 
it (and any attachments) permanently from your system. If you are, please 
respect the confidentiality of this communication's contents.

Spark on mesos in docker not getting parameters

2016-08-09 Thread Jim Carroll

I'm running spark 2.0.0 on Mesos using spark.mesos.executor.docker.image to
point to a docker container that I built with the Spark installation.

Everything is working except the Spark client process that's started inside
the container doesn't get any of my parameters I set in the spark config in
the driver.

I set spark.executor.extraJavaOptions and spark.executor.extraClassPath in
the driver and they don't get passed all the way through. Here is a capture
of the chain of processes that are started on the mesos slave, in the docker
container:

root  1064  1051  0 12:46 ?00:00:00 docker -H
unix:///var/run/docker.sock run --cpu-shares 8192 --memory 4723834880 -e
SPARK_CLASSPATH=[path to my jar] -e SPARK_EXECUTOR_OPTS=
-Daws.accessKeyId=[myid] -Daws.secretKey=[mykey] -e SPARK_USER=root -e
SPARK_EXECUTOR_MEMORY=4096m -e MESOS_SANDBOX=/mnt/mesos/sandbox -e
MESOS_CONTAINER_NAME=mesos-90e2c720-1e45-4dbc-8271-f0c47a33032a-S0.772f8080-6278-4a35-9e57-0009787ac605
-v
/tmp/mesos/slaves/90e2c720-1e45-4dbc-8271-f0c47a33032a-S0/frameworks/f5794f8a-b56f-4958-b906-f05c426dcef0-0001/executors/0/runs/772f8080-6278-4a35-9e57-0009787ac605:/mnt/mesos/sandbox
--net host --entrypoint /bin/sh --name
mesos-90e2c720-1e45-4dbc-8271-f0c47a33032a-S0.772f8080-6278-4a35-9e57-0009787ac605
[my docker image] -c  "/opt/spark/./bin/spark-class"
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url
spark://CoarseGrainedScheduler@192.168.10.145:46121 --executor-id 0
--hostname 192.168.10.145 --cores 8 --app-id
f5794f8a-b56f-4958-b906-f05c426dcef0-0001

root  1193  1175  0 12:46 ?00:00:00 /bin/sh -c 
"/opt/spark/./bin/spark-class"
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url
spark://CoarseGrainedScheduler@192.168.10.145:46121 --executor-id 0
--hostname 192.168.10.145 --cores 8 --app-id
f5794f8a-b56f-4958-b906-f05c426dcef0-0001

root  1208  1193  0 12:46 ?00:00:00 bash
/opt/spark/./bin/spark-class
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url
spark://CoarseGrainedScheduler@192.168.10.145:46121 --executor-id 0
--hostname 192.168.10.145 --cores 8 --app-id
f5794f8a-b56f-4958-b906-f05c426dcef0-0001

root  1213  1208  0 12:46 ?00:00:00 bash
/opt/spark/./bin/spark-class
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url
spark://CoarseGrainedScheduler@192.168.10.145:46121 --executor-id 0
--hostname 192.168.10.145 --cores 8 --app-id
f5794f8a-b56f-4958-b906-f05c426dcef0-0001

root  1215  1213  0 12:46 ?00:00:00
/usr/lib/jvm/java-8-openjdk-amd64/bin/java -Xmx128m -cp /opt/spark/jars/*
org.apache.spark.launcher.Main
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url
spark://CoarseGrainedScheduler@192.168.10.145:46121 --executor-id 0
--hostname 192.168.10.145 --cores 8 --app-id
f5794f8a-b56f-4958-b906-f05c426dcef0-0001

Notice, in the initial process started by mesos both the SPARK_CLASSPATH is
set to the value of spark.executor.extraClassPath and the -D options are set
as I set them on spark.executor.extraJavaOptions (in this case, to my aws
creds) in the drive configuration.

However, they are missing in subsequent child processes and the final java
process started doesn't contain them either.

I "fixed" the classpath problem by putting my jar in /opt/spark/jars
(/opt/spark is the location I have spark installed in the docker container).

Can someone tell me what I'm missing?

Thanks
Jim




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-mesos-in-docker-not-getting-parameters-tp27500.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

RE: Spark join and large temp files

2016-08-09 Thread Ashic Mahtab

Hi Mich,Hardware: AWS EMR cluster with 15 nodes with Rx3.2xlarge (CPU, RAM 
fine, disk a couple of hundred gig).
When we do:
onPointFiveTB.join(onePointFiveGig.cache(), "id")
we're seing that the temp directory is filling up fast, until a node gets 
killed. And then everything dies. 
-Ashic. 

From: mich.talebza...@gmail.com
Date: Tue, 9 Aug 2016 17:25:23 +0100
Subject: Re: Spark join and large temp files
To: as...@live.com
CC: samkiller@gmail.com; deepakmc...@gmail.com; user@spark.apache.org

Hi Sam,
What is your spark Hardware spec, No of nodes, RAM per node and disks please?
I don't understand this should not really be an issue. Underneath the bonnet it 
is a hash join. The small table I gather can be cached and the big table will 
do multiple passes using the temp space.
HTH

Dr Mich Talebzadeh

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction
of data or any other property which may arise from relying on this email's 
technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.  

On 9 August 2016 at 15:46, Ashic Mahtab  wrote:

Hi Sam,Yup. It seems it stalls when broadcasting. CPU goes to 100%, but there's 
no progress. The spark UI doesn't even show up.
-Ashic. 

From: samkiller@gmail.com
Date: Tue, 9 Aug 2016 16:21:27 +0200
Subject: Re: Spark join and large temp files
To: as...@live.com
CC: deepakmc...@gmail.com; user@spark.apache.org

Have you tried to broadcast your small table table in order to perform your 
join ?
joined = bigDF.join(broadcast(smallDF, )

On Tue, Aug 9, 2016 at 3:29 PM, Ashic Mahtab  wrote:

Hi Deepak,No...not really. Upping the disk size is a solution, but more 
expensive as you can't attach EBS volumes to EMR clusters configured with data 
pipelines easily (which is what we're doing). I've tried collecting the 1.5G 
dataset in a hashmap, and broadcasting. Timeouts seems to prevent that (even 
after upping the max driver result size). Increasing partition counts didn't 
help (the shuffle used up the temp space). I'm now looking at some form of 
clever broadcasting, or maybe falling back to chunking up the input, producing 
interim output, and unioning them for the final output. Might even try using 
Spark Streaming pointing to the parquet and seeing if that helps. 
-Ashic. 

From: deepakmc...@gmail.com
Date: Tue, 9 Aug 2016 17:31:19 +0530
Subject: Re: Spark join and large temp files
To: as...@live.com

Hi AshicDid you find the resolution to this issue?Just curious to know like 
what helped in this scenario.
ThanksDeepak

On Tue, Aug 9, 2016 at 12:23 AM, Ashic Mahtab  wrote:

Hi Deepak,Thanks for the response. 
Registering the temp tables didn't help. Here's what I have:
val a = sqlContext..read.parquet(...).select("eid.id", 
"name").withColumnRenamed("eid.id", "id")val b = 
sqlContext.read.parquet(...).select("id", "number")
a.registerTempTable("a")b.registerTempTable("b")
val results = sqlContext.sql("SELECT x.id, x.name, y.number FROM a x join b y 
on x.id=y.id)
results.write.parquet(...)
Is there something I'm missing?
Cheers,Ashic.
From: deepakmc...@gmail.com
Date: Tue, 9 Aug 2016 00:01:32 +0530
Subject: Re: Spark join and large temp files
To: as...@live.com
CC: user@spark.apache.org

Register you dataframes as temp tables and then try the join on the temp 
table.This should resolve your issue.
ThanksDeepak
On Mon, Aug 8, 2016 at 11:47 PM, Ashic Mahtab  wrote:

Hello,We have two parquet inputs of the following form:
a: id:String, Name:String  (1.5TB)b: id:String, Number:Int  (1.3GB)
We need to join these two to get (id, Number, Name). We've tried two approaches:
a.join(b, Seq("id"), "right_outer")
where a and b are dataframes. We also tried taking the rdds, mapping them to 
pair rdds with id as the key, and then joining. What we're seeing is that temp 
file usage is increasing on the join stage, and filling up our disks, causing 
the job to crash. Is there a way to join these two data sets without 
well...crashing?
Note, the ids are unique, and there's a one to one mapping between the two 
datasets. 
Any help would be appreciated.
-Ashic. 

-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net

-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net

Re: Cumulative Sum function using Dataset API

2016-08-09 Thread Jon Barksdale

Hi Santoshakhilesh,

I'd seen that already, but I was trying to avoid using rdds to perform this
calculation.

@Ayan, it seems I was mistaken, and doing a sum(b) over(order by b) totally
works.  I guess I expected the windowing with sum to work more like
oracle.  Thanks for the suggestion :)

Thank you both for your help,

Jon

On Tue, Aug 9, 2016 at 3:01 AM Santoshakhilesh 
wrote:

> You could check following link.
>
>
> http://stackoverflow.com/questions/35154267/how-to-compute-cumulative-sum-using-spark
>
>
>
> *From:* Jon Barksdale [mailto:jon.barksd...@gmail.com]
> *Sent:* 09 August 2016 08:21
> *To:* ayan guha
> *Cc:* user
> *Subject:* Re: Cumulative Sum function using Dataset API
>
>
>
> I don't think that would work properly, and would probably just give me
> the sum for each partition. I'll give it a try when I get home just to be
> certain.
>
> To maybe explain the intent better, if I have a column (pre sorted) of
> (1,2,3,4), then the cumulative sum would return (1,3,6,10).
>
> Does that make sense? Naturally, if ordering a sum turns it into a
> cumulative sum, I'll gladly use that :)
>
> Jon
>
> On Mon, Aug 8, 2016 at 4:55 PM ayan guha  wrote:
>
> You mean you are not able to use sum(col) over (partition by key order by
> some_col) ?
>
>
>
> On Tue, Aug 9, 2016 at 9:53 AM, jon  wrote:
>
> Hi all,
>
> I'm trying to write a function that calculates a cumulative sum as a column
> using the Dataset API, and I'm a little stuck on the implementation.  From
> what I can tell, UserDefinedAggregateFunctions don't seem to support
> windowing clauses, which I think I need for this use case.  If I write a
> function that extends from AggregateWindowFunction, I end up needing
> classes
> that are package private to the sql package, so I need to make my function
> under the org.apache.spark.sql package, which just feels wrong.
>
> I've also considered writing a custom transformer, but haven't spend as
> much
> time reading through the code, so I don't know how easy or hard that would
> be.
>
> TLDR; What's the best way to write a function that returns a value for
> every
> row, but has mutable state, and gets row in a specific order?
>
> Does anyone have any ideas, or examples?
>
> Thanks,
>
> Jon
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Cumulative-Sum-function-using-Dataset-API-tp27496.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
>
>
> --
>
> Best Regards,
> Ayan Guha
>
>

Re: Spark 2.0.1 / 2.1.0 on Maven

2016-08-09 Thread Mich Talebzadeh

LOL

Ink has not dried on Spark 2 yet so to speak :)

Dr Mich Talebzadeh

LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*

http://talebzadehmich.wordpress.com

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On 9 August 2016 at 17:56, Mark Hamstra  wrote:

> What are you expecting to find?  There currently are no releases beyond
> Spark 2.0.0.
>
> On Tue, Aug 9, 2016 at 9:55 AM, Jestin Ma 
> wrote:
>
>> If we want to use versions of Spark beyond the official 2.0.0 release,
>> specifically on Maven + Java, what steps should we take to upgrade? I can't
>> find the newer versions on Maven central.
>>
>> Thank you!
>> Jestin
>>
>
>

Re: Spark 2.0.1 / 2.1.0 on Maven

2016-08-09 Thread Mark Hamstra

What are you expecting to find?  There currently are no releases beyond
Spark 2.0.0.

On Tue, Aug 9, 2016 at 9:55 AM, Jestin Ma  wrote:

> If we want to use versions of Spark beyond the official 2.0.0 release,
> specifically on Maven + Java, what steps should we take to upgrade? I can't
> find the newer versions on Maven central.
>
> Thank you!
> Jestin
>

Spark 2.0.1 / 2.1.0 on Maven

2016-08-09 Thread Jestin Ma

If we want to use versions of Spark beyond the official 2.0.0 release,
specifically on Maven + Java, what steps should we take to upgrade? I can't
find the newer versions on Maven central.

Thank you!
Jestin

DataFrame equivalent to RDD.partionByKey

2016-08-09 Thread Stephen Fletcher

Is there a DataFrameReader equivalent to the RDD's partitionByKey for RDD?
I'm reading data from a file data source and I want to partition this data
I'm reading in to be partitioned the same way as the data I'm processing
through a spark streaming RDD in the process.

Re: Spark join and large temp files

2016-08-09 Thread Mich Talebzadeh

Hi Sam,

What is your spark Hardware spec, No of nodes, RAM per node and disks
please?

I don't understand this should not really be an issue. Underneath the
bonnet it is a hash join. The small table I gather can be cached and the
big table will do multiple passes using the temp space.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 9 August 2016 at 15:46, Ashic Mahtab  wrote:

> Hi Sam,
> Yup. It seems it stalls when broadcasting. CPU goes to 100%, but there's
> no progress. The spark UI doesn't even show up.
>
> -Ashic.
>
> --
> From: samkiller@gmail.com
> Date: Tue, 9 Aug 2016 16:21:27 +0200
> Subject: Re: Spark join and large temp files
> To: as...@live.com
> CC: deepakmc...@gmail.com; user@spark.apache.org
>
>
> Have you tried to broadcast your small table table in order to perform
> your join ?
>
> joined = bigDF.join(broadcast(smallDF, )
>
>
> On Tue, Aug 9, 2016 at 3:29 PM, Ashic Mahtab  wrote:
>
> Hi Deepak,
> No...not really. Upping the disk size is a solution, but more expensive as
> you can't attach EBS volumes to EMR clusters configured with data pipelines
> easily (which is what we're doing). I've tried collecting the 1.5G dataset
> in a hashmap, and broadcasting. Timeouts seems to prevent that (even after
> upping the max driver result size). Increasing partition counts didn't help
> (the shuffle used up the temp space). I'm now looking at some form of
> clever broadcasting, or maybe falling back to chunking up the input,
> producing interim output, and unioning them for the final output. Might
> even try using Spark Streaming pointing to the parquet and seeing if that
> helps.
>
> -Ashic.
>
> --
> From: deepakmc...@gmail.com
> Date: Tue, 9 Aug 2016 17:31:19 +0530
> Subject: Re: Spark join and large temp files
> To: as...@live.com
>
> Hi Ashic
> Did you find the resolution to this issue?
> Just curious to know like what helped in this scenario.
>
> Thanks
> Deepak
>
>
> On Tue, Aug 9, 2016 at 12:23 AM, Ashic Mahtab  wrote:
>
> Hi Deepak,
> Thanks for the response.
>
> Registering the temp tables didn't help. Here's what I have:
>
> val a = sqlContext..read.parquet(...).select("eid.id",
> "name").withColumnRenamed("eid.id", "id")
> val b = sqlContext.read.parquet(...).select("id", "number")
>
> a.registerTempTable("a")
> b.registerTempTable("b")
>
> val results = sqlContext.sql("SELECT x.id, x.name, y.number FROM a x join
> b y on x.id=y.id)
>
> results.write.parquet(...)
>
> Is there something I'm missing?
>
> Cheers,
> Ashic.
>
> --
> From: deepakmc...@gmail.com
> Date: Tue, 9 Aug 2016 00:01:32 +0530
> Subject: Re: Spark join and large temp files
> To: as...@live.com
> CC: user@spark.apache.org
>
>
> Register you dataframes as temp tables and then try the join on the temp
> table.
> This should resolve your issue.
>
> Thanks
> Deepak
>
> On Mon, Aug 8, 2016 at 11:47 PM, Ashic Mahtab  wrote:
>
> Hello,
> We have two parquet inputs of the following form:
>
> a: id:String, Name:String  (1.5TB)
> b: id:String, Number:Int  (1.3GB)
>
> We need to join these two to get (id, Number, Name). We've tried two
> approaches:
>
> a.join(b, Seq("id"), "right_outer")
>
> where a and b are dataframes. We also tried taking the rdds, mapping them
> to pair rdds with id as the key, and then joining. What we're seeing is
> that temp file usage is increasing on the join stage, and filling up our
> disks, causing the job to crash. Is there a way to join these two data sets
> without well...crashing?
>
> Note, the ids are unique, and there's a one to one mapping between the two
> datasets.
>
> Any help would be appreciated.
>
> -Ashic.
>
>
>
>
>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>
>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>
>
>

Spark 1.6.1 and regexp_replace

2016-08-09 Thread Andrés Ivaldi

I'm having a strange behaviour with regular expression replace, I'm trying
to remove the spaces with trim and also remove the spaces when they are
more than one to only one.

Given a string like this "   A  B   " with trim only I got "A  B" so
perfect,
if I add regexp_replace I got "  A B".

Text1 is the column so I did

df.withColumn("Text1", expr ( "trim(regexp_replace(Text1,'\\s+',' ') )) )

Also tried another expressions with no luck either

Any idea?

thanks

RE: Spark join and large temp files

2016-08-09 Thread Ashic Mahtab

Hi Sam,Yup. It seems it stalls when broadcasting. CPU goes to 100%, but there's 
no progress. The spark UI doesn't even show up.
-Ashic. 

From: samkiller@gmail.com
Date: Tue, 9 Aug 2016 16:21:27 +0200
Subject: Re: Spark join and large temp files
To: as...@live.com
CC: deepakmc...@gmail.com; user@spark.apache.org

Have you tried to broadcast your small table table in order to perform your 
join ?
joined = bigDF.join(broadcast(smallDF, )

On Tue, Aug 9, 2016 at 3:29 PM, Ashic Mahtab  wrote:

Hi Deepak,No...not really. Upping the disk size is a solution, but more 
expensive as you can't attach EBS volumes to EMR clusters configured with data 
pipelines easily (which is what we're doing). I've tried collecting the 1.5G 
dataset in a hashmap, and broadcasting. Timeouts seems to prevent that (even 
after upping the max driver result size). Increasing partition counts didn't 
help (the shuffle used up the temp space). I'm now looking at some form of 
clever broadcasting, or maybe falling back to chunking up the input, producing 
interim output, and unioning them for the final output. Might even try using 
Spark Streaming pointing to the parquet and seeing if that helps. 
-Ashic. 

From: deepakmc...@gmail.com
Date: Tue, 9 Aug 2016 17:31:19 +0530
Subject: Re: Spark join and large temp files
To: as...@live.com

Hi AshicDid you find the resolution to this issue?Just curious to know like 
what helped in this scenario.
ThanksDeepak

On Tue, Aug 9, 2016 at 12:23 AM, Ashic Mahtab  wrote:

Hi Deepak,Thanks for the response. 
Registering the temp tables didn't help. Here's what I have:
val a = sqlContext..read.parquet(...).select("eid.id", 
"name").withColumnRenamed("eid.id", "id")val b = 
sqlContext.read.parquet(...).select("id", "number")
a.registerTempTable("a")b.registerTempTable("b")
val results = sqlContext.sql("SELECT x.id, x.name, y.number FROM a x join b y 
on x.id=y.id)
results.write.parquet(...)
Is there something I'm missing?
Cheers,Ashic.
From: deepakmc...@gmail.com
Date: Tue, 9 Aug 2016 00:01:32 +0530
Subject: Re: Spark join and large temp files
To: as...@live.com
CC: user@spark.apache.org

Register you dataframes as temp tables and then try the join on the temp 
table.This should resolve your issue.
ThanksDeepak
On Mon, Aug 8, 2016 at 11:47 PM, Ashic Mahtab  wrote:

Hello,We have two parquet inputs of the following form:
a: id:String, Name:String  (1.5TB)b: id:String, Number:Int  (1.3GB)
We need to join these two to get (id, Number, Name). We've tried two approaches:
a.join(b, Seq("id"), "right_outer")
where a and b are dataframes. We also tried taking the rdds, mapping them to 
pair rdds with id as the key, and then joining. What we're seeing is that temp 
file usage is increasing on the join stage, and filling up our disks, causing 
the job to crash. Is there a way to join these two data sets without 
well...crashing?
Note, the ids are unique, and there's a one to one mapping between the two 
datasets. 
Any help would be appreciated.
-Ashic. 

-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net

-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net

Re: Spark join and large temp files

2016-08-09 Thread Sam Bessalah

Have you tried to broadcast your small table table in order to perform your
join ?

joined = bigDF.join(broadcast(smallDF, )


On Tue, Aug 9, 2016 at 3:29 PM, Ashic Mahtab  wrote:

> Hi Deepak,
> No...not really. Upping the disk size is a solution, but more expensive as
> you can't attach EBS volumes to EMR clusters configured with data pipelines
> easily (which is what we're doing). I've tried collecting the 1.5G dataset
> in a hashmap, and broadcasting. Timeouts seems to prevent that (even after
> upping the max driver result size). Increasing partition counts didn't help
> (the shuffle used up the temp space). I'm now looking at some form of
> clever broadcasting, or maybe falling back to chunking up the input,
> producing interim output, and unioning them for the final output. Might
> even try using Spark Streaming pointing to the parquet and seeing if that
> helps.
>
> -Ashic.
>
> --
> From: deepakmc...@gmail.com
> Date: Tue, 9 Aug 2016 17:31:19 +0530
> Subject: Re: Spark join and large temp files
> To: as...@live.com
>
> Hi Ashic
> Did you find the resolution to this issue?
> Just curious to know like what helped in this scenario.
>
> Thanks
> Deepak
>
>
> On Tue, Aug 9, 2016 at 12:23 AM, Ashic Mahtab  wrote:
>
> Hi Deepak,
> Thanks for the response.
>
> Registering the temp tables didn't help. Here's what I have:
>
> val a = sqlContext..read.parquet(...).select("eid.id",
> "name").withColumnRenamed("eid.id", "id")
> val b = sqlContext.read.parquet(...).select("id", "number")
>
> a.registerTempTable("a")
> b.registerTempTable("b")
>
> val results = sqlContext.sql("SELECT x.id, x.name, y.number FROM a x join
> b y on x.id=y.id)
>
> results.write.parquet(...)
>
> Is there something I'm missing?
>
> Cheers,
> Ashic.
>
> --
> From: deepakmc...@gmail.com
> Date: Tue, 9 Aug 2016 00:01:32 +0530
> Subject: Re: Spark join and large temp files
> To: as...@live.com
> CC: user@spark.apache.org
>
>
> Register you dataframes as temp tables and then try the join on the temp
> table.
> This should resolve your issue.
>
> Thanks
> Deepak
>
> On Mon, Aug 8, 2016 at 11:47 PM, Ashic Mahtab  wrote:
>
> Hello,
> We have two parquet inputs of the following form:
>
> a: id:String, Name:String  (1.5TB)
> b: id:String, Number:Int  (1.3GB)
>
> We need to join these two to get (id, Number, Name). We've tried two
> approaches:
>
> a.join(b, Seq("id"), "right_outer")
>
> where a and b are dataframes. We also tried taking the rdds, mapping them
> to pair rdds with id as the key, and then joining. What we're seeing is
> that temp file usage is increasing on the join stage, and filling up our
> disks, causing the job to crash. Is there a way to join these two data sets
> without well...crashing?
>
> Note, the ids are unique, and there's a one to one mapping between the two
> datasets.
>
> Any help would be appreciated.
>
> -Ashic.
>
>
>
>
>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>
>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>

RE: Spark join and large temp files

2016-08-09 Thread Ashic Mahtab

Hi Deepak,No...not really. Upping the disk size is a solution, but more 
expensive as you can't attach EBS volumes to EMR clusters configured with data 
pipelines easily (which is what we're doing). I've tried collecting the 1.5G 
dataset in a hashmap, and broadcasting. Timeouts seems to prevent that (even 
after upping the max driver result size). Increasing partition counts didn't 
help (the shuffle used up the temp space). I'm now looking at some form of 
clever broadcasting, or maybe falling back to chunking up the input, producing 
interim output, and unioning them for the final output. Might even try using 
Spark Streaming pointing to the parquet and seeing if that helps. 
-Ashic. 

From: deepakmc...@gmail.com
Date: Tue, 9 Aug 2016 17:31:19 +0530
Subject: Re: Spark join and large temp files
To: as...@live.com

Hi AshicDid you find the resolution to this issue?Just curious to know like 
what helped in this scenario.
ThanksDeepak


On Tue, Aug 9, 2016 at 12:23 AM, Ashic Mahtab  wrote:



Hi Deepak,Thanks for the response. 
Registering the temp tables didn't help. Here's what I have:
val a = sqlContext..read.parquet(...).select("eid.id", 
"name").withColumnRenamed("eid.id", "id")val b = 
sqlContext.read.parquet(...).select("id", "number")
a.registerTempTable("a")b.registerTempTable("b")
val results = sqlContext.sql("SELECT x.id, x.name, y.number FROM a x join b y 
on x.id=y.id)
results.write.parquet(...)
Is there something I'm missing?
Cheers,Ashic.
From: deepakmc...@gmail.com
Date: Tue, 9 Aug 2016 00:01:32 +0530
Subject: Re: Spark join and large temp files
To: as...@live.com
CC: user@spark.apache.org

Register you dataframes as temp tables and then try the join on the temp 
table.This should resolve your issue.
ThanksDeepak
On Mon, Aug 8, 2016 at 11:47 PM, Ashic Mahtab  wrote:



Hello,We have two parquet inputs of the following form:
a: id:String, Name:String  (1.5TB)b: id:String, Number:Int  (1.3GB)
We need to join these two to get (id, Number, Name). We've tried two approaches:
a.join(b, Seq("id"), "right_outer")
where a and b are dataframes. We also tried taking the rdds, mapping them to 
pair rdds with id as the key, and then joining. What we're seeing is that temp 
file usage is increasing on the join stage, and filling up our disks, causing 
the job to crash. Is there a way to join these two data sets without 
well...crashing?
Note, the ids are unique, and there's a one to one mapping between the two 
datasets. 
Any help would be appreciated.
-Ashic. 



  


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net
  


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net

Re: update specifc rows to DB using sqlContext

2016-08-09 Thread Mich Talebzadeh

Hi,

   1. what is the underlying DB, say Hive etc
   2. Is table transactional or you are going to do insert/overwrite to the
   same table
   3. can you do all this in the database itself assuming it is an RDBMS
   4. Can you provide the sql or pseudo code for such an update

HTH

Dr Mich Talebzadeh

LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*

http://talebzadehmich.wordpress.com

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On 9 August 2016 at 13:39, sujeet jog  wrote:

> Hi,
>
> Is it possible to update certain columnr records  in DB  from spark,
>
> for example i have 10 rows with 3 columns  which are read from Spark SQL,
>
> i want to update specific column entries  and write back to DB, but since
> RDD"s are immutable i believe this would be difficult, is there a
> workaround.
>
>
> Thanks,
> Sujeet
>

update specifc rows to DB using sqlContext

2016-08-09 Thread sujeet jog

Hi,

Is it possible to update certain columnr records  in DB  from spark,

for example i have 10 rows with 3 columns  which are read from Spark SQL,

i want to update specific column entries  and write back to DB, but since
RDD"s are immutable i believe this would be difficult, is there a
workaround.


Thanks,
Sujeet

OrientDB through JDBC: query field names wrapped by double quote

2016-08-09 Thread Roberto Franchini

Hi to all,
I'm the maintainer of the JDBC driver OrientDB.

We are trying to fetch data to Spark from an orientDB using theJDBC driver.

I'm facing some issues:


To gather metadata spark performs a "test query" of this form: select
* from TABLE_NAME whre 1=0
For this case, I write a workaround inside the driver, getting rid of
where 1=0 and replaging it with LIMIT 1.

After that query, it then performs a query with each field wrapped by
double quote:

SELECT "stringKey", "intKey" FROM Item

In orientDB's SQL dialect a double quote means a string value, so for
each record of the result set it will return stringkey and intKey as
vaules.

row 1: stringKey:strinKey, intKey:intKey
row 2: stringKey:strinKey, intKey:intKey
row 3: stringKey:strinKey, intKey:intKey



Is there a  way to configure SqlContext to avoid the double quoting of
fields names?

I'm using Java with spark 1.6.2:

Map options = new HashMap() {{
  put("url", "jdbc:orient:plocal:./target/databases/sparkTest");
  put("dbtable", "Item");
}};

SQLContext sqlCtx = new SQLContext(ctx);

DataFrame jdbcDF = sqlCtx.read().format("jdbc").options(options).load();


I found that someone has the same problem with SAS JDBC.
As a workaround I will implement a query cleaner inside the driver,
but an option to configure the quoting char would be better.

Regards,
RF

-- 
Roberto Franchini
"The impossible is inevitable"
https://github.com/robfrank/ https://twitter.com/robfrankie
hangout:ro.franchini skype:ro.franchini

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

[Spark 1.6]-increment value column based on condition + Dataframe

2016-08-09 Thread Divya Gehlot

Hi,
I have column values having values like
Value
30
12
56
23
12
16
12
89
12
5
6
4
8

I need create another column
if col("value") > 30  1 else col("value") < 30
newColValue
0
1
0
1
2
3
4
0
1
2
3
4
5

How can I have create an increment column
The grouping is happening based on some other cols which is not mentioned
here.
When I try Windows sum function ,its summing but instead of incrementing it
the total sum is getting displayed in all the rows .
val overWin = Window.partitionBy('col1,'col2,'col3).orderBy('Value)
val total = sum('Value).over(overWin)

With this logic
I am getting the below result
0
1
0
4
4
4
4
0
5
5
5
5
5

Written my own UDF also but customized UDF is not supported in windows
function in Spark 1.6

Would really appreciate the help.


Thanks,
Divya




Am I missing something

Re: Spark Streaming Job Keeps growing memory over time

2016-08-09 Thread Aasish Kumar

Hi Sandeep,

I have not enabled check pointing. I will try enabling check pointing and
observe the memory pattern. but what you really want to correlate with
check pointing . I don't know much about check-pointing.


Thanks and rgds

Aashish Kumar

Software Engineer

Avekshaa Technologies (P) Ltd. | www.avekshaa.com

+91 -9164495083

Performance Excellence Assured

*Deloitte Technology Fast 50 India *|* Technology Fast 500 APAC 2014*

*NASSCOM* Emerge 50, 2013
*Express IT Awards *- IT Innovation: Winner (silver) 2015

*P* *Every 3000 A4 paper costs 1 tree. Please **do not **print unless you
really need it, save environment & energy*

On Tue, Aug 9, 2016 at 5:30 PM, Sandeep Nemuri  wrote:

> Hi Aashish,
>
> Do you have checkpointing enabled ? if not, Can you try enabling
> checkpointing and observe the memory pattern.
>
> Thanks,
> Sandeep
> ᐧ
>
> On Tue, Aug 9, 2016 at 4:25 PM, Mich Talebzadeh  > wrote:
>
>> Hi Aashish,
>>
>> You are running in standalone mode with one node
>>
>> As I read you start master and 5 workers pop up from
>> SPARK_WORKER_INSTANCES=5. I gather you use start-slaves.sh?
>>
>> Now that is the number of workers and low memory on them port 8080
>> should  show practically no memory used (idle). Also every worker has been
>> allocated 1 core SPARK_WORKER_CORE=1
>>
>> Now it all depends how you start your start-submit job and what
>> parameters you pass to it.
>>
>> ${SPARK_HOME}/bin/spark-submit \
>> --driver-memory 1G \
>> --num-executors 2 \
>> --executor-cores 1 \
>> --executor-memory 1G \
>> --master spark://:7077 \
>>
>> What are your parameters here? From my experience standalone mode has
>> mind of its own and it does not follow what you have asked.
>>
>> If you increase the number of cores for workers, you may reduce the
>> memory issue because effectively multiple tasks can be run on sub-set of
>> your data.
>>
>> HTH
>>
>> P.S. I don't use SPARK_MASTER_OPTS
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 9 August 2016 at 11:21, aasish.kumar 
>> wrote:
>>
>>> Hi,
>>>
>>> I am running spark v 1.6.1 on a single machine in standalone mode, having
>>> 64GB RAM and 16cores.
>>>
>>> I have created five worker instances to create five executor as in
>>> standalone mode, there cannot be more than one executor in one worker
>>> node.
>>>
>>> *Configuration*:
>>>
>>> SPARK_WORKER_INSTANCES 5
>>> SPARK_WORKER_CORE 1
>>> SPARK_MASTER_OPTS "-Dspark.deploy.default.Cores=5"
>>>
>>> all other configurations are default in spark_env.sh
>>>
>>> I am running a spark streaming direct kafka job at an interval of 1 min,
>>> which takes data from kafka and after some aggregation write the data to
>>> mongo.
>>>
>>> *Problems:*
>>>
>>> > when I start master and slave, it starts one master process and five
>>> > worker processes. each only consume about 212 MB of ram.when i submit
>>> the
>>> > job , it again creates 5 executor processes and 1 job process and also
>>> the
>>> > memory uses grows to 8GB in total and keeps growing over time (slowly)
>>> > also when there is no data to process.
>>>
>>> I am also unpersisting cached rdd at the end also set spark.cleaner.ttl
>>> to
>>> 600. but still memory is growing.
>>>
>>> > one more thing, I have seen the merged SPARK-1706, then also why i am
>>> > unable to create multiple executor within a worker.and also in
>>> > spark_env.sh file , setting any configuration related to executor comes
>>> > under YARN only mode.
>>>
>>> I have also tried running example program but same problem.
>>>
>>> Any help would be greatly appreciated,
>>>
>>> Thanks
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-spark-user-list.
>>> 1001560.n3.nabble.com/Spark-Streaming-Job-Keeps-growing-memo
>>> ry-over-time-tp27498.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>
>
> --
> *  Regards*
> *  Sandeep Nemuri*
>

Re: Spark Streaming Job Keeps growing memory over time

2016-08-09 Thread Sandeep Nemuri

Hi Aashish,

Do you have checkpointing enabled ? if not, Can you try enabling
checkpointing and observe the memory pattern.

Thanks,
Sandeep
ᐧ

On Tue, Aug 9, 2016 at 4:25 PM, Mich Talebzadeh 
wrote:

> Hi Aashish,
>
> You are running in standalone mode with one node
>
> As I read you start master and 5 workers pop up from
> SPARK_WORKER_INSTANCES=5. I gather you use start-slaves.sh?
>
> Now that is the number of workers and low memory on them port 8080 should
> show practically no memory used (idle). Also every worker has been
> allocated 1 core SPARK_WORKER_CORE=1
>
> Now it all depends how you start your start-submit job and what parameters
> you pass to it.
>
> ${SPARK_HOME}/bin/spark-submit \
> --driver-memory 1G \
> --num-executors 2 \
> --executor-cores 1 \
> --executor-memory 1G \
> --master spark://:7077 \
>
> What are your parameters here? From my experience standalone mode has mind
> of its own and it does not follow what you have asked.
>
> If you increase the number of cores for workers, you may reduce the memory
> issue because effectively multiple tasks can be run on sub-set of your data.
>
> HTH
>
> P.S. I don't use SPARK_MASTER_OPTS
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 9 August 2016 at 11:21, aasish.kumar  wrote:
>
>> Hi,
>>
>> I am running spark v 1.6.1 on a single machine in standalone mode, having
>> 64GB RAM and 16cores.
>>
>> I have created five worker instances to create five executor as in
>> standalone mode, there cannot be more than one executor in one worker
>> node.
>>
>> *Configuration*:
>>
>> SPARK_WORKER_INSTANCES 5
>> SPARK_WORKER_CORE 1
>> SPARK_MASTER_OPTS "-Dspark.deploy.default.Cores=5"
>>
>> all other configurations are default in spark_env.sh
>>
>> I am running a spark streaming direct kafka job at an interval of 1 min,
>> which takes data from kafka and after some aggregation write the data to
>> mongo.
>>
>> *Problems:*
>>
>> > when I start master and slave, it starts one master process and five
>> > worker processes. each only consume about 212 MB of ram.when i submit
>> the
>> > job , it again creates 5 executor processes and 1 job process and also
>> the
>> > memory uses grows to 8GB in total and keeps growing over time (slowly)
>> > also when there is no data to process.
>>
>> I am also unpersisting cached rdd at the end also set spark.cleaner.ttl to
>> 600. but still memory is growing.
>>
>> > one more thing, I have seen the merged SPARK-1706, then also why i am
>> > unable to create multiple executor within a worker.and also in
>> > spark_env.sh file , setting any configuration related to executor comes
>> > under YARN only mode.
>>
>> I have also tried running example program but same problem.
>>
>> Any help would be greatly appreciated,
>>
>> Thanks
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Spark-Streaming-Job-Keeps-growing-memo
>> ry-over-time-tp27498.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


-- 
*  Regards*
*  Sandeep Nemuri*

Re: Spark Streaming Job Keeps growing memory over time

2016-08-09 Thread Mich Talebzadeh

Hi Aashish,

You are running in standalone mode with one node

As I read you start master and 5 workers pop up from
SPARK_WORKER_INSTANCES=5. I gather you use start-slaves.sh?

Now that is the number of workers and low memory on them port 8080 should
show practically no memory used (idle). Also every worker has been
allocated 1 core SPARK_WORKER_CORE=1

Now it all depends how you start your start-submit job and what parameters
you pass to it.

${SPARK_HOME}/bin/spark-submit \
--driver-memory 1G \
--num-executors 2 \
--executor-cores 1 \
--executor-memory 1G \
--master spark://:7077 \

What are your parameters here? From my experience standalone mode has mind
of its own and it does not follow what you have asked.

If you increase the number of cores for workers, you may reduce the memory
issue because effectively multiple tasks can be run on sub-set of your data.

HTH

P.S. I don't use SPARK_MASTER_OPTS

Dr Mich Talebzadeh

LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*

http://talebzadehmich.wordpress.com

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On 9 August 2016 at 11:21, aasish.kumar  wrote:

> Hi,
>
> I am running spark v 1.6.1 on a single machine in standalone mode, having
> 64GB RAM and 16cores.
>
> I have created five worker instances to create five executor as in
> standalone mode, there cannot be more than one executor in one worker node.
>
> *Configuration*:
>
> SPARK_WORKER_INSTANCES 5
> SPARK_WORKER_CORE 1
> SPARK_MASTER_OPTS "-Dspark.deploy.default.Cores=5"
>
> all other configurations are default in spark_env.sh
>
> I am running a spark streaming direct kafka job at an interval of 1 min,
> which takes data from kafka and after some aggregation write the data to
> mongo.
>
> *Problems:*
>
> > when I start master and slave, it starts one master process and five
> > worker processes. each only consume about 212 MB of ram.when i submit the
> > job , it again creates 5 executor processes and 1 job process and also
> the
> > memory uses grows to 8GB in total and keeps growing over time (slowly)
> > also when there is no data to process.
>
> I am also unpersisting cached rdd at the end also set spark.cleaner.ttl to
> 600. but still memory is growing.
>
> > one more thing, I have seen the merged SPARK-1706, then also why i am
> > unable to create multiple executor within a worker.and also in
> > spark_env.sh file , setting any configuration related to executor comes
> > under YARN only mode.
>
> I have also tried running example program but same problem.
>
> Any help would be greatly appreciated,
>
> Thanks
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Spark-Streaming-Job-Keeps-growing-
> memory-over-time-tp27498.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Get distinct column data from grouped data

2016-08-09 Thread Selvam Raman

Example:

sel1 test
sel1 test
sel1 ok
sel2 ok
sel2 test


expected result:

sel1, [test,ok]
sel2,[test,ok]

How to achieve the above result using spark dataframe.

please suggest me.
-- 
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"

Spark Streaming Job Keeps growing memory over time

2016-08-09 Thread aasish.kumar

Hi,

I am running spark v 1.6.1 on a single machine in standalone mode, having
64GB RAM and 16cores.

I have created five worker instances to create five executor as in
standalone mode, there cannot be more than one executor in one worker node.

*Configuration*:

SPARK_WORKER_INSTANCES 5
SPARK_WORKER_CORE 1
SPARK_MASTER_OPTS "-Dspark.deploy.default.Cores=5"

all other configurations are default in spark_env.sh

I am running a spark streaming direct kafka job at an interval of 1 min,
which takes data from kafka and after some aggregation write the data to
mongo.

*Problems:*

> when I start master and slave, it starts one master process and five
> worker processes. each only consume about 212 MB of ram.when i submit the
> job , it again creates 5 executor processes and 1 job process and also the
> memory uses grows to 8GB in total and keeps growing over time (slowly)
> also when there is no data to process.

I am also unpersisting cached rdd at the end also set spark.cleaner.ttl to
600. but still memory is growing.

> one more thing, I have seen the merged SPARK-1706, then also why i am
> unable to create multiple executor within a worker.and also in
> spark_env.sh file , setting any configuration related to executor comes
> under YARN only mode.

I have also tried running example program but same problem.

Any help would be greatly appreciated,

Thanks




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Job-Keeps-growing-memory-over-time-tp27498.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

RE: Cumulative Sum function using Dataset API

2016-08-09 Thread Santoshakhilesh

You could check following link.
http://stackoverflow.com/questions/35154267/how-to-compute-cumulative-sum-using-spark

From: Jon Barksdale [mailto:jon.barksd...@gmail.com]
Sent: 09 August 2016 08:21
To: ayan guha
Cc: user
Subject: Re: Cumulative Sum function using Dataset API

I don't think that would work properly, and would probably just give me the sum 
for each partition. I'll give it a try when I get home just to be certain.

To maybe explain the intent better, if I have a column (pre sorted) of 
(1,2,3,4), then the cumulative sum would return (1,3,6,10).

Does that make sense? Naturally, if ordering a sum turns it into a cumulative 
sum, I'll gladly use that :)

Jon
On Mon, Aug 8, 2016 at 4:55 PM ayan guha 
> wrote:
You mean you are not able to use sum(col) over (partition by key order by 
some_col) ?

On Tue, Aug 9, 2016 at 9:53 AM, jon 
> wrote:
Hi all,

I'm trying to write a function that calculates a cumulative sum as a column
using the Dataset API, and I'm a little stuck on the implementation.  From
what I can tell, UserDefinedAggregateFunctions don't seem to support
windowing clauses, which I think I need for this use case.  If I write a
function that extends from AggregateWindowFunction, I end up needing classes
that are package private to the sql package, so I need to make my function
under the org.apache.spark.sql package, which just feels wrong.

I've also considered writing a custom transformer, but haven't spend as much
time reading through the code, so I don't know how easy or hard that would
be.

TLDR; What's the best way to write a function that returns a value for every
row, but has mutable state, and gets row in a specific order?

Does anyone have any ideas, or examples?

Thanks,

Jon

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Cumulative-Sum-function-using-Dataset-API-tp27496.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org

--
Best Regards,
Ayan Guha

Re: coalesce serialising earlier work

2016-08-09 Thread ayan guha

How about running a count step to force spark to materialise data frame and
then repartition to 1?
On 9 Aug 2016 17:11, "Adrian Bridgett"  wrote:

> In short:  df.coalesce(1).write seems to make all the earlier calculations
> about the dataframe go through a single task (rather than on multiple tasks
> and then the final dataframe to be sent through a single worker).  Any idea
> how we can force the job to run in parallel?
>
> In more detail:
>
> We have a job that we wish to write out as a single CSV file.  We have two
> approaches (code below):
>
> df = (filtering, calculations)
> df.coalesce(num).write.
>   format("com.databricks.spark.csv").
>   option("codec", "org.apache.hadoop.io.compress.GzipCodec").
>   save(output_path)
> Option A: (num=100)
> - dataframe calculated in parallel
> - upload in parallel
> - gzip in parallel
> - but we then have to run "hdfs dfs -getmerge" to download all data and
> then write it back again.
>
> Option B: (num=1)
> - single gzip (but gzip is pretty quick)
> - uploads go through a single machine
> - no HDFS commands
> - dataframe is _not_ calculated in parallel (we can see filters getting
> just a single task)
>
> What I'm not sure is why spark (1.6.1) is deciding to run just a single
> task for the calculation - and what we can do about it?   We really want
> the df to be calculated in parallel and then this is _then_ coalesced
> before being written.  (It may be that the -getmerge approach will still be
> faster)
>
> df.coalesce(100).coalesce(1).write.  doesn't look very likely to help!
>
> Adrian
> --
> *Adrian Bridgett*
>

Re: Spark-2.0.0 fails reading a parquet dataset generated by Spark-1.6.2

2016-08-09 Thread immerrr again

Another follow-up: I have narrowed it down to the first 32 partitions,
but from that point it gets strange.

Here's the error:

In [68]: spark.read.parquet(*subdirs[:32])
...
AnalysisException: u'Unable to infer schema for ParquetFormat at
/path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It
must be specified manually;'


Removing *any* of the subdirs from that set removes the error.

In [69]: for i in range(32): spark.read.parquet(*(subdirs[:i] +
subdirs[i+1:32]))


Here's the punchline: schemas for the first 31 and for the last 31 of
those 32 subdirs are the same:

In [70]: spark.read.parquet(*subdirs[:31]).schema.jsonValue() ==
spark.read.parquet(*subdirs[1:32]).schema.jsonValue()
Out[70]: True

Any idea why that might be happening?

On Tue, Aug 9, 2016 at 12:12 PM, immerrr again  wrote:
> Some follow-up information:
>
> - dataset size is ~150G
>
> - the data is partitioned by one of the columns, _locality_code:
> $ ls -1
> _locality_code=AD
> _locality_code=AE
> _locality_code=AF
> _locality_code=AG
> _locality_code=AI
> _locality_code=AL
> _locality_code=AM
> _locality_code=AN
> 
> _locality_code=YE
> _locality_code=YT
> _locality_code=YU
> _locality_code=ZA
> _locality_code=ZM
> _locality_code=ZW
> _SUCCESS
>
> - some of the partitions contain only one row, but all partitions are
> in place (ie number of directories matches number of distinct
> localities
> val counts = 
> sqlContext.read.parquet("/path-to-data").groupBy("_locality_code").count().orderBy($"count").collect()
>
> scala> counts.slice(counts.length-10, counts.length)
> res13: Array[org.apache.spark.sql.Row] = Array([CN,5682255],
> [AU,6090561], [ES,6184507], [IT,7093401], [FR,8814435], [CA,10005467],
> [UK,15375397], [BR,15829260], [IN,22404143], [US,98585175])
>
> scala> counts.slice(0, 10)
> res14: Array[org.apache.spark.sql.Row] = Array([UM,1], [JB,1], [JK,1],
> [WP,1], [JT,1], [SX,9], [BL,52], [BQ,70], [BV,115], [MF,115])
>
>
> On Tue, Aug 9, 2016 at 11:10 AM, immerrr again  wrote:
>> Hi everyone
>>
>> I tried upgrading Spark-1.6.2 to Spark-2.0.0 but run into an issue
>> reading the existing data. Here's how the traceback looks in
>> spark-shell:
>>
>> scala> spark.read.parquet("/path/to/data")
>> org.apache.spark.sql.AnalysisException: Unable to infer schema for
>> ParquetFormat at /path/to/data. It must be specified manually;
>>   at 
>> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397)
>>   at 
>> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397)
>>   at scala.Option.getOrElse(Option.scala:121)
>>   at 
>> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:396)
>>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
>>   at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:427)
>>   at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:411)
>>   ... 48 elided
>>
>> If I enable DEBUG log with sc.setLogLevel("DEBUG"), here's what I
>> additionally see in the output:
>> https://gist.github.com/immerrr/4474021ae70f35b7b9e262251c0abc59. Of
>> course, that same data is read and processed by spark-1.6.2 correctly.
>>
>> Any idea what might be wrong here?
>>
>> Cheers,
>> immerrr

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

回复：saving DF to HDFS in parquet format very slow in SparkSQL app

2016-08-09 Thread luohui20001

maybe this problem is not so easy to understand, so I attached my full code.
Hope this could help in solving the problem.



 

ThanksBest regards!
San.Luo

- 原始邮件 -
发件人：
收件人："user" 
主题：saving DF to HDFS in parquet format very slow in SparkSQL app
日期：2016年08月09日 15点34分

hi there:I got a problem in saving a DF to HDFS as parquet format very 
slow. And I attached a pic which shows a lot of time is spent in getting 
result.the code is 
:streamingData.write.mode(SaveMode.Overwrite).parquet("/data/streamingData")
I don't quite understand why my app is so slow in getting the result. I tried 
to access my HDFS while the app is running slow , HDFS is ok.
Any idea will be appreciated.




 

ThanksBest regards!
San.Luo


DataExtractor.scala
Description: Binary data

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: SparkR error when repartition is called

2016-08-09 Thread Felix Cheung

I think it's saying a string isn't being sent properly from the JVM side.

Does it work for you if you change the dapply UDF to something simpler?

Do you have any log from YARN?

_
From: Shane Lee 
>
Sent: Tuesday, August 9, 2016 12:19 AM
Subject: Re: SparkR error when repartition is called
To: Sun Rui >
Cc: User >

Sun,

I am using spark in yarn client mode in a 2-node cluster with hadoop-2.7.2. My 
R version is 3.3.1.

I have the following in my spark-defaults.conf:
spark.executor.extraJavaOptions =-XX:+PrintGCDetails 
-XX:+HeapDumpOnOutOfMemoryError
spark.r.command=c:/R/R-3.3.1/bin/x64/Rscript
spark.ui.killEnabled=true
spark.executor.instances = 3
spark.serializer = org.apache.spark.serializer.KryoSerializer
spark.shuffle.file.buffer = 1m
spark.driver.maxResultSize=0
spark.executor.memory=8g
spark.executor.cores = 6

I also ran into some other R errors that I was able to bypass by modifying the 
worker.R file (attached). In a nutshell I was getting the "argument is length 
of zero" error sporadically so I put in extra checks for it.

Thanks,

Shane

On Monday, August 8, 2016 11:53 PM, Sun Rui 
> wrote:

I can't reproduce your issue with len=1 in local mode.
Could you give more environment information?
On Aug 9, 2016, at 11:35, Shane Lee 
> wrote:

Hi All,

I am trying out SparkR 2.0 and have run into an issue with repartition.

Here is the R code (essentially a port of the pi-calculating scala example in 
the spark package) that can reproduce the behavior:

schema <- structType(structField("input", "integer"),
structField("output", "integer"))

library(magrittr)

len = 3000
data.frame(n = 1:len) %>%
as.DataFrame %>%
SparkR:::repartition(10L) %>%
dapply(., function (df)
{
library(plyr)
ddply(df, .(n), function (y)
{
data.frame(z =
{
x1 = runif(1) * 2 - 1
y1 = runif(1) * 2 - 1
z = x1 * x1 + y1 * y1
if (z < 1)
{
1L
}
else
{
0L
}
})
})
}
, schema
) %>%
SparkR:::summarize(total = sum(.$output)) %>% collect * 4 / len

For me it runs fine as long as len is less than 5000, otherwise it errors out 
with the following message:

Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
  org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in 
stage 56.0 failed 4 times, most recent failure: Lost task 6.3 in stage 56.0 
(TID 899, LARBGDV-VM02): org.apache.spark.SparkException: R computation failed 
with
 Error in readBin(con, raw(), stringLen, endian = "big") :
  invalid 'n' argument
Calls:  -> readBin
Execution halted
at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
at 
org.apache.spark.sql.execution.r.MapPartitionsRWrapper.apply(MapPartitionsRWrapper.scala:59)
at 
org.apache.spark.sql.execution.r.MapPartitionsRWrapper.apply(MapPartitionsRWrapper.scala:29)
at 
org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:178)
at 
org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:175)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$

If the repartition call is removed, it runs fine again, even with very large 
len.

After looking through the documentations and searching the web, I can't seem to 
find any clues how to fix this. Anybody has seen similary problem?

Thanks in advance for your help.

Shane

Re: Spark-2.0.0 fails reading a parquet dataset generated by Spark-1.6.2

2016-08-09 Thread immerrr again

Some follow-up information:

- dataset size is ~150G

- the data is partitioned by one of the columns, _locality_code:
$ ls -1
_locality_code=AD
_locality_code=AE
_locality_code=AF
_locality_code=AG
_locality_code=AI
_locality_code=AL
_locality_code=AM
_locality_code=AN

_locality_code=YE
_locality_code=YT
_locality_code=YU
_locality_code=ZA
_locality_code=ZM
_locality_code=ZW
_SUCCESS

- some of the partitions contain only one row, but all partitions are
in place (ie number of directories matches number of distinct
localities
val counts = 
sqlContext.read.parquet("/path-to-data").groupBy("_locality_code").count().orderBy($"count").collect()

scala> counts.slice(counts.length-10, counts.length)
res13: Array[org.apache.spark.sql.Row] = Array([CN,5682255],
[AU,6090561], [ES,6184507], [IT,7093401], [FR,8814435], [CA,10005467],
[UK,15375397], [BR,15829260], [IN,22404143], [US,98585175])

scala> counts.slice(0, 10)
res14: Array[org.apache.spark.sql.Row] = Array([UM,1], [JB,1], [JK,1],
[WP,1], [JT,1], [SX,9], [BL,52], [BQ,70], [BV,115], [MF,115])


On Tue, Aug 9, 2016 at 11:10 AM, immerrr again  wrote:
> Hi everyone
>
> I tried upgrading Spark-1.6.2 to Spark-2.0.0 but run into an issue
> reading the existing data. Here's how the traceback looks in
> spark-shell:
>
> scala> spark.read.parquet("/path/to/data")
> org.apache.spark.sql.AnalysisException: Unable to infer schema for
> ParquetFormat at /path/to/data. It must be specified manually;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:396)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
>   at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:427)
>   at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:411)
>   ... 48 elided
>
> If I enable DEBUG log with sc.setLogLevel("DEBUG"), here's what I
> additionally see in the output:
> https://gist.github.com/immerrr/4474021ae70f35b7b9e262251c0abc59. Of
> course, that same data is read and processed by spark-1.6.2 correctly.
>
> Any idea what might be wrong here?
>
> Cheers,
> immerrr

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Spark Job Doesn't End on Mesos

2016-08-09 Thread Todd Leo

Hi,

I’m running Spark jobs on Mesos. When the job finishes, *SparkContext* is
manually closed by sc.stop(). Then Mesos log shows:

I0809 15:48:34.132014 11020 sched.cpp:1589] Asked to stop the driver
I0809 15:48:34.132181 11277 sched.cpp:831] Stopping framework
'20160808-170425-2365980426-5050-4372-0034'

However, the process doesn’t quit after all. This is critical, because I’d
like to use SparkLauncher to submit such jobs. If my job doesn’t end, jobs
will pile up and fill up the memory. Pls help. :-|

—
BR,
Todd Leo

RE: bisecting kmeans model tree

2016-08-09 Thread Huang, Qian

There seems to be an existing JIRA for this.
https://issues.apache.org/jira/browse/SPARK-11664

From: Yanbo Liang [mailto:yblia...@gmail.com]
Sent: Saturday, July 16, 2016 6:18 PM
To: roni 
Cc: user@spark.apache.org
Subject: Re: bisecting kmeans model tree

Currently we do not expose the APIs to get the Bisecting KMeans tree structure, 
they are private in the ml.clustering package scope.
But I think we should make a plan to expose these APIs like what we did for 
Decision Tree.

Thanks
Yanbo

2016-07-12 11:45 GMT-07:00 roni 
>:
Hi Spark,Mlib experts,
Anyone who can shine light on this?
Thanks
_R

On Thu, Apr 21, 2016 at 12:46 PM, roni 
> wrote:
Hi ,
 I want to get the bisecting kmeans tree structure to show a dendogram  on the 
heatmap I am generating based on the hierarchical clustering of data.
 How do I get that using mlib .
Thanks
-Roni

Spark-2.0.0 fails reading a parquet dataset generated by Spark-1.6.2

2016-08-09 Thread immerrr again

Hi everyone

I tried upgrading Spark-1.6.2 to Spark-2.0.0 but run into an issue
reading the existing data. Here's how the traceback looks in
spark-shell:

scala> spark.read.parquet("/path/to/data")
org.apache.spark.sql.AnalysisException: Unable to infer schema for
ParquetFormat at /path/to/data. It must be specified manually;
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397)
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:396)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:427)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:411)
  ... 48 elided

If I enable DEBUG log with sc.setLogLevel("DEBUG"), here's what I
additionally see in the output:
https://gist.github.com/immerrr/4474021ae70f35b7b9e262251c0abc59. Of
course, that same data is read and processed by spark-1.6.2 correctly.

Any idea what might be wrong here?

Cheers,
immerrr

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

saving DF to HDFS in parquet format very slow in SparkSQL app

2016-08-09 Thread luohui20001

hi there:I got a problem in saving a DF to HDFS as parquet format very 
slow. And I attached a pic which shows a lot of time is spent in getting 
result.the code is 
:streamingData.write.mode(SaveMode.Overwrite).parquet("/data/streamingData")
I don't quite understand why my app is so slow in getting the result. I tried 
to access my HDFS while the app is running slow , HDFS is ok.
Any idea will be appreciated.




 

ThanksBest regards!
San.Luo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

2016-08-09 Thread Sean Owen

Fewer features doesn't necessarily mean better predictions, because indeed
you are subtracting data. It might, because when done well you subtract
more noise than signal. It is usually done to make data sets smaller or
more tractable or to improve explainability.

But you have an unsupervised clustering problem where talking about feature
importance doesnt make as much sense. Important to what? There is no target
variable.

PCA will not 'improve' clustering per se but can make it faster.
You may want to specify what you are actually trying to optimize.

On Tue, Aug 9, 2016, 03:23 Rohit Chaddha  wrote:

> I would rather have less features to make better inferences on the data
> based on the smaller number of factors,
> Any suggestions Sean ?
>
> On Mon, Aug 8, 2016 at 11:37 PM, Sean Owen  wrote:
>
>> Yes, that's exactly what PCA is for as Sivakumaran noted. Do you
>> really want to select features or just obtain a lower-dimensional
>> representation of them, with less redundancy?
>>
>> On Mon, Aug 8, 2016 at 4:10 PM, Tony Lane  wrote:
>> > There must be an algorithmic way to figure out which of these factors
>> > contribute the least and remove them in the analysis.
>> > I am hoping same one can throw some insight on this.
>> >
>> > On Mon, Aug 8, 2016 at 7:41 PM, Sivakumaran S 
>> wrote:
>> >>
>> >> Not an expert here, but the first step would be devote some time and
>> >> identify which of these 112 factors are actually causative. Some domain
>> >> knowledge of the data may be required. Then, you can start of with PCA.
>> >>
>> >> HTH,
>> >>
>> >> Regards,
>> >>
>> >> Sivakumaran S
>> >>
>> >> On 08-Aug-2016, at 3:01 PM, Tony Lane  wrote:
>> >>
>> >> Great question Rohit.  I am in my early days of ML as well and it
>> would be
>> >> great if we get some idea on this from other experts on this group.
>> >>
>> >> I know we can reduce dimensions by using PCA, but i think that does not
>> >> allow us to understand which factors from the original are we using in
>> the
>> >> end.
>> >>
>> >> - Tony L.
>> >>
>> >> On Mon, Aug 8, 2016 at 5:12 PM, Rohit Chaddha <
>> rohitchaddha1...@gmail.com>
>> >> wrote:
>> >>>
>> >>>
>> >>> I have a data-set where each data-point has 112 factors.
>> >>>
>> >>> I want to remove the factors which are not relevant, and say reduce
>> to 20
>> >>> factors out of these 112 and then do clustering of data-points using
>> these
>> >>> 20 factors.
>> >>>
>> >>> How do I do these and how do I figure out which of the 20 factors are
>> >>> useful for analysis.
>> >>>
>> >>> I see SVD and PCA implementations, but I am not sure if these give
>> which
>> >>> elements are removed and which are remaining.
>> >>>
>> >>> Can someone please help me understand what to do here
>> >>>
>> >>> thanks,
>> >>> -Rohit
>> >>>
>> >>
>> >>
>> >
>>
>
>

Re: SparkR error when repartition is called

2016-08-09 Thread Shane Lee

Sun,
I am using spark in yarn client mode in a 2-node cluster with hadoop-2.7.2. My 
R version is 3.3.1.
I have the following in my spark-defaults.conf:spark.executor.extraJavaOptions 
=-XX:+PrintGCDetails 
-XX:+HeapDumpOnOutOfMemoryErrorspark.r.command=c:/R/R-3.3.1/bin/x64/Rscriptspark.ui.killEnabled=truespark.executor.instances
 = 3
spark.serializer = 
org.apache.spark.serializer.KryoSerializerspark.shuffle.file.buffer = 
1mspark.driver.maxResultSize=0spark.executor.memory=8gspark.executor.cores = 6 
I also ran into some other R errors that I was able to bypass by modifying the 
worker.R file (attached). In a nutshell I was getting the "argument is length 
of zero" error sporadically so I put in extra checks for it.
Thanks,
Shane
On Monday, August 8, 2016 11:53 PM, Sun Rui  wrote:
 

 I can’t reproduce your issue with len=1 in local mode.Could you give more 
environment information?

On Aug 9, 2016, at 11:35, Shane Lee  wrote:
Hi All,
I am trying out SparkR 2.0 and have run into an issue with repartition. 
Here is the R code (essentially a port of the pi-calculating scala example in 
the spark package) that can reproduce the behavior:
schema <- structType(structField("input", "integer"), structField("output", 
"integer"))
library(magrittr)

len = 3000data.frame(n = 1:len) %>%as.DataFrame %>%
SparkR:::repartition(10L) %>% dapply(., function (df) { library(plyr) ddply(df, 
.(n), function (y)
 { data.frame(z =  { x1 = runif(1) * 2 - 1 y1 = runif(1) * 2 - 1 z = x1 * x1 + 
y1 * y1 if (z < 1) { 1L } else { 0L } }) }) } , schema ) %>%  
SparkR:::summarize(total = sum(.$output)) %>% collect * 4 / len
For me it runs fine as long as len is less than 5000, otherwise it errors out 
with the following message:
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :   
org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in 
stage 56.0 failed 4 times, most recent failure: Lost task 6.3 in stage 56.0 
(TID 899, LARBGDV-VM02): org.apache.spark.SparkException: R computation failed 
with Error in readBin(con, raw(), stringLen, endian = "big") :   invalid 'n' 
argumentCalls:  -> readBinExecution halted at 
org.apache.spark.api.r.RRunner.compute(RRunner.scala:108) at 
org.apache.spark.sql.execution.r.MapPartitionsRWrapper.apply(MapPartitionsRWrapper.scala:59)
 at 
org.apache.spark.sql.execution.r.MapPartitionsRWrapper.apply(MapPartitionsRWrapper.scala:29)
 at 
org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:178)
 at 
org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:175)
 at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
 at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$
If the repartition call is removed, it runs fine again, even with very large 
len.
After looking through the documentations and searching the web, I can't seem to 
find any clues how to fix this. Anybody has seen similary problem?
Thanks in advance for your help.
Shane




  

worker.R
Description: Binary data

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

coalesce serialising earlier work

2016-08-09 Thread Adrian Bridgett

In short:  df.coalesce(1).write seems to make all the earlier 
calculations about the dataframe go through a single task (rather than 
on multiple tasks and then the final dataframe to be sent through a 
single worker).  Any idea how we can force the job to run in parallel?


In more detail:

We have a job that we wish to write out as a single CSV file.  We have 
two approaches (code below):


df = (filtering, calculations)
df.coalesce(num).write.
  format("com.databricks.spark.csv").
  option("codec", "org.apache.hadoop.io.compress.GzipCodec").
  save(output_path)

Option A: (num=100)
- dataframe calculated in parallel
- upload in parallel
- gzip in parallel
- but we then have to run "hdfs dfs -getmerge" to download all data and 
then write it back again.


Option B: (num=1)
- single gzip (but gzip is pretty quick)
- uploads go through a single machine
- no HDFS commands
- dataframe is _not_ calculated in parallel (we can see filters getting 
just a single task)


What I'm not sure is why spark (1.6.1) is deciding to run just a single 
task for the calculation - and what we can do about it? We really want 
the df to be calculated in parallel and then this is _then_ coalesced 
before being written.  (It may be that the -getmerge approach will still 
be faster)


df.coalesce(100).coalesce(1).write.  doesn't look very likely to help!

Adrian

--
*Adrian Bridgett*

Re: SparkR error when repartition is called

2016-08-09 Thread Sun Rui

I can’t reproduce your issue with len=1 in local mode.
Could you give more environment information?
> On Aug 9, 2016, at 11:35, Shane Lee  wrote:
> 
> Hi All,
> 
> I am trying out SparkR 2.0 and have run into an issue with repartition. 
> 
> Here is the R code (essentially a port of the pi-calculating scala example in 
> the spark package) that can reproduce the behavior:
> 
> schema <- structType(structField("input", "integer"), 
> structField("output", "integer"))
> 
> library(magrittr)
> 
> len = 3000
> data.frame(n = 1:len) %>%
> as.DataFrame %>%
> SparkR:::repartition(10L) %>%
>   dapply(., function (df)
>   {
>   library(plyr)
>   ddply(df, .(n), function (y)
>   {
>   data.frame(z = 
>   {
>   x1 = runif(1) * 2 - 1
>   y1 = runif(1) * 2 - 1
>   z = x1 * x1 + y1 * y1
>   if (z < 1)
>   {
>   1L
>   }
>   else
>   {
>   0L
>   }
>   })
>   })
>   }
>   , schema
>   ) %>% 
>   SparkR:::summarize(total = sum(.$output)) %>% collect * 4 / len
> 
> For me it runs fine as long as len is less than 5000, otherwise it errors out 
> with the following message:
> 
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) : 
>   org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 
> in stage 56.0 failed 4 times, most recent failure: Lost task 6.3 in stage 
> 56.0 (TID 899, LARBGDV-VM02): org.apache.spark.SparkException: R computation 
> failed with
>  Error in readBin(con, raw(), stringLen, endian = "big") : 
>   invalid 'n' argument
> Calls:  -> readBin
> Execution halted
>   at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
>   at 
> org.apache.spark.sql.execution.r.MapPartitionsRWrapper.apply(MapPartitionsRWrapper.scala:59)
>   at 
> org.apache.spark.sql.execution.r.MapPartitionsRWrapper.apply(MapPartitionsRWrapper.scala:29)
>   at 
> org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:178)
>   at 
> org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:175)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$
> 
> If the repartition call is removed, it runs fine again, even with very large 
> len.
> 
> After looking through the documentations and searching the web, I can't seem 
> to find any clues how to fix this. Anybody has seen similary problem?
> 
> Thanks in advance for your help.
> 
> Shane
>

How to get the parameters of bestmodel while using paramgrid and crossvalidator?

2016-08-09 Thread colin

I'm using CrossValidator and paramgrid to find the best parameters of my
model.
After crossvalidate, I got a CrossValidatorModel but when I want to get the
parameters of my pipeline, there's nothing in the parameter list of
bestmodel.

Here's the code runing on jupyter notebook:
sq=SQLContext(sc)
df=sq.createDataFrame([(1,0,1),(2,0,2),(1,0,1),(2,1,2),(1,1,1),(2,1,2),(3,1,1),(2,1,2),(1,1,2)],["a","b","c"])
lableIndexer=StringIndexer(inputCol="c",outputCol="label").fit(df)
vecAssembler=VectorAssembler(inputCols=["a","b"],outputCol="featureVector")
lr=LogisticRegression(featuresCol="featureVector",labelCol="label")
pip=Pipeline()
pip.setStages([lableIndexer,vecAssembler,lr])
grid=ParamGridBuilder().addGrid(lr.maxIter,[10,20,30,40]).build()
evaluator=MulticlassClassificationEvaluator()
cv=CrossValidator(estimator=pip,estimatorParamMaps=grid,evaluator=evaluator,numFolds=3)
cvModel=cv.fit(df)
cvPrediction=cvModel.transform(df)
cvPrediction
0.66
cvModel.bestModel.params
[]

There's no error, so what's wrong?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-the-parameters-of-bestmodel-while-using-paramgrid-and-crossvalidator-tp27497.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

71 matches

Mail list logo