Re: Not per-key state in spark streaming

2016-12-08 Thread Anty Rao
Thank you very much for your reply , Daniel

On Thu, Dec 8, 2016 at 7:07 PM, Daniel Haviv <
daniel.ha...@veracity-group.com> wrote:

> There's no need to extend Spark's API, look at mapWithState for examples.
>
> On Thu, Dec 8, 2016 at 4:49 AM, Anty Rao  wrote:
>
>>
>>
>> On Wed, Dec 7, 2016 at 7:42 PM, Anty Rao  wrote:
>>
>>> Hi
>>> I'm new to Spark. I'm doing some research to see if spark streaming can
>>> solve my problem. I don't want to keep per-key state,b/c my data set is
>>> very huge and keep a little longer time, it not viable to keep all per key
>>> state in memory.Instead, i want to have a bloom filter based state. Does it
>>> possible to achieve this in Spark streaming.
>>>
>>> Is it possible to achieve this by extending Spark API?
>>
>>> --
>>> Anty Rao
>>>
>>
>>
>>
>> --
>> Anty Rao
>>
>
>


-- 
Anty Rao


reading data from s3

2016-12-08 Thread Hitesh Goyal
Hi team,
I want to read the text file from s3. I am doing it using DataFrame. Like 
below:-
DataFrame d=sql.read().text("s3://my_first_text_file.txt");
  d.registerTempTable("table1");
  DataFrame d1=sql.sql("Select * from table1");
  d1.printSchema();
  d1.show();

But it is not registering the text file as a temp table so that I  can make SQL 
 queries on that. Can't I do this on a text file ?? Or if I can, suggest any 
way to do.
Like if I try to do it by JSON file, it is successful.

Regards,
Hitesh Goyal
Simpli5d Technologies
Cont No.: 9996588220



Re: Issue in using DenseVector in RowMatrix, error could be due to ml and mllib package changes

2016-12-08 Thread Nick Pentreath
Yes most likely due to hashing tf returns ml vectors while you need mllib
vectors for row matrix.

I'd recommend using the vector conversion utils (I think in
mllib.linalg.Vectors but I'm on mobile right now so can't recall exactly).
There are until methods for converting single vectors as well as vector
rows of DF. Check the mllib user guide for 2.0 for details.
On Fri, 9 Dec 2016 at 04:42, satyajit vegesna 
wrote:

> Hi All,
>
> PFB code.
>
>
> import org.apache.spark.ml.feature.{HashingTF, IDF}
> import org.apache.spark.ml.linalg.SparseVector
> import org.apache.spark.mllib.linalg.distributed.RowMatrix
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.{SparkConf, SparkContext}
>
> /**
>   * Created by satyajit on 12/7/16.
>   */
> object DIMSUMusingtf extends App {
>
>   val conf = new SparkConf()
> .setMaster("local[1]")
> .setAppName("testColsim")
>   val sc = new SparkContext(conf)
>   val spark = SparkSession
> .builder
> .appName("testColSim").getOrCreate()
>
>   import org.apache.spark.ml.feature.Tokenizer
>
>   val sentenceData = spark.createDataFrame(Seq(
> (0, "Hi I heard about Spark"),
> (0, "I wish Java could use case classes"),
> (1, "Logistic regression models are neat")
>   )).toDF("label", "sentence")
>
>   val tokenizer = new 
> Tokenizer().setInputCol("sentence").setOutputCol("words")
>
>   val wordsData = tokenizer.transform(sentenceData)
>
>
>   val hashingTF = new HashingTF()
> .setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)
>
>   val featurizedData = hashingTF.transform(wordsData)
>
>
>   val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
>   val idfModel = idf.fit(featurizedData)
>   val rescaledData = idfModel.transform(featurizedData)
>   rescaledData.show()
>   rescaledData.select("features", "label").take(3).foreach(println)
>   val check = rescaledData.select("features")
>
>   val row = check.rdd.map(row => row.getAs[SparseVector]("features"))
>
>   val mat = new RowMatrix(row) //i am basically trying to use Dense.vector as 
> a direct input to
>
> rowMatrix, but i get an error that RowMatrix Cannot resolve constructor
>
>   row.foreach(println)
> }
>
> Any help would be appreciated.
>
> Regards,
> Satyajit.
>
>
>
>


Re: how can I set the log configuration file for spark history server ?

2016-12-08 Thread Don Drake
You can update $SPARK_HOME/spark-env.sh by setting the environment
variable SPARK_HISTORY_OPTS.

See
http://spark.apache.org/docs/latest/monitoring.html#spark-configuration-options
for options (spark.history.fs.logDirectory) you can set.

There is log rotation built in (by time, not size) to the history server,
you need to enable/configure it.

Hope that helps.

-Don

On Thu, Dec 8, 2016 at 9:20 PM, John Fang 
wrote:

> ./start-history-server.sh
> starting org.apache.spark.deploy.history.HistoryServer,
> logging to /home/admin/koala/data/versions/0/SPARK/2.0.2/
> spark-2.0.2-bin-hadoop2.6/logs/spark-admin-org.apache.
> spark.deploy.history.HistoryServer-1-v069166214.sqa.zmf.out
>
> Then the history will print all log to the XXX.sqa.zmf.out, so i can't
> limit the file max size.  I want limit the size of the log file
>



-- 
Donald Drake
Drake Consulting
http://www.drakeconsulting.com/
https://twitter.com/dondrake 
800-733-2143


how can I set the log configuration file for spark history server ?

2016-12-08 Thread John Fang
./start-history-server.sh
starting org.apache.spark.deploy.history.HistoryServer, logging to 
/home/admin/koala/data/versions/0/SPARK/2.0.2/spark-2.0.2-bin-hadoop2.6/logs/spark-admin-org.apache.spark.deploy.history.HistoryServer-1-v069166214.sqa.zmf.out
Then the history will print all log to the XXX.sqa.zmf.out, so i can't limit 
the file max size.  I want limit the size of the log file

flatmap pair

2016-12-08 Thread im281

The class 'Detector' has a function 'detectFeature(cluster) 
However, the method has changed to return a list of features as opposed to
one feature as it is below.
How do I change this so it returns a list of feature objects instead


// creates key-value pairs for Isotope cluster ID and Isotope cluster
// groups them by keys and passes
// each collection of isotopes to the feature detector
JavaPairRDD features =
scans.flatMapToPair(keyData).groupByKey().mapValues(values -> {

System.out.println("Inside Feature Detection Reduce !");
Detector d = new Detector();
ArrayList clusters = 
(ArrayList)
StreamSupport
.stream(values.spliterator(),
false).map(d::GetClustersfromKey).collect(Collectors.toList());
return d.WriteFeatureValue(d.DetectFeature(clusters));
});



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/flatmap-pair-tp28187.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Issue in using DenseVector in RowMatrix, error could be due to ml and mllib package changes

2016-12-08 Thread satyajit vegesna
Hi All,

PFB code.


import org.apache.spark.ml.feature.{HashingTF, IDF}
import org.apache.spark.ml.linalg.SparseVector
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.sql.SparkSession
import org.apache.spark.{SparkConf, SparkContext}

/**
  * Created by satyajit on 12/7/16.
  */
object DIMSUMusingtf extends App {

  val conf = new SparkConf()
.setMaster("local[1]")
.setAppName("testColsim")
  val sc = new SparkContext(conf)
  val spark = SparkSession
.builder
.appName("testColSim").getOrCreate()

  import org.apache.spark.ml.feature.Tokenizer

  val sentenceData = spark.createDataFrame(Seq(
(0, "Hi I heard about Spark"),
(0, "I wish Java could use case classes"),
(1, "Logistic regression models are neat")
  )).toDF("label", "sentence")

  val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")

  val wordsData = tokenizer.transform(sentenceData)


  val hashingTF = new HashingTF()
.setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)

  val featurizedData = hashingTF.transform(wordsData)


  val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
  val idfModel = idf.fit(featurizedData)
  val rescaledData = idfModel.transform(featurizedData)
  rescaledData.show()
  rescaledData.select("features", "label").take(3).foreach(println)
  val check = rescaledData.select("features")

  val row = check.rdd.map(row => row.getAs[SparseVector]("features"))

  val mat = new RowMatrix(row) //i am basically trying to use
Dense.vector as a direct input to

rowMatrix, but i get an error that RowMatrix Cannot resolve constructor

  row.foreach(println)
}

Any help would be appreciated.

Regards,
Satyajit.


Re: unit testing in spark

2016-12-08 Thread Miguel Morales
Sure I'd love to participate.  Being new at Scala things like dependency 
injection are still a bit iffy.  Would love to exchange ideas.

Sent from my iPhone

> On Dec 8, 2016, at 4:29 PM, Holden Karau  wrote:
> 
> Maybe diverging a bit from the original question - but would it maybe make 
> sense for those of us that all care about testing to try and do a hangout at 
> some point so that we can exchange ideas?
> 
>> On Thu, Dec 8, 2016 at 4:15 PM, Miguel Morales  
>> wrote:
>> I would be interested in contributing.  Ive created my own library for this 
>> as well.  In my blog post I talk about testing with Spark in RSpec style: 
>> https://medium.com/@therevoltingx/test-driven-development-w-apache-spark-746082b44941
>> 
>> Sent from my iPhone
>> 
>>> On Dec 8, 2016, at 4:09 PM, Holden Karau  wrote:
>>> 
>>> There are also libraries designed to simplify testing Spark in the various 
>>> platforms, spark-testing-base for Scala/Java/Python (& video 
>>> https://www.youtube.com/watch?v=f69gSGSLGrY), sscheck (scala focused 
>>> property based), pyspark.test (python focused with py.test instead of 
>>> unittest2) (& blog post from nextdoor 
>>> https://engblog.nextdoor.com/unit-testing-apache-spark-with-py-test-3b8970dc013b#.jw3bdcej9
>>>  )
>>> 
>>> Good luck on your Spark Adventures :)
>>> 
>>> P.S.
>>> 
>>> If anyone is interested in helping improve spark testing libraries I'm 
>>> always looking for more people to be involved with spark-testing-base 
>>> because I'm lazy :p
>>> 
 On Thu, Dec 8, 2016 at 2:05 PM, Lars Albertsson  wrote:
 I wrote some advice in a previous post on the list:
 http://markmail.org/message/bbs5acrnksjxsrrs
 
 It does not mention python, but the strategy advice is the same. Just
 replace JUnit/Scalatest with pytest, unittest, or your favourite
 python test framework.
 
 
 I recently held a presentation on the subject. There is a video
 recording at https://vimeo.com/192429554 and slides at
 http://www.slideshare.net/lallea/test-strategies-for-data-processing-pipelines-67244458
 
 You can find more material on test strategies at
 http://www.mapflat.com/lands/resources/reading-list/index.html
 
 
 
 
 Lars Albertsson
 Data engineering consultant
 www.mapflat.com
 https://twitter.com/lalleal
 +46 70 7687109
 Calendar: https://goo.gl/6FBtlS, https://freebusy.io/la...@mapflat.com
 
 
 On Thu, Dec 8, 2016 at 4:14 PM, pseudo oduesp  
 wrote:
 > somone can tell me how i can make unit test on pyspark ?
 > (book, tutorial ...)
 
 -
 To unsubscribe e-mail: user-unsubscr...@spark.apache.org
 
>>> 
>>> 
>>> 
>>> -- 
>>> Cell : 425-233-8271
>>> Twitter: https://twitter.com/holdenkarau
> 
> 
> 
> -- 
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau


Re: unit testing in spark

2016-12-08 Thread Holden Karau
Maybe diverging a bit from the original question - but would it maybe make
sense for those of us that all care about testing to try and do a hangout
at some point so that we can exchange ideas?

On Thu, Dec 8, 2016 at 4:15 PM, Miguel Morales 
wrote:

> I would be interested in contributing.  Ive created my own library for
> this as well.  In my blog post I talk about testing with Spark in RSpec
> style:
> https://medium.com/@therevoltingx/test-driven-development-w-apache-spark-
> 746082b44941
>
> Sent from my iPhone
>
> On Dec 8, 2016, at 4:09 PM, Holden Karau  wrote:
>
> There are also libraries designed to simplify testing Spark in the various
> platforms, spark-testing-base
>  for Scala/Java/Python (&
> video https://www.youtube.com/watch?v=f69gSGSLGrY), sscheck
>  (scala focused property based),
> pyspark.test (python focused with py.test instead of unittest2) (& blog
> post from nextdoor https://engblog.nextdoor.com/unit-testing-
> apache-spark-with-py-test-3b8970dc013b#.jw3bdcej9 )
>
> Good luck on your Spark Adventures :)
>
> P.S.
>
> If anyone is interested in helping improve spark testing libraries I'm
> always looking for more people to be involved with spark-testing-base
> because I'm lazy :p
>
> On Thu, Dec 8, 2016 at 2:05 PM, Lars Albertsson  wrote:
>
>> I wrote some advice in a previous post on the list:
>> http://markmail.org/message/bbs5acrnksjxsrrs
>>
>> It does not mention python, but the strategy advice is the same. Just
>> replace JUnit/Scalatest with pytest, unittest, or your favourite
>> python test framework.
>>
>>
>> I recently held a presentation on the subject. There is a video
>> recording at https://vimeo.com/192429554 and slides at
>> http://www.slideshare.net/lallea/test-strategies-for-data-
>> processing-pipelines-67244458
>>
>> You can find more material on test strategies at
>> http://www.mapflat.com/lands/resources/reading-list/index.html
>>
>>
>>
>>
>> Lars Albertsson
>> Data engineering consultant
>> www.mapflat.com
>> https://twitter.com/lalleal
>> +46 70 7687109
>> Calendar: https://goo.gl/6FBtlS, https://freebusy.io/la...@mapflat.com
>>
>>
>> On Thu, Dec 8, 2016 at 4:14 PM, pseudo oduesp 
>> wrote:
>> > somone can tell me how i can make unit test on pyspark ?
>> > (book, tutorial ...)
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>
> --
> Cell : 425-233-8271 <(425)%20233-8271>
> Twitter: https://twitter.com/holdenkarau
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: unit testing in spark

2016-12-08 Thread Miguel Morales
I would be interested in contributing.  Ive created my own library for this as 
well.  In my blog post I talk about testing with Spark in RSpec style: 
https://medium.com/@therevoltingx/test-driven-development-w-apache-spark-746082b44941

Sent from my iPhone

> On Dec 8, 2016, at 4:09 PM, Holden Karau  wrote:
> 
> There are also libraries designed to simplify testing Spark in the various 
> platforms, spark-testing-base for Scala/Java/Python (& video 
> https://www.youtube.com/watch?v=f69gSGSLGrY), sscheck (scala focused property 
> based), pyspark.test (python focused with py.test instead of unittest2) (& 
> blog post from nextdoor 
> https://engblog.nextdoor.com/unit-testing-apache-spark-with-py-test-3b8970dc013b#.jw3bdcej9
>  )
> 
> Good luck on your Spark Adventures :)
> 
> P.S.
> 
> If anyone is interested in helping improve spark testing libraries I'm always 
> looking for more people to be involved with spark-testing-base because I'm 
> lazy :p
> 
>> On Thu, Dec 8, 2016 at 2:05 PM, Lars Albertsson  wrote:
>> I wrote some advice in a previous post on the list:
>> http://markmail.org/message/bbs5acrnksjxsrrs
>> 
>> It does not mention python, but the strategy advice is the same. Just
>> replace JUnit/Scalatest with pytest, unittest, or your favourite
>> python test framework.
>> 
>> 
>> I recently held a presentation on the subject. There is a video
>> recording at https://vimeo.com/192429554 and slides at
>> http://www.slideshare.net/lallea/test-strategies-for-data-processing-pipelines-67244458
>> 
>> You can find more material on test strategies at
>> http://www.mapflat.com/lands/resources/reading-list/index.html
>> 
>> 
>> 
>> 
>> Lars Albertsson
>> Data engineering consultant
>> www.mapflat.com
>> https://twitter.com/lalleal
>> +46 70 7687109
>> Calendar: https://goo.gl/6FBtlS, https://freebusy.io/la...@mapflat.com
>> 
>> 
>> On Thu, Dec 8, 2016 at 4:14 PM, pseudo oduesp  wrote:
>> > somone can tell me how i can make unit test on pyspark ?
>> > (book, tutorial ...)
>> 
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> 
> 
> 
> 
> -- 
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau


Re: unit testing in spark

2016-12-08 Thread Holden Karau
There are also libraries designed to simplify testing Spark in the various
platforms, spark-testing-base 
for Scala/Java/Python (& video https://www.youtube.com/watch?v=f69gSGSLGrY),
sscheck  (scala focused property based),
pyspark.test (python focused with py.test instead of unittest2) (& blog
post from nextdoor
https://engblog.nextdoor.com/unit-testing-apache-spark-with-py-test-3b8970dc013b#.jw3bdcej9
 )

Good luck on your Spark Adventures :)

P.S.

If anyone is interested in helping improve spark testing libraries I'm
always looking for more people to be involved with spark-testing-base
because I'm lazy :p

On Thu, Dec 8, 2016 at 2:05 PM, Lars Albertsson  wrote:

> I wrote some advice in a previous post on the list:
> http://markmail.org/message/bbs5acrnksjxsrrs
>
> It does not mention python, but the strategy advice is the same. Just
> replace JUnit/Scalatest with pytest, unittest, or your favourite
> python test framework.
>
>
> I recently held a presentation on the subject. There is a video
> recording at https://vimeo.com/192429554 and slides at
> http://www.slideshare.net/lallea/test-strategies-for-
> data-processing-pipelines-67244458
>
> You can find more material on test strategies at
> http://www.mapflat.com/lands/resources/reading-list/index.html
>
>
>
>
> Lars Albertsson
> Data engineering consultant
> www.mapflat.com
> https://twitter.com/lalleal
> +46 70 7687109
> Calendar: https://goo.gl/6FBtlS, https://freebusy.io/la...@mapflat.com
>
>
> On Thu, Dec 8, 2016 at 4:14 PM, pseudo oduesp 
> wrote:
> > somone can tell me how i can make unit test on pyspark ?
> > (book, tutorial ...)
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Fwd: Question about SPARK-11374 (skip.header.line.count)

2016-12-08 Thread Dongjoon Hyun
+dev

I forget to add @user.

Dongjoon.

-- Forwarded message -
From: Dongjoon Hyun 
Date: Thu, Dec 8, 2016 at 16:00
Subject: Question about SPARK-11374 (skip.header.line.count)
To: 


Hi, All.



Could you give me some opinion?



There is an old SPARK issue, SPARK-11374, about removing header lines from
text file.

Currently, Spark supports removing CSV header lines by the following way.



```

scala> spark.read.option("header","true").csv("/data").show

+---+---+

| c1| c2|

+---+---+

|  1|  a|

|  2|  b|

+---+---+

```



In SQL world, we can support that like the Hive way,
`skip.header.line.count`.



```

scala> sql("CREATE TABLE t1 (id INT, value VARCHAR(10)) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/data'
TBLPROPERTIES('skip.header.line.count'='1')")

scala> sql("SELECT * FROM t1").show

+---+-+

| id|value|

+---+-+

|  1|a|

|  2|b|

+---+-+

```



Although I made a PR for this based on the JIRA issue, I want to know this
is really needed feature.

Is it need for your use cases? Or, it's enough for you to remove them in a
preprocessing stage.

If this is too old and not proper in these days, I'll close the PR and JIRA
issue as WON'T FIX.



Thank you for all in advance!



Bests,

Dongjoon.



-

To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


Re: .tar.bz2 in spark

2016-12-08 Thread Jörn Franke
Tar is not out of the box supported. Just store the file as .json.bz2 without 
using tar.


> On 8 Dec 2016, at 20:18, Maurin Lenglart  wrote:
> 
> Hi,
> I am trying to load a json file compress in .tar.bz2 but spark throw an error.
> I am using pyspark with spark 1.6.2. (Cloudera 5.9)
>  
> What will be the best way to handle that?
> I don’t want to have a non-spark job that will just uncompressed the data…
>  
> thanks


Re: Design patterns for Spark implementation

2016-12-08 Thread Mich Talebzadeh
Another use case for Spark is to use its in-memory and parallel processing
on RDBMS data.

This may sound a bit strange, but you can access your RDBMS table from
Spark via JDBC with parallel processing and engage the speed of Spark to
accelerate the queries.

To do this you may need to parallelise you JDBC connection to RDBMS table
and you will need to have a primary key on the table.

I am going to test it to see how performant it is to offer Spark as a fast
query engine for RDNMS.

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 8 December 2016 at 19:51, Sachin Naik  wrote:

> Not sure if you are aware of these
>
> 1) Edx/Berkely/Databricks has three Spark related certifications. Might be
> a good start.
>
> 2) Fair understanding of scala/distributed collection patterns to better
> appreciate the internals of Spark. Coursera has three scala courses. I know
> there are other language bindings. The Edx course goes in great detail on
> those.
>
> 3) Advanced Analytics on Spark book.
>
> --sachin
>
> Sent from my iPhone
>
> On Dec 8, 2016, at 11:38 AM, Peter Figliozzi 
> wrote:
>
> Keeping in mind Spark is a parallel computing engine, Spark does not
> change your data infrastructure/data architecture.  These days it's
> relatively convenient to read data from a variety of sources (S3, HDFS,
> Cassandra, ...) and ditto on the output side.
>
> For example, for one of my use-cases, I store 10's of gigs of time-series
> data in Cassandra.  It just so happens I like to analyze all of it at once
> using Spark, which writes a very nice, small text file table of results I
> look at using Python/Pandas, in a Jupyter notebook, on a laptop.
>
> If we didn't have Spark, I'd still be doing the input side (Cassandra) and
> output side (small text file, ingestible by a laptop) the same way.  The
> only difference would be, instead of importing and processing in Spark, my
> fictional group of 5,000 assistants would each download a portion of the
> data into their Excel spreadsheet, then have a big meeting to produce my
> small text file.
>
> So my view is the nature of your data and specific objectives determine
> your infrastructure and architecture, not the presence or absence of Spark.
>
>
>
>
>
> On Sat, Dec 3, 2016 at 10:59 AM, Vasu Gourabathina 
> wrote:
>
>> Hi,
>>
>> I know this is a broad question. If this is not the right forum,
>> appreciate if you can point to other sites/areas that may be helpful.
>>
>> Before posing this question, I did use our friend Google, but sanitizing
>> the query results from my need angle hasn't been easy.
>>
>> Who I am:
>>- Have done data processing and analytics, but relatively new to Spark
>> world
>>
>> What I am looking for:
>>   - Architecture/Design of a ML system using Spark
>>   - In particular, looking for best practices that can support/bridge
>> both Engineering and Data Science teams
>>
>> Engineering:
>>- Build a system that has typical engineering needs, data processing,
>> scalability, reliability, availability, fault-tolerance etc.
>>- System monitoring etc.
>> Data Science:
>>- Build a system for Data Science team to do data exploration
>> activities
>>- Develop models using supervised learning and tweak models
>>
>> Data:
>>   - Batch and incremental updates - mostly structured or semi-structured
>> (some data from transaction systems, weblogs, click stream etc.)
>>   - Steaming, in near term, but not to begin with
>>
>> Data Storage:
>>   - Data is expected to grow on a daily basis...so, system should be able
>> to support and handle big data
>>   - May be, after further analysis, there might be a possibility/need to
>> archive some of the data...it all depends on how the ML models were built
>> and results were stored/used for future usage
>>
>> Data Analysis:
>>   - Obvious data related aspects, such as data cleansing, data
>> transformation, data partitioning etc
>>   - May be run models on windows of data. For example: last 1-year,
>> 2-years etc.
>>
>> ML models:
>>   - Ability to store model versions and previous results
>>   - Compare results of different variants of models
>>
>> Consumers:
>>   - RESTful webservice clients to look at the results
>>
>> *So, the questions I have are:*
>> 1) Are there architectural and design patterns that I can use based on
>> industry best-practices. In particular:
>>   - data ingestion
>>   - data 

Re: unit testing in spark

2016-12-08 Thread Lars Albertsson
I wrote some advice in a previous post on the list:
http://markmail.org/message/bbs5acrnksjxsrrs

It does not mention python, but the strategy advice is the same. Just
replace JUnit/Scalatest with pytest, unittest, or your favourite
python test framework.


I recently held a presentation on the subject. There is a video
recording at https://vimeo.com/192429554 and slides at
http://www.slideshare.net/lallea/test-strategies-for-data-processing-pipelines-67244458

You can find more material on test strategies at
http://www.mapflat.com/lands/resources/reading-list/index.html




Lars Albertsson
Data engineering consultant
www.mapflat.com
https://twitter.com/lalleal
+46 70 7687109
Calendar: https://goo.gl/6FBtlS, https://freebusy.io/la...@mapflat.com


On Thu, Dec 8, 2016 at 4:14 PM, pseudo oduesp  wrote:
> somone can tell me how i can make unit test on pyspark ?
> (book, tutorial ...)

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: When will Structured Streaming support stream-to-stream joins?

2016-12-08 Thread Michael Armbrust
I would guess Spark 2.3, but maybe sooner maybe later depending on demand.
I created https://issues.apache.org/jira/browse/SPARK-18791 so people can
describe their requirements / stay informed.

On Thu, Dec 8, 2016 at 11:16 AM, ljwagerfield 
wrote:

> Hi there,
>
> Structured Streaming currently only supports stream-to-batch joins.
>
> Is there an ETA for stream-to-stream joins?
>
> Kindest regards (and keep up the awesome work!),
> Lawrence
>
> (p.s. I've traversed the JIRA roadmaps but couldn't see anything)
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/When-will-Structured-Streaming-
> support-stream-to-stream-joins-tp28185.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: few basic questions on structured streaming

2016-12-08 Thread Michael Armbrust
>
> 1. what happens if an event arrives few days late? Looks like we have an
> unbound table with sorted time intervals as keys but I assume spark doesn't
> keep several days worth of data in memory but rather it would checkpoint
> parts of the unbound table to a storage at a specified interval such that
> if an event comes few days late it would update the part of the table that
> is in memory plus the parts of the table that are in storage which contains
> the interval (Again this is just my assumption, I don't know what it really
> does). is this correct so far?
>

The state we need to keep will be unbounded, unless you specify a
watermark.  This watermark tells us how long to wait for late data to
arrive and thus allows us to bound the amount of state that we keep in
memory.  Since we purge state for aggregations that are below the
watermark, we must also drop data that arrives even later than your
specified watermark (if any).  Note that the watermark is calculated based
on observed data, not on the actual time of processing.  So we should be
robust to cases where the stream is down for extended periods of time.


> 2.  Say I am running a Spark Structured streaming Job for 90 days with a
> window interval of 10 mins and a slide interval of 5 mins. Does the output
> of this Job always return the entire history in a table? other words the
> does the output on 90th day contains a table of 10 minute time intervals
> from day 1 to day 90? If so, wouldn't that be too big to return as an
> output?
>

This depends on the output mode.  In complete mode, we output the entire
result every time (thus, complete mode probably doesn't make sense for this
use case).  In update mode
, we will output
continually updated estimates of the final answer as the stream progresses
(useful if you are for example updating a database).  In append mode
(supported in 2.1) we only output finalized aggregations that have fallen
beneath the watermark.

Relatedly, SPARK-16738
 talks
about making the distributed state store queryable.  With this feature, you
could run your query in complete mode (given enough machines).  Even though
the results are large, you can still interact with the complete results of
the aggregation as a distributed DataFrame.


> 3. For Structured Streaming is it required to have a distributed storage
> such as HDFS? my guess would be yes (based on what I said in #1) but I
> would like to confirm.
>

Currently this is the only place that we can write the offset log (records
what data is in each batch) and the state checkpoints.  I think its likely
that we'll add support for other storage systems here in the future.


> 4. I briefly heard about watermarking. Are there any pointers where I can
> know them more in detail? Specifically how watermarks could help in
> structured streaming and so on.
>

Here's the best docs available: https://github.com/apache/spark/pull/15702

We are working on something for the programming guide / a blog post in the
next few weeks.


KMediods in Spark java

2016-12-08 Thread Shak S
Is there any example to implement KMediods cluster in spark and java? I
searched Spark API looks like Spark has not yet implemented KMediods. Any
example or inputs will be appreciated.

Thanks.


Phoenix Plugin for Spark - connecting to Phoenix in secured cluster.

2016-12-08 Thread Marcin Pastecki
Hello all,

I have problem accessing HBase using Spark Phoenix Plugin in secured
cluster.

Versions:

Spark 1.6.1,
HBase 1.1.2.2.4,
Phoenix 4.4.0


Using sqlline.py works just fine.

I have valid Kerberos ticket.

Trying to get this to work in local mode first. What I'm doing is basic
test as described here: https://phoenix.apache.org/phoenix_spark.html,
 Load as a DataFrame using the Data Source API as an example.

This is Ambari managed Hortonworks cluster. HBase client is installed on
the node I run it. Still I'm adding hbase-site.xml to the classpath.

So the code that has troubles is this:

*val df = sqlContext.load(*
*  "org.apache.phoenix.spark",*
*  Map("table" -> "TABLE1", "zkUrl" -> "zookeeper:2181:/hbase-secure")*
*)*


Tried also this:

*val df = sqlContext.load(*
*  "org.apache.phoenix.spark",*
*  Map("table" -> "TABLE1", "zkUrl" ->
"zookeeper:2181:/hbase-secure:hbase@DOMAIN:/etc/security/keytabs/hbase.headless.keytab")*
*)*


Once executed it tries to connect to Phoenix, lines worth mentioning:


16/12/08 22:03:52 INFO ZooKeeper: Initiating client connection,
connectString=zookeeper:2181 sessionTimeout=9
watcher=hconnection-0x55a7c430x0, quorum=zookeeper:2181,
baseZNode=/hbase-secure
16/12/08 22:03:52 INFO ClientCnxn: Opening socket connection to server
zookeeper/ip.ip.ip.ip:2181. *Will not attempt to authenticate using SASL
(unknown error)*
(...)
16/12/08 22:03:52 WARN RpcControllerFactory: *Cannot load configured
"hbase.rpc.controllerfactory.class"
*(org.apache.hadoop.hbase.ipc.controller.ClientRpcControllerFactory)
from hbase-site.xml, falling back to use default RpcControllerFactory


Then it repeats that once again and then it throws those lines every minute
or so.

16/12/08 22:04:41 INFO RpcRetryingCaller: Call exception, tries=10,
retries=35, started=48381 ms ago, cancelled=false, msg=
16/12/08 22:05:01 INFO RpcRetryingCaller: Call exception, tries=11,
retries=35, started=68424 ms ago, cancelled=false, msg=
16/12/08 22:05:21 INFO RpcRetryingCaller: Call exception, tries=12,
retries=35, started=88520 ms ago, cancelled=false, msg=
16/12/08 22:05:41 INFO RpcRetryingCaller: Call exception, tries=13,
retries=35, started=108677 ms ago, cancelled=false, msg=


And ... that's pretty much it.

I tried to replace phoenix-client.jar with the latest
one: phoenix-4.9.0-HBase-1.1-client.jar

But the only thing that changed was that instead of repeating those lat
lines it throws error:


java.sql.SQLException:
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
attempts=36, exceptions:
Thu Dec 08 21:28:16 CET 2016, null, java.net.SocketTimeoutException:
callTimeout=6, callDuration=68388: row 'SYSTEM:CATALOG,,' on table
'hbase:meta' at region=hbase:meta,,1.1588230740,
hostname=hostname,16020,1479200343216, seqNum=0

at
org.apache.phoenix.query.ConnectionQueryServicesImpl$13.call(ConnectionQueryServicesImpl.java:2432)
at
org.apache.phoenix.query.ConnectionQueryServicesImpl$13.call(ConnectionQueryServicesImpl.java:2352)
at
org.apache.phoenix.util.PhoenixContextExecutor.call(PhoenixContextExecutor.java:76)
(...)
Caused by: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
after attempts=36, exceptions:
Thu Dec 08 21:28:16 CET 2016, null, java.net.SocketTimeoutException:
callTimeout=6, callDuration=68388: row 'SYSTEM:CATALOG,,' on table
'hbase:meta' at region=hbase:meta,,1.1588230740,
hostname=hostname,16020,1479200343216, seqNum=0


Any help would be much appreciated.

The goal is to make it work in Yarn cluster mode...

Thanks,
Marcin


Re: Spark app write too many small parquet files

2016-12-08 Thread Miguel Morales
Try to coalesce with a value of 2 or so.  You could dynamically calculate how 
many partitions to have to obtain an optimal file size.

Sent from my iPhone

> On Dec 8, 2016, at 1:03 PM, Kevin Tran  wrote:
> 
> How many partition should it be when streaming? - As in streaming process the 
> data will growing in size and is there any configuration for limit file size 
> and write to new file if it is more than x (let says  128MB per file)
> 
> Another question about performance when query to these parquet files. What is 
> the practise for number of file size and files ?
> 
> How to compacting small parquet flies to small number of bigger parquet file ?
> 
> Thanks,
> Kevin.
> 
>> On Tue, Nov 29, 2016 at 3:01 AM, Chin Wei Low  wrote:
>> Try limit the partitions. spark.sql.shuffle.partitions
>> 
>> This control the number of files generated.
>> 
>> 
>>> On 28 Nov 2016 8:29 p.m., "Kevin Tran"  wrote:
>>> Hi Denny,
>>> Thank you for your inputs. I also use 128 MB but still too many files 
>>> generated by Spark app which is only ~14 KB each ! That's why I'm asking if 
>>> there is a solution for this if some one has same issue.
>>> 
>>> Cheers,
>>> Kevin.
>>> 
 On Mon, Nov 28, 2016 at 7:08 PM, Denny Lee  wrote:
 Generally, yes - you should try to have larger data sizes due to the 
 overhead of opening up files.  Typical guidance is between 64MB-1GB; 
 personally I usually stick with 128MB-512MB with the default of snappy 
 codec compression with parquet.  A good reference is Vida Ha's 
 presentation Data Storage Tips for Optimal Spark Performance.  
 
> On Sun, Nov 27, 2016 at 9:44 PM Kevin Tran  wrote:
> Hi Everyone,
> Does anyone know what is the best practise of writing parquet file from 
> Spark ?
> 
> As Spark app write data to parquet and it shows that under that directory 
> there are heaps of very small parquet file (such as 
> e73f47ef-4421-4bcc-a4db-a56b110c3089.parquet). Each parquet file is only 
> 15KB
> 
> Should it write each chunk of  bigger data size (such as 128 MB) with 
> proper number of files ?
> 
> Does anyone find out any performance changes when changing data size of 
> each parquet file ?
> 
> Thanks,
> Kevin.
>>> 
> 


Re: Spark app write too many small parquet files

2016-12-08 Thread Kevin Tran
How many partition should it be when streaming? - As in streaming process
the data will growing in size and is there any configuration for limit file
size and write to new file if it is more than x (let says  128MB per file)

Another question about performance when query to these parquet files. What
is the practise for number of file size and files ?

How to compacting small parquet flies to small number of bigger parquet
file ?

Thanks,
Kevin.

On Tue, Nov 29, 2016 at 3:01 AM, Chin Wei Low  wrote:

> Try limit the partitions. spark.sql.shuffle.partitions
>
> This control the number of files generated.
>
> On 28 Nov 2016 8:29 p.m., "Kevin Tran"  wrote:
>
>> Hi Denny,
>> Thank you for your inputs. I also use 128 MB but still too many files
>> generated by Spark app which is only ~14 KB each ! That's why I'm asking if
>> there is a solution for this if some one has same issue.
>>
>> Cheers,
>> Kevin.
>>
>> On Mon, Nov 28, 2016 at 7:08 PM, Denny Lee  wrote:
>>
>>> Generally, yes - you should try to have larger data sizes due to the
>>> overhead of opening up files.  Typical guidance is between 64MB-1GB;
>>> personally I usually stick with 128MB-512MB with the default of snappy
>>> codec compression with parquet.  A good reference is Vida Ha's presentation 
>>> Data
>>> Storage Tips for Optimal Spark Performance
>>> .
>>>
>>>
>>> On Sun, Nov 27, 2016 at 9:44 PM Kevin Tran  wrote:
>>>
 Hi Everyone,
 Does anyone know what is the best practise of writing parquet file from
 Spark ?

 As Spark app write data to parquet and it shows that under that
 directory there are heaps of very small parquet file (such as
 e73f47ef-4421-4bcc-a4db-a56b110c3089.parquet). Each parquet file is
 only 15KB

 Should it write each chunk of  bigger data size (such as 128 MB) with
 proper number of files ?

 Does anyone find out any performance changes when changing data size of
 each parquet file ?

 Thanks,
 Kevin.

>>>
>>


Re: Kerberos and YARN - functions in spark-shell and spark submit local but not cluster mode

2016-12-08 Thread Marcelo Vanzin
You could have posted just the error, which is at the end of my response.

Why are you trying to use WebHDFS? I'm not really sure how
authentication works with that. But generally applications use HDFS
(which uses a different URI scheme), and Spark should work fine with
that.


Error:
Authentication required
org.apache.hadoop.security.AccessControlException: Authentication required
at 
org.apache.hadoop.hdfs.web.WebHdfsFileSystem.validateResponse(WebHdfsFileSystem.java:457)
at 
org.apache.hadoop.hdfs.web.WebHdfsFileSystem.access$200(WebHdfsFileSystem.java:113)
at 
org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:738)
at 
org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.access$100(WebHdfsFileSystem.java:582)
at 
org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner$1.run(WebHdfsFileSystem.java:612)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
at 
org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.run(WebHdfsFileSystem.java:608)
at 
org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getDelegationToken(WebHdfsFileSystem.java:1507)
at org.apache.hadoop.fs.FileSystem.collectDelegationTokens(FileSystem.java:545)
at org.apache.hadoop.fs.FileSystem.addDelegationTokens(FileSystem.java:523)
at 
org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:140)
at 
org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
at 
org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:206)
at 
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)


On Thu, Dec 8, 2016 at 12:29 PM, Gerard Casey  wrote:
> Sure - I wanted to check with admin before sharing. I’ve attached it now, 
> does this help?
>
> Many thanks again,
>
> G
>
>
>
>> On 8 Dec 2016, at 20:18, Marcelo Vanzin  wrote:
>>
>> Then you probably have a configuration error somewhere. Since you
>> haven't actually posted the error you're seeing, it's kinda hard to
>> help any further.
>>
>> On Thu, Dec 8, 2016 at 11:17 AM, Gerard Casey  
>> wrote:
>>> Right. I’m confident that is setup correctly.
>>>
>>> I can run the SparkPi test script. The main difference between it and my 
>>> application is that it doesn’t access HDFS.
>>>
 On 8 Dec 2016, at 18:43, Marcelo Vanzin  wrote:

 On Wed, Dec 7, 2016 at 11:54 PM, Gerard Casey  
 wrote:
> To be specific, where exactly should spark.authenticate be set to true?

 spark.authenticate has nothing to do with kerberos. It's for
 authentication between different Spark processes belonging to the same
 app.

 --
 Marcelo

 -
 To unsubscribe e-mail: user-unsubscr...@spark.apache.org

>>>
>>
>>
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Kerberos and YARN - functions in spark-shell and spark submit local but not cluster mode

2016-12-08 Thread Gerard Casey
Sure - I wanted to check with admin before sharing. I’ve attached it now, does 
this help?

Many thanks again,

G

Container: container_e34_1479877553404_0174_01_03 on 
hdp-node12.xcat.cluster_45454_1481228528201

LogType:directory.info
Log Upload Time:Thu Dec 08 20:22:08 + 2016
LogLength:5138
Log Contents:
ls -l:
total 28
lrwxrwxrwx 1 my_user_name hadoop   70 Dec  8 20:21 __app__.jar -> 
/hadoop/yarn/local/usercache/my_user_name/filecache/26/graphx_sp_2.10-1.0.jar
lrwxrwxrwx 1 my_user_name hadoop   63 Dec  8 20:21 __spark__.jar -> 
/hadoop_1/hadoop/yarn/local/filecache/11/spark-hdp-assembly.jar
lrwxrwxrwx 1 my_user_name hadoop   94 Dec  8 20:21 __spark_conf__ -> 
/hadoop_1/hadoop/yarn/local/usercache/my_user_name/filecache/25/__spark_conf__2528926660896665250.zip
-rw--- 1 my_user_name hadoop  340 Dec  8 20:21 container_tokens
-rwx-- 1 my_user_name hadoop 6195 Dec  8 20:21 launch_container.sh
drwxr-s--- 2 my_user_name hadoop 4096 Dec  8 20:21 tmp
find -L . -maxdepth 5 -ls:
27527034 drwxr-s---   3 my_user_namehadoop   4096 Dec  8 20:21 .
106430595 184304 -r-xr-xr-x   1 yarn hadoop   188727178 Dec  5 18:22 
./__spark__.jar
1064315274 drwx--   2 my_user_namemy_user_name4096 Dec  8 
20:21 ./__spark_conf__
1064315594 -r-x--   1 my_user_namemy_user_name 951 Dec  8 
20:21 ./__spark_conf__/mapred-env.cmd
1064315584 -r-x--   1 my_user_namemy_user_name1000 Dec  8 
20:21 ./__spark_conf__/ssl-server.xml
1064315288 -r-x--   1 my_user_namemy_user_name5410 Dec  8 
20:21 ./__spark_conf__/hadoop-env.sh
1064315534 -r-x--   1 my_user_namemy_user_name2316 Dec  8 
20:21 ./__spark_conf__/ssl-client.xml.example
1064315324 -r-x--   1 my_user_namemy_user_name3979 Dec  8 
20:21 ./__spark_conf__/hadoop-env.cmd
106431546   12 -r-x--   1 my_user_namemy_user_name8217 Dec  8 
20:21 ./__spark_conf__/hdfs-site.xml
1064315458 -r-x--   1 my_user_namemy_user_name5637 Dec  8 
20:21 ./__spark_conf__/yarn-env.sh
1064315524 -r-x--   1 my_user_namemy_user_name1602 Dec  8 
20:21 ./__spark_conf__/health_check
1064315374 -r-x--   1 my_user_namemy_user_name1631 Dec  8 
20:21 ./__spark_conf__/kms-log4j.properties
1064315638 -r-x--   1 my_user_namemy_user_name5511 Dec  8 
20:21 ./__spark_conf__/kms-site.xml
1064315308 -r-x--   1 my_user_namemy_user_name7353 Dec  8 
20:21 ./__spark_conf__/mapred-site.xml
1064315484 -r-x--   1 my_user_namemy_user_name1072 Dec  8 
20:21 ./__spark_conf__/container-executor.cfg
1064315360 -r-x--   1 my_user_namemy_user_name   0 Dec  8 
20:21 ./__spark_conf__/yarn.exclude
1064315628 -r-x--   1 my_user_namemy_user_name4113 Dec  8 
20:21 ./__spark_conf__/mapred-queues.xml.template
1064315384 -r-x--   1 my_user_namemy_user_name2250 Dec  8 
20:21 ./__spark_conf__/yarn-env.cmd
1064315474 -r-x--   1 my_user_namemy_user_name1020 Dec  8 
20:21 ./__spark_conf__/commons-logging.properties
1064315434 -r-x--   1 my_user_namemy_user_name 758 Dec  8 
20:21 ./__spark_conf__/mapred-site.xml.template
1064315544 -r-x--   1 my_user_namemy_user_name1527 Dec  8 
20:21 ./__spark_conf__/kms-env.sh
1064315564 -r-x--   1 my_user_namemy_user_name 760 Dec  8 
20:21 ./__spark_conf__/slaves
1064315614 -r-x--   1 my_user_namemy_user_name 945 Dec  8 
20:21 ./__spark_conf__/taskcontroller.cfg
1064315424 -r-x--   1 my_user_namemy_user_name2358 Dec  8 
20:21 ./__spark_conf__/topology_script.py
1064315394 -r-x--   1 my_user_namemy_user_name 884 Dec  8 
20:21 ./__spark_conf__/ssl-client.xml
1064315314 -r-x--   1 my_user_namemy_user_name2207 Dec  8 
20:21 ./__spark_conf__/hadoop-metrics2.properties
1064315644 -r-x--   1 my_user_namemy_user_name 506 Dec  8 
20:21 ./__spark_conf__/__spark_conf__.properties
1064315508 -r-x--   1 my_user_namemy_user_name4221 Dec  8 
20:21 ./__spark_conf__/task-log4j.properties
1064315514 -r-x--   1 my_user_namemy_user_name 856 Dec  8 
20:21 ./__spark_conf__/mapred-env.sh
106431529   12 -r-x--   1 my_user_namemy_user_name9313 Dec  8 
20:21 ./__spark_conf__/log4j.properties
1064315414 -r-x--   1 my_user_namemy_user_name3518 Dec  8 
20:21 ./__spark_conf__/kms-acls.xml
1064315348 -r-x--   1 my_user_namemy_user_name7634 Dec  8 
20:21 ./__spark_conf__/core-site.xml
1064315574 -r-x--   1 my_user_namemy_user_name2081 Dec  8 
20:21 ./__spark_conf__/topology_mappings.data
1064315494 -r-x--   1 

Re: Design patterns for Spark implementation

2016-12-08 Thread Sachin Naik
Not sure if you are aware of these

1) Edx/Berkely/Databricks has three Spark related certifications. Might be a 
good start. 

2) Fair understanding of scala/distributed collection patterns to better 
appreciate the internals of Spark. Coursera has three scala courses. I know 
there are other language bindings. The Edx course goes in great detail on 
those. 

3) Advanced Analytics on Spark book. 

--sachin

Sent from my iPhone

> On Dec 8, 2016, at 11:38 AM, Peter Figliozzi  wrote:
> 
> Keeping in mind Spark is a parallel computing engine, Spark does not change 
> your data infrastructure/data architecture.  These days it's relatively 
> convenient to read data from a variety of sources (S3, HDFS, Cassandra, ...) 
> and ditto on the output side.  
> 
> For example, for one of my use-cases, I store 10's of gigs of time-series 
> data in Cassandra.  It just so happens I like to analyze all of it at once 
> using Spark, which writes a very nice, small text file table of results I 
> look at using Python/Pandas, in a Jupyter notebook, on a laptop. 
> 
> If we didn't have Spark, I'd still be doing the input side (Cassandra) and 
> output side (small text file, ingestible by a laptop) the same way.  The only 
> difference would be, instead of importing and processing in Spark, my 
> fictional group of 5,000 assistants would each download a portion of the data 
> into their Excel spreadsheet, then have a big meeting to produce my small 
> text file.
> 
> So my view is the nature of your data and specific objectives determine your 
> infrastructure and architecture, not the presence or absence of Spark.
> 
> 
> 
> 
> 
>> On Sat, Dec 3, 2016 at 10:59 AM, Vasu Gourabathina  
>> wrote:
>> Hi,
>> 
>> I know this is a broad question. If this is not the right forum, appreciate 
>> if you can point to other sites/areas that may be helpful.
>> 
>> Before posing this question, I did use our friend Google, but sanitizing the 
>> query results from my need angle hasn't been easy.
>> 
>> Who I am: 
>>- Have done data processing and analytics, but relatively new to Spark 
>> world
>> 
>> What I am looking for:
>>   - Architecture/Design of a ML system using Spark
>>   - In particular, looking for best practices that can support/bridge both 
>> Engineering and Data Science teams
>> 
>> Engineering:
>>- Build a system that has typical engineering needs, data processing, 
>> scalability, reliability, availability, fault-tolerance etc.
>>- System monitoring etc.
>> Data Science:
>>- Build a system for Data Science team to do data exploration activities
>>- Develop models using supervised learning and tweak models
>> 
>> Data:
>>   - Batch and incremental updates - mostly structured or semi-structured 
>> (some data from transaction systems, weblogs, click stream etc.)
>>   - Steaming, in near term, but not to begin with
>> 
>> Data Storage:
>>   - Data is expected to grow on a daily basis...so, system should be able to 
>> support and handle big data
>>   - May be, after further analysis, there might be a possibility/need to 
>> archive some of the data...it all depends on how the ML models were built 
>> and results were stored/used for future usage
>> 
>> Data Analysis:
>>   - Obvious data related aspects, such as data cleansing, data 
>> transformation, data partitioning etc
>>   - May be run models on windows of data. For example: last 1-year, 2-years 
>> etc.
>> 
>> ML models:
>>   - Ability to store model versions and previous results
>>   - Compare results of different variants of models
>>  
>> Consumers:
>>   - RESTful webservice clients to look at the results
>> 
>> So, the questions I have are:
>> 1) Are there architectural and design patterns that I can use based on 
>> industry best-practices. In particular:
>>   - data ingestion
>>   - data storage (for eg. go with HDFS or not)
>>   - data partitioning, especially in Spark world
>>   - running parallel ML models and combining results etc.
>>   - consumption of final results by clients (for eg. by pushing results 
>> to Cassandra, NoSQL dbs etc.)
>> 
>> Again, I know this is a broad questionPointers to some best-practices in 
>> some of the areas, if not all, would be highly appreciated. Open to purchase 
>> any books that may have relevant information.
>> 
>> Thanks much folks,
>> Vasu.
>> 
> 


Re: Design patterns for Spark implementation

2016-12-08 Thread Peter Figliozzi
Keeping in mind Spark is a parallel computing engine, Spark does not change
your data infrastructure/data architecture.  These days it's relatively
convenient to read data from a variety of sources (S3, HDFS, Cassandra,
...) and ditto on the output side.

For example, for one of my use-cases, I store 10's of gigs of time-series
data in Cassandra.  It just so happens I like to analyze all of it at once
using Spark, which writes a very nice, small text file table of results I
look at using Python/Pandas, in a Jupyter notebook, on a laptop.

If we didn't have Spark, I'd still be doing the input side (Cassandra) and
output side (small text file, ingestible by a laptop) the same way.  The
only difference would be, instead of importing and processing in Spark, my
fictional group of 5,000 assistants would each download a portion of the
data into their Excel spreadsheet, then have a big meeting to produce my
small text file.

So my view is the nature of your data and specific objectives determine
your infrastructure and architecture, not the presence or absence of Spark.





On Sat, Dec 3, 2016 at 10:59 AM, Vasu Gourabathina 
wrote:

> Hi,
>
> I know this is a broad question. If this is not the right forum,
> appreciate if you can point to other sites/areas that may be helpful.
>
> Before posing this question, I did use our friend Google, but sanitizing
> the query results from my need angle hasn't been easy.
>
> Who I am:
>- Have done data processing and analytics, but relatively new to Spark
> world
>
> What I am looking for:
>   - Architecture/Design of a ML system using Spark
>   - In particular, looking for best practices that can support/bridge both
> Engineering and Data Science teams
>
> Engineering:
>- Build a system that has typical engineering needs, data processing,
> scalability, reliability, availability, fault-tolerance etc.
>- System monitoring etc.
> Data Science:
>- Build a system for Data Science team to do data exploration activities
>- Develop models using supervised learning and tweak models
>
> Data:
>   - Batch and incremental updates - mostly structured or semi-structured
> (some data from transaction systems, weblogs, click stream etc.)
>   - Steaming, in near term, but not to begin with
>
> Data Storage:
>   - Data is expected to grow on a daily basis...so, system should be able
> to support and handle big data
>   - May be, after further analysis, there might be a possibility/need to
> archive some of the data...it all depends on how the ML models were built
> and results were stored/used for future usage
>
> Data Analysis:
>   - Obvious data related aspects, such as data cleansing, data
> transformation, data partitioning etc
>   - May be run models on windows of data. For example: last 1-year,
> 2-years etc.
>
> ML models:
>   - Ability to store model versions and previous results
>   - Compare results of different variants of models
>
> Consumers:
>   - RESTful webservice clients to look at the results
>
> *So, the questions I have are:*
> 1) Are there architectural and design patterns that I can use based on
> industry best-practices. In particular:
>   - data ingestion
>   - data storage (for eg. go with HDFS or not)
>   - data partitioning, especially in Spark world
>   - running parallel ML models and combining results etc.
>   - consumption of final results by clients (for eg. by pushing
> results to Cassandra, NoSQL dbs etc.)
>
> Again, I know this is a broad questionPointers to some best-practices
> in some of the areas, if not all, would be highly appreciated. Open to
> purchase any books that may have relevant information.
>
> Thanks much folks,
> Vasu.
>
>


SparkContext not creating due Logger initialization

2016-12-08 Thread Adnan Ahmed
Hi,


Sometimes I get this error when I submit spark job. Its not like every time but 
when it comes up SparkContext doesn't get created.


16/12/08 08:02:18 INFO [akka.event.slf4j.Slf4jLogger] 80==> Slf4jLogger started
error while starting up loggers
akka.ConfigurationException: Logger specified in config can't be loaded 
[akka.event.slf4j.Slf4jLogger] due to 
[akka.event.Logging$LoggerInitializationException: Logger log1-Slf4jLogger did 
not respond with LoggerInitialized, sent instead [TIMEOUT]]
at 
akka.event.LoggingBus$$anonfun$4$$anonfun$apply$1.applyOrElse(Logging.scala:116)
at 
akka.event.LoggingBus$$anonfun$4$$anonfun$apply$1.applyOrElse(Logging.scala:115)
at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:185)
at scala.util.Try$.apply(Try.scala:161)
at scala.util.Failure.recover(Try.scala:185)
at akka.event.LoggingBus$$anonfun$4.apply(Logging.scala:115)
at akka.event.LoggingBus$$anonfun$4.apply(Logging.scala:110)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at 
scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721)
at akka.event.LoggingBus$class.startDefaultLoggers(Logging.scala:110)
at akka.event.EventStream.startDefaultLoggers(EventStream.scala:26)
at akka.actor.LocalActorRefProvider.init(ActorRefProvider.scala:623)
at 
akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:157)
at akka.actor.ActorSystemImpl.liftedTree2$1(ActorSystem.scala:620)
at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:617)
at akka.actor.ActorSystemImpl._start(ActorSystem.scala:617)
at akka.actor.ActorSystemImpl.start(ActorSystem.scala:634)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:142)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:119)
at 
org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:121)
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53)
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:52)
at 
org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1988)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1979)
at 
org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:55)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:266)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:193)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:288)
at org.apache.spark.SparkContext.(SparkContext.scala:457)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2304)


Thanks

Adnan Ahmed


Re: Kerberos and YARN - functions in spark-shell and spark submit local but not cluster mode

2016-12-08 Thread Marcelo Vanzin
Then you probably have a configuration error somewhere. Since you
haven't actually posted the error you're seeing, it's kinda hard to
help any further.

On Thu, Dec 8, 2016 at 11:17 AM, Gerard Casey  wrote:
> Right. I’m confident that is setup correctly.
>
> I can run the SparkPi test script. The main difference between it and my 
> application is that it doesn’t access HDFS.
>
>> On 8 Dec 2016, at 18:43, Marcelo Vanzin  wrote:
>>
>> On Wed, Dec 7, 2016 at 11:54 PM, Gerard Casey  
>> wrote:
>>> To be specific, where exactly should spark.authenticate be set to true?
>>
>> spark.authenticate has nothing to do with kerberos. It's for
>> authentication between different Spark processes belonging to the same
>> app.
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



.tar.bz2 in spark

2016-12-08 Thread Maurin Lenglart
Hi,
I am trying to load a json file compress in .tar.bz2 but spark throw an error.
I am using pyspark with spark 1.6.2. (Cloudera 5.9)

What will be the best way to handle that?
I don’t want to have a non-spark job that will just uncompressed the data…

thanks


Re: Kerberos and YARN - functions in spark-shell and spark submit local but not cluster mode

2016-12-08 Thread Gerard Casey
Right. I’m confident that is setup correctly.

I can run the SparkPi test script. The main difference between it and my 
application is that it doesn’t access HDFS. 

> On 8 Dec 2016, at 18:43, Marcelo Vanzin  wrote:
> 
> On Wed, Dec 7, 2016 at 11:54 PM, Gerard Casey  
> wrote:
>> To be specific, where exactly should spark.authenticate be set to true?
> 
> spark.authenticate has nothing to do with kerberos. It's for
> authentication between different Spark processes belonging to the same
> app.
> 
> -- 
> Marcelo
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



When will Structured Streaming support stream-to-stream joins?

2016-12-08 Thread ljwagerfield
Hi there,

Structured Streaming currently only supports stream-to-batch joins. 

Is there an ETA for stream-to-stream joins?

Kindest regards (and keep up the awesome work!),
Lawrence

(p.s. I've traversed the JIRA roadmaps but couldn't see anything)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/When-will-Structured-Streaming-support-stream-to-stream-joins-tp28185.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: spark reshape hive table and save to parquet

2016-12-08 Thread Georg Heiler
https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html

Anton Kravchenko  schrieb am Do., 8. Dez.
2016 um 17:53 Uhr:

> Hello,
>
> I wonder if there is a way (preferably efficient) in Spark to reshape hive
> table and save it to parquet.
>
> Here is a minimal example, input hive table:
> col1 col2 col3
> 1 2 3
> 4 5 6
>
> output parquet:
> col1 newcol2
> 1 [2 3]
> 4 [5 6]
>
> p.s. The real input hive table has ~1000 columns.
>
> Thank you,
> Anton
>


Question about the DirectKafkaInputDStream

2016-12-08 Thread John Fang
The source is DirectKafkaInputDStream which can ensure the exectly-once of the 
consumer side. But I have a question based the following code。As we known, the 
"graph.generateJobs(time)" will create rdds and generate jobs。And the source 
RDD is KafkaRDD which contain the offsetRange。 The jobs are submitted 
successfully by " jobScheduler.submitJobSet", and the cluster start running the 
jobs. After that, the driver crash suddenly and will lost the offsetRange. 
Because the driver has not run the "eventLoop.post(DoCheckpoint(time, 
clearCheckpointDataLater = false))" yet. 

```
  private def generateJobs(time: Time) {
// Checkpoint all RDDs marked for checkpointing to ensure their lineages are
// truncated periodically. Otherwise, we may run into stack overflows 
(SPARK-6847).
ssc.sparkContext.setLocalProperty(RDD.CHECKPOINT_ALL_MARKED_ANCESTORS, 
"true")
Try {
  jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate 
received blocks to batch
  graph.generateJobs(time) // generate jobs using allocated block
} match {
  case Success(jobs) =>
val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time)
jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))
  case Failure(e) =>
jobScheduler.reportError("Error generating jobs for time " + time, e)
PythonDStream.stopStreamingContextIfPythonProcessIsDead(e)
}
eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false))
  }
  ```

Re: Kerberos and YARN - functions in spark-shell and spark submit local but not cluster mode

2016-12-08 Thread Marcelo Vanzin
On Wed, Dec 7, 2016 at 11:54 PM, Gerard Casey  wrote:
> To be specific, where exactly should spark.authenticate be set to true?

spark.authenticate has nothing to do with kerberos. It's for
authentication between different Spark processes belonging to the same
app.

-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



spark reshape hive table and save to parquet

2016-12-08 Thread Anton Kravchenko
Hello,

I wonder if there is a way (preferably efficient) in Spark to reshape hive
table and save it to parquet.

Here is a minimal example, input hive table:
col1 col2 col3
1 2 3
4 5 6

output parquet:
col1 newcol2
1 [2 3]
4 [5 6]

p.s. The real input hive table has ~1000 columns.

Thank you,
Anton


Re: OS killing Executor due to high (possibly off heap) memory usage

2016-12-08 Thread Aniket Bhatnagar
I did some instrumentation to figure out traces of where DirectByteBuffers
are being created and it turns out that setting the following system
properties in addition to setting spark.shuffle.io.preferDirectBufs=false
in spark config:

io.netty.noUnsafe=true
io.netty.threadLocalDirectBufferSize=0

This should force netty to mostly use on heap buffers and thus increases
the stability of spark jobs that perform a lot of shuffle. I have created
the defect SPARK-18787 to either force these settings when
spark.shuffle.io.preferDirectBufs=false is set in spark config or document
it.

Hope it will be helpful for other users as well.

Thanks,
Aniket

On Sat, Nov 26, 2016 at 3:31 PM Koert Kuipers  wrote:

> i agree that offheap memory usage is unpredictable.
>
> when we used rdds the memory was mostly on heap and total usage
> predictable, and we almost never had yarn killing executors.
>
> now with dataframes the memory usage is both on and off heap, and we have
> no way of limiting the off heap memory usage by spark, yet yarn requires a
> maximum total memory usage and if you go over it yarn kills the executor.
>
> On Fri, Nov 25, 2016 at 12:14 PM, Aniket Bhatnagar <
> aniket.bhatna...@gmail.com> wrote:
>
> Thanks Rohit, Roddick and Shreya. I tried
> changing spark.yarn.executor.memoryOverhead to be 10GB and lowering
> executor memory to 30 GB and both of these didn't work. I finally had to
> reduce the number of cores per executor to be 18 (from 36) in addition to
> setting higher spark.yarn.executor.memoryOverhead and lower executor memory
> size. I had to trade off performance for reliability.
>
> Unfortunately, spark does a poor job reporting off heap memory usage. From
> the profiler, it seems that the job's heap usage is pretty static but the
> off heap memory fluctuates quiet a lot. It looks like bulk of off heap is
> used by io.netty.buffer.UnpooledUnsafeDirectByteBuf while the shuffle
> client is trying to read block from shuffle service. It looks
> like org.apache.spark.network.util.TransportFrameDecoder retains them
> in buffers field while decoding responses from the shuffle service. So far,
> it's not clear why it needs to hold multiple GBs in the buffers. Perhaps
> increasing the number of partitions may help with this.
>
> Thanks,
> Aniket
>
> On Fri, Nov 25, 2016 at 1:09 AM Shreya Agarwal 
> wrote:
>
> I don’t think it’s just memory overhead. It might be better to use an
> execute with lesser heap space(30GB?). 46 GB would mean more data load into
> memory and more GC, which can cause issues.
>
>
>
> Also, have you tried to persist data in any way? If so, then that might be
> causing an issue.
>
>
>
> Lastly, I am not sure if your data has a skew and if that is forcing a lot
> of data to be on one executor node.
>
>
>
> Sent from my Windows 10 phone
>
>
>
> *From: *Rodrick Brown 
> *Sent: *Friday, November 25, 2016 12:25 AM
> *To: *Aniket Bhatnagar 
> *Cc: *user 
> *Subject: *Re: OS killing Executor due to high (possibly off heap) memory
> usage
>
>
> Try setting spark.yarn.executor.memoryOverhead 1
>
> On Thu, Nov 24, 2016 at 11:16 AM, Aniket Bhatnagar <
> aniket.bhatna...@gmail.com> wrote:
>
> Hi Spark users
>
> I am running a job that does join of a huge dataset (7 TB+) and the
> executors keep crashing randomly, eventually causing the job to crash.
> There are no out of memory exceptions in the log and looking at the dmesg
> output, it seems like the OS killed the JVM because of high memory usage.
> My suspicion is towards off heap usage of executor is causing this as I am
> limiting the on heap usage of executor to be 46 GB and each host running
> the executor has 60 GB of RAM. After the executor crashes, I can see that
> the external shuffle manager
> (org.apache.spark.network.server.TransportRequestHandler) logs a lot of
> channel closed exceptions in yarn node manager logs. This leads me to
> believe that something triggers out of memory during shuffle read. Is there
> a configuration to completely disable usage of off heap memory? I have
> tried setting spark.shuffle.io.preferDirectBufs=false but the executor is
> still getting killed by the same error.
>
> Cluster details:
> 10 AWS c4.8xlarge hosts
> RAM on each host - 60 GB
> Number of cores on each host - 36
> Additional hard disk on each host - 8 TB
>
> Spark configuration:
> dynamic allocation enabled
> external shuffle service enabled
> spark.driver.memory 1024M
> spark.executor.memory 47127M
> Spark master yarn-cluster
>
> Sample error in yarn node manager:
> 2016-11-24 10:34:06,507 ERROR
> org.apache.spark.network.server.TransportRequestHandler
> (shuffle-server-50): Error sending result
> ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=919299554123,
> chunkIndex=0},
> 

Re: unsubscribe

2016-12-08 Thread Nicholas Chammas
I wish I could provide additional suggestions. Maybe one of the admins can
step in and help. I'm just another random user trying (with mixed success)
to be helpful. 

Sorry again to everyone about my spam, which just added to the problem.

On Thu, Dec 8, 2016 at 11:22 AM Chen, Yan I  wrote:

> I’m pretty sure I didn’t.
>
>
>
> *From:* Nicholas Chammas [mailto:nicholas.cham...@gmail.com]
> *Sent:* Thursday, December 08, 2016 10:56 AM
> *To:* Chen, Yan I; Di Zhu
>
>
> *Cc:* user @spark
> *Subject:* Re: unsubscribe
>
>
>
> Oh, hmm...
>
> Did you perhaps subscribe with a different address than the one you're
> trying to unsubscribe from?
>
> For example, you subscribed with myemail+sp...@gmail.com but you send the
> unsubscribe email from myem...@gmail.com
>
> 2016년 12월 8일 (목) 오전 10:35, Chen, Yan I 님이 작성:
>
> The reason I sent that email is because I did sent emails to
> user-unsubscr...@spark.apache.org and dev-unsubscr...@spark.apache.org
> two months ago. But I can still receive a lot of emails every day. I even
> did that again before 10AM EST and got confirmation that I’m unsubscribed,
> but I still received this email.
>
>
>
>
>
> *From:* Nicholas Chammas [mailto:nicholas.cham...@gmail.com]
> *Sent:* Thursday, December 08, 2016 10:02 AM
> *To:* Di Zhu
> *Cc:* user @spark
> *Subject:* Re: unsubscribe
>
>
>
> Yes, sorry about that. I didn't think before responding to all those who
> asked to unsubscribe.
>
>
>
> On Thu, Dec 8, 2016 at 10:00 AM Di Zhu 
> wrote:
>
> Could you send to individual privately without cc to all users every time?
>
>
>
>
>
> On 8 Dec 2016, at 3:58 PM, Nicholas Chammas 
> wrote:
>
>
>
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
> This is explained here:
> http://spark.apache.org/community.html#mailing-lists
>
>
>
> On Thu, Dec 8, 2016 at 7:46 AM Ramon Rosa da Silva <
> ramon.si...@neogrid.com> wrote:
>
>
>
> *This e-mail message, including any attachments, is for the sole use of
> the person to whom it has been sent and may contain information that is
> confidential or legally protected. If you are not the intended recipient or
> have received this message in error, you are not authorized to copy,
> distribute, or otherwise use it or its attachments. Please notify the
> sender immediately by return email and permanently delete this message and
> any attachments. NeoGrid makes no warranty that this email is error or
> virus free. NeoGrid Europe Limited is a company registered in the United
> Kingdom with the registration number 7717968. The registered office is 8-10
> Upper Marlborough Road, St Albans AL1 3UR, Hertfordshire, UK. NeoGrid
> Netherlands B.V. is a company registered in the Netherlands with the
> registration number 3416.6499 and registered office at Science Park 400,
> 1098 XH Amsterdam, NL. NeoGrid North America Limited is a company
> registered in the United States with the registration number 52-2242825.
> The registered office is 55 West Monroe Street, Suite 3590-60603, Chicago,
> IL, USA. NeoGrid Japan is located at New Otani Garden Court 7F, 4-1
> Kioi-cho, Chiyoda-ku, Tokyo 102-0094, Japan. NeoGrid Software SA is a
> company registered in Brazil, with the registration number CNPJ:
> 03.553.145/0001-08 and located at Av. Santos Dumont, 935, 89.218-105,
> Joinville - SC – Brazil. *
>
> * Esta mensagem pode conter informação confidencial ou privilegiada, sendo
> seu sigilo protegido por lei. Se você não for o destinatário ou a pessoa
> autorizada a receber esta mensagem, não pode usar, copiar ou divulgar as
> informações nela contidas ou tomar qualquer ação baseada nessas
> informações. Se você recebeu esta mensagem por engano, por favor, avise
> imediatamente ao remetente, respondendo o e-mail e em seguida apague-a.
> Agradecemos sua cooperação. *
>
>
>
> ___
>
> If you received this email in error, please advise the sender (by return
> email or otherwise) immediately. You have consented to receive the attached
> electronically at the above-noted email address; please retain a copy of
> this confirmation for future reference.
>
> Si vous recevez ce courriel par erreur, veuillez en aviser l'expéditeur
> immédiatement, par retour de courriel ou par un autre moyen. Vous avez
> accepté de recevoir le(s) document(s) ci-joint(s) par voie électronique à
> l'adresse courriel indiquée ci-dessus; veuillez conserver une copie de
> cette confirmation pour les fins de reference future.
>
> ___
>
> If you received this email in error, please advise the sender (by return
> email or otherwise) immediately. You have consented to receive the attached
> electronically at the above-noted email address; please retain a copy of
> this confirmation for future reference.
>
> Si vous recevez ce courriel par erreur, veuillez en aviser 

RE: unsubscribe

2016-12-08 Thread Chen, Yan I
I’m pretty sure I didn’t.

From: Nicholas Chammas [mailto:nicholas.cham...@gmail.com]
Sent: Thursday, December 08, 2016 10:56 AM
To: Chen, Yan I; Di Zhu
Cc: user @spark
Subject: Re: unsubscribe

Oh, hmm...

Did you perhaps subscribe with a different address than the one you're trying 
to unsubscribe from?

For example, you subscribed with 
myemail+sp...@gmail.com but you send the 
unsubscribe email from myem...@gmail.com
2016년 12월 8일 (목) 오전 10:35, Chen, Yan I 
>님이 작성:
The reason I sent that email is because I did sent emails to 
user-unsubscr...@spark.apache.org and 
dev-unsubscr...@spark.apache.org two 
months ago. But I can still receive a lot of emails every day. I even did that 
again before 10AM EST and got confirmation that I’m unsubscribed, but I still 
received this email.


From: Nicholas Chammas 
[mailto:nicholas.cham...@gmail.com]
Sent: Thursday, December 08, 2016 10:02 AM
To: Di Zhu
Cc: user @spark
Subject: Re: unsubscribe

Yes, sorry about that. I didn't think before responding to all those who asked 
to unsubscribe.

On Thu, Dec 8, 2016 at 10:00 AM Di Zhu 
> wrote:
Could you send to individual privately without cc to all users every time?


On 8 Dec 2016, at 3:58 PM, Nicholas Chammas 
> wrote:

To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org

This is explained here: http://spark.apache.org/community.html#mailing-lists

On Thu, Dec 8, 2016 at 7:46 AM Ramon Rosa da Silva 
> wrote:

This e-mail message, including any attachments, is for the sole use of the 
person to whom it has been sent and may contain information that is 
confidential or legally protected. If you are not the intended recipient or 
have received this message in error, you are not authorized to copy, 
distribute, or otherwise use it or its attachments. Please notify the sender 
immediately by return email and permanently delete this message and any 
attachments. NeoGrid makes no warranty that this email is error or virus free. 
NeoGrid Europe Limited is a company registered in the United Kingdom with the 
registration number 7717968. The registered office is 8-10 Upper Marlborough 
Road, St Albans AL1 3UR, Hertfordshire, UK. NeoGrid Netherlands B.V. is a 
company registered in the Netherlands with the registration number 3416.6499 
and registered office at Science Park 400, 1098 XH Amsterdam, NL. NeoGrid North 
America Limited is a company registered in the United States with the 
registration number 52-2242825. The registered office is 55 West Monroe Street, 
Suite 3590-60603, Chicago, IL, USA. NeoGrid Japan is located at New Otani 
Garden Court 7F, 4-1 Kioi-cho, Chiyoda-ku, Tokyo 102-0094, Japan. NeoGrid 
Software SA is a company registered in Brazil, with the registration number 
CNPJ: 03.553.145/0001-08 and located at Av. Santos Dumont, 935, 89.218-105, 
Joinville - SC – Brazil.

Esta mensagem pode conter informação confidencial ou privilegiada, sendo seu 
sigilo protegido por lei. Se você não for o destinatário ou a pessoa autorizada 
a receber esta mensagem, não pode usar, copiar ou divulgar as informações nela 
contidas ou tomar qualquer ação baseada nessas informações. Se você recebeu 
esta mensagem por engano, por favor, avise imediatamente ao remetente, 
respondendo o e-mail e em seguida apague-a. Agradecemos sua cooperação.


___

If you received this email in error, please advise the sender (by return email 
or otherwise) immediately. You have consented to receive the attached 
electronically at the above-noted email address; please retain a copy of this 
confirmation for future reference.

Si vous recevez ce courriel par erreur, veuillez en aviser l'expéditeur 
immédiatement, par retour de courriel ou par un autre moyen. Vous avez accepté 
de recevoir le(s) document(s) ci-joint(s) par voie électronique à l'adresse 
courriel indiquée ci-dessus; veuillez conserver une copie de cette confirmation 
pour les fins de reference future.
___
If you received this email in error, please advise the sender (by return email 
or otherwise) immediately. You have consented to receive the attached 
electronically at the above-noted email address; please retain a copy of this 
confirmation for future reference.  

Si vous recevez ce courriel par erreur, veuillez en aviser l'expéditeur 
immédiatement, par retour de courriel ou par un autre moyen. Vous avez accepté 
de recevoir le(s) document(s) ci-joint(s) par voie électronique à l'adresse 
courriel 

Re: unsubscribe

2016-12-08 Thread Nicholas Chammas
Oh, hmm...

Did you perhaps subscribe with a different address than the one you're
trying to unsubscribe from?

For example, you subscribed with myemail+sp...@gmail.com but you send the
unsubscribe email from myem...@gmail.com
2016년 12월 8일 (목) 오전 10:35, Chen, Yan I 님이 작성:

> The reason I sent that email is because I did sent emails to
> user-unsubscr...@spark.apache.org and dev-unsubscr...@spark.apache.org
> two months ago. But I can still receive a lot of emails every day. I even
> did that again before 10AM EST and got confirmation that I’m unsubscribed,
> but I still received this email.
>
>
>
>
>
> *From:* Nicholas Chammas [mailto:nicholas.cham...@gmail.com]
> *Sent:* Thursday, December 08, 2016 10:02 AM
> *To:* Di Zhu
> *Cc:* user @spark
> *Subject:* Re: unsubscribe
>
>
>
> Yes, sorry about that. I didn't think before responding to all those who
> asked to unsubscribe.
>
>
>
> On Thu, Dec 8, 2016 at 10:00 AM Di Zhu 
> wrote:
>
> Could you send to individual privately without cc to all users every time?
>
>
>
>
>
> On 8 Dec 2016, at 3:58 PM, Nicholas Chammas 
> wrote:
>
>
>
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
> This is explained here:
> http://spark.apache.org/community.html#mailing-lists
>
>
>
> On Thu, Dec 8, 2016 at 7:46 AM Ramon Rosa da Silva <
> ramon.si...@neogrid.com> wrote:
>
>
>
>
>
> *This e-mail message, including any attachments, is for the sole use of
> the person to whom it has been sent and may contain information that is
> confidential or legally protected. If you are not the intended recipient or
> have received this message in error, you are not authorized to copy,
> distribute, or otherwise use it or its attachments. Please notify the
> sender immediately by return email and permanently delete this message and
> any attachments. NeoGrid makes no warranty that this email is error or
> virus free. NeoGrid Europe Limited is a company registered in the United
> Kingdom with the registration number 7717968. The registered office is 8-10
> Upper Marlborough Road, St Albans AL1 3UR, Hertfordshire, UK. NeoGrid
> Netherlands B.V. is a company registered in the Netherlands with the
> registration number 3416.6499 and registered office at Science Park 400,
> 1098 XH Amsterdam, NL. NeoGrid North America Limited is a company
> registered in the United States with the registration number 52-2242825.
> The registered office is 55 West Monroe Street, Suite 3590-60603, Chicago,
> IL, USA. NeoGrid Japan is located at New Otani Garden Court 7F, 4-1
> Kioi-cho, Chiyoda-ku, Tokyo 102-0094, Japan. NeoGrid Software SA is a
> company registered in Brazil, with the registration number CNPJ:
> 03.553.145/0001-08 and located at Av. Santos Dumont, 935, 89.218-105,
> Joinville - SC – Brazil. Esta mensagem pode conter informação confidencial
> ou privilegiada, sendo seu sigilo protegido por lei. Se você não for o
> destinatário ou a pessoa autorizada a receber esta mensagem, não pode usar,
> copiar ou divulgar as informações nela contidas ou tomar qualquer ação
> baseada nessas informações. Se você recebeu esta mensagem por engano, por
> favor, avise imediatamente ao remetente, respondendo o e-mail e em seguida
> apague-a. Agradecemos sua cooperação. *
>
>
>
> ___
>
> If you received this email in error, please advise the sender (by return
> email or otherwise) immediately. You have consented to receive the attached
> electronically at the above-noted email address; please retain a copy of
> this confirmation for future reference.
>
> Si vous recevez ce courriel par erreur, veuillez en aviser l'expéditeur
> immédiatement, par retour de courriel ou par un autre moyen. Vous avez
> accepté de recevoir le(s) document(s) ci-joint(s) par voie électronique à
> l'adresse courriel indiquée ci-dessus; veuillez conserver une copie de
> cette confirmation pour les fins de reference future.
>
>


Re: Managed memory leak : spark-2.0.2

2016-12-08 Thread Appu K
Hi,

I didn’t hit any oom issues.  thanks for the pointer.  i guess it’ll be
safe to ignore since TaskMemoryManager automatically releases

just wondering what would have been the cause in this case - couldn’t see
any task failures in the log

but some reference to ExternalAppendOnlyMap acquiring mem which it perhaps
didn’t release
———
2016-12-08 16:12:35,100 [Executor task launch worker-0]
(TaskMemoryManager.java:185) DEBUG Task 1 acquired 5.0 MB for
org.apache.spark.util.collection.ExternalAppendOnlyMap@28462783
2016-12-08 16:12:35,156 [Executor task launch worker-0]
(TaskMemoryManager.java:185) DEBUG Task 1 acquired 10.0 MB for
org.apache.spark.util.collection.ExternalAppendOnlyMap@28462783
2016-12-08 16:12:35,265 [Executor task launch worker-0]
(TaskMemoryManager.java:185) DEBUG Task 1 acquired 31.2 MB for
org.apache.spark.util.collection.ExternalAppendOnlyMap@28462783
2016-12-08 16:12:35,485 [Executor task launch worker-0]
(TaskMemoryManager.java:381) WARN leak 46.2 MB memory from
org.apache.spark.util.collection.ExternalAppendOnlyMap@28462783
———

thanks again

cheers
appu


On 8 December 2016 at 8:53:29 PM, Takeshi Yamamuro (linguin@gmail.com)
wrote:

Hi,

Did you hit some troubles from the memory leak?
I think we can ignore the message in most cases because TaskMemoryManager
automatically releases the memory. In fact, spark degraded the message
in SPARK-18557.
https://issues.apache.org/jira/browse/SPARK-18557

// maropu

On Thu, Dec 8, 2016 at 8:10 PM, Appu K  wrote:

> Hello,
>
> I’ve just ran into an issue where the job is giving me "Managed memory
> leak" with spark version 2.0.2
>
> —
> 2016-12-08 16:31:25,231 [Executor task launch worker-0]
> (TaskMemoryManager.java:381) WARN leak 46.2 MB memory from
> org.apache.spark.util.collection.ExternalAppendOnlyMap@22719fb8
> 2016-12-08 16:31:25,232 [Executor task launch worker-0] (Logging.scala:66)
> WARN Managed memory leak detected; size = 48442112 bytes, TID = 1
> —
>
>
> The program itself is very basic and looks like take() is causing the
> issue
>
> Program: https://gist.github.com/kutt4n/87cfcd4e794b1865b6f880412dd80bbf
> Debug Log: https://gist.github.com/kutt4n/ba3cf812dced34ceadc588856edc
>
>
> TaskMemoryManager.java:381 says that it's normal to see leaked memory if
> one of the tasks failed.  In this case from the debug log - it is not quite
> apparent which task failed and the reason for failure.
>
> When the TSV file itself is small the issue doesn’t exist. In this
> particular case, the file is a 21MB clickstream data from wikipedia
> available at https://ndownloader.figshare.com/files/5036392
>
> Where could i read up more about managed memory leak. Any pointers on what
> might be the issue would be highly helpful
>
> thanks
> appu
>
>
>
>


--
---
Takeshi Yamamuro


Re: unit testing in spark

2016-12-08 Thread ndjido
Hi Pseudo,

Just use unittest https://docs.python.org/2/library/unittest.html .

> On 8 Dec 2016, at 19:14, pseudo oduesp  wrote:
> 
> somone can tell me how i can make unit test on pyspark ?
> (book, tutorial ...)


Re: Managed memory leak : spark-2.0.2

2016-12-08 Thread Takeshi Yamamuro
Hi,

Did you hit some troubles from the memory leak?
I think we can ignore the message in most cases because TaskMemoryManager
automatically releases the memory. In fact, spark degraded the message
in SPARK-18557.
https://issues.apache.org/jira/browse/SPARK-18557

// maropu

On Thu, Dec 8, 2016 at 8:10 PM, Appu K  wrote:

> Hello,
>
> I’ve just ran into an issue where the job is giving me "Managed memory
> leak" with spark version 2.0.2
>
> —
> 2016-12-08 16:31:25,231 [Executor task launch worker-0]
> (TaskMemoryManager.java:381) WARN leak 46.2 MB memory from
> org.apache.spark.util.collection.ExternalAppendOnlyMap@22719fb8
> 2016-12-08 16:31:25,232 [Executor task launch worker-0] (Logging.scala:66)
> WARN Managed memory leak detected; size = 48442112 bytes, TID = 1
> —
>
>
> The program itself is very basic and looks like take() is causing the
> issue
>
> Program: https://gist.github.com/kutt4n/87cfcd4e794b1865b6f880412dd80bbf
> Debug Log: https://gist.github.com/kutt4n/ba3cf812dced34ceadc588856edc
>
>
> TaskMemoryManager.java:381 says that it's normal to see leaked memory if
> one of the tasks failed.  In this case from the debug log - it is not quite
> apparent which task failed and the reason for failure.
>
> When the TSV file itself is small the issue doesn’t exist. In this
> particular case, the file is a 21MB clickstream data from wikipedia
> available at https://ndownloader.figshare.com/files/5036392
>
> Where could i read up more about managed memory leak. Any pointers on what
> might be the issue would be highly helpful
>
> thanks
> appu
>
>
>
>


-- 
---
Takeshi Yamamuro


unit testing in spark

2016-12-08 Thread pseudo oduesp
somone can tell me how i can make unit test on pyspark ?
(book, tutorial ...)


Re: unsubscribe

2016-12-08 Thread Nicholas Chammas
Yes, sorry about that. I didn't think before responding to all those who
asked to unsubscribe.

On Thu, Dec 8, 2016 at 10:00 AM Di Zhu  wrote:

> Could you send to individual privately without cc to all users every time?
>
>
> On 8 Dec 2016, at 3:58 PM, Nicholas Chammas 
> wrote:
>
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> This is explained here:
> http://spark.apache.org/community.html#mailing-lists
>
> On Thu, Dec 8, 2016 at 7:46 AM Ramon Rosa da Silva <
> ramon.si...@neogrid.com> wrote:
>
>
> This e-mail message, including any attachments, is for the sole use of the
> person to whom it has been sent and may contain information that is
> confidential or legally protected. If you are not the intended recipient or
> have received this message in error, you are not authorized to copy,
> distribute, or otherwise use it or its attachments. Please notify the
> sender immediately by return email and permanently delete this message and
> any attachments. NeoGrid makes no warranty that this email is error or
> virus free. NeoGrid Europe Limited is a company registered in the United
> Kingdom with the registration number 7717968. The registered office is 8-10
> Upper Marlborough Road, St Albans AL1 3UR, Hertfordshire, UK. NeoGrid
> Netherlands B.V. is a company registered in the Netherlands with the
> registration number 3416.6499 and registered office at Science Park 400,
> 1098 XH Amsterdam, NL. NeoGrid North America Limited is a company
> registered in the United States with the registration number 52-2242825.
> The registered office is 55 West Monroe Street, Suite 3590-60603, Chicago,
> IL, USA. NeoGrid Japan is located at New Otani Garden Court 7F, 4-1
> Kioi-cho, Chiyoda-ku, Tokyo 102-0094, Japan. NeoGrid Software SA is a
> company registered in Brazil, with the registration number CNPJ:
> 03.553.145/0001-08 and located at Av. Santos Dumont, 935, 89.218-105,
> Joinville - SC – Brazil.
>
> Esta mensagem pode conter informação confidencial ou privilegiada, sendo
> seu sigilo protegido por lei. Se você não for o destinatário ou a pessoa
> autorizada a receber esta mensagem, não pode usar, copiar ou divulgar as
> informações nela contidas ou tomar qualquer ação baseada nessas
> informações. Se você recebeu esta mensagem por engano, por favor, avise
> imediatamente ao remetente, respondendo o e-mail e em seguida apague-a.
> Agradecemos sua cooperação.
>
>
>


Re: unsubscribe

2016-12-08 Thread Di Zhu
Could you send to individual privately without cc to all users every time?


> On 8 Dec 2016, at 3:58 PM, Nicholas Chammas  
> wrote:
> 
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> 
> 
> This is explained here: http://spark.apache.org/community.html#mailing-lists 
> 
> On Thu, Dec 8, 2016 at 7:46 AM Ramon Rosa da Silva  > wrote:
>  
> 
> This e-mail message, including any attachments, is for the sole use of the 
> person to whom it has been sent and may contain information that is 
> confidential or legally protected. If you are not the intended recipient or 
> have received this message in error, you are not authorized to copy, 
> distribute, or otherwise use it or its attachments. Please notify the sender 
> immediately by return email and permanently delete this message and any 
> attachments. NeoGrid makes no warranty that this email is error or virus 
> free. NeoGrid Europe Limited is a company registered in the United Kingdom 
> with the registration number 7717968. The registered office is 8-10 Upper 
> Marlborough Road, St Albans AL1 3UR, Hertfordshire, UK. NeoGrid Netherlands 
> B.V. is a company registered in the Netherlands with the registration number 
> 3416.6499 and registered office at Science Park 400, 1098 XH Amsterdam, NL. 
> NeoGrid North America Limited is a company registered in the United States 
> with the registration number 52-2242825. The registered office is 55 West 
> Monroe Street, Suite 3590-60603, Chicago, IL, USA. NeoGrid Japan is located 
> at New Otani Garden Court 7F, 4-1 Kioi-cho, Chiyoda-ku, Tokyo 102-0094, 
> Japan. NeoGrid Software SA is a company registered in Brazil, with the 
> registration number CNPJ: 03.553.145/0001-08 and located at Av. Santos 
> Dumont, 935, 89.218-105, Joinville - SC – Brazil. 
> 
> Esta mensagem pode conter informação confidencial ou privilegiada, sendo seu 
> sigilo protegido por lei. Se você não for o destinatário ou a pessoa 
> autorizada a receber esta mensagem, não pode usar, copiar ou divulgar as 
> informações nela contidas ou tomar qualquer ação baseada nessas informações. 
> Se você recebeu esta mensagem por engano, por favor, avise imediatamente ao 
> remetente, respondendo o e-mail e em seguida apague-a. Agradecemos sua 
> cooperação.



Re: unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

This is explained here: http://spark.apache.org/community.html#mailing-lists

On Thu, Dec 8, 2016 at 7:46 AM Ramon Rosa da Silva 
wrote:

>
> This e-mail message, including any attachments, is for the sole use of the
> person to whom it has been sent and may contain information that is
> confidential or legally protected. If you are not the intended recipient or
> have received this message in error, you are not authorized to copy,
> distribute, or otherwise use it or its attachments. Please notify the
> sender immediately by return email and permanently delete this message and
> any attachments. NeoGrid makes no warranty that this email is error or
> virus free. NeoGrid Europe Limited is a company registered in the United
> Kingdom with the registration number 7717968. The registered office is 8-10
> Upper Marlborough Road, St Albans AL1 3UR, Hertfordshire, UK. NeoGrid
> Netherlands B.V. is a company registered in the Netherlands with the
> registration number 3416.6499 and registered office at Science Park 400,
> 1098 XH Amsterdam, NL. NeoGrid North America Limited is a company
> registered in the United States with the registration number 52-2242825.
> The registered office is 55 West Monroe Street, Suite 3590-60603, Chicago,
> IL, USA. NeoGrid Japan is located at New Otani Garden Court 7F, 4-1
> Kioi-cho, Chiyoda-ku, Tokyo 102-0094, Japan. NeoGrid Software SA is a
> company registered in Brazil, with the registration number CNPJ:
> 03.553.145/0001-08 and located at Av. Santos Dumont, 935, 89.218-105,
> Joinville - SC – Brazil.
>
> Esta mensagem pode conter informação confidencial ou privilegiada, sendo
> seu sigilo protegido por lei. Se você não for o destinatário ou a pessoa
> autorizada a receber esta mensagem, não pode usar, copiar ou divulgar as
> informações nela contidas ou tomar qualquer ação baseada nessas
> informações. Se você recebeu esta mensagem por engano, por favor, avise
> imediatamente ao remetente, respondendo o e-mail e em seguida apague-a.
> Agradecemos sua cooperação.
>


Re: Unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

This is explained here: http://spark.apache.org/community.html#mailing-lists

On Thu, Dec 8, 2016 at 9:46 AM Tao Lu  wrote:

>
>


Re: unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

This is explained here: http://spark.apache.org/community.html#mailing-lists

On Thu, Dec 8, 2016 at 8:01 AM Niki Pavlopoulou  wrote:

> unsubscribe
>


Re: unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

This is explained here: http://spark.apache.org/community.html#mailing-lists

On Thu, Dec 8, 2016 at 7:50 AM Juan Caravaca 
wrote:

> unsubscribe
>


Re: Unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

This is explained here: http://spark.apache.org/community.html#mailing-lists

On Thu, Dec 8, 2016 at 9:54 AM Kishorkumar Patil
 wrote:

>
>


Re: Unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

This is explained here: http://spark.apache.org/community.html#mailing-lists

On Thu, Dec 8, 2016 at 9:42 AM Chen, Yan I  wrote:

>
>
>
> ___
>
> If you received this email in error, please advise the sender (by return
> email or otherwise) immediately. You have consented to receive the attached
> electronically at the above-noted email address; please retain a copy of
> this confirmation for future reference.
>
> Si vous recevez ce courriel par erreur, veuillez en aviser l'expéditeur
> immédiatement, par retour de courriel ou par un autre moyen. Vous avez
> accepté de recevoir le(s) document(s) ci-joint(s) par voie électronique à
> l'adresse courriel indiquée ci-dessus; veuillez conserver une copie de
> cette confirmation pour les fins de reference future.
>
>


Re: Unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

This is explained here: http://spark.apache.org/community.html#mailing-lists

On Thu, Dec 8, 2016 at 12:17 AM Prashant Singh Thakur <
prashant.tha...@impetus.co.in> wrote:

>
>
>
>
> Best Regards,
>
> Prashant Thakur
>
> Work : 6046
>
> Mobile: +91-9740266522 <+91%2097402%2066522>
>
>
>
> --
>
>
>
>
>
>
> NOTE: This message may contain information that is confidential,
> proprietary, privileged or otherwise protected by law. The message is
> intended solely for the named addressee. If received in error, please
> destroy and notify the sender. Any use of this email is prohibited when
> received in error. Impetus does not represent, warrant and/or guarantee,
> that the integrity of this communication has been maintained nor that the
> communication is free of errors, virus, interception or interference.
>


Re: Unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

This is explained here: http://spark.apache.org/community.html#mailing-lists

On Thu, Dec 8, 2016 at 12:08 AM Kranthi Gmail 
wrote:

>
>
> --
> Kranthi
>
> PS: Sent from mobile, pls excuse the brevity and typos.
>
> On Dec 7, 2016, at 8:05 PM, Siddhartha Khaitan <
> siddhartha.khai...@gmail.com> wrote:
>
>
>


Re: Unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

This is explained here: http://spark.apache.org/community.html#mailing-lists

On Thu, Dec 8, 2016 at 6:27 AM Vinicius Barreto <
vinicius.s.barr...@gmail.com> wrote:

> Unsubscribe
>
> Em 7 de dez de 2016 17:46, "map reduced"  escreveu:
>
> Hi,
>
> I am trying to solve this problem - in my streaming flow, every day few
> jobs fail due to some (say kafka cluster maintenance etc, mostly
> unavoidable) reasons for few batches and resumes back to success.
> I want to reprocess those failed jobs programmatically (assume I have a
> way of getting start-end offsets for kafka topics for failed jobs). I was
> thinking of these options:
> 1) Somehow pause streaming job when it detects failing jobs - this seems
> not possible.
> 2) From driver - run additional processing to check every few minutes
> using driver rest api (/api/v1/applications...) what jobs have failed and
> submit batch jobs for those failed jobs
>
> 1 - doesn't seem to be possible, and I don't want to kill streaming
> context just for few failing batches to stop the job for some time and
> resume after few minutes.
> 2 - seems like a viable option, but a little complicated, since even the
> batch job can fail due to whatever reasons and I am back to tracking that
> separately etc.
>
> Does anyone has faced this issue or have any suggestions?
>
> Thanks,
> KP
>
>


Re: Unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

This is explained here: http://spark.apache.org/community.html#mailing-lists

On Thu, Dec 8, 2016 at 12:54 AM Roger Holenweger  wrote:

>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: unscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

This is explained here: http://spark.apache.org/community.html#mailing-lists

On Thu, Dec 8, 2016 at 1:34 AM smith_666  wrote:

>
>
>
>


Re: Unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

This is explained here: http://spark.apache.org/community.html#mailing-lists

On Thu, Dec 8, 2016 at 12:12 AM Ajit Jaokar 
wrote:

>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


TO ALL WHO WANT TO UNSUBSCRIBE

2016-12-08 Thread 5g2w35+83j86k7gefujk
I swear, the next one trying to unsubscribe from this list or 
u...@spark.incubator.apache.org by sending "unsubscribe" to this list will be 
signed up for mailbait ... (you are welcome).

HERE ARE THE INFOS ON HOW TO UNSUBSCRIBE. READ THEM!

> --- Administrative commands for the user list ---
> 
> I can handle administrative requests automatically. Please
> do not send them to the list address! Instead, send
> your message to the correct command address:
> 
> To subscribe to the list, send a message to:
>
> 
> To remove your address from the list, send a message to:
>
> 
> Send mail to the following for info and FAQ for this list:
>
>
> 
> Similar addresses exist for the digest list:
>
>
> 
> To get messages 123 through 145 (a maximum of 100 per request), mail:
>
> 
> To get an index with subject and author for messages 123-456 , mail:
>
> 
> They are always returned as sets of 100, max 2000 per request,
> so you'll actually get 100-499.
> 
> To receive all messages with the same subject as message 12345,
> send a short message to:
>
> 
> The messages should contain one line or word of text to avoid being
> treated as sp@m, but I will ignore their content.
> Only the ADDRESS you send to is important.
> 
> You can start a subscription for an alternate address,
> for example "john@host.domain", just add a hyphen and your
> address (with '=' instead of '@') after the command word:
> 

Re: unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

This is explained here: http://spark.apache.org/community.html#mailing-lists

On Wed, Dec 7, 2016 at 10:53 PM Ajith Jose  wrote:

>
>


Unsubscribe

2016-12-08 Thread Kishorkumar Patil



Unsubscribe

2016-12-08 Thread Tao Lu



Unsubscribe

2016-12-08 Thread Chen, Yan I


___
If you received this email in error, please advise the sender (by return email 
or otherwise) immediately. You have consented to receive the attached 
electronically at the above-noted email address; please retain a copy of this 
confirmation for future reference.  

Si vous recevez ce courriel par erreur, veuillez en aviser l'expéditeur 
immédiatement, par retour de courriel ou par un autre moyen. Vous avez accepté 
de recevoir le(s) document(s) ci-joint(s) par voie électronique à l'adresse 
courriel indiquée ci-dessus; veuillez conserver une copie de cette confirmation 
pour les fins de reference future.


Unsubscribe

2016-12-08 Thread Chen, Yan I


___
If you received this email in error, please advise the sender (by return email 
or otherwise) immediately. You have consented to receive the attached 
electronically at the above-noted email address; please retain a copy of this 
confirmation for future reference.  

Si vous recevez ce courriel par erreur, veuillez en aviser l'expéditeur 
immédiatement, par retour de courriel ou par un autre moyen. Vous avez accepté 
de recevoir le(s) document(s) ci-joint(s) par voie électronique à l'adresse 
courriel indiquée ci-dessus; veuillez conserver une copie de cette confirmation 
pour les fins de reference future.


Re: Unsubscribe

2016-12-08 Thread Jeff Sadowski
I think some people have a problem following instructions. Sigh.

On Wed, Dec 7, 2016 at 10:54 PM, Roger Holenweger 
wrote:

>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Records processed metric for intermediate datasets

2016-12-08 Thread Aniket R More
Hi ,


I have created a spark job using DATASET API. There is chain of operations 
performed until the final result which is collected on HDFS.

But I also need to know how many records were read for each intermediate 
dataset. Lets say I apply 5 operations on dataset (could be map, groupby etc), 
I need to know how many records were there for each of 5 intermediate dataset. 
Can anybody suggest how this can be obtained at dataset level. I guess I can 
find this out at task level (using listeners) but not sure how to get it at 
dataset level.

Thanks

**Disclaimer**
 This e-mail message and any attachments may contain confidential information 
and is for the sole use of the intended recipient(s) only. Any views or 
opinions presented or implied are solely those of the author and do not 
necessarily represent the views of BitWise. If you are not the intended 
recipient(s), you are hereby notified that disclosure, printing, copying, 
forwarding, distribution, or the taking of any action whatsoever in reliance on 
the contents of this electronic information is strictly prohibited. If you have 
received this e-mail message in error, please immediately notify the sender and 
delete the electronic message and any attachments.BitWise does not accept 
liability for any virus introduced by this e-mail or any attachments. 



unsubscribe

2016-12-08 Thread Niki Pavlopoulou
unsubscribe


unsubscribe

2016-12-08 Thread Juan Caravaca
unsubscribe


unsubscribe

2016-12-08 Thread Ramon Rosa da Silva

This e-mail message, including any attachments, is for the sole use of the 
person to whom it has been sent and may contain information that is 
confidential or legally protected. If you are not the intended recipient or 
have received this message in error, you are not authorized to copy, 
distribute, or otherwise use it or its attachments. Please notify the sender 
immediately by return email and permanently delete this message and any 
attachments. NeoGrid makes no warranty that this email is error or virus free. 
NeoGrid Europe Limited is a company registered in the United Kingdom with the 
registration number 7717968. The registered office is 8-10 Upper Marlborough 
Road, St Albans AL1 3UR, Hertfordshire, UK. NeoGrid Netherlands B.V. is a 
company registered in the Netherlands with the registration number 3416.6499 
and registered office at Science Park 400, 1098 XH Amsterdam, NL. NeoGrid North 
America Limited is a company registered in the United States with the 
registration number 52-2242825. The registered office is 55 West Monroe Street, 
Suite 3590-60603, Chicago, IL, USA. NeoGrid Japan is located at New Otani 
Garden Court 7F, 4-1 Kioi-cho, Chiyoda-ku, Tokyo 102-0094, Japan. NeoGrid 
Software SA is a company registered in Brazil, with the registration number 
CNPJ: 03.553.145/0001-08 and located at Av. Santos Dumont, 935, 89.218-105, 
Joinville - SC - Brazil.

Esta mensagem pode conter informa??o confidencial ou privilegiada, sendo seu 
sigilo protegido por lei. Se voc? n?o for o destinat?rio ou a pessoa autorizada 
a receber esta mensagem, n?o pode usar, copiar ou divulgar as informa??es nela 
contidas ou tomar qualquer a??o baseada nessas informa??es. Se voc? recebeu 
esta mensagem por engano, por favor, avise imediatamente ao remetente, 
respondendo o e-mail e em seguida apague-a. Agradecemos sua coopera??o.


RE: few basic questions on structured streaming

2016-12-08 Thread Mendelson, Assaf
For watermarking you can read this excellent article: part 1: 
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101, part2: 
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102. It explains 
more than just watermarking but it helped me understand a lot of the concepts 
in structured streaming.
In any case, watermarking is currently not implemented yet. I believe it is 
targeted at spark 2.1 which is supposed to come out soon.
Assaf.

From: kant kodali [mailto:kanth...@gmail.com]
Sent: Thursday, December 08, 2016 1:50 PM
To: user @spark
Subject: few basic questions on structured streaming

Hi All,

I read the documentation on Structured Streaming based on event time and I have 
the following questions.

1. what happens if an event arrives few days late? Looks like we have an 
unbound table with sorted time intervals as keys but I assume spark doesn't 
keep several days worth of data in memory but rather it would checkpoint parts 
of the unbound table to a storage at a specified interval such that if an event 
comes few days late it would update the part of the table that is in memory 
plus the parts of the table that are in storage which contains the interval 
(Again this is just my assumption, I don't know what it really does). is this 
correct so far?

2.  Say I am running a Spark Structured streaming Job for 90 days with a window 
interval of 10 mins and a slide interval of 5 mins. Does the output of this Job 
always return the entire history in a table? other words the does the output on 
90th day contains a table of 10 minute time intervals from day 1 to day 90? If 
so, wouldn't that be too big to return as an output?

3. For Structured Streaming is it required to have a distributed storage such 
as HDFS? my guess would be yes (based on what I said in #1) but I would like to 
confirm.

4. I briefly heard about watermarking. Are there any pointers where I can know 
them more in detail? Specifically how watermarks could help in structured 
streaming and so on.

Thanks,
kant



few basic questions on structured streaming

2016-12-08 Thread kant kodali
Hi All,

I read the documentation on Structured Streaming based on event time and I
have the following questions.

1. what happens if an event arrives few days late? Looks like we have an
unbound table with sorted time intervals as keys but I assume spark doesn't
keep several days worth of data in memory but rather it would checkpoint
parts of the unbound table to a storage at a specified interval such that
if an event comes few days late it would update the part of the table that
is in memory plus the parts of the table that are in storage which contains
the interval (Again this is just my assumption, I don't know what it really
does). is this correct so far?

2.  Say I am running a Spark Structured streaming Job for 90 days with a
window interval of 10 mins and a slide interval of 5 mins. Does the output
of this Job always return the entire history in a table? other words the
does the output on 90th day contains a table of 10 minute time intervals
from day 1 to day 90? If so, wouldn't that be too big to return as an
output?

3. For Structured Streaming is it required to have a distributed storage
such as HDFS? my guess would be yes (based on what I said in #1) but I
would like to confirm.

4. I briefly heard about watermarking. Are there any pointers where I can
know them more in detail? Specifically how watermarks could help in
structured streaming and so on.

Thanks,
kant


Unsubscribe

2016-12-08 Thread Vinicius Barreto
Unsubscribe

Em 7 de dez de 2016 17:46, "map reduced"  escreveu:

> Hi,
>
> I am trying to solve this problem - in my streaming flow, every day few
> jobs fail due to some (say kafka cluster maintenance etc, mostly
> unavoidable) reasons for few batches and resumes back to success.
> I want to reprocess those failed jobs programmatically (assume I have a
> way of getting start-end offsets for kafka topics for failed jobs). I was
> thinking of these options:
> 1) Somehow pause streaming job when it detects failing jobs - this seems
> not possible.
> 2) From driver - run additional processing to check every few minutes
> using driver rest api (/api/v1/applications...) what jobs have failed and
> submit batch jobs for those failed jobs
>
> 1 - doesn't seem to be possible, and I don't want to kill streaming
> context just for few failing batches to stop the job for some time and
> resume after few minutes.
> 2 - seems like a viable option, but a little complicated, since even the
> batch job can fail due to whatever reasons and I am back to tracking that
> separately etc.
>
> Does anyone has faced this issue or have any suggestions?
>
> Thanks,
> KP
>


Managed memory leak : spark-2.0.2

2016-12-08 Thread Appu K
Hello,

I’ve just ran into an issue where the job is giving me "Managed memory
leak" with spark version 2.0.2

—
2016-12-08 16:31:25,231 [Executor task launch worker-0]
(TaskMemoryManager.java:381) WARN leak 46.2 MB memory from
org.apache.spark.util.collection.ExternalAppendOnlyMap@22719fb8
2016-12-08 16:31:25,232 [Executor task launch worker-0] (Logging.scala:66)
WARN Managed memory leak detected; size = 48442112 bytes, TID = 1
—


The program itself is very basic and looks like take() is causing the issue

Program: https://gist.github.com/kutt4n/87cfcd4e794b1865b6f880412dd80bbf
Debug Log: https://gist.github.com/kutt4n/ba3cf812dced34ceadc588856edc


TaskMemoryManager.java:381 says that it's normal to see leaked memory if
one of the tasks failed.  In this case from the debug log - it is not quite
apparent which task failed and the reason for failure.

When the TSV file itself is small the issue doesn’t exist. In this
particular case, the file is a 21MB clickstream data from wikipedia
available at https://ndownloader.figshare.com/files/5036392

Where could i read up more about managed memory leak. Any pointers on what
might be the issue would be highly helpful

thanks
appu


Re: Not per-key state in spark streaming

2016-12-08 Thread Daniel Haviv
There's no need to extend Spark's API, look at mapWithState for examples.

On Thu, Dec 8, 2016 at 4:49 AM, Anty Rao  wrote:

>
>
> On Wed, Dec 7, 2016 at 7:42 PM, Anty Rao  wrote:
>
>> Hi
>> I'm new to Spark. I'm doing some research to see if spark streaming can
>> solve my problem. I don't want to keep per-key state,b/c my data set is
>> very huge and keep a little longer time, it not viable to keep all per key
>> state in memory.Instead, i want to have a bloom filter based state. Does it
>> possible to achieve this in Spark streaming.
>>
>> Is it possible to achieve this by extending Spark API?
>
>> --
>> Anty Rao
>>
>
>
>
> --
> Anty Rao
>


"Failed to find data source: libsvm" while running Spark application with jar

2016-12-08 Thread Md. Rezaul Karim
Hi there,

I am getting the following error while trying read an input file in libsvm
format during running a Spark application jar.


*Exception in thread "main" java.lang.ClassNotFoundException: Failed to
find data  source: libsvm. *
*at*


*org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource(DataSource.scala:148)*
The remain error log contains similar message.

The Java application works fine on Eclipse. However, while packaging and
running the corresponding jar file, I am getting the above error which is
really weird!

I believe, it's all about the format of the input file. Any kind of help is
appreciated.


Regards,
_
*Md. Rezaul Karim* BSc, MSc
Ph.D. Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html



Re: Publishing of the Spectral LDA model on Spark Packages

2016-12-08 Thread François Garillot
This is very cool ! Thanks a lot for making this more accessible !

Best,
-- 
FG

On Wed, Dec 7, 2016 at 11:46 PM Jencir Lee  wrote:

> Hello,
>
> We just published the Spectral LDA model on Spark Packages. It’s an
> alternative approach to the LDA modelling based on tensor decompositions.
> We first build the 2nd, 3rd-moment tensors from empirical word counts, then
> orthogonalise them and perform decomposition on the 3rd-moment tensor. The
> convergence is guaranteed by theory, in contrast to most current
> approaches. We achieve comparable log-perplexity in much shorter running
> time.
>
> You could find the package at
>
> https://spark-packages.org/package/FurongHuang/SpectralLDA-TensorSpark
>
>
> We’d welcome any thoughts or feedback on it.
>
> Thanks very much,
>
> Furong Huang
> Jencir Lee
> Anima Anandkumar
>


RE: How to find unique values after groupBy() in spark dataframe ?

2016-12-08 Thread Mendelson, Assaf
Groupby is not an actual result but a construct to allow defining aggregations.

So you can do:


import org.apache.spark.sql.{functions => func}

 val resDF = df.groupBy("client").agg(func.collect_set(df("Date")))


Note that collect_set can be a little heavy in terms of performance so if you 
just want to count, you should probably use approxCountDistinct
Assaf.

From: Devi P.V [mailto:devip2...@gmail.com]
Sent: Thursday, December 08, 2016 10:38 AM
To: user @spark
Subject: How to find unique values after groupBy() in spark dataframe ?

Hi all,

I have a dataframe like following,
+-+--+
|client_id|Date  |
+ +--+
| a   |2016-11-23|
| b   |2016-11-18|
| a   |2016-11-23|
| a   |2016-11-23|
| a   |2016-11-24|
+-+--+
I want to find unique dates of each client_id using spark dataframe.
expected output

a  (2016-11-23, 2016-11-24)
b   2016-11-18
I tried with df.groupBy("client_id").But I don't know how to find distinct 
values after groupBy().
How to do this?
Is any other efficient methods are available for doing this ?
I am using scala 2.11.8 & spark 2.0

Thanks


How to find unique values after groupBy() in spark dataframe ?

2016-12-08 Thread Devi P.V
Hi all,

I have a dataframe like following,

+-+--+
|client_id|Date  |
+ +--+
| a   |2016-11-23|
| b   |2016-11-18|
| a   |2016-11-23|
| a   |2016-11-23|
| a   |2016-11-24|
+-+--+

I want to find unique dates of each client_id using spark dataframe.

expected output

a  (2016-11-23, 2016-11-24)
b   2016-11-18

I tried with df.groupBy("client_id").But I don't know how to find distinct
values after groupBy().
How to do this?
Is any other efficient methods are available for doing this ?
I am using scala 2.11.8 & spark 2.0


Thanks


RE: filter RDD by variable

2016-12-08 Thread Mendelson, Assaf
Can you provide the sample code you are using?
In general, RDD filter receives as an input a function. The function’s input is 
the single record in the RDD and the output is a Boolean whether or not to 
include it in the result. So you can create any function you want…
Assaf.

From: Soheila S. [mailto:soheila...@gmail.com]
Sent: Wednesday, December 07, 2016 6:23 PM
To: user@spark.apache.org
Subject: filter RDD by variable

Hi
I am new in Spark and have a question in first steps of Spark learning.

How can I filter an RDD using an String variable (for example words[i]) , 
instead of a fix one like "Error"?

Thanks a lot in advance.
Soheila