Re: Spark dataset cache vs tempview

2016-11-06 Thread Mich Talebzadeh
With regard to use of tempTable

createOrReplaceTempView is backed by an in-memory hash table that maps
table name (a string) to a logical query plan.  Fragments of that logical
query plan may or may not be cached. However, calling register alone will
not result in any materialization of results.

If your dataset is very large, then one option is to create a tempView out
of that DF and use that in your processing. My assumption here is that your
data will be the same. In other words that tempView will always be valid.
You can of course drop that tempView

scala> df.toDF.createOrReplaceTempView("tmp")

scala> spark.sql("drop view if exists tmp")

Check UI (port 4040) storage page to see what is cached etc.

Just try either options to see which one is more optimum. Option 2 may be
more optimum.

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 6 November 2016 at 03:44, Rohit Verma  wrote:

> I have a parquet file which I reading atleast 4-5 times within my
> application. I was wondering what is most efficient thing to do.
>
> Option 1. While writing parquet file, immediately read it back to dataset
> and call cache. I am assuming by doing an immediate read I might use some
> existing hdfs/spark cache as part from write process.
>
> Option 2. In my application when I need the dataset first time, call cache
> then.
>
> Option 3. While writing parquet file, after completion create a temp view
> out of it. In all subsequent usage, use the view.
>
> I am also not very clear about efficiency of reading from tempview vs
> parquet dataset.
>
> FYI the datasets which I am referring, its not possible to fit all of it
> in memory. They are very huge.
>
> Regards..
> Rohit
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


mapWithState and DataFrames

2016-11-06 Thread Daniel Haviv
Hi,
How can I utilize mapWithState and DataFrames?
Right now I stream json messages from Kafka, update their state, output the
updated state as json and compose a dataframe from it.
It seems inefficient both in terms of processing and storage (a long string
instead of a compact DF).

Is there a way to manage state for DataFrame?

Thank you,
Daniel


Re: mapWithState and DataFrames

2016-11-06 Thread Victor Shafran
Hi Daniel,

If you use state in the same app, use foreachRDD method of the
stateSnapshot DStream to either persist RDD to disk (rdd.persist) or
convert to DataFrame and call createOrReplaceTempView  method.
Code from
https://docs.cloud.databricks.com/docs/spark/1.6/examples/Streaming%20mapWithState.html

val wordCountStateStream = wordStream.mapWithState(stateSpec)
  wordCountStateStream.print()

  // A snapshot of the state for the current batch. This dstream contains
one entry per key.
  val stateSnapshotStream = wordCountStateStream.stateSnapshots()
  stateSnapshotStream.foreachRDD { rdd =>
rdd.toDF("word", "count").registerTempTable("batch_word_count")
  }

Hope it helps
Victor



On Sun, Nov 6, 2016 at 12:53 PM, Daniel Haviv <
daniel.ha...@veracity-group.com> wrote:

> Hi,
> How can I utilize mapWithState and DataFrames?
> Right now I stream json messages from Kafka, update their state, output
> the updated state as json and compose a dataframe from it.
> It seems inefficient both in terms of processing and storage (a long
> string instead of a compact DF).
>
> Is there a way to manage state for DataFrame?
>
> Thank you,
> Daniel
>



-- 

Victor Shafran

VP R&D| Equalum

Mobile: +972-523854883 | Email: victor.shaf...@equalum.io


Re: mLIb solving linear regression with sparse inputs

2016-11-06 Thread Robineast
Here’s a way of creating sparse vectors in MLLib:

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.rdd.RDD

val rdd = sc.textFile("A.txt").map(line => line.split(",")).
 map(ary => (ary(0).toInt, ary(1).toInt, ary(2).toDouble))

val pairRdd: RDD[(Int, (Int, Int, Double))] = rdd.map(el => (el._1, el))

val create = (first: (Int, Int, Double)) => (Array(first._2), Array(first._3))
val combine = (head: (Array[Int], Array[Double]), tail: (Int, Int, Double)) => 
(head._1 :+ tail._2, head._2 :+ tail._3)
val merge = (a: (Array[Int], Array[Double]), b: (Array[Int], Array[Double])) => 
(a._1 ++ b._1, a._2 ++ b._2)

val A = pairRdd.combineByKey(create,combine,merge).map(el => 
Vectors.sparse(3,el._2._1,el._2._2))

If you have a separate file of b’s then you would need to manipulate this 
slightly to join the b’s to the A RDD and then create LabeledPoints. I guess 
there is a way of doing this using the newer ML interfaces but it’s not 
particularly obvious to me how.

One point: In the example you give the b’s are exactly the same as col 2 in the 
A matrix. I presume this is just a quick hacked together example because that 
would give a trivial result.

---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action 






> On 3 Nov 2016, at 18:12, im281 [via Apache Spark User List] 
>  wrote:
> 
> I would like to use it. But how do I do the following 
> 1) Read sparse data (from text or database) 
> 2) pass the sparse data to the linearRegression class? 
> 
> For example: 
> 
> Sparse matrix A 
> row, column, value 
> 0,0,.42 
> 0,1,.28 
> 0,2,.89 
> 1,0,.83 
> 1,1,.34 
> 1,2,.42 
> 2,0,.23 
> 3,0,.42 
> 3,1,.98 
> 3,2,.88 
> 4,0,.23 
> 4,1,.36 
> 4,2,.97 
> 
> Sparse vector b 
> row, column, value 
> 0,2,.89 
> 1,2,.42 
> 3,2,.88 
> 4,2,.97 
> 
> Solve Ax = b??? 
> 
> 
> 
> If you reply to this email, your message will be added to the discussion 
> below:
> http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006p28008.html
>  
> 
> To start a new topic under Apache Spark User List, email 
> ml-node+s1001560n1...@n3.nabble.com 
> To unsubscribe from Apache Spark User List, click here 
> .
> NAML 
> 




-
Robin East 
Spark GraphX in Action Michael Malak and Robin East 
Manning Publications Co. 
http://www.manning.com/books/spark-graphx-in-action

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006p28027.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Reading csv files with quoted fields containing embedded commas

2016-11-06 Thread Hyukjin Kwon
Hi Femi,

Have you maybe tried the quote related options specified in the
documentation?

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.csv

Thanks.

2016-11-06 6:58 GMT+09:00 Femi Anthony :

> Hi, I am trying to process a very large comma delimited csv file and I am
> running into problems.
> The main problem is that some fields contain quoted strings with embedded
> commas.
> It seems as if PySpark is unable to properly parse lines containing such
> fields like say Pandas does.
>
> Here is the code I am using to read the file in Pyspark
>
> df_raw=spark.read.option("header","true").csv(csv_path)
>
> Here is an example of a good and 'bad' line in such a file:
>
>
> col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,
> col12,col13,col14,col15,col16,col17,col18,col19
> 80015360210876000,11.22,X,4076710258,,,sxsw,,"32 YIU ""A""",S5,,"32 XIY
> ""W""   JK, RE LK",SOMETHINGLIKEAPHENOMENON#YOUGOTSOUL~BRINGDANOISE,23.0,
> cyclingstats,2012-25-19,432,2023-05-17,CODERED
> 6167229561918,137.12,U,8234971771,,,woodstock,,,T4,,,
> OUTKAST#THROOTS~WUTANG#RUNDMC,0.0,runstats,2013-21-22,1333,
> 2019-11-23,CODEBLUE
>
> Line 0 is the header
> Line 1 is the 'problematic' line
> Line 2 is a good line.
>
> Pandas can handle this easily:
>
>
> [1]: import pandas as pd
>
> In [2]: pdf = pd.read_csv('malformed_data.csv')
>
> In [4]: pdf[['col12','col13','col14']]
> Out[4]:
> col12
> col13  \
> 0  32 XIY "W"   JK, RE LK  SOMETHINGLIKEAPHENOMENON#
> YOUGOTSOUL~BRINGDANOISE
> 1 NaN OUTKAST#THROOTS~WUTANG#RUNDMC
>
>col14
> 0   23.0
> 10.0
>
>
> while Pyspark seems to parse this erroneously:
>
> [5]: sdf=spark.read.format("org.apache.spark.csv").csv('
> malformed_data.csv',header=True)
>
> [6]: sdf.select("col12","col13",'col14').show()
> +--+++
> | col12|   col13|   col14|
> +--+++
> |"32 XIY ""W""   JK|  RE LK"|SOMETHINGLIKEAPHE...|
> |  null|OUTKAST#THROOTS~W...| 0.0|
> +--+++
>
>  Is this a bug or am I doing something wrong ?
>  I am working with Spark 2.0
>  Any help is appreciated
>
> Thanks,
> -- Femi
>
> http://www.nextmatrix.com
> "Great spirits have always encountered violent opposition from mediocre
> minds." - Albert Einstein.
>


Fwd: A Spark long running program as web server

2016-11-06 Thread Reza zade
Hi

I have written multiple spark driver programs that load some data from HDFS
to data frames and accomplish spark sql queries on it and persist the
results in HDFS again. Now I need to provide a long running java program in
order to receive requests and their some parameters(such as the number of
top rows should be returned) from a web application (e.g. a dashboard) via
post and get and send back the results to web application. My web
application is somewhere out of the Spark cluster. Briefly my goal is to
send requests and their accompanying data from web application via
something such as POST to long running java program. then it receives the
request and runs the corresponding spark driver (spark app) and returns the
results for example in JSON format.


Whats is your suggestion to develop this use case?
Is Livy a good choise? If your answer is positive what should I do?

Thanks.


Improvement proposal | Dynamic disk allocation

2016-11-06 Thread Aniket Bhatnagar
Hello

Dynamic allocation feature allows you to add executors and scale
computation power. This is great, however, I feel like we also need a way
to dynamically scale storage. Currently, if the disk is not able to hold
the spilled/shuffle data, the job is aborted causing frustration and loss
of time. In deployments like AWS EMR, it is possible to run an agent that
add disks on the fly if it sees that the disks are running out of space and
it would be great if Spark could immediately start using the added disks
just as it does when new executors are added.

Thanks,
Aniket


Re: Improvement proposal | Dynamic disk allocation

2016-11-06 Thread Aniket Bhatnagar
If people agree that is desired, I am willing to submit a SIP for this and
find time to work on it.

Thanks,
Aniket

On Sun, Nov 6, 2016 at 1:06 PM Aniket Bhatnagar 
wrote:

> Hello
>
> Dynamic allocation feature allows you to add executors and scale
> computation power. This is great, however, I feel like we also need a way
> to dynamically scale storage. Currently, if the disk is not able to hold
> the spilled/shuffle data, the job is aborted causing frustration and loss
> of time. In deployments like AWS EMR, it is possible to run an agent that
> add disks on the fly if it sees that the disks are running out of space and
> it would be great if Spark could immediately start using the added disks
> just as it does when new executors are added.
>
> Thanks,
> Aniket
>


Re: A Spark long running program as web server

2016-11-06 Thread vincent gromakowski
Hi,
Spark jobserver seems to be more mature than Livy but both would work I
think. You will just get more functionalities with the jobserver except the
impersonation that is only in Livy.
If you need to publish business API I would recommend to use Akka http with
Spark actors sharing a preloaded spark context so you can publish more user
friendly API. Jobserver has no way to specify endpoints URL and API verbs,
it's more like a series of random numbers.
The other way to publish business API is to build a classic API application
that requests jobserver or livy jobs through HTTP but I think it has two
much latency to run 2 HTTP request ?

2016-11-06 14:06 GMT+01:00 Reza zade :

> Hi
>
> I have written multiple spark driver programs that load some data from
> HDFS to data frames and accomplish spark sql queries on it and persist the
> results in HDFS again. Now I need to provide a long running java program in
> order to receive requests and their some parameters(such as the number of
> top rows should be returned) from a web application (e.g. a dashboard) via
> post and get and send back the results to web application. My web
> application is somewhere out of the Spark cluster. Briefly my goal is to
> send requests and their accompanying data from web application via
> something such as POST to long running java program. then it receives the
> request and runs the corresponding spark driver (spark app) and returns the
> results for example in JSON format.
>
>
> Whats is your suggestion to develop this use case?
> Is Livy a good choise? If your answer is positive what should I do?
>
> Thanks.
>
>


Very long pause/hang at end of execution

2016-11-06 Thread Michael Johnson
I'm doing some processing and then clustering of a small dataset (~150 MB). 
Everything seems to work fine, until the end; the last few lines of my program 
are log statements, but after printing those, nothing seems to happen for a 
long time...many minutes; I'm not usually patient enough to let it go, but I 
think one time when I did just wait, it took over an hour (and did eventually 
exit on its own). Any ideas on what's happening, or how to troubleshoot?
(This happens both when running locally, using the localhost mode, as well as 
on a small cluster with four 4-processor nodes each with 15GB of RAM; in both 
cases the executors have 2GB+ of RAM, and none of the inputs/outputs on any of 
the stages is more than 75 MB...)
Thanks,Michael

Re: Very long pause/hang at end of execution

2016-11-06 Thread Aniket Bhatnagar
In order to know what's going on, you can study the thread dumps either
from spark UI or from any other thread dump analysis tool.

Thanks,
Aniket

On Sun, Nov 6, 2016 at 1:31 PM Michael Johnson
 wrote:

> I'm doing some processing and then clustering of a small dataset (~150
> MB). Everything seems to work fine, until the end; the last few lines of my
> program are log statements, but after printing those, nothing seems to
> happen for a long time...many minutes; I'm not usually patient enough to
> let it go, but I think one time when I did just wait, it took over an hour
> (and did eventually exit on its own). Any ideas on what's happening, or how
> to troubleshoot?
>
> (This happens both when running locally, using the localhost mode, as well
> as on a small cluster with four 4-processor nodes each with 15GB of RAM; in
> both cases the executors have 2GB+ of RAM, and none of the inputs/outputs
> on any of the stages is more than 75 MB...)
>
> Thanks,
> Michael
>


Re: Very long pause/hang at end of execution

2016-11-06 Thread Michael Johnson
Thanks; I tried looking at the thread dumps for the driver and the one executor 
that had that option in the UI, but I'm afraid I don't know how to interpret 
what I saw...  I don't think it could be my code directly, since at this point 
my code has all completed? Could GC be taking that long? 
(I could also try grabbing the thread dumps and pasting them here, if that 
would help?)

On Sunday, November 6, 2016 8:36 AM, Aniket Bhatnagar 
 wrote:
 

 In order to know what's going on, you can study the thread dumps either from 
spark UI or from any other thread dump analysis tool.
Thanks,Aniket
On Sun, Nov 6, 2016 at 1:31 PM Michael Johnson 
 wrote:

I'm doing some processing and then clustering of a small dataset (~150 MB). 
Everything seems to work fine, until the end; the last few lines of my program 
are log statements, but after printing those, nothing seems to happen for a 
long time...many minutes; I'm not usually patient enough to let it go, but I 
think one time when I did just wait, it took over an hour (and did eventually 
exit on its own). Any ideas on what's happening, or how to troubleshoot?
(This happens both when running locally, using the localhost mode, as well as 
on a small cluster with four 4-processor nodes each with 15GB of RAM; in both 
cases the executors have 2GB+ of RAM, and none of the inputs/outputs on any of 
the stages is more than 75 MB...)
Thanks,Michael


   

Re: mLIb solving linear regression with sparse inputs

2016-11-06 Thread im281


Thank you! Would happen to have this code in Java?.

This is extremely helpful!

Iman






On Sun, Nov 6, 2016 at 3:35 AM -0800, "Robineast [via Apache Spark User List]" 
 wrote:












Here’s a way of creating sparse vectors in MLLib:
import org.apache.spark.mllib.linalg.Vectorsimport org.apache.spark.rdd.RDD
val rdd = sc.textFile("A.txt").map(line => line.split(",")).     map(ary => 
(ary(0).toInt, ary(1).toInt, ary(2).toDouble))
val pairRdd: RDD[(Int, (Int, Int, Double))] = rdd.map(el => (el._1, el))
val create = (first: (Int, Int, Double)) => (Array(first._2), 
Array(first._3))val combine = (head: (Array[Int], Array[Double]), tail: (Int, 
Int, Double)) => (head._1 :+ tail._2, head._2 :+ tail._3)val merge = (a: 
(Array[Int], Array[Double]), b: (Array[Int], Array[Double])) => (a._1 ++ b._1, 
a._2 ++ b._2)
val A = pairRdd.combineByKey(create,combine,merge).map(el => 
Vectors.sparse(3,el._2._1,el._2._2))
If you have a separate file of b’s then you would need to manipulate this 
slightly to join the b’s to the A RDD and then create LabeledPoints. I guess 
there is a way of doing this using the newer ML interfaces but it’s not 
particularly obvious to me how.
One point: In the example you give the b’s are exactly the same as col 2 in the 
A matrix. I presume this is just a quick hacked together example because that 
would give a trivial result.

---Robin
 EastSpark GraphX in Action Michael Malak and Robin EastManning Publications 
Co.http://www.manning.com/books/spark-graphx-in-action






On 3 Nov 2016, at 18:12, im281 [via Apache Spark User List] <[hidden email]> 
wrote:


I would like to use it. But how do I do the following

1) Read sparse data (from text or database)

2) pass the sparse data to the linearRegression class?


For example:


Sparse matrix A

row, column, value

0,0,.42

0,1,.28

0,2,.89

1,0,.83

1,1,.34

1,2,.42

2,0,.23

3,0,.42

3,1,.98

3,2,.88

4,0,.23

4,1,.36

4,2,.97


Sparse vector b

row, column, value

0,2,.89

1,2,.42

3,2,.88

4,2,.97


Solve Ax = b???













If you reply to this email, your message will be added to the 
discussion below:

http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006p28008.html


To start a new topic under Apache Spark User List, email 
[hidden email] 

To unsubscribe from Apache Spark User List, click here.

NAML






Robin East 

Spark GraphX in Action Michael Malak and Robin East 

Manning Publications Co. 

http://www.manning.com/books/spark-graphx-in-action








If you reply to this email, your message will be added to the 
discussion below:

http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006p28027.html



To unsubscribe from mLIb solving linear regression with sparse 
inputs, click here.

NAML









--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006p28028.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Very long pause/hang at end of execution

2016-11-06 Thread Aniket Bhatnagar
I doubt it's GC as you mentioned that the pause is several minutes. Since
it's reproducible in local mode, can you run the spark application locally
and once your job is complete (and application appears paused), can you
take 5 thread dumps (using jstack or jcmd on the local spark JVM process)
with 1 second delay between each dump and attach them? I can take a look.

Thanks,
Aniket

On Sun, Nov 6, 2016 at 2:21 PM Michael Johnson 
wrote:

> Thanks; I tried looking at the thread dumps for the driver and the one
> executor that had that option in the UI, but I'm afraid I don't know how to
> interpret what I saw...  I don't think it could be my code directly, since
> at this point my code has all completed? Could GC be taking that long?
>
> (I could also try grabbing the thread dumps and pasting them here, if that
> would help?)
>
> On Sunday, November 6, 2016 8:36 AM, Aniket Bhatnagar <
> aniket.bhatna...@gmail.com> wrote:
>
>
> In order to know what's going on, you can study the thread dumps either
> from spark UI or from any other thread dump analysis tool.
>
> Thanks,
> Aniket
>
> On Sun, Nov 6, 2016 at 1:31 PM Michael Johnson
>  wrote:
>
> I'm doing some processing and then clustering of a small dataset (~150
> MB). Everything seems to work fine, until the end; the last few lines of my
> program are log statements, but after printing those, nothing seems to
> happen for a long time...many minutes; I'm not usually patient enough to
> let it go, but I think one time when I did just wait, it took over an hour
> (and did eventually exit on its own). Any ideas on what's happening, or how
> to troubleshoot?
>
> (This happens both when running locally, using the localhost mode, as well
> as on a small cluster with four 4-processor nodes each with 15GB of RAM; in
> both cases the executors have 2GB+ of RAM, and none of the inputs/outputs
> on any of the stages is more than 75 MB...)
>
> Thanks,
> Michael
>
>
>
>


Re: Optimized way to use spark as db to hdfs etl

2016-11-06 Thread Sabarish Sasidharan
Pls be aware that Accumulators involve communication back with the driver
and may not be efficient. I think OP wants some way to extract the stats
from the sql plan if it is being stored in some internal data structure

Regards
Sab

On 5 Nov 2016 9:42 p.m., "Deepak Sharma"  wrote:

> Hi Rohit
> You can use accumulators and increase it on every record processing.
> At last you can get the value of accumulator on driver , which will give
> you the count.
>
> HTH
> Deepak
>
> On Nov 5, 2016 20:09, "Rohit Verma"  wrote:
>
>> I am using spark to read from database and write in hdfs as parquet file.
>> Here is code snippet.
>>
>> private long etlFunction(SparkSession spark){
>> spark.sqlContext().setConf("spark.sql.parquet.compression.codec",
>> “SNAPPY");
>> Properties properties = new Properties();
>> properties.put("driver”,”oracle.jdbc.driver");
>> properties.put("fetchSize”,”5000");
>> Dataset dataset = spark.read().jdbc(jdbcUrl, query, properties);
>> dataset.write.format(“parquet”).save(“pdfs-path”);
>> return dataset.count();
>> }
>>
>> When I look at spark ui, during write I have stats of records written,
>> visible in sql tab under query plan.
>>
>> While the count itself is a heavy task.
>>
>> Can someone suggest best way to get count in most optimized way.
>>
>> Thanks all..
>>
>


Re: mLIb solving linear regression with sparse inputs

2016-11-06 Thread im281
Hi Robin,
It looks like the linear regression model takes in a dataset not a matrix?
It would be helpful for this example if you could set up the whole problem
end to end using one of the columns of the matrix as b. So A is a sparse
matrix and b is a sparse vector
Best regards.
Iman

On Sun, Nov 6, 2016 at 6:43 AM  wrote:

> Thank you! Would happen to have this code in Java?.
> This is extremely helpful!
>
>
> Iman
>
>
>
>
> On Sun, Nov 6, 2016 at 3:35 AM -0800, "Robineast [via Apache Spark User
> List]"  wrote:
>
> Here’s a way of creating sparse vectors in MLLib:
>
> import org.apache.spark.mllib.linalg.Vectors
> import org.apache.spark.rdd.RDD
>
> val rdd = sc.textFile("A.txt").map(line => line.split(",")).
>  map(ary => (ary(0).toInt, ary(1).toInt, ary(2).toDouble))
>
> val pairRdd: RDD[(Int, (Int, Int, Double))] = rdd.map(el => (el._1, el))
>
> val create = (first: (Int, Int, Double)) => (Array(first._2),
> Array(first._3))
> val combine = (head: (Array[Int], Array[Double]), tail: (Int, Int,
> Double)) => (head._1 :+ tail._2, head._2 :+ tail._3)
> val merge = (a: (Array[Int], Array[Double]), b: (Array[Int],
> Array[Double])) => (a._1 ++ b._1, a._2 ++ b._2)
>
> val A = pairRdd.combineByKey(create,combine,merge).map(el =>
> Vectors.sparse(3,el._2._1,el._2._2))
>
> If you have a separate file of b’s then you would need to manipulate this
> slightly to join the b’s to the A RDD and then create LabeledPoints. I
> guess there is a way of doing this using the newer ML interfaces but it’s
> not particularly obvious to me how.
>
> One point: In the example you give the b’s are exactly the same as col 2
> in the A matrix. I presume this is just a quick hacked together example
> because that would give a trivial result.
>
>
> ---
> Robin East
> *Spark GraphX in Action* Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/books/spark-graphx-in-action
>
>
>
>
>
> On 3 Nov 2016, at 18:12, im281 [via Apache Spark User List] <[hidden
> email] > wrote:
>
> I would like to use it. But how do I do the following
> 1) Read sparse data (from text or database)
> 2) pass the sparse data to the linearRegression class?
>
> For example:
>
> Sparse matrix A
> row, column, value
> 0,0,.42
> 0,1,.28
> 0,2,.89
> 1,0,.83
> 1,1,.34
> 1,2,.42
> 2,0,.23
> 3,0,.42
> 3,1,.98
> 3,2,.88
> 4,0,.23
> 4,1,.36
> 4,2,.97
>
> Sparse vector b
> row, column, value
> 0,2,.89
> 1,2,.42
> 3,2,.88
> 4,2,.97
>
> Solve Ax = b???
>
>
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006p28008.html
> To start a new topic under Apache Spark User List, email [hidden email]
> 
> To unsubscribe from Apache Spark User List, click here.
> NAML
> 
>
>
> Robin East
> Spark GraphX in Action Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/books/spark-graphx-in-action
>
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006p28027.html
> To unsubscribe from mLIb solving linear regression with sparse inputs, click
> here
> 
> .
> NAML
> 
>
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006p28029.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: mLIb solving linear regression with sparse inputs

2016-11-06 Thread im281
Also in Java as well. Thanks again!
Iman

On Sun, Nov 6, 2016 at 8:28 AM Iman Mohtashemi 
wrote:

Hi Robin,
It looks like the linear regression model takes in a dataset not a matrix?
It would be helpful for this example if you could set up the whole problem
end to end using one of the columns of the matrix as b. So A is a sparse
matrix and b is a sparse vector
Best regards.
Iman

On Sun, Nov 6, 2016 at 6:43 AM  wrote:

Thank you! Would happen to have this code in Java?.
This is extremely helpful!


Iman




On Sun, Nov 6, 2016 at 3:35 AM -0800, "Robineast [via Apache Spark User
List]"  wrote:

Here’s a way of creating sparse vectors in MLLib:

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.rdd.RDD

val rdd = sc.textFile("A.txt").map(line => line.split(",")).
 map(ary => (ary(0).toInt, ary(1).toInt, ary(2).toDouble))

val pairRdd: RDD[(Int, (Int, Int, Double))] = rdd.map(el => (el._1, el))

val create = (first: (Int, Int, Double)) => (Array(first._2),
Array(first._3))
val combine = (head: (Array[Int], Array[Double]), tail: (Int, Int, Double))
=> (head._1 :+ tail._2, head._2 :+ tail._3)
val merge = (a: (Array[Int], Array[Double]), b: (Array[Int],
Array[Double])) => (a._1 ++ b._1, a._2 ++ b._2)

val A = pairRdd.combineByKey(create,combine,merge).map(el =>
Vectors.sparse(3,el._2._1,el._2._2))

If you have a separate file of b’s then you would need to manipulate this
slightly to join the b’s to the A RDD and then create LabeledPoints. I
guess there is a way of doing this using the newer ML interfaces but it’s
not particularly obvious to me how.

One point: In the example you give the b’s are exactly the same as col 2 in
the A matrix. I presume this is just a quick hacked together example
because that would give a trivial result.

---
Robin East
*Spark GraphX in Action* Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action





On 3 Nov 2016, at 18:12, im281 [via Apache Spark User List] <[hidden email]
> wrote:

I would like to use it. But how do I do the following
1) Read sparse data (from text or database)
2) pass the sparse data to the linearRegression class?

For example:

Sparse matrix A
row, column, value
0,0,.42
0,1,.28
0,2,.89
1,0,.83
1,1,.34
1,2,.42
2,0,.23
3,0,.42
3,1,.98
3,2,.88
4,0,.23
4,1,.36
4,2,.97

Sparse vector b
row, column, value
0,2,.89
1,2,.42
3,2,.88
4,2,.97

Solve Ax = b???



--
If you reply to this email, your message will be added to the discussion
below:
http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006p28008.html
To start a new topic under Apache Spark User List, email [hidden email]

To unsubscribe from Apache Spark User List, click here.
NAML



Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action


--
If you reply to this email, your message will be added to the discussion
below:
http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006p28027.html
To unsubscribe from mLIb solving linear regression with sparse inputs, click
here

.
NAML





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006p28030.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Very long pause/hang at end of execution

2016-11-06 Thread Gourav Sengupta
Hi,

In case your process finishes after a lag, then please check whether you
are writing by converting to Pandas or using coalesce (in which case entire
traffic is being directed to a single node) or writing over S3, in which
case there can be lags.

Regards,
Gourav

On Sun, Nov 6, 2016 at 1:28 PM, Michael Johnson <
mjjohnson@yahoo.com.invalid> wrote:

> I'm doing some processing and then clustering of a small dataset (~150
> MB). Everything seems to work fine, until the end; the last few lines of my
> program are log statements, but after printing those, nothing seems to
> happen for a long time...many minutes; I'm not usually patient enough to
> let it go, but I think one time when I did just wait, it took over an hour
> (and did eventually exit on its own). Any ideas on what's happening, or how
> to troubleshoot?
>
> (This happens both when running locally, using the localhost mode, as well
> as on a small cluster with four 4-processor nodes each with 15GB of RAM; in
> both cases the executors have 2GB+ of RAM, and none of the inputs/outputs
> on any of the stages is more than 75 MB...)
>
> Thanks,
> Michael
>


Re: distribute partitions evenly to my cluster

2016-11-06 Thread heather79
Thanks for your reply, Vipin!

I am using spark-perf benchmark. The command to create RDD is :
val data: RDD[Vector] = RandomRDDs.normalVectorRDD(sc, m, n, numPartitions,
seed)

after I set the numPartitions, for example 40 partitions, I think spark core
code will allocate those partitions to executors. I do not know how and
where spark did that. I feel spark will do it like: if executor one has 10
cores, it will allocate 10 partitions to  executor 1, then allocate 10 to
executor 2. spark core code will not not try to distribute 40 partitions to
8 nodes evenly, right?

I do not understand you mentioned hash a key from my dataset. Could you
explain more?
Thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/distribute-partitions-evenly-to-my-cluster-tp27998p28031.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: A Spark long running program as web server

2016-11-06 Thread Oddo Da
The spark jobserver will do what you describe for you. I have built an app
where we have a bunch of queries being submitted via
http://something/query/
via POST (all parameters for the query are in JSON POST request). This is a
scalatra layer that talks to spark jobserver via HTTP.

On Sun, Nov 6, 2016 at 8:06 AM, Reza zade  wrote:

> Hi
>
> I have written multiple spark driver programs that load some data from
> HDFS to data frames and accomplish spark sql queries on it and persist the
> results in HDFS again. Now I need to provide a long running java program in
> order to receive requests and their some parameters(such as the number of
> top rows should be returned) from a web application (e.g. a dashboard) via
> post and get and send back the results to web application. My web
> application is somewhere out of the Spark cluster. Briefly my goal is to
> send requests and their accompanying data from web application via
> something such as POST to long running java program. then it receives the
> request and runs the corresponding spark driver (spark app) and returns the
> results for example in JSON format.
>
>
> Whats is your suggestion to develop this use case?
> Is Livy a good choise? If your answer is positive what should I do?
>
> Thanks.
>
>


Newbie question - Best way to bootstrap with Spark

2016-11-06 Thread raghav
I am newbie in the world of big data analytics, and I want to teach myself
Apache Spark, and want to be able to write scripts to tinker with data.

I have some understanding of Map Reduce but have not had a chance to get my
hands dirty. There are tons of resources for Spark, but I am looking for
some guidance for starter material, or videos.

Thanks.

Raghav



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Newbie-question-Best-way-to-bootstrap-with-Spark-tp28032.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Structured Streaming with Kafka source,, does it work??????

2016-11-06 Thread shyla deshpande
I am trying to do Structured Streaming with Kafka Source. Please let me
know where I can find some sample code for this. Thanks


Re: Structured Streaming with Kafka Source, does it work??

2016-11-06 Thread shyla deshpande
Hi Jaya!

Thanks for the reply. Structured streaming works fine for me with socket
text stream . I think structured streaming with kafka source not yet
supported.

Please if anyone has got it working with kafka source, please provide me
some sample code or direction.

Thanks


On Sun, Nov 6, 2016 at 5:17 PM, Jayaradha Natarajan 
wrote:

> Shyla!
>
> Check
> https://databricks.com/blog/2016/07/28/structured-
> streaming-in-apache-spark.html
>
> Thanks,
> Jayaradha
>
> On Sun, Nov 6, 2016 at 5:13 PM, shyla  wrote:
>
>> I am trying to do Structured Streaming with Kafka Source. Please let me
>> know
>> where I can find some sample code for this. Thanks
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-developers
>> -list.1001551.n3.nabble.com/Structured-Streaming-with-
>> Kafka-Source-does-it-work-tp19748.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>


RE: expected behavior of Kafka dynamic topic subscription

2016-11-06 Thread Haopu Wang
Cody, thanks for the response. Do you think it's a Spark issue or Kafka issue? 
Can you please let me know the jira ticket number?

-Original Message-
From: Cody Koeninger [mailto:c...@koeninger.org] 
Sent: 2016年11月4日 22:35
To: Haopu Wang
Cc: user@spark.apache.org
Subject: Re: expected behavior of Kafka dynamic topic subscription

That's not what I would expect from the underlying kafka consumer, no.

But this particular case (no matching topics, then add a topic after
SubscribePattern stream starts) actually isn't part of unit tests for
either the DStream or the structured stream.

I'll make a jira ticket.

On Thu, Nov 3, 2016 at 9:43 PM, Haopu Wang  wrote:
> I'm using Kafka010 integration API to create a DStream using
> SubscriberPattern ConsumerStrategy.
>
> The specified topic doesn't exist when I start the application.
>
> Then I create the topic and publish some test messages. I can see them in
> the console subscriber.
>
> But the spark application doesn't seem to get the messages.
>
> I think this is not expected, right? What should I check to resolve it?
>
> Thank you very much!

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Newbie question - Best way to bootstrap with Spark

2016-11-06 Thread ayan guha
I would start with Spark documentation, really. Then you would probably
start with some older videos from youtube, especially spark summit
2014,2015 and 2016 videos. Regading practice, I would strongly suggest
Databricks cloud (or download prebuilt from spark site). You can also take
courses from EDX/Berkley, which are very good starter courses.

On Mon, Nov 7, 2016 at 11:57 AM, raghav  wrote:

> I am newbie in the world of big data analytics, and I want to teach myself
> Apache Spark, and want to be able to write scripts to tinker with data.
>
> I have some understanding of Map Reduce but have not had a chance to get my
> hands dirty. There are tons of resources for Spark, but I am looking for
> some guidance for starter material, or videos.
>
> Thanks.
>
> Raghav
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Newbie-question-Best-way-to-
> bootstrap-with-Spark-tp28032.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Best Regards,
Ayan Guha


Re: Re: Newbie question - Best way to bootstrap with Spark

2016-11-06 Thread warmb...@qq.com
EDX/Berkley +1



___
黄鹏程 HuangPengCheng
中国民生银行 总行科技开发部DBA组&应用运维四中心
*规范操作,主动维护,及时处理*
温良恭俭让**
地址:北京市顺义区顺安南路中国民生银行总部基地
邮编:101300
电话:010-56361701
手机:13488788499
Email:huangpengch...@cmbc.com.cn ,gnu...@gmail.com
 
From: ayan guha
Date: 2016-11-07 10:08
To: raghav
CC: user
Subject: Re: Newbie question - Best way to bootstrap with Spark
I would start with Spark documentation, really. Then you would probably start 
with some older videos from youtube, especially spark summit 2014,2015 and 2016 
videos. Regading practice, I would strongly suggest Databricks cloud (or 
download prebuilt from spark site). You can also take courses from EDX/Berkley, 
which are very good starter courses. 

On Mon, Nov 7, 2016 at 11:57 AM, raghav  wrote:
I am newbie in the world of big data analytics, and I want to teach myself
Apache Spark, and want to be able to write scripts to tinker with data.

I have some understanding of Map Reduce but have not had a chance to get my
hands dirty. There are tons of resources for Spark, but I am looking for
some guidance for starter material, or videos.

Thanks.

Raghav



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Newbie-question-Best-way-to-bootstrap-with-Spark-tp28032.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org




-- 
Best Regards,
Ayan Guha


Re: Reading csv files with quoted fields containing embedded commas

2016-11-06 Thread Femi Anthony
The quote options seem to be related to escaping quotes and the dataset
isn't escaaping quotes. As I said quoted strings with embedded commas is
something that pandas handles easily, and even Excel does that as well.


Femi

On Sun, Nov 6, 2016 at 6:59 AM, Hyukjin Kwon  wrote:

> Hi Femi,
>
> Have you maybe tried the quote related options specified in the
> documentation?
>
> http://spark.apache.org/docs/latest/api/python/pyspark.sql.
> html#pyspark.sql.DataFrameReader.csv
>
> Thanks.
>
> 2016-11-06 6:58 GMT+09:00 Femi Anthony :
>
>> Hi, I am trying to process a very large comma delimited csv file and I am
>> running into problems.
>> The main problem is that some fields contain quoted strings with embedded
>> commas.
>> It seems as if PySpark is unable to properly parse lines containing such
>> fields like say Pandas does.
>>
>> Here is the code I am using to read the file in Pyspark
>>
>> df_raw=spark.read.option("header","true").csv(csv_path)
>>
>> Here is an example of a good and 'bad' line in such a file:
>>
>>
>> col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col
>> 12,col13,col14,col15,col16,col17,col18,col19
>> 80015360210876000,11.22,X,4076710258,,,sxsw,,"32 YIU ""A""",S5,,"32 XIY
>> ""W""   JK, RE LK",SOMETHINGLIKEAPHENOMENON#Y
>> OUGOTSOUL~BRINGDANOISE,23.0,cyclingstats,2012-25-19,432,2023
>> -05-17,CODERED
>> 6167229561918,137.12,U,8234971771,,,woodstock,,,T4,,,OUT
>> KAST#THROOTS~WUTANG#RUNDMC,0.0,runstats,2013-21-22,1333,201
>> 9-11-23,CODEBLUE
>>
>> Line 0 is the header
>> Line 1 is the 'problematic' line
>> Line 2 is a good line.
>>
>> Pandas can handle this easily:
>>
>>
>> [1]: import pandas as pd
>>
>> In [2]: pdf = pd.read_csv('malformed_data.csv')
>>
>> In [4]: pdf[['col12','col13','col14']]
>> Out[4]:
>> col12
>> col13  \
>> 0  32 XIY "W"   JK, RE LK  SOMETHINGLIKEAPHENOMENON#YOUG
>> OTSOUL~BRINGDANOISE
>> 1 NaN
>> OUTKAST#THROOTS~WUTANG#RUNDMC
>>
>>col14
>> 0   23.0
>> 10.0
>>
>>
>> while Pyspark seems to parse this erroneously:
>>
>> [5]: sdf=spark.read.format("org.apache.spark.csv").csv('malformed
>> _data.csv',header=True)
>>
>> [6]: sdf.select("col12","col13",'col14').show()
>> +--+++
>> | col12|   col13|   col14|
>> +--+++
>> |"32 XIY ""W""   JK|  RE LK"|SOMETHINGLIKEAPHE...|
>> |  null|OUTKAST#THROOTS~W...| 0.0|
>> +--+++
>>
>>  Is this a bug or am I doing something wrong ?
>>  I am working with Spark 2.0
>>  Any help is appreciated
>>
>> Thanks,
>> -- Femi
>>
>> http://www.nextmatrix.com
>> "Great spirits have always encountered violent opposition from mediocre
>> minds." - Albert Einstein.
>>
>
>


-- 
http://www.femibyte.com/twiki5/bin/view/Tech/
http://www.nextmatrix.com
"Great spirits have always encountered violent opposition from mediocre
minds." - Albert Einstein.


Re: Structured Streaming with Kafka Source, does it work??

2016-11-06 Thread Matei Zaharia
The Kafka source will only appear in 2.0.2 -- see this thread for the current 
release candidate: 
https://lists.apache.org/thread.html/597d630135e9eb3ede54bb0cc0b61a2b57b189588f269a64b58c9243@%3Cdev.spark.apache.org%3E
 . You can try that right now if you want from the staging Maven repo shown 
there. The vote looks likely to pass so an actual release should hopefully also 
be out soon.

Matei

> On Nov 6, 2016, at 5:25 PM, shyla deshpande  wrote:
> 
> Hi Jaya!
> 
> Thanks for the reply. Structured streaming works fine for me with socket text 
> stream . I think structured streaming with kafka source not yet supported.
> 
> Please if anyone has got it working with kafka source, please provide me some 
> sample code or direction.
> 
> Thanks
> 
> 
> On Sun, Nov 6, 2016 at 5:17 PM, Jayaradha Natarajan  > wrote:
> Shyla!
> 
> Check
> https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
>  
> 
> 
> Thanks,
> Jayaradha
> 
> On Sun, Nov 6, 2016 at 5:13 PM, shyla  > wrote:
> I am trying to do Structured Streaming with Kafka Source. Please let me know
> where I can find some sample code for this. Thanks
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Structured-Streaming-with-Kafka-Source-does-it-work-tp19748.html
>  
> 
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> 
> 
> 
> 



hope someone can recommend some books for me,a spark beginner

2016-11-06 Thread litg
   I'm a postgraduate from  Shanghai Jiao Tong University,China. recently, I
carry out a project about the  realization of artificial algorithms on spark
in python. however, I am not familiar with this field.furthermore,there are
few Chinese books about spark.
 Actually,I strongly want to have a further study at this field.hope
someone can  kindly recommend me some books about  the mechanism of spark,
or just give me suggestions about how to  program with spark.
  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/hope-someone-can-recommend-some-books-for-me-a-spark-beginner-tp28033.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



回复:Structured Streaming with Kafka source,, does it work??????

2016-11-06 Thread 余根茂(木艮)
docs: 
https://github.com/apache/spark/blob/master/docs/structured-streaming-kafka-integration.md--发件人:shyla
 deshpande 发送时间:2016年11月7日(星期一) 09:15收件人:user 
主 题:Structured Streaming with Kafka source,, does it 
work??
I am trying to do Structured Streaming with Kafka Source. Please let me know 
where I can find some sample code for this. Thanks



Re: Very long pause/hang at end of execution

2016-11-06 Thread Michael Johnson
Hm. Something must have changed, as it was happening quite consistently and now 
I can't get it to reproduce. Thank you for the offer, and if it happens again I 
will try grabbing thread dumps and I will see if I can figure out what is going 
on. 

On Sunday, November 6, 2016 10:02 AM, Aniket Bhatnagar 
 wrote:
 

 I doubt it's GC as you mentioned that the pause is several minutes. Since it's 
reproducible in local mode, can you run the spark application locally and once 
your job is complete (and application appears paused), can you take 5 thread 
dumps (using jstack or jcmd on the local spark JVM process) with 1 second delay 
between each dump and attach them? I can take a look.
Thanks,Aniket
On Sun, Nov 6, 2016 at 2:21 PM Michael Johnson  wrote:

Thanks; I tried looking at the thread dumps for the driver and the one executor 
that had that option in the UI, but I'm afraid I don't know how to interpret 
what I saw...  I don't think it could be my code directly, since at this point 
my code has all completed? Could GC be taking that long? 
(I could also try grabbing the thread dumps and pasting them here, if that 
would help?)

On Sunday, November 6, 2016 8:36 AM, Aniket Bhatnagar 
 wrote:
 

 In order to know what's going on, you can study the thread dumps either from 
spark UI or from any other thread dump analysis tool.
Thanks,Aniket
On Sun, Nov 6, 2016 at 1:31 PM Michael Johnson 
 wrote:

I'm doing some processing and then clustering of a small dataset (~150 MB). 
Everything seems to work fine, until the end; the last few lines of my program 
are log statements, but after printing those, nothing seems to happen for a 
long time...many minutes; I'm not usually patient enough to let it go, but I 
think one time when I did just wait, it took over an hour (and did eventually 
exit on its own). Any ideas on what's happening, or how to troubleshoot?
(This happens both when running locally, using the localhost mode, as well as 
on a small cluster with four 4-processor nodes each with 15GB of RAM; in both 
cases the executors have 2GB+ of RAM, and none of the inputs/outputs on any of 
the stages is more than 75 MB...)
Thanks,Michael


   


   

Error while creating tables in Parquet format in 2.0.1 (No plan for InsertIntoTable)

2016-11-06 Thread Kiran Chitturi
Hello,

I am encountering a new problem with Spark 2.0.1 that didn't happen with
Spark 1.6.x.

These SQL statements ran successfully spark-thrift-server in 1.6.x


> CREATE TABLE test2 USING solr OPTIONS (zkhost "localhost:9987", collection
> "test", fields "id" );
>
> CREATE TABLE test_stored STORED AS PARQUET LOCATION
>  '/Users/kiran/spark/test.parquet' AS SELECT * FROM test;


but with Spark 2.0.x, the last statement throws this below error


> CREATE TABLE test_stored1 STORED AS PARQUET LOCATION

'/Users/kiran/spark/test.parquet' AS SELECT * FROM test2;





Error: java.lang.AssertionError: assertion failed: No plan for
> InsertIntoTable Relation[id#3] parquet, true, false
> +- Relation[id#2] com.lucidworks.spark.SolrRelation@57d735e9
> (state=,code=0)


The full stack trace is at
https://gist.github.com/kiranchitturi/8b3637723e0887f31917f405ef1425a1

SolrRelation class (
https://github.com/lucidworks/spark-solr/blob/master/src/main/scala/com/lucidworks/spark/SolrRelation.scala
)

This error message doesn't seem very meaningful to me. I am not quite sure
how to track this down or fix this. Is there something I need to implement
in the SolrRelation class to be able to create Parquet tables from Solr
tables.

Looking forward to your suggestions.

Thanks,
-- 
Kiran Chitturi


Spark-packages

2016-11-06 Thread Stephen Boesch
What is the state of the spark-packages project(s) ?  When running a query
for machine learning algorithms the results are not encouraging.


https://spark-packages.org/?q=tags%3A%22Machine%20Learning%22

There are 62 packages. Only a few have actual releases - and even less with
dates in the past twelve months.

There are several from DataBricks among the chosen few that have recent
releases.

Here is one that actually seems to be in reasonable shape: the DB port of
Stanford coreNLP.

https://github.com/databricks/spark-corenlp

But .. one or two solid packages .. ?

It seems the  spark-packages approach seems not to have picked up  steam..
  Are there other places suggested to look for algorithms not included in
mllib itself ?


Re: Error while creating tables in Parquet format in 2.0.1 (No plan for InsertIntoTable)

2016-11-06 Thread Kiran Chitturi
I get the same error with the JDBC Datasource as well

0: jdbc:hive2://localhost:1> CREATE TABLE jtest USING jdbc OPTIONS
> ("url" "jdbc:mysql://localhost/test", "driver" "com.mysql.jdbc.Driver",
> "dbtable" "stats");
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.156 seconds)
>


0: jdbc:hive2://localhost:1> CREATE TABLE test_stored STORED AS PARQUET
> LOCATION  '/Users/kiran/spark/test5.parquet' AS SELECT * FROM jtest;
> Error: java.lang.AssertionError: assertion failed: No plan for
> InsertIntoTable
> Relation[id#14,stat_repository_type#15,stat_repository_id#16,stat_holder_type#17,stat_holder_id#18,stat_coverage_type#19,stat_coverage_id#20,stat_membership_type#21,stat_membership_id#22,context#23]
> parquet, true, false
> +-
> Relation[id#4,stat_repository_type#5,stat_repository_id#6,stat_holder_type#7,stat_holder_id#8,stat_coverage_type#9,stat_coverage_id#10,stat_membership_type#11,stat_membership_id#12,context#13]
> JDBCRelation(stats) (state=,code=0)
>

JDBCRelation also extends the BaseRelation as well. Is there any workaround
for the Datasources that extend BaseRelation ?



On Sun, Nov 6, 2016 at 8:08 PM, Kiran Chitturi <
kiran.chitt...@lucidworks.com> wrote:

> Hello,
>
> I am encountering a new problem with Spark 2.0.1 that didn't happen with
> Spark 1.6.x.
>
> These SQL statements ran successfully spark-thrift-server in 1.6.x
>
>
>> CREATE TABLE test2 USING solr OPTIONS (zkhost "localhost:9987",
>> collection "test", fields "id" );
>>
>> CREATE TABLE test_stored STORED AS PARQUET LOCATION
>>  '/Users/kiran/spark/test.parquet' AS SELECT * FROM test;
>
>
> but with Spark 2.0.x, the last statement throws this below error
>
>
>> CREATE TABLE test_stored1 STORED AS PARQUET LOCATION
>
> '/Users/kiran/spark/test.parquet' AS SELECT * FROM test2;
>
>
>
>
>
> Error: java.lang.AssertionError: assertion failed: No plan for
>> InsertIntoTable Relation[id#3] parquet, true, false
>> +- Relation[id#2] com.lucidworks.spark.SolrRelation@57d735e9
>> (state=,code=0)
>
>
> The full stack trace is at https://gist.github.com/kiranchitturi/
> 8b3637723e0887f31917f405ef1425a1
>
> SolrRelation class (https://github.com/lucidworks/spark-solr/blob/
> master/src/main/scala/com/lucidworks/spark/SolrRelation.scala)
>
> This error message doesn't seem very meaningful to me. I am not quite sure
> how to track this down or fix this. Is there something I need to implement
> in the SolrRelation class to be able to create Parquet tables from Solr
> tables.
>
> Looking forward to your suggestions.
>
> Thanks,
> --
> Kiran Chitturi
>
>


-- 
Kiran Chitturi


Re: Spark-packages

2016-11-06 Thread Holden Karau
I think there is a bit more life in the connector side of things for
spark-packages, but there seem to be some outstanding issues with Python
support that are waiting on progress (see
https://github.com/databricks/sbt-spark-package/issues/26 ).
It's possible others are just distributing on maven central instead of
putting in the effort to publish to spark-packages, but I don't know if any
comprehensive index besides spark-packages currently.

On Sunday, November 6, 2016, Stephen Boesch  wrote:

>
> What is the state of the spark-packages project(s) ?  When running a query
> for machine learning algorithms the results are not encouraging.
>
>
> https://spark-packages.org/?q=tags%3A%22Machine%20Learning%22
>
> There are 62 packages. Only a few have actual releases - and even less
> with dates in the past twelve months.
>
> There are several from DataBricks among the chosen few that have recent
> releases.
>
> Here is one that actually seems to be in reasonable shape: the DB port of
> Stanford coreNLP.
>
> https://github.com/databricks/spark-corenlp
>
> But .. one or two solid packages .. ?
>
> It seems the  spark-packages approach seems not to have picked up  steam..
>   Are there other places suggested to look for algorithms not included in
> mllib itself ?
>
>
>
>

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: spark streaming with kinesis

2016-11-06 Thread Shushant Arora
Hi

By receicer I meant spark streaming receiver architecture- means worker
nodes are different than receiver nodes. There is no direct consumer/low
level consumer like of  Kafka in kinesis spark streaming?

Is there any limitation on interval checkpoint - minimum of 1second in
spark streaming with kinesis. But as such there is no limit on checkpoint
interval in KCL side ?

Thanks

On Tue, Oct 25, 2016 at 8:36 AM, Takeshi Yamamuro 
wrote:

> I'm not exactly sure about the receiver you pointed though,
> if you point the "KinesisReceiver" implementation, yes.
>
> Also, we currently cannot disable the interval checkpoints.
>
> On Tue, Oct 25, 2016 at 11:53 AM, Shushant Arora <
> shushantaror...@gmail.com> wrote:
>
>> Thanks!
>>
>> Is kinesis streams are receiver based only? Is there non receiver based
>> consumer for Kinesis ?
>>
>> And Instead of having fixed checkpoint interval,Can I disable auto
>> checkpoint and say  when my worker has processed the data after last record
>> of mapPartition now checkpoint the sequence no using some api.
>>
>>
>>
>> On Tue, Oct 25, 2016 at 7:07 AM, Takeshi Yamamuro 
>> wrote:
>>
>>> Hi,
>>>
>>> The only thing you can do for Kinesis checkpoints is tune the interval
>>> of them.
>>> https://github.com/apache/spark/blob/master/external/kinesis
>>> -asl/src/main/scala/org/apache/spark/streaming/kinesis/Kines
>>> isUtils.scala#L68
>>>
>>> Whether the dataloss occurs or not depends on the storage level you set;
>>> if you set StorageLevel.MEMORY_AND_DISK_2, Spark may continue processing
>>> in case of the dataloss because the stream data Spark receives are
>>> replicated across executors.
>>> However,  all the executors that have the replicated data crash,
>>> IIUC the dataloss occurs.
>>>
>>> // maropu
>>>
>>> On Mon, Oct 24, 2016 at 4:43 PM, Shushant Arora <
>>> shushantaror...@gmail.com> wrote:
>>>
 Does spark streaming consumer for kinesis uses Kinesis Client Library
  and mandates to checkpoint the sequence number of shards in dynamo db.

 Will it lead to dataloss if consumed datarecords are not yet processed
 and kinesis checkpointed the consumed sequenece numbers in dynamo db and
 spark worker crashes - then spark launched the worker on another node but
 start consuming from dynamo db's checkpointed sequence number which is
 ahead of processed sequenece number .

 is there a way to checkpoint the sequenece numbers ourselves in Kinesis
 as it is in Kafka low level consumer ?

 Thanks


>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: Newbie question - Best way to bootstrap with Spark

2016-11-06 Thread Raghav
Can you please point out the right courses from EDX/Berkeley ?

Many thanks.

On Sun, Nov 6, 2016 at 6:08 PM, ayan guha  wrote:

> I would start with Spark documentation, really. Then you would probably
> start with some older videos from youtube, especially spark summit
> 2014,2015 and 2016 videos. Regading practice, I would strongly suggest
> Databricks cloud (or download prebuilt from spark site). You can also take
> courses from EDX/Berkley, which are very good starter courses.
>
> On Mon, Nov 7, 2016 at 11:57 AM, raghav  wrote:
>
>> I am newbie in the world of big data analytics, and I want to teach myself
>> Apache Spark, and want to be able to write scripts to tinker with data.
>>
>> I have some understanding of Map Reduce but have not had a chance to get
>> my
>> hands dirty. There are tons of resources for Spark, but I am looking for
>> some guidance for starter material, or videos.
>>
>> Thanks.
>>
>> Raghav
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Newbie-question-Best-way-to-bootstrap-
>> with-Spark-tp28032.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>


Spark Exits with exception

2016-11-06 Thread Shivansh Srivastava
This is the stackTrace that I am getting while running the application:

16/11/03 11:25:45 INFO CoarseGrainedSchedulerBackend$DriverEndpoint:
Launching task 233 on executor id: 4 hostname: 10.178.149.243.
16/11/03 11:25:45 WARN TaskSetManager: Lost task 1.0 in stage 11.0 (TID
217, 10.178.149.243): java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at
org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
at
org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:644)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

16/11/03 11:25:45 INFO TaskSetManager: Lost task 14.0 in stage 11.0
(TID 225) on executor 10.178.149.243: java.util.NoSuchElementException
(None.get) [duplicate 1]
16/11/03 11:25:45 INFO TaskSetManager: Starting task 14.1 in stage 11.0
(TID 234, 10.178.149.243, partition 14, NODE_LOCAL, 8828 bytes)
16/11/03 11:25:45 INFO TaskSetManager: Lost task 22.0 in stage 11.0
(TID 232) on executor 10.178.149.243: java.util.NoSuchElementException
(None.get) [duplicate 2]
16/11/03 11:25:45 INFO CoarseGrainedSchedulerBackend$DriverEndpoint:
Launching task 234 on executor id: 4 hostname: 10.178.149.243.
16/11/03 11:25:45 INFO TaskSetManager: Starting task 22.1 in stage 11.0
(TID 235, 10.178.149.243, partition 22, NODE_LOCAL, 9066 bytes)
16/11/03 11:25:45 INFO TaskSetManager: Lost task 24.0 in stage 11.0
(TID 233) on executor 10.178.149.243: java.util.NoSuchElementException
(None.get) [duplicate 3]
16/11/03 11:25:45 INFO CoarseGrainedSchedulerBackend$DriverEndpoint:
Launching task 235 on executor id: 4 hostname: 10.178.149.243.
16/11/03 11:25:45 INFO TaskSetManager: Starting task 24.1 in stage 11.0
(TID 236, 10.178.149.243, partition 24, NODE_LOCAL, 9185 bytes)
16/11/03 11:25:45 INFO CoarseGrainedSchedulerBackend$DriverEndpoint:
Launching task 236 on executor id: 4 hostname: 10.178.149.243.
16/11/03 11:25:45 INFO TaskSetManager: Lost task 22.1 in stage 11.0
(TID 235) on executor 10.178.149.243: java.util.NoSuchElementException
(None.get) [duplicate 4]
16/11/03 11:25:45 INFO TaskSetManager: Starting task 22.2 in stage 11.0
(TID 237, 10.178.149.243, partition 22, NODE_LOCAL, 9066 bytes)
16/11/03 11:25:45 INFO CoarseGrainedSchedulerBackend$DriverEndpoint:
Launching task 237 on executor id: 4 hostname: 10.178.149.243.
16/11/03 11:25:45 INFO TaskSetManager: Lost task 14.1 in stage 11.0
(TID 234) on executor 10.178.149.243: java.util.NoSuchElementException
(None.get) [duplicate 5]
16/11/03 11:25:45 INFO TaskSetManager: Starting task 14.2 in stage 11.0
(TID 238, 10.178.149.243, partition 14, NODE_LOCAL, 8828 bytes)
16/11/03 11:25:45 INFO CoarseGrainedSchedulerBackend$DriverEndpoint:
Launching task 238 on executor id: 4 hostname: 10.178.149.243.
16/11/03 11:25:45 INFO TaskSetManager: Lost task 24.1 in stage 11.0
(TID 236) on executor 10.178.149.243: java.util.NoSuchElementException
(None.get) [duplicate 6]
16/11/03 11:25:45 INFO TaskSetManager: Starting task 24.2 in stage 11.0
(TID 239, 10.178.149.243, partition 24, NODE_LOCAL, 9185 bytes)
16/11/03 11:25:45 INFO CoarseGrainedSchedulerBackend$DriverEndpoint:
Launching task 239 on executor id: 4 hostname: 10.178.149.243.
16/11/03 11:25:45 INFO TaskSetManager: Lost task 22.2 in stage 11.0
(TID 237) on executor 10.178.149.243: java.util.NoSuchElementException
(None.get) [duplicate 7]
16/11/03 11:25:45 INFO TaskSetManager: Starting task 22.3 in stage 11.0
(TID 240, 10.178.149.243, partition 22, NODE_LOCAL, 9066 bytes)
16/11/03 11:25:45 INFO TaskSetManager: Lost task 14.2 in stage 11.0
(TID 238) on executor 10.178.149.243: java.util.NoSuchElementException
(None.get) [duplicate 8]
16/11/03 11:25:45 INFO CoarseGrainedSchedulerBackend$DriverEndpoint:
Launching task 240 on executor id: 4 hostname: 10.178.149.243.
16/11/03 11:25:45 INFO TaskSetManager: Starting task 14.3 in stage 11.0
(TID 241, 10.178.149.243, partition 14, NODE_LOCAL, 8828 bytes)
16/11/03 11:25:45 INFO TaskSetManager: Lost task 24.2 in stage 11.0
(TID 239) on executor 10.178.149.243: java.util.NoSuchElementException
(None.get) [duplicate 9]
16/11/03 11:25:45 INFO CoarseGrainedSchedulerBackend$DriverEndpoint:
Launching task 241 on executor id: 4 hostname: 10.178.149.243.
16/11/03 11:25:45 INFO TaskSetManager: Starting task 24.3 in stage 11.0
(TID 242, 10.178.149.243, partition 24, NODE_LOCAL, 9185 bytes)
16/11/03 11:25:45 INFO CoarseGrainedSchedulerBackend$DriverEndpoint:
Launching task 242 on executor id: 4 hostname: 10.178.149.243.
16/11/03 11:25:45 INFO TaskSetManager: Lost ta

Re: Spark Exits with exception

2016-11-06 Thread Shivansh Srivastava
Can someone help me out ! That what actually i am doing wrong !

The Spark UI shows that multiple apps are getting submitted , but I am
submitting only single application on Spark and All the applications are in
WAITING State except the main one !



On Mon, Nov 7, 2016 at 12:45 PM, Shivansh Srivastava 
wrote:

>
>
> This is the stackTrace that I am getting while running the application:
>
> 16/11/03 11:25:45 INFO CoarseGrainedSchedulerBackend$DriverEndpoint:
> Launching task 233 on executor id: 4 hostname: 10.178.149.243.
> 16/11/03 11:25:45 WARN TaskSetManager: Lost task 1.0 in stage 11.0
> (TID 217, 10.178.149.243): java.util.NoSuchElementException: None.get
> at scala.None$.get(Option.scala:347)
> at scala.None$.get(Option.scala:345)
> at org.apache.spark.storage.BlockInfoManager.
> releaseAllLocksForTask(BlockInfoManager.scala:343)
> at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(
> BlockManager.scala:644)
> at org.apache.spark.executor.Executor$TaskRunner.run(
> Executor.scala:281)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> 16/11/03 11:25:45 INFO TaskSetManager: Lost task 14.0 in stage 11.0
> (TID 225) on executor 10.178.149.243: java.util.NoSuchElementException
> (None.get) [duplicate 1]
> 16/11/03 11:25:45 INFO TaskSetManager: Starting task 14.1 in stage
> 11.0 (TID 234, 10.178.149.243, partition 14, NODE_LOCAL, 8828 bytes)
> 16/11/03 11:25:45 INFO TaskSetManager: Lost task 22.0 in stage 11.0
> (TID 232) on executor 10.178.149.243: java.util.NoSuchElementException
> (None.get) [duplicate 2]
> 16/11/03 11:25:45 INFO CoarseGrainedSchedulerBackend$DriverEndpoint:
> Launching task 234 on executor id: 4 hostname: 10.178.149.243.
> 16/11/03 11:25:45 INFO TaskSetManager: Starting task 22.1 in stage
> 11.0 (TID 235, 10.178.149.243, partition 22, NODE_LOCAL, 9066 bytes)
> 16/11/03 11:25:45 INFO TaskSetManager: Lost task 24.0 in stage 11.0
> (TID 233) on executor 10.178.149.243: java.util.NoSuchElementException
> (None.get) [duplicate 3]
> 16/11/03 11:25:45 INFO CoarseGrainedSchedulerBackend$DriverEndpoint:
> Launching task 235 on executor id: 4 hostname: 10.178.149.243.
> 16/11/03 11:25:45 INFO TaskSetManager: Starting task 24.1 in stage
> 11.0 (TID 236, 10.178.149.243, partition 24, NODE_LOCAL, 9185 bytes)
> 16/11/03 11:25:45 INFO CoarseGrainedSchedulerBackend$DriverEndpoint:
> Launching task 236 on executor id: 4 hostname: 10.178.149.243.
> 16/11/03 11:25:45 INFO TaskSetManager: Lost task 22.1 in stage 11.0
> (TID 235) on executor 10.178.149.243: java.util.NoSuchElementException
> (None.get) [duplicate 4]
> 16/11/03 11:25:45 INFO TaskSetManager: Starting task 22.2 in stage
> 11.0 (TID 237, 10.178.149.243, partition 22, NODE_LOCAL, 9066 bytes)
> 16/11/03 11:25:45 INFO CoarseGrainedSchedulerBackend$DriverEndpoint:
> Launching task 237 on executor id: 4 hostname: 10.178.149.243.
> 16/11/03 11:25:45 INFO TaskSetManager: Lost task 14.1 in stage 11.0
> (TID 234) on executor 10.178.149.243: java.util.NoSuchElementException
> (None.get) [duplicate 5]
> 16/11/03 11:25:45 INFO TaskSetManager: Starting task 14.2 in stage
> 11.0 (TID 238, 10.178.149.243, partition 14, NODE_LOCAL, 8828 bytes)
> 16/11/03 11:25:45 INFO CoarseGrainedSchedulerBackend$DriverEndpoint:
> Launching task 238 on executor id: 4 hostname: 10.178.149.243.
> 16/11/03 11:25:45 INFO TaskSetManager: Lost task 24.1 in stage 11.0
> (TID 236) on executor 10.178.149.243: java.util.NoSuchElementException
> (None.get) [duplicate 6]
> 16/11/03 11:25:45 INFO TaskSetManager: Starting task 24.2 in stage
> 11.0 (TID 239, 10.178.149.243, partition 24, NODE_LOCAL, 9185 bytes)
> 16/11/03 11:25:45 INFO CoarseGrainedSchedulerBackend$DriverEndpoint:
> Launching task 239 on executor id: 4 hostname: 10.178.149.243.
> 16/11/03 11:25:45 INFO TaskSetManager: Lost task 22.2 in stage 11.0
> (TID 237) on executor 10.178.149.243: java.util.NoSuchElementException
> (None.get) [duplicate 7]
> 16/11/03 11:25:45 INFO TaskSetManager: Starting task 22.3 in stage
> 11.0 (TID 240, 10.178.149.243, partition 22, NODE_LOCAL, 9066 bytes)
> 16/11/03 11:25:45 INFO TaskSetManager: Lost task 14.2 in stage 11.0
> (TID 238) on executor 10.178.149.243: java.util.NoSuchElementException
> (None.get) [duplicate 8]
> 16/11/03 11:25:45 INFO CoarseGrainedSchedulerBackend$DriverEndpoint:
> Launching task 240 on executor id: 4 hostname: 10.178.149.243.
> 16/11/03 11:25:45 INFO TaskSetManager: Starting task 14.3 in stage
> 11.0 (TID 241, 10.178.149.243, partition 14, NODE_LOCAL, 8828 bytes)
> 16/11/03 11:25:45 INFO TaskSetManager: Lost task 24.2 in stage 11.0
> (TID 239) on executor 10.178.149.243: java.util.NoSuchElementException
> (None.get) [d

Re: Newbie question - Best way to bootstrap with Spark

2016-11-06 Thread Denny Lee
The one you're looking for is the Data Sciences and Engineering with Apache
Spark at
https://www.edx.org/xseries/data-science-engineering-apacher-sparktm.

Note, a great quick start is the Getting Started with Apache Spark on
Databricks at https://databricks.com/product/getting-started-guide

HTH!

On Sun, Nov 6, 2016 at 22:20 Raghav  wrote:

> Can you please point out the right courses from EDX/Berkeley ?
>
> Many thanks.
>
> On Sun, Nov 6, 2016 at 6:08 PM, ayan guha  wrote:
>
> I would start with Spark documentation, really. Then you would probably
> start with some older videos from youtube, especially spark summit
> 2014,2015 and 2016 videos. Regading practice, I would strongly suggest
> Databricks cloud (or download prebuilt from spark site). You can also take
> courses from EDX/Berkley, which are very good starter courses.
>
> On Mon, Nov 7, 2016 at 11:57 AM, raghav  wrote:
>
> I am newbie in the world of big data analytics, and I want to teach myself
> Apache Spark, and want to be able to write scripts to tinker with data.
>
> I have some understanding of Map Reduce but have not had a chance to get my
> hands dirty. There are tons of resources for Spark, but I am looking for
> some guidance for starter material, or videos.
>
> Thanks.
>
> Raghav
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Newbie-question-Best-way-to-bootstrap-with-Spark-tp28032.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
>
> --
> Best Regards,
> Ayan Guha
>
>
>


Re: hope someone can recommend some books for me,a spark beginner

2016-11-06 Thread Denny Lee
There are a number of great resources to learn Apache Spark - a good
starting point is the Apache Spark Documentation at:
http://spark.apache.org/documentation.html


The two books that immediately come to mind are

- Learning Spark: http://shop.oreilly.com/product/mobile/0636920028512.do
(there's also a Chinese language version of this book)

- Advanced Analytics with Apache Spark:
http://shop.oreilly.com/product/mobile/0636920035091.do

You can also find a pretty decent listing of Apache Spark resources at:
https://sparkhub.databricks.com/resources/

HTH!


On Sun, Nov 6, 2016 at 19:00 litg <1933443...@qq.com> wrote:

>I'm a postgraduate from  Shanghai Jiao Tong University,China.
> recently, I
> carry out a project about the  realization of artificial algorithms on
> spark
> in python. however, I am not familiar with this field.furthermore,there are
> few Chinese books about spark.
>  Actually,I strongly want to have a further study at this field.hope
> someone can  kindly recommend me some books about  the mechanism of spark,
> or just give me suggestions about how to  program with spark.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/hope-someone-can-recommend-some-books-for-me-a-spark-beginner-tp28033.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Already subscribed to user@spark.apache.org

2016-11-06 Thread Maitray Thaker
On Mon, Nov 7, 2016 at 1:26 PM,  wrote:

> Hi! This is the ezmlm program. I'm managing the
> user@spark.apache.org mailing list.
>
> Acknowledgment: The address
>
>maitraytha...@gmail.com
>
> was already on the user mailing list when I received
> your request, and remains a subscriber.
>
>
> --- Administrative commands for the user list ---
>
> I can handle administrative requests automatically. Please
> do not send them to the list address! Instead, send
> your message to the correct command address:
>
> To subscribe to the list, send a message to:
>
>
> To remove your address from the list, send a message to:
>
>
> Send mail to the following for info and FAQ for this list:
>
>
>
> Similar addresses exist for the digest list:
>
>
>
> To get messages 123 through 145 (a maximum of 100 per request), mail:
>
>
> To get an index with subject and author for messages 123-456 , mail:
>
>
> They are always returned as sets of 100, max 2000 per request,
> so you'll actually get 100-499.
>
> To receive all messages with the same subject as message 12345,
> send a short message to:
>
>
> The messages should contain one line or word of text to avoid being
> treated as sp@m, but I will ignore their content.
> Only the ADDRESS you send to is important.
>
> You can start a subscription for an alternate address,
> for example "john@host.domain", just add a hyphen and your
> address (with '=' instead of '@') after the command word:
> 
>
> To stop subscription for this address, mail:
> 
>
> In both cases, I'll send a confirmation message to that address. When
> you receive it, simply reply to it to complete your subscription.
>
> If despite following these instructions, you do not get the
> desired results, please contact my owner at
> user-ow...@spark.apache.org. Please be patient, my owner is a
> lot slower than I am ;-)
>
> --- Enclosed is a copy of the request I received.
>
> Return-Path: 
> Received: (qmail 18151 invoked by uid 99); 7 Nov 2016 07:56:49 -
> Received: from pnap-us-west-generic-nat.apache.org (HELO
> spamd2-us-west.apache.org) (209.188.14.142)
> by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 07 Nov 2016 07:56:49
> +
> Received: from localhost (localhost [127.0.0.1])
> by spamd2-us-west.apache.org (ASF Mail Server at
> spamd2-us-west.apache.org) with ESMTP id 76F331A953A
> for ; Mon,  7 Nov 2016 07:56:48
> + (UTC)
> X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org
> X-Spam-Flag: NO
> X-Spam-Score: 4.023
> X-Spam-Level: 
> X-Spam-Status: No, score=4.023 tagged_above=-999 required=6.31
> tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
> EMPTY_MESSAGE=2.344, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7,
> RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01,
> RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled
> Authentication-Results: spamd2-us-west.apache.org (amavisd-new);
> dkim=pass (2048-bit key) header.d=gmail.com
> Received: from mx1-lw-eu.apache.org ([10.40.0.8])
> by localhost (spamd2-us-west.apache.org [10.40.0.9])
> (amavisd-new, port 10024)
> with ESMTP id QNLhLR3oqTK2 for ;
> Mon,  7 Nov 2016 07:56:48 + (UTC)
> Received: from mail-qk0-f170.google.com (mail-qk0-f170.google.com
> [209.85.220.170])
> by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org)
> with ESMTPS id 4F9A45F3BB
> for ; Mon,  7 Nov 2016 07:56:47
> + (UTC)
> Received: by mail-qk0-f170.google.com with SMTP id x190so164741797qkb.0
> for ; Sun, 06 Nov 2016 23:56:47
> -0800 (PST)
> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
> d=gmail.com; s=20120113;
> h=mime-version:from:date:message-id:subject:to;
> bh=E5+Ss19UKZSGC3X0p+9BcS07yFGbJODKmsnzZnMPPaE=;
> b=JBucJs4/vjrccpEajoMUusIa9edeJWNAdJ/
> jq1fpN8GkanWdgKX6YP4joBaCg0XoGG
>  emUw5QIN3ng6CNOvZ1VKGKe/72apmHAk+EnxnEipONFk0u5UOXRkFF+
> 1NgeH97k85R4F
>  Rw2nu8DgJauZqMq2UIaHH3Nb1Zxpmd5Hp2M13szbSSLx5e3sJ+
> V5zUGlKQEzeIzju00+
>  ddPN6G/opvgYY26LBw3dlgmFxTQw1JclMS5YE
> rtrkSQqinv3GH8MTPNfnD8VIduMmYse
>  4X+nJWvziGdPitoKuyDWz0AE98TvBTrgGm72qCbLYO4hilkDFvUaVtFpmzCg3Le
> hfDIq
>  CfKg==
> X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
> d=1e100.net; s=20130820;
> h=x-gm-message-state:mime-version:from:date:message-id:subject:to;
> bh=E5+Ss19UKZSGC3X0p+9BcS07yFGbJODKmsnzZnMPPaE=;
> b=VKJxk8Oxl1M7bN6Pl9Nfz1ZJLm2bw3ORUJBLJqY504pSos84wRHM328M984n
> uXqFHC
>  dvrz0OeTstXZodp7xe1eLxVZA+Z0a9UOHNif4EGPBgU20jgHen2KA+
> UkgyOiTXy+77tS
>  EzeQk+p6CxykVlmBNR2rAEfJnmoyyudhUeXjTVIjy3Aj75lDsAtOVeRP4T5GHEBH+
> sav
>  h+/eerx2kygaP2XyA5V4YJ/wGYH4fQD/pnnFuHpOiNuwZu7LuqxVCuvlGG7LYo
> /QwOG8
>  S+TeF0QAzj+RhYaPkxAs+hv6GM1bAq2aJdzNgUFqhJmVHV0biP4
> D69hM10BL0i9FUIvR
>  d2Kg==
> X-Gm-Message-State: ABUngvciiiLHasHURwyJSzNNgPkGa4
> 0k0F0Kz0HWU