Re: Already subscribed to user@spark.apache.org

2016-11-06 Thread Maitray Thaker
On Mon, Nov 7, 2016 at 1:26 PM, wrote: > Hi! This is the ezmlm program. I'm managing the > user@spark.apache.org mailing list. > > Acknowledgment: The address > >maitraytha...@gmail.com > > was already on the user mailing list when I received > your request, and

Re: hope someone can recommend some books for me,a spark beginner

2016-11-06 Thread Denny Lee
There are a number of great resources to learn Apache Spark - a good starting point is the Apache Spark Documentation at: http://spark.apache.org/documentation.html The two books that immediately come to mind are - Learning Spark: http://shop.oreilly.com/product/mobile/0636920028512.do (there's

Re: Newbie question - Best way to bootstrap with Spark

2016-11-06 Thread Denny Lee
The one you're looking for is the Data Sciences and Engineering with Apache Spark at https://www.edx.org/xseries/data-science-engineering-apacher-sparktm. Note, a great quick start is the Getting Started with Apache Spark on Databricks at https://databricks.com/product/getting-started-guide HTH!

Re: Spark Exits with exception

2016-11-06 Thread Shivansh Srivastava
Can someone help me out ! That what actually i am doing wrong ! The Spark UI shows that multiple apps are getting submitted , but I am submitting only single application on Spark and All the applications are in WAITING State except the main one ! On Mon, Nov 7, 2016 at 12:45 PM, Shivansh

Spark Exits with exception

2016-11-06 Thread Shivansh Srivastava
This is the stackTrace that I am getting while running the application: 16/11/03 11:25:45 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Launching task 233 on executor id: 4 hostname: 10.178.149.243. 16/11/03 11:25:45 WARN TaskSetManager: Lost task 1.0 in stage 11.0 (TID 217,

Re: Newbie question - Best way to bootstrap with Spark

2016-11-06 Thread Raghav
Can you please point out the right courses from EDX/Berkeley ? Many thanks. On Sun, Nov 6, 2016 at 6:08 PM, ayan guha wrote: > I would start with Spark documentation, really. Then you would probably > start with some older videos from youtube, especially spark summit >

Re: spark streaming with kinesis

2016-11-06 Thread Shushant Arora
Hi By receicer I meant spark streaming receiver architecture- means worker nodes are different than receiver nodes. There is no direct consumer/low level consumer like of Kafka in kinesis spark streaming? Is there any limitation on interval checkpoint - minimum of 1second in spark streaming

Re: Spark-packages

2016-11-06 Thread Holden Karau
I think there is a bit more life in the connector side of things for spark-packages, but there seem to be some outstanding issues with Python support that are waiting on progress (see https://github.com/databricks/sbt-spark-package/issues/26 ). It's possible others are just distributing on maven

Re: Error while creating tables in Parquet format in 2.0.1 (No plan for InsertIntoTable)

2016-11-06 Thread Kiran Chitturi
I get the same error with the JDBC Datasource as well 0: jdbc:hive2://localhost:1> CREATE TABLE jtest USING jdbc OPTIONS > ("url" "jdbc:mysql://localhost/test", "driver" "com.mysql.jdbc.Driver", > "dbtable" "stats"); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows

Spark-packages

2016-11-06 Thread Stephen Boesch
What is the state of the spark-packages project(s) ? When running a query for machine learning algorithms the results are not encouraging. https://spark-packages.org/?q=tags%3A%22Machine%20Learning%22 There are 62 packages. Only a few have actual releases - and even less with dates in the past

Error while creating tables in Parquet format in 2.0.1 (No plan for InsertIntoTable)

2016-11-06 Thread Kiran Chitturi
Hello, I am encountering a new problem with Spark 2.0.1 that didn't happen with Spark 1.6.x. These SQL statements ran successfully spark-thrift-server in 1.6.x > CREATE TABLE test2 USING solr OPTIONS (zkhost "localhost:9987", collection > "test", fields "id" ); > > CREATE TABLE test_stored

Re: Very long pause/hang at end of execution

2016-11-06 Thread Michael Johnson
Hm. Something must have changed, as it was happening quite consistently and now I can't get it to reproduce. Thank you for the offer, and if it happens again I will try grabbing thread dumps and I will see if I can figure out what is going on. On Sunday, November 6, 2016 10:02 AM, Aniket

回复:Structured Streaming with Kafka source,, does it work??????

2016-11-06 Thread 余根茂(木艮)
docs:  https://github.com/apache/spark/blob/master/docs/structured-streaming-kafka-integration.md--发件人:shyla deshpande 发送时间:2016年11月7日(星期一) 09:15收件人:user 主 题:Structured Streaming with

hope someone can recommend some books for me,a spark beginner

2016-11-06 Thread litg
I'm a postgraduate from Shanghai Jiao Tong University,China. recently, I carry out a project about the realization of artificial algorithms on spark in python. however, I am not familiar with this field.furthermore,there are few Chinese books about spark. Actually,I strongly want to

Re: Structured Streaming with Kafka Source, does it work??

2016-11-06 Thread Matei Zaharia
The Kafka source will only appear in 2.0.2 -- see this thread for the current release candidate: https://lists.apache.org/thread.html/597d630135e9eb3ede54bb0cc0b61a2b57b189588f269a64b58c9243@%3Cdev.spark.apache.org%3E . You can try that right now if you want from the staging Maven repo shown

Re: Reading csv files with quoted fields containing embedded commas

2016-11-06 Thread Femi Anthony
The quote options seem to be related to escaping quotes and the dataset isn't escaaping quotes. As I said quoted strings with embedded commas is something that pandas handles easily, and even Excel does that as well. Femi On Sun, Nov 6, 2016 at 6:59 AM, Hyukjin Kwon wrote:

Re: Re: Newbie question - Best way to bootstrap with Spark

2016-11-06 Thread warmb...@qq.com
EDX/Berkley +1 ___ 黄鹏程 HuangPengCheng 中国民生银行 总行科技开发部DBA组&应用运维四中心 *规范操作,主动维护,及时处理* 温良恭俭让** 地址:北京市顺义区顺安南路中国民生银行总部基地 邮编:101300 电话:010-56361701 手机:13488788499 Email:huangpengch...@cmbc.com.cn ,gnu...@gmail.com

Re: Newbie question - Best way to bootstrap with Spark

2016-11-06 Thread ayan guha
I would start with Spark documentation, really. Then you would probably start with some older videos from youtube, especially spark summit 2014,2015 and 2016 videos. Regading practice, I would strongly suggest Databricks cloud (or download prebuilt from spark site). You can also take courses from

RE: expected behavior of Kafka dynamic topic subscription

2016-11-06 Thread Haopu Wang
Cody, thanks for the response. Do you think it's a Spark issue or Kafka issue? Can you please let me know the jira ticket number? -Original Message- From: Cody Koeninger [mailto:c...@koeninger.org] Sent: 2016年11月4日 22:35 To: Haopu Wang Cc: user@spark.apache.org Subject: Re: expected

Re: Structured Streaming with Kafka Source, does it work??

2016-11-06 Thread shyla deshpande
Hi Jaya! Thanks for the reply. Structured streaming works fine for me with socket text stream . I think structured streaming with kafka source not yet supported. Please if anyone has got it working with kafka source, please provide me some sample code or direction. Thanks On Sun, Nov 6, 2016

Structured Streaming with Kafka source,, does it work??????

2016-11-06 Thread shyla deshpande
I am trying to do Structured Streaming with Kafka Source. Please let me know where I can find some sample code for this. Thanks

Newbie question - Best way to bootstrap with Spark

2016-11-06 Thread raghav
I am newbie in the world of big data analytics, and I want to teach myself Apache Spark, and want to be able to write scripts to tinker with data. I have some understanding of Map Reduce but have not had a chance to get my hands dirty. There are tons of resources for Spark, but I am looking for

Re: A Spark long running program as web server

2016-11-06 Thread Oddo Da
The spark jobserver will do what you describe for you. I have built an app where we have a bunch of queries being submitted via http://something/query/ via POST (all parameters for the query are in JSON POST request). This is a scalatra layer that talks to spark jobserver via HTTP. On Sun, Nov 6,

Re: distribute partitions evenly to my cluster

2016-11-06 Thread heather79
Thanks for your reply, Vipin! I am using spark-perf benchmark. The command to create RDD is : val data: RDD[Vector] = RandomRDDs.normalVectorRDD(sc, m, n, numPartitions, seed) after I set the numPartitions, for example 40 partitions, I think spark core code will allocate those partitions to

Re: Very long pause/hang at end of execution

2016-11-06 Thread Gourav Sengupta
Hi, In case your process finishes after a lag, then please check whether you are writing by converting to Pandas or using coalesce (in which case entire traffic is being directed to a single node) or writing over S3, in which case there can be lags. Regards, Gourav On Sun, Nov 6, 2016 at 1:28

Re: mLIb solving linear regression with sparse inputs

2016-11-06 Thread im281
Also in Java as well. Thanks again! Iman On Sun, Nov 6, 2016 at 8:28 AM Iman Mohtashemi wrote: Hi Robin, It looks like the linear regression model takes in a dataset not a matrix? It would be helpful for this example if you could set up the whole problem end to end

Re: mLIb solving linear regression with sparse inputs

2016-11-06 Thread im281
Hi Robin, It looks like the linear regression model takes in a dataset not a matrix? It would be helpful for this example if you could set up the whole problem end to end using one of the columns of the matrix as b. So A is a sparse matrix and b is a sparse vector Best regards. Iman On Sun, Nov

Re: Optimized way to use spark as db to hdfs etl

2016-11-06 Thread Sabarish Sasidharan
Pls be aware that Accumulators involve communication back with the driver and may not be efficient. I think OP wants some way to extract the stats from the sql plan if it is being stored in some internal data structure Regards Sab On 5 Nov 2016 9:42 p.m., "Deepak Sharma"

Re: Very long pause/hang at end of execution

2016-11-06 Thread Aniket Bhatnagar
I doubt it's GC as you mentioned that the pause is several minutes. Since it's reproducible in local mode, can you run the spark application locally and once your job is complete (and application appears paused), can you take 5 thread dumps (using jstack or jcmd on the local spark JVM process)

Re: mLIb solving linear regression with sparse inputs

2016-11-06 Thread im281
Thank you! Would happen to have this code in Java?. This is extremely helpful! Iman On Sun, Nov 6, 2016 at 3:35 AM -0800, "Robineast [via Apache Spark User List]" wrote: Here’s a way of creating sparse vectors in MLLib:

Re: Very long pause/hang at end of execution

2016-11-06 Thread Michael Johnson
Thanks; I tried looking at the thread dumps for the driver and the one executor that had that option in the UI, but I'm afraid I don't know how to interpret what I saw...  I don't think it could be my code directly, since at this point my code has all completed? Could GC be taking that long?

Re: Very long pause/hang at end of execution

2016-11-06 Thread Aniket Bhatnagar
In order to know what's going on, you can study the thread dumps either from spark UI or from any other thread dump analysis tool. Thanks, Aniket On Sun, Nov 6, 2016 at 1:31 PM Michael Johnson wrote: > I'm doing some processing and then clustering of a small

Very long pause/hang at end of execution

2016-11-06 Thread Michael Johnson
I'm doing some processing and then clustering of a small dataset (~150 MB). Everything seems to work fine, until the end; the last few lines of my program are log statements, but after printing those, nothing seems to happen for a long time...many minutes; I'm not usually patient enough to let

Re: A Spark long running program as web server

2016-11-06 Thread vincent gromakowski
Hi, Spark jobserver seems to be more mature than Livy but both would work I think. You will just get more functionalities with the jobserver except the impersonation that is only in Livy. If you need to publish business API I would recommend to use Akka http with Spark actors sharing a preloaded

Re: Improvement proposal | Dynamic disk allocation

2016-11-06 Thread Aniket Bhatnagar
If people agree that is desired, I am willing to submit a SIP for this and find time to work on it. Thanks, Aniket On Sun, Nov 6, 2016 at 1:06 PM Aniket Bhatnagar wrote: > Hello > > Dynamic allocation feature allows you to add executors and scale > computation

Improvement proposal | Dynamic disk allocation

2016-11-06 Thread Aniket Bhatnagar
Hello Dynamic allocation feature allows you to add executors and scale computation power. This is great, however, I feel like we also need a way to dynamically scale storage. Currently, if the disk is not able to hold the spilled/shuffle data, the job is aborted causing frustration and loss of

Fwd: A Spark long running program as web server

2016-11-06 Thread Reza zade
Hi I have written multiple spark driver programs that load some data from HDFS to data frames and accomplish spark sql queries on it and persist the results in HDFS again. Now I need to provide a long running java program in order to receive requests and their some parameters(such as the number

Re: Reading csv files with quoted fields containing embedded commas

2016-11-06 Thread Hyukjin Kwon
Hi Femi, Have you maybe tried the quote related options specified in the documentation? http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.csv Thanks. 2016-11-06 6:58 GMT+09:00 Femi Anthony : > Hi, I am trying to process a very

Re: mLIb solving linear regression with sparse inputs

2016-11-06 Thread Robineast
Here’s a way of creating sparse vectors in MLLib: import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.rdd.RDD val rdd = sc.textFile("A.txt").map(line => line.split(",")). map(ary => (ary(0).toInt, ary(1).toInt, ary(2).toDouble)) val pairRdd: RDD[(Int, (Int, Int, Double))]

Re: mapWithState and DataFrames

2016-11-06 Thread Victor Shafran
Hi Daniel, If you use state in the same app, use foreachRDD method of the stateSnapshot DStream to either persist RDD to disk (rdd.persist) or convert to DataFrame and call createOrReplaceTempView method. Code from

mapWithState and DataFrames

2016-11-06 Thread Daniel Haviv
Hi, How can I utilize mapWithState and DataFrames? Right now I stream json messages from Kafka, update their state, output the updated state as json and compose a dataframe from it. It seems inefficient both in terms of processing and storage (a long string instead of a compact DF). Is there a

Re: Spark dataset cache vs tempview

2016-11-06 Thread Mich Talebzadeh
With regard to use of tempTable createOrReplaceTempView is backed by an in-memory hash table that maps table name (a string) to a logical query plan. Fragments of that logical query plan may or may not be cached. However, calling register alone will not result in any materialization of results.