Re: Not per-key state in spark streaming

2016-12-08 Thread Anty Rao
Thank you very much for your reply , Daniel On Thu, Dec 8, 2016 at 7:07 PM, Daniel Haviv < daniel.ha...@veracity-group.com> wrote: > There's no need to extend Spark's API, look at mapWithState for examples. > > On Thu, Dec 8, 2016 at 4:49 AM, Anty Rao wrote: > >> >> >> On

reading data from s3

2016-12-08 Thread Hitesh Goyal
Hi team, I want to read the text file from s3. I am doing it using DataFrame. Like below:- DataFrame d=sql.read().text("s3://my_first_text_file.txt"); d.registerTempTable("table1"); DataFrame d1=sql.sql("Select * from table1"); d1.printSchema();

Re: Issue in using DenseVector in RowMatrix, error could be due to ml and mllib package changes

2016-12-08 Thread Nick Pentreath
Yes most likely due to hashing tf returns ml vectors while you need mllib vectors for row matrix. I'd recommend using the vector conversion utils (I think in mllib.linalg.Vectors but I'm on mobile right now so can't recall exactly). There are until methods for converting single vectors as well as

Re: how can I set the log configuration file for spark history server ?

2016-12-08 Thread Don Drake
You can update $SPARK_HOME/spark-env.sh by setting the environment variable SPARK_HISTORY_OPTS. See http://spark.apache.org/docs/latest/monitoring.html#spark-configuration-options for options (spark.history.fs.logDirectory) you can set. There is log rotation built in (by time, not size) to the

how can I set the log configuration file for spark history server ?

2016-12-08 Thread John Fang
./start-history-server.sh starting org.apache.spark.deploy.history.HistoryServer, logging to  /home/admin/koala/data/versions/0/SPARK/2.0.2/spark-2.0.2-bin-hadoop2.6/logs/spark-admin-org.apache.spark.deploy.history.HistoryServer-1-v069166214.sqa.zmf.out Then the history will print all log to the

flatmap pair

2016-12-08 Thread im281
The class 'Detector' has a function 'detectFeature(cluster) However, the method has changed to return a list of features as opposed to one feature as it is below. How do I change this so it returns a list of feature objects instead // creates key-value pairs for Isotope cluster ID and Isotope

Issue in using DenseVector in RowMatrix, error could be due to ml and mllib package changes

2016-12-08 Thread satyajit vegesna
Hi All, PFB code. import org.apache.spark.ml.feature.{HashingTF, IDF} import org.apache.spark.ml.linalg.SparseVector import org.apache.spark.mllib.linalg.distributed.RowMatrix import org.apache.spark.sql.SparkSession import org.apache.spark.{SparkConf, SparkContext} /** * Created by satyajit

Re: unit testing in spark

2016-12-08 Thread Miguel Morales
Sure I'd love to participate. Being new at Scala things like dependency injection are still a bit iffy. Would love to exchange ideas. Sent from my iPhone > On Dec 8, 2016, at 4:29 PM, Holden Karau wrote: > > Maybe diverging a bit from the original question - but would

Re: unit testing in spark

2016-12-08 Thread Holden Karau
Maybe diverging a bit from the original question - but would it maybe make sense for those of us that all care about testing to try and do a hangout at some point so that we can exchange ideas? On Thu, Dec 8, 2016 at 4:15 PM, Miguel Morales wrote: > I would be

Re: unit testing in spark

2016-12-08 Thread Miguel Morales
I would be interested in contributing. Ive created my own library for this as well. In my blog post I talk about testing with Spark in RSpec style: https://medium.com/@therevoltingx/test-driven-development-w-apache-spark-746082b44941 Sent from my iPhone > On Dec 8, 2016, at 4:09 PM, Holden

Re: unit testing in spark

2016-12-08 Thread Holden Karau
There are also libraries designed to simplify testing Spark in the various platforms, spark-testing-base for Scala/Java/Python (& video https://www.youtube.com/watch?v=f69gSGSLGrY), sscheck (scala focused property

Fwd: Question about SPARK-11374 (skip.header.line.count)

2016-12-08 Thread Dongjoon Hyun
+dev I forget to add @user. Dongjoon. -- Forwarded message - From: Dongjoon Hyun Date: Thu, Dec 8, 2016 at 16:00 Subject: Question about SPARK-11374 (skip.header.line.count) To: Hi, All. Could you give me some opinion? There

Re: .tar.bz2 in spark

2016-12-08 Thread Jörn Franke
Tar is not out of the box supported. Just store the file as .json.bz2 without using tar. > On 8 Dec 2016, at 20:18, Maurin Lenglart wrote: > > Hi, > I am trying to load a json file compress in .tar.bz2 but spark throw an error. > I am using pyspark with spark 1.6.2.

Re: Design patterns for Spark implementation

2016-12-08 Thread Mich Talebzadeh
Another use case for Spark is to use its in-memory and parallel processing on RDBMS data. This may sound a bit strange, but you can access your RDBMS table from Spark via JDBC with parallel processing and engage the speed of Spark to accelerate the queries. To do this you may need to parallelise

Re: unit testing in spark

2016-12-08 Thread Lars Albertsson
I wrote some advice in a previous post on the list: http://markmail.org/message/bbs5acrnksjxsrrs It does not mention python, but the strategy advice is the same. Just replace JUnit/Scalatest with pytest, unittest, or your favourite python test framework. I recently held a presentation on the

Re: When will Structured Streaming support stream-to-stream joins?

2016-12-08 Thread Michael Armbrust
I would guess Spark 2.3, but maybe sooner maybe later depending on demand. I created https://issues.apache.org/jira/browse/SPARK-18791 so people can describe their requirements / stay informed. On Thu, Dec 8, 2016 at 11:16 AM, ljwagerfield wrote: > Hi there, > >

Re: few basic questions on structured streaming

2016-12-08 Thread Michael Armbrust
> > 1. what happens if an event arrives few days late? Looks like we have an > unbound table with sorted time intervals as keys but I assume spark doesn't > keep several days worth of data in memory but rather it would checkpoint > parts of the unbound table to a storage at a specified interval

KMediods in Spark java

2016-12-08 Thread Shak S
Is there any example to implement KMediods cluster in spark and java? I searched Spark API looks like Spark has not yet implemented KMediods. Any example or inputs will be appreciated. Thanks.

Phoenix Plugin for Spark - connecting to Phoenix in secured cluster.

2016-12-08 Thread Marcin Pastecki
Hello all, I have problem accessing HBase using Spark Phoenix Plugin in secured cluster. Versions: Spark 1.6.1, HBase 1.1.2.2.4, Phoenix 4.4.0 Using sqlline.py works just fine. I have valid Kerberos ticket. Trying to get this to work in local mode first. What I'm doing is basic test as

Re: Spark app write too many small parquet files

2016-12-08 Thread Miguel Morales
Try to coalesce with a value of 2 or so. You could dynamically calculate how many partitions to have to obtain an optimal file size. Sent from my iPhone > On Dec 8, 2016, at 1:03 PM, Kevin Tran wrote: > > How many partition should it be when streaming? - As in streaming

Re: Spark app write too many small parquet files

2016-12-08 Thread Kevin Tran
How many partition should it be when streaming? - As in streaming process the data will growing in size and is there any configuration for limit file size and write to new file if it is more than x (let says 128MB per file) Another question about performance when query to these parquet files.

Re: Kerberos and YARN - functions in spark-shell and spark submit local but not cluster mode

2016-12-08 Thread Marcelo Vanzin
You could have posted just the error, which is at the end of my response. Why are you trying to use WebHDFS? I'm not really sure how authentication works with that. But generally applications use HDFS (which uses a different URI scheme), and Spark should work fine with that. Error:

Re: Kerberos and YARN - functions in spark-shell and spark submit local but not cluster mode

2016-12-08 Thread Gerard Casey
Sure - I wanted to check with admin before sharing. I’ve attached it now, does this help? Many thanks again, G Container: container_e34_1479877553404_0174_01_03 on hdp-node12.xcat.cluster_45454_1481228528201

Re: Design patterns for Spark implementation

2016-12-08 Thread Sachin Naik
Not sure if you are aware of these 1) Edx/Berkely/Databricks has three Spark related certifications. Might be a good start. 2) Fair understanding of scala/distributed collection patterns to better appreciate the internals of Spark. Coursera has three scala courses. I know there are other

Re: Design patterns for Spark implementation

2016-12-08 Thread Peter Figliozzi
Keeping in mind Spark is a parallel computing engine, Spark does not change your data infrastructure/data architecture. These days it's relatively convenient to read data from a variety of sources (S3, HDFS, Cassandra, ...) and ditto on the output side. For example, for one of my use-cases, I

SparkContext not creating due Logger initialization

2016-12-08 Thread Adnan Ahmed
Hi, Sometimes I get this error when I submit spark job. Its not like every time but when it comes up SparkContext doesn't get created. 16/12/08 08:02:18 INFO [akka.event.slf4j.Slf4jLogger] 80==> Slf4jLogger started error while starting up loggers akka.ConfigurationException: Logger specified

Re: Kerberos and YARN - functions in spark-shell and spark submit local but not cluster mode

2016-12-08 Thread Marcelo Vanzin
Then you probably have a configuration error somewhere. Since you haven't actually posted the error you're seeing, it's kinda hard to help any further. On Thu, Dec 8, 2016 at 11:17 AM, Gerard Casey wrote: > Right. I’m confident that is setup correctly. > > I can run

.tar.bz2 in spark

2016-12-08 Thread Maurin Lenglart
Hi, I am trying to load a json file compress in .tar.bz2 but spark throw an error. I am using pyspark with spark 1.6.2. (Cloudera 5.9) What will be the best way to handle that? I don’t want to have a non-spark job that will just uncompressed the data… thanks

Re: Kerberos and YARN - functions in spark-shell and spark submit local but not cluster mode

2016-12-08 Thread Gerard Casey
Right. I’m confident that is setup correctly. I can run the SparkPi test script. The main difference between it and my application is that it doesn’t access HDFS. > On 8 Dec 2016, at 18:43, Marcelo Vanzin wrote: > > On Wed, Dec 7, 2016 at 11:54 PM, Gerard Casey

When will Structured Streaming support stream-to-stream joins?

2016-12-08 Thread ljwagerfield
Hi there, Structured Streaming currently only supports stream-to-batch joins. Is there an ETA for stream-to-stream joins? Kindest regards (and keep up the awesome work!), Lawrence (p.s. I've traversed the JIRA roadmaps but couldn't see anything) -- View this message in context:

Re: spark reshape hive table and save to parquet

2016-12-08 Thread Georg Heiler
https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html Anton Kravchenko schrieb am Do., 8. Dez. 2016 um 17:53 Uhr: > Hello, > > I wonder if there is a way (preferably efficient) in Spark to reshape hive > table and save it to parquet.

Question about the DirectKafkaInputDStream

2016-12-08 Thread John Fang
The source is DirectKafkaInputDStream which can ensure the exectly-once of the  consumer side. But I have a question based the following code。As we known, the  "graph.generateJobs(time)" will create rdds and generate jobs。And the source  RDD is KafkaRDD which contain the offsetRange。 The jobs are 

Re: Kerberos and YARN - functions in spark-shell and spark submit local but not cluster mode

2016-12-08 Thread Marcelo Vanzin
On Wed, Dec 7, 2016 at 11:54 PM, Gerard Casey wrote: > To be specific, where exactly should spark.authenticate be set to true? spark.authenticate has nothing to do with kerberos. It's for authentication between different Spark processes belonging to the same app. --

spark reshape hive table and save to parquet

2016-12-08 Thread Anton Kravchenko
Hello, I wonder if there is a way (preferably efficient) in Spark to reshape hive table and save it to parquet. Here is a minimal example, input hive table: col1 col2 col3 1 2 3 4 5 6 output parquet: col1 newcol2 1 [2 3] 4 [5 6] p.s. The real input hive table has ~1000 columns. Thank you,

Re: OS killing Executor due to high (possibly off heap) memory usage

2016-12-08 Thread Aniket Bhatnagar
I did some instrumentation to figure out traces of where DirectByteBuffers are being created and it turns out that setting the following system properties in addition to setting spark.shuffle.io.preferDirectBufs=false in spark config: io.netty.noUnsafe=true io.netty.threadLocalDirectBufferSize=0

Re: unsubscribe

2016-12-08 Thread Nicholas Chammas
I wish I could provide additional suggestions. Maybe one of the admins can step in and help. I'm just another random user trying (with mixed success) to be helpful.  Sorry again to everyone about my spam, which just added to the problem. On Thu, Dec 8, 2016 at 11:22 AM Chen, Yan I

RE: unsubscribe

2016-12-08 Thread Chen, Yan I
I’m pretty sure I didn’t. From: Nicholas Chammas [mailto:nicholas.cham...@gmail.com] Sent: Thursday, December 08, 2016 10:56 AM To: Chen, Yan I; Di Zhu Cc: user @spark Subject: Re: unsubscribe Oh, hmm... Did you perhaps subscribe with a different address than the one you're trying to

Re: unsubscribe

2016-12-08 Thread Nicholas Chammas
Oh, hmm... Did you perhaps subscribe with a different address than the one you're trying to unsubscribe from? For example, you subscribed with myemail+sp...@gmail.com but you send the unsubscribe email from myem...@gmail.com 2016년 12월 8일 (목) 오전 10:35, Chen, Yan I 님이 작성: > The

Re: Managed memory leak : spark-2.0.2

2016-12-08 Thread Appu K
Hi, I didn’t hit any oom issues. thanks for the pointer. i guess it’ll be safe to ignore since TaskMemoryManager automatically releases just wondering what would have been the cause in this case - couldn’t see any task failures in the log but some reference to ExternalAppendOnlyMap acquiring

Re: unit testing in spark

2016-12-08 Thread ndjido
Hi Pseudo, Just use unittest https://docs.python.org/2/library/unittest.html . > On 8 Dec 2016, at 19:14, pseudo oduesp wrote: > > somone can tell me how i can make unit test on pyspark ? > (book, tutorial ...)

Re: Managed memory leak : spark-2.0.2

2016-12-08 Thread Takeshi Yamamuro
Hi, Did you hit some troubles from the memory leak? I think we can ignore the message in most cases because TaskMemoryManager automatically releases the memory. In fact, spark degraded the message in SPARK-18557. https://issues.apache.org/jira/browse/SPARK-18557 // maropu On Thu, Dec 8, 2016 at

unit testing in spark

2016-12-08 Thread pseudo oduesp
somone can tell me how i can make unit test on pyspark ? (book, tutorial ...)

Re: unsubscribe

2016-12-08 Thread Nicholas Chammas
Yes, sorry about that. I didn't think before responding to all those who asked to unsubscribe. On Thu, Dec 8, 2016 at 10:00 AM Di Zhu wrote: > Could you send to individual privately without cc to all users every time? > > > On 8 Dec 2016, at 3:58 PM, Nicholas

Re: unsubscribe

2016-12-08 Thread Di Zhu
Could you send to individual privately without cc to all users every time? > On 8 Dec 2016, at 3:58 PM, Nicholas Chammas > wrote: > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > This is explained here:

Re: unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org This is explained here: http://spark.apache.org/community.html#mailing-lists On Thu, Dec 8, 2016 at 7:46 AM Ramon Rosa da Silva wrote: > > This e-mail message, including any attachments, is for the sole use of

Re: Unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org This is explained here: http://spark.apache.org/community.html#mailing-lists On Thu, Dec 8, 2016 at 9:46 AM Tao Lu wrote: > >

Re: unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org This is explained here: http://spark.apache.org/community.html#mailing-lists On Thu, Dec 8, 2016 at 8:01 AM Niki Pavlopoulou wrote: > unsubscribe >

Re: unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org This is explained here: http://spark.apache.org/community.html#mailing-lists On Thu, Dec 8, 2016 at 7:50 AM Juan Caravaca wrote: > unsubscribe >

Re: Unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org This is explained here: http://spark.apache.org/community.html#mailing-lists On Thu, Dec 8, 2016 at 9:54 AM Kishorkumar Patil wrote: > >

Re: Unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org This is explained here: http://spark.apache.org/community.html#mailing-lists On Thu, Dec 8, 2016 at 9:42 AM Chen, Yan I wrote: > > > > ___ > > If you

Re: Unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org This is explained here: http://spark.apache.org/community.html#mailing-lists On Thu, Dec 8, 2016 at 12:17 AM Prashant Singh Thakur < prashant.tha...@impetus.co.in> wrote: > > > > > Best Regards, > > Prashant Thakur > > Work : 6046 > >

Re: Unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org This is explained here: http://spark.apache.org/community.html#mailing-lists On Thu, Dec 8, 2016 at 12:08 AM Kranthi Gmail wrote: > > > -- > Kranthi > > PS: Sent from mobile, pls excuse the brevity and typos. > >

Re: Unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org This is explained here: http://spark.apache.org/community.html#mailing-lists On Thu, Dec 8, 2016 at 6:27 AM Vinicius Barreto < vinicius.s.barr...@gmail.com> wrote: > Unsubscribe > > Em 7 de dez de 2016 17:46, "map reduced"

Re: Unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org This is explained here: http://spark.apache.org/community.html#mailing-lists On Thu, Dec 8, 2016 at 12:54 AM Roger Holenweger wrote: > > > - > To

Re: unscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org This is explained here: http://spark.apache.org/community.html#mailing-lists On Thu, Dec 8, 2016 at 1:34 AM smith_666 wrote: > > > >

Re: Unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org This is explained here: http://spark.apache.org/community.html#mailing-lists On Thu, Dec 8, 2016 at 12:12 AM Ajit Jaokar wrote: > > > - > To

TO ALL WHO WANT TO UNSUBSCRIBE

2016-12-08 Thread 5g2w35+83j86k7gefujk
I swear, the next one trying to unsubscribe from this list or u...@spark.incubator.apache.org by sending "unsubscribe" to this list will be signed up for mailbait ... (you are welcome). HERE ARE THE INFOS ON HOW TO UNSUBSCRIBE. READ THEM! > --- Administrative commands for the user list --- >

Re: unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org This is explained here: http://spark.apache.org/community.html#mailing-lists On Wed, Dec 7, 2016 at 10:53 PM Ajith Jose wrote: > >

Unsubscribe

2016-12-08 Thread Kishorkumar Patil

Unsubscribe

2016-12-08 Thread Tao Lu

Unsubscribe

2016-12-08 Thread Chen, Yan I
___ If you received this email in error, please advise the sender (by return email or otherwise) immediately. You have consented to receive the attached electronically at the above-noted email address; please retain a copy of

Unsubscribe

2016-12-08 Thread Chen, Yan I
___ If you received this email in error, please advise the sender (by return email or otherwise) immediately. You have consented to receive the attached electronically at the above-noted email address; please retain a copy of

Re: Unsubscribe

2016-12-08 Thread Jeff Sadowski
I think some people have a problem following instructions. Sigh. On Wed, Dec 7, 2016 at 10:54 PM, Roger Holenweger wrote: > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >

Records processed metric for intermediate datasets

2016-12-08 Thread Aniket R More
Hi , I have created a spark job using DATASET API. There is chain of operations performed until the final result which is collected on HDFS. But I also need to know how many records were read for each intermediate dataset. Lets say I apply 5 operations on dataset (could be map, groupby etc),

unsubscribe

2016-12-08 Thread Niki Pavlopoulou
unsubscribe

unsubscribe

2016-12-08 Thread Juan Caravaca
unsubscribe

unsubscribe

2016-12-08 Thread Ramon Rosa da Silva
This e-mail message, including any attachments, is for the sole use of the person to whom it has been sent and may contain information that is confidential or legally protected. If you are not the intended recipient or have received this message in error, you are not authorized to copy,

RE: few basic questions on structured streaming

2016-12-08 Thread Mendelson, Assaf
For watermarking you can read this excellent article: part 1: https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101, part2: https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102. It explains more than just watermarking but it helped me understand a lot of the concepts

few basic questions on structured streaming

2016-12-08 Thread kant kodali
Hi All, I read the documentation on Structured Streaming based on event time and I have the following questions. 1. what happens if an event arrives few days late? Looks like we have an unbound table with sorted time intervals as keys but I assume spark doesn't keep several days worth of data in

Unsubscribe

2016-12-08 Thread Vinicius Barreto
Unsubscribe Em 7 de dez de 2016 17:46, "map reduced" escreveu: > Hi, > > I am trying to solve this problem - in my streaming flow, every day few > jobs fail due to some (say kafka cluster maintenance etc, mostly > unavoidable) reasons for few batches and resumes back to

Managed memory leak : spark-2.0.2

2016-12-08 Thread Appu K
Hello, I’ve just ran into an issue where the job is giving me "Managed memory leak" with spark version 2.0.2 — 2016-12-08 16:31:25,231 [Executor task launch worker-0] (TaskMemoryManager.java:381) WARN leak 46.2 MB memory from

Re: Not per-key state in spark streaming

2016-12-08 Thread Daniel Haviv
There's no need to extend Spark's API, look at mapWithState for examples. On Thu, Dec 8, 2016 at 4:49 AM, Anty Rao wrote: > > > On Wed, Dec 7, 2016 at 7:42 PM, Anty Rao wrote: > >> Hi >> I'm new to Spark. I'm doing some research to see if spark streaming

"Failed to find data source: libsvm" while running Spark application with jar

2016-12-08 Thread Md. Rezaul Karim
Hi there, I am getting the following error while trying read an input file in libsvm format during running a Spark application jar. *Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: libsvm. * *at*

Re: Publishing of the Spectral LDA model on Spark Packages

2016-12-08 Thread François Garillot
This is very cool ! Thanks a lot for making this more accessible ! Best, -- FG On Wed, Dec 7, 2016 at 11:46 PM Jencir Lee wrote: > Hello, > > We just published the Spectral LDA model on Spark Packages. It’s an > alternative approach to the LDA modelling based on tensor

RE: How to find unique values after groupBy() in spark dataframe ?

2016-12-08 Thread Mendelson, Assaf
Groupby is not an actual result but a construct to allow defining aggregations. So you can do: import org.apache.spark.sql.{functions => func} val resDF = df.groupBy("client").agg(func.collect_set(df("Date"))) Note that collect_set can be a little heavy in terms

How to find unique values after groupBy() in spark dataframe ?

2016-12-08 Thread Devi P.V
Hi all, I have a dataframe like following, +-+--+ |client_id|Date | + +--+ | a |2016-11-23| | b |2016-11-18| | a |2016-11-23| | a |2016-11-23| | a |2016-11-24| +-+--+ I want to find unique dates of each client_id

RE: filter RDD by variable

2016-12-08 Thread Mendelson, Assaf
Can you provide the sample code you are using? In general, RDD filter receives as an input a function. The function’s input is the single record in the RDD and the output is a Boolean whether or not to include it in the result. So you can create any function you want… Assaf. From: Soheila S.