Re: Reading PDF/text/word file efficiently with Spark

2017-05-23 Thread Sonal Goyal
Hi, Sorry it's not clear to me if you want help moving the data to the cluster or in defining the best structure of your files on the cluster for efficient processing. Are you on standalone or using hdfs? On Tuesday, May 23, 2017, docdwarf wrote: > tesmai4 wrote > > I am

Re: Impact of coalesce operation before writing dataframe

2017-05-23 Thread Andrii Biletskyi
Ah that's right. I didn't mention it: I have 10 executors in my cluster, and so when I do .coalesce(10) and right after that saving orc to s3 - does coalescing really affects parallelism? To me it looks like no, because we went from 100 tasks that are executed in parallel by 10 executors to 10

Re: Impact of coalesce operation before writing dataframe

2017-05-23 Thread John Compitello
Spark is doing operations on each partition in parallel. If you decrease number of partitions, you’re potentially doing less work in parallel depending on your cluster setup. > On May 23, 2017, at 4:23 PM, Andrii Biletskyi > wrote: > > > No, I didn't

Re: Impact of coalesce operation before writing dataframe

2017-05-23 Thread Andrii Biletskyi
 No, I didn't try to use repartition, how exactly it impacts the parallelism?In my understanding coalesce simply "unions" multiple partitions located on same executor "one on on top of the other", while repartition does hash-based shuffle decreasing the number of output partitions. So how this

Spark Application hangs without trigger SparkShutdownHook

2017-05-23 Thread Xiaoye Sun
Hi all, I am running a Spark (v1.6.1) application using the ./bin/spark-submit script. I made some changes to the HttpBroadcast module. However, after the application finishes completely, the spark master program hangs at the end of the application. The ShutdownHook is supposed to be called at

Re: Impact of coalesce operation before writing dataframe

2017-05-23 Thread Andrii Biletskyi
No, I didn't try to use repartition, how exactly it impacts the parallelism? In my understanding coalesce simply "unions" multiple partitions located on same executor "one on on top of the other", while repartition does hash-based shuffle decreasing the number of output partitions. So how this

Re: Are there any Kafka forEachSink examples?

2017-05-23 Thread kant kodali
Thanks a lot Michael! I am not sure why Google search doesn't take me to databricks blog when I typed in relevant keywords on various things. Perhaps the blog needs some metadata for the search engine to index or Google is more focused on Ads than relevant docs?! On Tue, May 23, 2017 at 12:17

Re: Impact of coalesce operation before writing dataframe

2017-05-23 Thread Michael Armbrust
coalesce is nice because it does not shuffle, but the consequence of avoiding a shuffle is it will also reduce parallelism of the preceding computation. Have you tried using repartition instead? On Tue, May 23, 2017 at 12:14 PM, Andrii Biletskyi < andrii.bilets...@yahoo.com.invalid> wrote: > Hi

Re: Are there any Kafka forEachSink examples?

2017-05-23 Thread Michael Armbrust
There is an example in this post: https://databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming.html On Tue, May 23, 2017 at 11:35 AM, kant kodali wrote: > Hi All, > > Are there any Kafka forEachSink examples

Re: 2.2. release date ?

2017-05-23 Thread Michael Armbrust
Mark is right. I will cut another RC as soon as the known issues are resolve. In the mean time it would be very helpful for people to test RC2 and report issues. On Tue, May 23, 2017 at 11:10 AM, Mark Hamstra wrote: > I heard that once we reach release candidates it's

Impact of coalesce operation before writing dataframe

2017-05-23 Thread Andrii Biletskyi
Hi all, I'm trying to understand the impact of coalesce operation on spark job performance. As a side note: were are using emrfs (i.e. aws s3) as source and a target for the job. Omitting unnecessary details job can be explained as: join 200M records Dataframe stored in orc format on emrfs with

Are there any Kafka forEachSink examples?

2017-05-23 Thread kant kodali
Hi All, Are there any Kafka forEachSink examples preferably in Java but Scala is fine too? Thanks!

Re: 2.2. release date ?

2017-05-23 Thread Mark Hamstra
I heard that once we reach release candidates it's not a question of time or a target date, but only whether blockers are resolved and the code is ready to release. On Tue, May 23, 2017 at 11:07 AM, kant kodali wrote: > Heard its end of this month (May) > > On Tue, May 23,

Re: 2.2. release date ?

2017-05-23 Thread kant kodali
Heard its end of this month (May) On Tue, May 23, 2017 at 9:41 AM, mojhaha kiklasds wrote: > Hello, > > I could see a RC2 candidate for Spark 2.2, but not sure about the expected > release timeline on that. > Would be great if somebody can confirm it. > > Thanks, >

Re: Spark Streaming: Custom Receiver OOM consistently

2017-05-23 Thread Manish Malhotra
Thanks ! On Mon, May 22, 2017 at 5:58 PM kant kodali wrote: > Well there are few things here. > > 1. What is the Spark Version? > cdh 1.6 2. You said there is OOM error but what is the cause that appears in the > log message or stack trace? OOM can happen for various

2.2. release date ?

2017-05-23 Thread mojhaha kiklasds
Hello, I could see a RC2 candidate for Spark 2.2, but not sure about the expected release timeline on that. Would be great if somebody can confirm it. Thanks, Mhojaha

Re: Reading PDF/text/word file efficiently with Spark

2017-05-23 Thread docdwarf
tesmai4 wrote > I am converting my Java based NLP parser to execute it on my Spark > cluster. I know that Spark can read multiple text files from a directory > and convert into RDDs for further processing. My input data is not only in > text files, but in a multitude of different file formats. >

Re: scalastyle violation on mvn install but not on mvn package

2017-05-23 Thread Mark Hamstra
On Tue, May 23, 2017 at 7:48 AM, Xiangyu Li wrote: > Thank you for the answer. > > So basically it is not recommended to install Spark to your local maven > repository? I thought if they wanted to enforce scalastyle for better open > source contributions, they would have

Re: scalastyle violation on mvn install but not on mvn package

2017-05-23 Thread Xiangyu Li
Thank you for the answer. So basically it is not recommended to install Spark to your local maven repository? I thought if they wanted to enforce scalastyle for better open source contributions, they would have fixed all the scalastyle warnings. On a side note, my posts on Nabble never got

Re: scalastyle violation on mvn install but not on mvn package

2017-05-23 Thread Xiangyu Li
Thank you for the answer. So basically it is not recommended to install Spark to your local maven repository? I thought if they wanted to enforce scalastyle for better open source contributions, they would have fixed all the scalastyle warnings. On a side note, my posts on Nabble never got

user-unsubscr...@spark.apache.org

2017-05-23 Thread williamtellme123
From: Arun [mailto:arunbm...@gmail.com] Sent: Saturday, May 20, 2017 9:48 PM To: user@spark.apache.org Subject: Rmse recomender system hi all.. I am new to machine learning. i am working on recomender system. for training dataset RMSE is 0.08 while on test data its is

user-unsubscr...@spark.apache.org

2017-05-23 Thread williamtellme123
From: Abir Chakraborty [mailto:abi...@247-inc.com] Sent: Sunday, May 21, 2017 4:17 AM To: user@spark.apache.org Subject: unsubscribe unsubscribe

user-unsubscr...@spark.apache.org

2017-05-23 Thread williamtellme123
From: Bibudh Lahiri [mailto:bibudhlah...@gmail.com] Sent: Sunday, May 21, 2017 9:34 AM To: user Subject: unsubscribe unsubscribe

user-unsubscr...@spark.apache.org

2017-05-23 Thread williamtellme123
user-unsubscr...@spark.apache.org From: 萝卜丝炒饭 [mailto:1427357...@qq.com] Sent: Sunday, May 21, 2017 8:15 PM To: user Subject: Are tachyon and akka removed from 2.1.1 please HI all, Iread some paper about source code, the paper base on version 1.2. they refer the

Dependencies for starting Master / Worker in maven

2017-05-23 Thread Jens Teglhus Møller
Hi I just joined a project that runs on spark-1.6.1 and I have no prior spark experience. The project build is quite fragile when it comes to runtime dependencies. Often the project builds fine but after deployment we end up with ClassNotFoundException's or NoSuchMethodError's when submitting a

How to generate stage for this RDD DAG please?

2017-05-23 Thread ??????????
Hi all, I read some paper about the stage, l know the narrow dependency and shuffle dependency. About the belowing RDD DAG, how deos spark generate the stage DAG please? And is this RDD DAG legal please?<> - To

Re: OptionalDataException during Naive Bayes Training

2017-05-23 Thread elitejyo
Hi Xiangrui, We are also getting same exception while running our Spark application both in local mode and distributed mode. Do you have any insights on how to fix this? Any help is highly appreciated. TIA! -- View this message in context:

Re: Are tachyon and akka removed from 2.1.1 please

2017-05-23 Thread ??????????
thanks gromakowski and chin wei. ---Original--- From: "vincent gromakowski" Date: 2017/5/23 00:54:33 To: "Chin Wei Low"; Cc: "user";"??"<1427357...@qq.com>;"Gene Pang"; Subject: Re: Are

Re: Are tachyon and akka removed from 2.1.1 please

2017-05-23 Thread ??????????
thanks Gene. ---Original--- From: "Gene Pang" Date: 2017/5/22 22:19:47 To: "??"<1427357...@qq.com>; Cc: "user"; Subject: Re: Are tachyon and akka removed from 2.1.1 please Hi, Tachyon has been renamed to Alluxio. Here is the documentation

Re: Spark Launch programatically - Basics!

2017-05-23 Thread vimal dinakaran
We are using the below code for for integration test. You need to wait for the process state. .startApplication( new Listener { override def infoChanged(handle: SparkAppHandle): Unit = { println("*** info changed * ", handle.getAppId, handle.getState)

Custom function cannot be accessed across database

2017-05-23 Thread 李斌松
Custom function cannot be accessed across database, example: The registration function json_extract_value is in database A, and A.json_extract_value cannot be called in the database B SessionCatalog.java externalCatalog.getFunction(currentDb, name.funcName) to