Re: Reading PDF/text/word file efficiently with Spark

2017-05-23 Thread Sonal Goyal
Hi,

Sorry it's not clear to me if you want help moving the data to the cluster
or in defining the best structure of your files on the cluster for
efficient processing. Are you on standalone or using hdfs?

On Tuesday, May 23, 2017, docdwarf  wrote:

> tesmai4 wrote
> > I am converting my Java based NLP parser to execute it on my Spark
> > cluster.  I know that Spark can read multiple text files from a directory
> > and convert into RDDs for further processing. My input data is not only
> in
> > text files, but in a multitude of different file formats.
> >
> > My question is: How can I efficiently read the input files
> > (PDF/Text/Word/HTML) in my Java based Spark program for processing these
> > files in Spark cluster.
>
> I will suggest  flume   . Flume is a
> distributed,
> reliable, and available service for efficiently collecting, aggregating,
> and
> moving large amounts of log data.
>
> I will also mention  kafka   . Kafka is a
> distributed streaming platform.
>
> It is also popular to use both flume and kafka together ( flafka
>  flume-meets-apache-kafka-for-event-processing/>
> ).
>
>
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Reading-PDF-text-word-file-efficiently-with-Spark-
> tp28699p28705.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
>
>

-- 
Thanks,
Sonal
Nube Technologies 




Re: Impact of coalesce operation before writing dataframe

2017-05-23 Thread Andrii Biletskyi
Ah that's right. I didn't mention it: I have 10 executors in my cluster,
and so when I do .coalesce(10) and right after that saving orc to s3 - does
coalescing really affects parallelism? To me it looks like no, because we
went from 100 tasks that are executed in parallel by 10 executors to 10
tasks that are executed by same 10 executors. Now, I understand that there
may be some data skew that may result in uneven partitions but that's not
my case really (according to Spark UI).
Again I'm trying to understand first of all how coalescing dataset impacts
executor memory, gc etc. Maybe if coalesce is done before writing dataset,
each of the resulting partition needs to be evaluated and thus stored in
memory? - just a guess.

Andrii

2017-05-23 23:42 GMT+03:00 John Compitello :

> Spark is doing operations on each partition in parallel. If you decrease
> number of partitions, you’re potentially doing less work in parallel
> depending on your cluster setup.
>
> On May 23, 2017, at 4:23 PM, Andrii Biletskyi  INVALID > wrote:
>
>
> No, I didn't try to use repartition, how exactly it impacts the
> parallelism?
> In my understanding coalesce simply "unions" multiple partitions located
> on same executor "one on on top of the other", while repartition does
> hash-based shuffle decreasing the number of output partitions. So how this
> exactly affects the parallelism, which stage of the job?
>
> Thanks,
> Andrii
>
>
>
> On Tuesday, May 23, 2017 10:20 PM, Michael Armbrust <
> mich...@databricks.com> wrote:
>
>
> coalesce is nice because it does not shuffle, but the consequence of
> avoiding a shuffle is it will also reduce parallelism of the preceding
> computation.  Have you tried using repartition instead?
>
> On Tue, May 23, 2017 at 12:14 PM, Andrii Biletskyi <
> andrii.bilets...@yahoo.com.invalid> wrote:
>
> Hi all,
>
> I'm trying to understand the impact of coalesce operation on spark job
> performance.
>
> As a side note: were are using emrfs (i.e. aws s3) as source and a target
> for the job.
>
> Omitting unnecessary details job can be explained as: join 200M records
> Dataframe stored in orc format on emrfs with another 200M records cached
> Dataframe, the result of the join put back to emrfs. First DF is a set of
> wide rows (Spark UI shows 300 GB) and the second is relatively small (Spark
> shows 20 GB).
>
> I have enough resources in my cluster to perform the job but I don't like
> the fact that output datasource contains 200 part orc files (as 
> spark.sql.shuffle.
> partitions defaults to 200) so before saving orc to emrfs I'm doing
> .coalesce(10). From documentation coalesce looks like a quite harmless
> operations: no repartitioning etc.
>
> But with such setup my job fails to write dataset on the last stage. Right
> now the error is OOM: GC overhead. When I change  .coalesce(10) to
> .coalesce(100) the job runs much faster and finishes without errors.
>
> So what's the impact of .coalesce in this case? And how to do in place
> concatenation of files (not involving hive) to end up with smaller amount
> of bigger files, as with .coalesce(100) job generates 100 orc snappy
> encoded files ~300MB each.
>
> Thanks,
> Andrii
>
>
>
>
>
>


Re: Impact of coalesce operation before writing dataframe

2017-05-23 Thread John Compitello
Spark is doing operations on each partition in parallel. If you decrease number 
of partitions, you’re potentially doing less work in parallel depending on your 
cluster setup. 

> On May 23, 2017, at 4:23 PM, Andrii Biletskyi 
>  wrote:
> 
>  
> No, I didn't try to use repartition, how exactly it impacts the parallelism?
> In my understanding coalesce simply "unions" multiple partitions located on 
> same executor "one on on top of the other", while repartition does hash-based 
> shuffle decreasing the number of output partitions. So how this exactly 
> affects the parallelism, which stage of the job?
> 
> Thanks,
> Andrii
> 
> 
> 
> On Tuesday, May 23, 2017 10:20 PM, Michael Armbrust  
> wrote:
> 
> 
> coalesce is nice because it does not shuffle, but the consequence of avoiding 
> a shuffle is it will also reduce parallelism of the preceding computation.  
> Have you tried using repartition instead?
> 
> On Tue, May 23, 2017 at 12:14 PM, Andrii Biletskyi 
>  > wrote:
> Hi all,
> 
> I'm trying to understand the impact of coalesce operation on spark job 
> performance.
> 
> As a side note: were are using emrfs (i.e. aws s3) as source and a target for 
> the job.
> 
> Omitting unnecessary details job can be explained as: join 200M records 
> Dataframe stored in orc format on emrfs with another 200M records cached 
> Dataframe, the result of the join put back to emrfs. First DF is a set of 
> wide rows (Spark UI shows 300 GB) and the second is relatively small (Spark 
> shows 20 GB).
> 
> I have enough resources in my cluster to perform the job but I don't like the 
> fact that output datasource contains 200 part orc files (as 
> spark.sql.shuffle. partitions defaults to 200) so before saving orc to emrfs 
> I'm doing .coalesce(10). From documentation coalesce looks like a quite 
> harmless operations: no repartitioning etc.
> 
> But with such setup my job fails to write dataset on the last stage. Right 
> now the error is OOM: GC overhead. When I change  .coalesce(10) to 
> .coalesce(100) the job runs much faster and finishes without errors.
> 
> So what's the impact of .coalesce in this case? And how to do in place 
> concatenation of files (not involving hive) to end up with smaller amount of 
> bigger files, as with .coalesce(100) job generates 100 orc snappy encoded 
> files ~300MB each.
> 
> Thanks,
> Andrii
> 
> 
> 



Re: Impact of coalesce operation before writing dataframe

2017-05-23 Thread Andrii Biletskyi
 No, I didn't try to use repartition, how exactly it impacts the parallelism?In 
my understanding coalesce simply "unions" multiple partitions located on same 
executor "one on on top of the other", while repartition does hash-based 
shuffle decreasing the number of output partitions. So how this exactly affects 
the parallelism, which stage of the job?
Thanks,Andrii
 

On Tuesday, May 23, 2017 10:20 PM, Michael Armbrust 
 wrote:
 

 coalesce is nice because it does not shuffle, but the consequence of avoiding 
a shuffle is it will also reduce parallelism of the preceding computation.  
Have you tried using repartition instead?
On Tue, May 23, 2017 at 12:14 PM, Andrii Biletskyi 
 wrote:

Hi all,
I'm trying to understand the impact of coalesce operation on spark job 
performance.
As a side note: were are using emrfs (i.e. aws s3) as source and a target for 
the job.
Omitting unnecessary details job can be explained as: join 200M records 
Dataframe stored in orc format on emrfs with another 200M records cached 
Dataframe, the result of the join put back to emrfs. First DF is a set of wide 
rows (Spark UI shows 300 GB) and the second is relatively small (Spark shows 20 
GB).
I have enough resources in my cluster to perform the job but I don't like the 
fact that output datasource contains 200 part orc files (as spark.sql.shuffle. 
partitions defaults to 200) so before saving orc to emrfs I'm doing 
.coalesce(10). From documentation coalesce looks like a quite harmless 
operations: no repartitioning etc.
But with such setup my job fails to write dataset on the last stage. Right now 
the error is OOM: GC overhead. When I change  .coalesce(10) to .coalesce(100) 
the job runs much faster and finishes without errors.
So what's the impact of .coalesce in this case? And how to do in place 
concatenation of files (not involving hive) to end up with smaller amount of 
bigger files, as with .coalesce(100) job generates 100 orc snappy encoded files 
~300MB each.
Thanks,Andrii



   

Spark Application hangs without trigger SparkShutdownHook

2017-05-23 Thread Xiaoye Sun
Hi all,

I am running a Spark (v1.6.1) application using the ./bin/spark-submit
script. I made some changes to the HttpBroadcast module. However, after the
application finishes completely, the spark master program hangs at the end
of the application. The ShutdownHook is supposed to be called at this
point.

I am wondering what is the condition to trigger the
SparkShutdownHook.runAll() and where is the related code.


2017/05/23 14:47:26.030 INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose
tasks have all completed, from pool
2017/05/23 14:47:26.030 INFO DAGScheduler: ResultStage 6 (saveAsTextFile at
package.scala:169) finished in 1.180 s
2017/05/23 14:47:26.030 DEBUG DAGScheduler: After removal of stage 5,
remaining stages = 1
2017/05/23 14:47:26.030 DEBUG DAGScheduler: After removal of stage 6,
remaining stages = 0
2017/05/23 14:47:26.030 INFO DAGScheduler: Job 3 finished: saveAsTextFile
at package.scala:169, took 1.265678 s
*(***the following should have happened, but the master hangs here without
calling the runAll in SparkShutdownHook***)*
2017/05/23 14:47:26.065 INFO SparkShutdownHookManager: runAll is called
2017/05/23 14:47:26.068 INFO SparkContext: Invoking stop() from shutdown
hook
2017/05/23 14:47:26.156 INFO SparkUI: Stopped Spark web UI at
http://192.168.50.127:4040


Thanks!
Xiaoye


Re: Impact of coalesce operation before writing dataframe

2017-05-23 Thread Andrii Biletskyi
No, I didn't try to use repartition, how exactly it impacts the parallelism?
In my understanding coalesce simply "unions" multiple partitions located on
same executor "one on on top of the other", while repartition does
hash-based shuffle decreasing the number of output partitions. So how this
exactly affects the parallelism, which stage of the job?

Thanks,
Andrii

2017-05-23 22:19 GMT+03:00 Michael Armbrust :

> coalesce is nice because it does not shuffle, but the consequence of
> avoiding a shuffle is it will also reduce parallelism of the preceding
> computation.  Have you tried using repartition instead?
>
> On Tue, May 23, 2017 at 12:14 PM, Andrii Biletskyi <
> andrii.bilets...@yahoo.com.invalid> wrote:
>
>> Hi all,
>>
>> I'm trying to understand the impact of coalesce operation on spark job
>> performance.
>>
>> As a side note: were are using emrfs (i.e. aws s3) as source and a target
>> for the job.
>>
>> Omitting unnecessary details job can be explained as: join 200M records
>> Dataframe stored in orc format on emrfs with another 200M records cached
>> Dataframe, the result of the join put back to emrfs. First DF is a set of
>> wide rows (Spark UI shows 300 GB) and the second is relatively small (Spark
>> shows 20 GB).
>>
>> I have enough resources in my cluster to perform the job but I don't like
>> the fact that output datasource contains 200 part orc files (as
>> spark.sql.shuffle.partitions defaults to 200) so before saving orc to
>> emrfs I'm doing .coalesce(10). From documentation coalesce looks like a
>> quite harmless operations: no repartitioning etc.
>>
>> But with such setup my job fails to write dataset on the last stage.
>> Right now the error is OOM: GC overhead. When I change  .coalesce(10) to
>> .coalesce(100) the job runs much faster and finishes without errors.
>>
>> So what's the impact of .coalesce in this case? And how to do in place
>> concatenation of files (not involving hive) to end up with smaller amount
>> of bigger files, as with .coalesce(100) job generates 100 orc snappy
>> encoded files ~300MB each.
>>
>> Thanks,
>> Andrii
>>
>
>


Re: Are there any Kafka forEachSink examples?

2017-05-23 Thread kant kodali
Thanks a lot Michael! I am not sure why Google search doesn't take me to
databricks blog when I typed in relevant keywords on various things.
Perhaps the blog needs some metadata for the search engine to index or
 Google is more focused on Ads than relevant docs?!



On Tue, May 23, 2017 at 12:17 PM, Michael Armbrust 
wrote:

> There is an example in this post:
>
> https://databricks.com/blog/2017/04/04/real-time-end-to-
> end-integration-with-apache-kafka-in-apache-sparks-
> structured-streaming.html
>
> On Tue, May 23, 2017 at 11:35 AM, kant kodali  wrote:
>
>> Hi All,
>>
>> Are there any Kafka forEachSink examples preferably in Java but Scala is
>> fine too?
>>
>> Thanks!
>>
>
>


Re: Impact of coalesce operation before writing dataframe

2017-05-23 Thread Michael Armbrust
coalesce is nice because it does not shuffle, but the consequence of
avoiding a shuffle is it will also reduce parallelism of the preceding
computation.  Have you tried using repartition instead?

On Tue, May 23, 2017 at 12:14 PM, Andrii Biletskyi <
andrii.bilets...@yahoo.com.invalid> wrote:

> Hi all,
>
> I'm trying to understand the impact of coalesce operation on spark job
> performance.
>
> As a side note: were are using emrfs (i.e. aws s3) as source and a target
> for the job.
>
> Omitting unnecessary details job can be explained as: join 200M records
> Dataframe stored in orc format on emrfs with another 200M records cached
> Dataframe, the result of the join put back to emrfs. First DF is a set of
> wide rows (Spark UI shows 300 GB) and the second is relatively small (Spark
> shows 20 GB).
>
> I have enough resources in my cluster to perform the job but I don't like
> the fact that output datasource contains 200 part orc files (as
> spark.sql.shuffle.partitions defaults to 200) so before saving orc to
> emrfs I'm doing .coalesce(10). From documentation coalesce looks like a
> quite harmless operations: no repartitioning etc.
>
> But with such setup my job fails to write dataset on the last stage. Right
> now the error is OOM: GC overhead. When I change  .coalesce(10) to
> .coalesce(100) the job runs much faster and finishes without errors.
>
> So what's the impact of .coalesce in this case? And how to do in place
> concatenation of files (not involving hive) to end up with smaller amount
> of bigger files, as with .coalesce(100) job generates 100 orc snappy
> encoded files ~300MB each.
>
> Thanks,
> Andrii
>


Re: Are there any Kafka forEachSink examples?

2017-05-23 Thread Michael Armbrust
There is an example in this post:

https://databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming.html

On Tue, May 23, 2017 at 11:35 AM, kant kodali  wrote:

> Hi All,
>
> Are there any Kafka forEachSink examples preferably in Java but Scala is
> fine too?
>
> Thanks!
>


Re: 2.2. release date ?

2017-05-23 Thread Michael Armbrust
Mark is right.  I will cut another RC as soon as the known issues are
resolve.  In the mean time it would be very helpful for people to test RC2
and report issues.

On Tue, May 23, 2017 at 11:10 AM, Mark Hamstra 
wrote:

> I heard that once we reach release candidates it's not a question of time
> or a target date, but only whether blockers are resolved and the code is
> ready to release.
>
> On Tue, May 23, 2017 at 11:07 AM, kant kodali  wrote:
>
>> Heard its end of this month (May)
>>
>> On Tue, May 23, 2017 at 9:41 AM, mojhaha kiklasds <
>> sesca.syst...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I could see a RC2 candidate for Spark 2.2, but not sure about the
>>> expected release timeline on that.
>>> Would be great if somebody can confirm it.
>>>
>>> Thanks,
>>> Mhojaha
>>>
>>
>>
>


Impact of coalesce operation before writing dataframe

2017-05-23 Thread Andrii Biletskyi
Hi all,

I'm trying to understand the impact of coalesce operation on spark job
performance.

As a side note: were are using emrfs (i.e. aws s3) as source and a target
for the job.

Omitting unnecessary details job can be explained as: join 200M records
Dataframe stored in orc format on emrfs with another 200M records cached
Dataframe, the result of the join put back to emrfs. First DF is a set of
wide rows (Spark UI shows 300 GB) and the second is relatively small (Spark
shows 20 GB).

I have enough resources in my cluster to perform the job but I don't like
the fact that output datasource contains 200 part orc files (as
spark.sql.shuffle.partitions
defaults to 200) so before saving orc to emrfs I'm doing .coalesce(10).
>From documentation coalesce looks like a quite harmless operations: no
repartitioning etc.

But with such setup my job fails to write dataset on the last stage. Right
now the error is OOM: GC overhead. When I change  .coalesce(10) to
.coalesce(100) the job runs much faster and finishes without errors.

So what's the impact of .coalesce in this case? And how to do in place
concatenation of files (not involving hive) to end up with smaller amount
of bigger files, as with .coalesce(100) job generates 100 orc snappy
encoded files ~300MB each.

Thanks,
Andrii


Are there any Kafka forEachSink examples?

2017-05-23 Thread kant kodali
Hi All,

Are there any Kafka forEachSink examples preferably in Java but Scala is
fine too?

Thanks!


Re: 2.2. release date ?

2017-05-23 Thread Mark Hamstra
I heard that once we reach release candidates it's not a question of time
or a target date, but only whether blockers are resolved and the code is
ready to release.

On Tue, May 23, 2017 at 11:07 AM, kant kodali  wrote:

> Heard its end of this month (May)
>
> On Tue, May 23, 2017 at 9:41 AM, mojhaha kiklasds  > wrote:
>
>> Hello,
>>
>> I could see a RC2 candidate for Spark 2.2, but not sure about the
>> expected release timeline on that.
>> Would be great if somebody can confirm it.
>>
>> Thanks,
>> Mhojaha
>>
>
>


Re: 2.2. release date ?

2017-05-23 Thread kant kodali
Heard its end of this month (May)

On Tue, May 23, 2017 at 9:41 AM, mojhaha kiklasds 
wrote:

> Hello,
>
> I could see a RC2 candidate for Spark 2.2, but not sure about the expected
> release timeline on that.
> Would be great if somebody can confirm it.
>
> Thanks,
> Mhojaha
>


Re: Spark Streaming: Custom Receiver OOM consistently

2017-05-23 Thread Manish Malhotra
Thanks !

On Mon, May 22, 2017 at 5:58 PM kant kodali  wrote:

> Well there are few things here.
>
> 1. What is the Spark Version?
>
cdh 1.6

2. You said there is OOM error but what is the cause that appears in the
> log message or stack trace? OOM can happen for various reasons and JVM
> usually specifies the cause in the error message.
>
GC heap reached. Will send some logs as well.

>
> 3. What is the driver and executor memory?
>
Driver : 4g
Executor: 40g

> 4. What is the receive throughput per second and what is the data size of
> an average message?
>
Msg size : 2KB
1/sec per receiver. Running 2 receivers.

5. What OS you are using ?
>

Red hat Linux.

StorageLevel.MEMORY_AND_DISK_SER_2 This means that after the receiver
> receives the data is replicated across worker nodes.
>
yes but after batch is finished or after few batches receiver and worker
nodes should discard the old data ?


>
>
>
> On Mon, May 22, 2017 at 5:20 PM, Manish Malhotra <
> manish.malhotra.w...@gmail.com> wrote:
>
>> thanks Alonso,
>>
>> Sorry, but there are some security reservations.
>>
>> But we can assume the receiver, is equivalent to writing a JMS based
>> custom receiver, where we register a message listener and for each message
>> delivered by JMS will be stored by calling store method of listener.
>>
>>
>> Something like :
>> https://github.com/tbfenet/spark-jms-receiver/blob/master/src/main/scala/org/apache/spark/streaming/jms/JmsReceiver.scala
>>
>> Though the diff is here this JMS receiver is using block generator and
>> the calling store.
>> I m calling store when I receive message.
>> And also I'm not using block generator.
>> Not sure if that something will make the memory to balloon up.
>>
>> Btw I also run the same message consumer code from standalone map and
>> never seen this memory issue.
>>
>> On Sun, May 21, 2017 at 10:20 AM, Alonso Isidoro Roman <
>> alons...@gmail.com> wrote:
>>
>>> could you share the code?
>>>
>>> Alonso Isidoro Roman
>>> [image: https://]about.me/alonso.isidoro.roman
>>>
>>> 
>>>
>>> 2017-05-20 7:54 GMT+02:00 Manish Malhotra <
>>> manish.malhotra.w...@gmail.com>:
>>>
 Hello,

 have implemented Java based custom receiver, which consumes from
 messaging system say JMS.
 once received message, I call store(object) ... Im storing spark Row
 object.

 it run for around 8 hrs, and then goes OOM, and OOM is happening in
 receiver nodes.
 I also tried to run multiple receivers, to distribute the load but
 faces the same issue.

 something fundamentally we are doing wrong, which tells custom 
 receiver/spark
 to release the memory.
 but Im not able to crack that, atleast till now.

 any help is appreciated !!

 Regards,
 Manish


>>>
>>
>


2.2. release date ?

2017-05-23 Thread mojhaha kiklasds
Hello,

I could see a RC2 candidate for Spark 2.2, but not sure about the expected
release timeline on that.
Would be great if somebody can confirm it.

Thanks,
Mhojaha


Re: Reading PDF/text/word file efficiently with Spark

2017-05-23 Thread docdwarf
tesmai4 wrote
> I am converting my Java based NLP parser to execute it on my Spark
> cluster.  I know that Spark can read multiple text files from a directory
> and convert into RDDs for further processing. My input data is not only in
> text files, but in a multitude of different file formats. 
> 
> My question is: How can I efficiently read the input files
> (PDF/Text/Word/HTML) in my Java based Spark program for processing these
> files in Spark cluster.

I will suggest  flume   . Flume is a distributed,
reliable, and available service for efficiently collecting, aggregating, and
moving large amounts of log data. 

I will also mention  kafka   . Kafka is a
distributed streaming platform.

It is also popular to use both flume and kafka together ( flafka

 
).






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Reading-PDF-text-word-file-efficiently-with-Spark-tp28699p28705.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: scalastyle violation on mvn install but not on mvn package

2017-05-23 Thread Mark Hamstra
On Tue, May 23, 2017 at 7:48 AM, Xiangyu Li  wrote:

> Thank you for the answer.
>
> So basically it is not recommended to install Spark to your local maven
> repository? I thought if they wanted to enforce scalastyle for better open
> source contributions, they would have fixed all the scalastyle warnings.
>

That isn't a valid conclusion. There is nothing wrong with using maven's
"install" with Spark. There shouldn't be any scalastyle violations.


> On a side note, my posts on Nabble never got accepted by the mailing list
> for some reason (I am subscribed to the mail list), and your reply does not
> show as a reply to my question on Nabble probably for the same reason.
> Sorry for the late reply but is using email the only way to communicate on
> the mail list? I got another reply to this question through email but the
> two replies are not even in the same "email conversation".
>

I don't know the mechanics of why posts do or don't show up via Nabble, but
Nabble is neither the canonical archive nor the system of record for Apache
mailing lists.


> On Thu, May 4, 2017 at 8:11 PM, Mark Hamstra 
> wrote:
>
>> The check goal of the scalastyle plugin runs during the "verify" phase,
>> which is between "package" and "install"; so running just "package" will
>> not run scalastyle:check.
>>
>> On Thu, May 4, 2017 at 7:45 AM, yiskylee  wrote:
>>
>>> ./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean
>>> package
>>> works, but
>>> ./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean
>>> install
>>> triggers scalastyle violation error.
>>>
>>> Is the scalastyle check not used on package but only on install? To
>>> install,
>>> should I turn off "failOnViolation" in the pom?
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-spark-user-list.
>>> 1001560.n3.nabble.com/scalastyle-violation-on-mvn-install-bu
>>> t-not-on-mvn-package-tp28653.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>
>
> --
> Sincerely
> Xiangyu Li
>
> 
>


Re: scalastyle violation on mvn install but not on mvn package

2017-05-23 Thread Xiangyu Li
Thank you for the answer.

So basically it is not recommended to install Spark to your local maven
repository? I thought if they wanted to enforce scalastyle for better open
source contributions, they would have fixed all the scalastyle warnings.

On a side note, my posts on Nabble never got accepted by the mailing list
for some reason (I am subscribed to the mail list), and your reply does not
show as a reply to my question on Nabble probably for the same reason.
Sorry for the late reply but is using email the only way to communicate on
the mail list? I got another reply to this question through email but the
two replies are not even in the same "email conversation".

On Thu, May 4, 2017 at 8:11 PM, Mark Hamstra 
wrote:

> The check goal of the scalastyle plugin runs during the "verify" phase,
> which is between "package" and "install"; so running just "package" will
> not run scalastyle:check.
>
> On Thu, May 4, 2017 at 7:45 AM, yiskylee  wrote:
>
>> ./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean
>> package
>> works, but
>> ./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean
>> install
>> triggers scalastyle violation error.
>>
>> Is the scalastyle check not used on package but only on install? To
>> install,
>> should I turn off "failOnViolation" in the pom?
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/scalastyle-violation-on-mvn-install-
>> but-not-on-mvn-package-tp28653.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


-- 
Sincerely
Xiangyu Li




Re: scalastyle violation on mvn install but not on mvn package

2017-05-23 Thread Xiangyu Li
Thank you for the answer.

So basically it is not recommended to install Spark to your local maven
repository? I thought if they wanted to enforce scalastyle for better open
source contributions, they would have fixed all the scalastyle warnings.

On a side note, my posts on Nabble never got accepted by the mailing list
for some reason (I am subscribed to the mail list), and your reply does not
show as a reply to my question on Nabble probably for the same reason.
Sorry for the late reply but is using email the only way to communicate on
the mail list? I got another reply to this question through email but the
two replies are not even in the same "email conversation".

On Wed, May 17, 2017 at 8:48 PM, Marcelo Vanzin  wrote:

> scalastyle runs on the "verify" phase, which is after package but
> before install.
>
> On Wed, May 17, 2017 at 5:47 PM, yiskylee  wrote:
> > ./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean
> > package
> > works, but
> > ./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean
> > install
> > triggers scalastyle violation error.
> >
> > Is the scalastyle check not used on package but only on install? To
> install,
> > should I turn off "failOnViolation" in the pom?
> >
> >
> >
> > --
> > View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/scalastyle-violation-on-mvn-install-but-not-on-mvn-
> package-tp28693.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
>
>
> --
> Marcelo
>



-- 
Sincerely
Xiangyu Li




user-unsubscr...@spark.apache.org

2017-05-23 Thread williamtellme123
 

 

From: Arun [mailto:arunbm...@gmail.com] 
Sent: Saturday, May 20, 2017 9:48 PM
To: user@spark.apache.org
Subject: Rmse recomender system

 

 

hi all..

 

I am new to machine learning.

 

i am working on recomender system. for training dataset RMSE is  0.08  while on 
test data its is 2.345

 

whats conclusion and what steps can i take to improve

 

 

 

Sent from Samsung tablet



user-unsubscr...@spark.apache.org

2017-05-23 Thread williamtellme123
 

 

From: Abir Chakraborty [mailto:abi...@247-inc.com] 
Sent: Sunday, May 21, 2017 4:17 AM
To: user@spark.apache.org
Subject: unsubscribe

 

unsubscribe

 

 

 



user-unsubscr...@spark.apache.org

2017-05-23 Thread williamtellme123
 

 

From: Bibudh Lahiri [mailto:bibudhlah...@gmail.com] 
Sent: Sunday, May 21, 2017 9:34 AM
To: user 
Subject: unsubscribe

 

unsubscribe  



user-unsubscr...@spark.apache.org

2017-05-23 Thread williamtellme123
user-unsubscr...@spark.apache.org

 

From: 萝卜丝炒饭 [mailto:1427357...@qq.com] 
Sent: Sunday, May 21, 2017 8:15 PM
To: user 
Subject: Are tachyon and akka removed from 2.1.1 please

 

HI all,

Iread some paper about source code, the paper base on version 1.2.  they
refer the tachyon and akka.  When i read the 2.1code. I can not find the
code abiut akka and tachyon.

 

Are tachyon and akka removed from 2.1.1  please



Dependencies for starting Master / Worker in maven

2017-05-23 Thread Jens Teglhus Møller
Hi

I just joined a project that runs on spark-1.6.1 and I have no prior spark
experience.

The project build is quite fragile when it comes to runtime dependencies.
Often the project builds fine but after deployment we end up with
ClassNotFoundException's or NoSuchMethodError's when submitting a job.

To catch these issues early, I'm trying like to setup integrations tests
with maven. In the pre-integration phase I would like to startup a master
and a worker (using process-exec-maven-plugin in the pre-integration-test
phase).

I have managed to get it working for spark 1.6.1 (against a downloaded
spark distribution), but would prefer to be able to download all the
required jars as maven dependencies. Is there a relatively simple way to
get all the required dependencies? It is ok if its only for 2.x since we
are planning to migrate.

I would prefer to do this without docker.

Has anyone done something similar already or is there a simpler way?

Best regards Jens


How to generate stage for this RDD DAG please?

2017-05-23 Thread ??????????
Hi all,


I read some paper about the stage, l know the narrow dependency and shuffle 
dependency.


About the belowing RDD DAG,  how deos spark generate the stage DAG please?
And is  this RDD DAG  legal  please?<>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: OptionalDataException during Naive Bayes Training

2017-05-23 Thread elitejyo
Hi Xiangrui,

We are also getting same exception while running our Spark application both
in local mode and distributed mode.

Do you have any insights on how to fix this?
Any help is highly appreciated.
TIA!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/OptionalDataException-during-Naive-Bayes-Training-tp21059p28704.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Are tachyon and akka removed from 2.1.1 please

2017-05-23 Thread ??????????
thanks gromakowski and chin wei.


 
---Original---
From: "vincent gromakowski"
Date: 2017/5/23 00:54:33
To: "Chin Wei Low";
Cc: "user";"??"<1427357...@qq.com>;"Gene 
Pang";
Subject: Re: Are tachyon and akka removed from 2.1.1 please


Akka has been replaced by netty in 1.6

Le 22 mai 2017 15:25, "Chin Wei Low"  a ??crit :
I think akka has been removed since 2.0.

On 22 May 2017 10:19 pm, "Gene Pang"  wrote:
Hi,

Tachyon has been renamed to Alluxio. Here is the documentation for running 
Alluxio with Spark.


Hope this helps,
Gene


On Sun, May 21, 2017 at 6:15 PM, ?? <1427357...@qq.com> wrote:
HI all,
Iread some paper about source code, the paper base on version 1.2.  they refer 
the tachyon and akka.  When i read the 2.1code. I can not find the code abiut 
akka and tachyon.


Are tachyon and akka removed from 2.1.1  please

Re: Are tachyon and akka removed from 2.1.1 please

2017-05-23 Thread ??????????
thanks Gene.


 
---Original---
From: "Gene Pang"
Date: 2017/5/22 22:19:47
To: "??"<1427357...@qq.com>;
Cc: "user";
Subject: Re: Are tachyon and akka removed from 2.1.1 please


Hi,

Tachyon has been renamed to Alluxio. Here is the documentation for running 
Alluxio with Spark.


Hope this helps,
Gene


On Sun, May 21, 2017 at 6:15 PM, ?? <1427357...@qq.com> wrote:
HI all,
Iread some paper about source code, the paper base on version 1.2.  they refer 
the tachyon and akka.  When i read the 2.1code. I can not find the code abiut 
akka and tachyon.


Are tachyon and akka removed from 2.1.1  please

Re: Spark Launch programatically - Basics!

2017-05-23 Thread vimal dinakaran
We are using the below code for for integration test. You need to wait for
the process state.
.startApplication(
new Listener {
  override def infoChanged(handle: SparkAppHandle): Unit = {
println("*** info changed * ", handle.getAppId,
handle.getState)
  }

  override def stateChanged(handle: SparkAppHandle): Unit = {
println("*** state changed *", handle.getAppId,
handle.getState)
  }
})

// Initial state goes to unknown
// To avoid the UNKNOWN state check below.
Thread.sleep(1);

def waitTillComplete(handler: SparkAppHandle): Unit = {
while (!handler.getState.isFinal && handler.getState !=
SparkAppHandle.State.UNKNOWN) {
  println("State :%s".format(handler.getState()))
  Thread.sleep(5000)
}
  }

On Thu, May 18, 2017 at 2:17 AM, Nipun Arora 
wrote:

> Hi,
>
> I am trying to get a simple spark application to run programatically. I
> looked at http://spark.apache.org/docs/2.1.0/api/java/index.
> html?org/apache/spark/launcher/package-summary.html, at the following
> code.
>
>public class MyLauncher {
>  public static void main(String[] args) throws Exception {
>SparkAppHandle handle = new SparkLauncher()
>  .setAppResource("/my/app.jar")
>  .setMainClass("my.spark.app.Main")
>  .setMaster("local")
>  .setConf(SparkLauncher.DRIVER_MEMORY, "2g")
>  .startApplication();
>// Use handle API to monitor / control application.
>  }
>}
>
>
> I don't have any errors in running this for my application, but I am
> running spark in local mode and the launcher class immediately exits after
> executing this function. Are we supposed to wait for the process state etc.
>
> Is there a more detailed example of how to monitor inputstreams etc. any
> github link or blogpost would help.
>
> Thanks
> Nipun
>


Custom function cannot be accessed across database

2017-05-23 Thread 李斌松
Custom function cannot be accessed across database,
example: The registration function json_extract_value is in database A, and
A.json_extract_value cannot be called in the database B

SessionCatalog.java

externalCatalog.getFunction(currentDb, name.funcName)

to

externalCatalog.getFunction(name.database.getOrElse(currentDb), name.funcName)