RE: Associating user objects with SparkContext/SparkStreamingContext

2016-06-27 Thread Simon Scott
“move the functions you are passing” yes this is what I had already done – and 
what I hope to avoid

Thank you however for the reminder about @transient – with that I am able to 
create a function value that includes the non-serializable state as a 
@transient val. Which at least packages the solution closer to the code that 
causes the problem.

Cheers
Simon

From: Evan Sparks [mailto:evan.spa...@gmail.com]
Sent: 24 June 2016 16:12
To: Simon Scott 
Cc: dev@spark.apache.org
Subject: Re: Associating user objects with SparkContext/SparkStreamingContext

I would actually think about this the other way around. Move the functions you 
are passing to the streaming jobs out to their own object if possible. Spark's 
closure capture rules are necessarily far reaching and serialize the object 
that contains these methods, which is a common cause of the problem you're 
seeing.

Another option is to mark the non-serializable state as "@transient" if it is 
never accessed by the worker processes.

On Jun 24, 2016, at 1:23 AM, Simon Scott 
mailto:simon.sc...@viavisolutions.com>> wrote:
Hi,

I am developing a streaming application using checkpointing on Spark 1.5.1

I have just run into a NotSerializableException because some of the state that 
my streaming functions need cannot be serialized. This state is only used in 
the driver process, it is the checkpointing that requires the serialization.

So I am considering moving that state into a Scala “object” – i.e. global 
singleton that must be mutable to allow the state to be set at application 
start.

I would prefer to be able to create immutable state and attach it to either the 
SparkContext or SparkStreamingContext but I can’t find an api for that.

Does anybody else think is a good idea? Is there a better way? Or would such an 
api be a useful enhancement to Spark?

Thanks in advance
Simon

Research Developer
Viavi Solutions


Re: Using SHUFFLE_SERVICE_ENABLED for MesosCoarseGrainedSchedulerBackend, BlockManager, and Utils?

2016-06-27 Thread Sean Owen
That seems OK. If it introduces another module dependency we'd have to
think about it. I assume these constants should really be used
consistently everywhere if possible, just because it otherwise means
duplicating the defaults and possibly incorrectly. I think you could
have a look at that more broadly too.

On Sun, Jun 26, 2016 at 1:58 PM, Jacek Laskowski  wrote:
> Hi,
>
> I've just noticed that there is the private[spark] val
> SHUFFLE_SERVICE_ENABLED in package object config [1]
>
> [1] 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/package.scala#L74-L75
>
> However MesosCoarseGrainedSchedulerBackend [2], BlockManager [3] and
> Utils [4] are all using their own copies.
>
> Would that be acceptable* to send a pull request to get rid of this 
> redundancy?
>
> [*] I'm staring at @srowen for his nodding in agreement :-)
>
> [2] 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L71
> [3] 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L73-L74
> [4] 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L748
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Please add an unsubscribe link to the footer of user list email

2016-06-27 Thread Nicholas Chammas
Howdy,

It seems like every week we have at least a couple of people emailing the
user list in vain with "Unsubscribe" in the subject, the body, or both.

I remember a while back that every email on the user list used to include a
footer with a quick link to unsubscribe. It was removed, I believe, because
someone thought it was unnecessary clutter.

I disagree.

Please add that footer back in. It's a tiny cost to pay for giving users
the convenience of easily unsubscribing; it makes it much less likely that
people will mistakenly spam everybody as they have been; and it follows
modern email list conventions. The link can just be a "mailto:..."; link
that does the same thing that you have to do today to unsubscribe.

I don't think we need to do the same thing on the dev list since the volume
of email is much lower, and since people on average here are probably more
familiar with the mechanics of subscribing to and unsubscribing from Apache
mailing lists.

Nick


run spark sql with script transformation faild

2016-06-27 Thread linxi zeng
Hi, all:
Recently, we are trying to compare with spark sql and hive on MR, and I
have tried to run spark (spark1.6 rc2) sql with script transformation, the
spark job faild and get an error message like:

16/06/26 11:01:28 INFO codegen.GenerateUnsafeProjection: Code
generated in 19.054534 ms

16/06/26 11:01:28 ERROR execution.ScriptTransformationWriterThread:
/bin/bash: test.py: command not found



16/06/26 11:01:28 ERROR util.Utils: Uncaught exception in thread
Thread-ScriptTransformation-Feed

java.io.IOException: Stream closed

at 
java.lang.ProcessBuilder$NullOutputStream.write(ProcessBuilder.java:434)

at java.io.OutputStream.write(OutputStream.java:116)

at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)

at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)

at java.io.DataOutputStream.write(DataOutputStream.java:107)

at 
org.apache.hadoop.hive.ql.exec.TextRecordWriter.write(TextRecordWriter.java:53)

at 
org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(ScriptTransformation.scala:277)

at 
org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(ScriptTransformation.scala:255)

at scala.collection.Iterator$class.foreach(Iterator.scala:727)

at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

at 
org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1.apply$mcV$sp(ScriptTransformation.scala:255)

at 
org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1.apply(ScriptTransformation.scala:244)

at 
org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1.apply(ScriptTransformation.scala:244)

at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1801)

at 
org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread.run(ScriptTransformation.scala:244)

16/06/26 11:01:28 ERROR util.SparkUncaughtExceptionHandler: Uncaught
exception in thread Thread[Thread-ScriptTransformation-Feed,5,main]

java.io.IOException: Stream closed

at 
java.lang.ProcessBuilder$NullOutputStream.write(ProcessBuilder.java:434)

at java.io.OutputStream.write(OutputStream.java:116)

at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)

at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)

at java.io.DataOutputStream.write(DataOutputStream.java:107)

at 
org.apache.hadoop.hive.ql.exec.TextRecordWriter.write(TextRecordWriter.java:53)

at 
org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(ScriptTransformation.scala:277)

at 
org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(ScriptTransformation.scala:255)

at scala.collection.Iterator$class.foreach(Iterator.scala:727)

at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

at 
org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1.apply$mcV$sp(ScriptTransformation.scala:255)

at 
org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1.apply(ScriptTransformation.scala:244)

at 
org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1.apply(ScriptTransformation.scala:244)

at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1801)

at 
org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread.run(ScriptTransformation.scala:244)


cmd is:

> spark-1.6/bin/spark-sql -f transform.sql


the sql and python script is:
transform.sql (which was executed successfully on hive) :

> add file /tmp/spark_sql_test/test.py;
> select transform(cityname) using 'test.py' as (new_cityname) from
> test.spark2_orc where dt='20160622' limit 5 ;

test.py:

> #!/usr/bin/env python
> #coding=utf-8
> import sys
> import string
> reload(sys)
> sys.setdefaultencoding('utf8')
> for line in sys.stdin:
> cityname = line.strip("\n").split("\t")[0]
> lt = []
> lt.append(cityname + "_zlx")
> print "\t".join(lt)


And after making two modifications:
(1) chmod +x test.py
(2) transform.sql:using 'test.py'  ->  using './test.py'
the sql executed successfully.
I was wonder that if the spark sql with script transformation should be run
like this way? Any one meet the same problem?


Re: Please add an unsubscribe link to the footer of user list email

2016-06-27 Thread Reynold Xin
Let me look into this...

On Monday, June 27, 2016, Nicholas Chammas 
wrote:

> Howdy,
>
> It seems like every week we have at least a couple of people emailing the
> user list in vain with "Unsubscribe" in the subject, the body, or both.
>
> I remember a while back that every email on the user list used to include
> a footer with a quick link to unsubscribe. It was removed, I believe,
> because someone thought it was unnecessary clutter.
>
> I disagree.
>
> Please add that footer back in. It's a tiny cost to pay for giving users
> the convenience of easily unsubscribing; it makes it much less likely that
> people will mistakenly spam everybody as they have been; and it follows
> modern email list conventions. The link can just be a "mailto:..."; link
> that does the same thing that you have to do today to unsubscribe.
>
> I don't think we need to do the same thing on the dev list since the
> volume of email is much lower, and since people on average here are
> probably more familiar with the mechanics of subscribing to and
> unsubscribing from Apache mailing lists.
>
> Nick
>
>


Re: Using SHUFFLE_SERVICE_ENABLED for MesosCoarseGrainedSchedulerBackend, BlockManager, and Utils?

2016-06-27 Thread Jacek Laskowski
Thanks Sean. I'm going to create a JIRA for it and start the work under it.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Mon, Jun 27, 2016 at 9:19 AM, Sean Owen  wrote:
> That seems OK. If it introduces another module dependency we'd have to
> think about it. I assume these constants should really be used
> consistently everywhere if possible, just because it otherwise means
> duplicating the defaults and possibly incorrectly. I think you could
> have a look at that more broadly too.
>
> On Sun, Jun 26, 2016 at 1:58 PM, Jacek Laskowski  wrote:
>> Hi,
>>
>> I've just noticed that there is the private[spark] val
>> SHUFFLE_SERVICE_ENABLED in package object config [1]
>>
>> [1] 
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/package.scala#L74-L75
>>
>> However MesosCoarseGrainedSchedulerBackend [2], BlockManager [3] and
>> Utils [4] are all using their own copies.
>>
>> Would that be acceptable* to send a pull request to get rid of this 
>> redundancy?
>>
>> [*] I'm staring at @srowen for his nodding in agreement :-)
>>
>> [2] 
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L71
>> [3] 
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L73-L74
>> [4] 
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L748
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-06-27 Thread Egor Pahomov
-1 : SPARK-16228 [SQL]  - "Percentile" needs explicit cast to double,
otherwise it throws an error. I can not move my existing 100500 quires to
2.0 transparently.

2016-06-24 11:52 GMT-07:00 Matt Cheah :

> -1 because of SPARK-16181 which is a correctness regression from 1.6.
> Looks like the patch is ready though:
> https://github.com/apache/spark/pull/13884 – it would be ideal for this
> patch to make it into the release.
>
> -Matt Cheah
>
> From: Nick Pentreath 
> Date: Friday, June 24, 2016 at 4:37 AM
> To: "dev@spark.apache.org" 
> Subject: Re: [VOTE] Release Apache Spark 2.0.0 (RC1)
>
> I'm getting the following when trying to run ./dev/run-tests (not
> happening on master) from the extracted source tar. Anyone else seeing
> this?
>
> error: Could not access 'fc0a1475ef'
> **
> File "./dev/run-tests.py", line 69, in
> __main__.identify_changed_files_from_git_commits
> Failed example:
> [x.name
> 
> for x in determine_modules_for_files(
> identify_changed_files_from_git_commits("fc0a1475ef",
> target_ref="5da21f07"))]
> Exception raised:
> Traceback (most recent call last):
>   File "/Users/nick/miniconda2/lib/python2.7/doctest.py", line 1315,
> in __run
> compileflags, 1) in test.globs
>   File " __main__.identify_changed_files_from_git_commits[0]>", line 1, in 
> [x.name
> 
> for x in determine_modules_for_files(
> identify_changed_files_from_git_commits("fc0a1475ef",
> target_ref="5da21f07"))]
>   File "./dev/run-tests.py", line 86, in
> identify_changed_files_from_git_commits
> universal_newlines=True)
>   File "/Users/nick/miniconda2/lib/python2.7/subprocess.py", line 573,
> in check_output
> raise CalledProcessError(retcode, cmd, output=output)
> CalledProcessError: Command '['git', 'diff', '--name-only',
> 'fc0a1475ef', '5da21f07']' returned non-zero exit status 1
> error: Could not access '50a0496a43'
> **
> File "./dev/run-tests.py", line 71, in
> __main__.identify_changed_files_from_git_commits
> Failed example:
> 'root' in [x.name
> 
> for x in determine_modules_for_files(
>  identify_changed_files_from_git_commits("50a0496a43",
> target_ref="6765ef9"))]
> Exception raised:
> Traceback (most recent call last):
>   File "/Users/nick/miniconda2/lib/python2.7/doctest.py", line 1315,
> in __run
> compileflags, 1) in test.globs
>   File " __main__.identify_changed_files_from_git_commits[1]>", line 1, in 
> 'root' in [x.name
> 
> for x in determine_modules_for_files(
>  identify_changed_files_from_git_commits("50a0496a43",
> target_ref="6765ef9"))]
>   File "./dev/run-tests.py", line 86, in
> identify_changed_files_from_git_commits
> universal_newlines=True)
>   File "/Users/nick/miniconda2/lib/python2.7/subprocess.py", line 573,
> in check_output
> raise CalledProcessError(retcode, cmd, output=output)
> CalledProcessError: Command '['git', 'diff', '--name-only',
> '50a0496a43', '6765ef9']' returned non-zero exit status 1
> **
> 1 items had failures:
>2 of   2 in __main__.identify_changed_files_from_git_commits
> ***Test Failed*** 2 failures.
>
>
>
> On Fri, 24 Jun 2016 at 06:59 Yin Huai  wrote:
>
>> -1 because of https://issues.apache.org/jira/browse/SPARK-16121
>> .
>>
>>
>> This jira was resolved after 2.0.0-RC1 was cut. Without the fix, Spark
>> SQL effectively only uses the driver to list files when loading datasets
>> and the driver-side file listing is very slow for datasets 

SPARK-15982 breaks external DataSources

2016-06-27 Thread Koert Kuipers
hey,

since SPARK-15982 was fixed (https://github.com/apache/spark/pull/13727) i
believe all external DataSources that rely on using .load(path) without
being a FileFormat themselves are broken.

i noticed this because our unit tests for the elasticsearch datasource
broke.

i commented on the pullreq with the specific issue i am facing.

best,
koert


Spark streaming connectors available for testing

2016-06-27 Thread Luciano Resende
The Apache Bahir project is voting a release based on Spark 2.0.0-preview.
https://www.mail-archive.com/dev@bahir.apache.org/msg00085.html

It currently provides the following Apache Spark Streaming connectors:

streaming-akka
streaming-mqtt
streaming-twitter
streaming-zeromq

While we are continuing to work towards a release to support Spark 2.0.0,
we appreciate your help around testing the release and the current Spark
Streaming connectors.

To add the connectors to your scala application, the best way is to build
the source of Bahir with 'mvn clean install' which will make the necessary
dependencies available in your local maven repository and will enable you
to reference the connectors in your application and also submit  your
application to a local Spark test environment utilizing --packages.

Build:
mvn clean install

Add repository to your scala application (build.sbt):
resolvers += "Local Maven Repository" at "file://" +
Path.userHome.absolutePath + "/.m2/repository"

Submit your application to a local Spark test environment:
bin/spark-submit --master spark://127.0.0.1:7077 --packages
org.apache.bahir:spark-streaming-akka_2.11:2.0.0-preview --class
org.apache.spark.examples.streaming.akka.ActorWordCount
~/opensource/apache/bahir/streaming-akka-examples/target/scala-2.11/streaming-akka-examples_2.11-1.0.jar
localhost 


The Bahir community welcomes questions, comments, bug reports and all your
feedback.

http://bahir.apache.org/community/

Thanks


Re: SPARK-15982 breaks external DataSources

2016-06-27 Thread Reynold Xin
Yup this is bad. Can you create a JIRA ticket too?


On Mon, Jun 27, 2016 at 12:22 PM, Koert Kuipers  wrote:

> hey,
>
> since SPARK-15982 was fixed (https://github.com/apache/spark/pull/13727)
> i believe all external DataSources that rely on using .load(path) without
> being a FileFormat themselves are broken.
>
> i noticed this because our unit tests for the elasticsearch datasource
> broke.
>
> i commented on the pullreq with the specific issue i am facing.
>
> best,
> koert
>


[ANNOUNCE] Announcing Spark 1.6.2

2016-06-27 Thread Reynold Xin
We are happy to announce the availability of Spark 1.6.2! This maintenance
release includes fixes across several areas of Spark. You can find the list
of changes here: https://s.apache.org/spark-1.6.2

And download the release here: http://spark.apache.org/downloads.html


Re: Please add an unsubscribe link to the footer of user list email

2016-06-27 Thread Reynold Xin
Filed infra ticket: https://issues.apache.org/jira/browse/INFRA-12185



On Mon, Jun 27, 2016 at 10:02 AM, Reynold Xin  wrote:

> Let me look into this...
>
>
> On Monday, June 27, 2016, Nicholas Chammas 
> wrote:
>
>> Howdy,
>>
>> It seems like every week we have at least a couple of people emailing the
>> user list in vain with "Unsubscribe" in the subject, the body, or both.
>>
>> I remember a while back that every email on the user list used to include
>> a footer with a quick link to unsubscribe. It was removed, I believe,
>> because someone thought it was unnecessary clutter.
>>
>> I disagree.
>>
>> Please add that footer back in. It's a tiny cost to pay for giving users
>> the convenience of easily unsubscribing; it makes it much less likely that
>> people will mistakenly spam everybody as they have been; and it follows
>> modern email list conventions. The link can just be a "mailto:..."; link
>> that does the same thing that you have to do today to unsubscribe.
>>
>> I don't think we need to do the same thing on the dev list since the
>> volume of email is much lower, and since people on average here are
>> probably more familiar with the mechanics of subscribing to and
>> unsubscribing from Apache mailing lists.
>>
>> Nick
>>
>>


Re: Please add an unsubscribe link to the footer of user list email

2016-06-27 Thread Reynold Xin
If people want this to happen, please go comment on the INFRA ticket:

https://issues.apache.org/jira/browse/INFRA-12185

Otherwise it will probably be dropped.


On Mon, Jun 27, 2016 at 7:04 PM, Reynold Xin  wrote:

> Filed infra ticket: https://issues.apache.org/jira/browse/INFRA-12185
>
>
>
> On Mon, Jun 27, 2016 at 10:02 AM, Reynold Xin  wrote:
>
>> Let me look into this...
>>
>>
>> On Monday, June 27, 2016, Nicholas Chammas 
>> wrote:
>>
>>> Howdy,
>>>
>>> It seems like every week we have at least a couple of people emailing
>>> the user list in vain with "Unsubscribe" in the subject, the body, or both.
>>>
>>> I remember a while back that every email on the user list used to
>>> include a footer with a quick link to unsubscribe. It was removed, I
>>> believe, because someone thought it was unnecessary clutter.
>>>
>>> I disagree.
>>>
>>> Please add that footer back in. It's a tiny cost to pay for giving users
>>> the convenience of easily unsubscribing; it makes it much less likely that
>>> people will mistakenly spam everybody as they have been; and it follows
>>> modern email list conventions. The link can just be a "mailto:..."; link
>>> that does the same thing that you have to do today to unsubscribe.
>>>
>>> I don't think we need to do the same thing on the dev list since the
>>> volume of email is much lower, and since people on average here are
>>> probably more familiar with the mechanics of subscribing to and
>>> unsubscribing from Apache mailing lists.
>>>
>>> Nick
>>>
>>>
>