Re: Partitioning in spark

2016-06-24 Thread Darshan Singh
Thanks but the whole point is not setting it explicitly but it should be
derived from its parent RDDS.

Thanks

On Fri, Jun 24, 2016 at 6:09 AM, ayan guha  wrote:

> You can change paralllism like following:
>
> conf = SparkConf()
> conf.set('spark.sql.shuffle.partitions',10)
> sc = SparkContext(conf=conf)
>
>
>
> On Fri, Jun 24, 2016 at 6:46 AM, Darshan Singh 
> wrote:
>
>> Hi,
>>
>> My default parallelism is 100. Now I join 2 dataframes with 20 partitions
>> each , joined dataframe has 100 partition. I want to know what is the way
>> to keep it to 20 (except re-partition and coalesce.
>>
>> Also, when i join these 2 dataframes I am using 4 columns as joined
>> columns. The dataframes are partitions based on first 2 columns of join and
>> thus, in effect one partition should be joined corresponding joins and
>> doesn't need to join with rest of partitions so why spark is shuffling all
>> the data.
>>
>> Simialrly, when my dataframe is partitioned by col1,col2 and if i use
>> group by on col1,col2,col3,col4 then why does it shuffle everything whereas
>> it need to sort each partitions and then should grouping there itself.
>>
>> Bit confusing , I am using 1.5.1
>>
>> Is it fixed in future versions.
>>
>> Thanks
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>


Associating user objects with SparkContext/SparkStreamingContext

2016-06-24 Thread Simon Scott
Hi,

I am developing a streaming application using checkpointing on Spark 1.5.1

I have just run into a NotSerializableException because some of the state that 
my streaming functions need cannot be serialized. This state is only used in 
the driver process, it is the checkpointing that requires the serialization.

So I am considering moving that state into a Scala "object" - i.e. global 
singleton that must be mutable to allow the state to be set at application 
start.

I would prefer to be able to create immutable state and attach it to either the 
SparkContext or SparkStreamingContext but I can't find an api for that.

Does anybody else think is a good idea? Is there a better way? Or would such an 
api be a useful enhancement to Spark?

Thanks in advance
Simon

Research Developer
Viavi Solutions


HistoryServer: missing information

2016-06-24 Thread ElfoLiNk
In Spark 1.6.1 HistoryServer doesn't show coresGranted , coresPerExecutor ,
memoryPerExecutorMB for the applications, is this normal?

Example:

*http://MasterIP:8080/api/v1/applications*
{
  "id" : "app-20160623171554-0006",
  "name" : "TeraSort",
  "coresGranted" : 20,
  "maxCores" : 20,
  "coresPerExecutor" : 10,
  "memoryPerExecutorMB" : 1024,
  "attempts" : [ {
"startTime" : "2016-06-23T15:15:54.477GMT",
"endTime" : "2016-06-23T15:20:42.976GMT",
"sparkUser" : "xxx",
"completed" : true
  } ]

*http://HistoryServerIP:18080/api/v1/applications*
 {
  "id" : "app-20160623171554-0006",
  "name" : "TeraSort",
  "attempts" : [ {
"startTime" : "2016-06-23T15:15:49.212GMT",
"endTime" : "2016-06-23T15:20:42.863GMT",
"sparkUser" : "xxx",
"completed" : true
  } ]

Notice also the different startTime 





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/HistoryServer-missing-information-tp18086.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-06-24 Thread Nick Pentreath
I'm getting the following when trying to run ./dev/run-tests (not happening
on master) from the extracted source tar. Anyone else seeing this?

error: Could not access 'fc0a1475ef'
**
File "./dev/run-tests.py", line 69, in
__main__.identify_changed_files_from_git_commits
Failed example:
[x.name for x in determine_modules_for_files(
identify_changed_files_from_git_commits("fc0a1475ef",
target_ref="5da21f07"))]
Exception raised:
Traceback (most recent call last):
  File "/Users/nick/miniconda2/lib/python2.7/doctest.py", line 1315, in
__run
compileflags, 1) in test.globs
  File "",
line 1, in 
[x.name for x in determine_modules_for_files(
identify_changed_files_from_git_commits("fc0a1475ef",
target_ref="5da21f07"))]
  File "./dev/run-tests.py", line 86, in
identify_changed_files_from_git_commits
universal_newlines=True)
  File "/Users/nick/miniconda2/lib/python2.7/subprocess.py", line 573,
in check_output
raise CalledProcessError(retcode, cmd, output=output)
CalledProcessError: Command '['git', 'diff', '--name-only',
'fc0a1475ef', '5da21f07']' returned non-zero exit status 1
error: Could not access '50a0496a43'
**
File "./dev/run-tests.py", line 71, in
__main__.identify_changed_files_from_git_commits
Failed example:
'root' in [x.name for x in determine_modules_for_files(
 identify_changed_files_from_git_commits("50a0496a43",
target_ref="6765ef9"))]
Exception raised:
Traceback (most recent call last):
  File "/Users/nick/miniconda2/lib/python2.7/doctest.py", line 1315, in
__run
compileflags, 1) in test.globs
  File "",
line 1, in 
'root' in [x.name for x in determine_modules_for_files(
 identify_changed_files_from_git_commits("50a0496a43",
target_ref="6765ef9"))]
  File "./dev/run-tests.py", line 86, in
identify_changed_files_from_git_commits
universal_newlines=True)
  File "/Users/nick/miniconda2/lib/python2.7/subprocess.py", line 573,
in check_output
raise CalledProcessError(retcode, cmd, output=output)
CalledProcessError: Command '['git', 'diff', '--name-only',
'50a0496a43', '6765ef9']' returned non-zero exit status 1
**
1 items had failures:
   2 of   2 in __main__.identify_changed_files_from_git_commits
***Test Failed*** 2 failures.



On Fri, 24 Jun 2016 at 06:59 Yin Huai  wrote:

> -1 because of https://issues.apache.org/jira/browse/SPARK-16121.
>
> This jira was resolved after 2.0.0-RC1 was cut. Without the fix, Spark
> SQL effectively only uses the driver to list files when loading datasets
> and the driver-side file listing is very slow for datasets having many
> files and partitions. Since this bug causes a serious performance
> regression, I am giving -1.
>
> On Thu, Jun 23, 2016 at 1:25 AM, Pete Robbins  wrote:
>
>> I'm also seeing some of these same failures:
>>
>> - spilling with compression *** FAILED ***
>> I have seen this occassionaly
>>
>> - to UTC timestamp *** FAILED ***
>> This was fixed yesterday in branch-2.0 (
>> https://issues.apache.org/jira/browse/SPARK-16078)
>>
>> - offset recovery *** FAILED ***
>> Haven't seen this for a while and thought the flaky test was fixed but it
>> popped up again in one of our builds.
>>
>> StateStoreSuite:
>> - maintenance *** FAILED ***
>> Just seen this has been failing for last 2 days on one build machine
>> (linux amd64)
>>
>>
>> On 23 June 2016 at 08:51, Sean Owen  wrote:
>>
>>> First pass of feedback on the RC: all the sigs, hashes, etc are fine.
>>> Licensing is up to date to the best of my knowledge.
>>>
>>> I'm hitting test failures, some of which may be spurious. Just putting
>>> them out there to see if they ring bells. This is Java 8 on Ubuntu 16.
>>>
>>>
>>> - spilling with compression *** FAILED ***
>>>   java.lang.Exception: Test failed with compression using codec
>>> org.apache.spark.io.SnappyCompressionCodec:
>>> assertion failed: expected cogroup to spill, but did not
>>>   at scala.Predef$.assert(Predef.scala:170)
>>>   at org.apache.spark.TestUtils$.assertSpilled(TestUtils.scala:170)
>>>   at org.apache.spark.util.collection.ExternalAppendOnlyMapSuite.org
>>> $apache$spark$util$collection$ExternalAppendOnlyMapSuite$$testSimpleSpilling(ExternalAppendOnlyMapSuite.scala:263)
>>> ...
>>>
>>> I feel like I've seen this before, and see some possibly relevant
>>> fixes, but they're in 2.0.0 already:
>>> https://github.com/apache/spark/pull/10990
>>> Is this something where a native library needs to be installed or
>>> something?
>>>
>>>
>>> - to UTC timestamp *** FAILED ***
>>>   "2016-03-13 [02]:00:00.0" did not equal "2016-03-13 [10]:00:00.0"
>>> (DateTimeUtilsSuite.scala:506)
>>>
>>> I know, we talked about this for the 1.6.2 RC, but I reproduced this
>>> locally too. I will investigate, could still be spurious.
>>>

Re: Associating user objects with SparkContext/SparkStreamingContext

2016-06-24 Thread Evan Sparks
I would actually think about this the other way around. Move the functions you 
are passing to the streaming jobs out to their own object if possible. Spark's 
closure capture rules are necessarily far reaching and serialize the object 
that contains these methods, which is a common cause of the problem you're 
seeing. 

Another option is to mark the non-serializable state as "@transient" if it is 
never accessed by the worker processes. 

> On Jun 24, 2016, at 1:23 AM, Simon Scott  
> wrote:
> 
> Hi,
>  
> I am developing a streaming application using checkpointing on Spark 1.5.1
>  
> I have just run into a NotSerializableException because some of the state 
> that my streaming functions need cannot be serialized. This state is only 
> used in the driver process, it is the checkpointing that requires the 
> serialization.
>  
> So I am considering moving that state into a Scala “object” – i.e. global 
> singleton that must be mutable to allow the state to be set at application 
> start.
>  
> I would prefer to be able to create immutable state and attach it to either 
> the SparkContext or SparkStreamingContext but I can’t find an api for that.
>  
> Does anybody else think is a good idea? Is there a better way? Or would such 
> an api be a useful enhancement to Spark?
>  
> Thanks in advance
> Simon
>  
> Research Developer
> Viavi Solutions


Re: Jar for Spark developement

2016-06-24 Thread joshuata
With regards to the Spark JAR files, I have had really good success with the 
sbt plugin   . You can set
the desired spark version along with any plugins, and it will automatically
fetch your dependencies and put them on the classpath. 



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Jar-for-Spark-developement-tp18012p18089.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-06-24 Thread Matt Cheah
-1 because of SPARK-16181 which is a correctness regression from 1.6. Looks 
like the patch is ready though: https://github.com/apache/spark/pull/13884 – it 
would be ideal for this patch to make it into the release.

-Matt Cheah

From: Nick Pentreath mailto:nick.pentre...@gmail.com>>
Date: Friday, June 24, 2016 at 4:37 AM
To: "dev@spark.apache.org" 
mailto:dev@spark.apache.org>>
Subject: Re: [VOTE] Release Apache Spark 2.0.0 (RC1)

I'm getting the following when trying to run ./dev/run-tests (not happening on 
master) from the extracted source tar. Anyone else seeing this?

error: Could not access 'fc0a1475ef'
**
File "./dev/run-tests.py", line 69, in 
__main__.identify_changed_files_from_git_commits
Failed example:

[x.name
 for x in determine_modules_for_files( 
identify_changed_files_from_git_commits("fc0a1475ef", target_ref="5da21f07"))]
Exception raised:
Traceback (most recent call last):
  File "/Users/nick/miniconda2/lib/python2.7/doctest.py", line 1315, in 
__run
compileflags, 1) in test.globs
  File "", 
line 1, in 

[x.name
 for x in determine_modules_for_files( 
identify_changed_files_from_git_commits("fc0a1475ef", target_ref="5da21f07"))]
  File "./dev/run-tests.py", line 86, in 
identify_changed_files_from_git_commits
universal_newlines=True)
  File "/Users/nick/miniconda2/lib/python2.7/subprocess.py", line 573, in 
check_output
raise CalledProcessError(retcode, cmd, output=output)
CalledProcessError: Command '['git', 'diff', '--name-only', 'fc0a1475ef', 
'5da21f07']' returned non-zero exit status 1
error: Could not access '50a0496a43'
**
File "./dev/run-tests.py", line 71, in 
__main__.identify_changed_files_from_git_commits
Failed example:
'root' in 
[x.name
 for x in determine_modules_for_files(  
identify_changed_files_from_git_commits("50a0496a43", target_ref="6765ef9"))]
Exception raised:
Traceback (most recent call last):
  File "/Users/nick/miniconda2/lib/python2.7/doctest.py", line 1315, in 
__run
compileflags, 1) in test.globs
  File "", 
line 1, in 
'root' in 
[x.name
 for x in determine_modules_for_files(  
identify_changed_files_from_git_commits("50a0496a43", target_ref="6765ef9"))]
  File "./dev/run-tests.py", line 86, in 
identify_changed_files_from_git_commits
universal_newlines=True)
  File "/Users/nick/miniconda2/lib/python2.7/subprocess.py", line 573, in 
check_output
raise CalledProcessError(retcode, cmd, output=output)
CalledProcessError: Command '['git', 'diff', '--name-only', '50a0496a43', 
'6765ef9']' returned non-zero exit status 1
**
1 items had failures:
   2 of   2 in __main__.identify_changed_files_from_git_commits
***Test Failed*** 2 failures.



On Fri, 24 Jun 2016 at 06:59 Yin Huai 
mailto:yh...@databricks.com>> wrote:
-1 because of 
https://issues.apache.org/jira/browse/SPARK-16121.

This jira was resolved after 2.0.0-RC1 was cut. Without the fix, Spark SQL 
effectively only uses the driver to list files when loading datasets and the 
driver-side file listing is very slow for datasets having many files and 
partitions. Since this bug causes a serious performance regression, I am giving 
-1.

On Thu, Jun 23, 2016 at 1:25 AM, Pete Robbins 
mailto:robbin...@gmail.com>> wrote:
I'm also seeing some of these same failures:

- spilling with compression *** FAILED ***
I have seen this occas