date:20141029

[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2014-10-29 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189651#comment-14189651
 ] 

Zhan Zhang commented on SPARK-1537:
---

Hi Marcelo,

Do you have update on this? If you don't mind, I can work on your branch to get 
this done asap. Please let me know how do you think?


> Add integration with Yarn's Application Timeline Server
> ---
>
> Key: SPARK-1537
> URL: https://issues.apache.org/jira/browse/SPARK-1537
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>
> It would be nice to have Spark integrate with Yarn's Application Timeline 
> Server (see YARN-321, YARN-1530). This would allow users running Spark on 
> Yarn to have a single place to go for all their history needs, and avoid 
> having to manage a separate service (Spark's built-in server).
> At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, 
> although there is still some ongoing work. But the basics are there, and I 
> wouldn't expect them to change (much) at this point.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4149) ISO 8601 support for json date time strings

2014-10-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189618#comment-14189618
 ] 

Apache Spark commented on SPARK-4149:
-

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/3012

> ISO 8601 support for json date time strings
> ---
>
> Key: SPARK-4149
> URL: https://issues.apache.org/jira/browse/SPARK-4149
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Adrian Wang
>Assignee: Adrian Wang
>Priority: Minor
>
> parse json string like "2014-10-29T20:05:00-08:00" or "2014-10-29T20:05:00Z".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4150) rdd.setName returns None in PySpark

2014-10-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189592#comment-14189592
 ] 

Apache Spark commented on SPARK-4150:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/3011

> rdd.setName returns None in PySpark
> ---
>
> Key: SPARK-4150
> URL: https://issues.apache.org/jira/browse/SPARK-4150
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Trivial
>
> We should return self so we can do 
> {code}
> rdd.setName('abc').cache().count()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4150) rdd.setName returns None in PySpark

2014-10-29 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-4150:


 Summary: rdd.setName returns None in PySpark
 Key: SPARK-4150
 URL: https://issues.apache.org/jira/browse/SPARK-4150
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Trivial


We should return self so we can do 

{code}
rdd.setName('abc').cache().count()
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4148) PySpark's sample uses the same seed for all partitions

2014-10-29 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4148:
-
Description: 
The current way of seed distribution makes the random sequences from partition 
i and i+1 offset by 1.

{code}
In [14]: import random

In [15]: r1 = random.Random(10)

In [16]: r1.randint(0, 1)
Out[16]: 1

In [17]: r1.random()
Out[17]: 0.4288890546751146

In [18]: r1.random()
Out[18]: 0.5780913011344704

In [19]: r2 = random.Random(10)

In [20]: r2.randint(0, 1)
Out[20]: 1

In [21]: r2.randint(0, 1)
Out[21]: 0

In [22]: r2.random()
Out[22]: 0.5780913011344704
{code}

So the second value from partition 1 is the same as the first value from 
partition 2.

  was:We should have different seeds. Otherwise, we get the same sequence from 
each partition.


> PySpark's sample uses the same seed for all partitions
> --
>
> Key: SPARK-4148
> URL: https://issues.apache.org/jira/browse/SPARK-4148
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> The current way of seed distribution makes the random sequences from 
> partition i and i+1 offset by 1.
> {code}
> In [14]: import random
> In [15]: r1 = random.Random(10)
> In [16]: r1.randint(0, 1)
> Out[16]: 1
> In [17]: r1.random()
> Out[17]: 0.4288890546751146
> In [18]: r1.random()
> Out[18]: 0.5780913011344704
> In [19]: r2 = random.Random(10)
> In [20]: r2.randint(0, 1)
> Out[20]: 1
> In [21]: r2.randint(0, 1)
> Out[21]: 0
> In [22]: r2.random()
> Out[22]: 0.5780913011344704
> {code}
> So the second value from partition 1 is the same as the first value from 
> partition 2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4148) PySpark's sample uses the same seed for all partitions

2014-10-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189579#comment-14189579
 ] 

Apache Spark commented on SPARK-4148:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/3010

> PySpark's sample uses the same seed for all partitions
> --
>
> Key: SPARK-4148
> URL: https://issues.apache.org/jira/browse/SPARK-4148
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> We should have different seeds. Otherwise, we get the same sequence from each 
> partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4149) ISO 8601 support for json date time strings

2014-10-29 Thread Adrian Wang (JIRA)

Adrian Wang created SPARK-4149:
--

 Summary: ISO 8601 support for json date time strings
 Key: SPARK-4149
 URL: https://issues.apache.org/jira/browse/SPARK-4149
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Adrian Wang
Priority: Minor


parse json string like "2014-10-29T20:05:00-08:00" or "2014-10-29T20:05:00Z".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4148) PySpark's sample uses the same seed for all partitions

2014-10-29 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4148:
-
Affects Version/s: (was: 1.0.0)
   1.0.2

> PySpark's sample uses the same seed for all partitions
> --
>
> Key: SPARK-4148
> URL: https://issues.apache.org/jira/browse/SPARK-4148
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> We should have different seeds. Otherwise, we get the same sequence from each 
> partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4148) PySpark's sample uses the same seed for all partitions

2014-10-29 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-4148:


 Summary: PySpark's sample uses the same seed for all partitions
 Key: SPARK-4148
 URL: https://issues.apache.org/jira/browse/SPARK-4148
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0, 1.0.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


We should have different seeds. Otherwise, we get the same sequence from each 
partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4147) Remove log4j dependency

2014-10-29 Thread Tobias Pfeiffer (JIRA)

Tobias Pfeiffer created SPARK-4147:
--

 Summary: Remove log4j dependency
 Key: SPARK-4147
 URL: https://issues.apache.org/jira/browse/SPARK-4147
 Project: Spark
  Issue Type: Wish
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Tobias Pfeiffer


spark-core has a hard dependency on log4j, which shouldn't be necessary since 
slf4j is used. I tried to exclude slf4j-log4j12 and log4j dependencies in my 
sbt file.

Excluding org.slf4j.slf4j-log4j12 works fine if logback is on the classpath. 
However, removing the log4j dependency fails because in 
https://github.com/apache/spark/blob/v1.1.0/core/src/main/scala/org/apache/spark/Logging.scala#L121
 a static method of org.apache.log4j.LogManager is accessed *even if* log4j is 
not in use.

I guess removing all dependencies on log4j may be a bigger task, but it would 
be a great help if the access to LogManager would be done only if log4j use was 
detected before. (This is a 2-line change.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4146) [GraphX] Modify option name according to example doc in SynthBenchmark

2014-10-29 Thread Jie Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Huang updated SPARK-4146:
-
Fix Version/s: 1.2.0

> [GraphX] Modify option name according to example doc in SynthBenchmark 
> ---
>
> Key: SPARK-4146
> URL: https://issues.apache.org/jira/browse/SPARK-4146
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Jie Huang
>Priority: Minor
> Fix For: 1.2.0
>
>
> Now graphx.SynthBenchmark example has an option of iteration number named as 
> "niter". However, in its document, it is named as "niters". The mismatch 
> between the implementation and document causes certain 
> IllegalArgumentException while trying that example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4144) Support incremental model training of Naive Bayes classifier

2014-10-29 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4144:
-
Assignee: Liquan Pei

> Support incremental model training of Naive Bayes classifier
> 
>
> Key: SPARK-4144
> URL: https://issues.apache.org/jira/browse/SPARK-4144
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, Streaming
>Reporter: Chris Fregly
>Assignee: Liquan Pei
>
> Per Xiangrui Meng from the following user list discussion:  
> http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3CCAJgQjQ_QjMGO=jmm8weq1v8yqfov8du03abzy7eeavgjrou...@mail.gmail.com%3E
>
> "For Naive Bayes, we need to update the priors and conditional
> probabilities, which means we should also remember the number of
> observations for the updates."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4146) [GraphX] Modify option name according to example doc in SynthBenchmark

2014-10-29 Thread Jie Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Huang updated SPARK-4146:
-
Affects Version/s: 1.1.1

> [GraphX] Modify option name according to example doc in SynthBenchmark 
> ---
>
> Key: SPARK-4146
> URL: https://issues.apache.org/jira/browse/SPARK-4146
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Jie Huang
>Priority: Minor
> Fix For: 1.2.0
>
>
> Now graphx.SynthBenchmark example has an option of iteration number named as 
> "niter". However, in its document, it is named as "niters". The mismatch 
> between the implementation and document causes certain 
> IllegalArgumentException while trying that example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4146) [GraphX] Modify option name according to example doc in SynthBenchmark

2014-10-29 Thread Jie Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Huang resolved SPARK-4146.
--
Resolution: Fixed

> [GraphX] Modify option name according to example doc in SynthBenchmark 
> ---
>
> Key: SPARK-4146
> URL: https://issues.apache.org/jira/browse/SPARK-4146
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Jie Huang
>Priority: Minor
>
> Now graphx.SynthBenchmark example has an option of iteration number named as 
> "niter". However, in its document, it is named as "niters". The mismatch 
> between the implementation and document causes certain 
> IllegalArgumentException while trying that example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4146) [GraphX] Modify option name according to example doc in SynthBenchmark

2014-10-29 Thread Jie Huang (JIRA)

Jie Huang created SPARK-4146:


 Summary: [GraphX] Modify option name according to example doc in 
SynthBenchmark 
 Key: SPARK-4146
 URL: https://issues.apache.org/jira/browse/SPARK-4146
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.1.0
Reporter: Jie Huang
Priority: Minor


Now graphx.SynthBenchmark example has an option of iteration number named as 
"niter". However, in its document, it is named as "niters". The mismatch 
between the implementation and document causes certain IllegalArgumentException 
while trying that example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4078) New FsPermission instance w/o FsPermission.createImmutable in eventlog

2014-10-29 Thread Jason Dai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Dai updated SPARK-4078:
-
Assignee: Jason Dai

> New FsPermission instance w/o FsPermission.createImmutable in eventlog
> --
>
> Key: SPARK-4078
> URL: https://issues.apache.org/jira/browse/SPARK-4078
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Jie Huang
>Assignee: Jason Dai
>
> By default, Spark builds its package against Hadoop 1.0.4 version. In that 
> version, it has some FsPermission bug (see HADOOP-7629 by Todd Lipcon). This 
> bug got fixed since 1.1 version. By using that FsPermission.createImmutable() 
> API, end-user may see some RPC exception like below (if turn on eventlog over 
> HDFS). 
> {quote}
> Exception in thread "main" java.io.IOException: Call to sr484/10.1.2.84:54310 
> failed on local exception: java.io.EOFException
> at org.apache.hadoop.ipc.Client.wrapException(Client.java:1150)
> at org.apache.hadoop.ipc.Client.call(Client.java:1118)
> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
> at $Proxy6.setPermission(Unknown Source)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)
> at $Proxy6.setPermission(Unknown Source)
> at org.apache.hadoop.hdfs.DFSClient.setPermission(DFSClient.java:1285)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.setPermission(DistributedFileSystem.java:572)
> at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:138)
> at org.apache.spark.util.FileLogger.start(FileLogger.scala:115)
> at 
> org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74)
> at org.apache.spark.SparkContext.(SparkContext.scala:324)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4094) checkpoint should still be available after rdd actions

2014-10-29 Thread Jason Dai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Dai updated SPARK-4094:
-
Assignee: Zhang, Liye

> checkpoint should still be available after rdd actions
> --
>
> Key: SPARK-4094
> URL: https://issues.apache.org/jira/browse/SPARK-4094
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Zhang, Liye
>Assignee: Zhang, Liye
>
> rdd.checkpoint() must be called before any actions on this rdd, if there is 
> any other actions before, checkpoint would never succeed. For the following 
> code as example:
> *rdd = sc.makeRDD(...)*
> *rdd.collect()*
> *rdd.checkpoint()*
> *rdd.count()*
> This rdd would never be checkpointed. But this would not happen for RDD 
> cache. RDD cache would always make successfully before rdd actions no matter 
> whether there is any actions before cache().
> So rdd.checkpoint() should also be with the same behavior with rdd.cache().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2926) Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle

2014-10-29 Thread Jason Dai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Dai updated SPARK-2926:
-
Assignee: Saisai Shao

> Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
> --
>
> Key: SPARK-2926
> URL: https://issues.apache.org/jira/browse/SPARK-2926
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 1.1.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
> Attachments: SortBasedShuffleRead.pdf, Spark Shuffle Test 
> Report(contd).pdf, Spark Shuffle Test Report.pdf
>
>
> Currently Spark has already integrated sort-based shuffle write, which 
> greatly improve the IO performance and reduce the memory consumption when 
> reducer number is very large. But for the reducer side, it still adopts the 
> implementation of hash-based shuffle reader, which neglects the ordering 
> attributes of map output data in some situations.
> Here we propose a MR style sort-merge like shuffle reader for sort-based 
> shuffle to better improve the performance of sort-based shuffle.
> Working in progress code and performance test report will be posted later 
> when some unit test bugs are fixed.
> Any comments would be greatly appreciated. 
> Thanks a lot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4145) Create jobs overview and job details pages on the web UI

2014-10-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189429#comment-14189429
 ] 

Apache Spark commented on SPARK-4145:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/3009

> Create jobs overview and job details pages on the web UI
> 
>
> Key: SPARK-4145
> URL: https://issues.apache.org/jira/browse/SPARK-4145
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> We should create a jobs overview and job details page on the web UI.  The 
> overview page would list all jobs in the SparkContext and would replace the 
> current "Stages" page as the default web UI page.  The job details page would 
> provide information on the stages triggered by a particular job; it would 
> also serve as a place to show DAG visualizations and other debugging aids.
> I still plan to keep the current "Stages" page, which lists all stages of all 
> jobs, since it's a useful debugging aid for figuring out how resources are 
> being consumed across all jobs in a Spark Cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4145) Create jobs overview and job details pages on the web UI

2014-10-29 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-4145:
-

 Summary: Create jobs overview and job details pages on the web UI
 Key: SPARK-4145
 URL: https://issues.apache.org/jira/browse/SPARK-4145
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Reporter: Josh Rosen
Assignee: Josh Rosen


We should create a jobs overview and job details page on the web UI.  The 
overview page would list all jobs in the SparkContext and would replace the 
current "Stages" page as the default web UI page.  The job details page would 
provide information on the stages triggered by a particular job; it would also 
serve as a place to show DAG visualizations and other debugging aids.

I still plan to keep the current "Stages" page, which lists all stages of all 
jobs, since it's a useful debugging aid for figuring out how resources are 
being consumed across all jobs in a Spark Cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-4053) Block generator throttling in NetworkReceiverSuite is flaky

2014-10-29 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-4053.

   Resolution: Fixed
Fix Version/s: 1.2.0

> Block generator throttling in NetworkReceiverSuite is flaky
> ---
>
> Key: SPARK-4053
> URL: https://issues.apache.org/jira/browse/SPARK-4053
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.1.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Minor
> Fix For: 1.2.0
>
>
> In the unit test that checked whether blocks generated by throttled block 
> generator had expected number of records, the thresholds are too tight, which 
> sometimes led to the test failing. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-3795) Add scheduler hooks/heuristics for adding and removing executors

2014-10-29 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-3795.

   Resolution: Fixed
Fix Version/s: 1.2.0

> Add scheduler hooks/heuristics for adding and removing executors
> 
>
> Key: SPARK-3795
> URL: https://issues.apache.org/jira/browse/SPARK-3795
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Patrick Wendell
>Assignee: Andrew Or
> Fix For: 1.2.0
>
>
> To support dynamic scaling of a Spark application, Spark's scheduler will 
> need to have hooks around explicitly decommissioning executors. We'll also 
> need basic heuristics governing when to start/stop executors based on load. 
> An initial goal is to keep this very simple.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4144) Support incremental model training of Naive Bayes classifier

2014-10-29 Thread Chris Fregly (JIRA)

Chris Fregly created SPARK-4144:
---

 Summary: Support incremental model training of Naive Bayes 
classifier
 Key: SPARK-4144
 URL: https://issues.apache.org/jira/browse/SPARK-4144
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, Streaming
Reporter: Chris Fregly


Per Xiangrui Meng from the following user list discussion:  
http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3CCAJgQjQ_QjMGO=jmm8weq1v8yqfov8du03abzy7eeavgjrou...@mail.gmail.com%3E
   

"For Naive Bayes, we need to update the priors and conditional
probabilities, which means we should also remember the number of
observations for the updates."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4143) Move inner class DeferredObjectAdapter to top level

2014-10-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189380#comment-14189380
 ] 

Apache Spark commented on SPARK-4143:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/3007

> Move inner class DeferredObjectAdapter to top level
> ---
>
> Key: SPARK-4143
> URL: https://issues.apache.org/jira/browse/SPARK-4143
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Cheng Hao
>Priority: Trivial
>
> The class DeferredObjectAdapter is the inner class of HiveGenericUdf, which 
> may cause some overhead in closure ser/de-ser. Move it to top level.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4143) Move inner class DeferredObjectAdapter to top level

2014-10-29 Thread Cheng Hao (JIRA)

Cheng Hao created SPARK-4143:


 Summary: Move inner class DeferredObjectAdapter to top level
 Key: SPARK-4143
 URL: https://issues.apache.org/jira/browse/SPARK-4143
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Priority: Trivial


The class DeferredObjectAdapter is the inner class of HiveGenericUdf, which may 
cause some overhead in closure ser/de-ser. Move it to top level.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4132) Spark uses incompatible HDFS API

2014-10-29 Thread kuromatsu nobuyuki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189351#comment-14189351
 ] 

kuromatsu nobuyuki commented on SPARK-4132:
---

Owen, thank you for your indication.
It looks much the same as my trouble.

> Spark uses incompatible HDFS API
> 
>
> Key: SPARK-4132
> URL: https://issues.apache.org/jira/browse/SPARK-4132
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
> Environment: Spark1.1.0 on Hadoop1.2.1
> CentOS 6.3 64bit
>Reporter: kuromatsu nobuyuki
>Priority: Minor
>
> When I enable event logging and set it to output to HDFS, initialization 
> fails with 'java.lang.ClassNotFoundException' (see trace below).
> I found that an API incompatibility in 
> org.apache.hadoop.fs.permission.FsPermission between Hadoop 1.0.4 and Hadoop 
> 1.1.0 (and above) causes this error 
> (org.apache.hadoop.fs.permission.FsPermission$2 is used in 1.0.4 but doesn't 
> exist in my 1.2.1 environment).
> I think that the Spark jar file pre-built for Hadoop1.X should be built on 
> Hadoop Stable version(Hadoop 1.2.1).
> 2014-10-24 10:43:22,893 INFO org.apache.hadoop.ipc.Server: IPC Server 
> listener on 9000: 
> readAndProcess threw exception java.lang.RuntimeException: 
> readObject can't find class org.apache.hadoop.fs.permission.FsPermission$2. 
> Count of bytes read: 0
> java.lang.RuntimeException: readObject can't find class 
> org.apache.hadoop.fs.permission.FsPermission$2
> at 
> org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:233)
> at org.apache.hadoop.ipc.RPC$Invocation.readFields(RPC.java:106)
> at 
> org.apache.hadoop.ipc.Server$Connection.processData(Server.java:1347)
> at 
> org.apache.hadoop.ipc.Server$Connection.processOneRpc(Server.java:1326)
> at 
> org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1226)
> at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:577)
> at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:384)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:701)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.hadoop.fs.permission.FsPermission$2
> at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:270)
> at 
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:810)
> at 
> org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:231)
> ... 9 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4142) Bad Default for GraphLoader Edge Partitions

2014-10-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189347#comment-14189347
 ] 

Apache Spark commented on SPARK-4142:
-

User 'jegonzal' has created a pull request for this issue:
https://github.com/apache/spark/pull/3006

> Bad Default for GraphLoader Edge Partitions
> ---
>
> Key: SPARK-4142
> URL: https://issues.apache.org/jira/browse/SPARK-4142
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Reporter: Joseph E. Gonzalez
>
> The default number of edge partitions for the GraphLoader is set to 1 rather 
> than the default parallelism.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2672) Support compression in wholeFile()

2014-10-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189320#comment-14189320
 ] 

Apache Spark commented on SPARK-2672:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/3005

> Support compression in wholeFile()
> --
>
> Key: SPARK-2672
> URL: https://issues.apache.org/jira/browse/SPARK-2672
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Davies Liu
>Assignee: Davies Liu
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> The wholeFile() can not read compressed files, it should be, just like 
> textFile().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4142) Bad Default for GraphLoader Edge Partitions

2014-10-29 Thread Joseph E. Gonzalez (JIRA)

Joseph E. Gonzalez created SPARK-4142:
-

 Summary: Bad Default for GraphLoader Edge Partitions
 Key: SPARK-4142
 URL: https://issues.apache.org/jira/browse/SPARK-4142
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Reporter: Joseph E. Gonzalez


The default number of edge partitions for the GraphLoader is set to 1 rather 
than the default parallelism.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3080) ArrayIndexOutOfBoundsException in ALS for Large datasets

2014-10-29 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189093#comment-14189093
 ] 

Xiangrui Meng commented on SPARK-3080:
--

Btw, the `ArrayIndexOutOfBoundsException` is from the driver log. Could you 
also check the executor logs? It may contain the root cause.

> ArrayIndexOutOfBoundsException in ALS for Large datasets
> 
>
> Key: SPARK-3080
> URL: https://issues.apache.org/jira/browse/SPARK-3080
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Burak Yavuz
>
> The stack trace is below:
> {quote}
> java.lang.ArrayIndexOutOfBoundsException: 2716
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543)
> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> 
> org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537)
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505)
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504)
> 
> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
> 
> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
> 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> {quote}
> This happened after the dataset was sub-sampled. 
> Dataset properties: ~12B ratings
> Setup: 55 r3.8xlarge ec2 instances



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4097) Race condition in org.apache.spark.ComplexFutureAction.cancel

2014-10-29 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-4097.

   Resolution: Fixed
Fix Version/s: 1.2.0
   1.1.1
 Assignee: Shixiong Zhu

> Race condition in org.apache.spark.ComplexFutureAction.cancel
> -
>
> Key: SPARK-4097
> URL: https://issues.apache.org/jira/browse/SPARK-4097
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>  Labels: bug, race-condition
> Fix For: 1.1.1, 1.2.0
>
>
> There is a chance that `thread` is null when calling `thread.interrupt()`.
> {code:java}
>   override def cancel(): Unit = this.synchronized {
> _cancelled = true
> if (thread != null) {
>   thread.interrupt()
> }
>   }
> {code}
> Should put `thread = null` into a `synchronized` block to fix the race 
> condition.
> {code:java}
>   try {
> p.success(func)
>   } catch {
> case e: Exception => p.failure(e)
>   } finally {
> thread = null
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3080) ArrayIndexOutOfBoundsException in ALS for Large datasets

2014-10-29 Thread Ilya Ganelin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189039#comment-14189039
 ] 

Ilya Ganelin commented on SPARK-3080:
-

Hi all - I have managed to make some substantial progress! What I discovered is 
that the default parallelization setting is critical. I did two things that got 
me around this blocker: 
1) I increased the amount of memory available to nodes - by itself this did not 
solve the problem
2) I set .set("spark.default.parallelism","300")

I believe the latter is critical because even if I partitioned the data before 
feeding it into ALS.train, the internal operations would produce RDDs that are 
coalesced into fewer partitions. Consequently, I believe these smaller (but 
presumably large in memory) partitions would create memory issues ultimately 
leading to this and other hard to pin-down issues. Forcing default parallelism 
ensured that even these internal operations would shard appropriately. 

> ArrayIndexOutOfBoundsException in ALS for Large datasets
> 
>
> Key: SPARK-3080
> URL: https://issues.apache.org/jira/browse/SPARK-3080
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Burak Yavuz
>
> The stack trace is below:
> {quote}
> java.lang.ArrayIndexOutOfBoundsException: 2716
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543)
> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> 
> org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537)
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505)
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504)
> 
> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
> 
> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
> 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> {quote}
> This happened after the dataset was sub-sampled. 
> Dataset properties: ~12B ratings
> Setup: 55 r3.8xlarge ec2 instances



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4133) PARSING_ERROR(2) when upgrading issues from 1.0.2 to 1.1.0

2014-10-29 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189014#comment-14189014
 ] 

Josh Rosen commented on SPARK-4133:
---

Also, could you enable debug logging and share the executor logs?  If you're 
able to reliably reproduce this bug, please email me at 
joshro...@databricks.com and I'd be glad to hop on Skype to help you configure 
logging, etc.

> PARSING_ERROR(2) when upgrading issues from 1.0.2 to 1.1.0
> --
>
> Key: SPARK-4133
> URL: https://issues.apache.org/jira/browse/SPARK-4133
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.1.0
>Reporter: Antonio Jesus Navarro
>Priority: Blocker
>
> Snappy related problems found when trying to upgrade existing Spark Streaming 
> App from 1.0.2 to 1.1.0.
> We can not run an existing 1.0.2 spark app if upgraded to 1.1.0
> > IOException is thrown by snappy (parsing_error(2))
> > Only spark version changed
> As far as we have checked, snappy will throw this error when dealing with 
> zero bytes length arrays.
> We have tried:
> > Changing from snappy to LZF, 
> > Changing broadcast.compression false
> > Changing from TorrentBroadcast to HTTPBroadcast.
> but with no luck for the moment.
> {code}
> [ERROR] 2014-10-29 11:23:26,396 [Executor task launch worker-0]  
> org.apache.spark.executor.Executor logError - Exception in task 0.0 in stage 
> 0.0 (TID 0)
> java.io.IOException: PARSING_ERROR(2)
>   at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
>   at org.xerial.snappy.SnappyNative.uncompressedLength(Native Method)
>   at org.xerial.snappy.Snappy.uncompressedLength(Snappy.java:545)
>   at 
> org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:125)
>   at 
> org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
>   at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58)
>   at 
> org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:232)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.readObject(TorrentBroadcast.scala:169)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:159)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4141) Hide Accumulators column on stage page when no accumulators exist

2014-10-29 Thread Kay Ousterhout (JIRA)

Kay Ousterhout created SPARK-4141:
-

 Summary: Hide Accumulators column on stage page when no 
accumulators exist
 Key: SPARK-4141
 URL: https://issues.apache.org/jira/browse/SPARK-4141
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: Kay Ousterhout
Priority: Minor


The task table on the details page for each stage has a column for 
accumulators. We should only show this column if the stage has accumulators, 
otherwise it clutters the UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3466) Limit size of results that a driver collects for each action

2014-10-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188998#comment-14188998
 ] 

Apache Spark commented on SPARK-3466:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/3003

> Limit size of results that a driver collects for each action
> 
>
> Key: SPARK-3466
> URL: https://issues.apache.org/jira/browse/SPARK-3466
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Assignee: Davies Liu
>Priority: Critical
>
> Right now, operations like {{collect()}} and {{take()}} can crash the driver 
> with an OOM if they bring back too many data. We should add a 
> {{spark.driver.maxResultSize}} setting (or something like that) that will 
> make the driver abort a job if its result is too big. We can set it to some 
> fraction of the driver's memory by default, or to something like 100 MB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-4126) Do not set `spark.executor.instances` if not needed (yarn-cluster)

2014-10-29 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-4126.

Resolution: Won't Fix

superseded by SPARK-4138

> Do not set `spark.executor.instances` if not needed (yarn-cluster)
> --
>
> Key: SPARK-4126
> URL: https://issues.apache.org/jira/browse/SPARK-4126
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Minor
>
> In yarn cluster mode, we currently always set `spark.executor.instances` 
> regardless of whether this is set by the user. While not a huge deal, this 
> prevents us from knowing whether the user did specify a starting number of 
> executors.
> This is needed in SPARK-3795 to throw the appropriate exception when this is 
> set AND dynamic executor allocation is turned on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-3822) Expose a mechanism for SparkContext to ask for / remove Yarn containers

2014-10-29 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-3822.

   Resolution: Fixed
Fix Version/s: 1.2.0

> Expose a mechanism for SparkContext to ask for / remove Yarn containers
> ---
>
> Key: SPARK-3822
> URL: https://issues.apache.org/jira/browse/SPARK-3822
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, YARN
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 1.2.0
>
>
> This is one of the core components for the umbrella issue SPARK-3174. 
> Currently, the only agent in Spark that communicates directly with the RM is 
> the AM. This means the only way for the SparkContext to ask for / remove 
> containers from the RM is through the AM. The communication link between the 
> SparkContext and the AM needs to be added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3573) Dataset

2014-10-29 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188980#comment-14188980
 ] 

Joseph K. Bradley commented on SPARK-3573:
--

[~sparks]  Trying to simplify things, am I right that the main question is:
_Should ML data instances/examples/rows be flat vectors or have structure?_
Breaking this down,
(1) Should we allow structure?
(2) Should we encourage flatness or structure, and how?
(3) How does a Dataset used in a full ML pipeline resemble/differ from a 
Dataset used by a specific ML algorithm?

My thoughts:
(1) We should allow structure.  For general (complicated) pipelines, it will be 
important to provide structure to make it easy to select groups of features.
(2) We should encourage flatness where possible; e.g., unigram features from a 
document should be stored as a Vector instead of a bunch of Doubles in the 
Schema.  We should encourage structure where meaningful; e.g., the output of a 
learning algorithm should be appended as a new column (new element in the 
Schema) by default, rather than being appended to a big Vector of features.
(3) As in my comment for (2), a Dataset for a full pipeline should have 
structure where meaningful.  However, I agree that most common ML algorithms 
expect flat Vectors of features.  There needs to be an easy way to select 
relevant features and transform them to a Vector, LabeledPoint, etc.  Having 
structured Datasets in the pipeline should be useful for selecting relevant 
features.  To transform the selection, it will be important to provide helper 
methods for mushing the data into Vectors or other common formats.

The big challenge in my mind is (2): Figuring out default behavior and perhaps 
column naming/selection conventions which make it easy to select subsets of 
features (or even have an implicit selection if possible).

What do you think?

> Dataset
> ---
>
> Key: SPARK-3573
> URL: https://issues.apache.org/jira/browse/SPARK-3573
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> This JIRA is for discussion of ML dataset, essentially a SchemaRDD with extra 
> ML-specific metadata embedded in its schema.
> .Sample code
> Suppose we have training events stored on HDFS and user/ad features in Hive, 
> we want to assemble features for training and then apply decision tree.
> The proposed pipeline with dataset looks like the following (need more 
> refinements):
> {code}
> sqlContext.jsonFile("/path/to/training/events", 
> 0.01).registerTempTable("event")
> val training = sqlContext.sql("""
>   SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId, 
> event.action AS label,
>  user.gender AS userGender, user.country AS userCountry, 
> user.features AS userFeatures,
>  ad.targetGender AS targetGender
> FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = 
> ad.id;""").cache()
> val indexer = new Indexer()
> val interactor = new Interactor()
> val fvAssembler = new FeatureVectorAssembler()
> val treeClassifer = new DecisionTreeClassifer()
> val paramMap = new ParamMap()
>   .put(indexer.features, Map("userCountryIndex" -> "userCountry"))
>   .put(indexer.sortByFrequency, true)
>   .put(interactor.features, Map("genderMatch" -> Array("userGender", 
> "targetGender")))
>   .put(fvAssembler.features, Map("features" -> Array("genderMatch", 
> "userCountryIndex", "userFeatures")))
>   .put(fvAssembler.dense, true)
>   .put(treeClassifer.maxDepth, 4) // By default, classifier recognizes 
> "features" and "label" columns.
> val pipeline = Pipeline.create(indexer, interactor, fvAssembler, 
> treeClassifier)
> val model = pipeline.fit(training, paramMap)
> sqlContext.jsonFile("/path/to/events", 0.01).registerTempTable("event")
> val test = sqlContext.sql("""
>   SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId,
>  user.gender AS userGender, user.country AS userCountry, 
> user.features AS userFeatures,
>  ad.targetGender AS targetGender
> FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = 
> ad.id;""")
> val prediction = model.transform(test).select('eventId, 'prediction)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4140) Document the dynamic allocation feature

2014-10-29 Thread Andrew Or (JIRA)

Andrew Or created SPARK-4140:


 Summary: Document the dynamic allocation feature
 Key: SPARK-4140
 URL: https://issues.apache.org/jira/browse/SPARK-4140
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Andrew Or
Assignee: Andrew Or


This blocks on SPARK-3795 and SPARK-3822.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4138) Guard against incompatible settings on the number of executors

2014-10-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188960#comment-14188960
 ] 

Apache Spark commented on SPARK-4138:
-

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/3002

> Guard against incompatible settings on the number of executors
> --
>
> Key: SPARK-4138
> URL: https://issues.apache.org/jira/browse/SPARK-4138
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> After SPARK-3822 and SPARK-3795, we now set a lower bound and an upper bound 
> for the number of executors. These settings are incompatible if the user sets 
> the number of executors explicitly, however. We need to add a guard against 
> this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4139) Start the number of executors at the max if dynamic allocation is enabled

2014-10-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188961#comment-14188961
 ] 

Apache Spark commented on SPARK-4139:
-

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/3002

> Start the number of executors at the max if dynamic allocation is enabled
> -
>
> Key: SPARK-4139
> URL: https://issues.apache.org/jira/browse/SPARK-4139
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.2.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> SPARK-3795 allows us to dynamically scale the number of executors up and 
> down. We should start the number at the max instead of from 0 in the 
> beginning, because the first job will likely run immediately after the 
> SparkContext is set up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4139) Start the number of executors at the max if dynamic allocation is enabled

2014-10-29 Thread Andrew Or (JIRA)

Andrew Or created SPARK-4139:


 Summary: Start the number of executors at the max if dynamic 
allocation is enabled
 Key: SPARK-4139
 URL: https://issues.apache.org/jira/browse/SPARK-4139
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.2.0
Reporter: Andrew Or
Assignee: Andrew Or


SPARK-3795 allows us to dynamically scale the number of executors up and down. 
We should start the number at the max instead of from 0 in the beginning, 
because the first job will likely run immediately after the SparkContext is set 
up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4138) Guard against incompatible settings on the number of executors

2014-10-29 Thread Andrew Or (JIRA)

Andrew Or created SPARK-4138:


 Summary: Guard against incompatible settings on the number of 
executors
 Key: SPARK-4138
 URL: https://issues.apache.org/jira/browse/SPARK-4138
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Andrew Or
Assignee: Andrew Or


After SPARK-3822 and SPARK-3795, we now set a lower bound and an upper bound 
for the number of executors. These settings are incompatible if the user sets 
the number of executors explicitly, however. We need to add a guard against 
this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3796) Create shuffle service for external block storage

2014-10-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188947#comment-14188947
 ] 

Apache Spark commented on SPARK-3796:
-

User 'aarondav' has created a pull request for this issue:
https://github.com/apache/spark/pull/3001

> Create shuffle service for external block storage
> -
>
> Key: SPARK-3796
> URL: https://issues.apache.org/jira/browse/SPARK-3796
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Aaron Davidson
>
> This task will be broken up into two parts -- the first, being to refactor 
> our internal shuffle service to use a BlockTransferService which we can 
> easily extract out into its own service, and then the second is to actually 
> do the extraction.
> Here is the design document for the low-level service, nicknamed "Sluice", on 
> top of which will be Spark's BlockTransferService API:
> https://docs.google.com/document/d/1zKf3qloBu3dmv2AFyQTwEpumWRPUT5bcAUKB5PGNfx0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states

2014-10-29 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188938#comment-14188938
 ] 

Nicholas Chammas commented on SPARK-3398:
-

No problem. I've opened [SPARK-4137] to track this issue, and [PR 
2988|https://github.com/apache/spark/pull/2988] to resolve it.

> Have spark-ec2 intelligently wait for specific cluster states
> -
>
> Key: SPARK-3398
> URL: https://issues.apache.org/jira/browse/SPARK-3398
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
> Fix For: 1.2.0
>
>
> {{spark-ec2}} currently has retry logic for when it tries to install stuff on 
> a cluster and for when it tries to destroy security groups. 
> It would be better to have some logic that allows {{spark-ec2}} to explicitly 
> wait for when all the nodes in a cluster it is working on have reached a 
> specific state.
> Examples:
> * Wait for all nodes to be up
> * Wait for all nodes to be up and accepting SSH connections (then start 
> installing stuff)
> * Wait for all nodes to be down
> * Wait for all nodes to be terminated (then delete the security groups)
> Having a function in the {{spark_ec2.py}} script that blocks until the 
> desired cluster state is reached would reduce the need for various retry 
> logic. It would probably also eliminate the need for the {{--wait}} parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4137) Relative paths don't get handled correctly by spark-ec2

2014-10-29 Thread Nicholas Chammas (JIRA)

Nicholas Chammas created SPARK-4137:
---

 Summary: Relative paths don't get handled correctly by spark-ec2
 Key: SPARK-4137
 URL: https://issues.apache.org/jira/browse/SPARK-4137
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.1.0
Reporter: Nicholas Chammas
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3080) ArrayIndexOutOfBoundsException in ALS for Large datasets

2014-10-29 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188928#comment-14188928
 ] 

Xiangrui Meng commented on SPARK-3080:
--

SimpleALS is not merged yet. You need to build it and submit it as an 
application: http://spark.apache.org/docs/latest/submitting-applications.html

> ArrayIndexOutOfBoundsException in ALS for Large datasets
> 
>
> Key: SPARK-3080
> URL: https://issues.apache.org/jira/browse/SPARK-3080
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Burak Yavuz
>
> The stack trace is below:
> {quote}
> java.lang.ArrayIndexOutOfBoundsException: 2716
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543)
> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> 
> org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537)
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505)
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504)
> 
> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
> 
> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
> 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> {quote}
> This happened after the dataset was sub-sampled. 
> Dataset properties: ~12B ratings
> Setup: 55 r3.8xlarge ec2 instances



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3573) Dataset

2014-10-29 Thread Evan Sparks (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188919#comment-14188919
 ] 

Evan Sparks commented on SPARK-3573:


This comment originally appeared on the PR associated with this feature. 
(https://github.com/apache/spark/pull/2919):

I've looked at the code here, and it basically seems reasonable. One high-level 
concern I have is around the programming pattern that this encourages: complex 
nesting of otherwise simple structure that may make it difficult to program 
against Datasets for sufficiently complicated applications.

A 'dataset' is now a collection of Row, where we have the guarantee that all 
rows in a Dataset conform to the same schema. A schema is a list of (name, 
type) pairs which describe the attributes available in the dataset. This seems 
like a good thing to me, and is pretty much what we described in MLI (and how 
conventional databases have been structured forever). So far, so good.

The concern that I have is that we are now encouraging these attributes to be 
complex types. For example, where I might have had 
val x = Schema(('a', classOf[String]), ('b', classOf[Double]), ..., ("z", 
classOf[Double]))
This would become
val x = Schema(('a', classOf[String]), ('bGroup', classOf[Vector]), .., 
("zGroup", classOf[Vector]))

So, great, my schema now has these vector things in them, which I can create 
separately, pass around, etc.

This clearly has its merits:
1) Features are groups together logically based on the process that creates 
them.
2) Managing one short schema where each record is comprised of a few large 
objects (say, 4 vectors, each of length 1000) is probably easier than managing 
a really big schema comprised of lots small objects (say, 4000 doubles).

But, there are some major drawbacks
1) Why only stop at one level of nesting? Why not have Vector[Vector]? 
2) How do learning algorithms, like SVM or PCA deal with these Datasets? Is 
there an implicit conversion that flattens these things to RDD[LabeledPoint]? 
Do we want to guarantee these semantics?
3) Manipulating and subsetting nested schemas like this might be tricky. Where 
before I might be able to write:

val x: Dataset = input.select(Seq(0,1,2,4,180,181,1000,1001,1002))
now I might have to write
val groupSelections = Seq(Seq(0,1,2,4),Seq(0,1),Seq(0,1,2))
val x: Dataset = groupSelections.zip(input.columns).map {case (gs, col) => 
col(gs) }

Ignoring raw syntax and semantics of how you might actually map an operation 
over the columns of a Dataset and get back a well-structured dataset, I think 
this makes two conflicting points:
1) In the first example - presumably all the work goes into figuring out what 
the subset of features you want is in this really wide feature space.
2) In the second example - there’s a lot of gymnastics that goes into 
subsetting feature groups. I think it’s clear that working with lots of feature 
groups might get unreasonable pretty quickly.

If we look at R or pandas/scikit-learn as examples of projects that have 
(arguably quite successfully) dealt with these interface issues, there is one 
basic pattern: learning algorithms expect big tables of numbers as input. Even 
here, there are some important differences:

For example, in scikit-learn, categorical features aren’t supported directly by 
most learning algorithms. Instead, users are responsible for getting data from 
“table with heterogenously typed columns” to “table of numbers.” with something 
like OneHotEncoder and other feature transformers. In R, on the other hand, 
categorical features are (sometimes frustratingly) first class citizens by 
virtue of the “factor” data type - which is essentially and enum. Most 
out-of-the-box learning algorithms (like glm()) accept data frames with 
categorical inputs and handle them sensibly - implicitly one hot encoding (or 
creating dummy variables, if you prefer) the categorical features.

While I have a slight preference for representing things as big flat tables, I 
would be fine coding either way - but I wanted to raise the issue for 
discussion here before the interfaces are set in stone.

> Dataset
> ---
>
> Key: SPARK-3573
> URL: https://issues.apache.org/jira/browse/SPARK-3573
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> This JIRA is for discussion of ML dataset, essentially a SchemaRDD with extra 
> ML-specific metadata embedded in its schema.
> .Sample code
> Suppose we have training events stored on HDFS and user/ad features in Hive, 
> we want to assemble features for training and then apply decision tree.
> The proposed pipeline with dataset looks like the following (need more 
> refinements):
> {code}
> sqlContext.jsonFile("/path/to/training/events", 
> 0.

[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2014-10-29 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188916#comment-14188916
 ] 

Josh Rosen commented on SPARK-4105:
---

It seems plausible that SPARK-4107 could have caused this issue, but I'm 
waiting for confirmation that it fixes this issue.

> FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based 
> shuffle
> -
>
> Key: SPARK-4105
> URL: https://issues.apache.org/jira/browse/SPARK-4105
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
>
> We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during 
> shuffle read.  Here's a sample stacktrace from an executor:
> {code}
> 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID 
> 33053)
> java.io.IOException: FAILED_TO_UNCOMPRESS(5)
>   at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
>   at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
>   at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391)
>   at org.xerial.snappy.Snappy.uncompress(Snappy.java:427)
>   at 
> org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127)
>   at 
> org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
>   at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58)
>   at 
> org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128)
>   at 
> org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129)
>   at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
>   at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>   at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Here's another occurrence of a similar error:
> {code}
>

[jira] [Commented] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR

2014-10-29 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188910#comment-14188910
 ] 

Josh Rosen commented on SPARK-3630:
---

*Decompression errors during shuffle fetching*: If you've seen errors like 
{{FAILED_TO_UNCOMPRESS(5)}} during shuffle fetching, then please see 
SPARK-4105.  We believe that this might be fixed by SPARK-4107, but we're 
awaiting confirmation from the folks that have been able to reproduce these 
errors.

> Identify cause of Kryo+Snappy PARSING_ERROR
> ---
>
> Key: SPARK-3630
> URL: https://issues.apache.org/jira/browse/SPARK-3630
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Andrew Ash
>Assignee: Josh Rosen
>
> A recent GraphX commit caused non-deterministic exceptions in unit tests so 
> it was reverted (see SPARK-3400).
> Separately, [~aash] observed the same exception stacktrace in an 
> application-specific Kryo registrator:
> {noformat}
> com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to 
> uncompress the chunk: PARSING_ERROR(2)
> com.esotericsoftware.kryo.io.Input.fill(Input.java:142) 
> com.esotericsoftware.kryo.io.Input.require(Input.java:169) 
> com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) 
> com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127)
>  
> com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117)
>  
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)
>  
> com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
>  
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
> ...
> {noformat}
> This ticket is to identify the cause of the exception in the GraphX commit so 
> the faulty commit can be fixed and merged back into master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4136) Under dynamic allocation, cancel outstanding executor requests when pending task queue is empty

2014-10-29 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4136:
-
Affects Version/s: (was: 1.1.0)
   1.2.0

> Under dynamic allocation, cancel outstanding executor requests when pending 
> task queue is empty
> ---
>
> Key: SPARK-4136
> URL: https://issues.apache.org/jira/browse/SPARK-4136
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 1.2.0
>Reporter: Sandy Ryza
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4136) Under dynamic allocation, cancel outstanding executor requests when pending task queue is empty

2014-10-29 Thread Sandy Ryza (JIRA)

Sandy Ryza created SPARK-4136:
-

 Summary: Under dynamic allocation, cancel outstanding executor 
requests when pending task queue is empty
 Key: SPARK-4136
 URL: https://issues.apache.org/jira/browse/SPARK-4136
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.1.0
Reporter: Sandy Ryza






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4135) Error reading Parquet file generated with SparkSQL

2014-10-29 Thread Hossein Falaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-4135:
--
Attachment: _metadata
part-r-1.parquet

Files generated by SparkSQL that cannot be read.

> Error reading Parquet file generated with SparkSQL
> --
>
> Key: SPARK-4135
> URL: https://issues.apache.org/jira/browse/SPARK-4135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Hossein Falaki
> Attachments: _metadata, part-r-1.parquet
>
>
> I read a tsv version of the one million songs dataset (available here: 
> http://tbmmsd.s3.amazonaws.com/)
> After reading it I create a SchemaRDD with following schema:
> {code}
> root
>  |-- track_id: string (nullable = true)
>  |-- analysis_sample_rate: string (nullable = true)
>  |-- artist_7digitalid: string (nullable = true)
>  |-- artist_familiarity: double (nullable = true)
>  |-- artist_hotness: double (nullable = true)
>  |-- artist_id: string (nullable = true)
>  |-- artist_latitude: string (nullable = true)
>  |-- artist_location: string (nullable = true)
>  |-- artist_longitude: string (nullable = true)
>  |-- artist_mbid: string (nullable = true)
>  |-- artist_mbtags: array (nullable = true)
>  ||-- element: string (containsNull = true)
>  |-- artist_mbtags_count: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- artist_name: string (nullable = true)
>  |-- artist_playmeid: string (nullable = true)
>  |-- artist_terms: array (nullable = true)
>  ||-- element: string (containsNull = true)
>  |-- artist_terms_freq: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- artist_terms_weight: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- audio_md5: string (nullable = true)
>  |-- bars_confidence: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- bars_start: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- beats_confidence: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- beats_start: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- danceability: double (nullable = true)
>  |-- duration: double (nullable = true)
>  |-- end_of_fade_in: double (nullable = true)
>  |-- energy: double (nullable = true)
>  |-- key: string (nullable = true)
>  |-- key_confidence: double (nullable = true)
>  |-- loudness: double (nullable = true)
>  |-- mode: double (nullable = true)
>  |-- mode_confidence: double (nullable = true)
>  |-- release: string (nullable = true)
>  |-- release_7digitalid: string (nullable = true)
>  |-- sections_confidence: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- sections_start: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- segments_confidence: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- segments_loudness_max: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- segments_loudness_max_time: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- segments_loudness_start: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- segments_pitches: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- segments_start: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- segments_timbre: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- similar_artists: array (nullable = true)
>  ||-- element: string (containsNull = true)
>  |-- song_hotness: double (nullable = true)
>  |-- song_id: string (nullable = true)
>  |-- start_of_fade_out: double (nullable = true)
>  |-- tatums_confidence: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- tatums_start: array (nullable = true)
>  ||-- element: double (containsNull = true)
>  |-- tempo: double (nullable = true)
>  |-- time_signature: double (nullable = true)
>  |-- time_signature_confidence: double (nullable = true)
>  |-- title: string (nullable = true)
>  |-- track_7digitalid: string (nullable = true)
>  |-- year: double (nullable = true)
> {code}
> I select a single record from it and save it using saveAsParquetFile(). 
> When I read it later and try to query it I get the following exception:
> {code}
> Error in SQL statement: java.lang.RuntimeException: 
> java.lang.reflect.InvocationTargetException
>   at sun.reflect.GeneratedMethodAccessor208.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(

[jira] [Commented] (SPARK-4133) PARSING_ERROR(2) when upgrading issues from 1.0.2 to 1.1.0

2014-10-29 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188815#comment-14188815
 ] 

Josh Rosen commented on SPARK-4133:
---

Also, can you paste more of the log leading up to the error?  It would be 
helpful to see any other log messages from broadcast, such as messages about it 
fetching pieces / blocks.

> PARSING_ERROR(2) when upgrading issues from 1.0.2 to 1.1.0
> --
>
> Key: SPARK-4133
> URL: https://issues.apache.org/jira/browse/SPARK-4133
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.1.0
>Reporter: Antonio Jesus Navarro
>Priority: Blocker
>
> Snappy related problems found when trying to upgrade existing Spark Streaming 
> App from 1.0.2 to 1.1.0.
> We can not run an existing 1.0.2 spark app if upgraded to 1.1.0
> > IOException is thrown by snappy (parsing_error(2))
> > Only spark version changed
> As far as we have checked, snappy will throw this error when dealing with 
> zero bytes length arrays.
> We have tried:
> > Changing from snappy to LZF, 
> > Changing broadcast.compression false
> > Changing from TorrentBroadcast to HTTPBroadcast.
> but with no luck for the moment.
> {code}
> [ERROR] 2014-10-29 11:23:26,396 [Executor task launch worker-0]  
> org.apache.spark.executor.Executor logError - Exception in task 0.0 in stage 
> 0.0 (TID 0)
> java.io.IOException: PARSING_ERROR(2)
>   at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
>   at org.xerial.snappy.SnappyNative.uncompressedLength(Native Method)
>   at org.xerial.snappy.Snappy.uncompressedLength(Snappy.java:545)
>   at 
> org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:125)
>   at 
> org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
>   at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58)
>   at 
> org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:232)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.readObject(TorrentBroadcast.scala:169)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:159)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4135) Error reading Parquet file generated with SparkSQL

2014-10-29 Thread Hossein Falaki (JIRA)

Hossein Falaki created SPARK-4135:
-

 Summary: Error reading Parquet file generated with SparkSQL
 Key: SPARK-4135
 URL: https://issues.apache.org/jira/browse/SPARK-4135
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Hossein Falaki


I read a tsv version of the one million songs dataset (available here: 
http://tbmmsd.s3.amazonaws.com/)

After reading it I create a SchemaRDD with following schema:

{code}
root
 |-- track_id: string (nullable = true)
 |-- analysis_sample_rate: string (nullable = true)
 |-- artist_7digitalid: string (nullable = true)
 |-- artist_familiarity: double (nullable = true)
 |-- artist_hotness: double (nullable = true)
 |-- artist_id: string (nullable = true)
 |-- artist_latitude: string (nullable = true)
 |-- artist_location: string (nullable = true)
 |-- artist_longitude: string (nullable = true)
 |-- artist_mbid: string (nullable = true)
 |-- artist_mbtags: array (nullable = true)
 ||-- element: string (containsNull = true)
 |-- artist_mbtags_count: array (nullable = true)
 ||-- element: double (containsNull = true)
 |-- artist_name: string (nullable = true)
 |-- artist_playmeid: string (nullable = true)
 |-- artist_terms: array (nullable = true)
 ||-- element: string (containsNull = true)
 |-- artist_terms_freq: array (nullable = true)
 ||-- element: double (containsNull = true)
 |-- artist_terms_weight: array (nullable = true)
 ||-- element: double (containsNull = true)
 |-- audio_md5: string (nullable = true)
 |-- bars_confidence: array (nullable = true)
 ||-- element: double (containsNull = true)
 |-- bars_start: array (nullable = true)
 ||-- element: double (containsNull = true)
 |-- beats_confidence: array (nullable = true)
 ||-- element: double (containsNull = true)
 |-- beats_start: array (nullable = true)
 ||-- element: double (containsNull = true)
 |-- danceability: double (nullable = true)
 |-- duration: double (nullable = true)
 |-- end_of_fade_in: double (nullable = true)
 |-- energy: double (nullable = true)
 |-- key: string (nullable = true)
 |-- key_confidence: double (nullable = true)
 |-- loudness: double (nullable = true)
 |-- mode: double (nullable = true)
 |-- mode_confidence: double (nullable = true)
 |-- release: string (nullable = true)
 |-- release_7digitalid: string (nullable = true)
 |-- sections_confidence: array (nullable = true)
 ||-- element: double (containsNull = true)
 |-- sections_start: array (nullable = true)
 ||-- element: double (containsNull = true)
 |-- segments_confidence: array (nullable = true)
 ||-- element: double (containsNull = true)
 |-- segments_loudness_max: array (nullable = true)
 ||-- element: double (containsNull = true)
 |-- segments_loudness_max_time: array (nullable = true)
 ||-- element: double (containsNull = true)
 |-- segments_loudness_start: array (nullable = true)
 ||-- element: double (containsNull = true)
 |-- segments_pitches: array (nullable = true)
 ||-- element: double (containsNull = true)
 |-- segments_start: array (nullable = true)
 ||-- element: double (containsNull = true)
 |-- segments_timbre: array (nullable = true)
 ||-- element: double (containsNull = true)
 |-- similar_artists: array (nullable = true)
 ||-- element: string (containsNull = true)
 |-- song_hotness: double (nullable = true)
 |-- song_id: string (nullable = true)
 |-- start_of_fade_out: double (nullable = true)
 |-- tatums_confidence: array (nullable = true)
 ||-- element: double (containsNull = true)
 |-- tatums_start: array (nullable = true)
 ||-- element: double (containsNull = true)
 |-- tempo: double (nullable = true)
 |-- time_signature: double (nullable = true)
 |-- time_signature_confidence: double (nullable = true)
 |-- title: string (nullable = true)
 |-- track_7digitalid: string (nullable = true)
 |-- year: double (nullable = true)
{code}

I select a single record from it and save it using saveAsParquetFile(). 
When I read it later and try to query it I get the following exception:

{code}
Error in SQL statement: java.lang.RuntimeException: 
java.lang.reflect.InvocationTargetException
at sun.reflect.GeneratedMethodAccessor208.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.sql.parquet.FilteringParquetRowInputFormat$$anonfun$getSplits$1.apply(ParquetTableOperations.scala:472)
at 
org.apache.spark.sql.parquet.FilteringParquetRowInputFormat$$anonfun$getSplits$1.apply(ParquetTableOperations.scala:457)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.sca

[jira] [Updated] (SPARK-3080) ArrayIndexOutOfBoundsException in ALS for Large datasets

2014-10-29 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3080:
-
 Target Version/s: 1.2.0
Affects Version/s: 1.1.0

> ArrayIndexOutOfBoundsException in ALS for Large datasets
> 
>
> Key: SPARK-3080
> URL: https://issues.apache.org/jira/browse/SPARK-3080
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Burak Yavuz
>
> The stack trace is below:
> {quote}
> java.lang.ArrayIndexOutOfBoundsException: 2716
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543)
> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> 
> org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537)
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505)
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504)
> 
> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
> 
> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
> 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> {quote}
> This happened after the dataset was sub-sampled. 
> Dataset properties: ~12B ratings
> Setup: 55 r3.8xlarge ec2 instances



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4081) Categorical feature indexing

2014-10-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188810#comment-14188810
 ] 

Apache Spark commented on SPARK-4081:
-

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/3000

> Categorical feature indexing
> 
>
> Key: SPARK-4081
> URL: https://issues.apache.org/jira/browse/SPARK-4081
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> DecisionTree and RandomForest require that categorical features and labels be 
> indexed 0,1,2  There is currently no code to aid with indexing a dataset. 
>  This is a proposal for a helper class for computing indices (and also 
> deciding which features to treat as categorical).
> Proposed functionality:
> * This helps process a dataset of unknown vectors into a dataset with some 
> continuous features and some categorical features. The choice between 
> continuous and categorical is based upon a maxCategories parameter.
> * This can also map categorical feature values to 0-based indices.
> Usage:
> {code}
> val myData1: RDD[Vector] = ...
> val myData2: RDD[Vector] = ...
> val datasetIndexer = new DatasetIndexer(maxCategories)
> datasetIndexer.fit(myData1)
> val indexedData1: RDD[Vector] = datasetIndexer.transform(myData1)
> datasetIndexer.fit(myData2)
> val indexedData2: RDD[Vector] = datasetIndexer.transform(myData2)
> val categoricalFeaturesInfo: Map[Double, Int] = 
> datasetIndexer.getCategoricalFeatureIndexes()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4003) Add {Big Decimal, Timestamp, Date} types to Java SqlContext

2014-10-29 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4003.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2850
[https://github.com/apache/spark/pull/2850]

> Add {Big Decimal, Timestamp, Date} types to Java SqlContext
> ---
>
> Key: SPARK-4003
> URL: https://issues.apache.org/jira/browse/SPARK-4003
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Adrian Wang
>Assignee: Adrian Wang
> Fix For: 1.2.0
>
>
> in JavaSqlContext, we need to let java program use big decimal, timestamp, 
> date types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4081) Categorical feature indexing

2014-10-29 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-4081:
-
Description: 
DecisionTree and RandomForest require that categorical features and labels be 
indexed 0,1,2  There is currently no code to aid with indexing a dataset.  
This is a proposal for a helper class for computing indices (and also deciding 
which features to treat as categorical).

Proposed functionality:
* This helps process a dataset of unknown vectors into a dataset with some 
continuous features and some categorical features. The choice between 
continuous and categorical is based upon a maxCategories parameter.
* This can also map categorical feature values to 0-based indices.

Usage:
{code}
val myData1: RDD[Vector] = ...
val myData2: RDD[Vector] = ...
val datasetIndexer = new DatasetIndexer(maxCategories)
datasetIndexer.fit(myData1)
val indexedData1: RDD[Vector] = datasetIndexer.transform(myData1)
datasetIndexer.fit(myData2)
val indexedData2: RDD[Vector] = datasetIndexer.transform(myData2)
val categoricalFeaturesInfo: Map[Double, Int] = 
datasetIndexer.getCategoricalFeatureIndexes()
{code}


  was:
DecisionTree and RandomForest require that categorical features and labels be 
indexed 0,1,2  There is currently no code to aid with indexing a dataset.  
This is a proposal for a helper class for computing indices (and also deciding 
which features to treat as categorical).

Proposed functionality:
* This helps process a dataset of unknown vectors into a dataset with some 
continuous features and some categorical features. The choice between 
continuous and categorical is based upon a maxCategories parameter.
* This can also map categorical feature values to 0-based indices.

Usage:
{code}
val myData1: RDD[Vector] = ...
val myData2: RDD[Vector] = ...
val datasetIndexer = new DatasetIndexer(maxCategories)
datasetIndexer.fit(myData1)
val indexedData1: RDD[Vector] = datasetIndexer.transform(myData1)
datasetIndexer.fit(myData2)
val indexedData2: RDD[Vector] = datasetIndexer.transform(myData2)
val categoricalFeaturesInfo: Map[Int, Int] = 
datasetIndexer.getCategoricalFeaturesInfo()
{code}



> Categorical feature indexing
> 
>
> Key: SPARK-4081
> URL: https://issues.apache.org/jira/browse/SPARK-4081
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> DecisionTree and RandomForest require that categorical features and labels be 
> indexed 0,1,2  There is currently no code to aid with indexing a dataset. 
>  This is a proposal for a helper class for computing indices (and also 
> deciding which features to treat as categorical).
> Proposed functionality:
> * This helps process a dataset of unknown vectors into a dataset with some 
> continuous features and some categorical features. The choice between 
> continuous and categorical is based upon a maxCategories parameter.
> * This can also map categorical feature values to 0-based indices.
> Usage:
> {code}
> val myData1: RDD[Vector] = ...
> val myData2: RDD[Vector] = ...
> val datasetIndexer = new DatasetIndexer(maxCategories)
> datasetIndexer.fit(myData1)
> val indexedData1: RDD[Vector] = datasetIndexer.transform(myData1)
> datasetIndexer.fit(myData2)
> val indexedData2: RDD[Vector] = datasetIndexer.transform(myData2)
> val categoricalFeaturesInfo: Map[Double, Int] = 
> datasetIndexer.getCategoricalFeatureIndexes()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3958) Possible stream-corruption issues in TorrentBroadcast

2014-10-29 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-3958:
--
Affects Version/s: 1.1.0

Adding 1.1.0 as an affected version, since a user has observed this in 1.1.0, 
too; see SPARK-4133.

> Possible stream-corruption issues in TorrentBroadcast
> -
>
> Key: SPARK-3958
> URL: https://issues.apache.org/jira/browse/SPARK-3958
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> TorrentBroadcast deserialization sometimes fails with decompression errors, 
> which are most likely caused by stream-corruption exceptions.  For example, 
> this can manifest itself as a Snappy PARSING_ERROR when deserializing a 
> broadcasted task:
> {code}
> 14/10/14 17:20:55.016 DEBUG BlockManager: Getting local block broadcast_8
> 14/10/14 17:20:55.016 DEBUG BlockManager: Block broadcast_8 not registered 
> locally
> 14/10/14 17:20:55.016 INFO TorrentBroadcast: Started reading broadcast 
> variable 8
> 14/10/14 17:20:55.017 INFO TorrentBroadcast: Reading broadcast variable 8 
> took 5.3433E-5 s
> 14/10/14 17:20:55.017 ERROR Executor: Exception in task 2.0 in stage 8.0 (TID 
> 18)
> java.io.IOException: PARSING_ERROR(2)
>   at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84)
>   at org.xerial.snappy.SnappyNative.uncompressedLength(Native Method)
>   at org.xerial.snappy.Snappy.uncompressedLength(Snappy.java:594)
>   at 
> org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:125)
>   at 
> org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
>   at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58)
>   at 
> org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:216)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.readObject(TorrentBroadcast.scala:170)
>   at sun.reflect.GeneratedMethodAccessor92.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:164)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> SPARK-3630 is an umbrella ticket for investigating all causes of these Kryo 
> and Snappy deserialization errors.  This ticket is for a more 
> narrowly-focused exploration of the TorrentBroadcast version of these errors, 
> since the similar errors that we've seen in sort-based shuffle seem to be 
> explained by a different cause (see SPARK-3948).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4133) PARSING_ERROR(2) when upgrading issues from 1.0.2 to 1.1.0

2014-10-29 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188665#comment-14188665
 ] 

Josh Rosen commented on SPARK-4133:
---

Since you mentioned that you see a similar issue when using HTTPBroadcast, 
could you post the stacktrace from that case, too?  Similarly, can you post the 
stacktrace when broadcast compression is disabled?

> PARSING_ERROR(2) when upgrading issues from 1.0.2 to 1.1.0
> --
>
> Key: SPARK-4133
> URL: https://issues.apache.org/jira/browse/SPARK-4133
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.1.0
>Reporter: Antonio Jesus Navarro
>Priority: Blocker
>
> Snappy related problems found when trying to upgrade existing Spark Streaming 
> App from 1.0.2 to 1.1.0.
> We can not run an existing 1.0.2 spark app if upgraded to 1.1.0
> > IOException is thrown by snappy (parsing_error(2))
> > Only spark version changed
> As far as we have checked, snappy will throw this error when dealing with 
> zero bytes length arrays.
> We have tried:
> > Changing from snappy to LZF, 
> > Changing broadcast.compression false
> > Changing from TorrentBroadcast to HTTPBroadcast.
> but with no luck for the moment.
> {code}
> [ERROR] 2014-10-29 11:23:26,396 [Executor task launch worker-0]  
> org.apache.spark.executor.Executor logError - Exception in task 0.0 in stage 
> 0.0 (TID 0)
> java.io.IOException: PARSING_ERROR(2)
>   at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
>   at org.xerial.snappy.SnappyNative.uncompressedLength(Native Method)
>   at org.xerial.snappy.Snappy.uncompressedLength(Snappy.java:545)
>   at 
> org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:125)
>   at 
> org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
>   at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58)
>   at 
> org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:232)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.readObject(TorrentBroadcast.scala:169)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:159)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4129) Performance tuning in MultivariateOnlineSummarizer

2014-10-29 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4129:
-
Assignee: DB Tsai

> Performance tuning in MultivariateOnlineSummarizer
> --
>
> Key: SPARK-4129
> URL: https://issues.apache.org/jira/browse/SPARK-4129
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: DB Tsai
>Assignee: DB Tsai
> Fix For: 1.2.0
>
>
> In MultivariateOnlineSummarizer, breeze's activeIterator is used to loop 
> through the nonZero elements in the vector. However, activeIterator doesn't 
> perform well due to lots of overhead. In this PR, native while loop is used 
> for both DenseVector and SparseVector.
> The benchmark result with 20 executors using mnist8m dataset:
> Before:
> DenseVector: 48.2 seconds
> SparseVector: 16.3 seconds
> After:
> DenseVector: 17.8 seconds
> SparseVector: 11.2 seconds
> Since MultivariateOnlineSummarizer is used in several places, the overall 
> performance gain in mllib library will be significant with this PR. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4129) Performance tuning in MultivariateOnlineSummarizer

2014-10-29 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-4129.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2992
[https://github.com/apache/spark/pull/2992]

> Performance tuning in MultivariateOnlineSummarizer
> --
>
> Key: SPARK-4129
> URL: https://issues.apache.org/jira/browse/SPARK-4129
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: DB Tsai
>Assignee: DB Tsai
> Fix For: 1.2.0
>
>
> In MultivariateOnlineSummarizer, breeze's activeIterator is used to loop 
> through the nonZero elements in the vector. However, activeIterator doesn't 
> perform well due to lots of overhead. In this PR, native while loop is used 
> for both DenseVector and SparseVector.
> The benchmark result with 20 executors using mnist8m dataset:
> Before:
> DenseVector: 48.2 seconds
> SparseVector: 16.3 seconds
> After:
> DenseVector: 17.8 seconds
> SparseVector: 11.2 seconds
> Since MultivariateOnlineSummarizer is used in several places, the overall 
> performance gain in mllib library will be significant with this PR. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3182) Twitter Streaming Geoloaction Filter

2014-10-29 Thread Brennon York (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188544#comment-14188544
 ] 

Brennon York commented on SPARK-3182:
-

Hey all, looking to contribute back to Spark :) Would like to take this as a 
first issue. Could you please assign to me? Thanks!

> Twitter Streaming Geoloaction Filter
> 
>
> Key: SPARK-3182
> URL: https://issues.apache.org/jira/browse/SPARK-3182
> Project: Spark
>  Issue Type: Wish
>  Components: Streaming
>Affects Versions: 1.0.0, 1.0.2
>Reporter: Daniel Kershaw
>  Labels: features
> Fix For: 1.2.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Add a geolocation filter to the Twitter Streaming Component. 
> This should take a sequence of double to indicate the bounding box for the 
> stream. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4134) Tone down scary executor lost messages when killing on purpose

2014-10-29 Thread Andrew Or (JIRA)

Andrew Or created SPARK-4134:


 Summary: Tone down scary executor lost messages when killing on 
purpose
 Key: SPARK-4134
 URL: https://issues.apache.org/jira/browse/SPARK-4134
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Andrew Or
Assignee: Andrew Or


After SPARK-3822 goes in, we are now able to dynamically kill executors after 
an application has started. However, when we do that we get a ton of scary 
error messages telling us that we've done wrong somehow. It would be good to 
detect when this is the case and prevent these messages from surfacing.

This maybe difficult, however, because the connection manager tends to be quite 
verbose in unconditionally logging disconnection messages. This is a very 
nice-to-have for 1.2 but certainly not a blocker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states

2014-10-29 Thread Michael Griffiths (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188357#comment-14188357
 ] 

Michael Griffiths edited comment on SPARK-3398 at 10/29/14 1:58 PM:


Hi Nicholas,

Thanks for the thorough investigation!

Making the path absolute does work for me, when called with spark-ec2.

Thanks!


was (Author: michael.griffiths):
Hi Nicholas,

Thanks for the thorough investigation!

Making the path absolute does work for me, when called with spark-ec2.

> Have spark-ec2 intelligently wait for specific cluster states
> -
>
> Key: SPARK-3398
> URL: https://issues.apache.org/jira/browse/SPARK-3398
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
> Fix For: 1.2.0
>
>
> {{spark-ec2}} currently has retry logic for when it tries to install stuff on 
> a cluster and for when it tries to destroy security groups. 
> It would be better to have some logic that allows {{spark-ec2}} to explicitly 
> wait for when all the nodes in a cluster it is working on have reached a 
> specific state.
> Examples:
> * Wait for all nodes to be up
> * Wait for all nodes to be up and accepting SSH connections (then start 
> installing stuff)
> * Wait for all nodes to be down
> * Wait for all nodes to be terminated (then delete the security groups)
> Having a function in the {{spark_ec2.py}} script that blocks until the 
> desired cluster state is reached would reduce the need for various retry 
> logic. It would probably also eliminate the need for the {{--wait}} parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states

2014-10-29 Thread Michael Griffiths (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188357#comment-14188357
 ] 

Michael Griffiths commented on SPARK-3398:
--

Hi Nicholas,

Thanks for the thorough investigation!

Making the path absolute does work for me, when called with spark-ec2.

> Have spark-ec2 intelligently wait for specific cluster states
> -
>
> Key: SPARK-3398
> URL: https://issues.apache.org/jira/browse/SPARK-3398
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
> Fix For: 1.2.0
>
>
> {{spark-ec2}} currently has retry logic for when it tries to install stuff on 
> a cluster and for when it tries to destroy security groups. 
> It would be better to have some logic that allows {{spark-ec2}} to explicitly 
> wait for when all the nodes in a cluster it is working on have reached a 
> specific state.
> Examples:
> * Wait for all nodes to be up
> * Wait for all nodes to be up and accepting SSH connections (then start 
> installing stuff)
> * Wait for all nodes to be down
> * Wait for all nodes to be terminated (then delete the security groups)
> Having a function in the {{spark_ec2.py}} script that blocks until the 
> desired cluster state is reached would reduce the need for various retry 
> logic. It would probably also eliminate the need for the {{--wait}} parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3080) ArrayIndexOutOfBoundsException in ALS for Large datasets

2014-10-29 Thread Ilya Ganelin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188354#comment-14188354
 ] 

Ilya Ganelin commented on SPARK-3080:
-

Hello Xiangrui - happy to hear that you're on this! 

With regards to the first question, I have not seen any spillage to disk but I 
have seen executor loss (on a relatively frequent basis). I have not known 
whether this is a function of use on our cluster or an internal spark issue. 

With regards to upgrading ALS, can I simply replace the old SimpleALS.scala 
with the new one or will there be additional dependencies? I am interested in 
doing a piece-meal upgrade of ML Lib (without upgrading the rest of Spark from 
version 1.1). I want to do this to maintain compatibility with CDH 5.2. 

Please let me know, thank you. 

> ArrayIndexOutOfBoundsException in ALS for Large datasets
> 
>
> Key: SPARK-3080
> URL: https://issues.apache.org/jira/browse/SPARK-3080
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Burak Yavuz
>
> The stack trace is below:
> {quote}
> java.lang.ArrayIndexOutOfBoundsException: 2716
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543)
> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> 
> org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537)
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505)
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504)
> 
> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
> 
> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
> 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> {quote}
> This happened after the dataset was sub-sampled. 
> Dataset properties: ~12B ratings
> Setup: 55 r3.8xlarge ec2 instances



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2014-10-29 Thread RJ Nowling (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188325#comment-14188325
 ] 

RJ Nowling commented on SPARK-2429:
---

The sparsity tests look good.  Have you compared training and assignment time 
to KMeans yet?  An improvement in the assignment time will be important.  Also, 
I don't see a breakdown of the total time by splitting clusters, assignments, 
etc.  Doesn't need to be for every combination of parameters just one or two.  
That would be very helpful.  Thanks!

> Hierarchical Implementation of KMeans
> -
>
> Key: SPARK-2429
> URL: https://issues.apache.org/jira/browse/SPARK-2429
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: RJ Nowling
>Assignee: Yu Ishikawa
>Priority: Minor
> Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
> Result of Benchmarking a Hierarchical Clustering.pdf, 
> benchmark-result.2014-10-29.html, benchmark2.html
>
>
> Hierarchical clustering algorithms are widely used and would make a nice 
> addition to MLlib.  Clustering algorithms are useful for determining 
> relationships between clusters as well as offering faster assignment. 
> Discussion on the dev list suggested the following possible approaches:
> * Top down, recursive application of KMeans
> * Reuse DecisionTree implementation with different objective function
> * Hierarchical SVD
> It was also suggested that support for distance metrics other than Euclidean 
> such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2429) Hierarchical Implementation of KMeans

2014-10-29 Thread Yu Ishikawa (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Ishikawa updated SPARK-2429:
---
Attachment: benchmark-result.2014-10-29.html

I added a new performance test results named 
`benchmark-result.2014-10-29.html`.  The main change from the previous result 
is that I added the benchmark result about vector sparsity. Please check it.

> Hierarchical Implementation of KMeans
> -
>
> Key: SPARK-2429
> URL: https://issues.apache.org/jira/browse/SPARK-2429
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: RJ Nowling
>Assignee: Yu Ishikawa
>Priority: Minor
> Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
> Result of Benchmarking a Hierarchical Clustering.pdf, 
> benchmark-result.2014-10-29.html, benchmark2.html
>
>
> Hierarchical clustering algorithms are widely used and would make a nice 
> addition to MLlib.  Clustering algorithms are useful for determining 
> relationships between clusters as well as offering faster assignment. 
> Discussion on the dev list suggested the following possible approaches:
> * Top down, recursive application of KMeans
> * Reuse DecisionTree implementation with different objective function
> * Hierarchical SVD
> It was also suggested that support for distance metrics other than Euclidean 
> such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4133) PARSING_ERROR(2) when upgrading issues from 1.0.2 to 1.1.0

2014-10-29 Thread Antonio Jesus Navarro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188284#comment-14188284
 ] 

Antonio Jesus Navarro commented on SPARK-4133:
--

Existing Spark Streaming app can not be upgraded from 1.0.2 to 1.1.0

> PARSING_ERROR(2) when upgrading issues from 1.0.2 to 1.1.0
> --
>
> Key: SPARK-4133
> URL: https://issues.apache.org/jira/browse/SPARK-4133
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.1.0
>Reporter: Antonio Jesus Navarro
>Priority: Blocker
>
> Snappy related problems found when trying to upgrade existing Spark Streaming 
> App from 1.0.2 to 1.1.0.
> We can not run an existing 1.0.2 spark app if upgraded to 1.1.0
> > IOException is thrown by snappy (parsing_error(2))
> > Only spark version changed
> As far as we have checked, snappy will throw this error when dealing with 
> zero bytes length arrays.
> We have tried:
> > Changing from snappy to LZF, 
> > Changing broadcast.compression false
> > Changing from TorrentBroadcast to HTTPBroadcast.
> but with no luck for the moment.
> {code}
> [ERROR] 2014-10-29 11:23:26,396 [Executor task launch worker-0]  
> org.apache.spark.executor.Executor logError - Exception in task 0.0 in stage 
> 0.0 (TID 0)
> java.io.IOException: PARSING_ERROR(2)
>   at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
>   at org.xerial.snappy.SnappyNative.uncompressedLength(Native Method)
>   at org.xerial.snappy.Snappy.uncompressedLength(Snappy.java:545)
>   at 
> org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:125)
>   at 
> org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
>   at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58)
>   at 
> org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:232)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.readObject(TorrentBroadcast.scala:169)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:159)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4133) PARSING_ERROR(2) when upgrading issues from 1.0.2 to 1.1.0

2014-10-29 Thread Antonio Jesus Navarro (JIRA)

Antonio Jesus Navarro created SPARK-4133:


 Summary: PARSING_ERROR(2) when upgrading issues from 1.0.2 to 1.1.0
 Key: SPARK-4133
 URL: https://issues.apache.org/jira/browse/SPARK-4133
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Antonio Jesus Navarro
Priority: Blocker


Snappy related problems found when trying to upgrade existing Spark Streaming 
App from 1.0.2 to 1.1.0.

We can not run an existing 1.0.2 spark app if upgraded to 1.1.0

> IOException is thrown by snappy (parsing_error(2))
> Only spark version changed

As far as we have checked, snappy will throw this error when dealing with zero 
bytes length arrays.

We have tried:

> Changing from snappy to LZF, 
> Changing broadcast.compression false
> Changing from TorrentBroadcast to HTTPBroadcast.

but with no luck for the moment.

{code}
[ERROR] 2014-10-29 11:23:26,396 [Executor task launch worker-0]  
org.apache.spark.executor.Executor logError - Exception in task 0.0 in stage 
0.0 (TID 0)
java.io.IOException: PARSING_ERROR(2)
at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
at org.xerial.snappy.SnappyNative.uncompressedLength(Native Method)
at org.xerial.snappy.Snappy.uncompressedLength(Snappy.java:545)
at 
org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:125)
at 
org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58)
at 
org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128)
at 
org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:232)
at 
org.apache.spark.broadcast.TorrentBroadcast.readObject(TorrentBroadcast.scala:169)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:159)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3683) PySpark Hive query generates "NULL" instead of None

2014-10-29 Thread Tamas Jambor (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188227#comment-14188227
 ] 

Tamas Jambor commented on SPARK-3683:
-

OK, makes sense. Thanks.

> PySpark Hive query generates "NULL" instead of None
> ---
>
> Key: SPARK-3683
> URL: https://issues.apache.org/jira/browse/SPARK-3683
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.1.0
>Reporter: Tamas Jambor
>Assignee: Davies Liu
>
> When I run a Hive query in Spark SQL, I get the new Row object, where it does 
> not convert Hive NULL into Python None instead it keeps it string 'NULL'. 
> It's only an issue with String type, works with other types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-3683) PySpark Hive query generates "NULL" instead of None

2014-10-29 Thread Tamas Jambor (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tamas Jambor closed SPARK-3683.
---
Resolution: Not a Problem

> PySpark Hive query generates "NULL" instead of None
> ---
>
> Key: SPARK-3683
> URL: https://issues.apache.org/jira/browse/SPARK-3683
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.1.0
>Reporter: Tamas Jambor
>Assignee: Davies Liu
>
> When I run a Hive query in Spark SQL, I get the new Row object, where it does 
> not convert Hive NULL into Python None instead it keeps it string 'NULL'. 
> It's only an issue with String type, works with other types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3683) PySpark Hive query generates "NULL" instead of None

2014-10-29 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188194#comment-14188194
 ] 

Cheng Lian commented on SPARK-3683:
---

[~jamborta] Your concern is legitimate. However, unfortunately we have to take 
Hive compatibility into consideration  in this case, otherwise people who run 
legacy Hive scripts with Spark SQL may get wrong query result.

> PySpark Hive query generates "NULL" instead of None
> ---
>
> Key: SPARK-3683
> URL: https://issues.apache.org/jira/browse/SPARK-3683
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.1.0
>Reporter: Tamas Jambor
>Assignee: Davies Liu
>
> When I run a Hive query in Spark SQL, I get the new Row object, where it does 
> not convert Hive NULL into Python None instead it keeps it string 'NULL'. 
> It's only an issue with String type, works with other types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3683) PySpark Hive query generates "NULL" instead of None

2014-10-29 Thread Tamas Jambor (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188185#comment-14188185
 ] 

Tamas Jambor commented on SPARK-3683:
-

Thanks for the comments. 
>From my perspective this is a matter of inconsistency, as all the other types 
>are represented as none in python, except string. So I run another pass on the 
>data, and convert all the NULL values to none. 
I think the problem with the literal string "NULL", you cannot build the logic 
to handle that in the consecutive steps, as it is not represented in the 
appropriate way (i.e. missing values are usually handled as a special case). 




> PySpark Hive query generates "NULL" instead of None
> ---
>
> Key: SPARK-3683
> URL: https://issues.apache.org/jira/browse/SPARK-3683
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.1.0
>Reporter: Tamas Jambor
>Assignee: Davies Liu
>
> When I run a Hive query in Spark SQL, I get the new Row object, where it does 
> not convert Hive NULL into Python None instead it keeps it string 'NULL'. 
> It's only an issue with String type, works with other types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-683) Spark 0.7 with Hadoop 1.0 does not work with current AMI's HDFS installation

2014-10-29 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188160#comment-14188160
 ] 

Sean Owen commented on SPARK-683:
-

PS I think this also turns out to be the same as SPARK-4078

> Spark 0.7 with Hadoop 1.0 does not work with current AMI's HDFS installation
> 
>
> Key: SPARK-683
> URL: https://issues.apache.org/jira/browse/SPARK-683
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 0.7.0
>Reporter: Tathagata Das
>
> A simple saveAsObjectFile() leads to the following error.
> org.apache.hadoop.ipc.RemoteException: java.io.IOException: 
> java.lang.NoSuchMethodException: 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.create(java.lang.String, 
> org.apache.hadoop.fs.permission.FsPermission, java.lang.String, boolean, 
> boolean, short, long)
>   at java.lang.Class.getMethod(Class.java:1622)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:416)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4132) Spark uses incompatible HDFS API

2014-10-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4132.
--
Resolution: Duplicate

I'm all but certain you're describing the same thing as SPARK-4078

> Spark uses incompatible HDFS API
> 
>
> Key: SPARK-4132
> URL: https://issues.apache.org/jira/browse/SPARK-4132
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
> Environment: Spark1.1.0 on Hadoop1.2.1
> CentOS 6.3 64bit
>Reporter: kuromatsu nobuyuki
>Priority: Minor
>
> When I enable event logging and set it to output to HDFS, initialization 
> fails with 'java.lang.ClassNotFoundException' (see trace below).
> I found that an API incompatibility in 
> org.apache.hadoop.fs.permission.FsPermission between Hadoop 1.0.4 and Hadoop 
> 1.1.0 (and above) causes this error 
> (org.apache.hadoop.fs.permission.FsPermission$2 is used in 1.0.4 but doesn't 
> exist in my 1.2.1 environment).
> I think that the Spark jar file pre-built for Hadoop1.X should be built on 
> Hadoop Stable version(Hadoop 1.2.1).
> 2014-10-24 10:43:22,893 INFO org.apache.hadoop.ipc.Server: IPC Server 
> listener on 9000: 
> readAndProcess threw exception java.lang.RuntimeException: 
> readObject can't find class org.apache.hadoop.fs.permission.FsPermission$2. 
> Count of bytes read: 0
> java.lang.RuntimeException: readObject can't find class 
> org.apache.hadoop.fs.permission.FsPermission$2
> at 
> org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:233)
> at org.apache.hadoop.ipc.RPC$Invocation.readFields(RPC.java:106)
> at 
> org.apache.hadoop.ipc.Server$Connection.processData(Server.java:1347)
> at 
> org.apache.hadoop.ipc.Server$Connection.processOneRpc(Server.java:1326)
> at 
> org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1226)
> at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:577)
> at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:384)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:701)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.hadoop.fs.permission.FsPermission$2
> at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:270)
> at 
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:810)
> at 
> org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:231)
> ... 9 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4131) Support "Writing data into the filesystem from queries"

2014-10-29 Thread XiaoJing wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiaoJing wang updated SPARK-4131:
-
Description: 
Writing data into the filesystem from queries,SparkSql is not support .
eg:
insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * 
from page_views;


  was:
Writing data into the filesystem from queries,SparkSql is not support .
eg:
insert overwrite LOCAL DIRECTORY  '/data1/wangxj/sql_spark' select * 
from page_views; 
out:

java.lang.RuntimeException: 
Unsupported language features in query: insert overwrite LOCAL DIRECTORY  
'/data1/wangxj/sql_spark' select * from page_views
TOK_QUERY
  TOK_FROM
TOK_TABREF
  TOK_TABNAME
page_views
  TOK_INSERT
TOK_DESTINATION
  TOK_LOCAL_DIR
'/data1/wangxj/sql_spark'
TOK_SELECT
  TOK_SELEXPR
TOK_ALLCOLREF



> Support "Writing data into the filesystem from queries"
> ---
>
> Key: SPARK-4131
> URL: https://issues.apache.org/jira/browse/SPARK-4131
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: XiaoJing wang
>Priority: Critical
> Fix For: 1.3.0
>
>   Original Estimate: 0.05h
>  Remaining Estimate: 0.05h
>
> Writing data into the filesystem from queries,SparkSql is not support .
> eg:
> insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select 
> * from page_views;
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4131) Support "Writing data into the filesystem from queries"

2014-10-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188158#comment-14188158
 ] 

Apache Spark commented on SPARK-4131:
-

User 'wangxiaojing' has created a pull request for this issue:
https://github.com/apache/spark/pull/2997

> Support "Writing data into the filesystem from queries"
> ---
>
> Key: SPARK-4131
> URL: https://issues.apache.org/jira/browse/SPARK-4131
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: XiaoJing wang
>Priority: Critical
> Fix For: 1.3.0
>
>   Original Estimate: 0.05h
>  Remaining Estimate: 0.05h
>
> Writing data into the filesystem from queries,SparkSql is not support .
> eg:
> insert overwrite LOCAL DIRECTORY  '/data1/wangxj/sql_spark' select * 
> from page_views; 
> out:
> 
> java.lang.RuntimeException: 
> Unsupported language features in query: insert overwrite LOCAL DIRECTORY  
> '/data1/wangxj/sql_spark' select * from page_views
> TOK_QUERY
>   TOK_FROM
> TOK_TABREF
>   TOK_TABNAME
> page_views
>   TOK_INSERT
> TOK_DESTINATION
>   TOK_LOCAL_DIR
> '/data1/wangxj/sql_spark'
> TOK_SELECT
>   TOK_SELEXPR
> TOK_ALLCOLREF
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4131) Support "Writing data into the filesystem from queries"

2014-10-29 Thread XiaoJing wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiaoJing wang updated SPARK-4131:
-
Description: 
Writing data into the filesystem from queries,SparkSql is not support .
eg:
insert overwrite LOCAL DIRECTORY  '/data1/wangxj/sql_spark' select * 
from page_views; 
out:

java.lang.RuntimeException: 
Unsupported language features in query: insert overwrite LOCAL DIRECTORY  
'/data1/wangxj/sql_spark' select * from page_views
TOK_QUERY
  TOK_FROM
TOK_TABREF
  TOK_TABNAME
page_views
  TOK_INSERT
TOK_DESTINATION
  TOK_LOCAL_DIR
'/data1/wangxj/sql_spark'
TOK_SELECT
  TOK_SELEXPR
TOK_ALLCOLREF


  was:
Writing data into the filesystem from queries,SparkSql is not support .
eg:

insert overwrite LOCAL DIRECTORY  '/data1/wangxj/sql_spark' select * from 
page_views; 
out:


java.lang.RuntimeException: 
Unsupported language features in query: insert overwrite LOCAL DIRECTORY  
'/data1/wangxj/sql_spark' select * from page_views
TOK_QUERY
  TOK_FROM
TOK_TABREF
  TOK_TABNAME
page_views
  TOK_INSERT
TOK_DESTINATION
  TOK_LOCAL_DIR
'/data1/wangxj/sql_spark'
TOK_SELECT
  TOK_SELEXPR
TOK_ALLCOLREF




> Support "Writing data into the filesystem from queries"
> ---
>
> Key: SPARK-4131
> URL: https://issues.apache.org/jira/browse/SPARK-4131
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: XiaoJing wang
>Priority: Critical
> Fix For: 1.3.0
>
>   Original Estimate: 0.05h
>  Remaining Estimate: 0.05h
>
> Writing data into the filesystem from queries,SparkSql is not support .
> eg:
> insert overwrite LOCAL DIRECTORY  '/data1/wangxj/sql_spark' select * 
> from page_views; 
> out:
> 
> java.lang.RuntimeException: 
> Unsupported language features in query: insert overwrite LOCAL DIRECTORY  
> '/data1/wangxj/sql_spark' select * from page_views
> TOK_QUERY
>   TOK_FROM
> TOK_TABREF
>   TOK_TABNAME
> page_views
>   TOK_INSERT
> TOK_DESTINATION
>   TOK_LOCAL_DIR
> '/data1/wangxj/sql_spark'
> TOK_SELECT
>   TOK_SELEXPR
> TOK_ALLCOLREF
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4132) Spark uses incompatible HDFS API

2014-10-29 Thread kuromatsu nobuyuki (JIRA)

kuromatsu nobuyuki created SPARK-4132:
-

 Summary: Spark uses incompatible HDFS API
 Key: SPARK-4132
 URL: https://issues.apache.org/jira/browse/SPARK-4132
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
 Environment: Spark1.1.0 on Hadoop1.2.1
CentOS 6.3 64bit
Reporter: kuromatsu nobuyuki
Priority: Minor


When I enable event logging and set it to output to HDFS, initialization fails 
with 'java.lang.ClassNotFoundException' (see trace below).

I found that an API incompatibility in 
org.apache.hadoop.fs.permission.FsPermission between Hadoop 1.0.4 and Hadoop 
1.1.0 (and above) causes this error 
(org.apache.hadoop.fs.permission.FsPermission$2 is used in 1.0.4 but doesn't 
exist in my 1.2.1 environment).

I think that the Spark jar file pre-built for Hadoop1.X should be built on 
Hadoop Stable version(Hadoop 1.2.1).


2014-10-24 10:43:22,893 INFO org.apache.hadoop.ipc.Server: IPC Server listener 
on 9000: 
readAndProcess threw exception java.lang.RuntimeException: 
readObject can't find class org.apache.hadoop.fs.permission.FsPermission$2. 
Count of bytes read: 0
java.lang.RuntimeException: readObject can't find class 
org.apache.hadoop.fs.permission.FsPermission$2
at 
org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:233)
at org.apache.hadoop.ipc.RPC$Invocation.readFields(RPC.java:106)
at org.apache.hadoop.ipc.Server$Connection.processData(Server.java:1347)
at 
org.apache.hadoop.ipc.Server$Connection.processOneRpc(Server.java:1326)
at 
org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1226)
at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:577)
at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:384)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:701)
Caused by: java.lang.ClassNotFoundException: 
org.apache.hadoop.fs.permission.FsPermission$2
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at 
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:810)
at 
org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:231)
... 9 more





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4131) Support "Writing data into the filesystem from queries"

2014-10-29 Thread XiaoJing wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiaoJing wang updated SPARK-4131:
-
Description: 
Writing data into the filesystem from queries,SparkSql is not support .
eg:

insert overwrite LOCAL DIRECTORY  '/data1/wangxj/sql_spark' select * from 
page_views; 
out:


java.lang.RuntimeException: 
Unsupported language features in query: insert overwrite LOCAL DIRECTORY  
'/data1/wangxj/sql_spark' select * from page_views
TOK_QUERY
  TOK_FROM
TOK_TABREF
  TOK_TABNAME
page_views
  TOK_INSERT
TOK_DESTINATION
  TOK_LOCAL_DIR
'/data1/wangxj/sql_spark'
TOK_SELECT
  TOK_SELEXPR
TOK_ALLCOLREF



  was:
Writing data into the filesystem from queries,SparkSql is not support .
eg:

insert overwrite LOCAL DIRECTORY  '/data1/wangxj/sql_spark' select * from 
page_views; 
out:

java.lang.RuntimeException: 
Unsupported language features in query: insert overwrite LOCAL DIRECTORY  
'/data1/wangxj/sql_spark' select * from page_views
TOK_QUERY
  TOK_FROM
TOK_TABREF
  TOK_TABNAME
page_views
  TOK_INSERT
TOK_DESTINATION
  TOK_LOCAL_DIR
'/data1/wangxj/sql_spark'
TOK_SELECT
  TOK_SELEXPR
TOK_ALLCOLREF




> Support "Writing data into the filesystem from queries"
> ---
>
> Key: SPARK-4131
> URL: https://issues.apache.org/jira/browse/SPARK-4131
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: XiaoJing wang
>Priority: Critical
> Fix For: 1.3.0
>
>   Original Estimate: 0.05h
>  Remaining Estimate: 0.05h
>
> Writing data into the filesystem from queries,SparkSql is not support .
> eg:
> 
> insert overwrite LOCAL DIRECTORY  '/data1/wangxj/sql_spark' select * 
> from page_views; 
> out:
> 
> 
> java.lang.RuntimeException: 
> Unsupported language features in query: insert overwrite LOCAL DIRECTORY  
> '/data1/wangxj/sql_spark' select * from page_views
> TOK_QUERY
>   TOK_FROM
> TOK_TABREF
>   TOK_TABNAME
> page_views
>   TOK_INSERT
> TOK_DESTINATION
>   TOK_LOCAL_DIR
> '/data1/wangxj/sql_spark'
> TOK_SELECT
>   TOK_SELEXPR
> TOK_ALLCOLREF
> 
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4131) Support "Writing data into the filesystem from queries"

2014-10-29 Thread XiaoJing wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiaoJing wang updated SPARK-4131:
-
Description: 
Writing data into the filesystem from queries,SparkSql is not support .
eg:

insert overwrite LOCAL DIRECTORY  '/data1/wangxj/sql_spark' select * from 
page_views; 
out:

java.lang.RuntimeException: 
Unsupported language features in query: insert overwrite LOCAL DIRECTORY  
'/data1/wangxj/sql_spark' select * from page_views
TOK_QUERY
  TOK_FROM
TOK_TABREF
  TOK_TABNAME
page_views
  TOK_INSERT
TOK_DESTINATION
  TOK_LOCAL_DIR
'/data1/wangxj/sql_spark'
TOK_SELECT
  TOK_SELEXPR
TOK_ALLCOLREF



  was:
Writing data into the filesystem from queries,SparkSql is not support .
eg:
insert overwrite LOCAL DIRECTORY  '/data1/wangxj/sql_spark' select * from 
page_views; 
out:

java.lang.RuntimeException: 
Unsupported language features in query: insert overwrite LOCAL DIRECTORY  
'/data1/wangxj/sql_spark' select * from page_views
TOK_QUERY
  TOK_FROM
TOK_TABREF
  TOK_TABNAME
page_views
  TOK_INSERT
TOK_DESTINATION
  TOK_LOCAL_DIR
'/data1/wangxj/sql_spark'
TOK_SELECT
  TOK_SELEXPR
TOK_ALLCOLREF


> Support "Writing data into the filesystem from queries"
> ---
>
> Key: SPARK-4131
> URL: https://issues.apache.org/jira/browse/SPARK-4131
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: XiaoJing wang
>Priority: Critical
> Fix For: 1.3.0
>
>   Original Estimate: 0.05h
>  Remaining Estimate: 0.05h
>
> Writing data into the filesystem from queries,SparkSql is not support .
> eg:
> 
> insert overwrite LOCAL DIRECTORY  '/data1/wangxj/sql_spark' select * 
> from page_views; 
> out:
> 
> java.lang.RuntimeException: 
> Unsupported language features in query: insert overwrite LOCAL DIRECTORY  
> '/data1/wangxj/sql_spark' select * from page_views
> TOK_QUERY
>   TOK_FROM
> TOK_TABREF
>   TOK_TABNAME
> page_views
>   TOK_INSERT
> TOK_DESTINATION
>   TOK_LOCAL_DIR
> '/data1/wangxj/sql_spark'
> TOK_SELECT
>   TOK_SELEXPR
> TOK_ALLCOLREF
> 
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4131) Support "Writing data into the filesystem from queries"

2014-10-29 Thread Ravindra Pesala (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188147#comment-14188147
 ] 

Ravindra Pesala commented on SPARK-4131:


I will work on this issue.

> Support "Writing data into the filesystem from queries"
> ---
>
> Key: SPARK-4131
> URL: https://issues.apache.org/jira/browse/SPARK-4131
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: XiaoJing wang
>Priority: Critical
> Fix For: 1.3.0
>
>   Original Estimate: 0.05h
>  Remaining Estimate: 0.05h
>
> Writing data into the filesystem from queries,SparkSql is not support .
> eg:
> insert overwrite LOCAL DIRECTORY  '/data1/wangxj/sql_spark' select * 
> from page_views; 
> out:
> 
> java.lang.RuntimeException: 
> Unsupported language features in query: insert overwrite LOCAL DIRECTORY  
> '/data1/wangxj/sql_spark' select * from page_views
> TOK_QUERY
>   TOK_FROM
> TOK_TABREF
>   TOK_TABNAME
> page_views
>   TOK_INSERT
> TOK_DESTINATION
>   TOK_LOCAL_DIR
> '/data1/wangxj/sql_spark'
> TOK_SELECT
>   TOK_SELEXPR
> TOK_ALLCOLREF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4131) Support "Writing data into the filesystem from queries"

2014-10-29 Thread XiaoJing wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiaoJing wang updated SPARK-4131:
-
Summary: Support "Writing data into the filesystem from queries"  (was: 
support “Writing data into the filesystem from queries”)

> Support "Writing data into the filesystem from queries"
> ---
>
> Key: SPARK-4131
> URL: https://issues.apache.org/jira/browse/SPARK-4131
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: XiaoJing wang
>Priority: Critical
> Fix For: 1.3.0
>
>   Original Estimate: 0.05h
>  Remaining Estimate: 0.05h
>
> Writing data into the filesystem from queries,SparkSql is not support .
> eg:
> insert overwrite LOCAL DIRECTORY  '/data1/wangxj/sql_spark' select * 
> from page_views; 
> out:
> 
> java.lang.RuntimeException: 
> Unsupported language features in query: insert overwrite LOCAL DIRECTORY  
> '/data1/wangxj/sql_spark' select * from page_views
> TOK_QUERY
>   TOK_FROM
> TOK_TABREF
>   TOK_TABNAME
> page_views
>   TOK_INSERT
> TOK_DESTINATION
>   TOK_LOCAL_DIR
> '/data1/wangxj/sql_spark'
> TOK_SELECT
>   TOK_SELEXPR
> TOK_ALLCOLREF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4131) support “Writing data into the filesystem from queries”

2014-10-29 Thread XiaoJing wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiaoJing wang updated SPARK-4131:
-
Summary: support “Writing data into the filesystem from queries”  (was: 
support “insert overwrite LOCAL DIRECTORY ‘dir’ select * from tablename;”)

> support “Writing data into the filesystem from queries”
> ---
>
> Key: SPARK-4131
> URL: https://issues.apache.org/jira/browse/SPARK-4131
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: XiaoJing wang
>Priority: Critical
> Fix For: 1.3.0
>
>   Original Estimate: 0.05h
>  Remaining Estimate: 0.05h
>
> Writing data into the filesystem from queries,SparkSql is not support .
> eg:
> insert overwrite LOCAL DIRECTORY  '/data1/wangxj/sql_spark' select * 
> from page_views; 
> out:
> 
> java.lang.RuntimeException: 
> Unsupported language features in query: insert overwrite LOCAL DIRECTORY  
> '/data1/wangxj/sql_spark' select * from page_views
> TOK_QUERY
>   TOK_FROM
> TOK_TABREF
>   TOK_TABNAME
> page_views
>   TOK_INSERT
> TOK_DESTINATION
>   TOK_LOCAL_DIR
> '/data1/wangxj/sql_spark'
> TOK_SELECT
>   TOK_SELEXPR
> TOK_ALLCOLREF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4131) support “insert overwrite LOCAL DIRECTORY ‘dir’ select * from tablename;”

2014-10-29 Thread XiaoJing wang (JIRA)

XiaoJing wang created SPARK-4131:


 Summary: support “insert overwrite LOCAL DIRECTORY ‘dir’ select * 
from tablename;”
 Key: SPARK-4131
 URL: https://issues.apache.org/jira/browse/SPARK-4131
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.1.0
Reporter: XiaoJing wang
Priority: Critical
 Fix For: 1.3.0


Writing data into the filesystem from queries,SparkSql is not support .
eg:
insert overwrite LOCAL DIRECTORY  '/data1/wangxj/sql_spark' select * from 
page_views; 
out:

java.lang.RuntimeException: 
Unsupported language features in query: insert overwrite LOCAL DIRECTORY  
'/data1/wangxj/sql_spark' select * from page_views
TOK_QUERY
  TOK_FROM
TOK_TABREF
  TOK_TABNAME
page_views
  TOK_INSERT
TOK_DESTINATION
  TOK_LOCAL_DIR
'/data1/wangxj/sql_spark'
TOK_SELECT
  TOK_SELEXPR
TOK_ALLCOLREF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1442) Add Window function support

2014-10-29 Thread guowei (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guowei updated SPARK-1442:
--
Attachment: Window Function.pdf

> Add Window function support
> ---
>
> Key: SPARK-1442
> URL: https://issues.apache.org/jira/browse/SPARK-1442
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Chengxiang Li
> Attachments: Window Function.pdf
>
>
> similiar to Hive, add window function support for catalyst.
> https://issues.apache.org/jira/browse/HIVE-4197
> https://issues.apache.org/jira/browse/HIVE-896



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1442) Add Window function support

2014-10-29 Thread guowei (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guowei updated SPARK-1442:
--
Attachment: (was: Window Function.pdf)

> Add Window function support
> ---
>
> Key: SPARK-1442
> URL: https://issues.apache.org/jira/browse/SPARK-1442
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Chengxiang Li
> Attachments: Window Function.pdf
>
>
> similiar to Hive, add window function support for catalyst.
> https://issues.apache.org/jira/browse/HIVE-4197
> https://issues.apache.org/jira/browse/HIVE-896



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4130) loadLibSVMFile does not handle extra whitespace

2014-10-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188110#comment-14188110
 ] 

Apache Spark commented on SPARK-4130:
-

User 'jegonzal' has created a pull request for this issue:
https://github.com/apache/spark/pull/2996

> loadLibSVMFile does not handle extra whitespace
> ---
>
> Key: SPARK-4130
> URL: https://issues.apache.org/jira/browse/SPARK-4130
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Joseph E. Gonzalez
>
> When testing MLlib on the splice site data 
> (http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#splice-site)
>  the loadSVM.  To reproduce in spark shell:
> {code:scala}
> import org.apache.spark.mllib.util.MLUtils
> val data =  MLUtils.loadLibSVMFile(sc, 
> "hdfs://ec2-54-200-69-227.us-west-2.compute.amazonaws.com:9000/splice_site.t")
> {code}
> generates the error:
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 
> 0.0:73 failed 4 times, most recent failure: Exception failure in TID 335 on 
> host ip-172-31-31-54.us-west-2.compute.internal: 
> java.lang.NumberFormatException: For input string: ""
> 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
> java.lang.Integer.parseInt(Integer.java:504)
> java.lang.Integer.parseInt(Integer.java:527)
> 
> scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
> scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
> 
> org.apache.spark.mllib.util.MLUtils$$anonfun$4$$anonfun$5.apply(MLUtils.scala:81)
> 
> org.apache.spark.mllib.util.MLUtils$$anonfun$4$$anonfun$5.apply(MLUtils.scala:79)
> 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
> org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:79)
> org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:76)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> scala.collection.Iterator$class.foreach(Iterator.scala:727)
> scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:107)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
> org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
> org.apache.spark.scheduler.Task.run(Task.scala:51)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3683) PySpark Hive query generates "NULL" instead of None

2014-10-29 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188105#comment-14188105
 ] 

Davies Liu commented on SPARK-3683:
---

[~jamborta] It seems that this is a feature, not a bug. Does this work for you?

> PySpark Hive query generates "NULL" instead of None
> ---
>
> Key: SPARK-3683
> URL: https://issues.apache.org/jira/browse/SPARK-3683
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.1.0
>Reporter: Tamas Jambor
>Assignee: Davies Liu
>
> When I run a Hive query in Spark SQL, I get the new Row object, where it does 
> not convert Hive NULL into Python None instead it keeps it string 'NULL'. 
> It's only an issue with String type, works with other types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4124) Simplify serialization and call API in MLlib Python

2014-10-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188104#comment-14188104
 ] 

Apache Spark commented on SPARK-4124:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/2995

> Simplify serialization and call API in MLlib Python
> ---
>
> Key: SPARK-4124
> URL: https://issues.apache.org/jira/browse/SPARK-4124
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> There are much repeated code  to similar things, convert RDD into Java 
> object, convert arguments into java, convert to result rdd/object into 
> python, they could be simplified to share the same code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4130) loadLibSVMFile does not handle extra whitespace

2014-10-29 Thread Joseph E. Gonzalez (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph E. Gonzalez updated SPARK-4130:
--
Description: 
When testing MLlib on the splice site data 
(http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#splice-site)
 the loadSVM.  To reproduce in spark shell:

{code:scala}
import org.apache.spark.mllib.util.MLUtils
val data =  MLUtils.loadLibSVMFile(sc, 
"hdfs://ec2-54-200-69-227.us-west-2.compute.amazonaws.com:9000/splice_site.t")
{code}

generates the error:

{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:73 
failed 4 times, most recent failure: Exception failure in TID 335 on host 
ip-172-31-31-54.us-west-2.compute.internal: java.lang.NumberFormatException: 
For input string: ""

java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
java.lang.Integer.parseInt(Integer.java:504)
java.lang.Integer.parseInt(Integer.java:527)
scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
scala.collection.immutable.StringOps.toInt(StringOps.scala:31)

org.apache.spark.mllib.util.MLUtils$$anonfun$4$$anonfun$5.apply(MLUtils.scala:81)

org.apache.spark.mllib.util.MLUtils$$anonfun$4$$anonfun$5.apply(MLUtils.scala:79)

scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:79)
org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:76)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)

scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:107)
org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
org.apache.spark.scheduler.Task.run(Task.scala:51)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
{code}

  was:
When testing MLlib on the splice site data 
(http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#splice-site)
 the loadSVM.  To reproduce in spark shell:

{code:scala}
import org.apache.spark.mllib.util.MLUtils
val data =  MLUtils.loadLibSVMFile(sc, 
"hdfs://ec2-54-200-69-227.us-west-2.compute.amazonaws.com:9000/splice_site.t")
{code}

generates the error:

{{
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:73 
failed 4 times, most recent failure: Exception failure in TID 335 on host 
ip-172-31-31-54.us-west-2.compute.internal: java.lang.NumberFormatException: 
For input string: ""

java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
java.lang.Integer.parseInt(Integer.java:504)
java.lang.Integer.parseInt(Integer.java:527)
scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
scala.collection.immutable.StringOps.toInt(StringOps.scala:31)

org.apache.spark.mllib.util.MLUtils$$anonfun$4$$anonfun$5.apply(MLUtils.scala:81)

org.apache.spark.mllib.util.MLUtils$$anonfun$4$$anonfun$5.apply(MLUtils.scala:79)

scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:79)
org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:

[jira] [Updated] (SPARK-4130) loadLibSVMFile does not handle extra whitespace

2014-10-29 Thread Joseph E. Gonzalez (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph E. Gonzalez updated SPARK-4130:
--
Description: 
When testing MLlib on the splice site data 
(http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#splice-site)
 the loadSVM.  To reproduce in spark shell:

{code:scala}
import org.apache.spark.mllib.util.MLUtils
val data =  MLUtils.loadLibSVMFile(sc, 
"hdfs://ec2-54-200-69-227.us-west-2.compute.amazonaws.com:9000/splice_site.t")
{code}

generates the error:

{{
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:73 
failed 4 times, most recent failure: Exception failure in TID 335 on host 
ip-172-31-31-54.us-west-2.compute.internal: java.lang.NumberFormatException: 
For input string: ""

java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
java.lang.Integer.parseInt(Integer.java:504)
java.lang.Integer.parseInt(Integer.java:527)
scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
scala.collection.immutable.StringOps.toInt(StringOps.scala:31)

org.apache.spark.mllib.util.MLUtils$$anonfun$4$$anonfun$5.apply(MLUtils.scala:81)

org.apache.spark.mllib.util.MLUtils$$anonfun$4$$anonfun$5.apply(MLUtils.scala:79)

scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:79)
org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:76)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)

scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:107)
org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
org.apache.spark.scheduler.Task.run(Task.scala:51)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
}}

  was:
When testing MLlib on the splice site data 
(http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#splice-site)
 the loadSVM.  To reproduce in spark shell:

{code:scala}
import org.apache.spark.mllib.util.MLUtils
val data =  MLUtils.loadLibSVMFile(sc, 
"hdfs://ec2-54-200-69-227.us-west-2.compute.amazonaws.com:9000/splice_site.t")
{code}

generates the error:

```
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:73 
failed 4 times, most recent failure: Exception failure in TID 335 on host 
ip-172-31-31-54.us-west-2.compute.internal: java.lang.NumberFormatException: 
For input string: ""

java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
java.lang.Integer.parseInt(Integer.java:504)
java.lang.Integer.parseInt(Integer.java:527)
scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
scala.collection.immutable.StringOps.toInt(StringOps.scala:31)

org.apache.spark.mllib.util.MLUtils$$anonfun$4$$anonfun$5.apply(MLUtils.scala:81)

org.apache.spark.mllib.util.MLUtils$$anonfun$4$$anonfun$5.apply(MLUtils.scala:79)

scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:79)
org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:76)

[jira] [Updated] (SPARK-4130) loadLibSVMFile does not handle extra whitespace

2014-10-29 Thread Joseph E. Gonzalez (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph E. Gonzalez updated SPARK-4130:
--
Description: 
When testing MLlib on the splice site data 
(http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#splice-site)
 the loadSVM.  To reproduce in spark shell:

{code:scala}
import org.apache.spark.mllib.util.MLUtils
val data =  MLUtils.loadLibSVMFile(sc, 
"hdfs://ec2-54-200-69-227.us-west-2.compute.amazonaws.com:9000/splice_site.t")
{code}

generates the error:

```
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:73 
failed 4 times, most recent failure: Exception failure in TID 335 on host 
ip-172-31-31-54.us-west-2.compute.internal: java.lang.NumberFormatException: 
For input string: ""

java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
java.lang.Integer.parseInt(Integer.java:504)
java.lang.Integer.parseInt(Integer.java:527)
scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
scala.collection.immutable.StringOps.toInt(StringOps.scala:31)

org.apache.spark.mllib.util.MLUtils$$anonfun$4$$anonfun$5.apply(MLUtils.scala:81)

org.apache.spark.mllib.util.MLUtils$$anonfun$4$$anonfun$5.apply(MLUtils.scala:79)

scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:79)
org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:76)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)

scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:107)
org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
org.apache.spark.scheduler.Task.run(Task.scala:51)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
```

  was:
When testing MLlib on the splice site data 
(http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#splice-site)
 the loadSVM.  To reproduce in spark shell:

```
import org.apache.spark.mllib.util.MLUtils
val data =  MLUtils.loadLibSVMFile(sc, 
"hdfs://ec2-54-200-69-227.us-west-2.compute.amazonaws.com:9000/splice_site.t")
```
generates the error:

```
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:73 
failed 4 times, most recent failure: Exception failure in TID 335 on host 
ip-172-31-31-54.us-west-2.compute.internal: java.lang.NumberFormatException: 
For input string: ""

java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
java.lang.Integer.parseInt(Integer.java:504)
java.lang.Integer.parseInt(Integer.java:527)
scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
scala.collection.immutable.StringOps.toInt(StringOps.scala:31)

org.apache.spark.mllib.util.MLUtils$$anonfun$4$$anonfun$5.apply(MLUtils.scala:81)

org.apache.spark.mllib.util.MLUtils$$anonfun$4$$anonfun$5.apply(MLUtils.scala:79)

scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:79)
org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:76)
scala.

[jira] [Updated] (SPARK-4130) loadLibSVMFile does not handle extra whitespace

2014-10-29 Thread Joseph E. Gonzalez (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph E. Gonzalez updated SPARK-4130:
--
Description: 
When testing MLlib on the splice site data 
(http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#splice-site)
 the loadSVM.  To reproduce in spark shell:

```
import org.apache.spark.mllib.util.MLUtils
val data =  MLUtils.loadLibSVMFile(sc, 
"hdfs://ec2-54-200-69-227.us-west-2.compute.amazonaws.com:9000/splice_site.t")
```
generates the error:

```
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:73 
failed 4 times, most recent failure: Exception failure in TID 335 on host 
ip-172-31-31-54.us-west-2.compute.internal: java.lang.NumberFormatException: 
For input string: ""

java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
java.lang.Integer.parseInt(Integer.java:504)
java.lang.Integer.parseInt(Integer.java:527)
scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
scala.collection.immutable.StringOps.toInt(StringOps.scala:31)

org.apache.spark.mllib.util.MLUtils$$anonfun$4$$anonfun$5.apply(MLUtils.scala:81)

org.apache.spark.mllib.util.MLUtils$$anonfun$4$$anonfun$5.apply(MLUtils.scala:79)

scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:79)
org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:76)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)

scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:107)
org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
org.apache.spark.scheduler.Task.run(Task.scala:51)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
```

  was:
When testing MLlib on the splice site data 
(http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#splice-site)
 the loadSVM.  To reproduce in spark shell:

import org.apache.spark.mllib.util.MLUtils
val data =  MLUtils.loadLibSVMFile(sc, 
"hdfs://ec2-54-200-69-227.us-west-2.compute.amazonaws.com:9000/splice_site.t")

generates the error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:73 
failed 4 times, most recent failure: Exception failure in TID 335 on host 
ip-172-31-31-54.us-west-2.compute.internal: java.lang.NumberFormatException: 
For input string: ""

java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
java.lang.Integer.parseInt(Integer.java:504)
java.lang.Integer.parseInt(Integer.java:527)
scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
scala.collection.immutable.StringOps.toInt(StringOps.scala:31)

org.apache.spark.mllib.util.MLUtils$$anonfun$4$$anonfun$5.apply(MLUtils.scala:81)

org.apache.spark.mllib.util.MLUtils$$anonfun$4$$anonfun$5.apply(MLUtils.scala:79)

scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:79)
org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:76)
scala.collection.Iterator$$ano

[jira] [Created] (SPARK-4130) loadLibSVMFile does not handle extra whitespace

2014-10-29 Thread Joseph E. Gonzalez (JIRA)

Joseph E. Gonzalez created SPARK-4130:
-

 Summary: loadLibSVMFile does not handle extra whitespace
 Key: SPARK-4130
 URL: https://issues.apache.org/jira/browse/SPARK-4130
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Joseph E. Gonzalez


When testing MLlib on the splice site data 
(http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#splice-site)
 the loadSVM.  To reproduce in spark shell:

import org.apache.spark.mllib.util.MLUtils
val data =  MLUtils.loadLibSVMFile(sc, 
"hdfs://ec2-54-200-69-227.us-west-2.compute.amazonaws.com:9000/splice_site.t")

generates the error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:73 
failed 4 times, most recent failure: Exception failure in TID 335 on host 
ip-172-31-31-54.us-west-2.compute.internal: java.lang.NumberFormatException: 
For input string: ""

java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
java.lang.Integer.parseInt(Integer.java:504)
java.lang.Integer.parseInt(Integer.java:527)
scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
scala.collection.immutable.StringOps.toInt(StringOps.scala:31)

org.apache.spark.mllib.util.MLUtils$$anonfun$4$$anonfun$5.apply(MLUtils.scala:81)

org.apache.spark.mllib.util.MLUtils$$anonfun$4$$anonfun$5.apply(MLUtils.scala:79)

scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:79)
org.apache.spark.mllib.util.MLUtils$$anonfun$4.apply(MLUtils.scala:76)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)

scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:107)
org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
org.apache.spark.scheduler.Task.run(Task.scala:51)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

98 matches

Mail list logo