[jira] [Commented] (SPARK-2306) BoundedPriorityQueue is private and not registered with Kryo

2014-07-03 Thread ankit bhardwaj (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052198#comment-14052198
 ] 

ankit bhardwaj commented on SPARK-2306:
---

Created a pull request for it :https://github.com/apache/spark/pull/1298

> BoundedPriorityQueue is private and not registered with Kryo
> 
>
> Key: SPARK-2306
> URL: https://issues.apache.org/jira/browse/SPARK-2306
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Daniel Darabos
>
> Because BoundedPriorityQueue is private and not registered with Kryo, RDD.top 
> cannot be used when using Kryo (the recommended configuration).
> Curiously BoundedPriorityQueue is registered by GraphKryoRegistrator. But 
> that's the wrong registrator. (Is there one for Spark Core?)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2059) Unresolved Attributes should cause a failure before execution time

2014-07-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-2059.


   Resolution: Fixed
Fix Version/s: 1.1.0
   1.0.1

> Unresolved Attributes should cause a failure before execution time
> --
>
> Key: SPARK-2059
> URL: https://issues.apache.org/jira/browse/SPARK-2059
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
> Fix For: 1.0.1, 1.1.0
>
>
> Here's a partial solution: 
> https://github.com/marmbrus/spark/tree/analysisChecks



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2282) PySpark crashes if too many tasks complete quickly

2014-07-03 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2282:
---

Affects Version/s: 0.9.1

> PySpark crashes if too many tasks complete quickly
> --
>
> Key: SPARK-2282
> URL: https://issues.apache.org/jira/browse/SPARK-2282
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 0.9.1, 1.0.0, 1.0.1
>Reporter: Aaron Davidson
>Assignee: Aaron Davidson
> Fix For: 0.9.2, 1.0.0, 1.0.1
>
>
> Upon every task completion, PythonAccumulatorParam constructs a new socket to 
> the Accumulator server running inside the pyspark daemon. This can cause a 
> buildup of used ephemeral ports from sockets in the TIME_WAIT termination 
> stage, which will cause the SparkContext to crash if too many tasks complete 
> too quickly. We ran into this bug with 17k tasks completing in 15 seconds.
> This bug can be fixed outside of Spark by ensuring these properties are set 
> (on a linux server);
> echo "1" > /proc/sys/net/ipv4/tcp_tw_reuse
> echo "1" > /proc/sys/net/ipv4/tcp_tw_recycle
> or by adding the SO_REUSEADDR option to the Socket creation within Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2282) PySpark crashes if too many tasks complete quickly

2014-07-03 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2282.


   Resolution: Fixed
Fix Version/s: 1.0.0
   1.0.1
   0.9.2

> PySpark crashes if too many tasks complete quickly
> --
>
> Key: SPARK-2282
> URL: https://issues.apache.org/jira/browse/SPARK-2282
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 0.9.1, 1.0.0, 1.0.1
>Reporter: Aaron Davidson
>Assignee: Aaron Davidson
> Fix For: 0.9.2, 1.0.1, 1.0.0
>
>
> Upon every task completion, PythonAccumulatorParam constructs a new socket to 
> the Accumulator server running inside the pyspark daemon. This can cause a 
> buildup of used ephemeral ports from sockets in the TIME_WAIT termination 
> stage, which will cause the SparkContext to crash if too many tasks complete 
> too quickly. We ran into this bug with 17k tasks completing in 15 seconds.
> This bug can be fixed outside of Spark by ensuring these properties are set 
> (on a linux server);
> echo "1" > /proc/sys/net/ipv4/tcp_tw_reuse
> echo "1" > /proc/sys/net/ipv4/tcp_tw_recycle
> or by adding the SO_REUSEADDR option to the Socket creation within Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2350) Master throws NPE

2014-07-03 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2350:
---

Fix Version/s: 0.9.2

> Master throws NPE
> -
>
> Key: SPARK-2350
> URL: https://issues.apache.org/jira/browse/SPARK-2350
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>Assignee: Aaron Davidson
> Fix For: 0.9.2, 1.0.1, 1.1.0
>
>
> ... if we launch a driver and there are more waiting drivers to be launched. 
> This is because we remove from a list while iterating through this.
> Here is the culprit from Master.scala (L487 as of the creation of this JIRA, 
> commit bc7041a42dfa84312492ea8cae6fdeaeac4f6d1c).
> {code}
> for (driver <- waitingDrivers) {
>   if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= 
> driver.desc.cores) {
> launchDriver(worker, driver)
> waitingDrivers -= driver
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2307) SparkUI Storage page cached statuses incorrect

2014-07-03 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052183#comment-14052183
 ] 

Patrick Wendell commented on SPARK-2307:


There was a follow up patch:
https://github.com/apache/spark/pull/1255

> SparkUI Storage page cached statuses incorrect
> --
>
> Key: SPARK-2307
> URL: https://issues.apache.org/jira/browse/SPARK-2307
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 1.0.1, 1.1.0
>
> Attachments: Screen Shot 2014-06-27 at 11.09.54 AM.png
>
>
> See attached: the executor has 512MB, but somehow it has cached (279 + 27 + 
> 279 + 27) = 612MB? (The correct answer is 279MB).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2350) Master throws NPE

2014-07-03 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2350:
---

Assignee: Aaron Davidson

> Master throws NPE
> -
>
> Key: SPARK-2350
> URL: https://issues.apache.org/jira/browse/SPARK-2350
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>Assignee: Aaron Davidson
> Fix For: 1.0.1, 1.1.0
>
>
> ... if we launch a driver and there are more waiting drivers to be launched. 
> This is because we remove from a list while iterating through this.
> Here is the culprit from Master.scala (L487 as of the creation of this JIRA, 
> commit bc7041a42dfa84312492ea8cae6fdeaeac4f6d1c).
> {code}
> for (driver <- waitingDrivers) {
>   if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= 
> driver.desc.cores) {
> launchDriver(worker, driver)
> waitingDrivers -= driver
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2350) Master throws NPE

2014-07-03 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2350.


   Resolution: Fixed
Fix Version/s: 1.0.1

Issue resolved by pull request 1289
[https://github.com/apache/spark/pull/1289]

> Master throws NPE
> -
>
> Key: SPARK-2350
> URL: https://issues.apache.org/jira/browse/SPARK-2350
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Or
> Fix For: 1.0.1, 1.1.0
>
>
> ... if we launch a driver and there are more waiting drivers to be launched. 
> This is because we remove from a list while iterating through this.
> Here is the culprit from Master.scala (L487 as of the creation of this JIRA, 
> commit bc7041a42dfa84312492ea8cae6fdeaeac4f6d1c).
> {code}
> for (driver <- waitingDrivers) {
>   if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= 
> driver.desc.cores) {
> launchDriver(worker, driver)
> waitingDrivers -= driver
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2349) Fix NPE in ExternalAppendOnlyMap

2014-07-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-2349:
-

Fix Version/s: 1.1.0

> Fix NPE in ExternalAppendOnlyMap
> 
>
> Key: SPARK-2349
> URL: https://issues.apache.org/jira/browse/SPARK-2349
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Or
> Fix For: 1.0.1, 1.1.0
>
>
> It throws an NPE on null keys.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2349) Fix NPE in ExternalAppendOnlyMap

2014-07-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-2349:
-

Fix Version/s: 1.0.1

> Fix NPE in ExternalAppendOnlyMap
> 
>
> Key: SPARK-2349
> URL: https://issues.apache.org/jira/browse/SPARK-2349
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Or
> Fix For: 1.0.1, 1.1.0
>
>
> It throws an NPE on null keys.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2365) Add IndexedRDD, an efficient updatable key-value store

2014-07-03 Thread Ankur Dave (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052127#comment-14052127
 ] 

Ankur Dave commented on SPARK-2365:
---

Proposed implementation: https://github.com/apache/spark/pull/1297

> Add IndexedRDD, an efficient updatable key-value store
> --
>
> Key: SPARK-2365
> URL: https://issues.apache.org/jira/browse/SPARK-2365
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX, Spark Core
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>
> RDDs currently provide a bulk-updatable, iterator-based interface. This 
> imposes minimal requirements on the storage layer, which only needs to 
> support sequential access, enabling on-disk and serialized storage.
> However, many applications would benefit from a richer interface. Efficient 
> support for point lookups would enable serving data out of RDDs, but it 
> currently requires iterating over an entire partition to find the desired 
> element. Point updates similarly require copying an entire iterator. Joins 
> are also expensive, requiring a shuffle and local hash joins.
> To address these problems, we propose IndexedRDD, an efficient key-value 
> store built on RDDs. IndexedRDD would extend RDD[(Long, V)] by enforcing key 
> uniqueness and pre-indexing the entries for efficient joins and point 
> lookups, updates, and deletions.
> It would be implemented by (1) hash-partitioning the entries by key, (2) 
> maintaining a hash index within each partition, and (3) using purely 
> functional (immutable and efficiently updatable) data structures to enable 
> efficient modifications and deletions.
> GraphX would be the first user of IndexedRDD, since it currently implements a 
> limited form of this functionality in VertexRDD. We envision a variety of 
> other uses for IndexedRDD, including streaming updates to RDDs, direct 
> serving from RDDs, and as an execution strategy for Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2365) Add IndexedRDD, an efficient updatable key-value store

2014-07-03 Thread Ankur Dave (JIRA)
Ankur Dave created SPARK-2365:
-

 Summary: Add IndexedRDD, an efficient updatable key-value store
 Key: SPARK-2365
 URL: https://issues.apache.org/jira/browse/SPARK-2365
 Project: Spark
  Issue Type: New Feature
  Components: GraphX, Spark Core
Reporter: Ankur Dave
Assignee: Ankur Dave


RDDs currently provide a bulk-updatable, iterator-based interface. This imposes 
minimal requirements on the storage layer, which only needs to support 
sequential access, enabling on-disk and serialized storage.

However, many applications would benefit from a richer interface. Efficient 
support for point lookups would enable serving data out of RDDs, but it 
currently requires iterating over an entire partition to find the desired 
element. Point updates similarly require copying an entire iterator. Joins are 
also expensive, requiring a shuffle and local hash joins.

To address these problems, we propose IndexedRDD, an efficient key-value store 
built on RDDs. IndexedRDD would extend RDD[(Long, V)] by enforcing key 
uniqueness and pre-indexing the entries for efficient joins and point lookups, 
updates, and deletions.

It would be implemented by (1) hash-partitioning the entries by key, (2) 
maintaining a hash index within each partition, and (3) using purely functional 
(immutable and efficiently updatable) data structures to enable efficient 
modifications and deletions.

GraphX would be the first user of IndexedRDD, since it currently implements a 
limited form of this functionality in VertexRDD. We envision a variety of other 
uses for IndexedRDD, including streaming updates to RDDs, direct serving from 
RDDs, and as an execution strategy for Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2277) Make TaskScheduler track whether there's host on a rack

2014-07-03 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052103#comment-14052103
 ] 

Rui Li commented on SPARK-2277:
---

With [PR #892|https://github.com/apache/spark/pull/892], we'll check if a 
task's preference is available when adding it to pending lists. TaskScheduler 
tracks information about executor/host, so that TaskSetManager can check if the 
preferred executor/host is available.

TaskScheduler also provides getRackForHost to get the corresponding rack for a 
host (currently only returns None). I think this is some prior acquired 
knowledge about the cluster topology, which does not indicate whether there's 
any host on that rack granted to this spark app. Therefore we don't know the 
availability of the preferred rack.

> Make TaskScheduler track whether there's host on a rack
> ---
>
> Key: SPARK-2277
> URL: https://issues.apache.org/jira/browse/SPARK-2277
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Rui Li
>
> When TaskSetManager adds a pending task, it checks whether the tasks's 
> preferred location is available. Regarding RACK_LOCAL task, we consider the 
> preferred rack available if such a rack is defined for the preferred host. 
> This is incorrect as there may be no alive hosts on that rack at all. 
> Therefore, TaskScheduler should track the hosts on each rack, and provides an 
> API for TaskSetManager to check if there's host alive on a specific rack.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2364) ShuffledDStream run tasks only when dstream has partition items

2014-07-03 Thread guowei (JIRA)
guowei created SPARK-2364:
-

 Summary: ShuffledDStream run tasks only when dstream has partition 
items
 Key: SPARK-2364
 URL: https://issues.apache.org/jira/browse/SPARK-2364
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: guowei


ShuffledDStream run tasks no matter whether dstream has partition items



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2346) Register as table should not accept table names that start with numbers

2014-07-03 Thread Alexander Albul (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052083#comment-14052083
 ] 

Alexander Albul commented on SPARK-2346:


You right, thanks

> Register as table should not accept table names that start with numbers
> ---
>
> Key: SPARK-2346
> URL: https://issues.apache.org/jira/browse/SPARK-2346
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Alexander Albul
>Priority: Minor
>  Labels: starter
> Fix For: 1.1.0
>
>
> Looks like *org.apache.spark.sql.catalyst.SqlParser* cannot parse table names 
> when they start from numbers.
> Steps to reproduce:
> {code:title=Test.scala|borderStyle=solid}
> case class Data(value: String)
> object Test {
>   def main(args: Array[String]) {
> val sc = new SparkContext("local", "sql")
> val sqlSc = new SQLContext(sc)
> import sqlSc._
> sc.parallelize(List(Data("one"), 
> Data("two"))).registerAsTable("123_table")
> sql("SELECT * FROM '123_table'").collect().foreach(println)
>   }
> }
> {code}
> And here is an exception:
> {quote}
> Exception in thread "main" java.lang.RuntimeException: [1.15] failure: ``('' 
> expected but "123_table" found
> SELECT * FROM '123_table'
>   ^
>   at scala.sys.package$.error(package.scala:27)
>   at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:47)
>   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:70)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:150)
>   at io.ubix.spark.Test$.main(Test.scala:24)
>   at io.ubix.spark.Test.main(Test.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
> {quote}
> When i am changing from 123_table to table_123 problem disappears.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1378) Build error: org.eclipse.paho:mqtt-client

2014-07-03 Thread Mukul Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052080#comment-14052080
 ] 

Mukul Jain commented on SPARK-1378:
---

This seems like still an issue:
Downloading: 
https://repository.apache.org/content/repositories/releases/org/eclipse/paho/mqtt-client/0.4.0/mqtt-client-0.4.0.pom
Jul 3, 2014 6:22:27 PM org.apache.commons.httpclient.HttpMethodDirector 
executeWithRetry
INFO: I/O exception (java.net.ConnectException) caught when processing request: 
Connection timed out
Jul 3, 2014 6:22:27 PM org.apache.commons.httpclient.HttpMethodDirector 
executeWithRetry
INFO: Retrying request
It is failing to download.. I am running behind corporate firewall not sure if 
it has anything to do with that. I had my build stuck exactly like this earlier 
in the build process trying to download scala compiler jar file but after a few 
attempts it was able proceed/download the file. seems like repo issue.

> Build error: org.eclipse.paho:mqtt-client
> -
>
> Key: SPARK-1378
> URL: https://issues.apache.org/jira/browse/SPARK-1378
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 0.9.0
>Reporter: Ken Williams
>
> Using Maven, I'm unable to build the 0.9.0 distribution I just downloaded. I 
> attempt like so:
> {code}
> mvn -U -Pyarn -Dhadoop.version=2.2.0 -Dyarn.version=2.2.0 -DskipTests package
> {code}
> The Maven error is:
> {code}
> [ERROR] Failed to execute goal on project spark-examples_2.10: Could not 
> resolve dependencies for project 
> org.apache.spark:spark-examples_2.10:jar:0.9.0-incubating: Could not find 
> artifact org.eclipse.paho:mqtt-client:jar:0.4.0 in nexus
> {code}
> My Maven version is 3.2.1, running on Java 1.7.0, using Scala 2.10.4.
> Is there an additional Maven repository I should add or something?
> If I go into the {{pom.xml}} and comment out the {{external/mqtt}} and 
> {{examples}} modules, the build succeeds. I'm fine without the MQTT stuff, 
> but I would really like to get the examples working because I haven't played 
> with Spark before.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2363) Clean MLlib's sample data files

2014-07-03 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-2363:


 Summary: Clean MLlib's sample data files
 Key: SPARK-2363
 URL: https://issues.apache.org/jira/browse/SPARK-2363
 Project: Spark
  Issue Type: Task
  Components: MLlib
Reporter: Xiangrui Meng
Priority: Minor


MLlib has sample data under serveral folders:

1) data/mllib
2) data/
3) mllib/data/*

Per previous discussion with [~matei], we want to put them under `data/mllib` 
and clean outdated files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2354) BitSet Range Expanded when creating new one

2014-07-03 Thread Yijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yijie Shen updated SPARK-2354:
--

Affects Version/s: 1.0.0

> BitSet Range Expanded when creating new one
> ---
>
> Key: SPARK-2354
> URL: https://issues.apache.org/jira/browse/SPARK-2354
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.1.0
>Reporter: Yijie Shen
>Priority: Minor
>
> BitSet has a constructor parameter named "numBits: Int" and indicate the bit 
> num inside.
> And also, there is a function called "capacity" which represents the long 
> words number to hold the bits.
> When creating new BitSet,for example in '|', I thought the new created one 
> shouldn't be the size of longer words' length, instead, it should be the 
> longer set's num of bit
> {code}def |(other: BitSet): BitSet = {
> val newBS = new BitSet(math.max(numBits, other.numBits)) 
> // I know by now the numBits isn't a field
> {code}
> Does it have any other reason to expand the BitSet range I don't know?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2362) newFilesOnly = true FileInputDStream processes existing files in a directory

2014-07-03 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052076#comment-14052076
 ] 

Tathagata Das commented on SPARK-2362:
--

https://github.com/apache/spark/pull/1077

> newFilesOnly = true FileInputDStream processes existing files in a directory
> 
>
> Key: SPARK-2362
> URL: https://issues.apache.org/jira/browse/SPARK-2362
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.0.0
>Reporter: Tathagata Das
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2325) Utils.getLocalDir had better check the directory and choose a good one instead of choosing the first one directly

2014-07-03 Thread YanTang Zhai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052075#comment-14052075
 ] 

YanTang Zhai commented on SPARK-2325:
-

I've created PR: https://github.com/apache/spark/pull/1281. Please help to 
review. Thanks.

> Utils.getLocalDir had better check the directory and choose a good one 
> instead of choosing the first one directly
> -
>
> Key: SPARK-2325
> URL: https://issues.apache.org/jira/browse/SPARK-2325
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: YanTang Zhai
>
> If the first directory of spark.local.dir is bad, application will exit with 
> the exception:
> Exception in thread "main" java.io.IOException: Failed to create a temp 
> directory (under /data1/sparkenv/local) after 10 attempts!
> at org.apache.spark.util.Utils$.createTempDir(Utils.scala:258)
> at 
> org.apache.spark.broadcast.HttpBroadcast$.createServer(HttpBroadcast.scala:154)
> at 
> org.apache.spark.broadcast.HttpBroadcast$.initialize(HttpBroadcast.scala:127)
> at 
> org.apache.spark.broadcast.HttpBroadcastFactory.initialize(HttpBroadcastFactory.scala:31)
> at 
> org.apache.spark.broadcast.BroadcastManager.initialize(BroadcastManager.scala:48)
> at 
> org.apache.spark.broadcast.BroadcastManager.(BroadcastManager.scala:35)
> at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218)
> at org.apache.spark.SparkContext.(SparkContext.scala:202)
> at JobTaskJoin$.main(JobTaskJoin.scala:9)
> at JobTaskJoin.main(JobTaskJoin.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:601)
> at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Utils.getLocalDir had better check the directory and choose a good one 
> instead of choosing the first one directly. For example, spark.local.dir is 
> /data1/sparkenv/local,/data2/sparkenv/local. The disk data1 is bad while the 
> disk data2 is good, we could choose the data2 not data1.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2362) newFilesOnly = true FileInputDStream processes existing files in a directory

2014-07-03 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-2362:


 Summary: newFilesOnly = true FileInputDStream processes existing 
files in a directory
 Key: SPARK-2362
 URL: https://issues.apache.org/jira/browse/SPARK-2362
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.0
Reporter: Tathagata Das






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2010) Support for nested data in PySpark SQL

2014-07-03 Thread Kan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052068#comment-14052068
 ] 

Kan Zhang commented on SPARK-2010:
--

Sounds reasonable. Named tuple is a better fit than dictionary for struct type. 
Presumably it is due to lack of pickling support for named tuple that we 
resorted to dictionary for python schema definition. But for nested 
dictionaries, we should treat them as map type.

> Support for nested data in PySpark SQL
> --
>
> Key: SPARK-2010
> URL: https://issues.apache.org/jira/browse/SPARK-2010
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Kan Zhang
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2352) [MLLIB] Add Artificial Neural Network (ANN) to Spark

2014-07-03 Thread Bert Greevenbosch (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bert Greevenbosch updated SPARK-2352:
-

Summary: [MLLIB] Add Artificial Neural Network (ANN) to Spark  (was: Add 
Artificial Neural Network (ANN) to Spark)

> [MLLIB] Add Artificial Neural Network (ANN) to Spark
> 
>
> Key: SPARK-2352
> URL: https://issues.apache.org/jira/browse/SPARK-2352
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
> Environment: MLLIB code
>Reporter: Bert Greevenbosch
>
> It would be good if the Machine Learning Library contained Artificial Neural 
> Networks (ANNs).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2361) Decide whether to broadcast or serialize the weights directly in MLlib algorithms

2014-07-03 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-2361:


 Summary: Decide whether to broadcast or serialize the weights 
directly in MLlib algorithms
 Key: SPARK-2361
 URL: https://issues.apache.org/jira/browse/SPARK-2361
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng


In the current implementation, MLlib serializes weights directly into closure. 
This is okay for small feature dimension, but not efficient for feature 
dimensions beyond 1M. Especially the default akka.frameSize is 10m. We should 
use broadcast when the size of the serialized task is going to be large.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2360) CSV import to SchemaRDDs

2014-07-03 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-2360:
---

 Summary: CSV import to SchemaRDDs
 Key: SPARK-2360
 URL: https://issues.apache.org/jira/browse/SPARK-2360
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2010) Support for nested data in PySpark SQL

2014-07-03 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052035#comment-14052035
 ] 

Michael Armbrust commented on SPARK-2010:
-

I think probably the right thing to do here is use named tuples instead of 
dictionaries as the python struct equivalent. Dictionaries can then be used for 
maps.  One issue here is that we will need to fix our pickling library used by 
pyspark as it cannot serialize named tuples.

> Support for nested data in PySpark SQL
> --
>
> Key: SPARK-2010
> URL: https://issues.apache.org/jira/browse/SPARK-2010
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Kan Zhang
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2358) Add an option to include native BLAS/LAPACK loader in the build

2014-07-03 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052030#comment-14052030
 ] 

Xiangrui Meng commented on SPARK-2358:
--

PR: https://github.com/apache/spark/pull/1295

> Add an option to include native BLAS/LAPACK loader in the build
> ---
>
> Key: SPARK-2358
> URL: https://issues.apache.org/jira/browse/SPARK-2358
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> It would be easy for users to include the netlib-java jniloader in the spark 
> jar, which is LGPL-licensed. We can follow the same approach as ganglia 
> support in Spark, which is enabled by turning on "SPARK_GANGLIA_LGPL" at 
> build time. We can use "SPARK_NETLIB_LGPL" flag for this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2346) Register as table should not accept table names that start with numbers

2014-07-03 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052020#comment-14052020
 ] 

Michael Armbrust edited comment on SPARK-2346 at 7/4/14 12:03 AM:
--

The goal of the sql method is to provide something that is close to SQL-92, 
which explicitly disallows identifiers that start with numbers (as this makes 
expressions like 2e2 kind of ambiguous).  I think your query will work if you 
run it using the hive parser, using the hql method, instead.


was (Author: marmbrus):
The goal of the SQL method is to provide something that is close to SQL-92, 
which explicitly disallows identifiers that start with numbers (as this makes 
expressions like 2e2 kind of ambiguous).  I think your query will work if you 
run it using the hive parser, using the hql method, instead.

> Register as table should not accept table names that start with numbers
> ---
>
> Key: SPARK-2346
> URL: https://issues.apache.org/jira/browse/SPARK-2346
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Alexander Albul
>Priority: Minor
>  Labels: starter
> Fix For: 1.1.0
>
>
> Looks like *org.apache.spark.sql.catalyst.SqlParser* cannot parse table names 
> when they start from numbers.
> Steps to reproduce:
> {code:title=Test.scala|borderStyle=solid}
> case class Data(value: String)
> object Test {
>   def main(args: Array[String]) {
> val sc = new SparkContext("local", "sql")
> val sqlSc = new SQLContext(sc)
> import sqlSc._
> sc.parallelize(List(Data("one"), 
> Data("two"))).registerAsTable("123_table")
> sql("SELECT * FROM '123_table'").collect().foreach(println)
>   }
> }
> {code}
> And here is an exception:
> {quote}
> Exception in thread "main" java.lang.RuntimeException: [1.15] failure: ``('' 
> expected but "123_table" found
> SELECT * FROM '123_table'
>   ^
>   at scala.sys.package$.error(package.scala:27)
>   at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:47)
>   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:70)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:150)
>   at io.ubix.spark.Test$.main(Test.scala:24)
>   at io.ubix.spark.Test.main(Test.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
> {quote}
> When i am changing from 123_table to table_123 problem disappears.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2346) Register as table should not accept table names that start with numbers

2014-07-03 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052020#comment-14052020
 ] 

Michael Armbrust commented on SPARK-2346:
-

The goal of the SQL method is to provide something that is close to SQL-92, 
which explicitly disallows identifiers that start with numbers (as this makes 
expressions like 2e2 kind of ambiguous).  I think your query will work if you 
run it using the hive parser, using the hql method, instead.

> Register as table should not accept table names that start with numbers
> ---
>
> Key: SPARK-2346
> URL: https://issues.apache.org/jira/browse/SPARK-2346
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Alexander Albul
>Priority: Minor
>  Labels: starter
> Fix For: 1.1.0
>
>
> Looks like *org.apache.spark.sql.catalyst.SqlParser* cannot parse table names 
> when they start from numbers.
> Steps to reproduce:
> {code:title=Test.scala|borderStyle=solid}
> case class Data(value: String)
> object Test {
>   def main(args: Array[String]) {
> val sc = new SparkContext("local", "sql")
> val sqlSc = new SQLContext(sc)
> import sqlSc._
> sc.parallelize(List(Data("one"), 
> Data("two"))).registerAsTable("123_table")
> sql("SELECT * FROM '123_table'").collect().foreach(println)
>   }
> }
> {code}
> And here is an exception:
> {quote}
> Exception in thread "main" java.lang.RuntimeException: [1.15] failure: ``('' 
> expected but "123_table" found
> SELECT * FROM '123_table'
>   ^
>   at scala.sys.package$.error(package.scala:27)
>   at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:47)
>   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:70)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:150)
>   at io.ubix.spark.Test$.main(Test.scala:24)
>   at io.ubix.spark.Test.main(Test.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
> {quote}
> When i am changing from 123_table to table_123 problem disappears.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2346) Register as table should not accept table names that start with numbers

2014-07-03 Thread Alexander Albul (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052018#comment-14052018
 ] 

Alexander Albul commented on SPARK-2346:


Well, it depends actually.
The link that you sent just show a limitation in postgress database that means 
that they have not an optimal lexer.

>From other hand - Hive support any kind of table name. I found this bug 
>because i migrate from Shark to Spark SQL and some of my tests that had a 
>tables starting with numbers started to fail. 

> Register as table should not accept table names that start with numbers
> ---
>
> Key: SPARK-2346
> URL: https://issues.apache.org/jira/browse/SPARK-2346
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Alexander Albul
>Priority: Minor
>  Labels: starter
> Fix For: 1.1.0
>
>
> Looks like *org.apache.spark.sql.catalyst.SqlParser* cannot parse table names 
> when they start from numbers.
> Steps to reproduce:
> {code:title=Test.scala|borderStyle=solid}
> case class Data(value: String)
> object Test {
>   def main(args: Array[String]) {
> val sc = new SparkContext("local", "sql")
> val sqlSc = new SQLContext(sc)
> import sqlSc._
> sc.parallelize(List(Data("one"), 
> Data("two"))).registerAsTable("123_table")
> sql("SELECT * FROM '123_table'").collect().foreach(println)
>   }
> }
> {code}
> And here is an exception:
> {quote}
> Exception in thread "main" java.lang.RuntimeException: [1.15] failure: ``('' 
> expected but "123_table" found
> SELECT * FROM '123_table'
>   ^
>   at scala.sys.package$.error(package.scala:27)
>   at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:47)
>   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:70)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:150)
>   at io.ubix.spark.Test$.main(Test.scala:24)
>   at io.ubix.spark.Test.main(Test.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
> {quote}
> When i am changing from 123_table to table_123 problem disappears.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2346) Error parsing table names that starts with numbers

2014-07-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2346:


Labels: starter  (was: Parser SQL)

> Error parsing table names that starts with numbers
> --
>
> Key: SPARK-2346
> URL: https://issues.apache.org/jira/browse/SPARK-2346
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Alexander Albul
>  Labels: starter
> Fix For: 1.1.0
>
>
> Looks like *org.apache.spark.sql.catalyst.SqlParser* cannot parse table names 
> when they start from numbers.
> Steps to reproduce:
> {code:title=Test.scala|borderStyle=solid}
> case class Data(value: String)
> object Test {
>   def main(args: Array[String]) {
> val sc = new SparkContext("local", "sql")
> val sqlSc = new SQLContext(sc)
> import sqlSc._
> sc.parallelize(List(Data("one"), 
> Data("two"))).registerAsTable("123_table")
> sql("SELECT * FROM '123_table'").collect().foreach(println)
>   }
> }
> {code}
> And here is an exception:
> {quote}
> Exception in thread "main" java.lang.RuntimeException: [1.15] failure: ``('' 
> expected but "123_table" found
> SELECT * FROM '123_table'
>   ^
>   at scala.sys.package$.error(package.scala:27)
>   at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:47)
>   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:70)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:150)
>   at io.ubix.spark.Test$.main(Test.scala:24)
>   at io.ubix.spark.Test.main(Test.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
> {quote}
> When i am changing from 123_table to table_123 problem disappears.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2346) Error parsing table names that starts with numbers

2014-07-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2346:


Fix Version/s: 1.1.0

> Error parsing table names that starts with numbers
> --
>
> Key: SPARK-2346
> URL: https://issues.apache.org/jira/browse/SPARK-2346
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Alexander Albul
>Priority: Minor
>  Labels: starter
> Fix For: 1.1.0
>
>
> Looks like *org.apache.spark.sql.catalyst.SqlParser* cannot parse table names 
> when they start from numbers.
> Steps to reproduce:
> {code:title=Test.scala|borderStyle=solid}
> case class Data(value: String)
> object Test {
>   def main(args: Array[String]) {
> val sc = new SparkContext("local", "sql")
> val sqlSc = new SQLContext(sc)
> import sqlSc._
> sc.parallelize(List(Data("one"), 
> Data("two"))).registerAsTable("123_table")
> sql("SELECT * FROM '123_table'").collect().foreach(println)
>   }
> }
> {code}
> And here is an exception:
> {quote}
> Exception in thread "main" java.lang.RuntimeException: [1.15] failure: ``('' 
> expected but "123_table" found
> SELECT * FROM '123_table'
>   ^
>   at scala.sys.package$.error(package.scala:27)
>   at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:47)
>   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:70)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:150)
>   at io.ubix.spark.Test$.main(Test.scala:24)
>   at io.ubix.spark.Test.main(Test.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
> {quote}
> When i am changing from 123_table to table_123 problem disappears.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2346) Register as table should not accept table names that start with numbers

2014-07-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2346:


Summary: Register as table should not accept table names that start with 
numbers  (was: Error parsing table names that starts with numbers)

> Register as table should not accept table names that start with numbers
> ---
>
> Key: SPARK-2346
> URL: https://issues.apache.org/jira/browse/SPARK-2346
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Alexander Albul
>Priority: Minor
>  Labels: starter
> Fix For: 1.1.0
>
>
> Looks like *org.apache.spark.sql.catalyst.SqlParser* cannot parse table names 
> when they start from numbers.
> Steps to reproduce:
> {code:title=Test.scala|borderStyle=solid}
> case class Data(value: String)
> object Test {
>   def main(args: Array[String]) {
> val sc = new SparkContext("local", "sql")
> val sqlSc = new SQLContext(sc)
> import sqlSc._
> sc.parallelize(List(Data("one"), 
> Data("two"))).registerAsTable("123_table")
> sql("SELECT * FROM '123_table'").collect().foreach(println)
>   }
> }
> {code}
> And here is an exception:
> {quote}
> Exception in thread "main" java.lang.RuntimeException: [1.15] failure: ``('' 
> expected but "123_table" found
> SELECT * FROM '123_table'
>   ^
>   at scala.sys.package$.error(package.scala:27)
>   at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:47)
>   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:70)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:150)
>   at io.ubix.spark.Test$.main(Test.scala:24)
>   at io.ubix.spark.Test.main(Test.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
> {quote}
> When i am changing from 123_table to table_123 problem disappears.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2346) Error parsing table names that starts with numbers

2014-07-03 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052004#comment-14052004
 ] 

Michael Armbrust commented on SPARK-2346:
-

Here is more info: 
http://stackoverflow.com/questions/15917064/table-or-column-name-cannot-start-with-numeric

> Error parsing table names that starts with numbers
> --
>
> Key: SPARK-2346
> URL: https://issues.apache.org/jira/browse/SPARK-2346
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Alexander Albul
>  Labels: starter
> Fix For: 1.1.0
>
>
> Looks like *org.apache.spark.sql.catalyst.SqlParser* cannot parse table names 
> when they start from numbers.
> Steps to reproduce:
> {code:title=Test.scala|borderStyle=solid}
> case class Data(value: String)
> object Test {
>   def main(args: Array[String]) {
> val sc = new SparkContext("local", "sql")
> val sqlSc = new SQLContext(sc)
> import sqlSc._
> sc.parallelize(List(Data("one"), 
> Data("two"))).registerAsTable("123_table")
> sql("SELECT * FROM '123_table'").collect().foreach(println)
>   }
> }
> {code}
> And here is an exception:
> {quote}
> Exception in thread "main" java.lang.RuntimeException: [1.15] failure: ``('' 
> expected but "123_table" found
> SELECT * FROM '123_table'
>   ^
>   at scala.sys.package$.error(package.scala:27)
>   at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:47)
>   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:70)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:150)
>   at io.ubix.spark.Test$.main(Test.scala:24)
>   at io.ubix.spark.Test.main(Test.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
> {quote}
> When i am changing from 123_table to table_123 problem disappears.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2346) Error parsing table names that starts with numbers

2014-07-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2346:


Priority: Minor  (was: Major)

> Error parsing table names that starts with numbers
> --
>
> Key: SPARK-2346
> URL: https://issues.apache.org/jira/browse/SPARK-2346
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Alexander Albul
>Priority: Minor
>  Labels: starter
> Fix For: 1.1.0
>
>
> Looks like *org.apache.spark.sql.catalyst.SqlParser* cannot parse table names 
> when they start from numbers.
> Steps to reproduce:
> {code:title=Test.scala|borderStyle=solid}
> case class Data(value: String)
> object Test {
>   def main(args: Array[String]) {
> val sc = new SparkContext("local", "sql")
> val sqlSc = new SQLContext(sc)
> import sqlSc._
> sc.parallelize(List(Data("one"), 
> Data("two"))).registerAsTable("123_table")
> sql("SELECT * FROM '123_table'").collect().foreach(println)
>   }
> }
> {code}
> And here is an exception:
> {quote}
> Exception in thread "main" java.lang.RuntimeException: [1.15] failure: ``('' 
> expected but "123_table" found
> SELECT * FROM '123_table'
>   ^
>   at scala.sys.package$.error(package.scala:27)
>   at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:47)
>   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:70)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:150)
>   at io.ubix.spark.Test$.main(Test.scala:24)
>   at io.ubix.spark.Test.main(Test.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
> {quote}
> When i am changing from 123_table to table_123 problem disappears.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2346) Error parsing table names that starts with numbers

2014-07-03 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052002#comment-14052002
 ] 

Michael Armbrust commented on SPARK-2346:
-

I think this is actually a bug in the registerAsTable function.  It is not 
valid SQL to start a table name with a number AFAIK.

> Error parsing table names that starts with numbers
> --
>
> Key: SPARK-2346
> URL: https://issues.apache.org/jira/browse/SPARK-2346
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Alexander Albul
>  Labels: Parser, SQL
>
> Looks like *org.apache.spark.sql.catalyst.SqlParser* cannot parse table names 
> when they start from numbers.
> Steps to reproduce:
> {code:title=Test.scala|borderStyle=solid}
> case class Data(value: String)
> object Test {
>   def main(args: Array[String]) {
> val sc = new SparkContext("local", "sql")
> val sqlSc = new SQLContext(sc)
> import sqlSc._
> sc.parallelize(List(Data("one"), 
> Data("two"))).registerAsTable("123_table")
> sql("SELECT * FROM '123_table'").collect().foreach(println)
>   }
> }
> {code}
> And here is an exception:
> {quote}
> Exception in thread "main" java.lang.RuntimeException: [1.15] failure: ``('' 
> expected but "123_table" found
> SELECT * FROM '123_table'
>   ^
>   at scala.sys.package$.error(package.scala:27)
>   at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:47)
>   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:70)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:150)
>   at io.ubix.spark.Test$.main(Test.scala:24)
>   at io.ubix.spark.Test.main(Test.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
> {quote}
> When i am changing from 123_table to table_123 problem disappears.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1853) Show Streaming application code context (file, line number) in Spark Stages UI

2014-07-03 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14051998#comment-14051998
 ] 

Tathagata Das commented on SPARK-1853:
--

If you look at what is shown in any spark program's stage description, it shows 
the lines of "user code" that created that stage. For streaming, it shows the 
lines of internal code (spark streaming code) instead of the user code that 
created it. 

So in this case of the screenshot, it should show

4520 - take at Tutorial.scala:34
4521 - map at Tutorial.scala:XXX
...
4513 - reduceByKey at Tutorial.scala:YYY



> Show Streaming application code context (file, line number) in Spark Stages UI
> --
>
> Key: SPARK-1853
> URL: https://issues.apache.org/jira/browse/SPARK-1853
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.0.0
>Reporter: Tathagata Das
>Assignee: Mubarak Seyed
> Fix For: 1.1.0
>
> Attachments: Screen Shot 2014-07-03 at 2.54.05 PM.png
>
>
> Right now, the code context (file, and line number) shown for streaming jobs 
> in stages UI is meaningless as it refers to internal DStream: 
> rather than user application file.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1853) Show Streaming application code context (file, line number) in Spark Stages UI

2014-07-03 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-1853:
-

Assignee: Mubarak Seyed

> Show Streaming application code context (file, line number) in Spark Stages UI
> --
>
> Key: SPARK-1853
> URL: https://issues.apache.org/jira/browse/SPARK-1853
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.0.0
>Reporter: Tathagata Das
>Assignee: Mubarak Seyed
> Fix For: 1.1.0
>
> Attachments: Screen Shot 2014-07-03 at 2.54.05 PM.png
>
>
> Right now, the code context (file, and line number) shown for streaming jobs 
> in stages UI is meaningless as it refers to internal DStream: 
> rather than user application file.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2349) Fix NPE in ExternalAppendOnlyMap

2014-07-03 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson resolved SPARK-2349.
---

Resolution: Fixed

https://github.com/apache/spark/pull/1288

> Fix NPE in ExternalAppendOnlyMap
> 
>
> Key: SPARK-2349
> URL: https://issues.apache.org/jira/browse/SPARK-2349
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>
> It throws an NPE on null keys.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2359) Supporting common statistical functions in MLlib

2014-07-03 Thread Doris Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doris Xin updated SPARK-2359:
-

Summary: Supporting common statistical functions in MLlib  (was: Supporting 
common statistical estimators in MLlib)

> Supporting common statistical functions in MLlib
> 
>
> Key: SPARK-2359
> URL: https://issues.apache.org/jira/browse/SPARK-2359
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Reynold Xin
>Assignee: Doris Xin
>
> This is originally proposed by [~falaki].
> This is a proposal for a new package within the Spark distribution to support 
> common statistical estimators. We think consolidating statistical related 
> functions in a separate package will help with readability of core source 
> code and encourage spark users to submit back their functions.
> Please see the initial design document here: 
> https://docs.google.com/document/d/1Kju9kWSYMXMjEO6ggC9bF9eNbaM4MxcFs_KDqgAcH9c/pub



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2359) Supporting common statistical estimators in MLlib

2014-07-03 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2359:
-

Summary: Supporting common statistical estimators in MLlib  (was: Spark 
Stats package: supporting common statistical estimators for Big Data)

> Supporting common statistical estimators in MLlib
> -
>
> Key: SPARK-2359
> URL: https://issues.apache.org/jira/browse/SPARK-2359
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Reynold Xin
>Assignee: Doris Xin
>
> This is originally proposed by [~falaki].
> This is a proposal for a new package within the Spark distribution to support 
> common statistical estimators. We think consolidating statistical related 
> functions in a separate package will help with readability of core source 
> code and encourage spark users to submit back their functions.
> Please see the initial design document here: 
> https://docs.google.com/document/d/1Kju9kWSYMXMjEO6ggC9bF9eNbaM4MxcFs_KDqgAcH9c/pub



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2359) Spark Stats package: supporting common statistical estimators for Big Data

2014-07-03 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-2359:
--

 Summary: Spark Stats package: supporting common statistical 
estimators for Big Data
 Key: SPARK-2359
 URL: https://issues.apache.org/jira/browse/SPARK-2359
 Project: Spark
  Issue Type: New Feature
Reporter: Reynold Xin


This is originally proposed by [~falaki].

This is a proposal for a new package within the Spark distribution to support 
common statistical estimators. We think consolidating statistical related 
functions in a separate package will help with readability of core source code 
and encourage spark users to submit back their functions.

Please see the initial design document here: 
https://docs.google.com/document/d/1Kju9kWSYMXMjEO6ggC9bF9eNbaM4MxcFs_KDqgAcH9c/pub




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2359) Spark Stats package: supporting common statistical estimators for Big Data

2014-07-03 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2359:
-

Component/s: MLlib

> Spark Stats package: supporting common statistical estimators for Big Data
> --
>
> Key: SPARK-2359
> URL: https://issues.apache.org/jira/browse/SPARK-2359
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Reynold Xin
>Assignee: Doris Xin
>
> This is originally proposed by [~falaki].
> This is a proposal for a new package within the Spark distribution to support 
> common statistical estimators. We think consolidating statistical related 
> functions in a separate package will help with readability of core source 
> code and encourage spark users to submit back their functions.
> Please see the initial design document here: 
> https://docs.google.com/document/d/1Kju9kWSYMXMjEO6ggC9bF9eNbaM4MxcFs_KDqgAcH9c/pub



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2359) Spark Stats package: supporting common statistical estimators for Big Data

2014-07-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2359:
---

Assignee: Doris Xin

> Spark Stats package: supporting common statistical estimators for Big Data
> --
>
> Key: SPARK-2359
> URL: https://issues.apache.org/jira/browse/SPARK-2359
> Project: Spark
>  Issue Type: New Feature
>Reporter: Reynold Xin
>Assignee: Doris Xin
>
> This is originally proposed by [~falaki].
> This is a proposal for a new package within the Spark distribution to support 
> common statistical estimators. We think consolidating statistical related 
> functions in a separate package will help with readability of core source 
> code and encourage spark users to submit back their functions.
> Please see the initial design document here: 
> https://docs.google.com/document/d/1Kju9kWSYMXMjEO6ggC9bF9eNbaM4MxcFs_KDqgAcH9c/pub



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2017) web ui stage page becomes unresponsive when the number of tasks is large

2014-07-03 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14051977#comment-14051977
 ] 

Reynold Xin commented on SPARK-2017:


It is definitely browser specific (but for all browsers!!!). That's why I think 
just having the aggregated metrics by default and the list of tasks that failed 
is probably a good idea. Thoughts?



> web ui stage page becomes unresponsive when the number of tasks is large
> 
>
> Key: SPARK-2017
> URL: https://issues.apache.org/jira/browse/SPARK-2017
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Reynold Xin
>  Labels: starter
>
> {code}
> sc.parallelize(1 to 100, 100).count()
> {code}
> The above code creates one million tasks to be executed. The stage detail web 
> ui page takes forever to load (if it ever completes).
> There are again a few different alternatives:
> 0. Limit the number of tasks we show.
> 1. Pagination
> 2. By default only show the aggregate metrics and failed tasks, and hide the 
> successful ones.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1516) Yarn Client should not call System.exit, should throw exception instead.

2014-07-03 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1516:
-

Fix Version/s: 0.9.2

> Yarn Client should not call System.exit, should throw exception instead.
> 
>
> Key: SPARK-1516
> URL: https://issues.apache.org/jira/browse/SPARK-1516
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: DB Tsai
> Fix For: 0.9.2, 1.0.1
>
>
> People submit spark job inside their application to yarn cluster using spark 
> yarn client, and it's not desirable to call System.exit in yarn client which 
> will terminate the parent application as well.
> We should throw exception instead, and people can determine which action they 
> want to take given the exception.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1516) Yarn Client should not call System.exit, should throw exception instead.

2014-07-03 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14051975#comment-14051975
 ] 

Xiangrui Meng commented on SPARK-1516:
--

PR for branch-0.9: https://github.com/apache/spark/pull/1099

> Yarn Client should not call System.exit, should throw exception instead.
> 
>
> Key: SPARK-1516
> URL: https://issues.apache.org/jira/browse/SPARK-1516
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: DB Tsai
> Fix For: 0.9.2, 1.0.1
>
>
> People submit spark job inside their application to yarn cluster using spark 
> yarn client, and it's not desirable to call System.exit in yarn client which 
> will terminate the parent application as well.
> We should throw exception instead, and people can determine which action they 
> want to take given the exception.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1853) Show Streaming application code context (file, line number) in Spark Stages UI

2014-07-03 Thread Mubarak Seyed (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14051967#comment-14051967
 ] 

Mubarak Seyed commented on SPARK-1853:
--

Hi TD,

Stages UI shows _code context_ for both application file and internal DStream. 
Are you referring to remove the _internal DStream_ code context in description?

!Screen Shot 2014-07-03 at 2.54.05 PM.png|width=300,height=500!

Thanks,
Mubarak

> Show Streaming application code context (file, line number) in Spark Stages UI
> --
>
> Key: SPARK-1853
> URL: https://issues.apache.org/jira/browse/SPARK-1853
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.0.0
>Reporter: Tathagata Das
> Fix For: 1.1.0
>
> Attachments: Screen Shot 2014-07-03 at 2.54.05 PM.png
>
>
> Right now, the code context (file, and line number) shown for streaming jobs 
> in stages UI is meaningless as it refers to internal DStream: 
> rather than user application file.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2109) Setting SPARK_MEM for bin/pyspark does not work.

2014-07-03 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2109.


Resolution: Fixed

Fixed in master and 1.0 via https://github.com/apache/spark/pull/1050/files

> Setting SPARK_MEM for bin/pyspark does not work. 
> -
>
> Key: SPARK-2109
> URL: https://issues.apache.org/jira/browse/SPARK-2109
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>Priority: Critical
> Fix For: 1.0.1, 1.1.0
>
>
> prashant@sc:~/work/spark$ SPARK_MEM=10G bin/pyspark 
> Python 2.7.6 (default, Mar 22 2014, 22:59:56) 
> [GCC 4.8.2] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> Traceback (most recent call last):
>   File "/home/prashant/work/spark/python/pyspark/shell.py", line 43, in 
> 
> sc = SparkContext(appName="PySparkShell", pyFiles=add_files)
>   File "/home/prashant/work/spark/python/pyspark/context.py", line 94, in 
> __init__
> SparkContext._ensure_initialized(self, gateway=gateway)
>   File "/home/prashant/work/spark/python/pyspark/context.py", line 190, in 
> _ensure_initialized
> SparkContext._gateway = gateway or launch_gateway()
>   File "/home/prashant/work/spark/python/pyspark/java_gateway.py", line 51, 
> in launch_gateway
> gateway_port = int(proc.stdout.readline())
> ValueError: invalid literal for int() with base 10: 'Warning: SPARK_MEM is 
> deprecated, please use a more specific config option\n'



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1853) Show Streaming application code context (file, line number) in Spark Stages UI

2014-07-03 Thread Mubarak Seyed (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mubarak Seyed updated SPARK-1853:
-

Attachment: Screen Shot 2014-07-03 at 2.54.05 PM.png

> Show Streaming application code context (file, line number) in Spark Stages UI
> --
>
> Key: SPARK-1853
> URL: https://issues.apache.org/jira/browse/SPARK-1853
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.0.0
>Reporter: Tathagata Das
> Fix For: 1.1.0
>
> Attachments: Screen Shot 2014-07-03 at 2.54.05 PM.png
>
>
> Right now, the code context (file, and line number) shown for streaming jobs 
> in stages UI is meaningless as it refers to internal DStream: 
> rather than user application file.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib

2014-07-03 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14051946#comment-14051946
 ] 

Xiangrui Meng commented on SPARK-2308:
--

Is there a reference paper/work about using uniform sampling in k-means? 
Usually in practice the clusters are not balanced. With uniform sampling, you 
may miss many points from a small cluster.

> Add KMeans MiniBatch clustering algorithm to MLlib
> --
>
> Key: SPARK-2308
> URL: https://issues.apache.org/jira/browse/SPARK-2308
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: RJ Nowling
>Priority: Minor
>
> Mini-batch is a version of KMeans that uses a randomly-sampled subset of the 
> data points in each iteration instead of the full set of data points, 
> improving performance (and in some cases, accuracy).  The mini-batch version 
> is compatible with the KMeans|| initialization algorithm currently 
> implemented in MLlib.
> I suggest adding KMeans Mini-batch as an alternative.
> I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-2358) Add an option to include native BLAS/LAPACK loader in the build

2014-07-03 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-2358:


Assignee: Xiangrui Meng

> Add an option to include native BLAS/LAPACK loader in the build
> ---
>
> Key: SPARK-2358
> URL: https://issues.apache.org/jira/browse/SPARK-2358
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> It would be easy for users to include the netlib-java jniloader in the spark 
> jar, which is LGPL-licensed. We can follow the same approach as ganglia 
> support in Spark, which is enabled by turning on "SPARK_GANGLIA_LGPL" at 
> build time. We can use "SPARK_NETLIB_LGPL" flag for this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2353) ArrayIndexOutOfBoundsException in scheduler

2014-07-03 Thread Mridul Muralidharan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-2353:
---

Description: 
I suspect the recent changes from SPARK-1937 to compute valid locality levels 
(and ignoring ones which are not applicable) has resulted in this issue.
Specifically, some of the code using currentLocalityIndex (and lastLaunchTime 
actually) seems to be assuming 
a) constant population of locality levels.
b) probably also immutablility/repeatibility of locality levels

These do not hold any longer.
I do not have the exact values for which this failure was observed (since this 
is from the logs of a failed job) - but the code path is suspect.

Also note that the line numbers/classes might not exactly match master since we 
are in the middle of a merge. But the issue should hopefully be evident.

java.lang.ArrayIndexOutOfBoundsException: 2
at 
org.apache.spark.scheduler.TaskSetManager.getAllowedLocalityLevel(TaskSetManager.scala:439)
at 
org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:388)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$5$$anonfun$apply$2.apply$mcVI$sp(TaskSchedulerImpl.scala:248)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$5.apply(TaskSchedulerImpl.scala:244)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$5.apply(TaskSchedulerImpl.scala:241)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:241)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:241)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:241)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.makeOffers(CoarseGrainedSchedulerBackend.scala:133)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:86)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


Unfortunately, we do not have the bandwidth to tackle this issue - would be 
great if someone could take a look at it ! Thanks.

  was:

I suspect the recent changes from SPARK-1937 to compute valid locality levels 
(and ignoring ones which are not applicable) has resulted in this issue.
Specifically, some of the code using currentLocalityIndex (and lastLaunchTime 
actually) seems to be assuming 
a) constant population of locality levels.
b) probably also immutablility/repeatibility of locality levels

These do not hold any longer.
I do not have the exact values for which this failure was observed (since this 
is from the logs of a failed job) - but the code path is highly suspect.

Also note that the line numbers/classes might not exactly match master since we 
are in the middle of a merge. But the issue should hopefully be evident.

java.lang.ArrayIndexOutOfBoundsException: 2
at 
org.apache.spark.scheduler.TaskSetManager.getAllowedLocalityLevel(TaskSetManager.scala:439)
at 
org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:388)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$5$$anonfun$apply$2.apply$mcVI$sp(TaskSchedulerImpl.scala:248)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$5.apply(TaskSchedulerImpl.scala:244)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$5.apply(TaskSchedulerImpl.scala:241)
at 
scala.collection.Indexe

[jira] [Created] (SPARK-2358) Add an option to include native BLAS/LAPACK libraries in the build

2014-07-03 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-2358:


 Summary: Add an option to include native BLAS/LAPACK libraries in 
the build
 Key: SPARK-2358
 URL: https://issues.apache.org/jira/browse/SPARK-2358
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng


It would be easy for users to include the netlib-java jniloader in the spark 
jar, which is LGPL-licensed. We can follow the same approach as ganglia support 
in Spark, which is enabled by turning on "SPARK_GANGLIA_LGPL" at build time. We 
can use "SPARK_NETLIB_LGPL" flag for this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2358) Add an option to include native BLAS/LAPACK loader in the build

2014-07-03 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2358:
-

Summary: Add an option to include native BLAS/LAPACK loader in the build  
(was: Add an option to include native BLAS/LAPACK libraries in the build)

> Add an option to include native BLAS/LAPACK loader in the build
> ---
>
> Key: SPARK-2358
> URL: https://issues.apache.org/jira/browse/SPARK-2358
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>
> It would be easy for users to include the netlib-java jniloader in the spark 
> jar, which is LGPL-licensed. We can follow the same approach as ganglia 
> support in Spark, which is enabled by turning on "SPARK_GANGLIA_LGPL" at 
> build time. We can use "SPARK_NETLIB_LGPL" flag for this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2357) HashFilteredJoin doesn't match some equi-join query

2014-07-03 Thread Zongheng Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zongheng Yang resolved SPARK-2357.
--

Resolution: Not a Problem

> HashFilteredJoin doesn't match some equi-join query
> ---
>
> Key: SPARK-2357
> URL: https://issues.apache.org/jira/browse/SPARK-2357
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Zongheng Yang
>Priority: Minor
>
> For instance, this query:
> hql("""SELECT * FROM src a JOIN src b ON a.key = 238""") 
> is a case where the HashFilteredJoin pattern doesn't match.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2342) Evaluation helper's output type doesn't conform to input type

2014-07-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2342.
-

   Resolution: Fixed
Fix Version/s: 1.1.0
   1.0.1

> Evaluation helper's output type doesn't conform to input type
> -
>
> Key: SPARK-2342
> URL: https://issues.apache.org/jira/browse/SPARK-2342
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Yijie Shen
>Priority: Minor
>  Labels: easyfix
> Fix For: 1.0.1, 1.1.0
>
>
> In sql/catalyst/org/apache/spark/sql/catalyst/expressions.scala
> {code}protected final def n2 ( i: Row, e1: Expression, e2: Expression, f: 
> ((Numeric[Any], Any, Any) => Any)): Any  {code}
> is intended  to do computations for Numeric add/Minus/Multipy.
> Just as the comment suggest : {quote}Those expressions are supposed to be in 
> the same data type, and also the return type.{quote}
> But in code, function f was casted to function signature:
> {code}(Numeric[n.JvmType], n.JvmType, n.JvmType) => Int{code}
> I thought it as a typo and the correct should be:
> {code}(Numeric[n.JvmType], n.JvmType, n.JvmType) => n.JvmType{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1997) Update breeze to version 0.8.1

2014-07-03 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14051882#comment-14051882
 ] 

Xiangrui Meng commented on SPARK-1997:
--

[~gq] Could you help test the following?

1) dependencies changes in breeze 0.8.1 and their license, including libraries 
added and removed
2) number of files in the breeze 0.8.1 jar

> Update breeze to version 0.8.1
> --
>
> Key: SPARK-1997
> URL: https://issues.apache.org/jira/browse/SPARK-1997
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
>
> {{breeze 0.7}} does not support {{scala 2.11}} .



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2357) HashFilteredJoin doesn't match some equi-join query

2014-07-03 Thread Zongheng Yang (JIRA)
Zongheng Yang created SPARK-2357:


 Summary: HashFilteredJoin doesn't match some equi-join query
 Key: SPARK-2357
 URL: https://issues.apache.org/jira/browse/SPARK-2357
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0, 1.0.1
Reporter: Zongheng Yang
Priority: Minor


For instance, this query:

hql("""SELECT * FROM src a JOIN src b ON a.key = 238""") 

is a case where the HashFilteredJoin pattern doesn't match.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1675) Make clear whether computePrincipalComponents requires centered data

2014-07-03 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-1675.
--

   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1171
[https://github.com/apache/spark/pull/1171]

> Make clear whether computePrincipalComponents requires centered data
> 
>
> Key: SPARK-1675
> URL: https://issues.apache.org/jira/browse/SPARK-1675
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>Priority: Trivial
> Fix For: 1.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2355) Check for the number of clusters to avoid ArrayIndexOutOfBoundsException

2014-07-03 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-2355.
--

Resolution: Duplicate

> Check for the number of clusters to avoid ArrayIndexOutOfBoundsException
> 
>
> Key: SPARK-2355
> URL: https://issues.apache.org/jira/browse/SPARK-2355
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Liang-Chi Hsieh
>
> When the number of clusters given to perform with 
> org.apache.spark.mllib.clustering.KMeans under parallel initial mode is 
> greater than data number, it will throw ArrayIndexOutOfBoundsException.
> KMeans class should check the number of clusters that must not be greater 
> than data number.
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -1
> at 
> org.apache.spark.mllib.clustering.LocalKMeans$$anonfun$kMeansPlusPlus$1.apply$mcVI$sp(LocalKMeans.scala:62)
> at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> at 
> org.apache.spark.mllib.clustering.LocalKMeans$.kMeansPlusPlus(LocalKMeans.scala:49)
> at 
> org.apache.spark.mllib.clustering.KMeans$$anonfun$20.apply(KMeans.scala:297)
> at 
> org.apache.spark.mllib.clustering.KMeans$$anonfun$20.apply(KMeans.scala:294)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.Range.foreach(Range.scala:141)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:294)
> at 
> org.apache.spark.mllib.clustering.KMeans.runBreeze(KMeans.scala:143)
> at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:126)
> at 
> org.apache.spark.examples.mllib.DenseKMeans$.run(DenseKMeans.scala:102)
> at 
> org.apache.spark.examples.mllib.DenseKMeans$$anonfun$main$1.apply(DenseKMeans.scala:72)
> at 
> org.apache.spark.examples.mllib.DenseKMeans$$anonfun$main$1.apply(DenseKMeans.scala:71)
> at scala.Option.map(Option.scala:145)
> at 
> org.apache.spark.examples.mllib.DenseKMeans$.main(DenseKMeans.scala:71)
> at org.apache.spark.examples.mllib.DenseKMeans.main(DenseKMeans.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2330) Spark shell has weird scala semantics

2014-07-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2330.
-

Resolution: Duplicate

Going to close this as a duplicate.  We should have a fix for the original 
issue soon.

> Spark shell has weird scala semantics
> -
>
> Key: SPARK-2330
> URL: https://issues.apache.org/jira/browse/SPARK-2330
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.1, 1.0.0
> Environment: Ubuntu 14.04 with spark-x.x.x-bin-hadoop2
>Reporter: Andrea Ferretti
>  Labels: scala, shell
>
> Normal scala expressions are interpreted in a strange way in the spark shell. 
> For instance
> {noformat}
> case class Foo(x: Int)
> def print(f: Foo) = f.x
> val f = Foo(3)
> print(f)
> :24: error: type mismatch;
>  found   : Foo
>  required: Foo
> {noformat}
> For another example
> {noformat}
> trait Currency
> case object EUR extends Currency
> case object USD extends Currency
> def nextCurrency: Currency = nextInt(2) match {
>   case 0 => EUR
>   case _ => USD
> }
> :22: error: type mismatch;
>  found   : EUR.type
>  required: Currency
>  case 0 => EUR
> :24: error: type mismatch;
>  found   : USD.type
>  required: Currency
>  case _ => USD
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2356) Exception: Could not locate executable null\bin\winutils.exe in the Hadoop

2014-07-03 Thread Kostiantyn Kudriavtsev (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kostiantyn Kudriavtsev updated SPARK-2356:
--

Summary: Exception: Could not locate executable null\bin\winutils.exe in 
the Hadoop   (was: Exaption: Could not locate executable null\bin\winutils.exe 
in the Hadoop )

> Exception: Could not locate executable null\bin\winutils.exe in the Hadoop 
> ---
>
> Key: SPARK-2356
> URL: https://issues.apache.org/jira/browse/SPARK-2356
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Kostiantyn Kudriavtsev
>
> I'm trying to run some transformation on Spark, it works fine on cluster 
> (YARN, linux machines). However, when I'm trying to run it on local machine 
> (Windows 7) under unit test, I got errors (I don't use Hadoop, I'm read file 
> from local filesystem):
> 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the 
> hadoop binary path
> java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
> Hadoop binaries.
>   at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
>   at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
>   at org.apache.hadoop.util.Shell.(Shell.java:326)
>   at org.apache.hadoop.util.StringUtils.(StringUtils.java:76)
>   at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
>   at org.apache.hadoop.security.Groups.(Groups.java:77)
>   at 
> org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
>   at 
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
>   at 
> org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:36)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala:109)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala)
>   at org.apache.spark.SparkContext.(SparkContext.scala:228)
>   at org.apache.spark.SparkContext.(SparkContext.scala:97)
> It's happend because Hadoop config is initialised each time when spark 
> context is created regardless is hadoop required or not.
> I propose to add some special flag to indicate if hadoop config is required 
> (or start this configuration manually)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2356) Exaption: Could not locate executable null\bin\winutils.exe in the Hadoop

2014-07-03 Thread Kostiantyn Kudriavtsev (JIRA)
Kostiantyn Kudriavtsev created SPARK-2356:
-

 Summary: Exaption: Could not locate executable 
null\bin\winutils.exe in the Hadoop 
 Key: SPARK-2356
 URL: https://issues.apache.org/jira/browse/SPARK-2356
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Kostiantyn Kudriavtsev


I'm trying to run some transformation on Spark, it works fine on cluster (YARN, 
linux machines). However, when I'm trying to run it on local machine (Windows 
7) under unit test, I got errors (I don't use Hadoop, I'm read file from local 
filesystem):
14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the 
hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
at org.apache.hadoop.util.Shell.(Shell.java:326)
at org.apache.hadoop.util.StringUtils.(StringUtils.java:76)
at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
at org.apache.hadoop.security.Groups.(Groups.java:77)
at 
org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
at 
org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
at 
org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283)
at 
org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:36)
at 
org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala:109)
at 
org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala)
at org.apache.spark.SparkContext.(SparkContext.scala:228)
at org.apache.spark.SparkContext.(SparkContext.scala:97)

It's happend because Hadoop config is initialised each time when spark context 
is created regardless is hadoop required or not.

I propose to add some special flag to indicate if hadoop config is required (or 
start this configuration manually)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2324) SparkContext should not exit directly when spark.local.dir is a list of multiple paths and one of them has error

2014-07-03 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson resolved SPARK-2324.
---

Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/1274

> SparkContext should not exit directly when spark.local.dir is a list of 
> multiple paths and one of them has error
> 
>
> Key: SPARK-2324
> URL: https://issues.apache.org/jira/browse/SPARK-2324
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: YanTang Zhai
>
> The spark.local.dir is configured as a list of multiple paths as follows 
> /data1/sparkenv/local,/data2/sparkenv/local. If the disk data2 of the driver 
> node has error, the application will exit since DiskBlockManager exits 
> directly at createLocalDirs. If the disk data2 of the worker node has error, 
> the executor will exit either.
> DiskBlockManager should not exit directly at createLocalDirs if one of 
> spark.local.dir has error. Since spark.local.dir has multiple paths, a 
> problem should not affect the overall situation.
> I think DiskBlockManager could ignore the bad directory at createLocalDirs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2355) Check for the number of clusters to avoid ArrayIndexOutOfBoundsException

2014-07-03 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-2355:
--

 Summary: Check for the number of clusters to avoid 
ArrayIndexOutOfBoundsException
 Key: SPARK-2355
 URL: https://issues.apache.org/jira/browse/SPARK-2355
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Liang-Chi Hsieh


When the number of clusters given to perform with 
org.apache.spark.mllib.clustering.KMeans under parallel initial mode is greater 
than data number, it will throw ArrayIndexOutOfBoundsException.

KMeans class should check the number of clusters that must not be greater than 
data number.

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -1
at 
org.apache.spark.mllib.clustering.LocalKMeans$$anonfun$kMeansPlusPlus$1.apply$mcVI$sp(LocalKMeans.scala:62)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at 
org.apache.spark.mllib.clustering.LocalKMeans$.kMeansPlusPlus(LocalKMeans.scala:49)
at 
org.apache.spark.mllib.clustering.KMeans$$anonfun$20.apply(KMeans.scala:297)
at 
org.apache.spark.mllib.clustering.KMeans$$anonfun$20.apply(KMeans.scala:294)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.Range.foreach(Range.scala:141)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:294)
at org.apache.spark.mllib.clustering.KMeans.runBreeze(KMeans.scala:143)
at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:126)
at 
org.apache.spark.examples.mllib.DenseKMeans$.run(DenseKMeans.scala:102)
at 
org.apache.spark.examples.mllib.DenseKMeans$$anonfun$main$1.apply(DenseKMeans.scala:72)
at 
org.apache.spark.examples.mllib.DenseKMeans$$anonfun$main$1.apply(DenseKMeans.scala:71)
at scala.Option.map(Option.scala:145)
at 
org.apache.spark.examples.mllib.DenseKMeans$.main(DenseKMeans.scala:71)
at org.apache.spark.examples.mllib.DenseKMeans.main(DenseKMeans.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2354) BitSet Range Expanded when creating new one

2014-07-03 Thread Yijie Shen (JIRA)
Yijie Shen created SPARK-2354:
-

 Summary: BitSet Range Expanded when creating new one
 Key: SPARK-2354
 URL: https://issues.apache.org/jira/browse/SPARK-2354
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Yijie Shen
Priority: Minor


BitSet has a constructor parameter named "numBits: Int" and indicate the bit 
num inside.
And also, there is a function called "capacity" which represents the long words 
number to hold the bits.

When creating new BitSet,for example in '|', I thought the new created one 
shouldn't be the size of longer words' length, instead, it should be the longer 
set's num of bit
{code}def |(other: BitSet): BitSet = {
val newBS = new BitSet(math.max(numBits, other.numBits)) 
// I know by now the numBits isn't a field
{code}

Does it have any other reason to expand the BitSet range I don't know?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2344) Add Fuzzy C-Means algorithm to MLlib

2014-07-03 Thread Alex (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex updated SPARK-2344:


  Description: 
I would like to add a FCM (Fuzzy C-Means) algorithm to MLlib.

FCM is very similar to K - Means which is already implemented, and they differ 
only in the degree of relationship each point has with each cluster:
(in FCM the relationship is in a range of [0..1] whether in K - Means its 0/1.

As part of the implementation I would like:
- create a base class for K- Means and FCM
- implement the relationship for each algorithm differently (in its class)



I'd like this to be assigned to me.

  was:
I would like to add a FCM (Fuzzy C-Means) algorithm to MLlib.

FCM is very similar to K - Means which is already implemented, and they differ 
only in the degree of relationship each point has with each cluster:
(in FCM the relationship is in a range of [0..1] whether in K - Means its 0/1.

As part of the implementation I would like:
- create a base class for K- Means and FCM
- implement the relationship for each algorithm differently (in its class)

 Priority: Minor  (was: Major)
Affects Version/s: (was: 1.0.0)

> Add Fuzzy C-Means algorithm to MLlib
> 
>
> Key: SPARK-2344
> URL: https://issues.apache.org/jira/browse/SPARK-2344
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Alex
>Priority: Minor
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> I would like to add a FCM (Fuzzy C-Means) algorithm to MLlib.
> FCM is very similar to K - Means which is already implemented, and they 
> differ only in the degree of relationship each point has with each cluster:
> (in FCM the relationship is in a range of [0..1] whether in K - Means its 0/1.
> As part of the implementation I would like:
> - create a base class for K- Means and FCM
> - implement the relationship for each algorithm differently (in its class)
> I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2277) Make TaskScheduler track whether there's host on a rack

2014-07-03 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14051575#comment-14051575
 ] 

Mridul Muralidharan commented on SPARK-2277:


I have not rechecked that the code, but the way it was originally written by me 
was :

a) Task preference is decoupled from availability of the node.
For example, we need not have an executor on a host for which a block has host 
preference (example dfs blocks on a shared cluster)
Also note that a block might have one or more preferred location.

b) We lookup the rack for the preferred location to get preferred rack.
As with (a), there need not be an executor on that rack. This is just the rack 
preference.


c) At schedule time, for an executor, we lookup the host/rack of the executors 
location - and decide appropriately based on that.



In this context, I think your requirement is already handled.
Even if we dont have any hosts alive on a rack, those tasks would still be 
mentioned with rack local preference in task set manager.
When an executor comes in (existing or new), we check that executors rack with 
task preference - and it would now be marked rack local.

> Make TaskScheduler track whether there's host on a rack
> ---
>
> Key: SPARK-2277
> URL: https://issues.apache.org/jira/browse/SPARK-2277
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Rui Li
>
> When TaskSetManager adds a pending task, it checks whether the tasks's 
> preferred location is available. Regarding RACK_LOCAL task, we consider the 
> preferred rack available if such a rack is defined for the preferred host. 
> This is incorrect as there may be no alive hosts on that rack at all. 
> Therefore, TaskScheduler should track the hosts on each rack, and provides an 
> API for TaskSetManager to check if there's host alive on a specific rack.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2339) SQL parser in sql-core is case sensitive, but a table alias is converted to lower case when we create Subquery

2014-07-03 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050721#comment-14050721
 ] 

Yin Huai edited comment on SPARK-2339 at 7/3/14 1:46 PM:
-

Also, names of those registered tables are case-sensitive. But, names of Hive 
tables are case-insensitive. It may cause confusion when a user is using 
HiveContext. I guess we want to keep registered tables case-sensitive. I will 
add doc to registerAsTable and registerRDDAaTable.


was (Author: yhuai):
Also, names of those registered tables are case sensitive. But, names of Hive 
tables are case insensitive. It will cause confusion when a user using 
HiveContext. I think it may be good to treat all identifiers case insensitive 
when a user is using HiveContext and make HiveContext.sql as a alias of 
HiveContext.hql (basically do not expose catalyst's SQLParser in HiveContext).

> SQL parser in sql-core is case sensitive, but a table alias is converted to 
> lower case when we create Subquery
> --
>
> Key: SPARK-2339
> URL: https://issues.apache.org/jira/browse/SPARK-2339
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Yin Huai
> Fix For: 1.1.0
>
>
> Reported by 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Join-throws-exception-td8599.html
> After we get the table from the catalog, because the table has an alias, we 
> will temporarily insert a Subquery. Then, we convert the table alias to lower 
> case no matter if the parser is case sensitive or not.
> To see the issue ...
> {code}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.createSchemaRDD
> case class Person(name: String, age: Int)
> val people = 
> sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p 
> => Person(p(0), p(1).trim.toInt))
> people.registerAsTable("people")
> sqlContext.sql("select PEOPLE.name from people PEOPLE")
> {code}
> The plan is ...
> {code}
> == Query Plan ==
> Project ['PEOPLE.name]
>  ExistingRdd [name#0,age#1], MapPartitionsRDD[4] at mapPartitions at 
> basicOperators.scala:176
> {code}
> You can find that "PEOPLE.name" is not resolved.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2330) Spark shell has weird scala semantics

2014-07-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14051298#comment-14051298
 ] 

Sean Owen commented on SPARK-2330:
--

Hm. Yes I have the same error now. I don't know why I didn't before as I am 
pretty sure I just copied and pasted this code exactly as given.

> Spark shell has weird scala semantics
> -
>
> Key: SPARK-2330
> URL: https://issues.apache.org/jira/browse/SPARK-2330
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.1, 1.0.0
> Environment: Ubuntu 14.04 with spark-x.x.x-bin-hadoop2
>Reporter: Andrea Ferretti
>  Labels: scala, shell
>
> Normal scala expressions are interpreted in a strange way in the spark shell. 
> For instance
> {noformat}
> case class Foo(x: Int)
> def print(f: Foo) = f.x
> val f = Foo(3)
> print(f)
> :24: error: type mismatch;
>  found   : Foo
>  required: Foo
> {noformat}
> For another example
> {noformat}
> trait Currency
> case object EUR extends Currency
> case object USD extends Currency
> def nextCurrency: Currency = nextInt(2) match {
>   case 0 => EUR
>   case _ => USD
> }
> :22: error: type mismatch;
>  found   : EUR.type
>  required: Currency
>  case 0 => EUR
> :24: error: type mismatch;
>  found   : USD.type
>  required: Currency
>  case _ => USD
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1199) Type mismatch in Spark shell when using case class defined in shell

2014-07-03 Thread Andrea Ferretti (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14051282#comment-14051282
 ] 

Andrea Ferretti commented on SPARK-1199:


More examples on https://issues.apache.org/jira/browse/SPARK-2330 which should 
also be a duplicate

> Type mismatch in Spark shell when using case class defined in shell
> ---
>
> Key: SPARK-1199
> URL: https://issues.apache.org/jira/browse/SPARK-1199
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Andrew Kerr
>Assignee: Prashant Sharma
>Priority: Blocker
>
> Define a class in the shell:
> {code}
> case class TestClass(a:String)
> {code}
> and an RDD
> {code}
> val data = sc.parallelize(Seq("a")).map(TestClass(_))
> {code}
> define a function on it and map over the RDD
> {code}
> def itemFunc(a:TestClass):TestClass = a
> data.map(itemFunc)
> {code}
> Error:
> {code}
> :19: error: type mismatch;
>  found   : TestClass => TestClass
>  required: TestClass => ?
>   data.map(itemFunc)
> {code}
> Similarly with a mapPartitions:
> {code}
> def partitionFunc(a:Iterator[TestClass]):Iterator[TestClass] = a
> data.mapPartitions(partitionFunc)
> {code}
> {code}
> :19: error: type mismatch;
>  found   : Iterator[TestClass] => Iterator[TestClass]
>  required: Iterator[TestClass] => Iterator[?]
> Error occurred in an application involving default arguments.
>   data.mapPartitions(partitionFunc)
> {code}
> The behavior is the same whether in local mode or on a cluster.
> This isn't specific to RDDs. A Scala collection in the Spark shell has the 
> same problem.
> {code}
> scala> Seq(TestClass("foo")).map(itemFunc)
> :15: error: type mismatch;
>  found   : TestClass => TestClass
>  required: TestClass => ?
>   Seq(TestClass("foo")).map(itemFunc)
> ^
> {code}
> When run in the Scala console (not the Spark shell) there are no type 
> mismatch errors.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2330) Spark shell has weird scala semantics

2014-07-03 Thread Andrea Ferretti (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14051280#comment-14051280
 ] 

Andrea Ferretti commented on SPARK-2330:


I have tried this by pulling from git and I still have the same issues. I am 
not sure why you cannot reproduce it. What do you get when you open a shell and 
paste the lines above?

In any case, it does seem to be a duplicate of the issue you linked.

> Spark shell has weird scala semantics
> -
>
> Key: SPARK-2330
> URL: https://issues.apache.org/jira/browse/SPARK-2330
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.1, 1.0.0
> Environment: Ubuntu 14.04 with spark-x.x.x-bin-hadoop2
>Reporter: Andrea Ferretti
>  Labels: scala, shell
>
> Normal scala expressions are interpreted in a strange way in the spark shell. 
> For instance
> {noformat}
> case class Foo(x: Int)
> def print(f: Foo) = f.x
> val f = Foo(3)
> print(f)
> :24: error: type mismatch;
>  found   : Foo
>  required: Foo
> {noformat}
> For another example
> {noformat}
> trait Currency
> case object EUR extends Currency
> case object USD extends Currency
> def nextCurrency: Currency = nextInt(2) match {
>   case 0 => EUR
>   case _ => USD
> }
> :22: error: type mismatch;
>  found   : EUR.type
>  required: Currency
>  case 0 => EUR
> :24: error: type mismatch;
>  found   : USD.type
>  required: Currency
>  case _ => USD
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2014-07-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14051194#comment-14051194
 ] 

Sean Owen commented on SPARK-2341:
--

[~mengxr] For regression, rather than further overloading "multiclass" to mean 
"regression", how about modifying the argument to take on three values (as an 
enum, string, etc.) to distinguish the three modes. The current method would 
stay, but be deprecated.

multiclass=false is for binary classification. libsvm uses "0" and "1" (or any 
ints) for binary classification. But this parses it as a real number, and 
rounds to 0/1. (Is that was libsvm does?) Maybe it's a convenient semantic 
overload when you want to transform a continuous value to a 0/1 indicator, but 
is that implied by libsvm format or just a transformation the caller should 
make? multiclass=true treats libsvm integer labels as doubles, but not 
continuous values. It seems like inviting more confusion to have this mode also 
double as the mode for parsing labels that are continuous values as continuous 
values.

libsvm is widely used but it's old; I don't think it's file format from long 
ago should necessarily inform API design now. There are other serializations 
besides libsvm (plain CSV for instance) and other algorithms (random decision 
forests).

You can make utilities to convert classes to numbers for benefit of the 
implementation on the front, and I'll have to in order to use this. Maybe we 
can start there -- at least if a utility is in the project people aren't all 
reinventing this in order to use an SVM with actual labels. The caller carries 
around a dictionary then to do the reverse mapping. The model seems like the 
place to hold that info, if in fact internally it converts classes to some 
other representation. Maybe the need would be clearer once the utility is 
created.

As you say I'm concerned that the API is already locked down early and some of 
these changes are going to be viewed as infeasible just for that reason.

> loadLibSVMFile doesn't handle regression datasets
> -
>
> Key: SPARK-2341
> URL: https://issues.apache.org/jira/browse/SPARK-2341
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Eustache
>Priority: Minor
>  Labels: easyfix
>
> Many datasets exist in LibSVM format for regression tasks [1] but currently 
> the loadLibSVMFile primitive doesn't handle regression datasets.
> More precisely, the LabelParser is either a MulticlassLabelParser or a 
> BinaryLabelParser. What happens then is that the file is loaded but in 
> multiclass mode : each target value is interpreted as a class name !
> The fix would be to write a RegressionLabelParser which converts target 
> values to Double and plug it into the loadLibSVMFile routine.
> [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2353) ArrayIndexOutOfBoundsException in scheduler

2014-07-03 Thread Mridul Muralidharan (JIRA)
Mridul Muralidharan created SPARK-2353:
--

 Summary: ArrayIndexOutOfBoundsException in scheduler
 Key: SPARK-2353
 URL: https://issues.apache.org/jira/browse/SPARK-2353
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Mridul Muralidharan
Priority: Blocker



I suspect the recent changes from SPARK-1937 to compute valid locality levels 
(and ignoring ones which are not applicable) has resulted in this issue.
Specifically, some of the code using currentLocalityIndex (and lastLaunchTime 
actually) seems to be assuming 
a) constant population of locality levels.
b) probably also immutablility/repeatibility of locality levels

These do not hold any longer.
I do not have the exact values for which this failure was observed (since this 
is from the logs of a failed job) - but the code path is highly suspect.

Also note that the line numbers/classes might not exactly match master since we 
are in the middle of a merge. But the issue should hopefully be evident.

java.lang.ArrayIndexOutOfBoundsException: 2
at 
org.apache.spark.scheduler.TaskSetManager.getAllowedLocalityLevel(TaskSetManager.scala:439)
at 
org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:388)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$5$$anonfun$apply$2.apply$mcVI$sp(TaskSchedulerImpl.scala:248)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$5.apply(TaskSchedulerImpl.scala:244)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$5.apply(TaskSchedulerImpl.scala:241)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:241)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:241)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:241)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.makeOffers(CoarseGrainedSchedulerBackend.scala:133)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:86)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2014-07-03 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14051151#comment-14051151
 ] 

Xiangrui Meng commented on SPARK-2341:
--

[~srowen] Instead of taking string labels directly, we can provide tools to 
convert them to integer labels (still Double typed). LIBLINEAR/LIBSVM do not 
support string labels either, but they are still among the top choices for 
logistic regression and SVM.

[~eustache] Unfortunately, the argument name in Scala is part of the API and 
loadLibSVMFile is not marked as experimental. So we cannot update the argument 
name to `multiclassOrRegression`, which is too long anyway. Could you update 
the doc and change the first sentence from "multiclass: whether the input 
labels contain more than two classes" to "multiclass: whether the input labels 
are continuous-valued (for regression) or contain more than two classes"? 

> loadLibSVMFile doesn't handle regression datasets
> -
>
> Key: SPARK-2341
> URL: https://issues.apache.org/jira/browse/SPARK-2341
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Eustache
>Priority: Minor
>  Labels: easyfix
>
> Many datasets exist in LibSVM format for regression tasks [1] but currently 
> the loadLibSVMFile primitive doesn't handle regression datasets.
> More precisely, the LabelParser is either a MulticlassLabelParser or a 
> BinaryLabelParser. What happens then is that the file is loaded but in 
> multiclass mode : each target value is interpreted as a class name !
> The fix would be to write a RegressionLabelParser which converts target 
> values to Double and plug it into the loadLibSVMFile routine.
> [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)