[jira] [Updated] (SPARK-6727) Model export/import for spark.ml: HashingTF

2015-04-20 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6727:
-
Target Version/s:   (was: 1.4.0)

 Model export/import for spark.ml: HashingTF
 ---

 Key: SPARK-6727
 URL: https://issues.apache.org/jira/browse/SPARK-6727
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6788) Model export/import for spark.ml: Tokenizer

2015-04-20 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6788:
-
Target Version/s:   (was: 1.4.0)

 Model export/import for spark.ml: Tokenizer
 ---

 Key: SPARK-6788
 URL: https://issues.apache.org/jira/browse/SPARK-6788
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6790) Model export/import for spark.ml: LinearRegression

2015-04-20 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6790:
-
Target Version/s:   (was: 1.4.0)

 Model export/import for spark.ml: LinearRegression
 --

 Key: SPARK-6790
 URL: https://issues.apache.org/jira/browse/SPARK-6790
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6789) Model export/import for spark.ml: ALS

2015-04-20 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6789:
-
Target Version/s:   (was: 1.4.0)

 Model export/import for spark.ml: ALS
 -

 Key: SPARK-6789
 URL: https://issues.apache.org/jira/browse/SPARK-6789
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6791) Model export/import for spark.ml: meta-algorithms

2015-04-20 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6791:
-
Target Version/s:   (was: 1.4.0)

 Model export/import for spark.ml: meta-algorithms
 -

 Key: SPARK-6791
 URL: https://issues.apache.org/jira/browse/SPARK-6791
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Algorithms: Pipeline, CrossValidator (and associated models)
 This task will block on all other subtasks for [SPARK-6725].  This task will 
 also include adding export/import as a required part of the PipelineStage 
 interface since meta-algorithms will depend on sub-algorithms supporting 
 save/load.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6725) Model export/import for Pipeline API

2015-04-20 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6725:
-
Description: 
This is an umbrella JIRA for adding model export/import to the spark.ml API.  
This JIRA is for adding the internal Saveable/Loadable API and Parquet-based 
format, not for other formats like PMML.

This will require the following steps:
* Add export/import for all PipelineStages supported by spark.ml
** This will include some Transformers which are not Models.
** These can use almost the same format as the spark.mllib model save/load 
functions, but the model metadata must store a different class name (marking 
the class as a spark.ml class).
* After all PipelineStages support save/load, add an interface which forces 
future additions to support save/load.

*UPDATE*: In spark.ml, we could save feature metadata using DataFrames.  Other 
libraries and formats can support this, and it would be great if we could too.  
We could do either of the following:
* save() optionally takes a dataset (or schema), and load will return a (model, 
schema) pair.
* Models themselves save the input schema.

Both options would mean inheriting from new Saveable, Loadable types.


  was:
This is an umbrella JIRA for adding model export/import to the spark.ml API.  
This JIRA is for adding the internal Saveable/Loadable API and Parquet-based 
format, not for other formats like PMML.

This will require the following steps:
* Add export/import for all PipelineStages supported by spark.ml
** This will include some Transformers which are not Models.
** These can use almost the same format as the spark.mllib model save/load 
functions, but the model metadata must store a different class name (marking 
the class as a spark.ml class).
* After all PipelineStages support save/load, add an interface which forces 
future additions to support save/load.



 Model export/import for Pipeline API
 

 Key: SPARK-6725
 URL: https://issues.apache.org/jira/browse/SPARK-6725
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Critical

 This is an umbrella JIRA for adding model export/import to the spark.ml API.  
 This JIRA is for adding the internal Saveable/Loadable API and Parquet-based 
 format, not for other formats like PMML.
 This will require the following steps:
 * Add export/import for all PipelineStages supported by spark.ml
 ** This will include some Transformers which are not Models.
 ** These can use almost the same format as the spark.mllib model save/load 
 functions, but the model metadata must store a different class name (marking 
 the class as a spark.ml class).
 * After all PipelineStages support save/load, add an interface which forces 
 future additions to support save/load.
 *UPDATE*: In spark.ml, we could save feature metadata using DataFrames.  
 Other libraries and formats can support this, and it would be great if we 
 could too.  We could do either of the following:
 * save() optionally takes a dataset (or schema), and load will return a 
 (model, schema) pair.
 * Models themselves save the input schema.
 Both options would mean inheriting from new Saveable, Loadable types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6725) Model export/import for Pipeline API

2015-04-20 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6725:
-
Target Version/s:   (was: 1.4.0)

 Model export/import for Pipeline API
 

 Key: SPARK-6725
 URL: https://issues.apache.org/jira/browse/SPARK-6725
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Critical

 This is an umbrella JIRA for adding model export/import to the spark.ml API.  
 This JIRA is for adding the internal Saveable/Loadable API and Parquet-based 
 format, not for other formats like PMML.
 This will require the following steps:
 * Add export/import for all PipelineStages supported by spark.ml
 ** This will include some Transformers which are not Models.
 ** These can use almost the same format as the spark.mllib model save/load 
 functions, but the model metadata must store a different class name (marking 
 the class as a spark.ml class).
 * After all PipelineStages support save/load, add an interface which forces 
 future additions to support save/load.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7002) Persist on RDD fails the second time if the action is called on a child RDD without showing a FAILED message

2015-04-20 Thread Tom Hubregtsen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503730#comment-14503730
 ] 

Tom Hubregtsen edited comment on SPARK-7002 at 4/20/15 9:46 PM:


Your speculation was correct:

After the above computation, I performed the next extra steps:

I first tried to remove the data from rdd3, unpersisting it
{code}
scala rdd3.unpersist() 
scala rdd3.collect()
{code}
-- This did not work, rdd2 was still not on the disk

I then looked in the file system and found shuffle data. I removed these 
manually (shuffle_0_0_0.data and shuffle_0_0_0.index), after which I invoked 
the action on the child
{code}
scala rdd3.collect()
{code}
-- This worked, rdd2 appeared on disk

Next to this, I also looked if a different action that could not rely on these 
shuffle files would invoke computation on rdd2 (as per your suggestion; FYI, I 
performed these two experiments separately from each other so that they don't 
influence each other):
{code}
scala val rdd4 = rdd2.reduceByKey( (x,y) = x*y)
scala rdd4.collect()
{code}
-- This worked too, rdd2 appeared on disk again

Conclusion: Rdd2 was actually not recomputed, as rdd3 was using the shuffle 
data that was stored on disk. 

Action: Should we still do something about the message in .toDebugString? It 
currently mentions when data is persisted on either disk or memory, but does 
not mention that it is saving the shuffle data. I do believe this is something 
you want to know. I at least called this method with the intention to know 
where in my DAG data is actually present, and got to believe data was not 
present, while in fact it was.


was (Author: thubregtsen):
Your speculation was correct:

After the above computation, I performed the next extra steps:

I first tried to remove the data from rdd3, unpersisting it
scala rdd3.unpersist() 
scala rdd3.collect()
-- This did not work, rdd2 was still not on the disk

I then looked in the file system and found shuffle data. I removed these 
manually (shuffle_0_0_0.data and shuffle_0_0_0.index), after which I invoked 
the action on the child
scala rdd3.collect()
-- This worked, rdd2 appeared on disk

Next to this, I also looked if a different action that could not rely on these 
shuffle files would invoke computation on rdd2 (as per your suggestion; FYI, I 
performed these two experiments separately from each other so that they don't 
influence each other):
scala val rdd4 = rdd2.reduceByKey( (x,y) = x*y)
scala rdd4.collect()
-- This worked too, rdd2 appeared on disk again

Conclusion: Rdd2 was actually not recomputed, as rdd3 was using the shuffle 
data that was stored on disk. 

Action: Should we still do something about the message in .toDebugString? It 
currently mentions when data is persisted on either disk or memory, but does 
not mention that it is saving the shuffle data. I do believe this is something 
you want to know. I at least called this method with the intention to know 
where in my DAG data is actually present, and got to believe data was not 
present, while in fact it was.

 Persist on RDD fails the second time if the action is called on a child RDD 
 without showing a FAILED message
 

 Key: SPARK-7002
 URL: https://issues.apache.org/jira/browse/SPARK-7002
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
 Environment: Platform: Power8
 OS: Ubuntu 14.10
 Java: java-8-openjdk-ppc64el
Reporter: Tom Hubregtsen
Priority: Minor
  Labels: disk, persist, unpersist

 The major issue is: Persist on RDD fails the second time if the action is 
 called on a child RDD without showing a FAILED message. This is pointed out 
 at 2)
 next to this:
 toDebugString on a child RDD does not show that the parent RDD is [Disk 
 Serialized 1x Replicated]. This is pointed out at 1)
 Note: I am persisting to disk (DISK_ONLY) to validate that the RDD is or is 
 not physically stored, as I did not want to solely rely on a missing line in 
 .toDebugString (see comments in trace)
 {code}
 scala val rdd1 = sc.parallelize(List(1,2,3))
 scala val rdd2 = rdd1.map(x = (x,x+1))
 scala val rdd3 = rdd2.reduceByKey( (x,y) = x+y)
 scala import org.apache.spark.storage.StorageLevel
 scala rdd2.persist(StorageLevel.DISK_ONLY)
 scala rdd3.collect()
 scala rdd2.toDebugString
 res4: String = 
 (100) MapPartitionsRDD[1] at map at console:23 [Disk Serialized 1x 
 Replicated]
   \|CachedPartitions: 100; MemorySize: 0.0 B; TachyonSize: 0.0 B; 
 DiskSize: 802.0 B
   \|   ParallelCollectionRDD[0] at parallelize at console:21 [Disk 
 Serialized 1x Replicated]
 scala rdd3.toDebugString
 res5: String = 
 (100) ShuffledRDD[2] at reduceByKey at console:25 []
   +-(100) 

[jira] [Commented] (SPARK-7002) Persist on RDD fails the second time if the action is called on a child RDD without showing a FAILED message

2015-04-20 Thread Tom Hubregtsen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503730#comment-14503730
 ] 

Tom Hubregtsen commented on SPARK-7002:
---

Your speculation was correct:

After the above computation, I performed the next extra steps:

I first tried to remove the data from rdd3, unpersisting it
scala rdd3.unpersist() 
scala rdd3.collect()
-- This did not work, rdd2 was still not on the disk

I then looked in the file system and found shuffle data. I removed these 
manually (shuffle_0_0_0.data and shuffle_0_0_0.index), after which I invoked 
the action on the child
scala rdd3.collect()
-- This worked, rdd2 appeared on disk

Next to this, I also looked if a different action that could not rely on these 
shuffle files would invoke computation on rdd2 (as per your suggestion; FYI, I 
performed these two experiments separately from each other so that they don't 
influence each other):
scala val rdd4 = rdd2.reduceByKey( (x,y) = x*y)
scala rdd4.collect()
-- This worked too, rdd2 appeared on disk again

Conclusion: Rdd2 was actually not recomputed, as rdd3 was using the shuffle 
data that was stored on disk. 

Action: Should we still do something about the message in .toDebugString? It 
currently mentions when data is persisted on either disk or memory, but does 
not mention that it is saving the shuffle data. I do believe this is something 
you want to know. I at least called this method with the intention to know 
where in my DAG data is actually present, and got to believe data was not 
present, while in fact it was.

 Persist on RDD fails the second time if the action is called on a child RDD 
 without showing a FAILED message
 

 Key: SPARK-7002
 URL: https://issues.apache.org/jira/browse/SPARK-7002
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
 Environment: Platform: Power8
 OS: Ubuntu 14.10
 Java: java-8-openjdk-ppc64el
Reporter: Tom Hubregtsen
Priority: Minor
  Labels: disk, persist, unpersist

 The major issue is: Persist on RDD fails the second time if the action is 
 called on a child RDD without showing a FAILED message. This is pointed out 
 at 2)
 next to this:
 toDebugString on a child RDD does not show that the parent RDD is [Disk 
 Serialized 1x Replicated]. This is pointed out at 1)
 Note: I am persisting to disk (DISK_ONLY) to validate that the RDD is or is 
 not physically stored, as I did not want to solely rely on a missing line in 
 .toDebugString (see comments in trace)
 {code}
 scala val rdd1 = sc.parallelize(List(1,2,3))
 scala val rdd2 = rdd1.map(x = (x,x+1))
 scala val rdd3 = rdd2.reduceByKey( (x,y) = x+y)
 scala import org.apache.spark.storage.StorageLevel
 scala rdd2.persist(StorageLevel.DISK_ONLY)
 scala rdd3.collect()
 scala rdd2.toDebugString
 res4: String = 
 (100) MapPartitionsRDD[1] at map at console:23 [Disk Serialized 1x 
 Replicated]
   \|CachedPartitions: 100; MemorySize: 0.0 B; TachyonSize: 0.0 B; 
 DiskSize: 802.0 B
   \|   ParallelCollectionRDD[0] at parallelize at console:21 [Disk 
 Serialized 1x Replicated]
 scala rdd3.toDebugString
 res5: String = 
 (100) ShuffledRDD[2] at reduceByKey at console:25 []
   +-(100) MapPartitionsRDD[1] at map at console:23 []
   \|   CachedPartitions: 100; MemorySize: 0.0 B; TachyonSize: 0.0 B; 
 DiskSize: 802.0 B
   \|   ParallelCollectionRDD[0] at parallelize at console:21 []
 // 1) rdd3 does not show that the other RDD's are [Disk Serialized 1x 
 Replicated], but the data is on disk. This is verified by
 // a) The line starting with CachedPartitions
 // b) a find in spark_local_dir: find . -name \*  \| grep rdd returns 
 ./spark-b39bcf9b-e7d7-4284-bdd2-1be7ac3cacef/blockmgr-4f4c0b1c-b47a-4972-b364-7179ea6e0873/1f/rdd_4_*,
  where * are the number of partitions
 scala rdd2.unpersist()
 scala rdd2.toDebugString
 res8: String = 
 (100) MapPartitionsRDD[1] at map at console:23 []
   \|   ParallelCollectionRDD[0] at parallelize at console:21 []
 scala rdd3.toDebugString
 res9: String = 
 (100) ShuffledRDD[2] at reduceByKey at console:25 []
   +-(100) MapPartitionsRDD[1] at map at console:23 []
   \|   ParallelCollectionRDD[0] at parallelize at console:21 []
 // successfully unpersisted, also not visible on disk
 scala rdd2.persist(StorageLevel.DISK_ONLY)
 scala rdd3.collect()
 scala rdd2.toDebugString
 res18: String = 
 (100) MapPartitionsRDD[1] at map at console:23 [Disk Serialized 1x 
 Replicated]
   \|   ParallelCollectionRDD[0] at parallelize at console:21 [Disk 
 Serialized 1x Replicated]
 scala rdd3.toDebugString
 res19: String = 
 (100) ShuffledRDD[2] at reduceByKey at console:25 []
   +-(100) MapPartitionsRDD[1] at map at console:23 []
   \|   

[jira] [Commented] (SPARK-6921) Spark SQL API saveAsParquetFile will output tachyon file with different block size

2015-04-20 Thread Sebastian YEPES FERNANDEZ (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503921#comment-14503921
 ] 

Sebastian YEPES FERNANDEZ commented on SPARK-6921:
--

I can also validate this with v1.3.1

 Spark SQL API saveAsParquetFile will output tachyon file with different 
 block size
 

 Key: SPARK-6921
 URL: https://issues.apache.org/jira/browse/SPARK-6921
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: zhangxiongfei
Priority: Blocker

 I run below code  in Spark Shell to access parquet files in Tachyon.
   1.First,created a DataFrame by loading a bunch of Parquet Files in Tachyon
   val ta3 
 =sqlContext.parquetFile(tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m);
   2.Second, set the fs.local.block.size to 256M to make sure that block 
 size of output files in Tachyon is 256M.
 sc.hadoopConfiguration.setLong(fs.local.block.size,268435456)
  3.Third,saved above DataFrame into Parquet files that is stored in Tachyon
 
 ta3.saveAsParquetFile(tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m-test);
  After above code run successfully, the output parquet files were stored in 
 Tachyon,but these files have different block size,below is the information of 
 those files in the path 
 tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m-test:
   File Name Size  Block Size 
 In-Memory Pin Creation Time
_SUCCESS  0.00 B   256.00 MB 100% 
 NO 04-13-2015 17:48:23:519
  _common_metadata  1088.00 B  256.00 MB 100% NO 
 04-13-2015 17:48:23:741
  _metadata   22.71 KB   256.00 MB 100% NO 
 04-13-2015 17:48:23:646
  part-r-1.parquet 177.19 MB 32.00 MB  100% NO 
 04-13-2015 17:46:44:626
  part-r-2.parquet 177.21 MB 32.00 MB  100% NO 
 04-13-2015 17:46:44:636
  part-r-3.parquet 177.02 MB 32.00 MB  100% NO 
 04-13-2015 17:46:45:439
  part-r-4.parquet 177.21 MB 32.00 MB  100% NO 
 04-13-2015 17:46:44:845
  part-r-5.parquet 177.40 MB 32.00 MB  100% NO 
 04-13-2015 17:46:44:638
  part-r-6.parquet 177.33 MB 32.00 MB  100% NO 
 04-13-2015 17:46:44:648
  It seems that the API saveAsParquetFile does not distribute/broadcast the 
 hadoopconfiguration to executors like the other API such as 
 saveAsTextFile.The configutation fs.local.block.size only take effects on 
 Driver.
  If I set that configuration before loading parquet files,the problem is gone.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7022) PySpark is missing ParamGridBuilder

2015-04-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503957#comment-14503957
 ] 

Apache Spark commented on SPARK-7022:
-

User 'oefirouz' has created a pull request for this issue:
https://github.com/apache/spark/pull/5601

 PySpark is missing ParamGridBuilder
 ---

 Key: SPARK-7022
 URL: https://issues.apache.org/jira/browse/SPARK-7022
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Omede Firouz

 PySpark is missing the entirety of ML.Tuning (see: 
 https://issues.apache.org/jira/browse/SPARK-6940)
 This is a subticket specifically to track the ParamGridBuilder. The 
 CrossValidator will be dealt with in a followup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7022) PySpark is missing ParamGridBuilder

2015-04-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7022:
---

Assignee: Apache Spark

 PySpark is missing ParamGridBuilder
 ---

 Key: SPARK-7022
 URL: https://issues.apache.org/jira/browse/SPARK-7022
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Omede Firouz
Assignee: Apache Spark

 PySpark is missing the entirety of ML.Tuning (see: 
 https://issues.apache.org/jira/browse/SPARK-6940)
 This is a subticket specifically to track the ParamGridBuilder. The 
 CrossValidator will be dealt with in a followup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7022) PySpark is missing ParamGridBuilder

2015-04-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7022:
---

Assignee: (was: Apache Spark)

 PySpark is missing ParamGridBuilder
 ---

 Key: SPARK-7022
 URL: https://issues.apache.org/jira/browse/SPARK-7022
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Omede Firouz

 PySpark is missing the entirety of ML.Tuning (see: 
 https://issues.apache.org/jira/browse/SPARK-6940)
 This is a subticket specifically to track the ParamGridBuilder. The 
 CrossValidator will be dealt with in a followup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6635) DataFrame.withColumn can create columns with identical names

2015-04-20 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504084#comment-14504084
 ] 

Joseph K. Bradley commented on SPARK-6635:
--

Just to clarify, does that mean {{withColumn}} does *not* replace columns, but 
{{withName}} does?  (I'm not sure what {{withName}} is.)

 DataFrame.withColumn can create columns with identical names
 

 Key: SPARK-6635
 URL: https://issues.apache.org/jira/browse/SPARK-6635
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 DataFrame lets you create multiple columns with the same name, which causes 
 problems when you try to refer to columns by name.
 Proposal: If a column is added to a DataFrame with a column of the same name, 
 then the new column should replace the old column.
 {code}
 scala val df = sc.parallelize(Array(1,2,3)).toDF(x)
 df: org.apache.spark.sql.DataFrame = [x: int]
 scala val df3 = df.withColumn(x, df(x) + 1)
 df3: org.apache.spark.sql.DataFrame = [x: int, x: int]
 scala df3.collect()
 res1: Array[org.apache.spark.sql.Row] = Array([1,2], [2,3], [3,4])
 scala df3(x)
 org.apache.spark.sql.AnalysisException: Reference 'x' is ambiguous, could be: 
 x, x.;
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:216)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:121)
   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:161)
   at org.apache.spark.sql.DataFrame.col(DataFrame.scala:436)
   at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:426)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:26)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:33)
   at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:35)
   at $iwC$$iwC$$iwC$$iwC.init(console:37)
   at $iwC$$iwC$$iwC.init(console:39)
   at $iwC$$iwC.init(console:41)
   at $iwC.init(console:43)
   at init(console:45)
   at .init(console:49)
   at .clinit(console)
   at .init(console:7)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
   at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856)
   at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901)
   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813)
   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656)
   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664)
   at 
 org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
   at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
   at 
 org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058)
   at org.apache.spark.repl.Main$.main(Main.scala:31)
   at org.apache.spark.repl.Main.main(Main.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
   at 
 org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
   at 

[jira] [Commented] (SPARK-7002) Persist on RDD fails the second time if the action is called on a child RDD without showing a FAILED message

2015-04-20 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503761#comment-14503761
 ] 

Sean Owen commented on SPARK-7002:
--

The shuffle data is a sort of hidden, second type of caching that goes on. I 
don't know how much it's supposed to be exposed. My hunch is that if there's an 
easy API already to access this info, go ahead and propose adding it to the 
debug string, but if it's not otherwise easily accounted for, may not be worth 
adding it. It's good to know that there is a logic to what is happening, at 
least, rather than a bug.

 Persist on RDD fails the second time if the action is called on a child RDD 
 without showing a FAILED message
 

 Key: SPARK-7002
 URL: https://issues.apache.org/jira/browse/SPARK-7002
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
 Environment: Platform: Power8
 OS: Ubuntu 14.10
 Java: java-8-openjdk-ppc64el
Reporter: Tom Hubregtsen
Priority: Minor
  Labels: disk, persist, unpersist

 The major issue is: Persist on RDD fails the second time if the action is 
 called on a child RDD without showing a FAILED message. This is pointed out 
 at 2)
 next to this:
 toDebugString on a child RDD does not show that the parent RDD is [Disk 
 Serialized 1x Replicated]. This is pointed out at 1)
 Note: I am persisting to disk (DISK_ONLY) to validate that the RDD is or is 
 not physically stored, as I did not want to solely rely on a missing line in 
 .toDebugString (see comments in trace)
 {code}
 scala val rdd1 = sc.parallelize(List(1,2,3))
 scala val rdd2 = rdd1.map(x = (x,x+1))
 scala val rdd3 = rdd2.reduceByKey( (x,y) = x+y)
 scala import org.apache.spark.storage.StorageLevel
 scala rdd2.persist(StorageLevel.DISK_ONLY)
 scala rdd3.collect()
 scala rdd2.toDebugString
 res4: String = 
 (100) MapPartitionsRDD[1] at map at console:23 [Disk Serialized 1x 
 Replicated]
   \|CachedPartitions: 100; MemorySize: 0.0 B; TachyonSize: 0.0 B; 
 DiskSize: 802.0 B
   \|   ParallelCollectionRDD[0] at parallelize at console:21 [Disk 
 Serialized 1x Replicated]
 scala rdd3.toDebugString
 res5: String = 
 (100) ShuffledRDD[2] at reduceByKey at console:25 []
   +-(100) MapPartitionsRDD[1] at map at console:23 []
   \|   CachedPartitions: 100; MemorySize: 0.0 B; TachyonSize: 0.0 B; 
 DiskSize: 802.0 B
   \|   ParallelCollectionRDD[0] at parallelize at console:21 []
 // 1) rdd3 does not show that the other RDD's are [Disk Serialized 1x 
 Replicated], but the data is on disk. This is verified by
 // a) The line starting with CachedPartitions
 // b) a find in spark_local_dir: find . -name \*  \| grep rdd returns 
 ./spark-b39bcf9b-e7d7-4284-bdd2-1be7ac3cacef/blockmgr-4f4c0b1c-b47a-4972-b364-7179ea6e0873/1f/rdd_4_*,
  where * are the number of partitions
 scala rdd2.unpersist()
 scala rdd2.toDebugString
 res8: String = 
 (100) MapPartitionsRDD[1] at map at console:23 []
   \|   ParallelCollectionRDD[0] at parallelize at console:21 []
 scala rdd3.toDebugString
 res9: String = 
 (100) ShuffledRDD[2] at reduceByKey at console:25 []
   +-(100) MapPartitionsRDD[1] at map at console:23 []
   \|   ParallelCollectionRDD[0] at parallelize at console:21 []
 // successfully unpersisted, also not visible on disk
 scala rdd2.persist(StorageLevel.DISK_ONLY)
 scala rdd3.collect()
 scala rdd2.toDebugString
 res18: String = 
 (100) MapPartitionsRDD[1] at map at console:23 [Disk Serialized 1x 
 Replicated]
   \|   ParallelCollectionRDD[0] at parallelize at console:21 [Disk 
 Serialized 1x Replicated]
 scala rdd3.toDebugString
 res19: String = 
 (100) ShuffledRDD[2] at reduceByKey at console:25 []
   +-(100) MapPartitionsRDD[1] at map at console:23 []
   \|   ParallelCollectionRDD[0] at parallelize at console:21 []
 // 2) The data is not visible on disk though the find command previously 
 mentioned, and is also not mentioned in the toDebugString (no line starting 
 with CachedPartitions, even though  [Disk Serialized 1x Replicated] is 
 mentioned). It does work when you call the action on the actual RDD:
 scala rdd2.collect()
 scala rdd2.toDebugString
 res21: String = 
 (100) MapPartitionsRDD[1] at map at console:23 [Disk Serialized 1x 
 Replicated]
   \|CachedPartitions: 100; MemorySize: 0.0 B; TachyonSize: 0.0 B; 
 DiskSize: 802.0 B
   \|   ParallelCollectionRDD[0] at parallelize at console:21 [Disk 
 Serialized 1x Replicated]
 scala rdd3.toDebugString
 res22: String = 
 (100) ShuffledRDD[2] at reduceByKey at console:25 []
   +-(100) MapPartitionsRDD[1] at map at console:23 []
   \|   CachedPartitions: 100; MemorySize: 0.0 B; TachyonSize: 0.0 B; 
 DiskSize: 802.0 B
   \|   ParallelCollectionRDD[0] at 

[jira] [Commented] (SPARK-7002) Persist on RDD fails the second time if the action is called on a child RDD without showing a FAILED message

2015-04-20 Thread Tom Hubregtsen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503783#comment-14503783
 ] 

Tom Hubregtsen commented on SPARK-7002:
---

Great, thanks for your help :)

I will be happy to propose this. What is the proper way to do this? Do I close 
this issue, and start a new issue with as type either New feature or Wish 
in which I explain what I believe is missing from .toDebugString and why? 
Anything else I should add?

Thanks,

Tom Hubregtsen

 Persist on RDD fails the second time if the action is called on a child RDD 
 without showing a FAILED message
 

 Key: SPARK-7002
 URL: https://issues.apache.org/jira/browse/SPARK-7002
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
 Environment: Platform: Power8
 OS: Ubuntu 14.10
 Java: java-8-openjdk-ppc64el
Reporter: Tom Hubregtsen
Priority: Minor
  Labels: disk, persist, unpersist

 The major issue is: Persist on RDD fails the second time if the action is 
 called on a child RDD without showing a FAILED message. This is pointed out 
 at 2)
 next to this:
 toDebugString on a child RDD does not show that the parent RDD is [Disk 
 Serialized 1x Replicated]. This is pointed out at 1)
 Note: I am persisting to disk (DISK_ONLY) to validate that the RDD is or is 
 not physically stored, as I did not want to solely rely on a missing line in 
 .toDebugString (see comments in trace)
 {code}
 scala val rdd1 = sc.parallelize(List(1,2,3))
 scala val rdd2 = rdd1.map(x = (x,x+1))
 scala val rdd3 = rdd2.reduceByKey( (x,y) = x+y)
 scala import org.apache.spark.storage.StorageLevel
 scala rdd2.persist(StorageLevel.DISK_ONLY)
 scala rdd3.collect()
 scala rdd2.toDebugString
 res4: String = 
 (100) MapPartitionsRDD[1] at map at console:23 [Disk Serialized 1x 
 Replicated]
   \|CachedPartitions: 100; MemorySize: 0.0 B; TachyonSize: 0.0 B; 
 DiskSize: 802.0 B
   \|   ParallelCollectionRDD[0] at parallelize at console:21 [Disk 
 Serialized 1x Replicated]
 scala rdd3.toDebugString
 res5: String = 
 (100) ShuffledRDD[2] at reduceByKey at console:25 []
   +-(100) MapPartitionsRDD[1] at map at console:23 []
   \|   CachedPartitions: 100; MemorySize: 0.0 B; TachyonSize: 0.0 B; 
 DiskSize: 802.0 B
   \|   ParallelCollectionRDD[0] at parallelize at console:21 []
 // 1) rdd3 does not show that the other RDD's are [Disk Serialized 1x 
 Replicated], but the data is on disk. This is verified by
 // a) The line starting with CachedPartitions
 // b) a find in spark_local_dir: find . -name \*  \| grep rdd returns 
 ./spark-b39bcf9b-e7d7-4284-bdd2-1be7ac3cacef/blockmgr-4f4c0b1c-b47a-4972-b364-7179ea6e0873/1f/rdd_4_*,
  where * are the number of partitions
 scala rdd2.unpersist()
 scala rdd2.toDebugString
 res8: String = 
 (100) MapPartitionsRDD[1] at map at console:23 []
   \|   ParallelCollectionRDD[0] at parallelize at console:21 []
 scala rdd3.toDebugString
 res9: String = 
 (100) ShuffledRDD[2] at reduceByKey at console:25 []
   +-(100) MapPartitionsRDD[1] at map at console:23 []
   \|   ParallelCollectionRDD[0] at parallelize at console:21 []
 // successfully unpersisted, also not visible on disk
 scala rdd2.persist(StorageLevel.DISK_ONLY)
 scala rdd3.collect()
 scala rdd2.toDebugString
 res18: String = 
 (100) MapPartitionsRDD[1] at map at console:23 [Disk Serialized 1x 
 Replicated]
   \|   ParallelCollectionRDD[0] at parallelize at console:21 [Disk 
 Serialized 1x Replicated]
 scala rdd3.toDebugString
 res19: String = 
 (100) ShuffledRDD[2] at reduceByKey at console:25 []
   +-(100) MapPartitionsRDD[1] at map at console:23 []
   \|   ParallelCollectionRDD[0] at parallelize at console:21 []
 // 2) The data is not visible on disk though the find command previously 
 mentioned, and is also not mentioned in the toDebugString (no line starting 
 with CachedPartitions, even though  [Disk Serialized 1x Replicated] is 
 mentioned). It does work when you call the action on the actual RDD:
 scala rdd2.collect()
 scala rdd2.toDebugString
 res21: String = 
 (100) MapPartitionsRDD[1] at map at console:23 [Disk Serialized 1x 
 Replicated]
   \|CachedPartitions: 100; MemorySize: 0.0 B; TachyonSize: 0.0 B; 
 DiskSize: 802.0 B
   \|   ParallelCollectionRDD[0] at parallelize at console:21 [Disk 
 Serialized 1x Replicated]
 scala rdd3.toDebugString
 res22: String = 
 (100) ShuffledRDD[2] at reduceByKey at console:25 []
   +-(100) MapPartitionsRDD[1] at map at console:23 []
   \|   CachedPartitions: 100; MemorySize: 0.0 B; TachyonSize: 0.0 B; 
 DiskSize: 802.0 B
   \|   ParallelCollectionRDD[0] at parallelize at console:21 []
 // Data appears on disk again (using find command preciously 

[jira] [Created] (SPARK-7019) Build docs on doc changes

2015-04-20 Thread Brennon York (JIRA)
Brennon York created SPARK-7019:
---

 Summary: Build docs on doc changes
 Key: SPARK-7019
 URL: https://issues.apache.org/jira/browse/SPARK-7019
 Project: Spark
  Issue Type: New Feature
  Components: Build
Reporter: Brennon York


Currently when a pull request changes the {{docs/}} directory, the docs aren't 
actually built. When a PR is submitted the {{git}} history should be checked to 
see if any doc changes were made and, if so, properly build the docs and report 
any issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6635) DataFrame.withColumn can create columns with identical names

2015-04-20 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504092#comment-14504092
 ] 

Michael Armbrust commented on SPARK-6635:
-

Sorry, updated.  I meant {{withColumn}}.

 DataFrame.withColumn can create columns with identical names
 

 Key: SPARK-6635
 URL: https://issues.apache.org/jira/browse/SPARK-6635
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 DataFrame lets you create multiple columns with the same name, which causes 
 problems when you try to refer to columns by name.
 Proposal: If a column is added to a DataFrame with a column of the same name, 
 then the new column should replace the old column.
 {code}
 scala val df = sc.parallelize(Array(1,2,3)).toDF(x)
 df: org.apache.spark.sql.DataFrame = [x: int]
 scala val df3 = df.withColumn(x, df(x) + 1)
 df3: org.apache.spark.sql.DataFrame = [x: int, x: int]
 scala df3.collect()
 res1: Array[org.apache.spark.sql.Row] = Array([1,2], [2,3], [3,4])
 scala df3(x)
 org.apache.spark.sql.AnalysisException: Reference 'x' is ambiguous, could be: 
 x, x.;
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:216)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:121)
   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:161)
   at org.apache.spark.sql.DataFrame.col(DataFrame.scala:436)
   at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:426)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:26)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:33)
   at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:35)
   at $iwC$$iwC$$iwC$$iwC.init(console:37)
   at $iwC$$iwC$$iwC.init(console:39)
   at $iwC$$iwC.init(console:41)
   at $iwC.init(console:43)
   at init(console:45)
   at .init(console:49)
   at .clinit(console)
   at .init(console:7)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
   at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856)
   at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901)
   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813)
   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656)
   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664)
   at 
 org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
   at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
   at 
 org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058)
   at org.apache.spark.repl.Main$.main(Main.scala:31)
   at org.apache.spark.repl.Main.main(Main.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
   at 
 org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
   at 

[jira] [Commented] (SPARK-7009) Build assembly JAR via ant to avoid zip64 problems

2015-04-20 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503754#comment-14503754
 ] 

Sean Owen commented on SPARK-7009:
--

Or warnings, yes. These add to the case that updating to Java 7 would resolve 
gotchas that are currently merely documented or warned against.

 Build assembly JAR via ant to avoid zip64 problems
 --

 Key: SPARK-7009
 URL: https://issues.apache.org/jira/browse/SPARK-7009
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.3.0
 Environment: Java 7+
Reporter: Steve Loughran
   Original Estimate: 2h
  Remaining Estimate: 2h

 SPARK-1911 shows the problem that JDK7+ is using zip64 to build large JARs; a 
 format incompatible with Java and pyspark.
 Provided the total number of .class files+resources is 64K, ant can be used 
 to make the final JAR instead, perhaps by unzipping the maven-generated JAR 
 then rezipping it with zip64=never, before publishing the artifact via maven.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6726) Model export/import for spark.ml: LogisticRegression

2015-04-20 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6726:
-
Target Version/s:   (was: 1.4.0)

 Model export/import for spark.ml: LogisticRegression
 

 Key: SPARK-6726
 URL: https://issues.apache.org/jira/browse/SPARK-6726
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6786) Model export/import for spark.ml: Normalizer

2015-04-20 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6786:
-
Target Version/s:   (was: 1.4.0)

 Model export/import for spark.ml: Normalizer
 

 Key: SPARK-6786
 URL: https://issues.apache.org/jira/browse/SPARK-6786
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6787) Model export/import for spark.ml: StandardScaler

2015-04-20 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6787:
-
Target Version/s:   (was: 1.4.0)

 Model export/import for spark.ml: StandardScaler
 

 Key: SPARK-6787
 URL: https://issues.apache.org/jira/browse/SPARK-6787
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6635) DataFrame.withColumn can create columns with identical names

2015-04-20 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504030#comment-14504030
 ] 

Michael Armbrust commented on SPARK-6635:
-

+1 to {{withName}} overwriting existing columns.

 DataFrame.withColumn can create columns with identical names
 

 Key: SPARK-6635
 URL: https://issues.apache.org/jira/browse/SPARK-6635
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 DataFrame lets you create multiple columns with the same name, which causes 
 problems when you try to refer to columns by name.
 Proposal: If a column is added to a DataFrame with a column of the same name, 
 then the new column should replace the old column.
 {code}
 scala val df = sc.parallelize(Array(1,2,3)).toDF(x)
 df: org.apache.spark.sql.DataFrame = [x: int]
 scala val df3 = df.withColumn(x, df(x) + 1)
 df3: org.apache.spark.sql.DataFrame = [x: int, x: int]
 scala df3.collect()
 res1: Array[org.apache.spark.sql.Row] = Array([1,2], [2,3], [3,4])
 scala df3(x)
 org.apache.spark.sql.AnalysisException: Reference 'x' is ambiguous, could be: 
 x, x.;
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:216)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:121)
   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:161)
   at org.apache.spark.sql.DataFrame.col(DataFrame.scala:436)
   at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:426)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:26)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:33)
   at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:35)
   at $iwC$$iwC$$iwC$$iwC.init(console:37)
   at $iwC$$iwC$$iwC.init(console:39)
   at $iwC$$iwC.init(console:41)
   at $iwC.init(console:43)
   at init(console:45)
   at .init(console:49)
   at .clinit(console)
   at .init(console:7)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
   at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856)
   at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901)
   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813)
   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656)
   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664)
   at 
 org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
   at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
   at 
 org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058)
   at org.apache.spark.repl.Main$.main(Main.scala:31)
   at org.apache.spark.repl.Main.main(Main.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
   at 
 org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
   

[jira] [Commented] (SPARK-7008) An Implement of Factorization Machine (LibFM)

2015-04-20 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504114#comment-14504114
 ] 

zhengruifeng commented on SPARK-7008:
-

thanks for this information!

 An Implement of Factorization Machine (LibFM)
 -

 Key: SPARK-7008
 URL: https://issues.apache.org/jira/browse/SPARK-7008
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0, 1.3.1, 1.3.2
Reporter: zhengruifeng
  Labels: features, patch

 An implement of Factorization Machines based on Scala and Spark MLlib.
 Factorization Machine is a kind of machine learning algorithm for 
 multi-linear regression, and is widely used for recommendation.
 Factorization Machines works well in recent years' recommendation 
 competitions.
 Ref:
 http://libfm.org/
 http://doi.acm.org/10.1145/2168752.2168771
 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7009) Build assembly JAR via ant to avoid zip64 problems

2015-04-20 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503752#comment-14503752
 ] 

Steve Loughran commented on SPARK-7009:
---

most of the others seemed fix by documentation patches...

 Build assembly JAR via ant to avoid zip64 problems
 --

 Key: SPARK-7009
 URL: https://issues.apache.org/jira/browse/SPARK-7009
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.3.0
 Environment: Java 7+
Reporter: Steve Loughran
   Original Estimate: 2h
  Remaining Estimate: 2h

 SPARK-1911 shows the problem that JDK7+ is using zip64 to build large JARs; a 
 format incompatible with Java and pyspark.
 Provided the total number of .class files+resources is 64K, ant can be used 
 to make the final JAR instead, perhaps by unzipping the maven-generated JAR 
 then rezipping it with zip64=never, before publishing the artifact via maven.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7016) Refactor dev/run-tests(-jenkins) from Bash to Python

2015-04-20 Thread Brennon York (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brennon York updated SPARK-7016:

Summary: Refactor dev/run-tests(-jenkins) from Bash to Python  (was: 
Refactor {{dev/run-tests(-jenkins)}} from Bash to Python)

 Refactor dev/run-tests(-jenkins) from Bash to Python
 

 Key: SPARK-7016
 URL: https://issues.apache.org/jira/browse/SPARK-7016
 Project: Spark
  Issue Type: Improvement
  Components: Build, Project Infra
Reporter: Brennon York

 Currently the {dev/run-tests} and {dev/run-tests-jenkins} scripts are written 
 in Bash and becoming quite unwieldy to manage, both in their current state 
 and for future contributions.
 This proposal is to refactor both scripts into Python to allow for better 
 manage-ability by the community, easier capability to add features, and 
 provide a simpler approach to calling / running the various test suites.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7016) Refactor {{dev/run-tests(-jenkins)}} from Bash to Python

2015-04-20 Thread Brennon York (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brennon York updated SPARK-7016:

Summary: Refactor {{dev/run-tests(-jenkins)}} from Bash to Python  (was: 
Refactor {dev/run-tests(-jenkins)} from Bash to Python)

 Refactor {{dev/run-tests(-jenkins)}} from Bash to Python
 

 Key: SPARK-7016
 URL: https://issues.apache.org/jira/browse/SPARK-7016
 Project: Spark
  Issue Type: Improvement
  Components: Build, Project Infra
Reporter: Brennon York

 Currently the {dev/run-tests} and {dev/run-tests-jenkins} scripts are written 
 in Bash and becoming quite unwieldy to manage, both in their current state 
 and for future contributions.
 This proposal is to refactor both scripts into Python to allow for better 
 manage-ability by the community, easier capability to add features, and 
 provide a simpler approach to calling / running the various test suites.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7020) Restrict module testing based on commit contents

2015-04-20 Thread Brennon York (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brennon York updated SPARK-7020:

Description: Currently all builds trigger all tests. This does not need to 
happen and, to minimize the test window, the {{git}} commit contents should be 
checked to determine which modules were affected and, for each, only run those 
tests.

 Restrict module testing based on commit contents
 

 Key: SPARK-7020
 URL: https://issues.apache.org/jira/browse/SPARK-7020
 Project: Spark
  Issue Type: Improvement
  Components: Build, Project Infra
Reporter: Brennon York

 Currently all builds trigger all tests. This does not need to happen and, to 
 minimize the test window, the {{git}} commit contents should be checked to 
 determine which modules were affected and, for each, only run those tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6917) Broken data returned to PySpark dataframe if any large numbers used in Scala land

2015-04-20 Thread Harry Brundage (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503915#comment-14503915
 ] 

Harry Brundage commented on SPARK-6917:
---

[~davies] or [~joshrosen] any idea why this might be happening? I can dig in if 
you give me some pointers but I don't really know where to start! 

 Broken data returned to PySpark dataframe if any large numbers used in Scala 
 land
 -

 Key: SPARK-6917
 URL: https://issues.apache.org/jira/browse/SPARK-6917
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.3.0
 Environment: Spark 1.3, Python 2.7.6, Scala 2.10
Reporter: Harry Brundage
 Attachments: part-r-1.parquet


 When trying to access data stored in a Parquet file with an INT96 column 
 (read: TimestampType() encoded for Impala), if the INT96 column is included 
 in the fetched data, other, smaller numeric types come back broken.
 {code}
 In [1]: 
 sql.sql.parquetFile(/Users/hornairs/Downloads/part-r-1.parquet).select('int_col',
  'long_col').first()
 Out[1]: Row(int_col=Decimal('1'), long_col=Decimal('10'))
 In [2]: 
 sql.parquetFile(/Users/hornairs/Downloads/part-r-1.parquet).first()
 Out[2]: Row(long_col={u'__class__': u'scala.runtime.BoxedUnit'}, 
 str_col=u'Hello!', int_col={u'__class__': u'scala.runtime.BoxedUnit'}, 
 date_col=datetime.datetime(1, 12, 31, 19, 0, tzinfo=DstTzInfo 
 'America/Toronto' EDT-1 day, 19:00:00 DST))
 {code}
 Note the {{\{u'__class__': u'scala.runtime.BoxedUnit'}}} values being 
 returned for the {{int_col}} and {{long_col}} columns in the second loop 
 above. This only happens if I select the {{date_col}} which is stored as 
 {{INT96}}. 
 I don't know much about Scala boxing, but I assume that somehow by including 
 numeric columns that are bigger than a machine word I trigger some different, 
 slower execution path somewhere that boxes stuff and causes this problem.
 If anyone could give me any pointers on where to get started fixing this I'd 
 be happy to dive in!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5995) Make ML Prediction Developer APIs public

2015-04-20 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5995:
-
Description: 
Previously, some Developer APIs were added to spark.ml for classification and 
regression to make it easier to add new algorithms and models: [SPARK-4789]  
There are ongoing discussions about the best design of the API.  This JIRA is 
to continue that discussion and try to finalize those Developer APIs so that 
they can be made public.

Please see [this design doc from SPARK-4789 | 
https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs]
 for details on the original API design.

Some issues under debate:
* Should there be strongly typed APIs for fit()?
** Proposal: No
* Should the strongly typed API for transform() be public (vs. protected)?
** Proposal: Protected for now
* What transformation methods should the API make developers implement for 
classification?
** Proposal: See design doc
* Should there be a way to transform a single Row (instead of only DataFrames)?
** Proposal: Not for now

  was:
Previously, some Developer APIs were added to spark.ml for classification and 
regression to make it easier to add new algorithms and models: [SPARK-4789]  
There are ongoing discussions about the best design of the API.  This JIRA is 
to continue that discussion and try to finalize those Developer APIs so that 
they can be made public.

Please see [this design doc from SPARK-4789 | 
https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs]
 for details on the original API design.

Some issues under debate:
* Should there be strongly typed APIs for fit()?
* Should the strongly typed API for transform() be public (vs. protected)?
* What transformation methods should the API make developers implement for 
classification?  (See details below.)
* Should there be a way to transform a single Row (instead of only DataFrames)?

More on What transformation methods should the API make developers implement 
for classification?:
* Goals:
** Optimize transform: Make it fast, and make it output only the desired 
columns.
** Easy development
** Support Classifier, Regressor, and ProbabilisticClassifier
* (currently) Developers implement predictX methods for each output column X.  
They may override transform() to optimize speed.
** Pros: predictX is easy to understand.
** Cons: An optimized transform() is annoying to write.
* Developers implement more basic transformation methods, such as features2raw, 
raw2pred, raw2prob.
** Pros: Abstract classes may implement optimized transform().
** Cons: Different types of predictors require different methods:
*** Predictor and Regressor: features2pred
*** Classifier: features2raw, raw2pred
*** ProbabilisticClassifier: raw2prob
* Developers implement a single predict() method which takes parameters for 
what columns to output (returning tuple or some type with None for missing 
values).  Abstract classes take the outputs they want and put them into columns.
** Pros: Developers only write 1 method and can optimize it as much as they 
want.  It could be more optimized than the previous 2 options; e.g., if 
LogisticRegressionModel only wants the prediction, then it never has to 
construct intermediate results such as the vector of raw predictions.
** Cons: predict() will have a different signature for different abstractions, 
based on the possible output columns.



 Make ML Prediction Developer APIs public
 

 Key: SPARK-5995
 URL: https://issues.apache.org/jira/browse/SPARK-5995
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley

 Previously, some Developer APIs were added to spark.ml for classification and 
 regression to make it easier to add new algorithms and models: [SPARK-4789]  
 There are ongoing discussions about the best design of the API.  This JIRA is 
 to continue that discussion and try to finalize those Developer APIs so that 
 they can be made public.
 Please see [this design doc from SPARK-4789 | 
 https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs]
  for details on the original API design.
 Some issues under debate:
 * Should there be strongly typed APIs for fit()?
 ** Proposal: No
 * Should the strongly typed API for transform() be public (vs. protected)?
 ** Proposal: Protected for now
 * What transformation methods should the API make developers implement for 
 classification?
 ** Proposal: See design doc
 * Should there be a way to transform a single Row (instead of only 
 DataFrames)?
 ** Proposal: Not for now



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, 

[jira] [Issue Comment Deleted] (SPARK-3530) Pipeline and Parameters

2015-04-20 Thread Fan Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fan Jiang updated SPARK-3530:
-
Comment: was deleted

(was: Hi Xiangrui,

Which part of this pipeline project would you like us to work on? 

Thanks!


)

 Pipeline and Parameters
 ---

 Key: SPARK-3530
 URL: https://issues.apache.org/jira/browse/SPARK-3530
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical
 Fix For: 1.2.0


 This part of the design doc is for pipelines and parameters. I put the design 
 doc at
 https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing
 I will copy the proposed interfaces to this JIRA later. Some sample code can 
 be viewed at: https://github.com/mengxr/spark-ml/
 Please help review the design and post your comments here. Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7017) Refactor dev/run-tests into Python

2015-04-20 Thread Brennon York (JIRA)
Brennon York created SPARK-7017:
---

 Summary: Refactor dev/run-tests into Python
 Key: SPARK-7017
 URL: https://issues.apache.org/jira/browse/SPARK-7017
 Project: Spark
  Issue Type: Sub-task
Reporter: Brennon York


This issue is to specifically track the progress of the {{dev/run-tests}} 
script into Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7018) Refactor dev/run-tests-jenkins into Python

2015-04-20 Thread Brennon York (JIRA)
Brennon York created SPARK-7018:
---

 Summary: Refactor dev/run-tests-jenkins into Python
 Key: SPARK-7018
 URL: https://issues.apache.org/jira/browse/SPARK-7018
 Project: Spark
  Issue Type: Sub-task
  Components: Build, Project Infra
Reporter: Brennon York


This issue is to specifically track the progress of the 
{{dev/run-tests-jenkins}} script into Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7022) PySpark is missing ParamGridBuilder

2015-04-20 Thread Omede Firouz (JIRA)
Omede Firouz created SPARK-7022:
---

 Summary: PySpark is missing ParamGridBuilder
 Key: SPARK-7022
 URL: https://issues.apache.org/jira/browse/SPARK-7022
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Omede Firouz


PySpark is missing the entirety of ML.Tuning (see: 
vhttps://issues.apache.org/jira/browse/SPARK-6940)

This is a subticket specifically to track the ParamGridBuilder. The 
CrossValidator will be dealt with in a followup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7022) PySpark is missing ParamGridBuilder

2015-04-20 Thread Omede Firouz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omede Firouz updated SPARK-7022:

Description: 
PySpark is missing the entirety of ML.Tuning (see: 
https://issues.apache.org/jira/browse/SPARK-6940)

This is a subticket specifically to track the ParamGridBuilder. The 
CrossValidator will be dealt with in a followup.

  was:
PySpark is missing the entirety of ML.Tuning (see: 
vhttps://issues.apache.org/jira/browse/SPARK-6940)

This is a subticket specifically to track the ParamGridBuilder. The 
CrossValidator will be dealt with in a followup.


 PySpark is missing ParamGridBuilder
 ---

 Key: SPARK-7022
 URL: https://issues.apache.org/jira/browse/SPARK-7022
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Omede Firouz

 PySpark is missing the entirety of ML.Tuning (see: 
 https://issues.apache.org/jira/browse/SPARK-6940)
 This is a subticket specifically to track the ParamGridBuilder. The 
 CrossValidator will be dealt with in a followup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6954) Dynamic allocation: numExecutorsPending in ExecutorAllocationManager should never become negative

2015-04-20 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-6954:
-
Priority: Major  (was: Minor)

 Dynamic allocation: numExecutorsPending in ExecutorAllocationManager should 
 never become negative
 -

 Key: SPARK-6954
 URL: https://issues.apache.org/jira/browse/SPARK-6954
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.3.1
Reporter: Cheolsoo Park
Assignee: Cheolsoo Park
  Labels: yarn

 I have a simple test case for dynamic allocation on YARN that fails with the 
 following stack trace-
 {code}
 15/04/16 00:52:14 ERROR Utils: Uncaught exception in thread 
 spark-dynamic-executor-allocation-0
 java.lang.IllegalArgumentException: Attempted to request a negative number of 
 executor(s) -21 from the cluster manager. Please specify a positive number!
   at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:338)
   at 
 org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1137)
   at 
 org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:294)
   at 
 org.apache.spark.ExecutorAllocationManager.addOrCancelExecutorRequests(ExecutorAllocationManager.scala:263)
   at 
 org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:230)
   at 
 org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply$mcV$sp(ExecutorAllocationManager.scala:189)
   at 
 org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189)
   at 
 org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189)
   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618)
   at 
 org.apache.spark.ExecutorAllocationManager$$anon$1.run(ExecutorAllocationManager.scala:189)
   at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
   at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
   at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}
 My test is as follows-
 # Start spark-shell with a single executor.
 # Run a {{select count(\*)}} query. The number of executors rises as input 
 size is non-trivial.
 # After the job finishes, the number of  executors falls as most of them 
 become idle.
 # Rerun the same query again, and the request to add executors fails with the 
 above error. In fact, the job itself continues to run with whatever executors 
 it already has, but it never gets more executors unless the shell is closed 
 and restarted. 
 In fact, this error only happens when I configure {{executorIdleTimeout}} 
 very small. For eg, I can reproduce it with the following configs-
 {code}
 spark.dynamicAllocation.executorIdleTimeout 5
 spark.dynamicAllocation.schedulerBacklogTimeout 5
 {code}
 Although I can simply increase {{executorIdleTimeout}} to something like 60 
 secs to avoid the error, I think this is still a bug to be fixed.
 The root cause seems that {{numExecutorsPending}} accidentally becomes 
 negative if executors are killed too aggressively (i.e. 
 {{executorIdleTimeout}} is too small) because under that circumstance, the 
 new target # of executors can be smaller than the current # of executors. 
 When that happens, {{ExecutorAllocationManager}} ends up trying to add a 
 negative number of executors, which throws an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7021) JUnit output for Python tests

2015-04-20 Thread Brennon York (JIRA)
Brennon York created SPARK-7021:
---

 Summary: JUnit output for Python tests
 Key: SPARK-7021
 URL: https://issues.apache.org/jira/browse/SPARK-7021
 Project: Spark
  Issue Type: Improvement
  Components: Build, Project Infra
Reporter: Brennon York
Priority: Minor


Currently python returns its test output in its own format. What would be 
preferred is if the Python test runner could output its test results in JUnit 
format to better match the rest of the Jenkins test output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7016) Refactor dev/run-tests(-jenkins) from Bash to Python

2015-04-20 Thread Brennon York (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brennon York updated SPARK-7016:

Description: 
Currently the {{dev/run-tests}} and {{dev/run-tests-jenkins}} scripts are 
written in Bash and becoming quite unwieldy to manage, both in their current 
state and for future contributions.

This proposal is to refactor both scripts into Python to allow for better 
manage-ability by the community, easier capability to add features, and provide 
a simpler approach to calling / running the various test suites.

  was:
Currently the {dev/run-tests} and {dev/run-tests-jenkins} scripts are written 
in Bash and becoming quite unwieldy to manage, both in their current state and 
for future contributions.

This proposal is to refactor both scripts into Python to allow for better 
manage-ability by the community, easier capability to add features, and provide 
a simpler approach to calling / running the various test suites.


 Refactor dev/run-tests(-jenkins) from Bash to Python
 

 Key: SPARK-7016
 URL: https://issues.apache.org/jira/browse/SPARK-7016
 Project: Spark
  Issue Type: Improvement
  Components: Build, Project Infra
Reporter: Brennon York

 Currently the {{dev/run-tests}} and {{dev/run-tests-jenkins}} scripts are 
 written in Bash and becoming quite unwieldy to manage, both in their current 
 state and for future contributions.
 This proposal is to refactor both scripts into Python to allow for better 
 manage-ability by the community, easier capability to add features, and 
 provide a simpler approach to calling / running the various test suites.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7016) Refactor {dev/run-tests(-jenkins)} from Bash to Python

2015-04-20 Thread Brennon York (JIRA)
Brennon York created SPARK-7016:
---

 Summary: Refactor {dev/run-tests(-jenkins)} from Bash to Python
 Key: SPARK-7016
 URL: https://issues.apache.org/jira/browse/SPARK-7016
 Project: Spark
  Issue Type: Improvement
  Components: Build, Project Infra
Reporter: Brennon York


Currently the {dev/run-tests} and {dev/run-tests-jenkins} scripts are written 
in Bash and becoming quite unwieldy to manage, both in their current state and 
for future contributions.

This proposal is to refactor both scripts into Python to allow for better 
manage-ability by the community, easier capability to add features, and provide 
a simpler approach to calling / running the various test suites.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6635) DataFrame.withColumn can create columns with identical names

2015-04-20 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504030#comment-14504030
 ] 

Michael Armbrust edited comment on SPARK-6635 at 4/21/15 1:07 AM:
--

+1 to {{withColumn}} overwriting existing columns.


was (Author: marmbrus):
+1 to {{withName}} overwriting existing columns.

 DataFrame.withColumn can create columns with identical names
 

 Key: SPARK-6635
 URL: https://issues.apache.org/jira/browse/SPARK-6635
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 DataFrame lets you create multiple columns with the same name, which causes 
 problems when you try to refer to columns by name.
 Proposal: If a column is added to a DataFrame with a column of the same name, 
 then the new column should replace the old column.
 {code}
 scala val df = sc.parallelize(Array(1,2,3)).toDF(x)
 df: org.apache.spark.sql.DataFrame = [x: int]
 scala val df3 = df.withColumn(x, df(x) + 1)
 df3: org.apache.spark.sql.DataFrame = [x: int, x: int]
 scala df3.collect()
 res1: Array[org.apache.spark.sql.Row] = Array([1,2], [2,3], [3,4])
 scala df3(x)
 org.apache.spark.sql.AnalysisException: Reference 'x' is ambiguous, could be: 
 x, x.;
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:216)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:121)
   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:161)
   at org.apache.spark.sql.DataFrame.col(DataFrame.scala:436)
   at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:426)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:26)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:33)
   at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:35)
   at $iwC$$iwC$$iwC$$iwC.init(console:37)
   at $iwC$$iwC$$iwC.init(console:39)
   at $iwC$$iwC.init(console:41)
   at $iwC.init(console:43)
   at init(console:45)
   at .init(console:49)
   at .clinit(console)
   at .init(console:7)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
   at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856)
   at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901)
   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813)
   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656)
   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664)
   at 
 org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
   at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
   at 
 org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058)
   at org.apache.spark.repl.Main$.main(Main.scala:31)
   at org.apache.spark.repl.Main.main(Main.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
   at 
 org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
   at 

[jira] [Commented] (SPARK-5995) Make ML Prediction Developer APIs public

2015-04-20 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504090#comment-14504090
 ] 

Joseph K. Bradley commented on SPARK-5995:
--

I just updated the design doc linked above with a new section Post-Part 1 
Assessment detailing a few issues.

 Make ML Prediction Developer APIs public
 

 Key: SPARK-5995
 URL: https://issues.apache.org/jira/browse/SPARK-5995
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley

 Previously, some Developer APIs were added to spark.ml for classification and 
 regression to make it easier to add new algorithms and models: [SPARK-4789]  
 There are ongoing discussions about the best design of the API.  This JIRA is 
 to continue that discussion and try to finalize those Developer APIs so that 
 they can be made public.
 Please see [this design doc from SPARK-4789 | 
 https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs]
  for details on the original API design.
 Some issues under debate:
 * Should there be strongly typed APIs for fit()?
 * Should the strongly typed API for transform() be public (vs. protected)?
 * What transformation methods should the API make developers implement for 
 classification?  (See details below.)
 * Should there be a way to transform a single Row (instead of only 
 DataFrames)?
 More on What transformation methods should the API make developers implement 
 for classification?:
 * Goals:
 ** Optimize transform: Make it fast, and make it output only the desired 
 columns.
 ** Easy development
 ** Support Classifier, Regressor, and ProbabilisticClassifier
 * (currently) Developers implement predictX methods for each output column X. 
  They may override transform() to optimize speed.
 ** Pros: predictX is easy to understand.
 ** Cons: An optimized transform() is annoying to write.
 * Developers implement more basic transformation methods, such as 
 features2raw, raw2pred, raw2prob.
 ** Pros: Abstract classes may implement optimized transform().
 ** Cons: Different types of predictors require different methods:
 *** Predictor and Regressor: features2pred
 *** Classifier: features2raw, raw2pred
 *** ProbabilisticClassifier: raw2prob
 * Developers implement a single predict() method which takes parameters for 
 what columns to output (returning tuple or some type with None for missing 
 values).  Abstract classes take the outputs they want and put them into 
 columns.
 ** Pros: Developers only write 1 method and can optimize it as much as they 
 want.  It could be more optimized than the previous 2 options; e.g., if 
 LogisticRegressionModel only wants the prediction, then it never has to 
 construct intermediate results such as the vector of raw predictions.
 ** Cons: predict() will have a different signature for different 
 abstractions, based on the possible output columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7025) Create a Java-friendly input source API

2015-04-20 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-7025:
--

 Summary: Create a Java-friendly input source API
 Key: SPARK-7025
 URL: https://issues.apache.org/jira/browse/SPARK-7025
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin


The goal of this ticket is to create a simple input source API that we can 
maintain and support long term.

Spark currently has two de facto input source API:
1. RDD API
2. Hadoop MapReduce InputFormat API

Neither of the above is ideal:

1. RDD: It is hard for Java developers to implement RDD, given the implicit 
class tags. In addition, the RDD API depends on Scala's runtime library, which 
does not preserve binary compatibility across Scala versions. If a developer 
chooses Java to implement an input source, it would be great if that input 
source can be binary compatible in years to come.

2. Hadoop InputFormat: The Hadoop InputFormat API is overly restrictive. For 
example, it forces key-value semantics, and does not support running arbitrary 
code on the driver side (an example of why this is useful is broadcast). In 
addition, it is somewhat awkward to tell developers that in order to implement 
an input source for Spark, they should learn the Hadoop MapReduce API first.


So here's the proposal:

An InputSource is described by:
* an array of InputPartition that specifies the data partitioning
* a RecordReader that specifies how data on each partition can be read

This interface would be similar to Hadoop's InputFormat, except that there is 
no explicit key/value separation.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6529) Word2Vec transformer

2015-04-20 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504180#comment-14504180
 ] 

Joseph K. Bradley commented on SPARK-6529:
--

[~yinxusen] brings up a good point (in the PR) that Word2Vec and Word2VecModel 
take input columns of different types.  This is a problem with current 
Estimator-Model approaches since they always share the same {{inputCol}} param.

Thinking about this, I believe the Estimator and Model {{inputCol}} params must 
be different.  In a Pipeline, we need to be able to specify both input columns 
before fitting, and we will not always have the chance to reset the input 
column before testing.

CC: [~mengxr]  since you'll be interested

 Word2Vec transformer
 

 Key: SPARK-6529
 URL: https://issues.apache.org/jira/browse/SPARK-6529
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xusen Yin
Assignee: Xusen Yin





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7022) PySpark is missing ParamGridBuilder

2015-04-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7022:
-
Assignee: Omede Firouz

 PySpark is missing ParamGridBuilder
 ---

 Key: SPARK-7022
 URL: https://issues.apache.org/jira/browse/SPARK-7022
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Omede Firouz
Assignee: Omede Firouz

 PySpark is missing the entirety of ML.Tuning (see: 
 https://issues.apache.org/jira/browse/SPARK-6940)
 This is a subticket specifically to track the ParamGridBuilder. The 
 CrossValidator will be dealt with in a followup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7022) PySpark is missing ParamGridBuilder

2015-04-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7022:
-
Target Version/s: 1.4.0

 PySpark is missing ParamGridBuilder
 ---

 Key: SPARK-7022
 URL: https://issues.apache.org/jira/browse/SPARK-7022
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Omede Firouz
Assignee: Omede Firouz

 PySpark is missing the entirety of ML.Tuning (see: 
 https://issues.apache.org/jira/browse/SPARK-6940)
 This is a subticket specifically to track the ParamGridBuilder. The 
 CrossValidator will be dealt with in a followup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4521) Parquet fails to read columns with spaces in the name

2015-04-20 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-4521.
---
Resolution: Done

This ticket is covered by SPARK-6607.

 Parquet fails to read columns with spaces in the name
 -

 Key: SPARK-4521
 URL: https://issues.apache.org/jira/browse/SPARK-4521
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Michael Armbrust

 I think this is actually a bug in parquet, but it would be good to track it 
 here as well.  To reproduce:
 {code}
 jsonRDD(sparkContext.parallelize({number of clusters: 
 1}::Nil)).saveAsParquetFile(test)
 parquetFile(test).collect()
 {code}
 {code}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 
 (TID 13, localhost): java.lang.IllegalArgumentException: field ended by ';': 
 expected ';' but got 'of' at line 1:   optional int32 number of
   at parquet.schema.MessageTypeParser.check(MessageTypeParser.java:209)
   at 
 parquet.schema.MessageTypeParser.addPrimitiveType(MessageTypeParser.java:182)
   at parquet.schema.MessageTypeParser.addType(MessageTypeParser.java:108)
   at 
 parquet.schema.MessageTypeParser.addGroupTypeFields(MessageTypeParser.java:96)
   at parquet.schema.MessageTypeParser.parse(MessageTypeParser.java:89)
   at 
 parquet.schema.MessageTypeParser.parseMessageType(MessageTypeParser.java:79)
   at 
 parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:189)
   at 
 parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
   at 
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:135)
   at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6635) DataFrame.withColumn can create columns with identical names

2015-04-20 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-6635.
-
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5541
[https://github.com/apache/spark/pull/5541]

 DataFrame.withColumn can create columns with identical names
 

 Key: SPARK-6635
 URL: https://issues.apache.org/jira/browse/SPARK-6635
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
 Fix For: 1.4.0


 DataFrame lets you create multiple columns with the same name, which causes 
 problems when you try to refer to columns by name.
 Proposal: If a column is added to a DataFrame with a column of the same name, 
 then the new column should replace the old column.
 {code}
 scala val df = sc.parallelize(Array(1,2,3)).toDF(x)
 df: org.apache.spark.sql.DataFrame = [x: int]
 scala val df3 = df.withColumn(x, df(x) + 1)
 df3: org.apache.spark.sql.DataFrame = [x: int, x: int]
 scala df3.collect()
 res1: Array[org.apache.spark.sql.Row] = Array([1,2], [2,3], [3,4])
 scala df3(x)
 org.apache.spark.sql.AnalysisException: Reference 'x' is ambiguous, could be: 
 x, x.;
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:216)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:121)
   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:161)
   at org.apache.spark.sql.DataFrame.col(DataFrame.scala:436)
   at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:426)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:26)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:33)
   at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:35)
   at $iwC$$iwC$$iwC$$iwC.init(console:37)
   at $iwC$$iwC$$iwC.init(console:39)
   at $iwC$$iwC.init(console:41)
   at $iwC.init(console:43)
   at init(console:45)
   at .init(console:49)
   at .clinit(console)
   at .init(console:7)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
   at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856)
   at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901)
   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813)
   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656)
   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664)
   at 
 org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
   at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
   at 
 org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058)
   at org.apache.spark.repl.Main$.main(Main.scala:31)
   at org.apache.spark.repl.Main.main(Main.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
   at 
 org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
   at 

[jira] [Updated] (SPARK-6738) EstimateSize is difference with spill file size

2015-04-20 Thread Hong Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong Shen updated SPARK-6738:
-
Description: 
ExternalAppendOnlyMap spill 2.2 GB data to disk:

{code}

15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: Thread 54 spilling 
in-memory map of 2.2 GB to disk (61 times so far)
15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: 
/data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812
{code}

But the file size is only 2.2M.

{code}
ll -h 
/data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/
total 2.2M
-rw-r- 1 spark users 2.2M Apr  7 20:27 
temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812
{code}

The GC log show that the jvm memory is less than 1GB.
{code}
2015-04-07T20:27:08.023+0800: [GC 981981K-55363K(3961344K), 0.0341720 secs]
2015-04-07T20:27:14.483+0800: [GC 987523K-53737K(3961344K), 0.0252660 secs]
2015-04-07T20:27:20.793+0800: [GC 985897K-56370K(3961344K), 0.0606460 secs]
2015-04-07T20:27:27.553+0800: [GC 988530K-59089K(3961344K), 0.0651840 secs]
2015-04-07T20:27:34.067+0800: [GC 991249K-62153K(3961344K), 0.0288460 secs]
2015-04-07T20:27:40.180+0800: [GC 994313K-61344K(3961344K), 0.0388970 secs]
2015-04-07T20:27:46.490+0800: [GC 993504K-59915K(3961344K), 0.0235150 secs]
{code}

The estimateSize  is hugh difference with spill file size, there is a bug in 

  was:
ExternalAppendOnlyMap spill 2.2 GB data to disk:

{code}

15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: Thread 54 spilling 
in-memory map of 2.2 GB to disk (61 times so far)
15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: 
/data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812
{code}

But the file size is only 2.2M.

{code}
ll -h 
/data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/
total 2.2M
-rw-r- 1 spark users 2.2M Apr  7 20:27 
temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812
{code}

The GC log show that the jvm memory is less than 1GB.
{code}
2015-04-07T20:27:08.023+0800: [GC 981981K-55363K(3961344K), 0.0341720 secs]
2015-04-07T20:27:14.483+0800: [GC 987523K-53737K(3961344K), 0.0252660 secs]
2015-04-07T20:27:20.793+0800: [GC 985897K-56370K(3961344K), 0.0606460 secs]
2015-04-07T20:27:27.553+0800: [GC 988530K-59089K(3961344K), 0.0651840 secs]
2015-04-07T20:27:34.067+0800: [GC 991249K-62153K(3961344K), 0.0288460 secs]
2015-04-07T20:27:40.180+0800: [GC 994313K-61344K(3961344K), 0.0388970 secs]
2015-04-07T20:27:46.490+0800: [GC 993504K-59915K(3961344K), 0.0235150 secs]
{code}

The estimateSize  is hugh difference with spill file size


 EstimateSize  is difference with spill file size
 

 Key: SPARK-6738
 URL: https://issues.apache.org/jira/browse/SPARK-6738
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Hong Shen

 ExternalAppendOnlyMap spill 2.2 GB data to disk:
 {code}
 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: Thread 54 spilling 
 in-memory map of 2.2 GB to disk (61 times so far)
 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: 
 /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812
 {code}
 But the file size is only 2.2M.
 {code}
 ll -h 
 /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/
 total 2.2M
 -rw-r- 1 spark users 2.2M Apr  7 20:27 
 temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812
 {code}
 The GC log show that the jvm memory is less than 1GB.
 {code}
 2015-04-07T20:27:08.023+0800: [GC 981981K-55363K(3961344K), 0.0341720 secs]
 2015-04-07T20:27:14.483+0800: [GC 987523K-53737K(3961344K), 0.0252660 secs]
 2015-04-07T20:27:20.793+0800: [GC 985897K-56370K(3961344K), 0.0606460 secs]
 2015-04-07T20:27:27.553+0800: [GC 988530K-59089K(3961344K), 0.0651840 secs]
 2015-04-07T20:27:34.067+0800: [GC 991249K-62153K(3961344K), 0.0288460 secs]
 2015-04-07T20:27:40.180+0800: [GC 994313K-61344K(3961344K), 0.0388970 secs]
 2015-04-07T20:27:46.490+0800: [GC 993504K-59915K(3961344K), 0.0235150 secs]
 {code}
 The estimateSize  is hugh difference with spill file size, there is a bug in 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-6738) EstimateSize is difference with spill file size

2015-04-20 Thread Hong Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong Shen reopened SPARK-6738:
--

There is a in SizeEstimator

 EstimateSize  is difference with spill file size
 

 Key: SPARK-6738
 URL: https://issues.apache.org/jira/browse/SPARK-6738
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Hong Shen

 ExternalAppendOnlyMap spill 2.2 GB data to disk:
 {code}
 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: Thread 54 spilling 
 in-memory map of 2.2 GB to disk (61 times so far)
 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: 
 /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812
 {code}
 But the file size is only 2.2M.
 {code}
 ll -h 
 /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/
 total 2.2M
 -rw-r- 1 spark users 2.2M Apr  7 20:27 
 temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812
 {code}
 The GC log show that the jvm memory is less than 1GB.
 {code}
 2015-04-07T20:27:08.023+0800: [GC 981981K-55363K(3961344K), 0.0341720 secs]
 2015-04-07T20:27:14.483+0800: [GC 987523K-53737K(3961344K), 0.0252660 secs]
 2015-04-07T20:27:20.793+0800: [GC 985897K-56370K(3961344K), 0.0606460 secs]
 2015-04-07T20:27:27.553+0800: [GC 988530K-59089K(3961344K), 0.0651840 secs]
 2015-04-07T20:27:34.067+0800: [GC 991249K-62153K(3961344K), 0.0288460 secs]
 2015-04-07T20:27:40.180+0800: [GC 994313K-61344K(3961344K), 0.0388970 secs]
 2015-04-07T20:27:46.490+0800: [GC 993504K-59915K(3961344K), 0.0235150 secs]
 {code}
 The estimateSize  is hugh difference with spill file size



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6738) EstimateSize is difference with spill file size

2015-04-20 Thread Hong Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong Shen updated SPARK-6738:
-
Description: 
ExternalAppendOnlyMap spill 2.2 GB data to disk:

{code}

15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: Thread 54 spilling 
in-memory map of 2.2 GB to disk (61 times so far)
15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: 
/data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812
{code}

But the file size is only 2.2M.

{code}
ll -h 
/data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/
total 2.2M
-rw-r- 1 spark users 2.2M Apr  7 20:27 
temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812
{code}

The GC log show that the jvm memory is less than 1GB.
{code}
2015-04-07T20:27:08.023+0800: [GC 981981K-55363K(3961344K), 0.0341720 secs]
2015-04-07T20:27:14.483+0800: [GC 987523K-53737K(3961344K), 0.0252660 secs]
2015-04-07T20:27:20.793+0800: [GC 985897K-56370K(3961344K), 0.0606460 secs]
2015-04-07T20:27:27.553+0800: [GC 988530K-59089K(3961344K), 0.0651840 secs]
2015-04-07T20:27:34.067+0800: [GC 991249K-62153K(3961344K), 0.0288460 secs]
2015-04-07T20:27:40.180+0800: [GC 994313K-61344K(3961344K), 0.0388970 secs]
2015-04-07T20:27:46.490+0800: [GC 993504K-59915K(3961344K), 0.0235150 secs]
{code}

The estimateSize  is hugh difference with spill file size, there is a bug in 
SizeEstimator.visitArray.

  was:
ExternalAppendOnlyMap spill 2.2 GB data to disk:

{code}

15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: Thread 54 spilling 
in-memory map of 2.2 GB to disk (61 times so far)
15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: 
/data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812
{code}

But the file size is only 2.2M.

{code}
ll -h 
/data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/
total 2.2M
-rw-r- 1 spark users 2.2M Apr  7 20:27 
temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812
{code}

The GC log show that the jvm memory is less than 1GB.
{code}
2015-04-07T20:27:08.023+0800: [GC 981981K-55363K(3961344K), 0.0341720 secs]
2015-04-07T20:27:14.483+0800: [GC 987523K-53737K(3961344K), 0.0252660 secs]
2015-04-07T20:27:20.793+0800: [GC 985897K-56370K(3961344K), 0.0606460 secs]
2015-04-07T20:27:27.553+0800: [GC 988530K-59089K(3961344K), 0.0651840 secs]
2015-04-07T20:27:34.067+0800: [GC 991249K-62153K(3961344K), 0.0288460 secs]
2015-04-07T20:27:40.180+0800: [GC 994313K-61344K(3961344K), 0.0388970 secs]
2015-04-07T20:27:46.490+0800: [GC 993504K-59915K(3961344K), 0.0235150 secs]
{code}

The estimateSize  is hugh difference with spill file size, there is a bug in 


 EstimateSize  is difference with spill file size
 

 Key: SPARK-6738
 URL: https://issues.apache.org/jira/browse/SPARK-6738
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Hong Shen

 ExternalAppendOnlyMap spill 2.2 GB data to disk:
 {code}
 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: Thread 54 spilling 
 in-memory map of 2.2 GB to disk (61 times so far)
 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: 
 /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812
 {code}
 But the file size is only 2.2M.
 {code}
 ll -h 
 /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/
 total 2.2M
 -rw-r- 1 spark users 2.2M Apr  7 20:27 
 temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812
 {code}
 The GC log show that the jvm memory is less than 1GB.
 {code}
 2015-04-07T20:27:08.023+0800: [GC 981981K-55363K(3961344K), 0.0341720 secs]
 2015-04-07T20:27:14.483+0800: [GC 987523K-53737K(3961344K), 0.0252660 secs]
 2015-04-07T20:27:20.793+0800: [GC 985897K-56370K(3961344K), 0.0606460 secs]
 2015-04-07T20:27:27.553+0800: [GC 988530K-59089K(3961344K), 0.0651840 secs]
 2015-04-07T20:27:34.067+0800: [GC 991249K-62153K(3961344K), 0.0288460 secs]
 2015-04-07T20:27:40.180+0800: [GC 994313K-61344K(3961344K), 0.0388970 secs]
 2015-04-07T20:27:46.490+0800: [GC 993504K-59915K(3961344K), 0.0235150 secs]
 {code}
 The estimateSize  is hugh difference with spill file size, there is a bug in 
 SizeEstimator.visitArray.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Comment Edited] (SPARK-6738) EstimateSize is difference with spill file size

2015-04-20 Thread Hong Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504202#comment-14504202
 ] 

Hong Shen edited comment on SPARK-6738 at 4/21/15 2:54 AM:
---

There is a bug in SizeEstimator


was (Author: shenhong):
There is a in SizeEstimator

 EstimateSize  is difference with spill file size
 

 Key: SPARK-6738
 URL: https://issues.apache.org/jira/browse/SPARK-6738
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Hong Shen

 ExternalAppendOnlyMap spill 2.2 GB data to disk:
 {code}
 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: Thread 54 spilling 
 in-memory map of 2.2 GB to disk (61 times so far)
 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: 
 /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812
 {code}
 But the file size is only 2.2M.
 {code}
 ll -h 
 /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/
 total 2.2M
 -rw-r- 1 spark users 2.2M Apr  7 20:27 
 temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812
 {code}
 The GC log show that the jvm memory is less than 1GB.
 {code}
 2015-04-07T20:27:08.023+0800: [GC 981981K-55363K(3961344K), 0.0341720 secs]
 2015-04-07T20:27:14.483+0800: [GC 987523K-53737K(3961344K), 0.0252660 secs]
 2015-04-07T20:27:20.793+0800: [GC 985897K-56370K(3961344K), 0.0606460 secs]
 2015-04-07T20:27:27.553+0800: [GC 988530K-59089K(3961344K), 0.0651840 secs]
 2015-04-07T20:27:34.067+0800: [GC 991249K-62153K(3961344K), 0.0288460 secs]
 2015-04-07T20:27:40.180+0800: [GC 994313K-61344K(3961344K), 0.0388970 secs]
 2015-04-07T20:27:46.490+0800: [GC 993504K-59915K(3961344K), 0.0235150 secs]
 {code}
 The estimateSize  is hugh difference with spill file size



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4131) Support Writing data into the filesystem from queries

2015-04-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4131:
---

Assignee: Fei Wang  (was: Apache Spark)

 Support Writing data into the filesystem from queries
 ---

 Key: SPARK-4131
 URL: https://issues.apache.org/jira/browse/SPARK-4131
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.1.0
Reporter: XiaoJing wang
Assignee: Fei Wang
Priority: Critical
   Original Estimate: 0.05h
  Remaining Estimate: 0.05h

 Writing data into the filesystem from queries,SparkSql is not support .
 eg:
 {code}insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * 
 from page_views;
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4131) Support Writing data into the filesystem from queries

2015-04-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4131:
---

Assignee: Apache Spark  (was: Fei Wang)

 Support Writing data into the filesystem from queries
 ---

 Key: SPARK-4131
 URL: https://issues.apache.org/jira/browse/SPARK-4131
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.1.0
Reporter: XiaoJing wang
Assignee: Apache Spark
Priority: Critical
   Original Estimate: 0.05h
  Remaining Estimate: 0.05h

 Writing data into the filesystem from queries,SparkSql is not support .
 eg:
 {code}insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * 
 from page_views;
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7025) Create a Java-friendly input source API

2015-04-20 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7025:
---
Description: 
The goal of this ticket is to create a simple input source API that we can 
maintain and support long term.

Spark currently has two de facto input source API:
1. RDD
2. Hadoop MapReduce InputFormat

Neither of the above is ideal:

1. RDD: It is hard for Java developers to implement RDD, given the implicit 
class tags. In addition, the RDD API depends on Scala's runtime library, which 
does not preserve binary compatibility across Scala versions. If a developer 
chooses Java to implement an input source, it would be great if that input 
source can be binary compatible in years to come.

2. Hadoop InputFormat: The Hadoop InputFormat API is overly restrictive. For 
example, it forces key-value semantics, and does not support running arbitrary 
code on the driver side (an example of why this is useful is broadcast). In 
addition, it is somewhat awkward to tell developers that in order to implement 
an input source for Spark, they should learn the Hadoop MapReduce API first.


So here's the proposal: an InputSource is described by:
* an array of InputPartition that specifies the data partitioning
* a RecordReader that specifies how data on each partition can be read

This interface would be similar to Hadoop's InputFormat, except that there is 
no explicit key/value separation.


  was:
The goal of this ticket is to create a simple input source API that we can 
maintain and support long term.

Spark currently has two de facto input source API:
1. RDD API
2. Hadoop MapReduce InputFormat API

Neither of the above is ideal:

1. RDD: It is hard for Java developers to implement RDD, given the implicit 
class tags. In addition, the RDD API depends on Scala's runtime library, which 
does not preserve binary compatibility across Scala versions. If a developer 
chooses Java to implement an input source, it would be great if that input 
source can be binary compatible in years to come.

2. Hadoop InputFormat: The Hadoop InputFormat API is overly restrictive. For 
example, it forces key-value semantics, and does not support running arbitrary 
code on the driver side (an example of why this is useful is broadcast). In 
addition, it is somewhat awkward to tell developers that in order to implement 
an input source for Spark, they should learn the Hadoop MapReduce API first.


So here's the proposal: an InputSource is described by:
* an array of InputPartition that specifies the data partitioning
* a RecordReader that specifies how data on each partition can be read

This interface would be similar to Hadoop's InputFormat, except that there is 
no explicit key/value separation.



 Create a Java-friendly input source API
 ---

 Key: SPARK-7025
 URL: https://issues.apache.org/jira/browse/SPARK-7025
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin

 The goal of this ticket is to create a simple input source API that we can 
 maintain and support long term.
 Spark currently has two de facto input source API:
 1. RDD
 2. Hadoop MapReduce InputFormat
 Neither of the above is ideal:
 1. RDD: It is hard for Java developers to implement RDD, given the implicit 
 class tags. In addition, the RDD API depends on Scala's runtime library, 
 which does not preserve binary compatibility across Scala versions. If a 
 developer chooses Java to implement an input source, it would be great if 
 that input source can be binary compatible in years to come.
 2. Hadoop InputFormat: The Hadoop InputFormat API is overly restrictive. For 
 example, it forces key-value semantics, and does not support running 
 arbitrary code on the driver side (an example of why this is useful is 
 broadcast). In addition, it is somewhat awkward to tell developers that in 
 order to implement an input source for Spark, they should learn the Hadoop 
 MapReduce API first.
 So here's the proposal: an InputSource is described by:
 * an array of InputPartition that specifies the data partitioning
 * a RecordReader that specifies how data on each partition can be read
 This interface would be similar to Hadoop's InputFormat, except that there is 
 no explicit key/value separation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7015) Multiclass to Binary Reduction

2015-04-20 Thread Ram Sriharsha (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504197#comment-14504197
 ] 

Ram Sriharsha commented on SPARK-7015:
--

sounds good. Let me know what reference you had in mind.. I am familiar with 
Beygelzimer,Langford's error correcting tournaments 
http://hunch.net/~beygel/tournament.pdf but if you have a better reference in 
mind, let me know I can use that as the starting point.

 Multiclass to Binary Reduction
 --

 Key: SPARK-7015
 URL: https://issues.apache.org/jira/browse/SPARK-7015
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Ram Sriharsha
Assignee: Ram Sriharsha
   Original Estimate: 336h
  Remaining Estimate: 336h

 With the new Pipeline API, it is possible to seamlessly support machine 
 learning reductions as meta algorithms.
 GBDT and SVM today are binary classifiers and we can implement multi class 
 classification as a One vs All, or All vs All (or even more sophisticated 
 reduction) using binary classifiers as primitives.
 This JIRA is to track the creation of a reduction API for multi class 
 classification.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6954) Dynamic allocation: numExecutorsPending in ExecutorAllocationManager should never become negative

2015-04-20 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated SPARK-6954:
-
Attachment: without_fix.png
with_fix.png

I am uploading two diagrams that shows how the following variables move over 
time w/ and w/o my patch-
* numExecutorsPending
* executorIds.size
* executorsPendingToRemove.size
* targetNumExecutors

# The {{with_fix.png}} shows 4 consecutive runs of my query. As can be seen, 
{{targetNumExecutors}} and {{numExecutorsPending}} stays above zero.
# The {{without_fix.png}} shows a single run of my query. As can be seen, 
{{targetNumExecutors}} and {{numExecutorsPending}} goes negative after the 1st 
run.

Here is how I collected data in the source code-
{code}
private def targetNumExecutors(): Int = {
  logInfo(ZZZ  +
numExecutorsPending + , +
executorIds.size + , +
executorsPendingToRemove.size + , +
(numExecutorsPending + executorIds.size - executorsPendingToRemove.size))
  numExecutorsPending + executorIds.size - executorsPendingToRemove.size
}
{code}

 Dynamic allocation: numExecutorsPending in ExecutorAllocationManager should 
 never become negative
 -

 Key: SPARK-6954
 URL: https://issues.apache.org/jira/browse/SPARK-6954
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.3.1
Reporter: Cheolsoo Park
Assignee: Cheolsoo Park
  Labels: yarn
 Attachments: with_fix.png, without_fix.png


 I have a simple test case for dynamic allocation on YARN that fails with the 
 following stack trace-
 {code}
 15/04/16 00:52:14 ERROR Utils: Uncaught exception in thread 
 spark-dynamic-executor-allocation-0
 java.lang.IllegalArgumentException: Attempted to request a negative number of 
 executor(s) -21 from the cluster manager. Please specify a positive number!
   at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:338)
   at 
 org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1137)
   at 
 org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:294)
   at 
 org.apache.spark.ExecutorAllocationManager.addOrCancelExecutorRequests(ExecutorAllocationManager.scala:263)
   at 
 org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:230)
   at 
 org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply$mcV$sp(ExecutorAllocationManager.scala:189)
   at 
 org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189)
   at 
 org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189)
   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618)
   at 
 org.apache.spark.ExecutorAllocationManager$$anon$1.run(ExecutorAllocationManager.scala:189)
   at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
   at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
   at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}
 My test is as follows-
 # Start spark-shell with a single executor.
 # Run a {{select count(\*)}} query. The number of executors rises as input 
 size is non-trivial.
 # After the job finishes, the number of  executors falls as most of them 
 become idle.
 # Rerun the same query again, and the request to add executors fails with the 
 above error. In fact, the job itself continues to run with whatever executors 
 it already has, but it never gets more executors unless the shell is closed 
 and restarted. 
 In fact, this error only happens when I configure {{executorIdleTimeout}} 
 very small. For eg, I can reproduce it with the following configs-
 {code}
 spark.dynamicAllocation.executorIdleTimeout 5
 spark.dynamicAllocation.schedulerBacklogTimeout 5
 {code}
 Although I can simply increase {{executorIdleTimeout}} to something like 60 
 secs to avoid the error, I think this is still a bug to be fixed.
 The root cause seems that {{numExecutorsPending}} accidentally becomes 
 negative if executors are killed too aggressively (i.e. 
 {{executorIdleTimeout}} is too small) because under that circumstance, 

[jira] [Updated] (SPARK-7025) Create a Java-friendly input source API

2015-04-20 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7025:
---
Description: 
The goal of this ticket is to create a simple input source API that we can 
maintain and support long term.

Spark currently has two de facto input source API:
1. RDD API
2. Hadoop MapReduce InputFormat API

Neither of the above is ideal:

1. RDD: It is hard for Java developers to implement RDD, given the implicit 
class tags. In addition, the RDD API depends on Scala's runtime library, which 
does not preserve binary compatibility across Scala versions. If a developer 
chooses Java to implement an input source, it would be great if that input 
source can be binary compatible in years to come.

2. Hadoop InputFormat: The Hadoop InputFormat API is overly restrictive. For 
example, it forces key-value semantics, and does not support running arbitrary 
code on the driver side (an example of why this is useful is broadcast). In 
addition, it is somewhat awkward to tell developers that in order to implement 
an input source for Spark, they should learn the Hadoop MapReduce API first.


So here's the proposal: an InputSource is described by:
* an array of InputPartition that specifies the data partitioning
* a RecordReader that specifies how data on each partition can be read

This interface would be similar to Hadoop's InputFormat, except that there is 
no explicit key/value separation.


  was:
The goal of this ticket is to create a simple input source API that we can 
maintain and support long term.

Spark currently has two de facto input source API:
1. RDD API
2. Hadoop MapReduce InputFormat API

Neither of the above is ideal:

1. RDD: It is hard for Java developers to implement RDD, given the implicit 
class tags. In addition, the RDD API depends on Scala's runtime library, which 
does not preserve binary compatibility across Scala versions. If a developer 
chooses Java to implement an input source, it would be great if that input 
source can be binary compatible in years to come.

2. Hadoop InputFormat: The Hadoop InputFormat API is overly restrictive. For 
example, it forces key-value semantics, and does not support running arbitrary 
code on the driver side (an example of why this is useful is broadcast). In 
addition, it is somewhat awkward to tell developers that in order to implement 
an input source for Spark, they should learn the Hadoop MapReduce API first.


So here's the proposal:

An InputSource is described by:
* an array of InputPartition that specifies the data partitioning
* a RecordReader that specifies how data on each partition can be read

This interface would be similar to Hadoop's InputFormat, except that there is 
no explicit key/value separation.



 Create a Java-friendly input source API
 ---

 Key: SPARK-7025
 URL: https://issues.apache.org/jira/browse/SPARK-7025
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin

 The goal of this ticket is to create a simple input source API that we can 
 maintain and support long term.
 Spark currently has two de facto input source API:
 1. RDD API
 2. Hadoop MapReduce InputFormat API
 Neither of the above is ideal:
 1. RDD: It is hard for Java developers to implement RDD, given the implicit 
 class tags. In addition, the RDD API depends on Scala's runtime library, 
 which does not preserve binary compatibility across Scala versions. If a 
 developer chooses Java to implement an input source, it would be great if 
 that input source can be binary compatible in years to come.
 2. Hadoop InputFormat: The Hadoop InputFormat API is overly restrictive. For 
 example, it forces key-value semantics, and does not support running 
 arbitrary code on the driver side (an example of why this is useful is 
 broadcast). In addition, it is somewhat awkward to tell developers that in 
 order to implement an input source for Spark, they should learn the Hadoop 
 MapReduce API first.
 So here's the proposal: an InputSource is described by:
 * an array of InputPartition that specifies the data partitioning
 * a RecordReader that specifies how data on each partition can be read
 This interface would be similar to Hadoop's InputFormat, except that there is 
 no explicit key/value separation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7015) Multiclass to Binary Reduction

2015-04-20 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504189#comment-14504189
 ] 

Joseph K. Bradley edited comment on SPARK-7015 at 4/21/15 2:43 AM:
---

+1  I'd strongly vote for supporting error-correcting output codes from early 
on.  It's not that much harder to implement, and it can perform much better in 
practice (and in theory).  I can provide some references if it'd be helpful.


was (Author: josephkb):
+1  I'd strongly vote for supporting error-correcting output codes from the 
beginning.  It's not that much harder to implement, and it can perform much 
better in practice (and in theory).  I can provide some references if it'd be 
helpful.

 Multiclass to Binary Reduction
 --

 Key: SPARK-7015
 URL: https://issues.apache.org/jira/browse/SPARK-7015
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Ram Sriharsha
Assignee: Ram Sriharsha
   Original Estimate: 336h
  Remaining Estimate: 336h

 With the new Pipeline API, it is possible to seamlessly support machine 
 learning reductions as meta algorithms.
 GBDT and SVM today are binary classifiers and we can implement multi class 
 classification as a One vs All, or All vs All (or even more sophisticated 
 reduction) using binary classifiers as primitives.
 This JIRA is to track the creation of a reduction API for multi class 
 classification.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7015) Multiclass to Binary Reduction

2015-04-20 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504189#comment-14504189
 ] 

Joseph K. Bradley commented on SPARK-7015:
--

+1  I'd strongly vote for supporting error-correcting output codes from the 
beginning.  It's not that much harder to implement, and it can perform much 
better in practice (and in theory).  I can provide some references if it'd be 
helpful.

 Multiclass to Binary Reduction
 --

 Key: SPARK-7015
 URL: https://issues.apache.org/jira/browse/SPARK-7015
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Ram Sriharsha
Assignee: Ram Sriharsha
   Original Estimate: 336h
  Remaining Estimate: 336h

 With the new Pipeline API, it is possible to seamlessly support machine 
 learning reductions as meta algorithms.
 GBDT and SVM today are binary classifiers and we can implement multi class 
 classification as a One vs All, or All vs All (or even more sophisticated 
 reduction) using binary classifiers as primitives.
 This JIRA is to track the creation of a reduction API for multi class 
 classification.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7023) [Spark SQL] Can't populate table size inforamtion into Hive metastore when create table or insert into table

2015-04-20 Thread Yi Zhou (JIRA)
Yi Zhou created SPARK-7023:
--

 Summary: [Spark SQL] Can't populate table size inforamtion into 
Hive metastore when create table or insert into table
 Key: SPARK-7023
 URL: https://issues.apache.org/jira/browse/SPARK-7023
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Yi Zhou


After run below create tables SQL statement on Spark SQL, there is no 
'totalSize, numRows, rawDataSize' properties in 'parameters' field..

CREATE TABLE IF NOT EXISTS customer
STORED AS PARQUET
AS
SELECT * FROM customer_temp;

hive describe extended customer;
OK
c_customer_sk   bigint
c_customer_id   string
c_current_cdemo_sk  bigint
c_current_hdemo_sk  bigint
c_current_addr_sk   bigint
c_first_shipto_date_sk  bigint
c_first_sales_date_sk   bigint
c_salutationstring
c_first_namestring
c_last_name string
c_preferred_cust_flag   string
c_birth_day int
c_birth_month   int
c_birth_yearint
c_birth_country string
c_login string
c_email_address string
c_last_review_date  string

Detailed Table Information 
... parameters:{transient_lastDdlTime=1429582149}...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5100) Spark Thrift server monitor page

2015-04-20 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504132#comment-14504132
 ] 

Cheng Lian commented on SPARK-5100:
---

Had offline discussion with [~tianyi], he's rebasing PR #3946. I'll revisit it 
once he finishes rebasing.

 Spark Thrift server monitor page
 

 Key: SPARK-5100
 URL: https://issues.apache.org/jira/browse/SPARK-5100
 Project: Spark
  Issue Type: New Feature
  Components: SQL, Web UI
Reporter: Yi Tian
Priority: Critical
 Attachments: Spark Thrift-server monitor page.pdf, 
 prototype-screenshot.png


 In the latest Spark release, there is a Spark Streaming tab on the driver web 
 UI, which shows information about running streaming application. It should be 
 helpful for providing a monitor page in Thrift server, because both streaming 
 and Thrift server are long-term applications, and the details of the 
 application do not show on stage page or job page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6368) Build a specialized serializer for Exchange operator.

2015-04-20 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-6368.
-
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5497
[https://github.com/apache/spark/pull/5497]

 Build a specialized serializer for Exchange operator. 
 --

 Key: SPARK-6368
 URL: https://issues.apache.org/jira/browse/SPARK-6368
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Critical
 Fix For: 1.4.0

 Attachments: Kryo.nps, SchemaBased.nps


 Kryo is still pretty slow because it works on individual objects and relative 
 expensive to allocate. For Exchange operator, because the schema for key and 
 value are already defined, we can create a specialized serializer to handle 
 the specific schemas of key and value. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7015) Multiclass to Binary Reduction

2015-04-20 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7015:
-
Component/s: (was: MLlib)
 ML

 Multiclass to Binary Reduction
 --

 Key: SPARK-7015
 URL: https://issues.apache.org/jira/browse/SPARK-7015
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Ram Sriharsha
Assignee: Ram Sriharsha
   Original Estimate: 336h
  Remaining Estimate: 336h

 With the new Pipeline API, it is possible to seamlessly support machine 
 learning reductions as meta algorithms.
 GBDT and SVM today are binary classifiers and we can implement multi class 
 classification as a One vs All, or All vs All (or even more sophisticated 
 reduction) using binary classifiers as primitives.
 This JIRA is to track the creation of a reduction API for multi class 
 classification.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4521) Parquet fails to read columns with spaces in the name

2015-04-20 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504127#comment-14504127
 ] 

Cheng Lian commented on SPARK-4521:
---

Yes, I'm resolving this one.

 Parquet fails to read columns with spaces in the name
 -

 Key: SPARK-4521
 URL: https://issues.apache.org/jira/browse/SPARK-4521
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Michael Armbrust

 I think this is actually a bug in parquet, but it would be good to track it 
 here as well.  To reproduce:
 {code}
 jsonRDD(sparkContext.parallelize({number of clusters: 
 1}::Nil)).saveAsParquetFile(test)
 parquetFile(test).collect()
 {code}
 {code}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 
 (TID 13, localhost): java.lang.IllegalArgumentException: field ended by ';': 
 expected ';' but got 'of' at line 1:   optional int32 number of
   at parquet.schema.MessageTypeParser.check(MessageTypeParser.java:209)
   at 
 parquet.schema.MessageTypeParser.addPrimitiveType(MessageTypeParser.java:182)
   at parquet.schema.MessageTypeParser.addType(MessageTypeParser.java:108)
   at 
 parquet.schema.MessageTypeParser.addGroupTypeFields(MessageTypeParser.java:96)
   at parquet.schema.MessageTypeParser.parse(MessageTypeParser.java:89)
   at 
 parquet.schema.MessageTypeParser.parseMessageType(MessageTypeParser.java:79)
   at 
 parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:189)
   at 
 parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
   at 
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:135)
   at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4766) ML Estimator Params should be distinct from Transformer Params

2015-04-20 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-4766:
-
Description: 
Currently, in spark.ml, both Transformers and Estimators extend the same Params 
classes.  There should be one Params class for the Transformer and one for the 
Estimator.  These could sometimes be the same, but for other models, we may 
need either (a) to make them distinct or (b) to have the Estimator params class 
extend the Transformer one.

E.g., it is weird to be able to do:
{code}
val model: LogisticRegressionModel = ...
model.getMaxIter()
{code}

It's also weird to be able to:
* Wrap LogisticRegressionModel (a Transformer) with CrossValidator
* Pass a set of ParamMaps to CrossValidator which includes parameter 
LogisticRegressionModel.maxIter
* (CrossValidator would try to set that parameter.)
* I'm not sure if this would cause a failure or just be a noop.

See the comment below about Word2Vec as well.

  was:
Currently, in spark.ml, both Transformers and Estimators extend the same Params 
classes.  There should be one Params class for the Transformer and one for the 
Estimator, where the Estimator params class extends the Transformer one.

E.g., it is weird to be able to do:
{code}
val model: LogisticRegressionModel = ...
model.getMaxIter()
{code}

It's also weird to be able to:
* Wrap LogisticRegressionModel (a Transformer) with CrossValidator
* Pass a set of ParamMaps to CrossValidator which includes parameter 
LogisticRegressionModel.maxIter
* (CrossValidator would try to set that parameter.)
* I'm not sure if this would cause a failure or just be a noop.


 ML Estimator Params should be distinct from Transformer Params
 --

 Key: SPARK-4766
 URL: https://issues.apache.org/jira/browse/SPARK-4766
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 Currently, in spark.ml, both Transformers and Estimators extend the same 
 Params classes.  There should be one Params class for the Transformer and one 
 for the Estimator.  These could sometimes be the same, but for other models, 
 we may need either (a) to make them distinct or (b) to have the Estimator 
 params class extend the Transformer one.
 E.g., it is weird to be able to do:
 {code}
 val model: LogisticRegressionModel = ...
 model.getMaxIter()
 {code}
 It's also weird to be able to:
 * Wrap LogisticRegressionModel (a Transformer) with CrossValidator
 * Pass a set of ParamMaps to CrossValidator which includes parameter 
 LogisticRegressionModel.maxIter
 * (CrossValidator would try to set that parameter.)
 * I'm not sure if this would cause a failure or just be a noop.
 See the comment below about Word2Vec as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4766) ML Estimator Params should subclass Transformer Params

2015-04-20 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504181#comment-14504181
 ] 

Joseph K. Bradley commented on SPARK-4766:
--

*Update*: A new issue was brought up by the PR for Word2Vec for this JIRA: 
[https://issues.apache.org/jira/browse/SPARK-6529]

Basically, the Estimator and Model take different input column types, so they 
should (probably) use different input column parameters.  See that JIRA for the 
discussion.

 ML Estimator Params should subclass Transformer Params
 --

 Key: SPARK-4766
 URL: https://issues.apache.org/jira/browse/SPARK-4766
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 Currently, in spark.ml, both Transformers and Estimators extend the same 
 Params classes.  There should be one Params class for the Transformer and one 
 for the Estimator, where the Estimator params class extends the Transformer 
 one.
 E.g., it is weird to be able to do:
 {code}
 val model: LogisticRegressionModel = ...
 model.getMaxIter()
 {code}
 It's also weird to be able to:
 * Wrap LogisticRegressionModel (a Transformer) with CrossValidator
 * Pass a set of ParamMaps to CrossValidator which includes parameter 
 LogisticRegressionModel.maxIter
 * (CrossValidator would try to set that parameter.)
 * I'm not sure if this would cause a failure or just be a noop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4766) ML Estimator Params should be distinct from Transformer Params

2015-04-20 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-4766:
-
Summary: ML Estimator Params should be distinct from Transformer Params  
(was: ML Estimator Params should subclass Transformer Params)

 ML Estimator Params should be distinct from Transformer Params
 --

 Key: SPARK-4766
 URL: https://issues.apache.org/jira/browse/SPARK-4766
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 Currently, in spark.ml, both Transformers and Estimators extend the same 
 Params classes.  There should be one Params class for the Transformer and one 
 for the Estimator, where the Estimator params class extends the Transformer 
 one.
 E.g., it is weird to be able to do:
 {code}
 val model: LogisticRegressionModel = ...
 model.getMaxIter()
 {code}
 It's also weird to be able to:
 * Wrap LogisticRegressionModel (a Transformer) with CrossValidator
 * Pass a set of ParamMaps to CrossValidator which includes parameter 
 LogisticRegressionModel.maxIter
 * (CrossValidator would try to set that parameter.)
 * I'm not sure if this would cause a failure or just be a noop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6900) spark ec2 script enters infinite loop when run-instance fails

2015-04-20 Thread Guodong Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504236#comment-14504236
 ] 

Guodong Wang commented on SPARK-6900:
-

Hi Nick, sorry for my late reply.

I mark this as a major issue because I am using spark-ec2 script to 
launch/setup/destroy the spark cluster automatically in aws.  This is 
integrated with our computation platform service. we don't expect any manually 
operations when launching the cluster. I agree with you that it would be a 
major issue if I use spark-ec2 manually. But in my case, I use the script as a 
automation tool. So, I think it would be nice if the script can handle this 
case. Although, the AWS failure is rare case, we are heavily using the AWS 
now(launching/destroying a bunch of separated spark-clusters in each day). It 
would be nice to me if spark-ec2 script can handle such aws failure.

Here is more information about my case.
In my environment, spark-ec2 script justs waited for all the instances to 
become 'ssh-ready' for ever. It would not try to ssh any instances before 
exitting the loop. I have to kill the script process in such senario.

I went through the spark-ec2 script, and I think sshing to the instance hosts 
would happen after all the instances enter running state. Because one of the 
instance is terminated as soon as it was launched. It never entered the 
running state. Then, is_cluster_ssh_available is short circuited, because not 
all the instances are running. Here is the code
{code}
if all(i.state == 'running' for i in cluster_instances) and \
   all(s.system_status.status == 'ok' for s in statuses) and \
   all(s.instance_status.status == 'ok' for s in statuses) and \
   is_cluster_ssh_available(cluster_instances, opts):
{code}
Then, the scripts enters the infinite loop. and would not print any ssh 
failure message.

If I made some mistakes in above analysis, please tell me.

 spark ec2 script enters infinite loop when run-instance fails
 -

 Key: SPARK-6900
 URL: https://issues.apache.org/jira/browse/SPARK-6900
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.3.0
Reporter: Guodong Wang

 I am using spark-ec2 scripts to launch spark cluters in AWS.
 Recently, in our AWS region,  there were some tech issues about AWS EC2 
 service. 
 When spark-ec2 send the run-instance requests to EC2, not all the requested 
 instances were launched. Some instance was terminated by AWS-EC2 service  
 before it was up.
 But spark-ec2 script would wait for all the instances to enter 'ssh-ready' 
 status. So, the script enters the infinite loop. Because the terminated 
 instances would never be 'ssh-ready'.
 In my opinion, it should be OK if some of the slave instances were 
 terminated. As long as the master node is running, the terminated slaves 
 should be filtered and the cluster should be setup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7024) Improve performance of function containsStar

2015-04-20 Thread Yadong Qi (JIRA)
Yadong Qi created SPARK-7024:


 Summary: Improve performance of function containsStar
 Key: SPARK-7024
 URL: https://issues.apache.org/jira/browse/SPARK-7024
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Yadong Qi






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7024) Improve performance of function containsStar

2015-04-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7024:
---

Assignee: (was: Apache Spark)

 Improve performance of function containsStar
 

 Key: SPARK-7024
 URL: https://issues.apache.org/jira/browse/SPARK-7024
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Yadong Qi





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7024) Improve performance of function containsStar

2015-04-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504255#comment-14504255
 ] 

Apache Spark commented on SPARK-7024:
-

User 'watermen' has created a pull request for this issue:
https://github.com/apache/spark/pull/5602

 Improve performance of function containsStar
 

 Key: SPARK-7024
 URL: https://issues.apache.org/jira/browse/SPARK-7024
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Yadong Qi





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6900) spark ec2 script enters infinite loop when run-instance fails

2015-04-20 Thread Guodong Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504253#comment-14504253
 ] 

Guodong Wang commented on SPARK-6900:
-

In my opinion, it does not cost us much to fix this issue.

Currrently, I propose 2 ways to fix it
1. the first one is a  simple fix.  *Adding a timeout for 
wait_for_cluster_state* . If wait_for_cluster_state is timed out, just exit 
the script with non-zero code. Then, we can add --resume opt when retrying to 
launch the cluster in the next time.
2. the second one is more robust. *filtering terminated instances when 
wait_for_cluster_state to ssh-ready*. If all the non-terminated instances are 
ssh-ready, return the function. Then, if the master is terminated, setup 
cluster would failure. otherwise, the cluster is up although some slave 
instance is down.

What your opinion?[~nchammas] 
I would be happy to discuss the fix with you and provide a patch.

Thanks

 spark ec2 script enters infinite loop when run-instance fails
 -

 Key: SPARK-6900
 URL: https://issues.apache.org/jira/browse/SPARK-6900
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.3.0
Reporter: Guodong Wang

 I am using spark-ec2 scripts to launch spark cluters in AWS.
 Recently, in our AWS region,  there were some tech issues about AWS EC2 
 service. 
 When spark-ec2 send the run-instance requests to EC2, not all the requested 
 instances were launched. Some instance was terminated by AWS-EC2 service  
 before it was up.
 But spark-ec2 script would wait for all the instances to enter 'ssh-ready' 
 status. So, the script enters the infinite loop. Because the terminated 
 instances would never be 'ssh-ready'.
 In my opinion, it should be OK if some of the slave instances were 
 terminated. As long as the master node is running, the terminated slaves 
 should be filtered and the cluster should be setup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7024) Improve performance of function containsStar

2015-04-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7024:
---

Assignee: Apache Spark

 Improve performance of function containsStar
 

 Key: SPARK-7024
 URL: https://issues.apache.org/jira/browse/SPARK-7024
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Yadong Qi
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7015) Multiclass to Binary Reduction

2015-04-20 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504335#comment-14504335
 ] 

Joseph K. Bradley edited comment on SPARK-7015 at 4/21/15 5:21 AM:
---

Your reference looks newer than ones I've used before.  After a quick glance, 
it looks like it examines generalizations of methods I've seen.

This are the ones I've used:
* Dietterich  Bakiri.  Solving Multiclass Learning Problems via 
Error-Correcting Output Codes. 1995.
** [https://www.jair.org/media/105/live-105-1426-jair.pdf]
* Allwein et al.  Reducing Multiclass to Binary: A Unifying Approach for 
Margin Classifiers. 2000.
** [http://www.jmlr.org/papers/volume1/allwein00a/allwein00a.pdf]

Thinking about it, I'm fine if we start by supporting one-vs-all or something 
simple which everyone has heard of and will expect to find, and then add better 
approaches later (after I've had time to refresh myself on that literature!).


was (Author: josephkb):
Your reference looks newer than ones I've used before.  After a quick glance, 
it looks like it examines generalizations of methods I've seen.

This is the one I've used:
* Dietterich  Bakiri.  Solving Multiclass Learning Problems via 
Error-Correcting Output Codes. 1995.
** [https://www.jair.org/media/105/live-105-1426-jair.pdf]

Thinking about it, I'm fine if we start by supporting one-vs-all or something 
simple which everyone has heard of and will expect to find, and then add better 
approaches later (after I've had time to refresh myself on that literature!).

 Multiclass to Binary Reduction
 --

 Key: SPARK-7015
 URL: https://issues.apache.org/jira/browse/SPARK-7015
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Ram Sriharsha
Assignee: Ram Sriharsha
   Original Estimate: 336h
  Remaining Estimate: 336h

 With the new Pipeline API, it is possible to seamlessly support machine 
 learning reductions as meta algorithms.
 GBDT and SVM today are binary classifiers and we can implement multi class 
 classification as a One vs All, or All vs All (or even more sophisticated 
 reduction) using binary classifiers as primitives.
 This JIRA is to track the creation of a reduction API for multi class 
 classification.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7015) Multiclass to Binary Reduction

2015-04-20 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504335#comment-14504335
 ] 

Joseph K. Bradley commented on SPARK-7015:
--

Your reference looks newer than ones I've used before.  After a quick glance, 
it looks like it examines generalizations of methods I've seen.

This is the one I've used:
* Dietterich  Bakiri.  Solving Multiclass Learning Problems via 
Error-Correcting Output Codes. 1995.
** [https://www.jair.org/media/105/live-105-1426-jair.pdf]

Thinking about it, I'm fine if we start by supporting one-vs-all or something 
simple which everyone has heard of and will expect to find, and then add better 
approaches later (after I've had time to refresh myself on that literature!).

 Multiclass to Binary Reduction
 --

 Key: SPARK-7015
 URL: https://issues.apache.org/jira/browse/SPARK-7015
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Ram Sriharsha
Assignee: Ram Sriharsha
   Original Estimate: 336h
  Remaining Estimate: 336h

 With the new Pipeline API, it is possible to seamlessly support machine 
 learning reductions as meta algorithms.
 GBDT and SVM today are binary classifiers and we can implement multi class 
 classification as a One vs All, or All vs All (or even more sophisticated 
 reduction) using binary classifiers as primitives.
 This JIRA is to track the creation of a reduction API for multi class 
 classification.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7025) Create a Java-friendly input source API

2015-04-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504368#comment-14504368
 ] 

Apache Spark commented on SPARK-7025:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/5603

 Create a Java-friendly input source API
 ---

 Key: SPARK-7025
 URL: https://issues.apache.org/jira/browse/SPARK-7025
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin

 The goal of this ticket is to create a simple input source API that we can 
 maintain and support long term.
 Spark currently has two de facto input source API:
 1. RDD API
 2. Hadoop MapReduce InputFormat API
 Neither of the above is ideal:
 1. RDD: It is hard for Java developers to implement RDD, given the implicit 
 class tags. In addition, the RDD API depends on Scala's runtime library, 
 which does not preserve binary compatibility across Scala versions. If a 
 developer chooses Java to implement an input source, it would be great if 
 that input source can be binary compatible in years to come.
 2. Hadoop InputFormat: The Hadoop InputFormat API is overly restrictive. For 
 example, it forces key-value semantics, and does not support running 
 arbitrary code on the driver side (an example of why this is useful is 
 broadcast). In addition, it is somewhat awkward to tell developers that in 
 order to implement an input source for Spark, they should learn the Hadoop 
 MapReduce API first.
 So here's the proposal: an InputSource is described by:
 * an array of InputPartition that specifies the data partitioning
 * a RecordReader that specifies how data on each partition can be read
 This interface would be similar to Hadoop's InputFormat, except that there is 
 no explicit key/value separation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7025) Create a Java-friendly input source API

2015-04-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7025:
---

Assignee: Reynold Xin  (was: Apache Spark)

 Create a Java-friendly input source API
 ---

 Key: SPARK-7025
 URL: https://issues.apache.org/jira/browse/SPARK-7025
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin

 The goal of this ticket is to create a simple input source API that we can 
 maintain and support long term.
 Spark currently has two de facto input source API:
 1. RDD API
 2. Hadoop MapReduce InputFormat API
 Neither of the above is ideal:
 1. RDD: It is hard for Java developers to implement RDD, given the implicit 
 class tags. In addition, the RDD API depends on Scala's runtime library, 
 which does not preserve binary compatibility across Scala versions. If a 
 developer chooses Java to implement an input source, it would be great if 
 that input source can be binary compatible in years to come.
 2. Hadoop InputFormat: The Hadoop InputFormat API is overly restrictive. For 
 example, it forces key-value semantics, and does not support running 
 arbitrary code on the driver side (an example of why this is useful is 
 broadcast). In addition, it is somewhat awkward to tell developers that in 
 order to implement an input source for Spark, they should learn the Hadoop 
 MapReduce API first.
 So here's the proposal: an InputSource is described by:
 * an array of InputPartition that specifies the data partitioning
 * a RecordReader that specifies how data on each partition can be read
 This interface would be similar to Hadoop's InputFormat, except that there is 
 no explicit key/value separation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7008) An Implement of Factorization Machine (LibFM)

2015-04-20 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504369#comment-14504369
 ] 

Xiangrui Meng commented on SPARK-7008:
--

[~podongfeng] You implementation assumes that the model can be stored locally, 
which is not true for big models. [~gq]'s GraphX-based implementation should 
have better scalability, but slower on small datasets. We need more time to 
understand the algorithm and decide whether to include it in MLlib. As Sean 
suggested, it would be nice if you can submit both packages to 
spark-pacakges.org. 

[~podongfeng] and [~gq], I like the simplicity and the expressiveness of FM. I 
have a few questions to understand FM better. FM uses SGD on a non-convex 
objective. What is FM's convergence rate you observed in practice? Does it 
sensitive to local minimals (run FM multiple times and see whether there are 
big variance on the objective values)? Does it sensitive to the learning rate?


 An Implement of Factorization Machine (LibFM)
 -

 Key: SPARK-7008
 URL: https://issues.apache.org/jira/browse/SPARK-7008
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0, 1.3.1, 1.3.2
Reporter: zhengruifeng
  Labels: features, patch

 An implement of Factorization Machines based on Scala and Spark MLlib.
 Factorization Machine is a kind of machine learning algorithm for 
 multi-linear regression, and is widely used for recommendation.
 Factorization Machines works well in recent years' recommendation 
 competitions.
 Ref:
 http://libfm.org/
 http://doi.acm.org/10.1145/2168752.2168771
 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7025) Create a Java-friendly input source API

2015-04-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7025:
---

Assignee: Apache Spark  (was: Reynold Xin)

 Create a Java-friendly input source API
 ---

 Key: SPARK-7025
 URL: https://issues.apache.org/jira/browse/SPARK-7025
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Apache Spark

 The goal of this ticket is to create a simple input source API that we can 
 maintain and support long term.
 Spark currently has two de facto input source API:
 1. RDD API
 2. Hadoop MapReduce InputFormat API
 Neither of the above is ideal:
 1. RDD: It is hard for Java developers to implement RDD, given the implicit 
 class tags. In addition, the RDD API depends on Scala's runtime library, 
 which does not preserve binary compatibility across Scala versions. If a 
 developer chooses Java to implement an input source, it would be great if 
 that input source can be binary compatible in years to come.
 2. Hadoop InputFormat: The Hadoop InputFormat API is overly restrictive. For 
 example, it forces key-value semantics, and does not support running 
 arbitrary code on the driver side (an example of why this is useful is 
 broadcast). In addition, it is somewhat awkward to tell developers that in 
 order to implement an input source for Spark, they should learn the Hadoop 
 MapReduce API first.
 So here's the proposal: an InputSource is described by:
 * an array of InputPartition that specifies the data partitioning
 * a RecordReader that specifies how data on each partition can be read
 This interface would be similar to Hadoop's InputFormat, except that there is 
 no explicit key/value separation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7009) Build assembly JAR via ant to avoid zip64 problems

2015-04-20 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14502695#comment-14502695
 ] 

Sean Owen commented on SPARK-7009:
--

Let's see if I remember this correctly: Java 7 supports zip64, so there's no 
problem if building/running with Java 7+ only. Some  (early) Java 6 won't read 
zip64 correctly though. I think the implicit workaround there was to update to 
a later Java 6, since it doesn't affect most releases. Java 6 has some 
*different* hacky extension to zip that lets it read/write more than 65K files 
though, which means that weirdly Java 6-built assemblies might work on old Java 
6 after all. 

I think we only officially support the zip64 version. Implicitly, actually, 
early Java 6 doesn't necessarily work with Spark.

So... does this end up helping this weird situation if Ant is only making zip64 
archives? (Nice that this doesn't actually involve adding an Ant script)

 Build assembly JAR via ant to avoid zip64 problems
 --

 Key: SPARK-7009
 URL: https://issues.apache.org/jira/browse/SPARK-7009
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.3.0
 Environment: Java 7+
Reporter: Steve Loughran
   Original Estimate: 2h
  Remaining Estimate: 2h

 SPARK-1911 shows the problem that JDK7+ is using zip64 to build large JARs; a 
 format incompatible with Java and pyspark.
 Provided the total number of .class files+resources is 64K, ant can be used 
 to make the final JAR instead, perhaps by unzipping the maven-generated JAR 
 then rezipping it with zip64=never, before publishing the artifact via maven.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7011) Build fails with scala 2.11 option, because a protected[sql] type is accessed in ml package.

2015-04-20 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma reassigned SPARK-7011:
--

Assignee: Prashant Sharma

 Build fails with scala 2.11 option, because a protected[sql] type is accessed 
 in ml package.
 

 Key: SPARK-7011
 URL: https://issues.apache.org/jira/browse/SPARK-7011
 Project: Spark
  Issue Type: Bug
Reporter: Prashant Sharma
Assignee: Prashant Sharma

 I am not sure why this does not fail while building with scala 2.10, looks 
 like scala bug ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3276) Provide a API to specify MIN_REMEMBER_DURATION for files to consider as input in streaming

2015-04-20 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3276:
-
Assignee: Emre Sevinç

 Provide a API to specify MIN_REMEMBER_DURATION for files to consider as input 
 in streaming
 --

 Key: SPARK-3276
 URL: https://issues.apache.org/jira/browse/SPARK-3276
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.2.0
Reporter: Jack Hu
Assignee: Emre Sevinç
Priority: Minor

 Currently, only one API called textFileStream in StreamingContext to specify 
 the text file dstream, which ignores the old files always. On some times, the 
 old files is still useful.
 Need a API to let user choose whether the old files need to be ingored or not 
 .
 The API currently in StreamingContext:
 def textFileStream(directory: String): DStream[String] = {
 fileStream[LongWritable, Text, 
 TextInputFormat](directory).map(_._2.toString)
   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7009) Build assembly JAR via ant to avoid zip64 problems

2015-04-20 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14502657#comment-14502657
 ] 

Steve Loughran commented on SPARK-7009:
---

It's only 30 lines of diff including the antrun plugin  config; trivial 
compared to the shade plugin itself.

As you note though, it's not enough: there are 64K .class files.

Which means that the use java6 to compile warning note of SPARK-1911 probably 
isn't going to work either, unless a java6 build includes less classes in the 
shaded jar.

 Build assembly JAR via ant to avoid zip64 problems
 --

 Key: SPARK-7009
 URL: https://issues.apache.org/jira/browse/SPARK-7009
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.3.0
 Environment: Java 7+
Reporter: Steve Loughran
   Original Estimate: 2h
  Remaining Estimate: 2h

 SPARK-1911 shows the problem that JDK7+ is using zip64 to build large JARs; a 
 format incompatible with Java and pyspark.
 Provided the total number of .class files+resources is 64K, ant can be used 
 to make the final JAR instead, perhaps by unzipping the maven-generated JAR 
 then rezipping it with zip64=never, before publishing the artifact via maven.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7011) Build fails with scala 2.11 option, because a protected[sql] type is accessed in ml package.

2015-04-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7011:
---

Assignee: Apache Spark  (was: Prashant Sharma)

 Build fails with scala 2.11 option, because a protected[sql] type is accessed 
 in ml package.
 

 Key: SPARK-7011
 URL: https://issues.apache.org/jira/browse/SPARK-7011
 Project: Spark
  Issue Type: Bug
Reporter: Prashant Sharma
Assignee: Apache Spark

 I am not sure why this does not fail while building with scala 2.10, looks 
 like scala bug ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3276) Provide a API to specify MIN_REMEMBER_DURATION for files to consider as input in streaming

2015-04-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14502702#comment-14502702
 ] 

Emre Sevinç commented on SPARK-3276:


Can someone with enough access rights assign this issue to me, (currently it is 
not assigned to anyone)? (Now that I've already discussed it with Spark 
developers and prepared a Pull Request on Github)

 Provide a API to specify MIN_REMEMBER_DURATION for files to consider as input 
 in streaming
 --

 Key: SPARK-3276
 URL: https://issues.apache.org/jira/browse/SPARK-3276
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.2.0
Reporter: Jack Hu
Priority: Minor

 Currently, only one API called textFileStream in StreamingContext to specify 
 the text file dstream, which ignores the old files always. On some times, the 
 old files is still useful.
 Need a API to let user choose whether the old files need to be ingored or not 
 .
 The API currently in StreamingContext:
 def textFileStream(directory: String): DStream[String] = {
 fileStream[LongWritable, Text, 
 TextInputFormat](directory).map(_._2.toString)
   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7011) Build fails with scala 2.11 option, because a protected[sql] type is accessed in ml package.

2015-04-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7011:
---

Assignee: Prashant Sharma  (was: Apache Spark)

 Build fails with scala 2.11 option, because a protected[sql] type is accessed 
 in ml package.
 

 Key: SPARK-7011
 URL: https://issues.apache.org/jira/browse/SPARK-7011
 Project: Spark
  Issue Type: Bug
Reporter: Prashant Sharma
Assignee: Prashant Sharma

 I am not sure why this does not fail while building with scala 2.10, looks 
 like scala bug ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7011) Build fails with scala 2.11 option, because a protected[sql] type is accessed in ml package.

2015-04-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14502703#comment-14502703
 ] 

Apache Spark commented on SPARK-7011:
-

User 'ScrapCodes' has created a pull request for this issue:
https://github.com/apache/spark/pull/5593

 Build fails with scala 2.11 option, because a protected[sql] type is accessed 
 in ml package.
 

 Key: SPARK-7011
 URL: https://issues.apache.org/jira/browse/SPARK-7011
 Project: Spark
  Issue Type: Bug
Reporter: Prashant Sharma
Assignee: Prashant Sharma

 I am not sure why this does not fail while building with scala 2.10, looks 
 like scala bug ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7009) Build assembly JAR via ant to avoid zip64 problems

2015-04-20 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14502675#comment-14502675
 ] 

Steve Loughran commented on SPARK-7009:
---

Looking at the [openJDK 
issue|https://bugs.openjdk.java.net/browse/JDK-4828461], Java6 appears to be 
generating a header/footer that stops at 64K, and doesn't bother reading that 
header when enumerating zip file. Java 7 (presumably) handles reads the same 
way, but uses zip64 to generate the artifacts. Ant can be told not to generate 
zip64 files, but it does zip16 properly, rejecting source filesets that are 
too large

There isn't an obvious/immediate solution for this on Java7+; except to extend 
Ant to generate the same hacked zip files, then wait for that to trickle into 
the maven ant-run plugin, which would be about 3+ months after ant 1.9.x ships. 
That's a long term project, though something to consider starting now, to get 
the feature later in 2015

 Build assembly JAR via ant to avoid zip64 problems
 --

 Key: SPARK-7009
 URL: https://issues.apache.org/jira/browse/SPARK-7009
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.3.0
 Environment: Java 7+
Reporter: Steve Loughran
   Original Estimate: 2h
  Remaining Estimate: 2h

 SPARK-1911 shows the problem that JDK7+ is using zip64 to build large JARs; a 
 format incompatible with Java and pyspark.
 Provided the total number of .class files+resources is 64K, ant can be used 
 to make the final JAR instead, perhaps by unzipping the maven-generated JAR 
 then rezipping it with zip64=never, before publishing the artifact via maven.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7007) Add metrics source for ExecutorAllocationManager to expose internal status

2015-04-20 Thread Saisai Shao (JIRA)
Saisai Shao created SPARK-7007:
--

 Summary: Add metrics source for ExecutorAllocationManager to 
expose internal status
 Key: SPARK-7007
 URL: https://issues.apache.org/jira/browse/SPARK-7007
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.3.0
Reporter: Saisai Shao
Priority: Minor


Add a metric source to expose the internal status of ExecutorAllocationManager 
to better monitoring the executor allocation when running on Yarn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7010) How can i custom the external initialize when start the spark cluster

2015-04-20 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7010.
--
Resolution: Invalid

(Ask questions at u...@spark.apache.org)

 How can i custom the external initialize when start the spark cluster
 -

 Key: SPARK-7010
 URL: https://issues.apache.org/jira/browse/SPARK-7010
 Project: Spark
  Issue Type: Question
  Components: SQL
Affects Versions: 1.3.0
Reporter: Jacky19820629

 How can i config the custom initialize when start the spark , like cache 
 table , crate temporary table etc .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7006) Inconsistent behavior for ctrl-c in Spark shells

2015-04-20 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14502612#comment-14502612
 ] 

Cheolsoo Park commented on SPARK-7006:
--

Thanks for asking about Ctrl-D. During a job is running, Ctrl-D doesn't seem to 
have any effect (i.e. no response), but after the job is finished, it 
terminates the shell.

Actually, FB Presto uses Ctrl-D to exit the shell and Ctrl-C to cancel the 
running job. A lot of users find this quite convenient.

 Inconsistent behavior for ctrl-c in Spark shells
 

 Key: SPARK-7006
 URL: https://issues.apache.org/jira/browse/SPARK-7006
 Project: Spark
  Issue Type: Wish
  Components: Spark Shell, YARN
Affects Versions: 1.3.1
 Environment: YARN
Reporter: Cheolsoo Park
Priority: Minor
  Labels: shell, yarn

 When ctrl-c is pressed in shell, behaviors are not consistent across 
 spark-sql, spark-shell, and pyspark resulting in confusion for users. Here is 
 the summary-
 ||shell||after ctrl-c|
 |spark-sql|cancels the running job|
 |spark-shell|exits the shell|
 |pyspark|throws error \[1\] and doesn't cancel the job|
 Particularly, pyspark is worst because it gives a wrong impression that the 
 job is cancelled although it is not.
 Ideally, every shell should act like {{spark-sql}} because it allows users to 
 cancel the running job while staying in shell. (Pressing ctrl-c twice exits 
 the shell.) 
 \[1\] pyspark error for ctrl-c
 {code}
 Traceback (most recent call last):
   File stdin, line 1, in module
   File 
 /home/cheolsoop/spark/jars/spark-1.3.1/python/pyspark/sql/dataframe.py, 
 line 284, in count
 return self._jdf.count()
   File 
 /home/cheolsoop/spark/jars/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
  line 536, in __call__
   File 
 /home/cheolsoop/spark/jars/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
  line 364, in send_command
   File 
 /home/cheolsoop/spark/jars/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
  line 473, in send_command
   File /usr/lib/python2.7/socket.py, line 430, in readline
 data = recv(1)
 KeyboardInterrupt
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7008) Implement of Factorization Machine (LibFM)

2015-04-20 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14502616#comment-14502616
 ] 

Guoqiang Li edited comment on SPARK-7008 at 4/20/15 10:34 AM:
--

Here's a graphx-based implementation(WIP): 
https://github.com/witgo/zen/tree/FactorizationMachine


was (Author: gq):
Here's a graphx-based implementation: 
https://github.com/witgo/zen/tree/FactorizationMachine

 Implement of Factorization Machine (LibFM)
 --

 Key: SPARK-7008
 URL: https://issues.apache.org/jira/browse/SPARK-7008
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0, 1.3.1, 1.3.2
Reporter: zhengruifeng
  Labels: features, patch

 An implementation of Factorization Machines based on Scala and Spark MLlib.
 Factorization Machine is a kind of machine learning algorithm for 
 multi-linear regression, and is widely used for recommendation.
 Factorization Machines works well in recent years' recommendation 
 competitions.
 Ref:
 http://libfm.org/
 http://doi.acm.org/10.1145/2168752.2168771
 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7011) Build fails with scala 2.11 option, because a protected[sql] type is accessed in ml package.

2015-04-20 Thread Prashant Sharma (JIRA)
Prashant Sharma created SPARK-7011:
--

 Summary: Build fails with scala 2.11 option, because a 
protected[sql] type is accessed in ml package.
 Key: SPARK-7011
 URL: https://issues.apache.org/jira/browse/SPARK-7011
 Project: Spark
  Issue Type: Bug
Reporter: Prashant Sharma


I am not sure why this does not fail while building with scala 2.10, looks like 
scala bug ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-7005) resetProb error in pagerank

2015-04-20 Thread lisendong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lisendong updated SPARK-7005:
-
Comment: was deleted

(was: oh...you are right...
I'm so sorry, the result is exactly being scaled by N...
)

 resetProb error in pagerank
 ---

 Key: SPARK-7005
 URL: https://issues.apache.org/jira/browse/SPARK-7005
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.0
Reporter: lisendong
  Labels: easyfix
   Original Estimate: 24h
  Remaining Estimate: 24h

 in the page rank code, the resetProb should be divided by #vertex according 
 to the wikipedia:
 http://en.wikipedia.org/wiki/PageRank
 that is: 
 PR[i] = alpha / N + (1 - alpha) * inNbrs[i].map(j = oldPR[j] / outDeg[j]).sum
 but the code is (org.apache.spark.graphx.lib.PageRank)
 PR[i] = alpha + (1 - alpha) * inNbrs[i].map(j = oldPR[j] / outDeg[j]).sum



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7007) Add metrics source for ExecutorAllocationManager to expose internal status

2015-04-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14502542#comment-14502542
 ] 

Apache Spark commented on SPARK-7007:
-

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/5589

 Add metrics source for ExecutorAllocationManager to expose internal status
 --

 Key: SPARK-7007
 URL: https://issues.apache.org/jira/browse/SPARK-7007
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.3.0
Reporter: Saisai Shao
Priority: Minor

 Add a metric source to expose the internal status of 
 ExecutorAllocationManager to better monitoring the executor allocation when 
 running on Yarn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7007) Add metrics source for ExecutorAllocationManager to expose internal status

2015-04-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7007:
---

Assignee: (was: Apache Spark)

 Add metrics source for ExecutorAllocationManager to expose internal status
 --

 Key: SPARK-7007
 URL: https://issues.apache.org/jira/browse/SPARK-7007
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.3.0
Reporter: Saisai Shao
Priority: Minor

 Add a metric source to expose the internal status of 
 ExecutorAllocationManager to better monitoring the executor allocation when 
 running on Yarn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1911) Warn users if their assembly jars are not built with Java 6

2015-04-20 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14502589#comment-14502589
 ] 

Steve Loughran commented on SPARK-1911:
---

This doesn't fix the problem, merely documents it.

It should be doable by using Ant's zip task, which doesn't use the JDK zip 
routines. The assembly would be unzipped first, then zipped with zip63 option 
set to never

see [https://ant.apache.org/manual/Tasks/zip.html]



 Warn users if their assembly jars are not built with Java 6
 ---

 Key: SPARK-1911
 URL: https://issues.apache.org/jira/browse/SPARK-1911
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Andrew Or
Assignee: Sean Owen
 Fix For: 1.2.2, 1.3.0


 The root cause of the problem is detailed in: 
 https://issues.apache.org/jira/browse/SPARK-1520.
 In short, an assembly jar built with Java 7+ is not always accessible by 
 Python or other versions of Java (especially Java 6). If the assembly jar is 
 not built on the cluster itself, this problem may manifest itself in strange 
 exceptions that are not trivial to debug. This is an issue especially for 
 PySpark on YARN, which relies on the python files included within the 
 assembly jar.
 Currently we warn users only in make-distribution.sh, but most users build 
 the jars directly. At the very least we need to emphasize this in the docs 
 (currently missing entirely). The next step is to add a warning prompt in the 
 mvn scripts whenever Java 7+ is detected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7008) An Implement of Factorization Machine (LibFM)

2015-04-20 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-7008:

Description: 
An implement of Factorization Machines based on Scala and Spark MLlib.
Factorization Machine is a kind of machine learning algorithm for multi-linear 
regression, and is widely used for recommendation.
Factorization Machines works well in recent years' recommendation competitions.

Ref:
http://libfm.org/
http://doi.acm.org/10.1145/2168752.2168771
http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf


  was:
An implementation of Factorization Machines based on Scala and Spark MLlib.
Factorization Machine is a kind of machine learning algorithm for multi-linear 
regression, and is widely used for recommendation.
Factorization Machines works well in recent years' recommendation competitions.

Ref:
http://libfm.org/
http://doi.acm.org/10.1145/2168752.2168771
http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf


Summary: An Implement of Factorization Machine (LibFM)  (was: Implement 
of Factorization Machine (LibFM))

 An Implement of Factorization Machine (LibFM)
 -

 Key: SPARK-7008
 URL: https://issues.apache.org/jira/browse/SPARK-7008
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0, 1.3.1, 1.3.2
Reporter: zhengruifeng
  Labels: features, patch

 An implement of Factorization Machines based on Scala and Spark MLlib.
 Factorization Machine is a kind of machine learning algorithm for 
 multi-linear regression, and is widely used for recommendation.
 Factorization Machines works well in recent years' recommendation 
 competitions.
 Ref:
 http://libfm.org/
 http://doi.acm.org/10.1145/2168752.2168771
 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7008) Implement of Factorization Machine (LibFM)

2015-04-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7008:
---

Assignee: Apache Spark

 Implement of Factorization Machine (LibFM)
 --

 Key: SPARK-7008
 URL: https://issues.apache.org/jira/browse/SPARK-7008
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0, 1.3.1, 1.3.2
Reporter: zhengruifeng
Assignee: Apache Spark
  Labels: features, patch

 An implementation of Factorization Machines based on Scala and Spark MLlib.
 Factorization Machine is a kind of machine learning algorithm for 
 multi-linear regression, and is widely used for recommendation.
 Factorization Machines works well in recent years' recommendation 
 competitions.
 Ref:
 http://libfm.org/
 http://doi.acm.org/10.1145/2168752.2168771
 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7008) Implement of Factorization Machine (LibFM)

2015-04-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14502644#comment-14502644
 ] 

Apache Spark commented on SPARK-7008:
-

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/5591

 Implement of Factorization Machine (LibFM)
 --

 Key: SPARK-7008
 URL: https://issues.apache.org/jira/browse/SPARK-7008
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0, 1.3.1, 1.3.2
Reporter: zhengruifeng
  Labels: features, patch

 An implementation of Factorization Machines based on Scala and Spark MLlib.
 Factorization Machine is a kind of machine learning algorithm for 
 multi-linear regression, and is widely used for recommendation.
 Factorization Machines works well in recent years' recommendation 
 competitions.
 Ref:
 http://libfm.org/
 http://doi.acm.org/10.1145/2168752.2168771
 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >