[jira] [Commented] (SPARK-6844) Memory leak occurs when register temp table with cache table on

2015-04-16 Thread Jack Hu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497627#comment-14497627
 ] 

Jack Hu commented on SPARK-6844:


Hi, [~marmbrus]

Do we have a plan to port this to 1.3.X branch? 


 Memory leak occurs when register temp table with cache table on
 ---

 Key: SPARK-6844
 URL: https://issues.apache.org/jira/browse/SPARK-6844
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Jack Hu
  Labels: Memory, SQL
 Fix For: 1.4.0


 There is a memory leak in register temp table with cache on
 This is the simple code to reproduce this issue:
 {code}
 val sparkConf = new SparkConf().setAppName(LeakTest)
 val sparkContext = new SparkContext(sparkConf)
 val sqlContext = new SQLContext(sparkContext)
 val tableName = tmp
 val jsonrdd = sparkContext.textFile(sample.json)
 var loopCount = 1L
 while(true) {
   sqlContext.jsonRDD(jsonrdd).registerTempTable(tableName)
   sqlContext.cacheTable(tableName)
   println(L:  +loopCount +  R: + sqlContext.sql(select count(*) 
 from tmp).count())
   sqlContext.uncacheTable(tableName)
   loopCount += 1
 }
 {code}
 The cause is that the {{InMemoryRelation}}. {{InMemoryColumnarTableScan}} 
 uses the accumulator 
 ({{InMemoryRelation.batchStats}},{{InMemoryColumnarTableScan.readPartitions}},
  {{InMemoryColumnarTableScan.readBatches}} ) to get some information from 
 partitions or for test. These accumulators will register itself into a static 
 map in {{Accumulators.originals}} and never get cleaned up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-6933) Thrift Server couldn't strip .inprogress suffix after being stopped

2015-04-16 Thread Tao Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Wang closed SPARK-6933.
---
Resolution: Duplicate

 Thrift Server couldn't strip .inprogress suffix after being stopped
 ---

 Key: SPARK-6933
 URL: https://issues.apache.org/jira/browse/SPARK-6933
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Tao Wang

 When I stop the thrift server using stop-thriftserver.sh, there comes the 
 exception:
 15/04/15 21:48:53 INFO Utils: path = 
 /tmp/spark-f05dd451-46a8-47d0-836b-a25004f87ed9/blockmgr-971f5b1c-33ed-4be6-ac63-2fbb739bc649,
  already present as root for deletion.
 15/04/15 21:48:53 ERROR LiveListenerBus: Listener EventLoggingListener threw 
 an exception
 java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:601)
 at 
 org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144)
 at 
 org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144)
 at scala.Option.foreach(Option.scala:236)
 at 
 org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:144)
 at 
 org.apache.spark.scheduler.EventLoggingListener.onApplicationEnd(EventLoggingListener.scala:188)
 at 
 org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:54)
 at 
 org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
 at 
 org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
 at 
 org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:53)
 at 
 org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
 at 
 org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:79)
 at 
 org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1197)
 at 
 org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
 Caused by: java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:795)
 at 
 org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:1985)
 at 
 org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1946)
 at 
 org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130)
 ... 17 more
 15/04/15 21:48:53 INFO ThriftCLIService: Thrift server has stopped
 15/04/15 21:48:53 INFO AbstractService: Service:ThriftBinaryCLIService is 
 stopped.
 15/04/15 21:48:53 INFO HiveMetaStore: 1: Shutting down the object store...
 15/04/15 21:48:53 INFO audit: ugi=root  ip=unknown-ip-addr  cmd=Shutting 
 down the object store...
 15/04/15 21:48:53 INFO HiveMetaStore: 1: Metastore shutdown complete.
 15/04/15 21:48:53 INFO audit: ugi=root  ip=unknown-ip-addr  cmd=Metastore 
 shutdown complete.
 15/04/15 21:48:53 INFO ContextHandler: stopped 
 o.s.j.s.ServletContextHandler{/metrics/json,null}
 15/04/15 21:48:53 INFO ContextHandler: stopped 
 o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
 15/04/15 21:48:53 INFO ContextHandler: stopped 
 o.s.j.s.ServletContextHandler{/,null}
 15/04/15 21:48:53 INFO ContextHandler: stopped 
 o.s.j.s.ServletContextHandler{/static,null}
 15/04/15 21:48:53 INFO ContextHandler: stopped 
 o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
 15/04/15 21:48:53 INFO ContextHandler: stopped 
 o.s.j.s.ServletContextHandler{/executors/threadDump,null}
 15/04/15 21:48:53 INFO ContextHandler: stopped 
 o.s.j.s.ServletContextHandler{/executors/json,null}
 15/04/15 21:48:53 INFO ContextHandler: stopped 
 o.s.j.s.ServletContextHandler{/executors,null}
 15/04/15 21:48:53 INFO ContextHandler: stopped 
 o.s.j.s.ServletContextHandler{/environment/json,null}
 15/04/15 21:48:53 INFO ContextHandler: stopped 
 o.s.j.s.ServletContextHandler{/environment,null}
 15/04/15 21:48:53 INFO ContextHandler: stopped 
 o.s.j.s.ServletContextHandler{/storage/rdd/json,null}
 15/04/15 21:48:53 INFO ContextHandler: stopped 
 o.s.j.s.ServletContextHandler{/storage/rdd,null}
 15/04/15 21:48:53 INFO ContextHandler: stopped 
 o.s.j.s.ServletContextHandler{/storage/json,null}
 15/04/15 21:48:53 INFO ContextHandler: stopped 
 o.s.j.s.ServletContextHandler{/storage,null}
 15/04/15 21:48:53 INFO ContextHandler: 

[jira] [Created] (SPARK-6959) Support for datetime comparisions in filter for dataframes in pyspark

2015-04-16 Thread cynepia (JIRA)
cynepia created SPARK-6959:
--

 Summary: Support for datetime comparisions in filter for 
dataframes in pyspark
 Key: SPARK-6959
 URL: https://issues.apache.org/jira/browse/SPARK-6959
 Project: Spark
  Issue Type: Improvement
Reporter: cynepia


Currently, in filter strings can be passed for comparision with a date type 
column. But this does not address the case where dates may be passed in 
different formats. We should have support for datetime.date or some standard 
format for comparision with date type columns.

Currently,
df.filter(df.Datecol  datetime.date(2015,1,1)).show()
is not supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6893) Better handling of pipeline parameters in PySpark

2015-04-16 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-6893.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5534
[https://github.com/apache/spark/pull/5534]

 Better handling of pipeline parameters in PySpark
 -

 Key: SPARK-6893
 URL: https://issues.apache.org/jira/browse/SPARK-6893
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.4.0


 This is SPARK-5957 for Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6955) Do not let Yarn Shuffle Server retry its server port.

2015-04-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6955:
-
 Priority: Minor  (was: Major)
Affects Version/s: (was: 1.4.0)

 Do not let Yarn Shuffle Server retry its server port.
 -

 Key: SPARK-6955
 URL: https://issues.apache.org/jira/browse/SPARK-6955
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, YARN
Reporter: SaintBacchus
Priority: Minor

  It's better to let the NodeManager get down rather than take a port retry 
 when `spark.shuffle.service.port` has been conflicted during starting the 
 Spark Yarn Shuffle Server, because the retry mechanism will make the 
 inconsistency of shuffle port and also make client fail to find the port.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6869) Add pyspark archives path to PYTHONPATH

2015-04-16 Thread Weizhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weizhong updated SPARK-6869:

Description: 
From SPARK-1920 and SPARK-1520 we know PySpark on Yarn can not work when the 
assembly jar are package by JDK 1.7+, so ship pyspark archives to executors by 
Yarn with --py-files. The pyspark archives name must contains spark-pyspark.

1st: zip pyspark to spark-pyspark_2.10.zip
2nd:./bin/spark-submit --master yarn-client/yarn-cluster --py-files 
spark-pyspark_2.10.zip app.py args

  was:From SPARK-1920 and SPARK-1520 we know PySpark on Yarn can not work when 
the assembly jar are package by JDK 1.7+, so pass the PYTHONPATH(set in 
spark-env.sh) to executor so that executor python process can read pyspark file 
from local file system rather than from assembly jar.

Summary: Add pyspark archives path to PYTHONPATH  (was: Pass PYTHONPATH 
to executor, so that executor can read pyspark file from local file system on 
executor node)

 Add pyspark archives path to PYTHONPATH
 ---

 Key: SPARK-6869
 URL: https://issues.apache.org/jira/browse/SPARK-6869
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.0.0
Reporter: Weizhong
Priority: Minor

 From SPARK-1920 and SPARK-1520 we know PySpark on Yarn can not work when the 
 assembly jar are package by JDK 1.7+, so ship pyspark archives to executors 
 by Yarn with --py-files. The pyspark archives name must contains 
 spark-pyspark.
 1st: zip pyspark to spark-pyspark_2.10.zip
 2nd:./bin/spark-submit --master yarn-client/yarn-cluster --py-files 
 spark-pyspark_2.10.zip app.py args



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4783) System.exit() calls in SparkContext disrupt applications embedding Spark

2015-04-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4783:
-
Priority: Minor  (was: Major)

 System.exit() calls in SparkContext disrupt applications embedding Spark
 

 Key: SPARK-4783
 URL: https://issues.apache.org/jira/browse/SPARK-4783
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: David Semeria
Assignee: Sean Owen
Priority: Minor
 Fix For: 1.4.0


 A common architectural choice for integrating Spark within a larger 
 application is to employ a gateway to handle Spark jobs. The gateway is a 
 server which contains one or more long-running sparkcontexts.
 A typical server is created with the following pseudo code:
 var continue = true
 while (continue){
  try {
 server.run() 
   } catch (e) {
   continue = log_and_examine_error(e)
 }
 The problem is that sparkcontext frequently calls System.exit when it 
 encounters a problem which means the server can only be re-spawned at the 
 process level, which is much more messy than the simple code above.
 Therefore, I believe it makes sense to replace all System.exit calls in 
 sparkcontext with the throwing of a fatal error. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4783) System.exit() calls in SparkContext disrupt applications embedding Spark

2015-04-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4783.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5492
[https://github.com/apache/spark/pull/5492]

 System.exit() calls in SparkContext disrupt applications embedding Spark
 

 Key: SPARK-4783
 URL: https://issues.apache.org/jira/browse/SPARK-4783
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: David Semeria
 Fix For: 1.4.0


 A common architectural choice for integrating Spark within a larger 
 application is to employ a gateway to handle Spark jobs. The gateway is a 
 server which contains one or more long-running sparkcontexts.
 A typical server is created with the following pseudo code:
 var continue = true
 while (continue){
  try {
 server.run() 
   } catch (e) {
   continue = log_and_examine_error(e)
 }
 The problem is that sparkcontext frequently calls System.exit when it 
 encounters a problem which means the server can only be re-spawned at the 
 process level, which is much more messy than the simple code above.
 Therefore, I believe it makes sense to replace all System.exit calls in 
 sparkcontext with the throwing of a fatal error. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4783) System.exit() calls in SparkContext disrupt applications embedding Spark

2015-04-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-4783:


Assignee: Sean Owen

 System.exit() calls in SparkContext disrupt applications embedding Spark
 

 Key: SPARK-4783
 URL: https://issues.apache.org/jira/browse/SPARK-4783
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: David Semeria
Assignee: Sean Owen
 Fix For: 1.4.0


 A common architectural choice for integrating Spark within a larger 
 application is to employ a gateway to handle Spark jobs. The gateway is a 
 server which contains one or more long-running sparkcontexts.
 A typical server is created with the following pseudo code:
 var continue = true
 while (continue){
  try {
 server.run() 
   } catch (e) {
   continue = log_and_examine_error(e)
 }
 The problem is that sparkcontext frequently calls System.exit when it 
 encounters a problem which means the server can only be re-spawned at the 
 process level, which is much more messy than the simple code above.
 Therefore, I believe it makes sense to replace all System.exit calls in 
 sparkcontext with the throwing of a fatal error. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6528) IDF transformer

2015-04-16 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6528:
-
Assignee: Xusen Yin

 IDF transformer
 ---

 Key: SPARK-6528
 URL: https://issues.apache.org/jira/browse/SPARK-6528
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xusen Yin
Assignee: Xusen Yin





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4194) Exceptions thrown during SparkContext or SparkEnv construction might lead to resource leaks or corrupted global state

2015-04-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4194:
-
 Target Version/s:   (was: 1.2.0)
Affects Version/s: 1.2.0
   1.3.0
 Assignee: Marcelo Vanzin

 Exceptions thrown during SparkContext or SparkEnv construction might lead to 
 resource leaks or corrupted global state
 -

 Key: SPARK-4194
 URL: https://issues.apache.org/jira/browse/SPARK-4194
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0, 1.3.0
Reporter: Josh Rosen
Assignee: Marcelo Vanzin
Priority: Critical
 Fix For: 1.4.0


 The SparkContext and SparkEnv constructors instantiate a bunch of objects 
 that may need to be cleaned up after they're no longer needed.  If an 
 exception is thrown during SparkContext or SparkEnv construction (e.g. due to 
 a bad configuration setting), then objects created earlier in the constructor 
 may not be properly cleaned up.
 This is unlikely to cause problems for batch jobs submitted through 
 {{spark-submit}}, since failure to construct SparkContext will probably cause 
 the JVM to exit, but it is a potentially serious issue in interactive 
 environments where a user might attempt to create SparkContext with some 
 configuration, fail due to an error, and re-attempt the creation with new 
 settings.  In this case, resources from the previous creation attempt might 
 not have been cleaned up and could lead to confusing errors (especially if 
 the old, leaked resources share global state with the new SparkContext).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4194) Exceptions thrown during SparkContext or SparkEnv construction might lead to resource leaks or corrupted global state

2015-04-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4194.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5335
[https://github.com/apache/spark/pull/5335]

 Exceptions thrown during SparkContext or SparkEnv construction might lead to 
 resource leaks or corrupted global state
 -

 Key: SPARK-4194
 URL: https://issues.apache.org/jira/browse/SPARK-4194
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Josh Rosen
Priority: Critical
 Fix For: 1.4.0


 The SparkContext and SparkEnv constructors instantiate a bunch of objects 
 that may need to be cleaned up after they're no longer needed.  If an 
 exception is thrown during SparkContext or SparkEnv construction (e.g. due to 
 a bad configuration setting), then objects created earlier in the constructor 
 may not be properly cleaned up.
 This is unlikely to cause problems for batch jobs submitted through 
 {{spark-submit}}, since failure to construct SparkContext will probably cause 
 the JVM to exit, but it is a potentially serious issue in interactive 
 environments where a user might attempt to create SparkContext with some 
 configuration, fail due to an error, and re-attempt the creation with new 
 settings.  In this case, resources from the previous creation attempt might 
 not have been cleaned up and could lead to confusing errors (especially if 
 the old, leaked resources share global state with the new SparkContext).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6923) Get invalid hive table columns after save DataFrame to hive table

2015-04-16 Thread pin_zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497701#comment-14497701
 ] 

pin_zhang commented on SPARK-6923:
--

In spark1.1.0 client with the jdbc api to get the table schema
age(bigint), id(string)
while in spark1.3.0 {name=col, type=arraystring}
That's not expected.

ArrayListMap results = new ArrayList();
DatabaseMetaData meta = cnn.getMetaData();   
rsColumns = meta.getColumns(database, null, table, null);   
while (rsColumns.next()) {
Map col = new HashMap();
col.put(name, rsColumns.getString(COLUMN_NAME));
String typeName = rsColumns.getString(TYPE_NAME);
col.put(type, typeName);
results.add(col);
}
rsColumns.close();


 Get invalid hive table columns after save DataFrame to hive table
 -

 Key: SPARK-6923
 URL: https://issues.apache.org/jira/browse/SPARK-6923
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: pin_zhang

 HiveContext hctx = new HiveContext(sc);
 ListString sample = new ArrayListString();
 sample.add( {\id\: \id_1\, \age\:1} );
 RDDString sampleRDD = new JavaSparkContext(sc).parallelize(sample).rdd();   
 DataFrame df = hctx.jsonRDD(sampleRDD);
 String table=test;
 df.saveAsTable(table, json,SaveMode.Overwrite);
 Table t = hctx.catalog().client().getTable(table);
 System.out.println( t.getCols());
 --
 With the code above to save DataFrame to hive table,
 Get table cols returns one column named 'col'
 [FieldSchema(name:col, type:arraystring, comment:from deserializer)]
 Expected return fields schema id, age.
 This results in the jdbc API cannot retrieves the table columns via ResultSet 
 DatabaseMetaData.getColumns(String catalog, String schemaPattern,String 
 tableNamePattern, String columnNamePattern)
 But resultset metadata for query  select * from test   contains fields id, 
 age.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6935) spark/spark-ec2.py add parameters to give different instance types for master and slaves

2015-04-16 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497876#comment-14497876
 ] 

Yu Ishikawa commented on SPARK-6935:


[~oleksii.mdr] Thank you for replying. I guess this issue hasn't been resolved 
yet, although I read it at the master branch.
I understand you are using 1.2 release. I mean, I would like to confirm whether 
the option names which you suggested are wrong or not. The script at 1.2.0 has 
`\-\-master-instance-type` too.

https://github.com/apache/spark/blob/v1.2.1/ec2/spark_ec2.py#L80

Alright. This is a small problem. I understand what you meant. Thank you for 
your suggestion.

 spark/spark-ec2.py add parameters to give different instance types for master 
 and slaves
 

 Key: SPARK-6935
 URL: https://issues.apache.org/jira/browse/SPARK-6935
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 1.3.0
Reporter: Oleksii Mandrychenko
Priority: Minor
  Labels: easyfix
   Original Estimate: 24h
  Remaining Estimate: 24h

 I want to start a cluster where I give beefy AWS instances to slaves, such as 
 memory-optimised R3, but master is not really performing much number 
 crunching work. So it is a waste to allocate a powerful instance for master, 
 where a regular one would suffice.
 Suggested syntax:
 {code}
 sh spark-ec2 --instance-type-slave=instance_type # applies to slaves 
 only 
  --instance-type-master=instance_type# applies to master 
 only
  --instance-type=instance_type   # default, applies to 
 both
 # in real world
 sh spark-ec2 --instance-type-slave=r3.2xlarge --instance-type-master=c3.xlarge
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6935) spark/spark-ec2.py add parameters to give different instance types for master and slaves

2015-04-16 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497852#comment-14497852
 ] 

Yu Ishikawa commented on SPARK-6935:


That's reasonable. Spark v1.3.0 has the `\-\-master-instance-type` option, not 
`\-\-instance-type-master`. Do you mean that we should add other new options, 
deprecating the current option?

And I feel like refactoring the script because it has many functions with long 
lines. So it is a little hard to maintain the source code. Just a comment.

 spark/spark-ec2.py add parameters to give different instance types for master 
 and slaves
 

 Key: SPARK-6935
 URL: https://issues.apache.org/jira/browse/SPARK-6935
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 1.3.0
Reporter: Oleksii Mandrychenko
Priority: Minor
  Labels: easyfix
   Original Estimate: 24h
  Remaining Estimate: 24h

 I want to start a cluster where I give beefy AWS instances to slaves, such as 
 memory-optimised R3, but master is not really performing much number 
 crunching work. So it is a waste to allocate a powerful instance for master, 
 where a regular one would suffice.
 Suggested syntax:
 {code}
 sh spark-ec2 --instance-type-slave=instance_type # applies to slaves 
 only 
  --instance-type-master=instance_type# applies to master 
 only
  --instance-type=instance_type   # default, applies to 
 both
 # in real world
 sh spark-ec2 --instance-type-slave=r3.2xlarge --instance-type-master=c3.xlarge
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5886) Add StringIndexer

2015-04-16 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5886:
-
Description: 
`StringIndexer` takes a column of string labels (raw categories) and outputs an 
integer column with labels indexed by their frequency.

{code}
va li = new StringIndexer()
  .setInputCol(country)
  .setOutputCol(countryIndex)
{code}

In the output column, we should store the label to index map as an ML 
attribute. The index should be ordered by frequency, where the most frequent 
label gets index 0, to enhance sparsity.

We can discuss whether this should index multiple columns at the same time.

  was:
`LabelIndexer` takes a column of labels (raw categories) and outputs an integer 
column with labels indexed by their frequency.

{code}
va li = new LabelIndexer()
  .setInputCol(country)
  .setOutputCol(countryIndex)
{code}

In the output column, we should store the label to index map as an ML 
attribute. The index should be ordered by frequency, where the most frequent 
label gets index 0, to enhance sparsity.

We can discuss whether this should index multiple columns at the same time.


 Add StringIndexer
 -

 Key: SPARK-5886
 URL: https://issues.apache.org/jira/browse/SPARK-5886
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.4.0


 `StringIndexer` takes a column of string labels (raw categories) and outputs 
 an integer column with labels indexed by their frequency.
 {code}
 va li = new StringIndexer()
   .setInputCol(country)
   .setOutputCol(countryIndex)
 {code}
 In the output column, we should store the label to index map as an ML 
 attribute. The index should be ordered by frequency, where the most frequent 
 label gets index 0, to enhance sparsity.
 We can discuss whether this should index multiple columns at the same time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5886) Add StringIndexer

2015-04-16 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5886:
-
Summary: Add StringIndexer  (was: Add LabelIndexer)

 Add StringIndexer
 -

 Key: SPARK-5886
 URL: https://issues.apache.org/jira/browse/SPARK-5886
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.4.0


 `LabelIndexer` takes a column of labels (raw categories) and outputs an 
 integer column with labels indexed by their frequency.
 {code}
 va li = new LabelIndexer()
   .setInputCol(country)
   .setOutputCol(countryIndex)
 {code}
 In the output column, we should store the label to index map as an ML 
 attribute. The index should be ordered by frequency, where the most frequent 
 label gets index 0, to enhance sparsity.
 We can discuss whether this should index multiple columns at the same time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6932) A Prototype of Parameter Server

2015-04-16 Thread uncleGen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497840#comment-14497840
 ] 

uncleGen edited comment on SPARK-6932 at 4/16/15 9:38 AM:
--

There are two ways to use Parameter Server, one is an independent and 
standalone Parameter Server Service, and the another one is to integrate it 
into Spark. If we adopt the first one, we may need to maintain a long-running 
Parameter Server cluster, and Spark can access the Parameter Server 
seamlessly through some configuration without any code-broken in Spark core. 
IMHO, it is not an efficient way. For us, big model training is just one kind 
of job. We want to run a Spark PS job just like other spark jobs, and launch  
a Parameter Server Cluster in job dynamically. So, if we want to use it 
dynamically,  we may modify current spark core more or less. Well, just as 
[~chouqin], it is just a prototype, much more problems need to be resolved. 


was (Author: unclegen):
There are two ways to use Parameter Server, one is an independent and 
standalone Parameter Server Service, and the another one is to be integrated 
into Spark. If we adopt the first one, we may need to maintain a long-running 
Parameter Server cluster, and Spark can access the Parameter Server 
seamlessly through some configuration without any code-broken in Spark core. 
IMHO, it is not an efficient way. For us, big model training is just one kind 
of job. We want to run a Spark PS job just like other spark jobs, and launch  
a Parameter Server Cluster in job dynamically. So, if we want to use it 
dynamically,  we may modify current spark core more or less. Well, just as 
[~chouqin], it is just a prototype, much more problems need to be resolved. 

 A Prototype of Parameter Server
 ---

 Key: SPARK-6932
 URL: https://issues.apache.org/jira/browse/SPARK-6932
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Qiping Li

 h2. Introduction
 As specified in 
 [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590],it would be 
 very helpful to integrate parameter server into Spark for machine learning 
 algorithms, especially for those with ultra high dimensions features. 
 After carefully studying the design doc of [Parameter 
 Servers|https://docs.google.com/document/d/1SX3nkmF41wFXAAIr9BgqvrHSS5mW362fJ7roBXJm06o/edit?usp=sharing],and
  the paper of [Factorbird|http://stanford.edu/~rezab/papers/factorbird.pdf], 
 we proposed a prototype of Parameter Server on Spark(Ps-on-Spark), with 
 several key design concerns:
 * *User friendly interface*
   Careful investigation is done to most existing Parameter Server 
 systems(including:  [petuum|http://petuum.github.io], [parameter 
 server|http://parameterserver.org], 
 [paracel|https://github.com/douban/paracel]) and a user friendly interface is 
 design by absorbing essence from all these system. 
 * *Prototype of distributed array*
 IndexRDD (see 
 [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590]) doesn't seem 
 to be a good option for distributed array, because in most case, the #key 
 updates/second is not be very high. 
 So we implement a distributed HashMap to store the parameters, which can 
 be easily extended to get better performance.
 
 * *Minimal code change*
   Quite a lot of effort in done to avoid code change of Spark core. Tasks 
 which need parameter server are still created and scheduled by Spark's 
 scheduler. Tasks communicate with parameter server with a client object, 
 through *akka* or *netty*.
 With all these concerns we propose the following architecture:
 h2. Architecture
 !https://cloud.githubusercontent.com/assets/1285855/7158179/f2d25cc4-e3a9-11e4-835e-89681596c478.jpg!
 Data is stored in RDD and is partitioned across workers. During each 
 iteration, each worker gets parameters from parameter server then computes 
 new parameters based on old parameters and data in the partition. Finally 
 each worker updates parameters to parameter server.Worker communicates with 
 parameter server through a parameter server client,which is initialized in 
 `TaskContext` of this worker.
 The current implementation is based on YARN cluster mode, 
 but it should not be a problem to transplanted it to other modes. 
 h3. Interface
 We refer to existing parameter server systems(petuum, parameter server, 
 paracel) when design the interface of parameter server. 
 *`PSClient` provides the following interface for workers to use:*
 {code}
 //  get parameter indexed by key from parameter server
 def get[T](key: String): T
 // get multiple parameters from parameter server
 def multiGet[T](keys: Array[String]): Array[T]
 // add parameter indexed by `key` by `delta`, 
 // if multiple `delta` to update on the same parameter,
 // use `reduceFunc` 

[jira] [Commented] (SPARK-6932) A Prototype of Parameter Server

2015-04-16 Thread uncleGen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497840#comment-14497840
 ] 

uncleGen commented on SPARK-6932:
-

There are two ways to use Parameter Server, one is an independent and 
standalone Parameter Server Service, and the another one is to be integrated 
into Spark. If we adopt the first one, we may need to maintain a long-running 
Parameter Server cluster, and Spark can access the Parameter Server 
seamlessly through some configuration without any code-broken in Spark core. 
IMHO, it is not an efficient way. For us, big model training is just one kind 
of job. We want to run a Spark PS job just like other spark jobs, and launch  
a Parameter Server Cluster in job dynamically. So, if we want to use it 
dynamically,  we may modify current spark core more or less. Well, just as 
[~chouqin], it is just a prototype, much more problems need to be resolved. 

 A Prototype of Parameter Server
 ---

 Key: SPARK-6932
 URL: https://issues.apache.org/jira/browse/SPARK-6932
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Qiping Li

 h2. Introduction
 As specified in 
 [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590],it would be 
 very helpful to integrate parameter server into Spark for machine learning 
 algorithms, especially for those with ultra high dimensions features. 
 After carefully studying the design doc of [Parameter 
 Servers|https://docs.google.com/document/d/1SX3nkmF41wFXAAIr9BgqvrHSS5mW362fJ7roBXJm06o/edit?usp=sharing],and
  the paper of [Factorbird|http://stanford.edu/~rezab/papers/factorbird.pdf], 
 we proposed a prototype of Parameter Server on Spark(Ps-on-Spark), with 
 several key design concerns:
 * *User friendly interface*
   Careful investigation is done to most existing Parameter Server 
 systems(including:  [petuum|http://petuum.github.io], [parameter 
 server|http://parameterserver.org], 
 [paracel|https://github.com/douban/paracel]) and a user friendly interface is 
 design by absorbing essence from all these system. 
 * *Prototype of distributed array*
 IndexRDD (see 
 [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590]) doesn't seem 
 to be a good option for distributed array, because in most case, the #key 
 updates/second is not be very high. 
 So we implement a distributed HashMap to store the parameters, which can 
 be easily extended to get better performance.
 
 * *Minimal code change*
   Quite a lot of effort in done to avoid code change of Spark core. Tasks 
 which need parameter server are still created and scheduled by Spark's 
 scheduler. Tasks communicate with parameter server with a client object, 
 through *akka* or *netty*.
 With all these concerns we propose the following architecture:
 h2. Architecture
 !https://cloud.githubusercontent.com/assets/1285855/7158179/f2d25cc4-e3a9-11e4-835e-89681596c478.jpg!
 Data is stored in RDD and is partitioned across workers. During each 
 iteration, each worker gets parameters from parameter server then computes 
 new parameters based on old parameters and data in the partition. Finally 
 each worker updates parameters to parameter server.Worker communicates with 
 parameter server through a parameter server client,which is initialized in 
 `TaskContext` of this worker.
 The current implementation is based on YARN cluster mode, 
 but it should not be a problem to transplanted it to other modes. 
 h3. Interface
 We refer to existing parameter server systems(petuum, parameter server, 
 paracel) when design the interface of parameter server. 
 *`PSClient` provides the following interface for workers to use:*
 {code}
 //  get parameter indexed by key from parameter server
 def get[T](key: String): T
 // get multiple parameters from parameter server
 def multiGet[T](keys: Array[String]): Array[T]
 // add parameter indexed by `key` by `delta`, 
 // if multiple `delta` to update on the same parameter,
 // use `reduceFunc` to reduce these `delta`s frist.
 def update[T](key: String, delta: T, reduceFunc: (T, T) = T): Unit
 // update multiple parameters at the same time, use the same `reduceFunc`.
 def multiUpdate(keys: Array[String], delta: Array[T], reduceFunc: (T, T) = 
 T: Unit
 
 // advance clock to indicate that current iteration is finished.
 def clock(): Unit
  
 // block until all workers have reached this line of code.
 def sync(): Unit
 {code}
 *`PSContext` provides following functions to use on driver:*
 {code}
 // load parameters from existing rdd.
 def loadPSModel[T](model: RDD[String, T]) 
 // fetch parameters from parameter server to construct model.
 def fetchPSModel[T](keys: Array[String]): Array[T]
 {code} 
 
 *A new function has been add to `RDD` to run parameter server tasks:*
 {code}
 // run the provided 

[jira] [Comment Edited] (SPARK-6932) A Prototype of Parameter Server

2015-04-16 Thread uncleGen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497840#comment-14497840
 ] 

uncleGen edited comment on SPARK-6932 at 4/16/15 9:39 AM:
--

There are two ways to use Parameter Server, one is an independent and 
standalone Parameter Server Service, and the another one is to integrate it 
into Spark. If we adopt the first one, we may need to maintain a long-running 
Parameter Server cluster, and Spark can access the Parameter Server 
seamlessly through some configurations without any code-broken in Spark core. 
IMHO, it is not an efficient way. For us, big model training is just one kind 
of job. We want to run a Spark PS job just like other spark jobs, and launch  
a Parameter Server Cluster in job dynamically. So, if we want to use it 
dynamically,  we may modify current spark core more or less. Well, just as 
[~chouqin], it is just a prototype, much more problems need to be resolved. 


was (Author: unclegen):
There are two ways to use Parameter Server, one is an independent and 
standalone Parameter Server Service, and the another one is to integrate it 
into Spark. If we adopt the first one, we may need to maintain a long-running 
Parameter Server cluster, and Spark can access the Parameter Server 
seamlessly through some configuration without any code-broken in Spark core. 
IMHO, it is not an efficient way. For us, big model training is just one kind 
of job. We want to run a Spark PS job just like other spark jobs, and launch  
a Parameter Server Cluster in job dynamically. So, if we want to use it 
dynamically,  we may modify current spark core more or less. Well, just as 
[~chouqin], it is just a prototype, much more problems need to be resolved. 

 A Prototype of Parameter Server
 ---

 Key: SPARK-6932
 URL: https://issues.apache.org/jira/browse/SPARK-6932
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Qiping Li

 h2. Introduction
 As specified in 
 [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590],it would be 
 very helpful to integrate parameter server into Spark for machine learning 
 algorithms, especially for those with ultra high dimensions features. 
 After carefully studying the design doc of [Parameter 
 Servers|https://docs.google.com/document/d/1SX3nkmF41wFXAAIr9BgqvrHSS5mW362fJ7roBXJm06o/edit?usp=sharing],and
  the paper of [Factorbird|http://stanford.edu/~rezab/papers/factorbird.pdf], 
 we proposed a prototype of Parameter Server on Spark(Ps-on-Spark), with 
 several key design concerns:
 * *User friendly interface*
   Careful investigation is done to most existing Parameter Server 
 systems(including:  [petuum|http://petuum.github.io], [parameter 
 server|http://parameterserver.org], 
 [paracel|https://github.com/douban/paracel]) and a user friendly interface is 
 design by absorbing essence from all these system. 
 * *Prototype of distributed array*
 IndexRDD (see 
 [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590]) doesn't seem 
 to be a good option for distributed array, because in most case, the #key 
 updates/second is not be very high. 
 So we implement a distributed HashMap to store the parameters, which can 
 be easily extended to get better performance.
 
 * *Minimal code change*
   Quite a lot of effort in done to avoid code change of Spark core. Tasks 
 which need parameter server are still created and scheduled by Spark's 
 scheduler. Tasks communicate with parameter server with a client object, 
 through *akka* or *netty*.
 With all these concerns we propose the following architecture:
 h2. Architecture
 !https://cloud.githubusercontent.com/assets/1285855/7158179/f2d25cc4-e3a9-11e4-835e-89681596c478.jpg!
 Data is stored in RDD and is partitioned across workers. During each 
 iteration, each worker gets parameters from parameter server then computes 
 new parameters based on old parameters and data in the partition. Finally 
 each worker updates parameters to parameter server.Worker communicates with 
 parameter server through a parameter server client,which is initialized in 
 `TaskContext` of this worker.
 The current implementation is based on YARN cluster mode, 
 but it should not be a problem to transplanted it to other modes. 
 h3. Interface
 We refer to existing parameter server systems(petuum, parameter server, 
 paracel) when design the interface of parameter server. 
 *`PSClient` provides the following interface for workers to use:*
 {code}
 //  get parameter indexed by key from parameter server
 def get[T](key: String): T
 // get multiple parameters from parameter server
 def multiGet[T](keys: Array[String]): Array[T]
 // add parameter indexed by `key` by `delta`, 
 // if multiple `delta` to update on the same parameter,
 // use `reduceFunc` 

[jira] [Commented] (SPARK-6935) spark/spark-ec2.py add parameters to give different instance types for master and slaves

2015-04-16 Thread Oleksii Mandrychenko (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497861#comment-14497861
 ] 

Oleksii Mandrychenko commented on SPARK-6935:
-

I didn't know there was a `--master-instance-type` option in the 1.3 release. I 
am still using 1.2 release, which doesn't have it.

I guess if it is present then this ticket can be closed as fixed in 1.3.

Refactoring should probably be another ticket IMO

 spark/spark-ec2.py add parameters to give different instance types for master 
 and slaves
 

 Key: SPARK-6935
 URL: https://issues.apache.org/jira/browse/SPARK-6935
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 1.3.0
Reporter: Oleksii Mandrychenko
Priority: Minor
  Labels: easyfix
   Original Estimate: 24h
  Remaining Estimate: 24h

 I want to start a cluster where I give beefy AWS instances to slaves, such as 
 memory-optimised R3, but master is not really performing much number 
 crunching work. So it is a waste to allocate a powerful instance for master, 
 where a regular one would suffice.
 Suggested syntax:
 {code}
 sh spark-ec2 --instance-type-slave=instance_type # applies to slaves 
 only 
  --instance-type-master=instance_type# applies to master 
 only
  --instance-type=instance_type   # default, applies to 
 both
 # in real world
 sh spark-ec2 --instance-type-slave=r3.2xlarge --instance-type-master=c3.xlarge
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6935) spark/spark-ec2.py add parameters to give different instance types for master and slaves

2015-04-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6935.
--
Resolution: Not A Problem

Ah right, it is already there as --master-instance-type

 spark/spark-ec2.py add parameters to give different instance types for master 
 and slaves
 

 Key: SPARK-6935
 URL: https://issues.apache.org/jira/browse/SPARK-6935
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 1.3.0
Reporter: Oleksii Mandrychenko
Priority: Minor
  Labels: easyfix
   Original Estimate: 24h
  Remaining Estimate: 24h

 I want to start a cluster where I give beefy AWS instances to slaves, such as 
 memory-optimised R3, but master is not really performing much number 
 crunching work. So it is a waste to allocate a powerful instance for master, 
 where a regular one would suffice.
 Suggested syntax:
 {code}
 sh spark-ec2 --instance-type-slave=instance_type # applies to slaves 
 only 
  --instance-type-master=instance_type# applies to master 
 only
  --instance-type=instance_type   # default, applies to 
 both
 # in real world
 sh spark-ec2 --instance-type-slave=r3.2xlarge --instance-type-master=c3.xlarge
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6935) spark/spark-ec2.py add parameters to give different instance types for master and slaves

2015-04-16 Thread Oleksii Mandrychenko (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497885#comment-14497885
 ] 

Oleksii Mandrychenko commented on SPARK-6935:
-

Oh... It's actually there in 1.2 release as well. I did not notice it, because 
documentation is missing on this point

https://spark.apache.org/docs/1.2.1/ec2-scripts.html

Perhaps somebody can go through the available options in the script and double 
check that documentation has matching entries.

 spark/spark-ec2.py add parameters to give different instance types for master 
 and slaves
 

 Key: SPARK-6935
 URL: https://issues.apache.org/jira/browse/SPARK-6935
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 1.3.0
Reporter: Oleksii Mandrychenko
Priority: Minor
  Labels: easyfix
   Original Estimate: 24h
  Remaining Estimate: 24h

 I want to start a cluster where I give beefy AWS instances to slaves, such as 
 memory-optimised R3, but master is not really performing much number 
 crunching work. So it is a waste to allocate a powerful instance for master, 
 where a regular one would suffice.
 Suggested syntax:
 {code}
 sh spark-ec2 --instance-type-slave=instance_type # applies to slaves 
 only 
  --instance-type-master=instance_type# applies to master 
 only
  --instance-type=instance_type   # default, applies to 
 both
 # in real world
 sh spark-ec2 --instance-type-slave=r3.2xlarge --instance-type-master=c3.xlarge
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6635) DataFrame.withColumn can create columns with identical names

2015-04-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497887#comment-14497887
 ] 

Apache Spark commented on SPARK-6635:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/5541

 DataFrame.withColumn can create columns with identical names
 

 Key: SPARK-6635
 URL: https://issues.apache.org/jira/browse/SPARK-6635
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 DataFrame lets you create multiple columns with the same name, which causes 
 problems when you try to refer to columns by name.
 Proposal: If a column is added to a DataFrame with a column of the same name, 
 then the new column should replace the old column.
 {code}
 scala val df = sc.parallelize(Array(1,2,3)).toDF(x)
 df: org.apache.spark.sql.DataFrame = [x: int]
 scala val df3 = df.withColumn(x, df(x) + 1)
 df3: org.apache.spark.sql.DataFrame = [x: int, x: int]
 scala df3.collect()
 res1: Array[org.apache.spark.sql.Row] = Array([1,2], [2,3], [3,4])
 scala df3(x)
 org.apache.spark.sql.AnalysisException: Reference 'x' is ambiguous, could be: 
 x, x.;
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:216)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:121)
   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:161)
   at org.apache.spark.sql.DataFrame.col(DataFrame.scala:436)
   at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:426)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:26)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:33)
   at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:35)
   at $iwC$$iwC$$iwC$$iwC.init(console:37)
   at $iwC$$iwC$$iwC.init(console:39)
   at $iwC$$iwC.init(console:41)
   at $iwC.init(console:43)
   at init(console:45)
   at .init(console:49)
   at .clinit(console)
   at .init(console:7)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
   at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856)
   at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901)
   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813)
   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656)
   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664)
   at 
 org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
   at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
   at 
 org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058)
   at org.apache.spark.repl.Main$.main(Main.scala:31)
   at org.apache.spark.repl.Main.main(Main.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
   at 
 org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
   at 

[jira] [Assigned] (SPARK-6635) DataFrame.withColumn can create columns with identical names

2015-04-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6635:
---

Assignee: Apache Spark

 DataFrame.withColumn can create columns with identical names
 

 Key: SPARK-6635
 URL: https://issues.apache.org/jira/browse/SPARK-6635
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Apache Spark

 DataFrame lets you create multiple columns with the same name, which causes 
 problems when you try to refer to columns by name.
 Proposal: If a column is added to a DataFrame with a column of the same name, 
 then the new column should replace the old column.
 {code}
 scala val df = sc.parallelize(Array(1,2,3)).toDF(x)
 df: org.apache.spark.sql.DataFrame = [x: int]
 scala val df3 = df.withColumn(x, df(x) + 1)
 df3: org.apache.spark.sql.DataFrame = [x: int, x: int]
 scala df3.collect()
 res1: Array[org.apache.spark.sql.Row] = Array([1,2], [2,3], [3,4])
 scala df3(x)
 org.apache.spark.sql.AnalysisException: Reference 'x' is ambiguous, could be: 
 x, x.;
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:216)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:121)
   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:161)
   at org.apache.spark.sql.DataFrame.col(DataFrame.scala:436)
   at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:426)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:26)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:33)
   at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:35)
   at $iwC$$iwC$$iwC$$iwC.init(console:37)
   at $iwC$$iwC$$iwC.init(console:39)
   at $iwC$$iwC.init(console:41)
   at $iwC.init(console:43)
   at init(console:45)
   at .init(console:49)
   at .clinit(console)
   at .init(console:7)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
   at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856)
   at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901)
   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813)
   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656)
   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664)
   at 
 org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
   at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
   at 
 org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058)
   at org.apache.spark.repl.Main$.main(Main.scala:31)
   at org.apache.spark.repl.Main.main(Main.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
   at 
 org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
   at 

[jira] [Commented] (SPARK-6935) spark/spark-ec2.py add parameters to give different instance types for master and slaves

2015-04-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497966#comment-14497966
 ] 

Sean Owen commented on SPARK-6935:
--

[~yuu.ishik...@gmail.com] What do you want to work on? it looks like this 
option already exists.

 spark/spark-ec2.py add parameters to give different instance types for master 
 and slaves
 

 Key: SPARK-6935
 URL: https://issues.apache.org/jira/browse/SPARK-6935
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 1.3.0
Reporter: Oleksii Mandrychenko
Priority: Minor
  Labels: easyfix
   Original Estimate: 24h
  Remaining Estimate: 24h

 I want to start a cluster where I give beefy AWS instances to slaves, such as 
 memory-optimised R3, but master is not really performing much number 
 crunching work. So it is a waste to allocate a powerful instance for master, 
 where a regular one would suffice.
 Suggested syntax:
 {code}
 sh spark-ec2 --instance-type-slave=instance_type # applies to slaves 
 only 
  --instance-type-master=instance_type# applies to master 
 only
  --instance-type=instance_type   # default, applies to 
 both
 # in real world
 sh spark-ec2 --instance-type-slave=r3.2xlarge --instance-type-master=c3.xlarge
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6935) spark/spark-ec2.py add parameters to give different instance types for master and slaves

2015-04-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497989#comment-14497989
 ] 

Sean Owen commented on SPARK-6935:
--

I don't know if it's that important to update the 1.2 docs since there may be 
no additional 1.2.x release after 1.2.2 coming out ... today. It's also in the 
usage message already.

 spark/spark-ec2.py add parameters to give different instance types for master 
 and slaves
 

 Key: SPARK-6935
 URL: https://issues.apache.org/jira/browse/SPARK-6935
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 1.3.0
Reporter: Oleksii Mandrychenko
Priority: Minor
  Labels: easyfix
   Original Estimate: 24h
  Remaining Estimate: 24h

 I want to start a cluster where I give beefy AWS instances to slaves, such as 
 memory-optimised R3, but master is not really performing much number 
 crunching work. So it is a waste to allocate a powerful instance for master, 
 where a regular one would suffice.
 Suggested syntax:
 {code}
 sh spark-ec2 --instance-type-slave=instance_type # applies to slaves 
 only 
  --instance-type-master=instance_type# applies to master 
 only
  --instance-type=instance_type   # default, applies to 
 both
 # in real world
 sh spark-ec2 --instance-type-slave=r3.2xlarge --instance-type-master=c3.xlarge
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6635) DataFrame.withColumn can create columns with identical names

2015-04-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6635:
---

Assignee: (was: Apache Spark)

 DataFrame.withColumn can create columns with identical names
 

 Key: SPARK-6635
 URL: https://issues.apache.org/jira/browse/SPARK-6635
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 DataFrame lets you create multiple columns with the same name, which causes 
 problems when you try to refer to columns by name.
 Proposal: If a column is added to a DataFrame with a column of the same name, 
 then the new column should replace the old column.
 {code}
 scala val df = sc.parallelize(Array(1,2,3)).toDF(x)
 df: org.apache.spark.sql.DataFrame = [x: int]
 scala val df3 = df.withColumn(x, df(x) + 1)
 df3: org.apache.spark.sql.DataFrame = [x: int, x: int]
 scala df3.collect()
 res1: Array[org.apache.spark.sql.Row] = Array([1,2], [2,3], [3,4])
 scala df3(x)
 org.apache.spark.sql.AnalysisException: Reference 'x' is ambiguous, could be: 
 x, x.;
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:216)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:121)
   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:161)
   at org.apache.spark.sql.DataFrame.col(DataFrame.scala:436)
   at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:426)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:26)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:33)
   at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:35)
   at $iwC$$iwC$$iwC$$iwC.init(console:37)
   at $iwC$$iwC$$iwC.init(console:39)
   at $iwC$$iwC.init(console:41)
   at $iwC.init(console:43)
   at init(console:45)
   at .init(console:49)
   at .clinit(console)
   at .init(console:7)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
   at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856)
   at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901)
   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813)
   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656)
   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664)
   at 
 org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
   at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
   at 
 org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058)
   at org.apache.spark.repl.Main$.main(Main.scala:31)
   at org.apache.spark.repl.Main.main(Main.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
   at 
 org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 

[jira] [Commented] (SPARK-2734) DROP TABLE should also uncache table

2015-04-16 Thread Arush Kharbanda (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497913#comment-14497913
 ] 

Arush Kharbanda commented on SPARK-2734:


This issue is occurring for me, i am using Spark - 1.3.0. 

 DROP TABLE should also uncache table
 

 Key: SPARK-2734
 URL: https://issues.apache.org/jira/browse/SPARK-2734
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Critical
 Fix For: 1.1.0


 Steps to reproduce:
 {code}
 hql(CREATE TABLE test(a INT))
 hql(CACHE TABLE test)
 hql(DROP TABLE test)
 hql(SELECT * FROM test)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3284) saveAsParquetFile not working on windows

2015-04-16 Thread Bogdan Niculescu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498019#comment-14498019
 ] 

Bogdan Niculescu edited comment on SPARK-3284 at 4/16/15 1:02 PM:
--

I get the same type of exception in Spark 1.3.0 under Windows when trying to 
save to a parquet file.
Here is my code :

{code}
case class Person(name: String, age: Int)

object DataFrameTest extends App {
  val conf = new SparkConf().setMaster(local[4]).setAppName(ParquetTest)
  val sc = new SparkContext(conf)
  val sqlContext = new SQLContext(sc)


  val persons = List(Person(a, 1), Person(b, 2))
  val rdd = sc.parallelize(persons)
  val dataFrame = sqlContext.createDataFrame(rdd)

  dataFrame.saveAsParquetFile(test.parquet)
}
{code}

The exception that I'm seeing is :
{code}
Exception in thread main java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:404)
at org.apache.hadoop.util.Shell.run(Shell.java:379)
...
at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1123)
at org.apache.spark.sql.DataFrame.saveAsParquetFile(DataFrame.scala:922)
at 
sparkTest.DataFrameTest$.delayedEndpoint$sparkTest$DataFrameTest$1(DataFrameTest.scala:19)
at sparkTest.DataFrameTest$delayedInit$body.apply(DataFrameTest.scala:9)
{code}


was (Author: bogdannb):
I get the same type of exception in Spark 1.3.0 under Windows when trying to 
save to a parquet file.
Here is my code :

{code}
case class Person(name: String, age: Int)

object DataFrameTest extends App {
  val conf = new SparkConf().setMaster(local[4]).setAppName(ParquetTest)
  val sc = new SparkContext(conf)
  val sqlContext = new SQLContext(sc)


  val persons = List(Person(a, 1), Person(b, 2))
  val rdd = sc.parallelize(persons)
  val dataFrame = sqlContext.createDataFrame(rdd)

  dataFrame.saveAsParquetFile(test.parquet)
}
{code}

The exception that I'm seeing is :
Exception in thread main java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:404)
at org.apache.hadoop.util.Shell.run(Shell.java:379)
...
at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1123)
at org.apache.spark.sql.DataFrame.saveAsParquetFile(DataFrame.scala:922)
at 
sparkTest.DataFrameTest$.delayedEndpoint$sparkTest$DataFrameTest$1(DataFrameTest.scala:19)
at sparkTest.DataFrameTest$delayedInit$body.apply(DataFrameTest.scala:9)

 saveAsParquetFile not working on windows
 

 Key: SPARK-3284
 URL: https://issues.apache.org/jira/browse/SPARK-3284
 Project: Spark
  Issue Type: Bug
  Components: Windows
Affects Versions: 1.0.2
 Environment: Windows
Reporter: Pravesh Jain
Priority: Minor

 {code}
 object parquet {
   case class Person(name: String, age: Int)
   def main(args: Array[String]) {
 val sparkConf = new 
 SparkConf().setMaster(local).setAppName(HdfsWordCount)
 val sc = new SparkContext(sparkConf)
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 // createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD.
 import sqlContext.createSchemaRDD
 val people = 
 sc.textFile(C:/Users/pravesh.jain/Desktop/people/people.txt).map(_.split(,)).map(p
  = Person(p(0), p(1).trim.toInt))
 
 people.saveAsParquetFile(C:/Users/pravesh.jain/Desktop/people/people.parquet)
 val parquetFile = 
 sqlContext.parquetFile(C:/Users/pravesh.jain/Desktop/people/people.parquet)
   }
 }
 {code}
 gives the error
 Exception in thread main java.lang.NullPointerException at 
 org.apache.spark.parquet$.main(parquet.scala:16)
 which is the line saveAsParquetFile.
 This works fine in linux but using in eclipse in windows gives the error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3937) ç

2015-04-16 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-3937:

Summary: ç  (was: Unsafe memory access inside of Snappy library)

 ç
 -

 Key: SPARK-3937
 URL: https://issues.apache.org/jira/browse/SPARK-3937
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0, 1.3.0
Reporter: Patrick Wendell

 This was observed on master between Spark 1.1 and 1.2. Unfortunately I don't 
 have much information about this other than the stack trace. However, it was 
 concerning enough I figured I should post it.
 {code}
 java.lang.InternalError: a fault occurred in a recent unsafe memory access 
 operation in compiled Java code
 org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
 org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444)
 org.xerial.snappy.Snappy.uncompress(Snappy.java:480)
 
 org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:355)
 
 org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:159)
 org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:142)
 
 java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310)
 
 java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2712)
 
 java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2742)
 java.io.ObjectInputStream.readArray(ObjectInputStream.java:1687)
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
 java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
 
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
 scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
 scala.collection.Iterator$class.foreach(Iterator.scala:727)
 scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
 scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
 scala.collection.AbstractIterator.to(Iterator.scala:1157)
 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
 scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
 scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
 
 org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:140)
 
 org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:140)
 
 org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1118)
 
 org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1118)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 org.apache.spark.scheduler.Task.run(Task.scala:56)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6935) spark/spark-ec2.py add parameters to give different instance types for master and slaves

2015-04-16 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497882#comment-14497882
 ] 

Yu Ishikawa edited comment on SPARK-6935 at 4/16/15 10:38 AM:
--

[~sowen]

I would like to work this issue. Please assign me to it?
However, I don't know how to test `ec2/spark_ec2.py`. If you know the way, 
would you tell me that?


was (Author: yuu.ishik...@gmail.com):
[~sowen]

I would like to work this issue. Please assign me to it?
However, I don't know how to test `ec2/spark_ec2.py`. If do you know the way, 
would you tell me that?

 spark/spark-ec2.py add parameters to give different instance types for master 
 and slaves
 

 Key: SPARK-6935
 URL: https://issues.apache.org/jira/browse/SPARK-6935
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 1.3.0
Reporter: Oleksii Mandrychenko
Priority: Minor
  Labels: easyfix
   Original Estimate: 24h
  Remaining Estimate: 24h

 I want to start a cluster where I give beefy AWS instances to slaves, such as 
 memory-optimised R3, but master is not really performing much number 
 crunching work. So it is a waste to allocate a powerful instance for master, 
 where a regular one would suffice.
 Suggested syntax:
 {code}
 sh spark-ec2 --instance-type-slave=instance_type # applies to slaves 
 only 
  --instance-type-master=instance_type# applies to master 
 only
  --instance-type=instance_type   # default, applies to 
 both
 # in real world
 sh spark-ec2 --instance-type-slave=r3.2xlarge --instance-type-master=c3.xlarge
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3937) Unsafe memory access inside of Snappy library

2015-04-16 Thread Taro L. Saito (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497899#comment-14497899
 ] 

Taro L. Saito commented on SPARK-3937:
--

I think this error occurs when wrong memory position or corrupted data is read 
by snappy. I would like to check this binary data read by SnappyInputStream.

 Unsafe memory access inside of Snappy library
 -

 Key: SPARK-3937
 URL: https://issues.apache.org/jira/browse/SPARK-3937
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0, 1.3.0
Reporter: Patrick Wendell

 This was observed on master between Spark 1.1 and 1.2. Unfortunately I don't 
 have much information about this other than the stack trace. However, it was 
 concerning enough I figured I should post it.
 {code}
 java.lang.InternalError: a fault occurred in a recent unsafe memory access 
 operation in compiled Java code
 org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
 org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444)
 org.xerial.snappy.Snappy.uncompress(Snappy.java:480)
 
 org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:355)
 
 org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:159)
 org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:142)
 
 java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310)
 
 java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2712)
 
 java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2742)
 java.io.ObjectInputStream.readArray(ObjectInputStream.java:1687)
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
 java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
 
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
 scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
 scala.collection.Iterator$class.foreach(Iterator.scala:727)
 scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
 scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
 scala.collection.AbstractIterator.to(Iterator.scala:1157)
 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
 scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
 scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
 
 org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:140)
 
 org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:140)
 
 org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1118)
 
 org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1118)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 org.apache.spark.scheduler.Task.run(Task.scala:56)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SPARK-6935) spark/spark-ec2.py add parameters to give different instance types for master and slaves

2015-04-16 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497987#comment-14497987
 ] 

Yu Ishikawa commented on SPARK-6935:


Oh sorry. I misunderstood we should add --slave-instance-type option, instead 
of --instances-type. But, as Oleksii said, I think we should modify the 
documentation at v1.2.x. It should be another ticket. Thanks

 spark/spark-ec2.py add parameters to give different instance types for master 
 and slaves
 

 Key: SPARK-6935
 URL: https://issues.apache.org/jira/browse/SPARK-6935
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 1.3.0
Reporter: Oleksii Mandrychenko
Priority: Minor
  Labels: easyfix
   Original Estimate: 24h
  Remaining Estimate: 24h

 I want to start a cluster where I give beefy AWS instances to slaves, such as 
 memory-optimised R3, but master is not really performing much number 
 crunching work. So it is a waste to allocate a powerful instance for master, 
 where a regular one would suffice.
 Suggested syntax:
 {code}
 sh spark-ec2 --instance-type-slave=instance_type # applies to slaves 
 only 
  --instance-type-master=instance_type# applies to master 
 only
  --instance-type=instance_type   # default, applies to 
 both
 # in real world
 sh spark-ec2 --instance-type-slave=r3.2xlarge --instance-type-master=c3.xlarge
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6932) A Prototype of Parameter Server

2015-04-16 Thread uncleGen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-6932:

Component/s: Spark Core

 A Prototype of Parameter Server
 ---

 Key: SPARK-6932
 URL: https://issues.apache.org/jira/browse/SPARK-6932
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib, Spark Core
Reporter: Qiping Li

 h2. Introduction
 As specified in 
 [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590],it would be 
 very helpful to integrate parameter server into Spark for machine learning 
 algorithms, especially for those with ultra high dimensions features. 
 After carefully studying the design doc of [Parameter 
 Servers|https://docs.google.com/document/d/1SX3nkmF41wFXAAIr9BgqvrHSS5mW362fJ7roBXJm06o/edit?usp=sharing],and
  the paper of [Factorbird|http://stanford.edu/~rezab/papers/factorbird.pdf], 
 we proposed a prototype of Parameter Server on Spark(Ps-on-Spark), with 
 several key design concerns:
 * *User friendly interface*
   Careful investigation is done to most existing Parameter Server 
 systems(including:  [petuum|http://petuum.github.io], [parameter 
 server|http://parameterserver.org], 
 [paracel|https://github.com/douban/paracel]) and a user friendly interface is 
 design by absorbing essence from all these system. 
 * *Prototype of distributed array*
 IndexRDD (see 
 [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590]) doesn't seem 
 to be a good option for distributed array, because in most case, the #key 
 updates/second is not be very high. 
 So we implement a distributed HashMap to store the parameters, which can 
 be easily extended to get better performance.
 
 * *Minimal code change*
   Quite a lot of effort in done to avoid code change of Spark core. Tasks 
 which need parameter server are still created and scheduled by Spark's 
 scheduler. Tasks communicate with parameter server with a client object, 
 through *akka* or *netty*.
 With all these concerns we propose the following architecture:
 h2. Architecture
 !https://cloud.githubusercontent.com/assets/1285855/7158179/f2d25cc4-e3a9-11e4-835e-89681596c478.jpg!
 Data is stored in RDD and is partitioned across workers. During each 
 iteration, each worker gets parameters from parameter server then computes 
 new parameters based on old parameters and data in the partition. Finally 
 each worker updates parameters to parameter server.Worker communicates with 
 parameter server through a parameter server client,which is initialized in 
 `TaskContext` of this worker.
 The current implementation is based on YARN cluster mode, 
 but it should not be a problem to transplanted it to other modes. 
 h3. Interface
 We refer to existing parameter server systems(petuum, parameter server, 
 paracel) when design the interface of parameter server. 
 *`PSClient` provides the following interface for workers to use:*
 {code}
 //  get parameter indexed by key from parameter server
 def get[T](key: String): T
 // get multiple parameters from parameter server
 def multiGet[T](keys: Array[String]): Array[T]
 // add parameter indexed by `key` by `delta`, 
 // if multiple `delta` to update on the same parameter,
 // use `reduceFunc` to reduce these `delta`s frist.
 def update[T](key: String, delta: T, reduceFunc: (T, T) = T): Unit
 // update multiple parameters at the same time, use the same `reduceFunc`.
 def multiUpdate(keys: Array[String], delta: Array[T], reduceFunc: (T, T) = 
 T: Unit
 
 // advance clock to indicate that current iteration is finished.
 def clock(): Unit
  
 // block until all workers have reached this line of code.
 def sync(): Unit
 {code}
 *`PSContext` provides following functions to use on driver:*
 {code}
 // load parameters from existing rdd.
 def loadPSModel[T](model: RDD[String, T]) 
 // fetch parameters from parameter server to construct model.
 def fetchPSModel[T](keys: Array[String]): Array[T]
 {code} 
 
 *A new function has been add to `RDD` to run parameter server tasks:*
 {code}
 // run the provided `func` on each partition of this RDD. 
 // This function can use data of this partition(the first argument) 
 // and a parameter server client(the second argument). 
 // See the following Logistic Regression for an example.
 def runWithPS[U: ClassTag](func: (Array[T], PSClient) = U): Array[U]

 {code}
 h2. Example
 Here is an example of using our prototype to implement logistic regression:
 {code:title=LogisticRegression.scala|borderStyle=solid}
 def train(
 sc: SparkContext,
 input: RDD[LabeledPoint],
 numIterations: Int,
 stepSize: Double,
 miniBatchFraction: Double): LogisticRegressionModel = {
 
 // initialize weights
 val numFeatures = input.map(_.features.size).first()
 val initialWeights = new Array[Double](numFeatures)

[jira] [Commented] (SPARK-6935) spark/spark-ec2.py add parameters to give different instance types for master and slaves

2015-04-16 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497882#comment-14497882
 ] 

Yu Ishikawa commented on SPARK-6935:


[~sowen]

I would like to work this issue. Please assign me to it?
However, I don't know how to test `ec2/spark_ec2.py`. If do you know the way, 
would you tell me that?

 spark/spark-ec2.py add parameters to give different instance types for master 
 and slaves
 

 Key: SPARK-6935
 URL: https://issues.apache.org/jira/browse/SPARK-6935
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 1.3.0
Reporter: Oleksii Mandrychenko
Priority: Minor
  Labels: easyfix
   Original Estimate: 24h
  Remaining Estimate: 24h

 I want to start a cluster where I give beefy AWS instances to slaves, such as 
 memory-optimised R3, but master is not really performing much number 
 crunching work. So it is a waste to allocate a powerful instance for master, 
 where a regular one would suffice.
 Suggested syntax:
 {code}
 sh spark-ec2 --instance-type-slave=instance_type # applies to slaves 
 only 
  --instance-type-master=instance_type# applies to master 
 only
  --instance-type=instance_type   # default, applies to 
 both
 # in real world
 sh spark-ec2 --instance-type-slave=r3.2xlarge --instance-type-master=c3.xlarge
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6932) A Prototype of Parameter Server

2015-04-16 Thread uncleGen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497840#comment-14497840
 ] 

uncleGen edited comment on SPARK-6932 at 4/16/15 12:56 PM:
---

There are two ways to use Parameter Server, one is an independent and 
standalone Parameter Server Service, and the another one is to integrate it 
into Spark. If we adopt the first one, we may need to maintain a long-running 
Parameter Server cluster, and Spark can access the Parameter Server 
seamlessly through some configurations without any code-broken in Spark core. 
IMHO, it is not an efficient way. For us, big model training is just one kind 
of job. We want to run a Spark PS job just like other spark jobs, and launch  
a Parameter Server Cluster in job dynamically. So, if we want to use it 
dynamically,  we may modify current spark core more or less. Well, just as 
[~chouqin] said, it is just a prototype, much more problems need to be 
resolved. 


was (Author: unclegen):
There are two ways to use Parameter Server, one is an independent and 
standalone Parameter Server Service, and the another one is to integrate it 
into Spark. If we adopt the first one, we may need to maintain a long-running 
Parameter Server cluster, and Spark can access the Parameter Server 
seamlessly through some configurations without any code-broken in Spark core. 
IMHO, it is not an efficient way. For us, big model training is just one kind 
of job. We want to run a Spark PS job just like other spark jobs, and launch  
a Parameter Server Cluster in job dynamically. So, if we want to use it 
dynamically,  we may modify current spark core more or less. Well, just as 
[~chouqin], it is just a prototype, much more problems need to be resolved. 

 A Prototype of Parameter Server
 ---

 Key: SPARK-6932
 URL: https://issues.apache.org/jira/browse/SPARK-6932
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Qiping Li

 h2. Introduction
 As specified in 
 [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590],it would be 
 very helpful to integrate parameter server into Spark for machine learning 
 algorithms, especially for those with ultra high dimensions features. 
 After carefully studying the design doc of [Parameter 
 Servers|https://docs.google.com/document/d/1SX3nkmF41wFXAAIr9BgqvrHSS5mW362fJ7roBXJm06o/edit?usp=sharing],and
  the paper of [Factorbird|http://stanford.edu/~rezab/papers/factorbird.pdf], 
 we proposed a prototype of Parameter Server on Spark(Ps-on-Spark), with 
 several key design concerns:
 * *User friendly interface*
   Careful investigation is done to most existing Parameter Server 
 systems(including:  [petuum|http://petuum.github.io], [parameter 
 server|http://parameterserver.org], 
 [paracel|https://github.com/douban/paracel]) and a user friendly interface is 
 design by absorbing essence from all these system. 
 * *Prototype of distributed array*
 IndexRDD (see 
 [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590]) doesn't seem 
 to be a good option for distributed array, because in most case, the #key 
 updates/second is not be very high. 
 So we implement a distributed HashMap to store the parameters, which can 
 be easily extended to get better performance.
 
 * *Minimal code change*
   Quite a lot of effort in done to avoid code change of Spark core. Tasks 
 which need parameter server are still created and scheduled by Spark's 
 scheduler. Tasks communicate with parameter server with a client object, 
 through *akka* or *netty*.
 With all these concerns we propose the following architecture:
 h2. Architecture
 !https://cloud.githubusercontent.com/assets/1285855/7158179/f2d25cc4-e3a9-11e4-835e-89681596c478.jpg!
 Data is stored in RDD and is partitioned across workers. During each 
 iteration, each worker gets parameters from parameter server then computes 
 new parameters based on old parameters and data in the partition. Finally 
 each worker updates parameters to parameter server.Worker communicates with 
 parameter server through a parameter server client,which is initialized in 
 `TaskContext` of this worker.
 The current implementation is based on YARN cluster mode, 
 but it should not be a problem to transplanted it to other modes. 
 h3. Interface
 We refer to existing parameter server systems(petuum, parameter server, 
 paracel) when design the interface of parameter server. 
 *`PSClient` provides the following interface for workers to use:*
 {code}
 //  get parameter indexed by key from parameter server
 def get[T](key: String): T
 // get multiple parameters from parameter server
 def multiGet[T](keys: Array[String]): Array[T]
 // add parameter indexed by `key` by `delta`, 
 // if multiple `delta` to update on the same parameter,
 // use 

[jira] [Updated] (SPARK-3937) Unsafe memory access inside of Snappy library

2015-04-16 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-3937:

Summary: Unsafe memory access inside of Snappy library  (was: ç)

 Unsafe memory access inside of Snappy library
 -

 Key: SPARK-3937
 URL: https://issues.apache.org/jira/browse/SPARK-3937
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0, 1.3.0
Reporter: Patrick Wendell

 This was observed on master between Spark 1.1 and 1.2. Unfortunately I don't 
 have much information about this other than the stack trace. However, it was 
 concerning enough I figured I should post it.
 {code}
 java.lang.InternalError: a fault occurred in a recent unsafe memory access 
 operation in compiled Java code
 org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
 org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444)
 org.xerial.snappy.Snappy.uncompress(Snappy.java:480)
 
 org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:355)
 
 org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:159)
 org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:142)
 
 java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310)
 
 java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2712)
 
 java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2742)
 java.io.ObjectInputStream.readArray(ObjectInputStream.java:1687)
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
 java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
 
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
 scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
 scala.collection.Iterator$class.foreach(Iterator.scala:727)
 scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
 scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
 scala.collection.AbstractIterator.to(Iterator.scala:1157)
 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
 scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
 scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
 
 org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:140)
 
 org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:140)
 
 org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1118)
 
 org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1118)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 org.apache.spark.scheduler.Task.run(Task.scala:56)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Commented] (SPARK-6935) spark/spark-ec2.py add parameters to give different instance types for master and slaves

2015-04-16 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498001#comment-14498001
 ] 

Yu Ishikawa commented on SPARK-6935:


I got it. I agree. Thank you for letting me know.

 spark/spark-ec2.py add parameters to give different instance types for master 
 and slaves
 

 Key: SPARK-6935
 URL: https://issues.apache.org/jira/browse/SPARK-6935
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 1.3.0
Reporter: Oleksii Mandrychenko
Priority: Minor
  Labels: easyfix
   Original Estimate: 24h
  Remaining Estimate: 24h

 I want to start a cluster where I give beefy AWS instances to slaves, such as 
 memory-optimised R3, but master is not really performing much number 
 crunching work. So it is a waste to allocate a powerful instance for master, 
 where a regular one would suffice.
 Suggested syntax:
 {code}
 sh spark-ec2 --instance-type-slave=instance_type # applies to slaves 
 only 
  --instance-type-master=instance_type# applies to master 
 only
  --instance-type=instance_type   # default, applies to 
 both
 # in real world
 sh spark-ec2 --instance-type-slave=r3.2xlarge --instance-type-master=c3.xlarge
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3284) saveAsParquetFile not working on windows

2015-04-16 Thread Bogdan Niculescu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498019#comment-14498019
 ] 

Bogdan Niculescu commented on SPARK-3284:
-

I get the same type of exception in Spark 1.3.0 under Windows when trying to 
save to a parquet file.
Here is my code :
case class Person(name: String, age: Int)

object DataFrameTest extends App {
  val conf = new SparkConf().setMaster(local[4]).setAppName(ParquetTest)
  val sc = new SparkContext(conf)
  val sqlContext = new SQLContext(sc)


  val persons = List(Person(a, 1), Person(b, 2))
  val rdd = sc.parallelize(persons)
  val dataFrame = sqlContext.createDataFrame(rdd)

  dataFrame.saveAsParquetFile(test.parquet)
}

The exception that I'm seeing is :
Exception in thread main java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:404)
at org.apache.hadoop.util.Shell.run(Shell.java:379)
...
at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1123)
at org.apache.spark.sql.DataFrame.saveAsParquetFile(DataFrame.scala:922)
at 
sparkTest.DataFrameTest$.delayedEndpoint$sparkTest$DataFrameTest$1(DataFrameTest.scala:19)
at sparkTest.DataFrameTest$delayedInit$body.apply(DataFrameTest.scala:9)

 saveAsParquetFile not working on windows
 

 Key: SPARK-3284
 URL: https://issues.apache.org/jira/browse/SPARK-3284
 Project: Spark
  Issue Type: Bug
  Components: Windows
Affects Versions: 1.0.2
 Environment: Windows
Reporter: Pravesh Jain
Priority: Minor

 {code}
 object parquet {
   case class Person(name: String, age: Int)
   def main(args: Array[String]) {
 val sparkConf = new 
 SparkConf().setMaster(local).setAppName(HdfsWordCount)
 val sc = new SparkContext(sparkConf)
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 // createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD.
 import sqlContext.createSchemaRDD
 val people = 
 sc.textFile(C:/Users/pravesh.jain/Desktop/people/people.txt).map(_.split(,)).map(p
  = Person(p(0), p(1).trim.toInt))
 
 people.saveAsParquetFile(C:/Users/pravesh.jain/Desktop/people/people.parquet)
 val parquetFile = 
 sqlContext.parquetFile(C:/Users/pravesh.jain/Desktop/people/people.parquet)
   }
 }
 {code}
 gives the error
 Exception in thread main java.lang.NullPointerException at 
 org.apache.spark.parquet$.main(parquet.scala:16)
 which is the line saveAsParquetFile.
 This works fine in linux but using in eclipse in windows gives the error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6960) Hardcoded scala version in multiple files. SparkluginBuild.scala, as well as docs/ with self-deprecating comment.

2015-04-16 Thread Michah Lerner (JIRA)
Michah Lerner created SPARK-6960:


 Summary: Hardcoded scala version in multiple files. 
SparkluginBuild.scala, as well as docs/ with self-deprecating comment.
 Key: SPARK-6960
 URL: https://issues.apache.org/jira/browse/SPARK-6960
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.3.0
Reporter: Michah Lerner
Priority: Minor
 Fix For: 1.3.1


Hard-coded value scala version 2.10 in multiple files, including:
docs/_plugins/copy_api_dirs.rb, docs/_config.yml and  
spark-1.3.0/project/project/SparkPluginBuild.scala.

[error] /path/spark-1.3.0/project/project/SparkPluginBuild.scala:33: can't 
expand macros compiled by previous versions of Scala

The following generally builds successfully, except when the -Pnetlib-lgpl 
option is added to the mvn build, provided there is a manual edit of the scala 
version.

mvn -Dscala-2.11 -Pyarn -Phadoop-2.4 -DskipTests clean package
sbt -Pyarn -Phadoop-2.4 -Dscala-2.11 unidoc
cd docs
jekyll b





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6962) Spark gets stuck on a step, hangs forever - jobs do not complete

2015-04-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6962:
-
Priority: Major  (was: Blocker)

Until this is more vetted, reducing from Blocker. 
It's hard to get any further without more info, like what was stuck where. Do 
you have a stack dump? Or any code to reproduce?
This may be a duplicate of https://issues.apache.org/jira/browse/SPARK-4395 or 
https://issues.apache.org/jira/browse/SPARK-5060 for example; it's worth 
searching JIRA first.

 Spark gets stuck on a step, hangs forever - jobs do not complete
 

 Key: SPARK-6962
 URL: https://issues.apache.org/jira/browse/SPARK-6962
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0, 1.2.1, 1.3.0
Reporter: Jon Chase

 Spark SQL queries (though this seems to be a Spark Core issue - I'm just 
 using queries in the REPL to surface this, so I mention Spark SQL) hang 
 indefinitely under certain (not totally understood) circumstances.  
 This is resolved by setting spark.shuffle.blockTransferService=nio, which 
 seems to point to netty as the issue.  Netty was set as the default for the 
 block transport layer in 1.2.0, which is when this issue started.  Setting 
 the service to nio allows queries to complete normally.
 I do not see this problem when running queries over smaller (~20 5MB files) 
 datasets.  When I increase the scope to include more data (several hundred 
 ~5MB files), the queries will get through several steps but eventuall hang  
 indefinitely.
 Here's the email chain regarding this issue, including stack traces:
 http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/cae61spfqt2y7d5vqzomzz2dmr-jx2c2zggcyky40npkjjx4...@mail.gmail.com
 For context, here's the announcement regarding the block transfer service 
 change: 
 http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/cabpqxssl04q+rbltp-d8w+z3atn+g-um6gmdgdnh-hzcvd-...@mail.gmail.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6962) Spark gets stuck on a step, hangs forever - jobs do not complete

2015-04-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6962:
-
Component/s: (was: Spark Core)
 SQL

 Spark gets stuck on a step, hangs forever - jobs do not complete
 

 Key: SPARK-6962
 URL: https://issues.apache.org/jira/browse/SPARK-6962
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.2.1, 1.3.0
Reporter: Jon Chase

 Spark SQL queries (though this seems to be a Spark Core issue - I'm just 
 using queries in the REPL to surface this, so I mention Spark SQL) hang 
 indefinitely under certain (not totally understood) circumstances.  
 This is resolved by setting spark.shuffle.blockTransferService=nio, which 
 seems to point to netty as the issue.  Netty was set as the default for the 
 block transport layer in 1.2.0, which is when this issue started.  Setting 
 the service to nio allows queries to complete normally.
 I do not see this problem when running queries over smaller (~20 5MB files) 
 datasets.  When I increase the scope to include more data (several hundred 
 ~5MB files), the queries will get through several steps but eventuall hang  
 indefinitely.
 Here's the email chain regarding this issue, including stack traces:
 http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/cae61spfqt2y7d5vqzomzz2dmr-jx2c2zggcyky40npkjjx4...@mail.gmail.com
 For context, here's the announcement regarding the block transfer service 
 change: 
 http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/cabpqxssl04q+rbltp-d8w+z3atn+g-um6gmdgdnh-hzcvd-...@mail.gmail.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6694) SparkSQL CLI must be able to specify an option --database on the command line.

2015-04-16 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-6694.
---
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5345
[https://github.com/apache/spark/pull/5345]

 SparkSQL CLI must be able to specify an option --database on the command line.
 --

 Key: SPARK-6694
 URL: https://issues.apache.org/jira/browse/SPARK-6694
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.4.0
Reporter: Jin Adachi
Priority: Critical
 Fix For: 1.4.0


 SparkSQL CLI has an option --database as follows.
 But, the option --database is ignored.
 {code:}
 $ spark-sql --help
 :
 CLI options:
 :
 --database databasename Specify the database to use
 ```
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6959) Support for datetime comparisions in filter for dataframes in pyspark

2015-04-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6959:
-
Component/s: SQL

 Support for datetime comparisions in filter for dataframes in pyspark
 -

 Key: SPARK-6959
 URL: https://issues.apache.org/jira/browse/SPARK-6959
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: cynepia

 Currently, in filter strings can be passed for comparision with a date type 
 column. But this does not address the case where dates may be passed in 
 different formats. We should have support for datetime.date or some standard 
 format for comparision with date type columns.
 Currently,
 df.filter(df.Datecol  datetime.date(2015,1,1)).show()
 is not supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6962) Spark gets stuck on a step, hangs forever - jobs do not complete

2015-04-16 Thread Jon Chase (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498233#comment-14498233
 ] 

Jon Chase edited comment on SPARK-6962 at 4/16/15 4:17 PM:
---

Attaching the stack dumps I took when Spark is hanging.


was (Author: jonchase):
Here are the stack dumps I took when Spark is hanging.

 Spark gets stuck on a step, hangs forever - jobs do not complete
 

 Key: SPARK-6962
 URL: https://issues.apache.org/jira/browse/SPARK-6962
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.2.1, 1.3.0
Reporter: Jon Chase
 Attachments: jstacks.txt


 Spark SQL queries (though this seems to be a Spark Core issue - I'm just 
 using queries in the REPL to surface this, so I mention Spark SQL) hang 
 indefinitely under certain (not totally understood) circumstances.  
 This is resolved by setting spark.shuffle.blockTransferService=nio, which 
 seems to point to netty as the issue.  Netty was set as the default for the 
 block transport layer in 1.2.0, which is when this issue started.  Setting 
 the service to nio allows queries to complete normally.
 I do not see this problem when running queries over smaller (~20 5MB files) 
 datasets.  When I increase the scope to include more data (several hundred 
 ~5MB files), the queries will get through several steps but eventuall hang  
 indefinitely.
 Here's the email chain regarding this issue, including stack traces:
 http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/cae61spfqt2y7d5vqzomzz2dmr-jx2c2zggcyky40npkjjx4...@mail.gmail.com
 For context, here's the announcement regarding the block transfer service 
 change: 
 http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/cabpqxssl04q+rbltp-d8w+z3atn+g-um6gmdgdnh-hzcvd-...@mail.gmail.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6675) HiveContext setConf is not stable

2015-04-16 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6675:
--
Priority: Critical  (was: Major)

 HiveContext setConf is not stable
 -

 Key: SPARK-6675
 URL: https://issues.apache.org/jira/browse/SPARK-6675
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
 Environment: AWS ec2 xlarge2 cluster launched by spark's script
Reporter: Hao Ren
Priority: Critical

 I find HiveContext.setConf does not work correctly. Here are some code 
 snippets showing the problem:
 snippet 1:
 {code}
 import org.apache.spark.sql.hive.HiveContext
 import org.apache.spark.{SparkConf, SparkContext}
 object Main extends App {
   val conf = new SparkConf()
 .setAppName(context-test)
 .setMaster(local[8])
   val sc = new SparkContext(conf)
   val hc = new HiveContext(sc)
   hc.setConf(spark.sql.shuffle.partitions, 10)
   hc.setConf(hive.metastore.warehouse.dir, 
 /home/spark/hive/warehouse_test)
   hc.getAllConfs filter(_._1.contains(warehouse.dir)) foreach println
   hc.getAllConfs filter(_._1.contains(shuffle.partitions)) foreach println
 }
 {code}
 Results:
 (hive.metastore.warehouse.dir,/home/spark/hive/warehouse_test)
 (spark.sql.shuffle.partitions,10)
 snippet 2:
 {code}
 ...
   hc.setConf(hive.metastore.warehouse.dir, 
 /home/spark/hive/warehouse_test)
   hc.setConf(spark.sql.shuffle.partitions, 10)
   hc.getAllConfs filter(_._1.contains(warehouse.dir)) foreach println
   hc.getAllConfs filter(_._1.contains(shuffle.partitions)) foreach println
 ...
 {code}
 Results:
 (hive.metastore.warehouse.dir,/user/hive/warehouse)
 (spark.sql.shuffle.partitions,10)
 You can see that I just permuted the two setConf call, then that leads to two 
 different Hive configuration.
 It seems that HiveContext can not set a new value on 
 hive.metastore.warehouse.dir key in one or the first setConf call.
 You need another setConf call before changing 
 hive.metastore.warehouse.dir. For example, set 
 hive.metastore.warehouse.dir twice and the snippet 1
 snippet 3:
 {code}
 ...
   hc.setConf(hive.metastore.warehouse.dir, 
 /home/spark/hive/warehouse_test)
   hc.setConf(hive.metastore.warehouse.dir, 
 /home/spark/hive/warehouse_test)
   hc.getAllConfs filter(_._1.contains(warehouse.dir)) foreach println
 ...
 {code}
 Results:
 (hive.metastore.warehouse.dir,/home/spark/hive/warehouse_test)
 You can reproduce this if you move to the latest branch-1.3 (1.3.1-snapshot, 
 htag = 7d029cb1eb6f1df1bce1a3f5784fb7ce2f981a33)
 I have also tested the released 1.3.0 (htag = 
 4aaf48d46d13129f0f9bdafd771dd80fe568a7dc). It has the same problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6635) DataFrame.withColumn can create columns with identical names

2015-04-16 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498185#comment-14498185
 ] 

Reynold Xin edited comment on SPARK-6635 at 4/16/15 3:38 PM:
-

Should we throw an exception if the name is identical, or just replace it? For 
example, what happens if the user does
{code}
df.select(df(*), lit(1).as(col))
{code}

If df already contains a column named col?


was (Author: rxin):
Should we throw an exception if the name is identical, or just replace it?

 DataFrame.withColumn can create columns with identical names
 

 Key: SPARK-6635
 URL: https://issues.apache.org/jira/browse/SPARK-6635
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 DataFrame lets you create multiple columns with the same name, which causes 
 problems when you try to refer to columns by name.
 Proposal: If a column is added to a DataFrame with a column of the same name, 
 then the new column should replace the old column.
 {code}
 scala val df = sc.parallelize(Array(1,2,3)).toDF(x)
 df: org.apache.spark.sql.DataFrame = [x: int]
 scala val df3 = df.withColumn(x, df(x) + 1)
 df3: org.apache.spark.sql.DataFrame = [x: int, x: int]
 scala df3.collect()
 res1: Array[org.apache.spark.sql.Row] = Array([1,2], [2,3], [3,4])
 scala df3(x)
 org.apache.spark.sql.AnalysisException: Reference 'x' is ambiguous, could be: 
 x, x.;
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:216)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:121)
   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:161)
   at org.apache.spark.sql.DataFrame.col(DataFrame.scala:436)
   at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:426)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:26)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:33)
   at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:35)
   at $iwC$$iwC$$iwC$$iwC.init(console:37)
   at $iwC$$iwC$$iwC.init(console:39)
   at $iwC$$iwC.init(console:41)
   at $iwC.init(console:43)
   at init(console:45)
   at .init(console:49)
   at .clinit(console)
   at .init(console:7)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
   at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856)
   at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901)
   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813)
   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656)
   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664)
   at 
 org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
   at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
   at 
 org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058)
   at org.apache.spark.repl.Main$.main(Main.scala:31)
   at org.apache.spark.repl.Main.main(Main.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 

[jira] [Commented] (SPARK-3937) Unsafe memory access inside of Snappy library

2015-04-16 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498189#comment-14498189
 ] 

Guoqiang Li commented on SPARK-3937:


[~joshrosen]
This bug occurs every time. I'm not sure whether I can use the local-cluster to 
reproduce the bug. If it is successful, I'll post the code.

 Unsafe memory access inside of Snappy library
 -

 Key: SPARK-3937
 URL: https://issues.apache.org/jira/browse/SPARK-3937
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0, 1.3.0
Reporter: Patrick Wendell

 This was observed on master between Spark 1.1 and 1.2. Unfortunately I don't 
 have much information about this other than the stack trace. However, it was 
 concerning enough I figured I should post it.
 {code}
 java.lang.InternalError: a fault occurred in a recent unsafe memory access 
 operation in compiled Java code
 org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
 org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444)
 org.xerial.snappy.Snappy.uncompress(Snappy.java:480)
 
 org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:355)
 
 org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:159)
 org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:142)
 
 java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310)
 
 java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2712)
 
 java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2742)
 java.io.ObjectInputStream.readArray(ObjectInputStream.java:1687)
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
 java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
 
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
 scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
 scala.collection.Iterator$class.foreach(Iterator.scala:727)
 scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
 scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
 scala.collection.AbstractIterator.to(Iterator.scala:1157)
 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
 scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
 scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
 
 org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:140)
 
 org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:140)
 
 org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1118)
 
 org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1118)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 org.apache.spark.scheduler.Task.run(Task.scala:56)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SPARK-6962) Spark gets stuck on a step, hangs forever - jobs do not complete

2015-04-16 Thread Jon Chase (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498232#comment-14498232
 ] 

Jon Chase commented on SPARK-6962:
--

I think it's different from SPARK-4395, as calling/omitting .cache() doesn't 
have any effect.  Also, once it hangs, I've never seen it finish (even after 
waiting many hours).  

Also different from SPARK-5060, I believe, as the web UI accurately reports the 
remaining tasks as unfinished (or in progress in case of the ones running when 
the hang occurs).  

Here's my original post from the email thread:

===

Spark 1.3.0 on YARN (Amazon EMR), cluster of 10 m3.2xlarge (8cpu, 30GB),
executor memory 20GB, driver memory 10GB

I'm using Spark SQL, mainly via spark-shell, to query 15GB of data spread
out over roughly 2,000 Parquet files and my queries frequently hang. Simple
queries like select count(*) from ... on the entire data set work ok.
Slightly more demanding ones with group by's and some aggregate functions
(percentile_approx, avg, etc.) work ok as well, as long as I have some
criteria in my where clause to keep the number of rows down.

Once I hit some limit on query complexity and rows processed, my queries
start to hang.  I've left them for up to an hour without seeing any
progress.  No OOM's either - the job is just stuck.

I've tried setting spark.sql.shuffle.partitions to 400 and even 800, but
with the same results: usually near the end of the tasks (like 780 of 800
complete), progress just stops:

15/03/26 20:53:29 INFO scheduler.TaskSetManager: Finished task 788.0 in
stage 1.0 (TID 1618) in 800 ms on
ip-10-209-22-211.eu-west-1.compute.internal (748/800)
15/03/26 20:53:29 INFO scheduler.TaskSetManager: Finished task 793.0 in
stage 1.0 (TID 1623) in 622 ms on
ip-10-105-12-41.eu-west-1.compute.internal (749/800)
15/03/26 20:53:29 INFO scheduler.TaskSetManager: Finished task 797.0 in
stage 1.0 (TID 1627) in 616 ms on ip-10-90-2-201.eu-west-1.compute.internal
(750/800)
15/03/26 20:53:29 INFO scheduler.TaskSetManager: Finished task 799.0 in
stage 1.0 (TID 1629) in 611 ms on ip-10-90-2-201.eu-west-1.compute.internal
(751/800)
15/03/26 20:53:29 INFO scheduler.TaskSetManager: Finished task 795.0 in
stage 1.0 (TID 1625) in 669 ms on
ip-10-105-12-41.eu-west-1.compute.internal (752/800)

^^^ this is where it stays forever

Looking at the Spark UI, several of the executors still list active tasks.
I do see that the Shuffle Read for executors that don't have any tasks
remaining is around 100MB, whereas it's more like 10MB for the executors
that still have tasks.

The first stage, mapPartitions, always completes fine.  It's the second
stage (takeOrdered), that hangs.

I've had this issue in 1.2.0 and 1.2.1 as well as 1.3.0.  I've also
encountered it when using JSON files (instead of Parquet).


 Spark gets stuck on a step, hangs forever - jobs do not complete
 

 Key: SPARK-6962
 URL: https://issues.apache.org/jira/browse/SPARK-6962
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.2.1, 1.3.0
Reporter: Jon Chase

 Spark SQL queries (though this seems to be a Spark Core issue - I'm just 
 using queries in the REPL to surface this, so I mention Spark SQL) hang 
 indefinitely under certain (not totally understood) circumstances.  
 This is resolved by setting spark.shuffle.blockTransferService=nio, which 
 seems to point to netty as the issue.  Netty was set as the default for the 
 block transport layer in 1.2.0, which is when this issue started.  Setting 
 the service to nio allows queries to complete normally.
 I do not see this problem when running queries over smaller (~20 5MB files) 
 datasets.  When I increase the scope to include more data (several hundred 
 ~5MB files), the queries will get through several steps but eventuall hang  
 indefinitely.
 Here's the email chain regarding this issue, including stack traces:
 http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/cae61spfqt2y7d5vqzomzz2dmr-jx2c2zggcyky40npkjjx4...@mail.gmail.com
 For context, here's the announcement regarding the block transfer service 
 change: 
 http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/cabpqxssl04q+rbltp-d8w+z3atn+g-um6gmdgdnh-hzcvd-...@mail.gmail.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6635) DataFrame.withColumn can create columns with identical names

2015-04-16 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498185#comment-14498185
 ] 

Reynold Xin commented on SPARK-6635:


Should we throw an exception if the name is identical, or just replace it?

 DataFrame.withColumn can create columns with identical names
 

 Key: SPARK-6635
 URL: https://issues.apache.org/jira/browse/SPARK-6635
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 DataFrame lets you create multiple columns with the same name, which causes 
 problems when you try to refer to columns by name.
 Proposal: If a column is added to a DataFrame with a column of the same name, 
 then the new column should replace the old column.
 {code}
 scala val df = sc.parallelize(Array(1,2,3)).toDF(x)
 df: org.apache.spark.sql.DataFrame = [x: int]
 scala val df3 = df.withColumn(x, df(x) + 1)
 df3: org.apache.spark.sql.DataFrame = [x: int, x: int]
 scala df3.collect()
 res1: Array[org.apache.spark.sql.Row] = Array([1,2], [2,3], [3,4])
 scala df3(x)
 org.apache.spark.sql.AnalysisException: Reference 'x' is ambiguous, could be: 
 x, x.;
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:216)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:121)
   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:161)
   at org.apache.spark.sql.DataFrame.col(DataFrame.scala:436)
   at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:426)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:26)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:33)
   at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:35)
   at $iwC$$iwC$$iwC$$iwC.init(console:37)
   at $iwC$$iwC$$iwC.init(console:39)
   at $iwC$$iwC.init(console:41)
   at $iwC.init(console:43)
   at init(console:45)
   at .init(console:49)
   at .clinit(console)
   at .init(console:7)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
   at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856)
   at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901)
   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813)
   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656)
   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664)
   at 
 org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
   at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
   at 
 org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058)
   at org.apache.spark.repl.Main$.main(Main.scala:31)
   at org.apache.spark.repl.Main.main(Main.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
   at 
 org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
   at 

[jira] [Updated] (SPARK-6675) HiveContext setConf is not stable

2015-04-16 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6675:
--
Shepherd: Cheng Lian

 HiveContext setConf is not stable
 -

 Key: SPARK-6675
 URL: https://issues.apache.org/jira/browse/SPARK-6675
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
 Environment: AWS ec2 xlarge2 cluster launched by spark's script
Reporter: Hao Ren
Priority: Critical

 I find HiveContext.setConf does not work correctly. Here are some code 
 snippets showing the problem:
 snippet 1:
 {code}
 import org.apache.spark.sql.hive.HiveContext
 import org.apache.spark.{SparkConf, SparkContext}
 object Main extends App {
   val conf = new SparkConf()
 .setAppName(context-test)
 .setMaster(local[8])
   val sc = new SparkContext(conf)
   val hc = new HiveContext(sc)
   hc.setConf(spark.sql.shuffle.partitions, 10)
   hc.setConf(hive.metastore.warehouse.dir, 
 /home/spark/hive/warehouse_test)
   hc.getAllConfs filter(_._1.contains(warehouse.dir)) foreach println
   hc.getAllConfs filter(_._1.contains(shuffle.partitions)) foreach println
 }
 {code}
 Results:
 (hive.metastore.warehouse.dir,/home/spark/hive/warehouse_test)
 (spark.sql.shuffle.partitions,10)
 snippet 2:
 {code}
 ...
   hc.setConf(hive.metastore.warehouse.dir, 
 /home/spark/hive/warehouse_test)
   hc.setConf(spark.sql.shuffle.partitions, 10)
   hc.getAllConfs filter(_._1.contains(warehouse.dir)) foreach println
   hc.getAllConfs filter(_._1.contains(shuffle.partitions)) foreach println
 ...
 {code}
 Results:
 (hive.metastore.warehouse.dir,/user/hive/warehouse)
 (spark.sql.shuffle.partitions,10)
 You can see that I just permuted the two setConf call, then that leads to two 
 different Hive configuration.
 It seems that HiveContext can not set a new value on 
 hive.metastore.warehouse.dir key in one or the first setConf call.
 You need another setConf call before changing 
 hive.metastore.warehouse.dir. For example, set 
 hive.metastore.warehouse.dir twice and the snippet 1
 snippet 3:
 {code}
 ...
   hc.setConf(hive.metastore.warehouse.dir, 
 /home/spark/hive/warehouse_test)
   hc.setConf(hive.metastore.warehouse.dir, 
 /home/spark/hive/warehouse_test)
   hc.getAllConfs filter(_._1.contains(warehouse.dir)) foreach println
 ...
 {code}
 Results:
 (hive.metastore.warehouse.dir,/home/spark/hive/warehouse_test)
 You can reproduce this if you move to the latest branch-1.3 (1.3.1-snapshot, 
 htag = 7d029cb1eb6f1df1bce1a3f5784fb7ce2f981a33)
 I have also tested the released 1.3.0 (htag = 
 4aaf48d46d13129f0f9bdafd771dd80fe568a7dc). It has the same problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6953) Speedup tests of PySpark, reduce logging

2015-04-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6953:
-
Component/s: PySpark

 Speedup tests of PySpark, reduce logging
 

 Key: SPARK-6953
 URL: https://issues.apache.org/jira/browse/SPARK-6953
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Davies Liu
Assignee: Davies Liu

 Right now, it take about 30 minutes to complete the PySpark tests with Python 
 2.6, python 3.4 and PyPy. It's better to decrease it.
 Also when run pyspark/tests.py, the logging is pretty scaring (lots of 
 exceptions), it will be nice to mute the exception when it's expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6962) Spark gets stuck on a step, hangs forever - jobs do not complete

2015-04-16 Thread Jon Chase (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jon Chase updated SPARK-6962:
-
Attachment: jstacks.txt

Here are the stack dumps I took when Spark is hanging.

 Spark gets stuck on a step, hangs forever - jobs do not complete
 

 Key: SPARK-6962
 URL: https://issues.apache.org/jira/browse/SPARK-6962
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.2.1, 1.3.0
Reporter: Jon Chase
 Attachments: jstacks.txt


 Spark SQL queries (though this seems to be a Spark Core issue - I'm just 
 using queries in the REPL to surface this, so I mention Spark SQL) hang 
 indefinitely under certain (not totally understood) circumstances.  
 This is resolved by setting spark.shuffle.blockTransferService=nio, which 
 seems to point to netty as the issue.  Netty was set as the default for the 
 block transport layer in 1.2.0, which is when this issue started.  Setting 
 the service to nio allows queries to complete normally.
 I do not see this problem when running queries over smaller (~20 5MB files) 
 datasets.  When I increase the scope to include more data (several hundred 
 ~5MB files), the queries will get through several steps but eventuall hang  
 indefinitely.
 Here's the email chain regarding this issue, including stack traces:
 http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/cae61spfqt2y7d5vqzomzz2dmr-jx2c2zggcyky40npkjjx4...@mail.gmail.com
 For context, here's the announcement regarding the block transfer service 
 change: 
 http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/cabpqxssl04q+rbltp-d8w+z3atn+g-um6gmdgdnh-hzcvd-...@mail.gmail.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6963) Flaky test: o.a.s.ContextCleanerSuite automatically cleanup checkpoint

2015-04-16 Thread Andrew Or (JIRA)
Andrew Or created SPARK-6963:


 Summary: Flaky test: o.a.s.ContextCleanerSuite automatically 
cleanup checkpoint
 Key: SPARK-6963
 URL: https://issues.apache.org/jira/browse/SPARK-6963
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Andrew Or
Assignee: Guoqiang Li
Priority: Critical


{code}
sbt.ForkMain$ForkError: 
fs.exists(org.apache.spark.rdd.RDDCheckpointData.rddCheckpointDataPath(ContextCleanerSuite.this.sc,
 rddId).get) was true
at 
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
at 
org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
at 
org.apache.spark.ContextCleanerSuite$$anonfun$9.apply$mcV$sp(ContextCleanerSuite.scala:252)
at 
org.apache.spark.ContextCleanerSuite$$anonfun$9.apply(ContextCleanerSuite.scala:209)
at 
org.apache.spark.ContextCleanerSuite$$anonfun$9.apply(ContextCleanerSuite.scala:209)
at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
at 
org.apache.spark.ContextCleanerSuiteBase.org$scalatest$BeforeAndAfter$$super$runTest(ContextCleanerSuite.scala:46)
at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
at 
org.apache.spark.ContextCleanerSuiteBase.org$scalatest$BeforeAndAfterEach$$super$runTest(ContextCleanerSuite.scala:46)
at 
org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
at 
org.apache.spark.ContextCleanerSuiteBase.runTest(ContextCleanerSuite.scala:46)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6963) Flaky test: o.a.s.ContextCleanerSuite automatically cleanup checkpoint

2015-04-16 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-6963:
-
Description: 
Observed on an unrelated streaming PR https://github.com/apache/spark/pull/5428
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30389/

{code}
sbt.ForkMain$ForkError: 
fs.exists(org.apache.spark.rdd.RDDCheckpointData.rddCheckpointDataPath(ContextCleanerSuite.this.sc,
 rddId).get) was true
at 
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
at 
org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
at 
org.apache.spark.ContextCleanerSuite$$anonfun$9.apply$mcV$sp(ContextCleanerSuite.scala:252)
at 
org.apache.spark.ContextCleanerSuite$$anonfun$9.apply(ContextCleanerSuite.scala:209)
at 
org.apache.spark.ContextCleanerSuite$$anonfun$9.apply(ContextCleanerSuite.scala:209)
at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
at 
org.apache.spark.ContextCleanerSuiteBase.org$scalatest$BeforeAndAfter$$super$runTest(ContextCleanerSuite.scala:46)
at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
at 
org.apache.spark.ContextCleanerSuiteBase.org$scalatest$BeforeAndAfterEach$$super$runTest(ContextCleanerSuite.scala:46)
at 
org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
at 
org.apache.spark.ContextCleanerSuiteBase.runTest(ContextCleanerSuite.scala:46)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
{code}

  was:
{code}
sbt.ForkMain$ForkError: 
fs.exists(org.apache.spark.rdd.RDDCheckpointData.rddCheckpointDataPath(ContextCleanerSuite.this.sc,
 rddId).get) was true
at 
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
at 
org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
at 
org.apache.spark.ContextCleanerSuite$$anonfun$9.apply$mcV$sp(ContextCleanerSuite.scala:252)
at 
org.apache.spark.ContextCleanerSuite$$anonfun$9.apply(ContextCleanerSuite.scala:209)
at 
org.apache.spark.ContextCleanerSuite$$anonfun$9.apply(ContextCleanerSuite.scala:209)
at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
at 

[jira] [Commented] (SPARK-6936) SQLContext.sql() caused deadlock in multi-thread env

2015-04-16 Thread Paul Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498305#comment-14498305
 ] 

Paul Wu commented on SPARK-6936:


You are right: I used spark-hive_2.10 instead of spark-hive_2.11 in my building 
. Sorry I deleted my comment before reading your comment.  Thanks,

 SQLContext.sql() caused deadlock in multi-thread env
 

 Key: SPARK-6936
 URL: https://issues.apache.org/jira/browse/SPARK-6936
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
 Environment: JDK 1.8.x, RedHat
 Linux version 2.6.32-431.23.3.el6.x86_64 
 (mockbu...@x86-027.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red 
 Hat 4.4.7-4) (GCC) ) #1 SMP Wed Jul 16 06:12:23 EDT 2014
Reporter: Paul Wu
  Labels: deadlock, sql, threading

 Doing (the same query) in more than one threads with SQLConext.sql may lead 
 to deadlock. Here is a way to reproduce it (since this is multi-thread issue, 
 the reproduction may or may not be so easy).
 1. Register a relatively big table.
 2.  Create two different classes and in the classes, do the same query in a 
 method and put the results in a set and print out the set size.
 3.  Create two threads to use an object from each class in the run method. 
 Start the threads. For my tests,  it can have a deadlock just in a few runs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4897) Python 3 support

2015-04-16 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-4897:
--
Priority: Blocker  (was: Minor)

 Python 3 support
 

 Key: SPARK-4897
 URL: https://issues.apache.org/jira/browse/SPARK-4897
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Josh Rosen
Assignee: Davies Liu
Priority: Blocker

 It would be nice to have Python 3 support in PySpark, provided that we can do 
 it in a way that maintains backwards-compatibility with Python 2.6.
 I started looking into porting this; my WIP work can be found at 
 https://github.com/JoshRosen/spark/compare/python3
 I was able to use the 
 [futurize|http://python-future.org/futurize.html#forwards-conversion-stage1] 
 tool to handle the basic conversion of things like {{print}} statements, etc. 
 and had to manually fix up a few imports for packages that moved / were 
 renamed, but the major blocker that I hit was {{cloudpickle}}:
 {code}
 [joshrosen python (python3)]$ PYSPARK_PYTHON=python3 ../bin/pyspark
 Python 3.4.2 (default, Oct 19 2014, 17:52:17)
 [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.51)] on darwin
 Type help, copyright, credits or license for more information.
 Traceback (most recent call last):
   File /Users/joshrosen/Documents/Spark/python/pyspark/shell.py, line 28, 
 in module
 import pyspark
   File /Users/joshrosen/Documents/spark/python/pyspark/__init__.py, line 
 41, in module
 from pyspark.context import SparkContext
   File /Users/joshrosen/Documents/spark/python/pyspark/context.py, line 26, 
 in module
 from pyspark import accumulators
   File /Users/joshrosen/Documents/spark/python/pyspark/accumulators.py, 
 line 97, in module
 from pyspark.cloudpickle import CloudPickler
   File /Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py, line 
 120, in module
 class CloudPickler(pickle.Pickler):
   File /Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py, line 
 122, in CloudPickler
 dispatch = pickle.Pickler.dispatch.copy()
 AttributeError: type object '_pickle.Pickler' has no attribute 'dispatch'
 {code}
 This code looks like it will be hard difficult to port to Python 3, so this 
 might be a good reason to switch to 
 [Dill|https://github.com/uqfoundation/dill] for Python serialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6963) Flaky test: o.a.s.ContextCleanerSuite automatically cleanup checkpoint

2015-04-16 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-6963:
-
Labels: flaky-test  (was: )

 Flaky test: o.a.s.ContextCleanerSuite automatically cleanup checkpoint
 --

 Key: SPARK-6963
 URL: https://issues.apache.org/jira/browse/SPARK-6963
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Andrew Or
Assignee: Guoqiang Li
Priority: Critical
  Labels: flaky-test

 {code}
 sbt.ForkMain$ForkError: 
 fs.exists(org.apache.spark.rdd.RDDCheckpointData.rddCheckpointDataPath(ContextCleanerSuite.this.sc,
  rddId).get) was true
   at 
 org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
   at 
 org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
   at 
 org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
   at 
 org.apache.spark.ContextCleanerSuite$$anonfun$9.apply$mcV$sp(ContextCleanerSuite.scala:252)
   at 
 org.apache.spark.ContextCleanerSuite$$anonfun$9.apply(ContextCleanerSuite.scala:209)
   at 
 org.apache.spark.ContextCleanerSuite$$anonfun$9.apply(ContextCleanerSuite.scala:209)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at 
 org.apache.spark.ContextCleanerSuiteBase.org$scalatest$BeforeAndAfter$$super$runTest(ContextCleanerSuite.scala:46)
   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
   at 
 org.apache.spark.ContextCleanerSuiteBase.org$scalatest$BeforeAndAfterEach$$super$runTest(ContextCleanerSuite.scala:46)
   at 
 org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
   at 
 org.apache.spark.ContextCleanerSuiteBase.runTest(ContextCleanerSuite.scala:46)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6635) DataFrame.withColumn can create columns with identical names

2015-04-16 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498329#comment-14498329
 ] 

Joseph K. Bradley commented on SPARK-6635:
--

[~rxin] Your select statement does highlight how the correct behavior is 
ambiguous.  What is the best way to replace 1 column while leaving all others 
the same?  That seems like a useful operation.  (Right now, I can only think of 
getting all of the column names, removing the one being replaced, and using 
that list in a select statement.)

 DataFrame.withColumn can create columns with identical names
 

 Key: SPARK-6635
 URL: https://issues.apache.org/jira/browse/SPARK-6635
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 DataFrame lets you create multiple columns with the same name, which causes 
 problems when you try to refer to columns by name.
 Proposal: If a column is added to a DataFrame with a column of the same name, 
 then the new column should replace the old column.
 {code}
 scala val df = sc.parallelize(Array(1,2,3)).toDF(x)
 df: org.apache.spark.sql.DataFrame = [x: int]
 scala val df3 = df.withColumn(x, df(x) + 1)
 df3: org.apache.spark.sql.DataFrame = [x: int, x: int]
 scala df3.collect()
 res1: Array[org.apache.spark.sql.Row] = Array([1,2], [2,3], [3,4])
 scala df3(x)
 org.apache.spark.sql.AnalysisException: Reference 'x' is ambiguous, could be: 
 x, x.;
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:216)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:121)
   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:161)
   at org.apache.spark.sql.DataFrame.col(DataFrame.scala:436)
   at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:426)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:26)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:33)
   at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:35)
   at $iwC$$iwC$$iwC$$iwC.init(console:37)
   at $iwC$$iwC$$iwC.init(console:39)
   at $iwC$$iwC.init(console:41)
   at $iwC.init(console:43)
   at init(console:45)
   at .init(console:49)
   at .clinit(console)
   at .init(console:7)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
   at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856)
   at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901)
   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813)
   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656)
   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664)
   at 
 org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
   at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
   at 
 org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058)
   at org.apache.spark.repl.Main$.main(Main.scala:31)
   at org.apache.spark.repl.Main.main(Main.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 

[jira] [Updated] (SPARK-4081) Categorical feature indexing

2015-04-16 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-4081:
-
Description: 
**Updated Description**

Decision Trees and tree ensembles require that categorical features be indexed 
0,1,2  There is currently no code to aid with indexing a dataset.  This is 
a proposal for a helper class for computing indices (and also deciding which 
features to treat as categorical).

Proposed functionality:
* This helps process a dataset of unknown vectors into a dataset with some 
continuous features and some categorical features. The choice between 
continuous and categorical is based upon a maxCategories parameter.
* This can also map categorical feature values to 0-based indices.

This is implemented in the spark.ml package for the Pipelines API, and it 
stores the indexes as column metadata.


  was:
DecisionTree and RandomForest require that categorical features and labels be 
indexed 0,1,2  There is currently no code to aid with indexing a dataset.  
This is a proposal for a helper class for computing indices (and also deciding 
which features to treat as categorical).

Proposed functionality:
* This helps process a dataset of unknown vectors into a dataset with some 
continuous features and some categorical features. The choice between 
continuous and categorical is based upon a maxCategories parameter.
* This can also map categorical feature values to 0-based indices.

Usage:
{code}
val myData1: RDD[Vector] = ...
val myData2: RDD[Vector] = ...
val datasetIndexer = new DatasetIndexer(maxCategories)
datasetIndexer.fit(myData1)
val indexedData1: RDD[Vector] = datasetIndexer.transform(myData1)
datasetIndexer.fit(myData2)
val indexedData2: RDD[Vector] = datasetIndexer.transform(myData2)
val categoricalFeaturesInfo: Map[Double, Int] = 
datasetIndexer.getCategoricalFeatureIndexes()
{code}



 Categorical feature indexing
 

 Key: SPARK-4081
 URL: https://issues.apache.org/jira/browse/SPARK-4081
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Minor
 Fix For: 1.4.0


 **Updated Description**
 Decision Trees and tree ensembles require that categorical features be 
 indexed 0,1,2  There is currently no code to aid with indexing a dataset. 
  This is a proposal for a helper class for computing indices (and also 
 deciding which features to treat as categorical).
 Proposed functionality:
 * This helps process a dataset of unknown vectors into a dataset with some 
 continuous features and some categorical features. The choice between 
 continuous and categorical is based upon a maxCategories parameter.
 * This can also map categorical feature values to 0-based indices.
 This is implemented in the spark.ml package for the Pipelines API, and it 
 stores the indexes as column metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6936) SQLContext.sql() caused deadlock in multi-thread env

2015-04-16 Thread Paul Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498247#comment-14498247
 ] 

Paul Wu commented on SPARK-6936:


Not sure about HiveContext. I tried to do the following program and I got 
exception (env: JDK 1.8/ Spark 1.3).  Why did I get the error on HiveContext?


-- exec-maven-plugin:1.2.1:exec (default-cli) @ Spark-Sample ---
Exception in thread main java.lang.NoSuchMethodError: 
scala.collection.immutable.$colon$colon.hd$1()Ljava/lang/Object;
at org.apache.spark.sql.hive.HiveQl$.nodeToPlan(HiveQl.scala:574)
at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:245)
at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:137)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:237)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:237)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:217)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:249)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:249)
at 
scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:197)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:249)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:249)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:217)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:882)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:882)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at 
scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:881)
at 
scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38)
at org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:138)
at org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:138)
at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96)
at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:137)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:237)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:237)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:217)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:249)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:249)
at 
scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:197)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:249)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:249)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:217)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:882)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:882)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at 
scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:881)
at 
scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38)
at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:234)
at 
org.apache.spark.sql.hive.HiveContext$$anonfun$sql$1.apply(HiveContext.scala:92)
at 
org.apache.spark.sql.hive.HiveContext$$anonfun$sql$1.apply(HiveContext.scala:92)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:92)
at 

[jira] [Commented] (SPARK-6936) SQLContext.sql() caused deadlock in multi-thread env

2015-04-16 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498258#comment-14498258
 ] 

Michael Armbrust commented on SPARK-6936:
-

NoSuchMethodError almost always means you are mixing incompatible versions
of libraries (in this case probably scala?) on your classpath.



 SQLContext.sql() caused deadlock in multi-thread env
 

 Key: SPARK-6936
 URL: https://issues.apache.org/jira/browse/SPARK-6936
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
 Environment: JDK 1.8.x, RedHat
 Linux version 2.6.32-431.23.3.el6.x86_64 
 (mockbu...@x86-027.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red 
 Hat 4.4.7-4) (GCC) ) #1 SMP Wed Jul 16 06:12:23 EDT 2014
Reporter: Paul Wu
  Labels: deadlock, sql, threading

 Doing (the same query) in more than one threads with SQLConext.sql may lead 
 to deadlock. Here is a way to reproduce it (since this is multi-thread issue, 
 the reproduction may or may not be so easy).
 1. Register a relatively big table.
 2.  Create two different classes and in the classes, do the same query in a 
 method and put the results in a set and print out the set size.
 3.  Create two threads to use an object from each class in the run method. 
 Start the threads. For my tests,  it can have a deadlock just in a few runs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6923) Get invalid hive table columns after save DataFrame to hive table

2015-04-16 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498271#comment-14498271
 ] 

Michael Armbrust commented on SPARK-6923:
-

Only Spark 1.3 has the ability to read tables that are creates with the
datasource api.



 Get invalid hive table columns after save DataFrame to hive table
 -

 Key: SPARK-6923
 URL: https://issues.apache.org/jira/browse/SPARK-6923
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: pin_zhang

 HiveContext hctx = new HiveContext(sc);
 ListString sample = new ArrayListString();
 sample.add( {\id\: \id_1\, \age\:1} );
 RDDString sampleRDD = new JavaSparkContext(sc).parallelize(sample).rdd();   
 DataFrame df = hctx.jsonRDD(sampleRDD);
 String table=test;
 df.saveAsTable(table, json,SaveMode.Overwrite);
 Table t = hctx.catalog().client().getTable(table);
 System.out.println( t.getCols());
 --
 With the code above to save DataFrame to hive table,
 Get table cols returns one column named 'col'
 [FieldSchema(name:col, type:arraystring, comment:from deserializer)]
 Expected return fields schema id, age.
 This results in the jdbc API cannot retrieves the table columns via ResultSet 
 DatabaseMetaData.getColumns(String catalog, String schemaPattern,String 
 tableNamePattern, String columnNamePattern)
 But resultset metadata for query  select * from test   contains fields id, 
 age.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-6936) SQLContext.sql() caused deadlock in multi-thread env

2015-04-16 Thread Paul Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Wu updated SPARK-6936:
---
Comment: was deleted

(was: Not sure about HiveContext. I tried to do the following program and I got 
exception (env: JDK 1.8/ Spark 1.3).  Why did I get the error on HiveContext?


-- exec-maven-plugin:1.2.1:exec (default-cli) @ Spark-Sample ---
Exception in thread main java.lang.NoSuchMethodError: 
scala.collection.immutable.$colon$colon.hd$1()Ljava/lang/Object;
at org.apache.spark.sql.hive.HiveQl$.nodeToPlan(HiveQl.scala:574)
at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:245)
at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:137)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:237)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:237)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:217)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:249)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:249)
at 
scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:197)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:249)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:249)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:217)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:882)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:882)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at 
scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:881)
at 
scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38)
at org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:138)
at org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:138)
at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96)
at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:137)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:237)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:237)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:217)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:249)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:249)
at 
scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:197)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:249)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:249)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:217)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:882)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:882)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at 
scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:881)
at 
scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38)
at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:234)
at 
org.apache.spark.sql.hive.HiveContext$$anonfun$sql$1.apply(HiveContext.scala:92)
at 
org.apache.spark.sql.hive.HiveContext$$anonfun$sql$1.apply(HiveContext.scala:92)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:92)
at 

[jira] [Commented] (SPARK-6960) Hardcoded scala version in multiple files. SparkluginBuild.scala, as well as docs/ with self-deprecating comment.

2015-04-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498057#comment-14498057
 ] 

Sean Owen commented on SPARK-6960:
--

Micah the build doesn't work at all for 2.11 unless you use the script to 
translate all the hardcoded occurrences. Did you apply this script? Should be 
under dev. Docs plugins probably don't matter as there is no separate docs 
build for 2.11

 Hardcoded scala version in multiple files. SparkluginBuild.scala, as well as 
 docs/ with self-deprecating comment.
 -

 Key: SPARK-6960
 URL: https://issues.apache.org/jira/browse/SPARK-6960
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.3.0
Reporter: Michah Lerner
Priority: Minor
 Fix For: 1.3.1


 Hard-coded value scala version 2.10 in multiple files, including:
 docs/_plugins/copy_api_dirs.rb, docs/_config.yml and  
 spark-1.3.0/project/project/SparkPluginBuild.scala.
 [error] /path/spark-1.3.0/project/project/SparkPluginBuild.scala:33: can't 
 expand macros compiled by previous versions of Scala
 The following generally builds successfully, except when the -Pnetlib-lgpl 
 option is added to the mvn build, provided there is a manual edit of the 
 scala version.
 mvn -Dscala-2.11 -Pyarn -Phadoop-2.4 -DskipTests clean package
 sbt -Pyarn -Phadoop-2.4 -Dscala-2.11 unidoc
 cd docs
 jekyll b



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6960) Hardcoded scala version in multiple files. SparkluginBuild.scala, as well as docs/ with self-deprecating comment.

2015-04-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6960:
-
Target Version/s:   (was: 1.3.1)
   Fix Version/s: (was: 1.3.1)

Dont set fix or target version; 1.3.1 is out the door anyway. 

 Hardcoded scala version in multiple files. SparkluginBuild.scala, as well as 
 docs/ with self-deprecating comment.
 -

 Key: SPARK-6960
 URL: https://issues.apache.org/jira/browse/SPARK-6960
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.3.0
Reporter: Michah Lerner
Priority: Minor

 Hard-coded value scala version 2.10 in multiple files, including:
 docs/_plugins/copy_api_dirs.rb, docs/_config.yml and  
 spark-1.3.0/project/project/SparkPluginBuild.scala.
 [error] /path/spark-1.3.0/project/project/SparkPluginBuild.scala:33: can't 
 expand macros compiled by previous versions of Scala
 The following generally builds successfully, except when the -Pnetlib-lgpl 
 option is added to the mvn build, provided there is a manual edit of the 
 scala version.
 mvn -Dscala-2.11 -Pyarn -Phadoop-2.4 -DskipTests clean package
 sbt -Pyarn -Phadoop-2.4 -Dscala-2.11 unidoc
 cd docs
 jekyll b



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6964) Support Cancellation in the Thrift Server

2015-04-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6964:

Description: There is already a hook in {{ExecuteStatementOperation}}, we 
just need to connect it to the job group cancellation support we already have 
and make sure the various drivers support it.  (was: There is already a hook in 
)

 Support Cancellation in the Thrift Server
 -

 Key: SPARK-6964
 URL: https://issues.apache.org/jira/browse/SPARK-6964
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Michael Armbrust
Priority: Critical

 There is already a hook in {{ExecuteStatementOperation}}, we just need to 
 connect it to the job group cancellation support we already have and make 
 sure the various drivers support it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6955) Do not let Yarn Shuffle Server retry its server port.

2015-04-16 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-6955:
-
Affects Version/s: 1.2.0

 Do not let Yarn Shuffle Server retry its server port.
 -

 Key: SPARK-6955
 URL: https://issues.apache.org/jira/browse/SPARK-6955
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, YARN
Affects Versions: 1.2.0
Reporter: SaintBacchus
Priority: Minor

  It's better to let the NodeManager get down rather than take a port retry 
 when `spark.shuffle.service.port` has been conflicted during starting the 
 Spark Yarn Shuffle Server, because the retry mechanism will make the 
 inconsistency of shuffle port and also make client fail to find the port.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6955) Do not let Yarn Shuffle Server retry its server port.

2015-04-16 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-6955:
-
Assignee: SaintBacchus

 Do not let Yarn Shuffle Server retry its server port.
 -

 Key: SPARK-6955
 URL: https://issues.apache.org/jira/browse/SPARK-6955
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, YARN
Affects Versions: 1.2.0
Reporter: SaintBacchus
Assignee: SaintBacchus
Priority: Minor

  It's better to let the NodeManager get down rather than take a port retry 
 when `spark.shuffle.service.port` has been conflicted during starting the 
 Spark Yarn Shuffle Server, because the retry mechanism will make the 
 inconsistency of shuffle port and also make client fail to find the port.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6950) Spark master UI believes some applications are in progress when they are actually completed

2015-04-16 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498390#comment-14498390
 ] 

Matt Cheah edited comment on SPARK-6950 at 4/16/15 5:57 PM:


There's one way I could reproduce this locally, but I can't confirm if this is 
what is happening in production. I had to force a race condition to occur for 
it.

Basically sometimes the Spark Master can attempt to rebuild the UI before the 
event logging listener renames the event log file removing the .inprogress 
extension. If the spark master reaches that point first before the event 
logging listener renames the file, it will never check again if the file is 
renamed and never build the UI. The SparkContext.stop() method requests the 
eventLogger to stop, which may not execute until after the Master has called 
rebuildSparkUi() for the completed application.

However I think the fix to SPARK-6107 will inadvertently also solve this issue. 
I'll try applying that patch.


was (Author: mcheah):
There's one way I could reproduce this locally, but I can't confirm if this is 
what is happening in production. I had to force a race condition to occur for 
it.

Basically sometimes the Spark Master can attempt to rebuild the UI before the 
event logging listener renames the event log file removing the .inprogress 
extension. If the spark master reaches that point first before the event 
logging listener renames the file, it will never check again if the file is 
renamed and never build the UI. The SparkContext.stop() method requests the 
eventLogger to stop, which may not execute until after the Master has called 
rebuildSparkUi() for the completed application.

However I think the fix to SPARK-6107 will inadvertently also solve this issue.

 Spark master UI believes some applications are in progress when they are 
 actually completed
 ---

 Key: SPARK-6950
 URL: https://issues.apache.org/jira/browse/SPARK-6950
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.3.0
Reporter: Matt Cheah

 In Spark 1.2.x, I was able to set my spark event log directory to be a 
 different location from the default, and after the job finishes, I can replay 
 the UI by clicking on the appropriate link under Completed Applications.
 Now, on a non-deterministic basis (but seems to happen most of the time), 
 when I click on the link under Completed Applications, I instead get a 
 webpage that says:
 Application history not found (app-20150415052927-0014)
 Application myApp is still in progress.
 I am able to view the application's UI using the Spark history server, so 
 something regressed in the Spark master code between 1.2 and 1.3, but that 
 regression does not apply in the history server use case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6950) Spark master UI believes some applications are in progress when they are actually completed

2015-04-16 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498390#comment-14498390
 ] 

Matt Cheah commented on SPARK-6950:
---

There's one way I could reproduce this locally, but I can't confirm if this is 
what is happening in production. I had to force a race condition to occur for 
it.

Basically sometimes the Spark Master can attempt to rebuild the UI before the 
event logging listener renames the event log file removing the .inprogress 
extension. If the spark master reaches that point first before the event 
logging listener renames the file, it will never check again if the file is 
renamed and never build the UI. The SparkContext.stop() method requests the 
eventLogger to stop, which may not execute until after the Master has called 
rebuildSparkUi() for the completed application.

However I think the fix to SPARK-6107 will inadvertently also solve this issue.

 Spark master UI believes some applications are in progress when they are 
 actually completed
 ---

 Key: SPARK-6950
 URL: https://issues.apache.org/jira/browse/SPARK-6950
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.3.0
Reporter: Matt Cheah

 In Spark 1.2.x, I was able to set my spark event log directory to be a 
 different location from the default, and after the job finishes, I can replay 
 the UI by clicking on the appropriate link under Completed Applications.
 Now, on a non-deterministic basis (but seems to happen most of the time), 
 when I click on the link under Completed Applications, I instead get a 
 webpage that says:
 Application history not found (app-20150415052927-0014)
 Application myApp is still in progress.
 I am able to view the application's UI using the Spark history server, so 
 something regressed in the Spark master code between 1.2 and 1.3, but that 
 regression does not apply in the history server use case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6964) Support Cancellation in the Thrift Server

2015-04-16 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-6964:
---

 Summary: Support Cancellation in the Thrift Server
 Key: SPARK-6964
 URL: https://issues.apache.org/jira/browse/SPARK-6964
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Michael Armbrust
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1442) Add Window function support

2015-04-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-1442:

Priority: Blocker  (was: Critical)

 Add Window function support
 ---

 Key: SPARK-1442
 URL: https://issues.apache.org/jira/browse/SPARK-1442
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Chengxiang Li
Priority: Blocker
 Attachments: Window Function.pdf


 similiar to Hive, add window function support for catalyst.
 https://issues.apache.org/jira/browse/HIVE-4197
 https://issues.apache.org/jira/browse/HIVE-896



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6964) Support Cancellation in the Thrift Server

2015-04-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6964:

Target Version/s: 1.4.0

 Support Cancellation in the Thrift Server
 -

 Key: SPARK-6964
 URL: https://issues.apache.org/jira/browse/SPARK-6964
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Michael Armbrust
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6964) Support Cancellation in the Thrift Server

2015-04-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6964:

Description: There is already a hook in 

 Support Cancellation in the Thrift Server
 -

 Key: SPARK-6964
 URL: https://issues.apache.org/jira/browse/SPARK-6964
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Michael Armbrust
Priority: Critical

 There is already a hook in 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6965) StringIndexer should convert input to Strings

2015-04-16 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-6965:


 Summary: StringIndexer should convert input to Strings
 Key: SPARK-6965
 URL: https://issues.apache.org/jira/browse/SPARK-6965
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Minor


StringIndexer should convert non-String input types to String.  That way, it 
can handle any basic types such as Int, Double, etc..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6955) Do not let Yarn Shuffle Server retry its server port.

2015-04-16 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-6955:
-
Target Version/s: 1.4.0

 Do not let Yarn Shuffle Server retry its server port.
 -

 Key: SPARK-6955
 URL: https://issues.apache.org/jira/browse/SPARK-6955
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, YARN
Affects Versions: 1.2.0
Reporter: SaintBacchus
Priority: Minor

  It's better to let the NodeManager get down rather than take a port retry 
 when `spark.shuffle.service.port` has been conflicted during starting the 
 Spark Yarn Shuffle Server, because the retry mechanism will make the 
 inconsistency of shuffle port and also make client fail to find the port.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2734) DROP TABLE should also uncache table

2015-04-16 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498403#comment-14498403
 ] 

Michael Armbrust commented on SPARK-2734:
-

How do you know it occurring?  What queries are you running?

 DROP TABLE should also uncache table
 

 Key: SPARK-2734
 URL: https://issues.apache.org/jira/browse/SPARK-2734
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Critical
 Fix For: 1.1.0


 Steps to reproduce:
 {code}
 hql(CREATE TABLE test(a INT))
 hql(CACHE TABLE test)
 hql(DROP TABLE test)
 hql(SELECT * FROM test)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6940) PySpark ML.Tuning Wrappers are missing

2015-04-16 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498663#comment-14498663
 ] 

Joseph K. Bradley commented on SPARK-6940:
--

[~omede] Can you please coordinate with [~punya] who also opened up a similar 
ticket?

As far as conflicting JIRAs/PRs, you should primarily be aware of [SPARK-5874] 
which [~mengxr] is working on.

 PySpark ML.Tuning Wrappers are missing
 --

 Key: SPARK-6940
 URL: https://issues.apache.org/jira/browse/SPARK-6940
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 1.3.0
Reporter: Omede Firouz

 PySpark doesn't currently have wrappers for any of the ML.Tuning classes: 
 CrossValidator, CrossValidatorModel, ParamGridBuilder



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6950) Spark master UI believes some applications are in progress when they are actually completed

2015-04-16 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498555#comment-14498555
 ] 

Matt Cheah edited comment on SPARK-6950 at 4/16/15 7:32 PM:


This is no longer an issue on the tip of branch-1.3, as:

(1) Fixing SPARK-6107 would make any .inprogress event logs be parsed, and
(2) I believe somewhere between the released 1.3 and the current 1.3, the event 
logging listener in the SparkContext was forced to stop before the Spark master 
UI was notified that the application completed.

I'll close this out for now but will re-open it if I continue to see the issue 
after 1.3.1 is released and I update to it.


was (Author: mcheah):
This is no longer an issue on the tip of branch-1.3, as:

(1) Fixing SPARK-6107 would make any .inprogress event logs be parsed, and
(2) I believe somewhere between the released 1.3 and the current 1.3, the event 
log stopping was forced to stop before the Spark master UI was notified that 
the application completed.

I'll close this out for now but will re-open it if I continue to see the issue 
after 1.3.1 is released and I update to it.

 Spark master UI believes some applications are in progress when they are 
 actually completed
 ---

 Key: SPARK-6950
 URL: https://issues.apache.org/jira/browse/SPARK-6950
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.3.0
Reporter: Matt Cheah
 Fix For: 1.3.1


 In Spark 1.2.x, I was able to set my spark event log directory to be a 
 different location from the default, and after the job finishes, I can replay 
 the UI by clicking on the appropriate link under Completed Applications.
 Now, on a non-deterministic basis (but seems to happen most of the time), 
 when I click on the link under Completed Applications, I instead get a 
 webpage that says:
 Application history not found (app-20150415052927-0014)
 Application myApp is still in progress.
 I am able to view the application's UI using the Spark history server, so 
 something regressed in the Spark master code between 1.2 and 1.3, but that 
 regression does not apply in the history server use case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6947) Make ml.tuning accessible from Python API

2015-04-16 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6947:
-
Fix Version/s: (was: SPARK-6940)

 Make ml.tuning accessible from Python API
 -

 Key: SPARK-6947
 URL: https://issues.apache.org/jira/browse/SPARK-6947
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 1.3.0
Reporter: Punya Biswal

 {{CrossValidator}} and {{ParamGridBuilder}} should be available for use in 
 PySpark-based ML pipelines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6844) Memory leak occurs when register temp table with cache table on

2015-04-16 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498587#comment-14498587
 ] 

Michael Armbrust commented on SPARK-6844:
-

I was not planning to.  I do not think that it is a regression from 1.2 and it 
is a little risky to backport changes to the way we initialize cached relations.

 Memory leak occurs when register temp table with cache table on
 ---

 Key: SPARK-6844
 URL: https://issues.apache.org/jira/browse/SPARK-6844
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Jack Hu
  Labels: Memory, SQL
 Fix For: 1.4.0


 There is a memory leak in register temp table with cache on
 This is the simple code to reproduce this issue:
 {code}
 val sparkConf = new SparkConf().setAppName(LeakTest)
 val sparkContext = new SparkContext(sparkConf)
 val sqlContext = new SQLContext(sparkContext)
 val tableName = tmp
 val jsonrdd = sparkContext.textFile(sample.json)
 var loopCount = 1L
 while(true) {
   sqlContext.jsonRDD(jsonrdd).registerTempTable(tableName)
   sqlContext.cacheTable(tableName)
   println(L:  +loopCount +  R: + sqlContext.sql(select count(*) 
 from tmp).count())
   sqlContext.uncacheTable(tableName)
   loopCount += 1
 }
 {code}
 The cause is that the {{InMemoryRelation}}. {{InMemoryColumnarTableScan}} 
 uses the accumulator 
 ({{InMemoryRelation.batchStats}},{{InMemoryColumnarTableScan.readPartitions}},
  {{InMemoryColumnarTableScan.readBatches}} ) to get some information from 
 partitions or for test. These accumulators will register itself into a static 
 map in {{Accumulators.originals}} and never get cleaned up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6857) Python SQL schema inference should support numpy types

2015-04-16 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-6857.
--
Resolution: Not A Problem

 Python SQL schema inference should support numpy types
 --

 Key: SPARK-6857
 URL: https://issues.apache.org/jira/browse/SPARK-6857
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark, SQL
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 **UPDATE**: Closing this JIRA since a better fix will be better UDT support.  
 See discussion in comments.
 If you try to use SQL's schema inference to create a DataFrame out of a list 
 or RDD of numpy types (such as numpy.float64), SQL will not recognize the 
 numpy types.  It would be handy if it did.
 E.g.:
 {code}
 import numpy
 from collections import namedtuple
 from pyspark.sql import SQLContext
 MyType = namedtuple('MyType', 'x')
 myValues = map(lambda x: MyType(x), numpy.random.randint(100, size=10))
 sqlContext = SQLContext(sc)
 data = sqlContext.createDataFrame(myValues)
 {code}
 The above code fails with:
 {code}
 Traceback (most recent call last):
   File stdin, line 1, in module
   File /Users/josephkb/spark/python/pyspark/sql/context.py, line 331, in 
 createDataFrame
 return self.inferSchema(data, samplingRatio)
   File /Users/josephkb/spark/python/pyspark/sql/context.py, line 205, in 
 inferSchema
 schema = self._inferSchema(rdd, samplingRatio)
   File /Users/josephkb/spark/python/pyspark/sql/context.py, line 160, in 
 _inferSchema
 schema = _infer_schema(first)
   File /Users/josephkb/spark/python/pyspark/sql/types.py, line 660, in 
 _infer_schema
 fields = [StructField(k, _infer_type(v), True) for k, v in items]
   File /Users/josephkb/spark/python/pyspark/sql/types.py, line 637, in 
 _infer_type
 raise ValueError(not supported type: %s % type(obj))
 ValueError: not supported type: type 'numpy.int64'
 {code}
 But if we cast to int (not numpy types) first, it's OK:
 {code}
 myNativeValues = map(lambda x: MyType(int(x.x)), myValues)
 data = sqlContext.createDataFrame(myNativeValues) # OK
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6857) Python SQL schema inference should support numpy types

2015-04-16 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6857:
-
Description: 
**UPDATE**: Closing this JIRA since a better fix will be better UDT support.  
See discussion in comments.

If you try to use SQL's schema inference to create a DataFrame out of a list or 
RDD of numpy types (such as numpy.float64), SQL will not recognize the numpy 
types.  It would be handy if it did.

E.g.:
{code}
import numpy
from collections import namedtuple
from pyspark.sql import SQLContext
MyType = namedtuple('MyType', 'x')
myValues = map(lambda x: MyType(x), numpy.random.randint(100, size=10))
sqlContext = SQLContext(sc)
data = sqlContext.createDataFrame(myValues)
{code}

The above code fails with:
{code}
Traceback (most recent call last):
  File stdin, line 1, in module
  File /Users/josephkb/spark/python/pyspark/sql/context.py, line 331, in 
createDataFrame
return self.inferSchema(data, samplingRatio)
  File /Users/josephkb/spark/python/pyspark/sql/context.py, line 205, in 
inferSchema
schema = self._inferSchema(rdd, samplingRatio)
  File /Users/josephkb/spark/python/pyspark/sql/context.py, line 160, in 
_inferSchema
schema = _infer_schema(first)
  File /Users/josephkb/spark/python/pyspark/sql/types.py, line 660, in 
_infer_schema
fields = [StructField(k, _infer_type(v), True) for k, v in items]
  File /Users/josephkb/spark/python/pyspark/sql/types.py, line 637, in 
_infer_type
raise ValueError(not supported type: %s % type(obj))
ValueError: not supported type: type 'numpy.int64'
{code}

But if we cast to int (not numpy types) first, it's OK:
{code}
myNativeValues = map(lambda x: MyType(int(x.x)), myValues)
data = sqlContext.createDataFrame(myNativeValues) # OK
{code}


  was:
If you try to use SQL's schema inference to create a DataFrame out of a list or 
RDD of numpy types (such as numpy.float64), SQL will not recognize the numpy 
types.  It would be handy if it did.

E.g.:
{code}
import numpy
from collections import namedtuple
from pyspark.sql import SQLContext
MyType = namedtuple('MyType', 'x')
myValues = map(lambda x: MyType(x), numpy.random.randint(100, size=10))
sqlContext = SQLContext(sc)
data = sqlContext.createDataFrame(myValues)
{code}

The above code fails with:
{code}
Traceback (most recent call last):
  File stdin, line 1, in module
  File /Users/josephkb/spark/python/pyspark/sql/context.py, line 331, in 
createDataFrame
return self.inferSchema(data, samplingRatio)
  File /Users/josephkb/spark/python/pyspark/sql/context.py, line 205, in 
inferSchema
schema = self._inferSchema(rdd, samplingRatio)
  File /Users/josephkb/spark/python/pyspark/sql/context.py, line 160, in 
_inferSchema
schema = _infer_schema(first)
  File /Users/josephkb/spark/python/pyspark/sql/types.py, line 660, in 
_infer_schema
fields = [StructField(k, _infer_type(v), True) for k, v in items]
  File /Users/josephkb/spark/python/pyspark/sql/types.py, line 637, in 
_infer_type
raise ValueError(not supported type: %s % type(obj))
ValueError: not supported type: type 'numpy.int64'
{code}

But if we cast to int (not numpy types) first, it's OK:
{code}
myNativeValues = map(lambda x: MyType(int(x.x)), myValues)
data = sqlContext.createDataFrame(myNativeValues) # OK
{code}



 Python SQL schema inference should support numpy types
 --

 Key: SPARK-6857
 URL: https://issues.apache.org/jira/browse/SPARK-6857
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark, SQL
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 **UPDATE**: Closing this JIRA since a better fix will be better UDT support.  
 See discussion in comments.
 If you try to use SQL's schema inference to create a DataFrame out of a list 
 or RDD of numpy types (such as numpy.float64), SQL will not recognize the 
 numpy types.  It would be handy if it did.
 E.g.:
 {code}
 import numpy
 from collections import namedtuple
 from pyspark.sql import SQLContext
 MyType = namedtuple('MyType', 'x')
 myValues = map(lambda x: MyType(x), numpy.random.randint(100, size=10))
 sqlContext = SQLContext(sc)
 data = sqlContext.createDataFrame(myValues)
 {code}
 The above code fails with:
 {code}
 Traceback (most recent call last):
   File stdin, line 1, in module
   File /Users/josephkb/spark/python/pyspark/sql/context.py, line 331, in 
 createDataFrame
 return self.inferSchema(data, samplingRatio)
   File /Users/josephkb/spark/python/pyspark/sql/context.py, line 205, in 
 inferSchema
 schema = self._inferSchema(rdd, samplingRatio)
   File /Users/josephkb/spark/python/pyspark/sql/context.py, line 160, in 
 _inferSchema
 schema = _infer_schema(first)
   File /Users/josephkb/spark/python/pyspark/sql/types.py, line 660, in 
 _infer_schema
 fields = [StructField(k, 

[jira] [Commented] (SPARK-6635) DataFrame.withColumn can create columns with identical names

2015-04-16 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498725#comment-14498725
 ] 

Reynold Xin commented on SPARK-6635:


cc [~marmbrus] to chime in.

I think about it more, and withName should probably overwrite an existing 
column (or maybe with an argument to control the behavior?). However, we might 
want a broader discussion about column names also.


 DataFrame.withColumn can create columns with identical names
 

 Key: SPARK-6635
 URL: https://issues.apache.org/jira/browse/SPARK-6635
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 DataFrame lets you create multiple columns with the same name, which causes 
 problems when you try to refer to columns by name.
 Proposal: If a column is added to a DataFrame with a column of the same name, 
 then the new column should replace the old column.
 {code}
 scala val df = sc.parallelize(Array(1,2,3)).toDF(x)
 df: org.apache.spark.sql.DataFrame = [x: int]
 scala val df3 = df.withColumn(x, df(x) + 1)
 df3: org.apache.spark.sql.DataFrame = [x: int, x: int]
 scala df3.collect()
 res1: Array[org.apache.spark.sql.Row] = Array([1,2], [2,3], [3,4])
 scala df3(x)
 org.apache.spark.sql.AnalysisException: Reference 'x' is ambiguous, could be: 
 x, x.;
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:216)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:121)
   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:161)
   at org.apache.spark.sql.DataFrame.col(DataFrame.scala:436)
   at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:426)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:26)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:33)
   at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:35)
   at $iwC$$iwC$$iwC$$iwC.init(console:37)
   at $iwC$$iwC$$iwC.init(console:39)
   at $iwC$$iwC.init(console:41)
   at $iwC.init(console:43)
   at init(console:45)
   at .init(console:49)
   at .clinit(console)
   at .init(console:7)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
   at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856)
   at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901)
   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813)
   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656)
   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664)
   at 
 org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
   at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
   at 
 org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058)
   at org.apache.spark.repl.Main$.main(Main.scala:31)
   at org.apache.spark.repl.Main.main(Main.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
   at 
 

[jira] [Created] (SPARK-6970) Document what the options: Map[String, String] does on DataFrame.save and DataFrame.saveAsTable

2015-04-16 Thread John Muller (JIRA)
John Muller created SPARK-6970:
--

 Summary: Document what the options: Map[String, String] does on 
DataFrame.save and DataFrame.saveAsTable
 Key: SPARK-6970
 URL: https://issues.apache.org/jira/browse/SPARK-6970
 Project: Spark
  Issue Type: Documentation
  Components: SQL
Affects Versions: 1.3.0
Reporter: John Muller


The save options on DataFrames are not easily discerned:
[ResolvedDataSource.apply|https://github.com/apache/spark/blob/b75b3070740803480d235b0c9a86673721344f30/sql/core/src/main/scala/org/apache/spark/sql/sources/ddl.scala#L222]
  is where the pattern match occurs:

{code:title=ddl.scala|borderStyle=solid}
case dataSource: SchemaRelationProvider =
dataSource.createRelation(sqlContext, new CaseInsensitiveMap(options), schema)
{code}

Implementing classes are currently: TableScanSuite, JSONRelation, and newParquet



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5427) Add support for floor function in Spark SQL

2015-04-16 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated SPARK-5427:
--
Description: 
floor() function is supported in Hive SQL.
This issue is to add floor() function to Spark SQL.

Related thread: http://search-hadoop.com/m/JW1q563fc22

  was:
floor() function is supported in Hive SQL.

This issue is to add floor() function to Spark SQL.

Related thread: http://search-hadoop.com/m/JW1q563fc22


 Add support for floor function in Spark SQL
 ---

 Key: SPARK-5427
 URL: https://issues.apache.org/jira/browse/SPARK-5427
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Ted Yu
  Labels: math

 floor() function is supported in Hive SQL.
 This issue is to add floor() function to Spark SQL.
 Related thread: http://search-hadoop.com/m/JW1q563fc22



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6857) Python SQL schema inference should support numpy types

2015-04-16 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498593#comment-14498593
 ] 

Davies Liu commented on SPARK-6857:
---

It's not good that we use array or numpy.array as part of API, we can not 
change it right now. I'd like to suggest to use Vector as part of API in ml, 
and support conversion from/to numpy.array easy and fast.

numpy/scipy is only useful for mllib/ml, it's better to keep them out of the 
scope of SQL.

 Python SQL schema inference should support numpy types
 --

 Key: SPARK-6857
 URL: https://issues.apache.org/jira/browse/SPARK-6857
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark, SQL
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 **UPDATE**: Closing this JIRA since a better fix will be better UDT support.  
 See discussion in comments.
 If you try to use SQL's schema inference to create a DataFrame out of a list 
 or RDD of numpy types (such as numpy.float64), SQL will not recognize the 
 numpy types.  It would be handy if it did.
 E.g.:
 {code}
 import numpy
 from collections import namedtuple
 from pyspark.sql import SQLContext
 MyType = namedtuple('MyType', 'x')
 myValues = map(lambda x: MyType(x), numpy.random.randint(100, size=10))
 sqlContext = SQLContext(sc)
 data = sqlContext.createDataFrame(myValues)
 {code}
 The above code fails with:
 {code}
 Traceback (most recent call last):
   File stdin, line 1, in module
   File /Users/josephkb/spark/python/pyspark/sql/context.py, line 331, in 
 createDataFrame
 return self.inferSchema(data, samplingRatio)
   File /Users/josephkb/spark/python/pyspark/sql/context.py, line 205, in 
 inferSchema
 schema = self._inferSchema(rdd, samplingRatio)
   File /Users/josephkb/spark/python/pyspark/sql/context.py, line 160, in 
 _inferSchema
 schema = _infer_schema(first)
   File /Users/josephkb/spark/python/pyspark/sql/types.py, line 660, in 
 _infer_schema
 fields = [StructField(k, _infer_type(v), True) for k, v in items]
   File /Users/josephkb/spark/python/pyspark/sql/types.py, line 637, in 
 _infer_type
 raise ValueError(not supported type: %s % type(obj))
 ValueError: not supported type: type 'numpy.int64'
 {code}
 But if we cast to int (not numpy types) first, it's OK:
 {code}
 myNativeValues = map(lambda x: MyType(int(x.x)), myValues)
 data = sqlContext.createDataFrame(myNativeValues) # OK
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6635) DataFrame.withColumn can create columns with identical names

2015-04-16 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498736#comment-14498736
 ] 

Joseph K. Bradley commented on SPARK-6635:
--

Btw, when I say that seems like a useful operation, I really mean that I 
wanted to do that for MLlib.

 DataFrame.withColumn can create columns with identical names
 

 Key: SPARK-6635
 URL: https://issues.apache.org/jira/browse/SPARK-6635
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 DataFrame lets you create multiple columns with the same name, which causes 
 problems when you try to refer to columns by name.
 Proposal: If a column is added to a DataFrame with a column of the same name, 
 then the new column should replace the old column.
 {code}
 scala val df = sc.parallelize(Array(1,2,3)).toDF(x)
 df: org.apache.spark.sql.DataFrame = [x: int]
 scala val df3 = df.withColumn(x, df(x) + 1)
 df3: org.apache.spark.sql.DataFrame = [x: int, x: int]
 scala df3.collect()
 res1: Array[org.apache.spark.sql.Row] = Array([1,2], [2,3], [3,4])
 scala df3(x)
 org.apache.spark.sql.AnalysisException: Reference 'x' is ambiguous, could be: 
 x, x.;
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:216)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:121)
   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:161)
   at org.apache.spark.sql.DataFrame.col(DataFrame.scala:436)
   at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:426)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:26)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:33)
   at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:35)
   at $iwC$$iwC$$iwC$$iwC.init(console:37)
   at $iwC$$iwC$$iwC.init(console:39)
   at $iwC$$iwC.init(console:41)
   at $iwC.init(console:43)
   at init(console:45)
   at .init(console:49)
   at .clinit(console)
   at .init(console:7)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
   at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856)
   at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901)
   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813)
   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656)
   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664)
   at 
 org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
   at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
   at 
 org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058)
   at org.apache.spark.repl.Main$.main(Main.scala:31)
   at org.apache.spark.repl.Main.main(Main.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
   at 
 org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
   at 

[jira] [Comment Edited] (SPARK-6950) Spark master UI believes some applications are in progress when they are actually completed

2015-04-16 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498555#comment-14498555
 ] 

Matt Cheah edited comment on SPARK-6950 at 4/16/15 7:31 PM:


This is no longer an issue on the tip of branch-1.3, as:

(1) Fixing SPARK-6107 would make any .inprogress event logs be parsed, and
(2) I believe somewhere between the released 1.3 and the current 1.3, the event 
log stopping was forced to stop before the Spark master UI was notified that 
the application completed.

I'll close this out for now but will re-open it if I continue to see the issue 
after 1.3.1 is released and I update to it.


was (Author: mcheah):
This is no longer an issue on the tip of branch-1.3, as:

(1) Fixing SPARK-6107 would make any .inprogress event logs be parsed, and
(2) I believe somewhere between the released 1.3 and the current 1.3, the event 
log stopping was forced to stop before the Spark master UI was notified that 
the application completed.

 Spark master UI believes some applications are in progress when they are 
 actually completed
 ---

 Key: SPARK-6950
 URL: https://issues.apache.org/jira/browse/SPARK-6950
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.3.0
Reporter: Matt Cheah
 Fix For: 1.3.1


 In Spark 1.2.x, I was able to set my spark event log directory to be a 
 different location from the default, and after the job finishes, I can replay 
 the UI by clicking on the appropriate link under Completed Applications.
 Now, on a non-deterministic basis (but seems to happen most of the time), 
 when I click on the link under Completed Applications, I instead get a 
 webpage that says:
 Application history not found (app-20150415052927-0014)
 Application myApp is still in progress.
 I am able to view the application's UI using the Spark history server, so 
 something regressed in the Spark master code between 1.2 and 1.3, but that 
 regression does not apply in the history server use case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6950) Spark master UI believes some applications are in progress when they are actually completed

2015-04-16 Thread Matt Cheah (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Cheah resolved SPARK-6950.
---
   Resolution: Cannot Reproduce
Fix Version/s: 1.3.1

 Spark master UI believes some applications are in progress when they are 
 actually completed
 ---

 Key: SPARK-6950
 URL: https://issues.apache.org/jira/browse/SPARK-6950
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.3.0
Reporter: Matt Cheah
 Fix For: 1.3.1


 In Spark 1.2.x, I was able to set my spark event log directory to be a 
 different location from the default, and after the job finishes, I can replay 
 the UI by clicking on the appropriate link under Completed Applications.
 Now, on a non-deterministic basis (but seems to happen most of the time), 
 when I click on the link under Completed Applications, I instead get a 
 webpage that says:
 Application history not found (app-20150415052927-0014)
 Application myApp is still in progress.
 I am able to view the application's UI using the Spark history server, so 
 something regressed in the Spark master code between 1.2 and 1.3, but that 
 regression does not apply in the history server use case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6962) Netty BlockTransferService hangs in the middle of SQL query

2015-04-16 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson updated SPARK-6962:
--
Summary: Netty BlockTransferService hangs in the middle of SQL query  (was: 
Spark gets stuck on a step, hangs forever - jobs do not complete)

 Netty BlockTransferService hangs in the middle of SQL query
 ---

 Key: SPARK-6962
 URL: https://issues.apache.org/jira/browse/SPARK-6962
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.2.1, 1.3.0
Reporter: Jon Chase
 Attachments: jstacks.txt


 Spark SQL queries (though this seems to be a Spark Core issue - I'm just 
 using queries in the REPL to surface this, so I mention Spark SQL) hang 
 indefinitely under certain (not totally understood) circumstances.  
 This is resolved by setting spark.shuffle.blockTransferService=nio, which 
 seems to point to netty as the issue.  Netty was set as the default for the 
 block transport layer in 1.2.0, which is when this issue started.  Setting 
 the service to nio allows queries to complete normally.
 I do not see this problem when running queries over smaller (~20 5MB files) 
 datasets.  When I increase the scope to include more data (several hundred 
 ~5MB files), the queries will get through several steps but eventuall hang  
 indefinitely.
 Here's the email chain regarding this issue, including stack traces:
 http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/cae61spfqt2y7d5vqzomzz2dmr-jx2c2zggcyky40npkjjx4...@mail.gmail.com
 For context, here's the announcement regarding the block transfer service 
 change: 
 http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/cabpqxssl04q+rbltp-d8w+z3atn+g-um6gmdgdnh-hzcvd-...@mail.gmail.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6857) Python SQL schema inference should support numpy types

2015-04-16 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498651#comment-14498651
 ] 

Joseph K. Bradley commented on SPARK-6857:
--

Based on past discussions with [~mengxr], ML should use numpy and scipy types, 
rather than re-implementing all of that functionality.

Supporting numpy and scipy types in SQL does not actually mean having numpy 
or scipy code in SQL.  It would mean:
* Extending UDTs so users can registers their own UDTs with the SQLContext.
* Adding UDTs for numpy and scipy types in MLlib.
* Allowing users to import or call something which registers those MLlib UDTs 
with SQL.


 Python SQL schema inference should support numpy types
 --

 Key: SPARK-6857
 URL: https://issues.apache.org/jira/browse/SPARK-6857
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark, SQL
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 **UPDATE**: Closing this JIRA since a better fix will be better UDT support.  
 See discussion in comments.
 If you try to use SQL's schema inference to create a DataFrame out of a list 
 or RDD of numpy types (such as numpy.float64), SQL will not recognize the 
 numpy types.  It would be handy if it did.
 E.g.:
 {code}
 import numpy
 from collections import namedtuple
 from pyspark.sql import SQLContext
 MyType = namedtuple('MyType', 'x')
 myValues = map(lambda x: MyType(x), numpy.random.randint(100, size=10))
 sqlContext = SQLContext(sc)
 data = sqlContext.createDataFrame(myValues)
 {code}
 The above code fails with:
 {code}
 Traceback (most recent call last):
   File stdin, line 1, in module
   File /Users/josephkb/spark/python/pyspark/sql/context.py, line 331, in 
 createDataFrame
 return self.inferSchema(data, samplingRatio)
   File /Users/josephkb/spark/python/pyspark/sql/context.py, line 205, in 
 inferSchema
 schema = self._inferSchema(rdd, samplingRatio)
   File /Users/josephkb/spark/python/pyspark/sql/context.py, line 160, in 
 _inferSchema
 schema = _infer_schema(first)
   File /Users/josephkb/spark/python/pyspark/sql/types.py, line 660, in 
 _infer_schema
 fields = [StructField(k, _infer_type(v), True) for k, v in items]
   File /Users/josephkb/spark/python/pyspark/sql/types.py, line 637, in 
 _infer_type
 raise ValueError(not supported type: %s % type(obj))
 ValueError: not supported type: type 'numpy.int64'
 {code}
 But if we cast to int (not numpy types) first, it's OK:
 {code}
 myNativeValues = map(lambda x: MyType(int(x.x)), myValues)
 data = sqlContext.createDataFrame(myNativeValues) # OK
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6969) Refresh the cached table when REFRESH TABLE is used

2015-04-16 Thread Yin Huai (JIRA)
Yin Huai created SPARK-6969:
---

 Summary: Refresh the cached table when REFRESH TABLE is used
 Key: SPARK-6969
 URL: https://issues.apache.org/jira/browse/SPARK-6969
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Critical


Right now, {{REFRESH TABLE}} only invalidate the metadata of a table. If a 
table is cached and new files are added manually to the table, users still see 
the cached data after {{REFRESH TABLE}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6966) JDBC datasources use Class.forName to load driver

2015-04-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498802#comment-14498802
 ] 

Apache Spark commented on SPARK-6966:
-

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/5543

 JDBC datasources use Class.forName to load driver
 -

 Key: SPARK-6966
 URL: https://issues.apache.org/jira/browse/SPARK-6966
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6966) JDBC datasources use Class.forName to load driver

2015-04-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6966:
---

Assignee: Apache Spark  (was: Michael Armbrust)

 JDBC datasources use Class.forName to load driver
 -

 Key: SPARK-6966
 URL: https://issues.apache.org/jira/browse/SPARK-6966
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Apache Spark
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6966) JDBC datasources use Class.forName to load driver

2015-04-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6966:
---

Assignee: Michael Armbrust  (was: Apache Spark)

 JDBC datasources use Class.forName to load driver
 -

 Key: SPARK-6966
 URL: https://issues.apache.org/jira/browse/SPARK-6966
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6967) Internal DateType not handled correctly in caching

2015-04-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6967:

Target Version/s: 1.3.2, 1.4.0

 Internal DateType not handled correctly in caching
 --

 Key: SPARK-6967
 URL: https://issues.apache.org/jira/browse/SPARK-6967
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Adrian Wang
Priority: Blocker

 From the user list.  It looks like data is not implemented correctly in 
 in-memory caching.  We should also check the JDBC datasource support for date.
 {code}
 Stack trace of an exception being reported since upgrade to 1.3.0:
 java.lang.ClassCastException: java.sql.Date cannot be cast to
 java.lang.Integer
 at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:105)
 ~[scala-library-2.11.6.jar:na]
 at
 org.apache.spark.sql.catalyst.expressions.GenericRow.getInt(rows.scala:83)
 ~[spark-catalyst_2.11-1.3.0.jar:1.3.0]
 at
 org.apache.spark.sql.columnar.IntColumnStats.gatherStats(ColumnStats.scala:191)
 ~[spark-sql_2.11-1.3.0.jar:1.3.0]
 at
 org.apache.spark.sql.columnar.NullableColumnBuilder$class.appendFrom(NullableColumnBuilder.scala:56)
 ~[spark-sql_2.11-1.3.0.jar:1.3.0]
 at
 org.apache.spark.sql.columnar.NativeColumnBuilder.org$apache$spark$sql$columnar$compression$CompressibleColumnBuilder$$super$appendFrom(ColumnBuilder.scala:87)
 ~[spark-sql_2.11-1.3.0.jar:1.3.0]
 at
 org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.appendFrom(CompressibleColumnBuilder.scala:78)
 ~[spark-sql_2.11-1.3.0.jar:1.3.0]
 at
 org.apache.spark.sql.columnar.NativeColumnBuilder.appendFrom(ColumnBuilder.scala:87)
 ~[spark-sql_2.11-1.3.0.jar:1.3.0]
 at
 org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:135)
 ~[spark-sql_2.11-1.3.0.jar:1.3.0]
 at
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6857) Python SQL schema inference should support numpy types

2015-04-16 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498508#comment-14498508
 ] 

Joseph K. Bradley commented on SPARK-6857:
--

[~davies] Yes, that OK with me.  It's a bit inconsistent:
* In MLlib, we want to encourage users to use numpy and scipy types, rather 
than the mllib.linalg.* types.
* In SQL, it's better if users use Python types or mllib.linalg.* types (for 
which UDTs handle the conversion).

Perhaps the best fix will be better UDTs: If we can register any type (such as 
numpy.array) with the SQLContext as a UDT, then users will be able to use numpy 
and scipy types everywhere.  I hope we can add that support before too long.

 Python SQL schema inference should support numpy types
 --

 Key: SPARK-6857
 URL: https://issues.apache.org/jira/browse/SPARK-6857
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark, SQL
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 If you try to use SQL's schema inference to create a DataFrame out of a list 
 or RDD of numpy types (such as numpy.float64), SQL will not recognize the 
 numpy types.  It would be handy if it did.
 E.g.:
 {code}
 import numpy
 from collections import namedtuple
 from pyspark.sql import SQLContext
 MyType = namedtuple('MyType', 'x')
 myValues = map(lambda x: MyType(x), numpy.random.randint(100, size=10))
 sqlContext = SQLContext(sc)
 data = sqlContext.createDataFrame(myValues)
 {code}
 The above code fails with:
 {code}
 Traceback (most recent call last):
   File stdin, line 1, in module
   File /Users/josephkb/spark/python/pyspark/sql/context.py, line 331, in 
 createDataFrame
 return self.inferSchema(data, samplingRatio)
   File /Users/josephkb/spark/python/pyspark/sql/context.py, line 205, in 
 inferSchema
 schema = self._inferSchema(rdd, samplingRatio)
   File /Users/josephkb/spark/python/pyspark/sql/context.py, line 160, in 
 _inferSchema
 schema = _infer_schema(first)
   File /Users/josephkb/spark/python/pyspark/sql/types.py, line 660, in 
 _infer_schema
 fields = [StructField(k, _infer_type(v), True) for k, v in items]
   File /Users/josephkb/spark/python/pyspark/sql/types.py, line 637, in 
 _infer_type
 raise ValueError(not supported type: %s % type(obj))
 ValueError: not supported type: type 'numpy.int64'
 {code}
 But if we cast to int (not numpy types) first, it's OK:
 {code}
 myNativeValues = map(lambda x: MyType(int(x.x)), myValues)
 data = sqlContext.createDataFrame(myNativeValues) # OK
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >