date:20161007

[jira] [Closed] (SPARK-10174) refactor out project, filter, ordering generator from SparkPlan

2016-10-07 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-10174.
---
Resolution: Won't Fix

Close it since the PR has been closed.

> refactor out project, filter, ordering generator from SparkPlan
> ---
>
> Key: SPARK-10174
> URL: https://issues.apache.org/jira/browse/SPARK-10174
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-10154) remove the no-longer-necessary CatalystScan

2016-10-07 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-10154.
---
Resolution: Won't Fix

Keep it based on the PR discussion

> remove the no-longer-necessary CatalystScan
> ---
>
> Key: SPARK-10154
> URL: https://issues.apache.org/jira/browse/SPARK-10154
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-8436) Inconsistent behavior when converting a Timestamp column to Integer/Long and then convert back to Timestamp

2016-10-07 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-8436.
--
Resolution: Won't Fix

> Inconsistent behavior when converting a Timestamp column to Integer/Long and 
> then convert back to Timestamp
> ---
>
> Key: SPARK-8436
> URL: https://issues.apache.org/jira/browse/SPARK-8436
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Le Minh Tu
>Priority: Minor
>
> I'm aware that when converting from Integer/LongType to Timestamp, the 
> column's values should be in milliseconds. However, I was surprised when 
> trying to do this 
> `a.select(a['event_time'].astype(LongType()).astype(TimestampType())).first()`
>  and got back a totally different datetime ('event_time' is initially a 
> TimestampType). There must be some constraints in implementation that I'm not 
> aware of but it would be nice if a double conversion like this returns the 
> initial value as one might expect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8436) Inconsistent behavior when converting a Timestamp column to Integer/Long and then convert back to Timestamp

2016-10-07 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557282#comment-15557282
 ] 

Xiao Li commented on SPARK-8436:


After reading the PR description, this is not valid now. 

> Inconsistent behavior when converting a Timestamp column to Integer/Long and 
> then convert back to Timestamp
> ---
>
> Key: SPARK-8436
> URL: https://issues.apache.org/jira/browse/SPARK-8436
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Le Minh Tu
>Priority: Minor
>
> I'm aware that when converting from Integer/LongType to Timestamp, the 
> column's values should be in milliseconds. However, I was surprised when 
> trying to do this 
> `a.select(a['event_time'].astype(LongType()).astype(TimestampType())).first()`
>  and got back a totally different datetime ('event_time' is initially a 
> TimestampType). There must be some constraints in implementation that I'm not 
> aware of but it would be nice if a double conversion like this returns the 
> initial value as one might expect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17825) Expose log likelihood of EM algorithm in mllib

2016-10-07 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557281#comment-15557281
 ] 

Yanbo Liang commented on SPARK-17825:
-

Sure. You can definitely contribute on this issue after my PR. We will replace 
mllib with ml in the future. Thanks.

> Expose log likelihood of EM algorithm in mllib
> --
>
> Key: SPARK-17825
> URL: https://issues.apache.org/jira/browse/SPARK-17825
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Lei Wang
>
> Users sometimes need to get log likelihood of EM algorithm.
> For example, one might use this value to choose appropriate cluster number.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-14229) PySpark DataFrame.rdd's can't be saved to an arbitrary Hadoop OutputFormat

2016-10-07 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk closed SPARK-14229.
---
Resolution: Won't Fix

I don't think this is really a bug - if you want to save from dataframes there 
is the dataframe writer API and if you want to convert it to RDDs you will need 
to get your data from Row's into a class which the save function you are 
working with understands. That being said this should be much easier given the 
new dataset API which allows the data to come back as a case class and it can 
be much clearer.

> PySpark DataFrame.rdd's can't be saved to an arbitrary Hadoop OutputFormat
> --
>
> Key: SPARK-14229
> URL: https://issues.apache.org/jira/browse/SPARK-14229
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, PySpark, Spark Shell
>Affects Versions: 1.6.1
>Reporter: Russell Jurney
>
> I am able to save data to MongoDB from any RDD... provided that RDD does not 
> belong to a DataFrame. If I use DataFrame.rdd, it is not possible to save via 
> saveAsNewAPIHadoopFile whatsoever. I have tested that this applies to saving 
> to MongoDB, BSON Files, and ElasticSearch.
> I get the following error when I try to save to a HadoopFile:
> config = {"mongo.output.uri": 
> "mongodb://localhost:27017/agile_data_science.on_time_performance"}
> n [3]: on_time_dataframe.rdd.saveAsNewAPIHadoopFile(
>...:   path='file://unused', 
>...:   outputFormatClass='com.mongodb.hadoop.MongoOutputFormat',
>...:   keyClass='org.apache.hadoop.io.Text', 
>...:   valueClass='org.apache.hadoop.io.MapWritable', 
>...:   conf=config
>...: )
> 16/03/28 19:59:57 INFO storage.MemoryStore: Block broadcast_1 stored as 
> values in memory (estimated size 62.7 KB, free 147.3 KB)
> 16/03/28 19:59:57 INFO storage.MemoryStore: Block broadcast_1_piece0 stored 
> as bytes in memory (estimated size 20.4 KB, free 167.7 KB)
> 16/03/28 19:59:57 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in 
> memory on localhost:61301 (size: 20.4 KB, free: 511.1 MB)
> 16/03/28 19:59:57 INFO spark.SparkContext: Created broadcast 1 from 
> javaToPython at NativeMethodAccessorImpl.java:-2
> 16/03/28 19:59:57 INFO Configuration.deprecation: mapred.min.split.size is 
> deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
> 16/03/28 19:59:57 INFO parquet.ParquetRelation: Reading Parquet file(s) from 
> file:/Users/rjurney/Software/Agile_Data_Code_2/data/on_time_performance.parquet/part-r-0-32089f1b-5447-4a75-b008-4fd0a0a8b846.gz.parquet
> 16/03/28 19:59:57 INFO spark.SparkContext: Starting job: take at 
> SerDeUtil.scala:231
> 16/03/28 19:59:57 INFO scheduler.DAGScheduler: Got job 1 (take at 
> SerDeUtil.scala:231) with 1 output partitions
> 16/03/28 19:59:57 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 
> (take at SerDeUtil.scala:231)
> 16/03/28 19:59:57 INFO scheduler.DAGScheduler: Parents of final stage: List()
> 16/03/28 19:59:57 INFO scheduler.DAGScheduler: Missing parents: List()
> 16/03/28 19:59:57 INFO scheduler.DAGScheduler: Submitting ResultStage 1 
> (MapPartitionsRDD[6] at mapPartitions at SerDeUtil.scala:146), which has no 
> missing parents
> 16/03/28 19:59:57 INFO storage.MemoryStore: Block broadcast_2 stored as 
> values in memory (estimated size 14.9 KB, free 182.6 KB)
> 16/03/28 19:59:57 INFO storage.MemoryStore: Block broadcast_2_piece0 stored 
> as bytes in memory (estimated size 7.5 KB, free 190.1 KB)
> 16/03/28 19:59:57 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in 
> memory on localhost:61301 (size: 7.5 KB, free: 511.1 MB)
> 16/03/28 19:59:57 INFO spark.SparkContext: Created broadcast 2 from broadcast 
> at DAGScheduler.scala:1006
> 16/03/28 19:59:57 INFO scheduler.DAGScheduler: Submitting 1 missing tasks 
> from ResultStage 1 (MapPartitionsRDD[6] at mapPartitions at 
> SerDeUtil.scala:146)
> 16/03/28 19:59:57 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 
> 1 tasks
> 16/03/28 19:59:57 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 
> 1.0 (TID 8, localhost, partition 0,PROCESS_LOCAL, 2739 bytes)
> 16/03/28 19:59:57 INFO executor.Executor: Running task 0.0 in stage 1.0 (TID 
> 8)
> 16/03/28 19:59:58 INFO 
> parquet.ParquetRelation$$anonfun$buildInternalScan$1$$anon$1: Input split: 
> ParquetInputSplit{part: 
> file:/Users/rjurney/Software/Agile_Data_Code_2/data/on_time_performance.parquet/part-r-0-32089f1b-5447-4a75-b008-4fd0a0a8b846.gz.parquet
>  start: 0 end: 134217728 length: 134217728 hosts: []}
> 16/03/28 19:59:59 INFO compress.CodecPool: Got brand-new decompressor [.gz]
> 16/03/28 19:59:59 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
> (TID 8)
> net.razorvine.pickle.PickleException: expected zero arguments for 
> construction of ClassDict (for

[jira] [Commented] (SPARK-13585) addPyFile behavior change between 1.6 and before

2016-10-07 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557248#comment-15557248
 ] 

holdenk commented on SPARK-13585:
-

What is the use case for overwriting the old pyFile? The current scala API 
doesn't have an overwrite flag we can just plumb through directly but we could 
maybe think about alternatives or if it would be worth adding on the scala side 
too.

> addPyFile behavior change between 1.6 and before
> 
>
> Key: SPARK-13585
> URL: https://issues.apache.org/jira/browse/SPARK-13585
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Santhosh Gorantla Ramakrishna
>Priority: Minor
>
> addPyFile in earlier versions would remove the .py file if it already 
> existed. In 1.6, it throws an exception "__.py exists and does not match 
> contents of __.py".
> This might be because the underlying scala code needs an overwrite parameter, 
> and this is being defaulted to false when called from python.
>   private def copyFile(
>   url: String,
>   sourceFile: File,
>   destFile: File,
>   fileOverwrite: Boolean,
>   removeSourceFile: Boolean = false): Unit = {
> Would be good if addPyFile takes a parameter to set the overwrite and default 
> to false.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13606) Error from python worker: /usr/local/bin/python2.7: undefined symbol: _PyCodec_LookupTextEncoding

2016-10-07 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557244#comment-15557244
 ] 

holdenk commented on SPARK-13606:
-

Are you still experiencing this?

> Error from python worker:   /usr/local/bin/python2.7: undefined symbol: 
> _PyCodec_LookupTextEncoding
> ---
>
> Key: SPARK-13606
> URL: https://issues.apache.org/jira/browse/SPARK-13606
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Avatar Zhang
>
> Error from python worker:
>   /usr/local/bin/python2.7: /usr/local/lib/python2.7/lib-dynload/_io.so: 
> undefined symbol: _PyCodec_LookupTextEncoding
> PYTHONPATH was:
>   
> /usr/share/dse/spark/python/lib/pyspark.zip:/usr/share/dse/spark/python/lib/py4j-0.8.2.1-src.zip:/usr/share/dse/spark/lib/spark-core_2.10-1.4.2.2.jar
> java.io.EOFException
> at java.io.DataInputStream.readInt(DataInputStream.java:392)
> at 
> org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
> at 
> org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
> at 
> org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
> at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:130)
> at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:73)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:315)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13534) Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas

2016-10-07 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557242#comment-15557242
 ] 

holdenk commented on SPARK-13534:
-

For people following along arrow is in the middle of voting on its next 
release, while its likely not yet at the point where we can start using it will 
be good for those interested (like myself) to take a look once the release is 
ready :)

> Implement Apache Arrow serializer for Spark DataFrame for use in 
> DataFrame.toPandas
> ---
>
> Key: SPARK-13534
> URL: https://issues.apache.org/jira/browse/SPARK-13534
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Wes McKinney
>
> The current code path for accessing Spark DataFrame data in Python using 
> PySpark passes through an inefficient serialization-deserialiation process 
> that I've examined at a high level here: 
> https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] 
> objects are being deserialized in pure Python as a list of tuples, which are 
> then converted to pandas.DataFrame using its {{from_records}} alternate 
> constructor. This also uses a large amount of memory.
> For flat (no nested types) schemas, the Apache Arrow memory layout 
> (https://github.com/apache/arrow/tree/master/format) can be deserialized to 
> {{pandas.DataFrame}} objects with comparatively small overhead compared with 
> memcpy / system memory bandwidth -- Arrow's bitmasks must be examined, 
> replacing the corresponding null values with pandas's sentinel values (None 
> or NaN as appropriate).
> I will be contributing patches to Arrow in the coming weeks for converting 
> between Arrow and pandas in the general case, so if Spark can send Arrow 
> memory to PySpark, we will hopefully be able to increase the Python data 
> access throughput by an order of magnitude or more. I propose to add an new 
> serializer for Spark DataFrame and a new method that can be invoked from 
> PySpark to request a Arrow memory-layout byte stream, prefixed by a data 
> header indicating array buffer offsets and sizes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-13368) PySpark JavaModel fails to extract params from Spark side automatically

2016-10-07 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk closed SPARK-13368.
---
Resolution: Fixed

> PySpark JavaModel fails to extract params from Spark side automatically
> ---
>
> Key: SPARK-13368
> URL: https://issues.apache.org/jira/browse/SPARK-13368
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Xusen Yin
>Priority: Minor
>
> JavaModel fails to extract params from Spark side automatically that causes 
> model.extractParamMap() is always empty. As shown in the example code below 
> copied from Spark Guide 
> https://spark.apache.org/docs/latest/ml-guide.html#example-estimator-transformer-and-param
> {code}
> # Prepare training data from a list of (label, features) tuples.
> training = sqlContext.createDataFrame([
> (1.0, Vectors.dense([0.0, 1.1, 0.1])),
> (0.0, Vectors.dense([2.0, 1.0, -1.0])),
> (0.0, Vectors.dense([2.0, 1.3, 1.0])),
> (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])
> # Create a LogisticRegression instance. This instance is an Estimator.
> lr = LogisticRegression(maxIter=10, regParam=0.01)
> # Print out the parameters, documentation, and any default values.
> print "LogisticRegression parameters:\n" + lr.explainParams() + "\n"
> # Learn a LogisticRegression model. This uses the parameters stored in lr.
> model1 = lr.fit(training)
> # Since model1 is a Model (i.e., a transformer produced by an Estimator),
> # we can view the parameters it used during fit().
> # This prints the parameter (name: value) pairs, where names are unique
> # IDs for this LogisticRegression instance.
> print "Model 1 was fit using parameters: "
> print model1.extractParamMap()
> {code}
> The result of model1.extractParamMap() is {}.
> Question is, should we provide the feature or not? If yes, we need either let 
> Model share same params with Estimator or adds a parent in Model and points 
> to its Estimator; if not, we should remove those lines from example code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13368) PySpark JavaModel fails to extract params from Spark side automatically

2016-10-07 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557239#comment-15557239
 ] 

holdenk commented on SPARK-13368:
-

It seems that we don't have this in the example anymore, although 
https://issues.apache.org/jira/browse/SPARK-10931 / 
https://github.com/apache/spark/pull/14653 are working on this.

> PySpark JavaModel fails to extract params from Spark side automatically
> ---
>
> Key: SPARK-13368
> URL: https://issues.apache.org/jira/browse/SPARK-13368
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Xusen Yin
>Priority: Minor
>
> JavaModel fails to extract params from Spark side automatically that causes 
> model.extractParamMap() is always empty. As shown in the example code below 
> copied from Spark Guide 
> https://spark.apache.org/docs/latest/ml-guide.html#example-estimator-transformer-and-param
> {code}
> # Prepare training data from a list of (label, features) tuples.
> training = sqlContext.createDataFrame([
> (1.0, Vectors.dense([0.0, 1.1, 0.1])),
> (0.0, Vectors.dense([2.0, 1.0, -1.0])),
> (0.0, Vectors.dense([2.0, 1.3, 1.0])),
> (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])
> # Create a LogisticRegression instance. This instance is an Estimator.
> lr = LogisticRegression(maxIter=10, regParam=0.01)
> # Print out the parameters, documentation, and any default values.
> print "LogisticRegression parameters:\n" + lr.explainParams() + "\n"
> # Learn a LogisticRegression model. This uses the parameters stored in lr.
> model1 = lr.fit(training)
> # Since model1 is a Model (i.e., a transformer produced by an Estimator),
> # we can view the parameters it used during fit().
> # This prints the parameter (name: value) pairs, where names are unique
> # IDs for this LogisticRegression instance.
> print "Model 1 was fit using parameters: "
> print model1.extractParamMap()
> {code}
> The result of model1.extractParamMap() is {}.
> Question is, should we provide the feature or not? If yes, we need either let 
> Model share same params with Estimator or adds a parent in Model and points 
> to its Estimator; if not, we should remove those lines from example code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9965) Scala, Python SQLContext input methods' deprecation statuses do not match

2016-10-07 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-9965:
---
Component/s: (was: SQL)

> Scala, Python SQLContext input methods' deprecation statuses do not match
> -
>
> Key: SPARK-9965
> URL: https://issues.apache.org/jira/browse/SPARK-9965
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Scala's SQLContext has several methods for data input (jsonFile, jsonRDD, 
> etc.) deprecated.  These methods are not deprecated in Python's SQLContext.  
> They should be, but only after Python's DataFrameReader implements analogous 
> methods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9938) Constant folding in binaryComparison

2016-10-07 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557235#comment-15557235
 ] 

Xiao Li commented on SPARK-9938:


After reading the PR discussion, I think we can first close it now. If needed, 
we can reopen it. Thanks!

> Constant folding in binaryComparison
> 
>
> Key: SPARK-9938
> URL: https://issues.apache.org/jira/browse/SPARK-9938
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yijie Shen
>
> When we are comparing a int column with long literal, a < Int.MaxValue.toLong 
> + 1L as a example, the analyzed plan would be LessThan(Cast(a, LongType), 
> Literal).
> The result of predicate is determined and we could fold this at optimization 
> phase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-9938) Constant folding in binaryComparison

2016-10-07 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-9938.
--
Resolution: Won't Fix

> Constant folding in binaryComparison
> 
>
> Key: SPARK-9938
> URL: https://issues.apache.org/jira/browse/SPARK-9938
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yijie Shen
>
> When we are comparing a int column with long literal, a < Int.MaxValue.toLong 
> + 1L as a example, the analyzed plan would be LessThan(Cast(a, LongType), 
> Literal).
> The result of predicate is determined and we could fold this at optimization 
> phase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13303) Spark fails with pandas import error when pandas is not explicitly imported by user

2016-10-07 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557230#comment-15557230
 ] 

holdenk commented on SPARK-13303:
-

What about if we added a requirements file? We have one for our dev tools - 
having one for PySpark its self should be pretty reasonable.

> Spark fails with pandas import error when pandas is not explicitly imported 
> by user
> ---
>
> Key: SPARK-13303
> URL: https://issues.apache.org/jira/browse/SPARK-13303
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
> Environment: The python installation used by the driver (edge node) 
> has pandas installed on it, while on the data nodes pandas do not have pandas 
> installed in the python runtimes used. Pandas is never explicitly imported by 
> pi.py.
>Reporter: Juliet Hougland
>
> Running `spark-submit pi.py` results in:
>   File 
> "/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/spark/python/lib/pyspark.zip/pyspark/worker.py",
>  line 98, in main
> command = pickleSer._read_with_length(infile)
>   File 
> "/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py",
>  line 164, in _read_with_length
> return self.loads(obj)
>   File 
> "/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py",
>  line 422, in loads
> return pickle.loads(obj)
> ImportError: No module named pandas.algos
>   at 
> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:138)
>   at 
> org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:179)
>   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:97)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> This is unexpected and hard for users to unravel why they may see this error, 
> as they themselves have not explicitly done anything with pandas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11722) Rdds could be different between orginal one and save-out-then-read-in one

2016-10-07 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557226#comment-15557226
 ] 

holdenk commented on SPARK-11722:
-

Is this still an issue you are experiencing and if so do you have repro code, 
I'm not completely sure what the issue is you are experiencing?

> Rdds could be different between orginal one and save-out-then-read-in one
> -
>
> Key: SPARK-11722
> URL: https://issues.apache.org/jira/browse/SPARK-11722
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.1
> Environment: redhat6.4  64bit;   standalone-cluster ; 3 machines
>Reporter: liangguoning
>
> I found a bug on pyspark;
> I did some operations to create a rdd  A,  but I found the data are different 
> between that orginal A  and the saved_to_hdfs's  one, called B,
> I also printed all detail data inside my function and discovered that A 
> indeed contains a different one record from B.
> That record causes a different result under the same functions. 
> I got B  through 2 methods : A.saveAsTextFile  and  sc.textFile
> I also check the raw data, and found that B is the right rdd. 
> ---
> I tried another A2 through sc.parallelize(A.collect()) and got the same 
> result as A.
> Thanks 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12776) Implement Python API for Datasets

2016-10-07 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557224#comment-15557224
 ] 

holdenk commented on SPARK-12776:
-

Just re-opening discussion here - the migration to datasets was given as the 
reason to take out map from dataframes in Python, but as [~rxin] mentioned in 
SPARK-13233 there isn't any concrete plans to add dataset right now* (for a now 
of a while back). Are there any plans on the Python side for datasets (cc 
[~davies] & [~marmbrus])?

> Implement Python API for Datasets
> -
>
> Key: SPARK-12776
> URL: https://issues.apache.org/jira/browse/SPARK-12776
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Kevin Cox
>Priority: Minor
>
> Now that the Dataset API is in Scala and Java it would be awesome to see it 
> show up in PySpark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9842) Push down Spark SQL UDF to datasource UDF

2016-10-07 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557222#comment-15557222
 ] 

Xiao Li commented on SPARK-9842:


cc [~tsuresh]

> Push down Spark SQL UDF to datasource UDF
> -
>
> Key: SPARK-9842
> URL: https://issues.apache.org/jira/browse/SPARK-9842
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Alex Liu
>
> Since many data sources have UDF and Spark SQL has UDF as well, but it only 
> filters data at Spark SQL side.
> We should create a way to allow push down Spark SQL UDF to datasource UDF.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12100) bug in spark/python/pyspark/rdd.py portable_hash()

2016-10-07 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557219#comment-15557219
 ] 

holdenk commented on SPARK-12100:
-

Just noting related progress in https://github.com/apache/spark/pull/11211 / 
SPARK-13330

> bug in spark/python/pyspark/rdd.py portable_hash()
> --
>
> Key: SPARK-12100
> URL: https://issues.apache.org/jira/browse/SPARK-12100
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.1
>Reporter: Andrew Davidson
>Priority: Minor
>  Labels: hashing, pyspark
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> I am using spark-1.5.1-bin-hadoop2.6. I used 
> spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create a cluster and configured 
> spark-env to use python3. I get and exception 'Randomness of hash of string 
> should be disabled via PYTHONHASHSEED’. Is there any reason rdd.py should not 
> just set PYTHONHASHSEED ?
> Should I file a bug?
> Kind regards
> Andy
> details
> http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=subtract#pyspark.RDD.subtract
> Example from documentation does not work out of the box
> Subtract(other, numPartitions=None)
> Return each value in self that is not contained in other.
> >>> x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 3)])
> >>> y = sc.parallelize([("a", 3), ("c", None)])
> >>> sorted(x.subtract(y).collect())
> [('a', 1), ('b', 4), ('b', 5)]
> It raises 
> if sys.version >= '3.3' and 'PYTHONHASHSEED' not in os.environ:
> raise Exception("Randomness of hash of string should be disabled via 
> PYTHONHASHSEED")
> The following script fixes the problem 
> Sudo printf "\n# set PYTHONHASHSEED so python3 will not generate 
> Exception'Randomness of hash of string should be disabled via 
> PYTHONHASHSEED'\nexport PYTHONHASHSEED=123\n" >> /root/spark/conf/spark-env.sh
> sudo pssh -i -h /root/spark-ec2/slaves cp /root/spark/conf/spark-env.sh  
> /root/spark/conf/spark-env.sh-`date "+%Y-%m-%d:%H:%M"`
> Sudo for i in `cat slaves` ; do scp spark-env.sh 
> root@$i:/root/spark/conf/spark-env.sh; done
> This is how I am starting spark
> export PYSPARK_PYTHON=python3.4
> export PYSPARK_DRIVER_PYTHON=python3.4
> export IPYTHON_OPTS="notebook --no-browser --port=7000 --log-level=WARN"  
> $SPARK_ROOT/bin/pyspark \
> --master $MASTER_URL \
> --total-executor-cores $numCores \
> --driver-memory 2G \
> --executor-memory 2G \
> $extraPkgs \
> $*
> see email thread "possible bug spark/python/pyspark/rdd.py portable_hash()' 
> on user@spark for more info



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11874) DistributedCache for PySpark

2016-10-07 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557217#comment-15557217
 ] 

holdenk commented on SPARK-11874:
-

I think this is not intended to be supported, although I'm not super sure since 
the title of the JIRA seems a bit different than the text. Is it possible that 
some of the work around virtualenv support might meet your needs should it move 
forward?

> DistributedCache for PySpark
> 
>
> Key: SPARK-11874
> URL: https://issues.apache.org/jira/browse/SPARK-11874
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.1
>Reporter: Ranjana Rajendran
>
> I have access only to the workbench of a cluster. All the nodes have only 
> python 2.6. I want to use PySpark with iPython notebook with Python 2.7. 
> I created a python2.7 virtual environment as follows:
> conda create -n py27 python=2.7 anaconda
> source activate py27
> I installed all required modules in py27 .
> Created a zip for the py27 virtual environment.
> zip -r py27.zip py27
> hadoop fs -put py27.zip
> Now
> export PYSPARK_DRIVER_PYTHON=ipython
> export PYSPARK_DRIVER_PYTHON_OPTS=notebook
> export PYSPARK_PYTHON=./py27/bin/python
> export 
> PYTHONPATH=/opt/spark/python/lib/py4j-0.8.2.1-src.zip:/opt/spark/python/:PYSPARK_DRIVER_PYTHON=ipython
> I launched pyspark as follows:
> /opt/spark/bin/pyspark --verbose  --name iPythondemo --conf 
> spark.yarn.executor.memoryOverhead=2048 --conf 
> spark.eventLog.dir=${spark_event_log_dir}$USER/ --master yarn --deploy-mode 
> client --archives hdfs:///user/alti_ranjana/py27.zip#py27 --executor-memory 
> 8G  --executor-cores 2 --queue default --num-executors 48 $spark_opts_extra
> When I try to run a job in client mode, i.e. making use of executors running 
> on all the nodes, 
> I get error stating that file ./py27/bin/python  does not exist. 
> I also tried launching pyspark specifying argument  --file py27.zip#py27
> I get error
> Exception in thread "main" java.lang.IllegalArgumentException: pyspark does 
> not support any application options.
> Am I doing this the right way ?  Is there something wrong in the way I am 
> doing this or is this a known issue ?  Is PySpark working for 
> DistributedCache of zip files ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-12774) DataFrame.mapPartitions apply function operates on Pandas DataFrame instead of a generator or rows

2016-10-07 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk closed SPARK-12774.
---
Resolution: Won't Fix

In some ways yes avoiding unecessary iteration can be good, but allowing Spark 
to spill is also important. That being said - map and map partitions have been 
temporarily removed from DataFrame while the dataset API is sorted out in 
Python so I don't think this is likely to get in (although maybe worth being 
involved in the new dataframe API discussions if you are interested.

> DataFrame.mapPartitions apply function operates on Pandas DataFrame instead 
> of a generator or rows
> --
>
> Key: SPARK-12774
> URL: https://issues.apache.org/jira/browse/SPARK-12774
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Josh
>  Labels: dataframe, mapPartitions, pandas
>
> Currently DataFrame.mapPatitions is analogous to DataFrame.rdd.mapPatitions 
> in both Spark and pySpark. The function that is applied to each partition _f_ 
> must operate on a list generator. This is however very inefficient in Python. 
> It would be more logical and efficient if the apply function _f_  operated on 
> Pandas DataFrames instead and also returned a DataFrame. This avoids 
> unnecessary iteration in Python which is slow.
> Currently:
> {code}
> def apply_function(rows):
> df = pd.DataFrame(list(rows))
> df = df % 100   # Do something on df
> return df.values.tolist()
> table = sqlContext.read.parquet("")
> table = table.mapPatitions(apply_function)
> {code}
> New apply function would accept a Pandas DataFrame and return a DataFrame:
> {code}
> def apply_function(df):
> df = df % 100   # Do something on df
> return df
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9732) remove the unsafe -> safe conversion

2016-10-07 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557203#comment-15557203
 ] 

Xiao Li commented on SPARK-9732:


Should we close this?

> remove the unsafe -> safe conversion
> 
>
> Key: SPARK-9732
> URL: https://issues.apache.org/jira/browse/SPARK-9732
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11571) Twitter Api for PySpark

2016-10-07 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557205#comment-15557205
 ] 

holdenk commented on SPARK-11571:
-

Is there anything you are looking to do with this API? It can certainly be 
useful for demos, but I'm not sure it would be worth it to do?

> Twitter Api for PySpark
> ---
>
> Key: SPARK-11571
> URL: https://issues.apache.org/jira/browse/SPARK-11571
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, Streaming
>Reporter: Arda Mert
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3600) RDD[Double] doesn't use primitive arrays for caching

2016-10-07 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557195#comment-15557195
 ] 

holdenk commented on SPARK-3600:


Is this something we still want to work on or does `Datasets` make this 
addition less interesting?

> RDD[Double] doesn't use primitive arrays for caching
> 
>
> Key: SPARK-3600
> URL: https://issues.apache.org/jira/browse/SPARK-3600
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Xiangrui Meng
>
> RDD's classTag is not passed in through CacheManager. So RDD[Double] uses 
> object arrays for caching, which leads to huge overhead. However, we need to 
> send the classTag down many levels to make it work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3513) Provide a utility for running a function once on each executor

2016-10-07 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557193#comment-15557193
 ] 

holdenk commented on SPARK-3513:


This seems closely related to SPARK-650 and SPARK-636 as well. Is this general 
group of tasks something we want to look at enabling or are the current 
workaround (either mapPartitions or broadcast with custom initilization) "good 
enough"? (cc [~matei], [~joshrosen] & [~patrick]) who have commented on the 
different (related) JIRAs?

> Provide a utility for running a function once on each executor
> --
>
> Key: SPARK-3513
> URL: https://issues.apache.org/jira/browse/SPARK-3513
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Patrick Wendell
>Priority: Minor
>
> This is minor, but it would be nice to have a utility where you can pass a 
> function and it will run some arbitrary function once on each each executor 
> and return the result to you (e.g. you could perform a jstack from within the 
> JVM). You could probably hack it together with custom locality preferences, 
> accessing the list of live executors, and mapPartitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-9764) Spark SQL uses table metadata inconsistently

2016-10-07 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-9764.
--
Resolution: Fixed

> Spark SQL uses table metadata inconsistently
> 
>
> Key: SPARK-9764
> URL: https://issues.apache.org/jira/browse/SPARK-9764
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
> Environment: Ubuntu on AWS
>Reporter: Simeon Simeonov
>  Labels: hive, sql
>
> For the same table, {{DESCRIBE}} and {{SHOW COLUMNS}} produce different 
> results. The former shows the correct column names. The latter always shows 
> just a single column named {{col}}. This is true for any table created with 
> {{HiveContext.read.json}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9764) Spark SQL uses table metadata inconsistently

2016-10-07 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557188#comment-15557188
 ] 

Xiao Li commented on SPARK-9764:


This should be resolved in the latest branch. Please check it. Thanks!

> Spark SQL uses table metadata inconsistently
> 
>
> Key: SPARK-9764
> URL: https://issues.apache.org/jira/browse/SPARK-9764
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
> Environment: Ubuntu on AWS
>Reporter: Simeon Simeonov
>  Labels: hive, sql
>
> For the same table, {{DESCRIBE}} and {{SHOW COLUMNS}} produce different 
> results. The former shows the correct column names. The latter always shows 
> just a single column named {{col}}. This is true for any table created with 
> {{HiveContext.read.json}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-9764) Spark SQL uses table metadata inconsistently

2016-10-07 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557188#comment-15557188
 ] 

Xiao Li edited comment on SPARK-9764 at 10/8/16 4:38 AM:
-

This should be resolved by native DDL support since 2.0. Please check it. 
Thanks!


was (Author: smilegator):
This should be resolved in the latest branch. Please check it. Thanks!

> Spark SQL uses table metadata inconsistently
> 
>
> Key: SPARK-9764
> URL: https://issues.apache.org/jira/browse/SPARK-9764
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
> Environment: Ubuntu on AWS
>Reporter: Simeon Simeonov
>  Labels: hive, sql
>
> For the same table, {{DESCRIBE}} and {{SHOW COLUMNS}} produce different 
> results. The former shows the correct column names. The latter always shows 
> just a single column named {{col}}. This is true for any table created with 
> {{HiveContext.read.json}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3348) Support user-defined SparkListeners properly

2016-10-07 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557186#comment-15557186
 ] 

holdenk commented on SPARK-3348:


Is there still interest in seeing this happen? Should we ping the dev@ list on 
this one to see if anyone wants to collect those events for some purpose (and 
if so what)?

> Support user-defined SparkListeners properly
> 
>
> Key: SPARK-3348
> URL: https://issues.apache.org/jira/browse/SPARK-3348
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>
> Because of the current initialization ordering, user-defined SparkListeners 
> do not receive certain events that are posted before application code is run. 
> We need to expose a constructor that allows the given SparkListeners to 
> receive all events.
> There have been interest in this for a while, but I have searched through the 
> JIRA history and have not found a related issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9342) Spark SQL views don't work

2016-10-07 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557184#comment-15557184
 ] 

Xiao Li commented on SPARK-9342:


This should be resolved since Spark 2.0. Please check it. If you still hit the 
issue, reopen it. Thanks!

> Spark SQL views don't work
> --
>
> Key: SPARK-9342
> URL: https://issues.apache.org/jira/browse/SPARK-9342
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
> Environment: Ubuntu on AWS
>Reporter: Simeon Simeonov
>  Labels: sql, views
>
> The Spark SQL documentation's section on Hive support claims that views are 
> supported. However, even basic view operations fail with exceptions related 
> to column resolution. 
> For example,
> {code}
> // The test table has columns category & num
> ctx.sql("create view view1 as select * from test")
> ctx.table("view1").printSchema
> {code}
> generates
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve 'test.col' given input 
> columns category, num; line 1 pos 7
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
> ...
> {code}
> You can see a standalone reproducible example with full spark-shell output 
> demonstrating the problem at 
> [https://gist.github.com/ssimeonov/57164f9d6b928ba0cfde]
> The problem is that {{ctx.sql("create view view1 as select * from test")}} 
> puts the following in the metastore including {{cols:[FieldSchema(name:col, 
> type:string, comment:null)]}} even though the {{test}} table has {{category}} 
> and {{num}} columns:
> {code}
> 15/07/26 15:47:28 INFO HiveMetaStore: 0: create_table: Table(tableName:view1, 
> dbName:default, owner:ubuntu, createTime:1437925648, lastAccessTime:0, 
> retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:col, type:string, 
> comment:null)], location:null, 
> inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat, 
> outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat, 
> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, 
> serializationLib:null, parameters:{}), bucketCols:[], sortCols:[], 
> parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], 
> skewedColValueLocationMaps:{})), partitionKeys:[], parameters:{}, 
> viewOriginalText:select * from test, viewExpandedText:select `test`.`col` 
> from `default`.`test`, tableType:VIRTUAL_VIEW)
> 15/07/26 15:47:28 INFO audit: ugi=ubuntu  ip=unknown-ip-addr  
> cmd=create_table: Table(tableName:view1, dbName:default, owner:ubuntu, 
> createTime:1437925648, lastAccessTime:0, retention:0, 
> sd:StorageDescriptor(cols:[FieldSchema(name:col, type:string, comment:null)], 
> location:null, inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat, 
> outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat, 
> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, 
> serializationLib:null, parameters:{}), bucketCols:[], sortCols:[], 
> parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], 
> skewedColValueLocationMaps:{})), partitionKeys:[], parameters:{}, 
> viewOriginalText:select * from test, viewExpandedText:select `test`.`col` 
> from `default`.`test`, tableType:VIRTUAL_VIEW)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9342) Spark SQL views don't work

2016-10-07 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-9342.

Resolution: Fixed

> Spark SQL views don't work
> --
>
> Key: SPARK-9342
> URL: https://issues.apache.org/jira/browse/SPARK-9342
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
> Environment: Ubuntu on AWS
>Reporter: Simeon Simeonov
>  Labels: sql, views
>
> The Spark SQL documentation's section on Hive support claims that views are 
> supported. However, even basic view operations fail with exceptions related 
> to column resolution. 
> For example,
> {code}
> // The test table has columns category & num
> ctx.sql("create view view1 as select * from test")
> ctx.table("view1").printSchema
> {code}
> generates
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve 'test.col' given input 
> columns category, num; line 1 pos 7
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
> ...
> {code}
> You can see a standalone reproducible example with full spark-shell output 
> demonstrating the problem at 
> [https://gist.github.com/ssimeonov/57164f9d6b928ba0cfde]
> The problem is that {{ctx.sql("create view view1 as select * from test")}} 
> puts the following in the metastore including {{cols:[FieldSchema(name:col, 
> type:string, comment:null)]}} even though the {{test}} table has {{category}} 
> and {{num}} columns:
> {code}
> 15/07/26 15:47:28 INFO HiveMetaStore: 0: create_table: Table(tableName:view1, 
> dbName:default, owner:ubuntu, createTime:1437925648, lastAccessTime:0, 
> retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:col, type:string, 
> comment:null)], location:null, 
> inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat, 
> outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat, 
> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, 
> serializationLib:null, parameters:{}), bucketCols:[], sortCols:[], 
> parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], 
> skewedColValueLocationMaps:{})), partitionKeys:[], parameters:{}, 
> viewOriginalText:select * from test, viewExpandedText:select `test`.`col` 
> from `default`.`test`, tableType:VIRTUAL_VIEW)
> 15/07/26 15:47:28 INFO audit: ugi=ubuntu  ip=unknown-ip-addr  
> cmd=create_table: Table(tableName:view1, dbName:default, owner:ubuntu, 
> createTime:1437925648, lastAccessTime:0, retention:0, 
> sd:StorageDescriptor(cols:[FieldSchema(name:col, type:string, comment:null)], 
> location:null, inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat, 
> outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat, 
> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, 
> serializationLib:null, parameters:{}), bucketCols:[], sortCols:[], 
> parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], 
> skewedColValueLocationMaps:{})), partitionKeys:[], parameters:{}, 
> viewOriginalText:select * from test, viewExpandedText:select `test`.`col` 
> from `default`.`test`, tableType:VIRTUAL_VIEW)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-3312) Add a groupByKey which returns a special GroupBy object like in pandas

2016-10-07 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk closed SPARK-3312.
--
Resolution: Won't Fix

> Add a groupByKey which returns a special GroupBy object like in pandas
> --
>
> Key: SPARK-3312
> URL: https://issues.apache.org/jira/browse/SPARK-3312
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: holdenk
>Priority: Minor
>
> A common pattern which causes problems for new Spark users is using 
> groupByKey followed by a reduce. I'd like to make a special version of 
> groupByKey which returns a groupBy object (like the Panda's groupby object). 
> The resulting class would have a number of functions (min,max, stats, reduce) 
> which could all be implemented efficiently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17825) Expose log likelihood of EM algorithm in mllib

2016-10-07 Thread Lei Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557181#comment-15557181
 ] 

Lei Wang commented on SPARK-17825:
--

That's good. May I take part in this job?
By the way, are you planning to replace mllib with ml in the future?



> Expose log likelihood of EM algorithm in mllib
> --
>
> Key: SPARK-17825
> URL: https://issues.apache.org/jira/browse/SPARK-17825
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Lei Wang
>
> Users sometimes need to get log likelihood of EM algorithm.
> For example, one might use this value to choose appropriate cluster number.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3312) Add a groupByKey which returns a special GroupBy object like in pandas

2016-10-07 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557182#comment-15557182
 ] 

holdenk commented on SPARK-3312:


I'm going to go ahead and close this, now that `Datasets` are here they pretty 
much do a much better version of this than we could have made with RDDs.

> Add a groupByKey which returns a special GroupBy object like in pandas
> --
>
> Key: SPARK-3312
> URL: https://issues.apache.org/jira/browse/SPARK-3312
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: holdenk
>Priority: Minor
>
> A common pattern which causes problems for new Spark users is using 
> groupByKey followed by a reduce. I'd like to make a special version of 
> groupByKey which returns a groupBy object (like the Panda's groupby object). 
> The resulting class would have a number of functions (min,max, stats, reduce) 
> which could all be implemented efficiently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3132) Avoid serialization for Array[Byte] in TorrentBroadcast

2016-10-07 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557178#comment-15557178
 ] 

holdenk commented on SPARK-3132:


Is there any progress on this or would it be ok for me to take a look since 
I've been working on Python accumulators improving Python broadcast sounds 
reasonable too :)

> Avoid serialization for Array[Byte] in TorrentBroadcast
> ---
>
> Key: SPARK-3132
> URL: https://issues.apache.org/jira/browse/SPARK-3132
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Davies Liu
>
> If the input data is a byte array, we should allow TorrentBroadcast to skip 
> serializing and compressing the input.
> To do this, we should add a new parameter (shortCircuitByteArray) to 
> TorrentBroadcast, and then avoid serialization in if the input is byte array 
> and shortCircuitByteArray is true.
> We should then also do compression in task serialization itself instead of 
> doing it in TorrentBroadcast.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-8957) Backport Hive 1.X support to Branch 1.4

2016-10-07 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-8957.
--
Resolution: Won't Fix

Thanks. Close it now

> Backport Hive 1.X support to Branch 1.4
> ---
>
> Key: SPARK-8957
> URL: https://issues.apache.org/jira/browse/SPARK-8957
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Patrick Wendell
>Assignee: Michael Armbrust
>Priority: Critical
>
> We almost never to feature backports. But I think it would be really useful 
> to backport support for newer Hive versions to the 1.4 branch, for the 
> following reasons:
> 1. It blocks a large number of users from using Hive support.
> 2. It's a "relatively" small set of patches, since most of the heavy lifting 
> was done in Spark 1.4.0's classloader refactoring.
> 3. Some distributions have already done this, with success.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2722) Mechanism for escaping spark configs is not consistent

2016-10-07 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557158#comment-15557158
 ] 

holdenk commented on SPARK-2722:


I think at this point trying to change the escaping of the different 
configuration escaping mechanisms is likely to lead to more broken issues than 
we avoid - how do people feel about closing this?

> Mechanism for escaping spark configs is not consistent
> --
>
> Key: SPARK-2722
> URL: https://issues.apache.org/jira/browse/SPARK-2722
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.1
>Reporter: Andrew Or
>
> Currently, you can specify a spark config in spark-defaults.conf as follows:
> {code}
> spark.magic "Mr. Johnson"
> {code}
> and this will preserve the double quotes as part of the string. Naturally, if 
> you want to do the equivalent in spark.*.extraJavaOptions, you would use the 
> following:
> {code}
> spark.executor.extraJavaOptions "-Dmagic=\"Mr. Johnson\""
> {code}
> However, this fails because the backslashes go away and it tries to interpret 
> "Johnson" as the main class argument. Instead, you have to do the following:
> {code}
> spark.executor.extraJavaOptions "-Dmagic=\\\"Mr. Johnson\\\""
> {code}
> which is not super intuitive.
> Note that this only applies to standalone mode. In YARN it's not even 
> possible to use quoted strings in config values (SPARK-2718).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-1792) Missing Spark-Shell Configure Options

2016-10-07 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-1792.
--
Resolution: Fixed

> Missing Spark-Shell Configure Options
> -
>
> Key: SPARK-1792
> URL: https://issues.apache.org/jira/browse/SPARK-1792
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Core
>Reporter: Joseph E. Gonzalez
>
> The `conf/spark-env.sh.template` does not have configure options for the 
> spark shell.   For example to enable Kryo for GraphX when using the spark 
> shell in stand alone mode it appears you must add:
> {code}
> SPARK_SUBMIT_OPTS="-Dspark.serializer=org.apache.spark.serializer.KryoSerializer
>  "
> SPARK_SUBMIT_OPTS+="-Dspark.kryo.registrator=org.apache.spark.graphx.GraphKryoRegistrator
>   "
> {code}
> However SPARK_SUBMIT_OPTS is not documented anywhere.  Perhaps the 
> spark-shell should have its own options (e.g., SPARK_SHELL_OPTS).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2032) Add an RDD.samplePartitions method for partition-level sampling

2016-10-07 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557153#comment-15557153
 ] 

holdenk commented on SPARK-2032:


I'm assuming since there hasn't been any activity for awhile [~prashant_] isn't 
working on this anymore? Is this something we still think would make sense to 
add to the RDD API?

> Add an RDD.samplePartitions method for partition-level sampling
> ---
>
> Key: SPARK-2032
> URL: https://issues.apache.org/jira/browse/SPARK-2032
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Matei Zaharia
>Assignee: Prashant Sharma
>Priority: Minor
>
> This would allow us to sample a percent of the partitions and not have to 
> materialize all of them. It's less uniform but much faster and may be useful 
> for quickly exploring data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1865) Improve behavior of cleanup of disk state

2016-10-07 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557149#comment-15557149
 ] 

holdenk commented on SPARK-1865:


So ALS specifically has a work around for this with cleaning up shuffle files 
on checkpoint, but if this is impacting other workloads we could look at maybe 
trying to make that solution more general.

> Improve behavior of cleanup of disk state
> -
>
> Key: SPARK-1865
> URL: https://issues.apache.org/jira/browse/SPARK-1865
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Reporter: Aaron Davidson
>
> Right now the behavior of disk cleanup is centered around the exit hook of 
> the executor, which attempts to cleanup shuffle files and disk manager 
> blocks, but may fail. We should make this behavior more predictable, perhaps 
> by letting the Standalone Worker cleanup the disk state, and adding a flag to 
> disable having the executor cleanup its own state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1792) Missing Spark-Shell Configure Options

2016-10-07 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557146#comment-15557146
 ] 

holdenk commented on SPARK-1792:


It feels like we've already got a pretty good mechanism for handling this with 
`spark-defaults.conf` and we've made a lot of progress with documenting many of 
the configuration enviroment variables in 
http://spark.apache.org/docs/latest/submitting-applications.html and 
http://spark.apache.org/docs/latest/configuration.html , so personally I think 
we should probably mark this issue as resolved.

> Missing Spark-Shell Configure Options
> -
>
> Key: SPARK-1792
> URL: https://issues.apache.org/jira/browse/SPARK-1792
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Core
>Reporter: Joseph E. Gonzalez
>
> The `conf/spark-env.sh.template` does not have configure options for the 
> spark shell.   For example to enable Kryo for GraphX when using the spark 
> shell in stand alone mode it appears you must add:
> {code}
> SPARK_SUBMIT_OPTS="-Dspark.serializer=org.apache.spark.serializer.KryoSerializer
>  "
> SPARK_SUBMIT_OPTS+="-Dspark.kryo.registrator=org.apache.spark.graphx.GraphKryoRegistrator
>   "
> {code}
> However SPARK_SUBMIT_OPTS is not documented anywhere.  Perhaps the 
> spark-shell should have its own options (e.g., SPARK_SHELL_OPTS).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-9309) Support DecimalType and TimestampType in UnsafeRowConverter

2016-10-07 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-9309.
--
Resolution: Won't Fix

Based on the above discussion, it sounds this JIRA is not needed. Close it now

> Support DecimalType and TimestampType in UnsafeRowConverter
> ---
>
> Key: SPARK-9309
> URL: https://issues.apache.org/jira/browse/SPARK-9309
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Takeshi Yamamuro
>
> sparksql currently doesnot support these two types in UnsafeRowConverter.
> That is, DecimalUnsafeColumnWriter and TimestampUnsafeColumnWriter is added 
> into UnsafeRowConverter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17825) Expose log likelihood of EM algorithm in mllib

2016-10-07 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-17825:

Component/s: (was: MLlib)
 ML

> Expose log likelihood of EM algorithm in mllib
> --
>
> Key: SPARK-17825
> URL: https://issues.apache.org/jira/browse/SPARK-17825
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Lei Wang
>
> Users sometimes need to get log likelihood of EM algorithm.
> For example, one might use this value to choose appropriate cluster number.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17825) Expose log likelihood of EM algorithm in mllib

2016-10-07 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557137#comment-15557137
 ] 

Yanbo Liang commented on SPARK-17825:
-

[~is03wlei] This task depends on copying the GaussianMixture implementation 
from mllib to ml which I have started work. I will ping you when the PR 
available, it should be in a few days. Thanks!

> Expose log likelihood of EM algorithm in mllib
> --
>
> Key: SPARK-17825
> URL: https://issues.apache.org/jira/browse/SPARK-17825
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Lei Wang
>
> Users sometimes need to get log likelihood of EM algorithm.
> For example, one might use this value to choose appropriate cluster number.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8957) Backport Hive 1.X support to Branch 1.4

2016-10-07 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557139#comment-15557139
 ] 

Michael Armbrust commented on SPARK-8957:
-

Yeah, close it.


> Backport Hive 1.X support to Branch 1.4
> ---
>
> Key: SPARK-8957
> URL: https://issues.apache.org/jira/browse/SPARK-8957
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Patrick Wendell
>Assignee: Michael Armbrust
>Priority: Critical
>
> We almost never to feature backports. But I think it would be really useful 
> to backport support for newer Hive versions to the 1.4 branch, for the 
> following reasons:
> 1. It blocks a large number of users from using Hive support.
> 2. It's a "relatively" small set of patches, since most of the heavy lifting 
> was done in Spark 1.4.0's classloader refactoring.
> 3. Some distributions have already done this, with success.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1762) Add functionality to pin RDDs in cache

2016-10-07 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557133#comment-15557133
 ] 

holdenk commented on SPARK-1762:


Is this something we are still interested in? I could see it becoming more 
important with `Datasets`/`DataFrames` where a partial cache miss is much more 
expensive (potentially) than with `RDD`s.

> Add functionality to pin RDDs in cache
> --
>
> Key: SPARK-1762
> URL: https://issues.apache.org/jira/browse/SPARK-1762
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>
> Right now, all RDDs are created equal, and there is no mechanism to identify 
> a certain RDD to be more important than the rest. This is a problem if the 
> RDD fraction is small, because just caching a few RDDs can evict more 
> important ones.
> A side effect of this feature is that we can now more safely allocate a 
> smaller spark.storage.memoryFraction if we know how large our important RDDs 
> are, without having to worry about them being evicted. This allows us to use 
> more memory for shuffles, for instance, and avoid disk spills.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6802) User Defined Aggregate Function Refactoring

2016-10-07 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-6802:
---
Component/s: (was: SQL)

> User Defined Aggregate Function Refactoring
> ---
>
> Key: SPARK-6802
> URL: https://issues.apache.org/jira/browse/SPARK-6802
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
> Environment: We use Spark Dataframe, SQL along with json, sql and 
> pandas quite a bit
>Reporter: cynepia
>
> While trying to use custom aggregates in spark (something which is common in 
> pandas), We realized that Custom Aggregate Functions aren't well supported 
> across various features/functions in Spark beyond what is supported by Hive. 
> There are futher discussions on the topic viz-a -viz the issue SPARK-3947, 
> which points to similar improvement tickets opened earlier for refactoring 
> the UDAF area.
> While we refactor the interface for aggregates, It would make sense to keep 
> in consideration, the recently added DataFrame, GroupedData, and possibly 
> also sql.dataframe.Column, which looks different from pandas.Series and isn't 
> currently supporting any aggregations.
> Would like to get feedback from the folks, who are actively looking at this...
> We would be happy to participate and contribute, if there are any discussions 
> on the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9189) Takes locality and the sum of partition length into account when partition is instance of HadoopPartition in operator coalesce

2016-10-07 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-9189:
---
Component/s: (was: SQL)
 Spark Core

> Takes locality and the sum of partition length into account when partition is 
> instance of HadoopPartition in operator coalesce
> --
>
> Key: SPARK-9189
> URL: https://issues.apache.org/jira/browse/SPARK-9189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Yadong Qi
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-9194) fix case-insensitive bug for aggregation expression which is not PartialAggregate

2016-10-07 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-9194.
--
Resolution: Won't Fix

> fix case-insensitive bug for aggregation expression which is not 
> PartialAggregate
> -
>
> Key: SPARK-9194
> URL: https://issues.apache.org/jira/browse/SPARK-9194
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8624) DataFrameReader doesn't respect MERGE_SCHEMA setting for Parquet

2016-10-07 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-8624.

Resolution: Won't Fix

> DataFrameReader doesn't respect MERGE_SCHEMA setting for Parquet
> 
>
> Key: SPARK-8624
> URL: https://issues.apache.org/jira/browse/SPARK-8624
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Rex Xiong
>  Labels: parquet
>
> In 1.4.0, parquet is read by DataFrameReader.parquet, when creating 
> ParquetRelation2 object, "parameters" is hard-coded as "Map.empty[String, 
> String]", so ParquetRelation2.shouldMergeSchemas is always true (the default 
> value).
> In previous version, spark.sql.hive.convertMetastoreParquet.mergeSchema 
> config is respected.
> This bug downgrade performance a lot for a folder with hundreds of parquet 
> files and we don't want a schema merge.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8957) Backport Hive 1.X support to Branch 1.4

2016-10-07 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557099#comment-15557099
 ] 

Xiao Li commented on SPARK-8957:


Is this still needed? Maybe we should close it?

> Backport Hive 1.X support to Branch 1.4
> ---
>
> Key: SPARK-8957
> URL: https://issues.apache.org/jira/browse/SPARK-8957
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Patrick Wendell
>Assignee: Michael Armbrust
>Priority: Critical
>
> We almost never to feature backports. But I think it would be really useful 
> to backport support for newer Hive versions to the 1.4 branch, for the 
> following reasons:
> 1. It blocks a large number of users from using Hive support.
> 2. It's a "relatively" small set of patches, since most of the heavy lifting 
> was done in Spark 1.4.0's classloader refactoring.
> 3. Some distributions have already done this, with success.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10161) Support Pyspark shell over Mesos Cluster Mode

2016-10-07 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557069#comment-15557069
 ] 

holdenk commented on SPARK-10161:
-

That being said - I'm not sure I see the value of this?

> Support Pyspark shell over Mesos Cluster Mode
> -
>
> Key: SPARK-10161
> URL: https://issues.apache.org/jira/browse/SPARK-10161
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, PySpark
>Reporter: Timothy Chen
>
> It's not possible to run Pyspark shell with cluster mode since the shell that 
> is running in the cluster is not being able to interact with the client.
> We can build a proxy that is transferring the inputs of the user and the 
> output of the shell, and also be able to get connected and reconnected from 
> the user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10161) Support Pyspark shell over Mesos Cluster Mode

2016-10-07 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557068#comment-15557068
 ] 

holdenk commented on SPARK-10161:
-

I think this is an issue accross cluster modes, maybe using IJupyter protocol 
would help?

> Support Pyspark shell over Mesos Cluster Mode
> -
>
> Key: SPARK-10161
> URL: https://issues.apache.org/jira/browse/SPARK-10161
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, PySpark
>Reporter: Timothy Chen
>
> It's not possible to run Pyspark shell with cluster mode since the shell that 
> is running in the cluster is not being able to interact with the client.
> We can build a proxy that is transferring the inputs of the user and the 
> output of the shell, and also be able to get connected and reconnected from 
> the user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17219) QuantileDiscretizer does strange things with NaN values

2016-10-07 Thread Vincent (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557034#comment-15557034
 ] 

Vincent commented on SPARK-17219:
-

[~josephkb] [~srowen] [~timhunter] let me know what I can do to help if there 
is anything.

> QuantileDiscretizer does strange things with NaN values
> ---
>
> Key: SPARK-17219
> URL: https://issues.apache.org/jira/browse/SPARK-17219
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.2
>Reporter: Barry Becker
>Assignee: Vincent
> Fix For: 2.1.0
>
>
> How is the QuantileDiscretizer supposed to handle null values?
> Actual nulls are not allowed, so I replace them with Double.NaN.
> However, when you try to run the QuantileDiscretizer on a column that 
> contains NaNs, it will create (possibly more than one) NaN split(s) before 
> the final PositiveInfinity value.
> I am using the attache titanic csv data and trying to bin the "age" column 
> using the QuantileDiscretizer with 10 bins specified. The age column as a lot 
> of null values.
> These are the splits that I get:
> {code}
> -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity
> {code}
> Is that expected. It seems to imply that NaN is larger than any positive 
> number and less than infinity.
> I'm not sure of the best way to handle nulls, but I think they need a bucket 
> all their own. My suggestions would be to include an initial NaN split value 
> that is always there, just like the sentinel Infinities are. If that were the 
> case, then the splits for the example above might look like this:
> {code}
> NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity
> {code}
> This does not seem great either because a bucket that is [NaN, -Inf] doesn't 
> make much sense. Not sure if the NaN bucket counts toward numBins or not. I 
> do think it should always be there though in case future data has null even 
> though the fit data did not. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17219) QuantileDiscretizer does strange things with NaN values

2016-10-07 Thread Vincent (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557021#comment-15557021
 ] 

Vincent commented on SPARK-17219:
-

in this PR(https://github.com/apache/spark/pull/14858) NaN values are always 
put into the last bucket.

> QuantileDiscretizer does strange things with NaN values
> ---
>
> Key: SPARK-17219
> URL: https://issues.apache.org/jira/browse/SPARK-17219
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.2
>Reporter: Barry Becker
>Assignee: Vincent
> Fix For: 2.1.0
>
>
> How is the QuantileDiscretizer supposed to handle null values?
> Actual nulls are not allowed, so I replace them with Double.NaN.
> However, when you try to run the QuantileDiscretizer on a column that 
> contains NaNs, it will create (possibly more than one) NaN split(s) before 
> the final PositiveInfinity value.
> I am using the attache titanic csv data and trying to bin the "age" column 
> using the QuantileDiscretizer with 10 bins specified. The age column as a lot 
> of null values.
> These are the splits that I get:
> {code}
> -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity
> {code}
> Is that expected. It seems to imply that NaN is larger than any positive 
> number and less than infinity.
> I'm not sure of the best way to handle nulls, but I think they need a bucket 
> all their own. My suggestions would be to include an initial NaN split value 
> that is always there, just like the sentinel Infinities are. If that were the 
> case, then the splits for the example above might look like this:
> {code}
> NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity
> {code}
> This does not seem great either because a bucket that is [NaN, -Inf] doesn't 
> make much sense. Not sure if the NaN bucket counts toward numBins or not. I 
> do think it should always be there though in case future data has null even 
> though the fit data did not. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-10-07 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-9487:
---
Labels: starter  (was: )

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-10-07 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557018#comment-15557018
 ] 

holdenk commented on SPARK-9487:


This will maybe break some tests in the process but it would probably be good. 
I'd go with 4 rather than 2 just for the old streaming tests (so if we want to 
be consistent 4 everywhere). Is this something people are interested in 
pursuing? If so maybe we should make it a starter issue?

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17782) Kafka 010 test is flaky

2016-10-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557013#comment-15557013
 ] 

Apache Spark commented on SPARK-17782:
--

User 'koeninger' has created a pull request for this issue:
https://github.com/apache/spark/pull/15401

> Kafka 010 test is flaky
> ---
>
> Key: SPARK-17782
> URL: https://issues.apache.org/jira/browse/SPARK-17782
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Herman van Hovell
>
> The Kafka 010 DirectKafkaStreamSuite {{pattern based subscription}} is flaky. 
> We should disable it, and figure out how we can improve it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-8760) allow moving and symlinking binaries

2016-10-07 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk closed SPARK-8760.
--
Resolution: Fixed

This is a "partially fixed" but I think fixed is a close enough description. We 
don't use readlink but we do check for the current SPARK_HOME before running 
and if the user has a custom SPARK_HOME then can just point it there since we 
no longer overwrite it in `spark-submit`.

> allow moving and symlinking binaries
> 
>
> Key: SPARK-8760
> URL: https://issues.apache.org/jira/browse/SPARK-8760
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Shell, Spark Submit, SparkR
>Affects Versions: 1.4.0
>Reporter: Philipp Angerer
>Priority: Minor
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> you use the following line to determine {{$SPARK_HOME}} in all binaries
> {code:none}
> export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
> {code}
> however users should be able to override this. also symlinks should be 
> followed:
> {code:none}
> if [[ -z "$SPARK_HOME" ]]; then
>   export SPARK_HOME="$(dirname "$(readlink -f "$0")")"
> fi
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-8757) Check missing and add user guide for MLlib Python API

2016-10-07 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk closed SPARK-8757.
--
Resolution: Fixed

All sub issues fixed, and well past 1.5 release.

> Check missing and add user guide for MLlib Python API
> -
>
> Key: SPARK-8757
> URL: https://issues.apache.org/jira/browse/SPARK-8757
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, MLlib, PySpark
>Affects Versions: 1.5.0
>Reporter: Yanbo Liang
>
> Some MLlib algorithm missing user guide for Python, we need to check and add 
> them.
> The algorithms that missing user guides for Python are list following. Please 
> add it here if you find one more.
> * For MLlib
> ** Isotonic regression (Python example)
> ** LDA (Python example)
> ** Streaming k-means (Java/Python examples)
> ** PCA (Python example)
> ** SVD (Python example)
> ** FP-growth (Python example)
> * For ML
> ** feature
> *** CountVectorizerModel (user guide and examples)
> *** DCT (user guide and examples)
> *** MinMaxScaler (user guide and examples)
> *** StopWordsRemover (user guide and examples)
> *** VectorSlicer (user guide and examples)
> *** ElementwiseProduct (python example)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8842) Spark SQL - Insert into table Issue

2016-10-07 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15556986#comment-15556986
 ] 

Xiao Li commented on SPARK-8842:


Could you retry it in the latest master branch? Thanks!


> Spark SQL - Insert into table Issue
> ---
>
> Key: SPARK-8842
> URL: https://issues.apache.org/jira/browse/SPARK-8842
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: James Greenwood
>
> I am running spark 1.4 and currently experiencing an issue when inserting 
> data into a table. The data is loaded into an initial table and then selected 
> from this table, processed and then inserted into a second table. The issue 
> is that some of the data goes missing when inserted into the second table 
> when running in a multi-worker configuration (a master, a worker on the 
> master and then a worker on a different host). 
> I have narrowed down the problem to the insert into the second table. An 
> example process to generate the problem is below. 
> Generate a file (for example /home/spark/test) with the numbers 1 to 50 on 
> separate lines. 
> spark-sql --master spark://spark-master:7077 --hiveconf 
> hive.metastore.warehouse.dir=/spark 
> (/spark is shared between all hosts) 
> create table test(field string); 
> load data inpath '/home/spark/test' into table test; 
> create table processed(field string); 
> from test insert into table processed select *; 
> select * from processed; 
> The result from the final select does not contain all the numbers 1 to 50. 
> I have also run the above example in some different configurations :- 
> - When there is just one worker running on the master. The result of the 
> final select is the rows 1-50 i.e all data as expected. 
> - When there is just one worker running on a host which is not the master. 
> The final select returns no rows.
> No errors are logged in the log files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11272) Support importing and exporting event logs from HistoryServer web portal

2016-10-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15556866#comment-15556866
 ] 

Apache Spark commented on SPARK-11272:
--

User 'ajbozarth' has created a pull request for this issue:
https://github.com/apache/spark/pull/15400

> Support importing and exporting event logs from HistoryServer web portal
> 
>
> Key: SPARK-11272
> URL: https://issues.apache.org/jira/browse/SPARK-11272
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Web UI
>Reporter: Saisai Shao
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> Many users who met the problems when running Spark applications will send the 
> logs to some experienced guys to seek help, normally this running log is too 
> verbose to dig into the details.
> This proposal is trying to handle this problem: user could download the event 
> log of a particular application, and send to other guys, other guys could 
> replay this log by uploading to the history server, it's quite useful to 
> debug and find issue for many customers who are using Spark.
> Here is the screen shot, you could download and upload event log through 
> button click, the imported event log will be named with suffix "-imported".
> !https://issues.apache.org/jira/secure/attachment/12768211/screenshot-1.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14503) spark.ml API for FPGrowth

2016-10-07 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15556862#comment-15556862
 ] 

yuhao yang commented on SPARK-14503:


Yes, let me just send what I got.

> spark.ml API for FPGrowth
> -
>
> Key: SPARK-14503
> URL: https://issues.apache.org/jira/browse/SPARK-14503
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This task is the first port of spark.mllib.fpm functionality to spark.ml 
> (Scala).
> This will require a brief design doc to confirm a reasonable DataFrame-based 
> API, with details for this class.  The doc could also look ahead to the other 
> fpm classes, especially if their API decisions will affect FPGrowth.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17819) Specified database in JDBC URL is ignored when connecting to thriftserver

2016-10-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17819:


Assignee: (was: Apache Spark)

> Specified database in JDBC URL is ignored when connecting to thriftserver
> -
>
> Key: SPARK-17819
> URL: https://issues.apache.org/jira/browse/SPARK-17819
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0, 2.0.1
>Reporter: Todd Nemet
>
> Filing this based on a email thread with Reynold Xin. From the 
> [docs|http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server],
>  the JDBC connection URL to the thriftserver looks like:
> {code}
> beeline> !connect 
> jdbc:hive2://:/?hive.server2.transport.mode=http;hive.server2.thrift.http.path=
> {code}
> However, anything specified in  results in being put in default 
> schema. I'm running these with -e commands, but the shell shows the same 
> behavior.
> In 2.0.1, I created a table foo in schema spark_jira:
> {code}
> [558|18:01:20] ~/Documents/spark/spark$ bin/beeline -u 
> jdbc:hive2://localhost:10006/spark_jira -n hive -e "show tables"
> Connecting to jdbc:hive2://localhost:10006/spark_jira
> 16/10/06 18:01:28 INFO jdbc.Utils: Supplied authorities: localhost:10006
> 16/10/06 18:01:28 INFO jdbc.Utils: Resolved authority: localhost:10006
> 16/10/06 18:01:28 INFO jdbc.HiveConnection: Will try to open client transport 
> with JDBC Uri: jdbc:hive2://localhost:10006/spark_jira
> Connected to: Spark SQL (version 2.0.1)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> ++--+--+
> | tableName  | isTemporary  |
> ++--+--+
> ++--+--+
> No rows selected (0.558 seconds)
> Beeline version 1.2.1.spark2 by Apache Hive
> Closing: 0: jdbc:hive2://localhost:10006/spark_jira
> [559|18:01:30] ~/Documents/spark/spark$ bin/beeline -u 
> jdbc:hive2://localhost:10006/spark_jira -n hive -e "show tables in spark_jira"
> Connecting to jdbc:hive2://localhost:10006/spark_jira
> 16/10/06 18:01:34 INFO jdbc.Utils: Supplied authorities: localhost:10006
> 16/10/06 18:01:34 INFO jdbc.Utils: Resolved authority: localhost:10006
> 16/10/06 18:01:34 INFO jdbc.HiveConnection: Will try to open client transport 
> with JDBC Uri: jdbc:hive2://localhost:10006/spark_jira
> Connected to: Spark SQL (version 2.0.1)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> ++--+--+
> | tableName  | isTemporary  |
> ++--+--+
> | foo| false|
> ++--+--+
> 1 row selected (0.664 seconds)
> Beeline version 1.2.1.spark2 by Apache Hive
> Closing: 0: jdbc:hive2://localhost:10006/spark_jira
> {code}
> I also see this in Spark 1.6.2:
> {code}
> [555|18:13:32] ~/Documents/spark/spark16$ bin/beeline -u 
> jdbc:hive2://localhost:10005/spark_jira -n hive -e "show tables"
> Connecting to jdbc:hive2://localhost:10005/spark_jira
> 16/10/06 18:13:37 INFO jdbc.Utils: Supplied authorities: localhost:10005
> 16/10/06 18:13:37 INFO jdbc.Utils: Resolved authority: localhost:10005
> 16/10/06 18:13:37 INFO jdbc.HiveConnection: Will try to open client transport 
> with JDBC Uri: jdbc:hive2://localhost:10005/spark_jira
> Connected to: Spark SQL (version 1.6.2)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> +--+--+--+
> |  tableName   | isTemporary  |
> +--+--+--+
> | all_types| false|
> | order_items  | false|
> | orders   | false|
> | users| false|
> +--+--+--+
> 4 rows selected (0.653 seconds)
> Beeline version 1.2.1.spark2 by Apache Hive
> Closing: 0: jdbc:hive2://localhost:10005/spark_jira
> [556|18:13:39] ~/Documents/spark/spark16$ bin/beeline -u 
> jdbc:hive2://localhost:10005/spark_jira -n hive -e "show tables in spark_jira"
> Connecting to jdbc:hive2://localhost:10005/spark_jira
> 16/10/06 18:13:45 INFO jdbc.Utils: Supplied authorities: localhost:10005
> 16/10/06 18:13:45 INFO jdbc.Utils: Resolved authority: localhost:10005
> 16/10/06 18:13:45 INFO jdbc.HiveConnection: Will try to open client transport 
> with JDBC Uri: jdbc:hive2://localhost:10005/spark_jira
> Connected to: Spark SQL (version 1.6.2)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> ++--+--+
> | tableName  | isTemporary  |
> ++--+--+
> | foo| false|
> ++--+--+
> 1 row selected (0.633 seconds)
> Beeline version 1.2.1.spark2 by Apache Hive
> Closing: 0:

[jira] [Commented] (SPARK-17819) Specified database in JDBC URL is ignored when connecting to thriftserver

2016-10-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15556857#comment-15556857
 ] 

Apache Spark commented on SPARK-17819:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/15399

> Specified database in JDBC URL is ignored when connecting to thriftserver
> -
>
> Key: SPARK-17819
> URL: https://issues.apache.org/jira/browse/SPARK-17819
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0, 2.0.1
>Reporter: Todd Nemet
>
> Filing this based on a email thread with Reynold Xin. From the 
> [docs|http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server],
>  the JDBC connection URL to the thriftserver looks like:
> {code}
> beeline> !connect 
> jdbc:hive2://:/?hive.server2.transport.mode=http;hive.server2.thrift.http.path=
> {code}
> However, anything specified in  results in being put in default 
> schema. I'm running these with -e commands, but the shell shows the same 
> behavior.
> In 2.0.1, I created a table foo in schema spark_jira:
> {code}
> [558|18:01:20] ~/Documents/spark/spark$ bin/beeline -u 
> jdbc:hive2://localhost:10006/spark_jira -n hive -e "show tables"
> Connecting to jdbc:hive2://localhost:10006/spark_jira
> 16/10/06 18:01:28 INFO jdbc.Utils: Supplied authorities: localhost:10006
> 16/10/06 18:01:28 INFO jdbc.Utils: Resolved authority: localhost:10006
> 16/10/06 18:01:28 INFO jdbc.HiveConnection: Will try to open client transport 
> with JDBC Uri: jdbc:hive2://localhost:10006/spark_jira
> Connected to: Spark SQL (version 2.0.1)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> ++--+--+
> | tableName  | isTemporary  |
> ++--+--+
> ++--+--+
> No rows selected (0.558 seconds)
> Beeline version 1.2.1.spark2 by Apache Hive
> Closing: 0: jdbc:hive2://localhost:10006/spark_jira
> [559|18:01:30] ~/Documents/spark/spark$ bin/beeline -u 
> jdbc:hive2://localhost:10006/spark_jira -n hive -e "show tables in spark_jira"
> Connecting to jdbc:hive2://localhost:10006/spark_jira
> 16/10/06 18:01:34 INFO jdbc.Utils: Supplied authorities: localhost:10006
> 16/10/06 18:01:34 INFO jdbc.Utils: Resolved authority: localhost:10006
> 16/10/06 18:01:34 INFO jdbc.HiveConnection: Will try to open client transport 
> with JDBC Uri: jdbc:hive2://localhost:10006/spark_jira
> Connected to: Spark SQL (version 2.0.1)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> ++--+--+
> | tableName  | isTemporary  |
> ++--+--+
> | foo| false|
> ++--+--+
> 1 row selected (0.664 seconds)
> Beeline version 1.2.1.spark2 by Apache Hive
> Closing: 0: jdbc:hive2://localhost:10006/spark_jira
> {code}
> I also see this in Spark 1.6.2:
> {code}
> [555|18:13:32] ~/Documents/spark/spark16$ bin/beeline -u 
> jdbc:hive2://localhost:10005/spark_jira -n hive -e "show tables"
> Connecting to jdbc:hive2://localhost:10005/spark_jira
> 16/10/06 18:13:37 INFO jdbc.Utils: Supplied authorities: localhost:10005
> 16/10/06 18:13:37 INFO jdbc.Utils: Resolved authority: localhost:10005
> 16/10/06 18:13:37 INFO jdbc.HiveConnection: Will try to open client transport 
> with JDBC Uri: jdbc:hive2://localhost:10005/spark_jira
> Connected to: Spark SQL (version 1.6.2)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> +--+--+--+
> |  tableName   | isTemporary  |
> +--+--+--+
> | all_types| false|
> | order_items  | false|
> | orders   | false|
> | users| false|
> +--+--+--+
> 4 rows selected (0.653 seconds)
> Beeline version 1.2.1.spark2 by Apache Hive
> Closing: 0: jdbc:hive2://localhost:10005/spark_jira
> [556|18:13:39] ~/Documents/spark/spark16$ bin/beeline -u 
> jdbc:hive2://localhost:10005/spark_jira -n hive -e "show tables in spark_jira"
> Connecting to jdbc:hive2://localhost:10005/spark_jira
> 16/10/06 18:13:45 INFO jdbc.Utils: Supplied authorities: localhost:10005
> 16/10/06 18:13:45 INFO jdbc.Utils: Resolved authority: localhost:10005
> 16/10/06 18:13:45 INFO jdbc.HiveConnection: Will try to open client transport 
> with JDBC Uri: jdbc:hive2://localhost:10005/spark_jira
> Connected to: Spark SQL (version 1.6.2)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> ++--+--+
> | tableName  | isTemporary  |
> ++--+--+
> | foo| false|
>

[jira] [Assigned] (SPARK-17819) Specified database in JDBC URL is ignored when connecting to thriftserver

2016-10-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17819:


Assignee: Apache Spark

> Specified database in JDBC URL is ignored when connecting to thriftserver
> -
>
> Key: SPARK-17819
> URL: https://issues.apache.org/jira/browse/SPARK-17819
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0, 2.0.1
>Reporter: Todd Nemet
>Assignee: Apache Spark
>
> Filing this based on a email thread with Reynold Xin. From the 
> [docs|http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server],
>  the JDBC connection URL to the thriftserver looks like:
> {code}
> beeline> !connect 
> jdbc:hive2://:/?hive.server2.transport.mode=http;hive.server2.thrift.http.path=
> {code}
> However, anything specified in  results in being put in default 
> schema. I'm running these with -e commands, but the shell shows the same 
> behavior.
> In 2.0.1, I created a table foo in schema spark_jira:
> {code}
> [558|18:01:20] ~/Documents/spark/spark$ bin/beeline -u 
> jdbc:hive2://localhost:10006/spark_jira -n hive -e "show tables"
> Connecting to jdbc:hive2://localhost:10006/spark_jira
> 16/10/06 18:01:28 INFO jdbc.Utils: Supplied authorities: localhost:10006
> 16/10/06 18:01:28 INFO jdbc.Utils: Resolved authority: localhost:10006
> 16/10/06 18:01:28 INFO jdbc.HiveConnection: Will try to open client transport 
> with JDBC Uri: jdbc:hive2://localhost:10006/spark_jira
> Connected to: Spark SQL (version 2.0.1)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> ++--+--+
> | tableName  | isTemporary  |
> ++--+--+
> ++--+--+
> No rows selected (0.558 seconds)
> Beeline version 1.2.1.spark2 by Apache Hive
> Closing: 0: jdbc:hive2://localhost:10006/spark_jira
> [559|18:01:30] ~/Documents/spark/spark$ bin/beeline -u 
> jdbc:hive2://localhost:10006/spark_jira -n hive -e "show tables in spark_jira"
> Connecting to jdbc:hive2://localhost:10006/spark_jira
> 16/10/06 18:01:34 INFO jdbc.Utils: Supplied authorities: localhost:10006
> 16/10/06 18:01:34 INFO jdbc.Utils: Resolved authority: localhost:10006
> 16/10/06 18:01:34 INFO jdbc.HiveConnection: Will try to open client transport 
> with JDBC Uri: jdbc:hive2://localhost:10006/spark_jira
> Connected to: Spark SQL (version 2.0.1)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> ++--+--+
> | tableName  | isTemporary  |
> ++--+--+
> | foo| false|
> ++--+--+
> 1 row selected (0.664 seconds)
> Beeline version 1.2.1.spark2 by Apache Hive
> Closing: 0: jdbc:hive2://localhost:10006/spark_jira
> {code}
> I also see this in Spark 1.6.2:
> {code}
> [555|18:13:32] ~/Documents/spark/spark16$ bin/beeline -u 
> jdbc:hive2://localhost:10005/spark_jira -n hive -e "show tables"
> Connecting to jdbc:hive2://localhost:10005/spark_jira
> 16/10/06 18:13:37 INFO jdbc.Utils: Supplied authorities: localhost:10005
> 16/10/06 18:13:37 INFO jdbc.Utils: Resolved authority: localhost:10005
> 16/10/06 18:13:37 INFO jdbc.HiveConnection: Will try to open client transport 
> with JDBC Uri: jdbc:hive2://localhost:10005/spark_jira
> Connected to: Spark SQL (version 1.6.2)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> +--+--+--+
> |  tableName   | isTemporary  |
> +--+--+--+
> | all_types| false|
> | order_items  | false|
> | orders   | false|
> | users| false|
> +--+--+--+
> 4 rows selected (0.653 seconds)
> Beeline version 1.2.1.spark2 by Apache Hive
> Closing: 0: jdbc:hive2://localhost:10005/spark_jira
> [556|18:13:39] ~/Documents/spark/spark16$ bin/beeline -u 
> jdbc:hive2://localhost:10005/spark_jira -n hive -e "show tables in spark_jira"
> Connecting to jdbc:hive2://localhost:10005/spark_jira
> 16/10/06 18:13:45 INFO jdbc.Utils: Supplied authorities: localhost:10005
> 16/10/06 18:13:45 INFO jdbc.Utils: Resolved authority: localhost:10005
> 16/10/06 18:13:45 INFO jdbc.HiveConnection: Will try to open client transport 
> with JDBC Uri: jdbc:hive2://localhost:10005/spark_jira
> Connected to: Spark SQL (version 1.6.2)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> ++--+--+
> | tableName  | isTemporary  |
> ++--+--+
> | foo| false|
> ++--+--+
> 1 row selected (0.633 seconds)
> Beeline version 1.2.1.spark2 by

[jira] [Closed] (SPARK-8719) Adding Python support for 1-sample, 2-sided Kolmogorov Smirnov Test

2016-10-07 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk closed SPARK-8719.
--
Resolution: Duplicate

> Adding Python support for 1-sample, 2-sided Kolmogorov Smirnov Test
> ---
>
> Key: SPARK-8719
> URL: https://issues.apache.org/jira/browse/SPARK-8719
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jose Cambronero
>Priority: Minor
>
> Provide python support for the MLlib patch that implements 1-sample, 2-sided 
> Kolmogorov Smirnov test (found at  
> https://issues.apache.org/jira/browse/SPARK-8598)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8605) Exclude files in StreamingContext. textFileStream(directory)

2016-10-07 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-8605:
---
Component/s: (was: PySpark)
 Streaming

> Exclude files in StreamingContext. textFileStream(directory)
> 
>
> Key: SPARK-8605
> URL: https://issues.apache.org/jira/browse/SPARK-8605
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Noel Vo
>  Labels: streaming, streaming_api
>
> Currenly, spark streaming can monitor a directory and it will process the 
> newly added files. This will cause a bug if the files copied to the directory 
> are big. For example, in hdfs, if a file is being copied, its name is 
> file_name._COPYING_. Spark will pick up the file and process. However, when 
> it's done copying the file, the file name becomes file_name. This would cause 
> FileDoesNotExist error. It would be great if we can exclude files using regex 
> in the directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8605) Exclude files in StreamingContext. textFileStream(directory)

2016-10-07 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15556785#comment-15556785
 ] 

holdenk commented on SPARK-8605:


This is semi-documented (namely only atomic moves are supported), but adding a 
filter for .COPYING files could be useful.

> Exclude files in StreamingContext. textFileStream(directory)
> 
>
> Key: SPARK-8605
> URL: https://issues.apache.org/jira/browse/SPARK-8605
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Noel Vo
>  Labels: streaming, streaming_api
>
> Currenly, spark streaming can monitor a directory and it will process the 
> newly added files. This will cause a bug if the files copied to the directory 
> are big. For example, in hdfs, if a file is being copied, its name is 
> file_name._COPYING_. Spark will pick up the file and process. However, when 
> it's done copying the file, the file name becomes file_name. This would cause 
> FileDoesNotExist error. It would be great if we can exclude files using regex 
> in the directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7177) Create standard way to wrap Spark CLI scripts for external projects

2016-10-07 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15556780#comment-15556780
 ] 

holdenk commented on SPARK-7177:


I've run into similar challenges when working on Sparkling Pandas.

> Create standard way to wrap Spark CLI scripts for external projects
> ---
>
> Key: SPARK-7177
> URL: https://issues.apache.org/jira/browse/SPARK-7177
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Shell
>Reporter: Uri Laserson
>
> Many external projects that are built on Spark support CLI scripts to launch 
> their applications.  For example, the ADAM project has {{adam-submit}}, 
> {{adam-shell}}, and {{adam-pyspark}} commands that mirror the Spark versions 
> but modify the CLASSPATH and set some extra options to make it easy for the 
> user.  However, because these applications can take a mix of Spark-specific 
> and application-specific options, they need to use the internal CLI tools to 
> separate them out before calling {{spark-submit}} or {{spark-shell}} (e.g., 
> using {{gatherSparkSubmitOpts}}).  However, because this functionality is 
> considered internal, it has changed a few times in the past.
> It would be great if there was a stable way that could be an "extensibility" 
> API for people to make shell wrappers that would be unlikely to need 
> significant changes when the Spark functionality changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7941) Cache Cleanup Failure when job is killed by Spark

2016-10-07 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15556775#comment-15556775
 ] 

holdenk commented on SPARK-7941:


Are you still experiencing this issue [~cqnguyen] or would it be ok for us to 
close this?

> Cache Cleanup Failure when job is killed by Spark 
> --
>
> Key: SPARK-7941
> URL: https://issues.apache.org/jira/browse/SPARK-7941
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 1.3.1
>Reporter: Cory Nguyen
> Attachments: screenshot-1.png
>
>
> Problem/Bug:
> If a job is running and Spark kills the job intentionally, the cache files 
> remains on the local/worker nodes and are not cleaned up properly. Over time 
> the old cache builds up and causes "No Space Left on Device" error. 
> The cache is cleaned up properly when the job succeeds. I have not verified 
> if the cached remains when the user intentionally kills the job. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8780) Move Python doctest code example from models to algorithms

2016-10-07 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-8780:
---
Labels: starter  (was: )

> Move Python doctest code example from models to algorithms
> --
>
> Key: SPARK-8780
> URL: https://issues.apache.org/jira/browse/SPARK-8780
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Affects Versions: 1.5.0
>Reporter: Yanbo Liang
>Priority: Minor
>  Labels: starter
>
> Almost all doctest code examples are in the models at Pyspark mllib.
> Since users usually start with algorithms rather than models, we need to move 
> them from models to algorithms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17647) SQL LIKE does not handle backslashes correctly

2016-10-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15556767#comment-15556767
 ] 

Apache Spark commented on SPARK-17647:
--

User 'jodersky' has created a pull request for this issue:
https://github.com/apache/spark/pull/15398

> SQL LIKE does not handle backslashes correctly
> --
>
> Key: SPARK-17647
> URL: https://issues.apache.org/jira/browse/SPARK-17647
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Xiangrui Meng
>  Labels: correctness
>
> Try the following in SQL shell:
> {code}
> select '' like '%\\%';
> {code}
> It returned false, which is wrong.
> cc: [~yhuai] [~joshrosen]
> A false-negative considered previously:
> {code}
> select '' rlike '.*.*';
> {code}
> It returned true, which is correct if we assume that the pattern is treated 
> as a Java string but not raw string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17647) SQL LIKE does not handle backslashes correctly

2016-10-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17647:


Assignee: Apache Spark

> SQL LIKE does not handle backslashes correctly
> --
>
> Key: SPARK-17647
> URL: https://issues.apache.org/jira/browse/SPARK-17647
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>  Labels: correctness
>
> Try the following in SQL shell:
> {code}
> select '' like '%\\%';
> {code}
> It returned false, which is wrong.
> cc: [~yhuai] [~joshrosen]
> A false-negative considered previously:
> {code}
> select '' rlike '.*.*';
> {code}
> It returned true, which is correct if we assume that the pattern is treated 
> as a Java string but not raw string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17647) SQL LIKE does not handle backslashes correctly

2016-10-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17647:


Assignee: (was: Apache Spark)

> SQL LIKE does not handle backslashes correctly
> --
>
> Key: SPARK-17647
> URL: https://issues.apache.org/jira/browse/SPARK-17647
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Xiangrui Meng
>  Labels: correctness
>
> Try the following in SQL shell:
> {code}
> select '' like '%\\%';
> {code}
> It returned false, which is wrong.
> cc: [~yhuai] [~joshrosen]
> A false-negative considered previously:
> {code}
> select '' rlike '.*.*';
> {code}
> It returned true, which is correct if we assume that the pattern is treated 
> as a Java string but not raw string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8780) Move Python doctest code example from models to algorithms

2016-10-07 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15556763#comment-15556763
 ] 

holdenk commented on SPARK-8780:


Is this something we still want to do? This could be a great starter issue if 
so :)

> Move Python doctest code example from models to algorithms
> --
>
> Key: SPARK-8780
> URL: https://issues.apache.org/jira/browse/SPARK-8780
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Affects Versions: 1.5.0
>Reporter: Yanbo Liang
>Priority: Minor
>
> Almost all doctest code examples are in the models at Pyspark mllib.
> Since users usually start with algorithms rather than models, we need to move 
> them from models to algorithms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6831) Document how to use external data sources

2016-10-07 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15556758#comment-15556758
 ] 

holdenk commented on SPARK-6831:


Is this something we are planning to do at all? It doesn't seem to have 
happened for awhile. We could add a link to spark-packages formats list from 
the SQL documentation?

> Document how to use external data sources
> -
>
> Key: SPARK-6831
> URL: https://issues.apache.org/jira/browse/SPARK-6831
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark, SparkR, SQL
>Reporter: Shivaram Venkataraman
>
> We should include some instructions on how to use an external datasource for 
> users who are beginners. Do they need to install it on all the machines ? Or 
> just the master ? Are there are any special flags they need to pass to 
> `bin/spark-submit` etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-6780) Add saveAsTextFileByKey method for PySpark

2016-10-07 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk closed SPARK-6780.
--
Resolution: Won't Fix

Since SPARK-3533 is WON'T FIX this one should be to.

> Add saveAsTextFileByKey method for PySpark
> --
>
> Key: SPARK-6780
> URL: https://issues.apache.org/jira/browse/SPARK-6780
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Ilya Ganelin
>
> The PySpark API should have a method to allow saving a key-value RDD to 
> subdirectories organized by key as in :
> https://issues.apache.org/jira/browse/SPARK-3533



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-7613) Serialization fails in pyspark for lambdas referencing class data members

2016-10-07 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk closed SPARK-7613.
--
Resolution: Won't Fix

I believe this is expected behaviour and the current best practice is simply to 
make a local copy of any required variables.

> Serialization fails in pyspark for lambdas referencing class data members
> -
>
> Key: SPARK-7613
> URL: https://issues.apache.org/jira/browse/SPARK-7613
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.2.0, 1.3.1
> Environment: Python 2.7.6, Java 8
>Reporter: Nate Crosswhite
>
> The following code snippet works in pyspark 1.1.0, but fails post 1.2 with 
> the indicated error.  It appears the failure is caused by cloudpickler 
> attempting to serialize the second lambda function twice.
> {code}
> ## Begin PySpark code
> class LambdaFine():
> def __init__(self, exp):
> self.exp = exp
> self.f_function = (lambda x: x**exp)
> class LambdaFail():
> def __init__(self, exp):
> self.exp = exp
> self.f_function = (lambda x: x**self.exp)
> rdd = sc.parallelize(range(0,10))
> print 'LambdaFine:', rdd.map(LambdaFine(2).f_function).collect()  # works
> print 'LambdaFail:', rdd.map(LambdaFail(2).f_function).collect() # fails in 
> spark 1.2+
> ### End PySpark code
> {code}
> ### Output:
> {code}
> LambdaFine: [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
> LambdaFail:
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/spark-1.4.0-SNAPSHOT-bin-4abf285f/python/pyspark/rdd.py", line 
> 745, in collect
> port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
>   File "/spark-1.4.0-SNAPSHOT-bin-4abf285f/python/pyspark/rdd.py", line 
> 2345, in _jrdd
> pickled_cmd, bvars, env, includes = _prepare_for_python_RDD(self.ctx, 
> command, self)
>   File "/spark-1.4.0-SNAPSHOT-bin-4abf285f/python/pyspark/rdd.py", line 
> 2265, in _prepare_for_python_RDD
> pickled_command = ser.dumps((command, sys.version_info[:2]))
>   File 
> "/spark-1.4.0-SNAPSHOT-bin-4abf285f/python/pyspark/serializers.py", 
> line 427, in dumps
> return cloudpickle.dumps(obj, 2)
>   File 
> "/spark-1.4.0-SNAPSHOT-bin-4abf285f/python/pyspark/cloudpickle.py", 
> line 622, in dumps
> cp.dump(obj)
>   File 
> "/spark-1.4.0-SNAPSHOT-bin-4abf285f/python/pyspark/cloudpickle.py", 
> line 107, in dump
> return Pickler.dump(self, obj)
>   File 
> "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
>  line 224, in dump
> self.save(obj)
>   File 
> "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
>  line 286, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
>  line 548, in save_tuple
> save(element)
>   File 
> "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
>  line 286, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
>  line 562, in save_tuple
> save(element)
>   File 
> "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
>  line 286, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "/spark-1.4.0-SNAPSHOT-bin-4abf285f/python/pyspark/cloudpickle.py", 
> line 199, in save_function
> self.save_function_tuple(obj)
>   File 
> "/spark-1.4.0-SNAPSHOT-bin-4abf285f/python/pyspark/cloudpickle.py", 
> line 236, in save_function_tuple
> save((code, closure, base_globals))
>   File 
> "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
>  line 286, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
>  line 548, in save_tuple
> save(element)
>   File 
> "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
>  line 286, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
>  line 600, in save_list
> self._batch_appends(iter(obj))
>   File 
> "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
>  line 636, in _batch_appends
> save(tmp[0])
>   File 
> "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
>  line 286, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "/spark-1.4.0-SNAPSHOT-bin-4abf285f/python/pyspark/cloudpickle.py", 
> line 193, in save_function
>

[jira] [Commented] (SPARK-7638) Python API for pmml.export

2016-10-07 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15556732#comment-15556732
 ] 

holdenk commented on SPARK-7638:


Do we still want to do this or focus on adding PMML export on ML given our 
intention to only do new feature development on ML? (cc [~mlnick] & [~josephkb])

> Python API for pmml.export
> --
>
> Key: SPARK-7638
> URL: https://issues.apache.org/jira/browse/SPARK-7638
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Affects Versions: 1.4.0
>Reporter: Yanbo Liang
>
> Add Python API for pmml.export



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6174) Improve doc: Python ALS, MatrixFactorizationModel

2016-10-07 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15556720#comment-15556720
 ] 

holdenk commented on SPARK-6174:


I think Bryan did a good job of this I'd be in favour of closing the ticket. If 
no one objects I think lets just close it tomorrow :)

> Improve doc: Python ALS, MatrixFactorizationModel
> -
>
> Key: SPARK-6174
> URL: https://issues.apache.org/jira/browse/SPARK-6174
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Affects Versions: 1.5.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> The Python docs for recommendation have almost no content except an example.  
> Add class, method & attribute descriptions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5981) pyspark ML models should support predict/transform on vector within map

2016-10-07 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15556714#comment-15556714
 ] 

holdenk commented on SPARK-5981:


I'm not sure porting the models to Python sounds like a good idea, giving our 
push to put more computation into the JVM seems counter intuitive. I'm 
personally in favour of closing this as a WONTFIX - what are your thoughts 
[~josephkb] & [~mlnick]?

> pyspark ML models should support predict/transform on vector within map
> ---
>
> Key: SPARK-5981
> URL: https://issues.apache.org/jira/browse/SPARK-5981
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> Currently, most Python models only have limited support for single-vector 
> prediction.
> E.g., one can call {code}model.predict(myFeatureVector){code} for a single 
> instance, but that fails within a map for Python ML models and transformers 
> which use JavaModelWrapper:
> {code}
> data.map(lambda features: model.predict(features))
> {code}
> This fails because JavaModelWrapper.call uses the SparkContext (within the 
> transformation).  (It works for linear models, which do prediction within 
> Python.)
> Supporting prediction within a map would require storing the model and doing 
> prediction/transformation within Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4851) "Uninitialized staticmethod object" error in PySpark

2016-10-07 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-4851.

Resolution: Fixed

The provided repro now runs (although we need to provide it with the correct 
number of args).

> "Uninitialized staticmethod object" error in PySpark
> 
>
> Key: SPARK-4851
> URL: https://issues.apache.org/jira/browse/SPARK-4851
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.0.2, 1.2.0, 1.3.0
>Reporter: Nadav Grossug
>Priority: Minor
>
> *Reproduction:*
> {code}
> class A:
> @staticmethod
> def foo(self, x):
> return x
> sc.parallelize([1]).map(lambda x: A.foo(x)).count()
> {code}
> This gives
> {code}
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 
> (TID 3, localhost): org.apache.spark.api.python.PythonException: Traceback 
> (most recent call last):
>   File "/Users/joshrosen/Documents/Spark/python/pyspark/worker.py", line 107, 
> in main
> process()
>   File "/Users/joshrosen/Documents/Spark/python/pyspark/worker.py", line 98, 
> in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/Users/joshrosen/Documents/Spark/python/pyspark/rdd.py", line 2070, 
> in pipeline_func
> return func(split, prev_func(split, iterator))
>   File "/Users/joshrosen/Documents/Spark/python/pyspark/rdd.py", line 2070, 
> in pipeline_func
> return func(split, prev_func(split, iterator))
>   File "/Users/joshrosen/Documents/Spark/python/pyspark/rdd.py", line 2070, 
> in pipeline_func
> return func(split, prev_func(split, iterator))
>   File "/Users/joshrosen/Documents/Spark/python/pyspark/rdd.py", line 247, in 
> func
> return f(iterator)
>   File "/Users/joshrosen/Documents/Spark/python/pyspark/rdd.py", line 818, in 
> 
> return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
>   File "/Users/joshrosen/Documents/Spark/python/pyspark/rdd.py", line 818, in 
> 
> return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
>   File "", line 1, in 
> RuntimeError: uninitialized staticmethod object
>   at 
> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:136)
>   at 
> org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:173)
>   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:95)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:231)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745){code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1425) PySpark can crash Executors if worker.py fails while serializing data

2016-10-07 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15556680#comment-15556680
 ] 

holdenk commented on SPARK-1425:


Is this still an issue or do we have a repro case for it? The current framed 
serializer seems to only write out the object once its fully pickled, although 
we are still using the same pipe for both error messages and data.

> PySpark can crash Executors if worker.py fails while serializing data
> -
>
> Key: SPARK-1425
> URL: https://issues.apache.org/jira/browse/SPARK-1425
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 0.9.0
>Reporter: Matei Zaharia
>
> The PythonRDD code that talks to the worker will keep calling 
> stream.readInt() and allocating an array of that size. Unfortunately, if the 
> worker gives it corrupted data, it will attempt to allocate a huge array and 
> get an OutOfMemoryError. It would be better to use a different stream to give 
> feedback, *or* only write an object out to the stream once it's been properly 
> pickled to bytes or to a string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-5160) Python module in jars

2016-10-07 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk closed SPARK-5160.
--
Resolution: Fixed

This is now supported.

> Python module in jars
> -
>
> Key: SPARK-5160
> URL: https://issues.apache.org/jira/browse/SPARK-5160
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, Spark Core
>Reporter: Davies Liu
>
> In order to simplify publish of spark packages with Python modules, we could 
> put the python module into jars (jar is a zip but with different extension). 
> Python can import the module in a jar when:
> 1) the module is in the top level of jar
> 2) the path to jar is sys.path
> So, we should put the path of jar into PYTHONPATH in driver and executor.
> cc [~pwendell]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17834) Fetch the earliest offsets manually in KafkaSource instead of counting on KafkaConsumer

2016-10-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17834:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Fetch the earliest offsets manually in KafkaSource instead of counting on 
> KafkaConsumer
> ---
>
> Key: SPARK-17834
> URL: https://issues.apache.org/jira/browse/SPARK-17834
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17834) Fetch the earliest offsets manually in KafkaSource instead of counting on KafkaConsumer

2016-10-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17834:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Fetch the earliest offsets manually in KafkaSource instead of counting on 
> KafkaConsumer
> ---
>
> Key: SPARK-17834
> URL: https://issues.apache.org/jira/browse/SPARK-17834
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17834) Fetch the earliest offsets manually in KafkaSource instead of counting on KafkaConsumer

2016-10-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15556663#comment-15556663
 ] 

Apache Spark commented on SPARK-17834:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/15397

> Fetch the earliest offsets manually in KafkaSource instead of counting on 
> KafkaConsumer
> ---
>
> Key: SPARK-17834
> URL: https://issues.apache.org/jira/browse/SPARK-17834
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17834) Fetch the earliest offsets manually in KafkaSource instead of counting on KafkaConsumer

2016-10-07 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-17834:
-
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-15406

> Fetch the earliest offsets manually in KafkaSource instead of counting on 
> KafkaConsumer
> ---
>
> Key: SPARK-17834
> URL: https://issues.apache.org/jira/browse/SPARK-17834
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4488) Add control over map-side aggregation

2016-10-07 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15556650#comment-15556650
 ] 

holdenk commented on SPARK-4488:


So while the associated PR is closed, we ended up adding the option to disable 
map-side aggregation in the Scala API. This could be a pretty good medium-type 
issue for someone interested into digging into PySpark issues and I'd be happy 
to help out (otherwise if no one gets to it by December I'll take a look 
myself). cc [~joshrosen] who was involved in the code review :)

> Add control over map-side aggregation
> -
>
> Key: SPARK-4488
> URL: https://issues.apache.org/jira/browse/SPARK-4488
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.1.0
>Reporter: Genmao Yu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17834) Fetch the earliest offsets manually in KafkaSource instead of counting on KafkaConsumer

2016-10-07 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-17834:
-
Summary: Fetch the earliest offsets manually in KafkaSource instead of 
counting on KafkaConsumer  (was: Fetch the initial offsets manually in 
KafkaSource instead of counting on KafkaConsumer)

> Fetch the earliest offsets manually in KafkaSource instead of counting on 
> KafkaConsumer
> ---
>
> Key: SPARK-17834
> URL: https://issues.apache.org/jira/browse/SPARK-17834
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17834) Fetch the initial offsets manually in KafkaSource instead of counting on KafkaConsumer

2016-10-07 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-17834:


 Summary: Fetch the initial offsets manually in KafkaSource instead 
of counting on KafkaConsumer
 Key: SPARK-17834
 URL: https://issues.apache.org/jira/browse/SPARK-17834
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17626) TPC-DS performance improvements using star-schema heuristics

2016-10-07 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15556645#comment-15556645
 ] 

Reynold Xin commented on SPARK-17626:
-

Thanks - this makes sense (especially the bushy tree part).

For runtime, a lot of known optimizations one can do (e.g. based on RI to turn 
hash map lookups into dense array lookups) are already done entirely adaptively 
during query execution in whole stage code generation.

Also looping in [~ron8hu] and [~ZenWzh].

> TPC-DS performance improvements using star-schema heuristics
> 
>
> Key: SPARK-17626
> URL: https://issues.apache.org/jira/browse/SPARK-17626
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Ioana Delaney
>Priority: Critical
> Attachments: StarSchemaJoinReordering.pptx
>
>
> *TPC-DS performance improvements using star-schema heuristics*
> \\
> \\
> TPC-DS consists of multiple snowflake schema, which are multiple star schema 
> with dimensions linking to dimensions. A star schema consists of a fact table 
> referencing a number of dimension tables. Fact table holds the main data 
> about a business. Dimension table, a usually smaller table, describes data 
> reflecting the dimension/attribute of a business.
> \\
> \\
> As part of the benchmark performance investigation, we observed a pattern of 
> sub-optimal execution plans of large fact tables joins. Manual rewrite of 
> some of the queries into selective fact-dimensions joins resulted in 
> significant performance improvement. This prompted us to develop a simple 
> join reordering algorithm based on star schema detection. The performance 
> testing using *1TB TPC-DS workload* shows an overall improvement of *19%*. 
> \\
> \\
> *Summary of the results:*
> {code}
> Passed 99
> Failed  0
> Total q time (s)   14,962
> Max time1,467
> Min time3
> Mean time 145
> Geomean44
> {code}
> *Compared to baseline* (Negative = improvement; Positive = Degradation):
> {code}
> End to end improved (%)  -19% 
> Mean time improved (%)   -19%
> Geomean improved (%) -24%
> End to end improved (seconds)  -3,603
> Number of queries improved (>10%)  45
> Number of queries degraded (>10%)   6
> Number of queries unchanged48
> Top 10 queries improved (%)  -20%
> {code}
> Cluster: 20-node cluster with each node having:
> * 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 
> v2 @ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet.
> * Total memory for the cluster: 2.5TB
> * Total storage: 400TB
> * Total CPU cores: 480
> Hadoop stack: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0 GA
> Database info:
> * Schema: TPCDS 
> * Scale factor: 1TB total space
> * Storage format: Parquet with Snappy compression
> Our investigation and results are included in the attached document.
> There are two parts to this improvement:
> # Join reordering using star schema detection
> # New selectivity hint to specify the selectivity of the predicates over base 
> tables. Selectivity hint is optional and it was not used in the above TPC-DS 
> tests. 
> \\



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2999) Compress all the serialized data

2016-10-07 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-2999.

Resolution: Fixed

Fixed in b5c51c8df480f1a82a82e4d597d8eea631bffb4e

> Compress all the serialized data
> 
>
> Key: SPARK-2999
> URL: https://issues.apache.org/jira/browse/SPARK-2999
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> LZ4 is so fast that we can have performance benefit for all network/disk IO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8791) Make a better hashcode for InternalRow

2016-10-07 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-8791.

Resolution: Fixed

> Make a better hashcode for InternalRow
> --
>
> Key: SPARK-8791
> URL: https://issues.apache.org/jira/browse/SPARK-8791
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Cheng Hao
>Priority: Minor
>
> Currently, the InternalRow doesn't support well for complex data type while 
> getting the hashCode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8791) Make a better hashcode for InternalRow

2016-10-07 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15556603#comment-15556603
 ] 

Xiao Li commented on SPARK-8791:


This sounds have been resolved in the later version. Let me close it now. 
Please reopen it if you think we still need it. Thanks!

> Make a better hashcode for InternalRow
> --
>
> Key: SPARK-8791
> URL: https://issues.apache.org/jira/browse/SPARK-8791
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Cheng Hao
>Priority: Minor
>
> Currently, the InternalRow doesn't support well for complex data type while 
> getting the hashCode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2868) Support named accumulators in Python

2016-10-07 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15556582#comment-15556582
 ] 

holdenk commented on SPARK-2868:


Is this something we are still interested in pursuing (cc [~rxin] who did the 
Scala accumulator API update). I'd be happy to take on this issue, once we've 
decide what to do around data property accumulators, since I've already been 
working with accumulators a bunch.

> Support named accumulators in Python
> 
>
> Key: SPARK-2868
> URL: https://issues.apache.org/jira/browse/SPARK-2868
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Patrick Wendell
>
> SPARK-2380 added this for Java/Scala. To allow this in Python we'll need to 
> make some additional changes. One potential path is to have a 1:1 
> correspondence with Scala accumulators (instead of a one-to-many). A 
> challenge is exposing the stringified values of the accumulators to the Scala 
> code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2654) Leveled logging in PySpark

2016-10-07 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-2654.

Resolution: Fixed

This has been fixed in SPARK-3444 / ae98eec730125c1153dcac9ea941959cc79e4f42

> Leveled logging in PySpark
> --
>
> Key: SPARK-2654
> URL: https://issues.apache.org/jira/browse/SPARK-2654
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Davies Liu
>
> Add more leveled logging in PySpark, the logging level should be easy 
> controlled by configuration and command line arguments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8527) StructType's Factory method does not work in java code

2016-10-07 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15556567#comment-15556567
 ] 

Xiao Li commented on SPARK-8527:


This should have been resolved. Could you retry it in the master branch. If you 
still hit it, please reopen it. Thanks!

> StructType's Factory method does not work in java code
> --
>
> Key: SPARK-8527
> URL: https://issues.apache.org/jira/browse/SPARK-8527
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Hao Ren
>
> According to the following line of code below:
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala#L209
> The casting should be successful, however after type erasure, I encounter 
> {{java.lang.ClassCastException}}:
> {code}
> ArrayList structFields = new ArrayList<>();
> // Some add operation
> return StructType$.MODULE$.apply(structFields); // run time error
> Exception in thread "main" java.lang.ClassCastException: [Ljava.lang.Object; 
> cannot be cast to [Lorg.apache.spark.sql.types.StructField;
>   at org.apache.spark.sql.types.StructType$.apply(StructType.scala:209)
> {code}
> Am I missing anything ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8527) StructType's Factory method does not work in java code

2016-10-07 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-8527.

Resolution: Fixed

> StructType's Factory method does not work in java code
> --
>
> Key: SPARK-8527
> URL: https://issues.apache.org/jira/browse/SPARK-8527
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Hao Ren
>
> According to the following line of code below:
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala#L209
> The casting should be successful, however after type erasure, I encounter 
> {{java.lang.ClassCastException}}:
> {code}
> ArrayList structFields = new ArrayList<>();
> // Some add operation
> return StructType$.MODULE$.apply(structFields); // run time error
> Exception in thread "main" java.lang.ClassCastException: [Ljava.lang.Object; 
> cannot be cast to [Lorg.apache.spark.sql.types.StructField;
>   at org.apache.spark.sql.types.StructType$.apply(StructType.scala:209)
> {code}
> Am I missing anything ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 >

1 - 100 of 268 matches

Mail list logo