[jira] [Commented] (SPARK-14768) Remove expectedType arg for PySpark Param

2016-04-21 Thread Jason C Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15252280#comment-15252280
 ] 

Jason C Lee commented on SPARK-14768:
-

Seems straightforward enough. I am working on the PR

> Remove expectedType arg for PySpark Param
> -
>
> Key: SPARK-14768
> URL: https://issues.apache.org/jira/browse/SPARK-14768
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> We could go ahead and removed the expectedType in 2.0. The only reason to 
> keep it is to avoid breaking 3rd party implementations of PipelineStages in 
> Python which happen to be using the expectedType arg.  I doubt there are many 
> such implementations.
> Keeping that argument has led to problems with mixing the expectedType and 
> typeConverter args, as in [https://github.com/apache/spark/pull/12480]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14564) Python Word2Vec missing setWindowSize method

2016-04-12 Thread Jason C Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237719#comment-15237719
 ] 

Jason C Lee commented on SPARK-14564:
-

Looks straightforward enough. I will give it a shot!

> Python Word2Vec missing setWindowSize method
> 
>
> Key: SPARK-14564
> URL: https://issues.apache.org/jira/browse/SPARK-14564
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib, PySpark
>Affects Versions: 1.6.0, 1.6.1
> Environment: pyspark
>Reporter: Brad Willard
>Priority: Minor
>  Labels: ml, mllib, pyspark, python, word2vec
>
> The setWindowSize method when constructing the Word2Vec model is available in 
> scala but missing in python so you're stuck with a window of 5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13802) Fields order in Row(**kwargs) is not consistent with Schema.toInternal method

2016-03-29 Thread Jason C Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216695#comment-15216695
 ] 

Jason C Lee commented on SPARK-13802:
-

I tried making a fix where I treat a kwarg row as a dict, and reorder the row 
based on the schema name. But my fix failed one of the test cases under 
python/pyspark/sql/tests.py
{noformat}
   def test_toDF_with_schema_string(self):
data = [Row(key=i, value=str(i)) for i in range(100)]
rdd = self.sc.parallelize(data, 5)

df = rdd.toDF("key: int, value: string")
self.assertEqual(df.schema.simpleString(), 
"struct")
self.assertEqual(df.collect(), data)

# different but compatible field types can be used.
df = rdd.toDF("key: string, value: string")
self.assertEqual(df.schema.simpleString(), 
"struct")
self.assertEqual(df.collect(), [Row(key=str(i), value=str(i)) for i in 
range(100)])

# field names can differ.
df = rdd.toDF(" a: int, b: string ")
self.assertEqual(df.schema.simpleString(), "struct")
self.assertEqual(df.collect(), data)
{noformat}

This shows that the schema names don't have to correspond to the row's names. 
Rows are ordered based on row's names, not schema names. By providing schema 
names, you are essentially 'renaming' the column names. 

So, maybe a better approach for you is to leave out the schema. This way the 
schema can just be inferred from the first row:
{noformat}
row = Row(id="39", first_name="Szymon")
df = sqlContext.createDataFrame([row])
df.show(1)

+--+---+
|first_name| id|
+--+---+
|Szymon| 39|
+--+---+

{noformat}



> Fields order in Row(**kwargs) is not consistent with Schema.toInternal method
> -
>
> Key: SPARK-13802
> URL: https://issues.apache.org/jira/browse/SPARK-13802
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Szymon Matejczyk
>
> When using Row constructor from kwargs, fields in the tuple underneath are 
> sorted by name. When Schema is reading the row, it is not using the fields in 
> this order.
> {code}
> from pyspark.sql import Row
> from pyspark.sql.types import *
> schema = StructType([
> StructField("id", StringType()),
> StructField("first_name", StringType())])
> row = Row(id="39", first_name="Szymon")
> schema.toInternal(row)
> Out[5]: ('Szymon', '39')
> {code}
> {code}
> df = sqlContext.createDataFrame([row], schema)
> df.show(1)
> +--+--+
> |id|first_name|
> +--+--+
> |Szymon|39|
> +--+--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13802) Fields order in Row(**kwargs) is not consistent with Schema.toInternal method

2016-03-21 Thread Jason C Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205446#comment-15205446
 ] 

Jason C Lee commented on SPARK-13802:
-

I will give it a shot! Working on the PR at the moment. 

> Fields order in Row(**kwargs) is not consistent with Schema.toInternal method
> -
>
> Key: SPARK-13802
> URL: https://issues.apache.org/jira/browse/SPARK-13802
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Szymon Matejczyk
>
> When using Row constructor from kwargs, fields in the tuple underneath are 
> sorted by name. When Schema is reading the row, it is not using the fields in 
> this order.
> {code}
> from pyspark.sql import Row
> from pyspark.sql.types import *
> schema = StructType([
> StructField("id", StringType()),
> StructField("first_name", StringType())])
> row = Row(id="39", first_name="Szymon")
> schema.toInternal(row)
> Out[5]: ('Szymon', '39')
> {code}
> {code}
> df = sqlContext.createDataFrame([row], schema)
> df.show(1)
> +--+--+
> |id|first_name|
> +--+--+
> |Szymon|39|
> +--+--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12519) "Managed memory leak detected" when using distinct on PySpark DataFrame

2016-02-05 Thread Jason C Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15135630#comment-15135630
 ] 

Jason C Lee commented on SPARK-12519:
-

The memory leak in question comes from a TungstenAggregationIterator's hashmap 
that is usually freed at the end of iterating through the list. However, show() 
by default takes the first 20 items and does not usually get to the end of the 
list. Therefore, the hashmap does not get freed. Any idea how to fix it is 
welcome.

> "Managed memory leak detected" when using distinct on PySpark DataFrame
> ---
>
> Key: SPARK-12519
> URL: https://issues.apache.org/jira/browse/SPARK-12519
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
> Environment: OS X 10.9.5, Java 1.8.0_66
>Reporter: Paul Shearer
>
> After running the distinct() method to transform a DataFrame, subsequent 
> actions like count() and show() may report a managed memory leak. Here is a 
> minimal example that reproduces the bug on my machine:
> h1. Script
> {noformat}
> logger = sc._jvm.org.apache.log4j
> logger.LogManager.getLogger("org"). setLevel( logger.Level.WARN )
> logger.LogManager.getLogger("akka").setLevel( logger.Level.WARN )
> import string
> import random
> def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
> return ''.join(random.choice(chars) for _ in range(size))
> nrow   = 8
> ncol   = 20
> ndrow  = 4 # number distinct rows
> tmp = [id_generator() for i in xrange(ndrow*ncol)]
> tmp = [tuple(tmp[ncol*(i % ndrow)+0:ncol*(i % ndrow)+ncol]) for i in 
> xrange(nrow)] 
> dat = sc.parallelize(tmp,1000).toDF()
> dat = dat.distinct() # if this line is commented out, no memory leak will be 
> reported
> # dat = dat.rdd.distinct().toDF() # if this line is used instead of the 
> above, no leak
> ct = dat.count()
> print ct  
> # memory leak warning prints at this point in the code
> dat.show()  
> {noformat}
> h1. Output
> When this script is run in PySpark (with IPython kernel), I get this error:
> {noformat}
> $ pyspark --executor-memory 12G --driver-memory 12G
> Python 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015, 09:33:12) 
> Type "copyright", "credits" or "license" for more information.
> IPython 4.0.0 -- An enhanced Interactive Python.
> ? -> Introduction and overview of IPython's features.
> %quickref -> Quick reference.
> help  -> Python's own help system.
> object?   -> Details about 'object', use 'object??' for extra details.
> <<<... usual loading info...>>>
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 1.5.2
>   /_/
> Using Python version 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015 09:33:12)
> SparkContext available as sc, SQLContext available as sqlContext.
> In [1]: execfile('bugtest.py')
> 4
> 15/12/24 09:33:14 ERROR Executor: Managed memory leak detected; size = 
> 16777216 bytes, TID = 2202
> +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
> |_1|_2|_3|_4|_5|_6|_7|_8|_9|   _10|   
> _11|   _12|   _13|   _14|   _15|   _16|   _17|   _18|   _19|   _20|
> +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
> |83I981|09B1ZK|J5UB1A|BPYI80|7JTIMU|HVPQVY|XS4YM2|6N4YO3|AB9GQZ|92RCHR|1N46EU|THPZFH|5IXNR1|KL4LGD|B0S50O|DZH5QP|FKTHHF|MLOCTD|ZVV5BY|D76KRK|
> |BNLSVC|CYYYMD|W6ZXF6|Z0QXDT|4JPRX6|YSXIBK|WCB6YD|C86MPS|ZRA42Z|8W8GX8|2DW3AA|ZZ1U0O|EVXX3L|683UOL|5M6TOZ|PI4QX8|6V7SOS|THQVVJ|0ULB14|DJ2LP5|
> |IZYG7Q|Q0NCUG|0FSTPN|UVT8Y6|TBAEF6|5CGN50|WNGOSB|NX2Y8R|XWPW7Y|WPTLIV|NPF00K|92YSNO|FP50AU|CW0K3K|8ULT74|SZM6HK|4XPQU9|L109TB|02X1UC|TV8BLZ|
> |S7AWK6|7DQ8JP|YSIHVQ|1NKN5G|UOD1TN|ZSL6K4|86SDUW|NHLK9P|Z2ZBFL|QTOA89|D6D1NK|UXUJMG|B0A0ZF|94HB2S|HGLX19|VCVF05|HMAXNE|Y265LD|DHNR78|9L23XR|
> |U6JCLP|PKJEOB|66C408|HNAUQK|1Q9O2X|NFW976|YLAXD4|0XC334|NMKW62|W297XR|WL9KMG|8K1P94|T5P7LP|WAQ7PT|Q5JYG0|2A9H44|9DOW5P|9SOPFH|M0NNK5|W877FV|
> |3M39A1|K97EL6|7JFM9G|23I3JT|FIS25Z|HIY6VN|2ORNRG|MTGYMT|32IEH8|RX41EH|EJSSKX|H6QY8J|8G0R0H|AAPYPI|HDEVZ4|WP3VCW|2KNQZ0|U8V254|61T6SH|VJJP4L|
> |XT3CON|WG8XST|KKJ67T|5RBQB0|OC4LJT|GYSIBI|XGVGUP|8RND4A|38CY23|W3Q26Z|K0ARWU|FLA3O7|I3DGN7|IY080I|HAQW3T|EQDQHD|1Z8E3X|I0J5WN|P4B6IO|1S23KL|
> |4GMPF8|FFZLKK|Y4UW1Q|AF5J2H|VQ32TO|VMU7PG|WS66ZH|VXSYVK|S0GVCY|OL5I4Q|LFB98K|BCQVZK|XW03W6|F5YGTS|NTYCKZ|JTJ5YY|DR0VSC|KIUJMN|HCPYS4|QG9WYL|
> |USOIHJ|HPGNXC|DIGTPY|BL0QZ4|2957GI|8A7EC5|GOMEFU|568QPG|6EA6Z2|W7P0Z8|TSP1BF|XXYS8Q|TMN7OA|3ZL2R4|7W1856|DS3LHW|QH32TF|3Y7XPC|EUO5O6|95CIMH|
> 

[jira] [Commented] (SPARK-12683) SQL timestamp is wrong when accessed as Python datetime

2016-01-15 Thread Jason C Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102749#comment-15102749
 ] 

Jason C Lee commented on SPARK-12683:
-

Looks like collect() eventually calls py4j's collectToPython, which then 
returns the port that contains the wrong answer in the socket. I am not all 
that familiar with how py4j works. Any expert of py4j is welcome here!

> SQL timestamp is wrong when accessed as Python datetime
> ---
>
> Key: SPARK-12683
> URL: https://issues.apache.org/jira/browse/SPARK-12683
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.1, 1.5.2, 1.6.0
> Environment: Windows 7 Pro x64
> Python 3.4.3
> py4j 0.9
>Reporter: Gerhard Fiedler
> Attachments: spark_bug_date.py
>
>
> When accessing SQL timestamp data through {{.show()}}, it looks correct, but 
> when accessing it (as Python {{datetime}}) through {{.collect()}}, it is 
> wrong.
> {code}
> from datetime import datetime
> from pyspark import SparkContext
> from pyspark.sql import SQLContext
> if __name__ == "__main__":
> spark_context = SparkContext(appName='SparkBugTimestampHour')
> sql_context = SQLContext(spark_context)
> sql_text = """select cast('2100-09-09 12:11:10.09' as timestamp) as ts"""
> data_frame = sql_context.sql(sql_text)
> data_frame.show(truncate=False)
> # Result from .show() (as expected, looks correct):
> # +--+
> # |ts|
> # +--+
> # |2100-09-09 12:11:10.09|
> # +--+
> rows = data_frame.collect()
> row = rows[0]
> ts = row[0]
> print('ts={ts}'.format(ts=ts))
> # Expected result from this print statement:
> # ts=2100-09-09 12:11:10.09
> #
> # Actual, wrong result (note the hours being 18 instead of 12):
> # ts=2100-09-09 18:11:10.09
> #
> # This error seems to be dependent on some characteristic of the system. 
> We couldn't reproduce
> # this on all of our systems, but it is not clear what the differences 
> are. One difference is
> # the processor: it failed on Intel Xeon E5-2687W v2.
> assert isinstance(ts, datetime)
> assert ts.year == 2100 and ts.month == 9 and ts.day == 9
> assert ts.minute == 11 and ts.second == 10 and ts.microsecond == 9
> if ts.hour != 12:
> print('hour is not correct; should be 12, is actually 
> {hour}'.format(hour=ts.hour))
> spark_context.stop()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12683) SQL timestamp is wrong when accessed as Python datetime

2016-01-12 Thread Jason C Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15094877#comment-15094877
 ] 

Jason C Lee commented on SPARK-12683:
-

For me on my machine it's an hour of difference. The difference does not always 
occur, as it depends on the specified year, month, and day. I will trace 
through the code and get to the bottom of it.

> SQL timestamp is wrong when accessed as Python datetime
> ---
>
> Key: SPARK-12683
> URL: https://issues.apache.org/jira/browse/SPARK-12683
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.1, 1.5.2, 1.6.0
> Environment: Windows 7 Pro x64
> Python 3.4.3
> py4j 0.9
>Reporter: Gerhard Fiedler
> Attachments: spark_bug_date.py
>
>
> When accessing SQL timestamp data through {{.show()}}, it looks correct, but 
> when accessing it (as Python {{datetime}}) through {{.collect()}}, it is 
> wrong.
> {code}
> from datetime import datetime
> from pyspark import SparkContext
> from pyspark.sql import SQLContext
> if __name__ == "__main__":
> spark_context = SparkContext(appName='SparkBugTimestampHour')
> sql_context = SQLContext(spark_context)
> sql_text = """select cast('2100-09-09 12:11:10.09' as timestamp) as ts"""
> data_frame = sql_context.sql(sql_text)
> data_frame.show(truncate=False)
> # Result from .show() (as expected, looks correct):
> # +--+
> # |ts|
> # +--+
> # |2100-09-09 12:11:10.09|
> # +--+
> rows = data_frame.collect()
> row = rows[0]
> ts = row[0]
> print('ts={ts}'.format(ts=ts))
> # Expected result from this print statement:
> # ts=2100-09-09 12:11:10.09
> #
> # Actual, wrong result (note the hours being 18 instead of 12):
> # ts=2100-09-09 18:11:10.09
> #
> # This error seems to be dependent on some characteristic of the system. 
> We couldn't reproduce
> # this on all of our systems, but it is not clear what the differences 
> are. One difference is
> # the processor: it failed on Intel Xeon E5-2687W v2.
> assert isinstance(ts, datetime)
> assert ts.year == 2100 and ts.month == 9 and ts.day == 9
> assert ts.minute == 11 and ts.second == 10 and ts.microsecond == 9
> if ts.hour != 12:
> print('hour is not correct; should be 12, is actually 
> {hour}'.format(hour=ts.hour))
> spark_context.stop()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12655) GraphX does not unpersist RDDs

2016-01-08 Thread Jason C Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15090165#comment-15090165
 ] 

Jason C Lee commented on SPARK-12655:
-

The VertexRDD and EdgeRDD you see are created during the intermediate step of 
g.connectedComponents(). They are not properly unpersisted at the moment. I 
will look into this. 

> GraphX does not unpersist RDDs
> --
>
> Key: SPARK-12655
> URL: https://issues.apache.org/jira/browse/SPARK-12655
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Alexander Pivovarov
>
> Looks like Graph does not clean all RDDs from the cache on unpersist
> {code}
> // open spark-shell 1.5.2 or 1.6.0
> // run
> import org.apache.spark.graphx._
> val vert = sc.parallelize(List((1L, 1), (2L, 2), (3L, 3)), 1)
> val edges = sc.parallelize(List(Edge[Long](1L, 2L), Edge[Long](1L, 3L)), 1)
> val g0 = Graph(vert, edges)
> val g = g0.partitionBy(PartitionStrategy.EdgePartition2D, 2)
> val cc = g.connectedComponents()
> cc.unpersist()
> g.unpersist()
> g0.unpersist()
> vert.unpersist()
> edges.unpersist()
> {code}
> open http://localhost:4040/storage/
> Spark UI 4040 Storage page still shows 2 items
> {code}
> VertexRDD  Memory Deserialized 1x Replicated   1  100%1688.0 B   0.0 
> B  0.0 B
> EdgeRDDMemory Deserialized 1x Replicated   2  100%  4.7 KB   0.0 
> B  0.0 B
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-10943) NullType Column cannot be written to Parquet

2015-10-14 Thread Jason C Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason C Lee updated SPARK-10943:

Comment: was deleted

(was: I'd like to work on this. Thanx)

> NullType Column cannot be written to Parquet
> 
>
> Key: SPARK-10943
> URL: https://issues.apache.org/jira/browse/SPARK-10943
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Jason Pohl
>
> var data02 = sqlContext.sql("select 1 as id, \"cat in the hat\" as text, null 
> as comments")
> //FAIL - Try writing a NullType column (where all the values are NULL)
> data02.write.parquet("/tmp/celtra-test/dataset2")
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:156)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 179.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 179.0 (TID 39924, 10.0.196.208): 
> org.apache.spark.sql.AnalysisException: Unsupported data type 
> StructField(comments,NullType,true).dataType;
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:524)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:312)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:92)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at org.apache.spark.sql.types.StructType.map(StructType.scala:92)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convert(CatalystSchemaConverter.scala:305)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypesConverter.scala:58)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.RowWriteSupport.init(ParquetTableSupport.scala:55)
>   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:288)
>   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetRelation.scala:94)
>   at 
> 

[jira] [Commented] (SPARK-10943) NullType Column cannot be written to Parquet

2015-10-06 Thread Jason C Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14945783#comment-14945783
 ] 

Jason C Lee commented on SPARK-10943:
-

I'd like to work on this. Thanx

> NullType Column cannot be written to Parquet
> 
>
> Key: SPARK-10943
> URL: https://issues.apache.org/jira/browse/SPARK-10943
> Project: Spark
>  Issue Type: Bug
>Reporter: Jason Pohl
>
> var data02 = sqlContext.sql("select 1 as id, \"cat in the hat\" as text, null 
> as comments")
> //FAIL - Try writing a NullType column (where all the values are NULL)
> data02.write.parquet("/tmp/celtra-test/dataset2")
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:156)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 179.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 179.0 (TID 39924, 10.0.196.208): 
> org.apache.spark.sql.AnalysisException: Unsupported data type 
> StructField(comments,NullType,true).dataType;
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:524)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:312)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:92)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at org.apache.spark.sql.types.StructType.map(StructType.scala:92)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convert(CatalystSchemaConverter.scala:305)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypesConverter.scala:58)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.RowWriteSupport.init(ParquetTableSupport.scala:55)
>   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:288)
>   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetRelation.scala:94)
>   at 
> 

[jira] [Commented] (SPARK-10877) Assertions fail straightforward DataFrame job due to word alignment

2015-10-05 Thread Jason C Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943956#comment-14943956
 ] 

Jason C Lee commented on SPARK-10877:
-

I ran your SparkFilterByKeyTest.scala from spark-shell but did not run into the 
problem you stated above. At which statement did you run into the exception? 
randomUUID()?

scala> df1.printSchema
root
 |-- col1: integer (nullable = true)
 |-- col2: integer (nullable = true)
 |-- col3: string (nullable = true)
 |-- col4: double (nullable = true)
 |-- col5: array (nullable = true)
 ||-- element: string (containsNull = true)
 |-- indexfordefaultordering: integer (nullable = true)


scala> df2.printSchema
root
 |-- col1: integer (nullable = true)
 |-- col2: integer (nullable = true)
 |-- col3: string (nullable = true)
 |-- col4: double (nullable = true)
 |-- col5: array (nullable = true)
 ||-- element: string (containsNull = true)
 |-- indexfordefaultordering: integer (nullable = true)

scala> renamedRightDf.printSchema
root
 |-- cole9cf9a80b37641a1957509ad61e1f823: integer (nullable = true)
 |-- col3e0904a3138e4f2ba987b3d282d61927: integer (nullable = true)
 |-- colb92786c9d3944106a536038604bcb4ee: string (nullable = true)
 |-- coled77bd68bd634a328787c1489564e7ac: double (nullable = true)
 |-- colf8b2385adf014d7cb215ddbffb8d1b24: array (nullable = true)
 ||-- element: string (containsNull = true)
 |-- col0d1102e414e5441faab32b617a27ac60: integer (nullable = true)

scalc> joinedDf.collect
res6: Array[org.apache.spark.sql.Row] = Array()


> Assertions fail straightforward DataFrame job due to word alignment
> ---
>
> Key: SPARK-10877
> URL: https://issues.apache.org/jira/browse/SPARK-10877
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Matt Cheah
> Attachments: SparkFilterByKeyTest.scala
>
>
> I have some code that I’m running in a unit test suite, but the code I’m 
> running is failing with an assertion error.
> I have translated the JUnit test that was failing, to a Scala script that I 
> will attach to the ticket. The assertion error is the following:
> {code}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: 
> Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.AssertionError: 
> lengthInBytes must be a multiple of 8 (word-aligned)
> at 
> org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeWords(Murmur3_x86_32.java:53)
> at 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.hashCode(UnsafeArrayData.java:289)
> at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.hashCode(rows.scala:149)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow.hashCode(rows.scala:247)
> at org.apache.spark.HashPartitioner.getPartition(Partitioner.scala:85)
> at 
> org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1$$anonfun$4$$anonfun$apply$4.apply(Exchange.scala:180)
> at 
> org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1$$anonfun$4$$anonfun$apply$4.apply(Exchange.scala:180)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> {code}
> However, it turns out that this code actually works normally and computes the 
> correct result if assertions are turned off.
> I traced the code and found that when hashUnsafeWords was called, it was 
> given a byte-length of 12, which clearly is not a multiple of 8. However, the 
> job seems to compute correctly regardless of this fact. Of course, I can’t 
> just disable assertions for my unit test though.
> A few things we need to understand:
> 1. Why is the lengthInBytes of size 12?
> 2. Is it actually a problem that the byte length is not word-aligned? If so, 
> how should we fix the byte length? If it's not a problem, why is the 
> assertion flagging a false negative?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10877) Assertions fail straightforward DataFrame job due to word alignment

2015-10-05 Thread Jason C Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944141#comment-14944141
 ] 

Jason C Lee commented on SPARK-10877:
-

I enabled assertions by specifying either the following in my build.sbt
javaOptions += "-ea"
and use set package to build. 

I also run it with spark-submit instead of spark-shell...still doesn't see what 
you see.
$SPARK_HOME/bin/spark-submit --class "SparkFilterByKeyTest" --master local[2] 
target/scala-2.10/simple-project_2.10-1.0.jar 



> Assertions fail straightforward DataFrame job due to word alignment
> ---
>
> Key: SPARK-10877
> URL: https://issues.apache.org/jira/browse/SPARK-10877
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Matt Cheah
> Attachments: SparkFilterByKeyTest.scala
>
>
> I have some code that I’m running in a unit test suite, but the code I’m 
> running is failing with an assertion error.
> I have translated the JUnit test that was failing, to a Scala script that I 
> will attach to the ticket. The assertion error is the following:
> {code}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: 
> Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.AssertionError: 
> lengthInBytes must be a multiple of 8 (word-aligned)
> at 
> org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeWords(Murmur3_x86_32.java:53)
> at 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.hashCode(UnsafeArrayData.java:289)
> at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.hashCode(rows.scala:149)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow.hashCode(rows.scala:247)
> at org.apache.spark.HashPartitioner.getPartition(Partitioner.scala:85)
> at 
> org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1$$anonfun$4$$anonfun$apply$4.apply(Exchange.scala:180)
> at 
> org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1$$anonfun$4$$anonfun$apply$4.apply(Exchange.scala:180)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> {code}
> However, it turns out that this code actually works normally and computes the 
> correct result if assertions are turned off.
> I traced the code and found that when hashUnsafeWords was called, it was 
> given a byte-length of 12, which clearly is not a multiple of 8. However, the 
> job seems to compute correctly regardless of this fact. Of course, I can’t 
> just disable assertions for my unit test though.
> A few things we need to understand:
> 1. Why is the lengthInBytes of size 12?
> 2. Is it actually a problem that the byte length is not word-aligned? If so, 
> how should we fix the byte length? If it's not a problem, why is the 
> assertion flagging a false negative?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10925) Exception when joining DataFrames

2015-10-05 Thread Jason C Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944303#comment-14944303
 ] 

Jason C Lee commented on SPARK-10925:
-

I removed your 2nd step "apply an UDF on column "name"" and was able to also 
recreate the problem. I reduced your test case to the following:

import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.functions._

object TestCase2 {

  case class Individual(id: String, name: String, surname: String, birthDate: 
String)

  def main(args: Array[String]) {

val sc = new SparkContext("local", "join DFs")
val sqlContext = new SQLContext(sc)

val rdd = sc.parallelize(Seq(
  Individual("14", "patrick", "andrews", "10/10/1970")
))

val df = sqlContext.createDataFrame(rdd)
df.show()

val df1 = df;
val df2 = df1.withColumn("surname1", df("surname"))
df2.show()

val df3 = df2.withColumn("birthDate1", df("birthDate"))
df3.show()

val cardinalityDF1 = df3.groupBy("name")
  .agg(count("name").as("cardinality_name"))
cardinalityDF1.show()

val df4 = df3.join(cardinalityDF1, df3("name") === cardinalityDF1("name"))
df4.show()

val cardinalityDF2 = df4.groupBy("surname1")
  .agg(count("surname1").as("cardinality_surname"))
cardinalityDF2.show()

val df5 = df4.join(cardinalityDF2, df4("surname") === 
cardinalityDF2("surname1"))
df5.show()
  }
}

> Exception when joining DataFrames
> -
>
> Key: SPARK-10925
> URL: https://issues.apache.org/jira/browse/SPARK-10925
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tested with Spark 1.5.0 and Spark 1.5.1
>Reporter: Alexis Seigneurin
> Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase2.scala
>
>
> I get an exception when joining a DataFrame with another DataFrame. The 
> second DataFrame was created by performing an aggregation on the first 
> DataFrame.
> My complete workflow is:
> # read the DataFrame
> # apply an UDF on column "name"
> # apply an UDF on column "surname"
> # apply an UDF on column "birthDate"
> # aggregate on "name" and re-join with the DF
> # aggregate on "surname" and re-join with the DF
> If I remove one step, the process completes normally.
> Here is the exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
> attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in 
> operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS 
> birthDate_cleaned#8];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 

[jira] [Issue Comment Deleted] (SPARK-10925) Exception when joining DataFrames

2015-10-05 Thread Jason C Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason C Lee updated SPARK-10925:

Comment: was deleted

(was: I removed your 2nd step "apply an UDF on column "name"" and was able to 
also recreate the problem. I reduced your test case to the following:

import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.functions._

object TestCase2 {

  case class Individual(id: String, name: String, surname: String, birthDate: 
String)

  def main(args: Array[String]) {

val sc = new SparkContext("local", "join DFs")
val sqlContext = new SQLContext(sc)

val rdd = sc.parallelize(Seq(
  Individual("14", "patrick", "andrews", "10/10/1970")
))

val df = sqlContext.createDataFrame(rdd)
df.show()

val df1 = df;
val df2 = df1.withColumn("surname1", df("surname"))
df2.show()

val df3 = df2.withColumn("birthDate1", df("birthDate"))
df3.show()

val cardinalityDF1 = df3.groupBy("name")
  .agg(count("name").as("cardinality_name"))
cardinalityDF1.show()

val df4 = df3.join(cardinalityDF1, df3("name") === cardinalityDF1("name"))
df4.show()

val cardinalityDF2 = df4.groupBy("surname1")
  .agg(count("surname1").as("cardinality_surname"))
cardinalityDF2.show()

val df5 = df4.join(cardinalityDF2, df4("surname") === 
cardinalityDF2("surname1"))
df5.show()
  }
})

> Exception when joining DataFrames
> -
>
> Key: SPARK-10925
> URL: https://issues.apache.org/jira/browse/SPARK-10925
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tested with Spark 1.5.0 and Spark 1.5.1
>Reporter: Alexis Seigneurin
> Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase2.scala
>
>
> I get an exception when joining a DataFrame with another DataFrame. The 
> second DataFrame was created by performing an aggregation on the first 
> DataFrame.
> My complete workflow is:
> # read the DataFrame
> # apply an UDF on column "name"
> # apply an UDF on column "surname"
> # apply an UDF on column "birthDate"
> # aggregate on "name" and re-join with the DF
> # aggregate on "surname" and re-join with the DF
> If I remove one step, the process completes normally.
> Here is the exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
> attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in 
> operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS 
> birthDate_cleaned#8];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 

[jira] [Commented] (SPARK-10847) Pyspark - DataFrame - Optional Metadata with `None` triggers cryptic failure

2015-10-05 Thread Jason C Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943670#comment-14943670
 ] 

Jason C Lee commented on SPARK-10847:
-

You're welcome!

> Pyspark - DataFrame - Optional Metadata with `None` triggers cryptic failure
> 
>
> Key: SPARK-10847
> URL: https://issues.apache.org/jira/browse/SPARK-10847
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
> Environment: Windows 7
> java version "1.8.0_60" (64bit)
> Python 3.4.x
> Standalone cluster mode (not local[n]; a full local cluster)
>Reporter: Shea Parkes
>Priority: Minor
>
> If the optional metadata passed to `pyspark.sql.types.StructField` includes a 
> pythonic `None`, the `pyspark.SparkContext.createDataFrame` will fail with a 
> very cryptic/unhelpful error.
> Here is a minimal reproducible example:
> {code:none}
> # Assumes sc exists
> import pyspark.sql.types as types
> sqlContext = SQLContext(sc)
> literal_metadata = types.StructType([
> types.StructField(
> 'name',
> types.StringType(),
> nullable=True,
> metadata={'comment': 'From accounting system.'}
> ),
> types.StructField(
> 'age',
> types.IntegerType(),
> nullable=True,
> metadata={'comment': None}
> ),
> ])
> literal_rdd = sc.parallelize([
> ['Bob', 34],
> ['Dan', 42],
> ])
> print(literal_rdd.take(2))
> failed_dataframe = sqlContext.createDataFrame(
> literal_rdd,
> literal_metadata,
> )
> {code}
> This produces the following ~stacktrace:
> {noformat}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "", line 28, in 
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\pyspark\sql\context.py",
>  line 408, in createDataFrame
> jdf = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py",
>  line 538, in __call__
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\pyspark\sql\utils.py",
>  line 36, in deco
> return f(*a, **kw)
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py",
>  line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o757.applySchemaToPythonRDD.
> : java.lang.RuntimeException: Do not support type class scala.Tuple2.
>   at 
> org.apache.spark.sql.types.Metadata$$anonfun$fromJObject$1.apply(Metadata.scala:160)
>   at 
> org.apache.spark.sql.types.Metadata$$anonfun$fromJObject$1.apply(Metadata.scala:127)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.apache.spark.sql.types.Metadata$.fromJObject(Metadata.scala:127)
>   at 
> org.apache.spark.sql.types.DataType$.org$apache$spark$sql$types$DataType$$parseStructField(DataType.scala:173)
>   at 
> org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:148)
>   at 
> org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:148)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.types.DataType$.parseDataType(DataType.scala:148)
>   at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:96)
>   at org.apache.spark.sql.SQLContext.parseDataType(SQLContext.scala:961)
>   at 
> org.apache.spark.sql.SQLContext.applySchemaToPythonRDD(SQLContext.scala:970)
>   at sun.reflect.GeneratedMethodAccessor38.invoke(Unknown Source)
>   at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>   at java.lang.reflect.Method.invoke(Unknown Source)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Unknown Source)
> {noformat}
> I believe the most important line of the traceback is this one:
> {noformat}
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o757.applySchemaToPythonRDD.
> : 

[jira] [Commented] (SPARK-10847) Pyspark - DataFrame - Optional Metadata with `None` triggers cryptic failure

2015-09-28 Thread Jason C Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14933707#comment-14933707
 ] 

Jason C Lee commented on SPARK-10847:
-

I would like to work on this.

> Pyspark - DataFrame - Optional Metadata with `None` triggers cryptic failure
> 
>
> Key: SPARK-10847
> URL: https://issues.apache.org/jira/browse/SPARK-10847
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
> Environment: Windows 7
> java version "1.8.0_60" (64bit)
> Python 3.4.x
> Standalone cluster mode (not local[n]; a full local cluster)
>Reporter: Shea Parkes
>Priority: Minor
>
> If the optional metadata passed to `pyspark.sql.types.StructField` includes a 
> pythonic `None`, the `pyspark.SparkContext.createDataFrame` will fail with a 
> very cryptic/unhelpful error.
> Here is a minimal reproducible example:
> {code:none}
> # Assumes sc exists
> import pyspark.sql.types as types
> sqlContext = SQLContext(sc)
> literal_metadata = types.StructType([
> types.StructField(
> 'name',
> types.StringType(),
> nullable=True,
> metadata={'comment': 'From accounting system.'}
> ),
> types.StructField(
> 'age',
> types.IntegerType(),
> nullable=True,
> metadata={'comment': None}
> ),
> ])
> literal_rdd = sc.parallelize([
> ['Bob', 34],
> ['Dan', 42],
> ])
> print(literal_rdd.take(2))
> failed_dataframe = sqlContext.createDataFrame(
> literal_rdd,
> literal_metadata,
> )
> {code}
> This produces the following ~stacktrace:
> {noformat}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "", line 28, in 
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\pyspark\sql\context.py",
>  line 408, in createDataFrame
> jdf = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py",
>  line 538, in __call__
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\pyspark\sql\utils.py",
>  line 36, in deco
> return f(*a, **kw)
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py",
>  line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o757.applySchemaToPythonRDD.
> : java.lang.RuntimeException: Do not support type class scala.Tuple2.
>   at 
> org.apache.spark.sql.types.Metadata$$anonfun$fromJObject$1.apply(Metadata.scala:160)
>   at 
> org.apache.spark.sql.types.Metadata$$anonfun$fromJObject$1.apply(Metadata.scala:127)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.apache.spark.sql.types.Metadata$.fromJObject(Metadata.scala:127)
>   at 
> org.apache.spark.sql.types.DataType$.org$apache$spark$sql$types$DataType$$parseStructField(DataType.scala:173)
>   at 
> org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:148)
>   at 
> org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:148)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.types.DataType$.parseDataType(DataType.scala:148)
>   at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:96)
>   at org.apache.spark.sql.SQLContext.parseDataType(SQLContext.scala:961)
>   at 
> org.apache.spark.sql.SQLContext.applySchemaToPythonRDD(SQLContext.scala:970)
>   at sun.reflect.GeneratedMethodAccessor38.invoke(Unknown Source)
>   at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>   at java.lang.reflect.Method.invoke(Unknown Source)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Unknown Source)
> {noformat}
> I believe the most important line of the traceback is this one:
> {noformat}
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o757.applySchemaToPythonRDD.

[jira] [Commented] (SPARK-10847) Pyspark - DataFrame - Optional Metadata with `None` triggers cryptic failure

2015-09-28 Thread Jason C Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14934044#comment-14934044
 ] 

Jason C Lee commented on SPARK-10847:
-

Instead of 
Py4JJavaError: An error occurred while calling o757.applySchemaToPythonRDD.
: java.lang.RuntimeException: Do not support type class scala.Tuple2.

Would it be helpful if the error message is this:
Py4JJavaError: An error occurred while calling o76.applySchemaToPythonRDD.
: java.lang.RuntimeException: Do not support type class java.lang.String : 
class org.json4s.JsonAST$JNull$.

> Pyspark - DataFrame - Optional Metadata with `None` triggers cryptic failure
> 
>
> Key: SPARK-10847
> URL: https://issues.apache.org/jira/browse/SPARK-10847
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
> Environment: Windows 7
> java version "1.8.0_60" (64bit)
> Python 3.4.x
> Standalone cluster mode (not local[n]; a full local cluster)
>Reporter: Shea Parkes
>Priority: Minor
>
> If the optional metadata passed to `pyspark.sql.types.StructField` includes a 
> pythonic `None`, the `pyspark.SparkContext.createDataFrame` will fail with a 
> very cryptic/unhelpful error.
> Here is a minimal reproducible example:
> {code:none}
> # Assumes sc exists
> import pyspark.sql.types as types
> sqlContext = SQLContext(sc)
> literal_metadata = types.StructType([
> types.StructField(
> 'name',
> types.StringType(),
> nullable=True,
> metadata={'comment': 'From accounting system.'}
> ),
> types.StructField(
> 'age',
> types.IntegerType(),
> nullable=True,
> metadata={'comment': None}
> ),
> ])
> literal_rdd = sc.parallelize([
> ['Bob', 34],
> ['Dan', 42],
> ])
> print(literal_rdd.take(2))
> failed_dataframe = sqlContext.createDataFrame(
> literal_rdd,
> literal_metadata,
> )
> {code}
> This produces the following ~stacktrace:
> {noformat}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "", line 28, in 
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\pyspark\sql\context.py",
>  line 408, in createDataFrame
> jdf = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py",
>  line 538, in __call__
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\pyspark\sql\utils.py",
>  line 36, in deco
> return f(*a, **kw)
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py",
>  line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o757.applySchemaToPythonRDD.
> : java.lang.RuntimeException: Do not support type class scala.Tuple2.
>   at 
> org.apache.spark.sql.types.Metadata$$anonfun$fromJObject$1.apply(Metadata.scala:160)
>   at 
> org.apache.spark.sql.types.Metadata$$anonfun$fromJObject$1.apply(Metadata.scala:127)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.apache.spark.sql.types.Metadata$.fromJObject(Metadata.scala:127)
>   at 
> org.apache.spark.sql.types.DataType$.org$apache$spark$sql$types$DataType$$parseStructField(DataType.scala:173)
>   at 
> org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:148)
>   at 
> org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:148)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.types.DataType$.parseDataType(DataType.scala:148)
>   at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:96)
>   at org.apache.spark.sql.SQLContext.parseDataType(SQLContext.scala:961)
>   at 
> org.apache.spark.sql.SQLContext.applySchemaToPythonRDD(SQLContext.scala:970)
>   at sun.reflect.GeneratedMethodAccessor38.invoke(Unknown Source)
>   at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>   at java.lang.reflect.Method.invoke(Unknown Source)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   

[jira] [Commented] (SPARK-8616) SQLContext doesn't handle tricky column names when loading from JDBC

2015-06-25 Thread Jason C Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601569#comment-14601569
 ] 

Jason C Lee commented on SPARK-8616:


I would like to work on this.

 SQLContext doesn't handle tricky column names when loading from JDBC
 

 Key: SPARK-8616
 URL: https://issues.apache.org/jira/browse/SPARK-8616
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
 Environment: Ubuntu 14.04, Sqlite 3.8.7, Spark 1.4.0
Reporter: Gergely Svigruha

 Reproduce:
  - create a table in a relational database (in my case sqlite) with a column 
 name containing a space:
  CREATE TABLE my_table (id INTEGER, tricky column TEXT);
  - try to create a DataFrame using that table:
 sqlContext.read.format(jdbc).options(Map(
   url - jdbs:sqlite:...,
   dbtable - my_table)).load()
 java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
 column: tricky)
 According to the SQL spec this should be valid:
 http://savage.net.au/SQL/sql-99.bnf.html#delimited%20identifier



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org