[jira] [Comment Edited] (SPARK-36203) Spark SQL can't use "group by" on the column of map type.

2023-09-04 Thread Paul Praet (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17761796#comment-17761796
 ] 

Paul Praet edited comment on SPARK-36203 at 9/4/23 11:55 AM:
-

This problem is still present in 3.3.2 (and I think also in 3.4.1).

I cannot do a UNION on a dataframe with a map<>


was (Author: praetp):
This problem is still present in 3.3.2 (and I think also in 3.4.1).

> Spark SQL can't use "group by" on the column of map type.
> -
>
> Key: SPARK-36203
> URL: https://issues.apache.org/jira/browse/SPARK-36203
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Bruce Wong
>Priority: Major
>
> I want to know why the 'group by' can't use in column of map tyep.
>  
> *sql:*
> select distinct idselect distinct id , cols , extend_value from 
> test.test_table 
> -- extend_value's type is map.
> *error:*
> {color:#FF}Sql执行错误:org.apache.spark.sql.AnalysisException: Cannot have 
> map type columns in DataFrame which calls set operations(intersect, except, 
> etc.), but the type of column extend_value is map;{color}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36203) Spark SQL can't use "group by" on the column of map type.

2023-09-04 Thread Paul Praet (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17761796#comment-17761796
 ] 

Paul Praet commented on SPARK-36203:


This problem is still present in 3.3.2 (and I think also in 3.4.1).

> Spark SQL can't use "group by" on the column of map type.
> -
>
> Key: SPARK-36203
> URL: https://issues.apache.org/jira/browse/SPARK-36203
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Bruce Wong
>Priority: Major
>
> I want to know why the 'group by' can't use in column of map tyep.
>  
> *sql:*
> select distinct idselect distinct id , cols , extend_value from 
> test.test_table 
> -- extend_value's type is map.
> *error:*
> {color:#FF}Sql执行错误:org.apache.spark.sql.AnalysisException: Cannot have 
> map type columns in DataFrame which calls set operations(intersect, except, 
> etc.), but the type of column extend_value is map;{color}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34779) ExecutorMetricsPoller should keep stage entry in stageTCMP until a heartbeat occurs

2022-09-29 Thread Paul Praet (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610965#comment-17610965
 ] 

Paul Praet commented on SPARK-34779:


We are seeing spurious failures on the assert():
{noformat}
22/09/29 09:46:24 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
thread Thread[Executor task launch worker for task 3063.0 in stage 1997.0 (TID 
677249),5,main]
java.lang.AssertionError: assertion failed: task count shouldn't below 0
at scala.Predef$.assert(Predef.scala:223)
at 
org.apache.spark.executor.ExecutorMetricsPoller.decrementCount$1(ExecutorMetricsPoller.scala:130)
at 
org.apache.spark.executor.ExecutorMetricsPoller.$anonfun$onTaskCompletion$3(ExecutorMetricsPoller.scala:135)
at 
java.base/java.util.concurrent.ConcurrentHashMap.computeIfPresent(ConcurrentHashMap.java:1822)
at 
org.apache.spark.executor.ExecutorMetricsPoller.onTaskCompletion(ExecutorMetricsPoller.scala:135)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:737)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
22/09/29 09:46:24 INFO MemoryStore: MemoryStore cleared
22/09/29 09:46:24 INFO BlockManager: BlockManager stopped
22/09/29 09:46:24 INFO ShutdownHookManager: Shutdown hook called
22/09/29 09:46:24 INFO ShutdownHookManager: Deleting directory 
/mnt/yarn/usercache/hadoop/appcache/application_1664443624160_0001/spark-93efc2d4-84de-494b-a3b7-2cb1c3a45426{noformat}
Feels like overkill ? Seen on Spark 3.2.0.

> ExecutorMetricsPoller should keep stage entry in stageTCMP until a heartbeat 
> occurs
> ---
>
> Key: SPARK-34779
> URL: https://issues.apache.org/jira/browse/SPARK-34779
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>Reporter: Baohe Zhang
>Assignee: Baohe Zhang
>Priority: Major
> Fix For: 3.2.0
>
>
> The current implementation of ExecutoMetricsPoller uses task count in each 
> stage to decide whether to keep a stage entry or not. In the case of the 
> executor only has 1 core, it may have these issues:
>  # Peak metrics missing (due to stage entry being removed within a heartbeat 
> interval)
>  # Unnecessary and frequent hashmap entry removal and insertion.
> Assuming an executor with 1 core has 2 tasks (task1 and task2, both belong to 
> stage (0,0)) to execute in a heartbeat interval, the workflow in current 
> ExecutorMetricsPoller implementation would be:
> 1. task1 start -> stage (0, 0) entry created in stageTCMP, task count 
> increment to1
> 2. 1st poll() -> update peak metrics of stage (0, 0)
> 3. task1 end -> stage (0, 0) task count decrement to 0, stage (0, 0) entry 
> removed, peak metrics lost.
> 4. task2 start -> stage (0, 0) entry created in stageTCMP, task count 
> increment to1
> 5. 2nd poll() -> update peak metrics of stage (0, 0)
> 6. task2 end -> stage (0, 0) task count decrement to 0, stage (0, 0) entry 
> removed, peak metrics lost
> 7. heartbeat() ->  empty or inaccurate peak metrics for stage(0,0) reported.
> We can fix the issue by keeping entries with task count = 0 in stageTCMP map 
> until a heartbeat occurs. At the heartbeat, after reporting the peak metrics 
> for each stage, we scan each stage in stageTCMP and remove entries with task 
> count = 0.
> After the fix, the workflow would be:
> 1. task1 start -> stage (0, 0) entry created in stageTCMP, task count 
> increment to1
> 2. 1st poll() -> update peak metrics of stage (0, 0)
> 3. task1 end -> stage (0, 0) task count decrement to 0,but the entry (0,0) 
> still remain.
> 4. task2 start -> task count of stage (0,0) increment to1
> 5. 2nd poll() -> update peak metrics of stage (0, 0)
> 6. task2 end -> stage (0, 0) task count decrement to 0,but the entry (0,0) 
> still remain.
> 7. heartbeat() ->  accurate peak metrics for stage (0, 0) reported. Remove 
> entry for stage (0,0) in stageTCMP because its task count is 0.
>  
> How to verify the behavior? 
> Submit a job with a custom polling interval (e.g., 2s) and 
> spark.executor.cores=1 and check the debug logs of ExecutoMetricsPoller.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40388) SQL configuration spark.sql.mapKeyDedupPolicy not always applied

2022-09-07 Thread Paul Praet (Jira)
Paul Praet created SPARK-40388:
--

 Summary: SQL configuration spark.sql.mapKeyDedupPolicy not always 
applied
 Key: SPARK-40388
 URL: https://issues.apache.org/jira/browse/SPARK-40388
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.2
Reporter: Paul Praet


I have set spark.sql.mapKeyDedupPolicy to LAST_WIN.

However, I had still one failure where I got
{quote}Caused by: org.apache.spark.SparkException: Job aborted due to stage 
failure: Task 7 in stage 1201.0 failed 4 times, most recent failure: Lost task 
7.3 in stage 1201.0 (TID 1011313) (ip-10-1-34-47.eu-west-1.compute.internal 
executor 228): java.lang.RuntimeException: Duplicate map key domain was found, 
please check the input data. If you want to remove the duplicated keys, you can 
set spark.sql.mapKeyDedupPolicy to LAST_WIN so that the key inserted at last 
takes precedence.
{quote}
We are confident we set the right configuration in SparkConf (we can find it on 
the Spark UI -> Environment).

It is our impression this configuration is not propagated reliably to the 
executors.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver

2021-05-10 Thread Paul Praet (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17342341#comment-17342341
 ] 

Paul Praet commented on SPARK-12837:


Seeing the same issue on Spark 2.4.6. Excessive number of tasks and then the 
same exception as [~hryhoriev.nick] . Disabling broadcast join did not help.

> Spark driver requires large memory space for serialized results even there 
> are no data collected to the driver
> --
>
> Key: SPARK-12837
> URL: https://issues.apache.org/jira/browse/SPARK-12837
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Tien-Dung LE
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.2.0
>
>
> Executing a sql statement with a large number of partitions requires a high 
> memory space for the driver even there are no requests to collect data back 
> to the driver.
> Here are steps to re-produce the issue.
> 1. Start spark shell with a spark.driver.maxResultSize setting
> {code:java}
> bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m
> {code}
> 2. Execute the code 
> {code:java}
> case class Toto( a: Int, b: Int)
> val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF
> sqlContext.setConf( "spark.sql.shuffle.partitions", "200" )
> df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK
> sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString )
> df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile(
>  "toto2" ) // ERROR
> {code}
> The error message is 
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Total size of serialized results of 393 tasks (1025.9 KB) is bigger than 
> spark.driver.maxResultSize (1024.0 KB)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16044) input_file_name() returns empty strings in data sources based on NewHadoopRDD.

2018-11-20 Thread Paul Praet (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16693200#comment-16693200
 ] 

Paul Praet commented on SPARK-16044:


Still has issues. See https://issues.apache.org/jira/browse/SPARK-26128

> input_file_name() returns empty strings in data sources based on NewHadoopRDD.
> --
>
> Key: SPARK-16044
> URL: https://issues.apache.org/jira/browse/SPARK-16044
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 1.6.3, 2.0.0
>
>
> The issue is, {{input_file_name()}} function does not contain file paths when 
> data sources use {{NewHadoopRDD}}. This is currently only supported for 
> {{FileScanRDD}} and {{HadoopRDD}}.
> To be clear, this does not affect Spark's internal data sources because 
> currently they all do not use {{NewHadoopRDD}}.
> However, there are several datasources using this. For example,
>  
> spark-redshift - 
> [here|https://github.com/databricks/spark-redshift/blob/cba5eee1ab79ae8f0fa9e668373a54d2b5babf6b/src/main/scala/com/databricks/spark/redshift/RedshiftRelation.scala#L149]
> spark-xml - 
> [here|https://github.com/databricks/spark-xml/blob/master/src/main/scala/com/databricks/spark/xml/util/XmlFile.scala#L39-L47]
> Currently, using this functions shows the output below:
> {code}
> +-+
> |input_file_name()|
> +-+
> | |
> | |
> | |
> | |
> | |
> | |
> | |
> | |
> | |
> | |
> | |
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26128) filter breaks input_file_name

2018-11-20 Thread Paul Praet (JIRA)
Paul Praet created SPARK-26128:
--

 Summary: filter breaks input_file_name
 Key: SPARK-26128
 URL: https://issues.apache.org/jira/browse/SPARK-26128
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 2.3.2
Reporter: Paul Praet


This works:
{code:java}
scala> 
spark.read.parquet("/tmp/newparquet").select(input_file_name).show(5,false)
+-+
|input_file_name()  
  |
+-+
|file:///tmp/newparquet/parquet-5-PT6H/junit/data/tenant=NA/year=2017/month=201704/day=20170406/hour=2017040618/data.eu-west-1b.290.PT6H.FINAL.parquet|
|file:///tmp/newparquet/parquet-5-PT6H/junit/data/tenant=NA/year=2017/month=201704/day=20170406/hour=2017040618/data.eu-west-1b.290.PT6H.FINAL.parquet|
|file:///tmp/newparquet/parquet-5-PT6H/junit/data/tenant=NA/year=2017/month=201704/day=20170406/hour=2017040618/data.eu-west-1b.290.PT6H.FINAL.parquet|
|file:///tmp/newparquet/parquet-5-PT6H/junit/data/tenant=NA/year=2017/month=201704/day=20170406/hour=2017040618/data.eu-west-1b.290.PT6H.FINAL.parquet|
|file:///tmp/newparquet/parquet-5-PT6H/junit/data/tenant=NA/year=2017/month=201704/day=20170406/hour=2017040618/data.eu-west-1b.290.PT6H.FINAL.parquet|
+-+

{code}
When adding a filter:
{code:java}
scala> 
spark.read.parquet("/tmp/newparquet").where("key.station='XYZ'").select(input_file_name()).show(5,false)
+-+
|input_file_name()|
+-+
| |
| |
| |
| |
| |
+-+

{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18660) Parquet complains "Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImp

2018-10-09 Thread Paul Praet (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16643418#comment-16643418
 ] 

Paul Praet commented on SPARK-18660:


It's really polluting our logs. Any workaround ?

> Parquet complains "Can not initialize counter due to context is not a 
> instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl "
> --
>
> Key: SPARK-18660
> URL: https://issues.apache.org/jira/browse/SPARK-18660
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Major
>
> Parquet record reader always complain "Can not initialize counter due to 
> context is not a instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl". Looks like we 
> always create TaskAttemptContextImpl 
> (https://github.com/apache/spark/blob/2f7461f31331cfc37f6cfa3586b7bbefb3af5547/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L368).
>  But, Parquet wants to use TaskInputOutputContext, which is a subclass of 
> TaskAttemptContextImpl. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21402) Java encoders - switch fields on collectAsList

2018-10-03 Thread Paul Praet (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16636957#comment-16636957
 ] 

Paul Praet commented on SPARK-21402:


Still there in Spark 2.3.1.

> Java encoders - switch fields on collectAsList
> --
>
> Key: SPARK-21402
> URL: https://issues.apache.org/jira/browse/SPARK-21402
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: mac os
> spark 2.1.1
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121
>Reporter: Tom
>Priority: Major
>
> I have the following schema in a dataset -
> root
>  |-- userId: string (nullable = true)
>  |-- data: map (nullable = true)
>  ||-- key: string
>  ||-- value: struct (valueContainsNull = true)
>  |||-- startTime: long (nullable = true)
>  |||-- endTime: long (nullable = true)
>  |-- offset: long (nullable = true)
>  And I have the following classes (+ setter and getters which I omitted for 
> simplicity) -
>  
> {code:java}
> public class MyClass {
> private String userId;
> private Map data;
> private Long offset;
>  }
> public class MyDTO {
> private long startTime;
> private long endTime;
> }
> {code}
> I collect the result the following way - 
> {code:java}
> Encoder myClassEncoder = Encoders.bean(MyClass.class);
> Dataset results = raw_df.as(myClassEncoder);
> List lst = results.collectAsList();
> {code}
> 
> I do several calculations to get the result I want and the result is correct 
> all through the way before I collect it.
> This is the result for - 
> {code:java}
> results.select(results.col("data").getField("2017-07-01").getField("startTime")).show(false);
> {code}
> |data[2017-07-01].startTime|data[2017-07-01].endTime|
> +-+--+
> |1498854000|1498870800  |
> This is the result after collecting the reuslts for - 
> {code:java}
> MyClass userData = results.collectAsList().get(0);
> MyDTO userDTO = userData.getData().get("2017-07-01");
> System.out.println("userDTO startTime: " + userDTO.getStartTime());
> System.out.println("userDTO endTime: " + userDTO.getEndTime());
> {code}
> --
> data startTime: 1498870800
> data endTime: 1498854000
> I tend to believe it is a spark issue. Would love any suggestions on how to 
> bypass it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24502) flaky test: UnsafeRowSerializerSuite

2018-08-09 Thread Paul Praet (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16575026#comment-16575026
 ] 

Paul Praet commented on SPARK-24502:


We are only creating SparkSessions with a SparkSessionBuilder.getOrCreate() and 
then calling sparkSession.close() when we are done.

Can you confirm this is not enough then ?

 

> flaky test: UnsafeRowSerializerSuite
> 
>
> Key: SPARK-24502
> URL: https://issues.apache.org/jira/browse/SPARK-24502
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: flaky-test
> Fix For: 2.3.2, 2.4.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4193/testReport/org.apache.spark.sql.execution/UnsafeRowSerializerSuite/toUnsafeRow___test_helper_method/
> {code}
> sbt.ForkMain$ForkError: java.lang.IllegalStateException: LiveListenerBus is 
> stopped.
>   at 
> org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:97)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.addToStatusQueue(LiveListenerBus.scala:80)
>   at 
> org.apache.spark.sql.internal.SharedState.(SharedState.scala:93)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:120)
>   at org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:119)
>   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:286)
>   at 
> org.apache.spark.sql.test.TestSparkSession.sessionState$lzycompute(TestSQLContext.scala:42)
>   at 
> org.apache.spark.sql.test.TestSparkSession.sessionState(TestSQLContext.scala:41)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:95)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:94)
>   at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:126)
>   at 
> org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:54)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:157)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:150)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerSuite.org$apache$spark$sql$execution$UnsafeRowSerializerSuite$$unsafeRowConverter(UnsafeRowSerializerSuite.scala:54)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerSuite.org$apache$spark$sql$execution$UnsafeRowSerializerSuite$$toUnsafeRow(UnsafeRowSerializerSuite.scala:49)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerSuite$$anonfun$2.apply(UnsafeRowSerializerSuite.scala:63)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerSuite$$anonfun$2.apply(UnsafeRowSerializerSuite.scala:60)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24502) flaky test: UnsafeRowSerializerSuite

2018-08-09 Thread Paul Praet (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16574410#comment-16574410
 ] 

Paul Praet commented on SPARK-24502:


Find it hard to believe above pull requests were accepted, rather than tackling 
the root cause which is the apparent resource leakage when closing a spark 
session.

We are upstepping from Spark 2.2.1 to 2.3.1 and we find our tests have become 
flaky as well (when running them consecutively). Even in our production code we 
are creating and closing multiple spark sessions... 

I prefer not to pollute our code with those boilerplate statements.

> flaky test: UnsafeRowSerializerSuite
> 
>
> Key: SPARK-24502
> URL: https://issues.apache.org/jira/browse/SPARK-24502
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: flaky-test
> Fix For: 2.3.2, 2.4.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4193/testReport/org.apache.spark.sql.execution/UnsafeRowSerializerSuite/toUnsafeRow___test_helper_method/
> {code}
> sbt.ForkMain$ForkError: java.lang.IllegalStateException: LiveListenerBus is 
> stopped.
>   at 
> org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:97)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.addToStatusQueue(LiveListenerBus.scala:80)
>   at 
> org.apache.spark.sql.internal.SharedState.(SharedState.scala:93)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:120)
>   at org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:119)
>   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:286)
>   at 
> org.apache.spark.sql.test.TestSparkSession.sessionState$lzycompute(TestSQLContext.scala:42)
>   at 
> org.apache.spark.sql.test.TestSparkSession.sessionState(TestSQLContext.scala:41)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:95)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:94)
>   at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:126)
>   at 
> org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:54)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:157)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:150)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerSuite.org$apache$spark$sql$execution$UnsafeRowSerializerSuite$$unsafeRowConverter(UnsafeRowSerializerSuite.scala:54)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerSuite.org$apache$spark$sql$execution$UnsafeRowSerializerSuite$$toUnsafeRow(UnsafeRowSerializerSuite.scala:49)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerSuite$$anonfun$2.apply(UnsafeRowSerializerSuite.scala:63)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerSuite$$anonfun$2.apply(UnsafeRowSerializerSuite.scala:60)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24502) flaky test: UnsafeRowSerializerSuite

2018-08-08 Thread Paul Praet (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16573149#comment-16573149
 ] 

Paul Praet commented on SPARK-24502:


[~cloud_fan] Does this mean users also have to call this function when writing 
tests ? Where is this documented ? I would hope closing a spark session is 
enough...

> flaky test: UnsafeRowSerializerSuite
> 
>
> Key: SPARK-24502
> URL: https://issues.apache.org/jira/browse/SPARK-24502
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: flaky-test
> Fix For: 2.3.2, 2.4.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4193/testReport/org.apache.spark.sql.execution/UnsafeRowSerializerSuite/toUnsafeRow___test_helper_method/
> {code}
> sbt.ForkMain$ForkError: java.lang.IllegalStateException: LiveListenerBus is 
> stopped.
>   at 
> org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:97)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.addToStatusQueue(LiveListenerBus.scala:80)
>   at 
> org.apache.spark.sql.internal.SharedState.(SharedState.scala:93)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:120)
>   at org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:119)
>   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:286)
>   at 
> org.apache.spark.sql.test.TestSparkSession.sessionState$lzycompute(TestSQLContext.scala:42)
>   at 
> org.apache.spark.sql.test.TestSparkSession.sessionState(TestSQLContext.scala:41)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:95)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:94)
>   at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:126)
>   at 
> org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:54)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:157)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:150)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerSuite.org$apache$spark$sql$execution$UnsafeRowSerializerSuite$$unsafeRowConverter(UnsafeRowSerializerSuite.scala:54)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerSuite.org$apache$spark$sql$execution$UnsafeRowSerializerSuite$$toUnsafeRow(UnsafeRowSerializerSuite.scala:49)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerSuite$$anonfun$2.apply(UnsafeRowSerializerSuite.scala:63)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerSuite$$anonfun$2.apply(UnsafeRowSerializerSuite.scala:60)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21747) Java encoders - switch fields on collectAsList

2017-08-16 Thread Paul Praet (JIRA)
Paul Praet created SPARK-21747:
--

 Summary: Java encoders - switch fields on collectAsList
 Key: SPARK-21747
 URL: https://issues.apache.org/jira/browse/SPARK-21747
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Paul Praet


Duplicate of https://issues.apache.org/jira/browse/SPARK-21402 for 2.2.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21402) Java encoders - switch fields on collectAsList

2017-08-10 Thread Paul Praet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121401#comment-16121401
 ] 

Paul Praet commented on SPARK-21402:


It seems changing the order of the fields in the struct can give some 
improvements but when I add more fields, the problem just gets worse - some 
fields just never get filled in or twice.

> Java encoders - switch fields on collectAsList
> --
>
> Key: SPARK-21402
> URL: https://issues.apache.org/jira/browse/SPARK-21402
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: mac os
> spark 2.1.1
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121
>Reporter: Tom
>
> I have the following schema in a dataset -
> root
>  |-- userId: string (nullable = true)
>  |-- data: map (nullable = true)
>  ||-- key: string
>  ||-- value: struct (valueContainsNull = true)
>  |||-- startTime: long (nullable = true)
>  |||-- endTime: long (nullable = true)
>  |-- offset: long (nullable = true)
>  And I have the following classes (+ setter and getters which I omitted for 
> simplicity) -
>  
> {code:java}
> public class MyClass {
> private String userId;
> private Map data;
> private Long offset;
>  }
> public class MyDTO {
> private long startTime;
> private long endTime;
> }
> {code}
> I collect the result the following way - 
> {code:java}
> Encoder myClassEncoder = Encoders.bean(MyClass.class);
> Dataset results = raw_df.as(myClassEncoder);
> List lst = results.collectAsList();
> {code}
> 
> I do several calculations to get the result I want and the result is correct 
> all through the way before I collect it.
> This is the result for - 
> {code:java}
> results.select(results.col("data").getField("2017-07-01").getField("startTime")).show(false);
> {code}
> |data[2017-07-01].startTime|data[2017-07-01].endTime|
> +-+--+
> |1498854000|1498870800  |
> This is the result after collecting the reuslts for - 
> {code:java}
> MyClass userData = results.collectAsList().get(0);
> MyDTO userDTO = userData.getData().get("2017-07-01");
> System.out.println("userDTO startTime: " + userDTO.getStartTime());
> System.out.println("userDTO endTime: " + userDTO.getEndTime());
> {code}
> --
> data startTime: 1498870800
> data endTime: 1498854000
> I tend to believe it is a spark issue. Would love any suggestions on how to 
> bypass it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21402) Java encoders - switch fields on collectAsList

2017-08-10 Thread Paul Praet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121297#comment-16121297
 ] 

Paul Praet edited comment on SPARK-21402 at 8/10/17 9:04 AM:
-

I can confirm this problem persists in Spark 2.2.0: fields get all swapped when 
you use the bean encoder on a dataset with an array of structs. A plain struct 
works, an array of structs does not. Pretty big issue if you ask me.

{noformat}
root
 |-- writeKey: string (nullable = false)
 |-- id: string (nullable = false)
 |-- type: string (nullable = false)
 |-- ssid: string (nullable = false)
++--+++
|writeKey|id|type|ssid|
++--+++
|someWriteKey|someId|someType|someSSID|
++--+++
{noformat}

When I convert into a struct, everything is still fine:

{noformat}
root
 |-- writeKey: string (nullable = false)
 |-- nodes: struct (nullable = false)
 ||-- id: string (nullable = false)
 ||-- type: string (nullable = false)
 ||-- ssid: string (nullable = false)

++--+
|writeKey|nodes |
++--+
|someWriteKey|[someId,someType,someSSID]|
++--+
{noformat}

When I do a groupBy on writeKey and a collect_set() on the nodes, we get:

{noformat}
root
 |-- writeKey: string (nullable = false)
 |-- nodes: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- id: string (nullable = false)
 |||-- type: string (nullable = false)
 |||-- ssid: string (nullable = false)

+++
|writeKey|nodes   |
+++
|someWriteKey|[[someId,someType,someSSID]]|
+++
{noformat}

When I convert  this to Java...

{code:java}
Dataset dfArray = dfStruct.groupBy("writeKey")
.agg(functions.collect_set("nodes").alias("nodes"));
Encoder topologyEncoder = Encoders.bean(Topology.class);
Dataset datasetMultiple = dfArray.as(topologyEncoder);
System.out.println(datasetMultiple.first());
{code}
This prints:

{noformat}
Topology{writeKey='someWriteKey', nodes=[Node{id='someId', type='someSSID', 
ssid='someType'}]}
{noformat}
You can clearly see the type and ssid fields were swapped.

POJO classes:
{code:java}
 public static class Topology {
private String writeKey;
private List nodes;

public Topology() {
}

public String getWriteKey() {
return writeKey;
}

public void setWriteKey(String writeKey) {
this.writeKey = writeKey;
}

public List getNodes() {
return nodes;
}

public void setNodes(List nodes) {
this.nodes = nodes;
}

@Override
public String toString() {
return "Topology{" +
"writeKey='" + writeKey + '\'' +
", nodes=" + nodes +
'}';
}
}

public static class Node {
private String id;
private String type;
private String ssid;

public Node() {
}

public String getId() {
return id;
}

public void setId(String id) {
this.id = id;
}

public String getType() {
return type;
}

public void setType(String type) {
this.type = type;
}

public String getSsid() {
return ssid;
}

public void setSsid(String ssid) {
this.ssid = ssid;
}


@Override
public String toString() {
return "Node{" +
"id='" + id + '\'' +
", type='" + type + '\'' +
", ssid='" + ssid + '\'' +
'}';
}
}
{code}




was (Author: praetp):
I can confirm this problem persists in Spark 2.2.0: fields get all swapped when 
you use the bean encoder on a dataset with an array of structs. A plain struct 
works, an array of structs does not. Pretty big issue if you ask me.

{noformat}
root
 |-- writeKey: string (nullable = false)
 |-- id: string (nullable = false)
 |-- type: string (nullable = false)
 |-- ssid: string (nullable = false)
++--+++
|writeKey|id|type|ssid|
++--+++
|someWriteKey|someId|someType|someSSID|
++--+++
{noformat}

When I convert into a struct, everything is still fine:

{noformat}
root
 |-- writeKey: string (nullable = false)
 |-- nodes: struct (nullable = false)
 ||-- id: string (nullable = false)
 ||-- type: string (nullable = false)
 ||-- ssid: string (nullable = fal

[jira] [Comment Edited] (SPARK-21402) Java encoders - switch fields on collectAsList

2017-08-10 Thread Paul Praet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121297#comment-16121297
 ] 

Paul Praet edited comment on SPARK-21402 at 8/10/17 9:03 AM:
-

I can confirm this problem persists in Spark 2.2.0: fields get all swapped when 
you use the bean encoder on a dataset with an array of structs. A plain struct 
works, an array of structs does not. Pretty big issue if you ask me.

{noformat}
root
 |-- writeKey: string (nullable = false)
 |-- id: string (nullable = false)
 |-- type: string (nullable = false)
 |-- ssid: string (nullable = false)
++--+++
|writeKey|id|type|ssid|
++--+++
|someWriteKey|someId|someType|someSSID|
++--+++
{noformat}

When I convert into a struct, everything is still fine:

{noformat}
root
 |-- writeKey: string (nullable = false)
 |-- nodes: struct (nullable = false)
 ||-- id: string (nullable = false)
 ||-- type: string (nullable = false)
 ||-- ssid: string (nullable = false)

++--+
|writeKey|nodes |
++--+
|someWriteKey|[someId,someType,someSSID]|
++--+
{noformat}

When I do a groupBy on writeKey and a collect_set() on the nodes, we get:

{noformat}
root
 |-- writeKey: string (nullable = false)
 |-- nodes: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- id: string (nullable = false)
 |||-- type: string (nullable = false)
 |||-- ssid: string (nullable = false)

+++
|writeKey|nodes   |
+++
|someWriteKey|[[someId,someType,someSSID]]|
+++
{noformat}

When I convert  this to Java...

{code:java}
Dataset dfArray = dfStruct.groupBy("writeKey")
.agg(functions.collect_set("nodes").alias("nodes"));
  Encoder topologyEncoder = Encoders.bean(Topology.class);
Dataset datasetMultiple = dfArray.as(topologyEncoder);
System.out.println(datasetMultiple.first());
{code}
This prints:

{noformat}
Topology{writeKey='someWriteKey', nodes=[Node{id='someId', type='someSSID', 
ssid='someType'}]}
{noformat}
You can clearly see the type and ssid fields were swapped.

POJO classes:
{code:java}
 public static class Topology {
private String writeKey;
private List nodes;

public Topology() {
}

public String getWriteKey() {
return writeKey;
}

public void setWriteKey(String writeKey) {
this.writeKey = writeKey;
}

public List getNodes() {
return nodes;
}

public void setNodes(List nodes) {
this.nodes = nodes;
}

@Override
public String toString() {
return "Topology{" +
"writeKey='" + writeKey + '\'' +
", nodes=" + nodes +
'}';
}
}

public static class Node {
private String id;
private String type;
private String ssid;

public Node() {
}

public String getId() {
return id;
}

public void setId(String id) {
this.id = id;
}

public String getType() {
return type;
}

public void setType(String type) {
this.type = type;
}

public String getSsid() {
return ssid;
}

public void setSsid(String ssid) {
this.ssid = ssid;
}


@Override
public String toString() {
return "Node{" +
"id='" + id + '\'' +
", type='" + type + '\'' +
", ssid='" + ssid + '\'' +
'}';
}
}
{code}




was (Author: praetp):
I can confirm this problem persists in Spark 2.2.0: fields get all swapped when 
you use the bean encoder on a dataset with an array of structs. A plain struct 
works, an array of structs does not. Pretty big issue if you ask me.
I have a datamodel like this (all Strings)

{noformat}
++--+++
|writeKey|id|type|ssid|
++--+++
|someWriteKey|someId|someType|someSSID|
++--+++
{noformat}

When I convert into a struct, everything is still fine:

{noformat}
root
 |-- writeKey: string (nullable = false)
 |-- nodes: struct (nullable = false)
 ||-- id: string (nullable = false)
 ||-- type: string (nullable = false)
 ||-- ssid: string (nullable = false)

++--+
|writeKey|nodes

[jira] [Commented] (SPARK-21402) Java encoders - switch fields on collectAsList

2017-08-10 Thread Paul Praet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121297#comment-16121297
 ] 

Paul Praet commented on SPARK-21402:


I can confirm this problem persists in Spark 2.2.0: fields get all swapped when 
you use the bean encoder on a dataset with an array of structs. A plain struct 
works, an array of structs does not. Pretty big issue if you ask me.
I have a datamodel like this (all Strings)

{noformat}
++--+++
|writeKey|id|type|ssid|
++--+++
|someWriteKey|someId|someType|someSSID|
++--+++
{noformat}

When I convert into a struct, everything is still fine:

{noformat}
root
 |-- writeKey: string (nullable = false)
 |-- nodes: struct (nullable = false)
 ||-- id: string (nullable = false)
 ||-- type: string (nullable = false)
 ||-- ssid: string (nullable = false)

++--+
|writeKey|nodes |
++--+
|someWriteKey|[someId,someType,someSSID]|
++--+
{noformat}

When I do a groupBy on writeKey and a collect_set() on the nodes, we get:

{noformat}
root
 |-- writeKey: string (nullable = false)
 |-- nodes: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- id: string (nullable = false)
 |||-- type: string (nullable = false)
 |||-- ssid: string (nullable = false)

+++
|writeKey|nodes   |
+++
|someWriteKey|[[someId,someType,someSSID]]|
+++
{noformat}

When I convert  this to Java...

{code:java}
Dataset dfArray = dfStruct.groupBy("writeKey")
.agg(functions.collect_set("nodes").alias("nodes"));
  Encoder topologyEncoder = Encoders.bean(Topology.class);
Dataset datasetMultiple = dfArray.as(topologyEncoder);
System.out.println(datasetMultiple.first());
{code}
This prints:

{noformat}
Topology{writeKey='someWriteKey', nodes=[Node{id='someId', type='someSSID', 
ssid='someType'}]}
{noformat}
You can clearly see the type and ssid fields were swapped.

POJO classes:
{code:java}
 public static class Topology {
private String writeKey;
private List nodes;

public Topology() {
}

public String getWriteKey() {
return writeKey;
}

public void setWriteKey(String writeKey) {
this.writeKey = writeKey;
}

public List getNodes() {
return nodes;
}

public void setNodes(List nodes) {
this.nodes = nodes;
}

@Override
public String toString() {
return "Topology{" +
"writeKey='" + writeKey + '\'' +
", nodes=" + nodes +
'}';
}
}

public static class Node {
private String id;
private String type;
private String ssid;

public Node() {
}

public String getId() {
return id;
}

public void setId(String id) {
this.id = id;
}

public String getType() {
return type;
}

public void setType(String type) {
this.type = type;
}

public String getSsid() {
return ssid;
}

public void setSsid(String ssid) {
this.ssid = ssid;
}


@Override
public String toString() {
return "Node{" +
"id='" + id + '\'' +
", type='" + type + '\'' +
", ssid='" + ssid + '\'' +
'}';
}
}
{code}



> Java encoders - switch fields on collectAsList
> --
>
> Key: SPARK-21402
> URL: https://issues.apache.org/jira/browse/SPARK-21402
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: mac os
> spark 2.1.1
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121
>Reporter: Tom
>
> I have the following schema in a dataset -
> root
>  |-- userId: string (nullable = true)
>  |-- data: map (nullable = true)
>  ||-- key: string
>  ||-- value: struct (valueContainsNull = true)
>  |||-- startTime: long (nullable = true)
>  |||-- endTime: long (nullable = true)
>  |-- offset: long (nullable = true)
>  And I have the following classes (+ setter and getters which I omitted for 
> simplicity) -
>  
> {code:java}
> public class MyClass {
> private String userId;
> private Map data;
> private Long offset;
>  }
> 

[jira] [Created] (SPARK-21666) Cannot handle Parquet type FIXED_LEN_BYTE_ARRAY

2017-08-08 Thread Paul Praet (JIRA)
Paul Praet created SPARK-21666:
--

 Summary: Cannot handle Parquet type FIXED_LEN_BYTE_ARRAY
 Key: SPARK-21666
 URL: https://issues.apache.org/jira/browse/SPARK-21666
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Paul Praet


I have a parquet schema that looks like this:

{{  optional group connection {
required fixed_len_byte_array(6) localMacAddress;
required fixed_len_byte_array(6) remoteMacAddress;
  }
}}

When I try to load this parquet file in Spark, I get:
Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: 
FIXED_LEN_BYTE_ARRAY;
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.illegalType$1(ParquetSchemaConverter.scala:126)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertPrimitiveField(ParquetSchemaConverter.scala:193)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:108)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$2.apply(ParquetSchemaConverter.scala:90)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$2.apply(ParquetSchemaConverter.scala:84)

We are not able to change the schema so this issue prevents us from processing 
the data.


Duplicate of https://issues.apache.org/jira/browse/SPARK-2489



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2489) Unsupported parquet datatype optional fixed_len_byte_array

2017-08-08 Thread Paul Praet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16118093#comment-16118093
 ] 

Paul Praet commented on SPARK-2489:
---

No progress on this ?

> Unsupported parquet datatype optional fixed_len_byte_array
> --
>
> Key: SPARK-2489
> URL: https://issues.apache.org/jira/browse/SPARK-2489
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Pei-Lun Lee
>
> tested against commit 9fe693b5
> {noformat}
> scala> sqlContext.parquetFile("/tmp/foo")
> java.lang.RuntimeException: Unsupported parquet datatype optional 
> fixed_len_byte_array(4) b
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.toPrimitiveDataType(ParquetTypes.scala:58)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.toDataType(ParquetTypes.scala:109)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:282)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:279)
> {noformat}
> example avro schema
> {noformat}
> protocol Test {
> fixed Bytes4(4);
> record Foo {
> union {null, Bytes4} b;
> }
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4820) Spark build encounters "File name too long" on some encrypted filesystems

2017-05-19 Thread Paul Praet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017912#comment-16017912
 ] 

Paul Praet commented on SPARK-4820:
---

I confirm - still an issue when trying to build Spark 2.1.1 on Ubuntu 16.04.

> Spark build encounters "File name too long" on some encrypted filesystems
> -
>
> Key: SPARK-4820
> URL: https://issues.apache.org/jira/browse/SPARK-4820
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Patrick Wendell
>Assignee: Theodore Vasiloudis
>Priority: Minor
> Fix For: 1.4.0
>
>
> This was reported by Luchesar Cekov on github along with a proposed fix. The 
> fix has some potential downstream issues (it will modify the classnames) so 
> until we understand better how many users are affected we aren't going to 
> merge it. However, I'd like to include the issue and workaround here. If you 
> encounter this issue please comment on the JIRA so we can assess the 
> frequency.
> The issue produces this error:
> {code}
> [error] == Expanded type of tree ==
> [error] 
> [error] ConstantType(value = Constant(Throwable))
> [error] 
> [error] uncaught exception during compilation: java.io.IOException
> [error] File name too long
> [error] two errors found
> {code}
> The workaround is in maven under the compile options add: 
> {code}
> +  -Xmax-classfile-name
> +  128
> {code}
> In SBT add:
> {code}
> +scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"),
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org