from:"Marco Gaido \\\(JIRA\\\)"

[jira] [Commented] (SPARK-21888) Cannot add stuff to Client Classpath for Yarn Cluster Mode

2017-09-05 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153888#comment-16153888
 ] 

Marco Gaido commented on SPARK-21888:
-

[~tgraves] Sorry, I misread. Of course, this doesn't add it to the client, only 
to the driver and the executors. But in the example you made, ie. writing to 
HBase, I can't see why you would need it: it is enough to load the conf in 
driver and the executors.

> Cannot add stuff to Client Classpath for Yarn Cluster Mode
> --
>
> Key: SPARK-21888
> URL: https://issues.apache.org/jira/browse/SPARK-21888
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Parth Gandhi
>Priority: Minor
>
> While running Spark on Yarn in cluster mode, currently there is no way to add 
> any config files to Client classpath. An example for this is that suppose you 
> want to run an application that uses hbase. Then, unless and until we do not 
> copy the necessary config files required by hbase to Spark Config folder, we 
> cannot specify or set their exact locations in classpath on Client end which 
> we could do so earlier by setting the environment variable "SPARK_CLASSPATH".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21918) HiveClient shouldn't share Hive object between different thread

2017-09-05 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154033#comment-16154033
 ] 

Marco Gaido commented on SPARK-21918:
-

What do you mean by "works correctly"? Actually all the jobs are executed using 
the user who started STS.

> HiveClient shouldn't share Hive object between different thread
> ---
>
> Key: SPARK-21918
> URL: https://issues.apache.org/jira/browse/SPARK-21918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hu Liu,
>
> I'm testing the spark thrift server and found that all the DDL statements are 
> run by user hive even if hive.server2.enable.doAs=true
> The root cause is that Hive object is shared between different thread in 
> HiveClientImpl
> {code:java}
>   private def client: Hive = {
> if (clientLoader.cachedHive != null) {
>   clientLoader.cachedHive.asInstanceOf[Hive]
> } else {
>   val c = Hive.get(conf)
>   clientLoader.cachedHive = c
>   c
> }
>   }
> {code}
> But in impersonation mode, we should just share the Hive object inside the 
> thread so that the  metastore client in Hive could be associated with right 
> user.
> we can  pass the Hive object of parent thread to child thread when running 
> the sql to fix it
> I have already had a initial patch for review and I'm glad to work on it if 
> anyone could assign it to me.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21918) HiveClient shouldn't share Hive object between different thread

2017-09-06 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16155047#comment-16155047
 ] 

Marco Gaido commented on SPARK-21918:
-

What I meant is that if we want to support doAs, we shouldn't just support it 
for DDL operations, but also for all DML & DQL. Your fix I am pretty sure won't 
affect the DML & DQL behavior, ie. we would support the doAs only for DDL 
operations with your change. This means that there would be a hybrid situation: 
for DDL we'd have doAs working, for DML & DQL no. This is not a desirable 
condition.

PS For my sake of curiosity, may I ask you how you tested that your DDL 
commands were run using the session user?
Thanks.

> HiveClient shouldn't share Hive object between different thread
> ---
>
> Key: SPARK-21918
> URL: https://issues.apache.org/jira/browse/SPARK-21918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hu Liu,
>
> I'm testing the spark thrift server and found that all the DDL statements are 
> run by user hive even if hive.server2.enable.doAs=true
> The root cause is that Hive object is shared between different thread in 
> HiveClientImpl
> {code:java}
>   private def client: Hive = {
> if (clientLoader.cachedHive != null) {
>   clientLoader.cachedHive.asInstanceOf[Hive]
> } else {
>   val c = Hive.get(conf)
>   clientLoader.cachedHive = c
>   c
> }
>   }
> {code}
> But in impersonation mode, we should just share the Hive object inside the 
> thread so that the  metastore client in Hive could be associated with right 
> user.
> we can  pass the Hive object of parent thread to child thread when running 
> the sql to fix it
> I have already had a initial patch for review and I'm glad to work on it if 
> anyone could assign it to me.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21918) HiveClient shouldn't share Hive object between different thread

2017-09-06 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16155162#comment-16155162
 ] 

Marco Gaido commented on SPARK-21918:
-

Yes, I think this would be great, thanks.

> HiveClient shouldn't share Hive object between different thread
> ---
>
> Key: SPARK-21918
> URL: https://issues.apache.org/jira/browse/SPARK-21918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hu Liu,
>
> I'm testing the spark thrift server and found that all the DDL statements are 
> run by user hive even if hive.server2.enable.doAs=true
> The root cause is that Hive object is shared between different thread in 
> HiveClientImpl
> {code:java}
>   private def client: Hive = {
> if (clientLoader.cachedHive != null) {
>   clientLoader.cachedHive.asInstanceOf[Hive]
> } else {
>   val c = Hive.get(conf)
>   clientLoader.cachedHive = c
>   c
> }
>   }
> {code}
> But in impersonation mode, we should just share the Hive object inside the 
> thread so that the  metastore client in Hive could be associated with right 
> user.
> we can  pass the Hive object of parent thread to child thread when running 
> the sql to fix it
> I have already had a initial patch for review and I'm glad to work on it if 
> anyone could assign it to me.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21938) Spark partial CSV write fails silently

2017-09-06 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16156118#comment-16156118
 ] 

Marco Gaido commented on SPARK-21938:
-

It would be helpful if you can post a sample code to reproduce the issue with 
some sample data, thanks.

> Spark partial CSV write fails silently
> --
>
> Key: SPARK-21938
> URL: https://issues.apache.org/jira/browse/SPARK-21938
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core
>Affects Versions: 2.2.0
> Environment: Amazon EMR 5.8, varying instance types
>Reporter: Abbi McClintic
>
> Hello,
> My team has been experiencing a recurring unpredictable bug where only a 
> partial write to CSV in S3 on one partition of our Dataset is performed. For 
> example, in a Dataset of 10 partitions written to CSV in S3, we might see 9 
> of the partitions as 2.8 GB in size, but one of them as 1.6 GB. However, the 
> job does not exit with an error code. 
> This becomes problematic in the following ways:
> 1. When we copy the data to Redshift, we get a bad decrypt error on the 
> partial file, suggesting that the failure occurred at a weird byte in the 
> file. 
> 2. We lose data - sometimes as much as 10%. 
> We don't see this problem with parquet, which we also use, but moving all of 
> our data to parquet is not currently feasible. We're using the Java API.
> Any help on resolving this would be much appreciated.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21944) Watermark on window column is wrong

2017-09-07 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16157091#comment-16157091
 ] 

Marco Gaido commented on SPARK-21944:
-

May you please provide some sample data to reproduce the issue? Thanks.

> Watermark on window column is wrong
> ---
>
> Key: SPARK-21944
> URL: https://issues.apache.org/jira/browse/SPARK-21944
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Kevin Zhang
>
> When I use a watermark with dropDuplicates in the following way, the 
> watermark is calculated wrong
> {code:java}
> val counts = events.select(window($"time", "5 seconds"), $"time", $"id")
>   .withWatermark("window", "10 seconds")
>   .dropDuplicates("id", "window")
>   .groupBy("window")
>   .count
> {code}
> where events is a dataframe with a timestamp column "time" and long column 
> "id".
> I registered a listener to print the event time stats in each batch, and the 
> results is like the following
> {code:shell}
> ---
> Batch: 0
> ---
> +-+-+ 
>   
> |window   |count|
> +-+-+
> |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|3|
> +-+-+
> {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, 
> watermark=1970-01-01T00:00:00.000Z, max=1970-01-01T19:05:19.476Z}
> {watermark=1970-01-01T00:00:00.000Z}
> {watermark=1970-01-01T00:00:00.000Z}
> {watermark=1970-01-01T00:00:00.000Z}
> {watermark=1970-01-01T00:00:00.000Z}
> ---
> Batch: 1
> ---
> +-+-+ 
>   
> |window   |count|
> +-+-+
> |[2017-09-07 16:55:40.0,2017-09-07 16:55:45.0]|1|
> |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|3|
> +-+-+
> {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, 
> watermark=1970-01-01T19:05:09.476Z, max=1970-01-01T19:05:19.476Z}
> {watermark=1970-01-01T19:05:09.476Z}
> ---
> Batch: 2
> ---
> +-+-+ 
>   
> |window   |count|
> +-+-+
> |[2017-09-07 16:55:40.0,2017-09-07 16:55:45.0]|1|
> |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|4|
> +-+-+
> {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, 
> watermark=1970-01-01T19:05:09.476Z, max=1970-01-01T19:05:19.476Z}
> {watermark=1970-01-01T19:05:09.476Z}
> {code}
> As can be seen, the event time stats are wrong which are always in 
> 1970-01-01, so the watermark is calculated wrong.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21944) Watermark on window column is wrong

2017-09-08 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16158406#comment-16158406
 ] 

Marco Gaido commented on SPARK-21944:
-

[~kevinzhang] you should define the watermark on the column `"time"`, not the 
column `"window"`

> Watermark on window column is wrong
> ---
>
> Key: SPARK-21944
> URL: https://issues.apache.org/jira/browse/SPARK-21944
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Kevin Zhang
>
> When I use a watermark with dropDuplicates in the following way, the 
> watermark is calculated wrong
> {code:java}
> val counts = events.select(window($"time", "5 seconds"), $"time", $"id")
>   .withWatermark("window", "10 seconds")
>   .dropDuplicates("id", "window")
>   .groupBy("window")
>   .count
> {code}
> where events is a dataframe with a timestamp column "time" and long column 
> "id".
> I registered a listener to print the event time stats in each batch, and the 
> results is like the following
> {code:shell}
> ---
> Batch: 0
> ---
> +-+-+ 
>   
> |window   |count|
> +-+-+
> |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|3|
> +-+-+
> {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, 
> watermark=1970-01-01T00:00:00.000Z, max=1970-01-01T19:05:19.476Z}
> {watermark=1970-01-01T00:00:00.000Z}
> {watermark=1970-01-01T00:00:00.000Z}
> {watermark=1970-01-01T00:00:00.000Z}
> {watermark=1970-01-01T00:00:00.000Z}
> ---
> Batch: 1
> ---
> +-+-+ 
>   
> |window   |count|
> +-+-+
> |[2017-09-07 16:55:40.0,2017-09-07 16:55:45.0]|1|
> |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|3|
> +-+-+
> {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, 
> watermark=1970-01-01T19:05:09.476Z, max=1970-01-01T19:05:19.476Z}
> {watermark=1970-01-01T19:05:09.476Z}
> ---
> Batch: 2
> ---
> +-+-+ 
>   
> |window   |count|
> +-+-+
> |[2017-09-07 16:55:40.0,2017-09-07 16:55:45.0]|1|
> |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|4|
> +-+-+
> {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, 
> watermark=1970-01-01T19:05:09.476Z, max=1970-01-01T19:05:19.476Z}
> {watermark=1970-01-01T19:05:09.476Z}
> {code}
> As can be seen, the event time stats are wrong which are always in 
> 1970-01-01, so the watermark is calculated wrong.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21944) Watermark on window column is wrong

2017-09-08 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16158406#comment-16158406
 ] 

Marco Gaido edited comment on SPARK-21944 at 9/8/17 9:57 AM:
-

[~KevinZwx] you should define the watermark on the column `"time"`, not the 
column `"window"`


was (Author: mgaido):
[~kevinzhang] you should define the watermark on the column `"time"`, not the 
column `"window"`

> Watermark on window column is wrong
> ---
>
> Key: SPARK-21944
> URL: https://issues.apache.org/jira/browse/SPARK-21944
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Kevin Zhang
>
> When I use a watermark with dropDuplicates in the following way, the 
> watermark is calculated wrong
> {code:java}
> val counts = events.select(window($"time", "5 seconds"), $"time", $"id")
>   .withWatermark("window", "10 seconds")
>   .dropDuplicates("id", "window")
>   .groupBy("window")
>   .count
> {code}
> where events is a dataframe with a timestamp column "time" and long column 
> "id".
> I registered a listener to print the event time stats in each batch, and the 
> results is like the following
> {code:shell}
> ---
> Batch: 0
> ---
> +-+-+ 
>   
> |window   |count|
> +-+-+
> |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|3|
> +-+-+
> {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, 
> watermark=1970-01-01T00:00:00.000Z, max=1970-01-01T19:05:19.476Z}
> {watermark=1970-01-01T00:00:00.000Z}
> {watermark=1970-01-01T00:00:00.000Z}
> {watermark=1970-01-01T00:00:00.000Z}
> {watermark=1970-01-01T00:00:00.000Z}
> ---
> Batch: 1
> ---
> +-+-+ 
>   
> |window   |count|
> +-+-+
> |[2017-09-07 16:55:40.0,2017-09-07 16:55:45.0]|1|
> |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|3|
> +-+-+
> {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, 
> watermark=1970-01-01T19:05:09.476Z, max=1970-01-01T19:05:19.476Z}
> {watermark=1970-01-01T19:05:09.476Z}
> ---
> Batch: 2
> ---
> +-+-+ 
>   
> |window   |count|
> +-+-+
> |[2017-09-07 16:55:40.0,2017-09-07 16:55:45.0]|1|
> |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|4|
> +-+-+
> {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, 
> watermark=1970-01-01T19:05:09.476Z, max=1970-01-01T19:05:19.476Z}
> {watermark=1970-01-01T19:05:09.476Z}
> {code}
> As can be seen, the event time stats are wrong which are always in 
> 1970-01-01, so the watermark is calculated wrong.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21944) Watermark on window column is wrong

2017-09-08 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16158406#comment-16158406
 ] 

Marco Gaido edited comment on SPARK-21944 at 9/8/17 10:31 AM:
--

[~KevinZwx] you should define the watermark on the column {{"time"}}, not the 
column {{"window"}}


was (Author: mgaido):
[~KevinZwx] you should define the watermark on the column `"time"`, not the 
column `"window"`

> Watermark on window column is wrong
> ---
>
> Key: SPARK-21944
> URL: https://issues.apache.org/jira/browse/SPARK-21944
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Kevin Zhang
>
> When I use a watermark with dropDuplicates in the following way, the 
> watermark is calculated wrong
> {code:java}
> val counts = events.select(window($"time", "5 seconds"), $"time", $"id")
>   .withWatermark("window", "10 seconds")
>   .dropDuplicates("id", "window")
>   .groupBy("window")
>   .count
> {code}
> where events is a dataframe with a timestamp column "time" and long column 
> "id".
> I registered a listener to print the event time stats in each batch, and the 
> results is like the following
> {code:shell}
> ---
> Batch: 0
> ---
> +-+-+ 
>   
> |window   |count|
> +-+-+
> |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|3|
> +-+-+
> {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, 
> watermark=1970-01-01T00:00:00.000Z, max=1970-01-01T19:05:19.476Z}
> {watermark=1970-01-01T00:00:00.000Z}
> {watermark=1970-01-01T00:00:00.000Z}
> {watermark=1970-01-01T00:00:00.000Z}
> {watermark=1970-01-01T00:00:00.000Z}
> ---
> Batch: 1
> ---
> +-+-+ 
>   
> |window   |count|
> +-+-+
> |[2017-09-07 16:55:40.0,2017-09-07 16:55:45.0]|1|
> |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|3|
> +-+-+
> {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, 
> watermark=1970-01-01T19:05:09.476Z, max=1970-01-01T19:05:19.476Z}
> {watermark=1970-01-01T19:05:09.476Z}
> ---
> Batch: 2
> ---
> +-+-+ 
>   
> |window   |count|
> +-+-+
> |[2017-09-07 16:55:40.0,2017-09-07 16:55:45.0]|1|
> |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|4|
> +-+-+
> {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, 
> watermark=1970-01-01T19:05:09.476Z, max=1970-01-01T19:05:19.476Z}
> {watermark=1970-01-01T19:05:09.476Z}
> {code}
> As can be seen, the event time stats are wrong which are always in 
> 1970-01-01, so the watermark is calculated wrong.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21957) Add current_user function

2017-09-08 Thread Marco Gaido (JIRA)

Marco Gaido created SPARK-21957:
---

 Summary: Add current_user function
 Key: SPARK-21957
 URL: https://issues.apache.org/jira/browse/SPARK-21957
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.2.0
Reporter: Marco Gaido
Priority: Minor


Spark doesn't support the {{current_user}} function.

Despite the user can be retrieved in other ways, the function would help making 
easier to migrate existing Hive queries to Spark and it can also be convenient 
for people who are just using SQL to interact with Spark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21981) Python API for ClusteringEvaluator

2017-09-12 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16162764#comment-16162764
 ] 

Marco Gaido commented on SPARK-21981:
-

[~yanboliang] yes, thanks. I will post a PR asap, thank you.

> Python API for ClusteringEvaluator
> --
>
> Key: SPARK-21981
> URL: https://issues.apache.org/jira/browse/SPARK-21981
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>
> We have implemented {{ClusteringEvaluator}} in SPARK-14516, we should expose 
> API for PySpark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22036) BigDecimal multiplication sometimes returns null

2017-09-16 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16168969#comment-16168969
 ] 

Marco Gaido commented on SPARK-22036:
-

This happens because there is an overflow in the operation. I am not sure of 
what should be done in this case. The current implementation returns null when 
an operation cause a loss of precision.

> BigDecimal multiplication sometimes returns null
> 
>
> Key: SPARK-22036
> URL: https://issues.apache.org/jira/browse/SPARK-22036
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Olivier Blanvillain
>
> The multiplication of two BigDecimal numbers sometimes returns null. This 
> issue we discovered while doing property based testing for the frameless 
> project. Here is a minimal reproduction:
> {code:java}
> object Main extends App {
>   import org.apache.spark.{SparkConf, SparkContext}
>   import org.apache.spark.sql.SparkSession
>   import spark.implicits._
>   val conf = new 
> SparkConf().setMaster("local[*]").setAppName("REPL").set("spark.ui.enabled", 
> "false")
>   val spark = 
> SparkSession.builder().config(conf).appName("REPL").getOrCreate()
>   implicit val sqlContext = spark.sqlContext
>   case class X2(a: BigDecimal, b: BigDecimal)
>   val ds = sqlContext.createDataset(List(X2(BigDecimal(-0.1267333984375), 
> BigDecimal(-1000.1
>   val result = ds.select(ds("a") * ds("b")).collect.head
>   println(result) // [null]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22036) BigDecimal multiplication sometimes returns null

2017-09-16 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16169023#comment-16169023
 ] 

Marco Gaido commented on SPARK-22036:
-

Yes, it is only for multiplications. The reason is that for the multiplication 
it expects the result to have a scale which is the sum of the two scales of the 
operands. When there is an overflow in the result of the operations, the result 
is rounded up and the scale is one less than the expected. In this situation, 
the result is set to null.

> BigDecimal multiplication sometimes returns null
> 
>
> Key: SPARK-22036
> URL: https://issues.apache.org/jira/browse/SPARK-22036
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Olivier Blanvillain
>
> The multiplication of two BigDecimal numbers sometimes returns null. This 
> issue we discovered while doing property based testing for the frameless 
> project. Here is a minimal reproduction:
> {code:java}
> object Main extends App {
>   import org.apache.spark.{SparkConf, SparkContext}
>   import org.apache.spark.sql.SparkSession
>   import spark.implicits._
>   val conf = new 
> SparkConf().setMaster("local[*]").setAppName("REPL").set("spark.ui.enabled", 
> "false")
>   val spark = 
> SparkSession.builder().config(conf).appName("REPL").getOrCreate()
>   implicit val sqlContext = spark.sqlContext
>   case class X2(a: BigDecimal, b: BigDecimal)
>   val ds = sqlContext.createDataset(List(X2(BigDecimal(-0.1267333984375), 
> BigDecimal(-1000.1
>   val result = ds.select(ds("a") * ds("b")).collect.head
>   println(result) // [null]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22036) BigDecimal multiplication sometimes returns null

2017-09-16 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16169059#comment-16169059
 ] 

Marco Gaido commented on SPARK-22036:
-

Honestly I don't know, that is why I said that I don't know what should be done.

> BigDecimal multiplication sometimes returns null
> 
>
> Key: SPARK-22036
> URL: https://issues.apache.org/jira/browse/SPARK-22036
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Olivier Blanvillain
>
> The multiplication of two BigDecimal numbers sometimes returns null. This 
> issue we discovered while doing property based testing for the frameless 
> project. Here is a minimal reproduction:
> {code:java}
> object Main extends App {
>   import org.apache.spark.{SparkConf, SparkContext}
>   import org.apache.spark.sql.SparkSession
>   import spark.implicits._
>   val conf = new 
> SparkConf().setMaster("local[*]").setAppName("REPL").set("spark.ui.enabled", 
> "false")
>   val spark = 
> SparkSession.builder().config(conf).appName("REPL").getOrCreate()
>   implicit val sqlContext = spark.sqlContext
>   case class X2(a: BigDecimal, b: BigDecimal)
>   val ds = sqlContext.createDataset(List(X2(BigDecimal(-0.1267333984375), 
> BigDecimal(-1000.1
>   val result = ds.select(ds("a") * ds("b")).collect.head
>   println(result) // [null]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22036) BigDecimal multiplication sometimes returns null

2017-09-16 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16169074#comment-16169074
 ] 

Marco Gaido commented on SPARK-22036:
-

Maybe the "bad" part is that by default spark creates the columns as 
{{Decimal(38, 18)}}. This is the problem. With a multiplication this leads to a 
{{Decimal(38, 36)}}, which as you can easily understand is the root of the 
problem of your operation. If you cast the two columns before the 
multiplication, like {{ds("a").cast(DecimalType(20,14))}}, you won't have any 
problem anymore.
Currently you should suggest Spark which are the right values to use.

> BigDecimal multiplication sometimes returns null
> 
>
> Key: SPARK-22036
> URL: https://issues.apache.org/jira/browse/SPARK-22036
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Olivier Blanvillain
>
> The multiplication of two BigDecimal numbers sometimes returns null. This 
> issue we discovered while doing property based testing for the frameless 
> project. Here is a minimal reproduction:
> {code:java}
> object Main extends App {
>   import org.apache.spark.{SparkConf, SparkContext}
>   import org.apache.spark.sql.SparkSession
>   import spark.implicits._
>   val conf = new 
> SparkConf().setMaster("local[*]").setAppName("REPL").set("spark.ui.enabled", 
> "false")
>   val spark = 
> SparkSession.builder().config(conf).appName("REPL").getOrCreate()
>   implicit val sqlContext = spark.sqlContext
>   case class X2(a: BigDecimal, b: BigDecimal)
>   val ds = sqlContext.createDataset(List(X2(BigDecimal(-0.1267333984375), 
> BigDecimal(-1000.1
>   val result = ds.select(ds("a") * ds("b")).collect.head
>   println(result) // [null]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22040) current_date function with timezone id

2017-09-16 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16169124#comment-16169124
 ] 

Marco Gaido commented on SPARK-22040:
-

May I work on this?

> current_date function with timezone id
> --
>
> Key: SPARK-22040
> URL: https://issues.apache.org/jira/browse/SPARK-22040
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> {{current_date}} function creates {{CurrentDate}} expression that accepts 
> optional timezone id, but there's no function to allow for this.
> This is to have another {{current_date}} with the timezone id, i.e.
> {code}
> def current_date(timeZoneId: String): Column
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22119) Add cosine distance to KMeans

2017-09-25 Thread Marco Gaido (JIRA)

Marco Gaido created SPARK-22119:
---

 Summary: Add cosine distance to KMeans
 Key: SPARK-22119
 URL: https://issues.apache.org/jira/browse/SPARK-22119
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Affects Versions: 2.2.0
Reporter: Marco Gaido
Priority: Minor


Currently, KMeans assumes the only possible distance measure to be used is the 
Euclidean.

In some use cases, eg. text mining, other distance measures like the cosine 
distance are widely used. Thus, for such use cases, it would be good to support 
multiple distance measures.

This ticket is to support the cosine distance measure on KMeans. Later, other 
algorithms can be extended to support several distance measures and other 
distance measures can be added.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22146) FileNotFoundException while reading ORC files containing '%'

2017-09-27 Thread Marco Gaido (JIRA)

Marco Gaido created SPARK-22146:
---

 Summary: FileNotFoundException while reading ORC files containing 
'%'
 Key: SPARK-22146
 URL: https://issues.apache.org/jira/browse/SPARK-22146
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Marco Gaido


Reading ORC files containing "strange" characters like '%' fails with a 
FileNotFoundException.

For instance, if you have:

{noformat}
/tmp/orc_test/folder %3Aa/orc1.orc
/tmp/orc_test/folder %3Ab/orc2.orc
{noformat}

and you try to read the ORC files with:


{noformat}
spark.read.format("orc").load("/tmp/orc_test/*/*").show
{noformat}

you will get a:

{noformat}
java.io.FileNotFoundException: File file:/tmp/orc_test/folder%20%253Aa/orc1.orc 
does not exist
  at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
  at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
  at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
  at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
  at 
org.apache.spark.deploy.SparkHadoopUtil.listLeafStatuses(SparkHadoopUtil.scala:194)
  at 
org.apache.spark.sql.hive.orc.OrcFileOperator$.listOrcFiles(OrcFileOperator.scala:94)
  at 
org.apache.spark.sql.hive.orc.OrcFileOperator$.getFileReader(OrcFileOperator.scala:67)
  at 
org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$readSchema$1.apply(OrcFileOperator.scala:77)
  at 
org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$readSchema$1.apply(OrcFileOperator.scala:77)
  at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
  at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
  at 
org.apache.spark.sql.hive.orc.OrcFileOperator$.readSchema(OrcFileOperator.scala:77)
  at 
org.apache.spark.sql.hive.orc.OrcFileFormat.inferSchema(OrcFileFormat.scala:60)
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:197)
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:197)
  at scala.Option.orElse(Option.scala:289)
  at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:196)
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:168)
  ... 48 elided
{noformat}

Note that the same code works for Parquet and text files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22146) FileNotFoundException while reading ORC files containing '%'

2017-09-27 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16182650#comment-16182650
 ] 

Marco Gaido commented on SPARK-22146:
-

If you look carefully at the file which Spark is looking for, you'll see that 
it doesn't exist because it is the result of a improper encoding.
So, yes, the right file exists. But Spark is looking for a wrong one.
We tried both on HDFS and on the local filesystem, the error is the same, and 
it is due to the encoding of the path in the inferSchema process. I am 
preparing a PR to fix it. I will post it as soon as it is ready.

> FileNotFoundException while reading ORC files containing '%'
> 
>
> Key: SPARK-22146
> URL: https://issues.apache.org/jira/browse/SPARK-22146
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Marco Gaido
>
> Reading ORC files containing "strange" characters like '%' fails with a 
> FileNotFoundException.
> For instance, if you have:
> {noformat}
> /tmp/orc_test/folder %3Aa/orc1.orc
> /tmp/orc_test/folder %3Ab/orc2.orc
> {noformat}
> and you try to read the ORC files with:
> {noformat}
> spark.read.format("orc").load("/tmp/orc_test/*/*").show
> {noformat}
> you will get a:
> {noformat}
> java.io.FileNotFoundException: File 
> file:/tmp/orc_test/folder%20%253Aa/orc1.orc does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.listLeafStatuses(SparkHadoopUtil.scala:194)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$.listOrcFiles(OrcFileOperator.scala:94)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$.getFileReader(OrcFileOperator.scala:67)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$readSchema$1.apply(OrcFileOperator.scala:77)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$readSchema$1.apply(OrcFileOperator.scala:77)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$.readSchema(OrcFileOperator.scala:77)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileFormat.inferSchema(OrcFileFormat.scala:60)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:197)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:197)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:196)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:168)
>   ... 48 elided
> {noformat}
> Note that the same code works for Parquet and text files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22146) FileNotFoundException while reading ORC files containing '%'

2017-09-27 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16182686#comment-16182686
 ] 

Marco Gaido edited comment on SPARK-22146 at 9/27/17 2:58 PM:
--

Yes, that is a local file and I am running `spark-shell` locally on my machine 
from the current master.


was (Author: mgaido):
Yes, that is a local file and I am running `spark-shell` locally from the 
current master.

> FileNotFoundException while reading ORC files containing '%'
> 
>
> Key: SPARK-22146
> URL: https://issues.apache.org/jira/browse/SPARK-22146
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Marco Gaido
>
> Reading ORC files containing "strange" characters like '%' fails with a 
> FileNotFoundException.
> For instance, if you have:
> {noformat}
> /tmp/orc_test/folder %3Aa/orc1.orc
> /tmp/orc_test/folder %3Ab/orc2.orc
> {noformat}
> and you try to read the ORC files with:
> {noformat}
> spark.read.format("orc").load("/tmp/orc_test/*/*").show
> {noformat}
> you will get a:
> {noformat}
> java.io.FileNotFoundException: File 
> file:/tmp/orc_test/folder%20%253Aa/orc1.orc does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.listLeafStatuses(SparkHadoopUtil.scala:194)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$.listOrcFiles(OrcFileOperator.scala:94)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$.getFileReader(OrcFileOperator.scala:67)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$readSchema$1.apply(OrcFileOperator.scala:77)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$readSchema$1.apply(OrcFileOperator.scala:77)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$.readSchema(OrcFileOperator.scala:77)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileFormat.inferSchema(OrcFileFormat.scala:60)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:197)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:197)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:196)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:168)
>   ... 48 elided
> {noformat}
> Note that the same code works for Parquet and text files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22146) FileNotFoundException while reading ORC files containing '%'

2017-09-27 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16182686#comment-16182686
 ] 

Marco Gaido commented on SPARK-22146:
-

Yes, that is a local file and I am running `spark-shell` locally from the 
current master.

> FileNotFoundException while reading ORC files containing '%'
> 
>
> Key: SPARK-22146
> URL: https://issues.apache.org/jira/browse/SPARK-22146
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Marco Gaido
>
> Reading ORC files containing "strange" characters like '%' fails with a 
> FileNotFoundException.
> For instance, if you have:
> {noformat}
> /tmp/orc_test/folder %3Aa/orc1.orc
> /tmp/orc_test/folder %3Ab/orc2.orc
> {noformat}
> and you try to read the ORC files with:
> {noformat}
> spark.read.format("orc").load("/tmp/orc_test/*/*").show
> {noformat}
> you will get a:
> {noformat}
> java.io.FileNotFoundException: File 
> file:/tmp/orc_test/folder%20%253Aa/orc1.orc does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.listLeafStatuses(SparkHadoopUtil.scala:194)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$.listOrcFiles(OrcFileOperator.scala:94)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$.getFileReader(OrcFileOperator.scala:67)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$readSchema$1.apply(OrcFileOperator.scala:77)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$readSchema$1.apply(OrcFileOperator.scala:77)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$.readSchema(OrcFileOperator.scala:77)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileFormat.inferSchema(OrcFileFormat.scala:60)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:197)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:197)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:196)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:168)
>   ... 48 elided
> {noformat}
> Note that the same code works for Parquet and text files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22040) current_date function with timezone id

2017-10-02 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido resolved SPARK-22040.
-
Resolution: Invalid

> current_date function with timezone id
> --
>
> Key: SPARK-22040
> URL: https://issues.apache.org/jira/browse/SPARK-22040
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> {{current_date}} function creates {{CurrentDate}} expression that accepts 
> optional timezone id, but there's no function to allow for this.
> This is to have another {{current_date}} with the timezone id, i.e.
> {code}
> def current_date(timeZoneId: String): Column
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19909) Batches will fail in case that temporary checkpoint dir is on local file system while metadata dir is on HDFS

2017-06-14 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16049105#comment-16049105
 ] 

Marco Gaido commented on SPARK-19909:
-

[~rvoyer] there is a workaround and it is easy: you have to set the 
{{checkpointLocation}} option or the {{spark.sql.streaming.checkpointLocation}} 
parameter.

> Batches will fail in case that temporary checkpoint dir is on local file 
> system while metadata dir is on HDFS
> -
>
> Key: SPARK-19909
> URL: https://issues.apache.org/jira/browse/SPARK-19909
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Kousuke Saruta
>Priority: Minor
>
> When we try to run Structured Streaming in local mode but use HDFS for the 
> storage, batches will be fail because of error like as follows.
> {code}
> val handle = stream.writeStream.format("console").start()
> 17/03/09 16:54:45 ERROR StreamMetadata: Error writing stream metadata 
> StreamMetadata(fc07a0b1-5423-483e-a59d-b2206a49491e) to 
> /private/var/folders/4y/tmspvv353y59p3w4lknrf7ccgn/T/temporary-79d4fe05-4301-4b6d-a902-dff642d0ddca/metadata
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=kou, access=WRITE, 
> inode="/private/var/folders/4y/tmspvv353y59p3w4lknrf7ccgn/T/temporary-79d4fe05-4301-4b6d-a902-dff642d0ddca/metadata":hdfs:supergroup:drwxr-xr-x
> {code}
> It's because that a temporary checkpoint directory is created on local file 
> system but metadata whose path is based on the checkpoint directory will be 
> created on HDFS.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21888) Cannot add stuff to Client Classpath for Yarn Cluster Mode

[jira] [Commented] (SPARK-21918) HiveClient shouldn't share Hive object between different thread

[jira] [Commented] (SPARK-21918) HiveClient shouldn't share Hive object between different thread

[jira] [Commented] (SPARK-21918) HiveClient shouldn't share Hive object between different thread

[jira] [Commented] (SPARK-21938) Spark partial CSV write fails silently

[jira] [Commented] (SPARK-21944) Watermark on window column is wrong

[jira] [Commented] (SPARK-21944) Watermark on window column is wrong

[jira] [Comment Edited] (SPARK-21944) Watermark on window column is wrong

[jira] [Comment Edited] (SPARK-21944) Watermark on window column is wrong

[jira] [Created] (SPARK-21957) Add current_user function

[jira] [Commented] (SPARK-21981) Python API for ClusteringEvaluator

[jira] [Commented] (SPARK-22036) BigDecimal multiplication sometimes returns null

[jira] [Commented] (SPARK-22036) BigDecimal multiplication sometimes returns null

[jira] [Commented] (SPARK-22036) BigDecimal multiplication sometimes returns null

[jira] [Commented] (SPARK-22036) BigDecimal multiplication sometimes returns null

[jira] [Commented] (SPARK-22040) current_date function with timezone id

[jira] [Created] (SPARK-22119) Add cosine distance to KMeans

[jira] [Created] (SPARK-22146) FileNotFoundException while reading ORC files containing '%'

[jira] [Commented] (SPARK-22146) FileNotFoundException while reading ORC files containing '%'

[jira] [Comment Edited] (SPARK-22146) FileNotFoundException while reading ORC files containing '%'

[jira] [Commented] (SPARK-22146) FileNotFoundException while reading ORC files containing '%'

[jira] [Resolved] (SPARK-22040) current_date function with timezone id

[jira] [Commented] (SPARK-19909) Batches will fail in case that temporary checkpoint dir is on local file system while metadata dir is on HDFS

< 2 3 4 5 6 7

601 - 623 of 623 matches

Site Navigation

Mail list logo

Footer information