from:"Cheng Lian"

[jira] [Commented] (SPARK-37980) Extend METADATA column to support row indices for file based data sources

2022-02-01 Thread Cheng Lian (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485215#comment-17485215
 ] 

Cheng Lian commented on SPARK-37980:


[~prakharjain09], as you've mentioned, it's not super straightforward to 
customize the Parquet code paths in Spark to achieve the goal. In the 
meanwhile, this functionality is in general quite useful. I can imagine it 
enabling other systems in the Parquet ecosystem to build more sophisticated 
indexing solutions. Instead of doing heavy customizations in Spark, would it be 
better if we can make the changes happen in upstream {{parquet-mr}} so that 
other systems can benefit from it more easily?

> Extend METADATA column to support row indices for file based data sources
> -
>
> Key: SPARK-37980
> URL: https://issues.apache.org/jira/browse/SPARK-37980
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3
>Reporter: Prakhar Jain
>Priority: Major
>
> Spark recently added hidden metadata column support for File based 
> datasources as part of  SPARK-37273.
> We should extend it to support ROW_INDEX/ROW_POSITION also.
>  
> Meaning of  ROW_POSITION:
> ROW_INDEX/ROW_POSITION is basically an index of a row within a file. E.g. 5th 
> row in the file will have ROW_INDEX 5.
>  
> Use cases: 
> Row Indexes can be used in a variety of ways. A (fileName, rowIndex) tuple 
> uniquely identifies row in a table. This information can be used to mark rows 
> e.g. this can be used by indexer etc.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31935) Hadoop file system config should be effective in data source options

2020-06-30 Thread Cheng Lian (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-31935:
---
Affects Version/s: (was: 3.0.1)
   (was: 3.1.0)
   2.4.6
   3.0.0

> Hadoop file system config should be effective in data source options 
> -
>
> Key: SPARK-31935
> URL: https://issues.apache.org/jira/browse/SPARK-31935
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> Data source options should be propagated into the hadoop configuration of 
> method `checkAndGlobPathIfNecessary`
> From org.apache.hadoop.fs.FileSystem.java:
> {code:java}
>   public static FileSystem get(URI uri, Configuration conf) throws 
> IOException {
> String scheme = uri.getScheme();
> String authority = uri.getAuthority();
> if (scheme == null && authority == null) { // use default FS
>   return get(conf);
> }
> if (scheme != null && authority == null) { // no authority
>   URI defaultUri = getDefaultUri(conf);
>   if (scheme.equals(defaultUri.getScheme())// if scheme matches 
> default
>   && defaultUri.getAuthority() != null) {  // & default has authority
> return get(defaultUri, conf);  // return default
>   }
> }
> 
> String disableCacheName = String.format("fs.%s.impl.disable.cache", 
> scheme);
> if (conf.getBoolean(disableCacheName, false)) {
>   return createFileSystem(uri, conf);
> }
> return CACHE.get(uri, conf);
>   }
> {code}
> With this, we can specify URI schema and authority related configurations for 
> scanning file systems.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26352) Join reordering should not change the order of output attributes

2020-05-29 Thread Cheng Lian (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-26352:
---
Summary: Join reordering should not change the order of output attributes  
(was: join reordering should not change the order of output attributes)

> Join reordering should not change the order of output attributes
> 
>
> Key: SPARK-26352
> URL: https://issues.apache.org/jira/browse/SPARK-26352
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0, 2.4.0
>Reporter: Kris Mok
>Assignee: Kris Mok
>Priority: Major
>  Labels: correctness
> Fix For: 2.3.3, 2.4.1, 3.0.0
>
>
> The optimizer rule {{org.apache.spark.sql.catalyst.optimizer.ReorderJoin}} 
> performs join reordering on inner joins. This was introduced from SPARK-12032 
> in 2015-12.
> After it had reordered the joins, though, it didn't check whether or not the 
> column order (in terms of the {{output}} attribute list) is still the same as 
> before. Thus, it's possible to have a mismatch between the reordered column 
> order vs the schema that a DataFrame thinks it has.
> This can be demonstrated with the example:
> {code:none}
> spark.sql("create table table_a (x int, y int) using parquet")
> spark.sql("create table table_b (i int, j int) using parquet")
> spark.sql("create table table_c (a int, b int) using parquet")
> val df = spark.sql("with df1 as (select * from table_a cross join table_b) 
> select * from df1 join table_c on a = x and b = i")
> {code}
> here's what the DataFrame thinks:
> {code:none}
> scala> df.printSchema
> root
>  |-- x: integer (nullable = true)
>  |-- y: integer (nullable = true)
>  |-- i: integer (nullable = true)
>  |-- j: integer (nullable = true)
>  |-- a: integer (nullable = true)
>  |-- b: integer (nullable = true)
> {code}
> here's what the optimized plan thinks, after join reordering:
> {code:none}
> scala> df.queryExecution.optimizedPlan.output.foreach(a => println(s"|-- 
> ${a.name}: ${a.dataType.typeName}"))
> |-- x: integer
> |-- y: integer
> |-- a: integer
> |-- b: integer
> |-- i: integer
> |-- j: integer
> {code}
> If we exclude the {{ReorderJoin}} rule (using Spark 2.4's optimizer rule 
> exclusion feature), it's back to normal:
> {code:none}
> scala> spark.conf.set("spark.sql.optimizer.excludedRules", 
> "org.apache.spark.sql.catalyst.optimizer.ReorderJoin")
> scala> val df = spark.sql("with df1 as (select * from table_a cross join 
> table_b) select * from df1 join table_c on a = x and b = i")
> df: org.apache.spark.sql.DataFrame = [x: int, y: int ... 4 more fields]
> scala> df.queryExecution.optimizedPlan.output.foreach(a => println(s"|-- 
> ${a.name}: ${a.dataType.typeName}"))
> |-- x: integer
> |-- y: integer
> |-- i: integer
> |-- j: integer
> |-- a: integer
> |-- b: integer
> {code}
> Note that this column ordering problem leads to data corruption, and can 
> manifest itself in various symptoms:
> * Silently corrupting data, if the reordered columns happen to either have 
> matching types or have sufficiently-compatible types (e.g. all fixed length 
> primitive types are considered as "sufficiently compatible" in an UnsafeRow), 
> then only the resulting data is going to be wrong but it might not trigger 
> any alarms immediately. Or
> * Weird Java-level exceptions like {{java.lang.NegativeArraySizeException}}, 
> or even SIGSEGVs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-20 Thread Cheng Lian

Oh, actually, in order to decouple Hadoop 3.2 and Hive 2.3 upgrades, we
will need a hive-2.3 profile anyway, no matter having the hive-1.2 profile
or not.

On Wed, Nov 20, 2019 at 3:33 PM Cheng Lian  wrote:

> Just to summarize my points:
>
>1. Let's still keep the Hive 1.2 dependency in Spark 3.0, but it is
>optional. End-users may choose between Hive 1.2/2.3 via a new profile
>(either adding a hive-1.2 profile or adding a hive-2.3 profile works for
>me, depending on which Hive version we pick as the default version).
>2. Decouple Hive version upgrade and Hadoop version upgrade, so that
>people may have more choices in production, and makes Spark 3.0 migration
>easier (e.g., you don't have to switch to Hadoop 3 in order to pick Hive
>2.3 and/or JDK 11.).
>3. For default Hadoop/Hive versions in Spark 3.0, I personally do not
>have a preference as long as the above two are met.
>
>
> On Wed, Nov 20, 2019 at 3:22 PM Cheng Lian  wrote:
>
>> Dongjoon, I don't think we have any conflicts here. As stated in other
>> threads multiple times, as long as Hive 2.3 and Hadoop 3.2 version upgrades
>> can be decoupled, I have no preference over picking which Hive/Hadoop
>> version as the default version. So the following two plans both work for me:
>>
>>1. Keep Hive 1.2 as default Spark 3.0 execution Hive version, and
>>have an extra hive-2.3 profile.
>>2. Choose Hive 2.3 as default Spark 3.0 execution Hive version, and
>>have an extra hive-1.2 profile.
>>
>> BTW, I was also discussing Hive dependency issues with other people
>> offline, and I realized that the Hive isolated client loader is not well
>> known, and caused unnecessary confusion/worry. So I would like to provide
>> some background context for readers who are not familiar with Spark Hive
>> integration here. *Building Spark 3.0 with Hive 1.2.1 does NOT mean that
>> you can only interact with Hive 1.2.1.*
>>
>> Spark does work with different versions of Hive metastore via an isolated
>> classloading mechanism. *Even if Spark itself is built with the Hive
>> 1.2.1 fork, you can still interact with a Hive 2.3 metastore, and this has
>> been true ever since Spark 1.x.* In order to do this, just set the
>> following two options according to instructions in our official doc page
>> <http://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore>
>> :
>>
>>- spark.sql.hive.metastore.version
>>- spark.sql.hive.metastore.jars
>>
>> Say you set "spark.sql.hive.metastore.version" to "2.3.6", and
>> "spark.sql.hive.metastore.jars" to "maven", Spark will pull Hive 2.3.6
>> dependencies from Maven at runtime when initializing the Hive metastore
>> client. And those dependencies will NOT conflict with the built-in Hive
>> 1.2.1 jars, because the downloaded jars are loaded using an isolated
>> classloader (see here
>> <https://github.com/apache/spark/blob/1febd373ea806326d269a60048ee52543a76c918/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala>).
>> Historically, we call these two sets of Hive dependencies "execution Hive"
>> and "metastore Hive". The former is mostly used for features like SerDe,
>> while the latter is used to interact with Hive metastore. And the Hive
>> version upgrade we are discussing here is about the execution Hive.
>>
>> Cheng
>>
>> On Wed, Nov 20, 2019 at 2:38 PM Dongjoon Hyun 
>> wrote:
>>
>>> Nice. That's a progress.
>>>
>>> Let's narrow down to the path. We need to clarify what is the criteria
>>> we can agree.
>>>
>>> 1. What does `battle-tested for years` mean exactly?
>>> How and when can we start the `battle-tested` stage for Hive 2.3?
>>>
>>> 2. What is the new "Hive integration in Spark"?
>>> During introducing Hive 2.3, we fixed the compatibility stuff as you
>>> said.
>>> Most of code is shared for Hive 1.2 and Hive 2.3.
>>> That means if there is a bug inside this shared code, both of them
>>> will be affected.
>>> Of course, we can fix this because it's Spark code. We will learn
>>> and fix it as you said.
>>>
>>> >  Yes, there are issues, but people have learned how to get along
>>> with these issues.
>>>
>>> The only non-shared code are the following.
>>> Do you have a concern on the following directories?
>>> If there is no bugs

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-20 Thread Cheng Lian

Sean, thanks for the corner cases you listed. They make a lot of sense. Now
I do incline to have Hive 2.3 as the default version.

Dongjoon, apologize if I didn't make it clear before. What made me
concerned initially was only the following part:

> can we remove the usage of forked `hive` in Apache Spark 3.0 completely
officially?

So having Hive 2.3 as the default Hive version and adding a `hive-1.2`
profile to keep the Hive 1.2.1 fork looks like a feasible approach to me.
Thanks for starting the discussion!

On Wed, Nov 20, 2019 at 9:46 AM Dongjoon Hyun 
wrote:

> Yes. Right. That's the situation we are hitting and the result I expected.
> We need to change our default with Hive 2 in the POM.
>
> Dongjoon.
>
>
> On Wed, Nov 20, 2019 at 5:20 AM Sean Owen  wrote:
>
>> Yes, good point. A user would get whatever the POM says without
>> profiles enabled so it matters.
>>
>> Playing it out, an app _should_ compile with the Spark dependency
>> marked 'provided'. In that case the app that is spark-submit-ted is
>> agnostic to the Hive dependency as the only one that matters is what's
>> on the cluster. Right? we don't leak through the Hive API in the Spark
>> API. And yes it's then up to the cluster to provide whatever version
>> it wants. Vendors will have made a specific version choice when
>> building their distro one way or the other.
>>
>> If you run a Spark cluster yourself, you're using the binary distro,
>> and we're already talking about also publishing a binary distro with
>> this variation, so that's not the issue.
>>
>> The corner cases where it might matter are:
>>
>> - I unintentionally package Spark in the app and by default pull in
>> Hive 2 when I will deploy against Hive 1. But that's user error, and
>> causes other problems
>> - I run tests locally in my project, which will pull in a default
>> version of Hive defined by the POM
>>
>> Double-checking, is that right? if so it kind of implies it doesn't
>> matter. Which is an argument either way about what's the default. I
>> too would then prefer defaulting to Hive 2 in the POM. Am I missing
>> something about the implication?
>>
>> (That fork will stay published forever anyway, that's not an issue per
>> se.)
>>
>> On Wed, Nov 20, 2019 at 1:40 AM Dongjoon Hyun 
>> wrote:
>> > Sean, our published POM is pointing and advertising the illegitimate
>> Hive 1.2 fork as a compile dependency.
>> > Yes. It can be overridden. So, why does Apache Spark need to publish
>> like that?
>> > If someone want to use that illegitimate Hive 1.2 fork, let them
>> override it. We are unable to delete those illegitimate Hive 1.2 fork.
>> > Those artifacts will be orphans.
>> >
>>
>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-20 Thread Cheng Lian

Hey Nicholas,

Thanks for pointing this out. I just realized that I misread the
spark-hadoop-cloud POM. Previously, in Spark 2.4, two profiles,
"hadoop-2.7" and "hadoop-3.1", were referenced in the spark-hadoop-cloud
POM (here
<https://github.com/apache/spark/blob/v2.4.4/hadoop-cloud/pom.xml#L174> and
here <https://github.com/apache/spark/blob/v2.4.4/hadoop-cloud/pom.xml#L213>).
But in the current master (3.0.0-SNAPSHOT), only the "hadoop-3.2" profile
is mentioned. And I came to the wrong conclusion that spark-hadoop-cloud in
Spark 3.0.0 is only available with the "hadoop-3.2" profile. Apologies for
the misleading information.

Cheng



On Tue, Nov 19, 2019 at 8:57 PM Nicholas Chammas 
wrote:

> > I don't think the default Hadoop version matters except for the
> spark-hadoop-cloud module, which is only meaningful under the hadoop-3.2
> profile.
>
> What do you mean by "only meaningful under the hadoop-3.2 profile"?
>
> On Tue, Nov 19, 2019 at 5:40 PM Cheng Lian  wrote:
>
>> Hey Steve,
>>
>> In terms of Maven artifact, I don't think the default Hadoop version
>> matters except for the spark-hadoop-cloud module, which is only meaningful
>> under the hadoop-3.2 profile. All  the other spark-* artifacts published to
>> Maven central are Hadoop-version-neutral.
>>
>> Another issue about switching the default Hadoop version to 3.2 is
>> PySpark distribution. Right now, we only publish PySpark artifacts prebuilt
>> with Hadoop 2.x to PyPI. I'm not sure whether bumping the Hadoop dependency
>> to 3.2 is feasible for PySpark users. Or maybe we should publish PySpark
>> prebuilt with both Hadoop 2.x and 3.x. I'm open to suggestions on this one.
>>
>> Again, as long as Hive 2.3 and Hadoop 3.2 upgrade can be decoupled via
>> the proposed hive-2.3 profile, I personally don't have a preference over
>> having Hadoop 2.7 or 3.2 as the default Hadoop version. But just for
>> minimizing the release management work, in case we decided to publish other
>> spark-* Maven artifacts from a Hadoop 2.7 build, we can still special case
>> spark-hadoop-cloud and publish it using a hadoop-3.2 build.
>>
>> On Mon, Nov 18, 2019 at 8:39 PM Dongjoon Hyun 
>> wrote:
>>
>>> I also agree with Steve and Felix.
>>>
>>> Let's have another thread to discuss Hive issue
>>>
>>> because this thread was originally for `hadoop` version.
>>>
>>> And, now we can have `hive-2.3` profile for both `hadoop-2.7` and
>>> `hadoop-3.0` versions.
>>>
>>> We don't need to mix both.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Mon, Nov 18, 2019 at 8:19 PM Felix Cheung 
>>> wrote:
>>>
>>>> 1000% with Steve, the org.spark-project hive 1.2 will need a solution.
>>>> It is old and rather buggy; and It’s been *years*
>>>>
>>>> I think we should decouple hive change from everything else if people
>>>> are concerned?
>>>>
>>>> --
>>>> *From:* Steve Loughran 
>>>> *Sent:* Sunday, November 17, 2019 9:22:09 AM
>>>> *To:* Cheng Lian 
>>>> *Cc:* Sean Owen ; Wenchen Fan ;
>>>> Dongjoon Hyun ; dev ;
>>>> Yuming Wang 
>>>> *Subject:* Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?
>>>>
>>>> Can I take this moment to remind everyone that the version of hive
>>>> which spark has historically bundled (the org.spark-project one) is an
>>>> orphan project put together to deal with Hive's shading issues and a source
>>>> of unhappiness in the Hive project. What ever get shipped should do its
>>>> best to avoid including that file.
>>>>
>>>> Postponing a switch to hadoop 3.x after spark 3.0 is probably the
>>>> safest move from a risk minimisation perspective. If something has broken
>>>> then it is you can start with the assumption that it is in the o.a.s
>>>> packages without having to debug o.a.hadoop and o.a.hive first. There is a
>>>> cost: if there are problems with the hadoop / hive dependencies those teams
>>>> will inevitably ignore filed bug reports for the same reason spark team
>>>> will probably because 1.6-related JIRAs as WONTFIX. WONTFIX responses for
>>>> the Hadoop 2.x line include any compatibility issues with Java 9+. Do bear
>>>> that in mind. It's not been tested, it has dependencies on artifacts we
>>>> know are incompatible, and as far as the Hadoop project is concern

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-20 Thread Cheng Lian

Just to summarize my points:

   1. Let's still keep the Hive 1.2 dependency in Spark 3.0, but it is
   optional. End-users may choose between Hive 1.2/2.3 via a new profile
   (either adding a hive-1.2 profile or adding a hive-2.3 profile works for
   me, depending on which Hive version we pick as the default version).
   2. Decouple Hive version upgrade and Hadoop version upgrade, so that
   people may have more choices in production, and makes Spark 3.0 migration
   easier (e.g., you don't have to switch to Hadoop 3 in order to pick Hive
   2.3 and/or JDK 11.).
   3. For default Hadoop/Hive versions in Spark 3.0, I personally do not
   have a preference as long as the above two are met.


On Wed, Nov 20, 2019 at 3:22 PM Cheng Lian  wrote:

> Dongjoon, I don't think we have any conflicts here. As stated in other
> threads multiple times, as long as Hive 2.3 and Hadoop 3.2 version upgrades
> can be decoupled, I have no preference over picking which Hive/Hadoop
> version as the default version. So the following two plans both work for me:
>
>1. Keep Hive 1.2 as default Spark 3.0 execution Hive version, and have
>an extra hive-2.3 profile.
>2. Choose Hive 2.3 as default Spark 3.0 execution Hive version, and
>have an extra hive-1.2 profile.
>
> BTW, I was also discussing Hive dependency issues with other people
> offline, and I realized that the Hive isolated client loader is not well
> known, and caused unnecessary confusion/worry. So I would like to provide
> some background context for readers who are not familiar with Spark Hive
> integration here. *Building Spark 3.0 with Hive 1.2.1 does NOT mean that
> you can only interact with Hive 1.2.1.*
>
> Spark does work with different versions of Hive metastore via an isolated
> classloading mechanism. *Even if Spark itself is built with the Hive
> 1.2.1 fork, you can still interact with a Hive 2.3 metastore, and this has
> been true ever since Spark 1.x.* In order to do this, just set the
> following two options according to instructions in our official doc page
> <http://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore>
> :
>
>- spark.sql.hive.metastore.version
>- spark.sql.hive.metastore.jars
>
> Say you set "spark.sql.hive.metastore.version" to "2.3.6", and
> "spark.sql.hive.metastore.jars" to "maven", Spark will pull Hive 2.3.6
> dependencies from Maven at runtime when initializing the Hive metastore
> client. And those dependencies will NOT conflict with the built-in Hive
> 1.2.1 jars, because the downloaded jars are loaded using an isolated
> classloader (see here
> <https://github.com/apache/spark/blob/1febd373ea806326d269a60048ee52543a76c918/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala>).
> Historically, we call these two sets of Hive dependencies "execution Hive"
> and "metastore Hive". The former is mostly used for features like SerDe,
> while the latter is used to interact with Hive metastore. And the Hive
> version upgrade we are discussing here is about the execution Hive.
>
> Cheng
>
> On Wed, Nov 20, 2019 at 2:38 PM Dongjoon Hyun 
> wrote:
>
>> Nice. That's a progress.
>>
>> Let's narrow down to the path. We need to clarify what is the criteria we
>> can agree.
>>
>> 1. What does `battle-tested for years` mean exactly?
>> How and when can we start the `battle-tested` stage for Hive 2.3?
>>
>> 2. What is the new "Hive integration in Spark"?
>> During introducing Hive 2.3, we fixed the compatibility stuff as you
>> said.
>> Most of code is shared for Hive 1.2 and Hive 2.3.
>> That means if there is a bug inside this shared code, both of them
>> will be affected.
>> Of course, we can fix this because it's Spark code. We will learn and
>> fix it as you said.
>>
>> >  Yes, there are issues, but people have learned how to get along
>> with these issues.
>>
>> The only non-shared code are the following.
>> Do you have a concern on the following directories?
>> If there is no bugs on the following codebase, can we switch?
>>
>> $ find . -name v2.3.5
>> ./sql/core/v2.3.5
>> ./sql/hive-thriftserver/v2.3.5
>>
>> 3. We know that we can keep both code bases, but the community should
>> choose Hive 2.3 officially.
>> That's the right choice in the Apache project policy perspective. At
>> least, Sean and I prefer that.
>> If someone really want to stick to Hive 1.2 fork, they can use it at
>> their own risks.
>>
>> > for Spark 3.0 end-

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-20 Thread Cheng Lian

Dongjoon, I don't think we have any conflicts here. As stated in other
threads multiple times, as long as Hive 2.3 and Hadoop 3.2 version upgrades
can be decoupled, I have no preference over picking which Hive/Hadoop
version as the default version. So the following two plans both work for me:

   1. Keep Hive 1.2 as default Spark 3.0 execution Hive version, and have
   an extra hive-2.3 profile.
   2. Choose Hive 2.3 as default Spark 3.0 execution Hive version, and have
   an extra hive-1.2 profile.

BTW, I was also discussing Hive dependency issues with other people
offline, and I realized that the Hive isolated client loader is not well
known, and caused unnecessary confusion/worry. So I would like to provide
some background context for readers who are not familiar with Spark Hive
integration here. *Building Spark 3.0 with Hive 1.2.1 does NOT mean that
you can only interact with Hive 1.2.1.*

Spark does work with different versions of Hive metastore via an isolated
classloading mechanism. *Even if Spark itself is built with the Hive 1.2.1
fork, you can still interact with a Hive 2.3 metastore, and this has been
true ever since Spark 1.x.* In order to do this, just set the following two
options according to instructions in our official doc page
<http://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore>
:

   - spark.sql.hive.metastore.version
   - spark.sql.hive.metastore.jars

Say you set "spark.sql.hive.metastore.version" to "2.3.6", and
"spark.sql.hive.metastore.jars" to "maven", Spark will pull Hive 2.3.6
dependencies from Maven at runtime when initializing the Hive metastore
client. And those dependencies will NOT conflict with the built-in Hive
1.2.1 jars, because the downloaded jars are loaded using an isolated
classloader (see here
<https://github.com/apache/spark/blob/1febd373ea806326d269a60048ee52543a76c918/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala>).
Historically, we call these two sets of Hive dependencies "execution Hive"
and "metastore Hive". The former is mostly used for features like SerDe,
while the latter is used to interact with Hive metastore. And the Hive
version upgrade we are discussing here is about the execution Hive.

Cheng

On Wed, Nov 20, 2019 at 2:38 PM Dongjoon Hyun 
wrote:

> Nice. That's a progress.
>
> Let's narrow down to the path. We need to clarify what is the criteria we
> can agree.
>
> 1. What does `battle-tested for years` mean exactly?
> How and when can we start the `battle-tested` stage for Hive 2.3?
>
> 2. What is the new "Hive integration in Spark"?
> During introducing Hive 2.3, we fixed the compatibility stuff as you
> said.
> Most of code is shared for Hive 1.2 and Hive 2.3.
> That means if there is a bug inside this shared code, both of them
> will be affected.
> Of course, we can fix this because it's Spark code. We will learn and
> fix it as you said.
>
> >  Yes, there are issues, but people have learned how to get along
> with these issues.
>
> The only non-shared code are the following.
> Do you have a concern on the following directories?
> If there is no bugs on the following codebase, can we switch?
>
> $ find . -name v2.3.5
> ./sql/core/v2.3.5
> ./sql/hive-thriftserver/v2.3.5
>
> 3. We know that we can keep both code bases, but the community should
> choose Hive 2.3 officially.
> That's the right choice in the Apache project policy perspective. At
> least, Sean and I prefer that.
> If someone really want to stick to Hive 1.2 fork, they can use it at
> their own risks.
>
> > for Spark 3.0 end-users who really don't want to interact with this
> Hive 1.2 fork, they can always use Hive 2.3 at their own risks.
>
> Specifically, what about having a profile `hive-1.2` at `3.0.0` with the
> default Hive 2.3 pom at least?
> How do you think about that way, Cheng?
>
> Bests,
> Dongjoon.
>
>
> On Wed, Nov 20, 2019 at 12:59 PM Cheng Lian  wrote:
>
>> Hey Dongjoon and Felix,
>>
>> I totally agree that Hive 2.3 is more stable than Hive 1.2. Otherwise, we
>> wouldn't even consider integrating with Hive 2.3 in Spark 3.0.
>>
>> However, *"Hive" and "Hive integration in Spark" are two quite different
>> things*, and I don't think anybody has ever mentioned "the forked Hive
>> 1.2.1 is stable" in any recent Hadoop/Hive version discussions (at least I
>> double-checked all my replies).
>>
>> What I really care about is the stability and quality of "Hive
>> integration in Spark", which have gone through some major updates due to
>> the recent Hive 2.3 upgrade

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-20 Thread Cheng Lian

Hey Dongjoon and Felix,

I totally agree that Hive 2.3 is more stable than Hive 1.2. Otherwise, we
wouldn't even consider integrating with Hive 2.3 in Spark 3.0.

However, *"Hive" and "Hive integration in Spark" are two quite different
things*, and I don't think anybody has ever mentioned "the forked Hive
1.2.1 is stable" in any recent Hadoop/Hive version discussions (at least I
double-checked all my replies).

What I really care about is the stability and quality of "Hive integration
in Spark", which have gone through some major updates due to the recent
Hive 2.3 upgrade in Spark 3.0. We had already found bugs in this piece, and
empirically, for a significant upgrade like this one, it is not surprising
that other bugs/regressions can be found in the near future. On the other
hand, the Hive 1.2 integration code path in Spark has been battle-tested
for years. Yes, there are issues, but people have learned how to get along
with these issues. And please don't forget that, for Spark 3.0 end-users
who really don't want to interact with this Hive 1.2 fork, they can always
use Hive 2.3 at their own risks.

True, "stable" is quite vague a criterion, and hard to be proven. But that
is exactly the reason why we may want to be conservative and wait for some
time and see whether there are further signals suggesting that the Hive 2.3
integration in Spark 3.0 is *unstable*. After one or two Spark 3.x minor
releases, if we've fixed all the outstanding issues and no more significant
ones are showing up, we can declare that the Hive 2.3 integration in Spark
3.x is stable, and then we can consider removing reference to the Hive 1.2
fork. Does that make sense?

Cheng

On Wed, Nov 20, 2019 at 11:49 AM Felix Cheung 
wrote:

> Just to add - hive 1.2 fork is definitely not more stable. We know of a
> few critical bug fixes that we cherry picked into a fork of that fork to
> maintain ourselves.
>
>
> --
> *From:* Dongjoon Hyun 
> *Sent:* Wednesday, November 20, 2019 11:07:47 AM
> *To:* Sean Owen 
> *Cc:* dev 
> *Subject:* Re: The Myth: the forked Hive 1.2.1 is stabler than XXX
>
> Thanks. That will be a giant step forward, Sean!
>
> > I'd prefer making it the default in the POM for 3.0.
>
> Bests,
> Dongjoon.
>
> On Wed, Nov 20, 2019 at 11:02 AM Sean Owen  wrote:
>
> Yeah 'stable' is ambiguous. It's old and buggy, but at least it's the
> same old and buggy that's been there a while. "stable" in that sense
> I'm sure there is a lot more delta between Hive 1 and 2 in terms of
> bug fixes that are important; the question isn't just 1.x releases.
>
> What I don't know is how much affects Spark, as it's a Hive client
> mostly. Clearly some do.
>
> I'd prefer making it the default in the POM for 3.0. Mostly on the
> grounds that its effects are on deployed clusters, not apps. And
> deployers can still choose a binary distro with 1.x or make the choice
> they want. Those that don't care should probably be nudged to 2.x.
> Spark 3.x is already full of behavior changes and 'unstable', so I
> think this is minor relative to the overall risk question.
>
> On Wed, Nov 20, 2019 at 12:53 PM Dongjoon Hyun 
> wrote:
> >
> > Hi, All.
> >
> > I'm sending this email because it's important to discuss this topic
> narrowly
> > and make a clear conclusion.
> >
> > `The forked Hive 1.2.1 is stable`? It sounds like a myth we created
> > by ignoring the existing bugs. If you want to say the forked Hive 1.2.1
> is
> > stabler than XXX, please give us the evidence. Then, we can fix it.
> > Otherwise, let's stop making `The forked Hive 1.2.1` invincible.
> >
> > Historically, the following forked Hive 1.2.1 has never been stable.
> > It's just frozen. Since the forked Hive is out of our control, we
> ignored bugs.
> > That's all. The reality is a way far from the stable status.
> >
> > https://mvnrepository.com/artifact/org.spark-project.hive/
> >
> https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark
> (2015 August)
> >
> https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark2
> (2016 April)
> >
> > First, let's begin Hive itself by comparing with Apache Hive 1.2.2 and
> 1.2.3,
> >
> > Apache Hive 1.2.2 has 50 bug fixes.
> > Apache Hive 1.2.3 has 9 bug fixes.
> >
> > I will not cover all of them, but Apache Hive community also backports
> > important patches like Apache Spark community.
> >
> > Second, let's move to SPARK issues because we aren't exposed to all Hive
> issues.
> >
> > SPARK-19109 ORC metadata section can sometimes exceed protobuf
> message size limit
> > SPARK-22267 Spark SQL incorrectly reads ORC file when column order
> is different
> >
> > These were reported since Apache Spark 1.6.x because the forked Hive
> doesn't have
> > a proper upstream patch like HIVE-11592 (fixed at Apache Hive 1.3.0).
> >
> > Since we couldn't update the frozen forked Hive, we added Apache ORC
> dependency
> > at SPARK-20682 (2.3.0), added a switching

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Cheng Lian

Thanks for taking care of this, Dongjoon!

We can target SPARK-20202 to 3.1.0, but I don't think we should do it
immediately after cutting the branch-3.0. The Hive 1.2 code paths can only
be removed once the Hive 2.3 code paths are proven to be stable. If it
turned out to be buggy in Spark 3.1, we may want to further postpone
SPARK-20202 to 3.2.0 by then.

On Tue, Nov 19, 2019 at 2:53 PM Dongjoon Hyun 
wrote:

> Yes. It does. I meant SPARK-20202.
>
> Thanks. I understand that it can be considered like Scala version issue.
> So, that's the reason why I put this as a `policy` issue from the
> beginning.
>
> > First of all, I want to put this as a policy issue instead of a
> technical issue.
>
> In the policy perspective, we should remove this immediately if we have a
> solution to fix this.
> For now, I set `Target Versions` of SPARK-20202 to `3.1.0` according to
> the current discussion status.
>
> https://issues.apache.org/jira/browse/SPARK-20202
>
> And, if there is no other issues, I'll create a PR to remove it from
> `master` branch when we cut `branch-3.0`.
>
> For additional `hadoop-2.7 with Hive 2.3` pre-built distribution, how do
> you think about this, Sean?
> The preparation is already started in another email thread and I believe
> that is a keystone to prove `Hive 2.3` version stability
> (which Cheng/Hyukjin/you asked).
>
> Bests,
> Dongjoon.
>
>
> On Tue, Nov 19, 2019 at 2:09 PM Cheng Lian  wrote:
>
>> It's kinda like Scala version upgrade. Historically, we only remove the
>> support of an older Scala version when the newer version is proven to be
>> stable after one or more Spark minor versions.
>>
>> On Tue, Nov 19, 2019 at 2:07 PM Cheng Lian  wrote:
>>
>>> Hmm, what exactly did you mean by "remove the usage of forked `hive` in
>>> Apache Spark 3.0 completely officially"? I thought you wanted to remove the
>>> forked Hive 1.2 dependencies completely, no? As long as we still keep the
>>> Hive 1.2 in Spark 3.0, I'm fine with that. I personally don't have a
>>> particular preference between using Hive 1.2 or 2.3 as the default Hive
>>> version. After all, for end-users and providers who need a particular
>>> version combination, they can always build Spark with proper profiles
>>> themselves.
>>>
>>> And thanks for clarifying the Hive 2.3.5 issue. I didn't notice that
>>> it's due to the folder name.
>>>
>>> On Tue, Nov 19, 2019 at 11:15 AM Dongjoon Hyun 
>>> wrote:
>>>
>>>> BTW, `hive.version.short` is a directory name. We are using 2.3.6 only.
>>>>
>>>> For directory name, we use '1.2.1' and '2.3.5' because we just delayed
>>>> the renaming the directories until 3.0.0 deadline to minimize the diff.
>>>>
>>>> We can replace it immediately if we want right now.
>>>>
>>>>
>>>>
>>>> On Tue, Nov 19, 2019 at 11:11 AM Dongjoon Hyun 
>>>> wrote:
>>>>
>>>>> Hi, Cheng.
>>>>>
>>>>> This is irrelevant to JDK11 and Hadoop 3. I'm talking about JDK8 world.
>>>>> If we consider them, it could be the followings.
>>>>>
>>>>> +--+-++
>>>>> |  | Hive 1.2.1 fork |  Apache Hive 2.3.6 |
>>>>> +-+
>>>>> |Legitimate|X| O  |
>>>>> |JDK11 |X| O  |
>>>>> |Hadoop3   |X| O  |
>>>>> |Hadoop2   |O| O  |
>>>>> |Functions | Baseline|   More |
>>>>> |Bug fixes | Baseline|   More |
>>>>> +-+
>>>>>
>>>>> To stabilize Spark's Hive 2.3 usage, we should use it by ourselves
>>>>> (including Jenkins/GitHubAction/AppVeyor).
>>>>>
>>>>> For me, AS-IS 3.0 is not enough for that. According to your advices,
>>>>> to give more visibility to the whole community,
>>>>>
>>>>> 1. We need to give additional `hadoop-2.7 with Hive 2.3` pre-built
>>>>> distribution
>>>>> 2. We need to switch our default Hive usage to 2.3 in `master` for 3.1
>>>>> after `branch-3.0` branch cut.
>>>>>
>>>>> I know that we have been reluctant to (1) and (2) due to its burden.
>>>>> But, it's time to prepare.

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-19 Thread Cheng Lian

Hey Steve,

In terms of Maven artifact, I don't think the default Hadoop version
matters except for the spark-hadoop-cloud module, which is only meaningful
under the hadoop-3.2 profile. All  the other spark-* artifacts published to
Maven central are Hadoop-version-neutral.

Another issue about switching the default Hadoop version to 3.2 is PySpark
distribution. Right now, we only publish PySpark artifacts prebuilt with
Hadoop 2.x to PyPI. I'm not sure whether bumping the Hadoop dependency to
3.2 is feasible for PySpark users. Or maybe we should publish PySpark
prebuilt with both Hadoop 2.x and 3.x. I'm open to suggestions on this one.

Again, as long as Hive 2.3 and Hadoop 3.2 upgrade can be decoupled via the
proposed hive-2.3 profile, I personally don't have a preference over having
Hadoop 2.7 or 3.2 as the default Hadoop version. But just for minimizing
the release management work, in case we decided to publish other spark-*
Maven artifacts from a Hadoop 2.7 build, we can still special case
spark-hadoop-cloud and publish it using a hadoop-3.2 build.

On Mon, Nov 18, 2019 at 8:39 PM Dongjoon Hyun 
wrote:

> I also agree with Steve and Felix.
>
> Let's have another thread to discuss Hive issue
>
> because this thread was originally for `hadoop` version.
>
> And, now we can have `hive-2.3` profile for both `hadoop-2.7` and
> `hadoop-3.0` versions.
>
> We don't need to mix both.
>
> Bests,
> Dongjoon.
>
>
> On Mon, Nov 18, 2019 at 8:19 PM Felix Cheung 
> wrote:
>
>> 1000% with Steve, the org.spark-project hive 1.2 will need a solution. It
>> is old and rather buggy; and It’s been *years*
>>
>> I think we should decouple hive change from everything else if people are
>> concerned?
>>
>> ------
>> *From:* Steve Loughran 
>> *Sent:* Sunday, November 17, 2019 9:22:09 AM
>> *To:* Cheng Lian 
>> *Cc:* Sean Owen ; Wenchen Fan ;
>> Dongjoon Hyun ; dev ;
>> Yuming Wang 
>> *Subject:* Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?
>>
>> Can I take this moment to remind everyone that the version of hive which
>> spark has historically bundled (the org.spark-project one) is an orphan
>> project put together to deal with Hive's shading issues and a source of
>> unhappiness in the Hive project. What ever get shipped should do its best
>> to avoid including that file.
>>
>> Postponing a switch to hadoop 3.x after spark 3.0 is probably the safest
>> move from a risk minimisation perspective. If something has broken then it
>> is you can start with the assumption that it is in the o.a.s packages
>> without having to debug o.a.hadoop and o.a.hive first. There is a cost: if
>> there are problems with the hadoop / hive dependencies those teams will
>> inevitably ignore filed bug reports for the same reason spark team will
>> probably because 1.6-related JIRAs as WONTFIX. WONTFIX responses for the
>> Hadoop 2.x line include any compatibility issues with Java 9+. Do bear that
>> in mind. It's not been tested, it has dependencies on artifacts we know are
>> incompatible, and as far as the Hadoop project is concerned: people should
>> move to branch 3 if they want to run on a modern version of Java
>>
>> It would be really really good if the published spark maven artefacts (a)
>> included the spark-hadoop-cloud JAR and (b) were dependent upon hadoop 3.x.
>> That way people doing things with their own projects will get up-to-date
>> dependencies and don't get WONTFIX responses themselves.
>>
>> -Steve
>>
>> PS: Discussion on hadoop-dev @ making Hadoop 2.10 the official "last
>> ever" branch-2 release and then declare its predecessors EOL; 2.10 will be
>> the transition release.
>>
>> On Sun, Nov 17, 2019 at 1:50 AM Cheng Lian  wrote:
>>
>> Dongjoon, I didn't follow the original Hive 2.3 discussion closely. I
>> thought the original proposal was to replace Hive 1.2 with Hive 2.3, which
>> seemed risky, and therefore we only introduced Hive 2.3 under the
>> hadoop-3.2 profile without removing Hive 1.2. But maybe I'm totally wrong
>> here...
>>
>> Sean, Yuming's PR https://github.com/apache/spark/pull/26533 showed that
>> Hadoop 2 + Hive 2 + JDK 11 looks promising. My major motivation is not
>> about demand, but risk control: coupling Hive 2.3, Hadoop 3.2, and JDK 11
>> upgrade together looks too risky.
>>
>> On Sat, Nov 16, 2019 at 4:03 AM Sean Owen  wrote:
>>
>> I'd prefer simply not making Hadoop 3 the default until 3.1+, rather
>> than introduce yet another build combination. Does Hadoop 2 + Hive 2
>> work and is there demand for it?
>>
>> On Sat, Nov 16,

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Cheng Lian

It's kinda like Scala version upgrade. Historically, we only remove the
support of an older Scala version when the newer version is proven to be
stable after one or more Spark minor versions.

On Tue, Nov 19, 2019 at 2:07 PM Cheng Lian  wrote:

> Hmm, what exactly did you mean by "remove the usage of forked `hive` in
> Apache Spark 3.0 completely officially"? I thought you wanted to remove the
> forked Hive 1.2 dependencies completely, no? As long as we still keep the
> Hive 1.2 in Spark 3.0, I'm fine with that. I personally don't have a
> particular preference between using Hive 1.2 or 2.3 as the default Hive
> version. After all, for end-users and providers who need a particular
> version combination, they can always build Spark with proper profiles
> themselves.
>
> And thanks for clarifying the Hive 2.3.5 issue. I didn't notice that it's
> due to the folder name.
>
> On Tue, Nov 19, 2019 at 11:15 AM Dongjoon Hyun 
> wrote:
>
>> BTW, `hive.version.short` is a directory name. We are using 2.3.6 only.
>>
>> For directory name, we use '1.2.1' and '2.3.5' because we just delayed
>> the renaming the directories until 3.0.0 deadline to minimize the diff.
>>
>> We can replace it immediately if we want right now.
>>
>>
>>
>> On Tue, Nov 19, 2019 at 11:11 AM Dongjoon Hyun 
>> wrote:
>>
>>> Hi, Cheng.
>>>
>>> This is irrelevant to JDK11 and Hadoop 3. I'm talking about JDK8 world.
>>> If we consider them, it could be the followings.
>>>
>>> +--+-++
>>> |  | Hive 1.2.1 fork |  Apache Hive 2.3.6 |
>>> +-+
>>> |Legitimate|X| O  |
>>> |JDK11 |X| O  |
>>> |Hadoop3   |X| O  |
>>> |Hadoop2   |O| O  |
>>> |Functions | Baseline|   More |
>>> |Bug fixes | Baseline|   More |
>>> +-+
>>>
>>> To stabilize Spark's Hive 2.3 usage, we should use it by ourselves
>>> (including Jenkins/GitHubAction/AppVeyor).
>>>
>>> For me, AS-IS 3.0 is not enough for that. According to your advices,
>>> to give more visibility to the whole community,
>>>
>>> 1. We need to give additional `hadoop-2.7 with Hive 2.3` pre-built
>>> distribution
>>> 2. We need to switch our default Hive usage to 2.3 in `master` for 3.1
>>> after `branch-3.0` branch cut.
>>>
>>> I know that we have been reluctant to (1) and (2) due to its burden.
>>> But, it's time to prepare. Without them, we are going to be insufficient
>>> again and again.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>>
>>>
>>> On Tue, Nov 19, 2019 at 9:26 AM Cheng Lian 
>>> wrote:
>>>
>>>> Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x
>>>> minor release to stabilize Hive 2.3 code paths before retiring the Hive 1.2
>>>> fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still
>>>> buggy in terms of JDK 11 support. (BTW, I just found that our root POM is
>>>> referring both Hive 2.3.6 and 2.3.5 at the moment, see here
>>>> <https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L135>
>>>> and here
>>>> <https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L2927>
>>>> .)
>>>>
>>>> Again, I'm happy to get rid of ancient legacy dependencies like Hadoop
>>>> 2.7 and the Hive 1.2 fork, but I do believe that we need a safety net for
>>>> Spark 3.0. For preview releases, I'm afraid that their visibility is not
>>>> good enough for covering such major upgrades.
>>>>
>>>> On Tue, Nov 19, 2019 at 8:39 AM Dongjoon Hyun 
>>>> wrote:
>>>>
>>>>> Thank you for feedback, Hyujkjin and Sean.
>>>>>
>>>>> I proposed `preview-2` for that purpose but I'm also +1 for do that at
>>>>> 3.1
>>>>> if we can make a decision to eliminate the illegitimate Hive fork
>>>>> reference
>>>>> immediately after `branch-3.0` cut.
>>>>>
>>>>> Sean, I'm referencing Cheng Lian's email for the status of
>>>>> `hadoop-2.7`.
>>>>>
>>>>> -
>>>>&

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Cheng Lian

Hmm, what exactly did you mean by "remove the usage of forked `hive` in
Apache Spark 3.0 completely officially"? I thought you wanted to remove the
forked Hive 1.2 dependencies completely, no? As long as we still keep the
Hive 1.2 in Spark 3.0, I'm fine with that. I personally don't have a
particular preference between using Hive 1.2 or 2.3 as the default Hive
version. After all, for end-users and providers who need a particular
version combination, they can always build Spark with proper profiles
themselves.

And thanks for clarifying the Hive 2.3.5 issue. I didn't notice that it's
due to the folder name.

On Tue, Nov 19, 2019 at 11:15 AM Dongjoon Hyun 
wrote:

> BTW, `hive.version.short` is a directory name. We are using 2.3.6 only.
>
> For directory name, we use '1.2.1' and '2.3.5' because we just delayed the
> renaming the directories until 3.0.0 deadline to minimize the diff.
>
> We can replace it immediately if we want right now.
>
>
>
> On Tue, Nov 19, 2019 at 11:11 AM Dongjoon Hyun 
> wrote:
>
>> Hi, Cheng.
>>
>> This is irrelevant to JDK11 and Hadoop 3. I'm talking about JDK8 world.
>> If we consider them, it could be the followings.
>>
>> +--+-++
>> |  | Hive 1.2.1 fork |  Apache Hive 2.3.6 |
>> +-+
>> |Legitimate|X| O  |
>> |JDK11 |X| O  |
>> |Hadoop3   |X| O  |
>> |Hadoop2   |O| O  |
>> |Functions | Baseline|   More |
>> |Bug fixes | Baseline|   More |
>> +-+
>>
>> To stabilize Spark's Hive 2.3 usage, we should use it by ourselves
>> (including Jenkins/GitHubAction/AppVeyor).
>>
>> For me, AS-IS 3.0 is not enough for that. According to your advices,
>> to give more visibility to the whole community,
>>
>> 1. We need to give additional `hadoop-2.7 with Hive 2.3` pre-built
>> distribution
>> 2. We need to switch our default Hive usage to 2.3 in `master` for 3.1
>> after `branch-3.0` branch cut.
>>
>> I know that we have been reluctant to (1) and (2) due to its burden.
>> But, it's time to prepare. Without them, we are going to be insufficient
>> again and again.
>>
>> Bests,
>> Dongjoon.
>>
>>
>>
>>
>> On Tue, Nov 19, 2019 at 9:26 AM Cheng Lian  wrote:
>>
>>> Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x minor
>>> release to stabilize Hive 2.3 code paths before retiring the Hive 1.2
>>> fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still
>>> buggy in terms of JDK 11 support. (BTW, I just found that our root POM is
>>> referring both Hive 2.3.6 and 2.3.5 at the moment, see here
>>> <https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L135>
>>> and here
>>> <https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L2927>
>>> .)
>>>
>>> Again, I'm happy to get rid of ancient legacy dependencies like Hadoop
>>> 2.7 and the Hive 1.2 fork, but I do believe that we need a safety net for
>>> Spark 3.0. For preview releases, I'm afraid that their visibility is not
>>> good enough for covering such major upgrades.
>>>
>>> On Tue, Nov 19, 2019 at 8:39 AM Dongjoon Hyun 
>>> wrote:
>>>
>>>> Thank you for feedback, Hyujkjin and Sean.
>>>>
>>>> I proposed `preview-2` for that purpose but I'm also +1 for do that at
>>>> 3.1
>>>> if we can make a decision to eliminate the illegitimate Hive fork
>>>> reference
>>>> immediately after `branch-3.0` cut.
>>>>
>>>> Sean, I'm referencing Cheng Lian's email for the status of `hadoop-2.7`.
>>>>
>>>> -
>>>> https://lists.apache.org/thread.html/623dd9a6d4e951daeec985feffede12c7b419e03c2965018de7a72f1@%3Cdev.spark.apache.org%3E
>>>>
>>>> The way I see this is that it's not a user problem. Apache Spark
>>>> community didn't try to drop the illegitimate Hive fork yet.
>>>> We need to drop it by ourselves because we created it and it's our bad.
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>>
>>>>
>>>> On Tue, Nov 19, 2019 at 5:06 AM Sean Owen  wrote:
>>>>
>>>>> Just to clarify, as even I have lost the details over time: ha

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Cheng Lian

Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x minor
release to stabilize Hive 2.3 code paths before retiring the Hive 1.2
fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still
buggy in terms of JDK 11 support. (BTW, I just found that our root POM is
referring both Hive 2.3.6 and 2.3.5 at the moment, see here

and here

.)

Again, I'm happy to get rid of ancient legacy dependencies like Hadoop 2.7
and the Hive 1.2 fork, but I do believe that we need a safety net for Spark
3.0. For preview releases, I'm afraid that their visibility is not good
enough for covering such major upgrades.

On Tue, Nov 19, 2019 at 8:39 AM Dongjoon Hyun 
wrote:

> Thank you for feedback, Hyujkjin and Sean.
>
> I proposed `preview-2` for that purpose but I'm also +1 for do that at 3.1
> if we can make a decision to eliminate the illegitimate Hive fork reference
> immediately after `branch-3.0` cut.
>
> Sean, I'm referencing Cheng Lian's email for the status of `hadoop-2.7`.
>
> -
> https://lists.apache.org/thread.html/623dd9a6d4e951daeec985feffede12c7b419e03c2965018de7a72f1@%3Cdev.spark.apache.org%3E
>
> The way I see this is that it's not a user problem. Apache Spark community
> didn't try to drop the illegitimate Hive fork yet.
> We need to drop it by ourselves because we created it and it's our bad.
>
> Bests,
> Dongjoon.
>
>
>
> On Tue, Nov 19, 2019 at 5:06 AM Sean Owen  wrote:
>
>> Just to clarify, as even I have lost the details over time: hadoop-2.7
>> works with hive-2.3? it isn't tied to hadoop-3.2?
>> Roughly how much risk is there in using the Hive 1.x fork over Hive
>> 2.x, for end users using Hive via Spark?
>> I don't have a strong opinion, other than sharing the view that we
>> have to dump the Hive 1.x fork at the first opportunity.
>> Question is simply how much risk that entails. Keeping in mind that
>> Spark 3.0 is already something that people understand works
>> differently. We can accept some behavior changes.
>>
>> On Mon, Nov 18, 2019 at 11:11 PM Dongjoon Hyun 
>> wrote:
>> >
>> > Hi, All.
>> >
>> > First of all, I want to put this as a policy issue instead of a
>> technical issue.
>> > Also, this is orthogonal from `hadoop` version discussion.
>> >
>> > Apache Spark community kept (not maintained) the forked Apache Hive
>> > 1.2.1 because there has been no other options before. As we see at
>> > SPARK-20202, it's not a desirable situation among the Apache projects.
>> >
>> > https://issues.apache.org/jira/browse/SPARK-20202
>> >
>> > Also, please note that we `kept`, not `maintained`, because we know
>> it's not good.
>> > There are several attempt to update that forked repository
>> > for several reasons (Hadoop 3 support is one of the example),
>> > but those attempts are also turned down.
>> >
>> > From Apache Spark 3.0, it seems that we have a new feasible option
>> > `hive-2.3` profile. What about moving forward in this direction further?
>> >
>> > For example, can we remove the usage of forked `hive` in Apache Spark
>> 3.0
>> > completely officially? If someone still needs to use the forked `hive`,
>> we can
>> > have a profile `hive-1.2`. Of course, it should not be a default
>> profile in the community.
>> >
>> > I want to say this is a goal we should achieve someday.
>> > If we don't do anything, nothing happen. At least we need to prepare
>> this.
>> > Without any preparation, Spark 3.1+ will be the same.
>> >
>> > Shall we focus on what are our problems with Hive 2.3.6?
>> > If the only reason is that we didn't use it before, we can release
>> another
>> > `3.0.0-preview` for that.
>> >
>> > Bests,
>> > Dongjoon.
>>
>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-16 Thread Cheng Lian

Dongjoon, I didn't follow the original Hive 2.3 discussion closely. I
thought the original proposal was to replace Hive 1.2 with Hive 2.3, which
seemed risky, and therefore we only introduced Hive 2.3 under the
hadoop-3.2 profile without removing Hive 1.2. But maybe I'm totally wrong
here...

Sean, Yuming's PR https://github.com/apache/spark/pull/26533 showed that
Hadoop 2 + Hive 2 + JDK 11 looks promising. My major motivation is not
about demand, but risk control: coupling Hive 2.3, Hadoop 3.2, and JDK 11
upgrade together looks too risky.

On Sat, Nov 16, 2019 at 4:03 AM Sean Owen  wrote:

> I'd prefer simply not making Hadoop 3 the default until 3.1+, rather
> than introduce yet another build combination. Does Hadoop 2 + Hive 2
> work and is there demand for it?
>
> On Sat, Nov 16, 2019 at 3:52 AM Wenchen Fan  wrote:
> >
> > Do we have a limitation on the number of pre-built distributions? Seems
> this time we need
> > 1. hadoop 2.7 + hive 1.2
> > 2. hadoop 2.7 + hive 2.3
> > 3. hadoop 3 + hive 2.3
> >
> > AFAIK we always built with JDK 8 (but make it JDK 11 compatible), so
> don't need to add JDK version to the combination.
> >
> > On Sat, Nov 16, 2019 at 4:05 PM Dongjoon Hyun 
> wrote:
> >>
> >> Thank you for suggestion.
> >>
> >> Having `hive-2.3` profile sounds good to me because it's orthogonal to
> Hadoop 3.
> >> IIRC, originally, it was proposed in that way, but we put it under
> `hadoop-3.2` to avoid adding new profiles at that time.
> >>
> >> And, I'm wondering if you are considering additional pre-built
> distribution and Jenkins jobs.
> >>
> >> Bests,
> >> Dongjoon.
> >>
>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-15 Thread Cheng Lian

Cc Yuming, Steve, and Dongjoon

On Fri, Nov 15, 2019 at 10:37 AM Cheng Lian  wrote:

> Similar to Xiao, my major concern about making Hadoop 3.2 the default
> Hadoop version is quality control. The current hadoop-3.2 profile covers
> too many major component upgrades, i.e.:
>
>- Hadoop 3.2
>- Hive 2.3
>- JDK 11
>
> We have already found and fixed some feature and performance regressions
> related to these upgrades. Empirically, I’m not surprised at all if more
> regressions are lurking somewhere. On the other hand, we do want help from
> the community to help us to evaluate and stabilize these new changes.
> Following that, I’d like to propose:
>
>1.
>
>Introduce a new profile hive-2.3 to enable (hopefully) less risky
>Hadoop/Hive/JDK version combinations.
>
>This new profile allows us to decouple Hive 2.3 from the hadoop-3.2
>profile, so that users may try out some less risky Hadoop/Hive/JDK
>combinations: if you only want Hive 2.3 and/or JDK 11, you don’t need to
>face potential regressions introduced by the Hadoop 3.2 upgrade.
>
>Yuming Wang has already sent out PR #26533
><https://github.com/apache/spark/pull/26533> to exercise the Hadoop
>2.7 + Hive 2.3 + JDK 11 combination (this PR does not have the hive-2.3
>profile yet), and the result looks promising: the Kafka streaming and Arrow
>related test failures should be irrelevant to the topic discussed here.
>
>After decoupling Hive 2.3 and Hadoop 3.2, I don’t think it makes a lot
>of difference between having Hadoop 2.7 or Hadoop 3.2 as the default Hadoop
>version. For users who are still using Hadoop 2.x in production, they will
>have to use a hadoop-provided prebuilt package or build Spark 3.0
>against their own 2.x version anyway. It does make a difference for cloud
>users who don’t use Hadoop at all, though. And this probably also helps to
>stabilize the Hadoop 3.2 code path faster since our PR builder will
>exercise it regularly.
>2.
>
>Defer Hadoop 2.x upgrade to Spark 3.1+
>
>I personally do want to bump our Hadoop 2.x version to 2.9 or even
>2.10. Steve has already stated the benefits very well. My worry here is
>still quality control: Spark 3.0 has already had tons of changes and major
>component version upgrades that are subject to all kinds of known and
>hidden regressions. Having Hadoop 2.7 there provides us a safety net, since
>it’s proven to be stable. To me, it’s much less risky to upgrade Hadoop 2.7
>to 2.9/2.10 after we stabilize the Hadoop 3.2/Hive 2.3 combinations in the
>next 1 or 2 Spark 3.x releases.
>
> Cheng
>
> On Mon, Nov 4, 2019 at 11:24 AM Koert Kuipers  wrote:
>
>> i get that cdh and hdp backport a lot and in that way left 2.7 behind.
>> but they kept the public apis stable at the 2.7 level, because thats kind
>> of the point. arent those the hadoop apis spark uses?
>>
>> On Mon, Nov 4, 2019 at 10:07 AM Steve Loughran
>>  wrote:
>>
>>>
>>>
>>> On Mon, Nov 4, 2019 at 12:39 AM Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> On Fri, Nov 1, 2019 at 8:41 AM Steve Loughran
>>>>  wrote:
>>>>
>>>>> It would be really good if the spark distributions shipped with later
>>>>> versions of the hadoop artifacts.
>>>>>
>>>>
>>>> I second this. If we need to keep a Hadoop 2.x profile around, why not
>>>> make it Hadoop 2.8 or something newer?
>>>>
>>>
>>> go for 2.9
>>>
>>>>
>>>> Koert Kuipers  wrote:
>>>>
>>>>> given that latest hdp 2.x is still hadoop 2.7 bumping hadoop 2 profile
>>>>> to latest would probably be an issue for us.
>>>>
>>>>
>>>> When was the last time HDP 2.x bumped their minor version of Hadoop? Do
>>>> we want to wait for them to bump to Hadoop 2.8 before we do the same?
>>>>
>>>
>>> The internal builds of CDH and HDP are not those of ASF 2.7.x. A really
>>> large proportion of the later branch-2 patches are backported. 2,7 was left
>>> behind a long time ago
>>>
>>>
>>>
>>>
>>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-15 Thread Cheng Lian

Similar to Xiao, my major concern about making Hadoop 3.2 the default
Hadoop version is quality control. The current hadoop-3.2 profile covers
too many major component upgrades, i.e.:

   - Hadoop 3.2
   - Hive 2.3
   - JDK 11

We have already found and fixed some feature and performance regressions
related to these upgrades. Empirically, I’m not surprised at all if more
regressions are lurking somewhere. On the other hand, we do want help from
the community to help us to evaluate and stabilize these new changes.
Following that, I’d like to propose:

   1.

   Introduce a new profile hive-2.3 to enable (hopefully) less risky
   Hadoop/Hive/JDK version combinations.

   This new profile allows us to decouple Hive 2.3 from the hadoop-3.2
   profile, so that users may try out some less risky Hadoop/Hive/JDK
   combinations: if you only want Hive 2.3 and/or JDK 11, you don’t need to
   face potential regressions introduced by the Hadoop 3.2 upgrade.

   Yuming Wang has already sent out PR #26533
    to exercise the Hadoop 2.7
   + Hive 2.3 + JDK 11 combination (this PR does not have the hive-2.3
   profile yet), and the result looks promising: the Kafka streaming and Arrow
   related test failures should be irrelevant to the topic discussed here.

   After decoupling Hive 2.3 and Hadoop 3.2, I don’t think it makes a lot
   of difference between having Hadoop 2.7 or Hadoop 3.2 as the default Hadoop
   version. For users who are still using Hadoop 2.x in production, they will
   have to use a hadoop-provided prebuilt package or build Spark 3.0
   against their own 2.x version anyway. It does make a difference for cloud
   users who don’t use Hadoop at all, though. And this probably also helps to
   stabilize the Hadoop 3.2 code path faster since our PR builder will
   exercise it regularly.
   2.

   Defer Hadoop 2.x upgrade to Spark 3.1+

   I personally do want to bump our Hadoop 2.x version to 2.9 or even 2.10.
   Steve has already stated the benefits very well. My worry here is still
   quality control: Spark 3.0 has already had tons of changes and major
   component version upgrades that are subject to all kinds of known and
   hidden regressions. Having Hadoop 2.7 there provides us a safety net, since
   it’s proven to be stable. To me, it’s much less risky to upgrade Hadoop 2.7
   to 2.9/2.10 after we stabilize the Hadoop 3.2/Hive 2.3 combinations in the
   next 1 or 2 Spark 3.x releases.

Cheng

On Mon, Nov 4, 2019 at 11:24 AM Koert Kuipers  wrote:

> i get that cdh and hdp backport a lot and in that way left 2.7 behind. but
> they kept the public apis stable at the 2.7 level, because thats kind of
> the point. arent those the hadoop apis spark uses?
>
> On Mon, Nov 4, 2019 at 10:07 AM Steve Loughran 
> wrote:
>
>>
>>
>> On Mon, Nov 4, 2019 at 12:39 AM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> On Fri, Nov 1, 2019 at 8:41 AM Steve Loughran
>>>  wrote:
>>>
 It would be really good if the spark distributions shipped with later
 versions of the hadoop artifacts.

>>>
>>> I second this. If we need to keep a Hadoop 2.x profile around, why not
>>> make it Hadoop 2.8 or something newer?
>>>
>>
>> go for 2.9
>>
>>>
>>> Koert Kuipers  wrote:
>>>
 given that latest hdp 2.x is still hadoop 2.7 bumping hadoop 2 profile
 to latest would probably be an issue for us.
>>>
>>>
>>> When was the last time HDP 2.x bumped their minor version of Hadoop? Do
>>> we want to wait for them to bump to Hadoop 2.8 before we do the same?
>>>
>>
>> The internal builds of CDH and HDP are not those of ASF 2.7.x. A really
>> large proportion of the later branch-2 patches are backported. 2,7 was left
>> behind a long time ago
>>
>>
>>
>>
>

[jira] [Updated] (SPARK-29667) implicitly convert mismatched datatypes on right side of "IN" operator

2019-10-30 Thread Cheng Lian (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-29667:
---
Environment: (was: spark-2.4.3-bin-dbr-5.5-snapshot-9833d0f)

> implicitly convert mismatched datatypes on right side of "IN" operator
> --
>
> Key: SPARK-29667
> URL: https://issues.apache.org/jira/browse/SPARK-29667
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Jessie Lin
>Priority: Minor
>
> Ran into error on this sql
> Mismatched columns:
> [(a.`id`:decimal(28,0), db1.table1.`id`:decimal(18,0))] 
> the sql and clause
>   AND   a.id in (select id from db1.table1 where col1 = 1 group by id)
> Once I cast decimal(18,0) to decimal(28,0) explicitly above, the sql ran just 
> fine. Can the sql engine cast implicitly in this case?
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29667) implicitly convert mismatched datatypes on right side of "IN" operator

2019-10-30 Thread Cheng Lian (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16963305#comment-16963305
 ] 

Cheng Lian commented on SPARK-29667:


Reproduced this with the following snippet:
{code}
spark.range(10).select($"id" cast DecimalType(18, 
0)).createOrReplaceTempView("t1")
spark.range(10).select($"id" cast DecimalType(28, 
0)).createOrReplaceTempView("t2")
sql("SELECT * FROM t1 WHERE t1.id IN (SELECT id FROM t2)").explain(true)
{code}
Exception:
{noformat}
The data type of one or more elements in the left hand side of an IN subquery
is not compatible with the data type of the output of the subquery
Mismatched columns:
[(t1.`id`:decimal(18,0), t2.`id`:decimal(28,0))]
Left side:
[decimal(18,0)].
Right side:
[decimal(28,0)].; line 1 pos 29;
'Project [*]
+- 'Filter id#16 IN (list#22 [])
   :  +- Project [id#20]
   : +- SubqueryAlias `t2`
   :+- Project [cast(id#18L as decimal(28,0)) AS id#20]
   :   +- Range (0, 10, step=1, splits=Some(8))
   +- SubqueryAlias `t1`
  +- Project [cast(id#14L as decimal(18,0)) AS id#16]
 +- Range (0, 10, step=1, splits=Some(8))
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:123)
...
{noformat}
It seems that Postgres does support this kind of implicit casting:
{noformat}
postgres=# SELECT CAST(1 AS BIGINT) IN (CAST(1 AS INT));

 ?column?
--
 t
(1 row)
{noformat}
I believe the problem in Spark is that 
{{o.a.s.s.c.expressions.In#checkInputDataTypes()}} is too strict.

> implicitly convert mismatched datatypes on right side of "IN" operator
> --
>
> Key: SPARK-29667
> URL: https://issues.apache.org/jira/browse/SPARK-29667
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
> Environment: spark-2.4.3-bin-dbr-5.5-snapshot-9833d0f
>Reporter: Jessie Lin
>Priority: Minor
>
> Ran into error on this sql
> Mismatched columns:
> [(a.`id`:decimal(28,0), db1.table1.`id`:decimal(18,0))] 
> the sql and clause
>   AND   a.id in (select id from db1.table1 where col1 = 1 group by id)
> Once I cast decimal(18,0) to decimal(28,0) explicitly above, the sql ran just 
> fine. Can the sql engine cast implicitly in this case?
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26806) EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly

2019-10-10 Thread Cheng Lian (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-26806:
---
Reporter: Cheng Lian  (was: liancheng)

> EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly
> 
>
> Key: SPARK-26806
> URL: https://issues.apache.org/jira/browse/SPARK-26806
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0
>Reporter: Cheng Lian
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.2.4, 2.3.3, 2.4.1, 3.0.0
>
>
> Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will 
> make "avg" become "NaN". And whatever gets merged with the result of 
> "zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong 
> will return "0" and the user will see the following incorrect report:
> {code}
> "eventTime" : {
> "avg" : "1970-01-01T00:00:00.000Z",
> "max" : "2019-01-31T12:57:00.000Z",
> "min" : "2019-01-30T18:44:04.000Z",
> "watermark" : "1970-01-01T00:00:00.000Z"
>   }
> {code}
> This issue was reported by [~liancheng]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26806) EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly

2019-10-10 Thread Cheng Lian (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-26806:
---
Description: 
Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will 
make "avg" become "NaN". And whatever gets merged with the result of 
"zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong will 
return "0" and the user will see the following incorrect report:
{code:java}
"eventTime" : {
"avg" : "1970-01-01T00:00:00.000Z",
"max" : "2019-01-31T12:57:00.000Z",
"min" : "2019-01-30T18:44:04.000Z",
"watermark" : "1970-01-01T00:00:00.000Z"
  }
{code}

  was:
Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will 
make "avg" become "NaN". And whatever gets merged with the result of 
"zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong will 
return "0" and the user will see the following incorrect report:

{code}
"eventTime" : {
"avg" : "1970-01-01T00:00:00.000Z",
"max" : "2019-01-31T12:57:00.000Z",
"min" : "2019-01-30T18:44:04.000Z",
"watermark" : "1970-01-01T00:00:00.000Z"
  }
{code}

This issue was reported by [~liancheng]


> EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly
> ----
>
> Key: SPARK-26806
> URL: https://issues.apache.org/jira/browse/SPARK-26806
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0
>Reporter: Cheng Lian
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.2.4, 2.3.3, 2.4.1, 3.0.0
>
>
> Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will 
> make "avg" become "NaN". And whatever gets merged with the result of 
> "zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong 
> will return "0" and the user will see the following incorrect report:
> {code:java}
> "eventTime" : {
> "avg" : "1970-01-01T00:00:00.000Z",
> "max" : "2019-01-31T12:57:00.000Z",
> "min" : "2019-01-30T18:44:04.000Z",
> "watermark" : "1970-01-01T00:00:00.000Z"
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27369) Standalone worker can load resource conf and discover resources

2019-06-11 Thread Cheng Lian (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reassigned SPARK-27369:
--

Assignee: wuyi

> Standalone worker can load resource conf and discover resources
> ---
>
> Key: SPARK-27369
> URL: https://issues.apache.org/jira/browse/SPARK-27369
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: wuyi
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27611) Redundant javax.activation dependencies in the Maven build

2019-05-01 Thread Cheng Lian (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reassigned SPARK-27611:
--

Assignee: Cheng Lian

> Redundant javax.activation dependencies in the Maven build
> --
>
> Key: SPARK-27611
> URL: https://issues.apache.org/jira/browse/SPARK-27611
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.0.0
>    Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
>
> [PR #23890|https://github.com/apache/spark/pull/23890] introduced 
> {{org.glassfish.jaxb:jaxb-runtime:2.3.2}} as a runtime dependency. As an 
> unexpected side effect, {{jakarta.activation:jakarta.activation-api:1.2.1}} 
> was also pulled in as a transitive dependency. As a result, for the Maven 
> build, both of the following two jars can be found under 
> {{assembly/target/scala-2.12/jars}}:
> {noformat}
> activation-1.1.1.jar
> jakarta.activation-api-1.2.1.jar
> {noformat}
> Discussed this with [~srowen] offline and we agreed that we should probably 
> exclude the Jakarta one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27611) Redundant javax.activation dependencies in the Maven build

2019-04-30 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-27611:
--

 Summary: Redundant javax.activation dependencies in the Maven build
 Key: SPARK-27611
 URL: https://issues.apache.org/jira/browse/SPARK-27611
 Project: Spark
  Issue Type: Dependency upgrade
  Components: Build
Affects Versions: 3.0.0
Reporter: Cheng Lian


[PR #23890|https://github.com/apache/spark/pull/23890] introduced 
{{org.glassfish.jaxb:jaxb-runtime:2.3.2}} as a runtime dependency. As an 
unexpected side effect, {{jakarta.activation:jakarta.activation-api:1.2.1}} was 
also pulled in as a transitive dependency. As a result, for the Maven build, 
both of the following two jars can be found under 
{{assembly/target/scala-2.12/jars}}:
{noformat}
activation-1.1.1.jar
jakarta.activation-api-1.2.1.jar
{noformat}
Discussed this with [~srowen] offline and we agreed that we should probably 
exclude the Jakarta one.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets

2018-11-07 Thread Cheng Lian (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678595#comment-16678595
 ] 

Cheng Lian commented on SPARK-25966:


[~andrioni], just realized that I might misunderstand this part of your 
statement:
{quote}
This job used to work fine with Spark 2.2.1
[...]
{quote}
I thought you could read the same problematic files using Spark 2.2.1. Now I 
guess you probably only meant that the same job worked fine with Spark 2.2.1 
previously (with different sets of historical files).

> "EOF Reached the end of stream with bytes left to read" while reading/writing 
> to Parquets
> -
>
> Key: SPARK-25966
> URL: https://issues.apache.org/jira/browse/SPARK-25966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on 
> top of a Mesos cluster. Both input and output Parquet files are on S3.
>Reporter: Alessandro Andrioni
>Priority: Major
>
> I was persistently getting the following exception while trying to run one 
> Spark job we have using Spark 2.4.0. It went away after I regenerated from 
> scratch all the input Parquet files (generated by another Spark job also 
> using Spark 2.4.0).
> Is there a chance that Spark is writing (quite rarely) corrupted Parquet 
> files?
> {code:java}
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557)
>   (...)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 312 in stage 682.0 failed 4 times, most recent failure: Lost task 312.3 
> in stage 682.0 (TID 235229, 10.130.29.78, executor 77): java.io.EOFException: 
> Reached the end of stream with 996 bytes left to read
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174)
>

[jira] [Comment Edited] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets

2018-11-07 Thread Cheng Lian (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678542#comment-16678542
 ] 

Cheng Lian edited comment on SPARK-25966 at 11/7/18 5:34 PM:
-

Hey, [~andrioni], if you still have the original (potentially) corrupted 
Parquet files at hand, could you please try reading them again with Spark 2.4 
but with {{spark.sql.parquet.enableVectorizedReader}} set to {{false}}? In this 
way, we fall back to the vanilla {{parquet-mr}} 1.10 Parquet reader. If it 
works fine, it might be an issue in the vectorized reader.

Also, any chances that you can share a sample problematic file?

Since the same workload worked fine with Spark 2.2.1, I doubt whether this is 
really a file corruption issue. Unless somehow Spark 2.4 is reading more 
column(s)/row group(s) than Spark 2.2.1 for the same job, and those extra 
column(s)/row group(s) happened to contain some corrupted data, which would 
also indicate an optimizer side issue (predicate push-down and column pruning).


was (Author: lian cheng):
Hey, [~andrioni], if you still have the original (potentially) corrupted 
Parquet files at hand, could you please try reading them again with Spark 2.4 
but with {{spark.sql.parquet.enableVectorizedReader}} set to {{false}}? In this 
way, we fall back to the vanilla {{parquet-mr}} 1.10 Parquet reader. If it 
works fine, it might be an issue in the vectorized reader.

Also, any chances that you can share a sample problematic file?

Since the same workload worked fine with Spark 2.2.1, I doubt whether this is 
really a file corruption issue. Unless somehow Spark 2.4 is reading more 
columns/row groups than Spark 2.2.1 for the same job, which would also indicate 
an optimizer side issue (predicate push-down and column pruning).

> "EOF Reached the end of stream with bytes left to read" while reading/writing 
> to Parquets
> -
>
> Key: SPARK-25966
> URL: https://issues.apache.org/jira/browse/SPARK-25966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on 
> top of a Mesos cluster. Both input and output Parquet files are on S3.
>Reporter: Alessandro Andrioni
>Priority: Major
>
> I was persistently getting the following exception while trying to run one 
> Spark job we have using Spark 2.4.0. It went away after I regenerated from 
> scratch all the input Parquet files (generated by another Spark job also 
> using Spark 2.4.0).
> Is there a chance that Spark is writing (quite rarely) corrupted Parquet 
> files?
> {code:java}
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:

[jira] [Comment Edited] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets

2018-11-07 Thread Cheng Lian (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678542#comment-16678542
 ] 

Cheng Lian edited comment on SPARK-25966 at 11/7/18 5:34 PM:
-

Hey, [~andrioni], if you still have the original (potentially) corrupted 
Parquet files at hand, could you please try reading them again with Spark 2.4 
but with {{spark.sql.parquet.enableVectorizedReader}} set to {{false}}? In this 
way, we fall back to the vanilla {{parquet-mr}} 1.10 Parquet reader. If it 
works fine, it might be an issue in the vectorized reader.

Also, any chances that you can share a sample problematic file?

Since the same workload worked fine with Spark 2.2.1, I doubt whether this is 
really a file corruption issue. Unless somehow Spark 2.4 is reading more 
columns/row groups than Spark 2.2.1 for the same job, and those extra 
columns/row groups happened to contain some corrupted data, which would also 
indicate an optimizer side issue (predicate push-down and column pruning).


was (Author: lian cheng):
Hey, [~andrioni], if you still have the original (potentially) corrupted 
Parquet files at hand, could you please try reading them again with Spark 2.4 
but with {{spark.sql.parquet.enableVectorizedReader}} set to {{false}}? In this 
way, we fall back to the vanilla {{parquet-mr}} 1.10 Parquet reader. If it 
works fine, it might be an issue in the vectorized reader.

Also, any chances that you can share a sample problematic file?

Since the same workload worked fine with Spark 2.2.1, I doubt whether this is 
really a file corruption issue. Unless somehow Spark 2.4 is reading more 
column(s)/row group(s) than Spark 2.2.1 for the same job, and those extra 
column(s)/row group(s) happened to contain some corrupted data, which would 
also indicate an optimizer side issue (predicate push-down and column pruning).

> "EOF Reached the end of stream with bytes left to read" while reading/writing 
> to Parquets
> -
>
> Key: SPARK-25966
> URL: https://issues.apache.org/jira/browse/SPARK-25966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on 
> top of a Mesos cluster. Both input and output Parquet files are on S3.
>Reporter: Alessandro Andrioni
>Priority: Major
>
> I was persistently getting the following exception while trying to run one 
> Spark job we have using Spark 2.4.0. It went away after I regenerated from 
> scratch all the input Parquet files (generated by another Spark job also 
> using Spark 2.4.0).
> Is there a chance that Spark is writing (quite rarely) corrupted Parquet 
> files?
> {code:java}
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>

[jira] [Commented] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets

2018-11-07 Thread Cheng Lian (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678542#comment-16678542
 ] 

Cheng Lian commented on SPARK-25966:


Hey, [~andrioni], if you still have the original (potentially) corrupted 
Parquet files at hand, could you please try reading them again with Spark 2.4 
but with {{spark.sql.parquet.enableVectorizedReader}} set to {{false}}? In this 
way, we fall back to the vanilla {{parquet-mr}} 1.10 Parquet reader. If it 
works fine, it might be an issue in the vectorized reader.

Also, any chances that you can share a sample problematic file?

Since the same workload worked fine with Spark 2.2.1, I doubt whether this is 
really a file corruption issue. Unless somehow Spark 2.4 is reading more 
columns/row groups than Spark 2.2.1 for the same job, which would also indicate 
an optimizer side issue (predicate push-down and column pruning).

> "EOF Reached the end of stream with bytes left to read" while reading/writing 
> to Parquets
> -
>
> Key: SPARK-25966
> URL: https://issues.apache.org/jira/browse/SPARK-25966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on 
> top of a Mesos cluster. Both input and output Parquet files are on S3.
>Reporter: Alessandro Andrioni
>Priority: Major
>
> I was persistently getting the following exception while trying to run one 
> Spark job we have using Spark 2.4.0. It went away after I regenerated from 
> scratch all the input Parquet files (generated by another Spark job also 
> using Spark 2.4.0).
> Is there a chance that Spark is writing (quite rarely) corrupted Parquet 
> files?
> {code:java}
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557)
>   (...)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 312 in stage 682.0 failed 4 times, most recent failure: Lost task 312.3 
> in stage 682.0 (TID 235229, 10.130.29.78, executor 77): java.io.EOFException: 
> Reached the end of stream with 996 bytes left to read
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104)
>

[jira] [Assigned] (SPARK-24927) The hadoop-provided profile doesn't play well with Snappy-compressed Parquet files

2018-07-26 Thread Cheng Lian (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reassigned SPARK-24927:
--

Assignee: Cheng Lian

> The hadoop-provided profile doesn't play well with Snappy-compressed Parquet 
> files
> --
>
> Key: SPARK-24927
> URL: https://issues.apache.org/jira/browse/SPARK-24927
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Major
>
> Reproduction:
> {noformat}
> wget 
> https://archive.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-without-hadoop.tgz
> wget 
> https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz
> tar xzf spark-2.3.1-bin-without-hadoop.tgz
> tar xzf hadoop-2.7.3.tar.gz
> export SPARK_DIST_CLASSPATH=$(hadoop-2.7.3/bin/hadoop classpath)
> ./spark-2.3.1-bin-without-hadoop/bin/spark-shell --master local
> ...
> scala> 
> spark.range(1).repartition(1).write.mode("overwrite").parquet("file:///tmp/test.parquet")
> {noformat}
> Exception:
> {noformat}
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)
>   ... 69 more
> Caused by: org.apache.spark.SparkException: Task failed while writing rows.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.UnsatisfiedLinkError: 
> org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
>   at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method)
>   at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316)
>   at 
> org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67)
>   at 
> org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)
>   at 
> org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92)
>   at 
> org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112)
>   at 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93)
>   at 
> org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150)
>

[jira] [Updated] (SPARK-24927) The hadoop-provided profile doesn't play well with Snappy-compressed Parquet files

2018-07-26 Thread Cheng Lian (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-24927:
---
Description: 
Reproduction:
{noformat}
wget 
https://archive.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-without-hadoop.tgz
wget 
https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz

tar xzf spark-2.3.1-bin-without-hadoop.tgz
tar xzf hadoop-2.7.3.tar.gz

export SPARK_DIST_CLASSPATH=$(hadoop-2.7.3/bin/hadoop classpath)
./spark-2.3.1-bin-without-hadoop/bin/spark-shell --master local
...
scala> 
spark.range(1).repartition(1).write.mode("overwrite").parquet("file:///tmp/test.parquet")
{noformat}
Exception:
{noformat}
Driver stacktrace:
  at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
  at scala.Option.foreach(Option.scala:257)
  at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
  at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)
  ... 69 more
Caused by: org.apache.spark.SparkException: Task failed while writing rows.
  at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
  at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
  at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
  at org.apache.spark.scheduler.Task.run(Task.scala:109)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.UnsatisfiedLinkError: 
org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
  at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method)
  at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316)
  at 
org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67)
  at 
org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)
  at 
org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92)
  at 
org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112)
  at 
org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93)
  at 
org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150)
  at 
org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:238)
  at 
org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:121)
  at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:167)
  at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:109)
  at 
org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:163)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
  at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.releaseResources(FileFormatWriter.scala:405)
  at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:396)
  at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sq

[jira] [Commented] (SPARK-24927) The hadoop-provided profile doesn't play well with Snappy-compressed Parquet files

2018-07-26 Thread Cheng Lian (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16557603#comment-16557603
 ] 

Cheng Lian commented on SPARK-24927:


Downgraded from blocker to major, since it's not a regression. Just realized 
that this issue existed ever since at least 1.6.

> The hadoop-provided profile doesn't play well with Snappy-compressed Parquet 
> files
> --
>
> Key: SPARK-24927
> URL: https://issues.apache.org/jira/browse/SPARK-24927
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Cheng Lian
>Priority: Major
>
> Reproduction:
> {noformat}
> wget 
> https://archive.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-without-hadoop.tgz
> wget 
> https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz
> tar xzf spark-2.3.1-bin-without-hadoop.tgz
> tar xzf hadoop-2.7.3.tar.gz
> export SPARK_DIST_CLASSPATH=$(hadoop-2.7.3/bin/hadoop classpath)
> ./spark-2.3.1-bin-without-hadoop/bin/spark-shell --master local
> ...
> scala> 
> spark.range(1).repartition(1).write.mode("overwrite").parquet("file:///tmp/test.parquet")
> {noformat}
> Exception:
> {noformat}
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)
>   ... 69 more
> Caused by: org.apache.spark.SparkException: Task failed while writing rows.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.UnsatisfiedLinkError: 
> org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
>   at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method)
>   at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316)
>   at 
> org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67)
>   at 
> org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)
>   at 
> org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92)
>   at 
> org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112)
>   at 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageW

[jira] [Updated] (SPARK-24927) The hadoop-provided profile doesn't play well with Snappy-compressed Parquet files

2018-07-26 Thread Cheng Lian (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-24927:
---
Priority: Major  (was: Blocker)

> The hadoop-provided profile doesn't play well with Snappy-compressed Parquet 
> files
> --
>
> Key: SPARK-24927
> URL: https://issues.apache.org/jira/browse/SPARK-24927
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Cheng Lian
>Priority: Major
>
> Reproduction:
> {noformat}
> wget 
> https://archive.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-without-hadoop.tgz
> wget 
> https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz
> tar xzf spark-2.3.1-bin-without-hadoop.tgz
> tar xzf hadoop-2.7.3.tar.gz
> export SPARK_DIST_CLASSPATH=$(hadoop-2.7.3/bin/hadoop classpath)
> ./spark-2.3.1-bin-without-hadoop/bin/spark-shell --master local
> ...
> scala> 
> spark.range(1).repartition(1).write.mode("overwrite").parquet("file:///tmp/test.parquet")
> {noformat}
> Exception:
> {noformat}
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)
>   ... 69 more
> Caused by: org.apache.spark.SparkException: Task failed while writing rows.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.UnsatisfiedLinkError: 
> org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
>   at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method)
>   at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316)
>   at 
> org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67)
>   at 
> org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)
>   at 
> org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92)
>   at 
> org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112)
>   at 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93)
>   at 
> org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150)
>

[jira] [Created] (SPARK-24927) The hadoop-provided profile doesn't play well with Snappy-compressed Parquet files

2018-07-26 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-24927:
--

 Summary: The hadoop-provided profile doesn't play well with 
Snappy-compressed Parquet files
 Key: SPARK-24927
 URL: https://issues.apache.org/jira/browse/SPARK-24927
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.3.1, 2.3.2
Reporter: Cheng Lian


Reproduction:
{noformat}
wget 
https://archive.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-without-hadoop.tgz
wget 
https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz

tar xzf spark-2.3.1-bin-without-hadoop.tgz
tar xzf hadoop-2.7.3.tar.gz

export SPARK_DIST_CLASSPATH=$(hadoop-2.7.3/bin/hadoop classpath)
./spark-2.3.1-bin-without-hadoop/bin/spark-shell --master local
...
scala> 
spark.range(1).repartition(1).write.mode("overwrite").parquet("file:///tmp/test.parquet")
{noformat}
Exception:
{noformat}
Driver stacktrace:
  at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
  at scala.Option.foreach(Option.scala:257)
  at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
  at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)
  ... 69 more
Caused by: org.apache.spark.SparkException: Task failed while writing rows.
  at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
  at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
  at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
  at org.apache.spark.scheduler.Task.run(Task.scala:109)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.UnsatisfiedLinkError: 
org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
  at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method)
  at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316)
  at 
org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67)
  at 
org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)
  at 
org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92)
  at 
org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112)
  at 
org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93)
  at 
org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150)
  at 
org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:238)
  at 
org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:121)
  at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:167)
  at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:109)
  at 
org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:163)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
  at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.releaseResources(Fil

[jira] [Assigned] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames

2018-07-24 Thread Cheng Lian (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reassigned SPARK-24895:
--

Assignee: Eric Chang

> Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
> --
>
> Key: SPARK-24895
> URL: https://issues.apache.org/jira/browse/SPARK-24895
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Eric Chang
>Assignee: Eric Chang
>Priority: Major
>
> Spark 2.4.0 has Maven build errors because artifacts uploaded to apache maven 
> repo has mismatched filenames:
> {noformat}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce 
> (enforce-banned-dependencies) on project spark_2.4: Execution 
> enforce-banned-dependencies of goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: 
> org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: 
> Could not resolve following dependencies: 
> [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not 
> resolve dependencies for project com.databricks:spark_2.4:pom:1: The 
> following artifacts could not be resolved: 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find 
> artifact 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in 
> apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1]
> {noformat}
>  
> If you check the artifact metadata you will see the pom and jar files are 
> 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177:
> {code:xml}
> 
>   org.apache.spark
>   spark-mllib-local_2.11
>   2.4.0-SNAPSHOT
>   
> 
>   20180723.232411
>   177
> 
> 20180723232411
> 
>   
> jar
> 2.4.0-20180723.232411-177
> 20180723232411
>   
>   
> pom
> 2.4.0-20180723.232411-177
> 20180723232411
>   
>   
> tests
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
>   
> sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
>   
> test-sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
> 
>   
> 
> {code}
>  
> This behavior is very similar to this issue: 
> https://issues.apache.org/jira/browse/MDEPLOY-221
> Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy 
> 2.8.2 plugin, it is highly possible that we introduced a new plugin that 
> causes this. 
> The most recent addition is the spot-bugs plugin, which is known to have 
> incompatibilities with other plugins: 
> [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21]
> We may want to try building without it to sanity check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames

2018-07-24 Thread Cheng Lian (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-24895:
---
Description: 
Spark 2.4.0 has Maven build errors because artifacts uploaded to apache maven 
repo has mismatched filenames:
{noformat}
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce 
(enforce-banned-dependencies) on project spark_2.4: Execution 
enforce-banned-dependencies of goal 
org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: 
org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: Could 
not resolve following dependencies: 
[org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), 
org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), 
org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not 
resolve dependencies for project com.databricks:spark_2.4:pom:1: The following 
artifacts could not be resolved: 
org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, 
org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, 
org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find artifact 
org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in 
apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1]
{noformat}
 

If you check the artifact metadata you will see the pom and jar files are 
2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177:
{code:xml}

  org.apache.spark
  spark-mllib-local_2.11
  2.4.0-SNAPSHOT
  

  20180723.232411
  177

20180723232411

  
jar
2.4.0-20180723.232411-177
20180723232411
  
  
pom
2.4.0-20180723.232411-177
20180723232411
  
  
tests
jar
2.4.0-20180723.232410-177
20180723232411
  
  
sources
jar
2.4.0-20180723.232410-177
20180723232411
  
  
test-sources
jar
2.4.0-20180723.232410-177
20180723232411
  

  

{code}
 
This behavior is very similar to this issue: 
https://issues.apache.org/jira/browse/MDEPLOY-221

Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy 
2.8.2 plugin, it is highly possible that we introduced a new plugin that causes 
this. 

The most recent addition is the spot-bugs plugin, which is known to have 
incompatibilities with other plugins: 
[https://github.com/spotbugs/spotbugs-maven-plugin/issues/21]

We may want to try building without it to sanity check.

  was:
Spark 2.4.0 has maven build errors because artifacts uploaded to apache maven 
repo has mismatched filenames:

{code}
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce 
(enforce-banned-dependencies) on project spark_2.4: Execution 
enforce-banned-dependencies of goal 
org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: 
org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: Could 
not resolve following dependencies: 
[org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), 
org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), 
org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not 
resolve dependencies for project com.databricks:spark_2.4:pom:1: The following 
artifacts could not be resolved: 
org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, 
org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, 
org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find artifact 
org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in 
apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1]
{code}
 

If you check the artifact metadata you will see the pom and jar files are 
2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177:
{code:xml}

org.apache.spark
spark-mllib-local_2.11
2.4.0-SNAPSHOT


20180723.232411
177

20180723232411


jar
2.4.0-20180723.232411-177
20180723232411


pom
2.4.0-20180723.232411-177
20180723232411


tests
jar
2.4.0-20180723.232410-177
20180723232411


sources
jar
2.4.0-20180723.232410-177
20180723232411


test-sources
jar
2.4.0-20180723.232410-177
20180723232411




{code}
 
 This behavior is very similar to this issue: 
https://issues.apache.org/jira/browse/MDEPLOY-221

Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy 
2.8.2 plugin, it is highly possible that we introduced a new plugin that causes 
this. 

The most recent addition is the spot-bugs plugin, which is known to have 
incompatibilities with other plugins: 
[https://github.com/spotbugs/spotbugs-maven-plugin/issues/21]

We may want to try building without it to sanity check.


> Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched

Re: [VOTE] Spark 2.3.0 (RC5)

2018-02-23 Thread Cheng Lian

+1 (binding)

Passed all the tests, looks good.

Cheng

On 2/23/18 15:00, Holden Karau wrote:

+1 (binding)
PySpark artifacts install in a fresh Py3 virtual env

On Feb 23, 2018 7:55 AM, "Denny Lee" > wrote:

+1 (non-binding)

On Fri, Feb 23, 2018 at 07:08 Josh Goldsborough
> wrote:

New to testing out Spark RCs for the community but I was able
to run some of the basic unit tests without error so for what
it's worth, I'm a +1.

On Thu, Feb 22, 2018 at 4:23 PM, Sameer Agarwal
> wrote:

Please vote on releasing the following candidate as Apache
Spark version 2.3.0. The vote is open until Tuesday
February 27, 2018 at 8:00:00 am UTC and passes if a
majority of at least 3 PMC +1 votes are cast.

[ ] +1 Release this package as Apache Spark 2.3.0

[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
https://spark.apache.org/

The tag to be voted on is v2.3.0-rc5:
https://github.com/apache/spark/tree/v2.3.0-rc5

(992447fb30ee9ebb3cf794f2d06f4d63a2d792db)

List of JIRA tickets resolved in this release can be found
here:
https://issues.apache.org/jira/projects/SPARK/versions/12339551

The release files, including signatures, digests, etc. can
be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-bin/

Release artifacts are signed with the following key:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:

https://repository.apache.org/content/repositories/orgapachespark-1266/

The documentation corresponding to this release can be
found at:

https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-docs/_site/index.html

FAQ

===
What are the unresolved issues targeted for 2.3.0?
===

Please see https://s.apache.org/oXKi. At the time of
writing, there are currently no known release blockers.

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release
by taking an existing Spark workload and running on this
release candidate, then reporting any regressions.

If you're working in PySpark you can set up a virtual env
and install the current RC and see if anything important
breaks, in the Java/Scala you can add the staging
repository to your projects resolvers and test with the RC
(make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going
forward).

===
What should happen to JIRA tickets still targeting 2.3.0?
===

Committers should look at those and triage. Extremely
important bug fixes, documentation, and API tweaks that
impact compatibility should be worked on immediately.
Everything else please retarget to 2.3.1 or 2.4.0 as
appropriate.

===
Why is my bug not fixed?
===

In order to make timely releases, we will typically not
hold the release unless the bug in question is a
regression from 2.2.0. That being said, if there is
something which is a regression from 2.2.0 and has not
been correctly targeted please ping me or a committer to
help target the issue (you can see the open issues listed
as impacting Spark 2.3.0 at https://s.apache.org/WmoI).

[jira] [Commented] (SPARK-19737) New analysis rule for reporting unregistered functions without relying on relation resolution

2018-02-21 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372289#comment-16372289
 ] 

Cheng Lian commented on SPARK-19737:


[~LANDAIS Christophe], I filed SPARK-23486 for this. Should be relatively 
straightforward to fix and I'd like to have a new contributor to try it as a 
starter task. Thanks for reporting!

> New analysis rule for reporting unregistered functions without relying on 
> relation resolution
> -
>
> Key: SPARK-19737
> URL: https://issues.apache.org/jira/browse/SPARK-19737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Major
> Fix For: 2.2.0
>
>
> Let's consider the following simple SQL query that reference an undefined 
> function {{foo}} that is never registered in the function registry:
> {code:sql}
> SELECT foo(a) FROM t
> {code}
> Assuming table {{t}} is a partitioned  temporary view consisting of a large 
> number of files stored on S3, it may take the analyzer a long time before 
> realizing that {{foo}} is not registered yet.
> The reason is that the existing analysis rule {{ResolveFunctions}} requires 
> all child expressions to be resolved first. Therefore, {{ResolveRelations}} 
> has to be executed first to resolve all columns referenced by the unresolved 
> function invocation. This further leads to partition discovery for {{t}}, 
> which may take a long time.
> To address this case, we propose a new lightweight analysis rule 
> {{LookupFunctions}} that
> # Matches all unresolved function invocations
> # Look up the function names from the function registry
> # Report analysis error for any unregistered functions
> Since this rule doesn't actually try to resolve the unresolved functions, it 
> doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
> discovery.
> We may put this analysis rule in a separate {{Once}} rule batch that sits 
> between the "Substitution" batch and the "Resolution" batch to avoid running 
> it repeatedly and make sure it gets executed before {{ResolveRelations}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23486) LookupFunctions should not check the same function name more than once

2018-02-21 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-23486:
---
Labels: starter  (was: )

> LookupFunctions should not check the same function name more than once
> --
>
> Key: SPARK-23486
> URL: https://issues.apache.org/jira/browse/SPARK-23486
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>    Reporter: Cheng Lian
>Priority: Major
>  Labels: starter
>
> For a query invoking the same function multiple times, the current 
> {{LookupFunctions}} rule performs a check for each invocation. For users 
> using Hive metastore as external catalog, this issues unnecessary metastore 
> accesses and can slow down the analysis phase quite a bit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23486) LookupFunctions should not check the same function name more than once

2018-02-21 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372285#comment-16372285
 ] 

Cheng Lian commented on SPARK-23486:


Please refer to [this 
comment|https://issues.apache.org/jira/browse/SPARK-19737?focusedCommentId=16371377=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16371377]
 for more details.

> LookupFunctions should not check the same function name more than once
> --
>
> Key: SPARK-23486
> URL: https://issues.apache.org/jira/browse/SPARK-23486
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>    Reporter: Cheng Lian
>Priority: Major
>
> For a query invoking the same function multiple times, the current 
> {{LookupFunctions}} rule performs a check for each invocation. For users 
> using Hive metastore as external catalog, this issues unnecessary metastore 
> accesses and can slow down the analysis phase quite a bit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23486) LookupFunctions should not check the same function name more than once

2018-02-21 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-23486:
--

 Summary: LookupFunctions should not check the same function name 
more than once
 Key: SPARK-23486
 URL: https://issues.apache.org/jira/browse/SPARK-23486
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.1, 2.3.0
Reporter: Cheng Lian


For a query invoking the same function multiple times, the current 
{{LookupFunctions}} rule performs a check for each invocation. For users using 
Hive metastore as external catalog, this issues unnecessary metastore accesses 
and can slow down the analysis phase quite a bit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22951) count() after dropDuplicates() on emptyDataFrame returns incorrect value

2018-01-10 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-22951.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20174
[https://github.com/apache/spark/pull/20174]

> count() after dropDuplicates() on emptyDataFrame returns incorrect value
> 
>
> Key: SPARK-22951
> URL: https://issues.apache.org/jira/browse/SPARK-22951
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.2.0, 2.3.0
>Reporter: Michael Dreibelbis
>Assignee: Feng Liu
>  Labels: correctness
> Fix For: 2.3.0
>
>
> here is a minimal Spark Application to reproduce:
> {code}
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.{SparkConf, SparkContext}
> object DropDupesApp extends App {
>   
>   override def main(args: Array[String]): Unit = {
> val conf = new SparkConf()
>   .setAppName("test")
>   .setMaster("local")
> val sc = new SparkContext(conf)
> val sql = SQLContext.getOrCreate(sc)
> assert(sql.emptyDataFrame.count == 0) // expected
> assert(sql.emptyDataFrame.dropDuplicates.count == 1) // unexpected
>   }
>   
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22951) count() after dropDuplicates() on emptyDataFrame returns incorrect value

2018-01-10 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reassigned SPARK-22951:
--

Assignee: Feng Liu

> count() after dropDuplicates() on emptyDataFrame returns incorrect value
> 
>
> Key: SPARK-22951
> URL: https://issues.apache.org/jira/browse/SPARK-22951
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.2.0, 2.3.0
>Reporter: Michael Dreibelbis
>Assignee: Feng Liu
>  Labels: correctness
>
> here is a minimal Spark Application to reproduce:
> {code}
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.{SparkConf, SparkContext}
> object DropDupesApp extends App {
>   
>   override def main(args: Array[String]): Unit = {
> val conf = new SparkConf()
>   .setAppName("test")
>   .setMaster("local")
> val sc = new SparkContext(conf)
> val sql = SQLContext.getOrCreate(sc)
> assert(sql.emptyDataFrame.count == 0) // expected
> assert(sql.emptyDataFrame.dropDuplicates.count == 1) // unexpected
>   }
>   
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22951) count() after dropDuplicates() on emptyDataFrame returns incorrect value

2018-01-10 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-22951:
---
Target Version/s: 2.3.0

> count() after dropDuplicates() on emptyDataFrame returns incorrect value
> 
>
> Key: SPARK-22951
> URL: https://issues.apache.org/jira/browse/SPARK-22951
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.2.0, 2.3.0
>Reporter: Michael Dreibelbis
>  Labels: correctness
>
> here is a minimal Spark Application to reproduce:
> {code}
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.{SparkConf, SparkContext}
> object DropDupesApp extends App {
>   
>   override def main(args: Array[String]): Unit = {
> val conf = new SparkConf()
>   .setAppName("test")
>   .setMaster("local")
> val sc = new SparkContext(conf)
> val sql = SQLContext.getOrCreate(sc)
> assert(sql.emptyDataFrame.count == 0) // expected
> assert(sql.emptyDataFrame.dropDuplicates.count == 1) // unexpected
>   }
>   
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22951) count() after dropDuplicates() on emptyDataFrame returns incorrect value

2018-01-10 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-22951:
---
Labels: correctness  (was: )

> count() after dropDuplicates() on emptyDataFrame returns incorrect value
> 
>
> Key: SPARK-22951
> URL: https://issues.apache.org/jira/browse/SPARK-22951
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.2.0, 2.3.0
>Reporter: Michael Dreibelbis
>  Labels: correctness
>
> here is a minimal Spark Application to reproduce:
> {code}
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.{SparkConf, SparkContext}
> object DropDupesApp extends App {
>   
>   override def main(args: Array[String]): Unit = {
> val conf = new SparkConf()
>   .setAppName("test")
>   .setMaster("local")
> val sc = new SparkContext(conf)
> val sql = SQLContext.getOrCreate(sc)
> assert(sql.emptyDataFrame.count == 0) // expected
> assert(sql.emptyDataFrame.dropDuplicates.count == 1) // unexpected
>   }
>   
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (HADOOP-15086) NativeAzureFileSystem.rename is not atomic

2017-12-01 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-15086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16275422#comment-16275422
 ] 

Cheng Lian commented on HADOOP-15086:
-

To be more specific, when multiple threads rename files to the same target 
path, more than 1 *but not all* threads can succeed. It's because check and 
copy file in {{NativeAzureFileSystem#rename()}} is not atomic.

The problem here is that it's unclear what the expected semantics of 
{{NativeAzureFileSystem#rename()}} is:

- If the semantics is "error if the destination file already exists", then only 
1 thread can succeed.
- If the semantics is "overwrite if the destination file already exists", then 
all threads should succeed.

> NativeAzureFileSystem.rename is not atomic
> --
>
> Key: HADOOP-15086
> URL: https://issues.apache.org/jira/browse/HADOOP-15086
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/azure
>Affects Versions: 2.7.3
>Reporter: Shixiong Zhu
> Attachments: RenameReproducer.java
>
>
> When multiple threads rename files to the same target path, more than 1 
> threads can succeed. It's because check and copy file in `rename` is not 
> atomic.
> I would expect it's atomic just like HDFS.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Re: [VOTE][SPIP] SPARK-22026 data source v2 write path

2017-10-16 Thread Cheng Lian

+1

On 10/12/17 20:10, Liwei Lin wrote:

+1 !

Cheers,
Liwei

On Thu, Oct 12, 2017 at 7:11 PM, vaquar khan > wrote:

+1

Regards,
Vaquar khan

On Oct 11, 2017 10:14 PM, "Weichen Xu" > wrote:

+1

On Thu, Oct 12, 2017 at 10:36 AM, Xiao Li
> wrote:

+1

Xiao

On Mon, 9 Oct 2017 at 7:31 PM Reynold Xin
> wrote:

+1

One thing with MetadataSupport - It's a bad idea to
call it that unless adding new functions in that trait
wouldn't break source/binary compatibility in the future.

On Mon, Oct 9, 2017 at 6:07 PM, Wenchen Fan
> wrote:

I'm adding my own +1 (binding).

On Tue, Oct 10, 2017 at 9:07 AM, Wenchen Fan
>
wrote:

I'm going to update the proposal: for the last
point, although the user-facing API
(`df.write.format(...).option(...).mode(...).save()`)
mixes data and metadata operations, we are
still able to separate them in the data source
write API. We can have a mix-in trait
`MetadataSupport` which has a method
`create(options)`, so that data sources can
mix in this trait and provide metadata
creation support. Spark will call this
`create` method inside `DataFrameWriter.save`
if the specified data source has it.

Note that file format data sources can ignore
this new trait and still write data without
metadata(it doesn't have metadata anyway).

With this updated proposal, I'm calling a new
vote for the data source v2 write path.

The vote will be up for the next 72 hours.
Please reply with your vote:

+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because
of the following technical reasons.

Thanks!

On Tue, Oct 3, 2017 at 12:03 AM, Wenchen Fan
> wrote:

Hi all,

After we merge the infrastructure of data
source v2 read path, and have some
discussion for the write path, now I'm
sending this email to call a vote for Data
Source v2 write path.

The full document of the Data Source API
V2 is:

https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit

The ready-for-review PR that implements
the basic infrastructure for the write path:
https://github.com/apache/spark/pull/19269

The Data Source V1 write path asks
implementations to write a DataFrame
directly, which is painful:
1. Exposing upper-level API like DataFrame
to Data Source API is not good for
maintenance.
2. Data sources may need to preprocess the
input data before writing, like
cluster/sort the input by some columns.
It's better to do the preprocessing in
Spark instead of in the data source.
3. Data sources need to take care of
transaction themselves, which is hard. And
different data sources may come up with a
very similar approach for the transaction,
which leads to many duplicated codes.

To solve these pain points,

[jira] [Assigned] (PARQUET-1102) Travis CI builds are failing for parquet-format PRs

2017-09-12 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reassigned PARQUET-1102:
---

Assignee: Cheng Lian

> Travis CI builds are failing for parquet-format PRs
> ---
>
> Key: PARQUET-1102
> URL: https://issues.apache.org/jira/browse/PARQUET-1102
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>    Reporter: Cheng Lian
>    Assignee: Cheng Lian
>Priority: Blocker
> Fix For: format-2.3.2
>
>
> Travis CI builds are failing for parquet-format PRs, probably due to the 
> migration from Ubuntu precise to trusty on Sep 1 according to [this Travis 
> official blog 
> post|https://blog.travis-ci.com/2017-08-31-trusty-as-default-status].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Resolved] (PARQUET-1091) Wrong and broken links in README

2017-09-12 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved PARQUET-1091.
-
   Resolution: Fixed
Fix Version/s: format-2.3.2

Issue resolved by pull request 65
[https://github.com/apache/parquet-format/pull/65]

> Wrong and broken links in README
> 
>
> Key: PARQUET-1091
> URL: https://issues.apache.org/jira/browse/PARQUET-1091
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>    Reporter: Cheng Lian
>    Assignee: Cheng Lian
>Priority: Minor
> Fix For: format-2.3.2
>
>
> Multiple links in README.md still point to the old {{Parquet/parquet-format}} 
> repository, which is now removed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Resolved] (PARQUET-1102) Travis CI builds are failing for parquet-format PRs

2017-09-12 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved PARQUET-1102.
-
   Resolution: Fixed
Fix Version/s: format-2.3.2

Issue resolved by pull request 66
[https://github.com/apache/parquet-format/pull/66]

> Travis CI builds are failing for parquet-format PRs
> ---
>
> Key: PARQUET-1102
> URL: https://issues.apache.org/jira/browse/PARQUET-1102
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>    Reporter: Cheng Lian
>Priority: Blocker
> Fix For: format-2.3.2
>
>
> Travis CI builds are failing for parquet-format PRs, probably due to the 
> migration from Ubuntu precise to trusty on Sep 1 according to [this Travis 
> official blog 
> post|https://blog.travis-ci.com/2017-08-31-trusty-as-default-status].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (PARQUET-1102) Travis CI builds are failing for parquet-format PRs

2017-09-12 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated PARQUET-1102:

Priority: Blocker  (was: Major)

> Travis CI builds are failing for parquet-format PRs
> ---
>
> Key: PARQUET-1102
> URL: https://issues.apache.org/jira/browse/PARQUET-1102
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>    Reporter: Cheng Lian
>Priority: Blocker
>
> Travis CI builds are failing for parquet-format PRs, probably due to the 
> migration from Ubuntu precise to trusty on Sep 1 according to [this Travis 
> official blog 
> post|https://blog.travis-ci.com/2017-08-31-trusty-as-default-status].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (PARQUET-1102) Travis CI builds are failing for parquet-format PRs

2017-09-12 Thread Cheng Lian (JIRA)

Cheng Lian created PARQUET-1102:
---

 Summary: Travis CI builds are failing for parquet-format PRs
 Key: PARQUET-1102
 URL: https://issues.apache.org/jira/browse/PARQUET-1102
 Project: Parquet
  Issue Type: Bug
  Components: parquet-format
Reporter: Cheng Lian


Travis CI builds are failing for parquet-format PRs, probably due to the 
migration from Ubuntu precise to trusty on Sep 1 according to [this Travis 
official blog 
post|https://blog.travis-ci.com/2017-08-31-trusty-as-default-status].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (PARQUET-1091) Wrong and broken links in README

2017-09-07 Thread Cheng Lian (JIRA)

Cheng Lian created PARQUET-1091:
---

 Summary: Wrong and broken links in README
 Key: PARQUET-1091
 URL: https://issues.apache.org/jira/browse/PARQUET-1091
 Project: Parquet
  Issue Type: Bug
  Components: parquet-format
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Minor


Multiple links in README.md still point to the old {{Parquet/parquet-format}} 
repository, which is now removed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (HADOOP-14700) NativeAzureFileSystem.open() ignores blob container name

2017-08-02 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-14700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated HADOOP-14700:

Description: 
{{NativeAzureFileSystem}} instances are associated with the blob container used 
to initialize the file system. Assuming that a file system instance {{fs}} is 
associated with a container {{A}}, when trying to access a blob inside another 
container {{B}}, {{fs}} still tries to find the blob inside container {{A}}. If 
there happens to be two blobs with the same name inside both containers, the 
user may get a wrong result because {{fs}} reads the contents from the blob 
inside container {{A}} instead of container {{B}}.

You may reproduce it by running the following self-contained Scala script using 
[Ammonite|http://ammonite.io/]:
{code}
#!/usr/bin/env amm --no-remote-logging

import $ivy.`com.jsuereth::scala-arm:2.0`
import $ivy.`com.microsoft.azure:azure-storage:5.2.0`
import $ivy.`org.apache.hadoop:hadoop-azure:3.0.0-alpha4`
import $ivy.`org.apache.hadoop:hadoop-common:3.0.0-alpha4`
import $ivy.`org.scalatest::scalatest:3.0.3`

import java.io.{BufferedReader, InputStreamReader}
import java.net.URI
import java.time.{Duration, Instant}
import java.util.{Date, EnumSet}

import com.microsoft.azure.storage.{CloudStorageAccount, 
StorageCredentialsAccountAndKey}
import com.microsoft.azure.storage.blob.{SharedAccessBlobPermissions, 
SharedAccessBlobPolicy}
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.fs.azure.{AzureException, NativeAzureFileSystem}
import org.scalatest.Assertions._
import resource._

// Utility implicit conversion for auto resource management.
implicit def `Closable->Resource`[T <: { def close() }]: Resource[T] = new 
Resource[T] {
  override def close(closable: T): Unit = closable.close()
}

// Credentials information
val ACCOUNT = "** REDACTED **"
val ACCESS_KEY = "** REDACTED **"

// We'll create two different containers, both contain a blob named "test-blob" 
but with different
// contents.
val CONTAINER_A = "container-a"
val CONTAINER_B = "container-b"
val TEST_BLOB = "test-blob"

val blobClient = {
  val credentials = new StorageCredentialsAccountAndKey(ACCOUNT, ACCESS_KEY)
  val account = new CloudStorageAccount(credentials, /* useHttps */ true)
  account.createCloudBlobClient()
}

// Generates a read-only SAS key restricted within "container-a".
val sasKeyForContainerA = {
  val since = Instant.now() minus Duration.ofMinutes(10)
  val duration = Duration.ofHours(1)
  val policy = new SharedAccessBlobPolicy()

  policy.setSharedAccessStartTime(Date.from(since))
  policy.setSharedAccessExpiryTime(Date.from(since plus duration))
  policy.setPermissions(EnumSet.of(
SharedAccessBlobPermissions.READ,
SharedAccessBlobPermissions.LIST
  ))

  blobClient
.getContainerReference(CONTAINER_A)
.generateSharedAccessSignature(policy, null)
}

// Sets up testing containers and blobs using the Azure storage SDK:
//
//   container-a/test-blob => "foo"
//   container-b/test-blob => "bar"
{
  val containerARef = blobClient.getContainerReference(CONTAINER_A)
  val containerBRef = blobClient.getContainerReference(CONTAINER_B)

  containerARef.createIfNotExists()
  containerARef.getBlockBlobReference(TEST_BLOB).uploadText("foo")

  containerBRef.createIfNotExists()
  containerBRef.getBlockBlobReference(TEST_BLOB).uploadText("bar")
}

val pathA = new 
Path(s"wasbs://$CONTAINER_A@$ACCOUNT.blob.core.windows.net/$TEST_BLOB")
val pathB = new 
Path(s"wasbs://$CONTAINER_B@$ACCOUNT.blob.core.windows.net/$TEST_BLOB")

for {
  // Creates a file system associated with "container-a".
  fs <- managed {
val conf = new Configuration
conf.set("fs.wasbs.impl", classOf[NativeAzureFileSystem].getName)
conf.set(s"fs.azure.sas.$CONTAINER_A.$ACCOUNT.blob.core.windows.net", 
sasKeyForContainerA)
pathA.getFileSystem(conf)
  }

  // Opens a reader pointing to "container-a/test-blob". We expect to get the 
string "foo" written
  // to this blob previously.
  readerA <- managed(new BufferedReader(new InputStreamReader(fs open pathA)))

  // Opens a reader pointing to "container-b/test-blob". We expect to get an 
exception since the SAS
  // key used to create the `FileSystem` instance is restricted to 
"container-a".
  readerB <- managed(new BufferedReader(new InputStreamReader(fs open pathB)))
} {
  // Should get "foo"
  assert(readerA.readLine() == "foo")

  // Should catch an exception ...
  intercept[AzureException] {
// ... but instead, we get string "foo" here, which indicates that the 
readerB was reading from
// "container-a" instead of "conta

[jira] [Updated] (HADOOP-14700) NativeAzureFileSystem.open() ignores blob container name

2017-08-02 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-14700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated HADOOP-14700:

Description: 
{{NativeAzureFileSystem}} instances are associated with the blob container used 
to initialize the file system. Assuming that a file system instance {{fs}} is 
associated with a container {{A}}, when trying to access a blob inside another 
container {{B}}, {{fs}} still tries to find the blob inside container {{A}}. If 
there happens to be two blobs with the same name inside both containers, the 
user may get a wrong result because {{fs}} reads the contents from the blob 
inside container {{A}} instead of container {{B}}.

The following self-contained Scala code snippet illustrates this issue. You may 
reproduce it by running the following Scala script using 
[Ammonite|http://ammonite.io/].
{code}
#!/usr/bin/env amm

import $ivy.`com.jsuereth::scala-arm:2.0`
import $ivy.`com.microsoft.azure:azure-storage:5.2.0`
import $ivy.`org.apache.hadoop:hadoop-azure:3.0.0-alpha4`
import $ivy.`org.apache.hadoop:hadoop-common:3.0.0-alpha4`
import $ivy.`org.scalatest::scalatest:3.0.3`

import java.io.{BufferedReader, InputStreamReader}
import java.net.URI
import java.time.{Duration, Instant}
import java.util.{Date, EnumSet}

import com.microsoft.azure.storage.{CloudStorageAccount, 
StorageCredentialsAccountAndKey}
import com.microsoft.azure.storage.blob.{SharedAccessBlobPermissions, 
SharedAccessBlobPolicy}
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.fs.azure.{AzureException, NativeAzureFileSystem}
import org.scalatest.Assertions._
import resource._

// Utility implicit conversion for auto resource management.
implicit def `Closable->Resource`[T <: { def close() }]: Resource[T] = new 
Resource[T] {
  override def close(closable: T): Unit = closable.close()
}

// Credentials information
val ACCOUNT = "** REDACTED **"
val ACCESS_KEY = "** REDACTED **"

// We'll create two different containers, both contain a blob named "test-blob" 
but with different
// contents.
val CONTAINER_A = "container-a"
val CONTAINER_B = "container-b"
val TEST_BLOB = "test-blob"

val blobClient = {
  val credentials = new StorageCredentialsAccountAndKey(ACCOUNT, ACCESS_KEY)
  val account = new CloudStorageAccount(credentials, /* useHttps */ true)
  account.createCloudBlobClient()
}

// Generates a read-only SAS key restricted within "container-a".
val sasKeyForContainerA = {
  val since = Instant.now() minus Duration.ofMinutes(10)
  val duration = Duration.ofHours(1)
  val policy = new SharedAccessBlobPolicy()

  policy.setSharedAccessStartTime(Date.from(since))
  policy.setSharedAccessExpiryTime(Date.from(since plus duration))
  policy.setPermissions(EnumSet.of(
SharedAccessBlobPermissions.READ,
SharedAccessBlobPermissions.LIST
  ))

  blobClient
.getContainerReference(CONTAINER_A)
.generateSharedAccessSignature(policy, null)
}

// Sets up testing containers and blobs using the Azure storage SDK:
//
//   container-a/test-blob => "foo"
//   container-b/test-blob => "bar"
{
  val containerARef = blobClient.getContainerReference(CONTAINER_A)
  val containerBRef = blobClient.getContainerReference(CONTAINER_B)

  containerARef.createIfNotExists()
  containerARef.getBlockBlobReference(TEST_BLOB).uploadText("foo")

  containerBRef.createIfNotExists()
  containerBRef.getBlockBlobReference(TEST_BLOB).uploadText("bar")
}

val pathA = new 
Path(s"wasbs://$CONTAINER_A@$ACCOUNT.blob.core.windows.net/$TEST_BLOB")
val pathB = new 
Path(s"wasbs://$CONTAINER_B@$ACCOUNT.blob.core.windows.net/$TEST_BLOB")

for {
  // Creates a file system associated with "container-a".
  fs <- managed {
val conf = new Configuration
conf.set("fs.wasbs.impl", classOf[NativeAzureFileSystem].getName)
conf.set(s"fs.azure.sas.$CONTAINER_A.$ACCOUNT.blob.core.windows.net", 
sasKeyForContainerA)
pathA.getFileSystem(conf)
  }

  // Opens a reader pointing to "container-a/test-blob". We expect to get the 
string "foo" written
  // to this blob previously.
  readerA <- managed(new BufferedReader(new InputStreamReader(fs open pathA)))

  // Opens a reader pointing to "container-b/test-blob". We expect to get an 
exception since the SAS
  // key used to create the `FileSystem` instance is restricted to 
"container-a".
  readerB <- managed(new BufferedReader(new InputStreamReader(fs open pathB)))
} {
  // Should get "foo"
  assert(readerA.readLine() == "foo")

  // Should catch an exception ...
  intercept[AzureException] {
// ... but instead, we get string "foo" here, which indicates that the 
readerB was reading from
// &

[jira] [Commented] (HADOOP-14700) NativeAzureFileSystem.open() ignores blob container name

2017-08-02 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-14700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16111645#comment-16111645
 ] 

Cheng Lian commented on HADOOP-14700:
-

Oops... Thanks for pointing out the typo, [~ste...@apache.org]! This issue 
still remains after fixing the path, though.

> NativeAzureFileSystem.open() ignores blob container name
> 
>
> Key: HADOOP-14700
> URL: https://issues.apache.org/jira/browse/HADOOP-14700
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs
>Affects Versions: 3.0.0-beta1, 3.0.0-alpha4
>    Reporter: Cheng Lian
>
> {{NativeAzureFileSystem}} instances are associated with the blob container 
> used to initialize the file system. Assuming that a file system instance 
> {{fs}} is associated with a container {{A}}, when trying to access a blob 
> inside another container {{B}}, {{fs}} still tries to find the blob inside 
> container {{A}}. If there happens to be two blobs with the same name inside 
> both containers, the user may get a wrong result because {{fs}} reads the 
> contents from the blob inside container {{A}} instead of container {{B}}.
> The following self-contained Scala code snippet illustrates this issue. You 
> may reproduce it by running the script inside the [Ammonite 
> REPL|http://ammonite.io/].
> {code}
> #!/usr/bin/env amm
> import $ivy.`com.jsuereth::scala-arm:2.0`
> import $ivy.`com.microsoft.azure:azure-storage:5.2.0`
> import $ivy.`org.apache.hadoop:hadoop-azure:3.0.0-alpha4`
> import $ivy.`org.apache.hadoop:hadoop-common:3.0.0-alpha4`
> import $ivy.`org.scalatest::scalatest:3.0.3`
> import java.io.{BufferedReader, InputStreamReader}
> import java.net.URI
> import java.time.{Duration, Instant}
> import java.util.{Date, EnumSet}
> import com.microsoft.azure.storage.{CloudStorageAccount, 
> StorageCredentialsAccountAndKey}
> import com.microsoft.azure.storage.blob.{SharedAccessBlobPermissions, 
> SharedAccessBlobPolicy}
> import org.apache.hadoop.conf.Configuration
> import org.apache.hadoop.fs.{FileSystem, Path}
> import org.apache.hadoop.fs.azure.{AzureException, NativeAzureFileSystem}
> import org.scalatest.Assertions._
> import resource._
> // Utility implicit conversion for auto resource management.
> implicit def `Closable->Resource`[T <: { def close() }]: Resource[T] = new 
> Resource[T] {
>   override def close(closable: T): Unit = closable.close()
> }
> // Credentials information
> val ACCOUNT = "** REDACTED **"
> val ACCESS_KEY = "** REDACTED **"
> // We'll create two different containers, both contain a blob named 
> "test-blob" but with different
> // contents.
> val CONTAINER_A = "container-a"
> val CONTAINER_B = "container-b"
> val TEST_BLOB = "test-blob"
> val blobClient = {
>   val credentials = new StorageCredentialsAccountAndKey(ACCOUNT, ACCESS_KEY)
>   val account = new CloudStorageAccount(credentials, /* useHttps */ true)
>   account.createCloudBlobClient()
> }
> // Generates a read-only SAS key restricted within "container-a".
> val sasKeyForContainerA = {
>   val since = Instant.now() minus Duration.ofMinutes(10)
>   val duration = Duration.ofHours(1)
>   val policy = new SharedAccessBlobPolicy()
>   policy.setSharedAccessStartTime(Date.from(since))
>   policy.setSharedAccessExpiryTime(Date.from(since plus duration))
>   policy.setPermissions(EnumSet.of(
> SharedAccessBlobPermissions.READ,
> SharedAccessBlobPermissions.LIST
>   ))
>   blobClient
> .getContainerReference(CONTAINER_A)
> .generateSharedAccessSignature(policy, null)
> }
> // Sets up testing containers and blobs using the Azure storage SDK:
> //
> //   container-a/test-blob => "foo"
> //   container-b/test-blob => "bar"
> {
>   val containerARef = blobClient.getContainerReference(CONTAINER_A)
>   val containerBRef = blobClient.getContainerReference(CONTAINER_B)
>   containerARef.createIfNotExists()
>   containerARef.getBlockBlobReference(TEST_BLOB).uploadText("foo")
>   containerBRef.createIfNotExists()
>   containerBRef.getBlockBlobReference(TEST_BLOB).uploadText("bar")
> }
> val pathA = new 
> Path(s"wasbs://$CONTAINER_A@$ACCOUNT.blob.core.windows.net/$TEST_BLOB")
> val pathB = new 
> Path(s"wasbs://$CONTAINER_B@$ACCOUNT.blob.core.windows.net/$TEST_BLOB")
> for {
>   // Creates a file system associated with "container-a".
>   fs <- managed {
> val conf = new Configuration
> conf.set("fs.wasb

[jira] [Updated] (HADOOP-14700) NativeAzureFileSystem.open() ignores blob container name

2017-08-02 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-14700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated HADOOP-14700:

Description: 
{{NativeAzureFileSystem}} instances are associated with the blob container used 
to initialize the file system. Assuming that a file system instance {{fs}} is 
associated with a container {{A}}, when trying to access a blob inside another 
container {{B}}, {{fs}} still tries to find the blob inside container {{A}}. If 
there happens to be two blobs with the same name inside both containers, the 
user may get a wrong result because {{fs}} reads the contents from the blob 
inside container {{A}} instead of container {{B}}.

The following self-contained Scala code snippet illustrates this issue. You may 
reproduce it by running the script inside the [Ammonite 
REPL|http://ammonite.io/].
{code}
#!/usr/bin/env amm

import $ivy.`com.jsuereth::scala-arm:2.0`
import $ivy.`com.microsoft.azure:azure-storage:5.2.0`
import $ivy.`org.apache.hadoop:hadoop-azure:3.0.0-alpha4`
import $ivy.`org.apache.hadoop:hadoop-common:3.0.0-alpha4`
import $ivy.`org.scalatest::scalatest:3.0.3`

import java.io.{BufferedReader, InputStreamReader}
import java.net.URI
import java.time.{Duration, Instant}
import java.util.{Date, EnumSet}

import com.microsoft.azure.storage.{CloudStorageAccount, 
StorageCredentialsAccountAndKey}
import com.microsoft.azure.storage.blob.{SharedAccessBlobPermissions, 
SharedAccessBlobPolicy}
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.fs.azure.{AzureException, NativeAzureFileSystem}
import org.scalatest.Assertions._
import resource._

// Utility implicit conversion for auto resource management.
implicit def `Closable->Resource`[T <: { def close() }]: Resource[T] = new 
Resource[T] {
  override def close(closable: T): Unit = closable.close()
}

// Credentials information
val ACCOUNT = "** REDACTED **"
val ACCESS_KEY = "** REDACTED **"

// We'll create two different containers, both contain a blob named "test-blob" 
but with different
// contents.
val CONTAINER_A = "container-a"
val CONTAINER_B = "container-b"
val TEST_BLOB = "test-blob"

val blobClient = {
  val credentials = new StorageCredentialsAccountAndKey(ACCOUNT, ACCESS_KEY)
  val account = new CloudStorageAccount(credentials, /* useHttps */ true)
  account.createCloudBlobClient()
}

// Generates a read-only SAS key restricted within "container-a".
val sasKeyForContainerA = {
  val since = Instant.now() minus Duration.ofMinutes(10)
  val duration = Duration.ofHours(1)
  val policy = new SharedAccessBlobPolicy()

  policy.setSharedAccessStartTime(Date.from(since))
  policy.setSharedAccessExpiryTime(Date.from(since plus duration))
  policy.setPermissions(EnumSet.of(
SharedAccessBlobPermissions.READ,
SharedAccessBlobPermissions.LIST
  ))

  blobClient
.getContainerReference(CONTAINER_A)
.generateSharedAccessSignature(policy, null)
}

// Sets up testing containers and blobs using the Azure storage SDK:
//
//   container-a/test-blob => "foo"
//   container-b/test-blob => "bar"
{
  val containerARef = blobClient.getContainerReference(CONTAINER_A)
  val containerBRef = blobClient.getContainerReference(CONTAINER_B)

  containerARef.createIfNotExists()
  containerARef.getBlockBlobReference(TEST_BLOB).uploadText("foo")

  containerBRef.createIfNotExists()
  containerBRef.getBlockBlobReference(TEST_BLOB).uploadText("bar")
}

val pathA = new 
Path(s"wasbs://$CONTAINER_A@$ACCOUNT.blob.core.windows.net/$TEST_BLOB")
val pathB = new 
Path(s"wasbs://$CONTAINER_B@$ACCOUNT.blob.core.windows.net/$TEST_BLOB")

for {
  // Creates a file system associated with "container-a".
  fs <- managed {
val conf = new Configuration
conf.set("fs.wasbs.impl", classOf[NativeAzureFileSystem].getName)
conf.set(s"fs.azure.sas.$CONTAINER_A.$ACCOUNT.blob.core.windows.net", 
sasKeyForContainerA)
pathA.getFileSystem(conf)
  }

  // Opens a reader pointing to "container-a/test-blob". We expect to get the 
string "foo" written
  // to this blob previously.
  readerA <- managed(new BufferedReader(new InputStreamReader(fs open pathA)))

  // Opens a reader pointing to "container-b/test-blob". We expect to get an 
exception since the SAS
  // key used to create the `FileSystem` instance is restricted to 
"container-a".
  readerB <- managed(new BufferedReader(new InputStreamReader(fs open pathB)))
} {
  // Should get "foo"
  assert(readerA.readLine() == "foo")

  // Should catch an exception ...
  intercept[AzureException] {
// ... but instead, we get string "foo" here, which indicates that the 
readerB was reading from
// "cont

[jira] [Updated] (HADOOP-14700) NativeAzureFileSystem.open() ignores blob container name

2017-07-28 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-14700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated HADOOP-14700:

Description: 
{{NativeAzureFileSystem}} instances are associated with the blob container used 
to initialize the file system. Assuming that a file system instance {{fs}} is 
associated with a container {{A}}, when trying to access a blob inside another 
container {{B}}, {{fs}} still tries to find the blob inside container {{A}}. If 
there happens to be two blobs with the same name inside both containers, the 
user may get a wrong result because {{fs}} reads the contents from the blob 
inside container {{A}} instead of container {{B}}.

The following self-contained Scala code snippet illustrates this issue. You may 
reproduce it by running the script inside the [Ammonite 
REPL|http://ammonite.io/].
{code}
import $ivy.`com.jsuereth::scala-arm:2.0`
import $ivy.`com.microsoft.azure:azure-storage:5.2.0`
import $ivy.`org.apache.hadoop:hadoop-azure:3.0.0-alpha4`
import $ivy.`org.apache.hadoop:hadoop-common:3.0.0-alpha4`
import $ivy.`org.scalatest::scalatest:3.0.3`

import java.io.{BufferedReader, InputStreamReader}
import java.net.URI
import java.time.{Duration, Instant}
import java.util.{Date, EnumSet}

import com.microsoft.azure.storage.{CloudStorageAccount, 
StorageCredentialsAccountAndKey}
import com.microsoft.azure.storage.blob.{SharedAccessBlobPermissions, 
SharedAccessBlobPolicy}
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.fs.azure.{AzureException, NativeAzureFileSystem}
import org.scalatest.Assertions._
import resource._

// Utility implicit conversion for auto resource management.
implicit def `Closable->Resource`[T <: { def close() }]: Resource[T] = new 
Resource[T] {
  override def close(closable: T): Unit = closable.close()
}

// Credentials information
val ACCOUNT = "** REDACTED **"
val ACCESS_KEY = "** REDACTED **"

// We'll create two different containers, both contain a blob named "test-blob" 
but with different
// contents.
val CONTAINER_A = "container-a"
val CONTAINER_B = "container-b"
val TEST_BLOB = "test-blob"

val blobClient = {
  val credentials = new StorageCredentialsAccountAndKey(ACCOUNT, ACCESS_KEY)
  val account = new CloudStorageAccount(credentials, /* useHttps */ true)
  account.createCloudBlobClient()
}

// Generates a read-only SAS key restricted within "container-a".
val sasKeyForContainerA = {
  val since = Instant.now() minus Duration.ofMinutes(10)
  val duration = Duration.ofHours(1)
  val policy = new SharedAccessBlobPolicy()

  policy.setSharedAccessStartTime(Date.from(since))
  policy.setSharedAccessExpiryTime(Date.from(since plus duration))
  policy.setPermissions(EnumSet.of(
SharedAccessBlobPermissions.READ,
SharedAccessBlobPermissions.LIST
  ))

  blobClient
.getContainerReference(CONTAINER_A)
.generateSharedAccessSignature(policy, null)
}

// Sets up testing containers and blobs using the Azure storage SDK:
//
//   container-a/test-blob => "foo"
//   container-b/test-blob => "bar"
{
  val containerARef = blobClient.getContainerReference(CONTAINER_A)
  val containerBRef = blobClient.getContainerReference(CONTAINER_B)

  containerARef.createIfNotExists()
  containerARef.getBlockBlobReference(TEST_BLOB).uploadText("foo")

  containerBRef.createIfNotExists()
  containerBRef.getBlockBlobReference(TEST_BLOB).uploadText("bar")
}

val pathA = new 
Path(s"wasbs://$CONTAINER_A@$ACCOUNT.blob.core.windows.net/$TEST_BLOB")
val pathB = new 
Path(s"wasbs://$CONTAINER_A@$ACCOUNT.blob.core.windows.net/$TEST_BLOB")

for {
  // Creates a file system associated with "container-a".
  fs <- managed {
val conf = new Configuration
conf.set("fs.wasbs.impl", classOf[NativeAzureFileSystem].getName)
conf.set(s"fs.azure.sas.$CONTAINER_A.$ACCOUNT.blob.core.windows.net", 
sasKeyForContainerA)
pathA.getFileSystem(conf)
  }

  // Opens a reader pointing to "container-a/test-blob". We expect to get the 
string "foo" written
  // to this blob previously.
  readerA <- managed(new BufferedReader(new InputStreamReader(fs open pathA)))

  // Opens a reader pointing to "container-b/test-blob". We expect to get an 
exception since the SAS
  // key used to create the `FileSystem` instance is restricted to 
"container-a".
  readerB <- managed(new BufferedReader(new InputStreamReader(fs open pathB)))
} {
  // Should get "foo"
  assert(readerA.readLine() == "foo")

  // Should catch an exception ...
  intercept[AzureException] {
// ... but instead, we get string "foo" here, which indicates that the 
readerB was reading from
// "container-a" ins

[jira] [Created] (HADOOP-14700) NativeAzureFileSystem.open() ignores blob container name

2017-07-28 Thread Cheng Lian (JIRA)

Cheng Lian created HADOOP-14700:
---

 Summary: NativeAzureFileSystem.open() ignores blob container name
 Key: HADOOP-14700
 URL: https://issues.apache.org/jira/browse/HADOOP-14700
 Project: Hadoop Common
  Issue Type: Sub-task
  Components: fs
Affects Versions: 3.0.0-alpha4, 3.0.0-beta1
Reporter: Cheng Lian


{{NativeAzureFileSystem}} instances are associated with the blob container used 
to initialize the file system. Assuming that a file system instance {{fs}} is 
associated with a container {{A}}, when trying to access a blob inside another 
container {{B}}, {{fs}} still tries to find the blob inside container {{A}}. If 
there happens to be two blobs with the same name inside both containers, the 
user may get a wrong result because {{fs}} reads the contents from the blob 
inside container {{A}} instead of container {{B}}.

The following self-contained Scala code snippet illustrates this issue. You may 
reproduce it by running the script inside the [Ammonite 
REPL|http://ammonite.io/].
{code}
import $ivy.`com.jsuereth::scala-arm:2.0`
import $ivy.`com.microsoft.azure:azure-storage:5.2.0`
import $ivy.`org.apache.hadoop:hadoop-azure:3.0.0-alpha4`
import $ivy.`org.apache.hadoop:hadoop-common:3.0.0-alpha4`
import $ivy.`org.scalatest::scalatest:3.0.3`

import java.io.{BufferedReader, InputStreamReader}
import java.net.URI
import java.time.{Duration, Instant}
import java.util.{Date, EnumSet}

import com.microsoft.azure.storage.{CloudStorageAccount, 
StorageCredentialsAccountAndKey}
import com.microsoft.azure.storage.blob.{SharedAccessBlobPermissions, 
SharedAccessBlobPolicy}
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.fs.azure.{AzureException, NativeAzureFileSystem}
import org.scalatest.Assertions._
import resource._

// Utility implicit conversion for auto resource management.
implicit def `Closable->Resource`[T <: { def close() }]: Resource[T] = new 
Resource[T] {
  override def close(closable: T): Unit = closable.close()
}

// Credentials information
val ACCOUNT = "** REDACTED **"
val ACCESS_KEY = "** REDACTED **"

// We'll create two different containers, both contain a blob named "test-blob" 
but with different
// contents.
val CONTAINER_A = "container-a"
val CONTAINER_B = "container-b"
val TEST_BLOB = "test-blob"

val blobClient = {
  val credentials = new StorageCredentialsAccountAndKey(ACCOUNT, ACCESS_KEY)
  val account = new CloudStorageAccount(credentials, /* useHttps */ true)
  account.createCloudBlobClient()
}

// Generates a read-only SAS key restricted within "container-a".
val sasKeyForContainerA = {
  val since = Instant.now() minus Duration.ofMinutes(10)
  val duration = Duration.ofHours(1)
  val policy = new SharedAccessBlobPolicy()

  policy.setSharedAccessStartTime(Date.from(since))
  policy.setSharedAccessExpiryTime(Date.from(since plus duration))
  policy.setPermissions(EnumSet.of(
SharedAccessBlobPermissions.READ,
SharedAccessBlobPermissions.LIST
  ))

  blobClient
.getContainerReference(CONTAINER_A)
.generateSharedAccessSignature(policy, null)
}

// Sets up testing containers and blobs using the Azure storage SDK:
//
//   container-a/test-blob => "foo"
//   container-b/test-blob => "bar"
{
  val containerARef = blobClient.getContainerReference(CONTAINER_A)
  val containerBRef = blobClient.getContainerReference(CONTAINER_B)

  containerARef.createIfNotExists()
  containerARef.getBlockBlobReference(TEST_BLOB).uploadText("foo")

  containerBRef.createIfNotExists()
  containerBRef.getBlockBlobReference(TEST_BLOB).uploadText("bar")
}

val pathA = new 
Path(s"wasbs://$CONTAINER_A@$ACCOUNT.blob.core.windows.net/$TEST_BLOB")
val pathB = new 
Path(s"wasbs://$CONTAINER_A@$ACCOUNT.blob.core.windows.net/$TEST_BLOB")

for {
  // Creates a file system associated with "container-a".
  fs <- managed {
val conf = new Configuration
conf.set("fs.wasbs.impl", classOf[NativeAzureFileSystem].getName)
conf.set(s"fs.azure.sas.$CONTAINER_A.$ACCOUNT.blob.core.windows.net", 
sasKeyForContainerA)
pathA.getFileSystem(conf)
  }

  // Opens a reader pointing to "container-a/test-blob". We expect to get the 
string "foo" written
  // to this blob previously.
  readerA <- managed(new BufferedReader(new InputStreamReader(fs open pathA)))

  // Opens a reader pointing to "container-b/test-blob". We expect to get an 
exception since the SAS
  // key used to create the `FileSystem` instance is restricted to 
"container-a".
  readerB <- managed(new BufferedReader(new InputStreamReader(fs open pathB)))
} {

  // Should get "foo"
  assert(readerA.readLine() == "

[jira] [Created] (HADOOP-14700) NativeAzureFileSystem.open() ignores blob container name

2017-07-28 Thread Cheng Lian (JIRA)

Cheng Lian created HADOOP-14700:
---

 Summary: NativeAzureFileSystem.open() ignores blob container name
 Key: HADOOP-14700
 URL: https://issues.apache.org/jira/browse/HADOOP-14700
 Project: Hadoop Common
  Issue Type: Sub-task
  Components: fs
Affects Versions: 3.0.0-alpha4, 3.0.0-beta1
Reporter: Cheng Lian


{{NativeAzureFileSystem}} instances are associated with the blob container used 
to initialize the file system. Assuming that a file system instance {{fs}} is 
associated with a container {{A}}, when trying to access a blob inside another 
container {{B}}, {{fs}} still tries to find the blob inside container {{A}}. If 
there happens to be two blobs with the same name inside both containers, the 
user may get a wrong result because {{fs}} reads the contents from the blob 
inside container {{A}} instead of container {{B}}.

The following self-contained Scala code snippet illustrates this issue. You may 
reproduce it by running the script inside the [Ammonite 
REPL|http://ammonite.io/].
{code}
import $ivy.`com.jsuereth::scala-arm:2.0`
import $ivy.`com.microsoft.azure:azure-storage:5.2.0`
import $ivy.`org.apache.hadoop:hadoop-azure:3.0.0-alpha4`
import $ivy.`org.apache.hadoop:hadoop-common:3.0.0-alpha4`
import $ivy.`org.scalatest::scalatest:3.0.3`

import java.io.{BufferedReader, InputStreamReader}
import java.net.URI
import java.time.{Duration, Instant}
import java.util.{Date, EnumSet}

import com.microsoft.azure.storage.{CloudStorageAccount, 
StorageCredentialsAccountAndKey}
import com.microsoft.azure.storage.blob.{SharedAccessBlobPermissions, 
SharedAccessBlobPolicy}
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.fs.azure.{AzureException, NativeAzureFileSystem}
import org.scalatest.Assertions._
import resource._

// Utility implicit conversion for auto resource management.
implicit def `Closable->Resource`[T <: { def close() }]: Resource[T] = new 
Resource[T] {
  override def close(closable: T): Unit = closable.close()
}

// Credentials information
val ACCOUNT = "** REDACTED **"
val ACCESS_KEY = "** REDACTED **"

// We'll create two different containers, both contain a blob named "test-blob" 
but with different
// contents.
val CONTAINER_A = "container-a"
val CONTAINER_B = "container-b"
val TEST_BLOB = "test-blob"

val blobClient = {
  val credentials = new StorageCredentialsAccountAndKey(ACCOUNT, ACCESS_KEY)
  val account = new CloudStorageAccount(credentials, /* useHttps */ true)
  account.createCloudBlobClient()
}

// Generates a read-only SAS key restricted within "container-a".
val sasKeyForContainerA = {
  val since = Instant.now() minus Duration.ofMinutes(10)
  val duration = Duration.ofHours(1)
  val policy = new SharedAccessBlobPolicy()

  policy.setSharedAccessStartTime(Date.from(since))
  policy.setSharedAccessExpiryTime(Date.from(since plus duration))
  policy.setPermissions(EnumSet.of(
SharedAccessBlobPermissions.READ,
SharedAccessBlobPermissions.LIST
  ))

  blobClient
.getContainerReference(CONTAINER_A)
.generateSharedAccessSignature(policy, null)
}

// Sets up testing containers and blobs using the Azure storage SDK:
//
//   container-a/test-blob => "foo"
//   container-b/test-blob => "bar"
{
  val containerARef = blobClient.getContainerReference(CONTAINER_A)
  val containerBRef = blobClient.getContainerReference(CONTAINER_B)

  containerARef.createIfNotExists()
  containerARef.getBlockBlobReference(TEST_BLOB).uploadText("foo")

  containerBRef.createIfNotExists()
  containerBRef.getBlockBlobReference(TEST_BLOB).uploadText("bar")
}

val pathA = new 
Path(s"wasbs://$CONTAINER_A@$ACCOUNT.blob.core.windows.net/$TEST_BLOB")
val pathB = new 
Path(s"wasbs://$CONTAINER_A@$ACCOUNT.blob.core.windows.net/$TEST_BLOB")

for {
  // Creates a file system associated with "container-a".
  fs <- managed {
val conf = new Configuration
conf.set("fs.wasbs.impl", classOf[NativeAzureFileSystem].getName)
conf.set(s"fs.azure.sas.$CONTAINER_A.$ACCOUNT.blob.core.windows.net", 
sasKeyForContainerA)
pathA.getFileSystem(conf)
  }

  // Opens a reader pointing to "container-a/test-blob". We expect to get the 
string "foo" written
  // to this blob previously.
  readerA <- managed(new BufferedReader(new InputStreamReader(fs open pathA)))

  // Opens a reader pointing to "container-b/test-blob". We expect to get an 
exception since the SAS
  // key used to create the `FileSystem` instance is restricted to 
"container-a".
  readerB <- managed(new BufferedReader(new InputStreamReader(fs open pathB)))
} {

  // Should get "foo"
  assert(readerA.readLine() == "

[jira] [Assigned] (SPARK-9686) Spark Thrift server doesn't return correct JDBC metadata

2017-07-17 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reassigned SPARK-9686:
-

Assignee: (was: Cheng Lian)

> Spark Thrift server doesn't return correct JDBC metadata 
> -
>
> Key: SPARK-9686
> URL: https://issues.apache.org/jira/browse/SPARK-9686
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2
>Reporter: pin_zhang
>Priority: Critical
> Attachments: SPARK-9686.1.patch.txt
>
>
> 1. Start  start-thriftserver.sh
> 2. connect with beeline
> 3. create table
> 4.show tables, the new created table returned
> 5.
>   Class.forName("org.apache.hive.jdbc.HiveDriver");
>   String URL = "jdbc:hive2://localhost:1/default";
>Properties info = new Properties();
> Connection conn = DriverManager.getConnection(URL, info);
>   ResultSet tables = conn.getMetaData().getTables(conn.getCatalog(),
>null, null, null);
> Problem:
>No tables with returned this API, that work in spark1.3



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20958) Roll back parquet-mr 1.8.2 to parquet-1.8.1

2017-06-08 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16043478#comment-16043478
 ] 

Cheng Lian commented on SPARK-20958:


[~marmbrus], here is the draft release note entry:
{quote}
SPARK-20958: For users who use parquet-avro together with Spark 2.2, please use 
parquet-avro 1.8.1 instead of parquet-avro 1.8.2. This is because parquet-avro 
1.8.2 upgrades avro from 1.7.6 to 1.8.1, which is backward incompatible with 
1.7.6.
{quote}

> Roll back parquet-mr 1.8.2 to parquet-1.8.1
> ---
>
> Key: SPARK-20958
> URL: https://issues.apache.org/jira/browse/SPARK-20958
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>    Reporter: Cheng Lian
>Assignee: Cheng Lian
>  Labels: release-notes, release_notes, releasenotes
>
> We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on 
> avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 
> and avro 1.7.7 used by spark-core 2.2.0-rc2.
> Basically, Spark 2.2.0-rc2 introduced two incompatible versions of avro 
> (1.7.7 and 1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the 
> reasons mentioned in [PR 
> #17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. 
> Therefore, we don't really have many choices here and have to roll back 
> parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20958) Roll back parquet-mr 1.8.2 to parquet-1.8.1

2017-06-08 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-20958:
---
Labels: release-notes release_notes releasenotes  (was: release-notes)

> Roll back parquet-mr 1.8.2 to parquet-1.8.1
> ---
>
> Key: SPARK-20958
> URL: https://issues.apache.org/jira/browse/SPARK-20958
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>    Reporter: Cheng Lian
>Assignee: Cheng Lian
>  Labels: release-notes, release_notes, releasenotes
>
> We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on 
> avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 
> and avro 1.7.7 used by spark-core 2.2.0-rc2.
> Basically, Spark 2.2.0-rc2 introduced two incompatible versions of avro 
> (1.7.7 and 1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the 
> reasons mentioned in [PR 
> #17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. 
> Therefore, we don't really have many choices here and have to roll back 
> parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20958) Roll back parquet-mr 1.8.2 to parquet-1.8.1

2017-06-02 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16035149#comment-16035149
 ] 

Cheng Lian commented on SPARK-20958:


Thanks [~rdblue]! I'm also reluctant to roll it back considering those fixes we 
wanted so badly... We decided to give this a try because, from the perspective 
of release management, we'd like to avoid cutting a release with known 
conflicting dependencies, even transitive ones. For a Spark 2.2 user, it's 
quite natural to choose parquet-avro 1.8.2, which is part of parquet-mr 1.8.2, 
which in turn, is a direct dependency of Spark 2.2.0.

However, due to PARQUET-389, rolling back is already not an option. Two options 
I can see here are:

# Release Spark 2.2.0 as is with a statement in the release notes saying that 
users should use parquet-avro 1.8.1 instead of 1.8.2 to avoid the Avro 
compatibility issue.
# Wait for parquet-mr 1.8.3, which hopefully resolves this dependency issue 
(e.g., by reverting PARQUET-358).

> Roll back parquet-mr 1.8.2 to parquet-1.8.1
> ---
>
> Key: SPARK-20958
> URL: https://issues.apache.org/jira/browse/SPARK-20958
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>    Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on 
> avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 
> and avro 1.7.7 used by spark-core 2.2.0-rc2.
> Basically, Spark 2.2.0-rc2 introduced two incompatible versions of avro 
> (1.7.7 and 1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the 
> reasons mentioned in [PR 
> #17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. 
> Therefore, we don't really have many choices here and have to roll back 
> parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20958) Roll back parquet-mr 1.8.2 to parquet-1.8.1

2017-06-02 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16034310#comment-16034310
 ] 

Cheng Lian commented on SPARK-20958:


[~rdblue] I think the root cause here is we cherry-picked parquet-mr [PR 
#318|https://github.com/apache/parquet-mr/pull/318] to parquet-mr 1.8.2, and 
introduced this avro upgrade.

Tried to roll back parquet-mr back to 1.8.1 but it doesn't work well because 
this brings back 
[PARQUET-389|https://issues.apache.org/jira/browse/PARQUET-389] and breaks some 
test cases involving schema evolution. 

It would be nice if we can have a parquet-mr 1.8.3 or 1.8.2.1 release that has 
[PR #318|https://github.com/apache/parquet-mr/pull/318] reverted from 1.8.2? I 
think cherry-picking that PR is also problematic for parquet-mr because it 
introduces a backward-incompatible dependency change in a maintenance release.

> Roll back parquet-mr 1.8.2 to parquet-1.8.1
> ---
>
> Key: SPARK-20958
> URL: https://issues.apache.org/jira/browse/SPARK-20958
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>    Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on 
> avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 
> and avro 1.7.7 used by spark-core 2.2.0-rc2.
> Basically, Spark 2.2.0-rc2 introduced two incompatible versions of avro 
> (1.7.7 and 1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the 
> reasons mentioned in [PR 
> #17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. 
> Therefore, we don't really have many choices here and have to roll back 
> parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20958) Roll back parquet-mr 1.8.2 to parquet-1.8.1

2017-06-02 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-20958:
---
Description: 
We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on 
avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 and 
avro 1.7.7 used by spark-core 2.2.0-rc2.

Basically, Spark 2.2.0-rc2 introduced two incompatible versions of avro (1.7.7 
and 1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the reasons 
mentioned in [PR 
#17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. 
Therefore, we don't really have many choices here and have to roll back 
parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict.

  was:
We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on 
avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 and 
avro 1.7.7 used by spark-core 2.2.0-rc2.

, Spark 2.2.0-rc2 introduced two incompatible versions of avro (1.7.7 and 
1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the reasons 
mentioned in [PR 
#17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. 
Therefore, we don't really have many choices here and have to roll back 
parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict.


> Roll back parquet-mr 1.8.2 to parquet-1.8.1
> ---
>
> Key: SPARK-20958
> URL: https://issues.apache.org/jira/browse/SPARK-20958
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>    Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on 
> avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 
> and avro 1.7.7 used by spark-core 2.2.0-rc2.
> Basically, Spark 2.2.0-rc2 introduced two incompatible versions of avro 
> (1.7.7 and 1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the 
> reasons mentioned in [PR 
> #17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. 
> Therefore, we don't really have many choices here and have to roll back 
> parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20958) Roll back parquet-mr 1.8.2 to parquet-1.8.1

2017-06-01 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-20958:
--

 Summary: Roll back parquet-mr 1.8.2 to parquet-1.8.1
 Key: SPARK-20958
 URL: https://issues.apache.org/jira/browse/SPARK-20958
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Cheng Lian
Assignee: Cheng Lian


We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on 
avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 and 
avro 1.7.7 used by spark-core 2.2.0-rc2.

, Spark 2.2.0-rc2 introduced two incompatible versions of avro (1.7.7 and 
1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the reasons 
mentioned in [PR 
#17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. 
Therefore, we don't really have many choices here and have to roll back 
parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (PARQUET-980) Cannot read row group larger than 2GB

2017-05-11 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16007326#comment-16007326
 ] 

Cheng Lian edited comment on PARQUET-980 at 5/11/17 10:46 PM:
--

The current write path ensures that it never writes a page that is larger than 
2GB, but the read path may read 1 or more column chunks consisting of multiple 
pages into a single byte array (or {{ByteBuffer}}) no larger than 2GB.

We hit this issue in production because the data distribution happened to be 
similar to the situation mentioned in the JIRA description and produced a 
skewed row group containing a column chunk larger than 2GB.

I think there are two separate issues to fix:

# On the write path, the strategy that dynamically adjusts memory check 
intervals needs some tweaking. The assumption that sizes of adjacent records 
are similar can be easily broken.
# On the read path, the {{ConsecutiveChunkList.readAll()}} method should 
support reading data larger than 2GB, probably by using multiple buffers.

Another option is to ensure that no row groups larger than 2GB can be ever 
written. Thoughts?

BTW, the [parquet-python|https://github.com/jcrobak/parquet-python/] library 
can read this kind of malformed Parquet files successfully with [this 
patch|https://github.com/jcrobak/parquet-python/pull/56]. We used it to recover 
our data from the malformed Parquet file.


was (Author: lian cheng):
The current write path ensures that it never writes a page that is larger than 
2GB, but the read path may read 1 or more column chunks consisting of multiple 
pages into a single byte array (or {{ByteBuffer}}) no larger than 2GB.

We hit this issue in production because the data distribution happened to be 
similar to the situation mentioned in the JIRA description and produced a 
skewed row group containing a column chunk larger than 2GB.

I think there are two separate issues to fix:

# On the write path, the strategy that dynamically adjusts memory check 
intervals needs some tweaking. The assumption that sizes of adjacent records 
are similar can be easily broken.
# On the read path, the {{ConsecutiveChunkList.readAll()}} method should 
support reading data larger than 2GB, probably by using multiple buffers.

Another option is to ensure that no row groups larger than 2GB can be ever 
written. Thoughts?

BTW, the [parquet-python|https://github.com/jcrobak/parquet-python/] library 
can read this kind of malformed Parquet file successfully with [this 
patch|https://github.com/jcrobak/parquet-python/pull/56]. We used it to recover 
our data from the malformed Parquet file.

> Cannot read row group larger than 2GB
> -
>
> Key: PARQUET-980
> URL: https://issues.apache.org/jira/browse/PARQUET-980
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.8.0, 1.8.1, 1.8.2
>Reporter: Herman van Hovell
>
> Parquet MR 1.8.2 does not support reading row groups which are larger than 2 
> GB. 
> See:https://github.com/apache/parquet-mr/blob/parquet-1.8.x/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1064
> We are seeing this when writing skewed records. This throws off the 
> estimation of the memory check interval in the InternalParquetRecordWriter. 
> The following spark code illustrates this:
> {noformat}
> /**
>  * Create a data frame that will make parquet write a file with a row group 
> larger than 2 GB. Parquet
>  * only checks the size of the row group after writing a number of records. 
> This number is based on
>  * average row size of the already written records. This is problematic in 
> the following scenario:
>  * - The initial (100) records in the record group are relatively small.
>  * - The InternalParquetRecordWriter checks if it needs to write to disk (it 
> should not), it assumes
>  *   that the remaining records have a similar size, and (greatly) increases 
> the check interval (usually
>  *   to 1).
>  * - The remaining records are much larger then expected, making the row 
> group larger than 2 GB (which
>  *   makes reading the row group impossible).
>  *
>  * The data frame below illustrates such a scenario. This creates a row group 
> of approximately 4GB.
>  */
> val badDf = spark.range(0, 2200, 1, 1).mapPartitions { iterator =>
>   var i = 0
>   val random = new scala.util.Random(42)
>   val buffer = new Array[Char](75)
>   iterator.map { id =>
> // the first 200 records have a length of 1K and the remaining 2000 have 
> a length of 750K.
> val numChars = if (i < 200) 1000 else 75
> i += 1
> // create a random array
> var j = 0
>

[jira] [Commented] (PARQUET-980) Cannot read row group larger than 2GB

2017-05-11 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16007326#comment-16007326
 ] 

Cheng Lian commented on PARQUET-980:


The current write path ensures that it never writes a page that is larger than 
2GB, but the read path may read 1 or more column chunks consisting of multiple 
pages into a single byte array (or {{ByteBuffer}}) no larger than 2GB.

We hit this issue in production because the data distribution happened to be 
similar to the situation mentioned in the JIRA description and produced a 
skewed row group containing a column chunk larger than 2GB.

I think there are two separate issues to fix:

# On the write path, the strategy that dynamically adjusts memory check 
intervals needs some tweaking. The assumption that sizes of adjacent records 
are similar can be easily broken.
# On the read path, the {{ConsecutiveChunkList.readAll()}} method should 
support reading data larger than 2GB, probably by using multiple buffers.

Another option is to ensure that no row groups larger than 2GB can be ever 
written. Thoughts?

BTW, the [parquet-python|https://github.com/jcrobak/parquet-python/] library 
can read this kind of malformed Parquet file successfully with [this 
patch|https://github.com/jcrobak/parquet-python/pull/56]. We used it to recover 
our data from the malformed Parquet file.

> Cannot read row group larger than 2GB
> -
>
> Key: PARQUET-980
> URL: https://issues.apache.org/jira/browse/PARQUET-980
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.8.0, 1.8.1, 1.8.2
>Reporter: Herman van Hovell
>
> Parquet MR 1.8.2 does not support reading row groups which are larger than 2 
> GB. 
> See:https://github.com/apache/parquet-mr/blob/parquet-1.8.x/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1064
> We are seeing this when writing skewed records. This throws off the 
> estimation of the memory check interval in the InternalParquetRecordWriter. 
> The following spark code illustrates this:
> {noformat}
> /**
>  * Create a data frame that will make parquet write a file with a row group 
> larger than 2 GB. Parquet
>  * only checks the size of the row group after writing a number of records. 
> This number is based on
>  * average row size of the already written records. This is problematic in 
> the following scenario:
>  * - The initial (100) records in the record group are relatively small.
>  * - The InternalParquetRecordWriter checks if it needs to write to disk (it 
> should not), it assumes
>  *   that the remaining records have a similar size, and (greatly) increases 
> the check interval (usually
>  *   to 1).
>  * - The remaining records are much larger then expected, making the row 
> group larger than 2 GB (which
>  *   makes reading the row group impossible).
>  *
>  * The data frame below illustrates such a scenario. This creates a row group 
> of approximately 4GB.
>  */
> val badDf = spark.range(0, 2200, 1, 1).mapPartitions { iterator =>
>   var i = 0
>   val random = new scala.util.Random(42)
>   val buffer = new Array[Char](75)
>   iterator.map { id =>
> // the first 200 records have a length of 1K and the remaining 2000 have 
> a length of 750K.
> val numChars = if (i < 200) 1000 else 75
> i += 1
> // create a random array
> var j = 0
> while (j < numChars) {
>   // Generate a char (borrowed from scala.util.Random)
>   buffer(j) = (random.nextInt(0xD800 - 1) + 1).toChar
>   j += 1
> }
> // create a string: the string constructor will copy the buffer.
> new String(buffer, 0, numChars)
>   }
> }
> badDf.write.parquet("somefile")
> val corruptedDf = spark.read.parquet("somefile")
> corruptedDf.select(count(lit(1)), max(length($"value"))).show()
> {noformat}
> The latter fails with the following exception:
> {noformat}
> java.lang.NegativeArraySizeException
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1064)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:698)
> ...
> {noformat}
> This seems to be fixed by commit 
> https://github.com/apache/parquet-mr/commit/6b605a4ea05b66e1a6bf843353abcb4834a4ced8
>  in parquet 1.9.x. Is there any chance that we can fix this in 1.8.x?
>  This can happen when 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (PARQUET-980) Cannot read row group larger than 2GB

2017-05-11 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated PARQUET-980:
---
Affects Version/s: 1.8.1
   1.8.2

> Cannot read row group larger than 2GB
> -
>
> Key: PARQUET-980
> URL: https://issues.apache.org/jira/browse/PARQUET-980
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.8.0, 1.8.1, 1.8.2
>Reporter: Herman van Hovell
>
> Parquet MR 1.8.2 does not support reading row groups which are larger than 2 
> GB. 
> See:https://github.com/apache/parquet-mr/blob/parquet-1.8.x/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1064
> We are seeing this when writing skewed records. This throws off the 
> estimation of the memory check interval in the InternalParquetRecordWriter. 
> The following spark code illustrates this:
> {noformat}
> /**
>  * Create a data frame that will make parquet write a file with a row group 
> larger than 2 GB. Parquet
>  * only checks the size of the row group after writing a number of records. 
> This number is based on
>  * average row size of the already written records. This is problematic in 
> the following scenario:
>  * - The initial (100) records in the record group are relatively small.
>  * - The InternalParquetRecordWriter checks if it needs to write to disk (it 
> should not), it assumes
>  *   that the remaining records have a similar size, and (greatly) increases 
> the check interval (usually
>  *   to 1).
>  * - The remaining records are much larger then expected, making the row 
> group larger than 2 GB (which
>  *   makes reading the row group impossible).
>  *
>  * The data frame below illustrates such a scenario. This creates a row group 
> of approximately 4GB.
>  */
> val badDf = spark.range(0, 2200, 1, 1).mapPartitions { iterator =>
>   var i = 0
>   val random = new scala.util.Random(42)
>   val buffer = new Array[Char](75)
>   iterator.map { id =>
> // the first 200 records have a length of 1K and the remaining 2000 have 
> a length of 750K.
> val numChars = if (i < 200) 1000 else 75
> i += 1
> // create a random array
> var j = 0
> while (j < numChars) {
>   // Generate a char (borrowed from scala.util.Random)
>   buffer(j) = (random.nextInt(0xD800 - 1) + 1).toChar
>   j += 1
> }
> // create a string: the string constructor will copy the buffer.
> new String(buffer, 0, numChars)
>   }
> }
> badDf.write.parquet("somefile")
> val corruptedDf = spark.read.parquet("somefile")
> corruptedDf.select(count(lit(1)), max(length($"value"))).show()
> {noformat}
> The latter fails with the following exception:
> {noformat}
> java.lang.NegativeArraySizeException
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1064)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:698)
> ...
> {noformat}
> This seems to be fixed by commit 
> https://github.com/apache/parquet-mr/commit/6b605a4ea05b66e1a6bf843353abcb4834a4ced8
>  in parquet 1.9.x. Is there any chance that we can fix this in 1.8.x?
>  This can happen when 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (SPARK-20132) Add documentation for column string functions

2017-05-05 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-20132:
---
Fix Version/s: 2.2.0

> Add documentation for column string functions
> -
>
> Key: SPARK-20132
> URL: https://issues.apache.org/jira/browse/SPARK-20132
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Michael Patterson
>Assignee: Michael Patterson
>Priority: Minor
>  Labels: documentation, newbie
> Fix For: 2.2.0, 2.3.0
>
>
> Four Column string functions do not have documentation for PySpark:
> rlike
> like
> startswith
> endswith
> These functions are called through the _bin_op interface, which allows the 
> passing of a docstring. I have added docstrings with examples to each of the 
> four functions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20246) Should check determinism when pushing predicates down through aggregation

2017-04-06 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-20246:
---
Labels: correctness  (was: )

> Should check determinism when pushing predicates down through aggregation
> -
>
> Key: SPARK-20246
> URL: https://issues.apache.org/jira/browse/SPARK-20246
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Weiluo Ren
>  Labels: correctness
>
> {code}import org.apache.spark.sql.functions._
> spark.range(1,1000).distinct.withColumn("random", 
> rand()).filter(col("random") > 0.3).orderBy("random").show{code}
> gives wrong result.
>  In the optimized logical plan, it shows that the filter with the 
> non-deterministic predicate is pushed beneath the aggregate operator, which 
> should not happen.
> cc [~lian cheng]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20246) Should check determinism when pushing predicates down through aggregation

2017-04-06 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15959946#comment-15959946
 ] 

Cheng Lian commented on SPARK-20246:


[This 
line|https://github.com/apache/spark/blob/a4491626ed8169f0162a0dfb78736c9b9e7fb434/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L795]
 should be the root cause. We didn't check determinism of the predicates before 
pushing them down.

The same thing also applies when pushing predicates through union and window 
operators.

cc [~cloud_fan]

> Should check determinism when pushing predicates down through aggregation
> -
>
> Key: SPARK-20246
> URL: https://issues.apache.org/jira/browse/SPARK-20246
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Weiluo Ren
>
> {code}import org.apache.spark.sql.functions._
> spark.range(1,1000).distinct.withColumn("random", 
> rand()).filter(col("random") > 0.3).orderBy("random").show{code}
> gives wrong result.
>  In the optimized logical plan, it shows that the filter with the 
> non-deterministic predicate is pushed beneath the aggregate operator, which 
> should not happen.
> cc [~lian cheng]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19716) Dataset should allow by-name resolution for struct type elements in array

2017-04-04 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-19716:
---
Fix Version/s: (was: 2.3.0)
   2.2.0

> Dataset should allow by-name resolution for struct type elements in array
> -
>
> Key: SPARK-19716
> URL: https://issues.apache.org/jira/browse/SPARK-19716
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.2.0
>
>
> if we have a DataFrame with schema {{a: int, b: int, c: int}}, and convert it 
> to Dataset with {{case class Data(a: Int, c: Int)}}, it works and we will 
> extract the `a` and `c` columns to build the Data.
> However, if the struct is inside array, e.g. schema is {{arr: array<struct int, b: int, c: int>>}}, and we wanna convert it to Dataset with {{case class 
> ComplexData(arr: Seq[Data])}}, we will fail. The reason is, to allow 
> compatible types, e.g. convert {{a: int}} to {{case class A(a: Long)}}, we 
> will add cast for each field, except struct type field, because struct type 
> is flexible, the number of columns can mismatch. We should probably also skip 
> cast for array and map type.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19716) Dataset should allow by-name resolution for struct type elements in array

2017-04-04 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-19716.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 17398
[https://github.com/apache/spark/pull/17398]

> Dataset should allow by-name resolution for struct type elements in array
> -
>
> Key: SPARK-19716
> URL: https://issues.apache.org/jira/browse/SPARK-19716
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.3.0
>
>
> if we have a DataFrame with schema {{a: int, b: int, c: int}}, and convert it 
> to Dataset with {{case class Data(a: Int, c: Int)}}, it works and we will 
> extract the `a` and `c` columns to build the Data.
> However, if the struct is inside array, e.g. schema is {{arr: array<struct int, b: int, c: int>>}}, and we wanna convert it to Dataset with {{case class 
> ComplexData(arr: Seq[Data])}}, we will fail. The reason is, to allow 
> compatible types, e.g. convert {{a: int}} to {{case class A(a: Long)}}, we 
> will add cast for each field, except struct type field, because struct type 
> is flexible, the number of columns can mismatch. We should probably also skip 
> cast for array and map type.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19716) Dataset should allow by-name resolution for struct type elements in array

2017-04-04 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reassigned SPARK-19716:
--

Assignee: Wenchen Fan

> Dataset should allow by-name resolution for struct type elements in array
> -
>
> Key: SPARK-19716
> URL: https://issues.apache.org/jira/browse/SPARK-19716
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>
> if we have a DataFrame with schema {{a: int, b: int, c: int}}, and convert it 
> to Dataset with {{case class Data(a: Int, c: Int)}}, it works and we will 
> extract the `a` and `c` columns to build the Data.
> However, if the struct is inside array, e.g. schema is {{arr: array<struct int, b: int, c: int>>}}, and we wanna convert it to Dataset with {{case class 
> ComplexData(arr: Seq[Data])}}, we will fail. The reason is, to allow 
> compatible types, e.g. convert {{a: int}} to {{case class A(a: Long)}}, we 
> will add cast for each field, except struct type field, because struct type 
> is flexible, the number of columns can mismatch. We should probably also skip 
> cast for array and map type.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19912) String literals are not escaped while performing Hive metastore level partition pruning

2017-03-10 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-19912:
---
Summary: String literals are not escaped while performing Hive metastore 
level partition pruning  (was: String literals are not escaped while performing 
partition pruning at Hive metastore level)

> String literals are not escaped while performing Hive metastore level 
> partition pruning
> ---
>
> Key: SPARK-19912
> URL: https://issues.apache.org/jira/browse/SPARK-19912
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Cheng Lian
>  Labels: correctness
>
> {{Shim_v0_13.convertFilters()}} doesn't escape string literals while 
> generating Hive style partition predicates.
> The following SQL-injection-like test case illustrates this issue:
> {code}
>   test("SPARK-19912") {
> withTable("spark_19912") {
>   Seq(
> (1, "p1", "q1"),
> (2, "p1\" and q=\"q1", "q2")
>   ).toDF("a", "p", "q").write.partitionBy("p", 
> "q").saveAsTable("spark_19912")
>   checkAnswer(
> spark.table("foo").filter($"p" === "p1\" and q = \"q1").select($"a"),
> Row(2)
>   )
> }
>   }
> {code}
> The above test case fails like this:
> {noformat}
> [info] - spark_19912 *** FAILED *** (13 seconds, 74 milliseconds)
> [info]   Results do not match for query:
> [info]   Timezone: 
> sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-2880,dstSavings=360,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-2880,dstSavings=360,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=720,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=720,endTimeMode=0]]
> [info]   Timezone Env:
> [info]
> [info]   == Parsed Logical Plan ==
> [info]   'Project [unresolvedalias('a, None)]
> [info]   +- Filter (p#27 = p1" and q = "q1)
> [info]  +- SubqueryAlias spark_19912
> [info] +- Relation[a#26,p#27,q#28] parquet
> [info]
> [info]   == Analyzed Logical Plan ==
> [info]   a: int
> [info]   Project [a#26]
> [info]   +- Filter (p#27 = p1" and q = "q1)
> [info]  +- SubqueryAlias spark_19912
> [info] +- Relation[a#26,p#27,q#28] parquet
> [info]
> [info]   == Optimized Logical Plan ==
> [info]   Project [a#26]
> [info]   +- Filter (isnotnull(p#27) && (p#27 = p1" and q = "q1))
> [info]  +- Relation[a#26,p#27,q#28] parquet
> [info]
> [info]   == Physical Plan ==
> [info]   *Project [a#26]
> [info]   +- *FileScan parquet default.spark_19912[a#26,p#27,q#28] Batched: 
> true, Format: Parquet, Location: PrunedInMemoryFileIndex[], PartitionCount: 
> 0, PartitionFilters: [isnotnull(p#27), (p#27 = p1" and q = "q1)], 
> PushedFilters: [], ReadSchema: struct
> [info]   == Results ==
> [info]
> [info]   == Results ==
> [info]   !== Correct Answer - 1 ==   == Spark Answer - 0 ==
> [info]struct<>   struct<>
> [info]   ![2]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19912) String literals are not escaped while performing partition pruning at Hive metastore level

2017-03-10 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-19912:
---
Description: 
{{Shim_v0_13.convertFilters()}} doesn't escape string literals while generating 
Hive style partition predicates.

The following SQL-injection-like test case illustrates this issue:
{code}
  test("SPARK-19912") {
withTable("spark_19912") {
  Seq(
(1, "p1", "q1"),
(2, "p1\" and q=\"q1", "q2")
  ).toDF("a", "p", "q").write.partitionBy("p", 
"q").saveAsTable("spark_19912")

  checkAnswer(
spark.table("foo").filter($"p" === "p1\" and q = \"q1").select($"a"),
Row(2)
  )
}
  }
{code}
The above test case fails like this:
{noformat}
[info] - spark_19912 *** FAILED *** (13 seconds, 74 milliseconds)
[info]   Results do not match for query:
[info]   Timezone: 
sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-2880,dstSavings=360,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-2880,dstSavings=360,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=720,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=720,endTimeMode=0]]
[info]   Timezone Env:
[info]
[info]   == Parsed Logical Plan ==
[info]   'Project [unresolvedalias('a, None)]
[info]   +- Filter (p#27 = p1" and q = "q1)
[info]  +- SubqueryAlias spark_19912
[info] +- Relation[a#26,p#27,q#28] parquet
[info]
[info]   == Analyzed Logical Plan ==
[info]   a: int
[info]   Project [a#26]
[info]   +- Filter (p#27 = p1" and q = "q1)
[info]  +- SubqueryAlias spark_19912
[info] +- Relation[a#26,p#27,q#28] parquet
[info]
[info]   == Optimized Logical Plan ==
[info]   Project [a#26]
[info]   +- Filter (isnotnull(p#27) && (p#27 = p1" and q = "q1))
[info]  +- Relation[a#26,p#27,q#28] parquet
[info]
[info]   == Physical Plan ==
[info]   *Project [a#26]
[info]   +- *FileScan parquet default.spark_19912[a#26,p#27,q#28] Batched: 
true, Format: Parquet, Location: PrunedInMemoryFileIndex[], PartitionCount: 0, 
PartitionFilters: [isnotnull(p#27), (p#27 = p1" and q = "q1)], PushedFilters: 
[], ReadSchema: struct
[info]   == Results ==
[info]
[info]   == Results ==
[info]   !== Correct Answer - 1 ==   == Spark Answer - 0 ==
[info]struct<>   struct<>
[info]   ![2]
{noformat}

  was:
{{Shim_v0_13.convertFilters()}} doesn't escape string literals while generating 
Hive style partition predicates.

The following SQL-injection-like test case illustrates this issue:
{code}
  test("foo") {
withTable("foo") {
  Seq(
(1, "p1", "q1"),
(2, "p1\" and q=\"q1", "q2")
  ).toDF("a", "p", "q").write.partitionBy("p", "q").saveAsTable("foo")

  checkAnswer(
spark.table("foo").filter($"p" === "p1\" and q = \"q1").select($"a"),
Row(2)
  )
}
  }
{code}


> String literals are not escaped while performing partition pruning at Hive 
> metastore level
> --
>
> Key: SPARK-19912
> URL: https://issues.apache.org/jira/browse/SPARK-19912
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Cheng Lian
>  Labels: correctness
>
> {{Shim_v0_13.convertFilters()}} doesn't escape string literals while 
> generating Hive style partition predicates.
> The following SQL-injection-like test case illustrates this issue:
> {code}
>   test("SPARK-19912") {
> withTable("spark_19912") {
>   Seq(
> (1, "p1", "q1"),
> (2, "p1\" and q=\"q1", "q2")
>   ).toDF("a", "p", "q").write.partitionBy("p", 
> "q").saveAsTable("spark_19912")
>   checkAnswer(
> spark.table("foo").filter($"p" === "p1\" and q = \"q1").select($"a"),
> Row(2)
>   )
> }
>   }
> {code}
> The above test case fails like this:
> {noformat}
> [info] - spark_19912 *** FAILED *** (13 seconds, 74 milliseconds)
> [info]   Results do not match for query:
> [info]   Timezone: 
> sun.util.cale

[jira] [Updated] (SPARK-19887) __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition value in partitioned persisted tables

2017-03-10 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-19887:
---
Labels: correctness  (was: )

> __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition value in 
> partitioned persisted tables
> -
>
> Key: SPARK-19887
> URL: https://issues.apache.org/jira/browse/SPARK-19887
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Cheng Lian
>  Labels: correctness
>
> The following Spark shell snippet under Spark 2.1 reproduces this issue:
> {code}
> val data = Seq(
>   ("p1", 1, 1),
>   ("p2", 2, 2),
>   (null, 3, 3)
> )
> // Correct case: Saving partitioned data to file system.
> val path = "/tmp/partitioned"
> data.
>   toDF("a", "b", "c").
>   write.
>   mode("overwrite").
>   partitionBy("a", "b").
>   parquet(path)
> spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false)
> // +---+---+---+
> // |c  |a  |b  |
> // +---+---+---+
> // |2  |p2 |2  |
> // |1  |p1 |1  |
> // +---+---+---+
> // Incorrect case: Saving partitioned data as persisted table.
> data.
>   toDF("a", "b", "c").
>   write.
>   mode("overwrite").
>   partitionBy("a", "b").
>   saveAsTable("test_null")
> spark.table("test_null").filter($"a".isNotNull).show(truncate = false)
> // +---+--+---+
> // |c  |a |b  |
> // +---+--+---+
> // |3  |__HIVE_DEFAULT_PARTITION__|3  | <-- This line should not be here
> // |1  |p1|1  |
> // |2  |p2|2  |
> // +---+--+---+
> {code}
> Hive-style partitioned tables use the magic string 
> {{\_\_HIVE_DEFAULT_PARTITION\_\_}} to indicate {{NULL}} partition values in 
> partition directory names. However, in the case persisted partitioned table, 
> this magic string is not interpreted as {{NULL}} but a regular string.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19912) String literals are not escaped while performing partition pruning at Hive metastore level

2017-03-10 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-19912:
--

 Summary: String literals are not escaped while performing 
partition pruning at Hive metastore level
 Key: SPARK-19912
 URL: https://issues.apache.org/jira/browse/SPARK-19912
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.1, 2.2.0
Reporter: Cheng Lian


{{Shim_v0_13.convertFilters()}} doesn't escape string literals while generating 
Hive style partition predicates.

The following SQL-injection-like test case illustrates this issue:
{code}
  test("foo") {
withTable("foo") {
  Seq(
(1, "p1", "q1"),
(2, "p1\" and q=\"q1", "q2")
  ).toDF("a", "p", "q").write.partitionBy("p", "q").saveAsTable("foo")

  checkAnswer(
spark.table("foo").filter($"p" === "p1\" and q = \"q1").select($"a"),
Row(2)
  )
}
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19887) __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition value in partitioned persisted tables

2017-03-10 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-19887:
---
Affects Version/s: 2.2.0

> __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition value in 
> partitioned persisted tables
> -
>
> Key: SPARK-19887
> URL: https://issues.apache.org/jira/browse/SPARK-19887
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Cheng Lian
>
> The following Spark shell snippet under Spark 2.1 reproduces this issue:
> {code}
> val data = Seq(
>   ("p1", 1, 1),
>   ("p2", 2, 2),
>   (null, 3, 3)
> )
> // Correct case: Saving partitioned data to file system.
> val path = "/tmp/partitioned"
> data.
>   toDF("a", "b", "c").
>   write.
>   mode("overwrite").
>   partitionBy("a", "b").
>   parquet(path)
> spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false)
> // +---+---+---+
> // |c  |a  |b  |
> // +---+---+---+
> // |2  |p2 |2  |
> // |1  |p1 |1  |
> // +---+---+---+
> // Incorrect case: Saving partitioned data as persisted table.
> data.
>   toDF("a", "b", "c").
>   write.
>   mode("overwrite").
>   partitionBy("a", "b").
>   saveAsTable("test_null")
> spark.table("test_null").filter($"a".isNotNull).show(truncate = false)
> // +---+--+---+
> // |c  |a |b  |
> // +---+--+---+
> // |3  |__HIVE_DEFAULT_PARTITION__|3  | <-- This line should not be here
> // |1  |p1|1  |
> // |2  |p2|2  |
> // +---+--+---+
> {code}
> Hive-style partitioned tables use the magic string 
> {{\_\_HIVE_DEFAULT_PARTITION\_\_}} to indicate {{NULL}} partition values in 
> partition directory names. However, in the case persisted partitioned table, 
> this magic string is not interpreted as {{NULL}} but a regular string.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19905) Dataset.inputFiles is broken for Hive SerDe tables

2017-03-10 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-19905:
--

 Summary: Dataset.inputFiles is broken for Hive SerDe tables
 Key: SPARK-19905
 URL: https://issues.apache.org/jira/browse/SPARK-19905
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Cheng Lian
Assignee: Cheng Lian


The following snippet reproduces this issue:
{code}
spark.range(10).createOrReplaceTempView("t")
spark.sql("CREATE TABLE u STORED AS RCFILE AS SELECT * FROM t")
spark.table("u").inputFiles.foreach(println)
{code}
In Spark 2.2, it prints nothing, while in Spark 2.1, it prints something like
{noformat}
file:/Users/lian/local/var/lib/hive/warehouse_1.2.1/u
{noformat}
on my laptop.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19887) __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition value in partitioned persisted tables

2017-03-10 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-19887:
---
Description: 
The following Spark shell snippet under Spark 2.1 reproduces this issue:

{code}
val data = Seq(
  ("p1", 1, 1),
  ("p2", 2, 2),
  (null, 3, 3)
)

// Correct case: Saving partitioned data to file system.

val path = "/tmp/partitioned"

data.
  toDF("a", "b", "c").
  write.
  mode("overwrite").
  partitionBy("a", "b").
  parquet(path)

spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false)
// +---+---+---+
// |c  |a  |b  |
// +---+---+---+
// |2  |p2 |2  |
// |1  |p1 |1  |
// +---+---+---+

// Incorrect case: Saving partitioned data as persisted table.

data.
  toDF("a", "b", "c").
  write.
  mode("overwrite").
  partitionBy("a", "b").
  saveAsTable("test_null")

spark.table("test_null").filter($"a".isNotNull).show(truncate = false)
// +---+--+---+
// |c  |a |b  |
// +---+--+---+
// |3  |__HIVE_DEFAULT_PARTITION__|3  | <-- This line should not be here
// |1  |p1|1  |
// |2  |p2|2  |
// +---+--+---+
{code}

Hive-style partitioned tables use the magic string 
{{\_\_HIVE_DEFAULT_PARTITION\_\_}} to indicate {{NULL}} partition values in 
partition directory names. However, in the case persisted partitioned table, 
this magic string is not interpreted as {{NULL}} but a regular string.

  was:
The following Spark shell snippet under Spark 2.1 reproduces this issue:

{code}
val data = Seq(
  ("p1", 1, 1),
  ("p2", 2, 2),
  (null, 3, 3)
)

// Correct case: Saving partitioned data to file system.

val path = "/tmp/partitioned"

data.
  toDF("a", "b", "c").
  write.
  mode("overwrite").
  partitionBy("a", "b").
  parquet(path)

spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false)
// +---+---+---+
// |c  |a  |b  |
// +---+---+---+
// |2  |p2 |2  |
// |1  |p1 |1  |
// +---+---+---+

// Incorrect case: Saving partitioned data as persisted table.

data.
  toDF("a", "b", "c").
  write.
  mode("overwrite").
  partitionBy("a", "b").
  saveAsTable("test_null")

spark.table("test_null").filter($"a".isNotNull).show(truncate = false)
// +---+--+---+
// |c  |a |b  |
// +---+--+---+
// |3  |__HIVE_DEFAULT_PARTITION__|3  | <-- This line should not be here
// |1  |p1|1  |
// |2  |p2|2  |
// +---+--+---+
{code}

Hive-style partitioned tables use the magic string 
{{__HIVE_DEFAULT_PARTITION__}} to indicate {{NULL}} partition values in 
partition directory names. However, in the case persisted partitioned table, 
this magic string is not interpreted as {{NULL}} but a regular string.


> __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition value in 
> partitioned persisted tables
> -
>
> Key: SPARK-19887
> URL: https://issues.apache.org/jira/browse/SPARK-19887
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Cheng Lian
>
> The following Spark shell snippet under Spark 2.1 reproduces this issue:
> {code}
> val data = Seq(
>   ("p1", 1, 1),
>   ("p2", 2, 2),
>   (null, 3, 3)
> )
> // Correct case: Saving partitioned data to file system.
> val path = "/tmp/partitioned"
> data.
>   toDF("a", "b", "c").
>   write.
>   mode("overwrite").
>   partitionBy("a", "b").
>   parquet(path)
> spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false)
> // +---+---+---+
> // |c  |a  |b  |
> // +---+---+---+
> // |2  |p2 |2  |
> // |1  |p1 |1  |
> // +---+---+---+
> // Incorrect case: Saving partitioned data as persisted table.
> data.
>   toDF("a", "b", "c").
>   write.
>   mode("overwrite").
>   partitionBy("a", "b").
>   saveAsTable("test_null")
> spark.table("test_null").filter($"a".isNotNull).show(truncate = false)
> // +---+--+---+
> // |c  |a

[jira] [Updated] (SPARK-19887) __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition value in partitioned persisted tables

2017-03-10 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-19887:
---
Description: 
The following Spark shell snippet under Spark 2.1 reproduces this issue:

{code}
val data = Seq(
  ("p1", 1, 1),
  ("p2", 2, 2),
  (null, 3, 3)
)

// Correct case: Saving partitioned data to file system.

val path = "/tmp/partitioned"

data.
  toDF("a", "b", "c").
  write.
  mode("overwrite").
  partitionBy("a", "b").
  parquet(path)

spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false)
// +---+---+---+
// |c  |a  |b  |
// +---+---+---+
// |2  |p2 |2  |
// |1  |p1 |1  |
// +---+---+---+

// Incorrect case: Saving partitioned data as persisted table.

data.
  toDF("a", "b", "c").
  write.
  mode("overwrite").
  partitionBy("a", "b").
  saveAsTable("test_null")

spark.table("test_null").filter($"a".isNotNull).show(truncate = false)
// +---+--+---+
// |c  |a |b  |
// +---+--+---+
// |3  |__HIVE_DEFAULT_PARTITION__|3  | <-- This line should not be here
// |1  |p1|1  |
// |2  |p2|2  |
// +---+--+---+
{code}

Hive-style partitioned tables use the magic string 
{{__HIVE_DEFAULT_PARTITION__}} to indicate {{NULL}} partition values in 
partition directory names. However, in the case persisted partitioned table, 
this magic string is not interpreted as {{NULL}} but a regular string.

  was:
The following Spark shell snippet under Spark 2.1 reproduces this issue:

{code}
val data = Seq(
  ("p1", 1, 1),
  ("p2", 2, 2),
  (null, 3, 3)
)

// Correct case: Saving partitioned data to file system.

val path = "/tmp/partitioned"

data.
  toDF("a", "b", "c").
  write.
  mode("overwrite").
  partitionBy("a", "b").
  parquet(path)

spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false)
// +---+---+---+
// |c  |a  |b  |
// +---+---+---+
// |2  |p2 |2  |
// |1  |p1 |1  |
// +---+---+---+

// Incorrect case: Saving partitioned data as persisted table.

data.
  toDF("a", "b", "c").
  write.
  mode("overwrite").
  partitionBy("a", "b").
  saveAsTable("test_null")

spark.table("test_null").filter($"a".isNotNull).show(truncate = false)
// +---+--+---+
// |c  |a |b  |
// +---+--+---+
// |3  |__HIVE_DEFAULT_PARTITION__|3  | <-- This line should not be here
// |1  |p1|1  |
// |2  |p2|2  |
// +---+--+---+
{code}

Hive-style partitioned table uses magic string {{"__HIVE_DEFAULT_PARTITION__"}} 
to indicate {{NULL}} partition values in partition directory names. However, in 
the case persisted partitioned table, this magic string is not interpreted as 
{{NULL}} but a regular string.



> __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition value in 
> partitioned persisted tables
> -
>
> Key: SPARK-19887
> URL: https://issues.apache.org/jira/browse/SPARK-19887
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Cheng Lian
>
> The following Spark shell snippet under Spark 2.1 reproduces this issue:
> {code}
> val data = Seq(
>   ("p1", 1, 1),
>   ("p2", 2, 2),
>   (null, 3, 3)
> )
> // Correct case: Saving partitioned data to file system.
> val path = "/tmp/partitioned"
> data.
>   toDF("a", "b", "c").
>   write.
>   mode("overwrite").
>   partitionBy("a", "b").
>   parquet(path)
> spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false)
> // +---+---+---+
> // |c  |a  |b  |
> // +---+---+---+
> // |2  |p2 |2  |
> // |1  |p1 |1  |
> // +---+---+---+
> // Incorrect case: Saving partitioned data as persisted table.
> data.
>   toDF("a", "b", "c").
>   write.
>   mode("overwrite").
>   partitionBy("a", "b").
>   saveAsTable("test_null")
> spark.table("test_null").filter($"a".isNotNull).show(truncate = false)
> // +---+--+---+
> // |c  |a

[jira] [Updated] (SPARK-19887) __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition value in partitioned persisted tables

2017-03-09 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-19887:
---
Summary: __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition 
value in partitioned persisted tables  (was: __HIVE_DEFAULT_PARTITION__ not 
interpreted as NULL partition value in partitioned persisted tables)

> __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition value in 
> partitioned persisted tables
> -
>
> Key: SPARK-19887
> URL: https://issues.apache.org/jira/browse/SPARK-19887
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Cheng Lian
>
> The following Spark shell snippet under Spark 2.1 reproduces this issue:
> {code}
> val data = Seq(
>   ("p1", 1, 1),
>   ("p2", 2, 2),
>   (null, 3, 3)
> )
> // Correct case: Saving partitioned data to file system.
> val path = "/tmp/partitioned"
> data.
>   toDF("a", "b", "c").
>   write.
>   mode("overwrite").
>   partitionBy("a", "b").
>   parquet(path)
> spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false)
> // +---+---+---+
> // |c  |a  |b  |
> // +---+---+---+
> // |2  |p2 |2  |
> // |1  |p1 |1  |
> // +---+---+---+
> // Incorrect case: Saving partitioned data as persisted table.
> data.
>   toDF("a", "b", "c").
>   write.
>   mode("overwrite").
>   partitionBy("a", "b").
>   saveAsTable("test_null")
> spark.table("test_null").filter($"a".isNotNull).show(truncate = false)
> // +---+--+---+
> // |c  |a |b  |
> // +---+--+---+
> // |3  |__HIVE_DEFAULT_PARTITION__|3  | <-- This line should not be here
> // |1  |p1|1  |
> // |2  |p2|2  |
> // +---+--+---+
> {code}
> Hive-style partitioned table uses magic string 
> {{"__HIVE_DEFAULT_PARTITION__"}} to indicate {{NULL}} partition values in 
> partition directory names. However, in the case persisted partitioned table, 
> this magic string is not interpreted as {{NULL}} but a regular string.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19887) __HIVE_DEFAULT_PARTITION__ not interpreted as NULL partition value in partitioned persisted tables

2017-03-09 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-19887:
--

 Summary: __HIVE_DEFAULT_PARTITION__ not interpreted as NULL 
partition value in partitioned persisted tables
 Key: SPARK-19887
 URL: https://issues.apache.org/jira/browse/SPARK-19887
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Cheng Lian


The following Spark shell snippet under Spark 2.1 reproduces this issue:

{code}
val data = Seq(
  ("p1", 1, 1),
  ("p2", 2, 2),
  (null, 3, 3)
)

// Correct case: Saving partitioned data to file system.

val path = "/tmp/partitioned"

data.
  toDF("a", "b", "c").
  write.
  mode("overwrite").
  partitionBy("a", "b").
  parquet(path)

spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false)
// +---+---+---+
// |c  |a  |b  |
// +---+---+---+
// |2  |p2 |2  |
// |1  |p1 |1  |
// +---+---+---+

// Incorrect case: Saving partitioned data as persisted table.

data.
  toDF("a", "b", "c").
  write.
  mode("overwrite").
  partitionBy("a", "b").
  saveAsTable("test_null")

spark.table("test_null").filter($"a".isNotNull).show(truncate = false)
// +---+--+---+
// |c  |a |b  |
// +---+--+---+
// |3  |__HIVE_DEFAULT_PARTITION__|3  | <-- This line should not be here
// |1  |p1|1  |
// |2  |p2|2  |
// +---+--+---+
{code}

Hive-style partitioned table uses magic string {{"__HIVE_DEFAULT_PARTITION__"}} 
to indicate {{NULL}} partition values in partition directory names. However, in 
the case persisted partitioned table, this magic string is not interpreted as 
{{NULL}} but a regular string.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19887) __HIVE_DEFAULT_PARTITION__ not interpreted as NULL partition value in partitioned persisted tables

2017-03-09 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903749#comment-15903749
 ] 

Cheng Lian commented on SPARK-19887:


cc [~cloud_fan]

> __HIVE_DEFAULT_PARTITION__ not interpreted as NULL partition value in 
> partitioned persisted tables
> --
>
> Key: SPARK-19887
> URL: https://issues.apache.org/jira/browse/SPARK-19887
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Cheng Lian
>
> The following Spark shell snippet under Spark 2.1 reproduces this issue:
> {code}
> val data = Seq(
>   ("p1", 1, 1),
>   ("p2", 2, 2),
>   (null, 3, 3)
> )
> // Correct case: Saving partitioned data to file system.
> val path = "/tmp/partitioned"
> data.
>   toDF("a", "b", "c").
>   write.
>   mode("overwrite").
>   partitionBy("a", "b").
>   parquet(path)
> spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false)
> // +---+---+---+
> // |c  |a  |b  |
> // +---+---+---+
> // |2  |p2 |2  |
> // |1  |p1 |1  |
> // +---+---+---+
> // Incorrect case: Saving partitioned data as persisted table.
> data.
>   toDF("a", "b", "c").
>   write.
>   mode("overwrite").
>   partitionBy("a", "b").
>   saveAsTable("test_null")
> spark.table("test_null").filter($"a".isNotNull).show(truncate = false)
> // +---+--+---+
> // |c  |a |b  |
> // +---+--+---+
> // |3  |__HIVE_DEFAULT_PARTITION__|3  | <-- This line should not be here
> // |1  |p1|1  |
> // |2  |p2|2  |
> // +---+--+---+
> {code}
> Hive-style partitioned table uses magic string 
> {{"__HIVE_DEFAULT_PARTITION__"}} to indicate {{NULL}} partition values in 
> partition directory names. However, in the case persisted partitioned table, 
> this magic string is not interpreted as {{NULL}} but a regular string.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19737) New analysis rule for reporting unregistered functions without relying on relation resolution

2017-03-06 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-19737.

Resolution: Fixed

Issue resolved by pull request 17168
[https://github.com/apache/spark/pull/17168]

> New analysis rule for reporting unregistered functions without relying on 
> relation resolution
> -
>
> Key: SPARK-19737
> URL: https://issues.apache.org/jira/browse/SPARK-19737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 2.2.0
>
>
> Let's consider the following simple SQL query that reference an undefined 
> function {{foo}} that is never registered in the function registry:
> {code:sql}
> SELECT foo(a) FROM t
> {code}
> Assuming table {{t}} is a partitioned  temporary view consisting of a large 
> number of files stored on S3, it may take the analyzer a long time before 
> realizing that {{foo}} is not registered yet.
> The reason is that the existing analysis rule {{ResolveFunctions}} requires 
> all child expressions to be resolved first. Therefore, {{ResolveRelations}} 
> has to be executed first to resolve all columns referenced by the unresolved 
> function invocation. This further leads to partition discovery for {{t}}, 
> which may take a long time.
> To address this case, we propose a new lightweight analysis rule 
> {{LookupFunctions}} that
> # Matches all unresolved function invocations
> # Look up the function names from the function registry
> # Report analysis error for any unregistered functions
> Since this rule doesn't actually try to resolve the unresolved functions, it 
> doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
> discovery.
> We may put this analysis rule in a separate {{Once}} rule batch that sits 
> between the "Substitution" batch and the "Resolution" batch to avoid running 
> it repeatedly and make sure it gets executed before {{ResolveRelations}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19737) New analysis rule for reporting unregistered functions without relying on relation resolution

2017-03-06 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reassigned SPARK-19737:
--

Assignee: Cheng Lian

> New analysis rule for reporting unregistered functions without relying on 
> relation resolution
> -
>
> Key: SPARK-19737
> URL: https://issues.apache.org/jira/browse/SPARK-19737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 2.2.0
>
>
> Let's consider the following simple SQL query that reference an undefined 
> function {{foo}} that is never registered in the function registry:
> {code:sql}
> SELECT foo(a) FROM t
> {code}
> Assuming table {{t}} is a partitioned  temporary view consisting of a large 
> number of files stored on S3, it may take the analyzer a long time before 
> realizing that {{foo}} is not registered yet.
> The reason is that the existing analysis rule {{ResolveFunctions}} requires 
> all child expressions to be resolved first. Therefore, {{ResolveRelations}} 
> has to be executed first to resolve all columns referenced by the unresolved 
> function invocation. This further leads to partition discovery for {{t}}, 
> which may take a long time.
> To address this case, we propose a new lightweight analysis rule 
> {{LookupFunctions}} that
> # Matches all unresolved function invocations
> # Look up the function names from the function registry
> # Report analysis error for any unregistered functions
> Since this rule doesn't actually try to resolve the unresolved functions, it 
> doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
> discovery.
> We may put this analysis rule in a separate {{Once}} rule batch that sits 
> between the "Substitution" batch and the "Resolution" batch to avoid running 
> it repeatedly and make sure it gets executed before {{ResolveRelations}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19737) New analysis rule for reporting unregistered functions without relying on relation resolution

2017-03-05 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-19737:
---
Description: 
Let's consider the following simple SQL query that reference an undefined 
function {{foo}} that is never registered in the function registry:
{code:sql}
SELECT foo(a) FROM t
{code}
Assuming table {{t}} is a partitioned  temporary view consisting of a large 
number of files stored on S3, then it may take the analyzer a long time before 
realizing that {{foo}} is not registered yet.

The reason is that the existing analysis rule {{ResolveFunctions}} requires all 
child expressions to be resolved first. Therefore, {{ResolveRelations}} has to 
be executed first to resolve all columns referenced by the unresolved function 
invocation. This further leads to partition discovery for {{t}}, which may take 
a long time.

To address this case, we propose a new lightweight analysis rule 
{{LookupFunctions}} that
# Matches all unresolved function invocations
# Look up the function names from the function registry
# Report analysis error for any unregistered functions

Since this rule doesn't actually try to resolve the unresolved functions, it 
doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
discovery.

We may put this analysis rule in a separate {{Once}} rule batch that sits 
between the "Substitution" batch and the "Resolution" batch to avoid running it 
repeatedly and make sure it gets executed before {{ResolveRelations}}.

  was:
Let's consider the following simple SQL query that reference an invalid 
function {{foo}} that is never registered in the function registry:
{code:sql}
SELECT foo(a) FROM t
{code}
Assuming table {{t}} is a partitioned  temporary view consisting of a large 
number of files stored on S3, then it may take the analyzer a long time before 
realizing that {{foo}} is not registered yet.

The reason is that the existing analysis rule {{ResolveFunctions}} requires all 
child expressions to be resolved first. Therefore, {{ResolveRelations}} has to 
be executed first to resolve all columns referenced by the unresolved function 
invocation. This further leads to partition discovery for {{t}}, which may take 
a long time.

To address this case, we propose a new lightweight analysis rule 
{{LookupFunctions}} that
# Matches all unresolved function invocations
# Look up the function names from the function registry
# Report analysis error for any unregistered functions

Since this rule doesn't actually try to resolve the unresolved functions, it 
doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
discovery.

We may put this analysis rule in a separate {{Once}} rule batch that sits 
between the "Substitution" batch and the "Resolution" batch to avoid running it 
repeatedly and make sure it gets executed before {{ResolveRelations}}.


> New analysis rule for reporting unregistered functions without relying on 
> relation resolution
> -
>
> Key: SPARK-19737
> URL: https://issues.apache.org/jira/browse/SPARK-19737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
> Fix For: 2.2.0
>
>
> Let's consider the following simple SQL query that reference an undefined 
> function {{foo}} that is never registered in the function registry:
> {code:sql}
> SELECT foo(a) FROM t
> {code}
> Assuming table {{t}} is a partitioned  temporary view consisting of a large 
> number of files stored on S3, then it may take the analyzer a long time 
> before realizing that {{foo}} is not registered yet.
> The reason is that the existing analysis rule {{ResolveFunctions}} requires 
> all child expressions to be resolved first. Therefore, {{ResolveRelations}} 
> has to be executed first to resolve all columns referenced by the unresolved 
> function invocation. This further leads to partition discovery for {{t}}, 
> which may take a long time.
> To address this case, we propose a new lightweight analysis rule 
> {{LookupFunctions}} that
> # Matches all unresolved function invocations
> # Look up the function names from the function registry
> # Report analysis error for any unregistered functions
> Since this rule doesn't actually try to resolve the unresolved functions, it 
> doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
> discovery.
> We may put this analysis rule in a separate {{Once}} rule batch that sits 
> between the "Substitution" batch and the "Resolution" batch to avoid running 
> it repeatedly and make sure it gets executed befo

[jira] [Updated] (SPARK-19737) New analysis rule for reporting unregistered functions without relying on relation resolution

2017-03-05 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-19737:
---
Description: 
Let's consider the following simple SQL query that reference an undefined 
function {{foo}} that is never registered in the function registry:
{code:sql}
SELECT foo(a) FROM t
{code}
Assuming table {{t}} is a partitioned  temporary view consisting of a large 
number of files stored on S3, it may take the analyzer a long time before 
realizing that {{foo}} is not registered yet.

The reason is that the existing analysis rule {{ResolveFunctions}} requires all 
child expressions to be resolved first. Therefore, {{ResolveRelations}} has to 
be executed first to resolve all columns referenced by the unresolved function 
invocation. This further leads to partition discovery for {{t}}, which may take 
a long time.

To address this case, we propose a new lightweight analysis rule 
{{LookupFunctions}} that
# Matches all unresolved function invocations
# Look up the function names from the function registry
# Report analysis error for any unregistered functions

Since this rule doesn't actually try to resolve the unresolved functions, it 
doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
discovery.

We may put this analysis rule in a separate {{Once}} rule batch that sits 
between the "Substitution" batch and the "Resolution" batch to avoid running it 
repeatedly and make sure it gets executed before {{ResolveRelations}}.

  was:
Let's consider the following simple SQL query that reference an undefined 
function {{foo}} that is never registered in the function registry:
{code:sql}
SELECT foo(a) FROM t
{code}
Assuming table {{t}} is a partitioned  temporary view consisting of a large 
number of files stored on S3, then it may take the analyzer a long time before 
realizing that {{foo}} is not registered yet.

The reason is that the existing analysis rule {{ResolveFunctions}} requires all 
child expressions to be resolved first. Therefore, {{ResolveRelations}} has to 
be executed first to resolve all columns referenced by the unresolved function 
invocation. This further leads to partition discovery for {{t}}, which may take 
a long time.

To address this case, we propose a new lightweight analysis rule 
{{LookupFunctions}} that
# Matches all unresolved function invocations
# Look up the function names from the function registry
# Report analysis error for any unregistered functions

Since this rule doesn't actually try to resolve the unresolved functions, it 
doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
discovery.

We may put this analysis rule in a separate {{Once}} rule batch that sits 
between the "Substitution" batch and the "Resolution" batch to avoid running it 
repeatedly and make sure it gets executed before {{ResolveRelations}}.


> New analysis rule for reporting unregistered functions without relying on 
> relation resolution
> -
>
> Key: SPARK-19737
> URL: https://issues.apache.org/jira/browse/SPARK-19737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
> Fix For: 2.2.0
>
>
> Let's consider the following simple SQL query that reference an undefined 
> function {{foo}} that is never registered in the function registry:
> {code:sql}
> SELECT foo(a) FROM t
> {code}
> Assuming table {{t}} is a partitioned  temporary view consisting of a large 
> number of files stored on S3, it may take the analyzer a long time before 
> realizing that {{foo}} is not registered yet.
> The reason is that the existing analysis rule {{ResolveFunctions}} requires 
> all child expressions to be resolved first. Therefore, {{ResolveRelations}} 
> has to be executed first to resolve all columns referenced by the unresolved 
> function invocation. This further leads to partition discovery for {{t}}, 
> which may take a long time.
> To address this case, we propose a new lightweight analysis rule 
> {{LookupFunctions}} that
> # Matches all unresolved function invocations
> # Look up the function names from the function registry
> # Report analysis error for any unregistered functions
> Since this rule doesn't actually try to resolve the unresolved functions, it 
> doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
> discovery.
> We may put this analysis rule in a separate {{Once}} rule batch that sits 
> between the "Substitution" batch and the "Resolution" batch to avoid running 
> it repeatedly and make sure it gets executed before {{

[jira] [Updated] (SPARK-19737) New analysis rule for reporting unregistered functions without relying on relation resolution

2017-02-24 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-19737:
---
Description: 
Let's consider the following simple SQL query that reference an invalid 
function {{foo}} that is never registered in the function registry:
{code:sql}
SELECT foo(a) FROM t
{code}
Assuming table {{t}} is a partitioned  temporary view consisting of a large 
number of files stored on S3, then it may take the analyzer a long time before 
realizing that {{foo}} is not registered yet.

The reason is that the existing analysis rule {{ResolveFunctions}} requires all 
child expressions to be resolved first. Therefore, {{ResolveRelations}} has to 
be executed first to resolve all columns referenced by the unresolved function 
invocation. This further leads to partition discovery for {{t}}, which may take 
a long time.

To address this case, we propose a new lightweight analysis rule 
{{LookupFunctions}} that
# Matches all unresolved function invocations
# Look up the function names from the function registry
# Report analysis error for any unregistered functions

Since this rule doesn't actually try to resolve the unresolved functions, it 
doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
discovery.

We may put this analysis rule in a separate {{Once}} rule batch that sits 
between the "Substitution" batch and the "Resolution" batch to avoid running it 
repeatedly and make sure it gets executed before {{ResolveRelations}}.

  was:
Let's consider the following simple SQL query that reference an invalid 
function {{foo}} that is never registered in the function registry:
{code:sql}
SELECT foo(a) FROM t
{code}
Assuming table {{t}} is a partitioned  temporary view consisting of a large 
number of files stored on S3, then it may take the analyzer a long time before 
realizing that {{foo}} is not registered yet.

The reason is that the existing analysis rule {{ResolveFunctions}} requires all 
child expressions to be resolved first. Therefore, {{ResolveRelations}} has to 
be executed first to resolve all columns referenced by the unresolved function 
invocation. This further leads to partition discovery for {{t}}, which may take 
a long time.

To address this case, we propose a new lightweight analysis rule 
{{LookupFunctions}} that
# Matches all unresolved function invocation
# Look up the function name from the function registry
# Report analysis error for any unregistered functions

Since this rule doesn't actually try to resolve the unresolved functions, it 
doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
discovery.

We may put this analysis rule in a separate {{Once}} rule batch that sits 
between the "Substitution" batch and the "Resolution" batch to avoid running it 
repeatedly and make sure it gets executed before {{ResolveRelations}}.


> New analysis rule for reporting unregistered functions without relying on 
> relation resolution
> -
>
> Key: SPARK-19737
> URL: https://issues.apache.org/jira/browse/SPARK-19737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
> Fix For: 2.2.0
>
>
> Let's consider the following simple SQL query that reference an invalid 
> function {{foo}} that is never registered in the function registry:
> {code:sql}
> SELECT foo(a) FROM t
> {code}
> Assuming table {{t}} is a partitioned  temporary view consisting of a large 
> number of files stored on S3, then it may take the analyzer a long time 
> before realizing that {{foo}} is not registered yet.
> The reason is that the existing analysis rule {{ResolveFunctions}} requires 
> all child expressions to be resolved first. Therefore, {{ResolveRelations}} 
> has to be executed first to resolve all columns referenced by the unresolved 
> function invocation. This further leads to partition discovery for {{t}}, 
> which may take a long time.
> To address this case, we propose a new lightweight analysis rule 
> {{LookupFunctions}} that
> # Matches all unresolved function invocations
> # Look up the function names from the function registry
> # Report analysis error for any unregistered functions
> Since this rule doesn't actually try to resolve the unresolved functions, it 
> doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
> discovery.
> We may put this analysis rule in a separate {{Once}} rule batch that sits 
> between the "Substitution" batch and the "Resolution" batch to avoid running 
> it repeatedly and make sure it gets executed before {{

[jira] [Updated] (SPARK-19737) New analysis rule for reporting unregistered functions without relying on relation resolution

2017-02-24 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-19737:
---
Description: 
Let's consider the following simple SQL query that reference an invalid 
function {{foo}} that is never registered in the function registry:
{code:sql}
SELECT foo(a) FROM t
{code}
Assuming table {{t}} is a partitioned  temporary view consisting of a large 
number of files stored on S3, then it may take the analyzer a long time before 
realizing that {{foo}} is not registered yet.

The reason is that the existing analysis rule {{ResolveFunctions}} requires all 
child expressions to be resolved first. Therefore, {{ResolveRelations}} has to 
be executed first to resolve all columns referenced by the unresolved function 
invocation. This further leads to partition discovery for {{t}}, which may take 
a long time.

To address this case, we propose a new lightweight analysis rule 
{{LookupFunctions}} that
# Matches all unresolved function invocation
# Look up the function name from the function registry
# Report analysis error for any unregistered functions

Since this rule doesn't actually try to resolve the unresolved functions, it 
doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
discovery.

We may put this analysis rule in a separate {{Once}} rule batch that sits 
between the "Substitution" batch and the "Resolution" batch to avoid running it 
repeatedly and make sure it gets executed before {{ResolveRelations}}.

  was:
Let's consider the following simple SQL query that reference an invalid 
function {{foo}} that is never registered in the function registry:
{code:sql}
SELECT foo(a) FROM t
{code}
Assuming table {{t}} is a partitioned  temporary view consisting of a large 
number of files stored on S3, then it may take the analyzer a long time before 
realizing that {{foo}} is not registered yet.

The reason is that the existing analysis rule {{ResolveFunctions}} requires all 
child expressions to be resolved first. Therefore, {{ResolveRelations}} has to 
be executed first to resolve all columns referenced by the unresolved function 
invocation. This further leads to partition discovery for {{t}}, which may take 
a long time.

To address this case, we propose a new lightweight analysis rule 
{{LookupFunctions}} that
# Matches all unresolved function invocation
# Look up the function name from the function registry
# Report analysis error for any unregistered functions

Since this rule doesn't try to actually resolve the unresolved functions, it 
doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
discovery.

We may put this analysis rule in a separate {{Once}} rule batch that sits 
between the "Substitution" batch and the "Resolution" batch to avoid running it 
repeatedly.


> New analysis rule for reporting unregistered functions without relying on 
> relation resolution
> -
>
> Key: SPARK-19737
> URL: https://issues.apache.org/jira/browse/SPARK-19737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
> Fix For: 2.2.0
>
>
> Let's consider the following simple SQL query that reference an invalid 
> function {{foo}} that is never registered in the function registry:
> {code:sql}
> SELECT foo(a) FROM t
> {code}
> Assuming table {{t}} is a partitioned  temporary view consisting of a large 
> number of files stored on S3, then it may take the analyzer a long time 
> before realizing that {{foo}} is not registered yet.
> The reason is that the existing analysis rule {{ResolveFunctions}} requires 
> all child expressions to be resolved first. Therefore, {{ResolveRelations}} 
> has to be executed first to resolve all columns referenced by the unresolved 
> function invocation. This further leads to partition discovery for {{t}}, 
> which may take a long time.
> To address this case, we propose a new lightweight analysis rule 
> {{LookupFunctions}} that
> # Matches all unresolved function invocation
> # Look up the function name from the function registry
> # Report analysis error for any unregistered functions
> Since this rule doesn't actually try to resolve the unresolved functions, it 
> doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
> discovery.
> We may put this analysis rule in a separate {{Once}} rule batch that sits 
> between the "Substitution" batch and the "Resolution" batch to avoid running 
> it repeatedly and make sure it gets executed before {{ResolveRelations}}.



--
This message was sent by Atlassian JIR

[jira] [Created] (SPARK-19737) New analysis rule for reporting unregistered functions without relying on relation resolution

2017-02-24 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-19737:
--

 Summary: New analysis rule for reporting unregistered functions 
without relying on relation resolution
 Key: SPARK-19737
 URL: https://issues.apache.org/jira/browse/SPARK-19737
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Cheng Lian
 Fix For: 2.2.0


Let's consider the following simple SQL query that reference an invalid 
function {{foo}} that is never registered in the function registry:
{code:sql}
SELECT foo(a) FROM t
{code}
Assuming table {{t}} is a partitioned  temporary view consisting of a large 
number of files stored on S3, then it may take the analyzer a long time before 
realizing that {{foo}} is not registered yet.

The reason is that the existing analysis rule {{ResolveFunctions}} requires all 
child expressions to be resolved first. Therefore, {{ResolveRelations}} has to 
be executed first to resolve all columns referenced by the unresolved function 
invocation. This further leads to partition discovery for {{t}}, which may take 
a long time.

To address this case, we propose a new lightweight analysis rule 
{{LookupFunctions}} that
# Matches all unresolved function invocation
# Look up the function name from the function registry
# Report analysis error for any unregistered functions

Since this rule doesn't try to actually resolve the unresolved functions, it 
doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
discovery.

We may put this analysis rule in a separate {{Once}} rule batch that sits 
between the "Substitution" batch and the "Resolution" batch to avoid running it 
repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: The driver hangs at DataFrame.rdd in Spark 2.1.0

2017-02-23 Thread Cheng Lian


This one seems to be relevant, but it's already fixed in 2.1.0.

One way to debug is to turn on trace log and check how the 
analyzer/optimizer behaves.



On 2/22/17 11:11 PM, StanZhai wrote:
Could this be related to 
https://issues.apache.org/jira/browse/SPARK-17733 ?



-- Original --
*From: * "Cheng Lian-3 [via Apache Spark Developers List]";<[hidden 
email] >;

*Send time:* Thursday, Feb 23, 2017 9:43 AM
*To:* "Stan Zhai"<[hidden email] 
>;

*Subject: * Re: The driver hangs at DataFrame.rdd in Spark 2.1.0

Just from the thread dump you provided, it seems that this particular 
query plan jams our optimizer. However, it's also possible that the 
driver just happened to be running optimizer rules at that particular 
time point.


Since query planning doesn't touch any actual data, could you please 
try to minimize this query by replacing the actual relations with 
temporary views derived from Scala local collections? In this way, it 
would be much easier for others to reproduce issue.


Cheng


On 2/22/17 5:16 PM, Stan Zhai wrote:

Thanks for lian's reply.

Here is the QueryPlan generated by Spark 1.6.2(I can't get it in 
Spark 2.1.0):

|...|
||

-- Original --
*Subject: * Re: The driver hangs at DataFrame.rdd in Spark 2.1.0

What is the query plan? We had once observed query plans that grow 
exponentially in iterative ML workloads and the query planner hangs 
forever. For example, each iteration combines 4 plan trees of the 
last iteration and forms a larger plan tree. The size of the plan 
tree can easily reach billions of nodes after 15 iterations.



On 2/22/17 9:29 AM, Stan Zhai wrote:

Hi all,

The driver hangs at DataFrame.rdd in Spark 2.1.0 when the 
DataFrame(SQL) is complex, Following thread dump of my driver:

...







If you reply to this email, your message will be added to the 
discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/Re-The-driver-hangs-at-DataFrame-rdd-in-Spark-2-1-0-tp21052p21053.html 

To start a new topic under Apache Spark Developers List, email [hidden 
email] 

To unsubscribe from Apache Spark Developers List, click here.
NAML 
<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer=instant_html%21nabble%3Aemail.naml=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> 




View this message in context: Re: The driver hangs at DataFrame.rdd in 
Spark 2.1.0 
<http://apache-spark-developers-list.1001551.n3.nabble.com/Re-The-driver-hangs-at-DataFrame-rdd-in-Spark-2-1-0-tp21052p21054.html>
Sent from the Apache Spark Developers List mailing list archive 
<http://apache-spark-developers-list.1001551.n3.nabble.com/> at 
Nabble.com.

[jira] [Updated] (PARQUET-893) GroupColumnIO.getFirst() doesn't check for empty groups

2017-02-22 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated PARQUET-893:
---
Description: 
The following Spark snippet reproduces this issue with Spark 2.1 (with 
parquet-mr 1.8.1) and Spark 2.2-SNAPSHOT (with parquet-mr 1.8.2):

{code}
import org.apache.spark.sql.types._

val path = "/tmp/parquet-test"

case class Inner(f00: Int)
case class Outer(f0: Inner, f1: Int)

val df = Seq(Outer(Inner(1), 1)).toDF()

df.printSchema()
// root
//  |-- f0: struct (nullable = true)
//  ||-- f00: integer (nullable = false)
//  |-- f1: integer (nullable = false)

df.write.mode("overwrite").parquet(path)

val requestedSchema =
  new StructType().
add("f0", new StructType().
  // This nested field name differs from the original one
  add("f01", IntegerType)).
add("f1", IntegerType)

println(requestedSchema.treeString)
// root
//  |-- f0: struct (nullable = true)
//  ||-- f01: integer (nullable = true)
//  |-- f1: integer (nullable = true)

spark.read.schema(requestedSchema).parquet(path).show()
{code}

In the above snippet, {{requestedSchema}} is compatible with the schema of the 
written Parquet file, but the following exception is thrown:

{noformat}
org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in 
block -1 in file 
file:/tmp/parquet-test/part-7-d2b0bec1-7be5-4b51-8d53-3642680bc9c2.snappy.parquet
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:184)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:653)
at java.util.ArrayList.get(ArrayList.java:429)
at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
at 
org.apache.parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:102)
at 
org.apache.parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:97)
at 
org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:277)
at 
org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:135)
at 
org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:101)
at 
org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
at 
org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:101)
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:140)
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:214)
... 21 more
{noformat}

According to this stack trace, it seems that {{GroupColumnIO.getFirst()}} 
[doesn't check for empty 
groups|https://github.com/apache/parquet-mr/blob/apa

[jira] [Created] (PARQUET-893) GroupColumnIO.getFirst() doesn't check for empty groups

2017-02-22 Thread Cheng Lian (JIRA)

Cheng Lian created PARQUET-893:
--

 Summary: GroupColumnIO.getFirst() doesn't check for empty groups
 Key: PARQUET-893
 URL: https://issues.apache.org/jira/browse/PARQUET-893
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.8.1
Reporter: Cheng Lian


The following Spark 2.1 snippet reproduces this issue:

{code}
import org.apache.spark.sql.types._

val path = "/tmp/parquet-test"

case class Inner(f00: Int)
case class Outer(f0: Inner, f1: Int)

val df = Seq(Outer(Inner(1), 1)).toDF()

df.printSchema()
// root
//  |-- f0: struct (nullable = true)
//  ||-- f00: integer (nullable = false)
//  |-- f1: integer (nullable = false)

df.write.mode("overwrite").parquet(path)

val requestedSchema =
  new StructType().
add("f0", new StructType().
  // This nested field name differs from the original one
  add("f01", IntegerType)).
add("f1", IntegerType)

println(requestedSchema.treeString)
// root
//  |-- f0: struct (nullable = true)
//  ||-- f01: integer (nullable = true)
//  |-- f1: integer (nullable = true)

spark.read.schema(requestedSchema).parquet(path).show()
{code}

In the above snippet, {{requestedSchema}} is compatible with the schema of the 
written Parquet file, but the following exception is thrown:

{noformat}
org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in 
block -1 in file 
file:/tmp/parquet-test/part-7-d2b0bec1-7be5-4b51-8d53-3642680bc9c2.snappy.parquet
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:184)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:653)
at java.util.ArrayList.get(ArrayList.java:429)
at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
at 
org.apache.parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:102)
at 
org.apache.parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:97)
at 
org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:277)
at 
org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:135)
at 
org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:101)
at 
org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
at 
org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:101)
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:140)
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:214)
... 21 more
{noformat}

According to this stack trace, it seems

[jira] [Updated] (SPARK-19529) TransportClientFactory.createClient() shouldn't call awaitUninterruptibly()

2017-02-13 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-19529:
---
Target Version/s: 1.6.3, 2.0.3, 2.1.1, 2.2.0  (was: 2.0.3, 2.1.1, 2.2.0)

> TransportClientFactory.createClient() shouldn't call awaitUninterruptibly()
> ---
>
> Key: SPARK-19529
> URL: https://issues.apache.org/jira/browse/SPARK-19529
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.6.0, 2.0.0, 2.1.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> In Spark's Netty RPC layer, TransportClientFactory.createClient() calls 
> awaitUninterruptibly() on a Netty future while waiting for a connection to be 
> established. This creates problem when a Spark task is interrupted while 
> blocking in this call (which can happen in the event of a slow connection 
> which will eventually time out). This has bad impacts on task cancellation 
> when interruptOnCancel = true.
> As an example of the impact of this problem, I experienced significant 
> numbers of uncancellable "zombie tasks" on a production cluster where several 
> tasks were blocked trying to connect to a dead shuffle server and then 
> continued running as zombies after I cancelled the associated Spark stage. 
> The zombie tasks ran for several minutes with the following stack:
> {code}
> java.lang.Object.wait(Native Method)
> java.lang.Object.wait(Object.java:460)
> io.netty.util.concurrent.DefaultPromise.await0(DefaultPromise.java:607) 
> io.netty.util.concurrent.DefaultPromise.awaitUninterruptibly(DefaultPromise.java:301)
>  
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:224)
>  
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
>  => holding Monitor(java.lang.Object@1849476028}) 
> org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105)
>  
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>  
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
>  
> org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:114)
>  
> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:169)
>  
> org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:
> 350) 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:286)
>  
> org.apache.spark.storage.ShuffleBlockFetcherIterator.(ShuffleBlockFetcherIterator.scala:120)
>  
> org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:45)
>  
> org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:169)
>  
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) 
> [...]
> {code}
> I believe that we can easily fix this by using the 
> InterruptedException-throwing await() instead.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19529) TransportClientFactory.createClient() shouldn't call awaitUninterruptibly()

2017-02-13 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-19529:
---
Target Version/s: 2.0.3, 2.1.1, 2.2.0  (was: 2.0.3, 2.1.1)

> TransportClientFactory.createClient() shouldn't call awaitUninterruptibly()
> ---
>
> Key: SPARK-19529
> URL: https://issues.apache.org/jira/browse/SPARK-19529
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.6.0, 2.0.0, 2.1.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> In Spark's Netty RPC layer, TransportClientFactory.createClient() calls 
> awaitUninterruptibly() on a Netty future while waiting for a connection to be 
> established. This creates problem when a Spark task is interrupted while 
> blocking in this call (which can happen in the event of a slow connection 
> which will eventually time out). This has bad impacts on task cancellation 
> when interruptOnCancel = true.
> As an example of the impact of this problem, I experienced significant 
> numbers of uncancellable "zombie tasks" on a production cluster where several 
> tasks were blocked trying to connect to a dead shuffle server and then 
> continued running as zombies after I cancelled the associated Spark stage. 
> The zombie tasks ran for several minutes with the following stack:
> {code}
> java.lang.Object.wait(Native Method)
> java.lang.Object.wait(Object.java:460)
> io.netty.util.concurrent.DefaultPromise.await0(DefaultPromise.java:607) 
> io.netty.util.concurrent.DefaultPromise.awaitUninterruptibly(DefaultPromise.java:301)
>  
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:224)
>  
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
>  => holding Monitor(java.lang.Object@1849476028}) 
> org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105)
>  
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>  
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
>  
> org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:114)
>  
> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:169)
>  
> org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:
> 350) 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:286)
>  
> org.apache.spark.storage.ShuffleBlockFetcherIterator.(ShuffleBlockFetcherIterator.scala:120)
>  
> org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:45)
>  
> org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:169)
>  
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) 
> [...]
> {code}
> I believe that we can easily fix this by using the 
> InterruptedException-throwing await() instead.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18717) Datasets - crash (compile exception) when mapping to immutable scala map

2017-02-10 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-18717:
---
Fix Version/s: 2.1.1

> Datasets - crash (compile exception) when mapping to immutable scala map
> 
>
> Key: SPARK-18717
> URL: https://issues.apache.org/jira/browse/SPARK-18717
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Damian Momot
>Assignee: Andrew Ray
> Fix For: 2.1.1, 2.2.0
>
>
> {code}
> val spark: SparkSession = ???
> case class Test(id: String, map_test: Map[Long, String])
> spark.sql("CREATE TABLE xyz.map_test (id string, map_test map<int, string>) 
> STORED AS PARQUET")
> spark.sql("SELECT * FROM xyz.map_test").as[Test].map(t => t).collect()
> {code}
> {code}
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 307, Column 108: No applicable constructor/method found for actual parameters 
> "java.lang.String, scala.collection.Map"; candidates are: 
> "$line14.$read$$iw$$iw$Test(java.lang.String, scala.collection.immutable.Map)"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18717) Datasets - crash (compile exception) when mapping to immutable scala map

2017-02-10 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-18717:
---
Affects Version/s: 2.1.0

> Datasets - crash (compile exception) when mapping to immutable scala map
> 
>
> Key: SPARK-18717
> URL: https://issues.apache.org/jira/browse/SPARK-18717
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Damian Momot
>Assignee: Andrew Ray
> Fix For: 2.1.1, 2.2.0
>
>
> {code}
> val spark: SparkSession = ???
> case class Test(id: String, map_test: Map[Long, String])
> spark.sql("CREATE TABLE xyz.map_test (id string, map_test map<int, string>) 
> STORED AS PARQUET")
> spark.sql("SELECT * FROM xyz.map_test").as[Test].map(t => t).collect()
> {code}
> {code}
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 307, Column 108: No applicable constructor/method found for actual parameters 
> "java.lang.String, scala.collection.Map"; candidates are: 
> "$line14.$read$$iw$$iw$Test(java.lang.String, scala.collection.immutable.Map)"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2524 matches

Mail list logo