Re: spark messing up handling of native dependency code?

2017-06-02 Thread Georg Heiler
When tested without any parallelism the same problem persists. Actually,
NiFi shows the same issues. So probably it is not related to spark.

Maciej Szymkiewicz  schrieb am Sa., 3. Juni 2017 um
01:37 Uhr:

> Maybe not related, but in general geotools are not thread safe,so using
> from workers is most likely a gamble.
> On 06/03/2017 01:26 AM, Georg Heiler wrote:
>
> Hi,
>
> There is a weird problem with spark when handling native dependency code:
> I want to use a library (JAI) with spark to parse some spatial raster
> files. Unfortunately, there are some strange issues. JAI only works when
> running via the build tool i.e. `sbt run` when executed in spark.
>
> When executed via spark-submit the error is:
>
> java.lang.IllegalArgumentException: The input argument(s) may not be
> null.
> at
> javax.media.jai.ParameterBlockJAI.getDefaultMode(ParameterBlockJAI.java:136)
> at
> javax.media.jai.ParameterBlockJAI.(ParameterBlockJAI.java:157)
> at
> javax.media.jai.ParameterBlockJAI.(ParameterBlockJAI.java:178)
> at
> org.geotools.process.raster.PolygonExtractionProcess.execute(PolygonExtractionProcess.java:171)
>
> Which looks like some native dependency (I think GEOS is running in the
> background) is not there correctly.
>
> Assuming something is wrong with the class path I tried to run a plain
> java/scala function. but this one works just fine.
>
> Is spark messing with the class paths?
>
> I created a minimal example here:
> https://github.com/geoHeil/jai-packaging-problem
>
>
> Hope someone can shed some light on this problem,
> Regards,
> Georg
>
>
>


Re: [VOTE] Apache Spark 2.2.0 (RC3)

2017-06-02 Thread Michael Armbrust
This should probably fail the vote.  I'll follow up with an RC4.

On Fri, Jun 2, 2017 at 4:11 PM, Wenchen Fan  wrote:

> I'm -1 on this.
>
> I merged a PR  to master/2.2
> today and break the build. I'm really sorry for the trouble and I should
> not be so aggressive when merging PRs. The actual reason is some misleading
> comments in the code and a bug in Spark's testing framework that it never
> run REPL tests unless you change code in REPL module.
>
> I will be more careful in the future, and should NEVER backport
> non-bug-fix commits to an RC branch. Sorry again for the trouble!
>
> On Fri, Jun 2, 2017 at 2:40 PM, Michael Armbrust 
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.2.0. The vote is open until Tues, June 6th, 2017 at 12:00 PST and
>> passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.2.0
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.2.0-rc3
>>  (cc5dbd55b0b312a
>> 661d21a4b605ce5ead2ba5218)
>>
>> List of JIRA tickets resolved can be found with this filter
>> 
>> .
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc3-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1239/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc3-docs/
>>
>>
>> *FAQ*
>>
>> *How can I help test this release?*
>>
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should be
>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>
>> *But my bug isn't fixed!??!*
>>
>> In order to make timely releases, we will typically not hold the release
>> unless the bug in question is a regression from 2.1.1.
>>
>
>


Re: spark messing up handling of native dependency code?

2017-06-02 Thread Maciej Szymkiewicz
Maybe not related, but in general geotools are not thread safe,so using
from workers is most likely a gamble.

On 06/03/2017 01:26 AM, Georg Heiler wrote:
> Hi,
>
> There is a weird problem with spark when handling native dependency code:
> I want to use a library (JAI) with spark to parse some spatial raster
> files. Unfortunately, there are some strange issues. JAI only works
> when running via the build tool i.e. `sbt run` when executed in spark.
>
> When executed via spark-submit the error is:
>
> java.lang.IllegalArgumentException: The input argument(s) may not
> be null.
> at
> javax.media.jai.ParameterBlockJAI.getDefaultMode(ParameterBlockJAI.java:136)
> at
> javax.media.jai.ParameterBlockJAI.(ParameterBlockJAI.java:157)
> at
> javax.media.jai.ParameterBlockJAI.(ParameterBlockJAI.java:178)
> at
> org.geotools.process.raster.PolygonExtractionProcess.execute(PolygonExtractionProcess.java:171)
>
> Which looks like some native dependency (I think GEOS is running in
> the background) is not there correctly.
>
> Assuming something is wrong with the class path I tried to run a plain
> java/scala function. but this one works just fine.
>
> Is spark messing with the class paths?
>
> I created a minimal example here:
> https://github.com/geoHeil/jai-packaging-problem
>
>
> Hope someone can shed some light on this problem,
> Regards,
> Georg 



spark messing up handling of native dependency code?

2017-06-02 Thread Georg Heiler
Hi,

There is a weird problem with spark when handling native dependency code:
I want to use a library (JAI) with spark to parse some spatial raster
files. Unfortunately, there are some strange issues. JAI only works when
running via the build tool i.e. `sbt run` when executed in spark.

When executed via spark-submit the error is:

java.lang.IllegalArgumentException: The input argument(s) may not be
null.
at
javax.media.jai.ParameterBlockJAI.getDefaultMode(ParameterBlockJAI.java:136)
at javax.media.jai.ParameterBlockJAI.(ParameterBlockJAI.java:157)
at javax.media.jai.ParameterBlockJAI.(ParameterBlockJAI.java:178)
at
org.geotools.process.raster.PolygonExtractionProcess.execute(PolygonExtractionProcess.java:171)

Which looks like some native dependency (I think GEOS is running in the
background) is not there correctly.

Assuming something is wrong with the class path I tried to run a plain
java/scala function. but this one works just fine.

Is spark messing with the class paths?

I created a minimal example here:
https://github.com/geoHeil/jai-packaging-problem


Hope someone can shed some light on this problem,
Regards,
Georg


Re: [VOTE] Apache Spark 2.2.0 (RC3)

2017-06-02 Thread Wenchen Fan
I'm -1 on this.

I merged a PR  to master/2.2
today and break the build. I'm really sorry for the trouble and I should
not be so aggressive when merging PRs. The actual reason is some misleading
comments in the code and a bug in Spark's testing framework that it never
run REPL tests unless you change code in REPL module.

I will be more careful in the future, and should NEVER backport non-bug-fix
commits to an RC branch. Sorry again for the trouble!

On Fri, Jun 2, 2017 at 2:40 PM, Michael Armbrust 
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.2.0. The vote is open until Tues, June 6th, 2017 at 12:00 PST and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.2.0
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.2.0-rc3
>  (cc5dbd55b0b312a
> 661d21a4b605ce5ead2ba5218)
>
> List of JIRA tickets resolved can be found with this filter
> 
> .
>
> The release files, including signatures, digests, etc. can be found at:
> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1239/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc3-docs/
>
>
> *FAQ*
>
> *How can I help test this release?*
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> *What should happen to JIRA tickets still targeting 2.2.0?*
>
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>
> *But my bug isn't fixed!??!*
>
> In order to make timely releases, we will typically not hold the release
> unless the bug in question is a regression from 2.1.1.
>


[VOTE] Apache Spark 2.2.0 (RC3)

2017-06-02 Thread Michael Armbrust
Please vote on releasing the following candidate as Apache Spark version
2.2.0. The vote is open until Tues, June 6th, 2017 at 12:00 PST and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.2.0
[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.2.0-rc3
 (
cc5dbd55b0b312a661d21a4b605ce5ead2ba5218)

List of JIRA tickets resolved can be found with this filter

.

The release files, including signatures, digests, etc. can be found at:
http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc3-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1239/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc3-docs/


*FAQ*

*How can I help test this release?*

If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.

*What should happen to JIRA tickets still targeting 2.2.0?*

Committers should look at those and triage. Extremely important bug fixes,
documentation, and API tweaks that impact compatibility should be worked on
immediately. Everything else please retarget to 2.3.0 or 2.2.1.

*But my bug isn't fixed!??!*

In order to make timely releases, we will typically not hold the release
unless the bug in question is a regression from 2.1.1.


Re: [VOTE] Apache Spark 2.2.0 (RC2)

2017-06-02 Thread Michael Armbrust
This vote fails.  Following shortly with RC3

On Thu, Jun 1, 2017 at 8:28 PM, Reynold Xin  wrote:

> Again (I've probably said this more than 10 times already in different
> threads), SPARK-18350 has no impact on whether the timestamp type is with
> timezone or without timezone. It simply allows a session specific timezone
> setting rather than having Spark always rely on the machine timezone.
>
> On Wed, May 31, 2017 at 11:58 AM, Kostas Sakellis 
> wrote:
>
>> Hey Michael,
>>
>> There is a discussion on TIMESTAMP semantics going on the thread "SQL
>> TIMESTAMP semantics vs. SPARK-18350" which might impact Spark 2.2. Should
>> we make a decision there before voting on the next RC for Spark 2.2?
>>
>> Thanks,
>> Kostas
>>
>> On Tue, May 30, 2017 at 12:09 PM, Michael Armbrust <
>> mich...@databricks.com> wrote:
>>
>>> Last call, anything else important in-flight for 2.2?
>>>
>>> On Thu, May 25, 2017 at 10:56 AM, Michael Allman 
>>> wrote:
>>>
 PR is here: https://github.com/apache/spark/pull/18112


 On May 25, 2017, at 10:28 AM, Michael Allman 
 wrote:

 Michael,

 If you haven't started cutting the new RC, I'm working on a
 documentation PR right now I'm hoping we can get into Spark 2.2 as a
 migration note, even if it's just a mention: https://issues.apache
 .org/jira/browse/SPARK-20888.

 Michael


 On May 22, 2017, at 11:39 AM, Michael Armbrust 
 wrote:

 I'm waiting for SPARK-20814
  at Marcelo's
 request and I'd also like to include SPARK-20844
 .  I think we
 should be able to cut another RC midweek.

 On Fri, May 19, 2017 at 11:53 AM, Nick Pentreath <
 nick.pentre...@gmail.com> wrote:

> All the outstanding ML QA doc and user guide items are done for 2.2 so
> from that side we should be good to cut another RC :)
>
>
> On Thu, 18 May 2017 at 00:18 Russell Spitzer <
> russell.spit...@gmail.com> wrote:
>
>> Seeing an issue with the DataScanExec and some of our integration
>> tests for the SCC. Running dataframe read and writes from the shell seems
>> fine but the Redaction code seems to get a "None" when doing
>> SparkSession.getActiveSession.get in our integration tests. I'm not
>> sure why but i'll dig into this later if I get a chance.
>>
>> Example Failed Test
>> https://github.com/datastax/spark-cassandra-connector/blob/v
>> 2.0.1/spark-cassandra-connector/src/it/scala/com/datastax/sp
>> ark/connector/sql/CassandraSQLSpec.scala#L311
>>
>> ```[info]   org.apache.spark.SparkException: Job aborted due to
>> stage failure: Task serialization failed: 
>> java.util.NoSuchElementException:
>> None.get
>> [info] java.util.NoSuchElementException: None.get
>> [info] at scala.None$.get(Option.scala:347)
>> [info] at scala.None$.get(Option.scala:345)
>> [info] at org.apache.spark.sql.execution.DataSourceScanExec$class.org
>> $apache$spark$sql$execution$DataSourceScanExec$$redact(DataSo
>> urceScanExec.scala:70)
>> [info] at org.apache.spark.sql.execution
>> .DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:54)
>> [info] at org.apache.spark.sql.execution
>> .DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:52)
>> ```
>>
>> Again this only seems to repo in our IT suite so i'm not sure if this
>> is a real issue.
>>
>>
>> On Tue, May 16, 2017 at 1:40 PM Joseph Bradley 
>> wrote:
>>
>>> All of the ML/Graph/SparkR QA blocker JIRAs have been resolved.
>>> Thanks everyone who helped out on those!
>>>
>>> We still have open ML/Graph/SparkR JIRAs targeted at 2.2, but they
>>> are essentially all for documentation.
>>>
>>> Joseph
>>>
>>> On Thu, May 11, 2017 at 3:08 PM, Marcelo Vanzin >> > wrote:
>>>
 Since you'll be creating a new RC, I'd wait until SPARK-20666 is
 fixed, since the change that caused it is in branch-2.2. Probably a
 good idea to raise it to blocker at this point.

 On Thu, May 11, 2017 at 2:59 PM, Michael Armbrust
  wrote:
 > I'm going to -1 given the outstanding issues and lack of +1s.
 I'll create
 > another RC once ML has had time to take care of the more critical
 problems.
 > In the meantime please keep testing this release!
 >
 > On Tue, May 9, 2017 at 2:00 AM, Kazuaki Ishizaki <
 ishiz...@jp.ibm.com>
 > wrote:
 >>
 >> +1 (non-binding)
 >>
 >> I tested it on Ubuntu 16.04 and OpenJDK8 on ppc64le. All of the
 tests for
 >> core have passed.
 >>
 >> $ java -version
 >> openjdk version "1.8.0_111"
 >> OpenJDK Runtime Environment (build
 >> 1.8.0_111-8u111-b1

stuck on one of the jobs in spark streaming app

2017-06-02 Thread shara.st...@gmail.com
Hello

I have a spark streaming app which consume kafka messages using kafka 0.9
directStream and EsSpark saveToES.  

almost every time after submit it, it ran fine then after several hours one
of the jobs just stuck and keep running. Has anybody seen the same/similar
issue?

The stack trace is

2017-06-02 14:28:13
Full thread dump OpenJDK 64-Bit Server VM (25.131-b11 mixed mode):

"Attach Listener" #1031 daemon prio=9 os_prio=0 tid=0x7f97987f6000
nid=0x989 waiting on condition [0x]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
- None

"Thread-904" #1023 daemon prio=5 os_prio=0 tid=0x0334c000 nid=0x5db1
runnable [0x]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
- None

"Thread-903" #1024 daemon prio=5 os_prio=0 tid=0x7f97b64b3000 nid=0x5daf
runnable [0x]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
- None

"Thread-902" #1022 daemon prio=5 os_prio=0 tid=0x7f97b64ae000 nid=0x5dac
runnable [0x]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
- None

"Thread-901" #1021 daemon prio=5 os_prio=0 tid=0x7f97a04b8800 nid=0x5daa
runnable [0x]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
- None
.
.
.
.
"Thread-52" #138 daemon prio=5 os_prio=0 tid=0x7f97b4121800 nid=0x3df5
runnable [0x]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
- None

"Thread-51" #137 daemon prio=5 os_prio=0 tid=0x7f97a00ee000 nid=0x3df4
runnable [0x]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
- None

.
.
.
"Thread-3" #86 daemon prio=5 os_prio=0 tid=0x7f9798222800 nid=0x3e1c
runnable [0x]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
- None

"shuffle-client-5-1" #75 daemon prio=5 os_prio=0 tid=0x7f97a0024800
nid=0x3dcc runnable [0x7f9786488000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0x000702c84460> (a
io.netty.channel.nio.SelectedSelectionKeySet)
- locked <0x000702c84480> (a java.util.Collections$UnmodifiableSet)
- locked <0x000702c84418> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:760)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:401)
at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
at
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
- None

"Executor task launch worker-0" #83 daemon prio=5 os_prio=0
tid=0x7f97b5a96800 nid=0x3dc9 runnable [0x7f9786fa]
   java.lang.Thread.State: RUNNABLE
at com.mapr.fs.jni.MarlinJniListener.Poll(Native Method)
at
com.mapr.streams.impl.listener.MarlinListenerImpl.poll(MarlinListenerImpl.java:271)
at
com.mapr.streams.impl.listener.MarlinListener.poll(MarlinListener.java:100)
at
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1118)
at
org.apache.spark.streaming.kafka09.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:104)
at
org.apache.spark.streaming.kafka09.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:71)
at
org.apache.spark.streaming.kafka09.KafkaRDD$KafkaRDDIterator.skipGapsAndGetNext$1(KafkaRDD.scala:233)
at
org.apache.spark.streaming.kafka09.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:248)
at
org.apache.spark.streaming.kafka09.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:197)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1760)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1157)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadP

Re: SQL TIMESTAMP semantics vs. SPARK-18350

2017-06-02 Thread Michael Allman
Hi Zoltan,

I don't fully understand your proposal for table-specific timestamp type 
semantics. I think it will be helpful to everyone in this conversation if you 
can identify the expected behavior for a few concrete scenarios.

Suppose we have a Hive metastore table hivelogs with a column named ts with the 
hive timestamp type as described here: 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-timestamp
 
.
 This table was created by Hive and is usually accessed through Hive or Presto.

Suppose again we have a Hive metastore table sparklogs with a column named ts 
with the Spark SQL timestamp type as described here: 
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.types.TimestampType$
 
.
 This table was created by Spark SQL and is usually accessed through Spark SQL.

Let's say Spark SQL sets and reads a table property called timestamp_interp to 
determine timestamp type semantics for that table. Consider a dataframe df 
defined by sql("SELECT sts as ts FROM sparklogs UNION ALL SELECT hts as ts FROM 
hivelogs"). Suppose the timestamp_interp table property is absent from 
hivelogs. For each possible value of timestamp_interp set on the table 
sparklogs,

1. does df successfully pass analysis (i.e. is it a valid query)?
2. if it's a valid dataframe, what is the type of the ts column?
3. if it's a valid dataframe, what are the semantics of the type of the ts 
column?

Suppose further that Spark SQL sets the timestamp_interp on hivelogs. Can you 
answer the same three questions for each combination of timestamp_interp on 
hivelogs and sparklogs?

Thank you.

Michael


> On Jun 2, 2017, at 8:33 AM, Zoltan Ivanfi  wrote:
> 
> Hi,
> 
> We would like to solve the problem of interoperability of existing data, and 
> that is the main use case for having table-level control. Spark should be 
> able to read timestamps written by Impala or Hive and at the same time read 
> back its own data. These have different semantics, so having a single flag is 
> not enough.
> 
> Two separate types will solve this problem indeed, but only once every 
> component involved supports them. Unfortunately, adding these separate SQL 
> types is a larger effort that is only feasible in the long term and we would 
> like to provide a short-term solution for interoperability in the meantime.
> 
> Br,
> 
> Zoltan
> 
> On Fri, Jun 2, 2017 at 1:32 AM Reynold Xin  > wrote:
> Yea I don't see why this needs to be per table config. If the user wants to 
> configure it per table, can't they just declare the data type on a per table 
> basis, once we have separate types for timestamp w/ tz and w/o tz? 
> 
> On Thu, Jun 1, 2017 at 4:14 PM, Michael Allman  > wrote:
> I would suggest that making timestamp type behavior configurable and 
> persisted per-table could introduce some real confusion, e.g. in queries 
> involving tables with different timestamp type semantics.
> 
> I suggest starting with the assumption that timestamp type behavior is a 
> per-session flag that can be set in a global `spark-defaults.conf` and 
> consider more granular levels of configuration as people identify solid use 
> cases.
> 
> Cheers,
> 
> Michael
> 
> 
> 
>> On May 30, 2017, at 7:41 AM, Zoltan Ivanfi > > wrote:
>> 
>> Hi,
>> 
>> If I remember correctly, the TIMESTAMP type had UTC-normalized local time 
>> semantics even before Spark 2, so I can understand that Spark considers it 
>> to be the "established" behavior that must not be broken. Unfortunately, 
>> this behavior does not provide interoperability with other SQL engines of 
>> the Hadoop stack.
>> 
>> Let me summarize the findings of this e-mail thread so far:
>> Timezone-agnostic TIMESTAMP semantics would be beneficial for 
>> interoperability and SQL compliance.
>> Spark can not make a breaking change. For backward-compatibility with 
>> existing data, timestamp semantics should be user-configurable on a 
>> per-table level.
>> Before going into the specifics of a possible solution, do we all agree on 
>> these points?
>> 
>> Thanks,
>> 
>> Zoltan
>> 
>> On Sat, May 27, 2017 at 8:57 PM Imran Rashid > > wrote:
>> I had asked zoltan to bring this discussion to the dev list because I think 
>> it's a question that extends beyond a single jira (we can't figure out the 
>> semantics of timestamp in parquet if we don't k ow the overall goal of the 
>> timestamp type) and since its a design question the entire community should 
>> be involved.
>> 
>> I think that a lot of the confusion comes because we're talking about 
>> different ways time zone affect behavior: (1) parsing and (2) behavior when 
>> changing time zones for process

Re: SQL TIMESTAMP semantics vs. SPARK-18350

2017-06-02 Thread Zoltan Ivanfi
Hi,

We would like to solve the problem of interoperability of existing data,
and that is the main use case for having table-level control. Spark should
be able to read timestamps written by Impala or Hive and at the same time
read back its own data. These have different semantics, so having a single
flag is not enough.

Two separate types will solve this problem indeed, but only once every
component involved supports them. Unfortunately, adding these separate SQL
types is a larger effort that is only feasible in the long term and we
would like to provide a short-term solution for interoperability in the
meantime.

Br,

Zoltan

On Fri, Jun 2, 2017 at 1:32 AM Reynold Xin  wrote:

> Yea I don't see why this needs to be per table config. If the user wants
> to configure it per table, can't they just declare the data type on a per
> table basis, once we have separate types for timestamp w/ tz and w/o tz?
>
> On Thu, Jun 1, 2017 at 4:14 PM, Michael Allman 
> wrote:
>
>> I would suggest that making timestamp type behavior configurable and
>> persisted per-table could introduce some real confusion, e.g. in queries
>> involving tables with different timestamp type semantics.
>>
>> I suggest starting with the assumption that timestamp type behavior is a
>> per-session flag that can be set in a global `spark-defaults.conf` and
>> consider more granular levels of configuration as people identify solid use
>> cases.
>>
>> Cheers,
>>
>> Michael
>>
>>
>>
>> On May 30, 2017, at 7:41 AM, Zoltan Ivanfi  wrote:
>>
>> Hi,
>>
>> If I remember correctly, the TIMESTAMP type had UTC-normalized local time
>> semantics even before Spark 2, so I can understand that Spark considers it
>> to be the "established" behavior that must not be broken. Unfortunately,
>> this behavior does not provide interoperability with other SQL engines of
>> the Hadoop stack.
>>
>> Let me summarize the findings of this e-mail thread so far:
>>
>>- Timezone-agnostic TIMESTAMP semantics would be beneficial for
>>interoperability and SQL compliance.
>>- Spark can not make a breaking change. For backward-compatibility
>>with existing data, timestamp semantics should be user-configurable on a
>>per-table level.
>>
>> Before going into the specifics of a possible solution, do we all agree
>> on these points?
>>
>> Thanks,
>>
>> Zoltan
>>
>> On Sat, May 27, 2017 at 8:57 PM Imran Rashid 
>> wrote:
>>
>>> I had asked zoltan to bring this discussion to the dev list because I
>>> think it's a question that extends beyond a single jira (we can't figure
>>> out the semantics of timestamp in parquet if we don't k ow the overall goal
>>> of the timestamp type) and since its a design question the entire community
>>> should be involved.
>>>
>>> I think that a lot of the confusion comes because we're talking about
>>> different ways time zone affect behavior: (1) parsing and (2) behavior when
>>> changing time zones for processing data.
>>>
>>> It seems we agree that spark should eventually provide a timestamp type
>>> which does conform to the standard.   The question is, how do we get
>>> there?  Has spark already broken compliance so much that it's impossible to
>>> go back without breaking user behavior?  Or perhaps spark already has
>>> inconsistent behavior / broken compatibility within the 2.x line, so its
>>> not unthinkable to have another breaking change?
>>>
>>> (Another part of the confusion is on me -- I believed the behavior
>>> change was in 2.2, but actually it looks like its in 2.0.1.  That changes
>>> how we think about this in context of what goes into a 2.2
>>> release.  SPARK-18350 isn't the origin of the difference in behavior.)
>>>
>>> First: consider processing data that is already stored in tables, and
>>> then accessing it from machines in different time zones.  The standard is
>>> clear that "timestamp" should be just like "timestamp without time zone":
>>> it does not represent one instant in time, rather it's always displayed the
>>> same, regardless of time zone.  This was the behavior in spark 2.0.0 (and
>>> 1.6),  for hive tables stored as text files, and for spark's json formats.
>>>
>>> Spark 2.0.1  changed the behavior of the json format (I believe
>>> with SPARK-16216), so that it behaves more like timestamp *with* time
>>> zone.  It also makes csv behave the same (timestamp in csv was basically
>>> broken in 2.0.0).  However it did *not* change the behavior of a hive
>>> textfile; it still behaves like "timestamp with*out* time zone".  Here's
>>> some experiments I tried -- there are a bunch of files there for
>>> completeness, but mostly focus on the difference between
>>> query_output_2_0_0.txt vs. query_output_2_0_1.txt
>>>
>>> https://gist.github.com/squito/f348508ca7903ec2e1a64f4233e7aa70
>>>
>>> Given that spark has changed this behavior post 2.0.0, is it still out
>>> of the question to change this behavior to bring it back in line with the
>>> sql standard for timestamp (without time zone) in the 2.x line?

Re: Spark Issues on ORC

2017-06-02 Thread Dong Joon Hyun
Thank you for confirming, Steve.

I removes the dependency of SPARK-20799 on SPARK-20901.

Bests,
Dongjoon.

From: Steve Loughran 
Date: Friday, June 2, 2017 at 4:42 AM
To: Dong Joon Hyun 
Cc: Apache Spark Dev 
Subject: Re: Spark Issues on ORC


On 26 May 2017, at 19:02, Dong Joon Hyun 
mailto:dh...@hortonworks.com>> wrote:

Hi, All.

Today, while I’m looking over JIRA issues for Spark 2.2.0 in Apache Spark.
I noticed that there are many unresolved community requests and related efforts 
over `Feature parity for ORC with Parquet`.
Some examples I found are the following. I created SPARK-20901 to organize 
these although I’m not in the body to do this.
Please let me know if this is not a proper way in the Apache Spark community.
I think we can leverage or transfer the improvement of Parquet in Spark.


SPARK-20799   Unable to infer schema for ORC on reading ORC from S3


Fixed that one for you by changing title: SPARK-20799 Unable to infer schema 
for ORC/Parquet on S3N when secrets are in the URL

I'd recommended closing that as a WONTFIX as its related to some security work 
in HADOOP-3733 where Path.toString/toURI now strip out the AWS crentials, and 
the way things get passed around as Path.toString(), its losing them. As the 
current model meant that everything which logged a path would be logging AWS 
secrets, and the logs & exceptions weren't being treated as the sensitive 
documents they became the moment that happened.

It could could as a regression, but as it never worked if there was a "/" in 
the secret, it's always been a bit patchy.

If this is really needed then it could be pushed back into Hadoop 2.8.2 but 
disabled by default unless you set some option like 
"fs.s3a.insecure.secrets.in.URL".

Maybe also (somehow) changing to only support AWS Session token triples (id, 
session-secret, session-token), so that the damage caused by secrets in logs, 
bug reports &c are less destructive




Re: Spark Issues on ORC

2017-06-02 Thread Steve Loughran

On 26 May 2017, at 19:02, Dong Joon Hyun 
mailto:dh...@hortonworks.com>> wrote:

Hi, All.

Today, while I’m looking over JIRA issues for Spark 2.2.0 in Apache Spark.
I noticed that there are many unresolved community requests and related efforts 
over `Feature parity for ORC with Parquet`.
Some examples I found are the following. I created SPARK-20901 to organize 
these although I’m not in the body to do this.
Please let me know if this is not a proper way in the Apache Spark community.
I think we can leverage or transfer the improvement of Parquet in Spark.


SPARK-20799   Unable to infer schema for ORC on reading ORC from S3


Fixed that one for you by changing title: SPARK-20799 Unable to infer schema 
for ORC/Parquet on S3N when secrets are in the URL

I'd recommended closing that as a WONTFIX as its related to some security work 
in HADOOP-3733 where Path.toString/toURI now strip out the AWS crentials, and 
the way things get passed around as Path.toString(), its losing them. As the 
current model meant that everything which logged a path would be logging AWS 
secrets, and the logs & exceptions weren't being treated as the sensitive 
documents they became the moment that happened.

It could could as a regression, but as it never worked if there was a "/" in 
the secret, it's always been a bit patchy.

If this is really needed then it could be pushed back into Hadoop 2.8.2 but 
disabled by default unless you set some option like 
"fs.s3a.insecure.secrets.in.URL".

Maybe also (somehow) changing to only support AWS Session token triples (id, 
session-secret, session-token), so that the damage caused by secrets in logs, 
bug reports &c are less destructive




Re: [Spark SQL] Nanoseconds in Timestamps are set as Microseconds

2017-06-02 Thread Anton Okolnychyi
Then let me provide a PR so that we can discuss an alternative way

2017-06-02 8:26 GMT+02:00 Reynold Xin :

> Seems like a bug we should fix? I agree some form of truncation makes more
> sense.
>
>
> On Thu, Jun 1, 2017 at 1:17 AM, Anton Okolnychyi <
> anton.okolnyc...@gmail.com> wrote:
>
>> Hi all,
>>
>> I would like to ask what the community thinks regarding the way how Spark
>> handles nanoseconds in the Timestamp type.
>>
>> As far as I see in the code, Spark assumes microseconds precision.
>> Therefore, I expect to have a truncated to microseconds timestamp or an
>> exception if I specify a timestamp with nanoseconds. However, the current
>> implementation just silently sets nanoseconds as microseconds in [1], which
>> results in a wrong timestamp. Consider the example below:
>>
>> spark.sql("SELECT cast('2015-01-02 00:00:00.1' as
>> TIMESTAMP)").show(false)
>> ++
>> |CAST(2015-01-02 00:00:00.1 AS TIMESTAMP)|
>> ++
>> |2015-01-02 00:00:00.01  |
>> ++
>>
>> This issue was already raised in SPARK-17914 but I do not see any
>> decision there.
>>
>> [1] - org.apache.spark.sql.catalyst.util.DateTimeUtils, toJavaTimestamp,
>> line 204
>>
>> Best regards,
>> Anton
>>
>
>