[jira] [Commented] (SPARK-24768) Have a built-in AVRO data source implementation

2018-08-23 Thread Antonio Murgia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16590441#comment-16590441
 ] 

Antonio Murgia commented on SPARK-24768:


Will this support UDT to the extent the parquet reader/writer does?

> Have a built-in AVRO data source implementation
> ---
>
> Key: SPARK-24768
> URL: https://issues.apache.org/jira/browse/SPARK-24768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
> Attachments: Built-in AVRO Data Source In Spark 2.4.pdf
>
>
> Apache Avro (https://avro.apache.org) is a popular data serialization format. 
> It is widely used in the Spark and Hadoop ecosystem, especially for 
> Kafka-based data pipelines.  Using the external package 
> [https://github.com/databricks/spark-avro], Spark SQL can read and write the 
> avro data. Making spark-Avro built-in can provide a better experience for 
> first-time users of Spark SQL and structured streaming. We expect the 
> built-in Avro data source can further improve the adoption of structured 
> streaming. The proposal is to inline code from spark-avro package 
> ([https://github.com/databricks/spark-avro]). The target release is Spark 
> 2.4.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-24772) support reading AVRO logical types - Date

2018-08-23 Thread Antonio Murgia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antonio Murgia updated SPARK-24772:
---
Comment: was deleted

(was: Will this support UDT to the extent the parquet reader/writer does?)

> support reading AVRO logical types - Date
> -
>
> Key: SPARK-24772
> URL: https://issues.apache.org/jira/browse/SPARK-24772
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24772) support reading AVRO logical types - Date

2018-08-23 Thread Antonio Murgia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16590439#comment-16590439
 ] 

Antonio Murgia commented on SPARK-24772:


Will this support UDT to the extent the parquet reader/writer does?

> support reading AVRO logical types - Date
> -
>
> Key: SPARK-24772
> URL: https://issues.apache.org/jira/browse/SPARK-24772
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24862) Spark Encoder is not consistent to scala case class semantic for multiple argument lists

2018-07-21 Thread Antonio Murgia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551582#comment-16551582
 ] 

Antonio Murgia commented on SPARK-24862:


We can check if {{y}} is also synthesized as a field and if it is we can access 
it through reflection. About the inconsistency I actually don’t know. Maybe you 
are right, the inconsistency may cause issues. If it is the case, we might 
throw an exception earlier (when generating the encoder) instead of throwing it 
when the first action is called. I am just sketching down ideas anyway.

> Spark Encoder is not consistent to scala case class semantic for multiple 
> argument lists
> 
>
> Key: SPARK-24862
> URL: https://issues.apache.org/jira/browse/SPARK-24862
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Antonio Murgia
>Priority: Major
>
> Spark Encoder is not consistent to scala case class semantic for multiple 
> argument lists.
> For example if I create a case class with multiple constructor argument lists:
> {code:java}
> case class Multi(x: String)(y: Int){code}
> Scala creates a product with arity 1, while if I apply 
> {code:java}
> Encoders.product[Multi].schema.printTreeString{code}
> I get
> {code:java}
> root
> |-- x: string (nullable = true)
> |-- y: integer (nullable = false){code}
> That is not consistent and leads to:
> {code:java}
> Error while encoding: java.lang.RuntimeException: Couldn't find y on class 
> it.enel.next.platform.service.events.common.massive.immutable.Multi
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, assertnotnull(assertnotnull(input[0, 
> it.enel.next.platform.service.events.common.massive.immutable.Multi, 
> true])).x, true) AS x#0
> assertnotnull(assertnotnull(input[0, 
> it.enel.next.platform.service.events.common.massive.immutable.Multi, 
> true])).y AS y#1
> java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
> Couldn't find y on class 
> it.enel.next.platform.service.events.common.massive.immutable.Multi
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, assertnotnull(assertnotnull(input[0, 
> it.enel.next.platform.service.events.common.massive.immutable.Multi, 
> true])).x, true) AS x#0
> assertnotnull(assertnotnull(input[0, 
> it.enel.next.platform.service.events.common.massive.immutable.Multi, 
> true])).y AS y#1
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290)
> at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:464)
> at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:464)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.immutable.List.map(List.scala:296)
> at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:464)
> at 
> it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply$mcV$sp(ParquetQueueSuite.scala:48)
> at 
> it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply(ParquetQueueSuite.scala:46)
> at 
> it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply(ParquetQueueSuite.scala:46)
> at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
> at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> at org.scalatest.Transformer.apply(Transformer.scala:22)
> at org.scalatest.Transformer.apply(Transformer.scala:20)
> at org.scalatest.FlatSpecLike$$anon$1.apply(FlatSpecLike.scala:1682)
> at org.scalatest.TestSuite$class.withFixture(TestSuite.scala:196)
> at org.scalatest.FlatSpec.withFixture(FlatSpec.scala:1685)
> at 
> org.scalatest.FlatSpecLike$class.invokeWithFixture$1(FlatSpecLike.scala:1679)
> at 
> org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1692)
> at 
> org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1692)
> at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
> at org.scalatest.FlatSpecLike$class.runTest(FlatSpecLike.scala:1692)
> at org.scalatest.FlatSpec.runTest(FlatSpec.scala:1685)
> at 
> org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1750)
> at 
> org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1750)
> at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
> at 
> org.scalatest.SuperEngine$$anonf

[jira] [Commented] (SPARK-24862) Spark Encoder is not consistent to scala case class semantic for multiple argument lists

2018-07-20 Thread Antonio Murgia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551312#comment-16551312
 ] 

Antonio Murgia commented on SPARK-24862:


Yeah, they are definitely not supported. Therefore I think they encoder 
generator should generate the schema based on the first parameter and the 
ser/de based on all the param lists. I can think of a PR if you’d like.

> Spark Encoder is not consistent to scala case class semantic for multiple 
> argument lists
> 
>
> Key: SPARK-24862
> URL: https://issues.apache.org/jira/browse/SPARK-24862
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Antonio Murgia
>Priority: Major
>
> Spark Encoder is not consistent to scala case class semantic for multiple 
> argument lists.
> For example if I create a case class with multiple constructor argument lists:
> {code:java}
> case class Multi(x: String)(y: Int){code}
> Scala creates a product with arity 1, while if I apply 
> {code:java}
> Encoders.product[Multi].schema.printTreeString{code}
> I get
> {code:java}
> root
> |-- x: string (nullable = true)
> |-- y: integer (nullable = false){code}
> That is not consistent and leads to:
> {code:java}
> Error while encoding: java.lang.RuntimeException: Couldn't find y on class 
> it.enel.next.platform.service.events.common.massive.immutable.Multi
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, assertnotnull(assertnotnull(input[0, 
> it.enel.next.platform.service.events.common.massive.immutable.Multi, 
> true])).x, true) AS x#0
> assertnotnull(assertnotnull(input[0, 
> it.enel.next.platform.service.events.common.massive.immutable.Multi, 
> true])).y AS y#1
> java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
> Couldn't find y on class 
> it.enel.next.platform.service.events.common.massive.immutable.Multi
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, assertnotnull(assertnotnull(input[0, 
> it.enel.next.platform.service.events.common.massive.immutable.Multi, 
> true])).x, true) AS x#0
> assertnotnull(assertnotnull(input[0, 
> it.enel.next.platform.service.events.common.massive.immutable.Multi, 
> true])).y AS y#1
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290)
> at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:464)
> at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:464)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.immutable.List.map(List.scala:296)
> at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:464)
> at 
> it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply$mcV$sp(ParquetQueueSuite.scala:48)
> at 
> it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply(ParquetQueueSuite.scala:46)
> at 
> it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply(ParquetQueueSuite.scala:46)
> at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
> at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> at org.scalatest.Transformer.apply(Transformer.scala:22)
> at org.scalatest.Transformer.apply(Transformer.scala:20)
> at org.scalatest.FlatSpecLike$$anon$1.apply(FlatSpecLike.scala:1682)
> at org.scalatest.TestSuite$class.withFixture(TestSuite.scala:196)
> at org.scalatest.FlatSpec.withFixture(FlatSpec.scala:1685)
> at 
> org.scalatest.FlatSpecLike$class.invokeWithFixture$1(FlatSpecLike.scala:1679)
> at 
> org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1692)
> at 
> org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1692)
> at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
> at org.scalatest.FlatSpecLike$class.runTest(FlatSpecLike.scala:1692)
> at org.scalatest.FlatSpec.runTest(FlatSpec.scala:1685)
> at 
> org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1750)
> at 
> org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1750)
> at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
> at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala

[jira] [Created] (SPARK-24862) Spark Encoder is not consistent to scala case class semantic for multiple argument lists

2018-07-19 Thread Antonio Murgia (JIRA)
Antonio Murgia created SPARK-24862:
--

 Summary: Spark Encoder is not consistent to scala case class 
semantic for multiple argument lists
 Key: SPARK-24862
 URL: https://issues.apache.org/jira/browse/SPARK-24862
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.1
Reporter: Antonio Murgia


Spark Encoder is not consistent to scala case class semantic for multiple 
argument lists.

For example if I create a case class with multiple constructor argument lists:
{code:java}
case class Multi(x: String)(y: Int){code}
Scala creates a product with arity 1, while if I apply 
{code:java}
Encoders.product[Multi].schema.printTreeString{code}
I get
{code:java}
root
|-- x: string (nullable = true)
|-- y: integer (nullable = false){code}
That is not consistent and leads to:
{code:java}
Error while encoding: java.lang.RuntimeException: Couldn't find y on class 
it.enel.next.platform.service.events.common.massive.immutable.Multi
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, assertnotnull(assertnotnull(input[0, 
it.enel.next.platform.service.events.common.massive.immutable.Multi, true])).x, 
true) AS x#0
assertnotnull(assertnotnull(input[0, 
it.enel.next.platform.service.events.common.massive.immutable.Multi, true])).y 
AS y#1
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
Couldn't find y on class 
it.enel.next.platform.service.events.common.massive.immutable.Multi
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, assertnotnull(assertnotnull(input[0, 
it.enel.next.platform.service.events.common.massive.immutable.Multi, true])).x, 
true) AS x#0
assertnotnull(assertnotnull(input[0, 
it.enel.next.platform.service.events.common.massive.immutable.Multi, true])).y 
AS y#1
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290)
at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:464)
at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:464)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:296)
at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:464)
at 
it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply$mcV$sp(ParquetQueueSuite.scala:48)
at 
it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply(ParquetQueueSuite.scala:46)
at 
it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply(ParquetQueueSuite.scala:46)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FlatSpecLike$$anon$1.apply(FlatSpecLike.scala:1682)
at org.scalatest.TestSuite$class.withFixture(TestSuite.scala:196)
at org.scalatest.FlatSpec.withFixture(FlatSpec.scala:1685)
at org.scalatest.FlatSpecLike$class.invokeWithFixture$1(FlatSpecLike.scala:1679)
at org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1692)
at org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1692)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
at org.scalatest.FlatSpecLike$class.runTest(FlatSpecLike.scala:1692)
at org.scalatest.FlatSpec.runTest(FlatSpec.scala:1685)
at org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1750)
at org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1750)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:373)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:410)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
at org.scalatest.FlatSpecLike$class.runTests(FlatSpecLike.scala:1750)
at org.scalatest.FlatSpec.runTests(FlatSpec.scala:1685)
at org.

[jira] [Commented] (SPARK-24673) scala sql function from_utc_timestamp second argument could be Column instead of String

2018-07-02 Thread Antonio Murgia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16529479#comment-16529479
 ] 

Antonio Murgia commented on SPARK-24673:


I have created a PR, I have added the overload to both functions. Can you have 
a look at it? Especially [Xiao 
Li|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=smilegator] and 
[Takuya 
Ueshin|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=ueshin] for 
the user-facing api matter.

[https://github.com/apache/spark/pull/21693]

 

> scala sql function from_utc_timestamp second argument could be Column instead 
> of String
> ---
>
> Key: SPARK-24673
> URL: https://issues.apache.org/jira/browse/SPARK-24673
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Antonio Murgia
>Priority: Minor
>
> As of 2.3.1 the scala API for the built-in function from_utc_timestamp 
> (org.apache.spark.sql.functions#from_utc_timestamp) is less powerful than its 
> SQL counter part. In particular, given a dataset/dataframe with the following 
> schema:
> {code:java}
> CREATE TABLE MY_TABLE (
>   ts TIMESTAMP,
>   tz STRING
> ){code}
> from the SQL api I can do something like:
> {code:java}
> SELECT FROM_UTC_TIMESTAMP(TS, TZ){code}
> while from the programmatic api I simply cannot because
> {code:java}
> functions.from_utc_timestamp(ts: Column, tz: String){code}
> second argument is a String.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24673) scala sql function from_utc_timestamp second argument could be Column instead of String

2018-06-28 Thread Antonio Murgia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16526307#comment-16526307
 ] 

Antonio Murgia commented on SPARK-24673:


Looks doable. Should I go with a method overload, resulting in:
{code:java}
functions.from_utc_timestamp(ts: Column, tz: String)

functions.from_utc_timestamp(ts: Column, tz: Column)
{code}
Or is there some limitation I am not aware of?

Also do you think
{code:java}
to_utc_timestamp{code}
should receive the same treatment?

> scala sql function from_utc_timestamp second argument could be Column instead 
> of String
> ---
>
> Key: SPARK-24673
> URL: https://issues.apache.org/jira/browse/SPARK-24673
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Antonio Murgia
>Priority: Minor
>
> As of 2.3.1 the scala API for the built-in function from_utc_timestamp 
> (org.apache.spark.sql.functions#from_utc_timestamp) is less powerful than its 
> SQL counter part. In particular, given a dataset/dataframe with the following 
> schema:
> {code:java}
> CREATE TABLE MY_TABLE (
>   ts TIMESTAMP,
>   tz STRING
> ){code}
> from the SQL api I can do something like:
> {code:java}
> SELECT FROM_UTC_TIMESTAMP(TS, TZ){code}
> while from the programmatic api I simply cannot because
> {code:java}
> functions.from_utc_timestamp(ts: Column, tz: String){code}
> second argument is a String.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24673) scala sql function from_utc_timestamp second argument could be Column instead of String

2018-06-28 Thread Antonio Murgia (JIRA)
Antonio Murgia created SPARK-24673:
--

 Summary: scala sql function from_utc_timestamp second argument 
could be Column instead of String
 Key: SPARK-24673
 URL: https://issues.apache.org/jira/browse/SPARK-24673
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: Antonio Murgia


As of 2.3.1 the scala API for the built-in function from_utc_timestamp 
(org.apache.spark.sql.functions#from_utc_timestamp) is less powerful than its 
SQL counter part. In particular, given a dataset/dataframe with the following 
schema:
{code:java}
CREATE TABLE MY_TABLE (
  ts TIMESTAMP,
  tz STRING
){code}
from the SQL api I can do something like:
{code:java}
SELECT FROM_UTC_TIMESTAMP(TS, TZ){code}
while from the programmatic api I simply cannot because
{code:java}
functions.from_utc_timestamp(ts: Column, tz: String){code}
second argument is a String.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15740) Word2VecSuite "big model load / save" caused OOM in maven jenkins builds

2016-06-04 Thread Antonio Murgia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15315394#comment-15315394
 ] 

Antonio Murgia commented on SPARK-15740:


As of now the memory requirement would be something like 2k.
I made a pull request named SPARK-15740. I was able to run it successfully, but 
I don't know how to measure the memory req. of the test. Can you point me to 
some resource where I can learn how to?


> Word2VecSuite "big model load / save" caused OOM in maven jenkins builds
> 
>
> Key: SPARK-15740
> URL: https://issues.apache.org/jira/browse/SPARK-15740
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> [~andrewor14] noticed some OOM errors caused by "test big model load / save" 
> in Word2VecSuite, e.g., 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.2/1168/consoleFull.
>  It doesn't show up in the test result because it was OOMed.
> I'm going to disable the test first and leave this open for a proper fix.
> cc [~tmnd91]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15740) Word2VecSuite "big model load / save" caused OOM in maven jenkins builds

2016-06-03 Thread Antonio Murgia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15313830#comment-15313830
 ] 

Antonio Murgia commented on SPARK-15740:


In order to set a small partition size w/r/t spark.kryoserializer.buffer.max I 
need to read that conf in the first place and change the word2vecserializer 
code to be adaptive. I think i can retrieve the conf with:
```
spark.conf.get("spark.kryoserializer.buffer.max")
```
But then how could i convert the string in a format like "64m" to the number of 
bytes? Is there a utile somewhere to take advantage of?

I purpose to make the partitioning adaptive to that value and to set a small 
spark.kryoserializer.buffer.max to a small size only for the test, therefore 
we'll need a small array avoiding OOMs.
Let me know what do you think about it.

> Word2VecSuite "big model load / save" caused OOM in maven jenkins builds
> 
>
> Key: SPARK-15740
> URL: https://issues.apache.org/jira/browse/SPARK-15740
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> [~andrewor14] noticed some OOM errors caused by "test big model load / save" 
> in Word2VecSuite, e.g., 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.2/1168/consoleFull.
>  It doesn't show up in the test result because it was OOMed.
> I'm going to disable the test first and leave this open for a proper fix.
> cc [~tmnd91]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15740) Word2VecSuite "big model load / save" caused OOM in maven jenkins builds

2016-06-02 Thread Antonio Murgia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15313754#comment-15313754
 ] 

Antonio Murgia commented on SPARK-15740:


Looking into it right now.

> Word2VecSuite "big model load / save" caused OOM in maven jenkins builds
> 
>
> Key: SPARK-15740
> URL: https://issues.apache.org/jira/browse/SPARK-15740
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> [~andrewor14] noticed some OOM errors caused by "test big model load / save" 
> in Word2VecSuite, e.g., 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.2/1168/consoleFull.
>  It doesn't show up in the test result because it was OOMed.
> I'm going to disable the test first and leave this open for a proper fix.
> cc [~tmnd91]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11994) Word2VecModel load and save cause SparkException when model is bigger than spark.kryoserializer.buffer.max

2015-11-25 Thread Antonio Murgia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15028016#comment-15028016
 ] 

Antonio Murgia commented on SPARK-11994:


Since `spark.kryoserializer.buffer.max` defaults to 64MB, I decided to increase 
the number of partitions the model gets divided into at half that size (32MB).
One word2vec entry consists of an array of float of size vectorSize and a 
string, since the size of string is variable and considerably lower than the 
size of the array, I'm not going consider it in my size computation.
And obviously we have numWords entries. Stated that the size of a Float is 4 
Bytes.
The number of partitions the model gets splitted into is given by the formulae: 
(4 * numWords * vectorSize / 33554432) + 1 
I added a test to verify that the save/load methods works when the file is 
splitted in 2 parts.


> Word2VecModel load and save cause SparkException when model is bigger than 
> spark.kryoserializer.buffer.max
> --
>
> Key: SPARK-11994
> URL: https://issues.apache.org/jira/browse/SPARK-11994
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.4.1, 1.5.1
>Reporter: Antonio Murgia
>  Labels: kryo, mllib
>
> When loading a Word2VecModel of compressed size 58Mb using the 
> Word2VecModel.load() method introduced in Spark 1.4.0 I get a 
> `org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. 
> Available: 0, required: 2` exception.
> This happens because the model is saved as a unique file with no partitioning 
> and the kryo buffer overflows when tries to serialize it all.
> Increasing `spark.kryoserializer.buffer.max` works as a temporary solution 
> but needs to increased again whenever we increase the model size. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11994) Word2VecModel load and save cause SparkException when model is bigger than spark.kryoserializer.buffer.max

2015-11-25 Thread Antonio Murgia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15027648#comment-15027648
 ] 

Antonio Murgia commented on SPARK-11994:


Sure.

> Word2VecModel load and save cause SparkException when model is bigger than 
> spark.kryoserializer.buffer.max
> --
>
> Key: SPARK-11994
> URL: https://issues.apache.org/jira/browse/SPARK-11994
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.4.1, 1.5.1
>Reporter: Antonio Murgia
>  Labels: kryo, mllib
>
> When loading a Word2VecModel of compressed size 58Mb using the 
> Word2VecModel.load() method introduced in Spark 1.4.0 I get a 
> `org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. 
> Available: 0, required: 2` exception.
> This happens because the model is saved as a unique file with no partitioning 
> and the kryo buffer overflows when tries to serialize it all.
> Increasing `spark.kryoserializer.buffer.max` works as a temporary solution 
> but needs to increased again whenever we increase the model size. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11994) Word2VecModel load and save cause SparkException when model is bigger than spark.kryoserializer.buffer.max

2015-11-25 Thread Antonio Murgia (JIRA)
Antonio Murgia created SPARK-11994:
--

 Summary: Word2VecModel load and save cause SparkException when 
model is bigger than spark.kryoserializer.buffer.max
 Key: SPARK-11994
 URL: https://issues.apache.org/jira/browse/SPARK-11994
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.5.1, 1.4.1
Reporter: Antonio Murgia


When loading a Word2VecModel of compressed size 58Mb using the 
Word2VecModel.load() method introduced in Spark 1.4.0 I get a 
`org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. 
Available: 0, required: 2` exception.
This happens because the model is saved as a unique file with no partitioning 
and the kryo buffer overflows when tries to serialize it all.
Increasing `spark.kryoserializer.buffer.max` works as a temporary solution but 
needs to increased again whenever we increase the model size. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11993) https://github.com/streamatica/TrafficAnalytics/issues/393#issuecomment-159685855

2015-11-25 Thread Antonio Murgia (JIRA)
Antonio Murgia created SPARK-11993:
--

 Summary: 
https://github.com/streamatica/TrafficAnalytics/issues/393#issuecomment-159685855
 Key: SPARK-11993
 URL: https://issues.apache.org/jira/browse/SPARK-11993
 Project: Spark
  Issue Type: Bug
Reporter: Antonio Murgia






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-11993) https://github.com/streamatica/TrafficAnalytics/issues/393#issuecomment-159685855

2015-11-25 Thread Antonio Murgia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antonio Murgia closed SPARK-11993.
--
Resolution: Invalid

> https://github.com/streamatica/TrafficAnalytics/issues/393#issuecomment-159685855
> -
>
> Key: SPARK-11993
> URL: https://issues.apache.org/jira/browse/SPARK-11993
> Project: Spark
>  Issue Type: Bug
>Reporter: Antonio Murgia
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11350) There is no best practice to handle warnings or messages produced by Executors in a distributed manner

2015-10-27 Thread Antonio Murgia (JIRA)
Antonio Murgia created SPARK-11350:
--

 Summary: There is no best practice to handle warnings or messages 
produced by Executors in a distributed manner
 Key: SPARK-11350
 URL: https://issues.apache.org/jira/browse/SPARK-11350
 Project: Spark
  Issue Type: Wish
  Components: Spark Core
Reporter: Antonio Murgia


I looked around on the web and I couldn’t find any way to deal, in a 
distributed way with malformed/faulty records during computation. All I was 
able to find was the flatMap/Some/None technique + logging. 
I’m facing this problem because I have a processing algorithm that extracts 
more than one value from each record, but can fail in extracting one of those 
multiple values, and I want to keep track of them. Logging is not feasible 
because this “warning” happens so frequently that the logs would become 
overwhelming and impossibile to read. 
Since I have 3 different possible outcomes from my processing I modeled it with 
this class hierarchy: 

http://i.imgur.com/NIesYUm.png?1

That holds result and/or warnings. Since Result implements Traversable it can 
be used in a flatMap, discarding all warnings and failure results, in the other 
hand, if we want to keep track of warnings, we can elaborate them and output 
them if we need.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10105) Adding most k frequent words parameter to Word2Vec implementation

2015-08-18 Thread Antonio Murgia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antonio Murgia updated SPARK-10105:
---
Description: 
When training Word2Vec on a really big dataset, it's really hard to evaluate 
the right minCount parameter, it would really help having a parameter to choose 
how many words you want to be in the vocabulary.
Furthermore, the original Word2Vec paper, state that they took into account the 
most frequent 1M words.
 

  was:
When training Word2Vec on a really big dataset, it's really hard to evaluate 
the right minCount parameter, it would really help having a parameter to choose 
how many words you want to be in the vocabulary.
Furthermore, the original Word2Vec paper, state that they took into account the 
first 30k words.
 


> Adding most k frequent words parameter to Word2Vec implementation
> -
>
> Key: SPARK-10105
> URL: https://issues.apache.org/jira/browse/SPARK-10105
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Antonio Murgia
>Priority: Minor
>  Labels: mllib, top-k, word2vec
>
> When training Word2Vec on a really big dataset, it's really hard to evaluate 
> the right minCount parameter, it would really help having a parameter to 
> choose how many words you want to be in the vocabulary.
> Furthermore, the original Word2Vec paper, state that they took into account 
> the most frequent 1M words.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10105) Adding most k frequent words parameter to Word2Vec implementation

2015-08-18 Thread Antonio Murgia (JIRA)
Antonio Murgia created SPARK-10105:
--

 Summary: Adding most k frequent words parameter to Word2Vec 
implementation
 Key: SPARK-10105
 URL: https://issues.apache.org/jira/browse/SPARK-10105
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Antonio Murgia
Priority: Minor


When training Word2Vec on a really big dataset, it's really hard to evaluate 
the right minCount parameter, it would really help having a parameter to choose 
how many words you want to be in the vocabulary.
Furthermore, the original Word2Vec paper, state that they took into account the 
first 30k words.
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10046) Hive warehouse dir not set in current directory when not providing hive-site.xml

2015-08-16 Thread Antonio Murgia (JIRA)
Antonio Murgia created SPARK-10046:
--

 Summary: Hive warehouse dir not set in current directory when not 
providing hive-site.xml
 Key: SPARK-10046
 URL: https://issues.apache.org/jira/browse/SPARK-10046
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.3.1
 Environment: OS X 10.10.4
Java 1.7.0_79-b15
Scala 2.10.5
Spark 1.3.1
Reporter: Antonio Murgia


When running spark in local environment (for unit-testing purpose) and without 
providing any `hive-site.xml, databases apart from the default one are created 
in Hive default hive.metastore.warehouse.dir and not in the current directory 
(as stated in [Spark 
docs](http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables)).
 This code snippet, tested with Spark 1.3.1 demonstrates the issue: 
https://github.com/tmnd1991/spark-hive-bug/blob/master/src/main/scala/Main.scala
 You cane see that the exception is thrown when executing the CREATE DATABASE 
STATEMENT, stating that is `Unable to create database path 
file:/user/hive/warehouse/abc.db, failed to create database abc)` where is 
`/user/hive/warehouse/abc.db`is not the current directory as stated in the docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org