[jira] [Commented] (SPARK-24768) Have a built-in AVRO data source implementation
[ https://issues.apache.org/jira/browse/SPARK-24768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16590441#comment-16590441 ] Antonio Murgia commented on SPARK-24768: Will this support UDT to the extent the parquet reader/writer does? > Have a built-in AVRO data source implementation > --- > > Key: SPARK-24768 > URL: https://issues.apache.org/jira/browse/SPARK-24768 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Major > Attachments: Built-in AVRO Data Source In Spark 2.4.pdf > > > Apache Avro (https://avro.apache.org) is a popular data serialization format. > It is widely used in the Spark and Hadoop ecosystem, especially for > Kafka-based data pipelines. Using the external package > [https://github.com/databricks/spark-avro], Spark SQL can read and write the > avro data. Making spark-Avro built-in can provide a better experience for > first-time users of Spark SQL and structured streaming. We expect the > built-in Avro data source can further improve the adoption of structured > streaming. The proposal is to inline code from spark-avro package > ([https://github.com/databricks/spark-avro]). The target release is Spark > 2.4. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-24772) support reading AVRO logical types - Date
[ https://issues.apache.org/jira/browse/SPARK-24772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antonio Murgia updated SPARK-24772: --- Comment: was deleted (was: Will this support UDT to the extent the parquet reader/writer does?) > support reading AVRO logical types - Date > - > > Key: SPARK-24772 > URL: https://issues.apache.org/jira/browse/SPARK-24772 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24772) support reading AVRO logical types - Date
[ https://issues.apache.org/jira/browse/SPARK-24772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16590439#comment-16590439 ] Antonio Murgia commented on SPARK-24772: Will this support UDT to the extent the parquet reader/writer does? > support reading AVRO logical types - Date > - > > Key: SPARK-24772 > URL: https://issues.apache.org/jira/browse/SPARK-24772 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24862) Spark Encoder is not consistent to scala case class semantic for multiple argument lists
[ https://issues.apache.org/jira/browse/SPARK-24862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551582#comment-16551582 ] Antonio Murgia commented on SPARK-24862: We can check if {{y}} is also synthesized as a field and if it is we can access it through reflection. About the inconsistency I actually don’t know. Maybe you are right, the inconsistency may cause issues. If it is the case, we might throw an exception earlier (when generating the encoder) instead of throwing it when the first action is called. I am just sketching down ideas anyway. > Spark Encoder is not consistent to scala case class semantic for multiple > argument lists > > > Key: SPARK-24862 > URL: https://issues.apache.org/jira/browse/SPARK-24862 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1 >Reporter: Antonio Murgia >Priority: Major > > Spark Encoder is not consistent to scala case class semantic for multiple > argument lists. > For example if I create a case class with multiple constructor argument lists: > {code:java} > case class Multi(x: String)(y: Int){code} > Scala creates a product with arity 1, while if I apply > {code:java} > Encoders.product[Multi].schema.printTreeString{code} > I get > {code:java} > root > |-- x: string (nullable = true) > |-- y: integer (nullable = false){code} > That is not consistent and leads to: > {code:java} > Error while encoding: java.lang.RuntimeException: Couldn't find y on class > it.enel.next.platform.service.events.common.massive.immutable.Multi > staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, > fromString, assertnotnull(assertnotnull(input[0, > it.enel.next.platform.service.events.common.massive.immutable.Multi, > true])).x, true) AS x#0 > assertnotnull(assertnotnull(input[0, > it.enel.next.platform.service.events.common.massive.immutable.Multi, > true])).y AS y#1 > java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: > Couldn't find y on class > it.enel.next.platform.service.events.common.massive.immutable.Multi > staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, > fromString, assertnotnull(assertnotnull(input[0, > it.enel.next.platform.service.events.common.massive.immutable.Multi, > true])).x, true) AS x#0 > assertnotnull(assertnotnull(input[0, > it.enel.next.platform.service.events.common.massive.immutable.Multi, > true])).y AS y#1 > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290) > at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:464) > at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:464) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:296) > at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:464) > at > it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply$mcV$sp(ParquetQueueSuite.scala:48) > at > it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply(ParquetQueueSuite.scala:46) > at > it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply(ParquetQueueSuite.scala:46) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FlatSpecLike$$anon$1.apply(FlatSpecLike.scala:1682) > at org.scalatest.TestSuite$class.withFixture(TestSuite.scala:196) > at org.scalatest.FlatSpec.withFixture(FlatSpec.scala:1685) > at > org.scalatest.FlatSpecLike$class.invokeWithFixture$1(FlatSpecLike.scala:1679) > at > org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1692) > at > org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1692) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > at org.scalatest.FlatSpecLike$class.runTest(FlatSpecLike.scala:1692) > at org.scalatest.FlatSpec.runTest(FlatSpec.scala:1685) > at > org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1750) > at > org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1750) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) > at > org.scalatest.SuperEngine$$anonf
[jira] [Commented] (SPARK-24862) Spark Encoder is not consistent to scala case class semantic for multiple argument lists
[ https://issues.apache.org/jira/browse/SPARK-24862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551312#comment-16551312 ] Antonio Murgia commented on SPARK-24862: Yeah, they are definitely not supported. Therefore I think they encoder generator should generate the schema based on the first parameter and the ser/de based on all the param lists. I can think of a PR if you’d like. > Spark Encoder is not consistent to scala case class semantic for multiple > argument lists > > > Key: SPARK-24862 > URL: https://issues.apache.org/jira/browse/SPARK-24862 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1 >Reporter: Antonio Murgia >Priority: Major > > Spark Encoder is not consistent to scala case class semantic for multiple > argument lists. > For example if I create a case class with multiple constructor argument lists: > {code:java} > case class Multi(x: String)(y: Int){code} > Scala creates a product with arity 1, while if I apply > {code:java} > Encoders.product[Multi].schema.printTreeString{code} > I get > {code:java} > root > |-- x: string (nullable = true) > |-- y: integer (nullable = false){code} > That is not consistent and leads to: > {code:java} > Error while encoding: java.lang.RuntimeException: Couldn't find y on class > it.enel.next.platform.service.events.common.massive.immutable.Multi > staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, > fromString, assertnotnull(assertnotnull(input[0, > it.enel.next.platform.service.events.common.massive.immutable.Multi, > true])).x, true) AS x#0 > assertnotnull(assertnotnull(input[0, > it.enel.next.platform.service.events.common.massive.immutable.Multi, > true])).y AS y#1 > java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: > Couldn't find y on class > it.enel.next.platform.service.events.common.massive.immutable.Multi > staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, > fromString, assertnotnull(assertnotnull(input[0, > it.enel.next.platform.service.events.common.massive.immutable.Multi, > true])).x, true) AS x#0 > assertnotnull(assertnotnull(input[0, > it.enel.next.platform.service.events.common.massive.immutable.Multi, > true])).y AS y#1 > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290) > at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:464) > at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:464) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:296) > at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:464) > at > it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply$mcV$sp(ParquetQueueSuite.scala:48) > at > it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply(ParquetQueueSuite.scala:46) > at > it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply(ParquetQueueSuite.scala:46) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FlatSpecLike$$anon$1.apply(FlatSpecLike.scala:1682) > at org.scalatest.TestSuite$class.withFixture(TestSuite.scala:196) > at org.scalatest.FlatSpec.withFixture(FlatSpec.scala:1685) > at > org.scalatest.FlatSpecLike$class.invokeWithFixture$1(FlatSpecLike.scala:1679) > at > org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1692) > at > org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1692) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > at org.scalatest.FlatSpecLike$class.runTest(FlatSpecLike.scala:1692) > at org.scalatest.FlatSpec.runTest(FlatSpec.scala:1685) > at > org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1750) > at > org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1750) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) > at scala.collection.immutable.List.foreach(List.scala:392) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala
[jira] [Created] (SPARK-24862) Spark Encoder is not consistent to scala case class semantic for multiple argument lists
Antonio Murgia created SPARK-24862: -- Summary: Spark Encoder is not consistent to scala case class semantic for multiple argument lists Key: SPARK-24862 URL: https://issues.apache.org/jira/browse/SPARK-24862 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.1 Reporter: Antonio Murgia Spark Encoder is not consistent to scala case class semantic for multiple argument lists. For example if I create a case class with multiple constructor argument lists: {code:java} case class Multi(x: String)(y: Int){code} Scala creates a product with arity 1, while if I apply {code:java} Encoders.product[Multi].schema.printTreeString{code} I get {code:java} root |-- x: string (nullable = true) |-- y: integer (nullable = false){code} That is not consistent and leads to: {code:java} Error while encoding: java.lang.RuntimeException: Couldn't find y on class it.enel.next.platform.service.events.common.massive.immutable.Multi staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, it.enel.next.platform.service.events.common.massive.immutable.Multi, true])).x, true) AS x#0 assertnotnull(assertnotnull(input[0, it.enel.next.platform.service.events.common.massive.immutable.Multi, true])).y AS y#1 java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: Couldn't find y on class it.enel.next.platform.service.events.common.massive.immutable.Multi staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, it.enel.next.platform.service.events.common.massive.immutable.Multi, true])).x, true) AS x#0 assertnotnull(assertnotnull(input[0, it.enel.next.platform.service.events.common.massive.immutable.Multi, true])).y AS y#1 at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290) at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:464) at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:464) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:296) at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:464) at it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply$mcV$sp(ParquetQueueSuite.scala:48) at it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply(ParquetQueueSuite.scala:46) at it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply(ParquetQueueSuite.scala:46) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FlatSpecLike$$anon$1.apply(FlatSpecLike.scala:1682) at org.scalatest.TestSuite$class.withFixture(TestSuite.scala:196) at org.scalatest.FlatSpec.withFixture(FlatSpec.scala:1685) at org.scalatest.FlatSpecLike$class.invokeWithFixture$1(FlatSpecLike.scala:1679) at org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1692) at org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1692) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) at org.scalatest.FlatSpecLike$class.runTest(FlatSpecLike.scala:1692) at org.scalatest.FlatSpec.runTest(FlatSpec.scala:1685) at org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1750) at org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1750) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) at scala.collection.immutable.List.foreach(List.scala:392) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:373) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:410) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) at scala.collection.immutable.List.foreach(List.scala:392) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) at org.scalatest.FlatSpecLike$class.runTests(FlatSpecLike.scala:1750) at org.scalatest.FlatSpec.runTests(FlatSpec.scala:1685) at org.
[jira] [Commented] (SPARK-24673) scala sql function from_utc_timestamp second argument could be Column instead of String
[ https://issues.apache.org/jira/browse/SPARK-24673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16529479#comment-16529479 ] Antonio Murgia commented on SPARK-24673: I have created a PR, I have added the overload to both functions. Can you have a look at it? Especially [Xiao Li|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=smilegator] and [Takuya Ueshin|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=ueshin] for the user-facing api matter. [https://github.com/apache/spark/pull/21693] > scala sql function from_utc_timestamp second argument could be Column instead > of String > --- > > Key: SPARK-24673 > URL: https://issues.apache.org/jira/browse/SPARK-24673 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Antonio Murgia >Priority: Minor > > As of 2.3.1 the scala API for the built-in function from_utc_timestamp > (org.apache.spark.sql.functions#from_utc_timestamp) is less powerful than its > SQL counter part. In particular, given a dataset/dataframe with the following > schema: > {code:java} > CREATE TABLE MY_TABLE ( > ts TIMESTAMP, > tz STRING > ){code} > from the SQL api I can do something like: > {code:java} > SELECT FROM_UTC_TIMESTAMP(TS, TZ){code} > while from the programmatic api I simply cannot because > {code:java} > functions.from_utc_timestamp(ts: Column, tz: String){code} > second argument is a String. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24673) scala sql function from_utc_timestamp second argument could be Column instead of String
[ https://issues.apache.org/jira/browse/SPARK-24673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16526307#comment-16526307 ] Antonio Murgia commented on SPARK-24673: Looks doable. Should I go with a method overload, resulting in: {code:java} functions.from_utc_timestamp(ts: Column, tz: String) functions.from_utc_timestamp(ts: Column, tz: Column) {code} Or is there some limitation I am not aware of? Also do you think {code:java} to_utc_timestamp{code} should receive the same treatment? > scala sql function from_utc_timestamp second argument could be Column instead > of String > --- > > Key: SPARK-24673 > URL: https://issues.apache.org/jira/browse/SPARK-24673 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Antonio Murgia >Priority: Minor > > As of 2.3.1 the scala API for the built-in function from_utc_timestamp > (org.apache.spark.sql.functions#from_utc_timestamp) is less powerful than its > SQL counter part. In particular, given a dataset/dataframe with the following > schema: > {code:java} > CREATE TABLE MY_TABLE ( > ts TIMESTAMP, > tz STRING > ){code} > from the SQL api I can do something like: > {code:java} > SELECT FROM_UTC_TIMESTAMP(TS, TZ){code} > while from the programmatic api I simply cannot because > {code:java} > functions.from_utc_timestamp(ts: Column, tz: String){code} > second argument is a String. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24673) scala sql function from_utc_timestamp second argument could be Column instead of String
Antonio Murgia created SPARK-24673: -- Summary: scala sql function from_utc_timestamp second argument could be Column instead of String Key: SPARK-24673 URL: https://issues.apache.org/jira/browse/SPARK-24673 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.1 Reporter: Antonio Murgia As of 2.3.1 the scala API for the built-in function from_utc_timestamp (org.apache.spark.sql.functions#from_utc_timestamp) is less powerful than its SQL counter part. In particular, given a dataset/dataframe with the following schema: {code:java} CREATE TABLE MY_TABLE ( ts TIMESTAMP, tz STRING ){code} from the SQL api I can do something like: {code:java} SELECT FROM_UTC_TIMESTAMP(TS, TZ){code} while from the programmatic api I simply cannot because {code:java} functions.from_utc_timestamp(ts: Column, tz: String){code} second argument is a String. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15740) Word2VecSuite "big model load / save" caused OOM in maven jenkins builds
[ https://issues.apache.org/jira/browse/SPARK-15740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15315394#comment-15315394 ] Antonio Murgia commented on SPARK-15740: As of now the memory requirement would be something like 2k. I made a pull request named SPARK-15740. I was able to run it successfully, but I don't know how to measure the memory req. of the test. Can you point me to some resource where I can learn how to? > Word2VecSuite "big model load / save" caused OOM in maven jenkins builds > > > Key: SPARK-15740 > URL: https://issues.apache.org/jira/browse/SPARK-15740 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Priority: Critical > > [~andrewor14] noticed some OOM errors caused by "test big model load / save" > in Word2VecSuite, e.g., > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.2/1168/consoleFull. > It doesn't show up in the test result because it was OOMed. > I'm going to disable the test first and leave this open for a proper fix. > cc [~tmnd91] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15740) Word2VecSuite "big model load / save" caused OOM in maven jenkins builds
[ https://issues.apache.org/jira/browse/SPARK-15740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15313830#comment-15313830 ] Antonio Murgia commented on SPARK-15740: In order to set a small partition size w/r/t spark.kryoserializer.buffer.max I need to read that conf in the first place and change the word2vecserializer code to be adaptive. I think i can retrieve the conf with: ``` spark.conf.get("spark.kryoserializer.buffer.max") ``` But then how could i convert the string in a format like "64m" to the number of bytes? Is there a utile somewhere to take advantage of? I purpose to make the partitioning adaptive to that value and to set a small spark.kryoserializer.buffer.max to a small size only for the test, therefore we'll need a small array avoiding OOMs. Let me know what do you think about it. > Word2VecSuite "big model load / save" caused OOM in maven jenkins builds > > > Key: SPARK-15740 > URL: https://issues.apache.org/jira/browse/SPARK-15740 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Priority: Critical > > [~andrewor14] noticed some OOM errors caused by "test big model load / save" > in Word2VecSuite, e.g., > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.2/1168/consoleFull. > It doesn't show up in the test result because it was OOMed. > I'm going to disable the test first and leave this open for a proper fix. > cc [~tmnd91] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15740) Word2VecSuite "big model load / save" caused OOM in maven jenkins builds
[ https://issues.apache.org/jira/browse/SPARK-15740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15313754#comment-15313754 ] Antonio Murgia commented on SPARK-15740: Looking into it right now. > Word2VecSuite "big model load / save" caused OOM in maven jenkins builds > > > Key: SPARK-15740 > URL: https://issues.apache.org/jira/browse/SPARK-15740 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Priority: Critical > > [~andrewor14] noticed some OOM errors caused by "test big model load / save" > in Word2VecSuite, e.g., > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.2/1168/consoleFull. > It doesn't show up in the test result because it was OOMed. > I'm going to disable the test first and leave this open for a proper fix. > cc [~tmnd91] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11994) Word2VecModel load and save cause SparkException when model is bigger than spark.kryoserializer.buffer.max
[ https://issues.apache.org/jira/browse/SPARK-11994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15028016#comment-15028016 ] Antonio Murgia commented on SPARK-11994: Since `spark.kryoserializer.buffer.max` defaults to 64MB, I decided to increase the number of partitions the model gets divided into at half that size (32MB). One word2vec entry consists of an array of float of size vectorSize and a string, since the size of string is variable and considerably lower than the size of the array, I'm not going consider it in my size computation. And obviously we have numWords entries. Stated that the size of a Float is 4 Bytes. The number of partitions the model gets splitted into is given by the formulae: (4 * numWords * vectorSize / 33554432) + 1 I added a test to verify that the save/load methods works when the file is splitted in 2 parts. > Word2VecModel load and save cause SparkException when model is bigger than > spark.kryoserializer.buffer.max > -- > > Key: SPARK-11994 > URL: https://issues.apache.org/jira/browse/SPARK-11994 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.4.1, 1.5.1 >Reporter: Antonio Murgia > Labels: kryo, mllib > > When loading a Word2VecModel of compressed size 58Mb using the > Word2VecModel.load() method introduced in Spark 1.4.0 I get a > `org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. > Available: 0, required: 2` exception. > This happens because the model is saved as a unique file with no partitioning > and the kryo buffer overflows when tries to serialize it all. > Increasing `spark.kryoserializer.buffer.max` works as a temporary solution > but needs to increased again whenever we increase the model size. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11994) Word2VecModel load and save cause SparkException when model is bigger than spark.kryoserializer.buffer.max
[ https://issues.apache.org/jira/browse/SPARK-11994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15027648#comment-15027648 ] Antonio Murgia commented on SPARK-11994: Sure. > Word2VecModel load and save cause SparkException when model is bigger than > spark.kryoserializer.buffer.max > -- > > Key: SPARK-11994 > URL: https://issues.apache.org/jira/browse/SPARK-11994 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.4.1, 1.5.1 >Reporter: Antonio Murgia > Labels: kryo, mllib > > When loading a Word2VecModel of compressed size 58Mb using the > Word2VecModel.load() method introduced in Spark 1.4.0 I get a > `org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. > Available: 0, required: 2` exception. > This happens because the model is saved as a unique file with no partitioning > and the kryo buffer overflows when tries to serialize it all. > Increasing `spark.kryoserializer.buffer.max` works as a temporary solution > but needs to increased again whenever we increase the model size. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11994) Word2VecModel load and save cause SparkException when model is bigger than spark.kryoserializer.buffer.max
Antonio Murgia created SPARK-11994: -- Summary: Word2VecModel load and save cause SparkException when model is bigger than spark.kryoserializer.buffer.max Key: SPARK-11994 URL: https://issues.apache.org/jira/browse/SPARK-11994 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.5.1, 1.4.1 Reporter: Antonio Murgia When loading a Word2VecModel of compressed size 58Mb using the Word2VecModel.load() method introduced in Spark 1.4.0 I get a `org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 2` exception. This happens because the model is saved as a unique file with no partitioning and the kryo buffer overflows when tries to serialize it all. Increasing `spark.kryoserializer.buffer.max` works as a temporary solution but needs to increased again whenever we increase the model size. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11993) https://github.com/streamatica/TrafficAnalytics/issues/393#issuecomment-159685855
Antonio Murgia created SPARK-11993: -- Summary: https://github.com/streamatica/TrafficAnalytics/issues/393#issuecomment-159685855 Key: SPARK-11993 URL: https://issues.apache.org/jira/browse/SPARK-11993 Project: Spark Issue Type: Bug Reporter: Antonio Murgia -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-11993) https://github.com/streamatica/TrafficAnalytics/issues/393#issuecomment-159685855
[ https://issues.apache.org/jira/browse/SPARK-11993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antonio Murgia closed SPARK-11993. -- Resolution: Invalid > https://github.com/streamatica/TrafficAnalytics/issues/393#issuecomment-159685855 > - > > Key: SPARK-11993 > URL: https://issues.apache.org/jira/browse/SPARK-11993 > Project: Spark > Issue Type: Bug >Reporter: Antonio Murgia > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11350) There is no best practice to handle warnings or messages produced by Executors in a distributed manner
Antonio Murgia created SPARK-11350: -- Summary: There is no best practice to handle warnings or messages produced by Executors in a distributed manner Key: SPARK-11350 URL: https://issues.apache.org/jira/browse/SPARK-11350 Project: Spark Issue Type: Wish Components: Spark Core Reporter: Antonio Murgia I looked around on the web and I couldn’t find any way to deal, in a distributed way with malformed/faulty records during computation. All I was able to find was the flatMap/Some/None technique + logging. I’m facing this problem because I have a processing algorithm that extracts more than one value from each record, but can fail in extracting one of those multiple values, and I want to keep track of them. Logging is not feasible because this “warning” happens so frequently that the logs would become overwhelming and impossibile to read. Since I have 3 different possible outcomes from my processing I modeled it with this class hierarchy: http://i.imgur.com/NIesYUm.png?1 That holds result and/or warnings. Since Result implements Traversable it can be used in a flatMap, discarding all warnings and failure results, in the other hand, if we want to keep track of warnings, we can elaborate them and output them if we need. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10105) Adding most k frequent words parameter to Word2Vec implementation
[ https://issues.apache.org/jira/browse/SPARK-10105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antonio Murgia updated SPARK-10105: --- Description: When training Word2Vec on a really big dataset, it's really hard to evaluate the right minCount parameter, it would really help having a parameter to choose how many words you want to be in the vocabulary. Furthermore, the original Word2Vec paper, state that they took into account the most frequent 1M words. was: When training Word2Vec on a really big dataset, it's really hard to evaluate the right minCount parameter, it would really help having a parameter to choose how many words you want to be in the vocabulary. Furthermore, the original Word2Vec paper, state that they took into account the first 30k words. > Adding most k frequent words parameter to Word2Vec implementation > - > > Key: SPARK-10105 > URL: https://issues.apache.org/jira/browse/SPARK-10105 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Antonio Murgia >Priority: Minor > Labels: mllib, top-k, word2vec > > When training Word2Vec on a really big dataset, it's really hard to evaluate > the right minCount parameter, it would really help having a parameter to > choose how many words you want to be in the vocabulary. > Furthermore, the original Word2Vec paper, state that they took into account > the most frequent 1M words. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10105) Adding most k frequent words parameter to Word2Vec implementation
Antonio Murgia created SPARK-10105: -- Summary: Adding most k frequent words parameter to Word2Vec implementation Key: SPARK-10105 URL: https://issues.apache.org/jira/browse/SPARK-10105 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Antonio Murgia Priority: Minor When training Word2Vec on a really big dataset, it's really hard to evaluate the right minCount parameter, it would really help having a parameter to choose how many words you want to be in the vocabulary. Furthermore, the original Word2Vec paper, state that they took into account the first 30k words. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10046) Hive warehouse dir not set in current directory when not providing hive-site.xml
Antonio Murgia created SPARK-10046: -- Summary: Hive warehouse dir not set in current directory when not providing hive-site.xml Key: SPARK-10046 URL: https://issues.apache.org/jira/browse/SPARK-10046 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 1.3.1 Environment: OS X 10.10.4 Java 1.7.0_79-b15 Scala 2.10.5 Spark 1.3.1 Reporter: Antonio Murgia When running spark in local environment (for unit-testing purpose) and without providing any `hive-site.xml, databases apart from the default one are created in Hive default hive.metastore.warehouse.dir and not in the current directory (as stated in [Spark docs](http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables)). This code snippet, tested with Spark 1.3.1 demonstrates the issue: https://github.com/tmnd1991/spark-hive-bug/blob/master/src/main/scala/Main.scala You cane see that the exception is thrown when executing the CREATE DATABASE STATEMENT, stating that is `Unable to create database path file:/user/hive/warehouse/abc.db, failed to create database abc)` where is `/user/hive/warehouse/abc.db`is not the current directory as stated in the docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org