[jira] [Commented] (SPARK-20287) Kafka Consumer should be able to subscribe to more than one topic partition

2017-04-17 Thread Stephane Maarek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971893#comment-15971893
 ] 

Stephane Maarek commented on SPARK-20287:
-

[~c...@koeninger.org] It makes sense. I didn't realized in the direct streams, 
that the driver was in charge of assigning metadata to the executors to pull 
data. Therefore yes you're right, it's "incompatible" with the Kafka way of 
being "master-free", where each consumer really doesn't know and shouldn't care 
about how many other consumers there are. I think this ticket can now be closed 
(just re-open it if you don't believe so). Maybe it'll be worth opening a KIP 
on Kafka to have some APIs to allow Spark to be a bit more "optimized", but it 
all seems okay for now. Cheers!

> Kafka Consumer should be able to subscribe to more than one topic partition
> ---
>
> Key: SPARK-20287
> URL: https://issues.apache.org/jira/browse/SPARK-20287
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Stephane Maarek
>
> As I understand and as it stands, one Kafka Consumer is created for each 
> topic partition in the source Kafka topics, and they're cached.
> cf 
> https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L48
> In my opinion, that makes the design an anti pattern for Kafka and highly 
> unefficient:
> - Each Kafka consumer creates a connection to Kafka
> - Spark doesn't leverage the power of the Kafka consumers, which is that it 
> automatically assigns and balances partitions amongst all the consumers that 
> share the same group.id
> - You can still cache your Kafka consumer even if it has multiple partitions.
> I'm not sure about how that translates to the spark underlying RDD 
> architecture, but from a Kafka standpoint, I believe creating one consumer 
> per partition is a big overhead, and a risk as the user may have to increase 
> the spark.streaming.kafka.consumer.cache.maxCapacity parameter. 
> Happy to discuss to understand the rationale



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-20287) Kafka Consumer should be able to subscribe to more than one topic partition

2017-04-17 Thread Stephane Maarek (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephane Maarek closed SPARK-20287.
---
Resolution: Not A Problem

> Kafka Consumer should be able to subscribe to more than one topic partition
> ---
>
> Key: SPARK-20287
> URL: https://issues.apache.org/jira/browse/SPARK-20287
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Stephane Maarek
>
> As I understand and as it stands, one Kafka Consumer is created for each 
> topic partition in the source Kafka topics, and they're cached.
> cf 
> https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L48
> In my opinion, that makes the design an anti pattern for Kafka and highly 
> unefficient:
> - Each Kafka consumer creates a connection to Kafka
> - Spark doesn't leverage the power of the Kafka consumers, which is that it 
> automatically assigns and balances partitions amongst all the consumers that 
> share the same group.id
> - You can still cache your Kafka consumer even if it has multiple partitions.
> I'm not sure about how that translates to the spark underlying RDD 
> architecture, but from a Kafka standpoint, I believe creating one consumer 
> per partition is a big overhead, and a risk as the user may have to increase 
> the spark.streaming.kafka.consumer.cache.maxCapacity parameter. 
> Happy to discuss to understand the rationale



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20287) Kafka Consumer should be able to subscribe to more than one topic partition

2017-04-12 Thread Stephane Maarek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966973#comment-15966973
 ] 

Stephane Maarek commented on SPARK-20287:
-

[~c...@koeninger.org] 
How about using the subscribe pattern?
https://kafka.apache.org/0102/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html

```
public void subscribe(Collection topics)
Subscribe to the given list of topics to get dynamically assigned partitions. 
Topic subscriptions are not incremental. This list will replace the current 
assignment (if there is one). It is not possible to combine topic subscription 
with group management with manual partition assignment through 
assign(Collection). If the given list of topics is empty, it is treated the 
same as unsubscribe().
```

Then you let Kafka handle the partition assignments? As all the consumers share 
the same group.id, the data will be effectively distributed between every Spark 
instance?

But then I guess you may have already explored that option and it goes against 
the Spark DirectStream API? (not a Spark expert, just trying to understand the 
limitations. I believe you when you say you did it the most straightforward way)

> Kafka Consumer should be able to subscribe to more than one topic partition
> ---
>
> Key: SPARK-20287
> URL: https://issues.apache.org/jira/browse/SPARK-20287
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Stephane Maarek
>
> As I understand and as it stands, one Kafka Consumer is created for each 
> topic partition in the source Kafka topics, and they're cached.
> cf 
> https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L48
> In my opinion, that makes the design an anti pattern for Kafka and highly 
> unefficient:
> - Each Kafka consumer creates a connection to Kafka
> - Spark doesn't leverage the power of the Kafka consumers, which is that it 
> automatically assigns and balances partitions amongst all the consumers that 
> share the same group.id
> - You can still cache your Kafka consumer even if it has multiple partitions.
> I'm not sure about how that translates to the spark underlying RDD 
> architecture, but from a Kafka standpoint, I believe creating one consumer 
> per partition is a big overhead, and a risk as the user may have to increase 
> the spark.streaming.kafka.consumer.cache.maxCapacity parameter. 
> Happy to discuss to understand the rationale



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20287) Kafka Consumer should be able to subscribe to more than one topic partition

2017-04-11 Thread Stephane Maarek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963939#comment-15963939
 ] 

Stephane Maarek commented on SPARK-20287:
-

The other issue I can see is the coordinator work that has to re-coordinate XX 
number of Kafka Consumers should one go down. That's more expensive if you have 
100 consumers versus a few. But as you said, it should be performance 
limitation-driven, right now that'd be speculation. 

> Kafka Consumer should be able to subscribe to more than one topic partition
> ---
>
> Key: SPARK-20287
> URL: https://issues.apache.org/jira/browse/SPARK-20287
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Stephane Maarek
>
> As I understand and as it stands, one Kafka Consumer is created for each 
> topic partition in the source Kafka topics, and they're cached.
> cf 
> https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L48
> In my opinion, that makes the design an anti pattern for Kafka and highly 
> unefficient:
> - Each Kafka consumer creates a connection to Kafka
> - Spark doesn't leverage the power of the Kafka consumers, which is that it 
> automatically assigns and balances partitions amongst all the consumers that 
> share the same group.id
> - You can still cache your Kafka consumer even if it has multiple partitions.
> I'm not sure about how that translates to the spark underlying RDD 
> architecture, but from a Kafka standpoint, I believe creating one consumer 
> per partition is a big overhead, and a risk as the user may have to increase 
> the spark.streaming.kafka.consumer.cache.maxCapacity parameter. 
> Happy to discuss to understand the rationale



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20287) Kafka Consumer should be able to subscribe to more than one topic partition

2017-04-11 Thread Stephane Maarek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963938#comment-15963938
 ] 

Stephane Maarek commented on SPARK-20287:
-

[~srowen] those are good points. In the case of 100 separate machines on 100 
tasks, then I agree you have 100 Kafka Consumers no matter what. I guess as you 
said, my optimisation would come when you have tasks on the same machine that 
could share a Kafka Consumer. 
My concern is as you said the number of connections opened to Kafka that might 
be high even if not needed. I understand one Kafka Consumer distributing to 
multiple tasks may bind them together on the receive, and I'm not a Spark 
expert so I can't measure the implications of that on performance. 

My concern then is with the spark.streaming.kafka.consumer.cache.maxCapacity 
parameter. Is that truly needed?
Say one executor consumes from 100 partitions, do we really need to have a 
maxCapacity parameter? The executor should just spin as many consumers as 
needed ?
Same, in a distributed context, can't the individual executors figure out how 
many kafka consumers they need? 

Thanks for the discussion, I appreciate it

> Kafka Consumer should be able to subscribe to more than one topic partition
> ---
>
> Key: SPARK-20287
> URL: https://issues.apache.org/jira/browse/SPARK-20287
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Stephane Maarek
>
> As I understand and as it stands, one Kafka Consumer is created for each 
> topic partition in the source Kafka topics, and they're cached.
> cf 
> https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L48
> In my opinion, that makes the design an anti pattern for Kafka and highly 
> unefficient:
> - Each Kafka consumer creates a connection to Kafka
> - Spark doesn't leverage the power of the Kafka consumers, which is that it 
> automatically assigns and balances partitions amongst all the consumers that 
> share the same group.id
> - You can still cache your Kafka consumer even if it has multiple partitions.
> I'm not sure about how that translates to the spark underlying RDD 
> architecture, but from a Kafka standpoint, I believe creating one consumer 
> per partition is a big overhead, and a risk as the user may have to increase 
> the spark.streaming.kafka.consumer.cache.maxCapacity parameter. 
> Happy to discuss to understand the rationale



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20287) Kafka Consumer should be able to subscribe to more than one topic partition

2017-04-10 Thread Stephane Maarek (JIRA)
Stephane Maarek created SPARK-20287:
---

 Summary: Kafka Consumer should be able to subscribe to more than 
one topic partition
 Key: SPARK-20287
 URL: https://issues.apache.org/jira/browse/SPARK-20287
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Stephane Maarek


As I understand and as it stands, one Kafka Consumer is created for each topic 
partition in the source Kafka topics, and they're cached.

cf 
https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L48

In my opinion, that makes the design an anti pattern for Kafka and highly 
unefficient:
- Each Kafka consumer creates a connection to Kafka
- Spark doesn't leverage the power of the Kafka consumers, which is that it 
automatically assigns and balances partitions amongst all the consumers that 
share the same group.id
- You can still cache your Kafka consumer even if it has multiple partitions.

I'm not sure about how that translates to the spark underlying RDD 
architecture, but from a Kafka standpoint, I believe creating one consumer per 
partition is a big overhead, and a risk as the user may have to increase the 
spark.streaming.kafka.consumer.cache.maxCapacity parameter. 

Happy to discuss to understand the rationale



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18068) Spark SQL doesn't parse some ISO 8601 formatted dates

2016-10-25 Thread Stephane Maarek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15604903#comment-15604903
 ] 

Stephane Maarek commented on SPARK-18068:
-

Thanks for the links guys! Really helpful




> Spark SQL doesn't parse some ISO 8601 formatted dates
> -
>
> Key: SPARK-18068
> URL: https://issues.apache.org/jira/browse/SPARK-18068
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Stephane Maarek
>Priority: Minor
>
> The following fail, but shouldn't according to the ISO 8601 standard (seconds 
> can be omitted). Not sure where the issue lies (probably an external library?)
> {code}
> scala> sc.parallelize(Seq("2016-10-07T11:15Z"))
> res1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at 
> parallelize at :25
> scala> res1.toDF
> res2: org.apache.spark.sql.DataFrame = [value: string]
> scala> res2.select("value").show()
> +-+
> |value|
> +-+
> |2016-10-07T11:15Z|
> +-+
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> res2.select(col("value").cast(TimestampType)).show()
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> And the schema usage errors out right away:
> {code}
> scala> val jsonRDD = sc.parallelize(Seq("""{"tst":"2016-10-07T11:15Z"}"""))
> jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[8] at 
> parallelize at :33
> scala> val schema = StructType(StructField("tst",TimestampType,true)::Nil)
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(tst,TimestampType,true))
> scala> val df = spark.read.schema(schema).json(jsonRDD)
> df: org.apache.spark.sql.DataFrame = [tst: timestamp]
> scala> df.show()
> 16/10/24 13:06:27 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23)
> java.lang.IllegalArgumentException: 2016-10-07T11:15Z
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown
>  Source)
>   at 
> javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422)
>   at 
> javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417)
>   at 
> javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:114)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertObject(JacksonParser.scala:215)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:182)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertRootField(JacksonParser.scala:73)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:288)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:285)
>   at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2366)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:285)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:280)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at 

[jira] [Commented] (SPARK-18068) Spark SQL doesn't parse some ISO 8601 formatted dates

2016-10-25 Thread Stephane Maarek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15604443#comment-15604443
 ] 

Stephane Maarek commented on SPARK-18068:
-

Would be awesome to expose it in the only docs!




> Spark SQL doesn't parse some ISO 8601 formatted dates
> -
>
> Key: SPARK-18068
> URL: https://issues.apache.org/jira/browse/SPARK-18068
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Stephane Maarek
>Priority: Minor
>
> The following fail, but shouldn't according to the ISO 8601 standard (seconds 
> can be omitted). Not sure where the issue lies (probably an external library?)
> {code}
> scala> sc.parallelize(Seq("2016-10-07T11:15Z"))
> res1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at 
> parallelize at :25
> scala> res1.toDF
> res2: org.apache.spark.sql.DataFrame = [value: string]
> scala> res2.select("value").show()
> +-+
> |value|
> +-+
> |2016-10-07T11:15Z|
> +-+
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> res2.select(col("value").cast(TimestampType)).show()
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> And the schema usage errors out right away:
> {code}
> scala> val jsonRDD = sc.parallelize(Seq("""{"tst":"2016-10-07T11:15Z"}"""))
> jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[8] at 
> parallelize at :33
> scala> val schema = StructType(StructField("tst",TimestampType,true)::Nil)
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(tst,TimestampType,true))
> scala> val df = spark.read.schema(schema).json(jsonRDD)
> df: org.apache.spark.sql.DataFrame = [tst: timestamp]
> scala> df.show()
> 16/10/24 13:06:27 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23)
> java.lang.IllegalArgumentException: 2016-10-07T11:15Z
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown
>  Source)
>   at 
> javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422)
>   at 
> javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417)
>   at 
> javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:114)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertObject(JacksonParser.scala:215)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:182)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertRootField(JacksonParser.scala:73)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:288)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:285)
>   at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2366)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:285)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:280)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at 

[jira] [Commented] (SPARK-18068) Spark SQL doesn't parse some ISO 8601 formatted dates

2016-10-25 Thread Stephane Maarek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15604400#comment-15604400
 ] 

Stephane Maarek commented on SPARK-18068:
-

[~hyukjin.kwon] 
Thanks! Didn't see this option, where is it documented? 
It's a nice workaround for the time being, would be good if that was a default 
implementation to support any ISO 8601 compliant timestamps :)

> Spark SQL doesn't parse some ISO 8601 formatted dates
> -
>
> Key: SPARK-18068
> URL: https://issues.apache.org/jira/browse/SPARK-18068
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Stephane Maarek
>Priority: Minor
>
> The following fail, but shouldn't according to the ISO 8601 standard (seconds 
> can be omitted). Not sure where the issue lies (probably an external library?)
> {code}
> scala> sc.parallelize(Seq("2016-10-07T11:15Z"))
> res1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at 
> parallelize at :25
> scala> res1.toDF
> res2: org.apache.spark.sql.DataFrame = [value: string]
> scala> res2.select("value").show()
> +-+
> |value|
> +-+
> |2016-10-07T11:15Z|
> +-+
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> res2.select(col("value").cast(TimestampType)).show()
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> And the schema usage errors out right away:
> {code}
> scala> val jsonRDD = sc.parallelize(Seq("""{"tst":"2016-10-07T11:15Z"}"""))
> jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[8] at 
> parallelize at :33
> scala> val schema = StructType(StructField("tst",TimestampType,true)::Nil)
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(tst,TimestampType,true))
> scala> val df = spark.read.schema(schema).json(jsonRDD)
> df: org.apache.spark.sql.DataFrame = [tst: timestamp]
> scala> df.show()
> 16/10/24 13:06:27 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23)
> java.lang.IllegalArgumentException: 2016-10-07T11:15Z
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown
>  Source)
>   at 
> javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422)
>   at 
> javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417)
>   at 
> javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:114)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertObject(JacksonParser.scala:215)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:182)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertRootField(JacksonParser.scala:73)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:288)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:285)
>   at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2366)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:285)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:280)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at 

[jira] [Commented] (SPARK-18068) Spark SQL doesn't parse some ISO 8601 formatted dates

2016-10-23 Thread Stephane Maarek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15600732#comment-15600732
 ] 

Stephane Maarek commented on SPARK-18068:
-

I see TimestampType is a wrapper for java.sql.Timestamp

It seems that it can't parse a string without seconds.

{code}
scala> import java.sql.Timestamp
import java.sql.Timestamp

scala> Timestamp.valueOf("2016-10-07T11:15Z")
java.lang.IllegalArgumentException: Timestamp format must be -mm-dd 
hh:mm:ss[.f]
  at java.sql.Timestamp.valueOf(Timestamp.java:204)
  ... 32 elided
{code}

A workaround would be to first convert to a date using the good Java 8 API and 
then passing it to the java.sql.Timestamp class

> Spark SQL doesn't parse some ISO 8601 formatted dates
> -
>
> Key: SPARK-18068
> URL: https://issues.apache.org/jira/browse/SPARK-18068
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Stephane Maarek
>
> The following fail, but shouldn't according to the ISO 8601 standard (seconds 
> can be omitted). Not sure where the issue lies (probably an external library?)
> {code}
> scala> sc.parallelize(Seq("2016-10-07T11:15Z"))
> res1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at 
> parallelize at :25
> scala> res1.toDF
> res2: org.apache.spark.sql.DataFrame = [value: string]
> scala> res2.select("value").show()
> +-+
> |value|
> +-+
> |2016-10-07T11:15Z|
> +-+
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> res2.select(col("value").cast(TimestampType)).show()
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> And the schema usage errors out right away:
> {code}
> scala> val jsonRDD = sc.parallelize(Seq("""{"tst":"2016-10-07T11:15Z"}"""))
> jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[8] at 
> parallelize at :33
> scala> val schema = StructType(StructField("tst",TimestampType,true)::Nil)
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(tst,TimestampType,true))
> scala> val df = spark.read.schema(schema).json(jsonRDD)
> df: org.apache.spark.sql.DataFrame = [tst: timestamp]
> scala> df.show()
> 16/10/24 13:06:27 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23)
> java.lang.IllegalArgumentException: 2016-10-07T11:15Z
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown
>  Source)
>   at 
> javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422)
>   at 
> javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417)
>   at 
> javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:114)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertObject(JacksonParser.scala:215)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:182)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertRootField(JacksonParser.scala:73)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:288)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:285)
>   at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2366)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:285)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:280)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> 

[jira] [Updated] (SPARK-18068) Spark SQL doesn't parse some ISO 8601 formatted dates

2016-10-23 Thread Stephane Maarek (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephane Maarek updated SPARK-18068:

Priority: Major  (was: Critical)

> Spark SQL doesn't parse some ISO 8601 formatted dates
> -
>
> Key: SPARK-18068
> URL: https://issues.apache.org/jira/browse/SPARK-18068
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Stephane Maarek
>
> The following fail, but shouldn't according to the ISO 8601 standard (seconds 
> can be omitted). Not sure where the issue lies (probably an external library?)
> {code}
> scala> sc.parallelize(Seq("2016-10-07T11:15Z"))
> res1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at 
> parallelize at :25
> scala> res1.toDF
> res2: org.apache.spark.sql.DataFrame = [value: string]
> scala> res2.select("value").show()
> +-+
> |value|
> +-+
> |2016-10-07T11:15Z|
> +-+
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> res2.select(col("value").cast(TimestampType)).show()
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> And the schema usage errors out right away:
> {code}
> scala> val jsonRDD = sc.parallelize(Seq("""{"tst":"2016-10-07T11:15Z"}"""))
> jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[8] at 
> parallelize at :33
> scala> val schema = StructType(StructField("tst",TimestampType,true)::Nil)
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(tst,TimestampType,true))
> scala> val df = spark.read.schema(schema).json(jsonRDD)
> df: org.apache.spark.sql.DataFrame = [tst: timestamp]
> scala> df.show()
> 16/10/24 13:06:27 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23)
> java.lang.IllegalArgumentException: 2016-10-07T11:15Z
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown
>  Source)
>   at 
> javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422)
>   at 
> javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417)
>   at 
> javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:114)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertObject(JacksonParser.scala:215)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:182)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertRootField(JacksonParser.scala:73)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:288)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:285)
>   at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2366)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:285)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:280)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> 

[jira] [Updated] (SPARK-18068) Spark SQL doesn't parse some ISO 8601 formatted dates

2016-10-23 Thread Stephane Maarek (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephane Maarek updated SPARK-18068:

Description: 
The following fail, but shouldn't according to the ISO 8601 standard (seconds 
can be omitted). Not sure where the issue lies (probably an external library?)

{code}
scala> sc.parallelize(Seq("2016-10-07T11:15Z"))
res1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at 
parallelize at :25

scala> res1.toDF
res2: org.apache.spark.sql.DataFrame = [value: string]

scala> res2.select("value").show()
+-+
|value|
+-+
|2016-10-07T11:15Z|
+-+

scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> res2.select(col("value").cast(TimestampType)).show()
+-+
|value|
+-+
| null|
+-+
{code}


And the schema usage errors out right away:

{code}
scala> val jsonRDD = sc.parallelize(Seq("""{"tst":"2016-10-07T11:15Z"}"""))
jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[8] at 
parallelize at :33

scala> val schema = StructType(StructField("tst",TimestampType,true)::Nil)
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(tst,TimestampType,true))

scala> val df = spark.read.schema(schema).json(jsonRDD)
df: org.apache.spark.sql.DataFrame = [tst: timestamp]

scala> df.show()
16/10/24 13:06:27 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23)
java.lang.IllegalArgumentException: 2016-10-07T11:15Z
at 
org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown 
Source)
at 
org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown 
Source)
at 
org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown Source)
at 
org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown
 Source)
at 
javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422)
at 
javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417)
at 
javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327)
at 
org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:114)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertObject(JacksonParser.scala:215)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:182)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertRootField(JacksonParser.scala:73)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:288)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:285)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2366)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:285)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:280)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/10/24 13:06:27 WARN TaskSetManager: Lost task 6.0 in stage 5.0 (TID 23, 
localhost): java.lang.IllegalArgumentException: 2016-10-07T11:15Z
at 
org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown 
Source)
at 

[jira] [Updated] (SPARK-18068) Spark SQL doesn't parse some ISO 8601 formatted dates

2016-10-23 Thread Stephane Maarek (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephane Maarek updated SPARK-18068:

Description: 
The following fail, but shouldn't according to the ISO 8601 standard (seconds 
can be omitted). Not sure where the issue lies (probably an external library?)

{code:scala}
scala> sc.parallelize(Seq("2016-10-07T11:15Z"))
res1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at 
parallelize at :25

scala> res1.toDF
res2: org.apache.spark.sql.DataFrame = [value: string]

scala> res2.select("value").show()
+-+
|value|
+-+
|2016-10-07T11:15Z|
+-+

scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> res2.select(col("value").cast(TimestampType)).show()
+-+
|value|
+-+
| null|
+-+
{code}


And the schema usage errors out right away:

{code:scala}
scala> val jsonRDD = sc.parallelize(Seq("""{"tst":"2016-10-07T11:15Z"}"""))
jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[8] at 
parallelize at :33

scala> val schema = StructType(StructField("tst",TimestampType,true)::Nil)
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(tst,TimestampType,true))

scala> val df = spark.read.schema(schema).json(jsonRDD)
df: org.apache.spark.sql.DataFrame = [tst: timestamp]

scala> df.show()
16/10/24 13:06:27 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23)
java.lang.IllegalArgumentException: 2016-10-07T11:15Z
at 
org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown 
Source)
at 
org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown 
Source)
at 
org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown Source)
at 
org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown
 Source)
at 
javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422)
at 
javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417)
at 
javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327)
at 
org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:114)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertObject(JacksonParser.scala:215)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:182)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertRootField(JacksonParser.scala:73)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:288)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:285)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2366)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:285)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:280)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/10/24 13:06:27 WARN TaskSetManager: Lost task 6.0 in stage 5.0 (TID 23, 
localhost): java.lang.IllegalArgumentException: 2016-10-07T11:15Z
at 
org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown 
Source)
at 

[jira] [Created] (SPARK-18068) Spark SQL doesn't parse some ISO 8601 formatted dates

2016-10-23 Thread Stephane Maarek (JIRA)
Stephane Maarek created SPARK-18068:
---

 Summary: Spark SQL doesn't parse some ISO 8601 formatted dates
 Key: SPARK-18068
 URL: https://issues.apache.org/jira/browse/SPARK-18068
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1
Reporter: Stephane Maarek
Priority: Critical


The following fail, but shouldn't according to the ISO 8601 standard (seconds 
can be omitted). Not sure where the issue lies (probably an external library?)

```
scala> sc.parallelize(Seq("2016-10-07T11:15Z"))
res1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at 
parallelize at :25

scala> res1.toDF
res2: org.apache.spark.sql.DataFrame = [value: string]

scala> res2.select("value").show()
+-+
|value|
+-+
|2016-10-07T11:15Z|
+-+

scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> res2.select(col("value").cast(TimestampType)).show()
+-+
|value|
+-+
| null|
+-+
```

And the schema errors out right away:

```
scala> val jsonRDD = sc.parallelize(Seq("""{"tst":"2016-10-07T11:15Z"}"""))
jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[8] at 
parallelize at :33

scala> val schema = StructType(StructField("tst",TimestampType,true)::Nil)
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(tst,TimestampType,true))

scala> val df = spark.read.schema(schema).json(jsonRDD)
df: org.apache.spark.sql.DataFrame = [tst: timestamp]

scala> df.show()
16/10/24 13:06:27 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23)
java.lang.IllegalArgumentException: 2016-10-07T11:15Z
at 
org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown 
Source)
at 
org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown 
Source)
at 
org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown Source)
at 
org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown
 Source)
at 
javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422)
at 
javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417)
at 
javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327)
at 
org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:114)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertObject(JacksonParser.scala:215)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:182)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertRootField(JacksonParser.scala:73)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:288)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:285)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2366)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:285)
at 
org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:280)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/10/24 13:06:27 WARN TaskSetManager: Lost task 6.0 in stage 5.0 (TID 23, 
localhost): 

[jira] [Commented] (SPARK-11374) skip.header.line.count is ignored in HiveContext

2016-08-14 Thread Stephane Maarek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15420491#comment-15420491
 ] 

Stephane Maarek commented on SPARK-11374:
-

Hi,

Thanks for the PR. Can you also test for the footer option? Might as well
solve both issues

Thanks
Stéphane




> skip.header.line.count is ignored in HiveContext
> 
>
> Key: SPARK-11374
> URL: https://issues.apache.org/jira/browse/SPARK-11374
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Daniel Haviv
>
> csv table in Hive which is configured to skip the header row using 
> TBLPROPERTIES("skip.header.line.count"="1").
> When querying from Hive the header row is not included in the data, but when 
> running the same query via HiveContext I get the header row.
> "show create table " via the HiveContext confirms that it is aware of the 
> setting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14586) SparkSQL doesn't parse decimal like Hive

2016-04-17 Thread Stephane Maarek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244986#comment-15244986
 ] 

Stephane Maarek commented on SPARK-14586:
-

Hi [~tsuresh], thanks for your reply. It makes sense! I'm using Hive 1.2.1.
My only concern is that looking at the code, I understand why the number 
wouldn't be parsed correctly in Spark and Hive, but I don't understand why Hive 
1.2.1 CLI would parse the number correctly (as seen in my troubleshooting)? 
Isn't Spark using the exact same logic as Hive?

> SparkSQL doesn't parse decimal like Hive
> 
>
> Key: SPARK-14586
> URL: https://issues.apache.org/jira/browse/SPARK-14586
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Stephane Maarek
>
> create a test_data.csv with the following
> {code:none}
> a, 2.0
> ,3.0
> {code}
> (the space is intended before the 2)
> copy the test_data.csv to hdfs:///spark_testing_2
> go in hive, run the following statements
> {code:sql}
> CREATE SCHEMA IF NOT EXISTS spark_testing;
> DROP TABLE IF EXISTS spark_testing.test_csv_2;
> CREATE EXTERNAL TABLE `spark_testing.test_csv_2`(
>   column_1 varchar(10),
>   column_2 decimal(4,2))
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS TEXTFILE LOCATION '/spark_testing_2'
> TBLPROPERTIES('serialization.null.format'='');
> select * from spark_testing.test_csv_2;
> OK
> a   2
> NULL3
> {code}
> As you can see, the value " 2" gets parsed correctly to 2
> Now onto Spark-shell:
> {code:java}
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> sqlContext.sql("select * from spark_testing.test_csv_2").show()
> +++
> |column_1|column_2|
> +++
> |   a|null|
> |null|3.00|
> +++
> {code}
> As you can see, the " 2" got parsed to null. Therefore Hive and Spark don't 
> have a similar parsing behavior for decimals. I wouldn't say it is a bug per 
> se, but it looks like a necessary improvement for the two engines to 
> converge. Hive version is 1.5.1
> Not sure if relevant, but Scala does parse numbers with leading space 
> correctly
> {code}
> scala> "2.0".toDouble
> res21: Double = 2.0
> scala> " 2.0".toDouble
> res22: Double = 2.0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14583) SparkSQL doesn't apply TBLPROPERTIES('serialization.null.format'='') when Hive Table has partitions

2016-04-14 Thread Stephane Maarek (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephane Maarek updated SPARK-14583:

Summary: SparkSQL doesn't apply 
TBLPROPERTIES('serialization.null.format'='') when Hive Table has partitions  
(was: SparkSQL doesn't read TBLPROPERTIES('serialization.null.format'='') when 
Hive Table has partitions)

> SparkSQL doesn't apply TBLPROPERTIES('serialization.null.format'='') when 
> Hive Table has partitions
> ---
>
> Key: SPARK-14583
> URL: https://issues.apache.org/jira/browse/SPARK-14583
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.5.1
>Reporter: Stephane Maarek
>
> it seems that Spark forgets or fails to read the metadata tblproperties after 
> a MSCK REPAIR is issued from within HIVE
> Here are the steps to reproduce:
> create test_data.csv with the following content:
> {code:none}
> a,2
> ,3
> {code}
> move test_data.csv to hdfs:///spark_testing/part_a=a/part_b=b/
> run the following hive statements:
> {code:sql}
> CREATE SCHEMA IF NOT EXISTS spark_testing;
> DROP TABLE IF EXISTS spark_testing.test_csv;
> CREATE EXTERNAL TABLE `spark_testing.test_csv`(
>   column_1 varchar(10),
>   column_2 int)
> PARTITIONED BY (
>   `part_a` string,
>   `part_b` string)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS TEXTFILE LOCATION '/spark_testing'
> TBLPROPERTIES('serialization.null.format'='');
> MSCK REPAIR TABLE spark_testing.test_csv;
> select * from spark_testing.test_csv;
> OK
> a   2   a   b
> NULL3   a   b
> {code}
> (you can see the NULL)
> now onto Spark:
> {code:java}
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> sqlContext.sql("select * from spark_testing.test_csv").show()
> +++--+--+
> |column_1|column_2|part_a|part_b|
> +++--+--+
> |   a|   2| a| b|
> ||   3| a| b|
> +++--+--+
> {code}
> As you can see, SPARK can't detect the null. 
> I don't know if it affects future versions of SPARK and I can't test it in my 
> company's environment. Steps are easy to reproduce though so can be tested in 
> other environments. My hive version is 1.2.1
> Let me know if you have any questions. To me that's a big issue because data 
> isn't read correctly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14583) SparkSQL doesn't read TBLPROPERTIES('serialization.null.format'='') when Hive Table has partitions

2016-04-14 Thread Stephane Maarek (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephane Maarek updated SPARK-14583:

Summary: SparkSQL doesn't read 
TBLPROPERTIES('serialization.null.format'='') when Hive Table has partitions  
(was: SparkSQL doesn't read hive table properly after MSCK REPAIR)

> SparkSQL doesn't read TBLPROPERTIES('serialization.null.format'='') when Hive 
> Table has partitions
> --
>
> Key: SPARK-14583
> URL: https://issues.apache.org/jira/browse/SPARK-14583
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.5.1
>Reporter: Stephane Maarek
>
> it seems that Spark forgets or fails to read the metadata tblproperties after 
> a MSCK REPAIR is issued from within HIVE
> Here are the steps to reproduce:
> create test_data.csv with the following content:
> {code:none}
> a,2
> ,3
> {code}
> move test_data.csv to hdfs:///spark_testing/part_a=a/part_b=b/
> run the following hive statements:
> {code:sql}
> CREATE SCHEMA IF NOT EXISTS spark_testing;
> DROP TABLE IF EXISTS spark_testing.test_csv;
> CREATE EXTERNAL TABLE `spark_testing.test_csv`(
>   column_1 varchar(10),
>   column_2 int)
> PARTITIONED BY (
>   `part_a` string,
>   `part_b` string)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS TEXTFILE LOCATION '/spark_testing'
> TBLPROPERTIES('serialization.null.format'='');
> MSCK REPAIR TABLE spark_testing.test_csv;
> select * from spark_testing.test_csv;
> OK
> a   2   a   b
> NULL3   a   b
> {code}
> (you can see the NULL)
> now onto Spark:
> {code:java}
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> sqlContext.sql("select * from spark_testing.test_csv").show()
> +++--+--+
> |column_1|column_2|part_a|part_b|
> +++--+--+
> |   a|   2| a| b|
> ||   3| a| b|
> +++--+--+
> {code}
> As you can see, SPARK can't detect the null. 
> I don't know if it affects future versions of SPARK and I can't test it in my 
> company's environment. Steps are easy to reproduce though so can be tested in 
> other environments. My hive version is 1.2.1
> Let me know if you have any questions. To me that's a big issue because data 
> isn't read correctly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.

2016-04-14 Thread Stephane Maarek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15242132#comment-15242132
 ] 

Stephane Maarek commented on SPARK-12741:
-

Hi Sean,
What do you mean by the behavior on master? Do you want me to run the query on 
something different than spark-shell or spark-shell --master yarn --deploy-mode 
client ? Sorry I'm just starting with these kind of bugs reports and I don't 
have the expertise to dive down in the Spark code. 
Thanks for working with me through that
Regards,
Stephane

> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.

2016-04-13 Thread Stephane Maarek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238723#comment-15238723
 ] 

Stephane Maarek commented on SPARK-12741:
-

can we please re-open the issue?

> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11374) skip.header.line.count is ignored in HiveContext

2016-04-13 Thread Stephane Maarek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238689#comment-15238689
 ] 

Stephane Maarek commented on SPARK-11374:
-

any updates on this?
Just some log:

{code}

CREATE SCHEMA IF NOT EXISTS spark_testing;
DROP TABLE IF EXISTS spark_testing.test_csv_2;
CREATE EXTERNAL TABLE `spark_testing.test_csv_2`(
  column_1 varchar(10),
  column_2 decimal(4,2))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/spark_testing_2'
TBLPROPERTIES('serialization.null.format'='', "skip.header.line.count"="1");
select * from spark_testing.test_csv_2;

hive> select * from spark_testing.test_csv_2;
OK
NULL3
{code}

spark:

{code}

scala> sqlContext.sql("select * from spark_testing.test_csv_2").show()

+++
|column_1|column_2|
+++
|   a|null|
|null|3.00|
+++

{code}

That's a big problem

> skip.header.line.count is ignored in HiveContext
> 
>
> Key: SPARK-11374
> URL: https://issues.apache.org/jira/browse/SPARK-11374
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Daniel Haviv
>
> csv table in Hive which is configured to skip the header row using 
> TBLPROPERTIES("skip.header.line.count"="1").
> When querying from Hive the header row is not included in the data, but when 
> running the same query via HiveContext I get the header row.
> "show create table " via the HiveContext confirms that it is aware of the 
> setting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-11374) skip.header.line.count is ignored in HiveContext

2016-04-13 Thread Stephane Maarek (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephane Maarek updated SPARK-11374:

Comment: was deleted

(was: I may add that more metadata isn't processed, namely TBLPROPERTIES 
('serialization.null.format'='')
Also, another issue (may still be related to Spark not reading Hive Metadata or 
not properly using Hive), but if you create a csv with the following (spaces 
intended)

1, 2,3
4, 5,6

use Hive as this:
CREATE EXTERNAL TABLE `my_table`(
  `c1` DECIMAL,
  `c2` DECIMAL,
  `c3` DECIMAL) ... etc

select * from my_table will return in Hive
1,2,3
4,5,6

But using a hive context, in Spark
1,null,3
4,null,6)

> skip.header.line.count is ignored in HiveContext
> 
>
> Key: SPARK-11374
> URL: https://issues.apache.org/jira/browse/SPARK-11374
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Daniel Haviv
>
> csv table in Hive which is configured to skip the header row using 
> TBLPROPERTIES("skip.header.line.count"="1").
> When querying from Hive the header row is not included in the data, but when 
> running the same query via HiveContext I get the header row.
> "show create table " via the HiveContext confirms that it is aware of the 
> setting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14586) SparkSQL doesn't parse decimal like Hive

2016-04-12 Thread Stephane Maarek (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephane Maarek updated SPARK-14586:

Description: 
create a test_data.csv with the following
{code:none}
a, 2.0
,3.0
{code}

(the space is intended before the 2)

copy the test_data.csv to hdfs:///spark_testing_2

go in hive, run the following statements
{code:sql}

CREATE SCHEMA IF NOT EXISTS spark_testing;
DROP TABLE IF EXISTS spark_testing.test_csv_2;
CREATE EXTERNAL TABLE `spark_testing.test_csv_2`(
  column_1 varchar(10),
  column_2 decimal(4,2))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/spark_testing_2'
TBLPROPERTIES('serialization.null.format'='');
select * from spark_testing.test_csv_2;
OK
a   2
NULL3

{code}

As you can see, the value " 2" gets parsed correctly to 2

Now onto Spark-shell:

{code:java}

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("select * from spark_testing.test_csv_2").show()

+++
|column_1|column_2|
+++
|   a|null|
|null|3.00|
+++

{code}

As you can see, the " 2" got parsed to null. Therefore Hive and Spark don't 
have a similar parsing behavior for decimals. I wouldn't say it is a bug per 
se, but it looks like a necessary improvement for the two engines to converge. 
Hive version is 1.5.1

Not sure if relevant, but Scala does parse numbers with leading space correctly

{code}
scala> "2.0".toDouble
res21: Double = 2.0

scala> " 2.0".toDouble
res22: Double = 2.0
{code}

  was:
create a test_data.csv with the following
{code:none}
a, 2.0
,3.0
{code}

(the space is intended before the 2)

copy the test_data.csv to hdfs:///spark_testing_2

go in hive, run the following statements
{code:sql}

CREATE SCHEMA IF NOT EXISTS spark_testing;
DROP TABLE IF EXISTS spark_testing.test_csv_2;
CREATE EXTERNAL TABLE `spark_testing.test_csv_2`(
  column_1 varchar(10),
  column_2 decimal(4,2))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/spark_testing_2'
TBLPROPERTIES('serialization.null.format'='');
select * from spark_testing.test_csv_2;
OK
a   2
NULL3

{code}

As you can see, the value " 2" gets parsed correctly to 2

Now onto Spark-shell:

{code:java}

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("select * from spark_testing.test_csv_2").show()

+++
|column_1|column_2|
+++
|   a|null|
|null|3.00|
+++

{code}

As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a 
similar parsing behavior for decimals. I wouldn't say it is a bug per se, but 
it looks like a necessary improvement for the two engines to converge. Hive 
version is 1.5.1

Not sure if relevant, but Scala does parse numbers with leading space correctly

{code}
scala> "2.0".toDouble
res21: Double = 2.0

scala> " 2.0".toDouble
res22: Double = 2.0
{code}


> SparkSQL doesn't parse decimal like Hive
> 
>
> Key: SPARK-14586
> URL: https://issues.apache.org/jira/browse/SPARK-14586
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Stephane Maarek
>
> create a test_data.csv with the following
> {code:none}
> a, 2.0
> ,3.0
> {code}
> (the space is intended before the 2)
> copy the test_data.csv to hdfs:///spark_testing_2
> go in hive, run the following statements
> {code:sql}
> CREATE SCHEMA IF NOT EXISTS spark_testing;
> DROP TABLE IF EXISTS spark_testing.test_csv_2;
> CREATE EXTERNAL TABLE `spark_testing.test_csv_2`(
>   column_1 varchar(10),
>   column_2 decimal(4,2))
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS TEXTFILE LOCATION '/spark_testing_2'
> TBLPROPERTIES('serialization.null.format'='');
> select * from spark_testing.test_csv_2;
> OK
> a   2
> NULL3
> {code}
> As you can see, the value " 2" gets parsed correctly to 2
> Now onto Spark-shell:
> {code:java}
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> sqlContext.sql("select * from spark_testing.test_csv_2").show()
> +++
> |column_1|column_2|
> +++
> |   a|null|
> |null|3.00|
> +++
> {code}
> As you can see, the " 2" got parsed to null. Therefore Hive and Spark don't 
> have a similar parsing behavior for decimals. I wouldn't say it is a bug per 
> se, but it looks like a necessary improvement for the two engines to 
> converge. Hive version is 1.5.1
> Not sure if relevant, but Scala does parse numbers with leading space 
> correctly
> {code}
> scala> "2.0".toDouble
> res21: Double = 2.0
> scala> " 2.0".toDouble
> res22: Double = 2.0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (SPARK-14586) SparkSQL doesn't parse decimal like Hive

2016-04-12 Thread Stephane Maarek (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephane Maarek updated SPARK-14586:

Description: 
create a test_data.csv with the following
{code:none}
a, 2.0
,3.0
{code}

(the space is intended before the 2)

copy the test_data.csv to hdfs:///spark_testing_2

go in hive, run the following statements
{code:sql}

CREATE SCHEMA IF NOT EXISTS spark_testing;
DROP TABLE IF EXISTS spark_testing.test_csv_2;
CREATE EXTERNAL TABLE `spark_testing.test_csv_2`(
  column_1 varchar(10),
  column_2 decimal(4,2))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/spark_testing_2'
TBLPROPERTIES('serialization.null.format'='');
select * from spark_testing.test_csv_2;
OK
a   2
NULL3

{code}

As you can see, the value " 2" gets parsed correctly to 2

Now onto Spark-shell:

{code:java}

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("select * from spark_testing.test_csv_2").show()

+++
|column_1|column_2|
+++
|   a|null|
|null|3.00|
+++

{code}

As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a 
similar parsing behavior for decimals. I wouldn't say it is a bug per se, but 
it looks like a necessary improvement for the two engines to converge. Hive 
version is 1.5.1

Not sure if relevant, but Scala does parse numbers with leading space correctly

scala> "2.0".toDouble
res21: Double = 2.0

scala> " 2.0".toDouble
res22: Double = 2.0


  was:
create a test_data.csv with the following
{code:none}
a, 2.0
,3.0
{code}

(the space is intended before the 2)

copy the test_data.csv to hdfs:///spark_testing_2

go in hive, run the following statements
{code:sql}

CREATE SCHEMA IF NOT EXISTS spark_testing;
DROP TABLE IF EXISTS spark_testing.test_csv_2;
CREATE EXTERNAL TABLE `spark_testing.test_csv_2`(
  column_1 varchar(10),
  column_2 decimal(4,2))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/spark_testing_2'
TBLPROPERTIES('serialization.null.format'='');
select * from spark_testing.test_csv_2;
OK
a   2
NULL3

{code}

As you can see, the value " 2" gets parsed correctly to 2

Now onto Spark-shell:

{code:java}

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("select * from spark_testing.test_csv_2").show()

+++
|column_1|column_2|
+++
|   a|null|
|null|3.00|
+++

{code}

As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a 
similar parsing behavior for decimals. I wouldn't say it is a bug per se, but 
it looks like a necessary improvement for the two engines to converge. Hive 
version is 1.5.1


> SparkSQL doesn't parse decimal like Hive
> 
>
> Key: SPARK-14586
> URL: https://issues.apache.org/jira/browse/SPARK-14586
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Stephane Maarek
>
> create a test_data.csv with the following
> {code:none}
> a, 2.0
> ,3.0
> {code}
> (the space is intended before the 2)
> copy the test_data.csv to hdfs:///spark_testing_2
> go in hive, run the following statements
> {code:sql}
> CREATE SCHEMA IF NOT EXISTS spark_testing;
> DROP TABLE IF EXISTS spark_testing.test_csv_2;
> CREATE EXTERNAL TABLE `spark_testing.test_csv_2`(
>   column_1 varchar(10),
>   column_2 decimal(4,2))
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS TEXTFILE LOCATION '/spark_testing_2'
> TBLPROPERTIES('serialization.null.format'='');
> select * from spark_testing.test_csv_2;
> OK
> a   2
> NULL3
> {code}
> As you can see, the value " 2" gets parsed correctly to 2
> Now onto Spark-shell:
> {code:java}
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> sqlContext.sql("select * from spark_testing.test_csv_2").show()
> +++
> |column_1|column_2|
> +++
> |   a|null|
> |null|3.00|
> +++
> {code}
> As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a 
> similar parsing behavior for decimals. I wouldn't say it is a bug per se, but 
> it looks like a necessary improvement for the two engines to converge. Hive 
> version is 1.5.1
> Not sure if relevant, but Scala does parse numbers with leading space 
> correctly
> scala> "2.0".toDouble
> res21: Double = 2.0
> scala> " 2.0".toDouble
> res22: Double = 2.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14586) SparkSQL doesn't parse decimal like Hive

2016-04-12 Thread Stephane Maarek (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephane Maarek updated SPARK-14586:

Description: 
create a test_data.csv with the following
{code:none}
a, 2.0
,3.0
{code}

(the space is intended before the 2)

copy the test_data.csv to hdfs:///spark_testing_2

go in hive, run the following statements
{code:sql}

CREATE SCHEMA IF NOT EXISTS spark_testing;
DROP TABLE IF EXISTS spark_testing.test_csv_2;
CREATE EXTERNAL TABLE `spark_testing.test_csv_2`(
  column_1 varchar(10),
  column_2 decimal(4,2))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/spark_testing_2'
TBLPROPERTIES('serialization.null.format'='');
select * from spark_testing.test_csv_2;
OK
a   2
NULL3

{code}

As you can see, the value " 2" gets parsed correctly to 2

Now onto Spark-shell:

{code:java}

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("select * from spark_testing.test_csv_2").show()

+++
|column_1|column_2|
+++
|   a|null|
|null|3.00|
+++

{code}

As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a 
similar parsing behavior for decimals. I wouldn't say it is a bug per se, but 
it looks like a necessary improvement for the two engines to converge. Hive 
version is 1.5.1

Not sure if relevant, but Scala does parse numbers with leading space correctly

{code}
scala> "2.0".toDouble
res21: Double = 2.0

scala> " 2.0".toDouble
res22: Double = 2.0
{code}

  was:
create a test_data.csv with the following
{code:none}
a, 2.0
,3.0
{code}

(the space is intended before the 2)

copy the test_data.csv to hdfs:///spark_testing_2

go in hive, run the following statements
{code:sql}

CREATE SCHEMA IF NOT EXISTS spark_testing;
DROP TABLE IF EXISTS spark_testing.test_csv_2;
CREATE EXTERNAL TABLE `spark_testing.test_csv_2`(
  column_1 varchar(10),
  column_2 decimal(4,2))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/spark_testing_2'
TBLPROPERTIES('serialization.null.format'='');
select * from spark_testing.test_csv_2;
OK
a   2
NULL3

{code}

As you can see, the value " 2" gets parsed correctly to 2

Now onto Spark-shell:

{code:java}

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("select * from spark_testing.test_csv_2").show()

+++
|column_1|column_2|
+++
|   a|null|
|null|3.00|
+++

{code}

As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a 
similar parsing behavior for decimals. I wouldn't say it is a bug per se, but 
it looks like a necessary improvement for the two engines to converge. Hive 
version is 1.5.1

Not sure if relevant, but Scala does parse numbers with leading space correctly

scala> "2.0".toDouble
res21: Double = 2.0

scala> " 2.0".toDouble
res22: Double = 2.0



> SparkSQL doesn't parse decimal like Hive
> 
>
> Key: SPARK-14586
> URL: https://issues.apache.org/jira/browse/SPARK-14586
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Stephane Maarek
>
> create a test_data.csv with the following
> {code:none}
> a, 2.0
> ,3.0
> {code}
> (the space is intended before the 2)
> copy the test_data.csv to hdfs:///spark_testing_2
> go in hive, run the following statements
> {code:sql}
> CREATE SCHEMA IF NOT EXISTS spark_testing;
> DROP TABLE IF EXISTS spark_testing.test_csv_2;
> CREATE EXTERNAL TABLE `spark_testing.test_csv_2`(
>   column_1 varchar(10),
>   column_2 decimal(4,2))
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS TEXTFILE LOCATION '/spark_testing_2'
> TBLPROPERTIES('serialization.null.format'='');
> select * from spark_testing.test_csv_2;
> OK
> a   2
> NULL3
> {code}
> As you can see, the value " 2" gets parsed correctly to 2
> Now onto Spark-shell:
> {code:java}
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> sqlContext.sql("select * from spark_testing.test_csv_2").show()
> +++
> |column_1|column_2|
> +++
> |   a|null|
> |null|3.00|
> +++
> {code}
> As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a 
> similar parsing behavior for decimals. I wouldn't say it is a bug per se, but 
> it looks like a necessary improvement for the two engines to converge. Hive 
> version is 1.5.1
> Not sure if relevant, but Scala does parse numbers with leading space 
> correctly
> {code}
> scala> "2.0".toDouble
> res21: Double = 2.0
> scala> " 2.0".toDouble
> res22: Double = 2.0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To 

[jira] [Updated] (SPARK-14586) SparkSQL doesn't parse decimal like Hive

2016-04-12 Thread Stephane Maarek (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephane Maarek updated SPARK-14586:

Description: 
create a test_data.csv with the following
{code:none}
a, 2.0
,3.0
{code}

(the space is intended before the 2)

copy the test_data.csv to hdfs:///spark_testing_2

go in hive, run the following statements
{code:sql}

CREATE SCHEMA IF NOT EXISTS spark_testing;
DROP TABLE IF EXISTS spark_testing.test_csv_2;
CREATE EXTERNAL TABLE `spark_testing.test_csv_2`(
  column_1 varchar(10),
  column_2 decimal(4,2))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/spark_testing_2'
TBLPROPERTIES('serialization.null.format'='');
select * from spark_testing.test_csv_2;
OK
a   2
NULL3

{code}

As you can see, the value " 2" gets parsed correctly to 2

Now onto Spark-shell:

{code:java}

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("select * from spark_testing.test_csv_2").show()

+++
|column_1|column_2|
+++
|   a|null|
|null|3.00|
+++

{code}

As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a 
similar parsing behavior for decimals. I wouldn't say it is a bug per se, but 
it looks like a necessary improvement for the two engines to converge. Hive 
version is 1.5.1

  was:
create a test_data.csv with the following
{code:none}
a, 2.0
,3.0
{code}

(the space is intended before the 2)

copy the test_data.csv to hdfs:///spark_testing_2

go in hive, run the following statements
{code:sql}

CREATE SCHEMA IF NOT EXISTS spark_testing;
DROP TABLE IF EXISTS spark_testing.test_csv_2;
CREATE EXTERNAL TABLE `spark_testing.test_csv_2`(
  column_1 varchar(10),
  column_2 decimal(4,2))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/spark_testing_2'
TBLPROPERTIES('serialization.null.format'='');
select * from spark_testing.test_csv_2;
OK
a   2
NULL3

{code}

As you can see, the value " 2" gets parsed correctly to 2

Now onto Spark-shell:

{code:java}

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("select * from spark_testing.test_csv_2").show()

+++
|column_1|column_2|
+++
|   a|null|
|null|3.00|
+++

{code}

As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a 
similar parsing behavior for decimals. I wouldn't say it is a bug per se, but 
it looks like a necessary improvement for the two engines to converge


> SparkSQL doesn't parse decimal like Hive
> 
>
> Key: SPARK-14586
> URL: https://issues.apache.org/jira/browse/SPARK-14586
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Stephane Maarek
>
> create a test_data.csv with the following
> {code:none}
> a, 2.0
> ,3.0
> {code}
> (the space is intended before the 2)
> copy the test_data.csv to hdfs:///spark_testing_2
> go in hive, run the following statements
> {code:sql}
> CREATE SCHEMA IF NOT EXISTS spark_testing;
> DROP TABLE IF EXISTS spark_testing.test_csv_2;
> CREATE EXTERNAL TABLE `spark_testing.test_csv_2`(
>   column_1 varchar(10),
>   column_2 decimal(4,2))
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS TEXTFILE LOCATION '/spark_testing_2'
> TBLPROPERTIES('serialization.null.format'='');
> select * from spark_testing.test_csv_2;
> OK
> a   2
> NULL3
> {code}
> As you can see, the value " 2" gets parsed correctly to 2
> Now onto Spark-shell:
> {code:java}
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> sqlContext.sql("select * from spark_testing.test_csv_2").show()
> +++
> |column_1|column_2|
> +++
> |   a|null|
> |null|3.00|
> +++
> {code}
> As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a 
> similar parsing behavior for decimals. I wouldn't say it is a bug per se, but 
> it looks like a necessary improvement for the two engines to converge. Hive 
> version is 1.5.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14583) SparkSQL doesn't read hive table properly after MSCK REPAIR

2016-04-12 Thread Stephane Maarek (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephane Maarek updated SPARK-14583:

Summary: SparkSQL doesn't read hive table properly after MSCK REPAIR  (was: 
Spark doesn't read hive table properly after MSCK REPAIR)

> SparkSQL doesn't read hive table properly after MSCK REPAIR
> ---
>
> Key: SPARK-14583
> URL: https://issues.apache.org/jira/browse/SPARK-14583
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.5.1
>Reporter: Stephane Maarek
>
> it seems that Spark forgets or fails to read the metadata tblproperties after 
> a MSCK REPAIR is issued from within HIVE
> Here are the steps to reproduce:
> create test_data.csv with the following content:
> {code:none}
> a,2
> ,3
> {code}
> move test_data.csv to hdfs:///spark_testing/part_a=a/part_b=b/
> run the following hive statements:
> {code:sql}
> CREATE SCHEMA IF NOT EXISTS spark_testing;
> DROP TABLE IF EXISTS spark_testing.test_csv;
> CREATE EXTERNAL TABLE `spark_testing.test_csv`(
>   column_1 varchar(10),
>   column_2 int)
> PARTITIONED BY (
>   `part_a` string,
>   `part_b` string)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS TEXTFILE LOCATION '/spark_testing'
> TBLPROPERTIES('serialization.null.format'='');
> MSCK REPAIR TABLE spark_testing.test_csv;
> select * from spark_testing.test_csv;
> OK
> a   2   a   b
> NULL3   a   b
> {code}
> (you can see the NULL)
> now onto Spark:
> {code:java}
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> sqlContext.sql("select * from spark_testing.test_csv").show()
> +++--+--+
> |column_1|column_2|part_a|part_b|
> +++--+--+
> |   a|   2| a| b|
> ||   3| a| b|
> +++--+--+
> {code}
> As you can see, SPARK can't detect the null. 
> I don't know if it affects future versions of SPARK and I can't test it in my 
> company's environment. Steps are easy to reproduce though so can be tested in 
> other environments. My hive version is 1.2.1
> Let me know if you have any questions. To me that's a big issue because data 
> isn't read correctly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14586) SparkSQL doesn't parse decimal like Hive

2016-04-12 Thread Stephane Maarek (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephane Maarek updated SPARK-14586:

Description: 
create a test_data.csv with the following
{code:none}
a, 2.0
,3.0
{code}

(the space is intended before the 2)

copy the test_data.csv to hdfs:///spark_testing_2

go in hive, run the following statements
{code:sql}

CREATE SCHEMA IF NOT EXISTS spark_testing;
DROP TABLE IF EXISTS spark_testing.test_csv_2;
CREATE EXTERNAL TABLE `spark_testing.test_csv_2`(
  column_1 varchar(10),
  column_2 decimal(4,2))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/spark_testing_2'
TBLPROPERTIES('serialization.null.format'='');
select * from spark_testing.test_csv_2;
OK
a   2
NULL3

{code}

As you can see, the value " 2" gets parsed correctly to 2

Now onto Spark-shell:

{code:java}

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("select * from spark_testing.test_csv_2").show()

+++
|column_1|column_2|
+++
|   a|null|
|null|3.00|
+++

{code}

As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a 
similar parsing behavior for decimals. I wouldn't say it is a bug per se, but 
it looks like a necessary improvement for the two engines to converge

  was:
create a test_data.csv with the following
{code:none}
a, 2.0
,3.0
{code}

(the space is intended before the 2)

copy the test_data.csv to hdfs:///spark_testing_2

go in hive, run the following statements

CREATE SCHEMA IF NOT EXISTS spark_testing;
DROP TABLE IF EXISTS spark_testing.test_csv_2;
CREATE EXTERNAL TABLE `spark_testing.test_csv_2`(
  column_1 varchar(10),
  column_2 decimal(4,2))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/spark_testing_2'
TBLPROPERTIES('serialization.null.format'='');
select * from spark_testing.test_csv_2;
OK
a   2
NULL3

{code}

As you can see, the value " 2" gets parsed correctly to 2

Now onto Spark-shell:

{code:java}

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("select * from spark_testing.test_csv_2").show()

+++
|column_1|column_2|
+++
|   a|null|
|null|3.00|
+++

{code}

As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a 
similar parsing behavior for decimals. I wouldn't say it is a bug per se, but 
it looks like a necessary improvement for the two engines to converge


> SparkSQL doesn't parse decimal like Hive
> 
>
> Key: SPARK-14586
> URL: https://issues.apache.org/jira/browse/SPARK-14586
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Stephane Maarek
>
> create a test_data.csv with the following
> {code:none}
> a, 2.0
> ,3.0
> {code}
> (the space is intended before the 2)
> copy the test_data.csv to hdfs:///spark_testing_2
> go in hive, run the following statements
> {code:sql}
> CREATE SCHEMA IF NOT EXISTS spark_testing;
> DROP TABLE IF EXISTS spark_testing.test_csv_2;
> CREATE EXTERNAL TABLE `spark_testing.test_csv_2`(
>   column_1 varchar(10),
>   column_2 decimal(4,2))
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS TEXTFILE LOCATION '/spark_testing_2'
> TBLPROPERTIES('serialization.null.format'='');
> select * from spark_testing.test_csv_2;
> OK
> a   2
> NULL3
> {code}
> As you can see, the value " 2" gets parsed correctly to 2
> Now onto Spark-shell:
> {code:java}
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> sqlContext.sql("select * from spark_testing.test_csv_2").show()
> +++
> |column_1|column_2|
> +++
> |   a|null|
> |null|3.00|
> +++
> {code}
> As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a 
> similar parsing behavior for decimals. I wouldn't say it is a bug per se, but 
> it looks like a necessary improvement for the two engines to converge



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14586) SparkSQL doesn't parse decimal like Hive

2016-04-12 Thread Stephane Maarek (JIRA)
Stephane Maarek created SPARK-14586:
---

 Summary: SparkSQL doesn't parse decimal like Hive
 Key: SPARK-14586
 URL: https://issues.apache.org/jira/browse/SPARK-14586
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1
Reporter: Stephane Maarek


create a test_data.csv with the following
{code:none}
a, 2.0
,3.0
{code}

(the space is intended before the 2)

copy the test_data.csv to hdfs:///spark_testing_2

go in hive, run the following statements

CREATE SCHEMA IF NOT EXISTS spark_testing;
DROP TABLE IF EXISTS spark_testing.test_csv_2;
CREATE EXTERNAL TABLE `spark_testing.test_csv_2`(
  column_1 varchar(10),
  column_2 decimal(4,2))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/spark_testing_2'
TBLPROPERTIES('serialization.null.format'='');
select * from spark_testing.test_csv_2;
OK
a   2
NULL3

{code}

As you can see, the value " 2" gets parsed correctly to 2

Now onto Spark-shell:

{code:java}

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("select * from spark_testing.test_csv_2").show()

+++
|column_1|column_2|
+++
|   a|null|
|null|3.00|
+++

{code}

As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a 
similar parsing behavior for decimals. I wouldn't say it is a bug per se, but 
it looks like a necessary improvement for the two engines to converge



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14583) Spark doesn't read hive table properly after MSCK REPAIR

2016-04-12 Thread Stephane Maarek (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephane Maarek updated SPARK-14583:

Description: 
it seems that Spark forgets or fails to read the metadata tblproperties after a 
MSCK REPAIR is issued from within HIVE

Here are the steps to reproduce:
create test_data.csv with the following content:
{code:none}
a,2
,3
{code}

move test_data.csv to hdfs:///spark_testing/part_a=a/part_b=b/

run the following hive statements:
{code:sql}
CREATE SCHEMA IF NOT EXISTS spark_testing;
DROP TABLE IF EXISTS spark_testing.test_csv;
CREATE EXTERNAL TABLE `spark_testing.test_csv`(
  column_1 varchar(10),
  column_2 int)
PARTITIONED BY (
  `part_a` string,
  `part_b` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/spark_testing'
TBLPROPERTIES('serialization.null.format'='');
MSCK REPAIR TABLE spark_testing.test_csv;
select * from spark_testing.test_csv;


OK
a   2   a   b
NULL3   a   b
{code}
(you can see the NULL)

now onto Spark:


{code:java}
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("select * from spark_testing.test_csv").show()
+++--+--+
|column_1|column_2|part_a|part_b|
+++--+--+
|   a|   2| a| b|
||   3| a| b|
+++--+--+

{code}

As you can see, SPARK can't detect the null. 
I don't know if it affects future versions of SPARK and I can't test it in my 
company's environment. Steps are easy to reproduce though so can be tested in 
other environments. My hive version is 1.2.1

Let me know if you have any questions. To me that's a big issue because data 
isn't read correctly. 


  was:
it seems that Spark forgets or fails to read the metadata tblproperties after a 
MSCK REPAIR is issued from within HIVE

Here are the steps to reproduce:
create test_data.csv with the following content:
{code:csv}
a,2
,3
{code}

move test_data.csv to hdfs:///spark_testing/part_a=a/part_b=b/

run the following hive statements:
{code:sql}
CREATE SCHEMA IF NOT EXISTS spark_testing;
DROP TABLE IF EXISTS spark_testing.test_csv;
CREATE EXTERNAL TABLE `spark_testing.test_csv`(
  column_1 varchar(10),
  column_2 int)
PARTITIONED BY (
  `part_a` string,
  `part_b` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/spark_testing'
TBLPROPERTIES('serialization.null.format'='');
MSCK REPAIR TABLE spark_testing.test_csv;
select * from spark_testing.test_csv;


OK
a   2   a   b
NULL3   a   b
{code}
(you can see the NULL)

now onto Spark:


{code:scala}
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("select * from spark_testing.test_csv").show()
+++--+--+
|column_1|column_2|part_a|part_b|
+++--+--+
|   a|   2| a| b|
||   3| a| b|
+++--+--+

{code}

As you can see, SPARK can't detect the null. 
I don't know if it affects future versions of SPARK and I can't test it in my 
company's environment. Steps are easy to reproduce though so can be tested in 
other environments. My hive version is 1.2.1

Let me know if you have any questions. To me that's a big issue because data 
isn't read correctly. 



> Spark doesn't read hive table properly after MSCK REPAIR
> 
>
> Key: SPARK-14583
> URL: https://issues.apache.org/jira/browse/SPARK-14583
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.5.1
>Reporter: Stephane Maarek
>
> it seems that Spark forgets or fails to read the metadata tblproperties after 
> a MSCK REPAIR is issued from within HIVE
> Here are the steps to reproduce:
> create test_data.csv with the following content:
> {code:none}
> a,2
> ,3
> {code}
> move test_data.csv to hdfs:///spark_testing/part_a=a/part_b=b/
> run the following hive statements:
> {code:sql}
> CREATE SCHEMA IF NOT EXISTS spark_testing;
> DROP TABLE IF EXISTS spark_testing.test_csv;
> CREATE EXTERNAL TABLE `spark_testing.test_csv`(
>   column_1 varchar(10),
>   column_2 int)
> PARTITIONED BY (
>   `part_a` string,
>   `part_b` string)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS TEXTFILE LOCATION '/spark_testing'
> TBLPROPERTIES('serialization.null.format'='');
> MSCK REPAIR TABLE spark_testing.test_csv;
> select * from spark_testing.test_csv;
> OK
> a   2   a   b
> NULL3   a   b
> {code}
> (you can see the NULL)
> now onto Spark:
> {code:java}
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> sqlContext.sql("select * from spark_testing.test_csv").show()
> +++--+--+
> |column_1|column_2|part_a|part_b|
> 

[jira] [Updated] (SPARK-14583) Spark doesn't read hive table properly after MSCK REPAIR

2016-04-12 Thread Stephane Maarek (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephane Maarek updated SPARK-14583:

Description: 
it seems that Spark forgets or fails to read the metadata tblproperties after a 
MSCK REPAIR is issued from within HIVE

Here are the steps to reproduce:
create test_data.csv with the following content:
{code:csv}
a,2
,3
{code}

move test_data.csv to hdfs:///spark_testing/part_a=a/part_b=b/

run the following hive statements:
{code:sql}
CREATE SCHEMA IF NOT EXISTS spark_testing;
DROP TABLE IF EXISTS spark_testing.test_csv;
CREATE EXTERNAL TABLE `spark_testing.test_csv`(
  column_1 varchar(10),
  column_2 int)
PARTITIONED BY (
  `part_a` string,
  `part_b` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/spark_testing'
TBLPROPERTIES('serialization.null.format'='');
MSCK REPAIR TABLE spark_testing.test_csv;
select * from spark_testing.test_csv;


OK
a   2   a   b
NULL3   a   b
{code}
(you can see the NULL)

now onto Spark:


{code:scala}
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("select * from spark_testing.test_csv").show()
+++--+--+
|column_1|column_2|part_a|part_b|
+++--+--+
|   a|   2| a| b|
||   3| a| b|
+++--+--+

{code}

As you can see, SPARK can't detect the null. 
I don't know if it affects future versions of SPARK and I can't test it in my 
company's environment. Steps are easy to reproduce though so can be tested in 
other environments. My hive version is 1.2.1

Let me know if you have any questions. To me that's a big issue because data 
isn't read correctly. 


  was:
it seems that Spark forgets or fails to read the metadata tblproperties after a 
MSCK REPAIR is issued from within HIVE

Here are the steps to reproduce:
create test_data.csv with the following content:
a,2
,3

move test_data.csv to hdfs:///spark_testing/part_a=a/part_b=b/

run the following hive statements:

CREATE SCHEMA IF NOT EXISTS spark_testing;
DROP TABLE IF EXISTS spark_testing.test_csv;
CREATE EXTERNAL TABLE `spark_testing.test_csv`(
  column_1 varchar(10),
  column_2 int)
PARTITIONED BY (
  `part_a` string,
  `part_b` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/spark_testing'
TBLPROPERTIES('serialization.null.format'='');
MSCK REPAIR TABLE spark_testing.test_csv;
select * from spark_testing.test_csv;


OK
a   2   a   b
NULL3   a   b

(you can see the NULL)

now onto Spark:
+++--+--+
|column_1|column_2|part_a|part_b|
+++--+--+
|   a|   2| a| b|
||   3| a| b|
+++--+--+


As you can see, SPARK can't detect the null. 
I don't know if it affects future versions of SPARK and I can't test it in my 
company's environment. Steps are easy to reproduce though so can be tested in 
other environments. My hive version is 1.2.1

Let me know if you have any questions. To me that's a big issue because data 
isn't read correctly. 



> Spark doesn't read hive table properly after MSCK REPAIR
> 
>
> Key: SPARK-14583
> URL: https://issues.apache.org/jira/browse/SPARK-14583
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.5.1
>Reporter: Stephane Maarek
>
> it seems that Spark forgets or fails to read the metadata tblproperties after 
> a MSCK REPAIR is issued from within HIVE
> Here are the steps to reproduce:
> create test_data.csv with the following content:
> {code:csv}
> a,2
> ,3
> {code}
> move test_data.csv to hdfs:///spark_testing/part_a=a/part_b=b/
> run the following hive statements:
> {code:sql}
> CREATE SCHEMA IF NOT EXISTS spark_testing;
> DROP TABLE IF EXISTS spark_testing.test_csv;
> CREATE EXTERNAL TABLE `spark_testing.test_csv`(
>   column_1 varchar(10),
>   column_2 int)
> PARTITIONED BY (
>   `part_a` string,
>   `part_b` string)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS TEXTFILE LOCATION '/spark_testing'
> TBLPROPERTIES('serialization.null.format'='');
> MSCK REPAIR TABLE spark_testing.test_csv;
> select * from spark_testing.test_csv;
> OK
> a   2   a   b
> NULL3   a   b
> {code}
> (you can see the NULL)
> now onto Spark:
> {code:scala}
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> sqlContext.sql("select * from spark_testing.test_csv").show()
> +++--+--+
> |column_1|column_2|part_a|part_b|
> +++--+--+
> |   a|   2| a| b|
> ||   3| a| b|
> +++--+--+
> {code}
> As you can see, SPARK can't detect the 

[jira] [Commented] (SPARK-14583) Spark doesn't read hive table properly after MSCK REPAIR

2016-04-12 Thread Stephane Maarek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238352#comment-15238352
 ] 

Stephane Maarek commented on SPARK-14583:
-

pretty much the same behavior if instead of MSCK REPAIR we run ALTER TABLE 
spark_testing.test_csv ADD PARTITION (part_a="a", part_b="b");
This makes me believe it's the partitioning that makes Spark fail

> Spark doesn't read hive table properly after MSCK REPAIR
> 
>
> Key: SPARK-14583
> URL: https://issues.apache.org/jira/browse/SPARK-14583
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.5.1
>Reporter: Stephane Maarek
>
> it seems that Spark forgets or fails to read the metadata tblproperties after 
> a MSCK REPAIR is issued from within HIVE
> Here are the steps to reproduce:
> create test_data.csv with the following content:
> a,2
> ,3
> move test_data.csv to hdfs:///spark_testing/part_a=a/part_b=b/
> run the following hive statements:
> CREATE SCHEMA IF NOT EXISTS spark_testing;
> DROP TABLE IF EXISTS spark_testing.test_csv;
> CREATE EXTERNAL TABLE `spark_testing.test_csv`(
>   column_1 varchar(10),
>   column_2 int)
> PARTITIONED BY (
>   `part_a` string,
>   `part_b` string)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS TEXTFILE LOCATION '/spark_testing'
> TBLPROPERTIES('serialization.null.format'='');
> MSCK REPAIR TABLE spark_testing.test_csv;
> select * from spark_testing.test_csv;
> OK
> a   2   a   b
> NULL3   a   b
> (you can see the NULL)
> now onto Spark:
> +++--+--+
> |column_1|column_2|part_a|part_b|
> +++--+--+
> |   a|   2| a| b|
> ||   3| a| b|
> +++--+--+
> As you can see, SPARK can't detect the null. 
> I don't know if it affects future versions of SPARK and I can't test it in my 
> company's environment. Steps are easy to reproduce though so can be tested in 
> other environments. My hive version is 1.2.1
> Let me know if you have any questions. To me that's a big issue because data 
> isn't read correctly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14583) Spark doesn't read hive table properly after MSCK REPAIR

2016-04-12 Thread Stephane Maarek (JIRA)
Stephane Maarek created SPARK-14583:
---

 Summary: Spark doesn't read hive table properly after MSCK REPAIR
 Key: SPARK-14583
 URL: https://issues.apache.org/jira/browse/SPARK-14583
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.5.1
Reporter: Stephane Maarek


it seems that Spark forgets or fails to read the metadata tblproperties after a 
MSCK REPAIR is issued from within HIVE

Here are the steps to reproduce:
create test_data.csv with the following content:
a,2
,3

move test_data.csv to hdfs:///spark_testing/part_a=a/part_b=b/

run the following hive statements:

CREATE SCHEMA IF NOT EXISTS spark_testing;
DROP TABLE IF EXISTS spark_testing.test_csv;
CREATE EXTERNAL TABLE `spark_testing.test_csv`(
  column_1 varchar(10),
  column_2 int)
PARTITIONED BY (
  `part_a` string,
  `part_b` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/spark_testing'
TBLPROPERTIES('serialization.null.format'='');
MSCK REPAIR TABLE spark_testing.test_csv;
select * from spark_testing.test_csv;


OK
a   2   a   b
NULL3   a   b

(you can see the NULL)

now onto Spark:
+++--+--+
|column_1|column_2|part_a|part_b|
+++--+--+
|   a|   2| a| b|
||   3| a| b|
+++--+--+


As you can see, SPARK can't detect the null. 
I don't know if it affects future versions of SPARK and I can't test it in my 
company's environment. Steps are easy to reproduce though so can be tested in 
other environments. My hive version is 1.2.1

Let me know if you have any questions. To me that's a big issue because data 
isn't read correctly. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11374) skip.header.line.count is ignored in HiveContext

2016-04-06 Thread Stephane Maarek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15229401#comment-15229401
 ] 

Stephane Maarek commented on SPARK-11374:
-

I may add that more metadata isn't processed, namely TBLPROPERTIES 
('serialization.null.format'='')
Also, another issue (may still be related to Spark not reading Hive Metadata or 
not properly using Hive), but if you create a csv with the following (spaces 
intended)

1, 2,3
4, 5,6

use Hive as this:
CREATE EXTERNAL TABLE `my_table`(
  `c1` DECIMAL,
  `c2` DECIMAL,
  `c3` DECIMAL) ... etc

select * from my_table will return in Hive
1,2,3
4,5,6

But using a hive context, in Spark
1,null,3
4,null,6

> skip.header.line.count is ignored in HiveContext
> 
>
> Key: SPARK-11374
> URL: https://issues.apache.org/jira/browse/SPARK-11374
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Daniel Haviv
>
> csv table in Hive which is configured to skip the header row using 
> TBLPROPERTIES("skip.header.line.count"="1").
> When querying from Hive the header row is not included in the data, but when 
> running the same query via HiveContext I get the header row.
> "show create table " via the HiveContext confirms that it is aware of the 
> setting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.

2016-04-05 Thread Stephane Maarek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15227688#comment-15227688
 ] 

Stephane Maarek commented on SPARK-12741:
-

Hi,

May be related to:
http://stackoverflow.com/questions/36438898/spark-dataframe-count-doesnt-return-the-same-results-when-run-twice
 

I don't have code to generate the input file, it's just a simple hive table 
though.

Cheers,
Stephane

> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5480) GraphX pageRank: java.lang.ArrayIndexOutOfBoundsException:

2015-02-17 Thread Stephane Maarek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14324349#comment-14324349
 ] 

Stephane Maarek commented on SPARK-5480:


Hi Sean,

We have included the following code before and after the graph gets created:

println(sVertices count ${vertices.count})
println(sEdges count ${edges.count})

val defaultArticle = (Missing, None, List.empty, None)

// create the graph, making sure we default to a defaultArticle when we 
have a missing relation (prevents nulls)
val graph = Graph(vertices, edges, defaultArticle).cache

println(sAfter graph: Vertices count ${graph.vertices.count})
println(sAfter graph: Edges count ${graph.edges.count})

What we see on multiple runs with exact same configuration is that the count of 
edges and nodes before the graph is created is always the same. 

The constant:
Vertices count: 192190
Edges count: 4582582

After graph:
(trial one - generated the error)
After graph: Vertices count: 2450854
After graph: Edges count: 4188635
(trial two - terminated correctly)
After graph: Vertices count: 2450854
After graph: Edges count: 4582582
(trial three - generated the error)
After graph: Vertices count: 2450854
After graph: Edges count: 4000218

As we can replicate the issue, please let us know if we should add any code to 
help you debug. 
Our code is deterministic, so before creating the graph we always see the same 
output.
What's odd is that after creating the graph, the vertices count is constant, 
but the edges count varies


 GraphX pageRank: java.lang.ArrayIndexOutOfBoundsException: 
 ---

 Key: SPARK-5480
 URL: https://issues.apache.org/jira/browse/SPARK-5480
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.2.0
 Environment: Yarn client
Reporter: Stephane Maarek

 Running the following code:
 val subgraph = graph.subgraph (
   vpred = (id,article) = //working predicate)
 ).cache()
 println( sSubgraph contains ${subgraph.vertices.count} nodes and 
 ${subgraph.edges.count} edges)
 val prGraph = subgraph.staticPageRank(5).cache
 val titleAndPrGraph = subgraph.outerJoinVertices(prGraph.vertices) {
   (v, title, rank) = (rank.getOrElse(0.0), title)
 }
 titleAndPrGraph.vertices.top(13) {
   Ordering.by((entry: (VertexId, (Double, _))) = entry._2._1)
 }.foreach(t = println(t._2._2._1 + :  + t._2._1 + , id: + t._1))
 Returns a graph with 5000 nodes and 4000 edges.
 Then it crashes during the PageRank with the following:
 15/01/29 05:51:07 INFO scheduler.TaskSetManager: Starting task 125.0 in stage 
 39.0 (TID 1808, *HIDDEN, PROCESS_LOCAL, 2059 bytes)
 15/01/29 05:51:07 WARN scheduler.TaskSetManager: Lost task 107.0 in stage 
 39.0 (TID 1794, *HIDDEN): java.lang.ArrayIndexOutOfBoundsException: -1
 at 
 org.apache.spark.graphx.util.collection.GraphXPrimitiveKeyOpenHashMap$mcJI$sp.apply$mcJI$sp(GraphXPrimitiveKeyOpenHashMap.scala:64)
 at 
 org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:91)
 at 
 org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75)
 at 
 org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at 
 org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:110)
 at 
 org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:108)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 

[jira] [Commented] (SPARK-5480) GraphX pageRank: java.lang.ArrayIndexOutOfBoundsException:

2015-02-05 Thread Stephane Maarek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14307628#comment-14307628
 ] 

Stephane Maarek commented on SPARK-5480:


It happened once after one of my server failed, but the graph vertices and 
edges count did work. Doesn't happen systematically... having issues 
reproducing it

val subgraph = graph.subgraph (
  vpred = (id,article) = article._1.toLowerCase.contains(stringToSearchFor)
|| article._3.exists(keyword = keyword.contains(stringToSearchFor))
|| (article._2 match {
case None = false
case Some(articleAbstract) = 
articleAbstract.toLowerCase.contains(stringToSearchFor)
  })
).cache()

 GraphX pageRank: java.lang.ArrayIndexOutOfBoundsException: 
 ---

 Key: SPARK-5480
 URL: https://issues.apache.org/jira/browse/SPARK-5480
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.2.0
 Environment: Yarn client
Reporter: Stephane Maarek

 Running the following code:
 val subgraph = graph.subgraph (
   vpred = (id,article) = //working predicate)
 ).cache()
 println( sSubgraph contains ${subgraph.vertices.count} nodes and 
 ${subgraph.edges.count} edges)
 val prGraph = subgraph.staticPageRank(5).cache
 val titleAndPrGraph = subgraph.outerJoinVertices(prGraph.vertices) {
   (v, title, rank) = (rank.getOrElse(0.0), title)
 }
 titleAndPrGraph.vertices.top(13) {
   Ordering.by((entry: (VertexId, (Double, _))) = entry._2._1)
 }.foreach(t = println(t._2._2._1 + :  + t._2._1 + , id: + t._1))
 Returns a graph with 5000 nodes and 4000 edges.
 Then it crashes during the PageRank with the following:
 15/01/29 05:51:07 INFO scheduler.TaskSetManager: Starting task 125.0 in stage 
 39.0 (TID 1808, *HIDDEN, PROCESS_LOCAL, 2059 bytes)
 15/01/29 05:51:07 WARN scheduler.TaskSetManager: Lost task 107.0 in stage 
 39.0 (TID 1794, *HIDDEN): java.lang.ArrayIndexOutOfBoundsException: -1
 at 
 org.apache.spark.graphx.util.collection.GraphXPrimitiveKeyOpenHashMap$mcJI$sp.apply$mcJI$sp(GraphXPrimitiveKeyOpenHashMap.scala:64)
 at 
 org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:91)
 at 
 org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75)
 at 
 org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at 
 org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:110)
 at 
 org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:108)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at 

[jira] [Created] (SPARK-5480) GraphX pageRank: java.lang.ArrayIndexOutOfBoundsException:

2015-01-29 Thread Stephane Maarek (JIRA)
Stephane Maarek created SPARK-5480:
--

 Summary: GraphX pageRank: 
java.lang.ArrayIndexOutOfBoundsException: 
 Key: SPARK-5480
 URL: https://issues.apache.org/jira/browse/SPARK-5480
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.2.0
 Environment: Yarn client
Reporter: Stephane Maarek


Running the following code:


val subgraph = graph.subgraph (
  vpred = (id,article) = //working predicate)
).cache()

println( sSubgraph contains ${subgraph.vertices.count} nodes and 
${subgraph.edges.count} edges)

val prGraph = subgraph.staticPageRank(5).cache

val titleAndPrGraph = subgraph.outerJoinVertices(prGraph.vertices) {
  (v, title, rank) = (rank.getOrElse(0.0), title)
}

titleAndPrGraph.vertices.top(13) {
  Ordering.by((entry: (VertexId, (Double, _))) = entry._2._1)
}.foreach(t = println(t._2._2._1 + :  + t._2._1 + , id: + t._1))

Returns a graph with 5000 nodes and 4000 edges.
Then it crashes during the PageRank with the following:


15/01/29 05:51:07 INFO scheduler.TaskSetManager: Starting task 125.0 in stage 
39.0 (TID 1808, *HIDDEN, PROCESS_LOCAL, 2059 bytes)
15/01/29 05:51:07 WARN scheduler.TaskSetManager: Lost task 107.0 in stage 39.0 
(TID 1794, *HIDDEN): java.lang.ArrayIndexOutOfBoundsException: -1
at 
org.apache.spark.graphx.util.collection.GraphXPrimitiveKeyOpenHashMap$mcJI$sp.apply$mcJI$sp(GraphXPrimitiveKeyOpenHashMap.scala:64)
at 
org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:91)
at 
org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75)
at 
org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:110)
at 
org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:108)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at 
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at 
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at 
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: