Re: ORC Transaction Table - Spark

2017-08-24 Thread Aviral Agarwal
Are there any plans to include it in the future releases of Spark ?

Regards,
Aviral Agarwal

On Thu, Aug 24, 2017 at 3:11 PM, Akhil Das <ak...@hacked.work> wrote:

> How are you reading the data? Its clearly saying 
> *java.lang.NumberFormatException:
> For input string: "0645253_0001" *
>
> On Tue, Aug 22, 2017 at 7:40 PM, Aviral Agarwal <aviral12...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I am trying to read hive orc transaction table through Spark but I am
>> getting the following error
>>
>> Caused by: java.lang.RuntimeException: serious problem
>> at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSpli
>> tsInfo(OrcInputFormat.java:1021)
>> at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(Or
>> cInputFormat.java:1048)
>> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
>> .
>> Caused by: java.util.concurrent.ExecutionException:
>> java.lang.NumberFormatException: For input string: "0645253_0001"
>> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>> at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSpli
>> tsInfo(OrcInputFormat.java:998)
>> ... 118 more
>>
>> Any help would be appreciated.
>>
>> Thanks and Regards,
>> Aviral Agarwal
>>
>>
>
>
> --
> Cheers!
>
>


Fwd: ORC Transaction Table - Spark

2017-08-22 Thread Aviral Agarwal
Hi,

I am trying to read hive orc transaction table through Spark but I am
getting the following error

Caused by: java.lang.RuntimeException: serious problem
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(
OrcInputFormat.java:1021)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(
OrcInputFormat.java:1048)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
.
Caused by: java.util.concurrent.ExecutionException:
java.lang.NumberFormatException:
For input string: "0645253_0001"
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(
OrcInputFormat.java:998)
... 118 more

Any help would be appreciated.

Thanks and Regards,
Aviral Agarwal


Re: JDBC RDD Timestamp Parsing Issue

2017-06-21 Thread Aviral Agarwal
This woks. Thanks !

- Aviral Agarwal

On Wed, Jun 21, 2017 at 6:07 PM, Eduardo Mello <eedu.me...@gmail.com> wrote:

> You can add "?zeroDateTimeBehavior=convertToNull" to the connection
> string.
>
> On Wed, Jun 21, 2017 at 9:04 AM, Aviral Agarwal <aviral12...@gmail.com>
> wrote:
>
>> The exception is happening in JDBC RDD code where getNext() is called to
>> get the next row.
>> I do not have access to the result set. I am operating on a DataFrame.
>>
>> Thanks and Regards,
>> Aviral Agarwal
>>
>> On Jun 21, 2017 17:19, "Mahesh Sawaiker" <mahesh_sawai...@persistent.com>
>> wrote:
>>
>>> This has to do with how you are creating the timestamp object from the
>>> resultset ( I guess).
>>>
>>> If you can provide more code it will help, but you could surround the
>>> parsing code with a try catch and then just ignore the exception.
>>>
>>>
>>>
>>> *From:* Aviral Agarwal [mailto:aviral12...@gmail.com]
>>> *Sent:* Wednesday, June 21, 2017 2:37 PM
>>> *To:* user@spark.apache.org
>>> *Subject:* JDBC RDD Timestamp Parsing Issue
>>>
>>>
>>>
>>> Hi,
>>>
>>>
>>>
>>> I am using JDBC RDD to read from a MySQL RDBMS.
>>>
>>> My spark job fails with the below error :
>>>
>>>
>>>
>>> java.sql.SQLException: Value '-00-00 00:00:00.000' can not be 
>>> represented as java.sql.Timestamp
>>>
>>>
>>>
>>> Now instead of the whole job failing I want to skip this record and
>>> continue processing the rest.
>>> Any leads on how that can be done ?
>>>
>>>
>>> Thanks and Regards,
>>> Aviral Agarwal
>>> DISCLAIMER
>>> ==
>>> This e-mail may contain privileged and confidential information which is
>>> the property of Persistent Systems Ltd. It is intended only for the use of
>>> the individual or entity to which it is addressed. If you are not the
>>> intended recipient, you are not authorized to read, retain, copy, print,
>>> distribute or use this message. If you have received this communication in
>>> error, please notify the sender and delete all copies of this message.
>>> Persistent Systems Ltd. does not accept any liability for virus infected
>>> mails.
>>>
>>
>


RE: JDBC RDD Timestamp Parsing Issue

2017-06-21 Thread Aviral Agarwal
The exception is happening in JDBC RDD code where getNext() is called to
get the next row.
I do not have access to the result set. I am operating on a DataFrame.

Thanks and Regards,
Aviral Agarwal

On Jun 21, 2017 17:19, "Mahesh Sawaiker" <mahesh_sawai...@persistent.com>
wrote:

> This has to do with how you are creating the timestamp object from the
> resultset ( I guess).
>
> If you can provide more code it will help, but you could surround the
> parsing code with a try catch and then just ignore the exception.
>
>
>
> *From:* Aviral Agarwal [mailto:aviral12...@gmail.com]
> *Sent:* Wednesday, June 21, 2017 2:37 PM
> *To:* user@spark.apache.org
> *Subject:* JDBC RDD Timestamp Parsing Issue
>
>
>
> Hi,
>
>
>
> I am using JDBC RDD to read from a MySQL RDBMS.
>
> My spark job fails with the below error :
>
>
>
> java.sql.SQLException: Value '-00-00 00:00:00.000' can not be represented 
> as java.sql.Timestamp
>
>
>
> Now instead of the whole job failing I want to skip this record and
> continue processing the rest.
> Any leads on how that can be done ?
>
>
> Thanks and Regards,
> Aviral Agarwal
> DISCLAIMER
> ==
> This e-mail may contain privileged and confidential information which is
> the property of Persistent Systems Ltd. It is intended only for the use of
> the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.
>


JDBC RDD Timestamp Parsing Issue

2017-06-21 Thread Aviral Agarwal
Hi,

I am using JDBC RDD to read from a MySQL RDBMS.
My spark job fails with the below error :

java.sql.SQLException: Value '-00-00 00:00:00.000' can not be
represented as java.sql.Timestamp


Now instead of the whole job failing I want to skip this record and
continue processing the rest.
Any leads on how that can be done ?


Thanks and Regards,
Aviral Agarwal


[SparkSQL] Project using NamedExpression

2017-03-21 Thread Aviral Agarwal
Hi guys,

I want transform Row using NamedExpression.

Below is the code snipped that I am using :


def apply(dataFrame: DataFrame, selectExpressions:
java.util.List[String]): RDD[UnsafeRow] = {

val exprArray = selectExpressions.map(s =>
  Column(SqlParser.parseExpression(s)).named
)

val inputSchema = dataFrame.logicalPlan.output

val transformedRDD = dataFrame.mapPartitions(
  iter => {
val project = UnsafeProjection.create(exprArray,inputSchema)
iter.map{
  row =>
project(InternalRow.fromSeq(row.toSeq))
}
})

transformedRDD
  }


The problem is that expression becomes unevaluable :

Caused by: java.lang.UnsupportedOperationException: Cannot evaluate
expression: 'a
at
org.apache.spark.sql.catalyst.expressions.Unevaluable$class.genCode(Expression.scala:233)
at
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.genCode(unresolved.scala:53)
at
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$gen$2.apply(Expression.scala:106)
at
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$gen$2.apply(Expression.scala:102)
at scala.Option.getOrElse(Option.scala:120)
at
org.apache.spark.sql.catalyst.expressions.Expression.gen(Expression.scala:102)
at
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenContext$$anonfun$generateExpressions$1.apply(CodeGenerator.scala:464)
at
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenContext$$anonfun$generateExpressions$1.apply(CodeGenerator.scala:464)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenContext.generateExpressions(CodeGenerator.scala:464)
at
org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.createCode(GenerateUnsafeProjection.scala:281)
at
org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.create(GenerateUnsafeProjection.scala:324)
at
org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.create(GenerateUnsafeProjection.scala:317)
at
org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.create(GenerateUnsafeProjection.scala:32)
at
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:635)
at
org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:125)
at
org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:135)
at
org.apache.spark.sql.ScalaTransform$$anonfun$3.apply(ScalaTransform.scala:31)
at
org.apache.spark.sql.ScalaTransform$$anonfun$3.apply(ScalaTransform.scala:30)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


This might be because the Expression is unresolved.

Any help would be appreciated.

Thanks and Regards,
Aviral Agarwal


Spark SQL Skip and Log bad records

2017-03-15 Thread Aviral Agarwal
Hi guys,

Is there a way to skip some bad records and log them when using DataFrame
API ?


Thanks and Regards,
Aviral Agarwal


Mismatched datatype in Case statement

2017-02-04 Thread Aviral Agarwal
Hi,
I was trying Spark version 1.6.0 when I ran into the error mentioned in the
following Hive JIRA.
https://issues.apache.org/jira/browse/HIVE-5825
This error was there in both cases : either using SQLContext or HiveContext.

Any indication if this has been fixed in a higher spark version ? If yes,
which version ?

Thanks and Regards,
Aviral Agarwal